Hadoop Lessons

I recently completed a very interesting class focusing mainly on learning the Hadoop framework, as well as general principles of distributed algorithm design and analysis. It was a nice balance between theory and practice, although most of the theory was in the classroom and the practice was on our own for the assignments.

By the end of the class, I was feeling proficient enough to be dangerous and after not doing as well as I’d hoped on the final, I decided to do the optional last programming assignment implementing the relative frequencies in pairs and stripes for some extra credit and extra experience. I was able to get it done in under 9 hours thanks largely to the practice I’d developed over our previous projects doing word count, natural join, and PageRank.

I learned that with Hadoop and in things more generally, the practice comes first for me, then the theory, but in practicing Hadoop, I learned a few lessons that I’d like to pass along:

  1. Always test. You’re building a pipeline of functions and the framework is going to be operating under the hood.  For most of my assignments, I did my assignments in Python using Hadoop streaming.  I’m a big fan of iPython Notebook and would always grab a chunk of the data, load it in Notebook, write code against it, test in the UNIX command line:

    cat data | python mapper.py | sort > mapperOutput.txt
    I built my reducer code off of that, then run the code through the real mapper with a dummy reducer that just outputted the mapper result after the sorter.  I would load some of that data in Notebook again and finish building my reducer.  Test the whole thing in the command line:

    cat data | python mapper.py | sort | python reducer.py > reducerOutput.txt

    Then run this on the cluster. Since we were developing on AWS Elastic MapReduce, I just spun up an instance and kept it alive so I could run my code against it without having to wait for the cluster to start:

    ./elastic-mapreduce --create --alive --instance-type m1.large --instance-count 11

    This will return a job ID. I wrote a shell script to catch the code and use it as a variable in my jobs:

    MYJOB_FLOW_ID_STRING="$(./elastic-mapreduce --create --alive --instance-type m1.large --instance-count 11)"
    MYFLOW_ID=${MYJOB_FLOW_ID_STRING:17:31}

    The first line runs the command then stores the full result in the variable MYJOB_FLOW_ID_STRING. Then I parse the string to get just the job flow ID (MYFLOW_ID).  This helps me test and retest quickly without having to wait for clusters to spin up and configure, not to mention having to deal with the inevitable EC2 instance quota

  2. Learn the framework. The mapper and the reducer are generally the least of your troubles.  Often the code would be sitting on my computer waiting to be run while I was fighting with the framework. Tweaking configurations to get the job to run properly is an art and as much thought needs to go into to properly configuring the framework as in the internal operation of the mappers and reducers. A trick if you’re using the AWS EMR CLI is to pass arguments using the --arg flag. For instance, if you want to do secondary sort, you need to specify a -partitioner, which isn’t a common flag and isn’t supported in the elastic-mapreduce-ruby cli.  But you can pass it like this:

    --arg -partitioner \
    --arg org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \This allows you to still pass your arguments even if they aren’t in the standard lexicon of elastic-mapreduce-ruby.

  3. Plan for backward compatibility. EMR isn’t running the latest and greatest software. I had code that worked perfectly with my Python 2.7 that failed on EMR because they’re apparently running 2.6. I had to dig through the logs to find the actual error.
  4. Check the logs. The STDERR and SYSLOG are about as helpful as a postal worker having a bad day. You’ll see that your mapper failed but you won’t know why.  For the actual Python Traceback, look in the Task Attempt logs in S3. I would’ve saved so much time on earlier projects had I known about these earlier.  This is where I found out about the backwards compatibility issue (which had to do with format in Python).

There are other lessons, I’m sure, but these are the ones I wanted to preserve in some form.  I’ll be getting code examples for these projects into my Github.  I hope these little tricks are helpful. They sure took me long enough to figure out. If it saves someone else 20 minutes or so of searching, I’m happy.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s