For the past month or so, we've been investigating how to optimize Hadoop. Hadoop is frequently used as a comparison point for the performance of other distributed systems, so having it be as fast as possible makes our job more challenging - and hopefully makes research more interesting.
In the course of our work we found that Hadoop is actually fairly well optimized (not too surprising), but there are number of simple tweaks that can be made to improve the performance for some common research use cases. For this post, we're concerned with Hadoop's performance on a Pagerank dataset of 1 billion pages. Our initial runs on this dataset took about 9 minutes. Following are some changes we found helped decrease this runtime:
Caching reflected classes
There are a number of legacy hooks that Hadoop uses for creating configuration, key and value objects at runtime. The goal of these legacy code paths is to allow for both the Configuration and JobConf based jobs to work. Unfortunately, there is a significant overhead imposed by this code path. Simply caching the results of a few reflection operations reduces CPU usage by 20%.
Speeding object copies
The default mechanism used by Hadoop to clone objects is robust, but slow. It serializes an object and then deserializes it into the copy. For a variety of simple objects, this can be improved by creating a Copyable interface which allows for direct copies.
Configuration changes
- dfs.datanaode.max.xcievers
Datanodes have an upper bound on the number of files that it will serve at any one time. Turning it large enough is helpful to bear high I/O overload. We configure it to 8192 to run PageRank with 1 billion pages and 250 shards.
- dfs.balance.bandwidthPerSec
The default speed at which HDFS balancing operations take place is 1MB/s - far too low when gigabit networks are the bottom-end.
- mapred.map.output.compression.codec
We use Twitter implemented LZO to replace the original Gzip to compress the output data of Map tasks and Reduce tasks.
- mapred.reduce.parallel.copies
When there are many map tasks serving a job, turning this parameters larger is a good option, especially when each task is small. The larger it is set, the more map outputs are fetched during the Shuffle phase simultanesouly.
After our changes, we measured a 25% performance improvement - down to 7m30s. Honestly, not as much as we hoped, but still significant. We hope to push our code changes to Hadoop upstream in the near future.