The results and ramblings of research

phewww!

AWS EMR High Performance Bootstrap Actions

leave a comment »

In this post, I describe some EMR bootstrap scripts that are especially helpful in ensuring that the Hadoop clusters run great. As a general rule, I use the c3.xlarge compute or m3.xlarge spot nodes and have been consistently deploying medium sized clusters (>30 nodes).

  1. Aggregated logging: To setup more advanced logging, please use the following configure-hadoop bootstrap
-y, yarn.log-aggregation-enable=true, 
-y, yarn.log-aggregation.retain-seconds=-1, 
-y, yarn.log-aggregation.retain-check-interval-seconds=3000, 
-y, yarn.nodemanager.remote-app-log-dir=s3://s3_log_location/cluster/aggregated_logs/
  1. Ensuring correct disk utilization – Since some of the inputs I use are very disk heavy, the first issue I faced was that the cluster would run out of disk. Moreover, it became rather clear that this issue was due to incorrect disk utilization, since all the output were written to the boot disk and not the ephemeral storage disks (which had sufficient disk space, 2x40GB per node). It seems that the mount points were incorrectly setup, which I override using configure-hadoop bootstrap and the following (Note: these settings are for EC2 instances which have two ephemeral disks per instance. Please modify if you have a different number of ephemeral disks):
-m, mapred.local.dir=/mnt/var/lib/hadoop/mapred,/mnt1/var/lib/hadoop/mapred,
-h, dfs.data.dir=/mnt/var/lib/hadoop/dfs,/mnt1/var/lib/hadoop/dfs,
-h, dfs.name.dir=/mnt/var/lib/hadoop/dfs-name,/mnt1/var/lib/hadoop/dfs-name, 
-c, hadoop.tmp.dir=/mnt/var/lib/hadoop/tmp,/mnt1/var/lib/hadoop/tmp, 
-c, fs.s3.buffer.dir=/mnt/var/lib/hadoop/s3,/mnt1/var/lib/hadoop/s3, 
-y, yarn.nodemanager.local-dirs=/mnt/var/lib/hadoop/tmp/nm-local-dir,/mnt1/var/lib/hadoop/tmp/nm-local-dir
  1. Snappy Compression- You can setup cluster-wide Snappy compression of mapper outputs using configure-hadoop bootstrap and the following:
-m, mapreduce.map.output.compress=true, 
-m, mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec, 
-m, mapreduce.output.fileoutputformat.compress=true, 
-m, mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec
  1. Multipart Uploads to S3 – You can enable this setting to ensure that S3 uploads don’t timeout and extremely large output S3 files can be written.
-c, fs.s3n.multipart.uploads.enabled=true,
-c, fs.s3n.multipart.uploads.split.size=524288000
  1. Out Of Memory and GC errors – A common issue that occurs is Out of Memory or GC Overhead exceeded errors. A pretty easy fix for these issues are modifying the settings below. You can either bootstrap a cluster (using configure-hadoop) or set these using Configuration.setConf(). In our case, we have set the mapper JVM to allow upto 4G of memory.
-m "mapreduce.map.java.opts=-Xmx4096m -XX:-UseGCOverheadLimit"
-m mapreduce.map.memory.mb=4096
Advertisements

Written by anujjaiswal

February 10, 2015 at 7:35 am

Posted in AWS, Hadoop, MapRed

Tagged with , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

%d bloggers like this: