Running Amazon EMR with a custom AMI?

Question:

I need to run a custom C++ job as a Map Reduce on Amazon, and was planning to use Hadoop streaming for this. The C++ mapper executable relies on dozens of custom libraries, some of which are time-consuming to build.

I expected EMR to support custom AMIs (already have one built). However, after a careful look at the documentation it seems that it is only possible to run EMR on predefined images: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-cli-commands.html.

Am I missing something? If, indeed, only predefined AMIs are supported, what is the best option for getting this to run? The executable, obviously, is on s3, but can I actually bundle it up so that it depends on no shared libs at all?

Thanks.

Answer:

You are correct, because of the many software tools and configurations required on an Hadoop cluster node, only Amazon provided AMI are allowed on EMR. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-ami.html

You can use standard bootstrapping techniques to install any additional software you require to run on your cluster.
See http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html to learn more about bootstrap actions.

Back to your use case :
Why is it taking so long to bootstrap in your use case ? Because there are many packages ? Because you’re compiling them from source ?

In the latter case, it might be worth to build your .deb packages and to install them from a custom repository to speed up bootstrap process.

If it just because you have many packages to install, I am afraid there is no obvious solution today. I can think about EBS snapshots and volumes being created and attached during bootstrap – but the feasibility of this really depends on your use case.

Leave a Reply