Running Amazon EMR with a custom AMI?


I need to run a custom C++ job as a Map Reduce on Amazon, and was planning to use Hadoop streaming for this. The C++ mapper executable relies on dozens of custom libraries, some of which are time-consuming to build.

I expected EMR to support custom AMIs (already have one built). However, after a careful look at the documentation it seems that it is only possible to run EMR on predefined images:

Am I missing something? If, indeed, only predefined AMIs are supported, what is the best option for getting this to run? The executable, obviously, is on s3, but can I actually bundle it up so that it depends on no shared libs at all?



You are correct, because of the many software tools and configurations required on an Hadoop cluster node, only Amazon provided AMI are allowed on EMR.

You can use standard bootstrapping techniques to install any additional software you require to run on your cluster.
See to learn more about bootstrap actions.

Back to your use case :
Why is it taking so long to bootstrap in your use case ? Because there are many packages ? Because you’re compiling them from source ?

In the latter case, it might be worth to build your .deb packages and to install them from a custom repository to speed up bootstrap process.

If it just because you have many packages to install, I am afraid there is no obvious solution today. I can think about EBS snapshots and volumes being created and attached during bootstrap – but the feasibility of this really depends on your use case.

