EMR notebooks install additional libraries

Question:

I’m having a surprisingly hard time working with additional libraries via my EMR notebook. The AWS interface for EMR allows me to create Jupyter notebooks and attach them to a running cluster. I’d like to use additional libraries in them. SSHing into the machines and installing manually as ec2-user or root will not make the libraries available to the notebook, as it apparently uses the livy user. Bootstrap actions install things for hadoop. I can’t install from the notebook because its user apparently doesn’t have sudo, git, etc., and it probably wouldn’t install to the slaves anyway.

What is the canonical way of installing additional libraries for notebooks created through the EMR interface?

Answer:

For the sake of an example, let’s assume you need librosa Python module on running EMR cluster. We’re going to use Python 2.7 as the procedure is simpler – Python 2.7 is guaranteed to be on the cluster and that’s the default runtime for EMR.

Create a script that installs the package:

and save it to your home directory, e.g. /home/hadoop/install_librosa.sh. Note the name, we’re going to use it later.

In the next step you’re going to run this script through another script inspired by Amazon EMR docs: emr_install.py. It uses AWS Systems Manager to execute your script over the nodes.

To run it:

python emr_install.py [cluster_id]

Leave a Reply