Question:
I’m having a surprisingly hard time working with additional libraries via my EMR notebook. The AWS interface for EMR allows me to create Jupyter notebooks and attach them to a running cluster. I’d like to use additional libraries in them. SSHing into the machines and installing manually as ec2-user
or root
will not make the libraries available to the notebook, as it apparently uses the livy
user. Bootstrap actions install things for hadoop
. I can’t install from the notebook because its user apparently doesn’t have sudo
, git
, etc., and it probably wouldn’t install to the slaves anyway.
What is the canonical way of installing additional libraries for notebooks created through the EMR interface?
Answer:
For the sake of an example, let’s assume you need librosa
Python module on running EMR cluster. We’re going to use Python 2.7 as the procedure is simpler – Python 2.7 is guaranteed to be on the cluster and that’s the default runtime for EMR.
Create a script that installs the package:
1 2 3 4 |
#!/bin/bash sudo easy_install-2.7 pip sudo /usr/local/bin/pip2 install librosa |
and save it to your home directory, e.g. /home/hadoop/install_librosa.sh
. Note the name, we’re going to use it later.
In the next step you’re going to run this script through another script inspired by Amazon EMR docs: emr_install.py
. It uses AWS Systems Manager to execute your script over the nodes.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
import time from boto3 import client from sys import argv try: clusterId=argv[1] except: print("Syntax: emr_install.py [ClusterId]") import sys sys.exit(1) emrclient=client('emr') # Get list of core nodes instances=emrclient.list_instances(ClusterId=clusterId,InstanceGroupTypes=['CORE'])['Instances'] instance_list=[x['Ec2InstanceId'] for x in instances] # Attach tag to core nodes ec2client=client('ec2') ec2client.create_tags(Resources=instance_list,Tags=[{"Key":"environment","Value":"coreNodeLibs"}]) ssmclient=client('ssm') # Run shell script to install libraries command=ssmclient.send_command(Targets=[{"Key": "tag:environment", "Values":["coreNodeLibs"]}], DocumentName='AWS-RunShellScript', Parameters={"commands":["bash /home/hadoop/install_librosa.sh"]}, TimeoutSeconds=3600)['Command']['CommandId'] command_status=ssmclient.list_commands( CommandId=command, Filters=[ { 'key': 'Status', 'value': 'SUCCESS' }, ] )['Commands'][0]['Status'] time.sleep(30) print("Command:" + command + ": " + command_status) |
To run it:
python emr_install.py [cluster_id]