Question:
I am working with an AWS Data Pipeline that has a ShellCommandActivity that sets the script uri to bash file located in a s3 bucket. The bash file copies a python script located in the same s3 bucket to a EmrCluster and then the script tries to execute that python script.
This is my pipeline export:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
{ "objects": [ { "name": "DefaultResource1", "id": "ResourceId_27dLM", "amiVersion": "3.9.0", "type": "EmrCluster", "region": "us-east-1" }, { "failureAndRerunMode": "CASCADE", "resourceRole": "DataPipelineDefaultResourceRole", "role": "DataPipelineDefaultRole", "pipelineLogUri": "s3://project/bin/scripts/logs/", "scheduleType": "ONDEMAND", "name": "Default", "id": "Default" }, { "stage": "true", "scriptUri": "s3://project/bin/scripts/RunPython.sh", "name": "DefaultShellCommandActivity1", "id": "ShellCommandActivityId_hA57k", "runsOn": { "ref": "ResourceId_27dLM" }, "type": "ShellCommandActivity" } ], "parameters": [] } |
This is RunPython.sh:
1 2 3 4 |
#!/usr/bin/env bash aws s3 cp s3://project/bin/scripts/Test.py ./ python ./Test.py |
This is Test.py
1 2 3 4 5 6 7 8 |
__author__ = 'MrRobot' import re import os import sys import boto3 print "We've entered the python file" |
From the Stdout Log I get:
download: s3://project/bin/scripts/Test.py to ./
From the Stdeer Log I get:
python: can’t open file ‘Test.py’: [Errno 2] No such file or directory
I have also tried replacing python ./Test.py with python Test.py, but I get the same result.
How do I get my AWS Data Pipeline to execute my Test.py script.
EDIT
When I set scriptUri to s3://project/bin/scripts/Test.py I get the following errors
:
/mnt/taskRunner/output/tmp/df-0947490M9EHH2Y32694-59ed8ca814264f5d9e65b2d52ce78a53/ShellCommandActivityIdJiZP720170209T175934Attempt1_command.sh: line 1: author: command not found
/mnt/taskRunner/output/tmp/df-0947490M9EHH2Y32694-59ed8ca814264f5d9e65b2d52ce78a53/ShellCommandActivityIdJiZP720170209T175934Attempt1_command.sh: line 2: import: command not found
/mnt/taskRunner/output/tmp/df-0947490M9EHH2Y32694-59ed8ca814264f5d9e65b2d52ce78a53/ShellCommandActivityIdJiZP720170209T175934Attempt1_command.sh: line 3: import: command not found
/mnt/taskRunner/output/tmp/df-0947490M9EHH2Y32694-59ed8ca814264f5d9e65b2d52ce78a53/ShellCommandActivityIdJiZP720170209T175934Attempt1_command.sh: line 4: import: command not found
/mnt/taskRunner/output/tmp/df-0947490M9EHH2Y32694-59ed8ca814264f5d9e65b2d52ce78a53/ShellCommandActivityIdJiZP720170209T175934Attempt1_command.sh: line 5: import: command not found
/mnt/taskRunner/output/tmp/df-0947490M9EHH2Y32694-59ed8ca814264f5d9e65b2d52ce78a53/ShellCommandActivityIdJiZP720170209T175934Attempt1_command.sh: line 7: print: command not found
EDIT 2
Added the following line to Test.py
1 2 |
#!/usr/bin/env python |
Then I received the following error:
error: line 6, in import boto3 ImportError: No module named boto3
using @franklinsijo ‘s advice I created a Bootstrap Action on the EmrCluster with the following value:
s3://project/bin/scripts/BootstrapActions.sh
This is BootstrapActions.sh
1 2 3 |
#!/usr/bin/env bash sudo pip install boto3 |
This worked!!!!!!!
Answer:
Configure ShellCommandActivity with
- Pass the S3 Uri Path of the python file as the
Script Uri
. - Add the shebang line
#!/usr/bin/env python
in the
script. - If any non-default python libraries are used in the script, install them on the target resource.
- If
runsOn
is chosen, Add the installation commands as the bootstrap action for the EMR Resource. - If
workerGroup
is chosen, Install all the libraries on the Worker group before pipeline activation.
- If
Use either pip
or easy_install
to install the python modules.