How to make a AWS Data Pipeline ShellCommandActivity Script execute a python file

Question:

I am working with an AWS Data Pipeline that has a ShellCommandActivity that sets the script uri to bash file located in a s3 bucket. The bash file copies a python script located in the same s3 bucket to a EmrCluster and then the script tries to execute that python script.

enter image description here

This is my pipeline export:

This is RunPython.sh:

This is Test.py

From the Stdout Log I get:

download: s3://project/bin/scripts/Test.py to ./

From the Stdeer Log I get:

python: can’t open file ‘Test.py’: [Errno 2] No such file or directory

I have also tried replacing python ./Test.py with python Test.py, but I get the same result.

How do I get my AWS Data Pipeline to execute my Test.py script.

EDIT

When I set scriptUri to s3://project/bin/scripts/Test.py I get the following errors
:

/mnt/taskRunner/output/tmp/df-0947490M9EHH2Y32694-59ed8ca814264f5d9e65b2d52ce78a53/ShellCommandActivityIdJiZP720170209T175934Attempt1_command.sh: line 1: author: command not found
/mnt/taskRunner/output/tmp/df-0947490M9EHH2Y32694-59ed8ca814264f5d9e65b2d52ce78a53/ShellCommandActivityIdJiZP720170209T175934Attempt1_command.sh: line 2: import: command not found
/mnt/taskRunner/output/tmp/df-0947490M9EHH2Y32694-59ed8ca814264f5d9e65b2d52ce78a53/ShellCommandActivityIdJiZP720170209T175934Attempt1_command.sh: line 3: import: command not found
/mnt/taskRunner/output/tmp/df-0947490M9EHH2Y32694-59ed8ca814264f5d9e65b2d52ce78a53/ShellCommandActivityIdJiZP720170209T175934Attempt1_command.sh: line 4: import: command not found
/mnt/taskRunner/output/tmp/df-0947490M9EHH2Y32694-59ed8ca814264f5d9e65b2d52ce78a53/ShellCommandActivityIdJiZP720170209T175934Attempt1_command.sh: line 5: import: command not found
/mnt/taskRunner/output/tmp/df-0947490M9EHH2Y32694-59ed8ca814264f5d9e65b2d52ce78a53/ShellCommandActivityIdJiZP720170209T175934Attempt1_command.sh: line 7: print: command not found

EDIT 2

Added the following line to Test.py

Then I received the following error:

error: line 6, in import boto3 ImportError: No module named boto3

using @franklinsijo ‘s advice I created a Bootstrap Action on the EmrCluster with the following value:

s3://project/bin/scripts/BootstrapActions.sh

This is BootstrapActions.sh

This worked!!!!!!!

Answer:

Configure ShellCommandActivity with

  • Pass the S3 Uri Path of the python file as the Script Uri.
  • Add the shebang line #!/usr/bin/env python in the
    script.
  • If any non-default python libraries are used in the script, install them on the target resource.
    • If runsOn is chosen, Add the installation commands as the bootstrap action for the EMR Resource.
    • If workerGroup is chosen, Install all the libraries on the Worker group before pipeline activation.

Use either pip or easy_install to install the python modules.

Leave a Reply