How to upload files to new EMR cluster


I want to create a new EMR cluster, run a PySpark job and destroy it. Ideally I’d like to do this by adding a step when creating the cluster. The command I would run locally to start the jobs looks like this:

spark-submit --input x.csv --output output

What I don’t understand is how I can make sure that is already available on the master node. I saw a reference of reading the python script from an S3 bucket here, but I couldn’t get that to work.

Now I have separate commands for creating the cluster, putting the script on the master node and adding the steps. the problem with this is that the cluster keeps running after the job step finishes.


I solved this by creating an extra step which just calls hadoop fs -copyToLocal to download the files.

I had the problem with the bootstrap step that the hadoop command was not installed yet.

Full working example using boto3:

Leave a Reply