Spark step on EMR just hangs as “Running” after done writing to S3


Running PySpark 2 job on EMR 5.1.0 as a step. Even after the script is done with a _SUCCESS file written to S3 and Spark UI showing the job as completed, EMR still shows the step as “Running”. I’ve waited for over an hour to see if Spark was just trying to clean itself up but the step never shows as “Completed”. The last thing written in the logs is:

I didn’t have this problem with Spark 1.6. I’ve tried a bunch of different hadoop-aws and aws-java-sdk jars to no avail.

I’m using the default Spark 2.0 configurations so I don’t think anything else like metadata is being written. Also the size of the data doesn’t seem to have an impact on this problem.


If you aren’t already, you should close your spark context.

Also, if you are watching the Spark Web UI via a browser, you should close that as it sometimes keeps the spark context alive. I recall seeing this on the spark dev mailing list, but can’t find the jira for it.

