I’ve been struggling to find out what is wrong with my spark job that indefinitely hangs where I try to write it out to either S3 or HDFS (~100G of data in parquet format).
The line that causes the hang:
I have tried this in overwrite as well as append mode, and tried saving to HDFS and S3, but the job will hang no matter what.
In the Hadoop Resource Manager GUI, it shows the state of the spark application as “RUNNING”, but looking it seems nothing is actually being done by Spark and when I look at the Spark UI there are no jobs running.
The one thing that has gotten it to work is to increase the size of the cluster while it is in this hung state (I’m on AWS). This, however, doesn’t matter if I start the cluster with 6 workers and increase to 7, or if I start with 7 and increase to 8 which seems somewhat odd to me. The cluster is using all of the memory available in both cases, but I am not getting memory errors.
Any ideas on what could be going wrong?
Thanks for the help all. I ended up figuring out the problem was actually a few separate issues. Here’s how I understand them:
When I was saving directly to S3, it was related to the issue that Steve Loughran mentioned where the renames on S3 were just incredibly slow (so it looked like my cluster was doing nothing). On writes to S3, all the data is copied to temporary files and then “renamed” on S3 — the problem is that renames don’t happen like they do on a filesystem and actually take O(n) time. So all of my data was copied to S3 and then all of the time was spent renaming the files.
The other problem I faced was with saving my data to HDFS and then moving it to S3 via s3-dist-cp. All of my clusters resources were being used by Spark, and so when the Application Master tried giving resources to move the data to via s3-dist-cp it was unable to. The moving of data couldn’t happen because of Spark, and Spark wouldn’t shut down because my program was still trying to copy data to S3 (so they were locked).
Hope this can help someone else!