S3A: fails while S3: works in Spark EMR


I’m using EMR 5.5.0 with Spark. If I write a simple file to s3 using an s3://... URL it writes fine. But if I use an s3a://... address, it fails with Service: Amazon S3; Status Code: 403; Error Code: AccessDenied

Using the AWS command line I’m able to cp, mv, and rm any file in the path I’m writing to. But from spark, s3a fails on the put command.

We have Server Side Encryption Enabled, and I know spark knows because the s3 URLs work. Any ideas?

Failed PUT DEBUG logs here. Maybe its important to note, I’m doing an rdd.saveAsTextFile(path) but the put command says its trying to write to /my-bucket/tmp/carlos/testWrite/4/_temporary/0/ which it should only do in parquet? Not sure if that detail is relevant but thought I would mention.


s3a is the actively maintained S3 client in Apache Hadoop. AWS forked their own client off from the Apache s3n:// client many years ago & (presumably) have massively reworked theirs.

They can read and write the same data, but some bits of EMR expect extra methods in the filesystem client which only EMR s3 supports…you cannot safely use s3a.

There’s also the original ASF s3:// client which is incompatible with everything else, but was the first code used to connect Hadoop with S3, way before EMR was a product from amazon.

Which is better? S3A is probably, as of Aug 2017, faster on aggressive read IO of columnar formats like ORC and Parquet. EMR S3, with emrfs probably has the edge in terms of resilience and consistency. But the open source ASF S3A client is moving to address those

Leave a Reply