Reading from multiple S3 Buckets in Spark

Question:

I have a spark application running on a Yarn cluster that needs to read files from multiple buckets on an S3-compatible object store, each bucket having its own set of credentials.

According to hadoop documentation it should be possible to specify credentials for multiple buckets by setting configuration of the form spark.hadoop.fs.s3a.bucket.<bucket-name>.access.key=<access-key> in the active SparkSession but that has not worked for me in practice.

An example that according to the documentation, I believe should work:

However, when submitted the job fails with the following error:

The job does succeed when I instead set the fs.s3a.access.key and fs.s3a.secret.key settings sequentially but that involves sequential reads/writes:

Answer:

Exception in thread “main”
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403,
AWS Service: Amazon S3, AWS Request ID: null, AWS Error Code: null,
AWS Error Message: Forbidden, S3 Extended Request ID: null

403 Forbidden means understood the request and cant serve….

s3 account does not have access permission for one of your mutliple bucket[s].
pls check again…

One of the reason might be proxy issue…

AWS uses http proxy to connect to aws cluster. I hope those proxy settings are right
define these sample variables in your shell script,

spark submit would look like…

Note : AFAIK, if you have s3 access to AWS EMR no need to set access keys every time since it will be implicit

Leave a Reply