Pyspark s3 error : java.lang.NoClassDefFoundError: com/amazonaws/AmazonServiceException

Question:

I think I’m encountering jar incompatibility. I’m using the following jar files to build a spark cluster:

  1. spark-2.4.7-bin-hadoop2.7.tgz
  2. aws-java-sdk-1.11.885.jar
  3. hadoop:hadoop-aws-2.7.4.jar

The code fails when it starts to execute the spark.read.format. It appears that it can’t find the class. java.lang.NoClassDefFoundError: com/amazonaws/AmazonServiceException.

My spark-defaults.conf is configured as follows:

I would appreciate if someone can help me. Any ideas?

Answer:

hadoop-aws 2.7.4 uses aws-java-sdk 1.7.4 that isn’t completely compatible with newer versions, so if you use the newer version of aws-java-sdk, then Hadoop can’t find required classes. You have following choice:

  • remove explicit dependency on the aws-java-sdk – if you don’t need newer functionality
  • compile Spark 2.4 with Hadoop 3 using hadoop-3.1 profile, as described in documentation
  • switch to Spark 3.0.x that already has version built with Hadoop 3.2

Leave a Reply