How to connect to Amazon Redshift or other DB’s in Apache Spark?

Question:

I’m trying to connect to Amazon Redshift via Spark, so I can join data we have on S3 with data on our RS cluster. I found some very spartan documentation here for the capability of connecting to JDBC:

https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#jdbc-to-other-databases

The load command seems fairly straightforward (although I don’t know how I would enter AWS credentials here, maybe in the options?).

And I’m not entirely sure how to deal with the SPARK_CLASSPATH variable. I’m running Spark locally for now through an iPython notebook (as part of the Spark distribution). Where do I define that so that Spark loads it?

Anyway, for now, when I try running these commands, I get a bunch of undecipherable errors, so I’m kind of stuck for now. Any help or pointers to detailed tutorials are appreciated.

Answer:

Although this seems to be a very old post, anyone who is still looking for answer, below steps worked for me!

Start the shell including the jar.

Create a df by giving appropriate details:

Spark Version: 2.2

Leave a Reply