Question:
I’m trying to connect to Amazon Redshift via Spark, so I can join data we have on S3 with data on our RS cluster. I found some very spartan documentation here for the capability of connecting to JDBC:
https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#jdbc-to-other-databases
The load command seems fairly straightforward (although I don’t know how I would enter AWS credentials here, maybe in the options?).
1 2 |
df = sqlContext.load(source="jdbc", url="jdbc:postgresql:dbserver", dbtable="schema.tablename") |
And I’m not entirely sure how to deal with the SPARK_CLASSPATH variable. I’m running Spark locally for now through an iPython notebook (as part of the Spark distribution). Where do I define that so that Spark loads it?
Anyway, for now, when I try running these commands, I get a bunch of undecipherable errors, so I’m kind of stuck for now. Any help or pointers to detailed tutorials are appreciated.
Answer:
Although this seems to be a very old post, anyone who is still looking for answer, below steps worked for me!
Start the shell including the jar.
1 2 |
bin/pyspark --driver-class-path /path_to_postgresql-42.1.4.jar --jars /path_to_postgresql-42.1.4.jar |
Create a df by giving appropriate details:
1 2 3 4 5 6 7 8 |
myDF = spark.read \ .format("jdbc") \ .option("url", "jdbc:redshift://host:port/db_name") \ .option("dbtable", "table_name") \ .option("user", "user_name") \ .option("password", "password") \ .load() |
Spark Version: 2.2