I’m trying to use Glue for ETL on data I’m moving from RDS to Redshift.
As far as I am aware, Glue bookmarks only look for new rows using the specified primary key and does not track updated rows.
However that data I am working with tends to have rows updated frequently and I am looking for a possible solution. I’m a bit new to pyspark, so if it is possible to do this in pyspark I’d highly appreciate some guidance or a point in the right direction. If there’s a possible solution outside of Spark, I’d love to hear it as well.
You can use the query to find the updated records by filtering data at source JDBC database as shown below example. I have passed date as an argument so that for each run I can fetch only latest values from mysql database in this example.
query= "(select ab.id,ab.name,ab.date1,bb.tStartDate from test.test12 ab join test.test34 bb on ab.id=bb.id where ab.date1>'" + args['start_date'] + "') as testresult"
datasource0 = spark.read.format("jdbc").option("url", "jdbc:mysql://host.test.us-east-2.rds.amazonaws.com:3306/test").option("driver", "com.mysql.jdbc.Driver").option("dbtable", query).option("user", "test").option("password", "Password1234").load()