Unable to run scripts properly in AWS Glue PySpark Dev Endpoint

Question:

I’ve configured an AWS Glue dev endpoint and can connect to it successfully in a pyspark REPL shell – like this https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-repl.html

Unlike the example given in the AWS documentation I receive WARNings when I begin the session, and later on various operations on AWS Glue DynamicFrame structures fail. Here’s the full log on starting the session – note the errors about spark.yarn.jars and PyGlue.zip:

Many operations work as I expect but I also receive some unwelcome exceptions, for example I can load data from my Glue catalog inspect its structure and the data within, but I can’t apply a Map to it, or convert it to a DF. Here’s my full execution run log (apart from the longest error message). The first few commands and setup all work well, but the final two operations fail:

I think I’m following the instructions given by Amazon in their Dev Endpoint REPL instructions, but with these fairly simple operations (DynamicFrame.join and DynamicFrame.toDF) failing I’m working in the dark when I want to run the job for real (which seems to succeed, but my DynamicFrame.printSchema() and DynamicFrame.show() commands don’t show up in the CloudWatch logs for the execution).

Does anyone know what do I need to do to fix my REPL environment so that I can properly test pyspark AWS Glue scripts?

Answer:

AWS Support finally responded to my query on this issue. Here’s the response:

On researching further, I found that this is a known issue with the PySpark shell and glue service team is already on it. The fix should be deployed soon, however currently there’s no ETA that I can share with you.

Meanwhile here’s a workaround: before initializing Glue context, you can do

and then instantiate glueContext from that sc.

I can confirm this works for me, here’s a script I was able to run:

Previously the .map() and .toDF() calls would fail.

I’ve asked AWS Support to notify me when this issue has been resolved so that the workaround is no longer required.

Leave a Reply