Question:
I am relatively new to AWS and this may be a bit less technical question, but at present AWS Glue notes a maximum of 25 jobs permitted to be created. We are loading in a series of tables that each have their own job that subsequently appends audit columns. Each job is very similar, but simply changes the connection string source and target.
Is there a way to parameterize these jobs to allow for reuse and simply pass the proper connection strings to them? Or even possibly loop through a set connection strings in a master job that would call a child job passing the varying connection strings through?
Any examples or documentation would be most appreciated
Answer:
In the below example I present how to use Glue job input parameters in the code. This code takes the input parameters and it writes them to the flat file.
- Setting the input parameters in the job configuration.
- The code of Glue job
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job ## @params: [JOB_NAME] args = getResolvedOptions(sys.argv, ['JOB_NAME']) sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) args = getResolvedOptions(sys.argv, ['JOB_NAME','VAL1','VAL2','VAL3','DEST_FOLDER']) job.init(args['JOB_NAME'], args) v_list=[{"VAL1":args['VAL1'],"VAL2":args['VAL2'],"VAL3":args['VAL3']}] df=sc.parallelize(v_list).toDF() df.repartition(1).write.mode('overwrite').format('csv').options(header=True, delimiter = ';').save("s3://"+ args['DEST_FOLDER'] +"/") job.commit() |
- There is also possible to provide input parameters during using boto3, CloudFormation or StepFunctions. This example shows how to do it by using boto3.
1 2 3 4 5 6 7 8 9 10 11 12 |
import boto3 def lambda_handler(event, context): glue = boto3.client('glue') myJob = glue.create_job(Name='example_job2', Role='AWSGlueServiceDefaultRole', Command={'Name': 'glueetl','ScriptLocation': 's3://aws-glue-scripts/example_job'}, DefaultArguments={"VAL1":"value1","VAL2":"value2","VAL3":"value3"} ) glue.start_job_run(JobName=myJob['Name'], Arguments={"VAL1":"value11","VAL2":"value22","VAL3":"value33"}) |
Useful links:
- https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-get-resolved-options.html
- https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-calling.html
- https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.create_job
- https://docs.aws.amazon.com/step-functions/latest/dg/connectors-glue.html