How to include AWS Glue crawler in Step Function

Question:

This is my requirement:
I have a crawler and a pyspark job in AWS Glue. I have to setup the workflow using step function.

Questions:

  1. How can I add Crawler as the first state. What are the parameters I need to provide(Resource,Type etc).
  2. How to make sure that the next state – Pyspark job starts only once the crawler ran successfully.
  3. Is there any way I can schedule the Step Function State Machine to run at a particular time?

References:

Answer:

A few months late to answer this but this can be done from within the step function.
You can create the following states to achieve it:

  • TriggerCrawler: Task State: Triggers a Lambda function, within this lambda function you can write code for triggering AWS Glue Crawler using any of the aws-sdk
  • PollCrawlerStatus: Task state: Lambda function that polls for Crawler status and returns it as a response of lambda.
  • IsCrawlerRunSuccessful: Choice State: Based on that status of Glue crawler you can make Next state to be a Choice state which will either go to the next state that triggers yours Glue job (once the Glue crawler state is ‘READY’) or go to the Wait State for few seconds before you poll for it again.
  • RunGlueJob: Task State: A Lambda function that triggers the glue job.
  • WaitForCrawler: Wait State: That waits for ‘n’ seconds before you poll for status again.
  • Finish: Succeed State.

Here is how this Step Function will look like:

enter image description here

Leave a Reply