I have an application on an AWS EC2 instance that runs once daily. The application fetches some files from a web service, parses the files line by line, updates a database, updates S3 files based on changes in the database, sends notification emails to customers as well as a few other tasks.
This is a series of logical tasks that must take place in sequence, although some of the tasks can be thought of as sub-tasks that can be executed in parallel. All tasks are a combination of Perl scripts and Java programs, with a single Perl script acting as the manager that executes each in turn. Some tasks can take as long as 45 minutes to complete, and the whole process can take up to 3 hours in total.
I’d like to make this whole process serverless. My initial idea was to use AWS Lambda, whereby each task would execute as a Lambda function, until I discovered Lambda functions impose a 5 minute execution timeout. It seems like the AWS Step Functions service is actually a better fit for my use case, but my understanding is that this service is backed by Lambda, so the tasks will still have the 5 min execution limitation.
(I’m also aware that I would have to re-write my Perl scripts to a language supported by Lambda).
I assume that I can work around the execution time limit by refactoring my code into smaller functions that will guarantee to complete in under 5 minutes. In my particular situation though, this seems inefficient.
Currently the database update task processes lines from a file one at a time. For this to work with Lambda, a Lambda function would need to handle only a single line from the file (or a very small number of lines) in order to guarantee not spilling over 5 minutes execution time. This would involve opening and closing a connection with the database on every invocation of the Lambda function. Also, each line processed should result in an entry written to a file, to be stored in S3. Right now, I just keep a file handle in memory and write the file to S3 when all lines are processed, but with Lambda I would need to keep reading the file, updating it and writing it back to S3.
What I’m asking is:
- Is my use case a bad fit for AWS Lambda and/or AWS Step Functions?
- Have I misunderstood how these services work?
- Is there another AWS service that would be a better fit for my use case?
After further research, I think AWS Batch might be a good idea.
What you want are called Activity Workers. Tl;dr: You register “activities” and each gets an ARN. Then you can put that ARN in the resource field of Task states and then you run some code (the “worker”) somewhere (in a Lambda, on EC2, in your basement, wherever) that polls for tasks identified by that ARN, then calls back to report success or failure. Activity Workers can run for up to a year.
Step-by-step details at the AWS docs
In response to RTF’s comment, here’s a deeper dive: Suppose you have code to color turtles in color_turtles.pl. So what you do is call the CreateActivity API – see http://docs.aws.amazon.com/step-functions/latest/apireference/API_CreateActivity.html – giving the name “ColorTurtles” and it’ll give you back an ARN, a string beginning arn:aws… Then in your state machine you make a Task state with that ARN as the value of the resource field. Then you add code to color_turtles.pl to poll the service with http://docs.aws.amazon.com/step-functions/latest/apireference/API_GetActivityTask.html – whenever a machine you’re running gets to that task, it’ll go look for activity workers polling. It’ll give your polling worker the input for the task, then you process the input and generate some output, and call SendTaskSuccess or SendTaskFailure. All these are just REST HTTP calls, so you can run them anywhere and I mean anywhere; in a Lambda, on an EC2 instance, or on some computer anywhere on the Internet.