Amazon SQS with aws-sdk receiveMessage Stall


I’m using the aws-sdk node module with the (as far as I can tell) approved way to poll for messages.

Which basically sums up to:

receiveMessage is my “do stuff with collected messages if I have enough collected messages” function

Occasionally, my script is stalling because I don’t get a response for Amazon at all, say for example there are no messages in the queue to consume and instead of hitting the WaitTimeSeconds and sending a “no messages object”, the callback isn’t called.

(I’m writing this up to Amazon Weirdness)

What I’m asking is whats the best way to detect and deal with this, as I have some code in place to stop concurrent calls to receiveMessage.

The suggested answer here: Nodejs sqs queue processor also has code that prevents concurrent message request queries (granted it’s only fetching one message a time)

I do have the whole thing wrapped in

(With a running = false after the delete loop (not in it’s callback))

My solution would be

But surely this would leave a pile of floating sqs.receive’s lurking about and thus much memory over time?

(This job runs all the time, and I left it running on Friday, it stalled Saturday morning and hung till I manually restarted the job this morning)

Edit: I have seen cases where it hangs for ~5 minutes and then suddenly gets messages BUT with a wait time of 20 seconds it should throw a “no messages” after 20 seconds. So a WatchDog of ~10 minutes might be more practical (depending on the rest of ones business logic)

Edit: Yes Long Polling is already configured Queue Side.

Edit: This is under (latest) v2.3.9 of aws-sdk and NodeJS v4.4.4


I’ve been chasing this (or a similar) issue for a few days now and here’s what I’ve noticed:

  • The receiveMessage call does eventually return although only after 120 seconds
  • Concurrent calls to receiveMessage are serialised by the AWS.SDK library so making multiple calls in parallel have no effect.
  • The receiveMessage callback does not error – in fact after the 120 seconds have passed, it may contain messages.

What can be done about this? This sort of thing can happen for a number of reasons and some/many of these things can’t necessarily be fixed. The answer is to run multiple services each calling receiveMessage and processing the messages as they come – SQS supports this. At any time, one of these services may hit this 120 second lag but the other services should be able to continue on as normal.

My particular problem is that I have some critical singleton services that can’t afford 120 seconds of down time. For this I will look into either 1) use HTTP instead of SQS to push messages into my service or 2) spawn slave processes around each of the singletons to fetch the messages from SQS and push them into the service.

Leave a Reply