I was trying to evaluate SNS for a realtime application i am building and needed really fast turn around time < 2 seconds in delivering the message.
Since i am located in APAC region, i have an SNS in Singapore which has a subscriber in Lambda in Us-east-1 location.
Given this setup i ran a code to try to figure out the latency in invoking lambda and do zero processing and just log the time. One might argue you have lambda invocation latency also accounted for in this instance. Which is true. I need Lambda to be invoked and executed and replied to within < 2 seconds.
I sent 23914 messages of which i have an average of 653.520 ms for transport + lambda invocation.
with peaks around 600995 ms (~ 10 minutes ) which is terrible latency for a technology like pubsub.
About 20117 messages got sent and received by lambda in < 653 ms, which means 3797 packets or 15% took more than the average time.
2958 messages or 12.36% took over 1 second to be executed.
379 messages or 1.59% took over 2 seconds to be invoked and executed ( which means 1.6% of my messages cannot be considered realtime and have to be ignored)
82 messages over 10 seconds
64 over 20 seconds
it goes on till ~ 45 seconds, after which the delay is 10 minutes. I have 3 packets with 10 minutes delay.
what bothers me is that about 2% ( if you include the processing time as well )of my messages cannot be processed in realtime for a tiny scale of ~24K messages.
In the scale calculation i am trying to present, requires me to process about 216 billion messages per month. At this scale i am worried that i will not be able to process 4.3 billion messages in realtime.
Given this experiement i am not sure how well SNS would scale. would the #of less than real time messages (read > 2 second delay) be more ? or would it decrease?
Now there might be a tendency to question my internet connection reliability, i re-did this experiment on EC2 and have got very similar results.
Infact the delays in time kind of matched around the same time.
- What are the SLA to SNS performance?
- Indirectly : how does these SLA translate to that of AWS Lambda services?
- Any reasons as to where these delays might be happening?
Most likely what happened here was throttling on the Lambda function. The default limit for concurrent Lambda invocations is 100. If you sent 20K messages, you likely exceeded that limit, despite the short runtime of the lambda. When your lambda functions are throttled when executing an SNS request, the request goes onto a retry queue and is re-executed up to 3 times, which often occur over a long period of time (up to an hour).
You can see the number of throttles in the CloudWatch metrics for the function (unfortunately, you ran your test before 6 months CloudWatch retention was released).