What’s the correct Cloudwatch/Autoscale settings for extremely short traffic spikes on Amazon Web Services?

Question:

I have a site running on amazon elastic beanstalk with the following traffic pattern:

  • ~50 concurrent users normally.
  • ~2000 concurrent users for 1/2 minutes when post is made to Facebook page.

Amazon web services claim to be able to rapidly scale to challenges like this but the “Greater than x for more than 1 minute” setup of cloudwatch doesn’t appear to be fast enough for this traffic pattern?

Usually within seconds all the ec2 instances crash, killing all cloudwatch metrics and the whole site is down for 4/6 minutes. So far I’ve yet to find a configuration that works for this senario.

Here is the graph of a smaller event that also killed the site:
enter image description here

Answer:

The suggestion from AWS was as follows:

We are always working to make our systems more responsive, but it is
challenging to provision virtual servers automatically with a response
time of a few seconds as your use case appears to require. Perhaps
there is a workaround that responds more quickly or that is more
resilient when requests begin to increase.

Have you observed whether the site performs better if you use a larger
instance type or a larger number of instances in the steady state?
That may be one method to be resilient to rapid increases in inbound
requests. Although I recognize it may not be the most cost-effective,
you may find this to be a quick fix.

Another approach may be to adjust your alarm to use a threshold or a
metric that would reflect (or predict) your demand increase sooner.
For example, you might see better performance if you set your alarm to
add instances after you exceed 75 or 100 users. You may already be
doing this. Aside from that, your use case may have another indicator
that predicts a demand increase, for example a posting on your
Facebook page may precede a significant request increase by several
seconds or even a minute. Using CloudWatch custom metrics to monitor
that value and then setting an alarm to Auto Scale on it may also be a
potential solution.

So I think the best answer is to run more instances at lower traffic and use custom metrics to predict traffic from an external source. I am going to try, for example, monitoring Facebook and Twitter for posts with links to the site and scaling up straight away.

Leave a Reply