AWS Lambda and serverless programs are often relatively affordable. However, expenses occasionally go awry. This blog post, which I co-wrote with Dan Pudwell, details one such instance in which prices kept rising and over $900 a month for a Lambda that was intended to run once every hour!
We started working with a new customer to enhance their AWS utilization and save expenses. Given how little of their application was serverless, the Lambda expenses really jumped out when their AWS bill was analyzed. The Cost Explorer snapshot for the Lambda service alone is shown below.
Examining the Costs
When we broke down Lambda expenses per AWS account using Cost Explorer, it became clear that a dev account was in charge of a disproportionately large portion of them. To ensure that these expenditures came from EU-west-2, we also looked at costs by area (the London region).
We utilized CloudWatch data to investigate what Lambdas were executing in that dev account. When examining these expenses, “Invocations” and “Duration” per function were the best metrics to look at (summed over one hour). This demonstrated that one Lambda was primarily responsible for the expenditures. The graphs over 24 hours are shown below.
To summarise the facts, this one Lambda function ran on average for 176 hours per hour, received around 15,000 calls per hour, and executed for 635,000,000 milliseconds overall every hour.
Bravo, Lambda, for your incredible scaling skills. It would have been fantastic if we had been working on anything worthwhile, but sadly, this Lambda wasn’t.
We fully understand the expenditures and how they accrued to a significant sum. The next step was determining the cause of Lambda’s frequent and prolonged execution.
A CloudWatch Event rule was configured to call this method once per hour. How could it be used so frequently?
As it turns out, the error was in the code itself. The code’s goal is to eliminate empty log streams. This Lambda is required because when a log group’s retention period is configured, it deletes the log events but leaves empty log streams behind.
It called itself once per Log Group in the account and distributed the work in a fan-out style. Every month, more log groups were created (including each time the CI/CD process ran), resulting in thousands of Lambda invocations.
Why, therefore, does this equal 176 hours of execution each hour? Most invocations were relatively speedy; for example, p10 (10th percentile) took 0.9 seconds, and p50 took 3.5 seconds. Some invocations, like p99, took 617 seconds to complete. The concurrent Lambda invocations were using the CloudWatch API far too often, which was the cause of rate-limiting. The Lambda sometimes takes a long time to execute since the Python boto3 module automatically backs off and tries again.
There are more effective methods to prevent it from happening again in addition to resolving this particular case.
The problem was resolved by deleting the empty log groups and adding a Lambda to automate that deletion; the problem will never recur. Rewriting the Lambda, so it is not recursive would be a longer-term improvement. To remove the data from the log streams, establish the CloudWatch Log data retention periods.
As you can see from the image above, a Log Group consists of several Log Streams, including the Log Events. Retention periods can leave empty log streams, which can clog up our Log Groups and are always a good idea. Using CloudFormation to construct Log Groups will enable them to be removed from the rest of the stack when no longer required.
AWS Budgets for billing alerts were not configured for this customer. We created a budget based on anticipated spending and set alarms for actual and projected expenses at 90%, 100%, 125%, 150%, and 200%.
Cost Anomaly Detection was also turned on. This lessens the likelihood of unpleasant shocks at the month’s conclusion. Even though it might not always be helpful, such as when expenses gradually increase, it is still a good idea to turn it on.
Setting appropriate timeouts for Lambda functions is crucial. You don’t have to give it 15 minutes of a timeout if you anticipate it taking only a little while, like 1 second.
Last but not least, all AWS accounts are configured to support 1000 concurrent invocations by default. To lower this on the development account, we approached AWS Support. AWS does not, however, offer that (please raise this with AWS Support to add your support for this feature request!).
The irony in this situation is that adding Lambda was intended to save expenses by removing empty log streams, yet it cost the client thousands of dollars.
Lambda has incredible scalability capabilities and offers pay-as-you-go pricing. Costs are therefore uncertain. By implementing the proper monitoring and alerting, you can prevent a price shock at the end of the month.
Moreover, exercise caution when implementing and testing Lambda. An excellent example is managing rate-limiting, which may not function when things scale up. Check your downstream dependencies because not everything can scale as well as Lambda (even in AWS).
Concerned about the expense of the cloud?
Concerned about your cloud spending? Not sure that you comprehend everything? Let our professionals assist you in reducing your expenses. As an AWS Advanced Consulting Partner and FinOps Certified Service Provider, we’ll quickly raise your cloud cost maturity.