Lambda Scales Automatically, but Within Limits

Serverless is sold as a system that scales itself: you write the function, and as traffic grows AWS Lambda provisions whatever capacity it takes to keep up. That is true for the majority of workloads, and it is one of the better reasons to reach for Lambda. It is not unbounded. Lambda’s autoscaling is limited on two independent axes, and both are easy to miss until real traffic hits them:

How many invocations can run at the same time.
How fast Lambda is allowed to add capacity when traffic jumps.

Both limits are scoped per region, one to your account and one to each function. This post covers what concurrency actually means in Lambda, the two ceilings, the requests-per-second trap that catches short functions, and the levers you have to mitigate all of it.

What concurrency actually means

AWS defines concurrency as the number of in-flight requests that your function is handling at the same time. For each concurrent request, Lambda provisions a separate instance of the execution environment, and that instance is busy for the whole request: the Init phase where it sets up, then the Invoke phase where your handler runs. Shutdown, when Lambda later tears the environment down, is not counted against your concurrency, because by then the environment is no longer handling a request.

The practical way to estimate concurrency is the formula AWS uses:

Concurrency = (average requests per second) * (average request duration in seconds)

A function taking 500 ms per request under 100 requests per second sits at a concurrency of 50. The same throughput against a 50 ms function needs a concurrency of 5. Duration, not just request rate, decides how much of your budget a workload consumes, and that distinction matters for both ceilings below.

Ceiling one: how many can run at once

The first limit is the account-level concurrent executions quota. By default Lambda gives an account 1,000 concurrent executions, and that pool is shared across every function in the region. It is not per function unless you carve it up deliberately. The default is also not guaranteed: newer accounts are sometimes provisioned with a lower starting quota, so it is worth checking yours rather than assuming 1,000.

When demand exceeds the available concurrency, Lambda throttles. Synchronous callers get an HTTP 429 with a TooManyRequestsException, and the request is rejected rather than queued. The function is not broken and nothing is misconfigured; you have simply asked for more simultaneous execution environments than your quota allows.

Ceiling two: how fast it can scale

The less obvious limit is the rate at which Lambda will add capacity. To protect against runaway scaling on a sudden burst, Lambda caps how quickly new execution environments come online. In each region, for each function, the concurrency scaling rate is 1,000 execution environment instances every 10 seconds, or, put in request terms, 10,000 requests per second every 10 seconds. Lambda refills this allowance continuously rather than in one block every 10 seconds, and it does not accrue: an idle 10-second window does not bank extra capacity for the next one. At any instant your headroom to scale a single function is at most those 1,000 units.

This is the limit that surprises people, because it bites even when you are nowhere near your concurrency quota. If traffic to a function jumps faster than 1,000 additional environments per 10 seconds can absorb, the requests that arrive ahead of the ramp are throttled, and you have plenty of unused concurrency sitting in your account quota the whole time.

The requests-per-second trap

There is a second limit that travels with the concurrency quota: Lambda enforces a requests-per-second ceiling equal to 10 times the concurrency quota, at both the account and function level. With the default 1,000 concurrency, that is 10,000 requests per second.

For most functions this never binds, because duration keeps concurrency the active constraint. It binds for short functions. Take AWS’s own example: a function averaging 50 ms, receiving 20,000 requests per second. By the formula, that is a concurrency of 20,000 * 0.05 = 1,000, which fits inside the default quota. You would expect it to run clean. It does not, because the 10,000 requests-per-second limit throttles half the traffic regardless of how much concurrency headroom you have. The fix here is not more concurrency in the abstract; it is raising the account quota to 2,000, which lifts the requests-per-second ceiling to 20,000.

It is tempting to read the 10x rule as “each environment can do 10 requests per second,” and AWS explicitly calls that out as wrong. The limit is on overall concurrency and overall requests per second, not on what any single environment can push. A 50 ms environment can serve roughly 20 requests per second on its own; the cap is an account and function quota, not a per-instance speed limit.

Mitigating the limits

None of these ceilings is a wall you cannot move. The levers fall into making each environment do more, reserving capacity deliberately, and raising the quotas before you need them.

Make init and execution as fast as possible. A handler that finishes sooner frees its environment sooner, so the same physical capacity covers more requests and fewer cold starts are forced under load. This is where binary size, init work, and runtime choice pay off directly, which I measured in Part 2 and broke down by phase in Part 3 of the Rust on Lambda series.
Reserve concurrency for critical functions. Reserved concurrency sets both a floor and a hard cap for one function. The floor guarantees that other functions cannot starve it of the shared pool; the cap stops a single function from consuming the whole account quota and protects you from runaway cost on a misbehaving trigger. It counts against the account quota and cannot exceed it, so reserving for one function reduces what is left for the rest.
Request a quota increase ahead of known spikes. Raising the account concurrency quota also raises the 10x requests-per-second ceiling, so it addresses both ceiling one and the requests-per-second trap. Quota increases are not instant to apply, so the time to ask is before the launch or sale, not while the function is already throttling.
Retry with backoff on the trigger or consumer. Asynchronous and event-source invocations retry throttled requests automatically; synchronous callers in front of API Gateway need to handle the 429 with their own backoff. The same fail-fast reasoning from Part 4 applies: a throttle is a signal to slow down and retry, not to hold a slot waiting.
Keep production in its own AWS account. The concurrency quota is account-and-region wide, so a non-production load test sharing the account can eat the concurrency production needs. Separating production into its own account isolates its quota from everything else.

Closing

Lambda’s autoscaling is genuinely good, and for steady workloads well under the quota you can ignore all of this. The ceilings show up at the edges: sustained high load runs into the account concurrency quota, a sharp spike runs into the 1,000-per-10-seconds scaling rate before the quota is even close, and short sub-100 ms functions run into the 10x requests-per-second limit while the concurrency math says they should fit. The work is matching the limit that binds to the traffic shape you actually have, and raising the relevant quota before the traffic arrives rather than during the incident.

What concurrency actually means#

Ceiling one: how many can run at once#

Ceiling two: how fast it can scale#

The requests-per-second trap#

Mitigating the limits#

Closing#

Links#