Load Shedding to Avoid Overload
Patterns for keeping services stable when offered load exceeds capacity
↗ AWS Builders' Library — Using load shedding to avoid overload
Without load shedding With load shedding ───────────────────────────────────────────────────────────────── Throughput ↑ Throughput ↑ │ │ ▼ ▼ Latency climbs Latency climbs → shedding kicks in │ │ ▼ ▼ Retries amplify load Excess requests rejected cheaply │ │ ▼ ▼ Cascading timeouts Accepted requests served quickly │ │ ▼ ▼ Availability collapses to 0 Goodput plateaus near full capacity
Concept Goodput vs Throughput

Throughput is the total number of requests a server processes per second. Goodput is the subset of those requests that complete successfully and within a latency that is useful to the caller.

These two metrics diverge sharply during overload. A server can remain "busy" at high throughput while its goodput collapses toward zero — every request times out or errors, but the server is still churning through work.

Offered throughput →→→→→→→→→→→→→→→→→→→→→→→→→→→→→→→ ↑ inflection point (queue fills, latency spikes) Goodput (without shedding): ████████████▓▓▓░░░░░░ → 0 Goodput (with shedding): ██████████████████████ → plateau
The key metric is client-perceived goodput Server-side availability metrics look fine right up until the moment everything falls over. Measure what the client actually receives — successful responses within an acceptable latency window.
Concept Positive Feedback Loops

Overload is self-reinforcing. When a request times out, all the server's work on it is wasted. The client retries, adding more load. In a multi-service architecture this compounds at every hop.

Client A ──▶ Service 1 ──▶ Service 2 ──▶ Service 3 │ │ │ times out times out times out │ │ │ retries ×3 retries ×3 retries ×3 │ │ │ 3× load 9× load 27× load ← exponential amplification

This is why load shedding must happen early — at the edge, before work is started — rather than deep in the call chain after resources have already been consumed.

Retry amplification In a three-hop service with each layer retrying three times on failure, a single overloaded downstream can generate 27× the original offered load. Retry budgets and exponential backoff are necessary but not sufficient — load shedding is the backstop.
Technique Prioritization

When shedding is necessary, which requests to drop matters as much as how many. Prioritize by business criticality and operational necessity.

1
Health checks
Load balancer pings must always be answered — dropping them removes the host from rotation, making overload worse
2
End / close operations
For paired APIs (start/end, open/close), ending work frees resources — prioritize over starting new work
3
Continuing paginated requests
A client mid-pagination holds a cursor and expects the next page — cheaper to finish than to restart
4
Human / interactive traffic
A person waiting on a response has a hard latency expectation; automated clients can retry
5
Batch / crawler traffic
Background jobs and crawlers tolerate delay — safe to shed first
Finish what you start Abandoning in-progress work mid-flight is expensive — it consumes resources without producing value. Where possible, complete requests that are already being processed before shedding new ones.
Technique Timeout Propagation

Clients can include a timeout hint (or deadline) in each request, telling the server how long they are willing to wait. If the request has already been queued longer than the client's deadline, the server can discard it immediately — avoiding doing work whose result will be discarded anyway.

Client sets deadline: "I need a response within 500ms" │ ▼ Service A receives request at T=0 Queued for 120ms → remaining = 380ms │ propagates remaining deadline downstream ▼ Service B receives request at T=120ms Remaining = 380ms → subtract own budget → propagate 250ms downstream │ ▼ Service C: remaining = 50ms → if it can't respond in 50ms, drop it now rather than do the work
Clock skew matters Propagating wall-clock deadlines requires synchronized clocks. Propagating remaining time (elapsed subtracted at each hop) is safer and avoids clock drift issues across hosts.

Requests sitting in TCP buffers have already been waiting before the application sees them. Measure queue wait time as a baseline metric and discard requests whose age exceeds a configured threshold.

Technique Queue Management

Bound queue depth

Cap the number of requests waiting in any queue. When full, reject new arrivals immediately — a fast rejection is always better than a slow timeout.

Age-based eviction

Set a maximum age for queued requests. If a request has been waiting longer than the client's likely timeout, discard it — the result would be useless anyway.

LIFO ordering

Where the protocol allows, process the most recently arrived requests first. Fresh requests are more likely to still be within their client's timeout window.

Disable surge queues

Load balancer surge queues silently absorb excess traffic — servers see a steady rate but clients time out waiting. Configure spillover / fast-fail instead.

Load balancer queues hide overload When a load balancer queues excess requests, servers report normal utilisation while clients experience growing latency. This delays the signal that load shedding or scaling is needed. Prefer fast-fail (immediate 503) over queueing at the load balancer layer.
Technique Layered Protection

Shedding should happen as early — and as cheaply — as possible. Each layer can reject excess traffic before it reaches the next, more expensive layer.

☁️
Cloud / Edge
AWS WAF, API Gateway rate limiting, CloudFront — reject before traffic reaches your VPC
⚖️
Load Balancer
Max connections (max_conns in NGINX), spillover fast-fail instead of surge queuing
🐧
Operating System
iptables connection limits — reject at the kernel before any process code runs
🔧
Service Framework
Application-level admission control with full context: priority, deadline, caller identity
Early rejection is cheapest — but least visible Dropping at the OS or load balancer costs almost nothing but produces no useful instrumentation. Dropping at the service framework is slightly more expensive but gives you caller identity, operation name, and context to tune rules and alert on trends.
Operations Visibility & Instrumentation

Latency Metric Pollution

Fast-failing shed requests completes them in microseconds. If these are included in latency percentiles, they drag the median down — making the service appear healthy while successful request latency is terrible. Exclude shed requests from latency metrics, or track them in a separate histogram.

Without separation: p50 latency = 2ms ← looks great! (shed requests are ~0ms, pulling median down) p99 latency = 8s ← actual successful requests are suffering With separation: Shed requests: count=50k/s, p50=0.1ms Successful requests: count=10k/s, p50=45ms, p99=120ms ← real picture

False Positive Rate

A non-zero false positive rate — legitimate requests being shed when the service is not actually overloaded — means either the shedding threshold is tuned too aggressively, or individual hosts are chronically overloaded. Strive for zero false positives; any false positive is a bug to investigate.

Instrumentation Checklist

SignalWhy it matters
Shed request count (by caller, by operation)Identify who is being shed and tune priority rules
Queue wait time distributionBaseline for setting age-based eviction thresholds
Successful request latency (shed excluded)True health signal — not polluted by fast-fail completions
Goodput rateRequests that completed successfully within client timeout
Shed reason breakdownQueue full vs age expired vs priority vs capacity — helps tune each lever independently
Operations Auto-scaling Interactions

Load shedding and reactive auto-scaling can conflict in subtle ways.

Scaling signal suppression

If shedding kicks in at the same CPU threshold that triggers scaling, the scaler never sees elevated utilisation — it never launches new instances. Set the shedding threshold above the scaling trigger so scaling fires first.

AZ failure headroom

Services must be sized to absorb an entire Availability Zone going offline. If shedding runs the fleet near its limit normally, losing an AZ pushes it over — shedding during a failure is the opposite of what you want.

Design for N+1 (or N+AZ) capacity The fleet should handle peak load with one AZ removed, before shedding is engaged. Shedding is an emergency backstop, not a routine operating mode.
Operations Testing

Load shedding behaviour must be tested explicitly — it only activates under conditions you won't see in normal operation.

Test TypeWhat to Verify
Overload testDrive far beyond capacity — confirm goodput plateaus rather than collapsing to zero
Fixed-fleet testRemove auto-scaling; confirm a single host sheds gracefully at its individual capacity limit
AZ failure testRemove one AZ's capacity; confirm the remaining fleet handles load without shedding healthy traffic
Retry amplification testInject upstream retries; confirm downstream shedding prevents cascade
Priority testSend mixed traffic at overload; confirm high-priority requests succeed while low-priority are shed
Test beyond the break point A test that never pushes the system to the point of shedding tells you nothing about shedding behaviour. Generate enough load to trigger shedding and verify the shape of the goodput curve — it should plateau, not drop.