Throughput is the total number of requests a server processes per second. Goodput is the subset of those requests that complete successfully and within a latency that is useful to the caller.
These two metrics diverge sharply during overload. A server can remain "busy" at high throughput while its goodput collapses toward zero — every request times out or errors, but the server is still churning through work.
Overload is self-reinforcing. When a request times out, all the server's work on it is wasted. The client retries, adding more load. In a multi-service architecture this compounds at every hop.
This is why load shedding must happen early — at the edge, before work is started — rather than deep in the call chain after resources have already been consumed.
When shedding is necessary, which requests to drop matters as much as how many. Prioritize by business criticality and operational necessity.
Clients can include a timeout hint (or deadline) in each request, telling the server how long they are willing to wait. If the request has already been queued longer than the client's deadline, the server can discard it immediately — avoiding doing work whose result will be discarded anyway.
Requests sitting in TCP buffers have already been waiting before the application sees them. Measure queue wait time as a baseline metric and discard requests whose age exceeds a configured threshold.
Bound queue depth
Cap the number of requests waiting in any queue. When full, reject new arrivals immediately — a fast rejection is always better than a slow timeout.
Age-based eviction
Set a maximum age for queued requests. If a request has been waiting longer than the client's likely timeout, discard it — the result would be useless anyway.
LIFO ordering
Where the protocol allows, process the most recently arrived requests first. Fresh requests are more likely to still be within their client's timeout window.
Disable surge queues
Load balancer surge queues silently absorb excess traffic — servers see a steady rate but clients time out waiting. Configure spillover / fast-fail instead.
Shedding should happen as early — and as cheaply — as possible. Each layer can reject excess traffic before it reaches the next, more expensive layer.
max_conns in NGINX), spillover fast-fail instead of surge queuingiptables connection limits — reject at the kernel before any process code runsLatency Metric Pollution
Fast-failing shed requests completes them in microseconds. If these are included in latency percentiles, they drag the median down — making the service appear healthy while successful request latency is terrible. Exclude shed requests from latency metrics, or track them in a separate histogram.
False Positive Rate
A non-zero false positive rate — legitimate requests being shed when the service is not actually overloaded — means either the shedding threshold is tuned too aggressively, or individual hosts are chronically overloaded. Strive for zero false positives; any false positive is a bug to investigate.
Instrumentation Checklist
| Signal | Why it matters |
|---|---|
| Shed request count (by caller, by operation) | Identify who is being shed and tune priority rules |
| Queue wait time distribution | Baseline for setting age-based eviction thresholds |
| Successful request latency (shed excluded) | True health signal — not polluted by fast-fail completions |
| Goodput rate | Requests that completed successfully within client timeout |
| Shed reason breakdown | Queue full vs age expired vs priority vs capacity — helps tune each lever independently |
Load shedding and reactive auto-scaling can conflict in subtle ways.
Scaling signal suppression
If shedding kicks in at the same CPU threshold that triggers scaling, the scaler never sees elevated utilisation — it never launches new instances. Set the shedding threshold above the scaling trigger so scaling fires first.
AZ failure headroom
Services must be sized to absorb an entire Availability Zone going offline. If shedding runs the fleet near its limit normally, losing an AZ pushes it over — shedding during a failure is the opposite of what you want.
Load shedding behaviour must be tested explicitly — it only activates under conditions you won't see in normal operation.
| Test Type | What to Verify |
|---|---|
| Overload test | Drive far beyond capacity — confirm goodput plateaus rather than collapsing to zero |
| Fixed-fleet test | Remove auto-scaling; confirm a single host sheds gracefully at its individual capacity limit |
| AZ failure test | Remove one AZ's capacity; confirm the remaining fleet handles load without shedding healthy traffic |
| Retry amplification test | Inject upstream retries; confirm downstream shedding prevents cascade |
| Priority test | Send mixed traffic at overload; confirm high-priority requests succeed while low-priority are shed |