Streaming LLM Tokens Through AWS API Gateway

The previous four posts (Part 1, Part 2, Part 3, Part 4) covered Rust on Lambda from cold starts to architectural fit. This one is a sibling rather than a sequel, focused on a specific AWS gotcha that bites anyone wiring an LLM behind API Gateway: HTTP API v2 does not support response streaming. The modern, recommended Gateway flavor is the wrong tool for streaming LLM tokens. The older REST API got streaming support in November 2025. If you reach for HTTP API v2 by reflex (and most “use the modern one” guides will tell you to), your token stream silently collapses into a single buffered response.

Why streaming matters for LLMs

LLM endpoints like OpenRouter expose a stream: true flag for a reason. A typical chat completion takes seconds to tens of seconds to finish. Without streaming, the user stares at a spinner the whole time and has no signal that anything is happening. With streaming, tokens arrive as they are produced and the experience feels live.

Streaming also unlocks things that are awkward or impossible without it:

Cancellation. The client can close the connection mid-generation and stop paying for tokens it no longer wants.
Mid-stream interception. A proxy in the middle (your Lambda, in this case) can inspect tokens as they flow, apply moderation or PII filters, run early-stop logic on partial output, or fork the stream into analytics.
Lower time-to-first-token. A perceived latency win even when total generation time is unchanged.

The transport that makes all of this work is Server-Sent Events.

SSE in one paragraph

Server-Sent Events is a thin convention on top of plain HTTP: the server returns Content-Type: text/event-stream, holds the response open, and writes framed data: ...\n\n lines as events become available. The client reads them as they arrive. There is no second protocol like WebSockets, no bidirectional channel. Just a long-lived response. That simplicity is the point. It also means SSE only works if every hop between Lambda and client respects HTTP chunked transfer and does not buffer the body. That assumption is exactly where API Gateway HTTP v2 fails.

An Axum SSE handler

The Rust side is straightforward. axum::response::sse ships the helpers, and the handler signature is just Sse<impl Stream<Item = ...>>:

use axum::{
    response::sse::{Event, KeepAlive, Sse},
    routing::get,
    Router,
};
use futures::stream::{Stream, StreamExt};
use std::convert::Infallible;

async fn stream_completion() -> Sse<impl Stream<Item = Result<Event, Infallible>>> {
    let tokens = openrouter_token_stream().await;

    let events = tokens.map(|token| {
        // here goes any filtering, analytics, billing, early-stop, etc.
        let token = transform(token);
        Ok(Event::default().data(token))
    });

    Sse::new(events).keep_alive(KeepAlive::default())
}

#[tokio::main]
async fn main() {
    let app = Router::new().route("/chat", get(stream_completion));
    // run with lambda_http::run(app) for Lambda, or axum::serve(...) locally
}

The shape is the same whether stream_completion is fed by OpenRouter, a local model, or a fake. As long as openrouter_token_stream() returns a Stream of token strings, Axum frames each one as an SSE data: event. The interesting part is the .map closure: that is the choke point where every token passes through your code on its way to the client. Filtering, accounting, and policy enforcement all hang off of it, and none of it works if the stream is buffered downstream. The KeepAlive keeps idle connections from being killed by intermediate proxies during long pauses between tokens.

This handler streams cleanly when nothing buffers it on the way out. Put HTTP API v2 in front of it and the behavior changes.

Limitations of API Gateway HTTP API: response buffering

SSE token flow: HTTP API v2 buffers, REST API streams

The failure mode is silent. The Lambda emits tokens one by one, the runtime sends them upstream, and the API Gateway holds them. From the gateway’s perspective the response is not “ready” until the handler returns, so the client sees nothing for the full duration of the generation. When the handler finally finishes, the entire concatenated body lands in one chunk.

Nothing throws. The text/event-stream content type passes through. The client’s EventSource opens fine. It just receives no events until the very end, at which point it gets all of them at once. The streaming UX collapses into the non-streaming one and you only notice when a user complains the spinner sat for fifteen seconds before the answer appeared.

The cause is straightforward: HTTP API v2 buffers Lambda integration responses. There is no setting to turn it off. If you need streaming, you cannot use HTTP API v2 today.

What changed in November 2025

In November 2025 AWS announced response streaming for REST APIs:

Amazon API Gateway now supports response streaming for REST APIs.

The integration types covered are Lambda, HTTP proxy, and private integrations. The practical caveats from the announcement:

Timeout extends up to 15 minutes when streaming, matching Lambda’s hard cap.
Available on REST APIs only. HTTP APIs (v2) are not included.

The inversion is what makes this notable. REST API is the older flavor, the one the AWS docs steer you away from in favor of HTTP API v2 for most new work. It is also the one that just got the feature LLM apps need. If you went with the “modern” recommendation a year ago, you are now on the wrong side of the streaming line.

What to actually use

Three options, ranked by simplicity:

Lambda Function URLs. No API Gateway at all. Streaming has worked since April 2023. Simplest path to a streaming endpoint and the cheapest, but you give up the things API Gateway gives you: usage plans and API keys, request validation, custom authorizers, throttling tied to a Gateway stage, integrated WAF, custom domains attached to a Gateway. For an internal service or a thin public endpoint where you handle auth in the Lambda, this is the obvious choice.
REST API + Lambda. Now viable for streaming as of November 2025. Heavier to set up than HTTP API v2 and pricier per request, but you keep the full Gateway feature set. The right pick when you need streaming and Gateway features (custom authorizers, usage plans, WAF, the works).
HTTP API v2 + Lambda. Use only when you do not need streaming. It remains the cheapest and simplest Gateway flavor for non-streaming JSON APIs. The moment SSE enters the picture, it stops being a candidate.

A fourth option worth flagging: CloudFront in front of a Lambda Function URL. CloudFront supports response streaming and could give you edge caching, custom domains, and WAF on top of a Function URL. I will cover this combination in a separate post.

Why streaming matters for LLMs#

SSE in one paragraph#

An Axum SSE handler#

Limitations of API Gateway HTTP API: response buffering#

What changed in November 2025#

What to actually use#

Links#