Key Takeaway
By the end of this blueprint you will have an end-to-end streaming architecture that delivers LLM tokens to the client via Server-Sent Events, handles structured output parsing mid-stream, recovers gracefully from connection drops, and progressively renders markdown with code blocks on the client side.
Prerequisites
- A Next.js or FastAPI backend capable of streaming responses
- Familiarity with Server-Sent Events (SSE) or WebSocket protocols
- An LLM provider API that supports streaming (Anthropic, OpenAI)
- React or equivalent frontend framework for client-side rendering
SSE vs WebSocket for LLM Streaming
Server-Sent Events (SSE) is the right choice for most LLM streaming use cases. LLM generation is unidirectional — the server sends tokens to the client, and the client does not need to send data back during generation. SSE works over standard HTTP, passes through CDNs and load balancers without special configuration, supports automatic reconnection with event IDs, and is simpler to implement than WebSocket. Use WebSocket only when you need bidirectional communication during generation, such as mid-stream cancellation or real-time collaborative editing.
| Feature | SSE | WebSocket |
|---|---|---|
| Direction | Server to client only | Bidirectional |
| Protocol | HTTP/1.1 or HTTP/2 | Custom upgrade from HTTP |
| Load balancer support | Native | Requires sticky sessions or upgrade support |
| Auto-reconnection | Built-in with Last-Event-ID | Manual implementation |
| Complexity | Low | Medium-High |
| Best for | LLM token streaming | Interactive collaboration, gaming |
Server-Side Streaming Implementation
The server acts as a transformer between the LLM provider's stream format and a unified event protocol that your client SDK understands. Each event carries a type (token, tool_call_start, tool_call_delta, error, done), an incrementing sequence number for replay, and the payload. The sequence number is essential for connection resumption — when the client reconnects, it sends the last received sequence number, and the server replays missed events from a bounded buffer.
/** Next.js route handler for streaming LLM responses via SSE. */
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
export async function POST(request: Request) {
const { messages, model = "claude-sonnet-4-20250514" } = await request.json();
const encoder = new TextEncoder();
let sequence = 0;
const stream = new ReadableStream({
async start(controller) {
try {
const response = await client.messages.stream({
model,
max_tokens: 4096,
messages,
});
for await (const event of response) {
if (event.type === "content_block_delta") {
const delta = event.delta;
if (delta.type === "text_delta") {
controller.enqueue(
encoder.encode(
`data: ${JSON.stringify({
type: "token",
seq: sequence++,
content: delta.text,
})}\n\n`
)
);
}
}
}
// Final message with usage stats
const finalMessage = await response.finalMessage();
controller.enqueue(
encoder.encode(
`data: ${JSON.stringify({
type: "done",
seq: sequence++,
usage: {
inputTokens: finalMessage.usage.input_tokens,
outputTokens: finalMessage.usage.output_tokens,
},
})}\n\n`
)
);
} catch (error) {
controller.enqueue(
encoder.encode(
`data: ${JSON.stringify({
type: "error",
seq: sequence++,
message: error instanceof Error ? error.message : "Unknown error",
})}\n\n`
)
);
} finally {
controller.close();
}
},
});
return new Response(stream, {
headers: {
"Content-Type": "text/event-stream",
"Cache-Control": "no-cache",
Connection: "keep-alive",
},
});
}Client-Side Progressive Rendering
The client must render tokens as they arrive while handling partial markdown, incomplete code blocks, and mid-stream formatting. The key insight is to accumulate the full text and re-render the markdown on each token rather than trying to incrementally parse markdown deltas. This is simpler and avoids the bugs that come from trying to determine whether a backtick starts an inline code span or a code block before seeing the full context.
/** React hook for consuming SSE streams with progressive rendering. */
import { useCallback, useRef, useState } from "react";
interface StreamState {
content: string;
isStreaming: boolean;
error: string | null;
usage: { inputTokens: number; outputTokens: number } | null;
}
export function useStreamingChat() {
const [state, setState] = useState<StreamState>({
content: "",
isStreaming: false,
error: null,
usage: null,
});
const abortRef = useRef<AbortController | null>(null);
const send = useCallback(async (messages: Array<{ role: string; content: string }>) => {
// Cancel any in-flight request
abortRef.current?.abort();
const controller = new AbortController();
abortRef.current = controller;
setState({ content: "", isStreaming: true, error: null, usage: null });
try {
const response = await fetch("/api/chat", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ messages }),
signal: controller.signal,
});
if (!response.ok || !response.body) {
throw new Error(`HTTP ${response.status}`);
}
const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = "";
let accumulated = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split("\n\n");
buffer = lines.pop() || "";
for (const line of lines) {
if (!line.startsWith("data: ")) continue;
const data = JSON.parse(line.slice(6));
switch (data.type) {
case "token":
accumulated += data.content;
setState((s) => ({ ...s, content: accumulated }));
break;
case "done":
setState((s) => ({
...s,
isStreaming: false,
usage: data.usage,
}));
break;
case "error":
setState((s) => ({
...s,
isStreaming: false,
error: data.message,
}));
break;
}
}
}
} catch (err) {
if (err instanceof DOMException && err.name === "AbortError") return;
setState((s) => ({
...s,
isStreaming: false,
error: err instanceof Error ? err.message : "Stream failed",
}));
}
}, []);
const cancel = useCallback(() => {
abortRef.current?.abort();
setState((s) => ({ ...s, isStreaming: false }));
}, []);
return { ...state, send, cancel };
}Handling Backpressure
When the LLM generates tokens faster than the client consumes them (common on slow mobile connections), unbuffered streaming can lead to memory exhaustion on the server. Implement a bounded buffer between the LLM stream and the SSE output. If the buffer fills, pause reading from the LLM stream until the client drains. Most web frameworks handle this automatically through stream backpressure, but verify your middleware and reverse proxies do not buffer entire responses — Nginx, for example, requires proxy_buffering off for SSE.
Set proxy_buffering off in Nginx and disable response buffering in your CDN for SSE endpoints. Without this, the CDN or reverse proxy will buffer the entire response and deliver it all at once, defeating the purpose of streaming.
Always implement client-side abort. When a user navigates away or starts a new message, the previous stream must be cancelled. Without this, orphaned LLM calls continue generating tokens (and burning API credits) for requests nobody will read.
Server
Client
Infrastructure
Version History
1.0.0 · 2026-03-01
- • Initial publication with SSE streaming architecture for Next.js
- • Client-side React hook for progressive rendering
- • SSE vs WebSocket comparison and selection guidance
- • Backpressure handling and infrastructure configuration