Key Takeaway

By the end of this blueprint you will have an end-to-end streaming architecture that delivers LLM tokens to the client via Server-Sent Events, handles structured output parsing mid-stream, recovers gracefully from connection drops, and progressively renders markdown with code blocks on the client side.

Prerequisites

A Next.js or FastAPI backend capable of streaming responses
Familiarity with Server-Sent Events (SSE) or WebSocket protocols
An LLM provider API that supports streaming (Anthropic, OpenAI)
React or equivalent frontend framework for client-side rendering

SSE vs WebSocket for LLM Streaming

Server-Sent Events (SSE) is the right choice for most LLM streaming use cases. LLM generation is unidirectional — the server sends tokens to the client, and the client does not need to send data back during generation. SSE works over standard HTTP, passes through CDNs and load balancers without special configuration, supports automatic reconnection with event IDs, and is simpler to implement than WebSocket. Use WebSocket only when you need bidirectional communication during generation, such as mid-stream cancellation or real-time collaborative editing.

Feature	SSE	WebSocket
Direction	Server to client only	Bidirectional
Protocol	HTTP/1.1 or HTTP/2	Custom upgrade from HTTP
Load balancer support	Native	Requires sticky sessions or upgrade support
Auto-reconnection	Built-in with Last-Event-ID	Manual implementation
Complexity	Low	Medium-High
Best for	LLM token streaming	Interactive collaboration, gaming

Server-Side Streaming Implementation

The server acts as a transformer between the LLM provider's stream format and a unified event protocol that your client SDK understands. Each event carries a type (token, tool_call_start, tool_call_delta, error, done), an incrementing sequence number for replay, and the payload. The sequence number is essential for connection resumption — when the client reconnects, it sends the last received sequence number, and the server replays missed events from a bounded buffer.

app/api/chat/route.ts

/** Next.js route handler for streaming LLM responses via SSE. */

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

export async function POST(request: Request) {
  const { messages, model = "claude-sonnet-4-20250514" } = await request.json();

  const encoder = new TextEncoder();
  let sequence = 0;

  const stream = new ReadableStream({
    async start(controller) {
      try {
        const response = await client.messages.stream({
          model,
          max_tokens: 4096,
          messages,
        });

        for await (const event of response) {
          if (event.type === "content_block_delta") {
            const delta = event.delta;
            if (delta.type === "text_delta") {
              controller.enqueue(
                encoder.encode(
                  `data: ${JSON.stringify({
                    type: "token",
                    seq: sequence++,
                    content: delta.text,
                  })}\n\n`
                )
              );
            }
          }
        }

        // Final message with usage stats
        const finalMessage = await response.finalMessage();
        controller.enqueue(
          encoder.encode(
            `data: ${JSON.stringify({
              type: "done",
              seq: sequence++,
              usage: {
                inputTokens: finalMessage.usage.input_tokens,
                outputTokens: finalMessage.usage.output_tokens,
              },
            })}\n\n`
          )
        );
      } catch (error) {
        controller.enqueue(
          encoder.encode(
            `data: ${JSON.stringify({
              type: "error",
              seq: sequence++,
              message: error instanceof Error ? error.message : "Unknown error",
            })}\n\n`
          )
        );
      } finally {
        controller.close();
      }
    },
  });

  return new Response(stream, {
    headers: {
      "Content-Type": "text/event-stream",
      "Cache-Control": "no-cache",
      Connection: "keep-alive",
    },
  });
}

Client-Side Progressive Rendering

The client must render tokens as they arrive while handling partial markdown, incomplete code blocks, and mid-stream formatting. The key insight is to accumulate the full text and re-render the markdown on each token rather than trying to incrementally parse markdown deltas. This is simpler and avoids the bugs that come from trying to determine whether a backtick starts an inline code span or a code block before seeing the full context.

hooks/use-streaming-chat.ts

/** React hook for consuming SSE streams with progressive rendering. */

import { useCallback, useRef, useState } from "react";

interface StreamState {
  content: string;
  isStreaming: boolean;
  error: string | null;
  usage: { inputTokens: number; outputTokens: number } | null;
}

export function useStreamingChat() {
  const [state, setState] = useState<StreamState>({
    content: "",
    isStreaming: false,
    error: null,
    usage: null,
  });
  const abortRef = useRef<AbortController | null>(null);

  const send = useCallback(async (messages: Array<{ role: string; content: string }>) => {
    // Cancel any in-flight request
    abortRef.current?.abort();
    const controller = new AbortController();
    abortRef.current = controller;

    setState({ content: "", isStreaming: true, error: null, usage: null });

    try {
      const response = await fetch("/api/chat", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ messages }),
        signal: controller.signal,
      });

      if (!response.ok || !response.body) {
        throw new Error(`HTTP ${response.status}`);
      }

      const reader = response.body.getReader();
      const decoder = new TextDecoder();
      let buffer = "";
      let accumulated = "";

      while (true) {
        const { done, value } = await reader.read();
        if (done) break;

        buffer += decoder.decode(value, { stream: true });
        const lines = buffer.split("\n\n");
        buffer = lines.pop() || "";

        for (const line of lines) {
          if (!line.startsWith("data: ")) continue;
          const data = JSON.parse(line.slice(6));

          switch (data.type) {
            case "token":
              accumulated += data.content;
              setState((s) => ({ ...s, content: accumulated }));
              break;
            case "done":
              setState((s) => ({
                ...s,
                isStreaming: false,
                usage: data.usage,
              }));
              break;
            case "error":
              setState((s) => ({
                ...s,
                isStreaming: false,
                error: data.message,
              }));
              break;
          }
        }
      }
    } catch (err) {
      if (err instanceof DOMException && err.name === "AbortError") return;
      setState((s) => ({
        ...s,
        isStreaming: false,
        error: err instanceof Error ? err.message : "Stream failed",
      }));
    }
  }, []);

  const cancel = useCallback(() => {
    abortRef.current?.abort();
    setState((s) => ({ ...s, isStreaming: false }));
  }, []);

  return { ...state, send, cancel };
}

Handling Backpressure

When the LLM generates tokens faster than the client consumes them (common on slow mobile connections), unbuffered streaming can lead to memory exhaustion on the server. Implement a bounded buffer between the LLM stream and the SSE output. If the buffer fills, pause reading from the LLM stream until the client drains. Most web frameworks handle this automatically through stream backpressure, but verify your middleware and reverse proxies do not buffer entire responses — Nginx, for example, requires proxy_buffering off for SSE.

Set proxy_buffering off in Nginx and disable response buffering in your CDN for SSE endpoints. Without this, the CDN or reverse proxy will buffer the entire response and deliver it all at once, defeating the purpose of streaming.

Always implement client-side abort. When a user navigates away or starts a new message, the previous stream must be cancelled. Without this, orphaned LLM calls continue generating tokens (and burning API credits) for requests nobody will read.

Server

Client

Infrastructure

Version History

1.0.0 · 2026-03-01

• Initial publication with SSE streaming architecture for Next.js
• Client-side React hook for progressive rendering
• SSE vs WebSocket comparison and selection guidance
• Backpressure handling and infrastructure configuration

SSE vs WebSocket for LLM Streaming

Feature	SSE	WebSocket
Direction	Server to client only	Bidirectional
Protocol	HTTP/1.1 or HTTP/2	Custom upgrade from HTTP
Load balancer support	Native	Requires sticky sessions or upgrade support
Auto-reconnection	Built-in with Last-Event-ID	Manual implementation
Complexity	Low	Medium-High
Best for	LLM token streaming	Interactive collaboration, gaming

Server-Side Streaming Implementation

app/api/chat/route.ts

/** Next.js route handler for streaming LLM responses via SSE. */

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

export async function POST(request: Request) {
  const { messages, model = "claude-sonnet-4-20250514" } = await request.json();

  const encoder = new TextEncoder();
  let sequence = 0;

  const stream = new ReadableStream({
    async start(controller) {
      try {
        const response = await client.messages.stream({
          model,
          max_tokens: 4096,
          messages,
        });

        for await (const event of response) {
          if (event.type === "content_block_delta") {
            const delta = event.delta;
            if (delta.type === "text_delta") {
              controller.enqueue(
                encoder.encode(
                  `data: ${JSON.stringify({
                    type: "token",
                    seq: sequence++,
                    content: delta.text,
                  })}\n\n`
                )
              );
            }
          }
        }

        // Final message with usage stats
        const finalMessage = await response.finalMessage();
        controller.enqueue(
          encoder.encode(
            `data: ${JSON.stringify({
              type: "done",
              seq: sequence++,
              usage: {
                inputTokens: finalMessage.usage.input_tokens,
                outputTokens: finalMessage.usage.output_tokens,
              },
            })}\n\n`
          )
        );
      } catch (error) {
        controller.enqueue(
          encoder.encode(
            `data: ${JSON.stringify({
              type: "error",
              seq: sequence++,
              message: error instanceof Error ? error.message : "Unknown error",
            })}\n\n`
          )
        );
      } finally {
        controller.close();
      }
    },
  });

  return new Response(stream, {
    headers: {
      "Content-Type": "text/event-stream",
      "Cache-Control": "no-cache",
      Connection: "keep-alive",
    },
  });
}

Client-Side Progressive Rendering

hooks/use-streaming-chat.ts

/** React hook for consuming SSE streams with progressive rendering. */

import { useCallback, useRef, useState } from "react";

interface StreamState {
  content: string;
  isStreaming: boolean;
  error: string | null;
  usage: { inputTokens: number; outputTokens: number } | null;
}

export function useStreamingChat() {
  const [state, setState] = useState<StreamState>({
    content: "",
    isStreaming: false,
    error: null,
    usage: null,
  });
  const abortRef = useRef<AbortController | null>(null);

  const send = useCallback(async (messages: Array<{ role: string; content: string }>) => {
    // Cancel any in-flight request
    abortRef.current?.abort();
    const controller = new AbortController();
    abortRef.current = controller;

    setState({ content: "", isStreaming: true, error: null, usage: null });

    try {
      const response = await fetch("/api/chat", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ messages }),
        signal: controller.signal,
      });

      if (!response.ok || !response.body) {
        throw new Error(`HTTP ${response.status}`);
      }

      const reader = response.body.getReader();
      const decoder = new TextDecoder();
      let buffer = "";
      let accumulated = "";

      while (true) {
        const { done, value } = await reader.read();
        if (done) break;

        buffer += decoder.decode(value, { stream: true });
        const lines = buffer.split("\n\n");
        buffer = lines.pop() || "";

        for (const line of lines) {
          if (!line.startsWith("data: ")) continue;
          const data = JSON.parse(line.slice(6));

          switch (data.type) {
            case "token":
              accumulated += data.content;
              setState((s) => ({ ...s, content: accumulated }));
              break;
            case "done":
              setState((s) => ({
                ...s,
                isStreaming: false,
                usage: data.usage,
              }));
              break;
            case "error":
              setState((s) => ({
                ...s,
                isStreaming: false,
                error: data.message,
              }));
              break;
          }
        }
      }
    } catch (err) {
      if (err instanceof DOMException && err.name === "AbortError") return;
      setState((s) => ({
        ...s,
        isStreaming: false,
        error: err instanceof Error ? err.message : "Stream failed",
      }));
    }
  }, []);

  const cancel = useCallback(() => {
    abortRef.current?.abort();
    setState((s) => ({ ...s, isStreaming: false }));
  }, []);

  return { ...state, send, cancel };
}

Handling Backpressure

Server

Client

Infrastructure

Version History

1.0.0 · 2026-03-01

• Initial publication with SSE streaming architecture for Next.js
• Client-side React hook for progressive rendering
• SSE vs WebSocket comparison and selection guidance
• Backpressure handling and infrastructure configuration

Streaming Architecture for LLM Apps

SSE vs WebSocket for LLM Streaming

Server-Side Streaming Implementation

Client-Side Progressive Rendering

Handling Backpressure

Server

Client

Infrastructure

Version History

Related content

Streaming Architecture for LLM Apps

SSE vs WebSocket for LLM Streaming

Server-Side Streaming Implementation

Client-Side Progressive Rendering

Handling Backpressure

Server

Client

Infrastructure

Version History

Related content