Key Takeaway

By the end of this blueprint you will have a multi-tenant conversational AI platform architecture with persistent memory using rolling summarization, a tool registry for API integration, input/output guardrails for safety, and conversation analytics that measures quality and business outcomes across all deployed assistants.

Prerequisites

An LLM API with streaming support (Anthropic or OpenAI)
PostgreSQL for conversation persistence and assistant configuration
Redis for session state and rate limiting
Familiarity with the Streaming Architecture blueprint for real-time delivery
A task queue for async operations (tool execution, guardrail checks)

Platform vs Single-Bot Architecture

A single-bot architecture hardcodes the system prompt, tools, and memory strategy in application code. This works for one assistant but becomes unmanageable when you need customer support bots, internal IT helpers, onboarding guides, and sales assistants — each with different prompts, tools, and safety requirements. A platform architecture separates the conversation runtime (how messages flow) from the assistant configuration (what the assistant does), letting you deploy new assistants by writing a config file rather than building a new application.

Architecture Overview

The platform consists of a conversation runtime that manages message flow and memory, a tool registry that exposes organizational APIs as callable tools, a guardrails engine that screens both inputs and outputs for policy violations, and an analytics pipeline that captures conversation-level and turn-level metrics. Each assistant is defined by a configuration that specifies its system prompt, available tools, memory strategy, and guardrail policies.

Assistant Configuration

types/assistant.ts

/** Assistant configuration — the declarative definition of a conversational agent. */

export interface AssistantConfig {
  /** Unique identifier */
  id: string;

  /** Human-readable name */
  name: string;

  /** System prompt (supports {{variable}} templates) */
  systemPrompt: string;

  /** LLM model to use */
  model: string;

  /** Available tools from the tool registry */
  tools: string[];

  /** Memory strategy */
  memory: {
    strategy: "full" | "sliding-window" | "summarize";
    /** Max messages before summarization (for summarize strategy) */
    maxMessages?: number;
    /** Max tokens in context window budget for history */
    maxHistoryTokens?: number;
  };

  /** Guardrail policies to enforce */
  guardrails: {
    /** Block topics (e.g., ["competitor-pricing", "legal-advice"]) */
    blockedTopics: string[];
    /** PII detection and redaction */
    piiHandling: "block" | "redact" | "allow";
    /** Maximum response length in tokens */
    maxResponseTokens: number;
    /** Custom safety prompt appended to system prompt */
    safetyInstructions?: string;
  };

  /** Metadata for analytics */
  metadata: {
    team: string;
    environment: "staging" | "production";
    version: string;
  };
}

Multi-Turn Memory Management

Long conversations exceed context window limits. The platform offers three memory strategies: full history (send all messages, simple but limited), sliding window (send the last N messages, loses early context), and rolling summarization (periodically summarize older messages into a compact summary that preserves key facts). Rolling summarization is the production-grade choice — it maintains conversation coherence over hundreds of turns while keeping the context window within budget.

runtime/memory.py

"""Multi-turn memory management with rolling summarization."""

from __future__ import annotations

from dataclasses import dataclass
from anthropic import Anthropic

client = Anthropic()


@dataclass
class ConversationMemory:
    """Manages conversation history with summarization."""

    summary: str = ""
    recent_messages: list[dict] = None
    max_recent: int = 10
    total_turns: int = 0

    def __post_init__(self):
        if self.recent_messages is None:
            self.recent_messages = []

    def add_message(self, role: str, content: str):
        """Add a message and trigger summarization if needed."""
        self.recent_messages.append({"role": role, "content": content})
        self.total_turns += 1

        if len(self.recent_messages) > self.max_recent:
            self._summarize_oldest()

    def _summarize_oldest(self):
        """Summarize the oldest half of messages into the running summary."""
        cutoff = len(self.recent_messages) // 2
        to_summarize = self.recent_messages[:cutoff]
        self.recent_messages = self.recent_messages[cutoff:]

        messages_text = "\n".join(
            f"{m['role']}: {m['content']}" for m in to_summarize
        )

        response = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=512,
            messages=[{
                "role": "user",
                "content": (
                    "Summarize this conversation segment concisely, "
                    "preserving key facts, decisions, and user preferences. "
                    "Previous summary: "
                    f"{self.summary or 'None'}\n\n"
                    f"New messages:\n{messages_text}"
                ),
            }],
        )
        self.summary = response.content[0].text

    def get_context(self) -> list[dict]:
        """Build the message list for the LLM call."""
        context = []
        if self.summary:
            context.append({
                "role": "user",
                "content": (
                    f"[Conversation summary so far: {self.summary}]"
                ),
            })
            context.append({
                "role": "assistant",
                "content": "I understand the conversation context. How can I help?",
            })
        context.extend(self.recent_messages)
        return context

Tool Registry and Integration

The tool registry is a centralized catalog of capabilities that assistants can invoke. Each tool has a name, description (used in the LLM's tool schema), an input schema, and an executor function. Tools are registered once and can be assigned to any assistant by adding the tool name to the assistant's configuration. This decouples tool implementation from assistant logic — the platform team builds and maintains tools, and the prompt engineering team assigns them to assistants without writing code.

Input and Output Guardrails

Guardrails run as middleware in the conversation runtime. Input guardrails screen user messages before they reach the LLM — checking for prompt injection attempts, PII that should be redacted, and blocked topics. Output guardrails screen the LLM's response before it reaches the user — checking for off-topic responses, policy violations, and hallucinated information. Each guardrail returns a pass/fail/warn result with an explanation. Failed inputs return a canned response; failed outputs trigger a retry with additional safety instructions in the prompt.

runtime/guardrails.py

"""Input and output guardrail middleware."""

from __future__ import annotations

import re
from dataclasses import dataclass
from enum import Enum
from typing import Callable


class GuardrailResult(str, Enum):
    PASS = "pass"
    WARN = "warn"
    BLOCK = "block"


@dataclass
class GuardrailCheck:
    result: GuardrailResult
    reason: str
    modified_content: str | None = None  # For redaction


# ---- Input Guardrails ----

def check_pii(text: str, handling: str) -> GuardrailCheck:
    """Detect and handle PII in user input."""
    # Simple regex patterns for common PII
    patterns = {
        "email": r"[\w.+-]+@[\w-]+\.[\w.-]+",
        "phone": r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b",
        "ssn": r"\b\d{3}-\d{2}-\d{4}\b",
    }

    found = {}
    for pii_type, pattern in patterns.items():
        matches = re.findall(pattern, text)
        if matches:
            found[pii_type] = matches

    if not found:
        return GuardrailCheck(GuardrailResult.PASS, "No PII detected")

    if handling == "block":
        return GuardrailCheck(
            GuardrailResult.BLOCK,
            f"PII detected: {', '.join(found.keys())}. Please remove personal information.",
        )

    if handling == "redact":
        redacted = text
        for pattern in patterns.values():
            redacted = re.sub(pattern, "[REDACTED]", redacted)
        return GuardrailCheck(
            GuardrailResult.WARN,
            "PII redacted from input",
            modified_content=redacted,
        )

    return GuardrailCheck(GuardrailResult.PASS, "PII handling set to allow")


def check_blocked_topics(
    text: str, blocked: list[str]
) -> GuardrailCheck:
    """Check if input references blocked topics."""
    text_lower = text.lower()
    for topic in blocked:
        if topic.lower() in text_lower:
            return GuardrailCheck(
                GuardrailResult.BLOCK,
                f"This topic ({topic}) is outside my scope. "
                "Please contact the appropriate team.",
            )
    return GuardrailCheck(GuardrailResult.PASS, "No blocked topics")

Conversation Analytics

Every conversation generates analytics at two levels: turn-level metrics (latency, token usage, tool invocations, guardrail triggers) and conversation-level metrics (total turns, resolution status, user satisfaction, escalation rate). These metrics are tagged with the assistant ID, allowing you to compare performance across assistants and identify which ones need prompt improvements or additional tools. The analytics pipeline feeds into the same observability stack described in the AI Observability blueprint.

Metric	Level	Source	Purpose
Response latency	Turn	Runtime timer	Performance monitoring
Token usage	Turn	LLM API response	Cost attribution
Guardrail triggers	Turn	Guardrail middleware	Safety monitoring
Tool invocation rate	Turn	Tool registry	Feature usage tracking
Conversation length	Conversation	Message count	Efficiency analysis
Resolution rate	Conversation	User feedback or classifier	Quality measurement
Escalation rate	Conversation	Handoff events	Automation coverage

Add a lightweight feedback mechanism at the end of conversations — a simple thumbs up/down or a 1-5 rating. This costs almost nothing to implement but provides the ground-truth signal you need to evaluate assistant quality. Without it, you are flying blind on user satisfaction.

Memory summarization introduces a quality risk: important details from early in the conversation can be lost if the summarizer judges them irrelevant. Mitigate this by including 'preserve user preferences and explicit requests' in the summarization prompt, and by keeping the most recent unsummarized window large enough (10+ messages) to cover multi-step tasks.

Platform Core

Safety

Analytics

Version History

1.0.0 · 2026-03-01

• Initial publication with multi-tenant conversational AI platform architecture
• Assistant configuration schema with declarative tool and guardrail assignment
• Rolling summarization memory management implementation
• Input/output guardrail middleware with PII handling
• Conversation analytics metrics and dashboard patterns

Key Takeaway

Prerequisites

An LLM API with streaming support (Anthropic or OpenAI)
PostgreSQL for conversation persistence and assistant configuration
Redis for session state and rate limiting
Familiarity with the Streaming Architecture blueprint for real-time delivery
A task queue for async operations (tool execution, guardrail checks)

Platform vs Single-Bot Architecture

Architecture Overview

Assistant Configuration

types/assistant.ts

/** Assistant configuration — the declarative definition of a conversational agent. */

export interface AssistantConfig {
  /** Unique identifier */
  id: string;

  /** Human-readable name */
  name: string;

  /** System prompt (supports {{variable}} templates) */
  systemPrompt: string;

  /** LLM model to use */
  model: string;

  /** Available tools from the tool registry */
  tools: string[];

  /** Memory strategy */
  memory: {
    strategy: "full" | "sliding-window" | "summarize";
    /** Max messages before summarization (for summarize strategy) */
    maxMessages?: number;
    /** Max tokens in context window budget for history */
    maxHistoryTokens?: number;
  };

  /** Guardrail policies to enforce */
  guardrails: {
    /** Block topics (e.g., ["competitor-pricing", "legal-advice"]) */
    blockedTopics: string[];
    /** PII detection and redaction */
    piiHandling: "block" | "redact" | "allow";
    /** Maximum response length in tokens */
    maxResponseTokens: number;
    /** Custom safety prompt appended to system prompt */
    safetyInstructions?: string;
  };

  /** Metadata for analytics */
  metadata: {
    team: string;
    environment: "staging" | "production";
    version: string;
  };
}

Multi-Turn Memory Management

runtime/memory.py

"""Multi-turn memory management with rolling summarization."""

from __future__ import annotations

from dataclasses import dataclass
from anthropic import Anthropic

client = Anthropic()


@dataclass
class ConversationMemory:
    """Manages conversation history with summarization."""

    summary: str = ""
    recent_messages: list[dict] = None
    max_recent: int = 10
    total_turns: int = 0

    def __post_init__(self):
        if self.recent_messages is None:
            self.recent_messages = []

    def add_message(self, role: str, content: str):
        """Add a message and trigger summarization if needed."""
        self.recent_messages.append({"role": role, "content": content})
        self.total_turns += 1

        if len(self.recent_messages) > self.max_recent:
            self._summarize_oldest()

    def _summarize_oldest(self):
        """Summarize the oldest half of messages into the running summary."""
        cutoff = len(self.recent_messages) // 2
        to_summarize = self.recent_messages[:cutoff]
        self.recent_messages = self.recent_messages[cutoff:]

        messages_text = "\n".join(
            f"{m['role']}: {m['content']}" for m in to_summarize
        )

        response = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=512,
            messages=[{
                "role": "user",
                "content": (
                    "Summarize this conversation segment concisely, "
                    "preserving key facts, decisions, and user preferences. "
                    "Previous summary: "
                    f"{self.summary or 'None'}\n\n"
                    f"New messages:\n{messages_text}"
                ),
            }],
        )
        self.summary = response.content[0].text

    def get_context(self) -> list[dict]:
        """Build the message list for the LLM call."""
        context = []
        if self.summary:
            context.append({
                "role": "user",
                "content": (
                    f"[Conversation summary so far: {self.summary}]"
                ),
            })
            context.append({
                "role": "assistant",
                "content": "I understand the conversation context. How can I help?",
            })
        context.extend(self.recent_messages)
        return context

Tool Registry and Integration

Input and Output Guardrails

runtime/guardrails.py

"""Input and output guardrail middleware."""

from __future__ import annotations

import re
from dataclasses import dataclass
from enum import Enum
from typing import Callable


class GuardrailResult(str, Enum):
    PASS = "pass"
    WARN = "warn"
    BLOCK = "block"


@dataclass
class GuardrailCheck:
    result: GuardrailResult
    reason: str
    modified_content: str | None = None  # For redaction


# ---- Input Guardrails ----

def check_pii(text: str, handling: str) -> GuardrailCheck:
    """Detect and handle PII in user input."""
    # Simple regex patterns for common PII
    patterns = {
        "email": r"[\w.+-]+@[\w-]+\.[\w.-]+",
        "phone": r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b",
        "ssn": r"\b\d{3}-\d{2}-\d{4}\b",
    }

    found = {}
    for pii_type, pattern in patterns.items():
        matches = re.findall(pattern, text)
        if matches:
            found[pii_type] = matches

    if not found:
        return GuardrailCheck(GuardrailResult.PASS, "No PII detected")

    if handling == "block":
        return GuardrailCheck(
            GuardrailResult.BLOCK,
            f"PII detected: {', '.join(found.keys())}. Please remove personal information.",
        )

    if handling == "redact":
        redacted = text
        for pattern in patterns.values():
            redacted = re.sub(pattern, "[REDACTED]", redacted)
        return GuardrailCheck(
            GuardrailResult.WARN,
            "PII redacted from input",
            modified_content=redacted,
        )

    return GuardrailCheck(GuardrailResult.PASS, "PII handling set to allow")


def check_blocked_topics(
    text: str, blocked: list[str]
) -> GuardrailCheck:
    """Check if input references blocked topics."""
    text_lower = text.lower()
    for topic in blocked:
        if topic.lower() in text_lower:
            return GuardrailCheck(
                GuardrailResult.BLOCK,
                f"This topic ({topic}) is outside my scope. "
                "Please contact the appropriate team.",
            )
    return GuardrailCheck(GuardrailResult.PASS, "No blocked topics")

Conversation Analytics

Metric	Level	Source	Purpose
Response latency	Turn	Runtime timer	Performance monitoring
Token usage	Turn	LLM API response	Cost attribution
Guardrail triggers	Turn	Guardrail middleware	Safety monitoring
Tool invocation rate	Turn	Tool registry	Feature usage tracking
Conversation length	Conversation	Message count	Efficiency analysis
Resolution rate	Conversation	User feedback or classifier	Quality measurement
Escalation rate	Conversation	Handoff events	Automation coverage

Platform Core

Safety

Analytics

Version History

1.0.0 · 2026-03-01

• Initial publication with multi-tenant conversational AI platform architecture
• Assistant configuration schema with declarative tool and guardrail assignment
• Rolling summarization memory management implementation
• Input/output guardrail middleware with PII handling
• Conversation analytics metrics and dashboard patterns

Conversational AI Platform Architecture

Platform vs Single-Bot Architecture

Architecture Overview

Assistant Configuration

Multi-Turn Memory Management

Tool Registry and Integration

Input and Output Guardrails

Conversation Analytics

Platform Core

Safety

Analytics

Version History

Related content

Conversational AI Platform Architecture

Platform vs Single-Bot Architecture

Architecture Overview

Assistant Configuration

Multi-Turn Memory Management

Tool Registry and Integration

Input and Output Guardrails

Conversation Analytics

Platform Core

Safety

Analytics

Version History

Related content