Key Takeaway
By the end of this blueprint you will have a multi-tenant conversational AI platform architecture with persistent memory using rolling summarization, a tool registry for API integration, input/output guardrails for safety, and conversation analytics that measures quality and business outcomes across all deployed assistants.
Prerequisites
- An LLM API with streaming support (Anthropic or OpenAI)
- PostgreSQL for conversation persistence and assistant configuration
- Redis for session state and rate limiting
- Familiarity with the Streaming Architecture blueprint for real-time delivery
- A task queue for async operations (tool execution, guardrail checks)
Platform vs Single-Bot Architecture
A single-bot architecture hardcodes the system prompt, tools, and memory strategy in application code. This works for one assistant but becomes unmanageable when you need customer support bots, internal IT helpers, onboarding guides, and sales assistants — each with different prompts, tools, and safety requirements. A platform architecture separates the conversation runtime (how messages flow) from the assistant configuration (what the assistant does), letting you deploy new assistants by writing a config file rather than building a new application.
Architecture Overview
The platform consists of a conversation runtime that manages message flow and memory, a tool registry that exposes organizational APIs as callable tools, a guardrails engine that screens both inputs and outputs for policy violations, and an analytics pipeline that captures conversation-level and turn-level metrics. Each assistant is defined by a configuration that specifies its system prompt, available tools, memory strategy, and guardrail policies.
Assistant Configuration
/** Assistant configuration — the declarative definition of a conversational agent. */
export interface AssistantConfig {
/** Unique identifier */
id: string;
/** Human-readable name */
name: string;
/** System prompt (supports {{variable}} templates) */
systemPrompt: string;
/** LLM model to use */
model: string;
/** Available tools from the tool registry */
tools: string[];
/** Memory strategy */
memory: {
strategy: "full" | "sliding-window" | "summarize";
/** Max messages before summarization (for summarize strategy) */
maxMessages?: number;
/** Max tokens in context window budget for history */
maxHistoryTokens?: number;
};
/** Guardrail policies to enforce */
guardrails: {
/** Block topics (e.g., ["competitor-pricing", "legal-advice"]) */
blockedTopics: string[];
/** PII detection and redaction */
piiHandling: "block" | "redact" | "allow";
/** Maximum response length in tokens */
maxResponseTokens: number;
/** Custom safety prompt appended to system prompt */
safetyInstructions?: string;
};
/** Metadata for analytics */
metadata: {
team: string;
environment: "staging" | "production";
version: string;
};
}Multi-Turn Memory Management
Long conversations exceed context window limits. The platform offers three memory strategies: full history (send all messages, simple but limited), sliding window (send the last N messages, loses early context), and rolling summarization (periodically summarize older messages into a compact summary that preserves key facts). Rolling summarization is the production-grade choice — it maintains conversation coherence over hundreds of turns while keeping the context window within budget.
"""Multi-turn memory management with rolling summarization."""
from __future__ import annotations
from dataclasses import dataclass
from anthropic import Anthropic
client = Anthropic()
@dataclass
class ConversationMemory:
"""Manages conversation history with summarization."""
summary: str = ""
recent_messages: list[dict] = None
max_recent: int = 10
total_turns: int = 0
def __post_init__(self):
if self.recent_messages is None:
self.recent_messages = []
def add_message(self, role: str, content: str):
"""Add a message and trigger summarization if needed."""
self.recent_messages.append({"role": role, "content": content})
self.total_turns += 1
if len(self.recent_messages) > self.max_recent:
self._summarize_oldest()
def _summarize_oldest(self):
"""Summarize the oldest half of messages into the running summary."""
cutoff = len(self.recent_messages) // 2
to_summarize = self.recent_messages[:cutoff]
self.recent_messages = self.recent_messages[cutoff:]
messages_text = "\n".join(
f"{m['role']}: {m['content']}" for m in to_summarize
)
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=512,
messages=[{
"role": "user",
"content": (
"Summarize this conversation segment concisely, "
"preserving key facts, decisions, and user preferences. "
"Previous summary: "
f"{self.summary or 'None'}\n\n"
f"New messages:\n{messages_text}"
),
}],
)
self.summary = response.content[0].text
def get_context(self) -> list[dict]:
"""Build the message list for the LLM call."""
context = []
if self.summary:
context.append({
"role": "user",
"content": (
f"[Conversation summary so far: {self.summary}]"
),
})
context.append({
"role": "assistant",
"content": "I understand the conversation context. How can I help?",
})
context.extend(self.recent_messages)
return contextTool Registry and Integration
The tool registry is a centralized catalog of capabilities that assistants can invoke. Each tool has a name, description (used in the LLM's tool schema), an input schema, and an executor function. Tools are registered once and can be assigned to any assistant by adding the tool name to the assistant's configuration. This decouples tool implementation from assistant logic — the platform team builds and maintains tools, and the prompt engineering team assigns them to assistants without writing code.
Input and Output Guardrails
Guardrails run as middleware in the conversation runtime. Input guardrails screen user messages before they reach the LLM — checking for prompt injection attempts, PII that should be redacted, and blocked topics. Output guardrails screen the LLM's response before it reaches the user — checking for off-topic responses, policy violations, and hallucinated information. Each guardrail returns a pass/fail/warn result with an explanation. Failed inputs return a canned response; failed outputs trigger a retry with additional safety instructions in the prompt.
"""Input and output guardrail middleware."""
from __future__ import annotations
import re
from dataclasses import dataclass
from enum import Enum
from typing import Callable
class GuardrailResult(str, Enum):
PASS = "pass"
WARN = "warn"
BLOCK = "block"
@dataclass
class GuardrailCheck:
result: GuardrailResult
reason: str
modified_content: str | None = None # For redaction
# ---- Input Guardrails ----
def check_pii(text: str, handling: str) -> GuardrailCheck:
"""Detect and handle PII in user input."""
# Simple regex patterns for common PII
patterns = {
"email": r"[\w.+-]+@[\w-]+\.[\w.-]+",
"phone": r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b",
"ssn": r"\b\d{3}-\d{2}-\d{4}\b",
}
found = {}
for pii_type, pattern in patterns.items():
matches = re.findall(pattern, text)
if matches:
found[pii_type] = matches
if not found:
return GuardrailCheck(GuardrailResult.PASS, "No PII detected")
if handling == "block":
return GuardrailCheck(
GuardrailResult.BLOCK,
f"PII detected: {', '.join(found.keys())}. Please remove personal information.",
)
if handling == "redact":
redacted = text
for pattern in patterns.values():
redacted = re.sub(pattern, "[REDACTED]", redacted)
return GuardrailCheck(
GuardrailResult.WARN,
"PII redacted from input",
modified_content=redacted,
)
return GuardrailCheck(GuardrailResult.PASS, "PII handling set to allow")
def check_blocked_topics(
text: str, blocked: list[str]
) -> GuardrailCheck:
"""Check if input references blocked topics."""
text_lower = text.lower()
for topic in blocked:
if topic.lower() in text_lower:
return GuardrailCheck(
GuardrailResult.BLOCK,
f"This topic ({topic}) is outside my scope. "
"Please contact the appropriate team.",
)
return GuardrailCheck(GuardrailResult.PASS, "No blocked topics")Conversation Analytics
Every conversation generates analytics at two levels: turn-level metrics (latency, token usage, tool invocations, guardrail triggers) and conversation-level metrics (total turns, resolution status, user satisfaction, escalation rate). These metrics are tagged with the assistant ID, allowing you to compare performance across assistants and identify which ones need prompt improvements or additional tools. The analytics pipeline feeds into the same observability stack described in the AI Observability blueprint.
| Metric | Level | Source | Purpose |
|---|---|---|---|
| Response latency | Turn | Runtime timer | Performance monitoring |
| Token usage | Turn | LLM API response | Cost attribution |
| Guardrail triggers | Turn | Guardrail middleware | Safety monitoring |
| Tool invocation rate | Turn | Tool registry | Feature usage tracking |
| Conversation length | Conversation | Message count | Efficiency analysis |
| Resolution rate | Conversation | User feedback or classifier | Quality measurement |
| Escalation rate | Conversation | Handoff events | Automation coverage |
Add a lightweight feedback mechanism at the end of conversations — a simple thumbs up/down or a 1-5 rating. This costs almost nothing to implement but provides the ground-truth signal you need to evaluate assistant quality. Without it, you are flying blind on user satisfaction.
Memory summarization introduces a quality risk: important details from early in the conversation can be lost if the summarizer judges them irrelevant. Mitigate this by including 'preserve user preferences and explicit requests' in the summarization prompt, and by keeping the most recent unsummarized window large enough (10+ messages) to cover multi-step tasks.
Platform Core
Safety
Analytics
Version History
1.0.0 · 2026-03-01
- • Initial publication with multi-tenant conversational AI platform architecture
- • Assistant configuration schema with declarative tool and guardrail assignment
- • Rolling summarization memory management implementation
- • Input/output guardrail middleware with PII handling
- • Conversation analytics metrics and dashboard patterns