Key Takeaway
By the end of this blueprint you will have an automated document processing pipeline that ingests PDFs and images, extracts text via OCR with layout preservation, classifies documents by type, pulls structured fields using LLM-based extraction with Pydantic schemas, and routes low-confidence results to a human review queue.
Prerequisites
- Python 3.11+ with PyMuPDF or pdf2image for PDF handling
- Tesseract OCR installed, or access to a cloud OCR API (Google Document AI, AWS Textract)
- An LLM API key for classification and extraction (Anthropic or OpenAI)
- PostgreSQL for document metadata and extraction results
- A task queue (Celery, Temporal, or similar) for async processing
Pipeline Architecture
The pipeline follows an ingest-classify-extract-validate pattern. Documents enter through a file watcher or API endpoint, pass through a preprocessing stage for format normalization and OCR, get classified by document type using a fast LLM call, and then flow into type-specific extraction templates powered by structured LLM output. A human-in-the-loop review queue handles low-confidence extractions before data reaches downstream systems.
- 1
Ingest
Accept documents from API upload, email attachment, S3 bucket, or file system watcher. Normalize to a common internal format.
- 2
Preprocess
Convert PDFs to images, run OCR on scanned pages, extract native text from digital PDFs, and detect tables and layout structure.
- 3
Classify
Determine the document type (invoice, contract, report, form) using a fast LLM call or fine-tuned classifier.
- 4
Extract
Apply type-specific extraction schemas using structured LLM output. Each field gets a confidence score.
- 5
Validate
Run business rules (date format, numeric ranges, required fields). Route low-confidence results to human review.
- 6
Output
Write validated extractions to the database, trigger downstream workflows, and archive the source document.
OCR and Text Extraction
Not all PDFs are the same. Digital-native PDFs contain selectable text that can be extracted directly without OCR. Scanned PDFs are effectively images and require OCR. Your pipeline must detect which type it is and apply the right strategy. For digital PDFs, use a library like PyMuPDF to extract text with layout preservation. For scanned PDFs, convert each page to an image and run OCR with Tesseract or a cloud OCR service. The cloud services (Google Document AI, AWS Textract) produce significantly better results on low-quality scans and handwriting.
"""Document preprocessing: PDF text extraction and OCR."""
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
import fitz # PyMuPDF
@dataclass
class PageResult:
page_number: int
text: str
is_scanned: bool
confidence: float # OCR confidence, 1.0 for native text
def extract_text(pdf_path: Path) -> list[PageResult]:
"""Extract text from a PDF, using OCR for scanned pages.
Strategy:
1. Try native text extraction first (PyMuPDF).
2. If a page has fewer than 50 characters of native text,
treat it as scanned and apply OCR.
"""
doc = fitz.open(str(pdf_path))
results = []
for page_num, page in enumerate(doc):
native_text = page.get_text("text").strip()
if len(native_text) > 50:
# Digital page — use native text
results.append(PageResult(
page_number=page_num + 1,
text=native_text,
is_scanned=False,
confidence=1.0,
))
else:
# Scanned page — apply OCR
ocr_text, confidence = _ocr_page(page)
results.append(PageResult(
page_number=page_num + 1,
text=ocr_text,
is_scanned=True,
confidence=confidence,
))
doc.close()
return results
def _ocr_page(page) -> tuple[str, float]:
"""OCR a single page using Tesseract via PyMuPDF."""
# PyMuPDF can perform OCR directly with Tesseract
tp = page.get_textpage_ocr(flags=fitz.TEXT_PRESERVE_WHITESPACE)
text = page.get_text("text", textpage=tp).strip()
# Estimate confidence from character count vs page area
confidence = min(len(text) / 500, 1.0) # Rough heuristic
return text, confidenceDocument Classification
Classification determines which extraction schema to apply. A fast LLM call reads the first 500 tokens of the document and returns the document type. For high-volume pipelines, train a lightweight classifier (e.g., a fine-tuned DistilBERT) that runs locally without API calls. The LLM-based approach is more flexible and handles new document types without retraining, while the local classifier is faster and cheaper at scale.
"""Document type classification using LLM."""
from __future__ import annotations
from enum import Enum
from anthropic import Anthropic
client = Anthropic()
class DocumentType(str, Enum):
INVOICE = "invoice"
CONTRACT = "contract"
REPORT = "report"
FORM = "form"
LETTER = "letter"
UNKNOWN = "unknown"
async def classify_document(text_preview: str) -> tuple[DocumentType, float]:
"""Classify a document by type using the first 500 tokens.
Returns:
Tuple of (document_type, confidence_score).
"""
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=128,
messages=[{
"role": "user",
"content": (
"Classify this document. Return JSON with 'type' "
"(invoice, contract, report, form, letter, unknown) "
"and 'confidence' (0.0-1.0).\n\n"
f"Document preview:\n{text_preview[:2000]}"
),
}],
)
import json
data = json.loads(response.content[0].text)
return DocumentType(data["type"]), float(data["confidence"])LLM-Based Structured Extraction
The extraction stage is where LLMs shine. Each document type has a Pydantic schema defining the fields to extract. The LLM receives the full document text and the schema, and returns structured JSON matching the schema. Using Pydantic for schema definition gives you automatic validation, type coercion, and clear error messages when the LLM's output does not match the expected format. Add a confidence score field to every extracted value so downstream systems can decide whether to trust the extraction or flag it for review.
"""Structured extraction using LLM with Pydantic schemas."""
from __future__ import annotations
from datetime import date
from typing import Optional
from anthropic import Anthropic
from pydantic import BaseModel, Field
client = Anthropic()
class ExtractedField(BaseModel):
"""A single extracted field with confidence."""
value: str | float | None
confidence: float = Field(ge=0, le=1)
class InvoiceExtraction(BaseModel):
"""Extraction schema for invoices."""
vendor_name: ExtractedField
invoice_number: ExtractedField
invoice_date: ExtractedField
due_date: ExtractedField
total_amount: ExtractedField
currency: ExtractedField
line_items: list[dict] = Field(default_factory=list)
tax_amount: Optional[ExtractedField] = None
payment_terms: Optional[ExtractedField] = None
EXTRACTION_PROMPT = """Extract structured data from this {doc_type}.
Return JSON matching this exact schema. For each field, provide
the extracted value and a confidence score (0.0-1.0).
If a field is not present in the document, set value to null
and confidence to 0.0.
Schema fields: {schema_fields}
Document text:
{document_text}"""
async def extract_fields(
document_text: str,
doc_type: str,
schema_class: type[BaseModel],
) -> BaseModel:
"""Extract structured fields from a document.
Args:
document_text: Full text of the document.
doc_type: The classified document type.
schema_class: Pydantic model defining the extraction schema.
Returns:
Populated Pydantic model with extracted fields.
"""
schema_fields = ", ".join(schema_class.model_fields.keys())
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
messages=[{
"role": "user",
"content": EXTRACTION_PROMPT.format(
doc_type=doc_type,
schema_fields=schema_fields,
document_text=document_text[:8000],
),
}],
)
import json
data = json.loads(response.content[0].text)
return schema_class(**data)Human-in-the-Loop Review Queue
Not every extraction is trustworthy. Set a confidence threshold (e.g., 0.8) below which extractions are routed to a human review queue. The review UI shows the original document alongside the extracted fields, letting reviewers correct values and confirm or reject the extraction. Corrected extractions flow back into your training dataset for the next classifier fine-tuning cycle, creating a virtuous loop where the system improves over time.
Start with a high confidence threshold (0.9) and lower it as you validate the system's accuracy on your document types. This ensures high precision early on when trust in the system is being established. Track the human correction rate — if reviewers rarely change the extracted values, you can safely lower the threshold.
Table extraction remains the hardest part of document processing. LLMs can extract table data when the table structure is clear, but complex multi-level headers, merged cells, and spanning rows often produce incorrect results. For mission-critical table extraction, use a dedicated table extraction model (like Microsoft's Table Transformer) as a preprocessing step before LLM extraction.
Ingestion
Extraction Quality
Operations
Version History
1.0.0 · 2026-03-01
- • Initial publication with OCR, classification, and LLM extraction pipeline
- • PyMuPDF-based text extraction with scanned page detection
- • Pydantic schema-driven structured extraction with confidence scores
- • Human-in-the-loop review queue pattern
- • Table extraction guidance and limitations