Key Takeaway

By the end of this blueprint you will have an automated document processing pipeline that ingests PDFs and images, extracts text via OCR with layout preservation, classifies documents by type, pulls structured fields using LLM-based extraction with Pydantic schemas, and routes low-confidence results to a human review queue.

Prerequisites

Python 3.11+ with PyMuPDF or pdf2image for PDF handling
Tesseract OCR installed, or access to a cloud OCR API (Google Document AI, AWS Textract)
An LLM API key for classification and extraction (Anthropic or OpenAI)
PostgreSQL for document metadata and extraction results
A task queue (Celery, Temporal, or similar) for async processing

Pipeline Architecture

The pipeline follows an ingest-classify-extract-validate pattern. Documents enter through a file watcher or API endpoint, pass through a preprocessing stage for format normalization and OCR, get classified by document type using a fast LLM call, and then flow into type-specific extraction templates powered by structured LLM output. A human-in-the-loop review queue handles low-confidence extractions before data reaches downstream systems.

1
Ingest
Accept documents from API upload, email attachment, S3 bucket, or file system watcher. Normalize to a common internal format.
2
Preprocess
Convert PDFs to images, run OCR on scanned pages, extract native text from digital PDFs, and detect tables and layout structure.
3
Classify
Determine the document type (invoice, contract, report, form) using a fast LLM call or fine-tuned classifier.
4
Extract
Apply type-specific extraction schemas using structured LLM output. Each field gets a confidence score.
5
Validate
Run business rules (date format, numeric ranges, required fields). Route low-confidence results to human review.
6
Output
Write validated extractions to the database, trigger downstream workflows, and archive the source document.

OCR and Text Extraction

Not all PDFs are the same. Digital-native PDFs contain selectable text that can be extracted directly without OCR. Scanned PDFs are effectively images and require OCR. Your pipeline must detect which type it is and apply the right strategy. For digital PDFs, use a library like PyMuPDF to extract text with layout preservation. For scanned PDFs, convert each page to an image and run OCR with Tesseract or a cloud OCR service. The cloud services (Google Document AI, AWS Textract) produce significantly better results on low-quality scans and handwriting.

pipeline/preprocessor.py

"""Document preprocessing: PDF text extraction and OCR."""

from __future__ import annotations

from dataclasses import dataclass
from pathlib import Path

import fitz  # PyMuPDF


@dataclass
class PageResult:
    page_number: int
    text: str
    is_scanned: bool
    confidence: float  # OCR confidence, 1.0 for native text


def extract_text(pdf_path: Path) -> list[PageResult]:
    """Extract text from a PDF, using OCR for scanned pages.

    Strategy:
    1. Try native text extraction first (PyMuPDF).
    2. If a page has fewer than 50 characters of native text,
       treat it as scanned and apply OCR.
    """
    doc = fitz.open(str(pdf_path))
    results = []

    for page_num, page in enumerate(doc):
        native_text = page.get_text("text").strip()

        if len(native_text) > 50:
            # Digital page — use native text
            results.append(PageResult(
                page_number=page_num + 1,
                text=native_text,
                is_scanned=False,
                confidence=1.0,
            ))
        else:
            # Scanned page — apply OCR
            ocr_text, confidence = _ocr_page(page)
            results.append(PageResult(
                page_number=page_num + 1,
                text=ocr_text,
                is_scanned=True,
                confidence=confidence,
            ))

    doc.close()
    return results


def _ocr_page(page) -> tuple[str, float]:
    """OCR a single page using Tesseract via PyMuPDF."""
    # PyMuPDF can perform OCR directly with Tesseract
    tp = page.get_textpage_ocr(flags=fitz.TEXT_PRESERVE_WHITESPACE)
    text = page.get_text("text", textpage=tp).strip()
    # Estimate confidence from character count vs page area
    confidence = min(len(text) / 500, 1.0)  # Rough heuristic
    return text, confidence

Document Classification

Classification determines which extraction schema to apply. A fast LLM call reads the first 500 tokens of the document and returns the document type. For high-volume pipelines, train a lightweight classifier (e.g., a fine-tuned DistilBERT) that runs locally without API calls. The LLM-based approach is more flexible and handles new document types without retraining, while the local classifier is faster and cheaper at scale.

pipeline/classifier.py

"""Document type classification using LLM."""

from __future__ import annotations

from enum import Enum

from anthropic import Anthropic

client = Anthropic()


class DocumentType(str, Enum):
    INVOICE = "invoice"
    CONTRACT = "contract"
    REPORT = "report"
    FORM = "form"
    LETTER = "letter"
    UNKNOWN = "unknown"


async def classify_document(text_preview: str) -> tuple[DocumentType, float]:
    """Classify a document by type using the first 500 tokens.

    Returns:
        Tuple of (document_type, confidence_score).
    """
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=128,
        messages=[{
            "role": "user",
            "content": (
                "Classify this document. Return JSON with 'type' "
                "(invoice, contract, report, form, letter, unknown) "
                "and 'confidence' (0.0-1.0).\n\n"
                f"Document preview:\n{text_preview[:2000]}"
            ),
        }],
    )
    import json
    data = json.loads(response.content[0].text)
    return DocumentType(data["type"]), float(data["confidence"])

LLM-Based Structured Extraction

The extraction stage is where LLMs shine. Each document type has a Pydantic schema defining the fields to extract. The LLM receives the full document text and the schema, and returns structured JSON matching the schema. Using Pydantic for schema definition gives you automatic validation, type coercion, and clear error messages when the LLM's output does not match the expected format. Add a confidence score field to every extracted value so downstream systems can decide whether to trust the extraction or flag it for review.

pipeline/extractor.py

"""Structured extraction using LLM with Pydantic schemas."""

from __future__ import annotations

from datetime import date
from typing import Optional

from anthropic import Anthropic
from pydantic import BaseModel, Field

client = Anthropic()


class ExtractedField(BaseModel):
    """A single extracted field with confidence."""
    value: str | float | None
    confidence: float = Field(ge=0, le=1)


class InvoiceExtraction(BaseModel):
    """Extraction schema for invoices."""
    vendor_name: ExtractedField
    invoice_number: ExtractedField
    invoice_date: ExtractedField
    due_date: ExtractedField
    total_amount: ExtractedField
    currency: ExtractedField
    line_items: list[dict] = Field(default_factory=list)
    tax_amount: Optional[ExtractedField] = None
    payment_terms: Optional[ExtractedField] = None


EXTRACTION_PROMPT = """Extract structured data from this {doc_type}.
Return JSON matching this exact schema. For each field, provide
the extracted value and a confidence score (0.0-1.0).

If a field is not present in the document, set value to null
and confidence to 0.0.

Schema fields: {schema_fields}

Document text:
{document_text}"""


async def extract_fields(
    document_text: str,
    doc_type: str,
    schema_class: type[BaseModel],
) -> BaseModel:
    """Extract structured fields from a document.

    Args:
        document_text: Full text of the document.
        doc_type: The classified document type.
        schema_class: Pydantic model defining the extraction schema.

    Returns:
        Populated Pydantic model with extracted fields.
    """
    schema_fields = ", ".join(schema_class.model_fields.keys())

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        messages=[{
            "role": "user",
            "content": EXTRACTION_PROMPT.format(
                doc_type=doc_type,
                schema_fields=schema_fields,
                document_text=document_text[:8000],
            ),
        }],
    )

    import json
    data = json.loads(response.content[0].text)
    return schema_class(**data)

Human-in-the-Loop Review Queue

Not every extraction is trustworthy. Set a confidence threshold (e.g., 0.8) below which extractions are routed to a human review queue. The review UI shows the original document alongside the extracted fields, letting reviewers correct values and confirm or reject the extraction. Corrected extractions flow back into your training dataset for the next classifier fine-tuning cycle, creating a virtuous loop where the system improves over time.

Start with a high confidence threshold (0.9) and lower it as you validate the system's accuracy on your document types. This ensures high precision early on when trust in the system is being established. Track the human correction rate — if reviewers rarely change the extracted values, you can safely lower the threshold.

Table extraction remains the hardest part of document processing. LLMs can extract table data when the table structure is clear, but complex multi-level headers, merged cells, and spanning rows often produce incorrect results. For mission-critical table extraction, use a dedicated table extraction model (like Microsoft's Table Transformer) as a preprocessing step before LLM extraction.

Ingestion

Extraction Quality

Operations

Version History

1.0.0 · 2026-03-01

• Initial publication with OCR, classification, and LLM extraction pipeline
• PyMuPDF-based text extraction with scanned page detection
• Pydantic schema-driven structured extraction with confidence scores
• Human-in-the-loop review queue pattern
• Table extraction guidance and limitations

Key Takeaway

Prerequisites

Python 3.11+ with PyMuPDF or pdf2image for PDF handling
Tesseract OCR installed, or access to a cloud OCR API (Google Document AI, AWS Textract)
An LLM API key for classification and extraction (Anthropic or OpenAI)
PostgreSQL for document metadata and extraction results
A task queue (Celery, Temporal, or similar) for async processing

Pipeline Architecture

1
Ingest
Accept documents from API upload, email attachment, S3 bucket, or file system watcher. Normalize to a common internal format.
2
Preprocess
Convert PDFs to images, run OCR on scanned pages, extract native text from digital PDFs, and detect tables and layout structure.
3
Classify
Determine the document type (invoice, contract, report, form) using a fast LLM call or fine-tuned classifier.
4
Extract
Apply type-specific extraction schemas using structured LLM output. Each field gets a confidence score.
5
Validate
Run business rules (date format, numeric ranges, required fields). Route low-confidence results to human review.
6
Output
Write validated extractions to the database, trigger downstream workflows, and archive the source document.

OCR and Text Extraction

pipeline/preprocessor.py

"""Document preprocessing: PDF text extraction and OCR."""

from __future__ import annotations

from dataclasses import dataclass
from pathlib import Path

import fitz  # PyMuPDF


@dataclass
class PageResult:
    page_number: int
    text: str
    is_scanned: bool
    confidence: float  # OCR confidence, 1.0 for native text


def extract_text(pdf_path: Path) -> list[PageResult]:
    """Extract text from a PDF, using OCR for scanned pages.

    Strategy:
    1. Try native text extraction first (PyMuPDF).
    2. If a page has fewer than 50 characters of native text,
       treat it as scanned and apply OCR.
    """
    doc = fitz.open(str(pdf_path))
    results = []

    for page_num, page in enumerate(doc):
        native_text = page.get_text("text").strip()

        if len(native_text) > 50:
            # Digital page — use native text
            results.append(PageResult(
                page_number=page_num + 1,
                text=native_text,
                is_scanned=False,
                confidence=1.0,
            ))
        else:
            # Scanned page — apply OCR
            ocr_text, confidence = _ocr_page(page)
            results.append(PageResult(
                page_number=page_num + 1,
                text=ocr_text,
                is_scanned=True,
                confidence=confidence,
            ))

    doc.close()
    return results


def _ocr_page(page) -> tuple[str, float]:
    """OCR a single page using Tesseract via PyMuPDF."""
    # PyMuPDF can perform OCR directly with Tesseract
    tp = page.get_textpage_ocr(flags=fitz.TEXT_PRESERVE_WHITESPACE)
    text = page.get_text("text", textpage=tp).strip()
    # Estimate confidence from character count vs page area
    confidence = min(len(text) / 500, 1.0)  # Rough heuristic
    return text, confidence

Document Classification

pipeline/classifier.py

"""Document type classification using LLM."""

from __future__ import annotations

from enum import Enum

from anthropic import Anthropic

client = Anthropic()


class DocumentType(str, Enum):
    INVOICE = "invoice"
    CONTRACT = "contract"
    REPORT = "report"
    FORM = "form"
    LETTER = "letter"
    UNKNOWN = "unknown"


async def classify_document(text_preview: str) -> tuple[DocumentType, float]:
    """Classify a document by type using the first 500 tokens.

    Returns:
        Tuple of (document_type, confidence_score).
    """
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=128,
        messages=[{
            "role": "user",
            "content": (
                "Classify this document. Return JSON with 'type' "
                "(invoice, contract, report, form, letter, unknown) "
                "and 'confidence' (0.0-1.0).\n\n"
                f"Document preview:\n{text_preview[:2000]}"
            ),
        }],
    )
    import json
    data = json.loads(response.content[0].text)
    return DocumentType(data["type"]), float(data["confidence"])

LLM-Based Structured Extraction

pipeline/extractor.py

"""Structured extraction using LLM with Pydantic schemas."""

from __future__ import annotations

from datetime import date
from typing import Optional

from anthropic import Anthropic
from pydantic import BaseModel, Field

client = Anthropic()


class ExtractedField(BaseModel):
    """A single extracted field with confidence."""
    value: str | float | None
    confidence: float = Field(ge=0, le=1)


class InvoiceExtraction(BaseModel):
    """Extraction schema for invoices."""
    vendor_name: ExtractedField
    invoice_number: ExtractedField
    invoice_date: ExtractedField
    due_date: ExtractedField
    total_amount: ExtractedField
    currency: ExtractedField
    line_items: list[dict] = Field(default_factory=list)
    tax_amount: Optional[ExtractedField] = None
    payment_terms: Optional[ExtractedField] = None


EXTRACTION_PROMPT = """Extract structured data from this {doc_type}.
Return JSON matching this exact schema. For each field, provide
the extracted value and a confidence score (0.0-1.0).

If a field is not present in the document, set value to null
and confidence to 0.0.

Schema fields: {schema_fields}

Document text:
{document_text}"""


async def extract_fields(
    document_text: str,
    doc_type: str,
    schema_class: type[BaseModel],
) -> BaseModel:
    """Extract structured fields from a document.

    Args:
        document_text: Full text of the document.
        doc_type: The classified document type.
        schema_class: Pydantic model defining the extraction schema.

    Returns:
        Populated Pydantic model with extracted fields.
    """
    schema_fields = ", ".join(schema_class.model_fields.keys())

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        messages=[{
            "role": "user",
            "content": EXTRACTION_PROMPT.format(
                doc_type=doc_type,
                schema_fields=schema_fields,
                document_text=document_text[:8000],
            ),
        }],
    )

    import json
    data = json.loads(response.content[0].text)
    return schema_class(**data)

Human-in-the-Loop Review Queue

Ingestion

Extraction Quality

Operations

Version History

1.0.0 · 2026-03-01

• Initial publication with OCR, classification, and LLM extraction pipeline
• PyMuPDF-based text extraction with scanned page detection
• Pydantic schema-driven structured extraction with confidence scores
• Human-in-the-loop review queue pattern
• Table extraction guidance and limitations

AI-Powered Document Processing Pipeline

Pipeline Architecture

Ingest

Preprocess

Classify

Extract

Validate

Output

OCR and Text Extraction

Document Classification

LLM-Based Structured Extraction

Human-in-the-Loop Review Queue

Ingestion

Extraction Quality

Operations

Version History

Related content

AI-Powered Document Processing Pipeline

Pipeline Architecture

Ingest

Preprocess

Classify

Extract

Validate

Output

OCR and Text Extraction

Document Classification

LLM-Based Structured Extraction

Human-in-the-Loop Review Queue

Ingestion

Extraction Quality

Operations

Version History

Related content