Key Takeaway

By the end of this blueprint you will have a production vector search system using pgvector with HNSW indexing, tuned for sub-50ms queries at millions of vectors, with pre-filtered metadata search, a result caching layer, and a re-ranking stage that maximizes precision for your RAG or semantic search application.

Prerequisites

PostgreSQL 16+ with pgvector 0.7+ extension
Understanding of embedding models and vector similarity concepts
Python 3.11+ for the query and ingestion code
Redis for query result caching
Familiarity with PostgreSQL EXPLAIN ANALYZE for query tuning

Index Algorithm Selection

The choice of index algorithm determines the tradeoff between query speed, recall accuracy, memory usage, and build time. HNSW (Hierarchical Navigable Small World) is the best general-purpose choice: it provides excellent recall with logarithmic query time and supports incremental inserts without full index rebuilds. IVF (Inverted File Index) uses less memory but requires periodic rebuilds and has lower recall at the same latency. For very large corpora that do not fit in RAM, DiskANN style indices enable disk-resident search with acceptable latency.

Algorithm	Query Speed	Recall@10	Memory	Incremental Insert	Best For
HNSW	1-10ms	95-99%	High (2-4x vectors)	Yes	General purpose, <50M vectors
IVF-Flat	5-20ms	90-95%	Low (1x vectors)	Requires rebuild	Cost-sensitive, batch updates
IVF-PQ	2-10ms	85-92%	Very low (0.1x)	Requires rebuild	Billions of vectors, memory-constrained
DiskANN	10-50ms	95-98%	Minimal RAM	Limited	Disk-resident, very large corpora

pgvector HNSW Tuning

pgvector's HNSW index has two critical parameters: m (the number of connections per node, controlling graph density) and ef_construction (the beam width during index building, controlling build-time recall). Higher values produce better recall but consume more memory and build time. At query time, ef_search controls the beam width and directly trades latency for recall. The defaults are conservative — tuning these parameters for your specific dataset and latency budget can improve recall by 5-10 percentage points.

tuning/hnsw_index.sql

-- Drop and recreate HNSW index with tuned parameters
DROP INDEX IF EXISTS idx_embeddings_hnsw;

-- m=24: More connections per node (default 16)
--   Higher recall, more memory, slower builds
-- ef_construction=400: Wider beam during build (default 64)
--   Much better recall, significantly slower build
CREATE INDEX idx_embeddings_hnsw ON document_chunks
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 24, ef_construction = 400);

-- Set query-time search beam width
-- Higher ef_search = better recall but slower queries
-- Test values: 100 (fast), 200 (balanced), 400 (high recall)
SET hnsw.ef_search = 200;

-- Verify index is being used
EXPLAIN ANALYZE
SELECT id, content, embedding <=> $1::vector AS distance
FROM document_chunks
ORDER BY embedding <=> $1::vector
LIMIT 10;

Pre-Filtered Metadata Search

Most real-world vector searches require metadata filters: search within a specific tenant, document type, date range, or permission scope. The naive approach — search first, then filter results — is wasteful because you retrieve and score irrelevant vectors. Pre-filtered search applies metadata predicates before the vector scan, dramatically reducing the search space. In pgvector, this means combining a GIN index on JSONB metadata with the HNSW vector index using compound WHERE clauses.

search/filtered_search.py

"""Pre-filtered vector search with metadata constraints."""

from __future__ import annotations

from typing import Any, List

import asyncpg


async def filtered_vector_search(
    pool: asyncpg.Pool,
    query_embedding: List[float],
    filters: dict[str, Any],
    top_k: int = 10,
    ef_search: int = 200,
) -> list[dict]:
    """Vector search with pre-applied metadata filters.

    Args:
        pool: Database connection pool.
        query_embedding: Query vector.
        filters: Metadata key-value pairs to filter on.
        top_k: Number of results to return.
        ef_search: HNSW beam width for this query.

    Returns:
        List of matching documents with scores.
    """
    # Build dynamic WHERE clause from filters
    conditions = []
    params: list[Any] = [str(query_embedding), top_k]
    param_idx = 3

    for key, value in filters.items():
        if isinstance(value, list):
            # Array containment: metadata @> '{"tags": ["python"]}'
            conditions.append(
                f"metadata @> $" + str(param_idx) + "::jsonb"
            )
            import json
            params.append(json.dumps({key: value}))
        else:
            conditions.append(
                f"metadata->>'{key}' = $" + str(param_idx)
            )
            params.append(str(value))
        param_idx += 1

    where = " AND ".join(conditions) if conditions else "TRUE"

    async with pool.acquire() as conn:
        # Set ef_search for this session
        await conn.execute(f"SET hnsw.ef_search = {ef_search}")

        rows = await conn.fetch(
            f"""
            SELECT id, doc_id, content, metadata,
                   embedding <=> $1::vector AS distance
            FROM document_chunks
            WHERE {where}
            ORDER BY embedding <=> $1::vector
            LIMIT $2
            """,
            *params,
        )

    return [
        {
            "id": r["id"],
            "doc_id": r["doc_id"],
            "content": r["content"],
            "metadata": dict(r["metadata"]),
            "distance": float(r["distance"]),
        }
        for r in rows
    ]

Result Caching

Vector search queries are expensive — each one involves scanning a graph index and computing distance functions. For applications with repeated or similar queries (customer support, FAQ lookup), caching results dramatically reduces database load and query latency. Use a semantic cache that hashes the query embedding to a fixed-width key and stores the top-k results with a TTL. For exact query matches, the cache returns sub-millisecond results. For near-matches, consider a secondary lookup that retrieves cached results for the nearest cached query.

Set cache TTL based on your corpus update frequency. If your corpus updates hourly, set a 30-minute TTL so cached results are never more than one cycle stale. For static corpora, TTLs of 24 hours or longer are appropriate and dramatically reduce database load.

Re-Ranking for Precision

Bi-encoder vector search optimizes for recall — finding all potentially relevant documents. Cross-encoder re-ranking optimizes for precision — scoring each candidate against the query with full attention. The two-stage approach (retrieve 20-50 candidates with the vector index, re-rank to top 5 with a cross-encoder) consistently outperforms either stage alone. The cross-encoder is computationally expensive but only runs on the small candidate set, keeping total latency under budget.

When you change your embedding model, all existing vectors become incompatible. Plan for a full re-embedding migration: generate new embeddings in a shadow column, build a new index, switch the query path atomically, then drop the old column. Never mix embeddings from different models in the same index.

Scaling Beyond a Single Node

For corpora exceeding tens of millions of vectors, a single PostgreSQL instance becomes a bottleneck. Two scaling strategies work: vertical scaling (larger instance with more RAM to keep the index resident) and horizontal sharding (partition vectors across multiple instances by tenant ID or content hash). For multi-tenant SaaS applications, tenant-based sharding is natural — each tenant's vectors live on a dedicated shard, and the query coordinator routes to the correct shard based on the request's tenant context. This also provides data isolation for free.

Index Configuration

Query Performance

Operations

Version History

1.0.0 · 2026-03-01

• Initial publication with pgvector HNSW tuning guide
• Index algorithm comparison table (HNSW, IVF, DiskANN)
• Pre-filtered metadata search implementation
• Result caching and cross-encoder re-ranking patterns
• Scaling strategies for multi-tenant deployments

Key Takeaway

Prerequisites

PostgreSQL 16+ with pgvector 0.7+ extension
Understanding of embedding models and vector similarity concepts
Python 3.11+ for the query and ingestion code
Redis for query result caching
Familiarity with PostgreSQL EXPLAIN ANALYZE for query tuning

Index Algorithm Selection

Algorithm	Query Speed	Recall@10	Memory	Incremental Insert	Best For
HNSW	1-10ms	95-99%	High (2-4x vectors)	Yes	General purpose, <50M vectors
IVF-Flat	5-20ms	90-95%	Low (1x vectors)	Requires rebuild	Cost-sensitive, batch updates
IVF-PQ	2-10ms	85-92%	Very low (0.1x)	Requires rebuild	Billions of vectors, memory-constrained
DiskANN	10-50ms	95-98%	Minimal RAM	Limited	Disk-resident, very large corpora

pgvector HNSW Tuning

tuning/hnsw_index.sql

-- Drop and recreate HNSW index with tuned parameters
DROP INDEX IF EXISTS idx_embeddings_hnsw;

-- m=24: More connections per node (default 16)
--   Higher recall, more memory, slower builds
-- ef_construction=400: Wider beam during build (default 64)
--   Much better recall, significantly slower build
CREATE INDEX idx_embeddings_hnsw ON document_chunks
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 24, ef_construction = 400);

-- Set query-time search beam width
-- Higher ef_search = better recall but slower queries
-- Test values: 100 (fast), 200 (balanced), 400 (high recall)
SET hnsw.ef_search = 200;

-- Verify index is being used
EXPLAIN ANALYZE
SELECT id, content, embedding <=> $1::vector AS distance
FROM document_chunks
ORDER BY embedding <=> $1::vector
LIMIT 10;

Pre-Filtered Metadata Search

search/filtered_search.py

"""Pre-filtered vector search with metadata constraints."""

from __future__ import annotations

from typing import Any, List

import asyncpg


async def filtered_vector_search(
    pool: asyncpg.Pool,
    query_embedding: List[float],
    filters: dict[str, Any],
    top_k: int = 10,
    ef_search: int = 200,
) -> list[dict]:
    """Vector search with pre-applied metadata filters.

    Args:
        pool: Database connection pool.
        query_embedding: Query vector.
        filters: Metadata key-value pairs to filter on.
        top_k: Number of results to return.
        ef_search: HNSW beam width for this query.

    Returns:
        List of matching documents with scores.
    """
    # Build dynamic WHERE clause from filters
    conditions = []
    params: list[Any] = [str(query_embedding), top_k]
    param_idx = 3

    for key, value in filters.items():
        if isinstance(value, list):
            # Array containment: metadata @> '{"tags": ["python"]}'
            conditions.append(
                f"metadata @> $" + str(param_idx) + "::jsonb"
            )
            import json
            params.append(json.dumps({key: value}))
        else:
            conditions.append(
                f"metadata->>'{key}' = $" + str(param_idx)
            )
            params.append(str(value))
        param_idx += 1

    where = " AND ".join(conditions) if conditions else "TRUE"

    async with pool.acquire() as conn:
        # Set ef_search for this session
        await conn.execute(f"SET hnsw.ef_search = {ef_search}")

        rows = await conn.fetch(
            f"""
            SELECT id, doc_id, content, metadata,
                   embedding <=> $1::vector AS distance
            FROM document_chunks
            WHERE {where}
            ORDER BY embedding <=> $1::vector
            LIMIT $2
            """,
            *params,
        )

    return [
        {
            "id": r["id"],
            "doc_id": r["doc_id"],
            "content": r["content"],
            "metadata": dict(r["metadata"]),
            "distance": float(r["distance"]),
        }
        for r in rows
    ]

Result Caching

Re-Ranking for Precision

Scaling Beyond a Single Node

Index Configuration

Query Performance

Operations

Version History

1.0.0 · 2026-03-01

• Initial publication with pgvector HNSW tuning guide
• Index algorithm comparison table (HNSW, IVF, DiskANN)
• Pre-filtered metadata search implementation
• Result caching and cross-encoder re-ranking patterns
• Scaling strategies for multi-tenant deployments

Vector Search at Scale

Index Algorithm Selection

pgvector HNSW Tuning

Pre-Filtered Metadata Search

Result Caching

Re-Ranking for Precision

Scaling Beyond a Single Node

Index Configuration

Query Performance

Operations

Version History

Related content

Vector Search at Scale

Index Algorithm Selection

pgvector HNSW Tuning

Pre-Filtered Metadata Search

Result Caching

Re-Ranking for Precision

Scaling Beyond a Single Node

Index Configuration

Query Performance

Operations

Version History

Related content