Key Takeaway
By the end of this blueprint you will have a production vector search system using pgvector with HNSW indexing, tuned for sub-50ms queries at millions of vectors, with pre-filtered metadata search, a result caching layer, and a re-ranking stage that maximizes precision for your RAG or semantic search application.
Prerequisites
- PostgreSQL 16+ with pgvector 0.7+ extension
- Understanding of embedding models and vector similarity concepts
- Python 3.11+ for the query and ingestion code
- Redis for query result caching
- Familiarity with PostgreSQL EXPLAIN ANALYZE for query tuning
Index Algorithm Selection
The choice of index algorithm determines the tradeoff between query speed, recall accuracy, memory usage, and build time. HNSW (Hierarchical Navigable Small World) is the best general-purpose choice: it provides excellent recall with logarithmic query time and supports incremental inserts without full index rebuilds. IVF (Inverted File Index) uses less memory but requires periodic rebuilds and has lower recall at the same latency. For very large corpora that do not fit in RAM, DiskANN style indices enable disk-resident search with acceptable latency.
| Algorithm | Query Speed | Recall@10 | Memory | Incremental Insert | Best For |
|---|---|---|---|---|---|
| HNSW | 1-10ms | 95-99% | High (2-4x vectors) | Yes | General purpose, <50M vectors |
| IVF-Flat | 5-20ms | 90-95% | Low (1x vectors) | Requires rebuild | Cost-sensitive, batch updates |
| IVF-PQ | 2-10ms | 85-92% | Very low (0.1x) | Requires rebuild | Billions of vectors, memory-constrained |
| DiskANN | 10-50ms | 95-98% | Minimal RAM | Limited | Disk-resident, very large corpora |
pgvector HNSW Tuning
pgvector's HNSW index has two critical parameters: m (the number of connections per node, controlling graph density) and ef_construction (the beam width during index building, controlling build-time recall). Higher values produce better recall but consume more memory and build time. At query time, ef_search controls the beam width and directly trades latency for recall. The defaults are conservative — tuning these parameters for your specific dataset and latency budget can improve recall by 5-10 percentage points.
-- Drop and recreate HNSW index with tuned parameters
DROP INDEX IF EXISTS idx_embeddings_hnsw;
-- m=24: More connections per node (default 16)
-- Higher recall, more memory, slower builds
-- ef_construction=400: Wider beam during build (default 64)
-- Much better recall, significantly slower build
CREATE INDEX idx_embeddings_hnsw ON document_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 24, ef_construction = 400);
-- Set query-time search beam width
-- Higher ef_search = better recall but slower queries
-- Test values: 100 (fast), 200 (balanced), 400 (high recall)
SET hnsw.ef_search = 200;
-- Verify index is being used
EXPLAIN ANALYZE
SELECT id, content, embedding <=> $1::vector AS distance
FROM document_chunks
ORDER BY embedding <=> $1::vector
LIMIT 10;Pre-Filtered Metadata Search
Most real-world vector searches require metadata filters: search within a specific tenant, document type, date range, or permission scope. The naive approach — search first, then filter results — is wasteful because you retrieve and score irrelevant vectors. Pre-filtered search applies metadata predicates before the vector scan, dramatically reducing the search space. In pgvector, this means combining a GIN index on JSONB metadata with the HNSW vector index using compound WHERE clauses.
"""Pre-filtered vector search with metadata constraints."""
from __future__ import annotations
from typing import Any, List
import asyncpg
async def filtered_vector_search(
pool: asyncpg.Pool,
query_embedding: List[float],
filters: dict[str, Any],
top_k: int = 10,
ef_search: int = 200,
) -> list[dict]:
"""Vector search with pre-applied metadata filters.
Args:
pool: Database connection pool.
query_embedding: Query vector.
filters: Metadata key-value pairs to filter on.
top_k: Number of results to return.
ef_search: HNSW beam width for this query.
Returns:
List of matching documents with scores.
"""
# Build dynamic WHERE clause from filters
conditions = []
params: list[Any] = [str(query_embedding), top_k]
param_idx = 3
for key, value in filters.items():
if isinstance(value, list):
# Array containment: metadata @> '{"tags": ["python"]}'
conditions.append(
f"metadata @> $" + str(param_idx) + "::jsonb"
)
import json
params.append(json.dumps({key: value}))
else:
conditions.append(
f"metadata->>'{key}' = $" + str(param_idx)
)
params.append(str(value))
param_idx += 1
where = " AND ".join(conditions) if conditions else "TRUE"
async with pool.acquire() as conn:
# Set ef_search for this session
await conn.execute(f"SET hnsw.ef_search = {ef_search}")
rows = await conn.fetch(
f"""
SELECT id, doc_id, content, metadata,
embedding <=> $1::vector AS distance
FROM document_chunks
WHERE {where}
ORDER BY embedding <=> $1::vector
LIMIT $2
""",
*params,
)
return [
{
"id": r["id"],
"doc_id": r["doc_id"],
"content": r["content"],
"metadata": dict(r["metadata"]),
"distance": float(r["distance"]),
}
for r in rows
]Result Caching
Vector search queries are expensive — each one involves scanning a graph index and computing distance functions. For applications with repeated or similar queries (customer support, FAQ lookup), caching results dramatically reduces database load and query latency. Use a semantic cache that hashes the query embedding to a fixed-width key and stores the top-k results with a TTL. For exact query matches, the cache returns sub-millisecond results. For near-matches, consider a secondary lookup that retrieves cached results for the nearest cached query.
Set cache TTL based on your corpus update frequency. If your corpus updates hourly, set a 30-minute TTL so cached results are never more than one cycle stale. For static corpora, TTLs of 24 hours or longer are appropriate and dramatically reduce database load.
Re-Ranking for Precision
Bi-encoder vector search optimizes for recall — finding all potentially relevant documents. Cross-encoder re-ranking optimizes for precision — scoring each candidate against the query with full attention. The two-stage approach (retrieve 20-50 candidates with the vector index, re-rank to top 5 with a cross-encoder) consistently outperforms either stage alone. The cross-encoder is computationally expensive but only runs on the small candidate set, keeping total latency under budget.
When you change your embedding model, all existing vectors become incompatible. Plan for a full re-embedding migration: generate new embeddings in a shadow column, build a new index, switch the query path atomically, then drop the old column. Never mix embeddings from different models in the same index.
Scaling Beyond a Single Node
For corpora exceeding tens of millions of vectors, a single PostgreSQL instance becomes a bottleneck. Two scaling strategies work: vertical scaling (larger instance with more RAM to keep the index resident) and horizontal sharding (partition vectors across multiple instances by tenant ID or content hash). For multi-tenant SaaS applications, tenant-based sharding is natural — each tenant's vectors live on a dedicated shard, and the query coordinator routes to the correct shard based on the request's tenant context. This also provides data isolation for free.
Index Configuration
Query Performance
Operations
Version History
1.0.0 · 2026-03-01
- • Initial publication with pgvector HNSW tuning guide
- • Index algorithm comparison table (HNSW, IVF, DiskANN)
- • Pre-filtered metadata search implementation
- • Result caching and cross-encoder re-ranking patterns
- • Scaling strategies for multi-tenant deployments