TemplateFoundational1.0.0

ADR: Data Pipeline Architecture

Architecture Decision Record template for AI data pipeline design, evaluating batch vs. streaming, orchestration tools, and data quality gates.

10 min readUpdated Mar 2026Koundinya Lanka

templateadrdata-pipelineetlorchestration

On this page

Key Takeaway

Data pipeline ADRs should explicitly document data quality gate locations and failure handling, because silent data quality issues are the most common source of model degradation.

When to Use This Template

Use this ADR when building new data pipelines for ML training or inference, adding data quality validation to existing pipelines, or migrating between orchestration tools. Data pipeline architecture decisions affect model quality, training reproducibility, and operational reliability. This template forces the team to document data dependencies, quality gates, and failure handling before implementation.

ADR Template

adr-data-pipeline.md

# ADR: Data Pipeline Architecture

## Status
[Proposed | Accepted | Deprecated | Superseded by ADR-XXX]

## Date
YYYY-MM-DD

## Decision Makers
- [Name, Role]

## Context

### Data Sources
| Source | Type | Volume | Freshness | Owner |
|--------|------|--------|-----------|-------|
| [Source 1] | [API/DB/File/Stream] | [e.g., "10GB/day"] | [e.g., "real-time"] | [team] |
| [Source 2] | | | | |

### Requirements
- Processing latency: [e.g., "batch: within 4 hours of source update"]
- Data quality SLA: [e.g., "< 0.1% null rate in required fields"]
- Retention policy: [e.g., "raw data 90 days, processed 1 year"]
- Compliance: [e.g., "PII must be masked before ML consumption"]

## Options Considered

| Criterion | Batch ETL | Streaming | Lambda | Kappa |
|-----------|-----------|-----------|--------|-------|
| Latency | Hours | Seconds | Both | Seconds |
| Complexity | Low | High | Very High | High |
| Cost | Low | Medium | High | Medium |
| Debugging | Easy | Hard | Very Hard | Hard |
| Reprocessing | Simple | Complex | Complex | Simple |

### Orchestration Options
| Tool | Managed | Learning Curve | Ecosystem | Cost |
|------|---------|---------------|-----------|------|
| [Tool A] | | | | |
| [Tool B] | | | | |

## Decision
We will use [architecture] with [orchestration tool] because [rationale].

### Data Quality Gates
| Gate | Location | Rules | Failure Action |
|------|----------|-------|----------------|
| Ingestion validation | After source extraction | Schema, nulls, range | Block pipeline |
| Transformation check | After transform stage | Referential integrity | Alert + continue |
| Pre-serving validation | Before feature store write | Distribution drift | Alert + block |

## Consequences
- Infrastructure: [compute, storage, networking requirements]
- Team skills: [training needed for selected tools]
- Operational burden: [on-call, monitoring, maintenance]
- Cost projection: [monthly compute and storage costs]

## Review Trigger
- [ ] Daily data volume exceeds [threshold]
- [ ] Quality gate failure rate exceeds [threshold]%
- [ ] Pipeline latency SLA breach rate exceeds [threshold]%
- [ ] New data source integration required

Section-by-Section Guidance

Data Quality Gates

The data quality gates section is the most impactful part of this ADR. Define exactly where in the pipeline data quality is validated, what rules are checked, and what happens when a check fails. The failure action is critical: some gates should block the pipeline entirely (ingestion validation), while others should alert but continue processing (non-critical field completeness). Be explicit about which gates are blocking vs. non-blocking.

Batch vs. Streaming

Default to batch processing unless you have a specific latency requirement that demands streaming. Batch pipelines are simpler to debug, test, and reprocess. Many teams adopt streaming prematurely, adding operational complexity without a corresponding business need. If your data consumers can tolerate hourly or daily freshness, batch is almost always the right choice. Reserve streaming for use cases where sub-minute freshness is a hard requirement.

Include a data lineage diagram in your ADR showing the flow from source to consumption, with quality gates marked. This single diagram communicates the pipeline architecture more effectively than paragraphs of text and becomes the primary reference for debugging data issues.

Never skip the failure handling specification. Undocumented failure handling defaults to silent failure, which means bad data reaches your models without anyone noticing until model quality degrades in production.

Version History

1.0.0 · 2026-03-01

• Initial ADR template for data pipeline architecture

Was this article helpful?

ADR: Data Pipeline Architecture

Architecture Decision Record template for AI data pipeline design, evaluating batch vs. streaming, orchestration tools, and data quality gates.

10 min readUpdated Mar 2026Koundinya Lanka

templateadrdata-pipelineetlorchestration

On this page

Key Takeaway

Data pipeline ADRs should explicitly document data quality gate locations and failure handling, because silent data quality issues are the most common source of model degradation.

When to Use This Template

ADR Template

adr-data-pipeline.md

# ADR: Data Pipeline Architecture

## Status
[Proposed | Accepted | Deprecated | Superseded by ADR-XXX]

## Date
YYYY-MM-DD

## Decision Makers
- [Name, Role]

## Context

### Data Sources
| Source | Type | Volume | Freshness | Owner |
|--------|------|--------|-----------|-------|
| [Source 1] | [API/DB/File/Stream] | [e.g., "10GB/day"] | [e.g., "real-time"] | [team] |
| [Source 2] | | | | |

### Requirements
- Processing latency: [e.g., "batch: within 4 hours of source update"]
- Data quality SLA: [e.g., "< 0.1% null rate in required fields"]
- Retention policy: [e.g., "raw data 90 days, processed 1 year"]
- Compliance: [e.g., "PII must be masked before ML consumption"]

## Options Considered

| Criterion | Batch ETL | Streaming | Lambda | Kappa |
|-----------|-----------|-----------|--------|-------|
| Latency | Hours | Seconds | Both | Seconds |
| Complexity | Low | High | Very High | High |
| Cost | Low | Medium | High | Medium |
| Debugging | Easy | Hard | Very Hard | Hard |
| Reprocessing | Simple | Complex | Complex | Simple |

### Orchestration Options
| Tool | Managed | Learning Curve | Ecosystem | Cost |
|------|---------|---------------|-----------|------|
| [Tool A] | | | | |
| [Tool B] | | | | |

## Decision
We will use [architecture] with [orchestration tool] because [rationale].

### Data Quality Gates
| Gate | Location | Rules | Failure Action |
|------|----------|-------|----------------|
| Ingestion validation | After source extraction | Schema, nulls, range | Block pipeline |
| Transformation check | After transform stage | Referential integrity | Alert + continue |
| Pre-serving validation | Before feature store write | Distribution drift | Alert + block |

## Consequences
- Infrastructure: [compute, storage, networking requirements]
- Team skills: [training needed for selected tools]
- Operational burden: [on-call, monitoring, maintenance]
- Cost projection: [monthly compute and storage costs]

## Review Trigger
- [ ] Daily data volume exceeds [threshold]
- [ ] Quality gate failure rate exceeds [threshold]%
- [ ] Pipeline latency SLA breach rate exceeds [threshold]%
- [ ] New data source integration required

Section-by-Section Guidance

Data Quality Gates

Batch vs. Streaming

Version History

1.0.0 · 2026-03-01

• Initial ADR template for data pipeline architecture

Was this article helpful?

ADR: Data Pipeline Architecture

When to Use This Template

ADR Template

Section-by-Section Guidance

Data Quality Gates

Batch vs. Streaming

Version History

Related content

ADR: Data Pipeline Architecture

When to Use This Template

ADR Template

Section-by-Section Guidance

Data Quality Gates

Batch vs. Streaming

Version History

Related content