Key Takeaway
Investing in data lineage and quality scoring early prevents costly model retraining cycles and simplifies regulatory compliance audits. Data governance for AI extends traditional data management with ML-specific concerns: training data provenance, consent tracking for model training use, feature store management, and retention policies that balance retraining needs with deletion obligations.
Prerequisites
- An existing data catalog or inventory of data assets used across the organization
- Understanding of which datasets feed into ML training, evaluation, and inference pipelines
- Familiarity with applicable data protection regulations (GDPR, CCPA, sector-specific rules)
- Access to data pipeline orchestration tools (Airflow, Dagster, Prefect, or similar)
- A data classification scheme or willingness to implement one
Why AI Changes Data Governance
Traditional data governance focuses on data at rest and data in transit: who can access what data, how long it is retained, and where it is stored. AI introduces a third dimension: data in training. When data is used to train a model, information from that data becomes encoded in model weights in ways that are difficult to audit, impossible to surgically remove, and potentially subject to memorization and regurgitation. This means that data governance for AI must extend its scope to cover the entire lifecycle from raw data collection through model training, evaluation, deployment, and eventual model retirement.
The regulatory implications are significant. GDPR's right to erasure requires the ability to delete personal data, but deleting the original training record does not remove its influence from a trained model. The EU AI Act requires documentation of training data sources, quality measures, and potential biases. CCPA grants consumers the right to know what data is collected and how it is used, including for AI training purposes. Meeting these requirements without a systematic data governance approach is effectively impossible at scale.
Data Classification for AI
AI data classification extends standard sensitivity tiers with training-specific metadata. Every dataset must be tagged not only with its sensitivity level but also with its suitability for AI training, consent status for ML use, known biases or limitations, and temporal validity window. This metadata enables automated policy enforcement: a pipeline cannot use a dataset for training if its consent status does not include ML training authorization.
| Classification | Description | AI Training Allowed | Consent Required | Retention Rules |
|---|---|---|---|---|
| Public | Publicly available data, open datasets, published research | Yes, with license compliance check | Attribution per license terms | Standard retention, archive after model sunset |
| Internal | Business data, operational logs, product telemetry | Yes, with purpose limitation review | Employee/user consent for AI training use | Retain while model is active, delete on model retirement |
| Confidential | Customer data, financial records, HR data | Only with explicit consent and DPO approval | Explicit opt-in consent required | Strict retention periods, deletion verification required |
| Restricted | PII, health data, biometric data, children's data | Only with legal review and enhanced safeguards | Explicit consent + legal basis documentation | Minimum retention, encrypted at rest, audit all access |
Data Lineage for ML Pipelines
Data lineage in ML pipelines must trace the complete path from raw data source to model prediction. This means tracking which datasets were used for training, how they were preprocessed, which features were derived, which model version was trained on which data version, and which predictions were made with which model. This end-to-end traceability is required for regulatory compliance, incident investigation, and reproducibility.
"""Data lineage tracking for ML pipelines.
Captures the provenance chain from raw data through
feature engineering, training, and inference. Designed
for integration with pipeline orchestrators like
Airflow, Dagster, or Prefect.
"""
from dataclasses import dataclass, field
from typing import Dict, List, Optional
from datetime import datetime
import hashlib
import json
@dataclass
class DatasetVersion:
"""A versioned snapshot of a dataset."""
dataset_id: str
version: str
row_count: int
schema_hash: str
source: str
created_at: str
consent_status: str # "authorized", "pending", "revoked"
classification: str # "public", "internal", "confidential", "restricted"
quality_score: float # 0.0 - 1.0
@dataclass
class LineageRecord:
"""A single step in the data lineage chain."""
step_id: str
step_type: str # "ingest", "transform", "feature", "train", "predict"
input_datasets: List[str] # dataset_id:version references
output_dataset: Optional[str]
transformation: str # Human-readable description
code_version: str # Git commit hash
timestamp: str
parameters: Dict[str, str] = field(default_factory=dict)
class LineageTracker:
"""Track data lineage across ML pipeline stages."""
def __init__(self):
self.records: List[LineageRecord] = []
self.datasets: Dict[str, DatasetVersion] = {}
def register_dataset(self, dataset: DatasetVersion) -> str:
"""Register a dataset version in the lineage graph."""
key = f"{dataset.dataset_id}:{dataset.version}"
self.datasets[key] = dataset
return key
def record_step(
self,
step_type: str,
input_refs: List[str],
output_ref: Optional[str],
transformation: str,
code_version: str,
parameters: Optional[Dict[str, str]] = None,
) -> LineageRecord:
"""Record a lineage step in the pipeline."""
# Validate consent status for training steps
if step_type == "train":
for ref in input_refs:
ds = self.datasets.get(ref)
if ds and ds.consent_status != "authorized":
raise ValueError(
f"Dataset {ref} consent status is "
f"'{ds.consent_status}', not authorized "
f"for training use."
)
record = LineageRecord(
step_id=hashlib.sha256(
f"{step_type}:{datetime.utcnow().isoformat()}".encode()
).hexdigest()[:16],
step_type=step_type,
input_datasets=input_refs,
output_dataset=output_ref,
transformation=transformation,
code_version=code_version,
timestamp=datetime.utcnow().isoformat(),
parameters=parameters or {},
)
self.records.append(record)
return record
def get_model_lineage(self, model_id: str) -> List[LineageRecord]:
"""Trace the full lineage chain for a trained model."""
# Find training step for this model
train_steps = [
r for r in self.records
if r.step_type == "train" and r.output_dataset == model_id
]
if not train_steps:
return []
# Walk backwards through the lineage graph
lineage = []
to_visit = list(train_steps)
visited = set()
while to_visit:
current = to_visit.pop(0)
if current.step_id in visited:
continue
visited.add(current.step_id)
lineage.append(current)
# Find upstream steps that produced this step's inputs
for input_ref in current.input_datasets:
upstream = [
r for r in self.records
if r.output_dataset == input_ref
and r.step_id not in visited
]
to_visit.extend(upstream)
return sorted(lineage, key=lambda r: r.timestamp)Data Quality Scoring
Data quality scoring for AI goes beyond traditional data quality metrics. In addition to completeness, consistency, and accuracy checks, AI data quality must evaluate representativeness (does the data reflect the population the model will serve?), temporal relevance (is the data from the right time period?), label quality (are the ground truth labels accurate and consistent?), and feature informativeness (do the features actually carry signal for the prediction task?). A composite quality score informs whether a dataset is suitable for training, evaluation, or production inference.
Automate data quality scoring as a pipeline gate. Every dataset version that enters your training pipeline should pass a minimum quality score threshold. This catches issues like corrupted uploads, incomplete joins, and label leakage before they waste training compute and introduce model quality regressions.
Consent Management for Training Data
Consent management for AI training data requires tracking consent at the record level, not just the dataset level. Individual users may grant or revoke consent at any time, and you must be able to identify which records in which datasets are affected, which models were trained on those records, and what retraining is necessary when consent is revoked. This is operationally complex but legally required under GDPR and increasingly expected under other privacy regimes.
The practical approach is a consent registry that maps user identifiers to consent grants, with each grant specifying the scope (what data), purpose (what use, including AI training), and status (active, revoked, expired). When consent is revoked, the system identifies affected datasets, flags affected models for retraining, and logs the action for compliance records. Full model retraining on every revocation is often impractical, so organizations should define batch retraining schedules that balance privacy obligations with operational feasibility.
Retention Policies for AI Data
AI data retention policies must balance competing pressures. Privacy regulations demand minimal retention periods. Model quality demands access to historical training data for retraining and evaluation. Audit requirements demand preservation of the data that was used to train models that made consequential decisions. The resolution is a tiered retention framework that distinguishes between active training data, archived training data, evaluation reference data, and audit preservation data, each with its own retention period and access controls.
Data Classification
Data Lineage
Consent and Privacy
Quality and Retention
Version History
1.0.0 · 2026-03-01
- • Initial release covering data classification, lineage, quality, consent, and retention for AI
- • Data lineage tracker code example with consent validation
- • Four-tier data classification framework with AI training authorization
- • Production checklist covering 12 governance controls across four categories
- • Consent management and retention policy guidance aligned with GDPR and EU AI Act