Key Takeaway

Investing in data lineage and quality scoring early prevents costly model retraining cycles and simplifies regulatory compliance audits. Data governance for AI extends traditional data management with ML-specific concerns: training data provenance, consent tracking for model training use, feature store management, and retention policies that balance retraining needs with deletion obligations.

Prerequisites

An existing data catalog or inventory of data assets used across the organization
Understanding of which datasets feed into ML training, evaluation, and inference pipelines
Familiarity with applicable data protection regulations (GDPR, CCPA, sector-specific rules)
Access to data pipeline orchestration tools (Airflow, Dagster, Prefect, or similar)
A data classification scheme or willingness to implement one

Why AI Changes Data Governance

Traditional data governance focuses on data at rest and data in transit: who can access what data, how long it is retained, and where it is stored. AI introduces a third dimension: data in training. When data is used to train a model, information from that data becomes encoded in model weights in ways that are difficult to audit, impossible to surgically remove, and potentially subject to memorization and regurgitation. This means that data governance for AI must extend its scope to cover the entire lifecycle from raw data collection through model training, evaluation, deployment, and eventual model retirement.

The regulatory implications are significant. GDPR's right to erasure requires the ability to delete personal data, but deleting the original training record does not remove its influence from a trained model. The EU AI Act requires documentation of training data sources, quality measures, and potential biases. CCPA grants consumers the right to know what data is collected and how it is used, including for AI training purposes. Meeting these requirements without a systematic data governance approach is effectively impossible at scale.

Data Classification for AI

AI data classification extends standard sensitivity tiers with training-specific metadata. Every dataset must be tagged not only with its sensitivity level but also with its suitability for AI training, consent status for ML use, known biases or limitations, and temporal validity window. This metadata enables automated policy enforcement: a pipeline cannot use a dataset for training if its consent status does not include ML training authorization.

Classification	Description	AI Training Allowed	Consent Required	Retention Rules
Public	Publicly available data, open datasets, published research	Yes, with license compliance check	Attribution per license terms	Standard retention, archive after model sunset
Internal	Business data, operational logs, product telemetry	Yes, with purpose limitation review	Employee/user consent for AI training use	Retain while model is active, delete on model retirement
Confidential	Customer data, financial records, HR data	Only with explicit consent and DPO approval	Explicit opt-in consent required	Strict retention periods, deletion verification required
Restricted	PII, health data, biometric data, children's data	Only with legal review and enhanced safeguards	Explicit consent + legal basis documentation	Minimum retention, encrypted at rest, audit all access

Data Lineage for ML Pipelines

Data lineage in ML pipelines must trace the complete path from raw data source to model prediction. This means tracking which datasets were used for training, how they were preprocessed, which features were derived, which model version was trained on which data version, and which predictions were made with which model. This end-to-end traceability is required for regulatory compliance, incident investigation, and reproducibility.

data_lineage_tracker.py

"""Data lineage tracking for ML pipelines.

Captures the provenance chain from raw data through
feature engineering, training, and inference. Designed
for integration with pipeline orchestrators like
Airflow, Dagster, or Prefect.
"""

from dataclasses import dataclass, field
from typing import Dict, List, Optional
from datetime import datetime
import hashlib
import json


@dataclass
class DatasetVersion:
    """A versioned snapshot of a dataset."""
    dataset_id: str
    version: str
    row_count: int
    schema_hash: str
    source: str
    created_at: str
    consent_status: str  # "authorized", "pending", "revoked"
    classification: str  # "public", "internal", "confidential", "restricted"
    quality_score: float  # 0.0 - 1.0


@dataclass
class LineageRecord:
    """A single step in the data lineage chain."""
    step_id: str
    step_type: str  # "ingest", "transform", "feature", "train", "predict"
    input_datasets: List[str]  # dataset_id:version references
    output_dataset: Optional[str]
    transformation: str  # Human-readable description
    code_version: str  # Git commit hash
    timestamp: str
    parameters: Dict[str, str] = field(default_factory=dict)


class LineageTracker:
    """Track data lineage across ML pipeline stages."""

    def __init__(self):
        self.records: List[LineageRecord] = []
        self.datasets: Dict[str, DatasetVersion] = {}

    def register_dataset(self, dataset: DatasetVersion) -> str:
        """Register a dataset version in the lineage graph."""
        key = f"{dataset.dataset_id}:{dataset.version}"
        self.datasets[key] = dataset
        return key

    def record_step(
        self,
        step_type: str,
        input_refs: List[str],
        output_ref: Optional[str],
        transformation: str,
        code_version: str,
        parameters: Optional[Dict[str, str]] = None,
    ) -> LineageRecord:
        """Record a lineage step in the pipeline."""
        # Validate consent status for training steps
        if step_type == "train":
            for ref in input_refs:
                ds = self.datasets.get(ref)
                if ds and ds.consent_status != "authorized":
                    raise ValueError(
                        f"Dataset {ref} consent status is "
                        f"'{ds.consent_status}', not authorized "
                        f"for training use."
                    )

        record = LineageRecord(
            step_id=hashlib.sha256(
                f"{step_type}:{datetime.utcnow().isoformat()}".encode()
            ).hexdigest()[:16],
            step_type=step_type,
            input_datasets=input_refs,
            output_dataset=output_ref,
            transformation=transformation,
            code_version=code_version,
            timestamp=datetime.utcnow().isoformat(),
            parameters=parameters or {},
        )
        self.records.append(record)
        return record

    def get_model_lineage(self, model_id: str) -> List[LineageRecord]:
        """Trace the full lineage chain for a trained model."""
        # Find training step for this model
        train_steps = [
            r for r in self.records
            if r.step_type == "train" and r.output_dataset == model_id
        ]
        if not train_steps:
            return []

        # Walk backwards through the lineage graph
        lineage = []
        to_visit = list(train_steps)
        visited = set()

        while to_visit:
            current = to_visit.pop(0)
            if current.step_id in visited:
                continue
            visited.add(current.step_id)
            lineage.append(current)

            # Find upstream steps that produced this step's inputs
            for input_ref in current.input_datasets:
                upstream = [
                    r for r in self.records
                    if r.output_dataset == input_ref
                    and r.step_id not in visited
                ]
                to_visit.extend(upstream)

        return sorted(lineage, key=lambda r: r.timestamp)

Data Quality Scoring

Data quality scoring for AI goes beyond traditional data quality metrics. In addition to completeness, consistency, and accuracy checks, AI data quality must evaluate representativeness (does the data reflect the population the model will serve?), temporal relevance (is the data from the right time period?), label quality (are the ground truth labels accurate and consistent?), and feature informativeness (do the features actually carry signal for the prediction task?). A composite quality score informs whether a dataset is suitable for training, evaluation, or production inference.

Automate data quality scoring as a pipeline gate. Every dataset version that enters your training pipeline should pass a minimum quality score threshold. This catches issues like corrupted uploads, incomplete joins, and label leakage before they waste training compute and introduce model quality regressions.

Consent management for AI training data requires tracking consent at the record level, not just the dataset level. Individual users may grant or revoke consent at any time, and you must be able to identify which records in which datasets are affected, which models were trained on those records, and what retraining is necessary when consent is revoked. This is operationally complex but legally required under GDPR and increasingly expected under other privacy regimes.

The practical approach is a consent registry that maps user identifiers to consent grants, with each grant specifying the scope (what data), purpose (what use, including AI training), and status (active, revoked, expired). When consent is revoked, the system identifies affected datasets, flags affected models for retraining, and logs the action for compliance records. Full model retraining on every revocation is often impractical, so organizations should define batch retraining schedules that balance privacy obligations with operational feasibility.

Retention Policies for AI Data

AI data retention policies must balance competing pressures. Privacy regulations demand minimal retention periods. Model quality demands access to historical training data for retraining and evaluation. Audit requirements demand preservation of the data that was used to train models that made consequential decisions. The resolution is a tiered retention framework that distinguishes between active training data, archived training data, evaluation reference data, and audit preservation data, each with its own retention period and access controls.

Data Classification

Data Lineage

Consent and Privacy

Quality and Retention

Version History

1.0.0 · 2026-03-01

• Initial release covering data classification, lineage, quality, consent, and retention for AI
• Data lineage tracker code example with consent validation
• Four-tier data classification framework with AI training authorization
• Production checklist covering 12 governance controls across four categories
• Consent management and retention policy guidance aligned with GDPR and EU AI Act

Data Governance for AI

Why AI Changes Data Governance

Data Classification for AI

Data Lineage for ML Pipelines

Data Quality Scoring

Retention Policies for AI Data

Data Classification

Data Lineage

Consent and Privacy

Quality and Retention

Version History

Related content

Data Governance for AI

Why AI Changes Data Governance

Data Classification for AI

Data Lineage for ML Pipelines

Data Quality Scoring

Retention Policies for AI Data

Data Classification

Data Lineage

Consent and Privacy

Quality and Retention

Version History

Related content

Why AI Changes Data Governance

Data Classification for AI

Data Lineage for ML Pipelines

Data Quality Scoring

Consent Management for Training Data

Retention Policies for AI Data

Data Classification

Data Lineage

Consent and Privacy

Quality and Retention

Version History

Related content

Why AI Changes Data Governance

Data Classification for AI

Data Lineage for ML Pipelines

Data Quality Scoring

Consent Management for Training Data

Retention Policies for AI Data

Data Classification

Data Lineage

Consent and Privacy

Quality and Retention

Version History

Related content