Skip to content

Ingestion lessons

The smallest dlt pipeline that lands a fixture into DuckDB. Anchored in tutorial/serving/src/serving/dlt_pipeline.py. The lesson covers pipeline.run, automatic schema inference, write disposition, and the pipeline.dataset accessor.

The runnable command:

Terminal window
uv run --package tutorial-serving python -m serving.dlt_pipeline --source fixture

What you learn: dlt is not a magic ingest box. It is a deterministic transformation from (source iterator) → (typed dataset) with config-driven sinks. The configuration lives in pyproject.toml and the matching .dlt/config.toml (generated; not checked in).

The same dlt module loads three datasets:

PublisherRun commandNotable schema quirks
EB-NeRD... --source ebnerdCarries Danish article text + per-article sentiment_score
Adressa... --source adressaNorwegian, no sentiment scores, shorter session shape
MIND... --source mindEnglish, broader category taxonomy, longer impression slates

Each publisher’s pipeline writes to data/raw/<publisher>/ as partitioned Parquet (git-ignored per ADR-0014). Repeated runs are idempotent — re-running skips already-downloaded files unless explicitly forced.

The corresponding pipeline tests confirm landed-row shape and idempotency:

The whole module’s payoff is one downstream view. Once dlt has landed three publishers worth of raw data, the transformation module builds stg_unified_impressions.sql, which is the first place every publisher meets on equal terms. The ingestion lesson ends by reading from that view and confirming the row counts add up across publishers, with the publisher column populated correctly.

The staging-side tests anchor the cross-publisher behaviour:

What you should be able to answer after this module

Section titled “What you should be able to answer after this module”
  • Why are EB-NeRD’s sentiment scores not propagated as a fake column on Adressa? (Graceful degradation is an accountability feature. The ranker and evaluation already know how to handle missing scores.)
  • What does each stg_<publisher>_* model look like, and what does stg_unified_impressions add on top?
  • Where do raw downloads live, and why are they .gitignored?

Continue to transformation to build the marts that turn raw events into model-ready features and editorial config rows.