Ingestion lessons
L01 — dlt fundamentals
Section titled “L01 — dlt fundamentals”The smallest dlt pipeline that lands a fixture into DuckDB. Anchored in
tutorial/serving/src/serving/dlt_pipeline.py.
The lesson covers pipeline.run, automatic schema inference, write
disposition, and the pipeline.dataset accessor.
The runnable command:
uv run --package tutorial-serving python -m serving.dlt_pipeline --source fixtureWhat you learn: dlt is not a magic ingest box. It is a deterministic
transformation from (source iterator) → (typed dataset) with config-driven
sinks. The configuration lives in
pyproject.toml
and the matching .dlt/config.toml (generated; not checked in).
L02–L04 — Three publishers
Section titled “L02–L04 — Three publishers”The same dlt module loads three datasets:
| Publisher | Run command | Notable schema quirks |
|---|---|---|
| EB-NeRD | ... --source ebnerd | Carries Danish article text + per-article sentiment_score |
| Adressa | ... --source adressa | Norwegian, no sentiment scores, shorter session shape |
| MIND | ... --source mind | English, broader category taxonomy, longer impression slates |
Each publisher’s pipeline writes to data/raw/<publisher>/ as partitioned
Parquet (git-ignored per
ADR-0014).
Repeated runs are idempotent — re-running skips already-downloaded files
unless explicitly forced.
The corresponding pipeline tests confirm landed-row shape and idempotency:
L05 — The unified staging contract
Section titled “L05 — The unified staging contract”The whole module’s payoff is one downstream view. Once dlt has landed three
publishers worth of raw data, the transformation
module builds
stg_unified_impressions.sql,
which is the first place every publisher meets on equal terms. The
ingestion lesson ends by reading from that view and confirming the row
counts add up across publishers, with the publisher column populated
correctly.
The staging-side tests anchor the cross-publisher behaviour:
What you should be able to answer after this module
Section titled “What you should be able to answer after this module”- Why are EB-NeRD’s sentiment scores not propagated as a fake column on Adressa? (Graceful degradation is an accountability feature. The ranker and evaluation already know how to handle missing scores.)
- What does each
stg_<publisher>_*model look like, and what doesstg_unified_impressionsadd on top? - Where do raw downloads live, and why are they
.gitignored?
After this module
Section titled “After this module”Continue to transformation to build the marts that turn raw events into model-ready features and editorial config rows.