Ingestion lessons

L01 — Two publishers, one shape, via dlt

The file: tutorial/ingestion/lesson-01-dlt-news-events.py.

Run it

make -C tutorial/ingestion run

The lesson defines two dlt resources, runs a pipeline that lands each as Parquet under a work directory (/tmp/fos-workbench-ingestion-lesson-01 by default), then reads the Parquet back with DuckDB and prints five labelled sections. It is the real dlt → Parquet → DuckDB path, just on tiny made-up feeds instead of multi-gigabyte downloads.

What runs

Reading top to bottom — the SEED → LAND → UNIFY arc:

Seed. Two in-file feeds with different raw shapes. Ekstra Bladet: (reader_id, story_id, behaviour, happened_at, section), where behaviour is view or open. MIND: (uid, news_id, interaction, ts, category), where interaction is impression or click.
@dlt.resource. Each feed is wrapped in a dlt resource — a generator that yields the rows verbatim. A columns hint pins the timestamp fields to text so “raw shape untouched” is literally true (otherwise dlt would parse and tz-shift them, and the raw view would no longer match what you ingested).
Land. dlt.pipeline(destination=filesystem(...), ...) writes both resources to …/raw/news/<table>/*.parquet with write_disposition="replace". The pipeline’s own state is kept inside the work dir, so the lesson never touches global dlt state.
Unify. DuckDB reads the two Parquet sets into raw_ekstrabladet and raw_mind views (raw columns preserved), then a news_events view UNION ALLs them onto the canonical shape — renaming columns and mapping open → click / view → impression in a visible CASE.
Sections 1–4 print the raw feeds, the unified view, and clicks per publisher; section 5 asks the checkpoint.

Predict, then prove

Before reading section 4, predict:

After unification, how many click events does each publisher contribute, and how many unified events are there in total?

The worked answer on the module page: Ekstra Bladet 2 clicks, MIND 1 click, 8 unified events total. The trap is the vocabulary — Ekstra Bladet’s clicks are logged as open, so they only count once the CASE rename runs.

Then make it mechanical:

make -C tutorial/ingestion check

tests/test_lesson_01.py loads the lesson, runs the real pipeline into a temp directory, and asserts the raw shapes are preserved, the unified view has the canonical columns, the event_type vocabulary is normalised to {impression, click}, and the full checkpoint answer.

What the lesson teaches that later modules rely on

A dlt resource is just a generator. The same shape scales to the real EB-NeRD/Adressa/MIND sources in serving/dlt_pipeline.py.
Land raw, unify downstream. Publisher quirks are preserved in Parquet and resolved in SQL at a single, visible boundary — the production version calls that boundary stg_unified_impressions.
The canonical news-event shape is stable. It is the same shape foundations established; only the source and the publisher column are new.

What the lesson deliberately does not do

No real downloads — two tiny in-file feeds keep it hermetic and fast.
No DuckDB CLI required — make run uses the duckdb Python library via uv.
No persistence to clean up by hand — make clean removes the work dir.

After this lesson

Transformation reshapes the unified events into mart-level features; orchestration turns the dlt sources into scheduled Dagster assets.