Transformation lessons

L01 — Tested marts with dbt

The project: tutorial/transformation/dbt/.

Run it

make -C tutorial/transformation run

That is a single dbt build, which seeds the CSV, builds the models, and runs the data tests — in dependency order, derived from the {{ ref(...) }} calls.

What runs

Seed. seeds/news_events.csv — the canonical (user_id, article_id, event_type, event_at, topic, publisher) shape from foundations and ingestion — loads into DuckDB.
stg_news_events (view). Casts event_at to a timestamp; light reshaping only. A view because it is cheap and always fresh.
article_metrics and user_topic_affinity (tables). The same CTR metrics foundations computed ad-hoc, now materialised models with a schema.yml declaring columns and not_null/unique tests.
Data tests. assert_clicks_not_exceed_impressions (a singular test) plus the schema tests run as part of build.

Predict, then prove

Before reading the build output, predict:

How many articles have a perfect CTR of 1.0, and why can clicks never exceed impressions?

The worked answer: a004 and a006 (two articles, each shown once and clicked once), and clicks ≤ impressions because a click always follows an impression of the same article.

Then make it mechanical:

make -C tutorial/transformation check

tests/test_lesson_01.py runs a real dbt build into a temp warehouse, asserts it succeeds (so all dbt data tests passed), and checks the marts’ facts — the perfect-CTR pair, the CTR values, and u003’s sports affinity of 1.0 carried over from the foundations checkpoint.

What the lesson teaches that later modules rely on

A dbt model is a SELECT plus a materialisation. ref() builds the lineage graph; +materialized: table vs view is a storage choice.
Data tests are first-class. Schema tests and singular tests turn invariants into build failures — the dbt-native checkpoint.
The analytical contract is governed. The serving project scales this to staging-per-publisher, the constraint_configurations table, sensitivity, and embeddings — same discipline, more models.

What the lesson deliberately does not do

No real publishers — one seeded CSV keeps it hermetic and fast.
No dbt deps/packages — only built-in and singular tests, so there is nothing to install.
No committed warehouse — dbt/target/, dbt/logs/, and the /tmp warehouse are git-ignored and removed by make clean.

After this lesson

Modeling consumes the article features to compute embeddings and candidate sets; orchestration wraps the dbt project as Dagster assets.