Transformation

Transformation is where raw landed data becomes the analytical contract: documented, tested, lineage-tracked tables that both the FastAPI app and notebook-driven analysts consume. Foundations computed metrics with an ad-hoc query; ingestion landed raw events as Parquet; transformation promotes those metrics into governed dbt models with data tests, so a broken assumption fails the build instead of slipping through.

The runnable rep

Rep #3 is a self-contained dbt-duckdb project: tutorial/transformation/dbt/. It seeds the same news events you already know and builds two marts, guarded by dbt tests.

make -C tutorial/transformation run     # dbt build: seed -> models -> data tests
make -C tutorial/transformation check   # assert the marts + that the tests pass
make -C tutorial/transformation clean    # drop the warehouse + dbt artefacts

make run is a single dbt build, which does three things in dependency order:

Seed. seeds/news_events.csv (the canonical shape from foundations and ingestion) is loaded into DuckDB as a table.
Model. stg_news_events (a view — cheap reshaping) feeds two marts materialised as tables: article_metrics (per-article CTR) and user_topic_affinity (per-(user, topic) CTR). Each model is just a SELECT with a {{ ref(...) }} pointing at its upstream — that ref is how dbt builds the lineage graph for you.
Test. Schema tests (not_null, unique) plus a singular data test, assert_clicks_not_exceed_impressions, run automatically as part of the build.

The shift from foundations is the point: there, article_metrics was a view you eyeballed; here it is a model with declared columns and tests, so the build itself refuses to produce a contract that violates its own invariants.

The checkpoint, worked

The lesson’s checkpoint is a data invariant and a reading question:

The data test assert_clicks_not_exceed_impressions passes — why must clicks never exceed impressions? And how many articles have a perfect CTR of 1.0?

Clicks ≤ impressions because a click is only ever logged after the same article was shown (an impression) — so per article, clicks can structurally never outnumber impressions. The singular test encodes exactly that: it selects rows where clicks > impressions and dbt fails if any come back.

Two articles have a perfect CTR of 1.0: a004 and a006 — each was shown once and clicked once. Everything else sits at 0.5 (shown twice, clicked once) or 0.0 (a005, shown once, never clicked). The data test is the mechanical version of foundations’ “predict, then check”: the invariant is asserted by the build, not by your eye.

The grown-up version: the serving dbt project

The toy project scales to the real analytical contract in tutorial/serving/dbt/, materialised against the same DuckDB+Parquet substrate the FastAPI service queries — the operational shape of the two-contracts argument (ADR-0006). It has three model groups:

Staging — one model per publisher per source table (stg_ebnerd_*, stg_adressa_*, stg_mind_*) plus the unified stg_unified_impressions view. Renames + casts only; semantics stay downstream.
Editorial — the constraint_configurations table and the article_sensitivity Python model that turns NER hits + EB-NeRD sentiment into a queryable is_sensitive flag. Editorial concerns enter the contract as rows, not buried code.
Embeddings — article_embeddings.py, a dbt Python model that runs a sentence-transformer once per article and writes the vector plus the modeling-lifecycle model id/name/revision.

Tests and dbt docs generate are the deliverable, not a side effect: the generated lineage and column docs are embedded under Data reference so an analyst can browse the contract without reading SQL or Python.

What makes this a good rep

Following the foundations template (ADR-0033):

Standalone and hermetic. The project seeds its own data and builds into a throwaway /tmp warehouse; make clean removes it and dbt’s target/.
run / check / clean. dbt build is the run; the check re-builds into a temp warehouse and asserts both the marts’ facts and that dbt’s data tests passed.
The test is the gate. dbt data tests are the dbt-native form of a checkpoint — the build is green only if the contract holds.

After this module

The transformation outputs feed modeling (features → embeddings → candidate sets), editorial (constraint_configurations + article_sensitivity → ranking), and evaluation (configuration sweeps). Orchestration wraps the whole dbt project as Dagster assets so one materialise traces raw landing → evaluated outputs.