Skip to content

Transformation lessons

The transformation lessons walk the dbt project at tutorial/serving/dbt/. Run any of these from the serving package:

Terminal window
uv run --package tutorial-serving dbt run --project-dir tutorial/serving/dbt --profiles-dir tutorial/serving/dbt
uv run --package tutorial-serving dbt test --project-dir tutorial/serving/dbt --profiles-dir tutorial/serving/dbt
uv run --package tutorial-serving dbt docs generate --project-dir tutorial/serving/dbt --profiles-dir tutorial/serving/dbt

dbt/models/staging/ has one SQL file per publisher per source: stg_ebnerd_*, stg_adressa_*, stg_mind_*. Each model is a rename + cast pass over the corresponding raw dlt table. No semantic transformation happens here on purpose — semantics belong downstream so the staging boundary stays cheap to test.

The unified view — stg_unified_impressions.sql — UNIONs the three publishers with a publisher column. Everything cross- publisher in the rest of the platform reads from here.

dbt/models/editorial/ materialises two things that bridge editorial intent and platform data:

  • constraint_configurations.sql — the configuration table written to by editors via the editor interface and read by the ranker on every recommendation request. Columns match ADR-0015 exactly.
  • article_sensitivity.py — a dbt Python model combining EB-NeRD’s sentiment scores and a small NER-driven keyword pass into a boolean is_sensitive per article. This is the boundary where “editorial guard” becomes a queryable platform fact.

dbt/models/staging/article_embeddings.py runs the sentence-transformer once per article and writes the embedding column into Parquet. Because it lives as a dbt Python model rather than an ad-hoc script, the embeddings get the same lineage tracking, tests, and docs treatment as every other column in the analytical contract.

dbt test runs every test in schema.yml plus the custom assertion at dbt/tests/assert_article_sensitivity_seeded_cases.sql. The custom test pins known-sensitive seed articles to is_sensitive = true and known-benign ones to false, so a regression in the sensitivity heuristic fails the dbt run loudly.

L05 — Generated docs as the analyst’s reference

Section titled “L05 — Generated docs as the analyst’s reference”

dbt docs generate produces the lineage graph and column-level documentation that the docs site embeds at Data reference. This is the “analyst” half of the two-contracts argument made operational (ADR-0006) — analysts don’t need to read Python to understand the schema.

Embeddings flow into modeling, the constraint configurations and sensitivity flag flow into editorial, and orchestration wraps the whole dbt project as Dagster assets so a single materialisation pulls upstream ingest in automatically.