Skip to content

Transformation

Transformation is where raw landed data becomes the analytical contract that the FastAPI app and notebook-driven analysts both consume. The role of this module is to make the platform’s data model — articles, impressions, clicks, embeddings, editorial config, sensitivity flags — first-class, documented, and tested.

Every model lives in tutorial/serving/dbt/. The dbt project is materialised against the same DuckDB+Parquet substrate the FastAPI service queries, which is the operational shape of the two-contracts argument (ADR-0006).

Three model groups live under dbt/models/:

  • Staging (models/staging/): one staging model per publisher per source table (stg_ebnerd_*, stg_adressa_*, stg_mind_*) plus the unified view stg_unified_impressions.sql. Renames + casts to canonical column types. Nothing semantic — that work happens downstream.
  • Editorial (models/editorial/): the constraint configuration table (constraint_configurations.sql) and the article_sensitivity Python model (article_sensitivity.py) that combines NER hits and EB-NeRD sentiment scores into a boolean is_sensitive flag. This is where editorial concerns enter the analytical contract as queryable rows, not buried code.
  • Embeddings (models/staging/article_embeddings.py): a dbt Python model that runs an off-the-shelf sentence-transformer once per article and writes the vector column into Parquet for the recommender to pick up.

Tests + docs are the deliverable, not a side effect

Section titled “Tests + docs are the deliverable, not a side effect”

The whole module’s value is that the analytical contract is falsifiable. That means:

  • Every model has a schema.yml declaring expected columns and at least one test (non-null primary keys, referential checks across publishers).
  • A custom data test lives at dbt/tests/assert_article_sensitivity_seeded_cases.sql — it asserts that known-sensitive seed articles get flagged and known-benign ones don’t.
  • dbt docs generate runs in CI and the output is embedded in this docs site under Data reference so an analyst can browse lineage, column descriptions, and tests without reading Python or SQL.

The thesis (ADR-0004) says the platform is the product. That only works if the analytical contract is documented, tested, and lineage-tracked at the same fidelity as the HTTP contract. Raw SQL scripts hide the dependency graph; dbt forces it into the open. The cost — one more tool — is bought back many times over by the auto-generated docs, the per-column lineage, and the test discipline forcing schema changes to flow through the platform deliberately.

The transformation outputs feed three downstream consumers:

  • Modeling reads article and user features to compute embeddings + candidate sets.
  • Editorial reads constraint_configurations and article_sensitivity to apply ranking.
  • Evaluation sweeps configurations and writes results back into the same DuckDB substrate as the analytical contract.

Orchestration wraps the whole dbt project as Dagster assets so a single dagster materialize traces all the way from raw landing to evaluated outputs.