Ingestion
The ingestion module is where the platform stops being a toy. Lesson 1 of
foundations builds tables from VALUES; ingestion replaces
that with dlt pipelines that pull three real Scandinavian and international
news-recommendation datasets into local Parquet files queryable by DuckDB.
The point of the module is not “we can call an HTTP endpoint.” The point is that each publisher has its own raw shape, and the ingestion layer has to preserve that fidelity while still letting downstream models work against a canonical unified view.
The three publishers and why each is here
Section titled “The three publishers and why each is here”- EB-NeRD is the primary substrate. Released by Ekstra Bladet (inside the JP/Politikens media group), it carries Danish article text, per-article sentiment scores, and impression logs at real volume.
- Adressa adds a Norwegian local-news comparison point. Different session shape, no sentiment scores, more cold-start sparsity.
- MIND is the Microsoft News dataset — English, broader category taxonomy, longer impression slates. It tests whether the platform survives a publisher whose semantics differ at the schema level.
The cross-publisher work is the test of the architectural claim. If the ranker, evaluation, and editor only worked on EB-NeRD, the platform would not actually be a platform — it would be a single-publisher application with a docs site stapled on.
What the code looks like
Section titled “What the code looks like”The ingestion code lives in
tutorial/serving/src/serving/dlt_pipeline.py.
Three dlt sources land into local Parquet under data/raw/<publisher>/,
git-ignored by design (ADR-0014:
data is regenerated or re-downloaded, never committed). The Dagster wrapping
that exposes each source as a materialisable asset is in
tutorial/serving/src/serving/definitions.py
— see the orchestration module for how schedules and
sensors are wired on top.
Tests anchor the publisher-specific behaviour:
test_adressa_pipeline.pyandtest_mind_pipeline.pyverify each pipeline lands the expected rows.- Staging tests
(
test_adressa_staging.py,test_mind_staging.py) confirm the unifiedstg_unified_impressionsview includes each publisher.
The unified staging contract
Section titled “The unified staging contract”The whole module’s payoff is one downstream view:
stg_unified_impressions
(source).
It is the first place every publisher meets on equal terms — user_id,
article_id, event_type, event_at, publisher. Everything publisher-
specific stays in stg_<publisher>_* models; everything cross-publisher
reads from the unified view.
That boundary is not just plumbing. It is how the platform keeps comparison
honest. A sentiment metric cannot pretend Adressa has EB-NeRD’s
sentiment_score. The ranker degrades gracefully (the sentiment soft-term
contributes zero when no score exists), and the evaluation harness reports
metric availability per publisher rather than inventing values. Graceful
degradation is an accountability feature.
After this module
Section titled “After this module”Once data is landed and unified, transformation reshapes it into mart-level features: article metadata, user histories, embeddings. Orchestration is where the dlt sources become Dagster assets with checks, schedules, and sensors.