Modeling lessons
The modeling lessons cover the recommender model — deliberately modest per
ADR-0007.
Everything lives in the serving package:
tutorial/serving/src/serving/embeddings.py
and
tutorial/serving/src/serving/recommendations.py.
L01 — Sentence embeddings as a dbt model
Section titled “L01 — Sentence embeddings as a dbt model”The article embedding pass runs as a dbt Python model so it gets the same
lineage tracking and documentation as every other column in the analytical
contract:
dbt/models/staging/article_embeddings.py.
uv run --package tutorial-serving dbt run --select article_embeddings \ --project-dir tutorial/serving/dbt --profiles-dir tutorial/serving/dbtThe model uses an off-the-shelf multilingual sentence-transformer that handles Danish, Norwegian, and English article text. Encoding is deterministic for a given model + input, so the output Parquet is reproducible across re-runs.
L02 — User embeddings from recent reads
Section titled “L02 — User embeddings from recent reads”User vectors are computed in
recommendations.py:
take the user’s last-K read articles, look up their embeddings, average
them. That mean vector is the user’s representation.
It is the simplest possible user model. It does not learn user preferences over time. It does not weight recent reads more heavily. It does not penalise topics the user has unsubscribed from. All of those would be recommender-research moves, and per ADR-0007 the model is not where the tutorial competes.
L03 — Candidate generation
Section titled “L03 — Candidate generation”For a known user: take the user vector, compute cosine similarity against every article embedding, exclude already-read articles, return the top-N nearest. That’s the candidate set.
For the cold-start path (no read history): fall back to popularity-by-
category over a recent time window. Articles are scored by category-level
CTR plus recency, then sampled to ensure a mix of categories. The cold-
start fallback is in the same file so the routing logic (if user has embedding...) sits next to both paths.
L04 — Tests cover both paths
Section titled “L04 — Tests cover both paths”test_embeddings.py
covers the encoder behaviour. The deeper test is
test_recommendations.py,
which exercises candidate generation with seeded user histories and
embedding fixtures. The cold-start path has its own test — confirming
that a user with no reads returns popular-by-category articles, not an
empty list.
All tests run in milliseconds against in-memory fixtures, which is the test discipline ADR-0007 promises.
What this module deliberately does not do
Section titled “What this module deliberately does not do”- No training. No fine-tuning. No domain adaptation.
- No two-stage architecture (collaborative recall + content rerank).
- No personalised content models, no sequence models, no learned re-rankers.
Each of those would put the model in the spotlight, which is exactly where the platform-as-leverage thesis says the model should not be.
After this module
Section titled “After this module”The candidate sets produced here flow into the editorial module, where the ranker applies the five editorial constraints. The evaluation module sweeps constraint configurations against these candidates.