Skip to content

A tour of the tutorial

This page is a 10-minute walk through the whole tutorial. It is the page to read second — after the thesis essay — if you want to see what was built before you read why.

The tutorial is a working platform, not a write-up. The lessons across the modules build up that platform from zero. This tour zooms out so you can see the parts in one place.

flowchart LR
    subgraph platform["Data platform (Python)"]
        ingest[("dlt sources<br/>EB-NeRD · Adressa · MIND")]
        parquet["Partitioned Parquet"]
        duckdb["DuckDB + dbt models"]
        dagster["Dagster orchestration"]
        ingest --> parquet --> duckdb --> dagster
    end

    appContract{{"App contract<br/>FastAPI + OpenAPI"}}
    analyticalContract{{"Analytical contract<br/>DuckDB tables + dbt docs"}}
    editor["TypeScript editor<br/>Express + HTMX"]
    swap["Streamlit swap-demo"]
    notebook["Analyst notebook / SQL"]

    duckdb --> appContract
    duckdb --> analyticalContract
    appContract --> editor
    appContract --> swap
    analyticalContract --> notebook

Two contracts, two audiences. Editors talk to the app contract through a small editorial UI. Analysts talk to the analytical contract in SQL or a notebook. Both contracts read the same Parquet-backed DuckDB platform. That is the architectural claim from ADR-0006.

What happens when an editor moves a slider

Section titled “What happens when an editor moves a slider”

The most concrete way to see the platform working is to follow a single slider change through every layer:

sequenceDiagram
    autonumber
    participant Editor as Editor (browser)
    participant TSApp as TS editor (Express)
    participant API as FastAPI app contract
    participant Ranker as Ranker (pure fn)
    participant DuckDB as DuckDB + Parquet

    Editor->>TSApp: drag diversity slider → HTMX hx-post
    TSApp->>API: POST /preview (config inline)
    API->>DuckDB: read candidate articles + user embedding
    DuckDB-->>API: candidate set (~100 articles)
    API->>Ranker: rank(candidate_set, config)
    Ranker-->>API: ranked list (≤10)
    API-->>TSApp: ranked recommendations
    TSApp-->>Editor: HTMX swaps the preview pane

No page reload. No JavaScript framework. The slider update produces a fresh recommendation list that reflects the new editorial configuration — visible to the editor before they commit the change. This is the editorial accountability promise made operational.

The ranker is the heart of the platform. It is a pure function from (candidate_set, constraint_configuration) to a ranked list. The combination model is mixed enforcement (ADR-0010) — soft weights for tunable preferences, hard rules for editorial commitments:

flowchart TB
    candidates["Candidate set<br/>~100 articles"] --> score

    subgraph soft["Soft constraints (weighted score)"]
        diversity["Topical diversity<br/>weight 0..1"]
        recency["Recency / freshness<br/>weight + half-life"]
        sentiment["Sentiment balance<br/>weight + target"]
    end

    soft --> score["score = relevance<br/>+ Σ wᵢ · soft_termᵢ"]
    score --> sortStep["Sort descending"]
    sortStep --> hardStep

    subgraph hard["Hard rules (applied after sort)"]
        promotion["Editorial promotion<br/>insert at fixed positions"]
        sensitive["Sensitive-topic guard<br/>hard cap"]
    end

    hardStep[hard rules] --> final["Ranked list ≤10"]

The formulas live in ADR-0015. The ranker is implemented as a pure function precisely because that boundary makes the editorial transformation testable without a database, a web server, or a UI.

The tutorial is grouped by role, not by tool. A module is one role on the platform; a lesson inside it is one step of that role.

flowchart LR
    foundations[Foundations] --> ingestion[Ingestion]
    ingestion --> transformation[Transformation]
    transformation --> orchestration[Orchestration]
    transformation --> lakehouse[Lakehouse / Storage]
    transformation --> modeling[Modeling]
    modeling --> editorial[Editorial]
    editorial --> serving[Serving]
    serving --> editor[Editor]
    serving --> assistants[Assistants]
    editorial --> evaluation[Evaluation]
    modeling --> evaluation

Reading order is left-to-right, but lessons are runnable independently. Each module page in the Modules section is the README for that role; its Lesson sub-page is where the prose walkthrough lives.

Module roles in plain English:

ModuleWhat this role is doing
FoundationsEstablish the local DuckDB shape and the news-event vocabulary. Read this if you have never touched DuckDB.
IngestionPull EB-NeRD, Adressa, and MIND into the platform with dlt. Three publishers, one canonical staging shape.
Transformationdbt models: staging, marts, tests, and column-level docs.
OrchestrationWrap dlt + dbt as Dagster assets. Daily schedule + new-file sensor.
Lakehouse / StoragePartitioned Parquet layout, DuckDB scans, scale measurements, cloud-migration pathway (no actual deploy).
ModelingSentence-embedding candidate generation. Modest model with cold-start fallback.
EditorialThe five constraints + the mixed-enforcement ranker. Pure-function ranker is the headline TDD module.
ServingFastAPI app contract: /articles/{id}, /recommendations/{user_id}, /preview, /constraint-configurations. Plus the Streamlit swap-demo.
EditorTypeScript Express + HTMX. The reference editor interface.
AssistantsEditorial AI assistants behind editor endpoints. Phase 1 is change-explainer, evaluated with Evalite.
EvaluationConstraint-configuration sweeps, NDCG@10 vs diversity Pareto charts across publishers.
  • An interviewer at JP/Politikens Hus — read the thesis essay, glance at this tour, then look at the evaluation module for the Pareto chart and the serving module for what the editor surface actually does.
  • A data engineer kicking the tyres — start at foundations, then walk modules left-to-right.
  • An analyst — go straight to the data reference for dbt-generated lineage and column docs, then dip into evaluation for the metric definitions.
  • A developer integrating a different editor client — open the API reference. Any client that speaks the OpenAPI contract is interchangeable; the Streamlit swap-demo proves it.
  • A future maintainer — read how this was built for the lab, Sandcastle, and ADR-driven grilling workflow that produced the artefact.
PathWhat’s in it
tutorial/foundations/First SQL lesson — local DuckDB foundations.
tutorial/serving/FastAPI + ranker + dbt project + Streamlit swap-demo.
tutorial/storage/Partitioned-Parquet substrate + scale test + cloud-migration prose.
editor/TypeScript Express + HTMX editor interface.
docs-site/This site (Astro Starlight).
docs/adr/Architecture Decision Records 0001–0016. Authoritative.
CONTEXT.mdDomain glossary. The terminology contract for the whole repo.
.sandcastle/The Sandcastle parallel-planner runner that built most of this.

If you take one thing from this tour, take this: every layer above is in service of a single architectural claim — the editorial questions live in the platform, not in the model. The recommender model is deliberately simple. The contracts are deliberately separated. The constraints are deliberately mixed-enforcement. The editor interface is deliberately ordinary. Each “deliberately” is what gives the platform leverage.

The thesis essay makes that argument in full prose. This tour makes it in pictures. The rest of the modules show it in code.