Foundations lessons

L01 — News events in DuckDB

The file: tutorial/foundations/lesson-01-news-events.sql.

Run it

make -C tutorial/foundations run     # no DuckDB CLI required (uses uv)
duckdb < tutorial/foundations/lesson-01-news-events.sql   # if you have the CLI

make run reads the SQL file, runs each statement on a throwaway in-memory database, and prints the labelled sections. The SQL file is the lesson; the runner (run_lesson.py) is just the projector for machines without the CLI.

What runs

The lesson is one SQL file in three phases. Reading top to bottom:

Teardown + seed. DROP ... IF EXISTS, then build news_events from a literal VALUES block: three users, six articles, fourteen events, two event types (impression, click), five topics. The drops are what make the lesson a repeatable rep — run it ten times, get the same result.
Shape — article_metrics. A view computing impressions, clicks, and CTR per article. Note the nullif(...,0) guard against zero-impression divides.
Shape — user_topic_affinity. A view computing impressions, clicks, and topic-level CTR per (user, topic). The simplest cold-start signal in the tutorial.
Score — starter_candidate_scores. A view giving every (user, article) pair the user has not seen a blended score:
```
starter_score = 0.7 · user_topic_ctr + 0.3 · article_ctr
```
Already-seen articles are excluded inside the view via LEFT JOIN ... WHERE seen.article_id IS NULL — the candidate set is built before scoring, not filtered afterwards.
Sections 1–4 of the output print those tables for inspection, and section 5 asks the checkpoint question.

Predict, then prove

The point of a rep is recall, so before you read section 4’s table, predict:

Which user has the strongest sports signal, and which unseen article would the starter score rank highest for that user?

Then check yourself against the worked answer on the module page. The short version: the strongest sports signal is u003, but its top unseen article is a001 (politics, 0.85), not the sports article a005 (0.70) — because a005 has zero clicks. Surprising answers are the ones that stick.

When you are done predicting, make it mechanical:

make -C tutorial/foundations check

That runs tests/test_lesson_01.py, which asserts the seed size, the CTR guard, the strongest-sports-signal user, and the full checkpoint answer (including the a001-over-a005 twist). Green means the rep passed.

What the lesson teaches that later modules rely on

The two-event vocabulary (impression and click). Every downstream metric in evaluation reduces to these two atoms.
Per-article CTR with a divide-by-zero guard. That guard pattern reappears every time a metric divides by impression counts.
The candidate-set shape. Already-seen articles are excluded before ranking, not as a separate filter step. The modeling module keeps the same boundary when it swaps the starter score for cosine similarity over sentence embeddings.
Checkpoint questions as the correctness gate. A lesson is not “it printed something.” A lesson is “it printed something, and the output is what I predicted.” make check enforces that discipline, and every later module inherits it.

What the lesson deliberately does not do

No real Python logic — run_lesson.py only executes the SQL you can read.
No real data — the VALUES block is small enough to read row by row.
No persistence — everything is in-memory and rebuilt on each run, so there is nothing to clean up (make clean says exactly that).

After this lesson

Run ingestion when you are ready to replace the fourteen rows with millions. The data shape stays exactly the same; the column names stay the same; only the source changes.