Skip to content

Foundations lessons

The file: tutorial/foundations/lesson-01-news-events.sql.

Run it directly:

Terminal window
duckdb < tutorial/foundations/lesson-01-news-events.sql

The lesson is one SQL file. Reading top to bottom:

  1. Drop + create the news_events table from a literal VALUES block: three users, six articles, fourteen events, two event types (impression, click), five topics.
  2. Create article_metrics — a view computing impressions, clicks, and CTR per article. Note the nullif(...,0) guard against zero-impression divides.
  3. Create user_topic_affinity — a view computing impressions, clicks, and topic-level CTR per (user, topic). This is the simplest cold-start signal in the tutorial.
  4. Print three labelled sections of the underlying tables for visual inspection: raw events, article metrics, user-topic affinity.
  5. Compute a starter candidate score for every (user, article) pair the user has not already seen:
    starter_score = 0.7 · user_topic_ctr + 0.3 · article_ctr
    Articles already shown to the user are excluded via LEFT JOIN ... WHERE seen.article_id IS NULL.
  6. End with a checkpoint question — answer it from the output: Which user has the strongest sports signal, and which unseen article would the starter score rank highest for that user?

What the lesson teaches that later modules rely on

Section titled “What the lesson teaches that later modules rely on”
  • The two-event vocabulary (impression and click). Every downstream metric in evaluation reduces to these two atoms.
  • Per-article CTR with a divide-by-zero guard. That guard pattern reappears every time a metric divides by impression counts.
  • The candidate-set shape. Already-seen articles are excluded before ranking, not as a separate filter step. The modeling module keeps the same boundary when it swaps the starter score for cosine similarity over sentence embeddings.
  • Checkpoint questions as the lesson’s correctness gate. A lesson is not “run; it printed something.” A lesson is “run; the printed output answers the checkpoint.” That discipline scales to every later module.
  • No Python. No dependencies beyond DuckDB.
  • No real data — the VALUES block is small enough that you can read it.
  • No persistence beyond the views (CREATE VIEW, not CREATE TABLE AS). Reset by re-running.

Run ingestion when you are ready to replace the fourteen rows with millions. The data shape stays exactly the same; the column names stay the same; only the source changes.