Evaluation README
The evaluation module makes recommendation tradeoffs visible through metrics, fixtures, and Pareto-frontier artefacts.
Cross-publisher Pareto chart
Section titled “Cross-publisher Pareto chart”The evaluation harness sweeps the soft editorial-control grid as a 5 by 5 by 5 configuration matrix:
diversity_weight:0.00,0.15,0.30,0.60,1.00recency_weight:0.00,0.20,0.40,0.70,1.00sentiment_weight:0.00,0.10,0.20,0.50,1.00
That grid includes the click-only baseline (0, 0, 0) and the platform
default (0.30, 0.40, 0.20). The Dagster asset now runs the same sweep over
the EB-NeRD, Adressa, and MIND holdouts, then materialises
eval_sweep_results with one row per (config, dataset, metric), including
the full configuration columns so analysts can query the same contract from
SQL or notebooks.
The metric set is NDCG@10, MRR, hit-rate@10, intra-list diversity, catalog coverage, median recency, sentiment-distribution divergence, and sensitive-topic exposure.
The chart below plots the smaller labelled diversity slice, NDCG@10 against intra-list diversity, with one Pareto frontier per publisher. The same ranking code produces different tradeoff curves because the publisher context changes the candidate shape:
- EB-NeRD is the primary Danish-news substrate. In this holdout it behaves like a deep single-topic session: click accuracy is easiest when the list stays narrow, and diversity has a visible cost.
- Adressa’s local-news fixture has shorter, more mixed local sessions. The curve moves toward higher diversity earlier, which is the cold-start risk in miniature: fewer repeated signals make broad local coverage easier to justify but harder to rank confidently.
- MIND exposes a broader category taxonomy and longer impression slates. Its session-length shape starts more diverse, so the diversity frontier is flatter than EB-NeRD’s.
Deterministic metrics stay in dbt because they are reproducible checks over materialised platform outputs: diversity, recency, source mix, sensitivity exposure, and similar table-shaped facts. Evalite is reserved for editorial assistant behaviour where model text has to be scored: faithfulness, reason validity, changed-constraint coverage, register, and length. See ADR-0020 for the split.