How this was built

This tutorial was built in a lab, not in a vacuum, and the lab exists for one reason: so its owner builds durable fluency in the technical stack and the accountable-recommender domain by re-running the tutorial many times — ADR-0033 records that as the single objective. Two things were built getting there — a recommender-system tutorial for accountable news recommendations, and an AI-native way of using agents to build it — and this essay is about the second. The recommender is the material to learn against; the workflow is how that material got made. The lab is this clean-room repository. A skill is one of the Matt Pocock or Sandcastle working modes available inside it. An exercise is what happens when a skill is run against a real problem in the lab. The tutorial is the thing those exercises are building: a lesson-based platform over Danish and Scandinavian news-recommendation data. Each lesson adds a self-contained slice. Each module groups lessons by the role the data engineer is playing, not by the tool being used.

That vocabulary is not decoration. It is the control system for the work. If this were called a sandbox, the result could drift toward experiments with no obligation to cohere. If the skills were called commands, the workflow would sound like automation rather than collaboration. If the tutorial were called a demo, it would invite shallow spectacle. The language in CONTEXT.md forced a different standard: the lab had to produce exercises, the exercises had to move the tutorial, and the tutorial had to become a working platform whose lessons, modules, contracts, tests, and essays all used the same domain model.

The workflow started with a pivot. ADR-0003 retired the earlier todo-app probe and made the recommender tutorial the lab’s substrate. That was the first important AI-native move: the team did not ask agents to polish a toy forever. It picked a substrate with enough domain pressure to reveal whether the skills were useful. EB-NeRD, Adressa, and MIND gave the work a media context. The prior probe had been useful for learning how skills fit together, but it did not carry the domain pressure or the data-platform depth the owner needed to practice against. A news recommender did. Any JP/Politikens Hus relevance is a byproduct, not the reason the lab exists (ADR-0033).

The next move was not implementation. It was grilling. The project used grilling sessions to produce ADRs 0003-0016 before the bulk of the code was written. That sequence is the spine of the artifact. ADR-0004 stated the thesis: editorial accountability is a platform-layer concern, not a model-layer concern. ADR-0005 chose a TypeScript Express editor interface as a swappable client. ADR-0006 split the platform into an app contract and an analytical contract. ADR-0007 kept the recommender model deliberately simple. ADR-0008 made the docs site the front-door deliverable. ADR-0009 made constraint-configuration comparison the evaluation story. ADR-0010 chose mixed constraint enforcement. ADR-0011 kept the editor interface server-rendered with HTMX. ADR-0012 made the lessons themselves the platform. ADR-0013 grouped the modules by role. ADR-0014 kept the platform local-first while preserving a documented object-storage path. ADR-0015 fixed the editorial constraint math. ADR-0016 framed DuckDB over Parquet as the self-hosted analytical platform.

Those ADRs were not a committee ritual. They were the way the human and the agents stopped the project from becoming a pile of plausible choices. Each grilling session forced a fork: which claim is load-bearing, which term is too loose, which tradeoff would not survive a sceptical reviewer, which option is more impressive but less honest. The result was a decision trail that an AFK agent could follow later without guessing the project politics. When an issue said “build the sensitive-topic guard”, the agent did not need to rediscover whether sensitivity was a soft weight, a hard rule, a classifier project, or a UI warning. ADR-0010 and ADR-0015 had already made that explicit.

Sandcastle issue slicing turned those decisions into work. The important phrase is Sandcastle issue slicing, not “make a task list.” A good slice here was a lesson-sized unit with a parent PRD, acceptance criteria, and blockers. It had to fit the tutorial-as-platform shape from ADR-0012 and the module language from ADR-0013. That is why the backlog did not say “implement recommendations” as one vague assignment. It split the platform across foundations, ingestion, transformation, orchestration, lakehouse, modeling, editorial, serving, editor, assistants, and evaluation. The issue tracker became an execution surface for the architecture.

AFK execution then became possible because the repo had already narrowed the meaning of success. An agent working alone could pull one issue, read the parent PRD, inspect the ADRs, write tests, implement the slice, run the feedback loops, and commit a concise record of what changed. The recent history shows that pattern in miniature. The sentiment-balance issue carried ADR-0015 into the ranker, the app contract, the editor controls, generated OpenAPI types, and the docs. The editorial-promotion issue added the always-include hard rule and its editor surface. The sensitive-topic issue added the detector, dbt model, ranker cap, preview contract, and tests. The evaluation issue then swept the soft-weight configuration space and materialised the metrics table. None of those commits is just an isolated feature. Each is a lesson-shaped platform slice moving along the same decision trail.

Red-green-refactor was the local feedback loop inside that larger system. The ranker is a pure function because ADR-0010 and ADR-0015 made it the deep module where editorial judgment becomes executable. That made it easy for agents to write behavior tests at the public boundary: given a candidate set and a constraint configuration, what ranked list comes back? The same pattern appears in the evaluation metrics, the sensitive-topic detector, editor request handlers, generated client contracts, and docs-site tests. The tests are not there to make the repository look disciplined. They are what lets AFK execution be safe enough to use. When an agent changes the OpenAPI shape, TypeScript stops compiling if the editor drifted. When a docs page is missing a link, a node test fails. When a ranker rule changes, the behavioral test names the lost property.

The docs site is part of the same workflow, not a wrapper added at the end. ADR-0008 made the site the artifact linked from the application narrative, so written material had to be treated like code. The thesis essay explains why the platform matters. This meta essay explains how the platform was produced. The front page, reference docs, architecture diagram, generated OpenAPI reference, dbt documentation, Pareto chart, and module pages are not marketing pages floating above the repository. They are rendered from the same source tree that holds the tests and lesson code. If the tutorial teaches one thing and the platform does another, that is a defect, not an editorial issue.

The AI-native part is therefore not that AI wrote text or code quickly. Speed was useful, but it was not the differentiator. The differentiator is the operating model: human judgment captured as ADRs, Sandcastle issue slicing turning those ADRs into reviewable units, skills selecting the right working mode for each exercise, and AFK execution producing commits that can be tested against the same contracts. That is not a gimmick. A gimmick would be a screenshot of an agent doing something surprising. This workflow is more boring and more valuable: it makes the boundary between judgment and execution explicit enough that agents can help without erasing accountability.

That same boundary is why nao belongs in this essay rather than in the platform modules. In the lab’s AI-native workflow, Sandcastle gives issue slicing and AFK execution, the Matt Pocock skills give reusable working modes, and nao is the analyst and engineer’s schema-aware AI code editor for the SQL, dbt, and Python work around the data substrate. Its value is not that the recommender tutorial exposes a nao button. Its value is that the person building or interrogating the platform can ask questions against table shape, lineage, and local code while keeping edits in the same professional editor loop as migrations, models, notebooks, and tests. ADR-0020 makes the asymmetry important: nao is the worker’s editor; the assistants module is the editor’s assistant. One serves the analyst or engineer producing the platform. The other serves the newsroom editor operating editorial constraints after the platform exists. The platform doesn’t host either as a feature, because doing so would confuse internal leverage with product surface. Keeping nao at the workflow layer preserves the tutorial’s honest claim: AI can sharpen the people building and operating the system without turning every useful tool into a reader-facing product promise.

The system also stayed honest about expertise. ADR-0007 deliberately avoided turning the tutorial into a recommender-research claim. The model creates a candidate set through content similarity; the platform applies editorial constraints during ranking. That let the project demonstrate media-specific learning without pretending to have production newsroom recommender experience. The same honesty shaped the local-first scale story in ADR-0014 and ADR-0016. The data platform should prove that DuckDB over Parquet can support large analytical work on a laptop, while the object-storage pathway stays documented rather than theatrically deployed. The agents could then build toward a concrete platform story instead of chasing impressive but unfalsifiable claims.

The skills mattered because they kept changing roles. grill-with-docs crystallised terminology and decisions into CONTEXT.md and ADRs. to-issues turned the PRD and backlog into Sandcastle-ready slices. triage put issues into states that told agents whether they were ready. tdd made the inner loop concrete for implementation work. Architecture review skills helped spot where a module needed a smaller public interface and a deeper implementation. No single skill is the story. ADR-0001 calls this emergent mode: the next skill is chosen by the work at hand. That is why the handoffs are visible. The lab was not proving that one prompt can do everything. It was proving that a network of skills can preserve context across different kinds of engineering work.

The lesson and module structure made this more than backlog management. A lesson is small enough for one agent to build and one human to review, but it is not disposable. It must run on its own, teach a concept, and leave the cumulative platform stronger. A module gives related lessons a role-shaped home: modeling for candidate sets, editorial for constraints and ranking, serving for the app contract, editor for the HTMX interface, evaluation for the Pareto frontier. Because the modules are role-based, a future tool swap does not require the whole tutorial to be renamed. That is the kind of detail AI agents are bad at inventing after the fact but good at following when the decision is written down.

The commit history is the evidence trail. It shows agents adding features, resolving conflicts, regenerating contracts, expanding tests, and documenting limitations when a local environment lacked pytest or dbt. Those notes matter. An AI-native workflow that hides blockers is just automation theatre. This lab’s commits usually say what passed, what could not run, which files moved, and which decisions were made. That makes the work reviewable by a human who was not present during execution. It also makes the next agent’s job simpler: the repo records not only the final code, but the constraints under which that code was produced.

The result is a tutorial whose construction method matches its thesis. The technical thesis says that accountability belongs in platform surfaces: contracts, configuration tables, editor controls, tests, evaluations, and docs. The workflow thesis says the same thing about AI-assisted engineering. Accountability does not live in a claim that an agent is smart. It lives in the surfaces around the agent: the PRD, the ADRs, the issue slices, the test suite, the generated contracts, the commit messages, and the docs site. That is why the method belongs inside the artifact. For this tutorial, how it was built is not behind-the-scenes trivia. It is the second proof of the same argument.