Skip to content

Lakehouse / Storage

The lakehouse module is where the platform proves it scales without abandoning local-first ergonomics. It is not a separate database; it is the substrate the rest of the platform sits on — Parquet files in a partition layout DuckDB can scan efficiently, with a runnable scale harness that loads and queries at billion-row volume on a developer laptop.

This module’s role is captured by two ADRs:

  • ADR-0016: DuckDB over Parquet is the analytical platform. Not a development database; the platform.
  • ADR-0014: Local-first scale story. Every lesson runs on a laptop. The cloud-migration pathway is documented but not deployed.

The module is rooted under tutorial/storage/:

The scale harness publishes concrete numbers, not aspirations. The scale-results.md file is checked into the repo so a reader can see — on the exact laptop the test ran on — how many rows per second landed, how much RAM the worst case hit, and how long the representative queries took. That makes the claim “DuckDB+Parquet is the analytical platform” falsifiable rather than ceremonial.

Partitioning is the only “architecture” you really need

Section titled “Partitioning is the only “architecture” you really need”

The most important file in the module is the partitioning layout itself. The harness writes Parquet files partitioned by publisher and event date, which is what lets DuckDB skip irrelevant files at scan time without an external query planner or metadata service. If you read one thing from this module, read how the partitioning is chosen and how a single SQL query benefits from it.

The storage substrate is what every other module relies on, but only this module owns it explicitly. Transformation materialises dbt models against the same Parquet directory the harness wrote. Modeling precomputes embeddings into the same substrate. Serving reads from it directly through DuckDB connections.

The cloud-migration lesson does not deploy. It demonstrates that swapping the local Parquet directory for an S3 URL is a dlt config diff plus a DuckDB connection string — not a rewrite.