Skip to content

Pattern: Lakehouse

Quick facts

  • Category: Data & Analytics
  • Maturity: Trial
  • Typical team size: 3-6 engineers
  • Typical timeline to MVP: 8-16 weeks
  • Last reviewed: 2026-05-03 by Architecture Team

1. Context

Use this pattern when:

  • Workloads mix SQL analytics (BI dashboards) and ML training (bulk reads, feature engineering) on the same underlying data, and copying data between a warehouse and a lake is too expensive or too slow
  • Raw data volumes are in the hundreds of GB to PB range, making managed warehouse storage costs prohibitive
  • Multiple compute engines need to read the same data: Spark for ML, Trino for BI, DuckDB for analyst notebooks
  • Schema evolution, time-travel (auditing past states), and data versioning are first-class requirements

Do NOT use this pattern when:

  • Your data is under 100 GB and the team is small — the Modern Data Stack with a managed warehouse is simpler, cheaper, and operationally lighter
  • All workloads are BI/SQL with no ML training — a managed warehouse delivers better query performance and operator experience for pure analytics
  • The team has no distributed compute experience — the operational cost of learning Spark and Flink is high; start with a managed warehouse and migrate when you hit genuine limits

2. Problem it solves

A data warehouse optimises for SQL analytics but charges for compute every time you scan data for ML training, and stores data in a proprietary format that locks you in. A data lake stores data cheaply in open formats but can't serve low-latency SQL queries reliably. The Lakehouse stores data in open, ACID-compliant table formats (Delta Lake, Iceberg) on commodity object storage, letting different engines — Spark for training, Trino for dashboards — share the same physical files without copying, at warehouse-quality reliability.

3. Solution overview

System context (C4 Level 1)

flowchart LR
    Sources[Source Systems\nDBs / APIs / Streams] --> Ingest[Ingestion\nSpark / Airbyte / Debezium]
    Ingest --> Lake[(Object Store\nS3 — Delta Lake tables)]
    Lake --> SparkEngine[Spark\nML + batch transforms]
    Lake --> TrinoEngine[Trino\ninteractive SQL]
    SparkEngine --> BI[BI Tool\nSuperset]
    TrinoEngine --> BI
    SparkEngine --> ML[ML Training\nMLflow]
    Catalog[Metadata Catalog\nAWS Glue] -.->|table definitions| Lake

Container view (C4 Level 2)

flowchart TB
    subgraph Ingestion
        CDC[Debezium CDC\nDB change stream]
        BatchIngest[Batch Ingest\nAirbyte / custom Spark]
    end
    subgraph Object Storage - Delta Lake zones
        Bronze[(Bronze zone\nraw landed data)]
        Silver[(Silver zone\ncleaned + typed)]
        Gold[(Gold zone\nbusiness aggregates)]
    end
    subgraph Catalog
        GlueCatalog[AWS Glue Data Catalog\ntable definitions + schemas]
    end
    subgraph Compute
        SparkEMR[Apache Spark\nEMR / Databricks]
        Trino[Trino\ninteractive queries]
        DuckDB[DuckDB\nnotebook exploration]
    end
    subgraph Serving
        Superset[Apache Superset\nBI dashboards]
        MLPipeline[ML Training Pipeline\nMLflow + SageMaker]
        Notebooks[Jupyter / Databricks\nnotebooks]
    end
    subgraph Orchestration
        Airflow[Apache Airflow\nDAG scheduler]
    end

    CDC --> Bronze
    BatchIngest --> Bronze
    Bronze --> SparkEMR
    SparkEMR --> Silver
    Silver --> SparkEMR
    SparkEMR --> Gold
    GlueCatalog -.-> Bronze
    GlueCatalog -.-> Silver
    GlueCatalog -.-> Gold
    Gold --> Trino --> Superset
    Gold --> DuckDB --> Notebooks
    Silver --> MLPipeline
    Gold --> MLPipeline
    Airflow --> SparkEMR

4. Technology stack

Layer Primary choice Alternatives Notes
Object storage AWS S3 Azure Data Lake Storage Gen2, Google Cloud Storage S3 is most mature with Delta Lake and Iceberg; all three work equivalently
Table format Delta Lake Apache Iceberg See ADR-0007; Delta Lake for Spark-primary workloads; Iceberg for multi-engine or cloud-agnostic deployments
Batch compute Apache Spark on EMR Databricks (managed Spark), Dask, Ray Databricks wraps Spark with managed infra and built-in MLflow; EMR for teams preferring open-source and AWS-native IAM
Interactive SQL Trino Apache Spark SQL, DuckDB, Presto Trino for sub-minute latency SQL over the lakehouse; DuckDB for local notebook queries via duckdb.read_parquet()
Orchestration Apache Airflow Prefect, Dagster Airflow's SparkSubmitOperator is the most mature; Dagster has first-class Delta Lake integration via software-defined assets
Metadata catalog AWS Glue Data Catalog Apache Hive Metastore, Project Nessie Glue for AWS deployments; Nessie adds Git-like catalog branching useful for data CI/CD and isolated dev environments
BI Apache Superset Metabase, Looker Superset connects directly to Trino; self-hostable and has good Delta Lake support via the Trino connector
ML tracking MLflow Weights & Biases MLflow integrates natively with Databricks and reads Delta tables directly; see the ML Training Pipeline pattern

5. Non-functional characteristics

Concern Profile
Scalability S3 scales to any volume at constant cost-per-GB. Spark and Trino each scale horizontally by adding nodes. Design partition keys (typically year/month/day) upfront — repartitioning an existing table is a full rewrite.
Availability target S3 durability: 99.999999999%. Compute clusters (Spark, Trino) are ephemeral; design all Spark jobs to be re-runnable from the last checkpoint. A cluster failure means delayed results, not lost data.
Latency target Batch transforms: hours (SLA-driven). Trino interactive queries on Gold zone data: p95 < 30 s for pre-partitioned queries. Not suitable for sub-second queries — add ClickHouse as a serving layer for those.
Security posture IAM roles for all compute-to-S3 access; never long-lived credentials. Column-level security in the Glue catalog via Lake Formation. Encrypt S3 with SSE-KMS. Ranger or Lake Formation for fine-grained row/column access policies.
Data residency All data in one cloud account and region. Cross-region replication adds S3 transfer costs; document requirements explicitly.
Compliance fit GDPR ✓ — Delta Lake DELETE + VACUUM implement right-to-erasure; bound VACUUM retention to no more than 30 days to avoid indefinite PII persistence. HIPAA ✓ with encrypted S3 + AWS BAA. SOC 2 ✓ with Glue catalog access logs and Airflow run history.

6. Cost ballpark

Indicative monthly USD cost. Spark/Databricks compute is the largest variable; S3 storage costs are low.

Scale Data volume Monthly cost Cost drivers
Small < 500 GB $300 - $1,000 S3 storage, small EMR cluster (3 × m5.xlarge), Glue crawlers
Medium 500 GB - 20 TB $2,000 - $10,000 EMR or Databricks compute (dominant), S3 request costs, Trino cluster
Large 20 TB+ $10,000 - $50,000 Spark fleet, Databricks licences if managed, cross-AZ data transfer, dedicated ops time

7. LLM-assisted development fit

Aspect Rating Notes
PySpark transform scaffolding ★★★★ Good — PySpark patterns are well-represented; verify partition logic, broadcast hints, and join strategies by hand.
Delta Lake DDL and MERGE statements ★★★★ Generates correct Delta SQL for upserts and schema evolution; validate MERGE conditions for idempotency.
Trino SQL query writing ★★★ Generates syntactically correct Trino SQL; query plan optimisation (partition pruning, predicate pushdown) requires hands-on profiling.
Airflow DAG for Spark jobs ★★★★ Standard SparkSubmitOperator and EmrAddStepsOperator patterns generate cleanly.
Architecture decisions Don't outsource. Use ADRs.

Recommended workflow: Start with a single Bronze → Silver Spark job before building out the full medallion architecture. Validate partition strategy on a representative data sample. Add Trino only after the Gold zone is stable and analysts need interactive queries.

8. Reference implementations

  • Public reference: delta-io/delta — Delta Lake source; examples/ contains Python and Scala notebooks covering ACID transactions, schema evolution, time-travel, and OPTIMIZE (200 OK ✓)
  • Public reference: apache/iceberg — Iceberg table format; reference for the alternative format's Java and Python API, catalog integration, and hidden partitioning (200 OK ✓)
  • Public reference: trinodb/trino — Trino distributed SQL engine; plugin/trino-delta-lake/ shows the Delta Lake connector implementation (200 OK ✓)
  • Internal case study: Add your anonymised internal example here

10. Known risks & gotchas

  • Small files degrade query performance — Streaming writes or fine-grained appends produce thousands of tiny Parquet files; Trino opens a file handle per file, making queries slow. Mitigation: run OPTIMIZE (Delta Lake) or rewrite_data_files (Iceberg) on a daily schedule to compact small files; target 128–256 MB per file.
  • VACUUM deletes time-travel history needed for audits — Running VACUUM with a 7-day retention removes the old file versions that prove a record was deleted (GDPR compliance). Mitigation: set VACUUM retention to at least 30 days; document the tension between storage cost and audit window in your data retention policy.
  • Partition skew stalls Spark executors — A partition key with uneven distribution (one large tenant, one busy day) causes a single executor to process 80% of the data. Mitigation: choose partition keys with roughly equal cardinality; use salting or repartitioning for heavily skewed dimensions.
  • Metadata catalog becomes a silent single point of failure — If Glue is unreachable, Spark and Trino cannot resolve table locations, even though S3 data is intact. Mitigation: enable Glue HA across AZs; consider the Iceberg REST catalog or Project Nessie as a more resilient alternative.
  • Schema evolution breaks downstream Spark jobs silently — Renaming a column or changing a type in a Delta table breaks every Spark job that references it by name, often without a clear error. Mitigation: enable Delta Lake schema enforcement (delta.columnMapping.mode = name) for rename safety; enforce a column deprecation policy (add new column → dual-write → migrate all consumers → remove old column across one release cycle).