Pattern: Modern Data Stack (Warehouse + dbt + BI)¶

Quick facts

Category: Data & Analytics
Maturity: Adopt
Typical team size: 1-3 engineers
Typical timeline to MVP: 4-8 weeks
Last reviewed: 2026-05-03 by Architecture Team

1. Context¶

Use this pattern when:

Multiple source systems (CRM, product database, billing, marketing APIs) need to be unified for analysis
Business analysts need to query data without touching production databases
The primary deliverable is dashboards, reports, or self-serve analytics — not ML training
The team wants a governed, version-controlled, tested transformation layer rather than a tangle of hand-rolled SQL scripts

Do NOT use this pattern when:

Data freshness under 15 minutes is required — use the Real-time Streaming Analytics pattern instead
Your entire dataset fits in a single Postgres instance and the team is small — direct Postgres + Metabase is simpler and costs nothing extra
ML training dominates the workload and data volumes exceed multi-TB — the Lakehouse pattern gives better price-performance for bulk feature computation
You need to query raw semi-structured logs at scale — a purpose-built log analytics store (ClickHouse, Loki) is more appropriate

2. Problem it solves¶

Business teams ask questions like "What is our MRR by region this quarter?" but the data lives across five different systems. Each cross-system query consumes engineering time and produces inconsistent answers because different analysts apply different business logic. The Modern Data Stack solves this by centralising all source data into a warehouse, defining business logic once in version-controlled SQL, and serving consistent metrics to any BI tool — so every analyst sees the same number.

3. Solution overview¶

System context (C4 Level 1)¶

flowchart LR
    AppDB[(Application DB)]
    CRM[CRM\nSalesforce]
    Billing[Billing\nStripe]
    MktgAPI[Marketing APIs]

    AppDB --> Ingestion[Ingestion\nFivetran / Airbyte]
    CRM --> Ingestion
    Billing --> Ingestion
    MktgAPI --> Ingestion

    Ingestion --> DW[(Data Warehouse\nSnowflake)]
    DW --> dbt[dbt Core\ntransformations]
    dbt --> DW
    DW --> BI[BI Tool\nMetabase]
    Orch[Orchestrator\ndbt Cloud / Prefect] -->|schedules| dbt

Container view (C4 Level 2)¶

flowchart TB
    subgraph Sources
        AppDB[(Application DB)]
        SaaS[SaaS APIs\nStripe / Salesforce]
        Files[Flat Files\nS3 / SFTP]
    end
    subgraph Ingestion
        Connector[Fivetran / Airbyte\nmanaged connectors]
        RawSchema[(Raw schema\nin warehouse)]
    end
    subgraph Warehouse
        DW[(Snowflake / BigQuery\n/ Redshift)]
        Staging[Staging models\nlight type-casting]
        Marts[Mart models\nbusiness-logic views]
        dbtTests[dbt tests\nunique, not_null, refs]
        Catalog[dbt docs\ndata catalog]
    end
    subgraph Orchestration
        Scheduler[dbt Cloud / Prefect\nschedule + CI]
    end
    subgraph Serving
        BI[Metabase\ndashboards]
        Notebook[Jupyter / Hex\nanalyst notebooks]
    end

    AppDB --> Connector
    SaaS --> Connector
    Files --> Connector
    Connector --> RawSchema --> DW
    DW --> Staging --> Marts
    Marts --> dbtTests
    Marts --> Catalog
    Scheduler --> Staging
    Scheduler --> dbtTests
    Marts --> BI
    Marts --> Notebook

4. Technology stack¶

Layer	Primary choice	Alternatives	Notes
Ingestion	Fivetran	Airbyte (open source), Stitch, Meltano	Fivetran for simplicity and reliability; Airbyte if custom connectors are needed or vendor lock-in is a concern
Warehouse	Snowflake	BigQuery, Redshift, DuckDB	Snowflake separates compute from storage cleanly and scales with no cluster management; DuckDB for teams under 50 GB with no budget for a managed warehouse
Transformation	dbt Core	SQLMesh, Coalesce	dbt is the de-facto standard with the largest ecosystem; SQLMesh is a compatible alternative with better incremental model semantics
Orchestration	dbt Cloud scheduler	Prefect, Dagster, Apache Airflow	dbt Cloud is simplest when dbt is the only workload; Prefect for complex dependency graphs that span dbt and Python jobs
BI	Metabase	Apache Superset, Looker, Tableau	Metabase is open-source, self-hostable, and accessible to non-technical users; Looker if a governed semantic layer is critical
Data quality	dbt built-in tests + Elementary	Great Expectations, Soda	dbt's `unique`, `not_null`, and `accepted_values` tests cover 80% of cases; Elementary adds Slack alerting on top
Data catalog	dbt docs (auto-generated)	DataHub, Atlan, Collibra	dbt docs auto-generate from model YAML; add a dedicated catalog only at larger team sizes (> 5 data engineers)
CI/CD	GitHub Actions + `dbt build --select state:modified+`	dbt Cloud CI	Run only changed models and their dependents on every PR; full rebuild nightly

5. Non-functional characteristics¶

Concern	Profile
Scalability	Snowflake and BigQuery auto-scale warehouse compute on demand. dbt parallelises model runs automatically using the DAG. Comfortable to hundreds of GB without tuning; TB-scale needs explicit partitioning strategy and clustering keys.
Availability target	99.9% (warehouse SLA). Transformation failures affect data freshness, not user-facing systems. Alert on any failed dbt run within 30 minutes; analytics is not real-time, so a delayed alert is acceptable.
Latency target	Data freshness SLA, not per-query response time. Target: data in BI tool within 4 hours of source write for hourly schedules. BI query p95 < 5 s on pre-aggregated mart tables.
Security posture	Role-based access in the warehouse: raw schema read-only for the ETL role; mart schema read-only for analysts. Column-level masking for PII applied in dbt staging models. Fivetran/Airbyte source credentials in AWS Secrets Manager.
Data residency	Data lands in the warehouse region you choose at provisioning time. Document source-to-warehouse region policy; cross-region ingestion has cost and may have compliance implications.
Compliance fit	GDPR ✓ with PII masking in staging models and right-to-erasure automated via a dbt deletion model. SOC 2 ✓ with dbt run history and warehouse access logs as audit trail. HIPAA: avoid loading PHI into the warehouse unless BAA is in place with the warehouse vendor.

6. Cost ballpark¶

Indicative monthly USD cost. Warehouse compute (Snowflake credits, BigQuery slots) is the dominant variable.

Scale	Data volume	Monthly cost	Cost drivers
Small	< 100 GB	$100 - $500	Snowflake XS warehouse on-demand, Fivetran Starter, self-hosted Metabase
Medium	100 GB - 5 TB	$1,000 - $5,000	Snowflake compute credits, Fivetran seat volume, dbt Cloud Team plan
Large	5 TB+	$5,000 - $30,000	Warehouse compute (dominant), Fivetran volume pricing, Looker or Tableau licences

7. LLM-assisted development fit¶

Aspect	Rating	Notes
dbt model and test scaffolding	★★★★★	Excellent — SQL SELECT transforms and dbt YAML schema tests generate cleanly from a source schema description.
Ingestion connector configuration	★★★★	Good for standard Fivetran/Airbyte connectors; custom connector logic requires review.
Business logic SQL (metrics, joins)	★★★	Generates syntactically correct SQL; grain and join correctness require domain knowledge the LLM does not have — always validate with a domain expert.
Data quality test design	★★★	Knows common patterns (uniqueness, referential integrity); threshold values and `accepted_values` lists require business knowledge.
Architecture decisions	★	Don't outsource. Use ADRs.

Recommended workflow: Define your metrics dictionary (what "MRR" means, how churn is calculated) before writing any dbt models. Generate model scaffolds from source table schemas, then have a domain expert validate the business logic SQL. Run dbt test in CI on every PR.

8. Reference implementations¶

Public reference: dbt-labs/dbt-core — the dbt framework itself; examples/ and the official jaffle_shop demo show full project structure with models, tests, and documentation (200 OK ✓)
Public reference: airbytehq/airbyte — open-source EL tool with 300+ connectors; reference for self-hosted ingestion alternative to Fivetran (200 OK ✓)
Public reference: metabase/metabase — open-source BI tool; shows the embedding and multi-database connection architecture (200 OK ✓)
Internal case study: Add your anonymised internal example here

ADR-0003: Prefect as the default ETL/ELT orchestrator
Candidate ADR: Snowflake vs BigQuery vs Redshift warehouse choice — record when your organisation makes a committed vendor decision.

10. Known risks & gotchas¶

"Which number is right?" metric inconsistency — two analysts define MRR slightly differently in separate dbt models; the CEO sees two different figures in two dashboards. Mitigation: define all business metrics in a single dbt metrics layer; establish a "one model per concept" convention enforced in code review.
Compute cost explosion from ad-hoc full-table scans — an analyst runs SELECT * FROM large_table with no filter in Snowflake; the bill for that one query is $50. Mitigation: set per-user resource monitors in Snowflake (cost alerts + auto-suspend); add default date-range filters to all BI dashboards; build pre-aggregated mart tables for common query patterns.
Slow dbt full-builds block analyst iteration — a full dbt build taking 45+ minutes means any hotfix takes an hour to reach analysts. Mitigation: use --select state:modified+ in CI (builds only changed models and their dependents); separate large historical backfill models into their own scheduled job.
PII persists in the raw schema beyond GDPR retention windows — Fivetran or Airbyte writes raw source rows including PII into the warehouse and they sit there indefinitely. Mitigation: document retention policy per raw table; implement masking macros in dbt staging models; automate deletion via a dbt model or warehouse ALTER TABLE ... DROP COLUMN after the retention period.
Schema drift from source silently breaks downstream models — a CRM field is renamed, Fivetran syncs the new schema, and 12 downstream dbt models start returning NULL. Mitigation: add dbt source freshness and schema contract tests; alert on information_schema diffs in the raw layer before each dbt run.