Pattern: Database Engine / Storage Layer¶
Quick facts
- Category: Systems & Infrastructure
- Maturity: Assess
- Typical team size: 3-8 engineers (requires deep systems and database internals expertise)
- Typical timeline to MVP: 6-18 months
- Last reviewed: 2026-05-03 by Architecture Team
1. Context¶
Use this pattern when:
- An existing database engine does not meet the specific performance, query model, or data model requirements of a high-volume specialised workload (time-series, vector search, columnar analytics, graph)
- You are building infrastructure that other software uses as a storage primitive — a caching layer, an append-only log, an embedded key-value store
- The deployment environment prohibits external database dependencies (embedded device, offline, air-gapped) and a custom storage layer is required
Do NOT use this pattern when:
- An existing database (Postgres, RocksDB, DuckDB, ClickHouse, SQLite) meets your requirements — use it; building a database engine is one of the hardest software engineering problems and should be a last resort
- The team lacks deep experience with B-trees, LSM-trees, MVCC, WAL, and storage engine internals — the investment to build this expertise is substantial
- This is an application-level "storage problem" — most application storage challenges are solved by schema design, indexing, and query optimisation, not by building a new engine
2. Problem it solves¶
General-purpose databases make trade-offs that favour the average workload. A time-series database with billions of write events per day needs a storage engine optimised for append-heavy sequential writes and range queries over time — not the random read/write balance Postgres optimises for. When the specific access pattern is known and extreme, a purpose-built storage layer can be 10–100× more efficient than a general-purpose alternative.
3. Solution overview¶
System context (C4 Level 1)¶
flowchart LR
Application[Application / Client] -->|query / write API| Engine[Storage Engine]
Engine --> WAL[(Write-Ahead Log\ndurability)]
Engine --> DataFiles[(Data Files\nSSTable / B-tree pages)]
Engine --> IndexFiles[(Index Files)]
Admin((Admin)) --> CompactionMgr[Compaction / Vacuum\nmaintenance]
Container view (C4 Level 2)¶
flowchart TB
subgraph Engine
QueryLayer[Query / API Layer\nSQL parser, query planner, or custom API]
TxMgr[Transaction Manager\nMVCC / optimistic / serialisable]
WAL[Write-Ahead Log\ndurability, crash recovery]
BufferPool[Buffer Pool Manager\npage cache in memory]
StorageLayout[Storage Layout\nB-tree pages or LSM-tree SSTables]
Compaction[Compaction / Merge\nLSM background process]
IndexMgr[Index Manager\nB-tree, hash, bloom filter]
end
subgraph Disk
WALFiles[(WAL files\nsequential write)]
SSTables[(SSTables or pages\ndata files)]
end
QueryLayer --> TxMgr
TxMgr --> WAL
TxMgr --> BufferPool
BufferPool --> StorageLayout
StorageLayout --> Compaction
StorageLayout --> IndexMgr
WAL --> WALFiles
StorageLayout --> SSTables
4. Technology stack¶
| Layer | Primary choice | Alternatives | Notes |
|---|---|---|---|
| Implementation language | Rust | C++, Go, Java (RocksDB-JNI) | Rust for memory safety without GC — critical for a storage engine where use-after-free and data corruption are catastrophic; C++ for teams with existing storage engine codebases |
| Storage structure | LSM-tree (for write-heavy) | B-tree (for read-balanced), Copy-on-Write B-tree | LSM for append-heavy time-series or log workloads; B-tree for random read/write balance; measure your read:write ratio before choosing |
| Existing embeddable engine (if not building from scratch) | RocksDB | LevelDB, WiredTiger, LMDB | RocksDB is the most feature-complete embeddable LSM engine; consider embedding it rather than building a new storage layer — this is almost always the right choice |
| WAL format | Custom append-only log | SQLite WAL, Kafka-style segment files | Always separate the WAL from data files; WAL writes must be sequential and durable (fsync or O_DSYNC) |
| Memory management | Custom arena allocator | jemalloc, mimalloc | Arena allocators eliminate fragmentation in long-running processes; jemalloc as a drop-in replacement for the system allocator improves RocksDB throughput by 10–15% |
| Concurrency | MVCC (multi-version concurrency control) | Optimistic locking, OCC | MVCC for reader-writer concurrency without blocking; requires careful version GC to avoid unbounded version chain growth |
| Testing | Property-based testing (proptest Rust) + KLEE symbolic execution |
Jepsen (distributed), SQLite's own test harness | Storage engines require adversarial testing: crash injection, fsync failure simulation, bit-flip testing |
5. Non-functional characteristics¶
| Concern | Profile |
|---|---|
| Scalability | Vertical scaling (more RAM for buffer pool, faster NVMe) is the primary lever. Horizontal scaling via sharding requires partitioning logic above the storage layer. Design the storage layer to be single-node correct first. |
| Availability target | 99.999% — storage engines are the foundation; their availability is the ceiling for everything above them. WAL + checkpointing must enable crash recovery with zero data loss; measure MTTR from crash to operational with automated tests. |
| Latency target | Write path (WAL append): < 100 μs per transaction (fsync dominates; consider group commit to amortise). Read path (buffer pool hit): < 10 μs. Read path (disk read): 100 μs–1 ms depending on NVMe vs HDD. |
| Security posture | Validate all deserialized data from disk — a corrupted page must not cause undefined behaviour or a crash. Implement checksums on every page; verify on read. Encryption at rest: implement page-level encryption in the buffer pool, not at the filesystem layer. |
| Data residency | The storage layer controls where bytes are written; data residency is a deployment-time concern. Implement transparent encryption to support regulated deployments without changing the engine. |
| Compliance fit | Storage engines used in regulated contexts (medical devices, financial systems) must demonstrate crash-safe durability (provable WAL recovery) and integrity verification (per-page checksums). Formal verification of critical invariants (using TLA+, Coq, or Lean) is expected at this level. |
6. Cost ballpark¶
Building a database engine is primarily an engineering investment, not infrastructure cost.
| Scale | Storage volume | Monthly infra cost | Cost drivers |
|---|---|---|---|
| Small / embedded | < 10 GB | $0 - $100 | Embedded in application, no separate infrastructure |
| Medium / server | 10 GB - 1 TB | $200 - $2,000 | NVMe-attached servers or cloud block storage |
| Large / distributed | 1 TB+ | $2,000 - $20,000 | High-IOPS NVMe fleet, replication bandwidth |
7. LLM-assisted development fit¶
| Aspect | Rating | Notes |
|---|---|---|
| RocksDB integration and configuration | ★★★★ | Good — RocksDB API usage is well-documented; tuning parameters require benchmarking for your workload. |
| WAL append and recovery scaffolding | ★★★ | Understands the pattern; durability correctness (fsync, group commit, crash recovery) requires expert review and extensive crash testing. |
| B-tree page format and split/merge logic | ★★ | Knows the algorithms; implementation correctness for edge cases (concurrent splits, page overflow) requires deep expertise. |
| Property-based test scaffolding (proptest) | ★★★★ | Good at generating property strategies; defining the invariants to test requires deep domain knowledge. |
| Architecture decisions | ★ | Never outsource storage engine design decisions. |
Recommended workflow: Use RocksDB as an embedded storage engine rather than building from scratch unless you have a specific, measured reason that RocksDB cannot meet. If building from scratch, implement and thoroughly test crash recovery before any other feature.
8. Reference implementations¶
- Public reference: facebook/rocksdb — RocksDB: the most widely deployed embeddable LSM storage engine; study the architecture docs in
docs/before building anything (200 OK ✓) - Internal case study: Add your anonymised internal example here
9. Related decisions (ADRs)¶
10. Known risks & gotchas¶
- Silent data corruption from missing fsync — writing without fsync leaves data in the OS page cache; a power loss loses it; the WAL incorrectly indicates it was persisted. Mitigation: always fsync the WAL after each transaction (or use group commit); test durability by simulating power loss with a test harness that kills the process mid-write.
- Write amplification in LSM-tree destroys NVMe lifetime — heavy compaction writes data 10–30× before it settles; an NVMe drive with 600 TBW is exhausted in months. Mitigation: monitor write amplification factor (WAF) as a first-class metric; tune compaction strategy for your write rate.
- Buffer pool thrash under mixed read/write — a large scan evicts hot pages from the buffer pool; subsequent random reads miss and go to disk. Mitigation: implement LRU-K or ARC eviction policy rather than simple LRU; pin frequently-accessed index pages.
- Version chain unbounded growth under long-running transactions — an MVCC engine accumulates old versions; a long-running read transaction prevents GC; memory grows until OOM. Mitigation: track the oldest active transaction; enforce a maximum transaction age; abort transactions that hold snapshot references beyond the threshold.
- Testing a storage engine is harder than building it — a bug in the WAL recovery path surfaces only in specific crash scenarios at specific points in the write sequence. Mitigation: build a crash-injection test harness that kills the process at every possible fsync boundary; verify that re-opening the engine after each crash returns a consistent state.