Pattern: Real-time Streaming Analytics¶

Quick facts

Category: Data & Analytics
Maturity: Trial
Typical team size: 2-4 engineers
Typical timeline to MVP: 6-10 weeks
Last reviewed: 2026-05-03 by Architecture Team

1. Context¶

Use this pattern when:

Business decisions require data with sub-minute latency: fraud detection, live operational dashboards, real-time alerting, anomaly detection
Event volumes are high enough (thousands of events per second) that polling a transactional database is not viable
The data model is event-oriented — page views, transactions, sensor readings, log lines — rather than entity-oriented

Do NOT use this pattern when:

15-minute data freshness is acceptable — use the Modern Data Stack with an hourly dbt schedule; the operational complexity of streaming is significant and unwarranted at that latency target
The workload requires complex multi-table joins over large historical datasets — batch Spark is more cost-effective for that access pattern
The engineering team has no prior experience with distributed messaging systems; the learning curve is steep and operational failures are hard to diagnose

2. Problem it solves¶

Batch ETL pipelines update dashboards in hours. For fraud detection, a 4-hour-old view of transaction patterns means fraudulent accounts have been operating unchecked. For operational dashboards, yesterday's numbers drive today's decisions. This pattern ingests events as they are generated, aggregates them in a stateful stream processor, and writes results to a columnar OLAP store — making the latest data queryable within seconds rather than hours.

3. Solution overview¶

System context (C4 Level 1)¶

flowchart LR
    AppEvents[Application Events] --> Broker[Message Broker\nKafka / Redpanda]
    DBStream[DB Change Stream\nDebezium CDC] --> Broker
    Broker --> Processor[Stream Processor\nApache Flink]
    Processor --> OLAP[(OLAP Store\nClickHouse)]
    OLAP --> BI[Dashboards\nGrafana / Superset]
    OLAP --> Alerts[Alerting\nGrafana alerts]

Container view (C4 Level 2)¶

flowchart TB
    subgraph Producers
        App[Application\nKafka producer SDK]
        Debezium[Debezium\nDB CDC connector]
        IoT[IoT / sensors]
    end
    subgraph Message Broker
        Kafka[Apache Kafka\nor Redpanda]
        SchemaReg[Schema Registry\nAvro / Protobuf]
    end
    subgraph Stream Processing
        Flink[Apache Flink\nstateful processing]
        RocksDB[(RocksDB\nembedded state store)]
        Checkpoint[(S3 Checkpoint Store\nrecovery state)]
    end
    subgraph OLAP Sink
        ClickHouse[(ClickHouse\ncolumnar store)]
    end
    subgraph Serving
        Grafana[Grafana\nreal-time dashboards + alerts]
        Superset[Apache Superset\nanalyst exploration]
        QueryAPI[HTTP Query API\nClickHouse native interface]
    end
    subgraph Ops
        KafkaMetrics[Kafka metrics\nconsumer lag]
        FlinkMetrics[Flink metrics\nthroughput + checkpoints]
        DLQ[(Dead-letter topic\nfailed events)]
    end

    App --> Kafka
    Debezium --> Kafka
    IoT --> Kafka
    Kafka --> SchemaReg
    Kafka --> Flink
    Flink --> RocksDB
    Flink --> Checkpoint
    Flink --> ClickHouse
    Flink -->|on error| DLQ
    ClickHouse --> Grafana
    ClickHouse --> Superset
    ClickHouse --> QueryAPI
    KafkaMetrics -.-> Kafka
    FlinkMetrics -.-> Flink

4. Technology stack¶

Layer	Primary choice	Alternatives	Notes
Message broker	Apache Kafka	Redpanda, AWS Kinesis, Google Pub/Sub	Kafka is the industry standard with the richest connector ecosystem; Redpanda is drop-in API-compatible but simpler to operate (no ZooKeeper/KRaft complexity); Kinesis for AWS-native deployments
Stream processor	Apache Flink	Spark Structured Streaming, Kafka Streams	Flink for complex stateful operations (windows, joins, CEP); Kafka Streams for simpler topologies that live entirely in the Kafka ecosystem; Spark Streaming if the team already uses Spark
Schema registry	Confluent Schema Registry	AWS Glue Schema Registry, Apicurio	Enforces Avro/Protobuf schemas on every topic; prevents malformed events from propagating silently into consumers
Real-time OLAP	ClickHouse	Apache Druid, Apache Pinot	ClickHouse delivers sub-second SQL on billions of rows with a simple operational model; Druid for pre-aggregated time-series at extreme scale; Pinot for user-facing (externally-served) real-time queries
Visualisation	Grafana	Apache Superset, Metabase	Grafana for operational dashboards with built-in alert routing; Superset for richer ad-hoc analytical exploration
Serialisation	Apache Avro	Protocol Buffers, JSON	Avro with Schema Registry gives compact binary encoding and schema evolution guarantees; JSON only in low-volume dev environments
CDC source	Debezium	AWS DMS, Maxwell	Debezium captures row-level changes from Postgres/MySQL into Kafka topics without application code changes
Deployment	Kubernetes + Flink Operator	AWS MSK + Amazon Managed Flink	Kubernetes for full control over sizing and tuning; managed services trade control for simpler operations

5. Non-functional characteristics¶

Concern	Profile
Scalability	Kafka scales by adding partitions and consumer replicas. Flink scales by increasing parallelism (task slots). ClickHouse scales by adding shards. Each layer scales independently — identify the bottleneck before adding capacity.
Availability target	99.9%+ with Kafka replication factor ≥ 3 and Flink checkpointing to S3. A Flink restart recovers from the last checkpoint; with exactly-once semantics (Kafka source + ClickHouse idempotent sink), no data is lost or duplicated.
Latency target	Event-to-ClickHouse: p95 < 5 s. Dashboard refresh: p95 < 30 s end-to-end. ClickHouse query on pre-indexed data: p95 < 500 ms. Tune Flink checkpoint interval to balance recovery time vs throughput.
Security posture	Kafka mTLS between producers and consumers; ACLs per topic. Flink runs in a VPC with no public exposure. ClickHouse behind an internal load balancer with IP allowlisting. Encrypt data in transit (TLS everywhere) and at rest.
Data residency	All data processed within your cloud region. Define Kafka retention policies (7-day default) to prevent unbounded accumulation of sensitive event data.
Compliance fit	GDPR — event streams often contain PII; apply field-level pseudonymisation in Flink before writing to ClickHouse; implement a "forget event" pattern for right-to-erasure in streaming systems (true deletion from an append-only log is non-trivial). HIPAA ✓ with encryption at every layer and AWS/GCP BAA.

6. Cost ballpark¶

Indicative monthly USD cost. Kafka and Flink compute dominate at medium and large scale.

Scale	Events / second	Monthly cost	Cost drivers
Small	< 1,000	$400 - $1,200	3-node Kafka cluster (m5.large), 2-node Flink, small ClickHouse instance
Medium	1,000 - 50,000	$2,000 - $10,000	Larger Kafka and Flink clusters, ClickHouse with SSD-backed storage
Large	50,000+	$10,000 - $50,000	Multi-broker Kafka with high replication, Flink cluster with many task slots, ClickHouse cluster with cross-shard replication

7. LLM-assisted development fit¶

Aspect	Rating	Notes
Kafka producer and consumer boilerplate	★★★★★	Excellent — Kafka client patterns for Python, Java, and Go are extremely well-represented.
Flink job scaffolding (DataStream / Table API)	★★★	Generates structurally correct Flink code; windowing semantics, watermarks, and state TTL have subtle correctness issues that require careful manual review.
ClickHouse schema and MergeTree design	★★★★	Good — ClickHouse SQL is close to standard SQL; ORDER BY and PARTITION BY key selection for a specific query pattern needs review.
Exactly-once end-to-end configuration	★★	Knows the concepts; the specific configuration required across Kafka + Flink + ClickHouse for true end-to-end exactly-once needs explicit integration testing.
Architecture decisions	★	Don't outsource. Use ADRs.

Recommended workflow: Start with a single Kafka topic and a minimal Flink job that writes to ClickHouse before adding stateful operations. Test the restart/recovery path with fault injection before going to production — exactly-once semantics are only as good as your last recovery test.

8. Reference implementations¶

Public reference: apache/flink — Flink source; flink-examples/ contains DataStream, Table API, and windowing examples in Java and Python (200 OK ✓)
Public reference: apache/kafka — Kafka source; examples/ covers producer, consumer, Streams API, and Connect patterns (200 OK ✓)
Public reference: ClickHouse/ClickHouse — ClickHouse source; docs/ covers MergeTree engine family, replication, and Kafka table engine integration (200 OK ✓)
Public reference: redpanda-data/redpanda — Redpanda, the Kafka-compatible alternative; useful reference for operational simplification at smaller cluster sizes (200 OK ✓)
Internal case study: Add your anonymised internal example here

No ADRs recorded yet. Candidates: Kafka vs Redpanda broker choice, Flink vs Spark Streaming processor choice, ClickHouse vs Druid OLAP store — record when your organisation makes a committed decision.

10. Known risks & gotchas¶

Consumer lag accumulates past Kafka retention — a slow Flink consumer falls behind; if it falls past Kafka's retention window, messages are permanently lost before processing. Mitigation: alert on consumer group lag exceeding 10 minutes; set Kafka retention to at least 7 days; provision Flink to process at 2× steady-state throughput to absorb traffic spikes.
Late-arriving events corrupt windowed aggregates — an event timestamped 5 minutes ago arrives after a 5-minute tumbling window has already closed and been emitted. Mitigation: use Flink event-time processing with watermarks that allow configurable lateness (e.g., 10-minute allowed lateness); emit corrected results when late events arrive rather than silently dropping them.
Schema Registry unavailability halts all pipelines — if the Schema Registry is unreachable, producers and consumers cannot serialise or deserialise; all pipelines stop. Mitigation: run Schema Registry in HA mode (multiple replicas); configure producers and consumers to cache schemas locally (auto.register.schemas=false, cached fallback); test registry failure explicitly.
ClickHouse INSERT amplification from small Flink batches — thousands of individual row inserts create many small data parts in ClickHouse, causing write amplification and query slowdown. Mitigation: configure the Flink ClickHouse sink to batch at least 10,000 rows or 1 second of data per flush; use async_insert=1 in ClickHouse for high-concurrency write scenarios.
Exactly-once is harder than it looks — Kafka, Flink, and ClickHouse each have their own delivery semantics; a misconfigured combination silently produces duplicates on Flink restarts. Mitigation: test the full exactly-once path by deliberately killing the Flink job mid-run and verifying ClickHouse row counts match expected event counts after recovery.