Skip to content

Pattern: Real-time Streaming Analytics

Quick facts

  • Category: Data & Analytics
  • Maturity: Trial
  • Typical team size: 2-4 engineers
  • Typical timeline to MVP: 6-10 weeks
  • Last reviewed: 2026-05-03 by Architecture Team

1. Context

Use this pattern when:

  • Business decisions require data with sub-minute latency: fraud detection, live operational dashboards, real-time alerting, anomaly detection
  • Event volumes are high enough (thousands of events per second) that polling a transactional database is not viable
  • The data model is event-oriented — page views, transactions, sensor readings, log lines — rather than entity-oriented

Do NOT use this pattern when:

  • 15-minute data freshness is acceptable — use the Modern Data Stack with an hourly dbt schedule; the operational complexity of streaming is significant and unwarranted at that latency target
  • The workload requires complex multi-table joins over large historical datasets — batch Spark is more cost-effective for that access pattern
  • The engineering team has no prior experience with distributed messaging systems; the learning curve is steep and operational failures are hard to diagnose

2. Problem it solves

Batch ETL pipelines update dashboards in hours. For fraud detection, a 4-hour-old view of transaction patterns means fraudulent accounts have been operating unchecked. For operational dashboards, yesterday's numbers drive today's decisions. This pattern ingests events as they are generated, aggregates them in a stateful stream processor, and writes results to a columnar OLAP store — making the latest data queryable within seconds rather than hours.

3. Solution overview

System context (C4 Level 1)

flowchart LR
    AppEvents[Application Events] --> Broker[Message Broker\nKafka / Redpanda]
    DBStream[DB Change Stream\nDebezium CDC] --> Broker
    Broker --> Processor[Stream Processor\nApache Flink]
    Processor --> OLAP[(OLAP Store\nClickHouse)]
    OLAP --> BI[Dashboards\nGrafana / Superset]
    OLAP --> Alerts[Alerting\nGrafana alerts]

Container view (C4 Level 2)

flowchart TB
    subgraph Producers
        App[Application\nKafka producer SDK]
        Debezium[Debezium\nDB CDC connector]
        IoT[IoT / sensors]
    end
    subgraph Message Broker
        Kafka[Apache Kafka\nor Redpanda]
        SchemaReg[Schema Registry\nAvro / Protobuf]
    end
    subgraph Stream Processing
        Flink[Apache Flink\nstateful processing]
        RocksDB[(RocksDB\nembedded state store)]
        Checkpoint[(S3 Checkpoint Store\nrecovery state)]
    end
    subgraph OLAP Sink
        ClickHouse[(ClickHouse\ncolumnar store)]
    end
    subgraph Serving
        Grafana[Grafana\nreal-time dashboards + alerts]
        Superset[Apache Superset\nanalyst exploration]
        QueryAPI[HTTP Query API\nClickHouse native interface]
    end
    subgraph Ops
        KafkaMetrics[Kafka metrics\nconsumer lag]
        FlinkMetrics[Flink metrics\nthroughput + checkpoints]
        DLQ[(Dead-letter topic\nfailed events)]
    end

    App --> Kafka
    Debezium --> Kafka
    IoT --> Kafka
    Kafka --> SchemaReg
    Kafka --> Flink
    Flink --> RocksDB
    Flink --> Checkpoint
    Flink --> ClickHouse
    Flink -->|on error| DLQ
    ClickHouse --> Grafana
    ClickHouse --> Superset
    ClickHouse --> QueryAPI
    KafkaMetrics -.-> Kafka
    FlinkMetrics -.-> Flink

4. Technology stack

Layer Primary choice Alternatives Notes
Message broker Apache Kafka Redpanda, AWS Kinesis, Google Pub/Sub Kafka is the industry standard with the richest connector ecosystem; Redpanda is drop-in API-compatible but simpler to operate (no ZooKeeper/KRaft complexity); Kinesis for AWS-native deployments
Stream processor Apache Flink Spark Structured Streaming, Kafka Streams Flink for complex stateful operations (windows, joins, CEP); Kafka Streams for simpler topologies that live entirely in the Kafka ecosystem; Spark Streaming if the team already uses Spark
Schema registry Confluent Schema Registry AWS Glue Schema Registry, Apicurio Enforces Avro/Protobuf schemas on every topic; prevents malformed events from propagating silently into consumers
Real-time OLAP ClickHouse Apache Druid, Apache Pinot ClickHouse delivers sub-second SQL on billions of rows with a simple operational model; Druid for pre-aggregated time-series at extreme scale; Pinot for user-facing (externally-served) real-time queries
Visualisation Grafana Apache Superset, Metabase Grafana for operational dashboards with built-in alert routing; Superset for richer ad-hoc analytical exploration
Serialisation Apache Avro Protocol Buffers, JSON Avro with Schema Registry gives compact binary encoding and schema evolution guarantees; JSON only in low-volume dev environments
CDC source Debezium AWS DMS, Maxwell Debezium captures row-level changes from Postgres/MySQL into Kafka topics without application code changes
Deployment Kubernetes + Flink Operator AWS MSK + Amazon Managed Flink Kubernetes for full control over sizing and tuning; managed services trade control for simpler operations

5. Non-functional characteristics

Concern Profile
Scalability Kafka scales by adding partitions and consumer replicas. Flink scales by increasing parallelism (task slots). ClickHouse scales by adding shards. Each layer scales independently — identify the bottleneck before adding capacity.
Availability target 99.9%+ with Kafka replication factor ≥ 3 and Flink checkpointing to S3. A Flink restart recovers from the last checkpoint; with exactly-once semantics (Kafka source + ClickHouse idempotent sink), no data is lost or duplicated.
Latency target Event-to-ClickHouse: p95 < 5 s. Dashboard refresh: p95 < 30 s end-to-end. ClickHouse query on pre-indexed data: p95 < 500 ms. Tune Flink checkpoint interval to balance recovery time vs throughput.
Security posture Kafka mTLS between producers and consumers; ACLs per topic. Flink runs in a VPC with no public exposure. ClickHouse behind an internal load balancer with IP allowlisting. Encrypt data in transit (TLS everywhere) and at rest.
Data residency All data processed within your cloud region. Define Kafka retention policies (7-day default) to prevent unbounded accumulation of sensitive event data.
Compliance fit GDPR — event streams often contain PII; apply field-level pseudonymisation in Flink before writing to ClickHouse; implement a "forget event" pattern for right-to-erasure in streaming systems (true deletion from an append-only log is non-trivial). HIPAA ✓ with encryption at every layer and AWS/GCP BAA.

6. Cost ballpark

Indicative monthly USD cost. Kafka and Flink compute dominate at medium and large scale.

Scale Events / second Monthly cost Cost drivers
Small < 1,000 $400 - $1,200 3-node Kafka cluster (m5.large), 2-node Flink, small ClickHouse instance
Medium 1,000 - 50,000 $2,000 - $10,000 Larger Kafka and Flink clusters, ClickHouse with SSD-backed storage
Large 50,000+ $10,000 - $50,000 Multi-broker Kafka with high replication, Flink cluster with many task slots, ClickHouse cluster with cross-shard replication

7. LLM-assisted development fit

Aspect Rating Notes
Kafka producer and consumer boilerplate ★★★★★ Excellent — Kafka client patterns for Python, Java, and Go are extremely well-represented.
Flink job scaffolding (DataStream / Table API) ★★★ Generates structurally correct Flink code; windowing semantics, watermarks, and state TTL have subtle correctness issues that require careful manual review.
ClickHouse schema and MergeTree design ★★★★ Good — ClickHouse SQL is close to standard SQL; ORDER BY and PARTITION BY key selection for a specific query pattern needs review.
Exactly-once end-to-end configuration ★★ Knows the concepts; the specific configuration required across Kafka + Flink + ClickHouse for true end-to-end exactly-once needs explicit integration testing.
Architecture decisions Don't outsource. Use ADRs.

Recommended workflow: Start with a single Kafka topic and a minimal Flink job that writes to ClickHouse before adding stateful operations. Test the restart/recovery path with fault injection before going to production — exactly-once semantics are only as good as your last recovery test.

8. Reference implementations

  • Public reference: apache/flink — Flink source; flink-examples/ contains DataStream, Table API, and windowing examples in Java and Python (200 OK ✓)
  • Public reference: apache/kafka — Kafka source; examples/ covers producer, consumer, Streams API, and Connect patterns (200 OK ✓)
  • Public reference: ClickHouse/ClickHouse — ClickHouse source; docs/ covers MergeTree engine family, replication, and Kafka table engine integration (200 OK ✓)
  • Public reference: redpanda-data/redpanda — Redpanda, the Kafka-compatible alternative; useful reference for operational simplification at smaller cluster sizes (200 OK ✓)
  • Internal case study: Add your anonymised internal example here
  • No ADRs recorded yet. Candidates: Kafka vs Redpanda broker choice, Flink vs Spark Streaming processor choice, ClickHouse vs Druid OLAP store — record when your organisation makes a committed decision.

10. Known risks & gotchas

  • Consumer lag accumulates past Kafka retention — a slow Flink consumer falls behind; if it falls past Kafka's retention window, messages are permanently lost before processing. Mitigation: alert on consumer group lag exceeding 10 minutes; set Kafka retention to at least 7 days; provision Flink to process at 2× steady-state throughput to absorb traffic spikes.
  • Late-arriving events corrupt windowed aggregates — an event timestamped 5 minutes ago arrives after a 5-minute tumbling window has already closed and been emitted. Mitigation: use Flink event-time processing with watermarks that allow configurable lateness (e.g., 10-minute allowed lateness); emit corrected results when late events arrive rather than silently dropping them.
  • Schema Registry unavailability halts all pipelines — if the Schema Registry is unreachable, producers and consumers cannot serialise or deserialise; all pipelines stop. Mitigation: run Schema Registry in HA mode (multiple replicas); configure producers and consumers to cache schemas locally (auto.register.schemas=false, cached fallback); test registry failure explicitly.
  • ClickHouse INSERT amplification from small Flink batches — thousands of individual row inserts create many small data parts in ClickHouse, causing write amplification and query slowdown. Mitigation: configure the Flink ClickHouse sink to batch at least 10,000 rows or 1 second of data per flush; use async_insert=1 in ClickHouse for high-concurrency write scenarios.
  • Exactly-once is harder than it looks — Kafka, Flink, and ClickHouse each have their own delivery semantics; a misconfigured combination silently produces duplicates on Flink restarts. Mitigation: test the full exactly-once path by deliberately killing the Flink job mid-run and verifying ClickHouse row counts match expected event counts after recovery.