Pattern: Microservices¶
Quick facts
- Category: Backend & Distributed Systems
- Maturity: Trial
- Typical team size: 8+ engineers across multiple teams
- Typical timeline to MVP: 12-20 weeks for the first set of services
- Last reviewed: 2026-05-03 by Architecture Team
1. Context¶
Use this pattern when:
- The organisation has 8+ independent product teams who genuinely need to deploy to production without coordinating with each other
- Domain boundaries are well-understood — ideally because you have already run a modular monolith for a year and the natural service cuts have revealed themselves
- Specific services have measured, proven scaling requirements that differ materially from the rest of the system
- Different parts of the system need different release cadences, technology choices, or compliance boundaries
Do NOT use this pattern when:
- Starting a new system — begin with a Modular Monolith and extract services only when a genuine operational need is proven (see ADR-0008)
- The team is fewer than 8 engineers — the operational overhead of a microservices platform (K8s, service mesh, distributed tracing, multiple CI pipelines) will consume the majority of the team's capacity
- Domain boundaries are still unclear — premature decomposition means rewriting services and all their consumers when you discover the wrong boundary
2. Problem it solves¶
At large engineering scale, a single deployable unit becomes a bottleneck: the payments team cannot deploy without coordinating with the billing team; the notifications service must scale up with the order service even though it handles 10× less traffic; a single slow test suite blocks all releases. Microservices give each team a deployable unit they own end-to-end, letting them release, scale, and evolve independently — at the cost of distributed systems complexity that requires mature platform engineering to manage.
3. Solution overview¶
System context (C4 Level 1)¶
flowchart LR
WebClient((Web Client)) --> GW[API Gateway\nKong]
MobileClient((Mobile Client)) --> GW
Partners((Partners)) --> GW
GW --> OrderSvc[Order Service]
GW --> UserSvc[User Service]
GW --> BillingSvc[Billing Service]
OrderSvc -->|event| Kafka[Kafka\nEvent Bus]
Kafka --> BillingSvc
Kafka --> NotifSvc[Notification Service]
Container view (C4 Level 2)¶
flowchart TB
subgraph Edge
Kong[Kong API Gateway\nauth, rate limiting, routing]
end
subgraph Services
OrderSvc[Order Service\nGo — owns orders DB]
UserSvc[User Service\nGo — owns users DB]
BillingSvc[Billing Service\nGo — owns billing DB]
NotifSvc[Notification Service\nNode.js]
end
subgraph Databases
OrderDB[(Orders DB\nPostgres)]
UserDB[(Users DB\nPostgres)]
BillingDB[(Billing DB\nPostgres)]
end
subgraph Messaging
Kafka[Apache Kafka\nasync events]
end
subgraph Platform
K8s[Kubernetes\ncontainer orchestration]
OTel[OpenTelemetry\n→ Datadog]
Istio[Istio service mesh\nwhen > 20 services]
end
Kong --> OrderSvc
Kong --> UserSvc
Kong --> BillingSvc
OrderSvc --> OrderDB
UserSvc --> UserDB
BillingSvc --> BillingDB
OrderSvc -->|order.created| Kafka
Kafka --> BillingSvc
Kafka --> NotifSvc
K8s -.-> OrderSvc
K8s -.-> UserSvc
K8s -.-> BillingSvc
OTel -.-> OrderSvc
OTel -.-> BillingSvc
4. Technology stack¶
| Layer | Primary choice | Alternatives | Notes |
|---|---|---|---|
| Language | Go | Java (Spring Boot), Node.js (NestJS), .NET | Go for new services: small binaries, low memory, excellent concurrency, fast cold starts; Java if the team has deep Spring expertise |
| Sync communication | gRPC (internal) + REST (external) | REST only, GraphQL Federation | gRPC for internal service-to-service calls (typed, efficient); REST + OpenAPI for external-facing endpoints |
| Async communication | Apache Kafka | AWS SQS/SNS, NATS | Kafka for durable event streaming and replay; SQS for simpler task queues with no replay requirement |
| API gateway | Kong | AWS API Gateway, nginx | Kong for cloud-agnostic; AWS API Gateway for Lambda-heavy AWS-native deployments |
| Container orchestration | Kubernetes (EKS / GKE) | AWS ECS, Nomad | K8s is the standard; ECS for teams wanting managed control plane without K8s operational depth |
| Service mesh | None (< 20 services); Istio (> 20 services) | Linkerd | Avoid the complexity until the service count justifies it; Linkerd is simpler than Istio for mTLS + observability |
| Observability | OpenTelemetry → Datadog | Grafana + Tempo + Prometheus | OpenTelemetry as the vendor-neutral instrumentation standard; every service must emit traces from day one |
| Container registry | AWS ECR | GHCR, Docker Hub | ECR integrates with EKS IAM; Docker Hub rate-limits unauthenticated pulls |
5. Non-functional characteristics¶
| Concern | Profile |
|---|---|
| Scalability | Each service scales independently based on its own metrics. Design: measure before scaling; never pre-emptively over-provision based on assumptions about which service is the bottleneck. |
| Availability target | 99.9% per service; 99.5% for the system as a whole (failure modes multiply across services). Implement circuit breakers at service boundaries; a downstream outage must not cascade into an upstream outage. |
| Latency target | p95 < 50 ms for internal gRPC calls; p95 < 300 ms for external API responses including gateway overhead. Set explicit timeout budgets on every outbound call; never rely on a downstream service's default timeout. |
| Security posture | mTLS between all services (enforced by Istio when deployed). Zero-trust network policy — services communicate only with explicitly allowed peers. Separate IAM role and Kubernetes ServiceAccount per service. |
| Data residency | Each service owns its own database; cross-service data is accessible only via the owning service's API — never via direct DB access from another service. |
| Compliance fit | SOC 2 ✓ — distributed tracing provides full request lineage. GDPR: right-to-erasure requires coordinated deletion across every service that stores user data; design a user.deletion_requested event and implement a subscriber in each service before going live with personal data. |
6. Cost ballpark¶
Indicative monthly USD cost. Significantly higher than a modular monolith for the same functional scope due to platform overhead.
| Scale | Number of services | Monthly cost | Cost drivers |
|---|---|---|---|
| Small | 5 - 10 | $600 - $3,000 | K8s cluster, per-service Postgres, API gateway, Datadog |
| Medium | 10 - 30 | $3,000 - $15,000 | Larger K8s fleet, Datadog ($1,000–3,000/month), Kafka, Istio, dedicated platform SRE time |
| Large | 30+ | $15,000 - $60,000+ | Full platform engineering team, multi-region K8s, security scanning tooling, self-service developer portal |
7. LLM-assisted development fit¶
| Aspect | Rating | Notes |
|---|---|---|
| Individual service CRUD boilerplate (Go, NestJS) | ★★★★★ | Excellent — per-service code is self-contained and well-represented in training data. |
gRPC .proto definition and stub generation |
★★★★ | Good; verify field numbering, backwards-compatibility rules, and API versioning strategy. |
| Kubernetes manifests and Helm charts | ★★★★ | Good; always review RBAC permissions and resource limits — generated limits are often wrong for production. |
| Distributed saga and compensating transaction design | ★★ | Understands the concept; the correctness of compensating transactions requires careful human design and explicit testing. |
| Architecture decisions | ★ | Don't outsource — particularly the "should we use microservices at all?" question. Use ADRs. |
Recommended workflow: Extract one service at a time from a working modular monolith using the Strangler Fig pattern. Never start with microservices. Validate that the extracted service can operate, deploy, and recover independently before extracting the next one.
8. Reference implementations¶
- Public reference: GoogleCloudPlatform/microservices-demo — Online Boutique: 11-service e-commerce application in Go, Python, and Node; shows gRPC internal communication, Kubernetes deployment, and distributed tracing (200 OK ✓)
- Public reference: dotnet/eShop — Microsoft's reference microservices architecture in .NET; covers event-driven patterns, API gateway, and service-to-service communication (200 OK ✓)
- Public reference: open-telemetry/opentelemetry-collector — the OTel Collector used to ship traces and metrics from every service to any backend (200 OK ✓)
- Internal case study: Add your anonymised internal example here
9. Related decisions (ADRs)¶
10. Known risks & gotchas¶
- "Distributed monolith" anti-pattern — services are technically separate deployables but share a database and make synchronous call chains 5 services deep. You get all the operational cost with none of the team autonomy benefits. Mitigation: enforce "database per service" as a hard rule; use asynchronous events for non-critical coupling; keep synchronous call depth to ≤ 2.
- GDPR right-to-erasure is harder than it looks — user data is scattered across 15 services; a deletion request requires coordinating 15 independent delete operations. Mitigation: design a
user.deletion_requestedevent on day one; every service implements a deletion subscriber; test the full flow annually. - Observability is non-negotiable — a 500 error in the gateway originates in the 4th service in a call chain; without distributed tracing you cannot debug it. Mitigation: OpenTelemetry instrumentation is a launch prerequisite, not a nice-to-have; never deploy a new service to production without verifying traces propagate correctly.
- Service boundary drawn too fine — a "nano-service" does one thing and requires 5 synchronous calls to serve a single user request; latency adds up and the call graph becomes unmaintainable. Mitigation: a service should be ownable by a single team end-to-end; if it is too small for one team, it is too small to exist as a service.
- Kubernetes operational overhead is substantial — pods get evicted, nodes run out of memory, ingress controllers need upgrades, certificates expire. Mitigation: use a managed K8s offering (EKS, GKE); invest in a dedicated platform SRE before reaching 10 services; if the team does not have this capacity, stay on ECS or a managed PaaS.