Skip to content

Pattern: Microservices

Quick facts

  • Category: Backend & Distributed Systems
  • Maturity: Trial
  • Typical team size: 8+ engineers across multiple teams
  • Typical timeline to MVP: 12-20 weeks for the first set of services
  • Last reviewed: 2026-05-03 by Architecture Team

1. Context

Use this pattern when:

  • The organisation has 8+ independent product teams who genuinely need to deploy to production without coordinating with each other
  • Domain boundaries are well-understood — ideally because you have already run a modular monolith for a year and the natural service cuts have revealed themselves
  • Specific services have measured, proven scaling requirements that differ materially from the rest of the system
  • Different parts of the system need different release cadences, technology choices, or compliance boundaries

Do NOT use this pattern when:

  • Starting a new system — begin with a Modular Monolith and extract services only when a genuine operational need is proven (see ADR-0008)
  • The team is fewer than 8 engineers — the operational overhead of a microservices platform (K8s, service mesh, distributed tracing, multiple CI pipelines) will consume the majority of the team's capacity
  • Domain boundaries are still unclear — premature decomposition means rewriting services and all their consumers when you discover the wrong boundary

2. Problem it solves

At large engineering scale, a single deployable unit becomes a bottleneck: the payments team cannot deploy without coordinating with the billing team; the notifications service must scale up with the order service even though it handles 10× less traffic; a single slow test suite blocks all releases. Microservices give each team a deployable unit they own end-to-end, letting them release, scale, and evolve independently — at the cost of distributed systems complexity that requires mature platform engineering to manage.

3. Solution overview

System context (C4 Level 1)

flowchart LR
    WebClient((Web Client)) --> GW[API Gateway\nKong]
    MobileClient((Mobile Client)) --> GW
    Partners((Partners)) --> GW
    GW --> OrderSvc[Order Service]
    GW --> UserSvc[User Service]
    GW --> BillingSvc[Billing Service]
    OrderSvc -->|event| Kafka[Kafka\nEvent Bus]
    Kafka --> BillingSvc
    Kafka --> NotifSvc[Notification Service]

Container view (C4 Level 2)

flowchart TB
    subgraph Edge
        Kong[Kong API Gateway\nauth, rate limiting, routing]
    end
    subgraph Services
        OrderSvc[Order Service\nGo — owns orders DB]
        UserSvc[User Service\nGo — owns users DB]
        BillingSvc[Billing Service\nGo — owns billing DB]
        NotifSvc[Notification Service\nNode.js]
    end
    subgraph Databases
        OrderDB[(Orders DB\nPostgres)]
        UserDB[(Users DB\nPostgres)]
        BillingDB[(Billing DB\nPostgres)]
    end
    subgraph Messaging
        Kafka[Apache Kafka\nasync events]
    end
    subgraph Platform
        K8s[Kubernetes\ncontainer orchestration]
        OTel[OpenTelemetry\n→ Datadog]
        Istio[Istio service mesh\nwhen > 20 services]
    end

    Kong --> OrderSvc
    Kong --> UserSvc
    Kong --> BillingSvc
    OrderSvc --> OrderDB
    UserSvc --> UserDB
    BillingSvc --> BillingDB
    OrderSvc -->|order.created| Kafka
    Kafka --> BillingSvc
    Kafka --> NotifSvc
    K8s -.-> OrderSvc
    K8s -.-> UserSvc
    K8s -.-> BillingSvc
    OTel -.-> OrderSvc
    OTel -.-> BillingSvc

4. Technology stack

Layer Primary choice Alternatives Notes
Language Go Java (Spring Boot), Node.js (NestJS), .NET Go for new services: small binaries, low memory, excellent concurrency, fast cold starts; Java if the team has deep Spring expertise
Sync communication gRPC (internal) + REST (external) REST only, GraphQL Federation gRPC for internal service-to-service calls (typed, efficient); REST + OpenAPI for external-facing endpoints
Async communication Apache Kafka AWS SQS/SNS, NATS Kafka for durable event streaming and replay; SQS for simpler task queues with no replay requirement
API gateway Kong AWS API Gateway, nginx Kong for cloud-agnostic; AWS API Gateway for Lambda-heavy AWS-native deployments
Container orchestration Kubernetes (EKS / GKE) AWS ECS, Nomad K8s is the standard; ECS for teams wanting managed control plane without K8s operational depth
Service mesh None (< 20 services); Istio (> 20 services) Linkerd Avoid the complexity until the service count justifies it; Linkerd is simpler than Istio for mTLS + observability
Observability OpenTelemetry → Datadog Grafana + Tempo + Prometheus OpenTelemetry as the vendor-neutral instrumentation standard; every service must emit traces from day one
Container registry AWS ECR GHCR, Docker Hub ECR integrates with EKS IAM; Docker Hub rate-limits unauthenticated pulls

5. Non-functional characteristics

Concern Profile
Scalability Each service scales independently based on its own metrics. Design: measure before scaling; never pre-emptively over-provision based on assumptions about which service is the bottleneck.
Availability target 99.9% per service; 99.5% for the system as a whole (failure modes multiply across services). Implement circuit breakers at service boundaries; a downstream outage must not cascade into an upstream outage.
Latency target p95 < 50 ms for internal gRPC calls; p95 < 300 ms for external API responses including gateway overhead. Set explicit timeout budgets on every outbound call; never rely on a downstream service's default timeout.
Security posture mTLS between all services (enforced by Istio when deployed). Zero-trust network policy — services communicate only with explicitly allowed peers. Separate IAM role and Kubernetes ServiceAccount per service.
Data residency Each service owns its own database; cross-service data is accessible only via the owning service's API — never via direct DB access from another service.
Compliance fit SOC 2 ✓ — distributed tracing provides full request lineage. GDPR: right-to-erasure requires coordinated deletion across every service that stores user data; design a user.deletion_requested event and implement a subscriber in each service before going live with personal data.

6. Cost ballpark

Indicative monthly USD cost. Significantly higher than a modular monolith for the same functional scope due to platform overhead.

Scale Number of services Monthly cost Cost drivers
Small 5 - 10 $600 - $3,000 K8s cluster, per-service Postgres, API gateway, Datadog
Medium 10 - 30 $3,000 - $15,000 Larger K8s fleet, Datadog ($1,000–3,000/month), Kafka, Istio, dedicated platform SRE time
Large 30+ $15,000 - $60,000+ Full platform engineering team, multi-region K8s, security scanning tooling, self-service developer portal

7. LLM-assisted development fit

Aspect Rating Notes
Individual service CRUD boilerplate (Go, NestJS) ★★★★★ Excellent — per-service code is self-contained and well-represented in training data.
gRPC .proto definition and stub generation ★★★★ Good; verify field numbering, backwards-compatibility rules, and API versioning strategy.
Kubernetes manifests and Helm charts ★★★★ Good; always review RBAC permissions and resource limits — generated limits are often wrong for production.
Distributed saga and compensating transaction design ★★ Understands the concept; the correctness of compensating transactions requires careful human design and explicit testing.
Architecture decisions Don't outsource — particularly the "should we use microservices at all?" question. Use ADRs.

Recommended workflow: Extract one service at a time from a working modular monolith using the Strangler Fig pattern. Never start with microservices. Validate that the extracted service can operate, deploy, and recover independently before extracting the next one.

8. Reference implementations

  • Public reference: GoogleCloudPlatform/microservices-demo — Online Boutique: 11-service e-commerce application in Go, Python, and Node; shows gRPC internal communication, Kubernetes deployment, and distributed tracing (200 OK ✓)
  • Public reference: dotnet/eShop — Microsoft's reference microservices architecture in .NET; covers event-driven patterns, API gateway, and service-to-service communication (200 OK ✓)
  • Public reference: open-telemetry/opentelemetry-collector — the OTel Collector used to ship traces and metrics from every service to any backend (200 OK ✓)
  • Internal case study: Add your anonymised internal example here

10. Known risks & gotchas

  • "Distributed monolith" anti-pattern — services are technically separate deployables but share a database and make synchronous call chains 5 services deep. You get all the operational cost with none of the team autonomy benefits. Mitigation: enforce "database per service" as a hard rule; use asynchronous events for non-critical coupling; keep synchronous call depth to ≤ 2.
  • GDPR right-to-erasure is harder than it looks — user data is scattered across 15 services; a deletion request requires coordinating 15 independent delete operations. Mitigation: design a user.deletion_requested event on day one; every service implements a deletion subscriber; test the full flow annually.
  • Observability is non-negotiable — a 500 error in the gateway originates in the 4th service in a call chain; without distributed tracing you cannot debug it. Mitigation: OpenTelemetry instrumentation is a launch prerequisite, not a nice-to-have; never deploy a new service to production without verifying traces propagate correctly.
  • Service boundary drawn too fine — a "nano-service" does one thing and requires 5 synchronous calls to serve a single user request; latency adds up and the call graph becomes unmaintainable. Mitigation: a service should be ownable by a single team end-to-end; if it is too small for one team, it is too small to exist as a service.
  • Kubernetes operational overhead is substantial — pods get evicted, nodes run out of memory, ingress controllers need upgrades, certificates expire. Mitigation: use a managed K8s offering (EKS, GKE); invest in a dedicated platform SRE before reaching 10 services; if the team does not have this capacity, stay on ECS or a managed PaaS.