Skip to main content

Scaling & Reliability Patterns

Page Outline

Scaling & Reliability Patterns

DIA is designed for horizontal scaling and high availability using several proven patterns.

Kafka-Based Scaling

  • Partitioning Strategy: Each Kafka topic is partitioned to enable parallel processing:

    • decision.simulation.request: Partitioned by tenant_id for tenant isolation
    • train.results: Partitioned by model_id for model-specific processing
    • model.metrics: Partitioned by tenant_id for tenant-scoped metrics
  • Consumer Groups: Multiple consumer instances in the same group process different partitions:

    • dia-simulator-group: 3 instances process 3 partitions in parallel
    • dia-group: 2 instances handle general events
    • Enables linear scaling: add more consumers to increase throughput

Worker Pool Pattern

Heavy simulations are processed off the FastAPI request thread:

  • Async Workers: Custom async worker pool processes simulation jobs
  • Celery Alternative: Can use Celery for more complex task scheduling
  • Resource Isolation: Workers run in separate processes/containers
  • Priority Queue: High-priority simulations can be processed first

Caching Strategy (Redis)

Redis is used for multiple caching layers:

  • Overview Cache: /overview endpoint results cached for 30 seconds
  • Analysis Cache: Similar analysis queries return cached results
  • Policy Cache: GA policy validations cached to reduce API calls
  • Model Metadata Cache: Frequently accessed model information

Cache invalidation:

  • Time-based TTL for most caches
  • Event-based invalidation on policy changes
  • Manual invalidation via admin API

Idempotency

All simulation requests support idempotency keys:

  • Client provides idempotency_key in request
  • Server checks if key exists before processing
  • Returns existing result if key found
  • Prevents duplicate processing on retries

Circuit Breaker Pattern

External service calls use circuit breakers:

  • GA validation calls: Opens after 5 consecutive failures
  • Database connections: Fails fast when pool exhausted
  • Kafka producers: Buffers messages when broker unavailable

Retry Strategy

  • Exponential Backoff: Retries with increasing delays (1s, 2s, 4s, 8s)
  • Max Retries: 3 attempts for transient failures
  • Dead Letter Queue: Failed messages after max retries sent to DLQ
  • Idempotent Operations: Retries are safe due to idempotency keys