Scaling & Reliability Patterns
Page Outline
Scaling & Reliability Patterns
DIA is designed for horizontal scaling and high availability using several proven patterns.
Kafka-Based Scaling
-
Partitioning Strategy: Each Kafka topic is partitioned to enable parallel processing:
decision.simulation.request: Partitioned bytenant_idfor tenant isolationtrain.results: Partitioned bymodel_idfor model-specific processingmodel.metrics: Partitioned bytenant_idfor tenant-scoped metrics
-
Consumer Groups: Multiple consumer instances in the same group process different partitions:
dia-simulator-group: 3 instances process 3 partitions in paralleldia-group: 2 instances handle general events- Enables linear scaling: add more consumers to increase throughput
Worker Pool Pattern
Heavy simulations are processed off the FastAPI request thread:
- Async Workers: Custom async worker pool processes simulation jobs
- Celery Alternative: Can use Celery for more complex task scheduling
- Resource Isolation: Workers run in separate processes/containers
- Priority Queue: High-priority simulations can be processed first
Caching Strategy (Redis)
Redis is used for multiple caching layers:
- Overview Cache:
/overviewendpoint results cached for 30 seconds - Analysis Cache: Similar analysis queries return cached results
- Policy Cache: GA policy validations cached to reduce API calls
- Model Metadata Cache: Frequently accessed model information
Cache invalidation:
- Time-based TTL for most caches
- Event-based invalidation on policy changes
- Manual invalidation via admin API
Idempotency
All simulation requests support idempotency keys:
- Client provides
idempotency_keyin request - Server checks if key exists before processing
- Returns existing result if key found
- Prevents duplicate processing on retries
Circuit Breaker Pattern
External service calls use circuit breakers:
- GA validation calls: Opens after 5 consecutive failures
- Database connections: Fails fast when pool exhausted
- Kafka producers: Buffers messages when broker unavailable
Retry Strategy
- Exponential Backoff: Retries with increasing delays (1s, 2s, 4s, 8s)
- Max Retries: 3 attempts for transient failures
- Dead Letter Queue: Failed messages after max retries sent to DLQ
- Idempotent Operations: Retries are safe due to idempotency keys
Related Documentation
- Implementation Overview - Back to implementation index
- Kafka Topics - Event-driven architecture
- Observability & Alerts - Performance monitoring