Observability & Alerts
Page Outline
Observability & Alerts
DIA implements comprehensive observability to enable monitoring, debugging, and performance optimization. The observability stack includes distributed tracing, metrics, and audit logging.
Distributed Tracing (OpenTelemetry)
DIA emits OpenTelemetry spans for all major operations:
- HTTP Requests: Each API endpoint creates a span with request/response details
- Background Jobs: Simulation and analysis jobs create spans for async operations
- External Calls: Spans for GA validation, database queries, Kafka operations
- Cross-Service Tracing: Spans are correlated across services using trace IDs
Example span structure:
dia.simulate (root span)
├── ga.validate_decision
├── db.create_decision
├── kafka.produce
└── background.simulation_processing
├── simulator.select_engine
├── simulator.run_scenario
└── db.update_simulation
Prometheus Metrics
DIA exposes the following Prometheus metrics for monitoring:
dia_simulations_total{status, scenario_type}: Counter of simulation requests by status and typedia_simulation_latency_seconds{scenario_type, quantile}: Histogram of simulation processing timesdia_analysis_requests_total{status}: Counter of analysis requestsdia_analysis_latency_seconds{quantile}: Histogram of analysis processing timesdia_kafka_messages_consumed_total{topic}: Counter of Kafka messages consumeddia_kafka_consumer_lag{topic, partition}: Gauge of consumer lagdia_db_query_duration_seconds{operation}: Histogram of database query durationsdia_active_simulations: Gauge of currently running simulationsdia_governance_validations_total{result}: Counter of GA validation requests
Audit Logging (S3)
All decision reasoning and raw logs are pushed to S3 for compliance and replay:
- Decision Logs: Complete reasoning chains, inputs, and outputs
- Policy Validation Logs: Full policy evaluation details
- Error Logs: Detailed error information for debugging
- Access Logs: All API access with user context
Logs are structured as JSON and organized by date/tenant for efficient querying:
s3://audit-logs/dia/
├── 2025/11/04/
│ ├── tenant-t101/
│ │ ├── decisions-20251104-0001.json
│ │ ├── simulations-20251104-0001.json
│ │ └── validations-20251104-0001.json
Alerting Rules
Prometheus alerting rules are configured for:
- High simulation failure rate (>5% failures in 5 minutes)
- Slow simulation processing (p95 latency >30 seconds)
- Kafka consumer lag (lag >1000 messages)
- Database connection pool exhaustion
- Governance validation failures
Related Documentation
- Implementation Overview - Back to implementation index
- Scaling & Reliability Patterns - Performance optimization
- Security Considerations - Audit logging security