Skip to main content

Observability & Alerts

Page Outline

Observability & Alerts

DIA implements comprehensive observability to enable monitoring, debugging, and performance optimization. The observability stack includes distributed tracing, metrics, and audit logging.

Distributed Tracing (OpenTelemetry)

DIA emits OpenTelemetry spans for all major operations:

  • HTTP Requests: Each API endpoint creates a span with request/response details
  • Background Jobs: Simulation and analysis jobs create spans for async operations
  • External Calls: Spans for GA validation, database queries, Kafka operations
  • Cross-Service Tracing: Spans are correlated across services using trace IDs

Example span structure:

dia.simulate (root span)
├── ga.validate_decision
├── db.create_decision
├── kafka.produce
└── background.simulation_processing
├── simulator.select_engine
├── simulator.run_scenario
└── db.update_simulation

Prometheus Metrics

DIA exposes the following Prometheus metrics for monitoring:

  • dia_simulations_total{status, scenario_type}: Counter of simulation requests by status and type
  • dia_simulation_latency_seconds{scenario_type, quantile}: Histogram of simulation processing times
  • dia_analysis_requests_total{status}: Counter of analysis requests
  • dia_analysis_latency_seconds{quantile}: Histogram of analysis processing times
  • dia_kafka_messages_consumed_total{topic}: Counter of Kafka messages consumed
  • dia_kafka_consumer_lag{topic, partition}: Gauge of consumer lag
  • dia_db_query_duration_seconds{operation}: Histogram of database query durations
  • dia_active_simulations: Gauge of currently running simulations
  • dia_governance_validations_total{result}: Counter of GA validation requests

Audit Logging (S3)

All decision reasoning and raw logs are pushed to S3 for compliance and replay:

  • Decision Logs: Complete reasoning chains, inputs, and outputs
  • Policy Validation Logs: Full policy evaluation details
  • Error Logs: Detailed error information for debugging
  • Access Logs: All API access with user context

Logs are structured as JSON and organized by date/tenant for efficient querying:

s3://audit-logs/dia/
├── 2025/11/04/
│ ├── tenant-t101/
│ │ ├── decisions-20251104-0001.json
│ │ ├── simulations-20251104-0001.json
│ │ └── validations-20251104-0001.json

Alerting Rules

Prometheus alerting rules are configured for:

  • High simulation failure rate (>5% failures in 5 minutes)
  • Slow simulation processing (p95 latency >30 seconds)
  • Kafka consumer lag (lag >1000 messages)
  • Database connection pool exhaustion
  • Governance validation failures