Core Services Specification
River Agents are implemented across 9 backend services. Services 1, 2, 3, 5, 7, 8, and 9 run as internal modules within Backend :8005. Service 4 (Reasoning) is a separate stateless process at river-agent :8007. Service 6 (Tool Invocation) describes the tool dispatch path that routes through TLO Gateway :8001. This document specifies the API contracts, internal components, owned tables, and inter-service dependencies for each.
Quick Navigation
- Service Dependency Map
- Service 1: Agent Management
- Service 2: Trigger Ingestion
- Service 3: Agent Execution Runner
- Service 4: Reasoning Service
- Service 5: Data Access and Schema
- Service 6: Tool and Workflow Invocation
- Service 7: Governance and Approval
- Service 8: Monitoring and Telemetry
- Service 9: Execution Logging and Audit
Service Dependency Map

Ownership boundary: Backend :8005 owns the river_agents PostgreSQL schema and is the only process that writes to it. river-agent :8007 is stateless -- it receives all necessary state in the request body and has no direct database connection.
Service 1: Agent Management
Owns all persistent agent state. Every other service reads agent configuration from records this service writes.
Owns: river_agents.agents, river_agents.agent_versions, river_agents.agent_version_tools, river_agents.agent_version_data_sources, river_agents.agent_templates
Internal Components
| Component | Responsibility |
|---|---|
AgentService | CRUD business logic for the agents table. Enforces org/workspace scoping on all reads and writes. |
LifecycleStateMachine | The single code path for all agents.status writes. Validates pre-conditions before any transition. No other component may update agents.status directly. |
AgentValidator | Runs the validation job triggered by POST /api/v1/agents/{id}/validate. Tests data source connectivity, checks that all bound tools exist in the registry, resolves governance policy references, and validates trigger config syntax. |
VersionManager | Creates an immutable agent_versions snapshot on every post-deployment configuration change. Updates agents.current_version_id on deploy and rollback. Enforces the partial unique index that allows only one active version per agent. |
TemplateService | Manages agent_templates records and the conversion of a template into a pre-filled agent configuration for the creation wizard. |
API Endpoints
| Method | Endpoint | Permission | Description |
|---|---|---|---|
| GET | /api/v1/agents | agent:read | List agents in workspace with search, status filter, and pagination |
| POST | /api/v1/agents | agent:create | Create agent in draft status |
| GET | /api/v1/agents/{id} | agent:read | Full agent detail including current version, health status, and recent run summary |
| PUT | /api/v1/agents/{id} | agent:update | Update configuration -- creates a new version if agent is deployed or active |
| DELETE | /api/v1/agents/{id} | agent:delete | Soft delete -- transitions to archived |
| POST | /api/v1/agents/{id}/validate | agent:update | Run async validation checks; returns validation result with field-level errors |
| POST | /api/v1/agents/{id}/deploy | agent:deploy | Deploy validated agent; registers triggers and transitions to active |
| POST | /api/v1/agents/{id}/pause | agent:deploy | Suspend all triggers; transition to paused |
| POST | /api/v1/agents/{id}/resume | agent:deploy | Re-activate triggers; transition back to active |
| POST | /api/v1/agents/{id}/archive | agent:delete | Retire agent; preserve for audit; block all future runs |
| GET | /api/v1/agents/{id}/versions | agent:read | List all version records for an agent |
| POST | /api/v1/agents/{id}/versions/{vid}/rollback | agent:deploy | Update current_version_id to a previous version; emits audit event |
| GET | /api/v1/agent-templates | agent:read | List available agent templates |
| GET | /api/v1/agent-templates/{id} | agent:read | Full template detail including configuration defaults |
| POST | /api/v1/agents/from-template/{template_id} | agent:create | Create pre-filled agent from template |
| POST | /api/v1/agents/generate | agent:create | AI-powered agent generation from natural language via river-agent :8007 |
Implementation Notes
Validation is asynchronous. The POST /validate endpoint queues a background job and returns a 202. The frontend polls the agent detail endpoint until validation_status is passed or failed. Validation can take up to 30 seconds for agents with multiple data sources.
PUT on a deployed agent does not modify the active version. It creates a new version with status = 'draft'. The agent continues running on the existing active version until the operator explicitly deploys the new one. This prevents configuration changes from affecting in-flight runs.
Rollback does not create a new version. It updates current_version_id and emits an agent_version_rolled_back audit event. The version history is preserved intact.
Service 2: Trigger Ingestion
Normalizes all trigger types into a uniform ExecutionRequest and routes to the Agent Execution Runner. This service has no knowledge of what happens during a run -- its only job is to ingest, validate, and enqueue.
Owns: river_agents.agent_triggers, river_agents.trigger_rejection_log
Internal Components
| Component | Handles | Ingestion Mechanism |
|---|---|---|
ManualTriggerHandler | manual | HTTP POST from frontend or external API call |
ScheduledTriggerHandler | scheduled | Internal scheduler polls agent_triggers for due expressions every 10 seconds; handles timezone normalization |
EventTriggerHandler | event | Kafka consumer group river-agents-events; evaluates event_type and optional payload_conditions filter per agent |
APITriggerHandler | api | Authenticated endpoint; validates API key, enforces per-agent rate limits, optionally validates payload against trigger_config.payload_schema |
ThresholdTriggerHandler | threshold | Subscribes to metric stream from Service 8; evaluates threshold_expression; enforces cooldown_seconds to prevent re-fire storms |
WorkflowTriggerHandler | workflow | Listens for Temporal workflow signals matching trigger_config.workflow_id and signal_name |
TriggerDispatcher | All | Normalizes all handler outputs into ExecutionRequest; validates agent is active; enforces concurrency policy; enqueues to Execution Queue |
API Endpoints
| Method | Endpoint | Permission | Description |
|---|---|---|---|
| POST | /api/v1/agents/{id}/run | agent:execute | Manual trigger -- immediately enqueues execution |
| POST | /api/v1/agents/{id}/triggers | agent:update | Register a new trigger configuration for an agent |
| GET | /api/v1/agents/{id}/triggers | agent:read | List all trigger configurations for an agent |
| PUT | /api/v1/agents/{id}/triggers/{tid} | agent:update | Update trigger configuration |
| DELETE | /api/v1/agents/{id}/triggers/{tid} | agent:update | Remove trigger |
| POST | /api/v1/agent-webhooks/{agent_id} | API key auth | Inbound webhook endpoint for event-based triggers |
| POST | /api/v1/agent-api/{agent_id}/execute | API key auth | Authenticated execution endpoint for external system triggers |
ExecutionRequest Schema
Every trigger normalizes to this structure before the Dispatcher enqueues it.
| Field | Type | Description |
|---|---|---|
request_id | UUID | Idempotency key; duplicate request_id values are rejected by the Runner |
agent_id | UUID | Target agent |
agent_version_id | integer | Pinned at ingestion time from agents.current_version_id; immutable for the run |
trigger_type | enum | manual, scheduled, event, api, threshold, workflow |
trigger_source | string | Audit label (e.g., "cron:daily-8am", "webhook:zendesk-ticket") |
trigger_payload | JSONB | Contextual data from the trigger source; passed to river-agent as part of the context bundle |
requested_by | UUID | User ID for manual/API triggers; service identifier for automated triggers |
requested_at | timestamptz | Ingestion timestamp; used for SLA and latency tracking |
Implementation Notes
Concurrency is enforced at dispatch time. If agent_versions.runtime_config.allow_concurrent_runs is false and a run is already in running status, the TriggerDispatcher applies the on_concurrent_trigger policy: queue holds the request until the current run finishes, drop discards it with a rejection log entry, and replace cancels the running execution and starts fresh.
Event triggers use payload condition matching. The EventTriggerHandler evaluates trigger_config.payload_conditions as a JSON path expression against the incoming webhook payload. Agents with overlapping event type subscriptions are each evaluated independently.
Service 3: Agent Execution Runner
The central coordinator for every agent run. Bridges the normalized trigger from Service 2 to the reasoning engine in Service 4, manages approval gate hibernation, enforces execution limits, and finalizes the run record.
Owns: river_agents.agent_executions, river_agents.agent_memory
Internal Components
| Component | Responsibility |
|---|---|
AgentRunWorkflow | The root Temporal workflow. Owns the full execution lifecycle from running to a terminal state (completed, failed, cancelled, budget_exhausted). All execution state lives in this workflow and the linked DB records. |
ContextBuilder | Assembles the AgentContext bundle before the first reasoning call and after each approval gate resume. Loads agent version config, long_term_context, trigger payload, data source schema metadata, and the active tool registry. |
StateSerializer | On gate: writes agent_executions.serialized_state (JSONB containing conversation history, pending tool call, observation buffer, memory snapshot) and transitions execution to paused. On resume: deserializes and reconstructs AgentContext from the stored snapshot. |
RetryManager | Applies Temporal retry policy per activity: 3 retries with exponential backoff for transient failures. Marks tool calls as permanently failed after retry exhaustion. |
TimeoutManager | Enforces per-tool timeout (default 30 seconds) and per-run turn limit (default 15 turns). A run that exhausts turns without reaching finalization terminates with budget_exhausted status. |
API Endpoints
| Method | Endpoint | Permission | Description |
|---|---|---|---|
| GET | /api/v1/agent-runs | agent:read | List runs with filters: agent, status, trigger type, date range, pagination |
| GET | /api/v1/agent-runs/{id} | agent:read | Full run detail: status, timing, token usage, final output |
| POST | /api/v1/agent-runs/{id}/stop | agent:execute | Emergency cancel a running execution; sends cancellation signal to Temporal workflow |
| POST | /api/v1/agent-runs/{id}/retry | agent:execute | Re-enqueue a failed run using the same trigger payload and version |
| GET | /api/v1/agent-runs/{id}/logs | agent:read | Paginated turn-level log entries for a run |
| GET | /api/v1/agent-runs/{id}/stream | agent:read | WebSocket upgrade -- real-time run progress events |
Implementation Notes
The Temporal workflow ID is derived from execution_id. The pattern is agent-run-{execution_id}. This means the workflow can be looked up by execution ID directly from Temporal's API without a separate index table. It also means the execution record must be created before the Temporal workflow is started -- the agent_executions.id is the source of truth.
Approval signals go directly to Temporal, not through the REST layer. When an operator calls PATCH /api/v1/approvals/{id}, Service 7 updates the approval_requests record and sends a signal via the Temporal client SDK: temporal.signal_workflow(workflow_id="agent-run-{execution_id}", signal="approval_resolved", payload={...}). The REST endpoint does not block waiting for the workflow to resume.
Memory write-back is non-blocking. After finalization, the ContextBuilder writes the memory_updates delta from the reasoning result back to river_agents.agent_memory. This write happens asynchronously after the run record is marked completed -- a write failure does not change the run's terminal status.
Service 4: Reasoning Service
The LLM reasoning engine. This service is implemented as a separate FastAPI process at river-agent :8007. It is entirely stateless -- all state arrives in the request body and all outputs are returned in the response. It is called once per reasoning turn by Service 3.
Owns: No database tables. Stateless.
Internal Components
| Component | Responsibility |
|---|---|
AgentLoop | Entry point. Receives the AgentContext bundle, runs one Reason -> Act -> Observe iteration, and returns a structured ReasoningResult containing the next tool call or a finalization signal. |
SystemPromptBuilder | Dynamically composes the system prompt per turn by assembling sections from the AgentContext. The prompt is rebuilt on every turn to reflect the current observation state. |
RiverCore | Multi-provider LLM router. Classifies the turn complexity, selects the appropriate model tier and provider, executes the inference call, and handles failover to the next provider in the chain on error. |
ToolCallParser | Parses the raw LLM output into a structured ToolCall object (tool name + validated arguments). Rejects malformed tool selections before returning to Service 3. |
System Prompt Composition
The SystemPromptBuilder assembles the system prompt from these sections in order:
| Section | Content | Source |
|---|---|---|
| Role Definition | "You are {agent.name}, a {agent.business_function} agent..." | agent_versions.name, business_function |
| Goal and Instructions | The agent's natural language instruction set | agent_versions.instruction_set |
| Available Tools | JSON schema definitions for all tools in the active tool registry | Tool registry filtered by agent_versions.selected_tools |
| Data Context | Schema metadata for all connected data sources | Assembled by ContextBuilder from Service 5 |
| Governance Constraints | Action level and applicable policy constraints in natural language | agent_versions.action_level, resolved policies |
| Long-Term Memory | Structured summary of learnings from past runs | agent_memory.context_snapshot |
| Trigger Context | The trigger type and payload for this run | ExecutionRequest.trigger_payload |
| Conversation History | All prior turns in this execution (reasoning + observations) | Accumulated in AgentContext across turns |
Implementation Notes
Service 3 calls Service 4 once per turn, not once per run. The request body contains the full AgentContext at the current turn's state. This is intentional -- it keeps river-agent stateless and allows Service 3 (via Temporal) to be the durable state holder. The tradeoff is larger request payloads on longer runs.
river-agent does not know its action level. The tool registry it receives in the context bundle is already filtered by Service 3 based on action level. If a tool requires approval, that check happens in Service 7 after river-agent returns its tool selection. From river-agent's perspective, it selects from the tools it was given -- governance enforcement is downstream.
Finalization is signaled by a special tool call. When river-agent determines the goal is reached, it returns a ToolCall with tool_name = "finalize" and a structured final_output in the arguments. Service 3 detects this and begins the run finalization sequence without invoking Service 6.
Service 5: Data Access and Schema
Provides the reasoning engine with schema-aware, ACL-governed access to all connected data sources. All reads pass through TLO Gateway -- this service never connects directly to external data sources.
Owns: No dedicated tables. Reads from platform.data_sources (cross-schema). Schema metadata is cached in Redis with per-source TTL.
Internal Components
| Component | Responsibility |
|---|---|
SchemaDiscoveryService | Retrieves and caches table, column, and relationship metadata for connected data sources. Cache TTL is configurable per data source (default: 3600 seconds). Force-refresh is triggered by POST /data-sources/{id}/discover-schema. |
QueryExecutionEngine | Translates a validated query spec into a Data Orchestration Service request. Handles result pagination, type normalization, and error classification (user error vs. connectivity error vs. timeout). |
SemanticCatalog | Vector search over data source metadata in Qdrant. Used by the reasoning engine to resolve table/column references by semantic meaning when exact names are unknown. |
DataConnectorProxy | Routes all data source interactions through TLO Gateway to the Data Orchestration Service. Injects the X-Agent-ID and X-Execution-ID headers for audit trail correlation in the downstream service. |
API Endpoints
| Method | Endpoint | Permission | Description |
|---|---|---|---|
| GET | /api/v1/data-sources/{id}/schema | data_source:view | Retrieve cached schema metadata for a connected data source |
| POST | /api/v1/query/execute | data_source:query | Execute a query against a connected data source via Data Orchestration |
| GET | /api/v1/catalog/search | data_source:view | Semantic search over data source metadata using Qdrant vectors |
| POST | /api/v1/data-sources/{id}/test | data_source:view | Test connectivity to a data source |
Supported Data Source Types
| Category | Sources |
|---|---|
| SQL Databases | PostgreSQL, MySQL, SQL Server, Snowflake, BigQuery, Redshift |
| NoSQL Databases | MongoDB, DynamoDB, Elasticsearch |
| SaaS APIs | Salesforce, HubSpot, Zendesk, Stripe, Shopify |
| File Storage | CSV and Excel files in MinIO, Google Sheets |
| Custom APIs | Any REST API with a registered OpenAPI specification |
Implementation Notes
Schema metadata is cached, not live. The reasoning engine sees a schema snapshot, not the live database state. Stale schema causes query generation errors at execution time. The SchemaDiscoveryService detects schema-related query errors and triggers a background refresh for the affected data source.
ACL is checked per call at TLO Gateway. A data_source:view permission is required for schema discovery; data_source:query is required for query execution. These checks happen at TLO on every call. A data source permission revoked mid-run takes effect on the next tool invocation within that run.
Service 6: Tool and Workflow Invocation
Describes the path a tool call takes from the reasoning result to execution on a target service. This is not a standalone process -- it is the pattern implemented by Service 3 when dispatching an approved tool call through TLO Gateway.
Owns: No tables. Operates as a dispatch path within Service 3.
Tool Execution Steps
| Step | Action | Owner |
|---|---|---|
| 1 | river-agent selects tool and returns structured ToolCall | Service 4 |
| 2 | Service 3 validates arguments against tool's Pydantic input schema | Service 3 / Tool Registry |
| 3 | Service 3 calls Service 7 for action level and policy check | Service 7 |
| 4 | If gated: approval flow; if blocked: turn error; if allowed: proceed | Service 7 |
| 5 | Service 3 sends tool dispatch request to TLO Gateway with governance token | Service 3 -> TLO |
| 6 | TLO validates JWT and per-tool ACL permission | TLO Gateway |
| 7 | TLO routes to target service (Backend, Data Orchestration, external API) | TLO Gateway |
| 8 | Target service executes and returns result | Target service |
| 9 | Service 3 receives result; passes to ResultValidator | Service 3 |
| 10 | Validated result is formatted as an observation and injected into the next AgentContext | Service 3 |
Tool Registry
All tools available to River Agents are registered in the Tool Registry. Engineers adding new tools must define:
- A unique
tool_nameidentifier - A Pydantic input schema (used for argument validation at step 2 and for prompt injection at system prompt build time)
- A Pydantic output schema (used for result validation at step 9)
- A
write_classifiedboolean (determines whether the tool triggers action level checks in Service 7) - A required TLO ACL permission string (e.g.,
"agent:execute","data_source:query") - A target service route (Backend, Data Orchestration, or external service URL)
Custom tools registered via OpenAPI spec upload are validated against this schema at registration time. A spec that cannot be mapped to a valid tool definition is rejected.
Workflow Invocation
For operations that span multiple steps or services, Service 3 can invoke a Temporal sub-workflow rather than a single tool call. The WorkflowInvoker starts a child workflow using the workflow_id specified in agent_versions.selected_workflows and passes the reasoning engine's arguments as the workflow input. The execution observes the workflow execution ID and optionally awaits the result if the parent workflow is configured to wait.
Service 7: Governance and Approval
The enforcement boundary between what an agent reasons it should do and what it is permitted to do. Every write-capable tool call passes through this service. No tool dispatch occurs without this service's sign-off.
Owns: river_agents.approval_requests, river_agents.governance_policies, river_agents.agent_policy_bindings
Internal Components
| Component | Responsibility |
|---|---|
ActionLevelChecker | Evaluates a proposed tool call against the agent's action_level. Returns execute, stage_only, or gate as the enforcement outcome. |
PolicyEngine | Evaluates the current execution context (agent identity, workspace, tool name, tool arguments, data classification) against all bound governance policies. Returns allow, block, gate, or alert per matching policy. |
ApprovalGateService | Creates approval_requests records, assigns approvers from approval_rules, and sends the approval signal to the Temporal workflow on resolution. |
ApprovalNotifier | Dispatches approval request notifications via Novu to configured channels: Slack, email, in-app, or PagerDuty. Handles escalation if no response within approval_rules.timeout_hours. |
API Endpoints
| Method | Endpoint | Permission | Description |
|---|---|---|---|
| GET | /api/v1/approvals | agent:approve | List approval requests with status, agent, and date filters |
| GET | /api/v1/approvals/{id} | agent:approve | Full approval detail: proposed action, reasoning context, risk assessment |
| PATCH | /api/v1/approvals/{id} | agent:approve | Resolve: approve, reject, or edit_and_approve with modified arguments |
| GET | /api/v1/approvals/pending | agent:approve | Count of pending approvals (used for badge display) |
| GET | /api/v1/agents/{id}/policies | agent:read | List governance policies bound to an agent |
| POST | /api/v1/agents/{id}/policies | agent:update | Bind a governance policy to an agent |
| DELETE | /api/v1/agents/{id}/policies/{pid} | agent:update | Remove a policy binding |
Action Level Enforcement Matrix
Agent action_level | Tool write_classified | Outcome |
|---|---|---|
read_only | false (read tool) | Execute |
read_only | true (write tool) | Block -- write operations not permitted at this level |
recommend | false or true | Stage as proposal -- no execution; returned to user as recommendation |
act_with_approval | false (read tool) | Execute |
act_with_approval | true -- tool in approval_rules | Gate -- create ApprovalRequest, pause execution, notify approver |
act_with_approval | true -- tool not in approval_rules | Execute |
automated | false or true | Execute -- all configured tools run without gates |
Approval Request Lifecycle

Implementation Notes
Policy evaluation happens after action level check. A tool call blocked at the action level never reaches the PolicyEngine. The policy engine evaluates only calls that pass the action level check. This ordering means action level is always the outer constraint.
edit_and_approve substitutes arguments entirely. When an approver uses edit-and-approve, the original ToolCall arguments are discarded and the approver's revised arguments are used. The approval_requests record stores both the original and modified arguments for audit. The ResultValidator in Service 6 re-validates the modified arguments against the tool's input schema before dispatch.
Every governance decision -- including execute outcomes -- emits an audit event. Service 7 calls Service 9 on every enforcement outcome, not only on blocks and gates. This ensures the audit trail is complete for compliance purposes.
Service 8: Monitoring and Telemetry
Collects and aggregates runtime telemetry across all agent executions. Powers the system-wide monitoring dashboard, per-agent metrics, real-time execution views, and the alert engine.
Owns: river_agents.agent_metrics_hourly, river_agents.agent_alerts
Internal Components
| Component | Responsibility |
|---|---|
TelemetryEmitter | Writes structured turn-level events to the WebSocket channel during active runs. Called by Service 3 at each turn boundary, tool dispatch, and approval gate event. |
MetricAggregator | Aggregates execution outcomes into rolling windows (1h, 24h, 7d, 30d) and writes to agent_metrics_hourly. Runs as a background job after each run completion. |
HealthEvaluator | Evaluates agent_metrics_hourly on a 5-minute cycle to derive agents.health_status (healthy, degraded, critical, unknown). Degraded threshold: success rate below 80% over 24 hours. Critical threshold: below 50%. |
AlertEngine | Monitors per-agent and system-wide metrics against configured alert rules. Dispatches to Novu on threshold breach. Implements a cooldown period to prevent alert floods for sustained degradation. |
API Endpoints
| Method | Endpoint | Permission | Description |
|---|---|---|---|
| GET | /api/v1/agents/{id}/metrics | agent:read | Per-agent metrics for a configurable time window |
| GET | /api/v1/monitoring/overview | agent:monitor | System-wide: total agents, active runs, throughput, health distribution |
| GET | /api/v1/monitoring/throughput | agent:monitor | Time-series throughput data for the system chart |
| GET | /api/v1/monitoring/alerts | agent:monitor | Recent alert stream with severity and agent reference |
| GET | /api/v1/monitoring/cluster | agent:monitor | Runtime instance health table |
| WS | /ws/agent-runs/{id} | agent:read | Real-time run progress events for a specific execution |
| WS | /ws/monitoring | agent:monitor | System-wide real-time monitoring event stream |
Metric Definitions
| Metric | Aggregation | Scope | Storage |
|---|---|---|---|
| Total Runs | Count | Per-agent, System-wide | agent_metrics_hourly.run_count |
| Success Rate | completed / total as percentage | Per-agent | agent_metrics_hourly.success_rate |
| Average Latency | P50, P90, P99 in milliseconds | Per-agent, Per-tool | agent_metrics_hourly.latency_p50/p90/p99 |
| Failure Count | Count with categorized reasons | Per-agent, System-wide | agent_metrics_hourly.failure_count + failure_reasons JSONB |
| Throughput | Runs per hour | System-wide | Derived from agent_metrics_hourly at query time |
| Pending Approvals | Count of approval_requests where status = 'pending' | Per-agent, System-wide | Queried live from approval_requests |
| Token Cost | Total tokens multiplied by model unit pricing | Per-agent, Per-run | agent_executions.token_usage JSONB |
| Actions Taken | Count grouped by tool_name | Per-agent | Aggregated from agent_logs at query time |
WebSocket Event Types Emitted
| Event | Payload Fields | Emitted When |
|---|---|---|
run_started | execution_id, agent_id, started_at | Execution begins |
turn_reasoning | turn, content, model_used, tokens | Reasoning turn complete |
tool_called | turn, tool_name, inputs | Tool dispatch initiated |
tool_result | turn, tool_name, output, duration_ms | Tool result received |
approval_requested | approval_id, tool_name, pending_since | Approval gate triggered |
approval_resolved | approval_id, resolution, resolved_by | Approval gate resolved |
run_completed | execution_id, status, final_output, duration_ms | Execution finalized |
run_failed | execution_id, error_code, error_message | Execution terminated with error |
Service 9: Execution Logging and Audit
The write-once source of truth for all historical agent behavior. Provides queryable, filterable, and exportable audit trails for debugging, compliance, and analysis.
Owns: river_agents.audit_logs, river_agents.agent_logs
Internal Components
| Component | Responsibility |
|---|---|
AuditWriter | Single entry point for all writes to audit_logs. Enforces write-once semantics at the application layer -- no UPDATE or DELETE is permitted on this table through this component. |
ExecutionLogWriter | Writes turn-level records to agent_logs during active runs. Called by Service 3 after every reasoning turn, tool call, and approval gate event. |
AuditQueryService | Provides the read interface for both audit_logs and agent_logs. Handles filter compilation, pagination, and export formatting. |
RetentionManager | Applies retention policies: archives or deletes audit records older than the configured retention period per organization. Retention periods are stored in iam.organizations.audit_retention_days. |
API Endpoints
| Method | Endpoint | Permission | Description |
|---|---|---|---|
| GET | /api/v1/audit/logs | agent:audit | Search audit logs with filters: agent, user, event type, date range, outcome, pagination |
| GET | /api/v1/audit/logs/{id} | agent:audit | Full audit entry with complete payload detail |
| GET | /api/v1/audit/stats | agent:audit | Aggregate counts: total events, by type, by agent, by user |
| GET | /api/v1/audit/export | agent:audit | Export filtered audit logs as CSV or JSON for compliance reporting |
| GET | /api/v1/agent-runs/{id}/trace | agent:read | Complete turn-by-turn execution trace for a single run |
Audit Log Entry Schema
| Field | Type | Description |
|---|---|---|
log_id | UUID | Immutable identifier; set at write time and never changed |
timestamp | timestamptz | Event time; indexed for range queries |
event_type | enum | agent.created, agent.deployed, agent.archived, run.started, run.completed, run.failed, tool.executed, tool.blocked, approval.requested, approval.resolved, policy.violated, permission.denied |
agent_id | UUID | Related agent (nullable for non-agent events) |
execution_id | UUID | Related run (null for non-execution events) |
user_id | UUID | Actor who initiated the event; system service ID for automated events |
action | string | Specific action description (e.g., "deploy_agent", "execute_tool:update_crm") |
payload | JSONB | Full request context and response (scrubbed of secrets) |
outcome | enum | success, failure, blocked, pending |
ip_address | inet | Source IP from the originating request; populated from TLO Gateway propagation header |
organization_id | UUID | Tenant isolation key; all queries must filter by this |
Implementation Notes
audit_logs has no UPDATE or DELETE paths in the application code. The write-once constraint is enforced by AuditWriter -- there is no update_audit_log or delete_audit_log method. At the database layer, this is reinforced by a row-level security policy that grants INSERT only (no UPDATE, no DELETE) to the Backend service's database role.
Turn-level logs in agent_logs are separate from audit logs. agent_logs stores the detailed reasoning trace (turn content, tool arguments, observations) and is queried for debugging and run detail views. audit_logs stores governance events, lifecycle transitions, and compliance-relevant actions. The two tables serve different consumers: agent_logs is for operators debugging a run; audit_logs is for compliance and security review.
Export uses streaming for large result sets. The GET /api/v1/audit/export endpoint streams the response rather than buffering the full result set in memory. Callers should handle chunked transfer encoding. Result sets over 100,000 rows are automatically split into multiple files in a ZIP archive.