Architecture Layers
River Agents are organized into 8 functional layers, each with a distinct responsibility boundary. This document specifies the components, contracts, constraints, and interaction patterns for each layer, intended as the primary reference for engineers implementing or maintaining the system.
Quick Navigation
- Layer Reference
- Master Architecture
- Layer 1: Agent Definition and Lifecycle
- Layer 2: Trigger Ingestion and Dispatch
- Layer 3: Runtime Orchestration
- Layer 4: Reasoning and Planning
- Layer 5: Data Access and Connectors
- Layer 6: Tool and Workflow Execution
- Layer 7: Governance and Safety
- Layer 8: Monitoring and Observability
- Cross-Layer Interaction Map
Layer Reference
| Layer | Name | Primary Owner Service | Core Responsibility |
|---|---|---|---|
| 1 | Agent Definition and Lifecycle | Backend :8005 | Persistent identity, versioned configuration, lifecycle state machine |
| 2 | Trigger Ingestion and Dispatch | Backend :8005 | Normalize all trigger types into a uniform execution request |
| 3 | Runtime Orchestration | Backend :8005 + Temporal | Coordinate a single agent run from start to finalization |
| 4 | Reasoning and Planning | river-agent :8007 | Agentic loop, LLM routing, context construction, tool selection |
| 5 | Data Access and Connectors | Data Orchestration :8002 via TLO | Schema-aware, governed data retrieval from connected sources |
| 6 | Tool and Workflow Execution | TLO Gateway :8001 | ACL-validated tool dispatch and result delivery to the reasoning loop |
| 7 | Governance and Safety | Backend :8005 | Action level enforcement, approval gates, policy evaluation, audit |
| 8 | Monitoring and Observability | Backend :8005 | Real-time telemetry, metric aggregation, alerting, health tracking |
Master Architecture
The stack is ordered bottom-up: Layer 1 is the foundation (stored agent state), and Layer 8 is the operational envelope (observability). Layer 3 (Runtime Orchestration) is the central coordinator -- it drives Layer 4, writes to Layer 8, and gates through Layer 7 on every write-capable action.

Solid arrows indicate the primary execution flow. Dotted arrows from L3 indicate that Runtime Orchestration directly interfaces with Governance (for gate checks) and Monitoring (for telemetry) throughout the run, not only at the end.
Layer 1: Agent Definition and Lifecycle
Layer 1 is the persistence foundation. Every agent run traces back to a configuration version record in this layer. Nothing executes unless a valid, deployed version exists here.
Components
| Component | Responsibility |
|---|---|
| Agent Management Service | CRUD operations for agent configurations. Owns all writes to river_agents.agents and related tables. Validates field completeness before state transitions. |
| Version Control | Creates an immutable agent_versions snapshot on every post-deployment configuration change. The active version is the source of truth for all execution context. |
| Lifecycle State Machine | Enforces valid state transitions with pre-condition checks. No direct database write to agents.status is permitted from outside this component. |
Versioning Rules
Every configuration change on a deployed or active agent creates a new version record rather than updating in place. The rules are:
agents.current_version_idalways points to the version the next execution will use.agent_executions.agent_version_idis set at run creation time and never updated. A run is always auditable against the exact configuration that produced it.- When an operator edits a live agent, a new version is written with
status = 'draft'; on deploy, it becomesactiveand the previous version transitions toarchived. - Rollback is implemented by updating
current_version_idto an older version's ID. This does not create a new version record but does emit an audit event of typeagent_version_rolled_back. - Only one version per agent can hold
status = 'active'at a time. This is enforced by a partial unique index.

State Transition Pre-Conditions
| Transition | Pre-Condition Check | Failure Behavior |
|---|---|---|
| Configured -> Validated | Data source connectivity test, tool registry availability check, policy resolution, trigger config syntax validation | Transition blocked; validation errors returned as field-level detail |
| Validated -> Deployed | Temporal workflow registration succeeds, trigger listeners activated | Transition blocked; deployment error logged |
| Active -> Running | Agent in active state, concurrency check passes, rate limit not exceeded | Trigger enqueued or dropped per on_concurrent_trigger policy |
| Running -> Awaiting Approval | Action level gate triggered during execution | State serialized to DB; Temporal workflow suspended |
| Any -> Archived | No active running executions (or caller accepts forced cancellation), caller has agent:delete permission | Blocked if running executions exist without force=true |

Layer 2: Trigger Ingestion and Dispatch
Layer 2 is the entry surface for all agent executions. Its sole job is to receive heterogeneous signals, normalize them into a uniform ExecutionRequest, validate the target agent's readiness, and hand off to the runtime layer.
Components
| Component | Trigger Type | Ingestion Mechanism |
|---|---|---|
| Manual Trigger Handler | manual | Direct HTTP POST to /api/v1/agents/:id/runs |
| Scheduled Trigger Handler | scheduled | Internal cron scheduler reads trigger_config.cron_expression and timezone; polling interval is 10 seconds |
| Event Trigger Handler | event | Kafka consumer group river-agents-events; evaluates trigger_config.event_type and optional payload_conditions filter |
| API Trigger Handler | api | Authenticated POST to agent's dedicated endpoint; optional payload_schema validated on arrival |
| Threshold Trigger Handler | threshold | Metric monitor evaluates trigger_config.threshold_expression against configured metric source; cooldown_seconds prevents re-fire |
| Workflow Trigger Handler | workflow | Temporal signal listener bound to trigger_config.workflow_id and signal_name |
| Dispatcher | All | Normalizes trigger into ExecutionRequest, validates agent state and concurrency, enqueues to Execution Queue |
ExecutionRequest Schema
Every trigger, regardless of source, is normalized into this structure before the Dispatcher submits it to the runtime layer.
| Field | Type | Description |
|---|---|---|
request_id | UUID | Unique identifier for this trigger event; used for deduplication and idempotency |
agent_id | UUID | Target agent |
trigger_type | enum | manual, scheduled, event, api, threshold, workflow |
trigger_source | string | Origin label (e.g., "cron:daily-8am", "webhook:zendesk", "api:ci-pipeline") |
trigger_payload | JSONB | Optional context from the trigger source (ticket data, metric values, API body) |
requested_by | UUID | User ID for manual/API triggers; system service ID for automated triggers |
requested_at | timestamptz | When the trigger was received by the ingestion handler |
Trigger Processing Flow
Concurrency policy: if allow_concurrent_runs is false (the default) and a run is already in running status, the incoming trigger is handled per on_concurrent_trigger: queue (hold in queue until current run finishes), drop (discard with log), or replace (cancel the current run and start fresh).
Layer 3: Runtime Orchestration
Layer 3 is the conductor of every agent execution. It bridges the static configuration in Layer 1 with the dynamic reasoning in Layer 4, manages approval gate hibernation, and ensures every execution is durable, retriable, and fully traced.
Components
| Component | Responsibility |
|---|---|
| Agent Runtime Runner | The root Temporal workflow for a single agent run. Owns the full execution lifecycle from context initialization through finalization and memory write-back. |
| Context Builder | Assembles the AgentContext bundle: agent version config, long_term_context snapshot, trigger payload, connected data source metadata, and active tool registry. Passed to river-agent on every reasoning invocation. |
| State Serializer | On approval gate: serializes conversation history, pending tool call, current observations, and memory snapshot to agent_executions.serialized_state. On approval resolution: deserializes and restores the full AgentContext to resume from exactly where the run paused. |
| Retry and Timeout Manager | Enforces per-tool timeouts (default: 30 seconds), per-run turn limits (default: 15 turns), and Temporal retry policies for transient failures (3 retries, exponential backoff). |
Execution Constraints
| Constraint | Default | Configuration Field |
|---|---|---|
| Max turns per run | 15 | agent_versions.runtime_config.max_turns |
| Per-tool timeout | 30 seconds | agent_versions.runtime_config.tool_timeout_seconds |
| Approval gate timeout | 72 hours | agent_versions.approval_rules.timeout_hours |
| Max retries on transient failure | 3 | agent_versions.runtime_config.max_retries |
| Context bundle max size | 500 KB | Hard limit; enforced by Context Builder |
Runtime Orchestration Flow
Approval Gate State Serialization
When a gate is triggered, the execution must be serializable to zero active compute. The State Serializer writes the following to agent_executions.serialized_state (JSONB):

Layer 4: Reasoning and Planning
Layer 4 is the intelligence core. It receives a fully assembled AgentContext from Layer 3, runs the agentic loop, and returns a structured result. This layer is implemented by the stateless river-agent microservice at :8007.
Components
| Component | Responsibility |
|---|---|
| River-Agent Agentic Loop | The Reason -> Act -> Observe cycle. Each turn: classify the task, call the appropriate LLM via RiverCore, parse the response for a tool call or finalization signal, and feed the observation back into context for the next turn. |
| RiverCore (Multi-LLM Router) | Selects the optimal LLM provider and model for each turn based on classified task complexity. Handles provider failover transparently within a turn. |
| System Prompt Builder | Dynamically constructs the system prompt per run by composing: the agent's instruction_set, the active tool registry with Pydantic schemas, connected data source metadata, current governance constraints, and the long_term_context snapshot. |
| Long-Term Context Memory | The accumulated knowledge from past runs, injected into the system prompt. Updated by Backend after each run via the memory_updates delta returned in the reasoning result. |
Multi-LLM Routing Strategy
RiverCore routes each turn to one of four model tiers based on task complexity classification. The classification is a fast internal heuristic step run before the main reasoning call.

Reasoning Turn Structure
Every turn in the agentic loop produces a structured record written to river_agents.agent_logs. Engineers implementing or querying the log store should treat these fields as the canonical schema for turn-level data.
| Field | Type | Description |
|---|---|---|
turn_number | integer | Sequential index starting at 1 |
turn_type | enum | reasoning, action, observation, interaction |
reasoning | text | LLM chain-of-thought for this turn; stored for audit and debugging |
tool_name | string | Name of the tool selected (null for reasoning-only and finalization turns) |
tool_arguments | JSONB | Structured arguments for the tool call (null if no tool selected) |
observation | text | Result returned by the tool or injected from the approval resolution |
model_used | string | Exact model identifier (e.g., claude-sonnet-4-6, gpt-4o) |
tokens_used | integer | Total input + output token count for this turn |
latency_ms | integer | Wall-clock time for the turn including LLM call and tool dispatch |
Layer 5: Data Access and Connectors
Layer 5 is the agent's governed interface to enterprise data. All data retrieval -- schema lookup, query execution, and catalog search -- routes through this layer. No direct database connections are made from the reasoning layer.
Components
| Component | Responsibility |
|---|---|
| Schema Discovery Service | Provides the reasoning engine with accurate table, column, and relationship metadata for all connected data sources. Auto-discovers on connection; cached with configurable TTL per data source. |
| Query Execution Engine | Accepts structured SQL, NoSQL, or API query specs from the reasoning layer. Executes via Data Orchestration Service (:8002) through TLO Gateway. Returns normalized result sets. |
| Semantic Catalog (Qdrant) | Vector-based search over data source metadata. Enables the reasoning engine to resolve table and column references by semantic meaning when exact names are unavailable or ambiguous. |
| Data Connectors | Type-specific adapters for each supported source: PostgreSQL, MySQL, Snowflake, BigQuery, Redshift, MongoDB, Elasticsearch, REST APIs, Salesforce, HubSpot, Stripe, and others. Managed by Data Orchestration Service. |
Data Access Flow
ACL note: data_source:view is required for schema discovery; data_source:query is required for query execution. Both are checked per call at TLO Gateway, not cached from the start of the run. A data source ACL change takes effect immediately on the next tool invocation, even within an active run.
Layer 6: Tool and Workflow Execution
Layer 6 is the dispatch and validation surface for every non-reasoning action the agent takes. All tool calls route through TLO Gateway, which re-validates ACL permissions before forwarding to the target service.
Components
| Component | Responsibility |
|---|---|
| Tool Registry | Maintains the catalog of all available tools with their input and output Pydantic schemas. Includes platform tools and dynamically registered custom tools added via OpenAPI spec upload. |
| Tool Executor | Routes tool call requests from the reasoning layer through TLO Gateway. Parses the response, formats it as a structured observation, and returns it to Layer 3 for injection into the next reasoning turn. |
| Workflow Invoker | Triggers Temporal sub-workflows for operations that span multiple steps or services (e.g., a data pipeline run, a multi-system onboarding sequence). Returns a workflow execution ID as the observation. |
| Result Validator | Validates tool execution results against the expected output schema for that tool. Malformed results are caught here before being injected into the reasoning context; the turn is marked as a tool error. |
Tool Categories
| Category | Count | ACL Check | LLM Required | Examples |
|---|---|---|---|---|
| AI Reasoning | 6 | None -- internal processing only | Yes (per-turn routing) | classify_intent, check_governance, generate_query, search_catalog, recommend_visualization, explain_results |
| Execution | 10+ | Per-call check by TLO | No -- direct service call | execute_query, discover_schema, send_email, send_slack, create_ticket, update_crm, write_back, restart_workflow, scale_infrastructure |
| Interaction | 1 | None | No | request_approval -- sends approval request via WebSocket and signals Temporal to suspend |
| Custom | N (dynamic) | Per-call check by TLO | No | Registered from OpenAPI specs; schema validated at registration time |
Implementation note on request_approval: this tool does not return an observation to the reasoning loop. Calling it is a terminal action for the current turn -- it triggers the State Serializer in Layer 3, creates the approval record, and suspends the Temporal workflow. The reasoning loop resumes from the next turn only after the approval signal is received.
Layer 7: Governance and Safety
Layer 7 is the enforcement boundary between what an agent reasons it should do and what it is permitted to do. Every write-capable action passes through this layer before dispatch. Governance checks are not advisory -- they block, gate, or permit, with no silent pass-through.
Components
| Component | Responsibility |
|---|---|
| Action Level Checker | Evaluates each tool call against the agent's action_level. Determines whether the call proceeds, is returned as a read-only proposal, or requires an approval gate. |
| Approval Gate Service | Creates approval_requests records, assigns approvers from approval_rules, dispatches notifications via Novu, and processes approval signals back to the Temporal workflow. |
| Policy Engine | Evaluates the current execution context against organization-level and workspace-level governance policies (e.g., row count limits on data exports, time-of-day restrictions on production writes). |
| Approval Notification | Delivers approval requests to configured channels per notification_config: Slack, email, in-app, or PagerDuty. Handles escalation if no response is received before timeout_hours. |
Governance Decision Flow

Policy Engine behavior: policies are evaluated against the execution_context object, which includes the agent identity, the workspace, the tool name, the tool arguments, and the data classification of involved data sources. Policies can match on any of these fields. A policy match always produces an audit event regardless of the enforcement action (block, gate, log, or alert).
Layer 8: Monitoring and Observability
Layer 8 collects structured telemetry from across the agent ecosystem and makes it available via the real-time WebSocket feed, the metrics API, the audit log query interface, and the alert engine. It has no authority over execution -- it observes and reports.
Components
| Component | Responsibility |
|---|---|
| Real-Time Telemetry | Emits structured events over WebSocket to the frontend Run Detail view. Events are emitted at each turn boundary, tool call, and state transition. |
| Metric Aggregation | Calculates rolling metrics at per-agent and system-wide scopes across 1h, 24h, 7d, and 30d windows. Stores aggregates in river_agents.agent_metrics_hourly. |
| Execution Logging | Persists the complete trace of every execution to river_agents.agent_logs: every reasoning turn, tool call, observation, approval decision, and finalization event. |
| Anomaly Detection and Alert Engine | Monitors per-agent metrics for threshold breaches (e.g., failure rate above 20%, latency P99 above 10 seconds). Dispatches alerts to configured channels via Novu. |
| Health Monitor | Derives operational health status for each agent from recent execution outcomes. Updates agents.health_status on a 5-minute evaluation cycle. |
Key Metrics
| Metric | Scope | Aggregation | Used In |
|---|---|---|---|
| Total Runs | Per-agent, System-wide | Count | Agent Detail KPI, System Monitoring |
| Success Rate | Per-agent | completed / total as percentage | Agent Detail KPI, Health Score |
| Average Latency | Per-agent, Per-tool | P50, P90, P99 in milliseconds | Agent Detail KPI, System Monitoring |
| Failure Count | Per-agent, System-wide | Count with categorized reasons | Agent Detail KPI, Alert Engine |
| Throughput | System-wide | Runs per hour and per day | System Monitoring |
| Pending Approvals | Per-agent, System-wide | Count | Agent Detail badge, Runs page |
| Token Cost | Per-agent, Per-run | Total tokens multiplied by model unit pricing | Cost Tracking |
| Actions Taken | Per-agent | Count grouped by tool name | Agent Overview tab |
Telemetry Data Flow

Cross-Layer Interaction Map
The full system maps to five infrastructure tiers. The layer groupings above are functional; the tiers below reflect the actual deployment topology.

Layer-to-tier mapping:
| Layers | Tier |
|---|---|
| L1, L2, L7, L8 | Service Layer (Backend :8005) |
| L3 | Orchestration Layer (Backend :8005 + Temporal :7233) |
| L4 | Orchestration Layer (river-agent :8007) |
| L5 | Service Layer (Data Orchestration :8002) |
| L6 | API Layer (TLO Gateway :8001 as dispatch boundary) |