Architecture Layers

River Agents are organized into 8 functional layers, each with a distinct responsibility boundary. This document specifies the components, contracts, constraints, and interaction patterns for each layer, intended as the primary reference for engineers implementing or maintaining the system.

Quick Navigation

Layer Reference
Master Architecture
Layer 1: Agent Definition and Lifecycle
Layer 2: Trigger Ingestion and Dispatch
Layer 3: Runtime Orchestration
Layer 4: Reasoning and Planning
Layer 5: Data Access and Connectors
Layer 6: Tool and Workflow Execution
Layer 7: Governance and Safety
Layer 8: Monitoring and Observability
Cross-Layer Interaction Map

Layer Reference

Layer	Name	Primary Owner Service	Core Responsibility
1	Agent Definition and Lifecycle	Backend :8005	Persistent identity, versioned configuration, lifecycle state machine
2	Trigger Ingestion and Dispatch	Backend :8005	Normalize all trigger types into a uniform execution request
3	Runtime Orchestration	Backend :8005 + Temporal	Coordinate a single agent run from start to finalization
4	Reasoning and Planning	river-agent :8007	Agentic loop, LLM routing, context construction, tool selection
5	Data Access and Connectors	Data Orchestration :8002 via TLO	Schema-aware, governed data retrieval from connected sources
6	Tool and Workflow Execution	TLO Gateway :8001	ACL-validated tool dispatch and result delivery to the reasoning loop
7	Governance and Safety	Backend :8005	Action level enforcement, approval gates, policy evaluation, audit
8	Monitoring and Observability	Backend :8005	Real-time telemetry, metric aggregation, alerting, health tracking

Master Architecture

The stack is ordered bottom-up: Layer 1 is the foundation (stored agent state), and Layer 8 is the operational envelope (observability). Layer 3 (Runtime Orchestration) is the central coordinator -- it drives Layer 4, writes to Layer 8, and gates through Layer 7 on every write-capable action.

Master Architecture

Solid arrows indicate the primary execution flow. Dotted arrows from L3 indicate that Runtime Orchestration directly interfaces with Governance (for gate checks) and Monitoring (for telemetry) throughout the run, not only at the end.

Layer 1: Agent Definition and Lifecycle

Layer 1 is the persistence foundation. Every agent run traces back to a configuration version record in this layer. Nothing executes unless a valid, deployed version exists here.

Components

Component	Responsibility
Agent Management Service	CRUD operations for agent configurations. Owns all writes to `river_agents.agents` and related tables. Validates field completeness before state transitions.
Version Control	Creates an immutable `agent_versions` snapshot on every post-deployment configuration change. The active version is the source of truth for all execution context.
Lifecycle State Machine	Enforces valid state transitions with pre-condition checks. No direct database write to `agents.status` is permitted from outside this component.

Versioning Rules

Every configuration change on a deployed or active agent creates a new version record rather than updating in place. The rules are:

agents.current_version_id always points to the version the next execution will use.
agent_executions.agent_version_id is set at run creation time and never updated. A run is always auditable against the exact configuration that produced it.
When an operator edits a live agent, a new version is written with status = 'draft'; on deploy, it becomes active and the previous version transitions to archived.
Rollback is implemented by updating current_version_id to an older version's ID. This does not create a new version record but does emit an audit event of type agent_version_rolled_back.
Only one version per agent can hold status = 'active' at a time. This is enforced by a partial unique index.

Versioning Model

State Transition Pre-Conditions

Transition	Pre-Condition Check	Failure Behavior
Configured -> Validated	Data source connectivity test, tool registry availability check, policy resolution, trigger config syntax validation	Transition blocked; validation errors returned as field-level detail
Validated -> Deployed	Temporal workflow registration succeeds, trigger listeners activated	Transition blocked; deployment error logged
Active -> Running	Agent in `active` state, concurrency check passes, rate limit not exceeded	Trigger enqueued or dropped per `on_concurrent_trigger` policy
Running -> Awaiting Approval	Action level gate triggered during execution	State serialized to DB; Temporal workflow suspended
Any -> Archived	No active running executions (or caller accepts forced cancellation), caller has `agent:delete` permission	Blocked if running executions exist without `force=true`

Lifecycle State Machine Detail

Layer 2: Trigger Ingestion and Dispatch

Layer 2 is the entry surface for all agent executions. Its sole job is to receive heterogeneous signals, normalize them into a uniform ExecutionRequest, validate the target agent's readiness, and hand off to the runtime layer.

Components

Component	Trigger Type	Ingestion Mechanism
Manual Trigger Handler	`manual`	Direct HTTP POST to `/api/v1/agents/:id/runs`
Scheduled Trigger Handler	`scheduled`	Internal cron scheduler reads `trigger_config.cron_expression` and `timezone`; polling interval is 10 seconds
Event Trigger Handler	`event`	Kafka consumer group `river-agents-events`; evaluates `trigger_config.event_type` and optional `payload_conditions` filter
API Trigger Handler	`api`	Authenticated POST to agent's dedicated endpoint; optional `payload_schema` validated on arrival
Threshold Trigger Handler	`threshold`	Metric monitor evaluates `trigger_config.threshold_expression` against configured metric source; `cooldown_seconds` prevents re-fire
Workflow Trigger Handler	`workflow`	Temporal signal listener bound to `trigger_config.workflow_id` and `signal_name`
Dispatcher	All	Normalizes trigger into `ExecutionRequest`, validates agent state and concurrency, enqueues to Execution Queue

ExecutionRequest Schema

Every trigger, regardless of source, is normalized into this structure before the Dispatcher submits it to the runtime layer.

Field	Type	Description
`request_id`	UUID	Unique identifier for this trigger event; used for deduplication and idempotency
`agent_id`	UUID	Target agent
`trigger_type`	enum	`manual`, `scheduled`, `event`, `api`, `threshold`, `workflow`
`trigger_source`	string	Origin label (e.g., `"cron:daily-8am"`, `"webhook:zendesk"`, `"api:ci-pipeline"`)
`trigger_payload`	JSONB	Optional context from the trigger source (ticket data, metric values, API body)
`requested_by`	UUID	User ID for manual/API triggers; system service ID for automated triggers
`requested_at`	timestamptz	When the trigger was received by the ingestion handler

Trigger Processing Flow

Concurrency policy: if allow_concurrent_runs is false (the default) and a run is already in running status, the incoming trigger is handled per on_concurrent_trigger: queue (hold in queue until current run finishes), drop (discard with log), or replace (cancel the current run and start fresh).

Layer 3: Runtime Orchestration

Layer 3 is the conductor of every agent execution. It bridges the static configuration in Layer 1 with the dynamic reasoning in Layer 4, manages approval gate hibernation, and ensures every execution is durable, retriable, and fully traced.

Components

Component	Responsibility
Agent Runtime Runner	The root Temporal workflow for a single agent run. Owns the full execution lifecycle from context initialization through finalization and memory write-back.
Context Builder	Assembles the `AgentContext` bundle: agent version config, `long_term_context` snapshot, trigger payload, connected data source metadata, and active tool registry. Passed to river-agent on every reasoning invocation.
State Serializer	On approval gate: serializes conversation history, pending tool call, current observations, and memory snapshot to `agent_executions.serialized_state`. On approval resolution: deserializes and restores the full `AgentContext` to resume from exactly where the run paused.
Retry and Timeout Manager	Enforces per-tool timeouts (default: 30 seconds), per-run turn limits (default: 15 turns), and Temporal retry policies for transient failures (3 retries, exponential backoff).

Execution Constraints

Constraint	Default	Configuration Field
Max turns per run	15	`agent_versions.runtime_config.max_turns`
Per-tool timeout	30 seconds	`agent_versions.runtime_config.tool_timeout_seconds`
Approval gate timeout	72 hours	`agent_versions.approval_rules.timeout_hours`
Max retries on transient failure	3	`agent_versions.runtime_config.max_retries`
Context bundle max size	500 KB	Hard limit; enforced by Context Builder

Runtime Orchestration Flow

Approval Gate State Serialization

When a gate is triggered, the execution must be serializable to zero active compute. The State Serializer writes the following to agent_executions.serialized_state (JSONB):

State Serialization for Approval Gates

Layer 4: Reasoning and Planning

Layer 4 is the intelligence core. It receives a fully assembled AgentContext from Layer 3, runs the agentic loop, and returns a structured result. This layer is implemented by the stateless river-agent microservice at :8007.

Components

Component	Responsibility
River-Agent Agentic Loop	The Reason -> Act -> Observe cycle. Each turn: classify the task, call the appropriate LLM via RiverCore, parse the response for a tool call or finalization signal, and feed the observation back into context for the next turn.
RiverCore (Multi-LLM Router)	Selects the optimal LLM provider and model for each turn based on classified task complexity. Handles provider failover transparently within a turn.
System Prompt Builder	Dynamically constructs the system prompt per run by composing: the agent's `instruction_set`, the active tool registry with Pydantic schemas, connected data source metadata, current governance constraints, and the `long_term_context` snapshot.
Long-Term Context Memory	The accumulated knowledge from past runs, injected into the system prompt. Updated by Backend after each run via the `memory_updates` delta returned in the reasoning result.

Multi-LLM Routing Strategy

RiverCore routes each turn to one of four model tiers based on task complexity classification. The classification is a fast internal heuristic step run before the main reasoning call.

Multi-LLM Routing Strategy

Reasoning Turn Structure

Every turn in the agentic loop produces a structured record written to river_agents.agent_logs. Engineers implementing or querying the log store should treat these fields as the canonical schema for turn-level data.

Field	Type	Description
`turn_number`	integer	Sequential index starting at 1
`turn_type`	enum	`reasoning`, `action`, `observation`, `interaction`
`reasoning`	text	LLM chain-of-thought for this turn; stored for audit and debugging
`tool_name`	string	Name of the tool selected (null for reasoning-only and finalization turns)
`tool_arguments`	JSONB	Structured arguments for the tool call (null if no tool selected)
`observation`	text	Result returned by the tool or injected from the approval resolution
`model_used`	string	Exact model identifier (e.g., `claude-sonnet-4-6`, `gpt-4o`)
`tokens_used`	integer	Total input + output token count for this turn
`latency_ms`	integer	Wall-clock time for the turn including LLM call and tool dispatch

Layer 5: Data Access and Connectors

Layer 5 is the agent's governed interface to enterprise data. All data retrieval -- schema lookup, query execution, and catalog search -- routes through this layer. No direct database connections are made from the reasoning layer.

Components

Component	Responsibility
Schema Discovery Service	Provides the reasoning engine with accurate table, column, and relationship metadata for all connected data sources. Auto-discovers on connection; cached with configurable TTL per data source.
Query Execution Engine	Accepts structured SQL, NoSQL, or API query specs from the reasoning layer. Executes via Data Orchestration Service (:8002) through TLO Gateway. Returns normalized result sets.
Semantic Catalog (Qdrant)	Vector-based search over data source metadata. Enables the reasoning engine to resolve table and column references by semantic meaning when exact names are unavailable or ambiguous.
Data Connectors	Type-specific adapters for each supported source: PostgreSQL, MySQL, Snowflake, BigQuery, Redshift, MongoDB, Elasticsearch, REST APIs, Salesforce, HubSpot, Stripe, and others. Managed by Data Orchestration Service.

Data Access Flow

ACL note: data_source:view is required for schema discovery; data_source:query is required for query execution. Both are checked per call at TLO Gateway, not cached from the start of the run. A data source ACL change takes effect immediately on the next tool invocation, even within an active run.

Layer 6: Tool and Workflow Execution

Layer 6 is the dispatch and validation surface for every non-reasoning action the agent takes. All tool calls route through TLO Gateway, which re-validates ACL permissions before forwarding to the target service.

Components

Component	Responsibility
Tool Registry	Maintains the catalog of all available tools with their input and output Pydantic schemas. Includes platform tools and dynamically registered custom tools added via OpenAPI spec upload.
Tool Executor	Routes tool call requests from the reasoning layer through TLO Gateway. Parses the response, formats it as a structured observation, and returns it to Layer 3 for injection into the next reasoning turn.
Workflow Invoker	Triggers Temporal sub-workflows for operations that span multiple steps or services (e.g., a data pipeline run, a multi-system onboarding sequence). Returns a workflow execution ID as the observation.
Result Validator	Validates tool execution results against the expected output schema for that tool. Malformed results are caught here before being injected into the reasoning context; the turn is marked as a tool error.

Tool Categories

Category	Count	ACL Check	LLM Required	Examples
AI Reasoning	6	None -- internal processing only	Yes (per-turn routing)	`classify_intent`, `check_governance`, `generate_query`, `search_catalog`, `recommend_visualization`, `explain_results`
Execution	10+	Per-call check by TLO	No -- direct service call	`execute_query`, `discover_schema`, `send_email`, `send_slack`, `create_ticket`, `update_crm`, `write_back`, `restart_workflow`, `scale_infrastructure`
Interaction	1	None	No	`request_approval` -- sends approval request via WebSocket and signals Temporal to suspend
Custom	N (dynamic)	Per-call check by TLO	No	Registered from OpenAPI specs; schema validated at registration time

Implementation note on request_approval: this tool does not return an observation to the reasoning loop. Calling it is a terminal action for the current turn -- it triggers the State Serializer in Layer 3, creates the approval record, and suspends the Temporal workflow. The reasoning loop resumes from the next turn only after the approval signal is received.

Layer 7: Governance and Safety

Layer 7 is the enforcement boundary between what an agent reasons it should do and what it is permitted to do. Every write-capable action passes through this layer before dispatch. Governance checks are not advisory -- they block, gate, or permit, with no silent pass-through.

Components

Component	Responsibility
Action Level Checker	Evaluates each tool call against the agent's `action_level`. Determines whether the call proceeds, is returned as a read-only proposal, or requires an approval gate.
Approval Gate Service	Creates `approval_requests` records, assigns approvers from `approval_rules`, dispatches notifications via Novu, and processes approval signals back to the Temporal workflow.
Policy Engine	Evaluates the current execution context against organization-level and workspace-level governance policies (e.g., row count limits on data exports, time-of-day restrictions on production writes).
Approval Notification	Delivers approval requests to configured channels per `notification_config`: Slack, email, in-app, or PagerDuty. Handles escalation if no response is received before `timeout_hours`.

Governance Decision Flow

Policy Engine behavior: policies are evaluated against the execution_context object, which includes the agent identity, the workspace, the tool name, the tool arguments, and the data classification of involved data sources. Policies can match on any of these fields. A policy match always produces an audit event regardless of the enforcement action (block, gate, log, or alert).

Layer 8: Monitoring and Observability

Layer 8 collects structured telemetry from across the agent ecosystem and makes it available via the real-time WebSocket feed, the metrics API, the audit log query interface, and the alert engine. It has no authority over execution -- it observes and reports.

Components

Component	Responsibility
Real-Time Telemetry	Emits structured events over WebSocket to the frontend Run Detail view. Events are emitted at each turn boundary, tool call, and state transition.
Metric Aggregation	Calculates rolling metrics at per-agent and system-wide scopes across 1h, 24h, 7d, and 30d windows. Stores aggregates in `river_agents.agent_metrics_hourly`.
Execution Logging	Persists the complete trace of every execution to `river_agents.agent_logs`: every reasoning turn, tool call, observation, approval decision, and finalization event.
Anomaly Detection and Alert Engine	Monitors per-agent metrics for threshold breaches (e.g., failure rate above 20%, latency P99 above 10 seconds). Dispatches alerts to configured channels via Novu.
Health Monitor	Derives operational health status for each agent from recent execution outcomes. Updates `agents.health_status` on a 5-minute evaluation cycle.

Key Metrics

Metric	Scope	Aggregation	Used In
Total Runs	Per-agent, System-wide	Count	Agent Detail KPI, System Monitoring
Success Rate	Per-agent	`completed / total` as percentage	Agent Detail KPI, Health Score
Average Latency	Per-agent, Per-tool	P50, P90, P99 in milliseconds	Agent Detail KPI, System Monitoring
Failure Count	Per-agent, System-wide	Count with categorized reasons	Agent Detail KPI, Alert Engine
Throughput	System-wide	Runs per hour and per day	System Monitoring
Pending Approvals	Per-agent, System-wide	Count	Agent Detail badge, Runs page
Token Cost	Per-agent, Per-run	Total tokens multiplied by model unit pricing	Cost Tracking
Actions Taken	Per-agent	Count grouped by tool name	Agent Overview tab

Telemetry Data Flow

Cross-Layer Interaction Map

The full system maps to five infrastructure tiers. The layer groupings above are functional; the tiers below reflect the actual deployment topology.

Cross-Layer Interaction Summary

Layer-to-tier mapping:

Layers	Tier
L1, L2, L7, L8	Service Layer (Backend :8005)
L3	Orchestration Layer (Backend :8005 + Temporal :7233)
L4	Orchestration Layer (river-agent :8007)
L5	Service Layer (Data Orchestration :8002)
L6	API Layer (TLO Gateway :8001 as dispatch boundary)

Layer Reference​

Master Architecture​

Layer 1: Agent Definition and Lifecycle​

Components​

Versioning Rules​

State Transition Pre-Conditions​

Layer 2: Trigger Ingestion and Dispatch​

Components​

ExecutionRequest Schema​

Trigger Processing Flow​

Layer 3: Runtime Orchestration​

Components​

Execution Constraints​

Runtime Orchestration Flow​

Approval Gate State Serialization​

Layer 4: Reasoning and Planning​

Components​

Multi-LLM Routing Strategy​

Reasoning Turn Structure​

Layer 5: Data Access and Connectors​

Components​

Data Access Flow​

Layer 6: Tool and Workflow Execution​

Components​

Tool Categories​

Layer 7: Governance and Safety​

Components​

Governance Decision Flow​

Layer 8: Monitoring and Observability​

Components​

Key Metrics​

Telemetry Data Flow​

Cross-Layer Interaction Map​

Layer Reference

Master Architecture

Layer 1: Agent Definition and Lifecycle

Components

Versioning Rules

State Transition Pre-Conditions

Layer 2: Trigger Ingestion and Dispatch

Components

ExecutionRequest Schema

Trigger Processing Flow

Layer 3: Runtime Orchestration

Components

Execution Constraints

Runtime Orchestration Flow

Approval Gate State Serialization

Layer 4: Reasoning and Planning

Components

Multi-LLM Routing Strategy

Reasoning Turn Structure

Layer 5: Data Access and Connectors

Components

Data Access Flow

Layer 6: Tool and Workflow Execution

Components

Tool Categories

Layer 7: Governance and Safety

Components

Governance Decision Flow

Layer 8: Monitoring and Observability

Components

Key Metrics

Telemetry Data Flow

Cross-Layer Interaction Map