Skip to main content

Architecture Layers

River Agents are organized into 8 functional layers, each with a distinct responsibility boundary. This document specifies the components, contracts, constraints, and interaction patterns for each layer, intended as the primary reference for engineers implementing or maintaining the system.

Quick Navigation

Layer Reference

LayerNamePrimary Owner ServiceCore Responsibility
1Agent Definition and LifecycleBackend :8005Persistent identity, versioned configuration, lifecycle state machine
2Trigger Ingestion and DispatchBackend :8005Normalize all trigger types into a uniform execution request
3Runtime OrchestrationBackend :8005 + TemporalCoordinate a single agent run from start to finalization
4Reasoning and Planningriver-agent :8007Agentic loop, LLM routing, context construction, tool selection
5Data Access and ConnectorsData Orchestration :8002 via TLOSchema-aware, governed data retrieval from connected sources
6Tool and Workflow ExecutionTLO Gateway :8001ACL-validated tool dispatch and result delivery to the reasoning loop
7Governance and SafetyBackend :8005Action level enforcement, approval gates, policy evaluation, audit
8Monitoring and ObservabilityBackend :8005Real-time telemetry, metric aggregation, alerting, health tracking

Master Architecture

The stack is ordered bottom-up: Layer 1 is the foundation (stored agent state), and Layer 8 is the operational envelope (observability). Layer 3 (Runtime Orchestration) is the central coordinator -- it drives Layer 4, writes to Layer 8, and gates through Layer 7 on every write-capable action.

Master Architecture

Solid arrows indicate the primary execution flow. Dotted arrows from L3 indicate that Runtime Orchestration directly interfaces with Governance (for gate checks) and Monitoring (for telemetry) throughout the run, not only at the end.


Layer 1: Agent Definition and Lifecycle

Layer 1 is the persistence foundation. Every agent run traces back to a configuration version record in this layer. Nothing executes unless a valid, deployed version exists here.

Components

ComponentResponsibility
Agent Management ServiceCRUD operations for agent configurations. Owns all writes to river_agents.agents and related tables. Validates field completeness before state transitions.
Version ControlCreates an immutable agent_versions snapshot on every post-deployment configuration change. The active version is the source of truth for all execution context.
Lifecycle State MachineEnforces valid state transitions with pre-condition checks. No direct database write to agents.status is permitted from outside this component.

Versioning Rules

Every configuration change on a deployed or active agent creates a new version record rather than updating in place. The rules are:

  1. agents.current_version_id always points to the version the next execution will use.
  2. agent_executions.agent_version_id is set at run creation time and never updated. A run is always auditable against the exact configuration that produced it.
  3. When an operator edits a live agent, a new version is written with status = 'draft'; on deploy, it becomes active and the previous version transitions to archived.
  4. Rollback is implemented by updating current_version_id to an older version's ID. This does not create a new version record but does emit an audit event of type agent_version_rolled_back.
  5. Only one version per agent can hold status = 'active' at a time. This is enforced by a partial unique index.

Versioning Model

State Transition Pre-Conditions

TransitionPre-Condition CheckFailure Behavior
Configured -> ValidatedData source connectivity test, tool registry availability check, policy resolution, trigger config syntax validationTransition blocked; validation errors returned as field-level detail
Validated -> DeployedTemporal workflow registration succeeds, trigger listeners activatedTransition blocked; deployment error logged
Active -> RunningAgent in active state, concurrency check passes, rate limit not exceededTrigger enqueued or dropped per on_concurrent_trigger policy
Running -> Awaiting ApprovalAction level gate triggered during executionState serialized to DB; Temporal workflow suspended
Any -> ArchivedNo active running executions (or caller accepts forced cancellation), caller has agent:delete permissionBlocked if running executions exist without force=true

Lifecycle State Machine Detail


Layer 2: Trigger Ingestion and Dispatch

Layer 2 is the entry surface for all agent executions. Its sole job is to receive heterogeneous signals, normalize them into a uniform ExecutionRequest, validate the target agent's readiness, and hand off to the runtime layer.

Components

ComponentTrigger TypeIngestion Mechanism
Manual Trigger HandlermanualDirect HTTP POST to /api/v1/agents/:id/runs
Scheduled Trigger HandlerscheduledInternal cron scheduler reads trigger_config.cron_expression and timezone; polling interval is 10 seconds
Event Trigger HandlereventKafka consumer group river-agents-events; evaluates trigger_config.event_type and optional payload_conditions filter
API Trigger HandlerapiAuthenticated POST to agent's dedicated endpoint; optional payload_schema validated on arrival
Threshold Trigger HandlerthresholdMetric monitor evaluates trigger_config.threshold_expression against configured metric source; cooldown_seconds prevents re-fire
Workflow Trigger HandlerworkflowTemporal signal listener bound to trigger_config.workflow_id and signal_name
DispatcherAllNormalizes trigger into ExecutionRequest, validates agent state and concurrency, enqueues to Execution Queue

ExecutionRequest Schema

Every trigger, regardless of source, is normalized into this structure before the Dispatcher submits it to the runtime layer.

FieldTypeDescription
request_idUUIDUnique identifier for this trigger event; used for deduplication and idempotency
agent_idUUIDTarget agent
trigger_typeenummanual, scheduled, event, api, threshold, workflow
trigger_sourcestringOrigin label (e.g., "cron:daily-8am", "webhook:zendesk", "api:ci-pipeline")
trigger_payloadJSONBOptional context from the trigger source (ticket data, metric values, API body)
requested_byUUIDUser ID for manual/API triggers; system service ID for automated triggers
requested_attimestamptzWhen the trigger was received by the ingestion handler

Trigger Processing Flow

Concurrency policy: if allow_concurrent_runs is false (the default) and a run is already in running status, the incoming trigger is handled per on_concurrent_trigger: queue (hold in queue until current run finishes), drop (discard with log), or replace (cancel the current run and start fresh).


Layer 3: Runtime Orchestration

Layer 3 is the conductor of every agent execution. It bridges the static configuration in Layer 1 with the dynamic reasoning in Layer 4, manages approval gate hibernation, and ensures every execution is durable, retriable, and fully traced.

Components

ComponentResponsibility
Agent Runtime RunnerThe root Temporal workflow for a single agent run. Owns the full execution lifecycle from context initialization through finalization and memory write-back.
Context BuilderAssembles the AgentContext bundle: agent version config, long_term_context snapshot, trigger payload, connected data source metadata, and active tool registry. Passed to river-agent on every reasoning invocation.
State SerializerOn approval gate: serializes conversation history, pending tool call, current observations, and memory snapshot to agent_executions.serialized_state. On approval resolution: deserializes and restores the full AgentContext to resume from exactly where the run paused.
Retry and Timeout ManagerEnforces per-tool timeouts (default: 30 seconds), per-run turn limits (default: 15 turns), and Temporal retry policies for transient failures (3 retries, exponential backoff).

Execution Constraints

ConstraintDefaultConfiguration Field
Max turns per run15agent_versions.runtime_config.max_turns
Per-tool timeout30 secondsagent_versions.runtime_config.tool_timeout_seconds
Approval gate timeout72 hoursagent_versions.approval_rules.timeout_hours
Max retries on transient failure3agent_versions.runtime_config.max_retries
Context bundle max size500 KBHard limit; enforced by Context Builder

Runtime Orchestration Flow

Approval Gate State Serialization

When a gate is triggered, the execution must be serializable to zero active compute. The State Serializer writes the following to agent_executions.serialized_state (JSONB):

State Serialization for Approval Gates


Layer 4: Reasoning and Planning

Layer 4 is the intelligence core. It receives a fully assembled AgentContext from Layer 3, runs the agentic loop, and returns a structured result. This layer is implemented by the stateless river-agent microservice at :8007.

Components

ComponentResponsibility
River-Agent Agentic LoopThe Reason -> Act -> Observe cycle. Each turn: classify the task, call the appropriate LLM via RiverCore, parse the response for a tool call or finalization signal, and feed the observation back into context for the next turn.
RiverCore (Multi-LLM Router)Selects the optimal LLM provider and model for each turn based on classified task complexity. Handles provider failover transparently within a turn.
System Prompt BuilderDynamically constructs the system prompt per run by composing: the agent's instruction_set, the active tool registry with Pydantic schemas, connected data source metadata, current governance constraints, and the long_term_context snapshot.
Long-Term Context MemoryThe accumulated knowledge from past runs, injected into the system prompt. Updated by Backend after each run via the memory_updates delta returned in the reasoning result.

Multi-LLM Routing Strategy

RiverCore routes each turn to one of four model tiers based on task complexity classification. The classification is a fast internal heuristic step run before the main reasoning call.

Multi-LLM Routing Strategy

Reasoning Turn Structure

Every turn in the agentic loop produces a structured record written to river_agents.agent_logs. Engineers implementing or querying the log store should treat these fields as the canonical schema for turn-level data.

FieldTypeDescription
turn_numberintegerSequential index starting at 1
turn_typeenumreasoning, action, observation, interaction
reasoningtextLLM chain-of-thought for this turn; stored for audit and debugging
tool_namestringName of the tool selected (null for reasoning-only and finalization turns)
tool_argumentsJSONBStructured arguments for the tool call (null if no tool selected)
observationtextResult returned by the tool or injected from the approval resolution
model_usedstringExact model identifier (e.g., claude-sonnet-4-6, gpt-4o)
tokens_usedintegerTotal input + output token count for this turn
latency_msintegerWall-clock time for the turn including LLM call and tool dispatch

Layer 5: Data Access and Connectors

Layer 5 is the agent's governed interface to enterprise data. All data retrieval -- schema lookup, query execution, and catalog search -- routes through this layer. No direct database connections are made from the reasoning layer.

Components

ComponentResponsibility
Schema Discovery ServiceProvides the reasoning engine with accurate table, column, and relationship metadata for all connected data sources. Auto-discovers on connection; cached with configurable TTL per data source.
Query Execution EngineAccepts structured SQL, NoSQL, or API query specs from the reasoning layer. Executes via Data Orchestration Service (:8002) through TLO Gateway. Returns normalized result sets.
Semantic Catalog (Qdrant)Vector-based search over data source metadata. Enables the reasoning engine to resolve table and column references by semantic meaning when exact names are unavailable or ambiguous.
Data ConnectorsType-specific adapters for each supported source: PostgreSQL, MySQL, Snowflake, BigQuery, Redshift, MongoDB, Elasticsearch, REST APIs, Salesforce, HubSpot, Stripe, and others. Managed by Data Orchestration Service.

Data Access Flow

ACL note: data_source:view is required for schema discovery; data_source:query is required for query execution. Both are checked per call at TLO Gateway, not cached from the start of the run. A data source ACL change takes effect immediately on the next tool invocation, even within an active run.


Layer 6: Tool and Workflow Execution

Layer 6 is the dispatch and validation surface for every non-reasoning action the agent takes. All tool calls route through TLO Gateway, which re-validates ACL permissions before forwarding to the target service.

Components

ComponentResponsibility
Tool RegistryMaintains the catalog of all available tools with their input and output Pydantic schemas. Includes platform tools and dynamically registered custom tools added via OpenAPI spec upload.
Tool ExecutorRoutes tool call requests from the reasoning layer through TLO Gateway. Parses the response, formats it as a structured observation, and returns it to Layer 3 for injection into the next reasoning turn.
Workflow InvokerTriggers Temporal sub-workflows for operations that span multiple steps or services (e.g., a data pipeline run, a multi-system onboarding sequence). Returns a workflow execution ID as the observation.
Result ValidatorValidates tool execution results against the expected output schema for that tool. Malformed results are caught here before being injected into the reasoning context; the turn is marked as a tool error.

Tool Categories

CategoryCountACL CheckLLM RequiredExamples
AI Reasoning6None -- internal processing onlyYes (per-turn routing)classify_intent, check_governance, generate_query, search_catalog, recommend_visualization, explain_results
Execution10+Per-call check by TLONo -- direct service callexecute_query, discover_schema, send_email, send_slack, create_ticket, update_crm, write_back, restart_workflow, scale_infrastructure
Interaction1NoneNorequest_approval -- sends approval request via WebSocket and signals Temporal to suspend
CustomN (dynamic)Per-call check by TLONoRegistered from OpenAPI specs; schema validated at registration time

Implementation note on request_approval: this tool does not return an observation to the reasoning loop. Calling it is a terminal action for the current turn -- it triggers the State Serializer in Layer 3, creates the approval record, and suspends the Temporal workflow. The reasoning loop resumes from the next turn only after the approval signal is received.


Layer 7: Governance and Safety

Layer 7 is the enforcement boundary between what an agent reasons it should do and what it is permitted to do. Every write-capable action passes through this layer before dispatch. Governance checks are not advisory -- they block, gate, or permit, with no silent pass-through.

Components

ComponentResponsibility
Action Level CheckerEvaluates each tool call against the agent's action_level. Determines whether the call proceeds, is returned as a read-only proposal, or requires an approval gate.
Approval Gate ServiceCreates approval_requests records, assigns approvers from approval_rules, dispatches notifications via Novu, and processes approval signals back to the Temporal workflow.
Policy EngineEvaluates the current execution context against organization-level and workspace-level governance policies (e.g., row count limits on data exports, time-of-day restrictions on production writes).
Approval NotificationDelivers approval requests to configured channels per notification_config: Slack, email, in-app, or PagerDuty. Handles escalation if no response is received before timeout_hours.

Governance Decision Flow

Governance Decision Flow

Policy Engine behavior: policies are evaluated against the execution_context object, which includes the agent identity, the workspace, the tool name, the tool arguments, and the data classification of involved data sources. Policies can match on any of these fields. A policy match always produces an audit event regardless of the enforcement action (block, gate, log, or alert).


Layer 8: Monitoring and Observability

Layer 8 collects structured telemetry from across the agent ecosystem and makes it available via the real-time WebSocket feed, the metrics API, the audit log query interface, and the alert engine. It has no authority over execution -- it observes and reports.

Components

ComponentResponsibility
Real-Time TelemetryEmits structured events over WebSocket to the frontend Run Detail view. Events are emitted at each turn boundary, tool call, and state transition.
Metric AggregationCalculates rolling metrics at per-agent and system-wide scopes across 1h, 24h, 7d, and 30d windows. Stores aggregates in river_agents.agent_metrics_hourly.
Execution LoggingPersists the complete trace of every execution to river_agents.agent_logs: every reasoning turn, tool call, observation, approval decision, and finalization event.
Anomaly Detection and Alert EngineMonitors per-agent metrics for threshold breaches (e.g., failure rate above 20%, latency P99 above 10 seconds). Dispatches alerts to configured channels via Novu.
Health MonitorDerives operational health status for each agent from recent execution outcomes. Updates agents.health_status on a 5-minute evaluation cycle.

Key Metrics

MetricScopeAggregationUsed In
Total RunsPer-agent, System-wideCountAgent Detail KPI, System Monitoring
Success RatePer-agentcompleted / total as percentageAgent Detail KPI, Health Score
Average LatencyPer-agent, Per-toolP50, P90, P99 in millisecondsAgent Detail KPI, System Monitoring
Failure CountPer-agent, System-wideCount with categorized reasonsAgent Detail KPI, Alert Engine
ThroughputSystem-wideRuns per hour and per daySystem Monitoring
Pending ApprovalsPer-agent, System-wideCountAgent Detail badge, Runs page
Token CostPer-agent, Per-runTotal tokens multiplied by model unit pricingCost Tracking
Actions TakenPer-agentCount grouped by tool nameAgent Overview tab

Telemetry Data Flow

Telemetry Data Flow


Cross-Layer Interaction Map

The full system maps to five infrastructure tiers. The layer groupings above are functional; the tiers below reflect the actual deployment topology.

Cross-Layer Interaction Summary

Layer-to-tier mapping:

LayersTier
L1, L2, L7, L8Service Layer (Backend :8005)
L3Orchestration Layer (Backend :8005 + Temporal :7233)
L4Orchestration Layer (river-agent :8007)
L5Service Layer (Data Orchestration :8002)
L6API Layer (TLO Gateway :8001 as dispatch boundary)