Skip to main content

Hybrid Computing Layer

Page Outline

The Hybrid Computing Layer enables RiverGen's Model Studio to orchestrate ML training across heterogeneous infrastructure—on-premise clusters, cloud providers (AWS, Azure, GCP), and edge devices—while optimizing for cost, latency, and resource availability.

Key Capability

Model Studio automatically selects the optimal compute resource based on task requirements, data locality, cost constraints, and current resource availability.


Architecture Overview

Resource Selection Flow

  1. Task Analysis: Model Studio analyzes the training task (data size, model complexity, time constraints)
  2. Resource Matching: The scheduler queries the registry for available resources matching requirements
  3. Cost Optimization: Evaluates cost vs. performance trade-offs based on user preferences
  4. Allocation: Reserves the selected resource and updates its status to busy
  5. Monitoring: Continuously tracks resource utilization and job progress
  6. Release: Returns resource to available status upon job completion

Resource Types

1. On-Premise Resources

Use Cases:

  • Sensitive data that cannot leave the organization
  • High-performance GPU clusters for large-scale training
  • Cost-effective for sustained workloads

Example Configuration:

{
"resource_id": "rivergen-gpu-cluster-01",
"resource_type": "on-premise",
"location": "RiverGen Data Center - Building A",
"capabilities": {
"gpu": true,
"gpu_count": 8,
"gpu_type": "NVIDIA A100",
"cpu_cores": 128,
"memory_gb": 1024,
"storage_tb": 50
},
"supported_frameworks": ["pytorch", "tensorflow", "sklearn", "xgboost"],
"max_parallel_jobs": 4,
"cost_per_hour_usd": 0.0
}

2. Cloud Resources

Use Cases:

  • Elastic scaling for variable workloads
  • Access to specialized hardware (TPUs, Inferentia)
  • Geographic distribution for global deployments

Example Configuration:

{
"resource_id": "aws-us-east-1-p4d",
"resource_type": "cloud",
"provider": "AWS",
"region": "us-east-1",
"capabilities": {
"gpu": true,
"gpu_count": 8,
"gpu_type": "NVIDIA A100",
"cpu_cores": 96,
"memory_gb": 1152
},
"supported_frameworks": ["pytorch", "tensorflow", "mxnet"],
"max_parallel_jobs": 10,
"cost_per_hour_usd": 32.77,
"network_bandwidth_gbps": 400.0
}

3. Edge Resources

Use Cases:

  • Federated learning on edge devices
  • Privacy-preserving local model training
  • Low-latency inference model deployment

Example Configuration:

{
"resource_id": "edge-device-fleet-01",
"resource_type": "edge",
"location": "Distributed IoT Network",
"capabilities": {
"gpu": false,
"cpu_cores": 4,
"memory_gb": 8,
"storage_gb": 64
},
"supported_frameworks": ["tensorflow-lite", "onnx"],
"max_parallel_jobs": 1,
"latency_ms": 2
}

API Reference

Register Compute Resource

Register a new compute resource in the hybrid computing registry.

Endpoint: POST /api/model-studio/compute-resources
Status Code: 201 Created

{
"resource_id": "rivergen-gpu-cluster-02",
"resource_type": "on-premise",
"location": "RiverGen Data Center - Building B",
"capabilities": {
"gpu": true,
"gpu_count": 4,
"gpu_type": "NVIDIA V100",
"cpu_cores": 64,
"memory_gb": 512
},
"supported_frameworks": ["pytorch", "tensorflow", "sklearn"],
"status": "available",
"max_parallel_jobs": 2,
"network_bandwidth_gbps": 10.0,
"cost_per_hour_usd": 0.0,
"auth_method": "ssh_key",
"latency_ms": 5
}

Request Schema:

FieldTypeRequiredDescription
resource_idstringYesUnique identifier (e.g., rivergen-gpu-cluster-02)
resource_typestringYesType: on-premise, cloud, edge
providerstringNoCloud provider: AWS, Azure, GCP (required for cloud)
locationstringNoPhysical location or data center name
regionstringNoCloud region (e.g., us-east-1)
capabilitiesobjectYesHardware capabilities (GPU, CPU, memory)
supported_frameworksarrayNoML frameworks: pytorch, tensorflow, sklearn, etc.
statusstringNoInitial status (default: available)
max_parallel_jobsintegerNoMaximum concurrent training jobs
network_bandwidth_gbpsfloatNoNetwork bandwidth in Gbps
cost_per_hour_usdfloatNoCost per hour in USD (0.0 for on-premise)
auth_methodstringNoAuthentication method: ssh_key, api_token, iam_role
latency_msintegerNoNetwork latency to Model Studio in milliseconds
Unique Resource IDs

Each resource_id must be globally unique across all registered resources. Registration will fail with 400 Bad Request if a duplicate ID is detected.


List Compute Resources

Retrieve all registered compute resources with optional filtering.

Endpoint: GET /api/model-studio/compute-resources
Status Code: 200 OK

Query Parameters:

ParameterTypeDescription
status_filterstringFilter by status: available, busy, maintenance, unavailable
resource_typestringFilter by type: on-premise, cloud, edge
providerstringFilter by cloud provider: AWS, Azure, GCP

Example Request:

curl "http://localhost:8000/api/model-studio/compute-resources?status_filter=available&resource_type=on-premise"

Response:

[
{
"id": 1,
"resource_id": "rivergen-gpu-cluster-01",
"resource_type": "on-premise",
"status": "available",
"capabilities": {
"gpu": true,
"gpu_count": 8,
"cpu_cores": 128,
"memory_gb": 1024
},
"max_parallel_jobs": 4,
"cost_per_hour_usd": 0.0
},
{
"id": 5,
"resource_id": "rivergen-gpu-cluster-02",
"resource_type": "on-premise",
"status": "available",
"capabilities": {
"gpu": true,
"gpu_count": 4,
"cpu_cores": 64,
"memory_gb": 512
},
"max_parallel_jobs": 2,
"cost_per_hour_usd": 0.0
}
]

Update Compute Resource

Update the status or configuration of an existing compute resource.

Endpoint: PUT /api/model-studio/compute-resources/{resource_id}
Status Code: 200 OK

Example - Mark Resource for Maintenance:

curl -X PUT "http://localhost:8000/api/model-studio/compute-resources/rivergen-gpu-cluster-02" \
-H "Content-Type: application/json" \
-d '{
"status": "maintenance"
}'
Resource Status Management
  • available: Ready to accept new training jobs
  • busy: Currently executing jobs (auto-managed by scheduler)
  • maintenance: Temporarily unavailable for scheduled maintenance
  • unavailable: Permanently offline or decommissioned

Resource Allocation Strategy

Automatic Selection (compute_preference: "auto")

When a task intent specifies "compute_preference": "auto", Model Studio uses the following decision tree:

Manual Selection

Users can explicitly specify compute preferences:

{
"task_name": "High-Priority Training",
"ml_task": "classification",
"compute_preference": "cloud",
"constraints": {
"max_cost_per_hour_usd": 50.0,
"preferred_provider": "AWS",
"required_gpu_count": 8
}
}

Best Practices

1. Resource Registration

  • ✅ Register all available compute resources during initial setup
  • ✅ Use descriptive resource_id values (e.g., aws-us-east-1-p4d-01)
  • ✅ Keep capabilities metadata accurate and up-to-date
  • ✅ Set realistic max_parallel_jobs based on resource capacity

2. Cost Optimization

  • 💰 Use on-premise resources for sustained, predictable workloads
  • 💰 Reserve cloud resources for burst capacity and experimentation
  • 💰 Set max_cost_per_hour_usd constraints in task intents
  • 💰 Monitor actual vs. estimated training times to refine cost models

3. Monitoring & Maintenance

  • 🔧 Regularly update resource status (e.g., mark for maintenance)
  • 🔧 Monitor resource utilization via the Resource Monitor
  • 🔧 Remove decommissioned resources or mark as unavailable
  • 🔧 Validate framework compatibility before major upgrades