Custom Model Training
While Model Studio's AutoML engine handles the majority of use cases, advanced users may require complete control over the training logic, framework versions, and hardware acceleration. Custom Model Training allows you to submit your own training scripts and dependencies to be executed on RiverGen's compute infrastructure.
Workflow Overview
The custom training workflow differs from the standard AutoML pipeline by requiring a Code Artifact and a Compute Configuration.
1. Preparing Your Code
Your training code must be packaged as a compressed archive (.tar.gz or .zip) and accessible via a public or pre-signed URL.
File Structure
The archive should contain your main entry point script and a requirements.txt file.
custom_train.tar.gz
├── train.py # Main entry point
├── utils/ # Helper modules
│ ├── data_loader.py
│ └── metrics.py
└── requirements.txt # Python dependencies
Script Requirements
Your entry point script should accept parameters via command-line arguments. Model Studio injects environment variables for data paths:
RG_DATA_PATH: Local path to the training dataset.RG_OUTPUT_PATH: Path where the model should be saved for registry ingestion.
2. Compute Configuration
Model Studio supports various instance types tailored for different ML tasks.
| Instance Type | CPU Cores | RAM (GB) | GPU | Usage Case |
|---|---|---|---|---|
cpu.standard | 4 | 16 | None | Light preprocessing, Scikit-learn |
cpu.highmem | 16 | 64 | None | Large dataset joins, complex feature engineering |
gpu.xlarge | 8 | 32 | 1x NVIDIA T4 | Deep Learning, PyTorch, TensorFlow |
gpu.2xlarge | 16 | 128 | 1x NVIDIA A100 | Large Model Training, LLM Finetuning |
3. Submitting a Job
Use the Submit Custom Training Job endpoint to trigger execution.
curl -X POST "https://api.rivergen.ai/api/model-studio/custom-training-job" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"task_name": "ResNet-50-Finetune",
"code_artifact_url": "s3://my-bucket/models/v1/code.tar.gz",
"entry_point": "train.py",
"compute_config": {
"instance_type": "gpu.xlarge"
},
"training_args": {
"learning_rate": 0.0001,
"batch_size": 128
}
}'
4. Monitoring & Quotas
Compute Credits
Custom jobs consume Compute Credits proportional to the instance type and duration.
- Check Balance:
GET /organization/billing/compute-credits - Quotas: Organization-level concurrency limits apply (e.g., maximum 2 simultaneous
gpu.2xlargejobs).
Debugging Logs
Logs generated by your script are streamed back to the Model Studio dashboard in real-time. If a job fails, the Get Training Details endpoint will contain the stdout/stderr trace.
Best Practices
Always reference a specific dataset_id in your metadata to ensure reproducibility.
The RG_OUTPUT_PATH has a maximum capacity of 50GB. If your model exceeds this, consider using tiered storage checkpoints.