Skip to main content

Custom Model Training

While Model Studio's AutoML engine handles the majority of use cases, advanced users may require complete control over the training logic, framework versions, and hardware acceleration. Custom Model Training allows you to submit your own training scripts and dependencies to be executed on RiverGen's compute infrastructure.


Workflow Overview

The custom training workflow differs from the standard AutoML pipeline by requiring a Code Artifact and a Compute Configuration.


1. Preparing Your Code

Your training code must be packaged as a compressed archive (.tar.gz or .zip) and accessible via a public or pre-signed URL.

File Structure

The archive should contain your main entry point script and a requirements.txt file.

custom_train.tar.gz
├── train.py # Main entry point
├── utils/ # Helper modules
│ ├── data_loader.py
│ └── metrics.py
└── requirements.txt # Python dependencies

Script Requirements

Your entry point script should accept parameters via command-line arguments. Model Studio injects environment variables for data paths:

  • RG_DATA_PATH: Local path to the training dataset.
  • RG_OUTPUT_PATH: Path where the model should be saved for registry ingestion.

2. Compute Configuration

Model Studio supports various instance types tailored for different ML tasks.

Instance TypeCPU CoresRAM (GB)GPUUsage Case
cpu.standard416NoneLight preprocessing, Scikit-learn
cpu.highmem1664NoneLarge dataset joins, complex feature engineering
gpu.xlarge8321x NVIDIA T4Deep Learning, PyTorch, TensorFlow
gpu.2xlarge161281x NVIDIA A100Large Model Training, LLM Finetuning

3. Submitting a Job

Use the Submit Custom Training Job endpoint to trigger execution.

curl -X POST "https://api.rivergen.ai/api/model-studio/custom-training-job" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"task_name": "ResNet-50-Finetune",
"code_artifact_url": "s3://my-bucket/models/v1/code.tar.gz",
"entry_point": "train.py",
"compute_config": {
"instance_type": "gpu.xlarge"
},
"training_args": {
"learning_rate": 0.0001,
"batch_size": 128
}
}'

4. Monitoring & Quotas

Compute Credits

Custom jobs consume Compute Credits proportional to the instance type and duration.

  • Check Balance: GET /organization/billing/compute-credits
  • Quotas: Organization-level concurrency limits apply (e.g., maximum 2 simultaneous gpu.2xlarge jobs).

Debugging Logs

Logs generated by your script are streamed back to the Model Studio dashboard in real-time. If a job fails, the Get Training Details endpoint will contain the stdout/stderr trace.


Best Practices

Use Data Versioning

Always reference a specific dataset_id in your metadata to ensure reproducibility.

Artifact Limits

The RG_OUTPUT_PATH has a maximum capacity of 50GB. If your model exceeds this, consider using tiered storage checkpoints.