Custom Model Training

While Model Studio's AutoML engine handles the majority of use cases, advanced users may require complete control over the training logic, framework versions, and hardware acceleration. Custom Model Training allows you to submit your own training scripts and dependencies to be executed on RiverGen's compute infrastructure.

Workflow Overview

The custom training workflow differs from the standard AutoML pipeline by requiring a Code Artifact and a Compute Configuration.

1. Preparing Your Code

Your training code must be packaged as a compressed archive (.tar.gz or .zip) and accessible via a public or pre-signed URL.

File Structure

The archive should contain your main entry point script and a requirements.txt file.

custom_train.tar.gz
├── train.py           # Main entry point
├── utils/             # Helper modules
│   ├── data_loader.py
│   └── metrics.py
└── requirements.txt   # Python dependencies

Script Requirements

Your entry point script should accept parameters via command-line arguments. Model Studio injects environment variables for data paths:

RG_DATA_PATH: Local path to the training dataset.
RG_OUTPUT_PATH: Path where the model should be saved for registry ingestion.

2. Compute Configuration

Model Studio supports various instance types tailored for different ML tasks.

Instance Type	CPU Cores	RAM (GB)	GPU	Usage Case
`cpu.standard`	4	16	None	Light preprocessing, Scikit-learn
`cpu.highmem`	16	64	None	Large dataset joins, complex feature engineering
`gpu.xlarge`	8	32	1x NVIDIA T4	Deep Learning, PyTorch, TensorFlow
`gpu.2xlarge`	16	128	1x NVIDIA A100	Large Model Training, LLM Finetuning

3. Submitting a Job

Use the Submit Custom Training Job endpoint to trigger execution.

curl -X POST "https://api.rivergen.ai/api/model-studio/custom-training-job" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
  "task_name": "ResNet-50-Finetune",
  "code_artifact_url": "s3://my-bucket/models/v1/code.tar.gz",
  "entry_point": "train.py",
  "compute_config": {
    "instance_type": "gpu.xlarge"
  },
  "training_args": {
    "learning_rate": 0.0001,
    "batch_size": 128
  }
}'

4. Monitoring & Quotas

Compute Credits

Custom jobs consume Compute Credits proportional to the instance type and duration.

Check Balance: GET /organization/billing/compute-credits
Quotas: Organization-level concurrency limits apply (e.g., maximum 2 simultaneous gpu.2xlarge jobs).

Debugging Logs

Logs generated by your script are streamed back to the Model Studio dashboard in real-time. If a job fails, the Get Training Details endpoint will contain the stdout/stderr trace.

Best Practices

Use Data Versioning

Always reference a specific dataset_id in your metadata to ensure reproducibility.

Artifact Limits

The RG_OUTPUT_PATH has a maximum capacity of 50GB. If your model exceeds this, consider using tiered storage checkpoints.

Workflow Overview​

1. Preparing Your Code​

File Structure​

Script Requirements​

2. Compute Configuration​

3. Submitting a Job​

4. Monitoring & Quotas​

Compute Credits​

Debugging Logs​

Best Practices​