Skip to main content

Staging Service

The Staging Service handles data staging operations for Model Studio, converting data sources to Parquet format for machine learning consumption.

Quick Navigation

Overview

The Staging Service is designed to:

  • Extract data from data sources
  • Convert data to Parquet format
  • Stage data in chunks for Model Studio consumption
  • Manage staging jobs with Temporal workflows
  • Provide presigned URLs for chunk access

Note: This service is currently under development. All endpoints return 501 Not Implemented.

Service Architecture Flow Diagram

View Flow Diagram

Staging Service Flow Diagram

Staging Service Flow Overview:

This flow diagram illustrates the complete staging service workflow from job creation through data extraction, chunk management, and cleanup. It demonstrates how jobs are created, data is extracted and converted to Parquet, chunks are managed with presigned URLs, and how Temporal workflows handle job lifecycle.

Key Flow Components:

  1. Job Creation: Model Studio creates staging jobs with data source parameters
  2. Temporal Workflow: Initializes Temporal workflow for reliable execution
  3. Data Extraction: Connects to data source and extracts data in chunks
  4. Parquet Conversion: Converts data to Parquet format with compression
  5. Chunk Management: Uploads chunks to MinIO/S3 and generates presigned URLs
  6. Job Control: Supports pause, resume, and cancel operations
  7. Chunk Consumption: Model Studio consumes chunks via presigned URLs
  8. Cleanup: Automatic cleanup of chunks when TTL expires or job completes

Base Path

All endpoints are prefixed with /api/v1/staging.

Base URL: http://staging-service:8003 (internal) or http://localhost:8003 (development)

Authentication

The service uses header-based authentication:

X-Org-ID: <organization_id> (required, defaults to 1)
X-User-ID: <user_id> (optional)

Endpoints

Job Management

MethodEndpointDescriptionStatus
POST/api/v1/staging/jobsCreate a new data staging jobNot Implemented
GET/api/v1/staging/jobsList staging jobs for the organizationNot Implemented
GET/api/v1/staging/jobs/{job_id}Get status and progress of a staging jobNot Implemented
DELETE/api/v1/staging/jobs/{job_id}Cancel a staging job and cleanup resourcesNot Implemented

Job Control

MethodEndpointDescriptionStatus
POST/api/v1/staging/jobs/{job_id}/pausePause a running staging jobNot Implemented
POST/api/v1/staging/jobs/{job_id}/resumeResume a paused staging jobNot Implemented

Chunk Management

MethodEndpointDescriptionStatus
POST/api/v1/staging/jobs/{job_id}/chunks/{chunk_id}/ackAcknowledge consumption of a chunkNot Implemented
GET/api/v1/staging/jobs/{job_id}/chunks/{chunk_id}/refreshRefresh the presigned URL for a chunkNot Implemented

Total: 8 endpoints (all not implemented)

Status

Current Status: All endpoints return 501 Not Implemented

Planned Implementation:

  • Temporal workflow integration for job orchestration
  • Parquet conversion pipeline
  • Chunk-based data streaming
  • Presigned URL generation for chunk access
  • Job status tracking and progress monitoring

Internal Notes

  • Service will use Temporal for workflow orchestration
  • Data will be converted to Parquet format for ML consumption
  • Chunks will be streamed to Model Studio with acknowledgment mechanism
  • Presigned URLs will be generated for secure chunk access
  • Jobs will support pause/resume functionality
  • All endpoints are currently stubs returning 501 Not Implemented