Skip to main content

Data Orchestration Service

The Data Orchestration Service handles query execution, data catalog management, schema discovery, and data lineage tracking across multiple data sources.

Quick Navigation

Overview

The Data Orchestration Service is responsible for:

  • Executing queries and data plans across multiple data sources
  • Managing the data catalog with schema discovery and metadata
  • Tracking data lineage (upstream and downstream dependencies)
  • Providing AI context for query generation
  • Managing connectors for 30+ data source types

Service Architecture Flow Diagram

View Flow Diagram

Data Orchestration Flow Diagram

Data Orchestration Flow Overview:

This flow diagram illustrates the complete data orchestration workflow from connector management through query execution and lineage tracking. It demonstrates how data sources are connected, schemas are discovered, queries are executed, and data lineage is tracked.

Key Flow Components:

  1. Connector Management: Routes requests to appropriate connector types (SQL, NoSQL, Cloud, File, API)
  2. Connection Pooling: Establishes and manages connections to data sources
  3. Catalog Service: Discovers schemas, extracts metadata, generates embeddings
  4. Execution Router: Routes queries to DuckDB, Trino, or native execution engines
  5. Query Execution: Executes queries with caching and result aggregation
  6. Lineage Engine: Tracks upstream and downstream data relationships
  7. AI Context Service: Provides semantic search via embeddings for query generation

Base Path

All endpoints are prefixed with /api/v1 or root / for health checks.

Base URL: http://data-orchestration:8000 (internal) or http://localhost:8000 (development)

Authentication

The service uses header-based authentication:

X-Org-ID: <organization_id>
X-User-ID: <user_id> (optional)

Note: The service accepts user_context in request bodies for stateless execution, allowing the main API to pass decrypted credentials at runtime.

Endpoints

Health & Status

MethodEndpointDescription
GET/healthCheck service health
GET/metricsGet system metrics

Execution

MethodEndpointDescription
POST/executeExecute a data plan
POST/queryExecute a simple SQL query

Connectors

MethodEndpointDescription
GET/connectorsList available connectors
GET/connectors/{connector_id}Get connector details
POST/connections/testTest connection to a data source

Catalog

MethodEndpointDescription
POST/catalogCatalog a data source
GET/catalog/tablesList tables in catalog
GET/catalog/schemasGet hierarchical catalog structure
POST/catalog/searchSearch catalog by text
GET/catalog/tables/{table_id}Get table details including columns
POST/catalog/ai-contextGet catalog context for AI query generation

Lineage

MethodEndpointDescription
GET/lineage/upstream/{table_id}Get upstream lineage for a table
GET/lineage/downstream/{table_id}Get downstream lineage for a table
GET/lineage/graphGet complete lineage graph

Schema Discovery

MethodEndpointDescription
GET/schema/{data_source_id}Discover schema for a data source

Total: 17 endpoints

Internal Notes

  • Service accepts stateless execution with data_source_configs in request body
  • Supports multiple compute engines (DuckDB, Trino)
  • Uses vector database (Qdrant/Pinecone) for semantic catalog search
  • Catalog supports embeddings for AI-powered table discovery
  • Lineage tracking provides data dependency graphs
  • All endpoints are fully implemented