Data Science Pipelines

Data Science Pipelines in Red Hat OpenShift AI provide infrastructure for automating and orchestrating machine learning workflows. By deploying a DataSciencePipelinesApplication (DSPA) in your namespace, you enable the use of Kubeflow Pipelines for creating reproducible, scalable ML workflows.

Overview

The Data Science Pipelines infrastructure enables you to:

Automate ML workflows: Deploy infrastructure for running data preprocessing, model training, and deployment pipelines
Track experiments: Provide version control capabilities for data, code, and models
Scale operations: Support distributed training and batch inference
Collaborate effectively: Enable teams to share pipelines and results

Key Components

When you deploy a DSPA, it sets up:

API Server: Core service for managing pipeline definitions and runs
Persistence Agent: Tracks pipeline execution state
Scheduled Workflow Controller: Manages pipeline scheduling
Optional Components: Object storage (Minio), database (MariaDB), UI, and metadata services

Getting Started

To begin using Data Science Pipelines:

Enable pipeline infrastructure - Deploy a DataSciencePipelinesApplication in your namespace
Create pipeline definitions using the KFP SDK or visual tools
Submit and execute pipeline runs
Monitor results through the dashboard or API

In This Section

Pipeline Setup - Enable pipeline capabilities in your namespace by deploying DataSciencePipelinesApplication resources

Prerequisites

Before working with pipelines, ensure you have:

A Data Science Project with the required labels
Appropriate permissions to create resources
Storage configured for pipeline artifacts

Projects - Create data science projects
Data Connections - Configure S3 storage for artifacts
Model Serving - Deploy trained models

Pipeline Setup