Data Science Pipelines
Data Science Pipelines in Red Hat OpenShift AI provide infrastructure for automating and orchestrating machine learning workflows. By deploying a DataSciencePipelinesApplication (DSPA) in your namespace, you enable the use of Kubeflow Pipelines for creating reproducible, scalable ML workflows.
Overview
The Data Science Pipelines infrastructure enables you to:
- Automate ML workflows: Deploy infrastructure for running data preprocessing, model training, and deployment pipelines
- Track experiments: Provide version control capabilities for data, code, and models
- Scale operations: Support distributed training and batch inference
- Collaborate effectively: Enable teams to share pipelines and results
Key Components
When you deploy a DSPA, it sets up:
- API Server: Core service for managing pipeline definitions and runs
- Persistence Agent: Tracks pipeline execution state
- Scheduled Workflow Controller: Manages pipeline scheduling
- Optional Components: Object storage (Minio), database (MariaDB), UI, and metadata services
Getting Started
To begin using Data Science Pipelines:
- Enable pipeline infrastructure - Deploy a DataSciencePipelinesApplication in your namespace
- Create pipeline definitions using the KFP SDK or visual tools
- Submit and execute pipeline runs
- Monitor results through the dashboard or API
In This Section
- Pipeline Setup - Enable pipeline capabilities in your namespace by deploying DataSciencePipelinesApplication resources
Prerequisites
Before working with pipelines, ensure you have:
- A Data Science Project with the required labels
- Appropriate permissions to create resources
- Storage configured for pipeline artifacts
Related Resources
- Projects - Create data science projects
- Data Connections - Configure S3 storage for artifacts
- Model Serving - Deploy trained models