computer science kubernetes ai, Schemes and Mind Maps of Computer Applications

computer science kubernetes ai

Typology: Schemes and Mind Maps

2025/2026

Uploaded on 12/17/2025

bilgi-sayar
bilgi-sayar 🇹🇷

13 documents

1 / 74

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Red Hat OpenShift AI Self-Managed
2.25
Working with data science pipelines
Work with data science pipelines from Red Hat OpenShift AI Self-Managed
Last Updated: 2025-10-31
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a

Partial preview of the text

Download computer science kubernetes ai and more Schemes and Mind Maps Computer Applications in PDF only on Docsity!

Red Hat OpenShift AI Self-Managed

Working with data science pipelines

Work with data science pipelines from Red Hat OpenShift AI Self-Managed

Last Updated: 2025-10-

Legal Notice

Copyright © 2025 Red Hat, Inc.
The text of and illustrations in this document are licensed by Red Hat under a Creative Commons
Attribution–Share Alike 3.0 Unported license ("CC-BY-SA"). An explanation of CC-BY-SA is
available at
http://creativecommons.org/licenses/by-sa/3.0/
. In accordance with CC-BY-SA, if you distribute this document or an adaptation of it, you must
provide the URL for the original version.
Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert,
Section 4d of CC-BY-SA to the fullest extent permitted by applicable law.
Red Hat, Red Hat Enterprise Linux, the Shadowman logo, the Red Hat logo, JBoss, OpenShift,
Fedora, the Infinity logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States
and other countries.
Linux ® is the registered trademark of Linus Torvalds in the United States and other countries.
Java ® is a registered trademark of Oracle and/or its affiliates.
XFS ® is a trademark of Silicon Graphics International Corp. or its subsidiaries in the United States
and/or other countries.
MySQL ® is a registered trademark of MySQL AB in the United States, the European Union and
other countries.
Node.js ® is an official trademark of Joyent. Red Hat is not formally related to or endorsed by the
official Joyent Node.js open source or commercial project.
The OpenStack ® Word Mark and OpenStack logo are either registered trademarks/service marks
or trademarks/service marks of the OpenStack Foundation, in the United States and other
countries and are used with the OpenStack Foundation's permission. We are not affiliated with,
endorsed or sponsored by the OpenStack Foundation, or the OpenStack community.
All other trademarks are the property of their respective owners.

Abstract

Enhance your data science projects on OpenShift AI by building portable machine learning (ML)
workflows with data science pipelines.

.......................................................................................................................... .......................................................................................................................... .......................................................................................................................... ..........................................................................................................................

Table of Contents

PREFACE CHAPTER 1. MANAGING DATA SCIENCE PIPELINES 1.1. CONFIGURING A PIPELINE SERVER 1.1.1. Configuring a pipeline server with an external Amazon RDS database 1.2. DEFINING A PIPELINE 1.2.1. Compiling the pipeline YAML with the Kubeflow Pipelines SDK 1.2.2. Compiling Kubernetes-native manifests with the Kubeflow Pipelines SDK 1.2.3. Defining a pipeline by using the Kubernetes API 1.2.4. Migrating pipelines from database to Kubernetes API storage 1.3. IMPORTING A DATA SCIENCE PIPELINE 1.4. DELETING A DATA SCIENCE PIPELINE 1.5. DELETING A PIPELINE SERVER 1.6. VIEWING THE DETAILS OF A PIPELINE SERVER 1.7. VIEWING EXISTING PIPELINES 1.8. OVERVIEW OF PIPELINE VERSIONS 1.9. UPLOADING A PIPELINE VERSION 1.10. DELETING A PIPELINE VERSION 1.11. VIEWING THE DETAILS OF A PIPELINE VERSION 1.12. DOWNLOADING A DATA SCIENCE PIPELINE VERSION 1.13. OVERVIEW OF DATA SCIENCE PIPELINES CACHING 1.13.1. Caching criteria 1.13.2. Viewing cached steps in the OpenShift AI user interface 1.13.3. Controlling caching in data science pipelines 1.13.3.1. Disabling caching for individual tasks 1.13.3.2. Disabling caching for a pipeline at submit time 1.13.3.3. Disabling caching for a pipeline at compile time 1.13.3.4. Disabling caching for all pipelines (pipeline server) CHAPTER 2. MANAGING PIPELINE EXPERIMENTS 2.1. OVERVIEW OF PIPELINE EXPERIMENTS 2.2. CREATING A PIPELINE EXPERIMENT 2.3. ARCHIVING A PIPELINE EXPERIMENT 2.4. DELETING AN ARCHIVED PIPELINE EXPERIMENT 2.5. RESTORING AN ARCHIVED PIPELINE EXPERIMENT 2.6. VIEWING PIPELINE TASK EXECUTIONS 2.7. VIEWING PIPELINE ARTIFACTS 2.8. COMPARING RUNS IN AN EXPERIMENT 2.9. COMPARING RUNS IN DIFFERENT EXPERIMENTS CHAPTER 3. MANAGING PIPELINE RUNS 3.1. OVERVIEW OF PIPELINE RUNS 3.2. STORING DATA WITH DATA SCIENCE PIPELINES 3.3. VIEWING ACTIVE PIPELINE RUNS 3.4. EXECUTING A PIPELINE RUN 3.5. STOPPING AN ACTIVE PIPELINE RUN 3.6. DUPLICATING AN ACTIVE PIPELINE RUN 3.7. VIEWING SCHEDULED PIPELINE RUNS 3.8. SCHEDULING A PIPELINE RUN USING A CRON JOB 3.9. SCHEDULING A PIPELINE RUN 3.10. DUPLICATING A SCHEDULED PIPELINE RUN 3.11. DELETING A SCHEDULED PIPELINE RUN 4 6 6 8 9 10 10 12 15 17 18 19 20 20 21 21 22 23 24 24 24 25 25 25 26 26 26 28 28 28 29 29 30 30 31 32 33 35 35 36 36 37 38 39 40 40 41 43 44 Table of Contents

Table of Contents

PREFACE As a data scientist, you can enhance your data science projects on OpenShift AI by building portable machine learning (ML) workflows with data science pipelines, using Docker containers. This enables you to standardize and automate machine learning workflows to enable you to develop and deploy your data science models. For example, the steps in a machine learning workflow might include items such as data extraction, data processing, feature extraction, model training, model validation, and model serving. Automating these activities enables your organization to develop a continuous process of retraining and updating a model based on newly received data. This can help address challenges related to building an integrated machine learning deployment and continuously operating it in production. You can also use the Elyra JupyterLab extension to create and run data science pipelines within JupyterLab. For more information, see Working with pipelines in JupyterLab. To use a data science pipeline in OpenShift AI, you need the following components: Pipeline server: A server that is attached to your data science project and hosts your data science pipeline. Pipeline: A pipeline defines the configuration of your machine learning workflow and the relationship between each component in the workflow. Pipeline code: A definition of your pipeline in a YAML file. Pipeline graph: A graphical illustration of the steps executed in a pipeline run and the relationship between them. Pipeline experiment: A workspace where you can try different configurations of your pipelines. You can use experiments to organize your runs into logical groups. Archived pipeline experiment: An archived pipeline experiment. Pipeline artifact: An output artifact produced by a pipeline component. Pipeline execution: The execution of a task in a pipeline. Pipeline run: An execution of your pipeline. Active run: A pipeline run that is executing, or stopped. Scheduled run: A pipeline run that is scheduled to execute at least once. Archived run: An archived pipeline run. This feature is based on Kubeflow Pipelines 2.0. Use the latest Kubeflow Pipelines 2.0 SDK to build your data science pipeline in Python code. After you have built your pipeline, use the SDK to compile it into an Intermediate Representation (IR) YAML file. The OpenShift AI user interface enables you to track and manage pipelines, experiments, and pipeline runs. To view a record of previously executed, scheduled, and archived runs, you can go to Data science pipelines → Runs, or you can select an experiment from the Experiments → Experiments and runs to access all of its pipeline runs. You can manage incremental changes to pipelines in OpenShift AI by using versioning. This allows you to develop and deploy pipelines iteratively, preserving a record of your changes. You can store your pipeline artifacts in an S3-compatible object storage bucket so that you do not Red Hat OpenShift AI Self-Managed 2.25 Working with data science pipelines

CHAPTER 1. MANAGING DATA SCIENCE PIPELINES 1.1. CONFIGURING A PIPELINE SERVER Before you can successfully create a pipeline in OpenShift AI, you must configure a pipeline server. This task includes configuring where your pipeline artifacts and data are stored.

NOTE

You are not required to specify any storage directories when configuring a connection for your pipeline server. When you import a pipeline, the /pipelines folder is created in the root folder of the bucket, containing a YAML file for the pipeline. If you upload a new version of the same pipeline, a new YAML file with a different ID is added to the /pipelines folder. When you run a pipeline, the artifacts are stored in the /pipeline-name folder in the root folder of the bucket. Prerequisites You have logged in to Red Hat OpenShift AI. You have created a data science project that you can add a pipeline server to. You have an existing S3-compatible object storage bucket and you have configured write access to your S3 bucket on your storage account. If you are configuring a pipeline server for production pipeline workloads, you have an existing external MySQL or MariaDB database. If you are configuring a pipeline server with an external MySQL database, your database must use at least MySQL version 5.x. However, Red Hat recommends that you use MySQL version 8.x.

NOTE

The mysql_native_password authentication plugin is required for the ML Metadata component to successfully connect to your database. mysql_native_password is disabled by default in MySQL 8.4 and later. If your database uses MySQL 8.4 or later, you must update your MySQL deployment to enable the mysql_native_password plugin. For more information about enabling the mysql_native_password plugin, see Native Pluggable Authentication in the MySQL documentation. If you are configuring a pipeline server with a MariaDB database, your database must use MariaDB version 10.3 or later. However, Red Hat recommends that you use at least MariaDB version 10.5. Procedure

  1. From the OpenShift AI dashboard, click Data science projects. The Data science projects page opens. Red Hat OpenShift AI Self-Managed 2.25 Working with data science pipelines
  1. Click the name of the project that you want to configure a pipeline server for. A project details page opens.
  2. Click the Pipelines tab.
  3. Click Configure pipeline server. The Configure pipeline server dialog opens.
  4. In the Object storage connection section, provide values for the mandatory fields: a. In the Access key field, enter the access key ID for the S3-compatible object storage provider. b. In the Secret key field, enter the secret access key for the S3-compatible object storage account that you specified. c. In the Endpoint field, enter the endpoint of your S3-compatible object storage bucket. d. In the Region field, enter the default region of your S3-compatible object storage account. e. In the Bucket field, enter the name of your S3-compatible object storage bucket.
IMPORTANT

If you specify incorrect connection settings, you cannot update these settings on the same pipeline server. Therefore, you must delete the pipeline server and configure another one. If you want to use an existing artifact that was not generated by a task in a pipeline, you can use the kfp.dsl.importer component to import the artifact from its URI. You can only import these artifacts to the S3-compatible object storage bucket that you define in the Bucket field in your pipeline server configuration. For more information about the kfp.dsl.importer component, see Special Case: Importer Components.

  1. Click Advanced settings to display the Database, Pipeline definition storage, and Pipeline caching sections.
  2. In the Database section, choose one of the following options to specify where to store your pipeline metadata and run information: Select Default database on the cluster to deploy a MariaDB database in your project.
IMPORTANT

The Default database on the cluster option is intended for development and testing purposes only. For production pipeline workloads, select the External MySQL database option to use an external MySQL or MariaDB database. Select External MySQL database to add a new connection to an external MySQL or MariaDB database that your pipeline server can access. i. In the Host field, enter the database hostname. ii. In the Port field, enter the database port. CHAPTER 1. MANAGING DATA SCIENCE PIPELINES

For example, if the database was created in the us-east-1 region, download us-east-1- bundle.pem.

  1. In a terminal window, log in to the OpenShift cluster where OpenShift AI is deployed. oc login api.<cluster_name>.<cluster_domain>:6443 --web
  2. Run the following command to fetch the current OpenShift AI trusted CA configuration and store it in a new file: oc get dscinitializations.dscinitialization.opendatahub.io default-dsci -o json | jq '.spec.trustedCABundle.customCABundle' > /tmp/my-custom-ca-bundles.crt
  3. Run the following command to append the PEM certificate bundle that you downloaded to the new custom CA configuration file: cat us-east-1-bundle.pem >> /tmp/my-custom-ca-bundles.crt
  4. Run the following command to update the OpenShift AI trusted CA configuration to trust certificates issued by the CAs included in the new custom CA configuration file: oc patch dscinitialization default-dsci --type='json' - p='[{"op":"replace","path":"/spec/trustedCABundle/customCABundle","value":"'"$(awk '{printf "%s\n", $0}' /tmp/my-custom-ca-bundles.crt)"'"}]'
  5. Configure a pipeline server, as described in Configuring a pipeline server. Verification The pipeline server starts successfully. You can import and run data science pipelines. 1.2. DEFINING A PIPELINE The Kubeflow Pipelines SDK enables you to define end-to-end machine learning and data pipelines. Use the latest Kubeflow Pipelines 2.0 SDK to build your data science pipeline in Python code. After you have built your pipeline, use the SDK to compile it into an Intermediate Representation (IR) YAML file. For more information about compiling pipelines, see Compiling the pipeline YAML with the Kubeflow Pipelines SDK and Compiling Kubernetes-native manifests with the Kubeflow Pipelines SDK. Compiling to Kubernetes-native manifests is optional and applies only when your pipeline server is configured to use Kubernetes API storage. After defining the pipeline, you can import the YAML file to the OpenShift AI dashboard to enable you to configure its execution settings.
IMPORTANT

If you are using OpenShift AI on a cluster running in FIPS mode, any custom container images for data science pipelines must be based on UBI 9 or RHEL 9. This ensures compatibility with FIPS-approved pipeline components and prevents errors related to mismatched OpenSSL or GNU C Library (glibc) versions. You can also use the Elyra JupyterLab extension to create and run data science pipelines within CHAPTER 1. MANAGING DATA SCIENCE PIPELINES

You can also use the Elyra JupyterLab extension to create and run data science pipelines within JupyterLab. For more information about creating pipelines in JupyterLab, see Working with pipelines in JupyterLab. For more information about the Elyra JupyterLab extension, see Elyra Documentation. Additional resources Kubeflow Pipelines 2.0 Documentation Elyra Documentation

1.2.1. Compiling the pipeline YAML with the Kubeflow Pipelines SDK

Before you can define your pipeline in the cluster, you must convert your Python-defined pipeline into YAML format. You can use the Kubeflow Pipelines (KFP) Software Development Kit (SDK) to compile your pipeline code into a deployable YAML file for declarative GitOps deployment. Prerequisites You have installed Python 3.11 or later in your local environment. You have installed the Kubeflow Pipelines SDK package ( kfp ) version 2.14.3 or later. You have a valid Python pipeline definition file.

Procedure

Compile your pipeline by using the KFP SDK to generate the pipeline YAML file. In the following example, replace <pipeline_file>.py with the name of your Python pipeline file and specify an output file for the compiled YAML: $ kfp dsl compile
--py <pipeline_file>.py
--output <compiled_pipeline_file>.yaml

NOTE

The generated <compiled_pipeline_file>.yaml file contains the compiled pipeline specification in YAML format. You can use this content as the value of the pipelineSpec field when you create a PipelineVersion custom resource (CR). You can also store the file in Git for declarative or GitOps-based deployment.

Verification

Verify that the generated file includes a pipelineSpec key followed by the compiled pipeline definition: $ head -n 10 <compiled_pipeline_file>.yaml Additional resources Compiling a pipeline with the Kubeflow Pipelines SDK

1.2.2. Compiling Kubernetes-native manifests with the Kubeflow Pipelines SDK

Red Hat OpenShift AI Self-Managed 2.25 Working with data science pipelines

Additional resources Compiling for Kubernetes native API mode

1.2.3. Defining a pipeline by using the Kubernetes API

You can define data science pipelines and pipeline versions by using the Kubernetes API, which stores them as custom resources in the cluster instead of the internal database. This approach makes it easier to use OpenShift GitOps (Argo CD) or similar tools to manage pipelines and pipeline versions, while still allowing you to manage them through the OpenShift AI dashboard, API, and the Kubeflow Pipelines (KFP) Software Development Kit (SDK). You can generate the required manifests by using the Kubeflow Pipelines SDK; see Compiling the pipeline YAML with the Kubeflow Pipelines SDK or Compiling Kubernetes-native manifests with the Kubeflow Pipelines SDK.

NOTE

If your pipeline server is already configured to use Kubernetes API storage, you can still use the OpenShift AI dashboard and REST API to view pipeline details, run pipelines, and create schedules. In this mode, the Kubernetes API acts as the storage backend, so your existing tools continue to work as expected. Prerequisites You have OpenShift AI administrator privileges or you are the project owner. You have a data science project with a running pipeline server. You have installed the OpenShift CLI ( oc ) as described in the appropriate documentation for your cluster: Installing the OpenShift CLI for OpenShift Container Platform Installing the OpenShift CLI for Red Hat OpenShift Service on AWS If you plan to create a PipelineVersion custom resource, you have either: Compiled your Python pipeline to IR YAML by using the KFP SDK. See Compiling the pipeline YAML with the Kubeflow Pipelines SDK. Compiled Kubernetes-native manifests by using the KFP SDK. See Compiling Kubernetes- native manifests with the Kubeflow Pipelines SDK. Procedure

  1. In a terminal window, log in to your OpenShift cluster by using the OpenShift CLI ( oc ): $ oc login -u <user_name> When prompted, enter the OpenShift server URL, connection type, and your password.
  2. To configure the pipeline server to use Kubernetes API storage instead of the default database spec: pipelineSpec: ... platformSpec: ... # present when Kubernetes resource configuration is used Red Hat OpenShift AI Self-Managed 2.25 Working with data science pipelines
  1. To configure the pipeline server to use Kubernetes API storage instead of the default database option, set the spec.apiServer.pipelineStore field to kubernetes in your project’s DataSciencePipelinesApplication (DSPA) custom resource. In the following command, replace <dspa_name> with the name of your DSPA custom resource, and replace with the name of your project: $ oc patch dspa <dspa_name> -n
    --type=merge
    -p {"spec": {"apiServer": {"pipelineStore": "kubernetes"}}}
WARNING

When you switch the pipeline server from database storage to Kubernetes API storage, existing pipelines that were stored in the internal database are no longer visible in the OpenShift AI dashboard or REST API. To view or manage those pipelines again, change the spec.apiServer.pipelineStore field back to database.

  1. Define a Pipeline custom resource in a YAML file with the following contents:
Example pipeline definition

name : The immutable Kubernetes resource name of your pipeline. namespace : The name of your project. displayName : The user-friendly display name of your pipeline, which is shown in the dashboard and REST API.

  1. Apply the pipeline definition to create the Pipeline custom resource in your cluster. In the following command, replace <pipeline_yaml_file> with the name of your YAML file:
Example command

$ oc apply -f <pipeline_yaml_file> .yaml

  1. Alternatively, if you compiled Kubernetes-native manifests with the KFP SDK, you can apply the generated file directly without manually creating separate YAML files: $ oc apply -f <output_file> .yaml

apiVersion: pipelines.kubeflow.org/v2beta kind: Pipeline metadata: name: namespace: spec: displayName: CHAPTER 1. MANAGING DATA SCIENCE PIPELINES

  1. Check that the PipelineVersion custom resource was successfully created: $ oc get pipelineversion <pipeline_version_name> -n
1.2.4. Migrating pipelines from database to Kubernetes API storage

You can migrate existing pipelines and pipeline versions from the internal database to Kubernetes custom resources. This makes it easier to use OpenShift GitOps (Argo CD) or similar tools to manage pipelines and pipeline versions, while still allowing you to manage them through the OpenShift AI dashboard, API, and the Kubeflow Pipelines (KFP) Software Development Kit (SDK). This procedure uses a community-supported Kubeflow Pipelines migration script to export pipelines from the Data Science Pipelines API and generate corresponding Pipeline and PipelineVersion custom resources for import into your cluster.

IMPORTANT

The migration script in this procedure is maintained by the Kubeflow Pipelines community and is not supported by Red Hat. Before you use the script, review the repository and validate it in a non-production environment.

WARNING

The pipeline and pipeline version IDs change during migration, so existing pipeline runs do not map to the migrated pipeline version. The original ID is stored in the pipelines.kubeflow.org/original-id label. Prerequisites You have OpenShift AI administrator privileges or you are the project owner. You have a data science project with a running pipeline server. The pipeline server is configured with spec.apiServer.pipelineStore: database. You have Python 3.11 installed in your local environment. You have installed the OpenShift CLI ( oc ) as described in the appropriate documentation for your cluster: Installing the OpenShift CLI for OpenShift Container Platform Installing the OpenShift CLI for Red Hat OpenShift Service on AWS Procedure

  1. In a terminal window, log in to your OpenShift cluster by using the OpenShift CLI ( oc ): $ oc login -u <user_name>

CHAPTER 1. MANAGING DATA SCIENCE PIPELINES

When prompted, enter the OpenShift server URL, connection type, and your password.

  1. Set environment variables for your data science project and get the pipeline API route. In the export command, replace with the name of your project: echo "Setting the prerequisite variables" export NAMESPACE= export DSPA_NAME=$(oc -n $NAMESPACE get dspa -o jsonpath= {.items[0].metadata.name} ) export API_URL="https://$(oc -n $NAMESPACE get route "ds-pipeline-$DSPA_NAME" -o jsonpath= {.spec.host} )"
  2. Create a Python virtual environment and install the required dependencies. echo "Set up the Python prerequisites" python3.11 -m venv .venv ./.venv/bin/pip install kfp requests PyYAML
  3. Download and run the Kubeflow Pipelines community migration script. The script connects to the Data Science Pipelines API, exports all pipelines and versions from the specified data science project, and generates one YAML file per pipeline in a local kfp- exported-pipelines/ directory. Each file includes a Pipeline resource followed by all associated PipelineVersion resources. a. Run the following command: curl -L https://raw.githubusercontent.com/kubeflow/pipelines/refs/heads/master/tools/k8s- native/migration.py -o migration.py ./.venv/bin/python migration.py --skip-tls-verify --kfp-server-host $API_URL --namespace $NAMESPACE --token "$(oc whoami --show-token)"
NOTE

The --skip-tls-verify option disables certificate validation and should be used only in development environments or when connecting to a server with a self- signed certificate. In production environments, provide a valid certificate bundle instead. Additionally, passing the access token directly on the command line might expose it in shell history or process lists. To reduce this risk, store the token in an environment variable and reference it in your command: export KFP_TOKEN=$(oc whoami --show-token) ./.venv/bin/python migration.py --kfp-server-host $API_URL --namespace $NAMESPACE --token "$KFP_TOKEN" Alternatively, use a prompt with read -s to input the token securely at runtime. b. Optional: For more information about the script, run the following command: ./.venv/bin/python migration.py --help c. If you plan to create new or updated PipelineVersion custom resources after migration, Red Hat OpenShift AI Self-Managed 2.25 Working with data science pipelines