computer science kubernetes ai, Schemes and Mind Maps of Computer Science

computer science kubernetes ai

Typology: Schemes and Mind Maps

2025/2026

Uploaded on 12/17/2025

bilgi-sayar
bilgi-sayar 🇹🇷

13 documents

1 / 49

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Red Hat OpenShift AI Self-Managed
2.25
Deploying models
Deploy models in Red Hat OpenShift AI Self-Managed
Last Updated: 2025-10-28
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31

Partial preview of the text

Download computer science kubernetes ai and more Schemes and Mind Maps Computer Science in PDF only on Docsity!

Red Hat OpenShift AI Self-Managed

Deploying models

Deploy models in Red Hat OpenShift AI Self-Managed

Last Updated: 2025-10-

Legal Notice

Copyright © 2025 Red Hat, Inc.

The text of and illustrations in this document are licensed by Red Hat under a Creative Commons

Attribution–Share Alike 3.0 Unported license ("CC-BY-SA"). An explanation of CC-BY-SA is

available at

http://creativecommons.org/licenses/by-sa/3.0/

. In accordance with CC-BY-SA, if you distribute this document or an adaptation of it, you must

provide the URL for the original version.

Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert,

Section 4d of CC-BY-SA to the fullest extent permitted by applicable law.

Red Hat, Red Hat Enterprise Linux, the Shadowman logo, the Red Hat logo, JBoss, OpenShift,

Fedora, the Infinity logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States

and other countries.

Linux ® is the registered trademark of Linus Torvalds in the United States and other countries.

Java ® is a registered trademark of Oracle and/or its affiliates.

XFS ® is a trademark of Silicon Graphics International Corp. or its subsidiaries in the United States

and/or other countries.

MySQL ® is a registered trademark of MySQL AB in the United States, the European Union and

other countries.

Node.js ® is an official trademark of Joyent. Red Hat is not formally related to or endorsed by the

official Joyent Node.js open source or commercial project.

The OpenStack ® Word Mark and OpenStack logo are either registered trademarks/service marks

or trademarks/service marks of the OpenStack Foundation, in the United States and other

countries and are used with the OpenStack Foundation's permission. We are not affiliated with,

endorsed or sponsored by the OpenStack Foundation, or the OpenStack community.

All other trademarks are the property of their respective owners.

Abstract

As a Red Hat OpenShift AI user, you can deploy your machine-learning models in Red Hat

OpenShift AI Self-Managed.

.......................................................................................................................... .......................................................................................................................... .......................................................................................................................... .......................................................................................................................... ..........................................................................................................................

Table of Contents

CHAPTER 1. STORING MODELS 1.1. USING OCI CONTAINERS FOR MODEL STORAGE 1.2. STORING A MODEL IN AN OCI IMAGE 1.3. UPLOADING MODEL FILES TO A PERSISTENT VOLUME CLAIM (PVC) CHAPTER 2. DEPLOYING MODELS ON THE SINGLE-MODEL SERVING PLATFORM 2.1. ABOUT KSERVE DEPLOYMENT MODES 2.2. DEPLOYING MODELS ON THE SINGLE-MODEL SERVING PLATFORM 2.3. DEPLOYING A MODEL STORED IN AN OCI IMAGE BY USING THE CLI 2.4. DEPLOYING MODELS BY USING DISTRIBUTED INFERENCE WITH LLM-D 2.4.1. Example usage for Distributed Inference with llm-d 2.4.1.1. Single-node GPU deployment 2.4.1.2. Multi-node deployment 2.4.1.3. Intelligent inference scheduler with KV cache routing 2.5. MONITORING MODELS ON THE SINGLE-MODEL SERVING PLATFORM 2.5.1. Viewing performance metrics for a deployed model 2.5.2. Viewing model-serving runtime metrics for the single-model serving platform CHAPTER 3. DEPLOYING MODELS ON THE NVIDIA NIM MODEL SERVING PLATFORM 3.1. DEPLOYING MODELS ON THE NVIDIA NIM MODEL SERVING PLATFORM 3.2. VIEWING NVIDIA NIM METRICS FOR A NIM MODEL 3.3. VIEWING PERFORMANCE METRICS FOR A NIM MODEL CHAPTER 4. DEPLOYING MODELS ON THE MULTI-MODEL SERVING PLATFORM 4.1. ADDING A MODEL SERVER FOR THE MULTI-MODEL SERVING PLATFORM 4.2. DELETING A MODEL SERVER 4.3. DEPLOYING A MODEL BY USING THE MULTI-MODEL SERVING PLATFORM 4.4. VIEWING A DEPLOYED MODEL 4.5. UPDATING THE DEPLOYMENT PROPERTIES OF A DEPLOYED MODEL 4.6. DELETING A DEPLOYED MODEL 4.7. CONFIGURING MONITORING FOR THE MULTI-MODEL SERVING PLATFORM 4.8. VIEWING MODEL-SERVING RUNTIME METRICS FOR THE MULTI-MODEL SERVING PLATFORM 4.9. VIEWING PERFORMANCE METRICS FOR ALL MODELS ON A MODEL SERVER 4.10. VIEWING HTTP REQUEST METRICS FOR A DEPLOYED MODEL CHAPTER 5. MAKING INFERENCE REQUESTS TO DEPLOYED MODELS 5.1. ACCESSING THE AUTHENTICATION TOKEN FOR A DEPLOYED MODEL 5.2. ACCESSING THE INFERENCE ENDPOINT FOR A DEPLOYED MODEL 5.3. MAKING INFERENCE REQUESTS TO MODELS DEPLOYED ON THE SINGLE-MODEL SERVING PLATFORM 5.4. INFERENCE ENDPOINTS 5.4.1. Caikit TGIS ServingRuntime for KServe 5.4.2. Caikit Standalone ServingRuntime for KServe 5.4.3. TGIS Standalone ServingRuntime for KServe 5.4.4. OpenVINO Model Server 5.4.5. vLLM NVIDIA GPU ServingRuntime for KServe 5.4.6. vLLM Intel Gaudi Accelerator ServingRuntime for KServe 5.4.7. vLLM AMD GPU ServingRuntime for KServe 5.4.8. vLLM Spyre AI Accelerator ServingRuntime for KServe 5.4.9. NVIDIA Triton Inference Server 5.4.10. Seldon MLServer 5.4.11. Additional resources 3 3 3 5 7 7 8 12 15 17 17 18 18 18 18 19 22 22 24 25 27 27 29 30 31 32 33 33 34 35 36 38 38 38 39 39 39 40 41 41 41 42 42 42 43 44 45 Table of Contents

CHAPTER 1. STORING MODELS You must store your model before you can deploy it. You can store a model in an S3 bucket, URI or Open Container Initiative (OCI) containers. 1.1. USING OCI CONTAINERS FOR MODEL STORAGE As an alternative to storing a model in an S3 bucket or URI, you can upload models to Open Container Initiative (OCI) containers. Deploying models from OCI containers is also known as modelcars in KServe. Using OCI containers for model storage can help you: Reduce startup times by avoiding downloading the same model multiple times. Reduce disk space usage by reducing the number of models downloaded locally. Improve model performance by allowing pre-fetched images. Using OCI containers for model storage involves the following tasks: Storing a model in an OCI image. Deploying a model from an OCI image by using either the user interface or the command line interface. To deploy a model by using: The user interface, see Deploying models on the single-model serving platform. The command line interface, see Deploying a model stored in an OCI image by using the CLI. 1.2. STORING A MODEL IN AN OCI IMAGE You can store a model in an OCI image. The following procedure uses the example of storing a MobileNet v2-7 model in ONNX format. Prerequisites You have a model in the ONNX format. The example in this procedure uses the MobileNet v2- model in ONNX format. You have installed the Podman tool. Procedure

  1. In a terminal window on your local machine, create a temporary directory for storing both the model and the support files that you need to create the OCI image: cd $(mktemp -d)
  2. Create a models folder inside the temporary directory: mkdir -p models/

NOTE

CHAPTER 1. STORING MODELS

NOTE

This example command specifies the subdirectory 1 because OpenVINO requires numbered subdirectories for model versioning. If you are not using OpenVINO, you do not need to create the 1 subdirectory to use OCI container images.

  1. Download the model and support files: DOWNLOAD_URL=https://github.com/onnx/models/raw/main/validated/vision/classification/mob ilenet/model/mobilenetv2-7.onnx curl -L $DOWNLOAD_URL -O --output-dir models/1/
  2. Use the tree command to confirm that the model files are located in the directory structure as expected: tree The tree command should return a directory structure similar to the following example: . ├── Containerfile └── models └── 1 └── mobilenetv2-7.onnx
  3. Create a Docker file named Containerfile :

NOTE

Specify a base image that provides a shell. In the following example, ubi9- micro is the base container image. You cannot specify an empty image that does not provide a shell, such as scratch , because KServe uses the shell to ensure the model files are accessible to the model server. Change the ownership of the copied model files and grant read permissions to the root group to ensure that the model server can access the files. OpenShift runs containers with a random user ID and the root group ID. FROM registry.access.redhat.com/ubi9/ubi-micro:latest COPY --chown=0:0 models /models RUN chmod -R a=rX /models

nobody user

USER 65534

  1. Use podman build commands to create the OCI container image and upload it to a registry. The following commands use Quay as the registry.

NOTE

If your repository is private, ensure that you are authenticated to the registry before uploading your container image. Red Hat OpenShift AI Self-Managed 2.25 Deploying models

Using JupyterLab: a. Click the Upload Files icon ( ) in the file browser toolbar above the folder listing. b. In the file selection dialog, navigate to and select the model files from your local computer. Click Open. c. Wait for the upload progress bars next to the filenames to complete. Using code-server: a. Drag the model files directly from your local file explorer and drop them into the file browser pane in the target folder within code-server.

  1. Wait for the upload process to complete. Verification Confirm that your files appear in the file browser at the path where you uploaded them.

Next steps

When you follow the procedure to deploy a model, you can access the model files from the specified path within your PVC:

  1. In the Deploy model dialog, select Existing cluster storage under the Source model location section.
  2. From the Cluster storage list, select the PVC associated with your workbench.
  3. In the Model path field, enter the path to your model or the folder containing your model. Red Hat OpenShift AI Self-Managed 2.25 Deploying models

CHAPTER 2. DEPLOYING MODELS ON THE SINGLE-MODEL SERVING PLATFORM The single-model serving platform deploys each model from its own dedicated model server. This architecture is ideal for deploying, monitoring, scaling, and maintaining large models that require more resources, such as large language models (LLMs). The platform is based on the KServe component and offers two deployment modes: KServe RawDeployment: Uses a standard deployment method that does not require serverless dependencies. Knative Serverless: Uses Red Hat OpenShift Serverless for deployments that can automatically scale based on demand. 2.1. ABOUT KSERVE DEPLOYMENT MODES KServe offers two deployment modes for serving models. The default mode, Knative Serverless, is based on the open-source Knative project and provides powerful autoscaling capabilities. It integrates with Red Hat OpenShift Serverless and Red Hat OpenShift Service Mesh. Alternatively, the KServe RawDeployment mode offers a more traditional deployment method with fewer dependencies. Before you choose an option, understand how your initial configuration affects future deployments: If you configure for Knative Serverless: You can use both Knative Serverless and KServe RawDeployment modes. If you configure for KServe RawDeployment only: You can only use the KServe RawDeployment mode. Use the following comparison to choose the option that best fits your requirements. Table 2.1. Comparison of deployment modes Criterion Knative Serverless KServe RawDeployment Default mode Yes No Recommended use case Most workloads. Custom serving setups or models that must remain active. Autoscaling Scales up automatically based on request volume. Supports scaling down to zero when idle to save costs. No built-in autoscaling; you can configure Kubernetes Event- Driven Autoscaling (KEDA) or Horizontal Pod Autoscaler (HPA) on your deployment. Does not support scaling to zero by default, which might result in higher costs during periods of low traffic. CHAPTER 2. DEPLOYING MODELS ON THE SINGLE-MODEL SERVING PLATFORM

registry or a persistent volume claim (PVC) and have added a connection to your data science project. For more information about adding a connection, see Adding a connection to your data science project. If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in OpenShift AI. If you use NVIDIA GPUs, see Enabling NVIDIA GPUs. If you use AMD GPUs, see AMD GPU integration.

Runtime-specific prerequisites

Meet the requirements for the specific runtime you intend to use. Caikit-TGIS runtime To use the Caikit-TGIS runtime, you have converted your model to Caikit format. For an example, see Converting Hugging Face Hub models to Caikit format in the caikit-tgis- serving repository. vLLM NVIDIA GPU ServingRuntime for KServe To use the vLLM NVIDIA GPU ServingRuntime for KServeruntime, you have enabled GPU support in OpenShift AI and have installed and configured the Node Feature Discovery Operator on your cluster. For more information, see Installing the Node Feature Discovery Operator and Enabling NVIDIA GPUs. vLLM CPU ServingRuntime for KServe To use the VLLM runtime on IBM Z and IBM Power, use the vLLM CPU ServingRuntime for KServe. You cannot use GPU accelerators with IBM Z and IBM Power architectures. For more information, see Red Hat OpenShift Multi Architecture Component Availability Matrix. vLLM Intel Gaudi Accelerator ServingRuntime for KServe To use the vLLM Intel Gaudi Accelerator ServingRuntime for KServeruntime, you have enabled support for hybrid processing units (HPUs) in OpenShift AI. This includes installing the Intel Gaudi Base Operator and configuring a hardware profile. For more information, see Intel Gaudi Base Operator OpenShift installation in the AMD documentation and Working with hardware profiles. vLLM AMD GPU ServingRuntime for KServe To use the vLLM AMD GPU ServingRuntime for KServeruntime, you have enabled support for AMD graphic processing units (GPUs) in OpenShift AI. This includes installing the AMD GPU operator and configuring a hardware profile. For more information, see Deploying the AMD GPU operator on OpenShift and Working with hardware profiles. vLLM Spyre AI Accelerator ServingRuntime for KServe

IMPORTANT

CHAPTER 2. DEPLOYING MODELS ON THE SINGLE-MODEL SERVING PLATFORM

IMPORTANT

Support for IBM Spyre AI Accelerators on x86 is currently available in Red Hat OpenShift AI 2.25 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process. For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope. To use the vLLM Spyre AI Accelerator ServingRuntime for KServeruntime on x86, you have installed the Spyre Operator and configured a hardware profile. For more information, see Spyre operator image and Working with hardware profiles. Procedure

  1. In the left menu, click Data science projects.
  2. Click the name of the project that you want to deploy a model in. A project details page opens.
  3. Click the Models tab.
  4. Click Select single-model to deploy your model using single-model serving.
  5. Click the Deploy model button. The Deploy model dialog opens.
  6. In the Model deployment name field, enter a unique name for the model that you are deploying.
  7. In the Serving runtime field, select an enabled runtime. If project-scoped runtimes exist, the Serving runtime list includes subheadings to distinguish between global runtimes and project- scoped runtimes.
  8. From the Model framework (name - version) list, select a value if applicable.
  9. From the Deployment mode list, select KServe RawDeployment or Knative Serverless. For more information about deployment modes, see About KServe deployment modes.
  10. In the Number of model server replicas to deployfield, specify a value.
  11. The following options are only available if you have created a hardware profile: a. From the Hardware profile list, select a hardware profile. If project-scoped hardware profiles exist, the Hardware profile list includes subheadings to distinguish between global hardware profiles and project-scoped hardware profiles.

IMPORTANT

Red Hat OpenShift AI Self-Managed 2.25 Deploying models

NOTE

If you are deploying a registered model version with an existing S3, URI, or OCI data connection, some of your connection details might be autofilled. This depends on the type of data connection and the number of matching connections available in your data science project. For example, if only one matching connection exists, fields like the path, URI, endpoint, model URI, bucket, and region might populate automatically. Matching connections will be labeled as Recommended. c. Complete the connection detail fields. d. Optional: If you have uploaded model files to a persistent volume claim (PVC) and the PVC is attached to your workbench, use the Existing cluster storage option to select the PVC and specify the path to the model file.

IMPORTANT

If your connection type is an S3-compatible object storage, you must provide the folder path that contains your data file. The OpenVINO Model Server runtime has specific requirements for how you specify the model path. For more information, see known issue RHOAIENG-3025 in the OpenShift AI release notes.

  1. (Optional) Customize the runtime parameters in the Configuration parameters section: a. Modify the values in Additional serving runtime arguments to define how the deployed model behaves. b. Modify the values in Additional environment variables to define variables in the model’s environment. The Configuration parameters section shows predefined serving runtime parameters, if any are available.

NOTE

Do not modify the port or model serving runtime arguments, because they require specific values to be set. Overwriting these parameters can cause the deployment to fail.

  1. Click Deploy. Verification Confirm that the deployed model is shown on the Models tab for the project, and on the Model deployments page of the dashboard with a checkmark in the Status column. Additional resources Model-serving runtimes for accelerators 2.3. DEPLOYING A MODEL STORED IN AN OCI IMAGE BY USING THE CLI Red Hat OpenShift AI Self-Managed 2.25 Deploying models

You can deploy a model that is stored in an OCI image from the command line interface. The following procedure uses the example of deploying a MobileNet v2-7 model in ONNX format, stored in an OCI image on an OpenVINO model server.

NOTE

By default in KServe, models are exposed outside the cluster and not protected with authentication. Prerequisites You have stored a model in an OCI image as described in Storing a model in an OCI image. If you want to deploy a model that is stored in a private OCI repository, you must configure an image pull secret. For more information about creating an image pull secret, see Using image pull secrets. You are logged in to your OpenShift cluster. Procedure

  1. Create a project to deploy the model: oc new-project oci-model-example
  2. Use the OpenShift AI Applications project kserve-ovms template to create a ServingRuntime resource and configure the OpenVINO model server in the new project: oc process -n redhat-ods-applications -o yaml kserve-ovms | oc apply -f -
  3. Verify that the ServingRuntime named kserve-ovms is created: oc get servingruntimes The command should return output similar to the following: NAME DISABLED MODELTYPE CONTAINERS AGE kserve-ovms openvino_ir kserve-container 1m
  4. Create an InferenceService YAML resource, depending on whether the model is stored from a private or a public OCI repository: For a model stored in a public OCI repository, create an InferenceService YAML file with the following values, replacing **** , **** , and **** with values specific to your environment: apiVersion: serving.kserve.io/v1beta kind: InferenceService metadata: name: sample-isvc-using-oci spec: predictor: model: CHAPTER 2. DEPLOYING MODELS ON THE SINGLE-MODEL SERVING PLATFORM

The command should return output that includes information, such as the URL of the deployed model and its readiness state. 2.4. DEPLOYING MODELS BY USING DISTRIBUTED INFERENCE WITH LLM-D

IMPORTANT

Distributed Inference with llm-d is currently available in Red Hat OpenShift AI 2.25 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process. For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope. Distributed Inference with llm-d is a Kubernetes-native, open-source framework designed for serving large language models (LLMs) at scale. You can use Distributed Inference with llm-d to simplify the deployment of generative AI, focusing on high performance and cost-effectiveness across various hardware accelerators. Key features of Distributed Inference with llm-d include: Efficiently handles large models using optimizations such as prefix-cache aware routing and disaggregated serving. Integrates into a standard Kubernetes environment, where it leverages specialized components like the Envoy proxy to handle networking and routing, and high-performance libraries such as vLLM and NVIDIA Inference Transfer Library (NIXL). Tested recipes and well-known presets reduce the complexity of deploying inference at scale, so users can focus on building applications rather than managing infrastructure. Serving models using Distributed Inference with llm-d on Red Hat OpenShift AI consists of the following steps:

  1. Installing OpenShift AI.

NOTE

Because KServe Serverless conflicts with the Gateway API used for Distributed Inference with llm-d, KServe Serverless is not supported on the same cluster. Instead, use KServe RawDeployment.

  1. Enabling the single model serving platform.
  2. Enabling Distributed Inference with llm-d on a Kubernetes cluster.
  3. Creating an LLMInferenceService Custom Resource (CR).
  4. Deploying a model. CHAPTER 2. DEPLOYING MODELS ON THE SINGLE-MODEL SERVING PLATFORM

This procedure describes how to create a custom resource (CR) for an LLMInferenceService resource. You replace the default InferenceService with the LLMInferenceService. Prerequisites You have enabled the single model-serving platform. You have access to an OpenShift cluster running version 4.19.9 or later. OpenShift Service Mesh v2 is not installed in the cluster. You have created a GatewayClass and a Gateway named openshift-ai-inference in the openshift-ingress namespace as described in Gateway API with OpenShift Container Platform Networking. You have installed the LeaderWorkerSet Operator in OpenShift. For more information, see the OpenShift documentation. Procedure

  1. Log in to the OpenShift console as a cluster administrator.
  2. Create a data science cluster initialization (DSCI) and set the serviceMesh.managementState to removed , as shown in the following example: serviceMesh: ... managementState: Removed
  3. Create a data science cluster (DSC) with the following information set in kserve and serving : kserve: defaultDeploymentMode: RawDeployment managementState: Managed ... serving: ... managementState: Removed ...
  4. Create the LLMInferenceService CR with the following information: apiVersion: serving.kserve.io/v1alpha kind: LLMInferenceService metadata: name: sample-llm-inference-service spec: replicas: 2 model: uri: hf://RedHatAI/Qwen3-8B-FP8-dynamic name: RedHatAI/Qwen3-8B-FP8-dynamic router: route: {} gateway: {} scheduler: {} Red Hat OpenShift AI Self-Managed 2.25 Deploying models