









































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
computer science kubernetes ai
Typology: Schemes and Mind Maps
1 / 49
This page cannot be seen from the preview
Don't miss anything!










































.......................................................................................................................... .......................................................................................................................... .......................................................................................................................... .......................................................................................................................... ..........................................................................................................................
CHAPTER 1. STORING MODELS 1.1. USING OCI CONTAINERS FOR MODEL STORAGE 1.2. STORING A MODEL IN AN OCI IMAGE 1.3. UPLOADING MODEL FILES TO A PERSISTENT VOLUME CLAIM (PVC) CHAPTER 2. DEPLOYING MODELS ON THE SINGLE-MODEL SERVING PLATFORM 2.1. ABOUT KSERVE DEPLOYMENT MODES 2.2. DEPLOYING MODELS ON THE SINGLE-MODEL SERVING PLATFORM 2.3. DEPLOYING A MODEL STORED IN AN OCI IMAGE BY USING THE CLI 2.4. DEPLOYING MODELS BY USING DISTRIBUTED INFERENCE WITH LLM-D 2.4.1. Example usage for Distributed Inference with llm-d 2.4.1.1. Single-node GPU deployment 2.4.1.2. Multi-node deployment 2.4.1.3. Intelligent inference scheduler with KV cache routing 2.5. MONITORING MODELS ON THE SINGLE-MODEL SERVING PLATFORM 2.5.1. Viewing performance metrics for a deployed model 2.5.2. Viewing model-serving runtime metrics for the single-model serving platform CHAPTER 3. DEPLOYING MODELS ON THE NVIDIA NIM MODEL SERVING PLATFORM 3.1. DEPLOYING MODELS ON THE NVIDIA NIM MODEL SERVING PLATFORM 3.2. VIEWING NVIDIA NIM METRICS FOR A NIM MODEL 3.3. VIEWING PERFORMANCE METRICS FOR A NIM MODEL CHAPTER 4. DEPLOYING MODELS ON THE MULTI-MODEL SERVING PLATFORM 4.1. ADDING A MODEL SERVER FOR THE MULTI-MODEL SERVING PLATFORM 4.2. DELETING A MODEL SERVER 4.3. DEPLOYING A MODEL BY USING THE MULTI-MODEL SERVING PLATFORM 4.4. VIEWING A DEPLOYED MODEL 4.5. UPDATING THE DEPLOYMENT PROPERTIES OF A DEPLOYED MODEL 4.6. DELETING A DEPLOYED MODEL 4.7. CONFIGURING MONITORING FOR THE MULTI-MODEL SERVING PLATFORM 4.8. VIEWING MODEL-SERVING RUNTIME METRICS FOR THE MULTI-MODEL SERVING PLATFORM 4.9. VIEWING PERFORMANCE METRICS FOR ALL MODELS ON A MODEL SERVER 4.10. VIEWING HTTP REQUEST METRICS FOR A DEPLOYED MODEL CHAPTER 5. MAKING INFERENCE REQUESTS TO DEPLOYED MODELS 5.1. ACCESSING THE AUTHENTICATION TOKEN FOR A DEPLOYED MODEL 5.2. ACCESSING THE INFERENCE ENDPOINT FOR A DEPLOYED MODEL 5.3. MAKING INFERENCE REQUESTS TO MODELS DEPLOYED ON THE SINGLE-MODEL SERVING PLATFORM 5.4. INFERENCE ENDPOINTS 5.4.1. Caikit TGIS ServingRuntime for KServe 5.4.2. Caikit Standalone ServingRuntime for KServe 5.4.3. TGIS Standalone ServingRuntime for KServe 5.4.4. OpenVINO Model Server 5.4.5. vLLM NVIDIA GPU ServingRuntime for KServe 5.4.6. vLLM Intel Gaudi Accelerator ServingRuntime for KServe 5.4.7. vLLM AMD GPU ServingRuntime for KServe 5.4.8. vLLM Spyre AI Accelerator ServingRuntime for KServe 5.4.9. NVIDIA Triton Inference Server 5.4.10. Seldon MLServer 5.4.11. Additional resources 3 3 3 5 7 7 8 12 15 17 17 18 18 18 18 19 22 22 24 25 27 27 29 30 31 32 33 33 34 35 36 38 38 38 39 39 39 40 41 41 41 42 42 42 43 44 45 Table of Contents
CHAPTER 1. STORING MODELS You must store your model before you can deploy it. You can store a model in an S3 bucket, URI or Open Container Initiative (OCI) containers. 1.1. USING OCI CONTAINERS FOR MODEL STORAGE As an alternative to storing a model in an S3 bucket or URI, you can upload models to Open Container Initiative (OCI) containers. Deploying models from OCI containers is also known as modelcars in KServe. Using OCI containers for model storage can help you: Reduce startup times by avoiding downloading the same model multiple times. Reduce disk space usage by reducing the number of models downloaded locally. Improve model performance by allowing pre-fetched images. Using OCI containers for model storage involves the following tasks: Storing a model in an OCI image. Deploying a model from an OCI image by using either the user interface or the command line interface. To deploy a model by using: The user interface, see Deploying models on the single-model serving platform. The command line interface, see Deploying a model stored in an OCI image by using the CLI. 1.2. STORING A MODEL IN AN OCI IMAGE You can store a model in an OCI image. The following procedure uses the example of storing a MobileNet v2-7 model in ONNX format. Prerequisites You have a model in the ONNX format. The example in this procedure uses the MobileNet v2- model in ONNX format. You have installed the Podman tool. Procedure
CHAPTER 1. STORING MODELS
This example command specifies the subdirectory 1 because OpenVINO requires numbered subdirectories for model versioning. If you are not using OpenVINO, you do not need to create the 1 subdirectory to use OCI container images.
Specify a base image that provides a shell. In the following example, ubi9- micro is the base container image. You cannot specify an empty image that does not provide a shell, such as scratch , because KServe uses the shell to ensure the model files are accessible to the model server. Change the ownership of the copied model files and grant read permissions to the root group to ensure that the model server can access the files. OpenShift runs containers with a random user ID and the root group ID. FROM registry.access.redhat.com/ubi9/ubi-micro:latest COPY --chown=0:0 models /models RUN chmod -R a=rX /models
USER 65534
If your repository is private, ensure that you are authenticated to the registry before uploading your container image. Red Hat OpenShift AI Self-Managed 2.25 Deploying models
Using JupyterLab: a. Click the Upload Files icon ( ) in the file browser toolbar above the folder listing. b. In the file selection dialog, navigate to and select the model files from your local computer. Click Open. c. Wait for the upload progress bars next to the filenames to complete. Using code-server: a. Drag the model files directly from your local file explorer and drop them into the file browser pane in the target folder within code-server.
When you follow the procedure to deploy a model, you can access the model files from the specified path within your PVC:
CHAPTER 2. DEPLOYING MODELS ON THE SINGLE-MODEL SERVING PLATFORM The single-model serving platform deploys each model from its own dedicated model server. This architecture is ideal for deploying, monitoring, scaling, and maintaining large models that require more resources, such as large language models (LLMs). The platform is based on the KServe component and offers two deployment modes: KServe RawDeployment: Uses a standard deployment method that does not require serverless dependencies. Knative Serverless: Uses Red Hat OpenShift Serverless for deployments that can automatically scale based on demand. 2.1. ABOUT KSERVE DEPLOYMENT MODES KServe offers two deployment modes for serving models. The default mode, Knative Serverless, is based on the open-source Knative project and provides powerful autoscaling capabilities. It integrates with Red Hat OpenShift Serverless and Red Hat OpenShift Service Mesh. Alternatively, the KServe RawDeployment mode offers a more traditional deployment method with fewer dependencies. Before you choose an option, understand how your initial configuration affects future deployments: If you configure for Knative Serverless: You can use both Knative Serverless and KServe RawDeployment modes. If you configure for KServe RawDeployment only: You can only use the KServe RawDeployment mode. Use the following comparison to choose the option that best fits your requirements. Table 2.1. Comparison of deployment modes Criterion Knative Serverless KServe RawDeployment Default mode Yes No Recommended use case Most workloads. Custom serving setups or models that must remain active. Autoscaling Scales up automatically based on request volume. Supports scaling down to zero when idle to save costs. No built-in autoscaling; you can configure Kubernetes Event- Driven Autoscaling (KEDA) or Horizontal Pod Autoscaler (HPA) on your deployment. Does not support scaling to zero by default, which might result in higher costs during periods of low traffic. CHAPTER 2. DEPLOYING MODELS ON THE SINGLE-MODEL SERVING PLATFORM
registry or a persistent volume claim (PVC) and have added a connection to your data science project. For more information about adding a connection, see Adding a connection to your data science project. If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in OpenShift AI. If you use NVIDIA GPUs, see Enabling NVIDIA GPUs. If you use AMD GPUs, see AMD GPU integration.
Meet the requirements for the specific runtime you intend to use. Caikit-TGIS runtime To use the Caikit-TGIS runtime, you have converted your model to Caikit format. For an example, see Converting Hugging Face Hub models to Caikit format in the caikit-tgis- serving repository. vLLM NVIDIA GPU ServingRuntime for KServe To use the vLLM NVIDIA GPU ServingRuntime for KServeruntime, you have enabled GPU support in OpenShift AI and have installed and configured the Node Feature Discovery Operator on your cluster. For more information, see Installing the Node Feature Discovery Operator and Enabling NVIDIA GPUs. vLLM CPU ServingRuntime for KServe To use the VLLM runtime on IBM Z and IBM Power, use the vLLM CPU ServingRuntime for KServe. You cannot use GPU accelerators with IBM Z and IBM Power architectures. For more information, see Red Hat OpenShift Multi Architecture Component Availability Matrix. vLLM Intel Gaudi Accelerator ServingRuntime for KServe To use the vLLM Intel Gaudi Accelerator ServingRuntime for KServeruntime, you have enabled support for hybrid processing units (HPUs) in OpenShift AI. This includes installing the Intel Gaudi Base Operator and configuring a hardware profile. For more information, see Intel Gaudi Base Operator OpenShift installation in the AMD documentation and Working with hardware profiles. vLLM AMD GPU ServingRuntime for KServe To use the vLLM AMD GPU ServingRuntime for KServeruntime, you have enabled support for AMD graphic processing units (GPUs) in OpenShift AI. This includes installing the AMD GPU operator and configuring a hardware profile. For more information, see Deploying the AMD GPU operator on OpenShift and Working with hardware profiles. vLLM Spyre AI Accelerator ServingRuntime for KServe
CHAPTER 2. DEPLOYING MODELS ON THE SINGLE-MODEL SERVING PLATFORM
Support for IBM Spyre AI Accelerators on x86 is currently available in Red Hat OpenShift AI 2.25 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process. For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope. To use the vLLM Spyre AI Accelerator ServingRuntime for KServeruntime on x86, you have installed the Spyre Operator and configured a hardware profile. For more information, see Spyre operator image and Working with hardware profiles. Procedure
Red Hat OpenShift AI Self-Managed 2.25 Deploying models
If you are deploying a registered model version with an existing S3, URI, or OCI data connection, some of your connection details might be autofilled. This depends on the type of data connection and the number of matching connections available in your data science project. For example, if only one matching connection exists, fields like the path, URI, endpoint, model URI, bucket, and region might populate automatically. Matching connections will be labeled as Recommended. c. Complete the connection detail fields. d. Optional: If you have uploaded model files to a persistent volume claim (PVC) and the PVC is attached to your workbench, use the Existing cluster storage option to select the PVC and specify the path to the model file.
If your connection type is an S3-compatible object storage, you must provide the folder path that contains your data file. The OpenVINO Model Server runtime has specific requirements for how you specify the model path. For more information, see known issue RHOAIENG-3025 in the OpenShift AI release notes.
Do not modify the port or model serving runtime arguments, because they require specific values to be set. Overwriting these parameters can cause the deployment to fail.
You can deploy a model that is stored in an OCI image from the command line interface. The following procedure uses the example of deploying a MobileNet v2-7 model in ONNX format, stored in an OCI image on an OpenVINO model server.
By default in KServe, models are exposed outside the cluster and not protected with authentication. Prerequisites You have stored a model in an OCI image as described in Storing a model in an OCI image. If you want to deploy a model that is stored in a private OCI repository, you must configure an image pull secret. For more information about creating an image pull secret, see Using image pull secrets. You are logged in to your OpenShift cluster. Procedure
The command should return output that includes information, such as the URL of the deployed model and its readiness state. 2.4. DEPLOYING MODELS BY USING DISTRIBUTED INFERENCE WITH LLM-D
Distributed Inference with llm-d is currently available in Red Hat OpenShift AI 2.25 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process. For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope. Distributed Inference with llm-d is a Kubernetes-native, open-source framework designed for serving large language models (LLMs) at scale. You can use Distributed Inference with llm-d to simplify the deployment of generative AI, focusing on high performance and cost-effectiveness across various hardware accelerators. Key features of Distributed Inference with llm-d include: Efficiently handles large models using optimizations such as prefix-cache aware routing and disaggregated serving. Integrates into a standard Kubernetes environment, where it leverages specialized components like the Envoy proxy to handle networking and routing, and high-performance libraries such as vLLM and NVIDIA Inference Transfer Library (NIXL). Tested recipes and well-known presets reduce the complexity of deploying inference at scale, so users can focus on building applications rather than managing infrastructure. Serving models using Distributed Inference with llm-d on Red Hat OpenShift AI consists of the following steps:
Because KServe Serverless conflicts with the Gateway API used for Distributed Inference with llm-d, KServe Serverless is not supported on the same cluster. Instead, use KServe RawDeployment.
This procedure describes how to create a custom resource (CR) for an LLMInferenceService resource. You replace the default InferenceService with the LLMInferenceService. Prerequisites You have enabled the single model-serving platform. You have access to an OpenShift cluster running version 4.19.9 or later. OpenShift Service Mesh v2 is not installed in the cluster. You have created a GatewayClass and a Gateway named openshift-ai-inference in the openshift-ingress namespace as described in Gateway API with OpenShift Container Platform Networking. You have installed the LeaderWorkerSet Operator in OpenShift. For more information, see the OpenShift documentation. Procedure