Serving Large Language Models (LLMs)
Red Hat OpenShift AI provides a comprehensive solution for deploying and serving Large Language Models (LLMs) at scale. Using the KServe infrastructure, you can deploy models from various storage sources including pre-built ModelCars (OCI registries), S3-compatible storage, and Persistent Volume Claims (PVCs).
Quick Start: This guide begins with a complete working example that deploys a Granite 3.1 8B model on a single GPU with external access. After showing this full example, we explain each component in detail and provide additional deployment patterns.
Overview
A complete model serving deployment in OpenShift AI consists of two key resources working together:
- ServingRuntime: Defines HOW models are served
- Specifies the serving container (e.g., vLLM, Triton, MLServer)
- Configures runtime parameters and environment
- Sets hardware acceleration annotations and model serving arguments
- InferenceService: Defines WHAT model to serve
- References the ServingRuntime to use
- Specifies the model location (storageUri or storage)
- Controls deployment parameters (replicas, resources, tolerations)
- Manages the model lifecycle
- When labeled with
networking.kserve.io/visibility: exposed
, automatically creates an OpenShift Route for external access
These two resources must be created together to deploy a functional model serving endpoint. Important: The ServingRuntime and InferenceService names MUST match exactly - this is a requirement for the OpenShift AI dashboard to properly display and manage the model deployment. Together they create the pods, services, routes, and other Kubernetes resources needed to serve your model.
Important: Naming Convention
The ServingRuntime and InferenceService names MUST be identical. This is a critical requirement for the OpenShift AI dashboard to properly:
- Display the model deployment
- Show serving runtime details
- Enable management features
- Link the resources correctly
For example, if your InferenceService is named granite-model
, then your ServingRuntime must also be named granite-model
. This naming convention is enforced throughout all examples in this guide.
Template Variables in ServingRuntime
OpenShift AI provides template variables that get replaced at runtime with values from the InferenceService:
- `` - The name of the InferenceService
- Common usage:
--served-model-name=
in vLLM args
This ensures the model server uses the same name as the InferenceService, maintaining consistency across the deployment.
Prerequisites
Before deploying models with ModelCars, ensure you have:
- OpenShift AI cluster with appropriate permissions
- GPU nodes (if using GPU-accelerated models)
- CLI tools:
kubectl
oroc
installed and configured - Namespace access: Create or use an existing namespace for deployments
- NVIDIA GPU operator installed (for GPU workloads)
Checking Available AcceleratorProfiles
AcceleratorProfiles define GPU configurations available in your cluster. While optional, understanding available profiles helps configure proper GPU resource requests and node tolerations.
List AcceleratorProfiles
# List all accelerator profiles
kubectl get acceleratorprofiles -n redhat-ods-applications
# Example output:
NAME DISPLAY NAME ENABLED IDENTIFIER TOLERATIONS
migrated-gpu NVIDIA GPU true nvidia.com/gpu 1 toleration(s)
nvidia-a100 NVIDIA A100 true nvidia.com/gpu 1 toleration(s)
View Specific AcceleratorProfile
# Get detailed profile information
kubectl get acceleratorprofile migrated-gpu -n redhat-ods-applications -o yaml
# Or use describe for a summary
kubectl describe acceleratorprofile migrated-gpu -n redhat-ods-applications
Example AcceleratorProfile:
apiVersion: dashboard.opendatahub.io/v1
kind: AcceleratorProfile
metadata:
name: migrated-gpu
namespace: redhat-ods-applications
spec:
displayName: "NVIDIA GPU"
enabled: true
identifier: nvidia.com/gpu # GPU resource identifier
tolerations: # Node scheduling tolerations
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
Extract Configuration for Model Deployment
From the AcceleratorProfile, note:
- identifier: Use in
resources.requests
andresources.limits
(e.g.,nvidia.com/gpu: '1'
) - tolerations: Copy to your InferenceService spec for proper node scheduling
If no AcceleratorProfiles exist, you can still deploy models by manually specifying GPU resources and tolerations based on your cluster configuration.
Storage Options for LLMs
OpenShift AI supports three primary storage methods for serving LLMs. Each method has specific use cases and configuration requirements:
1. ModelCars (OCI Registry)
Pre-built, optimized container images from Red Hat’s registry or other OCI-compliant registries.
When to use:
- Quick deployment of Red Hat validated models
- Models that are pre-optimized and containerized
- When you don’t need to modify model files
Configuration:
spec:
predictor:
model:
storageUri: 'oci://registry.redhat.io/rhelai1/modelcar-granite-3-1-8b-instruct:1.5'
See examples:
- Complete Working Example - Full deployment with GPU and external access
- Examples 1-4 - All use ModelCar storage
2. S3-Compatible Storage
Models stored in S3-compatible object storage (AWS S3, MinIO, OpenShift Data Foundation).
When to use:
- Models stored in object storage
- When you need to share models across multiple deployments
- Integration with existing S3 infrastructure
Configuration:
spec:
predictor:
model:
storage:
key: aws-connection-my-storage # Name of your S3 data connection
path: models/granite-7b-instruct/ # Path within the S3 bucket
Prerequisites:
- Create an S3 data connection in your namespace (see Data Connections)
- Ensure the model files are uploaded to the specified S3 path
See example:
3. Persistent Volume Claims (PVC)
Models stored on Kubernetes persistent volumes.
When to use:
- Models already available on shared storage
- When you need ReadWriteMany (RWX) access
- Custom model files or modified versions
Configuration:
spec:
predictor:
model:
storageUri: 'pvc://model-pvc/llama-3.1-8b-instruct' # pvc://<pvc-name>/<model-path>
Prerequisites:
- PVC must exist in the same namespace
- Model files must be copied to the PVC (see Setting Up PVCs for Model Storage)
See example:
Choosing a Storage Method
Storage Type | Best For | Pros | Cons |
---|---|---|---|
ModelCars (OCI) | Production deployments | Pre-optimized, versioned, easy to deploy | Limited to available images |
S3 Storage | Shared models, cloud environments | Centralized storage, easy updates | Requires S3 setup, potential latency |
PVC Storage | Custom models, on-premise | Full control, no external dependencies | Requires PVC management, manual uploads |
Creating Model Deployments
Model deployments require creating both a ServingRuntime and InferenceService. The resources should be created in order: ServingRuntime → InferenceService.
Important for External Access: To expose your model externally, you MUST add the label networking.kserve.io/visibility: exposed
to your InferenceService. When this label is present, OpenShift AI automatically creates an OpenShift Route for external HTTPS access. Without this label, the service will only be accessible internally within the cluster.
Complete Working Example
Here’s a complete example that deploys a Granite 3.1 8B model on a single GPU with external access. This example includes all the essential configurations for a production-ready deployment:
1. ServingRuntime (granite-model-runtime.yaml
):
# granite-model-runtime.yaml - defines HOW to serve the model
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: granite-model # Must match InferenceService name
annotations:
opendatahub.io/accelerator-name: migrated-gpu # GPU accelerator profile
opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]' # Compatible GPU types
labels:
opendatahub.io/dashboard: 'true' # Show in OpenShift AI dashboard
spec:
containers:
- name: kserve-container
image: 'quay.io/modh/vllm:rhoai-2.20-cuda' # vLLM serving image with CUDA support
args:
- '--port=8080' # Server port
- '--model=/mnt/models' # Model mount path
- '--served-model-name=' # Use InferenceService name
- '--max-model-len=4096' # Maximum context length
volumeMounts:
- mountPath: /dev/shm # Shared memory mount for GPU performance
name: shm
volumes:
- name: shm # Shared memory volume for GPU operations
emptyDir:
medium: Memory
sizeLimit: 2Gi
supportedModelFormats:
- name: vLLM
autoSelect: true # Automatically select this runtime for vLLM models
2. InferenceService (granite-model-service.yaml
):
# granite-model-service.yaml - defines WHAT model to serve
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: granite-model
annotations:
openshift.io/display-name: 'Granite 3.1 8B Instruct' # Human-readable name
labels:
opendatahub.io/dashboard: 'true' # Show in dashboard
networking.kserve.io/visibility: exposed # REQUIRED for external Route
spec:
predictor:
model:
modelFormat:
name: vLLM # Model format (must match runtime)
runtime: granite-model # Reference to ServingRuntime
storageUri: 'oci://registry.redhat.io/rhelai1/modelcar-granite-3-1-8b-instruct:1.5' # Model location
resources:
requests:
cpu: '2' # CPU request
memory: 16Gi # Memory request
nvidia.com/gpu: '1' # GPU request
limits:
cpu: '4' # CPU limit
memory: 24Gi # Memory limit
nvidia.com/gpu: '1' # GPU limit (must match runtime)
tolerations: # Node scheduling tolerations
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists # Schedule on GPU nodes
Deploy the complete example:
# 1. Create ServingRuntime first
kubectl apply -f granite-model-runtime.yaml -n my-namespace
# 2. Create InferenceService (route will be created automatically)
kubectl apply -f granite-model-service.yaml -n my-namespace
# 3. Wait for the model to be ready (this may take several minutes)
kubectl wait --for=condition=Ready inferenceservice/granite-model --timeout=600s -n my-namespace
# 4. Get the automatically created route URL
ROUTE_URL=$(kubectl get route granite-model -n my-namespace -o jsonpath='{.spec.host}')
echo "Model endpoint available at: https://$ROUTE_URL/v2/models/granite-model"
# 5. Test the deployment
curl -X POST "https://$ROUTE_URL/v2/models/granite-model/infer" \
-H "Content-Type: application/json" \
-d '{
"inputs": [{"name": "text_input", "shape": [1], "datatype": "BYTES", "data": ["What is machine learning?"]}],
"parameters": {"temperature": 0.7, "max_tokens": 100}
}'
Understanding the Example
This complete example demonstrates:
- GPU Configuration: The model is configured to run on a single NVIDIA GPU with proper resource allocation and node tolerations
- Memory Optimization: Includes shared memory volume for better GPU performance
- Model Parameters: Sets context length and GPU memory utilization for optimal performance
- External Access: The Route provides secure HTTPS access to the model endpoint
- Dashboard Integration: Both resources are visible in the OpenShift AI dashboard
The following sections explain each component and configuration option in detail.
Key Components Explained
The complete example above contains all the essential elements for a production deployment. Let’s break down each component:
GPU and Hardware Acceleration
GPU support requires three key configurations:
- Accelerator Annotations (in ServingRuntime):
annotations: opendatahub.io/accelerator-name: migrated-gpu opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
- Resource Requests/Limits (in InferenceService only):
resources: requests: nvidia.com/gpu: '1' limits: nvidia.com/gpu: '1'
- Node Tolerations (in InferenceService): ```yaml tolerations:
- effect: NoSchedule key: nvidia.com/gpu operator: Exists ```
See also:
- Example 4: Single Node, Multi GPU - Multi-GPU with tensor parallelism
Memory and Performance Optimization
Shared memory volumes improve GPU performance by allowing efficient inter-process communication:
# In ServingRuntime spec:
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 2Gi # Increase for larger models
# In ServingRuntime container:
volumeMounts:
- mountPath: /dev/shm
name: shm
Why shared memory matters:
- Enables efficient GPU kernel operations
- Required for tensor parallelism
- Improves model loading and inference speed
See also:
- Example 4: Single Node, Multi GPU - Uses 32Gi shared memory for large models
Model-Specific Arguments
vLLM arguments are specified in the ServingRuntime container args. Common configurations:
# In ServingRuntime container args:
args:
- '--port=8080' # Required: serving port
- '--model=/mnt/models' # Required: model mount path
- '--served-model-name=' # Use InferenceService name
- '--max-model-len=4096' # Context window size
Common argument patterns:
- Small models (7-8B):
--max-model-len=4096
- Large models (70B+):
--max-model-len=8192
,--tensor-parallel-size=4
,--distributed-executor-backend=mp
- Production: Consider resource limits and autoscaling
Important: When using --tensor-parallel-size
greater than 1, you must also include --distributed-executor-backend=mp
See also:
- Example 1: Custom vLLM Arguments - Various configurations
- Example 4: Single Node, Multi GPU - Multi-GPU specific arguments
Monitoring and Observability
Enable Prometheus metrics collection for monitoring:
# In ServingRuntime metadata.annotations:
annotations:
prometheus.io/scrape: 'true'
prometheus.io/path: /metrics
prometheus.io/port: '8080'
# In InferenceService metadata.annotations:
annotations:
serving.kserve.io/enable-prometheus-scraping: 'true'
Metrics available:
- Request latency and throughput
- GPU utilization
- Model loading time
- Token generation speed
Display Names and Dashboard Integration
Make your deployments easily identifiable:
# In metadata.annotations:
annotations:
openshift.io/display-name: 'Granite 3.1 8B Instruct'
# In metadata.labels:
labels:
opendatahub.io/dashboard: 'true' # Required for dashboard visibility
Dashboard features enabled:
- Model deployment status
- Endpoint URLs
- Resource usage monitoring
- Log access
Minimal Configuration
If you need a minimal deployment with basic GPU support:
Minimal ServingRuntime:
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: minimal-model
annotations:
opendatahub.io/accelerator-name: migrated-gpu
labels:
opendatahub.io/dashboard: 'true'
spec:
containers:
- name: kserve-container
image: 'quay.io/modh/vllm:rhoai-2.20-cuda'
args:
- '--port=8080'
- '--model=/mnt/models'
- '--served-model-name='
supportedModelFormats:
- name: vLLM
autoSelect: true
Minimal InferenceService:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: minimal-model
labels:
opendatahub.io/dashboard: 'true'
spec:
predictor:
model:
modelFormat:
name: vLLM
runtime: minimal-model
storageUri: 'oci://registry.redhat.io/rhelai1/modelcar-granite-3-1-8b-instruct:1.5'
resources:
requests:
memory: 16Gi
nvidia.com/gpu: '1'
limits:
memory: 24Gi
nvidia.com/gpu: '1'
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
Note: This minimal configuration includes basic GPU support. For a complete example with external access and optimizations, see the Complete Working Example above.
Runtime Images and Hardware Support
The examples above use quay.io/modh/vllm:rhoai-2.20-cuda
for NVIDIA GPUs. OpenShift AI provides different runtime images for various architectures:
- NVIDIA GPUs:
quay.io/modh/vllm:rhoai-2.20-cuda
- AMD GPUs:
quay.io/modh/vllm:rhoai-2.20-rocm
- Intel Habana:
quay.io/modh/vllm:rhoai-2.20-gaudi
- CPU only:
quay.io/modh/vllm:rhoai-2.20-cpu
(Note: CPU-only runtimes are primarily provided for ppc64le and s390x architectures)
Example deployment for AMD GPUs:
ServingRuntime (amd-model-runtime.yaml
):
# amd-model-runtime.yaml
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: amd-model # Must match InferenceService name
annotations:
# AMD GPU-specific accelerator type
opendatahub.io/recommended-accelerators: '["amd.com/gpu"]' # AMD instead of NVIDIA
labels:
opendatahub.io/dashboard: 'true'
spec:
containers:
- name: kserve-container
# ROCm-specific image for AMD GPU support
image: 'quay.io/modh/vllm:rhoai-2.20-rocm' # ROCm instead of CUDA
args:
- '--port=8080'
- '--model=/mnt/models'
- '--served-model-name='
command:
- python
- '-m'
- vllm.entrypoints.openai.api_server
supportedModelFormats:
- name: vLLM
autoSelect: true
InferenceService (amd-model-service.yaml
):
# amd-model-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: amd-model
labels:
opendatahub.io/dashboard: 'true'
spec:
predictor:
model:
modelFormat:
name: vLLM
runtime: amd-model
storageUri: 'oci://registry.redhat.io/rhelai1/modelcar-granite-3-1-8b-instruct:1.5'
resources:
requests:
memory: 16Gi
amd.com/gpu: '1' # AMD GPU resource identifier
limits:
memory: 24Gi
amd.com/gpu: '1' # AMD GPU resource identifier
Deploy AMD GPU model:
kubectl apply -f amd-model-runtime.yaml
kubectl apply -f amd-model-service.yaml
Listing and Viewing Model Deployments
List All Model Deployments
List all InferenceServices (the primary resource representing model deployments):
# List all InferenceServices in current namespace
kubectl get inferenceservices
# List with more details
kubectl get inferenceservices -o wide
# List across all namespaces
kubectl get inferenceservices --all-namespaces
# Custom output showing key fields
kubectl get inferenceservices -o custom-columns=\
NAME:.metadata.name,\
RUNTIME:.spec.predictor.model.runtime,\
MODEL:.spec.predictor.model.storageUri,\
READY:.status.conditions[?(@.type=='Ready')].status,\
URL:.status.url
List Associated Resources
# List all ServingRuntimes
kubectl get servingruntimes
# List Routes for external access (automatically created)
kubectl get routes -l inferenceservice-name
# List all resources for a specific model deployment
MODEL_NAME="granite-model"
kubectl get servingruntime,inferenceservice,route,service,deployment,pod \
-l serving.kserve.io/inferenceservice=$MODEL_NAME
View Specific Model Deployment
# View InferenceService details
kubectl describe inferenceservice granite-model
# Get InferenceService in YAML format
kubectl get inferenceservice granite-model -o yaml
# View ServingRuntime details
kubectl describe servingruntime granite-runtime
# Check deployment status
kubectl get inferenceservice granite-model -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'
Filter Model Deployments
# Filter by label
kubectl get inferenceservices -l environment=production
# List deployments using specific runtime
RUNTIME="granite-runtime"
kubectl get inferenceservices -o json | \
jq --arg runtime "$RUNTIME" '.items[] | select(.spec.predictor.model.runtime == $runtime) | .metadata.name'
# List GPU-enabled deployments
kubectl get inferenceservices -o json | \
jq '.items[] | select(.spec.predictor.model.resources.requests["nvidia.com/gpu"] != null) | .metadata.name'
Updating Model Deployments
Model deployments can be updated to change models, adjust resources, or modify configurations. Updates should be done carefully to minimize downtime.
Update Model Version
# Update to a new model version
kubectl patch inferenceservice granite-model --type='json' -p='[
{"op": "replace", "path": "/spec/predictor/model/storageUri",
"value": "oci://registry.redhat.io/rhelai1/modelcar-granite-3-1-8b-instruct:1.6"}
]'
# Or using kubectl set (if supported)
kubectl set env inferenceservice/granite-model \
STORAGE_URI=oci://registry.redhat.io/rhelai1/modelcar-granite-3-1-8b-instruct:1.6
Scale Model Deployment
# Update replica count
kubectl patch inferenceservice granite-model --type='merge' -p='
{
"spec": {
"predictor": {
"minReplicas": 2,
"maxReplicas": 5
}
}
}'
# Enable autoscaling
kubectl patch inferenceservice granite-model --type='merge' -p='
{
"spec": {
"predictor": {
"minReplicas": 1,
"maxReplicas": 10,
"scaleTarget": 80,
"scaleMetric": "cpu"
}
}
}'
Update Resource Allocations
# Increase memory and CPU
kubectl patch inferenceservice granite-model --type='merge' -p='
{
"spec": {
"predictor": {
"model": {
"resources": {
"requests": {
"cpu": "4",
"memory": "32Gi"
},
"limits": {
"cpu": "8",
"memory": "48Gi"
}
}
}
}
}
}'
Update ServingRuntime Configuration
# Update runtime arguments
kubectl patch servingruntime granite-runtime --type='json' -p='[
{"op": "replace", "path": "/spec/containers/0/args",
"value": ["--port=8080", "--model=/mnt/models", "--tensor-parallel-size=2", "--distributed-executor-backend=mp", "--max-model-len=8192"]}
]'
# Update container image
kubectl patch servingruntime granite-runtime --type='merge' -p='
{
"spec": {
"containers": [{
"name": "kserve-container",
"image": "quay.io/modh/vllm:rhoai-2.21-cuda"
}]
}
}'
Deleting Model Deployments
When deleting model deployments, remove resources in the reverse order of creation to ensure clean cleanup.
Basic Deletion
# Delete InferenceService (routes are automatically cleaned up)
kubectl delete inferenceservice granite-model
kubectl delete servingruntime granite-runtime
Delete All Resources for a Model
# Delete all resources with a specific label
MODEL_NAME="granite-model"
kubectl delete inferenceservice,servingruntime \
-l serving.kserve.io/inferenceservice=$MODEL_NAME
# Force deletion if stuck
kubectl delete inferenceservice granite-model --force --grace-period=0
Important Notes on Deletion
⚠️ Warning: Deleting an InferenceService will:
- Remove all associated pods and services
- Delete any in-memory model cache
- Terminate active inference requests
- Remove autoscaling configurations
Cleanup Verification
# Verify all resources are deleted
kubectl get all -l serving.kserve.io/inferenceservice=granite-model
# Check for lingering PVCs
kubectl get pvc -l serving.kserve.io/inferenceservice=granite-model
# Check for finalizers preventing deletion
kubectl get inferenceservice granite-model -o jsonpath='{.metadata.finalizers}'
Practical Examples
This section provides additional deployment examples for specific use cases, building on the complete working example shown earlier.
Example 1: Custom vLLM Arguments
This example demonstrates various vLLM arguments for different use cases:
ServingRuntime (custom-args-runtime.yaml
):
# custom-args-runtime.yaml - custom vLLM configurations
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: custom-args-model
annotations:
opendatahub.io/accelerator-name: migrated-gpu
labels:
opendatahub.io/dashboard: 'true'
spec:
containers:
- name: kserve-container
image: 'quay.io/modh/vllm:rhoai-2.20-cuda'
args:
- '--port=8080'
- '--model=/mnt/models'
- '--served-model-name='
# Custom vLLM arguments for specific use cases
- '--max-model-len=8192' # Larger context window
- '--download-dir=/tmp/models' # Custom download directory
- '--trust-remote-code' # Allow custom model code
- '--disable-log-requests' # Reduce logging overhead
supportedModelFormats:
- name: vLLM
autoSelect: true
Common vLLM argument use cases:
- Larger contexts:
--max-model-len=8192
or higher for long documents - Custom models:
--trust-remote-code
for models with custom implementations - Performance tuning:
--disable-log-requests
to reduce overhead - Storage management:
--download-dir
to specify model cache location
Example 2: Internal Only (No Route)
Deploy a model that is only accessible within the cluster, without external route exposure:
ServingRuntime (internal-model-runtime.yaml
):
# internal-model-runtime.yaml - for internal cluster access only
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: internal-model
annotations:
opendatahub.io/accelerator-name: migrated-gpu
labels:
opendatahub.io/dashboard: 'true'
spec:
containers:
- name: kserve-container
image: 'quay.io/modh/vllm:rhoai-2.20-cuda'
args:
- '--port=8080'
- '--model=/mnt/models'
- '--served-model-name='
- '--max-model-len=4096'
supportedModelFormats:
- name: vLLM
autoSelect: true
InferenceService (internal-model-service.yaml
):
# internal-model-service.yaml - NO external visibility label
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: internal-model
labels:
opendatahub.io/dashboard: 'true'
# NOTE: No networking.kserve.io/visibility label = internal only
spec:
predictor:
model:
modelFormat:
name: vLLM
runtime: internal-model
storageUri: 'oci://registry.redhat.io/rhelai1/modelcar-granite-3-1-8b-instruct:1.5'
resources:
requests:
cpu: '2'
memory: 16Gi
nvidia.com/gpu: '1'
limits:
cpu: '4'
memory: 24Gi
nvidia.com/gpu: '1'
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
Key differences for internal-only deployment:
- No visibility label: Omit
networking.kserve.io/visibility: exposed
- No Route creation: OpenShift won’t create an external route
- Cluster access only: Model endpoint only accessible from within the cluster
Accessing the internal model:
# From another pod in the cluster
kubectl run test-client --image=curlimages/curl --rm -it -- sh
# Inside the test pod, access the model using the service URL
curl -X POST http://internal-model-predictor.my-namespace.svc.cluster.local/v2/models/internal-model/infer \
-H "Content-Type: application/json" \
-d '{
"inputs": [{
"name": "text_input",
"shape": [1],
"datatype": "BYTES",
"data": ["What is OpenShift AI?"]
}]
}'
Use cases for internal-only models:
- Backend services that don’t need external access
- Models used by other cluster workloads
- Development/testing environments
- Security-sensitive deployments
Example 3: Alternate GPU (AMD)
Deploy models on AMD GPUs using ROCm-enabled images:
ServingRuntime (amd-gpu-runtime.yaml
):
# amd-gpu-runtime.yaml - for AMD GPU deployment
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: amd-gpu-model
annotations:
opendatahub.io/accelerator-name: amd.com/gpu # AMD GPU annotation
labels:
opendatahub.io/dashboard: 'true'
spec:
containers:
- name: kserve-container
image: 'quay.io/modh/vllm:rhoai-2.20-rocm' # ROCm-enabled image
args:
- '--port=8080'
- '--model=/mnt/models'
- '--served-model-name='
- '--max-model-len=4096'
supportedModelFormats:
- name: vLLM
autoSelect: true
InferenceService (amd-gpu-service.yaml
):
# amd-gpu-service.yaml - using AMD GPU resources
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: amd-gpu-model
labels:
opendatahub.io/dashboard: 'true'
networking.kserve.io/visibility: exposed
spec:
predictor:
model:
modelFormat:
name: vLLM
runtime: amd-gpu-model
storageUri: 'oci://registry.redhat.io/rhelai1/modelcar-granite-3-1-8b-instruct:1.5'
resources:
requests:
cpu: '2'
memory: 16Gi
amd.com/gpu: '1' # Request AMD GPU
limits:
cpu: '4'
memory: 24Gi
amd.com/gpu: '1' # Limit AMD GPU
tolerations:
- effect: NoSchedule
key: amd.com/gpu # AMD GPU toleration
operator: Exists
AMD GPU deployment notes:
- Image selection: Use ROCm-enabled vLLM images (
-rocm
suffix) - Resource type: Replace
nvidia.com/gpu
withamd.com/gpu
- Accelerator annotation: Use
amd.com/gpu
in annotations - Tolerations: Update to match AMD GPU node taints
- Compatibility: Ensure models are tested with ROCm/AMD GPUs
Verify AMD GPU availability:
# Check for AMD GPU nodes
kubectl get nodes -L amd.com/gpu
# Verify AMD GPU resources
kubectl describe node <node-name> | grep amd.com/gpu
Example 4: Single Node, Multi GPU
Deploy large models across multiple GPUs on a single node using tensor parallelism:
ServingRuntime (multi-gpu-model-runtime.yaml
):
# multi-gpu-model-runtime.yaml
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: multi-gpu-model # Must match InferenceService name
annotations:
opendatahub.io/accelerator-name: nvidia-a100 # Target specific GPU type
labels:
opendatahub.io/dashboard: 'true'
spec:
containers:
- name: kserve-container
image: 'quay.io/modh/vllm:rhoai-2.20-cuda'
args:
- '--port=8080'
- '--model=/mnt/models'
# Multi-GPU specific configurations
- '--tensor-parallel-size=4' # Split model across 4 GPUs
- '--distributed-executor-backend=mp' # Required for multi-GPU
- '--max-model-len=8192' # Larger context enabled by more memory
env:
- name: VLLM_ATTENTION_BACKEND
value: FLASHINFER # Performance optimization for large models
volumeMounts:
- mountPath: /dev/shm # Shared memory critical for multi-GPU
name: shm
supportedModelFormats:
- name: vLLM
autoSelect: true
volumes:
- name: shm # Larger shared memory for multi-GPU communication
emptyDir:
medium: Memory
sizeLimit: 32Gi
InferenceService (multi-gpu-model-service.yaml
):
# multi-gpu-model-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: multi-gpu-model
annotations:
serving.kserve.io/enable-prometheus-scraping: 'true' # Monitor multi-GPU usage
labels:
networking.kserve.io/visibility: exposed
opendatahub.io/dashboard: 'true'
spec:
predictor:
# Autoscaling configuration for production
minReplicas: 1
maxReplicas: 3 # Scale horizontally across nodes
scaleTarget: 80 # CPU utilization target
scaleMetric: cpu
model:
modelFormat:
name: vLLM
runtime: multi-gpu-model # Must match ServingRuntime name
# Large model requiring multiple GPUs
storageUri: 'oci://registry.redhat.io/rhelai1/modelcar-llama-3-3-70b-instruct:1.5'
resources:
requests:
cpu: '16'
memory: 160Gi
nvidia.com/gpu: '4' # Must match runtime GPU count
limits:
cpu: '32'
memory: 192Gi
nvidia.com/gpu: '4' # Ensure same GPU count as runtime
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
- effect: NoSchedule # A100-specific toleration
key: nvidia.com/gpu-model
operator: Equal
value: A100
Deploy multi-GPU model:
kubectl apply -f multi-gpu-model-runtime.yaml
kubectl apply -f multi-gpu-model-service.yaml
kubectl wait --for=condition=Ready inferenceservice/multi-gpu-model --timeout=300s
Example 5: S3 Storage Deployment
Deploy a model from S3-compatible storage with included data connection:
S3 Data Connection Secret (s3-model-secret.yaml
):
# s3-model-secret.yaml - S3 data connection
apiVersion: v1
kind: Secret
metadata:
name: model-s3-storage
labels:
opendatahub.io/dashboard: 'true'
opendatahub.io/managed: 'true'
annotations:
opendatahub.io/connection-type: s3
openshift.io/display-name: Model S3 Storage
type: Opaque
stringData:
# Replace with your S3 credentials
AWS_ACCESS_KEY_ID: your-access-key
AWS_SECRET_ACCESS_KEY: your-secret-key
AWS_S3_ENDPOINT: https://s3.amazonaws.com
AWS_S3_BUCKET: your-model-bucket
AWS_DEFAULT_REGION: us-east-1
ServingRuntime (s3-model-runtime.yaml
):
# s3-model-runtime.yaml
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: s3-model
labels:
opendatahub.io/dashboard: 'true'
spec:
containers:
- name: kserve-container
image: 'quay.io/modh/vllm:rhoai-2.20-cuda'
args:
- '--port=8080'
- '--model=/mnt/models'
- '--served-model-name='
supportedModelFormats:
- name: vLLM
autoSelect: true
InferenceService (s3-model-service.yaml
):
# s3-model-service.yaml - using S3 storage
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: s3-model
labels:
opendatahub.io/dashboard: 'true'
spec:
predictor:
model:
modelFormat:
name: vLLM
runtime: s3-model
# Use storage field for S3 connections
storage:
key: model-s3-storage # Reference to the Secret
path: models/llama-3.1-8b-instruct/ # Path within the S3 bucket
args:
- '--max-model-len=4096'
resources:
requests:
nvidia.com/gpu: '1'
limits:
nvidia.com/gpu: '1'
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
Deploy from S3 storage:
# Create the S3 data connection first
kubectl apply -f s3-model-secret.yaml
# Deploy the runtime and service
kubectl apply -f s3-model-runtime.yaml
kubectl apply -f s3-model-service.yaml
Example 6: PVC Storage Deployment
Deploy a model from a Persistent Volume Claim:
ServingRuntime (pvc-model-runtime.yaml
):
# pvc-model-runtime.yaml
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: pvc-model
labels:
opendatahub.io/dashboard: 'true'
spec:
containers:
- name: kserve-container
image: 'quay.io/modh/vllm:rhoai-2.20-cuda'
args:
- '--port=8080'
- '--model=/mnt/models'
- '--served-model-name='
supportedModelFormats:
- name: vLLM
autoSelect: true
InferenceService (pvc-model-service.yaml
):
# pvc-model-service.yaml - using PVC storage
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: pvc-model
labels:
opendatahub.io/dashboard: 'true'
spec:
predictor:
model:
modelFormat:
name: vLLM
runtime: pvc-model
# Use storageUri with pvc:// scheme
storageUri: 'pvc://model-pvc/llama-3.1-8b-instruct' # pvc://<pvc-name>/<model-path>
args:
- '--max-model-len=4096'
resources:
requests:
nvidia.com/gpu: '1'
limits:
nvidia.com/gpu: '1'
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
Deploy from PVC storage:
# Note: PVC must already exist with model files
# See "Setting Up PVCs for Model Storage" guide
kubectl apply -f pvc-model-runtime.yaml
kubectl apply -f pvc-model-service.yaml
Prerequisites for PVC storage:
- Create a PVC named
model-pvc
with RWX access mode - Copy model files to the PVC at path
llama-3.1-8b-instruct/
- See Setting Up PVCs for Model Storage for detailed instructions
Deployment Commands for All Examples
# Complete Working Example (from earlier section)
kubectl apply -f granite-model-runtime.yaml
kubectl apply -f granite-model-service.yaml
kubectl wait --for=condition=Ready inferenceservice/granite-model --timeout=600s
# Example 1: Custom vLLM Arguments
kubectl apply -f custom-args-runtime.yaml
kubectl apply -f custom-args-service.yaml
# Example 2: Internal Only (No Route)
kubectl apply -f internal-model-runtime.yaml
kubectl apply -f internal-model-service.yaml
# Example 3: Alternate GPU (AMD)
kubectl apply -f amd-gpu-runtime.yaml
kubectl apply -f amd-gpu-service.yaml
# Example 4: Multi-GPU
kubectl apply -f multi-gpu-model-runtime.yaml
kubectl apply -f multi-gpu-model-service.yaml
kubectl wait --for=condition=Ready inferenceservice/multi-gpu-model --timeout=300s
# Example 5: S3 Storage
kubectl apply -f s3-model-secret.yaml
kubectl apply -f s3-model-runtime.yaml
kubectl apply -f s3-model-service.yaml
# Example 6: PVC Storage
kubectl apply -f pvc-model-runtime.yaml
kubectl apply -f pvc-model-service.yaml
# Monitor deployment status
kubectl get inferenceservice -w
# Test endpoints (internal)
kubectl port-forward service/<model-name>-predictor 8080:80
curl http://localhost:8080/v2/models
# Test external routes (automatically created for exposed services)
ROUTE_URL=$(kubectl get route <model-name> -o jsonpath='{.spec.host}')
curl https://$ROUTE_URL/v2/models
Verification and Troubleshooting
Verify Deployment Status
# Check InferenceService readiness
kubectl get inferenceservice <model-name> -o jsonpath='{.status.conditions[?(@.type=="Ready")]}'
# Watch deployment progress
kubectl get inferenceservice <model-name> -w
# Check all components
kubectl get pods,services,deployments -l serving.kserve.io/inferenceservice=<model-name>
View Logs
# View predictor pod logs
kubectl logs -l serving.kserve.io/inferenceservice=<model-name> -c kserve-container
# Stream logs
kubectl logs -f -l serving.kserve.io/inferenceservice=<model-name>
# View previous container logs (if crashed)
kubectl logs -l serving.kserve.io/inferenceservice=<model-name> -c kserve-container --previous
Common Issues and Solutions
Model Not Loading
# Check pod events
kubectl describe pod -l serving.kserve.io/inferenceservice=<model-name>
# Common causes:
# - Insufficient memory: Increase memory limits
# - Wrong model format: Verify storageUri and model compatibility
# - GPU not available: Check node labels and tolerations
Route Not Accessible
# Check route status (automatically created when networking.kserve.io/visibility: exposed is set)
kubectl get route <model-name> -o jsonpath='{.status.ingress[0].conditions[?(@.type=="Admitted")]}'
# Verify service exists
kubectl get service <model-name>-predictor
# Test internal connectivity
kubectl run test-curl --image=curlimages/curl:latest --rm -it -- \
curl http://<model-name>-predictor.<namespace>.svc.cluster.local:80/v1/models
GPU Allocation Issues
# Check node GPU availability
kubectl get nodes -o custom-columns=NAME:.metadata.name,GPUs:.status.capacity.nvidia\\.com/gpu
# Verify pod GPU requests
kubectl get pod -l serving.kserve.io/inferenceservice=<model-name> -o jsonpath='{.items[0].spec.containers[0].resources}'
# Check GPU operator status
kubectl get pods -n nvidia-gpu-operator
Performance Troubleshooting
# Check resource usage
kubectl top pod -l serving.kserve.io/inferenceservice=<model-name>
# View HPA status (if autoscaling enabled)
kubectl get hpa
# Check response times
kubectl exec -it <pod-name> -- curl -w "@curl-format.txt" -o /dev/null -s http://localhost:8080/v1/models
Best Practices
Naming Conventions
- Use consistent names for ServingRuntime and InferenceService (e.g., both named
granite-model
) - Include model version in names for easy identification (e.g.,
llama-70b-v1
) - Use descriptive prefixes for environment separation (e.g.,
prod-
,dev-
,test-
)
Resource Allocation
- Development: Start with minimal resources and scale up as needed
- Production:
- Set resource requests to guarantee minimum performance
- Set resource limits 20-30% above requests for burst capacity
- Use GPU only when necessary (some models run efficiently on CPU)
Security Considerations
- Use TLS termination at the route level
- Implement network policies to restrict access
- Regularly update serving runtime images
- Consider implementing authentication for production deployments
Model Selection Guidelines
- Choose quantized models (FP8, W8A8) for better GPU memory efficiency
- Use tensor parallelism for models larger than single GPU memory
- Consider model size vs. accuracy trade-offs
Production Deployment Checklist
- Enable authentication
- Configure autoscaling
- Set appropriate resource requests/limits
- Add health checks and readiness probes
- Configure monitoring and alerts
- Document model version and parameters
- Test rollback procedures
- Verify GPU node affinity and tolerations
Field Reference
ServingRuntime Fields
Field | Type | Required | Description | Example |
---|---|---|---|---|
metadata.name | string | Yes | Unique runtime identifier | vllm-runtime |
metadata.annotations.opendatahub.io/accelerator-name | string | No | Accelerator profile reference | nvidia-a100 |
metadata.annotations.opendatahub.io/recommended-accelerators | JSON array | No | List of compatible GPU types | ["nvidia.com/gpu"] |
metadata.labels.opendatahub.io/dashboard | string | No | Show in OpenShift AI dashboard | 'true' |
spec.containers[].name | string | Yes | Container name | kserve-container |
spec.containers[].image | string | Yes | Serving container image | quay.io/modh/vllm:rhoai-2.20-cuda |
spec.containers[].args | array | No | Container arguments | ["--port=8080"] |
spec.containers[].env | array | No | Environment variables | See examples |
spec.supportedModelFormats[].name | string | Yes | Model format name | vLLM |
spec.supportedModelFormats[].autoSelect | boolean | No | Auto-select this format | true |
spec.volumes | array | No | Volume definitions | See examples |
InferenceService Fields
Field | Type | Required | Description | Example |
---|---|---|---|---|
metadata.name | string | Yes | Unique model identifier | granite-model |
metadata.annotations.serving.kserve.io/deploymentMode | string | No | Deployment mode | RawDeployment |
metadata.labels.networking.kserve.io/visibility | string | No | Network visibility (REQUIRED for external Route access) | exposed |
spec.predictor.minReplicas | integer | No | Minimum pod replicas | 1 |
spec.predictor.maxReplicas | integer | No | Maximum pod replicas | 5 |
spec.predictor.scaleTarget | integer | No | Autoscaling target percentage | 80 |
spec.predictor.scaleMetric | string | No | Autoscaling metric | cpu or memory |
spec.predictor.model.modelFormat.name | string | Yes | Model format | vLLM |
spec.predictor.model.runtime | string | Yes | ServingRuntime reference | vllm-runtime |
spec.predictor.model.storageUri | string | Yes | Model location | oci://registry.redhat.io/... |
spec.predictor.model.args | array | No | Model server arguments | ["--max-model-len=4096"] |
spec.predictor.model.resources | object | No | Resource requirements | See examples |
spec.predictor.tolerations | array | No | Node scheduling tolerations | See examples |
Common Annotations
Annotation | Resource | Description | Values |
---|---|---|---|
opendatahub.io/dashboard | Both | Display in OpenShift AI dashboard | 'true' |
openshift.io/display-name | Both | Human-readable name | Any string |
serving.kserve.io/deploymentMode | InferenceService | Deployment strategy | RawDeployment , Serverless |
serving.kserve.io/enable-prometheus-scraping | InferenceService | Enable metrics | 'true' |
prometheus.io/scrape | ServingRuntime | Enable Prometheus scraping | 'true' |
prometheus.io/port | ServingRuntime | Metrics port | '8080' |
prometheus.io/path | ServingRuntime | Metrics endpoint | /metrics |
Using with Kubernetes MCP Server
The Kubernetes MCP server provides programmatic access to manage model serving resources. Below are the tool mappings and usage patterns.
MCP Tool Mapping
Operation | MCP Tool | Description |
---|---|---|
Create resources | mcp__Kubernetes__resources_create_or_update | Create ServingRuntime, InferenceService, Route |
List resources | mcp__Kubernetes__resources_list | List model deployments |
Get specific resource | mcp__Kubernetes__resources_get | View details of specific resources |
Update resources | mcp__Kubernetes__resources_create_or_update | Update existing resources |
Delete resources | mcp__Kubernetes__resources_delete | Remove model deployments |
View logs | mcp__Kubernetes__pods_log | Check model server logs |
Execute commands | mcp__Kubernetes__pods_exec | Run commands in pods |
Creating Model Deployments with MCP
// Create ServingRuntime
await mcp__Kubernetes__resources_create_or_update({
resource: JSON.stringify({
apiVersion: "serving.kserve.io/v1alpha1",
kind: "ServingRuntime",
metadata: {
name: "mcp-model-runtime",
namespace: "default"
},
spec: {
containers: [{
name: "kserve-container",
image: "quay.io/modh/vllm:rhoai-2.20-cuda",
args: ["--port=8080", "--model=/mnt/models"]
}],
supportedModelFormats: [{
name: "vLLM",
autoSelect: true
}]
}
})
});
// Create InferenceService
await mcp__Kubernetes__resources_create_or_update({
resource: JSON.stringify({
apiVersion: "serving.kserve.io/v1beta1",
kind: "InferenceService",
metadata: {
name: "mcp-model",
namespace: "default"
},
spec: {
predictor: {
model: {
modelFormat: { name: "vLLM" },
runtime: "mcp-model-runtime",
storageUri: "oci://registry.redhat.io/rhelai1/modelcar-granite-3-1-8b-instruct:1.5"
}
}
}
})
});
Listing Model Deployments with MCP
// List all InferenceServices
const inferenceServices = await mcp__Kubernetes__resources_list({
apiVersion: "serving.kserve.io/v1beta1",
kind: "InferenceService",
namespace: "default"
});
// List all ServingRuntimes
const servingRuntimes = await mcp__Kubernetes__resources_list({
apiVersion: "serving.kserve.io/v1alpha1",
kind: "ServingRuntime",
namespace: "default"
});
Monitoring with MCP
// Get model deployment status
const model = await mcp__Kubernetes__resources_get({
apiVersion: "serving.kserve.io/v1beta1",
kind: "InferenceService",
name: "mcp-model",
namespace: "default"
});
// View model server logs
const pods = await mcp__Kubernetes__pods_list_in_namespace({
namespace: "default",
labelSelector: "serving.kserve.io/inferenceservice=mcp-model"
});
if (pods.items.length > 0) {
const logs = await mcp__Kubernetes__pods_log({
name: pods.items[0].metadata.name,
namespace: "default",
container: "kserve-container"
});
}
MCP Limitations
- Cannot use
kubectl apply
with file paths - must provide full resource JSON - No support for
kubectl patch
- use full resource updates - Cannot use shell redirections or pipes
- Limited support for complex label selectors
Best Practices for MCP
- Always specify namespace explicitly
- Use JSON.stringify() for resource specifications
- Check resource existence before updates
- Handle API version compatibility
- Implement proper error handling for failed operations
Appendix: Pre-built ModelCars
The following table lists all available pre-built ModelCars from Red Hat’s validated models collection. These models are optimized for deployment on OpenShift AI and can be used with the OCI registry storage method by specifying the URI in the storageUri
field of your InferenceService configuration.
Model ID | ModelCar URI |
---|---|
RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic | oci://registry.redhat.io/rhelai1/modelcar-llama-4-scout-17b-16e-instruct-fp8-dynamic:1.5 |
RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16 | oci://registry.redhat.io/rhelai1/modelcar-llama-4-scout-17b-16e-instruct-quantized-w4a16:1.5 |
RedHatAI/Llama-4-Scout-17B-16E-Instruct | oci://registry.redhat.io/rhelai1/modelcar-llama-4-scout-17b-16e-instruct:1.5 |
RedHatAI/Llama-4-Maverick-17B-128E-Instruct | oci://registry.redhat.io/rhelai1/modelcar-llama-4-maverick-17b-128e-instruct:1.5 |
RedHatAI/Llama-4-Maverick-17B-128E-Instruct-FP8 | oci://registry.redhat.io/rhelai1/modelcar-llama-4-maverick-17b-128e-instruct-fp8:1.5 |
RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-FP8-dynamic | oci://registry.redhat.io/rhelai1/modelcar-mistral-small-3-1-24b-instruct-2503-fp8-dynamic:1.5 |
RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8 | oci://registry.redhat.io/rhelai1/modelcar-mistral-small-3-1-24b-instruct-2503-quantized-w8a8:1.5 |
RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16 | oci://registry.redhat.io/rhelai1/modelcar-mistral-small-3-1-24b-instruct-2503-quantized-w4a16:1.5 |
RedHatAI/Mistral-Small-3.1-24B-Instruct-2503 | oci://registry.redhat.io/rhelai1/modelcar-mistral-small-3-1-24b-instruct-2503:1.5 |
RedHatAI/Mistral-Small-24B-Instruct-2501-FP8-dynamic | oci://registry.redhat.io/rhelai1/modelcar-mistral-small-24b-instruct-2501-fp8-dynamic:1.5 |
RedHatAI/Mistral-Small-24B-Instruct-2501-quantized.w8a8 | oci://registry.redhat.io/rhelai1/modelcar-mistral-small-24b-instruct-2501-quantized-w8a8:1.5 |
RedHatAI/Mistral-Small-24B-Instruct-2501 | oci://registry.redhat.io/rhelai1/modelcar-mistral-small-24b-instruct-2501:1.5 |
RedHatAI/phi-4 | oci://registry.redhat.io/rhelai1/modelcar-phi-4:1.5 |
RedHatAI/phi-4-quantized.w4a16 | oci://registry.redhat.io/rhelai1/modelcar-phi-4-quantized-w4a16:1.5 |
RedHatAI/phi-4-quantized.w8a8 | oci://registry.redhat.io/rhelai1/modelcar-phi-4-quantized-w8a8:1.5 |
RedHatAI/phi-4-FP8-dynamic | oci://registry.redhat.io/rhelai1/modelcar-phi-4-fp8-dynamic:1.5 |
RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic | oci://registry.redhat.io/rhelai1/modelcar-llama-3-3-70b-instruct-fp8-dynamic:1.5 |
RedHatAI/Llama-3.3-70B-Instruct | oci://registry.redhat.io/rhelai1/modelcar-llama-3-3-70b-instruct:1.5 |
RedHatAI/granite-3.1-8b-instruct-FP8-dynamic | oci://registry.redhat.io/rhelai1/modelcar-granite-3-1-8b-instruct-fp8-dynamic:1.5 |
RedHatAI/granite-3.1-8b-instruct | oci://registry.redhat.io/rhelai1/modelcar-granite-3-1-8b-instruct:1.5 |
RedHatAI/Qwen2.5-7B-Instruct-FP8-dynamic | oci://registry.redhat.io/rhelai1/modelcar-qwen2-5-7b-instruct-fp8-dynamic:1.5 |
RedHatAI/Qwen2.5-7B-Instruct | oci://registry.redhat.io/rhelai1/modelcar-qwen2-5-7b-instruct:1.5 |
RedHatAI/Llama-3.1-8B-Instruct | oci://registry.redhat.io/rhelai1/modelcar-llama-3-1-8b-instruct:1.5 |
RedHatAI/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-dynamic | oci://registry.redhat.io/rhelai1/modelcar-llama-3-1-nemotron-70b-instruct-hf-fp8-dynamic:1.5 |
RedHatAI/Mixtral-8x7B-Instruct-v0.1 | oci://registry.redhat.io/rhelai1/modelcar-mixtral-8x7b-instruct-v0-1:1.4 |
RedHatAI/Llama-3.1-Nemotron-70B-Instruct-HF | oci://registry.redhat.io/rhelai1/modelcar-llama-3-1-nemotron-70b-instruct-hf:1.5 |
RedHatAI/Llama-3.3-70B-Instruct-quantized.w8a8 | oci://registry.redhat.io/rhelai1/modelcar-llama-3-3-70b-instruct-quantized-w8a8:1.5 |
RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16 | oci://registry.redhat.io/rhelai1/modelcar-llama-3-3-70b-instruct-quantized-w4a16:1.5 |
RedHatAI/granite-3.1-8b-instruct-quantized.w8a8 | oci://registry.redhat.io/rhelai1/modelcar-granite-3-1-8b-instruct-quantized-w8a8:1.5 |
RedHatAI/granite-3.1-8b-instruct-quantized.w4a16 | oci://registry.redhat.io/rhelai1/modelcar-granite-3-1-8b-instruct-quantized-w4a16:1.5 |
RedHatAI/Qwen2.5-7B-Instruct-quantized.w8a8 | oci://registry.redhat.io/rhelai1/modelcar-qwen2-5-7b-instruct-quantized-w8a8:1.5 |
RedHatAI/Qwen2.5-7B-Instruct-quantized.w4a16 | oci://registry.redhat.io/rhelai1/modelcar-qwen2-5-7b-instruct-quantized-w4a16:1.5 |
RedHatAI/granite-3.1-8b-base-quantized.w4a16 | oci://registry.redhat.io/rhelai1/modelcar-granite-3-1-8b-base-quantized-w4a16:1.5 |
RedHatAI/Mistral-Small-24B-Instruct-2501-quantized.w4a16 | oci://registry.redhat.io/rhelai1/modelcar-mistral-small-24b-instruct-2501-quantized-w4a16:1.5 |
RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8 | oci://registry.redhat.io/rhelai1/modelcar-llama-3-1-8b-instruct-quantized-w8a8:1.5 |
RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8-dynamic | oci://registry.redhat.io/rhelai1/modelcar-llama-3-1-8b-instruct-fp8-dynamic:1.5 |
RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16 | oci://registry.redhat.io/rhelai1/modelcar-llama-3-1-8b-instruct-quantized-w4a16:1.5 |
RedHatAI/gemma-2-9b-it-FP8 | oci://registry.redhat.io/rhelai1/modelcar-gemma-2-9b-it-FP8:1.5 |
RedHatAI/gemma-2-9b-it | oci://registry.redhat.io/rhelai1/modelcar-gemma-2-9b-it:1.5 |
Usage Notes
- Model Selection: Choose models based on your hardware constraints and performance requirements
- Quantization: Models with
FP8-dynamic
,w4a16
, orw8a8
quantization require less GPU memory - GPU Requirements: Larger models (70B parameters) require more GPU memory and compute resources
- Version: Most models use version
1.5
, exceptMixtral-8x7B-Instruct-v0.1
which uses1.4
- Registry: All models are hosted in the Red Hat registry at
registry.redhat.io/rhelai1/
Example Usage
To use any of these models in your InferenceService, simply replace the storageUri
field:
spec:
predictor:
model:
storageUri: 'oci://registry.redhat.io/rhelai1/modelcar-granite-3-1-8b-instruct:1.5'
For more information about each model, visit the Red Hat AI validated models collection on Hugging Face.