Setting Up PVCs for Model Storage
This guide explains how to create and configure Persistent Volume Claims (PVCs) for storing large language models that can be served by OpenShift AI. Using PVCs allows you to store models on shared storage and serve them across multiple deployments.
Overview
PVC-based model storage is useful when you:
- Have models stored on shared file systems
- Need to modify or customize model files
- Want to avoid downloading models from external sources repeatedly
- Require ReadWriteMany (RWX) access for multiple pods
Prerequisites
- Access to a Kubernetes/OpenShift cluster
- Storage class that supports RWX access mode (e.g., NFS, CephFS, GlusterFS)
- Sufficient storage quota for your models (LLMs can be 10-100+ GB)
kubectl
oroc
CLI tool configured
Creating a PVC for Model Storage
Step 1: Determine Storage Requirements
First, check available storage classes that support RWX:
# List storage classes
kubectl get storageclass
# Check which support RWX (look for "ReadWriteMany" in the output)
kubectl describe storageclass <storage-class-name>
Step 2: Create the PVC
Create a PVC with sufficient capacity for your models:
# model-storage-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-pvc
namespace: my-namespace
labels:
app: model-storage
purpose: llm-serving
spec:
accessModes:
- ReadWriteMany # Required for multiple pod access
resources:
requests:
storage: 100Gi # Adjust based on model size
storageClassName: nfs-storage # Replace with your RWX storage class
Apply the PVC:
kubectl apply -f model-storage-pvc.yaml -n my-namespace
Verify PVC is bound:
kubectl get pvc model-pvc -n my-namespace
# Expected output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
model-pvc Bound pvc-12345678-abcd-efgh-ijkl-123456789012 100Gi RWX nfs-storage 1m
Downloading Models to PVC
Method 1: Using a Job to Download Models
Create a Kubernetes Job to download models directly to the PVC.
Recommended Model Sources:
- Red Hat AI Validated Models - Pre-validated and optimized models for enterprise use
- These models are tested and supported for use with OpenShift AI
Example using a Red Hat validated model:
# download-model-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: download-llama-model
namespace: my-namespace
spec:
template:
spec:
containers:
- name: downloader
image: python:3.11-slim
command: ["/bin/bash", "-c"]
args:
- |
# Install required tools
pip install huggingface-hub
# Download model to PVC
python -c "
from huggingface_hub import snapshot_download
# Download Red Hat validated Llama model
snapshot_download(
repo_id='RedHatAI/Llama-3.1-8B-Instruct',
local_dir='/models/llama-3.1-8b-instruct',
local_dir_use_symlinks=False
# Note: Some models may require authentication
# token='YOUR_HF_TOKEN' # Uncomment and add token if needed
)
"
echo "Model download complete!"
ls -la /models/llama-3.1-8b-instruct/
volumeMounts:
- name: model-storage
mountPath: /models
resources:
requests:
memory: "4Gi"
cpu: "2"
limits:
memory: "8Gi"
cpu: "4"
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: model-pvc
restartPolicy: Never
backoffLimit: 2
Run the job:
kubectl apply -f download-model-job.yaml -n my-namespace
# Monitor progress
kubectl logs -f job/download-llama-model -n my-namespace
Method 2: Using a Temporary Pod
Create a temporary pod to manually download or copy models:
# model-setup-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: model-setup
namespace: my-namespace
spec:
containers:
- name: setup
image: python:3.11
command: ["sleep", "infinity"]
volumeMounts:
- name: model-storage
mountPath: /models
resources:
requests:
memory: "4Gi"
cpu: "2"
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: model-pvc
Deploy and access the pod:
# Create the pod
kubectl apply -f model-setup-pod.yaml -n my-namespace
# Wait for pod to be ready
kubectl wait --for=condition=ready pod/model-setup -n my-namespace
# Access the pod
kubectl exec -it model-setup -n my-namespace -- bash
# Inside the pod, download models
pip install huggingface-hub
# Download a Red Hat validated model
python -c "from huggingface_hub import snapshot_download; snapshot_download('RedHatAI/Llama-3.1-8B-Instruct', local_dir='/models/llama-3.1-8b-instruct', local_dir_use_symlinks=False)"
# Or download other Red Hat AI models:
# RedHatAI/granite-3-8b-instruct
# RedHatAI/Mistral-7B-Instruct-v0.3
# See full collection: https://huggingface.co/collections/RedHatAI
# Exit and delete the pod when done
exit
kubectl delete pod model-setup -n my-namespace
Method 3: Copying from Local Machine
If you have models locally, copy them to the PVC:
# Create a temporary pod with the PVC mounted
kubectl run model-copy --image=busybox --restart=Never --rm -i --tty \
--overrides='{"spec":{"volumes":[{"name":"model-storage","persistentVolumeClaim":{"claimName":"model-pvc"}}],"containers":[{"name":"model-copy","volumeMounts":[{"name":"model-storage","mountPath":"/models"}]}]}}' \
-n my-namespace -- sh
# In another terminal, copy files to the pod
kubectl cp /local/path/to/model my-namespace/model-copy:/models/my-model
# The pod will automatically be deleted when you exit
Verifying Model Files
Check that models are correctly stored on the PVC:
# Create a debug pod to inspect the PVC
kubectl run pvc-inspector --image=busybox --restart=Never --rm -i --tty \
--overrides='{"spec":{"volumes":[{"name":"model-storage","persistentVolumeClaim":{"claimName":"model-pvc"}}],"containers":[{"name":"pvc-inspector","volumeMounts":[{"name":"model-storage","mountPath":"/models"}]}]}}' \
-n my-namespace -- sh
# Inside the pod
ls -la /models/
du -sh /models/*
Using PVC Storage in InferenceService
Once models are stored on the PVC, reference them in your InferenceService:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-model
namespace: my-namespace
spec:
predictor:
model:
modelFormat:
name: vLLM
runtime: llama-runtime
# Reference the PVC and model path
storageUri: 'pvc://model-pvc/llama-3.1-8b-instruct'
resources:
requests:
nvidia.com/gpu: '1'
limits:
nvidia.com/gpu: '1'
Best Practices
Storage Sizing
- Small models (< 10GB): 50Gi PVC
- Medium models (10-50GB): 100Gi PVC
- Large models (> 50GB): 200Gi+ PVC
- Add 20% overhead for temporary files and caching
Access Modes
- Use ReadWriteMany (RWX) for:
- Serving models from multiple pods
- Updating models without downtime
- Shared model repositories
- Use ReadWriteOnce (RWO) only if:
- Single pod deployment
- Cost is a major concern
- RWX is not available
Organization
Structure your models on the PVC:
/models/
├── llama-3.1-8b-instruct/
│ ├── config.json
│ ├── model.safetensors
│ └── tokenizer.json
├── granite-3-1-8b/
│ └── ...
└── mistral-7b/
└── ...
Performance Considerations
- Storage Class: Choose high-performance storage for production
- Caching: Models are loaded into memory, so initial load time is most important
- Network: Ensure good network connectivity between nodes and storage
Troubleshooting
PVC Won’t Bind
# Check PVC events
kubectl describe pvc model-pvc -n my-namespace
# Common issues:
# - No storage class supports RWX
# - Insufficient quota
# - Storage class doesn't exist
Model Loading Errors
# Check InferenceService logs
kubectl logs -l serving.kserve.io/inferenceservice=your-model -n my-namespace
# Common issues:
# - Wrong path in storageUri
# - Missing model files
# - Incorrect permissions
Slow Model Loading
- Check storage performance:
kubectl exec -it <pod> -- dd if=/models/test of=/dev/null bs=1M count=1000
- Consider using higher performance storage class
- Ensure nodes have good network connectivity to storage