GKE GPU timesharing and resource quotas experiment
Fri 26 August 2022 | Last updated on Tue 06 December 2022You only got a few GPUs and want to pretend to your end-users that you got many? Then GKE GPU timesharing might just be the feature for you to save costs on GPUs that are underutilized. In this blog post you will learn:
- Creating a GKE nodepool with timesharing enabled
- Configure GPU based ResourseQuotas on multiple namespaces
- How it's possible to request 4 GPUs in different namespaces even though there is only a single physical GPU
Creating the cluster and nodepool
Create a cluster that meets the requirements of timesharing. At the time of writing the blog post this requires using the rapid release channel. This creates a cluster with a default CPU only nodepool using e2-medium. System services will run on the default CPU only nodepool.
gcloud container clusters create gpu-gpu-timesharing \
--region us-central1 --node-locations=us-central1-a \
--machine-type e2-medium --max-nodes=3 --min-nodes=1 \
--enable-autoscaling --release-channel rapid \
--shielded-integrity-monitoring --shielded-secure-boot
Create the T4 GPU nodepool with timesharing enabled
gcloud container node-pools create gpu-time-sharing \
--cluster=gke-gpu-timesharing \
--machine-type=n1-standard-4 \
--num-nodes=1 \
--region=us-central1 \
--node-locations=us-central1-a \
--accelerator=type=nvidia-tesla-t4,count=1,gpu-sharing-strategy=time-sharing,max-shared-clients-per-gpu=8 \
--shielded-secure-boot --shielded-integrity-monitoring
Install the nvidia GPU device drivers:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml
You should now have a working GKE cluster with 2 nodepools. A CPU only nodepool and a GPU nodepool with 1 GPU node that has 1 T4. This T4 GPU can be used by up to 8 clients at the same time. So it could be used by 8 pods each requesting 1 GPU or it could be used by 2 pods each requesting 4 GPUs.
Creating the namespaces and resource quotas
The example will use 2 fictional tenants: tenant-a and tenant-b.
Create the namespaces for both tenants:
kubectl create ns tenant-a
kubectl create ns tenant-b
Create a quota for tenant-a
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ResourceQuota
metadata:
name: test-gpu-quota-1
namespace: tenant-a
spec:
hard:
requests.nvidia.com/gpu: 1
limits.nvidia.com/gpu: 1
requests.cpu: 1
limits.cpu: 1
EOF
Create a quota for tenant-b
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ResourceQuota
metadata:
name: test-gpu-quota-1
namespace: tenant-b
spec:
hard:
requests.nvidia.com/gpu: 10
limits.nvidia.com/gpu: 10
requests.cpu: 2
limits.cpu: 2
EOF
Creating a GPU pod in tenant-a and tenant-b
Let's create 4 pods in tenant-a and 4 pods in tenant-b where each pod is requesting a single GPU.
The job that we will use in both tenants.
cat > gpu-deployment.yml << EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: cuda-simple
spec:
replicas: 4
selector:
matchLabels:
app: cuda-simple
template:
metadata:
labels:
app: cuda-simple
spec:
containers:
- name: cuda-simple
image: nvidia/cuda:11.0.3-base-ubi7
command:
- bash
- -c
- |
/usr/local/nvidia/bin/nvidia-smi -L; sleep 300
resources:
limits:
nvidia.com/gpu: 1
EOF
launch it in tenant-a and tenant-b
kubectl apply -f gpu-deployment.yml -n tenant-a
kubectl apply -f gpu-deployment.yml -n tenant-b
You should see 1 pod running in tenant-a:
kubectl get pods -n tenant-a
this demonstrates that resourcequotas are effective and able to limit GPUs that are timeshared.
You should see 4 pods running in tenant-a:
kubectl get pods -n tenant-b
Verify that pods in both tenant-a see the whole GPU being usable
kubectl exec -n tenant-a deploy/cuda-simple -ti -- sh
nvidia-smi
exit
Verify that pods in tenant-b see the whole GPU being usable
kubectl exec -n tenant-b deploy/cuda-simple -ti -- sh
nvidia-smi
exit
You can also take a look at the GPU node to see that it will
report having 8 GPU devices even though it only has a single GPU
attached to the VM. This is due to the fact of timesharing being
enabled with the setting max-shared-clients-per-gpu=8
.
Run the following to get node details of the GPU nodes with tesla t4s
kubectl get node -l cloud.google.com/gke-accelerator=nvidia-tesla-t4 -o yaml
You should see something like this in the output:
allocatable:
attachable-volumes-gce-pd: "127"
cpu: 3920m
ephemeral-storage: "47093746742"
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 12663300Ki
nvidia.com/gpu: "1"
pods: "110"
Summary
Demonstrated how to use timesharing GPUs in GKE and verified that each tenant will think they're getting the whole GPU even when they are being shared. We are able to limit how much GPUs a single tenant can request by applying resource quotas. GPU timesharing is a great way to reduce costs of GPUs when the GPU utilization of a single tenant/user is low.