Using GPU workers
Charmed Kubernetes supports GPU-enabled instances for applications that
can use them. The kubernetes-worker
application will automatically detect
NVIDIA hardware and enable the appropriate support. This page describes
recommended deployment and verification steps when using GPU workers with
Charmed Kubernetes.
Deploying Charmed Kubernetes with GPU workers
When deploying the Charmed Kubernetes bundle, you can use a YAML overlay file with constraints to ensure worker units are deployed on GPU-enabled machines. Because GPU support varies depending on the underlying cloud, this requires specifying a particular instance type.
For example, when deploying to AWS, you may decide to use a p3.2xlarge
instance from the available AWS GPU-enabled instance types.
Similarly, you could choose Azure’s Standard_NC6s_v3
instance from the
available Azure GPU-enabled instance types.
NVIDIA updates its list of supported GPUs frequently, so be sure to cross reference the GPU included in a specific cloud instance against the Supported NVIDIA GPUs and Systems documentation.
Example overlay files that set GPU worker constraints:
# AWS gpu-overlay.yaml
applications:
kubernetes-worker:
constraints: instance-type=p3.2xlarge
# Azure gpu-overlay.yaml
applications:
kubernetes-worker:
constraints: instance-type=Standard_NC6s_v3
Deploy Charmed Kubernetes with an overlay like this:
juju deploy charmed-kubernetes --overlay ~/path/my-overlay.yaml --overlay ~/path/gpu-overlay.yaml
As demonstrated here, you can use multiple overlay files when deploying, so you can combine GPU support with an integrator charm or other custom configuration.
You may then want to test a GPU workload.
Adding GPU workers post deployment
It isn’t necessary for all worker units to have GPU support. You can add
GPU-enabled workers to an existing cluster. The recommended way to do this is
to first set a new constraint for the kubernetes-worker
application:
juju set-constraints kubernetes-worker instance-type=p3.2xlarge
Then add as many new GPU worker units as required. For example, to add two new units:
juju add-unit kubernetes-worker -n2
Adding GPU workers with GCP
Google supports GPUs slightly differently to most clouds. There are no GPU variations included in the general instance templates, and therefore they have to be added manually.
To begin, add a new machine with Juju. Include any desired constraints for cpu cores, memory, etc:
juju add-machine --constraints 'cores=4 mem=16G'
The command will return with the unit number of the machine that was created - take note of this number.
Next you will need to use the gcloud tool or the GCP console to stop the newly created instance, edit its configuration and then restart the machine.
Once it is up and running, add the kubernetes-worker
application to it:
juju add-unit kubernetes-worker --to 10
…replacing 10
in the above with the previously noted number. As the charm
installs, the GPU will be detected and the relevant support will be installed.
Testing
As GPU instances can be costly, it is useful to test that they can actually be used. A simple test job can be created to run NVIDIA’s hardware reporting tool. Please note that you may need to replace the image tag in the following YAML with the latest supported one.
This can also be downloaded here.
apiVersion: batch/v1
kind: Job
metadata:
name: nvidia-smi
spec:
template:
metadata:
name: nvidia-smi
spec:
restartPolicy: Never
containers:
- image: nvidia/cuda:12.1.0-base-ubuntu22.04
name: nvidia-smi
args:
- nvidia-smi
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
volumeMounts:
- mountPath: /usr/bin/
name: binaries
- mountPath: /usr/lib/x86_64-linux-gnu
name: libraries
volumes:
- name: binaries
hostPath:
path: /usr/bin/
- name: libraries
hostPath:
path: /usr/lib/x86_64-linux-gnu
Download the file and run it with:
kubectl create -f nvidia-test.yaml
You can inspect the logs to find the hardware report.
kubectl logs job.batch/nvidia-smi
Tue Apr 11 22:46:04 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-SXM2-16GB On | 00000000:00:1E.0 Off | 0 |
| N/A 36C P0 23W / 300W| 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+