环境:centos7
kubernetes版本:1.25
前言:这两天在搞kubeflow,之前也没了解过GPU相关的服务,这是在kubernetes之上用过机器学习的一个平台,也可认为kubeflow是kubernetes的ML工具包。
山重水复疑无路
kubeflow已经部署上去,当我们创建一个Notebooks时,返回这么一个错误:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 45m (x23 over 151m) default-scheduler 0/3 nodes are available: 3 Insufficient nvidia.com/gpu. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod.
重点是这个 Insufficient nvidia.com/gpu.
,提示我没有可用的gpu,第一次遇到是也是感觉有点蒙圈,因为再机器上已经成功安装好驱动,可以正常查看GPU状态。
经过查询,kubernetes要调度GPU,也需要安装对应厂商的设备插件
:zap::赶紧屁颠屁颠的跑去安装
:first_quarter_moon::苦逼打工人
柳岸花明又一村
GPU容器创建流程
containerd --> containerd-shim--> nvidia-container-runtime --> nvidia-container-runtime-hook --> libnvidia-container --> runc -- > container-process
部署nvidia-container-runtime
环境:centos7(其他系统到官网查看)
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.repo | \
sudo tee /etc/yum.repos.d/nvidia-container-runtime.repo
使用yum安装:
[root@ycloud ~]# yum -y install nvidia-container-runtime
[root@ycloud ~]# which nvidia-container-runtime
/bin/nvidia-container-runtime
如果这里你的机器无法访问,可以尝试使用云主机先来安装好,然后把依赖打包放入我们的目标机器下
#yum deplist 展示包的全部依赖
yum deplist nvidia-container-runtime
#讲依赖下载到当前目录
##这里最好提前最好创建好目录
mkdir nvidia-container-runtime;cd nvidia-container-runtime
repotrack nvidia-container-runtime
#打包
tar -zcvf nvidia-container-runtime.tar.gz nvidia-container-runtime
打包好之后放到我们的目标机器,然后解压
tar -zxvf nvidia-container-runtime.tar.gz ~/nvidia-container-runtime
cd ~/nvidia-container-runtime
# 离线安装
$ rpm -Uvh --force --nodeps *.rpm
准备您的 GPU 节点
需要在所有 GPU 节点上执行以下步骤。本 README 假定 NVIDIA 驱动程序和nvidia-container-toolkit
已预安装。它还假定您已将 设置nvidia-container-runtime
为要使用的默认低级运行时。
配置containerd
使用 运行kubernetes
时containerd
,编辑通常存在的配置文件/etc/containerd/config.toml
以设置 nvidia-container-runtime
为默认的低级运行时:
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
然后重新启动containerd
:
$ sudo systemctl restart containerd
在 Kubernetes 中启用 GPU 支持
在集群中的所有 GPU 节点上配置上述选项后,您可以通过部署以下 Daemonset 来启用 GPU 支持:
[root@ycloud ~]# cat nvidia-device-plugin.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
containers:
- image: ycloudhub.com/middleware/nvidia-gpu-device-plugin:v0.12.3
name: nvidia-device-plugin-ctr
env:
- name: FAIL_ON_INIT_ERROR
value: "false"
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
通过apply进行部署,并查看
[root@ycloud ~]# kubectl apply -f nvidia-device-plugin.yaml
daemonset.apps/nvidia-device-plugin-daemonset unchanged
[root@ycloud ~]# kubectl get po -n kube-system | grep nvidia
nvidia-device-plugin-daemonset-4hf89 1/1 Running 0 81m
nvidia-device-plugin-daemonset-6v4k2 1/1 Running 0 81m
nvidia-device-plugin-daemonset-lvgmd 1/1 Running 0 81m
验证
[root@ycloud ~]# kubectl describe nodes ycloud
......
Capacity:
cpu: 32
ephemeral-storage: 458291312Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 131661096Ki
nvidia.com/gpu: 2
pods: 110
Allocatable:
cpu: 32
ephemeral-storage: 422361272440
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 131558696Ki
nvidia.com/gpu: 2
pods: 110
......
:zap::到这,插件装好之后,发现我们的kubeflow中创建的Notebooks状态也变正常,可喜可贺啊!