Kubernetes 的 NVIDIA 设备插件

时间:2022-11-26 10:58:01

环境:centos7

kubernetes版本:1.25

前言:这两天在搞kubeflow,之前也没了解过GPU相关的服务,这是在kubernetes之上用过机器学习的一个平台,也可认为kubeflow是kubernetes的ML工具包。

山重水复疑无路

kubeflow已经部署上去,当我们创建一个Notebooks时,返回这么一个错误:

Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  45m (x23 over 151m)  default-scheduler  0/3 nodes are available: 3 Insufficient nvidia.com/gpu. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod.

重点是这个 Insufficient nvidia.com/gpu.,提示我没有可用的gpu,第一次遇到是也是感觉有点蒙圈,因为再机器上已经成功安装好驱动,可以正常查看GPU状态。

Kubernetes 的 NVIDIA 设备插件

经过查询,kubernetes要调度GPU,也需要安装对应厂商的设备插件

:zap::赶紧屁颠屁颠的跑去安装

:first_quarter_moon::苦逼打工人

柳岸花明又一村

GPU容器创建流程

containerd --> containerd-shim--> nvidia-container-runtime --> nvidia-container-runtime-hook --> libnvidia-container --> runc -- > container-process

部署nvidia-container-runtime

环境:centos7(其他系统到官网查看)

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.repo | \
  sudo tee /etc/yum.repos.d/nvidia-container-runtime.repo

使用yum安装:

[root@ycloud ~]# yum -y install nvidia-container-runtime
[root@ycloud ~]# which nvidia-container-runtime
/bin/nvidia-container-runtime

如果这里你的机器无法访问,可以尝试使用云主机先来安装好,然后把依赖打包放入我们的目标机器下

#yum deplist 展示包的全部依赖
yum deplist nvidia-container-runtime
#讲依赖下载到当前目录 
##这里最好提前最好创建好目录
mkdir nvidia-container-runtime;cd nvidia-container-runtime
repotrack nvidia-container-runtime
#打包
tar -zcvf nvidia-container-runtime.tar.gz nvidia-container-runtime

打包好之后放到我们的目标机器,然后解压

tar -zxvf nvidia-container-runtime.tar.gz ~/nvidia-container-runtime
cd ~/nvidia-container-runtime
# 离线安装
$ rpm -Uvh --force --nodeps *.rpm

准备您的 GPU 节点

需要在所有 GPU 节点上执行以下步骤。本 README 假定 NVIDIA 驱动程序和nvidia-container-toolkit已预安装。它还假定您已将 设置nvidia-container-runtime为要使用的默认低级运行时。

配置containerd

使用 运行kubernetescontainerd,编辑通常存在的配置文件/etc/containerd/config.toml以设置 nvidia-container-runtime为默认的低级运行时:

[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

然后重新启动containerd

$ sudo systemctl restart containerd

在 Kubernetes 中启用 GPU 支持

在集群中的所有 GPU 节点上配置上述选项后,您可以通过部署以下 Daemonset 来启用 GPU 支持:

[root@ycloud ~]# cat nvidia-device-plugin.yaml 
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      containers:
      - image: ycloudhub.com/middleware/nvidia-gpu-device-plugin:v0.12.3
        name: nvidia-device-plugin-ctr
        env:
          - name: FAIL_ON_INIT_ERROR
            value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins

通过apply进行部署,并查看

[root@ycloud ~]# kubectl apply -f nvidia-device-plugin.yaml 
daemonset.apps/nvidia-device-plugin-daemonset unchanged
[root@ycloud ~]# kubectl get po -n kube-system | grep nvidia
nvidia-device-plugin-daemonset-4hf89                      1/1     Running   0             81m
nvidia-device-plugin-daemonset-6v4k2                      1/1     Running   0             81m
nvidia-device-plugin-daemonset-lvgmd                      1/1     Running   0             81m

验证

[root@ycloud ~]#  kubectl describe nodes ycloud
......
Capacity:
  cpu:                32
  ephemeral-storage:  458291312Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             131661096Ki
  nvidia.com/gpu:     2
  pods:               110
Allocatable:
  cpu:                32
  ephemeral-storage:  422361272440
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             131558696Ki
  nvidia.com/gpu:     2
  pods:               110
......

:zap::到这,插件装好之后,发现我们的kubeflow中创建的Notebooks状态也变正常,可喜可贺啊!

Kubernetes 的 NVIDIA 设备插件