时间:2024-02-27 17:51:58

安装前准备:
查看显卡及系统版本内核信息
cat /etc/centos-release
lshw -numeric -C display
lshw -numeric -C display
yum install pciutils
lspci | grep -i vga
lspci | grep -i nvidia

1、安装编译环境:gcc、kernel-devel、kernel-headers("kernel-devel-uname-r == $(uname -r)"可以确保安装与当前运行内核版本一样的kernel-header)
yum -y install gcc kernel-devel "kernel-devel-uname-r == $(uname -r)" dkms

2.检查内核版本和源码版本,保证一致(如不一致需用yum升级一致)

ls /boot | grep vmlinu

rpm -aq | grep kernel-devel
一致
移除其他版本内核重建内核启动文件
grub2-set-default 0
grub2-mkconfig -o /boot/grub2/grub.cfg
重启reboot
查看nouveau驱动是否开启(无命令lsmod可yum安装)
lsmod | grep  nouveau
屏蔽系统自带的nouveau

修改dist-blacklist.conf文件:
vim /lib/modprobe.d/dist-blacklist.conf

将nvidiafb注释掉:
#blacklist nvidiafb

然后添加以下语句:
blacklist nouveau
options nouveau modeset=0

3、重新建立initramfs image文件(生成新的内核,这个内核在开机的时候不会加载nouveau驱动程序)

mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak

dracut /boot/initramfs-$(uname -r).img $(uname -r)
修改运行级别为文本模式

systemctl set-default multi-user.target
重启
reboot
输入:lsmod | grep nouveau,没有任何输出,则确定nouveau没有加载



一、安装NVIDIA显卡驱动
显卡驱动程序下载:
https://www.nvidia.cn/drivers/unix/
添加权限+x 安装
chmod +x
执行
./NVIDIA-Linux-x86_64-455.45.01.run --kernel-source-path=/usr/src/kernels/3.10.0-1160.15.2.el7.x86_64/ --no-drm
 (注意:--no-drm要带上,要不然安装过程会报错ERROR: The nvidia-drm kernel module failed to load. This kernel
 module isrequired for the proper operation of DRM-KMS. If you do not need touse DRM-KMS, you can try to install
 this driver package again withthe \'--no-drm\' option.)
点击yes即可安装完成后,重启
reboot

输入nvidia-smi,出现显卡配置信息,说明NVIDIA驱动安装成功
Sat Feb 27 15:39:09 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01    Driver Version: 455.45.01    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla M40 24GB      Off  | 00000000:0B:00.0 Off |                    0 |
| N/A   38C    P0    66W / 250W |      0MiB / 22945MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

安装docker服务
安装依赖:
  yum install -y yum-utils device-mapper-persistent-data lvm2
导入repo文件
   yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
查看看在的版本:
  yum list docker-ce --showduplicates | sort -r
安装指定版本的docker
    yum install docker-ce-18.09.6-3.el7 docker-ce-cli-18.09.6 containerd.io
启动docker
 systemctl start docker
 systemctl status docker
 systemctl enable docker
 

安装nvidia-docker
 参考文献:
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html   官网安装文档
https://nvidia.github.io/libnvidia-container/ (FQ可达)

设置key导入repo
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo

清空yum缓存
yum clean expire-cache
重建cache
yum makecache
查找可安装的nvidia docker版本:
yum search --showduplicates nvidia-docker
安装nvidia-docker(可指定版本默认安装最新稳定版)
yum install -y nvidia-docker2
修改daemon.json文件
root@slash:/home/slash# cat  /etc/docker/daemon.json
#注意一定要有default-runtime ,否则k8s里的docker容器运行起来后找不到nvidia-smi
{
   "registry-mirrors": ["https://5twf62k1.mirror.aliyuncs.com"],
   "default-runtime": "nvidia",
   "runtimes": {
       "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
       }
   }
}

尤其是上面的path这个地方需要注意
重启Docker daemon
 systemctl daemon-reload && systemctl restart docker
验证docker2
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi  出现以下列表表明安装成功

执行    nvidia-docker run --rm nvidia/cuda nvidia-smi
Mon Mar  1 02:47:16 2021    
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01    Driver Version: 455.45.01    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla M40 24GB      Off  | 00000000:0B:00.0 Off |                    0 |
| N/A   40C    P0    66W / 250W |      0MiB / 22945MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+










使用 nvidia-docker  查看 GPU 信息:
nvidia-docker run --rm nvidia/cuda nvidia-smi