安装前准备:
查看显卡及系统版本内核信息
cat /etc/centos-release
lshw -numeric -C display
lshw -numeric -C display
yum install pciutils
lspci | grep -i vga
lspci | grep -i nvidia
1、安装编译环境:gcc、kernel-devel、kernel-headers("kernel-devel-uname-r == $(uname -r)"可以确保安装与当前运行内核版本一样的kernel-header)
yum -y install gcc kernel-devel "kernel-devel-uname-r == $(uname -r)" dkms
2.检查内核版本和源码版本,保证一致(如不一致需用yum升级一致)
ls /boot | grep vmlinu
与
rpm -aq | grep kernel-devel
一致
移除其他版本内核重建内核启动文件
grub2-set-default 0
grub2-mkconfig -o /boot/grub2/grub.cfg
重启reboot
查看nouveau驱动是否开启(无命令lsmod可yum安装)
lsmod | grep nouveau
屏蔽系统自带的nouveau
修改dist-blacklist.conf文件:
vim /lib/modprobe.d/dist-blacklist.conf
将nvidiafb注释掉:
#blacklist nvidiafb
然后添加以下语句:
blacklist nouveau
options nouveau modeset=0
3、重新建立initramfs image文件(生成新的内核,这个内核在开机的时候不会加载nouveau驱动程序)
mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
dracut /boot/initramfs-$(uname -r).img $(uname -r)
修改运行级别为文本模式
systemctl set-default multi-user.target
重启
reboot
输入:lsmod | grep nouveau,没有任何输出,则确定nouveau没有加载
一、安装NVIDIA显卡驱动
显卡驱动程序下载:
https://www.nvidia.cn/drivers/unix/
添加权限+x 安装
chmod +x
执行
./NVIDIA-Linux-x86_64-455.45.01.run --kernel-source-path=/usr/src/kernels/3.10.0-1160.15.2.el7.x86_64/ --no-drm
(注意:--no-drm要带上,要不然安装过程会报错ERROR: The nvidia-drm kernel module failed to load. This kernel
module isrequired for the proper operation of DRM-KMS. If you do not need touse DRM-KMS, you can try to install
this driver package again withthe \'--no-drm\' option.)
点击yes即可安装完成后,重启
reboot
输入nvidia-smi,出现显卡配置信息,说明NVIDIA驱动安装成功
Sat Feb 27 15:39:09 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01 Driver Version: 455.45.01 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla M40 24GB Off | 00000000:0B:00.0 Off | 0 |
| N/A 38C P0 66W / 250W | 0MiB / 22945MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
安装docker服务
安装依赖:
yum install -y yum-utils device-mapper-persistent-data lvm2
导入repo文件
yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
查看看在的版本:
yum list docker-ce --showduplicates | sort -r
安装指定版本的docker
yum install docker-ce-18.09.6-3.el7 docker-ce-cli-18.09.6 containerd.io
启动docker
systemctl start docker
systemctl status docker
systemctl enable docker
安装nvidia-docker
参考文献:
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html 官网安装文档
https://nvidia.github.io/libnvidia-container/ (FQ可达)
设置key导入repo
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
清空yum缓存
yum clean expire-cache
重建cache
yum makecache
查找可安装的nvidia docker版本:
yum search --showduplicates nvidia-docker
安装nvidia-docker(可指定版本默认安装最新稳定版)
yum install -y nvidia-docker2
修改daemon.json文件
root@slash:/home/slash# cat /etc/docker/daemon.json
#注意一定要有default-runtime ,否则k8s里的docker容器运行起来后找不到nvidia-smi
{
"registry-mirrors": ["https://5twf62k1.mirror.aliyuncs.com"],
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
尤其是上面的path这个地方需要注意
重启Docker daemon
systemctl daemon-reload && systemctl restart docker
验证docker2
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi 出现以下列表表明安装成功
执行 nvidia-docker run --rm nvidia/cuda nvidia-smi
Mon Mar 1 02:47:16 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01 Driver Version: 455.45.01 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla M40 24GB Off | 00000000:0B:00.0 Off | 0 |
| N/A 40C P0 66W / 250W | 0MiB / 22945MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
使用 nvidia-docker 查看 GPU 信息:
nvidia-docker run --rm nvidia/cuda nvidia-smi
