I am trying to execute MPI and CUDA code on a cluster. The code works fine on single machine but when I try to execute it on cluster I get error:
我尝试在集群上执行MPI和CUDA代码。代码在单个机器上运行良好,但是当我试图在集群上执行时,我就会出错:
error while loading shared libraries: libcudart.so.4: cannot open shared object file: No such file or directory
加载共享库时出错:libcudart.so。4:不能打开共享对象文件:没有这样的文件或目录。
I checked my PATH and LD_PATH and it looks ok. I have a .bashrc file which contains following entries -
我检查了路径和LD_PATH,看起来没问题。我有一个.bashrc文件,其中包含以下条目。
export PATH=$PATH:/usr/local/lib/:/usr/local/lib/openmpi:/usr/local/cuda/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib:/usr/local/ lib/openmpi/:/usr/local/cuda/lib
出口路径=$PATH:/usr/local/lib/:/usr/local/lib/openmpi:/usr/local/cuda/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib:/usr/local/ lib/openmpi/:/usr/local/cuda/lib。
All the machines haves same installation of CUDA and OpenMPI.
所有的机器都安装了相同的CUDA和OpenMPI。
I also have /usr/local/cuda/lib in /etc/ld.so.conf
我也有/usr/local/cuda/lib在/etc/ lsd .so.conf中。
Can anyone help me with this. This problem is really annoying.
有人能帮我吗?这个问题真烦人。
Thanks.
谢谢。
1 个解决方案
#1
5
If you are sending a batch job on a cluster, please add commands like
如果您要在集群上发送批处理作业,请添加诸如此类的命令。
echo $LD_LIBRARY_PATH
ldd ./your_app
to your batch script. This should help to debug the problem.
你的批处理脚本。这应该有助于调试问题。
Also make sure that you export environment variables in mpirun. For instance, in OpenMPI you would run your code with
还要确保在mpirun中导出环境变量。例如,在OpenMPI中,您将运行您的代码。
mpirun -x LD_LIBRARY_PATH ...
#1
5
If you are sending a batch job on a cluster, please add commands like
如果您要在集群上发送批处理作业,请添加诸如此类的命令。
echo $LD_LIBRARY_PATH
ldd ./your_app
to your batch script. This should help to debug the problem.
你的批处理脚本。这应该有助于调试问题。
Also make sure that you export environment variables in mpirun. For instance, in OpenMPI you would run your code with
还要确保在mpirun中导出环境变量。例如,在OpenMPI中,您将运行您的代码。
mpirun -x LD_LIBRARY_PATH ...