为什么我得到了“cuMemAlloc失败:未初始化”，即使我正在正确初始化?

I am having some trouble with my Django/Celery/PyCuda setup. I am using PyCuda for some image processing on a Amazon EC2 G2 instance. Here is the info on my Cuda-capable GRID K520 card: Detected 1 CUDA Capable device(s)

我的Django/芹菜/PyCuda安装有些问题。我在Amazon EC2 G2实例上使用PyCuda进行一些图像处理。这是我的CUDA - Capable GRID K520卡上的信息:检测到1 CUDA Capable device(s)

Device 0: "GRID K520"
CUDA Driver Version / Runtime Version          6.0 / 6.0
CUDA Capability Major/Minor version number:    3.0
Total amount of global memory:                 4096 MBytes (4294770688 bytes)
( 8) Multiprocessors, (192) CUDA Cores/MP:     1536 CUDA Cores
GPU Clock rate:                                797 MHz (0.80 GHz)
Memory Clock rate:                             2500 Mhz
Memory Bus Width:                              256-bit
L2 Cache Size:                                 524288 bytes
Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
Total amount of constant memory:               65536 bytes
Total amount of shared memory per block:       49152 bytes
Total number of registers available per block: 65536
Warp size:                                     32
Maximum number of threads per multiprocessor:  2048
Maximum number of threads per block:           1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch:                          2147483647 bytes
Texture alignment:                             512 bytes
Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
Run time limit on kernels:                     No
Integrated GPU sharing Host Memory:            No
Support host page-locked memory mapping:       Yes
Alignment requirement for Surfaces:            Yes
Device has ECC support:                        Disabled
Device supports Unified Addressing (UVA):      Yes
Device PCI Bus ID / PCI location ID:           0 / 3
Compute Mode:
 < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.0, CUDA Runtime Version = 6.0,   NumDevs = 1, Device0 = GRID K520
Result = PASS

I am using a pretty out-of-the-box celery config. I have a set of tasks defined in utils/tasks.py, which are tested and work before attempting to use PyCuda. I installed PyCuda via pip.

我正在使用一种非常现成的芹菜配置。我有一组在utils/tasks中定义的任务。py，在尝试使用PyCuda之前进行测试和工作。我通过pip安装了PyCuda。

At the top of the file that I am having trouble with, I do my standard imports:

在我遇到麻烦的文件的顶部，我做了标准的导入:

from celery import task
# other imports
import os
try:
    import Image
except Exception:
    from PIL import Image
import time

#Cuda imports
import pycuda.autoinit
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import numpy

A remote server initiates a task, which follows this basic workflow:

远程服务器启动一个任务，该任务遵循以下基本工作流:

 @task()
 def photo_function(photo_id,...):
     print 'Got photo...'
     ... Do some stuff ...
     result = do_photo_manipulation(photo_id)
     return result

def do_photo_manipulation(photo_id):
    im = Image.open(inPath)
    px = numpy.array(im)
    px = px.astype(numpy.float32)
    d_px = cuda.mem_alloc(px.nbytes)
    ... (Do stuff with the pixel array) ...
    return new_image

This works if I run it in shell plus (ie, ./manage.py shell_plus) and if I run it as a standalone, outside-of-django-and-celery process. It's only in this context it fails, with the error: cuMemAlloc failed: not initialized

如果我在shell plus中运行它(即. ./manage)，它就会工作。如果我将它作为一个独立的、django和芹菜的外部进程运行。只有在这种情况下，它才会失败，因为错误:cuMemAlloc失败了:没有初始化。

I have looked at other solutions for a while, and tried putting the import statement to do the initialization in the function itself. I have also plugged in a wait() statement, to ensure it's not a problem of the gpu being ready to do work.

我已经研究了一段时间的其他解决方案，并尝试将import语句放到函数本身中进行初始化。我还插入了一个wait()语句，以确保gpu没有准备好进行工作。

Here is an answer that suggests the error comes from not importing pycuda.autoinit, which I have done: http://comments.gmane.org/gmane.comp.python.cuda/1975

这里有一个答案表明错误来自于没有导入pycuda。autoinit，我已经做过:http://comments.gmane.org/gmane.comp.python.cuda/1975

Any help here would be appreciated!

这里的任何帮助都是值得感激的!

If I need to provide any more information, just let me know!

如果我需要提供更多的信息，请告诉我!

EDIT: Here is the test code: def CudaImageShift(imageIn, mode = "luminosity" , log = 0):

编辑:这里是测试代码:def cudaimag(imageIn, mode = "luminosity"， log = 0):

    if log == 1 :
        print ("----------> CUDA CONVERSION")

#    print "ENVIRON: "
#    import os
#    print os.environ

    print 'AUTOINIT'
    print pycuda.autoinit

    print 'Making context...'
    context = make_default_context()
    print 'Context created.'
    totalT0 = time.time()

    print 'Doing test run...'
    a = numpy.random.randn(4,4)
    a = a.astype(numpy.float32)
    print 'Test mem alloc'
    a_gpu = cuda.mem_alloc(a.nbytes)
    print 'MemAlloc complete, test mem copy'
    cuda.memcpy_htod(a_gpu, a)
    print 'memcopy complete'


[2014-07-15 14:52:20,469: WARNING/Worker-1] cuDeviceGetCount failed: not initialized

1 个解决方案

#1

I believe the problem you experience is related to CUDA contexts. As of CUDA 4.0 a CUDA context is required per process and per device.

我相信你遇到的问题与CUDA上下文有关。从CUDA 4.0开始，每个进程和每个设备都需要CUDA上下文。

Behind the scenes celery will spawn processes for the task workers. When a process/task starts it will not have a context available. In pyCUDA the context creation happens in the autoinit module. That's why your code will work if you run it as a standalone (no extra process is created and the context is valid) or if you put the import autoinit inside the CUDA task (Now the process/task will have a context, I believe you tried that already).

在幕后，芹菜会为工作人员生成进程。当一个流程/任务启动时，它将没有可用的上下文。在pyCUDA中，上下文创建发生在autoinit模块中。这就是为什么如果您将代码作为独立的运行(没有创建额外的进程并且上下文是有效的)或者如果您将import autoinit放在CUDA任务中(现在流程/任务将有一个上下文，我相信您已经尝试过了)，那么您的代码将会工作。

If you want to avoid the import you may be able to use the make_default_context from pycuda.tools although I'm not very familiar with pyCUDA and how it handles context management.

如果希望避免导入，可以使用pycuda中的make_default_context。工具虽然我不太熟悉pyCUDA以及它如何处理上下文管理。

from pycuda.tools import make_default_context

@task()
def photo_function(photo_id,...):
  ctx = make_default_context()
  print 'Got photo...'
  ... Do some stuff ...
  result = do_photo_manipulation(photo_id)
  return result

Beware that context creation is an expensive process. CUDA deliberately front loads a lot of work in the context in order to avoid non expected delays later on. That's why you have a stack of contexts that you can push/pop between host threads (but not between processes). If your kernel code is very fast you may experience delays because of the context create/destroy procedure.

请注意，上下文创建是一个代价高昂的过程。CUDA故意在上下文中预先加载大量工作，以避免以后出现预料之外的延迟。这就是为什么您有一个可以在主机线程之间(而不是进程之间)推送/弹出的上下文堆栈。如果您的内核代码非常快，您可能会因为上下文创建/销毁过程而遇到延迟。

#1