如何测量OpenGL中的峰值内存带宽?

时间:2021-09-11 17:00:23

Just to get an idea of what kind of speeds I should be expecting I have been trying to benchmark transfer between global memory and shaders, rather than relying on GPU spec sheets. However I can't get close to the theoretical maximum. In fact I'm out by a factor of 50!.

只是为了了解我应该期待什么样的速度,我一直在尝试在全局内存和着色器之间进行基准测试,而不是依赖于GPU规格表。但是我无法接近理论上的最大值。事实上,我出了50倍!

I'm using a GTX Titan X, which is said to have 336.5GB/s. Linux x64 driver 352.21.

我正在使用GTX Titan X,据说它有336.5GB / s。 Linux x64驱动程序352.21。

I found a CUDA benchmark here which gives me ~240–250GB/s (this is more what I expect).

我在这里找到了一个CUDA基准测试,它给了我~240-250GB / s(这更符合我的预期)。

I'm trying to match exactly what they do with shaders. I've tried vertex shaders, compute shaders, accessing buffer objects via image_load_store and NV_shader_buffer_store, with floats, vec4s, loops inside the shader (with coalesced addressing within the work group) and various methods of timing. I'm stuck at ~7GB/s (see the update below).

我正在尝试将它们与着色器完全匹配。我尝试过顶点着色器,计算着色器,通过image_load_store和NV_shader_buffer_store访问缓冲区对象,在着色器内部使用浮点数,vec4s,循环(在工作组内具有合并的寻址)和各种计时方法。我卡在~7GB / s(参见下面的更新)。

Why is GL so much slower? Am I doing something wrong and if so, how should it be done?

为什么GL这么慢?我做错了什么,如果是的话,应该怎么做?

Here's my MWE with three methods (1. vertex shader with image_load_store, 2. vertex shader with bindless graphics, 3. compute shader with bindless graphics):

这是我的MWE有三种方法(1.顶点着色器与image_load_store,2。顶点着色器与无绑定图形,3。计算着色器与无绑定图形):

//#include <windows.h>
#include <assert.h>
#include <stdio.h>
#include <memory.h>
#include <GL/glew.h>
#include <GL/glut.h>

const char* imageSource =
    "#version 440\n"
    "uniform layout(r32f) imageBuffer data;\n"
    "uniform float val;\n"
    "void main() {\n"
    "   imageStore(data, gl_VertexID, vec4(val, 0.0, 0.0, 0.0));\n"
    "   gl_Position = vec4(0.0);\n"
    "}\n";

const char* bindlessSource =
    "#version 440\n"
    "#extension GL_NV_gpu_shader5 : enable\n"
    "#extension GL_NV_shader_buffer_load : enable\n"
    "uniform float* data;\n"
    "uniform float val;\n"
    "void main() {\n"
    "   data[gl_VertexID] = val;\n"
    "   gl_Position = vec4(0.0);\n"
    "}\n";

const char* bindlessComputeSource =
    "#version 440\n"
    "#extension GL_NV_gpu_shader5 : enable\n"
    "#extension GL_NV_shader_buffer_load : enable\n"
    "layout(local_size_x = 256) in;\n"
    "uniform float* data;\n"
    "uniform float val;\n"
    "void main() {\n"
    "   data[gl_GlobalInvocationID.x] = val;\n"
    "}\n";

GLuint compile(GLenum type, const char* shaderSrc)
{
    GLuint shader = glCreateShader(type);
    glShaderSource(shader, 1, (const GLchar**)&shaderSrc, NULL);
    glCompileShader(shader);
    int success = 0;
    int loglen = 0;
    glGetShaderiv(shader, GL_COMPILE_STATUS, &success);
    glGetShaderiv(shader, GL_INFO_LOG_LENGTH, &loglen);
    GLchar* log = new GLchar[loglen];
    glGetShaderInfoLog(shader, loglen, &loglen, log);
    if (!success)
    {
        printf("%s\n", log);
        exit(0);
    }
    GLuint program = glCreateProgram();
    glAttachShader(program, shader);
    glLinkProgram(program);
    return program;
}

GLuint timerQueries[2];
void start()
{
    glGenQueries(2, timerQueries);
    glQueryCounter(timerQueries[0], GL_TIMESTAMP);
}

float stop()
{
    glMemoryBarrier(GL_ALL_BARRIER_BITS);
    GLsync sync = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
    glWaitSync(sync, 0, GL_TIMEOUT_IGNORED);
    glQueryCounter(timerQueries[1], GL_TIMESTAMP);
    GLint available = 0;
    while (!available) //sometimes gets stuck here for whatever reason
        glGetQueryObjectiv(timerQueries[1], GL_QUERY_RESULT_AVAILABLE, &available);
    GLuint64 a, b;
    glGetQueryObjectui64v(timerQueries[0], GL_QUERY_RESULT, &a);
    glGetQueryObjectui64v(timerQueries[1], GL_QUERY_RESULT, &b);
    glDeleteQueries(2, timerQueries);
    return b - a;
}

int main(int argc, char** argv)
{
    float* check;
    glutInit(&argc, argv);
    glutInitDisplayMode(GLUT_DOUBLE | GLUT_RGB | GLUT_DEPTH);
    glutCreateWindow("test");
    glewInit();

    int bufferSize = 64 * 1024 * 1024; //64MB
    int loops = 500;

    glEnable(GL_RASTERIZER_DISCARD);

    float* dat = new float[bufferSize/sizeof(float)];
    memset(dat, 0, bufferSize);

    //create a buffer with data
    GLuint buffer;
    glGenBuffers(1, &buffer);
    glBindBuffer(GL_TEXTURE_BUFFER, buffer);
    glBufferData(GL_TEXTURE_BUFFER, bufferSize, NULL, GL_STATIC_DRAW);

    //get a bindless address
    GLuint64 address;
    glMakeBufferResidentNV(GL_TEXTURE_BUFFER, GL_READ_WRITE);
    glGetBufferParameterui64vNV(GL_TEXTURE_BUFFER, GL_BUFFER_GPU_ADDRESS_NV, &address);

    //make a texture alias for it
    GLuint bufferTexture;
    glGenTextures(1, &bufferTexture);
    glBindTexture(GL_TEXTURE_BUFFER, bufferTexture);
    glTexBuffer(GL_TEXTURE_BUFFER, GL_R32F, buffer);
    glBindImageTextureEXT(0, bufferTexture, 0, GL_FALSE, 0, GL_READ_WRITE, GL_R32F);

    //compile the shaders
    GLuint imageShader = compile(GL_VERTEX_SHADER, imageSource);
    GLuint bindlessShader = compile(GL_VERTEX_SHADER, bindlessSource);
    GLuint bindlessComputeShader = compile(GL_COMPUTE_SHADER, bindlessComputeSource);

    //warm-up and check values
    glBufferData(GL_TEXTURE_BUFFER, bufferSize, dat, GL_STATIC_DRAW);
    glUseProgram(imageShader);
    glUniform1i(glGetUniformLocation(imageShader, "data"), 0);
    glUniform1f(glGetUniformLocation(imageShader, "val"), 1.0f);
    glDrawArrays(GL_POINTS, 0, bufferSize/sizeof(float));
    glMemoryBarrier(GL_SHADER_IMAGE_ACCESS_BARRIER_BIT);
    //check = (float*)glMapBuffer(GL_TEXTURE_BUFFER, GL_READ_ONLY);
    //for (int i = 0; i < bufferSize/sizeof(float); ++i)
    //  assert(check[i] == 1.0f);
    //glUnmapBuffer(GL_TEXTURE_BUFFER);

    glBufferData(GL_TEXTURE_BUFFER, bufferSize, dat, GL_STATIC_DRAW);
    glUseProgram(bindlessShader);
    glProgramUniformui64NV(bindlessShader, glGetUniformLocation(bindlessShader, "data"), address);
    glUniform1f(glGetUniformLocation(bindlessShader, "val"), 1.0f);
    glDrawArrays(GL_POINTS, 0, bufferSize/sizeof(float));
    //glMemoryBarrier(GL_ALL_BARRIER_BITS); //this causes glDispatchCompute to segfault later, so don't uncomment
    //check = (float*)glMapBuffer(GL_TEXTURE_BUFFER, GL_READ_ONLY);
    //for (int i = 0; i < bufferSize/sizeof(float); ++i)
    //  assert(check[i] == 1.0f);
    //glUnmapBuffer(GL_TEXTURE_BUFFER);

    glBufferData(GL_TEXTURE_BUFFER, bufferSize, dat, GL_STATIC_DRAW);
    glUseProgram(bindlessComputeShader);
    glProgramUniformui64NV(bindlessComputeShader, glGetUniformLocation(bindlessComputeShader, "data"), address);
    glUniform1f(glGetUniformLocation(bindlessComputeShader, "val"), 1.0f);
    glDispatchCompute(bufferSize/(sizeof(float) * 256), 1, 1);
    glMemoryBarrier(GL_ALL_BARRIER_BITS);
    //check = (float*)glMapBuffer(GL_TEXTURE_BUFFER, GL_READ_ONLY);
    //for (int i = 0; i < bufferSize/sizeof(float); ++i)
    //  assert(check[i] == 1.0f); //glDispatchCompute doesn't actually write anything with bindless graphics
    //glUnmapBuffer(GL_TEXTURE_BUFFER);
    glFinish();

    //time image_load_store
    glUseProgram(imageShader);
    glUniform1i(glGetUniformLocation(imageShader, "data"), 0);
    glUniform1f(glGetUniformLocation(imageShader, "val"), 1.0f);
    start();
    for (int i = 0; i < loops; ++i)
        glDrawArrays(GL_POINTS, 0, bufferSize/sizeof(float));
    GLuint64 imageTime = stop();
    printf("image_load_store: %.2fGB/s\n", (float)((bufferSize * (double)loops) / imageTime));

    //time bindless
    glUseProgram(bindlessShader);
    glProgramUniformui64NV(bindlessShader, glGetUniformLocation(bindlessShader, "data"), address);
    glUniform1f(glGetUniformLocation(bindlessShader, "val"), 1.0f);
    start();
    for (int i = 0; i < loops; ++i)
        glDrawArrays(GL_POINTS, 0, bufferSize/sizeof(float));
    GLuint64 bindlessTime = stop();
    printf("bindless: %.2fGB/s\n", (float)((bufferSize * (double)loops) / bindlessTime));

    //time bindless in a compute shader
    glUseProgram(bindlessComputeShader);
    glProgramUniformui64NV(bindlessComputeShader, glGetUniformLocation(bindlessComputeShader, "data"), address);
    glUniform1f(glGetUniformLocation(bindlessComputeShader, "val"), 1.0f);
    start();
    for (int i = 0; i < loops; ++i)
        glDispatchCompute(bufferSize/(sizeof(float) * 256), 1, 1);
    GLuint64 bindlessComputeTime = stop();
    printf("bindless compute: %.2fGB/s\n", (float)((bufferSize * (double)loops) / bindlessComputeTime));
    assert(glGetError() == GL_NO_ERROR);
    return 0;
}

My output:

image_load_store: 6.66GB/s
bindless: 6.68GB/s
bindless compute: 6.65GB/s

Some notes:

  1. Compute shaders with bindless graphics don't appear to write anything (the commented out assert fails), or at least the data isn't retrieved with glMapBuffer even though the speed matches the other methods. Using image_load_store in the compute shader works and gives the same speed the vertex shaders (though I thought that'd be one too many permutations to post).
  2. 使用无绑定图形的计算着色器似乎没有写任何东西(注释掉的断言失败),或者至少使用glMapBuffer检索数据,即使速度与其他方法匹配。在计算着色器中使用image_load_store可以工作并给出顶点着色器的相同速度(尽管我认为这是一个过多的排列发布)。

  3. Calling glMemoryBarrier(GL_ALL_BARRIER_BITS) before glDispatchCompute causes a crash in the driver.
  4. 在glDispatchCompute之前调用glMemoryBarrier(GL_ALL_BARRIER_BITS)会导致驱动程序崩溃。

  5. Commenting out the three glBufferData(GL_TEXTURE_BUFFER, bufferSize, dat, GL_STATIC_DRAW);, which are used to check the output, raises the speed of the first two tests to 17GB/s and the compute shader skyrockets to 292GB/s which is much closer to what I'd like but this can't be trusted because of point 1.
  6. 注释掉用于检查输出的三个glBufferData(GL_TEXTURE_BUFFER,bufferSize,dat,GL_STATIC_DRAW);,将前两个测试的速度提高到17GB / s,计算着色器突然增加到292GB / s,这更接近于我想要什么,但由于第1点,这不可信。

  7. Sometimes while (!available) hangs for ages (ctrl-c when I get tired of waiting shows its still in the loop).
  8. 有时虽然(!可用)挂起很久(ctrl-c,当我厌倦了等待显示它仍然在循环中)。

For reference, here's the CUDA code:

供参考,这是CUDA代码:

//http://www.ks.uiuc.edu/Research/vmd/doxygen/CUDABench_8cu-source.html

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <cuda.h>

#define CUERR { cudaError_t err; \
    if ((err = cudaGetLastError()) != cudaSuccess) { \
    printf("CUDA error: %s, %s line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); \
    return -1; }}

//
// GPU device global memory bandwidth benchmark
//
template <class T>
__global__ void gpuglobmemcpybw(T *dest, const T *src) {
    const unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;
    dest[idx] = src[idx];
}

template <class T>
__global__ void gpuglobmemsetbw(T *dest, const T val) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    dest[idx] = val;
}

typedef float4 datatype;

static int cudaglobmembw(int cudadev, double *gpumemsetgbsec, double *gpumemcpygbsec) {
    int i;
    int len = 1 << 22; // one thread per data element
    int loops = 500;
    datatype *src, *dest;
    datatype val=make_float4(1.0f, 1.0f, 1.0f, 1.0f);

    // initialize to zero for starters
    float memsettime = 0.0f;
    float memcpytime = 0.0f;
    *gpumemsetgbsec = 0.0;
    *gpumemcpygbsec = 0.0;

    // attach to the selected device
    cudaError_t rc;
    rc = cudaSetDevice(cudadev);
    if (rc != cudaSuccess) {
        #if CUDART_VERSION >= 2010
        rc = cudaGetLastError(); // query last error and reset error state
        if (rc != cudaErrorSetOnActiveProcess)
        return -1; // abort and return an error
        #else
        cudaGetLastError(); // just ignore and reset error state, since older CUDA
        // revs don't have a cudaErrorSetOnActiveProcess enum
        #endif
    }

    cudaMalloc((void **) &src, sizeof(datatype)*len);
    CUERR
    cudaMalloc((void **) &dest, sizeof(datatype)*len);
    CUERR

    dim3 BSz(256, 1, 1);
    dim3 GSz(len / (BSz.x * BSz.y * BSz.z), 1, 1); 

    // do a warm-up pass
    gpuglobmemsetbw<datatype><<< GSz, BSz >>>(src, val);
    CUERR
    gpuglobmemsetbw<datatype><<< GSz, BSz >>>(dest, val);
    CUERR
    gpuglobmemcpybw<datatype><<< GSz, BSz >>>(dest, src);
    CUERR

    cudaEvent_t start, end;
    cudaEventCreate(&start);
    cudaEventCreate(&end);

    // execute the memset kernel
    cudaEventRecord(start, 0);
    for (i=0; i<loops; i++) {
    gpuglobmemsetbw<datatype><<< GSz, BSz >>>(dest, val);
    }
    CUERR
    cudaEventRecord(end, 0);
    CUERR
    cudaEventSynchronize(start);
    CUERR
    cudaEventSynchronize(end);
    CUERR
    cudaEventElapsedTime(&memsettime, start, end);
    CUERR

    // execute the memcpy kernel
    cudaEventRecord(start, 0);
    for (i=0; i<loops; i++) {
    gpuglobmemcpybw<datatype><<< GSz, BSz >>>(dest, src);
    }
    cudaEventRecord(end, 0);
    CUERR
    cudaEventSynchronize(start);
    CUERR
    cudaEventSynchronize(end);
    CUERR
    cudaEventElapsedTime(&memcpytime, start, end);
    CUERR

    cudaEventDestroy(start);
    CUERR
    cudaEventDestroy(end);
    CUERR

    *gpumemsetgbsec = (len * sizeof(datatype) / (1024.0 * 1024.0)) / (memsettime / loops);
    *gpumemcpygbsec = (2 * len * sizeof(datatype) / (1024.0 * 1024.0)) / (memcpytime / loops);
    cudaFree(dest);
    cudaFree(src);
    CUERR

    return 0;
}

int main()
{
    double a, b;
    cudaglobmembw(0, &a, &b);
    printf("%f %f\n", (float)a, (float)b);
    return 0;
}

Update:

It seems that the buffer gets made non-resident on my glBufferData calls which were there to check output was being written. As per the extension:

似乎缓冲区在我的glBufferData调用中被非驻留,这些调用正在检查输出是否被写入。根据扩展名:

A buffer is also made non-resident implicitly as a result of being respecified via BufferData or being deleted.
...
BufferData is specified to "delete the existing data store", so the GPU address of that data should become invalid. The buffer is therefore made non-resident in the current context.

由于通过BufferData重新指定或被删除,缓冲区也被隐式地非驻留。 ... BufferData被指定为“删除现有数据存储”,因此该数据的GPU地址应该变为无效。因此,缓冲区在当前上下文中非驻留。

At a guess, OpenGL then streams in the buffer object data each frame and doesn't cache it in video memory. This explains why the compute shader failed the assert, however there's a slight anomaly that bindless graphics in the vertex shader still worked when not resident, but I'll ignore that for now. I have no idea why a 64MB buffer object wouldn't default to being resident (though perhaps after first use) when there's 12GB available.

猜测一下,OpenGL然后每帧流入缓冲区对象数据,并不将其缓存在视频内存中。这解释了为什么计算着色器使断言失败的原因,但是有一个轻微的异常,顶点着色器中的无绑定图形在不驻留时仍然有效,但我现在将忽略它。我不知道为什么64MB缓冲区对象不会默认为12GB可用时驻留(尽管可能在第一次使用后)。

So after each call to glBufferData I make it resident again and get the address in case its changed:

因此,在每次调用glBufferData之后,我再次将其驻留并获取地址以防其更改:

glBufferData(GL_TEXTURE_BUFFER, bufferSize, dat, GL_STATIC_DRAW);
glMakeBufferResidentNV(GL_TEXTURE_BUFFER, GL_READ_WRITE);
glGetBufferParameterui64vNV(GL_TEXTURE_BUFFER, GL_BUFFER_GPU_ADDRESS_NV, &address);
assert(glIsBufferResidentNV(GL_TEXTURE_BUFFER)); //sanity check

I'm now getting 270–290GB/s with the compute shader using either image_load_store or bindless graphics. Now my question includes:

我现在使用image_load_store或无绑定图形使用计算着色器获得270-290GB / s的速度。现在我的问题包括:

  • Given the buffer seems to be resident for each test and the compute shader is nice and fast, why are the vertex shader versions still so slow?
  • 鉴于缓冲区似乎是每个测试的驻留,并且计算着色器很好而且快,为什么顶点着色器版本仍然如此慢?

  • Without the bindless graphics extension, how should regular OpenGL users put data into video memory (actually put and not idly suggest that the driver might just like to)?

    如果没有无绑定的图形扩展,常规的OpenGL用户应该如何将数据放入视频内存(实际放置而不是空闲地建议驱动程序可能只是喜欢)?

    I'm pretty sure I would have noticed this problem in real world situations, and it's this contrived benchmark that hits a slow path, so how could I trick the driver into making a buffer object resident? Running a compute shader first doesn't change anything.

    我很确定在实际情况下我会注意到这个问题,并且这个设计的基准测试会遇到一条缓慢的路径,所以我怎么能欺骗驱动程序使缓冲区对象驻留?首先运行计算着色器不会改变任何内容。

1 个解决方案

#1


1  

You are asking the driver to read from your process memory, dat. This causes extensive cache coherency traffic. When the GPU reads that memory, it can't be sure that it is up to date, it might be in the CPU cache, modified, and not written back to RAM yet. This causes the GPU to actually have to read from the CPU cache, which is far more expensive than bypassing the CPU and reading the RAM. The RAM is often idle during normal operation, because a modern CPU's hit rate is typically 95% to 99%. The cache is used continuously.

您要求驱动程序读取您的进程内存dat。这会导致广泛的缓存一致性流量。当GPU读取该内存时,它无法确定它是否是最新的,它可能位于CPU缓存中,已修改,并且尚未写回RAM。这导致GPU实际上必须从CPU缓存读取,这比绕过CPU和读取RAM要昂贵得多。 RAM在正常操作期间通常处于空闲状态,因为现代CPU的命中率通常为95%至99%。缓存连续使用。

To achieve maximum performance, you need to let the driver allocate the memory. Normal memory your program uses, like global variables and the heap are allocated in writeback memory. Driver allocated memory will usually be allocated as write combining or uncacheable, which eliminates the coherency traffic.

要获得最大性能,您需要让驱动程序分配内存。程序使用的正常内存,如全局变量,堆在回写内存中分配。驱动程序分配的内存通常被分配为写入组合或不可缓存,这消除了一致性流量。

Peak advertised bandwidth numbers will be achieved only without cache coherency overhead.

仅在没有高速缓存一致性开销的情况下才能实现峰值通告带宽数。

To let the driver allocate it, use glBufferData with a nullptr for the data.

要让驱动程序分配它,请将glBufferData与nullptr一起用于数据。

It isn't all rosy though, if you manage to coerce the driver into using a system memory write combining buffer. CPU reads to such addresses will be very slow. Sequential writes are optimized by the CPU, but random writes will cause the write combining buffer to flush frequently, hurting performance.

但是,如果您设法强制驱动程序使用系统内存写入组合缓冲区,那么它并不完美。 CPU读取到这样的地址将非常慢。顺序写入由CPU优化,但随机写入将导致写入组合缓冲区频繁刷新,从而损害性能。

#1


1  

You are asking the driver to read from your process memory, dat. This causes extensive cache coherency traffic. When the GPU reads that memory, it can't be sure that it is up to date, it might be in the CPU cache, modified, and not written back to RAM yet. This causes the GPU to actually have to read from the CPU cache, which is far more expensive than bypassing the CPU and reading the RAM. The RAM is often idle during normal operation, because a modern CPU's hit rate is typically 95% to 99%. The cache is used continuously.

您要求驱动程序读取您的进程内存dat。这会导致广泛的缓存一致性流量。当GPU读取该内存时,它无法确定它是否是最新的,它可能位于CPU缓存中,已修改,并且尚未写回RAM。这导致GPU实际上必须从CPU缓存读取,这比绕过CPU和读取RAM要昂贵得多。 RAM在正常操作期间通常处于空闲状态,因为现代CPU的命中率通常为95%至99%。缓存连续使用。

To achieve maximum performance, you need to let the driver allocate the memory. Normal memory your program uses, like global variables and the heap are allocated in writeback memory. Driver allocated memory will usually be allocated as write combining or uncacheable, which eliminates the coherency traffic.

要获得最大性能,您需要让驱动程序分配内存。程序使用的正常内存,如全局变量,堆在回写内存中分配。驱动程序分配的内存通常被分配为写入组合或不可缓存,这消除了一致性流量。

Peak advertised bandwidth numbers will be achieved only without cache coherency overhead.

仅在没有高速缓存一致性开销的情况下才能实现峰值通告带宽数。

To let the driver allocate it, use glBufferData with a nullptr for the data.

要让驱动程序分配它,请将glBufferData与nullptr一起用于数据。

It isn't all rosy though, if you manage to coerce the driver into using a system memory write combining buffer. CPU reads to such addresses will be very slow. Sequential writes are optimized by the CPU, but random writes will cause the write combining buffer to flush frequently, hurting performance.

但是,如果您设法强制驱动程序使用系统内存写入组合缓冲区,那么它并不完美。 CPU读取到这样的地址将非常慢。顺序写入由CPU优化,但随机写入将导致写入组合缓冲区频繁刷新,从而损害性能。