我可以/应该在GPU上运行这段代码吗?

I'm working on a statistical application containing approximately 10 - 30 million floating point values in an array.

我正在开发一个统计应用程序，它包含一个数组中大约1000 - 3000万个浮点值。

Several methods performing different, but independent, calculations on the array in nested loops, for example:

几种方法对嵌套循环中的数组执行不同但独立的计算，例如:

Dictionary<float, int> noOfNumbers = new Dictionary<float, int>();

for (float x = 0f; x < 100f; x += 0.0001f) {
    int noOfOccurrences = 0;

    foreach (float y in largeFloatingPointArray) {
        if (x == y) {
            noOfOccurrences++;
        }
    }

    noOfNumbers.Add(x, noOfOccurrences);
}

The current application is written in C#, runs on an Intel CPU and needs several hours to complete. I have no knowledge of GPU programming concepts and APIs, so my questions are:

当前的应用程序是用c#编写的，运行在Intel CPU上，需要几个小时才能完成。我对GPU编程概念和api一无所知，所以我的问题是:

Is it possible (and does it make sense) to utilize a GPU to speed up such calculations?
利用GPU加速这种计算是否可能(并且有意义)?
If yes: Does anyone know any tutorial or got any sample code (programming language doesn't matter)?
如果知道:是否有人知道任何教程或任何示例代码(编程语言无关紧要)?

Any help would be highly appreciated.

如有任何帮助，我们将不胜感激。

5 个解决方案

#1

UPDATE GPU Version

GPU更新版本

__global__ void hash (float *largeFloatingPointArray,int largeFloatingPointArraySize, int *dictionary, int size, int num_blocks)
{
    int x = (threadIdx.x + blockIdx.x * blockDim.x); // Each thread of each block will
    float y;                                         // compute one (or more) floats
    int noOfOccurrences = 0;
    int a;

    while( x < size )            // While there is work to do each thread will:
    {
        dictionary[x] = 0;       // Initialize the position in each it will work
        noOfOccurrences = 0;    

        for(int j = 0 ;j < largeFloatingPointArraySize; j ++) // Search for floats
        {                                                     // that are equal 
                                                             // to it assign float
           y = largeFloatingPointArray[j];  // Take a candidate from the floats array 
           y *= 10000;                      // e.g if y = 0.0001f;
           a = y + 0.5;                     // a = 1 + 0.5 = 1;
           if (a == x) noOfOccurrences++;    
        }                                      

        dictionary[x] += noOfOccurrences; // Update in the dictionary 
                                          // the number of times that the float appears 

    x += blockDim.x * gridDim.x;  // Update the position here the thread will work
    }
}

This one I just tested for smaller inputs, because I am testing I my laptop. Nevertheless, it did work. However, it necessary to do furthers testes.

我刚刚测试的是小输入，因为我正在测试我的笔记本电脑。然而,它确实工作。然而，有必要做进一步的测试。

UPDATE Sequential Version

顺序更新版本

I just did this naive version that perform your algorithm for 30,000,000 in less than 20 seconds (already counting function to generate data).

我刚刚做了这个简单的版本，它在不到20秒内执行了30,000,000(已经在计算生成数据的函数)。

Basically, it sort your array of floats. It will travel over the sorted array, analyzing the number of times a value consecutively appears in the array and then put this value in a dictionary along with the number of times it appear.

基本上，它对浮动数组进行排序。它将遍历排序数组，分析一个值连续出现在数组中的次数，然后将该值连同出现次数一起放入字典。

You can use sorted map, instead of the unordered_map that I used.

您可以使用排序映射，而不是我使用的unordered_map。

Heres the code:

这是代码:

#include <stdio.h>
#include <stdlib.h>
#include "cuda.h"
#include <algorithm>
#include <string>
#include <iostream>
#include <tr1/unordered_map>


typedef std::tr1::unordered_map<float, int> Mymap;


void generator(float *data, long int size)
{
    float LO = 0.0;
    float HI = 100.0;

    for(long int i = 0; i < size; i++)
        data[i] = LO + (float)rand()/((float)RAND_MAX/(HI-LO));
}

void print_array(float *data, long int size)
{

    for(long int i = 2; i < size; i++)
        printf("%f\n",data[i]);

}

std::tr1::unordered_map<float, int> fill_dict(float *data, int size)
{
    float previous = data[0];
    int count = 1;
    std::tr1::unordered_map<float, int> dict;

    for(long int i = 1; i < size; i++)
    {
        if(previous == data[i])
            count++;
        else
        {
          dict.insert(Mymap::value_type(previous,count));
          previous = data[i];
          count = 1;         
        }

    }
    dict.insert(Mymap::value_type(previous,count)); // add the last member
    return dict;

}

void printMAP(std::tr1::unordered_map<float, int> dict)
{
   for(std::tr1::unordered_map<float, int>::iterator i = dict.begin(); i != dict.end(); i++)
  {
     std::cout << "key(string): " << i->first << ", value(int): " << i->second << std::endl;
   }
}


int main(int argc, char** argv)
{
  int size = 1000000; 
  if(argc > 1) size = atoi(argv[1]);
  printf("Size = %d",size);

  float data[size];
  using namespace __gnu_cxx;

  std::tr1::unordered_map<float, int> dict;

  generator(data,size);

  sort(data, data + size);
  dict = fill_dict(data,size);

  return 0;
}

If you have the library thrust installed in you machine you should use this:

如果你的电脑安装了库推力，你应该这样做:

#include <thrust/sort.h>
thrust::sort(data, data + size);

instead of this

而不是这个

sort(data, data + size);

For sure it will be faster.

肯定会更快。

Original Post

最初的发布

"I'm working on a statistical application which has a large array containin 10 - 30 millions of floating point values".

“我正在开发一个统计应用程序，它有一个包含一千万到三千万浮点值的大数组”。

"Is it possible (and does it make sense) to utilize a GPU to speed up such calculations?"

“利用GPU加速这种计算是否可能(而且有意义)?”

Yes, it is. A month ago I put a Molecular Dynamic simulation entirely on the GPU. One of the kernels, that calculates the force between pairs of particles, receive 6 array each one with 500,000 doubles, a total of 3 Millions doubles (22 MB).

是的,它是。一个月前，我在GPU上做了一个分子动力学模拟。其中一个内核，计算对粒子对之间的作用力，接收6个数组，每一个有50万个双打，总数为3百万倍(22 MB)。

So you are planing to put 30 Millions of float points this is about 114 MB of global Memory, so this is not a problem, even my laptop have 250MB.

你打算放3000万浮点数这是114兆全局内存，所以这不是问题，即使我的笔记本也有250MB。

The number of calculation can be a issue in your case? Based on my experience with the Molecular Dynamic (MD) I say no. The sequential MD version takes about 25 hours to complete while in GPU took 45 Minutes. You said your application took a couple hours, also based in your code example it looks softer than the Molecular Dynamic.

计算的数量可能是你的问题吗?根据我对分子动力学的经验，我说不。连续的MD版本大约需要25个小时完成，而在GPU中则需要45分钟。您说您的应用程序花了几个小时，同样基于您的代码示例，它看起来比分子动力学更柔软。

Here's the force calculation example:

这里有一个力的计算例子:

__global__ void add(double *fx, double *fy, double *fz,
                    double *x, double *y, double *z,...){

     int pos = (threadIdx.x + blockIdx.x * blockDim.x); 

     ...

     while(pos < particles)
     {

      for (i = 0; i < particles; i++)
      {
              if(//inside of the same radius)
                {
                 // calculate force
                } 
       }
     pos += blockDim.x * gridDim.x;  
     }        
  }

A simple example of a code in Cuda could be the sum of two 2D arrays:

Cuda中代码的一个简单示例可以是两个二维数组的和:

In c:

在c:

for(int i = 0; i < N; i++)
    c[i] = a[i] + b[i];

In Cuda:

在Cuda:

__global__ add(int *c, int *a, int*b, int N)
{
  int pos = (threadIdx.x + blockIdx.x)
  for(; i < N; pos +=blockDim.x)
      c[pos] = a[pos] + b[pos];
}

In Cuda you basically took each for iteration and divide by each thread,

在Cuda中，你基本上每次迭代然后除以每个线程，

1) threadIdx.x + blockIdx.x*blockDim.x;

Each block have a Id from 0 to N-1 (N the number maximum of blocks) and each block have a X number of threads with an id from 0 to X-1.

每个块的Id从0到N-1 (N是块的最大数量)，每个块有一个Id从0到X-1的线程数。

1) Gives you the for iteration that each thread will compute based on it id and the block id where the thread is in, the blockDim.x is the number of thread that a block have.

1)给出每个线程将基于它的id和线程所在的块id进行计算的for迭代，即blockDim。x是一个块的线程数。

So if you have 2 blocks each one with 10 threads and a N = 40, the:

因此，如果你有两个块，每个有10个线程和一个N = 40，

Thread 0 Block 0 will execute pos 0
Thread 1 Block 0 will execute pos 1
...
Thread 9 Block 0 will execute pos 9
Thread 0 Block 1 will execute pos 10
....
Thread 9 Block 1 will execute pos 19
Thread 0 Block 0 will execute pos 20
...
Thread 0 Block 1 will execute pos 30
Thread 9 Block 1 will execute pos 39

Looking to your code I made this draft of what could be it in cuda:

看看你的代码，我写了这个库达的草稿:

__global__ hash (float *largeFloatingPointArray, int *dictionary)
    // You can turn the dictionary in one array of int
    // here each position will represent the float
    // Since  x = 0f; x < 100f; x += 0.0001f
    // you can associate each x to different position
    // in the dictionary:

    // pos 0 have the same meaning as 0f;
    // pos 1 means float 0.0001f
    // pos 2 means float 0.0002f ect.
    // Then you use the int of each position 
    // to count how many times that "float" had appeared 


   int x = blockIdx.x;  // Each block will take a different x to work
    float y;

while( x < 1000000) // x < 100f (for incremental step of 0.0001f)
{
    int noOfOccurrences = 0;
    float z = converting_int_to_float(x); // This function will convert the x to the
                                          // float like you use (x / 0.0001)

    // each thread of each block
    // will takes the y from the array of largeFloatingPointArray

    for(j = threadIdx.x; j < largeFloatingPointArraySize; j += blockDim.x)
    {
        y = largeFloatingPointArray[j];
        if (z == y)
        {
            noOfOccurrences++;
        }
    }
    if(threadIdx.x == 0) // Thread master will update the values
      atomicAdd(&dictionary[x], noOfOccurrences);
    __syncthreads();
}

You have to use atomicAdd because different threads from different blocks may write/read noOfOccurrences at the same time, so you have to unsure mutual exclusion.

您必须使用atomicAdd，因为来自不同块的不同线程可能同时写/读noofoccurrence，因此您必须不确定互斥。

This is only one approach you can even give the iterations of the outer loop to the threads instead of the blocks.

这只是一种方法，您甚至可以将外部循环的迭代交给线程而不是块。

Tutorials

教程

The Dr Dobbs Journal series CUDA: Supercomputing for the masses by Rob Farmer is excellent and covers just about everything in its fourteen installments. It also starts rather gently and is therefore fairly beginner-friendly.

多布斯博士的期刊系列CUDA:由罗布·法默所著的超级计算机非常棒，在它的十四个部分中几乎涵盖了所有内容。它也开始相当温和，因此是相当初学者友好的。

and anothers:

和别人:

Developing With CUDA - Introduction
开发与CUDA -介绍
Volume I: Introduction to CUDA Programming
第一卷:CUDA编程简介
Getting started with CUDA
开始使用CUDA
CUDA Resources List
CUDA资源列表

Take a look on the last item, you will find many link to learn CUDA.

看看最后一项，你会发现很多学习CUDA的链接。

OpenCL: OpenCL Tutorials | MacResearch

OpenCL: OpenCL教程| MacResearch。

#2

I don't know much of anything about parallel processing or GPGPU, but for this specific example, you could save a lot of time by making a single pass over the input array rather than looping over it a million times. With large data sets you will usually want to do things in a single pass if possible. Even if you're doing multiple independent computations, if it's over the same data set you might get better speed doing them all in the same pass, as you'll get better locality of reference that way. But it may not be worth it for the increased complexity in your code.

我不太了解并行处理或GPGPU，但是对于这个特定的示例，您可以通过对输入数组进行单次传递来节省大量时间，而不是对其进行百万次循环。有了大数据集，你通常会想要在可能的情况下，在单次传递中做一些事情。即使你在做多个独立的计算，如果它在相同的数据集上你可能会得到更好的速度在相同的过程中做这些，因为你会得到更好的引用局部性。但是，在代码中增加的复杂性可能并不值得。

In addition, you really don't want to add a small amount to a floating point number repetitively like that, the rounding error will add up and you won't get what you intended. I've added an if statement to my below sample to check if inputs match your pattern of iteration, but omit it if you don't actually need that.

另外，你真的不想在浮点数上增加一个小数目，像这样重复，舍入误差会增加，你不会得到你想要的结果。我在下面的示例中添加了一个if语句，以检查输入是否与迭代的模式匹配，但是如果您实际上不需要它，就忽略它。

I don't know any C#, but a single pass implementation of your sample would look something like this:

我不知道任何c#，但是您的示例的单次实现应该是这样的:

Dictionary<float, int> noOfNumbers = new Dictionary<float, int>();

foreach (float x in largeFloatingPointArray)
{
    if (math.Truncate(x/0.0001f)*0.0001f == x)
    {
        if (noOfNumbers.ContainsKey(x))
            noOfNumbers.Add(x, noOfNumbers[x]+1);
        else
            noOfNumbers.Add(x, 1);
    }
}

Hope this helps.

希望这个有帮助。

#3

Is it possible (and does it make sense) to utilize a GPU to speed up such calculations?

利用GPU加速这种计算是否可能(并且有意义)?

Definitely YES, this kind of algorithm is typically the ideal candidate for massive data-parallelism processing, the thing GPUs are so good at.
当然，这种算法通常是大规模数据并行处理的理想候选，gpu非常擅长。

If yes: Does anyone know any tutorial or got any sample code (programming language doesn't matter)?

如果知道:是否有人知道任何教程或任何示例代码(编程语言无关紧要)?

When you want to go the GPGPU way you have two alternatives : CUDA or OpenCL.

当你想采用GPGPU的方式时，你有两个选择:CUDA或OpenCL。

CUDA is mature with a lot of tools but is NVidia GPUs centric.

CUDA是成熟的，有很多工具，但是是NVidia gpu中心。

OpenCL is a standard running on NVidia and AMD GPUs, and CPUs too. So you should really favour it.

OpenCL是运行在NVidia、AMD gpu和cpu上的标准。所以你真的应该喜欢它。
For tutorial you have an excellent series on CodeProject by Rob Farber : http://www.codeproject.com/Articles/Rob-Farber#Articles

对于教程，您有一个关于CodeProject的优秀系列，作者是Rob Farber: http://www.codeproject.com/Articles/Rob-Farber#Articles
For your specific use-case there is a lot of samples for histograms buiding with OpenCL (note that many are image histograms but the principles are the same).

对于您的特定用例，有许多与OpenCL结合的直方图示例(注意，许多是图像直方图，但原理是相同的)。
As you use C# you can use bindings like OpenCL.Net or Cloo.

当您使用c#时，您可以使用OpenCL之类的绑定。净或Cloo。
If your array is too big to be stored in the GPU memory, you can block-partition it and rerun your OpenCL kernel for each part easily.

如果您的数组太大而不能存储在GPU内存中，那么您可以对它进行块分区，并且可以轻松地重新运行OpenCL内核。

#4

In addition to the suggestion by the above poster use the TPL (task parallel library) when appropriate to run in parallel on multiple cores.

除了上述海报的建议之外，在适当的时候使用TPL(任务并行库)在多个核心上并行运行。

The example above could use Parallel.Foreach and ConcurrentDictionary, but a more complex map-reduce setup where the array is split into chunks each generating an dictionary which would then be reduced to a single dictionary would give you better results.

上面的示例可以使用并行。对于每个和ConcurrentDictionary，但是一个更复杂的map-reduce设置，其中数组被分割成块，每个块生成一个字典，然后将它简化为一个字典，这会给您带来更好的结果。

I don't know whether all your computations map correctly to the GPU capabilities, but you'll have to use a map-reduce algorithm anyway to map the calculations to the GPU cores and then reduce the partial results to a single result, so you might as well do that on the CPU before moving on to a less familiar platform.

我不知道是否你所有的计算正确映射到GPU功能,但无论如何你必须使用使用映射-规约模式算法计算映射到GPU核心,然后减少部分结果到一个结果,你不妨这样做在CPU上之前在一个不太熟悉的平台。

#5

I am not sure whether using GPUs would be a good match given that 'largerFloatingPointArray' values need to be retrieved from memory. My understanding is that GPUs are better suited for self contained calculations.

我不确定是否使用gpu是一种很好的匹配，因为需要从内存中检索“largerFloatingPointArray”的值。我的理解是gpu更适合自包含计算。

I think turning this single process application into a distributed application running on many systems and tweaking the algorithm should speed things up considerably, depending how many systems are available.

我认为将这个单一的流程应用程序转换为运行在许多系统上的分布式应用程序，并对算法进行调整，应该会大大加快速度，这取决于有多少系统可用。

You can use the classic 'divide and conquer' approach. The general approach I would take is as follows.

你可以使用经典的“分而治之”方法。我将采取的一般方法如下。

Use one system to preprocess 'largeFloatingPointArray' into a hash table or a database. This would be done in a single pass. It would use floating point value as the key, and the number of occurrences in the array as the value. Worst case scenario is that each value only occurs once, but that is unlikely. If largeFloatingPointArray keeps changing each time the application is run then in-memory hash table makes sense. If it is static, then the table could be saved in a key-value database such as Berkeley DB. Let's call this a 'lookup' system.

使用一个系统将“largeFloatingPointArray”预处理到散列表或数据库。这可以一次完成。它将使用浮点值作为键，使用数组中出现的次数作为值。最坏的情况是，每个值只发生一次，但这是不可能的。如果每次运行应用程序时，largeFloatingPointArray都保持更改，那么内存中的哈希表是有意义的。如果是静态的，那么可以将表保存在键值数据库(如Berkeley DB)中。我们称之为“查找”系统。

On another system, let's call it 'main', create chunks of work and 'scatter' the work items across N systems, and 'gather' the results as they become available. E.g a work item could be as simple as two numbers indicating the range that a system should work on. When a system completes the work, it sends back array of occurrences and it's ready to work on another chunk of work.

在另一个系统中，我们将其称为“main”，创建工作块，并在N个系统中“分散”工作项，并在结果可用时“收集”结果。E。工作项可以简单到两个数字，表示系统应该处理的范围。当系统完成工作时，它将返回出现的数组，并准备处理另一大块工作。

The performance is improved because we do not keep iterating over largeFloatingPointArray. If lookup system becomes a bottleneck, then it could be replicated on as many systems as needed.

性能得到了改进，因为我们不不断地遍历大浮点数。如果查找系统成为瓶颈，那么它就可以在需要的多个系统上进行复制。

With large enough number of systems working in parallel, it should be possible to reduce the processing time down to minutes.

由于有足够多的系统并行工作，应该可以将处理时间缩短为分钟。

I am working on a compiler for parallel programming in C targeted for many-core based systems, often referred to as microservers, that are/or will be built using multiple 'system-on-a-chip' modules within a system. ARM module vendors include Calxeda, AMD, AMCC, etc. Intel will probably also have a similar offering.

我正在为C语言的并行编程编写一个编译器，目标是多核系统(通常称为微服务器)，这些系统是/或将使用系统中的多个“系统芯片”模块构建的。ARM模块供应商包括Calxeda、AMD、AMCC等。英特尔可能也会推出类似的产品。

I have a version of the compiler working, which could be used for such an application. The compiler, based on C function prototypes, generates C networking code that implements inter-process communication code (IPC) across systems. One of the IPC mechanism available is socket/tcp/ip.

我有一个正在运行的编译器版本，它可以用于这样的应用程序。编译器基于C函数原型，生成跨系统实现进程间通信代码(IPC)的C网络代码。IPC机制之一是套接字/tcp/ip。

If you need help in implementing a distributed solution, I'd be happy to discuss it with you.

如果您在实现分布式解决方案时需要帮助，我很乐意与您讨论。

Added Nov 16, 2012.

增加了11月16日,2012年。

I thought a little bit more about the algorithm and I think this should do it in a single pass. It's written in C and it should be very fast compared with what you have.

我对算法有了更多的思考，我认为这应该一次完成。它是用C写的，和你现在的相比应该很快。

/*
 * Convert the X range from 0f to 100f in steps of 0.0001f
 * into a range of integers 0 to 1 + (100 * 10000) to use as an
 * index into an array.
 */

#define X_MAX           (1 + (100 * 10000))

/*
 * Number of floats in largeFloatingPointArray needs to be defined
 * below to be whatever your value is.
 */

#define LARGE_ARRAY_MAX (1000)

main()
{
    int j, y, *noOfOccurances;
    float *largeFloatingPointArray;

    /*
     * Allocate memory for largeFloatingPointArray and populate it.
     */

    largeFloatingPointArray = (float *)malloc(LARGE_ARRAY_MAX * sizeof(float));    
    if (largeFloatingPointArray == 0) {
        printf("out of memory\n");
        exit(1);
    }

    /*
     * Allocate memory to hold noOfOccurances. The index/10000 is the
     * the floating point number.  The contents is the count.
     *
     * E.g. noOfOccurances[12345] = 20, means 1.2345f occurs 20 times
     * in largeFloatingPointArray.
     */

    noOfOccurances = (int *)calloc(X_MAX, sizeof(int));
    if (noOfOccurances == 0) {  
        printf("out of memory\n");
        exit(1);
    }

    for (j = 0; j < LARGE_ARRAY_MAX; j++) {
        y = (int)(largeFloatingPointArray[j] * 10000);
        if (y >= 0 && y <= X_MAX) {
            noOfOccurances[y]++;
        }   
    }
}

#1