并行化for循环不会带来性能提升

时间:2022-08-06 13:55:04

I have an algorithm which converts a bayer image channel to RGB. In my implementation I have a single nested for loop which iterates over the bayer channel, calculates the rgb index from the bayer index and then sets that pixel's value from the bayer channel. The main thing to notice here is that each pixel can be calculated independently from other pixels (doesn't rely on previous calculations) and so the algorithm is a natural candidate for paralleization. The calculation does however rely on some preset arrays which all threads will be accessing in the same time but will not change.

我有一个算法将拜耳图像通道转换为RGB。在我的实现中,我有一个嵌套的for循环,它遍历拜耳通道,从拜耳索引计算rgb索引,然后从拜耳通道设置该像素的值。这里要注意的主要事实是每个像素可以独立于其他像素计算(不依赖于先前的计算),因此该算法是并行化的自然候选者。但是,计算依赖于某些预设数组,所有线程将在同一时间访问但不会更改。

However, when I tried parallelizing the main forwith MS's cuncurrency::parallel_for I gained no boost in performance. In fact, for an input of size 3264X2540 running over a 4-core CPU, the non parallelized version ran in ~34ms and the parallelized version ran in ~69ms (averaged over 10 runs). I confirmed that the operation was indeed parallelized (3 new threads were created for the task).

但是,当我尝试并行化主要的转发MS的cuncurrency :: parallel_for时,我的性能没有提升。事实上,对于在4核CPU上运行的大小为3264X2540的输入,非并行化版本在~34ms内运行,并行化版本运行在~69ms(平均超过10次运行)。我确认该操作确实是并行化的(为该任务创建了3个新线程)。

Using Intel's compiler with tbb::parallel_for gave near exact results. For comparison, I started out with this algorithm implemented in C# in which I also used parallel_for loops and there I encountered near X4 performance gains (I opted for C++ because for this particular task C++ was faster even with a single core).

使用英特尔的编译器和tbb :: parallel_for给出了接近完全的结果。为了比较,我开始使用在C#中实现的这个算法,其中我也使用了parallel_for循环,在那里我遇到了接近X4的性能提升(我选择了C ++,因为对于这个特定的任务,即使使用单个内核,C ++也更快)。

Any ideas what is preventing my code from parallelizing well?

有什么想法阻止我的代码很好地并行化?

My code:

template<typename T>
void static ConvertBayerToRgbImageAsIs(T* BayerChannel, T* RgbChannel, int Width, int Height, ColorSpace ColorSpace)
{
        //Translates index offset in Bayer image to channel offset in RGB image
        int offsets[4];
        //calculate offsets according to color space
        switch (ColorSpace)
        {
        case ColorSpace::BGGR:
            offsets[0] = 2;
            offsets[1] = 1;
            offsets[2] = 1;
            offsets[3] = 0;
            break;
        ...other color spaces
        }
        memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
        parallel_for(0, Height, [&] (int row)
        {
            for (auto col = 0, bayerIndex = row * Width; col < Width; col++, bayerIndex++)
            {
                auto offset = (row%2)*2 + (col%2); //0...3
                auto rgbIndex = bayerIndex * 3 + offsets[offset];
                RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
            }
        });
}

8 个解决方案

#1


21  

First of all, your algorithm is memory bandwidth bounded. That is memory load/store would outweigh any index calculations you do.

首先,您的算法是有限的内存带宽。那就是内存加载/存储将超过你所做的任何索引计算。

Vector operations like SSE/AVX would not help either - you are not doing any intensive calculations.

像SSE / AVX这样的矢量操作也无济于事 - 您没有进行任何密集计算。

Increasing work amount per iteration is also useless - both PPL and TBB are smart enough, to not create thread per iteration, they would use some good partition, which would additionaly try to preserve locality. For instance, here is quote from TBB::parallel_for:

增加每次迭代的工作量也是没有用的--PPL和TBB都足够聪明,不能每次迭代创建线程,他们会使用一些好的分区,这将另外尝试保留局部性。例如,这里引用TBB :: parallel_for:

When worker threads are available, parallel_for executes iterations is non-deterministic order. Do not rely upon any particular execution order for correctness. However, for efficiency, do expect parallel_for to tend towards operating on consecutive runs of values.

当工作线程可用时,parallel_for执行迭代是非确定性顺序。不要依赖任何特定的执行顺序来保证正确性。但是,为了提高效率,请确保parallel_for倾向于在连续运行的值上运行。

What really matters is to reduce memory operations. Any superfluous traversal over input or output buffer is poison for performance, so you should try to remove your memset or do it in parallel too.

真正重要的是减少内存操作。对输入或输出缓冲区进行任何多余的遍历都会导致性能下降,因此您应该尝试删除memset或者并行执行此操作。

You are fully traversing input and output data. Even if you skip something in output - that doesn't mater, because memory operations are happening by 64 byte chunks at modern hardware. So, calculate size of your input and output, measure time of algorithm, divide size/time and compare result with maximal characteristics of your system (for instance, measure with benchmark).

您完全遍历输入和输出数据。即使你跳过输出中的东西 - 这也不重要,因为内存操作是在现代硬件上由64字节块发生的。因此,计算输入和输出的大小,测量算法的时间,划分大小/时间并将结果与​​系统的最大特征进行比较(例如,使用基准测量)。

I have made test for Microsoft PPL, OpenMP and Native for, results are (I used 8x of your height):

我已经为Microsoft PPL,OpenMP和Native进行了测试,结果是(我使用了你身高的8倍):

Native_For       0.21 s
OpenMP_For       0.15 s
Intel_TBB_For    0.15 s
MS_PPL_For       0.15 s

If remove memset then:

如果删除memset则:

Native_For       0.15 s
OpenMP_For       0.09 s
Intel_TBB_For    0.09 s
MS_PPL_For       0.09 s

As you can see memset (which is highly optimized) is responsoble for significant amount of execution time, which shows how your algorithm is memory bounded.

正如您所看到的,memset(经过高度优化)可以响应大量的执行时间,从而显示您的算法如何受内存限制。

FULL SOURCE CODE:

完整的源代码:

#include <boost/exception/detail/type_info.hpp>
#include <boost/mpl/for_each.hpp>
#include <boost/mpl/vector.hpp>
#include <boost/progress.hpp>
#include <tbb/tbb.h>
#include <iostream>
#include <ostream>
#include <vector>
#include <string>
#include <omp.h>
#include <ppl.h>

using namespace boost;
using namespace std;

const auto Width = 3264;
const auto Height = 2540*8;

struct MS_PPL_For
{
    template<typename F,typename Index>
    void operator()(Index first,Index last,F f) const
    {
        concurrency::parallel_for(first,last,f);
    }
};

struct Intel_TBB_For
{
    template<typename F,typename Index>
    void operator()(Index first,Index last,F f) const
    {
        tbb::parallel_for(first,last,f);
    }
};

struct Native_For
{
    template<typename F,typename Index>
    void operator()(Index first,Index last,F f) const
    {
        for(; first!=last; ++first) f(first);
    }
};

struct OpenMP_For
{
    template<typename F,typename Index>
    void operator()(Index first,Index last,F f) const
    {
        #pragma omp parallel for
        for(auto i=first; i<last; ++i) f(i);
    }
};

template<typename T>
struct ConvertBayerToRgbImageAsIs
{
    const T* BayerChannel;
    T* RgbChannel;
    template<typename For>
    void operator()(For for_)
    {
        cout << type_name<For>() << "\t";
        progress_timer t;
        int offsets[] = {2,1,1,0};
        //memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
        for_(0, Height, [&] (int row)
        {
            for (auto col = 0, bayerIndex = row * Width; col < Width; col++, bayerIndex++)
            {
                auto offset = (row % 2)*2 + (col % 2); //0...3
                auto rgbIndex = bayerIndex * 3 + offsets[offset];
                RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
            }
        });
    }
};

int main()
{
    vector<float> bayer(Width*Height);
    vector<float> rgb(Width*Height*3);
    ConvertBayerToRgbImageAsIs<float> work = {&bayer[0],&rgb[0]};
    for(auto i=0;i!=4;++i)
    {
        mpl::for_each<mpl::vector<Native_For, OpenMP_For,Intel_TBB_For,MS_PPL_For>>(work);
        cout << string(16,'_') << endl;
    }
}

#2


4  

Synchronization overhead

I would guess that the amount of work done per iteration of the loop is too small. Had you split the image into four parts and ran the computation in parallel, you would have noticed a large gain. Try to design the loop in a way that would case less iterations and more work per iteration. The reasoning behind this is that there is too much synchronization done.

我猜想每次迭代循环完成的工作量太小了。如果您将图像分成四个部分并且并行运行计算,您会注意到一个很大的增益。尝试以减少迭代次数和每次迭代更多工作的方式设计循环。这背后的原因是完成了太多的同步。

Cache usage

An important factor may be how the data is split (partitioned) for the processing. If the proceessed rows are separated as in the bad case below, then more rows will cause a cache miss. This effect will become more important with each additional thread, because the distance between rows will be greater. If you are certain that the parallelizing function performs reasonable partitioning, then manual work-splitting will not give any results

一个重要因素可能是如何对数据进行拆分(分区)以进行处理。如果处理的行如下面的坏情况那样分开,那么更多的行将导致缓存未命中。对于每个额外的线程,此效果将变得更加重要,因为行之间的距离将更大。如果您确定并行化功能执行合理的分区,则手动分工不会产生任何结果

 bad       good
****** t1 ****** t1
****** t2 ****** t1
****** t1 ****** t1
****** t2 ****** t1
****** t1 ****** t2
****** t2 ****** t2
****** t1 ****** t2
****** t2 ****** t2

Also make sure that you access your data in the same way it is aligned; it is possible that each call to offset[] and BayerChannel[] is a cache miss. Your algorithm is very memory intensive. Almost all operations are either accessing a memory segment or writing to it. Preventing cache misses and minimizing memory access is crucial.

还要确保以与对齐方式相同的方式访问数据;每次调用offset []和BayerChannel []都可能是缓存未命中。您的算法非常耗费内存。几乎所有操作都是访问内存段或写入内存段。防止缓存未命中和最小化内存访问至关重要。

Code optimizations

the optimizations shown below may be done by the compiler and may not give better results. It is worth knowing that they can be done.

下面显示的优化可能由编译器完成,可能无法提供更好的结果。值得知道的是,他们可以做到。

    // is the memset really necessary?
    //memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
    parallel_for(0, Height, [&] (int row)
    {
        int rowMod = (row & 1) << 1;
        for (auto col = 0, bayerIndex = row * Width, tripleBayerIndex=row*Width*3; col < Width; col+=2, bayerIndex+=2, tripleBayerIndex+=6)
        {
            auto rgbIndex = tripleBayerIndex + offsets[rowMod];
            RgbChannel[rgbIndex] = BayerChannel[bayerIndex];

            //unrolled the loop to save col & 1 operation
            rgbIndex = tripleBayerIndex + 3 + offsets[rowMod+1];
            RgbChannel[rgbIndex] = BayerChannel[bayerIndex+1];
        }
    });

#3


2  

Here comes my suggestion:

我的建议如下:

  1. Computer larger chunks in parallel
  2. 计算机较大的块并行

  3. get rid of modulo/multiplication
  4. 摆脱模数/乘法

  5. unroll inner loop to compute one full pixel (simplifies code)

    展开内循环以计算一个完整像素(简化代码)

    template<typename T> void static ConvertBayerToRgbImageAsIsNew(T* BayerChannel, T* RgbChannel, int Width, int Height)
    {
        // convert BGGR->RGB
        // have as many threads as the hardware concurrency is
        parallel_for(0, Height, static_cast<int>(Height/(thread::hardware_concurrency())), [&] (int stride)
        {
            for (auto row = stride; row<2*stride; row++)
            {
                for (auto col = row*Width, rgbCol =row*Width; col < row*Width+Width; rgbCol +=3, col+=4)
                {
                    RgbChannel[rgbCol+0]  = BayerChannel[col+3];
                    RgbChannel[rgbCol+1]  = BayerChannel[col+1];
                    // RgbChannel[rgbCol+1] += BayerChannel[col+2]; // this line might be left out if g is used unadjusted
                    RgbChannel[rgbCol+2]  = BayerChannel[col+0];
                }
            }
        });
    }
    

This code is 60% faster than the original version but still only half as fast as the non parallelized version on my laptop. This seemed to be due to the memory boundedness of the algorithm as others have pointed out already.

此代码比原始版本快60%,但仍然只是笔记本电脑上非并行版本的一半。这似乎是由于算法的记忆有限性,正如其他人已经指出的那样。

edit: But I was not happy with that. I could greatly improve the parallel performance when going from parallel_for to std::async:

编辑:但我对此并不满意。从parallel_for到std :: async时,我可以大大提高并行性能:

int hc = thread::hardware_concurrency();
future<void>* res = new future<void>[hc];
for (int i = 0; i<hc; ++i)
{
    res[i] = async(Converter<char>(bayerChannel, rgbChannel, rows, cols, rows/hc*i, rows/hc*(i+1)));
}
for (int i = 0; i<hc; ++i)
{
    res[i].wait();
}
delete [] res;

with converter being a simple class:

转换器是一个简单的类:

template <class T> class Converter
{
public:
Converter(T* BayerChannel, T* RgbChannel, int Width, int Height, int startRow, int endRow) : 
    BayerChannel(BayerChannel), RgbChannel(RgbChannel), Width(Width), Height(Height), startRow(startRow), endRow(endRow)
{
}
void operator()()
{
    // convert BGGR->RGB
    for(int row = startRow; row < endRow; row++)
    {
        for (auto col = row*Width, rgbCol =row*Width; col < row*Width+Width; rgbCol +=3, col+=4)
        {
            RgbChannel[rgbCol+0]  = BayerChannel[col+3];
            RgbChannel[rgbCol+1]  = BayerChannel[col+1];
            // RgbChannel[rgbCol+1] += BayerChannel[col+2]; // this line might be left out if g is used unadjusted
            RgbChannel[rgbCol+2]  = BayerChannel[col+0];
        }
    };
}
private:
T* BayerChannel;
T* RgbChannel;
int Width;
int Height;
int startRow;
int endRow;
};

This is now 3.5 times faster than the non parallelized version. From what I have seen in the profiler so far, I assume that the work stealing approach of parallel_for incurs a lot of waiting and synchronization overhead.

现在,这比非并行版本快3.5倍。从我到目前为止在Profiler中看到的情况来看,我认为parallel_for的工作窃取方法会导致大量的等待和同步开销。

#4


1  

I have not used tbb::parallel_for not cuncurrency::parallel_for, but if your numbers are correct they seem to carry too much overhead. However, I strongly advice you to run more that 10 iterations when testing, and also be sure to do as many warmup iterations before timing.

我没有使用过tbb :: parallel_for cuncurrency :: parallel_for,但是如果你的数字是正确的,它们似乎带来了太多的开销。但是,我强烈建议您在测试时运行10次迭代,并确保在计时之前执行尽可能多的预热迭代。

I tested your code exactly using three different methods, averaged over 1000 tries.

我使用三种不同的方法测试了您的代码,平均超过1000次尝试。

Serial:      14.6 += 1.0  ms
std::async:  13.6 += 1.6  ms
workers:     11.8 += 1.2  ms

The first is serial calculation. The second is done using four calls to std::async. The last is done by sending four jobs to four already started (but sleeping) background threads.

第一个是连续计算。第二个是使用对std :: async的四次调用完成的。最后一步是通过向四个已经启动(但正在休眠)的后台线程发送四个作业来完成的。

The gains aren't big, but at least they are gains. I did the test on a 2012 MacBook Pro, with dual hyper threaded cores = 4 logical cores.

收益并不大,但至少它们是收益。我在2012年的MacBook Pro上进行了测试,双超线程内核= 4个逻辑内核。

For reference, here's my std::async parallel for:

作为参考,这是我的std :: async parallel for:

template<typename Int=int, class Fun>
void std_par_for(Int beg, Int end, const Fun& fun)
{
    auto N = std::thread::hardware_concurrency();
    std::vector<std::future<void>> futures;

    for (Int ti=0; ti<N; ++ti) {
        Int b = ti * (end - beg) / N;
        Int e = (ti+1) * (end - beg) / N;
        if (ti == N-1) { e = end; }

        futures.emplace_back( std::async([&,b,e]() {
            for (Int ix=b; ix<e; ++ix) {
                fun( ix );
            }
        }));
    }

    for (auto&& f : futures) {
        f.wait();
    }
}

#5


1  

Things to check or do

要检查或做的事情

  • Are you using a Core 2 or older processor? They have a very narrow memory bus that's easy to saturate with code like this. In contrast, 4-channel Sandy Bridge-E processors require multiple threads to saturate the memory bus (it's not possible for a single memory-bound thread to fully saturate it).
  • 您使用的是Core 2或更旧的处理器吗?它们有一个非常窄的内存总线,很容易用这样的代码饱和。相比之下,4通道Sandy Bridge-E处理器需要多个线程来使内存总线饱和(单个内存绑定线程不可能完全饱和它)。

  • Have you populated all of your memory channels? E.g. if you have a dual-channel CPU but have just one RAM card installed or two that are on the same channel, you're getting half the available bandwidth.
  • 你填充了所有的内存频道吗?例如。如果你有一个双通道CPU,但只安装了一个RAM卡,或者两个在同一个通道上,你可以获得一半的可用带宽。

  • How are you timing your code?
    • The timing should be done inside the application like Evgeny Panasyuk suggests.
    • 应该像Evgeny Panasyuk建议的那样在应用程序内完成。

    • You should do multiple runs within the same application. Otherwise, you may be timing one-time startup code to launch the thread pools, etc.
    • 您应该在同一个应用程序中进行多次运行。否则,您可能会计划一次性启动代码以启动线程池等。

  • 你如何计算代码?应该像Evgeny Panasyuk建议的那样在应用程序内完成。您应该在同一个应用程序中进行多次运行。否则,您可能会计划一次性启动代码以启动线程池等。

  • Remove the superfluous memset, as others have explained.
  • 正如其他人所解释的那样,删除多余的memset。

  • As ogni42 and others have suggested, unroll your inner loop (I didn't bother checking the correctness of that solution, but if it's wrong, you should be able to fix it). This is orthogonal to the main question of parallelization, but it's a good idea anyway.
  • 正如ogni42和其他人所建议的那样,展开你的内部循环(我没有费心去检查该解决方案的正确性,但是如果它错了,你应该能够解决它)。这与并行化的主要问题是正交的,但无论如何它都是一个好主意。

  • Make sure your machine is otherwise idle when doing performance testing.
  • 在进行性能测试时,确保您的机器处于空闲状态。

Additional timings

I've merged the suggestions of Evgeny Panasyuk and ogni42 in a bare-bones C++03 Win23 implementation:

我已经在一个简单的C ++ 03 Win23实现中合并了Evgeny Panasyuk和ogni42的建议:

#include "stdafx.h"

#include <omp.h>
#include <vector>
#include <iostream>
#include <stdio.h>

using namespace std;

const int Width = 3264;
const int Height = 2540*8;

class Timer {
private:
    string name;
    LARGE_INTEGER start;
    LARGE_INTEGER stop;
    LARGE_INTEGER frequency;
public:
    Timer(const char *name) : name(name) {
        QueryPerformanceFrequency(&frequency);
        QueryPerformanceCounter(&start);
    }

    ~Timer() {
        QueryPerformanceCounter(&stop);
        LARGE_INTEGER time;
        time.QuadPart = stop.QuadPart - start.QuadPart;
        double elapsed = ((double)time.QuadPart /(double)frequency.QuadPart);
        printf("%-20s : %5.2f\n", name.c_str(), elapsed);
    }
};

static const int offsets[] = {2,1,1,0};

template <typename T>
void Inner_Orig(const T* BayerChannel, T* RgbChannel, int row)
{
    for (int col = 0, bayerIndex = row * Width;
         col < Width; col++, bayerIndex++)
    {
        int offset = (row % 2)*2 + (col % 2); //0...3
        int rgbIndex = bayerIndex * 3 + offsets[offset];
        RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
    }
}

// adapted from ogni42's answer
template <typename T>
void Inner_Unrolled(const T* BayerChannel, T* RgbChannel, int row)
{
    for (int col = row*Width, rgbCol =row*Width;
         col < row*Width+Width; rgbCol +=3, col+=4)
    {
        RgbChannel[rgbCol+0]  = BayerChannel[col+3];
        RgbChannel[rgbCol+1]  = BayerChannel[col+1];
        // RgbChannel[rgbCol+1] += BayerChannel[col+2]; // this line might be left out if g is used unadjusted
        RgbChannel[rgbCol+2]  = BayerChannel[col+0];
    }
}

int _tmain(int argc, _TCHAR* argv[])
{
    vector<float> bayer(Width*Height);
    vector<float> rgb(Width*Height*3);
    for(int i = 0; i < 4; ++i)
    {
        {
            Timer t("serial_orig");
            for(int row = 0; row < Height; ++row) {
                Inner_Orig<float>(&bayer[0], &rgb[0], row);
            }
        }
        {
            Timer t("omp_dynamic_orig");
            #pragma omp parallel for
            for(int row = 0; row < Height; ++row) {
                Inner_Orig<float>(&bayer[0], &rgb[0], row);
            }
        }
        {
            Timer t("omp_static_orig");
            #pragma omp parallel for schedule(static)
            for(int row = 0; row < Height; ++row) {
                Inner_Orig<float>(&bayer[0], &rgb[0], row);
            }
        }

        {
            Timer t("serial_unrolled");
            for(int row = 0; row < Height; ++row) {
                Inner_Unrolled<float>(&bayer[0], &rgb[0], row);
            }
        }
        {
            Timer t("omp_dynamic_unrolled");
            #pragma omp parallel for
            for(int row = 0; row < Height; ++row) {
                Inner_Unrolled<float>(&bayer[0], &rgb[0], row);
            }
        }
        {
            Timer t("omp_static_unrolled");
            #pragma omp parallel for schedule(static)
            for(int row = 0; row < Height; ++row) {
                Inner_Unrolled<float>(&bayer[0], &rgb[0], row);
            }
        }
        printf("-----------------------------\n");
    }
    return 0;
}

Here are the timings I see on a triple-channel 8-way hyperthreaded Core i7-950 box:

以下是我在三通道8路超线程Core i7-950盒子上看到的时间:

serial_orig          :  0.13
omp_dynamic_orig     :  0.10
omp_static_orig      :  0.10
serial_unrolled      :  0.06
omp_dynamic_unrolled :  0.04
omp_static_unrolled  :  0.04

The "static" versions tell the compiler to evenly divide up the work between threads at loop entry. This avoids the overhead of attempting to do work stealing or other dynamic load balancing. For this code snippet, it doesn't seem to make a difference, even though the workload is very uniform across threads.

“静态”版本告诉编译器在循环入口处平均分配线程之间的工作。这避免了尝试进行工作窃取或其他动态负载平衡的开销。对于此代码段,即使跨线程的工作负载非常一致,它似乎也没有区别。

#6


0  

The performance reduction might be happening because your are trying to distribute for loop on "row" number of cores, which wont be available and hence again it become like a sequential execution with the overhead of parallelism.

性能降低可能正在发生,因为您正试图在“行”数量的核心上分配循环,这些核心不可用,因此它再次成为具有并行开销的顺序执行。

#7


0  

Not very familiar with parallel for loops but it seems to me the contention is in the memory access. It appears your threads are overlapping access to the same pages.

并不熟悉并行for循环,但在我看来,争用是在内存访问中。您的线程似乎是对相同页面的重叠访问。

Can you break up your array access into 4k chunks somewhat align with the page boundary?

你可以将你的阵列访问分解成4k块,有些与页面边界对齐吗?

#8


0  

There is no point talking about parallel performance before not having optimized the for loop for serial code. Here is my attempt at that (some good compilers may be able to obtain similarly optimized versions, but I'd rather not rely on that)

在没有优化串行代码的for循环之前,没有必要谈论并行性能。这是我的尝试(一些好的编译器可能能够获得类似的优化版本,但我宁愿不依赖它)

    parallel_for(0, Height, [=] (int row) noexcept
    {
        for (auto col=0, bayerindex=row*Width,
                  rgb0=3*bayerindex+offset[(row%2)*2],
                  rgb1=3*bayerindex+offset[(row%2)*2+1];
             col < Width; col+=2, bayerindex+=2, rgb0+=6, rgb1+=6 )
        {
            RgbChannel[rgb0] = BayerChannel[bayerindex  ];
            RgbChannel[rgb1] = BayerChannel[bayerindex+1];
        }
    }); 

#1


21  

First of all, your algorithm is memory bandwidth bounded. That is memory load/store would outweigh any index calculations you do.

首先,您的算法是有限的内存带宽。那就是内存加载/存储将超过你所做的任何索引计算。

Vector operations like SSE/AVX would not help either - you are not doing any intensive calculations.

像SSE / AVX这样的矢量操作也无济于事 - 您没有进行任何密集计算。

Increasing work amount per iteration is also useless - both PPL and TBB are smart enough, to not create thread per iteration, they would use some good partition, which would additionaly try to preserve locality. For instance, here is quote from TBB::parallel_for:

增加每次迭代的工作量也是没有用的--PPL和TBB都足够聪明,不能每次迭代创建线程,他们会使用一些好的分区,这将另外尝试保留局部性。例如,这里引用TBB :: parallel_for:

When worker threads are available, parallel_for executes iterations is non-deterministic order. Do not rely upon any particular execution order for correctness. However, for efficiency, do expect parallel_for to tend towards operating on consecutive runs of values.

当工作线程可用时,parallel_for执行迭代是非确定性顺序。不要依赖任何特定的执行顺序来保证正确性。但是,为了提高效率,请确保parallel_for倾向于在连续运行的值上运行。

What really matters is to reduce memory operations. Any superfluous traversal over input or output buffer is poison for performance, so you should try to remove your memset or do it in parallel too.

真正重要的是减少内存操作。对输入或输出缓冲区进行任何多余的遍历都会导致性能下降,因此您应该尝试删除memset或者并行执行此操作。

You are fully traversing input and output data. Even if you skip something in output - that doesn't mater, because memory operations are happening by 64 byte chunks at modern hardware. So, calculate size of your input and output, measure time of algorithm, divide size/time and compare result with maximal characteristics of your system (for instance, measure with benchmark).

您完全遍历输入和输出数据。即使你跳过输出中的东西 - 这也不重要,因为内存操作是在现代硬件上由64字节块发生的。因此,计算输入和输出的大小,测量算法的时间,划分大小/时间并将结果与​​系统的最大特征进行比较(例如,使用基准测量)。

I have made test for Microsoft PPL, OpenMP and Native for, results are (I used 8x of your height):

我已经为Microsoft PPL,OpenMP和Native进行了测试,结果是(我使用了你身高的8倍):

Native_For       0.21 s
OpenMP_For       0.15 s
Intel_TBB_For    0.15 s
MS_PPL_For       0.15 s

If remove memset then:

如果删除memset则:

Native_For       0.15 s
OpenMP_For       0.09 s
Intel_TBB_For    0.09 s
MS_PPL_For       0.09 s

As you can see memset (which is highly optimized) is responsoble for significant amount of execution time, which shows how your algorithm is memory bounded.

正如您所看到的,memset(经过高度优化)可以响应大量的执行时间,从而显示您的算法如何受内存限制。

FULL SOURCE CODE:

完整的源代码:

#include <boost/exception/detail/type_info.hpp>
#include <boost/mpl/for_each.hpp>
#include <boost/mpl/vector.hpp>
#include <boost/progress.hpp>
#include <tbb/tbb.h>
#include <iostream>
#include <ostream>
#include <vector>
#include <string>
#include <omp.h>
#include <ppl.h>

using namespace boost;
using namespace std;

const auto Width = 3264;
const auto Height = 2540*8;

struct MS_PPL_For
{
    template<typename F,typename Index>
    void operator()(Index first,Index last,F f) const
    {
        concurrency::parallel_for(first,last,f);
    }
};

struct Intel_TBB_For
{
    template<typename F,typename Index>
    void operator()(Index first,Index last,F f) const
    {
        tbb::parallel_for(first,last,f);
    }
};

struct Native_For
{
    template<typename F,typename Index>
    void operator()(Index first,Index last,F f) const
    {
        for(; first!=last; ++first) f(first);
    }
};

struct OpenMP_For
{
    template<typename F,typename Index>
    void operator()(Index first,Index last,F f) const
    {
        #pragma omp parallel for
        for(auto i=first; i<last; ++i) f(i);
    }
};

template<typename T>
struct ConvertBayerToRgbImageAsIs
{
    const T* BayerChannel;
    T* RgbChannel;
    template<typename For>
    void operator()(For for_)
    {
        cout << type_name<For>() << "\t";
        progress_timer t;
        int offsets[] = {2,1,1,0};
        //memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
        for_(0, Height, [&] (int row)
        {
            for (auto col = 0, bayerIndex = row * Width; col < Width; col++, bayerIndex++)
            {
                auto offset = (row % 2)*2 + (col % 2); //0...3
                auto rgbIndex = bayerIndex * 3 + offsets[offset];
                RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
            }
        });
    }
};

int main()
{
    vector<float> bayer(Width*Height);
    vector<float> rgb(Width*Height*3);
    ConvertBayerToRgbImageAsIs<float> work = {&bayer[0],&rgb[0]};
    for(auto i=0;i!=4;++i)
    {
        mpl::for_each<mpl::vector<Native_For, OpenMP_For,Intel_TBB_For,MS_PPL_For>>(work);
        cout << string(16,'_') << endl;
    }
}

#2


4  

Synchronization overhead

I would guess that the amount of work done per iteration of the loop is too small. Had you split the image into four parts and ran the computation in parallel, you would have noticed a large gain. Try to design the loop in a way that would case less iterations and more work per iteration. The reasoning behind this is that there is too much synchronization done.

我猜想每次迭代循环完成的工作量太小了。如果您将图像分成四个部分并且并行运行计算,您会注意到一个很大的增益。尝试以减少迭代次数和每次迭代更多工作的方式设计循环。这背后的原因是完成了太多的同步。

Cache usage

An important factor may be how the data is split (partitioned) for the processing. If the proceessed rows are separated as in the bad case below, then more rows will cause a cache miss. This effect will become more important with each additional thread, because the distance between rows will be greater. If you are certain that the parallelizing function performs reasonable partitioning, then manual work-splitting will not give any results

一个重要因素可能是如何对数据进行拆分(分区)以进行处理。如果处理的行如下面的坏情况那样分开,那么更多的行将导致缓存未命中。对于每个额外的线程,此效果将变得更加重要,因为行之间的距离将更大。如果您确定并行化功能执行合理的分区,则手动分工不会产生任何结果

 bad       good
****** t1 ****** t1
****** t2 ****** t1
****** t1 ****** t1
****** t2 ****** t1
****** t1 ****** t2
****** t2 ****** t2
****** t1 ****** t2
****** t2 ****** t2

Also make sure that you access your data in the same way it is aligned; it is possible that each call to offset[] and BayerChannel[] is a cache miss. Your algorithm is very memory intensive. Almost all operations are either accessing a memory segment or writing to it. Preventing cache misses and minimizing memory access is crucial.

还要确保以与对齐方式相同的方式访问数据;每次调用offset []和BayerChannel []都可能是缓存未命中。您的算法非常耗费内存。几乎所有操作都是访问内存段或写入内存段。防止缓存未命中和最小化内存访问至关重要。

Code optimizations

the optimizations shown below may be done by the compiler and may not give better results. It is worth knowing that they can be done.

下面显示的优化可能由编译器完成,可能无法提供更好的结果。值得知道的是,他们可以做到。

    // is the memset really necessary?
    //memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
    parallel_for(0, Height, [&] (int row)
    {
        int rowMod = (row & 1) << 1;
        for (auto col = 0, bayerIndex = row * Width, tripleBayerIndex=row*Width*3; col < Width; col+=2, bayerIndex+=2, tripleBayerIndex+=6)
        {
            auto rgbIndex = tripleBayerIndex + offsets[rowMod];
            RgbChannel[rgbIndex] = BayerChannel[bayerIndex];

            //unrolled the loop to save col & 1 operation
            rgbIndex = tripleBayerIndex + 3 + offsets[rowMod+1];
            RgbChannel[rgbIndex] = BayerChannel[bayerIndex+1];
        }
    });

#3


2  

Here comes my suggestion:

我的建议如下:

  1. Computer larger chunks in parallel
  2. 计算机较大的块并行

  3. get rid of modulo/multiplication
  4. 摆脱模数/乘法

  5. unroll inner loop to compute one full pixel (simplifies code)

    展开内循环以计算一个完整像素(简化代码)

    template<typename T> void static ConvertBayerToRgbImageAsIsNew(T* BayerChannel, T* RgbChannel, int Width, int Height)
    {
        // convert BGGR->RGB
        // have as many threads as the hardware concurrency is
        parallel_for(0, Height, static_cast<int>(Height/(thread::hardware_concurrency())), [&] (int stride)
        {
            for (auto row = stride; row<2*stride; row++)
            {
                for (auto col = row*Width, rgbCol =row*Width; col < row*Width+Width; rgbCol +=3, col+=4)
                {
                    RgbChannel[rgbCol+0]  = BayerChannel[col+3];
                    RgbChannel[rgbCol+1]  = BayerChannel[col+1];
                    // RgbChannel[rgbCol+1] += BayerChannel[col+2]; // this line might be left out if g is used unadjusted
                    RgbChannel[rgbCol+2]  = BayerChannel[col+0];
                }
            }
        });
    }
    

This code is 60% faster than the original version but still only half as fast as the non parallelized version on my laptop. This seemed to be due to the memory boundedness of the algorithm as others have pointed out already.

此代码比原始版本快60%,但仍然只是笔记本电脑上非并行版本的一半。这似乎是由于算法的记忆有限性,正如其他人已经指出的那样。

edit: But I was not happy with that. I could greatly improve the parallel performance when going from parallel_for to std::async:

编辑:但我对此并不满意。从parallel_for到std :: async时,我可以大大提高并行性能:

int hc = thread::hardware_concurrency();
future<void>* res = new future<void>[hc];
for (int i = 0; i<hc; ++i)
{
    res[i] = async(Converter<char>(bayerChannel, rgbChannel, rows, cols, rows/hc*i, rows/hc*(i+1)));
}
for (int i = 0; i<hc; ++i)
{
    res[i].wait();
}
delete [] res;

with converter being a simple class:

转换器是一个简单的类:

template <class T> class Converter
{
public:
Converter(T* BayerChannel, T* RgbChannel, int Width, int Height, int startRow, int endRow) : 
    BayerChannel(BayerChannel), RgbChannel(RgbChannel), Width(Width), Height(Height), startRow(startRow), endRow(endRow)
{
}
void operator()()
{
    // convert BGGR->RGB
    for(int row = startRow; row < endRow; row++)
    {
        for (auto col = row*Width, rgbCol =row*Width; col < row*Width+Width; rgbCol +=3, col+=4)
        {
            RgbChannel[rgbCol+0]  = BayerChannel[col+3];
            RgbChannel[rgbCol+1]  = BayerChannel[col+1];
            // RgbChannel[rgbCol+1] += BayerChannel[col+2]; // this line might be left out if g is used unadjusted
            RgbChannel[rgbCol+2]  = BayerChannel[col+0];
        }
    };
}
private:
T* BayerChannel;
T* RgbChannel;
int Width;
int Height;
int startRow;
int endRow;
};

This is now 3.5 times faster than the non parallelized version. From what I have seen in the profiler so far, I assume that the work stealing approach of parallel_for incurs a lot of waiting and synchronization overhead.

现在,这比非并行版本快3.5倍。从我到目前为止在Profiler中看到的情况来看,我认为parallel_for的工作窃取方法会导致大量的等待和同步开销。

#4


1  

I have not used tbb::parallel_for not cuncurrency::parallel_for, but if your numbers are correct they seem to carry too much overhead. However, I strongly advice you to run more that 10 iterations when testing, and also be sure to do as many warmup iterations before timing.

我没有使用过tbb :: parallel_for cuncurrency :: parallel_for,但是如果你的数字是正确的,它们似乎带来了太多的开销。但是,我强烈建议您在测试时运行10次迭代,并确保在计时之前执行尽可能多的预热迭代。

I tested your code exactly using three different methods, averaged over 1000 tries.

我使用三种不同的方法测试了您的代码,平均超过1000次尝试。

Serial:      14.6 += 1.0  ms
std::async:  13.6 += 1.6  ms
workers:     11.8 += 1.2  ms

The first is serial calculation. The second is done using four calls to std::async. The last is done by sending four jobs to four already started (but sleeping) background threads.

第一个是连续计算。第二个是使用对std :: async的四次调用完成的。最后一步是通过向四个已经启动(但正在休眠)的后台线程发送四个作业来完成的。

The gains aren't big, but at least they are gains. I did the test on a 2012 MacBook Pro, with dual hyper threaded cores = 4 logical cores.

收益并不大,但至少它们是收益。我在2012年的MacBook Pro上进行了测试,双超线程内核= 4个逻辑内核。

For reference, here's my std::async parallel for:

作为参考,这是我的std :: async parallel for:

template<typename Int=int, class Fun>
void std_par_for(Int beg, Int end, const Fun& fun)
{
    auto N = std::thread::hardware_concurrency();
    std::vector<std::future<void>> futures;

    for (Int ti=0; ti<N; ++ti) {
        Int b = ti * (end - beg) / N;
        Int e = (ti+1) * (end - beg) / N;
        if (ti == N-1) { e = end; }

        futures.emplace_back( std::async([&,b,e]() {
            for (Int ix=b; ix<e; ++ix) {
                fun( ix );
            }
        }));
    }

    for (auto&& f : futures) {
        f.wait();
    }
}

#5


1  

Things to check or do

要检查或做的事情

  • Are you using a Core 2 or older processor? They have a very narrow memory bus that's easy to saturate with code like this. In contrast, 4-channel Sandy Bridge-E processors require multiple threads to saturate the memory bus (it's not possible for a single memory-bound thread to fully saturate it).
  • 您使用的是Core 2或更旧的处理器吗?它们有一个非常窄的内存总线,很容易用这样的代码饱和。相比之下,4通道Sandy Bridge-E处理器需要多个线程来使内存总线饱和(单个内存绑定线程不可能完全饱和它)。

  • Have you populated all of your memory channels? E.g. if you have a dual-channel CPU but have just one RAM card installed or two that are on the same channel, you're getting half the available bandwidth.
  • 你填充了所有的内存频道吗?例如。如果你有一个双通道CPU,但只安装了一个RAM卡,或者两个在同一个通道上,你可以获得一半的可用带宽。

  • How are you timing your code?
    • The timing should be done inside the application like Evgeny Panasyuk suggests.
    • 应该像Evgeny Panasyuk建议的那样在应用程序内完成。

    • You should do multiple runs within the same application. Otherwise, you may be timing one-time startup code to launch the thread pools, etc.
    • 您应该在同一个应用程序中进行多次运行。否则,您可能会计划一次性启动代码以启动线程池等。

  • 你如何计算代码?应该像Evgeny Panasyuk建议的那样在应用程序内完成。您应该在同一个应用程序中进行多次运行。否则,您可能会计划一次性启动代码以启动线程池等。

  • Remove the superfluous memset, as others have explained.
  • 正如其他人所解释的那样,删除多余的memset。

  • As ogni42 and others have suggested, unroll your inner loop (I didn't bother checking the correctness of that solution, but if it's wrong, you should be able to fix it). This is orthogonal to the main question of parallelization, but it's a good idea anyway.
  • 正如ogni42和其他人所建议的那样,展开你的内部循环(我没有费心去检查该解决方案的正确性,但是如果它错了,你应该能够解决它)。这与并行化的主要问题是正交的,但无论如何它都是一个好主意。

  • Make sure your machine is otherwise idle when doing performance testing.
  • 在进行性能测试时,确保您的机器处于空闲状态。

Additional timings

I've merged the suggestions of Evgeny Panasyuk and ogni42 in a bare-bones C++03 Win23 implementation:

我已经在一个简单的C ++ 03 Win23实现中合并了Evgeny Panasyuk和ogni42的建议:

#include "stdafx.h"

#include <omp.h>
#include <vector>
#include <iostream>
#include <stdio.h>

using namespace std;

const int Width = 3264;
const int Height = 2540*8;

class Timer {
private:
    string name;
    LARGE_INTEGER start;
    LARGE_INTEGER stop;
    LARGE_INTEGER frequency;
public:
    Timer(const char *name) : name(name) {
        QueryPerformanceFrequency(&frequency);
        QueryPerformanceCounter(&start);
    }

    ~Timer() {
        QueryPerformanceCounter(&stop);
        LARGE_INTEGER time;
        time.QuadPart = stop.QuadPart - start.QuadPart;
        double elapsed = ((double)time.QuadPart /(double)frequency.QuadPart);
        printf("%-20s : %5.2f\n", name.c_str(), elapsed);
    }
};

static const int offsets[] = {2,1,1,0};

template <typename T>
void Inner_Orig(const T* BayerChannel, T* RgbChannel, int row)
{
    for (int col = 0, bayerIndex = row * Width;
         col < Width; col++, bayerIndex++)
    {
        int offset = (row % 2)*2 + (col % 2); //0...3
        int rgbIndex = bayerIndex * 3 + offsets[offset];
        RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
    }
}

// adapted from ogni42's answer
template <typename T>
void Inner_Unrolled(const T* BayerChannel, T* RgbChannel, int row)
{
    for (int col = row*Width, rgbCol =row*Width;
         col < row*Width+Width; rgbCol +=3, col+=4)
    {
        RgbChannel[rgbCol+0]  = BayerChannel[col+3];
        RgbChannel[rgbCol+1]  = BayerChannel[col+1];
        // RgbChannel[rgbCol+1] += BayerChannel[col+2]; // this line might be left out if g is used unadjusted
        RgbChannel[rgbCol+2]  = BayerChannel[col+0];
    }
}

int _tmain(int argc, _TCHAR* argv[])
{
    vector<float> bayer(Width*Height);
    vector<float> rgb(Width*Height*3);
    for(int i = 0; i < 4; ++i)
    {
        {
            Timer t("serial_orig");
            for(int row = 0; row < Height; ++row) {
                Inner_Orig<float>(&bayer[0], &rgb[0], row);
            }
        }
        {
            Timer t("omp_dynamic_orig");
            #pragma omp parallel for
            for(int row = 0; row < Height; ++row) {
                Inner_Orig<float>(&bayer[0], &rgb[0], row);
            }
        }
        {
            Timer t("omp_static_orig");
            #pragma omp parallel for schedule(static)
            for(int row = 0; row < Height; ++row) {
                Inner_Orig<float>(&bayer[0], &rgb[0], row);
            }
        }

        {
            Timer t("serial_unrolled");
            for(int row = 0; row < Height; ++row) {
                Inner_Unrolled<float>(&bayer[0], &rgb[0], row);
            }
        }
        {
            Timer t("omp_dynamic_unrolled");
            #pragma omp parallel for
            for(int row = 0; row < Height; ++row) {
                Inner_Unrolled<float>(&bayer[0], &rgb[0], row);
            }
        }
        {
            Timer t("omp_static_unrolled");
            #pragma omp parallel for schedule(static)
            for(int row = 0; row < Height; ++row) {
                Inner_Unrolled<float>(&bayer[0], &rgb[0], row);
            }
        }
        printf("-----------------------------\n");
    }
    return 0;
}

Here are the timings I see on a triple-channel 8-way hyperthreaded Core i7-950 box:

以下是我在三通道8路超线程Core i7-950盒子上看到的时间:

serial_orig          :  0.13
omp_dynamic_orig     :  0.10
omp_static_orig      :  0.10
serial_unrolled      :  0.06
omp_dynamic_unrolled :  0.04
omp_static_unrolled  :  0.04

The "static" versions tell the compiler to evenly divide up the work between threads at loop entry. This avoids the overhead of attempting to do work stealing or other dynamic load balancing. For this code snippet, it doesn't seem to make a difference, even though the workload is very uniform across threads.

“静态”版本告诉编译器在循环入口处平均分配线程之间的工作。这避免了尝试进行工作窃取或其他动态负载平衡的开销。对于此代码段,即使跨线程的工作负载非常一致,它似乎也没有区别。

#6


0  

The performance reduction might be happening because your are trying to distribute for loop on "row" number of cores, which wont be available and hence again it become like a sequential execution with the overhead of parallelism.

性能降低可能正在发生,因为您正试图在“行”数量的核心上分配循环,这些核心不可用,因此它再次成为具有并行开销的顺序执行。

#7


0  

Not very familiar with parallel for loops but it seems to me the contention is in the memory access. It appears your threads are overlapping access to the same pages.

并不熟悉并行for循环,但在我看来,争用是在内存访问中。您的线程似乎是对相同页面的重叠访问。

Can you break up your array access into 4k chunks somewhat align with the page boundary?

你可以将你的阵列访问分解成4k块,有些与页面边界对齐吗?

#8


0  

There is no point talking about parallel performance before not having optimized the for loop for serial code. Here is my attempt at that (some good compilers may be able to obtain similarly optimized versions, but I'd rather not rely on that)

在没有优化串行代码的for循环之前,没有必要谈论并行性能。这是我的尝试(一些好的编译器可能能够获得类似的优化版本,但我宁愿不依赖它)

    parallel_for(0, Height, [=] (int row) noexcept
    {
        for (auto col=0, bayerindex=row*Width,
                  rgb0=3*bayerindex+offset[(row%2)*2],
                  rgb1=3*bayerindex+offset[(row%2)*2+1];
             col < Width; col+=2, bayerindex+=2, rgb0+=6, rgb1+=6 )
        {
            RgbChannel[rgb0] = BayerChannel[bayerindex  ];
            RgbChannel[rgb1] = BayerChannel[bayerindex+1];
        }
    });