如何在C＃中优化数组的复制块？

I am writing a live-video imaging application and need to speed up this method. It's currently taking about 10ms to execute and I'd like to get it down to 2-3ms.

我正在编写一个实时视频成像应用程序，需要加快这种方法。它目前需要大约10ms才能执行，我希望将其降低到2-3ms。

I've tried both Array.Copy and Buffer.BlockCopy and they both take ~30ms which is 3x longer than the manual copy.

我已经尝试了Array.Copy和Buffer.BlockCopy，它们都需要大约30毫秒，这比手动副本长3倍。

One thought was to somehow copy 4 bytes as an integer and then paste them as an integer, thereby reducing 4 lines of code to one line of code. However, I'm not sure how to do that.

一种想法是以某种方式将4个字节复制为整数，然后将它们粘贴为整数，从而将4行代码减少为一行代码。但是，我不知道该怎么做。

Another thought was to somehow use pointers and unsafe code to do this, but I'm not sure how to do that either.

另一个想法是以某种方式使用指针和不安全的代码来做到这一点，但我不知道如何做到这一点。

All help is much appreciated. Thank you!

非常感谢所有帮助。谢谢！

EDIT: Array sizes are: inputBuffer[327680], lookupTable[16384], outputBuffer[1310720]

编辑：数组大小是：inputBuffer [327680]，lookupTable [16384]，outputBuffer [1310720]

public byte[] ApplyLookupTableToBuffer(byte[] lookupTable, ushort[] inputBuffer)
{
    System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
    sw.Start();

    // Precalculate and initialize the variables
    int lookupTableLength = lookupTable.Length;
    int bufferLength = inputBuffer.Length;
    byte[] outputBuffer = new byte[bufferLength * 4];
    int outIndex = 0;
    int curPixelValue = 0;

    // For each pixel in the input buffer...
    for (int curPixel = 0; curPixel < bufferLength; curPixel++)
    {
        outIndex = curPixel * 4;                    // Calculate the corresponding index in the output buffer
        curPixelValue = inputBuffer[curPixel] * 4;  // Retrieve the pixel value and multiply by 4 since the lookup table has 4 values (blue/green/red/alpha) for each pixel value

        // If the multiplied pixel value falls within the lookup table...
        if ((curPixelValue + 3) < lookupTableLength)
        {
            // Copy the lookup table value associated with the value of the current input buffer location to the output buffer
            outputBuffer[outIndex + 0] = lookupTable[curPixelValue + 0];
            outputBuffer[outIndex + 1] = lookupTable[curPixelValue + 1];
            outputBuffer[outIndex + 2] = lookupTable[curPixelValue + 2];
            outputBuffer[outIndex + 3] = lookupTable[curPixelValue + 3];

            //System.Buffer.BlockCopy(lookupTable, curPixelValue, outputBuffer, outIndex, 4);   // Takes 2-10x longer than just copying the values manually
            //Array.Copy(lookupTable, curPixelValue, outputBuffer, outIndex, 4);                // Takes 2-10x longer than just copying the values manually
        }
    }

    Debug.WriteLine("ApplyLookupTableToBuffer(ms): " + sw.Elapsed.TotalMilliseconds.ToString("N2"));
    return outputBuffer;
}

EDIT: I've updated the method keeping the same variable names so others can see how the code would translate based on HABJAN's solution below.

编辑：我已经更新了保持相同变量名称的方法，以便其他人可以看到代码将如何根据下面的HABJAN解决方案进行翻译。

    public byte[] ApplyLookupTableToBufferV2(byte[] lookupTable, ushort[] inputBuffer)
    {
        System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
        sw.Start();

        // Precalculate and initialize the variables
        int lookupTableLength = lookupTable.Length;
        int bufferLength = inputBuffer.Length;
        byte[] outputBuffer = new byte[bufferLength * 4];
        //int outIndex = 0;
        int curPixelValue = 0;

        unsafe
        {
            fixed (byte* pointerToOutputBuffer = &outputBuffer[0])
            fixed (byte* pointerToLookupTable = &lookupTable[0])
            {
                // Cast to integer pointers since groups of 4 bytes get copied at once
                uint* lookupTablePointer = (uint*)pointerToLookupTable;
                uint* outputBufferPointer = (uint*)pointerToOutputBuffer;

                // For each pixel in the input buffer...
                for (int curPixel = 0; curPixel < bufferLength; curPixel++)
                {
                    // No need to multiply by 4 on the following 2 lines since the pointers are for integers, not bytes
                    // outIndex = curPixel;  // This line is commented since we can use curPixel instead of outIndex
                    curPixelValue = inputBuffer[curPixel];  // Retrieve the pixel value 

                    if ((curPixelValue + 3) < lookupTableLength)
                    {
                        outputBufferPointer[curPixel] = lookupTablePointer[curPixelValue];
                    }
                }
            }
        }

        Debug.WriteLine("2 ApplyLookupTableToBuffer(ms): " + sw.Elapsed.TotalMilliseconds.ToString("N2"));
        return outputBuffer;
    }

1 个解决方案

#1

I did some tests, and I managed to achieve max speed by turning my code into unsafe along with using the RtlMoveMemory API. I figured out that Buffer.BlockCopy and Array.Copy were much slower than direct RtlMoveMemory usage.

我做了一些测试，我设法通过使用RtlMoveMemory API将我的代码变为不安全来实现最大速度。我发现Buffer.BlockCopy和Array.Copy比直接使用RtlMoveMemory慢得多。

So, at the end you will end up with something like this:

所以，最后你会得到这样的东西：

fixed(byte* ptrOutput= &outputBufferBuffer[0])
{
    MoveMemory(ptrOutput, ptrInput, 4);
}

[DllImport("Kernel32.dll", EntryPoint = "RtlMoveMemory", SetLastError = false)]
private static unsafe extern void MoveMemory(void* dest, void* src, int size);

EDIT:

Ok, now once when I figured out your logic and when I did some tests, I managed to speed up your method for almost up to 50%. Since you need to copy a small data blocks (always 4 bytes), yes, you were right, RtlMoveMemory wont help here and it's better to copy data as integer. Here is the final solution I came up with:

好的，现在一旦我弄清楚你的逻辑，当我做了一些测试时，我设法将你的方法加速了近50％。由于你需要复制一个小数据块（总是4个字节），是的，你是对的，RtlMoveMemory在这里不会有帮助，最好将数据复制为整数。这是我提出的最终解决方案：

public static byte[] ApplyLookupTableToBufferV2(byte[] lookupTable, ushort[] inputBuffer)
{
    int lookupTableLength = lookupTable.Length;
    int bufferLength = inputBuffer.Length;
    byte[] outputBuffer = new byte[bufferLength * 4];
    int outIndex = 0, curPixelValue = 0;

    unsafe
    {
        fixed (byte* ptrOutput = &outputBuffer[0])
        fixed (byte* ptrLookup = &lookupTable[0])
        {
            uint* lkp = (uint*)ptrLookup;
            uint* opt = (uint*)ptrOutput;

            for (int index = 0; index < bufferLength; index++)
            {
                outIndex = index;
                curPixelValue = inputBuffer[index];

                if ((curPixelValue + 3) < lookupTableLength)
                {
                    opt[outIndex] = lkp[curPixelValue];
                }
            }
        }
    }

    return outputBuffer;
}

I renamed your method to ApplyLookupTableToBufferV1.

我将您的方法重命名为ApplyLookupTableToBufferV1。

And here are my test result:

以下是我的测试结果：

int tc1 = Environment.TickCount;

for (int i = 0; i < 200; i++)
{
    byte[] a = ApplyLookupTableToBufferV1(lt, ib);
}

tc1 = Environment.TickCount - tc1;

Console.WriteLine("V1: " + tc1.ToString() + "ms");

Result - V1: 998 ms

结果 - V1：998毫秒

int tc2 = Environment.TickCount;

for (int i = 0; i < 200; i++)
{
    byte[] a = ApplyLookupTableToBufferV2(lt, ib);
}

tc2 = Environment.TickCount - tc2;

Console.WriteLine("V2: " + tc2.ToString() + "ms");

Result - V2: 473 ms

结果 - V2：473毫秒

#1

我做了一些测试，我设法通过使用RtlMoveMemory API将我的代码变为不安全来实现最大速度。我发现Buffer.BlockCopy和Array.Copy比直接使用RtlMoveMemory慢得多。

So, at the end you will end up with something like this:

所以，最后你会得到这样的东西：

fixed(byte* ptrOutput= &outputBufferBuffer[0])
{
    MoveMemory(ptrOutput, ptrInput, 4);
}

[DllImport("Kernel32.dll", EntryPoint = "RtlMoveMemory", SetLastError = false)]
private static unsafe extern void MoveMemory(void* dest, void* src, int size);

EDIT:

public static byte[] ApplyLookupTableToBufferV2(byte[] lookupTable, ushort[] inputBuffer)
{
    int lookupTableLength = lookupTable.Length;
    int bufferLength = inputBuffer.Length;
    byte[] outputBuffer = new byte[bufferLength * 4];
    int outIndex = 0, curPixelValue = 0;

    unsafe
    {
        fixed (byte* ptrOutput = &outputBuffer[0])
        fixed (byte* ptrLookup = &lookupTable[0])
        {
            uint* lkp = (uint*)ptrLookup;
            uint* opt = (uint*)ptrOutput;

            for (int index = 0; index < bufferLength; index++)
            {
                outIndex = index;
                curPixelValue = inputBuffer[index];

                if ((curPixelValue + 3) < lookupTableLength)
                {
                    opt[outIndex] = lkp[curPixelValue];
                }
            }
        }
    }

    return outputBuffer;
}

I renamed your method to ApplyLookupTableToBufferV1.

我将您的方法重命名为ApplyLookupTableToBufferV1。

And here are my test result:

以下是我的测试结果：

int tc1 = Environment.TickCount;

for (int i = 0; i < 200; i++)
{
    byte[] a = ApplyLookupTableToBufferV1(lt, ib);
}

tc1 = Environment.TickCount - tc1;

Console.WriteLine("V1: " + tc1.ToString() + "ms");

Result - V1: 998 ms

结果 - V1：998毫秒

int tc2 = Environment.TickCount;

for (int i = 0; i < 200; i++)
{
    byte[] a = ApplyLookupTableToBufferV2(lt, ib);
}

tc2 = Environment.TickCount - tc2;

Console.WriteLine("V2: " + tc2.ToString() + "ms");

Result - V2: 473 ms

结果 - V2：473毫秒

秒客网

如何在C＃中优化数组的复制块？

1 个解决方案

#1

EDIT:

#1

EDIT:

相关文章