在C ++中将32位数字拆分为字节的最快方法

时间:2021-02-12 12:04:59

I am writing a piece of code designed to do some data compression on CLSID structures. I'm storing them as a compressed stream of 128 bit integers. However, the code in question has to be able to place invalid CLSIDs into the stream. In order to do this, I have left them as one big string. On disk, it would look something like this:

我正在编写一段代码,旨在对CLSID结构进行一些数据压缩。我将它们存储为128位整数的压缩流。但是,有问题的代码必须能够将无效的CLSID放入流中。为了做到这一点,我把它们留作一个大字符串。在磁盘上,它看起来像这样:

+--------------------------+-----------------+------------------------+
|                          |                 |                        |
| Length of Invalid String | Invalid String  | Compressed Data Stream |
|                          |                 |                        |
+--------------------------+-----------------+------------------------+

To encode the length of the string, I need to output the 32 bit integer that is the length of the string one byte at a time. Here's my current code:

为了对字符串的长度进行编码,我需要输出32位整数,该整数是一次一个字节的字符串长度。这是我目前的代码:

std::vector<BYTE> compressedBytes;
DWORD invalidLength = (DWORD) invalidClsids.length();
compressedBytes.push_back((BYTE)  invalidLength        & 0x000000FF);
compressedBytes.push_back((BYTE) (invalidLength >>= 8) & 0x000000FF));
compressedBytes.push_back((BYTE) (invalidLength >>= 8) & 0x000000FF));
compressedBytes.push_back((BYTE) (invalidLength >>= 8));

This code won't be called often, but there will need to be a similar structure in the decoding stage called many thousands of times. I'm curious if this is the most efficient method or if someone can come up with one better?

这段代码不会经常调用,但在解码阶段需要有类似的结构,称为数千次。我很好奇这是否是最有效的方法,或者有人能想出更好的方法吗?

Thanks all!

Billy3

EDIT: After looking over some of the answers, I created this mini test program to see which was the fastest:

编辑:在查看了一些答案后,我创建了这个迷你测试程序,看看哪个是最快的:

// temp.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include <windows.h>
#include <ctime>
#include <iostream>
#include <vector>

void testAssignedShifts();
void testRawShifts();
void testUnion();

int _tmain(int argc, _TCHAR* argv[])
{
    std::clock_t startTime = std::clock();
    for (register unsigned __int32 forLoopTest = 0; forLoopTest < 0x008FFFFF; forLoopTest++)
    {
        testAssignedShifts();
    }
    std::clock_t assignedShiftsFinishedTime = std::clock();
    for (register unsigned __int32 forLoopTest = 0; forLoopTest < 0x008FFFFF; forLoopTest++)
    {
        testRawShifts();
    }
    std::clock_t rawShiftsFinishedTime = std::clock();
    for (register unsigned __int32 forLoopTest = 0; forLoopTest < 0x008FFFFF; forLoopTest++)
    {
        testUnion();
    }
    std::clock_t unionFinishedTime = std::clock();
    std::printf(
        "Execution time for assigned shifts: %08u clocks\n"
        "Execution time for raw shifts:      %08u clocks\n"
        "Execution time for union:           %08u clocks\n\n",
        assignedShiftsFinishedTime - startTime,
        rawShiftsFinishedTime - assignedShiftsFinishedTime,
        unionFinishedTime - rawShiftsFinishedTime);
    startTime = std::clock();
    for (register unsigned __int32 forLoopTest = 0; forLoopTest < 0x008FFFFF; forLoopTest++)
    {
        testAssignedShifts();
    }
    assignedShiftsFinishedTime = std::clock();
    for (register unsigned __int32 forLoopTest = 0; forLoopTest < 0x008FFFFF; forLoopTest++)
    {
        testRawShifts();
    }
    rawShiftsFinishedTime = std::clock();
    for (register unsigned __int32 forLoopTest = 0; forLoopTest < 0x008FFFFF; forLoopTest++)
    {
        testUnion();
    }
    unionFinishedTime = std::clock();
    std::printf(
        "Execution time for assigned shifts: %08u clocks\n"
        "Execution time for raw shifts:      %08u clocks\n"
        "Execution time for union:           %08u clocks\n\n"
        "Finished. Terminate!\n\n",
        assignedShiftsFinishedTime - startTime,
        rawShiftsFinishedTime - assignedShiftsFinishedTime,
        unionFinishedTime - rawShiftsFinishedTime);

    system("pause");
    return 0;
}

void testAssignedShifts()
{
    std::string invalidClsids("This is a test string");
    std::vector<BYTE> compressedBytes;
    DWORD invalidLength = (DWORD) invalidClsids.length();
    compressedBytes.push_back((BYTE)  invalidLength);
    compressedBytes.push_back((BYTE) (invalidLength >>= 8));
    compressedBytes.push_back((BYTE) (invalidLength >>= 8));
    compressedBytes.push_back((BYTE) (invalidLength >>= 8));
}
void testRawShifts()
{
    std::string invalidClsids("This is a test string");
    std::vector<BYTE> compressedBytes;
    DWORD invalidLength = (DWORD) invalidClsids.length();
    compressedBytes.push_back((BYTE) invalidLength);
    compressedBytes.push_back((BYTE) (invalidLength >>  8));
    compressedBytes.push_back((BYTE) (invalidLength >>  16));
    compressedBytes.push_back((BYTE) (invalidLength >>  24));
}

typedef union _choice
{
    DWORD dwordVal;
    BYTE bytes[4];
} choice;

void testUnion()
{
    std::string invalidClsids("This is a test string");
    std::vector<BYTE> compressedBytes;
    choice invalidLength;
    invalidLength.dwordVal = (DWORD) invalidClsids.length();
    compressedBytes.push_back(invalidLength.bytes[0]);
    compressedBytes.push_back(invalidLength.bytes[1]);
    compressedBytes.push_back(invalidLength.bytes[2]);
    compressedBytes.push_back(invalidLength.bytes[3]);
}

Running this a few times results in:

运行几次会导致:

Execution time for assigned shifts: 00012484 clocks
Execution time for raw shifts:      00012578 clocks
Execution time for union:           00013172 clocks

Execution time for assigned shifts: 00012594 clocks
Execution time for raw shifts:      00013140 clocks
Execution time for union:           00012782 clocks

Execution time for assigned shifts: 00012500 clocks
Execution time for raw shifts:      00012515 clocks
Execution time for union:           00012531 clocks

Execution time for assigned shifts: 00012391 clocks
Execution time for raw shifts:      00012469 clocks
Execution time for union:           00012500 clocks

Execution time for assigned shifts: 00012500 clocks
Execution time for raw shifts:      00012562 clocks
Execution time for union:           00012422 clocks

Execution time for assigned shifts: 00012484 clocks
Execution time for raw shifts:      00012407 clocks
Execution time for union:           00012468 clocks

Looks to be about a tie between assigned shifts and union. Since I'm going to need the value later, union it is! Thanks!

看起来是指定班次和工会之间的关系。因为我以后需要这个值,所以联合它!谢谢!

Billy3

7 个解决方案

#1


Just use a union:

只需使用联盟:

assert(sizeof (DWORD) == sizeof (BYTE[4]));   // Sanity check

union either {
    DWORD dw;
    struct {
         BYTE b[4];
    } bytes;
};

either invalidLength;
invalidLength.dw = (DWORD) invalidClsids.length();
compressedBytes.push_back(either.bytes.b[0]);
compressedBytes.push_back(either.bytes.b[1]);
compressedBytes.push_back(either.bytes.b[2]);
compressedBytes.push_back(either.bytes.b[3]);

NOTE: Unlike the bit-shifting approach in the original question, this code produces endian-dependent output. This matters only if output from a program running on one computer will be read on a computer with different endianness -- but as there seems to be no measurable speed increase from using this method, you might as well use the more portable bit-shifting approach, just in case.

注意:与原始问题中的位移方法不同,此代码产生依赖于字节序的输出。这只有在一台计算机上运行的程序的输出将在具有不同字节序的计算机上读取时才重要 - 但由于使用此方法似乎没有可测量的速度增加,您可能也可以使用更便携的位移方法, 以防万一。

#2


This is probably as optimized as you'll get. Bit-twiddling operations are some of the fastest available on the processor.

这可能会像您一样优化。 Bit-twiddling操作是处理器上最快的一些操作。

It may be faster to >> 16, >> 24 instead of >>= 8 >>= 8 - you cut down an assignment.

它可能更快>> 16,>> 24而不是>> = 8 >> = 8 - 你减少了一项任务。

Also I don't think you need the & - since you're casting to a BYTE (which should be a 8-bit char) it'll get truncated down appropriately anyway. (Is it? correct me if I'm wrong)

此外,我认为你不需要& - 因为你要转换为BYTE(应该是一个8位字符),否则它将被适当地截断。 (如果我错了,请纠正我)

All in all, though, these are really minor changes. Profile it to see if it actually makes a difference :P

总而言之,这些都是微小的变化。对其进行剖析以确定它是否确实有所不同:P

#3


You should measure rather than guess at any potential improvement but my first thought is that it may be faster to do a union as follows:

您应该测量而不是猜测任何潜在的改进,但我首先想到的是,按照以下方式进行联合可能更快:

typedef union {
    DWORD d;
    struct {
        BYTE b0;
        BYTE b1;
        BYTE b2;
        BYTE b3;
    } b;
} DWB;

std::vector<BYTE> compBytes;
DWB invLen;
invLen.d = (DWORD) invalidClsids.length();
compBytes.push_back(invalidLength.b.b3);
compBytes.push_back(invalidLength.b.b2);
compBytes.push_back(invalidLength.b.b1);
compBytes.push_back(invalidLength.b.b0);

That may be the right order for the pushbacks but check just in case - it depends on the endian-ness of the CPU.

这可能是回击的正确顺序,但检查以防万一 - 它取决于CPU的字节顺序。

#4


A real quick way is to just treat the a DWORD* (single element array) as a BYTE* (4 element array). The code is also a lot more readable.

一种真正快速的方法是将DWORD *(单个元素数组)视为BYTE *(4个元素数组)。代码也更具可读性。

Warning: I haven't compiled this

警告:我没有编译过这个

Warning: This makes your code dependent on byte ordering

警告:这使您的代码依赖于字节顺序

std::vector<BYTE> compressedBytes;
DWORD invalidLength = (DWORD) invalidClsids.length();
BYTE* lengthParts = &invalidLength;
static const int kLenghtPartsLength = sizeof(DWORD) / sizeof(BYTE);
for(int i = 0; i < kLenghtPartsLength; ++i)
    compressedBytes.push_back(lengthParts[i]);

#5


compressedBytes.push_back(either.bytes.b[0]);
compressedBytes.push_back(either.bytes.b[1]);
compressedBytes.push_back(either.bytes.b[2]);
compressedBytes.push_back(either.bytes.b[3]);

There is an even smarter and faster way! Let's see what this code is doing and how we can improve it.

有一种更智能,更快捷的方式!让我们看看这段代码在做什么,以及我们如何改进它。

This code is serializing the integer, one byte at a time. For each byte it's calling push_back, which is checking the free space in the internal vector buffer. If we have no room for another byte, memory reallocation will happen (hint, slow!). Granted, the reallocation will not happen frequently (reallocations typically happen by doubling the existing buffer). Then, the new byte is copied and the internal size is increased by one.

此代码序列化整数,一次一个字节。对于每个字节,它调用push_back,它检查内部向量缓冲区中的可用空间。如果我们没有其他字节的空间,将发生内存重新分配(提示,慢!)。当然,重新分配不会经常发生(通常通过将现有缓冲区加倍来重新分配)。然后,复制新字节并将内部大小增加1。

vector<> has a requirement by the standard which dictates that the internal buffer be contiguous. vector<> also happen to have an operator& () and operator[] ().

vector <>具有标准的要求,该要求规定内部缓冲区是连续的。 vector <>也碰巧有一个运算符&()和operator []()。

So, here is the best code you can come up with:

所以,这是您可以提出的最佳代码:

std::string invalidClsids("This is a test string");
std::vector<BYTE> compressedBytes;
DWORD invalidLength = (DWORD) invalidClsids.length();
compressedBytes.resize(sizeof(DWORD)); // You probably want to make this much larger, to avoid resizing later.
// compressedBytes is as large as the length we want to serialize.
BYTE* p = &compressedBytes[0]; // This is valid code and designed by the standard for such cases. p points to a buffer that is at least as large as a DWORD.
*((DWORD*)p) = invalidLength;  // Copy all bytes in one go!

The above cast can be done in one go with the &compressedBytes[0] statement, but it won't be faster. This is more readable.

上面的强制转换可以使用&compressedBytes [0]语句一次完成,但不会更快。这更具可读性。

NOTE! Serializing this way (or even with the UNION method) is endian-dependent. That is, on an Intel/AMD processor the least significant byte will come first, while one a big-endian machine (PowerPC, Motorola...) the most significant byte will come first. If you want to be neutral, you must use a math method (shifts).

注意!以这种方式序列化(甚至使用UNION方法)依赖于字节序。也就是说,在Intel / AMD处理器上,最不重要的字节将首先出现,而一个大端机器(PowerPC,Motorola ......)将是最重要的字节。如果你想保持中立,你必须使用数学方法(轮班)。

#6


Do you have to do it one byte at a time? Is there a way you could just memcpy() the whole 32 bits into the stream in one fell swoop? If you have the address of the buffer you're writing to the stream, can you just copy into that?

你必须一次做一个字节吗?有没有办法你可以一举memcpy()整个32位进入流中?如果你有要写入流的缓冲区的地址,你可以复制到那个吗?

#7


Perhaps it's possible to get 32bit variable pointer, convert it into char pointer and read char, then add +1 to pointer and read next char .. just theory :) i don't know if it's working

也许有可能获得32位变量指针,将其转换为char指针并读取char,然后将+1添加到指针并读取下一个char ..只是理论:)我不知道它是否正常工作

#1


Just use a union:

只需使用联盟:

assert(sizeof (DWORD) == sizeof (BYTE[4]));   // Sanity check

union either {
    DWORD dw;
    struct {
         BYTE b[4];
    } bytes;
};

either invalidLength;
invalidLength.dw = (DWORD) invalidClsids.length();
compressedBytes.push_back(either.bytes.b[0]);
compressedBytes.push_back(either.bytes.b[1]);
compressedBytes.push_back(either.bytes.b[2]);
compressedBytes.push_back(either.bytes.b[3]);

NOTE: Unlike the bit-shifting approach in the original question, this code produces endian-dependent output. This matters only if output from a program running on one computer will be read on a computer with different endianness -- but as there seems to be no measurable speed increase from using this method, you might as well use the more portable bit-shifting approach, just in case.

注意:与原始问题中的位移方法不同,此代码产生依赖于字节序的输出。这只有在一台计算机上运行的程序的输出将在具有不同字节序的计算机上读取时才重要 - 但由于使用此方法似乎没有可测量的速度增加,您可能也可以使用更便携的位移方法, 以防万一。

#2


This is probably as optimized as you'll get. Bit-twiddling operations are some of the fastest available on the processor.

这可能会像您一样优化。 Bit-twiddling操作是处理器上最快的一些操作。

It may be faster to >> 16, >> 24 instead of >>= 8 >>= 8 - you cut down an assignment.

它可能更快>> 16,>> 24而不是>> = 8 >> = 8 - 你减少了一项任务。

Also I don't think you need the & - since you're casting to a BYTE (which should be a 8-bit char) it'll get truncated down appropriately anyway. (Is it? correct me if I'm wrong)

此外,我认为你不需要& - 因为你要转换为BYTE(应该是一个8位字符),否则它将被适当地截断。 (如果我错了,请纠正我)

All in all, though, these are really minor changes. Profile it to see if it actually makes a difference :P

总而言之,这些都是微小的变化。对其进行剖析以确定它是否确实有所不同:P

#3


You should measure rather than guess at any potential improvement but my first thought is that it may be faster to do a union as follows:

您应该测量而不是猜测任何潜在的改进,但我首先想到的是,按照以下方式进行联合可能更快:

typedef union {
    DWORD d;
    struct {
        BYTE b0;
        BYTE b1;
        BYTE b2;
        BYTE b3;
    } b;
} DWB;

std::vector<BYTE> compBytes;
DWB invLen;
invLen.d = (DWORD) invalidClsids.length();
compBytes.push_back(invalidLength.b.b3);
compBytes.push_back(invalidLength.b.b2);
compBytes.push_back(invalidLength.b.b1);
compBytes.push_back(invalidLength.b.b0);

That may be the right order for the pushbacks but check just in case - it depends on the endian-ness of the CPU.

这可能是回击的正确顺序,但检查以防万一 - 它取决于CPU的字节顺序。

#4


A real quick way is to just treat the a DWORD* (single element array) as a BYTE* (4 element array). The code is also a lot more readable.

一种真正快速的方法是将DWORD *(单个元素数组)视为BYTE *(4个元素数组)。代码也更具可读性。

Warning: I haven't compiled this

警告:我没有编译过这个

Warning: This makes your code dependent on byte ordering

警告:这使您的代码依赖于字节顺序

std::vector<BYTE> compressedBytes;
DWORD invalidLength = (DWORD) invalidClsids.length();
BYTE* lengthParts = &invalidLength;
static const int kLenghtPartsLength = sizeof(DWORD) / sizeof(BYTE);
for(int i = 0; i < kLenghtPartsLength; ++i)
    compressedBytes.push_back(lengthParts[i]);

#5


compressedBytes.push_back(either.bytes.b[0]);
compressedBytes.push_back(either.bytes.b[1]);
compressedBytes.push_back(either.bytes.b[2]);
compressedBytes.push_back(either.bytes.b[3]);

There is an even smarter and faster way! Let's see what this code is doing and how we can improve it.

有一种更智能,更快捷的方式!让我们看看这段代码在做什么,以及我们如何改进它。

This code is serializing the integer, one byte at a time. For each byte it's calling push_back, which is checking the free space in the internal vector buffer. If we have no room for another byte, memory reallocation will happen (hint, slow!). Granted, the reallocation will not happen frequently (reallocations typically happen by doubling the existing buffer). Then, the new byte is copied and the internal size is increased by one.

此代码序列化整数,一次一个字节。对于每个字节,它调用push_back,它检查内部向量缓冲区中的可用空间。如果我们没有其他字节的空间,将发生内存重新分配(提示,慢!)。当然,重新分配不会经常发生(通常通过将现有缓冲区加倍来重新分配)。然后,复制新字节并将内部大小增加1。

vector<> has a requirement by the standard which dictates that the internal buffer be contiguous. vector<> also happen to have an operator& () and operator[] ().

vector <>具有标准的要求,该要求规定内部缓冲区是连续的。 vector <>也碰巧有一个运算符&()和operator []()。

So, here is the best code you can come up with:

所以,这是您可以提出的最佳代码:

std::string invalidClsids("This is a test string");
std::vector<BYTE> compressedBytes;
DWORD invalidLength = (DWORD) invalidClsids.length();
compressedBytes.resize(sizeof(DWORD)); // You probably want to make this much larger, to avoid resizing later.
// compressedBytes is as large as the length we want to serialize.
BYTE* p = &compressedBytes[0]; // This is valid code and designed by the standard for such cases. p points to a buffer that is at least as large as a DWORD.
*((DWORD*)p) = invalidLength;  // Copy all bytes in one go!

The above cast can be done in one go with the &compressedBytes[0] statement, but it won't be faster. This is more readable.

上面的强制转换可以使用&compressedBytes [0]语句一次完成,但不会更快。这更具可读性。

NOTE! Serializing this way (or even with the UNION method) is endian-dependent. That is, on an Intel/AMD processor the least significant byte will come first, while one a big-endian machine (PowerPC, Motorola...) the most significant byte will come first. If you want to be neutral, you must use a math method (shifts).

注意!以这种方式序列化(甚至使用UNION方法)依赖于字节序。也就是说,在Intel / AMD处理器上,最不重要的字节将首先出现,而一个大端机器(PowerPC,Motorola ......)将是最重要的字节。如果你想保持中立,你必须使用数学方法(轮班)。

#6


Do you have to do it one byte at a time? Is there a way you could just memcpy() the whole 32 bits into the stream in one fell swoop? If you have the address of the buffer you're writing to the stream, can you just copy into that?

你必须一次做一个字节吗?有没有办法你可以一举memcpy()整个32位进入流中?如果你有要写入流的缓冲区的地址,你可以复制到那个吗?

#7


Perhaps it's possible to get 32bit variable pointer, convert it into char pointer and read char, then add +1 to pointer and read next char .. just theory :) i don't know if it's working

也许有可能获得32位变量指针,将其转换为char指针并读取char,然后将+1添加到指针并读取下一个char ..只是理论:)我不知道它是否正常工作