Memcpy与memset使用的时间相同

I want to measure memory bandwidth using memcpy. I modified the code from this answer:why vectorizing the loop does not have performance improvement which used memset to measure the bandwidth. The problem is that memcpy is only slighly slower than memset when I expect it to be about two times slower since it operations on twice the memory.

我想用memcpy来测量内存带宽。我从这个答案修改了代码:为什么对循环进行向量化没有性能改进，而使用memset来度量带宽。问题是，当我预期memcpy比memset慢两倍时，它的速度只比memset慢一点，因为它对内存的操作是它的两倍。

More specifically, I run over 1 GB arrays a and b (allocated will calloc) 100 times with the following operations.

更具体地说，我使用以下操作运行1 GB数组a和b(分配的将是calloc) 100次。

operation             time(s)
-----------------------------
memset(a,0xff,LEN)    3.7
memcpy(a,b,LEN)       3.9
a[j] += b[j]          9.4
memcpy(a,b,LEN)       3.8

Notice that memcpy is only slightly slower then memset. The operations a[j] += b[j] (where j goes over [0,LEN)) should take three times longer than memcpy because it operates on three times as much data. However it's only about 2.5 as slow as memset.

注意，memcpy比memset慢一点。操作a[j] += b[j] (j在这里超过[0,LEN)]的时间应该是memcpy的3倍，因为它的操作是3倍的数据。然而，它的速度只有memset的2.5倍。

Then I initialized b to zero with memset(b,0,LEN) and test again:

然后我用memset(b,0,LEN)将b初始化为0，再次测试:

operation             time(s)
-----------------------------
memcpy(a,b,LEN)       8.2
a[j] += b[j]          11.5

Now we see that memcpy is about twice as slow as memset and a[j] += b[j] is about thrice as slow as memset like I expect.

现在我们看到memcpy的速度是memset的两倍，而a[j] += b[j]则是像我预期的memset一样慢。

At the very least I would have expected that before memset(b,0,LEN) that memcpy would be slower because the of lazy allocation (first touch) on the first of the 100 iterations.

至少，我期望在memset(b,0,LEN)之前，memcpy会更慢，因为在100次迭代的第一次中，延迟分配(第一次接触)。

Why do I only get the time I expect after memset(b,0,LEN)?

为什么我只能得到模因集(b,0,LEN)之后的时间?

test.c

#include <time.h>
#include <string.h>
#include <stdio.h>

void tests(char *a, char *b, const int LEN){
    clock_t time0, time1;
    time0 = clock();
    for (int i = 0; i < 100; i++) memset(a,0xff,LEN);
    time1 = clock();
    printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);

    time0 = clock();
    for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
    time1 = clock();
    printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);

    time0 = clock();
    for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
    time1 = clock();
    printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);

    time0 = clock();
    for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
    time1 = clock();
    printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);

    memset(b,0,LEN);
    time0 = clock();
    for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
    time1 = clock();
    printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);

    time0 = clock();
    for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
    time1 = clock();
    printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
}

main.c

#include <stdlib.h>

int tests(char *a, char *b, const int LEN);

int main(void) {
    const int LEN = 1 << 30;    //  1GB
    char *a = (char*)calloc(LEN,1);
    char *b = (char*)calloc(LEN,1);
    tests(a, b, LEN);
}

Compile with (gcc 6.2) gcc -O3 test.c main.c. Clang 3.8 gives essentially the same result.

与(gcc 6.2) gcc -O3测试一起编译。c c。Clang 3.8得到的结果基本相同。

Test system: i7-6700HQ@2.60GHz (Skylake), 32 GB DDR4, Ubuntu 16.10. On my Haswell system the bandwidths make sense before memset(b,0,LEN) i.e. I only see a problem on my Skylake system.

测试系统:i7-6700HQ@2.60GHz (Skylake)， 32gb DDR4, Ubuntu 16.10。在我的Haswell系统上，在memset(b,0,LEN)之前，带宽是有意义的，也就是说，我只看到我的Skylake系统有问题。

I first discovered this issue from the a[j] += b[k] operations in this answer which was overestimating the bandwidth.

我第一次发现这个问题是在a[j] += b[k]运算中，这个结果高估了带宽。

I came up with a simpler test

我想到了一个更简单的测试

#include <time.h>
#include <string.h>
#include <stdio.h>

void __attribute__ ((noinline))  foo(char *a, char *b, const int LEN) {
  for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
}

void tests(char *a, char *b, const int LEN) {
    foo(a, b, LEN);
    memset(b,0,LEN);
    foo(a, b, LEN);
}

This outputs.

这个输出。

9.472976
12.728426

However, if I do memset(b,1,LEN) in main after calloc (see below) then it outputs

但是，如果我在calloc(见下面)之后，在main中做memset(b,1,LEN)，那么它就会输出

12.5
12.5

This leads me to to think this is a OS allocation issue and not a compiler issue.

这使我认为这是一个OS分配问题，而不是一个编译器问题。

#include <stdlib.h>

int tests(char *a, char *b, const int LEN);

int main(void) {
    const int LEN = 1 << 30;    //  1GB
    char *a = (char*)calloc(LEN,1);
    char *b = (char*)calloc(LEN,1);
    //GCC optimizes memset(b,0,LEN) away after calloc but Clang does not.
    memset(b,1,LEN);
    tests(a, b, LEN);
}

2 个解决方案

#1

The point is that malloc and calloc on most platforms don't allocate memory; they allocate address space.

关键是大多数平台上的malloc和calloc都不分配内存;他们分配地址空间。

malloc etc work by:

malloc等工作:

if the request can be fulfilled by the freelist, carve a chunk out of it
- in case of calloc: the equivalent ofmemset(ptr, 0, size) is issued
- 若发生calloc，则发出等效的memset(ptr, 0, size)
如果请求可以由freelist完成，那么在calloc情况下从请求中分割出一块:发出等效的memset(ptr, 0, size)
if not: ask the OS to extend the address space.
如果没有，请操作系统扩展地址空间。

For systems with demand paging (COW) (an MMU could help here), the second options winds downto:

对于需要分页的系统(牛)(MMU可以在这里提供帮助)，第二种选择是:

create enough page table entries for the request, and fill them with a (COW) reference to /dev/zero
为请求创建足够的页表条目，并用/dev/zero的(COW)引用填充它们
add these PTEs to the address space of the process
将这些pte添加到进程的地址空间。

This will consume no physical memory, except only for the Page Tables.

这将不消耗任何物理内存，除了页表。

Once the new memory is referenced for read, the read will come from /dev/zero. The /dev/zero device is a very special device, in this case mapped to every page of the new memory.
一旦新内存被引用为read，读取将来自/dev/zero/dev/zero设备是一个非常特殊的设备，在本例中，它映射到新内存的每个页面。
but, if the new page is written, the COW logic kicks in (via a page fault):
- physical memory is allocated
- 物理内存分配
- the /dev/zero page is copied to the new page
- /dev/ 0页面被复制到新页面
- the new page is detached from the mother page
- 新页面与母页面分离
- and the calling process can finally do the update which started all this
- 而调用过程最终可以执行启动所有这些的更新
但是,如果新页面编写,牛逻辑介入(通过一个页面错误):物理内存分配/dev/zero页面复制到新页面的新页面脱离母亲页面,调用进程终于可以做更新开始这一切

#2

Your b array probably was not written after mmap-ing (huge allocation requests with malloc/calloc are usually converted into mmap). And whole array was mmaped to single read-only "zero page" (part of COW mechanism). Reading zeroes from single page is faster than reading from many pages, as single page will be kept in the cache and in TLB. This explains why test before memset(0) was faster:

您的b数组可能在mmap-ing之后没有被写入(使用malloc/calloc的巨大分配请求通常被转换为mmap)。整个数组被mmaped成只读的“零页”(COW机制的一部分)。从单个页面读取0要比从多个页面读取速度快，因为单个页面将保存在缓存和TLB中。这解释了为什么在memset(0)之前的测试更快:

This outputs. 9.472976 12.728426

这个输出。9.472976 - 12.728426

However, if I do memset(b,1,LEN) in main after calloc (see below) then it outputs: 12.5 12.5

但是，如果我在calloc(见下文)之后，主要地做memset(b,1,LEN)，那么它将输出:12.5

And more about gcc's malloc+memset / calloc+memset optimization into calloc (expanded from my comment)

关于gcc的malloc+memset / calloc+memset优化到calloc的更多信息(从我的注释扩展)

//GCC optimizes memset(b,0,LEN) away after calloc but Clang does not.

This optimization was proposed in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57742 (tree-optimization PR57742) at 2013-06-27 by Marc Glisse (https://*.com/users/1918193?) as planned for 4.9/5.0 version of GCC:

这个优化是在https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57742(树优化PR57742)， 2013-06-27, Marc Glisse (https://*.com/users/19181919193?)

memset(malloc(n),0,n) -> calloc(n,1)

memset(malloc(n)0 n)- > calloc(n,1)

calloc can sometimes be significantly faster than malloc+bzero because it has special knowledge that some memory is already zero. When other optimizations simplify some code to malloc+memset(0), it would thus be nice to replace it with calloc. Sadly, I don't think there is a way to do a similar optimization in C++ with new, which is where such code most easily appears (creating std::vector(10000) for instance). And there would also be the complication there that the size of the memset would be a bit smaller than that of the malloc (using calloc would still be fine, but it gets harder to know if it is an improvement).

calloc有时比malloc+ b0要快得多，因为它有一些特殊的知识，即一些内存已经为零。当其他优化将一些代码简化为malloc+memset(0)时，那么最好用calloc替换它。遗憾的是，我认为在c++中没有一种方法可以对new进行类似的优化，而new正是这种代码最容易出现的地方(例如创建std::vector(10000)))。而且还有一个复杂的问题，那就是memset的大小会比malloc的小一点(使用calloc仍然可以，但是很难知道它是否有改进)。

Implemented at 2014-06-24 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57742#c15) - https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=211956 (also https://patchwork.ozlabs.org/patch/325357/)

在2014-06-24 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57742#c15) - https://gcc.gnu.org/viewcvs/gcc?视图= revision&revision = 211956(https://patchwork.ozlabs.org/patch/325357/)

tree-ssa-strlen.c ... (handle_builtin_malloc, handle_builtin_memset): New functions.

tree-ssa-strlen。c…(handle_builtin_malloc handle_builtin_memset):新功能。

The current code in gcc/tree-ssa-strlen.c https://github.com/gcc-mirror/gcc/blob/7a31ada4c400351a35ab65f8dc0357e7c88805d5/gcc/tree-ssa-strlen.c#L1889 - if memset(0) get pointer from malloc or calloc, it will convert malloc into calloc and then memset(0) will be removed:

当前代码在gcc/tree-ssa-strlen中。c: https://github.com/gcc- mirror/gcc/blob/7a31ada4c400351a351ab65f65f8dc0357e7c88805d5 /gcc/tree- strlen0 .c#L1889 -如果memset(0)从malloc或callobe获得指针，则将c和malloc转换为

/* Handle a call to memset.
   After a call to calloc, memset(,0,) is unnecessary.
   memset(malloc(n),0,n) is calloc(n,1).  */
static bool
handle_builtin_memset (gimple_stmt_iterator *gsi)
 ...
  if (code1 == BUILT_IN_CALLOC)
    /* Not touching stmt1 */ ;
  else if (code1 == BUILT_IN_MALLOC
       && operand_equal_p (gimple_call_arg (stmt1, 0), size, 0))
    {
      gimple_stmt_iterator gsi1 = gsi_for_stmt (stmt1);
      update_gimple_call (&gsi1, builtin_decl_implicit (BUILT_IN_CALLOC), 2,
              size, build_one_cst (size_type_node));
      si1->length = build_int_cst (size_type_node, 0);
      si1->stmt = gsi_stmt (gsi1);
    }

This was discussed in gcc-patches mailing list in Mar 1, 2014 - Jul 15, 2014 with subject "calloc = malloc + memset"

这在2014年3月1日- 2014年7月15日的gcc-patch邮件列表中得到了讨论，主题为“calloc = malloc + memset”

https://gcc.gnu.org/ml/gcc-patches/2014-02/msg01693.html
https://gcc.gnu.org/ml/gcc-patches/2014-02/msg01693.html
https://gcc.gnu.org/ml/gcc-patches/2014-03/threads.html#00009
https://gcc.gnu.org/ml/gcc-patches/2014-03/threads.html # 00009
https://gcc.gnu.org/ml/gcc-patches/2014-04/threads.html#00817
https://gcc.gnu.org/ml/gcc-patches/2014-04/threads.html # 00817
https://gcc.gnu.org/ml/gcc-patches/2014-05/msg01392.html
https://gcc.gnu.org/ml/gcc-patches/2014-05/msg01392.html
https://gcc.gnu.org/ml/gcc-patches/2014-06/threads.html#00234
https://gcc.gnu.org/ml/gcc-patches/2014-06/threads.html # 00234
https://gcc.gnu.org/ml/gcc-patches/2014-07/threads.html#01059
https://gcc.gnu.org/ml/gcc-patches/2014-07/threads.html # 01059

with notable comment from Andi Kleen (http://halobates.de/blog/, https://github.com/andikleen): https://gcc.gnu.org/ml/gcc-patches/2014-06/msg01818.html

来自Andi Kleen的显著评论(http://halobates.de/blog/， https://github.com/andikleen): https://gcc.gnu.org/ml/gcc-patches/2014-06/msg01818.html

FWIW i believe the transformation will break a large variety of micro benchmarks.

我相信这种转变将打破各种各样的微观基准。

calloc internally knows that memory fresh from the OS is zeroed. But the memory may not be faulted in yet.

calloc内部知道，来自OS的新内存是0。但这段记忆可能还没有被遗忘。

memset always faults in the memory.

内存集总是在内存中出错。

So if you have some test like

如果你有一些测试
   buf = malloc(...)
   memset(buf, ...) 
   start = get_time();
   ... do something with buf
   end = get_time()
Now the times will be completely off because the measured times includes the page faults.

现在时间将完全关闭，因为测量的时间包括页面错误。

Marc replied "Good point. I guess working around compiler optimizations is part of the game for micro benchmarks, and their authors would be disappointed if the compiler didn't mess it up regularly in new and entertaining ways ;-)" and Andi asked: "I would prefer to not do it. I'm not sure it has a lot of benefit. If you want to keep it please make sure there is an easy way to turn it off."

马克回答说“好点。我猜围绕编译器优化的工作是微基准测试的一部分，如果编译器不经常以新的、有趣的方式把它搞砸，他们的作者会很失望的。我不确定它是否有很多好处。如果你想保留它，请确保有一个简单的方法关闭它。

Marc shows how to turn this optimization off: https://gcc.gnu.org/ml/gcc-patches/2014-06/msg01834.html

Marc演示了如何关闭这个优化:https://gcc.gnu.org/ml/gcc-patches/2014-06/msg01834.html

Any of these flags works:

这些旗帜的任何一种工作:

-fdisable-tree-strlen

-fdisable-tree-strlen

-fno-builtin-malloc

-fno-builtin-malloc

-fno-builtin-memset (assuming you wrote 'memset' explicitly in your code)

-fno- build -memset(假设您在代码中显式地编写了“memset”)

-fno-builtin

-fno-builtin

-ffreestanding

-ffreestanding

-O1

o1群

-Os

操作系统

In the code, you can hide that the pointer passed to memset is the one returned by malloc by storing it in a volatile variable, or any other trick to hide from the compiler that we are doing memset(malloc(n),0,n).

在代码中，您可以隐藏传递给memset的指针是malloc返回的指针，方法是将它存储在一个volatile变量中，或者隐藏我们正在执行memset的编译器(malloc(n)，0,n)。

#1