在Linux中零页的最快方法

I need to clear large address ranges (order of 200 pages at a time) in Linux. There are two approaches I tried -

我需要在Linux中清除大地址范围(一次200页的顺序)。我试过两种方法 -

Use memset - The simplest way to clear the address range. Performed a bit slower than method 2.

使用memset - 清除地址范围的最简单方法。执行速度比方法2慢一点。
Use munmap/mmap - I called munmap on the address range then mmap'd the same address again with the same permissions. Since MAP_ANONYMOUS is passed, the pages are cleared.

使用munmap / mmap - 我在地址范围内调用munmap,然后使用相同的权限再次mmap'd相同的地址。由于传递了MAP_ANONYMOUS,因此将清除页面。

The second method makes a benchmark run 5-10% faster. The benchmark ofcourse does a lot more than just clearing the pages. If I understand correctly, this is because the operating system has a pool of zero'd pages which it maps to the address range.

第二种方法使基准测试运行速度提高5-10%。该基准测试不仅仅是清除页面。如果我理解正确,这是因为操作系统有一个零页面池,它映射到地址范围。

But I don't like this way because the munmap and mmap is not atomic. In the sense that another mmap (with NULL as first argument) done simultaneously could render my address range un-usable.

但我不喜欢这种方式,因为munmap和mmap不是原子的。在某种意义上,同时完成另一个mmap(以NULL作为第一个参数)可能会使我的地址范围无法使用。

So my question is does Linux provide a system call that can swap out the physical pages for an address range with zero-pages?

所以我的问题是Linux是否提供了一个系统调用,可以将物理页面换成零页面的地址范围?

I tried to look at the source of glibc (specifically memset) to see if they use any technique to do this efficiently. But I couldn't find anything.

我试着看一下glibc(特别是memset)的来源,看看他们是否使用任何技术来有效地做到这一点。但我找不到任何东西。

3 个解决方案

#1

memset() appears to be about an order of magnitude faster than mmap() to get a new zero-filled page, at least on the Solaris 11 server I have access to right now. I strongly suspect that Linux will produce similar results.

memset()似乎比mmap()快一个数量级,以获得一个新的零填充页面,至少在我现在可以访问的Solaris 11服务器上。我强烈怀疑Linux会产生类似的结果。

I wrote a small benchmark program:

我写了一个小基准程序:

#include <stdio.h>
#include <sys/mman.h>
#include <string.h>
#include <strings.h>

#include <sys/time.h>

#define NUM_BLOCKS ( 512 * 1024 )
#define BLOCKSIZE ( 4 * 1024 )

int main( int argc, char **argv )
{
    int ii;

    char *blocks[ NUM_BLOCKS ];

    hrtime_t start = gethrtime();

    for ( ii = 0; ii < NUM_BLOCKS; ii++ )
    {
        blocks[ ii ] = mmap( NULL, BLOCKSIZE,
            PROT_READ | PROT_WRITE,
            MAP_ANONYMOUS | MAP_PRIVATE, -1, 0 );
        // force the creation of the mapping
        blocks[ ii ][ ii % BLOCKSIZE ] = ii;
    }

    printf( "setup time:    %lf sec\n",
        ( gethrtime() - start ) / 1000000000.0 );

    for ( int jj = 0; jj < 4; jj++ )
    {
        start = gethrtime();

        for ( ii = 0; ii < NUM_BLOCKS; ii++ )
        {
            blocks[ ii ] = mmap( blocks[ ii ],
                BLOCKSIZE, PROT_READ | PROT_WRITE,
                MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0 );
            blocks[ ii ][ ii % BLOCKSIZE ] = 0;
        }

        printf( "mmap() time:   %lf sec\n",
            ( gethrtime() - start ) / 1000000000.0 );
        start = gethrtime();

        for ( ii = 0; ii < NUM_BLOCKS; ii++ )
        {
            memset( blocks[ ii ], 0, BLOCKSIZE );
        }

        printf( "memset() time: %lf sec\n",
            ( gethrtime() - start ) / 1000000000.0 );
    }

    return( 0 );
}

Note that writing a single byte anywhere in the page is all that's needed to force the creation of the physical page.

请注意,在页面中的任何位置写入单个字节都是强制创建物理页面所需的全部内容。

I ran it on my Solaris 11 file server (the only POSIX-style system I have running on bare metal right now). I didn't test madvise() on my Solaris system because Solaris, unlike Linux, doesn't guarantee that the mapping will be repopulated with zero-filled pages, only that "the system starts to free the resources".

我在我的Solaris 11文件服务器上运行它(我现在在裸机上运行的唯一POSIX风格的系统)。我没有在我的Solaris系统上测试madvise(),因为与Linux不同,Solaris不保证映射将重新填充零填充页面,只保证“系统开始释放资源”。

The results:

setup time:    11.144852 sec
mmap() time:   15.159650 sec
memset() time: 1.817739 sec
mmap() time:   15.029283 sec
memset() time: 1.788925 sec
mmap() time:   15.083473 sec
memset() time: 1.780283 sec
mmap() time:   15.201085 sec
memset() time: 1.771827 sec

memset() is almost an order of magnitude faster. When I get a chance, I'll rerun that benchmark on Linux, but it'll likely have to be on a VM (AWS etc.)

memset()几乎快一个数量级。当我有机会时,我会在Linux上重新运行该基准测试,但它可能必须在VM上(AWS等)

That's not surprising - mmap() is expensive, and the kernel still needs to zero the pages at some time.

这并不奇怪 - mmap()很昂贵,内核仍然需要在某个时候将页面归零。

Interestingly, commenting out one line

有趣的是,评论出一行

        for ( ii = 0; ii < NUM_BLOCKS; ii++ )
        {
            blocks[ ii ] = mmap( blocks[ ii ],
                BLOCKSIZE, PROT_READ | PROT_WRITE,
                MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0 );
            //blocks[ ii ][ ii % BLOCKSIZE ] = 0;
        }

produces these results:

产生这些结果:

setup time:    10.962788 sec
mmap() time:   7.524939 sec
memset() time: 10.418480 sec
mmap() time:   7.512086 sec
memset() time: 10.406675 sec
mmap() time:   7.457512 sec
memset() time: 10.296231 sec
mmap() time:   7.420942 sec
memset() time: 10.414861 sec

The burden of forcing the creation of the physical mapping has shifted to the memset() call, leaving only the implicit munmap() in the test loops, where the mappings are destroyed when the MAP_FIXED mmap() call replaces them. Note that the just the munmap() takes about 3-4 times longer than keeping the pages in the address space and memset()'ing them to zeros.

强制创建物理映射的负担已转移到memset()调用,在测试循环中只留下隐式munmap(),其中MAP_FIXED mmap()调用替换它们时映射被破坏。请注意,只有munmap()比将页面保留在地址空间和memset()中的时间长约3-4倍。

The cost of mmap() isn't really the mmap()/munmap() system call itself, it's that the new page requires a lot of behind-the-scenes CPU cycles to create the actual physical mapping, and that doesn't happen in the mmap() system call itself - it happens afterwards, when the process accesses the memory page.

mmap()的成本实际上并不是mmap()/ munmap()系统调用本身,而是新页面需要大量的幕后CPU周期来创建实际的物理映射,而不是发生在mmap()系统调用本身 - 它发生在进程访问内存页面之后。

If you doubt the results, note this LMKL post from Linus Torvalds himself:

如果您怀疑结果,请注意Linus Torvalds本人的这篇LMKL帖子:

...

HOWEVER, playing games with the virtual memory mapping is very expensive in itself. It has a number of quite real disadvantages that people tend to ignore because memory copying is seen as something very slow, and sometimes optimizing that copy away is seen as an obvious improvment.

但是,使用虚拟内存映射玩游戏本身非常昂贵。它有许多非常现实的缺点,人们往往会忽略它,因为内存复制被视为非常慢的东西,有时候优化副本被视为一种明显的改进。

Downsides to mmap:

下行到mmap:

quite noticeable setup and teardown costs. And I mean noticeable. It's things like following the page tables to unmap everything cleanly. It's the book-keeping for maintaining a list of all the mappings. It's The TLB flush needed after unmapping stuff.

非常明显的设置和拆卸成本。我的意思是显而易见的。这就像跟随页面表一样干净地取消映射。它是用于维护所有映射列表的簿记。这是取消映射后需要的TLB刷新。

...

Profiling the code using Solaris Studio's collect and analyzer tools produced the following output:

使用Solaris Studio的收集和分析器工具分析代码产生以下输出:

Source File: mm.c

Inclusive        Inclusive        Inclusive         
Total CPU Time   Sync Wait Time   Sync Wait Count   Name
sec.             sec.                               
                                                      1. #include <stdio.h>
                                                      2. #include <sys/mman.h>
                                                      3. #include <string.h>
                                                      4. #include <strings.h>
                                                      5. 
                                                      6. #include <sys/time.h>
                                                      7. 
                                                      8. #define NUM_BLOCKS ( 512 * 1024 )
                                                      9. #define BLOCKSIZE ( 4 * 1024 )
                                                     10. 
                                                     11. int main( int argc, char **argv )
                                                         <Function: main>
 0.011           0.               0                  12. {
                                                     13.     int ii;
                                                     14. 
                                                     15.     char *blocks[ NUM_BLOCKS ];
                                                     16. 
 0.              0.               0                  17.     hrtime_t start = gethrtime();
                                                     18. 
 0.129           0.               0                  19.     for ( ii = 0; ii < NUM_BLOCKS; ii++ )
                                                     20.     {
                                                     21.         blocks[ ii ] = mmap( NULL, BLOCKSIZE,
                                                     22.             PROT_READ | PROT_WRITE,
 3.874           0.               0                  23.             MAP_ANONYMOUS | MAP_PRIVATE, -1, 0 );
                                                     24.         // force the creation of the mapping
 7.928           0.               0                  25.         blocks[ ii ][ ii % BLOCKSIZE ] = ii;
                                                     26.     }
                                                     27. 
                                                     28.     printf( "setup time:    %lf sec\n",
 0.              0.               0                  29.         ( gethrtime() - start ) / 1000000000.0 );
                                                     30. 
 0.              0.               0                  31.     for ( int jj = 0; jj < 4; jj++ )
                                                     32.     {
 0.              0.               0                  33.         start = gethrtime();
                                                     34. 
 0.560           0.               0                  35.         for ( ii = 0; ii < NUM_BLOCKS; ii++ )
                                                     36.         {
                                                     37.             blocks[ ii ] = mmap( blocks[ ii ],
                                                     38.                 BLOCKSIZE, PROT_READ | PROT_WRITE,
33.432           0.               0                  39.                 MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0 );
29.535           0.               0                  40.             blocks[ ii ][ ii % BLOCKSIZE ] = 0;
                                                     41.         }
                                                     42. 
                                                     43.         printf( "mmap() time:   %lf sec\n",
 0.              0.               0                  44.             ( gethrtime() - start ) / 1000000000.0 );
 0.              0.               0                  45.         start = gethrtime();
                                                     46. 
 0.101           0.               0                  47.         for ( ii = 0; ii < NUM_BLOCKS; ii++ )
                                                     48.         {
 7.362           0.               0                  49.             memset( blocks[ ii ], 0, BLOCKSIZE );
                                                     50.         }
                                                     51. 
                                                     52.         printf( "memset() time: %lf sec\n",
 0.              0.               0                  53.             ( gethrtime() - start ) / 1000000000.0 );
                                                     54.     }
                                                     55. 
 0.              0.               0                  56.     return( 0 );
 0.              0.               0                  57. }

                                                    Compile flags:  /opt/SUNWspro/bin/cc -g -m64  mm.c -W0,-xp.XAAjaAFbs71a00k.

Note the large amount of time spent in mmap(), and also in the setting of a single byte in each newly-mapped page.

请注意在mmap()中花费的大量时间,以及每个新映射页面中单个字节的设置。

This is an overview from the analyzer tool. Note the large amount of system time:

这是分析仪工具的概述。注意大量的系统时间:

The large amount of system time consumed is the time taken to map and unmap the physical pages.

消耗的大量系统时间是映射和取消映射物理页面所花费的时间。

This timeline shows when all that time was consumed:

此时间线显示所有时间消耗的时间:

The light green is system time - that's all in the mmap() loops. You can see that switch over to dark-green user time when the memset() loops run. I've highlighted one of those instances so you can see what's going on at that time.

浅绿色是系统时间 - 这一切都在mmap()循环中。当memset()循环运行时,您可以看到切换到深绿色用户时间。我已经突出显示了其中一个实例,因此您可以看到当时正在发生的事情。

Updated results from a Linux VM:

从Linux VM更新的结果:

setup time:    2.567396 sec
mmap() time:   2.971756 sec
memset() time: 0.654947 sec
mmap() time:   3.149629 sec
memset() time: 0.658858 sec
mmap() time:   2.800389 sec
memset() time: 0.647367 sec
mmap() time:   2.915774 sec
memset() time: 0.646539 sec

This tracks exactly with what I stated in my comment yesterday: FWIW, a quick test I ran showed that a simple, single-threaded call to memset() is somewhere between five and ten times faster than redoing mmap()

这与我昨天在评论中所说的完全一致:FWIW,我运行的一个快速测试表明,对memset()进行简单的单线程调用比重做mmap()快5到10倍。

I simply do not understand this fascination with mmap(). mmap() is one hugely expensive call, and it's a forced single-threaded operation - there's only one set of physical memory on the machine. mmap() is not only S-L-O-W, it impacts both the entire process address space and the VM system on the entire host.

我根本不明白这种对mmap()的迷恋。 mmap()是一个非常昂贵的调用,它是一个强制的单线程操作 - 机器上只有一组物理内存。 mmap()不仅是S-L-O-W,它还影响整个主机上的整个进程地址空间和VM系统。

Using any form of mmap() just to zero out memory pages is counterproductive. First, the pages don't get zeroed for free - something has to memset() them to clear them. It just doesn't make any sense to add tearing down and recreating a memory mapping to that memset() just to clear a page of RAM.

使用任何形式的mmap()只是为了将内存页面清零是适得其反的。首先,页面不会被免费归零 - 有些东西需要memset()它们来清除它们。添加拆除并重新创建内存映射到该memset()只是为了清除RAM页面是没有任何意义的。

memset() also has the advantage that more than one thread can be clearing memory at any one time. Making changes to memory mappings is a single-threaded process.

memset()还有一个优点,即任何时候都可以有多个线程清除内存。更改内存映射是一个单线程进程。

#2

madvise(..., MADV_DOTNEED) should be equivalent to munmap/mmap on anonymous mappings on Linux. It's a bit weird because that's not how I understand what the semantics of "don't need" should be, but it does throw away the page(s) on Linux.

madvise(...,MADV_DOTNEED)应该等同于Linux上匿名映射的munmap / mmap。这有点奇怪,因为这不是我理解“不需要”的语义应该是什么,但它确实丢掉了Linux上的页面。

$ cat > foo.c
#include <sys/types.h>
#include <sys/mman.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

int
main(int argc, char **argv)
{
    int *foo = mmap(NULL, getpagesize(), PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    *foo = 42;
    printf("%d\n", *foo);
    madvise(foo, getpagesize(), MADV_DONTNEED);
    printf("%d\n", *foo);
    return 0;
}
$ cc -o foo foo.c && ./foo
42
0
$ uname -sr
Linux 3.10.0-693.11.6.el7.x86_64

MADV_DONTNEED does not do that on other operating systems so this is definitely not portable. For example:

MADV_DONTNEED不会在其他操作系统上执行此操作,因此这绝对不可移植。例如:

$ cc -o foo foo.c && ./foo
42
42
$ uname -sr
Darwin 17.5.0

But, you don't need to unmap, you can just overwrite the mapping. As a bonus this is much more portable:

但是,您不需要取消映射,只需覆盖映射即可。作为奖励,这更便携:

$ cat foo.c
#include <sys/types.h>
#include <sys/mman.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

int
main(int argc, char **argv)
{
    int *foo = mmap(NULL, getpagesize(), PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    *foo = 42;
    printf("%d\n", *foo);
    mmap(foo, getpagesize(), PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0);
    printf("%d\n", *foo);
    return 0;
}
$ cc -o foo foo.c && ./foo
42
0
$

Also, I'm not actually sure if you benchmarked things properly. Creating and dropping mappings can be quite expensive and I don't think idle zeroing would help that much. Newly mmap:ed pages are not actually mapped until they are used for the first time and on Linux this means written and not read because Linux does silly things with copy-on-write zero pages if the first access to a page is a read instead of a write. So unless you benchmark writes to the newly mmap:ed pages I suspect that neither your previous solution, nor the ones I suggested here will actually be faster than just a dumb memset.

此外,我不确定你是否正确地对事情进行了基准测试。创建和删除映射可能非常昂贵,我不认为空闲归零会对此有所帮助。新的mmap:ed页面在第一次使用之前并没有实际映射,而在Linux上这意味着写入而不是读取,因为如果对页面的第一次访问是读取,则Linux会对写入时写入零页面执行愚蠢的操作一个写。因此,除非你对新的mmap:ed页面进行基准测试,否则我怀疑你以前的解决方案和我在这里提出的解决方案实际上都不会比一个愚蠢的memset更快。

#3

Note: this is not an answer,I just needed the formatting feature.

注意:这不是答案,我只需要格式化功能。

BTW: it is possible that the /dev/zero all-zeros page doesn't even exist, and that the .read() method is implemented as follows (a similar thing happens for dev/null, which just returns the length argument):

BTW:/ dev / zero all-zeros页面甚至可能不存在,并且.read()方法实现如下(dev / null也会发生类似的事情,只返回length参数) :

struct file_operations {
        ...
        ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
        ...
        };

static ssize_t zero_read (struct file *not_needed_here, char __user * buff, size_t len, loff_t * ignored)
{
memset (buff, 0, len);
return len;
}

#1