与mmap重叠的页面(MAP_FIXED)

Due to some obscure reasons which are not relevant for this question, I need to resort to use MAP_FIXED in order to obtain a page close to where the text section of libc lives in memory.

由于一些与这个问题无关的模糊的原因，我需要使用MAP_FIXED来获得一个接近libc的文本部分在内存中的位置的页面。

Before reading mmap(2) (which I should had done in the first place), I was expecting to get an error if I called mmap with MAP_FIXED and a base address overlapping an already-mapped area.

在读取mmap(2)之前(这是我应该首先完成的)，如果我调用带有MAP_FIXED的mmap和一个基本地址重叠已经映射的区域的mmap，我希望会得到一个错误。

However that is not the case. For instance, here is part of /proc/maps for certain process

然而事实并非如此。例如，这里是特定进程的/proc/maps的一部分

7ffff7299000-7ffff744c000 r-xp 00000000 08:05 654098                     /lib/x86_64-linux-gnu/libc-2.15.so

Which, after making the following mmap call ...

在做了下面的mmap调用之后…

  mmap(0x7ffff731b000,
       getpagesize(),
       PROT_READ | PROT_WRITE | PROT_EXEC,
       MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED,
       0,
       0);

... turns into:

…变成:

7ffff7299000-7ffff731b000 r-xp 00000000 08:05 654098                     /lib/x86_64-linux-gnu/libc-2.15.so
7ffff731b000-7ffff731c000 rwxp 00000000 00:00 0 
7ffff731c000-7ffff744c000 r-xp 00083000 08:05 654098                     /lib/x86_64-linux-gnu/libc-2.15.so

Which means I have overwritten part of the virtual address space dedicated to libc with my own page. Clearly not what I want ...

这意味着我已经用我自己的页面覆盖了libc专用的部分虚拟地址空间。显然不是我想要的……

In the MAP_FIXED part of the mmap(2) manual, it clearly states:

在mmap(2)手册的MAP_FIXED部分中，它明确表示:

If the memory region specified by addr and len overlaps pages of any existing mapping(s), then the overlapped part of the existing mapping(s) will be discarded.

如果addr和len指定的内存区域重叠了任何现有映射的页面，那么现有映射的重叠部分将被丢弃。

Which explains what I am seeing, but I have a couple of questions:

这解释了我所看到的，但我有几个问题:

Is there a way to detect if something was already mapped to certain address? without accesing /proc/maps?
是否有一种方法可以检测是否已经映射到某个地址?没有accesing /proc/maps吗?
Is there a way to force mmap to fail in the case of finding overlapping pages?
是否有一种方法可以迫使mmap在寻找重叠页面的情况下失败?

3 个解决方案

#1

Use page = sysconf(SC_PAGE_SIZE) to find out the page size, then scan each page-sized block you wish to check using msync(addr, page, 0) (with (unsigned long)addr % page == 0, i.e. addr aligned to pages). If it returns -1 with errno == ENOMEM, that page is not mapped.

使用page = sysconf(SC_PAGE_SIZE)查找页面大小，然后使用msync(addr, page, 0)(带有(无符号长)addr % page = 0，即addr对齐到页面)检查每个页面大小的块。如果它返回-1，而errno == ENOMEM，则该页面没有映射。

Edited: As fons commented below, mincore(addr,page,&dummy) is superior to msync(). (The implementation of the syscall is in mm/mincore.c in the Linux kernel sources, with C libraries usually providing a wrapper that updates errno. As the syscall does the mapping check immediately after making sure addr is page aligned, it is optimal in the not-mapped case (ENOMEM). It does some work if the page is already mapped, so if performance is paramount, try to avoid checking pages you know are mapped.

编辑:fons评论如下，mincore(addr,page，&dummy)优于msync()。(syscall的实现用mm/mincore表示。在Linux内核源代码中，c库通常提供一个更新errno的包装器。由于syscall在确保addr页面对齐后立即进行映射检查，所以在未映射的情况下(ENOMEM)是最优的。如果页面已经被映射，那么它会做一些工作，所以如果性能是最重要的，那么尽量避免检查您知道被映射的页面。

You must do this individually, separately per each page, because for regions larger than a single page, ENOMEM means that the region was not fully mapped; it might still be partially mapped. Mapping is always granular to page-sized units.

每个页面都必须单独完成，因为对于大于单个页面的区域，ENOMEM意味着该区域没有被完全映射;它仍然可能被部分映射。映射总是颗粒大小的单位。
As far as I can tell, there is no way to tell mmap() to fail if the region is already mapped, or contains already mapped pages. (The same applies to mremap(), so you cannot create a mapping, then move it to the desired region.)

就我所知，如果该区域已经被映射，或者包含已映射的页面，则无法告诉mmap()失败。(mremap()也是如此，因此无法创建映射，然后将其移动到所需的区域。)

This means you run a risk of a race condition. It would be best to execute the actual syscalls yourself, instead of the C library wrappers, just in case they do memory allocation or change memory mappings internally:

这意味着你面临着比赛的风险。最好自己执行实际的syscalls，而不是C库包装器，以防它们在内部执行内存分配或更改内存映射:
```
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>

static size_t page = 0;
static inline size_t page_size(void)
{
    if (!page)
        page = (size_t)sysconf(_SC_PAGESIZE);
    return page;
}


static inline int raw_msync(void *addr, size_t length, int flags)
{
    return syscall(SYS_msync, addr, length, flags);
}

static inline void *raw_mmap(void *addr, size_t length, int prot, int flags)
{
    return (void *)syscall(SYS_mmap, addr, length, prot, flags, -1, (off_t)0);
}
```

However, I suspect that whatever it is you are trying to do, you eventually need to parse /proc/self/maps anyway.

然而，我怀疑无论您要做什么，最终都需要解析/proc/self/maps。

I recommend avoiding standard I/O stdio.h altogether (as the various operations will allocate memory dynamically, and thus change the mappings), and instead use the lower-level unistd.h interfaces, which are much less likely to affect the mappings. Here is a set of simple, crude functions, that you can use to find out each mapped region and the protections enabled in that region (and discard the other info). In practice, it uses about a kilobyte of code and less than that in stack, so it is very useful even on limited architectures (say, embedded devices).

我建议避免标准的I/O stdio。h总计(因为各种操作将动态地分配内存，从而更改映射)，而使用低级unistd。h接口，它不太可能影响映射。这里有一组简单、粗糙的函数，您可以使用它们来查找每个映射区域以及该区域中启用的保护(并丢弃其他信息)。在实践中，它使用了大约千字节的代码，并且比堆栈中的代码要少，因此即使在有限的体系结构(比如嵌入式设备)上，它也是非常有用的。

#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>

#ifndef   INPUT_BUFFER
#define   INPUT_BUFFER   512
#endif /* INPUT_BUFFER */

#ifndef   INPUT_EOF
#define   INPUT_EOF     -256
#endif /* INPUT_EOF */

#define   PERM_PRIVATE  16
#define   PERM_SHARED    8
#define   PERM_READ      4
#define   PERM_WRITE     2
#define   PERM_EXEC      1

typedef struct {
    int            descriptor;
    int            status;
    unsigned char *next;
    unsigned char *ends;
    unsigned char  buffer[INPUT_BUFFER + 16];
} input_buffer;

/* Refill input buffer. Returns the number of new bytes.
 * Sets status to ENODATA at EOF.
*/
static size_t input_refill(input_buffer *const input)
{
    ssize_t n;

    if (input->status)
        return (size_t)0;

    if (input->next > input->buffer) {
        if (input->ends > input->next) {
            memmove(input->buffer, input->next,
                    (size_t)(input->ends - input->next));
            input->ends = input->buffer + (size_t)(input->ends - input->next);
            input->next = input->buffer;
        } else {
            input->ends = input->buffer;
            input->next = input->buffer;
        }
    }

    do {
        n = read(input->descriptor, input->ends,
                 INPUT_BUFFER - (size_t)(input->ends - input->buffer));
    } while (n == (ssize_t)-1 && errno == EINTR);
    if (n > (ssize_t)0) {
        input->ends += n;
        return (size_t)n;

    } else
    if (n == (ssize_t)0) {
        input->status = ENODATA;
        return (size_t)0;
    }

    if (n == (ssize_t)-1)
        input->status = errno;
    else
        input->status = EIO;

    return (size_t)0;
}

/* Low-lever getchar() equivalent.
*/
static inline int input_next(input_buffer *const input)
{
    if (input->next < input->ends)
        return *(input->next++);
    else
    if (input_refill(input) > 0)
        return *(input->next++);
    else
        return INPUT_EOF;
}

/* Low-level ungetc() equivalent.
*/
static inline int input_back(input_buffer *const input, const int c)
{
    if (c < 0 || c > 255)
        return INPUT_EOF;
    else
    if (input->next > input->buffer)
        return *(--input->next) = c;
    else
    if (input->ends >= input->buffer + sizeof input->buffer)
        return INPUT_EOF;

    memmove(input->next + 1, input->next, (size_t)(input->ends - input->next));
    input->ends++;
    return *(input->next) = c;
}

/* Low-level fopen() equivalent.
*/
static int input_open(input_buffer *const input, const char *const filename)
{
    if (!input)
        return errno = EINVAL;

    input->descriptor = -1;
    input->status = 0;
    input->next = input->buffer;
    input->ends = input->buffer;

    if (!filename || !*filename)
        return errno = input->status = EINVAL;

    do {
        input->descriptor = open(filename, O_RDONLY | O_NOCTTY);
    } while (input->descriptor == -1 && errno == EINTR);
    if (input->descriptor == -1)
        return input->status = errno;

    return 0;
}

/* Low-level fclose() equivalent.
*/
static int input_close(input_buffer *const input)
{
    int result;

    if (!input)
        return errno = EINVAL;

    /* EOF is not an error; we use ENODATA for that. */
    if (input->status == ENODATA)
        input->status = 0;

    if (input->descriptor != -1) {
        do {
            result = close(input->descriptor);
        } while (result == -1 && errno == EINTR);
        if (result == -1 && !input->status)
            input->status = errno;
    }

    input->descriptor = -1;
    input->next = input->buffer;
    input->ends = input->buffer;

    return errno = input->status;
}

/* Read /proc/self/maps, and fill in the arrays corresponding to the fields.
 * The function will return the number of mappings, even if not all are saved.
*/
size_t read_maps(size_t const n,
                 void **const ptr, size_t *const len,
                 unsigned char *const mode)
{
    input_buffer    input;
    size_t          i = 0;
    unsigned long   curr_start, curr_end;
    unsigned char   curr_mode;
    int             c;

    errno = 0;

    if (input_open(&input, "/proc/self/maps"))
        return (size_t)0; /* errno already set. */

    c = input_next(&input);
    while (c >= 0) {

        /* Skip leading controls and whitespace */
        while (c >= 0 && c <= 32)
            c = input_next(&input);

        /* EOF? */
        if (c < 0)
            break;

        curr_start = 0UL;
        curr_end = 0UL;
        curr_mode = 0U;

        /* Start of address range. */
        while (1)
            if (c >= '0' && c <= '9') {
                curr_start = (16UL * curr_start) + c - '0';
                c = input_next(&input);
            } else
            if (c >= 'A' && c <= 'F') {
                curr_start = (16UL * curr_start) + c - 'A' + 10;
                c = input_next(&input);
            } else
            if (c >= 'a' && c <= 'f') {
                curr_start = (16UL * curr_start) + c - 'a' + 10;
                c = input_next(&input);
            } else
                break;
        if (c == '-')
            c = input_next(&input);
        else {
            errno = EIO;
            return (size_t)0;
        }

        /* End of address range. */
        while (1)
            if (c >= '0' && c <= '9') {
                curr_end = (16UL * curr_end) + c - '0';
                c = input_next(&input);
            } else
            if (c >= 'A' && c <= 'F') {
                curr_end = (16UL * curr_end) + c - 'A' + 10;
                c = input_next(&input);
            } else
            if (c >= 'a' && c <= 'f') {
                curr_end = (16UL * curr_end) + c - 'a' + 10;
                c = input_next(&input);
            } else
                break;
        if (c == ' ')
            c = input_next(&input);
        else {
            errno = EIO;
            return (size_t)0;
        }

        /* Permissions. */
        while (1)
            if (c == 'r') {
                curr_mode |= PERM_READ;
                c = input_next(&input);
            } else
            if (c == 'w') {
                curr_mode |= PERM_WRITE;
                c = input_next(&input);
            } else
            if (c == 'x') {
                curr_mode |= PERM_EXEC;
                c = input_next(&input);
            } else
            if (c == 's') {
                curr_mode |= PERM_SHARED;
                c = input_next(&input);
            } else
            if (c == 'p') {
                curr_mode |= PERM_PRIVATE;
                c = input_next(&input);
            } else
            if (c == '-') {
                c = input_next(&input);
            } else
                break;
        if (c == ' ')
            c = input_next(&input);
        else {
            errno = EIO;
            return (size_t)0;
        }

        /* Skip the rest of the line. */
        while (c >= 0 && c != '\n')
            c = input_next(&input);

        /* Add to arrays, if possible. */
        if (i < n) {
            if (ptr) ptr[i] = (void *)curr_start;
            if (len) len[i] = (size_t)(curr_end - curr_start);
            if (mode) mode[i] = curr_mode;
        }
        i++;
    }

    if (input_close(&input))
        return (size_t)0; /* errno already set. */

    errno = 0;
    return i;
}

The read_maps() function reads up to n regions, start addresses as void * into the ptr array, lengths into the len array, and permissions into the mode array, returning the total number of maps (may be greater than n), or zero with errno set if an error occurs.

read_maps()函数读取到n个区域，开始地址为void *到ptr数组，长度到len数组，并进入到mode数组，返回映射的总数(可能大于n)，如果出现错误，则返回errno集合。

It is quite possible to use syscalls for the low-level I/O above, so that you don't use any C library features, but I don't think it is at all necessary. (The C libraries, as far as I can tell, use very simple wrappers around the actual syscalls for these.)

对于上面的低级I/O，很有可能使用syscalls，这样您就不会使用任何C库特性，但我认为根本没有必要使用它。(据我所知，C库在实际的syscalls周围使用了非常简单的包装器)。

I hope you find this useful.

我希望你觉得这个有用。

#2

"Which explains what I am seeing, but I have a couple of questions:"

“这解释了我所看到的，但我有几个问题:”

"Is there a way to detect if something was already mapped to certain address? without accessing /proc/maps?"

“是否有一种方法可以检测某物是否已经映射到某个地址?”没有访问/proc/maps ?”

Yes, use mmap without MAP_FIXED.

是的，不用MAP_FIXED就可以使用mmap。

"Is there a way to force mmap to fail in the case of finding overlapping pages?"

“有没有办法在发现重叠页面的情况下迫使mmap失败?”

Apparently not, but simply use munmap after the mmap if mmap returns a mapping at other than the requested address.

显然不是，但是如果mmap返回的映射不是请求的地址，那么只需在mmap之后使用munmap。

When used without MAP_FIXED, mmap on both linux and Mac OS X (and I suspect elsewhere also) obeys the address parameter iff no existing mapping in the range [address, address + length) exists. So if mmap answers a mapping at a different address to the one you supply you can infer there already exists a mapping in that range and you need to use a different range. Since mmap will typically answer a mapping at a very high address when it ignores the address parameter, simply unmap the region using munmap, and try again at a different address.

如果在没有MAP_FIXED的情况下使用，linux和Mac OS X上的mmap(我怀疑其他地方也有)遵循地址参数iff(地址、地址+长度)不存在。如果mmap在一个不同的地址对应一个映射到你提供的地址你可以推断已经存在一个映射在这个范围内你需要使用一个不同的范围。由于mmap通常会在忽略地址参数时以非常高的地址回答映射，因此只需使用munmap取消映射区域，然后在另一个地址重试。

Using mincore to check for use of an address range is not only a waste of time (one has to probe a page at a time), it may not work. Older linux kernels will only fail mincore appropriately for file mappings. They won't answer anything at all for MAP_ANON mappings. But as I've pointed out, all you need is mmap and munmap.

使用mincore检查地址范围的使用不仅是浪费时间(一次必须探查一个页面)，而且可能无法工作。旧的linux内核只会在文件映射中出现错误。对于MAP_ANON映射，他们什么都不会回答。但是正如我指出的，你所需要的是mmap和munmap。

I've just been through this exercise in implementing a memory manager for a Smalltalk VM. I use sbrk(0) to find out the first address at which I can map the first segment, and then use mmap and an increment of 1Mb to search for room for subsequent segments:

我刚刚完成了为Smalltalk VM实现内存管理器的实践。我使用sbrk(0)查找第一个可以映射第一个段的地址，然后使用mmap和1Mb的增量来搜索后续段的空间:

static long          pageSize = 0;
static unsigned long pageMask = 0;

#define roundDownToPage(v) ((v)&pageMask)
#define roundUpToPage(v) (((v)+pageSize-1)&pageMask)

void *
sqAllocateMemory(usqInt minHeapSize, usqInt desiredHeapSize)
{
    char *hint, *address, *alloc;
    unsigned long alignment, allocBytes;

    if (pageSize) {
        fprintf(stderr, "sqAllocateMemory: already called\n");
        exit(1);
    }
    pageSize = getpagesize();
    pageMask = ~(pageSize - 1);

    hint = sbrk(0); /* the first unmapped address above existing data */

    alignment = max(pageSize,1024*1024);
    address = (char *)(((usqInt)hint + alignment - 1) & ~(alignment - 1));

    alloc = sqAllocateMemorySegmentOfSizeAboveAllocatedSizeInto
                (roundUpToPage(desiredHeapSize), address, &allocBytes);
    if (!alloc) {
        fprintf(stderr, "sqAllocateMemory: initial alloc failed!\n");
        exit(errno);
    }
    return (usqInt)alloc;
}

/* Allocate a region of memory of at least size bytes, at or above minAddress.
 *  If the attempt fails, answer null.  If the attempt succeeds, answer the
 * start of the region and assign its size through allocatedSizePointer.
 */
void *
sqAllocateMemorySegmentOfSizeAboveAllocatedSizeInto(sqInt size, void *minAddress, sqInt *allocatedSizePointer)
{
    char *address, *alloc;
    long bytes, delta;

    address = (char *)roundUpToPage((unsigned long)minAddress);
    bytes = roundUpToPage(size);
    delta = max(pageSize,1024*1024);

    while ((unsigned long)(address + bytes) > (unsigned long)address) {
        alloc = mmap(address, bytes, PROT_READ | PROT_WRITE,
                     MAP_ANON | MAP_PRIVATE, -1, 0);
        if (alloc == MAP_FAILED) {
            perror("sqAllocateMemorySegmentOfSizeAboveAllocatedSizeInto mmap");
            return 0;
        }
        /* is the mapping both at or above address and not too far above address? */
        if (alloc >= address && alloc <= address + delta) {
            *allocatedSizePointer = bytes;
            return alloc;
        }
        /* mmap answered a mapping well away from where Spur prefers.  Discard
         * the mapping and try again delta higher.
         */
        if (munmap(alloc, bytes) != 0)
            perror("sqAllocateMemorySegment... munmap");
        address += delta;
    }
    return 0;
}

This appears to work well, allocating memory at ascending addresses while skipping over any existing mappings.

这看起来运行良好，在提升地址分配内存，同时跳过任何现有的映射。

HTH

#3

It seems that posix_mem_offset() is what I was looking for.

似乎posix_mem_offset()正是我要查找的。

Not only it tells you if an address is mapped but also, in case it happens to be mapped, it implicitly gives you the boundaries of the mapped area to which it belongs (by providing SIZE_MAX in the len argument).

它不仅告诉您一个地址是否被映射，而且，如果它碰巧被映射，它还隐式地给出它所属的映射区域的边界(通过在len参数中提供SIZE_MAX)。

So, before enforcing MAP_FIXED, I can use posix_mem_offset() to verify that the address I am using is not mapped yet.

因此，在强制执行MAP_FIXED之前，我可以使用posix_mem_offset()来验证我正在使用的地址尚未映射。

I could use msync() or mincore() too (checking for an ENOMEM error tells you that an address is already mapped), but then I would be blinder (no information about the area where the address is mapped). Also, msync() has side effects which may have a performance impact and mincore() is BSD-only (not POSIX).

我也可以使用msync()或mincore()(检查ENOMEM错误会告诉您一个地址已经被映射)，但是这样我就会变得更加盲目(没有关于地址被映射的区域的信息)。此外，msync()也有可能对性能产生影响的副作用，mincore()仅为bsd(而非POSIX)。

#1

Use page = sysconf(SC_PAGE_SIZE) to find out the page size, then scan each page-sized block you wish to check using msync(addr, page, 0) (with (unsigned long)addr % page == 0, i.e. addr aligned to pages). If it returns -1 with errno == ENOMEM, that page is not mapped.

使用page = sysconf(SC_PAGE_SIZE)查找页面大小，然后使用msync(addr, page, 0)(带有(无符号长)addr % page = 0，即addr对齐到页面)检查每个页面大小的块。如果它返回-1，而errno == ENOMEM，则该页面没有映射。

Edited: As fons commented below, mincore(addr,page,&dummy) is superior to msync(). (The implementation of the syscall is in mm/mincore.c in the Linux kernel sources, with C libraries usually providing a wrapper that updates errno. As the syscall does the mapping check immediately after making sure addr is page aligned, it is optimal in the not-mapped case (ENOMEM). It does some work if the page is already mapped, so if performance is paramount, try to avoid checking pages you know are mapped.

编辑:fons评论如下，mincore(addr,page，&dummy)优于msync()。(syscall的实现用mm/mincore表示。在Linux内核源代码中，c库通常提供一个更新errno的包装器。由于syscall在确保addr页面对齐后立即进行映射检查，所以在未映射的情况下(ENOMEM)是最优的。如果页面已经被映射，那么它会做一些工作，所以如果性能是最重要的，那么尽量避免检查您知道被映射的页面。

You must do this individually, separately per each page, because for regions larger than a single page, ENOMEM means that the region was not fully mapped; it might still be partially mapped. Mapping is always granular to page-sized units.

每个页面都必须单独完成，因为对于大于单个页面的区域，ENOMEM意味着该区域没有被完全映射;它仍然可能被部分映射。映射总是颗粒大小的单位。
As far as I can tell, there is no way to tell mmap() to fail if the region is already mapped, or contains already mapped pages. (The same applies to mremap(), so you cannot create a mapping, then move it to the desired region.)

就我所知，如果该区域已经被映射，或者包含已映射的页面，则无法告诉mmap()失败。(mremap()也是如此，因此无法创建映射，然后将其移动到所需的区域。)

This means you run a risk of a race condition. It would be best to execute the actual syscalls yourself, instead of the C library wrappers, just in case they do memory allocation or change memory mappings internally:

这意味着你面临着比赛的风险。最好自己执行实际的syscalls，而不是C库包装器，以防它们在内部执行内存分配或更改内存映射:
```
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>

static size_t page = 0;
static inline size_t page_size(void)
{
    if (!page)
        page = (size_t)sysconf(_SC_PAGESIZE);
    return page;
}


static inline int raw_msync(void *addr, size_t length, int flags)
{
    return syscall(SYS_msync, addr, length, flags);
}

static inline void *raw_mmap(void *addr, size_t length, int prot, int flags)
{
    return (void *)syscall(SYS_mmap, addr, length, prot, flags, -1, (off_t)0);
}
```