Linux per-CPU变量分配与管理源码分析(未完)

什么是per-CPU变量?

per-CPU变量主要用在多处理器系统中，用来为系统中的每个CPU生成一个变量副本，per-CPU变量对于每个处理器都有一个互相独立的副本。per-CPU变量分为静态分配与动态分配两种，静态分配是指在编译内核期间分配好的per-CPU变量，动态分配是指运行期间调用per-CPU memory allocator 分配的per-CPU变量。

Linux使用Chunk数据结构来管理per-CPU变量的分配，当要分配某个size的per-CPU变量时，每个CPU的per-CPU变量副本都在同一个chunk当中分配，如果一个chunk分配满了，那么会再新增一个chunk.

为了便于分配与查找，Linux按照每个Chunk的空闲空间的size将Chunk链接到不同的list中，每次分配时从满足allocate size要求且空闲空间最小的list中的chunk中进行分配。(There are usually many small percpu allocations many of them being as small as 4 bytes. The allocator organizes chunks into lists according to free size and tries to allocate from the fullest one.)

下面是源码分析:

参考源代码: Linux 3.10

Linux实现per-CPU模块的源文件主要位于 /include/linux/percpu.h 和 mm/percpu.c中。

struct Chunk定义如下:

struct pcpu_chunk {
struct list_headlist;/* linked to pcpu_slot lists */
intfree_size;/* free bytes in the chunk */
intcontig_hint;/* max contiguous size hint */
void*base_addr;/* base address of this chunk */
intmap_used;/* # of map entries used */
intmap_alloc;/* # of map entries allocated */
int*map;/* allocation map */
void*data;/* chunk data */
boolimmutable;/* no [de]population allowed */
unsigned longpopulated[];/* populated bitmap */
};

list: 用于将每个Chunk链接到pcpu_slot lists中，pcpu_slot是一个list_head数组，Linux将所有的Chunk按照其空闲空间的大小链入pcpu_slot数组对应的list中。

free_size: 表示此Chunk空闲空间的大小。

contig_hint: 最大的连续的空闲size

base_addr: 这个Chunk所管理的memory的起始地址（虚拟地址）

map_used: map数组中的已使用成员个数

map_alloc: map数组的大小

map: 用于分配size的数组

Chunk使用map数组来实现指定size的分配，每个数组成员是一个int类型的值，记录了一个分配好的size或一个free size（正数表示可分配的size，负数表示已经分配出去的size），map数组大小初始化为PCPU_DFL_MAP_ALLOC，map_used初始化为1，表示只用了一个map成员来记录，因此初始化后map数组只有map[0]有效，大小为整个Chunk可供分配的size(正数)，表示现在Chunk为空。在运行的过程中每当有新的size分配请求，Chunk会在map数组里寻找满足要求的空闲的size，找到后分配指定的size并记录在map数组中，最后将空闲的size减去己分配的size，必要的话会根据情况将某些空闲的size合并。随着不断的进行各种size的动态分配，map_used会一直增长，当初始化的map大小不够用的时候map数组的大小也会增长。（Allocation state in each chunk is kept using an array of integers on chunk->map. A positive value in the map represents a free region and negative allocated. Allocation inside a chunk is done by scanning this map sequentially and serving the first matching entry. ）

data: 指向为Chunk分配的vms结构的指针。Chunk管理的memory本质上还是以page的形式分配的(first Chunk除外)。

immutable: 一个布尔变量，置1表示不可再allocate和map page.

populated[]: unsigned long 数组用于记录已经map成功的page

前面说过，Linux使用Chunk数据结构来管理per-CPU变量的分配，现在假设系统有Nr个CPU，那么意味着在外部调用per-cpu allocater分配变量时，Chunk可以同时为Nr个CPU分配per-CPU变量。而实际情况还要复杂一些，linux还要对Nr个CPU分组，这个后面会结合code讨论。

Linux用一组全局变量记录CPU的信息及per-cpu allocater最大可分配内存的信息：

static int pcpu_unit_pages __read_mostly;

每个Chunk中单个CPU可供分配per-cpu变量的内存的大小，单位page。（很明显对于每个CPU，这个值是一样的，因为per-cpu变量是对所有的CPU同时分配的）

static int pcpu_unit_size __read_mostly;

每个Chunk中单个CPU可供分配per-cpu变量的内存的大小，单位byte。单个CPU可供分配per-cpu变量的内存的大小称为一个unit。

static int pcpu_nr_units __read_mostly;

每个Chunk中unit的数量，也就是系统中CPU的数量。

static int pcpu_atom_size __read_mostly;

用于align的size。

static struct list_head *pcpu_slot __read_mostly;

pcpu_slot是list_head数组，按照空闲空间的大小链接各个Chunk到其中不同的list_head中。

static int pcpu_nr_slots __read_mostly;

pcpu_slot数组的size。

static size_t pcpu_chunk_struct_size __read_mostly;

Chunk结构的size，在分配新的Chunk时用到。

void *pcpu_base_addr __read_mostly;
EXPORT_SYMBOL_GPL(pcpu_base_addr);

第一个Chunk所管理内存的基地址，(the address of the first chunk which starts with the kernel static area.)

前面知道全局变量pcpu_uint_size表示单个CPU可供分配的内存大小，又由前面叙述可知：当要分配某个size的per-CPU变量时，每个CPU的per-CPU变量副本都在同一个chunk当中分配。因此每个Chunk管理的内存大小必须为Nr x (pcpu_uint_size)。

per-cpu变量分为静态分配（编译时分配）与动态分配（运行时分配），先来看静态分配：

per-cpu变量的静态分配通过将变量定义在特殊的数据段中来实现（include/linux/percpu-defs.h）:

#define DECLARE_PER_CPU(type, name)\
DECLARE_PER_CPU_SECTION(type, name, "")

#define DEFINE_PER_CPU(type, name)\
DEFINE_PER_CPU_SECTION(type, name, "")

这两个宏静态声明与分配一个类型为type的per-cpu变量。DECLARE_PER_CPU_SECTION和DEFINE_PER_CPU_SECTION又分别定义为:

#define DECLARE_PER_CPU_SECTION(type, name, sec)\
extern __PCPU_ATTRS(sec) __typeof__(type) name

#define DEFINE_PER_CPU_SECTION(type, name, sec)\
__PCPU_ATTRS(sec) PER_CPU_DEF_ATTRIBUTES\
__typeof__(type) name

其中__PCPU_ATTRS定义为:

#define __PCPU_ATTRS(sec)\
__percpu __attribute__((section(PER_CPU_BASE_SECTION sec)))\
PER_CPU_ATTRIBUTES

__percpu是个编译扩展类型，在include/linux/compile.h文件中，__percpu是空的。而传进来的sec也是空的，PER_CPU_ATTRIBUTES也是空的，前面PER_CPU_DEF_ATTRIBUTES还是空的，所以DEFINE_PER_CPU(type, name)展开就是：

__attribute__((section(PER_CPU_BASE_SECTION sec)))

__typeof__(type) name

其中,PER_CPU_BASE_SECTION定义在(include/linux/asm-generic/percpu.h).

#define PER_CPU_BASE_SECTION ".data..percpu"

DEFINE_PER_CPU(type, name)最后展开就是:

__attribute__((section(.data..percpu)))

__typeof__(type) name

对于宏DECLARE_PER_CPU(type, name)同样展开就是:

extern __attribute__((section(.data..percpu))) __typeof__(type) name

由此看来per-cpu变量的静态定义就是用这个宏在.data..percpu段中定义一个per-cpu变量。那么在编译的时候这个段就会被编译进内核镜像，.data..percpu段的起始地址为__per_cpu_start，结束地址为 __per_cpu_end，这两个地址符号定义在linux内核链接脚本中（arch/arm/kernel/vmlinux.lds）。

还记得前面说过per-cpu变量是每个CPU对应有一个副本，可是这里明明只定义了一个变量到.data..percpu段中，这是怎么回事呢？

原来linux启动后start_kernel会调用setup_per_cpu_areas函数来初始化系统第一个chunk，在这个函数中会把.data..percpu段中的变量数据copy到每个CPU在该chunk内对应的内存中。

setup_per_cpu_areas实现在（mm/percpu.c）

void __init setup_per_cpu_areas(void)
{
unsigned long delta;
unsigned int cpu;
int rc;

/*
 * Always reserve area for module percpu variables.  That's
 * what the legacy allocator did.
 */
rc = pcpu_embed_first_chunk(PERCPU_MODULE_RESERVE,
    PERCPU_DYNAMIC_RESERVE, PAGE_SIZE, NULL,
    pcpu_dfl_fc_alloc, pcpu_dfl_fc_free);
if (rc < 0)
panic("Failed to initialize percpu areas.");

delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start;
for_each_possible_cpu(cpu)
__per_cpu_offset[cpu] = delta + pcpu_unit_offsets[cpu];
}

函数首先调用pcpu_embed_first_chunk创建系统第一个chunk，然后初始化__per_cpu_offset[]数组。先来看 pcpu_embed_first_chunk的实现，这个函数比较长，因此分段来看：

int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size,
  size_t atom_size,
  pcpu_fc_cpu_distance_fn_t cpu_distance_fn,
  pcpu_fc_alloc_fn_t alloc_fn,
  pcpu_fc_free_fn_t free_fn)
{
void *base = (void *)ULONG_MAX;
void **areas = NULL;
struct pcpu_alloc_info *ai;
size_t size_sum, areas_size, max_distance;
int group, i, rc;

ai = pcpu_build_alloc_info(reserved_size, dyn_size, atom_size,
   cpu_distance_fn);
if (IS_ERR(ai))
return PTR_ERR(ai);
            ... ...
}

参数reserved_size和dyn_size分别表示这个chunk中用于为reserved分配保留的空间和用于为动态分配保留的空间的大小，这两个参数由setup_per_cpu_areas直接传递。参数atom_size用于对齐，这里传入的是PAGE_SIZE也就是页对齐。参数 cpu_distance_fn是可选的，用于计算cpu之间的distance，这里传入NULL. (cpu_distance_fn: callback to determine distance between cpus, optional)，最后两个参数也是函数指针，用于分配和释放内存，分别为pcpu_dfl_fc_alloc和pcpu_dfl_fc_free，实现如下：

static void * __init pcpu_dfl_fc_alloc(unsigned int cpu, size_t size,
       size_t align)
{
return __alloc_bootmem_nopanic(size, align, __pa(MAX_DMA_ADDRESS));
}

static void __init pcpu_dfl_fc_free(void *ptr, size_t size)
{
free_bootmem(__pa(ptr), size);
}

这两个函数用于在系统刚刚初始化的时候分配与释放内存。

pcpu_embed_first_chunk函数先调用pcpu_build_alloc_info函数来收集alloc info信息，这个函数也比较长，分段来看:

static struct pcpu_alloc_info * __init pcpu_build_alloc_info(
size_t reserved_size, size_t dyn_size,
size_t atom_size,
pcpu_fc_cpu_distance_fn_t cpu_distance_fn)
{
static int group_map[NR_CPUS] __initdata;
static int group_cnt[NR_CPUS] __initdata;
const size_t static_size = __per_cpu_end - __per_cpu_start;
int nr_groups = 1, nr_units = 0;
size_t size_sum, min_unit_size, alloc_size;
int upa, max_upa, uninitialized_var(best_upa);/* units_per_alloc */
int last_allocs, group, unit;
unsigned int cpu, tcpu;
struct pcpu_alloc_info *ai;
unsigned int *cpu_map;

/* this function may be called multiple times */
memset(group_map, 0, sizeof(group_map));
memset(group_cnt, 0, sizeof(group_cnt));

/* calculate size_sum and ensure dyn_size is enough for early alloc */
size_sum = PFN_ALIGN(static_size + reserved_size +
    max_t(size_t, dyn_size, PERCPU_DYNAMIC_EARLY_SIZE));
dyn_size = size_sum - static_size - reserved_size;

/*
 * Determine min_unit_size, alloc_size and max_upa such that
 * alloc_size is multiple of atom_size and is the smallest
 * which can accommodate 4k aligned segments which are equal to
 * or larger than min_unit_size.
 */
min_unit_size = max_t(size_t, size_sum, PCPU_MIN_UNIT_SIZE);

alloc_size = roundup(min_unit_size, atom_size);
            ... ...

static int group_map[NR_CPUS]， static int group_cnt[NR_CPUS]两个数组是计算cpu分组用的， group_map[]记录每个cpu到对应group的映射，group_cnt[]记录每个group对应的cpu数量。

size_sum = PFN_ALIGN(static_size + reserved_size + max_t(size_t, dyn_size, PERCPU_DYNAMIC_EARLY_SIZE)); 这里函数先计算chunk中每个cpu需要分配的空间大小（静态+保留+动态）并对齐到页。min_unit_size = max_t(size_t, size_sum, PCPU_MIN_UNIT_SIZE) 是将上步的结果与PCPU_MIN_UINT_SIZE比较取其中较大的值，最后alloc_size是min_unit_size按照参数atom_size向上取整的结果。

(Determine min_unit_size, alloc_size and max_upa such that alloc_size is multiple of atom_size and is the smallest which can accommodate 4k aligned segments which are equal to or larger than min_unit_size.) alloc_size就是最终要为每个CPU分配的空间大小。

#define PCPU_MIN_UNIT_SIZEPFN_ALIGN(32 << 10)

PCPU_MIN_UINT_SIZE定义为8个page

...
for_each_possible_cpu(cpu) {
group = 0;
next_group:
for_each_possible_cpu(tcpu) {
if (cpu == tcpu)
break;
if (group_map[tcpu] == group && cpu_distance_fn &&
    (cpu_distance_fn(cpu, tcpu) > LOCAL_DISTANCE ||
     cpu_distance_fn(tcpu, cpu) > LOCAL_DISTANCE)) {
group++;
nr_groups = max(nr_groups, group + 1);
goto next_group;
}
}
group_map[cpu] = group;
group_cnt[group]++;
}
        ... ...

这段代码是对cpu进行分组,因为我们传进来的参数cpu_distance_fn是NULL，所以实际上所有的cpu都分在group0里。

    ...
/* allocate and fill alloc_info */
for (group = 0; group < nr_groups; group++)
nr_units += roundup(group_cnt[group], upa);

ai = pcpu_alloc_alloc_info(nr_groups, nr_units);
if (!ai)
return ERR_PTR(-ENOMEM);
        ... ...

接着计算出CPU的数量并记录在nr_units中，然后调用pcpu_alloc_alloc_info函数分配一个pcpu_alloc_info结构。

先看下pcpu_alloc_info结构定义:

struct pcpu_group_info {
intnr_units;/* aligned # of units */
unsigned longbase_offset;/* base address offset */
unsigned int*cpu_map;/* unit->cpu map, empty
 * entries contain NR_CPUS */
};

struct pcpu_alloc_info {
size_tstatic_size;
size_treserved_size;
size_tdyn_size;
size_tunit_size;
size_tatom_size;
size_talloc_size;
size_t__ai_size;/* internal, don't use */
intnr_groups;/* 0 if grouping unnecessary */
struct pcpu_group_infogroups[];
};

CPU分组信息就保存在groups[]数组中，pcpu_group_info结构记录每个group的信息，nr_units表示这个group包含CPU的数量，base_offset表示这个group对应的内存起始地址到chunk所管理的整个内存的起始地址的offset，cpu_map记录group内包含哪些cpu。

接着看pcpu_alloc_alloc_info这个函数：

struct pcpu_alloc_info * __init pcpu_alloc_alloc_info(int nr_groups,
      int nr_units)
{
struct pcpu_alloc_info *ai;
size_t base_size, ai_size;
void *ptr;
int unit;

base_size = ALIGN(sizeof(*ai) + nr_groups * sizeof(ai->groups[0]),
  __alignof__(ai->groups[0].cpu_map[0]));
ai_size = base_size + nr_units * sizeof(ai->groups[0].cpu_map[0]);

ptr = alloc_bootmem_nopanic(PFN_ALIGN(ai_size));
if (!ptr)
return NULL;
ai = ptr;
ptr += base_size;

ai->groups[0].cpu_map = ptr;

for (unit = 0; unit < nr_units; unit++)
ai->groups[0].cpu_map[unit] = NR_CPUS;

ai->nr_groups = nr_groups;
ai->__ai_size = PFN_ALIGN(ai_size);

return ai;
}

base_size是不包含cpu_map数组的size，ai_size是包含cpu_map数组的总的size（所有group是共用一个cpu_map数组的，只不过各个group的cpu_map指针指向的offset不同）。

函数将group[0]的cpu_map指针初始化为cpu_map数组的起始地址，其他group的cpu_map指针由这个函数的调用者负责初始化，然后函数将cpu_map数组的成员全部初始化为NR_CPUS。最后设置pcpu_group_info结构的nr_groups和__ai_size.

返回到pcpu_build_alloc_info:

...
cpu_map = ai->groups[0].cpu_map;

for (group = 0; group < nr_groups; group++) {
ai->groups[group].cpu_map = cpu_map;
cpu_map += roundup(group_cnt[group], upa);
}

ai->static_size = static_size;
ai->reserved_size = reserved_size;
ai->dyn_size = dyn_size;
ai->unit_size = alloc_size / upa;
ai->atom_size = atom_size;
ai->alloc_size = alloc_size;
        ... ...

for循环初始化所有group的cpu_map指针，这里可以看到各个group的cpu_map指针都是基于公用的cpu_map的一个offset。然后分别初始化pcpu_alloc_info的static_size，reserved_size，dyn_size，unit_size，atom_size，alloc_size，这里upa=1，所以unit_size就等于alloc_size。

...
for (group = 0, unit = 0; group_cnt[group]; group++) {
struct pcpu_group_info *gi = &ai->groups[group];

/*
 * Initialize base_offset as if all groups are located
 * back-to-back.  The caller should update this to
 * reflect actual allocation.
 */
gi->base_offset = unit * ai->unit_size;

for_each_possible_cpu(cpu)
if (group_map[cpu] == group)
gi->cpu_map[gi->nr_units++] = cpu;
gi->nr_units = roundup(gi->nr_units, upa);
unit += gi->nr_units;
}
BUG_ON(unit != nr_units);

return ai;
}

接下来这个for循环初始化group的base_offset，前面讲过base_offset表示这个group对应的内存起始地址到chunk所管理的整个内存的起始地址的offset，ai->unit_size表示为每个CPU需要分配的空间大小，unit在循环中就表示当前的group前面有多少cpu。可以看出接下来为各个group分配的空间之间一定是连续的.每个group内为各个cpu分配的空间之间也是连续的。

然后返回到pcpu_embed_first_chunk:

...
size_sum = ai->static_size + ai->reserved_size + ai->dyn_size;
areas_size = PFN_ALIGN(ai->nr_groups * sizeof(void *));

areas = alloc_bootmem_nopanic(areas_size);
if (!areas) {
rc = -ENOMEM;
goto out_free;
}

/* allocate, copy and determine base address */
for (group = 0; group < ai->nr_groups; group++) {
struct pcpu_group_info *gi = &ai->groups[group];
unsigned int cpu = NR_CPUS;
void *ptr;

for (i = 0; i < gi->nr_units && cpu == NR_CPUS; i++)
cpu = gi->cpu_map[i];
BUG_ON(cpu == NR_CPUS);

/* allocate space for the whole group */
ptr = alloc_fn(cpu, gi->nr_units * ai->unit_size, atom_size);
if (!ptr) {
rc = -ENOMEM;
goto out_free_areas;
}
/* kmemleak tracks the percpu allocations separately */
kmemleak_free(ptr);
areas[group] = ptr;

base = min(ptr, base);
}
        ... ...

秒客网

Linux per-CPU变量分配与管理源码分析(未完)

相关文章