专题：Linux内存管理专题

关键词：OOM、oom_adj、oom_score、badness。

Linux内核为了提高内存的使用效率采用过度分配内存(over-commit memory)的办法，造成物理内存过度紧张进而触发OOM机制来杀死一些进程回收内存。

该机制会监控那些占用内存过大，尤其是瞬间很快消耗大量内存的进程，为了防止内存耗尽会把该进程杀掉。

1. 关于OOM

内核检测到系统内存不足，在内存分配路径上触发out_of_memory，然后调用select_bad_process()选择一个'bad'进程杀掉，判断和选择一个‘bad'进程的过程由oom_badness()决定。

Linux下每个进程都有自己的OOM权重，在/proc/<pid>/oom_adj里面，范围是-17到+15，取值越高，越容易被杀掉。

2. OOM触发路径

在内存分配路径上，当内存不足的时候会触发kswapd、或者内存规整，极端情况会触发OOM，来获取更多内存。

在内存回收失败之后，__alloc_pages_may_oom是OOM的入口，但是主要工作在out_of_memory中进行处理。

由于Linux内存都是以页为单位，所以__alloc_pages_nodemask是必经之处。

alloc_pages
  ->_alloc_pages
    ->__alloc_pages_nodemask
      ->__alloc_pages_slowpath-------------------------此时已经说明内存不够，会触发一些内存回收、内存规整机制，极端情况触发OOM。
        ->__alloc_pages_may_oom -----------------------进入OOM的开始，包括一些检查动作。
->out_of_memory------------------------------OOM的核心
            ->select_bad_process-----------------------选择最'bad'进程
              ->oom_scan_process_thread
              ->oom_badness----------------------------计算当前进程有多'badness'
            ->oom_kill_process-------------------------杀死选中的进程

3. 影响OOM的内核参数

参照Linux内存管理 (23)内存sysfs节点和工具的OOM章节。

4. OOM代码分析

OOM主要代码在mm/oom_kill.c和include/linux/mm.h中。

/*
 * Details of the page allocation that triggered the oom killer that are used to
 * determine what should be killed.
 */
struct oom_control {
    /* Used to determine cpuset */
    struct zonelist *zonelist;

    /* Used to determine mempolicy */
    nodemask_t *nodemask;

    /* Used to determine cpuset and node locality requirement */
    const gfp_t gfp_mask;

    /*
     * order == -1 means the oom kill is required by sysrq, otherwise only
     * for display purposes.
     */
    const int order;
};

__alloc_pages_may_oom是内存分配路径上的OOM入口，在进入OOM之前还会检查一些特殊情况。

static inline struct page *
__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
    const struct alloc_context *ac, unsigned long *did_some_progress)
{
    struct oom_control oc = {-------------------------------------------------------OOM控制参数
        .zonelist = ac->zonelist,
        .nodemask = ac->nodemask,
        .gfp_mask = gfp_mask,
        .order = order,
    };
    struct page *page;

    *did_some_progress = 0;

    /*
     * Acquire the oom lock.  If that fails, somebody else is
     * making progress for us.
     */
    if (!mutex_trylock(&oom_lock)) {
        *did_some_progress = 1;
        schedule_timeout_uninterruptible(1);
        return NULL;
    }

    /*
     * Go through the zonelist yet one more time, keep very high watermark
     * here, this is only to catch a parallel oom killing, we must fail if
     * we're still under heavy pressure.
     */
    page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order,---------------再次使用高水位检查一次，是否需要启动OOM流程。
                    ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
    if (page)
        goto out;

    if (!(gfp_mask & __GFP_NOFAIL)) {---------------------------------------------跳过OOM的特殊情况 /* Coredumps can quickly deplete all memory reserves */
        if (current->flags & PF_DUMPCORE)
            goto out;
        /* The OOM killer will not help higher order allocs */
        if (order > PAGE_ALLOC_COSTLY_ORDER)
            goto out;
        /* The OOM killer does not needlessly kill tasks for lowmem */
        if (ac->high_zoneidx < ZONE_NORMAL)
            goto out;
        /* The OOM killer does not compensate for IO-less reclaim */
        if (!(gfp_mask & __GFP_FS)) {
            /*
             * XXX: Page reclaim didn't yield anything,
             * and the OOM killer can't be invoked, but
             * keep looping as per tradition.
             */
            *did_some_progress = 1;
            goto out;
        }
        if (pm_suspended_storage())
            goto out;
        /* The OOM killer may not free memory on a specific node */
        if (gfp_mask & __GFP_THISNODE)
            goto out;
    }
    /* Exhausted what can be done so it's blamo time */
    if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))---------------进入OOM主流程 *did_some_progress = 1;
out:
    mutex_unlock(&oom_lock);
    return page;
}

out_of_memory函数是OOM机制的核心，他可以分为两部分。一是调挑选最’bad‘的进程，二是杀死它。

/**
 * out_of_memory - kill the "best" process when we run out of memory
 * @oc: pointer to struct oom_control
 *
 * If we run out of memory, we have the choice between either
 * killing a random task (bad), letting the system crash (worse)
 * OR try to be smart about which process to kill. Note that we
 * don't have to be perfect here, we just have to be good.
 */
bool out_of_memory(struct oom_control *oc)
{
    struct task_struct *p;
    unsigned long totalpages;
    unsigned long freed = 0;
    unsigned int uninitialized_var(points);
    enum oom_constraint constraint = CONSTRAINT_NONE;

    if (oom_killer_disabled)----------------------------------------------------在freeze_processes会将其置位，即禁止OOM；在thaw_processes会将其清零，即打开OOM。所以，如果在冻结过程，不允许OOM。 return false;

    blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
    if (freed > 0)
        /* Got some memory back in the last second. */
        return true;

    /*
     * If current has a pending SIGKILL or is exiting, then automatically
     * select it.  The goal is to allow it to allocate so that it may
     * quickly exit and free its memory.
     *
     * But don't select if current has already released its mm and cleared
     * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
     */
    if (current->mm &&
        (fatal_signal_pending(current) || task_will_free_mem(current))) {
        mark_oom_victim(current);
        return true;
    }

    /*
     * Check if there were limitations on the allocation (only relevant for
     * NUMA) that may require different handling.
     */
    constraint = constrained_alloc(oc, &totalpages);-----------------------------未定义CONFIG_NUMA返回CONSTRAINT_NONE if (constraint != CONSTRAINT_MEMORY_POLICY)
        oc->nodemask = NULL;
    check_panic_on_oom(oc, constraint, NULL);----------------------------------检查sysctl_panic_on_oom设置，以及是否由sysrq触发，来决定是否触发panic。 if (sysctl_oom_kill_allocating_task && current->mm &&----------------------如果设置了sysctl_oom_kill_allocating_task，那么当内存耗尽时，会把当前申请内存分配的进程杀掉。
        !oom_unkillable_task(current, NULL, oc->nodemask) &&
        current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
        get_task_struct(current);
        oom_kill_process(oc, current, 0, totalpages, NULL,
                 "Out of memory (oom_kill_allocating_task)");
        return true;
    }

    p = select_bad_process(oc, &points, totalpages);---------------------------遍历所有进程，进程下的线程，查找合适的候选进程。 /* Found nothing?!?! Either we hang forever, or we panic. */
    if (!p && !is_sysrq_oom(oc)) {---------------------------------------------如果没有合适候选进程，并且OOM不是由sysrq触发的，进入panic。
        dump_header(oc, NULL, NULL);
        panic("Out of memory and no killable processes...\n");
    }
    if (p && p != (void *)-1UL) {
        oom_kill_process(oc, p, points, totalpages, NULL,
                 "Out of memory");---------------------------------------------杀死选中的进程。 /*
         * Give the killed process a good chance to exit before trying
         * to allocate memory again.
         */
        schedule_timeout_killable(1);
    }
    return true;
}

select_bad_process通过oom_scan_process_thread检查当前进程各种属性，返回oom_scan_t以决定for流程走向。

在oom_badness总计算当前进程的得分，选取最高分者。返回选中进程的结构体，以及进程得分ppoints。

/*
 * Simple selection loop. We chose the process with the highest
 * number of 'points'.  Returns -1 on scan abort.
 */
static struct task_struct *select_bad_process(struct oom_control *oc,
        unsigned int *ppoints, unsigned long totalpages)
{
    struct task_struct *g, *p;
    struct task_struct *chosen = NULL;
    unsigned long chosen_points = 0;

    rcu_read_lock();
    for_each_process_thread(g, p) {-----------------------------------遍历所有进程线程
        unsigned int points;

        switch (oom_scan_process_thread(oc, p, totalpages)) {---------根据oc和p来决定当前scan的流程，返回oom_scan_t值。 case OOM_SCAN_SELECT:-----------------------------------------相应的进程可以被选择
            chosen = p;
            chosen_points = ULONG_MAX;
            /* fall through */
        case OOM_SCAN_CONTINUE:---------------------------------------跳过for中之后部分 continue;
        case OOM_SCAN_ABORT:------------------------------------------退出整个for循环，并且直接返回。
            rcu_read_unlock();
            return (struct task_struct *)(-1UL);
        case OOM_SCAN_OK:
            break;
        };
        points = oom_badness(p, NULL, oc->nodemask, totalpages);------对每个进程进行打分。 if (!points || points < chosen_points)------------------------这里保证只取最高分的进程，所以分数最高者被选中。 continue;
        /* Prefer thread group leaders for display purposes */
        if (points == chosen_points && thread_group_leader(chosen))
            continue;

        chosen = p;
        chosen_points = points;
    }
    if (chosen)
        get_task_struct(chosen);
    rcu_read_unlock();

    *ppoints = chosen_points * 1000 / totalpages;
    return chosen;
}

oom_badness是给进程打分的函数，可以说是核心中的核心。最终结果受oom_score_adj和当前进程内存使用量综合影响。

/**
 * oom_badness - heuristic function to determine which candidate task to kill
 * @p: task struct of which task we should calculate
 * @totalpages: total present RAM allowed for page allocation
 *
 * The heuristic for determining which task to kill is made to be as simple and
 * predictable as possible.  The goal is to return the highest value for the
 * task consuming the most memory to avoid subsequent oom failures.
 */
unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
              const nodemask_t *nodemask, unsigned long totalpages)
{
    long points;
    long adj;

    if (oom_unkillable_task(p, memcg, nodemask))
        return 0;

    p = find_lock_task_mm(p);
    if (!p)
        return 0;

    adj = (long)p->signal->oom_score_adj;--------------------------------------获取当前进程的oom_score_adh参数。 if (adj == OOM_SCORE_ADJ_MIN) {
        task_unlock(p);
        return 0;--------------------------------------------------------------如果当前进程oom_score_adj为OOM_SCORE_ADJ_MIN的话，就返回0.等于告诉OOM，此进程不参数'bad'评比。
    }

    /*
     * The baseline for the badness score is the proportion of RAM that each
     * task's rss, pagetable and swap space use.
     */
    points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
        atomic_long_read(&p->mm->nr_ptes) + mm_nr_pmds(p->mm);-----------------可以看出points综合了内存占用情况，包括RSS部分、swap file或者swap device占用内存、以及页表占用内存。
    task_unlock(p);

    /*
     * Root processes get 3% bonus, just like the __vm_enough_memory()
     * implementation used by LSMs.
     */
    if (has_capability_noaudit(p, CAP_SYS_ADMIN))------------------------------如果是root用户，增加3%的使用特权。
        points -= (points * 3) / 100;

    /* Normalize to oom_score_adj units */
    adj *= totalpages / 1000;--------------------------------------------------这里可以看出oom_score_adj对最终分数的影响，如果oom_score_adj小于0，则最终points就会变小，进程更加不会被选中。
    points += adj;-------------------------------------------------------------将归一化后的adj和points求和，作为当前进程的分数。 /*
     * Never return 0 for an eligible task regardless of the root bonus and
     * oom_score_adj (oom_score_adj can't be OOM_SCORE_ADJ_MIN here).
     */
    return points > 0 ? points : 1;
}

oom_kill_process用于杀死最高分的进程，包括进程下的线程。

/*
 * Must be called while holding a reference to p, which will be released upon
 * returning.
 */
void oom_kill_process(struct oom_control *oc, struct task_struct *p,
              unsigned int points, unsigned long totalpages,
              struct mem_cgroup *memcg, const char *message)
{
    struct task_struct *victim = p;
    struct task_struct *child;
    struct task_struct *t;
    struct mm_struct *mm;
    unsigned int victim_points = 0;
    static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
                          DEFAULT_RATELIMIT_BURST);

    /*
     * If the task is already exiting, don't alarm the sysadmin or kill
     * its children or threads, just set TIF_MEMDIE so it can die quickly
     */
    task_lock(p);
    if (p->mm && task_will_free_mem(p)) {---------------------------------对于非coredump正处于退出状态的线程，标注TIF_MEMDIE，然后退出。
        mark_oom_victim(p);
        task_unlock(p);
        put_task_struct(p);
        return;
    }
    task_unlock(p);

    if (__ratelimit(&oom_rs))
        dump_header(oc, p, memcg);

    pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
        message, task_pid_nr(p), p->comm, points);

    /*
     * If any of p's children has a different mm and is eligible for kill,
     * the one with the highest oom_badness() score is sacrificed for its
     * parent.  This attempts to lose the minimal amount of work done while
     * still freeing memory.
     */
    read_lock(&tasklist_lock);
    for_each_thread(p, t) {-----------------------------------------------遍历进程下的线程
        list_for_each_entry(child, &t->children, sibling) {
            unsigned int child_points;

            if (process_shares_mm(child, p->mm))
                continue;
            /*
             * oom_badness() returns 0 if the thread is unkillable
             */
            child_points = oom_badness(child, memcg, oc->nodemask,
                                totalpages);------------------------------计算子线程的得分情况 if (child_points > victim_points) {---------------------------将得分最高者计为victim，得分为victim_points。
                put_task_struct(victim);
                victim = child;
                victim_points = child_points;
                get_task_struct(victim);
            }
        }
    }
    read_unlock(&tasklist_lock);

    p = find_lock_task_mm(victim);
    if (!p) {
        put_task_struct(victim);
        return;
    } else if (victim != p) {
        get_task_struct(p);
        put_task_struct(victim);
        victim = p;
    }

    /* Get a reference to safely compare mm after task_unlock(victim) */
    mm = victim->mm;
    atomic_inc(&mm->mm_count);
    /*
     * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
     * the OOM victim from depleting the memory reserves from the user
     * space under its control.
     */
    do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);--------------发送SIGKILL信号给victim进程。
    mark_oom_victim(victim);-----------------------------------------------标注TIF_MEMDIE是因为OOM被杀死
    pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
        task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
        K(get_mm_counter(victim->mm, MM_ANONPAGES)),
        K(get_mm_counter(victim->mm, MM_FILEPAGES)));
    task_unlock(victim);

    /*
     * Kill all user processes sharing victim->mm in other thread groups, if
     * any.  They don't get access to memory reserves, though, to avoid
     * depletion of all memory.  This prevents mm->mmap_sem livelock when an
     * oom killed thread cannot exit because it requires the semaphore and
     * its contended by another thread trying to allocate memory itself.
     * That thread will now get access to memory reserves since it has a
     * pending fatal signal.
     */
    rcu_read_lock();
    for_each_process(p) {--------------------------------------------------继续处理共享内存的相关线程 if (!process_shares_mm(p, mm))
            continue;
        if (same_thread_group(p, victim))
            continue;
        if (unlikely(p->flags & PF_KTHREAD))
            continue;
        if (is_global_init(p))
            continue;
        if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
            continue;

        do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
    }
    rcu_read_unlock();

    mmdrop(mm);------------------------------------------------------------释放mm空间的内存。包括申请的页面、mm结构体等。
    put_task_struct(victim);-----------------------------------------------释放task_struct占用的内存空间，包括cgroup等等。
}

5. 关于OOM的测试

相关阅读：《Linux OOM机制介绍》、《Linux内核OOM机制的详细分析》、《Linux内核OOM机制分析》

秒客网

Linux内存管理 (21)OOM

1. 关于OOM

2. OOM触发路径

3. 影响OOM的内核参数

4. OOM代码分析

5. 关于OOM的测试

相关文章