From patchwork Fri Jul 13 06:20:44 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tetsuo Handa X-Patchwork-Id: 10522591 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 83C43602A0 for ; Fri, 13 Jul 2018 06:20:59 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 66A8C29979 for ; Fri, 13 Jul 2018 06:20:59 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 5A374299FC; Fri, 13 Jul 2018 06:20:59 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00, MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5361A29982 for ; Fri, 13 Jul 2018 06:20:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 09A206B0003; Fri, 13 Jul 2018 02:20:57 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 04A776B0006; Fri, 13 Jul 2018 02:20:57 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E7BBC6B0007; Fri, 13 Jul 2018 02:20:56 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-it0-f72.google.com (mail-it0-f72.google.com [209.85.214.72]) by kanga.kvack.org (Postfix) with ESMTP id BCA586B0003 for ; Fri, 13 Jul 2018 02:20:56 -0400 (EDT) Received: by mail-it0-f72.google.com with SMTP id g125-v6so6563727ita.0 for ; Thu, 12 Jul 2018 23:20:56 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:message-id :subject:from:to:cc:mime-version:date:references:in-reply-to :content-transfer-encoding; bh=mrNr83skZPFnrEKdgB3JAyNaf89uYQwkvoTsHkbNu+4=; b=WF0mQ4GysIOsjNX5D/HevbtqPSxhGdcipavkIwmcIhkkqq62KOJfjovMunsdRx179F mhxTcltEx5SAQTuZo3N6kuZqQjns7y6nlfaKFFIbvYzaWNh7EFKR2qUjJ8JZMwrSSdOx 37nMV+hYguPWPsGGEEqvoMPe0pFj+SetbkegOM7MKCuXraIUXoRdQx+enZy2++YV6xfv o8wiy9+3e/oiR8n8ZPHFw0ick3+dZd6N6mbH6Rwx1XKT1QPrOL6jgUZXMKI6kza+Ejgq Bqwfndiy8ejD1ErYPMzhDxTPnDPglCrFGEa6N92XUP2/0D+j9EGL/ypudYbHbVXgGovU yAFA== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of penguin-kernel@i-love.sakura.ne.jp designates 202.181.97.72 as permitted sender) smtp.mailfrom=penguin-kernel@i-love.sakura.ne.jp X-Gm-Message-State: AOUpUlEpX+9pWbXPafj8A+g83C2ZT+MueuSvYKGsEI4HoknW14PAIyhk sfwJEvF3WGYXP8NeFlsJK3V58IqbUz5qcnSSVCYZg9Of//ffhVLVVLLnBASqQpQ+Q0QxcMgDAkX H7R2lj88/UCWZdjsNxR8Lgp6PhRM0EElbx4mnwkUTgL7Cxt3IhHAy6HCRgzmioFHs5A== X-Received: by 2002:a02:41d8:: with SMTP id n85-v6mr4229223jad.108.1531462856518; Thu, 12 Jul 2018 23:20:56 -0700 (PDT) X-Google-Smtp-Source: AAOMgpcqeNlw/qNrdD5VwRoWLmGq5YPo6gElZ4+LXzs1ezvjnpMieNeKptp24pMOp9N2guylWc+m X-Received: by 2002:a02:41d8:: with SMTP id n85-v6mr4229171jad.108.1531462855251; Thu, 12 Jul 2018 23:20:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1531462855; cv=none; d=google.com; s=arc-20160816; b=z8SUwTm8m8GGjdDWyVy1HZiuM+d3VWlpR9bfa9BTRvVcTxDEeiUlPMKWpnictyXYUB FqbpFPzyS6M74e6lMOFb6coRASpqKicgamk/X/VlRnxThPweTXjQu26lJeCkkA1Ss+Ah xqTMy8ubVvz5SE31fhzhbSy4zM4X+BITDtmCZ2W2fBlLMfkhUi4I81JVkzVt4zbIB4IA 5IxDNZTh9UG7Zw1pVQTGXcWDjWPHgfL251NO7Nf20/7hTKs9rRtWyn3PNGX1+1x8ZjwI h0QfQ2tItUUgzVGOnd0SbJWzh+WRgx9RNFlvnKeehaFTwVduwKcmRNfPpDCUj2NcaZwA /Bnw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:in-reply-to:references:date:mime-version :cc:to:from:subject:message-id:arc-authentication-results; bh=mrNr83skZPFnrEKdgB3JAyNaf89uYQwkvoTsHkbNu+4=; b=f30Qhwhs3WIKUP+v9ubvHC23Qgi/hmTbL/weJgxsUkuINoXBirhuw0qUCDGZLrD3SQ lhY//yhQaV6dkOBGBLGb2jtCHeGiVFzEpGurLTZPTWUNFNSNCL3PsXMdvhX5cQNq/ITc ddp1avWW0l6plBxKA8u5KgHVbzcLgQqwxCJOrNl1FHaE9e4DyuxrW/HXYTrA3pBJ0IQN 6VjTr0jNwsyQWJs2nNGhF7iAt9Am/TtTvy6Qx8GsfWqIPocfdQzMg1ACjzzpC57ScTTM F9jaPmapKlyBPomOG3gs5vRSDvjM+QqCDWzHhoxknbZ2iiXXrZQl1IPlbdO/ZLwpQCvQ /a6w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of penguin-kernel@i-love.sakura.ne.jp designates 202.181.97.72 as permitted sender) smtp.mailfrom=penguin-kernel@i-love.sakura.ne.jp Received: from www262.sakura.ne.jp (www262.sakura.ne.jp. [202.181.97.72]) by mx.google.com with ESMTPS id r9-v6si4141375itr.21.2018.07.12.23.20.54 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 12 Jul 2018 23:20:54 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of penguin-kernel@i-love.sakura.ne.jp designates 202.181.97.72 as permitted sender) client-ip=202.181.97.72; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of penguin-kernel@i-love.sakura.ne.jp designates 202.181.97.72 as permitted sender) smtp.mailfrom=penguin-kernel@i-love.sakura.ne.jp Received: from fsav401.sakura.ne.jp (fsav401.sakura.ne.jp [133.242.250.100]) by www262.sakura.ne.jp (8.15.2/8.15.2) with ESMTP id w6D6KjIJ093016; Fri, 13 Jul 2018 15:20:45 +0900 (JST) (envelope-from penguin-kernel@i-love.sakura.ne.jp) Received: from www262.sakura.ne.jp (202.181.97.72) by fsav401.sakura.ne.jp (F-Secure/fsigk_smtp/530/fsav401.sakura.ne.jp); Fri, 13 Jul 2018 15:20:45 +0900 (JST) X-Virus-Status: clean(F-Secure/fsigk_smtp/530/fsav401.sakura.ne.jp) Received: from www262.sakura.ne.jp (localhost [127.0.0.1]) by www262.sakura.ne.jp (8.15.2/8.15.2) with ESMTP id w6D6Kiq4093011; Fri, 13 Jul 2018 15:20:44 +0900 (JST) (envelope-from penguin-kernel@i-love.sakura.ne.jp) Received: (from i-love@localhost) by www262.sakura.ne.jp (8.15.2/8.15.2/Submit) id w6D6KiAJ093010; Fri, 13 Jul 2018 15:20:44 +0900 (JST) (envelope-from penguin-kernel@i-love.sakura.ne.jp) Message-Id: <201807130620.w6D6KiAJ093010@www262.sakura.ne.jp> X-Authentication-Warning: www262.sakura.ne.jp: i-love set sender to penguin-kernel@i-love.sakura.ne.jp using -f Subject: Re: [patch -mm] mm, oom: remove =?ISO-2022-JP?B?b29tX2xvY2sgZnJvbSBleGl0?= =?ISO-2022-JP?B?X21tYXA=?= From: Tetsuo Handa To: David Rientjes Cc: Andrew Morton , Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org MIME-Version: 1.0 Date: Fri, 13 Jul 2018 15:20:44 +0900 References: In-Reply-To: X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP What a simplified description of oom_lock... Positive effects (1) Serialize "setting TIF_MEMDIE and calling __thaw_task()/atomic_inc() from mark_oom_victim()" and "setting oom_killer_disabled = true from oom_killer_disable()". (2) Serialize all printk() messages from out_of_memory(). (3) Prevent from selecting new OOM victim when there is an !MMF_OOM_SKIP mm which current thread should wait for. (4) Mutex blocking_notifier_call_chain() from out_of_memory() because some of callbacks might not be thread-safe and/or serialized call might release more memory than needed. Negative effects (A) Threads which called mutex_lock(&oom_lock) before calling out_of_memory() are blocked waiting for "__oom_reap_task_mm() from exit_mmap()" and/or "__oom_reap_task_mm() from oom_reap_task_mm()". (B) Threads which do not call out_of_memory() because mutex_trylock(&oom_lock) failed continue consuming CPU resources pointlessly. Regarding (A), we can reduce the range oom_lock serializes from "__oom_reap_task_mm()" to "setting MMF_OOM_SKIP", for oom_lock is useful for (3). Therefore, we can apply below change on top of your patch. But I don't like sharing MMF_UNSBALE for two purposes (reason is explained below). Regarding (B), we can do direct OOM reaping (like my proposal does). --- kernel/fork.c | 5 +++++ mm/mmap.c | 21 +++++++++------------ mm/oom_kill.c | 57 ++++++++++++++++++++++----------------------------------- 3 files changed, 36 insertions(+), 47 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index 6747298..f37d481 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -984,6 +984,11 @@ static inline void __mmput(struct mm_struct *mm) } if (mm->binfmt) module_put(mm->binfmt->module); + if (unlikely(mm_is_oom_victim(mm))) { + mutex_lock(&oom_lock); + set_bit(MMF_OOM_SKIP, &mm->flags); + mutex_unlock(&oom_lock); + } mmdrop(mm); } diff --git a/mm/mmap.c b/mm/mmap.c index 7f918eb..203061f 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3075,19 +3075,17 @@ void exit_mmap(struct mm_struct *mm) __oom_reap_task_mm(mm); /* - * Now, set MMF_UNSTABLE to avoid racing with the oom reaper. - * This needs to be done before calling munlock_vma_pages_all(), - * which clears VM_LOCKED, otherwise the oom reaper cannot - * reliably test for it. If the oom reaper races with - * munlock_vma_pages_all(), this can result in a kernel oops if - * a pmd is zapped, for example, after follow_page_mask() has - * checked pmd_none(). + * Wait for the oom reaper to complete. This needs to be done + * before calling munlock_vma_pages_all(), which clears + * VM_LOCKED, otherwise the oom reaper cannot reliably test for + * it. If the oom reaper races with munlock_vma_pages_all(), + * this can result in a kernel oops if a pmd is zapped, for + * example, after follow_page_mask() has checked pmd_none(). * - * Taking mm->mmap_sem for write after setting MMF_UNSTABLE will - * guarantee that the oom reaper will not run on this mm again - * after mmap_sem is dropped. + * Taking mm->mmap_sem for write will guarantee that the oom + * reaper will not run on this mm again after mmap_sem is + * dropped. */ - set_bit(MMF_UNSTABLE, &mm->flags); down_write(&mm->mmap_sem); up_write(&mm->mmap_sem); } @@ -3115,7 +3113,6 @@ void exit_mmap(struct mm_struct *mm) unmap_vmas(&tlb, vma, 0, -1); free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); tlb_finish_mmu(&tlb, 0, -1); - set_bit(MMF_OOM_SKIP, &mm->flags); /* * Walk the list again, actually closing and freeing it, diff --git a/mm/oom_kill.c b/mm/oom_kill.c index e6328ce..7ed4ed0 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -488,11 +488,9 @@ void __oom_reap_task_mm(struct mm_struct *mm) * Tell all users of get_user/copy_from_user etc... that the content * is no longer stable. No barriers really needed because unmapping * should imply barriers already and the reader would hit a page fault - * if it stumbled over a reaped memory. If MMF_UNSTABLE is already set, - * reaping as already occurred so nothing left to do. + * if it stumbled over a reaped memory. */ - if (test_and_set_bit(MMF_UNSTABLE, &mm->flags)) - return; + set_bit(MMF_UNSTABLE, &mm->flags); for (vma = mm->mmap ; vma; vma = vma->vm_next) { if (!can_madv_dontneed_vma(vma)) @@ -524,25 +522,9 @@ void __oom_reap_task_mm(struct mm_struct *mm) static void oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) { - /* - * We have to make sure to not race with the victim exit path - * and cause premature new oom victim selection: - * oom_reap_task_mm exit_mm - * mmget_not_zero - * mmput - * atomic_dec_and_test - * exit_oom_victim - * [...] - * out_of_memory - * select_bad_process - * # no TIF_MEMDIE task selects new victim - * unmap_page_range # frees some memory - */ - mutex_lock(&oom_lock); - if (!down_read_trylock(&mm->mmap_sem)) { trace_skip_task_reaping(tsk->pid); - goto out_oom; + return; } /* @@ -555,10 +537,18 @@ static void oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) goto out_mm; /* - * MMF_UNSTABLE is set by exit_mmap when the OOM reaper can't - * work on the mm anymore. The check for MMF_UNSTABLE must run - * under mmap_sem for reading because it serializes against the - * down_write();up_write() cycle in exit_mmap(). + * MMF_UNSTABLE is set by the time exit_mmap() calls + * munlock_vma_pages_all() in order to avoid race condition. The check + * for MMF_UNSTABLE must run under mmap_sem for reading because it + * serializes against the down_write();up_write() cycle in exit_mmap(). + * + * However, since MMF_UNSTABLE is set by __oom_reap_task_mm() from + * exit_mmap() before start reaping (because the purpose of + * MMF_UNSTABLE is to "tell all users of get_user/copy_from_user etc... + * that the content is no longer stable"), it cannot be used for a flag + * for indicating that the OOM reaper can't work on the mm anymore. + * The OOM reaper will give up after (by default) 1 second even if + * exit_mmap() is doing __oom_reap_task_mm(). */ if (test_bit(MMF_UNSTABLE, &mm->flags)) { trace_skip_task_reaping(tsk->pid); @@ -576,8 +566,6 @@ static void oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) K(get_mm_counter(mm, MM_SHMEMPAGES))); out_mm: up_read(&mm->mmap_sem); -out_oom: - mutex_unlock(&oom_lock); } static void oom_reap_task(struct task_struct *tsk) @@ -591,12 +579,7 @@ static void oom_reap_task(struct task_struct *tsk) if (test_bit(MMF_OOM_SKIP, &mm->flags)) goto drop; - /* - * If this mm has already been reaped, doing so again will not likely - * free additional memory. - */ - if (!test_bit(MMF_UNSTABLE, &mm->flags)) - oom_reap_task_mm(tsk, mm); + oom_reap_task_mm(tsk, mm); if (time_after_eq(jiffies, mm->oom_free_expire)) { if (!test_bit(MMF_OOM_SKIP, &mm->flags)) { @@ -658,12 +641,16 @@ static int oom_reaper(void *unused) static u64 oom_free_timeout_ms = 1000; static void wake_oom_reaper(struct task_struct *tsk) { + unsigned long expire = jiffies + msecs_to_jiffies(oom_free_timeout_ms); + + /* expire must not be 0 in order to avoid double list_add(). */ + if (!expire) + expire++; /* * Set the reap timeout; if it's already set, the mm is enqueued and * this tsk can be ignored. */ - if (cmpxchg(&tsk->signal->oom_mm->oom_free_expire, 0UL, - jiffies + msecs_to_jiffies(oom_free_timeout_ms))) + if (cmpxchg(&tsk->signal->oom_mm->oom_free_expire, 0UL, expire)) return; get_task_struct(tsk);