From patchwork Fri Jun 24 15:06:37 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Michal Hocko X-Patchwork-Id: 9197745 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id C856960871 for ; Fri, 24 Jun 2016 15:06:56 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B6251284B3 for ; Fri, 24 Jun 2016 15:06:56 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id AA7FA284BD; Fri, 24 Jun 2016 15:06:56 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.2 required=2.0 tests=BAYES_00, RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from mother.openwall.net (mother.openwall.net [195.42.179.200]) by mail.wl.linuxfoundation.org (Postfix) with SMTP id A817E284B3 for ; Fri, 24 Jun 2016 15:06:53 +0000 (UTC) Received: (qmail 13588 invoked by uid 550); 24 Jun 2016 15:06:51 -0000 Mailing-List: contact kernel-hardening-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: kernel-hardening@lists.openwall.com Delivered-To: mailing list kernel-hardening@lists.openwall.com Received: (qmail 13569 invoked from network); 24 Jun 2016 15:06:50 -0000 X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=kYCEa4nQ9YKON0q7zhVWNuZ4p/PSEWjAgWrl7teQ4Dw=; b=HDkpRuW1M8vGDEVCRYZsono3dXVGL/YcIbyXI9VX/LBV//PuNM5VXbfbq2SbKZe7h3 yc00zWlFAbfcs2XAvPeCnqIqye8/sJ4e8OcYcf0/IzoJen20dv1cbhdsg5zuwENLhnDw JXrasAxBsafbeUY7yBfTLxg35rraDRFyJULsWbho1fvGrm9xCd/JRld2lYL7fKy++2ws OpWsuCXA3LQeWnBVr0SrPcEf4Nji1WXk1Br4tJ5S25chamNsEth8u11xl9TChJdCLUlp LCh56E/d+LjhpLYXPqeWolJjF6AAbMIN+SYmV2ODAhbLfi29BpKZR+k4FayMPrJyWT/B GJOw== X-Gm-Message-State: ALyK8tKuRxwzEL7pFqMQZRqdALU1XXtd04EkEjXLvfwvDxTS/l0SDHsb27R5UP73Gmbmqw== X-Received: by 10.28.14.75 with SMTP id 72mr18761987wmo.85.1466780798977; Fri, 24 Jun 2016 08:06:38 -0700 (PDT) Date: Fri, 24 Jun 2016 17:06:37 +0200 From: Michal Hocko To: Oleg Nesterov Cc: Linus Torvalds , Andy Lutomirski , Andy Lutomirski , the arch/x86 maintainers , Linux Kernel Mailing List , "linux-arch@vger.kernel.org" , Borislav Petkov , Nadav Amit , Kees Cook , Brian Gerst , "kernel-hardening@lists.openwall.com" , Josh Poimboeuf , Jann Horn , Heiko Carstens Message-ID: <20160624150637.GD20203@dhcp22.suse.cz> References: <20160623143126.GA16664@redhat.com> <20160623170352.GA17372@redhat.com> <20160623185221.GA17983@redhat.com> <20160624140558.GA20208@dhcp22.suse.cz> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20160624140558.GA20208@dhcp22.suse.cz> User-Agent: Mutt/1.6.0 (2016-04-01) Subject: [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core) X-Virus-Scanned: ClamAV using ClamSMTP On Fri 24-06-16 16:05:58, Michal Hocko wrote: > On Thu 23-06-16 20:52:21, Oleg Nesterov wrote: > > On 06/23, Linus Torvalds wrote: > > > > > > On Thu, Jun 23, 2016 at 10:03 AM, Oleg Nesterov wrote: > > > > > > > > Let me quote my previous email ;) > > > > > > > > And we can't free/nullify it when the parent/debuger reaps a zombie, > > > > say, mark_oom_victim() expects that get_task_struct() protects > > > > thread_info as well. > > > > > > > > probably we can fix all such users though... > > > > > > TIF_MEMDIE is indeed a potential problem, but I don't think > > > mark_oom_victim() is actually problematic. > > > > > > mark_oom_victim() is called with either "current", > > > > This is no longer true in -mm tree. > > > > But I agree, this is fixable (and in fact I still hope TIF_MEMDIE will die, > > at least in its current form). > > We can move the flag to the task_struct. There are still some bits left > there. This would be trivial so that the oom usage doesn't stay in the > way. Here is the patch. I've found two bugs when the TIF_MEMDIE was checked on current rather than the given task. I will separate them into their own patches (was just too lazy for it now). If the approach looks reasonable then I will repost next week. --- From 1baaa1f8f9568f95d8feccb28cf1994f8ca0df9f Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 24 Jun 2016 16:46:18 +0200 Subject: [PATCH] mm, oom: move TIF_MEMDIE to the task_struct There is an interest to drop thread_info->flags usage for further clean ups. TIF_MEMDIE stands in the way so let's move it out of the thread_info into the task_struct. We cannot use flags because the oom killer will set it for !current task without any locking so let's add task_struct::memdie. It has to be atomic because we need it to be updated atomically. Signed-off-by: Michal Hocko --- arch/alpha/include/asm/thread_info.h | 1 - arch/arc/include/asm/thread_info.h | 2 -- arch/arm/include/asm/thread_info.h | 1 - arch/arm64/include/asm/thread_info.h | 1 - arch/avr32/include/asm/thread_info.h | 2 -- arch/blackfin/include/asm/thread_info.h | 1 - arch/c6x/include/asm/thread_info.h | 1 - arch/cris/include/asm/thread_info.h | 1 - arch/frv/include/asm/thread_info.h | 1 - arch/h8300/include/asm/thread_info.h | 1 - arch/hexagon/include/asm/thread_info.h | 1 - arch/ia64/include/asm/thread_info.h | 1 - arch/m32r/include/asm/thread_info.h | 1 - arch/m68k/include/asm/thread_info.h | 1 - arch/metag/include/asm/thread_info.h | 1 - arch/microblaze/include/asm/thread_info.h | 1 - arch/mips/include/asm/thread_info.h | 1 - arch/mn10300/include/asm/thread_info.h | 1 - arch/nios2/include/asm/thread_info.h | 1 - arch/openrisc/include/asm/thread_info.h | 1 - arch/parisc/include/asm/thread_info.h | 1 - arch/powerpc/include/asm/thread_info.h | 1 - arch/s390/include/asm/thread_info.h | 1 - arch/score/include/asm/thread_info.h | 1 - arch/sh/include/asm/thread_info.h | 1 - arch/sparc/include/asm/thread_info_32.h | 1 - arch/sparc/include/asm/thread_info_64.h | 1 - arch/tile/include/asm/thread_info.h | 2 -- arch/um/include/asm/thread_info.h | 2 -- arch/unicore32/include/asm/thread_info.h | 1 - arch/x86/include/asm/thread_info.h | 1 - arch/xtensa/include/asm/thread_info.h | 1 - drivers/staging/android/lowmemorykiller.c | 2 +- fs/ext4/mballoc.c | 2 +- include/linux/sched.h | 2 ++ kernel/cpuset.c | 12 ++++++------ kernel/exit.c | 2 +- kernel/freezer.c | 2 +- mm/ksm.c | 4 ++-- mm/memcontrol.c | 2 +- mm/oom_kill.c | 20 ++++++++++---------- mm/page_alloc.c | 6 +++--- 42 files changed, 28 insertions(+), 62 deletions(-) diff --git a/arch/alpha/include/asm/thread_info.h b/arch/alpha/include/asm/thread_info.h index 32e920a83ae5..126eaaf6559d 100644 --- a/arch/alpha/include/asm/thread_info.h +++ b/arch/alpha/include/asm/thread_info.h @@ -65,7 +65,6 @@ register struct thread_info *__current_thread_info __asm__("$8"); #define TIF_NEED_RESCHED 3 /* rescheduling necessary */ #define TIF_SYSCALL_AUDIT 4 /* syscall audit active */ #define TIF_DIE_IF_KERNEL 9 /* dik recursion lock */ -#define TIF_MEMDIE 13 /* is terminating due to OOM killer */ #define TIF_POLLING_NRFLAG 14 /* idle is polling for TIF_NEED_RESCHED */ #define _TIF_SYSCALL_TRACE (1<memdie) && time_before_eq(jiffies, lowmem_deathpending_timeout)) { task_unlock(p); rcu_read_unlock(); diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index c1ab3ec30423..ddc12f571c50 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -4815,7 +4815,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode, #endif trace_ext4_mballoc_free(sb, inode, block_group, bit, count_clusters); - /* __GFP_NOFAIL: retry infinitely, ignore TIF_MEMDIE and memcg limit. */ + /* __GFP_NOFAIL: retry infinitely, ignore memdie tasks and memcg limit. */ err = ext4_mb_load_buddy_gfp(sb, block_group, &e4b, GFP_NOFS|__GFP_NOFAIL); if (err) diff --git a/include/linux/sched.h b/include/linux/sched.h index 6d81a1eb974a..4c91fc0c2e8e 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1856,6 +1856,8 @@ struct task_struct { unsigned long task_state_change; #endif int pagefault_disabled; + /* oom victim - give it access to memory reserves */ + atomic_t memdie; #ifdef CONFIG_MMU struct task_struct *oom_reaper_list; #endif diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 73e93e53884d..857fac0b973d 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -1038,9 +1038,9 @@ static void cpuset_change_task_nodemask(struct task_struct *tsk, * Allow tasks that have access to memory reserves because they have * been OOM killed to get memory anywhere. */ - if (unlikely(test_thread_flag(TIF_MEMDIE))) + if (unlikely(atomic_read(&tsk->memdie))) return; - if (current->flags & PF_EXITING) /* Let dying task have memory */ + if (tsk->flags & PF_EXITING) /* Let dying task have memory */ return; task_lock(tsk); @@ -2496,12 +2496,12 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs) * If we're in interrupt, yes, we can always allocate. If @node is set in * current's mems_allowed, yes. If it's not a __GFP_HARDWALL request and this * node is set in the nearest hardwalled cpuset ancestor to current's cpuset, - * yes. If current has access to memory reserves due to TIF_MEMDIE, yes. + * yes. If current has access to memory reserves due to memdie, yes. * Otherwise, no. * * GFP_USER allocations are marked with the __GFP_HARDWALL bit, * and do not allow allocations outside the current tasks cpuset - * unless the task has been OOM killed as is marked TIF_MEMDIE. + * unless the task has been OOM killed as is marked memdie. * GFP_KERNEL allocations are not so marked, so can escape to the * nearest enclosing hardwalled ancestor cpuset. * @@ -2524,7 +2524,7 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs) * affect that: * in_interrupt - any node ok (current task context irrelevant) * GFP_ATOMIC - any node ok - * TIF_MEMDIE - any node ok + * memdie - any node ok * GFP_KERNEL - any node in enclosing hardwalled cpuset ok * GFP_USER - only nodes in current tasks mems allowed ok. */ @@ -2542,7 +2542,7 @@ bool __cpuset_node_allowed(int node, gfp_t gfp_mask) * Allow tasks that have access to memory reserves because they have * been OOM killed to get memory anywhere. */ - if (unlikely(test_thread_flag(TIF_MEMDIE))) + if (unlikely(atomic_read(¤t->memdie))) return true; if (gfp_mask & __GFP_HARDWALL) /* If hardwall request, stop here */ return false; diff --git a/kernel/exit.c b/kernel/exit.c index 9e6e1356e6bb..8bfdda9bc99a 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -434,7 +434,7 @@ static void exit_mm(struct task_struct *tsk) task_unlock(tsk); mm_update_next_owner(mm); mmput(mm); - if (test_thread_flag(TIF_MEMDIE)) + if (atomic_read(¤t->memdie)) exit_oom_victim(tsk); } diff --git a/kernel/freezer.c b/kernel/freezer.c index a8900a3bc27a..e1bd9f2780fe 100644 --- a/kernel/freezer.c +++ b/kernel/freezer.c @@ -42,7 +42,7 @@ bool freezing_slow_path(struct task_struct *p) if (p->flags & (PF_NOFREEZE | PF_SUSPEND_TASK)) return false; - if (test_thread_flag(TIF_MEMDIE)) + if (atomic_read(&p->memdie)) return false; if (pm_nosig_freezing || cgroup_freezing(p)) diff --git a/mm/ksm.c b/mm/ksm.c index 73d43bafd9fb..8d5a295fb955 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -396,11 +396,11 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr) * * VM_FAULT_OOM: at the time of writing (late July 2009), setting * aside mem_cgroup limits, VM_FAULT_OOM would only be set if the - * current task has TIF_MEMDIE set, and will be OOM killed on return + * current task has memdie set, and will be OOM killed on return * to user; and ksmd, having no mm, would never be chosen for that. * * But if the mm is in a limited mem_cgroup, then the fault may fail - * with VM_FAULT_OOM even if the current task is not TIF_MEMDIE; and + * with VM_FAULT_OOM even if the current task is not memdie; and * even ksmd can fail in this way - though it's usually breaking ksm * just to undo a merge it made a moment before, so unlikely to oom. * diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 3e8f9e5e9291..df411de17a75 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1987,7 +1987,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, * bypass the last charges so that they can exit quickly and * free their memory. */ - if (unlikely(test_thread_flag(TIF_MEMDIE) || + if (unlikely(atomic_read(¤t->memdie) || fatal_signal_pending(current) || current->flags & PF_EXITING)) goto force; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 4c21f744daa6..9d24007cdb82 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -473,7 +473,7 @@ static bool __oom_reap_task(struct task_struct *tsk) * [...] * out_of_memory * select_bad_process - * # no TIF_MEMDIE task selects new victim + * # no memdie task selects new victim * unmap_page_range # frees some memory */ mutex_lock(&oom_lock); @@ -593,7 +593,7 @@ static void oom_reap_task(struct task_struct *tsk) } /* - * Clear TIF_MEMDIE because the task shouldn't be sitting on a + * Clear memdie because the task shouldn't be sitting on a * reasonably reclaimable memory anymore or it is not a good candidate * for the oom victim right now because it cannot release its memory * itself nor by the oom reaper. @@ -669,14 +669,14 @@ void mark_oom_victim(struct task_struct *tsk) { WARN_ON(oom_killer_disabled); /* OOM killer might race with memcg OOM */ - if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) + if (!atomic_add_unless(&tsk->memdie, 1, 1)) return; atomic_inc(&tsk->signal->oom_victims); /* * Make sure that the task is woken up from uninterruptible sleep * if it is frozen because OOM killer wouldn't be able to free * any memory and livelock. freezing_slow_path will tell the freezer - * that TIF_MEMDIE tasks should be ignored. + * that memdie tasks should be ignored. */ __thaw_task(tsk); atomic_inc(&oom_victims); @@ -687,7 +687,7 @@ void mark_oom_victim(struct task_struct *tsk) */ void exit_oom_victim(struct task_struct *tsk) { - if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE)) + if (!atomic_add_unless(&tsk->memdie, -1, 0)) return; atomic_dec(&tsk->signal->oom_victims); @@ -771,7 +771,7 @@ bool task_will_free_mem(struct task_struct *task) * If the process has passed exit_mm we have to skip it because * we have lost a link to other tasks sharing this mm, we do not * have anything to reap and the task might then get stuck waiting - * for parent as zombie and we do not want it to hold TIF_MEMDIE + * for parent as zombie and we do not want it to hold memdie */ p = find_lock_task_mm(task); if (!p) @@ -836,7 +836,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, /* * If the task is already exiting, don't alarm the sysadmin or kill - * its children or threads, just set TIF_MEMDIE so it can die quickly + * its children or threads, just set memdie so it can die quickly */ if (task_will_free_mem(p)) { mark_oom_victim(p); @@ -893,7 +893,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, mm = victim->mm; atomic_inc(&mm->mm_count); /* - * We should send SIGKILL before setting TIF_MEMDIE in order to prevent + * We should send SIGKILL before setting memdie in order to prevent * the OOM victim from depleting the memory reserves from the user * space under its control. */ @@ -1016,7 +1016,7 @@ bool out_of_memory(struct oom_control *oc) * quickly exit and free its memory. * * But don't select if current has already released its mm and cleared - * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur. + * memdie flag at exit_mm(), otherwise an OOM livelock may occur. */ if (current->mm && task_will_free_mem(current)) { mark_oom_victim(current); @@ -1096,7 +1096,7 @@ void pagefault_out_of_memory(void) * be a racing OOM victim for which oom_killer_disable() * is waiting for. */ - WARN_ON(test_thread_flag(TIF_MEMDIE)); + WARN_ON(atomic_read(¤t->memdie)); } mutex_unlock(&oom_lock); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 89128d64d662..6c550afde6a4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3050,7 +3050,7 @@ void warn_alloc_failed(gfp_t gfp_mask, unsigned int order, const char *fmt, ...) * of allowed nodes. */ if (!(gfp_mask & __GFP_NOMEMALLOC)) - if (test_thread_flag(TIF_MEMDIE) || + if (atomic_read(¤t->memdie) || (current->flags & (PF_MEMALLOC | PF_EXITING))) filter &= ~SHOW_MEM_FILTER_NODES; if (in_interrupt() || !(gfp_mask & __GFP_DIRECT_RECLAIM)) @@ -3428,7 +3428,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) alloc_flags |= ALLOC_NO_WATERMARKS; else if (!in_interrupt() && ((current->flags & PF_MEMALLOC) || - unlikely(test_thread_flag(TIF_MEMDIE)))) + unlikely(atomic_read(¤t->memdie)))) alloc_flags |= ALLOC_NO_WATERMARKS; } #ifdef CONFIG_CMA @@ -3637,7 +3637,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, } /* Avoid allocations with no watermarks from looping endlessly */ - if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL)) + if (atomic_read(¤t->memdie) && !(gfp_mask & __GFP_NOFAIL)) goto nopage; /*