From patchwork Tue Nov 13 05:51:48 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sasha Levin X-Patchwork-Id: 10679615 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id F3B3E13BB for ; Tue, 13 Nov 2018 05:52:25 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E26D92A3AA for ; Tue, 13 Nov 2018 05:52:25 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id D53692A264; Tue, 13 Nov 2018 05:52:25 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1D43F2A3AA for ; Tue, 13 Nov 2018 05:52:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 30C2F6B028C; Tue, 13 Nov 2018 00:52:21 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 2929B6B028E; Tue, 13 Nov 2018 00:52:21 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 099A56B028F; Tue, 13 Nov 2018 00:52:21 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pf1-f199.google.com (mail-pf1-f199.google.com [209.85.210.199]) by kanga.kvack.org (Postfix) with ESMTP id B77916B028C for ; Tue, 13 Nov 2018 00:52:20 -0500 (EST) Received: by mail-pf1-f199.google.com with SMTP id t2so698442pfj.15 for ; Mon, 12 Nov 2018 21:52:20 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:from:to:cc:subject:date :message-id:in-reply-to:references; bh=X238rF2zIuPmXp/Z/syy+HQIkRRwRFt5Yr4FVtMHjIQ=; b=AaiAPQ1ZCZTyeabRMwv4wD+1SlipLr6CquVoG89ez9iA28q9oVVvsDiOe+rgvUEtI1 cnD2B1kEX7h+tnOges31+uEMv4JmAq+NFZNy/jx+6i08V0HIS7Ze9ZcIqt8iDqNs37oN YD4nRS+7FhARf+tBaVnJzroRlrtwOY7enePBaMS28JIVfFD3piWLhDObshzSHBs1CkFH kyJztaoJ+UkqnmlAtGMjFFr1Ll8PNTaUQH+7wIAsB0jdwABG612avTNKLeRYh+jzFuk8 GkZPUREzVFTADN7raa5gT05uu0lDlXP4Uo+a62pP6JZyRElENabK9UIsQI2RtrhYAOce Waqw== X-Gm-Message-State: AGRZ1gKkJD8+6oqDFqtxxNhNvwdIAHGgflkoM//6nlMxZbZLBCun+jzk EqCtPQraAlglCvLwmPBXEkDATAp1PN6JVQavwMfjHa7zshUwmJpGDMLfkuCe++YhjKKXMYzs70V 7y2Ak8CDW6o4cZpz+5EdyXoj2s1mnMGuAcBOzQYG7RRE2CxnBkM/xclFOouNDxZ//fg== X-Received: by 2002:a62:4784:: with SMTP id p4-v6mr3731168pfi.257.1542088340356; Mon, 12 Nov 2018 21:52:20 -0800 (PST) X-Google-Smtp-Source: AJdET5fL4LLD/LxVRtN/VE7Ptqyz1elQfo6uTKARXWV6cNI6Q0KdiOEOJi6Lg1q/xanNFFSr1GWs X-Received: by 2002:a62:4784:: with SMTP id p4-v6mr3731146pfi.257.1542088339461; Mon, 12 Nov 2018 21:52:19 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1542088339; cv=none; d=google.com; s=arc-20160816; b=0g1ZtZatqyFwBEW9POG1uTYnJ8ubgT4fP6h4U/SxKEL1wVPX3ZpfzLeNKyNlMW7unW FOLEu1/G58jktSQ+Zo3X3qLCxtYOA9x7Gl1rxWRQZv6Lw8iE37cUmwuyVHT9K3ceAgjn 47j5RPNi5RMPjTpHmUQoB3HUjCf5dDy0ze5uMvXQtOHXjINJStdIJW8ad3JGMG6al30d GMyQsKdu+i/Y6CNGFxQMHlxhpG1jisJTIMbcU6dS9UoBxUFXr2S29dd99f4RNCmwGoGD Qtr6k5MLpsDREBPlA0oqDgE6MLH9wp2Bf+v/8AZKR5LrSFmzpv2rC4nsLUD+rhzL9bPB qKVw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=X238rF2zIuPmXp/Z/syy+HQIkRRwRFt5Yr4FVtMHjIQ=; b=ssi76PsoblRl39Q8VZWKjV5wFcQmJ2UuqQ1LzZc3dFuf8Im2ml5h77k4yRCQ8Mozc7 vjmdc6HEuQ+nlOagkqLWvwnvTpCQ84EXM3zqrVnL6V6F9Ku7ijFBjz31bXwdlJ1LO7IW 1KDe8zCVz2WXdpmm+1VhlvbwW72gTvTlNadrz0Lgjvn3K1QVq6WfweyiSHxz+5SbX27F LEKt+9kd/JvsJib0egLFe1n8eY0SQ/kURkqefVI/XcHuc9FogvymXDMkfdaDgWtHYU+O 6ICwBhqvsd6JF6XPzSNH416CjoGWd89VI1bw9VsghRdLbvpIGuj0CGqB1CC7j7G8cb6s qdwA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=cWYczi3h; spf=pass (google.com: domain of sashal@kernel.org designates 198.145.29.99 as permitted sender) smtp.mailfrom=sashal@kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from mail.kernel.org (mail.kernel.org. [198.145.29.99]) by mx.google.com with ESMTPS id z189-v6si20844666pfz.32.2018.11.12.21.52.19 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 12 Nov 2018 21:52:19 -0800 (PST) Received-SPF: pass (google.com: domain of sashal@kernel.org designates 198.145.29.99 as permitted sender) client-ip=198.145.29.99; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=cWYczi3h; spf=pass (google.com: domain of sashal@kernel.org designates 198.145.29.99 as permitted sender) smtp.mailfrom=sashal@kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from sasha-vm.mshome.net (unknown [64.114.255.114]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 1399B22526; Tue, 13 Nov 2018 05:52:19 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1542088339; bh=4BLeI+QKojb6hOzqP8825G8fRG2v3A7srOzewaBl+bI=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=cWYczi3hs8saY+X9W0JIsZAxqvODl1+vp7Bxw2UE7SUHlcOGqTsCGBUV7cGglUQIf H0p2RyeK5bdnevsuSvXaDEMjA3ATnHdmUnrFxuCKjIqHyXR7oECK3Fqj8wFLdM0ABx iJi+douQZocQzEJ1X7KpX9kBAID0Qd3kRhkr8nK8= From: Sasha Levin To: stable@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Andrea Arcangeli , Andrew Morton , Linus Torvalds , Sasha Levin , linux-mm@kvack.org Subject: [PATCH AUTOSEL 4.14 24/26] userfaultfd: allow get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) to trigger userfaults Date: Tue, 13 Nov 2018 00:51:48 -0500 Message-Id: <20181113055150.78773-24-sashal@kernel.org> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20181113055150.78773-1-sashal@kernel.org> References: <20181113055150.78773-1-sashal@kernel.org> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP From: Andrea Arcangeli [ Upstream commit 3b9aadf7278d16d7bed4d5d808501065f70898d8 ] get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) called a get_user_pages that would not be waiting for userfaults before failing and it would hit on a SIGBUS instead. Using get_user_pages_locked/unlocked instead will allow get_mempolicy to allow userfaults to resolve the fault and fill the hole, before grabbing the node id of the page. If the user calls get_mempolicy() with MPOL_F_ADDR | MPOL_F_NODE for an address inside an area managed by uffd and there is no page at that address, the page allocation from within get_mempolicy() will fail because get_user_pages() does not allow for page fault retry required for uffd; the user will get SIGBUS. With this patch, the page fault will be resolved by the uffd and the get_mempolicy() will continue normally. Background: Via code review, previously the syscall would have returned -EFAULT (vm_fault_to_errno), now it will block and wait for an userfault (if it's waken before the fault is resolved it'll still -EFAULT). This way get_mempolicy will give a chance to an "unaware" app to be compliant with userfaults. The reason this visible change is that becoming "userfault compliant" cannot regress anything: all other syscalls including read(2)/write(2) had to become "userfault compliant" long time ago (that's one of the things userfaultfd can do that PROT_NONE and trapping segfaults can't). So this is just one more syscall that become "userfault compliant" like all other major ones already were. This has been happening on virtio-bridge dpdk process which just called get_mempolicy on the guest space post live migration, but before the memory had a chance to be migrated to destination. I didn't run an strace to be able to show the -EFAULT going away, but I've the confirmation of the below debug aid information (only visible with CONFIG_DEBUG_VM=y) going away with the patch: [20116.371461] FAULT_FLAG_ALLOW_RETRY missing 0 [20116.371464] CPU: 1 PID: 13381 Comm: vhost-events Not tainted 4.17.12-200.fc28.x86_64 #1 [20116.371465] Hardware name: LENOVO 20FAS2BN0A/20FAS2BN0A, BIOS N1CET54W (1.22 ) 02/10/2017 [20116.371466] Call Trace: [20116.371473] dump_stack+0x5c/0x80 [20116.371476] handle_userfault.cold.37+0x1b/0x22 [20116.371479] ? remove_wait_queue+0x20/0x60 [20116.371481] ? poll_freewait+0x45/0xa0 [20116.371483] ? do_sys_poll+0x31c/0x520 [20116.371485] ? radix_tree_lookup_slot+0x1e/0x50 [20116.371488] shmem_getpage_gfp+0xce7/0xe50 [20116.371491] ? page_add_file_rmap+0x1a/0x2c0 [20116.371493] shmem_fault+0x78/0x1e0 [20116.371495] ? filemap_map_pages+0x3a1/0x450 [20116.371498] __do_fault+0x1f/0xc0 [20116.371500] __handle_mm_fault+0xe2e/0x12f0 [20116.371502] handle_mm_fault+0xda/0x200 [20116.371504] __get_user_pages+0x238/0x790 [20116.371506] get_user_pages+0x3e/0x50 [20116.371510] kernel_get_mempolicy+0x40b/0x700 [20116.371512] ? vfs_write+0x170/0x1a0 [20116.371515] __x64_sys_get_mempolicy+0x21/0x30 [20116.371517] do_syscall_64+0x5b/0x160 [20116.371520] entry_SYSCALL_64_after_hwframe+0x44/0xa9 The above harmless debug message (not a kernel crash, just a dump_stack()) is shown with CONFIG_DEBUG_VM=y to more quickly identify and improve kernel spots that may have to become "userfaultfd compliant" like this one (without having to run an strace and search for syscall misbehavior). Spots like the above are more closer to a kernel bug for the non-cooperative usages that Mike focuses on, than for for dpdk qemu-cooperative usages that reproduced it, but it's still nicer to get this fixed for dpdk too. The part of the patch that caused me to think is only the implementation issue of mpol_get, but it looks like it should work safe no matter the kind of mempolicy structure that is (the default static policy also starts at 1 so it'll go to 2 and back to 1 without crashing everything at 0). [rppt@linux.vnet.ibm.com: changelog addition] http://lkml.kernel.org/r/20180904073718.GA26916@rapoport-lnx Link: http://lkml.kernel.org/r/20180831214848.23676-1-aarcange@redhat.com Signed-off-by: Andrea Arcangeli Reported-by: Maxime Coquelin Tested-by: Dr. David Alan Gilbert Reviewed-by: Mike Rapoport Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Sasha Levin --- mm/mempolicy.c | 24 +++++++++++++++++++----- 1 file changed, 19 insertions(+), 5 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index ecbda7f5d494..c19864283a8e 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -821,16 +821,19 @@ static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes) } } -static int lookup_node(unsigned long addr) +static int lookup_node(struct mm_struct *mm, unsigned long addr) { struct page *p; int err; - err = get_user_pages(addr & PAGE_MASK, 1, 0, &p, NULL); + int locked = 1; + err = get_user_pages_locked(addr & PAGE_MASK, 1, 0, &p, &locked); if (err >= 0) { err = page_to_nid(p); put_page(p); } + if (locked) + up_read(&mm->mmap_sem); return err; } @@ -841,7 +844,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, int err; struct mm_struct *mm = current->mm; struct vm_area_struct *vma = NULL; - struct mempolicy *pol = current->mempolicy; + struct mempolicy *pol = current->mempolicy, *pol_refcount = NULL; if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR|MPOL_F_MEMS_ALLOWED)) @@ -881,7 +884,16 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, if (flags & MPOL_F_NODE) { if (flags & MPOL_F_ADDR) { - err = lookup_node(addr); + /* + * Take a refcount on the mpol, lookup_node() + * wil drop the mmap_sem, so after calling + * lookup_node() only "pol" remains valid, "vma" + * is stale. + */ + pol_refcount = pol; + vma = NULL; + mpol_get(pol); + err = lookup_node(mm, addr); if (err < 0) goto out; *policy = err; @@ -916,7 +928,9 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, out: mpol_cond_put(pol); if (vma) - up_read(¤t->mm->mmap_sem); + up_read(&mm->mmap_sem); + if (pol_refcount) + mpol_put(pol_refcount); return err; }