From patchwork Mon Aug 10 14:57:01 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Peter Xu X-Patchwork-Id: 11707539 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A053F13A4 for ; Mon, 10 Aug 2020 14:57:10 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 601E02078D for ; Mon, 10 Aug 2020 14:57:10 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="IVZy4/vV" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 601E02078D Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 56FD88D0001; Mon, 10 Aug 2020 10:57:09 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 51F9B6B0003; Mon, 10 Aug 2020 10:57:09 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3E6F98D0001; Mon, 10 Aug 2020 10:57:09 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0162.hostedemail.com [216.40.44.162]) by kanga.kvack.org (Postfix) with ESMTP id 240D96B0002 for ; Mon, 10 Aug 2020 10:57:09 -0400 (EDT) Received: from smtpin01.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id B4BD3181AEF00 for ; Mon, 10 Aug 2020 14:57:08 +0000 (UTC) X-FDA: 77134961736.01.music60_2d02fdb26fdb Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin01.hostedemail.com (Postfix) with ESMTP id 851FD1004D016 for ; Mon, 10 Aug 2020 14:57:08 +0000 (UTC) X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,peterx@redhat.com,,RULES_HIT:30003:30054:30070:30079:30090,0,RBL:207.211.31.120:@redhat.com:.lbl8.mailshell.net-66.10.201.10 62.18.0.100;04yf6bxofsoi4s7tdc3rnmdf59ebcopim3ks37coytc6g51gnxmxyyaitya956i.gidomune54gnxms66tgpr7sh3o8hm8t41pr5878x5pizw3taxr39b5ndmohj6ig.r-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:ft,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:25,LUA_SUMMARY:none X-HE-Tag: music60_2d02fdb26fdb X-Filterd-Recvd-Size: 10258 Received: from us-smtp-1.mimecast.com (us-smtp-delivery-1.mimecast.com [207.211.31.120]) by imf20.hostedemail.com (Postfix) with ESMTP for ; Mon, 10 Aug 2020 14:57:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1597071427; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=X94MiDwUendf6JmPezLRdErQYPoOiP1KjRkqlgzcETw=; b=IVZy4/vVitlAntbEf60YUk8WIP6hE1uI/MjJg1eH/834VS+cRdgNZaXoXRfJKXPbDp2ONr jhUJzihi9cqdgFRajSeG16ZuDUHeCS1I9JSuFUnS9/vDQzMgfLCbYxOCyHWNK3Dy0u4M5a 9PLBXY801atZY+chdu8jZMsemjSs7Tc= Received: from mail-qt1-f197.google.com (mail-qt1-f197.google.com [209.85.160.197]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-411-5Reo71u-OPylE7nMaFqG8w-1; Mon, 10 Aug 2020 10:57:04 -0400 X-MC-Unique: 5Reo71u-OPylE7nMaFqG8w-1 Received: by mail-qt1-f197.google.com with SMTP id u17so7644924qtq.13 for ; Mon, 10 Aug 2020 07:57:04 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=X94MiDwUendf6JmPezLRdErQYPoOiP1KjRkqlgzcETw=; b=ObY9dO12oNa0ECS4e6M0trhYx3vSf6HRRFGE/aoYiOgXmZVgbOIccMQa+auHzGPQOp 2oZ5adCX29tZJMgPJO9742+s3j/3dDcFBXEkDxirEdyqla4k+zmAbwN1jfZMbsVEf/x/ 8HoPBm4OJGGtpKQBm7DUSKFGIyoYT8OP7D76Svx8T8rWTF226yHO9q3nuYfabPMCJD2f rYU0puuI/bhNRt/UvrsVoa0x84m8JIdmX+50j6gDvv5/uqCB4bIQ4NSvlUxvdS5IrtE8 F4xH/y4CbHTGZ++1vjzU8Ozvm8YTjKDEM1z7+t+H18Rm8/9nEZPNhFuokFu0RCeCSJh4 Sq4A== X-Gm-Message-State: AOAM533lcqbohX9qMwoTKzZclzVC7RD31FSDgElDV19rI/jUBb1dU9V2 0GWm1sZrssGsiUpOjXZlA43kytlsQUJ3QxiScKqKymSETks9omrkkr042m5iNQUxjgGbowY7i+/ gX7jhGsEPMOs= X-Received: by 2002:a05:6214:542:: with SMTP id ci2mr27045354qvb.7.1597071423711; Mon, 10 Aug 2020 07:57:03 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzc6QGdaHDoibk9FZRvLZyjwh25Olf6AnPr1gUhvePBSPpxAgBDa+c1O2p7J0VDk5BRjnhyiA== X-Received: by 2002:a05:6214:542:: with SMTP id ci2mr27045321qvb.7.1597071423371; Mon, 10 Aug 2020 07:57:03 -0700 (PDT) Received: from xz-x1.redhat.com (bras-vprn-toroon474qw-lp130-11-70-53-122-15.dsl.bell.ca. [70.53.122.15]) by smtp.gmail.com with ESMTPSA id y3sm10557694qkd.132.2020.08.10.07.57.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 10 Aug 2020 07:57:02 -0700 (PDT) From: Peter Xu To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Andrew Morton , peterx@redhat.com, Marty Mcfadden , Andrea Arcangeli , Linus Torvalds , Jann Horn , Christoph Hellwig , Oleg Nesterov , Kirill Shutemov , Jan Kara Subject: [PATCH v2] mm/gup: Allow real explicit breaking of COW Date: Mon, 10 Aug 2020 10:57:01 -0400 Message-Id: <20200810145701.129228-1-peterx@redhat.com> X-Mailer: git-send-email 2.26.2 MIME-Version: 1.0 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=peterx@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com X-Rspamd-Queue-Id: 851FD1004D016 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam02 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Starting from commit 17839856fd58 ("gup: document and work around "COW can break either way" issue", 2020-06-02), explicit copy-on-write behavior is enforced for private gup pages even if it's a read-only. It is achieved by always passing FOLL_WRITE to emulate a write. That should fix the COW issue that we were facing, however above commit could also break userfaultfd-wp and applications like umapsort [1,2]. One general routine of umap-like program is: userspace library will manage page allocations, and it will evict the least recently used pages from memory to external storages (e.g., file systems). Below are the general steps to evict an in-memory page in the uffd service thread when the page pool is full: (1) UFFDIO_WRITEPROTECT with mode=WP on some to-be-evicted page P, so that further writes to page P will block (keep page P clean) (2) Copy page P to external storage (e.g. file system) (3) MADV_DONTNEED to evict page P Here step (1) makes sure that the page to dump will always be up-to-date, so that the page snapshot in the file system is consistent with the one that was in the memory. However with commit 17839856fd58, step (2) can potentially hang itself because e.g. if we use write() to a file system fd to dump the page data, that will be a translated read gup request in the file system driver to read the page content, then the read gup will be translated to a write gup due to the new enforced COW behavior. This write gup will further trigger handle_userfault() and hang the uffd service thread itself. I think the problem will go away too if we replace the write() to the file system into a memory write to a mmaped region in the userspace library, because normal page faults will not enforce COW, only gup is affected. However we cannot forbid users to use write() or any form of kernel level read gup. One solution is actually already mentioned in commit 17839856fd58, which is to provide an explicit BREAK_COW scemantics for enforced COW. Then we can still use FAULT_FLAG_WRITE to identify whether this is a "real write request" or an "enfornced COW (read) request". [1] https://github.com/LLNL/umap-apps/blob/develop/src/umapsort/umapsort.cpp [2] https://github.com/LLNL/umap CC: Marty Mcfadden CC: Andrea Arcangeli CC: Linus Torvalds CC: Andrew Morton CC: Jann Horn CC: Christoph Hellwig CC: Oleg Nesterov CC: Kirill Shutemov CC: Jan Kara Fixes: 17839856fd588f4ab6b789f482ed3ffd7c403e1f Signed-off-by: Peter Xu --- v2: - apply FAULT_FLAG_BREAK_COW correctly when FOLL_BREAK_COW [Christoph] - removed comments above do_wp_page which seems redundant --- include/linux/mm.h | 3 +++ mm/gup.c | 6 ++++-- mm/memory.c | 7 ++++--- 3 files changed, 11 insertions(+), 5 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index f6a82f9bccd7..dacba5c7942f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -409,6 +409,7 @@ extern pgprot_t protection_map[16]; * @FAULT_FLAG_REMOTE: The fault is not for current task/mm. * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch. * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals. + * @FAULT_FLAG_BREAK_COW: Do COW explicitly for the fault (even for read) * * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify * whether we would allow page faults to retry by specifying these two @@ -439,6 +440,7 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_REMOTE 0x80 #define FAULT_FLAG_INSTRUCTION 0x100 #define FAULT_FLAG_INTERRUPTIBLE 0x200 +#define FAULT_FLAG_BREAK_COW 0x400 /* * The default fault flags that should be used by most of the @@ -2756,6 +2758,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, #define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */ #define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */ #define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */ +#define FOLL_BREAK_COW 0x100000 /* request for explicit COW (even for read) */ /* * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each diff --git a/mm/gup.c b/mm/gup.c index d8a33dd1430d..c33e84ab9c36 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -870,6 +870,8 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma, return -ENOENT; if (*flags & FOLL_WRITE) fault_flags |= FAULT_FLAG_WRITE; + if (*flags & FOLL_BREAK_COW) + fault_flags |= FAULT_FLAG_BREAK_COW; if (*flags & FOLL_REMOTE) fault_flags |= FAULT_FLAG_REMOTE; if (locked) @@ -1076,7 +1078,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, } if (is_vm_hugetlb_page(vma)) { if (should_force_cow_break(vma, foll_flags)) - foll_flags |= FOLL_WRITE; + foll_flags |= FOLL_BREAK_COW; i = follow_hugetlb_page(mm, vma, pages, vmas, &start, &nr_pages, i, foll_flags, locked); @@ -1095,7 +1097,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, } if (should_force_cow_break(vma, foll_flags)) - foll_flags |= FOLL_WRITE; + foll_flags |= FOLL_BREAK_COW; retry: /* diff --git a/mm/memory.c b/mm/memory.c index c39a13b09602..7659b0e27a98 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2900,7 +2900,8 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; - if (userfaultfd_pte_wp(vma, *vmf->pte)) { + if ((vmf->flags & FAULT_FLAG_WRITE) && + userfaultfd_pte_wp(vma, *vmf->pte)) { pte_unmap_unlock(vmf->pte, vmf->ptl); return handle_userfault(vmf, VM_UFFD_WP); } @@ -3290,7 +3291,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) put_page(swapcache); } - if (vmf->flags & FAULT_FLAG_WRITE) { + if (vmf->flags & (FAULT_FLAG_WRITE | FAULT_FLAG_BREAK_COW)) { ret |= do_wp_page(vmf); if (ret & VM_FAULT_ERROR) ret &= VM_FAULT_ERROR; @@ -4241,7 +4242,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) update_mmu_tlb(vmf->vma, vmf->address, vmf->pte); goto unlock; } - if (vmf->flags & FAULT_FLAG_WRITE) { + if (vmf->flags & (FAULT_FLAG_WRITE | FAULT_FLAG_BREAK_COW)) { if (!pte_write(entry)) return do_wp_page(vmf); entry = pte_mkdirty(entry);