From patchwork Tue Apr 7 12:57:32 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Boaz Harrosh X-Patchwork-Id: 6170361 Return-Path: X-Original-To: patchwork-linux-fsdevel@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork2.web.kernel.org (Postfix) with ESMTP id DE5FEBF4A6 for ; Tue, 7 Apr 2015 12:57:44 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id D0ED22035D for ; Tue, 7 Apr 2015 12:57:43 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 945D520379 for ; Tue, 7 Apr 2015 12:57:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753193AbbDGM5j (ORCPT ); Tue, 7 Apr 2015 08:57:39 -0400 Received: from mail-wi0-f172.google.com ([209.85.212.172]:37705 "EHLO mail-wi0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752795AbbDGM5g (ORCPT ); Tue, 7 Apr 2015 08:57:36 -0400 Received: by wiaa2 with SMTP id a2so17545548wia.0 for ; Tue, 07 Apr 2015 05:57:35 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:message-id:date:from:user-agent:mime-version:to :cc:subject:references:in-reply-to:content-type :content-transfer-encoding; bh=YZzkV56RMPyOkYHewM90QEBzC1ep6F1yAEEMbbjBywI=; b=NHviiLm26vxdTqA9OmzNXY1V9ihb35lKbKc+gTPqDA8wy6ropOI1t7YbfTn+HBhrLY M8SoWZRU1sODgGYs2Rd48PjZ6HLkJC97CC9MLACyAoU+CGTIPJJDIGkCnyz/FfTLd3J2 TetMbdthmrX3J9472sUMJW5G2gA1MpzCqYhhSXs/I9C6wNv14U991GdJ4EFcy3uD5NM6 +zFkRmgxqjdUEHVKjvdeUHK9FWV0X4tKcj5EfgEHuAs+40GJVfuJWS2YJTEZ2Y2UDo1x 3TpMQOhm4ulJ6j87CUmoMoMXZJWT7SB6UPu0kWMBT1PHVj3ZIVfDLFlLYXLGy803jOBQ opsQ== X-Gm-Message-State: ALoCoQnX1/j1WxqV/9UJ1We+NYcvGt6sZZJkObxoXU05SJCdocMQsfX29Tr3kr47Sy0RHyZwd4fV X-Received: by 10.180.189.69 with SMTP id gg5mr4339501wic.93.1428411455575; Tue, 07 Apr 2015 05:57:35 -0700 (PDT) Received: from [10.0.0.5] ([207.232.55.62]) by mx.google.com with ESMTPSA id kc4sm8460221wjc.2.2015.04.07.05.57.33 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 07 Apr 2015 05:57:34 -0700 (PDT) Message-ID: <5523D43C.1060708@plexistor.com> Date: Tue, 07 Apr 2015 15:57:32 +0300 From: Boaz Harrosh User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0 MIME-Version: 1.0 To: Dave Chinner , Matthew Wilcox , Andrew Morton , "Kirill A. Shutemov" , Jan Kara , Hugh Dickins , Mel Gorman , linux-mm@kvack.org, linux-nvdimm , linux-fsdevel , Eryu Guan , Christoph Hellwig CC: Stable Tree Subject: [PATCH 1/3 v6] mm(v4.1): New pfn_mkwrite same as page_mkwrite for VM_PFNMAP References: <55239645.9000507@plexistor.com> <552397E6.5030506@plexistor.com> In-Reply-To: <552397E6.5030506@plexistor.com> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, T_RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP [v4] Kirill's comments about splitting out a new wp_pfn_shared(). Add Documentation/filesystems/Locking text about pfn_mkwrite. [v3] Kirill's comments about use of linear_page_index() [v2] Based on linux-next/akpm [3dc4623]. For v4.1 merge window Incorporated comments from Andrew And Kirill [v1] This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) to get notified when access is a write to a read-only PFN. This can happen if we mmap() a file then first mmap-read from it to page-in a read-only PFN, than we mmap-write to the same page. We need this functionality to fix a DAX bug, where in the scenario above we fail to set ctime/mtime though we modified the file. An xfstest is attached to this patchset that shows the failure and the fix. (A DAX patch will follow) This functionality is extra important for us, because upon dirtying of a pmem page we also want to RDMA the page to a remote cluster node. We define a new pfn_mkwrite and do not reuse page_mkwrite because 1 - The name ;-) 2 - But mainly because it would take a very long and tedious audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP users. To make sure they do not now CRASH. For example current DAX code (which this is for) would crash. If we would want to reuse page_mkwrite, We will need to first patch all users, so to not-crash-on-no-page. Then enable this patch. But even if I did that I would not sleep so well at night. Adding a new vector is the safest thing to do, and is not that expensive. an extra pointer at a static function vector per driver. Also the new vector is better for performance, because else we Will call all current Kernel vectors, so to: check-ha-no-page-do-nothing and return. No need to call it from do_shared_fault because do_wp_page is called to change pte permissions anyway. CC: Kirill A. Shutemov CC: Matthew Wilcox CC: Jan Kara CC: Andrew Morton CC: Hugh Dickins CC: Mel Gorman CC: linux-mm@kvack.org Signed-off-by: Yigal Korman Signed-off-by: Boaz Harrosh --- Documentation/filesystems/Locking | 8 ++++++++ include/linux/mm.h | 3 +++ mm/memory.c | 40 +++++++++++++++++++++++++++++++++++---- 3 files changed, 47 insertions(+), 4 deletions(-) diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index f91926f..25f36e6 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -525,6 +525,7 @@ prototypes: void (*close)(struct vm_area_struct*); int (*fault)(struct vm_area_struct*, struct vm_fault *); int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *); + int (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *); int (*access)(struct vm_area_struct *, unsigned long, void*, int, int); locking rules: @@ -534,6 +535,7 @@ close: yes fault: yes can return with page locked map_pages: yes page_mkwrite: yes can return with page locked +pfn_mkwrite: yes access: yes ->fault() is called when a previously not present pte is about @@ -560,6 +562,12 @@ the page has been truncated, the filesystem should not look up a new page like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which will cause the VM to retry the fault. + ->pfn_mkwrite() is the same as page_mkwrite but when the pte is +VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is +VM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior +after this call is to make the pte read-write, unless pfn_mkwrite() +already touched the pte, in that case it is untouched. + ->access() is called when get_user_pages() fails in access_process_vm(), typically used to debug a process through /proc/pid/mem or ptrace. This function is needed only for diff --git a/include/linux/mm.h b/include/linux/mm.h index d584b95..70c47f2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -251,6 +251,9 @@ struct vm_operations_struct { * writable, if an error is returned it will cause a SIGBUS */ int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); + /* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */ + int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); + /* called by access_process_vm when get_user_pages() fails, typically * for use by special VMAs that can switch between memory and hardware */ diff --git a/mm/memory.c b/mm/memory.c index 59f6268..67b1ee8 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2181,6 +2181,39 @@ oom: return VM_FAULT_OOM; } +/* + * Handle write page faults for VM_MIXEDMAP or VM_PFNMAP for a VM_SHARED + * mapping + */ +static int wp_pfn_shared(struct mm_struct *mm, + struct vm_area_struct *vma, unsigned long address, + pte_t *page_table, spinlock_t *ptl, pte_t orig_pte, + pmd_t *pmd) +{ + if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) { + struct vm_fault vmf = { + .page = NULL, + .pgoff = linear_page_index(vma, address), + .virtual_address = (void __user *)(address & PAGE_MASK), + .flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE, + }; + int ret; + + pte_unmap_unlock(page_table, ptl); + ret = vma->vm_ops->pfn_mkwrite(vma, &vmf); + if (ret & VM_FAULT_ERROR) + return ret; + page_table = pte_offset_map_lock(mm, pmd, address, &ptl); + /* Did pfn_mkwrite already fixed up the pte */ + if (!pte_same(*page_table, orig_pte)) { + pte_unmap_unlock(page_table, ptl); + return ret; + } + } + return wp_page_reuse(mm, vma, address, page_table, ptl, orig_pte, + NULL, 0, 0); +} + static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *page_table, pmd_t *pmd, spinlock_t *ptl, pte_t orig_pte, @@ -2259,13 +2292,12 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, * VM_PFNMAP VMA. * * We should not cow pages in a shared writeable mapping. - * Just mark the pages writable as we can't do any dirty - * accounting on raw pfn maps. + * Just mark the pages writable and/or call ops->pfn_mkwrite. */ if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED)) - return wp_page_reuse(mm, vma, address, page_table, ptl, - orig_pte, old_page, 0, 0); + return wp_pfn_shared(mm, vma, address, page_table, ptl, + orig_pte, pmd); pte_unmap_unlock(page_table, ptl); return wp_page_copy(mm, vma, address, page_table, pmd,