[v8,15/23] mm/hugetlb: Handle pte markers in page faults

Message ID	20220405014909.14761-1-peterx@redhat.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Peter Xu <peterx@redhat.com> To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Mike Kravetz <mike.kravetz@oracle.com>, Nadav Amit <nadav.amit@gmail.com>, Matthew Wilcox <willy@infradead.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, David Hildenbrand <david@redhat.com>, Hugh Dickins <hughd@google.com>, Jerome Glisse <jglisse@redhat.com>, "Kirill A . Shutemov" <kirill@shutemov.name>, Andrea Arcangeli <aarcange@redhat.com>, Andrew Morton <akpm@linux-foundation.org>, Axel Rasmussen <axelrasmussen@google.com>, Alistair Popple <apopple@nvidia.com>, peterx@redhat.com Subject: [PATCH v8 15/23] mm/hugetlb: Handle pte markers in page faults Date: Mon, 4 Apr 2022 21:49:09 -0400 Message-Id: <20220405014909.14761-1-peterx@redhat.com> In-Reply-To: <20220405014646.13522-1-peterx@redhat.com> References: <20220405014646.13522-1-peterx@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="US-ASCII"; x-default=true Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	userfaultfd-wp: Support shmem and hugetlbfs \| expand [v8,00/23] userfaultfd-wp: Support shmem and hugetlbfs [v8,01/23] mm: Introduce PTE_MARKER swap entry [v8,02/23] mm: Teach core mm about pte markers [v8,03/23] mm: Check against orig_pte for finish_fault() [v8,04/23] mm/uffd: PTE_MARKER_UFFD_WP [v8,05/23] mm/shmem: Take care of UFFDIO_COPY_MODE_WP [v8,06/23] mm/shmem: Handle uffd-wp special pte in page fault handler [v8,07/23] mm/shmem: Persist uffd-wp bit across zapping for file-backed [v8,08/23] mm/shmem: Allow uffd wr-protect none pte for file-backed mem [v8,09/23] mm/shmem: Allows file-back mem to be uffd wr-protected on thps [v8,10/23] mm/shmem: Handle uffd-wp during fork() [v8,11/23] mm/hugetlb: Introduce huge pte version of uffd-wp helpers [v8,12/23] mm/hugetlb: Hook page faults for uffd write protection [v8,13/23] mm/hugetlb: Take care of UFFDIO_COPY_MODE_WP [v8,14/23] mm/hugetlb: Handle UFFDIO_WRITEPROTECT [v8,15/23] mm/hugetlb: Handle pte markers in page faults [v8,16/23] mm/hugetlb: Allow uffd wr-protect none ptes [v8,17/23] mm/hugetlb: Only drop uffd-wp special pte if required [v8,18/23] mm/hugetlb: Handle uffd-wp during fork() [v8,19/23] mm/khugepaged: Don't recycle vma pgtable if uffd-wp registered [v8,20/23] mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs [v8,21/23] mm/uffd: Enable write protection for shmem & hugetlbfs [v8,22/23] mm: Enable PTE markers by default [v8,23/23] selftests/uffd: Enable uffd-wp for shmem/hugetlbfs

Message ID

20220405014909.14761-1-peterx@redhat.com (mailing list archive)

State

New

Headers

From: Peter Xu <peterx@redhat.com>
To: linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Cc: Mike Kravetz <mike.kravetz@oracle.com>,
	Nadav Amit <nadav.amit@gmail.com>,
	Matthew Wilcox <willy@infradead.org>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	David Hildenbrand <david@redhat.com>,
	Hugh Dickins <hughd@google.com>,
	Jerome Glisse <jglisse@redhat.com>,
	"Kirill A . Shutemov" <kirill@shutemov.name>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Alistair Popple <apopple@nvidia.com>,
	peterx@redhat.com
Subject: [PATCH v8 15/23] mm/hugetlb: Handle pte markers in page faults
Date: Mon,  4 Apr 2022 21:49:09 -0400
Message-Id: <20220405014909.14761-1-peterx@redhat.com>
In-Reply-To: <20220405014646.13522-1-peterx@redhat.com>
References: <20220405014646.13522-1-peterx@redhat.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset="US-ASCII"; x-default=true
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

userfaultfd-wp: Support shmem and hugetlbfs | expand

Commit Message

Peter Xu April 5, 2022, 1:49 a.m. UTC

Allow hugetlb code to handle pte markers just like none ptes.  It's mostly
there, we just need to make sure we don't assume hugetlb_no_page() only handles
none pte, so when detecting pte change we should use pte_same() rather than
pte_none().  We need to pass in the old_pte to do the comparison.

Check the original pte to see whether it's a pte marker, if it is, we should
recover uffd-wp bit on the new pte to be installed, so that the next write will
be trapped by uffd.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

Comments

kernel test robot April 6, 2022, 1:37 p.m. UTC | #1

Hi Peter,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on hnaz-mm/master]
[cannot apply to arnd-asm-generic/master linus/master linux/master v5.18-rc1 next-20220406]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Peter-Xu/userfaultfd-wp-Support-shmem-and-hugetlbfs/20220405-100136
base:   https://github.com/hnaz/linux-mm master
config: s390-randconfig-r044-20220406 (https://download.01.org/0day-ci/archive/20220406/202204062154.2txNJyaf-lkp@intel.com/config)
compiler: s390-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/e7e7aaec811e2817cd169f0cc1d8f81bdf1f05c3
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Peter-Xu/userfaultfd-wp-Support-shmem-and-hugetlbfs/20220405-100136
        git checkout e7e7aaec811e2817cd169f0cc1d8f81bdf1f05c3
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=s390 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   mm/hugetlb.c: In function 'hugetlb_fault':
>> mm/hugetlb.c:5678:13: error: implicit declaration of function 'huge_pte_none_mostly'; did you mean 'pte_none_mostly'? [-Werror=implicit-function-declaration]
    5678 |         if (huge_pte_none_mostly(entry)) {
         |             ^~~~~~~~~~~~~~~~~~~~
         |             pte_none_mostly
   cc1: some warnings being treated as errors


vim +5678 mm/hugetlb.c

  5616	
  5617	vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
  5618				unsigned long address, unsigned int flags)
  5619	{
  5620		pte_t *ptep, entry;
  5621		spinlock_t *ptl;
  5622		vm_fault_t ret;
  5623		u32 hash;
  5624		pgoff_t idx;
  5625		struct page *page = NULL;
  5626		struct page *pagecache_page = NULL;
  5627		struct hstate *h = hstate_vma(vma);
  5628		struct address_space *mapping;
  5629		int need_wait_lock = 0;
  5630		unsigned long haddr = address & huge_page_mask(h);
  5631	
  5632		ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
  5633		if (ptep) {
  5634			/*
  5635			 * Since we hold no locks, ptep could be stale.  That is
  5636			 * OK as we are only making decisions based on content and
  5637			 * not actually modifying content here.
  5638			 */
  5639			entry = huge_ptep_get(ptep);
  5640			if (unlikely(is_hugetlb_entry_migration(entry))) {
  5641				migration_entry_wait_huge(vma, mm, ptep);
  5642				return 0;
  5643			} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
  5644				return VM_FAULT_HWPOISON_LARGE |
  5645					VM_FAULT_SET_HINDEX(hstate_index(h));
  5646		}
  5647	
  5648		/*
  5649		 * Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
  5650		 * until finished with ptep.  This serves two purposes:
  5651		 * 1) It prevents huge_pmd_unshare from being called elsewhere
  5652		 *    and making the ptep no longer valid.
  5653		 * 2) It synchronizes us with i_size modifications during truncation.
  5654		 *
  5655		 * ptep could have already be assigned via huge_pte_offset.  That
  5656		 * is OK, as huge_pte_alloc will return the same value unless
  5657		 * something has changed.
  5658		 */
  5659		mapping = vma->vm_file->f_mapping;
  5660		i_mmap_lock_read(mapping);
  5661		ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
  5662		if (!ptep) {
  5663			i_mmap_unlock_read(mapping);
  5664			return VM_FAULT_OOM;
  5665		}
  5666	
  5667		/*
  5668		 * Serialize hugepage allocation and instantiation, so that we don't
  5669		 * get spurious allocation failures if two CPUs race to instantiate
  5670		 * the same page in the page cache.
  5671		 */
  5672		idx = vma_hugecache_offset(h, vma, haddr);
  5673		hash = hugetlb_fault_mutex_hash(mapping, idx);
  5674		mutex_lock(&hugetlb_fault_mutex_table[hash]);
  5675	
  5676		entry = huge_ptep_get(ptep);
  5677		/* PTE markers should be handled the same way as none pte */
> 5678		if (huge_pte_none_mostly(entry)) {
  5679			ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
  5680					      entry, flags);
  5681			goto out_mutex;
  5682		}
  5683	
  5684		ret = 0;
  5685	
  5686		/*
  5687		 * entry could be a migration/hwpoison entry at this point, so this
  5688		 * check prevents the kernel from going below assuming that we have
  5689		 * an active hugepage in pagecache. This goto expects the 2nd page
  5690		 * fault, and is_hugetlb_entry_(migration|hwpoisoned) check will
  5691		 * properly handle it.
  5692		 */
  5693		if (!pte_present(entry))
  5694			goto out_mutex;
  5695	
  5696		/*
  5697		 * If we are going to COW/unshare the mapping later, we examine the
  5698		 * pending reservations for this page now. This will ensure that any
  5699		 * allocations necessary to record that reservation occur outside the
  5700		 * spinlock. For private mappings, we also lookup the pagecache
  5701		 * page now as it is used to determine if a reservation has been
  5702		 * consumed.
  5703		 */
  5704		if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) &&
  5705		    !huge_pte_write(entry)) {
  5706			if (vma_needs_reservation(h, vma, haddr) < 0) {
  5707				ret = VM_FAULT_OOM;
  5708				goto out_mutex;
  5709			}
  5710			/* Just decrements count, does not deallocate */
  5711			vma_end_reservation(h, vma, haddr);
  5712	
  5713			if (!(vma->vm_flags & VM_MAYSHARE))
  5714				pagecache_page = hugetlbfs_pagecache_page(h,
  5715									vma, haddr);
  5716		}
  5717	
  5718		ptl = huge_pte_lock(h, mm, ptep);
  5719	
  5720		/* Check for a racing update before calling hugetlb_wp() */
  5721		if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
  5722			goto out_ptl;
  5723	
  5724		/* Handle userfault-wp first, before trying to lock more pages */
  5725		if (userfaultfd_wp(vma) && huge_pte_uffd_wp(huge_ptep_get(ptep)) &&
  5726		    (flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
  5727			struct vm_fault vmf = {
  5728				.vma = vma,
  5729				.address = haddr,
  5730				.real_address = address,
  5731				.flags = flags,
  5732			};
  5733	
  5734			spin_unlock(ptl);
  5735			if (pagecache_page) {
  5736				unlock_page(pagecache_page);
  5737				put_page(pagecache_page);
  5738			}
  5739			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
  5740			i_mmap_unlock_read(mapping);
  5741			return handle_userfault(&vmf, VM_UFFD_WP);
  5742		}
  5743	
  5744		/*
  5745		 * hugetlb_wp() requires page locks of pte_page(entry) and
  5746		 * pagecache_page, so here we need take the former one
  5747		 * when page != pagecache_page or !pagecache_page.
  5748		 */
  5749		page = pte_page(entry);
  5750		if (page != pagecache_page)
  5751			if (!trylock_page(page)) {
  5752				need_wait_lock = 1;
  5753				goto out_ptl;
  5754			}
  5755	
  5756		get_page(page);
  5757	
  5758		if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
  5759			if (!huge_pte_write(entry)) {
  5760				ret = hugetlb_wp(mm, vma, address, ptep, flags,
  5761						 pagecache_page, ptl);
  5762				goto out_put_page;
  5763			} else if (likely(flags & FAULT_FLAG_WRITE)) {
  5764				entry = huge_pte_mkdirty(entry);
  5765			}
  5766		}
  5767		entry = pte_mkyoung(entry);
  5768		if (huge_ptep_set_access_flags(vma, haddr, ptep, entry,
  5769							flags & FAULT_FLAG_WRITE))
  5770			update_mmu_cache(vma, haddr, ptep);
  5771	out_put_page:
  5772		if (page != pagecache_page)
  5773			unlock_page(page);
  5774		put_page(page);
  5775	out_ptl:
  5776		spin_unlock(ptl);
  5777	
  5778		if (pagecache_page) {
  5779			unlock_page(pagecache_page);
  5780			put_page(pagecache_page);
  5781		}
  5782	out_mutex:
  5783		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
  5784		i_mmap_unlock_read(mapping);
  5785		/*
  5786		 * Generally it's safe to hold refcount during waiting page lock. But
  5787		 * here we just wait to defer the next page fault to avoid busy loop and
  5788		 * the page is not used after unlocked before returning from the current
  5789		 * page fault. So we are safe from accessing freed page, even if we wait
  5790		 * here without taking refcount.
  5791		 */
  5792		if (need_wait_lock)
  5793			wait_on_page_locked(page);
  5794		return ret;
  5795	}
  5796

Peter Xu April 6, 2022, 3:02 p.m. UTC | #2

On Wed, Apr 06, 2022 at 09:37:00PM +0800, kernel test robot wrote:
> Hi Peter,
> 
> Thank you for the patch! Yet something to improve:
> 
> [auto build test ERROR on hnaz-mm/master]
> [cannot apply to arnd-asm-generic/master linus/master linux/master v5.18-rc1 next-20220406]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Peter-Xu/userfaultfd-wp-Support-shmem-and-hugetlbfs/20220405-100136
> base:   https://github.com/hnaz/linux-mm master
> config: s390-randconfig-r044-20220406 (https://download.01.org/0day-ci/archive/20220406/202204062154.2txNJyaf-lkp@intel.com/config)
> compiler: s390-linux-gcc (GCC) 11.2.0
> reproduce (this is a W=1 build):
>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # https://github.com/intel-lab-lkp/linux/commit/e7e7aaec811e2817cd169f0cc1d8f81bdf1f05c3
>         git remote add linux-review https://github.com/intel-lab-lkp/linux
>         git fetch --no-tags linux-review Peter-Xu/userfaultfd-wp-Support-shmem-and-hugetlbfs/20220405-100136
>         git checkout e7e7aaec811e2817cd169f0cc1d8f81bdf1f05c3
>         # save the config file to linux build tree
>         mkdir build_dir
>         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=s390 SHELL=/bin/bash
> 
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <lkp@intel.com>
> 
> All errors (new ones prefixed by >>):
> 
>    mm/hugetlb.c: In function 'hugetlb_fault':
> >> mm/hugetlb.c:5678:13: error: implicit declaration of function 'huge_pte_none_mostly'; did you mean 'pte_none_mostly'? [-Werror=implicit-function-declaration]
>     5678 |         if (huge_pte_none_mostly(entry)) {
>          |             ^~~~~~~~~~~~~~~~~~~~
>          |             pte_none_mostly
>    cc1: some warnings being treated as errors

Ah, the s390 stub was forgotten again, sorry.  I hope someday s390 will
start to include asm-generic/hugetlb.h like all the rest archs, because
that's really from the gut feeling of how it should happen.. or the dir
should be renamed to asm-generic-without-s390/. :(

An expected fix patch attached (to be squashed into patch "mm: Introduce
PTE_MARKER swap entry").

Thanks,

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2401dd5997b7..9317b790161d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5412,7 +5412,8 @@  static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
 static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			struct vm_area_struct *vma,
 			struct address_space *mapping, pgoff_t idx,
-			unsigned long address, pte_t *ptep, unsigned int flags)
+			unsigned long address, pte_t *ptep,
+			pte_t old_pte, unsigned int flags)
 {
 	struct hstate *h = hstate_vma(vma);
 	vm_fault_t ret = VM_FAULT_SIGBUS;
@@ -5539,7 +5540,8 @@  static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 
 	ptl = huge_pte_lock(h, mm, ptep);
 	ret = 0;
-	if (!huge_pte_none(huge_ptep_get(ptep)))
+	/* If pte changed from under us, retry */
+	if (!pte_same(huge_ptep_get(ptep), old_pte))
 		goto backout;
 
 	if (anon_rmap) {
@@ -5549,6 +5551,12 @@  static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 		page_dup_file_rmap(page, true);
 	new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
 				&& (vma->vm_flags & VM_SHARED)));
+	/*
+	 * If this pte was previously wr-protected, keep it wr-protected even
+	 * if populated.
+	 */
+	if (unlikely(pte_marker_uffd_wp(old_pte)))
+		new_pte = huge_pte_wrprotect(huge_pte_mkuffd_wp(new_pte));
 	set_huge_pte_at(mm, haddr, ptep, new_pte);
 
 	hugetlb_count_add(pages_per_huge_page(h), mm);
@@ -5666,8 +5674,10 @@  vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
 	entry = huge_ptep_get(ptep);
-	if (huge_pte_none(entry)) {
-		ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, flags);
+	/* PTE markers should be handled the same way as none pte */
+	if (huge_pte_none_mostly(entry)) {
+		ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
+				      entry, flags);
 		goto out_mutex;
 	}

[v8,15/23] mm/hugetlb: Handle pte markers in page faults

Commit Message

Comments

Patch