diff mbox series

[v8,1/4] userfaultfd: Add UFFD WP Async support

Message ID 20230124084323.1363825-2-usama.anjum@collabora.com (mailing list archive)
State New
Headers show
Series Implement IOCTL to get and/or the clear info about PTEs | expand

Commit Message

Muhammad Usama Anjum Jan. 24, 2023, 8:43 a.m. UTC
Add new WP Async mode (UFFD_FEATURE_WP_ASYNC) which resolves the page
faults on its own. It can be used to track that which pages have been
written-to from the time the pages were write-protected. It is very
efficient way to track the changes as uffd is by nature pte/pmd based.

UFFD synchronous WP sends the page faults to the userspace where the
pages which have been written-to can be tracked. But it is not efficient.
This is why this asynchronous version is being added. After setting the
WP Async, the pages which have been written to can be found in the pagemap
file or information can be obtained from the PAGEMAP_IOCTL.

Suggested-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
---
Changes in v7:
- Remove UFFDIO_WRITEPROTECT_MODE_ASYNC_WP and add UFFD_FEATURE_WP_ASYNC
- Handle automatic page fault resolution in better way (thanks to Peter)
---
 fs/userfaultfd.c                 | 11 +++++++++++
 include/linux/userfaultfd_k.h    |  6 ++++++
 include/uapi/linux/userfaultfd.h |  8 +++++++-
 mm/memory.c                      | 29 +++++++++++++++++++++++++++--
 4 files changed, 51 insertions(+), 3 deletions(-)

Comments

kernel test robot Jan. 24, 2023, 10:43 a.m. UTC | #1
Hi Muhammad,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on shuah-kselftest/next]
[also build test ERROR on shuah-kselftest/fixes linus/master v6.2-rc5 next-20230124]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Muhammad-Usama-Anjum/userfaultfd-Add-UFFD-WP-Async-support/20230124-164601
base:   https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next
patch link:    https://lore.kernel.org/r/20230124084323.1363825-2-usama.anjum%40collabora.com
patch subject: [PATCH v8 1/4] userfaultfd: Add UFFD WP Async support
config: i386-tinyconfig (https://download.01.org/0day-ci/archive/20230124/202301241804.zHrFxA0L-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/59e98aec663b7ca8fd5f3b3d2a0f17f777f425c4
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Muhammad-Usama-Anjum/userfaultfd-Add-UFFD-WP-Async-support/20230124-164601
        git checkout 59e98aec663b7ca8fd5f3b3d2a0f17f777f425c4
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 O=build_dir ARCH=i386 olddefconfig
        make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   ld: arch/x86/mm/init_32.o: in function `userfaultfd_wp_async':
>> init_32.c:(.text+0x0): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: arch/x86/mm/fault.o: in function `userfaultfd_wp_async':
   fault.c:(.text+0x8cd): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: arch/x86/mm/pgtable.o: in function `userfaultfd_wp_async':
   pgtable.c:(.text+0x0): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: kernel/fork.o: in function `userfaultfd_wp_async':
   fork.c:(.text+0x5bb): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: kernel/sysctl.o: in function `userfaultfd_wp_async':
   sysctl.c:(.text+0x0): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: kernel/sys.o: in function `userfaultfd_wp_async':
   sys.c:(.text+0xb5e): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: kernel/events/core.o: in function `userfaultfd_wp_async':
   core.c:(.text+0x404c): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: mm/filemap.o: in function `userfaultfd_wp_async':
   filemap.c:(.text+0x7c5): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: mm/page-writeback.o: in function `userfaultfd_wp_async':
   page-writeback.c:(.text+0xb69): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: mm/folio-compat.o: in function `userfaultfd_wp_async':
   folio-compat.c:(.text+0xc): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: mm/readahead.o: in function `userfaultfd_wp_async':
   readahead.c:(.text+0x128): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: mm/swap.o: in function `userfaultfd_wp_async':
   swap.c:(.text+0x6e5): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: mm/vmscan.o: in function `userfaultfd_wp_async':
   vmscan.c:(.text+0xf96): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: mm/shmem.o: in function `userfaultfd_wp_async':
   shmem.c:(.text+0x91): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: mm/util.o: in function `userfaultfd_wp_async':
   util.c:(.text+0x2b): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: mm/vmstat.o: in function `userfaultfd_wp_async':
   vmstat.c:(.text+0x0): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: mm/compaction.o: in function `userfaultfd_wp_async':
   compaction.c:(.text+0x0): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: mm/workingset.o: in function `userfaultfd_wp_async':
   workingset.c:(.text+0x181): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: mm/debug.o: in function `userfaultfd_wp_async':
   debug.c:(.text+0xb9): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: mm/gup.o: in function `userfaultfd_wp_async':
   gup.c:(.text+0x2ae): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: mm/memory.o: in function `userfaultfd_wp_async':
   memory.c:(.text+0x737): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: mm/mincore.o: in function `userfaultfd_wp_async':
   mincore.c:(.text+0x149): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: mm/mlock.o: in function `userfaultfd_wp_async':
   mlock.c:(.text+0x90e): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: mm/mmap.o: in function `userfaultfd_wp_async':
   mmap.c:(.text+0x4d8): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: mm/mmu_gather.o: in function `userfaultfd_wp_async':
   mmu_gather.c:(.text+0x29): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: mm/mprotect.o: in function `userfaultfd_wp_async':
   mprotect.c:(.text+0x45): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: mm/mremap.o: in function `userfaultfd_wp_async':
   mremap.c:(.text+0x36b): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: mm/page_vma_mapped.o: in function `userfaultfd_wp_async':
   page_vma_mapped.c:(.text+0x28): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: mm/pagewalk.o: in function `userfaultfd_wp_async':
   pagewalk.c:(.text+0x30d): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: mm/pgtable-generic.o: in function `userfaultfd_wp_async':
   pgtable-generic.c:(.text+0x0): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: mm/rmap.o: in function `userfaultfd_wp_async':
   rmap.c:(.text+0x655): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: mm/vmalloc.o: in function `userfaultfd_wp_async':
   vmalloc.c:(.text+0x1546): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: mm/page_alloc.o: in function `userfaultfd_wp_async':
   page_alloc.c:(.text+0xdb1): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: fs/splice.o: in function `userfaultfd_wp_async':
   splice.c:(.text+0x6f8): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
   ld: security/commoncap.o: in function `userfaultfd_wp_async':
   commoncap.c:(.text+0x8b): multiple definition of `userfaultfd_wp_async'; arch/x86/kernel/setup.o:setup.c:(.text+0x3): first defined here
kernel test robot Jan. 24, 2023, 11:04 a.m. UTC | #2
Hi Muhammad,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on shuah-kselftest/next]
[also build test WARNING on shuah-kselftest/fixes linus/master v6.2-rc5 next-20230124]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Muhammad-Usama-Anjum/userfaultfd-Add-UFFD-WP-Async-support/20230124-164601
base:   https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next
patch link:    https://lore.kernel.org/r/20230124084323.1363825-2-usama.anjum%40collabora.com
patch subject: [PATCH v8 1/4] userfaultfd: Add UFFD WP Async support
config: um-i386_defconfig (https://download.01.org/0day-ci/archive/20230124/202301241800.pdnow11d-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/59e98aec663b7ca8fd5f3b3d2a0f17f777f425c4
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Muhammad-Usama-Anjum/userfaultfd-Add-UFFD-WP-Async-support/20230124-164601
        git checkout 59e98aec663b7ca8fd5f3b3d2a0f17f777f425c4
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 O=build_dir ARCH=um SUBARCH=i386 olddefconfig
        make W=1 O=build_dir ARCH=um SUBARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   In file included from include/linux/mm_inline.h:9,
                    from kernel/fork.c:46:
>> include/linux/userfaultfd_k.h:278:5: warning: no previous prototype for 'userfaultfd_wp_async' [-Wmissing-prototypes]
     278 | int userfaultfd_wp_async(struct vm_area_struct *vma)
         |     ^~~~~~~~~~~~~~~~~~~~
   kernel/fork.c:162:13: warning: no previous prototype for 'arch_release_task_struct' [-Wmissing-prototypes]
     162 | void __weak arch_release_task_struct(struct task_struct *tsk)
         |             ^~~~~~~~~~~~~~~~~~~~~~~~
   kernel/fork.c:862:20: warning: no previous prototype for 'arch_task_cache_init' [-Wmissing-prototypes]
     862 | void __init __weak arch_task_cache_init(void) { }
         |                    ^~~~~~~~~~~~~~~~~~~~
   kernel/fork.c:957:12: warning: no previous prototype for 'arch_dup_task_struct' [-Wmissing-prototypes]
     957 | int __weak arch_dup_task_struct(struct task_struct *dst,
         |            ^~~~~~~~~~~~~~~~~~~~
--
   In file included from include/linux/hugetlb.h:14,
                    from kernel/sysctl.c:46:
>> include/linux/userfaultfd_k.h:278:5: warning: no previous prototype for 'userfaultfd_wp_async' [-Wmissing-prototypes]
     278 | int userfaultfd_wp_async(struct vm_area_struct *vma)
         |     ^~~~~~~~~~~~~~~~~~~~
--
   In file included from include/linux/mm_inline.h:9,
                    from mm/memory.c:44:
>> include/linux/userfaultfd_k.h:278:5: warning: no previous prototype for 'userfaultfd_wp_async' [-Wmissing-prototypes]
     278 | int userfaultfd_wp_async(struct vm_area_struct *vma)
         |     ^~~~~~~~~~~~~~~~~~~~
   mm/memory.c: In function 'wp_huge_pmd':
   mm/memory.c:4824:33: error: implicit declaration of function 'set_pmd_at'; did you mean 'set_pte_at'? [-Werror=implicit-function-declaration]
    4824 |                                 set_pmd_at(vmf->vma->vm_mm, vmf->address, vmf->pmd,
         |                                 ^~~~~~~~~~
         |                                 set_pte_at
   cc1: some warnings being treated as errors
--
   In file included from include/linux/hugetlb.h:14,
                    from fs/proc/meminfo.c:6:
>> include/linux/userfaultfd_k.h:278:5: warning: no previous prototype for 'userfaultfd_wp_async' [-Wmissing-prototypes]
     278 | int userfaultfd_wp_async(struct vm_area_struct *vma)
         |     ^~~~~~~~~~~~~~~~~~~~
   fs/proc/meminfo.c:22:28: warning: no previous prototype for 'arch_report_meminfo' [-Wmissing-prototypes]
      22 | void __attribute__((weak)) arch_report_meminfo(struct seq_file *m)
         |                            ^~~~~~~~~~~~~~~~~~~


vim +/userfaultfd_wp_async +278 include/linux/userfaultfd_k.h

   277	
 > 278	int userfaultfd_wp_async(struct vm_area_struct *vma)
   279	{
   280		return false;
   281	}
   282
kernel test robot Jan. 24, 2023, 11:15 a.m. UTC | #3
Hi Muhammad,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on shuah-kselftest/next]
[also build test ERROR on shuah-kselftest/fixes linus/master v6.2-rc5 next-20230124]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Muhammad-Usama-Anjum/userfaultfd-Add-UFFD-WP-Async-support/20230124-164601
base:   https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next
patch link:    https://lore.kernel.org/r/20230124084323.1363825-2-usama.anjum%40collabora.com
patch subject: [PATCH v8 1/4] userfaultfd: Add UFFD WP Async support
config: powerpc-allnoconfig (https://download.01.org/0day-ci/archive/20230124/202301241830.KlT5Xcjy-lkp@intel.com/config)
compiler: powerpc-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/59e98aec663b7ca8fd5f3b3d2a0f17f777f425c4
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Muhammad-Usama-Anjum/userfaultfd-Add-UFFD-WP-Async-support/20230124-164601
        git checkout 59e98aec663b7ca8fd5f3b3d2a0f17f777f425c4
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=powerpc olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=powerpc SHELL=/bin/bash arch/powerpc/kernel/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from include/linux/hugetlb.h:14,
                    from arch/powerpc/kernel/setup-common.c:37:
>> include/linux/userfaultfd_k.h:278:5: error: no previous prototype for 'userfaultfd_wp_async' [-Werror=missing-prototypes]
     278 | int userfaultfd_wp_async(struct vm_area_struct *vma)
         |     ^~~~~~~~~~~~~~~~~~~~
   cc1: all warnings being treated as errors


vim +/userfaultfd_wp_async +278 include/linux/userfaultfd_k.h

   277	
 > 278	int userfaultfd_wp_async(struct vm_area_struct *vma)
   279	{
   280		return false;
   281	}
   282
Peter Xu Jan. 26, 2023, 11:05 p.m. UTC | #4
On Tue, Jan 24, 2023 at 01:43:20PM +0500, Muhammad Usama Anjum wrote:
> Add new WP Async mode (UFFD_FEATURE_WP_ASYNC) which resolves the page
> faults on its own. It can be used to track that which pages have been
> written-to from the time the pages were write-protected. It is very
> efficient way to track the changes as uffd is by nature pte/pmd based.
> 
> UFFD synchronous WP sends the page faults to the userspace where the
> pages which have been written-to can be tracked. But it is not efficient.
> This is why this asynchronous version is being added. After setting the
> WP Async, the pages which have been written to can be found in the pagemap
> file or information can be obtained from the PAGEMAP_IOCTL.
> 
> Suggested-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
> ---
> Changes in v7:
> - Remove UFFDIO_WRITEPROTECT_MODE_ASYNC_WP and add UFFD_FEATURE_WP_ASYNC
> - Handle automatic page fault resolution in better way (thanks to Peter)
> ---
>  fs/userfaultfd.c                 | 11 +++++++++++
>  include/linux/userfaultfd_k.h    |  6 ++++++
>  include/uapi/linux/userfaultfd.h |  8 +++++++-
>  mm/memory.c                      | 29 +++++++++++++++++++++++++++--
>  4 files changed, 51 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 15a5bf765d43..b82af02092ce 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1867,6 +1867,10 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
>  	mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
>  	mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
>  
> +	/* Write protection cannot be disabled in case of aync WP */

s/aync/async/

A slight reworded version:

        /* Unprotection is not supported if in async WP mode */

> +	if (!mode_wp && (ctx->features & UFFD_FEATURE_WP_ASYNC))
> +		return -EINVAL;
> +
>  	if (mode_wp && mode_dontwake)
>  		return -EINVAL;
>  
> @@ -1950,6 +1954,13 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
>  	return ret;
>  }
>  
> +int userfaultfd_wp_async(struct vm_area_struct *vma)
> +{
> +	struct userfaultfd_ctx *ctx = vma->vm_userfaultfd_ctx.ctx;
> +
> +	return (ctx && (ctx->features & UFFD_FEATURE_WP_ASYNC));
> +}
> +
>  static inline unsigned int uffd_ctx_features(__u64 user_features)
>  {
>  	/*
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index 9df0b9a762cc..5db51fccae1d 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -179,6 +179,7 @@ extern int userfaultfd_unmap_prep(struct mm_struct *mm, unsigned long start,
>  				  unsigned long end, struct list_head *uf);
>  extern void userfaultfd_unmap_complete(struct mm_struct *mm,
>  				       struct list_head *uf);
> +extern int userfaultfd_wp_async(struct vm_area_struct *vma);
>  
>  #else /* CONFIG_USERFAULTFD */
>  
> @@ -274,6 +275,11 @@ static inline bool uffd_disable_fault_around(struct vm_area_struct *vma)
>  	return false;
>  }
>  
> +int userfaultfd_wp_async(struct vm_area_struct *vma)
> +{
> +	return false;
> +}
> +
>  #endif /* CONFIG_USERFAULTFD */
>  
>  static inline bool pte_marker_entry_uffd_wp(swp_entry_t entry)
> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> index 005e5e306266..f4252ef40071 100644
> --- a/include/uapi/linux/userfaultfd.h
> +++ b/include/uapi/linux/userfaultfd.h
> @@ -38,7 +38,8 @@
>  			   UFFD_FEATURE_MINOR_HUGETLBFS |	\
>  			   UFFD_FEATURE_MINOR_SHMEM |		\
>  			   UFFD_FEATURE_EXACT_ADDRESS |		\
> -			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM)
> +			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM |	\
> +			   UFFD_FEATURE_WP_ASYNC)
>  #define UFFD_API_IOCTLS				\
>  	((__u64)1 << _UFFDIO_REGISTER |		\
>  	 (__u64)1 << _UFFDIO_UNREGISTER |	\
> @@ -203,6 +204,10 @@ struct uffdio_api {
>  	 *
>  	 * UFFD_FEATURE_WP_HUGETLBFS_SHMEM indicates that userfaultfd
>  	 * write-protection mode is supported on both shmem and hugetlbfs.
> +	 *
> +	 * UFFD_FEATURE_WP_ASYNC indicates that userfaultfd write-protection
> +	 * asynchronous mode is supported in which the write fault is automatically
> +	 * resolved and write-protection is un-set.
>  	 */
>  #define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
>  #define UFFD_FEATURE_EVENT_FORK			(1<<1)
> @@ -217,6 +222,7 @@ struct uffdio_api {
>  #define UFFD_FEATURE_MINOR_SHMEM		(1<<10)
>  #define UFFD_FEATURE_EXACT_ADDRESS		(1<<11)
>  #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM		(1<<12)
> +#define UFFD_FEATURE_WP_ASYNC			(1<<13)
>  	__u64 features;
>  
>  	__u64 ioctls;
> diff --git a/mm/memory.c b/mm/memory.c
> index 4000e9f017e0..8c03b133d483 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3351,6 +3351,18 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>  
>  	if (likely(!unshare)) {
>  		if (userfaultfd_pte_wp(vma, *vmf->pte)) {
> +			if (userfaultfd_wp_async(vma)) {
> +				/*
> +				 * Nothing needed (cache flush, TLB invalidations,
> +				 * etc.) because we're only removing the uffd-wp bit,
> +				 * which is completely invisible to the user. This
> +				 * falls through to possible CoW.

Here it says it falls through to CoW, but..

> +				 */
> +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> +				set_pte_at(vma->vm_mm, vmf->address, vmf->pte,
> +					   pte_clear_uffd_wp(*vmf->pte));
> +				return 0;

... it's not doing so.  The original lines should do:

https://lore.kernel.org/all/Y8qq0dKIJBshua+X@x1n/

Side note: you cannot modify pgtable after releasing the pgtable lock.
It's racy.

> +			}
>  			pte_unmap_unlock(vmf->pte, vmf->ptl);
>  			return handle_userfault(vmf, VM_UFFD_WP);
>  		}
> @@ -4812,8 +4824,21 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
>  
>  	if (vma_is_anonymous(vmf->vma)) {
>  		if (likely(!unshare) &&
> -		    userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
> -			return handle_userfault(vmf, VM_UFFD_WP);
> +		    userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd)) {
> +			if (userfaultfd_wp_async(vmf->vma)) {
> +				/*
> +				 * Nothing needed (cache flush, TLB invalidations,
> +				 * etc.) because we're only removing the uffd-wp bit,
> +				 * which is completely invisible to the user. This
> +				 * falls through to possible CoW.
> +				 */
> +				set_pmd_at(vmf->vma->vm_mm, vmf->address, vmf->pmd,
> +					   pmd_clear_uffd_wp(*vmf->pmd));

This is for THP, not hugetlb.

Clearing uffd-wp bit here for the whole pmd is wrong to me, because we
track writes in small page sizes only.  We should just split.

The relevant code for hugetlb resides in hugetlb_fault().

> +				return 0;
> +			} else {
> +				return handle_userfault(vmf, VM_UFFD_WP);
> +			}
> +		}
>  		return do_huge_pmd_wp_page(vmf);
>  	}
>  
> -- 
> 2.30.2
>
Muhammad Usama Anjum Jan. 27, 2023, 6:47 a.m. UTC | #5
On 1/27/23 4:05 AM, Peter Xu wrote:
> On Tue, Jan 24, 2023 at 01:43:20PM +0500, Muhammad Usama Anjum wrote:
>> Add new WP Async mode (UFFD_FEATURE_WP_ASYNC) which resolves the page
>> faults on its own. It can be used to track that which pages have been
>> written-to from the time the pages were write-protected. It is very
>> efficient way to track the changes as uffd is by nature pte/pmd based.
>>
>> UFFD synchronous WP sends the page faults to the userspace where the
>> pages which have been written-to can be tracked. But it is not efficient.
>> This is why this asynchronous version is being added. After setting the
>> WP Async, the pages which have been written to can be found in the pagemap
>> file or information can be obtained from the PAGEMAP_IOCTL.
>>
>> Suggested-by: Peter Xu <peterx@redhat.com>
>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
>> ---
>> Changes in v7:
>> - Remove UFFDIO_WRITEPROTECT_MODE_ASYNC_WP and add UFFD_FEATURE_WP_ASYNC
>> - Handle automatic page fault resolution in better way (thanks to Peter)
>> ---
>>  fs/userfaultfd.c                 | 11 +++++++++++
>>  include/linux/userfaultfd_k.h    |  6 ++++++
>>  include/uapi/linux/userfaultfd.h |  8 +++++++-
>>  mm/memory.c                      | 29 +++++++++++++++++++++++++++--
>>  4 files changed, 51 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
>> index 15a5bf765d43..b82af02092ce 100644
>> --- a/fs/userfaultfd.c
>> +++ b/fs/userfaultfd.c
>> @@ -1867,6 +1867,10 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
>>  	mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
>>  	mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
>>  
>> +	/* Write protection cannot be disabled in case of aync WP */
> 
> s/aync/async/
> 
> A slight reworded version:
> 
>         /* Unprotection is not supported if in async WP mode */
> 
Will fix in next version.

>> +	if (!mode_wp && (ctx->features & UFFD_FEATURE_WP_ASYNC))
>> +		return -EINVAL;
>> +
>>  	if (mode_wp && mode_dontwake)
>>  		return -EINVAL;
>>  
>> @@ -1950,6 +1954,13 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
>>  	return ret;
>>  }
>>  
>> +int userfaultfd_wp_async(struct vm_area_struct *vma)
>> +{
>> +	struct userfaultfd_ctx *ctx = vma->vm_userfaultfd_ctx.ctx;
>> +
>> +	return (ctx && (ctx->features & UFFD_FEATURE_WP_ASYNC));
>> +}
>> +
>>  static inline unsigned int uffd_ctx_features(__u64 user_features)
>>  {
>>  	/*
>> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
>> index 9df0b9a762cc..5db51fccae1d 100644
>> --- a/include/linux/userfaultfd_k.h
>> +++ b/include/linux/userfaultfd_k.h
>> @@ -179,6 +179,7 @@ extern int userfaultfd_unmap_prep(struct mm_struct *mm, unsigned long start,
>>  				  unsigned long end, struct list_head *uf);
>>  extern void userfaultfd_unmap_complete(struct mm_struct *mm,
>>  				       struct list_head *uf);
>> +extern int userfaultfd_wp_async(struct vm_area_struct *vma);
>>  
>>  #else /* CONFIG_USERFAULTFD */
>>  
>> @@ -274,6 +275,11 @@ static inline bool uffd_disable_fault_around(struct vm_area_struct *vma)
>>  	return false;
>>  }
>>  
>> +int userfaultfd_wp_async(struct vm_area_struct *vma)
>> +{
>> +	return false;
>> +}
>> +
>>  #endif /* CONFIG_USERFAULTFD */
>>  
>>  static inline bool pte_marker_entry_uffd_wp(swp_entry_t entry)
>> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
>> index 005e5e306266..f4252ef40071 100644
>> --- a/include/uapi/linux/userfaultfd.h
>> +++ b/include/uapi/linux/userfaultfd.h
>> @@ -38,7 +38,8 @@
>>  			   UFFD_FEATURE_MINOR_HUGETLBFS |	\
>>  			   UFFD_FEATURE_MINOR_SHMEM |		\
>>  			   UFFD_FEATURE_EXACT_ADDRESS |		\
>> -			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM)
>> +			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM |	\
>> +			   UFFD_FEATURE_WP_ASYNC)
>>  #define UFFD_API_IOCTLS				\
>>  	((__u64)1 << _UFFDIO_REGISTER |		\
>>  	 (__u64)1 << _UFFDIO_UNREGISTER |	\
>> @@ -203,6 +204,10 @@ struct uffdio_api {
>>  	 *
>>  	 * UFFD_FEATURE_WP_HUGETLBFS_SHMEM indicates that userfaultfd
>>  	 * write-protection mode is supported on both shmem and hugetlbfs.
>> +	 *
>> +	 * UFFD_FEATURE_WP_ASYNC indicates that userfaultfd write-protection
>> +	 * asynchronous mode is supported in which the write fault is automatically
>> +	 * resolved and write-protection is un-set.
>>  	 */
>>  #define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
>>  #define UFFD_FEATURE_EVENT_FORK			(1<<1)
>> @@ -217,6 +222,7 @@ struct uffdio_api {
>>  #define UFFD_FEATURE_MINOR_SHMEM		(1<<10)
>>  #define UFFD_FEATURE_EXACT_ADDRESS		(1<<11)
>>  #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM		(1<<12)
>> +#define UFFD_FEATURE_WP_ASYNC			(1<<13)
>>  	__u64 features;
>>  
>>  	__u64 ioctls;
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 4000e9f017e0..8c03b133d483 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3351,6 +3351,18 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>>  
>>  	if (likely(!unshare)) {
>>  		if (userfaultfd_pte_wp(vma, *vmf->pte)) {
>> +			if (userfaultfd_wp_async(vma)) {
>> +				/*
>> +				 * Nothing needed (cache flush, TLB invalidations,
>> +				 * etc.) because we're only removing the uffd-wp bit,
>> +				 * which is completely invisible to the user. This
>> +				 * falls through to possible CoW.
> 
> Here it says it falls through to CoW, but..
> 
>> +				 */
>> +				pte_unmap_unlock(vmf->pte, vmf->ptl);
>> +				set_pte_at(vma->vm_mm, vmf->address, vmf->pte,
>> +					   pte_clear_uffd_wp(*vmf->pte));
>> +				return 0;
> 
> ... it's not doing so.  The original lines should do:
> 
> https://lore.kernel.org/all/Y8qq0dKIJBshua+X@x1n/
> 
> Side note: you cannot modify pgtable after releasing the pgtable lock.
> It's racy.
If I don't unlock and return after removing the UFFD_WP flag in case of
async wp, the target just gets stuck. Maybe the pte lock is not unlocked in
some path.

If I unlock and don't return, the crash happens.

So I'd put unlock and return from here. Please comment on the below patch
and what do you think should be done. I've missed something.

> 
>> +			}
>>  			pte_unmap_unlock(vmf->pte, vmf->ptl);
>>  			return handle_userfault(vmf, VM_UFFD_WP);
>>  		}
>> @@ -4812,8 +4824,21 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
>>  
>>  	if (vma_is_anonymous(vmf->vma)) {
>>  		if (likely(!unshare) &&
>> -		    userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
>> -			return handle_userfault(vmf, VM_UFFD_WP);
>> +		    userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd)) {
>> +			if (userfaultfd_wp_async(vmf->vma)) {
>> +				/*
>> +				 * Nothing needed (cache flush, TLB invalidations,
>> +				 * etc.) because we're only removing the uffd-wp bit,
>> +				 * which is completely invisible to the user. This
>> +				 * falls through to possible CoW.
>> +				 */
>> +				set_pmd_at(vmf->vma->vm_mm, vmf->address, vmf->pmd,
>> +					   pmd_clear_uffd_wp(*vmf->pmd));
> 
> This is for THP, not hugetlb.
> 
> Clearing uffd-wp bit here for the whole pmd is wrong to me, because we
> track writes in small page sizes only.  We should just split.
By detecting if the fault is async wp, just splitting the PMD doesn't work.
The below given snippit is working right now. But definately, the fault of
the whole PMD is being resolved which if we can bypass by correctly
splitting would be highly desirable. Can you please take a look on UFFD
side and suggest the changes? It would be much appreciated. I'm attaching
WIP v9 patches for you to apply on next(next-20230105) and pagemap_ioctl
selftest can be ran to test things after making changes.

> 
> The relevant code for hugetlb resides in hugetlb_fault().
> 
>> +				return 0;
>> +			} else {
>> +				return handle_userfault(vmf, VM_UFFD_WP);
>> +			}
>> +		}
>>  		return do_huge_pmd_wp_page(vmf);
>>  	}
>>  
>> -- 
>> 2.30.2
>>
> 



--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3351,6 +3351,18 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)

        if (likely(!unshare)) {
                if (userfaultfd_pte_wp(vma, *vmf->pte)) {
+                       if (userfaultfd_wp_async(vma)) {
+                               set_pte_at(vma->vm_mm, vmf->address,
+                                          vmf->pte,
+                                          pte_clear_uffd_wp(*vmf->pte));
+                               pte_unmap_unlock(vmf->pte, vmf->ptl);
+                               return 0;
+                       }
                        pte_unmap_unlock(vmf->pte, vmf->ptl);
                        return handle_userfault(vmf, VM_UFFD_WP);
                }
@@ -4812,8 +4824,21 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault
*vmf)

        if (vma_is_anonymous(vmf->vma)) {
                if (likely(!unshare) &&
-                   userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
-                       return handle_userfault(vmf, VM_UFFD_WP);
+                   userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd)) {
+                       if (userfaultfd_wp_async(vmf->vma)) {
+                               set_pmd_at(vmf->vma->vm_mm, vmf->address,
+                                          vmf->pmd,
+                                          pmd_clear_uffd_wp(*vmf->pmd));
+                               return 0;
+                       } else {
+                               return handle_userfault(vmf, VM_UFFD_WP);
+                       }
+               }
                return do_huge_pmd_wp_page(vmf);
        }
Peter Xu Jan. 27, 2023, 3:32 p.m. UTC | #6
On Fri, Jan 27, 2023 at 11:47:14AM +0500, Muhammad Usama Anjum wrote:
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index 4000e9f017e0..8c03b133d483 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -3351,6 +3351,18 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
> >>  
> >>  	if (likely(!unshare)) {
> >>  		if (userfaultfd_pte_wp(vma, *vmf->pte)) {
> >> +			if (userfaultfd_wp_async(vma)) {
> >> +				/*
> >> +				 * Nothing needed (cache flush, TLB invalidations,
> >> +				 * etc.) because we're only removing the uffd-wp bit,
> >> +				 * which is completely invisible to the user. This
> >> +				 * falls through to possible CoW.
> > 
> > Here it says it falls through to CoW, but..
> > 
> >> +				 */
> >> +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> >> +				set_pte_at(vma->vm_mm, vmf->address, vmf->pte,
> >> +					   pte_clear_uffd_wp(*vmf->pte));
> >> +				return 0;
> > 
> > ... it's not doing so.  The original lines should do:
> > 
> > https://lore.kernel.org/all/Y8qq0dKIJBshua+X@x1n/

[1]

> > 
> > Side note: you cannot modify pgtable after releasing the pgtable lock.
> > It's racy.
> If I don't unlock and return after removing the UFFD_WP flag in case of
> async wp, the target just gets stuck. Maybe the pte lock is not unlocked in
> some path.
> 
> If I unlock and don't return, the crash happens.
> 
> So I'd put unlock and return from here. Please comment on the below patch
> and what do you think should be done. I've missed something.

Have you tried to just use exactly what I suggested in [1]?  I'll paste
again:

---8<---
diff --git a/mm/memory.c b/mm/memory.c
index 4000e9f017e0..09aab434654c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3351,8 +3351,20 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)

        if (likely(!unshare)) {
                if (userfaultfd_pte_wp(vma, *vmf->pte)) {
-                       pte_unmap_unlock(vmf->pte, vmf->ptl);
-                       return handle_userfault(vmf, VM_UFFD_WP);
+                       if (userfaultfd_uffd_wp_async(vma)) {
+                               /*
+                                * Nothing needed (cache flush, TLB
+                                * invalidations, etc.) because we're only
+                                * removing the uffd-wp bit, which is
+                                * completely invisible to the user.
+                                * This falls through to possible CoW.
+                                */
+                               set_pte_at(vma->vm_mm, vmf->address, vmf->pte,
+                                          pte_clear_uffd_wp(*vmf->pte));
+                       } else {
+                               pte_unmap_unlock(vmf->pte, vmf->ptl);
+                               return handle_userfault(vmf, VM_UFFD_WP);
+                       }
                }
---8<---

Note that there's no "return", neither the unlock.  The lock is used in the
follow up write fault resolution and it's released later.

Meanwhile please fully digest how pgtable lock is used in this path before
moving forward on any of such changes.

> 
> > 
> >> +			}
> >>  			pte_unmap_unlock(vmf->pte, vmf->ptl);
> >>  			return handle_userfault(vmf, VM_UFFD_WP);
> >>  		}
> >> @@ -4812,8 +4824,21 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
> >>  
> >>  	if (vma_is_anonymous(vmf->vma)) {
> >>  		if (likely(!unshare) &&
> >> -		    userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
> >> -			return handle_userfault(vmf, VM_UFFD_WP);
> >> +		    userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd)) {
> >> +			if (userfaultfd_wp_async(vmf->vma)) {
> >> +				/*
> >> +				 * Nothing needed (cache flush, TLB invalidations,
> >> +				 * etc.) because we're only removing the uffd-wp bit,
> >> +				 * which is completely invisible to the user. This
> >> +				 * falls through to possible CoW.
> >> +				 */
> >> +				set_pmd_at(vmf->vma->vm_mm, vmf->address, vmf->pmd,
> >> +					   pmd_clear_uffd_wp(*vmf->pmd));
> > 
> > This is for THP, not hugetlb.
> > 
> > Clearing uffd-wp bit here for the whole pmd is wrong to me, because we
> > track writes in small page sizes only.  We should just split.
> By detecting if the fault is async wp, just splitting the PMD doesn't work.
> The below given snippit is working right now. But definately, the fault of
> the whole PMD is being resolved which if we can bypass by correctly
> splitting would be highly desirable. Can you please take a look on UFFD
> side and suggest the changes? It would be much appreciated. I'm attaching
> WIP v9 patches for you to apply on next(next-20230105) and pagemap_ioctl
> selftest can be ran to test things after making changes.

Can you elaborate why thp split didn't work?  Or if you want, I can look
into this and provide the patch to enable uffd async mode.

Thanks,
Muhammad Usama Anjum Jan. 30, 2023, 8:38 a.m. UTC | #7
On 1/27/23 8:32 PM, Peter Xu wrote:
> On Fri, Jan 27, 2023 at 11:47:14AM +0500, Muhammad Usama Anjum wrote:
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 4000e9f017e0..8c03b133d483 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -3351,6 +3351,18 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>>>>  
>>>>  	if (likely(!unshare)) {
>>>>  		if (userfaultfd_pte_wp(vma, *vmf->pte)) {
>>>> +			if (userfaultfd_wp_async(vma)) {
>>>> +				/*
>>>> +				 * Nothing needed (cache flush, TLB invalidations,
>>>> +				 * etc.) because we're only removing the uffd-wp bit,
>>>> +				 * which is completely invisible to the user. This
>>>> +				 * falls through to possible CoW.
>>>
>>> Here it says it falls through to CoW, but..
>>>
>>>> +				 */
>>>> +				pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>> +				set_pte_at(vma->vm_mm, vmf->address, vmf->pte,
>>>> +					   pte_clear_uffd_wp(*vmf->pte));
>>>> +				return 0;
>>>
>>> ... it's not doing so.  The original lines should do:
>>>
>>> https://lore.kernel.org/all/Y8qq0dKIJBshua+X@x1n/
> 
> [1]
> 
>>>
>>> Side note: you cannot modify pgtable after releasing the pgtable lock.
>>> It's racy.
>> If I don't unlock and return after removing the UFFD_WP flag in case of
>> async wp, the target just gets stuck. Maybe the pte lock is not unlocked in
>> some path.
>>
>> If I unlock and don't return, the crash happens.
>>
>> So I'd put unlock and return from here. Please comment on the below patch
>> and what do you think should be done. I've missed something.
> 
> Have you tried to just use exactly what I suggested in [1]?  I'll paste
> again:
> 
> ---8<---
> diff --git a/mm/memory.c b/mm/memory.c
> index 4000e9f017e0..09aab434654c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3351,8 +3351,20 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
> 
>         if (likely(!unshare)) {
>                 if (userfaultfd_pte_wp(vma, *vmf->pte)) {
> -                       pte_unmap_unlock(vmf->pte, vmf->ptl);
> -                       return handle_userfault(vmf, VM_UFFD_WP);
> +                       if (userfaultfd_uffd_wp_async(vma)) {
> +                               /*
> +                                * Nothing needed (cache flush, TLB
> +                                * invalidations, etc.) because we're only
> +                                * removing the uffd-wp bit, which is
> +                                * completely invisible to the user.
> +                                * This falls through to possible CoW.
> +                                */
> +                               set_pte_at(vma->vm_mm, vmf->address, vmf->pte,
> +                                          pte_clear_uffd_wp(*vmf->pte));
> +                       } else {
> +                               pte_unmap_unlock(vmf->pte, vmf->ptl);
> +                               return handle_userfault(vmf, VM_UFFD_WP);
> +                       }
>                 }
> ---8<---
> 
> Note that there's no "return", neither the unlock.  The lock is used in the
> follow up write fault resolution and it's released later.
I've tried out the exact patch above. This doesn't work. The pages keep
their WP flag even after being resolved in do_wp_page() while is written on
the page.

So I'd added pte_unmap_unlock() and return 0 from here. This makes the
patch to work. Maybe you can try this on your end to see what I'm seeing here?

> 
> Meanwhile please fully digest how pgtable lock is used in this path before
> moving forward on any of such changes.
> 
>>
>>>
>>>> +			}
>>>>  			pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>>  			return handle_userfault(vmf, VM_UFFD_WP);
>>>>  		}
>>>> @@ -4812,8 +4824,21 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
>>>>  
>>>>  	if (vma_is_anonymous(vmf->vma)) {
>>>>  		if (likely(!unshare) &&
>>>> -		    userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
>>>> -			return handle_userfault(vmf, VM_UFFD_WP);
>>>> +		    userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd)) {
>>>> +			if (userfaultfd_wp_async(vmf->vma)) {
>>>> +				/*
>>>> +				 * Nothing needed (cache flush, TLB invalidations,
>>>> +				 * etc.) because we're only removing the uffd-wp bit,
>>>> +				 * which is completely invisible to the user. This
>>>> +				 * falls through to possible CoW.
>>>> +				 */
>>>> +				set_pmd_at(vmf->vma->vm_mm, vmf->address, vmf->pmd,
>>>> +					   pmd_clear_uffd_wp(*vmf->pmd));
>>>
>>> This is for THP, not hugetlb.
>>>
>>> Clearing uffd-wp bit here for the whole pmd is wrong to me, because we
>>> track writes in small page sizes only.  We should just split.
>> By detecting if the fault is async wp, just splitting the PMD doesn't work.
>> The below given snippit is working right now. But definately, the fault of
>> the whole PMD is being resolved which if we can bypass by correctly
>> splitting would be highly desirable. Can you please take a look on UFFD
>> side and suggest the changes? It would be much appreciated. I'm attaching
>> WIP v9 patches for you to apply on next(next-20230105) and pagemap_ioctl
>> selftest can be ran to test things after making changes.
> 
> Can you elaborate why thp split didn't work?  Or if you want, I can look
> into this and provide the patch to enable uffd async mode.
Sorry, I was doing the wrong way. Splitting the page does work. What do you
think about the following:

--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3351,6 +3351,17 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)

 	if (likely(!unshare)) {
 		if (userfaultfd_pte_wp(vma, *vmf->pte)) {
+			if (userfaultfd_wp_async(vma)) {
+				/*
+				 * Nothing needed (cache flush, TLB invalidations,
+				 * etc.) because we're only removing the uffd-wp bit,
+				 * which is completely invisible to the user.
+				 */
+				set_pte_at(vma->vm_mm, vmf->address, vmf->pte,
+					   pte_clear_uffd_wp(*vmf->pte));
+				pte_unmap_unlock(vmf->pte, vmf->ptl);
+				return 0;
+			}
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
 			return handle_userfault(vmf, VM_UFFD_WP);
 		}
@@ -4812,8 +4823,13 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault
*vmf)

 	if (vma_is_anonymous(vmf->vma)) {
 		if (likely(!unshare) &&
-		    userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
+		    userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd)) {
+			if (userfaultfd_wp_async(vmf->vma)) {
+				__split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
+				return 0;
+			}
 			return handle_userfault(vmf, VM_UFFD_WP);
+		}
 		return do_huge_pmd_wp_page(vmf);
 	}



> 
> Thanks,
>
Peter Xu Jan. 30, 2023, 9:27 p.m. UTC | #8
On Mon, Jan 30, 2023 at 01:38:16PM +0500, Muhammad Usama Anjum wrote:
> On 1/27/23 8:32 PM, Peter Xu wrote:
> > On Fri, Jan 27, 2023 at 11:47:14AM +0500, Muhammad Usama Anjum wrote:
> >>>> diff --git a/mm/memory.c b/mm/memory.c
> >>>> index 4000e9f017e0..8c03b133d483 100644
> >>>> --- a/mm/memory.c
> >>>> +++ b/mm/memory.c
> >>>> @@ -3351,6 +3351,18 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
> >>>>  
> >>>>  	if (likely(!unshare)) {
> >>>>  		if (userfaultfd_pte_wp(vma, *vmf->pte)) {
> >>>> +			if (userfaultfd_wp_async(vma)) {
> >>>> +				/*
> >>>> +				 * Nothing needed (cache flush, TLB invalidations,
> >>>> +				 * etc.) because we're only removing the uffd-wp bit,
> >>>> +				 * which is completely invisible to the user. This
> >>>> +				 * falls through to possible CoW.
> >>>
> >>> Here it says it falls through to CoW, but..
> >>>
> >>>> +				 */
> >>>> +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> >>>> +				set_pte_at(vma->vm_mm, vmf->address, vmf->pte,
> >>>> +					   pte_clear_uffd_wp(*vmf->pte));
> >>>> +				return 0;
> >>>
> >>> ... it's not doing so.  The original lines should do:
> >>>
> >>> https://lore.kernel.org/all/Y8qq0dKIJBshua+X@x1n/
> > 
> > [1]
> > 
> >>>
> >>> Side note: you cannot modify pgtable after releasing the pgtable lock.
> >>> It's racy.
> >> If I don't unlock and return after removing the UFFD_WP flag in case of
> >> async wp, the target just gets stuck. Maybe the pte lock is not unlocked in
> >> some path.
> >>
> >> If I unlock and don't return, the crash happens.
> >>
> >> So I'd put unlock and return from here. Please comment on the below patch
> >> and what do you think should be done. I've missed something.
> > 
> > Have you tried to just use exactly what I suggested in [1]?  I'll paste
> > again:
> > 
> > ---8<---
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 4000e9f017e0..09aab434654c 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3351,8 +3351,20 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
> > 
> >         if (likely(!unshare)) {
> >                 if (userfaultfd_pte_wp(vma, *vmf->pte)) {
> > -                       pte_unmap_unlock(vmf->pte, vmf->ptl);
> > -                       return handle_userfault(vmf, VM_UFFD_WP);
> > +                       if (userfaultfd_uffd_wp_async(vma)) {
> > +                               /*
> > +                                * Nothing needed (cache flush, TLB
> > +                                * invalidations, etc.) because we're only
> > +                                * removing the uffd-wp bit, which is
> > +                                * completely invisible to the user.
> > +                                * This falls through to possible CoW.
> > +                                */
> > +                               set_pte_at(vma->vm_mm, vmf->address, vmf->pte,
> > +                                          pte_clear_uffd_wp(*vmf->pte));
> > +                       } else {
> > +                               pte_unmap_unlock(vmf->pte, vmf->ptl);
> > +                               return handle_userfault(vmf, VM_UFFD_WP);
> > +                       }
> >                 }
> > ---8<---
> > 
> > Note that there's no "return", neither the unlock.  The lock is used in the
> > follow up write fault resolution and it's released later.
> I've tried out the exact patch above. This doesn't work. The pages keep
> their WP flag even after being resolved in do_wp_page() while is written on
> the page.
> 
> So I'd added pte_unmap_unlock() and return 0 from here. This makes the
> patch to work. Maybe you can try this on your end to see what I'm seeing here?

Oh maybe it's because it didn't update orig_pte.  If you want, you can try
again with doing so by changing:

  set_pte_at(vma->vm_mm, vmf->address, vmf->pte,
             pte_clear_uffd_wp(*vmf->pte));

into:

  pte_t pte = pte_clear_uffd_wp(*vmf->pte);
  set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
  /* Update this to be prepared for following up CoW handling */
  vmf->orig_pte = pte;

> 
> > 
> > Meanwhile please fully digest how pgtable lock is used in this path before
> > moving forward on any of such changes.
> > 
> >>
> >>>
> >>>> +			}
> >>>>  			pte_unmap_unlock(vmf->pte, vmf->ptl);
> >>>>  			return handle_userfault(vmf, VM_UFFD_WP);
> >>>>  		}
> >>>> @@ -4812,8 +4824,21 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
> >>>>  
> >>>>  	if (vma_is_anonymous(vmf->vma)) {
> >>>>  		if (likely(!unshare) &&
> >>>> -		    userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
> >>>> -			return handle_userfault(vmf, VM_UFFD_WP);
> >>>> +		    userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd)) {
> >>>> +			if (userfaultfd_wp_async(vmf->vma)) {
> >>>> +				/*
> >>>> +				 * Nothing needed (cache flush, TLB invalidations,
> >>>> +				 * etc.) because we're only removing the uffd-wp bit,
> >>>> +				 * which is completely invisible to the user. This
> >>>> +				 * falls through to possible CoW.
> >>>> +				 */
> >>>> +				set_pmd_at(vmf->vma->vm_mm, vmf->address, vmf->pmd,
> >>>> +					   pmd_clear_uffd_wp(*vmf->pmd));
> >>>
> >>> This is for THP, not hugetlb.
> >>>
> >>> Clearing uffd-wp bit here for the whole pmd is wrong to me, because we
> >>> track writes in small page sizes only.  We should just split.
> >> By detecting if the fault is async wp, just splitting the PMD doesn't work.
> >> The below given snippit is working right now. But definately, the fault of
> >> the whole PMD is being resolved which if we can bypass by correctly
> >> splitting would be highly desirable. Can you please take a look on UFFD
> >> side and suggest the changes? It would be much appreciated. I'm attaching
> >> WIP v9 patches for you to apply on next(next-20230105) and pagemap_ioctl
> >> selftest can be ran to test things after making changes.
> > 
> > Can you elaborate why thp split didn't work?  Or if you want, I can look
> > into this and provide the patch to enable uffd async mode.
> Sorry, I was doing the wrong way. Splitting the page does work. What do you
> think about the following:
> 
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3351,6 +3351,17 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
> 
>  	if (likely(!unshare)) {
>  		if (userfaultfd_pte_wp(vma, *vmf->pte)) {
> +			if (userfaultfd_wp_async(vma)) {
> +				/*
> +				 * Nothing needed (cache flush, TLB invalidations,
> +				 * etc.) because we're only removing the uffd-wp bit,
> +				 * which is completely invisible to the user.
> +				 */
> +				set_pte_at(vma->vm_mm, vmf->address, vmf->pte,
> +					   pte_clear_uffd_wp(*vmf->pte));
> +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> +				return 0;

Please give it a shot with above to see whether we can avoid the "return 0"
here.

> +			}
>  			pte_unmap_unlock(vmf->pte, vmf->ptl);
>  			return handle_userfault(vmf, VM_UFFD_WP);
>  		}
> @@ -4812,8 +4823,13 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault
> *vmf)
> 
>  	if (vma_is_anonymous(vmf->vma)) {
>  		if (likely(!unshare) &&
> -		    userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
> +		    userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd)) {
> +			if (userfaultfd_wp_async(vmf->vma)) {
> +				__split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
> +				return 0;

Same here, I hope it'll work for you if you just goto __split_huge_pmd()
right below and return with VM_FAULT_FALLBACK.  It avoids one more round of
fault just like the pte case above.

> +			}
>  			return handle_userfault(vmf, VM_UFFD_WP);
> +		}
>  		return do_huge_pmd_wp_page(vmf);
>  	}
Muhammad Usama Anjum Jan. 31, 2023, 8:40 a.m. UTC | #9
On 1/31/23 2:27 AM, Peter Xu wrote:
> On Mon, Jan 30, 2023 at 01:38:16PM +0500, Muhammad Usama Anjum wrote:
>> On 1/27/23 8:32 PM, Peter Xu wrote:
>>> On Fri, Jan 27, 2023 at 11:47:14AM +0500, Muhammad Usama Anjum wrote:
>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>> index 4000e9f017e0..8c03b133d483 100644
>>>>>> --- a/mm/memory.c
>>>>>> +++ b/mm/memory.c
>>>>>> @@ -3351,6 +3351,18 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>>>>>>  
>>>>>>  	if (likely(!unshare)) {
>>>>>>  		if (userfaultfd_pte_wp(vma, *vmf->pte)) {
>>>>>> +			if (userfaultfd_wp_async(vma)) {
>>>>>> +				/*
>>>>>> +				 * Nothing needed (cache flush, TLB invalidations,
>>>>>> +				 * etc.) because we're only removing the uffd-wp bit,
>>>>>> +				 * which is completely invisible to the user. This
>>>>>> +				 * falls through to possible CoW.
>>>>>
>>>>> Here it says it falls through to CoW, but..
>>>>>
>>>>>> +				 */
>>>>>> +				pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>>>> +				set_pte_at(vma->vm_mm, vmf->address, vmf->pte,
>>>>>> +					   pte_clear_uffd_wp(*vmf->pte));
>>>>>> +				return 0;
>>>>>
>>>>> ... it's not doing so.  The original lines should do:
>>>>>
>>>>> https://lore.kernel.org/all/Y8qq0dKIJBshua+X@x1n/
>>>
>>> [1]
>>>
>>>>>
>>>>> Side note: you cannot modify pgtable after releasing the pgtable lock.
>>>>> It's racy.
>>>> If I don't unlock and return after removing the UFFD_WP flag in case of
>>>> async wp, the target just gets stuck. Maybe the pte lock is not unlocked in
>>>> some path.
>>>>
>>>> If I unlock and don't return, the crash happens.
>>>>
>>>> So I'd put unlock and return from here. Please comment on the below patch
>>>> and what do you think should be done. I've missed something.
>>>
>>> Have you tried to just use exactly what I suggested in [1]?  I'll paste
>>> again:
>>>
>>> ---8<---
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 4000e9f017e0..09aab434654c 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -3351,8 +3351,20 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>>>
>>>         if (likely(!unshare)) {
>>>                 if (userfaultfd_pte_wp(vma, *vmf->pte)) {
>>> -                       pte_unmap_unlock(vmf->pte, vmf->ptl);
>>> -                       return handle_userfault(vmf, VM_UFFD_WP);
>>> +                       if (userfaultfd_uffd_wp_async(vma)) {
>>> +                               /*
>>> +                                * Nothing needed (cache flush, TLB
>>> +                                * invalidations, etc.) because we're only
>>> +                                * removing the uffd-wp bit, which is
>>> +                                * completely invisible to the user.
>>> +                                * This falls through to possible CoW.
>>> +                                */
>>> +                               set_pte_at(vma->vm_mm, vmf->address, vmf->pte,
>>> +                                          pte_clear_uffd_wp(*vmf->pte));
>>> +                       } else {
>>> +                               pte_unmap_unlock(vmf->pte, vmf->ptl);
>>> +                               return handle_userfault(vmf, VM_UFFD_WP);
>>> +                       }
>>>                 }
>>> ---8<---
>>>
>>> Note that there's no "return", neither the unlock.  The lock is used in the
>>> follow up write fault resolution and it's released later.
>> I've tried out the exact patch above. This doesn't work. The pages keep
>> their WP flag even after being resolved in do_wp_page() while is written on
>> the page.
>>
>> So I'd added pte_unmap_unlock() and return 0 from here. This makes the
>> patch to work. Maybe you can try this on your end to see what I'm seeing here?
> 
> Oh maybe it's because it didn't update orig_pte.  If you want, you can try
> again with doing so by changing:
> 
>   set_pte_at(vma->vm_mm, vmf->address, vmf->pte,
>              pte_clear_uffd_wp(*vmf->pte));
> 
> into:
> 
>   pte_t pte = pte_clear_uffd_wp(*vmf->pte);
>   set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
>   /* Update this to be prepared for following up CoW handling */
>   vmf->orig_pte = pte;
> 
It works.

>>
>>>
>>> Meanwhile please fully digest how pgtable lock is used in this path before
>>> moving forward on any of such changes.
>>>
>>>>
>>>>>
>>>>>> +			}
>>>>>>  			pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>>>>  			return handle_userfault(vmf, VM_UFFD_WP);
>>>>>>  		}
>>>>>> @@ -4812,8 +4824,21 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
>>>>>>  
>>>>>>  	if (vma_is_anonymous(vmf->vma)) {
>>>>>>  		if (likely(!unshare) &&
>>>>>> -		    userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
>>>>>> -			return handle_userfault(vmf, VM_UFFD_WP);
>>>>>> +		    userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd)) {
>>>>>> +			if (userfaultfd_wp_async(vmf->vma)) {
>>>>>> +				/*
>>>>>> +				 * Nothing needed (cache flush, TLB invalidations,
>>>>>> +				 * etc.) because we're only removing the uffd-wp bit,
>>>>>> +				 * which is completely invisible to the user. This
>>>>>> +				 * falls through to possible CoW.
>>>>>> +				 */
>>>>>> +				set_pmd_at(vmf->vma->vm_mm, vmf->address, vmf->pmd,
>>>>>> +					   pmd_clear_uffd_wp(*vmf->pmd));
>>>>>
>>>>> This is for THP, not hugetlb.
>>>>>
>>>>> Clearing uffd-wp bit here for the whole pmd is wrong to me, because we
>>>>> track writes in small page sizes only.  We should just split.
>>>> By detecting if the fault is async wp, just splitting the PMD doesn't work.
>>>> The below given snippit is working right now. But definately, the fault of
>>>> the whole PMD is being resolved which if we can bypass by correctly
>>>> splitting would be highly desirable. Can you please take a look on UFFD
>>>> side and suggest the changes? It would be much appreciated. I'm attaching
>>>> WIP v9 patches for you to apply on next(next-20230105) and pagemap_ioctl
>>>> selftest can be ran to test things after making changes.
>>>
>>> Can you elaborate why thp split didn't work?  Or if you want, I can look
>>> into this and provide the patch to enable uffd async mode.
>> Sorry, I was doing the wrong way. Splitting the page does work. What do you
>> think about the following:
>>
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3351,6 +3351,17 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>>
>>  	if (likely(!unshare)) {
>>  		if (userfaultfd_pte_wp(vma, *vmf->pte)) {
>> +			if (userfaultfd_wp_async(vma)) {
>> +				/*
>> +				 * Nothing needed (cache flush, TLB invalidations,
>> +				 * etc.) because we're only removing the uffd-wp bit,
>> +				 * which is completely invisible to the user.
>> +				 */
>> +				set_pte_at(vma->vm_mm, vmf->address, vmf->pte,
>> +					   pte_clear_uffd_wp(*vmf->pte));
>> +				pte_unmap_unlock(vmf->pte, vmf->ptl);
>> +				return 0;
> 
> Please give it a shot with above to see whether we can avoid the "return 0"
> here.
> 
>> +			}
>>  			pte_unmap_unlock(vmf->pte, vmf->ptl);
>>  			return handle_userfault(vmf, VM_UFFD_WP);
>>  		}
>> @@ -4812,8 +4823,13 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault
>> *vmf)
>>
>>  	if (vma_is_anonymous(vmf->vma)) {
>>  		if (likely(!unshare) &&
>> -		    userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
>> +		    userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd)) {
>> +			if (userfaultfd_wp_async(vmf->vma)) {
>> +				__split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
>> +				return 0;
> 
> Same here, I hope it'll work for you if you just goto __split_huge_pmd()
> right below and return with VM_FAULT_FALLBACK.  It avoids one more round of
> fault just like the pte case above.
> 
It works as well.

>> +			}
>>  			return handle_userfault(vmf, VM_UFFD_WP);
>> +		}
>>  		return do_huge_pmd_wp_page(vmf);
>>  	}
>
diff mbox series

Patch

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 15a5bf765d43..b82af02092ce 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1867,6 +1867,10 @@  static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 	mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
 	mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
 
+	/* Write protection cannot be disabled in case of aync WP */
+	if (!mode_wp && (ctx->features & UFFD_FEATURE_WP_ASYNC))
+		return -EINVAL;
+
 	if (mode_wp && mode_dontwake)
 		return -EINVAL;
 
@@ -1950,6 +1954,13 @@  static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
 	return ret;
 }
 
+int userfaultfd_wp_async(struct vm_area_struct *vma)
+{
+	struct userfaultfd_ctx *ctx = vma->vm_userfaultfd_ctx.ctx;
+
+	return (ctx && (ctx->features & UFFD_FEATURE_WP_ASYNC));
+}
+
 static inline unsigned int uffd_ctx_features(__u64 user_features)
 {
 	/*
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 9df0b9a762cc..5db51fccae1d 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -179,6 +179,7 @@  extern int userfaultfd_unmap_prep(struct mm_struct *mm, unsigned long start,
 				  unsigned long end, struct list_head *uf);
 extern void userfaultfd_unmap_complete(struct mm_struct *mm,
 				       struct list_head *uf);
+extern int userfaultfd_wp_async(struct vm_area_struct *vma);
 
 #else /* CONFIG_USERFAULTFD */
 
@@ -274,6 +275,11 @@  static inline bool uffd_disable_fault_around(struct vm_area_struct *vma)
 	return false;
 }
 
+int userfaultfd_wp_async(struct vm_area_struct *vma)
+{
+	return false;
+}
+
 #endif /* CONFIG_USERFAULTFD */
 
 static inline bool pte_marker_entry_uffd_wp(swp_entry_t entry)
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 005e5e306266..f4252ef40071 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -38,7 +38,8 @@ 
 			   UFFD_FEATURE_MINOR_HUGETLBFS |	\
 			   UFFD_FEATURE_MINOR_SHMEM |		\
 			   UFFD_FEATURE_EXACT_ADDRESS |		\
-			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM)
+			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM |	\
+			   UFFD_FEATURE_WP_ASYNC)
 #define UFFD_API_IOCTLS				\
 	((__u64)1 << _UFFDIO_REGISTER |		\
 	 (__u64)1 << _UFFDIO_UNREGISTER |	\
@@ -203,6 +204,10 @@  struct uffdio_api {
 	 *
 	 * UFFD_FEATURE_WP_HUGETLBFS_SHMEM indicates that userfaultfd
 	 * write-protection mode is supported on both shmem and hugetlbfs.
+	 *
+	 * UFFD_FEATURE_WP_ASYNC indicates that userfaultfd write-protection
+	 * asynchronous mode is supported in which the write fault is automatically
+	 * resolved and write-protection is un-set.
 	 */
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
 #define UFFD_FEATURE_EVENT_FORK			(1<<1)
@@ -217,6 +222,7 @@  struct uffdio_api {
 #define UFFD_FEATURE_MINOR_SHMEM		(1<<10)
 #define UFFD_FEATURE_EXACT_ADDRESS		(1<<11)
 #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM		(1<<12)
+#define UFFD_FEATURE_WP_ASYNC			(1<<13)
 	__u64 features;
 
 	__u64 ioctls;
diff --git a/mm/memory.c b/mm/memory.c
index 4000e9f017e0..8c03b133d483 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3351,6 +3351,18 @@  static vm_fault_t do_wp_page(struct vm_fault *vmf)
 
 	if (likely(!unshare)) {
 		if (userfaultfd_pte_wp(vma, *vmf->pte)) {
+			if (userfaultfd_wp_async(vma)) {
+				/*
+				 * Nothing needed (cache flush, TLB invalidations,
+				 * etc.) because we're only removing the uffd-wp bit,
+				 * which is completely invisible to the user. This
+				 * falls through to possible CoW.
+				 */
+				pte_unmap_unlock(vmf->pte, vmf->ptl);
+				set_pte_at(vma->vm_mm, vmf->address, vmf->pte,
+					   pte_clear_uffd_wp(*vmf->pte));
+				return 0;
+			}
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
 			return handle_userfault(vmf, VM_UFFD_WP);
 		}
@@ -4812,8 +4824,21 @@  static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
 
 	if (vma_is_anonymous(vmf->vma)) {
 		if (likely(!unshare) &&
-		    userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
-			return handle_userfault(vmf, VM_UFFD_WP);
+		    userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd)) {
+			if (userfaultfd_wp_async(vmf->vma)) {
+				/*
+				 * Nothing needed (cache flush, TLB invalidations,
+				 * etc.) because we're only removing the uffd-wp bit,
+				 * which is completely invisible to the user. This
+				 * falls through to possible CoW.
+				 */
+				set_pmd_at(vmf->vma->vm_mm, vmf->address, vmf->pmd,
+					   pmd_clear_uffd_wp(*vmf->pmd));
+				return 0;
+			} else {
+				return handle_userfault(vmf, VM_UFFD_WP);
+			}
+		}
 		return do_huge_pmd_wp_page(vmf);
 	}