diff mbox series

[v5] mm/gup: disallow GUP writing to file-backed mappings by default

Message ID 6b73e692c2929dc4613af711bdf92e2ec1956a66.1682638385.git.lstoakes@gmail.com (mailing list archive)
State Mainlined, archived
Headers show
Series [v5] mm/gup: disallow GUP writing to file-backed mappings by default | expand

Commit Message

Lorenzo Stoakes April 27, 2023, 11:42 p.m. UTC
Writing to file-backed mappings which require folio dirty tracking using
GUP is a fundamentally broken operation, as kernel write access to GUP
mappings do not adhere to the semantics expected by a file system.

A GUP caller uses the direct mapping to access the folio, which does not
cause write notify to trigger, nor does it enforce that the caller marks
the folio dirty.

The problem arises when, after an initial write to the folio, writeback
results in the folio being cleaned and then the caller, via the GUP
interface, writes to the folio again.

As a result of the use of this secondary, direct, mapping to the folio no
write notify will occur, and if the caller does mark the folio dirty, this
will be done so unexpectedly.

For example, consider the following scenario:-

1. A folio is written to via GUP which write-faults the memory, notifying
   the file system and dirtying the folio.
2. Later, writeback is triggered, resulting in the folio being cleaned and
   the PTE being marked read-only.
3. The GUP caller writes to the folio, as it is mapped read/write via the
   direct mapping.
4. The GUP caller, now done with the page, unpins it and sets it dirty
   (though it does not have to).

This results in both data being written to a folio without writenotify, and
the folio being dirtied unexpectedly (if the caller decides to do so).

This issue was first reported by Jan Kara [1] in 2018, where the problem
resulted in file system crashes.

This is only relevant when the mappings are file-backed and the underlying
file system requires folio dirty tracking. File systems which do not, such
as shmem or hugetlb, are not at risk and therefore can be written to
without issue.

Unfortunately this limitation of GUP has been present for some time and
requires future rework of the GUP API in order to provide correct write
access to such mappings.

However, for the time being we introduce this check to prevent the most
egregious case of this occurring, use of the FOLL_LONGTERM pin.

These mappings are considerably more likely to be written to after
folios are cleaned and thus simply must not be permitted to do so.

As part of this change we separate out vma_needs_dirty_tracking() as a
helper function to determine this which is distinct from
vma_wants_writenotify() which is specific to determining which PTE flags to
set.

[1]:https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz/

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
---
 include/linux/mm.h |  1 +
 mm/gup.c           | 41 ++++++++++++++++++++++++++++++++++++++++-
 mm/mmap.c          | 36 +++++++++++++++++++++++++++---------
 3 files changed, 68 insertions(+), 10 deletions(-)

--
2.40.0

Comments

John Hubbard April 28, 2023, 2:06 a.m. UTC | #1
On 4/27/23 16:42, Lorenzo Stoakes wrote:
> Writing to file-backed mappings which require folio dirty tracking using
> GUP is a fundamentally broken operation, as kernel write access to GUP
> mappings do not adhere to the semantics expected by a file system.
> 
> A GUP caller uses the direct mapping to access the folio, which does not
> cause write notify to trigger, nor does it enforce that the caller marks
> the folio dirty.
> 
> The problem arises when, after an initial write to the folio, writeback
> results in the folio being cleaned and then the caller, via the GUP
> interface, writes to the folio again.
> 
> As a result of the use of this secondary, direct, mapping to the folio no
> write notify will occur, and if the caller does mark the folio dirty, this
> will be done so unexpectedly.
> 
> For example, consider the following scenario:-
> 
> 1. A folio is written to via GUP which write-faults the memory, notifying
>     the file system and dirtying the folio.
> 2. Later, writeback is triggered, resulting in the folio being cleaned and
>     the PTE being marked read-only.
> 3. The GUP caller writes to the folio, as it is mapped read/write via the
>     direct mapping.
> 4. The GUP caller, now done with the page, unpins it and sets it dirty
>     (though it does not have to).
> 
> This results in both data being written to a folio without writenotify, and
> the folio being dirtied unexpectedly (if the caller decides to do so).
> 
> This issue was first reported by Jan Kara [1] in 2018, where the problem
> resulted in file system crashes.
> 
> This is only relevant when the mappings are file-backed and the underlying
> file system requires folio dirty tracking. File systems which do not, such
> as shmem or hugetlb, are not at risk and therefore can be written to
> without issue.
> 
> Unfortunately this limitation of GUP has been present for some time and
> requires future rework of the GUP API in order to provide correct write
> access to such mappings.
> 
> However, for the time being we introduce this check to prevent the most
> egregious case of this occurring, use of the FOLL_LONGTERM pin.
> 
> These mappings are considerably more likely to be written to after
> folios are cleaned and thus simply must not be permitted to do so.
> 
> As part of this change we separate out vma_needs_dirty_tracking() as a
> helper function to determine this which is distinct from
> vma_wants_writenotify() which is specific to determining which PTE flags to
> set.
> 
> [1]:https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz/
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
> ---
>   include/linux/mm.h |  1 +
>   mm/gup.c           | 41 ++++++++++++++++++++++++++++++++++++++++-
>   mm/mmap.c          | 36 +++++++++++++++++++++++++++---------
>   3 files changed, 68 insertions(+), 10 deletions(-)
> 

Looks good.

Reviewed-by: John Hubbard <jhubbard@nvidia.com>


thanks,
Mika Penttilä April 28, 2023, 4:21 a.m. UTC | #2
On 28.4.2023 2.42, Lorenzo Stoakes wrote:
> Writing to file-backed mappings which require folio dirty tracking using
> GUP is a fundamentally broken operation, as kernel write access to GUP
> mappings do not adhere to the semantics expected by a file system.
> 
> A GUP caller uses the direct mapping to access the folio, which does not
> cause write notify to trigger, nor does it enforce that the caller marks
> the folio dirty.
> 
> The problem arises when, after an initial write to the folio, writeback
> results in the folio being cleaned and then the caller, via the GUP
> interface, writes to the folio again.
> 
> As a result of the use of this secondary, direct, mapping to the folio no
> write notify will occur, and if the caller does mark the folio dirty, this
> will be done so unexpectedly.
> 
> For example, consider the following scenario:-
> 
> 1. A folio is written to via GUP which write-faults the memory, notifying
>     the file system and dirtying the folio.
> 2. Later, writeback is triggered, resulting in the folio being cleaned and
>     the PTE being marked read-only.
> 3. The GUP caller writes to the folio, as it is mapped read/write via the
>     direct mapping.
> 4. The GUP caller, now done with the page, unpins it and sets it dirty
>     (though it does not have to).
> 
> This results in both data being written to a folio without writenotify, and
> the folio being dirtied unexpectedly (if the caller decides to do so).
> 
> This issue was first reported by Jan Kara [1] in 2018, where the problem
> resulted in file system crashes.
> 
> This is only relevant when the mappings are file-backed and the underlying
> file system requires folio dirty tracking. File systems which do not, such
> as shmem or hugetlb, are not at risk and therefore can be written to
> without issue.
> 
> Unfortunately this limitation of GUP has been present for some time and
> requires future rework of the GUP API in order to provide correct write
> access to such mappings.
> 
> However, for the time being we introduce this check to prevent the most
> egregious case of this occurring, use of the FOLL_LONGTERM pin.
> 
> These mappings are considerably more likely to be written to after
> folios are cleaned and thus simply must not be permitted to do so.
> 
> As part of this change we separate out vma_needs_dirty_tracking() as a
> helper function to determine this which is distinct from
> vma_wants_writenotify() which is specific to determining which PTE flags to
> set.
> 
> [1]:https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz/
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
> ---
>   include/linux/mm.h |  1 +
>   mm/gup.c           | 41 ++++++++++++++++++++++++++++++++++++++++-
>   mm/mmap.c          | 36 +++++++++++++++++++++++++++---------
>   3 files changed, 68 insertions(+), 10 deletions(-)
> 
Thanks Lorenzo for the added patch descriptions!

Reviewed-by: Mika Penttilä <mpenttil@redhat.com>



> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 37554b08bb28..f7da02fc89c6 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2433,6 +2433,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
>   #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
>   					    MM_CP_UFFD_WP_RESOLVE)
> 
> +bool vma_needs_dirty_tracking(struct vm_area_struct *vma);
>   int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
>   static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma)
>   {
> diff --git a/mm/gup.c b/mm/gup.c
> index 1f72a717232b..d36a5db9feb1 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -959,16 +959,51 @@ static int faultin_page(struct vm_area_struct *vma,
>   	return 0;
>   }
> 
> +/*
> + * Writing to file-backed mappings which require folio dirty tracking using GUP
> + * is a fundamentally broken operation, as kernel write access to GUP mappings
> + * do not adhere to the semantics expected by a file system.
> + *
> + * Consider the following scenario:-
> + *
> + * 1. A folio is written to via GUP which write-faults the memory, notifying
> + *    the file system and dirtying the folio.
> + * 2. Later, writeback is triggered, resulting in the folio being cleaned and
> + *    the PTE being marked read-only.
> + * 3. The GUP caller writes to the folio, as it is mapped read/write via the
> + *    direct mapping.
> + * 4. The GUP caller, now done with the page, unpins it and sets it dirty
> + *    (though it does not have to).
> + *
> + * This results in both data being written to a folio without writenotify, and
> + * the folio being dirtied unexpectedly (if the caller decides to do so).
> + */
> +static bool writeable_file_mapping_allowed(struct vm_area_struct *vma,
> +					   unsigned long gup_flags)
> +{
> +	/* If we aren't pinning then no problematic write can occur. */
> +	if (!(gup_flags & (FOLL_GET | FOLL_PIN)))
> +		return true;
> +
> +	/* We limit this check to the most egregious case - a long term pin. */
> +	if (!(gup_flags & FOLL_LONGTERM))
> +		return true;
> +
> +	/* If the VMA requires dirty tracking then GUP will be problematic. */
> +	return vma_needs_dirty_tracking(vma);
> +}
> +
>   static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
>   {
>   	vm_flags_t vm_flags = vma->vm_flags;
>   	int write = (gup_flags & FOLL_WRITE);
>   	int foreign = (gup_flags & FOLL_REMOTE);
> +	bool vma_anon = vma_is_anonymous(vma);
> 
>   	if (vm_flags & (VM_IO | VM_PFNMAP))
>   		return -EFAULT;
> 
> -	if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma))
> +	if ((gup_flags & FOLL_ANON) && !vma_anon)
>   		return -EFAULT;
> 
>   	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
> @@ -978,6 +1013,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
>   		return -EFAULT;
> 
>   	if (write) {
> +		if (!vma_anon &&
> +		    !writeable_file_mapping_allowed(vma, gup_flags))
> +			return -EFAULT;
> +
>   		if (!(vm_flags & VM_WRITE)) {
>   			if (!(gup_flags & FOLL_FORCE))
>   				return -EFAULT;
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 536bbb8fa0ae..7b6344d1832a 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1475,6 +1475,31 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
>   }
>   #endif /* __ARCH_WANT_SYS_OLD_MMAP */
> 
> +/* Do VMA operations imply write notify is required? */
> +static bool vm_ops_needs_writenotify(const struct vm_operations_struct *vm_ops)
> +{
> +	return vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite);
> +}
> +
> +/*
> + * Does this VMA require the underlying folios to have their dirty state
> + * tracked?
> + */
> +bool vma_needs_dirty_tracking(struct vm_area_struct *vma)
> +{
> +	/* Does the filesystem need to be notified? */
> +	if (vm_ops_needs_writenotify(vma->vm_ops))
> +		return true;
> +
> +	/* Specialty mapping? */
> +	if (vma->vm_flags & VM_PFNMAP)
> +		return false;
> +
> +	/* Can the mapping track the dirty pages? */
> +	return vma->vm_file && vma->vm_file->f_mapping &&
> +		mapping_can_writeback(vma->vm_file->f_mapping);
> +}
> +
>   /*
>    * Some shared mappings will want the pages marked read-only
>    * to track write events. If so, we'll downgrade vm_page_prot
> @@ -1484,14 +1509,13 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
>   int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot)
>   {
>   	vm_flags_t vm_flags = vma->vm_flags;
> -	const struct vm_operations_struct *vm_ops = vma->vm_ops;
> 
>   	/* If it was private or non-writable, the write bit is already clear */
>   	if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED)))
>   		return 0;
> 
>   	/* The backer wishes to know when pages are first written to? */
> -	if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite))
> +	if (vm_ops_needs_writenotify(vma->vm_ops))
>   		return 1;
> 
>   	/* The open routine did something to the protections that pgprot_modify
> @@ -1511,13 +1535,7 @@ int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot)
>   	if (userfaultfd_wp(vma))
>   		return 1;
> 
> -	/* Specialty mapping? */
> -	if (vm_flags & VM_PFNMAP)
> -		return 0;
> -
> -	/* Can the mapping track the dirty pages? */
> -	return vma->vm_file && vma->vm_file->f_mapping &&
> -		mapping_can_writeback(vma->vm_file->f_mapping);
> +	return vma_needs_dirty_tracking(vma);
>   }
> 
>   /*
> --
> 2.40.0
>
Jan Kara April 28, 2023, 11:51 a.m. UTC | #3
On Fri 28-04-23 00:42:32, Lorenzo Stoakes wrote:
> Writing to file-backed mappings which require folio dirty tracking using
> GUP is a fundamentally broken operation, as kernel write access to GUP
> mappings do not adhere to the semantics expected by a file system.
> 
> A GUP caller uses the direct mapping to access the folio, which does not
> cause write notify to trigger, nor does it enforce that the caller marks
> the folio dirty.
> 
> The problem arises when, after an initial write to the folio, writeback
> results in the folio being cleaned and then the caller, via the GUP
> interface, writes to the folio again.
> 
> As a result of the use of this secondary, direct, mapping to the folio no
> write notify will occur, and if the caller does mark the folio dirty, this
> will be done so unexpectedly.
> 
> For example, consider the following scenario:-
> 
> 1. A folio is written to via GUP which write-faults the memory, notifying
>    the file system and dirtying the folio.
> 2. Later, writeback is triggered, resulting in the folio being cleaned and
>    the PTE being marked read-only.
> 3. The GUP caller writes to the folio, as it is mapped read/write via the
>    direct mapping.
> 4. The GUP caller, now done with the page, unpins it and sets it dirty
>    (though it does not have to).
> 
> This results in both data being written to a folio without writenotify, and
> the folio being dirtied unexpectedly (if the caller decides to do so).
> 
> This issue was first reported by Jan Kara [1] in 2018, where the problem
> resulted in file system crashes.
> 
> This is only relevant when the mappings are file-backed and the underlying
> file system requires folio dirty tracking. File systems which do not, such
> as shmem or hugetlb, are not at risk and therefore can be written to
> without issue.
> 
> Unfortunately this limitation of GUP has been present for some time and
> requires future rework of the GUP API in order to provide correct write
> access to such mappings.
> 
> However, for the time being we introduce this check to prevent the most
> egregious case of this occurring, use of the FOLL_LONGTERM pin.
> 
> These mappings are considerably more likely to be written to after
> folios are cleaned and thus simply must not be permitted to do so.
> 
> As part of this change we separate out vma_needs_dirty_tracking() as a
> helper function to determine this which is distinct from
> vma_wants_writenotify() which is specific to determining which PTE flags to
> set.
> 
> [1]:https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz/
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>

I'm for trying this out and let's see who complains ;) The patch looks good
to me from the implementation point of view. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  include/linux/mm.h |  1 +
>  mm/gup.c           | 41 ++++++++++++++++++++++++++++++++++++++++-
>  mm/mmap.c          | 36 +++++++++++++++++++++++++++---------
>  3 files changed, 68 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 37554b08bb28..f7da02fc89c6 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2433,6 +2433,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
>  #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
>  					    MM_CP_UFFD_WP_RESOLVE)
> 
> +bool vma_needs_dirty_tracking(struct vm_area_struct *vma);
>  int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
>  static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma)
>  {
> diff --git a/mm/gup.c b/mm/gup.c
> index 1f72a717232b..d36a5db9feb1 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -959,16 +959,51 @@ static int faultin_page(struct vm_area_struct *vma,
>  	return 0;
>  }
> 
> +/*
> + * Writing to file-backed mappings which require folio dirty tracking using GUP
> + * is a fundamentally broken operation, as kernel write access to GUP mappings
> + * do not adhere to the semantics expected by a file system.
> + *
> + * Consider the following scenario:-
> + *
> + * 1. A folio is written to via GUP which write-faults the memory, notifying
> + *    the file system and dirtying the folio.
> + * 2. Later, writeback is triggered, resulting in the folio being cleaned and
> + *    the PTE being marked read-only.
> + * 3. The GUP caller writes to the folio, as it is mapped read/write via the
> + *    direct mapping.
> + * 4. The GUP caller, now done with the page, unpins it and sets it dirty
> + *    (though it does not have to).
> + *
> + * This results in both data being written to a folio without writenotify, and
> + * the folio being dirtied unexpectedly (if the caller decides to do so).
> + */
> +static bool writeable_file_mapping_allowed(struct vm_area_struct *vma,
> +					   unsigned long gup_flags)
> +{
> +	/* If we aren't pinning then no problematic write can occur. */
> +	if (!(gup_flags & (FOLL_GET | FOLL_PIN)))
> +		return true;
> +
> +	/* We limit this check to the most egregious case - a long term pin. */
> +	if (!(gup_flags & FOLL_LONGTERM))
> +		return true;
> +
> +	/* If the VMA requires dirty tracking then GUP will be problematic. */
> +	return vma_needs_dirty_tracking(vma);
> +}
> +
>  static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
>  {
>  	vm_flags_t vm_flags = vma->vm_flags;
>  	int write = (gup_flags & FOLL_WRITE);
>  	int foreign = (gup_flags & FOLL_REMOTE);
> +	bool vma_anon = vma_is_anonymous(vma);
> 
>  	if (vm_flags & (VM_IO | VM_PFNMAP))
>  		return -EFAULT;
> 
> -	if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma))
> +	if ((gup_flags & FOLL_ANON) && !vma_anon)
>  		return -EFAULT;
> 
>  	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
> @@ -978,6 +1013,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
>  		return -EFAULT;
> 
>  	if (write) {
> +		if (!vma_anon &&
> +		    !writeable_file_mapping_allowed(vma, gup_flags))
> +			return -EFAULT;
> +
>  		if (!(vm_flags & VM_WRITE)) {
>  			if (!(gup_flags & FOLL_FORCE))
>  				return -EFAULT;
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 536bbb8fa0ae..7b6344d1832a 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1475,6 +1475,31 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
>  }
>  #endif /* __ARCH_WANT_SYS_OLD_MMAP */
> 
> +/* Do VMA operations imply write notify is required? */
> +static bool vm_ops_needs_writenotify(const struct vm_operations_struct *vm_ops)
> +{
> +	return vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite);
> +}
> +
> +/*
> + * Does this VMA require the underlying folios to have their dirty state
> + * tracked?
> + */
> +bool vma_needs_dirty_tracking(struct vm_area_struct *vma)
> +{
> +	/* Does the filesystem need to be notified? */
> +	if (vm_ops_needs_writenotify(vma->vm_ops))
> +		return true;
> +
> +	/* Specialty mapping? */
> +	if (vma->vm_flags & VM_PFNMAP)
> +		return false;
> +
> +	/* Can the mapping track the dirty pages? */
> +	return vma->vm_file && vma->vm_file->f_mapping &&
> +		mapping_can_writeback(vma->vm_file->f_mapping);
> +}
> +
>  /*
>   * Some shared mappings will want the pages marked read-only
>   * to track write events. If so, we'll downgrade vm_page_prot
> @@ -1484,14 +1509,13 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
>  int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot)
>  {
>  	vm_flags_t vm_flags = vma->vm_flags;
> -	const struct vm_operations_struct *vm_ops = vma->vm_ops;
> 
>  	/* If it was private or non-writable, the write bit is already clear */
>  	if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED)))
>  		return 0;
> 
>  	/* The backer wishes to know when pages are first written to? */
> -	if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite))
> +	if (vm_ops_needs_writenotify(vma->vm_ops))
>  		return 1;
> 
>  	/* The open routine did something to the protections that pgprot_modify
> @@ -1511,13 +1535,7 @@ int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot)
>  	if (userfaultfd_wp(vma))
>  		return 1;
> 
> -	/* Specialty mapping? */
> -	if (vm_flags & VM_PFNMAP)
> -		return 0;
> -
> -	/* Can the mapping track the dirty pages? */
> -	return vma->vm_file && vma->vm_file->f_mapping &&
> -		mapping_can_writeback(vma->vm_file->f_mapping);
> +	return vma_needs_dirty_tracking(vma);
>  }
> 
>  /*
> --
> 2.40.0
Lorenzo Stoakes April 28, 2023, 11:59 a.m. UTC | #4
On Fri, Apr 28, 2023 at 12:42:32AM +0100, Lorenzo Stoakes wrote:
> Writing to file-backed mappings which require folio dirty tracking using
> GUP is a fundamentally broken operation, as kernel write access to GUP
> mappings do not adhere to the semantics expected by a file system.
>
> A GUP caller uses the direct mapping to access the folio, which does not
> cause write notify to trigger, nor does it enforce that the caller marks
> the folio dirty.
>
> The problem arises when, after an initial write to the folio, writeback
> results in the folio being cleaned and then the caller, via the GUP
> interface, writes to the folio again.
>
> As a result of the use of this secondary, direct, mapping to the folio no
> write notify will occur, and if the caller does mark the folio dirty, this
> will be done so unexpectedly.
>
> For example, consider the following scenario:-
>
> 1. A folio is written to via GUP which write-faults the memory, notifying
>    the file system and dirtying the folio.
> 2. Later, writeback is triggered, resulting in the folio being cleaned and
>    the PTE being marked read-only.
> 3. The GUP caller writes to the folio, as it is mapped read/write via the
>    direct mapping.
> 4. The GUP caller, now done with the page, unpins it and sets it dirty
>    (though it does not have to).
>
> This results in both data being written to a folio without writenotify, and
> the folio being dirtied unexpectedly (if the caller decides to do so).
>
> This issue was first reported by Jan Kara [1] in 2018, where the problem
> resulted in file system crashes.
>
> This is only relevant when the mappings are file-backed and the underlying
> file system requires folio dirty tracking. File systems which do not, such
> as shmem or hugetlb, are not at risk and therefore can be written to
> without issue.
>
> Unfortunately this limitation of GUP has been present for some time and
> requires future rework of the GUP API in order to provide correct write
> access to such mappings.
>
> However, for the time being we introduce this check to prevent the most
> egregious case of this occurring, use of the FOLL_LONGTERM pin.
>
> These mappings are considerably more likely to be written to after
> folios are cleaned and thus simply must not be permitted to do so.
>
> As part of this change we separate out vma_needs_dirty_tracking() as a
> helper function to determine this which is distinct from
> vma_wants_writenotify() which is specific to determining which PTE flags to
> set.
>
> [1]:https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz/
>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
> ---
>  include/linux/mm.h |  1 +
>  mm/gup.c           | 41 ++++++++++++++++++++++++++++++++++++++++-
>  mm/mmap.c          | 36 +++++++++++++++++++++++++++---------
>  3 files changed, 68 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 37554b08bb28..f7da02fc89c6 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2433,6 +2433,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
>  #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
>  					    MM_CP_UFFD_WP_RESOLVE)
>
> +bool vma_needs_dirty_tracking(struct vm_area_struct *vma);
>  int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
>  static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma)
>  {
> diff --git a/mm/gup.c b/mm/gup.c
> index 1f72a717232b..d36a5db9feb1 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -959,16 +959,51 @@ static int faultin_page(struct vm_area_struct *vma,
>  	return 0;
>  }
>
> +/*
> + * Writing to file-backed mappings which require folio dirty tracking using GUP
> + * is a fundamentally broken operation, as kernel write access to GUP mappings
> + * do not adhere to the semantics expected by a file system.
> + *
> + * Consider the following scenario:-
> + *
> + * 1. A folio is written to via GUP which write-faults the memory, notifying
> + *    the file system and dirtying the folio.
> + * 2. Later, writeback is triggered, resulting in the folio being cleaned and
> + *    the PTE being marked read-only.
> + * 3. The GUP caller writes to the folio, as it is mapped read/write via the
> + *    direct mapping.
> + * 4. The GUP caller, now done with the page, unpins it and sets it dirty
> + *    (though it does not have to).
> + *
> + * This results in both data being written to a folio without writenotify, and
> + * the folio being dirtied unexpectedly (if the caller decides to do so).
> + */
> +static bool writeable_file_mapping_allowed(struct vm_area_struct *vma,
> +					   unsigned long gup_flags)
> +{
> +	/* If we aren't pinning then no problematic write can occur. */
> +	if (!(gup_flags & (FOLL_GET | FOLL_PIN)))
> +		return true;
> +
> +	/* We limit this check to the most egregious case - a long term pin. */
> +	if (!(gup_flags & FOLL_LONGTERM))
> +		return true;
> +
> +	/* If the VMA requires dirty tracking then GUP will be problematic. */
> +	return vma_needs_dirty_tracking(vma);
> +}
> +
>  static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
>  {
>  	vm_flags_t vm_flags = vma->vm_flags;
>  	int write = (gup_flags & FOLL_WRITE);
>  	int foreign = (gup_flags & FOLL_REMOTE);
> +	bool vma_anon = vma_is_anonymous(vma);
>
>  	if (vm_flags & (VM_IO | VM_PFNMAP))
>  		return -EFAULT;
>
> -	if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma))
> +	if ((gup_flags & FOLL_ANON) && !vma_anon)
>  		return -EFAULT;
>
>  	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
> @@ -978,6 +1013,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
>  		return -EFAULT;
>
>  	if (write) {
> +		if (!vma_anon &&
> +		    !writeable_file_mapping_allowed(vma, gup_flags))
> +			return -EFAULT;
> +
>  		if (!(vm_flags & VM_WRITE)) {
>  			if (!(gup_flags & FOLL_FORCE))
>  				return -EFAULT;
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 536bbb8fa0ae..7b6344d1832a 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1475,6 +1475,31 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
>  }
>  #endif /* __ARCH_WANT_SYS_OLD_MMAP */
>
> +/* Do VMA operations imply write notify is required? */
> +static bool vm_ops_needs_writenotify(const struct vm_operations_struct *vm_ops)
> +{
> +	return vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite);
> +}
> +
> +/*
> + * Does this VMA require the underlying folios to have their dirty state
> + * tracked?
> + */
> +bool vma_needs_dirty_tracking(struct vm_area_struct *vma)
> +{
> +	/* Does the filesystem need to be notified? */
> +	if (vm_ops_needs_writenotify(vma->vm_ops))
> +		return true;
> +
> +	/* Specialty mapping? */
> +	if (vma->vm_flags & VM_PFNMAP)
> +		return false;
> +
> +	/* Can the mapping track the dirty pages? */
> +	return vma->vm_file && vma->vm_file->f_mapping &&
> +		mapping_can_writeback(vma->vm_file->f_mapping);
> +}
> +
>  /*
>   * Some shared mappings will want the pages marked read-only
>   * to track write events. If so, we'll downgrade vm_page_prot
> @@ -1484,14 +1509,13 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
>  int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot)
>  {
>  	vm_flags_t vm_flags = vma->vm_flags;
> -	const struct vm_operations_struct *vm_ops = vma->vm_ops;
>
>  	/* If it was private or non-writable, the write bit is already clear */
>  	if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED)))
>  		return 0;
>
>  	/* The backer wishes to know when pages are first written to? */
> -	if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite))
> +	if (vm_ops_needs_writenotify(vma->vm_ops))
>  		return 1;
>
>  	/* The open routine did something to the protections that pgprot_modify
> @@ -1511,13 +1535,7 @@ int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot)
>  	if (userfaultfd_wp(vma))
>  		return 1;
>
> -	/* Specialty mapping? */
> -	if (vm_flags & VM_PFNMAP)
> -		return 0;
> -
> -	/* Can the mapping track the dirty pages? */
> -	return vma->vm_file && vma->vm_file->f_mapping &&
> -		mapping_can_writeback(vma->vm_file->f_mapping);
> +	return vma_needs_dirty_tracking(vma);
>  }
>
>  /*
> --
> 2.40.0

Apologies, for some reason I forgot to include the revision list in that
patch, enclosed below:-

v5:
- Rebased on latest mm-unstable as of 25th April 2023.
- Some small refactorings suggested by John.
- Added an extended description of the problem in the comment around
  writeable_file_mapping_allowed() for clarity.
- Updated commit message as suggested by Mika and John.

v4:
- Split out vma_needs_dirty_tracking() from vma_wants_writenotify() to
  reduce duplication and update to use this in the GUP check. Note that
  both separately check vm_ops_needs_writenotify() as the latter needs to
  test this before the vm_pgprot_modify() test, resulting in
  vma_wants_writenotify() checking this twice, however it is such a small
  check this should not be egregious.
https://lore.kernel.org/all/3b92d56f55671a0389252379237703df6e86ea48.1682464032.git.lstoakes@gmail.com/

v3:
- Rebased on latest mm-unstable as of 24th April 2023.
- Explicitly check whether file system requires folio dirtying. Note that
  vma_wants_writenotify() could not be used directly as it is very much focused
  on determining if the PTE r/w should be set (e.g. assuming private mapping
  does not require it as already set, soft dirty considerations).
- Tested code against shmem and hugetlb mappings - confirmed that these are not
  disallowed by the check.
- Eliminate FOLL_ALLOW_BROKEN_FILE_MAPPING flag and instead perform check only
  for FOLL_LONGTERM pins.
- As a result, limit check to internal GUP code.
 https://lore.kernel.org/all/23c19e27ef0745f6d3125976e047ee0da62569d4.1682406295.git.lstoakes@gmail.com/

v2:
- Add accidentally excluded ptrace_access_vm() use of
  FOLL_ALLOW_BROKEN_FILE_MAPPING.
- Tweak commit message.
https://lore.kernel.org/all/c8ee7e02d3d4f50bb3e40855c53bda39eec85b7d.1682321768.git.lstoakes@gmail.com/

v1:
https://lore.kernel.org/all/f86dc089b460c80805e321747b0898fd1efe93d7.1682168199.git.lstoakes@gmail.com/
Jason Gunthorpe April 28, 2023, 1:17 p.m. UTC | #5
On Fri, Apr 28, 2023 at 12:42:32AM +0100, Lorenzo Stoakes wrote:
> Writing to file-backed mappings which require folio dirty tracking using
> GUP is a fundamentally broken operation, as kernel write access to GUP
> mappings do not adhere to the semantics expected by a file system.
> 
> A GUP caller uses the direct mapping to access the folio, which does not
> cause write notify to trigger, nor does it enforce that the caller marks
> the folio dirty.
> 
> The problem arises when, after an initial write to the folio, writeback
> results in the folio being cleaned and then the caller, via the GUP
> interface, writes to the folio again.
> 
> As a result of the use of this secondary, direct, mapping to the folio no
> write notify will occur, and if the caller does mark the folio dirty, this
> will be done so unexpectedly.
> 
> For example, consider the following scenario:-
> 
> 1. A folio is written to via GUP which write-faults the memory, notifying
>    the file system and dirtying the folio.
> 2. Later, writeback is triggered, resulting in the folio being cleaned and
>    the PTE being marked read-only.
> 3. The GUP caller writes to the folio, as it is mapped read/write via the
>    direct mapping.
> 4. The GUP caller, now done with the page, unpins it and sets it dirty
>    (though it does not have to).
> 
> This results in both data being written to a folio without writenotify, and
> the folio being dirtied unexpectedly (if the caller decides to do so).
> 
> This issue was first reported by Jan Kara [1] in 2018, where the problem
> resulted in file system crashes.
> 
> This is only relevant when the mappings are file-backed and the underlying
> file system requires folio dirty tracking. File systems which do not, such
> as shmem or hugetlb, are not at risk and therefore can be written to
> without issue.
> 
> Unfortunately this limitation of GUP has been present for some time and
> requires future rework of the GUP API in order to provide correct write
> access to such mappings.
> 
> However, for the time being we introduce this check to prevent the most
> egregious case of this occurring, use of the FOLL_LONGTERM pin.
> 
> These mappings are considerably more likely to be written to after
> folios are cleaned and thus simply must not be permitted to do so.
> 
> As part of this change we separate out vma_needs_dirty_tracking() as a
> helper function to determine this which is distinct from
> vma_wants_writenotify() which is specific to determining which PTE flags to
> set.
> 
> [1]:https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz/
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
> ---
>  include/linux/mm.h |  1 +
>  mm/gup.c           | 41 ++++++++++++++++++++++++++++++++++++++++-
>  mm/mmap.c          | 36 +++++++++++++++++++++++++++---------
>  3 files changed, 68 insertions(+), 10 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason
David Hildenbrand April 28, 2023, 2:20 p.m. UTC | #6
Sorry for jumping in late, I'm on vacation :)

On 28.04.23 01:42, Lorenzo Stoakes wrote:
> Writing to file-backed mappings which require folio dirty tracking using
> GUP is a fundamentally broken operation, as kernel write access to GUP
> mappings do not adhere to the semantics expected by a file system.
> 
> A GUP caller uses the direct mapping to access the folio, which does not
> cause write notify to trigger, nor does it enforce that the caller marks
> the folio dirty.

How should we enforce it? It would be a BUG in the GUP user.

> 
> The problem arises when, after an initial write to the folio, writeback
> results in the folio being cleaned and then the caller, via the GUP
> interface, writes to the folio again.
> 
> As a result of the use of this secondary, direct, mapping to the folio no
> write notify will occur, and if the caller does mark the folio dirty, this
> will be done so unexpectedly.

Right, in mprotect() code we only allow upgrading write permissions in 
this case if the pte is dirty, so we always go via the pagefault path.

> 
> For example, consider the following scenario:-
> 
> 1. A folio is written to via GUP which write-faults the memory, notifying
>     the file system and dirtying the folio.
> 2. Later, writeback is triggered, resulting in the folio being cleaned and
>     the PTE being marked read-only.


How would that be triggered? Would that writeback triggered by e.g., 
fsync that Jan tried to tackle recently?


> 3. The GUP caller writes to the folio, as it is mapped read/write via the
>     direct mapping.
> 4. The GUP caller, now done with the page, unpins it and sets it dirty
>     (though it does not have to).
> 
> This results in both data being written to a folio without writenotify, and
> the folio being dirtied unexpectedly (if the caller decides to do so).
> 
> This issue was first reported by Jan Kara [1] in 2018, where the problem
> resulted in file system crashes.
> 
> This is only relevant when the mappings are file-backed and the underlying
> file system requires folio dirty tracking. File systems which do not, such
> as shmem or hugetlb, are not at risk and therefore can be written to
> without issue.
> 
> Unfortunately this limitation of GUP has been present for some time and
> requires future rework of the GUP API in order to provide correct write
> access to such mappings.
> 
> However, for the time being we introduce this check to prevent the most
> egregious case of this occurring, use of the FOLL_LONGTERM pin.
> 
> These mappings are considerably more likely to be written to after
> folios are cleaned and thus simply must not be permitted to do so.
> 
> As part of this change we separate out vma_needs_dirty_tracking() as a
> helper function to determine this which is distinct from
> vma_wants_writenotify() which is specific to determining which PTE flags to
> set.
> 
> [1]:https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz/
> 


This change has the potential to break existing setups. Simple example: 
libvirt domains configured for file-backed VM memory that also has a 
vfio device configured. It can easily be configured by users (evolving 
VM configuration, copy-paste etc.). And it works from a VM perspective, 
because the guest memory is essentially stale once the VM is shutdown 
and the pages were unpinned. At least we're not concerned about stale 
data on disk.

With your changes, such VMs would no longer start, breaking existing 
user setups with a kernel update.

I don't really see a lot of reasons to perform this change now. It's 
been known to be problematic for a long time. People are working on a 
fix (I see Jan is already CCed, CCing Dave and Christop). FOLL_LONGTERM 
check is only handling some of the problematic cases, so it's not even a 
complete blocker.

I know, Jason und John will disagree, but I don't think we want to be 
very careful with changing the default.

Sure, we could warn, or convert individual users using a flag 
(io_uring). But maybe we should invest more energy on a fix?




> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
> ---
>   include/linux/mm.h |  1 +
>   mm/gup.c           | 41 ++++++++++++++++++++++++++++++++++++++++-
>   mm/mmap.c          | 36 +++++++++++++++++++++++++++---------
>   3 files changed, 68 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 37554b08bb28..f7da02fc89c6 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2433,6 +2433,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
>   #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
>   					    MM_CP_UFFD_WP_RESOLVE)
> 
> +bool vma_needs_dirty_tracking(struct vm_area_struct *vma);
>   int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
>   static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma)
>   {
> diff --git a/mm/gup.c b/mm/gup.c
> index 1f72a717232b..d36a5db9feb1 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -959,16 +959,51 @@ static int faultin_page(struct vm_area_struct *vma,
>   	return 0;
>   }
> 
> +/*
> + * Writing to file-backed mappings which require folio dirty tracking using GUP
> + * is a fundamentally broken operation, as kernel write access to GUP mappings
> + * do not adhere to the semantics expected by a file system.
> + *
> + * Consider the following scenario:-
> + *
> + * 1. A folio is written to via GUP which write-faults the memory, notifying
> + *    the file system and dirtying the folio.
> + * 2. Later, writeback is triggered, resulting in the folio being cleaned and
> + *    the PTE being marked read-only.
> + * 3. The GUP caller writes to the folio, as it is mapped read/write via the
> + *    direct mapping.
> + * 4. The GUP caller, now done with the page, unpins it and sets it dirty
> + *    (though it does not have to).
> + *
> + * This results in both data being written to a folio without writenotify, and
> + * the folio being dirtied unexpectedly (if the caller decides to do so).
> + */
> +static bool writeable_file_mapping_allowed(struct vm_area_struct *vma,
> +					   unsigned long gup_flags)
> +{
> +	/* If we aren't pinning then no problematic write can occur. */
> +	if (!(gup_flags & (FOLL_GET | FOLL_PIN)))
> +		return true;

FOLL_LONGTERM only applies to FOLL_PIN. This check can be dropped.

> +
> +	/* We limit this check to the most egregious case - a long term pin. */
> +	if (!(gup_flags & FOLL_LONGTERM))
> +		return true;
> +
> +	/* If the VMA requires dirty tracking then GUP will be problematic. */
> +	return vma_needs_dirty_tracking(vma);
> +}
> +
>   static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
>   {
>   	vm_flags_t vm_flags = vma->vm_flags;
>   	int write = (gup_flags & FOLL_WRITE);
>   	int foreign = (gup_flags & FOLL_REMOTE);
> +	bool vma_anon = vma_is_anonymous(vma);
> 
>   	if (vm_flags & (VM_IO | VM_PFNMAP))
>   		return -EFAULT;
> 
> -	if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma))
> +	if ((gup_flags & FOLL_ANON) && !vma_anon)
>   		return -EFAULT;
> 
>   	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
> @@ -978,6 +1013,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
>   		return -EFAULT;
> 
>   	if (write) {
> +		if (!vma_anon &&
> +		    !writeable_file_mapping_allowed(vma, gup_flags))
> +			return -EFAULT;
> +
>   		if (!(vm_flags & VM_WRITE)) {
>   			if (!(gup_flags & FOLL_FORCE))
>   				return -EFAULT;
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 536bbb8fa0ae..7b6344d1832a 100644
> --- a/mm/mmap.c


I'm probably missing something, why don't we have to handle GUP-fast 
(having said that, it's hard to handle ;) )? The sequence you describe 
above should apply to GUP-fast as well, no?

1) Pin writable mapped page using GUP-fast
2) Trigger writeback
3) Write to page via pin
4) Unpin and set dirty
Jason Gunthorpe April 28, 2023, 2:35 p.m. UTC | #7
On Fri, Apr 28, 2023 at 04:20:46PM +0200, David Hildenbrand wrote:
> Sorry for jumping in late, I'm on vacation :)
> 
> On 28.04.23 01:42, Lorenzo Stoakes wrote:
> > Writing to file-backed mappings which require folio dirty tracking using
> > GUP is a fundamentally broken operation, as kernel write access to GUP
> > mappings do not adhere to the semantics expected by a file system.
> > 
> > A GUP caller uses the direct mapping to access the folio, which does not
> > cause write notify to trigger, nor does it enforce that the caller marks
> > the folio dirty.
> 
> How should we enforce it? It would be a BUG in the GUP user.

I hope we don't have these kinds of mistakes.. hard to enforce by
code.

> This change has the potential to break existing setups. Simple example:
> libvirt domains configured for file-backed VM memory that also has a vfio
> device configured. It can easily be configured by users (evolving VM
> configuration, copy-paste etc.). And it works from a VM perspective, because
> the guest memory is essentially stale once the VM is shutdown and the pages
> were unpinned. At least we're not concerned about stale data on
> disk.

I think this is broken today and we should block it. We know from
experiments with RDMA that doing exactly this triggers kernel oop's.

Run your qemu config once, all the pages in the file become dirty.

Run your qmeu config again and now all the dirty pages are longterm
pinned.

Something eventually does writeback and FS cleans the page.

Now close your VM and the page is dirtied without make write. FS is
inconsistent with the MM, kernel will eventually oops.

I'm skeptical that anyone can actually do this combination of things
successfully without getting kernel crashes or file data corruption -
ie there is no real user to break.

> With your changes, such VMs would no longer start, breaking existing user
> setups with a kernel update.

Yes, as a matter of security we should break it.

Earlier I suggested making this contingent on kernel lockdown >=
integrity, if actual users come and complain we should go to that
option.

> Sure, we could warn, or convert individual users using a flag (io_uring).
> But maybe we should invest more energy on a fix?

It has been years now, I think we need to admit a fix is still years
away. Blocking the security problem may even motivate more people to
work on a fix.

Security is the primary case where we have historically closed uAPI
items.

Jason
Lorenzo Stoakes April 28, 2023, 2:55 p.m. UTC | #8
On Fri, Apr 28, 2023 at 04:20:46PM +0200, David Hildenbrand wrote:
> Sorry for jumping in late, I'm on vacation :)
>
> On 28.04.23 01:42, Lorenzo Stoakes wrote:
> > Writing to file-backed mappings which require folio dirty tracking using
> > GUP is a fundamentally broken operation, as kernel write access to GUP
> > mappings do not adhere to the semantics expected by a file system.
> >
> > A GUP caller uses the direct mapping to access the folio, which does not
> > cause write notify to trigger, nor does it enforce that the caller marks
> > the folio dirty.
>
> How should we enforce it? It would be a BUG in the GUP user.

There was discussion about holding locks when passing back pages but it's a
sticky question no doubt, but a bit out of scope here. This is more
documenting that this is a thing.

>
> >
> > The problem arises when, after an initial write to the folio, writeback
> > results in the folio being cleaned and then the caller, via the GUP
> > interface, writes to the folio again.
> >
> > As a result of the use of this secondary, direct, mapping to the folio no
> > write notify will occur, and if the caller does mark the folio dirty, this
> > will be done so unexpectedly.
>
> Right, in mprotect() code we only allow upgrading write permissions in this
> case if the pte is dirty, so we always go via the pagefault path.
>

In my ideal world we'd somehow do remote accesses via a kthread_use_mm()
style approach and page fault in every time. That's again a bit out of
scope though...

> >
> > For example, consider the following scenario:-
> >
> > 1. A folio is written to via GUP which write-faults the memory, notifying
> >     the file system and dirtying the folio.
> > 2. Later, writeback is triggered, resulting in the folio being cleaned and
> >     the PTE being marked read-only.
>
>
> How would that be triggered? Would that writeback triggered by e.g., fsync
> that Jan tried to tackle recently?

Yes or just perioditic writeback threads. The folio is dirty and needs
writeback, so at some point that'll happen. Obviously fsync/msync could
cause it too (I may be missing something here :)

>
>
> > 3. The GUP caller writes to the folio, as it is mapped read/write via the
> >     direct mapping.
> > 4. The GUP caller, now done with the page, unpins it and sets it dirty
> >     (though it does not have to).
> >
> > This results in both data being written to a folio without writenotify, and
> > the folio being dirtied unexpectedly (if the caller decides to do so).
> >
> > This issue was first reported by Jan Kara [1] in 2018, where the problem
> > resulted in file system crashes.
> >
> > This is only relevant when the mappings are file-backed and the underlying
> > file system requires folio dirty tracking. File systems which do not, such
> > as shmem or hugetlb, are not at risk and therefore can be written to
> > without issue.
> >
> > Unfortunately this limitation of GUP has been present for some time and
> > requires future rework of the GUP API in order to provide correct write
> > access to such mappings.
> >
> > However, for the time being we introduce this check to prevent the most
> > egregious case of this occurring, use of the FOLL_LONGTERM pin.
> >
> > These mappings are considerably more likely to be written to after
> > folios are cleaned and thus simply must not be permitted to do so.
> >
> > As part of this change we separate out vma_needs_dirty_tracking() as a
> > helper function to determine this which is distinct from
> > vma_wants_writenotify() which is specific to determining which PTE flags to
> > set.
> >
> > [1]:https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz/
> >
>
>
> This change has the potential to break existing setups. Simple example:
> libvirt domains configured for file-backed VM memory that also has a vfio
> device configured. It can easily be configured by users (evolving VM
> configuration, copy-paste etc.). And it works from a VM perspective, because
> the guest memory is essentially stale once the VM is shutdown and the pages
> were unpinned. At least we're not concerned about stale data on disk.
>
> With your changes, such VMs would no longer start, breaking existing user
> setups with a kernel update.

Which vfio vm_ops are we talking about? vfio_pci_mmap_ops for example
doesn't specify page_mkwrite or pfn_mkwrite. Unless you mean some arbitrary
file system in the guest?

I may well be missing context on this so forgive me if I'm being a little
dumb here, but it'd be good to get a specific example.

>
> I don't really see a lot of reasons to perform this change now. It's been
> known to be problematic for a long time. People are working on a fix (I see
> Jan is already CCed, CCing Dave and Christop). FOLL_LONGTERM check is only
> handling some of the problematic cases, so it's not even a complete blocker.

I am not a huge fan of the commentary along the lines of 'I don't really
see a lot of reasons to make this change' when people have gone to lengths
to posit reasons as to why this change is being made, by all means disagree
but it's more helpful if it's of the form 'you say we should do this for
reasons X, Y, Z I disagree for reasons a, b, c' (which you've done
elsewhere).

Also those people working on a longer term fix are supporting this change
;) I think to think that I myself may also contribute to this fix longer
term, perhaps.

>
> I know, Jason und John will disagree, but I don't think we want to be very
> careful with changing the default.
>
> Sure, we could warn, or convert individual users using a flag (io_uring).
> But maybe we should invest more energy on a fix?

This is proactively blocking a cleanup (eliminating vmas) that I believe
will be useful in moving things forward. I am not against an opt-in option
(I have been responding to community feedback in adapting my approach),
which is the way I implemented it all the way back then :)

But given we know this is both entirely broken and a potential security
issue, and FOLL_LONGTERM is about as egregious as you can get (user
explicitly saying they'll hold write access indefinitely) I feel it is an
important improvement and makes clear that this is not an acceptable usage.

I see Jason has said more on this also :)

>
>
>
>
> > Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> > Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
> > ---
> >   include/linux/mm.h |  1 +
> >   mm/gup.c           | 41 ++++++++++++++++++++++++++++++++++++++++-
> >   mm/mmap.c          | 36 +++++++++++++++++++++++++++---------
> >   3 files changed, 68 insertions(+), 10 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 37554b08bb28..f7da02fc89c6 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -2433,6 +2433,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
> >   #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
> >   					    MM_CP_UFFD_WP_RESOLVE)
> >
> > +bool vma_needs_dirty_tracking(struct vm_area_struct *vma);
> >   int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
> >   static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma)
> >   {
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 1f72a717232b..d36a5db9feb1 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -959,16 +959,51 @@ static int faultin_page(struct vm_area_struct *vma,
> >   	return 0;
> >   }
> >
> > +/*
> > + * Writing to file-backed mappings which require folio dirty tracking using GUP
> > + * is a fundamentally broken operation, as kernel write access to GUP mappings
> > + * do not adhere to the semantics expected by a file system.
> > + *
> > + * Consider the following scenario:-
> > + *
> > + * 1. A folio is written to via GUP which write-faults the memory, notifying
> > + *    the file system and dirtying the folio.
> > + * 2. Later, writeback is triggered, resulting in the folio being cleaned and
> > + *    the PTE being marked read-only.
> > + * 3. The GUP caller writes to the folio, as it is mapped read/write via the
> > + *    direct mapping.
> > + * 4. The GUP caller, now done with the page, unpins it and sets it dirty
> > + *    (though it does not have to).
> > + *
> > + * This results in both data being written to a folio without writenotify, and
> > + * the folio being dirtied unexpectedly (if the caller decides to do so).
> > + */
> > +static bool writeable_file_mapping_allowed(struct vm_area_struct *vma,
> > +					   unsigned long gup_flags)
> > +{
> > +	/* If we aren't pinning then no problematic write can occur. */
> > +	if (!(gup_flags & (FOLL_GET | FOLL_PIN)))
> > +		return true;
>
> FOLL_LONGTERM only applies to FOLL_PIN. This check can be dropped.

I understand that of course (well maybe not of course, but I mean I do, I
have oodles of diagrams referencing this int he book :) This is intended to
document the fact that the check isn't relevant if we don't pin at all,
e.g. reading this you see:-

- (implicit) if not writing or anon we're good
- if not pin we're good
- ok we are only currently checking one especially egregious case
- finally, perform the dirty tracking check.

So this is intentional.

>
> > +
> > +	/* We limit this check to the most egregious case - a long term pin. */
> > +	if (!(gup_flags & FOLL_LONGTERM))
> > +		return true;
> > +
> > +	/* If the VMA requires dirty tracking then GUP will be problematic. */
> > +	return vma_needs_dirty_tracking(vma);
> > +}
> > +
> >   static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
> >   {
> >   	vm_flags_t vm_flags = vma->vm_flags;
> >   	int write = (gup_flags & FOLL_WRITE);
> >   	int foreign = (gup_flags & FOLL_REMOTE);
> > +	bool vma_anon = vma_is_anonymous(vma);
> >
> >   	if (vm_flags & (VM_IO | VM_PFNMAP))
> >   		return -EFAULT;
> >
> > -	if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma))
> > +	if ((gup_flags & FOLL_ANON) && !vma_anon)
> >   		return -EFAULT;
> >
> >   	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
> > @@ -978,6 +1013,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
> >   		return -EFAULT;
> >
> >   	if (write) {
> > +		if (!vma_anon &&
> > +		    !writeable_file_mapping_allowed(vma, gup_flags))
> > +			return -EFAULT;
> > +
> >   		if (!(vm_flags & VM_WRITE)) {
> >   			if (!(gup_flags & FOLL_FORCE))
> >   				return -EFAULT;
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 536bbb8fa0ae..7b6344d1832a 100644
> > --- a/mm/mmap.c
>
>
> I'm probably missing something, why don't we have to handle GUP-fast (having
> said that, it's hard to handle ;) )? The sequence you describe above should
> apply to GUP-fast as well, no?
>
> 1) Pin writable mapped page using GUP-fast
> 2) Trigger writeback
> 3) Write to page via pin
> 4) Unpin and set dirty

You're right, and this is an excellent point. I worry about other GUP use
cases too, but we're a bit out of luck there because we don't get to check
the VMA _at all_ (which opens yet another Pandora's box about how safe it
is to do unlocked pinning :)

But again, this comes down to the fact we're trying to make things
_incrementally__ better rather than throwing our hands up and saying one
day my ship will come in...

>
>
> --
> Thanks,
>
> David / dhildenb
>
>
David Hildenbrand April 28, 2023, 3:08 p.m. UTC | #9
On 28.04.23 16:35, Jason Gunthorpe wrote:
> On Fri, Apr 28, 2023 at 04:20:46PM +0200, David Hildenbrand wrote:
>> Sorry for jumping in late, I'm on vacation :)
>>
>> On 28.04.23 01:42, Lorenzo Stoakes wrote:
>>> Writing to file-backed mappings which require folio dirty tracking using
>>> GUP is a fundamentally broken operation, as kernel write access to GUP
>>> mappings do not adhere to the semantics expected by a file system.
>>>
>>> A GUP caller uses the direct mapping to access the folio, which does not
>>> cause write notify to trigger, nor does it enforce that the caller marks
>>> the folio dirty.
>>
>> How should we enforce it? It would be a BUG in the GUP user.
> 
> I hope we don't have these kinds of mistakes.. hard to enforce by
> code.
> 

I briefly played with the idea of only allowing to write-pin fs pages 
that are dirty (or the pte is dirty). If we adjust writeback code to 
leave such (maybe pinned) pages dirty, there would be no need to reset 
the pages dirty I guess.

Just an idea, getting the writebackcode fixed (and race-free with 
GUP-fast) is the harder problem.

>> This change has the potential to break existing setups. Simple example:
>> libvirt domains configured for file-backed VM memory that also has a vfio
>> device configured. It can easily be configured by users (evolving VM
>> configuration, copy-paste etc.). And it works from a VM perspective, because
>> the guest memory is essentially stale once the VM is shutdown and the pages
>> were unpinned. At least we're not concerned about stale data on
>> disk.
> 
> I think this is broken today and we should block it. We know from
> experiments with RDMA that doing exactly this triggers kernel oop's.
> 

I never saw similar reports in the wild (especially targeted at RHEL), 
so is this still a current issue that has not been mitigated? Or is it 
just so hard to actually trigger?

> Run your qemu config once, all the pages in the file become dirty.
> 
> Run your qmeu config again and now all the dirty pages are longterm
> pinned.
> 
> Something eventually does writeback and FS cleans the page.

At least vmscan does not touch pages that have additional references -- 
pageout() quits early. So that other thing doesn't happen that often I 
guess (manual fsync doesn't usually happen on VM memory if I am not 
wrong ...).

> 
> Now close your VM and the page is dirtied without make write. FS is
> inconsistent with the MM, kernel will eventually oops.
> 
> I'm skeptical that anyone can actually do this combination of things
> successfully without getting kernel crashes or file data corruption -
> ie there is no real user to break.

I am pretty sure that there are such VM users, because on the libvirt 
level it's completely unclear which features trigger what behavior :/

I remember (but did not check) that VM memory might usually get deleted 
whenever we usually shutdown a VM, another reason why we wouldn't see 
this in the wild. libvirt has the "discard" option exactly for that 
purpose, to be used with file-based guest memory. [1]

Users tend to copy-paste domain XMLs + edit because it's just so hard to 
get right. Changing the VM backing to be backed from a file can be done 
with a one-line change in the libvirt XML.

> 
>> With your changes, such VMs would no longer start, breaking existing user
>> setups with a kernel update.
> 
> Yes, as a matter of security we should break it.
> 
> Earlier I suggested making this contingent on kernel lockdown >=
> integrity, if actual users come and complain we should go to that
> option.
> 
>> Sure, we could warn, or convert individual users using a flag (io_uring).
>> But maybe we should invest more energy on a fix?
> 
> It has been years now, I think we need to admit a fix is still years
> away. Blocking the security problem may even motivate more people to
> work on a fix.

Maybe we should make this a topic this year at LSF/MM (again?). At least 
we learned a lot about GUP, what might work, what might not work, and 
got a depper understanding (+ motivation to fix? :) ) the issue at hand.

> 
> Security is the primary case where we have historically closed uAPI
> items.

As this patch

1) Does not tackle GUP-fast
2) Does not take care of !FOLL_LONGTERM

I am not convinced by the security argument in regard to this patch.


If we want to sells this as a security thing, we have to block it 
*completely* and then CC stable.

Everything else sounds like band-aids to me, is insufficient, and might 
cause more harm than actually help IMHO. Especially the gup-fast case is 
extremely easy to work-around in malicious user space.


[1] https://listman.redhat.com/archives/libvir-list/2018-May/msg00885.html
David Hildenbrand April 28, 2023, 3:13 p.m. UTC | #10
[...]

>> This change has the potential to break existing setups. Simple example:
>> libvirt domains configured for file-backed VM memory that also has a vfio
>> device configured. It can easily be configured by users (evolving VM
>> configuration, copy-paste etc.). And it works from a VM perspective, because
>> the guest memory is essentially stale once the VM is shutdown and the pages
>> were unpinned. At least we're not concerned about stale data on disk.
>>
>> With your changes, such VMs would no longer start, breaking existing user
>> setups with a kernel update.
> 
> Which vfio vm_ops are we talking about? vfio_pci_mmap_ops for example
> doesn't specify page_mkwrite or pfn_mkwrite. Unless you mean some arbitrary
> file system in the guest?

Sorry, you define a VM to have its memory backed by VM memory and, at 
the same time, define a vfio-pci device for your VM, which will end up 
long-term pinning the VM memory.

> 
> I may well be missing context on this so forgive me if I'm being a little
> dumb here, but it'd be good to get a specific example.

I was giving to little details ;)

[...]

>>
>> I know, Jason und John will disagree, but I don't think we want to be very
>> careful with changing the default.
>>
>> Sure, we could warn, or convert individual users using a flag (io_uring).
>> But maybe we should invest more energy on a fix?
> 
> This is proactively blocking a cleanup (eliminating vmas) that I believe
> will be useful in moving things forward. I am not against an opt-in option
> (I have been responding to community feedback in adapting my approach),
> which is the way I implemented it all the way back then :)

There are alternatives: just use a flag as Jason initially suggested and 
use that in io_uring code. Then, you can also bail out on the GUP-fast 
path as "cannot support it right now, never do GUP-fast".

IMHO, this patch is not a prereq.

> 
> But given we know this is both entirely broken and a potential security
> issue, and FOLL_LONGTERM is about as egregious as you can get (user
> explicitly saying they'll hold write access indefinitely) I feel it is an
> important improvement and makes clear that this is not an acceptable usage.
> 
> I see Jason has said more on this also :)
> 
>>
>>
>>
>>
>>> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
>>> Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
>>> ---
>>>    include/linux/mm.h |  1 +
>>>    mm/gup.c           | 41 ++++++++++++++++++++++++++++++++++++++++-
>>>    mm/mmap.c          | 36 +++++++++++++++++++++++++++---------
>>>    3 files changed, 68 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 37554b08bb28..f7da02fc89c6 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -2433,6 +2433,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
>>>    #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
>>>    					    MM_CP_UFFD_WP_RESOLVE)
>>>
>>> +bool vma_needs_dirty_tracking(struct vm_area_struct *vma);
>>>    int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
>>>    static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma)
>>>    {
>>> diff --git a/mm/gup.c b/mm/gup.c
>>> index 1f72a717232b..d36a5db9feb1 100644
>>> --- a/mm/gup.c
>>> +++ b/mm/gup.c
>>> @@ -959,16 +959,51 @@ static int faultin_page(struct vm_area_struct *vma,
>>>    	return 0;
>>>    }
>>>
>>> +/*
>>> + * Writing to file-backed mappings which require folio dirty tracking using GUP
>>> + * is a fundamentally broken operation, as kernel write access to GUP mappings
>>> + * do not adhere to the semantics expected by a file system.
>>> + *
>>> + * Consider the following scenario:-
>>> + *
>>> + * 1. A folio is written to via GUP which write-faults the memory, notifying
>>> + *    the file system and dirtying the folio.
>>> + * 2. Later, writeback is triggered, resulting in the folio being cleaned and
>>> + *    the PTE being marked read-only.
>>> + * 3. The GUP caller writes to the folio, as it is mapped read/write via the
>>> + *    direct mapping.
>>> + * 4. The GUP caller, now done with the page, unpins it and sets it dirty
>>> + *    (though it does not have to).
>>> + *
>>> + * This results in both data being written to a folio without writenotify, and
>>> + * the folio being dirtied unexpectedly (if the caller decides to do so).
>>> + */
>>> +static bool writeable_file_mapping_allowed(struct vm_area_struct *vma,
>>> +					   unsigned long gup_flags)
>>> +{
>>> +	/* If we aren't pinning then no problematic write can occur. */
>>> +	if (!(gup_flags & (FOLL_GET | FOLL_PIN)))
>>> +		return true;
>>
>> FOLL_LONGTERM only applies to FOLL_PIN. This check can be dropped.
> 
> I understand that of course (well maybe not of course, but I mean I do, I
> have oodles of diagrams referencing this int he book :) This is intended to
> document the fact that the check isn't relevant if we don't pin at all,
> e.g. reading this you see:-
> 
> - (implicit) if not writing or anon we're good
> - if not pin we're good
> - ok we are only currently checking one especially egregious case
> - finally, perform the dirty tracking check.
> 
> So this is intentional.
> 
>>
>>> +
>>> +	/* We limit this check to the most egregious case - a long term pin. */
>>> +	if (!(gup_flags & FOLL_LONGTERM))
>>> +		return true;
>>> +
>>> +	/* If the VMA requires dirty tracking then GUP will be problematic. */
>>> +	return vma_needs_dirty_tracking(vma);
>>> +}
>>> +
>>>    static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
>>>    {
>>>    	vm_flags_t vm_flags = vma->vm_flags;
>>>    	int write = (gup_flags & FOLL_WRITE);
>>>    	int foreign = (gup_flags & FOLL_REMOTE);
>>> +	bool vma_anon = vma_is_anonymous(vma);
>>>
>>>    	if (vm_flags & (VM_IO | VM_PFNMAP))
>>>    		return -EFAULT;
>>>
>>> -	if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma))
>>> +	if ((gup_flags & FOLL_ANON) && !vma_anon)
>>>    		return -EFAULT;
>>>
>>>    	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
>>> @@ -978,6 +1013,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
>>>    		return -EFAULT;
>>>
>>>    	if (write) {
>>> +		if (!vma_anon &&
>>> +		    !writeable_file_mapping_allowed(vma, gup_flags))
>>> +			return -EFAULT;
>>> +
>>>    		if (!(vm_flags & VM_WRITE)) {
>>>    			if (!(gup_flags & FOLL_FORCE))
>>>    				return -EFAULT;
>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>> index 536bbb8fa0ae..7b6344d1832a 100644
>>> --- a/mm/mmap.c
>>
>>
>> I'm probably missing something, why don't we have to handle GUP-fast (having
>> said that, it's hard to handle ;) )? The sequence you describe above should
>> apply to GUP-fast as well, no?
>>
>> 1) Pin writable mapped page using GUP-fast
>> 2) Trigger writeback
>> 3) Write to page via pin
>> 4) Unpin and set dirty
> 
> You're right, and this is an excellent point. I worry about other GUP use
> cases too, but we're a bit out of luck there because we don't get to check
> the VMA _at all_ (which opens yet another Pandora's box about how safe it
> is to do unlocked pinning :)
> 
> But again, this comes down to the fact we're trying to make things
> _incrementally__ better rather than throwing our hands up and saying one
> day my ship will come in...

That's not how security fixes are supposed to work IMHO, sorry.
Jens Axboe April 28, 2023, 3:15 p.m. UTC | #11
On 4/28/23 9:13?AM, David Hildenbrand wrote:
>>> I know, Jason und John will disagree, but I don't think we want to be very
>>> careful with changing the default.
>>>
>>> Sure, we could warn, or convert individual users using a flag (io_uring).
>>> But maybe we should invest more energy on a fix?
>>
>> This is proactively blocking a cleanup (eliminating vmas) that I believe
>> will be useful in moving things forward. I am not against an opt-in option
>> (I have been responding to community feedback in adapting my approach),
>> which is the way I implemented it all the way back then :)
> 
> There are alternatives: just use a flag as Jason initially suggested
> and use that in io_uring code. Then, you can also bail out on the
> GUP-fast path as "cannot support it right now, never do GUP-fast".

Since I've seen this brougth up a few times, what's the issue on the
io_uring side? We already dropped the special vma checking, it's in -git
right. Hence I don't believe there are any special cases left for
io_uring at all, and we certainly don't allow real file backings either,
never have done.
David Hildenbrand April 28, 2023, 3:23 p.m. UTC | #12
>>
>> Security is the primary case where we have historically closed uAPI
>> items.
> 
> As this patch
> 
> 1) Does not tackle GUP-fast
> 2) Does not take care of !FOLL_LONGTERM
> 
> I am not convinced by the security argument in regard to this patch.
> 
> 
> If we want to sells this as a security thing, we have to block it
> *completely* and then CC stable.

Regarding GUP-fast, to fix the issue there as well, I guess we could do 
something similar as I did in gup_must_unshare():

If we're in GUP-fast (no VMA), and want to pin a !anon page writable, 
fallback to ordinary GUP. IOW, if we don't know, better be safe.

Of course, this would prevent hugetlb/shmem from getting pinned writable 
during gup-fast. Unless we're able to whitelist them somehow in there.


For FOLL_LONGTERM it might fairly uncontroversial. For everything else 
I'm not sure if there could be undesired side-effects.
Lorenzo Stoakes April 28, 2023, 3:24 p.m. UTC | #13
On Fri, Apr 28, 2023 at 05:13:07PM +0200, David Hildenbrand wrote:
> [...]
>
> > > This change has the potential to break existing setups. Simple example:
> > > libvirt domains configured for file-backed VM memory that also has a vfio
> > > device configured. It can easily be configured by users (evolving VM
> > > configuration, copy-paste etc.). And it works from a VM perspective, because
> > > the guest memory is essentially stale once the VM is shutdown and the pages
> > > were unpinned. At least we're not concerned about stale data on disk.
> > >
> > > With your changes, such VMs would no longer start, breaking existing user
> > > setups with a kernel update.
> >
> > Which vfio vm_ops are we talking about? vfio_pci_mmap_ops for example
> > doesn't specify page_mkwrite or pfn_mkwrite. Unless you mean some arbitrary
> > file system in the guest?
>
> Sorry, you define a VM to have its memory backed by VM memory and, at the
> same time, define a vfio-pci device for your VM, which will end up long-term
> pinning the VM memory.

Ah ack. Jason seemed concerned that this was already a broken case, I guess
that's one for you two to hash out...

>
> >
> > I may well be missing context on this so forgive me if I'm being a little
> > dumb here, but it'd be good to get a specific example.
>
> I was giving to little details ;)
>
> [...]
>
> > >
> > > I know, Jason und John will disagree, but I don't think we want to be very
> > > careful with changing the default.
> > >
> > > Sure, we could warn, or convert individual users using a flag (io_uring).
> > > But maybe we should invest more energy on a fix?
> >
> > This is proactively blocking a cleanup (eliminating vmas) that I believe
> > will be useful in moving things forward. I am not against an opt-in option
> > (I have been responding to community feedback in adapting my approach),
> > which is the way I implemented it all the way back then :)
>
> There are alternatives: just use a flag as Jason initially suggested and use
> that in io_uring code. Then, you can also bail out on the GUP-fast path as
> "cannot support it right now, never do GUP-fast".

I already implemented the alternatives (look back through revisions to see :)
and there were objections for various reasons.

Personally my preference is to provide a FOLL_SAFE_FILE_WRITE flag or such and
replace the FOLL_LONGTERM check with this (that'll automatically get rejected
for GUP-fast so the GUP-fast conundrum wouldn't be a thing either).

GUP-fast is a problem as you say,, but it feels like a fundamental issue with
GUP-fast as a whole since you don't get to look at a VMA since you can't take
the mmap_lock. You could just look at the folio->mapping once you've walked the
page tables and say 'I'm out' if FOLL_WRITE and it's non-anon if that's what
you're suggesting?

I'm not against that change but this being incremental, I think that would be a
further increment.

>
> IMHO, this patch is not a prereq.
>
> >
> > But given we know this is both entirely broken and a potential security
> > issue, and FOLL_LONGTERM is about as egregious as you can get (user
> > explicitly saying they'll hold write access indefinitely) I feel it is an
> > important improvement and makes clear that this is not an acceptable usage.
> >
> > I see Jason has said more on this also :)
> >
> > >
> > >
> > >
> > >
> > > > Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> > > > Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
> > > > ---
> > > >    include/linux/mm.h |  1 +
> > > >    mm/gup.c           | 41 ++++++++++++++++++++++++++++++++++++++++-
> > > >    mm/mmap.c          | 36 +++++++++++++++++++++++++++---------
> > > >    3 files changed, 68 insertions(+), 10 deletions(-)
> > > >
> > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > index 37554b08bb28..f7da02fc89c6 100644
> > > > --- a/include/linux/mm.h
> > > > +++ b/include/linux/mm.h
> > > > @@ -2433,6 +2433,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
> > > >    #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
> > > >    					    MM_CP_UFFD_WP_RESOLVE)
> > > >
> > > > +bool vma_needs_dirty_tracking(struct vm_area_struct *vma);
> > > >    int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
> > > >    static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma)
> > > >    {
> > > > diff --git a/mm/gup.c b/mm/gup.c
> > > > index 1f72a717232b..d36a5db9feb1 100644
> > > > --- a/mm/gup.c
> > > > +++ b/mm/gup.c
> > > > @@ -959,16 +959,51 @@ static int faultin_page(struct vm_area_struct *vma,
> > > >    	return 0;
> > > >    }
> > > >
> > > > +/*
> > > > + * Writing to file-backed mappings which require folio dirty tracking using GUP
> > > > + * is a fundamentally broken operation, as kernel write access to GUP mappings
> > > > + * do not adhere to the semantics expected by a file system.
> > > > + *
> > > > + * Consider the following scenario:-
> > > > + *
> > > > + * 1. A folio is written to via GUP which write-faults the memory, notifying
> > > > + *    the file system and dirtying the folio.
> > > > + * 2. Later, writeback is triggered, resulting in the folio being cleaned and
> > > > + *    the PTE being marked read-only.
> > > > + * 3. The GUP caller writes to the folio, as it is mapped read/write via the
> > > > + *    direct mapping.
> > > > + * 4. The GUP caller, now done with the page, unpins it and sets it dirty
> > > > + *    (though it does not have to).
> > > > + *
> > > > + * This results in both data being written to a folio without writenotify, and
> > > > + * the folio being dirtied unexpectedly (if the caller decides to do so).
> > > > + */
> > > > +static bool writeable_file_mapping_allowed(struct vm_area_struct *vma,
> > > > +					   unsigned long gup_flags)
> > > > +{
> > > > +	/* If we aren't pinning then no problematic write can occur. */
> > > > +	if (!(gup_flags & (FOLL_GET | FOLL_PIN)))
> > > > +		return true;
> > >
> > > FOLL_LONGTERM only applies to FOLL_PIN. This check can be dropped.
> >
> > I understand that of course (well maybe not of course, but I mean I do, I
> > have oodles of diagrams referencing this int he book :) This is intended to
> > document the fact that the check isn't relevant if we don't pin at all,
> > e.g. reading this you see:-
> >
> > - (implicit) if not writing or anon we're good
> > - if not pin we're good
> > - ok we are only currently checking one especially egregious case
> > - finally, perform the dirty tracking check.
> >
> > So this is intentional.
> >
> > >
> > > > +
> > > > +	/* We limit this check to the most egregious case - a long term pin. */
> > > > +	if (!(gup_flags & FOLL_LONGTERM))
> > > > +		return true;
> > > > +
> > > > +	/* If the VMA requires dirty tracking then GUP will be problematic. */
> > > > +	return vma_needs_dirty_tracking(vma);
> > > > +}
> > > > +
> > > >    static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
> > > >    {
> > > >    	vm_flags_t vm_flags = vma->vm_flags;
> > > >    	int write = (gup_flags & FOLL_WRITE);
> > > >    	int foreign = (gup_flags & FOLL_REMOTE);
> > > > +	bool vma_anon = vma_is_anonymous(vma);
> > > >
> > > >    	if (vm_flags & (VM_IO | VM_PFNMAP))
> > > >    		return -EFAULT;
> > > >
> > > > -	if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma))
> > > > +	if ((gup_flags & FOLL_ANON) && !vma_anon)
> > > >    		return -EFAULT;
> > > >
> > > >    	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
> > > > @@ -978,6 +1013,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
> > > >    		return -EFAULT;
> > > >
> > > >    	if (write) {
> > > > +		if (!vma_anon &&
> > > > +		    !writeable_file_mapping_allowed(vma, gup_flags))
> > > > +			return -EFAULT;
> > > > +
> > > >    		if (!(vm_flags & VM_WRITE)) {
> > > >    			if (!(gup_flags & FOLL_FORCE))
> > > >    				return -EFAULT;
> > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > index 536bbb8fa0ae..7b6344d1832a 100644
> > > > --- a/mm/mmap.c
> > >
> > >
> > > I'm probably missing something, why don't we have to handle GUP-fast (having
> > > said that, it's hard to handle ;) )? The sequence you describe above should
> > > apply to GUP-fast as well, no?
> > >
> > > 1) Pin writable mapped page using GUP-fast
> > > 2) Trigger writeback
> > > 3) Write to page via pin
> > > 4) Unpin and set dirty
> >
> > You're right, and this is an excellent point. I worry about other GUP use
> > cases too, but we're a bit out of luck there because we don't get to check
> > the VMA _at all_ (which opens yet another Pandora's box about how safe it
> > is to do unlocked pinning :)
> >
> > But again, this comes down to the fact we're trying to make things
> > _incrementally__ better rather than throwing our hands up and saying one
> > day my ship will come in...
>
> That's not how security fixes are supposed to work IMHO, sorry.

Sure, but I don't think the 'let things continue to be terribly broken for X
more years' is also a great approach.

Personally I come at this from the 'I just want my vmas patch series' unblocked
perspective :) and feel there's a functional aspect here too.

>
> --
> Thanks,
>
> David / dhildenb
>
>
Lorenzo Stoakes April 28, 2023, 3:27 p.m. UTC | #14
On Fri, Apr 28, 2023 at 09:15:08AM -0600, Jens Axboe wrote:
> On 4/28/23 9:13?AM, David Hildenbrand wrote:
> >>> I know, Jason und John will disagree, but I don't think we want to be very
> >>> careful with changing the default.
> >>>
> >>> Sure, we could warn, or convert individual users using a flag (io_uring).
> >>> But maybe we should invest more energy on a fix?
> >>
> >> This is proactively blocking a cleanup (eliminating vmas) that I believe
> >> will be useful in moving things forward. I am not against an opt-in option
> >> (I have been responding to community feedback in adapting my approach),
> >> which is the way I implemented it all the way back then :)
> >
> > There are alternatives: just use a flag as Jason initially suggested
> > and use that in io_uring code. Then, you can also bail out on the
> > GUP-fast path as "cannot support it right now, never do GUP-fast".
>
> Since I've seen this brougth up a few times, what's the issue on the
> io_uring side? We already dropped the special vma checking, it's in -git
> right. Hence I don't believe there are any special cases left for
> io_uring at all, and we certainly don't allow real file backings either,
> never have done.

The purpose from my perspective is being able to have GUP perform the 'is the
file-backed mapping sane to GUP' check rather than you having to open code
it. There is nothing special beyond that.

Personally I think the best solution is an opt-in FOLL_SAFE_WRITE_FILE flag or
such that you call and drop the vma check you have.

That way we don't risk breaking anything, the vmas patch series can unblock, and
you don't have to have raw mm code in your bit :)

>
> --
> Jens Axboe
>
Jason Gunthorpe April 28, 2023, 3:27 p.m. UTC | #15
On Fri, Apr 28, 2023 at 05:08:27PM +0200, David Hildenbrand wrote:

> > I think this is broken today and we should block it. We know from
> > experiments with RDMA that doing exactly this triggers kernel oop's.
> 
> I never saw similar reports in the wild (especially targeted at RHEL), so is
> this still a current issue that has not been mitigated? Or is it just so
> hard to actually trigger?

People send RDMA related bug reports to us, and we tell them not to do
this stuff :)

> > I'm skeptical that anyone can actually do this combination of things
> > successfully without getting kernel crashes or file data corruption -
> > ie there is no real user to break.
> 
> I am pretty sure that there are such VM users, because on the libvirt level
> it's completely unclear which features trigger what behavior :/

IDK, why on earth would anyone want to do this? Using VFIO forces all
the memory to become resident so what was the point of making it file
backed in the first place?

I'm skeptical there are real users even if it now requires special
steps to be crashy/corrupty.

> > > Sure, we could warn, or convert individual users using a flag (io_uring).
> > > But maybe we should invest more energy on a fix?
> > 
> > It has been years now, I think we need to admit a fix is still years
> > away. Blocking the security problem may even motivate more people to
> > work on a fix.
> 
> Maybe we should make this a topic this year at LSF/MM (again?). At least we
> learned a lot about GUP, what might work, what might not work, and got a
> depper understanding (+ motivation to fix? :) ) the issue at hand.

We keep having the topic.. This is the old argument that the FS people
say the MM isn't following its inode and dirty lifetime rules and the
MM people say the FS isn't following its refcounting rules <shrug>

> > Security is the primary case where we have historically closed uAPI
> > items.
> 
> As this patch
> 
> 1) Does not tackle GUP-fast
> 2) Does not take care of !FOLL_LONGTERM
> 
> I am not convinced by the security argument in regard to this patch.

It is incremental and a temperature check to see what kind of real
users exist. We have no idea right now, just speculation.

Like I said, if there is feedback we can weaken it even further.

> Everything else sounds like band-aids to me, is insufficient, and might
> cause more harm than actually help IMHO. Especially the gup-fast case is
> extremely easy to work-around in malicious user space.

It is true this patch should probably block gup_fast when using
FOLL_LONGTERM as well, just like we used to do for the DAX check.

Jason
Lorenzo Stoakes April 28, 2023, 3:33 p.m. UTC | #16
On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote:
> > >
> > > Security is the primary case where we have historically closed uAPI
> > > items.
> >
> > As this patch
> >
> > 1) Does not tackle GUP-fast
> > 2) Does not take care of !FOLL_LONGTERM
> >
> > I am not convinced by the security argument in regard to this patch.
> >
> >
> > If we want to sells this as a security thing, we have to block it
> > *completely* and then CC stable.
>
> Regarding GUP-fast, to fix the issue there as well, I guess we could do
> something similar as I did in gup_must_unshare():
>
> If we're in GUP-fast (no VMA), and want to pin a !anon page writable,
> fallback to ordinary GUP. IOW, if we don't know, better be safe.

How do we determine it's non-anon in the first place? The check is on the
VMA. We could do it by following page tables down to folio and checking
folio->mapping for PAGE_MAPPING_ANON I suppose?

>
> Of course, this would prevent hugetlb/shmem from getting pinned writable
> during gup-fast. Unless we're able to whitelist them somehow in there.

We could degrade those to non-fast assuming not FOLL_FAST_ONLY. But it'd be
a pity.

>
>
> For FOLL_LONGTERM it might fairly uncontroversial. For everything else I'm
> not sure if there could be undesired side-effects.

Yeah this is why I pared the patch down to this alone :) there are
definitely concerns and issues with other cases, notably ptrace + friends
but obviously not only.

FOLL_LONGTERM is just the most egregious case.

>
> --
> Thanks,
>
> David / dhildenb
>
David Hildenbrand April 28, 2023, 3:33 p.m. UTC | #17
On 28.04.23 17:24, Lorenzo Stoakes wrote:
> On Fri, Apr 28, 2023 at 05:13:07PM +0200, David Hildenbrand wrote:
>> [...]
>>
>>>> This change has the potential to break existing setups. Simple example:
>>>> libvirt domains configured for file-backed VM memory that also has a vfio
>>>> device configured. It can easily be configured by users (evolving VM
>>>> configuration, copy-paste etc.). And it works from a VM perspective, because
>>>> the guest memory is essentially stale once the VM is shutdown and the pages
>>>> were unpinned. At least we're not concerned about stale data on disk.
>>>>
>>>> With your changes, such VMs would no longer start, breaking existing user
>>>> setups with a kernel update.
>>>
>>> Which vfio vm_ops are we talking about? vfio_pci_mmap_ops for example
>>> doesn't specify page_mkwrite or pfn_mkwrite. Unless you mean some arbitrary
>>> file system in the guest?
>>
>> Sorry, you define a VM to have its memory backed by VM memory and, at the
>> same time, define a vfio-pci device for your VM, which will end up long-term
>> pinning the VM memory.
> 

"memory backed by file memory", I guess you figured that out :)

> Ah ack. Jason seemed concerned that this was already a broken case, I guess
> that's one for you two to hash out...
> 
>>
>>>
>>> I may well be missing context on this so forgive me if I'm being a little
>>> dumb here, but it'd be good to get a specific example.
>>
>> I was giving to little details ;)
>>
>> [...]
>>
>>>>
>>>> I know, Jason und John will disagree, but I don't think we want to be very
>>>> careful with changing the default.
>>>>
>>>> Sure, we could warn, or convert individual users using a flag (io_uring).
>>>> But maybe we should invest more energy on a fix?
>>>
>>> This is proactively blocking a cleanup (eliminating vmas) that I believe
>>> will be useful in moving things forward. I am not against an opt-in option
>>> (I have been responding to community feedback in adapting my approach),
>>> which is the way I implemented it all the way back then :)
>>
>> There are alternatives: just use a flag as Jason initially suggested and use
>> that in io_uring code. Then, you can also bail out on the GUP-fast path as
>> "cannot support it right now, never do GUP-fast".
> 
> I already implemented the alternatives (look back through revisions to see :)
> and there were objections for various reasons.
> 
> Personally my preference is to provide a FOLL_SAFE_FILE_WRITE flag or such and
> replace the FOLL_LONGTERM check with this (that'll automatically get rejected
> for GUP-fast so the GUP-fast conundrum wouldn't be a thing either).
> 
> GUP-fast is a problem as you say,, but it feels like a fundamental issue with
> GUP-fast as a whole since you don't get to look at a VMA since you can't take
> the mmap_lock. You could just look at the folio->mapping once you've walked the
> page tables and say 'I'm out' if FOLL_WRITE and it's non-anon if that's what
> you're suggesting?

See my other reply, kind-of yes. Like we do with gup_must_unshare(). I'm 
only concerned about how to keep GUP-fast working on hugetlb and shmem.

> 
> I'm not against that change but this being incremental, I think that would be a
> further increment.

If we want to fix a security issue, as Jason said, incremental is IMHO 
the wrong approach.

It's often too tempting to ignore the hard part and fix the easy part, 
making the hard part an increment for the future that nobody will really 
implement ... because it's hard.
[...]

>>>
>>> But given we know this is both entirely broken and a potential security
>>> issue, and FOLL_LONGTERM is about as egregious as you can get (user
>>> explicitly saying they'll hold write access indefinitely) I feel it is an
>>> important improvement and makes clear that this is not an acceptable usage.
>>>
>>> I see Jason has said more on this also :)
>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
>>>>> Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
>>>>> ---
>>>>>     include/linux/mm.h |  1 +
>>>>>     mm/gup.c           | 41 ++++++++++++++++++++++++++++++++++++++++-
>>>>>     mm/mmap.c          | 36 +++++++++++++++++++++++++++---------
>>>>>     3 files changed, 68 insertions(+), 10 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>>> index 37554b08bb28..f7da02fc89c6 100644
>>>>> --- a/include/linux/mm.h
>>>>> +++ b/include/linux/mm.h
>>>>> @@ -2433,6 +2433,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
>>>>>     #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
>>>>>     					    MM_CP_UFFD_WP_RESOLVE)
>>>>>
>>>>> +bool vma_needs_dirty_tracking(struct vm_area_struct *vma);
>>>>>     int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
>>>>>     static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma)
>>>>>     {
>>>>> diff --git a/mm/gup.c b/mm/gup.c
>>>>> index 1f72a717232b..d36a5db9feb1 100644
>>>>> --- a/mm/gup.c
>>>>> +++ b/mm/gup.c
>>>>> @@ -959,16 +959,51 @@ static int faultin_page(struct vm_area_struct *vma,
>>>>>     	return 0;
>>>>>     }
>>>>>
>>>>> +/*
>>>>> + * Writing to file-backed mappings which require folio dirty tracking using GUP
>>>>> + * is a fundamentally broken operation, as kernel write access to GUP mappings
>>>>> + * do not adhere to the semantics expected by a file system.
>>>>> + *
>>>>> + * Consider the following scenario:-
>>>>> + *
>>>>> + * 1. A folio is written to via GUP which write-faults the memory, notifying
>>>>> + *    the file system and dirtying the folio.
>>>>> + * 2. Later, writeback is triggered, resulting in the folio being cleaned and
>>>>> + *    the PTE being marked read-only.
>>>>> + * 3. The GUP caller writes to the folio, as it is mapped read/write via the
>>>>> + *    direct mapping.
>>>>> + * 4. The GUP caller, now done with the page, unpins it and sets it dirty
>>>>> + *    (though it does not have to).
>>>>> + *
>>>>> + * This results in both data being written to a folio without writenotify, and
>>>>> + * the folio being dirtied unexpectedly (if the caller decides to do so).
>>>>> + */
>>>>> +static bool writeable_file_mapping_allowed(struct vm_area_struct *vma,
>>>>> +					   unsigned long gup_flags)
>>>>> +{
>>>>> +	/* If we aren't pinning then no problematic write can occur. */
>>>>> +	if (!(gup_flags & (FOLL_GET | FOLL_PIN)))
>>>>> +		return true;
>>>>
>>>> FOLL_LONGTERM only applies to FOLL_PIN. This check can be dropped.
>>>
>>> I understand that of course (well maybe not of course, but I mean I do, I
>>> have oodles of diagrams referencing this int he book :) This is intended to
>>> document the fact that the check isn't relevant if we don't pin at all,
>>> e.g. reading this you see:-
>>>
>>> - (implicit) if not writing or anon we're good
>>> - if not pin we're good
>>> - ok we are only currently checking one especially egregious case
>>> - finally, perform the dirty tracking check.
>>>
>>> So this is intentional.
>>>
>>>>
>>>>> +
>>>>> +	/* We limit this check to the most egregious case - a long term pin. */
>>>>> +	if (!(gup_flags & FOLL_LONGTERM))
>>>>> +		return true;
>>>>> +
>>>>> +	/* If the VMA requires dirty tracking then GUP will be problematic. */
>>>>> +	return vma_needs_dirty_tracking(vma);
>>>>> +}
>>>>> +
>>>>>     static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
>>>>>     {
>>>>>     	vm_flags_t vm_flags = vma->vm_flags;
>>>>>     	int write = (gup_flags & FOLL_WRITE);
>>>>>     	int foreign = (gup_flags & FOLL_REMOTE);
>>>>> +	bool vma_anon = vma_is_anonymous(vma);
>>>>>
>>>>>     	if (vm_flags & (VM_IO | VM_PFNMAP))
>>>>>     		return -EFAULT;
>>>>>
>>>>> -	if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma))
>>>>> +	if ((gup_flags & FOLL_ANON) && !vma_anon)
>>>>>     		return -EFAULT;
>>>>>
>>>>>     	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
>>>>> @@ -978,6 +1013,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
>>>>>     		return -EFAULT;
>>>>>
>>>>>     	if (write) {
>>>>> +		if (!vma_anon &&
>>>>> +		    !writeable_file_mapping_allowed(vma, gup_flags))
>>>>> +			return -EFAULT;
>>>>> +
>>>>>     		if (!(vm_flags & VM_WRITE)) {
>>>>>     			if (!(gup_flags & FOLL_FORCE))
>>>>>     				return -EFAULT;
>>>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>>>> index 536bbb8fa0ae..7b6344d1832a 100644
>>>>> --- a/mm/mmap.c
>>>>
>>>>
>>>> I'm probably missing something, why don't we have to handle GUP-fast (having
>>>> said that, it's hard to handle ;) )? The sequence you describe above should
>>>> apply to GUP-fast as well, no?
>>>>
>>>> 1) Pin writable mapped page using GUP-fast
>>>> 2) Trigger writeback
>>>> 3) Write to page via pin
>>>> 4) Unpin and set dirty
>>>
>>> You're right, and this is an excellent point. I worry about other GUP use
>>> cases too, but we're a bit out of luck there because we don't get to check
>>> the VMA _at all_ (which opens yet another Pandora's box about how safe it
>>> is to do unlocked pinning :)
>>>
>>> But again, this comes down to the fact we're trying to make things
>>> _incrementally__ better rather than throwing our hands up and saying one
>>> day my ship will come in...
>>
>> That's not how security fixes are supposed to work IMHO, sorry.
> 
> Sure, but I don't think the 'let things continue to be terribly broken for X
> more years' is also a great approach.

Not at all, people (including me) were simply not aware that it is that 
much of an (security) issue because I never saw any real bug reports (or 
CVE numbers) and only herd John talk about possible fixes a year ago :)

So I'm saying we either try to block it completely or finally look into 
fixing it for good. I'm not a friend of anything in between.

You don't gain a lot of security by locking the front door but knowingly 
leaving the back door unlocked.

> 
> Personally I come at this from the 'I just want my vmas patch series' unblocked
> perspective :) and feel there's a functional aspect here too.

I know, it always gets messy when touching such sensible topics :P
David Hildenbrand April 28, 2023, 3:34 p.m. UTC | #18
On 28.04.23 17:33, Lorenzo Stoakes wrote:
> On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote:
>>>>
>>>> Security is the primary case where we have historically closed uAPI
>>>> items.
>>>
>>> As this patch
>>>
>>> 1) Does not tackle GUP-fast
>>> 2) Does not take care of !FOLL_LONGTERM
>>>
>>> I am not convinced by the security argument in regard to this patch.
>>>
>>>
>>> If we want to sells this as a security thing, we have to block it
>>> *completely* and then CC stable.
>>
>> Regarding GUP-fast, to fix the issue there as well, I guess we could do
>> something similar as I did in gup_must_unshare():
>>
>> If we're in GUP-fast (no VMA), and want to pin a !anon page writable,
>> fallback to ordinary GUP. IOW, if we don't know, better be safe.
> 
> How do we determine it's non-anon in the first place? The check is on the
> VMA. We could do it by following page tables down to folio and checking
> folio->mapping for PAGE_MAPPING_ANON I suppose?

PageAnon(page) can be called from GUP-fast after grabbing a reference. 
See gup_must_unshare().

> 
>>
>> Of course, this would prevent hugetlb/shmem from getting pinned writable
>> during gup-fast. Unless we're able to whitelist them somehow in there.
> 
> We could degrade those to non-fast assuming not FOLL_FAST_ONLY. But it'd be
> a pity.
David Hildenbrand April 28, 2023, 3:41 p.m. UTC | #19
On 28.04.23 17:27, Jason Gunthorpe wrote:
> On Fri, Apr 28, 2023 at 05:08:27PM +0200, David Hildenbrand wrote:
> 
>>> I think this is broken today and we should block it. We know from
>>> experiments with RDMA that doing exactly this triggers kernel oop's.
>>
>> I never saw similar reports in the wild (especially targeted at RHEL), so is
>> this still a current issue that has not been mitigated? Or is it just so
>> hard to actually trigger?
> 
> People send RDMA related bug reports to us, and we tell them not to do
> this stuff :)
> 
>>> I'm skeptical that anyone can actually do this combination of things
>>> successfully without getting kernel crashes or file data corruption -
>>> ie there is no real user to break.
>>
>> I am pretty sure that there are such VM users, because on the libvirt level
>> it's completely unclear which features trigger what behavior :/
> 
> IDK, why on earth would anyone want to do this? Using VFIO forces all
> the memory to become resident so what was the point of making it file
> backed in the first place?

As I said, copy-and paste, incremental changes to domain XMLs. I've seen 
some crazy domain XMLs in bug reports.

> 
> I'm skeptical there are real users even if it now requires special
> steps to be crashy/corrupty.

In any case, I think we should document the possible implications of 
this patch. I gave one use case that could be broken.

> 
>>>> Sure, we could warn, or convert individual users using a flag (io_uring).
>>>> But maybe we should invest more energy on a fix?
>>>
>>> It has been years now, I think we need to admit a fix is still years
>>> away. Blocking the security problem may even motivate more people to
>>> work on a fix.
>>
>> Maybe we should make this a topic this year at LSF/MM (again?). At least we
>> learned a lot about GUP, what might work, what might not work, and got a
>> depper understanding (+ motivation to fix? :) ) the issue at hand.
> 
> We keep having the topic.. This is the old argument that the FS people
> say the MM isn't following its inode and dirty lifetime rules and the
> MM people say the FS isn't following its refcounting rules <shrug>

:/ so we have to discuss it ... again I guess.

> 
>>> Security is the primary case where we have historically closed uAPI
>>> items.
>>
>> As this patch
>>
>> 1) Does not tackle GUP-fast
>> 2) Does not take care of !FOLL_LONGTERM
>>
>> I am not convinced by the security argument in regard to this patch.
> 
> It is incremental and a temperature check to see what kind of real
> users exist. We have no idea right now, just speculation.

Right, but again, if we start talking about security it's a different 
thing IMHO.

>> Everything else sounds like band-aids to me, is insufficient, and might
>> cause more harm than actually help IMHO. Especially the gup-fast case is
>> extremely easy to work-around in malicious user space.
> 
> It is true this patch should probably block gup_fast when using
> FOLL_LONGTERM as well, just like we used to do for the DAX check.

Then we'd at least fix the security issue for all FOLL_LONGTERM completely.
David Hildenbrand April 28, 2023, 3:43 p.m. UTC | #20
On 28.04.23 17:34, David Hildenbrand wrote:
> On 28.04.23 17:33, Lorenzo Stoakes wrote:
>> On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote:
>>>>>
>>>>> Security is the primary case where we have historically closed uAPI
>>>>> items.
>>>>
>>>> As this patch
>>>>
>>>> 1) Does not tackle GUP-fast
>>>> 2) Does not take care of !FOLL_LONGTERM
>>>>
>>>> I am not convinced by the security argument in regard to this patch.
>>>>
>>>>
>>>> If we want to sells this as a security thing, we have to block it
>>>> *completely* and then CC stable.
>>>
>>> Regarding GUP-fast, to fix the issue there as well, I guess we could do
>>> something similar as I did in gup_must_unshare():
>>>
>>> If we're in GUP-fast (no VMA), and want to pin a !anon page writable,
>>> fallback to ordinary GUP. IOW, if we don't know, better be safe.
>>
>> How do we determine it's non-anon in the first place? The check is on the
>> VMA. We could do it by following page tables down to folio and checking
>> folio->mapping for PAGE_MAPPING_ANON I suppose?
> 
> PageAnon(page) can be called from GUP-fast after grabbing a reference.
> See gup_must_unshare().

IIRC, PageHuge() can also be called from GUP-fast and could special-case 
hugetlb eventually, as it's table while we hold a (temporary) reference. 
Shmem might be not so easy ...
Lorenzo Stoakes April 28, 2023, 3:50 p.m. UTC | #21
On Fri, Apr 28, 2023 at 05:33:17PM +0200, David Hildenbrand wrote:
> On 28.04.23 17:24, Lorenzo Stoakes wrote:
> > On Fri, Apr 28, 2023 at 05:13:07PM +0200, David Hildenbrand wrote:
> > > [...]
> > >
> > > > > This change has the potential to break existing setups. Simple example:
> > > > > libvirt domains configured for file-backed VM memory that also has a vfio
> > > > > device configured. It can easily be configured by users (evolving VM
> > > > > configuration, copy-paste etc.). And it works from a VM perspective, because
> > > > > the guest memory is essentially stale once the VM is shutdown and the pages
> > > > > were unpinned. At least we're not concerned about stale data on disk.
> > > > >
> > > > > With your changes, such VMs would no longer start, breaking existing user
> > > > > setups with a kernel update.
> > > >
> > > > Which vfio vm_ops are we talking about? vfio_pci_mmap_ops for example
> > > > doesn't specify page_mkwrite or pfn_mkwrite. Unless you mean some arbitrary
> > > > file system in the guest?
> > >
> > > Sorry, you define a VM to have its memory backed by VM memory and, at the
> > > same time, define a vfio-pci device for your VM, which will end up long-term
> > > pinning the VM memory.
> >
>
> "memory backed by file memory", I guess you figured that out :)

Ack yeah I vaguely assumed this was what you meant :) as in a virtualised 'file'
system that ultimately actually is in reality memory backed but not from guest's
persective.

>
> > Ah ack. Jason seemed concerned that this was already a broken case, I guess
> > that's one for you two to hash out...
> >
> > >
> > > >
> > > > I may well be missing context on this so forgive me if I'm being a little
> > > > dumb here, but it'd be good to get a specific example.
> > >
> > > I was giving to little details ;)
> > >
> > > [...]
> > >
> > > > >
> > > > > I know, Jason und John will disagree, but I don't think we want to be very
> > > > > careful with changing the default.
> > > > >
> > > > > Sure, we could warn, or convert individual users using a flag (io_uring).
> > > > > But maybe we should invest more energy on a fix?
> > > >
> > > > This is proactively blocking a cleanup (eliminating vmas) that I believe
> > > > will be useful in moving things forward. I am not against an opt-in option
> > > > (I have been responding to community feedback in adapting my approach),
> > > > which is the way I implemented it all the way back then :)
> > >
> > > There are alternatives: just use a flag as Jason initially suggested and use
> > > that in io_uring code. Then, you can also bail out on the GUP-fast path as
> > > "cannot support it right now, never do GUP-fast".
> >
> > I already implemented the alternatives (look back through revisions to see :)
> > and there were objections for various reasons.
> >
> > Personally my preference is to provide a FOLL_SAFE_FILE_WRITE flag or such and
> > replace the FOLL_LONGTERM check with this (that'll automatically get rejected
> > for GUP-fast so the GUP-fast conundrum wouldn't be a thing either).
> >
> > GUP-fast is a problem as you say,, but it feels like a fundamental issue with
> > GUP-fast as a whole since you don't get to look at a VMA since you can't take
> > the mmap_lock. You could just look at the folio->mapping once you've walked the
> > page tables and say 'I'm out' if FOLL_WRITE and it's non-anon if that's what
> > you're suggesting?
>
> See my other reply, kind-of yes. Like we do with gup_must_unshare(). I'm
> only concerned about how to keep GUP-fast working on hugetlb and shmem.
>
> >
> > I'm not against that change but this being incremental, I think that would be a
> > further increment.
>
> If we want to fix a security issue, as Jason said, incremental is IMHO the
> wrong approach.
>
> It's often too tempting to ignore the hard part and fix the easy part,
> making the hard part an increment for the future that nobody will really
> implement ... because it's hard.
> [...]
>
> > > >
> > > > But given we know this is both entirely broken and a potential security
> > > > issue, and FOLL_LONGTERM is about as egregious as you can get (user
> > > > explicitly saying they'll hold write access indefinitely) I feel it is an
> > > > important improvement and makes clear that this is not an acceptable usage.
> > > >
> > > > I see Jason has said more on this also :)
> > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> > > > > > Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
> > > > > > ---
> > > > > >     include/linux/mm.h |  1 +
> > > > > >     mm/gup.c           | 41 ++++++++++++++++++++++++++++++++++++++++-
> > > > > >     mm/mmap.c          | 36 +++++++++++++++++++++++++++---------
> > > > > >     3 files changed, 68 insertions(+), 10 deletions(-)
> > > > > >
> > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > > > index 37554b08bb28..f7da02fc89c6 100644
> > > > > > --- a/include/linux/mm.h
> > > > > > +++ b/include/linux/mm.h
> > > > > > @@ -2433,6 +2433,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
> > > > > >     #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
> > > > > >     					    MM_CP_UFFD_WP_RESOLVE)
> > > > > >
> > > > > > +bool vma_needs_dirty_tracking(struct vm_area_struct *vma);
> > > > > >     int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
> > > > > >     static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma)
> > > > > >     {
> > > > > > diff --git a/mm/gup.c b/mm/gup.c
> > > > > > index 1f72a717232b..d36a5db9feb1 100644
> > > > > > --- a/mm/gup.c
> > > > > > +++ b/mm/gup.c
> > > > > > @@ -959,16 +959,51 @@ static int faultin_page(struct vm_area_struct *vma,
> > > > > >     	return 0;
> > > > > >     }
> > > > > >
> > > > > > +/*
> > > > > > + * Writing to file-backed mappings which require folio dirty tracking using GUP
> > > > > > + * is a fundamentally broken operation, as kernel write access to GUP mappings
> > > > > > + * do not adhere to the semantics expected by a file system.
> > > > > > + *
> > > > > > + * Consider the following scenario:-
> > > > > > + *
> > > > > > + * 1. A folio is written to via GUP which write-faults the memory, notifying
> > > > > > + *    the file system and dirtying the folio.
> > > > > > + * 2. Later, writeback is triggered, resulting in the folio being cleaned and
> > > > > > + *    the PTE being marked read-only.
> > > > > > + * 3. The GUP caller writes to the folio, as it is mapped read/write via the
> > > > > > + *    direct mapping.
> > > > > > + * 4. The GUP caller, now done with the page, unpins it and sets it dirty
> > > > > > + *    (though it does not have to).
> > > > > > + *
> > > > > > + * This results in both data being written to a folio without writenotify, and
> > > > > > + * the folio being dirtied unexpectedly (if the caller decides to do so).
> > > > > > + */
> > > > > > +static bool writeable_file_mapping_allowed(struct vm_area_struct *vma,
> > > > > > +					   unsigned long gup_flags)
> > > > > > +{
> > > > > > +	/* If we aren't pinning then no problematic write can occur. */
> > > > > > +	if (!(gup_flags & (FOLL_GET | FOLL_PIN)))
> > > > > > +		return true;
> > > > >
> > > > > FOLL_LONGTERM only applies to FOLL_PIN. This check can be dropped.
> > > >
> > > > I understand that of course (well maybe not of course, but I mean I do, I
> > > > have oodles of diagrams referencing this int he book :) This is intended to
> > > > document the fact that the check isn't relevant if we don't pin at all,
> > > > e.g. reading this you see:-
> > > >
> > > > - (implicit) if not writing or anon we're good
> > > > - if not pin we're good
> > > > - ok we are only currently checking one especially egregious case
> > > > - finally, perform the dirty tracking check.
> > > >
> > > > So this is intentional.
> > > >
> > > > >
> > > > > > +
> > > > > > +	/* We limit this check to the most egregious case - a long term pin. */
> > > > > > +	if (!(gup_flags & FOLL_LONGTERM))
> > > > > > +		return true;
> > > > > > +
> > > > > > +	/* If the VMA requires dirty tracking then GUP will be problematic. */
> > > > > > +	return vma_needs_dirty_tracking(vma);
> > > > > > +}
> > > > > > +
> > > > > >     static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
> > > > > >     {
> > > > > >     	vm_flags_t vm_flags = vma->vm_flags;
> > > > > >     	int write = (gup_flags & FOLL_WRITE);
> > > > > >     	int foreign = (gup_flags & FOLL_REMOTE);
> > > > > > +	bool vma_anon = vma_is_anonymous(vma);
> > > > > >
> > > > > >     	if (vm_flags & (VM_IO | VM_PFNMAP))
> > > > > >     		return -EFAULT;
> > > > > >
> > > > > > -	if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma))
> > > > > > +	if ((gup_flags & FOLL_ANON) && !vma_anon)
> > > > > >     		return -EFAULT;
> > > > > >
> > > > > >     	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
> > > > > > @@ -978,6 +1013,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
> > > > > >     		return -EFAULT;
> > > > > >
> > > > > >     	if (write) {
> > > > > > +		if (!vma_anon &&
> > > > > > +		    !writeable_file_mapping_allowed(vma, gup_flags))
> > > > > > +			return -EFAULT;
> > > > > > +
> > > > > >     		if (!(vm_flags & VM_WRITE)) {
> > > > > >     			if (!(gup_flags & FOLL_FORCE))
> > > > > >     				return -EFAULT;
> > > > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > > > index 536bbb8fa0ae..7b6344d1832a 100644
> > > > > > --- a/mm/mmap.c
> > > > >
> > > > >
> > > > > I'm probably missing something, why don't we have to handle GUP-fast (having
> > > > > said that, it's hard to handle ;) )? The sequence you describe above should
> > > > > apply to GUP-fast as well, no?
> > > > >
> > > > > 1) Pin writable mapped page using GUP-fast
> > > > > 2) Trigger writeback
> > > > > 3) Write to page via pin
> > > > > 4) Unpin and set dirty
> > > >
> > > > You're right, and this is an excellent point. I worry about other GUP use
> > > > cases too, but we're a bit out of luck there because we don't get to check
> > > > the VMA _at all_ (which opens yet another Pandora's box about how safe it
> > > > is to do unlocked pinning :)
> > > >
> > > > But again, this comes down to the fact we're trying to make things
> > > > _incrementally__ better rather than throwing our hands up and saying one
> > > > day my ship will come in...
> > >
> > > That's not how security fixes are supposed to work IMHO, sorry.
> >
> > Sure, but I don't think the 'let things continue to be terribly broken for X
> > more years' is also a great approach.
>
> Not at all, people (including me) were simply not aware that it is that much
> of an (security) issue because I never saw any real bug reports (or CVE
> numbers) and only herd John talk about possible fixes a year ago :)
>
> So I'm saying we either try to block it completely or finally look into
> fixing it for good. I'm not a friend of anything in between.
>
> You don't gain a lot of security by locking the front door but knowingly
> leaving the back door unlocked.
>
> >
> > Personally I come at this from the 'I just want my vmas patch series' unblocked
> > perspective :) and feel there's a functional aspect here too.
>
> I know, it always gets messy when touching such sensible topics :P

I feel that several people owe me drinks at LSF/MM :P

To cut a long story short to your other points, I'm _really_ leaning
towards an opt-in variant of this change that we just hand to io_uring to
make everything simple with minimum risk (if Jens was also open to this
idea, it'd simply be deleting the open coded vma checks there and adding
FOLL_SAFE_FILE_WRITE).

That way we can save the delightful back and forth for another time while
adding a useful feature and documenting the issue.

Altneratively I could try to adapt this to also do the GUP-fast check,
hoping that no FOLL_FAST_ONLY users would get nixed (I'd have to check who
uses that). The others should just get degraded to a standard GUP right?

I feel these various series have really helped beat out some details about
GUP, so as to your point on another thread (trying to reduce noise here
:P), I think discussion at LSF/MM is also a sensible idea, also you know,
if beers were bought too it could all work out nicely :]

>
> --
> Thanks,
>
> David / dhildenb
>
Peter Xu April 28, 2023, 3:56 p.m. UTC | #22
On Fri, Apr 28, 2023 at 05:34:35PM +0200, David Hildenbrand wrote:
> On 28.04.23 17:33, Lorenzo Stoakes wrote:
> > On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote:
> > > > > 
> > > > > Security is the primary case where we have historically closed uAPI
> > > > > items.
> > > > 
> > > > As this patch
> > > > 
> > > > 1) Does not tackle GUP-fast
> > > > 2) Does not take care of !FOLL_LONGTERM
> > > > 
> > > > I am not convinced by the security argument in regard to this patch.
> > > > 
> > > > 
> > > > If we want to sells this as a security thing, we have to block it
> > > > *completely* and then CC stable.
> > > 
> > > Regarding GUP-fast, to fix the issue there as well, I guess we could do
> > > something similar as I did in gup_must_unshare():
> > > 
> > > If we're in GUP-fast (no VMA), and want to pin a !anon page writable,
> > > fallback to ordinary GUP. IOW, if we don't know, better be safe.
> > 
> > How do we determine it's non-anon in the first place? The check is on the
> > VMA. We could do it by following page tables down to folio and checking
> > folio->mapping for PAGE_MAPPING_ANON I suppose?
> 
> PageAnon(page) can be called from GUP-fast after grabbing a reference. See
> gup_must_unshare().

Hmm.. Is it a good idea at all to sacrifise all "!anon" fast-gups for this?
People will silently got degrade even on legal pins on shmem/hugetlb, I
think, which seems to be still a very major use case.
David Hildenbrand April 28, 2023, 4 p.m. UTC | #23
[...]

>>>
>>> Personally I come at this from the 'I just want my vmas patch series' unblocked
>>> perspective :) and feel there's a functional aspect here too.
>>
>> I know, it always gets messy when touching such sensible topics :P
> 
> I feel that several people owe me drinks at LSF/MM :P
> 
> To cut a long story short to your other points, I'm _really_ leaning
> towards an opt-in variant of this change that we just hand to io_uring to
> make everything simple with minimum risk (if Jens was also open to this
> idea, it'd simply be deleting the open coded vma checks there and adding
> FOLL_SAFE_FILE_WRITE).
> 
> That way we can save the delightful back and forth for another time while
> adding a useful feature and documenting the issue.

Just for the records: I'm not opposed to disabling it system-wide, 
especially once this is an actual security issue and can bring down the 
machine easily (thanks to Jason for raising the security aspect). I just 
wanted to raise awareness that there might be users affected ...

Sure, we could glue this to some system knob like Jason said, if we want 
to play safe.

> 
> Altneratively I could try to adapt this to also do the GUP-fast check,
> hoping that no FOLL_FAST_ONLY users would get nixed (I'd have to check who
> uses that). The others should just get degraded to a standard GUP right?

Yes. When you need the VMA to make a decision, fallback to standard GUP.

The only problematic part is something like get_user_pages_fast_only(), 
that would observe a change. But KVM never passes FOLL_LONGTERM, so at 
least in that context the change should be fine I guess.

The performance concern is the most problematic thing (how to identify 
shmem pages).

> 
> I feel these various series have really helped beat out some details about
> GUP, so as to your point on another thread (trying to reduce noise here
> :P), I think discussion at LSF/MM is also a sensible idea, also you know,
> if beers were bought too it could all work out nicely :]

The issue is, that GUP is so complicated, that each and every MM 
developer familiar with GUP has something to add :P

What stood out to me is that we disallow something for ordinary GUP but 
disallow it for GUP-fast, which looks very odd.

So sorry again for jumping in late ...
David Hildenbrand April 28, 2023, 4:02 p.m. UTC | #24
On 28.04.23 17:56, Peter Xu wrote:
> On Fri, Apr 28, 2023 at 05:34:35PM +0200, David Hildenbrand wrote:
>> On 28.04.23 17:33, Lorenzo Stoakes wrote:
>>> On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote:
>>>>>>
>>>>>> Security is the primary case where we have historically closed uAPI
>>>>>> items.
>>>>>
>>>>> As this patch
>>>>>
>>>>> 1) Does not tackle GUP-fast
>>>>> 2) Does not take care of !FOLL_LONGTERM
>>>>>
>>>>> I am not convinced by the security argument in regard to this patch.
>>>>>
>>>>>
>>>>> If we want to sells this as a security thing, we have to block it
>>>>> *completely* and then CC stable.
>>>>
>>>> Regarding GUP-fast, to fix the issue there as well, I guess we could do
>>>> something similar as I did in gup_must_unshare():
>>>>
>>>> If we're in GUP-fast (no VMA), and want to pin a !anon page writable,
>>>> fallback to ordinary GUP. IOW, if we don't know, better be safe.
>>>
>>> How do we determine it's non-anon in the first place? The check is on the
>>> VMA. We could do it by following page tables down to folio and checking
>>> folio->mapping for PAGE_MAPPING_ANON I suppose?
>>
>> PageAnon(page) can be called from GUP-fast after grabbing a reference. See
>> gup_must_unshare().
> 
> Hmm.. Is it a good idea at all to sacrifise all "!anon" fast-gups for this?
> People will silently got degrade even on legal pins on shmem/hugetlb, I
> think, which seems to be still a very major use case.
> 

Right. Optimizing for hugetlb should be easy. Shmem is problematic.

I once raised to John that PageAnonExclusive is essentially a "anon page 
is pinnable" flag. Too bad we don't have spare page flags ;)
Kirill A. Shutemov April 28, 2023, 4:09 p.m. UTC | #25
On Fri, Apr 28, 2023 at 05:43:52PM +0200, David Hildenbrand wrote:
> On 28.04.23 17:34, David Hildenbrand wrote:
> > On 28.04.23 17:33, Lorenzo Stoakes wrote:
> > > On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote:
> > > > > > 
> > > > > > Security is the primary case where we have historically closed uAPI
> > > > > > items.
> > > > > 
> > > > > As this patch
> > > > > 
> > > > > 1) Does not tackle GUP-fast
> > > > > 2) Does not take care of !FOLL_LONGTERM
> > > > > 
> > > > > I am not convinced by the security argument in regard to this patch.
> > > > > 
> > > > > 
> > > > > If we want to sells this as a security thing, we have to block it
> > > > > *completely* and then CC stable.
> > > > 
> > > > Regarding GUP-fast, to fix the issue there as well, I guess we could do
> > > > something similar as I did in gup_must_unshare():
> > > > 
> > > > If we're in GUP-fast (no VMA), and want to pin a !anon page writable,
> > > > fallback to ordinary GUP. IOW, if we don't know, better be safe.
> > > 
> > > How do we determine it's non-anon in the first place? The check is on the
> > > VMA. We could do it by following page tables down to folio and checking
> > > folio->mapping for PAGE_MAPPING_ANON I suppose?
> > 
> > PageAnon(page) can be called from GUP-fast after grabbing a reference.
> > See gup_must_unshare().
> 
> IIRC, PageHuge() can also be called from GUP-fast and could special-case
> hugetlb eventually, as it's table while we hold a (temporary) reference.
> Shmem might be not so easy ...

page->mapping->a_ops should be enough to whitelist whatever fs you want.
David Hildenbrand April 28, 2023, 4:13 p.m. UTC | #26
On 28.04.23 18:09, Kirill A . Shutemov wrote:
> On Fri, Apr 28, 2023 at 05:43:52PM +0200, David Hildenbrand wrote:
>> On 28.04.23 17:34, David Hildenbrand wrote:
>>> On 28.04.23 17:33, Lorenzo Stoakes wrote:
>>>> On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote:
>>>>>>>
>>>>>>> Security is the primary case where we have historically closed uAPI
>>>>>>> items.
>>>>>>
>>>>>> As this patch
>>>>>>
>>>>>> 1) Does not tackle GUP-fast
>>>>>> 2) Does not take care of !FOLL_LONGTERM
>>>>>>
>>>>>> I am not convinced by the security argument in regard to this patch.
>>>>>>
>>>>>>
>>>>>> If we want to sells this as a security thing, we have to block it
>>>>>> *completely* and then CC stable.
>>>>>
>>>>> Regarding GUP-fast, to fix the issue there as well, I guess we could do
>>>>> something similar as I did in gup_must_unshare():
>>>>>
>>>>> If we're in GUP-fast (no VMA), and want to pin a !anon page writable,
>>>>> fallback to ordinary GUP. IOW, if we don't know, better be safe.
>>>>
>>>> How do we determine it's non-anon in the first place? The check is on the
>>>> VMA. We could do it by following page tables down to folio and checking
>>>> folio->mapping for PAGE_MAPPING_ANON I suppose?
>>>
>>> PageAnon(page) can be called from GUP-fast after grabbing a reference.
>>> See gup_must_unshare().
>>
>> IIRC, PageHuge() can also be called from GUP-fast and could special-case
>> hugetlb eventually, as it's table while we hold a (temporary) reference.
>> Shmem might be not so easy ...
> 
> page->mapping->a_ops should be enough to whitelist whatever fs you want.
> 

The issue is how to stabilize that from GUP-fast, such that we can 
safely dereference the mapping. Any idea?

At least for anon page I know that page->mapping only gets cleared when 
freeing the page, and we don't dereference the mapping but only check a 
single flag stored alongside the mapping. Therefore, PageAnon() is fine 
in GUP-fast context.
Kirill A. Shutemov April 28, 2023, 4:22 p.m. UTC | #27
On Fri, Apr 28, 2023 at 06:13:03PM +0200, David Hildenbrand wrote:
> On 28.04.23 18:09, Kirill A . Shutemov wrote:
> > On Fri, Apr 28, 2023 at 05:43:52PM +0200, David Hildenbrand wrote:
> > > On 28.04.23 17:34, David Hildenbrand wrote:
> > > > On 28.04.23 17:33, Lorenzo Stoakes wrote:
> > > > > On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote:
> > > > > > > > 
> > > > > > > > Security is the primary case where we have historically closed uAPI
> > > > > > > > items.
> > > > > > > 
> > > > > > > As this patch
> > > > > > > 
> > > > > > > 1) Does not tackle GUP-fast
> > > > > > > 2) Does not take care of !FOLL_LONGTERM
> > > > > > > 
> > > > > > > I am not convinced by the security argument in regard to this patch.
> > > > > > > 
> > > > > > > 
> > > > > > > If we want to sells this as a security thing, we have to block it
> > > > > > > *completely* and then CC stable.
> > > > > > 
> > > > > > Regarding GUP-fast, to fix the issue there as well, I guess we could do
> > > > > > something similar as I did in gup_must_unshare():
> > > > > > 
> > > > > > If we're in GUP-fast (no VMA), and want to pin a !anon page writable,
> > > > > > fallback to ordinary GUP. IOW, if we don't know, better be safe.
> > > > > 
> > > > > How do we determine it's non-anon in the first place? The check is on the
> > > > > VMA. We could do it by following page tables down to folio and checking
> > > > > folio->mapping for PAGE_MAPPING_ANON I suppose?
> > > > 
> > > > PageAnon(page) can be called from GUP-fast after grabbing a reference.
> > > > See gup_must_unshare().
> > > 
> > > IIRC, PageHuge() can also be called from GUP-fast and could special-case
> > > hugetlb eventually, as it's table while we hold a (temporary) reference.
> > > Shmem might be not so easy ...
> > 
> > page->mapping->a_ops should be enough to whitelist whatever fs you want.
> > 
> 
> The issue is how to stabilize that from GUP-fast, such that we can safely
> dereference the mapping. Any idea?
> 
> At least for anon page I know that page->mapping only gets cleared when
> freeing the page, and we don't dereference the mapping but only check a
> single flag stored alongside the mapping. Therefore, PageAnon() is fine in
> GUP-fast context.

What codepath you are worry about that clears ->mapping on pages with
non-zero refcount?

I can only think of truncate (and punch hole). READ_ONCE(page->mapping)
and fail GUP_fast if it is NULL should be fine, no?

I guess we should consider if the inode can be freed from under us and the
mapping pointer becomes dangling. But I think we should be fine here too:
VMA pins inode and VMA cannot go away from under GUP.

Hm?

(I didn't look close at GUP for a while and my reasoning might be off.)
Peter Xu April 28, 2023, 4:39 p.m. UTC | #28
On Fri, Apr 28, 2023 at 07:22:07PM +0300, Kirill A . Shutemov wrote:
> On Fri, Apr 28, 2023 at 06:13:03PM +0200, David Hildenbrand wrote:
> > On 28.04.23 18:09, Kirill A . Shutemov wrote:
> > > On Fri, Apr 28, 2023 at 05:43:52PM +0200, David Hildenbrand wrote:
> > > > On 28.04.23 17:34, David Hildenbrand wrote:
> > > > > On 28.04.23 17:33, Lorenzo Stoakes wrote:
> > > > > > On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote:
> > > > > > > > > 
> > > > > > > > > Security is the primary case where we have historically closed uAPI
> > > > > > > > > items.
> > > > > > > > 
> > > > > > > > As this patch
> > > > > > > > 
> > > > > > > > 1) Does not tackle GUP-fast
> > > > > > > > 2) Does not take care of !FOLL_LONGTERM
> > > > > > > > 
> > > > > > > > I am not convinced by the security argument in regard to this patch.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > If we want to sells this as a security thing, we have to block it
> > > > > > > > *completely* and then CC stable.
> > > > > > > 
> > > > > > > Regarding GUP-fast, to fix the issue there as well, I guess we could do
> > > > > > > something similar as I did in gup_must_unshare():
> > > > > > > 
> > > > > > > If we're in GUP-fast (no VMA), and want to pin a !anon page writable,
> > > > > > > fallback to ordinary GUP. IOW, if we don't know, better be safe.
> > > > > > 
> > > > > > How do we determine it's non-anon in the first place? The check is on the
> > > > > > VMA. We could do it by following page tables down to folio and checking
> > > > > > folio->mapping for PAGE_MAPPING_ANON I suppose?
> > > > > 
> > > > > PageAnon(page) can be called from GUP-fast after grabbing a reference.
> > > > > See gup_must_unshare().
> > > > 
> > > > IIRC, PageHuge() can also be called from GUP-fast and could special-case
> > > > hugetlb eventually, as it's table while we hold a (temporary) reference.
> > > > Shmem might be not so easy ...
> > > 
> > > page->mapping->a_ops should be enough to whitelist whatever fs you want.
> > > 
> > 
> > The issue is how to stabilize that from GUP-fast, such that we can safely
> > dereference the mapping. Any idea?
> > 
> > At least for anon page I know that page->mapping only gets cleared when
> > freeing the page, and we don't dereference the mapping but only check a
> > single flag stored alongside the mapping. Therefore, PageAnon() is fine in
> > GUP-fast context.
> 
> What codepath you are worry about that clears ->mapping on pages with
> non-zero refcount?
> 
> I can only think of truncate (and punch hole). READ_ONCE(page->mapping)
> and fail GUP_fast if it is NULL should be fine, no?
> 
> I guess we should consider if the inode can be freed from under us and the
> mapping pointer becomes dangling. But I think we should be fine here too:
> VMA pins inode and VMA cannot go away from under GUP.

Can vma still go away if during a fast-gup?

> 
> Hm?
> 
> (I didn't look close at GUP for a while and my reasoning might be off.)

Thanks,
David Hildenbrand April 28, 2023, 4:51 p.m. UTC | #29
On 28.04.23 18:39, Peter Xu wrote:
> On Fri, Apr 28, 2023 at 07:22:07PM +0300, Kirill A . Shutemov wrote:
>> On Fri, Apr 28, 2023 at 06:13:03PM +0200, David Hildenbrand wrote:
>>> On 28.04.23 18:09, Kirill A . Shutemov wrote:
>>>> On Fri, Apr 28, 2023 at 05:43:52PM +0200, David Hildenbrand wrote:
>>>>> On 28.04.23 17:34, David Hildenbrand wrote:
>>>>>> On 28.04.23 17:33, Lorenzo Stoakes wrote:
>>>>>>> On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote:
>>>>>>>>>>
>>>>>>>>>> Security is the primary case where we have historically closed uAPI
>>>>>>>>>> items.
>>>>>>>>>
>>>>>>>>> As this patch
>>>>>>>>>
>>>>>>>>> 1) Does not tackle GUP-fast
>>>>>>>>> 2) Does not take care of !FOLL_LONGTERM
>>>>>>>>>
>>>>>>>>> I am not convinced by the security argument in regard to this patch.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> If we want to sells this as a security thing, we have to block it
>>>>>>>>> *completely* and then CC stable.
>>>>>>>>
>>>>>>>> Regarding GUP-fast, to fix the issue there as well, I guess we could do
>>>>>>>> something similar as I did in gup_must_unshare():
>>>>>>>>
>>>>>>>> If we're in GUP-fast (no VMA), and want to pin a !anon page writable,
>>>>>>>> fallback to ordinary GUP. IOW, if we don't know, better be safe.
>>>>>>>
>>>>>>> How do we determine it's non-anon in the first place? The check is on the
>>>>>>> VMA. We could do it by following page tables down to folio and checking
>>>>>>> folio->mapping for PAGE_MAPPING_ANON I suppose?
>>>>>>
>>>>>> PageAnon(page) can be called from GUP-fast after grabbing a reference.
>>>>>> See gup_must_unshare().
>>>>>
>>>>> IIRC, PageHuge() can also be called from GUP-fast and could special-case
>>>>> hugetlb eventually, as it's table while we hold a (temporary) reference.
>>>>> Shmem might be not so easy ...
>>>>
>>>> page->mapping->a_ops should be enough to whitelist whatever fs you want.
>>>>
>>>
>>> The issue is how to stabilize that from GUP-fast, such that we can safely
>>> dereference the mapping. Any idea?
>>>
>>> At least for anon page I know that page->mapping only gets cleared when
>>> freeing the page, and we don't dereference the mapping but only check a
>>> single flag stored alongside the mapping. Therefore, PageAnon() is fine in
>>> GUP-fast context.
>>
>> What codepath you are worry about that clears ->mapping on pages with
>> non-zero refcount?
>>
>> I can only think of truncate (and punch hole). READ_ONCE(page->mapping)
>> and fail GUP_fast if it is NULL should be fine, no?
>>
>> I guess we should consider if the inode can be freed from under us and the
>> mapping pointer becomes dangling. But I think we should be fine here too:
>> VMA pins inode and VMA cannot go away from under GUP.
> 
> Can vma still go away if during a fast-gup?
> 

So, after we grabbed the page and made sure the the PTE didn't change 
(IOW, the PTE was stable while we processed it), the page can get 
unmapped (but not freed, because we hold a reference) and the VMA can 
theoretically go away (and as far as I understand, nothing stops the 
file from getting deleted, truncated etc).

So we might be looking at folio->mapping and the VMA is no longer there. 
Maybe even the file is no longer there.
Kirill A. Shutemov April 28, 2023, 4:56 p.m. UTC | #30
On Fri, Apr 28, 2023 at 06:51:46PM +0200, David Hildenbrand wrote:
> On 28.04.23 18:39, Peter Xu wrote:
> > On Fri, Apr 28, 2023 at 07:22:07PM +0300, Kirill A . Shutemov wrote:
> > > On Fri, Apr 28, 2023 at 06:13:03PM +0200, David Hildenbrand wrote:
> > > > On 28.04.23 18:09, Kirill A . Shutemov wrote:
> > > > > On Fri, Apr 28, 2023 at 05:43:52PM +0200, David Hildenbrand wrote:
> > > > > > On 28.04.23 17:34, David Hildenbrand wrote:
> > > > > > > On 28.04.23 17:33, Lorenzo Stoakes wrote:
> > > > > > > > On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote:
> > > > > > > > > > > 
> > > > > > > > > > > Security is the primary case where we have historically closed uAPI
> > > > > > > > > > > items.
> > > > > > > > > > 
> > > > > > > > > > As this patch
> > > > > > > > > > 
> > > > > > > > > > 1) Does not tackle GUP-fast
> > > > > > > > > > 2) Does not take care of !FOLL_LONGTERM
> > > > > > > > > > 
> > > > > > > > > > I am not convinced by the security argument in regard to this patch.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > If we want to sells this as a security thing, we have to block it
> > > > > > > > > > *completely* and then CC stable.
> > > > > > > > > 
> > > > > > > > > Regarding GUP-fast, to fix the issue there as well, I guess we could do
> > > > > > > > > something similar as I did in gup_must_unshare():
> > > > > > > > > 
> > > > > > > > > If we're in GUP-fast (no VMA), and want to pin a !anon page writable,
> > > > > > > > > fallback to ordinary GUP. IOW, if we don't know, better be safe.
> > > > > > > > 
> > > > > > > > How do we determine it's non-anon in the first place? The check is on the
> > > > > > > > VMA. We could do it by following page tables down to folio and checking
> > > > > > > > folio->mapping for PAGE_MAPPING_ANON I suppose?
> > > > > > > 
> > > > > > > PageAnon(page) can be called from GUP-fast after grabbing a reference.
> > > > > > > See gup_must_unshare().
> > > > > > 
> > > > > > IIRC, PageHuge() can also be called from GUP-fast and could special-case
> > > > > > hugetlb eventually, as it's table while we hold a (temporary) reference.
> > > > > > Shmem might be not so easy ...
> > > > > 
> > > > > page->mapping->a_ops should be enough to whitelist whatever fs you want.
> > > > > 
> > > > 
> > > > The issue is how to stabilize that from GUP-fast, such that we can safely
> > > > dereference the mapping. Any idea?
> > > > 
> > > > At least for anon page I know that page->mapping only gets cleared when
> > > > freeing the page, and we don't dereference the mapping but only check a
> > > > single flag stored alongside the mapping. Therefore, PageAnon() is fine in
> > > > GUP-fast context.
> > > 
> > > What codepath you are worry about that clears ->mapping on pages with
> > > non-zero refcount?
> > > 
> > > I can only think of truncate (and punch hole). READ_ONCE(page->mapping)
> > > and fail GUP_fast if it is NULL should be fine, no?
> > > 
> > > I guess we should consider if the inode can be freed from under us and the
> > > mapping pointer becomes dangling. But I think we should be fine here too:
> > > VMA pins inode and VMA cannot go away from under GUP.
> > 
> > Can vma still go away if during a fast-gup?
> > 
> 
> So, after we grabbed the page and made sure the the PTE didn't change (IOW,
> the PTE was stable while we processed it), the page can get unmapped (but
> not freed, because we hold a reference) and the VMA can theoretically go
> away (and as far as I understand, nothing stops the file from getting
> deleted, truncated etc).
> 
> So we might be looking at folio->mapping and the VMA is no longer there.
> Maybe even the file is no longer there.

No. VMA cannot get away before PTEs are unmapped and TLB is flushed. And
TLB flushing is serialized against GUP_fast().
Lorenzo Stoakes April 28, 2023, 5:01 p.m. UTC | #31
On Fri, Apr 28, 2023 at 06:51:46PM +0200, David Hildenbrand wrote:
> On 28.04.23 18:39, Peter Xu wrote:
> > On Fri, Apr 28, 2023 at 07:22:07PM +0300, Kirill A . Shutemov wrote:
> > > On Fri, Apr 28, 2023 at 06:13:03PM +0200, David Hildenbrand wrote:
> > > > On 28.04.23 18:09, Kirill A . Shutemov wrote:
> > > > > On Fri, Apr 28, 2023 at 05:43:52PM +0200, David Hildenbrand wrote:
> > > > > > On 28.04.23 17:34, David Hildenbrand wrote:
> > > > > > > On 28.04.23 17:33, Lorenzo Stoakes wrote:
> > > > > > > > On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote:
> > > > > > > > > > >
> > > > > > > > > > > Security is the primary case where we have historically closed uAPI
> > > > > > > > > > > items.
> > > > > > > > > >
> > > > > > > > > > As this patch
> > > > > > > > > >
> > > > > > > > > > 1) Does not tackle GUP-fast
> > > > > > > > > > 2) Does not take care of !FOLL_LONGTERM
> > > > > > > > > >
> > > > > > > > > > I am not convinced by the security argument in regard to this patch.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > If we want to sells this as a security thing, we have to block it
> > > > > > > > > > *completely* and then CC stable.
> > > > > > > > >
> > > > > > > > > Regarding GUP-fast, to fix the issue there as well, I guess we could do
> > > > > > > > > something similar as I did in gup_must_unshare():
> > > > > > > > >
> > > > > > > > > If we're in GUP-fast (no VMA), and want to pin a !anon page writable,
> > > > > > > > > fallback to ordinary GUP. IOW, if we don't know, better be safe.
> > > > > > > >
> > > > > > > > How do we determine it's non-anon in the first place? The check is on the
> > > > > > > > VMA. We could do it by following page tables down to folio and checking
> > > > > > > > folio->mapping for PAGE_MAPPING_ANON I suppose?
> > > > > > >
> > > > > > > PageAnon(page) can be called from GUP-fast after grabbing a reference.
> > > > > > > See gup_must_unshare().
> > > > > >
> > > > > > IIRC, PageHuge() can also be called from GUP-fast and could special-case
> > > > > > hugetlb eventually, as it's table while we hold a (temporary) reference.
> > > > > > Shmem might be not so easy ...
> > > > >
> > > > > page->mapping->a_ops should be enough to whitelist whatever fs you want.
> > > > >
> > > >
> > > > The issue is how to stabilize that from GUP-fast, such that we can safely
> > > > dereference the mapping. Any idea?
> > > >
> > > > At least for anon page I know that page->mapping only gets cleared when
> > > > freeing the page, and we don't dereference the mapping but only check a
> > > > single flag stored alongside the mapping. Therefore, PageAnon() is fine in
> > > > GUP-fast context.
> > >
> > > What codepath you are worry about that clears ->mapping on pages with
> > > non-zero refcount?
> > >
> > > I can only think of truncate (and punch hole). READ_ONCE(page->mapping)
> > > and fail GUP_fast if it is NULL should be fine, no?
> > >
> > > I guess we should consider if the inode can be freed from under us and the
> > > mapping pointer becomes dangling. But I think we should be fine here too:
> > > VMA pins inode and VMA cannot go away from under GUP.
> >
> > Can vma still go away if during a fast-gup?
> >
>
> So, after we grabbed the page and made sure the the PTE didn't change (IOW,
> the PTE was stable while we processed it), the page can get unmapped (but
> not freed, because we hold a reference) and the VMA can theoretically go
> away (and as far as I understand, nothing stops the file from getting
> deleted, truncated etc).
>
> So we might be looking at folio->mapping and the VMA is no longer there.
> Maybe even the file is no longer there.
>

This shouldn't be an issue though right? Because after a pup call unlocks the
mmap_lock we're in the same situation anyway. GUP doesn't generally guarantee
the mapping remains valid, only pinning the underlying folio.

I'm thinking of respinning with a gup_fast component then, if a_ops is
sufficient to identify file systems. We'll just revert to slow path for
non-FOLL_FAST_ONLY cases.

This would at least cover both FOLL_LONGTERM angles and could provoke some
further interesting discussion :)

> --
> Thanks,
>
> David / dhildenb
>
Lorenzo Stoakes April 28, 2023, 5:01 p.m. UTC | #32
On Fri, Apr 28, 2023 at 07:56:23PM +0300, Kirill A . Shutemov wrote:
> On Fri, Apr 28, 2023 at 06:51:46PM +0200, David Hildenbrand wrote:
> > On 28.04.23 18:39, Peter Xu wrote:
> > > On Fri, Apr 28, 2023 at 07:22:07PM +0300, Kirill A . Shutemov wrote:
> > > > On Fri, Apr 28, 2023 at 06:13:03PM +0200, David Hildenbrand wrote:
> > > > > On 28.04.23 18:09, Kirill A . Shutemov wrote:
> > > > > > On Fri, Apr 28, 2023 at 05:43:52PM +0200, David Hildenbrand wrote:
> > > > > > > On 28.04.23 17:34, David Hildenbrand wrote:
> > > > > > > > On 28.04.23 17:33, Lorenzo Stoakes wrote:
> > > > > > > > > On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Security is the primary case where we have historically closed uAPI
> > > > > > > > > > > > items.
> > > > > > > > > > >
> > > > > > > > > > > As this patch
> > > > > > > > > > >
> > > > > > > > > > > 1) Does not tackle GUP-fast
> > > > > > > > > > > 2) Does not take care of !FOLL_LONGTERM
> > > > > > > > > > >
> > > > > > > > > > > I am not convinced by the security argument in regard to this patch.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > If we want to sells this as a security thing, we have to block it
> > > > > > > > > > > *completely* and then CC stable.
> > > > > > > > > >
> > > > > > > > > > Regarding GUP-fast, to fix the issue there as well, I guess we could do
> > > > > > > > > > something similar as I did in gup_must_unshare():
> > > > > > > > > >
> > > > > > > > > > If we're in GUP-fast (no VMA), and want to pin a !anon page writable,
> > > > > > > > > > fallback to ordinary GUP. IOW, if we don't know, better be safe.
> > > > > > > > >
> > > > > > > > > How do we determine it's non-anon in the first place? The check is on the
> > > > > > > > > VMA. We could do it by following page tables down to folio and checking
> > > > > > > > > folio->mapping for PAGE_MAPPING_ANON I suppose?
> > > > > > > >
> > > > > > > > PageAnon(page) can be called from GUP-fast after grabbing a reference.
> > > > > > > > See gup_must_unshare().
> > > > > > >
> > > > > > > IIRC, PageHuge() can also be called from GUP-fast and could special-case
> > > > > > > hugetlb eventually, as it's table while we hold a (temporary) reference.
> > > > > > > Shmem might be not so easy ...
> > > > > >
> > > > > > page->mapping->a_ops should be enough to whitelist whatever fs you want.
> > > > > >
> > > > >
> > > > > The issue is how to stabilize that from GUP-fast, such that we can safely
> > > > > dereference the mapping. Any idea?
> > > > >
> > > > > At least for anon page I know that page->mapping only gets cleared when
> > > > > freeing the page, and we don't dereference the mapping but only check a
> > > > > single flag stored alongside the mapping. Therefore, PageAnon() is fine in
> > > > > GUP-fast context.
> > > >
> > > > What codepath you are worry about that clears ->mapping on pages with
> > > > non-zero refcount?
> > > >
> > > > I can only think of truncate (and punch hole). READ_ONCE(page->mapping)
> > > > and fail GUP_fast if it is NULL should be fine, no?
> > > >
> > > > I guess we should consider if the inode can be freed from under us and the
> > > > mapping pointer becomes dangling. But I think we should be fine here too:
> > > > VMA pins inode and VMA cannot go away from under GUP.
> > >
> > > Can vma still go away if during a fast-gup?
> > >
> >
> > So, after we grabbed the page and made sure the the PTE didn't change (IOW,
> > the PTE was stable while we processed it), the page can get unmapped (but
> > not freed, because we hold a reference) and the VMA can theoretically go
> > away (and as far as I understand, nothing stops the file from getting
> > deleted, truncated etc).
> >
> > So we might be looking at folio->mapping and the VMA is no longer there.
> > Maybe even the file is no longer there.
>
> No. VMA cannot get away before PTEs are unmapped and TLB is flushed. And
> TLB flushing is serialized against GUP_fast().
>

But the lockless path doesn't hold the PTL? So PTE can disappear at anytime,
since we use ptep_get_lockless().

But it won't matter because then the fast path will just fail anyway and can
revert back to slow one.

> --
>   Kiryl Shutsemau / Kirill A. Shutemov
David Hildenbrand April 28, 2023, 5:02 p.m. UTC | #33
On 28.04.23 18:56, Kirill A . Shutemov wrote:
> On Fri, Apr 28, 2023 at 06:51:46PM +0200, David Hildenbrand wrote:
>> On 28.04.23 18:39, Peter Xu wrote:
>>> On Fri, Apr 28, 2023 at 07:22:07PM +0300, Kirill A . Shutemov wrote:
>>>> On Fri, Apr 28, 2023 at 06:13:03PM +0200, David Hildenbrand wrote:
>>>>> On 28.04.23 18:09, Kirill A . Shutemov wrote:
>>>>>> On Fri, Apr 28, 2023 at 05:43:52PM +0200, David Hildenbrand wrote:
>>>>>>> On 28.04.23 17:34, David Hildenbrand wrote:
>>>>>>>> On 28.04.23 17:33, Lorenzo Stoakes wrote:
>>>>>>>>> On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Security is the primary case where we have historically closed uAPI
>>>>>>>>>>>> items.
>>>>>>>>>>>
>>>>>>>>>>> As this patch
>>>>>>>>>>>
>>>>>>>>>>> 1) Does not tackle GUP-fast
>>>>>>>>>>> 2) Does not take care of !FOLL_LONGTERM
>>>>>>>>>>>
>>>>>>>>>>> I am not convinced by the security argument in regard to this patch.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> If we want to sells this as a security thing, we have to block it
>>>>>>>>>>> *completely* and then CC stable.
>>>>>>>>>>
>>>>>>>>>> Regarding GUP-fast, to fix the issue there as well, I guess we could do
>>>>>>>>>> something similar as I did in gup_must_unshare():
>>>>>>>>>>
>>>>>>>>>> If we're in GUP-fast (no VMA), and want to pin a !anon page writable,
>>>>>>>>>> fallback to ordinary GUP. IOW, if we don't know, better be safe.
>>>>>>>>>
>>>>>>>>> How do we determine it's non-anon in the first place? The check is on the
>>>>>>>>> VMA. We could do it by following page tables down to folio and checking
>>>>>>>>> folio->mapping for PAGE_MAPPING_ANON I suppose?
>>>>>>>>
>>>>>>>> PageAnon(page) can be called from GUP-fast after grabbing a reference.
>>>>>>>> See gup_must_unshare().
>>>>>>>
>>>>>>> IIRC, PageHuge() can also be called from GUP-fast and could special-case
>>>>>>> hugetlb eventually, as it's table while we hold a (temporary) reference.
>>>>>>> Shmem might be not so easy ...
>>>>>>
>>>>>> page->mapping->a_ops should be enough to whitelist whatever fs you want.
>>>>>>
>>>>>
>>>>> The issue is how to stabilize that from GUP-fast, such that we can safely
>>>>> dereference the mapping. Any idea?
>>>>>
>>>>> At least for anon page I know that page->mapping only gets cleared when
>>>>> freeing the page, and we don't dereference the mapping but only check a
>>>>> single flag stored alongside the mapping. Therefore, PageAnon() is fine in
>>>>> GUP-fast context.
>>>>
>>>> What codepath you are worry about that clears ->mapping on pages with
>>>> non-zero refcount?
>>>>
>>>> I can only think of truncate (and punch hole). READ_ONCE(page->mapping)
>>>> and fail GUP_fast if it is NULL should be fine, no?
>>>>
>>>> I guess we should consider if the inode can be freed from under us and the
>>>> mapping pointer becomes dangling. But I think we should be fine here too:
>>>> VMA pins inode and VMA cannot go away from under GUP.
>>>
>>> Can vma still go away if during a fast-gup?
>>>
>>
>> So, after we grabbed the page and made sure the the PTE didn't change (IOW,
>> the PTE was stable while we processed it), the page can get unmapped (but
>> not freed, because we hold a reference) and the VMA can theoretically go
>> away (and as far as I understand, nothing stops the file from getting
>> deleted, truncated etc).
>>
>> So we might be looking at folio->mapping and the VMA is no longer there.
>> Maybe even the file is no longer there.
> 
> No. VMA cannot get away before PTEs are unmapped and TLB is flushed. And
> TLB flushing is serialized against GUP_fast().


The whole CONFIG_MMU_GATHER_RCU_TABLE_FREE handling makes the situation 
more complicated.
David Hildenbrand April 28, 2023, 5:05 p.m. UTC | #34
On 28.04.23 19:01, Lorenzo Stoakes wrote:
> On Fri, Apr 28, 2023 at 06:51:46PM +0200, David Hildenbrand wrote:
>> On 28.04.23 18:39, Peter Xu wrote:
>>> On Fri, Apr 28, 2023 at 07:22:07PM +0300, Kirill A . Shutemov wrote:
>>>> On Fri, Apr 28, 2023 at 06:13:03PM +0200, David Hildenbrand wrote:
>>>>> On 28.04.23 18:09, Kirill A . Shutemov wrote:
>>>>>> On Fri, Apr 28, 2023 at 05:43:52PM +0200, David Hildenbrand wrote:
>>>>>>> On 28.04.23 17:34, David Hildenbrand wrote:
>>>>>>>> On 28.04.23 17:33, Lorenzo Stoakes wrote:
>>>>>>>>> On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Security is the primary case where we have historically closed uAPI
>>>>>>>>>>>> items.
>>>>>>>>>>>
>>>>>>>>>>> As this patch
>>>>>>>>>>>
>>>>>>>>>>> 1) Does not tackle GUP-fast
>>>>>>>>>>> 2) Does not take care of !FOLL_LONGTERM
>>>>>>>>>>>
>>>>>>>>>>> I am not convinced by the security argument in regard to this patch.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> If we want to sells this as a security thing, we have to block it
>>>>>>>>>>> *completely* and then CC stable.
>>>>>>>>>>
>>>>>>>>>> Regarding GUP-fast, to fix the issue there as well, I guess we could do
>>>>>>>>>> something similar as I did in gup_must_unshare():
>>>>>>>>>>
>>>>>>>>>> If we're in GUP-fast (no VMA), and want to pin a !anon page writable,
>>>>>>>>>> fallback to ordinary GUP. IOW, if we don't know, better be safe.
>>>>>>>>>
>>>>>>>>> How do we determine it's non-anon in the first place? The check is on the
>>>>>>>>> VMA. We could do it by following page tables down to folio and checking
>>>>>>>>> folio->mapping for PAGE_MAPPING_ANON I suppose?
>>>>>>>>
>>>>>>>> PageAnon(page) can be called from GUP-fast after grabbing a reference.
>>>>>>>> See gup_must_unshare().
>>>>>>>
>>>>>>> IIRC, PageHuge() can also be called from GUP-fast and could special-case
>>>>>>> hugetlb eventually, as it's table while we hold a (temporary) reference.
>>>>>>> Shmem might be not so easy ...
>>>>>>
>>>>>> page->mapping->a_ops should be enough to whitelist whatever fs you want.
>>>>>>
>>>>>
>>>>> The issue is how to stabilize that from GUP-fast, such that we can safely
>>>>> dereference the mapping. Any idea?
>>>>>
>>>>> At least for anon page I know that page->mapping only gets cleared when
>>>>> freeing the page, and we don't dereference the mapping but only check a
>>>>> single flag stored alongside the mapping. Therefore, PageAnon() is fine in
>>>>> GUP-fast context.
>>>>
>>>> What codepath you are worry about that clears ->mapping on pages with
>>>> non-zero refcount?
>>>>
>>>> I can only think of truncate (and punch hole). READ_ONCE(page->mapping)
>>>> and fail GUP_fast if it is NULL should be fine, no?
>>>>
>>>> I guess we should consider if the inode can be freed from under us and the
>>>> mapping pointer becomes dangling. But I think we should be fine here too:
>>>> VMA pins inode and VMA cannot go away from under GUP.
>>>
>>> Can vma still go away if during a fast-gup?
>>>
>>
>> So, after we grabbed the page and made sure the the PTE didn't change (IOW,
>> the PTE was stable while we processed it), the page can get unmapped (but
>> not freed, because we hold a reference) and the VMA can theoretically go
>> away (and as far as I understand, nothing stops the file from getting
>> deleted, truncated etc).
>>
>> So we might be looking at folio->mapping and the VMA is no longer there.
>> Maybe even the file is no longer there.
>>
> 
> This shouldn't be an issue though right? Because after a pup call unlocks the
> mmap_lock we're in the same situation anyway. GUP doesn't generally guarantee
> the mapping remains valid, only pinning the underlying folio.

Yes. But the issue here is rather dereferencing something that has 
already been freed, eventually leading to undefined behavior.

Maybe de-referencing folio->mapping is fine ... but yes, we could handle 
that optimization in a separate patch.
Lorenzo Stoakes April 28, 2023, 5:13 p.m. UTC | #35
On Fri, Apr 28, 2023 at 07:05:38PM +0200, David Hildenbrand wrote:
> On 28.04.23 19:01, Lorenzo Stoakes wrote:
> > On Fri, Apr 28, 2023 at 06:51:46PM +0200, David Hildenbrand wrote:
> > > On 28.04.23 18:39, Peter Xu wrote:
> > > > On Fri, Apr 28, 2023 at 07:22:07PM +0300, Kirill A . Shutemov wrote:
> > > > > On Fri, Apr 28, 2023 at 06:13:03PM +0200, David Hildenbrand wrote:
> > > > > > On 28.04.23 18:09, Kirill A . Shutemov wrote:
> > > > > > > On Fri, Apr 28, 2023 at 05:43:52PM +0200, David Hildenbrand wrote:
> > > > > > > > On 28.04.23 17:34, David Hildenbrand wrote:
> > > > > > > > > On 28.04.23 17:33, Lorenzo Stoakes wrote:
> > > > > > > > > > On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Security is the primary case where we have historically closed uAPI
> > > > > > > > > > > > > items.
> > > > > > > > > > > >
> > > > > > > > > > > > As this patch
> > > > > > > > > > > >
> > > > > > > > > > > > 1) Does not tackle GUP-fast
> > > > > > > > > > > > 2) Does not take care of !FOLL_LONGTERM
> > > > > > > > > > > >
> > > > > > > > > > > > I am not convinced by the security argument in regard to this patch.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > If we want to sells this as a security thing, we have to block it
> > > > > > > > > > > > *completely* and then CC stable.
> > > > > > > > > > >
> > > > > > > > > > > Regarding GUP-fast, to fix the issue there as well, I guess we could do
> > > > > > > > > > > something similar as I did in gup_must_unshare():
> > > > > > > > > > >
> > > > > > > > > > > If we're in GUP-fast (no VMA), and want to pin a !anon page writable,
> > > > > > > > > > > fallback to ordinary GUP. IOW, if we don't know, better be safe.
> > > > > > > > > >
> > > > > > > > > > How do we determine it's non-anon in the first place? The check is on the
> > > > > > > > > > VMA. We could do it by following page tables down to folio and checking
> > > > > > > > > > folio->mapping for PAGE_MAPPING_ANON I suppose?
> > > > > > > > >
> > > > > > > > > PageAnon(page) can be called from GUP-fast after grabbing a reference.
> > > > > > > > > See gup_must_unshare().
> > > > > > > >
> > > > > > > > IIRC, PageHuge() can also be called from GUP-fast and could special-case
> > > > > > > > hugetlb eventually, as it's table while we hold a (temporary) reference.
> > > > > > > > Shmem might be not so easy ...
> > > > > > >
> > > > > > > page->mapping->a_ops should be enough to whitelist whatever fs you want.
> > > > > > >
> > > > > >
> > > > > > The issue is how to stabilize that from GUP-fast, such that we can safely
> > > > > > dereference the mapping. Any idea?
> > > > > >
> > > > > > At least for anon page I know that page->mapping only gets cleared when
> > > > > > freeing the page, and we don't dereference the mapping but only check a
> > > > > > single flag stored alongside the mapping. Therefore, PageAnon() is fine in
> > > > > > GUP-fast context.
> > > > >
> > > > > What codepath you are worry about that clears ->mapping on pages with
> > > > > non-zero refcount?
> > > > >
> > > > > I can only think of truncate (and punch hole). READ_ONCE(page->mapping)
> > > > > and fail GUP_fast if it is NULL should be fine, no?
> > > > >
> > > > > I guess we should consider if the inode can be freed from under us and the
> > > > > mapping pointer becomes dangling. But I think we should be fine here too:
> > > > > VMA pins inode and VMA cannot go away from under GUP.
> > > >
> > > > Can vma still go away if during a fast-gup?
> > > >
> > >
> > > So, after we grabbed the page and made sure the the PTE didn't change (IOW,
> > > the PTE was stable while we processed it), the page can get unmapped (but
> > > not freed, because we hold a reference) and the VMA can theoretically go
> > > away (and as far as I understand, nothing stops the file from getting
> > > deleted, truncated etc).
> > >
> > > So we might be looking at folio->mapping and the VMA is no longer there.
> > > Maybe even the file is no longer there.
> > >
> >
> > This shouldn't be an issue though right? Because after a pup call unlocks the
> > mmap_lock we're in the same situation anyway. GUP doesn't generally guarantee
> > the mapping remains valid, only pinning the underlying folio.
>
> Yes. But the issue here is rather dereferencing something that has already
> been freed, eventually leading to undefined behavior.
>

Is that an issue with interrupts disabled though? Will block page tables being
removed and as Kirill says (sorry I maybe misinterpreted you) we should be ok.

> Maybe de-referencing folio->mapping is fine ... but yes, we could handle
> that optimization in a separate patch.
>
> --
> Thanks,
>
> David / dhildenb
>
David Hildenbrand April 28, 2023, 5:29 p.m. UTC | #36
On 28.04.23 19:13, Lorenzo Stoakes wrote:
> On Fri, Apr 28, 2023 at 07:05:38PM +0200, David Hildenbrand wrote:
>> On 28.04.23 19:01, Lorenzo Stoakes wrote:
>>> On Fri, Apr 28, 2023 at 06:51:46PM +0200, David Hildenbrand wrote:
>>>> On 28.04.23 18:39, Peter Xu wrote:
>>>>> On Fri, Apr 28, 2023 at 07:22:07PM +0300, Kirill A . Shutemov wrote:
>>>>>> On Fri, Apr 28, 2023 at 06:13:03PM +0200, David Hildenbrand wrote:
>>>>>>> On 28.04.23 18:09, Kirill A . Shutemov wrote:
>>>>>>>> On Fri, Apr 28, 2023 at 05:43:52PM +0200, David Hildenbrand wrote:
>>>>>>>>> On 28.04.23 17:34, David Hildenbrand wrote:
>>>>>>>>>> On 28.04.23 17:33, Lorenzo Stoakes wrote:
>>>>>>>>>>> On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Security is the primary case where we have historically closed uAPI
>>>>>>>>>>>>>> items.
>>>>>>>>>>>>>
>>>>>>>>>>>>> As this patch
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1) Does not tackle GUP-fast
>>>>>>>>>>>>> 2) Does not take care of !FOLL_LONGTERM
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am not convinced by the security argument in regard to this patch.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> If we want to sells this as a security thing, we have to block it
>>>>>>>>>>>>> *completely* and then CC stable.
>>>>>>>>>>>>
>>>>>>>>>>>> Regarding GUP-fast, to fix the issue there as well, I guess we could do
>>>>>>>>>>>> something similar as I did in gup_must_unshare():
>>>>>>>>>>>>
>>>>>>>>>>>> If we're in GUP-fast (no VMA), and want to pin a !anon page writable,
>>>>>>>>>>>> fallback to ordinary GUP. IOW, if we don't know, better be safe.
>>>>>>>>>>>
>>>>>>>>>>> How do we determine it's non-anon in the first place? The check is on the
>>>>>>>>>>> VMA. We could do it by following page tables down to folio and checking
>>>>>>>>>>> folio->mapping for PAGE_MAPPING_ANON I suppose?
>>>>>>>>>>
>>>>>>>>>> PageAnon(page) can be called from GUP-fast after grabbing a reference.
>>>>>>>>>> See gup_must_unshare().
>>>>>>>>>
>>>>>>>>> IIRC, PageHuge() can also be called from GUP-fast and could special-case
>>>>>>>>> hugetlb eventually, as it's table while we hold a (temporary) reference.
>>>>>>>>> Shmem might be not so easy ...
>>>>>>>>
>>>>>>>> page->mapping->a_ops should be enough to whitelist whatever fs you want.
>>>>>>>>
>>>>>>>
>>>>>>> The issue is how to stabilize that from GUP-fast, such that we can safely
>>>>>>> dereference the mapping. Any idea?
>>>>>>>
>>>>>>> At least for anon page I know that page->mapping only gets cleared when
>>>>>>> freeing the page, and we don't dereference the mapping but only check a
>>>>>>> single flag stored alongside the mapping. Therefore, PageAnon() is fine in
>>>>>>> GUP-fast context.
>>>>>>
>>>>>> What codepath you are worry about that clears ->mapping on pages with
>>>>>> non-zero refcount?
>>>>>>
>>>>>> I can only think of truncate (and punch hole). READ_ONCE(page->mapping)
>>>>>> and fail GUP_fast if it is NULL should be fine, no?
>>>>>>
>>>>>> I guess we should consider if the inode can be freed from under us and the
>>>>>> mapping pointer becomes dangling. But I think we should be fine here too:
>>>>>> VMA pins inode and VMA cannot go away from under GUP.
>>>>>
>>>>> Can vma still go away if during a fast-gup?
>>>>>
>>>>
>>>> So, after we grabbed the page and made sure the the PTE didn't change (IOW,
>>>> the PTE was stable while we processed it), the page can get unmapped (but
>>>> not freed, because we hold a reference) and the VMA can theoretically go
>>>> away (and as far as I understand, nothing stops the file from getting
>>>> deleted, truncated etc).
>>>>
>>>> So we might be looking at folio->mapping and the VMA is no longer there.
>>>> Maybe even the file is no longer there.
>>>>
>>>
>>> This shouldn't be an issue though right? Because after a pup call unlocks the
>>> mmap_lock we're in the same situation anyway. GUP doesn't generally guarantee
>>> the mapping remains valid, only pinning the underlying folio.
>>
>> Yes. But the issue here is rather dereferencing something that has already
>> been freed, eventually leading to undefined behavior.
>>
> 
> Is that an issue with interrupts disabled though? Will block page tables being
> removed and as Kirill says (sorry I maybe misinterpreted you) we should be ok.

Let's rule out page table freeing. If our VMA only spans a single page 
and falls into the same PMD as another VMA, an munmap() would not even 
free a single page table.

However, if unmapping a page (flushing the TLB) would imply an IPI as 
Kirill said, we'd be fine. I recall that that's not the case for all 
architectures, but I might be just wrong.

... and now I'll stop reading mails until Tuesday :)
Jason Gunthorpe April 28, 2023, 5:31 p.m. UTC | #37
On Fri, Apr 28, 2023 at 07:02:22PM +0200, David Hildenbrand wrote:

> > No. VMA cannot get away before PTEs are unmapped and TLB is flushed. And
> > TLB flushing is serialized against GUP_fast().
>
> The whole CONFIG_MMU_GATHER_RCU_TABLE_FREE handling makes the situation more
> complicated.

Yeah, you have to think of gup_fast as RCU with a hacky pre-RCU implementation
on most architectures.

We could make page->mapping safe under RCU, for instance.

Jason
Jason Gunthorpe April 28, 2023, 5:33 p.m. UTC | #38
On Fri, Apr 28, 2023 at 11:56:55AM -0400, Peter Xu wrote:

> > PageAnon(page) can be called from GUP-fast after grabbing a reference. See
> > gup_must_unshare().
> 
> Hmm.. Is it a good idea at all to sacrifise all "!anon" fast-gups for this?
> People will silently got degrade even on legal pins on shmem/hugetlb, I
> think, which seems to be still a very major use case.

Remember gup fast was like this until quite recently - DAX wrecked it.

I fixed it when I changed DAX to not post-scan the VMA list..

I'm not sure longterm and really fast need to go together.

Jason
Lorenzo Stoakes April 28, 2023, 5:42 p.m. UTC | #39
On Fri, Apr 28, 2023 at 02:31:38PM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 28, 2023 at 07:02:22PM +0200, David Hildenbrand wrote:
>
> > > No. VMA cannot get away before PTEs are unmapped and TLB is flushed. And
> > > TLB flushing is serialized against GUP_fast().
> >
> > The whole CONFIG_MMU_GATHER_RCU_TABLE_FREE handling makes the situation more
> > complicated.
>
> Yeah, you have to think of gup_fast as RCU with a hacky pre-RCU implementation
> on most architectures.
>
> We could make page->mapping safe under RCU, for instance.
>
> Jason

Does it really require a change though? I might be missing some details,
but afaict with interrupts disabled we should be ok to deref page->mapping
to check PageAnon and a_ops before handing back a page right?
Jason Gunthorpe April 28, 2023, 5:49 p.m. UTC | #40
On Fri, Apr 28, 2023 at 06:42:41PM +0100, Lorenzo Stoakes wrote:
> On Fri, Apr 28, 2023 at 02:31:38PM -0300, Jason Gunthorpe wrote:
> > On Fri, Apr 28, 2023 at 07:02:22PM +0200, David Hildenbrand wrote:
> >
> > > > No. VMA cannot get away before PTEs are unmapped and TLB is flushed. And
> > > > TLB flushing is serialized against GUP_fast().
> > >
> > > The whole CONFIG_MMU_GATHER_RCU_TABLE_FREE handling makes the situation more
> > > complicated.
> >
> > Yeah, you have to think of gup_fast as RCU with a hacky pre-RCU implementation
> > on most architectures.
> >
> > We could make page->mapping safe under RCU, for instance.
> >
> > Jason
> 
> Does it really require a change though? I might be missing some details,
> but afaict with interrupts disabled we should be ok to deref page->mapping
> to check PageAnon and a_ops before handing back a page right?
 
AFAIK not on all architectures

Jason
Theodore Ts'o April 28, 2023, 6:25 p.m. UTC | #41
On Fri, Apr 28, 2023 at 11:35:32AM -0300, Jason Gunthorpe wrote:
> 
> It has been years now, I think we need to admit a fix is still years
> away. Blocking the security problem may even motivate more people to
> work on a fix.

Do we think we can still trigger a kernel crash, or maybe even some
more exciting like an arbitrary buffer overrun, via the
process_vm_writev(2) system call into a file-backed mmap'ed region?

Maybe if someone can come up with an easy-to-expliot security proof of
aconcept, that doesn't require special RDMA hardware or some special
libvirt setup, we could finally get motivation to get it fixed, or at
least blocked?  :-)

We've only been talking about it for years, after all...

       	       	    	      		- Ted

> Security is the primary case where we have historically closed uAPI
> items.
> 
> Jason
John Hubbard April 28, 2023, 6:26 p.m. UTC | #42
On 4/28/23 10:33, Jason Gunthorpe wrote:
> On Fri, Apr 28, 2023 at 11:56:55AM -0400, Peter Xu wrote:
> 
>>> PageAnon(page) can be called from GUP-fast after grabbing a reference. See
>>> gup_must_unshare().
>>
>> Hmm.. Is it a good idea at all to sacrifise all "!anon" fast-gups for this?
>> People will silently got degrade even on legal pins on shmem/hugetlb, I
>> think, which seems to be still a very major use case.
> 
> Remember gup fast was like this until quite recently - DAX wrecked it.
> 
> I fixed it when I changed DAX to not post-scan the VMA list..
> 
> I'm not sure longterm and really fast need to go together.
> 

Exactly: gup_fast + FOLL_LONGTERM is asking too much of this system,
it's really excessive.

I like the idea of simply: "if (FOLL_LONGTERM) jump straight to slow gup". 

thanks,
John Hubbard April 28, 2023, 6:26 p.m. UTC | #43
On 4/28/23 10:29, David Hildenbrand wrote:
...
>> Is that an issue with interrupts disabled though? Will block page tables being
>> removed and as Kirill says (sorry I maybe misinterpreted you) we should be ok.
> 
> Let's rule out page table freeing. If our VMA only spans a single page 
> and falls into the same PMD as another VMA, an munmap() would not even 
> free a single page table.
> 
> However, if unmapping a page (flushing the TLB) would imply an IPI as 
> Kirill said, we'd be fine. I recall that that's not the case for all 
> architectures, but I might be just wrong.
> 

Some architectures use IPIs, and others use an RCU-like approach instead.

thanks,
Jason Gunthorpe April 28, 2023, 6:50 p.m. UTC | #44
On Fri, Apr 28, 2023 at 02:25:53PM -0400, Theodore Ts'o wrote:
> On Fri, Apr 28, 2023 at 11:35:32AM -0300, Jason Gunthorpe wrote:
> > 
> > It has been years now, I think we need to admit a fix is still years
> > away. Blocking the security problem may even motivate more people to
> > work on a fix.
> 
> Do we think we can still trigger a kernel crash, or maybe even some
> more exciting like an arbitrary buffer overrun, via the
> process_vm_writev(2) system call into a file-backed mmap'ed region?

Jens? You blocked it from io_uring, did you have a specific attack in
mind?

Jason
Kirill A. Shutemov April 28, 2023, 11:43 p.m. UTC | #45
On Fri, Apr 28, 2023 at 07:02:22PM +0200, David Hildenbrand wrote:
> On 28.04.23 18:56, Kirill A . Shutemov wrote:
> > On Fri, Apr 28, 2023 at 06:51:46PM +0200, David Hildenbrand wrote:
> > > On 28.04.23 18:39, Peter Xu wrote:
> > > > On Fri, Apr 28, 2023 at 07:22:07PM +0300, Kirill A . Shutemov wrote:
> > > > > On Fri, Apr 28, 2023 at 06:13:03PM +0200, David Hildenbrand wrote:
> > > > > > On 28.04.23 18:09, Kirill A . Shutemov wrote:
> > > > > > > On Fri, Apr 28, 2023 at 05:43:52PM +0200, David Hildenbrand wrote:
> > > > > > > > On 28.04.23 17:34, David Hildenbrand wrote:
> > > > > > > > > On 28.04.23 17:33, Lorenzo Stoakes wrote:
> > > > > > > > > > On Fri, Apr 28, 2023 at 05:23:29PM +0200, David Hildenbrand wrote:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Security is the primary case where we have historically closed uAPI
> > > > > > > > > > > > > items.
> > > > > > > > > > > > 
> > > > > > > > > > > > As this patch
> > > > > > > > > > > > 
> > > > > > > > > > > > 1) Does not tackle GUP-fast
> > > > > > > > > > > > 2) Does not take care of !FOLL_LONGTERM
> > > > > > > > > > > > 
> > > > > > > > > > > > I am not convinced by the security argument in regard to this patch.
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > If we want to sells this as a security thing, we have to block it
> > > > > > > > > > > > *completely* and then CC stable.
> > > > > > > > > > > 
> > > > > > > > > > > Regarding GUP-fast, to fix the issue there as well, I guess we could do
> > > > > > > > > > > something similar as I did in gup_must_unshare():
> > > > > > > > > > > 
> > > > > > > > > > > If we're in GUP-fast (no VMA), and want to pin a !anon page writable,
> > > > > > > > > > > fallback to ordinary GUP. IOW, if we don't know, better be safe.
> > > > > > > > > > 
> > > > > > > > > > How do we determine it's non-anon in the first place? The check is on the
> > > > > > > > > > VMA. We could do it by following page tables down to folio and checking
> > > > > > > > > > folio->mapping for PAGE_MAPPING_ANON I suppose?
> > > > > > > > > 
> > > > > > > > > PageAnon(page) can be called from GUP-fast after grabbing a reference.
> > > > > > > > > See gup_must_unshare().
> > > > > > > > 
> > > > > > > > IIRC, PageHuge() can also be called from GUP-fast and could special-case
> > > > > > > > hugetlb eventually, as it's table while we hold a (temporary) reference.
> > > > > > > > Shmem might be not so easy ...
> > > > > > > 
> > > > > > > page->mapping->a_ops should be enough to whitelist whatever fs you want.
> > > > > > > 
> > > > > > 
> > > > > > The issue is how to stabilize that from GUP-fast, such that we can safely
> > > > > > dereference the mapping. Any idea?
> > > > > > 
> > > > > > At least for anon page I know that page->mapping only gets cleared when
> > > > > > freeing the page, and we don't dereference the mapping but only check a
> > > > > > single flag stored alongside the mapping. Therefore, PageAnon() is fine in
> > > > > > GUP-fast context.
> > > > > 
> > > > > What codepath you are worry about that clears ->mapping on pages with
> > > > > non-zero refcount?
> > > > > 
> > > > > I can only think of truncate (and punch hole). READ_ONCE(page->mapping)
> > > > > and fail GUP_fast if it is NULL should be fine, no?
> > > > > 
> > > > > I guess we should consider if the inode can be freed from under us and the
> > > > > mapping pointer becomes dangling. But I think we should be fine here too:
> > > > > VMA pins inode and VMA cannot go away from under GUP.
> > > > 
> > > > Can vma still go away if during a fast-gup?
> > > > 
> > > 
> > > So, after we grabbed the page and made sure the the PTE didn't change (IOW,
> > > the PTE was stable while we processed it), the page can get unmapped (but
> > > not freed, because we hold a reference) and the VMA can theoretically go
> > > away (and as far as I understand, nothing stops the file from getting
> > > deleted, truncated etc).
> > > 
> > > So we might be looking at folio->mapping and the VMA is no longer there.
> > > Maybe even the file is no longer there.
> > 
> > No. VMA cannot get away before PTEs are unmapped and TLB is flushed. And
> > TLB flushing is serialized against GUP_fast().
> 
> 
> The whole CONFIG_MMU_GATHER_RCU_TABLE_FREE handling makes the situation more
> complicated.

I think I found relevant snippet of code that solves similar issue.
get_futex_key() uses RCU to stabilize page->mapping after GUP_fast:


		/*
		 * The associated futex object in this case is the inode and
		 * the page->mapping must be traversed. Ordinarily this should
		 * be stabilised under page lock but it's not strictly
		 * necessary in this case as we just want to pin the inode, not
		 * update the radix tree or anything like that.
		 *
		 * The RCU read lock is taken as the inode is finally freed
		 * under RCU. If the mapping still matches expectations then the
		 * mapping->host can be safely accessed as being a valid inode.
		 */
		rcu_read_lock();

		if (READ_ONCE(page->mapping) != mapping) {
			rcu_read_unlock();
			put_page(page);

			goto again;
		}

		inode = READ_ONCE(mapping->host);
		if (!inode) {
			rcu_read_unlock();
			put_page(page);

			goto again;
		}

I think something similar can be used inside GUP_fast too.
Theodore Ts'o April 29, 2023, 4:21 a.m. UTC | #46
On Fri, Apr 28, 2023 at 03:50:20PM -0300, Jason Gunthorpe wrote:
> > Do we think we can still trigger a kernel crash, or maybe even some
> > more exciting like an arbitrary buffer overrun, via the
> > process_vm_writev(2) system call into a file-backed mmap'ed region?

I paged back into my memory the details, and (un)fortunately(?) it
probably can't be turned into high severity security exploit; it's
"just" a silent case of data loss.  (Which is *so* much better.... :-)

There was a reliable reproducer which was found by Syzkaller, that
didn't require any kind of exotic hardware or setup[1], and we
ultimately kluged a workaround in commit cc5095747edf ("ext4: don't
BUG if someone dirty pages without asking ext4 first").

[1] https://lore.kernel.org/all/Yg0m6IjcNmfaSokM@google.com/

Commit cc5095747edf had the (un)fortunate(?) side effect that GUP
writes to ext4 file-backed mappings no longer would cause random
low-probability crashes on large installations using RDMA, which has
apparently removed some of the motivation of really fixing the problem
instead of papering over it.  The good news is that I'm no longer
getting complaints from syzbot for this issue, and *I* don't have to
support anyone trying to use RDMA into file-backed mappings.  :-)

In any case, the file system maintainers' position (mine and I doubt
Dave Chinner's position has changed) is that if you write to
file-backed mappings via GUP/RDMA/process_vm_writev, and it causes
silent data corruption, you get to keep both pieces, and don't go
looking for us for anything other than sympathy...

						- Ted
Jason Gunthorpe April 29, 2023, 11:01 p.m. UTC | #47
On Sat, Apr 29, 2023 at 12:21:09AM -0400, Theodore Ts'o wrote:

> In any case, the file system maintainers' position (mine and I doubt
> Dave Chinner's position has changed) is that if you write to
> file-backed mappings via GUP/RDMA/process_vm_writev, and it causes
> silent data corruption, you get to keep both pieces, and don't go
> looking for us for anything other than sympathy...

This alone is enough reason to block it. I'm tired of this round and
round and I think we should just say enough, the mm will work to
enforce this view point. Files can only be written through PTEs.

If this upsets people they can work on fixing it, but at least we
don't have these kernel problems and inconsistencies to deal with.

Jason
Lorenzo Stoakes April 29, 2023, 11:09 p.m. UTC | #48
On Sat, Apr 29, 2023 at 08:01:11PM -0300, Jason Gunthorpe wrote:
> On Sat, Apr 29, 2023 at 12:21:09AM -0400, Theodore Ts'o wrote:
>
> > In any case, the file system maintainers' position (mine and I doubt
> > Dave Chinner's position has changed) is that if you write to
> > file-backed mappings via GUP/RDMA/process_vm_writev, and it causes
> > silent data corruption, you get to keep both pieces, and don't go
> > looking for us for anything other than sympathy...
>
> This alone is enough reason to block it. I'm tired of this round and
> round and I think we should just say enough, the mm will work to
> enforce this view point. Files can only be written through PTEs.
>
> If this upsets people they can work on fixing it, but at least we
> don't have these kernel problems and inconsistencies to deal with.
>

Indeed, I think there is a broad consensus that FOLL_LONGTERM is such an
egregious case that this is the way forward.

I will respin with the discussed GUP-fast changes relatively soon and then
we can see where that takes us :)

> Jason
Dave Chinner May 1, 2023, 7:27 a.m. UTC | #49
On Sat, Apr 29, 2023 at 08:01:11PM -0300, Jason Gunthorpe wrote:
> On Sat, Apr 29, 2023 at 12:21:09AM -0400, Theodore Ts'o wrote:
> 
> > In any case, the file system maintainers' position (mine and I doubt
> > Dave Chinner's position has changed) is that if you write to
> > file-backed mappings via GUP/RDMA/process_vm_writev, and it causes
> > silent data corruption, you get to keep both pieces, and don't go
> > looking for us for anything other than sympathy...
> 
> This alone is enough reason to block it. I'm tired of this round and
> round and I think we should just say enough, the mm will work to
> enforce this view point. Files can only be written through PTEs.

It has to be at least 5 years ago now that we were told that the
next-gen RDMA hardware would be able to trigger hardware page faults
when remote systems dirtied local pages.  This would enable
->page-mkwrite to be run on file backed pages mapped pages just like
local CPU write faults and everything would be fine.

Whatever happened to that? Are we still waiting for hardware that
can trigger page faults from remote DMA transfers, or have hardware
vendors given up on this?

Cheers,

Dave.
Jason Gunthorpe May 1, 2023, 12:39 p.m. UTC | #50
On Mon, May 01, 2023 at 05:27:18PM +1000, Dave Chinner wrote:
> On Sat, Apr 29, 2023 at 08:01:11PM -0300, Jason Gunthorpe wrote:
> > On Sat, Apr 29, 2023 at 12:21:09AM -0400, Theodore Ts'o wrote:
> > 
> > > In any case, the file system maintainers' position (mine and I doubt
> > > Dave Chinner's position has changed) is that if you write to
> > > file-backed mappings via GUP/RDMA/process_vm_writev, and it causes
> > > silent data corruption, you get to keep both pieces, and don't go
> > > looking for us for anything other than sympathy...
> > 
> > This alone is enough reason to block it. I'm tired of this round and
> > round and I think we should just say enough, the mm will work to
> > enforce this view point. Files can only be written through PTEs.
> 
> It has to be at least 5 years ago now that we were told that the
> next-gen RDMA hardware would be able to trigger hardware page faults
> when remote systems dirtied local pages.  This would enable
> ->page-mkwrite to be run on file backed pages mapped pages just like
> local CPU write faults and everything would be fine.

Things are progressing, but I'm not as optimistic as I once was..

- Today mlx5 has ODP which allows this to work using hmm_range_fault()
  techniques. I know of at least one deployment using this with a DAX
  configuration. This is now at least 5 years old stuff. The downside
  is that HMM approaches yield poor wost case performance, and have
  weird caching corner cases. This is still only one vendor, in the
  past 5 years nobody else stepped up to implement it.

- Intel Sapphire Rapids chips have ATS/PRI support and we are doing
  research on integrating mlx5 with that. In Linux this is called
  "IOMMU SVA".

  However, performance is wonky - in the best case it is worse
  than ODP but it removes ODP's worst case corners. It also makes the
  entire MM notably slower for processes that turn it on. Who knows
  when or what this will turn out to be useful for.

- Full cache coherence with CXL. CXL has taken a long time to really
  reach the mainstream market - maybe next gen of server CPUs. I'm not
  aware of anyone doing work here in the RDMA space, it is difficult
  to see the benefit. This seems likely to be very popular in the GPU
  space, I already see some products announced. This is a big topic on
  its own for FSs..

So, basically, you can make it work on the most popular HW, but at the
cost of top performance. Which makes it unpopular.

I don't expect anything on the horizon to subtantially change this
calculus, the latency cost of doing ATS like things is an inherent
performance penalty that can't be overcome

Jason
Jan Kara May 2, 2023, 8 a.m. UTC | #51
On Sat 29-04-23 02:43:32, Kirill A . Shutemov wrote:
> I think I found relevant snippet of code that solves similar issue.
> get_futex_key() uses RCU to stabilize page->mapping after GUP_fast:
> 
> 
> 		/*
> 		 * The associated futex object in this case is the inode and
> 		 * the page->mapping must be traversed. Ordinarily this should
> 		 * be stabilised under page lock but it's not strictly
> 		 * necessary in this case as we just want to pin the inode, not
> 		 * update the radix tree or anything like that.
> 		 *
> 		 * The RCU read lock is taken as the inode is finally freed
> 		 * under RCU. If the mapping still matches expectations then the
> 		 * mapping->host can be safely accessed as being a valid inode.
> 		 */
> 		rcu_read_lock();
> 
> 		if (READ_ONCE(page->mapping) != mapping) {
> 			rcu_read_unlock();
> 			put_page(page);
> 
> 			goto again;
> 		}
> 
> 		inode = READ_ONCE(mapping->host);
> 		if (!inode) {
> 			rcu_read_unlock();
> 			put_page(page);
> 
> 			goto again;
> 		}
> 
> I think something similar can be used inside GUP_fast too.

Yeah, inodes (and thus struct address_space) is RCU protected these days so
grabbing RCU lock in gup_fast() will get you enough protection for checking
aops if you are careful (like the futex code is).

								Honza
Peter Zijlstra May 2, 2023, 8:39 a.m. UTC | #52
On Tue, May 02, 2023 at 10:00:16AM +0200, Jan Kara wrote:
> On Sat 29-04-23 02:43:32, Kirill A . Shutemov wrote:
> > I think I found relevant snippet of code that solves similar issue.
> > get_futex_key() uses RCU to stabilize page->mapping after GUP_fast:
> > 
> > 
> > 		/*
> > 		 * The associated futex object in this case is the inode and
> > 		 * the page->mapping must be traversed. Ordinarily this should
> > 		 * be stabilised under page lock but it's not strictly
> > 		 * necessary in this case as we just want to pin the inode, not
> > 		 * update the radix tree or anything like that.
> > 		 *
> > 		 * The RCU read lock is taken as the inode is finally freed
> > 		 * under RCU. If the mapping still matches expectations then the
> > 		 * mapping->host can be safely accessed as being a valid inode.
> > 		 */
> > 		rcu_read_lock();
> > 
> > 		if (READ_ONCE(page->mapping) != mapping) {
> > 			rcu_read_unlock();
> > 			put_page(page);
> > 
> > 			goto again;
> > 		}
> > 
> > 		inode = READ_ONCE(mapping->host);
> > 		if (!inode) {
> > 			rcu_read_unlock();
> > 			put_page(page);
> > 
> > 			goto again;
> > 		}
> > 
> > I think something similar can be used inside GUP_fast too.
> 
> Yeah, inodes (and thus struct address_space) is RCU protected these days so
> grabbing RCU lock in gup_fast() will get you enough protection for checking
> aops if you are careful (like the futex code is).

GUP_fast has IRQs disabled per definition, there is no need to then also
use rcu_read_lock().
diff mbox series

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 37554b08bb28..f7da02fc89c6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2433,6 +2433,7 @@  extern unsigned long move_page_tables(struct vm_area_struct *vma,
 #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
 					    MM_CP_UFFD_WP_RESOLVE)

+bool vma_needs_dirty_tracking(struct vm_area_struct *vma);
 int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
 static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma)
 {
diff --git a/mm/gup.c b/mm/gup.c
index 1f72a717232b..d36a5db9feb1 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -959,16 +959,51 @@  static int faultin_page(struct vm_area_struct *vma,
 	return 0;
 }

+/*
+ * Writing to file-backed mappings which require folio dirty tracking using GUP
+ * is a fundamentally broken operation, as kernel write access to GUP mappings
+ * do not adhere to the semantics expected by a file system.
+ *
+ * Consider the following scenario:-
+ *
+ * 1. A folio is written to via GUP which write-faults the memory, notifying
+ *    the file system and dirtying the folio.
+ * 2. Later, writeback is triggered, resulting in the folio being cleaned and
+ *    the PTE being marked read-only.
+ * 3. The GUP caller writes to the folio, as it is mapped read/write via the
+ *    direct mapping.
+ * 4. The GUP caller, now done with the page, unpins it and sets it dirty
+ *    (though it does not have to).
+ *
+ * This results in both data being written to a folio without writenotify, and
+ * the folio being dirtied unexpectedly (if the caller decides to do so).
+ */
+static bool writeable_file_mapping_allowed(struct vm_area_struct *vma,
+					   unsigned long gup_flags)
+{
+	/* If we aren't pinning then no problematic write can occur. */
+	if (!(gup_flags & (FOLL_GET | FOLL_PIN)))
+		return true;
+
+	/* We limit this check to the most egregious case - a long term pin. */
+	if (!(gup_flags & FOLL_LONGTERM))
+		return true;
+
+	/* If the VMA requires dirty tracking then GUP will be problematic. */
+	return vma_needs_dirty_tracking(vma);
+}
+
 static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 {
 	vm_flags_t vm_flags = vma->vm_flags;
 	int write = (gup_flags & FOLL_WRITE);
 	int foreign = (gup_flags & FOLL_REMOTE);
+	bool vma_anon = vma_is_anonymous(vma);

 	if (vm_flags & (VM_IO | VM_PFNMAP))
 		return -EFAULT;

-	if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma))
+	if ((gup_flags & FOLL_ANON) && !vma_anon)
 		return -EFAULT;

 	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
@@ -978,6 +1013,10 @@  static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 		return -EFAULT;

 	if (write) {
+		if (!vma_anon &&
+		    !writeable_file_mapping_allowed(vma, gup_flags))
+			return -EFAULT;
+
 		if (!(vm_flags & VM_WRITE)) {
 			if (!(gup_flags & FOLL_FORCE))
 				return -EFAULT;
diff --git a/mm/mmap.c b/mm/mmap.c
index 536bbb8fa0ae..7b6344d1832a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1475,6 +1475,31 @@  SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
 }
 #endif /* __ARCH_WANT_SYS_OLD_MMAP */

+/* Do VMA operations imply write notify is required? */
+static bool vm_ops_needs_writenotify(const struct vm_operations_struct *vm_ops)
+{
+	return vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite);
+}
+
+/*
+ * Does this VMA require the underlying folios to have their dirty state
+ * tracked?
+ */
+bool vma_needs_dirty_tracking(struct vm_area_struct *vma)
+{
+	/* Does the filesystem need to be notified? */
+	if (vm_ops_needs_writenotify(vma->vm_ops))
+		return true;
+
+	/* Specialty mapping? */
+	if (vma->vm_flags & VM_PFNMAP)
+		return false;
+
+	/* Can the mapping track the dirty pages? */
+	return vma->vm_file && vma->vm_file->f_mapping &&
+		mapping_can_writeback(vma->vm_file->f_mapping);
+}
+
 /*
  * Some shared mappings will want the pages marked read-only
  * to track write events. If so, we'll downgrade vm_page_prot
@@ -1484,14 +1509,13 @@  SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
 int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot)
 {
 	vm_flags_t vm_flags = vma->vm_flags;
-	const struct vm_operations_struct *vm_ops = vma->vm_ops;

 	/* If it was private or non-writable, the write bit is already clear */
 	if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED)))
 		return 0;

 	/* The backer wishes to know when pages are first written to? */
-	if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite))
+	if (vm_ops_needs_writenotify(vma->vm_ops))
 		return 1;

 	/* The open routine did something to the protections that pgprot_modify
@@ -1511,13 +1535,7 @@  int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot)
 	if (userfaultfd_wp(vma))
 		return 1;

-	/* Specialty mapping? */
-	if (vm_flags & VM_PFNMAP)
-		return 0;
-
-	/* Can the mapping track the dirty pages? */
-	return vma->vm_file && vma->vm_file->f_mapping &&
-		mapping_can_writeback(vma->vm_file->f_mapping);
+	return vma_needs_dirty_tracking(vma);
 }

 /*