diff mbox series

[v6,3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

Message ID dee4f4ad6532b0f94d073da263526de334d5d7e0.1682981880.git.lstoakes@gmail.com (mailing list archive)
State Mainlined, archived
Headers show
Series mm/gup: disallow GUP writing to file-backed mappings by default | expand

Commit Message

Lorenzo Stoakes May 1, 2023, 11:11 p.m. UTC
Writing to file-backed dirty-tracked mappings via GUP is inherently broken
as we cannot rule out folios being cleaned and then a GUP user writing to
them again and possibly marking them dirty unexpectedly.

This is especially egregious for long-term mappings (as indicated by the
use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as
we have already done in the slow path.

We have access to less information in the fast path as we cannot examine
the VMA containing the mapping, however we can determine whether the folio
is anonymous and then whitelist known-good mappings - specifically hugetlb
and shmem mappings.

While we obtain a stable folio for this check, the mapping might not be, as
a truncate could nullify it at any time. Since doing so requires mappings
to be zapped, we can synchronise against a TLB shootdown operation.

For some architectures TLB shootdown is synchronised by IPI, against which
we are protected as the GUP-fast operation is performed with interrupts
disabled. However, other architectures which specify
CONFIG_MMU_GATHER_RCU_TABLE_FREE use an RCU lock for this operation.

In these instances, we acquire an RCU lock while performing our checks. If
we cannot get a stable mapping, we fall back to the slow path, as otherwise
we'd have to walk the page tables again and it's simpler and more effective
to just fall back.

It's important to note that there are no APIs allowing users to specify
FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can
always rely on the fact that if we fail to pin on the fast path, the code
will fall back to the slow path which can perform the more thorough check.

Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Kirill A . Shutemov <kirill@shutemov.name>
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
---
 mm/gup.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 85 insertions(+), 2 deletions(-)

Comments

John Hubbard May 1, 2023, 11:42 p.m. UTC | #1
On 5/1/23 16:11, Lorenzo Stoakes wrote:
> Writing to file-backed dirty-tracked mappings via GUP is inherently broken
> as we cannot rule out folios being cleaned and then a GUP user writing to
> them again and possibly marking them dirty unexpectedly.
> 
> This is especially egregious for long-term mappings (as indicated by the
> use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as
> we have already done in the slow path.
> 
> We have access to less information in the fast path as we cannot examine
> the VMA containing the mapping, however we can determine whether the folio
> is anonymous and then whitelist known-good mappings - specifically hugetlb
> and shmem mappings.
> 
> While we obtain a stable folio for this check, the mapping might not be, as
> a truncate could nullify it at any time. Since doing so requires mappings
> to be zapped, we can synchronise against a TLB shootdown operation.
> 
> For some architectures TLB shootdown is synchronised by IPI, against which
> we are protected as the GUP-fast operation is performed with interrupts
> disabled. However, other architectures which specify
> CONFIG_MMU_GATHER_RCU_TABLE_FREE use an RCU lock for this operation.
> 
> In these instances, we acquire an RCU lock while performing our checks. If
> we cannot get a stable mapping, we fall back to the slow path, as otherwise
> we'd have to walk the page tables again and it's simpler and more effective
> to just fall back.
> 
> It's important to note that there are no APIs allowing users to specify
> FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can
> always rely on the fact that if we fail to pin on the fast path, the code
> will fall back to the slow path which can perform the more thorough check.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Suggested-by: Kirill A . Shutemov <kirill@shutemov.name>
> Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
> ---
>   mm/gup.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 85 insertions(+), 2 deletions(-)
> 

Hi Lorenzo,

I am unable to find anything wrong with this patch, despite poring
over it and fretting over IPI vs. RCU cases. :)

Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
Peter Zijlstra May 2, 2023, 11:13 a.m. UTC | #2
On Tue, May 02, 2023 at 12:11:49AM +0100, Lorenzo Stoakes wrote:
> @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
>  	return folio;
>  }
>  
> +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
> +static bool stabilise_mapping_rcu(struct folio *folio)
> +{
> +	struct address_space *mapping = READ_ONCE(folio->mapping);
> +
> +	rcu_read_lock();
> +
> +	return mapping == READ_ONCE(folio->mapping);

This doesn't make sense; why bother reading the same thing twice?

Who cares if the thing changes from before; what you care about is that
the value you see has stable storage, this doesn't help with that.

> +}
> +
> +static void unlock_rcu(void)
> +{
> +	rcu_read_unlock();
> +}
> +#else
> +static bool stabilise_mapping_rcu(struct folio *)
> +{
> +	return true;
> +}
> +
> +static void unlock_rcu(void)
> +{
> +}
> +#endif

Anyway, this all can go away. RCU can't progress while you have
interrupts disabled anyway.

> +/*
> + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM |
> + * FOLL_WRITE pin is permitted for a specific folio.
> + *
> + * This assumes the folio is stable and pinned.
> + *
> + * Writing to pinned file-backed dirty tracked folios is inherently problematic
> + * (see comment describing the writeable_file_mapping_allowed() function). We
> + * therefore try to avoid the most egregious case of a long-term mapping doing
> + * so.
> + *
> + * This function cannot be as thorough as that one as the VMA is not available
> + * in the fast path, so instead we whitelist known good cases.
> + *
> + * The folio is stable, but the mapping might not be. When truncating for
> + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled
> + * so we are safe from an IPI, but some architectures use an RCU lock for this
> + * operation, so we acquire an RCU lock to ensure the mapping is stable.
> + */
> +static bool folio_longterm_write_pin_allowed(struct folio *folio)
> +{
> +	bool ret;
> +
> +	/* hugetlb mappings do not require dirty tracking. */
> +	if (folio_test_hugetlb(folio))
> +		return true;
> +

This:

> +	if (stabilise_mapping_rcu(folio)) {
> +		struct address_space *mapping = folio_mapping(folio);

And this is 3rd read of folio->mapping, just for giggles?

> +
> +		/*
> +		 * Neither anonymous nor shmem-backed folios require
> +		 * dirty tracking.
> +		 */
> +		ret = folio_test_anon(folio) ||
> +			(mapping && shmem_mapping(mapping));
> +	} else {
> +		/* If the mapping is unstable, fallback to the slow path. */
> +		ret = false;
> +	}
> +
> +	unlock_rcu();
> +
> +	return ret;

then becomes:


	if (folio_test_anon(folio))
		return true;

	/*
	 * Having IRQs disabled (as per GUP-fast) also inhibits RCU
	 * grace periods from making progress, IOW. they imply
	 * rcu_read_lock().
	 */
	lockdep_assert_irqs_disabled();

	/*
	 * Inodes and thus address_space are RCU freed and thus safe to
	 * access at this point.
	 */
	mapping = folio_mapping(folio);
	if (mapping && shmem_mapping(mapping))
		return true;

	return false;

> +}
Jan Kara May 2, 2023, 11:20 a.m. UTC | #3
On Tue 02-05-23 00:11:49, Lorenzo Stoakes wrote:
> Writing to file-backed dirty-tracked mappings via GUP is inherently broken
> as we cannot rule out folios being cleaned and then a GUP user writing to
> them again and possibly marking them dirty unexpectedly.
> 
> This is especially egregious for long-term mappings (as indicated by the
> use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as
> we have already done in the slow path.
> 
> We have access to less information in the fast path as we cannot examine
> the VMA containing the mapping, however we can determine whether the folio
> is anonymous and then whitelist known-good mappings - specifically hugetlb
> and shmem mappings.
> 
> While we obtain a stable folio for this check, the mapping might not be, as
> a truncate could nullify it at any time. Since doing so requires mappings
> to be zapped, we can synchronise against a TLB shootdown operation.
> 
> For some architectures TLB shootdown is synchronised by IPI, against which
> we are protected as the GUP-fast operation is performed with interrupts
> disabled. However, other architectures which specify
> CONFIG_MMU_GATHER_RCU_TABLE_FREE use an RCU lock for this operation.
> 
> In these instances, we acquire an RCU lock while performing our checks. If
> we cannot get a stable mapping, we fall back to the slow path, as otherwise
> we'd have to walk the page tables again and it's simpler and more effective
> to just fall back.
> 
> It's important to note that there are no APIs allowing users to specify
> FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can
> always rely on the fact that if we fail to pin on the fast path, the code
> will fall back to the slow path which can perform the more thorough check.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Suggested-by: Kirill A . Shutemov <kirill@shutemov.name>
> Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
> ---
>  mm/gup.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 85 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 0f09dec0906c..431618048a03 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -18,6 +18,7 @@
>  #include <linux/migrate.h>
>  #include <linux/mm_inline.h>
>  #include <linux/sched/mm.h>
> +#include <linux/shmem_fs.h>
>  
>  #include <asm/mmu_context.h>
>  #include <asm/tlbflush.h>
> @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
>  	return folio;
>  }
>  
> +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
> +static bool stabilise_mapping_rcu(struct folio *folio)
> +{
> +	struct address_space *mapping = READ_ONCE(folio->mapping);
> +
> +	rcu_read_lock();
> +
> +	return mapping == READ_ONCE(folio->mapping);
> +}
> +
> +static void unlock_rcu(void)
> +{
> +	rcu_read_unlock();
> +}
> +#else
> +static bool stabilise_mapping_rcu(struct folio *)
> +{
> +	return true;
> +}
> +
> +static void unlock_rcu(void)
> +{
> +}
> +#endif

So I wonder is this complexity worth it over just using rcu_read_lock()
unconditionally? It is just a compilation barrier AFAIK...

Also stabilise_mapping_rcu() seems to be a bit of a misnomer since the
mapping is not stable after the function is called. Also the return value
seems a bit pointless to me since we have to check folio_mapping() for
being != NULL anyway. All in all I'd say that:

	struct address_space *mapping;

	/* Make sure mapping cannot be freed under our hands */
	rcu_read_lock();
	mapping = folio_mapping(folio);
	ret = folio_test_anon(folio) || (mapping && shmem_mapping(mapping));
	rcu_read_unlock();

looks more comprehensible...

								Honza

> +
> +/*
> + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM |
> + * FOLL_WRITE pin is permitted for a specific folio.
> + *
> + * This assumes the folio is stable and pinned.
> + *
> + * Writing to pinned file-backed dirty tracked folios is inherently problematic
> + * (see comment describing the writeable_file_mapping_allowed() function). We
> + * therefore try to avoid the most egregious case of a long-term mapping doing
> + * so.
> + *
> + * This function cannot be as thorough as that one as the VMA is not available
> + * in the fast path, so instead we whitelist known good cases.
> + *
> + * The folio is stable, but the mapping might not be. When truncating for
> + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled
> + * so we are safe from an IPI, but some architectures use an RCU lock for this
> + * operation, so we acquire an RCU lock to ensure the mapping is stable.
> + */
> +static bool folio_longterm_write_pin_allowed(struct folio *folio)
> +{
> +	bool ret;
> +
> +	/* hugetlb mappings do not require dirty tracking. */
> +	if (folio_test_hugetlb(folio))
> +		return true;
> +
> +	if (stabilise_mapping_rcu(folio)) {
> +		struct address_space *mapping = folio_mapping(folio);
> +
> +		/*
> +		 * Neither anonymous nor shmem-backed folios require
> +		 * dirty tracking.
> +		 */
> +		ret = folio_test_anon(folio) ||
> +			(mapping && shmem_mapping(mapping));
> +	} else {
> +		/* If the mapping is unstable, fallback to the slow path. */
> +		ret = false;
> +	}
> +
> +	unlock_rcu();
> +
> +	return ret;
> +}
> +
>  /**
>   * try_grab_folio() - Attempt to get or pin a folio.
>   * @page:  pointer to page to be grabbed
> @@ -123,6 +195,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
>   */
>  struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>  {
> +	bool is_longterm = flags & FOLL_LONGTERM;
> +
>  	if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
>  		return NULL;
>  
> @@ -136,8 +210,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>  		 * right zone, so fail and let the caller fall back to the slow
>  		 * path.
>  		 */
> -		if (unlikely((flags & FOLL_LONGTERM) &&
> -			     !is_longterm_pinnable_page(page)))
> +		if (unlikely(is_longterm && !is_longterm_pinnable_page(page)))
>  			return NULL;
>  
>  		/*
> @@ -148,6 +221,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>  		if (!folio)
>  			return NULL;
>  
> +		/*
> +		 * Can this folio be safely pinned? We need to perform this
> +		 * check after the folio is stabilised.
> +		 */
> +		if ((flags & FOLL_WRITE) && is_longterm &&
> +		    !folio_longterm_write_pin_allowed(folio)) {
> +			folio_put_refs(folio, refs);
> +			return NULL;
> +		}
> +
>  		/*
>  		 * When pinning a large folio, use an exact count to track it.
>  		 *
> -- 
> 2.40.1
>
Jan Kara May 2, 2023, 11:23 a.m. UTC | #4
On Tue 02-05-23 13:13:34, Peter Zijlstra wrote:
> On Tue, May 02, 2023 at 12:11:49AM +0100, Lorenzo Stoakes wrote:
> > +
> > +		/*
> > +		 * Neither anonymous nor shmem-backed folios require
> > +		 * dirty tracking.
> > +		 */
> > +		ret = folio_test_anon(folio) ||
> > +			(mapping && shmem_mapping(mapping));
> > +	} else {
> > +		/* If the mapping is unstable, fallback to the slow path. */
> > +		ret = false;
> > +	}
> > +
> > +	unlock_rcu();
> > +
> > +	return ret;
> 
> then becomes:
> 
> 
> 	if (folio_test_anon(folio))
> 		return true;
> 
> 	/*
> 	 * Having IRQs disabled (as per GUP-fast) also inhibits RCU
> 	 * grace periods from making progress, IOW. they imply
> 	 * rcu_read_lock().
> 	 */
> 	lockdep_assert_irqs_disabled();
> 
> 	/*
> 	 * Inodes and thus address_space are RCU freed and thus safe to
> 	 * access at this point.
> 	 */
> 	mapping = folio_mapping(folio);
> 	if (mapping && shmem_mapping(mapping))
> 		return true;
> 
> 	return false;

Yeah, that's even better.

								Honza
Lorenzo Stoakes May 2, 2023, 11:25 a.m. UTC | #5
On Tue, May 02, 2023 at 01:13:34PM +0200, Peter Zijlstra wrote:
> On Tue, May 02, 2023 at 12:11:49AM +0100, Lorenzo Stoakes wrote:
> > @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> >  	return folio;
> >  }
> >
> > +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > +static bool stabilise_mapping_rcu(struct folio *folio)
> > +{
> > +	struct address_space *mapping = READ_ONCE(folio->mapping);
> > +
> > +	rcu_read_lock();
> > +
> > +	return mapping == READ_ONCE(folio->mapping);
>
> This doesn't make sense; why bother reading the same thing twice?

The intent is to see whether the folio->mapping has been truncated from
underneath us, as per the futex code that Kirill referred to which does
something similar [1].

>
> Who cares if the thing changes from before; what you care about is that
> the value you see has stable storage, this doesn't help with that.
>
> > +}
> > +
> > +static void unlock_rcu(void)
> > +{
> > +	rcu_read_unlock();
> > +}
> > +#else
> > +static bool stabilise_mapping_rcu(struct folio *)
> > +{
> > +	return true;
> > +}
> > +
> > +static void unlock_rcu(void)
> > +{
> > +}
> > +#endif
>
> Anyway, this all can go away. RCU can't progress while you have
> interrupts disabled anyway.

There seems to be other code in the kernel that assumes that this is not
the case, i.e. the futex code, though not sure if that's being run with
IRQs disabled... if not and it's absolutely certain that we need no special
handling for the RCU case, then happy days and more than glad to remove
this bit.

I'm far from an expert on RCU (I need to gain a better understanding of it)
so I'm deferring how best to proceed on _this part_ to the community.

>
> > +/*
> > + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM |
> > + * FOLL_WRITE pin is permitted for a specific folio.
> > + *
> > + * This assumes the folio is stable and pinned.
> > + *
> > + * Writing to pinned file-backed dirty tracked folios is inherently problematic
> > + * (see comment describing the writeable_file_mapping_allowed() function). We
> > + * therefore try to avoid the most egregious case of a long-term mapping doing
> > + * so.
> > + *
> > + * This function cannot be as thorough as that one as the VMA is not available
> > + * in the fast path, so instead we whitelist known good cases.
> > + *
> > + * The folio is stable, but the mapping might not be. When truncating for
> > + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled
> > + * so we are safe from an IPI, but some architectures use an RCU lock for this
> > + * operation, so we acquire an RCU lock to ensure the mapping is stable.
> > + */
> > +static bool folio_longterm_write_pin_allowed(struct folio *folio)
> > +{
> > +	bool ret;
> > +
> > +	/* hugetlb mappings do not require dirty tracking. */
> > +	if (folio_test_hugetlb(folio))
> > +		return true;
> > +
>
> This:
>
> > +	if (stabilise_mapping_rcu(folio)) {
> > +		struct address_space *mapping = folio_mapping(folio);
>
> And this is 3rd read of folio->mapping, just for giggles?

I like to giggle :)

Actually this is to handle the various cases in which the mapping might not
be what we want (i.e. have PAGE_MAPPING_FLAGS set) which doesn't appear to
have a helper exposed for a check. Given previous review about duplication
I felt best to reuse this even though it does access again... yes I felt
weird about doing that.

>
> > +
> > +		/*
> > +		 * Neither anonymous nor shmem-backed folios require
> > +		 * dirty tracking.
> > +		 */
> > +		ret = folio_test_anon(folio) ||
> > +			(mapping && shmem_mapping(mapping));
> > +	} else {
> > +		/* If the mapping is unstable, fallback to the slow path. */
> > +		ret = false;
> > +	}
> > +
> > +	unlock_rcu();
> > +
> > +	return ret;
>
> then becomes:
>
>
> 	if (folio_test_anon(folio))
> 		return true;

This relies on the mapping so belongs below the lockdep assert imo.

>
> 	/*
> 	 * Having IRQs disabled (as per GUP-fast) also inhibits RCU
> 	 * grace periods from making progress, IOW. they imply
> 	 * rcu_read_lock().
> 	 */
> 	lockdep_assert_irqs_disabled();
>
> 	/*
> 	 * Inodes and thus address_space are RCU freed and thus safe to
> 	 * access at this point.
> 	 */
> 	mapping = folio_mapping(folio);
> 	if (mapping && shmem_mapping(mapping))
> 		return true;
>
> 	return false;
>
> > +}

I'm more than happy to do this (I'd rather drop the RCU bits if possible)
but need to be sure it's safe.
Lorenzo Stoakes May 2, 2023, 11:28 a.m. UTC | #6
On Tue, May 02, 2023 at 12:25:54PM +0100, Lorenzo Stoakes wrote:
> On Tue, May 02, 2023 at 01:13:34PM +0200, Peter Zijlstra wrote:
> > On Tue, May 02, 2023 at 12:11:49AM +0100, Lorenzo Stoakes wrote:
> > > @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> > >  	return folio;
> > >  }
> > >
> > > +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > > +static bool stabilise_mapping_rcu(struct folio *folio)
> > > +{
> > > +	struct address_space *mapping = READ_ONCE(folio->mapping);
> > > +
> > > +	rcu_read_lock();
> > > +
> > > +	return mapping == READ_ONCE(folio->mapping);
> >
> > This doesn't make sense; why bother reading the same thing twice?
>
> The intent is to see whether the folio->mapping has been truncated from
> underneath us, as per the futex code that Kirill referred to which does
> something similar [1].
>
> >
> > Who cares if the thing changes from before; what you care about is that
> > the value you see has stable storage, this doesn't help with that.
> >
> > > +}
> > > +
> > > +static void unlock_rcu(void)
> > > +{
> > > +	rcu_read_unlock();
> > > +}
> > > +#else
> > > +static bool stabilise_mapping_rcu(struct folio *)
> > > +{
> > > +	return true;
> > > +}
> > > +
> > > +static void unlock_rcu(void)
> > > +{
> > > +}
> > > +#endif
> >
> > Anyway, this all can go away. RCU can't progress while you have
> > interrupts disabled anyway.
>
> There seems to be other code in the kernel that assumes that this is not
> the case, i.e. the futex code, though not sure if that's being run with
> IRQs disabled... if not and it's absolutely certain that we need no special
> handling for the RCU case, then happy days and more than glad to remove
> this bit.
>
> I'm far from an expert on RCU (I need to gain a better understanding of it)
> so I'm deferring how best to proceed on _this part_ to the community.
>
> >
> > > +/*
> > > + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM |
> > > + * FOLL_WRITE pin is permitted for a specific folio.
> > > + *
> > > + * This assumes the folio is stable and pinned.
> > > + *
> > > + * Writing to pinned file-backed dirty tracked folios is inherently problematic
> > > + * (see comment describing the writeable_file_mapping_allowed() function). We
> > > + * therefore try to avoid the most egregious case of a long-term mapping doing
> > > + * so.
> > > + *
> > > + * This function cannot be as thorough as that one as the VMA is not available
> > > + * in the fast path, so instead we whitelist known good cases.
> > > + *
> > > + * The folio is stable, but the mapping might not be. When truncating for
> > > + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled
> > > + * so we are safe from an IPI, but some architectures use an RCU lock for this
> > > + * operation, so we acquire an RCU lock to ensure the mapping is stable.
> > > + */
> > > +static bool folio_longterm_write_pin_allowed(struct folio *folio)
> > > +{
> > > +	bool ret;
> > > +
> > > +	/* hugetlb mappings do not require dirty tracking. */
> > > +	if (folio_test_hugetlb(folio))
> > > +		return true;
> > > +
> >
> > This:
> >
> > > +	if (stabilise_mapping_rcu(folio)) {
> > > +		struct address_space *mapping = folio_mapping(folio);
> >
> > And this is 3rd read of folio->mapping, just for giggles?
>
> I like to giggle :)
>
> Actually this is to handle the various cases in which the mapping might not
> be what we want (i.e. have PAGE_MAPPING_FLAGS set) which doesn't appear to
> have a helper exposed for a check. Given previous review about duplication
> I felt best to reuse this even though it does access again... yes I felt
> weird about doing that.
>
> >
> > > +
> > > +		/*
> > > +		 * Neither anonymous nor shmem-backed folios require
> > > +		 * dirty tracking.
> > > +		 */
> > > +		ret = folio_test_anon(folio) ||
> > > +			(mapping && shmem_mapping(mapping));
> > > +	} else {
> > > +		/* If the mapping is unstable, fallback to the slow path. */
> > > +		ret = false;
> > > +	}
> > > +
> > > +	unlock_rcu();
> > > +
> > > +	return ret;
> >
> > then becomes:
> >
> >
> > 	if (folio_test_anon(folio))
> > 		return true;
>
> This relies on the mapping so belongs below the lockdep assert imo.
>
> >
> > 	/*
> > 	 * Having IRQs disabled (as per GUP-fast) also inhibits RCU
> > 	 * grace periods from making progress, IOW. they imply
> > 	 * rcu_read_lock().
> > 	 */
> > 	lockdep_assert_irqs_disabled();
> >
> > 	/*
> > 	 * Inodes and thus address_space are RCU freed and thus safe to
> > 	 * access at this point.
> > 	 */
> > 	mapping = folio_mapping(folio);
> > 	if (mapping && shmem_mapping(mapping))
> > 		return true;
> >
> > 	return false;
> >
> > > +}
>
> I'm more than happy to do this (I'd rather drop the RCU bits if possible)
> but need to be sure it's safe.

Sorry forgot to include the [1]

[1]:https://lore.kernel.org/all/20230428234332.2vhprztuotlqir4x@box.shutemov.name/
Peter Zijlstra May 2, 2023, 12:08 p.m. UTC | #7
On Tue, May 02, 2023 at 12:25:54PM +0100, Lorenzo Stoakes wrote:
> On Tue, May 02, 2023 at 01:13:34PM +0200, Peter Zijlstra wrote:
> > On Tue, May 02, 2023 at 12:11:49AM +0100, Lorenzo Stoakes wrote:
> > > @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> > >  	return folio;
> > >  }
> > >
> > > +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > > +static bool stabilise_mapping_rcu(struct folio *folio)
> > > +{
> > > +	struct address_space *mapping = READ_ONCE(folio->mapping);
> > > +
> > > +	rcu_read_lock();
> > > +
> > > +	return mapping == READ_ONCE(folio->mapping);
> >
> > This doesn't make sense; why bother reading the same thing twice?
> 
> The intent is to see whether the folio->mapping has been truncated from
> underneath us, as per the futex code that Kirill referred to which does
> something similar [1].

Yeah, but per that 3rd load you got nothing here. Also that futex code
did the early load to deal with the !mapping case, but you're not doing
that.

> > Who cares if the thing changes from before; what you care about is that
> > the value you see has stable storage, this doesn't help with that.
> >
> > > +}
> > > +
> > > +static void unlock_rcu(void)
> > > +{
> > > +	rcu_read_unlock();
> > > +}
> > > +#else
> > > +static bool stabilise_mapping_rcu(struct folio *)
> > > +{
> > > +	return true;
> > > +}
> > > +
> > > +static void unlock_rcu(void)
> > > +{
> > > +}
> > > +#endif
> >
> > Anyway, this all can go away. RCU can't progress while you have
> > interrupts disabled anyway.
> 
> There seems to be other code in the kernel that assumes that this is not
> the case,

Yeah, so Paul went back on forth on that a bit. It used to be true in
the good old days when everything was simple. Then Paul made things
complicated by separating out sched-RCU bh-RCU and 'regular' RCU
flavours.

At that point disabling IRQs would only (officially) inhibit sched and
bh RCU flavours, but not the regular RCU.

But then some years ago Linus convinced Paul that having all these
separate RCU flavours with separate QS rules was a big pain in the
backside and Paul munged them all together again.

So now, anything that inhibits any of the RCU flavours inhibits them
all. So disabling IRQs is sufficient.

> i.e. the futex code, though not sure if that's being run with
> IRQs disabled...

That futex code runs in preemptible context, per the lock_page() that
can sleep etc.. :-)

> > > +/*
> > > + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM |
> > > + * FOLL_WRITE pin is permitted for a specific folio.
> > > + *
> > > + * This assumes the folio is stable and pinned.
> > > + *
> > > + * Writing to pinned file-backed dirty tracked folios is inherently problematic
> > > + * (see comment describing the writeable_file_mapping_allowed() function). We
> > > + * therefore try to avoid the most egregious case of a long-term mapping doing
> > > + * so.
> > > + *
> > > + * This function cannot be as thorough as that one as the VMA is not available
> > > + * in the fast path, so instead we whitelist known good cases.
> > > + *
> > > + * The folio is stable, but the mapping might not be. When truncating for
> > > + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled
> > > + * so we are safe from an IPI, but some architectures use an RCU lock for this
> > > + * operation, so we acquire an RCU lock to ensure the mapping is stable.
> > > + */
> > > +static bool folio_longterm_write_pin_allowed(struct folio *folio)
> > > +{
> > > +	bool ret;
> > > +
> > > +	/* hugetlb mappings do not require dirty tracking. */
> > > +	if (folio_test_hugetlb(folio))
> > > +		return true;
> > > +
> >
> > This:
> >
> > > +	if (stabilise_mapping_rcu(folio)) {
> > > +		struct address_space *mapping = folio_mapping(folio);
> >
> > And this is 3rd read of folio->mapping, just for giggles?
> 
> I like to giggle :)
> 
> Actually this is to handle the various cases in which the mapping might not
> be what we want (i.e. have PAGE_MAPPING_FLAGS set) which doesn't appear to
> have a helper exposed for a check. Given previous review about duplication
> I felt best to reuse this even though it does access again... yes I felt
> weird about doing that.

Right, I had a peek inside folio_mapping(), but the point is that this
3rd load might see yet *another* value of mapping from the prior two
loads, rendering them somewhat worthless.

> > > +
> > > +		/*
> > > +		 * Neither anonymous nor shmem-backed folios require
> > > +		 * dirty tracking.
> > > +		 */
> > > +		ret = folio_test_anon(folio) ||
> > > +			(mapping && shmem_mapping(mapping));
> > > +	} else {
> > > +		/* If the mapping is unstable, fallback to the slow path. */
> > > +		ret = false;
> > > +	}
> > > +
> > > +	unlock_rcu();
> > > +
> > > +	return ret;
> >
> > then becomes:
> >
> >
> > 	if (folio_test_anon(folio))
> > 		return true;
> 
> This relies on the mapping so belongs below the lockdep assert imo.

Oh, right you are.

> >
> > 	/*
> > 	 * Having IRQs disabled (as per GUP-fast) also inhibits RCU
> > 	 * grace periods from making progress, IOW. they imply
> > 	 * rcu_read_lock().
> > 	 */
> > 	lockdep_assert_irqs_disabled();
> >
> > 	/*
> > 	 * Inodes and thus address_space are RCU freed and thus safe to
> > 	 * access at this point.
> > 	 */
> > 	mapping = folio_mapping(folio);
> > 	if (mapping && shmem_mapping(mapping))
> > 		return true;
> >
> > 	return false;
> >
> > > +}
> 
> I'm more than happy to do this (I'd rather drop the RCU bits if possible)
> but need to be sure it's safe.

GUP-fast as a whole relies on it :-)
Lorenzo Stoakes May 2, 2023, 12:27 p.m. UTC | #8
On Tue, May 02, 2023 at 02:08:10PM +0200, Peter Zijlstra wrote:
> On Tue, May 02, 2023 at 12:25:54PM +0100, Lorenzo Stoakes wrote:
> > On Tue, May 02, 2023 at 01:13:34PM +0200, Peter Zijlstra wrote:
> > > On Tue, May 02, 2023 at 12:11:49AM +0100, Lorenzo Stoakes wrote:
> > > > @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> > > >  	return folio;
> > > >  }
> > > >
> > > > +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > > > +static bool stabilise_mapping_rcu(struct folio *folio)
> > > > +{
> > > > +	struct address_space *mapping = READ_ONCE(folio->mapping);
> > > > +
> > > > +	rcu_read_lock();
> > > > +
> > > > +	return mapping == READ_ONCE(folio->mapping);
> > >
> > > This doesn't make sense; why bother reading the same thing twice?
> >
> > The intent is to see whether the folio->mapping has been truncated from
> > underneath us, as per the futex code that Kirill referred to which does
> > something similar [1].
>
> Yeah, but per that 3rd load you got nothing here. Also that futex code
> did the early load to deal with the !mapping case, but you're not doing
> that.
>

OK I drafted a response three times then deleted which shows you how this
stuff messes with your mind :)

I realise now that literally it is checking whether the previous !mapping
case and lack of action taken on that was valid for futex, rendering this
pointless for the logic here.

We do check !mapping later but obviously with the 'stable' mapping whose
relation to pre-rcu lock is irrelevant.

Thanks for patiently explaining this :) RCU remains an area I need to take
a closer look at generally.

> > > Who cares if the thing changes from before; what you care about is that
> > > the value you see has stable storage, this doesn't help with that.
> > >
> > > > +}
> > > > +
> > > > +static void unlock_rcu(void)
> > > > +{
> > > > +	rcu_read_unlock();
> > > > +}
> > > > +#else
> > > > +static bool stabilise_mapping_rcu(struct folio *)
> > > > +{
> > > > +	return true;
> > > > +}
> > > > +
> > > > +static void unlock_rcu(void)
> > > > +{
> > > > +}
> > > > +#endif
> > >
> > > Anyway, this all can go away. RCU can't progress while you have
> > > interrupts disabled anyway.
> >
> > There seems to be other code in the kernel that assumes that this is not
> > the case,
>
> Yeah, so Paul went back on forth on that a bit. It used to be true in
> the good old days when everything was simple. Then Paul made things
> complicated by separating out sched-RCU bh-RCU and 'regular' RCU
> flavours.
>
> At that point disabling IRQs would only (officially) inhibit sched and
> bh RCU flavours, but not the regular RCU.
>
> But then some years ago Linus convinced Paul that having all these
> separate RCU flavours with separate QS rules was a big pain in the
> backside and Paul munged them all together again.
>
> So now, anything that inhibits any of the RCU flavours inhibits them
> all. So disabling IRQs is sufficient.
>
> > i.e. the futex code, though not sure if that's being run with
> > IRQs disabled...
>
> That futex code runs in preemptible context, per the lock_page() that
> can sleep etc.. :-)

OK I am actually really happy to hear this because this means I can go
simplify this code significantly!

>
> > > > +/*
> > > > + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM |
> > > > + * FOLL_WRITE pin is permitted for a specific folio.
> > > > + *
> > > > + * This assumes the folio is stable and pinned.
> > > > + *
> > > > + * Writing to pinned file-backed dirty tracked folios is inherently problematic
> > > > + * (see comment describing the writeable_file_mapping_allowed() function). We
> > > > + * therefore try to avoid the most egregious case of a long-term mapping doing
> > > > + * so.
> > > > + *
> > > > + * This function cannot be as thorough as that one as the VMA is not available
> > > > + * in the fast path, so instead we whitelist known good cases.
> > > > + *
> > > > + * The folio is stable, but the mapping might not be. When truncating for
> > > > + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled
> > > > + * so we are safe from an IPI, but some architectures use an RCU lock for this
> > > > + * operation, so we acquire an RCU lock to ensure the mapping is stable.
> > > > + */
> > > > +static bool folio_longterm_write_pin_allowed(struct folio *folio)
> > > > +{
> > > > +	bool ret;
> > > > +
> > > > +	/* hugetlb mappings do not require dirty tracking. */
> > > > +	if (folio_test_hugetlb(folio))
> > > > +		return true;
> > > > +
> > >
> > > This:
> > >
> > > > +	if (stabilise_mapping_rcu(folio)) {
> > > > +		struct address_space *mapping = folio_mapping(folio);
> > >
> > > And this is 3rd read of folio->mapping, just for giggles?
> >
> > I like to giggle :)
> >
> > Actually this is to handle the various cases in which the mapping might not
> > be what we want (i.e. have PAGE_MAPPING_FLAGS set) which doesn't appear to
> > have a helper exposed for a check. Given previous review about duplication
> > I felt best to reuse this even though it does access again... yes I felt
> > weird about doing that.
>
> Right, I had a peek inside folio_mapping(), but the point is that this
> 3rd load might see yet *another* value of mapping from the prior two
> loads, rendering them somewhat worthless.
>
> > > > +
> > > > +		/*
> > > > +		 * Neither anonymous nor shmem-backed folios require
> > > > +		 * dirty tracking.
> > > > +		 */
> > > > +		ret = folio_test_anon(folio) ||
> > > > +			(mapping && shmem_mapping(mapping));
> > > > +	} else {
> > > > +		/* If the mapping is unstable, fallback to the slow path. */
> > > > +		ret = false;
> > > > +	}
> > > > +
> > > > +	unlock_rcu();
> > > > +
> > > > +	return ret;
> > >
> > > then becomes:
> > >
> > >
> > > 	if (folio_test_anon(folio))
> > > 		return true;
> >
> > This relies on the mapping so belongs below the lockdep assert imo.
>
> Oh, right you are.
>
> > >
> > > 	/*
> > > 	 * Having IRQs disabled (as per GUP-fast) also inhibits RCU
> > > 	 * grace periods from making progress, IOW. they imply
> > > 	 * rcu_read_lock().
> > > 	 */
> > > 	lockdep_assert_irqs_disabled();
> > >
> > > 	/*
> > > 	 * Inodes and thus address_space are RCU freed and thus safe to
> > > 	 * access at this point.
> > > 	 */
> > > 	mapping = folio_mapping(folio);
> > > 	if (mapping && shmem_mapping(mapping))
> > > 		return true;
> > >
> > > 	return false;
> > >
> > > > +}
> >
> > I'm more than happy to do this (I'd rather drop the RCU bits if possible)
> > but need to be sure it's safe.
>
> GUP-fast as a whole relies on it :-)

Indeed, the only question was what happened with
CONFIG_MMU_GATHER_RCU_TABLE_FREE arches which appeared to require special
handling, but I'm very happy to hear they don't!

Will respin along the lines of your suggestion.
Peter Zijlstra May 2, 2023, 12:40 p.m. UTC | #9
On Tue, May 02, 2023 at 02:08:10PM +0200, Peter Zijlstra wrote:

> > >
> > >
> > > 	if (folio_test_anon(folio))
> > > 		return true;
> > 
> > This relies on the mapping so belongs below the lockdep assert imo.
> 
> Oh, right you are.
> 
> > >
> > > 	/*
> > > 	 * Having IRQs disabled (as per GUP-fast) also inhibits RCU
> > > 	 * grace periods from making progress, IOW. they imply
> > > 	 * rcu_read_lock().
> > > 	 */
> > > 	lockdep_assert_irqs_disabled();
> > >
> > > 	/*
> > > 	 * Inodes and thus address_space are RCU freed and thus safe to
> > > 	 * access at this point.
> > > 	 */
> > > 	mapping = folio_mapping(folio);
> > > 	if (mapping && shmem_mapping(mapping))
> > > 		return true;
> > >
> > > 	return false;
> > >
> > > > +}

So arguably you should do *one* READ_ONCE() load of mapping and
consistently use that, this means open-coding both folio_test_anon() and
folio_mapping().
Christian Borntraeger May 2, 2023, 12:46 p.m. UTC | #10
Am 02.05.23 um 01:11 schrieb Lorenzo Stoakes:
> Writing to file-backed dirty-tracked mappings via GUP is inherently broken
> as we cannot rule out folios being cleaned and then a GUP user writing to
> them again and possibly marking them dirty unexpectedly.
> 
> This is especially egregious for long-term mappings (as indicated by the
> use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as
> we have already done in the slow path.

Hmm, does this interfer with KVM on s390 and PCI interpretion of interrupt delivery?
It would no longer work with file backed memory, correct?

See
arch/s390/kvm/pci.c

kvm_s390_pci_aif_enable
which does have
FOLL_WRITE | FOLL_LONGTERM
to

> 
> We have access to less information in the fast path as we cannot examine
> the VMA containing the mapping, however we can determine whether the folio
> is anonymous and then whitelist known-good mappings - specifically hugetlb
> and shmem mappings.
> 
> While we obtain a stable folio for this check, the mapping might not be, as
> a truncate could nullify it at any time. Since doing so requires mappings
> to be zapped, we can synchronise against a TLB shootdown operation.
> 
> For some architectures TLB shootdown is synchronised by IPI, against which
> we are protected as the GUP-fast operation is performed with interrupts
> disabled. However, other architectures which specify
> CONFIG_MMU_GATHER_RCU_TABLE_FREE use an RCU lock for this operation.
> 
> In these instances, we acquire an RCU lock while performing our checks. If
> we cannot get a stable mapping, we fall back to the slow path, as otherwise
> we'd have to walk the page tables again and it's simpler and more effective
> to just fall back.
> 
> It's important to note that there are no APIs allowing users to specify
> FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can
> always rely on the fact that if we fail to pin on the fast path, the code
> will fall back to the slow path which can perform the more thorough check.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Suggested-by: Kirill A . Shutemov <kirill@shutemov.name>
> Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
> ---
>   mm/gup.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 85 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 0f09dec0906c..431618048a03 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -18,6 +18,7 @@
>   #include <linux/migrate.h>
>   #include <linux/mm_inline.h>
>   #include <linux/sched/mm.h>
> +#include <linux/shmem_fs.h>
>   
>   #include <asm/mmu_context.h>
>   #include <asm/tlbflush.h>
> @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
>   	return folio;
>   }
>   
> +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
> +static bool stabilise_mapping_rcu(struct folio *folio)
> +{
> +	struct address_space *mapping = READ_ONCE(folio->mapping);
> +
> +	rcu_read_lock();
> +
> +	return mapping == READ_ONCE(folio->mapping);
> +}
> +
> +static void unlock_rcu(void)
> +{
> +	rcu_read_unlock();
> +}
> +#else
> +static bool stabilise_mapping_rcu(struct folio *)
> +{
> +	return true;
> +}
> +
> +static void unlock_rcu(void)
> +{
> +}
> +#endif
> +
> +/*
> + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM |
> + * FOLL_WRITE pin is permitted for a specific folio.
> + *
> + * This assumes the folio is stable and pinned.
> + *
> + * Writing to pinned file-backed dirty tracked folios is inherently problematic
> + * (see comment describing the writeable_file_mapping_allowed() function). We
> + * therefore try to avoid the most egregious case of a long-term mapping doing
> + * so.
> + *
> + * This function cannot be as thorough as that one as the VMA is not available
> + * in the fast path, so instead we whitelist known good cases.
> + *
> + * The folio is stable, but the mapping might not be. When truncating for
> + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled
> + * so we are safe from an IPI, but some architectures use an RCU lock for this
> + * operation, so we acquire an RCU lock to ensure the mapping is stable.
> + */
> +static bool folio_longterm_write_pin_allowed(struct folio *folio)
> +{
> +	bool ret;
> +
> +	/* hugetlb mappings do not require dirty tracking. */
> +	if (folio_test_hugetlb(folio))
> +		return true;
> +
> +	if (stabilise_mapping_rcu(folio)) {
> +		struct address_space *mapping = folio_mapping(folio);
> +
> +		/*
> +		 * Neither anonymous nor shmem-backed folios require
> +		 * dirty tracking.
> +		 */
> +		ret = folio_test_anon(folio) ||
> +			(mapping && shmem_mapping(mapping));
> +	} else {
> +		/* If the mapping is unstable, fallback to the slow path. */
> +		ret = false;
> +	}
> +
> +	unlock_rcu();
> +
> +	return ret;
> +}
> +
>   /**
>    * try_grab_folio() - Attempt to get or pin a folio.
>    * @page:  pointer to page to be grabbed
> @@ -123,6 +195,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
>    */
>   struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>   {
> +	bool is_longterm = flags & FOLL_LONGTERM;
> +
>   	if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
>   		return NULL;
>   
> @@ -136,8 +210,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>   		 * right zone, so fail and let the caller fall back to the slow
>   		 * path.
>   		 */
> -		if (unlikely((flags & FOLL_LONGTERM) &&
> -			     !is_longterm_pinnable_page(page)))
> +		if (unlikely(is_longterm && !is_longterm_pinnable_page(page)))
>   			return NULL;
>   
>   		/*
> @@ -148,6 +221,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>   		if (!folio)
>   			return NULL;
>   
> +		/*
> +		 * Can this folio be safely pinned? We need to perform this
> +		 * check after the folio is stabilised.
> +		 */
> +		if ((flags & FOLL_WRITE) && is_longterm &&
> +		    !folio_longterm_write_pin_allowed(folio)) {
> +			folio_put_refs(folio, refs);
> +			return NULL;
> +		}
> +
>   		/*
>   		 * When pinning a large folio, use an exact count to track it.
>   		 *
David Hildenbrand May 2, 2023, 12:47 p.m. UTC | #11
On 02.05.23 14:40, Peter Zijlstra wrote:
> On Tue, May 02, 2023 at 02:08:10PM +0200, Peter Zijlstra wrote:
> 
>>>>
>>>>
>>>> 	if (folio_test_anon(folio))
>>>> 		return true;
>>>
>>> This relies on the mapping so belongs below the lockdep assert imo.
>>
>> Oh, right you are.
>>
>>>>
>>>> 	/*
>>>> 	 * Having IRQs disabled (as per GUP-fast) also inhibits RCU
>>>> 	 * grace periods from making progress, IOW. they imply
>>>> 	 * rcu_read_lock().
>>>> 	 */
>>>> 	lockdep_assert_irqs_disabled();
>>>>
>>>> 	/*
>>>> 	 * Inodes and thus address_space are RCU freed and thus safe to
>>>> 	 * access at this point.
>>>> 	 */
>>>> 	mapping = folio_mapping(folio);
>>>> 	if (mapping && shmem_mapping(mapping))
>>>> 		return true;
>>>>
>>>> 	return false;
>>>>
>>>>> +}
> 
> So arguably you should do *one* READ_ONCE() load of mapping and
> consistently use that, this means open-coding both folio_test_anon() and
> folio_mapping().

Open-coding folio_test_anon() should not be required. We only care about 
PAGE_MAPPING_FLAGS stored alongside folio->mapping, that will stick 
around until the anon page was freed.

@Lorenzo, you might also want to special-case hugetlb directly using 
folio_test_hugetlb().
Lorenzo Stoakes May 2, 2023, 12:52 p.m. UTC | #12
On Tue, May 02, 2023 at 02:47:22PM +0200, David Hildenbrand wrote:
> On 02.05.23 14:40, Peter Zijlstra wrote:
> > On Tue, May 02, 2023 at 02:08:10PM +0200, Peter Zijlstra wrote:
> >
> > > > >
> > > > >
> > > > > 	if (folio_test_anon(folio))
> > > > > 		return true;
> > > >
> > > > This relies on the mapping so belongs below the lockdep assert imo.
> > >
> > > Oh, right you are.
> > >
> > > > >
> > > > > 	/*
> > > > > 	 * Having IRQs disabled (as per GUP-fast) also inhibits RCU
> > > > > 	 * grace periods from making progress, IOW. they imply
> > > > > 	 * rcu_read_lock().
> > > > > 	 */
> > > > > 	lockdep_assert_irqs_disabled();
> > > > >
> > > > > 	/*
> > > > > 	 * Inodes and thus address_space are RCU freed and thus safe to
> > > > > 	 * access at this point.
> > > > > 	 */
> > > > > 	mapping = folio_mapping(folio);
> > > > > 	if (mapping && shmem_mapping(mapping))
> > > > > 		return true;
> > > > >
> > > > > 	return false;
> > > > >
> > > > > > +}
> >
> > So arguably you should do *one* READ_ONCE() load of mapping and
> > consistently use that, this means open-coding both folio_test_anon() and
> > folio_mapping().
>
> Open-coding folio_test_anon() should not be required. We only care about
> PAGE_MAPPING_FLAGS stored alongside folio->mapping, that will stick around
> until the anon page was freed.
>

Ack, good point!

> @Lorenzo, you might also want to special-case hugetlb directly using
> folio_test_hugetlb().
>

I already am :) I guess you mean when I respin along these lines? Will port
that across to.

> --
> Thanks,
>
> David / dhildenb
>
David Hildenbrand May 2, 2023, 12:53 p.m. UTC | #13
On 02.05.23 14:52, Lorenzo Stoakes wrote:
> On Tue, May 02, 2023 at 02:47:22PM +0200, David Hildenbrand wrote:
>> On 02.05.23 14:40, Peter Zijlstra wrote:
>>> On Tue, May 02, 2023 at 02:08:10PM +0200, Peter Zijlstra wrote:
>>>
>>>>>>
>>>>>>
>>>>>> 	if (folio_test_anon(folio))
>>>>>> 		return true;
>>>>>
>>>>> This relies on the mapping so belongs below the lockdep assert imo.
>>>>
>>>> Oh, right you are.
>>>>
>>>>>>
>>>>>> 	/*
>>>>>> 	 * Having IRQs disabled (as per GUP-fast) also inhibits RCU
>>>>>> 	 * grace periods from making progress, IOW. they imply
>>>>>> 	 * rcu_read_lock().
>>>>>> 	 */
>>>>>> 	lockdep_assert_irqs_disabled();
>>>>>>
>>>>>> 	/*
>>>>>> 	 * Inodes and thus address_space are RCU freed and thus safe to
>>>>>> 	 * access at this point.
>>>>>> 	 */
>>>>>> 	mapping = folio_mapping(folio);
>>>>>> 	if (mapping && shmem_mapping(mapping))
>>>>>> 		return true;
>>>>>>
>>>>>> 	return false;
>>>>>>
>>>>>>> +}
>>>
>>> So arguably you should do *one* READ_ONCE() load of mapping and
>>> consistently use that, this means open-coding both folio_test_anon() and
>>> folio_mapping().
>>
>> Open-coding folio_test_anon() should not be required. We only care about
>> PAGE_MAPPING_FLAGS stored alongside folio->mapping, that will stick around
>> until the anon page was freed.
>>
> 
> Ack, good point!
> 
>> @Lorenzo, you might also want to special-case hugetlb directly using
>> folio_test_hugetlb().
>>
> 
> I already am :) I guess you mean when I respin along these lines? Will port
> that across to.

Ha, I only stared at that reply, good! :)
Lorenzo Stoakes May 2, 2023, 12:54 p.m. UTC | #14
On Tue, May 02, 2023 at 02:46:28PM +0200, Christian Borntraeger wrote:
> Am 02.05.23 um 01:11 schrieb Lorenzo Stoakes:
> > Writing to file-backed dirty-tracked mappings via GUP is inherently broken
> > as we cannot rule out folios being cleaned and then a GUP user writing to
> > them again and possibly marking them dirty unexpectedly.
> >
> > This is especially egregious for long-term mappings (as indicated by the
> > use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as
> > we have already done in the slow path.
>
> Hmm, does this interfer with KVM on s390 and PCI interpretion of interrupt delivery?
> It would no longer work with file backed memory, correct?
>
> See
> arch/s390/kvm/pci.c
>
> kvm_s390_pci_aif_enable
> which does have
> FOLL_WRITE | FOLL_LONGTERM
> to
>

Does this memory map a dirty-tracked file? It's kind of hard to dig into where
the address originates from without going through a ton of code. In worst case
if the fast code doesn't find a whitelist it'll fall back to slow path which
explicitly checks for dirty-tracked filesystem.

We can reintroduce a flag to permit exceptions if this is really broken, are you
able to test? I don't have an s390 sat around :)

> >
> > We have access to less information in the fast path as we cannot examine
> > the VMA containing the mapping, however we can determine whether the folio
> > is anonymous and then whitelist known-good mappings - specifically hugetlb
> > and shmem mappings.
> >
> > While we obtain a stable folio for this check, the mapping might not be, as
> > a truncate could nullify it at any time. Since doing so requires mappings
> > to be zapped, we can synchronise against a TLB shootdown operation.
> >
> > For some architectures TLB shootdown is synchronised by IPI, against which
> > we are protected as the GUP-fast operation is performed with interrupts
> > disabled. However, other architectures which specify
> > CONFIG_MMU_GATHER_RCU_TABLE_FREE use an RCU lock for this operation.
> >
> > In these instances, we acquire an RCU lock while performing our checks. If
> > we cannot get a stable mapping, we fall back to the slow path, as otherwise
> > we'd have to walk the page tables again and it's simpler and more effective
> > to just fall back.
> >
> > It's important to note that there are no APIs allowing users to specify
> > FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can
> > always rely on the fact that if we fail to pin on the fast path, the code
> > will fall back to the slow path which can perform the more thorough check.
> >
> > Suggested-by: David Hildenbrand <david@redhat.com>
> > Suggested-by: Kirill A . Shutemov <kirill@shutemov.name>
> > Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
> > ---
> >   mm/gup.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
> >   1 file changed, 85 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 0f09dec0906c..431618048a03 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -18,6 +18,7 @@
> >   #include <linux/migrate.h>
> >   #include <linux/mm_inline.h>
> >   #include <linux/sched/mm.h>
> > +#include <linux/shmem_fs.h>
> >   #include <asm/mmu_context.h>
> >   #include <asm/tlbflush.h>
> > @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> >   	return folio;
> >   }
> > +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > +static bool stabilise_mapping_rcu(struct folio *folio)
> > +{
> > +	struct address_space *mapping = READ_ONCE(folio->mapping);
> > +
> > +	rcu_read_lock();
> > +
> > +	return mapping == READ_ONCE(folio->mapping);
> > +}
> > +
> > +static void unlock_rcu(void)
> > +{
> > +	rcu_read_unlock();
> > +}
> > +#else
> > +static bool stabilise_mapping_rcu(struct folio *)
> > +{
> > +	return true;
> > +}
> > +
> > +static void unlock_rcu(void)
> > +{
> > +}
> > +#endif
> > +
> > +/*
> > + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM |
> > + * FOLL_WRITE pin is permitted for a specific folio.
> > + *
> > + * This assumes the folio is stable and pinned.
> > + *
> > + * Writing to pinned file-backed dirty tracked folios is inherently problematic
> > + * (see comment describing the writeable_file_mapping_allowed() function). We
> > + * therefore try to avoid the most egregious case of a long-term mapping doing
> > + * so.
> > + *
> > + * This function cannot be as thorough as that one as the VMA is not available
> > + * in the fast path, so instead we whitelist known good cases.
> > + *
> > + * The folio is stable, but the mapping might not be. When truncating for
> > + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled
> > + * so we are safe from an IPI, but some architectures use an RCU lock for this
> > + * operation, so we acquire an RCU lock to ensure the mapping is stable.
> > + */
> > +static bool folio_longterm_write_pin_allowed(struct folio *folio)
> > +{
> > +	bool ret;
> > +
> > +	/* hugetlb mappings do not require dirty tracking. */
> > +	if (folio_test_hugetlb(folio))
> > +		return true;
> > +
> > +	if (stabilise_mapping_rcu(folio)) {
> > +		struct address_space *mapping = folio_mapping(folio);
> > +
> > +		/*
> > +		 * Neither anonymous nor shmem-backed folios require
> > +		 * dirty tracking.
> > +		 */
> > +		ret = folio_test_anon(folio) ||
> > +			(mapping && shmem_mapping(mapping));
> > +	} else {
> > +		/* If the mapping is unstable, fallback to the slow path. */
> > +		ret = false;
> > +	}
> > +
> > +	unlock_rcu();
> > +
> > +	return ret;
> > +}
> > +
> >   /**
> >    * try_grab_folio() - Attempt to get or pin a folio.
> >    * @page:  pointer to page to be grabbed
> > @@ -123,6 +195,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> >    */
> >   struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> >   {
> > +	bool is_longterm = flags & FOLL_LONGTERM;
> > +
> >   	if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
> >   		return NULL;
> > @@ -136,8 +210,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> >   		 * right zone, so fail and let the caller fall back to the slow
> >   		 * path.
> >   		 */
> > -		if (unlikely((flags & FOLL_LONGTERM) &&
> > -			     !is_longterm_pinnable_page(page)))
> > +		if (unlikely(is_longterm && !is_longterm_pinnable_page(page)))
> >   			return NULL;
> >   		/*
> > @@ -148,6 +221,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> >   		if (!folio)
> >   			return NULL;
> > +		/*
> > +		 * Can this folio be safely pinned? We need to perform this
> > +		 * check after the folio is stabilised.
> > +		 */
> > +		if ((flags & FOLL_WRITE) && is_longterm &&
> > +		    !folio_longterm_write_pin_allowed(folio)) {
> > +			folio_put_refs(folio, refs);
> > +			return NULL;
> > +		}
> > +
> >   		/*
> >   		 * When pinning a large folio, use an exact count to track it.
> >   		 *
Jason Gunthorpe May 2, 2023, 1:02 p.m. UTC | #15
On Tue, May 02, 2023 at 01:54:41PM +0100, Lorenzo Stoakes wrote:
> On Tue, May 02, 2023 at 02:46:28PM +0200, Christian Borntraeger wrote:
> > Am 02.05.23 um 01:11 schrieb Lorenzo Stoakes:
> > > Writing to file-backed dirty-tracked mappings via GUP is inherently broken
> > > as we cannot rule out folios being cleaned and then a GUP user writing to
> > > them again and possibly marking them dirty unexpectedly.
> > >
> > > This is especially egregious for long-term mappings (as indicated by the
> > > use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as
> > > we have already done in the slow path.
> >
> > Hmm, does this interfer with KVM on s390 and PCI interpretion of interrupt delivery?
> > It would no longer work with file backed memory, correct?
> >
> > See
> > arch/s390/kvm/pci.c
> >
> > kvm_s390_pci_aif_enable
> > which does have
> > FOLL_WRITE | FOLL_LONGTERM
> > to
> >
> 
> Does this memory map a dirty-tracked file? It's kind of hard to dig into where
> the address originates from without going through a ton of code. In worst case
> if the fast code doesn't find a whitelist it'll fall back to slow path which
> explicitly checks for dirty-tracked filesystem.

This looks like the same stuff David was talking about, a qemu guest
with VFIO backed by a filesystem file..

Broadly though, arch kvm code should not call pin_user_pages().

Either it is KVM focused and it should use the shadow table and it's
existing mmu_notifier synchronization scheme

Or it is VFIO focused so it should use mdev/etc and have an unmap call
back.

I'm not really sure what this is for though..

Jason
Christian Borntraeger May 2, 2023, 1:04 p.m. UTC | #16
Am 02.05.23 um 14:54 schrieb Lorenzo Stoakes:
> On Tue, May 02, 2023 at 02:46:28PM +0200, Christian Borntraeger wrote:
>> Am 02.05.23 um 01:11 schrieb Lorenzo Stoakes:
>>> Writing to file-backed dirty-tracked mappings via GUP is inherently broken
>>> as we cannot rule out folios being cleaned and then a GUP user writing to
>>> them again and possibly marking them dirty unexpectedly.
>>>
>>> This is especially egregious for long-term mappings (as indicated by the
>>> use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as
>>> we have already done in the slow path.
>>
>> Hmm, does this interfer with KVM on s390 and PCI interpretion of interrupt delivery?
>> It would no longer work with file backed memory, correct?
>>
>> See
>> arch/s390/kvm/pci.c
>>
>> kvm_s390_pci_aif_enable
>> which does have
>> FOLL_WRITE | FOLL_LONGTERM
>> to
>>
> 
> Does this memory map a dirty-tracked file? It's kind of hard to dig into where
> the address originates from without going through a ton of code. In worst case
> if the fast code doesn't find a whitelist it'll fall back to slow path which
> explicitly checks for dirty-tracked filesystem.

It does pin from whatever QEMU uses as backing for the guest.
> 
> We can reintroduce a flag to permit exceptions if this is really broken, are you
> able to test? I don't have an s390 sat around :)

Matt (Rosato on cc) probably can. In the end, it would mean having
   <memoryBacking>
     <source type="file"/>
   </memoryBacking>

In libvirt I guess.
Jason Gunthorpe May 2, 2023, 1:10 p.m. UTC | #17
On Tue, May 02, 2023 at 03:04:27PM +0200, Christian Borntraeger wrote:
\> > We can reintroduce a flag to permit exceptions if this is really broken, are you
> > able to test? I don't have an s390 sat around :)
> 
> Matt (Rosato on cc) probably can. In the end, it would mean having
>   <memoryBacking>
>     <source type="file"/>
>   </memoryBacking>

This s390 code is the least of the problems, after this series VFIO
won't startup at all with this configuration.

Jason
David Hildenbrand May 2, 2023, 1:28 p.m. UTC | #18
On 02.05.23 15:10, Jason Gunthorpe wrote:
> On Tue, May 02, 2023 at 03:04:27PM +0200, Christian Borntraeger wrote:
> \> > We can reintroduce a flag to permit exceptions if this is really broken, are you
>>> able to test? I don't have an s390 sat around :)
>>
>> Matt (Rosato on cc) probably can. In the end, it would mean having
>>    <memoryBacking>
>>      <source type="file"/>
>>    </memoryBacking>
> 
> This s390 code is the least of the problems, after this series VFIO
> won't startup at all with this configuration.

Good question if the domain would fail to start. I recall that IOMMUs 
for zPCI are special on s390x. [1]

Well, zPCI is special. I cannot immediately tell when we would trigger 
long-term pinning.

[1] https://www.mail-archive.com/qemu-devel@nongnu.org/msg875728.html
Paul E. McKenney May 2, 2023, 1:30 p.m. UTC | #19
On Tue, May 02, 2023 at 02:08:10PM +0200, Peter Zijlstra wrote:
> On Tue, May 02, 2023 at 12:25:54PM +0100, Lorenzo Stoakes wrote:
> > On Tue, May 02, 2023 at 01:13:34PM +0200, Peter Zijlstra wrote:
> > > On Tue, May 02, 2023 at 12:11:49AM +0100, Lorenzo Stoakes wrote:
> > > > @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> > > >  	return folio;
> > > >  }
> > > >
> > > > +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > > > +static bool stabilise_mapping_rcu(struct folio *folio)
> > > > +{
> > > > +	struct address_space *mapping = READ_ONCE(folio->mapping);
> > > > +
> > > > +	rcu_read_lock();
> > > > +
> > > > +	return mapping == READ_ONCE(folio->mapping);
> > >
> > > This doesn't make sense; why bother reading the same thing twice?
> > 
> > The intent is to see whether the folio->mapping has been truncated from
> > underneath us, as per the futex code that Kirill referred to which does
> > something similar [1].
> 
> Yeah, but per that 3rd load you got nothing here. Also that futex code
> did the early load to deal with the !mapping case, but you're not doing
> that.
> 
> > > Who cares if the thing changes from before; what you care about is that
> > > the value you see has stable storage, this doesn't help with that.
> > >
> > > > +}
> > > > +
> > > > +static void unlock_rcu(void)
> > > > +{
> > > > +	rcu_read_unlock();
> > > > +}
> > > > +#else
> > > > +static bool stabilise_mapping_rcu(struct folio *)
> > > > +{
> > > > +	return true;
> > > > +}
> > > > +
> > > > +static void unlock_rcu(void)
> > > > +{
> > > > +}
> > > > +#endif
> > >
> > > Anyway, this all can go away. RCU can't progress while you have
> > > interrupts disabled anyway.
> > 
> > There seems to be other code in the kernel that assumes that this is not
> > the case,
> 
> Yeah, so Paul went back on forth on that a bit. It used to be true in
> the good old days when everything was simple. Then Paul made things
> complicated by separating out sched-RCU bh-RCU and 'regular' RCU
> flavours.

Almost.  ;-)

The way I made things complicated was instead by creating preemptible RCU
for the real-time effort.  The original non-preemptible RCU was still
required for a number of use cases (for example, waiting for hardware
interrupt handlers), so it had to stay.  Separately, network-based DoS
attacks necessitated adding RCU bh.

> At that point disabling IRQs would only (officially) inhibit sched and
> bh RCU flavours, but not the regular RCU.

Quite right.

> But then some years ago Linus convinced Paul that having all these
> separate RCU flavours with separate QS rules was a big pain in the
> backside and Paul munged them all together again.

What happened was that someone used one flavor of RCU reader and a
different flavor of RCU updater, creating an exploitable bug.  

http://www2.rdrop.com/~paulmck/RCU/cve.2019.01.23e.pdf
https://www.youtube.com/watch?v=hZX1aokdNiY

And Linus asked that this bug be ruled out, so...

> So now, anything that inhibits any of the RCU flavours inhibits them
> all. So disabling IRQs is sufficient.

...for v4.20 and later, exactly.

							Thanx, Paul

> > i.e. the futex code, though not sure if that's being run with
> > IRQs disabled...
> 
> That futex code runs in preemptible context, per the lock_page() that
> can sleep etc.. :-)
> 
> > > > +/*
> > > > + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM |
> > > > + * FOLL_WRITE pin is permitted for a specific folio.
> > > > + *
> > > > + * This assumes the folio is stable and pinned.
> > > > + *
> > > > + * Writing to pinned file-backed dirty tracked folios is inherently problematic
> > > > + * (see comment describing the writeable_file_mapping_allowed() function). We
> > > > + * therefore try to avoid the most egregious case of a long-term mapping doing
> > > > + * so.
> > > > + *
> > > > + * This function cannot be as thorough as that one as the VMA is not available
> > > > + * in the fast path, so instead we whitelist known good cases.
> > > > + *
> > > > + * The folio is stable, but the mapping might not be. When truncating for
> > > > + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled
> > > > + * so we are safe from an IPI, but some architectures use an RCU lock for this
> > > > + * operation, so we acquire an RCU lock to ensure the mapping is stable.
> > > > + */
> > > > +static bool folio_longterm_write_pin_allowed(struct folio *folio)
> > > > +{
> > > > +	bool ret;
> > > > +
> > > > +	/* hugetlb mappings do not require dirty tracking. */
> > > > +	if (folio_test_hugetlb(folio))
> > > > +		return true;
> > > > +
> > >
> > > This:
> > >
> > > > +	if (stabilise_mapping_rcu(folio)) {
> > > > +		struct address_space *mapping = folio_mapping(folio);
> > >
> > > And this is 3rd read of folio->mapping, just for giggles?
> > 
> > I like to giggle :)
> > 
> > Actually this is to handle the various cases in which the mapping might not
> > be what we want (i.e. have PAGE_MAPPING_FLAGS set) which doesn't appear to
> > have a helper exposed for a check. Given previous review about duplication
> > I felt best to reuse this even though it does access again... yes I felt
> > weird about doing that.
> 
> Right, I had a peek inside folio_mapping(), but the point is that this
> 3rd load might see yet *another* value of mapping from the prior two
> loads, rendering them somewhat worthless.
> 
> > > > +
> > > > +		/*
> > > > +		 * Neither anonymous nor shmem-backed folios require
> > > > +		 * dirty tracking.
> > > > +		 */
> > > > +		ret = folio_test_anon(folio) ||
> > > > +			(mapping && shmem_mapping(mapping));
> > > > +	} else {
> > > > +		/* If the mapping is unstable, fallback to the slow path. */
> > > > +		ret = false;
> > > > +	}
> > > > +
> > > > +	unlock_rcu();
> > > > +
> > > > +	return ret;
> > >
> > > then becomes:
> > >
> > >
> > > 	if (folio_test_anon(folio))
> > > 		return true;
> > 
> > This relies on the mapping so belongs below the lockdep assert imo.
> 
> Oh, right you are.
> 
> > >
> > > 	/*
> > > 	 * Having IRQs disabled (as per GUP-fast) also inhibits RCU
> > > 	 * grace periods from making progress, IOW. they imply
> > > 	 * rcu_read_lock().
> > > 	 */
> > > 	lockdep_assert_irqs_disabled();
> > >
> > > 	/*
> > > 	 * Inodes and thus address_space are RCU freed and thus safe to
> > > 	 * access at this point.
> > > 	 */
> > > 	mapping = folio_mapping(folio);
> > > 	if (mapping && shmem_mapping(mapping))
> > > 		return true;
> > >
> > > 	return false;
> > >
> > > > +}
> > 
> > I'm more than happy to do this (I'd rather drop the RCU bits if possible)
> > but need to be sure it's safe.
> 
> GUP-fast as a whole relies on it :-)
Matthew Rosato May 2, 2023, 1:35 p.m. UTC | #20
On 5/2/23 9:04 AM, Christian Borntraeger wrote:
> 
> 
> Am 02.05.23 um 14:54 schrieb Lorenzo Stoakes:
>> On Tue, May 02, 2023 at 02:46:28PM +0200, Christian Borntraeger wrote:
>>> Am 02.05.23 um 01:11 schrieb Lorenzo Stoakes:
>>>> Writing to file-backed dirty-tracked mappings via GUP is inherently broken
>>>> as we cannot rule out folios being cleaned and then a GUP user writing to
>>>> them again and possibly marking them dirty unexpectedly.
>>>>
>>>> This is especially egregious for long-term mappings (as indicated by the
>>>> use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as
>>>> we have already done in the slow path.
>>>
>>> Hmm, does this interfer with KVM on s390 and PCI interpretion of interrupt delivery?
>>> It would no longer work with file backed memory, correct?
>>>
>>> See
>>> arch/s390/kvm/pci.c
>>>
>>> kvm_s390_pci_aif_enable
>>> which does have
>>> FOLL_WRITE | FOLL_LONGTERM
>>> to
>>>
>>
>> Does this memory map a dirty-tracked file? It's kind of hard to dig into where
>> the address originates from without going through a ton of code. In worst case
>> if the fast code doesn't find a whitelist it'll fall back to slow path which
>> explicitly checks for dirty-tracked filesystem.
> 
> It does pin from whatever QEMU uses as backing for the guest.
>>
>> We can reintroduce a flag to permit exceptions if this is really broken, are you
>> able to test? I don't have an s390 sat around :)
> 
> Matt (Rosato on cc) probably can. In the end, it would mean having
>   <memoryBacking>
>     <source type="file"/>
>   </memoryBacking>
> 
> In libvirt I guess.

I am running with this series applied using a QEMU guest with memory-backend-file (using the above libvirt snippet) for a few different PCI device types and AEN forwarding (e.g. what is setup in kvm_s390_pci_aif_enable) is still working.
Jason Gunthorpe May 2, 2023, 1:36 p.m. UTC | #21
On Tue, May 02, 2023 at 03:28:40PM +0200, David Hildenbrand wrote:
> On 02.05.23 15:10, Jason Gunthorpe wrote:
> > On Tue, May 02, 2023 at 03:04:27PM +0200, Christian Borntraeger wrote:
> > \> > We can reintroduce a flag to permit exceptions if this is really broken, are you
> > > > able to test? I don't have an s390 sat around :)
> > > 
> > > Matt (Rosato on cc) probably can. In the end, it would mean having
> > >    <memoryBacking>
> > >      <source type="file"/>
> > >    </memoryBacking>
> > 
> > This s390 code is the least of the problems, after this series VFIO
> > won't startup at all with this configuration.
> 
> Good question if the domain would fail to start. I recall that IOMMUs for
> zPCI are special on s390x. [1]

Not upstream they aren't.

> Well, zPCI is special. I cannot immediately tell when we would trigger
> long-term pinning.

zPCI uses the standard IOMMU stuff, so it uses a normal VFIO container
and the normal pin_user_pages() path.

> [1] https://www.mail-archive.com/qemu-devel@nongnu.org/msg875728.html

AFIACT this is talking about the nested IOMMU translation stuff.
RPCIT is the hypercall to invalidate the nested IOMMU table. I expect s390 is
going to have another go at implementing this using iommufd.

The stuff in that series like KVM_S390_ZPCIOP_REG_IOAT didn't make it
upstream.

Jason
Matthew Rosato May 2, 2023, 1:38 p.m. UTC | #22
On 5/2/23 9:28 AM, David Hildenbrand wrote:
> On 02.05.23 15:10, Jason Gunthorpe wrote:
>> On Tue, May 02, 2023 at 03:04:27PM +0200, Christian Borntraeger wrote:
>> \> > We can reintroduce a flag to permit exceptions if this is really broken, are you
>>>> able to test? I don't have an s390 sat around :)
>>>
>>> Matt (Rosato on cc) probably can. In the end, it would mean having
>>>    <memoryBacking>
>>>      <source type="file"/>
>>>    </memoryBacking>
>>
>> This s390 code is the least of the problems, after this series VFIO
>> won't startup at all with this configuration.
> 
> Good question if the domain would fail to start. I recall that IOMMUs for zPCI are special on s390x. [1]

Eh, well what you referenced there never merged.  We will eventually do something like this, but via iommufd.  But right now s390 is still using vfio type1 iommu

> 
> Well, zPCI is special. I cannot immediately tell when we would trigger long-term pinning.

Enabling a zPCI device in a QEMU guest when using zPCI interpretation facilities

> 
> [1] https://www.mail-archive.com/qemu-devel@nongnu.org/msg875728.html
>
David Hildenbrand May 2, 2023, 1:39 p.m. UTC | #23
On 02.05.23 15:36, Jason Gunthorpe wrote:
> On Tue, May 02, 2023 at 03:28:40PM +0200, David Hildenbrand wrote:
>> On 02.05.23 15:10, Jason Gunthorpe wrote:
>>> On Tue, May 02, 2023 at 03:04:27PM +0200, Christian Borntraeger wrote:
>>> \> > We can reintroduce a flag to permit exceptions if this is really broken, are you
>>>>> able to test? I don't have an s390 sat around :)
>>>>
>>>> Matt (Rosato on cc) probably can. In the end, it would mean having
>>>>     <memoryBacking>
>>>>       <source type="file"/>
>>>>     </memoryBacking>
>>>
>>> This s390 code is the least of the problems, after this series VFIO
>>> won't startup at all with this configuration.
>>
>> Good question if the domain would fail to start. I recall that IOMMUs for
>> zPCI are special on s390x. [1]
> 
> Not upstream they aren't.
> 
>> Well, zPCI is special. I cannot immediately tell when we would trigger
>> long-term pinning.
> 
> zPCI uses the standard IOMMU stuff, so it uses a normal VFIO container
> and the normal pin_user_pages() path.


@Christian, Matthew: would we pin all guest memory when starting the 
domain (IIRC, like on x86-64) and fail early, or only when the guest 
issues rpcit instructions to map individual pages?
Matthew Rosato May 2, 2023, 1:43 p.m. UTC | #24
On 5/2/23 9:39 AM, David Hildenbrand wrote:
> On 02.05.23 15:36, Jason Gunthorpe wrote:
>> On Tue, May 02, 2023 at 03:28:40PM +0200, David Hildenbrand wrote:
>>> On 02.05.23 15:10, Jason Gunthorpe wrote:
>>>> On Tue, May 02, 2023 at 03:04:27PM +0200, Christian Borntraeger wrote:
>>>> \> > We can reintroduce a flag to permit exceptions if this is really broken, are you
>>>>>> able to test? I don't have an s390 sat around :)
>>>>>
>>>>> Matt (Rosato on cc) probably can. In the end, it would mean having
>>>>>     <memoryBacking>
>>>>>       <source type="file"/>
>>>>>     </memoryBacking>
>>>>
>>>> This s390 code is the least of the problems, after this series VFIO
>>>> won't startup at all with this configuration.
>>>
>>> Good question if the domain would fail to start. I recall that IOMMUs for
>>> zPCI are special on s390x. [1]
>>
>> Not upstream they aren't.
>>
>>> Well, zPCI is special. I cannot immediately tell when we would trigger
>>> long-term pinning.
>>
>> zPCI uses the standard IOMMU stuff, so it uses a normal VFIO container
>> and the normal pin_user_pages() path.
> 
> 
> @Christian, Matthew: would we pin all guest memory when starting the domain (IIRC, like on x86-64) and fail early, or only when the guest issues rpcit instructions to map individual pages?
> 

Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT.

However, per Jason's prior suggestion, the initial implementation for s390 nesting via iommufd will pin all of guest memory when starting the domain.  I have something already working via iommufd built on top of the nesting infrastructure patches and QEMU iommufd series that are floating around; needs some cleanup, hoping to send an RFC in the coming weeks.  I can CC you if you'd like.
David Hildenbrand May 2, 2023, 1:47 p.m. UTC | #25
On 02.05.23 15:43, Matthew Rosato wrote:
> On 5/2/23 9:39 AM, David Hildenbrand wrote:
>> On 02.05.23 15:36, Jason Gunthorpe wrote:
>>> On Tue, May 02, 2023 at 03:28:40PM +0200, David Hildenbrand wrote:
>>>> On 02.05.23 15:10, Jason Gunthorpe wrote:
>>>>> On Tue, May 02, 2023 at 03:04:27PM +0200, Christian Borntraeger wrote:
>>>>> \> > We can reintroduce a flag to permit exceptions if this is really broken, are you
>>>>>>> able to test? I don't have an s390 sat around :)
>>>>>>
>>>>>> Matt (Rosato on cc) probably can. In the end, it would mean having
>>>>>>      <memoryBacking>
>>>>>>        <source type="file"/>
>>>>>>      </memoryBacking>
>>>>>
>>>>> This s390 code is the least of the problems, after this series VFIO
>>>>> won't startup at all with this configuration.
>>>>
>>>> Good question if the domain would fail to start. I recall that IOMMUs for
>>>> zPCI are special on s390x. [1]
>>>
>>> Not upstream they aren't.
>>>
>>>> Well, zPCI is special. I cannot immediately tell when we would trigger
>>>> long-term pinning.
>>>
>>> zPCI uses the standard IOMMU stuff, so it uses a normal VFIO container
>>> and the normal pin_user_pages() path.
>>
>>
>> @Christian, Matthew: would we pin all guest memory when starting the domain (IIRC, like on x86-64) and fail early, or only when the guest issues rpcit instructions to map individual pages?
>>
> 
> Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT.
> 

Okay, so IIRC we'll fail starting the domain early, that's good. And if 
we pin all guest memory (instead of small pieces dynamically), there is 
little existing use for file-backed RAM in such zPCI configurations 
(because memory cannot be reclaimed either way if it's all pinned), so 
likely there are no real existing users.

> However, per Jason's prior suggestion, the initial implementation for s390 nesting via iommufd will pin all of guest memory when starting the domain.  I have something already working via iommufd built on top of the nesting infrastructure patches and QEMU iommufd series that are floating around; needs some cleanup, hoping to send an RFC in the coming weeks.  I can CC you if you'd like.

Yes, please.
Jason Gunthorpe May 2, 2023, 1:50 p.m. UTC | #26
On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote:
> > Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT.
> 
> Okay, so IIRC we'll fail starting the domain early, that's good. And if we
> pin all guest memory (instead of small pieces dynamically), there is little
> existing use for file-backed RAM in such zPCI configurations (because memory
> cannot be reclaimed either way if it's all pinned), so likely there are no
> real existing users.

Right, this is VFIO, the physical HW can't tolerate not having pinned
memory, so something somewhere is always pinning it.

Which, again, makes it weird/wrong that this KVM code is pinning it
again :\

Jason
Matthew Rosato May 2, 2023, 1:56 p.m. UTC | #27
On 5/2/23 9:50 AM, Jason Gunthorpe wrote:
> On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote:
>>> Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT.
>>
>> Okay, so IIRC we'll fail starting the domain early, that's good. And if we
>> pin all guest memory (instead of small pieces dynamically), there is little
>> existing use for file-backed RAM in such zPCI configurations (because memory
>> cannot be reclaimed either way if it's all pinned), so likely there are no
>> real existing users.
> 
> Right, this is VFIO, the physical HW can't tolerate not having pinned
> memory, so something somewhere is always pinning it.

I might have mis-explained above.

With iommufd nesting, we will pin everything upfront as a starting point.  

The current usage of vfio type1 iommu for s390 does not pin the entirety of guest memory upfront, it happens as guest RPCITs occur / type1 mappings are made.

> 
> Which, again, makes it weird/wrong that this KVM code is pinning it
> again :\
> 
> Jason
David Hildenbrand May 2, 2023, 1:57 p.m. UTC | #28
On 02.05.23 15:50, Jason Gunthorpe wrote:
> On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote:
>>> Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT.
>>
>> Okay, so IIRC we'll fail starting the domain early, that's good. And if we
>> pin all guest memory (instead of small pieces dynamically), there is little
>> existing use for file-backed RAM in such zPCI configurations (because memory
>> cannot be reclaimed either way if it's all pinned), so likely there are no
>> real existing users.
> 
> Right, this is VFIO, the physical HW can't tolerate not having pinned
> memory, so something somewhere is always pinning it.
> 
> Which, again, makes it weird/wrong that this KVM code is pinning it
> again :\

IIUC, that pinning is not for ordinary IOMMU / KVM memory access. It's 
for passthrough of (adapter) interrupts.

I have to speculate, but I guess for hardware to forward interrupts to 
the VM, it has to pin the special guest memory page that will receive 
the indications, to then configure (interrupt) hardware to target the 
interrupt indications to that special guest page (using a host physical 
address).
Jason Gunthorpe May 2, 2023, 2:04 p.m. UTC | #29
On Tue, May 02, 2023 at 03:57:30PM +0200, David Hildenbrand wrote:
> On 02.05.23 15:50, Jason Gunthorpe wrote:
> > On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote:
> > > > Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT.
> > > 
> > > Okay, so IIRC we'll fail starting the domain early, that's good. And if we
> > > pin all guest memory (instead of small pieces dynamically), there is little
> > > existing use for file-backed RAM in such zPCI configurations (because memory
> > > cannot be reclaimed either way if it's all pinned), so likely there are no
> > > real existing users.
> > 
> > Right, this is VFIO, the physical HW can't tolerate not having pinned
> > memory, so something somewhere is always pinning it.
> > 
> > Which, again, makes it weird/wrong that this KVM code is pinning it
> > again :\
> 
> IIUC, that pinning is not for ordinary IOMMU / KVM memory access. It's for
> passthrough of (adapter) interrupts.
> 
> I have to speculate, but I guess for hardware to forward interrupts to the
> VM, it has to pin the special guest memory page that will receive the
> indications, to then configure (interrupt) hardware to target the interrupt
> indications to that special guest page (using a host physical address).

Either the emulated access is "CPU" based happening through the KVM
page table so it should use mmu_notifier locking.

Or it is "DMA" and should go through an IOVA through iommufd pinning
and locking.

There is no other ground, nothing in KVM should be inventing its own
access methodology.

Jason
David Hildenbrand May 2, 2023, 2:15 p.m. UTC | #30
On 02.05.23 16:04, Jason Gunthorpe wrote:
> On Tue, May 02, 2023 at 03:57:30PM +0200, David Hildenbrand wrote:
>> On 02.05.23 15:50, Jason Gunthorpe wrote:
>>> On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote:
>>>>> Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT.
>>>>
>>>> Okay, so IIRC we'll fail starting the domain early, that's good. And if we
>>>> pin all guest memory (instead of small pieces dynamically), there is little
>>>> existing use for file-backed RAM in such zPCI configurations (because memory
>>>> cannot be reclaimed either way if it's all pinned), so likely there are no
>>>> real existing users.
>>>
>>> Right, this is VFIO, the physical HW can't tolerate not having pinned
>>> memory, so something somewhere is always pinning it.
>>>
>>> Which, again, makes it weird/wrong that this KVM code is pinning it
>>> again :\
>>
>> IIUC, that pinning is not for ordinary IOMMU / KVM memory access. It's for
>> passthrough of (adapter) interrupts.
>>
>> I have to speculate, but I guess for hardware to forward interrupts to the
>> VM, it has to pin the special guest memory page that will receive the
>> indications, to then configure (interrupt) hardware to target the interrupt
>> indications to that special guest page (using a host physical address).
> 
> Either the emulated access is "CPU" based happening through the KVM
> page table so it should use mmu_notifier locking.
> 
> Or it is "DMA" and should go through an IOVA through iommufd pinning
> and locking.
> 
> There is no other ground, nothing in KVM should be inventing its own
> access methodology.

I might be wrong, but this seems to be a bit different.

It cannot tolerate page faults (needs a host physical address), so 
memory notifiers don't really apply. (as a side note, KVM on s390x does 
not use mmu notifiers as we know them)

It's kind-of like DMA, but it's not really DMA.  It's the CPU delivering 
interrupts for a specific device. So we're configuring the interrupt 
controller I guess to target a guest memory page.

But I have way too little knowledge about zPCI and the code in question 
here. And if it could be converted to iommufd (and if that's really the 
right mechanism to use here).

Hopefully Matthew knows the details and if this really needs to be 
special :)
Matthew Rosato May 2, 2023, 2:54 p.m. UTC | #31
On 5/2/23 10:15 AM, David Hildenbrand wrote:
> On 02.05.23 16:04, Jason Gunthorpe wrote:
>> On Tue, May 02, 2023 at 03:57:30PM +0200, David Hildenbrand wrote:
>>> On 02.05.23 15:50, Jason Gunthorpe wrote:
>>>> On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote:
>>>>>> Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT.
>>>>>
>>>>> Okay, so IIRC we'll fail starting the domain early, that's good. And if we
>>>>> pin all guest memory (instead of small pieces dynamically), there is little
>>>>> existing use for file-backed RAM in such zPCI configurations (because memory
>>>>> cannot be reclaimed either way if it's all pinned), so likely there are no
>>>>> real existing users.
>>>>
>>>> Right, this is VFIO, the physical HW can't tolerate not having pinned
>>>> memory, so something somewhere is always pinning it.
>>>>
>>>> Which, again, makes it weird/wrong that this KVM code is pinning it
>>>> again :\
>>>
>>> IIUC, that pinning is not for ordinary IOMMU / KVM memory access. It's for
>>> passthrough of (adapter) interrupts.
>>>
>>> I have to speculate, but I guess for hardware to forward interrupts to the
>>> VM, it has to pin the special guest memory page that will receive the
>>> indications, to then configure (interrupt) hardware to target the interrupt
>>> indications to that special guest page (using a host physical address).
>>
>> Either the emulated access is "CPU" based happening through the KVM
>> page table so it should use mmu_notifier locking.
>>
>> Or it is "DMA" and should go through an IOVA through iommufd pinning
>> and locking.
>>
>> There is no other ground, nothing in KVM should be inventing its own
>> access methodology.
> 
> I might be wrong, but this seems to be a bit different.
> 
> It cannot tolerate page faults (needs a host physical address), so memory notifiers don't really apply. (as a side note, KVM on s390x does not use mmu notifiers as we know them)

The host physical address is one shared between underlying firmware and the host kvm.  Either might make changes to the referenced page and then issue an alert to the guest via a mechanism called GISA, giving impetus to the guest to look at that page and process the event.  As you say, firmware can't tolerate the page being unavailable; it's expecting that once we feed it that location it's always available until we remove it (kvm_s390_pci_aif_disable).

> 
> It's kind-of like DMA, but it's not really DMA.  It's the CPU delivering interrupts for a specific device. So we're configuring the interrupt controller I guess to target a guest memory page.
> 
> But I have way too little knowledge about zPCI and the code in question here. And if it could be converted to iommufd (and if that's really the right mechanism to use here).
> 	
> Hopefully Matthew knows the details and if this really needs to be special :)

I think I need to have a look at mmu_notifiers to understand that better, but in the end firmware still needs a reliable page to deliver events to.
David Hildenbrand May 2, 2023, 2:57 p.m. UTC | #32
On 02.05.23 15:35, Matthew Rosato wrote:
> On 5/2/23 9:04 AM, Christian Borntraeger wrote:
>>
>>
>> Am 02.05.23 um 14:54 schrieb Lorenzo Stoakes:
>>> On Tue, May 02, 2023 at 02:46:28PM +0200, Christian Borntraeger wrote:
>>>> Am 02.05.23 um 01:11 schrieb Lorenzo Stoakes:
>>>>> Writing to file-backed dirty-tracked mappings via GUP is inherently broken
>>>>> as we cannot rule out folios being cleaned and then a GUP user writing to
>>>>> them again and possibly marking them dirty unexpectedly.
>>>>>
>>>>> This is especially egregious for long-term mappings (as indicated by the
>>>>> use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as
>>>>> we have already done in the slow path.
>>>>
>>>> Hmm, does this interfer with KVM on s390 and PCI interpretion of interrupt delivery?
>>>> It would no longer work with file backed memory, correct?
>>>>
>>>> See
>>>> arch/s390/kvm/pci.c
>>>>
>>>> kvm_s390_pci_aif_enable
>>>> which does have
>>>> FOLL_WRITE | FOLL_LONGTERM
>>>> to
>>>>
>>>
>>> Does this memory map a dirty-tracked file? It's kind of hard to dig into where
>>> the address originates from without going through a ton of code. In worst case
>>> if the fast code doesn't find a whitelist it'll fall back to slow path which
>>> explicitly checks for dirty-tracked filesystem.
>>
>> It does pin from whatever QEMU uses as backing for the guest.
>>>
>>> We can reintroduce a flag to permit exceptions if this is really broken, are you
>>> able to test? I don't have an s390 sat around :)
>>
>> Matt (Rosato on cc) probably can. In the end, it would mean having
>>    <memoryBacking>
>>      <source type="file"/>
>>    </memoryBacking>
>>
>> In libvirt I guess.
> 
> I am running with this series applied using a QEMU guest with memory-backend-file (using the above libvirt snippet) for a few different PCI device types and AEN forwarding (e.g. what is setup in kvm_s390_pci_aif_enable) is still working.
> 

That's ... unexpected. :)

Either this series doesn't work as expected or you end up using a 
filesystem that is still compatible. But I guess most applicable 
filesystems (ext4, btrfs, xfs) all have a page_mkwrite callback and 
should, therefore, disallow long-term pinning with this series.
David Hildenbrand May 2, 2023, 3:09 p.m. UTC | #33
On 02.05.23 15:56, Matthew Rosato wrote:
> On 5/2/23 9:50 AM, Jason Gunthorpe wrote:
>> On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote:
>>>> Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT.
>>>
>>> Okay, so IIRC we'll fail starting the domain early, that's good. And if we
>>> pin all guest memory (instead of small pieces dynamically), there is little
>>> existing use for file-backed RAM in such zPCI configurations (because memory
>>> cannot be reclaimed either way if it's all pinned), so likely there are no
>>> real existing users.
>>
>> Right, this is VFIO, the physical HW can't tolerate not having pinned
>> memory, so something somewhere is always pinning it.
> 
> I might have mis-explained above.
> 
> With iommufd nesting, we will pin everything upfront as a starting point.
> 
> The current usage of vfio type1 iommu for s390 does not pin the entirety of guest memory upfront, it happens as guest RPCITs occur / type1 mappings are made.

... so, after the domain started successfully on the libvirt/QEMU side ? :/

It would be great to confirm that. There might be a BUG in patch #2 (see 
my reply to patch #2) that might not allow you to reproduce it right now.
Lorenzo Stoakes May 2, 2023, 3:19 p.m. UTC | #34
On Tue, May 02, 2023 at 05:09:06PM +0200, David Hildenbrand wrote:
> On 02.05.23 15:56, Matthew Rosato wrote:
> > On 5/2/23 9:50 AM, Jason Gunthorpe wrote:
> > > On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote:
> > > > > Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT.
> > > >
> > > > Okay, so IIRC we'll fail starting the domain early, that's good. And if we
> > > > pin all guest memory (instead of small pieces dynamically), there is little
> > > > existing use for file-backed RAM in such zPCI configurations (because memory
> > > > cannot be reclaimed either way if it's all pinned), so likely there are no
> > > > real existing users.
> > >
> > > Right, this is VFIO, the physical HW can't tolerate not having pinned
> > > memory, so something somewhere is always pinning it.
> >
> > I might have mis-explained above.
> >
> > With iommufd nesting, we will pin everything upfront as a starting point.
> >
> > The current usage of vfio type1 iommu for s390 does not pin the entirety of guest memory upfront, it happens as guest RPCITs occur / type1 mappings are made.
>
> ... so, after the domain started successfully on the libvirt/QEMU side ? :/
>
> It would be great to confirm that. There might be a BUG in patch #2 (see my
> reply to patch #2) that might not allow you to reproduce it right now.
>

Yes apologies - thank you VERY much for doing this Matthew, but apologies, I
made rather a clanger in patch 2 which would mean fast patch degrading to slow
path would pass even for file-backed.

Will respin a v7 + cc you on that, if you could be so kind as to test that?

> --
> Thanks,
>
> David / dhildenb
>
Matthew Rosato May 2, 2023, 3:19 p.m. UTC | #35
On 5/2/23 10:57 AM, David Hildenbrand wrote:
> On 02.05.23 15:35, Matthew Rosato wrote:
>> On 5/2/23 9:04 AM, Christian Borntraeger wrote:
>>>
>>>
>>> Am 02.05.23 um 14:54 schrieb Lorenzo Stoakes:
>>>> On Tue, May 02, 2023 at 02:46:28PM +0200, Christian Borntraeger wrote:
>>>>> Am 02.05.23 um 01:11 schrieb Lorenzo Stoakes:
>>>>>> Writing to file-backed dirty-tracked mappings via GUP is inherently broken
>>>>>> as we cannot rule out folios being cleaned and then a GUP user writing to
>>>>>> them again and possibly marking them dirty unexpectedly.
>>>>>>
>>>>>> This is especially egregious for long-term mappings (as indicated by the
>>>>>> use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as
>>>>>> we have already done in the slow path.
>>>>>
>>>>> Hmm, does this interfer with KVM on s390 and PCI interpretion of interrupt delivery?
>>>>> It would no longer work with file backed memory, correct?
>>>>>
>>>>> See
>>>>> arch/s390/kvm/pci.c
>>>>>
>>>>> kvm_s390_pci_aif_enable
>>>>> which does have
>>>>> FOLL_WRITE | FOLL_LONGTERM
>>>>> to
>>>>>
>>>>
>>>> Does this memory map a dirty-tracked file? It's kind of hard to dig into where
>>>> the address originates from without going through a ton of code. In worst case
>>>> if the fast code doesn't find a whitelist it'll fall back to slow path which
>>>> explicitly checks for dirty-tracked filesystem.
>>>
>>> It does pin from whatever QEMU uses as backing for the guest.
>>>>
>>>> We can reintroduce a flag to permit exceptions if this is really broken, are you
>>>> able to test? I don't have an s390 sat around :)
>>>
>>> Matt (Rosato on cc) probably can. In the end, it would mean having
>>>    <memoryBacking>
>>>      <source type="file"/>
>>>    </memoryBacking>
>>>
>>> In libvirt I guess.
>>
>> I am running with this series applied using a QEMU guest with memory-backend-file (using the above libvirt snippet) for a few different PCI device types and AEN forwarding (e.g. what is setup in kvm_s390_pci_aif_enable) is still working.
>>
> 
> That's ... unexpected. :)
> 
> Either this series doesn't work as expected or you end up using a filesystem that is still compatible. But I guess most applicable filesystems (ext4, btrfs, xfs) all have a page_mkwrite callback and should, therefore, disallow long-term pinning with this series.
> 

The memory backend file is on ext4 in my tests.  A quick trace shows that pin_user_pages_fast(FOLL_WRITE | FOLL_LONGTERM) in kvm_s390_pci_aif_enable is still returning positive implying pages are being pinned.
Matthew Rosato May 2, 2023, 3:20 p.m. UTC | #36
On 5/2/23 11:19 AM, Lorenzo Stoakes wrote:
> On Tue, May 02, 2023 at 05:09:06PM +0200, David Hildenbrand wrote:
>> On 02.05.23 15:56, Matthew Rosato wrote:
>>> On 5/2/23 9:50 AM, Jason Gunthorpe wrote:
>>>> On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote:
>>>>>> Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT.
>>>>>
>>>>> Okay, so IIRC we'll fail starting the domain early, that's good. And if we
>>>>> pin all guest memory (instead of small pieces dynamically), there is little
>>>>> existing use for file-backed RAM in such zPCI configurations (because memory
>>>>> cannot be reclaimed either way if it's all pinned), so likely there are no
>>>>> real existing users.
>>>>
>>>> Right, this is VFIO, the physical HW can't tolerate not having pinned
>>>> memory, so something somewhere is always pinning it.
>>>
>>> I might have mis-explained above.
>>>
>>> With iommufd nesting, we will pin everything upfront as a starting point.
>>>
>>> The current usage of vfio type1 iommu for s390 does not pin the entirety of guest memory upfront, it happens as guest RPCITs occur / type1 mappings are made.
>>
>> ... so, after the domain started successfully on the libvirt/QEMU side ? :/
>>
>> It would be great to confirm that. There might be a BUG in patch #2 (see my
>> reply to patch #2) that might not allow you to reproduce it right now.
>>
> 
> Yes apologies - thank you VERY much for doing this Matthew, but apologies, I
> made rather a clanger in patch 2 which would mean fast patch degrading to slow
> path would pass even for file-backed.
> 
> Will respin a v7 + cc you on that, if you could be so kind as to test that?
> 

Sure, will do!
Jason Gunthorpe May 2, 2023, 3:20 p.m. UTC | #37
On Tue, May 02, 2023 at 10:54:35AM -0400, Matthew Rosato wrote:
> On 5/2/23 10:15 AM, David Hildenbrand wrote:
> > On 02.05.23 16:04, Jason Gunthorpe wrote:
> >> On Tue, May 02, 2023 at 03:57:30PM +0200, David Hildenbrand wrote:
> >>> On 02.05.23 15:50, Jason Gunthorpe wrote:
> >>>> On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote:
> >>>>>> Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT.
> >>>>>
> >>>>> Okay, so IIRC we'll fail starting the domain early, that's good. And if we
> >>>>> pin all guest memory (instead of small pieces dynamically), there is little
> >>>>> existing use for file-backed RAM in such zPCI configurations (because memory
> >>>>> cannot be reclaimed either way if it's all pinned), so likely there are no
> >>>>> real existing users.
> >>>>
> >>>> Right, this is VFIO, the physical HW can't tolerate not having pinned
> >>>> memory, so something somewhere is always pinning it.
> >>>>
> >>>> Which, again, makes it weird/wrong that this KVM code is pinning it
> >>>> again :\
> >>>
> >>> IIUC, that pinning is not for ordinary IOMMU / KVM memory access. It's for
> >>> passthrough of (adapter) interrupts.
> >>>
> >>> I have to speculate, but I guess for hardware to forward interrupts to the
> >>> VM, it has to pin the special guest memory page that will receive the
> >>> indications, to then configure (interrupt) hardware to target the interrupt
> >>> indications to that special guest page (using a host physical address).
> >>
> >> Either the emulated access is "CPU" based happening through the KVM
> >> page table so it should use mmu_notifier locking.
> >>
> >> Or it is "DMA" and should go through an IOVA through iommufd pinning
> >> and locking.
> >>
> >> There is no other ground, nothing in KVM should be inventing its own
> >> access methodology.
> > 
> > I might be wrong, but this seems to be a bit different.
> >
> > It cannot tolerate page faults (needs a host physical address), so
> > memory notifiers don't really apply. (as a side note, KVM on s390x
> > does not use mmu notifiers as we know them)
>
> The host physical address is one shared between underlying firmware
> and the host kvm.  Either might make changes to the referenced page
> and then issue an alert to the guest via a mechanism called GISA,
> giving impetus to the guest to look at that page and process the
> event.  As you say, firmware can't tolerate the page being
> unavailable; it's expecting that once we feed it that location it's
> always available until we remove it (kvm_s390_pci_aif_disable).

That is a CPU access delegated to the FW without any locking scheme to
make it safe with KVM :\

It would have been better if FW could inject it through the kvm page
tables so it has some coherency.

Otherwise you have to call this "DMA", I think.

How does s390 avoid mmu notifiers without having lots of problems?? It
is not really optional to hook the invalidations if you need to build
a shadow page table..

Jason
Peter Xu May 2, 2023, 3:32 p.m. UTC | #38
On Tue, May 02, 2023 at 12:20:46PM -0300, Jason Gunthorpe wrote:
> On Tue, May 02, 2023 at 10:54:35AM -0400, Matthew Rosato wrote:
> > On 5/2/23 10:15 AM, David Hildenbrand wrote:
> > > On 02.05.23 16:04, Jason Gunthorpe wrote:
> > >> On Tue, May 02, 2023 at 03:57:30PM +0200, David Hildenbrand wrote:
> > >>> On 02.05.23 15:50, Jason Gunthorpe wrote:
> > >>>> On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote:
> > >>>>>> Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT.
> > >>>>>
> > >>>>> Okay, so IIRC we'll fail starting the domain early, that's good. And if we
> > >>>>> pin all guest memory (instead of small pieces dynamically), there is little
> > >>>>> existing use for file-backed RAM in such zPCI configurations (because memory
> > >>>>> cannot be reclaimed either way if it's all pinned), so likely there are no
> > >>>>> real existing users.
> > >>>>
> > >>>> Right, this is VFIO, the physical HW can't tolerate not having pinned
> > >>>> memory, so something somewhere is always pinning it.
> > >>>>
> > >>>> Which, again, makes it weird/wrong that this KVM code is pinning it
> > >>>> again :\
> > >>>
> > >>> IIUC, that pinning is not for ordinary IOMMU / KVM memory access. It's for
> > >>> passthrough of (adapter) interrupts.
> > >>>
> > >>> I have to speculate, but I guess for hardware to forward interrupts to the
> > >>> VM, it has to pin the special guest memory page that will receive the
> > >>> indications, to then configure (interrupt) hardware to target the interrupt
> > >>> indications to that special guest page (using a host physical address).
> > >>
> > >> Either the emulated access is "CPU" based happening through the KVM
> > >> page table so it should use mmu_notifier locking.
> > >>
> > >> Or it is "DMA" and should go through an IOVA through iommufd pinning
> > >> and locking.
> > >>
> > >> There is no other ground, nothing in KVM should be inventing its own
> > >> access methodology.
> > > 
> > > I might be wrong, but this seems to be a bit different.
> > >
> > > It cannot tolerate page faults (needs a host physical address), so
> > > memory notifiers don't really apply. (as a side note, KVM on s390x
> > > does not use mmu notifiers as we know them)
> >
> > The host physical address is one shared between underlying firmware
> > and the host kvm.  Either might make changes to the referenced page
> > and then issue an alert to the guest via a mechanism called GISA,
> > giving impetus to the guest to look at that page and process the
> > event.  As you say, firmware can't tolerate the page being
> > unavailable; it's expecting that once we feed it that location it's
> > always available until we remove it (kvm_s390_pci_aif_disable).
> 
> That is a CPU access delegated to the FW without any locking scheme to
> make it safe with KVM :\
> 
> It would have been better if FW could inject it through the kvm page
> tables so it has some coherency.
> 
> Otherwise you have to call this "DMA", I think.
> 
> How does s390 avoid mmu notifiers without having lots of problems?? It
> is not really optional to hook the invalidations if you need to build
> a shadow page table..

Totally no idea on s390 details, but.. per my read above, if the firmware
needs to make sure the page is always available (so no way to fault it in
on demand), which means a longterm pinning seems appropriate here.

Then if pinned a must, there's no need for mmu notifiers (as the page will
simply not be invalidated anyway)?

Thanks,
Jason Gunthorpe May 2, 2023, 3:36 p.m. UTC | #39
On Tue, May 02, 2023 at 11:32:57AM -0400, Peter Xu wrote:
> > How does s390 avoid mmu notifiers without having lots of problems?? It
> > is not really optional to hook the invalidations if you need to build
> > a shadow page table..
> 
> Totally no idea on s390 details, but.. per my read above, if the firmware
> needs to make sure the page is always available (so no way to fault it in
> on demand), which means a longterm pinning seems appropriate here.
> 
> Then if pinned a must, there's no need for mmu notifiers (as the page will
> simply not be invalidated anyway)?

And what if someone deliberately changes the mapping?  memory hotplug
in the VM, or whatever?

We have ways to deal with this class of stuff in kvm and in
iommufd/vfio - this does not.

Jason
David Hildenbrand May 2, 2023, 3:45 p.m. UTC | #40
On 02.05.23 17:36, Jason Gunthorpe wrote:
> On Tue, May 02, 2023 at 11:32:57AM -0400, Peter Xu wrote:
>>> How does s390 avoid mmu notifiers without having lots of problems?? It
>>> is not really optional to hook the invalidations if you need to build
>>> a shadow page table..
>>
>> Totally no idea on s390 details, but.. per my read above, if the firmware
>> needs to make sure the page is always available (so no way to fault it in
>> on demand), which means a longterm pinning seems appropriate here.
>>
>> Then if pinned a must, there's no need for mmu notifiers (as the page will
>> simply not be invalidated anyway)?
> 
> And what if someone deliberately changes the mapping?  memory hotplug
> in the VM, or whatever?

Besides s390 not supporting memory hotplug in VMs (yet): if the guest 
wants a different guest physical address, I guess that's the problem of 
the guest, and it can update it:

KVM_S390_ZPCIOP_REG_AEN is triggered from QEMU via 
s390_pci_kvm_aif_enable(), triggered by the guest via a special instruction.

If the hypervisor changes the mapping, it's just the same thing as 
mixing e.g. MADV_DONTNEED with longterm pinning in vfio: don't do it. 
And if you do it, you get to keep the mess you created for your VM.

Linux will make sure to not change the mapping: for example, page 
migration of a pinned page will fail.

But maybe I am missing something important here.
Jason Gunthorpe May 2, 2023, 4:06 p.m. UTC | #41
On Tue, May 02, 2023 at 05:45:40PM +0200, David Hildenbrand wrote:
> On 02.05.23 17:36, Jason Gunthorpe wrote:
> > On Tue, May 02, 2023 at 11:32:57AM -0400, Peter Xu wrote:
> > > > How does s390 avoid mmu notifiers without having lots of problems?? It
> > > > is not really optional to hook the invalidations if you need to build
> > > > a shadow page table..
> > > 
> > > Totally no idea on s390 details, but.. per my read above, if the firmware
> > > needs to make sure the page is always available (so no way to fault it in
> > > on demand), which means a longterm pinning seems appropriate here.
> > > 
> > > Then if pinned a must, there's no need for mmu notifiers (as the page will
> > > simply not be invalidated anyway)?
> > 
> > And what if someone deliberately changes the mapping?  memory hotplug
> > in the VM, or whatever?
> 
> Besides s390 not supporting memory hotplug in VMs (yet): if the guest wants
> a different guest physical address, I guess that's the problem of the guest,
> and it can update it:
> 
> KVM_S390_ZPCIOP_REG_AEN is triggered from QEMU via
> s390_pci_kvm_aif_enable(), triggered by the guest via a special instruction.
> 
> If the hypervisor changes the mapping, it's just the same thing as mixing
> e.g. MADV_DONTNEED with longterm pinning in vfio: don't do it. And if you do
> it, you get to keep the mess you created for your VM.
> 
> Linux will make sure to not change the mapping: for example, page migration
> of a pinned page will fail.
>
> But maybe I am missing something important here.

It missses the general architectural point why we have all these
shootdown mechanims in other places - plares are not supposed to make
these kinds of assumptions. When the userspace unplugs the memory from
KVM or unmaps it from VFIO it is not still being accessed by the
kernel.

Functional bug or not, it is inconsistent with how this is designed to
work.

Jason
David Hildenbrand May 2, 2023, 4:12 p.m. UTC | #42
On 02.05.23 18:06, Jason Gunthorpe wrote:
> On Tue, May 02, 2023 at 05:45:40PM +0200, David Hildenbrand wrote:
>> On 02.05.23 17:36, Jason Gunthorpe wrote:
>>> On Tue, May 02, 2023 at 11:32:57AM -0400, Peter Xu wrote:
>>>>> How does s390 avoid mmu notifiers without having lots of problems?? It
>>>>> is not really optional to hook the invalidations if you need to build
>>>>> a shadow page table..
>>>>
>>>> Totally no idea on s390 details, but.. per my read above, if the firmware
>>>> needs to make sure the page is always available (so no way to fault it in
>>>> on demand), which means a longterm pinning seems appropriate here.
>>>>
>>>> Then if pinned a must, there's no need for mmu notifiers (as the page will
>>>> simply not be invalidated anyway)?
>>>
>>> And what if someone deliberately changes the mapping?  memory hotplug
>>> in the VM, or whatever?
>>
>> Besides s390 not supporting memory hotplug in VMs (yet): if the guest wants
>> a different guest physical address, I guess that's the problem of the guest,
>> and it can update it:
>>
>> KVM_S390_ZPCIOP_REG_AEN is triggered from QEMU via
>> s390_pci_kvm_aif_enable(), triggered by the guest via a special instruction.
>>
>> If the hypervisor changes the mapping, it's just the same thing as mixing
>> e.g. MADV_DONTNEED with longterm pinning in vfio: don't do it. And if you do
>> it, you get to keep the mess you created for your VM.
>>
>> Linux will make sure to not change the mapping: for example, page migration
>> of a pinned page will fail.
>>
>> But maybe I am missing something important here.
> 
> It missses the general architectural point why we have all these
> shootdown mechanims in other places - plares are not supposed to make
> these kinds of assumptions. When the userspace unplugs the memory from
> KVM or unmaps it from VFIO it is not still being accessed by the
> kernel.

Yes. Like having memory in a vfio iommu v1 and doing the same (mremap, 
munmap, MADV_DONTNEED, ...). Which is why we disable MADV_DONTNEED 
(e.g., virtio-balloon) in QEMU with vfio.

> 
> Functional bug or not, it is inconsistent with how this is designed to
> work.

Sorry to say, I *really* don't see how that is supposed to work with a 
page that *cannot* be faulted back in on demand.
Jason Gunthorpe May 2, 2023, 4:19 p.m. UTC | #43
On Tue, May 02, 2023 at 06:12:39PM +0200, David Hildenbrand wrote:

> > It missses the general architectural point why we have all these
> > shootdown mechanims in other places - plares are not supposed to make
> > these kinds of assumptions. When the userspace unplugs the memory from
> > KVM or unmaps it from VFIO it is not still being accessed by the
> > kernel.
> 
> Yes. Like having memory in a vfio iommu v1 and doing the same (mremap,
> munmap, MADV_DONTNEED, ...). Which is why we disable MADV_DONTNEED (e.g.,
> virtio-balloon) in QEMU with vfio.

That is different, VFIO has it's own contract how it consumes the
memory from the MM and VFIO breaks all this stuff.

But when you tell VFIO to unmap the memory it doesn't keep accessing
it in the background like this does.

Jason
David Hildenbrand May 2, 2023, 4:32 p.m. UTC | #44
On 02.05.23 18:19, Jason Gunthorpe wrote:
> On Tue, May 02, 2023 at 06:12:39PM +0200, David Hildenbrand wrote:
> 
>>> It missses the general architectural point why we have all these
>>> shootdown mechanims in other places - plares are not supposed to make
>>> these kinds of assumptions. When the userspace unplugs the memory from
>>> KVM or unmaps it from VFIO it is not still being accessed by the
>>> kernel.
>>
>> Yes. Like having memory in a vfio iommu v1 and doing the same (mremap,
>> munmap, MADV_DONTNEED, ...). Which is why we disable MADV_DONTNEED (e.g.,
>> virtio-balloon) in QEMU with vfio.
> 
> That is different, VFIO has it's own contract how it consumes the
> memory from the MM and VFIO breaks all this stuff.
> 
> But when you tell VFIO to unmap the memory it doesn't keep accessing
> it in the background like this does.

To me, this is similar to when QEMU (user space) triggers 
KVM_S390_ZPCIOP_DEREG_AEN, to tell KVM to disable AIF and stop using the 
page (1) When triggered by the guest explicitly (2) when resetting the 
VM (3) when resetting the virtual PCI device / configuration.

Interrupt gets unregistered from HW (which stops using the page), the 
pages get unpinned. Pages get no longer used.

I guess I am still missing (a) how this is fundamentally different (b) 
how it could be done differently.

I'd really be happy to learn how a better approach would look like that 
does not use longterm pinnings.

I don't see an easy way to not use longterm pinnings. When using mmu 
notifiers and getting notified about unmapping of a page (for whatever 
reason ... migration, swapout, unmap), you'd have to disable aif. But 
when to reenable it (maybe there would be a way)? Also, I'm not sure if 
this could even be visible by the guest, if it's suddenly no longer enabled.

Something for the s390x people to explore ... if HW would be providing a 
way to deal with that somehow.
Jason Gunthorpe May 2, 2023, 5:46 p.m. UTC | #45
On Tue, May 02, 2023 at 06:32:23PM +0200, David Hildenbrand wrote:
> On 02.05.23 18:19, Jason Gunthorpe wrote:
> > On Tue, May 02, 2023 at 06:12:39PM +0200, David Hildenbrand wrote:
> > 
> > > > It missses the general architectural point why we have all these
> > > > shootdown mechanims in other places - plares are not supposed to make
> > > > these kinds of assumptions. When the userspace unplugs the memory from
> > > > KVM or unmaps it from VFIO it is not still being accessed by the
> > > > kernel.
> > > 
> > > Yes. Like having memory in a vfio iommu v1 and doing the same (mremap,
> > > munmap, MADV_DONTNEED, ...). Which is why we disable MADV_DONTNEED (e.g.,
> > > virtio-balloon) in QEMU with vfio.
> > 
> > That is different, VFIO has it's own contract how it consumes the
> > memory from the MM and VFIO breaks all this stuff.
> > 
> > But when you tell VFIO to unmap the memory it doesn't keep accessing
> > it in the background like this does.
> 
> To me, this is similar to when QEMU (user space) triggers
> KVM_S390_ZPCIOP_DEREG_AEN, to tell KVM to disable AIF and stop using the
> page (1) When triggered by the guest explicitly (2) when resetting the VM
> (3) when resetting the virtual PCI device / configuration.
> 
> Interrupt gets unregistered from HW (which stops using the page), the pages
> get unpinned. Pages get no longer used.
> 
> I guess I am still missing (a) how this is fundamentally different (b) how
> it could be done differently.

It uses an address that is already scoped within the KVM memory map
and uses KVM's gpa_to_gfn() to translate it to some pinnable page

It is not some independent thing like VFIO, it is explicitly scoped
within the existing KVM structure and it does not follow any mutations
that are done to the gpa map through the usual KVM APIs.

> I'd really be happy to learn how a better approach would look like that does
> not use longterm pinnings.

Sounds like the FW sadly needs pinnings. This is why I said it looks
like DMA. If possible it would be better to get the pinning through
VFIO, eg as a mdev

Otherwise, it would have been cleaner if this was divorced from KVM
and took in a direct user pointer, then maybe you could make the
argument is its own thing with its own lifetime rules. (then you are
kind of making your own mdev)

Or, perhaps, this is really part of some radical "irqfd" that we've
been on and off talking about specifically to get this area of
interrupt bypass uAPI'd properly..

Jason
Matthew Rosato May 2, 2023, 5:59 p.m. UTC | #46
On 5/2/23 1:46 PM, Jason Gunthorpe wrote:
> On Tue, May 02, 2023 at 06:32:23PM +0200, David Hildenbrand wrote:
>> On 02.05.23 18:19, Jason Gunthorpe wrote:
>>> On Tue, May 02, 2023 at 06:12:39PM +0200, David Hildenbrand wrote:
>>>
>>>>> It missses the general architectural point why we have all these
>>>>> shootdown mechanims in other places - plares are not supposed to make
>>>>> these kinds of assumptions. When the userspace unplugs the memory from
>>>>> KVM or unmaps it from VFIO it is not still being accessed by the
>>>>> kernel.
>>>>
>>>> Yes. Like having memory in a vfio iommu v1 and doing the same (mremap,
>>>> munmap, MADV_DONTNEED, ...). Which is why we disable MADV_DONTNEED (e.g.,
>>>> virtio-balloon) in QEMU with vfio.
>>>
>>> That is different, VFIO has it's own contract how it consumes the
>>> memory from the MM and VFIO breaks all this stuff.
>>>
>>> But when you tell VFIO to unmap the memory it doesn't keep accessing
>>> it in the background like this does.
>>
>> To me, this is similar to when QEMU (user space) triggers
>> KVM_S390_ZPCIOP_DEREG_AEN, to tell KVM to disable AIF and stop using the
>> page (1) When triggered by the guest explicitly (2) when resetting the VM
>> (3) when resetting the virtual PCI device / configuration.
>>
>> Interrupt gets unregistered from HW (which stops using the page), the pages
>> get unpinned. Pages get no longer used.
>>
>> I guess I am still missing (a) how this is fundamentally different (b) how
>> it could be done differently.
> 
> It uses an address that is already scoped within the KVM memory map
> and uses KVM's gpa_to_gfn() to translate it to some pinnable page
> 
> It is not some independent thing like VFIO, it is explicitly scoped
> within the existing KVM structure and it does not follow any mutations
> that are done to the gpa map through the usual KVM APIs.
> 
>> I'd really be happy to learn how a better approach would look like that does
>> not use longterm pinnings.
> 
> Sounds like the FW sadly needs pinnings. This is why I said it looks
> like DMA. If possible it would be better to get the pinning through
> VFIO, eg as a mdev

Hrm, these are today handled as a vfio-pci device (e.g. no mdev) so that would be a pretty significant change.

> 
> Otherwise, it would have been cleaner if this was divorced from KVM
> and took in a direct user pointer, then maybe you could make the
> argument is its own thing with its own lifetime rules. (then you are
> kind of making your own mdev)

Problem there is that firmware needs both the location (where to indicate the event) and the identity of the KVM instance (what guest to ship the GISA alert to) so I don't think we can completely divorce them.

> 
> Or, perhaps, this is really part of some radical "irqfd" that we've
> been on and off talking about specifically to get this area of
> interrupt bypass uAPI'd properly..

I investigated that at one point and could not seem to get it to fit; I'll have another look there.
Jason Gunthorpe May 2, 2023, 6:09 p.m. UTC | #47
On Tue, May 02, 2023 at 01:59:09PM -0400, Matthew Rosato wrote:

> >> I'd really be happy to learn how a better approach would look like that does
> >> not use longterm pinnings.
> > 
> > Sounds like the FW sadly needs pinnings. This is why I said it looks
> > like DMA. If possible it would be better to get the pinning through
> > VFIO, eg as a mdev
> 
> Hrm, these are today handled as a vfio-pci device (e.g. no mdev) so that would be a pretty significant change.

I was thinking more of having your vIRQ controller as a vfio-mdev, so
qemu would have to open the irq controller as well as the kvm and the
PCI device

> > Otherwise, it would have been cleaner if this was divorced from KVM
> > and took in a direct user pointer, then maybe you could make the
> > argument is its own thing with its own lifetime rules. (then you are
> > kind of making your own mdev)
> 
> Problem there is that firmware needs both the location (where to
> indicate the event) and the identity of the KVM instance (what guest
> to ship the GISA alert to) so I don't think we can completely
> divorce them.

I should have said "kvm page table", being in the KVM ioctls is sort
of OK - but I'm really skeptical that KVM is a good place to put HW
device wrappers.

> > Or, perhaps, this is really part of some radical "irqfd" that we've
> > been on and off talking about specifically to get this area of
> > interrupt bypass uAPI'd properly..
> 
> I investigated that at one point and could not seem to get it to
> fit; I'll have another look there.

It is a long journey, but other arches have problems here too

Jason
David Hildenbrand May 2, 2023, 7:23 p.m. UTC | #48
On 02.05.23 19:46, Jason Gunthorpe wrote:
> On Tue, May 02, 2023 at 06:32:23PM +0200, David Hildenbrand wrote:
>> On 02.05.23 18:19, Jason Gunthorpe wrote:
>>> On Tue, May 02, 2023 at 06:12:39PM +0200, David Hildenbrand wrote:
>>>
>>>>> It missses the general architectural point why we have all these
>>>>> shootdown mechanims in other places - plares are not supposed to make
>>>>> these kinds of assumptions. When the userspace unplugs the memory from
>>>>> KVM or unmaps it from VFIO it is not still being accessed by the
>>>>> kernel.
>>>>
>>>> Yes. Like having memory in a vfio iommu v1 and doing the same (mremap,
>>>> munmap, MADV_DONTNEED, ...). Which is why we disable MADV_DONTNEED (e.g.,
>>>> virtio-balloon) in QEMU with vfio.
>>>
>>> That is different, VFIO has it's own contract how it consumes the
>>> memory from the MM and VFIO breaks all this stuff.
>>>
>>> But when you tell VFIO to unmap the memory it doesn't keep accessing
>>> it in the background like this does.
>>
>> To me, this is similar to when QEMU (user space) triggers
>> KVM_S390_ZPCIOP_DEREG_AEN, to tell KVM to disable AIF and stop using the
>> page (1) When triggered by the guest explicitly (2) when resetting the VM
>> (3) when resetting the virtual PCI device / configuration.
>>
>> Interrupt gets unregistered from HW (which stops using the page), the pages
>> get unpinned. Pages get no longer used.
>>
>> I guess I am still missing (a) how this is fundamentally different (b) how
>> it could be done differently.
> 
> It uses an address that is already scoped within the KVM memory map
> and uses KVM's gpa_to_gfn() to translate it to some pinnable page
> 
> It is not some independent thing like VFIO, it is explicitly scoped
> within the existing KVM structure and it does not follow any mutations
> that are done to the gpa map through the usual KVM APIs.

Right, it consumes guest physical addresses that are translated via the KVM memslots.
Agreed that it does not (and possibly cannot easily) update the hardware when the KVM
mapping (memslots) would ever change.

I guess it's also not documented that this is not supported.

> 
>> I'd really be happy to learn how a better approach would look like that does
>> not use longterm pinnings.
> 
> Sounds like the FW sadly needs pinnings. This is why I said it looks
> like DMA. If possible it would be better to get the pinning through
> VFIO, eg as a mdev
> 
> Otherwise, it would have been cleaner if this was divorced from KVM
> and took in a direct user pointer, then maybe you could make the
> argument is its own thing with its own lifetime rules. (then you are
> kind of making your own mdev)

It would be cleaner if user space would translate the GPA to a HVA and provid
  that, agreed ...

> 
> Or, perhaps, this is really part of some radical "irqfd" that we've
> been on and off talking about specifically to get this area of
> interrupt bypass uAPI'd properly..

Most probably. It's one of these very special cases ... thankfully:

$ git grep -i longterm | grep kvm
arch/s390/kvm/pci.c:    npages = pin_user_pages_fast(hva, 1, FOLL_WRITE | FOLL_LONGTERM, pages);
arch/s390/kvm/pci.c:            npages = pin_user_pages_fast(hva, 1, FOLL_WRITE | FOLL_LONGTERM,
diff mbox series

Patch

diff --git a/mm/gup.c b/mm/gup.c
index 0f09dec0906c..431618048a03 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -18,6 +18,7 @@ 
 #include <linux/migrate.h>
 #include <linux/mm_inline.h>
 #include <linux/sched/mm.h>
+#include <linux/shmem_fs.h>
 
 #include <asm/mmu_context.h>
 #include <asm/tlbflush.h>
@@ -95,6 +96,77 @@  static inline struct folio *try_get_folio(struct page *page, int refs)
 	return folio;
 }
 
+#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
+static bool stabilise_mapping_rcu(struct folio *folio)
+{
+	struct address_space *mapping = READ_ONCE(folio->mapping);
+
+	rcu_read_lock();
+
+	return mapping == READ_ONCE(folio->mapping);
+}
+
+static void unlock_rcu(void)
+{
+	rcu_read_unlock();
+}
+#else
+static bool stabilise_mapping_rcu(struct folio *)
+{
+	return true;
+}
+
+static void unlock_rcu(void)
+{
+}
+#endif
+
+/*
+ * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM |
+ * FOLL_WRITE pin is permitted for a specific folio.
+ *
+ * This assumes the folio is stable and pinned.
+ *
+ * Writing to pinned file-backed dirty tracked folios is inherently problematic
+ * (see comment describing the writeable_file_mapping_allowed() function). We
+ * therefore try to avoid the most egregious case of a long-term mapping doing
+ * so.
+ *
+ * This function cannot be as thorough as that one as the VMA is not available
+ * in the fast path, so instead we whitelist known good cases.
+ *
+ * The folio is stable, but the mapping might not be. When truncating for
+ * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled
+ * so we are safe from an IPI, but some architectures use an RCU lock for this
+ * operation, so we acquire an RCU lock to ensure the mapping is stable.
+ */
+static bool folio_longterm_write_pin_allowed(struct folio *folio)
+{
+	bool ret;
+
+	/* hugetlb mappings do not require dirty tracking. */
+	if (folio_test_hugetlb(folio))
+		return true;
+
+	if (stabilise_mapping_rcu(folio)) {
+		struct address_space *mapping = folio_mapping(folio);
+
+		/*
+		 * Neither anonymous nor shmem-backed folios require
+		 * dirty tracking.
+		 */
+		ret = folio_test_anon(folio) ||
+			(mapping && shmem_mapping(mapping));
+	} else {
+		/* If the mapping is unstable, fallback to the slow path. */
+		ret = false;
+	}
+
+	unlock_rcu();
+
+	return ret;
+}
+
 /**
  * try_grab_folio() - Attempt to get or pin a folio.
  * @page:  pointer to page to be grabbed
@@ -123,6 +195,8 @@  static inline struct folio *try_get_folio(struct page *page, int refs)
  */
 struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
 {
+	bool is_longterm = flags & FOLL_LONGTERM;
+
 	if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
 		return NULL;
 
@@ -136,8 +210,7 @@  struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
 		 * right zone, so fail and let the caller fall back to the slow
 		 * path.
 		 */
-		if (unlikely((flags & FOLL_LONGTERM) &&
-			     !is_longterm_pinnable_page(page)))
+		if (unlikely(is_longterm && !is_longterm_pinnable_page(page)))
 			return NULL;
 
 		/*
@@ -148,6 +221,16 @@  struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
 		if (!folio)
 			return NULL;
 
+		/*
+		 * Can this folio be safely pinned? We need to perform this
+		 * check after the folio is stabilised.
+		 */
+		if ((flags & FOLL_WRITE) && is_longterm &&
+		    !folio_longterm_write_pin_allowed(folio)) {
+			folio_put_refs(folio, refs);
+			return NULL;
+		}
+
 		/*
 		 * When pinning a large folio, use an exact count to track it.
 		 *