diff mbox series

[v6,3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

Message ID dee4f4ad6532b0f94d073da263526de334d5d7e0.1682981880.git.lstoakes@gmail.com (mailing list archive)
State New
Headers show
Series mm/gup: disallow GUP writing to file-backed mappings by default | expand

Commit Message

Lorenzo Stoakes May 1, 2023, 11:11 p.m. UTC
Writing to file-backed dirty-tracked mappings via GUP is inherently broken
as we cannot rule out folios being cleaned and then a GUP user writing to
them again and possibly marking them dirty unexpectedly.

This is especially egregious for long-term mappings (as indicated by the
use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as
we have already done in the slow path.

We have access to less information in the fast path as we cannot examine
the VMA containing the mapping, however we can determine whether the folio
is anonymous and then whitelist known-good mappings - specifically hugetlb
and shmem mappings.

While we obtain a stable folio for this check, the mapping might not be, as
a truncate could nullify it at any time. Since doing so requires mappings
to be zapped, we can synchronise against a TLB shootdown operation.

For some architectures TLB shootdown is synchronised by IPI, against which
we are protected as the GUP-fast operation is performed with interrupts
disabled. However, other architectures which specify
CONFIG_MMU_GATHER_RCU_TABLE_FREE use an RCU lock for this operation.

In these instances, we acquire an RCU lock while performing our checks. If
we cannot get a stable mapping, we fall back to the slow path, as otherwise
we'd have to walk the page tables again and it's simpler and more effective
to just fall back.

It's important to note that there are no APIs allowing users to specify
FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can
always rely on the fact that if we fail to pin on the fast path, the code
will fall back to the slow path which can perform the more thorough check.

Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Kirill A . Shutemov <kirill@shutemov.name>
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
---
 mm/gup.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 85 insertions(+), 2 deletions(-)

Comments

Jan Kara May 2, 2023, 11:23 a.m. UTC | #1
On Tue 02-05-23 13:13:34, Peter Zijlstra wrote:
> On Tue, May 02, 2023 at 12:11:49AM +0100, Lorenzo Stoakes wrote:
> > +
> > +		/*
> > +		 * Neither anonymous nor shmem-backed folios require
> > +		 * dirty tracking.
> > +		 */
> > +		ret = folio_test_anon(folio) ||
> > +			(mapping && shmem_mapping(mapping));
> > +	} else {
> > +		/* If the mapping is unstable, fallback to the slow path. */
> > +		ret = false;
> > +	}
> > +
> > +	unlock_rcu();
> > +
> > +	return ret;
> 
> then becomes:
> 
> 
> 	if (folio_test_anon(folio))
> 		return true;
> 
> 	/*
> 	 * Having IRQs disabled (as per GUP-fast) also inhibits RCU
> 	 * grace periods from making progress, IOW. they imply
> 	 * rcu_read_lock().
> 	 */
> 	lockdep_assert_irqs_disabled();
> 
> 	/*
> 	 * Inodes and thus address_space are RCU freed and thus safe to
> 	 * access at this point.
> 	 */
> 	mapping = folio_mapping(folio);
> 	if (mapping && shmem_mapping(mapping))
> 		return true;
> 
> 	return false;

Yeah, that's even better.

								Honza
Lorenzo Stoakes May 2, 2023, 11:25 a.m. UTC | #2
On Tue, May 02, 2023 at 01:13:34PM +0200, Peter Zijlstra wrote:
> On Tue, May 02, 2023 at 12:11:49AM +0100, Lorenzo Stoakes wrote:
> > @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> >  	return folio;
> >  }
> >
> > +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > +static bool stabilise_mapping_rcu(struct folio *folio)
> > +{
> > +	struct address_space *mapping = READ_ONCE(folio->mapping);
> > +
> > +	rcu_read_lock();
> > +
> > +	return mapping == READ_ONCE(folio->mapping);
>
> This doesn't make sense; why bother reading the same thing twice?

The intent is to see whether the folio->mapping has been truncated from
underneath us, as per the futex code that Kirill referred to which does
something similar [1].

>
> Who cares if the thing changes from before; what you care about is that
> the value you see has stable storage, this doesn't help with that.
>
> > +}
> > +
> > +static void unlock_rcu(void)
> > +{
> > +	rcu_read_unlock();
> > +}
> > +#else
> > +static bool stabilise_mapping_rcu(struct folio *)
> > +{
> > +	return true;
> > +}
> > +
> > +static void unlock_rcu(void)
> > +{
> > +}
> > +#endif
>
> Anyway, this all can go away. RCU can't progress while you have
> interrupts disabled anyway.

There seems to be other code in the kernel that assumes that this is not
the case, i.e. the futex code, though not sure if that's being run with
IRQs disabled... if not and it's absolutely certain that we need no special
handling for the RCU case, then happy days and more than glad to remove
this bit.

I'm far from an expert on RCU (I need to gain a better understanding of it)
so I'm deferring how best to proceed on _this part_ to the community.

>
> > +/*
> > + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM |
> > + * FOLL_WRITE pin is permitted for a specific folio.
> > + *
> > + * This assumes the folio is stable and pinned.
> > + *
> > + * Writing to pinned file-backed dirty tracked folios is inherently problematic
> > + * (see comment describing the writeable_file_mapping_allowed() function). We
> > + * therefore try to avoid the most egregious case of a long-term mapping doing
> > + * so.
> > + *
> > + * This function cannot be as thorough as that one as the VMA is not available
> > + * in the fast path, so instead we whitelist known good cases.
> > + *
> > + * The folio is stable, but the mapping might not be. When truncating for
> > + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled
> > + * so we are safe from an IPI, but some architectures use an RCU lock for this
> > + * operation, so we acquire an RCU lock to ensure the mapping is stable.
> > + */
> > +static bool folio_longterm_write_pin_allowed(struct folio *folio)
> > +{
> > +	bool ret;
> > +
> > +	/* hugetlb mappings do not require dirty tracking. */
> > +	if (folio_test_hugetlb(folio))
> > +		return true;
> > +
>
> This:
>
> > +	if (stabilise_mapping_rcu(folio)) {
> > +		struct address_space *mapping = folio_mapping(folio);
>
> And this is 3rd read of folio->mapping, just for giggles?

I like to giggle :)

Actually this is to handle the various cases in which the mapping might not
be what we want (i.e. have PAGE_MAPPING_FLAGS set) which doesn't appear to
have a helper exposed for a check. Given previous review about duplication
I felt best to reuse this even though it does access again... yes I felt
weird about doing that.

>
> > +
> > +		/*
> > +		 * Neither anonymous nor shmem-backed folios require
> > +		 * dirty tracking.
> > +		 */
> > +		ret = folio_test_anon(folio) ||
> > +			(mapping && shmem_mapping(mapping));
> > +	} else {
> > +		/* If the mapping is unstable, fallback to the slow path. */
> > +		ret = false;
> > +	}
> > +
> > +	unlock_rcu();
> > +
> > +	return ret;
>
> then becomes:
>
>
> 	if (folio_test_anon(folio))
> 		return true;

This relies on the mapping so belongs below the lockdep assert imo.

>
> 	/*
> 	 * Having IRQs disabled (as per GUP-fast) also inhibits RCU
> 	 * grace periods from making progress, IOW. they imply
> 	 * rcu_read_lock().
> 	 */
> 	lockdep_assert_irqs_disabled();
>
> 	/*
> 	 * Inodes and thus address_space are RCU freed and thus safe to
> 	 * access at this point.
> 	 */
> 	mapping = folio_mapping(folio);
> 	if (mapping && shmem_mapping(mapping))
> 		return true;
>
> 	return false;
>
> > +}

I'm more than happy to do this (I'd rather drop the RCU bits if possible)
but need to be sure it's safe.
Lorenzo Stoakes May 2, 2023, 12:27 p.m. UTC | #3
On Tue, May 02, 2023 at 02:08:10PM +0200, Peter Zijlstra wrote:
> On Tue, May 02, 2023 at 12:25:54PM +0100, Lorenzo Stoakes wrote:
> > On Tue, May 02, 2023 at 01:13:34PM +0200, Peter Zijlstra wrote:
> > > On Tue, May 02, 2023 at 12:11:49AM +0100, Lorenzo Stoakes wrote:
> > > > @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> > > >  	return folio;
> > > >  }
> > > >
> > > > +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > > > +static bool stabilise_mapping_rcu(struct folio *folio)
> > > > +{
> > > > +	struct address_space *mapping = READ_ONCE(folio->mapping);
> > > > +
> > > > +	rcu_read_lock();
> > > > +
> > > > +	return mapping == READ_ONCE(folio->mapping);
> > >
> > > This doesn't make sense; why bother reading the same thing twice?
> >
> > The intent is to see whether the folio->mapping has been truncated from
> > underneath us, as per the futex code that Kirill referred to which does
> > something similar [1].
>
> Yeah, but per that 3rd load you got nothing here. Also that futex code
> did the early load to deal with the !mapping case, but you're not doing
> that.
>

OK I drafted a response three times then deleted which shows you how this
stuff messes with your mind :)

I realise now that literally it is checking whether the previous !mapping
case and lack of action taken on that was valid for futex, rendering this
pointless for the logic here.

We do check !mapping later but obviously with the 'stable' mapping whose
relation to pre-rcu lock is irrelevant.

Thanks for patiently explaining this :) RCU remains an area I need to take
a closer look at generally.

> > > Who cares if the thing changes from before; what you care about is that
> > > the value you see has stable storage, this doesn't help with that.
> > >
> > > > +}
> > > > +
> > > > +static void unlock_rcu(void)
> > > > +{
> > > > +	rcu_read_unlock();
> > > > +}
> > > > +#else
> > > > +static bool stabilise_mapping_rcu(struct folio *)
> > > > +{
> > > > +	return true;
> > > > +}
> > > > +
> > > > +static void unlock_rcu(void)
> > > > +{
> > > > +}
> > > > +#endif
> > >
> > > Anyway, this all can go away. RCU can't progress while you have
> > > interrupts disabled anyway.
> >
> > There seems to be other code in the kernel that assumes that this is not
> > the case,
>
> Yeah, so Paul went back on forth on that a bit. It used to be true in
> the good old days when everything was simple. Then Paul made things
> complicated by separating out sched-RCU bh-RCU and 'regular' RCU
> flavours.
>
> At that point disabling IRQs would only (officially) inhibit sched and
> bh RCU flavours, but not the regular RCU.
>
> But then some years ago Linus convinced Paul that having all these
> separate RCU flavours with separate QS rules was a big pain in the
> backside and Paul munged them all together again.
>
> So now, anything that inhibits any of the RCU flavours inhibits them
> all. So disabling IRQs is sufficient.
>
> > i.e. the futex code, though not sure if that's being run with
> > IRQs disabled...
>
> That futex code runs in preemptible context, per the lock_page() that
> can sleep etc.. :-)

OK I am actually really happy to hear this because this means I can go
simplify this code significantly!

>
> > > > +/*
> > > > + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM |
> > > > + * FOLL_WRITE pin is permitted for a specific folio.
> > > > + *
> > > > + * This assumes the folio is stable and pinned.
> > > > + *
> > > > + * Writing to pinned file-backed dirty tracked folios is inherently problematic
> > > > + * (see comment describing the writeable_file_mapping_allowed() function). We
> > > > + * therefore try to avoid the most egregious case of a long-term mapping doing
> > > > + * so.
> > > > + *
> > > > + * This function cannot be as thorough as that one as the VMA is not available
> > > > + * in the fast path, so instead we whitelist known good cases.
> > > > + *
> > > > + * The folio is stable, but the mapping might not be. When truncating for
> > > > + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled
> > > > + * so we are safe from an IPI, but some architectures use an RCU lock for this
> > > > + * operation, so we acquire an RCU lock to ensure the mapping is stable.
> > > > + */
> > > > +static bool folio_longterm_write_pin_allowed(struct folio *folio)
> > > > +{
> > > > +	bool ret;
> > > > +
> > > > +	/* hugetlb mappings do not require dirty tracking. */
> > > > +	if (folio_test_hugetlb(folio))
> > > > +		return true;
> > > > +
> > >
> > > This:
> > >
> > > > +	if (stabilise_mapping_rcu(folio)) {
> > > > +		struct address_space *mapping = folio_mapping(folio);
> > >
> > > And this is 3rd read of folio->mapping, just for giggles?
> >
> > I like to giggle :)
> >
> > Actually this is to handle the various cases in which the mapping might not
> > be what we want (i.e. have PAGE_MAPPING_FLAGS set) which doesn't appear to
> > have a helper exposed for a check. Given previous review about duplication
> > I felt best to reuse this even though it does access again... yes I felt
> > weird about doing that.
>
> Right, I had a peek inside folio_mapping(), but the point is that this
> 3rd load might see yet *another* value of mapping from the prior two
> loads, rendering them somewhat worthless.
>
> > > > +
> > > > +		/*
> > > > +		 * Neither anonymous nor shmem-backed folios require
> > > > +		 * dirty tracking.
> > > > +		 */
> > > > +		ret = folio_test_anon(folio) ||
> > > > +			(mapping && shmem_mapping(mapping));
> > > > +	} else {
> > > > +		/* If the mapping is unstable, fallback to the slow path. */
> > > > +		ret = false;
> > > > +	}
> > > > +
> > > > +	unlock_rcu();
> > > > +
> > > > +	return ret;
> > >
> > > then becomes:
> > >
> > >
> > > 	if (folio_test_anon(folio))
> > > 		return true;
> >
> > This relies on the mapping so belongs below the lockdep assert imo.
>
> Oh, right you are.
>
> > >
> > > 	/*
> > > 	 * Having IRQs disabled (as per GUP-fast) also inhibits RCU
> > > 	 * grace periods from making progress, IOW. they imply
> > > 	 * rcu_read_lock().
> > > 	 */
> > > 	lockdep_assert_irqs_disabled();
> > >
> > > 	/*
> > > 	 * Inodes and thus address_space are RCU freed and thus safe to
> > > 	 * access at this point.
> > > 	 */
> > > 	mapping = folio_mapping(folio);
> > > 	if (mapping && shmem_mapping(mapping))
> > > 		return true;
> > >
> > > 	return false;
> > >
> > > > +}
> >
> > I'm more than happy to do this (I'd rather drop the RCU bits if possible)
> > but need to be sure it's safe.
>
> GUP-fast as a whole relies on it :-)

Indeed, the only question was what happened with
CONFIG_MMU_GATHER_RCU_TABLE_FREE arches which appeared to require special
handling, but I'm very happy to hear they don't!

Will respin along the lines of your suggestion.
Paul E. McKenney May 2, 2023, 1:30 p.m. UTC | #4
On Tue, May 02, 2023 at 02:08:10PM +0200, Peter Zijlstra wrote:
> On Tue, May 02, 2023 at 12:25:54PM +0100, Lorenzo Stoakes wrote:
> > On Tue, May 02, 2023 at 01:13:34PM +0200, Peter Zijlstra wrote:
> > > On Tue, May 02, 2023 at 12:11:49AM +0100, Lorenzo Stoakes wrote:
> > > > @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> > > >  	return folio;
> > > >  }
> > > >
> > > > +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > > > +static bool stabilise_mapping_rcu(struct folio *folio)
> > > > +{
> > > > +	struct address_space *mapping = READ_ONCE(folio->mapping);
> > > > +
> > > > +	rcu_read_lock();
> > > > +
> > > > +	return mapping == READ_ONCE(folio->mapping);
> > >
> > > This doesn't make sense; why bother reading the same thing twice?
> > 
> > The intent is to see whether the folio->mapping has been truncated from
> > underneath us, as per the futex code that Kirill referred to which does
> > something similar [1].
> 
> Yeah, but per that 3rd load you got nothing here. Also that futex code
> did the early load to deal with the !mapping case, but you're not doing
> that.
> 
> > > Who cares if the thing changes from before; what you care about is that
> > > the value you see has stable storage, this doesn't help with that.
> > >
> > > > +}
> > > > +
> > > > +static void unlock_rcu(void)
> > > > +{
> > > > +	rcu_read_unlock();
> > > > +}
> > > > +#else
> > > > +static bool stabilise_mapping_rcu(struct folio *)
> > > > +{
> > > > +	return true;
> > > > +}
> > > > +
> > > > +static void unlock_rcu(void)
> > > > +{
> > > > +}
> > > > +#endif
> > >
> > > Anyway, this all can go away. RCU can't progress while you have
> > > interrupts disabled anyway.
> > 
> > There seems to be other code in the kernel that assumes that this is not
> > the case,
> 
> Yeah, so Paul went back on forth on that a bit. It used to be true in
> the good old days when everything was simple. Then Paul made things
> complicated by separating out sched-RCU bh-RCU and 'regular' RCU
> flavours.

Almost.  ;-)

The way I made things complicated was instead by creating preemptible RCU
for the real-time effort.  The original non-preemptible RCU was still
required for a number of use cases (for example, waiting for hardware
interrupt handlers), so it had to stay.  Separately, network-based DoS
attacks necessitated adding RCU bh.

> At that point disabling IRQs would only (officially) inhibit sched and
> bh RCU flavours, but not the regular RCU.

Quite right.

> But then some years ago Linus convinced Paul that having all these
> separate RCU flavours with separate QS rules was a big pain in the
> backside and Paul munged them all together again.

What happened was that someone used one flavor of RCU reader and a
different flavor of RCU updater, creating an exploitable bug.  

http://www2.rdrop.com/~paulmck/RCU/cve.2019.01.23e.pdf
https://www.youtube.com/watch?v=hZX1aokdNiY

And Linus asked that this bug be ruled out, so...

> So now, anything that inhibits any of the RCU flavours inhibits them
> all. So disabling IRQs is sufficient.

...for v4.20 and later, exactly.

							Thanx, Paul

> > i.e. the futex code, though not sure if that's being run with
> > IRQs disabled...
> 
> That futex code runs in preemptible context, per the lock_page() that
> can sleep etc.. :-)
> 
> > > > +/*
> > > > + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM |
> > > > + * FOLL_WRITE pin is permitted for a specific folio.
> > > > + *
> > > > + * This assumes the folio is stable and pinned.
> > > > + *
> > > > + * Writing to pinned file-backed dirty tracked folios is inherently problematic
> > > > + * (see comment describing the writeable_file_mapping_allowed() function). We
> > > > + * therefore try to avoid the most egregious case of a long-term mapping doing
> > > > + * so.
> > > > + *
> > > > + * This function cannot be as thorough as that one as the VMA is not available
> > > > + * in the fast path, so instead we whitelist known good cases.
> > > > + *
> > > > + * The folio is stable, but the mapping might not be. When truncating for
> > > > + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled
> > > > + * so we are safe from an IPI, but some architectures use an RCU lock for this
> > > > + * operation, so we acquire an RCU lock to ensure the mapping is stable.
> > > > + */
> > > > +static bool folio_longterm_write_pin_allowed(struct folio *folio)
> > > > +{
> > > > +	bool ret;
> > > > +
> > > > +	/* hugetlb mappings do not require dirty tracking. */
> > > > +	if (folio_test_hugetlb(folio))
> > > > +		return true;
> > > > +
> > >
> > > This:
> > >
> > > > +	if (stabilise_mapping_rcu(folio)) {
> > > > +		struct address_space *mapping = folio_mapping(folio);
> > >
> > > And this is 3rd read of folio->mapping, just for giggles?
> > 
> > I like to giggle :)
> > 
> > Actually this is to handle the various cases in which the mapping might not
> > be what we want (i.e. have PAGE_MAPPING_FLAGS set) which doesn't appear to
> > have a helper exposed for a check. Given previous review about duplication
> > I felt best to reuse this even though it does access again... yes I felt
> > weird about doing that.
> 
> Right, I had a peek inside folio_mapping(), but the point is that this
> 3rd load might see yet *another* value of mapping from the prior two
> loads, rendering them somewhat worthless.
> 
> > > > +
> > > > +		/*
> > > > +		 * Neither anonymous nor shmem-backed folios require
> > > > +		 * dirty tracking.
> > > > +		 */
> > > > +		ret = folio_test_anon(folio) ||
> > > > +			(mapping && shmem_mapping(mapping));
> > > > +	} else {
> > > > +		/* If the mapping is unstable, fallback to the slow path. */
> > > > +		ret = false;
> > > > +	}
> > > > +
> > > > +	unlock_rcu();
> > > > +
> > > > +	return ret;
> > >
> > > then becomes:
> > >
> > >
> > > 	if (folio_test_anon(folio))
> > > 		return true;
> > 
> > This relies on the mapping so belongs below the lockdep assert imo.
> 
> Oh, right you are.
> 
> > >
> > > 	/*
> > > 	 * Having IRQs disabled (as per GUP-fast) also inhibits RCU
> > > 	 * grace periods from making progress, IOW. they imply
> > > 	 * rcu_read_lock().
> > > 	 */
> > > 	lockdep_assert_irqs_disabled();
> > >
> > > 	/*
> > > 	 * Inodes and thus address_space are RCU freed and thus safe to
> > > 	 * access at this point.
> > > 	 */
> > > 	mapping = folio_mapping(folio);
> > > 	if (mapping && shmem_mapping(mapping))
> > > 		return true;
> > >
> > > 	return false;
> > >
> > > > +}
> > 
> > I'm more than happy to do this (I'd rather drop the RCU bits if possible)
> > but need to be sure it's safe.
> 
> GUP-fast as a whole relies on it :-)
Matthew Rosato May 2, 2023, 1:35 p.m. UTC | #5
On 5/2/23 9:04 AM, Christian Borntraeger wrote:
> 
> 
> Am 02.05.23 um 14:54 schrieb Lorenzo Stoakes:
>> On Tue, May 02, 2023 at 02:46:28PM +0200, Christian Borntraeger wrote:
>>> Am 02.05.23 um 01:11 schrieb Lorenzo Stoakes:
>>>> Writing to file-backed dirty-tracked mappings via GUP is inherently broken
>>>> as we cannot rule out folios being cleaned and then a GUP user writing to
>>>> them again and possibly marking them dirty unexpectedly.
>>>>
>>>> This is especially egregious for long-term mappings (as indicated by the
>>>> use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as
>>>> we have already done in the slow path.
>>>
>>> Hmm, does this interfer with KVM on s390 and PCI interpretion of interrupt delivery?
>>> It would no longer work with file backed memory, correct?
>>>
>>> See
>>> arch/s390/kvm/pci.c
>>>
>>> kvm_s390_pci_aif_enable
>>> which does have
>>> FOLL_WRITE | FOLL_LONGTERM
>>> to
>>>
>>
>> Does this memory map a dirty-tracked file? It's kind of hard to dig into where
>> the address originates from without going through a ton of code. In worst case
>> if the fast code doesn't find a whitelist it'll fall back to slow path which
>> explicitly checks for dirty-tracked filesystem.
> 
> It does pin from whatever QEMU uses as backing for the guest.
>>
>> We can reintroduce a flag to permit exceptions if this is really broken, are you
>> able to test? I don't have an s390 sat around :)
> 
> Matt (Rosato on cc) probably can. In the end, it would mean having
>   <memoryBacking>
>     <source type="file"/>
>   </memoryBacking>
> 
> In libvirt I guess.

I am running with this series applied using a QEMU guest with memory-backend-file (using the above libvirt snippet) for a few different PCI device types and AEN forwarding (e.g. what is setup in kvm_s390_pci_aif_enable) is still working.
David Hildenbrand May 2, 2023, 1:39 p.m. UTC | #6
On 02.05.23 15:36, Jason Gunthorpe wrote:
> On Tue, May 02, 2023 at 03:28:40PM +0200, David Hildenbrand wrote:
>> On 02.05.23 15:10, Jason Gunthorpe wrote:
>>> On Tue, May 02, 2023 at 03:04:27PM +0200, Christian Borntraeger wrote:
>>> \> > We can reintroduce a flag to permit exceptions if this is really broken, are you
>>>>> able to test? I don't have an s390 sat around :)
>>>>
>>>> Matt (Rosato on cc) probably can. In the end, it would mean having
>>>>     <memoryBacking>
>>>>       <source type="file"/>
>>>>     </memoryBacking>
>>>
>>> This s390 code is the least of the problems, after this series VFIO
>>> won't startup at all with this configuration.
>>
>> Good question if the domain would fail to start. I recall that IOMMUs for
>> zPCI are special on s390x. [1]
> 
> Not upstream they aren't.
> 
>> Well, zPCI is special. I cannot immediately tell when we would trigger
>> long-term pinning.
> 
> zPCI uses the standard IOMMU stuff, so it uses a normal VFIO container
> and the normal pin_user_pages() path.


@Christian, Matthew: would we pin all guest memory when starting the 
domain (IIRC, like on x86-64) and fail early, or only when the guest 
issues rpcit instructions to map individual pages?
David Hildenbrand May 2, 2023, 1:47 p.m. UTC | #7
On 02.05.23 15:43, Matthew Rosato wrote:
> On 5/2/23 9:39 AM, David Hildenbrand wrote:
>> On 02.05.23 15:36, Jason Gunthorpe wrote:
>>> On Tue, May 02, 2023 at 03:28:40PM +0200, David Hildenbrand wrote:
>>>> On 02.05.23 15:10, Jason Gunthorpe wrote:
>>>>> On Tue, May 02, 2023 at 03:04:27PM +0200, Christian Borntraeger wrote:
>>>>> \> > We can reintroduce a flag to permit exceptions if this is really broken, are you
>>>>>>> able to test? I don't have an s390 sat around :)
>>>>>>
>>>>>> Matt (Rosato on cc) probably can. In the end, it would mean having
>>>>>>      <memoryBacking>
>>>>>>        <source type="file"/>
>>>>>>      </memoryBacking>
>>>>>
>>>>> This s390 code is the least of the problems, after this series VFIO
>>>>> won't startup at all with this configuration.
>>>>
>>>> Good question if the domain would fail to start. I recall that IOMMUs for
>>>> zPCI are special on s390x. [1]
>>>
>>> Not upstream they aren't.
>>>
>>>> Well, zPCI is special. I cannot immediately tell when we would trigger
>>>> long-term pinning.
>>>
>>> zPCI uses the standard IOMMU stuff, so it uses a normal VFIO container
>>> and the normal pin_user_pages() path.
>>
>>
>> @Christian, Matthew: would we pin all guest memory when starting the domain (IIRC, like on x86-64) and fail early, or only when the guest issues rpcit instructions to map individual pages?
>>
> 
> Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT.
> 

Okay, so IIRC we'll fail starting the domain early, that's good. And if 
we pin all guest memory (instead of small pieces dynamically), there is 
little existing use for file-backed RAM in such zPCI configurations 
(because memory cannot be reclaimed either way if it's all pinned), so 
likely there are no real existing users.

> However, per Jason's prior suggestion, the initial implementation for s390 nesting via iommufd will pin all of guest memory when starting the domain.  I have something already working via iommufd built on top of the nesting infrastructure patches and QEMU iommufd series that are floating around; needs some cleanup, hoping to send an RFC in the coming weeks.  I can CC you if you'd like.

Yes, please.
diff mbox series

Patch

diff --git a/mm/gup.c b/mm/gup.c
index 0f09dec0906c..431618048a03 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -18,6 +18,7 @@ 
 #include <linux/migrate.h>
 #include <linux/mm_inline.h>
 #include <linux/sched/mm.h>
+#include <linux/shmem_fs.h>
 
 #include <asm/mmu_context.h>
 #include <asm/tlbflush.h>
@@ -95,6 +96,77 @@  static inline struct folio *try_get_folio(struct page *page, int refs)
 	return folio;
 }
 
+#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
+static bool stabilise_mapping_rcu(struct folio *folio)
+{
+	struct address_space *mapping = READ_ONCE(folio->mapping);
+
+	rcu_read_lock();
+
+	return mapping == READ_ONCE(folio->mapping);
+}
+
+static void unlock_rcu(void)
+{
+	rcu_read_unlock();
+}
+#else
+static bool stabilise_mapping_rcu(struct folio *)
+{
+	return true;
+}
+
+static void unlock_rcu(void)
+{
+}
+#endif
+
+/*
+ * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM |
+ * FOLL_WRITE pin is permitted for a specific folio.
+ *
+ * This assumes the folio is stable and pinned.
+ *
+ * Writing to pinned file-backed dirty tracked folios is inherently problematic
+ * (see comment describing the writeable_file_mapping_allowed() function). We
+ * therefore try to avoid the most egregious case of a long-term mapping doing
+ * so.
+ *
+ * This function cannot be as thorough as that one as the VMA is not available
+ * in the fast path, so instead we whitelist known good cases.
+ *
+ * The folio is stable, but the mapping might not be. When truncating for
+ * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled
+ * so we are safe from an IPI, but some architectures use an RCU lock for this
+ * operation, so we acquire an RCU lock to ensure the mapping is stable.
+ */
+static bool folio_longterm_write_pin_allowed(struct folio *folio)
+{
+	bool ret;
+
+	/* hugetlb mappings do not require dirty tracking. */
+	if (folio_test_hugetlb(folio))
+		return true;
+
+	if (stabilise_mapping_rcu(folio)) {
+		struct address_space *mapping = folio_mapping(folio);
+
+		/*
+		 * Neither anonymous nor shmem-backed folios require
+		 * dirty tracking.
+		 */
+		ret = folio_test_anon(folio) ||
+			(mapping && shmem_mapping(mapping));
+	} else {
+		/* If the mapping is unstable, fallback to the slow path. */
+		ret = false;
+	}
+
+	unlock_rcu();
+
+	return ret;
+}
+
 /**
  * try_grab_folio() - Attempt to get or pin a folio.
  * @page:  pointer to page to be grabbed
@@ -123,6 +195,8 @@  static inline struct folio *try_get_folio(struct page *page, int refs)
  */
 struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
 {
+	bool is_longterm = flags & FOLL_LONGTERM;
+
 	if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
 		return NULL;
 
@@ -136,8 +210,7 @@  struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
 		 * right zone, so fail and let the caller fall back to the slow
 		 * path.
 		 */
-		if (unlikely((flags & FOLL_LONGTERM) &&
-			     !is_longterm_pinnable_page(page)))
+		if (unlikely(is_longterm && !is_longterm_pinnable_page(page)))
 			return NULL;
 
 		/*
@@ -148,6 +221,16 @@  struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
 		if (!folio)
 			return NULL;
 
+		/*
+		 * Can this folio be safely pinned? We need to perform this
+		 * check after the folio is stabilised.
+		 */
+		if ((flags & FOLL_WRITE) && is_longterm &&
+		    !folio_longterm_write_pin_allowed(folio)) {
+			folio_put_refs(folio, refs);
+			return NULL;
+		}
+
 		/*
 		 * When pinning a large folio, use an exact count to track it.
 		 *