Message ID | dee4f4ad6532b0f94d073da263526de334d5d7e0.1682981880.git.lstoakes@gmail.com (mailing list archive) |
---|---|
State | Mainlined, archived |
Headers | show |
Series | mm/gup: disallow GUP writing to file-backed mappings by default | expand |
On 5/1/23 16:11, Lorenzo Stoakes wrote: > Writing to file-backed dirty-tracked mappings via GUP is inherently broken > as we cannot rule out folios being cleaned and then a GUP user writing to > them again and possibly marking them dirty unexpectedly. > > This is especially egregious for long-term mappings (as indicated by the > use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as > we have already done in the slow path. > > We have access to less information in the fast path as we cannot examine > the VMA containing the mapping, however we can determine whether the folio > is anonymous and then whitelist known-good mappings - specifically hugetlb > and shmem mappings. > > While we obtain a stable folio for this check, the mapping might not be, as > a truncate could nullify it at any time. Since doing so requires mappings > to be zapped, we can synchronise against a TLB shootdown operation. > > For some architectures TLB shootdown is synchronised by IPI, against which > we are protected as the GUP-fast operation is performed with interrupts > disabled. However, other architectures which specify > CONFIG_MMU_GATHER_RCU_TABLE_FREE use an RCU lock for this operation. > > In these instances, we acquire an RCU lock while performing our checks. If > we cannot get a stable mapping, we fall back to the slow path, as otherwise > we'd have to walk the page tables again and it's simpler and more effective > to just fall back. > > It's important to note that there are no APIs allowing users to specify > FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can > always rely on the fact that if we fail to pin on the fast path, the code > will fall back to the slow path which can perform the more thorough check. > > Suggested-by: David Hildenbrand <david@redhat.com> > Suggested-by: Kirill A . Shutemov <kirill@shutemov.name> > Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> > --- > mm/gup.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 85 insertions(+), 2 deletions(-) > Hi Lorenzo, I am unable to find anything wrong with this patch, despite poring over it and fretting over IPI vs. RCU cases. :) Reviewed-by: John Hubbard <jhubbard@nvidia.com> thanks,
On Tue, May 02, 2023 at 12:11:49AM +0100, Lorenzo Stoakes wrote: > @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs) > return folio; > } > > +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE > +static bool stabilise_mapping_rcu(struct folio *folio) > +{ > + struct address_space *mapping = READ_ONCE(folio->mapping); > + > + rcu_read_lock(); > + > + return mapping == READ_ONCE(folio->mapping); This doesn't make sense; why bother reading the same thing twice? Who cares if the thing changes from before; what you care about is that the value you see has stable storage, this doesn't help with that. > +} > + > +static void unlock_rcu(void) > +{ > + rcu_read_unlock(); > +} > +#else > +static bool stabilise_mapping_rcu(struct folio *) > +{ > + return true; > +} > + > +static void unlock_rcu(void) > +{ > +} > +#endif Anyway, this all can go away. RCU can't progress while you have interrupts disabled anyway. > +/* > + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM | > + * FOLL_WRITE pin is permitted for a specific folio. > + * > + * This assumes the folio is stable and pinned. > + * > + * Writing to pinned file-backed dirty tracked folios is inherently problematic > + * (see comment describing the writeable_file_mapping_allowed() function). We > + * therefore try to avoid the most egregious case of a long-term mapping doing > + * so. > + * > + * This function cannot be as thorough as that one as the VMA is not available > + * in the fast path, so instead we whitelist known good cases. > + * > + * The folio is stable, but the mapping might not be. When truncating for > + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled > + * so we are safe from an IPI, but some architectures use an RCU lock for this > + * operation, so we acquire an RCU lock to ensure the mapping is stable. > + */ > +static bool folio_longterm_write_pin_allowed(struct folio *folio) > +{ > + bool ret; > + > + /* hugetlb mappings do not require dirty tracking. */ > + if (folio_test_hugetlb(folio)) > + return true; > + This: > + if (stabilise_mapping_rcu(folio)) { > + struct address_space *mapping = folio_mapping(folio); And this is 3rd read of folio->mapping, just for giggles? > + > + /* > + * Neither anonymous nor shmem-backed folios require > + * dirty tracking. > + */ > + ret = folio_test_anon(folio) || > + (mapping && shmem_mapping(mapping)); > + } else { > + /* If the mapping is unstable, fallback to the slow path. */ > + ret = false; > + } > + > + unlock_rcu(); > + > + return ret; then becomes: if (folio_test_anon(folio)) return true; /* * Having IRQs disabled (as per GUP-fast) also inhibits RCU * grace periods from making progress, IOW. they imply * rcu_read_lock(). */ lockdep_assert_irqs_disabled(); /* * Inodes and thus address_space are RCU freed and thus safe to * access at this point. */ mapping = folio_mapping(folio); if (mapping && shmem_mapping(mapping)) return true; return false; > +}
On Tue 02-05-23 00:11:49, Lorenzo Stoakes wrote: > Writing to file-backed dirty-tracked mappings via GUP is inherently broken > as we cannot rule out folios being cleaned and then a GUP user writing to > them again and possibly marking them dirty unexpectedly. > > This is especially egregious for long-term mappings (as indicated by the > use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as > we have already done in the slow path. > > We have access to less information in the fast path as we cannot examine > the VMA containing the mapping, however we can determine whether the folio > is anonymous and then whitelist known-good mappings - specifically hugetlb > and shmem mappings. > > While we obtain a stable folio for this check, the mapping might not be, as > a truncate could nullify it at any time. Since doing so requires mappings > to be zapped, we can synchronise against a TLB shootdown operation. > > For some architectures TLB shootdown is synchronised by IPI, against which > we are protected as the GUP-fast operation is performed with interrupts > disabled. However, other architectures which specify > CONFIG_MMU_GATHER_RCU_TABLE_FREE use an RCU lock for this operation. > > In these instances, we acquire an RCU lock while performing our checks. If > we cannot get a stable mapping, we fall back to the slow path, as otherwise > we'd have to walk the page tables again and it's simpler and more effective > to just fall back. > > It's important to note that there are no APIs allowing users to specify > FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can > always rely on the fact that if we fail to pin on the fast path, the code > will fall back to the slow path which can perform the more thorough check. > > Suggested-by: David Hildenbrand <david@redhat.com> > Suggested-by: Kirill A . Shutemov <kirill@shutemov.name> > Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> > --- > mm/gup.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 85 insertions(+), 2 deletions(-) > > diff --git a/mm/gup.c b/mm/gup.c > index 0f09dec0906c..431618048a03 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -18,6 +18,7 @@ > #include <linux/migrate.h> > #include <linux/mm_inline.h> > #include <linux/sched/mm.h> > +#include <linux/shmem_fs.h> > > #include <asm/mmu_context.h> > #include <asm/tlbflush.h> > @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs) > return folio; > } > > +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE > +static bool stabilise_mapping_rcu(struct folio *folio) > +{ > + struct address_space *mapping = READ_ONCE(folio->mapping); > + > + rcu_read_lock(); > + > + return mapping == READ_ONCE(folio->mapping); > +} > + > +static void unlock_rcu(void) > +{ > + rcu_read_unlock(); > +} > +#else > +static bool stabilise_mapping_rcu(struct folio *) > +{ > + return true; > +} > + > +static void unlock_rcu(void) > +{ > +} > +#endif So I wonder is this complexity worth it over just using rcu_read_lock() unconditionally? It is just a compilation barrier AFAIK... Also stabilise_mapping_rcu() seems to be a bit of a misnomer since the mapping is not stable after the function is called. Also the return value seems a bit pointless to me since we have to check folio_mapping() for being != NULL anyway. All in all I'd say that: struct address_space *mapping; /* Make sure mapping cannot be freed under our hands */ rcu_read_lock(); mapping = folio_mapping(folio); ret = folio_test_anon(folio) || (mapping && shmem_mapping(mapping)); rcu_read_unlock(); looks more comprehensible... Honza > + > +/* > + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM | > + * FOLL_WRITE pin is permitted for a specific folio. > + * > + * This assumes the folio is stable and pinned. > + * > + * Writing to pinned file-backed dirty tracked folios is inherently problematic > + * (see comment describing the writeable_file_mapping_allowed() function). We > + * therefore try to avoid the most egregious case of a long-term mapping doing > + * so. > + * > + * This function cannot be as thorough as that one as the VMA is not available > + * in the fast path, so instead we whitelist known good cases. > + * > + * The folio is stable, but the mapping might not be. When truncating for > + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled > + * so we are safe from an IPI, but some architectures use an RCU lock for this > + * operation, so we acquire an RCU lock to ensure the mapping is stable. > + */ > +static bool folio_longterm_write_pin_allowed(struct folio *folio) > +{ > + bool ret; > + > + /* hugetlb mappings do not require dirty tracking. */ > + if (folio_test_hugetlb(folio)) > + return true; > + > + if (stabilise_mapping_rcu(folio)) { > + struct address_space *mapping = folio_mapping(folio); > + > + /* > + * Neither anonymous nor shmem-backed folios require > + * dirty tracking. > + */ > + ret = folio_test_anon(folio) || > + (mapping && shmem_mapping(mapping)); > + } else { > + /* If the mapping is unstable, fallback to the slow path. */ > + ret = false; > + } > + > + unlock_rcu(); > + > + return ret; > +} > + > /** > * try_grab_folio() - Attempt to get or pin a folio. > * @page: pointer to page to be grabbed > @@ -123,6 +195,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs) > */ > struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > { > + bool is_longterm = flags & FOLL_LONGTERM; > + > if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page))) > return NULL; > > @@ -136,8 +210,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > * right zone, so fail and let the caller fall back to the slow > * path. > */ > - if (unlikely((flags & FOLL_LONGTERM) && > - !is_longterm_pinnable_page(page))) > + if (unlikely(is_longterm && !is_longterm_pinnable_page(page))) > return NULL; > > /* > @@ -148,6 +221,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > if (!folio) > return NULL; > > + /* > + * Can this folio be safely pinned? We need to perform this > + * check after the folio is stabilised. > + */ > + if ((flags & FOLL_WRITE) && is_longterm && > + !folio_longterm_write_pin_allowed(folio)) { > + folio_put_refs(folio, refs); > + return NULL; > + } > + > /* > * When pinning a large folio, use an exact count to track it. > * > -- > 2.40.1 >
On Tue 02-05-23 13:13:34, Peter Zijlstra wrote: > On Tue, May 02, 2023 at 12:11:49AM +0100, Lorenzo Stoakes wrote: > > + > > + /* > > + * Neither anonymous nor shmem-backed folios require > > + * dirty tracking. > > + */ > > + ret = folio_test_anon(folio) || > > + (mapping && shmem_mapping(mapping)); > > + } else { > > + /* If the mapping is unstable, fallback to the slow path. */ > > + ret = false; > > + } > > + > > + unlock_rcu(); > > + > > + return ret; > > then becomes: > > > if (folio_test_anon(folio)) > return true; > > /* > * Having IRQs disabled (as per GUP-fast) also inhibits RCU > * grace periods from making progress, IOW. they imply > * rcu_read_lock(). > */ > lockdep_assert_irqs_disabled(); > > /* > * Inodes and thus address_space are RCU freed and thus safe to > * access at this point. > */ > mapping = folio_mapping(folio); > if (mapping && shmem_mapping(mapping)) > return true; > > return false; Yeah, that's even better. Honza
On Tue, May 02, 2023 at 01:13:34PM +0200, Peter Zijlstra wrote: > On Tue, May 02, 2023 at 12:11:49AM +0100, Lorenzo Stoakes wrote: > > @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs) > > return folio; > > } > > > > +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE > > +static bool stabilise_mapping_rcu(struct folio *folio) > > +{ > > + struct address_space *mapping = READ_ONCE(folio->mapping); > > + > > + rcu_read_lock(); > > + > > + return mapping == READ_ONCE(folio->mapping); > > This doesn't make sense; why bother reading the same thing twice? The intent is to see whether the folio->mapping has been truncated from underneath us, as per the futex code that Kirill referred to which does something similar [1]. > > Who cares if the thing changes from before; what you care about is that > the value you see has stable storage, this doesn't help with that. > > > +} > > + > > +static void unlock_rcu(void) > > +{ > > + rcu_read_unlock(); > > +} > > +#else > > +static bool stabilise_mapping_rcu(struct folio *) > > +{ > > + return true; > > +} > > + > > +static void unlock_rcu(void) > > +{ > > +} > > +#endif > > Anyway, this all can go away. RCU can't progress while you have > interrupts disabled anyway. There seems to be other code in the kernel that assumes that this is not the case, i.e. the futex code, though not sure if that's being run with IRQs disabled... if not and it's absolutely certain that we need no special handling for the RCU case, then happy days and more than glad to remove this bit. I'm far from an expert on RCU (I need to gain a better understanding of it) so I'm deferring how best to proceed on _this part_ to the community. > > > +/* > > + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM | > > + * FOLL_WRITE pin is permitted for a specific folio. > > + * > > + * This assumes the folio is stable and pinned. > > + * > > + * Writing to pinned file-backed dirty tracked folios is inherently problematic > > + * (see comment describing the writeable_file_mapping_allowed() function). We > > + * therefore try to avoid the most egregious case of a long-term mapping doing > > + * so. > > + * > > + * This function cannot be as thorough as that one as the VMA is not available > > + * in the fast path, so instead we whitelist known good cases. > > + * > > + * The folio is stable, but the mapping might not be. When truncating for > > + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled > > + * so we are safe from an IPI, but some architectures use an RCU lock for this > > + * operation, so we acquire an RCU lock to ensure the mapping is stable. > > + */ > > +static bool folio_longterm_write_pin_allowed(struct folio *folio) > > +{ > > + bool ret; > > + > > + /* hugetlb mappings do not require dirty tracking. */ > > + if (folio_test_hugetlb(folio)) > > + return true; > > + > > This: > > > + if (stabilise_mapping_rcu(folio)) { > > + struct address_space *mapping = folio_mapping(folio); > > And this is 3rd read of folio->mapping, just for giggles? I like to giggle :) Actually this is to handle the various cases in which the mapping might not be what we want (i.e. have PAGE_MAPPING_FLAGS set) which doesn't appear to have a helper exposed for a check. Given previous review about duplication I felt best to reuse this even though it does access again... yes I felt weird about doing that. > > > + > > + /* > > + * Neither anonymous nor shmem-backed folios require > > + * dirty tracking. > > + */ > > + ret = folio_test_anon(folio) || > > + (mapping && shmem_mapping(mapping)); > > + } else { > > + /* If the mapping is unstable, fallback to the slow path. */ > > + ret = false; > > + } > > + > > + unlock_rcu(); > > + > > + return ret; > > then becomes: > > > if (folio_test_anon(folio)) > return true; This relies on the mapping so belongs below the lockdep assert imo. > > /* > * Having IRQs disabled (as per GUP-fast) also inhibits RCU > * grace periods from making progress, IOW. they imply > * rcu_read_lock(). > */ > lockdep_assert_irqs_disabled(); > > /* > * Inodes and thus address_space are RCU freed and thus safe to > * access at this point. > */ > mapping = folio_mapping(folio); > if (mapping && shmem_mapping(mapping)) > return true; > > return false; > > > +} I'm more than happy to do this (I'd rather drop the RCU bits if possible) but need to be sure it's safe.
On Tue, May 02, 2023 at 12:25:54PM +0100, Lorenzo Stoakes wrote: > On Tue, May 02, 2023 at 01:13:34PM +0200, Peter Zijlstra wrote: > > On Tue, May 02, 2023 at 12:11:49AM +0100, Lorenzo Stoakes wrote: > > > @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs) > > > return folio; > > > } > > > > > > +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE > > > +static bool stabilise_mapping_rcu(struct folio *folio) > > > +{ > > > + struct address_space *mapping = READ_ONCE(folio->mapping); > > > + > > > + rcu_read_lock(); > > > + > > > + return mapping == READ_ONCE(folio->mapping); > > > > This doesn't make sense; why bother reading the same thing twice? > > The intent is to see whether the folio->mapping has been truncated from > underneath us, as per the futex code that Kirill referred to which does > something similar [1]. > > > > > Who cares if the thing changes from before; what you care about is that > > the value you see has stable storage, this doesn't help with that. > > > > > +} > > > + > > > +static void unlock_rcu(void) > > > +{ > > > + rcu_read_unlock(); > > > +} > > > +#else > > > +static bool stabilise_mapping_rcu(struct folio *) > > > +{ > > > + return true; > > > +} > > > + > > > +static void unlock_rcu(void) > > > +{ > > > +} > > > +#endif > > > > Anyway, this all can go away. RCU can't progress while you have > > interrupts disabled anyway. > > There seems to be other code in the kernel that assumes that this is not > the case, i.e. the futex code, though not sure if that's being run with > IRQs disabled... if not and it's absolutely certain that we need no special > handling for the RCU case, then happy days and more than glad to remove > this bit. > > I'm far from an expert on RCU (I need to gain a better understanding of it) > so I'm deferring how best to proceed on _this part_ to the community. > > > > > > +/* > > > + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM | > > > + * FOLL_WRITE pin is permitted for a specific folio. > > > + * > > > + * This assumes the folio is stable and pinned. > > > + * > > > + * Writing to pinned file-backed dirty tracked folios is inherently problematic > > > + * (see comment describing the writeable_file_mapping_allowed() function). We > > > + * therefore try to avoid the most egregious case of a long-term mapping doing > > > + * so. > > > + * > > > + * This function cannot be as thorough as that one as the VMA is not available > > > + * in the fast path, so instead we whitelist known good cases. > > > + * > > > + * The folio is stable, but the mapping might not be. When truncating for > > > + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled > > > + * so we are safe from an IPI, but some architectures use an RCU lock for this > > > + * operation, so we acquire an RCU lock to ensure the mapping is stable. > > > + */ > > > +static bool folio_longterm_write_pin_allowed(struct folio *folio) > > > +{ > > > + bool ret; > > > + > > > + /* hugetlb mappings do not require dirty tracking. */ > > > + if (folio_test_hugetlb(folio)) > > > + return true; > > > + > > > > This: > > > > > + if (stabilise_mapping_rcu(folio)) { > > > + struct address_space *mapping = folio_mapping(folio); > > > > And this is 3rd read of folio->mapping, just for giggles? > > I like to giggle :) > > Actually this is to handle the various cases in which the mapping might not > be what we want (i.e. have PAGE_MAPPING_FLAGS set) which doesn't appear to > have a helper exposed for a check. Given previous review about duplication > I felt best to reuse this even though it does access again... yes I felt > weird about doing that. > > > > > > + > > > + /* > > > + * Neither anonymous nor shmem-backed folios require > > > + * dirty tracking. > > > + */ > > > + ret = folio_test_anon(folio) || > > > + (mapping && shmem_mapping(mapping)); > > > + } else { > > > + /* If the mapping is unstable, fallback to the slow path. */ > > > + ret = false; > > > + } > > > + > > > + unlock_rcu(); > > > + > > > + return ret; > > > > then becomes: > > > > > > if (folio_test_anon(folio)) > > return true; > > This relies on the mapping so belongs below the lockdep assert imo. > > > > > /* > > * Having IRQs disabled (as per GUP-fast) also inhibits RCU > > * grace periods from making progress, IOW. they imply > > * rcu_read_lock(). > > */ > > lockdep_assert_irqs_disabled(); > > > > /* > > * Inodes and thus address_space are RCU freed and thus safe to > > * access at this point. > > */ > > mapping = folio_mapping(folio); > > if (mapping && shmem_mapping(mapping)) > > return true; > > > > return false; > > > > > +} > > I'm more than happy to do this (I'd rather drop the RCU bits if possible) > but need to be sure it's safe. Sorry forgot to include the [1] [1]:https://lore.kernel.org/all/20230428234332.2vhprztuotlqir4x@box.shutemov.name/
On Tue, May 02, 2023 at 12:25:54PM +0100, Lorenzo Stoakes wrote: > On Tue, May 02, 2023 at 01:13:34PM +0200, Peter Zijlstra wrote: > > On Tue, May 02, 2023 at 12:11:49AM +0100, Lorenzo Stoakes wrote: > > > @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs) > > > return folio; > > > } > > > > > > +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE > > > +static bool stabilise_mapping_rcu(struct folio *folio) > > > +{ > > > + struct address_space *mapping = READ_ONCE(folio->mapping); > > > + > > > + rcu_read_lock(); > > > + > > > + return mapping == READ_ONCE(folio->mapping); > > > > This doesn't make sense; why bother reading the same thing twice? > > The intent is to see whether the folio->mapping has been truncated from > underneath us, as per the futex code that Kirill referred to which does > something similar [1]. Yeah, but per that 3rd load you got nothing here. Also that futex code did the early load to deal with the !mapping case, but you're not doing that. > > Who cares if the thing changes from before; what you care about is that > > the value you see has stable storage, this doesn't help with that. > > > > > +} > > > + > > > +static void unlock_rcu(void) > > > +{ > > > + rcu_read_unlock(); > > > +} > > > +#else > > > +static bool stabilise_mapping_rcu(struct folio *) > > > +{ > > > + return true; > > > +} > > > + > > > +static void unlock_rcu(void) > > > +{ > > > +} > > > +#endif > > > > Anyway, this all can go away. RCU can't progress while you have > > interrupts disabled anyway. > > There seems to be other code in the kernel that assumes that this is not > the case, Yeah, so Paul went back on forth on that a bit. It used to be true in the good old days when everything was simple. Then Paul made things complicated by separating out sched-RCU bh-RCU and 'regular' RCU flavours. At that point disabling IRQs would only (officially) inhibit sched and bh RCU flavours, but not the regular RCU. But then some years ago Linus convinced Paul that having all these separate RCU flavours with separate QS rules was a big pain in the backside and Paul munged them all together again. So now, anything that inhibits any of the RCU flavours inhibits them all. So disabling IRQs is sufficient. > i.e. the futex code, though not sure if that's being run with > IRQs disabled... That futex code runs in preemptible context, per the lock_page() that can sleep etc.. :-) > > > +/* > > > + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM | > > > + * FOLL_WRITE pin is permitted for a specific folio. > > > + * > > > + * This assumes the folio is stable and pinned. > > > + * > > > + * Writing to pinned file-backed dirty tracked folios is inherently problematic > > > + * (see comment describing the writeable_file_mapping_allowed() function). We > > > + * therefore try to avoid the most egregious case of a long-term mapping doing > > > + * so. > > > + * > > > + * This function cannot be as thorough as that one as the VMA is not available > > > + * in the fast path, so instead we whitelist known good cases. > > > + * > > > + * The folio is stable, but the mapping might not be. When truncating for > > > + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled > > > + * so we are safe from an IPI, but some architectures use an RCU lock for this > > > + * operation, so we acquire an RCU lock to ensure the mapping is stable. > > > + */ > > > +static bool folio_longterm_write_pin_allowed(struct folio *folio) > > > +{ > > > + bool ret; > > > + > > > + /* hugetlb mappings do not require dirty tracking. */ > > > + if (folio_test_hugetlb(folio)) > > > + return true; > > > + > > > > This: > > > > > + if (stabilise_mapping_rcu(folio)) { > > > + struct address_space *mapping = folio_mapping(folio); > > > > And this is 3rd read of folio->mapping, just for giggles? > > I like to giggle :) > > Actually this is to handle the various cases in which the mapping might not > be what we want (i.e. have PAGE_MAPPING_FLAGS set) which doesn't appear to > have a helper exposed for a check. Given previous review about duplication > I felt best to reuse this even though it does access again... yes I felt > weird about doing that. Right, I had a peek inside folio_mapping(), but the point is that this 3rd load might see yet *another* value of mapping from the prior two loads, rendering them somewhat worthless. > > > + > > > + /* > > > + * Neither anonymous nor shmem-backed folios require > > > + * dirty tracking. > > > + */ > > > + ret = folio_test_anon(folio) || > > > + (mapping && shmem_mapping(mapping)); > > > + } else { > > > + /* If the mapping is unstable, fallback to the slow path. */ > > > + ret = false; > > > + } > > > + > > > + unlock_rcu(); > > > + > > > + return ret; > > > > then becomes: > > > > > > if (folio_test_anon(folio)) > > return true; > > This relies on the mapping so belongs below the lockdep assert imo. Oh, right you are. > > > > /* > > * Having IRQs disabled (as per GUP-fast) also inhibits RCU > > * grace periods from making progress, IOW. they imply > > * rcu_read_lock(). > > */ > > lockdep_assert_irqs_disabled(); > > > > /* > > * Inodes and thus address_space are RCU freed and thus safe to > > * access at this point. > > */ > > mapping = folio_mapping(folio); > > if (mapping && shmem_mapping(mapping)) > > return true; > > > > return false; > > > > > +} > > I'm more than happy to do this (I'd rather drop the RCU bits if possible) > but need to be sure it's safe. GUP-fast as a whole relies on it :-)
On Tue, May 02, 2023 at 02:08:10PM +0200, Peter Zijlstra wrote: > On Tue, May 02, 2023 at 12:25:54PM +0100, Lorenzo Stoakes wrote: > > On Tue, May 02, 2023 at 01:13:34PM +0200, Peter Zijlstra wrote: > > > On Tue, May 02, 2023 at 12:11:49AM +0100, Lorenzo Stoakes wrote: > > > > @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs) > > > > return folio; > > > > } > > > > > > > > +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE > > > > +static bool stabilise_mapping_rcu(struct folio *folio) > > > > +{ > > > > + struct address_space *mapping = READ_ONCE(folio->mapping); > > > > + > > > > + rcu_read_lock(); > > > > + > > > > + return mapping == READ_ONCE(folio->mapping); > > > > > > This doesn't make sense; why bother reading the same thing twice? > > > > The intent is to see whether the folio->mapping has been truncated from > > underneath us, as per the futex code that Kirill referred to which does > > something similar [1]. > > Yeah, but per that 3rd load you got nothing here. Also that futex code > did the early load to deal with the !mapping case, but you're not doing > that. > OK I drafted a response three times then deleted which shows you how this stuff messes with your mind :) I realise now that literally it is checking whether the previous !mapping case and lack of action taken on that was valid for futex, rendering this pointless for the logic here. We do check !mapping later but obviously with the 'stable' mapping whose relation to pre-rcu lock is irrelevant. Thanks for patiently explaining this :) RCU remains an area I need to take a closer look at generally. > > > Who cares if the thing changes from before; what you care about is that > > > the value you see has stable storage, this doesn't help with that. > > > > > > > +} > > > > + > > > > +static void unlock_rcu(void) > > > > +{ > > > > + rcu_read_unlock(); > > > > +} > > > > +#else > > > > +static bool stabilise_mapping_rcu(struct folio *) > > > > +{ > > > > + return true; > > > > +} > > > > + > > > > +static void unlock_rcu(void) > > > > +{ > > > > +} > > > > +#endif > > > > > > Anyway, this all can go away. RCU can't progress while you have > > > interrupts disabled anyway. > > > > There seems to be other code in the kernel that assumes that this is not > > the case, > > Yeah, so Paul went back on forth on that a bit. It used to be true in > the good old days when everything was simple. Then Paul made things > complicated by separating out sched-RCU bh-RCU and 'regular' RCU > flavours. > > At that point disabling IRQs would only (officially) inhibit sched and > bh RCU flavours, but not the regular RCU. > > But then some years ago Linus convinced Paul that having all these > separate RCU flavours with separate QS rules was a big pain in the > backside and Paul munged them all together again. > > So now, anything that inhibits any of the RCU flavours inhibits them > all. So disabling IRQs is sufficient. > > > i.e. the futex code, though not sure if that's being run with > > IRQs disabled... > > That futex code runs in preemptible context, per the lock_page() that > can sleep etc.. :-) OK I am actually really happy to hear this because this means I can go simplify this code significantly! > > > > > +/* > > > > + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM | > > > > + * FOLL_WRITE pin is permitted for a specific folio. > > > > + * > > > > + * This assumes the folio is stable and pinned. > > > > + * > > > > + * Writing to pinned file-backed dirty tracked folios is inherently problematic > > > > + * (see comment describing the writeable_file_mapping_allowed() function). We > > > > + * therefore try to avoid the most egregious case of a long-term mapping doing > > > > + * so. > > > > + * > > > > + * This function cannot be as thorough as that one as the VMA is not available > > > > + * in the fast path, so instead we whitelist known good cases. > > > > + * > > > > + * The folio is stable, but the mapping might not be. When truncating for > > > > + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled > > > > + * so we are safe from an IPI, but some architectures use an RCU lock for this > > > > + * operation, so we acquire an RCU lock to ensure the mapping is stable. > > > > + */ > > > > +static bool folio_longterm_write_pin_allowed(struct folio *folio) > > > > +{ > > > > + bool ret; > > > > + > > > > + /* hugetlb mappings do not require dirty tracking. */ > > > > + if (folio_test_hugetlb(folio)) > > > > + return true; > > > > + > > > > > > This: > > > > > > > + if (stabilise_mapping_rcu(folio)) { > > > > + struct address_space *mapping = folio_mapping(folio); > > > > > > And this is 3rd read of folio->mapping, just for giggles? > > > > I like to giggle :) > > > > Actually this is to handle the various cases in which the mapping might not > > be what we want (i.e. have PAGE_MAPPING_FLAGS set) which doesn't appear to > > have a helper exposed for a check. Given previous review about duplication > > I felt best to reuse this even though it does access again... yes I felt > > weird about doing that. > > Right, I had a peek inside folio_mapping(), but the point is that this > 3rd load might see yet *another* value of mapping from the prior two > loads, rendering them somewhat worthless. > > > > > + > > > > + /* > > > > + * Neither anonymous nor shmem-backed folios require > > > > + * dirty tracking. > > > > + */ > > > > + ret = folio_test_anon(folio) || > > > > + (mapping && shmem_mapping(mapping)); > > > > + } else { > > > > + /* If the mapping is unstable, fallback to the slow path. */ > > > > + ret = false; > > > > + } > > > > + > > > > + unlock_rcu(); > > > > + > > > > + return ret; > > > > > > then becomes: > > > > > > > > > if (folio_test_anon(folio)) > > > return true; > > > > This relies on the mapping so belongs below the lockdep assert imo. > > Oh, right you are. > > > > > > > /* > > > * Having IRQs disabled (as per GUP-fast) also inhibits RCU > > > * grace periods from making progress, IOW. they imply > > > * rcu_read_lock(). > > > */ > > > lockdep_assert_irqs_disabled(); > > > > > > /* > > > * Inodes and thus address_space are RCU freed and thus safe to > > > * access at this point. > > > */ > > > mapping = folio_mapping(folio); > > > if (mapping && shmem_mapping(mapping)) > > > return true; > > > > > > return false; > > > > > > > +} > > > > I'm more than happy to do this (I'd rather drop the RCU bits if possible) > > but need to be sure it's safe. > > GUP-fast as a whole relies on it :-) Indeed, the only question was what happened with CONFIG_MMU_GATHER_RCU_TABLE_FREE arches which appeared to require special handling, but I'm very happy to hear they don't! Will respin along the lines of your suggestion.
On Tue, May 02, 2023 at 02:08:10PM +0200, Peter Zijlstra wrote: > > > > > > > > > if (folio_test_anon(folio)) > > > return true; > > > > This relies on the mapping so belongs below the lockdep assert imo. > > Oh, right you are. > > > > > > > /* > > > * Having IRQs disabled (as per GUP-fast) also inhibits RCU > > > * grace periods from making progress, IOW. they imply > > > * rcu_read_lock(). > > > */ > > > lockdep_assert_irqs_disabled(); > > > > > > /* > > > * Inodes and thus address_space are RCU freed and thus safe to > > > * access at this point. > > > */ > > > mapping = folio_mapping(folio); > > > if (mapping && shmem_mapping(mapping)) > > > return true; > > > > > > return false; > > > > > > > +} So arguably you should do *one* READ_ONCE() load of mapping and consistently use that, this means open-coding both folio_test_anon() and folio_mapping().
Am 02.05.23 um 01:11 schrieb Lorenzo Stoakes: > Writing to file-backed dirty-tracked mappings via GUP is inherently broken > as we cannot rule out folios being cleaned and then a GUP user writing to > them again and possibly marking them dirty unexpectedly. > > This is especially egregious for long-term mappings (as indicated by the > use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as > we have already done in the slow path. Hmm, does this interfer with KVM on s390 and PCI interpretion of interrupt delivery? It would no longer work with file backed memory, correct? See arch/s390/kvm/pci.c kvm_s390_pci_aif_enable which does have FOLL_WRITE | FOLL_LONGTERM to > > We have access to less information in the fast path as we cannot examine > the VMA containing the mapping, however we can determine whether the folio > is anonymous and then whitelist known-good mappings - specifically hugetlb > and shmem mappings. > > While we obtain a stable folio for this check, the mapping might not be, as > a truncate could nullify it at any time. Since doing so requires mappings > to be zapped, we can synchronise against a TLB shootdown operation. > > For some architectures TLB shootdown is synchronised by IPI, against which > we are protected as the GUP-fast operation is performed with interrupts > disabled. However, other architectures which specify > CONFIG_MMU_GATHER_RCU_TABLE_FREE use an RCU lock for this operation. > > In these instances, we acquire an RCU lock while performing our checks. If > we cannot get a stable mapping, we fall back to the slow path, as otherwise > we'd have to walk the page tables again and it's simpler and more effective > to just fall back. > > It's important to note that there are no APIs allowing users to specify > FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can > always rely on the fact that if we fail to pin on the fast path, the code > will fall back to the slow path which can perform the more thorough check. > > Suggested-by: David Hildenbrand <david@redhat.com> > Suggested-by: Kirill A . Shutemov <kirill@shutemov.name> > Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> > --- > mm/gup.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 85 insertions(+), 2 deletions(-) > > diff --git a/mm/gup.c b/mm/gup.c > index 0f09dec0906c..431618048a03 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -18,6 +18,7 @@ > #include <linux/migrate.h> > #include <linux/mm_inline.h> > #include <linux/sched/mm.h> > +#include <linux/shmem_fs.h> > > #include <asm/mmu_context.h> > #include <asm/tlbflush.h> > @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs) > return folio; > } > > +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE > +static bool stabilise_mapping_rcu(struct folio *folio) > +{ > + struct address_space *mapping = READ_ONCE(folio->mapping); > + > + rcu_read_lock(); > + > + return mapping == READ_ONCE(folio->mapping); > +} > + > +static void unlock_rcu(void) > +{ > + rcu_read_unlock(); > +} > +#else > +static bool stabilise_mapping_rcu(struct folio *) > +{ > + return true; > +} > + > +static void unlock_rcu(void) > +{ > +} > +#endif > + > +/* > + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM | > + * FOLL_WRITE pin is permitted for a specific folio. > + * > + * This assumes the folio is stable and pinned. > + * > + * Writing to pinned file-backed dirty tracked folios is inherently problematic > + * (see comment describing the writeable_file_mapping_allowed() function). We > + * therefore try to avoid the most egregious case of a long-term mapping doing > + * so. > + * > + * This function cannot be as thorough as that one as the VMA is not available > + * in the fast path, so instead we whitelist known good cases. > + * > + * The folio is stable, but the mapping might not be. When truncating for > + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled > + * so we are safe from an IPI, but some architectures use an RCU lock for this > + * operation, so we acquire an RCU lock to ensure the mapping is stable. > + */ > +static bool folio_longterm_write_pin_allowed(struct folio *folio) > +{ > + bool ret; > + > + /* hugetlb mappings do not require dirty tracking. */ > + if (folio_test_hugetlb(folio)) > + return true; > + > + if (stabilise_mapping_rcu(folio)) { > + struct address_space *mapping = folio_mapping(folio); > + > + /* > + * Neither anonymous nor shmem-backed folios require > + * dirty tracking. > + */ > + ret = folio_test_anon(folio) || > + (mapping && shmem_mapping(mapping)); > + } else { > + /* If the mapping is unstable, fallback to the slow path. */ > + ret = false; > + } > + > + unlock_rcu(); > + > + return ret; > +} > + > /** > * try_grab_folio() - Attempt to get or pin a folio. > * @page: pointer to page to be grabbed > @@ -123,6 +195,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs) > */ > struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > { > + bool is_longterm = flags & FOLL_LONGTERM; > + > if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page))) > return NULL; > > @@ -136,8 +210,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > * right zone, so fail and let the caller fall back to the slow > * path. > */ > - if (unlikely((flags & FOLL_LONGTERM) && > - !is_longterm_pinnable_page(page))) > + if (unlikely(is_longterm && !is_longterm_pinnable_page(page))) > return NULL; > > /* > @@ -148,6 +221,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > if (!folio) > return NULL; > > + /* > + * Can this folio be safely pinned? We need to perform this > + * check after the folio is stabilised. > + */ > + if ((flags & FOLL_WRITE) && is_longterm && > + !folio_longterm_write_pin_allowed(folio)) { > + folio_put_refs(folio, refs); > + return NULL; > + } > + > /* > * When pinning a large folio, use an exact count to track it. > *
On 02.05.23 14:40, Peter Zijlstra wrote: > On Tue, May 02, 2023 at 02:08:10PM +0200, Peter Zijlstra wrote: > >>>> >>>> >>>> if (folio_test_anon(folio)) >>>> return true; >>> >>> This relies on the mapping so belongs below the lockdep assert imo. >> >> Oh, right you are. >> >>>> >>>> /* >>>> * Having IRQs disabled (as per GUP-fast) also inhibits RCU >>>> * grace periods from making progress, IOW. they imply >>>> * rcu_read_lock(). >>>> */ >>>> lockdep_assert_irqs_disabled(); >>>> >>>> /* >>>> * Inodes and thus address_space are RCU freed and thus safe to >>>> * access at this point. >>>> */ >>>> mapping = folio_mapping(folio); >>>> if (mapping && shmem_mapping(mapping)) >>>> return true; >>>> >>>> return false; >>>> >>>>> +} > > So arguably you should do *one* READ_ONCE() load of mapping and > consistently use that, this means open-coding both folio_test_anon() and > folio_mapping(). Open-coding folio_test_anon() should not be required. We only care about PAGE_MAPPING_FLAGS stored alongside folio->mapping, that will stick around until the anon page was freed. @Lorenzo, you might also want to special-case hugetlb directly using folio_test_hugetlb().
On Tue, May 02, 2023 at 02:47:22PM +0200, David Hildenbrand wrote: > On 02.05.23 14:40, Peter Zijlstra wrote: > > On Tue, May 02, 2023 at 02:08:10PM +0200, Peter Zijlstra wrote: > > > > > > > > > > > > > > > > > if (folio_test_anon(folio)) > > > > > return true; > > > > > > > > This relies on the mapping so belongs below the lockdep assert imo. > > > > > > Oh, right you are. > > > > > > > > > > > > > /* > > > > > * Having IRQs disabled (as per GUP-fast) also inhibits RCU > > > > > * grace periods from making progress, IOW. they imply > > > > > * rcu_read_lock(). > > > > > */ > > > > > lockdep_assert_irqs_disabled(); > > > > > > > > > > /* > > > > > * Inodes and thus address_space are RCU freed and thus safe to > > > > > * access at this point. > > > > > */ > > > > > mapping = folio_mapping(folio); > > > > > if (mapping && shmem_mapping(mapping)) > > > > > return true; > > > > > > > > > > return false; > > > > > > > > > > > +} > > > > So arguably you should do *one* READ_ONCE() load of mapping and > > consistently use that, this means open-coding both folio_test_anon() and > > folio_mapping(). > > Open-coding folio_test_anon() should not be required. We only care about > PAGE_MAPPING_FLAGS stored alongside folio->mapping, that will stick around > until the anon page was freed. > Ack, good point! > @Lorenzo, you might also want to special-case hugetlb directly using > folio_test_hugetlb(). > I already am :) I guess you mean when I respin along these lines? Will port that across to. > -- > Thanks, > > David / dhildenb >
On 02.05.23 14:52, Lorenzo Stoakes wrote: > On Tue, May 02, 2023 at 02:47:22PM +0200, David Hildenbrand wrote: >> On 02.05.23 14:40, Peter Zijlstra wrote: >>> On Tue, May 02, 2023 at 02:08:10PM +0200, Peter Zijlstra wrote: >>> >>>>>> >>>>>> >>>>>> if (folio_test_anon(folio)) >>>>>> return true; >>>>> >>>>> This relies on the mapping so belongs below the lockdep assert imo. >>>> >>>> Oh, right you are. >>>> >>>>>> >>>>>> /* >>>>>> * Having IRQs disabled (as per GUP-fast) also inhibits RCU >>>>>> * grace periods from making progress, IOW. they imply >>>>>> * rcu_read_lock(). >>>>>> */ >>>>>> lockdep_assert_irqs_disabled(); >>>>>> >>>>>> /* >>>>>> * Inodes and thus address_space are RCU freed and thus safe to >>>>>> * access at this point. >>>>>> */ >>>>>> mapping = folio_mapping(folio); >>>>>> if (mapping && shmem_mapping(mapping)) >>>>>> return true; >>>>>> >>>>>> return false; >>>>>> >>>>>>> +} >>> >>> So arguably you should do *one* READ_ONCE() load of mapping and >>> consistently use that, this means open-coding both folio_test_anon() and >>> folio_mapping(). >> >> Open-coding folio_test_anon() should not be required. We only care about >> PAGE_MAPPING_FLAGS stored alongside folio->mapping, that will stick around >> until the anon page was freed. >> > > Ack, good point! > >> @Lorenzo, you might also want to special-case hugetlb directly using >> folio_test_hugetlb(). >> > > I already am :) I guess you mean when I respin along these lines? Will port > that across to. Ha, I only stared at that reply, good! :)
On Tue, May 02, 2023 at 02:46:28PM +0200, Christian Borntraeger wrote: > Am 02.05.23 um 01:11 schrieb Lorenzo Stoakes: > > Writing to file-backed dirty-tracked mappings via GUP is inherently broken > > as we cannot rule out folios being cleaned and then a GUP user writing to > > them again and possibly marking them dirty unexpectedly. > > > > This is especially egregious for long-term mappings (as indicated by the > > use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as > > we have already done in the slow path. > > Hmm, does this interfer with KVM on s390 and PCI interpretion of interrupt delivery? > It would no longer work with file backed memory, correct? > > See > arch/s390/kvm/pci.c > > kvm_s390_pci_aif_enable > which does have > FOLL_WRITE | FOLL_LONGTERM > to > Does this memory map a dirty-tracked file? It's kind of hard to dig into where the address originates from without going through a ton of code. In worst case if the fast code doesn't find a whitelist it'll fall back to slow path which explicitly checks for dirty-tracked filesystem. We can reintroduce a flag to permit exceptions if this is really broken, are you able to test? I don't have an s390 sat around :) > > > > We have access to less information in the fast path as we cannot examine > > the VMA containing the mapping, however we can determine whether the folio > > is anonymous and then whitelist known-good mappings - specifically hugetlb > > and shmem mappings. > > > > While we obtain a stable folio for this check, the mapping might not be, as > > a truncate could nullify it at any time. Since doing so requires mappings > > to be zapped, we can synchronise against a TLB shootdown operation. > > > > For some architectures TLB shootdown is synchronised by IPI, against which > > we are protected as the GUP-fast operation is performed with interrupts > > disabled. However, other architectures which specify > > CONFIG_MMU_GATHER_RCU_TABLE_FREE use an RCU lock for this operation. > > > > In these instances, we acquire an RCU lock while performing our checks. If > > we cannot get a stable mapping, we fall back to the slow path, as otherwise > > we'd have to walk the page tables again and it's simpler and more effective > > to just fall back. > > > > It's important to note that there are no APIs allowing users to specify > > FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can > > always rely on the fact that if we fail to pin on the fast path, the code > > will fall back to the slow path which can perform the more thorough check. > > > > Suggested-by: David Hildenbrand <david@redhat.com> > > Suggested-by: Kirill A . Shutemov <kirill@shutemov.name> > > Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> > > --- > > mm/gup.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-- > > 1 file changed, 85 insertions(+), 2 deletions(-) > > > > diff --git a/mm/gup.c b/mm/gup.c > > index 0f09dec0906c..431618048a03 100644 > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -18,6 +18,7 @@ > > #include <linux/migrate.h> > > #include <linux/mm_inline.h> > > #include <linux/sched/mm.h> > > +#include <linux/shmem_fs.h> > > #include <asm/mmu_context.h> > > #include <asm/tlbflush.h> > > @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs) > > return folio; > > } > > +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE > > +static bool stabilise_mapping_rcu(struct folio *folio) > > +{ > > + struct address_space *mapping = READ_ONCE(folio->mapping); > > + > > + rcu_read_lock(); > > + > > + return mapping == READ_ONCE(folio->mapping); > > +} > > + > > +static void unlock_rcu(void) > > +{ > > + rcu_read_unlock(); > > +} > > +#else > > +static bool stabilise_mapping_rcu(struct folio *) > > +{ > > + return true; > > +} > > + > > +static void unlock_rcu(void) > > +{ > > +} > > +#endif > > + > > +/* > > + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM | > > + * FOLL_WRITE pin is permitted for a specific folio. > > + * > > + * This assumes the folio is stable and pinned. > > + * > > + * Writing to pinned file-backed dirty tracked folios is inherently problematic > > + * (see comment describing the writeable_file_mapping_allowed() function). We > > + * therefore try to avoid the most egregious case of a long-term mapping doing > > + * so. > > + * > > + * This function cannot be as thorough as that one as the VMA is not available > > + * in the fast path, so instead we whitelist known good cases. > > + * > > + * The folio is stable, but the mapping might not be. When truncating for > > + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled > > + * so we are safe from an IPI, but some architectures use an RCU lock for this > > + * operation, so we acquire an RCU lock to ensure the mapping is stable. > > + */ > > +static bool folio_longterm_write_pin_allowed(struct folio *folio) > > +{ > > + bool ret; > > + > > + /* hugetlb mappings do not require dirty tracking. */ > > + if (folio_test_hugetlb(folio)) > > + return true; > > + > > + if (stabilise_mapping_rcu(folio)) { > > + struct address_space *mapping = folio_mapping(folio); > > + > > + /* > > + * Neither anonymous nor shmem-backed folios require > > + * dirty tracking. > > + */ > > + ret = folio_test_anon(folio) || > > + (mapping && shmem_mapping(mapping)); > > + } else { > > + /* If the mapping is unstable, fallback to the slow path. */ > > + ret = false; > > + } > > + > > + unlock_rcu(); > > + > > + return ret; > > +} > > + > > /** > > * try_grab_folio() - Attempt to get or pin a folio. > > * @page: pointer to page to be grabbed > > @@ -123,6 +195,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs) > > */ > > struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > > { > > + bool is_longterm = flags & FOLL_LONGTERM; > > + > > if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page))) > > return NULL; > > @@ -136,8 +210,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > > * right zone, so fail and let the caller fall back to the slow > > * path. > > */ > > - if (unlikely((flags & FOLL_LONGTERM) && > > - !is_longterm_pinnable_page(page))) > > + if (unlikely(is_longterm && !is_longterm_pinnable_page(page))) > > return NULL; > > /* > > @@ -148,6 +221,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > > if (!folio) > > return NULL; > > + /* > > + * Can this folio be safely pinned? We need to perform this > > + * check after the folio is stabilised. > > + */ > > + if ((flags & FOLL_WRITE) && is_longterm && > > + !folio_longterm_write_pin_allowed(folio)) { > > + folio_put_refs(folio, refs); > > + return NULL; > > + } > > + > > /* > > * When pinning a large folio, use an exact count to track it. > > *
On Tue, May 02, 2023 at 01:54:41PM +0100, Lorenzo Stoakes wrote: > On Tue, May 02, 2023 at 02:46:28PM +0200, Christian Borntraeger wrote: > > Am 02.05.23 um 01:11 schrieb Lorenzo Stoakes: > > > Writing to file-backed dirty-tracked mappings via GUP is inherently broken > > > as we cannot rule out folios being cleaned and then a GUP user writing to > > > them again and possibly marking them dirty unexpectedly. > > > > > > This is especially egregious for long-term mappings (as indicated by the > > > use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as > > > we have already done in the slow path. > > > > Hmm, does this interfer with KVM on s390 and PCI interpretion of interrupt delivery? > > It would no longer work with file backed memory, correct? > > > > See > > arch/s390/kvm/pci.c > > > > kvm_s390_pci_aif_enable > > which does have > > FOLL_WRITE | FOLL_LONGTERM > > to > > > > Does this memory map a dirty-tracked file? It's kind of hard to dig into where > the address originates from without going through a ton of code. In worst case > if the fast code doesn't find a whitelist it'll fall back to slow path which > explicitly checks for dirty-tracked filesystem. This looks like the same stuff David was talking about, a qemu guest with VFIO backed by a filesystem file.. Broadly though, arch kvm code should not call pin_user_pages(). Either it is KVM focused and it should use the shadow table and it's existing mmu_notifier synchronization scheme Or it is VFIO focused so it should use mdev/etc and have an unmap call back. I'm not really sure what this is for though.. Jason
Am 02.05.23 um 14:54 schrieb Lorenzo Stoakes: > On Tue, May 02, 2023 at 02:46:28PM +0200, Christian Borntraeger wrote: >> Am 02.05.23 um 01:11 schrieb Lorenzo Stoakes: >>> Writing to file-backed dirty-tracked mappings via GUP is inherently broken >>> as we cannot rule out folios being cleaned and then a GUP user writing to >>> them again and possibly marking them dirty unexpectedly. >>> >>> This is especially egregious for long-term mappings (as indicated by the >>> use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as >>> we have already done in the slow path. >> >> Hmm, does this interfer with KVM on s390 and PCI interpretion of interrupt delivery? >> It would no longer work with file backed memory, correct? >> >> See >> arch/s390/kvm/pci.c >> >> kvm_s390_pci_aif_enable >> which does have >> FOLL_WRITE | FOLL_LONGTERM >> to >> > > Does this memory map a dirty-tracked file? It's kind of hard to dig into where > the address originates from without going through a ton of code. In worst case > if the fast code doesn't find a whitelist it'll fall back to slow path which > explicitly checks for dirty-tracked filesystem. It does pin from whatever QEMU uses as backing for the guest. > > We can reintroduce a flag to permit exceptions if this is really broken, are you > able to test? I don't have an s390 sat around :) Matt (Rosato on cc) probably can. In the end, it would mean having <memoryBacking> <source type="file"/> </memoryBacking> In libvirt I guess.
On Tue, May 02, 2023 at 03:04:27PM +0200, Christian Borntraeger wrote: \> > We can reintroduce a flag to permit exceptions if this is really broken, are you > > able to test? I don't have an s390 sat around :) > > Matt (Rosato on cc) probably can. In the end, it would mean having > <memoryBacking> > <source type="file"/> > </memoryBacking> This s390 code is the least of the problems, after this series VFIO won't startup at all with this configuration. Jason
On 02.05.23 15:10, Jason Gunthorpe wrote: > On Tue, May 02, 2023 at 03:04:27PM +0200, Christian Borntraeger wrote: > \> > We can reintroduce a flag to permit exceptions if this is really broken, are you >>> able to test? I don't have an s390 sat around :) >> >> Matt (Rosato on cc) probably can. In the end, it would mean having >> <memoryBacking> >> <source type="file"/> >> </memoryBacking> > > This s390 code is the least of the problems, after this series VFIO > won't startup at all with this configuration. Good question if the domain would fail to start. I recall that IOMMUs for zPCI are special on s390x. [1] Well, zPCI is special. I cannot immediately tell when we would trigger long-term pinning. [1] https://www.mail-archive.com/qemu-devel@nongnu.org/msg875728.html
On Tue, May 02, 2023 at 02:08:10PM +0200, Peter Zijlstra wrote: > On Tue, May 02, 2023 at 12:25:54PM +0100, Lorenzo Stoakes wrote: > > On Tue, May 02, 2023 at 01:13:34PM +0200, Peter Zijlstra wrote: > > > On Tue, May 02, 2023 at 12:11:49AM +0100, Lorenzo Stoakes wrote: > > > > @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs) > > > > return folio; > > > > } > > > > > > > > +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE > > > > +static bool stabilise_mapping_rcu(struct folio *folio) > > > > +{ > > > > + struct address_space *mapping = READ_ONCE(folio->mapping); > > > > + > > > > + rcu_read_lock(); > > > > + > > > > + return mapping == READ_ONCE(folio->mapping); > > > > > > This doesn't make sense; why bother reading the same thing twice? > > > > The intent is to see whether the folio->mapping has been truncated from > > underneath us, as per the futex code that Kirill referred to which does > > something similar [1]. > > Yeah, but per that 3rd load you got nothing here. Also that futex code > did the early load to deal with the !mapping case, but you're not doing > that. > > > > Who cares if the thing changes from before; what you care about is that > > > the value you see has stable storage, this doesn't help with that. > > > > > > > +} > > > > + > > > > +static void unlock_rcu(void) > > > > +{ > > > > + rcu_read_unlock(); > > > > +} > > > > +#else > > > > +static bool stabilise_mapping_rcu(struct folio *) > > > > +{ > > > > + return true; > > > > +} > > > > + > > > > +static void unlock_rcu(void) > > > > +{ > > > > +} > > > > +#endif > > > > > > Anyway, this all can go away. RCU can't progress while you have > > > interrupts disabled anyway. > > > > There seems to be other code in the kernel that assumes that this is not > > the case, > > Yeah, so Paul went back on forth on that a bit. It used to be true in > the good old days when everything was simple. Then Paul made things > complicated by separating out sched-RCU bh-RCU and 'regular' RCU > flavours. Almost. ;-) The way I made things complicated was instead by creating preemptible RCU for the real-time effort. The original non-preemptible RCU was still required for a number of use cases (for example, waiting for hardware interrupt handlers), so it had to stay. Separately, network-based DoS attacks necessitated adding RCU bh. > At that point disabling IRQs would only (officially) inhibit sched and > bh RCU flavours, but not the regular RCU. Quite right. > But then some years ago Linus convinced Paul that having all these > separate RCU flavours with separate QS rules was a big pain in the > backside and Paul munged them all together again. What happened was that someone used one flavor of RCU reader and a different flavor of RCU updater, creating an exploitable bug. http://www2.rdrop.com/~paulmck/RCU/cve.2019.01.23e.pdf https://www.youtube.com/watch?v=hZX1aokdNiY And Linus asked that this bug be ruled out, so... > So now, anything that inhibits any of the RCU flavours inhibits them > all. So disabling IRQs is sufficient. ...for v4.20 and later, exactly. Thanx, Paul > > i.e. the futex code, though not sure if that's being run with > > IRQs disabled... > > That futex code runs in preemptible context, per the lock_page() that > can sleep etc.. :-) > > > > > +/* > > > > + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM | > > > > + * FOLL_WRITE pin is permitted for a specific folio. > > > > + * > > > > + * This assumes the folio is stable and pinned. > > > > + * > > > > + * Writing to pinned file-backed dirty tracked folios is inherently problematic > > > > + * (see comment describing the writeable_file_mapping_allowed() function). We > > > > + * therefore try to avoid the most egregious case of a long-term mapping doing > > > > + * so. > > > > + * > > > > + * This function cannot be as thorough as that one as the VMA is not available > > > > + * in the fast path, so instead we whitelist known good cases. > > > > + * > > > > + * The folio is stable, but the mapping might not be. When truncating for > > > > + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled > > > > + * so we are safe from an IPI, but some architectures use an RCU lock for this > > > > + * operation, so we acquire an RCU lock to ensure the mapping is stable. > > > > + */ > > > > +static bool folio_longterm_write_pin_allowed(struct folio *folio) > > > > +{ > > > > + bool ret; > > > > + > > > > + /* hugetlb mappings do not require dirty tracking. */ > > > > + if (folio_test_hugetlb(folio)) > > > > + return true; > > > > + > > > > > > This: > > > > > > > + if (stabilise_mapping_rcu(folio)) { > > > > + struct address_space *mapping = folio_mapping(folio); > > > > > > And this is 3rd read of folio->mapping, just for giggles? > > > > I like to giggle :) > > > > Actually this is to handle the various cases in which the mapping might not > > be what we want (i.e. have PAGE_MAPPING_FLAGS set) which doesn't appear to > > have a helper exposed for a check. Given previous review about duplication > > I felt best to reuse this even though it does access again... yes I felt > > weird about doing that. > > Right, I had a peek inside folio_mapping(), but the point is that this > 3rd load might see yet *another* value of mapping from the prior two > loads, rendering them somewhat worthless. > > > > > + > > > > + /* > > > > + * Neither anonymous nor shmem-backed folios require > > > > + * dirty tracking. > > > > + */ > > > > + ret = folio_test_anon(folio) || > > > > + (mapping && shmem_mapping(mapping)); > > > > + } else { > > > > + /* If the mapping is unstable, fallback to the slow path. */ > > > > + ret = false; > > > > + } > > > > + > > > > + unlock_rcu(); > > > > + > > > > + return ret; > > > > > > then becomes: > > > > > > > > > if (folio_test_anon(folio)) > > > return true; > > > > This relies on the mapping so belongs below the lockdep assert imo. > > Oh, right you are. > > > > > > > /* > > > * Having IRQs disabled (as per GUP-fast) also inhibits RCU > > > * grace periods from making progress, IOW. they imply > > > * rcu_read_lock(). > > > */ > > > lockdep_assert_irqs_disabled(); > > > > > > /* > > > * Inodes and thus address_space are RCU freed and thus safe to > > > * access at this point. > > > */ > > > mapping = folio_mapping(folio); > > > if (mapping && shmem_mapping(mapping)) > > > return true; > > > > > > return false; > > > > > > > +} > > > > I'm more than happy to do this (I'd rather drop the RCU bits if possible) > > but need to be sure it's safe. > > GUP-fast as a whole relies on it :-)
On 5/2/23 9:04 AM, Christian Borntraeger wrote: > > > Am 02.05.23 um 14:54 schrieb Lorenzo Stoakes: >> On Tue, May 02, 2023 at 02:46:28PM +0200, Christian Borntraeger wrote: >>> Am 02.05.23 um 01:11 schrieb Lorenzo Stoakes: >>>> Writing to file-backed dirty-tracked mappings via GUP is inherently broken >>>> as we cannot rule out folios being cleaned and then a GUP user writing to >>>> them again and possibly marking them dirty unexpectedly. >>>> >>>> This is especially egregious for long-term mappings (as indicated by the >>>> use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as >>>> we have already done in the slow path. >>> >>> Hmm, does this interfer with KVM on s390 and PCI interpretion of interrupt delivery? >>> It would no longer work with file backed memory, correct? >>> >>> See >>> arch/s390/kvm/pci.c >>> >>> kvm_s390_pci_aif_enable >>> which does have >>> FOLL_WRITE | FOLL_LONGTERM >>> to >>> >> >> Does this memory map a dirty-tracked file? It's kind of hard to dig into where >> the address originates from without going through a ton of code. In worst case >> if the fast code doesn't find a whitelist it'll fall back to slow path which >> explicitly checks for dirty-tracked filesystem. > > It does pin from whatever QEMU uses as backing for the guest. >> >> We can reintroduce a flag to permit exceptions if this is really broken, are you >> able to test? I don't have an s390 sat around :) > > Matt (Rosato on cc) probably can. In the end, it would mean having > <memoryBacking> > <source type="file"/> > </memoryBacking> > > In libvirt I guess. I am running with this series applied using a QEMU guest with memory-backend-file (using the above libvirt snippet) for a few different PCI device types and AEN forwarding (e.g. what is setup in kvm_s390_pci_aif_enable) is still working.
On Tue, May 02, 2023 at 03:28:40PM +0200, David Hildenbrand wrote: > On 02.05.23 15:10, Jason Gunthorpe wrote: > > On Tue, May 02, 2023 at 03:04:27PM +0200, Christian Borntraeger wrote: > > \> > We can reintroduce a flag to permit exceptions if this is really broken, are you > > > > able to test? I don't have an s390 sat around :) > > > > > > Matt (Rosato on cc) probably can. In the end, it would mean having > > > <memoryBacking> > > > <source type="file"/> > > > </memoryBacking> > > > > This s390 code is the least of the problems, after this series VFIO > > won't startup at all with this configuration. > > Good question if the domain would fail to start. I recall that IOMMUs for > zPCI are special on s390x. [1] Not upstream they aren't. > Well, zPCI is special. I cannot immediately tell when we would trigger > long-term pinning. zPCI uses the standard IOMMU stuff, so it uses a normal VFIO container and the normal pin_user_pages() path. > [1] https://www.mail-archive.com/qemu-devel@nongnu.org/msg875728.html AFIACT this is talking about the nested IOMMU translation stuff. RPCIT is the hypercall to invalidate the nested IOMMU table. I expect s390 is going to have another go at implementing this using iommufd. The stuff in that series like KVM_S390_ZPCIOP_REG_IOAT didn't make it upstream. Jason
On 5/2/23 9:28 AM, David Hildenbrand wrote: > On 02.05.23 15:10, Jason Gunthorpe wrote: >> On Tue, May 02, 2023 at 03:04:27PM +0200, Christian Borntraeger wrote: >> \> > We can reintroduce a flag to permit exceptions if this is really broken, are you >>>> able to test? I don't have an s390 sat around :) >>> >>> Matt (Rosato on cc) probably can. In the end, it would mean having >>> <memoryBacking> >>> <source type="file"/> >>> </memoryBacking> >> >> This s390 code is the least of the problems, after this series VFIO >> won't startup at all with this configuration. > > Good question if the domain would fail to start. I recall that IOMMUs for zPCI are special on s390x. [1] Eh, well what you referenced there never merged. We will eventually do something like this, but via iommufd. But right now s390 is still using vfio type1 iommu > > Well, zPCI is special. I cannot immediately tell when we would trigger long-term pinning. Enabling a zPCI device in a QEMU guest when using zPCI interpretation facilities > > [1] https://www.mail-archive.com/qemu-devel@nongnu.org/msg875728.html >
On 02.05.23 15:36, Jason Gunthorpe wrote: > On Tue, May 02, 2023 at 03:28:40PM +0200, David Hildenbrand wrote: >> On 02.05.23 15:10, Jason Gunthorpe wrote: >>> On Tue, May 02, 2023 at 03:04:27PM +0200, Christian Borntraeger wrote: >>> \> > We can reintroduce a flag to permit exceptions if this is really broken, are you >>>>> able to test? I don't have an s390 sat around :) >>>> >>>> Matt (Rosato on cc) probably can. In the end, it would mean having >>>> <memoryBacking> >>>> <source type="file"/> >>>> </memoryBacking> >>> >>> This s390 code is the least of the problems, after this series VFIO >>> won't startup at all with this configuration. >> >> Good question if the domain would fail to start. I recall that IOMMUs for >> zPCI are special on s390x. [1] > > Not upstream they aren't. > >> Well, zPCI is special. I cannot immediately tell when we would trigger >> long-term pinning. > > zPCI uses the standard IOMMU stuff, so it uses a normal VFIO container > and the normal pin_user_pages() path. @Christian, Matthew: would we pin all guest memory when starting the domain (IIRC, like on x86-64) and fail early, or only when the guest issues rpcit instructions to map individual pages?
On 5/2/23 9:39 AM, David Hildenbrand wrote: > On 02.05.23 15:36, Jason Gunthorpe wrote: >> On Tue, May 02, 2023 at 03:28:40PM +0200, David Hildenbrand wrote: >>> On 02.05.23 15:10, Jason Gunthorpe wrote: >>>> On Tue, May 02, 2023 at 03:04:27PM +0200, Christian Borntraeger wrote: >>>> \> > We can reintroduce a flag to permit exceptions if this is really broken, are you >>>>>> able to test? I don't have an s390 sat around :) >>>>> >>>>> Matt (Rosato on cc) probably can. In the end, it would mean having >>>>> <memoryBacking> >>>>> <source type="file"/> >>>>> </memoryBacking> >>>> >>>> This s390 code is the least of the problems, after this series VFIO >>>> won't startup at all with this configuration. >>> >>> Good question if the domain would fail to start. I recall that IOMMUs for >>> zPCI are special on s390x. [1] >> >> Not upstream they aren't. >> >>> Well, zPCI is special. I cannot immediately tell when we would trigger >>> long-term pinning. >> >> zPCI uses the standard IOMMU stuff, so it uses a normal VFIO container >> and the normal pin_user_pages() path. > > > @Christian, Matthew: would we pin all guest memory when starting the domain (IIRC, like on x86-64) and fail early, or only when the guest issues rpcit instructions to map individual pages? > Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT. However, per Jason's prior suggestion, the initial implementation for s390 nesting via iommufd will pin all of guest memory when starting the domain. I have something already working via iommufd built on top of the nesting infrastructure patches and QEMU iommufd series that are floating around; needs some cleanup, hoping to send an RFC in the coming weeks. I can CC you if you'd like.
On 02.05.23 15:43, Matthew Rosato wrote: > On 5/2/23 9:39 AM, David Hildenbrand wrote: >> On 02.05.23 15:36, Jason Gunthorpe wrote: >>> On Tue, May 02, 2023 at 03:28:40PM +0200, David Hildenbrand wrote: >>>> On 02.05.23 15:10, Jason Gunthorpe wrote: >>>>> On Tue, May 02, 2023 at 03:04:27PM +0200, Christian Borntraeger wrote: >>>>> \> > We can reintroduce a flag to permit exceptions if this is really broken, are you >>>>>>> able to test? I don't have an s390 sat around :) >>>>>> >>>>>> Matt (Rosato on cc) probably can. In the end, it would mean having >>>>>> <memoryBacking> >>>>>> <source type="file"/> >>>>>> </memoryBacking> >>>>> >>>>> This s390 code is the least of the problems, after this series VFIO >>>>> won't startup at all with this configuration. >>>> >>>> Good question if the domain would fail to start. I recall that IOMMUs for >>>> zPCI are special on s390x. [1] >>> >>> Not upstream they aren't. >>> >>>> Well, zPCI is special. I cannot immediately tell when we would trigger >>>> long-term pinning. >>> >>> zPCI uses the standard IOMMU stuff, so it uses a normal VFIO container >>> and the normal pin_user_pages() path. >> >> >> @Christian, Matthew: would we pin all guest memory when starting the domain (IIRC, like on x86-64) and fail early, or only when the guest issues rpcit instructions to map individual pages? >> > > Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT. > Okay, so IIRC we'll fail starting the domain early, that's good. And if we pin all guest memory (instead of small pieces dynamically), there is little existing use for file-backed RAM in such zPCI configurations (because memory cannot be reclaimed either way if it's all pinned), so likely there are no real existing users. > However, per Jason's prior suggestion, the initial implementation for s390 nesting via iommufd will pin all of guest memory when starting the domain. I have something already working via iommufd built on top of the nesting infrastructure patches and QEMU iommufd series that are floating around; needs some cleanup, hoping to send an RFC in the coming weeks. I can CC you if you'd like. Yes, please.
On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote: > > Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT. > > Okay, so IIRC we'll fail starting the domain early, that's good. And if we > pin all guest memory (instead of small pieces dynamically), there is little > existing use for file-backed RAM in such zPCI configurations (because memory > cannot be reclaimed either way if it's all pinned), so likely there are no > real existing users. Right, this is VFIO, the physical HW can't tolerate not having pinned memory, so something somewhere is always pinning it. Which, again, makes it weird/wrong that this KVM code is pinning it again :\ Jason
On 5/2/23 9:50 AM, Jason Gunthorpe wrote: > On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote: >>> Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT. >> >> Okay, so IIRC we'll fail starting the domain early, that's good. And if we >> pin all guest memory (instead of small pieces dynamically), there is little >> existing use for file-backed RAM in such zPCI configurations (because memory >> cannot be reclaimed either way if it's all pinned), so likely there are no >> real existing users. > > Right, this is VFIO, the physical HW can't tolerate not having pinned > memory, so something somewhere is always pinning it. I might have mis-explained above. With iommufd nesting, we will pin everything upfront as a starting point. The current usage of vfio type1 iommu for s390 does not pin the entirety of guest memory upfront, it happens as guest RPCITs occur / type1 mappings are made. > > Which, again, makes it weird/wrong that this KVM code is pinning it > again :\ > > Jason
On 02.05.23 15:50, Jason Gunthorpe wrote: > On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote: >>> Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT. >> >> Okay, so IIRC we'll fail starting the domain early, that's good. And if we >> pin all guest memory (instead of small pieces dynamically), there is little >> existing use for file-backed RAM in such zPCI configurations (because memory >> cannot be reclaimed either way if it's all pinned), so likely there are no >> real existing users. > > Right, this is VFIO, the physical HW can't tolerate not having pinned > memory, so something somewhere is always pinning it. > > Which, again, makes it weird/wrong that this KVM code is pinning it > again :\ IIUC, that pinning is not for ordinary IOMMU / KVM memory access. It's for passthrough of (adapter) interrupts. I have to speculate, but I guess for hardware to forward interrupts to the VM, it has to pin the special guest memory page that will receive the indications, to then configure (interrupt) hardware to target the interrupt indications to that special guest page (using a host physical address).
On Tue, May 02, 2023 at 03:57:30PM +0200, David Hildenbrand wrote: > On 02.05.23 15:50, Jason Gunthorpe wrote: > > On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote: > > > > Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT. > > > > > > Okay, so IIRC we'll fail starting the domain early, that's good. And if we > > > pin all guest memory (instead of small pieces dynamically), there is little > > > existing use for file-backed RAM in such zPCI configurations (because memory > > > cannot be reclaimed either way if it's all pinned), so likely there are no > > > real existing users. > > > > Right, this is VFIO, the physical HW can't tolerate not having pinned > > memory, so something somewhere is always pinning it. > > > > Which, again, makes it weird/wrong that this KVM code is pinning it > > again :\ > > IIUC, that pinning is not for ordinary IOMMU / KVM memory access. It's for > passthrough of (adapter) interrupts. > > I have to speculate, but I guess for hardware to forward interrupts to the > VM, it has to pin the special guest memory page that will receive the > indications, to then configure (interrupt) hardware to target the interrupt > indications to that special guest page (using a host physical address). Either the emulated access is "CPU" based happening through the KVM page table so it should use mmu_notifier locking. Or it is "DMA" and should go through an IOVA through iommufd pinning and locking. There is no other ground, nothing in KVM should be inventing its own access methodology. Jason
On 02.05.23 16:04, Jason Gunthorpe wrote: > On Tue, May 02, 2023 at 03:57:30PM +0200, David Hildenbrand wrote: >> On 02.05.23 15:50, Jason Gunthorpe wrote: >>> On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote: >>>>> Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT. >>>> >>>> Okay, so IIRC we'll fail starting the domain early, that's good. And if we >>>> pin all guest memory (instead of small pieces dynamically), there is little >>>> existing use for file-backed RAM in such zPCI configurations (because memory >>>> cannot be reclaimed either way if it's all pinned), so likely there are no >>>> real existing users. >>> >>> Right, this is VFIO, the physical HW can't tolerate not having pinned >>> memory, so something somewhere is always pinning it. >>> >>> Which, again, makes it weird/wrong that this KVM code is pinning it >>> again :\ >> >> IIUC, that pinning is not for ordinary IOMMU / KVM memory access. It's for >> passthrough of (adapter) interrupts. >> >> I have to speculate, but I guess for hardware to forward interrupts to the >> VM, it has to pin the special guest memory page that will receive the >> indications, to then configure (interrupt) hardware to target the interrupt >> indications to that special guest page (using a host physical address). > > Either the emulated access is "CPU" based happening through the KVM > page table so it should use mmu_notifier locking. > > Or it is "DMA" and should go through an IOVA through iommufd pinning > and locking. > > There is no other ground, nothing in KVM should be inventing its own > access methodology. I might be wrong, but this seems to be a bit different. It cannot tolerate page faults (needs a host physical address), so memory notifiers don't really apply. (as a side note, KVM on s390x does not use mmu notifiers as we know them) It's kind-of like DMA, but it's not really DMA. It's the CPU delivering interrupts for a specific device. So we're configuring the interrupt controller I guess to target a guest memory page. But I have way too little knowledge about zPCI and the code in question here. And if it could be converted to iommufd (and if that's really the right mechanism to use here). Hopefully Matthew knows the details and if this really needs to be special :)
On 5/2/23 10:15 AM, David Hildenbrand wrote: > On 02.05.23 16:04, Jason Gunthorpe wrote: >> On Tue, May 02, 2023 at 03:57:30PM +0200, David Hildenbrand wrote: >>> On 02.05.23 15:50, Jason Gunthorpe wrote: >>>> On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote: >>>>>> Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT. >>>>> >>>>> Okay, so IIRC we'll fail starting the domain early, that's good. And if we >>>>> pin all guest memory (instead of small pieces dynamically), there is little >>>>> existing use for file-backed RAM in such zPCI configurations (because memory >>>>> cannot be reclaimed either way if it's all pinned), so likely there are no >>>>> real existing users. >>>> >>>> Right, this is VFIO, the physical HW can't tolerate not having pinned >>>> memory, so something somewhere is always pinning it. >>>> >>>> Which, again, makes it weird/wrong that this KVM code is pinning it >>>> again :\ >>> >>> IIUC, that pinning is not for ordinary IOMMU / KVM memory access. It's for >>> passthrough of (adapter) interrupts. >>> >>> I have to speculate, but I guess for hardware to forward interrupts to the >>> VM, it has to pin the special guest memory page that will receive the >>> indications, to then configure (interrupt) hardware to target the interrupt >>> indications to that special guest page (using a host physical address). >> >> Either the emulated access is "CPU" based happening through the KVM >> page table so it should use mmu_notifier locking. >> >> Or it is "DMA" and should go through an IOVA through iommufd pinning >> and locking. >> >> There is no other ground, nothing in KVM should be inventing its own >> access methodology. > > I might be wrong, but this seems to be a bit different. > > It cannot tolerate page faults (needs a host physical address), so memory notifiers don't really apply. (as a side note, KVM on s390x does not use mmu notifiers as we know them) The host physical address is one shared between underlying firmware and the host kvm. Either might make changes to the referenced page and then issue an alert to the guest via a mechanism called GISA, giving impetus to the guest to look at that page and process the event. As you say, firmware can't tolerate the page being unavailable; it's expecting that once we feed it that location it's always available until we remove it (kvm_s390_pci_aif_disable). > > It's kind-of like DMA, but it's not really DMA. It's the CPU delivering interrupts for a specific device. So we're configuring the interrupt controller I guess to target a guest memory page. > > But I have way too little knowledge about zPCI and the code in question here. And if it could be converted to iommufd (and if that's really the right mechanism to use here). > > Hopefully Matthew knows the details and if this really needs to be special :) I think I need to have a look at mmu_notifiers to understand that better, but in the end firmware still needs a reliable page to deliver events to.
On 02.05.23 15:35, Matthew Rosato wrote: > On 5/2/23 9:04 AM, Christian Borntraeger wrote: >> >> >> Am 02.05.23 um 14:54 schrieb Lorenzo Stoakes: >>> On Tue, May 02, 2023 at 02:46:28PM +0200, Christian Borntraeger wrote: >>>> Am 02.05.23 um 01:11 schrieb Lorenzo Stoakes: >>>>> Writing to file-backed dirty-tracked mappings via GUP is inherently broken >>>>> as we cannot rule out folios being cleaned and then a GUP user writing to >>>>> them again and possibly marking them dirty unexpectedly. >>>>> >>>>> This is especially egregious for long-term mappings (as indicated by the >>>>> use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as >>>>> we have already done in the slow path. >>>> >>>> Hmm, does this interfer with KVM on s390 and PCI interpretion of interrupt delivery? >>>> It would no longer work with file backed memory, correct? >>>> >>>> See >>>> arch/s390/kvm/pci.c >>>> >>>> kvm_s390_pci_aif_enable >>>> which does have >>>> FOLL_WRITE | FOLL_LONGTERM >>>> to >>>> >>> >>> Does this memory map a dirty-tracked file? It's kind of hard to dig into where >>> the address originates from without going through a ton of code. In worst case >>> if the fast code doesn't find a whitelist it'll fall back to slow path which >>> explicitly checks for dirty-tracked filesystem. >> >> It does pin from whatever QEMU uses as backing for the guest. >>> >>> We can reintroduce a flag to permit exceptions if this is really broken, are you >>> able to test? I don't have an s390 sat around :) >> >> Matt (Rosato on cc) probably can. In the end, it would mean having >> <memoryBacking> >> <source type="file"/> >> </memoryBacking> >> >> In libvirt I guess. > > I am running with this series applied using a QEMU guest with memory-backend-file (using the above libvirt snippet) for a few different PCI device types and AEN forwarding (e.g. what is setup in kvm_s390_pci_aif_enable) is still working. > That's ... unexpected. :) Either this series doesn't work as expected or you end up using a filesystem that is still compatible. But I guess most applicable filesystems (ext4, btrfs, xfs) all have a page_mkwrite callback and should, therefore, disallow long-term pinning with this series.
On 02.05.23 15:56, Matthew Rosato wrote: > On 5/2/23 9:50 AM, Jason Gunthorpe wrote: >> On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote: >>>> Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT. >>> >>> Okay, so IIRC we'll fail starting the domain early, that's good. And if we >>> pin all guest memory (instead of small pieces dynamically), there is little >>> existing use for file-backed RAM in such zPCI configurations (because memory >>> cannot be reclaimed either way if it's all pinned), so likely there are no >>> real existing users. >> >> Right, this is VFIO, the physical HW can't tolerate not having pinned >> memory, so something somewhere is always pinning it. > > I might have mis-explained above. > > With iommufd nesting, we will pin everything upfront as a starting point. > > The current usage of vfio type1 iommu for s390 does not pin the entirety of guest memory upfront, it happens as guest RPCITs occur / type1 mappings are made. ... so, after the domain started successfully on the libvirt/QEMU side ? :/ It would be great to confirm that. There might be a BUG in patch #2 (see my reply to patch #2) that might not allow you to reproduce it right now.
On Tue, May 02, 2023 at 05:09:06PM +0200, David Hildenbrand wrote: > On 02.05.23 15:56, Matthew Rosato wrote: > > On 5/2/23 9:50 AM, Jason Gunthorpe wrote: > > > On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote: > > > > > Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT. > > > > > > > > Okay, so IIRC we'll fail starting the domain early, that's good. And if we > > > > pin all guest memory (instead of small pieces dynamically), there is little > > > > existing use for file-backed RAM in such zPCI configurations (because memory > > > > cannot be reclaimed either way if it's all pinned), so likely there are no > > > > real existing users. > > > > > > Right, this is VFIO, the physical HW can't tolerate not having pinned > > > memory, so something somewhere is always pinning it. > > > > I might have mis-explained above. > > > > With iommufd nesting, we will pin everything upfront as a starting point. > > > > The current usage of vfio type1 iommu for s390 does not pin the entirety of guest memory upfront, it happens as guest RPCITs occur / type1 mappings are made. > > ... so, after the domain started successfully on the libvirt/QEMU side ? :/ > > It would be great to confirm that. There might be a BUG in patch #2 (see my > reply to patch #2) that might not allow you to reproduce it right now. > Yes apologies - thank you VERY much for doing this Matthew, but apologies, I made rather a clanger in patch 2 which would mean fast patch degrading to slow path would pass even for file-backed. Will respin a v7 + cc you on that, if you could be so kind as to test that? > -- > Thanks, > > David / dhildenb >
On 5/2/23 10:57 AM, David Hildenbrand wrote: > On 02.05.23 15:35, Matthew Rosato wrote: >> On 5/2/23 9:04 AM, Christian Borntraeger wrote: >>> >>> >>> Am 02.05.23 um 14:54 schrieb Lorenzo Stoakes: >>>> On Tue, May 02, 2023 at 02:46:28PM +0200, Christian Borntraeger wrote: >>>>> Am 02.05.23 um 01:11 schrieb Lorenzo Stoakes: >>>>>> Writing to file-backed dirty-tracked mappings via GUP is inherently broken >>>>>> as we cannot rule out folios being cleaned and then a GUP user writing to >>>>>> them again and possibly marking them dirty unexpectedly. >>>>>> >>>>>> This is especially egregious for long-term mappings (as indicated by the >>>>>> use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as >>>>>> we have already done in the slow path. >>>>> >>>>> Hmm, does this interfer with KVM on s390 and PCI interpretion of interrupt delivery? >>>>> It would no longer work with file backed memory, correct? >>>>> >>>>> See >>>>> arch/s390/kvm/pci.c >>>>> >>>>> kvm_s390_pci_aif_enable >>>>> which does have >>>>> FOLL_WRITE | FOLL_LONGTERM >>>>> to >>>>> >>>> >>>> Does this memory map a dirty-tracked file? It's kind of hard to dig into where >>>> the address originates from without going through a ton of code. In worst case >>>> if the fast code doesn't find a whitelist it'll fall back to slow path which >>>> explicitly checks for dirty-tracked filesystem. >>> >>> It does pin from whatever QEMU uses as backing for the guest. >>>> >>>> We can reintroduce a flag to permit exceptions if this is really broken, are you >>>> able to test? I don't have an s390 sat around :) >>> >>> Matt (Rosato on cc) probably can. In the end, it would mean having >>> <memoryBacking> >>> <source type="file"/> >>> </memoryBacking> >>> >>> In libvirt I guess. >> >> I am running with this series applied using a QEMU guest with memory-backend-file (using the above libvirt snippet) for a few different PCI device types and AEN forwarding (e.g. what is setup in kvm_s390_pci_aif_enable) is still working. >> > > That's ... unexpected. :) > > Either this series doesn't work as expected or you end up using a filesystem that is still compatible. But I guess most applicable filesystems (ext4, btrfs, xfs) all have a page_mkwrite callback and should, therefore, disallow long-term pinning with this series. > The memory backend file is on ext4 in my tests. A quick trace shows that pin_user_pages_fast(FOLL_WRITE | FOLL_LONGTERM) in kvm_s390_pci_aif_enable is still returning positive implying pages are being pinned.
On 5/2/23 11:19 AM, Lorenzo Stoakes wrote: > On Tue, May 02, 2023 at 05:09:06PM +0200, David Hildenbrand wrote: >> On 02.05.23 15:56, Matthew Rosato wrote: >>> On 5/2/23 9:50 AM, Jason Gunthorpe wrote: >>>> On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote: >>>>>> Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT. >>>>> >>>>> Okay, so IIRC we'll fail starting the domain early, that's good. And if we >>>>> pin all guest memory (instead of small pieces dynamically), there is little >>>>> existing use for file-backed RAM in such zPCI configurations (because memory >>>>> cannot be reclaimed either way if it's all pinned), so likely there are no >>>>> real existing users. >>>> >>>> Right, this is VFIO, the physical HW can't tolerate not having pinned >>>> memory, so something somewhere is always pinning it. >>> >>> I might have mis-explained above. >>> >>> With iommufd nesting, we will pin everything upfront as a starting point. >>> >>> The current usage of vfio type1 iommu for s390 does not pin the entirety of guest memory upfront, it happens as guest RPCITs occur / type1 mappings are made. >> >> ... so, after the domain started successfully on the libvirt/QEMU side ? :/ >> >> It would be great to confirm that. There might be a BUG in patch #2 (see my >> reply to patch #2) that might not allow you to reproduce it right now. >> > > Yes apologies - thank you VERY much for doing this Matthew, but apologies, I > made rather a clanger in patch 2 which would mean fast patch degrading to slow > path would pass even for file-backed. > > Will respin a v7 + cc you on that, if you could be so kind as to test that? > Sure, will do!
On Tue, May 02, 2023 at 10:54:35AM -0400, Matthew Rosato wrote: > On 5/2/23 10:15 AM, David Hildenbrand wrote: > > On 02.05.23 16:04, Jason Gunthorpe wrote: > >> On Tue, May 02, 2023 at 03:57:30PM +0200, David Hildenbrand wrote: > >>> On 02.05.23 15:50, Jason Gunthorpe wrote: > >>>> On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote: > >>>>>> Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT. > >>>>> > >>>>> Okay, so IIRC we'll fail starting the domain early, that's good. And if we > >>>>> pin all guest memory (instead of small pieces dynamically), there is little > >>>>> existing use for file-backed RAM in such zPCI configurations (because memory > >>>>> cannot be reclaimed either way if it's all pinned), so likely there are no > >>>>> real existing users. > >>>> > >>>> Right, this is VFIO, the physical HW can't tolerate not having pinned > >>>> memory, so something somewhere is always pinning it. > >>>> > >>>> Which, again, makes it weird/wrong that this KVM code is pinning it > >>>> again :\ > >>> > >>> IIUC, that pinning is not for ordinary IOMMU / KVM memory access. It's for > >>> passthrough of (adapter) interrupts. > >>> > >>> I have to speculate, but I guess for hardware to forward interrupts to the > >>> VM, it has to pin the special guest memory page that will receive the > >>> indications, to then configure (interrupt) hardware to target the interrupt > >>> indications to that special guest page (using a host physical address). > >> > >> Either the emulated access is "CPU" based happening through the KVM > >> page table so it should use mmu_notifier locking. > >> > >> Or it is "DMA" and should go through an IOVA through iommufd pinning > >> and locking. > >> > >> There is no other ground, nothing in KVM should be inventing its own > >> access methodology. > > > > I might be wrong, but this seems to be a bit different. > > > > It cannot tolerate page faults (needs a host physical address), so > > memory notifiers don't really apply. (as a side note, KVM on s390x > > does not use mmu notifiers as we know them) > > The host physical address is one shared between underlying firmware > and the host kvm. Either might make changes to the referenced page > and then issue an alert to the guest via a mechanism called GISA, > giving impetus to the guest to look at that page and process the > event. As you say, firmware can't tolerate the page being > unavailable; it's expecting that once we feed it that location it's > always available until we remove it (kvm_s390_pci_aif_disable). That is a CPU access delegated to the FW without any locking scheme to make it safe with KVM :\ It would have been better if FW could inject it through the kvm page tables so it has some coherency. Otherwise you have to call this "DMA", I think. How does s390 avoid mmu notifiers without having lots of problems?? It is not really optional to hook the invalidations if you need to build a shadow page table.. Jason
On Tue, May 02, 2023 at 12:20:46PM -0300, Jason Gunthorpe wrote: > On Tue, May 02, 2023 at 10:54:35AM -0400, Matthew Rosato wrote: > > On 5/2/23 10:15 AM, David Hildenbrand wrote: > > > On 02.05.23 16:04, Jason Gunthorpe wrote: > > >> On Tue, May 02, 2023 at 03:57:30PM +0200, David Hildenbrand wrote: > > >>> On 02.05.23 15:50, Jason Gunthorpe wrote: > > >>>> On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote: > > >>>>>> Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT. > > >>>>> > > >>>>> Okay, so IIRC we'll fail starting the domain early, that's good. And if we > > >>>>> pin all guest memory (instead of small pieces dynamically), there is little > > >>>>> existing use for file-backed RAM in such zPCI configurations (because memory > > >>>>> cannot be reclaimed either way if it's all pinned), so likely there are no > > >>>>> real existing users. > > >>>> > > >>>> Right, this is VFIO, the physical HW can't tolerate not having pinned > > >>>> memory, so something somewhere is always pinning it. > > >>>> > > >>>> Which, again, makes it weird/wrong that this KVM code is pinning it > > >>>> again :\ > > >>> > > >>> IIUC, that pinning is not for ordinary IOMMU / KVM memory access. It's for > > >>> passthrough of (adapter) interrupts. > > >>> > > >>> I have to speculate, but I guess for hardware to forward interrupts to the > > >>> VM, it has to pin the special guest memory page that will receive the > > >>> indications, to then configure (interrupt) hardware to target the interrupt > > >>> indications to that special guest page (using a host physical address). > > >> > > >> Either the emulated access is "CPU" based happening through the KVM > > >> page table so it should use mmu_notifier locking. > > >> > > >> Or it is "DMA" and should go through an IOVA through iommufd pinning > > >> and locking. > > >> > > >> There is no other ground, nothing in KVM should be inventing its own > > >> access methodology. > > > > > > I might be wrong, but this seems to be a bit different. > > > > > > It cannot tolerate page faults (needs a host physical address), so > > > memory notifiers don't really apply. (as a side note, KVM on s390x > > > does not use mmu notifiers as we know them) > > > > The host physical address is one shared between underlying firmware > > and the host kvm. Either might make changes to the referenced page > > and then issue an alert to the guest via a mechanism called GISA, > > giving impetus to the guest to look at that page and process the > > event. As you say, firmware can't tolerate the page being > > unavailable; it's expecting that once we feed it that location it's > > always available until we remove it (kvm_s390_pci_aif_disable). > > That is a CPU access delegated to the FW without any locking scheme to > make it safe with KVM :\ > > It would have been better if FW could inject it through the kvm page > tables so it has some coherency. > > Otherwise you have to call this "DMA", I think. > > How does s390 avoid mmu notifiers without having lots of problems?? It > is not really optional to hook the invalidations if you need to build > a shadow page table.. Totally no idea on s390 details, but.. per my read above, if the firmware needs to make sure the page is always available (so no way to fault it in on demand), which means a longterm pinning seems appropriate here. Then if pinned a must, there's no need for mmu notifiers (as the page will simply not be invalidated anyway)? Thanks,
On Tue, May 02, 2023 at 11:32:57AM -0400, Peter Xu wrote: > > How does s390 avoid mmu notifiers without having lots of problems?? It > > is not really optional to hook the invalidations if you need to build > > a shadow page table.. > > Totally no idea on s390 details, but.. per my read above, if the firmware > needs to make sure the page is always available (so no way to fault it in > on demand), which means a longterm pinning seems appropriate here. > > Then if pinned a must, there's no need for mmu notifiers (as the page will > simply not be invalidated anyway)? And what if someone deliberately changes the mapping? memory hotplug in the VM, or whatever? We have ways to deal with this class of stuff in kvm and in iommufd/vfio - this does not. Jason
On 02.05.23 17:36, Jason Gunthorpe wrote: > On Tue, May 02, 2023 at 11:32:57AM -0400, Peter Xu wrote: >>> How does s390 avoid mmu notifiers without having lots of problems?? It >>> is not really optional to hook the invalidations if you need to build >>> a shadow page table.. >> >> Totally no idea on s390 details, but.. per my read above, if the firmware >> needs to make sure the page is always available (so no way to fault it in >> on demand), which means a longterm pinning seems appropriate here. >> >> Then if pinned a must, there's no need for mmu notifiers (as the page will >> simply not be invalidated anyway)? > > And what if someone deliberately changes the mapping? memory hotplug > in the VM, or whatever? Besides s390 not supporting memory hotplug in VMs (yet): if the guest wants a different guest physical address, I guess that's the problem of the guest, and it can update it: KVM_S390_ZPCIOP_REG_AEN is triggered from QEMU via s390_pci_kvm_aif_enable(), triggered by the guest via a special instruction. If the hypervisor changes the mapping, it's just the same thing as mixing e.g. MADV_DONTNEED with longterm pinning in vfio: don't do it. And if you do it, you get to keep the mess you created for your VM. Linux will make sure to not change the mapping: for example, page migration of a pinned page will fail. But maybe I am missing something important here.
On Tue, May 02, 2023 at 05:45:40PM +0200, David Hildenbrand wrote: > On 02.05.23 17:36, Jason Gunthorpe wrote: > > On Tue, May 02, 2023 at 11:32:57AM -0400, Peter Xu wrote: > > > > How does s390 avoid mmu notifiers without having lots of problems?? It > > > > is not really optional to hook the invalidations if you need to build > > > > a shadow page table.. > > > > > > Totally no idea on s390 details, but.. per my read above, if the firmware > > > needs to make sure the page is always available (so no way to fault it in > > > on demand), which means a longterm pinning seems appropriate here. > > > > > > Then if pinned a must, there's no need for mmu notifiers (as the page will > > > simply not be invalidated anyway)? > > > > And what if someone deliberately changes the mapping? memory hotplug > > in the VM, or whatever? > > Besides s390 not supporting memory hotplug in VMs (yet): if the guest wants > a different guest physical address, I guess that's the problem of the guest, > and it can update it: > > KVM_S390_ZPCIOP_REG_AEN is triggered from QEMU via > s390_pci_kvm_aif_enable(), triggered by the guest via a special instruction. > > If the hypervisor changes the mapping, it's just the same thing as mixing > e.g. MADV_DONTNEED with longterm pinning in vfio: don't do it. And if you do > it, you get to keep the mess you created for your VM. > > Linux will make sure to not change the mapping: for example, page migration > of a pinned page will fail. > > But maybe I am missing something important here. It missses the general architectural point why we have all these shootdown mechanims in other places - plares are not supposed to make these kinds of assumptions. When the userspace unplugs the memory from KVM or unmaps it from VFIO it is not still being accessed by the kernel. Functional bug or not, it is inconsistent with how this is designed to work. Jason
On 02.05.23 18:06, Jason Gunthorpe wrote: > On Tue, May 02, 2023 at 05:45:40PM +0200, David Hildenbrand wrote: >> On 02.05.23 17:36, Jason Gunthorpe wrote: >>> On Tue, May 02, 2023 at 11:32:57AM -0400, Peter Xu wrote: >>>>> How does s390 avoid mmu notifiers without having lots of problems?? It >>>>> is not really optional to hook the invalidations if you need to build >>>>> a shadow page table.. >>>> >>>> Totally no idea on s390 details, but.. per my read above, if the firmware >>>> needs to make sure the page is always available (so no way to fault it in >>>> on demand), which means a longterm pinning seems appropriate here. >>>> >>>> Then if pinned a must, there's no need for mmu notifiers (as the page will >>>> simply not be invalidated anyway)? >>> >>> And what if someone deliberately changes the mapping? memory hotplug >>> in the VM, or whatever? >> >> Besides s390 not supporting memory hotplug in VMs (yet): if the guest wants >> a different guest physical address, I guess that's the problem of the guest, >> and it can update it: >> >> KVM_S390_ZPCIOP_REG_AEN is triggered from QEMU via >> s390_pci_kvm_aif_enable(), triggered by the guest via a special instruction. >> >> If the hypervisor changes the mapping, it's just the same thing as mixing >> e.g. MADV_DONTNEED with longterm pinning in vfio: don't do it. And if you do >> it, you get to keep the mess you created for your VM. >> >> Linux will make sure to not change the mapping: for example, page migration >> of a pinned page will fail. >> >> But maybe I am missing something important here. > > It missses the general architectural point why we have all these > shootdown mechanims in other places - plares are not supposed to make > these kinds of assumptions. When the userspace unplugs the memory from > KVM or unmaps it from VFIO it is not still being accessed by the > kernel. Yes. Like having memory in a vfio iommu v1 and doing the same (mremap, munmap, MADV_DONTNEED, ...). Which is why we disable MADV_DONTNEED (e.g., virtio-balloon) in QEMU with vfio. > > Functional bug or not, it is inconsistent with how this is designed to > work. Sorry to say, I *really* don't see how that is supposed to work with a page that *cannot* be faulted back in on demand.
On Tue, May 02, 2023 at 06:12:39PM +0200, David Hildenbrand wrote: > > It missses the general architectural point why we have all these > > shootdown mechanims in other places - plares are not supposed to make > > these kinds of assumptions. When the userspace unplugs the memory from > > KVM or unmaps it from VFIO it is not still being accessed by the > > kernel. > > Yes. Like having memory in a vfio iommu v1 and doing the same (mremap, > munmap, MADV_DONTNEED, ...). Which is why we disable MADV_DONTNEED (e.g., > virtio-balloon) in QEMU with vfio. That is different, VFIO has it's own contract how it consumes the memory from the MM and VFIO breaks all this stuff. But when you tell VFIO to unmap the memory it doesn't keep accessing it in the background like this does. Jason
On 02.05.23 18:19, Jason Gunthorpe wrote: > On Tue, May 02, 2023 at 06:12:39PM +0200, David Hildenbrand wrote: > >>> It missses the general architectural point why we have all these >>> shootdown mechanims in other places - plares are not supposed to make >>> these kinds of assumptions. When the userspace unplugs the memory from >>> KVM or unmaps it from VFIO it is not still being accessed by the >>> kernel. >> >> Yes. Like having memory in a vfio iommu v1 and doing the same (mremap, >> munmap, MADV_DONTNEED, ...). Which is why we disable MADV_DONTNEED (e.g., >> virtio-balloon) in QEMU with vfio. > > That is different, VFIO has it's own contract how it consumes the > memory from the MM and VFIO breaks all this stuff. > > But when you tell VFIO to unmap the memory it doesn't keep accessing > it in the background like this does. To me, this is similar to when QEMU (user space) triggers KVM_S390_ZPCIOP_DEREG_AEN, to tell KVM to disable AIF and stop using the page (1) When triggered by the guest explicitly (2) when resetting the VM (3) when resetting the virtual PCI device / configuration. Interrupt gets unregistered from HW (which stops using the page), the pages get unpinned. Pages get no longer used. I guess I am still missing (a) how this is fundamentally different (b) how it could be done differently. I'd really be happy to learn how a better approach would look like that does not use longterm pinnings. I don't see an easy way to not use longterm pinnings. When using mmu notifiers and getting notified about unmapping of a page (for whatever reason ... migration, swapout, unmap), you'd have to disable aif. But when to reenable it (maybe there would be a way)? Also, I'm not sure if this could even be visible by the guest, if it's suddenly no longer enabled. Something for the s390x people to explore ... if HW would be providing a way to deal with that somehow.
On Tue, May 02, 2023 at 06:32:23PM +0200, David Hildenbrand wrote: > On 02.05.23 18:19, Jason Gunthorpe wrote: > > On Tue, May 02, 2023 at 06:12:39PM +0200, David Hildenbrand wrote: > > > > > > It missses the general architectural point why we have all these > > > > shootdown mechanims in other places - plares are not supposed to make > > > > these kinds of assumptions. When the userspace unplugs the memory from > > > > KVM or unmaps it from VFIO it is not still being accessed by the > > > > kernel. > > > > > > Yes. Like having memory in a vfio iommu v1 and doing the same (mremap, > > > munmap, MADV_DONTNEED, ...). Which is why we disable MADV_DONTNEED (e.g., > > > virtio-balloon) in QEMU with vfio. > > > > That is different, VFIO has it's own contract how it consumes the > > memory from the MM and VFIO breaks all this stuff. > > > > But when you tell VFIO to unmap the memory it doesn't keep accessing > > it in the background like this does. > > To me, this is similar to when QEMU (user space) triggers > KVM_S390_ZPCIOP_DEREG_AEN, to tell KVM to disable AIF and stop using the > page (1) When triggered by the guest explicitly (2) when resetting the VM > (3) when resetting the virtual PCI device / configuration. > > Interrupt gets unregistered from HW (which stops using the page), the pages > get unpinned. Pages get no longer used. > > I guess I am still missing (a) how this is fundamentally different (b) how > it could be done differently. It uses an address that is already scoped within the KVM memory map and uses KVM's gpa_to_gfn() to translate it to some pinnable page It is not some independent thing like VFIO, it is explicitly scoped within the existing KVM structure and it does not follow any mutations that are done to the gpa map through the usual KVM APIs. > I'd really be happy to learn how a better approach would look like that does > not use longterm pinnings. Sounds like the FW sadly needs pinnings. This is why I said it looks like DMA. If possible it would be better to get the pinning through VFIO, eg as a mdev Otherwise, it would have been cleaner if this was divorced from KVM and took in a direct user pointer, then maybe you could make the argument is its own thing with its own lifetime rules. (then you are kind of making your own mdev) Or, perhaps, this is really part of some radical "irqfd" that we've been on and off talking about specifically to get this area of interrupt bypass uAPI'd properly.. Jason
On 5/2/23 1:46 PM, Jason Gunthorpe wrote: > On Tue, May 02, 2023 at 06:32:23PM +0200, David Hildenbrand wrote: >> On 02.05.23 18:19, Jason Gunthorpe wrote: >>> On Tue, May 02, 2023 at 06:12:39PM +0200, David Hildenbrand wrote: >>> >>>>> It missses the general architectural point why we have all these >>>>> shootdown mechanims in other places - plares are not supposed to make >>>>> these kinds of assumptions. When the userspace unplugs the memory from >>>>> KVM or unmaps it from VFIO it is not still being accessed by the >>>>> kernel. >>>> >>>> Yes. Like having memory in a vfio iommu v1 and doing the same (mremap, >>>> munmap, MADV_DONTNEED, ...). Which is why we disable MADV_DONTNEED (e.g., >>>> virtio-balloon) in QEMU with vfio. >>> >>> That is different, VFIO has it's own contract how it consumes the >>> memory from the MM and VFIO breaks all this stuff. >>> >>> But when you tell VFIO to unmap the memory it doesn't keep accessing >>> it in the background like this does. >> >> To me, this is similar to when QEMU (user space) triggers >> KVM_S390_ZPCIOP_DEREG_AEN, to tell KVM to disable AIF and stop using the >> page (1) When triggered by the guest explicitly (2) when resetting the VM >> (3) when resetting the virtual PCI device / configuration. >> >> Interrupt gets unregistered from HW (which stops using the page), the pages >> get unpinned. Pages get no longer used. >> >> I guess I am still missing (a) how this is fundamentally different (b) how >> it could be done differently. > > It uses an address that is already scoped within the KVM memory map > and uses KVM's gpa_to_gfn() to translate it to some pinnable page > > It is not some independent thing like VFIO, it is explicitly scoped > within the existing KVM structure and it does not follow any mutations > that are done to the gpa map through the usual KVM APIs. > >> I'd really be happy to learn how a better approach would look like that does >> not use longterm pinnings. > > Sounds like the FW sadly needs pinnings. This is why I said it looks > like DMA. If possible it would be better to get the pinning through > VFIO, eg as a mdev Hrm, these are today handled as a vfio-pci device (e.g. no mdev) so that would be a pretty significant change. > > Otherwise, it would have been cleaner if this was divorced from KVM > and took in a direct user pointer, then maybe you could make the > argument is its own thing with its own lifetime rules. (then you are > kind of making your own mdev) Problem there is that firmware needs both the location (where to indicate the event) and the identity of the KVM instance (what guest to ship the GISA alert to) so I don't think we can completely divorce them. > > Or, perhaps, this is really part of some radical "irqfd" that we've > been on and off talking about specifically to get this area of > interrupt bypass uAPI'd properly.. I investigated that at one point and could not seem to get it to fit; I'll have another look there.
On Tue, May 02, 2023 at 01:59:09PM -0400, Matthew Rosato wrote: > >> I'd really be happy to learn how a better approach would look like that does > >> not use longterm pinnings. > > > > Sounds like the FW sadly needs pinnings. This is why I said it looks > > like DMA. If possible it would be better to get the pinning through > > VFIO, eg as a mdev > > Hrm, these are today handled as a vfio-pci device (e.g. no mdev) so that would be a pretty significant change. I was thinking more of having your vIRQ controller as a vfio-mdev, so qemu would have to open the irq controller as well as the kvm and the PCI device > > Otherwise, it would have been cleaner if this was divorced from KVM > > and took in a direct user pointer, then maybe you could make the > > argument is its own thing with its own lifetime rules. (then you are > > kind of making your own mdev) > > Problem there is that firmware needs both the location (where to > indicate the event) and the identity of the KVM instance (what guest > to ship the GISA alert to) so I don't think we can completely > divorce them. I should have said "kvm page table", being in the KVM ioctls is sort of OK - but I'm really skeptical that KVM is a good place to put HW device wrappers. > > Or, perhaps, this is really part of some radical "irqfd" that we've > > been on and off talking about specifically to get this area of > > interrupt bypass uAPI'd properly.. > > I investigated that at one point and could not seem to get it to > fit; I'll have another look there. It is a long journey, but other arches have problems here too Jason
On 02.05.23 19:46, Jason Gunthorpe wrote: > On Tue, May 02, 2023 at 06:32:23PM +0200, David Hildenbrand wrote: >> On 02.05.23 18:19, Jason Gunthorpe wrote: >>> On Tue, May 02, 2023 at 06:12:39PM +0200, David Hildenbrand wrote: >>> >>>>> It missses the general architectural point why we have all these >>>>> shootdown mechanims in other places - plares are not supposed to make >>>>> these kinds of assumptions. When the userspace unplugs the memory from >>>>> KVM or unmaps it from VFIO it is not still being accessed by the >>>>> kernel. >>>> >>>> Yes. Like having memory in a vfio iommu v1 and doing the same (mremap, >>>> munmap, MADV_DONTNEED, ...). Which is why we disable MADV_DONTNEED (e.g., >>>> virtio-balloon) in QEMU with vfio. >>> >>> That is different, VFIO has it's own contract how it consumes the >>> memory from the MM and VFIO breaks all this stuff. >>> >>> But when you tell VFIO to unmap the memory it doesn't keep accessing >>> it in the background like this does. >> >> To me, this is similar to when QEMU (user space) triggers >> KVM_S390_ZPCIOP_DEREG_AEN, to tell KVM to disable AIF and stop using the >> page (1) When triggered by the guest explicitly (2) when resetting the VM >> (3) when resetting the virtual PCI device / configuration. >> >> Interrupt gets unregistered from HW (which stops using the page), the pages >> get unpinned. Pages get no longer used. >> >> I guess I am still missing (a) how this is fundamentally different (b) how >> it could be done differently. > > It uses an address that is already scoped within the KVM memory map > and uses KVM's gpa_to_gfn() to translate it to some pinnable page > > It is not some independent thing like VFIO, it is explicitly scoped > within the existing KVM structure and it does not follow any mutations > that are done to the gpa map through the usual KVM APIs. Right, it consumes guest physical addresses that are translated via the KVM memslots. Agreed that it does not (and possibly cannot easily) update the hardware when the KVM mapping (memslots) would ever change. I guess it's also not documented that this is not supported. > >> I'd really be happy to learn how a better approach would look like that does >> not use longterm pinnings. > > Sounds like the FW sadly needs pinnings. This is why I said it looks > like DMA. If possible it would be better to get the pinning through > VFIO, eg as a mdev > > Otherwise, it would have been cleaner if this was divorced from KVM > and took in a direct user pointer, then maybe you could make the > argument is its own thing with its own lifetime rules. (then you are > kind of making your own mdev) It would be cleaner if user space would translate the GPA to a HVA and provid that, agreed ... > > Or, perhaps, this is really part of some radical "irqfd" that we've > been on and off talking about specifically to get this area of > interrupt bypass uAPI'd properly.. Most probably. It's one of these very special cases ... thankfully: $ git grep -i longterm | grep kvm arch/s390/kvm/pci.c: npages = pin_user_pages_fast(hva, 1, FOLL_WRITE | FOLL_LONGTERM, pages); arch/s390/kvm/pci.c: npages = pin_user_pages_fast(hva, 1, FOLL_WRITE | FOLL_LONGTERM,
diff --git a/mm/gup.c b/mm/gup.c index 0f09dec0906c..431618048a03 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -18,6 +18,7 @@ #include <linux/migrate.h> #include <linux/mm_inline.h> #include <linux/sched/mm.h> +#include <linux/shmem_fs.h> #include <asm/mmu_context.h> #include <asm/tlbflush.h> @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs) return folio; } +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE +static bool stabilise_mapping_rcu(struct folio *folio) +{ + struct address_space *mapping = READ_ONCE(folio->mapping); + + rcu_read_lock(); + + return mapping == READ_ONCE(folio->mapping); +} + +static void unlock_rcu(void) +{ + rcu_read_unlock(); +} +#else +static bool stabilise_mapping_rcu(struct folio *) +{ + return true; +} + +static void unlock_rcu(void) +{ +} +#endif + +/* + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM | + * FOLL_WRITE pin is permitted for a specific folio. + * + * This assumes the folio is stable and pinned. + * + * Writing to pinned file-backed dirty tracked folios is inherently problematic + * (see comment describing the writeable_file_mapping_allowed() function). We + * therefore try to avoid the most egregious case of a long-term mapping doing + * so. + * + * This function cannot be as thorough as that one as the VMA is not available + * in the fast path, so instead we whitelist known good cases. + * + * The folio is stable, but the mapping might not be. When truncating for + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled + * so we are safe from an IPI, but some architectures use an RCU lock for this + * operation, so we acquire an RCU lock to ensure the mapping is stable. + */ +static bool folio_longterm_write_pin_allowed(struct folio *folio) +{ + bool ret; + + /* hugetlb mappings do not require dirty tracking. */ + if (folio_test_hugetlb(folio)) + return true; + + if (stabilise_mapping_rcu(folio)) { + struct address_space *mapping = folio_mapping(folio); + + /* + * Neither anonymous nor shmem-backed folios require + * dirty tracking. + */ + ret = folio_test_anon(folio) || + (mapping && shmem_mapping(mapping)); + } else { + /* If the mapping is unstable, fallback to the slow path. */ + ret = false; + } + + unlock_rcu(); + + return ret; +} + /** * try_grab_folio() - Attempt to get or pin a folio. * @page: pointer to page to be grabbed @@ -123,6 +195,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs) */ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) { + bool is_longterm = flags & FOLL_LONGTERM; + if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page))) return NULL; @@ -136,8 +210,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) * right zone, so fail and let the caller fall back to the slow * path. */ - if (unlikely((flags & FOLL_LONGTERM) && - !is_longterm_pinnable_page(page))) + if (unlikely(is_longterm && !is_longterm_pinnable_page(page))) return NULL; /* @@ -148,6 +221,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) if (!folio) return NULL; + /* + * Can this folio be safely pinned? We need to perform this + * check after the folio is stabilised. + */ + if ((flags & FOLL_WRITE) && is_longterm && + !folio_longterm_write_pin_allowed(folio)) { + folio_put_refs(folio, refs); + return NULL; + } + /* * When pinning a large folio, use an exact count to track it. *
Writing to file-backed dirty-tracked mappings via GUP is inherently broken as we cannot rule out folios being cleaned and then a GUP user writing to them again and possibly marking them dirty unexpectedly. This is especially egregious for long-term mappings (as indicated by the use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as we have already done in the slow path. We have access to less information in the fast path as we cannot examine the VMA containing the mapping, however we can determine whether the folio is anonymous and then whitelist known-good mappings - specifically hugetlb and shmem mappings. While we obtain a stable folio for this check, the mapping might not be, as a truncate could nullify it at any time. Since doing so requires mappings to be zapped, we can synchronise against a TLB shootdown operation. For some architectures TLB shootdown is synchronised by IPI, against which we are protected as the GUP-fast operation is performed with interrupts disabled. However, other architectures which specify CONFIG_MMU_GATHER_RCU_TABLE_FREE use an RCU lock for this operation. In these instances, we acquire an RCU lock while performing our checks. If we cannot get a stable mapping, we fall back to the slow path, as otherwise we'd have to walk the page tables again and it's simpler and more effective to just fall back. It's important to note that there are no APIs allowing users to specify FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can always rely on the fact that if we fail to pin on the fast path, the code will fall back to the slow path which can perform the more thorough check. Suggested-by: David Hildenbrand <david@redhat.com> Suggested-by: Kirill A . Shutemov <kirill@shutemov.name> Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> --- mm/gup.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 85 insertions(+), 2 deletions(-)