Message ID | 20250129115803.2084769-5-david@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: cleanups for device-exclusive entries (hmm) | expand |
On Wed, Jan 29, 2025 at 12:58:02PM +0100, David Hildenbrand wrote: > Let's document how this function is to be used, and why the requirement > for the folio lock might maybe be dropped in the future. Sorry, only just catching up on your other thread. The folio lock was to ensure the GPU got a chance to make forward progress by mapping the page. Without it the CPU could immediately invalidate the entry before the GPU had a chance to retry the fault. Obviously performance wise having such thrashing is terrible, so should really be avoided by userspace, but the lock at least allowed such programs to complete. > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > mm/memory.c | 25 +++++++++++++++++++++++++ > 1 file changed, 25 insertions(+) > > diff --git a/mm/memory.c b/mm/memory.c > index 46956994aaff..caaae8df11a9 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -718,6 +718,31 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma, > } > #endif > > +/** > + * restore_exclusive_pte - Restore a device-exclusive entry > + * @vma: VMA covering @address > + * @folio: the mapped folio > + * @page: the mapped folio page > + * @address: the virtual address > + * @ptep: PTE pointer into the locked page table mapping the folio page > + * @orig_pte: PTE value at @ptep > + * > + * Restore a device-exclusive non-swap entry to an ordinary present PTE. > + * > + * The folio and the page table must be locked, and MMU notifiers must have > + * been called to invalidate any (exclusive) device mappings. In case of > + * fork(), MMU_NOTIFY_PROTECTION_PAGE is triggered, and in case of a page > + * fault MMU_NOTIFY_EXCLUSIVE is triggered. > + * > + * Locking the folio makes sure that anybody who just converted the PTE to > + * a device-private entry can map it into the device, before unlocking it; so > + * the folio lock prevents concurrent conversion to device-exclusive. I don't quite follow this - a concurrent conversion would already fail because the GUP in make_device_exclusive_range() would most likely cause an unexpected reference during the migration. And if a migration entry has already been installed for the device private PTE conversion then make_device_exclusive_range() will skip it as a non-present entry anyway. However s/device-private/device-exclusive/ makes sense - the intent was to allow the device to map it before a call to restore_exclusive_pte() (ie. a CPU fault) could convert it back to a normal PTE. > + * TODO: the folio lock does not protect against all cases of concurrent > + * page table modifications (e.g., MADV_DONTNEED, mprotect), so device drivers > + * must already use MMU notifiers to sync against any concurrent changes Right. It's expected drivers are using MMU notifiers to keep page tables in sync, same as for hmm_range_fault(). > + * Maybe the requirement for the folio lock can be dropped in the future. > + */ > static void restore_exclusive_pte(struct vm_area_struct *vma, > struct folio *folio, struct page *page, unsigned long address, > pte_t *ptep, pte_t orig_pte) > -- > 2.48.1 >
On 30.01.25 01:27, Alistair Popple wrote: > On Wed, Jan 29, 2025 at 12:58:02PM +0100, David Hildenbrand wrote: >> Let's document how this function is to be used, and why the requirement >> for the folio lock might maybe be dropped in the future. > > Sorry, only just catching up on your other thread. The folio lock was to ensure > the GPU got a chance to make forward progress by mapping the page. Without it > the CPU could immediately invalidate the entry before the GPU had a chance to > retry the fault. > > Obviously performance wise having such thrashing is terrible, so should > really be avoided by userspace, but the lock at least allowed such programs > to complete. Thanks for the clarification. So it's relevant that the MMU notifier in remove_device_exclusive_entry() is sent after taking the folio lock. However, as soon as we drop the folio lock, remove_device_exclusive_entry() will become active, lock the folio and trigger the MMU notifier. So the time it is actually mapped into the device is rather > >> Signed-off-by: David Hildenbrand <david@redhat.com> >> --- >> mm/memory.c | 25 +++++++++++++++++++++++++ >> 1 file changed, 25 insertions(+) >> >> diff --git a/mm/memory.c b/mm/memory.c >> index 46956994aaff..caaae8df11a9 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -718,6 +718,31 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma, >> } >> #endif >> >> +/** >> + * restore_exclusive_pte - Restore a device-exclusive entry >> + * @vma: VMA covering @address >> + * @folio: the mapped folio >> + * @page: the mapped folio page >> + * @address: the virtual address >> + * @ptep: PTE pointer into the locked page table mapping the folio page >> + * @orig_pte: PTE value at @ptep >> + * >> + * Restore a device-exclusive non-swap entry to an ordinary present PTE. >> + * >> + * The folio and the page table must be locked, and MMU notifiers must have >> + * been called to invalidate any (exclusive) device mappings. In case of >> + * fork(), MMU_NOTIFY_PROTECTION_PAGE is triggered, and in case of a page >> + * fault MMU_NOTIFY_EXCLUSIVE is triggered. >> + * >> + * Locking the folio makes sure that anybody who just converted the PTE to >> + * a device-private entry can map it into the device, before unlocking it; so >> + * the folio lock prevents concurrent conversion to device-exclusive. > > I don't quite follow this - a concurrent conversion would already fail > because the GUP in make_device_exclusive_range() would most likely cause > an unexpected reference during the migration. And if a migration entry > has already been installed for the device private PTE conversion then > make_device_exclusive_range() will skip it as a non-present entry anyway. Sorry, I meant "device-exclusive", so migration is not a concern. > > However s/device-private/device-exclusive/ makes sense - the intent was to allow > the device to map it before a call to restore_exclusive_pte() (ie. a CPU fault) > could convert it back to a normal PTE. > >> + * TODO: the folio lock does not protect against all cases of concurrent >> + * page table modifications (e.g., MADV_DONTNEED, mprotect), so device drivers >> + * must already use MMU notifiers to sync against any concurrent changes > > Right. It's expected drivers are using MMU notifiers to keep page tables in > sync, same as for hmm_range_fault(). Let me try to rephrase it given that the folio lock is purely to guarantee forward-progress, not for correctness; that's what MMU notifiers must be used for.
On Thu, Jan 30, 2025 at 11:27:37AM +1100, Alistair Popple wrote: > On Wed, Jan 29, 2025 at 12:58:02PM +0100, David Hildenbrand wrote: > > Let's document how this function is to be used, and why the requirement > > for the folio lock might maybe be dropped in the future. > > Sorry, only just catching up on your other thread. The folio lock was to ensure > the GPU got a chance to make forward progress by mapping the page. Without it > the CPU could immediately invalidate the entry before the GPU had a chance to > retry the fault. > > Obviously performance wise having such thrashing is terrible, so should > really be avoided by userspace, but the lock at least allowed such programs > to complete. Imo this is not a legit use-case. If userspace concurrently (instead of clearly alternating) uses the same 4k page for gpu atomics and on the cpu, it just gets to keep the fallout. Plus there's no guarantee that we hold the folio_lock long enough for the gpu to actually complete the atomic, so this isn't even really helping with forward progress even if this somehow would be a legit usecase. But this is also why thp is potentially an issue, because if thp constantly creates pmd entries that potentially causes false sharing and we do have thrashing that shouldn't happen. -Sima > > Signed-off-by: David Hildenbrand <david@redhat.com> > > --- > > mm/memory.c | 25 +++++++++++++++++++++++++ > > 1 file changed, 25 insertions(+) > > > > diff --git a/mm/memory.c b/mm/memory.c > > index 46956994aaff..caaae8df11a9 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -718,6 +718,31 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma, > > } > > #endif > > > > +/** > > + * restore_exclusive_pte - Restore a device-exclusive entry > > + * @vma: VMA covering @address > > + * @folio: the mapped folio > > + * @page: the mapped folio page > > + * @address: the virtual address > > + * @ptep: PTE pointer into the locked page table mapping the folio page > > + * @orig_pte: PTE value at @ptep > > + * > > + * Restore a device-exclusive non-swap entry to an ordinary present PTE. > > + * > > + * The folio and the page table must be locked, and MMU notifiers must have > > + * been called to invalidate any (exclusive) device mappings. In case of > > + * fork(), MMU_NOTIFY_PROTECTION_PAGE is triggered, and in case of a page > > + * fault MMU_NOTIFY_EXCLUSIVE is triggered. > > + * > > + * Locking the folio makes sure that anybody who just converted the PTE to > > + * a device-private entry can map it into the device, before unlocking it; so > > + * the folio lock prevents concurrent conversion to device-exclusive. > > I don't quite follow this - a concurrent conversion would already fail > because the GUP in make_device_exclusive_range() would most likely cause > an unexpected reference during the migration. And if a migration entry > has already been installed for the device private PTE conversion then > make_device_exclusive_range() will skip it as a non-present entry anyway. > > However s/device-private/device-exclusive/ makes sense - the intent was to allow > the device to map it before a call to restore_exclusive_pte() (ie. a CPU fault) > could convert it back to a normal PTE. > > > + * TODO: the folio lock does not protect against all cases of concurrent > > + * page table modifications (e.g., MADV_DONTNEED, mprotect), so device drivers > > + * must already use MMU notifiers to sync against any concurrent changes > > Right. It's expected drivers are using MMU notifiers to keep page tables in > sync, same as for hmm_range_fault(). > > > + * Maybe the requirement for the folio lock can be dropped in the future. > > + */ > > static void restore_exclusive_pte(struct vm_area_struct *vma, > > struct folio *folio, struct page *page, unsigned long address, > > pte_t *ptep, pte_t orig_pte) > > -- > > 2.48.1 > >
On Thu, Jan 30, 2025 at 10:37:06AM +0100, David Hildenbrand wrote: > On 30.01.25 01:27, Alistair Popple wrote: > > On Wed, Jan 29, 2025 at 12:58:02PM +0100, David Hildenbrand wrote: > > > Let's document how this function is to be used, and why the requirement > > > for the folio lock might maybe be dropped in the future. > > > > Sorry, only just catching up on your other thread. The folio lock was to ensure > > the GPU got a chance to make forward progress by mapping the page. Without it > > the CPU could immediately invalidate the entry before the GPU had a chance to > > retry the fault. > > > Obviously performance wise having such thrashing is terrible, so should > > really be avoided by userspace, but the lock at least allowed such programs > > to complete. > > Thanks for the clarification. So it's relevant that the MMU notifier in > remove_device_exclusive_entry() is sent after taking the folio lock. > > However, as soon as we drop the folio lock, remove_device_exclusive_entry() > will become active, lock the folio and trigger the MMU notifier. > > So the time it is actually mapped into the device is rather Looks like you cut off a bit here (or mail transport did that somewhere), but see my other reply I don't think this is a legit use-case. So we don't have to worry. Well beyond documenting that if userspace concurrently thrashes the same page with both device atomics and cpu access it will stall real bad. -Sima
On 30.01.25 14:31, Simona Vetter wrote: > On Thu, Jan 30, 2025 at 10:37:06AM +0100, David Hildenbrand wrote: >> On 30.01.25 01:27, Alistair Popple wrote: >>> On Wed, Jan 29, 2025 at 12:58:02PM +0100, David Hildenbrand wrote: >>>> Let's document how this function is to be used, and why the requirement >>>> for the folio lock might maybe be dropped in the future. >>> >>> Sorry, only just catching up on your other thread. The folio lock was to ensure >>> the GPU got a chance to make forward progress by mapping the page. Without it >>> the CPU could immediately invalidate the entry before the GPU had a chance to >>> retry the fault. >>>> Obviously performance wise having such thrashing is terrible, so should >>> really be avoided by userspace, but the lock at least allowed such programs >>> to complete. >> >> Thanks for the clarification. So it's relevant that the MMU notifier in >> remove_device_exclusive_entry() is sent after taking the folio lock. >> >> However, as soon as we drop the folio lock, remove_device_exclusive_entry() >> will become active, lock the folio and trigger the MMU notifier. >> >> So the time it is actually mapped into the device is rather I meant to say "rather short." :) > > Looks like you cut off a bit here (or mail transport did that somewhere), > but see my other reply I don't think this is a legit use-case. So we don't > have to worry. In that case, we would need the folio lock in the future. > Well beyond documenting that if userspace concurrently thrashes > the same page with both device atomics and cpu access it will stall real > bad. I'm curious, is locking between device-cpu or device-device something that can happen frequently? In that case, you would get that trashing naturally?
diff --git a/mm/memory.c b/mm/memory.c index 46956994aaff..caaae8df11a9 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -718,6 +718,31 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma, } #endif +/** + * restore_exclusive_pte - Restore a device-exclusive entry + * @vma: VMA covering @address + * @folio: the mapped folio + * @page: the mapped folio page + * @address: the virtual address + * @ptep: PTE pointer into the locked page table mapping the folio page + * @orig_pte: PTE value at @ptep + * + * Restore a device-exclusive non-swap entry to an ordinary present PTE. + * + * The folio and the page table must be locked, and MMU notifiers must have + * been called to invalidate any (exclusive) device mappings. In case of + * fork(), MMU_NOTIFY_PROTECTION_PAGE is triggered, and in case of a page + * fault MMU_NOTIFY_EXCLUSIVE is triggered. + * + * Locking the folio makes sure that anybody who just converted the PTE to + * a device-private entry can map it into the device, before unlocking it; so + * the folio lock prevents concurrent conversion to device-exclusive. + * + * TODO: the folio lock does not protect against all cases of concurrent + * page table modifications (e.g., MADV_DONTNEED, mprotect), so device drivers + * must already use MMU notifiers to sync against any concurrent changes + * Maybe the requirement for the folio lock can be dropped in the future. + */ static void restore_exclusive_pte(struct vm_area_struct *vma, struct folio *folio, struct page *page, unsigned long address, pte_t *ptep, pte_t orig_pte)
Let's document how this function is to be used, and why the requirement for the folio lock might maybe be dropped in the future. Signed-off-by: David Hildenbrand <david@redhat.com> --- mm/memory.c | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+)