Message ID | 20191015181242.8343-1-jgg@ziepe.ca (mailing list archive) |
---|---|
Headers | show |
Series | Consolidate the mmu notifier interval_tree and locking | expand |
Am 15.10.19 um 20:12 schrieb Jason Gunthorpe: > From: Jason Gunthorpe <jgg@mellanox.com> > > 8 of the mmu_notifier using drivers (i915_gem, radeon_mn, umem_odp, hfi1, > scif_dma, vhost, gntdev, hmm) drivers are using a common pattern where > they only use invalidate_range_start/end and immediately check the > invalidating range against some driver data structure to tell if the > driver is interested. Half of them use an interval_tree, the others are > simple linear search lists. > > Of the ones I checked they largely seem to have various kinds of races, > bugs and poor implementation. This is a result of the complexity in how > the notifier interacts with get_user_pages(). It is extremely difficult to > use it correctly. > > Consolidate all of this code together into the core mmu_notifier and > provide a locking scheme similar to hmm_mirror that allows the user to > safely use get_user_pages() and reliably know if the page list still > matches the mm. That sounds really good, but could you outline for a moment how that is archived? Please keep in mind that the page reference get_user_pages() grabs is *NOT* sufficient to guarantee coherency here. Regards, Christian. > > This new arrangment plays nicely with the !blockable mode for > OOM. Scanning the interval tree is done such that the intersection test > will always succeed, and since there is no invalidate_range_end exposed to > drivers the scheme safely allows multiple drivers to be subscribed. > > Four places are converted as an example of how the new API is used. > Four are left for future patches: > - i915_gem has complex locking around destruction of a registration, > needs more study > - hfi1 (2nd user) needs access to the rbtree > - scif_dma has a complicated logic flow > - vhost's mmu notifiers are already being rewritten > > This is still being tested, but I figured to send it to start getting help > from the xen, amd and hfi drivers which I cannot test here. > > It would be intended for the hmm tree. > > Jason Gunthorpe (15): > mm/mmu_notifier: define the header pre-processor parts even if > disabled > mm/mmu_notifier: add an interval tree notifier > mm/hmm: allow hmm_range to be used with a mmu_range_notifier or > hmm_mirror > mm/hmm: define the pre-processor related parts of hmm.h even if > disabled > RDMA/odp: Use mmu_range_notifier_insert() > RDMA/hfi1: Use mmu_range_notifier_inset for user_exp_rcv > drm/radeon: use mmu_range_notifier_insert > xen/gntdev: Use select for DMA_SHARED_BUFFER > xen/gntdev: use mmu_range_notifier_insert > nouveau: use mmu_notifier directly for invalidate_range_start > nouveau: use mmu_range_notifier instead of hmm_mirror > drm/amdgpu: Call find_vma under mmap_sem > drm/amdgpu: Use mmu_range_insert instead of hmm_mirror > drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror > mm/hmm: remove hmm_mirror and related > > Documentation/vm/hmm.rst | 105 +--- > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 + > .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 9 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 14 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 + > drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 445 ++------------ > drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h | 53 -- > drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 13 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 111 ++-- > drivers/gpu/drm/nouveau/nouveau_svm.c | 229 +++++--- > drivers/gpu/drm/radeon/radeon.h | 9 +- > drivers/gpu/drm/radeon/radeon_mn.c | 218 ++----- > drivers/infiniband/core/device.c | 1 - > drivers/infiniband/core/umem_odp.c | 288 +--------- > drivers/infiniband/hw/hfi1/file_ops.c | 2 +- > drivers/infiniband/hw/hfi1/hfi.h | 2 +- > drivers/infiniband/hw/hfi1/user_exp_rcv.c | 144 ++--- > drivers/infiniband/hw/hfi1/user_exp_rcv.h | 3 +- > drivers/infiniband/hw/mlx5/mlx5_ib.h | 7 +- > drivers/infiniband/hw/mlx5/mr.c | 3 +- > drivers/infiniband/hw/mlx5/odp.c | 48 +- > drivers/xen/Kconfig | 3 +- > drivers/xen/gntdev-common.h | 8 +- > drivers/xen/gntdev.c | 179 ++---- > include/linux/hmm.h | 195 +------ > include/linux/mmu_notifier.h | 124 +++- > include/rdma/ib_umem_odp.h | 65 +-- > include/rdma/ib_verbs.h | 2 - > kernel/fork.c | 1 - > mm/Kconfig | 2 +- > mm/hmm.c | 275 +-------- > mm/mmu_notifier.c | 542 +++++++++++++++++- > 32 files changed, 1180 insertions(+), 1923 deletions(-) >
On Wed, Oct 16, 2019 at 10:58:02AM +0200, Christian König wrote: > Am 15.10.19 um 20:12 schrieb Jason Gunthorpe: > > From: Jason Gunthorpe <jgg@mellanox.com> > > > > 8 of the mmu_notifier using drivers (i915_gem, radeon_mn, umem_odp, hfi1, > > scif_dma, vhost, gntdev, hmm) drivers are using a common pattern where > > they only use invalidate_range_start/end and immediately check the > > invalidating range against some driver data structure to tell if the > > driver is interested. Half of them use an interval_tree, the others are > > simple linear search lists. > > > > Of the ones I checked they largely seem to have various kinds of races, > > bugs and poor implementation. This is a result of the complexity in how > > the notifier interacts with get_user_pages(). It is extremely difficult to > > use it correctly. > > > > Consolidate all of this code together into the core mmu_notifier and > > provide a locking scheme similar to hmm_mirror that allows the user to > > safely use get_user_pages() and reliably know if the page list still > > matches the mm. > > That sounds really good, but could you outline for a moment how that is > archived? It uses the same basic scheme as hmm and rdma odp, outlined in the revisions to hmm.rst later on. Basically, seq = mmu_range_read_begin(&mrn); // This is a speculative region .. get_user_pages()/hmm_range_fault() .. // Result cannot be derferenced take_lock(driver->update); if (mmu_range_read_retry(&mrn, range.notifier_seq) { // collision! The results are not correct goto again } // no collision, and now under lock. Now we can de-reference the pages/etc // program HW // Now the invalidate callback is responsible to synchronize against changes unlock(driver->update) Basically, anything that was using hmm_mirror correctly transisions over fairly trivially, just with the modification to store a sequence number to close that race described in the hmm commit. For something like AMD gpu I expect it to transition to use dma_fence from the notifier for coherency right before it unlocks driver->update. Jason
Am 16.10.19 um 18:04 schrieb Jason Gunthorpe: > On Wed, Oct 16, 2019 at 10:58:02AM +0200, Christian König wrote: >> Am 15.10.19 um 20:12 schrieb Jason Gunthorpe: >>> From: Jason Gunthorpe <jgg@mellanox.com> >>> >>> 8 of the mmu_notifier using drivers (i915_gem, radeon_mn, umem_odp, hfi1, >>> scif_dma, vhost, gntdev, hmm) drivers are using a common pattern where >>> they only use invalidate_range_start/end and immediately check the >>> invalidating range against some driver data structure to tell if the >>> driver is interested. Half of them use an interval_tree, the others are >>> simple linear search lists. >>> >>> Of the ones I checked they largely seem to have various kinds of races, >>> bugs and poor implementation. This is a result of the complexity in how >>> the notifier interacts with get_user_pages(). It is extremely difficult to >>> use it correctly. >>> >>> Consolidate all of this code together into the core mmu_notifier and >>> provide a locking scheme similar to hmm_mirror that allows the user to >>> safely use get_user_pages() and reliably know if the page list still >>> matches the mm. >> That sounds really good, but could you outline for a moment how that is >> archived? > It uses the same basic scheme as hmm and rdma odp, outlined in the > revisions to hmm.rst later on. > > Basically, > > seq = mmu_range_read_begin(&mrn); > > // This is a speculative region > .. get_user_pages()/hmm_range_fault() .. How do we enforce that this get_user_pages()/hmm_range_fault() doesn't see outdated page table information? In other words how the the following race prevented: CPU A CPU B invalidate_range_start() mmu_range_read_begin() get_user_pages()/hmm_range_fault() Updating the ptes invalidate_range_end() I mean get_user_pages() tries to circumvent this issue by grabbing a reference to the pages in question, but that isn't sufficient for the SVM use case. That's the reason why we had this horrible solution with a r/w lock and a linked list of BOs in an interval tree. Regards, Christian. > // Result cannot be derferenced > > take_lock(driver->update); > if (mmu_range_read_retry(&mrn, range.notifier_seq) { > // collision! The results are not correct > goto again > } > > // no collision, and now under lock. Now we can de-reference the pages/etc > // program HW > // Now the invalidate callback is responsible to synchronize against changes > unlock(driver->update) > > Basically, anything that was using hmm_mirror correctly transisions > over fairly trivially, just with the modification to store a sequence > number to close that race described in the hmm commit. > > For something like AMD gpu I expect it to transition to use dma_fence > from the notifier for coherency right before it unlocks driver->update. > > Jason > _______________________________________________ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
On 2019-10-17 4:54 a.m., Christian König wrote: > Am 16.10.19 um 18:04 schrieb Jason Gunthorpe: >> On Wed, Oct 16, 2019 at 10:58:02AM +0200, Christian König wrote: >>> Am 15.10.19 um 20:12 schrieb Jason Gunthorpe: >>>> From: Jason Gunthorpe <jgg@mellanox.com> >>>> >>>> 8 of the mmu_notifier using drivers (i915_gem, radeon_mn, umem_odp, >>>> hfi1, >>>> scif_dma, vhost, gntdev, hmm) drivers are using a common pattern where >>>> they only use invalidate_range_start/end and immediately check the >>>> invalidating range against some driver data structure to tell if the >>>> driver is interested. Half of them use an interval_tree, the others are >>>> simple linear search lists. >>>> >>>> Of the ones I checked they largely seem to have various kinds of races, >>>> bugs and poor implementation. This is a result of the complexity in how >>>> the notifier interacts with get_user_pages(). It is extremely >>>> difficult to >>>> use it correctly. >>>> >>>> Consolidate all of this code together into the core mmu_notifier and >>>> provide a locking scheme similar to hmm_mirror that allows the user to >>>> safely use get_user_pages() and reliably know if the page list still >>>> matches the mm. >>> That sounds really good, but could you outline for a moment how that is >>> archived? >> It uses the same basic scheme as hmm and rdma odp, outlined in the >> revisions to hmm.rst later on. >> >> Basically, >> >> seq = mmu_range_read_begin(&mrn); >> >> // This is a speculative region >> .. get_user_pages()/hmm_range_fault() .. > > How do we enforce that this get_user_pages()/hmm_range_fault() doesn't > see outdated page table information? > > In other words how the the following race prevented: > > CPU A CPU B > invalidate_range_start() > mmu_range_read_begin() > get_user_pages()/hmm_range_fault() > Updating the ptes > invalidate_range_end() > > > I mean get_user_pages() tries to circumvent this issue by grabbing a > reference to the pages in question, but that isn't sufficient for the > SVM use case. > > That's the reason why we had this horrible solution with a r/w lock and > a linked list of BOs in an interval tree. > > Regards, > Christian. get_user_pages/hmm_range_fault() and invalidate_range_start() both are called while holding mm->map_sem, so they are always serialized. Philip > >> // Result cannot be derferenced >> >> take_lock(driver->update); >> if (mmu_range_read_retry(&mrn, range.notifier_seq) { >> // collision! The results are not correct >> goto again >> } >> >> // no collision, and now under lock. Now we can de-reference the >> pages/etc >> // program HW >> // Now the invalidate callback is responsible to synchronize against >> changes >> unlock(driver->update) >> >> Basically, anything that was using hmm_mirror correctly transisions >> over fairly trivially, just with the modification to store a sequence >> number to close that race described in the hmm commit. >> >> For something like AMD gpu I expect it to transition to use dma_fence >> from the notifier for coherency right before it unlocks driver->update. >> >> Jason >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx@lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > > _______________________________________________ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Sending once more as text. Am 17.10.19 um 18:26 schrieb Yang, Philip: > On 2019-10-17 4:54 a.m., Christian König wrote: >> Am 16.10.19 um 18:04 schrieb Jason Gunthorpe: >>> On Wed, Oct 16, 2019 at 10:58:02AM +0200, Christian König wrote: >>>> Am 15.10.19 um 20:12 schrieb Jason Gunthorpe: >>>>> From: Jason Gunthorpe <jgg@mellanox.com> >>>>> >>>>> 8 of the mmu_notifier using drivers (i915_gem, radeon_mn, umem_odp, >>>>> hfi1, >>>>> scif_dma, vhost, gntdev, hmm) drivers are using a common pattern where >>>>> they only use invalidate_range_start/end and immediately check the >>>>> invalidating range against some driver data structure to tell if the >>>>> driver is interested. Half of them use an interval_tree, the others are >>>>> simple linear search lists. >>>>> >>>>> Of the ones I checked they largely seem to have various kinds of races, >>>>> bugs and poor implementation. This is a result of the complexity in how >>>>> the notifier interacts with get_user_pages(). It is extremely >>>>> difficult to >>>>> use it correctly. >>>>> >>>>> Consolidate all of this code together into the core mmu_notifier and >>>>> provide a locking scheme similar to hmm_mirror that allows the user to >>>>> safely use get_user_pages() and reliably know if the page list still >>>>> matches the mm. >>>> That sounds really good, but could you outline for a moment how that is >>>> archived? >>> It uses the same basic scheme as hmm and rdma odp, outlined in the >>> revisions to hmm.rst later on. >>> >>> Basically, >>> >>> seq = mmu_range_read_begin(&mrn); >>> >>> // This is a speculative region >>> .. get_user_pages()/hmm_range_fault() .. >> How do we enforce that this get_user_pages()/hmm_range_fault() doesn't >> see outdated page table information? >> >> In other words how the the following race prevented: >> >> CPU A CPU B >> invalidate_range_start() >> mmu_range_read_begin() >> get_user_pages()/hmm_range_fault() >> Updating the ptes >> invalidate_range_end() >> >> >> I mean get_user_pages() tries to circumvent this issue by grabbing a >> reference to the pages in question, but that isn't sufficient for the >> SVM use case. >> >> That's the reason why we had this horrible solution with a r/w lock and >> a linked list of BOs in an interval tree. >> >> Regards, >> Christian. > get_user_pages/hmm_range_fault() and invalidate_range_start() both are > called while holding mm->map_sem, so they are always serialized. Not even remotely. For calling get_user_pages()/hmm_range_fault() you only need to hold the mmap_sem in read mode. And IIRC invalidate_range_start() is sometimes called without holding the mmap_sem at all. So again how are they serialized? Regards, Christian. > > Philip >>> // Result cannot be derferenced >>> >>> take_lock(driver->update); >>> if (mmu_range_read_retry(&mrn, range.notifier_seq) { >>> // collision! The results are not correct >>> goto again >>> } >>> >>> // no collision, and now under lock. Now we can de-reference the >>> pages/etc >>> // program HW >>> // Now the invalidate callback is responsible to synchronize against >>> changes >>> unlock(driver->update) >>> >>> Basically, anything that was using hmm_mirror correctly transisions >>> over fairly trivially, just with the modification to store a sequence >>> number to close that race described in the hmm commit. >>> >>> For something like AMD gpu I expect it to transition to use dma_fence >>> from the notifier for coherency right before it unlocks driver->update. >>> >>> Jason >>> _______________________________________________ >>> amd-gfx mailing list >>> amd-gfx@lists.freedesktop.org >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx@lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
On Thu, Oct 17, 2019 at 04:47:20PM +0000, Koenig, Christian wrote: > > get_user_pages/hmm_range_fault() and invalidate_range_start() both are > > called while holding mm->map_sem, so they are always serialized. > > Not even remotely. > > For calling get_user_pages()/hmm_range_fault() you only need to hold the > mmap_sem in read mode. Right > And IIRC invalidate_range_start() is sometimes called without holding > the mmap_sem at all. Yep > So again how are they serialized? The 'driver lock' thing does it, read the hmm documentation, the hmm approach is basically the only approach that was correct of all the drivers.. So long as the 'driver lock' is held the range cannot become invalidated as the 'driver lock' prevents progress of invalidation. Holding the driver lock and using the seq based mmu_range_read_retry() tells if the previous unlocked get_user_pages() is still valid or needs to be discard. So it doesn't matter if get_user_pages() races or not, the result is not to be used until the driver lock is held and mmu_range_read_retry() called, which provides the coherence. It is the usual seqlock pattern. Jason
Am 18.10.19 um 22:36 schrieb Jason Gunthorpe: > On Thu, Oct 17, 2019 at 04:47:20PM +0000, Koenig, Christian wrote: > >>> get_user_pages/hmm_range_fault() and invalidate_range_start() both are >>> called while holding mm->map_sem, so they are always serialized. >> Not even remotely. >> >> For calling get_user_pages()/hmm_range_fault() you only need to hold the >> mmap_sem in read mode. > Right > >> And IIRC invalidate_range_start() is sometimes called without holding >> the mmap_sem at all. > Yep > >> So again how are they serialized? > The 'driver lock' thing does it, read the hmm documentation, the hmm > approach is basically the only approach that was correct of all the > drivers.. Well that's what I've did, but what HMM does still doesn't looks correct to me. > So long as the 'driver lock' is held the range cannot become > invalidated as the 'driver lock' prevents progress of invalidation. Correct, but the problem is it doesn't wait for ongoing operations to complete. See I'm talking about the following case: Thread A Thread B invalidate_range_start() mmu_range_read_begin() get_user_pages()/hmm_range_fault() grab_driver_lock() Updating the ptes invalidate_range_end() As far as I can see in invalidate_range_start() the driver lock is taken to make sure that we can't start any invalidation while the driver is using the pages for a command submission. But the pages we got from get_user_pages()/hmm_range_fault() might not be up to date because the driver lock is also dropped again in invalidate_range_start() and not in invalidate_range_end(). > Holding the driver lock and using the seq based mmu_range_read_retry() > tells if the previous unlocked get_user_pages() is still valid or > needs to be discard. > > So it doesn't matter if get_user_pages() races or not, the result is not > to be used until the driver lock is held and mmu_range_read_retry() > called, which provides the coherence. > > It is the usual seqlock pattern. Well we don't update the seqlock after the update to the protected data structure (the page table) happened, but rather before that. That doesn't looks like the normal patter for a seqlock to me and as far as I can see that is quite a bug in the HMM design/logic. Regards, Christian. > > Jason
On Sun, Oct 20, 2019 at 02:21:42PM +0000, Koenig, Christian wrote: > Am 18.10.19 um 22:36 schrieb Jason Gunthorpe: > > On Thu, Oct 17, 2019 at 04:47:20PM +0000, Koenig, Christian wrote: > > > >>> get_user_pages/hmm_range_fault() and invalidate_range_start() both are > >>> called while holding mm->map_sem, so they are always serialized. > >> Not even remotely. > >> > >> For calling get_user_pages()/hmm_range_fault() you only need to hold the > >> mmap_sem in read mode. > > Right > > > >> And IIRC invalidate_range_start() is sometimes called without holding > >> the mmap_sem at all. > > Yep > > > >> So again how are they serialized? > > The 'driver lock' thing does it, read the hmm documentation, the hmm > > approach is basically the only approach that was correct of all the > > drivers.. > > Well that's what I've did, but what HMM does still doesn't looks correct > to me. It has a bug, but the basic flow seems to work. https://patchwork.kernel.org/patch/11191 > > So long as the 'driver lock' is held the range cannot become > > invalidated as the 'driver lock' prevents progress of invalidation. > > Correct, but the problem is it doesn't wait for ongoing operations to > complete. > > See I'm talking about the following case: > > Thread A Thread B > invalidate_range_start() > mmu_range_read_begin() > get_user_pages()/hmm_range_fault() > grab_driver_lock() > Updating the ptes > invalidate_range_end() > > As far as I can see in invalidate_range_start() the driver lock is taken > to make sure that we can't start any invalidation while the driver is > using the pages for a command submission. Again, this uses the seqlock like scheme *and* the driver lock. In this case after grab_driver_lock() mmu_range_read_retry() will return false if Thread A has progressed to 'updating the ptes. For instance here is how the concurrency resolves for retry: CPU1 CPU2 seq = mmu_range_read_begin() invalidate_range_start() invalidate_seq++ get_user_pages()/hmm_range_fault() grab_driver_lock() ungrab_driver_lock() grab_driver_lock() seq != invalidate_seq, so retry Updating the ptes invalidate_range_end() invalidate_seq++ And here is how it resolves for blocking: CPU1 CPU2 seq = mmu_range_read_begin() invalidate_range_start() get_user_pages()/hmm_range_fault() grab_driver_lock() seq == invalidate_seq, so conitnue invalidate_seq++ ungrab_driver_lock() grab_driver_lock() // Cannot progress to 'Updating the ptes' while the drive_lock is held ungrab_driver_lock() Updating the ptes invalidate_range_end() invalidate_seq++ For the above I've simplified the mechanics of the invalidate_seq, you need to look through the patch to see how it actually works. > Well we don't update the seqlock after the update to the protected data > structure (the page table) happened, but rather before that. ??? This is what mn_itree_inv_end() does, it is called by invalidate_range_end > That doesn't looks like the normal patter for a seqlock to me and as far > as I can see that is quite a bug in the HMM design/logic. Well, hmm has a bug because it doesn't use a seqlock pattern, see the above URL. One of the motivations for this work is to squash that bug by adding a seqlock like pattern. But the basic hmm flow and collision-retry approach seems sound. Do you see a problem with this patch? Jason
Am 21.10.19 um 15:57 schrieb Jason Gunthorpe: > On Sun, Oct 20, 2019 at 02:21:42PM +0000, Koenig, Christian wrote: >> Am 18.10.19 um 22:36 schrieb Jason Gunthorpe: >>> On Thu, Oct 17, 2019 at 04:47:20PM +0000, Koenig, Christian wrote: >>> [SNIP] >>> >>>> So again how are they serialized? >>> The 'driver lock' thing does it, read the hmm documentation, the hmm >>> approach is basically the only approach that was correct of all the >>> drivers.. >> Well that's what I've did, but what HMM does still doesn't looks correct >> to me. > It has a bug, but the basic flow seems to work. > > https://patchwork.kernel.org/patch/11191 Maybe wrong link? That link looks like an unrelated discussion on kernel image relocation. >>> So long as the 'driver lock' is held the range cannot become >>> invalidated as the 'driver lock' prevents progress of invalidation. >> Correct, but the problem is it doesn't wait for ongoing operations to >> complete. >> >> See I'm talking about the following case: >> >> Thread A Thread B >> invalidate_range_start() >> mmu_range_read_begin() >> get_user_pages()/hmm_range_fault() >> grab_driver_lock() >> Updating the ptes >> invalidate_range_end() >> >> As far as I can see in invalidate_range_start() the driver lock is taken >> to make sure that we can't start any invalidation while the driver is >> using the pages for a command submission. > Again, this uses the seqlock like scheme *and* the driver lock. > > In this case after grab_driver_lock() mmu_range_read_retry() will > return false if Thread A has progressed to 'updating the ptes. > > For instance here is how the concurrency resolves for retry: > > CPU1 CPU2 > seq = mmu_range_read_begin() > invalidate_range_start() > invalidate_seq++ How that was order was what confusing me. But I've read up on the code in mmu_range_read_begin() and found the lines I was looking for: + if (is_invalidating) + wait_event(mmn_mm->wq, + READ_ONCE(mmn_mm->invalidate_seq) != seq); [SNIP] > For the above I've simplified the mechanics of the invalidate_seq, you > need to look through the patch to see how it actually works. Yea, that you also allow multiple write sides is pretty neat. >> Well we don't update the seqlock after the update to the protected data >> structure (the page table) happened, but rather before that. > ??? This is what mn_itree_inv_end() does, it is called by > invalidate_range_end > >> That doesn't looks like the normal patter for a seqlock to me and as far >> as I can see that is quite a bug in the HMM design/logic. > Well, hmm has a bug because it doesn't use a seqlock pattern, see the > above URL. > > One of the motivations for this work is to squash that bug by adding a > seqlock like pattern. But the basic hmm flow and collision-retry > approach seems sound. > > Do you see a problem with this patch? No, not any more. Essentially you are doing the same thing I've tried to do with the original amdgpu implementation. The difference is that you don't try to use a per range sequence (which is a good idea, we never got that fully working) and you allow multiple writers at the same time. Feel free to stitch an Acked-by: Christian König <christian.koenig@amd.com> on patch #2, but you still doing a bunch of things in there which are way beyond my understanding (e.g. where are all the SMP barriers?). Cheers, Christian. > > Jason
On Mon, Oct 21, 2019 at 02:28:46PM +0000, Koenig, Christian wrote: > Am 21.10.19 um 15:57 schrieb Jason Gunthorpe: > > On Sun, Oct 20, 2019 at 02:21:42PM +0000, Koenig, Christian wrote: > >> Am 18.10.19 um 22:36 schrieb Jason Gunthorpe: > >>> On Thu, Oct 17, 2019 at 04:47:20PM +0000, Koenig, Christian wrote: > >>> [SNIP] > >>> > >>>> So again how are they serialized? > >>> The 'driver lock' thing does it, read the hmm documentation, the hmm > >>> approach is basically the only approach that was correct of all the > >>> drivers.. > >> Well that's what I've did, but what HMM does still doesn't looks correct > >> to me. > > It has a bug, but the basic flow seems to work. > > > > https://patchwork.kernel.org/patch/11191 > > Maybe wrong link? That link looks like an unrelated discussion on kernel > image relocation. Sorry, it got corrupted: https://patchwork.kernel.org/patch/11191397/ > >>> So long as the 'driver lock' is held the range cannot become > >>> invalidated as the 'driver lock' prevents progress of invalidation. > >> Correct, but the problem is it doesn't wait for ongoing operations to > >> complete. > >> > >> See I'm talking about the following case: > >> > >> Thread A Thread B > >> invalidate_range_start() > >> mmu_range_read_begin() > >> get_user_pages()/hmm_range_fault() > >> grab_driver_lock() > >> Updating the ptes > >> invalidate_range_end() > >> > >> As far as I can see in invalidate_range_start() the driver lock is taken > >> to make sure that we can't start any invalidation while the driver is > >> using the pages for a command submission. > > Again, this uses the seqlock like scheme *and* the driver lock. > > > > In this case after grab_driver_lock() mmu_range_read_retry() will > > return false if Thread A has progressed to 'updating the ptes. > > > > For instance here is how the concurrency resolves for retry: > > > > CPU1 CPU2 > > seq = mmu_range_read_begin() > > invalidate_range_start() > > invalidate_seq++ > > How that was order was what confusing me. But I've read up on the code > in mmu_range_read_begin() and found the lines I was looking for: > > + if (is_invalidating) > + wait_event(mmn_mm->wq, > + READ_ONCE(mmn_mm->invalidate_seq) != seq); > > [SNIP] Right, the basic design is that the 'seq' returned by mmu_range_read_begin() is never currently under invalidation. Thus if the starting seq is not invalidating, then if it doesn't change we are guaranteed the ptes haven't changed either. > > For the above I've simplified the mechanics of the invalidate_seq, you > > need to look through the patch to see how it actually works. > > Yea, that you also allow multiple write sides is pretty neat. Complicated, but necessary to make the non-blocking OOM stuff able to read the interval tree under all conditions :\ > > One of the motivations for this work is to squash that bug by adding a > > seqlock like pattern. But the basic hmm flow and collision-retry > > approach seems sound. > > > > Do you see a problem with this patch? > > No, not any more. Okay, great, thanks > Essentially you are doing the same thing I've tried to do with the > original amdgpu implementation. The difference is that you don't try to > use a per range sequence (which is a good idea, we never got that fully > working) and you allow multiple writers at the same time. Yes, ODP had the per-range sequence and it never worked right either. Keeping track of things during the invalidate_end was too complex > Feel free to stitch an Acked-by: Christian König > <christian.koenig@amd.com> on patch #2, Thanks! Can you also take some review and test for the AMD related patches? These were quite hard to make, it is very likely I've made an error.. > but you still doing a bunch of things in there which are way beyond > my understanding (e.g. where are all the SMP barriers?). The only non-locked data is 'struct mmu_range_notifier->invalidate_seq' On the write side it uses a WRITE_ONCE. The 'seq' it writes is generated under the mmn_mm->lock spinlock (held before and after the WRITE_ONCE) such that all concurrent WRITE_ONCE's are storing the same value. Essentially the spinlock is providing the barrier to order write: invalidate_range_start(): spin_lock(&mmn_mm->lock); mmn_mm->active_invalidate_ranges++; mmn_mm->invalidate_seq |= 1; *seq = mmn_mm->invalidate_seq; spin_unlock(&mmn_mm->lock); WRITE_ONCE(mrn->invalidate_seq, cur_seq); invalidate_range_end() spin_lock(&mmn_mm->lock); if (--mmn_mm->active_invalidate_ranges) mmn_mm->invalidate_seq++ spin_unlock(&mmn_mm->lock); ie when we do invalidate_seq++, due to the active_invalidate_ranges counter and the spinlock, we know all other WRITE_ONCE's have passed their spin_unlock and no concurrent ones are starting. The spinlock is providing the barrier here. On the read side.. It is a similar argument, except with the driver_lock: mmu_range_read_begin() seq = READ_ONCE(mrn->invalidate_seq); Here 'seq' may be the current value, or it may be an older value. Ordering is eventually provided by the driver_lock: mn_tree_invalidate_start() [..] WRITE_ONCE(mrn->invalidate_seq, cur_seq); driver_lock() driver_unlock() vs driver_lock() mmu_range_read_begin() return seq != READ_ONCE(mrn->invalidate_seq); driver_unlock() Here if mn_tree_invalidate_start() has passed driver_unlock() then because we do driver_lock() before mmu_range_read_begin() we must observe the WRITE_ONCE. ie the driver_unlock()/driver_lock() provide the pair'd barrier. If mn_tree_invalidate_start() has not passed driver_lock() then the mmu_range_read_begin() side prevents it from passing driver_lock() while it holds the lock. In this case it is OK if we don't observe the WRITE_ONCE() that was done just before as the invalidate_start() thread can't proceed to invalidation. It is very unusual locking, it would be great if others could help look at it! The unusual bit in all of this is using a lock's critical region to 'protect' data for read, but updating that same data before the lock's critical secion. ie relying on the unlock barrier to 'release' program ordered stores done before the lock's own critical region, and the lock side barrier to 'acquire' those stores. This approach is borrowed from the hmm mirror implementation.. If for some reason the scheme doesn't work, then the fallback is to expand the mmn_mm->lock spinlock to protect the mrn->invalidate_seq at some cost in performance. Jason
On 10/15/2019 2:12 PM, Jason Gunthorpe wrote: > This is still being tested, but I figured to send it to start getting help > from the xen, amd and hfi drivers which I cannot test here. Sorry for the delay, I never seen this. Was not on Cc list and didn't register to me it impacted hfi. I'll take a look and run it through some hfi1 tests. -Denny
On Mon, Oct 21, 2019 at 11:55:51AM -0400, Dennis Dalessandro wrote: > On 10/15/2019 2:12 PM, Jason Gunthorpe wrote: > > This is still being tested, but I figured to send it to start getting help > > from the xen, amd and hfi drivers which I cannot test here. > > Sorry for the delay, I never seen this. Was not on Cc list and didn't > register to me it impacted hfi. I'll take a look and run it through some > hfi1 tests. Hm, you were cc'd on the hfi1 patch of the series: https://patchwork.kernel.org/patch/11191395/ So you saw that, right? But it seems that git send-email didn't pull all the cc's together? Jason
On Tue, Oct 15, 2019 at 03:12:27PM -0300, Jason Gunthorpe wrote: > From: Jason Gunthorpe <jgg@mellanox.com> > > 8 of the mmu_notifier using drivers (i915_gem, radeon_mn, umem_odp, hfi1, > scif_dma, vhost, gntdev, hmm) drivers are using a common pattern where > they only use invalidate_range_start/end and immediately check the > invalidating range against some driver data structure to tell if the > driver is interested. Half of them use an interval_tree, the others are > simple linear search lists. > > Of the ones I checked they largely seem to have various kinds of races, > bugs and poor implementation. This is a result of the complexity in how > the notifier interacts with get_user_pages(). It is extremely difficult to > use it correctly. > > Consolidate all of this code together into the core mmu_notifier and > provide a locking scheme similar to hmm_mirror that allows the user to > safely use get_user_pages() and reliably know if the page list still > matches the mm. > > This new arrangment plays nicely with the !blockable mode for > OOM. Scanning the interval tree is done such that the intersection test > will always succeed, and since there is no invalidate_range_end exposed to > drivers the scheme safely allows multiple drivers to be subscribed. > > Four places are converted as an example of how the new API is used. > Four are left for future patches: > - i915_gem has complex locking around destruction of a registration, > needs more study > - hfi1 (2nd user) needs access to the rbtree > - scif_dma has a complicated logic flow > - vhost's mmu notifiers are already being rewritten > > This is still being tested, but I figured to send it to start getting help > from the xen, amd and hfi drivers which I cannot test here. It might be a good oportunity to also switch those users to hmm_range_fault() instead of GUP as GUP is pointless for those users. In fact the GUP is an impediment to normal mm operations. I will test on nouveau. Cheers, Jérôme
On Mon, Oct 21, 2019 at 02:40:41PM -0400, Jerome Glisse wrote: > On Tue, Oct 15, 2019 at 03:12:27PM -0300, Jason Gunthorpe wrote: > > From: Jason Gunthorpe <jgg@mellanox.com> > > > > 8 of the mmu_notifier using drivers (i915_gem, radeon_mn, umem_odp, hfi1, > > scif_dma, vhost, gntdev, hmm) drivers are using a common pattern where > > they only use invalidate_range_start/end and immediately check the > > invalidating range against some driver data structure to tell if the > > driver is interested. Half of them use an interval_tree, the others are > > simple linear search lists. > > > > Of the ones I checked they largely seem to have various kinds of races, > > bugs and poor implementation. This is a result of the complexity in how > > the notifier interacts with get_user_pages(). It is extremely difficult to > > use it correctly. > > > > Consolidate all of this code together into the core mmu_notifier and > > provide a locking scheme similar to hmm_mirror that allows the user to > > safely use get_user_pages() and reliably know if the page list still > > matches the mm. > > > > This new arrangment plays nicely with the !blockable mode for > > OOM. Scanning the interval tree is done such that the intersection test > > will always succeed, and since there is no invalidate_range_end exposed to > > drivers the scheme safely allows multiple drivers to be subscribed. > > > > Four places are converted as an example of how the new API is used. > > Four are left for future patches: > > - i915_gem has complex locking around destruction of a registration, > > needs more study > > - hfi1 (2nd user) needs access to the rbtree > > - scif_dma has a complicated logic flow > > - vhost's mmu notifiers are already being rewritten > > > > This is still being tested, but I figured to send it to start getting help > > from the xen, amd and hfi drivers which I cannot test here. > > It might be a good oportunity to also switch those users to > hmm_range_fault() instead of GUP as GUP is pointless for those > users. In fact the GUP is an impediment to normal mm operations. I think vhost can use hmm_range_fault hfi1 does actually need to have the page pin, it doesn't fence DMA during invalidate. i915_gem feels alot like amdgpu, so probably it would benefit No idea about scif_dma > I will test on nouveau. Thanks, hopefully it still works, I think Ralph was able to do some basic checks. But it is a pretty complicated series, I probably made some mistakes. FWIW, I know that nouveau gets a lockdep splat now from Daniel Vetter's recent changes, it tries to do GFP_KERENEL allocations under a lock also held by the invalidate_range_start path. Thanks for looking at it! Regards, Jason
On Mon, Oct 21, 2019 at 03:12:26PM +0000, Jason Gunthorpe wrote: > On Mon, Oct 21, 2019 at 02:28:46PM +0000, Koenig, Christian wrote: > > Am 21.10.19 um 15:57 schrieb Jason Gunthorpe: > > > On Sun, Oct 20, 2019 at 02:21:42PM +0000, Koenig, Christian wrote: > > >> Am 18.10.19 um 22:36 schrieb Jason Gunthorpe: > > >>> On Thu, Oct 17, 2019 at 04:47:20PM +0000, Koenig, Christian wrote: > > >>> [SNIP] > > >>> > > >>>> So again how are they serialized? > > >>> The 'driver lock' thing does it, read the hmm documentation, the hmm > > >>> approach is basically the only approach that was correct of all the > > >>> drivers.. > > >> Well that's what I've did, but what HMM does still doesn't looks correct > > >> to me. > > > It has a bug, but the basic flow seems to work. > > > > > > https://patchwork.kernel.org/patch/11191 > > > > Maybe wrong link? That link looks like an unrelated discussion on kernel > > image relocation. > > Sorry, it got corrupted: > > https://patchwork.kernel.org/patch/11191397/ > > > >>> So long as the 'driver lock' is held the range cannot become > > >>> invalidated as the 'driver lock' prevents progress of invalidation. > > >> Correct, but the problem is it doesn't wait for ongoing operations to > > >> complete. > > >> > > >> See I'm talking about the following case: > > >> > > >> Thread A Thread B > > >> invalidate_range_start() > > >> mmu_range_read_begin() > > >> get_user_pages()/hmm_range_fault() > > >> grab_driver_lock() > > >> Updating the ptes > > >> invalidate_range_end() > > >> > > >> As far as I can see in invalidate_range_start() the driver lock is taken > > >> to make sure that we can't start any invalidation while the driver is > > >> using the pages for a command submission. > > > Again, this uses the seqlock like scheme *and* the driver lock. > > > > > > In this case after grab_driver_lock() mmu_range_read_retry() will > > > return false if Thread A has progressed to 'updating the ptes. > > > > > > For instance here is how the concurrency resolves for retry: > > > > > > CPU1 CPU2 > > > seq = mmu_range_read_begin() > > > invalidate_range_start() > > > invalidate_seq++ > > > > How that was order was what confusing me. But I've read up on the code > > in mmu_range_read_begin() and found the lines I was looking for: > > > > + if (is_invalidating) > > + wait_event(mmn_mm->wq, > > + READ_ONCE(mmn_mm->invalidate_seq) != seq); > > > > [SNIP] > > Right, the basic design is that the 'seq' returned by > mmu_range_read_begin() is never currently under invalidation. > > Thus if the starting seq is not invalidating, then if it doesn't > change we are guaranteed the ptes haven't changed either. > > > > For the above I've simplified the mechanics of the invalidate_seq, you > > > need to look through the patch to see how it actually works. > > > > Yea, that you also allow multiple write sides is pretty neat. > > Complicated, but necessary to make the non-blocking OOM stuff able to > read the interval tree under all conditions :\ > > > > One of the motivations for this work is to squash that bug by adding a > > > seqlock like pattern. But the basic hmm flow and collision-retry > > > approach seems sound. > > > > > > Do you see a problem with this patch? > > > > No, not any more. > > Okay, great, thanks > > > Essentially you are doing the same thing I've tried to do with the > > original amdgpu implementation. The difference is that you don't try to > > use a per range sequence (which is a good idea, we never got that fully > > working) and you allow multiple writers at the same time. > > Yes, ODP had the per-range sequence and it never worked right > either. Keeping track of things during the invalidate_end was too complex > > > Feel free to stitch an Acked-by: Christian König > > <christian.koenig@amd.com> on patch #2, > > Thanks! Can you also take some review and test for the AMD related > patches? These were quite hard to make, it is very likely I've made an > error.. > > > but you still doing a bunch of things in there which are way beyond > > my understanding (e.g. where are all the SMP barriers?). > > The only non-locked data is 'struct mmu_range_notifier->invalidate_seq' > > On the write side it uses a WRITE_ONCE. The 'seq' it writes is > generated under the mmn_mm->lock spinlock (held before and after the > WRITE_ONCE) such that all concurrent WRITE_ONCE's are storing the same > value. > > Essentially the spinlock is providing the barrier to order write: > > invalidate_range_start(): > spin_lock(&mmn_mm->lock); > mmn_mm->active_invalidate_ranges++; > mmn_mm->invalidate_seq |= 1; > *seq = mmn_mm->invalidate_seq; > spin_unlock(&mmn_mm->lock); > > WRITE_ONCE(mrn->invalidate_seq, cur_seq); > > invalidate_range_end() > spin_lock(&mmn_mm->lock); > if (--mmn_mm->active_invalidate_ranges) > mmn_mm->invalidate_seq++ > spin_unlock(&mmn_mm->lock); > > ie when we do invalidate_seq++, due to the active_invalidate_ranges > counter and the spinlock, we know all other WRITE_ONCE's have passed > their spin_unlock and no concurrent ones are starting. The spinlock is > providing the barrier here. > > On the read side.. It is a similar argument, except with the > driver_lock: > > mmu_range_read_begin() > seq = READ_ONCE(mrn->invalidate_seq); > > Here 'seq' may be the current value, or it may be an older > value. Ordering is eventually provided by the driver_lock: > > mn_tree_invalidate_start() > [..] > WRITE_ONCE(mrn->invalidate_seq, cur_seq); > driver_lock() > driver_unlock() > > vs > driver_lock() > mmu_range_read_begin() > return seq != READ_ONCE(mrn->invalidate_seq); > driver_unlock() > > Here if mn_tree_invalidate_start() has passed driver_unlock() then > because we do driver_lock() before mmu_range_read_begin() we must > observe the WRITE_ONCE. ie the driver_unlock()/driver_lock() provide > the pair'd barrier. > > If mn_tree_invalidate_start() has not passed driver_lock() then the > mmu_range_read_begin() side prevents it from passing driver_lock() > while it holds the lock. In this case it is OK if we don't observe the > WRITE_ONCE() that was done just before as the invalidate_start() > thread can't proceed to invalidation. > > It is very unusual locking, it would be great if others could help > look at it! > > The unusual bit in all of this is using a lock's critical region to > 'protect' data for read, but updating that same data before the lock's > critical secion. ie relying on the unlock barrier to 'release' program > ordered stores done before the lock's own critical region, and the > lock side barrier to 'acquire' those stores. I think this unusual use of locks as barriers for other unlocked accesses deserves comments even more than just normal barriers. Can you pls add them? I think the design seeems sound ... Also the comment on the driver's lock hopefully prevents driver maintainers from moving the driver_lock around in a way that would very subtle break the scheme, so I think having the acquire barrier commented in each place would be really good. Cheers, Daniel > > This approach is borrowed from the hmm mirror implementation.. > > If for some reason the scheme doesn't work, then the fallback is to > expand the mmn_mm->lock spinlock to protect the mrn->invalidate_seq at > some cost in performance. > > Jason > _______________________________________________ > dri-devel mailing list > dri-devel@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel
On 10/21/2019 12:58 PM, Jason Gunthorpe wrote: > On Mon, Oct 21, 2019 at 11:55:51AM -0400, Dennis Dalessandro wrote: >> On 10/15/2019 2:12 PM, Jason Gunthorpe wrote: >>> This is still being tested, but I figured to send it to start getting help >>> from the xen, amd and hfi drivers which I cannot test here. >> >> Sorry for the delay, I never seen this. Was not on Cc list and didn't >> register to me it impacted hfi. I'll take a look and run it through some >> hfi1 tests. > > Hm, you were cc'd on the hfi1 patch of the series: > > https://patchwork.kernel.org/patch/11191395/ > > So you saw that, right? I do now. > But it seems that git send-email didn't pull all the cc's together? I don't know. I thought it did, at one time I recall trying to get it *not* to do that, when preparing some internal reviews. Haven't used it for a long time though, I've been using stgit. At any rate can you give me a SHA or branch this applies on top of? I have pulled rdma/hmm, rdma/wip/jgg, linus/master but seem to have conflicts. -Denny
On Tue, Oct 22, 2019 at 07:56:12AM -0400, Dennis Dalessandro wrote: > On 10/21/2019 12:58 PM, Jason Gunthorpe wrote: > > On Mon, Oct 21, 2019 at 11:55:51AM -0400, Dennis Dalessandro wrote: > > > On 10/15/2019 2:12 PM, Jason Gunthorpe wrote: > > > > This is still being tested, but I figured to send it to start getting help > > > > from the xen, amd and hfi drivers which I cannot test here. > > > > > > Sorry for the delay, I never seen this. Was not on Cc list and didn't > > > register to me it impacted hfi. I'll take a look and run it through some > > > hfi1 tests. > > > > Hm, you were cc'd on the hfi1 patch of the series: > > > > https://patchwork.kernel.org/patch/11191395/ > > > > So you saw that, right? > > I do now. > > > But it seems that git send-email didn't pull all the cc's together? > > I don't know. I thought it did, at one time I recall trying to get it *not* > to do that, when preparing some internal reviews. Haven't used it for a long > time though, I've been using stgit. > > At any rate can you give me a SHA or branch this applies on top of? I have > pulled rdma/hmm, rdma/wip/jgg, linus/master but seem to have conflicts. Here is a full tree: https://github.com/jgunthorpe/linux/commits/mmu_notifier It needs the ODP series on the mailing list to apply Thanks, Jason
On Tue, Oct 22, 2019 at 09:57:35AM +0200, Daniel Vetter wrote: > > The unusual bit in all of this is using a lock's critical region to > > 'protect' data for read, but updating that same data before the lock's > > critical secion. ie relying on the unlock barrier to 'release' program > > ordered stores done before the lock's own critical region, and the > > lock side barrier to 'acquire' those stores. > > I think this unusual use of locks as barriers for other unlocked accesses > deserves comments even more than just normal barriers. Can you pls add > them? I think the design seeems sound ... > > Also the comment on the driver's lock hopefully prevents driver > maintainers from moving the driver_lock around in a way that would very > subtle break the scheme, so I think having the acquire barrier commented > in each place would be really good. There is already a lot of documentation, I think it would be helpful if you could suggest some specific places where you think an addition would help? I think the perspective of someone less familiar with this design would really improve the documentation I've been tempted to force the driver to store the seq number directly under the driver lock - this makes the scheme much clearer, ie something like this: diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c index 712c99918551bc..738fa670dcfb19 100644 --- a/drivers/gpu/drm/nouveau/nouveau_svm.c +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c @@ -488,7 +488,8 @@ struct svm_notifier { }; static bool nouveau_svm_range_invalidate(struct mmu_range_notifier *mrn, - const struct mmu_notifier_range *range) + const struct mmu_notifier_range *range, + unsigned long seq) { struct svm_notifier *sn = container_of(mrn, struct svm_notifier, notifier); @@ -504,6 +505,7 @@ static bool nouveau_svm_range_invalidate(struct mmu_range_notifier *mrn, mutex_lock(&sn->svmm->mutex); else if (!mutex_trylock(&sn->svmm->mutex)) return false; + mmu_range_notifier_update_seq(mrn, seq); mutex_unlock(&sn->svmm->mutex); return true; } At the cost of making the driver a bit more complex, what do you think? Jason
On Tue, Oct 22, 2019 at 03:01:13PM +0000, Jason Gunthorpe wrote: > On Tue, Oct 22, 2019 at 09:57:35AM +0200, Daniel Vetter wrote: > > > > The unusual bit in all of this is using a lock's critical region to > > > 'protect' data for read, but updating that same data before the lock's > > > critical secion. ie relying on the unlock barrier to 'release' program > > > ordered stores done before the lock's own critical region, and the > > > lock side barrier to 'acquire' those stores. > > > > I think this unusual use of locks as barriers for other unlocked accesses > > deserves comments even more than just normal barriers. Can you pls add > > them? I think the design seeems sound ... > > > > Also the comment on the driver's lock hopefully prevents driver > > maintainers from moving the driver_lock around in a way that would very > > subtle break the scheme, so I think having the acquire barrier commented > > in each place would be really good. > > There is already a lot of documentation, I think it would be helpful > if you could suggest some specific places where you think an addition > would help? I think the perspective of someone less familiar with this > design would really improve the documentation Hm I just meant the usual recommendation that "barriers must have comments explaining what they order, and where the other side of the barrier is". Using unlock/lock as a barrier imo just makes that an even better idea. Usually what I do is something like "we need to order $this against $that below, and the other side of this barrier is in function()." With maybe a bit more if it's not obvious how things go wrong if the orderin is broken. Ofc seqlock.h itself skimps on that rule and doesn't bother explaining its barriers :-/ > I've been tempted to force the driver to store the seq number directly > under the driver lock - this makes the scheme much clearer, ie > something like this: > > diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c > index 712c99918551bc..738fa670dcfb19 100644 > --- a/drivers/gpu/drm/nouveau/nouveau_svm.c > +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c > @@ -488,7 +488,8 @@ struct svm_notifier { > }; > > static bool nouveau_svm_range_invalidate(struct mmu_range_notifier *mrn, > - const struct mmu_notifier_range *range) > + const struct mmu_notifier_range *range, > + unsigned long seq) > { > struct svm_notifier *sn = > container_of(mrn, struct svm_notifier, notifier); > @@ -504,6 +505,7 @@ static bool nouveau_svm_range_invalidate(struct mmu_range_notifier *mrn, > mutex_lock(&sn->svmm->mutex); > else if (!mutex_trylock(&sn->svmm->mutex)) > return false; > + mmu_range_notifier_update_seq(mrn, seq); > mutex_unlock(&sn->svmm->mutex); > return true; > } > > > At the cost of making the driver a bit more complex, what do you > think? Hm, spinning this further ... could we initialize the mmu range notifier with a pointer to the driver lock, so that we could put a lockdep_assert_held into mmu_range_notifier_update_seq? I think that would make this scheme substantially more driver-hacker proof :-) Cheers, Daniel
Am 23.10.19 um 11:08 schrieb Daniel Vetter: > On Tue, Oct 22, 2019 at 03:01:13PM +0000, Jason Gunthorpe wrote: >> On Tue, Oct 22, 2019 at 09:57:35AM +0200, Daniel Vetter wrote: >> >>>> The unusual bit in all of this is using a lock's critical region to >>>> 'protect' data for read, but updating that same data before the lock's >>>> critical secion. ie relying on the unlock barrier to 'release' program >>>> ordered stores done before the lock's own critical region, and the >>>> lock side barrier to 'acquire' those stores. >>> I think this unusual use of locks as barriers for other unlocked accesses >>> deserves comments even more than just normal barriers. Can you pls add >>> them? I think the design seeems sound ... >>> >>> Also the comment on the driver's lock hopefully prevents driver >>> maintainers from moving the driver_lock around in a way that would very >>> subtle break the scheme, so I think having the acquire barrier commented >>> in each place would be really good. >> There is already a lot of documentation, I think it would be helpful >> if you could suggest some specific places where you think an addition >> would help? I think the perspective of someone less familiar with this >> design would really improve the documentation > Hm I just meant the usual recommendation that "barriers must have comments > explaining what they order, and where the other side of the barrier is". > Using unlock/lock as a barrier imo just makes that an even better idea. > Usually what I do is something like "we need to order $this against $that > below, and the other side of this barrier is in function()." With maybe a > bit more if it's not obvious how things go wrong if the orderin is broken. > > Ofc seqlock.h itself skimps on that rule and doesn't bother explaining its > barriers :-/ > >> I've been tempted to force the driver to store the seq number directly >> under the driver lock - this makes the scheme much clearer, ie >> something like this: >> >> diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c >> index 712c99918551bc..738fa670dcfb19 100644 >> --- a/drivers/gpu/drm/nouveau/nouveau_svm.c >> +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c >> @@ -488,7 +488,8 @@ struct svm_notifier { >> }; >> >> static bool nouveau_svm_range_invalidate(struct mmu_range_notifier *mrn, >> - const struct mmu_notifier_range *range) >> + const struct mmu_notifier_range *range, >> + unsigned long seq) >> { >> struct svm_notifier *sn = >> container_of(mrn, struct svm_notifier, notifier); >> @@ -504,6 +505,7 @@ static bool nouveau_svm_range_invalidate(struct mmu_range_notifier *mrn, >> mutex_lock(&sn->svmm->mutex); >> else if (!mutex_trylock(&sn->svmm->mutex)) >> return false; >> + mmu_range_notifier_update_seq(mrn, seq); >> mutex_unlock(&sn->svmm->mutex); >> return true; >> } >> >> >> At the cost of making the driver a bit more complex, what do you >> think? > Hm, spinning this further ... could we initialize the mmu range notifier > with a pointer to the driver lock, so that we could put a > lockdep_assert_held into mmu_range_notifier_update_seq? I think that would > make this scheme substantially more driver-hacker proof :-) Going another step further.... what hinders us to put the lock into the mmu range notifier itself and have _lock()/_unlock() helpers? I mean having the lock in the driver only makes sense when the driver would be using the same lock for multiple things, e.g. multiple MMU range notifiers under the same lock. But I really don't see that use case here. Regards, Christian. > > Cheers, Daniel
On Wed, Oct 23, 2019 at 11:32:16AM +0200, Christian König wrote: > Am 23.10.19 um 11:08 schrieb Daniel Vetter: > > On Tue, Oct 22, 2019 at 03:01:13PM +0000, Jason Gunthorpe wrote: > > > On Tue, Oct 22, 2019 at 09:57:35AM +0200, Daniel Vetter wrote: > > > > > > > > The unusual bit in all of this is using a lock's critical region to > > > > > 'protect' data for read, but updating that same data before the lock's > > > > > critical secion. ie relying on the unlock barrier to 'release' program > > > > > ordered stores done before the lock's own critical region, and the > > > > > lock side barrier to 'acquire' those stores. > > > > I think this unusual use of locks as barriers for other unlocked accesses > > > > deserves comments even more than just normal barriers. Can you pls add > > > > them? I think the design seeems sound ... > > > > > > > > Also the comment on the driver's lock hopefully prevents driver > > > > maintainers from moving the driver_lock around in a way that would very > > > > subtle break the scheme, so I think having the acquire barrier commented > > > > in each place would be really good. > > > There is already a lot of documentation, I think it would be helpful > > > if you could suggest some specific places where you think an addition > > > would help? I think the perspective of someone less familiar with this > > > design would really improve the documentation > > Hm I just meant the usual recommendation that "barriers must have comments > > explaining what they order, and where the other side of the barrier is". > > Using unlock/lock as a barrier imo just makes that an even better idea. > > Usually what I do is something like "we need to order $this against $that > > below, and the other side of this barrier is in function()." With maybe a > > bit more if it's not obvious how things go wrong if the orderin is broken. > > > > Ofc seqlock.h itself skimps on that rule and doesn't bother explaining its > > barriers :-/ > > > > > I've been tempted to force the driver to store the seq number directly > > > under the driver lock - this makes the scheme much clearer, ie > > > something like this: > > > > > > diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c > > > index 712c99918551bc..738fa670dcfb19 100644 > > > --- a/drivers/gpu/drm/nouveau/nouveau_svm.c > > > +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c > > > @@ -488,7 +488,8 @@ struct svm_notifier { > > > }; > > > static bool nouveau_svm_range_invalidate(struct mmu_range_notifier *mrn, > > > - const struct mmu_notifier_range *range) > > > + const struct mmu_notifier_range *range, > > > + unsigned long seq) > > > { > > > struct svm_notifier *sn = > > > container_of(mrn, struct svm_notifier, notifier); > > > @@ -504,6 +505,7 @@ static bool nouveau_svm_range_invalidate(struct mmu_range_notifier *mrn, > > > mutex_lock(&sn->svmm->mutex); > > > else if (!mutex_trylock(&sn->svmm->mutex)) > > > return false; > > > + mmu_range_notifier_update_seq(mrn, seq); > > > mutex_unlock(&sn->svmm->mutex); > > > return true; > > > } > > > > > > > > > At the cost of making the driver a bit more complex, what do you > > > think? > > Hm, spinning this further ... could we initialize the mmu range notifier > > with a pointer to the driver lock, so that we could put a > > lockdep_assert_held into mmu_range_notifier_update_seq? I think that would > > make this scheme substantially more driver-hacker proof :-) > > Going another step further.... what hinders us to put the lock into the mmu > range notifier itself and have _lock()/_unlock() helpers? > > I mean having the lock in the driver only makes sense when the driver would > be using the same lock for multiple things, e.g. multiple MMU range > notifiers under the same lock. But I really don't see that use case here. I actualy do, nouveau use one lock to protect the page table and that's the lock that matter. You can have multiple range for a single page table, idea being only a sub-set of the process address space is ever accessed by the GPU and those it is better to focus on this sub-set and track invalidation in a finer grain. Cheers, Jérôme
On Wed, Oct 23, 2019 at 12:52:23PM -0400, Jerome Glisse wrote: > > Going another step further.... what hinders us to put the lock into the mmu > > range notifier itself and have _lock()/_unlock() helpers? > > > > I mean having the lock in the driver only makes sense when the driver would > > be using the same lock for multiple things, e.g. multiple MMU range > > notifiers under the same lock. But I really don't see that use case here. > > I actualy do, nouveau use one lock to protect the page table and that's the > lock that matter. You can have multiple range for a single page table, idea > being only a sub-set of the process address space is ever accessed by the > GPU and those it is better to focus on this sub-set and track invalidation in > a finer grain. mlx5 is similar, but not currently coded quite right, there is one lock that protects the command queue for submitting invalidations to the HW and it doesn't make a lot of sense to have additional fine grained locking beyond that. So I suppose the intent here that most drivers would have a single 'page table lock' that protects the HW's page table update, and this lock is the one that should be held while upating and checking the sequence number. dma_fence based drivers are possibly a little different, I think they can just use a spinlock, their pattern should probably be something like fault: hmm_range_fault() spin_lock() if (mmu_range_read_retry())) goto again dma_fence_init(mrn->fence) spin_unlock() invalidate: spin_lock() is_inited = 'dma fence init has been called' spin_unlock() if (is_inited) dma_fence_wait(fence) I'm not sure, never used dma_fence before. The key thing is that the dma_fence_wait() cannot block until after the mmu_range_read_retry() & unlock completes. Otherwise it can deadlock with hmm_range_fault(). It would be nice to figure this out and add it to the hmm.rst as we do have two drivers using the dma_fence scheme. Also, the use of a spinlock here probably says we should keep the lock external. But, it sounds like the mmu_range_notifier_update_seq() is a good idea, so let me add that in v2. Jason
On Mon, Oct 21, 2019 at 07:06:00PM +0000, Jason Gunthorpe wrote: > On Mon, Oct 21, 2019 at 02:40:41PM -0400, Jerome Glisse wrote: > > On Tue, Oct 15, 2019 at 03:12:27PM -0300, Jason Gunthorpe wrote: > > > From: Jason Gunthorpe <jgg@mellanox.com> > > > > > > 8 of the mmu_notifier using drivers (i915_gem, radeon_mn, umem_odp, hfi1, > > > scif_dma, vhost, gntdev, hmm) drivers are using a common pattern where > > > they only use invalidate_range_start/end and immediately check the > > > invalidating range against some driver data structure to tell if the > > > driver is interested. Half of them use an interval_tree, the others are > > > simple linear search lists. > > > > > > Of the ones I checked they largely seem to have various kinds of races, > > > bugs and poor implementation. This is a result of the complexity in how > > > the notifier interacts with get_user_pages(). It is extremely difficult to > > > use it correctly. > > > > > > Consolidate all of this code together into the core mmu_notifier and > > > provide a locking scheme similar to hmm_mirror that allows the user to > > > safely use get_user_pages() and reliably know if the page list still > > > matches the mm. > > > > > > This new arrangment plays nicely with the !blockable mode for > > > OOM. Scanning the interval tree is done such that the intersection test > > > will always succeed, and since there is no invalidate_range_end exposed to > > > drivers the scheme safely allows multiple drivers to be subscribed. > > > > > > Four places are converted as an example of how the new API is used. > > > Four are left for future patches: > > > - i915_gem has complex locking around destruction of a registration, > > > needs more study > > > - hfi1 (2nd user) needs access to the rbtree > > > - scif_dma has a complicated logic flow > > > - vhost's mmu notifiers are already being rewritten > > > > > > This is still being tested, but I figured to send it to start getting help > > > from the xen, amd and hfi drivers which I cannot test here. > > > > It might be a good oportunity to also switch those users to > > hmm_range_fault() instead of GUP as GUP is pointless for those > > users. In fact the GUP is an impediment to normal mm operations. > > I think vhost can use hmm_range_fault > > hfi1 does actually need to have the page pin, it doesn't fence DMA > during invalidate. > > i915_gem feels alot like amdgpu, so probably it would benefit > > No idea about scif_dma > > > I will test on nouveau. > > Thanks, hopefully it still works, I think Ralph was able to do some > basic checks. But it is a pretty complicated series, I probably made > some mistakes. So it seems to work ok with nouveau, will let tests run in loop thought there are not very advance test. > > FWIW, I know that nouveau gets a lockdep splat now from Daniel > Vetter's recent changes, it tries to do GFP_KERENEL allocations under > a lock also held by the invalidate_range_start path. I have not seen any splat so far, is it throught some new kernel config ? Cheers, Jérôme
On Wed, Oct 23, 2019 at 05:24:45PM +0000, Jason Gunthorpe wrote: > mlx5 is similar, but not currently coded quite right, there is one > lock that protects the command queue for submitting invalidations to > the HW and it doesn't make a lot of sense to have additional fine > grained locking beyond that. IFF all drivers could agree on a lock type (rw_semaphore?) for this protection you could add a pointer to the range which would clear things up a lot. I'm just not sure you could get everyone to agree.
From: Jason Gunthorpe <jgg@mellanox.com> 8 of the mmu_notifier using drivers (i915_gem, radeon_mn, umem_odp, hfi1, scif_dma, vhost, gntdev, hmm) drivers are using a common pattern where they only use invalidate_range_start/end and immediately check the invalidating range against some driver data structure to tell if the driver is interested. Half of them use an interval_tree, the others are simple linear search lists. Of the ones I checked they largely seem to have various kinds of races, bugs and poor implementation. This is a result of the complexity in how the notifier interacts with get_user_pages(). It is extremely difficult to use it correctly. Consolidate all of this code together into the core mmu_notifier and provide a locking scheme similar to hmm_mirror that allows the user to safely use get_user_pages() and reliably know if the page list still matches the mm. This new arrangment plays nicely with the !blockable mode for OOM. Scanning the interval tree is done such that the intersection test will always succeed, and since there is no invalidate_range_end exposed to drivers the scheme safely allows multiple drivers to be subscribed. Four places are converted as an example of how the new API is used. Four are left for future patches: - i915_gem has complex locking around destruction of a registration, needs more study - hfi1 (2nd user) needs access to the rbtree - scif_dma has a complicated logic flow - vhost's mmu notifiers are already being rewritten This is still being tested, but I figured to send it to start getting help from the xen, amd and hfi drivers which I cannot test here. It would be intended for the hmm tree. Jason Gunthorpe (15): mm/mmu_notifier: define the header pre-processor parts even if disabled mm/mmu_notifier: add an interval tree notifier mm/hmm: allow hmm_range to be used with a mmu_range_notifier or hmm_mirror mm/hmm: define the pre-processor related parts of hmm.h even if disabled RDMA/odp: Use mmu_range_notifier_insert() RDMA/hfi1: Use mmu_range_notifier_inset for user_exp_rcv drm/radeon: use mmu_range_notifier_insert xen/gntdev: Use select for DMA_SHARED_BUFFER xen/gntdev: use mmu_range_notifier_insert nouveau: use mmu_notifier directly for invalidate_range_start nouveau: use mmu_range_notifier instead of hmm_mirror drm/amdgpu: Call find_vma under mmap_sem drm/amdgpu: Use mmu_range_insert instead of hmm_mirror drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror mm/hmm: remove hmm_mirror and related Documentation/vm/hmm.rst | 105 +--- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 + .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 9 +- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 14 +- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 445 ++------------ drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h | 53 -- drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 13 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 111 ++-- drivers/gpu/drm/nouveau/nouveau_svm.c | 229 +++++--- drivers/gpu/drm/radeon/radeon.h | 9 +- drivers/gpu/drm/radeon/radeon_mn.c | 218 ++----- drivers/infiniband/core/device.c | 1 - drivers/infiniband/core/umem_odp.c | 288 +--------- drivers/infiniband/hw/hfi1/file_ops.c | 2 +- drivers/infiniband/hw/hfi1/hfi.h | 2 +- drivers/infiniband/hw/hfi1/user_exp_rcv.c | 144 ++--- drivers/infiniband/hw/hfi1/user_exp_rcv.h | 3 +- drivers/infiniband/hw/mlx5/mlx5_ib.h | 7 +- drivers/infiniband/hw/mlx5/mr.c | 3 +- drivers/infiniband/hw/mlx5/odp.c | 48 +- drivers/xen/Kconfig | 3 +- drivers/xen/gntdev-common.h | 8 +- drivers/xen/gntdev.c | 179 ++---- include/linux/hmm.h | 195 +------ include/linux/mmu_notifier.h | 124 +++- include/rdma/ib_umem_odp.h | 65 +-- include/rdma/ib_verbs.h | 2 - kernel/fork.c | 1 - mm/Kconfig | 2 +- mm/hmm.c | 275 +-------- mm/mmu_notifier.c | 542 +++++++++++++++++- 32 files changed, 1180 insertions(+), 1923 deletions(-)