Message ID | 20190523153436.19102-1-jgg@ziepe.ca (mailing list archive) |
---|---|
Headers | show |
Series | mm/hmm: Various revisions from a locking/code review | expand |
On 5/23/19 8:34 AM, Jason Gunthorpe wrote: > From: Jason Gunthorpe <jgg@mellanox.com> > > This patch series arised out of discussions with Jerome when looking at the > ODP changes, particularly informed by use after free races we have already > found and fixed in the ODP code (thanks to syzkaller) working with mmu > notifiers, and the discussion with Ralph on how to resolve the lifetime model. > > Overall this brings in a simplified locking scheme and easy to explain > lifetime model: > > If a hmm_range is valid, then the hmm is valid, if a hmm is valid then the mm > is allocated memory. > > If the mm needs to still be alive (ie to lock the mmap_sem, find a vma, etc) > then the mmget must be obtained via mmget_not_zero(). > > Locking of mm->hmm is shifted to use the mmap_sem consistently for all > read/write and unlocked accesses are removed. > > The use unlocked reads on 'hmm->dead' are also eliminated in favour of using > standard mmget() locking to prevent the mm from being released. Many of the > debugging checks of !range->hmm and !hmm->mm are dropped in favour of poison - > which is much clearer as to the lifetime intent. > > The trailing patches are just some random cleanups I noticed when reviewing > this code. > > I expect Jerome & Ralph will have some design notes so this is just RFC, and > it still needs a matching edit to nouveau. It is only compile tested. > Thanks so much for doing this. Jerome has already absorbed these into his hmm-5.3 branch, along with Ralph's other fixes, so we can start testing, as well as reviewing, the whole set. We'll have feedback soon. thanks,
On Thu, May 23, 2019 at 12:04:16PM -0700, John Hubbard wrote: > On 5/23/19 8:34 AM, Jason Gunthorpe wrote: > > From: Jason Gunthorpe <jgg@mellanox.com> > > > > This patch series arised out of discussions with Jerome when looking at the > > ODP changes, particularly informed by use after free races we have already > > found and fixed in the ODP code (thanks to syzkaller) working with mmu > > notifiers, and the discussion with Ralph on how to resolve the lifetime model. > > > > Overall this brings in a simplified locking scheme and easy to explain > > lifetime model: > > > > If a hmm_range is valid, then the hmm is valid, if a hmm is valid then the mm > > is allocated memory. > > > > If the mm needs to still be alive (ie to lock the mmap_sem, find a vma, etc) > > then the mmget must be obtained via mmget_not_zero(). > > > > Locking of mm->hmm is shifted to use the mmap_sem consistently for all > > read/write and unlocked accesses are removed. > > > > The use unlocked reads on 'hmm->dead' are also eliminated in favour of using > > standard mmget() locking to prevent the mm from being released. Many of the > > debugging checks of !range->hmm and !hmm->mm are dropped in favour of poison - > > which is much clearer as to the lifetime intent. > > > > The trailing patches are just some random cleanups I noticed when reviewing > > this code. > > > > I expect Jerome & Ralph will have some design notes so this is just RFC, and > > it still needs a matching edit to nouveau. It is only compile tested. > > > > Thanks so much for doing this. Jerome has already absorbed these into his > hmm-5.3 branch, along with Ralph's other fixes, so we can start testing, > as well as reviewing, the whole set. We'll have feedback soon. Yes, I looked at Jerome's v2's and he found a few great fixups. My only dislike is re-introducing a READ_ONCE(mm->hmm) when a major point of this seris was to remove that use-after-free stuff. But Jerome says it is a temporary defect while he works out some cross tree API stuff. Jason
On Thu, May 23, 2019 at 12:04:16PM -0700, John Hubbard wrote: > On 5/23/19 8:34 AM, Jason Gunthorpe wrote: > > From: Jason Gunthorpe <jgg@mellanox.com> > > > > This patch series arised out of discussions with Jerome when looking at the > > ODP changes, particularly informed by use after free races we have already > > found and fixed in the ODP code (thanks to syzkaller) working with mmu > > notifiers, and the discussion with Ralph on how to resolve the lifetime model. > > > > Overall this brings in a simplified locking scheme and easy to explain > > lifetime model: > > > > If a hmm_range is valid, then the hmm is valid, if a hmm is valid then the mm > > is allocated memory. > > > > If the mm needs to still be alive (ie to lock the mmap_sem, find a vma, etc) > > then the mmget must be obtained via mmget_not_zero(). > > > > Locking of mm->hmm is shifted to use the mmap_sem consistently for all > > read/write and unlocked accesses are removed. > > > > The use unlocked reads on 'hmm->dead' are also eliminated in favour of using > > standard mmget() locking to prevent the mm from being released. Many of the > > debugging checks of !range->hmm and !hmm->mm are dropped in favour of poison - > > which is much clearer as to the lifetime intent. > > > > The trailing patches are just some random cleanups I noticed when reviewing > > this code. > > > > I expect Jerome & Ralph will have some design notes so this is just RFC, and > > it still needs a matching edit to nouveau. It is only compile tested. > > > > Thanks so much for doing this. Jerome has already absorbed these into his > hmm-5.3 branch, along with Ralph's other fixes, so we can start testing, > as well as reviewing, the whole set. We'll have feedback soon. > I force pushed an updated branch with couple fix https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-5.3 Seems to work ok so far, still doing testing. Cheers, Jérôme
On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote: > From: Jason Gunthorpe <jgg@mellanox.com> > > This patch series arised out of discussions with Jerome when looking at the > ODP changes, particularly informed by use after free races we have already > found and fixed in the ODP code (thanks to syzkaller) working with mmu > notifiers, and the discussion with Ralph on how to resolve the lifetime model. > > Overall this brings in a simplified locking scheme and easy to explain > lifetime model: > > If a hmm_range is valid, then the hmm is valid, if a hmm is valid then the mm > is allocated memory. > > If the mm needs to still be alive (ie to lock the mmap_sem, find a vma, etc) > then the mmget must be obtained via mmget_not_zero(). > > Locking of mm->hmm is shifted to use the mmap_sem consistently for all > read/write and unlocked accesses are removed. > > The use unlocked reads on 'hmm->dead' are also eliminated in favour of using > standard mmget() locking to prevent the mm from being released. Many of the > debugging checks of !range->hmm and !hmm->mm are dropped in favour of poison - > which is much clearer as to the lifetime intent. > > The trailing patches are just some random cleanups I noticed when reviewing > this code. > > I expect Jerome & Ralph will have some design notes so this is just RFC, and > it still needs a matching edit to nouveau. It is only compile tested. > > Regards, > Jason > > Jason Gunthorpe (11): > mm/hmm: Fix use after free with struct hmm in the mmu notifiers > mm/hmm: Use hmm_mirror not mm as an argument for hmm_register_range > mm/hmm: Hold a mmgrab from hmm to mm > mm/hmm: Simplify hmm_get_or_create and make it reliable > mm/hmm: Improve locking around hmm->dead > mm/hmm: Remove duplicate condition test before wait_event_timeout > mm/hmm: Delete hmm_mirror_mm_is_alive() > mm/hmm: Use lockdep instead of comments > mm/hmm: Remove racy protection against double-unregistration > mm/hmm: Poison hmm_range during unregister > mm/hmm: Do not use list*_rcu() for hmm->ranges > > include/linux/hmm.h | 50 ++---------- > kernel/fork.c | 1 - > mm/hmm.c | 184 +++++++++++++++++++------------------------- > 3 files changed, 88 insertions(+), 147 deletions(-) Jerome, I was doing some more checking of this and noticed lockdep doesn't compile test if it is turned off, since you took and revised the series can you please fold in these hunks to fix compile failures with lockdep on. Thanks commit f0653c4d4c1dadeaf58d49f1c949ab1d2fda05d3 diff --git a/mm/hmm.c b/mm/hmm.c index 836adf613f81c8..2a08b78550b90d 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -56,7 +56,7 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm) { struct hmm *hmm; - lockdep_assert_held_exclusive(mm->mmap_sem); + lockdep_assert_held_exclusive(&mm->mmap_sem); if (mm->hmm) { if (kref_get_unless_zero(&mm->hmm->kref)) @@ -262,7 +262,7 @@ static const struct mmu_notifier_ops hmm_mmu_notifier_ops = { */ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm) { - lockdep_assert_held_exclusive(mm->mmap_sem); + lockdep_assert_held_exclusive(&mm->mmap_sem); /* Sanity check */ if (!mm || !mirror || !mirror->ops) @@ -987,7 +987,7 @@ long hmm_range_snapshot(struct hmm_range *range) struct mm_walk mm_walk; /* Caller must hold the mmap_sem, and range hold a reference on mm. */ - lockdep_assert_held(hmm->mm->mmap_sem); + lockdep_assert_held(&hmm->mm->mmap_sem); if (WARN_ON(!atomic_read(&hmm->mm->mm_users))) return -EINVAL; @@ -1086,7 +1086,7 @@ long hmm_range_fault(struct hmm_range *range, bool block) int ret; /* Caller must hold the mmap_sem, and range hold a reference on mm. */ - lockdep_assert_held(hmm->mm->mmap_sem); + lockdep_assert_held(&hmm->mm->mmap_sem); if (WARN_ON(!atomic_read(&hmm->mm->mm_users))) return -EINVAL;
On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote: > From: Jason Gunthorpe <jgg@mellanox.com> > > This patch series arised out of discussions with Jerome when looking at the > ODP changes, particularly informed by use after free races we have already > found and fixed in the ODP code (thanks to syzkaller) working with mmu > notifiers, and the discussion with Ralph on how to resolve the lifetime model. So the last big difference with ODP's flow is how 'range->valid' works. In ODP this was done using the rwsem umem->umem_rwsem which is obtained for read in invalidate_start and released in invalidate_end. Then any other threads that wish to only work on a umem which is not undergoing invalidation will obtain the write side of the lock, and within that lock's critical section the virtual address range is known to not be invalidating. I cannot understand how hmm gets to the same approach. It has range->valid, but it is not locked by anything that I can see, so when we test it in places like hmm_range_fault it seems useless.. Jerome, how does this work? I have a feeling we should copy the approach from ODP and use an actual lock here. Jason
On Fri, May 24, 2019 at 11:36:49AM -0300, Jason Gunthorpe wrote: > On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote: > > From: Jason Gunthorpe <jgg@mellanox.com> > > > > This patch series arised out of discussions with Jerome when looking at the > > ODP changes, particularly informed by use after free races we have already > > found and fixed in the ODP code (thanks to syzkaller) working with mmu > > notifiers, and the discussion with Ralph on how to resolve the lifetime model. > > So the last big difference with ODP's flow is how 'range->valid' > works. > > In ODP this was done using the rwsem umem->umem_rwsem which is > obtained for read in invalidate_start and released in invalidate_end. > > Then any other threads that wish to only work on a umem which is not > undergoing invalidation will obtain the write side of the lock, and > within that lock's critical section the virtual address range is known > to not be invalidating. > > I cannot understand how hmm gets to the same approach. It has > range->valid, but it is not locked by anything that I can see, so when > we test it in places like hmm_range_fault it seems useless.. > > Jerome, how does this work? > > I have a feeling we should copy the approach from ODP and use an > actual lock here. range->valid is use as bail early if invalidation is happening in hmm_range_fault() to avoid doing useless work. The synchronization is explained in the documentation: Locking within the sync_cpu_device_pagetables() callback is the most important aspect the driver must respect in order to keep things properly synchronized. The usage pattern is:: int driver_populate_range(...) { struct hmm_range range; ... range.start = ...; range.end = ...; range.pfns = ...; range.flags = ...; range.values = ...; range.pfn_shift = ...; hmm_range_register(&range); /* * Just wait for range to be valid, safe to ignore return value as we * will use the return value of hmm_range_snapshot() below under the * mmap_sem to ascertain the validity of the range. */ hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC); again: down_read(&mm->mmap_sem); ret = hmm_range_snapshot(&range); if (ret) { up_read(&mm->mmap_sem); if (ret == -EAGAIN) { /* * No need to check hmm_range_wait_until_valid() return value * on retry we will get proper error with hmm_range_snapshot() */ hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC); goto again; } hmm_range_unregister(&range); return ret; } take_lock(driver->update); if (!hmm_range_valid(&range)) { release_lock(driver->update); up_read(&mm->mmap_sem); goto again; } // Use pfns array content to update device page table hmm_range_unregister(&range); release_lock(driver->update); up_read(&mm->mmap_sem); return 0; } The driver->update lock is the same lock that the driver takes inside its sync_cpu_device_pagetables() callback. That lock must be held before calling hmm_range_valid() to avoid any race with a concurrent CPU page table update. Cheers, Jérôme
On Fri, May 24, 2019 at 12:49:02PM -0400, Jerome Glisse wrote: > On Fri, May 24, 2019 at 11:36:49AM -0300, Jason Gunthorpe wrote: > > On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote: > > > From: Jason Gunthorpe <jgg@mellanox.com> > > > > > > This patch series arised out of discussions with Jerome when looking at the > > > ODP changes, particularly informed by use after free races we have already > > > found and fixed in the ODP code (thanks to syzkaller) working with mmu > > > notifiers, and the discussion with Ralph on how to resolve the lifetime model. > > > > So the last big difference with ODP's flow is how 'range->valid' > > works. > > > > In ODP this was done using the rwsem umem->umem_rwsem which is > > obtained for read in invalidate_start and released in invalidate_end. > > > > Then any other threads that wish to only work on a umem which is not > > undergoing invalidation will obtain the write side of the lock, and > > within that lock's critical section the virtual address range is known > > to not be invalidating. > > > > I cannot understand how hmm gets to the same approach. It has > > range->valid, but it is not locked by anything that I can see, so when > > we test it in places like hmm_range_fault it seems useless.. > > > > Jerome, how does this work? > > > > I have a feeling we should copy the approach from ODP and use an > > actual lock here. > > range->valid is use as bail early if invalidation is happening in > hmm_range_fault() to avoid doing useless work. The synchronization > is explained in the documentation: That just says the hmm APIs handle locking. I asked how the apis implement that locking internally. Are you trying to say that if I do this, hmm will still work completely correctly? diff --git a/mm/hmm.c b/mm/hmm.c index 8396a65710e304..42977744855d26 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -981,8 +981,8 @@ long hmm_range_snapshot(struct hmm_range *range) do { /* If range is no longer valid force retry. */ - if (!range->valid) - return -EAGAIN; +/* if (!range->valid) + return -EAGAIN;*/ vma = find_vma(hmm->mm, start); if (vma == NULL || (vma->vm_flags & device_vma)) @@ -1080,10 +1080,10 @@ long hmm_range_fault(struct hmm_range *range, bool block) do { /* If range is no longer valid force retry. */ - if (!range->valid) { +/* if (!range->valid) { up_read(&hmm->mm->mmap_sem); return -EAGAIN; - } + }*/ vma = find_vma(hmm->mm, start); if (vma == NULL || (vma->vm_flags & device_vma)) @@ -1134,7 +1134,7 @@ long hmm_range_fault(struct hmm_range *range, bool block) start = hmm_vma_walk.last; /* Keep trying while the range is valid. */ - } while (ret == -EBUSY && range->valid); + } while (ret == -EBUSY /*&& range->valid*/); if (ret) { unsigned long i;
On Fri, May 24, 2019 at 01:59:31PM -0300, Jason Gunthorpe wrote: > On Fri, May 24, 2019 at 12:49:02PM -0400, Jerome Glisse wrote: > > On Fri, May 24, 2019 at 11:36:49AM -0300, Jason Gunthorpe wrote: > > > On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote: > > > > From: Jason Gunthorpe <jgg@mellanox.com> > > > > > > > > This patch series arised out of discussions with Jerome when looking at the > > > > ODP changes, particularly informed by use after free races we have already > > > > found and fixed in the ODP code (thanks to syzkaller) working with mmu > > > > notifiers, and the discussion with Ralph on how to resolve the lifetime model. > > > > > > So the last big difference with ODP's flow is how 'range->valid' > > > works. > > > > > > In ODP this was done using the rwsem umem->umem_rwsem which is > > > obtained for read in invalidate_start and released in invalidate_end. > > > > > > Then any other threads that wish to only work on a umem which is not > > > undergoing invalidation will obtain the write side of the lock, and > > > within that lock's critical section the virtual address range is known > > > to not be invalidating. > > > > > > I cannot understand how hmm gets to the same approach. It has > > > range->valid, but it is not locked by anything that I can see, so when > > > we test it in places like hmm_range_fault it seems useless.. > > > > > > Jerome, how does this work? > > > > > > I have a feeling we should copy the approach from ODP and use an > > > actual lock here. > > > > range->valid is use as bail early if invalidation is happening in > > hmm_range_fault() to avoid doing useless work. The synchronization > > is explained in the documentation: > > That just says the hmm APIs handle locking. I asked how the apis > implement that locking internally. > > Are you trying to say that if I do this, hmm will still work completely > correctly? Yes it will keep working correctly. You would just be doing potentialy useless work. > > diff --git a/mm/hmm.c b/mm/hmm.c > index 8396a65710e304..42977744855d26 100644 > --- a/mm/hmm.c > +++ b/mm/hmm.c > @@ -981,8 +981,8 @@ long hmm_range_snapshot(struct hmm_range *range) > > do { > /* If range is no longer valid force retry. */ > - if (!range->valid) > - return -EAGAIN; > +/* if (!range->valid) > + return -EAGAIN;*/ > > vma = find_vma(hmm->mm, start); > if (vma == NULL || (vma->vm_flags & device_vma)) > @@ -1080,10 +1080,10 @@ long hmm_range_fault(struct hmm_range *range, bool block) > > do { > /* If range is no longer valid force retry. */ > - if (!range->valid) { > +/* if (!range->valid) { > up_read(&hmm->mm->mmap_sem); > return -EAGAIN; > - } > + }*/ > > vma = find_vma(hmm->mm, start); > if (vma == NULL || (vma->vm_flags & device_vma)) > @@ -1134,7 +1134,7 @@ long hmm_range_fault(struct hmm_range *range, bool block) > start = hmm_vma_walk.last; > > /* Keep trying while the range is valid. */ > - } while (ret == -EBUSY && range->valid); > + } while (ret == -EBUSY /*&& range->valid*/); > > if (ret) { > unsigned long i;
On 5/24/19 9:49 AM, Jerome Glisse wrote: > On Fri, May 24, 2019 at 11:36:49AM -0300, Jason Gunthorpe wrote: >> On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote: >>> From: Jason Gunthorpe <jgg@mellanox.com> >>> >>> This patch series arised out of discussions with Jerome when looking at the >>> ODP changes, particularly informed by use after free races we have already >>> found and fixed in the ODP code (thanks to syzkaller) working with mmu >>> notifiers, and the discussion with Ralph on how to resolve the lifetime model. >> >> So the last big difference with ODP's flow is how 'range->valid' >> works. >> >> In ODP this was done using the rwsem umem->umem_rwsem which is >> obtained for read in invalidate_start and released in invalidate_end. >> >> Then any other threads that wish to only work on a umem which is not >> undergoing invalidation will obtain the write side of the lock, and >> within that lock's critical section the virtual address range is known >> to not be invalidating. >> >> I cannot understand how hmm gets to the same approach. It has >> range->valid, but it is not locked by anything that I can see, so when >> we test it in places like hmm_range_fault it seems useless.. >> >> Jerome, how does this work? >> >> I have a feeling we should copy the approach from ODP and use an >> actual lock here. > > range->valid is use as bail early if invalidation is happening in > hmm_range_fault() to avoid doing useless work. The synchronization > is explained in the documentation: > > > Locking within the sync_cpu_device_pagetables() callback is the most important > aspect the driver must respect in order to keep things properly synchronized. > The usage pattern is:: > > int driver_populate_range(...) > { > struct hmm_range range; > ... > > range.start = ...; > range.end = ...; > range.pfns = ...; > range.flags = ...; > range.values = ...; > range.pfn_shift = ...; > hmm_range_register(&range); > > /* > * Just wait for range to be valid, safe to ignore return value as we > * will use the return value of hmm_range_snapshot() below under the > * mmap_sem to ascertain the validity of the range. > */ > hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC); > > again: > down_read(&mm->mmap_sem); > ret = hmm_range_snapshot(&range); > if (ret) { > up_read(&mm->mmap_sem); > if (ret == -EAGAIN) { > /* > * No need to check hmm_range_wait_until_valid() return value > * on retry we will get proper error with hmm_range_snapshot() > */ > hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC); > goto again; > } > hmm_range_unregister(&range); > return ret; > } > take_lock(driver->update); > if (!hmm_range_valid(&range)) { > release_lock(driver->update); > up_read(&mm->mmap_sem); > goto again; > } > > // Use pfns array content to update device page table > > hmm_range_unregister(&range); > release_lock(driver->update); > up_read(&mm->mmap_sem); > return 0; > } > > The driver->update lock is the same lock that the driver takes inside its > sync_cpu_device_pagetables() callback. That lock must be held before calling > hmm_range_valid() to avoid any race with a concurrent CPU page table update. > > > Cheers, > Jérôme Given the above, the following patch looks necessary to me. Also, looking at drivers/gpu/drm/nouveau/nouveau_svm.c, it doesn't check the return value to avoid calling up_read(&mm->mmap_sem). Besides, it's better to keep the mmap_sem lock/unlock in the caller. diff --git a/mm/hmm.c b/mm/hmm.c index 836adf613f81..8b6ef97a8d71 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -1092,10 +1092,8 @@ long hmm_range_fault(struct hmm_range *range, bool block) do { /* If range is no longer valid force retry. */ - if (!range->valid) { - up_read(&hmm->mm->mmap_sem); + if (!range->valid) return -EAGAIN; - } vma = find_vma(hmm->mm, start); if (vma == NULL || (vma->vm_flags & device_vma)) ----------------------------------------------------------------------------------- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. -----------------------------------------------------------------------------------
On Fri, May 24, 2019 at 10:47:16AM -0700, Ralph Campbell wrote: > > On 5/24/19 9:49 AM, Jerome Glisse wrote: > > On Fri, May 24, 2019 at 11:36:49AM -0300, Jason Gunthorpe wrote: > > > On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote: > > > > From: Jason Gunthorpe <jgg@mellanox.com> > > > > > > > > This patch series arised out of discussions with Jerome when looking at the > > > > ODP changes, particularly informed by use after free races we have already > > > > found and fixed in the ODP code (thanks to syzkaller) working with mmu > > > > notifiers, and the discussion with Ralph on how to resolve the lifetime model. > > > > > > So the last big difference with ODP's flow is how 'range->valid' > > > works. > > > > > > In ODP this was done using the rwsem umem->umem_rwsem which is > > > obtained for read in invalidate_start and released in invalidate_end. > > > > > > Then any other threads that wish to only work on a umem which is not > > > undergoing invalidation will obtain the write side of the lock, and > > > within that lock's critical section the virtual address range is known > > > to not be invalidating. > > > > > > I cannot understand how hmm gets to the same approach. It has > > > range->valid, but it is not locked by anything that I can see, so when > > > we test it in places like hmm_range_fault it seems useless.. > > > > > > Jerome, how does this work? > > > > > > I have a feeling we should copy the approach from ODP and use an > > > actual lock here. > > > > range->valid is use as bail early if invalidation is happening in > > hmm_range_fault() to avoid doing useless work. The synchronization > > is explained in the documentation: > > > > > > Locking within the sync_cpu_device_pagetables() callback is the most important > > aspect the driver must respect in order to keep things properly synchronized. > > The usage pattern is:: > > > > int driver_populate_range(...) > > { > > struct hmm_range range; > > ... > > > > range.start = ...; > > range.end = ...; > > range.pfns = ...; > > range.flags = ...; > > range.values = ...; > > range.pfn_shift = ...; > > hmm_range_register(&range); > > > > /* > > * Just wait for range to be valid, safe to ignore return value as we > > * will use the return value of hmm_range_snapshot() below under the > > * mmap_sem to ascertain the validity of the range. > > */ > > hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC); > > > > again: > > down_read(&mm->mmap_sem); > > ret = hmm_range_snapshot(&range); > > if (ret) { > > up_read(&mm->mmap_sem); > > if (ret == -EAGAIN) { > > /* > > * No need to check hmm_range_wait_until_valid() return value > > * on retry we will get proper error with hmm_range_snapshot() > > */ > > hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC); > > goto again; > > } > > hmm_range_unregister(&range); > > return ret; > > } > > take_lock(driver->update); > > if (!hmm_range_valid(&range)) { > > release_lock(driver->update); > > up_read(&mm->mmap_sem); > > goto again; > > } > > > > // Use pfns array content to update device page table > > > > hmm_range_unregister(&range); > > release_lock(driver->update); > > up_read(&mm->mmap_sem); > > return 0; > > } > > > > The driver->update lock is the same lock that the driver takes inside its > > sync_cpu_device_pagetables() callback. That lock must be held before calling > > hmm_range_valid() to avoid any race with a concurrent CPU page table update. > > > > > > Cheers, > > Jérôme > > > Given the above, the following patch looks necessary to me. > Also, looking at drivers/gpu/drm/nouveau/nouveau_svm.c, it > doesn't check the return value to avoid calling up_read(&mm->mmap_sem). > Besides, it's better to keep the mmap_sem lock/unlock in the caller. No, nouveau use the old API so check hmm_vma_fault() within hmm.h, i have patch to convert it to new API for 5.3 > > diff --git a/mm/hmm.c b/mm/hmm.c > index 836adf613f81..8b6ef97a8d71 100644 > --- a/mm/hmm.c > +++ b/mm/hmm.c > @@ -1092,10 +1092,8 @@ long hmm_range_fault(struct hmm_range *range, bool > block) > > do { > /* If range is no longer valid force retry. */ > - if (!range->valid) { > - up_read(&hmm->mm->mmap_sem); > + if (!range->valid) > return -EAGAIN; > - } > > vma = find_vma(hmm->mm, start); > if (vma == NULL || (vma->vm_flags & device_vma)) > > ----------------------------------------------------------------------------------- > This email message is for the sole use of the intended recipient(s) and may contain > confidential information. Any unauthorized review, use, disclosure or distribution > is prohibited. If you are not the intended recipient, please contact the sender by > reply email and destroy all copies of the original message. > -----------------------------------------------------------------------------------
On Fri, May 24, 2019 at 01:01:49PM -0400, Jerome Glisse wrote: > On Fri, May 24, 2019 at 01:59:31PM -0300, Jason Gunthorpe wrote: > > On Fri, May 24, 2019 at 12:49:02PM -0400, Jerome Glisse wrote: > > > On Fri, May 24, 2019 at 11:36:49AM -0300, Jason Gunthorpe wrote: > > > > On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote: > > > > > From: Jason Gunthorpe <jgg@mellanox.com> > > > > > > > > > > This patch series arised out of discussions with Jerome when looking at the > > > > > ODP changes, particularly informed by use after free races we have already > > > > > found and fixed in the ODP code (thanks to syzkaller) working with mmu > > > > > notifiers, and the discussion with Ralph on how to resolve the lifetime model. > > > > > > > > So the last big difference with ODP's flow is how 'range->valid' > > > > works. > > > > > > > > In ODP this was done using the rwsem umem->umem_rwsem which is > > > > obtained for read in invalidate_start and released in invalidate_end. > > > > > > > > Then any other threads that wish to only work on a umem which is not > > > > undergoing invalidation will obtain the write side of the lock, and > > > > within that lock's critical section the virtual address range is known > > > > to not be invalidating. > > > > > > > > I cannot understand how hmm gets to the same approach. It has > > > > range->valid, but it is not locked by anything that I can see, so when > > > > we test it in places like hmm_range_fault it seems useless.. > > > > > > > > Jerome, how does this work? > > > > > > > > I have a feeling we should copy the approach from ODP and use an > > > > actual lock here. > > > > > > range->valid is use as bail early if invalidation is happening in > > > hmm_range_fault() to avoid doing useless work. The synchronization > > > is explained in the documentation: > > > > That just says the hmm APIs handle locking. I asked how the apis > > implement that locking internally. > > > > Are you trying to say that if I do this, hmm will still work completely > > correctly? > > Yes it will keep working correctly. You would just be doing potentialy > useless work. I don't see how it works correctly. Apply the comment out patch I showed and this trivially happens: CPU0 CPU1 hmm_invalidate_start() ops->sync_cpu_device_pagetables() device_lock() // Wipe out page tables in device, enable faulting device_unlock() DEVICE PAGE FAULT device_lock() hmm_range_register() hmm_range_dma_map() device_unlock() hmm_invalidate_end() The mmu notifier spec says: * Invalidation of multiple concurrent ranges may be * optionally permitted by the driver. Either way the * establishment of sptes is forbidden in the range passed to * invalidate_range_begin/end for the whole duration of the * invalidate_range_begin/end critical section. And I understand "establishment of sptes is forbidden" means "hmm_range_dmap_map() must fail with EAGAIN". This is why ODP uses an actual lock held across the critical region which completely prohibits reading the CPU pages tables, or establishing new mappings. So, I still think we need a true lock, not a 'maybe valid' flag. Jason
On Fri, May 24, 2019 at 02:52:03PM -0300, Jason Gunthorpe wrote: > On Fri, May 24, 2019 at 01:01:49PM -0400, Jerome Glisse wrote: > > On Fri, May 24, 2019 at 01:59:31PM -0300, Jason Gunthorpe wrote: > > > On Fri, May 24, 2019 at 12:49:02PM -0400, Jerome Glisse wrote: > > > > On Fri, May 24, 2019 at 11:36:49AM -0300, Jason Gunthorpe wrote: > > > > > On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote: > > > > > > From: Jason Gunthorpe <jgg@mellanox.com> > > > > > > > > > > > > This patch series arised out of discussions with Jerome when looking at the > > > > > > ODP changes, particularly informed by use after free races we have already > > > > > > found and fixed in the ODP code (thanks to syzkaller) working with mmu > > > > > > notifiers, and the discussion with Ralph on how to resolve the lifetime model. > > > > > > > > > > So the last big difference with ODP's flow is how 'range->valid' > > > > > works. > > > > > > > > > > In ODP this was done using the rwsem umem->umem_rwsem which is > > > > > obtained for read in invalidate_start and released in invalidate_end. > > > > > > > > > > Then any other threads that wish to only work on a umem which is not > > > > > undergoing invalidation will obtain the write side of the lock, and > > > > > within that lock's critical section the virtual address range is known > > > > > to not be invalidating. > > > > > > > > > > I cannot understand how hmm gets to the same approach. It has > > > > > range->valid, but it is not locked by anything that I can see, so when > > > > > we test it in places like hmm_range_fault it seems useless.. > > > > > > > > > > Jerome, how does this work? > > > > > > > > > > I have a feeling we should copy the approach from ODP and use an > > > > > actual lock here. > > > > > > > > range->valid is use as bail early if invalidation is happening in > > > > hmm_range_fault() to avoid doing useless work. The synchronization > > > > is explained in the documentation: > > > > > > That just says the hmm APIs handle locking. I asked how the apis > > > implement that locking internally. > > > > > > Are you trying to say that if I do this, hmm will still work completely > > > correctly? > > > > Yes it will keep working correctly. You would just be doing potentialy > > useless work. > > I don't see how it works correctly. > > Apply the comment out patch I showed and this trivially happens: > > CPU0 CPU1 > hmm_invalidate_start() > ops->sync_cpu_device_pagetables() > device_lock() > // Wipe out page tables in device, enable faulting > device_unlock() > > DEVICE PAGE FAULT > device_lock() > hmm_range_register() > hmm_range_dma_map() > device_unlock() > hmm_invalidate_end() No in the above scenario hmm_range_register() will not mark the range as valid thus the driver will bailout after taking its lock and checking the range->valid value. > > The mmu notifier spec says: > > * Invalidation of multiple concurrent ranges may be > * optionally permitted by the driver. Either way the > * establishment of sptes is forbidden in the range passed to > * invalidate_range_begin/end for the whole duration of the > * invalidate_range_begin/end critical section. > > And I understand "establishment of sptes is forbidden" means > "hmm_range_dmap_map() must fail with EAGAIN". No it means that secondary page table entry (SPTE) must not materialize thus what hmm_range_dmap_map() is doing if fine and safe as long as the driver do not use the result to populate the device page table if there was an invalidation for the range. > > This is why ODP uses an actual lock held across the critical region > which completely prohibits reading the CPU pages tables, or > establishing new mappings. > > So, I still think we need a true lock, not a 'maybe valid' flag. The rational in HMM is to never block mm so that mm can always make progress as whatever mm is doing will take precedence and thus it would be useless to block mm while we do something if what we are doing is about to become invalid. Cheers, Jérôme
On Fri, May 24, 2019 at 02:03:22PM -0400, Jerome Glisse wrote: > On Fri, May 24, 2019 at 02:52:03PM -0300, Jason Gunthorpe wrote: > > On Fri, May 24, 2019 at 01:01:49PM -0400, Jerome Glisse wrote: > > > On Fri, May 24, 2019 at 01:59:31PM -0300, Jason Gunthorpe wrote: > > > > On Fri, May 24, 2019 at 12:49:02PM -0400, Jerome Glisse wrote: > > > > > On Fri, May 24, 2019 at 11:36:49AM -0300, Jason Gunthorpe wrote: > > > > > > On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote: > > > > > > > From: Jason Gunthorpe <jgg@mellanox.com> > > > > > > > > > > > > > > This patch series arised out of discussions with Jerome when looking at the > > > > > > > ODP changes, particularly informed by use after free races we have already > > > > > > > found and fixed in the ODP code (thanks to syzkaller) working with mmu > > > > > > > notifiers, and the discussion with Ralph on how to resolve the lifetime model. > > > > > > > > > > > > So the last big difference with ODP's flow is how 'range->valid' > > > > > > works. > > > > > > > > > > > > In ODP this was done using the rwsem umem->umem_rwsem which is > > > > > > obtained for read in invalidate_start and released in invalidate_end. > > > > > > > > > > > > Then any other threads that wish to only work on a umem which is not > > > > > > undergoing invalidation will obtain the write side of the lock, and > > > > > > within that lock's critical section the virtual address range is known > > > > > > to not be invalidating. > > > > > > > > > > > > I cannot understand how hmm gets to the same approach. It has > > > > > > range->valid, but it is not locked by anything that I can see, so when > > > > > > we test it in places like hmm_range_fault it seems useless.. > > > > > > > > > > > > Jerome, how does this work? > > > > > > > > > > > > I have a feeling we should copy the approach from ODP and use an > > > > > > actual lock here. > > > > > > > > > > range->valid is use as bail early if invalidation is happening in > > > > > hmm_range_fault() to avoid doing useless work. The synchronization > > > > > is explained in the documentation: > > > > > > > > That just says the hmm APIs handle locking. I asked how the apis > > > > implement that locking internally. > > > > > > > > Are you trying to say that if I do this, hmm will still work completely > > > > correctly? > > > > > > Yes it will keep working correctly. You would just be doing potentialy > > > useless work. > > > > I don't see how it works correctly. > > > > Apply the comment out patch I showed and this trivially happens: > > > > CPU0 CPU1 > > hmm_invalidate_start() > > ops->sync_cpu_device_pagetables() > > device_lock() > > // Wipe out page tables in device, enable faulting > > device_unlock() > > > > DEVICE PAGE FAULT > > device_lock() > > hmm_range_register() > > hmm_range_dma_map() > > device_unlock() > > hmm_invalidate_end() > > No in the above scenario hmm_range_register() will not mark the range > as valid thus the driver will bailout after taking its lock and checking > the range->valid value. I see your confusion, I only asked about removing valid from hmm.c, not the unlocked use of valid in your hmm.rst example. My mistake, sorry for being unclear. Here is the big 3 CPU ladder diagram that shows how 'valid' does not work: CPU0 CPU1 CPU2 DEVICE PAGE FAULT range = hmm_range_register() // Overlaps with range hmm_invalidate_start() range->valid = false ops->sync_cpu_device_pagetables() take_lock(driver->update); // Wipe out page tables in device, enable faulting release_lock(driver->update); // Does not overlap with range hmm_invalidate_start() hmm_invalidate_end() list_for_each range->valid = true device_lock() // Note range->valid = true now hmm_range_snapshot(&range); take_lock(driver->update); if (!hmm_range_valid(&range)) goto again ESTABLISHE SPTES device_unlock() hmm_invalidate_end() And I can make this more complicated (ie overlapping parallel invalidates, etc) and show any 'bool' valid cannot work. > > The mmu notifier spec says: > > > > * Invalidation of multiple concurrent ranges may be > > * optionally permitted by the driver. Either way the > > * establishment of sptes is forbidden in the range passed to > > * invalidate_range_begin/end for the whole duration of the > > * invalidate_range_begin/end critical section. > > > > And I understand "establishment of sptes is forbidden" means > > "hmm_range_dmap_map() must fail with EAGAIN". > > No it means that secondary page table entry (SPTE) must not > materialize thus what hmm_range_dmap_map() is doing if fine and safe > as long as the driver do not use the result to populate the device > page table if there was an invalidation for the range. Okay, so we agree, if there is an invalidate_start/end critical region then it is OK to *call* hmm_range_dmap_map(), however the driver must not *use* the result, and you are expecting this bit: take_lock(driver->update); if (!hmm_range_valid(&range)) { goto again In your hmm.rst to prevent the pfns from being used by the driver? I think the above ladder shows that hmm_range_valid can return true during a invalidate_start/end critical region, so this is a problem. I still think the best solution is to move device_lock() into mirror and have hmm manage it for the driver as ODP does. It is certainly the simplest solution to understand. Jason
On Fri, May 24, 2019 at 03:32:25PM -0300, Jason Gunthorpe wrote: > On Fri, May 24, 2019 at 02:03:22PM -0400, Jerome Glisse wrote: > > On Fri, May 24, 2019 at 02:52:03PM -0300, Jason Gunthorpe wrote: > > > On Fri, May 24, 2019 at 01:01:49PM -0400, Jerome Glisse wrote: > > > > On Fri, May 24, 2019 at 01:59:31PM -0300, Jason Gunthorpe wrote: > > > > > On Fri, May 24, 2019 at 12:49:02PM -0400, Jerome Glisse wrote: > > > > > > On Fri, May 24, 2019 at 11:36:49AM -0300, Jason Gunthorpe wrote: > > > > > > > On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote: > > > > > > > > From: Jason Gunthorpe <jgg@mellanox.com> > > > > > > > > > > > > > > > > This patch series arised out of discussions with Jerome when looking at the > > > > > > > > ODP changes, particularly informed by use after free races we have already > > > > > > > > found and fixed in the ODP code (thanks to syzkaller) working with mmu > > > > > > > > notifiers, and the discussion with Ralph on how to resolve the lifetime model. > > > > > > > > > > > > > > So the last big difference with ODP's flow is how 'range->valid' > > > > > > > works. > > > > > > > > > > > > > > In ODP this was done using the rwsem umem->umem_rwsem which is > > > > > > > obtained for read in invalidate_start and released in invalidate_end. > > > > > > > > > > > > > > Then any other threads that wish to only work on a umem which is not > > > > > > > undergoing invalidation will obtain the write side of the lock, and > > > > > > > within that lock's critical section the virtual address range is known > > > > > > > to not be invalidating. > > > > > > > > > > > > > > I cannot understand how hmm gets to the same approach. It has > > > > > > > range->valid, but it is not locked by anything that I can see, so when > > > > > > > we test it in places like hmm_range_fault it seems useless.. > > > > > > > > > > > > > > Jerome, how does this work? > > > > > > > > > > > > > > I have a feeling we should copy the approach from ODP and use an > > > > > > > actual lock here. > > > > > > > > > > > > range->valid is use as bail early if invalidation is happening in > > > > > > hmm_range_fault() to avoid doing useless work. The synchronization > > > > > > is explained in the documentation: > > > > > > > > > > That just says the hmm APIs handle locking. I asked how the apis > > > > > implement that locking internally. > > > > > > > > > > Are you trying to say that if I do this, hmm will still work completely > > > > > correctly? > > > > > > > > Yes it will keep working correctly. You would just be doing potentialy > > > > useless work. > > > > > > I don't see how it works correctly. > > > > > > Apply the comment out patch I showed and this trivially happens: > > > > > > CPU0 CPU1 > > > hmm_invalidate_start() > > > ops->sync_cpu_device_pagetables() > > > device_lock() > > > // Wipe out page tables in device, enable faulting > > > device_unlock() > > > > > > DEVICE PAGE FAULT > > > device_lock() > > > hmm_range_register() > > > hmm_range_dma_map() > > > device_unlock() > > > hmm_invalidate_end() > > > > No in the above scenario hmm_range_register() will not mark the range > > as valid thus the driver will bailout after taking its lock and checking > > the range->valid value. > > I see your confusion, I only asked about removing valid from hmm.c, > not the unlocked use of valid in your hmm.rst example. My mistake, > sorry for being unclear. No i did understand properly and it is fine to remove all the valid check within hmm_range_fault() or hmm_range_snapshot() nothing bad will come out of that. > > Here is the big 3 CPU ladder diagram that shows how 'valid' does not > work: > > CPU0 CPU1 CPU2 > DEVICE PAGE FAULT > range = hmm_range_register() > > // Overlaps with range > hmm_invalidate_start() > range->valid = false > ops->sync_cpu_device_pagetables() > take_lock(driver->update); > // Wipe out page tables in device, enable faulting > release_lock(driver->update); > // Does not overlap with range > hmm_invalidate_start() > hmm_invalidate_end() > list_for_each > range->valid = true ^ No this can not happen because CPU0 still has invalidate_range in progress and thus hmm->notifiers > 0 so the hmm_invalidate_range_end() will not set the range->valid as true. > > > device_lock() > // Note range->valid = true now > hmm_range_snapshot(&range); > take_lock(driver->update); > if (!hmm_range_valid(&range)) > goto again > ESTABLISHE SPTES > device_unlock() > hmm_invalidate_end() > > > And I can make this more complicated (ie overlapping parallel > invalidates, etc) and show any 'bool' valid cannot work. It does work. If you want i can remove the range->valid = true from the hmm_invalidate_range_end() and move it within hmm_range_wait_until_valid() ie modifying the hmm_range_wait_until_valid() logic, this might look cleaner. > > > The mmu notifier spec says: > > > > > > * Invalidation of multiple concurrent ranges may be > > > * optionally permitted by the driver. Either way the > > > * establishment of sptes is forbidden in the range passed to > > > * invalidate_range_begin/end for the whole duration of the > > > * invalidate_range_begin/end critical section. > > > > > > And I understand "establishment of sptes is forbidden" means > > > "hmm_range_dmap_map() must fail with EAGAIN". > > > > No it means that secondary page table entry (SPTE) must not > > materialize thus what hmm_range_dmap_map() is doing if fine and safe > > as long as the driver do not use the result to populate the device > > page table if there was an invalidation for the range. > > Okay, so we agree, if there is an invalidate_start/end critical region > then it is OK to *call* hmm_range_dmap_map(), however the driver must > not *use* the result, and you are expecting this bit: > > take_lock(driver->update); > if (!hmm_range_valid(&range)) { > goto again > > In your hmm.rst to prevent the pfns from being used by the driver? > > I think the above ladder shows that hmm_range_valid can return true > during a invalidate_start/end critical region, so this is a problem. > > I still think the best solution is to move device_lock() into mirror > and have hmm manage it for the driver as ODP does. It is certainly the > simplest solution to understand. It is un-efficient and would block further than needed forward progress by mm code. Cheers, Jérôme
On Fri, May 24, 2019 at 02:46:08PM -0400, Jerome Glisse wrote: > > Here is the big 3 CPU ladder diagram that shows how 'valid' does not > > work: > > > > CPU0 CPU1 CPU2 > > DEVICE PAGE FAULT > > range = hmm_range_register() > > > > // Overlaps with range > > hmm_invalidate_start() > > range->valid = false > > ops->sync_cpu_device_pagetables() > > take_lock(driver->update); > > // Wipe out page tables in device, enable faulting > > release_lock(driver->update); > > // Does not overlap with range > > hmm_invalidate_start() > > hmm_invalidate_end() > > list_for_each > > range->valid = true > > ^ > No this can not happen because CPU0 still has invalidate_range in progress and > thus hmm->notifiers > 0 so the hmm_invalidate_range_end() will not set the > range->valid as true. Oh, Okay, I now see how this all works, thank you > > And I can make this more complicated (ie overlapping parallel > > invalidates, etc) and show any 'bool' valid cannot work. > > It does work. Well, I ment the bool alone cannot work, but this is really bool + a counter. > If you want i can remove the range->valid = true from the > hmm_invalidate_range_end() and move it within hmm_range_wait_until_valid() > ie modifying the hmm_range_wait_until_valid() logic, this might look > cleaner. Let me reflect on it for a bit. I have to say I don't like the clarity here, and I don't like the valid=true loop in the invalidate_end, it is pretty clunky. I'm thinking a more obvious API for drivers, as something like: again: hmm_range_start(); [..] if (hmm_range_test_retry()) goto again driver_lock() if (hmm_range_end()) goto again driver_unlock(); Just because it makes it very clear to the driver author what to do and how this is working, and makes it clear that there is no such thing as 'valid' - what we *really* have is a locking collision forcing retry. ie this is really closer to a seq-lock scheme, not a valid/invalid scheme. Being able to explain the concept does matter for maintainability... And I'm thinking the above API design would comfortably support a more efficient seq-lock like approach without the loop in invalidate_end.. But I haven't quite thought it all through yet. Next week! > > I still think the best solution is to move device_lock() into mirror > > and have hmm manage it for the driver as ODP does. It is certainly the > > simplest solution to understand. > > It is un-efficient and would block further than needed forward progress > by mm code. I'm not sure how you get to that, we already have the device_lock() and it already blocks forward progress by mm code. Really the big unfortunate thing here is that valid is manipulated outside the device_lock, but really, logically, it is covered under the device_lock Jason
On Fri, May 24, 2019 at 07:09:23PM -0300, Jason Gunthorpe wrote: > On Fri, May 24, 2019 at 02:46:08PM -0400, Jerome Glisse wrote: > > > Here is the big 3 CPU ladder diagram that shows how 'valid' does not > > > work: > > > > > > CPU0 CPU1 CPU2 > > > DEVICE PAGE FAULT > > > range = hmm_range_register() > > > > > > // Overlaps with range > > > hmm_invalidate_start() > > > range->valid = false > > > ops->sync_cpu_device_pagetables() > > > take_lock(driver->update); > > > // Wipe out page tables in device, enable faulting > > > release_lock(driver->update); > > > // Does not overlap with range > > > hmm_invalidate_start() > > > hmm_invalidate_end() > > > list_for_each > > > range->valid = true > > > > ^ > > No this can not happen because CPU0 still has invalidate_range in progress and > > thus hmm->notifiers > 0 so the hmm_invalidate_range_end() will not set the > > range->valid as true. > > Oh, Okay, I now see how this all works, thank you > > > > And I can make this more complicated (ie overlapping parallel > > > invalidates, etc) and show any 'bool' valid cannot work. > > > > It does work. > > Well, I ment the bool alone cannot work, but this is really bool + a > counter. I couldn't shake this unease that bool shouldn't work for this type of locking, especially since odp also used a sequence lock as well as the rwsem... What about this situation: CPU0 CPU1 DEVICE PAGE FAULT range = hmm_range_register() hmm_range_snapshot(&range); // Overlaps with range hmm_invalidate_start() range->valid = false ops->sync_cpu_device_pagetables() take_lock(driver->update); // Wipe out page tables in device, enable faulting release_lock(driver->update); hmm_invalidate_end() range->valid = true take_lock(driver->update); if (!hmm_range_valid(&range)) goto again ESTABLISH SPTES release_lock(driver->update); The ODP patch appears to follow this pattern as the dma_map and the mlx5_ib_update_xlt are in different locking regions. We should dump the result of rmm_range_snapshot in this case, certainly the driver shouldn't be tasked with fixing this.. So, something like this is what I'm thinking about: From 41b6a6120e30978e7335ada04fb9168db4e5fd29 Mon Sep 17 00:00:00 2001 From: Jason Gunthorpe <jgg@mellanox.com> Date: Mon, 27 May 2019 16:48:53 -0300 Subject: [PATCH] RFC mm/hmm: Replace the range->valid with a seqcount Instead of trying to use a single valid to keep track of parallel invalidations use a traditional seqcount retry lock. Replace the range->valid with the concept of a 'update critical region' bounded by hmm_range_start_update() / hmm_range_end_update() which can fail and need retry if it is ever interrupted by a parallel invalidation. Updaters must create the critical section and can only finish their update while holding the device_lock. Continue to take a very loose approach to track invalidation, now with a single global seqcount for all ranges. This is done to minimize the overhead in the mmu notifier, and expects there will only be a small number of ranges active at once. It could be converted to a seqcount per range if necessary. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> --- Documentation/vm/hmm.rst | 22 +++++-------- include/linux/hmm.h | 60 ++++++++++++++++++++++++--------- mm/hmm.c | 71 ++++++++++++++++++++++------------------ 3 files changed, 93 insertions(+), 60 deletions(-) diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst index 7c1e929931a07f..7e827058964579 100644 --- a/Documentation/vm/hmm.rst +++ b/Documentation/vm/hmm.rst @@ -229,32 +229,27 @@ The usage pattern is:: * will use the return value of hmm_range_snapshot() below under the * mmap_sem to ascertain the validity of the range. */ - hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC); - again: + if (!hmm_range_start_update(&range, TIMEOUT_IN_MSEC)) + goto err + down_read(&mm->mmap_sem); ret = hmm_range_snapshot(&range); if (ret) { up_read(&mm->mmap_sem); - if (ret == -EAGAIN) { - /* - * No need to check hmm_range_wait_until_valid() return value - * on retry we will get proper error with hmm_range_snapshot() - */ - hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC); - goto again; - } + if (ret == -EAGAIN) + goto again; hmm_mirror_unregister(&range); return ret; } take_lock(driver->update); - if (!hmm_range_valid(&range)) { + if (!hmm_range_end_update(&range)) { release_lock(driver->update); up_read(&mm->mmap_sem); goto again; } - // Use pfns array content to update device page table + // Use pfns array content to update device page table, must hold driver->update hmm_mirror_unregister(&range); release_lock(driver->update); @@ -264,7 +259,8 @@ The usage pattern is:: The driver->update lock is the same lock that the driver takes inside its sync_cpu_device_pagetables() callback. That lock must be held before calling -hmm_range_valid() to avoid any race with a concurrent CPU page table update. +hmm_range_end_update() to avoid any race with a concurrent CPU page table +update. HMM implements all this on top of the mmu_notifier API because we wanted a simpler API and also to be able to perform optimizations latter on like doing diff --git a/include/linux/hmm.h b/include/linux/hmm.h index 26dfd9377b5094..9096113cfba8de 100644 --- a/include/linux/hmm.h +++ b/include/linux/hmm.h @@ -90,7 +90,9 @@ * @mmu_notifier: mmu notifier to track updates to CPU page table * @mirrors_sem: read/write semaphore protecting the mirrors list * @wq: wait queue for user waiting on a range invalidation - * @notifiers: count of active mmu notifiers + * @active_invalidates: count of active mmu notifier invalidations + * @range_invalidated: seqcount indicating that an active range was + * maybe invalidated */ struct hmm { struct mm_struct *mm; @@ -102,7 +104,8 @@ struct hmm { struct rw_semaphore mirrors_sem; wait_queue_head_t wq; struct rcu_head rcu; - long notifiers; + unsigned int active_invalidates; + seqcount_t range_invalidated; }; /* @@ -169,7 +172,7 @@ enum hmm_pfn_value_e { * @pfn_flags_mask: allows to mask pfn flags so that only default_flags matter * @page_shift: device virtual address shift value (should be >= PAGE_SHIFT) * @pfn_shifts: pfn shift value (should be <= PAGE_SHIFT) - * @valid: pfns array did not change since it has been fill by an HMM function + * @update_seq: sequence number for the seqcount lock read side */ struct hmm_range { struct hmm *hmm; @@ -184,7 +187,7 @@ struct hmm_range { uint64_t pfn_flags_mask; uint8_t page_shift; uint8_t pfn_shift; - bool valid; + unsigned int update_seq; }; /* @@ -208,27 +211,52 @@ static inline unsigned long hmm_range_page_size(const struct hmm_range *range) } /* - * hmm_range_wait_until_valid() - wait for range to be valid + * hmm_range_start_update() - wait for range to be valid * @range: range affected by invalidation to wait on * @timeout: time out for wait in ms (ie abort wait after that period of time) * Return: true if the range is valid, false otherwise. */ -static inline bool hmm_range_wait_until_valid(struct hmm_range *range, - unsigned long timeout) +// FIXME: hmm should handle the timeout for the driver too. +static inline unsigned int hmm_range_start_update(struct hmm_range *range, + unsigned long timeout) { - wait_event_timeout(range->hmm->wq, range->valid, + wait_event_timeout(range->hmm->wq, + READ_ONCE(range->hmm->active_invalidates) == 0, msecs_to_jiffies(timeout)); - return READ_ONCE(range->valid); + + // FIXME: wants a non-raw seq helper + seqcount_lockdep_reader_access(&range->hmm->range_invalidated); + range->update_seq = raw_seqcount_begin(&range->hmm->range_invalidated); + return !read_seqcount_retry(&range->hmm->range_invalidated, + range->update_seq); } /* - * hmm_range_valid() - test if a range is valid or not + * hmm_range_needs_retry() - test if a range that has begun update + * via hmm_range_start_update() needs to be retried. * @range: range - * Return: true if the range is valid, false otherwise. + * Return: true if the update needs to be restarted from hmm_range_start_update(), + * false otherwise. + */ +static inline bool hmm_range_needs_retry(struct hmm_range *range) +{ + return read_seqcount_retry(&range->hmm->range_invalidated, + range->update_seq); +} + +/* + * hmm_range_end_update() - finish an update of a range + * @range: range + * + * This must only be called while holding the device lock as described in + * hmm.rst, and must be called before making any of the update visible. + * + * Return: true if the update was not interrupted by an invalidation of the + * covered virtual memory range, false if the update needs to be retried. */ -static inline bool hmm_range_valid(struct hmm_range *range) +static inline bool hmm_range_end_update(struct hmm_range *range) { - return range->valid; + return !hmm_range_needs_retry(range); } /* @@ -511,7 +539,7 @@ static inline int hmm_mirror_range_register(struct hmm_range *range, /* This is a temporary helper to avoid merge conflict between trees. */ static inline bool hmm_vma_range_done(struct hmm_range *range) { - bool ret = hmm_range_valid(range); + bool ret = !hmm_range_needs_retry(range); hmm_range_unregister(range); return ret; @@ -537,7 +565,7 @@ static inline int hmm_vma_fault(struct hmm_range *range, bool block) if (ret) return (int)ret; - if (!hmm_range_wait_until_valid(range, HMM_RANGE_DEFAULT_TIMEOUT)) { + if (!hmm_range_start_update(range, HMM_RANGE_DEFAULT_TIMEOUT)) { hmm_range_unregister(range); /* * The mmap_sem was taken by driver we release it here and @@ -549,6 +577,8 @@ static inline int hmm_vma_fault(struct hmm_range *range, bool block) } ret = hmm_range_fault(range, block); + if (!hmm_range_end_update(range)) + ret = -EAGAIN; if (ret <= 0) { hmm_range_unregister(range); if (ret == -EBUSY || !ret) { diff --git a/mm/hmm.c b/mm/hmm.c index 8396a65710e304..09725774ff6112 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -79,7 +79,8 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm) INIT_LIST_HEAD(&hmm->ranges); mutex_init(&hmm->lock); kref_init(&hmm->kref); - hmm->notifiers = 0; + hmm->active_invalidates = 0; + seqcount_init(&hmm->range_invalidated); hmm->mm = mm; mmgrab(mm); @@ -155,13 +156,22 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm) hmm_put(hmm); } +static bool any_range_overlaps(struct hmm *hmm, unsigned long start, unsigned long end) +{ + struct hmm_range *range; + + list_for_each_entry(range, &hmm->ranges, list) + // FIXME: check me + if (range->start <= end && range->end < start) + return true; + return false; +} static int hmm_invalidate_range_start(struct mmu_notifier *mn, const struct mmu_notifier_range *nrange) { struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier); struct hmm_mirror *mirror; struct hmm_update update; - struct hmm_range *range; int ret = 0; /* hmm is in progress to free */ @@ -179,13 +189,22 @@ static int hmm_invalidate_range_start(struct mmu_notifier *mn, ret = -EAGAIN; goto out; } - hmm->notifiers++; - list_for_each_entry(range, &hmm->ranges, list) { - if (update.end < range->start || update.start >= range->end) - continue; - - range->valid = false; - } + /* + * The seqcount is used as a fast but inaccurate indication that + * another CPU working with a range needs to retry due to invalidation + * of the same virtual address space covered by the range by this + * CPU. + * + * It is based on the assumption that the ranges will be short lived, + * so there is no need to be aggressively accurate in the mmu notifier + * fast path. Any notifier intersection will cause all registered + * ranges to retry. + */ + hmm->active_invalidates++; + // FIXME: needs a seqcount helper + if (!(hmm->range_invalidated.sequence & 1) && + any_range_overlaps(hmm, update.start, update.end)) + write_seqcount_begin(&hmm->range_invalidated); mutex_unlock(&hmm->lock); if (mmu_notifier_range_blockable(nrange)) @@ -218,15 +237,11 @@ static void hmm_invalidate_range_end(struct mmu_notifier *mn, return; mutex_lock(&hmm->lock); - hmm->notifiers--; - if (!hmm->notifiers) { - struct hmm_range *range; - - list_for_each_entry(range, &hmm->ranges, list) { - if (range->valid) - continue; - range->valid = true; - } + hmm->active_invalidates--; + if (hmm->active_invalidates == 0) { + // FIXME: needs a seqcount helper + if (hmm->range_invalidated.sequence & 1) + write_seqcount_end(&hmm->range_invalidated); wake_up_all(&hmm->wq); } mutex_unlock(&hmm->lock); @@ -882,7 +897,7 @@ int hmm_range_register(struct hmm_range *range, { unsigned long mask = ((1UL << page_shift) - 1UL); - range->valid = false; + range->update_seq = 0; range->hmm = NULL; if ((start & mask) || (end & mask)) @@ -908,15 +923,7 @@ int hmm_range_register(struct hmm_range *range, /* Initialize range to track CPU page table updates. */ mutex_lock(&range->hmm->lock); - list_add(&range->list, &range->hmm->ranges); - - /* - * If there are any concurrent notifiers we have to wait for them for - * the range to be valid (see hmm_range_wait_until_valid()). - */ - if (!range->hmm->notifiers) - range->valid = true; mutex_unlock(&range->hmm->lock); return 0; @@ -947,7 +954,6 @@ void hmm_range_unregister(struct hmm_range *range) hmm_put(hmm); /* The range is now invalid, leave it poisoned. */ - range->valid = false; memset(&range->hmm, POISON_INUSE, sizeof(range->hmm)); } EXPORT_SYMBOL(hmm_range_unregister); @@ -981,7 +987,7 @@ long hmm_range_snapshot(struct hmm_range *range) do { /* If range is no longer valid force retry. */ - if (!range->valid) + if (hmm_range_needs_retry(range)) return -EAGAIN; vma = find_vma(hmm->mm, start); @@ -1080,7 +1086,7 @@ long hmm_range_fault(struct hmm_range *range, bool block) do { /* If range is no longer valid force retry. */ - if (!range->valid) { + if (hmm_range_needs_retry(range)) { up_read(&hmm->mm->mmap_sem); return -EAGAIN; } @@ -1134,7 +1140,7 @@ long hmm_range_fault(struct hmm_range *range, bool block) start = hmm_vma_walk.last; /* Keep trying while the range is valid. */ - } while (ret == -EBUSY && range->valid); + } while (ret == -EBUSY && !hmm_range_needs_retry(range)); if (ret) { unsigned long i; @@ -1195,7 +1201,8 @@ long hmm_range_dma_map(struct hmm_range *range, continue; /* Check if range is being invalidated */ - if (!range->valid) { + if (hmm_range_needs_retry(range)) { + // ?? EAGAIN?? ret = -EBUSY; goto unmap; }
From: Jason Gunthorpe <jgg@mellanox.com> This patch series arised out of discussions with Jerome when looking at the ODP changes, particularly informed by use after free races we have already found and fixed in the ODP code (thanks to syzkaller) working with mmu notifiers, and the discussion with Ralph on how to resolve the lifetime model. Overall this brings in a simplified locking scheme and easy to explain lifetime model: If a hmm_range is valid, then the hmm is valid, if a hmm is valid then the mm is allocated memory. If the mm needs to still be alive (ie to lock the mmap_sem, find a vma, etc) then the mmget must be obtained via mmget_not_zero(). Locking of mm->hmm is shifted to use the mmap_sem consistently for all read/write and unlocked accesses are removed. The use unlocked reads on 'hmm->dead' are also eliminated in favour of using standard mmget() locking to prevent the mm from being released. Many of the debugging checks of !range->hmm and !hmm->mm are dropped in favour of poison - which is much clearer as to the lifetime intent. The trailing patches are just some random cleanups I noticed when reviewing this code. I expect Jerome & Ralph will have some design notes so this is just RFC, and it still needs a matching edit to nouveau. It is only compile tested. Regards, Jason Jason Gunthorpe (11): mm/hmm: Fix use after free with struct hmm in the mmu notifiers mm/hmm: Use hmm_mirror not mm as an argument for hmm_register_range mm/hmm: Hold a mmgrab from hmm to mm mm/hmm: Simplify hmm_get_or_create and make it reliable mm/hmm: Improve locking around hmm->dead mm/hmm: Remove duplicate condition test before wait_event_timeout mm/hmm: Delete hmm_mirror_mm_is_alive() mm/hmm: Use lockdep instead of comments mm/hmm: Remove racy protection against double-unregistration mm/hmm: Poison hmm_range during unregister mm/hmm: Do not use list*_rcu() for hmm->ranges include/linux/hmm.h | 50 ++---------- kernel/fork.c | 1 - mm/hmm.c | 184 +++++++++++++++++++------------------------- 3 files changed, 88 insertions(+), 147 deletions(-)