mbox series

[RFC,00/11] mm/hmm: Various revisions from a locking/code review

Message ID 20190523153436.19102-1-jgg@ziepe.ca (mailing list archive)
Headers show
Series mm/hmm: Various revisions from a locking/code review | expand

Message

Jason Gunthorpe May 23, 2019, 3:34 p.m. UTC
From: Jason Gunthorpe <jgg@mellanox.com>

This patch series arised out of discussions with Jerome when looking at the
ODP changes, particularly informed by use after free races we have already
found and fixed in the ODP code (thanks to syzkaller) working with mmu
notifiers, and the discussion with Ralph on how to resolve the lifetime model.

Overall this brings in a simplified locking scheme and easy to explain
lifetime model:

 If a hmm_range is valid, then the hmm is valid, if a hmm is valid then the mm
 is allocated memory.

 If the mm needs to still be alive (ie to lock the mmap_sem, find a vma, etc)
 then the mmget must be obtained via mmget_not_zero().

Locking of mm->hmm is shifted to use the mmap_sem consistently for all
read/write and unlocked accesses are removed.

The use unlocked reads on 'hmm->dead' are also eliminated in favour of using
standard mmget() locking to prevent the mm from being released. Many of the
debugging checks of !range->hmm and !hmm->mm are dropped in favour of poison -
which is much clearer as to the lifetime intent.

The trailing patches are just some random cleanups I noticed when reviewing
this code.

I expect Jerome & Ralph will have some design notes so this is just RFC, and
it still needs a matching edit to nouveau. It is only compile tested.

Regards,
Jason

Jason Gunthorpe (11):
  mm/hmm: Fix use after free with struct hmm in the mmu notifiers
  mm/hmm: Use hmm_mirror not mm as an argument for hmm_register_range
  mm/hmm: Hold a mmgrab from hmm to mm
  mm/hmm: Simplify hmm_get_or_create and make it reliable
  mm/hmm: Improve locking around hmm->dead
  mm/hmm: Remove duplicate condition test before wait_event_timeout
  mm/hmm: Delete hmm_mirror_mm_is_alive()
  mm/hmm: Use lockdep instead of comments
  mm/hmm: Remove racy protection against double-unregistration
  mm/hmm: Poison hmm_range during unregister
  mm/hmm: Do not use list*_rcu() for hmm->ranges

 include/linux/hmm.h |  50 ++----------
 kernel/fork.c       |   1 -
 mm/hmm.c            | 184 +++++++++++++++++++-------------------------
 3 files changed, 88 insertions(+), 147 deletions(-)

Comments

John Hubbard May 23, 2019, 7:04 p.m. UTC | #1
On 5/23/19 8:34 AM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> This patch series arised out of discussions with Jerome when looking at the
> ODP changes, particularly informed by use after free races we have already
> found and fixed in the ODP code (thanks to syzkaller) working with mmu
> notifiers, and the discussion with Ralph on how to resolve the lifetime model.
> 
> Overall this brings in a simplified locking scheme and easy to explain
> lifetime model:
> 
>   If a hmm_range is valid, then the hmm is valid, if a hmm is valid then the mm
>   is allocated memory.
> 
>   If the mm needs to still be alive (ie to lock the mmap_sem, find a vma, etc)
>   then the mmget must be obtained via mmget_not_zero().
> 
> Locking of mm->hmm is shifted to use the mmap_sem consistently for all
> read/write and unlocked accesses are removed.
> 
> The use unlocked reads on 'hmm->dead' are also eliminated in favour of using
> standard mmget() locking to prevent the mm from being released. Many of the
> debugging checks of !range->hmm and !hmm->mm are dropped in favour of poison -
> which is much clearer as to the lifetime intent.
> 
> The trailing patches are just some random cleanups I noticed when reviewing
> this code.
> 
> I expect Jerome & Ralph will have some design notes so this is just RFC, and
> it still needs a matching edit to nouveau. It is only compile tested.
> 

Thanks so much for doing this. Jerome has already absorbed these into his
hmm-5.3 branch, along with Ralph's other fixes, so we can start testing,
as well as reviewing, the whole set. We'll have feedback soon.


thanks,
Jason Gunthorpe May 23, 2019, 7:37 p.m. UTC | #2
On Thu, May 23, 2019 at 12:04:16PM -0700, John Hubbard wrote:
> On 5/23/19 8:34 AM, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> > 
> > This patch series arised out of discussions with Jerome when looking at the
> > ODP changes, particularly informed by use after free races we have already
> > found and fixed in the ODP code (thanks to syzkaller) working with mmu
> > notifiers, and the discussion with Ralph on how to resolve the lifetime model.
> > 
> > Overall this brings in a simplified locking scheme and easy to explain
> > lifetime model:
> > 
> >   If a hmm_range is valid, then the hmm is valid, if a hmm is valid then the mm
> >   is allocated memory.
> > 
> >   If the mm needs to still be alive (ie to lock the mmap_sem, find a vma, etc)
> >   then the mmget must be obtained via mmget_not_zero().
> > 
> > Locking of mm->hmm is shifted to use the mmap_sem consistently for all
> > read/write and unlocked accesses are removed.
> > 
> > The use unlocked reads on 'hmm->dead' are also eliminated in favour of using
> > standard mmget() locking to prevent the mm from being released. Many of the
> > debugging checks of !range->hmm and !hmm->mm are dropped in favour of poison -
> > which is much clearer as to the lifetime intent.
> > 
> > The trailing patches are just some random cleanups I noticed when reviewing
> > this code.
> > 
> > I expect Jerome & Ralph will have some design notes so this is just RFC, and
> > it still needs a matching edit to nouveau. It is only compile tested.
> > 
> 
> Thanks so much for doing this. Jerome has already absorbed these into his
> hmm-5.3 branch, along with Ralph's other fixes, so we can start testing,
> as well as reviewing, the whole set. We'll have feedback soon.

Yes, I looked at Jerome's v2's and he found a few great fixups.

My only dislike is re-introducing a READ_ONCE(mm->hmm) when a major
point of this seris was to remove that use-after-free stuff. 

But Jerome says it is a temporary defect while he works out some cross
tree API stuff.

Jason
Jerome Glisse May 23, 2019, 8:59 p.m. UTC | #3
On Thu, May 23, 2019 at 12:04:16PM -0700, John Hubbard wrote:
> On 5/23/19 8:34 AM, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> > 
> > This patch series arised out of discussions with Jerome when looking at the
> > ODP changes, particularly informed by use after free races we have already
> > found and fixed in the ODP code (thanks to syzkaller) working with mmu
> > notifiers, and the discussion with Ralph on how to resolve the lifetime model.
> > 
> > Overall this brings in a simplified locking scheme and easy to explain
> > lifetime model:
> > 
> >   If a hmm_range is valid, then the hmm is valid, if a hmm is valid then the mm
> >   is allocated memory.
> > 
> >   If the mm needs to still be alive (ie to lock the mmap_sem, find a vma, etc)
> >   then the mmget must be obtained via mmget_not_zero().
> > 
> > Locking of mm->hmm is shifted to use the mmap_sem consistently for all
> > read/write and unlocked accesses are removed.
> > 
> > The use unlocked reads on 'hmm->dead' are also eliminated in favour of using
> > standard mmget() locking to prevent the mm from being released. Many of the
> > debugging checks of !range->hmm and !hmm->mm are dropped in favour of poison -
> > which is much clearer as to the lifetime intent.
> > 
> > The trailing patches are just some random cleanups I noticed when reviewing
> > this code.
> > 
> > I expect Jerome & Ralph will have some design notes so this is just RFC, and
> > it still needs a matching edit to nouveau. It is only compile tested.
> > 
> 
> Thanks so much for doing this. Jerome has already absorbed these into his
> hmm-5.3 branch, along with Ralph's other fixes, so we can start testing,
> as well as reviewing, the whole set. We'll have feedback soon.
> 

I force pushed an updated branch with couple fix

https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-5.3

Seems to work ok so far, still doing testing.

Cheers,
Jérôme
Jason Gunthorpe May 24, 2019, 1:35 p.m. UTC | #4
On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> This patch series arised out of discussions with Jerome when looking at the
> ODP changes, particularly informed by use after free races we have already
> found and fixed in the ODP code (thanks to syzkaller) working with mmu
> notifiers, and the discussion with Ralph on how to resolve the lifetime model.
> 
> Overall this brings in a simplified locking scheme and easy to explain
> lifetime model:
> 
>  If a hmm_range is valid, then the hmm is valid, if a hmm is valid then the mm
>  is allocated memory.
> 
>  If the mm needs to still be alive (ie to lock the mmap_sem, find a vma, etc)
>  then the mmget must be obtained via mmget_not_zero().
> 
> Locking of mm->hmm is shifted to use the mmap_sem consistently for all
> read/write and unlocked accesses are removed.
> 
> The use unlocked reads on 'hmm->dead' are also eliminated in favour of using
> standard mmget() locking to prevent the mm from being released. Many of the
> debugging checks of !range->hmm and !hmm->mm are dropped in favour of poison -
> which is much clearer as to the lifetime intent.
> 
> The trailing patches are just some random cleanups I noticed when reviewing
> this code.
> 
> I expect Jerome & Ralph will have some design notes so this is just RFC, and
> it still needs a matching edit to nouveau. It is only compile tested.
> 
> Regards,
> Jason
> 
> Jason Gunthorpe (11):
>   mm/hmm: Fix use after free with struct hmm in the mmu notifiers
>   mm/hmm: Use hmm_mirror not mm as an argument for hmm_register_range
>   mm/hmm: Hold a mmgrab from hmm to mm
>   mm/hmm: Simplify hmm_get_or_create and make it reliable
>   mm/hmm: Improve locking around hmm->dead
>   mm/hmm: Remove duplicate condition test before wait_event_timeout
>   mm/hmm: Delete hmm_mirror_mm_is_alive()
>   mm/hmm: Use lockdep instead of comments
>   mm/hmm: Remove racy protection against double-unregistration
>   mm/hmm: Poison hmm_range during unregister
>   mm/hmm: Do not use list*_rcu() for hmm->ranges
> 
>  include/linux/hmm.h |  50 ++----------
>  kernel/fork.c       |   1 -
>  mm/hmm.c            | 184 +++++++++++++++++++-------------------------
>  3 files changed, 88 insertions(+), 147 deletions(-)

Jerome, I was doing some more checking of this and noticed lockdep
doesn't compile test if it is turned off, since you took and revised
the series can you please fold in these hunks to fix compile failures
with lockdep on. Thanks

commit f0653c4d4c1dadeaf58d49f1c949ab1d2fda05d3
diff --git a/mm/hmm.c b/mm/hmm.c
index 836adf613f81c8..2a08b78550b90d 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -56,7 +56,7 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
 {
 	struct hmm *hmm;
 
-	lockdep_assert_held_exclusive(mm->mmap_sem);
+	lockdep_assert_held_exclusive(&mm->mmap_sem);
 
 	if (mm->hmm) {
 		if (kref_get_unless_zero(&mm->hmm->kref))
@@ -262,7 +262,7 @@ static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
  */
 int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
 {
-	lockdep_assert_held_exclusive(mm->mmap_sem);
+	lockdep_assert_held_exclusive(&mm->mmap_sem);
 
 	/* Sanity check */
 	if (!mm || !mirror || !mirror->ops)
@@ -987,7 +987,7 @@ long hmm_range_snapshot(struct hmm_range *range)
 	struct mm_walk mm_walk;
 
 	/* Caller must hold the mmap_sem, and range hold a reference on mm. */
-	lockdep_assert_held(hmm->mm->mmap_sem);
+	lockdep_assert_held(&hmm->mm->mmap_sem);
 	if (WARN_ON(!atomic_read(&hmm->mm->mm_users)))
 		return -EINVAL;
 
@@ -1086,7 +1086,7 @@ long hmm_range_fault(struct hmm_range *range, bool block)
 	int ret;
 
 	/* Caller must hold the mmap_sem, and range hold a reference on mm. */
-	lockdep_assert_held(hmm->mm->mmap_sem);
+	lockdep_assert_held(&hmm->mm->mmap_sem);
 	if (WARN_ON(!atomic_read(&hmm->mm->mm_users)))
 		return -EINVAL;
Jason Gunthorpe May 24, 2019, 2:36 p.m. UTC | #5
On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> This patch series arised out of discussions with Jerome when looking at the
> ODP changes, particularly informed by use after free races we have already
> found and fixed in the ODP code (thanks to syzkaller) working with mmu
> notifiers, and the discussion with Ralph on how to resolve the lifetime model.

So the last big difference with ODP's flow is how 'range->valid'
works.

In ODP this was done using the rwsem umem->umem_rwsem which is
obtained for read in invalidate_start and released in invalidate_end.

Then any other threads that wish to only work on a umem which is not
undergoing invalidation will obtain the write side of the lock, and
within that lock's critical section the virtual address range is known
to not be invalidating.

I cannot understand how hmm gets to the same approach. It has
range->valid, but it is not locked by anything that I can see, so when
we test it in places like hmm_range_fault it seems useless..

Jerome, how does this work?

I have a feeling we should copy the approach from ODP and use an
actual lock here.

Jason
Jerome Glisse May 24, 2019, 4:49 p.m. UTC | #6
On Fri, May 24, 2019 at 11:36:49AM -0300, Jason Gunthorpe wrote:
> On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> > 
> > This patch series arised out of discussions with Jerome when looking at the
> > ODP changes, particularly informed by use after free races we have already
> > found and fixed in the ODP code (thanks to syzkaller) working with mmu
> > notifiers, and the discussion with Ralph on how to resolve the lifetime model.
> 
> So the last big difference with ODP's flow is how 'range->valid'
> works.
> 
> In ODP this was done using the rwsem umem->umem_rwsem which is
> obtained for read in invalidate_start and released in invalidate_end.
> 
> Then any other threads that wish to only work on a umem which is not
> undergoing invalidation will obtain the write side of the lock, and
> within that lock's critical section the virtual address range is known
> to not be invalidating.
> 
> I cannot understand how hmm gets to the same approach. It has
> range->valid, but it is not locked by anything that I can see, so when
> we test it in places like hmm_range_fault it seems useless..
> 
> Jerome, how does this work?
> 
> I have a feeling we should copy the approach from ODP and use an
> actual lock here.

range->valid is use as bail early if invalidation is happening in
hmm_range_fault() to avoid doing useless work. The synchronization
is explained in the documentation:


Locking within the sync_cpu_device_pagetables() callback is the most important
aspect the driver must respect in order to keep things properly synchronized.
The usage pattern is::

 int driver_populate_range(...)
 {
      struct hmm_range range;
      ...

      range.start = ...;
      range.end = ...;
      range.pfns = ...;
      range.flags = ...;
      range.values = ...;
      range.pfn_shift = ...;
      hmm_range_register(&range);

      /*
       * Just wait for range to be valid, safe to ignore return value as we
       * will use the return value of hmm_range_snapshot() below under the
       * mmap_sem to ascertain the validity of the range.
       */
      hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC);

 again:
      down_read(&mm->mmap_sem);
      ret = hmm_range_snapshot(&range);
      if (ret) {
          up_read(&mm->mmap_sem);
          if (ret == -EAGAIN) {
            /*
             * No need to check hmm_range_wait_until_valid() return value
             * on retry we will get proper error with hmm_range_snapshot()
             */
            hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC);
            goto again;
          }
          hmm_range_unregister(&range);
          return ret;
      }
      take_lock(driver->update);
      if (!hmm_range_valid(&range)) {
          release_lock(driver->update);
          up_read(&mm->mmap_sem);
          goto again;
      }

      // Use pfns array content to update device page table

      hmm_range_unregister(&range);
      release_lock(driver->update);
      up_read(&mm->mmap_sem);
      return 0;
 }

The driver->update lock is the same lock that the driver takes inside its
sync_cpu_device_pagetables() callback. That lock must be held before calling
hmm_range_valid() to avoid any race with a concurrent CPU page table update.


Cheers,
Jérôme
Jason Gunthorpe May 24, 2019, 4:59 p.m. UTC | #7
On Fri, May 24, 2019 at 12:49:02PM -0400, Jerome Glisse wrote:
> On Fri, May 24, 2019 at 11:36:49AM -0300, Jason Gunthorpe wrote:
> > On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote:
> > > From: Jason Gunthorpe <jgg@mellanox.com>
> > > 
> > > This patch series arised out of discussions with Jerome when looking at the
> > > ODP changes, particularly informed by use after free races we have already
> > > found and fixed in the ODP code (thanks to syzkaller) working with mmu
> > > notifiers, and the discussion with Ralph on how to resolve the lifetime model.
> > 
> > So the last big difference with ODP's flow is how 'range->valid'
> > works.
> > 
> > In ODP this was done using the rwsem umem->umem_rwsem which is
> > obtained for read in invalidate_start and released in invalidate_end.
> > 
> > Then any other threads that wish to only work on a umem which is not
> > undergoing invalidation will obtain the write side of the lock, and
> > within that lock's critical section the virtual address range is known
> > to not be invalidating.
> > 
> > I cannot understand how hmm gets to the same approach. It has
> > range->valid, but it is not locked by anything that I can see, so when
> > we test it in places like hmm_range_fault it seems useless..
> > 
> > Jerome, how does this work?
> > 
> > I have a feeling we should copy the approach from ODP and use an
> > actual lock here.
> 
> range->valid is use as bail early if invalidation is happening in
> hmm_range_fault() to avoid doing useless work. The synchronization
> is explained in the documentation:

That just says the hmm APIs handle locking. I asked how the apis
implement that locking internally.

Are you trying to say that if I do this, hmm will still work completely
correctly?

diff --git a/mm/hmm.c b/mm/hmm.c
index 8396a65710e304..42977744855d26 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -981,8 +981,8 @@ long hmm_range_snapshot(struct hmm_range *range)
 
 	do {
 		/* If range is no longer valid force retry. */
-		if (!range->valid)
-			return -EAGAIN;
+/*		if (!range->valid)
+			return -EAGAIN;*/
 
 		vma = find_vma(hmm->mm, start);
 		if (vma == NULL || (vma->vm_flags & device_vma))
@@ -1080,10 +1080,10 @@ long hmm_range_fault(struct hmm_range *range, bool block)
 
 	do {
 		/* If range is no longer valid force retry. */
-		if (!range->valid) {
+/*		if (!range->valid) {
 			up_read(&hmm->mm->mmap_sem);
 			return -EAGAIN;
-		}
+		}*/
 
 		vma = find_vma(hmm->mm, start);
 		if (vma == NULL || (vma->vm_flags & device_vma))
@@ -1134,7 +1134,7 @@ long hmm_range_fault(struct hmm_range *range, bool block)
 			start = hmm_vma_walk.last;
 
 			/* Keep trying while the range is valid. */
-		} while (ret == -EBUSY && range->valid);
+		} while (ret == -EBUSY /*&& range->valid*/);
 
 		if (ret) {
 			unsigned long i;
Jerome Glisse May 24, 2019, 5:01 p.m. UTC | #8
On Fri, May 24, 2019 at 01:59:31PM -0300, Jason Gunthorpe wrote:
> On Fri, May 24, 2019 at 12:49:02PM -0400, Jerome Glisse wrote:
> > On Fri, May 24, 2019 at 11:36:49AM -0300, Jason Gunthorpe wrote:
> > > On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote:
> > > > From: Jason Gunthorpe <jgg@mellanox.com>
> > > > 
> > > > This patch series arised out of discussions with Jerome when looking at the
> > > > ODP changes, particularly informed by use after free races we have already
> > > > found and fixed in the ODP code (thanks to syzkaller) working with mmu
> > > > notifiers, and the discussion with Ralph on how to resolve the lifetime model.
> > > 
> > > So the last big difference with ODP's flow is how 'range->valid'
> > > works.
> > > 
> > > In ODP this was done using the rwsem umem->umem_rwsem which is
> > > obtained for read in invalidate_start and released in invalidate_end.
> > > 
> > > Then any other threads that wish to only work on a umem which is not
> > > undergoing invalidation will obtain the write side of the lock, and
> > > within that lock's critical section the virtual address range is known
> > > to not be invalidating.
> > > 
> > > I cannot understand how hmm gets to the same approach. It has
> > > range->valid, but it is not locked by anything that I can see, so when
> > > we test it in places like hmm_range_fault it seems useless..
> > > 
> > > Jerome, how does this work?
> > > 
> > > I have a feeling we should copy the approach from ODP and use an
> > > actual lock here.
> > 
> > range->valid is use as bail early if invalidation is happening in
> > hmm_range_fault() to avoid doing useless work. The synchronization
> > is explained in the documentation:
> 
> That just says the hmm APIs handle locking. I asked how the apis
> implement that locking internally.
> 
> Are you trying to say that if I do this, hmm will still work completely
> correctly?

Yes it will keep working correctly. You would just be doing potentialy
useless work.

> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 8396a65710e304..42977744855d26 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -981,8 +981,8 @@ long hmm_range_snapshot(struct hmm_range *range)
>  
>  	do {
>  		/* If range is no longer valid force retry. */
> -		if (!range->valid)
> -			return -EAGAIN;
> +/*		if (!range->valid)
> +			return -EAGAIN;*/
>  
>  		vma = find_vma(hmm->mm, start);
>  		if (vma == NULL || (vma->vm_flags & device_vma))
> @@ -1080,10 +1080,10 @@ long hmm_range_fault(struct hmm_range *range, bool block)
>  
>  	do {
>  		/* If range is no longer valid force retry. */
> -		if (!range->valid) {
> +/*		if (!range->valid) {
>  			up_read(&hmm->mm->mmap_sem);
>  			return -EAGAIN;
> -		}
> +		}*/
>  
>  		vma = find_vma(hmm->mm, start);
>  		if (vma == NULL || (vma->vm_flags & device_vma))
> @@ -1134,7 +1134,7 @@ long hmm_range_fault(struct hmm_range *range, bool block)
>  			start = hmm_vma_walk.last;
>  
>  			/* Keep trying while the range is valid. */
> -		} while (ret == -EBUSY && range->valid);
> +		} while (ret == -EBUSY /*&& range->valid*/);
>  
>  		if (ret) {
>  			unsigned long i;
Ralph Campbell May 24, 2019, 5:47 p.m. UTC | #9
On 5/24/19 9:49 AM, Jerome Glisse wrote:
> On Fri, May 24, 2019 at 11:36:49AM -0300, Jason Gunthorpe wrote:
>> On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote:
>>> From: Jason Gunthorpe <jgg@mellanox.com>
>>>
>>> This patch series arised out of discussions with Jerome when looking at the
>>> ODP changes, particularly informed by use after free races we have already
>>> found and fixed in the ODP code (thanks to syzkaller) working with mmu
>>> notifiers, and the discussion with Ralph on how to resolve the lifetime model.
>>
>> So the last big difference with ODP's flow is how 'range->valid'
>> works.
>>
>> In ODP this was done using the rwsem umem->umem_rwsem which is
>> obtained for read in invalidate_start and released in invalidate_end.
>>
>> Then any other threads that wish to only work on a umem which is not
>> undergoing invalidation will obtain the write side of the lock, and
>> within that lock's critical section the virtual address range is known
>> to not be invalidating.
>>
>> I cannot understand how hmm gets to the same approach. It has
>> range->valid, but it is not locked by anything that I can see, so when
>> we test it in places like hmm_range_fault it seems useless..
>>
>> Jerome, how does this work?
>>
>> I have a feeling we should copy the approach from ODP and use an
>> actual lock here.
> 
> range->valid is use as bail early if invalidation is happening in
> hmm_range_fault() to avoid doing useless work. The synchronization
> is explained in the documentation:
> 
> 
> Locking within the sync_cpu_device_pagetables() callback is the most important
> aspect the driver must respect in order to keep things properly synchronized.
> The usage pattern is::
> 
>   int driver_populate_range(...)
>   {
>        struct hmm_range range;
>        ...
> 
>        range.start = ...;
>        range.end = ...;
>        range.pfns = ...;
>        range.flags = ...;
>        range.values = ...;
>        range.pfn_shift = ...;
>        hmm_range_register(&range);
> 
>        /*
>         * Just wait for range to be valid, safe to ignore return value as we
>         * will use the return value of hmm_range_snapshot() below under the
>         * mmap_sem to ascertain the validity of the range.
>         */
>        hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC);
> 
>   again:
>        down_read(&mm->mmap_sem);
>        ret = hmm_range_snapshot(&range);
>        if (ret) {
>            up_read(&mm->mmap_sem);
>            if (ret == -EAGAIN) {
>              /*
>               * No need to check hmm_range_wait_until_valid() return value
>               * on retry we will get proper error with hmm_range_snapshot()
>               */
>              hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC);
>              goto again;
>            }
>            hmm_range_unregister(&range);
>            return ret;
>        }
>        take_lock(driver->update);
>        if (!hmm_range_valid(&range)) {
>            release_lock(driver->update);
>            up_read(&mm->mmap_sem);
>            goto again;
>        }
> 
>        // Use pfns array content to update device page table
> 
>        hmm_range_unregister(&range);
>        release_lock(driver->update);
>        up_read(&mm->mmap_sem);
>        return 0;
>   }
> 
> The driver->update lock is the same lock that the driver takes inside its
> sync_cpu_device_pagetables() callback. That lock must be held before calling
> hmm_range_valid() to avoid any race with a concurrent CPU page table update.
> 
> 
> Cheers,
> Jérôme


Given the above, the following patch looks necessary to me.
Also, looking at drivers/gpu/drm/nouveau/nouveau_svm.c, it
doesn't check the return value to avoid calling up_read(&mm->mmap_sem).
Besides, it's better to keep the mmap_sem lock/unlock in the caller.

diff --git a/mm/hmm.c b/mm/hmm.c
index 836adf613f81..8b6ef97a8d71 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -1092,10 +1092,8 @@ long hmm_range_fault(struct hmm_range *range, 
bool block)

  	do {
  		/* If range is no longer valid force retry. */
-		if (!range->valid) {
-			up_read(&hmm->mm->mmap_sem);
+		if (!range->valid)
  			return -EAGAIN;
-		}

  		vma = find_vma(hmm->mm, start);
  		if (vma == NULL || (vma->vm_flags & device_vma))

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information.  Any unauthorized review, use, disclosure or distribution
is prohibited.  If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
Jerome Glisse May 24, 2019, 5:51 p.m. UTC | #10
On Fri, May 24, 2019 at 10:47:16AM -0700, Ralph Campbell wrote:
> 
> On 5/24/19 9:49 AM, Jerome Glisse wrote:
> > On Fri, May 24, 2019 at 11:36:49AM -0300, Jason Gunthorpe wrote:
> > > On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote:
> > > > From: Jason Gunthorpe <jgg@mellanox.com>
> > > > 
> > > > This patch series arised out of discussions with Jerome when looking at the
> > > > ODP changes, particularly informed by use after free races we have already
> > > > found and fixed in the ODP code (thanks to syzkaller) working with mmu
> > > > notifiers, and the discussion with Ralph on how to resolve the lifetime model.
> > > 
> > > So the last big difference with ODP's flow is how 'range->valid'
> > > works.
> > > 
> > > In ODP this was done using the rwsem umem->umem_rwsem which is
> > > obtained for read in invalidate_start and released in invalidate_end.
> > > 
> > > Then any other threads that wish to only work on a umem which is not
> > > undergoing invalidation will obtain the write side of the lock, and
> > > within that lock's critical section the virtual address range is known
> > > to not be invalidating.
> > > 
> > > I cannot understand how hmm gets to the same approach. It has
> > > range->valid, but it is not locked by anything that I can see, so when
> > > we test it in places like hmm_range_fault it seems useless..
> > > 
> > > Jerome, how does this work?
> > > 
> > > I have a feeling we should copy the approach from ODP and use an
> > > actual lock here.
> > 
> > range->valid is use as bail early if invalidation is happening in
> > hmm_range_fault() to avoid doing useless work. The synchronization
> > is explained in the documentation:
> > 
> > 
> > Locking within the sync_cpu_device_pagetables() callback is the most important
> > aspect the driver must respect in order to keep things properly synchronized.
> > The usage pattern is::
> > 
> >   int driver_populate_range(...)
> >   {
> >        struct hmm_range range;
> >        ...
> > 
> >        range.start = ...;
> >        range.end = ...;
> >        range.pfns = ...;
> >        range.flags = ...;
> >        range.values = ...;
> >        range.pfn_shift = ...;
> >        hmm_range_register(&range);
> > 
> >        /*
> >         * Just wait for range to be valid, safe to ignore return value as we
> >         * will use the return value of hmm_range_snapshot() below under the
> >         * mmap_sem to ascertain the validity of the range.
> >         */
> >        hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC);
> > 
> >   again:
> >        down_read(&mm->mmap_sem);
> >        ret = hmm_range_snapshot(&range);
> >        if (ret) {
> >            up_read(&mm->mmap_sem);
> >            if (ret == -EAGAIN) {
> >              /*
> >               * No need to check hmm_range_wait_until_valid() return value
> >               * on retry we will get proper error with hmm_range_snapshot()
> >               */
> >              hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC);
> >              goto again;
> >            }
> >            hmm_range_unregister(&range);
> >            return ret;
> >        }
> >        take_lock(driver->update);
> >        if (!hmm_range_valid(&range)) {
> >            release_lock(driver->update);
> >            up_read(&mm->mmap_sem);
> >            goto again;
> >        }
> > 
> >        // Use pfns array content to update device page table
> > 
> >        hmm_range_unregister(&range);
> >        release_lock(driver->update);
> >        up_read(&mm->mmap_sem);
> >        return 0;
> >   }
> > 
> > The driver->update lock is the same lock that the driver takes inside its
> > sync_cpu_device_pagetables() callback. That lock must be held before calling
> > hmm_range_valid() to avoid any race with a concurrent CPU page table update.
> > 
> > 
> > Cheers,
> > Jérôme
> 
> 
> Given the above, the following patch looks necessary to me.
> Also, looking at drivers/gpu/drm/nouveau/nouveau_svm.c, it
> doesn't check the return value to avoid calling up_read(&mm->mmap_sem).
> Besides, it's better to keep the mmap_sem lock/unlock in the caller.

No, nouveau use the old API so check hmm_vma_fault() within hmm.h, i have
patch to convert it to new API for 5.3


> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 836adf613f81..8b6ef97a8d71 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -1092,10 +1092,8 @@ long hmm_range_fault(struct hmm_range *range, bool
> block)
> 
>  	do {
>  		/* If range is no longer valid force retry. */
> -		if (!range->valid) {
> -			up_read(&hmm->mm->mmap_sem);
> +		if (!range->valid)
>  			return -EAGAIN;
> -		}
> 
>  		vma = find_vma(hmm->mm, start);
>  		if (vma == NULL || (vma->vm_flags & device_vma))
> 
> -----------------------------------------------------------------------------------
> This email message is for the sole use of the intended recipient(s) and may contain
> confidential information.  Any unauthorized review, use, disclosure or distribution
> is prohibited.  If you are not the intended recipient, please contact the sender by
> reply email and destroy all copies of the original message.
> -----------------------------------------------------------------------------------
Jason Gunthorpe May 24, 2019, 5:52 p.m. UTC | #11
On Fri, May 24, 2019 at 01:01:49PM -0400, Jerome Glisse wrote:
> On Fri, May 24, 2019 at 01:59:31PM -0300, Jason Gunthorpe wrote:
> > On Fri, May 24, 2019 at 12:49:02PM -0400, Jerome Glisse wrote:
> > > On Fri, May 24, 2019 at 11:36:49AM -0300, Jason Gunthorpe wrote:
> > > > On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote:
> > > > > From: Jason Gunthorpe <jgg@mellanox.com>
> > > > > 
> > > > > This patch series arised out of discussions with Jerome when looking at the
> > > > > ODP changes, particularly informed by use after free races we have already
> > > > > found and fixed in the ODP code (thanks to syzkaller) working with mmu
> > > > > notifiers, and the discussion with Ralph on how to resolve the lifetime model.
> > > > 
> > > > So the last big difference with ODP's flow is how 'range->valid'
> > > > works.
> > > > 
> > > > In ODP this was done using the rwsem umem->umem_rwsem which is
> > > > obtained for read in invalidate_start and released in invalidate_end.
> > > > 
> > > > Then any other threads that wish to only work on a umem which is not
> > > > undergoing invalidation will obtain the write side of the lock, and
> > > > within that lock's critical section the virtual address range is known
> > > > to not be invalidating.
> > > > 
> > > > I cannot understand how hmm gets to the same approach. It has
> > > > range->valid, but it is not locked by anything that I can see, so when
> > > > we test it in places like hmm_range_fault it seems useless..
> > > > 
> > > > Jerome, how does this work?
> > > > 
> > > > I have a feeling we should copy the approach from ODP and use an
> > > > actual lock here.
> > > 
> > > range->valid is use as bail early if invalidation is happening in
> > > hmm_range_fault() to avoid doing useless work. The synchronization
> > > is explained in the documentation:
> > 
> > That just says the hmm APIs handle locking. I asked how the apis
> > implement that locking internally.
> > 
> > Are you trying to say that if I do this, hmm will still work completely
> > correctly?
> 
> Yes it will keep working correctly. You would just be doing potentialy
> useless work.

I don't see how it works correctly.

Apply the comment out patch I showed and this trivially happens:

      CPU0                                               CPU1
  hmm_invalidate_start()
    ops->sync_cpu_device_pagetables()
      device_lock()
       // Wipe out page tables in device, enable faulting
      device_unlock()

						     DEVICE PAGE FAULT
						       device_lock()
						       hmm_range_register()
                                                       hmm_range_dma_map()
						       device_unlock()
  hmm_invalidate_end()

The mmu notifier spec says:

 	 * Invalidation of multiple concurrent ranges may be
	 * optionally permitted by the driver. Either way the
	 * establishment of sptes is forbidden in the range passed to
	 * invalidate_range_begin/end for the whole duration of the
	 * invalidate_range_begin/end critical section.

And I understand "establishment of sptes is forbidden" means
"hmm_range_dmap_map() must fail with EAGAIN". 

This is why ODP uses an actual lock held across the critical region
which completely prohibits reading the CPU pages tables, or
establishing new mappings.

So, I still think we need a true lock, not a 'maybe valid' flag.

Jason
Jerome Glisse May 24, 2019, 6:03 p.m. UTC | #12
On Fri, May 24, 2019 at 02:52:03PM -0300, Jason Gunthorpe wrote:
> On Fri, May 24, 2019 at 01:01:49PM -0400, Jerome Glisse wrote:
> > On Fri, May 24, 2019 at 01:59:31PM -0300, Jason Gunthorpe wrote:
> > > On Fri, May 24, 2019 at 12:49:02PM -0400, Jerome Glisse wrote:
> > > > On Fri, May 24, 2019 at 11:36:49AM -0300, Jason Gunthorpe wrote:
> > > > > On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote:
> > > > > > From: Jason Gunthorpe <jgg@mellanox.com>
> > > > > > 
> > > > > > This patch series arised out of discussions with Jerome when looking at the
> > > > > > ODP changes, particularly informed by use after free races we have already
> > > > > > found and fixed in the ODP code (thanks to syzkaller) working with mmu
> > > > > > notifiers, and the discussion with Ralph on how to resolve the lifetime model.
> > > > > 
> > > > > So the last big difference with ODP's flow is how 'range->valid'
> > > > > works.
> > > > > 
> > > > > In ODP this was done using the rwsem umem->umem_rwsem which is
> > > > > obtained for read in invalidate_start and released in invalidate_end.
> > > > > 
> > > > > Then any other threads that wish to only work on a umem which is not
> > > > > undergoing invalidation will obtain the write side of the lock, and
> > > > > within that lock's critical section the virtual address range is known
> > > > > to not be invalidating.
> > > > > 
> > > > > I cannot understand how hmm gets to the same approach. It has
> > > > > range->valid, but it is not locked by anything that I can see, so when
> > > > > we test it in places like hmm_range_fault it seems useless..
> > > > > 
> > > > > Jerome, how does this work?
> > > > > 
> > > > > I have a feeling we should copy the approach from ODP and use an
> > > > > actual lock here.
> > > > 
> > > > range->valid is use as bail early if invalidation is happening in
> > > > hmm_range_fault() to avoid doing useless work. The synchronization
> > > > is explained in the documentation:
> > > 
> > > That just says the hmm APIs handle locking. I asked how the apis
> > > implement that locking internally.
> > > 
> > > Are you trying to say that if I do this, hmm will still work completely
> > > correctly?
> > 
> > Yes it will keep working correctly. You would just be doing potentialy
> > useless work.
> 
> I don't see how it works correctly.
> 
> Apply the comment out patch I showed and this trivially happens:
> 
>       CPU0                                               CPU1
>   hmm_invalidate_start()
>     ops->sync_cpu_device_pagetables()
>       device_lock()
>        // Wipe out page tables in device, enable faulting
>       device_unlock()
> 
>                                                        DEVICE PAGE FAULT
>                                                        device_lock()
>                                                        hmm_range_register()
>                                                        hmm_range_dma_map()
>                                                        device_unlock()
>   hmm_invalidate_end()

No in the above scenario hmm_range_register() will not mark the range
as valid thus the driver will bailout after taking its lock and checking
the range->valid value.

> 
> The mmu notifier spec says:
> 
>  	 * Invalidation of multiple concurrent ranges may be
> 	 * optionally permitted by the driver. Either way the
> 	 * establishment of sptes is forbidden in the range passed to
> 	 * invalidate_range_begin/end for the whole duration of the
> 	 * invalidate_range_begin/end critical section.
> 
> And I understand "establishment of sptes is forbidden" means
> "hmm_range_dmap_map() must fail with EAGAIN". 

No it means that secondary page table entry (SPTE) must not materialize
thus what hmm_range_dmap_map() is doing if fine and safe as long as the
driver do not use the result to populate the device page table if there
was an invalidation for the range.

> 
> This is why ODP uses an actual lock held across the critical region
> which completely prohibits reading the CPU pages tables, or
> establishing new mappings.
> 
> So, I still think we need a true lock, not a 'maybe valid' flag.

The rational in HMM is to never block mm so that mm can always make
progress as whatever mm is doing will take precedence and thus it
would be useless to block mm while we do something if what we are
doing is about to become invalid.

Cheers,
Jérôme
Jason Gunthorpe May 24, 2019, 6:32 p.m. UTC | #13
On Fri, May 24, 2019 at 02:03:22PM -0400, Jerome Glisse wrote:
> On Fri, May 24, 2019 at 02:52:03PM -0300, Jason Gunthorpe wrote:
> > On Fri, May 24, 2019 at 01:01:49PM -0400, Jerome Glisse wrote:
> > > On Fri, May 24, 2019 at 01:59:31PM -0300, Jason Gunthorpe wrote:
> > > > On Fri, May 24, 2019 at 12:49:02PM -0400, Jerome Glisse wrote:
> > > > > On Fri, May 24, 2019 at 11:36:49AM -0300, Jason Gunthorpe wrote:
> > > > > > On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote:
> > > > > > > From: Jason Gunthorpe <jgg@mellanox.com>
> > > > > > > 
> > > > > > > This patch series arised out of discussions with Jerome when looking at the
> > > > > > > ODP changes, particularly informed by use after free races we have already
> > > > > > > found and fixed in the ODP code (thanks to syzkaller) working with mmu
> > > > > > > notifiers, and the discussion with Ralph on how to resolve the lifetime model.
> > > > > > 
> > > > > > So the last big difference with ODP's flow is how 'range->valid'
> > > > > > works.
> > > > > > 
> > > > > > In ODP this was done using the rwsem umem->umem_rwsem which is
> > > > > > obtained for read in invalidate_start and released in invalidate_end.
> > > > > > 
> > > > > > Then any other threads that wish to only work on a umem which is not
> > > > > > undergoing invalidation will obtain the write side of the lock, and
> > > > > > within that lock's critical section the virtual address range is known
> > > > > > to not be invalidating.
> > > > > > 
> > > > > > I cannot understand how hmm gets to the same approach. It has
> > > > > > range->valid, but it is not locked by anything that I can see, so when
> > > > > > we test it in places like hmm_range_fault it seems useless..
> > > > > > 
> > > > > > Jerome, how does this work?
> > > > > > 
> > > > > > I have a feeling we should copy the approach from ODP and use an
> > > > > > actual lock here.
> > > > > 
> > > > > range->valid is use as bail early if invalidation is happening in
> > > > > hmm_range_fault() to avoid doing useless work. The synchronization
> > > > > is explained in the documentation:
> > > > 
> > > > That just says the hmm APIs handle locking. I asked how the apis
> > > > implement that locking internally.
> > > > 
> > > > Are you trying to say that if I do this, hmm will still work completely
> > > > correctly?
> > > 
> > > Yes it will keep working correctly. You would just be doing potentialy
> > > useless work.
> > 
> > I don't see how it works correctly.
> > 
> > Apply the comment out patch I showed and this trivially happens:
> > 
> >       CPU0                                               CPU1
> >   hmm_invalidate_start()
> >     ops->sync_cpu_device_pagetables()
> >       device_lock()
> >        // Wipe out page tables in device, enable faulting
> >       device_unlock()
> > 
> >                                                        DEVICE PAGE FAULT
> >                                                        device_lock()
> >                                                        hmm_range_register()
> >                                                        hmm_range_dma_map()
> >                                                        device_unlock()
> >   hmm_invalidate_end()
> 
> No in the above scenario hmm_range_register() will not mark the range
> as valid thus the driver will bailout after taking its lock and checking
> the range->valid value.

I see your confusion, I only asked about removing valid from hmm.c,
not the unlocked use of valid in your hmm.rst example. My mistake,
sorry for being unclear.

Here is the big 3 CPU ladder diagram that shows how 'valid' does not
work:

       CPU0                                               CPU1                                          CPU2
                                                        DEVICE PAGE FAULT
                                                        range = hmm_range_register()

   // Overlaps with range
   hmm_invalidate_start()
     range->valid = false
     ops->sync_cpu_device_pagetables()
       take_lock(driver->update);
        // Wipe out page tables in device, enable faulting
       release_lock(driver->update);
												    // Does not overlap with range
												    hmm_invalidate_start()
												    hmm_invalidate_end()
													list_for_each
													    range->valid =  true


                                                        device_lock()
							// Note range->valid = true now
							hmm_range_snapshot(&range);
							take_lock(driver->update);
							if (!hmm_range_valid(&range))
							    goto again
							ESTABLISHE SPTES
                                                        device_unlock()
   hmm_invalidate_end()

And I can make this more complicated (ie overlapping parallel
invalidates, etc) and show any 'bool' valid cannot work.

> > The mmu notifier spec says:
> > 
> >  	 * Invalidation of multiple concurrent ranges may be
> > 	 * optionally permitted by the driver. Either way the
> > 	 * establishment of sptes is forbidden in the range passed to
> > 	 * invalidate_range_begin/end for the whole duration of the
> > 	 * invalidate_range_begin/end critical section.
> > 
> > And I understand "establishment of sptes is forbidden" means
> > "hmm_range_dmap_map() must fail with EAGAIN". 
> 
> No it means that secondary page table entry (SPTE) must not
> materialize thus what hmm_range_dmap_map() is doing if fine and safe
> as long as the driver do not use the result to populate the device
> page table if there was an invalidation for the range.

Okay, so we agree, if there is an invalidate_start/end critical region
then it is OK to *call* hmm_range_dmap_map(), however the driver must
not *use* the result, and you are expecting this bit:

      take_lock(driver->update);
      if (!hmm_range_valid(&range)) {
         goto again

In your hmm.rst to prevent the pfns from being used by the driver?

I think the above ladder shows that hmm_range_valid can return true
during a invalidate_start/end critical region, so this is a problem.

I still think the best solution is to move device_lock() into mirror
and have hmm manage it for the driver as ODP does. It is certainly the
simplest solution to understand.

Jason
Jerome Glisse May 24, 2019, 6:46 p.m. UTC | #14
On Fri, May 24, 2019 at 03:32:25PM -0300, Jason Gunthorpe wrote:
> On Fri, May 24, 2019 at 02:03:22PM -0400, Jerome Glisse wrote:
> > On Fri, May 24, 2019 at 02:52:03PM -0300, Jason Gunthorpe wrote:
> > > On Fri, May 24, 2019 at 01:01:49PM -0400, Jerome Glisse wrote:
> > > > On Fri, May 24, 2019 at 01:59:31PM -0300, Jason Gunthorpe wrote:
> > > > > On Fri, May 24, 2019 at 12:49:02PM -0400, Jerome Glisse wrote:
> > > > > > On Fri, May 24, 2019 at 11:36:49AM -0300, Jason Gunthorpe wrote:
> > > > > > > On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote:
> > > > > > > > From: Jason Gunthorpe <jgg@mellanox.com>
> > > > > > > > 
> > > > > > > > This patch series arised out of discussions with Jerome when looking at the
> > > > > > > > ODP changes, particularly informed by use after free races we have already
> > > > > > > > found and fixed in the ODP code (thanks to syzkaller) working with mmu
> > > > > > > > notifiers, and the discussion with Ralph on how to resolve the lifetime model.
> > > > > > > 
> > > > > > > So the last big difference with ODP's flow is how 'range->valid'
> > > > > > > works.
> > > > > > > 
> > > > > > > In ODP this was done using the rwsem umem->umem_rwsem which is
> > > > > > > obtained for read in invalidate_start and released in invalidate_end.
> > > > > > > 
> > > > > > > Then any other threads that wish to only work on a umem which is not
> > > > > > > undergoing invalidation will obtain the write side of the lock, and
> > > > > > > within that lock's critical section the virtual address range is known
> > > > > > > to not be invalidating.
> > > > > > > 
> > > > > > > I cannot understand how hmm gets to the same approach. It has
> > > > > > > range->valid, but it is not locked by anything that I can see, so when
> > > > > > > we test it in places like hmm_range_fault it seems useless..
> > > > > > > 
> > > > > > > Jerome, how does this work?
> > > > > > > 
> > > > > > > I have a feeling we should copy the approach from ODP and use an
> > > > > > > actual lock here.
> > > > > > 
> > > > > > range->valid is use as bail early if invalidation is happening in
> > > > > > hmm_range_fault() to avoid doing useless work. The synchronization
> > > > > > is explained in the documentation:
> > > > > 
> > > > > That just says the hmm APIs handle locking. I asked how the apis
> > > > > implement that locking internally.
> > > > > 
> > > > > Are you trying to say that if I do this, hmm will still work completely
> > > > > correctly?
> > > > 
> > > > Yes it will keep working correctly. You would just be doing potentialy
> > > > useless work.
> > > 
> > > I don't see how it works correctly.
> > > 
> > > Apply the comment out patch I showed and this trivially happens:
> > > 
> > >       CPU0                                               CPU1
> > >   hmm_invalidate_start()
> > >     ops->sync_cpu_device_pagetables()
> > >       device_lock()
> > >        // Wipe out page tables in device, enable faulting
> > >       device_unlock()
> > > 
> > >                                                        DEVICE PAGE FAULT
> > >                                                        device_lock()
> > >                                                        hmm_range_register()
> > >                                                        hmm_range_dma_map()
> > >                                                        device_unlock()
> > >   hmm_invalidate_end()
> > 
> > No in the above scenario hmm_range_register() will not mark the range
> > as valid thus the driver will bailout after taking its lock and checking
> > the range->valid value.
> 
> I see your confusion, I only asked about removing valid from hmm.c,
> not the unlocked use of valid in your hmm.rst example. My mistake,
> sorry for being unclear.

No i did understand properly and it is fine to remove all the valid
check within hmm_range_fault() or hmm_range_snapshot() nothing bad
will come out of that.

> 
> Here is the big 3 CPU ladder diagram that shows how 'valid' does not
> work:
> 
>        CPU0                                               CPU1                                          CPU2
>                                                         DEVICE PAGE FAULT
>                                                         range = hmm_range_register()
>
>   // Overlaps with range
>   hmm_invalidate_start()
>     range->valid = false
>     ops->sync_cpu_device_pagetables()
>       take_lock(driver->update);
>        // Wipe out page tables in device, enable faulting
>       release_lock(driver->update);
>                                                                                                    // Does not overlap with range
>                                                                                                    hmm_invalidate_start()
>                                                                                                    hmm_invalidate_end()
>                                                                                                        list_for_each
>                                                                                                            range->valid =  true

                                                                                                             ^
No this can not happen because CPU0 still has invalidate_range in progress and
thus hmm->notifiers > 0 so the hmm_invalidate_range_end() will not set the
range->valid as true.

>
>
>                                                        device_lock()
>                                                        // Note range->valid = true now
>                                                        hmm_range_snapshot(&range);
>                                                        take_lock(driver->update);
>                                                        if (!hmm_range_valid(&range))
>                                                            goto again
>                                                        ESTABLISHE SPTES
>                                                        device_unlock()
>   hmm_invalidate_end()
> 
> 
> And I can make this more complicated (ie overlapping parallel
> invalidates, etc) and show any 'bool' valid cannot work.

It does work. If you want i can remove the range->valid = true from the
hmm_invalidate_range_end() and move it within hmm_range_wait_until_valid()
ie modifying the hmm_range_wait_until_valid() logic, this might look
cleaner.

> > > The mmu notifier spec says:
> > > 
> > >  	 * Invalidation of multiple concurrent ranges may be
> > > 	 * optionally permitted by the driver. Either way the
> > > 	 * establishment of sptes is forbidden in the range passed to
> > > 	 * invalidate_range_begin/end for the whole duration of the
> > > 	 * invalidate_range_begin/end critical section.
> > > 
> > > And I understand "establishment of sptes is forbidden" means
> > > "hmm_range_dmap_map() must fail with EAGAIN". 
> > 
> > No it means that secondary page table entry (SPTE) must not
> > materialize thus what hmm_range_dmap_map() is doing if fine and safe
> > as long as the driver do not use the result to populate the device
> > page table if there was an invalidation for the range.
> 
> Okay, so we agree, if there is an invalidate_start/end critical region
> then it is OK to *call* hmm_range_dmap_map(), however the driver must
> not *use* the result, and you are expecting this bit:
> 
>       take_lock(driver->update);
>       if (!hmm_range_valid(&range)) {
>          goto again
> 
> In your hmm.rst to prevent the pfns from being used by the driver?
> 
> I think the above ladder shows that hmm_range_valid can return true
> during a invalidate_start/end critical region, so this is a problem.
>
> I still think the best solution is to move device_lock() into mirror
> and have hmm manage it for the driver as ODP does. It is certainly the
> simplest solution to understand.

It is un-efficient and would block further than needed forward progress
by mm code.

Cheers,
Jérôme
Jason Gunthorpe May 24, 2019, 10:09 p.m. UTC | #15
On Fri, May 24, 2019 at 02:46:08PM -0400, Jerome Glisse wrote:
> > Here is the big 3 CPU ladder diagram that shows how 'valid' does not
> > work:
> > 
> >        CPU0                                               CPU1                                          CPU2
> >                                                         DEVICE PAGE FAULT
> >                                                         range = hmm_range_register()
> >
> >   // Overlaps with range
> >   hmm_invalidate_start()
> >     range->valid = false
> >     ops->sync_cpu_device_pagetables()
> >       take_lock(driver->update);
> >        // Wipe out page tables in device, enable faulting
> >       release_lock(driver->update);
> >                                                                                                    // Does not overlap with range
> >                                                                                                    hmm_invalidate_start()
> >                                                                                                    hmm_invalidate_end()
> >                                                                                                        list_for_each
> >                                                                                                            range->valid =  true
> 
>                                                                                                              ^
> No this can not happen because CPU0 still has invalidate_range in progress and
> thus hmm->notifiers > 0 so the hmm_invalidate_range_end() will not set the
> range->valid as true.

Oh, Okay, I now see how this all works, thank you

> > And I can make this more complicated (ie overlapping parallel
> > invalidates, etc) and show any 'bool' valid cannot work.
> 
> It does work. 

Well, I ment the bool alone cannot work, but this is really bool + a
counter.

> If you want i can remove the range->valid = true from the
> hmm_invalidate_range_end() and move it within hmm_range_wait_until_valid()
> ie modifying the hmm_range_wait_until_valid() logic, this might look
> cleaner.

Let me reflect on it for a bit. I have to say I don't like the clarity
here, and I don't like the valid=true loop in the invalidate_end, it
is pretty clunky.

I'm thinking a more obvious API for drivers, as something like:

again:
    hmm_range_start();
     [..]
    if (hmm_range_test_retry())
          goto again

    driver_lock()
      if (hmm_range_end())
           goto again
    driver_unlock();

Just because it makes it very clear to the driver author what to do
and how this is working, and makes it clear that there is no such
thing as 'valid' - what we *really* have is a locking collision
forcing retry. ie this is really closer to a seq-lock scheme, not a
valid/invalid scheme. Being able to explain the concept does matter
for maintainability...

And I'm thinking the above API design would comfortably support a more
efficient seq-lock like approach without the loop in invalidate_end..

But I haven't quite thought it all through yet. Next week!

> > I still think the best solution is to move device_lock() into mirror
> > and have hmm manage it for the driver as ODP does. It is certainly the
> > simplest solution to understand.
> 
> It is un-efficient and would block further than needed forward progress
> by mm code.

I'm not sure how you get to that, we already have the device_lock()
and it already blocks forward progress by mm code.

Really the big unfortunate thing here is that valid is manipulated
outside the device_lock, but really, logically, it is covered under
the device_lock

Jason
Jason Gunthorpe May 27, 2019, 7:58 p.m. UTC | #16
On Fri, May 24, 2019 at 07:09:23PM -0300, Jason Gunthorpe wrote:
> On Fri, May 24, 2019 at 02:46:08PM -0400, Jerome Glisse wrote:
> > > Here is the big 3 CPU ladder diagram that shows how 'valid' does not
> > > work:
> > > 
> > >        CPU0                                               CPU1                                          CPU2
> > >                                                         DEVICE PAGE FAULT
> > >                                                         range = hmm_range_register()
> > >
> > >   // Overlaps with range
> > >   hmm_invalidate_start()
> > >     range->valid = false
> > >     ops->sync_cpu_device_pagetables()
> > >       take_lock(driver->update);
> > >        // Wipe out page tables in device, enable faulting
> > >       release_lock(driver->update);
> > >                                                                                                    // Does not overlap with range
> > >                                                                                                    hmm_invalidate_start()
> > >                                                                                                    hmm_invalidate_end()
> > >                                                                                                        list_for_each
> > >                                                                                                            range->valid =  true
> > 
> >                                                                                                              ^
> > No this can not happen because CPU0 still has invalidate_range in progress and
> > thus hmm->notifiers > 0 so the hmm_invalidate_range_end() will not set the
> > range->valid as true.
> 
> Oh, Okay, I now see how this all works, thank you
> 
> > > And I can make this more complicated (ie overlapping parallel
> > > invalidates, etc) and show any 'bool' valid cannot work.
> > 
> > It does work. 
> 
> Well, I ment the bool alone cannot work, but this is really bool + a
> counter.

I couldn't shake this unease that bool shouldn't work for this type of
locking, especially since odp also used a sequence lock as well as the
rwsem...

What about this situation:

       CPU0                                               CPU1
                                                        DEVICE PAGE FAULT
                                                        range = hmm_range_register()
							hmm_range_snapshot(&range);


   // Overlaps with range
   hmm_invalidate_start()
     range->valid = false
     ops->sync_cpu_device_pagetables()
       take_lock(driver->update);
        // Wipe out page tables in device, enable faulting
       release_lock(driver->update);
   hmm_invalidate_end()
      range->valid = true


							take_lock(driver->update);
							if (!hmm_range_valid(&range))
							    goto again
							ESTABLISH SPTES
                                                        release_lock(driver->update);


The ODP patch appears to follow this pattern as the dma_map and the
mlx5_ib_update_xlt are in different locking regions. We should dump
the result of rmm_range_snapshot in this case, certainly the driver
shouldn't be tasked with fixing this..

So, something like this is what I'm thinking about:

From 41b6a6120e30978e7335ada04fb9168db4e5fd29 Mon Sep 17 00:00:00 2001
From: Jason Gunthorpe <jgg@mellanox.com>
Date: Mon, 27 May 2019 16:48:53 -0300
Subject: [PATCH] RFC mm/hmm: Replace the range->valid with a seqcount

Instead of trying to use a single valid to keep track of parallel
invalidations use a traditional seqcount retry lock.

Replace the range->valid with the concept of a 'update critical region'
bounded by hmm_range_start_update() / hmm_range_end_update() which can
fail and need retry if it is ever interrupted by a parallel
invalidation. Updaters must create the critical section and can only
finish their update while holding the device_lock.

Continue to take a very loose approach to track invalidation, now with a
single global seqcount for all ranges. This is done to minimize the
overhead in the mmu notifier, and expects there will only be a small
number of ranges active at once. It could be converted to a seqcount per
range if necessary.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 Documentation/vm/hmm.rst | 22 +++++--------
 include/linux/hmm.h      | 60 ++++++++++++++++++++++++---------
 mm/hmm.c                 | 71 ++++++++++++++++++++++------------------
 3 files changed, 93 insertions(+), 60 deletions(-)

diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
index 7c1e929931a07f..7e827058964579 100644
--- a/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@ -229,32 +229,27 @@ The usage pattern is::
        * will use the return value of hmm_range_snapshot() below under the
        * mmap_sem to ascertain the validity of the range.
        */
-      hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC);
-
  again:
+       if (!hmm_range_start_update(&range, TIMEOUT_IN_MSEC))
+             goto err
+
       down_read(&mm->mmap_sem);
       ret = hmm_range_snapshot(&range);
       if (ret) {
           up_read(&mm->mmap_sem);
-          if (ret == -EAGAIN) {
-            /*
-             * No need to check hmm_range_wait_until_valid() return value
-             * on retry we will get proper error with hmm_range_snapshot()
-             */
-            hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC);
-            goto again;
-          }
+          if (ret == -EAGAIN)
+              goto again;
           hmm_mirror_unregister(&range);
           return ret;
       }
       take_lock(driver->update);
-      if (!hmm_range_valid(&range)) {
+      if (!hmm_range_end_update(&range)) {
           release_lock(driver->update);
           up_read(&mm->mmap_sem);
           goto again;
       }
 
-      // Use pfns array content to update device page table
+      // Use pfns array content to update device page table, must hold driver->update
 
       hmm_mirror_unregister(&range);
       release_lock(driver->update);
@@ -264,7 +259,8 @@ The usage pattern is::
 
 The driver->update lock is the same lock that the driver takes inside its
 sync_cpu_device_pagetables() callback. That lock must be held before calling
-hmm_range_valid() to avoid any race with a concurrent CPU page table update.
+hmm_range_end_update() to avoid any race with a concurrent CPU page table
+update.
 
 HMM implements all this on top of the mmu_notifier API because we wanted a
 simpler API and also to be able to perform optimizations latter on like doing
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 26dfd9377b5094..9096113cfba8de 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -90,7 +90,9 @@
  * @mmu_notifier: mmu notifier to track updates to CPU page table
  * @mirrors_sem: read/write semaphore protecting the mirrors list
  * @wq: wait queue for user waiting on a range invalidation
- * @notifiers: count of active mmu notifiers
+ * @active_invalidates: count of active mmu notifier invalidations
+ * @range_invalidated: seqcount indicating that an active range was
+ *                     maybe invalidated
  */
 struct hmm {
 	struct mm_struct	*mm;
@@ -102,7 +104,8 @@ struct hmm {
 	struct rw_semaphore	mirrors_sem;
 	wait_queue_head_t	wq;
 	struct rcu_head		rcu;
-	long			notifiers;
+	unsigned int		active_invalidates;
+	seqcount_t		range_invalidated;
 };
 
 /*
@@ -169,7 +172,7 @@ enum hmm_pfn_value_e {
  * @pfn_flags_mask: allows to mask pfn flags so that only default_flags matter
  * @page_shift: device virtual address shift value (should be >= PAGE_SHIFT)
  * @pfn_shifts: pfn shift value (should be <= PAGE_SHIFT)
- * @valid: pfns array did not change since it has been fill by an HMM function
+ * @update_seq: sequence number for the seqcount lock read side
  */
 struct hmm_range {
 	struct hmm		*hmm;
@@ -184,7 +187,7 @@ struct hmm_range {
 	uint64_t		pfn_flags_mask;
 	uint8_t			page_shift;
 	uint8_t			pfn_shift;
-	bool			valid;
+	unsigned int		update_seq;
 };
 
 /*
@@ -208,27 +211,52 @@ static inline unsigned long hmm_range_page_size(const struct hmm_range *range)
 }
 
 /*
- * hmm_range_wait_until_valid() - wait for range to be valid
+ * hmm_range_start_update() - wait for range to be valid
  * @range: range affected by invalidation to wait on
  * @timeout: time out for wait in ms (ie abort wait after that period of time)
  * Return: true if the range is valid, false otherwise.
  */
-static inline bool hmm_range_wait_until_valid(struct hmm_range *range,
-					      unsigned long timeout)
+// FIXME: hmm should handle the timeout for the driver too.
+static inline unsigned int hmm_range_start_update(struct hmm_range *range,
+						  unsigned long timeout)
 {
-	wait_event_timeout(range->hmm->wq, range->valid,
+	wait_event_timeout(range->hmm->wq,
+			   READ_ONCE(range->hmm->active_invalidates) == 0,
 			   msecs_to_jiffies(timeout));
-	return READ_ONCE(range->valid);
+
+	// FIXME: wants a non-raw seq helper
+	seqcount_lockdep_reader_access(&range->hmm->range_invalidated);
+	range->update_seq = raw_seqcount_begin(&range->hmm->range_invalidated);
+	return !read_seqcount_retry(&range->hmm->range_invalidated,
+				    range->update_seq);
 }
 
 /*
- * hmm_range_valid() - test if a range is valid or not
+ * hmm_range_needs_retry() - test if a range that has begun update
+ *                 via hmm_range_start_update() needs to be retried.
  * @range: range
- * Return: true if the range is valid, false otherwise.
+ * Return: true if the update needs to be restarted from hmm_range_start_update(),
+ *         false otherwise.
+ */
+static inline bool hmm_range_needs_retry(struct hmm_range *range)
+{
+	return read_seqcount_retry(&range->hmm->range_invalidated,
+				   range->update_seq);
+}
+
+/*
+ * hmm_range_end_update() - finish an update of a range
+ * @range: range
+ *
+ * This must only be called while holding the device lock as described in
+ * hmm.rst, and must be called before making any of the update visible.
+ *
+ * Return: true if the update was not interrupted by an invalidation of the
+ * covered virtual memory range, false if the update needs to be retried.
  */
-static inline bool hmm_range_valid(struct hmm_range *range)
+static inline bool hmm_range_end_update(struct hmm_range *range)
 {
-	return range->valid;
+	return !hmm_range_needs_retry(range);
 }
 
 /*
@@ -511,7 +539,7 @@ static inline int hmm_mirror_range_register(struct hmm_range *range,
 /* This is a temporary helper to avoid merge conflict between trees. */
 static inline bool hmm_vma_range_done(struct hmm_range *range)
 {
-	bool ret = hmm_range_valid(range);
+	bool ret = !hmm_range_needs_retry(range);
 
 	hmm_range_unregister(range);
 	return ret;
@@ -537,7 +565,7 @@ static inline int hmm_vma_fault(struct hmm_range *range, bool block)
 	if (ret)
 		return (int)ret;
 
-	if (!hmm_range_wait_until_valid(range, HMM_RANGE_DEFAULT_TIMEOUT)) {
+	if (!hmm_range_start_update(range, HMM_RANGE_DEFAULT_TIMEOUT)) {
 		hmm_range_unregister(range);
 		/*
 		 * The mmap_sem was taken by driver we release it here and
@@ -549,6 +577,8 @@ static inline int hmm_vma_fault(struct hmm_range *range, bool block)
 	}
 
 	ret = hmm_range_fault(range, block);
+	if (!hmm_range_end_update(range))
+		ret = -EAGAIN;
 	if (ret <= 0) {
 		hmm_range_unregister(range);
 		if (ret == -EBUSY || !ret) {
diff --git a/mm/hmm.c b/mm/hmm.c
index 8396a65710e304..09725774ff6112 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -79,7 +79,8 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
 	INIT_LIST_HEAD(&hmm->ranges);
 	mutex_init(&hmm->lock);
 	kref_init(&hmm->kref);
-	hmm->notifiers = 0;
+	hmm->active_invalidates = 0;
+	seqcount_init(&hmm->range_invalidated);
 	hmm->mm = mm;
 	mmgrab(mm);
 
@@ -155,13 +156,22 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	hmm_put(hmm);
 }
 
+static bool any_range_overlaps(struct hmm *hmm, unsigned long start, unsigned long end)
+{
+	struct hmm_range *range;
+
+	list_for_each_entry(range, &hmm->ranges, list)
+		// FIXME: check me
+		if (range->start <= end && range->end < start)
+			return true;
+	return false;
+}
 static int hmm_invalidate_range_start(struct mmu_notifier *mn,
 			const struct mmu_notifier_range *nrange)
 {
 	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
 	struct hmm_mirror *mirror;
 	struct hmm_update update;
-	struct hmm_range *range;
 	int ret = 0;
 
 	/* hmm is in progress to free */
@@ -179,13 +189,22 @@ static int hmm_invalidate_range_start(struct mmu_notifier *mn,
 		ret = -EAGAIN;
 		goto out;
 	}
-	hmm->notifiers++;
-	list_for_each_entry(range, &hmm->ranges, list) {
-		if (update.end < range->start || update.start >= range->end)
-			continue;
-
-		range->valid = false;
-	}
+	/*
+	 * The seqcount is used as a fast but inaccurate indication that
+	 * another CPU working with a range needs to retry due to invalidation
+	 * of the same virtual address space covered by the range by this
+	 * CPU.
+	 *
+	 * It is based on the assumption that the ranges will be short lived,
+	 * so there is no need to be aggressively accurate in the mmu notifier
+	 * fast path. Any notifier intersection will cause all registered
+	 * ranges to retry.
+	 */
+	hmm->active_invalidates++;
+	// FIXME: needs a seqcount helper
+	if (!(hmm->range_invalidated.sequence & 1) &&
+	    any_range_overlaps(hmm, update.start, update.end))
+		write_seqcount_begin(&hmm->range_invalidated);
 	mutex_unlock(&hmm->lock);
 
 	if (mmu_notifier_range_blockable(nrange))
@@ -218,15 +237,11 @@ static void hmm_invalidate_range_end(struct mmu_notifier *mn,
 		return;
 
 	mutex_lock(&hmm->lock);
-	hmm->notifiers--;
-	if (!hmm->notifiers) {
-		struct hmm_range *range;
-
-		list_for_each_entry(range, &hmm->ranges, list) {
-			if (range->valid)
-				continue;
-			range->valid = true;
-		}
+	hmm->active_invalidates--;
+	if (hmm->active_invalidates == 0) {
+		// FIXME: needs a seqcount helper
+		if (hmm->range_invalidated.sequence & 1)
+			write_seqcount_end(&hmm->range_invalidated);
 		wake_up_all(&hmm->wq);
 	}
 	mutex_unlock(&hmm->lock);
@@ -882,7 +897,7 @@ int hmm_range_register(struct hmm_range *range,
 {
 	unsigned long mask = ((1UL << page_shift) - 1UL);
 
-	range->valid = false;
+	range->update_seq = 0;
 	range->hmm = NULL;
 
 	if ((start & mask) || (end & mask))
@@ -908,15 +923,7 @@ int hmm_range_register(struct hmm_range *range,
 
 	/* Initialize range to track CPU page table updates. */
 	mutex_lock(&range->hmm->lock);
-
 	list_add(&range->list, &range->hmm->ranges);
-
-	/*
-	 * If there are any concurrent notifiers we have to wait for them for
-	 * the range to be valid (see hmm_range_wait_until_valid()).
-	 */
-	if (!range->hmm->notifiers)
-		range->valid = true;
 	mutex_unlock(&range->hmm->lock);
 
 	return 0;
@@ -947,7 +954,6 @@ void hmm_range_unregister(struct hmm_range *range)
 	hmm_put(hmm);
 
 	/* The range is now invalid, leave it poisoned. */
-	range->valid = false;
 	memset(&range->hmm, POISON_INUSE, sizeof(range->hmm));
 }
 EXPORT_SYMBOL(hmm_range_unregister);
@@ -981,7 +987,7 @@ long hmm_range_snapshot(struct hmm_range *range)
 
 	do {
 		/* If range is no longer valid force retry. */
-		if (!range->valid)
+		if (hmm_range_needs_retry(range))
 			return -EAGAIN;
 
 		vma = find_vma(hmm->mm, start);
@@ -1080,7 +1086,7 @@ long hmm_range_fault(struct hmm_range *range, bool block)
 
 	do {
 		/* If range is no longer valid force retry. */
-		if (!range->valid) {
+		if (hmm_range_needs_retry(range)) {
 			up_read(&hmm->mm->mmap_sem);
 			return -EAGAIN;
 		}
@@ -1134,7 +1140,7 @@ long hmm_range_fault(struct hmm_range *range, bool block)
 			start = hmm_vma_walk.last;
 
 			/* Keep trying while the range is valid. */
-		} while (ret == -EBUSY && range->valid);
+		} while (ret == -EBUSY && !hmm_range_needs_retry(range));
 
 		if (ret) {
 			unsigned long i;
@@ -1195,7 +1201,8 @@ long hmm_range_dma_map(struct hmm_range *range,
 			continue;
 
 		/* Check if range is being invalidated */
-		if (!range->valid) {
+		if (hmm_range_needs_retry(range)) {
+			// ?? EAGAIN??
 			ret = -EBUSY;
 			goto unmap;
 		}