diff mbox series

[v2,hmm,05/11] mm/hmm: Remove duplicate condition test before wait_event_timeout

Message ID 20190606184438.31646-6-jgg@ziepe.ca (mailing list archive)
State Superseded
Delegated to: Jason Gunthorpe
Headers show
Series Various revisions from a locking/code review | expand

Commit Message

Jason Gunthorpe June 6, 2019, 6:44 p.m. UTC
From: Jason Gunthorpe <jgg@mellanox.com>

The wait_event_timeout macro already tests the condition as its first
action, so there is no reason to open code another version of this, all
that does is skip the might_sleep() debugging in common cases, which is
not helpful.

Further, based on prior patches, we can no simplify the required condition
test:
 - If range is valid memory then so is range->hmm
 - If hmm_release() has run then range->valid is set to false
   at the same time as dead, so no reason to check both.
 - A valid hmm has a valid hmm->mm.

Also, add the READ_ONCE for range->valid as there is no lock held here.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
---
 include/linux/hmm.h | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

Comments

John Hubbard June 7, 2019, 3:06 a.m. UTC | #1
On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> The wait_event_timeout macro already tests the condition as its first
> action, so there is no reason to open code another version of this, all
> that does is skip the might_sleep() debugging in common cases, which is
> not helpful.
> 
> Further, based on prior patches, we can no simplify the required condition

                                          "now simplify"

> test:
>  - If range is valid memory then so is range->hmm
>  - If hmm_release() has run then range->valid is set to false
>    at the same time as dead, so no reason to check both.
>  - A valid hmm has a valid hmm->mm.
> 
> Also, add the READ_ONCE for range->valid as there is no lock held here.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> ---
>  include/linux/hmm.h | 12 ++----------
>  1 file changed, 2 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 4ee3acabe5ed22..2ab35b40992b24 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -218,17 +218,9 @@ static inline unsigned long hmm_range_page_size(const struct hmm_range *range)
>  static inline bool hmm_range_wait_until_valid(struct hmm_range *range,
>  					      unsigned long timeout)
>  {
> -	/* Check if mm is dead ? */
> -	if (range->hmm == NULL || range->hmm->dead || range->hmm->mm == NULL) {
> -		range->valid = false;
> -		return false;
> -	}
> -	if (range->valid)
> -		return true;
> -	wait_event_timeout(range->hmm->wq, range->valid || range->hmm->dead,
> +	wait_event_timeout(range->hmm->wq, range->valid,
>  			   msecs_to_jiffies(timeout));
> -	/* Return current valid status just in case we get lucky */
> -	return range->valid;
> +	return READ_ONCE(range->valid);

Just to ensure that I actually understand the model: I'm assuming that the 
READ_ONCE is there solely to ensure that range->valid is read *after* the
wait_event_timeout() returns. Is that correct?


>  }
>  
>  /*
> 

In any case, it looks good, so:

    Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
Jason Gunthorpe June 7, 2019, 12:47 p.m. UTC | #2
On Thu, Jun 06, 2019 at 08:06:52PM -0700, John Hubbard wrote:
> On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> > 
> > The wait_event_timeout macro already tests the condition as its first
> > action, so there is no reason to open code another version of this, all
> > that does is skip the might_sleep() debugging in common cases, which is
> > not helpful.
> > 
> > Further, based on prior patches, we can no simplify the required condition
> 
>                                           "now simplify"
> 
> > test:
> >  - If range is valid memory then so is range->hmm
> >  - If hmm_release() has run then range->valid is set to false
> >    at the same time as dead, so no reason to check both.
> >  - A valid hmm has a valid hmm->mm.
> > 
> > Also, add the READ_ONCE for range->valid as there is no lock held here.
> > 
> > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> > Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> >  include/linux/hmm.h | 12 ++----------
> >  1 file changed, 2 insertions(+), 10 deletions(-)
> > 
> > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > index 4ee3acabe5ed22..2ab35b40992b24 100644
> > +++ b/include/linux/hmm.h
> > @@ -218,17 +218,9 @@ static inline unsigned long hmm_range_page_size(const struct hmm_range *range)
> >  static inline bool hmm_range_wait_until_valid(struct hmm_range *range,
> >  					      unsigned long timeout)
> >  {
> > -	/* Check if mm is dead ? */
> > -	if (range->hmm == NULL || range->hmm->dead || range->hmm->mm == NULL) {
> > -		range->valid = false;
> > -		return false;
> > -	}
> > -	if (range->valid)
> > -		return true;
> > -	wait_event_timeout(range->hmm->wq, range->valid || range->hmm->dead,
> > +	wait_event_timeout(range->hmm->wq, range->valid,
> >  			   msecs_to_jiffies(timeout));
> > -	/* Return current valid status just in case we get lucky */
> > -	return range->valid;
> > +	return READ_ONCE(range->valid);
> 
> Just to ensure that I actually understand the model: I'm assuming that the 
> READ_ONCE is there solely to ensure that range->valid is read *after* the
> wait_event_timeout() returns. Is that correct?

No, wait_event_timout already has internal barriers that make sure
things don't leak across it.

The READ_ONCE is required any time a thread is reading a value that
another thread can be concurrently changing - ie in this case there is
no lock protecting range->valid so the write side could be running.

Without the READ_ONCE the compiler is allowed to read the value twice
and assume it gets the same result, which may not be true with a
parallel writer, and thus may compromise the control flow in some
unknown way. 

It is also good documentation for the locking scheme in use as it
marks shared data that is not being locked.

However, now that dead is gone we can just write the above more simply
as:

static inline bool hmm_range_wait_until_valid(struct hmm_range *range,
					      unsigned long timeout)
{
	return wait_event_timeout(range->hmm->wq, range->valid,
				  msecs_to_jiffies(timeout)) != 0;
}

Which relies on the internal barriers of wait_event_timeout, I'll fix
it up..

Thanks,
Jason
Ralph Campbell June 7, 2019, 7:01 p.m. UTC | #3
On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> The wait_event_timeout macro already tests the condition as its first
> action, so there is no reason to open code another version of this, all
> that does is skip the might_sleep() debugging in common cases, which is
> not helpful.
> 
> Further, based on prior patches, we can no simplify the required condition
> test:
>   - If range is valid memory then so is range->hmm
>   - If hmm_release() has run then range->valid is set to false
>     at the same time as dead, so no reason to check both.
>   - A valid hmm has a valid hmm->mm.
> 
> Also, add the READ_ONCE for range->valid as there is no lock held here.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> ---
>   include/linux/hmm.h | 12 ++----------
>   1 file changed, 2 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 4ee3acabe5ed22..2ab35b40992b24 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -218,17 +218,9 @@ static inline unsigned long hmm_range_page_size(const struct hmm_range *range)
>   static inline bool hmm_range_wait_until_valid(struct hmm_range *range,
>   					      unsigned long timeout)
>   {
> -	/* Check if mm is dead ? */
> -	if (range->hmm == NULL || range->hmm->dead || range->hmm->mm == NULL) {
> -		range->valid = false;
> -		return false;
> -	}
> -	if (range->valid)
> -		return true;
> -	wait_event_timeout(range->hmm->wq, range->valid || range->hmm->dead,
> +	wait_event_timeout(range->hmm->wq, range->valid,
>   			   msecs_to_jiffies(timeout));
> -	/* Return current valid status just in case we get lucky */
> -	return range->valid;
> +	return READ_ONCE(range->valid);
>   }
>   
>   /*
> 

Since we are simplifying things, perhaps we should consider merging
hmm_range_wait_until_valid() info hmm_range_register() and
removing hmm_range_wait_until_valid() since the pattern
is to always call the two together.

In any case, this looks OK to me so you can add
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Jason Gunthorpe June 7, 2019, 7:13 p.m. UTC | #4
On Fri, Jun 07, 2019 at 12:01:45PM -0700, Ralph Campbell wrote:
> 
> On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> > 
> > The wait_event_timeout macro already tests the condition as its first
> > action, so there is no reason to open code another version of this, all
> > that does is skip the might_sleep() debugging in common cases, which is
> > not helpful.
> > 
> > Further, based on prior patches, we can no simplify the required condition
> > test:
> >   - If range is valid memory then so is range->hmm
> >   - If hmm_release() has run then range->valid is set to false
> >     at the same time as dead, so no reason to check both.
> >   - A valid hmm has a valid hmm->mm.
> > 
> > Also, add the READ_ONCE for range->valid as there is no lock held here.
> > 
> > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> > Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> >   include/linux/hmm.h | 12 ++----------
> >   1 file changed, 2 insertions(+), 10 deletions(-)
> > 
> > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > index 4ee3acabe5ed22..2ab35b40992b24 100644
> > +++ b/include/linux/hmm.h
> > @@ -218,17 +218,9 @@ static inline unsigned long hmm_range_page_size(const struct hmm_range *range)
> >   static inline bool hmm_range_wait_until_valid(struct hmm_range *range,
> >   					      unsigned long timeout)
> >   {
> > -	/* Check if mm is dead ? */
> > -	if (range->hmm == NULL || range->hmm->dead || range->hmm->mm == NULL) {
> > -		range->valid = false;
> > -		return false;
> > -	}
> > -	if (range->valid)
> > -		return true;
> > -	wait_event_timeout(range->hmm->wq, range->valid || range->hmm->dead,
> > +	wait_event_timeout(range->hmm->wq, range->valid,
> >   			   msecs_to_jiffies(timeout));
> > -	/* Return current valid status just in case we get lucky */
> > -	return range->valid;
> > +	return READ_ONCE(range->valid);
> >   }
> >   /*
> > 
> 
> Since we are simplifying things, perhaps we should consider merging
> hmm_range_wait_until_valid() info hmm_range_register() and
> removing hmm_range_wait_until_valid() since the pattern
> is to always call the two together.

? the hmm.rst shows the hmm_range_wait_until_valid being called in the
(ret == -EAGAIN) path. It is confusing because it should really just
have the again label moved up above hmm_range_wait_until_valid() as
even if we get the driver lock it could still be a long wait for the
colliding invalidation to clear.

What I want to get to is a pattern like this:

pagefault():

   hmm_range_register(&range);
again:
   /* On the slow path, if we appear to be live locked then we get
      the write side of mmap_sem which will break the live lock,
      otherwise this gets the read lock */
   if (hmm_range_start_and_lock(&range))
         goto err;

   lockdep_assert_held(range->mm->mmap_sem);

   // Optional: Avoid useless expensive work
   if (hmm_range_needs_retry(&range))
      goto again;
   hmm_range_(touch vmas)

   take_lock(driver->update);
   if (hmm_range_end(&range) {
       release_lock(driver->update);
       goto again;
   }
   // Finish driver updates
   release_lock(driver->update);

   // Releases mmap_sem
   hmm_range_unregister_and_unlock(&range);

What do you think? 

Is it clear?

Jason
Ralph Campbell June 7, 2019, 8:21 p.m. UTC | #5
On 6/7/19 12:13 PM, Jason Gunthorpe wrote:
> On Fri, Jun 07, 2019 at 12:01:45PM -0700, Ralph Campbell wrote:
>>
>> On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
>>> From: Jason Gunthorpe <jgg@mellanox.com>
>>>
>>> The wait_event_timeout macro already tests the condition as its first
>>> action, so there is no reason to open code another version of this, all
>>> that does is skip the might_sleep() debugging in common cases, which is
>>> not helpful.
>>>
>>> Further, based on prior patches, we can no simplify the required condition
>>> test:
>>>    - If range is valid memory then so is range->hmm
>>>    - If hmm_release() has run then range->valid is set to false
>>>      at the same time as dead, so no reason to check both.
>>>    - A valid hmm has a valid hmm->mm.
>>>
>>> Also, add the READ_ONCE for range->valid as there is no lock held here.
>>>
>>> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
>>> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
>>>    include/linux/hmm.h | 12 ++----------
>>>    1 file changed, 2 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
>>> index 4ee3acabe5ed22..2ab35b40992b24 100644
>>> +++ b/include/linux/hmm.h
>>> @@ -218,17 +218,9 @@ static inline unsigned long hmm_range_page_size(const struct hmm_range *range)
>>>    static inline bool hmm_range_wait_until_valid(struct hmm_range *range,
>>>    					      unsigned long timeout)
>>>    {
>>> -	/* Check if mm is dead ? */
>>> -	if (range->hmm == NULL || range->hmm->dead || range->hmm->mm == NULL) {
>>> -		range->valid = false;
>>> -		return false;
>>> -	}
>>> -	if (range->valid)
>>> -		return true;
>>> -	wait_event_timeout(range->hmm->wq, range->valid || range->hmm->dead,
>>> +	wait_event_timeout(range->hmm->wq, range->valid,
>>>    			   msecs_to_jiffies(timeout));
>>> -	/* Return current valid status just in case we get lucky */
>>> -	return range->valid;
>>> +	return READ_ONCE(range->valid);
>>>    }
>>>    /*
>>>
>>
>> Since we are simplifying things, perhaps we should consider merging
>> hmm_range_wait_until_valid() info hmm_range_register() and
>> removing hmm_range_wait_until_valid() since the pattern
>> is to always call the two together.
> 
> ? the hmm.rst shows the hmm_range_wait_until_valid being called in the
> (ret == -EAGAIN) path. It is confusing because it should really just
> have the again label moved up above hmm_range_wait_until_valid() as
> even if we get the driver lock it could still be a long wait for the
> colliding invalidation to clear.
> 
> What I want to get to is a pattern like this:
> 
> pagefault():
> 
>     hmm_range_register(&range);
> again:
>     /* On the slow path, if we appear to be live locked then we get
>        the write side of mmap_sem which will break the live lock,
>        otherwise this gets the read lock */
>     if (hmm_range_start_and_lock(&range))
>           goto err;
> 
>     lockdep_assert_held(range->mm->mmap_sem);
> 
>     // Optional: Avoid useless expensive work
>     if (hmm_range_needs_retry(&range))
>        goto again;
>     hmm_range_(touch vmas)
> 
>     take_lock(driver->update);
>     if (hmm_range_end(&range) {
>         release_lock(driver->update);
>         goto again;
>     }
>     // Finish driver updates
>     release_lock(driver->update);
> 
>     // Releases mmap_sem
>     hmm_range_unregister_and_unlock(&range);
> 
> What do you think?
> 
> Is it clear?
> 
> Jason
> 

Are you talking about acquiring mmap_sem in hmm_range_start_and_lock()?
Usually, the fault code has to lock mmap_sem for read in order to
call find_vma() so it can set range.vma.
If HMM drops mmap_sem - which I don't think it should, just return an
error to tell the caller to drop mmap_sem and retry - the find_vma()
will need to be repeated as well.
I'm also not sure about acquiring the mmap_sem for write as way to
mitigate thrashing. It seems to me that if a device and a CPU are
both faulting on the same page, some sort of backoff delay is needed
to let one side or the other make some progress.

Thrashing mitigation and how migrate_vma() plays in this is a
deep topic for thought.
Jason Gunthorpe June 7, 2019, 8:44 p.m. UTC | #6
On Fri, Jun 07, 2019 at 01:21:12PM -0700, Ralph Campbell wrote:

> > What I want to get to is a pattern like this:
> > 
> > pagefault():
> > 
> >     hmm_range_register(&range);
> > again:
> >     /* On the slow path, if we appear to be live locked then we get
> >        the write side of mmap_sem which will break the live lock,
> >        otherwise this gets the read lock */
> >     if (hmm_range_start_and_lock(&range))
> >           goto err;
> > 
> >     lockdep_assert_held(range->mm->mmap_sem);
> > 
> >     // Optional: Avoid useless expensive work
> >     if (hmm_range_needs_retry(&range))
> >        goto again;
> >     hmm_range_(touch vmas)
> > 
> >     take_lock(driver->update);
> >     if (hmm_range_end(&range) {
> >         release_lock(driver->update);
> >         goto again;
> >     }
> >     // Finish driver updates
> >     release_lock(driver->update);
> > 
> >     // Releases mmap_sem
> >     hmm_range_unregister_and_unlock(&range);
> > 
> > What do you think?
> > 
> > Is it clear?
> > 
> > Jason
> > 
> 
> Are you talking about acquiring mmap_sem in hmm_range_start_and_lock()?
> Usually, the fault code has to lock mmap_sem for read in order to
> call find_vma() so it can set range.vma.

> If HMM drops mmap_sem - which I don't think it should, just return an
> error to tell the caller to drop mmap_sem and retry - the find_vma()
> will need to be repeated as well.

Overall I don't think it makes a lot of sense to sleep for retry in
hmm_range_start_and_lock() while holding mmap_sem. It would be better
to drop that lock, sleep, then re-acquire it as part of the hmm logic.

The find_vma should be done inside the critical section created by
hmm_range_start_and_lock(), not before it. If we are retrying then we
already slept and the additional CPU cost to repeat the find_vma is
immaterial, IMHO?

Do you see a reason why the find_vma() ever needs to be before the
'again' in my above example? range.vma does not need to be set for
range_register.

> I'm also not sure about acquiring the mmap_sem for write as way to
> mitigate thrashing. It seems to me that if a device and a CPU are
> both faulting on the same page,

One of the reasons to prefer this approach is that it means we don't
need to keep track of which ranges we are faulting, and if there is a
lot of *unrelated* fault activity (unlikely?) we can resolve it using
mmap sem instead of this elaborate ranges scheme and related
locking. 

This would reduce the overall work in the page fault and
invalidate_start/end paths for the common uncontended cases.

> some sort of backoff delay is needed to let one side or the other
> make some progress.

What the write side of the mmap_sem would do is force the CPU and
device to cleanly take turns. Once the device pages are registered
under the write side the CPU will have to wait in invalidate_start for
the driver to complete a shootdown, then the whole thing starts all
over again. 

It is certainly imaginable something could have a 'min life' timer for
a device mapping and hold mm invalidate_start, and device pagefault
for that min time to promote better sharing.

But, if we don't use the mmap_sem then we can livelock and the device
will see an unrecoverable error from the timeout which means we have
risk that under load the system will simply obscurely fail. This seems
unacceptable to me..

Particularly since for the ODP use case the issue is not trashing
migration as a GPU might have, but simple system stability under swap
load. We do not want the ODP pagefault to permanently fail due to
timeout if the VMA is still valid..

Jason
Ralph Campbell June 7, 2019, 10:13 p.m. UTC | #7
On 6/7/19 1:44 PM, Jason Gunthorpe wrote:
> On Fri, Jun 07, 2019 at 01:21:12PM -0700, Ralph Campbell wrote:
> 
>>> What I want to get to is a pattern like this:
>>>
>>> pagefault():
>>>
>>>      hmm_range_register(&range);
>>> again:
>>>      /* On the slow path, if we appear to be live locked then we get
>>>         the write side of mmap_sem which will break the live lock,
>>>         otherwise this gets the read lock */
>>>      if (hmm_range_start_and_lock(&range))
>>>            goto err;
>>>
>>>      lockdep_assert_held(range->mm->mmap_sem);
>>>
>>>      // Optional: Avoid useless expensive work
>>>      if (hmm_range_needs_retry(&range))
>>>         goto again;
>>>      hmm_range_(touch vmas)
>>>
>>>      take_lock(driver->update);
>>>      if (hmm_range_end(&range) {
>>>          release_lock(driver->update);
>>>          goto again;
>>>      }
>>>      // Finish driver updates
>>>      release_lock(driver->update);
>>>
>>>      // Releases mmap_sem
>>>      hmm_range_unregister_and_unlock(&range);
>>>
>>> What do you think?
>>>
>>> Is it clear?
>>>
>>> Jason
>>>
>>
>> Are you talking about acquiring mmap_sem in hmm_range_start_and_lock()?
>> Usually, the fault code has to lock mmap_sem for read in order to
>> call find_vma() so it can set range.vma.
> 
>> If HMM drops mmap_sem - which I don't think it should, just return an
>> error to tell the caller to drop mmap_sem and retry - the find_vma()
>> will need to be repeated as well.
> 
> Overall I don't think it makes a lot of sense to sleep for retry in
> hmm_range_start_and_lock() while holding mmap_sem. It would be better
> to drop that lock, sleep, then re-acquire it as part of the hmm logic.
> 
> The find_vma should be done inside the critical section created by
> hmm_range_start_and_lock(), not before it. If we are retrying then we
> already slept and the additional CPU cost to repeat the find_vma is
> immaterial, IMHO?
> 
> Do you see a reason why the find_vma() ever needs to be before the
> 'again' in my above example? range.vma does not need to be set for
> range_register.

Yes, for the GPU case, there can be many faults in an event queue
and the goal is to try to handle more than one page at a time.
The vma is needed to limit the amount of coalescing and checking
for pages that could be speculatively migrated or mapped.

>> I'm also not sure about acquiring the mmap_sem for write as way to
>> mitigate thrashing. It seems to me that if a device and a CPU are
>> both faulting on the same page,
> 
> One of the reasons to prefer this approach is that it means we don't
> need to keep track of which ranges we are faulting, and if there is a
> lot of *unrelated* fault activity (unlikely?) we can resolve it using
> mmap sem instead of this elaborate ranges scheme and related
> locking.
> 
> This would reduce the overall work in the page fault and
> invalidate_start/end paths for the common uncontended cases.
> 
>> some sort of backoff delay is needed to let one side or the other
>> make some progress.
> 
> What the write side of the mmap_sem would do is force the CPU and
> device to cleanly take turns. Once the device pages are registered
> under the write side the CPU will have to wait in invalidate_start for
> the driver to complete a shootdown, then the whole thing starts all
> over again.
> 
> It is certainly imaginable something could have a 'min life' timer for
> a device mapping and hold mm invalidate_start, and device pagefault
> for that min time to promote better sharing.
> 
> But, if we don't use the mmap_sem then we can livelock and the device
> will see an unrecoverable error from the timeout which means we have
> risk that under load the system will simply obscurely fail. This seems
> unacceptable to me..
> 
> Particularly since for the ODP use case the issue is not trashing
> migration as a GPU might have, but simple system stability under swap
> load. We do not want the ODP pagefault to permanently fail due to
> timeout if the VMA is still valid..
> 
> Jason
> 

OK, I understand.
If you come up with a set of changes, I can try testing them.
Jason Gunthorpe June 8, 2019, 1:47 a.m. UTC | #8
On Fri, Jun 07, 2019 at 03:13:00PM -0700, Ralph Campbell wrote:
> > Do you see a reason why the find_vma() ever needs to be before the
> > 'again' in my above example? range.vma does not need to be set for
> > range_register.
> 
> Yes, for the GPU case, there can be many faults in an event queue
> and the goal is to try to handle more than one page at a time.
> The vma is needed to limit the amount of coalescing and checking
> for pages that could be speculatively migrated or mapped.

I'd need to see an algorithm sketch to see what you are thinking..

But, I guess a driver would have figure out a list of what virtual
pages it wants to fault under the mmap sem (maybe use find_vam, etc),
then drop mmap_sem, and start processing those pages for mirroring
under the hmm side.

ie they are two seperate unrelated tasks.

I looked at the hmm.rst again, and that reference algorithm is already
showing that that upon retry the mmap_sem is released:

      take_lock(driver->update);
      if (!range.valid) {
          release_lock(driver->update);
          up_read(&mm->mmap_sem);
          goto again;

So a driver cannot assume that any VMA related work done under the
mmap before the hmm 'critical section' can carry into the hmm critical
section as the lock can be released upon retry and invalidate all that
data..

Forcing the hmm_range_start_and_lock() to acquire the mmap_sem is a
rough way to force the driver author to realize there are two locking
domains and lock protected information cannot cross between.

> OK, I understand.
> If you come up with a set of changes, I can try testing them.

Thanks, I make a sketch of the patch, I have to get back to it after
this series is done.

Jason
diff mbox series

Patch

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 4ee3acabe5ed22..2ab35b40992b24 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -218,17 +218,9 @@  static inline unsigned long hmm_range_page_size(const struct hmm_range *range)
 static inline bool hmm_range_wait_until_valid(struct hmm_range *range,
 					      unsigned long timeout)
 {
-	/* Check if mm is dead ? */
-	if (range->hmm == NULL || range->hmm->dead || range->hmm->mm == NULL) {
-		range->valid = false;
-		return false;
-	}
-	if (range->valid)
-		return true;
-	wait_event_timeout(range->hmm->wq, range->valid || range->hmm->dead,
+	wait_event_timeout(range->hmm->wq, range->valid,
 			   msecs_to_jiffies(timeout));
-	/* Return current valid status just in case we get lucky */
-	return range->valid;
+	return READ_ONCE(range->valid);
 }
 
 /*