Message ID | 20190606184438.31646-6-jgg@ziepe.ca (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Jason Gunthorpe |
Headers | show |
Series | Various revisions from a locking/code review | expand |
On 6/6/19 11:44 AM, Jason Gunthorpe wrote: > From: Jason Gunthorpe <jgg@mellanox.com> > > The wait_event_timeout macro already tests the condition as its first > action, so there is no reason to open code another version of this, all > that does is skip the might_sleep() debugging in common cases, which is > not helpful. > > Further, based on prior patches, we can no simplify the required condition "now simplify" > test: > - If range is valid memory then so is range->hmm > - If hmm_release() has run then range->valid is set to false > at the same time as dead, so no reason to check both. > - A valid hmm has a valid hmm->mm. > > Also, add the READ_ONCE for range->valid as there is no lock held here. > > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> > Reviewed-by: Jérôme Glisse <jglisse@redhat.com> > --- > include/linux/hmm.h | 12 ++---------- > 1 file changed, 2 insertions(+), 10 deletions(-) > > diff --git a/include/linux/hmm.h b/include/linux/hmm.h > index 4ee3acabe5ed22..2ab35b40992b24 100644 > --- a/include/linux/hmm.h > +++ b/include/linux/hmm.h > @@ -218,17 +218,9 @@ static inline unsigned long hmm_range_page_size(const struct hmm_range *range) > static inline bool hmm_range_wait_until_valid(struct hmm_range *range, > unsigned long timeout) > { > - /* Check if mm is dead ? */ > - if (range->hmm == NULL || range->hmm->dead || range->hmm->mm == NULL) { > - range->valid = false; > - return false; > - } > - if (range->valid) > - return true; > - wait_event_timeout(range->hmm->wq, range->valid || range->hmm->dead, > + wait_event_timeout(range->hmm->wq, range->valid, > msecs_to_jiffies(timeout)); > - /* Return current valid status just in case we get lucky */ > - return range->valid; > + return READ_ONCE(range->valid); Just to ensure that I actually understand the model: I'm assuming that the READ_ONCE is there solely to ensure that range->valid is read *after* the wait_event_timeout() returns. Is that correct? > } > > /* > In any case, it looks good, so: Reviewed-by: John Hubbard <jhubbard@nvidia.com> thanks,
On Thu, Jun 06, 2019 at 08:06:52PM -0700, John Hubbard wrote: > On 6/6/19 11:44 AM, Jason Gunthorpe wrote: > > From: Jason Gunthorpe <jgg@mellanox.com> > > > > The wait_event_timeout macro already tests the condition as its first > > action, so there is no reason to open code another version of this, all > > that does is skip the might_sleep() debugging in common cases, which is > > not helpful. > > > > Further, based on prior patches, we can no simplify the required condition > > "now simplify" > > > test: > > - If range is valid memory then so is range->hmm > > - If hmm_release() has run then range->valid is set to false > > at the same time as dead, so no reason to check both. > > - A valid hmm has a valid hmm->mm. > > > > Also, add the READ_ONCE for range->valid as there is no lock held here. > > > > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> > > Reviewed-by: Jérôme Glisse <jglisse@redhat.com> > > include/linux/hmm.h | 12 ++---------- > > 1 file changed, 2 insertions(+), 10 deletions(-) > > > > diff --git a/include/linux/hmm.h b/include/linux/hmm.h > > index 4ee3acabe5ed22..2ab35b40992b24 100644 > > +++ b/include/linux/hmm.h > > @@ -218,17 +218,9 @@ static inline unsigned long hmm_range_page_size(const struct hmm_range *range) > > static inline bool hmm_range_wait_until_valid(struct hmm_range *range, > > unsigned long timeout) > > { > > - /* Check if mm is dead ? */ > > - if (range->hmm == NULL || range->hmm->dead || range->hmm->mm == NULL) { > > - range->valid = false; > > - return false; > > - } > > - if (range->valid) > > - return true; > > - wait_event_timeout(range->hmm->wq, range->valid || range->hmm->dead, > > + wait_event_timeout(range->hmm->wq, range->valid, > > msecs_to_jiffies(timeout)); > > - /* Return current valid status just in case we get lucky */ > > - return range->valid; > > + return READ_ONCE(range->valid); > > Just to ensure that I actually understand the model: I'm assuming that the > READ_ONCE is there solely to ensure that range->valid is read *after* the > wait_event_timeout() returns. Is that correct? No, wait_event_timout already has internal barriers that make sure things don't leak across it. The READ_ONCE is required any time a thread is reading a value that another thread can be concurrently changing - ie in this case there is no lock protecting range->valid so the write side could be running. Without the READ_ONCE the compiler is allowed to read the value twice and assume it gets the same result, which may not be true with a parallel writer, and thus may compromise the control flow in some unknown way. It is also good documentation for the locking scheme in use as it marks shared data that is not being locked. However, now that dead is gone we can just write the above more simply as: static inline bool hmm_range_wait_until_valid(struct hmm_range *range, unsigned long timeout) { return wait_event_timeout(range->hmm->wq, range->valid, msecs_to_jiffies(timeout)) != 0; } Which relies on the internal barriers of wait_event_timeout, I'll fix it up.. Thanks, Jason
On 6/6/19 11:44 AM, Jason Gunthorpe wrote: > From: Jason Gunthorpe <jgg@mellanox.com> > > The wait_event_timeout macro already tests the condition as its first > action, so there is no reason to open code another version of this, all > that does is skip the might_sleep() debugging in common cases, which is > not helpful. > > Further, based on prior patches, we can no simplify the required condition > test: > - If range is valid memory then so is range->hmm > - If hmm_release() has run then range->valid is set to false > at the same time as dead, so no reason to check both. > - A valid hmm has a valid hmm->mm. > > Also, add the READ_ONCE for range->valid as there is no lock held here. > > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> > Reviewed-by: Jérôme Glisse <jglisse@redhat.com> > --- > include/linux/hmm.h | 12 ++---------- > 1 file changed, 2 insertions(+), 10 deletions(-) > > diff --git a/include/linux/hmm.h b/include/linux/hmm.h > index 4ee3acabe5ed22..2ab35b40992b24 100644 > --- a/include/linux/hmm.h > +++ b/include/linux/hmm.h > @@ -218,17 +218,9 @@ static inline unsigned long hmm_range_page_size(const struct hmm_range *range) > static inline bool hmm_range_wait_until_valid(struct hmm_range *range, > unsigned long timeout) > { > - /* Check if mm is dead ? */ > - if (range->hmm == NULL || range->hmm->dead || range->hmm->mm == NULL) { > - range->valid = false; > - return false; > - } > - if (range->valid) > - return true; > - wait_event_timeout(range->hmm->wq, range->valid || range->hmm->dead, > + wait_event_timeout(range->hmm->wq, range->valid, > msecs_to_jiffies(timeout)); > - /* Return current valid status just in case we get lucky */ > - return range->valid; > + return READ_ONCE(range->valid); > } > > /* > Since we are simplifying things, perhaps we should consider merging hmm_range_wait_until_valid() info hmm_range_register() and removing hmm_range_wait_until_valid() since the pattern is to always call the two together. In any case, this looks OK to me so you can add Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
On Fri, Jun 07, 2019 at 12:01:45PM -0700, Ralph Campbell wrote: > > On 6/6/19 11:44 AM, Jason Gunthorpe wrote: > > From: Jason Gunthorpe <jgg@mellanox.com> > > > > The wait_event_timeout macro already tests the condition as its first > > action, so there is no reason to open code another version of this, all > > that does is skip the might_sleep() debugging in common cases, which is > > not helpful. > > > > Further, based on prior patches, we can no simplify the required condition > > test: > > - If range is valid memory then so is range->hmm > > - If hmm_release() has run then range->valid is set to false > > at the same time as dead, so no reason to check both. > > - A valid hmm has a valid hmm->mm. > > > > Also, add the READ_ONCE for range->valid as there is no lock held here. > > > > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> > > Reviewed-by: Jérôme Glisse <jglisse@redhat.com> > > include/linux/hmm.h | 12 ++---------- > > 1 file changed, 2 insertions(+), 10 deletions(-) > > > > diff --git a/include/linux/hmm.h b/include/linux/hmm.h > > index 4ee3acabe5ed22..2ab35b40992b24 100644 > > +++ b/include/linux/hmm.h > > @@ -218,17 +218,9 @@ static inline unsigned long hmm_range_page_size(const struct hmm_range *range) > > static inline bool hmm_range_wait_until_valid(struct hmm_range *range, > > unsigned long timeout) > > { > > - /* Check if mm is dead ? */ > > - if (range->hmm == NULL || range->hmm->dead || range->hmm->mm == NULL) { > > - range->valid = false; > > - return false; > > - } > > - if (range->valid) > > - return true; > > - wait_event_timeout(range->hmm->wq, range->valid || range->hmm->dead, > > + wait_event_timeout(range->hmm->wq, range->valid, > > msecs_to_jiffies(timeout)); > > - /* Return current valid status just in case we get lucky */ > > - return range->valid; > > + return READ_ONCE(range->valid); > > } > > /* > > > > Since we are simplifying things, perhaps we should consider merging > hmm_range_wait_until_valid() info hmm_range_register() and > removing hmm_range_wait_until_valid() since the pattern > is to always call the two together. ? the hmm.rst shows the hmm_range_wait_until_valid being called in the (ret == -EAGAIN) path. It is confusing because it should really just have the again label moved up above hmm_range_wait_until_valid() as even if we get the driver lock it could still be a long wait for the colliding invalidation to clear. What I want to get to is a pattern like this: pagefault(): hmm_range_register(&range); again: /* On the slow path, if we appear to be live locked then we get the write side of mmap_sem which will break the live lock, otherwise this gets the read lock */ if (hmm_range_start_and_lock(&range)) goto err; lockdep_assert_held(range->mm->mmap_sem); // Optional: Avoid useless expensive work if (hmm_range_needs_retry(&range)) goto again; hmm_range_(touch vmas) take_lock(driver->update); if (hmm_range_end(&range) { release_lock(driver->update); goto again; } // Finish driver updates release_lock(driver->update); // Releases mmap_sem hmm_range_unregister_and_unlock(&range); What do you think? Is it clear? Jason
On 6/7/19 12:13 PM, Jason Gunthorpe wrote: > On Fri, Jun 07, 2019 at 12:01:45PM -0700, Ralph Campbell wrote: >> >> On 6/6/19 11:44 AM, Jason Gunthorpe wrote: >>> From: Jason Gunthorpe <jgg@mellanox.com> >>> >>> The wait_event_timeout macro already tests the condition as its first >>> action, so there is no reason to open code another version of this, all >>> that does is skip the might_sleep() debugging in common cases, which is >>> not helpful. >>> >>> Further, based on prior patches, we can no simplify the required condition >>> test: >>> - If range is valid memory then so is range->hmm >>> - If hmm_release() has run then range->valid is set to false >>> at the same time as dead, so no reason to check both. >>> - A valid hmm has a valid hmm->mm. >>> >>> Also, add the READ_ONCE for range->valid as there is no lock held here. >>> >>> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> >>> Reviewed-by: Jérôme Glisse <jglisse@redhat.com> >>> include/linux/hmm.h | 12 ++---------- >>> 1 file changed, 2 insertions(+), 10 deletions(-) >>> >>> diff --git a/include/linux/hmm.h b/include/linux/hmm.h >>> index 4ee3acabe5ed22..2ab35b40992b24 100644 >>> +++ b/include/linux/hmm.h >>> @@ -218,17 +218,9 @@ static inline unsigned long hmm_range_page_size(const struct hmm_range *range) >>> static inline bool hmm_range_wait_until_valid(struct hmm_range *range, >>> unsigned long timeout) >>> { >>> - /* Check if mm is dead ? */ >>> - if (range->hmm == NULL || range->hmm->dead || range->hmm->mm == NULL) { >>> - range->valid = false; >>> - return false; >>> - } >>> - if (range->valid) >>> - return true; >>> - wait_event_timeout(range->hmm->wq, range->valid || range->hmm->dead, >>> + wait_event_timeout(range->hmm->wq, range->valid, >>> msecs_to_jiffies(timeout)); >>> - /* Return current valid status just in case we get lucky */ >>> - return range->valid; >>> + return READ_ONCE(range->valid); >>> } >>> /* >>> >> >> Since we are simplifying things, perhaps we should consider merging >> hmm_range_wait_until_valid() info hmm_range_register() and >> removing hmm_range_wait_until_valid() since the pattern >> is to always call the two together. > > ? the hmm.rst shows the hmm_range_wait_until_valid being called in the > (ret == -EAGAIN) path. It is confusing because it should really just > have the again label moved up above hmm_range_wait_until_valid() as > even if we get the driver lock it could still be a long wait for the > colliding invalidation to clear. > > What I want to get to is a pattern like this: > > pagefault(): > > hmm_range_register(&range); > again: > /* On the slow path, if we appear to be live locked then we get > the write side of mmap_sem which will break the live lock, > otherwise this gets the read lock */ > if (hmm_range_start_and_lock(&range)) > goto err; > > lockdep_assert_held(range->mm->mmap_sem); > > // Optional: Avoid useless expensive work > if (hmm_range_needs_retry(&range)) > goto again; > hmm_range_(touch vmas) > > take_lock(driver->update); > if (hmm_range_end(&range) { > release_lock(driver->update); > goto again; > } > // Finish driver updates > release_lock(driver->update); > > // Releases mmap_sem > hmm_range_unregister_and_unlock(&range); > > What do you think? > > Is it clear? > > Jason > Are you talking about acquiring mmap_sem in hmm_range_start_and_lock()? Usually, the fault code has to lock mmap_sem for read in order to call find_vma() so it can set range.vma. If HMM drops mmap_sem - which I don't think it should, just return an error to tell the caller to drop mmap_sem and retry - the find_vma() will need to be repeated as well. I'm also not sure about acquiring the mmap_sem for write as way to mitigate thrashing. It seems to me that if a device and a CPU are both faulting on the same page, some sort of backoff delay is needed to let one side or the other make some progress. Thrashing mitigation and how migrate_vma() plays in this is a deep topic for thought.
On Fri, Jun 07, 2019 at 01:21:12PM -0700, Ralph Campbell wrote: > > What I want to get to is a pattern like this: > > > > pagefault(): > > > > hmm_range_register(&range); > > again: > > /* On the slow path, if we appear to be live locked then we get > > the write side of mmap_sem which will break the live lock, > > otherwise this gets the read lock */ > > if (hmm_range_start_and_lock(&range)) > > goto err; > > > > lockdep_assert_held(range->mm->mmap_sem); > > > > // Optional: Avoid useless expensive work > > if (hmm_range_needs_retry(&range)) > > goto again; > > hmm_range_(touch vmas) > > > > take_lock(driver->update); > > if (hmm_range_end(&range) { > > release_lock(driver->update); > > goto again; > > } > > // Finish driver updates > > release_lock(driver->update); > > > > // Releases mmap_sem > > hmm_range_unregister_and_unlock(&range); > > > > What do you think? > > > > Is it clear? > > > > Jason > > > > Are you talking about acquiring mmap_sem in hmm_range_start_and_lock()? > Usually, the fault code has to lock mmap_sem for read in order to > call find_vma() so it can set range.vma. > If HMM drops mmap_sem - which I don't think it should, just return an > error to tell the caller to drop mmap_sem and retry - the find_vma() > will need to be repeated as well. Overall I don't think it makes a lot of sense to sleep for retry in hmm_range_start_and_lock() while holding mmap_sem. It would be better to drop that lock, sleep, then re-acquire it as part of the hmm logic. The find_vma should be done inside the critical section created by hmm_range_start_and_lock(), not before it. If we are retrying then we already slept and the additional CPU cost to repeat the find_vma is immaterial, IMHO? Do you see a reason why the find_vma() ever needs to be before the 'again' in my above example? range.vma does not need to be set for range_register. > I'm also not sure about acquiring the mmap_sem for write as way to > mitigate thrashing. It seems to me that if a device and a CPU are > both faulting on the same page, One of the reasons to prefer this approach is that it means we don't need to keep track of which ranges we are faulting, and if there is a lot of *unrelated* fault activity (unlikely?) we can resolve it using mmap sem instead of this elaborate ranges scheme and related locking. This would reduce the overall work in the page fault and invalidate_start/end paths for the common uncontended cases. > some sort of backoff delay is needed to let one side or the other > make some progress. What the write side of the mmap_sem would do is force the CPU and device to cleanly take turns. Once the device pages are registered under the write side the CPU will have to wait in invalidate_start for the driver to complete a shootdown, then the whole thing starts all over again. It is certainly imaginable something could have a 'min life' timer for a device mapping and hold mm invalidate_start, and device pagefault for that min time to promote better sharing. But, if we don't use the mmap_sem then we can livelock and the device will see an unrecoverable error from the timeout which means we have risk that under load the system will simply obscurely fail. This seems unacceptable to me.. Particularly since for the ODP use case the issue is not trashing migration as a GPU might have, but simple system stability under swap load. We do not want the ODP pagefault to permanently fail due to timeout if the VMA is still valid.. Jason
On 6/7/19 1:44 PM, Jason Gunthorpe wrote: > On Fri, Jun 07, 2019 at 01:21:12PM -0700, Ralph Campbell wrote: > >>> What I want to get to is a pattern like this: >>> >>> pagefault(): >>> >>> hmm_range_register(&range); >>> again: >>> /* On the slow path, if we appear to be live locked then we get >>> the write side of mmap_sem which will break the live lock, >>> otherwise this gets the read lock */ >>> if (hmm_range_start_and_lock(&range)) >>> goto err; >>> >>> lockdep_assert_held(range->mm->mmap_sem); >>> >>> // Optional: Avoid useless expensive work >>> if (hmm_range_needs_retry(&range)) >>> goto again; >>> hmm_range_(touch vmas) >>> >>> take_lock(driver->update); >>> if (hmm_range_end(&range) { >>> release_lock(driver->update); >>> goto again; >>> } >>> // Finish driver updates >>> release_lock(driver->update); >>> >>> // Releases mmap_sem >>> hmm_range_unregister_and_unlock(&range); >>> >>> What do you think? >>> >>> Is it clear? >>> >>> Jason >>> >> >> Are you talking about acquiring mmap_sem in hmm_range_start_and_lock()? >> Usually, the fault code has to lock mmap_sem for read in order to >> call find_vma() so it can set range.vma. > >> If HMM drops mmap_sem - which I don't think it should, just return an >> error to tell the caller to drop mmap_sem and retry - the find_vma() >> will need to be repeated as well. > > Overall I don't think it makes a lot of sense to sleep for retry in > hmm_range_start_and_lock() while holding mmap_sem. It would be better > to drop that lock, sleep, then re-acquire it as part of the hmm logic. > > The find_vma should be done inside the critical section created by > hmm_range_start_and_lock(), not before it. If we are retrying then we > already slept and the additional CPU cost to repeat the find_vma is > immaterial, IMHO? > > Do you see a reason why the find_vma() ever needs to be before the > 'again' in my above example? range.vma does not need to be set for > range_register. Yes, for the GPU case, there can be many faults in an event queue and the goal is to try to handle more than one page at a time. The vma is needed to limit the amount of coalescing and checking for pages that could be speculatively migrated or mapped. >> I'm also not sure about acquiring the mmap_sem for write as way to >> mitigate thrashing. It seems to me that if a device and a CPU are >> both faulting on the same page, > > One of the reasons to prefer this approach is that it means we don't > need to keep track of which ranges we are faulting, and if there is a > lot of *unrelated* fault activity (unlikely?) we can resolve it using > mmap sem instead of this elaborate ranges scheme and related > locking. > > This would reduce the overall work in the page fault and > invalidate_start/end paths for the common uncontended cases. > >> some sort of backoff delay is needed to let one side or the other >> make some progress. > > What the write side of the mmap_sem would do is force the CPU and > device to cleanly take turns. Once the device pages are registered > under the write side the CPU will have to wait in invalidate_start for > the driver to complete a shootdown, then the whole thing starts all > over again. > > It is certainly imaginable something could have a 'min life' timer for > a device mapping and hold mm invalidate_start, and device pagefault > for that min time to promote better sharing. > > But, if we don't use the mmap_sem then we can livelock and the device > will see an unrecoverable error from the timeout which means we have > risk that under load the system will simply obscurely fail. This seems > unacceptable to me.. > > Particularly since for the ODP use case the issue is not trashing > migration as a GPU might have, but simple system stability under swap > load. We do not want the ODP pagefault to permanently fail due to > timeout if the VMA is still valid.. > > Jason > OK, I understand. If you come up with a set of changes, I can try testing them.
On Fri, Jun 07, 2019 at 03:13:00PM -0700, Ralph Campbell wrote: > > Do you see a reason why the find_vma() ever needs to be before the > > 'again' in my above example? range.vma does not need to be set for > > range_register. > > Yes, for the GPU case, there can be many faults in an event queue > and the goal is to try to handle more than one page at a time. > The vma is needed to limit the amount of coalescing and checking > for pages that could be speculatively migrated or mapped. I'd need to see an algorithm sketch to see what you are thinking.. But, I guess a driver would have figure out a list of what virtual pages it wants to fault under the mmap sem (maybe use find_vam, etc), then drop mmap_sem, and start processing those pages for mirroring under the hmm side. ie they are two seperate unrelated tasks. I looked at the hmm.rst again, and that reference algorithm is already showing that that upon retry the mmap_sem is released: take_lock(driver->update); if (!range.valid) { release_lock(driver->update); up_read(&mm->mmap_sem); goto again; So a driver cannot assume that any VMA related work done under the mmap before the hmm 'critical section' can carry into the hmm critical section as the lock can be released upon retry and invalidate all that data.. Forcing the hmm_range_start_and_lock() to acquire the mmap_sem is a rough way to force the driver author to realize there are two locking domains and lock protected information cannot cross between. > OK, I understand. > If you come up with a set of changes, I can try testing them. Thanks, I make a sketch of the patch, I have to get back to it after this series is done. Jason
diff --git a/include/linux/hmm.h b/include/linux/hmm.h index 4ee3acabe5ed22..2ab35b40992b24 100644 --- a/include/linux/hmm.h +++ b/include/linux/hmm.h @@ -218,17 +218,9 @@ static inline unsigned long hmm_range_page_size(const struct hmm_range *range) static inline bool hmm_range_wait_until_valid(struct hmm_range *range, unsigned long timeout) { - /* Check if mm is dead ? */ - if (range->hmm == NULL || range->hmm->dead || range->hmm->mm == NULL) { - range->valid = false; - return false; - } - if (range->valid) - return true; - wait_event_timeout(range->hmm->wq, range->valid || range->hmm->dead, + wait_event_timeout(range->hmm->wq, range->valid, msecs_to_jiffies(timeout)); - /* Return current valid status just in case we get lucky */ - return range->valid; + return READ_ONCE(range->valid); } /*