Message ID | 20191216195733.28353-2-rcampbell@nvidia.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/hmm/test: add self tests for HMM | expand |
On Mon, Dec 16, 2019 at 11:57:32AM -0800, Ralph Campbell wrote: > mmu_interval_notifier_insert() and mmu_interval_notifier_remove() can't > be called safely from inside the invalidate() callback. This is fine for > devices with explicit memory region register and unregister calls but it > is desirable from a programming model standpoint to not require explicit > memory region registration. Regions can be registered based on device > address faults but without a mechanism for updating or removing the mmu > interval notifiers in response to munmap(), the invalidation callbacks > will be for regions that are stale or apply to different mmaped regions. What we do in RDMA is drive the removal from a work queue, as we need a synchronize_srcu anyhow to serialize everything to do with destroying a part of the address space mirror. Is it really necessary to have all this stuff just to save doing something like a work queue? Also, I think we are not taking core kernel APIs like this with out an in-kernel user?? > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > index 9e6caa8ecd19..55fbefcdc564 100644 > +++ b/include/linux/mmu_notifier.h > @@ -233,11 +233,18 @@ struct mmu_notifier { > * @invalidate: Upon return the caller must stop using any SPTEs within this > * range. This function can sleep. Return false only if sleeping > * was required but mmu_notifier_range_blockable(range) is false. > + * @release: This function will be called when the mmu_interval_notifier > + * is removed from the interval tree. Defining this function also > + * allows mmu_interval_notifier_remove() and > + * mmu_interval_notifier_update() to be called from the > + * invalidate() callback function (i.e., they won't block waiting > + * for invalidations to finish. Having a function called remove that doesn't block seems like very poor choice of language, we've tended to use put to describe that operation. The difference is meaningful as people often create use after free bugs in drivers when presented with interfaces named 'remove' or 'destroy' that don't actually guarentee there is not going to be continued accesses to the memory. > */ > struct mmu_interval_notifier_ops { > bool (*invalidate)(struct mmu_interval_notifier *mni, > const struct mmu_notifier_range *range, > unsigned long cur_seq); > + void (*release)(struct mmu_interval_notifier *mni); > }; > > struct mmu_interval_notifier { > @@ -246,6 +253,8 @@ struct mmu_interval_notifier { > struct mm_struct *mm; > struct hlist_node deferred_item; > unsigned long invalidate_seq; > + unsigned long deferred_start; > + unsigned long deferred_last; I couldn't quite understand how something like this can work, what is preventing parallel updates? > +/** > + * mmu_interval_notifier_update - Update interval notifier end > + * @mni: Interval notifier to update > + * @start: New starting virtual address to monitor > + * @length: New length of the range to monitor > + * > + * This function updates the range being monitored. > + * If there is no release() function defined, the call will wait for the > + * update to finish before returning. > + */ > +int mmu_interval_notifier_update(struct mmu_interval_notifier *mni, > + unsigned long start, unsigned long length) > +{ Update should probably be its own patch Jason
On 12/17/19 12:51 PM, Jason Gunthorpe wrote: > On Mon, Dec 16, 2019 at 11:57:32AM -0800, Ralph Campbell wrote: >> mmu_interval_notifier_insert() and mmu_interval_notifier_remove() can't >> be called safely from inside the invalidate() callback. This is fine for >> devices with explicit memory region register and unregister calls but it >> is desirable from a programming model standpoint to not require explicit >> memory region registration. Regions can be registered based on device >> address faults but without a mechanism for updating or removing the mmu >> interval notifiers in response to munmap(), the invalidation callbacks >> will be for regions that are stale or apply to different mmaped regions. > > What we do in RDMA is drive the removal from a work queue, as we need > a synchronize_srcu anyhow to serialize everything to do with > destroying a part of the address space mirror. > > Is it really necessary to have all this stuff just to save doing > something like a work queue? Well, the invalidates already have to use the driver lock to synchronize so handling the range tracking updates semi-synchronously seems more straightforward to me. Do you feel strongly that adding a work queue is the right way to handle this? > Also, I think we are not taking core kernel APIs like this with out an > in-kernel user?? Right. I was looking for feedback before updating nouveau to use it. >> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h >> index 9e6caa8ecd19..55fbefcdc564 100644 >> +++ b/include/linux/mmu_notifier.h >> @@ -233,11 +233,18 @@ struct mmu_notifier { >> * @invalidate: Upon return the caller must stop using any SPTEs within this >> * range. This function can sleep. Return false only if sleeping >> * was required but mmu_notifier_range_blockable(range) is false. >> + * @release: This function will be called when the mmu_interval_notifier >> + * is removed from the interval tree. Defining this function also >> + * allows mmu_interval_notifier_remove() and >> + * mmu_interval_notifier_update() to be called from the >> + * invalidate() callback function (i.e., they won't block waiting >> + * for invalidations to finish. > > Having a function called remove that doesn't block seems like very > poor choice of language, we've tended to use put to describe that > operation. > > The difference is meaningful as people often create use after free > bugs in drivers when presented with interfaces named 'remove' or > 'destroy' that don't actually guarentee there is not going to be > continued accesses to the memory. OK. I can rename it put(). >> */ >> struct mmu_interval_notifier_ops { >> bool (*invalidate)(struct mmu_interval_notifier *mni, >> const struct mmu_notifier_range *range, >> unsigned long cur_seq); >> + void (*release)(struct mmu_interval_notifier *mni); >> }; >> >> struct mmu_interval_notifier { >> @@ -246,6 +253,8 @@ struct mmu_interval_notifier { >> struct mm_struct *mm; >> struct hlist_node deferred_item; >> unsigned long invalidate_seq; >> + unsigned long deferred_start; >> + unsigned long deferred_last; > > I couldn't quite understand how something like this can work, what is > preventing parallel updates? It is serialized by the struct mmu_notifier_mm lock. If there are no tasks walking the interval tree, the update happens synchronously under the lock. If there are walkers, the start/last values are stored under the lock and the last caller's values are used to update the interval tree when the last walker finishes (under the lock again). >> +/** >> + * mmu_interval_notifier_update - Update interval notifier end >> + * @mni: Interval notifier to update >> + * @start: New starting virtual address to monitor >> + * @length: New length of the range to monitor >> + * >> + * This function updates the range being monitored. >> + * If there is no release() function defined, the call will wait for the >> + * update to finish before returning. >> + */ >> +int mmu_interval_notifier_update(struct mmu_interval_notifier *mni, >> + unsigned long start, unsigned long length) >> +{ > > Update should probably be its own patch > > Jason OK. Thanks for the review.
On Mon, Dec 16, 2019 at 11:57:32AM -0800, Ralph Campbell wrote: > mmu_interval_notifier_insert() and mmu_interval_notifier_remove() can't > be called safely from inside the invalidate() callback. This is fine for > devices with explicit memory region register and unregister calls but it > is desirable from a programming model standpoint to not require explicit > memory region registration. Regions can be registered based on device > address faults but without a mechanism for updating or removing the mmu > interval notifiers in response to munmap(), the invalidation callbacks > will be for regions that are stale or apply to different mmaped regions. > > The invalidate() callback provides the necessary information (i.e., > the event type MMU_NOTIFY_UNMAP) so add insert, remove, and update > functions that are safe to call from the invalidate() callback by > extending the work done in mn_itree_inv_end() to update the interval tree > when it is not being traversed. > > Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> > include/linux/mmu_notifier.h | 15 +++ > mm/mmu_notifier.c | 196 ++++++++++++++++++++++++++++++----- > 2 files changed, 186 insertions(+), 25 deletions(-) > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > index 9e6caa8ecd19..55fbefcdc564 100644 > +++ b/include/linux/mmu_notifier.h > @@ -233,11 +233,18 @@ struct mmu_notifier { > * @invalidate: Upon return the caller must stop using any SPTEs within this > * range. This function can sleep. Return false only if sleeping > * was required but mmu_notifier_range_blockable(range) is false. > + * @release: This function will be called when the mmu_interval_notifier > + * is removed from the interval tree. Defining this function also > + * allows mmu_interval_notifier_remove() and > + * mmu_interval_notifier_update() to be called from the > + * invalidate() callback function (i.e., they won't block waiting > + * for invalidations to finish. > */ > struct mmu_interval_notifier_ops { > bool (*invalidate)(struct mmu_interval_notifier *mni, > const struct mmu_notifier_range *range, > unsigned long cur_seq); > + void (*release)(struct mmu_interval_notifier *mni); > }; > > struct mmu_interval_notifier { > @@ -246,6 +253,8 @@ struct mmu_interval_notifier { > struct mm_struct *mm; > struct hlist_node deferred_item; > unsigned long invalidate_seq; > + unsigned long deferred_start; > + unsigned long deferred_last; > }; > > #ifdef CONFIG_MMU_NOTIFIER > @@ -299,7 +308,13 @@ int mmu_interval_notifier_insert_locked( > struct mmu_interval_notifier *mni, struct mm_struct *mm, > unsigned long start, unsigned long length, > const struct mmu_interval_notifier_ops *ops); > +int mmu_interval_notifier_insert_safe( > + struct mmu_interval_notifier *mni, struct mm_struct *mm, > + unsigned long start, unsigned long length, > + const struct mmu_interval_notifier_ops *ops); > void mmu_interval_notifier_remove(struct mmu_interval_notifier *mni); > +int mmu_interval_notifier_update(struct mmu_interval_notifier *mni, > + unsigned long start, unsigned long length); > > /** > * mmu_interval_set_seq - Save the invalidation sequence > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c > index f76ea05b1cb0..c303285750b2 100644 > +++ b/mm/mmu_notifier.c > @@ -129,6 +129,7 @@ static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm) > { > struct mmu_interval_notifier *mni; > struct hlist_node *next; > + struct hlist_head removed_list; > > spin_lock(&mmn_mm->lock); > if (--mmn_mm->active_invalidate_ranges || > @@ -144,21 +145,47 @@ static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm) > * The inv_end incorporates a deferred mechanism like rtnl_unlock(). > * Adds and removes are queued until the final inv_end happens then > * they are progressed. This arrangement for tree updates is used to > - * avoid using a blocking lock during invalidate_range_start. > + * avoid using a blocking lock while walking the interval tree. > */ > + INIT_HLIST_HEAD(&removed_list); > hlist_for_each_entry_safe(mni, next, &mmn_mm->deferred_list, > deferred_item) { > + hlist_del(&mni->deferred_item); > if (RB_EMPTY_NODE(&mni->interval_tree.rb)) > interval_tree_insert(&mni->interval_tree, > &mmn_mm->itree); > - else > + else { > interval_tree_remove(&mni->interval_tree, > &mmn_mm->itree); > - hlist_del(&mni->deferred_item); > + if (mni->deferred_last) { > + mni->interval_tree.start = mni->deferred_start; > + mni->interval_tree.last = mni->deferred_last; > + mni->deferred_last = 0; Technicaly we can have an interval starting at zero. I'd write it more like if (mni->updated_start == mni->updated_end) insert else remove ie an empty interval can't get a notification so it should be removed from the tree. I also like the name 'updated' better than deferred, it is a bit clearer.. Adding release should it's own patch. > @@ -970,14 +999,52 @@ int mmu_interval_notifier_insert_locked( > EXPORT_SYMBOL_GPL(mmu_interval_notifier_insert_locked); > > /** > - * mmu_interval_notifier_remove - Remove a interval notifier > - * @mni: Interval notifier to unregister > + * mmu_interval_notifier_insert_safe - Insert an interval notifier > + * @mni: Interval notifier to register > + * @mm : mm_struct to attach to > + * @start: Starting virtual address to monitor > + * @length: Length of the range to monitor > + * @ops: Interval notifier callback operations > * > - * This function must be paired with mmu_interval_notifier_insert(). It cannot > - * be called from any ops callback. > + * Return: -EINVAL if @mm hasn't been initialized for interval notifiers > + * by calling mmu_notifier_register(NULL, mm) or > + * __mmu_notifier_register(NULL, mm). > * > - * Once this returns ops callbacks are no longer running on other CPUs and > - * will not be called in future. > + * This function subscribes the interval notifier for notifications from the > + * mm. Upon return, the ops related to mmu_interval_notifier will be called > + * whenever an event that intersects with the given range occurs. > + * > + * This function is safe to call from the ops->invalidate() function. > + * Upon return, the mmu_interval_notifier may not be present in the interval > + * tree yet. The caller must use the normal interval notifier read flow via > + * mmu_interval_read_begin() to establish SPTEs for this range. So why do we need this? You can't call hmm_range_fault from a notifier. You just can't. So there should be no reason to create an interval from the notifier, do it from where you call hmm_range_fault, and it must be safe to obtain the mmap_sem from that thread. > +/** > + * mmu_interval_notifier_update - Update interval notifier end > + * @mni: Interval notifier to update > + * @start: New starting virtual address to monitor > + * @length: New length of the range to monitor > + * > + * This function updates the range being monitored. > + * If there is no release() function defined, the call will wait for the > + * update to finish before returning. > + */ > +int mmu_interval_notifier_update(struct mmu_interval_notifier *mni, > + unsigned long start, unsigned long length) > +{ > + struct mm_struct *mm = mni->mm; > + struct mmu_notifier_mm *mmn_mm = mm->mmu_notifier_mm; > + unsigned long seq = 0; > + unsigned long last; > + > + if (length == 0 || check_add_overflow(start, length - 1, &last)) > + return -EOVERFLOW; > + > + spin_lock(&mmn_mm->lock); > + if (mn_itree_is_invalidating(mmn_mm)) { > + /* > + * Update is being called after insert put this on the > + * deferred list, but before the deferred list was processed. > + */ > + if (RB_EMPTY_NODE(&mni->interval_tree.rb)) { > + mni->interval_tree.start = start; > + mni->interval_tree.last = last; > + } else { > + if (!mni->deferred_last) > + hlist_add_head(&mni->deferred_item, > + &mmn_mm->deferred_list); > + mni->deferred_start = start; > + mni->deferred_last = last; > + } > + seq = mmn_mm->invalidate_seq; > + } else { > + WARN_ON(RB_EMPTY_NODE(&mni->interval_tree.rb)); > + interval_tree_remove(&mni->interval_tree, &mmn_mm->itree); > + mni->interval_tree.start = start; > + mni->interval_tree.last = last; > + interval_tree_insert(&mni->interval_tree, &mmn_mm->itree); > + } > + spin_unlock(&mmn_mm->lock); > + > + if (!mni->ops->release && seq) { > + /* > + * The possible sleep on progress in the invalidation requires > + * the caller not hold any locks held by invalidation > + * callbacks. > + */ > + lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); > + lock_map_release(&__mmu_notifier_invalidate_range_start_map); > wait_event(mmn_mm->wq, > READ_ONCE(mmn_mm->invalidate_seq) != seq); > + } > > - /* pairs with mmgrab in mmu_interval_notifier_insert() */ > - mmdrop(mm); > + return 0; > } > -EXPORT_SYMBOL_GPL(mmu_interval_notifier_remove); > +EXPORT_SYMBOL_GPL(mmu_interval_notifier_update); A 'update' should probably be the same as insert, it doesn't necessarily take effect until mmu_interval_read_begin(), so nothing contingent on release. As before, I'm not sure what to do with this. We need an in-kernel user for new apis, and I don't see a reason to make this more complicated for a test program. The test program should match one of the existing driver flows, so use the page table like scheme from ODP or the fixed lifetime scheme from AMDGPU/ODP Jason
On 1/9/20 11:48 AM, Jason Gunthorpe wrote: > On Mon, Dec 16, 2019 at 11:57:32AM -0800, Ralph Campbell wrote: >> mmu_interval_notifier_insert() and mmu_interval_notifier_remove() can't >> be called safely from inside the invalidate() callback. This is fine for >> devices with explicit memory region register and unregister calls but it >> is desirable from a programming model standpoint to not require explicit >> memory region registration. Regions can be registered based on device >> address faults but without a mechanism for updating or removing the mmu >> interval notifiers in response to munmap(), the invalidation callbacks >> will be for regions that are stale or apply to different mmaped regions. >> >> The invalidate() callback provides the necessary information (i.e., >> the event type MMU_NOTIFY_UNMAP) so add insert, remove, and update >> functions that are safe to call from the invalidate() callback by >> extending the work done in mn_itree_inv_end() to update the interval tree >> when it is not being traversed. >> >> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> >> include/linux/mmu_notifier.h | 15 +++ >> mm/mmu_notifier.c | 196 ++++++++++++++++++++++++++++++----- >> 2 files changed, 186 insertions(+), 25 deletions(-) >> >> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h >> index 9e6caa8ecd19..55fbefcdc564 100644 >> +++ b/include/linux/mmu_notifier.h >> @@ -233,11 +233,18 @@ struct mmu_notifier { >> * @invalidate: Upon return the caller must stop using any SPTEs within this >> * range. This function can sleep. Return false only if sleeping >> * was required but mmu_notifier_range_blockable(range) is false. >> + * @release: This function will be called when the mmu_interval_notifier >> + * is removed from the interval tree. Defining this function also >> + * allows mmu_interval_notifier_remove() and >> + * mmu_interval_notifier_update() to be called from the >> + * invalidate() callback function (i.e., they won't block waiting >> + * for invalidations to finish. >> */ >> struct mmu_interval_notifier_ops { >> bool (*invalidate)(struct mmu_interval_notifier *mni, >> const struct mmu_notifier_range *range, >> unsigned long cur_seq); >> + void (*release)(struct mmu_interval_notifier *mni); >> }; >> >> struct mmu_interval_notifier { >> @@ -246,6 +253,8 @@ struct mmu_interval_notifier { >> struct mm_struct *mm; >> struct hlist_node deferred_item; >> unsigned long invalidate_seq; >> + unsigned long deferred_start; >> + unsigned long deferred_last; >> }; >> >> #ifdef CONFIG_MMU_NOTIFIER >> @@ -299,7 +308,13 @@ int mmu_interval_notifier_insert_locked( >> struct mmu_interval_notifier *mni, struct mm_struct *mm, >> unsigned long start, unsigned long length, >> const struct mmu_interval_notifier_ops *ops); >> +int mmu_interval_notifier_insert_safe( >> + struct mmu_interval_notifier *mni, struct mm_struct *mm, >> + unsigned long start, unsigned long length, >> + const struct mmu_interval_notifier_ops *ops); >> void mmu_interval_notifier_remove(struct mmu_interval_notifier *mni); >> +int mmu_interval_notifier_update(struct mmu_interval_notifier *mni, >> + unsigned long start, unsigned long length); >> >> /** >> * mmu_interval_set_seq - Save the invalidation sequence >> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c >> index f76ea05b1cb0..c303285750b2 100644 >> +++ b/mm/mmu_notifier.c >> @@ -129,6 +129,7 @@ static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm) >> { >> struct mmu_interval_notifier *mni; >> struct hlist_node *next; >> + struct hlist_head removed_list; >> >> spin_lock(&mmn_mm->lock); >> if (--mmn_mm->active_invalidate_ranges || >> @@ -144,21 +145,47 @@ static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm) >> * The inv_end incorporates a deferred mechanism like rtnl_unlock(). >> * Adds and removes are queued until the final inv_end happens then >> * they are progressed. This arrangement for tree updates is used to >> - * avoid using a blocking lock during invalidate_range_start. >> + * avoid using a blocking lock while walking the interval tree. >> */ >> + INIT_HLIST_HEAD(&removed_list); >> hlist_for_each_entry_safe(mni, next, &mmn_mm->deferred_list, >> deferred_item) { >> + hlist_del(&mni->deferred_item); >> if (RB_EMPTY_NODE(&mni->interval_tree.rb)) >> interval_tree_insert(&mni->interval_tree, >> &mmn_mm->itree); >> - else >> + else { >> interval_tree_remove(&mni->interval_tree, >> &mmn_mm->itree); >> - hlist_del(&mni->deferred_item); >> + if (mni->deferred_last) { >> + mni->interval_tree.start = mni->deferred_start; >> + mni->interval_tree.last = mni->deferred_last; >> + mni->deferred_last = 0; > > Technicaly we can have an interval starting at zero. > > I'd write it more like > > if (mni->updated_start == mni->updated_end) > insert > else > remove OK, but I'm using updated_end == 0, not updated_start, and the end can't be zero. > ie an empty interval can't get a notification so it should be removed > from the tree. > > I also like the name 'updated' better than deferred, it is a bit > clearer.. OK. > Adding release should it's own patch. The release callback is associated with mmu_interval_notifier_put() (i.e., async remove). Otherwise, there is no way to know when the interval can be freed. >> @@ -970,14 +999,52 @@ int mmu_interval_notifier_insert_locked( >> EXPORT_SYMBOL_GPL(mmu_interval_notifier_insert_locked); >> >> /** >> - * mmu_interval_notifier_remove - Remove a interval notifier >> - * @mni: Interval notifier to unregister >> + * mmu_interval_notifier_insert_safe - Insert an interval notifier >> + * @mni: Interval notifier to register >> + * @mm : mm_struct to attach to >> + * @start: Starting virtual address to monitor >> + * @length: Length of the range to monitor >> + * @ops: Interval notifier callback operations >> * >> - * This function must be paired with mmu_interval_notifier_insert(). It cannot >> - * be called from any ops callback. >> + * Return: -EINVAL if @mm hasn't been initialized for interval notifiers >> + * by calling mmu_notifier_register(NULL, mm) or >> + * __mmu_notifier_register(NULL, mm). >> * >> - * Once this returns ops callbacks are no longer running on other CPUs and >> - * will not be called in future. >> + * This function subscribes the interval notifier for notifications from the >> + * mm. Upon return, the ops related to mmu_interval_notifier will be called >> + * whenever an event that intersects with the given range occurs. >> + * >> + * This function is safe to call from the ops->invalidate() function. >> + * Upon return, the mmu_interval_notifier may not be present in the interval >> + * tree yet. The caller must use the normal interval notifier read flow via >> + * mmu_interval_read_begin() to establish SPTEs for this range. > > So why do we need this? You can't call hmm_range_fault from a > notifier. You just can't. > > So there should be no reason to create an interval from the notifier, > do it from where you call hmm_range_fault, and it must be safe to > obtain the mmap_sem from that thread. I was thinking of the case where munmap() creates a hole in the interval. The invalidate callback would need to update the interval to cover the left side of the remaining interval and an insert to cover the right side. Otherwise, the HW invalidation has to be extended to cover the right side and rely on a fault to re-establish the right side interval. >> +/** >> + * mmu_interval_notifier_update - Update interval notifier end >> + * @mni: Interval notifier to update >> + * @start: New starting virtual address to monitor >> + * @length: New length of the range to monitor >> + * >> + * This function updates the range being monitored. >> + * If there is no release() function defined, the call will wait for the >> + * update to finish before returning. >> + */ >> +int mmu_interval_notifier_update(struct mmu_interval_notifier *mni, >> + unsigned long start, unsigned long length) >> +{ >> + struct mm_struct *mm = mni->mm; >> + struct mmu_notifier_mm *mmn_mm = mm->mmu_notifier_mm; >> + unsigned long seq = 0; >> + unsigned long last; >> + >> + if (length == 0 || check_add_overflow(start, length - 1, &last)) >> + return -EOVERFLOW; >> + >> + spin_lock(&mmn_mm->lock); >> + if (mn_itree_is_invalidating(mmn_mm)) { >> + /* >> + * Update is being called after insert put this on the >> + * deferred list, but before the deferred list was processed. >> + */ >> + if (RB_EMPTY_NODE(&mni->interval_tree.rb)) { >> + mni->interval_tree.start = start; >> + mni->interval_tree.last = last; >> + } else { >> + if (!mni->deferred_last) >> + hlist_add_head(&mni->deferred_item, >> + &mmn_mm->deferred_list); >> + mni->deferred_start = start; >> + mni->deferred_last = last; >> + } >> + seq = mmn_mm->invalidate_seq; >> + } else { >> + WARN_ON(RB_EMPTY_NODE(&mni->interval_tree.rb)); >> + interval_tree_remove(&mni->interval_tree, &mmn_mm->itree); >> + mni->interval_tree.start = start; >> + mni->interval_tree.last = last; >> + interval_tree_insert(&mni->interval_tree, &mmn_mm->itree); >> + } >> + spin_unlock(&mmn_mm->lock); >> + >> + if (!mni->ops->release && seq) { >> + /* >> + * The possible sleep on progress in the invalidation requires >> + * the caller not hold any locks held by invalidation >> + * callbacks. >> + */ >> + lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); >> + lock_map_release(&__mmu_notifier_invalidate_range_start_map); >> wait_event(mmn_mm->wq, >> READ_ONCE(mmn_mm->invalidate_seq) != seq); >> + } >> >> - /* pairs with mmgrab in mmu_interval_notifier_insert() */ >> - mmdrop(mm); >> + return 0; >> } >> -EXPORT_SYMBOL_GPL(mmu_interval_notifier_remove); >> +EXPORT_SYMBOL_GPL(mmu_interval_notifier_update); > > A 'update' should probably be the same as insert, it doesn't > necessarily take effect until mmu_interval_read_begin(), so nothing > contingent on release. Agreed. I was trying to keep mmu_interval_notifier_remove() unchanged and use whether ops->release() was defined to indicate if it should be synchronous or asynchronous. The test for release in the update case was intended to be similar. Now the plan for v6 is to leave mmu_interval_notifier_remove() unchanged, add mmu_interval_notifier_put() for async/safe removal and make 'update' be asynchronous only and, as you say, rely on mmu_interval_read_begin() to be sure all delayed add/remove/updates are complete. I'm also planning to add a mmu_interval_notifier_find() so that nouveau and the self tests don't need to create a duplicate interval range tree to track the intervals that they have registered. There isn't an existing structure that the struct mmu_interval_notifier can just be added to so it ends up being a separately allocated structure and would need to be stored in some sort of table so I thought why not just use the itree. So I guess the patch series will be insert_safe, put+release, update, find, nouveau, and tests. > As before, I'm not sure what to do with this. We need an in-kernel > user for new apis, and I don't see a reason to make this more > complicated for a test program. I agree. > The test program should match one of the existing driver flows, so use > the page table like scheme from ODP or the fixed lifetime scheme from > AMDGPU/ODP > > Jason This is all useful feedback. I am working on v6 which addresses your concerns and updates nouveau to use the new API. I'm somewhat sidetracked by the lockdep issue I posted about nouveau calling kmalloc(GFP_KERNEL) from the invalidation callback so it may take me awhile to sort that out. Since we are at -rc5, I'm guessing this won't have enough soak time to make 5.6.
On Thu, Jan 09, 2020 at 02:01:21PM -0800, Ralph Campbell wrote: > > I'd write it more like > > > > if (mni->updated_start == mni->updated_end) > > insert > > else > > remove > > OK, but I'm using updated_end == 0, not updated_start, and the end can't be zero. Tricky.. > > ie an empty interval can't get a notification so it should be removed > > from the tree. > > > > I also like the name 'updated' better than deferred, it is a bit > > clearer.. > > OK. > > > Adding release should it's own patch. > > The release callback is associated with mmu_interval_notifier_put() > (i.e., async remove). Otherwise, there is no way to know when the > interval can be freed. Okay, but this patch is just trying to add update? > > So why do we need this? You can't call hmm_range_fault from a > > notifier. You just can't. > > > > So there should be no reason to create an interval from the notifier, > > do it from where you call hmm_range_fault, and it must be safe to > > obtain the mmap_sem from that thread. > > I was thinking of the case where munmap() creates a hole in the interval. > The invalidate callback would need to update the interval to cover the > left side of the remaining interval and an insert to cover the right > side. Otherwise, the HW invalidation has to be extended to cover the > right side and rely on a fault to re-establish the right side interval. This is very tricky because this algorithm can only work correctly if done atomically as a batch entirely under the spinlock. Forcing it into the defered list while holding the lock is the only way to do something like that sensibly.. So 'update' is not some generic API you can call, it can only be done while the interval tree is locked for reading. Thus 'safe' is probably the wrong name, it is actually 'interval tree locked for read' At the minimum this needs to be comprehensively documented and we need a lockdep style assertion that we are locked when doing it.. And if we are defining things like that then it might as well be expressed as a remove/insert/insert batch operation rather than a somewhat confusing update. > Now the plan for v6 is to leave mmu_interval_notifier_remove() unchanged, > add mmu_interval_notifier_put() for async/safe removal and make 'update' > be asynchronous only and, as you say, rely on mmu_interval_read_begin() > to be sure all delayed add/remove/updates are complete. Hm, we can see what injecting reference counts would look like. > I'm also planning to add a mmu_interval_notifier_find() so that nouveau > and the self tests don't need to create a duplicate interval range tree > to track the intervals that they have registered. There isn't an existing > structure that the struct mmu_interval_notifier can just be added to so > it ends up being a separately allocated structure and would need to be > stored in some sort of table so I thought why not just use the itree. Okay, but for locking reasons find is also a little tricky. I suppose find can obtain the read side lock on the interval tree and then the caller would have to find_unlock once it has refcounted or finished accessing the object. Much like how the invalidate callback is locked. > This is all useful feedback. I am working on v6 which addresses your concerns > and updates nouveau to use the new API. I'm somewhat sidetracked by the lockdep > issue I posted about nouveau calling kmalloc(GFP_KERNEL) from the invalidation > callback so it may take me awhile to sort that out. > Since we are at -rc5, I'm guessing this won't have enough soak time to make 5.6. Yes, sorry for the delay, lots of travel and a mountain of emails. I am almost caught up now. But you can post it at least. Jason
On 1/9/20 3:25 PM, Jason Gunthorpe wrote: > On Thu, Jan 09, 2020 at 02:01:21PM -0800, Ralph Campbell wrote: > >>> I'd write it more like >>> >>> if (mni->updated_start == mni->updated_end) >>> insert >>> else >>> remove >> >> OK, but I'm using updated_end == 0, not updated_start, and the end can't be zero. > > Tricky.. > >>> ie an empty interval can't get a notification so it should be removed >>> from the tree. >>> >>> I also like the name 'updated' better than deferred, it is a bit >>> clearer.. >> >> OK. >> >>> Adding release should it's own patch. >> >> The release callback is associated with mmu_interval_notifier_put() >> (i.e., async remove). Otherwise, there is no way to know when the >> interval can be freed. > > Okay, but this patch is just trying to add update? > >>> So why do we need this? You can't call hmm_range_fault from a >>> notifier. You just can't. >>> >>> So there should be no reason to create an interval from the notifier, >>> do it from where you call hmm_range_fault, and it must be safe to >>> obtain the mmap_sem from that thread. >> >> I was thinking of the case where munmap() creates a hole in the interval. >> The invalidate callback would need to update the interval to cover the >> left side of the remaining interval and an insert to cover the right >> side. Otherwise, the HW invalidation has to be extended to cover the >> right side and rely on a fault to re-establish the right side interval. > > This is very tricky because this algorithm can only work correctly if > done atomically as a batch entirely under the spinlock. Forcing it > into the defered list while holding the lock is the only way to do > something like that sensibly.. > > So 'update' is not some generic API you can call, it can only be done > while the interval tree is locked for reading. Thus 'safe' is probably > the wrong name, it is actually 'interval tree locked for read' > > At the minimum this needs to be comprehensively documented and we need > a lockdep style assertion that we are locked when doing it.. > > And if we are defining things like that then it might as well be > expressed as a remove/insert/insert batch operation rather than > a somewhat confusing update. > >> Now the plan for v6 is to leave mmu_interval_notifier_remove() unchanged, >> add mmu_interval_notifier_put() for async/safe removal and make 'update' >> be asynchronous only and, as you say, rely on mmu_interval_read_begin() >> to be sure all delayed add/remove/updates are complete. > > Hm, we can see what injecting reference counts would look like. > >> I'm also planning to add a mmu_interval_notifier_find() so that nouveau >> and the self tests don't need to create a duplicate interval range tree >> to track the intervals that they have registered. There isn't an existing >> structure that the struct mmu_interval_notifier can just be added to so >> it ends up being a separately allocated structure and would need to be >> stored in some sort of table so I thought why not just use the itree. > > Okay, but for locking reasons find is also a little tricky. I suppose > find can obtain the read side lock on the interval tree and then the > caller would have to find_unlock once it has refcounted or finished > accessing the object. Much like how the invalidate callback is locked. > >> This is all useful feedback. I am working on v6 which addresses your concerns >> and updates nouveau to use the new API. I'm somewhat sidetracked by the lockdep >> issue I posted about nouveau calling kmalloc(GFP_KERNEL) from the invalidation >> callback so it may take me awhile to sort that out. >> Since we are at -rc5, I'm guessing this won't have enough soak time to make 5.6. > > Yes, sorry for the delay, lots of travel and a mountain of emails. I > am almost caught up now. But you can post it at least. > > Jason > I'm using the device driver lock to serialize find/insert/update/remove changes to the interval tree. The important thing is to have a registered interval covering any shadow PTEs in the hardware and the driver lock that protects the updates to the HW and sequence number also protects updates to the registered intervals. Hopefully, this will be easier to understand with v6 which I'm posting for review.
On Mon, Jan 13, 2020 at 02:44:52PM -0800, Ralph Campbell wrote: > I'm using the device driver lock to serialize find/insert/update/remove > changes to the interval tree. The device driver lock can't really fully serialize this as it doesn't effect how the intersection lookup runs. Jason
On 1/14/20 4:45 AM, Jason Gunthorpe wrote: > On Mon, Jan 13, 2020 at 02:44:52PM -0800, Ralph Campbell wrote: > >> I'm using the device driver lock to serialize find/insert/update/remove >> changes to the interval tree. > > The device driver lock can't really fully serialize this as it doesn't > effect how the intersection lookup runs. > > Jason > Single updates to the interval notifier are atomic due to the struct mmu_notifier_mm spinlock so the issue is with "punching a hole" in the interval. In that case, the existing interval is updated to cover one side of the hole and a new interval is inserted for the other side while holding the driver lock. Since this sequence is done from the invalidate() callback, those two operations will be deferred until the callback returns and any other parallel invalidates complete (which would be serialized on the driver lock). So none of these changes will be visible to the interval tree walks until the last invalidate task calls mn_itree_inv_end() and all the deferred changes are applied atomically while holding the spinlock. I'll make sure to add comments explaining this. But I see your point if this sequence is done outside of the invalidate callback. In that case, if the driver shrank the interval, an invalidate callback for the right hand side could be missed before the insertion of the new interval for the right hand side. I'll explain this in the comments for nouveau_svmm_do_unmap() and dmirror_do_unmap().
On Wed, Jan 15, 2020 at 02:04:47PM -0800, Ralph Campbell wrote: > But I see your point if this sequence is done outside of the invalidate > callback. In that case, if the driver shrank the interval, an invalidate > callback for the right hand side could be missed before the insertion of > the new interval for the right hand side. > I'll explain this in the comments for nouveau_svmm_do_unmap() and > dmirror_do_unmap(). Yes, this is why I'm not sure this is a good API for the core to expose. Batch manipulations is a resonable thing, but it should be forced to work under safe conditions, ie while holding the required 'inv_begin' on the interval tree, and the batching APIs should assert this requirement. Jason
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 9e6caa8ecd19..55fbefcdc564 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -233,11 +233,18 @@ struct mmu_notifier { * @invalidate: Upon return the caller must stop using any SPTEs within this * range. This function can sleep. Return false only if sleeping * was required but mmu_notifier_range_blockable(range) is false. + * @release: This function will be called when the mmu_interval_notifier + * is removed from the interval tree. Defining this function also + * allows mmu_interval_notifier_remove() and + * mmu_interval_notifier_update() to be called from the + * invalidate() callback function (i.e., they won't block waiting + * for invalidations to finish. */ struct mmu_interval_notifier_ops { bool (*invalidate)(struct mmu_interval_notifier *mni, const struct mmu_notifier_range *range, unsigned long cur_seq); + void (*release)(struct mmu_interval_notifier *mni); }; struct mmu_interval_notifier { @@ -246,6 +253,8 @@ struct mmu_interval_notifier { struct mm_struct *mm; struct hlist_node deferred_item; unsigned long invalidate_seq; + unsigned long deferred_start; + unsigned long deferred_last; }; #ifdef CONFIG_MMU_NOTIFIER @@ -299,7 +308,13 @@ int mmu_interval_notifier_insert_locked( struct mmu_interval_notifier *mni, struct mm_struct *mm, unsigned long start, unsigned long length, const struct mmu_interval_notifier_ops *ops); +int mmu_interval_notifier_insert_safe( + struct mmu_interval_notifier *mni, struct mm_struct *mm, + unsigned long start, unsigned long length, + const struct mmu_interval_notifier_ops *ops); void mmu_interval_notifier_remove(struct mmu_interval_notifier *mni); +int mmu_interval_notifier_update(struct mmu_interval_notifier *mni, + unsigned long start, unsigned long length); /** * mmu_interval_set_seq - Save the invalidation sequence diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index f76ea05b1cb0..c303285750b2 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -129,6 +129,7 @@ static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm) { struct mmu_interval_notifier *mni; struct hlist_node *next; + struct hlist_head removed_list; spin_lock(&mmn_mm->lock); if (--mmn_mm->active_invalidate_ranges || @@ -144,21 +145,47 @@ static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm) * The inv_end incorporates a deferred mechanism like rtnl_unlock(). * Adds and removes are queued until the final inv_end happens then * they are progressed. This arrangement for tree updates is used to - * avoid using a blocking lock during invalidate_range_start. + * avoid using a blocking lock while walking the interval tree. */ + INIT_HLIST_HEAD(&removed_list); hlist_for_each_entry_safe(mni, next, &mmn_mm->deferred_list, deferred_item) { + hlist_del(&mni->deferred_item); if (RB_EMPTY_NODE(&mni->interval_tree.rb)) interval_tree_insert(&mni->interval_tree, &mmn_mm->itree); - else + else { interval_tree_remove(&mni->interval_tree, &mmn_mm->itree); - hlist_del(&mni->deferred_item); + if (mni->deferred_last) { + mni->interval_tree.start = mni->deferred_start; + mni->interval_tree.last = mni->deferred_last; + mni->deferred_last = 0; + interval_tree_insert(&mni->interval_tree, + &mmn_mm->itree); + } else if (mni->ops->release) + hlist_add_head(&mni->deferred_item, + &removed_list); + } } spin_unlock(&mmn_mm->lock); wake_up_all(&mmn_mm->wq); + + /* + * Interval notifiers with a release() operation expect a callback + * instead of mmu_interval_notifier_remove() waiting for the + * wake up above. + */ + hlist_for_each_entry_safe(mni, next, &removed_list, deferred_item) { + struct mm_struct *mm = mni->mm; + + hlist_del(&mni->deferred_item); + mni->ops->release(mni); + + /* pairs with mmgrab() in mmu_interval_notifier_insert() */ + mmdrop(mm); + } } /** @@ -856,6 +883,7 @@ static int __mmu_interval_notifier_insert( mni->ops = ops; RB_CLEAR_NODE(&mni->interval_tree.rb); mni->interval_tree.start = start; + mni->deferred_last = 0; /* * Note that the representation of the intervals in the interval tree * considers the ending point as contained in the interval. @@ -913,16 +941,17 @@ static int __mmu_interval_notifier_insert( /** * mmu_interval_notifier_insert - Insert an interval notifier * @mni: Interval notifier to register + * @mm : mm_struct to attach to * @start: Starting virtual address to monitor * @length: Length of the range to monitor - * @mm : mm_struct to attach to + * @ops: Interval notifier callback operations * * This function subscribes the interval notifier for notifications from the - * mm. Upon return the ops related to mmu_interval_notifier will be called + * mm. Upon return, the ops related to mmu_interval_notifier will be called * whenever an event that intersects with the given range occurs. * - * Upon return the range_notifier may not be present in the interval tree yet. - * The caller must use the normal interval notifier read flow via + * Upon return, the mmu_interval_notifier may not be present in the interval + * tree yet. The caller must use the normal interval notifier read flow via * mmu_interval_read_begin() to establish SPTEs for this range. */ int mmu_interval_notifier_insert(struct mmu_interval_notifier *mni, @@ -970,14 +999,52 @@ int mmu_interval_notifier_insert_locked( EXPORT_SYMBOL_GPL(mmu_interval_notifier_insert_locked); /** - * mmu_interval_notifier_remove - Remove a interval notifier - * @mni: Interval notifier to unregister + * mmu_interval_notifier_insert_safe - Insert an interval notifier + * @mni: Interval notifier to register + * @mm : mm_struct to attach to + * @start: Starting virtual address to monitor + * @length: Length of the range to monitor + * @ops: Interval notifier callback operations * - * This function must be paired with mmu_interval_notifier_insert(). It cannot - * be called from any ops callback. + * Return: -EINVAL if @mm hasn't been initialized for interval notifiers + * by calling mmu_notifier_register(NULL, mm) or + * __mmu_notifier_register(NULL, mm). * - * Once this returns ops callbacks are no longer running on other CPUs and - * will not be called in future. + * This function subscribes the interval notifier for notifications from the + * mm. Upon return, the ops related to mmu_interval_notifier will be called + * whenever an event that intersects with the given range occurs. + * + * This function is safe to call from the ops->invalidate() function. + * Upon return, the mmu_interval_notifier may not be present in the interval + * tree yet. The caller must use the normal interval notifier read flow via + * mmu_interval_read_begin() to establish SPTEs for this range. + */ +int mmu_interval_notifier_insert_safe( + struct mmu_interval_notifier *mni, struct mm_struct *mm, + unsigned long start, unsigned long length, + const struct mmu_interval_notifier_ops *ops) +{ + struct mmu_notifier_mm *mmn_mm; + + mmn_mm = mm->mmu_notifier_mm; + if (!mmn_mm || !mmn_mm->has_itree) + return -EINVAL; + return __mmu_interval_notifier_insert(mni, mm, mmn_mm, start, length, + ops); +} +EXPORT_SYMBOL_GPL(mmu_interval_notifier_insert_safe); + +/** + * mmu_interval_notifier_remove - Remove an interval notifier + * @mni: Interval notifier to unregister + * + * This function must be paired with mmu_interval_notifier_insert(). + * If a mmu_interval_notifier_ops.release() function is defined, + * mmu_interval_notifier_remove() is safe to call from the invalidate() + * callback and the release() function will be called once all CPUs + * are finished using the notifier. Otherwise, mmu_interval_notifier_remove() + * cannot be called from any ops callback and will block waiting for + * invalidation callbacks to finish before returning. */ void mmu_interval_notifier_remove(struct mmu_interval_notifier *mni) { @@ -996,8 +1063,11 @@ void mmu_interval_notifier_remove(struct mmu_interval_notifier *mni) if (RB_EMPTY_NODE(&mni->interval_tree.rb)) { hlist_del(&mni->deferred_item); } else { - hlist_add_head(&mni->deferred_item, - &mmn_mm->deferred_list); + if (mni->deferred_last) + mni->deferred_last = 0; + else + hlist_add_head(&mni->deferred_item, + &mmn_mm->deferred_list); seq = mmn_mm->invalidate_seq; } } else { @@ -1006,20 +1076,96 @@ void mmu_interval_notifier_remove(struct mmu_interval_notifier *mni) } spin_unlock(&mmn_mm->lock); - /* - * The possible sleep on progress in the invalidation requires the - * caller not hold any locks held by invalidation callbacks. - */ - lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); - lock_map_release(&__mmu_notifier_invalidate_range_start_map); - if (seq) + if (mni->ops->release) { + if (!seq) { + mni->ops->release(mni); + + /* + * Pairs with mmgrab() in + * mmu_interval_notifier_insert(). + */ + mmdrop(mm); + } + } else { + /* + * The possible sleep on progress in the invalidation requires + * the caller not hold any locks held by invalidation + * callbacks. + */ + lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); + lock_map_release(&__mmu_notifier_invalidate_range_start_map); + if (seq) + wait_event(mmn_mm->wq, + READ_ONCE(mmn_mm->invalidate_seq) != seq); + + /* pairs with mmgrab in mmu_interval_notifier_insert() */ + mmdrop(mm); + } +} +EXPORT_SYMBOL_GPL(mmu_interval_notifier_remove); + +/** + * mmu_interval_notifier_update - Update interval notifier end + * @mni: Interval notifier to update + * @start: New starting virtual address to monitor + * @length: New length of the range to monitor + * + * This function updates the range being monitored. + * If there is no release() function defined, the call will wait for the + * update to finish before returning. + */ +int mmu_interval_notifier_update(struct mmu_interval_notifier *mni, + unsigned long start, unsigned long length) +{ + struct mm_struct *mm = mni->mm; + struct mmu_notifier_mm *mmn_mm = mm->mmu_notifier_mm; + unsigned long seq = 0; + unsigned long last; + + if (length == 0 || check_add_overflow(start, length - 1, &last)) + return -EOVERFLOW; + + spin_lock(&mmn_mm->lock); + if (mn_itree_is_invalidating(mmn_mm)) { + /* + * Update is being called after insert put this on the + * deferred list, but before the deferred list was processed. + */ + if (RB_EMPTY_NODE(&mni->interval_tree.rb)) { + mni->interval_tree.start = start; + mni->interval_tree.last = last; + } else { + if (!mni->deferred_last) + hlist_add_head(&mni->deferred_item, + &mmn_mm->deferred_list); + mni->deferred_start = start; + mni->deferred_last = last; + } + seq = mmn_mm->invalidate_seq; + } else { + WARN_ON(RB_EMPTY_NODE(&mni->interval_tree.rb)); + interval_tree_remove(&mni->interval_tree, &mmn_mm->itree); + mni->interval_tree.start = start; + mni->interval_tree.last = last; + interval_tree_insert(&mni->interval_tree, &mmn_mm->itree); + } + spin_unlock(&mmn_mm->lock); + + if (!mni->ops->release && seq) { + /* + * The possible sleep on progress in the invalidation requires + * the caller not hold any locks held by invalidation + * callbacks. + */ + lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); + lock_map_release(&__mmu_notifier_invalidate_range_start_map); wait_event(mmn_mm->wq, READ_ONCE(mmn_mm->invalidate_seq) != seq); + } - /* pairs with mmgrab in mmu_interval_notifier_insert() */ - mmdrop(mm); + return 0; } -EXPORT_SYMBOL_GPL(mmu_interval_notifier_remove); +EXPORT_SYMBOL_GPL(mmu_interval_notifier_update); /** * mmu_notifier_synchronize - Ensure all mmu_notifiers are freed
mmu_interval_notifier_insert() and mmu_interval_notifier_remove() can't be called safely from inside the invalidate() callback. This is fine for devices with explicit memory region register and unregister calls but it is desirable from a programming model standpoint to not require explicit memory region registration. Regions can be registered based on device address faults but without a mechanism for updating or removing the mmu interval notifiers in response to munmap(), the invalidation callbacks will be for regions that are stale or apply to different mmaped regions. The invalidate() callback provides the necessary information (i.e., the event type MMU_NOTIFY_UNMAP) so add insert, remove, and update functions that are safe to call from the invalidate() callback by extending the work done in mn_itree_inv_end() to update the interval tree when it is not being traversed. Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> --- include/linux/mmu_notifier.h | 15 +++ mm/mmu_notifier.c | 196 ++++++++++++++++++++++++++++++----- 2 files changed, 186 insertions(+), 25 deletions(-)