diff mbox series

[v2,02/15] mm/mmu_notifier: add an interval tree notifier

Message ID 20191028201032.6352-3-jgg@ziepe.ca (mailing list archive)
State New, archived
Headers show
Series Consolidate the mmu notifier interval_tree and locking | expand

Commit Message

Jason Gunthorpe Oct. 28, 2019, 8:10 p.m. UTC
From: Jason Gunthorpe <jgg@mellanox.com>

Of the 13 users of mmu_notifiers, 8 of them use only
invalidate_range_start/end() and immediately intersect the
mmu_notifier_range with some kind of internal list of VAs.  4 use an
interval tree (i915_gem, radeon_mn, umem_odp, hfi1). 4 use a linked list
of some kind (scif_dma, vhost, gntdev, hmm)

And the remaining 5 either don't use invalidate_range_start() or do some
special thing with it.

It turns out that building a correct scheme with an interval tree is
pretty complicated, particularly if the use case is synchronizing against
another thread doing get_user_pages().  Many of these implementations have
various subtle and difficult to fix races.

This approach puts the interval tree as common code at the top of the mmu
notifier call tree and implements a shareable locking scheme.

It includes:
 - An interval tree tracking VA ranges, with per-range callbacks
 - A read/write locking scheme for the interval tree that avoids
   sleeping in the notifier path (for OOM killer)
 - A sequence counter based collision-retry locking scheme to tell
   device page fault that a VA range is being concurrently invalidated.

This is based on various ideas:
- hmm accumulates invalidated VA ranges and releases them when all
  invalidates are done, via active_invalidate_ranges count.
  This approach avoids having to intersect the interval tree twice (as
  umem_odp does) at the potential cost of a longer device page fault.

- kvm/umem_odp use a sequence counter to drive the collision retry,
  via invalidate_seq

- a deferred work todo list on unlock scheme like RTNL, via deferred_list.
  This makes adding/removing interval tree members more deterministic

- seqlock, except this version makes the seqlock idea multi-holder on the
  write side by protecting it with active_invalidate_ranges and a spinlock

To minimize MM overhead when only the interval tree is being used, the
entire SRCU and hlist overheads are dropped using some simple
branches. Similarly the interval tree overhead is dropped when in hlist
mode.

The overhead from the mandatory spinlock is broadly the same as most of
existing users which already had a lock (or two) of some sort on the
invalidation path.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 include/linux/mmu_notifier.h |  98 +++++++
 mm/Kconfig                   |   1 +
 mm/mmu_notifier.c            | 533 +++++++++++++++++++++++++++++++++--
 3 files changed, 607 insertions(+), 25 deletions(-)

Comments

Felix Kuehling Oct. 29, 2019, 10:04 p.m. UTC | #1
I haven't had enough time to fully understand the deferred logic in this 
change. I spotted one problem, see comments inline.

On 2019-10-28 4:10 p.m., Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
>
> Of the 13 users of mmu_notifiers, 8 of them use only
> invalidate_range_start/end() and immediately intersect the
> mmu_notifier_range with some kind of internal list of VAs.  4 use an
> interval tree (i915_gem, radeon_mn, umem_odp, hfi1). 4 use a linked list
> of some kind (scif_dma, vhost, gntdev, hmm)
>
> And the remaining 5 either don't use invalidate_range_start() or do some
> special thing with it.
>
> It turns out that building a correct scheme with an interval tree is
> pretty complicated, particularly if the use case is synchronizing against
> another thread doing get_user_pages().  Many of these implementations have
> various subtle and difficult to fix races.
>
> This approach puts the interval tree as common code at the top of the mmu
> notifier call tree and implements a shareable locking scheme.
>
> It includes:
>   - An interval tree tracking VA ranges, with per-range callbacks
>   - A read/write locking scheme for the interval tree that avoids
>     sleeping in the notifier path (for OOM killer)
>   - A sequence counter based collision-retry locking scheme to tell
>     device page fault that a VA range is being concurrently invalidated.
>
> This is based on various ideas:
> - hmm accumulates invalidated VA ranges and releases them when all
>    invalidates are done, via active_invalidate_ranges count.
>    This approach avoids having to intersect the interval tree twice (as
>    umem_odp does) at the potential cost of a longer device page fault.
>
> - kvm/umem_odp use a sequence counter to drive the collision retry,
>    via invalidate_seq
>
> - a deferred work todo list on unlock scheme like RTNL, via deferred_list.
>    This makes adding/removing interval tree members more deterministic
>
> - seqlock, except this version makes the seqlock idea multi-holder on the
>    write side by protecting it with active_invalidate_ranges and a spinlock
>
> To minimize MM overhead when only the interval tree is being used, the
> entire SRCU and hlist overheads are dropped using some simple
> branches. Similarly the interval tree overhead is dropped when in hlist
> mode.
>
> The overhead from the mandatory spinlock is broadly the same as most of
> existing users which already had a lock (or two) of some sort on the
> invalidation path.
>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Acked-by: Christian König <christian.koenig@amd.com>
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> ---
>   include/linux/mmu_notifier.h |  98 +++++++
>   mm/Kconfig                   |   1 +
>   mm/mmu_notifier.c            | 533 +++++++++++++++++++++++++++++++++--
>   3 files changed, 607 insertions(+), 25 deletions(-)
>
[snip]
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 367670cfd02b7b..d02d3c8c223eb7 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
[snip]
>    * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
> @@ -52,17 +286,24 @@ struct mmu_notifier_mm {
>    * can't go away from under us as exit_mmap holds an mm_count pin
>    * itself.
>    */
> -void __mmu_notifier_release(struct mm_struct *mm)
> +static void mn_hlist_release(struct mmu_notifier_mm *mmn_mm,
> +			     struct mm_struct *mm)
>   {
>   	struct mmu_notifier *mn;
>   	int id;
>   
> +	if (mmn_mm->has_interval)
> +		mn_itree_release(mmn_mm, mm);
> +
> +	if (hlist_empty(&mmn_mm->list))
> +		return;

This seems to duplicate the conditions in __mmu_notifier_release. See my 
comments below, I think one of them is wrong. I suspect this one, 
because __mmu_notifier_release follows the same pattern as the other 
notifiers.


> +
>   	/*
>   	 * SRCU here will block mmu_notifier_unregister until
>   	 * ->release returns.
>   	 */
>   	id = srcu_read_lock(&srcu);
> -	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist)
> +	hlist_for_each_entry_rcu(mn, &mmn_mm->list, hlist)
>   		/*
>   		 * If ->release runs before mmu_notifier_unregister it must be
>   		 * handled, as it's the only way for the driver to flush all
> @@ -72,9 +313,9 @@ void __mmu_notifier_release(struct mm_struct *mm)
>   		if (mn->ops->release)
>   			mn->ops->release(mn, mm);
>   
> -	spin_lock(&mm->mmu_notifier_mm->lock);
> -	while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) {
> -		mn = hlist_entry(mm->mmu_notifier_mm->list.first,
> +	spin_lock(&mmn_mm->lock);
> +	while (unlikely(!hlist_empty(&mmn_mm->list))) {
> +		mn = hlist_entry(mmn_mm->list.first,
>   				 struct mmu_notifier,
>   				 hlist);
>   		/*
> @@ -85,7 +326,7 @@ void __mmu_notifier_release(struct mm_struct *mm)
>   		 */
>   		hlist_del_init_rcu(&mn->hlist);
>   	}
> -	spin_unlock(&mm->mmu_notifier_mm->lock);
> +	spin_unlock(&mmn_mm->lock);
>   	srcu_read_unlock(&srcu, id);
>   
>   	/*
> @@ -100,6 +341,17 @@ void __mmu_notifier_release(struct mm_struct *mm)
>   	synchronize_srcu(&srcu);
>   }
>   
> +void __mmu_notifier_release(struct mm_struct *mm)
> +{
> +	struct mmu_notifier_mm *mmn_mm = mm->mmu_notifier_mm;
> +
> +	if (mmn_mm->has_interval)
> +		mn_itree_release(mmn_mm, mm);

If mmn_mm->list is not empty, this will be done twice because 
mn_hlist_release duplicates this.


> +
> +	if (!hlist_empty(&mmn_mm->list))
> +		mn_hlist_release(mmn_mm, mm);

mn_hlist_release checks the same condition itself.


> +}
> +
>   /*
>    * If no young bitflag is supported by the hardware, ->clear_flush_young can
>    * unmap the address and return 1 or 0 depending if the mapping previously
[snip]
Jason Gunthorpe Oct. 29, 2019, 10:56 p.m. UTC | #2
On Tue, Oct 29, 2019 at 10:04:45PM +0000, Kuehling, Felix wrote:

> >    * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
> > @@ -52,17 +286,24 @@ struct mmu_notifier_mm {
> >    * can't go away from under us as exit_mmap holds an mm_count pin
> >    * itself.
> >    */
> > -void __mmu_notifier_release(struct mm_struct *mm)
> > +static void mn_hlist_release(struct mmu_notifier_mm *mmn_mm,
> > +			     struct mm_struct *mm)
> >   {
> >   	struct mmu_notifier *mn;
> >   	int id;
> >   
> > +	if (mmn_mm->has_interval)
> > +		mn_itree_release(mmn_mm, mm);
> > +
> > +	if (hlist_empty(&mmn_mm->list))
> > +		return;
> 
> This seems to duplicate the conditions in __mmu_notifier_release. See my 
> comments below, I think one of them is wrong. I suspect this one, 
> because __mmu_notifier_release follows the same pattern as the other 
> notifiers.

Yep, this is a rebasing error from a earlier version, the above two
lines should be deleted.

I think it is harmless so it should not impact any testing.

Thanks,
Jason
John Hubbard Nov. 7, 2019, 12:23 a.m. UTC | #3
On 10/28/19 1:10 PM, Jason Gunthorpe wrote:
...
>  include/linux/mmu_notifier.h |  98 +++++++
>  mm/Kconfig                   |   1 +
>  mm/mmu_notifier.c            | 533 +++++++++++++++++++++++++++++++++--
>  3 files changed, 607 insertions(+), 25 deletions(-)
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 12bd603d318ce7..51b92ba013ddce 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -6,10 +6,12 @@
>  #include <linux/spinlock.h>
>  #include <linux/mm_types.h>
>  #include <linux/srcu.h>
> +#include <linux/interval_tree.h>
>  
>  struct mmu_notifier_mm;
>  struct mmu_notifier;
>  struct mmu_notifier_range;
> +struct mmu_range_notifier;

Hi Jason,

Nice design, I love the seq foundation! So far, I'm not able to spot anything
actually wrong with the implementation, sorry about that. 

Generally my reaction is: given that the design is complex, try to mitigate 
that with documentation and naming. So the comments are in these areas:

1. There is a rather severe naming overlap (not technically a naming conflict,
but still) with existing mmn work, which already has, for example:

    struct mmu_notifier_range

...and you're adding:

    struct mmu_range_notifier

...so I'll try to help sort that out.

2. I'm also seeing a couple of things that are really hard for the reader
verify are correct (abuse and battery of the low bit in .invalidate_seq, 
for example, haha), so I have some recommendations there.

3. Documentation improvements, which easy to apply, with perhaps one exception.
(Here, because this a complicated area, documentation does make a difference,
so it's worth a little extra fuss.)

4. Other nits that don't matter too much, but just help polish things up
as usual.

>  
>  /**
>   * enum mmu_notifier_event - reason for the mmu notifier callback
> @@ -32,6 +34,9 @@ struct mmu_notifier_range;
>   * access flags). User should soft dirty the page in the end callback to make
>   * sure that anyone relying on soft dirtyness catch pages that might be written
>   * through non CPU mappings.
> + *
> + * @MMU_NOTIFY_RELEASE: used during mmu_range_notifier invalidate to signal that
> + * the mm refcount is zero and the range is no longer accessible.
>   */
>  enum mmu_notifier_event {
>  	MMU_NOTIFY_UNMAP = 0,
> @@ -39,6 +44,7 @@ enum mmu_notifier_event {
>  	MMU_NOTIFY_PROTECTION_VMA,
>  	MMU_NOTIFY_PROTECTION_PAGE,
>  	MMU_NOTIFY_SOFT_DIRTY,
> +	MMU_NOTIFY_RELEASE,
>  };


OK, let the naming debates begin! ha. Anyway, after careful study of the overall
patch, and some browsing of the larger patchset, it's clear that:

* The new "MMU range notifier" that you've created is, approximately, a new
object. It uses classic mmu notifiers inside, as an implementation detail, and
it does *similar* things (notifications) as mmn's. But it's certainly not the same
as mmn's, as shown later when you say the need to an entirely new ops struct, and 
data struct too.

Therefore, you need a separate events enum as well. This is important. MMN's
won't be issuing MMN_NOTIFY_RELEASE events, nor will MNR's be issuing the first
four prexisting MMU_NOTIFY_* items. So it would be a design mistake to glom them
together, unless you ultimately decided to merge these MMN and MNR objects (which
I don't really see any intention of, and that's fine).

So this should read:

enum mmu_range_notifier_event {
	MMU_NOTIFY_RELEASE,
};

...assuming that we stay with "mmu_range_notifier" as a core name for this 
whole thing.

Also, it is best moved down to be next to the new MNR structs, so that all the
MNR stuff is in one group.

Extra credit: IMHO, this clearly deserves to all be in a new mmu_range_notifier.h
header file, but I know that's extra work. Maybe later as a follow-up patch,
if anyone has the time.

>  
>  #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
> @@ -222,6 +228,26 @@ struct mmu_notifier {
>  	unsigned int users;
>  };
>  

That should also be moved down, next to the new structs.



A little bit above these next items, just above "struct mmu_notifier" (not shown here, 
it's outside the diff area), there is some documentation about classic MMNs. It would 
be nice if it were clearer that that documentation is not relevant to MNRs. Actually, 
this is another reason that a separate header file would be nice.

> +/**
> + * struct mmu_range_notifier_ops
> + * @invalidate: Upon return the caller must stop using any SPTEs within this
> + *              range, this function can sleep. Return false if blocking was
> + *              required but range is non-blocking
> + */

How about this (I'm not sure I fully understand the return value, though):

/**
 * struct mmu_range_notifier_ops
 * @invalidate: Upon return the caller must stop using any SPTEs within this
 * 		range.
 *
 * 		This function is permitted to sleep.
 *
 *      	@Return: false if blocking was required, but @range is
 *			non-blocking.
 *
 */


> +struct mmu_range_notifier_ops {
> +	bool (*invalidate)(struct mmu_range_notifier *mrn,
> +			   const struct mmu_notifier_range *range,
> +			   unsigned long cur_seq);
> +};
> +
> +struct mmu_range_notifier {
> +	struct interval_tree_node interval_tree;
> +	const struct mmu_range_notifier_ops *ops;
> +	struct hlist_node deferred_item;
> +	unsigned long invalidate_seq;
> +	struct mm_struct *mm;
> +};
> +

Again, now we have the new struct mmu_range_notifier, and the old 
struct mmu_notifier_range, and it's not good.

Ideas:

a) Live with it.

b) (Discarded, too many callers): rename old one. Nope.

c) Rename new one. Ideas:

    struct mmu_interval_notifier
    struct mmu_range_intersection
    ...other ideas?


>  #ifdef CONFIG_MMU_NOTIFIER
>  
>  #ifdef CONFIG_LOCKDEP
> @@ -263,6 +289,78 @@ extern int __mmu_notifier_register(struct mmu_notifier *mn,
>  				   struct mm_struct *mm);
>  extern void mmu_notifier_unregister(struct mmu_notifier *mn,
>  				    struct mm_struct *mm);
> +
> +unsigned long mmu_range_read_begin(struct mmu_range_notifier *mrn);
> +int mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
> +			      unsigned long start, unsigned long length,
> +			      struct mm_struct *mm);
> +int mmu_range_notifier_insert_locked(struct mmu_range_notifier *mrn,
> +				     unsigned long start, unsigned long length,
> +				     struct mm_struct *mm);
> +void mmu_range_notifier_remove(struct mmu_range_notifier *mrn);
> +
> +/**
> + * mmu_range_set_seq - Save the invalidation sequence

How about:

 * mmu_range_set_seq - Set the .invalidate_seq to a new value.


> + * @mrn - The mrn passed to invalidate
> + * @cur_seq - The cur_seq passed to invalidate
> + *
> + * This must be called unconditionally from the invalidate callback of a
> + * struct mmu_range_notifier_ops under the same lock that is used to call
> + * mmu_range_read_retry(). It updates the sequence number for later use by
> + * mmu_range_read_retry().
> + *
> + * If the user does not call mmu_range_read_begin() or mmu_range_read_retry()

nit: "caller" is better than "user", when referring to...well, callers. "user" 
most often refers to user space, whereas a call stack and function calling is 
clearly what you're referring to here (and in other places, especially "user lock").

> + * then this call is not required.
> + */
> +static inline void mmu_range_set_seq(struct mmu_range_notifier *mrn,
> +				     unsigned long cur_seq)
> +{
> +	WRITE_ONCE(mrn->invalidate_seq, cur_seq);
> +}
> +
> +/**
> + * mmu_range_read_retry - End a read side critical section against a VA range
> + * mrn: The range under lock
> + * seq: The return of the paired mmu_range_read_begin()
> + *
> + * This MUST be called under a user provided lock that is also held
> + * unconditionally by op->invalidate() when it calls mmu_range_set_seq().
> + *
> + * Each call should be paired with a single mmu_range_read_begin() and
> + * should be used to conclude the read side.
> + *
> + * Returns true if an invalidation collided with this critical section, and
> + * the caller should retry.
> + */
> +static inline bool mmu_range_read_retry(struct mmu_range_notifier *mrn,
> +					unsigned long seq)
> +{
> +	return mrn->invalidate_seq != seq;
> +}
> +
> +/**
> + * mmu_range_check_retry - Test if a collision has occurred
> + * mrn: The range under lock
> + * seq: The return of the matching mmu_range_read_begin()
> + *
> + * This can be used in the critical section between mmu_range_read_begin() and
> + * mmu_range_read_retry().  A return of true indicates an invalidation has
> + * collided with this lock and a future mmu_range_read_retry() will return
> + * true.
> + *
> + * False is not reliable and only suggests a collision has not happened. It

let's say "suggests that a collision *may* not have occurred."  

...
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 367670cfd02b7b..d02d3c8c223eb7 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -12,6 +12,7 @@
>  #include <linux/export.h>
>  #include <linux/mm.h>
>  #include <linux/err.h>
> +#include <linux/interval_tree.h>
>  #include <linux/srcu.h>
>  #include <linux/rcupdate.h>
>  #include <linux/sched.h>
> @@ -36,10 +37,243 @@ struct lockdep_map __mmu_notifier_invalidate_range_start_map = {
>  struct mmu_notifier_mm {
>  	/* all mmu notifiers registered in this mm are queued in this list */
>  	struct hlist_head list;
> +	bool has_interval;
>  	/* to serialize the list modifications and hlist_unhashed */
>  	spinlock_t lock;
> +	unsigned long invalidate_seq;
> +	unsigned long active_invalidate_ranges;
> +	struct rb_root_cached itree;
> +	wait_queue_head_t wq;
> +	struct hlist_head deferred_list;
>  };
>  
> +/*
> + * This is a collision-retry read-side/write-side 'lock', a lot like a
> + * seqcount, however this allows multiple write-sides to hold it at
> + * once. Conceptually the write side is protecting the values of the PTEs in
> + * this mm, such that PTES cannot be read into SPTEs while any writer exists.

Just to be kind, can we say "SPTEs (shadow PTEs)", just this once? :)

> + *
> + * Note that the core mm creates nested invalidate_range_start()/end() regions
> + * within the same thread, and runs invalidate_range_start()/end() in parallel
> + * on multiple CPUs. This is designed to not reduce concurrency or block
> + * progress on the mm side.
> + *
> + * As a secondary function, holding the full write side also serves to prevent
> + * writers for the itree, this is an optimization to avoid extra locking
> + * during invalidate_range_start/end notifiers.
> + *
> + * The write side has two states, fully excluded:
> + *  - mm->active_invalidate_ranges != 0
> + *  - mnn->invalidate_seq & 1 == True
> + *  - some range on the mm_struct is being invalidated
> + *  - the itree is not allowed to change
> + *
> + * And partially excluded:
> + *  - mm->active_invalidate_ranges != 0

I assume this implies mnn->invalidate_seq & 1 == False in this case? If so,
let's say so. I'm probably getting that wrong, too.

> + *  - some range on the mm_struct is being invalidated
> + *  - the itree is allowed to change
> + *
> + * The later state avoids some expensive work on inv_end in the common case of
> + * no mrn monitoring the VA.
> + */
> +static bool mn_itree_is_invalidating(struct mmu_notifier_mm *mmn_mm)
> +{
> +	lockdep_assert_held(&mmn_mm->lock);
> +	return mmn_mm->invalidate_seq & 1;
> +}
> +
> +static struct mmu_range_notifier *
> +mn_itree_inv_start_range(struct mmu_notifier_mm *mmn_mm,
> +			 const struct mmu_notifier_range *range,
> +			 unsigned long *seq)
> +{
> +	struct interval_tree_node *node;
> +	struct mmu_range_notifier *res = NULL;
> +
> +	spin_lock(&mmn_mm->lock);
> +	mmn_mm->active_invalidate_ranges++;
> +	node = interval_tree_iter_first(&mmn_mm->itree, range->start,
> +					range->end - 1);
> +	if (node) {
> +		mmn_mm->invalidate_seq |= 1;


OK, this either needs more documentation and assertions, or a different
approach. Because I see addition, subtraction, AND, OR and booleans
all being applied to this field, and it's darn near hopeless to figure
out whether or not it really is even or odd at the right times.

Different approach: why not just add a mmn_mm->is_invalidating 
member variable? It's not like you're short of space in that struct.


> +		res = container_of(node, struct mmu_range_notifier,
> +				   interval_tree);
> +	}
> +
> +	*seq = mmn_mm->invalidate_seq;
> +	spin_unlock(&mmn_mm->lock);
> +	return res;
> +}
> +
> +static struct mmu_range_notifier *
> +mn_itree_inv_next(struct mmu_range_notifier *mrn,
> +		  const struct mmu_notifier_range *range)
> +{
> +	struct interval_tree_node *node;
> +
> +	node = interval_tree_iter_next(&mrn->interval_tree, range->start,
> +				       range->end - 1);
> +	if (!node)
> +		return NULL;
> +	return container_of(node, struct mmu_range_notifier, interval_tree);
> +}
> +
> +static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm)
> +{
> +	struct mmu_range_notifier *mrn;
> +	struct hlist_node *next;
> +	bool need_wake = false;
> +
> +	spin_lock(&mmn_mm->lock);
> +	if (--mmn_mm->active_invalidate_ranges ||
> +	    !mn_itree_is_invalidating(mmn_mm)) {
> +		spin_unlock(&mmn_mm->lock);
> +		return;
> +	}
> +
> +	mmn_mm->invalidate_seq++;

Is this the right place for an assertion that this is now an even value?

> +	need_wake = true;
> +
> +	/*
> +	 * The inv_end incorporates a deferred mechanism like
> +	 * rtnl_lock(). Adds and removes are queued until the final inv_end

Let me point out that rtnl_lock() itself is a one-liner that calls mutex_lock().
But I suppose if one studies that file closely there is more. :)

...

> +unsigned long mmu_range_read_begin(struct mmu_range_notifier *mrn)
> +{
> +	struct mmu_notifier_mm *mmn_mm = mrn->mm->mmu_notifier_mm;
> +	unsigned long seq;
> +	bool is_invalidating;
> +
> +	/*
> +	 * If the mrn has a different seq value under the user_lock than we
> +	 * started with then it has collided.
> +	 *
> +	 * If the mrn currently has the same seq value as the mmn_mm seq, then
> +	 * it is currently between invalidate_start/end and is colliding.
> +	 *
> +	 * The locking looks broadly like this:
> +	 *   mn_tree_invalidate_start():          mmu_range_read_begin():
> +	 *                                         spin_lock
> +	 *                                          seq = READ_ONCE(mrn->invalidate_seq);
> +	 *                                          seq == mmn_mm->invalidate_seq
> +	 *                                         spin_unlock
> +	 *    spin_lock
> +	 *     seq = ++mmn_mm->invalidate_seq
> +	 *    spin_unlock
> +	 *     op->invalidate_range():
> +	 *       user_lock
> +	 *        mmu_range_set_seq()
> +	 *         mrn->invalidate_seq = seq
> +	 *       user_unlock
> +	 *
> +	 *                          [Required: mmu_range_read_retry() == true]
> +	 *
> +	 *   mn_itree_inv_end():
> +	 *    spin_lock
> +	 *     seq = ++mmn_mm->invalidate_seq
> +	 *    spin_unlock
> +	 *
> +	 *                                        user_lock
> +	 *                                         mmu_range_read_retry():
> +	 *                                          mrn->invalidate_seq != seq
> +	 *                                        user_unlock
> +	 *
> +	 * Barriers are not needed here as any races here are closed by an
> +	 * eventual mmu_range_read_retry(), which provides a barrier via the
> +	 * user_lock.
> +	 */
> +	spin_lock(&mmn_mm->lock);
> +	/* Pairs with the WRITE_ONCE in mmu_range_set_seq() */
> +	seq = READ_ONCE(mrn->invalidate_seq);
> +	is_invalidating = seq == mmn_mm->invalidate_seq;
> +	spin_unlock(&mmn_mm->lock);
> +
> +	/*
> +	 * mrn->invalidate_seq is always set to an odd value. This ensures

This claim just looks wrong the first N times one reads the code, given that
there is mmu_range_set_seq() to set it to an arbitrary value!  Maybe you mean

"is always set to an odd value when invalidating"??

> +	 * that if seq does wrap we will always clear the below sleep in some
> +	 * reasonable time as mmn_mm->invalidate_seq is even in the idle
> +	 * state.
> +	 */

Let's move that comment higher up. The code that follows it has nothing to
do with it, so it's confusing here.

...
> @@ -529,6 +852,166 @@ void mmu_notifier_put(struct mmu_notifier *mn)
>  }
>  EXPORT_SYMBOL_GPL(mmu_notifier_put);
>  
> +static int __mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
> +				       unsigned long start,
> +				       unsigned long length,
> +				       struct mmu_notifier_mm *mmn_mm,
> +				       struct mm_struct *mm)
> +{
> +	mrn->mm = mm;
> +	RB_CLEAR_NODE(&mrn->interval_tree.rb);
> +	mrn->interval_tree.start = start;
> +	/*
> +	 * Note that the representation of the intervals in the interval tree
> +	 * considers the ending point as contained in the interval.

Thanks for that comment!

> +	 */
> +	if (length == 0 ||
> +	    check_add_overflow(start, length - 1, &mrn->interval_tree.last))
> +		return -EOVERFLOW;
> +
> +	/* pairs with mmdrop in mmu_range_notifier_remove() */
> +	mmgrab(mm);
> +
> +	/*
> +	 * If some invalidate_range_start/end region is going on in parallel
> +	 * we don't know what VA ranges are affected, so we must assume this
> +	 * new range is included.
> +	 *
> +	 * If the itree is invalidating then we are not allowed to change
> +	 * it. Retrying until invalidation is done is tricky due to the
> +	 * possibility for live lock, instead defer the add to the unlock so
> +	 * this algorithm is deterministic.
> +	 *
> +	 * In all cases the value for the mrn->mr_invalidate_seq should be
> +	 * odd, see mmu_range_read_begin()
> +	 */
> +	spin_lock(&mmn_mm->lock);
> +	if (mmn_mm->active_invalidate_ranges) {
> +		if (mn_itree_is_invalidating(mmn_mm))
> +			hlist_add_head(&mrn->deferred_item,
> +				       &mmn_mm->deferred_list);
> +		else {
> +			mmn_mm->invalidate_seq |= 1;
> +			interval_tree_insert(&mrn->interval_tree,
> +					     &mmn_mm->itree);
> +		}
> +		mrn->invalidate_seq = mmn_mm->invalidate_seq;
> +	} else {
> +		WARN_ON(mn_itree_is_invalidating(mmn_mm));
> +		mrn->invalidate_seq = mmn_mm->invalidate_seq - 1;

Ohhh, checkmate. I lose. Why is *subtracting* the right thing to do
for seq numbers here?  I'm acutely unhappy trying to figure this out.
I suspect it's another unfortunate side effect of trying to use the
lower bit of the seq number (even/odd) for something else.

> +		interval_tree_insert(&mrn->interval_tree, &mmn_mm->itree);
> +	}
> +	spin_unlock(&mmn_mm->lock);
> +	return 0;
> +}
> +
> +/**
> + * mmu_range_notifier_insert - Insert a range notifier
> + * @mrn: Range notifier to register
> + * @start: Starting virtual address to monitor
> + * @length: Length of the range to monitor
> + * @mm : mm_struct to attach to
> + *
> + * This function subscribes the range notifier for notifications from the mm.
> + * Upon return the ops related to mmu_range_notifier will be called whenever
> + * an event that intersects with the given range occurs.
> + *
> + * Upon return the range_notifier may not be present in the interval tree yet.
> + * The caller must use the normal range notifier locking flow via
> + * mmu_range_read_begin() to establish SPTEs for this range.
> + */
> +int mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
> +			      unsigned long start, unsigned long length,
> +			      struct mm_struct *mm)
> +{
> +	struct mmu_notifier_mm *mmn_mm;
> +	int ret;

Hmmm, I think a later patch improperly changes the above to "int ret = 0;".
I'll check on that. It's correct here, though.

> +
> +	might_lock(&mm->mmap_sem);
> +
> +	mmn_mm = smp_load_acquire(&mm->mmu_notifier_mm);

What does the above pair with? Should have a comment that specifies that.

 
thanks,

John Hubbard
NVIDIA
Jerome Glisse Nov. 7, 2019, 2:08 a.m. UTC | #4
On Wed, Nov 06, 2019 at 04:23:21PM -0800, John Hubbard wrote:
> On 10/28/19 1:10 PM, Jason Gunthorpe wrote:

[...]

> >  /**
> >   * enum mmu_notifier_event - reason for the mmu notifier callback
> > @@ -32,6 +34,9 @@ struct mmu_notifier_range;
> >   * access flags). User should soft dirty the page in the end callback to make
> >   * sure that anyone relying on soft dirtyness catch pages that might be written
> >   * through non CPU mappings.
> > + *
> > + * @MMU_NOTIFY_RELEASE: used during mmu_range_notifier invalidate to signal that
> > + * the mm refcount is zero and the range is no longer accessible.
> >   */
> >  enum mmu_notifier_event {
> >  	MMU_NOTIFY_UNMAP = 0,
> > @@ -39,6 +44,7 @@ enum mmu_notifier_event {
> >  	MMU_NOTIFY_PROTECTION_VMA,
> >  	MMU_NOTIFY_PROTECTION_PAGE,
> >  	MMU_NOTIFY_SOFT_DIRTY,
> > +	MMU_NOTIFY_RELEASE,
> >  };
> 
> 
> OK, let the naming debates begin! ha. Anyway, after careful study of the overall
> patch, and some browsing of the larger patchset, it's clear that:
> 
> * The new "MMU range notifier" that you've created is, approximately, a new
> object. It uses classic mmu notifiers inside, as an implementation detail, and
> it does *similar* things (notifications) as mmn's. But it's certainly not the same
> as mmn's, as shown later when you say the need to an entirely new ops struct, and 
> data struct too.
> 
> Therefore, you need a separate events enum as well. This is important. MMN's
> won't be issuing MMN_NOTIFY_RELEASE events, nor will MNR's be issuing the first
> four prexisting MMU_NOTIFY_* items. So it would be a design mistake to glom them
> together, unless you ultimately decided to merge these MMN and MNR objects (which
> I don't really see any intention of, and that's fine).
> 
> So this should read:
> 
> enum mmu_range_notifier_event {
> 	MMU_NOTIFY_RELEASE,
> };
> 
> ...assuming that we stay with "mmu_range_notifier" as a core name for this 
> whole thing.
> 
> Also, it is best moved down to be next to the new MNR structs, so that all the
> MNR stuff is in one group.
> 
> Extra credit: IMHO, this clearly deserves to all be in a new mmu_range_notifier.h
> header file, but I know that's extra work. Maybe later as a follow-up patch,
> if anyone has the time.

The range notifier should get the event too, it would be a waste, i think it is
an oversight here. The release event is fine so NAK to you separate event. Event
is really an helper for notifier i had a set of patch for nouveau to leverage
this i need to resucite them. So no need to split thing, i would just forward
the event ie add event to mmu_range_notifier_ops.invalidate() i failed to catch
that in v1 sorry.


[...]

> > +struct mmu_range_notifier_ops {
> > +	bool (*invalidate)(struct mmu_range_notifier *mrn,
> > +			   const struct mmu_notifier_range *range,
> > +			   unsigned long cur_seq);
> > +};
> > +
> > +struct mmu_range_notifier {
> > +	struct interval_tree_node interval_tree;
> > +	const struct mmu_range_notifier_ops *ops;
> > +	struct hlist_node deferred_item;
> > +	unsigned long invalidate_seq;
> > +	struct mm_struct *mm;
> > +};
> > +
> 
> Again, now we have the new struct mmu_range_notifier, and the old 
> struct mmu_notifier_range, and it's not good.
> 
> Ideas:
> 
> a) Live with it.
> 
> b) (Discarded, too many callers): rename old one. Nope.
> 
> c) Rename new one. Ideas:
> 
>     struct mmu_interval_notifier
>     struct mmu_range_intersection
>     ...other ideas?

I vote for interval_notifier we do want notifier in name but i am also
fine with current name.

[...]

> > + *
> > + * Note that the core mm creates nested invalidate_range_start()/end() regions
> > + * within the same thread, and runs invalidate_range_start()/end() in parallel
> > + * on multiple CPUs. This is designed to not reduce concurrency or block
> > + * progress on the mm side.
> > + *
> > + * As a secondary function, holding the full write side also serves to prevent
> > + * writers for the itree, this is an optimization to avoid extra locking
> > + * during invalidate_range_start/end notifiers.
> > + *
> > + * The write side has two states, fully excluded:
> > + *  - mm->active_invalidate_ranges != 0
> > + *  - mnn->invalidate_seq & 1 == True
> > + *  - some range on the mm_struct is being invalidated
> > + *  - the itree is not allowed to change
> > + *
> > + * And partially excluded:
> > + *  - mm->active_invalidate_ranges != 0
> 
> I assume this implies mnn->invalidate_seq & 1 == False in this case? If so,
> let's say so. I'm probably getting that wrong, too.

Yes (mnn->invalidate_seq & 1) == 0

> 
> > + *  - some range on the mm_struct is being invalidated
> > + *  - the itree is allowed to change
> > + *
> > + * The later state avoids some expensive work on inv_end in the common case of
> > + * no mrn monitoring the VA.
> > + */
> > +static bool mn_itree_is_invalidating(struct mmu_notifier_mm *mmn_mm)
> > +{
> > +	lockdep_assert_held(&mmn_mm->lock);
> > +	return mmn_mm->invalidate_seq & 1;
> > +}
> > +
> > +static struct mmu_range_notifier *
> > +mn_itree_inv_start_range(struct mmu_notifier_mm *mmn_mm,
> > +			 const struct mmu_notifier_range *range,
> > +			 unsigned long *seq)
> > +{
> > +	struct interval_tree_node *node;
> > +	struct mmu_range_notifier *res = NULL;
> > +
> > +	spin_lock(&mmn_mm->lock);
> > +	mmn_mm->active_invalidate_ranges++;
> > +	node = interval_tree_iter_first(&mmn_mm->itree, range->start,
> > +					range->end - 1);
> > +	if (node) {
> > +		mmn_mm->invalidate_seq |= 1;
> 
> 
> OK, this either needs more documentation and assertions, or a different
> approach. Because I see addition, subtraction, AND, OR and booleans
> all being applied to this field, and it's darn near hopeless to figure
> out whether or not it really is even or odd at the right times.
> 
> Different approach: why not just add a mmn_mm->is_invalidating 
> member variable? It's not like you're short of space in that struct.

The invalidate_seq scheme looks fine to me, maybe it can use more comments.


> 
> 
> > +		res = container_of(node, struct mmu_range_notifier,
> > +				   interval_tree);
> > +	}
> > +
> > +	*seq = mmn_mm->invalidate_seq;
> > +	spin_unlock(&mmn_mm->lock);
> > +	return res;
> > +}
> > +
> > +static struct mmu_range_notifier *
> > +mn_itree_inv_next(struct mmu_range_notifier *mrn,
> > +		  const struct mmu_notifier_range *range)
> > +{
> > +	struct interval_tree_node *node;
> > +
> > +	node = interval_tree_iter_next(&mrn->interval_tree, range->start,
> > +				       range->end - 1);
> > +	if (!node)
> > +		return NULL;
> > +	return container_of(node, struct mmu_range_notifier, interval_tree);
> > +}
> > +
> > +static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm)
> > +{
> > +	struct mmu_range_notifier *mrn;
> > +	struct hlist_node *next;
> > +	bool need_wake = false;
> > +
> > +	spin_lock(&mmn_mm->lock);
> > +	if (--mmn_mm->active_invalidate_ranges ||
> > +	    !mn_itree_is_invalidating(mmn_mm)) {
> > +		spin_unlock(&mmn_mm->lock);
> > +		return;
> > +	}
> > +
> > +	mmn_mm->invalidate_seq++;
> 
> Is this the right place for an assertion that this is now an even value?

Yes at that point it should be even ie mmn_mm->active_invalidate_ranges == 0
and we are holding the lock thus nothing can set the lower bit of invalidate_seq
and ++ should lead to even number.

> 
> > +	need_wake = true;
> > +
> > +	/*
> > +	 * The inv_end incorporates a deferred mechanism like
> > +	 * rtnl_lock(). Adds and removes are queued until the final inv_end
> 
> Let me point out that rtnl_lock() itself is a one-liner that calls mutex_lock().
> But I suppose if one studies that file closely there is more. :)

I think i commented in v1 about rtnl_lock() being something network people only
might be familiar, i think i saw it documented somewhere, maybe a lwn article.
But if you are familiar with network it is a think well understood ... for any
reasonable network scholar ;)

> ...
> 
> > +unsigned long mmu_range_read_begin(struct mmu_range_notifier *mrn)
> > +{
> > +	struct mmu_notifier_mm *mmn_mm = mrn->mm->mmu_notifier_mm;
> > +	unsigned long seq;
> > +	bool is_invalidating;
> > +
> > +	/*
> > +	 * If the mrn has a different seq value under the user_lock than we
> > +	 * started with then it has collided.
> > +	 *
> > +	 * If the mrn currently has the same seq value as the mmn_mm seq, then
> > +	 * it is currently between invalidate_start/end and is colliding.
> > +	 *
> > +	 * The locking looks broadly like this:
> > +	 *   mn_tree_invalidate_start():          mmu_range_read_begin():
> > +	 *                                         spin_lock
> > +	 *                                          seq = READ_ONCE(mrn->invalidate_seq);
> > +	 *                                          seq == mmn_mm->invalidate_seq
> > +	 *                                         spin_unlock
> > +	 *    spin_lock
> > +	 *     seq = ++mmn_mm->invalidate_seq
> > +	 *    spin_unlock
> > +	 *     op->invalidate_range():
> > +	 *       user_lock
> > +	 *        mmu_range_set_seq()
> > +	 *         mrn->invalidate_seq = seq
> > +	 *       user_unlock
> > +	 *
> > +	 *                          [Required: mmu_range_read_retry() == true]
> > +	 *
> > +	 *   mn_itree_inv_end():
> > +	 *    spin_lock
> > +	 *     seq = ++mmn_mm->invalidate_seq
> > +	 *    spin_unlock
> > +	 *
> > +	 *                                        user_lock
> > +	 *                                         mmu_range_read_retry():
> > +	 *                                          mrn->invalidate_seq != seq
> > +	 *                                        user_unlock
> > +	 *
> > +	 * Barriers are not needed here as any races here are closed by an
> > +	 * eventual mmu_range_read_retry(), which provides a barrier via the
> > +	 * user_lock.
> > +	 */
> > +	spin_lock(&mmn_mm->lock);
> > +	/* Pairs with the WRITE_ONCE in mmu_range_set_seq() */
> > +	seq = READ_ONCE(mrn->invalidate_seq);
> > +	is_invalidating = seq == mmn_mm->invalidate_seq;
> > +	spin_unlock(&mmn_mm->lock);
> > +
> > +	/*
> > +	 * mrn->invalidate_seq is always set to an odd value. This ensures
> 
> This claim just looks wrong the first N times one reads the code, given that
> there is mmu_range_set_seq() to set it to an arbitrary value!  Maybe you mean
> 
> "is always set to an odd value when invalidating"??

No it is always odd, you must call mmu_range_set_seq() only from the
op->invalidate_range() callback at which point the seq is odd. As well
when mrn is added and its seq first set it is set to an odd value
always. Maybe the comment, should read:

 * mrn->invalidate_seq is always, yes always, set to an odd value. This ensures

To stress that it is not an error.

> 
> > +	 * that if seq does wrap we will always clear the below sleep in some
> > +	 * reasonable time as mmn_mm->invalidate_seq is even in the idle
> > +	 * state.
> > +	 */
> 
> Let's move that comment higher up. The code that follows it has nothing to
> do with it, so it's confusing here.

No the comment is in the right place, the fact that it is odd and that
idle state is even explains why the wait() will never last forever.
Already had a discussion on this in v1.

[...]

> > +	/*
> > +	 * If some invalidate_range_start/end region is going on in parallel
> > +	 * we don't know what VA ranges are affected, so we must assume this
> > +	 * new range is included.
> > +	 *
> > +	 * If the itree is invalidating then we are not allowed to change
> > +	 * it. Retrying until invalidation is done is tricky due to the
> > +	 * possibility for live lock, instead defer the add to the unlock so
> > +	 * this algorithm is deterministic.
> > +	 *
> > +	 * In all cases the value for the mrn->mr_invalidate_seq should be
> > +	 * odd, see mmu_range_read_begin()
> > +	 */
> > +	spin_lock(&mmn_mm->lock);
> > +	if (mmn_mm->active_invalidate_ranges) {
> > +		if (mn_itree_is_invalidating(mmn_mm))
> > +			hlist_add_head(&mrn->deferred_item,
> > +				       &mmn_mm->deferred_list);
> > +		else {
> > +			mmn_mm->invalidate_seq |= 1;
> > +			interval_tree_insert(&mrn->interval_tree,
> > +					     &mmn_mm->itree);
> > +		}
> > +		mrn->invalidate_seq = mmn_mm->invalidate_seq;
> > +	} else {
> > +		WARN_ON(mn_itree_is_invalidating(mmn_mm));
> > +		mrn->invalidate_seq = mmn_mm->invalidate_seq - 1;
> 
> Ohhh, checkmate. I lose. Why is *subtracting* the right thing to do
> for seq numbers here?  I'm acutely unhappy trying to figure this out.
> I suspect it's another unfortunate side effect of trying to use the
> lower bit of the seq number (even/odd) for something else.

If there is no mmn_mm->active_invalidate_ranges then it means that
mmn_mm->invalidate_seq is even and thus mmn_mm->invalidate_seq - 1
is an odd number which means that mrn->invalidate_seq is initialized
to odd value and if you follow the rule for calling mmu_range_set_seq()
then it will _always_ be an odd number and this close the loop with
the above comments :)

> 
> > +		interval_tree_insert(&mrn->interval_tree, &mmn_mm->itree);
> > +	}
> > +	spin_unlock(&mmn_mm->lock);
> > +	return 0;
> > +}
> > +
> > +/**
> > + * mmu_range_notifier_insert - Insert a range notifier
> > + * @mrn: Range notifier to register
> > + * @start: Starting virtual address to monitor
> > + * @length: Length of the range to monitor
> > + * @mm : mm_struct to attach to
> > + *
> > + * This function subscribes the range notifier for notifications from the mm.
> > + * Upon return the ops related to mmu_range_notifier will be called whenever
> > + * an event that intersects with the given range occurs.
> > + *
> > + * Upon return the range_notifier may not be present in the interval tree yet.
> > + * The caller must use the normal range notifier locking flow via
> > + * mmu_range_read_begin() to establish SPTEs for this range.
> > + */
> > +int mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
> > +			      unsigned long start, unsigned long length,
> > +			      struct mm_struct *mm)
> > +{
> > +	struct mmu_notifier_mm *mmn_mm;
> > +	int ret;
> 
> Hmmm, I think a later patch improperly changes the above to "int ret = 0;".
> I'll check on that. It's correct here, though.
> 
> > +
> > +	might_lock(&mm->mmap_sem);
> > +
> > +	mmn_mm = smp_load_acquire(&mm->mmu_notifier_mm);
> 
> What does the above pair with? Should have a comment that specifies that.

It was discussed in v1 but maybe a comment of what was said back then would
be helpful. Something like:

/*
 * We need to insure that all writes to mm->mmu_notifier_mm are visible before
 * any checks we do on mmn_mm below as otherwise CPU might re-order write done
 * by another CPU core to mm->mmu_notifier_mm structure fields after the read
 * belows.
 */

Cheers,
Jérôme
Jason Gunthorpe Nov. 7, 2019, 8:06 p.m. UTC | #5
On Wed, Nov 06, 2019 at 04:23:21PM -0800, John Hubbard wrote:
 
> Nice design, I love the seq foundation! So far, I'm not able to spot anything
> actually wrong with the implementation, sorry about that. 

Alas :( I feel there must be a bug in here still, but onwards!

One of the main sad points was it didn't make sense to use the
existing seqlock/seqcount primitives as they have both the wrong write
concurrancy model and extra barriers that are not needed when it is
always manipulated under a spinlock
 
> 1. There is a rather severe naming overlap (not technically a naming conflict,
> but still) with existing mmn work, which already has, for example:
> 
>     struct mmu_notifier_range
> 
> ...and you're adding:
> 
>     struct mmu_range_notifier
> 
> ...so I'll try to help sort that out.

Yes, I've been sad about this too.

> So this should read:
> 
> enum mmu_range_notifier_event {
> 	MMU_NOTIFY_RELEASE,
> };
> 
> ...assuming that we stay with "mmu_range_notifier" as a core name for this 
> whole thing.
> 
> Also, it is best moved down to be next to the new MNR structs, so that all the
> MNR stuff is in one group.

I agree with Jerome, this enum is part of the 'struct
mmu_notifier_range' (ie the description of the invalidation) and it
doesn't really matter that only these new notifiers can be called with
this type, it is still part of the mmu_notifier_range.

The comment already says it only applies to the mmu_range_notifier
scheme..

> >  #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
> > @@ -222,6 +228,26 @@ struct mmu_notifier {
> >  	unsigned int users;
> >  };
> >  
> 
> That should also be moved down, next to the new structs.

Which this?

> > +/**
> > + * struct mmu_range_notifier_ops
> > + * @invalidate: Upon return the caller must stop using any SPTEs within this
> > + *              range, this function can sleep. Return false if blocking was
> > + *              required but range is non-blocking
> > + */
> 
> How about this (I'm not sure I fully understand the return value, though):
> 
> /**
>  * struct mmu_range_notifier_ops
>  * @invalidate: Upon return the caller must stop using any SPTEs within this
>  * 		range.
>  *
>  * 		This function is permitted to sleep.
>  *
>  *      	@Return: false if blocking was required, but @range is
>  *			non-blocking.
>  *
>  */

Is this kdoc format for function pointers?
 
> 
> > +struct mmu_range_notifier_ops {
> > +	bool (*invalidate)(struct mmu_range_notifier *mrn,
> > +			   const struct mmu_notifier_range *range,
> > +			   unsigned long cur_seq);
> > +};
> > +
> > +struct mmu_range_notifier {
> > +	struct interval_tree_node interval_tree;
> > +	const struct mmu_range_notifier_ops *ops;
> > +	struct hlist_node deferred_item;
> > +	unsigned long invalidate_seq;
> > +	struct mm_struct *mm;
> > +};
> > +
> 
> Again, now we have the new struct mmu_range_notifier, and the old 
> struct mmu_notifier_range, and it's not good.
> 
> Ideas:
> 
> a) Live with it.
> 
> b) (Discarded, too many callers): rename old one. Nope.
> 
> c) Rename new one. Ideas:
> 
>     struct mmu_interval_notifier
>     struct mmu_range_intersection
>     ...other ideas?

This odd duality has already cause some confusion, but names here are
hard.  mmu_interval_notifier is the best alternative I've heard.

Changing this name is a lot of work - are we happy
'mmu_interval_notifier' is the right choice?

> > +/**
> > + * mmu_range_set_seq - Save the invalidation sequence
> 
> How about:
> 
>  * mmu_range_set_seq - Set the .invalidate_seq to a new value.

It is not a 'new value', it is a value that is provided to the
invalidate callback

> 
> > + * @mrn - The mrn passed to invalidate
> > + * @cur_seq - The cur_seq passed to invalidate
> > + *
> > + * This must be called unconditionally from the invalidate callback of a
> > + * struct mmu_range_notifier_ops under the same lock that is used to call
> > + * mmu_range_read_retry(). It updates the sequence number for later use by
> > + * mmu_range_read_retry().
> > + *
> > + * If the user does not call mmu_range_read_begin() or mmu_range_read_retry()
> 
> nit: "caller" is better than "user", when referring to...well, callers. "user" 
> most often refers to user space, whereas a call stack and function calling is 
> clearly what you're referring to here (and in other places, especially "user lock").

Done

> > +/**
> > + * mmu_range_check_retry - Test if a collision has occurred
> > + * mrn: The range under lock
> > + * seq: The return of the matching mmu_range_read_begin()
> > + *
> > + * This can be used in the critical section between mmu_range_read_begin() and
> > + * mmu_range_read_retry().  A return of true indicates an invalidation has
> > + * collided with this lock and a future mmu_range_read_retry() will return
> > + * true.
> > + *
> > + * False is not reliable and only suggests a collision has not happened. It
> 
> let's say "suggests that a collision *may* not have occurred."  

Sure

> > +/*
> > + * This is a collision-retry read-side/write-side 'lock', a lot like a
> > + * seqcount, however this allows multiple write-sides to hold it at
> > + * once. Conceptually the write side is protecting the values of the PTEs in
> > + * this mm, such that PTES cannot be read into SPTEs while any writer exists.
> 
> Just to be kind, can we say "SPTEs (shadow PTEs)", just this once? :)

Haha, sure, why not

> > + * The write side has two states, fully excluded:
> > + *  - mm->active_invalidate_ranges != 0
> > + *  - mnn->invalidate_seq & 1 == True
> > + *  - some range on the mm_struct is being invalidated
> > + *  - the itree is not allowed to change
> > + *
> > + * And partially excluded:
> > + *  - mm->active_invalidate_ranges != 0
> 
> I assume this implies mnn->invalidate_seq & 1 == False in this case? If so,
> let's say so. I'm probably getting that wrong, too.

Yes that is right, done

> 
> > + *  - some range on the mm_struct is being invalidated
> > + *  - the itree is allowed to change
> > + *
> > + * The later state avoids some expensive work on inv_end in the common case of
> > + * no mrn monitoring the VA.
> > + */
> > +static bool mn_itree_is_invalidating(struct mmu_notifier_mm *mmn_mm)
> > +{
> > +	lockdep_assert_held(&mmn_mm->lock);
> > +	return mmn_mm->invalidate_seq & 1;
> > +}
> > +
> > +static struct mmu_range_notifier *
> > +mn_itree_inv_start_range(struct mmu_notifier_mm *mmn_mm,
> > +			 const struct mmu_notifier_range *range,
> > +			 unsigned long *seq)
> > +{
> > +	struct interval_tree_node *node;
> > +	struct mmu_range_notifier *res = NULL;
> > +
> > +	spin_lock(&mmn_mm->lock);
> > +	mmn_mm->active_invalidate_ranges++;
> > +	node = interval_tree_iter_first(&mmn_mm->itree, range->start,
> > +					range->end - 1);
> > +	if (node) {
> > +		mmn_mm->invalidate_seq |= 1;
> 
> 
> OK, this either needs more documentation and assertions, or a different
> approach. Because I see addition, subtraction, AND, OR and booleans
> all being applied to this field, and it's darn near hopeless to figure
> out whether or not it really is even or odd at the right times.

This is a standard design for a seqlock scheme and follows the
existing design of the linux seq lock.

The lower bit indicates the lock'd state and the upper bits indicate
the generation of the lock

The operations on the lock itself are then:
   seq |= 1  # Take the lock
   seq++     # Release an acquired lock
   seq & 1   # True if locked

Which is how this is written

> Different approach: why not just add a mmn_mm->is_invalidating 
> member variable? It's not like you're short of space in that struct.

Splitting it makes alot of stuff more complex and unnatural.

The ops above could be put in inline wrappers, but they only occur
only in functions already called mn_itree_inv_start_range() and
mn_itree_inv_end() and mn_itree_is_invalidating().

There is the one 'take the lock' outlier in
__mmu_range_notifier_insert() though

> > +static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm)
> > +{
> > +	struct mmu_range_notifier *mrn;
> > +	struct hlist_node *next;
> > +	bool need_wake = false;
> > +
> > +	spin_lock(&mmn_mm->lock);
> > +	if (--mmn_mm->active_invalidate_ranges ||
> > +	    !mn_itree_is_invalidating(mmn_mm)) {
> > +		spin_unlock(&mmn_mm->lock);
> > +		return;
> > +	}
> > +
> > +	mmn_mm->invalidate_seq++;
> 
> Is this the right place for an assertion that this is now an even value?

Yes, but I'm reluctant to add such a runtime check on this fastish path..
How about a comment?

> > +	need_wake = true;
> > +
> > +	/*
> > +	 * The inv_end incorporates a deferred mechanism like
> > +	 * rtnl_lock(). Adds and removes are queued until the final inv_end
> 
> Let me point out that rtnl_lock() itself is a one-liner that calls mutex_lock().
> But I suppose if one studies that file closely there is more. :)

Lets change that to rtnl_unlock() then

> > +	spin_lock(&mmn_mm->lock);
> > +	/* Pairs with the WRITE_ONCE in mmu_range_set_seq() */
> > +	seq = READ_ONCE(mrn->invalidate_seq);
> > +	is_invalidating = seq == mmn_mm->invalidate_seq;
> > +	spin_unlock(&mmn_mm->lock);
> > +
> > +	/*
> > +	 * mrn->invalidate_seq is always set to an odd value. This ensures
> 
> This claim just looks wrong the first N times one reads the code, given that
> there is mmu_range_set_seq() to set it to an arbitrary value!  Maybe
> you mean

mmu_range_set_seq() is NOT to be used to set to an arbitary value, it
must only be used to set to the value provided in the invalidate()
callback and that value is always odd. Lets make this super clear:

	/*
	 * mrn->invalidate_seq must always be set to an odd value via
	 * mmu_range_set_seq() using the provided cur_seq from
	 * mn_itree_inv_start_range(). This ensures that if seq does wrap we
	 * will always clear the below sleep in some reasonable time as
	 * mmn_mm->invalidate_seq is even in the idle state.
	 */

The invarient is that the 'struct mmu_range_notifier' always has an
odd 'seq'

> > +	 * that if seq does wrap we will always clear the below sleep in some
> > +	 * reasonable time as mmn_mm->invalidate_seq is even in the idle
> > +	 * state.
> > +	 */
> 
> Let's move that comment higher up. The code that follows it has nothing to
> do with it, so it's confusing here.

The comment is explaining why the wait_event is safe, even if we wrap
the sequence number, which is a significant and very subtle corner
case. This is really why we have the even/odd thing at all.

> > +	spin_lock(&mmn_mm->lock);
> > +	if (mmn_mm->active_invalidate_ranges) {
> > +		if (mn_itree_is_invalidating(mmn_mm))
> > +			hlist_add_head(&mrn->deferred_item,
> > +				       &mmn_mm->deferred_list);
> > +		else {
> > +			mmn_mm->invalidate_seq |= 1;
> > +			interval_tree_insert(&mrn->interval_tree,
> > +					     &mmn_mm->itree);
> > +		}
> > +		mrn->invalidate_seq = mmn_mm->invalidate_seq;
> > +	} else {
> > +		WARN_ON(mn_itree_is_invalidating(mmn_mm));
> > +		mrn->invalidate_seq = mmn_mm->invalidate_seq - 1;
> 
> Ohhh, checkmate. I lose. Why is *subtracting* the right thing to do
> for seq numbers here?  I'm acutely unhappy trying to figure this out.
> I suspect it's another unfortunate side effect of trying to use the
> lower bit of the seq number (even/odd) for something else.

No, this is actually done for the seq number itself. We need to
generate a seq number that is != the current invalidate_seq as this
new mrn is not invalidating.

The best seq to use is one that the invalidate_seq will not reach for
a long time, ie 'invalidate_seq + MAX' which is expressed as -1

The even/odd thing just takes care of itself naturally here as
invalidate_seq is guarenteed even and -1 creates both an odd mrn value
and a good seq number.

The algorithm would actually work correctly if this was
'mrn->invalidate_seq = 1', but occasionally things would block when
they don't need to block.

Lets add a comment:

		/*
		 * The starting seq for a mrn not under invalidation should be
		 * odd, not equal to the current invalidate_seq and
		 * invalidate_seq should not 'wrap' to the new seq any time
		 * soon.
		 */

> > +int mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
> > +			      unsigned long start, unsigned long length,
> > +			      struct mm_struct *mm)
> > +{
> > +	struct mmu_notifier_mm *mmn_mm;
> > +	int ret;
> 
> Hmmm, I think a later patch improperly changes the above to "int ret = 0;".
> I'll check on that. It's correct here, though.

Looks OK in my tree?

> > +	might_lock(&mm->mmap_sem);
> > +
> > +	mmn_mm = smp_load_acquire(&mm->mmu_notifier_mm);
> 
> What does the above pair with? Should have a comment that specifies that.

smp_load_acquire() always pairs with smp_store_release() to the same
memory, there is only one store, is a comment really needed?

Below are the comment updates I made, thanks!

Jason

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 51b92ba013ddce..065c95002e9602 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -302,15 +302,15 @@ void mmu_range_notifier_remove(struct mmu_range_notifier *mrn);
 /**
  * mmu_range_set_seq - Save the invalidation sequence
  * @mrn - The mrn passed to invalidate
- * @cur_seq - The cur_seq passed to invalidate
+ * @cur_seq - The cur_seq passed to the invalidate() callback
  *
  * This must be called unconditionally from the invalidate callback of a
  * struct mmu_range_notifier_ops under the same lock that is used to call
  * mmu_range_read_retry(). It updates the sequence number for later use by
- * mmu_range_read_retry().
+ * mmu_range_read_retry(). The provided cur_seq will always be odd.
  *
- * If the user does not call mmu_range_read_begin() or mmu_range_read_retry()
- * then this call is not required.
+ * If the caller does not call mmu_range_read_begin() or
+ * mmu_range_read_retry() then this call is not required.
  */
 static inline void mmu_range_set_seq(struct mmu_range_notifier *mrn,
 				     unsigned long cur_seq)
@@ -348,8 +348,9 @@ static inline bool mmu_range_read_retry(struct mmu_range_notifier *mrn,
  * collided with this lock and a future mmu_range_read_retry() will return
  * true.
  *
- * False is not reliable and only suggests a collision has not happened. It
- * can be called many times and does not have to hold the user provided lock.
+ * False is not reliable and only suggests a collision may not have
+ * occured. It can be called many times and does not have to hold the user
+ * provided lock.
  *
  * This call can be used as part of loops and other expensive operations to
  * expedite a retry.
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 2b7485919ecfeb..afe1e2d94183f8 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -51,7 +51,8 @@ struct mmu_notifier_mm {
  * This is a collision-retry read-side/write-side 'lock', a lot like a
  * seqcount, however this allows multiple write-sides to hold it at
  * once. Conceptually the write side is protecting the values of the PTEs in
- * this mm, such that PTES cannot be read into SPTEs while any writer exists.
+ * this mm, such that PTES cannot be read into SPTEs (shadow PTEs) while any
+ * writer exists.
  *
  * Note that the core mm creates nested invalidate_range_start()/end() regions
  * within the same thread, and runs invalidate_range_start()/end() in parallel
@@ -64,12 +65,13 @@ struct mmu_notifier_mm {
  *
  * The write side has two states, fully excluded:
  *  - mm->active_invalidate_ranges != 0
- *  - mnn->invalidate_seq & 1 == True
+ *  - mnn->invalidate_seq & 1 == True (odd)
  *  - some range on the mm_struct is being invalidated
  *  - the itree is not allowed to change
  *
  * And partially excluded:
  *  - mm->active_invalidate_ranges != 0
+ *  - mnn->invalidate_seq & 1 == False (even)
  *  - some range on the mm_struct is being invalidated
  *  - the itree is allowed to change
  *
@@ -131,12 +133,13 @@ static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm)
 		return;
 	}
 
+	/* Make invalidate_seq even */
 	mmn_mm->invalidate_seq++;
 	need_wake = true;
 
 	/*
 	 * The inv_end incorporates a deferred mechanism like
-	 * rtnl_lock(). Adds and removes are queued until the final inv_end
+	 * rtnl_unlock(). Adds and removes are queued until the final inv_end
 	 * happens then they are progressed. This arrangement for tree updates
 	 * is used to avoid using a blocking lock during
 	 * invalidate_range_start.
@@ -230,10 +233,11 @@ unsigned long mmu_range_read_begin(struct mmu_range_notifier *mrn)
 	spin_unlock(&mmn_mm->lock);
 
 	/*
-	 * mrn->invalidate_seq is always set to an odd value. This ensures
-	 * that if seq does wrap we will always clear the below sleep in some
-	 * reasonable time as mmn_mm->invalidate_seq is even in the idle
-	 * state.
+	 * mrn->invalidate_seq must always be set to an odd value via
+	 * mmu_range_set_seq() using the provided cur_seq from
+	 * mn_itree_inv_start_range(). This ensures that if seq does wrap we
+	 * will always clear the below sleep in some reasonable time as
+	 * mmn_mm->invalidate_seq is even in the idle state.
 	 */
 	lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
 	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
@@ -892,6 +896,12 @@ static int __mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
 		mrn->invalidate_seq = mmn_mm->invalidate_seq;
 	} else {
 		WARN_ON(mn_itree_is_invalidating(mmn_mm));
+		/*
+		 * The starting seq for a mrn not under invalidation should be
+		 * odd, not equal to the current invalidate_seq and
+		 * invalidate_seq should not 'wrap' to the new seq any time
+		 * soon.
+		 */
 		mrn->invalidate_seq = mmn_mm->invalidate_seq - 1;
 		interval_tree_insert(&mrn->interval_tree, &mmn_mm->itree);
 	}
Jason Gunthorpe Nov. 7, 2019, 8:11 p.m. UTC | #6
On Wed, Nov 06, 2019 at 09:08:07PM -0500, Jerome Glisse wrote:

> > 
> > Extra credit: IMHO, this clearly deserves to all be in a new mmu_range_notifier.h
> > header file, but I know that's extra work. Maybe later as a follow-up patch,
> > if anyone has the time.
> 
> The range notifier should get the event too, it would be a waste, i think it is
> an oversight here. The release event is fine so NAK to you separate event. Event
> is really an helper for notifier i had a set of patch for nouveau to leverage
> this i need to resucite them. So no need to split thing, i would just forward
> the event ie add event to mmu_range_notifier_ops.invalidate() i failed to catch
> that in v1 sorry.

I think what you mean is already done?

struct mmu_range_notifier_ops {
	bool (*invalidate)(struct mmu_range_notifier *mrn,
			   const struct mmu_notifier_range *range,
			   unsigned long cur_seq);

> No it is always odd, you must call mmu_range_set_seq() only from the
> op->invalidate_range() callback at which point the seq is odd. As well
> when mrn is added and its seq first set it is set to an odd value
> always. Maybe the comment, should read:
> 
>  * mrn->invalidate_seq is always, yes always, set to an odd value. This ensures
> 
> To stress that it is not an error.

I went with this:

	/*
	 * mrn->invalidate_seq must always be set to an odd value via
	 * mmu_range_set_seq() using the provided cur_seq from
	 * mn_itree_inv_start_range(). This ensures that if seq does wrap we
	 * will always clear the below sleep in some reasonable time as
	 * mmn_mm->invalidate_seq is even in the idle state.
	 */

> > > +	spin_lock(&mmn_mm->lock);
> > > +	if (mmn_mm->active_invalidate_ranges) {
> > > +		if (mn_itree_is_invalidating(mmn_mm))
> > > +			hlist_add_head(&mrn->deferred_item,
> > > +				       &mmn_mm->deferred_list);
> > > +		else {
> > > +			mmn_mm->invalidate_seq |= 1;
> > > +			interval_tree_insert(&mrn->interval_tree,
> > > +					     &mmn_mm->itree);
> > > +		}
> > > +		mrn->invalidate_seq = mmn_mm->invalidate_seq;
> > > +	} else {
> > > +		WARN_ON(mn_itree_is_invalidating(mmn_mm));
> > > +		mrn->invalidate_seq = mmn_mm->invalidate_seq - 1;
> > 
> > Ohhh, checkmate. I lose. Why is *subtracting* the right thing to do
> > for seq numbers here?  I'm acutely unhappy trying to figure this out.
> > I suspect it's another unfortunate side effect of trying to use the
> > lower bit of the seq number (even/odd) for something else.
> 
> If there is no mmn_mm->active_invalidate_ranges then it means that
> mmn_mm->invalidate_seq is even and thus mmn_mm->invalidate_seq - 1
> is an odd number which means that mrn->invalidate_seq is initialized
> to odd value and if you follow the rule for calling mmu_range_set_seq()
> then it will _always_ be an odd number and this close the loop with
> the above comments :)

The key thing is that it is an odd value that will take a long time
before mmn_mm->invalidate seq reaches it

> > > +	might_lock(&mm->mmap_sem);
> > > +
> > > +	mmn_mm = smp_load_acquire(&mm->mmu_notifier_mm);
> > 
> > What does the above pair with? Should have a comment that specifies that.
> 
> It was discussed in v1 but maybe a comment of what was said back then would
> be helpful. Something like:
> 
> /*
>  * We need to insure that all writes to mm->mmu_notifier_mm are visible before
>  * any checks we do on mmn_mm below as otherwise CPU might re-order write done
>  * by another CPU core to mm->mmu_notifier_mm structure fields after the read
>  * belows.
>  */

This comment made it, just at the store side:

	/*
	 * Serialize the update against mmu_notifier_unregister. A
	 * side note: mmu_notifier_release can't run concurrently with
	 * us because we hold the mm_users pin (either implicitly as
	 * current->mm or explicitly with get_task_mm() or similar).
	 * We can't race against any other mmu notifier method either
	 * thanks to mm_take_all_locks().
	 *
	 * release semantics on the initialization of the mmu_notifier_mm's
         * contents are provided for unlocked readers.  acquire can only be
         * used while holding the mmgrab or mmget, and is safe because once
         * created the mmu_notififer_mm is not freed until the mm is
         * destroyed.  As above, users holding the mmap_sem or one of the
         * mm_take_all_locks() do not need to use acquire semantics.
	 */
	if (mmu_notifier_mm)
		smp_store_release(&mm->mmu_notifier_mm, mmu_notifier_mm);

Which I think is really overly belaboring the typical smp
store/release pattern, but people do seem unfamiliar with them...

Thanks,
Jason
John Hubbard Nov. 7, 2019, 8:53 p.m. UTC | #7
On 11/7/19 12:06 PM, Jason Gunthorpe wrote:
...
>>
>> Also, it is best moved down to be next to the new MNR structs, so that all the
>> MNR stuff is in one group.
> 
> I agree with Jerome, this enum is part of the 'struct
> mmu_notifier_range' (ie the description of the invalidation) and it
> doesn't really matter that only these new notifiers can be called with
> this type, it is still part of the mmu_notifier_range.
> 

OK.

> The comment already says it only applies to the mmu_range_notifier
> scheme..
> 
>>>   #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
>>> @@ -222,6 +228,26 @@ struct mmu_notifier {
>>>   	unsigned int users;
>>>   };
>>>   
>>
>> That should also be moved down, next to the new structs.
> 
> Which this?

I was referring to MMU_NOTIFIER_RANGE_BLOCKABLE, above. Trying
to put all the new range notifier stuff in one place. But maybe not,
if these are really not as separate as I thought.

> 
>>> +/**
>>> + * struct mmu_range_notifier_ops
>>> + * @invalidate: Upon return the caller must stop using any SPTEs within this
>>> + *              range, this function can sleep. Return false if blocking was
>>> + *              required but range is non-blocking
>>> + */
>>
>> How about this (I'm not sure I fully understand the return value, though):
>>
>> /**
>>   * struct mmu_range_notifier_ops
>>   * @invalidate: Upon return the caller must stop using any SPTEs within this
>>   * 		range.
>>   *
>>   * 		This function is permitted to sleep.
>>   *
>>   *      	@Return: false if blocking was required, but @range is
>>   *			non-blocking.
>>   *
>>   */
> 
> Is this kdoc format for function pointers?

heh, I'm sort of winging it, I'm not sure how function pointers are supposed
to be documented in kdoc. Actually the only key take-away here is to write

"This function can sleep"

as a separate sentence..

...
>> c) Rename new one. Ideas:
>>
>>      struct mmu_interval_notifier
>>      struct mmu_range_intersection
>>      ...other ideas?
> 
> This odd duality has already cause some confusion, but names here are
> hard.  mmu_interval_notifier is the best alternative I've heard.
> 
> Changing this name is a lot of work - are we happy
> 'mmu_interval_notifier' is the right choice?


Yes, it's my favorite too. I'd vote for going with that.

...
>>
>>
>> OK, this either needs more documentation and assertions, or a different
>> approach. Because I see addition, subtraction, AND, OR and booleans
>> all being applied to this field, and it's darn near hopeless to figure
>> out whether or not it really is even or odd at the right times.
> 
> This is a standard design for a seqlock scheme and follows the
> existing design of the linux seq lock.
> 
> The lower bit indicates the lock'd state and the upper bits indicate
> the generation of the lock
> 
> The operations on the lock itself are then:
>     seq |= 1  # Take the lock
>     seq++     # Release an acquired lock
>     seq & 1   # True if locked
> 
> Which is how this is written

Very nice, would you be open to putting that into (any) one of the comment
headers? That's an unusually clear and concise description:

/*
  * This is a standard design for a seqlock scheme and follows the
  * existing design of the linux seq lock.
  *
  * The lower bit indicates the lock'd state and the upper bits indicate
  * the generation of the lock
  *
  * The operations on the lock itself are then:
  *    seq |= 1  # Take the lock
  *    seq++     # Release an acquired lock
  *    seq & 1   # True if locked
  */


> 
>> Different approach: why not just add a mmn_mm->is_invalidating
>> member variable? It's not like you're short of space in that struct.
> 
> Splitting it makes alot of stuff more complex and unnatural.
> 

OK, agreed.

> The ops above could be put in inline wrappers, but they only occur
> only in functions already called mn_itree_inv_start_range() and
> mn_itree_inv_end() and mn_itree_is_invalidating().
> 
> There is the one 'take the lock' outlier in
> __mmu_range_notifier_insert() though
> 
>>> +static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm)
>>> +{
>>> +	struct mmu_range_notifier *mrn;
>>> +	struct hlist_node *next;
>>> +	bool need_wake = false;
>>> +
>>> +	spin_lock(&mmn_mm->lock);
>>> +	if (--mmn_mm->active_invalidate_ranges ||
>>> +	    !mn_itree_is_invalidating(mmn_mm)) {
>>> +		spin_unlock(&mmn_mm->lock);
>>> +		return;
>>> +	}
>>> +
>>> +	mmn_mm->invalidate_seq++;
>>
>> Is this the right place for an assertion that this is now an even value?
> 
> Yes, but I'm reluctant to add such a runtime check on this fastish path..
> How about a comment?

Sure.

> 
>>> +	need_wake = true;
>>> +
>>> +	/*
>>> +	 * The inv_end incorporates a deferred mechanism like
>>> +	 * rtnl_lock(). Adds and removes are queued until the final inv_end
>>
>> Let me point out that rtnl_lock() itself is a one-liner that calls mutex_lock().
>> But I suppose if one studies that file closely there is more. :)
> 
> Lets change that to rtnl_unlock() then


Thanks :)


...
>>> +	 * mrn->invalidate_seq is always set to an odd value. This ensures
>>
>> This claim just looks wrong the first N times one reads the code, given that
>> there is mmu_range_set_seq() to set it to an arbitrary value!  Maybe
>> you mean
> 
> mmu_range_set_seq() is NOT to be used to set to an arbitary value, it
> must only be used to set to the value provided in the invalidate()
> callback and that value is always odd. Lets make this super clear:
> 
> 	/*
> 	 * mrn->invalidate_seq must always be set to an odd value via
> 	 * mmu_range_set_seq() using the provided cur_seq from
> 	 * mn_itree_inv_start_range(). This ensures that if seq does wrap we
> 	 * will always clear the below sleep in some reasonable time as
> 	 * mmn_mm->invalidate_seq is even in the idle state.
> 	 */
> 

OK, that helps a lot.

...
>>> +		mrn->invalidate_seq = mmn_mm->invalidate_seq - 1;
>>
>> Ohhh, checkmate. I lose. Why is *subtracting* the right thing to do
>> for seq numbers here?  I'm acutely unhappy trying to figure this out.
>> I suspect it's another unfortunate side effect of trying to use the
>> lower bit of the seq number (even/odd) for something else.
> 
> No, this is actually done for the seq number itself. We need to
> generate a seq number that is != the current invalidate_seq as this
> new mrn is not invalidating.
> 
> The best seq to use is one that the invalidate_seq will not reach for
> a long time, ie 'invalidate_seq + MAX' which is expressed as -1
> 
> The even/odd thing just takes care of itself naturally here as
> invalidate_seq is guarenteed even and -1 creates both an odd mrn value
> and a good seq number.
> 
> The algorithm would actually work correctly if this was
> 'mrn->invalidate_seq = 1', but occasionally things would block when
> they don't need to block.
> 
> Lets add a comment:
> 
> 		/*
> 		 * The starting seq for a mrn not under invalidation should be
> 		 * odd, not equal to the current invalidate_seq and
> 		 * invalidate_seq should not 'wrap' to the new seq any time
> 		 * soon.
> 		 */

Very helpful. How about this additional tweak:

/*
  * The starting seq for a mrn not under invalidation should be
  * odd, not equal to the current invalidate_seq and
  * invalidate_seq should not 'wrap' to the new seq any time
  * soon. Subtracting 1 from the current (even) value achieves that.
  */


> 
>>> +int mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
>>> +			      unsigned long start, unsigned long length,
>>> +			      struct mm_struct *mm)
>>> +{
>>> +	struct mmu_notifier_mm *mmn_mm;
>>> +	int ret;
>>
>> Hmmm, I think a later patch improperly changes the above to "int ret = 0;".
>> I'll check on that. It's correct here, though.
> 
> Looks OK in my tree?

Nope, that's how I found it. The top of your mmu_notifier branch has this:

int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
{
         struct mmu_notifier_mm *mmn_mm = range->mm->mmu_notifier_mm;
         int ret = 0;

         if (mmn_mm->has_interval) {
                 ret = mn_itree_invalidate(mmn_mm, range);
                 if (ret)
                         return ret;
         }
         if (!hlist_empty(&mmn_mm->list))
                 return mn_hlist_invalidate_range_start(mmn_mm, range);
         return 0;
}


> 
>>> +	might_lock(&mm->mmap_sem);
>>> +
>>> +	mmn_mm = smp_load_acquire(&mm->mmu_notifier_mm);
>>
>> What does the above pair with? Should have a comment that specifies that.
> 
> smp_load_acquire() always pairs with smp_store_release() to the same
> memory, there is only one store, is a comment really needed?
> 
> Below are the comment updates I made, thanks!
> 
> Jason
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 51b92ba013ddce..065c95002e9602 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -302,15 +302,15 @@ void mmu_range_notifier_remove(struct mmu_range_notifier *mrn);
>   /**
>    * mmu_range_set_seq - Save the invalidation sequence
>    * @mrn - The mrn passed to invalidate
> - * @cur_seq - The cur_seq passed to invalidate
> + * @cur_seq - The cur_seq passed to the invalidate() callback
>    *
>    * This must be called unconditionally from the invalidate callback of a
>    * struct mmu_range_notifier_ops under the same lock that is used to call
>    * mmu_range_read_retry(). It updates the sequence number for later use by
> - * mmu_range_read_retry().
> + * mmu_range_read_retry(). The provided cur_seq will always be odd.
>    *
> - * If the user does not call mmu_range_read_begin() or mmu_range_read_retry()
> - * then this call is not required.
> + * If the caller does not call mmu_range_read_begin() or
> + * mmu_range_read_retry() then this call is not required.
>    */
>   static inline void mmu_range_set_seq(struct mmu_range_notifier *mrn,
>   				     unsigned long cur_seq)
> @@ -348,8 +348,9 @@ static inline bool mmu_range_read_retry(struct mmu_range_notifier *mrn,
>    * collided with this lock and a future mmu_range_read_retry() will return
>    * true.
>    *
> - * False is not reliable and only suggests a collision has not happened. It
> - * can be called many times and does not have to hold the user provided lock.
> + * False is not reliable and only suggests a collision may not have
> + * occured. It can be called many times and does not have to hold the user
> + * provided lock.
>    *
>    * This call can be used as part of loops and other expensive operations to
>    * expedite a retry.
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 2b7485919ecfeb..afe1e2d94183f8 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -51,7 +51,8 @@ struct mmu_notifier_mm {
>    * This is a collision-retry read-side/write-side 'lock', a lot like a
>    * seqcount, however this allows multiple write-sides to hold it at
>    * once. Conceptually the write side is protecting the values of the PTEs in
> - * this mm, such that PTES cannot be read into SPTEs while any writer exists.
> + * this mm, such that PTES cannot be read into SPTEs (shadow PTEs) while any
> + * writer exists.
>    *
>    * Note that the core mm creates nested invalidate_range_start()/end() regions
>    * within the same thread, and runs invalidate_range_start()/end() in parallel
> @@ -64,12 +65,13 @@ struct mmu_notifier_mm {
>    *
>    * The write side has two states, fully excluded:
>    *  - mm->active_invalidate_ranges != 0
> - *  - mnn->invalidate_seq & 1 == True
> + *  - mnn->invalidate_seq & 1 == True (odd)
>    *  - some range on the mm_struct is being invalidated
>    *  - the itree is not allowed to change
>    *
>    * And partially excluded:
>    *  - mm->active_invalidate_ranges != 0
> + *  - mnn->invalidate_seq & 1 == False (even)
>    *  - some range on the mm_struct is being invalidated
>    *  - the itree is allowed to change
>    *
> @@ -131,12 +133,13 @@ static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm)
>   		return;
>   	}
>   
> +	/* Make invalidate_seq even */
>   	mmn_mm->invalidate_seq++;
>   	need_wake = true;
>   
>   	/*
>   	 * The inv_end incorporates a deferred mechanism like
> -	 * rtnl_lock(). Adds and removes are queued until the final inv_end
> +	 * rtnl_unlock(). Adds and removes are queued until the final inv_end
>   	 * happens then they are progressed. This arrangement for tree updates
>   	 * is used to avoid using a blocking lock during
>   	 * invalidate_range_start.
> @@ -230,10 +233,11 @@ unsigned long mmu_range_read_begin(struct mmu_range_notifier *mrn)
>   	spin_unlock(&mmn_mm->lock);
>   
>   	/*
> -	 * mrn->invalidate_seq is always set to an odd value. This ensures
> -	 * that if seq does wrap we will always clear the below sleep in some
> -	 * reasonable time as mmn_mm->invalidate_seq is even in the idle
> -	 * state.
> +	 * mrn->invalidate_seq must always be set to an odd value via
> +	 * mmu_range_set_seq() using the provided cur_seq from
> +	 * mn_itree_inv_start_range(). This ensures that if seq does wrap we
> +	 * will always clear the below sleep in some reasonable time as
> +	 * mmn_mm->invalidate_seq is even in the idle state.
>   	 */
>   	lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
>   	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
> @@ -892,6 +896,12 @@ static int __mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
>   		mrn->invalidate_seq = mmn_mm->invalidate_seq;
>   	} else {
>   		WARN_ON(mn_itree_is_invalidating(mmn_mm));
> +		/*
> +		 * The starting seq for a mrn not under invalidation should be
> +		 * odd, not equal to the current invalidate_seq and
> +		 * invalidate_seq should not 'wrap' to the new seq any time
> +		 * soon.
> +		 */
>   		mrn->invalidate_seq = mmn_mm->invalidate_seq - 1;
>   		interval_tree_insert(&mrn->interval_tree, &mmn_mm->itree);
>   	}
> 

Looks good. We're just polishing up minor points now, so you can add:

Reviewed-by: John Hubbard <jhubbard@nvidia.com>



thanks,
Jerome Glisse Nov. 7, 2019, 9:04 p.m. UTC | #8
On Thu, Nov 07, 2019 at 08:11:06PM +0000, Jason Gunthorpe wrote:
> On Wed, Nov 06, 2019 at 09:08:07PM -0500, Jerome Glisse wrote:
> 
> > > 
> > > Extra credit: IMHO, this clearly deserves to all be in a new mmu_range_notifier.h
> > > header file, but I know that's extra work. Maybe later as a follow-up patch,
> > > if anyone has the time.
> > 
> > The range notifier should get the event too, it would be a waste, i think it is
> > an oversight here. The release event is fine so NAK to you separate event. Event
> > is really an helper for notifier i had a set of patch for nouveau to leverage
> > this i need to resucite them. So no need to split thing, i would just forward
> > the event ie add event to mmu_range_notifier_ops.invalidate() i failed to catch
> > that in v1 sorry.
> 
> I think what you mean is already done?
> 
> struct mmu_range_notifier_ops {
> 	bool (*invalidate)(struct mmu_range_notifier *mrn,
> 			   const struct mmu_notifier_range *range,
> 			   unsigned long cur_seq);

Yes it is sorry, i got confuse with mmu_range_notifier and mmu_notifier_range :)
It is almost a palyndrome structure ;)

> 
> > No it is always odd, you must call mmu_range_set_seq() only from the
> > op->invalidate_range() callback at which point the seq is odd. As well
> > when mrn is added and its seq first set it is set to an odd value
> > always. Maybe the comment, should read:
> > 
> >  * mrn->invalidate_seq is always, yes always, set to an odd value. This ensures
> > 
> > To stress that it is not an error.
> 
> I went with this:
> 
> 	/*
> 	 * mrn->invalidate_seq must always be set to an odd value via
> 	 * mmu_range_set_seq() using the provided cur_seq from
> 	 * mn_itree_inv_start_range(). This ensures that if seq does wrap we
> 	 * will always clear the below sleep in some reasonable time as
> 	 * mmn_mm->invalidate_seq is even in the idle state.
> 	 */

Yes fine with me.

[...]

> > > > +	might_lock(&mm->mmap_sem);
> > > > +
> > > > +	mmn_mm = smp_load_acquire(&mm->mmu_notifier_mm);
> > > 
> > > What does the above pair with? Should have a comment that specifies that.
> > 
> > It was discussed in v1 but maybe a comment of what was said back then would
> > be helpful. Something like:
> > 
> > /*
> >  * We need to insure that all writes to mm->mmu_notifier_mm are visible before
> >  * any checks we do on mmn_mm below as otherwise CPU might re-order write done
> >  * by another CPU core to mm->mmu_notifier_mm structure fields after the read
> >  * belows.
> >  */
> 
> This comment made it, just at the store side:
> 
> 	/*
> 	 * Serialize the update against mmu_notifier_unregister. A
> 	 * side note: mmu_notifier_release can't run concurrently with
> 	 * us because we hold the mm_users pin (either implicitly as
> 	 * current->mm or explicitly with get_task_mm() or similar).
> 	 * We can't race against any other mmu notifier method either
> 	 * thanks to mm_take_all_locks().
> 	 *
> 	 * release semantics on the initialization of the mmu_notifier_mm's
>          * contents are provided for unlocked readers.  acquire can only be
>          * used while holding the mmgrab or mmget, and is safe because once
>          * created the mmu_notififer_mm is not freed until the mm is
>          * destroyed.  As above, users holding the mmap_sem or one of the
>          * mm_take_all_locks() do not need to use acquire semantics.
> 	 */
> 	if (mmu_notifier_mm)
> 		smp_store_release(&mm->mmu_notifier_mm, mmu_notifier_mm);
> 
> Which I think is really overly belaboring the typical smp
> store/release pattern, but people do seem unfamiliar with them...

Perfect with me. I think also sometimes you forgot what memory model is
and thus store/release pattern do, i know i do and i need to refresh my
mind.

Cheers,
Jérôme
Jason Gunthorpe Nov. 8, 2019, 12:32 a.m. UTC | #9
On Thu, Nov 07, 2019 at 04:04:08PM -0500, Jerome Glisse wrote:
> On Thu, Nov 07, 2019 at 08:11:06PM +0000, Jason Gunthorpe wrote:
> > On Wed, Nov 06, 2019 at 09:08:07PM -0500, Jerome Glisse wrote:
> > 
> > > > 
> > > > Extra credit: IMHO, this clearly deserves to all be in a new mmu_range_notifier.h
> > > > header file, but I know that's extra work. Maybe later as a follow-up patch,
> > > > if anyone has the time.
> > > 
> > > The range notifier should get the event too, it would be a waste, i think it is
> > > an oversight here. The release event is fine so NAK to you separate event. Event
> > > is really an helper for notifier i had a set of patch for nouveau to leverage
> > > this i need to resucite them. So no need to split thing, i would just forward
> > > the event ie add event to mmu_range_notifier_ops.invalidate() i failed to catch
> > > that in v1 sorry.
> > 
> > I think what you mean is already done?
> > 
> > struct mmu_range_notifier_ops {
> > 	bool (*invalidate)(struct mmu_range_notifier *mrn,
> > 			   const struct mmu_notifier_range *range,
> > 			   unsigned long cur_seq);
> 
> Yes it is sorry, i got confuse with mmu_range_notifier and mmu_notifier_range :)
> It is almost a palyndrome structure ;)

Lets change the name then, this is clearly not working. I'll reflow
everything tomorrow

Jason
Jerome Glisse Nov. 8, 2019, 2 a.m. UTC | #10
On Fri, Nov 08, 2019 at 12:32:25AM +0000, Jason Gunthorpe wrote:
> On Thu, Nov 07, 2019 at 04:04:08PM -0500, Jerome Glisse wrote:
> > On Thu, Nov 07, 2019 at 08:11:06PM +0000, Jason Gunthorpe wrote:
> > > On Wed, Nov 06, 2019 at 09:08:07PM -0500, Jerome Glisse wrote:
> > > 
> > > > > 
> > > > > Extra credit: IMHO, this clearly deserves to all be in a new mmu_range_notifier.h
> > > > > header file, but I know that's extra work. Maybe later as a follow-up patch,
> > > > > if anyone has the time.
> > > > 
> > > > The range notifier should get the event too, it would be a waste, i think it is
> > > > an oversight here. The release event is fine so NAK to you separate event. Event
> > > > is really an helper for notifier i had a set of patch for nouveau to leverage
> > > > this i need to resucite them. So no need to split thing, i would just forward
> > > > the event ie add event to mmu_range_notifier_ops.invalidate() i failed to catch
> > > > that in v1 sorry.
> > > 
> > > I think what you mean is already done?
> > > 
> > > struct mmu_range_notifier_ops {
> > > 	bool (*invalidate)(struct mmu_range_notifier *mrn,
> > > 			   const struct mmu_notifier_range *range,
> > > 			   unsigned long cur_seq);
> > 
> > Yes it is sorry, i got confuse with mmu_range_notifier and mmu_notifier_range :)
> > It is almost a palyndrome structure ;)
> 
> Lets change the name then, this is clearly not working. I'll reflow
> everything tomorrow

Semantic patch to do that run from your linux kernel directory with your patch
applied (you can run it one patch after the other and the git commit -a --fixup HEAD)

spatch --sp-file name-of-the-file-below --dir . --all-includes --in-place

%< ------------------------------------------------------------------
@@
@@
struct
-mmu_range_notifier
+mmu_interval_notifier

@@
@@
struct
-mmu_range_notifier
+mmu_interval_notifier
{...};

// Change mrn name to mmu_in
@@
struct mmu_interval_notifier *mrn;
@@
-mrn
+mmu_in

@@
identifier fn;
@@
fn(..., 
-struct mmu_interval_notifier *mrn,
+struct mmu_interval_notifier *mmu_in,
...) {...}
------------------------------------------------------------------ >%

You need coccinelle (which provides spatch). It is untested but it should work
also i could not come up with a nice name to update mrn as min is way too
confusing. If you have better name feel free to use it.

Oh and coccinelle is pretty clever about code formating so it should do a good
jobs at keeping things nicely formated and align.

Cheers,
Jérôme
Jerome Glisse Nov. 8, 2019, 1:43 p.m. UTC | #11
On Thu, Nov 07, 2019 at 10:33:02PM -0800, Christoph Hellwig wrote:
> On Thu, Nov 07, 2019 at 08:06:08PM +0000, Jason Gunthorpe wrote:
> > > 
> > > enum mmu_range_notifier_event {
> > > 	MMU_NOTIFY_RELEASE,
> > > };
> > > 
> > > ...assuming that we stay with "mmu_range_notifier" as a core name for this 
> > > whole thing.
> > > 
> > > Also, it is best moved down to be next to the new MNR structs, so that all the
> > > MNR stuff is in one group.
> > 
> > I agree with Jerome, this enum is part of the 'struct
> > mmu_notifier_range' (ie the description of the invalidation) and it
> > doesn't really matter that only these new notifiers can be called with
> > this type, it is still part of the mmu_notifier_range.
> > 
> > The comment already says it only applies to the mmu_range_notifier
> > scheme..
> 
> In fact the enum is entirely unused.  We might as well just kill it off
> entirely.

I had patches to use it, i need to re-post them. I posted them long ago
and i droped the ball. I will re-spin after this.

Cheers,
Jérôme
diff mbox series

Patch

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 12bd603d318ce7..51b92ba013ddce 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -6,10 +6,12 @@ 
 #include <linux/spinlock.h>
 #include <linux/mm_types.h>
 #include <linux/srcu.h>
+#include <linux/interval_tree.h>
 
 struct mmu_notifier_mm;
 struct mmu_notifier;
 struct mmu_notifier_range;
+struct mmu_range_notifier;
 
 /**
  * enum mmu_notifier_event - reason for the mmu notifier callback
@@ -32,6 +34,9 @@  struct mmu_notifier_range;
  * access flags). User should soft dirty the page in the end callback to make
  * sure that anyone relying on soft dirtyness catch pages that might be written
  * through non CPU mappings.
+ *
+ * @MMU_NOTIFY_RELEASE: used during mmu_range_notifier invalidate to signal that
+ * the mm refcount is zero and the range is no longer accessible.
  */
 enum mmu_notifier_event {
 	MMU_NOTIFY_UNMAP = 0,
@@ -39,6 +44,7 @@  enum mmu_notifier_event {
 	MMU_NOTIFY_PROTECTION_VMA,
 	MMU_NOTIFY_PROTECTION_PAGE,
 	MMU_NOTIFY_SOFT_DIRTY,
+	MMU_NOTIFY_RELEASE,
 };
 
 #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
@@ -222,6 +228,26 @@  struct mmu_notifier {
 	unsigned int users;
 };
 
+/**
+ * struct mmu_range_notifier_ops
+ * @invalidate: Upon return the caller must stop using any SPTEs within this
+ *              range, this function can sleep. Return false if blocking was
+ *              required but range is non-blocking
+ */
+struct mmu_range_notifier_ops {
+	bool (*invalidate)(struct mmu_range_notifier *mrn,
+			   const struct mmu_notifier_range *range,
+			   unsigned long cur_seq);
+};
+
+struct mmu_range_notifier {
+	struct interval_tree_node interval_tree;
+	const struct mmu_range_notifier_ops *ops;
+	struct hlist_node deferred_item;
+	unsigned long invalidate_seq;
+	struct mm_struct *mm;
+};
+
 #ifdef CONFIG_MMU_NOTIFIER
 
 #ifdef CONFIG_LOCKDEP
@@ -263,6 +289,78 @@  extern int __mmu_notifier_register(struct mmu_notifier *mn,
 				   struct mm_struct *mm);
 extern void mmu_notifier_unregister(struct mmu_notifier *mn,
 				    struct mm_struct *mm);
+
+unsigned long mmu_range_read_begin(struct mmu_range_notifier *mrn);
+int mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
+			      unsigned long start, unsigned long length,
+			      struct mm_struct *mm);
+int mmu_range_notifier_insert_locked(struct mmu_range_notifier *mrn,
+				     unsigned long start, unsigned long length,
+				     struct mm_struct *mm);
+void mmu_range_notifier_remove(struct mmu_range_notifier *mrn);
+
+/**
+ * mmu_range_set_seq - Save the invalidation sequence
+ * @mrn - The mrn passed to invalidate
+ * @cur_seq - The cur_seq passed to invalidate
+ *
+ * This must be called unconditionally from the invalidate callback of a
+ * struct mmu_range_notifier_ops under the same lock that is used to call
+ * mmu_range_read_retry(). It updates the sequence number for later use by
+ * mmu_range_read_retry().
+ *
+ * If the user does not call mmu_range_read_begin() or mmu_range_read_retry()
+ * then this call is not required.
+ */
+static inline void mmu_range_set_seq(struct mmu_range_notifier *mrn,
+				     unsigned long cur_seq)
+{
+	WRITE_ONCE(mrn->invalidate_seq, cur_seq);
+}
+
+/**
+ * mmu_range_read_retry - End a read side critical section against a VA range
+ * mrn: The range under lock
+ * seq: The return of the paired mmu_range_read_begin()
+ *
+ * This MUST be called under a user provided lock that is also held
+ * unconditionally by op->invalidate() when it calls mmu_range_set_seq().
+ *
+ * Each call should be paired with a single mmu_range_read_begin() and
+ * should be used to conclude the read side.
+ *
+ * Returns true if an invalidation collided with this critical section, and
+ * the caller should retry.
+ */
+static inline bool mmu_range_read_retry(struct mmu_range_notifier *mrn,
+					unsigned long seq)
+{
+	return mrn->invalidate_seq != seq;
+}
+
+/**
+ * mmu_range_check_retry - Test if a collision has occurred
+ * mrn: The range under lock
+ * seq: The return of the matching mmu_range_read_begin()
+ *
+ * This can be used in the critical section between mmu_range_read_begin() and
+ * mmu_range_read_retry().  A return of true indicates an invalidation has
+ * collided with this lock and a future mmu_range_read_retry() will return
+ * true.
+ *
+ * False is not reliable and only suggests a collision has not happened. It
+ * can be called many times and does not have to hold the user provided lock.
+ *
+ * This call can be used as part of loops and other expensive operations to
+ * expedite a retry.
+ */
+static inline bool mmu_range_check_retry(struct mmu_range_notifier *mrn,
+					 unsigned long seq)
+{
+	/* Pairs with the WRITE_ONCE in mmu_range_set_seq() */
+	return READ_ONCE(mrn->invalidate_seq) != seq;
+}
+
 extern void __mmu_notifier_mm_destroy(struct mm_struct *mm);
 extern void __mmu_notifier_release(struct mm_struct *mm);
 extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
diff --git a/mm/Kconfig b/mm/Kconfig
index a5dae9a7eb510a..d0b5046d9aeffd 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -284,6 +284,7 @@  config VIRT_TO_BUS
 config MMU_NOTIFIER
 	bool
 	select SRCU
+	select INTERVAL_TREE
 
 config KSM
 	bool "Enable KSM for page merging"
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 367670cfd02b7b..d02d3c8c223eb7 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -12,6 +12,7 @@ 
 #include <linux/export.h>
 #include <linux/mm.h>
 #include <linux/err.h>
+#include <linux/interval_tree.h>
 #include <linux/srcu.h>
 #include <linux/rcupdate.h>
 #include <linux/sched.h>
@@ -36,10 +37,243 @@  struct lockdep_map __mmu_notifier_invalidate_range_start_map = {
 struct mmu_notifier_mm {
 	/* all mmu notifiers registered in this mm are queued in this list */
 	struct hlist_head list;
+	bool has_interval;
 	/* to serialize the list modifications and hlist_unhashed */
 	spinlock_t lock;
+	unsigned long invalidate_seq;
+	unsigned long active_invalidate_ranges;
+	struct rb_root_cached itree;
+	wait_queue_head_t wq;
+	struct hlist_head deferred_list;
 };
 
+/*
+ * This is a collision-retry read-side/write-side 'lock', a lot like a
+ * seqcount, however this allows multiple write-sides to hold it at
+ * once. Conceptually the write side is protecting the values of the PTEs in
+ * this mm, such that PTES cannot be read into SPTEs while any writer exists.
+ *
+ * Note that the core mm creates nested invalidate_range_start()/end() regions
+ * within the same thread, and runs invalidate_range_start()/end() in parallel
+ * on multiple CPUs. This is designed to not reduce concurrency or block
+ * progress on the mm side.
+ *
+ * As a secondary function, holding the full write side also serves to prevent
+ * writers for the itree, this is an optimization to avoid extra locking
+ * during invalidate_range_start/end notifiers.
+ *
+ * The write side has two states, fully excluded:
+ *  - mm->active_invalidate_ranges != 0
+ *  - mnn->invalidate_seq & 1 == True
+ *  - some range on the mm_struct is being invalidated
+ *  - the itree is not allowed to change
+ *
+ * And partially excluded:
+ *  - mm->active_invalidate_ranges != 0
+ *  - some range on the mm_struct is being invalidated
+ *  - the itree is allowed to change
+ *
+ * The later state avoids some expensive work on inv_end in the common case of
+ * no mrn monitoring the VA.
+ */
+static bool mn_itree_is_invalidating(struct mmu_notifier_mm *mmn_mm)
+{
+	lockdep_assert_held(&mmn_mm->lock);
+	return mmn_mm->invalidate_seq & 1;
+}
+
+static struct mmu_range_notifier *
+mn_itree_inv_start_range(struct mmu_notifier_mm *mmn_mm,
+			 const struct mmu_notifier_range *range,
+			 unsigned long *seq)
+{
+	struct interval_tree_node *node;
+	struct mmu_range_notifier *res = NULL;
+
+	spin_lock(&mmn_mm->lock);
+	mmn_mm->active_invalidate_ranges++;
+	node = interval_tree_iter_first(&mmn_mm->itree, range->start,
+					range->end - 1);
+	if (node) {
+		mmn_mm->invalidate_seq |= 1;
+		res = container_of(node, struct mmu_range_notifier,
+				   interval_tree);
+	}
+
+	*seq = mmn_mm->invalidate_seq;
+	spin_unlock(&mmn_mm->lock);
+	return res;
+}
+
+static struct mmu_range_notifier *
+mn_itree_inv_next(struct mmu_range_notifier *mrn,
+		  const struct mmu_notifier_range *range)
+{
+	struct interval_tree_node *node;
+
+	node = interval_tree_iter_next(&mrn->interval_tree, range->start,
+				       range->end - 1);
+	if (!node)
+		return NULL;
+	return container_of(node, struct mmu_range_notifier, interval_tree);
+}
+
+static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm)
+{
+	struct mmu_range_notifier *mrn;
+	struct hlist_node *next;
+	bool need_wake = false;
+
+	spin_lock(&mmn_mm->lock);
+	if (--mmn_mm->active_invalidate_ranges ||
+	    !mn_itree_is_invalidating(mmn_mm)) {
+		spin_unlock(&mmn_mm->lock);
+		return;
+	}
+
+	mmn_mm->invalidate_seq++;
+	need_wake = true;
+
+	/*
+	 * The inv_end incorporates a deferred mechanism like
+	 * rtnl_lock(). Adds and removes are queued until the final inv_end
+	 * happens then they are progressed. This arrangement for tree updates
+	 * is used to avoid using a blocking lock during
+	 * invalidate_range_start.
+	 */
+	hlist_for_each_entry_safe(mrn, next, &mmn_mm->deferred_list,
+				  deferred_item) {
+		if (RB_EMPTY_NODE(&mrn->interval_tree.rb))
+			interval_tree_insert(&mrn->interval_tree,
+					     &mmn_mm->itree);
+		else
+			interval_tree_remove(&mrn->interval_tree,
+					     &mmn_mm->itree);
+		hlist_del(&mrn->deferred_item);
+	}
+	spin_unlock(&mmn_mm->lock);
+
+	/*
+	 * TODO: Since we already have a spinlock above, this would be faster
+	 * as wake_up_q
+	 */
+	if (need_wake)
+		wake_up_all(&mmn_mm->wq);
+}
+
+/**
+ * mmu_range_read_begin - Begin a read side critical section against a VA range
+ * mrn: The range to lock
+ *
+ * mmu_range_read_begin()/mmu_range_read_retry() implement a collision-retry
+ * locking scheme similar to seqcount for the VA range under mrn. If the mm
+ * invokes invalidation during the critical section then
+ * mmu_range_read_retry() will return true.
+ *
+ * This is useful to obtain shadow PTEs where teardown or setup of the SPTEs
+ * require a blocking context.  The critical region formed by this lock can
+ * sleep, and the required 'user_lock' can also be a sleeping lock.
+ *
+ * The caller is required to provide a 'user_lock' to serialize both teardown
+ * and setup.
+ *
+ * The return value should be passed to mmu_range_read_retry().
+ */
+unsigned long mmu_range_read_begin(struct mmu_range_notifier *mrn)
+{
+	struct mmu_notifier_mm *mmn_mm = mrn->mm->mmu_notifier_mm;
+	unsigned long seq;
+	bool is_invalidating;
+
+	/*
+	 * If the mrn has a different seq value under the user_lock than we
+	 * started with then it has collided.
+	 *
+	 * If the mrn currently has the same seq value as the mmn_mm seq, then
+	 * it is currently between invalidate_start/end and is colliding.
+	 *
+	 * The locking looks broadly like this:
+	 *   mn_tree_invalidate_start():          mmu_range_read_begin():
+	 *                                         spin_lock
+	 *                                          seq = READ_ONCE(mrn->invalidate_seq);
+	 *                                          seq == mmn_mm->invalidate_seq
+	 *                                         spin_unlock
+	 *    spin_lock
+	 *     seq = ++mmn_mm->invalidate_seq
+	 *    spin_unlock
+	 *     op->invalidate_range():
+	 *       user_lock
+	 *        mmu_range_set_seq()
+	 *         mrn->invalidate_seq = seq
+	 *       user_unlock
+	 *
+	 *                          [Required: mmu_range_read_retry() == true]
+	 *
+	 *   mn_itree_inv_end():
+	 *    spin_lock
+	 *     seq = ++mmn_mm->invalidate_seq
+	 *    spin_unlock
+	 *
+	 *                                        user_lock
+	 *                                         mmu_range_read_retry():
+	 *                                          mrn->invalidate_seq != seq
+	 *                                        user_unlock
+	 *
+	 * Barriers are not needed here as any races here are closed by an
+	 * eventual mmu_range_read_retry(), which provides a barrier via the
+	 * user_lock.
+	 */
+	spin_lock(&mmn_mm->lock);
+	/* Pairs with the WRITE_ONCE in mmu_range_set_seq() */
+	seq = READ_ONCE(mrn->invalidate_seq);
+	is_invalidating = seq == mmn_mm->invalidate_seq;
+	spin_unlock(&mmn_mm->lock);
+
+	/*
+	 * mrn->invalidate_seq is always set to an odd value. This ensures
+	 * that if seq does wrap we will always clear the below sleep in some
+	 * reasonable time as mmn_mm->invalidate_seq is even in the idle
+	 * state.
+	 */
+	lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
+	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
+	if (is_invalidating)
+		wait_event(mmn_mm->wq,
+			   READ_ONCE(mmn_mm->invalidate_seq) != seq);
+
+	/*
+	 * Notice that mmu_range_read_retry() can already be true at this
+	 * point, avoiding loops here allows the user of this lock to provide
+	 * a global time bound.
+	 */
+
+	return seq;
+}
+EXPORT_SYMBOL_GPL(mmu_range_read_begin);
+
+static void mn_itree_release(struct mmu_notifier_mm *mmn_mm,
+			     struct mm_struct *mm)
+{
+	struct mmu_notifier_range range = {
+		.flags = MMU_NOTIFIER_RANGE_BLOCKABLE,
+		.event = MMU_NOTIFY_RELEASE,
+		.mm = mm,
+		.start = 0,
+		.end = ULONG_MAX,
+	};
+	struct mmu_range_notifier *mrn;
+	unsigned long cur_seq;
+	bool ret;
+
+	for (mrn = mn_itree_inv_start_range(mmn_mm, &range, &cur_seq); mrn;
+	     mrn = mn_itree_inv_next(mrn, &range)) {
+		ret = mrn->ops->invalidate(mrn, &range, cur_seq);
+		WARN_ON(!ret);
+	}
+
+	mn_itree_inv_end(mmn_mm);
+}
+
 /*
  * This function can't run concurrently against mmu_notifier_register
  * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
@@ -52,17 +286,24 @@  struct mmu_notifier_mm {
  * can't go away from under us as exit_mmap holds an mm_count pin
  * itself.
  */
-void __mmu_notifier_release(struct mm_struct *mm)
+static void mn_hlist_release(struct mmu_notifier_mm *mmn_mm,
+			     struct mm_struct *mm)
 {
 	struct mmu_notifier *mn;
 	int id;
 
+	if (mmn_mm->has_interval)
+		mn_itree_release(mmn_mm, mm);
+
+	if (hlist_empty(&mmn_mm->list))
+		return;
+
 	/*
 	 * SRCU here will block mmu_notifier_unregister until
 	 * ->release returns.
 	 */
 	id = srcu_read_lock(&srcu);
-	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist)
+	hlist_for_each_entry_rcu(mn, &mmn_mm->list, hlist)
 		/*
 		 * If ->release runs before mmu_notifier_unregister it must be
 		 * handled, as it's the only way for the driver to flush all
@@ -72,9 +313,9 @@  void __mmu_notifier_release(struct mm_struct *mm)
 		if (mn->ops->release)
 			mn->ops->release(mn, mm);
 
-	spin_lock(&mm->mmu_notifier_mm->lock);
-	while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) {
-		mn = hlist_entry(mm->mmu_notifier_mm->list.first,
+	spin_lock(&mmn_mm->lock);
+	while (unlikely(!hlist_empty(&mmn_mm->list))) {
+		mn = hlist_entry(mmn_mm->list.first,
 				 struct mmu_notifier,
 				 hlist);
 		/*
@@ -85,7 +326,7 @@  void __mmu_notifier_release(struct mm_struct *mm)
 		 */
 		hlist_del_init_rcu(&mn->hlist);
 	}
-	spin_unlock(&mm->mmu_notifier_mm->lock);
+	spin_unlock(&mmn_mm->lock);
 	srcu_read_unlock(&srcu, id);
 
 	/*
@@ -100,6 +341,17 @@  void __mmu_notifier_release(struct mm_struct *mm)
 	synchronize_srcu(&srcu);
 }
 
+void __mmu_notifier_release(struct mm_struct *mm)
+{
+	struct mmu_notifier_mm *mmn_mm = mm->mmu_notifier_mm;
+
+	if (mmn_mm->has_interval)
+		mn_itree_release(mmn_mm, mm);
+
+	if (!hlist_empty(&mmn_mm->list))
+		mn_hlist_release(mmn_mm, mm);
+}
+
 /*
  * If no young bitflag is supported by the hardware, ->clear_flush_young can
  * unmap the address and return 1 or 0 depending if the mapping previously
@@ -172,14 +424,43 @@  void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 	srcu_read_unlock(&srcu, id);
 }
 
-int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
+static int mn_itree_invalidate(struct mmu_notifier_mm *mmn_mm,
+				     const struct mmu_notifier_range *range)
+{
+	struct mmu_range_notifier *mrn;
+	unsigned long cur_seq;
+
+	for (mrn = mn_itree_inv_start_range(mmn_mm, range, &cur_seq); mrn;
+	     mrn = mn_itree_inv_next(mrn, range)) {
+		bool ret;
+
+		ret = mrn->ops->invalidate(mrn, range, cur_seq);
+		if (!ret) {
+			if (WARN_ON(mmu_notifier_range_blockable(range)))
+				continue;
+			goto out_would_block;
+		}
+	}
+	return 0;
+
+out_would_block:
+	/*
+	 * On -EAGAIN the non-blocking caller is not allowed to call
+	 * invalidate_range_end()
+	 */
+	mn_itree_inv_end(mmn_mm);
+	return -EAGAIN;
+}
+
+static int mn_hlist_invalidate_range_start(struct mmu_notifier_mm *mmn_mm,
+					   struct mmu_notifier_range *range)
 {
 	struct mmu_notifier *mn;
 	int ret = 0;
 	int id;
 
 	id = srcu_read_lock(&srcu);
-	hlist_for_each_entry_rcu(mn, &range->mm->mmu_notifier_mm->list, hlist) {
+	hlist_for_each_entry_rcu(mn, &mmn_mm->list, hlist) {
 		if (mn->ops->invalidate_range_start) {
 			int _ret;
 
@@ -203,15 +484,30 @@  int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
 	return ret;
 }
 
-void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range,
-					 bool only_end)
+int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
+{
+	struct mmu_notifier_mm *mmn_mm = range->mm->mmu_notifier_mm;
+	int ret = 0;
+
+	if (mmn_mm->has_interval) {
+		ret = mn_itree_invalidate(mmn_mm, range);
+		if (ret)
+			return ret;
+	}
+	if (!hlist_empty(&mmn_mm->list))
+		return mn_hlist_invalidate_range_start(mmn_mm, range);
+	return 0;
+}
+
+static void mn_hlist_invalidate_end(struct mmu_notifier_mm *mmn_mm,
+				    struct mmu_notifier_range *range,
+				    bool only_end)
 {
 	struct mmu_notifier *mn;
 	int id;
 
-	lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
 	id = srcu_read_lock(&srcu);
-	hlist_for_each_entry_rcu(mn, &range->mm->mmu_notifier_mm->list, hlist) {
+	hlist_for_each_entry_rcu(mn, &mmn_mm->list, hlist) {
 		/*
 		 * Call invalidate_range here too to avoid the need for the
 		 * subsystem of having to register an invalidate_range_end
@@ -238,6 +534,19 @@  void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range,
 		}
 	}
 	srcu_read_unlock(&srcu, id);
+}
+
+void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range,
+					 bool only_end)
+{
+	struct mmu_notifier_mm *mmn_mm = range->mm->mmu_notifier_mm;
+
+	lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
+	if (mmn_mm->has_interval)
+		mn_itree_inv_end(mmn_mm);
+
+	if (!hlist_empty(&mmn_mm->list))
+		mn_hlist_invalidate_end(mmn_mm, range, only_end);
 	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
 }
 
@@ -256,8 +565,9 @@  void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 }
 
 /*
- * Same as mmu_notifier_register but here the caller must hold the
- * mmap_sem in write mode.
+ * Same as mmu_notifier_register but here the caller must hold the mmap_sem in
+ * write mode. A NULL mn signals the notifier is being registered for itree
+ * mode.
  */
 int __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
 {
@@ -274,9 +584,6 @@  int __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
 		fs_reclaim_release(GFP_KERNEL);
 	}
 
-	mn->mm = mm;
-	mn->users = 1;
-
 	if (!mm->mmu_notifier_mm) {
 		/*
 		 * kmalloc cannot be called under mm_take_all_locks(), but we
@@ -284,21 +591,22 @@  int __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
 		 * the write side of the mmap_sem.
 		 */
 		mmu_notifier_mm =
-			kmalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL);
+			kzalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL);
 		if (!mmu_notifier_mm)
 			return -ENOMEM;
 
 		INIT_HLIST_HEAD(&mmu_notifier_mm->list);
 		spin_lock_init(&mmu_notifier_mm->lock);
+		mmu_notifier_mm->invalidate_seq = 2;
+		mmu_notifier_mm->itree = RB_ROOT_CACHED;
+		init_waitqueue_head(&mmu_notifier_mm->wq);
+		INIT_HLIST_HEAD(&mmu_notifier_mm->deferred_list);
 	}
 
 	ret = mm_take_all_locks(mm);
 	if (unlikely(ret))
 		goto out_clean;
 
-	/* Pairs with the mmdrop in mmu_notifier_unregister_* */
-	mmgrab(mm);
-
 	/*
 	 * Serialize the update against mmu_notifier_unregister. A
 	 * side note: mmu_notifier_release can't run concurrently with
@@ -306,13 +614,28 @@  int __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
 	 * current->mm or explicitly with get_task_mm() or similar).
 	 * We can't race against any other mmu notifier method either
 	 * thanks to mm_take_all_locks().
+	 *
+	 * release semantics on the initialization of the mmu_notifier_mm's
+         * contents are provided for unlocked readers.  acquire can only be
+         * used while holding the mmgrab or mmget, and is safe because once
+         * created the mmu_notififer_mm is not freed until the mm is
+         * destroyed.  As above, users holding the mmap_sem or one of the
+         * mm_take_all_locks() do not need to use acquire semantics.
 	 */
 	if (mmu_notifier_mm)
-		mm->mmu_notifier_mm = mmu_notifier_mm;
+		smp_store_release(&mm->mmu_notifier_mm, mmu_notifier_mm);
 
-	spin_lock(&mm->mmu_notifier_mm->lock);
-	hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier_mm->list);
-	spin_unlock(&mm->mmu_notifier_mm->lock);
+	if (mn) {
+		/* Pairs with the mmdrop in mmu_notifier_unregister_* */
+		mmgrab(mm);
+		mn->mm = mm;
+		mn->users = 1;
+
+		spin_lock(&mm->mmu_notifier_mm->lock);
+		hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier_mm->list);
+		spin_unlock(&mm->mmu_notifier_mm->lock);
+	} else
+		mm->mmu_notifier_mm->has_interval = true;
 
 	mm_drop_all_locks(mm);
 	BUG_ON(atomic_read(&mm->mm_users) <= 0);
@@ -529,6 +852,166 @@  void mmu_notifier_put(struct mmu_notifier *mn)
 }
 EXPORT_SYMBOL_GPL(mmu_notifier_put);
 
+static int __mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
+				       unsigned long start,
+				       unsigned long length,
+				       struct mmu_notifier_mm *mmn_mm,
+				       struct mm_struct *mm)
+{
+	mrn->mm = mm;
+	RB_CLEAR_NODE(&mrn->interval_tree.rb);
+	mrn->interval_tree.start = start;
+	/*
+	 * Note that the representation of the intervals in the interval tree
+	 * considers the ending point as contained in the interval.
+	 */
+	if (length == 0 ||
+	    check_add_overflow(start, length - 1, &mrn->interval_tree.last))
+		return -EOVERFLOW;
+
+	/* pairs with mmdrop in mmu_range_notifier_remove() */
+	mmgrab(mm);
+
+	/*
+	 * If some invalidate_range_start/end region is going on in parallel
+	 * we don't know what VA ranges are affected, so we must assume this
+	 * new range is included.
+	 *
+	 * If the itree is invalidating then we are not allowed to change
+	 * it. Retrying until invalidation is done is tricky due to the
+	 * possibility for live lock, instead defer the add to the unlock so
+	 * this algorithm is deterministic.
+	 *
+	 * In all cases the value for the mrn->mr_invalidate_seq should be
+	 * odd, see mmu_range_read_begin()
+	 */
+	spin_lock(&mmn_mm->lock);
+	if (mmn_mm->active_invalidate_ranges) {
+		if (mn_itree_is_invalidating(mmn_mm))
+			hlist_add_head(&mrn->deferred_item,
+				       &mmn_mm->deferred_list);
+		else {
+			mmn_mm->invalidate_seq |= 1;
+			interval_tree_insert(&mrn->interval_tree,
+					     &mmn_mm->itree);
+		}
+		mrn->invalidate_seq = mmn_mm->invalidate_seq;
+	} else {
+		WARN_ON(mn_itree_is_invalidating(mmn_mm));
+		mrn->invalidate_seq = mmn_mm->invalidate_seq - 1;
+		interval_tree_insert(&mrn->interval_tree, &mmn_mm->itree);
+	}
+	spin_unlock(&mmn_mm->lock);
+	return 0;
+}
+
+/**
+ * mmu_range_notifier_insert - Insert a range notifier
+ * @mrn: Range notifier to register
+ * @start: Starting virtual address to monitor
+ * @length: Length of the range to monitor
+ * @mm : mm_struct to attach to
+ *
+ * This function subscribes the range notifier for notifications from the mm.
+ * Upon return the ops related to mmu_range_notifier will be called whenever
+ * an event that intersects with the given range occurs.
+ *
+ * Upon return the range_notifier may not be present in the interval tree yet.
+ * The caller must use the normal range notifier locking flow via
+ * mmu_range_read_begin() to establish SPTEs for this range.
+ */
+int mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
+			      unsigned long start, unsigned long length,
+			      struct mm_struct *mm)
+{
+	struct mmu_notifier_mm *mmn_mm;
+	int ret;
+
+	might_lock(&mm->mmap_sem);
+
+	mmn_mm = smp_load_acquire(&mm->mmu_notifier_mm);
+	if (!mmn_mm || !mmn_mm->has_interval) {
+		ret = mmu_notifier_register(NULL, mm);
+		if (ret)
+			return ret;
+		mmn_mm = mm->mmu_notifier_mm;
+	}
+	return __mmu_range_notifier_insert(mrn, start, length, mmn_mm, mm);
+}
+EXPORT_SYMBOL_GPL(mmu_range_notifier_insert);
+
+int mmu_range_notifier_insert_locked(struct mmu_range_notifier *mrn,
+				     unsigned long start, unsigned long length,
+				     struct mm_struct *mm)
+{
+	struct mmu_notifier_mm *mmn_mm;
+	int ret;
+
+	lockdep_assert_held_write(&mm->mmap_sem);
+
+	mmn_mm = mm->mmu_notifier_mm;
+	if (!mmn_mm || !mmn_mm->has_interval) {
+		ret = __mmu_notifier_register(NULL, mm);
+		if (ret)
+			return ret;
+		mmn_mm = mm->mmu_notifier_mm;
+	}
+	return __mmu_range_notifier_insert(mrn, start, length, mmn_mm, mm);
+}
+EXPORT_SYMBOL_GPL(mmu_range_notifier_insert_locked);
+
+/**
+ * mmu_range_notifier_remove - Remove a range notifier
+ * @mrn: Range notifier to unregister
+ *
+ * This function must be paired with mmu_range_notifier_insert(). It cannot be
+ * called from any ops callback.
+ *
+ * Once this returns ops callbacks are no longer running on other CPUs and
+ * will not be called in future.
+ */
+void mmu_range_notifier_remove(struct mmu_range_notifier *mrn)
+{
+	struct mm_struct *mm = mrn->mm;
+	struct mmu_notifier_mm *mmn_mm = mm->mmu_notifier_mm;
+	unsigned long seq = 0;
+
+	might_sleep();
+
+	spin_lock(&mmn_mm->lock);
+	if (mn_itree_is_invalidating(mmn_mm)) {
+		/*
+		 * remove is being called after insert put this on the
+		 * deferred list, but before the deferred list was processed.
+		 */
+		if (RB_EMPTY_NODE(&mrn->interval_tree.rb)) {
+			hlist_del(&mrn->deferred_item);
+		} else {
+			hlist_add_head(&mrn->deferred_item,
+				       &mmn_mm->deferred_list);
+			seq = mmn_mm->invalidate_seq;
+		}
+	} else {
+		WARN_ON(RB_EMPTY_NODE(&mrn->interval_tree.rb));
+		interval_tree_remove(&mrn->interval_tree, &mmn_mm->itree);
+	}
+	spin_unlock(&mmn_mm->lock);
+
+	/*
+	 * The possible sleep on progress in the invalidation requires the
+	 * caller not hold any locks held by invalidation callbacks.
+	 */
+	lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
+	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
+	if (seq)
+		wait_event(mmn_mm->wq,
+			   READ_ONCE(mmn_mm->invalidate_seq) != seq);
+
+	/* pairs with mmgrab in mmu_range_notifier_insert() */
+	mmdrop(mm);
+}
+EXPORT_SYMBOL_GPL(mmu_range_notifier_remove);
+
 /**
  * mmu_notifier_synchronize - Ensure all mmu_notifiers are freed
  *