diff mbox series

[v4,01/26] mm/mmu_notifiers: pass private data down to alloc_notifier()

Message ID 20200224182401.353359-2-jean-philippe@linaro.org (mailing list archive)
State New, archived
Headers show
Series iommu: Shared Virtual Addressing and SMMUv3 support | expand

Commit Message

Jean-Philippe Brucker Feb. 24, 2020, 6:23 p.m. UTC
The new allocation scheme introduced by 2c7933f53f6b ("mm/mmu_notifiers:
add a get/put scheme for the registration") provides a convenient way
for users to attach notifier data to an mm. However, it would be even
better to create this notifier data atomically.

Since the alloc_notifier() callback only takes an mm argument at the
moment, some users have to perform the allocation in two times.
alloc_notifier() initially creates an incomplete structure, which is
then finalized using more context once mmu_notifier_get() returns. This
second step requires carrying an initialization lock in the notifier
data and playing dirty tricks to order memory accesses against live
invalidation.

To simplify MMU notifier allocation, pass an allocation context to
mmu_notifier_get().

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 drivers/misc/sgi-gru/grutlbpurge.c |  4 ++--
 include/linux/mmu_notifier.h       | 10 ++++++----
 mm/mmu_notifier.c                  |  6 ++++--
 3 files changed, 12 insertions(+), 8 deletions(-)

Comments

Jason Gunthorpe Feb. 24, 2020, 7 p.m. UTC | #1
On Mon, Feb 24, 2020 at 07:23:36PM +0100, Jean-Philippe Brucker wrote:
> The new allocation scheme introduced by 2c7933f53f6b ("mm/mmu_notifiers:
> add a get/put scheme for the registration") provides a convenient way
> for users to attach notifier data to an mm. However, it would be even
> better to create this notifier data atomically.
> 
> Since the alloc_notifier() callback only takes an mm argument at the
> moment, some users have to perform the allocation in two times.
> alloc_notifier() initially creates an incomplete structure, which is
> then finalized using more context once mmu_notifier_get() returns. This
> second step requires carrying an initialization lock in the notifier
> data and playing dirty tricks to order memory accesses against live
> invalidation.

This was the intended pattern. Tthere shouldn't be an real issue as
there shouldn't be any data on which to invalidate, ie the later patch
does:

+       list_for_each_entry_rcu(bond, &io_mm->devices, mm_head)

And that list is empty post-allocation, so no 'dirty tricks' required.

The other op callback is release, which also cannot be called as the
caller must hold a mmget to establish the notifier.

So just use the locking that already exists. There is one function
that calls io_mm_get() which immediately calls io_mm_attach, which
immediately grabs the global iommu_sva_lock.

Thus init the pasid for the first time under that lock and everything
is fine.

There is nothing inherently wrong with the approach in this patch, but
it seems unneeded in this case..

Jason
Jean-Philippe Brucker Feb. 25, 2020, 9:24 a.m. UTC | #2
On Mon, Feb 24, 2020 at 03:00:56PM -0400, Jason Gunthorpe wrote:
> On Mon, Feb 24, 2020 at 07:23:36PM +0100, Jean-Philippe Brucker wrote:
> > The new allocation scheme introduced by 2c7933f53f6b ("mm/mmu_notifiers:
> > add a get/put scheme for the registration") provides a convenient way
> > for users to attach notifier data to an mm. However, it would be even
> > better to create this notifier data atomically.
> > 
> > Since the alloc_notifier() callback only takes an mm argument at the
> > moment, some users have to perform the allocation in two times.
> > alloc_notifier() initially creates an incomplete structure, which is
> > then finalized using more context once mmu_notifier_get() returns. This
> > second step requires carrying an initialization lock in the notifier
> > data and playing dirty tricks to order memory accesses against live
> > invalidation.
> 
> This was the intended pattern. Tthere shouldn't be an real issue as
> there shouldn't be any data on which to invalidate, ie the later patch
> does:
> 
> +       list_for_each_entry_rcu(bond, &io_mm->devices, mm_head)
> 
> And that list is empty post-allocation, so no 'dirty tricks' required.

Before introducing this patch I had the following code:

+	list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) {
+		/*
+		 * To ensure that we observe the initialization of io_mm fields
+		 * by io_mm_finalize() before the registration of this bond to
+		 * the list by io_mm_attach(), introduce an address dependency
+		 * between bond and io_mm. It pairs with the smp_store_release()
+		 * from list_add_rcu().
+		 */
+		io_mm = rcu_dereference(bond->io_mm);
+		io_mm->ops->invalidate(bond->sva.dev, io_mm->pasid, io_mm->ctx,
+				       start, end - start);
+	}

(1) io_mm_get() would obtain an empty io_mm from iommu_notifier_get().
(2) then io_mm_finalize() would initialize io_mm->ops, io_mm->ctx, etc.
(3) finally io_mm_attach() would add the bond to io_mm->devices.

Since the above code can run before (2) it needs to observe valid
io_mm->ctx, io_mm->ops initialized by (2) after obtaining the bond
initialized by (3). Which I believe requires the address dependency from
the rcu_dereference() above or some stronger barrier to pair with the
list_add_rcu(). If io_mm->ctx and io_mm->ops are already valid before the
mmu notifier is published, then we don't need that stuff.

That's the main reason I would have liked moving everything to
alloc_notifier(), the locking below isn't a big deal.

> The other op callback is release, which also cannot be called as the
> caller must hold a mmget to establish the notifier.
> 
> So just use the locking that already exists. There is one function
> that calls io_mm_get() which immediately calls io_mm_attach, which
> immediately grabs the global iommu_sva_lock.
> 
> Thus init the pasid for the first time under that lock and everything
> is fine.

I agree with this, can't remember why I used a separate lock for
initialization rather than reusing iommu_sva_lock.

Thanks,
Jean

> 
> There is nothing inherently wrong with the approach in this patch, but
> it seems unneeded in this case..
> 
> Jason
Jason Gunthorpe Feb. 25, 2020, 2:08 p.m. UTC | #3
On Tue, Feb 25, 2020 at 10:24:39AM +0100, Jean-Philippe Brucker wrote:
> On Mon, Feb 24, 2020 at 03:00:56PM -0400, Jason Gunthorpe wrote:
> > On Mon, Feb 24, 2020 at 07:23:36PM +0100, Jean-Philippe Brucker wrote:
> > > The new allocation scheme introduced by 2c7933f53f6b ("mm/mmu_notifiers:
> > > add a get/put scheme for the registration") provides a convenient way
> > > for users to attach notifier data to an mm. However, it would be even
> > > better to create this notifier data atomically.
> > > 
> > > Since the alloc_notifier() callback only takes an mm argument at the
> > > moment, some users have to perform the allocation in two times.
> > > alloc_notifier() initially creates an incomplete structure, which is
> > > then finalized using more context once mmu_notifier_get() returns. This
> > > second step requires carrying an initialization lock in the notifier
> > > data and playing dirty tricks to order memory accesses against live
> > > invalidation.
> > 
> > This was the intended pattern. Tthere shouldn't be an real issue as
> > there shouldn't be any data on which to invalidate, ie the later patch
> > does:
> > 
> > +       list_for_each_entry_rcu(bond, &io_mm->devices, mm_head)
> > 
> > And that list is empty post-allocation, so no 'dirty tricks' required.
> 
> Before introducing this patch I had the following code:
> 
> +	list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) {
> +		/*
> +		 * To ensure that we observe the initialization of io_mm fields
> +		 * by io_mm_finalize() before the registration of this bond to
> +		 * the list by io_mm_attach(), introduce an address dependency
> +		 * between bond and io_mm. It pairs with the smp_store_release()
> +		 * from list_add_rcu().
> +		 */
> +		io_mm = rcu_dereference(bond->io_mm);

A rcu_dereference isn't need here, just a normal derference is fine.

> +		io_mm->ops->invalidate(bond->sva.dev, io_mm->pasid, io_mm->ctx,
> +				       start, end - start);
> +	}
> 
> (1) io_mm_get() would obtain an empty io_mm from iommu_notifier_get().
> (2) then io_mm_finalize() would initialize io_mm->ops, io_mm->ctx, etc.
> (3) finally io_mm_attach() would add the bond to io_mm->devices.
> 
> Since the above code can run before (2) it needs to observe valid
> io_mm->ctx, io_mm->ops initialized by (2) after obtaining the bond
> initialized by (3). Which I believe requires the address dependency from
> the rcu_dereference() above or some stronger barrier to pair with the
> list_add_rcu().

The list_for_each_entry_rcu() is an acquire that already pairs with
the release in list_add_rcu(), all you need is a data dependency chain
starting on bond to be correct on ordering.

But this is super tricky :\

> If io_mm->ctx and io_mm->ops are already valid before the
> mmu notifier is published, then we don't need that stuff.

So, this trickyness with RCU is not a bad reason to introduce the priv
scheme, maybe explain it in the commit message?

Jason
Jean-Philippe Brucker Feb. 28, 2020, 2:39 p.m. UTC | #4
On Tue, Feb 25, 2020 at 10:08:14AM -0400, Jason Gunthorpe wrote:
> On Tue, Feb 25, 2020 at 10:24:39AM +0100, Jean-Philippe Brucker wrote:
> > On Mon, Feb 24, 2020 at 03:00:56PM -0400, Jason Gunthorpe wrote:
> > > On Mon, Feb 24, 2020 at 07:23:36PM +0100, Jean-Philippe Brucker wrote:
> > > > The new allocation scheme introduced by 2c7933f53f6b ("mm/mmu_notifiers:
> > > > add a get/put scheme for the registration") provides a convenient way
> > > > for users to attach notifier data to an mm. However, it would be even
> > > > better to create this notifier data atomically.
> > > > 
> > > > Since the alloc_notifier() callback only takes an mm argument at the
> > > > moment, some users have to perform the allocation in two times.
> > > > alloc_notifier() initially creates an incomplete structure, which is
> > > > then finalized using more context once mmu_notifier_get() returns. This
> > > > second step requires carrying an initialization lock in the notifier
> > > > data and playing dirty tricks to order memory accesses against live
> > > > invalidation.
> > > 
> > > This was the intended pattern. Tthere shouldn't be an real issue as
> > > there shouldn't be any data on which to invalidate, ie the later patch
> > > does:
> > > 
> > > +       list_for_each_entry_rcu(bond, &io_mm->devices, mm_head)
> > > 
> > > And that list is empty post-allocation, so no 'dirty tricks' required.
> > 
> > Before introducing this patch I had the following code:
> > 
> > +	list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) {
> > +		/*
> > +		 * To ensure that we observe the initialization of io_mm fields
> > +		 * by io_mm_finalize() before the registration of this bond to
> > +		 * the list by io_mm_attach(), introduce an address dependency
> > +		 * between bond and io_mm. It pairs with the smp_store_release()
> > +		 * from list_add_rcu().
> > +		 */
> > +		io_mm = rcu_dereference(bond->io_mm);
> 
> A rcu_dereference isn't need here, just a normal derference is fine.

bond->io_mm is annotated with __rcu (for iommu_sva_get_pasid_generic(),
which does bond->io_mm under rcu_read_lock())

> 
> > +		io_mm->ops->invalidate(bond->sva.dev, io_mm->pasid, io_mm->ctx,
> > +				       start, end - start);
> > +	}
> > 
> > (1) io_mm_get() would obtain an empty io_mm from iommu_notifier_get().
> > (2) then io_mm_finalize() would initialize io_mm->ops, io_mm->ctx, etc.
> > (3) finally io_mm_attach() would add the bond to io_mm->devices.
> > 
> > Since the above code can run before (2) it needs to observe valid
> > io_mm->ctx, io_mm->ops initialized by (2) after obtaining the bond
> > initialized by (3). Which I believe requires the address dependency from
> > the rcu_dereference() above or some stronger barrier to pair with the
> > list_add_rcu().
> 
> The list_for_each_entry_rcu() is an acquire that already pairs with
> the release in list_add_rcu(), all you need is a data dependency chain
> starting on bond to be correct on ordering.
> 
> But this is super tricky :\
> 
> > If io_mm->ctx and io_mm->ops are already valid before the
> > mmu notifier is published, then we don't need that stuff.
> 
> So, this trickyness with RCU is not a bad reason to introduce the priv
> scheme, maybe explain it in the commit message?

Ok, I've added this to the commit message:

    The IOMMU SVA module, which attaches an mm to multiple devices,
    exemplifies this situation. In essence it does:

            mmu_notifier_get()
              alloc_notifier()
                 A = kzalloc()
              /* MMU notifier is published */
            A->ctx = ctx;                           // (1)
            device->A = A;
            list_add_rcu(device, A->devices);       // (2)

    The invalidate notifier, which may start running before A is fully
    initialized at (1), does the following:

            io_mm_invalidate(A)
              list_for_each_entry_rcu(device, A->devices)
                A = device->A;                      // (3)
                device->invalidate(A->ctx)

    To ensure that an invalidate() thread observes write (1) before (2), it
    needs the address dependency (3). The resulting code is subtle and
    difficult to understand. If instead we fully initialize object A before
    publishing the MMU notifier, we don't need the complexity added by (3).


I'll try to improve the wording before sending next version.

Thanks,
Jean
Jason Gunthorpe Feb. 28, 2020, 2:48 p.m. UTC | #5
On Fri, Feb 28, 2020 at 03:39:35PM +0100, Jean-Philippe Brucker wrote:
> > > +	list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) {
> > > +		/*
> > > +		 * To ensure that we observe the initialization of io_mm fields
> > > +		 * by io_mm_finalize() before the registration of this bond to
> > > +		 * the list by io_mm_attach(), introduce an address dependency
> > > +		 * between bond and io_mm. It pairs with the smp_store_release()
> > > +		 * from list_add_rcu().
> > > +		 */
> > > +		io_mm = rcu_dereference(bond->io_mm);
> > 
> > A rcu_dereference isn't need here, just a normal derference is fine.
> 
> bond->io_mm is annotated with __rcu (for iommu_sva_get_pasid_generic(),
> which does bond->io_mm under rcu_read_lock())

I'm surprised the bond->io_mm can change over the lifetime of the
bond memory..

> > > If io_mm->ctx and io_mm->ops are already valid before the
> > > mmu notifier is published, then we don't need that stuff.
> > 
> > So, this trickyness with RCU is not a bad reason to introduce the priv
> > scheme, maybe explain it in the commit message?
> 
> Ok, I've added this to the commit message:
> 
>     The IOMMU SVA module, which attaches an mm to multiple devices,
>     exemplifies this situation. In essence it does:
> 
>             mmu_notifier_get()
>               alloc_notifier()
>                  A = kzalloc()
>               /* MMU notifier is published */
>             A->ctx = ctx;                           // (1)
>             device->A = A;
>             list_add_rcu(device, A->devices);       // (2)
> 
>     The invalidate notifier, which may start running before A is fully
>     initialized at (1), does the following:
> 
>             io_mm_invalidate(A)
>               list_for_each_entry_rcu(device, A->devices)
>                 A = device->A;                      // (3)

I would drop the work around from the decription, it is enough to say
that the line below needs to observe (1) after (2) and this is
trivially achieved by moving (1) to before publishing the notifier so
the core MM locking can be used.

Regards,
Jason
Jean-Philippe Brucker Feb. 28, 2020, 3:04 p.m. UTC | #6
On Fri, Feb 28, 2020 at 10:48:44AM -0400, Jason Gunthorpe wrote:
> On Fri, Feb 28, 2020 at 03:39:35PM +0100, Jean-Philippe Brucker wrote:
> > > > +	list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) {
> > > > +		/*
> > > > +		 * To ensure that we observe the initialization of io_mm fields
> > > > +		 * by io_mm_finalize() before the registration of this bond to
> > > > +		 * the list by io_mm_attach(), introduce an address dependency
> > > > +		 * between bond and io_mm. It pairs with the smp_store_release()
> > > > +		 * from list_add_rcu().
> > > > +		 */
> > > > +		io_mm = rcu_dereference(bond->io_mm);
> > > 
> > > A rcu_dereference isn't need here, just a normal derference is fine.
> > 
> > bond->io_mm is annotated with __rcu (for iommu_sva_get_pasid_generic(),
> > which does bond->io_mm under rcu_read_lock())
> 
> I'm surprised the bond->io_mm can change over the lifetime of the
> bond memory..

The normal lifetime of the bond is between device driver calls to bind()
and unbind(). If the mm exits early, though, we clear bond->io_mm. The
bond is then stale but can only be freed when the device driver releases
it with unbind().

> 
> > > > If io_mm->ctx and io_mm->ops are already valid before the
> > > > mmu notifier is published, then we don't need that stuff.
> > > 
> > > So, this trickyness with RCU is not a bad reason to introduce the priv
> > > scheme, maybe explain it in the commit message?
> > 
> > Ok, I've added this to the commit message:
> > 
> >     The IOMMU SVA module, which attaches an mm to multiple devices,
> >     exemplifies this situation. In essence it does:
> > 
> >             mmu_notifier_get()
> >               alloc_notifier()
> >                  A = kzalloc()
> >               /* MMU notifier is published */
> >             A->ctx = ctx;                           // (1)
> >             device->A = A;
> >             list_add_rcu(device, A->devices);       // (2)
> > 
> >     The invalidate notifier, which may start running before A is fully
> >     initialized at (1), does the following:
> > 
> >             io_mm_invalidate(A)
> >               list_for_each_entry_rcu(device, A->devices)
> >                 A = device->A;                      // (3)
> 
> I would drop the work around from the decription, it is enough to say
> that the line below needs to observe (1) after (2) and this is
> trivially achieved by moving (1) to before publishing the notifier so
> the core MM locking can be used.

Ok, will do

Thanks,
Jean
Jason Gunthorpe Feb. 28, 2020, 3:13 p.m. UTC | #7
On Fri, Feb 28, 2020 at 04:04:27PM +0100, Jean-Philippe Brucker wrote:
> On Fri, Feb 28, 2020 at 10:48:44AM -0400, Jason Gunthorpe wrote:
> > On Fri, Feb 28, 2020 at 03:39:35PM +0100, Jean-Philippe Brucker wrote:
> > > > > +	list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) {
> > > > > +		/*
> > > > > +		 * To ensure that we observe the initialization of io_mm fields
> > > > > +		 * by io_mm_finalize() before the registration of this bond to
> > > > > +		 * the list by io_mm_attach(), introduce an address dependency
> > > > > +		 * between bond and io_mm. It pairs with the smp_store_release()
> > > > > +		 * from list_add_rcu().
> > > > > +		 */
> > > > > +		io_mm = rcu_dereference(bond->io_mm);
> > > > 
> > > > A rcu_dereference isn't need here, just a normal derference is fine.
> > > 
> > > bond->io_mm is annotated with __rcu (for iommu_sva_get_pasid_generic(),
> > > which does bond->io_mm under rcu_read_lock())
> > 
> > I'm surprised the bond->io_mm can change over the lifetime of the
> > bond memory..
> 
> The normal lifetime of the bond is between device driver calls to bind()
> and unbind(). If the mm exits early, though, we clear bond->io_mm. The
> bond is then stale but can only be freed when the device driver releases
> it with unbind().

I usually advocate for simple use of these APIs. The mm_notifier_get()
should happen in bind() and the matching put should happen in the
call_rcu callbcak that does the kfree. Then you can never get a stale
pointer. Don't worry about exit_mmap().

release() is an unusual callback and I see alot of places using it
wrong. The purpose of release is to invalidate_all, that is it.

Also, confusingly release may be called multiple times in some
situations, so it shouldn't disturb anything that might impact a 2nd
call.

Jason
Christoph Hellwig March 5, 2020, 4:36 p.m. UTC | #8
On Mon, Feb 24, 2020 at 07:23:36PM +0100, Jean-Philippe Brucker wrote:
> -static struct mmu_notifier *gru_alloc_notifier(struct mm_struct *mm)
> +static struct mmu_notifier *gru_alloc_notifier(struct mm_struct *mm, void *privdata)

Pleae don't introduce any > 80 char lines.  Not here, and not anywhere
else in this patch or the series.
Jean-Philippe Brucker March 6, 2020, 9:56 a.m. UTC | #9
On Fri, Feb 28, 2020 at 11:13:40AM -0400, Jason Gunthorpe wrote:
> On Fri, Feb 28, 2020 at 04:04:27PM +0100, Jean-Philippe Brucker wrote:
> > On Fri, Feb 28, 2020 at 10:48:44AM -0400, Jason Gunthorpe wrote:
> > > On Fri, Feb 28, 2020 at 03:39:35PM +0100, Jean-Philippe Brucker wrote:
> > > > > > +	list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) {
> > > > > > +		/*
> > > > > > +		 * To ensure that we observe the initialization of io_mm fields
> > > > > > +		 * by io_mm_finalize() before the registration of this bond to
> > > > > > +		 * the list by io_mm_attach(), introduce an address dependency
> > > > > > +		 * between bond and io_mm. It pairs with the smp_store_release()
> > > > > > +		 * from list_add_rcu().
> > > > > > +		 */
> > > > > > +		io_mm = rcu_dereference(bond->io_mm);
> > > > > 
> > > > > A rcu_dereference isn't need here, just a normal derference is fine.
> > > > 
> > > > bond->io_mm is annotated with __rcu (for iommu_sva_get_pasid_generic(),
> > > > which does bond->io_mm under rcu_read_lock())
> > > 
> > > I'm surprised the bond->io_mm can change over the lifetime of the
> > > bond memory..
> > 
> > The normal lifetime of the bond is between device driver calls to bind()
> > and unbind(). If the mm exits early, though, we clear bond->io_mm. The
> > bond is then stale but can only be freed when the device driver releases
> > it with unbind().
> 
> I usually advocate for simple use of these APIs. The mm_notifier_get()
> should happen in bind() and the matching put should happen in the
> call_rcu callbcak that does the kfree.

I tried to keep it simple like that: normally mmu_notifier_get() is called
in bind(), and mmu_notifier_put() is called in unbind(). 

Multiple device drivers may call bind() with the same mm. Each bind()
calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond
(a device<->mm link). Each bond is freed by calling unbind(), which calls
mmu_notifier_put().

That's the most common case. Now if the process is killed and the mm
disappears, we do need to avoid use-after-free caused by DMA of the
mappings and the page tables. So the release() callback, before doing
invalidate_all, stops DMA and clears the page table pointer on the IOMMU
side. It detaches all bonds from the io_mm, calling mmu_notifier_put() for
each of them. After release(), bond objects still exists and device
drivers still need to free them with unbind(), but they don't point to an
io_mm anymore.

> Then you can never get a stale
> pointer. Don't worry about exit_mmap().
> 
> release() is an unusual callback and I see alot of places using it
> wrong. The purpose of release is to invalidate_all, that is it.
> 
> Also, confusingly release may be called multiple times in some
> situations, so it shouldn't disturb anything that might impact a 2nd
> call.

I hadn't realized that. The current implementation should be safe against
it, as release() is a nop if the io_mm doesn't have bonds anymore. Do you
have an example of such a situation?  I'm trying to write tests for this
kind of corner cases.

Thanks,
Jean
Jason Gunthorpe March 6, 2020, 1:09 p.m. UTC | #10
On Fri, Mar 06, 2020 at 10:56:14AM +0100, Jean-Philippe Brucker wrote:
> I tried to keep it simple like that: normally mmu_notifier_get() is called
> in bind(), and mmu_notifier_put() is called in unbind(). 
> 
> Multiple device drivers may call bind() with the same mm. Each bind()
> calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond
> (a device<->mm link). Each bond is freed by calling unbind(), which calls
> mmu_notifier_put().
> 
> That's the most common case. Now if the process is killed and the mm
> disappears, we do need to avoid use-after-free caused by DMA of the
> mappings and the page tables. 

This is why release must do invalidate all - but it doesn't need to do
any more - as no SPTE can be established without a mmget() - and
mmget() is no longer possible past release.

> So the release() callback, before doing invalidate_all, stops DMA
> and clears the page table pointer on the IOMMU side. It detaches all
> bonds from the io_mm, calling mmu_notifier_put() for each of
> them. After release(), bond objects still exists and device drivers
> still need to free them with unbind(), but they don't point to an
> io_mm anymore.

Why is so much work needed in release? It really should just be
invalidate all, usually trying to sort out all the locking for the
more complicated stuff is not worthwhile.

If other stuff is implicitly relying on the mm being alive and release
to fence against that then it is already racy. If it doesn't, then why
bother doing complicated work in release?

> > Then you can never get a stale
> > pointer. Don't worry about exit_mmap().
> > 
> > release() is an unusual callback and I see alot of places using it
> > wrong. The purpose of release is to invalidate_all, that is it.
> > 
> > Also, confusingly release may be called multiple times in some
> > situations, so it shouldn't disturb anything that might impact a 2nd
> > call.
> 
> I hadn't realized that. The current implementation should be safe against
> it, as release() is a nop if the io_mm doesn't have bonds anymore. Do you
> have an example of such a situation?  I'm trying to write tests for this
> kind of corner cases.

Hmm, let me think. Ah, you have to be using mmu_notifier_unregister()
to get that race. This is one of the things that get/put don't suffer
from - but they conversely don't guarantee that release() will be
called, so it is up to the caller to ensure everything is fenced
before calling put.

Jason
Jean-Philippe Brucker March 6, 2020, 2:35 p.m. UTC | #11
On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote:
> On Fri, Mar 06, 2020 at 10:56:14AM +0100, Jean-Philippe Brucker wrote:
> > I tried to keep it simple like that: normally mmu_notifier_get() is called
> > in bind(), and mmu_notifier_put() is called in unbind(). 
> > 
> > Multiple device drivers may call bind() with the same mm. Each bind()
> > calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond
> > (a device<->mm link). Each bond is freed by calling unbind(), which calls
> > mmu_notifier_put().
> > 
> > That's the most common case. Now if the process is killed and the mm
> > disappears, we do need to avoid use-after-free caused by DMA of the
> > mappings and the page tables. 
> 
> This is why release must do invalidate all - but it doesn't need to do
> any more - as no SPTE can be established without a mmget() - and
> mmget() is no longer possible past release.

In our case we don't have SPTEs, the whole pgd is shared between MMU and
IOMMU (isolated using PASID tables).

Taking the concrete example of the crypto accelerator:

1. A process opens a queue in the accelerator. That queue is bound to the
   address space: a PASID is allocated for the mm, and mm->pgd is written
   into the IOMMU PASID table.
2. The process queues some work and waits. In the background, the
   accelerators performs DMA on the process address space, by using the
   mm's PASID.
3. Now the process gets killed, and release() is called.

At this point no one told the device to stop working on this queue, it may
still be doing DMA on this address space. So the first thing we do is
notify the device driver that the bond is going away, and that it must
stop the queue and flush remaining DMA transactions for this PASID.

Then we also clear the pgd from the IOMMU PASID table. If we only did
invalidate-all and somehow the queue wasn't properly stopped, concurrent
DMA would immediately form new IOTLB entries since the page tables haven't
been wiped at this point. And later, it would use-after-free page tables
and mappings. Whereas with a clear pgd it would just generate IOMMU fault
events, which are undesirable but harmless.

Thanks,
Jean

> > So the release() callback, before doing invalidate_all, stops DMA
> > and clears the page table pointer on the IOMMU side. It detaches all
> > bonds from the io_mm, calling mmu_notifier_put() for each of
> > them. After release(), bond objects still exists and device drivers
> > still need to free them with unbind(), but they don't point to an
> > io_mm anymore.
> 
> Why is so much work needed in release? It really should just be
> invalidate all, usually trying to sort out all the locking for the
> more complicated stuff is not worthwhile.
> 
> If other stuff is implicitly relying on the mm being alive and release
> to fence against that then it is already racy. If it doesn't, then why
> bother doing complicated work in release?
> 
> > > Then you can never get a stale
> > > pointer. Don't worry about exit_mmap().
> > > 
> > > release() is an unusual callback and I see alot of places using it
> > > wrong. The purpose of release is to invalidate_all, that is it.
> > > 
> > > Also, confusingly release may be called multiple times in some
> > > situations, so it shouldn't disturb anything that might impact a 2nd
> > > call.
> > 
> > I hadn't realized that. The current implementation should be safe against
> > it, as release() is a nop if the io_mm doesn't have bonds anymore. Do you
> > have an example of such a situation?  I'm trying to write tests for this
> > kind of corner cases.
> 
> Hmm, let me think. Ah, you have to be using mmu_notifier_unregister()
> to get that race. This is one of the things that get/put don't suffer
> from - but they conversely don't guarantee that release() will be
> called, so it is up to the caller to ensure everything is fenced
> before calling put.
> 
> Jason
Jason Gunthorpe March 6, 2020, 2:52 p.m. UTC | #12
On Fri, Mar 06, 2020 at 03:35:56PM +0100, Jean-Philippe Brucker wrote:
> On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote:
> > On Fri, Mar 06, 2020 at 10:56:14AM +0100, Jean-Philippe Brucker wrote:
> > > I tried to keep it simple like that: normally mmu_notifier_get() is called
> > > in bind(), and mmu_notifier_put() is called in unbind(). 
> > > 
> > > Multiple device drivers may call bind() with the same mm. Each bind()
> > > calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond
> > > (a device<->mm link). Each bond is freed by calling unbind(), which calls
> > > mmu_notifier_put().
> > > 
> > > That's the most common case. Now if the process is killed and the mm
> > > disappears, we do need to avoid use-after-free caused by DMA of the
> > > mappings and the page tables. 
> > 
> > This is why release must do invalidate all - but it doesn't need to do
> > any more - as no SPTE can be established without a mmget() - and
> > mmget() is no longer possible past release.
> 
> In our case we don't have SPTEs, the whole pgd is shared between MMU and
> IOMMU (isolated using PASID tables).

Okay, but this just means that 'invalidate all' also requires
switching the PASID to use some pgd that is permanently 'all fail'.

> At this point no one told the device to stop working on this queue,
> it may still be doing DMA on this address space.

Sure, but there are lots of cases where a defective user space can
cause pages under active DMA to disappear, like munmap for
instance. Process exit is really no different, the PASID should take
errors and the device & driver should do whatever error flow it has.

Involving a complex driver flow in the exit_mmap path seems like
dangerous complexity to me.

Jason
Jean-Philippe Brucker March 6, 2020, 4:15 p.m. UTC | #13
On Fri, Mar 06, 2020 at 10:52:45AM -0400, Jason Gunthorpe wrote:
> On Fri, Mar 06, 2020 at 03:35:56PM +0100, Jean-Philippe Brucker wrote:
> > On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote:
> > > On Fri, Mar 06, 2020 at 10:56:14AM +0100, Jean-Philippe Brucker wrote:
> > > > I tried to keep it simple like that: normally mmu_notifier_get() is called
> > > > in bind(), and mmu_notifier_put() is called in unbind(). 
> > > > 
> > > > Multiple device drivers may call bind() with the same mm. Each bind()
> > > > calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond
> > > > (a device<->mm link). Each bond is freed by calling unbind(), which calls
> > > > mmu_notifier_put().
> > > > 
> > > > That's the most common case. Now if the process is killed and the mm
> > > > disappears, we do need to avoid use-after-free caused by DMA of the
> > > > mappings and the page tables. 
> > > 
> > > This is why release must do invalidate all - but it doesn't need to do
> > > any more - as no SPTE can be established without a mmget() - and
> > > mmget() is no longer possible past release.
> > 
> > In our case we don't have SPTEs, the whole pgd is shared between MMU and
> > IOMMU (isolated using PASID tables).
> 
> Okay, but this just means that 'invalidate all' also requires
> switching the PASID to use some pgd that is permanently 'all fail'.
> 
> > At this point no one told the device to stop working on this queue,
> > it may still be doing DMA on this address space.
> 
> Sure, but there are lots of cases where a defective user space can
> cause pages under active DMA to disappear, like munmap for
> instance. Process exit is really no different, the PASID should take
> errors and the device & driver should do whatever error flow it has.

We do have the possibility to shut things down in order, so to me this
feels like a band-aid. The idea has come up before though [1], and I'm not
strongly opposed to this model, but I'm still not convinced it's
necessary. It does add more complexity to IOMMU drivers, to avoid printing
out the errors that we wouldn't otherwise see, whereas device drivers need
in any case to implement the logic that forces stop DMA.

Thanks,
Jean

[1] https://lore.kernel.org/linux-iommu/4d68da96-0ad5-b412-5987-2f7a6aa796c3@amd.com/

> 
> Involving a complex driver flow in the exit_mmap path seems like
> dangerous complexity to me.
> 
> Jason
Jason Gunthorpe March 6, 2020, 5:42 p.m. UTC | #14
On Fri, Mar 06, 2020 at 05:15:19PM +0100, Jean-Philippe Brucker wrote:
> On Fri, Mar 06, 2020 at 10:52:45AM -0400, Jason Gunthorpe wrote:
> > On Fri, Mar 06, 2020 at 03:35:56PM +0100, Jean-Philippe Brucker wrote:
> > > On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote:
> > > > On Fri, Mar 06, 2020 at 10:56:14AM +0100, Jean-Philippe Brucker wrote:
> > > > > I tried to keep it simple like that: normally mmu_notifier_get() is called
> > > > > in bind(), and mmu_notifier_put() is called in unbind(). 
> > > > > 
> > > > > Multiple device drivers may call bind() with the same mm. Each bind()
> > > > > calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond
> > > > > (a device<->mm link). Each bond is freed by calling unbind(), which calls
> > > > > mmu_notifier_put().
> > > > > 
> > > > > That's the most common case. Now if the process is killed and the mm
> > > > > disappears, we do need to avoid use-after-free caused by DMA of the
> > > > > mappings and the page tables. 
> > > > 
> > > > This is why release must do invalidate all - but it doesn't need to do
> > > > any more - as no SPTE can be established without a mmget() - and
> > > > mmget() is no longer possible past release.
> > > 
> > > In our case we don't have SPTEs, the whole pgd is shared between MMU and
> > > IOMMU (isolated using PASID tables).
> > 
> > Okay, but this just means that 'invalidate all' also requires
> > switching the PASID to use some pgd that is permanently 'all fail'.
> > 
> > > At this point no one told the device to stop working on this queue,
> > > it may still be doing DMA on this address space.
> > 
> > Sure, but there are lots of cases where a defective user space can
> > cause pages under active DMA to disappear, like munmap for
> > instance. Process exit is really no different, the PASID should take
> > errors and the device & driver should do whatever error flow it has.
> 
> We do have the possibility to shut things down in order, so to me this
> feels like a band-aid. 

->release() is called by exit_mmap which is called by mmput. There are
over a 100 callsites to mmput() and I'm not totally sure what the
rules are for release(). We've run into problems before with things
like this.

IMHO, due to this, it is best for release to be simple and have
conservative requirements on context like all the other notifier
callbacks. It is is not a good place to put complex HW fencing driver
code.

In particular that link you referenced is suggesting the driver tear
down could take minutes - IMHO it is not OK to block mmput() for
minutes.

> The idea has come up before though [1], and I'm not strongly opposed
> to this model, but I'm still not convinced it's necessary. It does
> add more complexity to IOMMU drivers, to avoid printing out the
> errors that we wouldn't otherwise see, whereas device drivers need
> in any case to implement the logic that forces stop DMA.

Errors should not be printed to the kernel log for PASID cases
anyhow. PASID will be used by unpriv user, and unpriv user should not
be able to trigger kernel prints at will, eg by doing dma to nmap VA
or whatever. 

Process exit is just another case of this, and should not be treated
specially.

Jason
Jean-Philippe Brucker March 13, 2020, 6:49 p.m. UTC | #15
On Fri, Mar 06, 2020 at 01:42:39PM -0400, Jason Gunthorpe wrote:
> On Fri, Mar 06, 2020 at 05:15:19PM +0100, Jean-Philippe Brucker wrote:
> > On Fri, Mar 06, 2020 at 10:52:45AM -0400, Jason Gunthorpe wrote:
> > > On Fri, Mar 06, 2020 at 03:35:56PM +0100, Jean-Philippe Brucker wrote:
> > > > On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote:
> > > > > On Fri, Mar 06, 2020 at 10:56:14AM +0100, Jean-Philippe Brucker wrote:
> > > > > > I tried to keep it simple like that: normally mmu_notifier_get() is called
> > > > > > in bind(), and mmu_notifier_put() is called in unbind(). 
> > > > > > 
> > > > > > Multiple device drivers may call bind() with the same mm. Each bind()
> > > > > > calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond
> > > > > > (a device<->mm link). Each bond is freed by calling unbind(), which calls
> > > > > > mmu_notifier_put().
> > > > > > 
> > > > > > That's the most common case. Now if the process is killed and the mm
> > > > > > disappears, we do need to avoid use-after-free caused by DMA of the
> > > > > > mappings and the page tables. 
> > > > > 
> > > > > This is why release must do invalidate all - but it doesn't need to do
> > > > > any more - as no SPTE can be established without a mmget() - and
> > > > > mmget() is no longer possible past release.
> > > > 
> > > > In our case we don't have SPTEs, the whole pgd is shared between MMU and
> > > > IOMMU (isolated using PASID tables).
> > > 
> > > Okay, but this just means that 'invalidate all' also requires
> > > switching the PASID to use some pgd that is permanently 'all fail'.
> > > 
> > > > At this point no one told the device to stop working on this queue,
> > > > it may still be doing DMA on this address space.
> > > 
> > > Sure, but there are lots of cases where a defective user space can
> > > cause pages under active DMA to disappear, like munmap for
> > > instance. Process exit is really no different, the PASID should take
> > > errors and the device & driver should do whatever error flow it has.
> > 
> > We do have the possibility to shut things down in order, so to me this
> > feels like a band-aid. 
> 
> ->release() is called by exit_mmap which is called by mmput. There are
> over a 100 callsites to mmput() and I'm not totally sure what the
> rules are for release(). We've run into problems before with things
> like this.

A concrete example of something that could go badly if mmput() takes too
long would greatly help. Otherwise I'll have a hard time justifying the
added complexity.

I wrote a prototype that removes the device driver callback from
release(). It works with SMMUv3, but complicates the PASID descriptor
code, which is already awful with my recent changes and this series.

> IMHO, due to this, it is best for release to be simple and have
> conservative requirements on context like all the other notifier
> callbacks. It is is not a good place to put complex HW fencing driver
> code.
> 
> In particular that link you referenced is suggesting the driver tear
> down could take minutes - IMHO it is not OK to block mmput() for
> minutes.
> 
> > The idea has come up before though [1], and I'm not strongly opposed
> > to this model, but I'm still not convinced it's necessary. It does
> > add more complexity to IOMMU drivers, to avoid printing out the
> > errors that we wouldn't otherwise see, whereas device drivers need
> > in any case to implement the logic that forces stop DMA.
> 
> Errors should not be printed to the kernel log for PASID cases
> anyhow. PASID will be used by unpriv user, and unpriv user should not
> be able to trigger kernel prints at will, eg by doing dma to nmap VA
> or whatever. 

I agree. There is a difference, though, between invalid mappings and the
absence of a pgd. The former comes from userspace issuing DMA on unmapped
buffers, while the latter is typically a device/driver error which
normally needs to be reported.

On Arm SMMUv3 they are handled differently by the hardware. But instead of
disabling the whole PASID context on mm exit, we can quietly abort
incoming transactions while waiting for unbind().

And I think the other IOMMUs treat invalid PASID descriptor the same as
invalid translation table descriptor. At least VT-d quietly returns a
no-translation response to ATS TR and rejects PRI PR. I haven't found the
equivalent in the AMD IOMMU spec yet.

Thanks,
Jean
Jason Gunthorpe March 13, 2020, 7:13 p.m. UTC | #16
On Fri, Mar 13, 2020 at 07:49:29PM +0100, Jean-Philippe Brucker wrote:
> On Fri, Mar 06, 2020 at 01:42:39PM -0400, Jason Gunthorpe wrote:
> > On Fri, Mar 06, 2020 at 05:15:19PM +0100, Jean-Philippe Brucker wrote:
> > > On Fri, Mar 06, 2020 at 10:52:45AM -0400, Jason Gunthorpe wrote:
> > > > On Fri, Mar 06, 2020 at 03:35:56PM +0100, Jean-Philippe Brucker wrote:
> > > > > On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote:
> > > > > > On Fri, Mar 06, 2020 at 10:56:14AM +0100, Jean-Philippe Brucker wrote:
> > > > > > > I tried to keep it simple like that: normally mmu_notifier_get() is called
> > > > > > > in bind(), and mmu_notifier_put() is called in unbind(). 
> > > > > > > 
> > > > > > > Multiple device drivers may call bind() with the same mm. Each bind()
> > > > > > > calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond
> > > > > > > (a device<->mm link). Each bond is freed by calling unbind(), which calls
> > > > > > > mmu_notifier_put().
> > > > > > > 
> > > > > > > That's the most common case. Now if the process is killed and the mm
> > > > > > > disappears, we do need to avoid use-after-free caused by DMA of the
> > > > > > > mappings and the page tables. 
> > > > > > 
> > > > > > This is why release must do invalidate all - but it doesn't need to do
> > > > > > any more - as no SPTE can be established without a mmget() - and
> > > > > > mmget() is no longer possible past release.
> > > > > 
> > > > > In our case we don't have SPTEs, the whole pgd is shared between MMU and
> > > > > IOMMU (isolated using PASID tables).
> > > > 
> > > > Okay, but this just means that 'invalidate all' also requires
> > > > switching the PASID to use some pgd that is permanently 'all fail'.
> > > > 
> > > > > At this point no one told the device to stop working on this queue,
> > > > > it may still be doing DMA on this address space.
> > > > 
> > > > Sure, but there are lots of cases where a defective user space can
> > > > cause pages under active DMA to disappear, like munmap for
> > > > instance. Process exit is really no different, the PASID should take
> > > > errors and the device & driver should do whatever error flow it has.
> > > 
> > > We do have the possibility to shut things down in order, so to me this
> > > feels like a band-aid. 
> > 
> > ->release() is called by exit_mmap which is called by mmput. There are
> > over a 100 callsites to mmput() and I'm not totally sure what the
> > rules are for release(). We've run into problems before with things
> > like this.
> 
> A concrete example of something that could go badly if mmput() takes too
> long would greatly help. Otherwise I'll have a hard time justifying the
> added complexity.

It is not just takes too long, but also accidently causing locking
problems by doing very complex code in the release callback. Unless
you audit all the mmput call sites to define the calling conditions I
can't even say what the risk is here. 

Particularly, calling something with impossible to audit locking like
the dma_fence stuff from release is probably impossible to prove
safety and then keep safe.

It is easy enough to see where takes too long can have a bad impact,
mmput is called all over the place. Just in the RDMA code slowing it
down would block ODP page faulting completely for all processes.
This is not acceptable.

For this reason release callbacks must be simple/fast and must have
trivial locking.

> > Errors should not be printed to the kernel log for PASID cases
> > anyhow. PASID will be used by unpriv user, and unpriv user should not
> > be able to trigger kernel prints at will, eg by doing dma to nmap VA
> > or whatever. 
> 
> I agree. There is a difference, though, between invalid mappings and the
> absence of a pgd. The former comes from userspace issuing DMA on unmapped
> buffers, while the latter is typically a device/driver error which
> normally needs to be reported.

Why not make the pgd present as I suggested? Point it at a static
dummy pgd that always fails to page fault during release? Make the pgd
not present only once the PASID is fully destroyed.

That really is the only thing release is supposed to mean -> unmap all
VAs.

Jason
Christoph Hellwig March 16, 2020, 3:46 p.m. UTC | #17
On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote:
> This is why release must do invalidate all - but it doesn't need to do
> any more - as no SPTE can be established without a mmget() - and
> mmget() is no longer possible past release.

Maybe we should rename the release method to invalidate_all?
Jason Gunthorpe March 17, 2020, 6:40 p.m. UTC | #18
On Mon, Mar 16, 2020 at 08:46:59AM -0700, Christoph Hellwig wrote:
> On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote:
> > This is why release must do invalidate all - but it doesn't need to do
> > any more - as no SPTE can be established without a mmget() - and
> > mmget() is no longer possible past release.
> 
> Maybe we should rename the release method to invalidate_all?

It is a better name. The function it must also fence future access if
the mirror is not using mmget(), and stop using the pgd/etc pointer if
the page tables are accessed directly.

Jason
diff mbox series

Patch

diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index 10921cd2608d..77610e1704f6 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -235,7 +235,7 @@  static void gru_invalidate_range_end(struct mmu_notifier *mn,
 		gms, range->start, range->end);
 }
 
-static struct mmu_notifier *gru_alloc_notifier(struct mm_struct *mm)
+static struct mmu_notifier *gru_alloc_notifier(struct mm_struct *mm, void *privdata)
 {
 	struct gru_mm_struct *gms;
 
@@ -266,7 +266,7 @@  struct gru_mm_struct *gru_register_mmu_notifier(void)
 {
 	struct mmu_notifier *mn;
 
-	mn = mmu_notifier_get_locked(&gru_mmuops, current->mm);
+	mn = mmu_notifier_get_locked(&gru_mmuops, current->mm, NULL);
 	if (IS_ERR(mn))
 		return ERR_CAST(mn);
 
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 736f6918335e..06e68fa2b019 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -207,7 +207,7 @@  struct mmu_notifier_ops {
 	 * callbacks are currently running. It is called from a SRCU callback
 	 * and cannot sleep.
 	 */
-	struct mmu_notifier *(*alloc_notifier)(struct mm_struct *mm);
+	struct mmu_notifier *(*alloc_notifier)(struct mm_struct *mm, void *privdata);
 	void (*free_notifier)(struct mmu_notifier *subscription);
 };
 
@@ -271,14 +271,16 @@  static inline int mm_has_notifiers(struct mm_struct *mm)
 }
 
 struct mmu_notifier *mmu_notifier_get_locked(const struct mmu_notifier_ops *ops,
-					     struct mm_struct *mm);
+					     struct mm_struct *mm,
+					     void *privdata);
 static inline struct mmu_notifier *
-mmu_notifier_get(const struct mmu_notifier_ops *ops, struct mm_struct *mm)
+mmu_notifier_get(const struct mmu_notifier_ops *ops, struct mm_struct *mm,
+		 void *privdata)
 {
 	struct mmu_notifier *ret;
 
 	down_write(&mm->mmap_sem);
-	ret = mmu_notifier_get_locked(ops, mm);
+	ret = mmu_notifier_get_locked(ops, mm, privdata);
 	up_write(&mm->mmap_sem);
 	return ret;
 }
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index ef3973a5d34a..8beb9dcbe0fd 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -734,6 +734,7 @@  find_get_mmu_notifier(struct mm_struct *mm, const struct mmu_notifier_ops *ops)
  *                           the mm & ops
  * @ops: The operations struct being subscribe with
  * @mm : The mm to attach notifiers too
+ * @privdata: Initialization data passed down to ops->alloc_notifier()
  *
  * This function either allocates a new mmu_notifier via
  * ops->alloc_notifier(), or returns an already existing notifier on the
@@ -747,7 +748,8 @@  find_get_mmu_notifier(struct mm_struct *mm, const struct mmu_notifier_ops *ops)
  * and can be converted to an active mm pointer via mmget_not_zero().
  */
 struct mmu_notifier *mmu_notifier_get_locked(const struct mmu_notifier_ops *ops,
-					     struct mm_struct *mm)
+					     struct mm_struct *mm,
+					     void *privdata)
 {
 	struct mmu_notifier *subscription;
 	int ret;
@@ -760,7 +762,7 @@  struct mmu_notifier *mmu_notifier_get_locked(const struct mmu_notifier_ops *ops,
 			return subscription;
 	}
 
-	subscription = ops->alloc_notifier(mm);
+	subscription = ops->alloc_notifier(mm, privdata);
 	if (IS_ERR(subscription))
 		return subscription;
 	subscription->ops = ops;