diff mbox series

[for-rc,v2] IB/hfi1: Move cached value of mm into handler

Message ID 20201112025837.24440.6767.stgit@awfm-01.aw.intel.com (mailing list archive)
State Superseded
Headers show
Series [for-rc,v2] IB/hfi1: Move cached value of mm into handler | expand

Commit Message

Dennis Dalessandro Nov. 12, 2020, 2:58 a.m. UTC
Two earlier bug fixes have created a security problem in the hfi1
driver. One fix aimed to solve an issue where current->mm was not valid
when closing the hfi1 cdev. It attempted to do this by saving a cached
value of the current->mm pointer at file open time. This is a problem if
another process with access to the FD calls in via write() or ioctl() to
pin pages via the hfi driver. The other fix tried to solve a use after
free by taking a reference on the mm. This was just wrong because its
possible for a race condition between one process with an mm that opened
the cdev if it was accessing via an IOCTL, and another process
attempting to close the cdev with a different current->mm.

To fix this correctly we move the cached value of the mm into the mmu
handler struct for the driver. Now we can check in the insert, evict,
etc. routines that current->mm matched what the handler was registered
for. If not, then don't allow access. The register of the mmu notifier
will save the mm pointer.

Note the check in the unregister is not needed in the event that
current->mm is empty. This means the tear down is happening due to a
SigKill or OOM Killer, something along those lines. If current->mm has a
value then it must be checked and only the task that did the register
can do the unregister.

Since in do_exit() the exit_mm() is called before exit_files(), which
would call our close routine a reference is needed on the mm. We rely on
the mmgrab done by the registration of the notifier, whereas before it
was explicit.

Also of note is we do not do any explicit work to protect the interval
tree notifier. It doesn't seem that this is going to be needed since we
aren't actually doing anything with current->mm. The interval tree
notifier stuff still has a FIXME noted from a previous commit that will
be addressed in a follow on patch.

Fixes: e0cf75deab81 ("IB/hfi1: Fix mm_struct use after free")
Fixes: 3faa3d9a308e ("IB/hfi1: Make use of mm consistent")
Reported-by: Jann Horn <jannh@google.com>
Reported-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@cornelisnetworks.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>

---

Changes since v0:
----------------
Removed the checking of the pid and limitation that
whatever task opens the dev is the only one that can do write() or
ioctl(). While this limitation is OK it doesn't appear to be strictly
necessary.

Rebased on top of 5.10-rc1. Testing has been done on 5.9 due to a bug in
5.10 that is being worked (separate issue).

Changes since v1:
----------------
Remove explicit mmget/put to rely on the notifier register's mmgrab
instead.

Fixed missing check in rb_unregister to only check current->mm if its
actually valid.

Moved mm_from_tid_node to exp_rcv header and use it
---
 drivers/infiniband/hw/hfi1/file_ops.c     |    4 +--
 drivers/infiniband/hw/hfi1/hfi.h          |    2 +
 drivers/infiniband/hw/hfi1/mmu_rb.c       |   41 +++++++++++++++++------------
 drivers/infiniband/hw/hfi1/mmu_rb.h       |   17 +++++++++++-
 drivers/infiniband/hw/hfi1/user_exp_rcv.c |   12 ++++++--
 drivers/infiniband/hw/hfi1/user_exp_rcv.h |    6 ++++
 drivers/infiniband/hw/hfi1/user_sdma.c    |   13 +++++----
 drivers/infiniband/hw/hfi1/user_sdma.h    |    7 ++++-
 8 files changed, 69 insertions(+), 33 deletions(-)

Comments

Ira Weiny Nov. 12, 2020, 5:14 p.m. UTC | #1
On Wed, Nov 11, 2020 at 09:58:37PM -0500, Dennis Dalessandro wrote:
> Two earlier bug fixes have created a security problem in the hfi1
> driver. One fix aimed to solve an issue where current->mm was not valid
> when closing the hfi1 cdev. It attempted to do this by saving a cached
> value of the current->mm pointer at file open time. This is a problem if
> another process with access to the FD calls in via write() or ioctl() to
> pin pages via the hfi driver. The other fix tried to solve a use after
> free by taking a reference on the mm. This was just wrong because its
> possible for a race condition between one process with an mm that opened
> the cdev if it was accessing via an IOCTL, and another process
> attempting to close the cdev with a different current->mm.

Again I'm still not seeing the race here.  It is entirely possible that the fix
I was trying to do way back was mistaken too...  ;-)  I would just delete the
last 2 sentences...  and/or reference the commit of those fixes and help
explain this more.

> 
> To fix this correctly we move the cached value of the mm into the mmu
> handler struct for the driver.

Looking at this closer I don't think you need the mm member of mmu_rb_handler
any longer.  See below.

> Now we can check in the insert, evict,
> etc. routines that current->mm matched what the handler was registered
> for. If not, then don't allow access. The register of the mmu notifier
> will save the mm pointer.
> 
> Note the check in the unregister is not needed in the event that
> current->mm is empty. This means the tear down is happening due to a
> SigKill or OOM Killer, something along those lines. If current->mm has a
> value then it must be checked and only the task that did the register
> can do the unregister.
> 
> Since in do_exit() the exit_mm() is called before exit_files(), which
> would call our close routine a reference is needed on the mm. We rely on
> the mmgrab done by the registration of the notifier, whereas before it
> was explicit.
> 
> Also of note is we do not do any explicit work to protect the interval
> tree notifier. It doesn't seem that this is going to be needed since we
> aren't actually doing anything with current->mm. The interval tree
> notifier stuff still has a FIXME noted from a previous commit that will
> be addressed in a follow on patch.

This is a bit confusing ...

Is this the FIXME you are refering to?

hfi1/user_exp_rcv.c:
...
764                 /*
765                  * FIXME: This is in the wrong order, the notifier should be
766                  * established before the pages are pinned by pin_rcv_pages.
767                  */
...

> 
> Fixes: e0cf75deab81 ("IB/hfi1: Fix mm_struct use after free")
> Fixes: 3faa3d9a308e ("IB/hfi1: Make use of mm consistent")
> Reported-by: Jann Horn <jannh@google.com>
> Reported-by: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Reviewed-by: Mike Marciniszyn <mike.marciniszyn@cornelisnetworks.com>
> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
> 
> ---
> 
> Changes since v0:
> ----------------
> Removed the checking of the pid and limitation that
> whatever task opens the dev is the only one that can do write() or
> ioctl(). While this limitation is OK it doesn't appear to be strictly
> necessary.
> 
> Rebased on top of 5.10-rc1. Testing has been done on 5.9 due to a bug in
> 5.10 that is being worked (separate issue).
> 
> Changes since v1:
> ----------------
> Remove explicit mmget/put to rely on the notifier register's mmgrab
> instead.
> 
> Fixed missing check in rb_unregister to only check current->mm if its
> actually valid.
> 
> Moved mm_from_tid_node to exp_rcv header and use it
> ---
>  drivers/infiniband/hw/hfi1/file_ops.c     |    4 +--
>  drivers/infiniband/hw/hfi1/hfi.h          |    2 +
>  drivers/infiniband/hw/hfi1/mmu_rb.c       |   41 +++++++++++++++++------------
>  drivers/infiniband/hw/hfi1/mmu_rb.h       |   17 +++++++++++-
>  drivers/infiniband/hw/hfi1/user_exp_rcv.c |   12 ++++++--
>  drivers/infiniband/hw/hfi1/user_exp_rcv.h |    6 ++++
>  drivers/infiniband/hw/hfi1/user_sdma.c    |   13 +++++----
>  drivers/infiniband/hw/hfi1/user_sdma.h    |    7 ++++-
>  8 files changed, 69 insertions(+), 33 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/hfi1/file_ops.c b/drivers/infiniband/hw/hfi1/file_ops.c
> index 8ca51e4..329ee4f 100644
> --- a/drivers/infiniband/hw/hfi1/file_ops.c
> +++ b/drivers/infiniband/hw/hfi1/file_ops.c
> @@ -1,4 +1,5 @@
>  /*
> + * Copyright(c) 2020 Cornelis Networks, Inc.
>   * Copyright(c) 2015-2020 Intel Corporation.
>   *
>   * This file is provided under a dual BSD/GPLv2 license.  When using or
> @@ -206,8 +207,6 @@ static int hfi1_file_open(struct inode *inode, struct file *fp)
>  	spin_lock_init(&fd->tid_lock);
>  	spin_lock_init(&fd->invalid_lock);
>  	fd->rec_cpu_num = -1; /* no cpu affinity by default */
> -	fd->mm = current->mm;
> -	mmgrab(fd->mm);
>  	fd->dd = dd;
>  	fp->private_data = fd;
>  	return 0;
> @@ -711,7 +710,6 @@ static int hfi1_file_close(struct inode *inode, struct file *fp)
>  
>  	deallocate_ctxt(uctxt);
>  done:
> -	mmdrop(fdata->mm);
>  
>  	if (atomic_dec_and_test(&dd->user_refcount))
>  		complete(&dd->user_comp);
> diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
> index b4c6bff..e09e824 100644
> --- a/drivers/infiniband/hw/hfi1/hfi.h
> +++ b/drivers/infiniband/hw/hfi1/hfi.h
> @@ -1,6 +1,7 @@
>  #ifndef _HFI1_KERNEL_H
>  #define _HFI1_KERNEL_H
>  /*
> + * Copyright(c) 2020 Cornelis Networks, Inc.
>   * Copyright(c) 2015-2020 Intel Corporation.
>   *
>   * This file is provided under a dual BSD/GPLv2 license.  When using or
> @@ -1451,7 +1452,6 @@ struct hfi1_filedata {
>  	u32 invalid_tid_idx;
>  	/* protect invalid_tids array and invalid_tid_idx */
>  	spinlock_t invalid_lock;
> -	struct mm_struct *mm;
>  };
>  
>  extern struct xarray hfi1_dev_table;
> diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
> index 24ca17b..6be4e79 100644
> --- a/drivers/infiniband/hw/hfi1/mmu_rb.c
> +++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
> @@ -1,4 +1,5 @@
>  /*
> + * Copyright(c) 2020 Cornelis Networks, Inc.
>   * Copyright(c) 2016 - 2017 Intel Corporation.
>   *
>   * This file is provided under a dual BSD/GPLv2 license.  When using or
> @@ -48,23 +49,11 @@
>  #include <linux/rculist.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/interval_tree_generic.h>
> +#include <linux/sched/mm.h>
>  
>  #include "mmu_rb.h"
>  #include "trace.h"
>  
> -struct mmu_rb_handler {
> -	struct mmu_notifier mn;
> -	struct rb_root_cached root;
> -	void *ops_arg;
> -	spinlock_t lock;        /* protect the RB tree */
> -	struct mmu_rb_ops *ops;
> -	struct mm_struct *mm;
> -	struct list_head lru_list;
> -	struct work_struct del_work;
> -	struct list_head del_list;
> -	struct workqueue_struct *wq;
> -};
> -
>  static unsigned long mmu_node_start(struct mmu_rb_node *);
>  static unsigned long mmu_node_last(struct mmu_rb_node *);
>  static int mmu_notifier_range_start(struct mmu_notifier *,
> @@ -92,7 +81,7 @@ static unsigned long mmu_node_last(struct mmu_rb_node *node)
>  	return PAGE_ALIGN(node->addr + node->len) - 1;
>  }
>  
> -int hfi1_mmu_rb_register(void *ops_arg, struct mm_struct *mm,
> +int hfi1_mmu_rb_register(void *ops_arg,
>  			 struct mmu_rb_ops *ops,
>  			 struct workqueue_struct *wq,
>  			 struct mmu_rb_handler **handler)
> @@ -110,18 +99,19 @@ int hfi1_mmu_rb_register(void *ops_arg, struct mm_struct *mm,
>  	INIT_HLIST_NODE(&handlr->mn.hlist);
>  	spin_lock_init(&handlr->lock);
>  	handlr->mn.ops = &mn_opts;
> -	handlr->mm = mm;

NIT: I really think you should follow up with a spelling fix patch...  Sorry
just got frustrated greping for 'handler' and not finding this!  ;-)

>  	INIT_WORK(&handlr->del_work, handle_remove);
>  	INIT_LIST_HEAD(&handlr->del_list);
>  	INIT_LIST_HEAD(&handlr->lru_list);
>  	handlr->wq = wq;
>  
> -	ret = mmu_notifier_register(&handlr->mn, handlr->mm);
> +	ret = mmu_notifier_register(&handlr->mn, current->mm);
>  	if (ret) {
>  		kfree(handlr);
>  		return ret;
>  	}
>  
> +	handlr->mm = current->mm;

Sorry I did not catch this before but do you need to store this pointer?  Is it
not enough to check the ->mn.mm? ...

I think that would also make it clear you are relying on the mmget() within the
mmu_notifier_register()  Because that is the reference you are using rather
than having another reference here which could potentially be used wrongly in
the future.

> +
>  	*handler = handlr;
>  	return 0;
>  }
> @@ -133,8 +123,11 @@ void hfi1_mmu_rb_unregister(struct mmu_rb_handler *handler)
>  	unsigned long flags;
>  	struct list_head del_list;
>  
> +	if (current->mm && (handler->mm != current->mm))
                            ^^^^^^^^^^^
                            handler->mn.mm?
... Like this?

> +		return;
> +
>  	/* Unregister first so we don't get any more notifications. */
> -	mmu_notifier_unregister(&handler->mn, handler->mm);
> +	mmu_notifier_unregister(&handler->mn, handler->mn.mm);

Here you use the mn.mm.

It is the same right?

>  
>  	/*
>  	 * Make sure the wq delete handler is finished running.  It will not
> @@ -166,6 +159,10 @@ int hfi1_mmu_rb_insert(struct mmu_rb_handler *handler,
>  	int ret = 0;
>  
>  	trace_hfi1_mmu_rb_insert(mnode->addr, mnode->len);
> +
> +	if (current->mm != handler->mm)

Ditto.

> +		return -EPERM;
> +
>  	spin_lock_irqsave(&handler->lock, flags);
>  	node = __mmu_rb_search(handler, mnode->addr, mnode->len);
>  	if (node) {
> @@ -180,6 +177,7 @@ int hfi1_mmu_rb_insert(struct mmu_rb_handler *handler,
>  		__mmu_int_rb_remove(mnode, &handler->root);
>  		list_del(&mnode->list); /* remove from LRU list */
>  	}
> +	mnode->handler = handler;
>  unlock:
>  	spin_unlock_irqrestore(&handler->lock, flags);
>  	return ret;
> @@ -217,6 +215,9 @@ bool hfi1_mmu_rb_remove_unless_exact(struct mmu_rb_handler *handler,
>  	unsigned long flags;
>  	bool ret = false;
>  
> +	if (current->mm != handler->mm)

Ditto.

> +		return ret;
> +
>  	spin_lock_irqsave(&handler->lock, flags);
>  	node = __mmu_rb_search(handler, addr, len);
>  	if (node) {
> @@ -239,6 +240,9 @@ void hfi1_mmu_rb_evict(struct mmu_rb_handler *handler, void *evict_arg)
>  	unsigned long flags;
>  	bool stop = false;
>  
> +	if (current->mm != handler->mm)

Ditto.

> +		return;
> +
>  	INIT_LIST_HEAD(&del_list);
>  
>  	spin_lock_irqsave(&handler->lock, flags);
> @@ -272,6 +276,9 @@ void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler,
>  {
>  	unsigned long flags;
>  
> +	if (current->mm != handler->mm)

Ditto.

> +		return;
> +
>  	/* Validity of handler and node pointers has been checked by caller. */
>  	trace_hfi1_mmu_rb_remove(node->addr, node->len);
>  	spin_lock_irqsave(&handler->lock, flags);
> diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.h b/drivers/infiniband/hw/hfi1/mmu_rb.h
> index f04cec1..e208618 100644
> --- a/drivers/infiniband/hw/hfi1/mmu_rb.h
> +++ b/drivers/infiniband/hw/hfi1/mmu_rb.h
> @@ -1,4 +1,5 @@
>  /*
> + * Copyright(c) 2020 Cornelis Networks, Inc.
>   * Copyright(c) 2016 Intel Corporation.
>   *
>   * This file is provided under a dual BSD/GPLv2 license.  When using or
> @@ -54,6 +55,7 @@ struct mmu_rb_node {
>  	unsigned long len;
>  	unsigned long __last;
>  	struct rb_node node;
> +	struct mmu_rb_handler *handler;
>  	struct list_head list;
>  };
>  
> @@ -71,7 +73,20 @@ struct mmu_rb_ops {
>  		     void *evict_arg, bool *stop);
>  };
>  
> -int hfi1_mmu_rb_register(void *ops_arg, struct mm_struct *mm,
> +struct mmu_rb_handler {
> +	struct mmu_notifier mn;
> +	struct rb_root_cached root;
> +	void *ops_arg;
> +	spinlock_t lock;        /* protect the RB tree */
> +	struct mmu_rb_ops *ops;
> +	struct list_head lru_list;
> +	struct work_struct del_work;
> +	struct list_head del_list;
> +	struct workqueue_struct *wq;
> +	struct mm_struct *mm;

And remove this?

Ira

> +};
> +
> +int hfi1_mmu_rb_register(void *ops_arg,
>  			 struct mmu_rb_ops *ops,
>  			 struct workqueue_struct *wq,
>  			 struct mmu_rb_handler **handler);
> diff --git a/drivers/infiniband/hw/hfi1/user_exp_rcv.c b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
> index f81ca20..b94fc7f 100644
> --- a/drivers/infiniband/hw/hfi1/user_exp_rcv.c
> +++ b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
> @@ -1,4 +1,5 @@
>  /*
> + * Copyright(c) 2020 Cornelis Networks, Inc.
>   * Copyright(c) 2015-2018 Intel Corporation.
>   *
>   * This file is provided under a dual BSD/GPLv2 license.  When using or
> @@ -173,15 +174,18 @@ static void unpin_rcv_pages(struct hfi1_filedata *fd,
>  {
>  	struct page **pages;
>  	struct hfi1_devdata *dd = fd->uctxt->dd;
> +	struct mm_struct *mm;
>  
>  	if (mapped) {
>  		pci_unmap_single(dd->pcidev, node->dma_addr,
>  				 node->npages * PAGE_SIZE, PCI_DMA_FROMDEVICE);
>  		pages = &node->pages[idx];
> +		mm = mm_from_tid_node(node);
>  	} else {
>  		pages = &tidbuf->pages[idx];
> +		mm = current->mm;
>  	}
> -	hfi1_release_user_pages(fd->mm, pages, npages, mapped);
> +	hfi1_release_user_pages(mm, pages, npages, mapped);
>  	fd->tid_n_pinned -= npages;
>  }
>  
> @@ -216,12 +220,12 @@ static int pin_rcv_pages(struct hfi1_filedata *fd, struct tid_user_buf *tidbuf)
>  	 * pages, accept the amount pinned so far and program only that.
>  	 * User space knows how to deal with partially programmed buffers.
>  	 */
> -	if (!hfi1_can_pin_pages(dd, fd->mm, fd->tid_n_pinned, npages)) {
> +	if (!hfi1_can_pin_pages(dd, current->mm, fd->tid_n_pinned, npages)) {
>  		kfree(pages);
>  		return -ENOMEM;
>  	}
>  
> -	pinned = hfi1_acquire_user_pages(fd->mm, vaddr, npages, true, pages);
> +	pinned = hfi1_acquire_user_pages(current->mm, vaddr, npages, true, pages);
>  	if (pinned <= 0) {
>  		kfree(pages);
>  		return pinned;
> @@ -756,7 +760,7 @@ static int set_rcvarray_entry(struct hfi1_filedata *fd,
>  
>  	if (fd->use_mn) {
>  		ret = mmu_interval_notifier_insert(
> -			&node->notifier, fd->mm,
> +			&node->notifier, current->mm,
>  			tbuf->vaddr + (pageidx * PAGE_SIZE), npages * PAGE_SIZE,
>  			&tid_mn_ops);
>  		if (ret)
> diff --git a/drivers/infiniband/hw/hfi1/user_exp_rcv.h b/drivers/infiniband/hw/hfi1/user_exp_rcv.h
> index 332abb4..d45c7b6 100644
> --- a/drivers/infiniband/hw/hfi1/user_exp_rcv.h
> +++ b/drivers/infiniband/hw/hfi1/user_exp_rcv.h
> @@ -1,6 +1,7 @@
>  #ifndef _HFI1_USER_EXP_RCV_H
>  #define _HFI1_USER_EXP_RCV_H
>  /*
> + * Copyright(c) 2020 - Cornelis Networks, Inc.
>   * Copyright(c) 2015 - 2017 Intel Corporation.
>   *
>   * This file is provided under a dual BSD/GPLv2 license.  When using or
> @@ -95,4 +96,9 @@ int hfi1_user_exp_rcv_clear(struct hfi1_filedata *fd,
>  int hfi1_user_exp_rcv_invalid(struct hfi1_filedata *fd,
>  			      struct hfi1_tid_info *tinfo);
>  
> +static inline struct mm_struct *mm_from_tid_node(struct tid_rb_node *node)
> +{
> +	return node->notifier.mm;
> +}
> +
>  #endif /* _HFI1_USER_EXP_RCV_H */
> diff --git a/drivers/infiniband/hw/hfi1/user_sdma.c b/drivers/infiniband/hw/hfi1/user_sdma.c
> index a92346e..4a4956f9 100644
> --- a/drivers/infiniband/hw/hfi1/user_sdma.c
> +++ b/drivers/infiniband/hw/hfi1/user_sdma.c
> @@ -1,4 +1,5 @@
>  /*
> + * Copyright(c) 2020 - Cornelis Networks, Inc.
>   * Copyright(c) 2015 - 2018 Intel Corporation.
>   *
>   * This file is provided under a dual BSD/GPLv2 license.  When using or
> @@ -188,7 +189,6 @@ int hfi1_user_sdma_alloc_queues(struct hfi1_ctxtdata *uctxt,
>  	atomic_set(&pq->n_reqs, 0);
>  	init_waitqueue_head(&pq->wait);
>  	atomic_set(&pq->n_locked, 0);
> -	pq->mm = fd->mm;
>  
>  	iowait_init(&pq->busy, 0, NULL, NULL, defer_packet_queue,
>  		    activate_packet_queue, NULL, NULL);
> @@ -230,7 +230,7 @@ int hfi1_user_sdma_alloc_queues(struct hfi1_ctxtdata *uctxt,
>  
>  	cq->nentries = hfi1_sdma_comp_ring_size;
>  
> -	ret = hfi1_mmu_rb_register(pq, pq->mm, &sdma_rb_ops, dd->pport->hfi1_wq,
> +	ret = hfi1_mmu_rb_register(pq, &sdma_rb_ops, dd->pport->hfi1_wq,
>  				   &pq->handler);
>  	if (ret) {
>  		dd_dev_err(dd, "Failed to register with MMU %d", ret);
> @@ -980,13 +980,13 @@ static int pin_sdma_pages(struct user_sdma_request *req,
>  
>  	npages -= node->npages;
>  retry:
> -	if (!hfi1_can_pin_pages(pq->dd, pq->mm,
> +	if (!hfi1_can_pin_pages(pq->dd, current->mm,
>  				atomic_read(&pq->n_locked), npages)) {
>  		cleared = sdma_cache_evict(pq, npages);
>  		if (cleared >= npages)
>  			goto retry;
>  	}
> -	pinned = hfi1_acquire_user_pages(pq->mm,
> +	pinned = hfi1_acquire_user_pages(current->mm,
>  					 ((unsigned long)iovec->iov.iov_base +
>  					 (node->npages * PAGE_SIZE)), npages, 0,
>  					 pages + node->npages);
> @@ -995,7 +995,7 @@ static int pin_sdma_pages(struct user_sdma_request *req,
>  		return pinned;
>  	}
>  	if (pinned != npages) {
> -		unpin_vector_pages(pq->mm, pages, node->npages, pinned);
> +		unpin_vector_pages(current->mm, pages, node->npages, pinned);
>  		return -EFAULT;
>  	}
>  	kfree(node->pages);
> @@ -1008,7 +1008,8 @@ static int pin_sdma_pages(struct user_sdma_request *req,
>  static void unpin_sdma_pages(struct sdma_mmu_node *node)
>  {
>  	if (node->npages) {
> -		unpin_vector_pages(node->pq->mm, node->pages, 0, node->npages);
> +		unpin_vector_pages(mm_from_sdma_node(node), node->pages, 0,
> +				   node->npages);
>  		atomic_sub(node->npages, &node->pq->n_locked);
>  	}
>  }
> diff --git a/drivers/infiniband/hw/hfi1/user_sdma.h b/drivers/infiniband/hw/hfi1/user_sdma.h
> index 9972e0e..1e8c02f 100644
> --- a/drivers/infiniband/hw/hfi1/user_sdma.h
> +++ b/drivers/infiniband/hw/hfi1/user_sdma.h
> @@ -1,6 +1,7 @@
>  #ifndef _HFI1_USER_SDMA_H
>  #define _HFI1_USER_SDMA_H
>  /*
> + * Copyright(c) 2020 - Cornelis Networks, Inc.
>   * Copyright(c) 2015 - 2018 Intel Corporation.
>   *
>   * This file is provided under a dual BSD/GPLv2 license.  When using or
> @@ -133,7 +134,6 @@ struct hfi1_user_sdma_pkt_q {
>  	unsigned long unpinned;
>  	struct mmu_rb_handler *handler;
>  	atomic_t n_locked;
> -	struct mm_struct *mm;
>  };
>  
>  struct hfi1_user_sdma_comp_q {
> @@ -250,4 +250,9 @@ int hfi1_user_sdma_process_request(struct hfi1_filedata *fd,
>  				   struct iovec *iovec, unsigned long dim,
>  				   unsigned long *count);
>  
> +static inline struct mm_struct *mm_from_sdma_node(struct sdma_mmu_node *node)
> +{
> +	return node->rb.handler->mn.mm;
> +}
> +
>  #endif /* _HFI1_USER_SDMA_H */
>
Dennis Dalessandro Nov. 12, 2020, 10:06 p.m. UTC | #2
On 11/12/2020 12:14 PM, Ira Weiny wrote:
> On Wed, Nov 11, 2020 at 09:58:37PM -0500, Dennis Dalessandro wrote:
>> Two earlier bug fixes have created a security problem in the hfi1
>> driver. One fix aimed to solve an issue where current->mm was not valid
>> when closing the hfi1 cdev. It attempted to do this by saving a cached
>> value of the current->mm pointer at file open time. This is a problem if
>> another process with access to the FD calls in via write() or ioctl() to
>> pin pages via the hfi driver. The other fix tried to solve a use after
>> free by taking a reference on the mm. This was just wrong because its
>> possible for a race condition between one process with an mm that opened
>> the cdev if it was accessing via an IOCTL, and another process
>> attempting to close the cdev with a different current->mm.
> 
> Again I'm still not seeing the race here.  It is entirely possible that the fix
> I was trying to do way back was mistaken too...  ;-)  I would just delete the
> last 2 sentences...  and/or reference the commit of those fixes and help
> explain this more.

I was attempting to refer to [1], the email that started all of this.

>>
>> To fix this correctly we move the cached value of the mm into the mmu
>> handler struct for the driver.
> 
> Looking at this closer I don't think you need the mm member of mmu_rb_handler
> any longer.  See below.

We went back and forth on this as well. We thought it better to rely on 
our own pointer vs looking into the notifier to get the mm. Same 
reasoning for doing our own referecne counting. Question is what is the 
preferred way here. Functionally it makes no difference and I'm fine 
going either way.


> NIT: I really think you should follow up with a spelling fix patch...  Sorry
> just got frustrated greping for 'handler' and not finding this!  ;-)
> 
>>   	INIT_WORK(&handlr->del_work, handle_remove);
>>   	INIT_LIST_HEAD(&handlr->del_list);
>>   	INIT_LIST_HEAD(&handlr->lru_list);
>>   	handlr->wq = wq;
>>   
>> -	ret = mmu_notifier_register(&handlr->mn, handlr->mm);
>> +	ret = mmu_notifier_register(&handlr->mn, current->mm);
>>   	if (ret) {
>>   		kfree(handlr);
>>   		return ret;
>>   	}
>>   
>> +	handlr->mm = current->mm;
> 
> Sorry I did not catch this before but do you need to store this pointer?  Is it
> not enough to check the ->mn.mm? ...
> 
> I think that would also make it clear you are relying on the mmget() within the
> mmu_notifier_register()  Because that is the reference you are using rather
> than having another reference here which could potentially be used wrongly in
> the future.

That's the question. It does make sense to do that if we are sticking 
iwth the notifier's reference vs our own explicit one. I'm not 100% sold 
that we should not be doing the ref counting and keeping our own 
pointer. To me we shoudln't be looking inside the notifer struct and 
instead honestly there should probably be an API/helper call to get the 
mm from it. I'm open to either approach.


-Denny
Dennis Dalessandro Nov. 12, 2020, 10:08 p.m. UTC | #3
On 11/12/2020 5:06 PM, Dennis Dalessandro wrote:
> On 11/12/2020 12:14 PM, Ira Weiny wrote:
>> On Wed, Nov 11, 2020 at 09:58:37PM -0500, Dennis Dalessandro wrote:
>>> Two earlier bug fixes have created a security problem in the hfi1
>>> driver. One fix aimed to solve an issue where current->mm was not valid
>>> when closing the hfi1 cdev. It attempted to do this by saving a cached
>>> value of the current->mm pointer at file open time. This is a problem if
>>> another process with access to the FD calls in via write() or ioctl() to
>>> pin pages via the hfi driver. The other fix tried to solve a use after
>>> free by taking a reference on the mm. This was just wrong because its
>>> possible for a race condition between one process with an mm that opened
>>> the cdev if it was accessing via an IOCTL, and another process
>>> attempting to close the cdev with a different current->mm.
>>
>> Again I'm still not seeing the race here.  It is entirely possible 
>> that the fix
>> I was trying to do way back was mistaken too...  ;-)  I would just 
>> delete the
>> last 2 sentences...  and/or reference the commit of those fixes and help
>> explain this more.
> 
> I was attempting to refer to [1], the email that started all of this.

That link should be:
[1] https://marc.info/?l=linux-rdma&m=159891753806720&w=2

-Denny
Jason Gunthorpe Nov. 13, 2020, 12:01 a.m. UTC | #4
On Thu, Nov 12, 2020 at 05:06:30PM -0500, Dennis Dalessandro wrote:
> On 11/12/2020 12:14 PM, Ira Weiny wrote:
> > On Wed, Nov 11, 2020 at 09:58:37PM -0500, Dennis Dalessandro wrote:
> > > Two earlier bug fixes have created a security problem in the hfi1
> > > driver. One fix aimed to solve an issue where current->mm was not valid
> > > when closing the hfi1 cdev. It attempted to do this by saving a cached
> > > value of the current->mm pointer at file open time. This is a problem if
> > > another process with access to the FD calls in via write() or ioctl() to
> > > pin pages via the hfi driver. The other fix tried to solve a use after
> > > free by taking a reference on the mm. This was just wrong because its
> > > possible for a race condition between one process with an mm that opened
> > > the cdev if it was accessing via an IOCTL, and another process
> > > attempting to close the cdev with a different current->mm.
> > 
> > Again I'm still not seeing the race here.  It is entirely possible that the fix
> > I was trying to do way back was mistaken too...  ;-)  I would just delete the
> > last 2 sentences...  and/or reference the commit of those fixes and help
> > explain this more.
> 
> I was attempting to refer to [1], the email that started all of this.
> 
> > > 
> > > To fix this correctly we move the cached value of the mm into the mmu
> > > handler struct for the driver.
> > 
> > Looking at this closer I don't think you need the mm member of mmu_rb_handler
> > any longer.  See below.
> 
> We went back and forth on this as well. We thought it better to rely on our
> own pointer vs looking into the notifier to get the mm. Same reasoning for
> doing our own referecne counting. Question is what is the preferred way
> here. Functionally it makes no difference and I'm fine going either way.

Use the mm pointer in the notifier if you have a notifier registered,
it is clearer as to the lifetime and matches what other places do

> That's the question. It does make sense to do that if we are sticking iwth
> the notifier's reference vs our own explicit one. I'm not 100% sold that we
> should not be doing the ref counting and keeping our own pointer. To me we
> shoudln't be looking inside the notifer struct and instead honestly there
> should probably be an API/helper call to get the mm from it. I'm open to
> either approach.

The notifier is there to support users of the notifier, and nearly all
notifier users require the mm at various points.

Adding get accessors is a bit of a kernel anti-pattern, this isn't Java..

Jason
Jason Gunthorpe Nov. 13, 2020, 12:02 a.m. UTC | #5
On Thu, Nov 12, 2020 at 05:08:22PM -0500, Dennis Dalessandro wrote:
> On 11/12/2020 5:06 PM, Dennis Dalessandro wrote:
> > On 11/12/2020 12:14 PM, Ira Weiny wrote:
> > > On Wed, Nov 11, 2020 at 09:58:37PM -0500, Dennis Dalessandro wrote:
> > > > Two earlier bug fixes have created a security problem in the hfi1
> > > > driver. One fix aimed to solve an issue where current->mm was not valid
> > > > when closing the hfi1 cdev. It attempted to do this by saving a cached
> > > > value of the current->mm pointer at file open time. This is a problem if
> > > > another process with access to the FD calls in via write() or ioctl() to
> > > > pin pages via the hfi driver. The other fix tried to solve a use after
> > > > free by taking a reference on the mm. This was just wrong because its
> > > > possible for a race condition between one process with an mm that opened
> > > > the cdev if it was accessing via an IOCTL, and another process
> > > > attempting to close the cdev with a different current->mm.
> > > 
> > > Again I'm still not seeing the race here.  It is entirely possible
> > > that the fix
> > > I was trying to do way back was mistaken too...  ;-)  I would just
> > > delete the
> > > last 2 sentences...  and/or reference the commit of those fixes and help
> > > explain this more.
> > 
> > I was attempting to refer to [1], the email that started all of this.
> 
> That link should be:
> [1] https://marc.info/?l=linux-rdma&m=159891753806720&w=2

And consistently use lore.kernel.org links please :\

Jason
Ira Weiny Nov. 13, 2020, 12:33 a.m. UTC | #6
On Thu, Nov 12, 2020 at 05:08:22PM -0500, Dennis Dalessandro wrote:
> On 11/12/2020 5:06 PM, Dennis Dalessandro wrote:
> > On 11/12/2020 12:14 PM, Ira Weiny wrote:
> > > On Wed, Nov 11, 2020 at 09:58:37PM -0500, Dennis Dalessandro wrote:
> > > > Two earlier bug fixes have created a security problem in the hfi1
> > > > driver. One fix aimed to solve an issue where current->mm was not valid
> > > > when closing the hfi1 cdev. It attempted to do this by saving a cached
> > > > value of the current->mm pointer at file open time. This is a problem if
> > > > another process with access to the FD calls in via write() or ioctl() to
> > > > pin pages via the hfi driver. The other fix tried to solve a use after
> > > > free by taking a reference on the mm. This was just wrong because its
> > > > possible for a race condition between one process with an mm that opened
> > > > the cdev if it was accessing via an IOCTL, and another process
> > > > attempting to close the cdev with a different current->mm.
> > > 
> > > Again I'm still not seeing the race here.  It is entirely possible
> > > that the fix
> > > I was trying to do way back was mistaken too...  ;-)  I would just
> > > delete the
> > > last 2 sentences...  and/or reference the commit of those fixes and help
> > > explain this more.
> > 
> > I was attempting to refer to [1], the email that started all of this.
> 
> That link should be:
> [1] https://marc.info/?l=linux-rdma&m=159891753806720&w=2

Ah...  ok  That does not have anything to do with a close.  He is worried about
the mm structure going away because the other process exited.  That can't
happen, even with the old code, because the release will not be called until
the child process calls close.

But even if the mm is still around the get_user_pages_fast() in the child is
_going_ to use current->mm if it falls back to the locked version.  Thus it is
going to go off in the weeds when trying to pin user addresses in the child.

Basically there is no 'race', the code is just broken and going to do the wrong
thing regardless of timing!  :-(

The new code is keeping the mm_grab() reference in the mmu_notifier rather than
in the fd structure, an improvement for sure, but for many applications likely
to have almost the same lifetime as before, in the parent process.

Also Jann is 100% correct that the driver should not be operating on the wrong
mm and you are using his methodology #3.[1]

So I think the final point is the key to fixing the bug.  Keeping any
current->mm which is not the one we opened the file with...  (or more
specifically the one which first registered memory).  In some ways this may be
worse than before because technically the parent could open the fd and hand it
to the child and have the child register with it's mm.  But that is ok
really...  May just be odd behavior for some users depending on what operations
they do and in what order.

Ira

[1] Also, you probably should credit Jann for the idea with a suggested by tag.

> 
> -Denny
>
Ira Weiny Nov. 13, 2020, 12:42 a.m. UTC | #7
On Thu, Nov 12, 2020 at 08:01:36PM -0400, Jason Gunthorpe wrote:
> On Thu, Nov 12, 2020 at 05:06:30PM -0500, Dennis Dalessandro wrote:
> > On 11/12/2020 12:14 PM, Ira Weiny wrote:
> > > On Wed, Nov 11, 2020 at 09:58:37PM -0500, Dennis Dalessandro wrote:
> > > > Two earlier bug fixes have created a security problem in the hfi1
> > > > driver. One fix aimed to solve an issue where current->mm was not valid
> > > > when closing the hfi1 cdev. It attempted to do this by saving a cached
> > > > value of the current->mm pointer at file open time. This is a problem if
> > > > another process with access to the FD calls in via write() or ioctl() to
> > > > pin pages via the hfi driver. The other fix tried to solve a use after
> > > > free by taking a reference on the mm. This was just wrong because its
> > > > possible for a race condition between one process with an mm that opened
> > > > the cdev if it was accessing via an IOCTL, and another process
> > > > attempting to close the cdev with a different current->mm.
> > > 
> > > Again I'm still not seeing the race here.  It is entirely possible that the fix
> > > I was trying to do way back was mistaken too...  ;-)  I would just delete the
> > > last 2 sentences...  and/or reference the commit of those fixes and help
> > > explain this more.
> > 
> > I was attempting to refer to [1], the email that started all of this.
> > 
> > > > 
> > > > To fix this correctly we move the cached value of the mm into the mmu
> > > > handler struct for the driver.
> > > 
> > > Looking at this closer I don't think you need the mm member of mmu_rb_handler
> > > any longer.  See below.
> > 
> > We went back and forth on this as well. We thought it better to rely on our
> > own pointer vs looking into the notifier to get the mm. Same reasoning for
> > doing our own referecne counting. Question is what is the preferred way
> > here. Functionally it makes no difference and I'm fine going either way.
> 
> Use the mm pointer in the notifier if you have a notifier registered,
> it is clearer as to the lifetime and matches what other places do

I agree.  IOW, if you are storing a pointer to something you should take a
reference to it.  Here you store 2 pointers but only take 1 reference and the
user has to 'know' the lifetime is the same...

> 
> > That's the question. It does make sense to do that if we are sticking iwth
> > the notifier's reference vs our own explicit one. I'm not 100% sold that we
> > should not be doing the ref counting and keeping our own pointer. To me we
> > shoudln't be looking inside the notifer struct and instead honestly there
> > should probably be an API/helper call to get the mm from it. I'm open to
> > either approach.
> 
> The notifier is there to support users of the notifier, and nearly all
> notifier users require the mm at various points.
> 
> Adding get accessors is a bit of a kernel anti-pattern, this isn't Java..

I'm not 100% on board that a helper call is an anti-pattern, it can help to ID
that users are 'reaching into' the notifier 'object'...  That said, in this
case I would not bother.  Also you defined other helper calls which reach into
the notifier...  so FWIW you were not consistent in this patch.  That flagged
my attention in the first place...   ;-)

Ira
Dennis Dalessandro Nov. 13, 2020, 1:37 p.m. UTC | #8
On 11/12/2020 7:33 PM, Ira Weiny wrote:
> So I think the final point is the key to fixing the bug.  Keeping any
> current->mm which is not the one we opened the file with...  (or more
> specifically the one which first registered memory).  In some ways this may be
> worse than before because technically the parent could open the fd and hand it
> to the child and have the child register with it's mm.  But that is ok
> really...  May just be odd behavior for some users depending on what operations
> they do and in what order.

I don't think that's worse than before. Before we were letting it 
operate on the wrong mm. That's so much worse. Yes, parent could open fd 
and hand it off, which is OK. The "odd" behavior is up to whoever wrote 
the user space code to do that in the first place.

> [1] Also, you probably should credit Jann for the idea with a suggested by tag.

Will change reported-by to suggested-by.

-Denny
Ira Weiny Nov. 13, 2020, 6:31 p.m. UTC | #9
On Fri, Nov 13, 2020 at 08:37:39AM -0500, Dennis Dalessandro wrote:
> On 11/12/2020 7:33 PM, Ira Weiny wrote:
> > So I think the final point is the key to fixing the bug.  Keeping any
> > current->mm which is not the one we opened the file with...  (or more
> > specifically the one which first registered memory).  In some ways this may be
> > worse than before because technically the parent could open the fd and hand it
> > to the child and have the child register with it's mm.  But that is ok
> > really...  May just be odd behavior for some users depending on what operations
> > they do and in what order.
> 
> I don't think that's worse than before.

'worse' was a bad word...  sorry.  I just meant that the new code opens up the
opportunity to get it to work with the child when this is really not the way I
the API was intended...  But now I am just bike shedding...  ;-)

> Before we were letting it operate on
> the wrong mm. That's so much worse. Yes, parent could open fd and hand it
> off, which is OK. The "odd" behavior is up to whoever wrote the user space
> code to do that in the first place.
> 
> > [1] Also, you probably should credit Jann for the idea with a suggested by tag.
> 
> Will change reported-by to suggested-by.

Ah yea both are appropriate.  I'm not an expert on if both should be tagged or
not...  But to me it seems suggested-by is a bit stronger because a solution
was provided beyond just a bug reported.

Ira

> 
> -Denny
>
diff mbox series

Patch

diff --git a/drivers/infiniband/hw/hfi1/file_ops.c b/drivers/infiniband/hw/hfi1/file_ops.c
index 8ca51e4..329ee4f 100644
--- a/drivers/infiniband/hw/hfi1/file_ops.c
+++ b/drivers/infiniband/hw/hfi1/file_ops.c
@@ -1,4 +1,5 @@ 
 /*
+ * Copyright(c) 2020 Cornelis Networks, Inc.
  * Copyright(c) 2015-2020 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
@@ -206,8 +207,6 @@  static int hfi1_file_open(struct inode *inode, struct file *fp)
 	spin_lock_init(&fd->tid_lock);
 	spin_lock_init(&fd->invalid_lock);
 	fd->rec_cpu_num = -1; /* no cpu affinity by default */
-	fd->mm = current->mm;
-	mmgrab(fd->mm);
 	fd->dd = dd;
 	fp->private_data = fd;
 	return 0;
@@ -711,7 +710,6 @@  static int hfi1_file_close(struct inode *inode, struct file *fp)
 
 	deallocate_ctxt(uctxt);
 done:
-	mmdrop(fdata->mm);
 
 	if (atomic_dec_and_test(&dd->user_refcount))
 		complete(&dd->user_comp);
diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
index b4c6bff..e09e824 100644
--- a/drivers/infiniband/hw/hfi1/hfi.h
+++ b/drivers/infiniband/hw/hfi1/hfi.h
@@ -1,6 +1,7 @@ 
 #ifndef _HFI1_KERNEL_H
 #define _HFI1_KERNEL_H
 /*
+ * Copyright(c) 2020 Cornelis Networks, Inc.
  * Copyright(c) 2015-2020 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
@@ -1451,7 +1452,6 @@  struct hfi1_filedata {
 	u32 invalid_tid_idx;
 	/* protect invalid_tids array and invalid_tid_idx */
 	spinlock_t invalid_lock;
-	struct mm_struct *mm;
 };
 
 extern struct xarray hfi1_dev_table;
diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
index 24ca17b..6be4e79 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.c
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
@@ -1,4 +1,5 @@ 
 /*
+ * Copyright(c) 2020 Cornelis Networks, Inc.
  * Copyright(c) 2016 - 2017 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
@@ -48,23 +49,11 @@ 
 #include <linux/rculist.h>
 #include <linux/mmu_notifier.h>
 #include <linux/interval_tree_generic.h>
+#include <linux/sched/mm.h>
 
 #include "mmu_rb.h"
 #include "trace.h"
 
-struct mmu_rb_handler {
-	struct mmu_notifier mn;
-	struct rb_root_cached root;
-	void *ops_arg;
-	spinlock_t lock;        /* protect the RB tree */
-	struct mmu_rb_ops *ops;
-	struct mm_struct *mm;
-	struct list_head lru_list;
-	struct work_struct del_work;
-	struct list_head del_list;
-	struct workqueue_struct *wq;
-};
-
 static unsigned long mmu_node_start(struct mmu_rb_node *);
 static unsigned long mmu_node_last(struct mmu_rb_node *);
 static int mmu_notifier_range_start(struct mmu_notifier *,
@@ -92,7 +81,7 @@  static unsigned long mmu_node_last(struct mmu_rb_node *node)
 	return PAGE_ALIGN(node->addr + node->len) - 1;
 }
 
-int hfi1_mmu_rb_register(void *ops_arg, struct mm_struct *mm,
+int hfi1_mmu_rb_register(void *ops_arg,
 			 struct mmu_rb_ops *ops,
 			 struct workqueue_struct *wq,
 			 struct mmu_rb_handler **handler)
@@ -110,18 +99,19 @@  int hfi1_mmu_rb_register(void *ops_arg, struct mm_struct *mm,
 	INIT_HLIST_NODE(&handlr->mn.hlist);
 	spin_lock_init(&handlr->lock);
 	handlr->mn.ops = &mn_opts;
-	handlr->mm = mm;
 	INIT_WORK(&handlr->del_work, handle_remove);
 	INIT_LIST_HEAD(&handlr->del_list);
 	INIT_LIST_HEAD(&handlr->lru_list);
 	handlr->wq = wq;
 
-	ret = mmu_notifier_register(&handlr->mn, handlr->mm);
+	ret = mmu_notifier_register(&handlr->mn, current->mm);
 	if (ret) {
 		kfree(handlr);
 		return ret;
 	}
 
+	handlr->mm = current->mm;
+
 	*handler = handlr;
 	return 0;
 }
@@ -133,8 +123,11 @@  void hfi1_mmu_rb_unregister(struct mmu_rb_handler *handler)
 	unsigned long flags;
 	struct list_head del_list;
 
+	if (current->mm && (handler->mm != current->mm))
+		return;
+
 	/* Unregister first so we don't get any more notifications. */
-	mmu_notifier_unregister(&handler->mn, handler->mm);
+	mmu_notifier_unregister(&handler->mn, handler->mn.mm);
 
 	/*
 	 * Make sure the wq delete handler is finished running.  It will not
@@ -166,6 +159,10 @@  int hfi1_mmu_rb_insert(struct mmu_rb_handler *handler,
 	int ret = 0;
 
 	trace_hfi1_mmu_rb_insert(mnode->addr, mnode->len);
+
+	if (current->mm != handler->mm)
+		return -EPERM;
+
 	spin_lock_irqsave(&handler->lock, flags);
 	node = __mmu_rb_search(handler, mnode->addr, mnode->len);
 	if (node) {
@@ -180,6 +177,7 @@  int hfi1_mmu_rb_insert(struct mmu_rb_handler *handler,
 		__mmu_int_rb_remove(mnode, &handler->root);
 		list_del(&mnode->list); /* remove from LRU list */
 	}
+	mnode->handler = handler;
 unlock:
 	spin_unlock_irqrestore(&handler->lock, flags);
 	return ret;
@@ -217,6 +215,9 @@  bool hfi1_mmu_rb_remove_unless_exact(struct mmu_rb_handler *handler,
 	unsigned long flags;
 	bool ret = false;
 
+	if (current->mm != handler->mm)
+		return ret;
+
 	spin_lock_irqsave(&handler->lock, flags);
 	node = __mmu_rb_search(handler, addr, len);
 	if (node) {
@@ -239,6 +240,9 @@  void hfi1_mmu_rb_evict(struct mmu_rb_handler *handler, void *evict_arg)
 	unsigned long flags;
 	bool stop = false;
 
+	if (current->mm != handler->mm)
+		return;
+
 	INIT_LIST_HEAD(&del_list);
 
 	spin_lock_irqsave(&handler->lock, flags);
@@ -272,6 +276,9 @@  void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler,
 {
 	unsigned long flags;
 
+	if (current->mm != handler->mm)
+		return;
+
 	/* Validity of handler and node pointers has been checked by caller. */
 	trace_hfi1_mmu_rb_remove(node->addr, node->len);
 	spin_lock_irqsave(&handler->lock, flags);
diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.h b/drivers/infiniband/hw/hfi1/mmu_rb.h
index f04cec1..e208618 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.h
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.h
@@ -1,4 +1,5 @@ 
 /*
+ * Copyright(c) 2020 Cornelis Networks, Inc.
  * Copyright(c) 2016 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
@@ -54,6 +55,7 @@  struct mmu_rb_node {
 	unsigned long len;
 	unsigned long __last;
 	struct rb_node node;
+	struct mmu_rb_handler *handler;
 	struct list_head list;
 };
 
@@ -71,7 +73,20 @@  struct mmu_rb_ops {
 		     void *evict_arg, bool *stop);
 };
 
-int hfi1_mmu_rb_register(void *ops_arg, struct mm_struct *mm,
+struct mmu_rb_handler {
+	struct mmu_notifier mn;
+	struct rb_root_cached root;
+	void *ops_arg;
+	spinlock_t lock;        /* protect the RB tree */
+	struct mmu_rb_ops *ops;
+	struct list_head lru_list;
+	struct work_struct del_work;
+	struct list_head del_list;
+	struct workqueue_struct *wq;
+	struct mm_struct *mm;
+};
+
+int hfi1_mmu_rb_register(void *ops_arg,
 			 struct mmu_rb_ops *ops,
 			 struct workqueue_struct *wq,
 			 struct mmu_rb_handler **handler);
diff --git a/drivers/infiniband/hw/hfi1/user_exp_rcv.c b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
index f81ca20..b94fc7f 100644
--- a/drivers/infiniband/hw/hfi1/user_exp_rcv.c
+++ b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
@@ -1,4 +1,5 @@ 
 /*
+ * Copyright(c) 2020 Cornelis Networks, Inc.
  * Copyright(c) 2015-2018 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
@@ -173,15 +174,18 @@  static void unpin_rcv_pages(struct hfi1_filedata *fd,
 {
 	struct page **pages;
 	struct hfi1_devdata *dd = fd->uctxt->dd;
+	struct mm_struct *mm;
 
 	if (mapped) {
 		pci_unmap_single(dd->pcidev, node->dma_addr,
 				 node->npages * PAGE_SIZE, PCI_DMA_FROMDEVICE);
 		pages = &node->pages[idx];
+		mm = mm_from_tid_node(node);
 	} else {
 		pages = &tidbuf->pages[idx];
+		mm = current->mm;
 	}
-	hfi1_release_user_pages(fd->mm, pages, npages, mapped);
+	hfi1_release_user_pages(mm, pages, npages, mapped);
 	fd->tid_n_pinned -= npages;
 }
 
@@ -216,12 +220,12 @@  static int pin_rcv_pages(struct hfi1_filedata *fd, struct tid_user_buf *tidbuf)
 	 * pages, accept the amount pinned so far and program only that.
 	 * User space knows how to deal with partially programmed buffers.
 	 */
-	if (!hfi1_can_pin_pages(dd, fd->mm, fd->tid_n_pinned, npages)) {
+	if (!hfi1_can_pin_pages(dd, current->mm, fd->tid_n_pinned, npages)) {
 		kfree(pages);
 		return -ENOMEM;
 	}
 
-	pinned = hfi1_acquire_user_pages(fd->mm, vaddr, npages, true, pages);
+	pinned = hfi1_acquire_user_pages(current->mm, vaddr, npages, true, pages);
 	if (pinned <= 0) {
 		kfree(pages);
 		return pinned;
@@ -756,7 +760,7 @@  static int set_rcvarray_entry(struct hfi1_filedata *fd,
 
 	if (fd->use_mn) {
 		ret = mmu_interval_notifier_insert(
-			&node->notifier, fd->mm,
+			&node->notifier, current->mm,
 			tbuf->vaddr + (pageidx * PAGE_SIZE), npages * PAGE_SIZE,
 			&tid_mn_ops);
 		if (ret)
diff --git a/drivers/infiniband/hw/hfi1/user_exp_rcv.h b/drivers/infiniband/hw/hfi1/user_exp_rcv.h
index 332abb4..d45c7b6 100644
--- a/drivers/infiniband/hw/hfi1/user_exp_rcv.h
+++ b/drivers/infiniband/hw/hfi1/user_exp_rcv.h
@@ -1,6 +1,7 @@ 
 #ifndef _HFI1_USER_EXP_RCV_H
 #define _HFI1_USER_EXP_RCV_H
 /*
+ * Copyright(c) 2020 - Cornelis Networks, Inc.
  * Copyright(c) 2015 - 2017 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
@@ -95,4 +96,9 @@  int hfi1_user_exp_rcv_clear(struct hfi1_filedata *fd,
 int hfi1_user_exp_rcv_invalid(struct hfi1_filedata *fd,
 			      struct hfi1_tid_info *tinfo);
 
+static inline struct mm_struct *mm_from_tid_node(struct tid_rb_node *node)
+{
+	return node->notifier.mm;
+}
+
 #endif /* _HFI1_USER_EXP_RCV_H */
diff --git a/drivers/infiniband/hw/hfi1/user_sdma.c b/drivers/infiniband/hw/hfi1/user_sdma.c
index a92346e..4a4956f9 100644
--- a/drivers/infiniband/hw/hfi1/user_sdma.c
+++ b/drivers/infiniband/hw/hfi1/user_sdma.c
@@ -1,4 +1,5 @@ 
 /*
+ * Copyright(c) 2020 - Cornelis Networks, Inc.
  * Copyright(c) 2015 - 2018 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
@@ -188,7 +189,6 @@  int hfi1_user_sdma_alloc_queues(struct hfi1_ctxtdata *uctxt,
 	atomic_set(&pq->n_reqs, 0);
 	init_waitqueue_head(&pq->wait);
 	atomic_set(&pq->n_locked, 0);
-	pq->mm = fd->mm;
 
 	iowait_init(&pq->busy, 0, NULL, NULL, defer_packet_queue,
 		    activate_packet_queue, NULL, NULL);
@@ -230,7 +230,7 @@  int hfi1_user_sdma_alloc_queues(struct hfi1_ctxtdata *uctxt,
 
 	cq->nentries = hfi1_sdma_comp_ring_size;
 
-	ret = hfi1_mmu_rb_register(pq, pq->mm, &sdma_rb_ops, dd->pport->hfi1_wq,
+	ret = hfi1_mmu_rb_register(pq, &sdma_rb_ops, dd->pport->hfi1_wq,
 				   &pq->handler);
 	if (ret) {
 		dd_dev_err(dd, "Failed to register with MMU %d", ret);
@@ -980,13 +980,13 @@  static int pin_sdma_pages(struct user_sdma_request *req,
 
 	npages -= node->npages;
 retry:
-	if (!hfi1_can_pin_pages(pq->dd, pq->mm,
+	if (!hfi1_can_pin_pages(pq->dd, current->mm,
 				atomic_read(&pq->n_locked), npages)) {
 		cleared = sdma_cache_evict(pq, npages);
 		if (cleared >= npages)
 			goto retry;
 	}
-	pinned = hfi1_acquire_user_pages(pq->mm,
+	pinned = hfi1_acquire_user_pages(current->mm,
 					 ((unsigned long)iovec->iov.iov_base +
 					 (node->npages * PAGE_SIZE)), npages, 0,
 					 pages + node->npages);
@@ -995,7 +995,7 @@  static int pin_sdma_pages(struct user_sdma_request *req,
 		return pinned;
 	}
 	if (pinned != npages) {
-		unpin_vector_pages(pq->mm, pages, node->npages, pinned);
+		unpin_vector_pages(current->mm, pages, node->npages, pinned);
 		return -EFAULT;
 	}
 	kfree(node->pages);
@@ -1008,7 +1008,8 @@  static int pin_sdma_pages(struct user_sdma_request *req,
 static void unpin_sdma_pages(struct sdma_mmu_node *node)
 {
 	if (node->npages) {
-		unpin_vector_pages(node->pq->mm, node->pages, 0, node->npages);
+		unpin_vector_pages(mm_from_sdma_node(node), node->pages, 0,
+				   node->npages);
 		atomic_sub(node->npages, &node->pq->n_locked);
 	}
 }
diff --git a/drivers/infiniband/hw/hfi1/user_sdma.h b/drivers/infiniband/hw/hfi1/user_sdma.h
index 9972e0e..1e8c02f 100644
--- a/drivers/infiniband/hw/hfi1/user_sdma.h
+++ b/drivers/infiniband/hw/hfi1/user_sdma.h
@@ -1,6 +1,7 @@ 
 #ifndef _HFI1_USER_SDMA_H
 #define _HFI1_USER_SDMA_H
 /*
+ * Copyright(c) 2020 - Cornelis Networks, Inc.
  * Copyright(c) 2015 - 2018 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
@@ -133,7 +134,6 @@  struct hfi1_user_sdma_pkt_q {
 	unsigned long unpinned;
 	struct mmu_rb_handler *handler;
 	atomic_t n_locked;
-	struct mm_struct *mm;
 };
 
 struct hfi1_user_sdma_comp_q {
@@ -250,4 +250,9 @@  int hfi1_user_sdma_process_request(struct hfi1_filedata *fd,
 				   struct iovec *iovec, unsigned long dim,
 				   unsigned long *count);
 
+static inline struct mm_struct *mm_from_sdma_node(struct sdma_mmu_node *node)
+{
+	return node->rb.handler->mn.mm;
+}
+
 #endif /* _HFI1_USER_SDMA_H */