diff mbox series

[for-rc,v1] IB/hfi1: Move cached value of mm into handler

Message ID 20201029012243.115730.93867.stgit@awfm-01.aw.intel.com (mailing list archive)
State Superseded
Headers show
Series [for-rc,v1] IB/hfi1: Move cached value of mm into handler | expand

Commit Message

Dennis Dalessandro Oct. 29, 2020, 1:22 a.m. UTC
Two earlier bug fixes have created a security problem in the hfi1
driver. One fix aimed to solve an issue where current->mm was not valid
when closing the hfi1 cdev. It attempted to do this by saving a cached
value of the current->mm pointer at file open time. This is a problem if
another process with access to the FD calls in via write() or ioctl() to
pin pages via the hfi driver. The other fix tried to solve a use after
free by taking a reference on the mm. This was just wrong because its
possible for a race condition between one process with an mm that opened
the cdev if it was accessing via an IOCTL, and another process
attempting to close the cdev with a different current->mm.

To fix this correctly we move the cached value of the mm into the mmu
handler struct for the driver. Now we can check in the insert, evict,
etc. routines that current->mm matched what the handler was registered
for. If not, then don't allow access. The register of the mmu notifier
will save the mm pointer.

Note the check in the unregister is not needed in the event that
current->mm is empty. This means the tear down is happening due to a
SigKill or OOM Killer, something along those lines. If current->mm has a
value then it must be checked and only the task that did the register
can do the unregister.

Since in do_exit() the exit_mm() is called before exit_files(), which
would call our close routine a reference is needed on the mm. This is
taken when the notifier is registered and dropped in the file close
routine.

Also of note is we do not do any explicit work to protect the interval
tree notifier. It doesn't seem that this is going to be needed since we
aren't actually doing anything with current->mm. The interval tree
notifier stuff still has a FIXME noted from a previous commit that will
be addressed in a follow on patch.

Fixes: e0cf75deab81 ("IB/hfi1: Fix mm_struct use after free")
Fixes: 3faa3d9a308e ("IB/hfi1: Make use of mm consistent")
Reported-by: Jann Horn <jannh@google.com>
Reported-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@cornelisnetworks.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>

---

Will add a Cc for stable once the patch is finalized. I'd like to get
some more feedback on this patch especially in the mmu_interval_notifier
stuff. I expect to be able to get a Reviewed-by from Ira as well but
I've made changes he hasn't yet seen so going with a Cc for this round.

This is sort of a v1 as the patch was sent to security folks at first
which included a few others as well. This is the first public posting.

Changes since v0: Removed the checking of the pid and limitation that
whatever task opens the dev is the only one that can do write() or
ioctl(). While this limitation is OK it doesn't appear to be strictly
necessary.

Rebased on top of 5.10-rc1. Testing has been done on 5.9 due to a bug in
5.10 that is being worked (separate issue).
---
 drivers/infiniband/hw/hfi1/file_ops.c     |    3 --
 drivers/infiniband/hw/hfi1/hfi.h          |    1 -
 drivers/infiniband/hw/hfi1/mmu_rb.c       |   40 +++++++++++++++++------------
 drivers/infiniband/hw/hfi1/mmu_rb.h       |   16 +++++++++++-
 drivers/infiniband/hw/hfi1/user_exp_rcv.c |   12 ++++++---
 drivers/infiniband/hw/hfi1/user_sdma.c    |   12 ++++-----
 drivers/infiniband/hw/hfi1/user_sdma.h    |   11 +++++++-
 7 files changed, 62 insertions(+), 33 deletions(-)

Comments

Jason Gunthorpe Nov. 3, 2020, 12:22 a.m. UTC | #1
On Wed, Oct 28, 2020 at 09:22:43PM -0400, Dennis Dalessandro wrote:

> Will add a Cc for stable once the patch is finalized. I'd like to get
> some more feedback on this patch especially in the mmu_interval_notifier
> stuff. 

Which mmu_interval_notifier stuff in this patch?

Can you convert this last usage of mmu_notifier in mmu_rb_handler to
use an interval notifier? I seem to remember thinking it was the right
thing but too complicated for me to attempt

Jason
Dennis Dalessandro Nov. 3, 2020, 1:41 p.m. UTC | #2
On 11/2/2020 7:22 PM, Jason Gunthorpe wrote:
> On Wed, Oct 28, 2020 at 09:22:43PM -0400, Dennis Dalessandro wrote:
> 
>> Will add a Cc for stable once the patch is finalized. I'd like to get
>> some more feedback on this patch especially in the mmu_interval_notifier
>> stuff.
> 
> Which mmu_interval_notifier stuff in this patch?
> 
> Can you convert this last usage of mmu_notifier in mmu_rb_handler to
> use an interval notifier? I seem to remember thinking it was the right
> thing but too complicated for me to attempt

The call to mmu_interval_notifier_insert() in set_rcvarray_entry() 
specifically. Instead of passing the saved mm I changed it to use current.

Yes we may be able to convert to the interval notifier for the rest of 
it. I'd rather do that as a separate patch and get this fixed for 5.10 
and in the stable kernels and do that as a for-next sort of thing.

-Denny
Ira Weiny Nov. 5, 2020, 12:12 a.m. UTC | #3
On Wed, Oct 28, 2020 at 09:22:43PM -0400, Dennis Dalessandro wrote:
> Two earlier bug fixes have created a security problem in the hfi1
> driver. One fix aimed to solve an issue where current->mm was not valid
> when closing the hfi1 cdev. It attempted to do this by saving a cached
> value of the current->mm pointer at file open time. This is a problem if
> another process with access to the FD calls in via write() or ioctl() to
> pin pages via the hfi driver. The other fix tried to solve a use after
> free by taking a reference on the mm. This was just wrong because its
> possible for a race condition between one process with an mm that opened
> the cdev if it was accessing via an IOCTL, and another process
> attempting to close the cdev with a different current->mm.

I'm not clear on the issue in this last sentence.  If process A is accessing
the FD via ioctl then process B closes the fd release should not be called
until process A calls close as well?

I don't think it really matters much the code is wrong for other reasons.  I'm
just trying to understand what you are saying here.

> 
> To fix this correctly we move the cached value of the mm into the mmu
> handler struct for the driver. Now we can check in the insert, evict,
> etc. routines that current->mm matched what the handler was registered
> for. If not, then don't allow access. The register of the mmu notifier
> will save the mm pointer.
> 
> Note the check in the unregister is not needed in the event that
> current->mm is empty. This means the tear down is happening due to a
> SigKill or OOM Killer, something along those lines. If current->mm has a
> value then it must be checked and only the task that did the register
> can do the unregister.

I'm not seeing this bit of logic below...  Sorry.

> 
> Since in do_exit() the exit_mm() is called before exit_files(), which
> would call our close routine a reference is needed on the mm. This is
> taken when the notifier is registered and dropped in the file close
> routine.

This should be moved below as a comment for why you need the mmget/put()

> 
> Also of note is we do not do any explicit work to protect the interval
> tree notifier. It doesn't seem that this is going to be needed since we
> aren't actually doing anything with current->mm. The interval tree
> notifier stuff still has a FIXME noted from a previous commit that will
> be addressed in a follow on patch.
> 
> Fixes: e0cf75deab81 ("IB/hfi1: Fix mm_struct use after free")
> Fixes: 3faa3d9a308e ("IB/hfi1: Make use of mm consistent")
> Reported-by: Jann Horn <jannh@google.com>
> Reported-by: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Reviewed-by: Mike Marciniszyn <mike.marciniszyn@cornelisnetworks.com>
> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
> 
> ---
> 
> Will add a Cc for stable once the patch is finalized. I'd like to get
> some more feedback on this patch especially in the mmu_interval_notifier
> stuff. I expect to be able to get a Reviewed-by from Ira as well but
> I've made changes he hasn't yet seen so going with a Cc for this round.
> 
> This is sort of a v1 as the patch was sent to security folks at first
> which included a few others as well. This is the first public posting.

Is this really a security issue or a correctness issue?  Is there really any
way that a user can corrupt another users data?  Or is this just going to
scribble all over the wrong process or send the wrong data on the wire?

> 
> Changes since v0: Removed the checking of the pid and limitation that
> whatever task opens the dev is the only one that can do write() or
> ioctl(). While this limitation is OK it doesn't appear to be strictly
> necessary.
> 
> Rebased on top of 5.10-rc1. Testing has been done on 5.9 due to a bug in
> 5.10 that is being worked (separate issue).
> ---
>  drivers/infiniband/hw/hfi1/file_ops.c     |    3 --
>  drivers/infiniband/hw/hfi1/hfi.h          |    1 -
>  drivers/infiniband/hw/hfi1/mmu_rb.c       |   40 +++++++++++++++++------------
>  drivers/infiniband/hw/hfi1/mmu_rb.h       |   16 +++++++++++-
>  drivers/infiniband/hw/hfi1/user_exp_rcv.c |   12 ++++++---
>  drivers/infiniband/hw/hfi1/user_sdma.c    |   12 ++++-----
>  drivers/infiniband/hw/hfi1/user_sdma.h    |   11 +++++++-
>  7 files changed, 62 insertions(+), 33 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/hfi1/file_ops.c b/drivers/infiniband/hw/hfi1/file_ops.c
> index 8ca51e4..bc15498 100644
> --- a/drivers/infiniband/hw/hfi1/file_ops.c
> +++ b/drivers/infiniband/hw/hfi1/file_ops.c
> @@ -206,8 +206,6 @@ static int hfi1_file_open(struct inode *inode, struct file *fp)
>  	spin_lock_init(&fd->tid_lock);
>  	spin_lock_init(&fd->invalid_lock);
>  	fd->rec_cpu_num = -1; /* no cpu affinity by default */
> -	fd->mm = current->mm;
> -	mmgrab(fd->mm);
>  	fd->dd = dd;
>  	fp->private_data = fd;
>  	return 0;
> @@ -711,7 +709,6 @@ static int hfi1_file_close(struct inode *inode, struct file *fp)
>  
>  	deallocate_ctxt(uctxt);
>  done:
> -	mmdrop(fdata->mm);
>  
>  	if (atomic_dec_and_test(&dd->user_refcount))
>  		complete(&dd->user_comp);
> diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
> index b4c6bff..bc59e61 100644
> --- a/drivers/infiniband/hw/hfi1/hfi.h
> +++ b/drivers/infiniband/hw/hfi1/hfi.h
> @@ -1451,7 +1451,6 @@ struct hfi1_filedata {
>  	u32 invalid_tid_idx;
>  	/* protect invalid_tids array and invalid_tid_idx */
>  	spinlock_t invalid_lock;
> -	struct mm_struct *mm;
>  };
>  
>  extern struct xarray hfi1_dev_table;
> diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
> index 24ca17b..2b04942 100644
> --- a/drivers/infiniband/hw/hfi1/mmu_rb.c
> +++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
> @@ -48,23 +48,11 @@
>  #include <linux/rculist.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/interval_tree_generic.h>
> +#include <linux/sched/mm.h>
>  
>  #include "mmu_rb.h"
>  #include "trace.h"
>  
> -struct mmu_rb_handler {
> -	struct mmu_notifier mn;
> -	struct rb_root_cached root;
> -	void *ops_arg;
> -	spinlock_t lock;        /* protect the RB tree */
> -	struct mmu_rb_ops *ops;
> -	struct mm_struct *mm;
> -	struct list_head lru_list;
> -	struct work_struct del_work;
> -	struct list_head del_list;
> -	struct workqueue_struct *wq;
> -};
> -
>  static unsigned long mmu_node_start(struct mmu_rb_node *);
>  static unsigned long mmu_node_last(struct mmu_rb_node *);
>  static int mmu_notifier_range_start(struct mmu_notifier *,
> @@ -92,7 +80,7 @@ static unsigned long mmu_node_last(struct mmu_rb_node *node)
>  	return PAGE_ALIGN(node->addr + node->len) - 1;
>  }
>  
> -int hfi1_mmu_rb_register(void *ops_arg, struct mm_struct *mm,
> +int hfi1_mmu_rb_register(void *ops_arg,
>  			 struct mmu_rb_ops *ops,
>  			 struct workqueue_struct *wq,
>  			 struct mmu_rb_handler **handler)
> @@ -110,18 +98,20 @@ int hfi1_mmu_rb_register(void *ops_arg, struct mm_struct *mm,
>  	INIT_HLIST_NODE(&handlr->mn.hlist);
>  	spin_lock_init(&handlr->lock);
>  	handlr->mn.ops = &mn_opts;
> -	handlr->mm = mm;
>  	INIT_WORK(&handlr->del_work, handle_remove);
>  	INIT_LIST_HEAD(&handlr->del_list);
>  	INIT_LIST_HEAD(&handlr->lru_list);
>  	handlr->wq = wq;
>  
> -	ret = mmu_notifier_register(&handlr->mn, handlr->mm);
> +	ret = mmu_notifier_register(&handlr->mn, current->mm);
>  	if (ret) {
>  		kfree(handlr);
>  		return ret;
>  	}
>  
> +	mmget(current->mm);

I flagged this initially but then reviewed the commit message for why you need
this reference.  I think it is worth a comment here as well as below.
Specifically mentioning the order of calls in do_exit().

> +	handlr->mm = current->mm;
> +
>  	*handler = handlr;
>  	return 0;
>  }
> @@ -134,7 +124,9 @@ void hfi1_mmu_rb_unregister(struct mmu_rb_handler *handler)
>  	struct list_head del_list;
>  
>  	/* Unregister first so we don't get any more notifications. */
> -	mmu_notifier_unregister(&handler->mn, handler->mm);
> +	mmu_notifier_unregister(&handler->mn, handler->mn.mm);
> +
> +	mmput(handler->mm); /* We don't need to touch the mm anymore it can go away */

Better to put this comment on the previous line...

>  
>  	/*
>  	 * Make sure the wq delete handler is finished running.  It will not
> @@ -166,6 +158,10 @@ int hfi1_mmu_rb_insert(struct mmu_rb_handler *handler,
>  	int ret = 0;
>  
>  	trace_hfi1_mmu_rb_insert(mnode->addr, mnode->len);
> +
> +	if (current->mm != handler->mm)
> +		return -EPERM;
> +
>  	spin_lock_irqsave(&handler->lock, flags);
>  	node = __mmu_rb_search(handler, mnode->addr, mnode->len);
>  	if (node) {
> @@ -180,6 +176,7 @@ int hfi1_mmu_rb_insert(struct mmu_rb_handler *handler,
>  		__mmu_int_rb_remove(mnode, &handler->root);
>  		list_del(&mnode->list); /* remove from LRU list */
>  	}
> +	mnode->handler = handler;
>  unlock:
>  	spin_unlock_irqrestore(&handler->lock, flags);
>  	return ret;
> @@ -217,6 +214,9 @@ bool hfi1_mmu_rb_remove_unless_exact(struct mmu_rb_handler *handler,
>  	unsigned long flags;
>  	bool ret = false;
>  
> +	if (current->mm != handler->mm)
> +		return ret;
> +
>  	spin_lock_irqsave(&handler->lock, flags);
>  	node = __mmu_rb_search(handler, addr, len);
>  	if (node) {
> @@ -239,6 +239,9 @@ void hfi1_mmu_rb_evict(struct mmu_rb_handler *handler, void *evict_arg)
>  	unsigned long flags;
>  	bool stop = false;
>  
> +	if (current->mm != handler->mm)
> +		return;
> +
>  	INIT_LIST_HEAD(&del_list);
>  
>  	spin_lock_irqsave(&handler->lock, flags);
> @@ -272,6 +275,9 @@ void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler,
>  {
>  	unsigned long flags;
>  
> +	if (current->mm != handler->mm)
> +		return;
> +
>  	/* Validity of handler and node pointers has been checked by caller. */
>  	trace_hfi1_mmu_rb_remove(node->addr, node->len);
>  	spin_lock_irqsave(&handler->lock, flags);
> diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.h b/drivers/infiniband/hw/hfi1/mmu_rb.h
> index f04cec1..8493a2d 100644
> --- a/drivers/infiniband/hw/hfi1/mmu_rb.h
> +++ b/drivers/infiniband/hw/hfi1/mmu_rb.h
> @@ -54,6 +54,7 @@ struct mmu_rb_node {
>  	unsigned long len;
>  	unsigned long __last;
>  	struct rb_node node;
> +	struct mmu_rb_handler *handler;
>  	struct list_head list;
>  };
>  
> @@ -71,7 +72,20 @@ struct mmu_rb_ops {
>  		     void *evict_arg, bool *stop);
>  };
>  
> -int hfi1_mmu_rb_register(void *ops_arg, struct mm_struct *mm,
> +struct mmu_rb_handler {
> +	struct mmu_notifier mn;
> +	struct rb_root_cached root;
> +	void *ops_arg;
> +	spinlock_t lock;        /* protect the RB tree */
> +	struct mmu_rb_ops *ops;
> +	struct list_head lru_list;
> +	struct work_struct del_work;
> +	struct list_head del_list;
> +	struct workqueue_struct *wq;
> +	struct mm_struct *mm;
> +};
> +
> +int hfi1_mmu_rb_register(void *ops_arg,
>  			 struct mmu_rb_ops *ops,
>  			 struct workqueue_struct *wq,
>  			 struct mmu_rb_handler **handler);
> diff --git a/drivers/infiniband/hw/hfi1/user_exp_rcv.c b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
> index f81ca20..fb72a3c 100644
> --- a/drivers/infiniband/hw/hfi1/user_exp_rcv.c
> +++ b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
> @@ -49,6 +49,7 @@
>  
>  #include "mmu_rb.h"
>  #include "user_exp_rcv.h"
> +

NIT: white space

>  #include "trace.h"
>  
>  static void unlock_exp_tids(struct hfi1_ctxtdata *uctxt,
> @@ -173,15 +174,18 @@ static void unpin_rcv_pages(struct hfi1_filedata *fd,
>  {
>  	struct page **pages;
>  	struct hfi1_devdata *dd = fd->uctxt->dd;
> +	struct mm_struct *mm;
>  
>  	if (mapped) {
>  		pci_unmap_single(dd->pcidev, node->dma_addr,
>  				 node->npages * PAGE_SIZE, PCI_DMA_FROMDEVICE);
>  		pages = &node->pages[idx];
> +		mm = node->notifier.mm;
>  	} else {
>  		pages = &tidbuf->pages[idx];
> +		mm = current->mm;
>  	}
> -	hfi1_release_user_pages(fd->mm, pages, npages, mapped);
> +	hfi1_release_user_pages(mm, pages, npages, mapped);
>  	fd->tid_n_pinned -= npages;
>  }
>  
> @@ -216,12 +220,12 @@ static int pin_rcv_pages(struct hfi1_filedata *fd, struct tid_user_buf *tidbuf)
>  	 * pages, accept the amount pinned so far and program only that.
>  	 * User space knows how to deal with partially programmed buffers.
>  	 */
> -	if (!hfi1_can_pin_pages(dd, fd->mm, fd->tid_n_pinned, npages)) {
> +	if (!hfi1_can_pin_pages(dd, current->mm, fd->tid_n_pinned, npages)) {
>  		kfree(pages);
>  		return -ENOMEM;
>  	}
>  
> -	pinned = hfi1_acquire_user_pages(fd->mm, vaddr, npages, true, pages);
> +	pinned = hfi1_acquire_user_pages(current->mm, vaddr, npages, true, pages);
>  	if (pinned <= 0) {
>  		kfree(pages);
>  		return pinned;
> @@ -756,7 +760,7 @@ static int set_rcvarray_entry(struct hfi1_filedata *fd,
>  
>  	if (fd->use_mn) {
>  		ret = mmu_interval_notifier_insert(
> -			&node->notifier, fd->mm,
> +			&node->notifier, current->mm,
>  			tbuf->vaddr + (pageidx * PAGE_SIZE), npages * PAGE_SIZE,
>  			&tid_mn_ops);
>  		if (ret)
> diff --git a/drivers/infiniband/hw/hfi1/user_sdma.c b/drivers/infiniband/hw/hfi1/user_sdma.c
> index a92346e..058c83c 100644
> --- a/drivers/infiniband/hw/hfi1/user_sdma.c
> +++ b/drivers/infiniband/hw/hfi1/user_sdma.c
> @@ -188,7 +188,6 @@ int hfi1_user_sdma_alloc_queues(struct hfi1_ctxtdata *uctxt,
>  	atomic_set(&pq->n_reqs, 0);
>  	init_waitqueue_head(&pq->wait);
>  	atomic_set(&pq->n_locked, 0);
> -	pq->mm = fd->mm;
>  
>  	iowait_init(&pq->busy, 0, NULL, NULL, defer_packet_queue,
>  		    activate_packet_queue, NULL, NULL);
> @@ -230,7 +229,7 @@ int hfi1_user_sdma_alloc_queues(struct hfi1_ctxtdata *uctxt,
>  
>  	cq->nentries = hfi1_sdma_comp_ring_size;
>  
> -	ret = hfi1_mmu_rb_register(pq, pq->mm, &sdma_rb_ops, dd->pport->hfi1_wq,
> +	ret = hfi1_mmu_rb_register(pq, &sdma_rb_ops, dd->pport->hfi1_wq,
>  				   &pq->handler);
>  	if (ret) {
>  		dd_dev_err(dd, "Failed to register with MMU %d", ret);
> @@ -980,13 +979,13 @@ static int pin_sdma_pages(struct user_sdma_request *req,
>  
>  	npages -= node->npages;
>  retry:
> -	if (!hfi1_can_pin_pages(pq->dd, pq->mm,
> +	if (!hfi1_can_pin_pages(pq->dd, current->mm,
>  				atomic_read(&pq->n_locked), npages)) {
>  		cleared = sdma_cache_evict(pq, npages);
>  		if (cleared >= npages)
>  			goto retry;
>  	}
> -	pinned = hfi1_acquire_user_pages(pq->mm,
> +	pinned = hfi1_acquire_user_pages(current->mm,
>  					 ((unsigned long)iovec->iov.iov_base +
>  					 (node->npages * PAGE_SIZE)), npages, 0,
>  					 pages + node->npages);
> @@ -995,7 +994,7 @@ static int pin_sdma_pages(struct user_sdma_request *req,
>  		return pinned;
>  	}
>  	if (pinned != npages) {
> -		unpin_vector_pages(pq->mm, pages, node->npages, pinned);
> +		unpin_vector_pages(current->mm, pages, node->npages, pinned);
>  		return -EFAULT;
>  	}
>  	kfree(node->pages);
> @@ -1008,7 +1007,8 @@ static int pin_sdma_pages(struct user_sdma_request *req,
>  static void unpin_sdma_pages(struct sdma_mmu_node *node)
>  {
>  	if (node->npages) {
> -		unpin_vector_pages(node->pq->mm, node->pages, 0, node->npages);
> +		unpin_vector_pages(mm_from_sdma_node(node), node->pages, 0,
> +				   node->npages);
>  		atomic_sub(node->npages, &node->pq->n_locked);
>  	}
>  }
> diff --git a/drivers/infiniband/hw/hfi1/user_sdma.h b/drivers/infiniband/hw/hfi1/user_sdma.h
> index 9972e0e..7126d61 100644
> --- a/drivers/infiniband/hw/hfi1/user_sdma.h
> +++ b/drivers/infiniband/hw/hfi1/user_sdma.h
> @@ -133,7 +133,6 @@ struct hfi1_user_sdma_pkt_q {
>  	unsigned long unpinned;
>  	struct mmu_rb_handler *handler;
>  	atomic_t n_locked;
> -	struct mm_struct *mm;
>  };
>  
>  struct hfi1_user_sdma_comp_q {
> @@ -250,4 +249,14 @@ int hfi1_user_sdma_process_request(struct hfi1_filedata *fd,
>  				   struct iovec *iovec, unsigned long dim,
>  				   unsigned long *count);
>  
> +static inline struct mm_struct *mm_from_tid_node(struct tid_rb_node *node)

Where is mm_from_tid_node() used?

Ira

> +{
> +	return node->notifier.mm;
> +}
> +
> +static inline struct mm_struct *mm_from_sdma_node(struct sdma_mmu_node *node)
> +{
> +	return node->rb.handler->mn.mm;
> +}
> +
>  #endif /* _HFI1_USER_SDMA_H */
>
Jason Gunthorpe Nov. 5, 2020, 3:17 p.m. UTC | #4
On Wed, Nov 04, 2020 at 04:12:45PM -0800, Ira Weiny wrote:
> > -int hfi1_mmu_rb_register(void *ops_arg, struct mm_struct *mm,
> > +int hfi1_mmu_rb_register(void *ops_arg,
> >  			 struct mmu_rb_ops *ops,
> >  			 struct workqueue_struct *wq,
> >  			 struct mmu_rb_handler **handler)
> > @@ -110,18 +98,20 @@ int hfi1_mmu_rb_register(void *ops_arg, struct mm_struct *mm,
> >  	INIT_HLIST_NODE(&handlr->mn.hlist);
> >  	spin_lock_init(&handlr->lock);
> >  	handlr->mn.ops = &mn_opts;
> > -	handlr->mm = mm;
> >  	INIT_WORK(&handlr->del_work, handle_remove);
> >  	INIT_LIST_HEAD(&handlr->del_list);
> >  	INIT_LIST_HEAD(&handlr->lru_list);
> >  	handlr->wq = wq;
> >  
> > -	ret = mmu_notifier_register(&handlr->mn, handlr->mm);
> > +	ret = mmu_notifier_register(&handlr->mn, current->mm);
> >  	if (ret) {
> >  		kfree(handlr);
> >  		return ret;
> >  	}
> >  
> > +	mmget(current->mm);
> 
> I flagged this initially but then reviewed the commit message for why you need
> this reference.  I think it is worth a comment here as well as below.
> Specifically mentioning the order of calls in do_exit().

Oh? a mmget should not be held for a long time, and a notifier already
holds a mmgrab while it is registered

If hfi later needs the mmget then the only the part that needs it
should convert the mmgrab to a mmget with mmget_not_zero

Jason
Dennis Dalessandro Nov. 5, 2020, 5:13 p.m. UTC | #5
On 11/4/2020 7:12 PM, Ira Weiny wrote:
> On Wed, Oct 28, 2020 at 09:22:43PM -0400, Dennis Dalessandro wrote:
>> Two earlier bug fixes have created a security problem in the hfi1
>> driver. One fix aimed to solve an issue where current->mm was not valid
>> when closing the hfi1 cdev. It attempted to do this by saving a cached
>> value of the current->mm pointer at file open time. This is a problem if
>> another process with access to the FD calls in via write() or ioctl() to
>> pin pages via the hfi driver. The other fix tried to solve a use after
>> free by taking a reference on the mm. This was just wrong because its
>> possible for a race condition between one process with an mm that opened
>> the cdev if it was accessing via an IOCTL, and another process
>> attempting to close the cdev with a different current->mm.
> 
> I'm not clear on the issue in this last sentence.  If process A is accessing
> the FD via ioctl then process B closes the fd release should not be called
> until process A calls close as well?

Is that the case. If proc A opened the FD then forked. Won't a call by B 
to close the fd end up doing the release?

> I don't think it really matters much the code is wrong for other reasons.  I'm
> just trying to understand what you are saying here.
> 
>>
>> To fix this correctly we move the cached value of the mm into the mmu
>> handler struct for the driver. Now we can check in the insert, evict,
>> etc. routines that current->mm matched what the handler was registered
>> for. If not, then don't allow access. The register of the mmu notifier
>> will save the mm pointer.
>>
>> Note the check in the unregister is not needed in the event that
>> current->mm is empty. This means the tear down is happening due to a
>> SigKill or OOM Killer, something along those lines. If current->mm has a
>> value then it must be checked and only the task that did the register
>> can do the unregister.
> 
> I'm not seeing this bit of logic below...  Sorry.

Hmm. Maybe I goofed on a merge or something. Will fix up for a v2.

>> Since in do_exit() the exit_mm() is called before exit_files(), which
>> would call our close routine a reference is needed on the mm. This is
>> taken when the notifier is registered and dropped in the file close
>> routine.
> 
> This should be moved below as a comment for why you need the mmget/put()

I can do that.

>> Also of note is we do not do any explicit work to protect the interval
>> tree notifier. It doesn't seem that this is going to be needed since we
>> aren't actually doing anything with current->mm. The interval tree
>> notifier stuff still has a FIXME noted from a previous commit that will
>> be addressed in a follow on patch.
>>
>> Fixes: e0cf75deab81 ("IB/hfi1: Fix mm_struct use after free")
>> Fixes: 3faa3d9a308e ("IB/hfi1: Make use of mm consistent")
>> Reported-by: Jann Horn <jannh@google.com>
>> Reported-by: Jason Gunthorpe <jgg@nvidia.com>
>> Cc: Ira Weiny <ira.weiny@intel.com>
>> Reviewed-by: Mike Marciniszyn <mike.marciniszyn@cornelisnetworks.com>
>> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
>>
>> ---
>>
>> Will add a Cc for stable once the patch is finalized. I'd like to get
>> some more feedback on this patch especially in the mmu_interval_notifier
>> stuff. I expect to be able to get a Reviewed-by from Ira as well but
>> I've made changes he hasn't yet seen so going with a Cc for this round.
>>
>> This is sort of a v1 as the patch was sent to security folks at first
>> which included a few others as well. This is the first public posting.
> 
> Is this really a security issue or a correctness issue?  Is there really any
> way that a user can corrupt another users data?  Or is this just going to
> scribble all over the wrong process or send the wrong data on the wire?

I guess it depends on your viewpoint. It's a security issue in the sense 
that we are currently letting one process operate while using a pointer 
to another's mm pointer. Could someone exploit it to do somethign bad, I 
don't know. I'd rather not take the chance. Due to the nature of the bug 
we thought it prudent to not take chances here.

>>
>> Changes since v0: Removed the checking of the pid and limitation that
>> whatever task opens the dev is the only one that can do write() or
>> ioctl(). While this limitation is OK it doesn't appear to be strictly
>> necessary.
>>
>> Rebased on top of 5.10-rc1. Testing has been done on 5.9 due to a bug in
>> 5.10 that is being worked (separate issue).
>> ---
>>   drivers/infiniband/hw/hfi1/file_ops.c     |    3 --
>>   drivers/infiniband/hw/hfi1/hfi.h          |    1 -
>>   drivers/infiniband/hw/hfi1/mmu_rb.c       |   40 +++++++++++++++++------------
>>   drivers/infiniband/hw/hfi1/mmu_rb.h       |   16 +++++++++++-
>>   drivers/infiniband/hw/hfi1/user_exp_rcv.c |   12 ++++++---
>>   drivers/infiniband/hw/hfi1/user_sdma.c    |   12 ++++-----
>>   drivers/infiniband/hw/hfi1/user_sdma.h    |   11 +++++++-
>>   7 files changed, 62 insertions(+), 33 deletions(-)
>>
>> diff --git a/drivers/infiniband/hw/hfi1/file_ops.c b/drivers/infiniband/hw/hfi1/file_ops.c
>> index 8ca51e4..bc15498 100644
>> --- a/drivers/infiniband/hw/hfi1/file_ops.c
>> +++ b/drivers/infiniband/hw/hfi1/file_ops.c
>> @@ -206,8 +206,6 @@ static int hfi1_file_open(struct inode *inode, struct file *fp)
>>   	spin_lock_init(&fd->tid_lock);
>>   	spin_lock_init(&fd->invalid_lock);
>>   	fd->rec_cpu_num = -1; /* no cpu affinity by default */
>> -	fd->mm = current->mm;
>> -	mmgrab(fd->mm);
>>   	fd->dd = dd;
>>   	fp->private_data = fd;
>>   	return 0;
>> @@ -711,7 +709,6 @@ static int hfi1_file_close(struct inode *inode, struct file *fp)
>>   
>>   	deallocate_ctxt(uctxt);
>>   done:
>> -	mmdrop(fdata->mm);
>>   
>>   	if (atomic_dec_and_test(&dd->user_refcount))
>>   		complete(&dd->user_comp);
>> diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
>> index b4c6bff..bc59e61 100644
>> --- a/drivers/infiniband/hw/hfi1/hfi.h
>> +++ b/drivers/infiniband/hw/hfi1/hfi.h
>> @@ -1451,7 +1451,6 @@ struct hfi1_filedata {
>>   	u32 invalid_tid_idx;
>>   	/* protect invalid_tids array and invalid_tid_idx */
>>   	spinlock_t invalid_lock;
>> -	struct mm_struct *mm;
>>   };
>>   
>>   extern struct xarray hfi1_dev_table;
>> diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
>> index 24ca17b..2b04942 100644
>> --- a/drivers/infiniband/hw/hfi1/mmu_rb.c
>> +++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
>> @@ -48,23 +48,11 @@
>>   #include <linux/rculist.h>
>>   #include <linux/mmu_notifier.h>
>>   #include <linux/interval_tree_generic.h>
>> +#include <linux/sched/mm.h>
>>   
>>   #include "mmu_rb.h"
>>   #include "trace.h"
>>   
>> -struct mmu_rb_handler {
>> -	struct mmu_notifier mn;
>> -	struct rb_root_cached root;
>> -	void *ops_arg;
>> -	spinlock_t lock;        /* protect the RB tree */
>> -	struct mmu_rb_ops *ops;
>> -	struct mm_struct *mm;
>> -	struct list_head lru_list;
>> -	struct work_struct del_work;
>> -	struct list_head del_list;
>> -	struct workqueue_struct *wq;
>> -};
>> -
>>   static unsigned long mmu_node_start(struct mmu_rb_node *);
>>   static unsigned long mmu_node_last(struct mmu_rb_node *);
>>   static int mmu_notifier_range_start(struct mmu_notifier *,
>> @@ -92,7 +80,7 @@ static unsigned long mmu_node_last(struct mmu_rb_node *node)
>>   	return PAGE_ALIGN(node->addr + node->len) - 1;
>>   }
>>   
>> -int hfi1_mmu_rb_register(void *ops_arg, struct mm_struct *mm,
>> +int hfi1_mmu_rb_register(void *ops_arg,
>>   			 struct mmu_rb_ops *ops,
>>   			 struct workqueue_struct *wq,
>>   			 struct mmu_rb_handler **handler)
>> @@ -110,18 +98,20 @@ int hfi1_mmu_rb_register(void *ops_arg, struct mm_struct *mm,
>>   	INIT_HLIST_NODE(&handlr->mn.hlist);
>>   	spin_lock_init(&handlr->lock);
>>   	handlr->mn.ops = &mn_opts;
>> -	handlr->mm = mm;
>>   	INIT_WORK(&handlr->del_work, handle_remove);
>>   	INIT_LIST_HEAD(&handlr->del_list);
>>   	INIT_LIST_HEAD(&handlr->lru_list);
>>   	handlr->wq = wq;
>>   
>> -	ret = mmu_notifier_register(&handlr->mn, handlr->mm);
>> +	ret = mmu_notifier_register(&handlr->mn, current->mm);
>>   	if (ret) {
>>   		kfree(handlr);
>>   		return ret;
>>   	}
>>   
>> +	mmget(current->mm);
> 
> I flagged this initially but then reviewed the commit message for why you need
> this reference.  I think it is worth a comment here as well as below.
> Specifically mentioning the order of calls in do_exit().

Sure.

>> +	handlr->mm = current->mm;
>> +
>>   	*handler = handlr;
>>   	return 0;
>>   }
>> @@ -134,7 +124,9 @@ void hfi1_mmu_rb_unregister(struct mmu_rb_handler *handler)
>>   	struct list_head del_list;
>>   
>>   	/* Unregister first so we don't get any more notifications. */
>> -	mmu_notifier_unregister(&handler->mn, handler->mm);
>> +	mmu_notifier_unregister(&handler->mn, handler->mn.mm);
>> +
>> +	mmput(handler->mm); /* We don't need to touch the mm anymore it can go away */
> 
> Better to put this comment on the previous line...

Will do.

> 
>>   
>>   	/*
>>   	 * Make sure the wq delete handler is finished running.  It will not
>> @@ -166,6 +158,10 @@ int hfi1_mmu_rb_insert(struct mmu_rb_handler *handler,
>>   	int ret = 0;
>>   
>>   	trace_hfi1_mmu_rb_insert(mnode->addr, mnode->len);
>> +
>> +	if (current->mm != handler->mm)
>> +		return -EPERM;
>> +
>>   	spin_lock_irqsave(&handler->lock, flags);
>>   	node = __mmu_rb_search(handler, mnode->addr, mnode->len);
>>   	if (node) {
>> @@ -180,6 +176,7 @@ int hfi1_mmu_rb_insert(struct mmu_rb_handler *handler,
>>   		__mmu_int_rb_remove(mnode, &handler->root);
>>   		list_del(&mnode->list); /* remove from LRU list */
>>   	}
>> +	mnode->handler = handler;
>>   unlock:
>>   	spin_unlock_irqrestore(&handler->lock, flags);
>>   	return ret;
>> @@ -217,6 +214,9 @@ bool hfi1_mmu_rb_remove_unless_exact(struct mmu_rb_handler *handler,
>>   	unsigned long flags;
>>   	bool ret = false;
>>   
>> +	if (current->mm != handler->mm)
>> +		return ret;
>> +
>>   	spin_lock_irqsave(&handler->lock, flags);
>>   	node = __mmu_rb_search(handler, addr, len);
>>   	if (node) {
>> @@ -239,6 +239,9 @@ void hfi1_mmu_rb_evict(struct mmu_rb_handler *handler, void *evict_arg)
>>   	unsigned long flags;
>>   	bool stop = false;
>>   
>> +	if (current->mm != handler->mm)
>> +		return;
>> +
>>   	INIT_LIST_HEAD(&del_list);
>>   
>>   	spin_lock_irqsave(&handler->lock, flags);
>> @@ -272,6 +275,9 @@ void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler,
>>   {
>>   	unsigned long flags;
>>   
>> +	if (current->mm != handler->mm)
>> +		return;
>> +
>>   	/* Validity of handler and node pointers has been checked by caller. */
>>   	trace_hfi1_mmu_rb_remove(node->addr, node->len);
>>   	spin_lock_irqsave(&handler->lock, flags);
>> diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.h b/drivers/infiniband/hw/hfi1/mmu_rb.h
>> index f04cec1..8493a2d 100644
>> --- a/drivers/infiniband/hw/hfi1/mmu_rb.h
>> +++ b/drivers/infiniband/hw/hfi1/mmu_rb.h
>> @@ -54,6 +54,7 @@ struct mmu_rb_node {
>>   	unsigned long len;
>>   	unsigned long __last;
>>   	struct rb_node node;
>> +	struct mmu_rb_handler *handler;
>>   	struct list_head list;
>>   };
>>   
>> @@ -71,7 +72,20 @@ struct mmu_rb_ops {
>>   		     void *evict_arg, bool *stop);
>>   };
>>   
>> -int hfi1_mmu_rb_register(void *ops_arg, struct mm_struct *mm,
>> +struct mmu_rb_handler {
>> +	struct mmu_notifier mn;
>> +	struct rb_root_cached root;
>> +	void *ops_arg;
>> +	spinlock_t lock;        /* protect the RB tree */
>> +	struct mmu_rb_ops *ops;
>> +	struct list_head lru_list;
>> +	struct work_struct del_work;
>> +	struct list_head del_list;
>> +	struct workqueue_struct *wq;
>> +	struct mm_struct *mm;
>> +};
>> +
>> +int hfi1_mmu_rb_register(void *ops_arg,
>>   			 struct mmu_rb_ops *ops,
>>   			 struct workqueue_struct *wq,
>>   			 struct mmu_rb_handler **handler);
>> diff --git a/drivers/infiniband/hw/hfi1/user_exp_rcv.c b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
>> index f81ca20..fb72a3c 100644
>> --- a/drivers/infiniband/hw/hfi1/user_exp_rcv.c
>> +++ b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
>> @@ -49,6 +49,7 @@
>>   
>>   #include "mmu_rb.h"
>>   #include "user_exp_rcv.h"
>> +
> 
> NIT: white space

Sorry about that, got it.

> 
>>   #include "trace.h"
>>   
>>   static void unlock_exp_tids(struct hfi1_ctxtdata *uctxt,
>> @@ -173,15 +174,18 @@ static void unpin_rcv_pages(struct hfi1_filedata *fd,
>>   {
>>   	struct page **pages;
>>   	struct hfi1_devdata *dd = fd->uctxt->dd;
>> +	struct mm_struct *mm;
>>   
>>   	if (mapped) {
>>   		pci_unmap_single(dd->pcidev, node->dma_addr,
>>   				 node->npages * PAGE_SIZE, PCI_DMA_FROMDEVICE);
>>   		pages = &node->pages[idx];
>> +		mm = node->notifier.mm;
>>   	} else {
>>   		pages = &tidbuf->pages[idx];
>> +		mm = current->mm;
>>   	}
>> -	hfi1_release_user_pages(fd->mm, pages, npages, mapped);
>> +	hfi1_release_user_pages(mm, pages, npages, mapped);
>>   	fd->tid_n_pinned -= npages;
>>   }
>>   
>> @@ -216,12 +220,12 @@ static int pin_rcv_pages(struct hfi1_filedata *fd, struct tid_user_buf *tidbuf)
>>   	 * pages, accept the amount pinned so far and program only that.
>>   	 * User space knows how to deal with partially programmed buffers.
>>   	 */
>> -	if (!hfi1_can_pin_pages(dd, fd->mm, fd->tid_n_pinned, npages)) {
>> +	if (!hfi1_can_pin_pages(dd, current->mm, fd->tid_n_pinned, npages)) {
>>   		kfree(pages);
>>   		return -ENOMEM;
>>   	}
>>   
>> -	pinned = hfi1_acquire_user_pages(fd->mm, vaddr, npages, true, pages);
>> +	pinned = hfi1_acquire_user_pages(current->mm, vaddr, npages, true, pages);
>>   	if (pinned <= 0) {
>>   		kfree(pages);
>>   		return pinned;
>> @@ -756,7 +760,7 @@ static int set_rcvarray_entry(struct hfi1_filedata *fd,
>>   
>>   	if (fd->use_mn) {
>>   		ret = mmu_interval_notifier_insert(
>> -			&node->notifier, fd->mm,
>> +			&node->notifier, current->mm,
>>   			tbuf->vaddr + (pageidx * PAGE_SIZE), npages * PAGE_SIZE,
>>   			&tid_mn_ops);
>>   		if (ret)
>> diff --git a/drivers/infiniband/hw/hfi1/user_sdma.c b/drivers/infiniband/hw/hfi1/user_sdma.c
>> index a92346e..058c83c 100644
>> --- a/drivers/infiniband/hw/hfi1/user_sdma.c
>> +++ b/drivers/infiniband/hw/hfi1/user_sdma.c
>> @@ -188,7 +188,6 @@ int hfi1_user_sdma_alloc_queues(struct hfi1_ctxtdata *uctxt,
>>   	atomic_set(&pq->n_reqs, 0);
>>   	init_waitqueue_head(&pq->wait);
>>   	atomic_set(&pq->n_locked, 0);
>> -	pq->mm = fd->mm;
>>   
>>   	iowait_init(&pq->busy, 0, NULL, NULL, defer_packet_queue,
>>   		    activate_packet_queue, NULL, NULL);
>> @@ -230,7 +229,7 @@ int hfi1_user_sdma_alloc_queues(struct hfi1_ctxtdata *uctxt,
>>   
>>   	cq->nentries = hfi1_sdma_comp_ring_size;
>>   
>> -	ret = hfi1_mmu_rb_register(pq, pq->mm, &sdma_rb_ops, dd->pport->hfi1_wq,
>> +	ret = hfi1_mmu_rb_register(pq, &sdma_rb_ops, dd->pport->hfi1_wq,
>>   				   &pq->handler);
>>   	if (ret) {
>>   		dd_dev_err(dd, "Failed to register with MMU %d", ret);
>> @@ -980,13 +979,13 @@ static int pin_sdma_pages(struct user_sdma_request *req,
>>   
>>   	npages -= node->npages;
>>   retry:
>> -	if (!hfi1_can_pin_pages(pq->dd, pq->mm,
>> +	if (!hfi1_can_pin_pages(pq->dd, current->mm,
>>   				atomic_read(&pq->n_locked), npages)) {
>>   		cleared = sdma_cache_evict(pq, npages);
>>   		if (cleared >= npages)
>>   			goto retry;
>>   	}
>> -	pinned = hfi1_acquire_user_pages(pq->mm,
>> +	pinned = hfi1_acquire_user_pages(current->mm,
>>   					 ((unsigned long)iovec->iov.iov_base +
>>   					 (node->npages * PAGE_SIZE)), npages, 0,
>>   					 pages + node->npages);
>> @@ -995,7 +994,7 @@ static int pin_sdma_pages(struct user_sdma_request *req,
>>   		return pinned;
>>   	}
>>   	if (pinned != npages) {
>> -		unpin_vector_pages(pq->mm, pages, node->npages, pinned);
>> +		unpin_vector_pages(current->mm, pages, node->npages, pinned);
>>   		return -EFAULT;
>>   	}
>>   	kfree(node->pages);
>> @@ -1008,7 +1007,8 @@ static int pin_sdma_pages(struct user_sdma_request *req,
>>   static void unpin_sdma_pages(struct sdma_mmu_node *node)
>>   {
>>   	if (node->npages) {
>> -		unpin_vector_pages(node->pq->mm, node->pages, 0, node->npages);
>> +		unpin_vector_pages(mm_from_sdma_node(node), node->pages, 0,
>> +				   node->npages);
>>   		atomic_sub(node->npages, &node->pq->n_locked);
>>   	}
>>   }
>> diff --git a/drivers/infiniband/hw/hfi1/user_sdma.h b/drivers/infiniband/hw/hfi1/user_sdma.h
>> index 9972e0e..7126d61 100644
>> --- a/drivers/infiniband/hw/hfi1/user_sdma.h
>> +++ b/drivers/infiniband/hw/hfi1/user_sdma.h
>> @@ -133,7 +133,6 @@ struct hfi1_user_sdma_pkt_q {
>>   	unsigned long unpinned;
>>   	struct mmu_rb_handler *handler;
>>   	atomic_t n_locked;
>> -	struct mm_struct *mm;
>>   };
>>   
>>   struct hfi1_user_sdma_comp_q {
>> @@ -250,4 +249,14 @@ int hfi1_user_sdma_process_request(struct hfi1_filedata *fd,
>>   				   struct iovec *iovec, unsigned long dim,
>>   				   unsigned long *count);
>>   
>> +static inline struct mm_struct *mm_from_tid_node(struct tid_rb_node *node)
> 
> Where is mm_from_tid_node() used?

Will fix.

> 
> Ira
> 
>> +{
>> +	return node->notifier.mm;
>> +}
>> +
>> +static inline struct mm_struct *mm_from_sdma_node(struct sdma_mmu_node *node)
>> +{
>> +	return node->rb.handler->mn.mm;
>> +}
>> +
>>   #endif /* _HFI1_USER_SDMA_H */
>>

-Denny
Ira Weiny Nov. 10, 2020, 4:36 a.m. UTC | #6
On Thu, Nov 05, 2020 at 12:13:23PM -0500, Dennis Dalessandro wrote:
> On 11/4/2020 7:12 PM, Ira Weiny wrote:
> > On Wed, Oct 28, 2020 at 09:22:43PM -0400, Dennis Dalessandro wrote:
> > > Two earlier bug fixes have created a security problem in the hfi1
> > > driver. One fix aimed to solve an issue where current->mm was not valid
> > > when closing the hfi1 cdev. It attempted to do this by saving a cached
> > > value of the current->mm pointer at file open time. This is a problem if
> > > another process with access to the FD calls in via write() or ioctl() to
> > > pin pages via the hfi driver. The other fix tried to solve a use after
> > > free by taking a reference on the mm. This was just wrong because its
> > > possible for a race condition between one process with an mm that opened
> > > the cdev if it was accessing via an IOCTL, and another process
> > > attempting to close the cdev with a different current->mm.
> > 
> > I'm not clear on the issue in this last sentence.  If process A is accessing
> > the FD via ioctl then process B closes the fd release should not be called
> > until process A calls close as well?
> 
> Is that the case. If proc A opened the FD then forked. Won't a call by B to
> close the fd end up doing the release?

Only if A closes the FD as well.  In which case A can't be in the middle of an
ioctl.

> 
> > I don't think it really matters much the code is wrong for other reasons.  I'm
> > just trying to understand what you are saying here.
> > 
> > > 
> > > To fix this correctly we move the cached value of the mm into the mmu
> > > handler struct for the driver. Now we can check in the insert, evict,
> > > etc. routines that current->mm matched what the handler was registered
> > > for. If not, then don't allow access. The register of the mmu notifier
> > > will save the mm pointer.
> > > 
> > > Note the check in the unregister is not needed in the event that
> > > current->mm is empty. This means the tear down is happening due to a
> > > SigKill or OOM Killer, something along those lines. If current->mm has a
> > > value then it must be checked and only the task that did the register
> > > can do the unregister.
> > 
> > I'm not seeing this bit of logic below...  Sorry.
> 
> Hmm. Maybe I goofed on a merge or something. Will fix up for a v2.
> 
> > > Since in do_exit() the exit_mm() is called before exit_files(), which
> > > would call our close routine a reference is needed on the mm. This is
> > > taken when the notifier is registered and dropped in the file close
> > > routine.
> > 
> > This should be moved below as a comment for why you need the mmget/put()
> 
> I can do that.
> 
> > > Also of note is we do not do any explicit work to protect the interval
> > > tree notifier. It doesn't seem that this is going to be needed since we
> > > aren't actually doing anything with current->mm. The interval tree
> > > notifier stuff still has a FIXME noted from a previous commit that will
> > > be addressed in a follow on patch.
> > > 
> > > Fixes: e0cf75deab81 ("IB/hfi1: Fix mm_struct use after free")
> > > Fixes: 3faa3d9a308e ("IB/hfi1: Make use of mm consistent")
> > > Reported-by: Jann Horn <jannh@google.com>
> > > Reported-by: Jason Gunthorpe <jgg@nvidia.com>
> > > Cc: Ira Weiny <ira.weiny@intel.com>
> > > Reviewed-by: Mike Marciniszyn <mike.marciniszyn@cornelisnetworks.com>
> > > Signed-off-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
> > > 
> > > ---
> > > 
> > > Will add a Cc for stable once the patch is finalized. I'd like to get
> > > some more feedback on this patch especially in the mmu_interval_notifier
> > > stuff. I expect to be able to get a Reviewed-by from Ira as well but
> > > I've made changes he hasn't yet seen so going with a Cc for this round.
> > > 
> > > This is sort of a v1 as the patch was sent to security folks at first
> > > which included a few others as well. This is the first public posting.
> > 
> > Is this really a security issue or a correctness issue?  Is there really any
> > way that a user can corrupt another users data?  Or is this just going to
> > scribble all over the wrong process or send the wrong data on the wire?
> 
> I guess it depends on your viewpoint. It's a security issue in the sense
> that we are currently letting one process operate while using a pointer to
> another's mm pointer. Could someone exploit it to do somethign bad, I don't
> know. I'd rather not take the chance. Due to the nature of the bug we
> thought it prudent to not take chances here.

I agree this should be fixed.  But I think it is low on the security issue.  I
have not really thought about what it might mean in an HPC sense of MPI jobs
but I still think it would be contained to a single users data.  So still seems
like a bug which would only affect a single user and I'm not sure how it would
be exploited to other users data.

That said we should get it fixed...

[snip]

> > > @@ -92,7 +80,7 @@ static unsigned long mmu_node_last(struct mmu_rb_node *node)
> > >   	return PAGE_ALIGN(node->addr + node->len) - 1;
> > >   }
> > > -int hfi1_mmu_rb_register(void *ops_arg, struct mm_struct *mm,
> > > +int hfi1_mmu_rb_register(void *ops_arg,
> > >   			 struct mmu_rb_ops *ops,
> > >   			 struct workqueue_struct *wq,
> > >   			 struct mmu_rb_handler **handler)
> > > @@ -110,18 +98,20 @@ int hfi1_mmu_rb_register(void *ops_arg, struct mm_struct *mm,
> > >   	INIT_HLIST_NODE(&handlr->mn.hlist);
> > >   	spin_lock_init(&handlr->lock);
> > >   	handlr->mn.ops = &mn_opts;
> > > -	handlr->mm = mm;
> > >   	INIT_WORK(&handlr->del_work, handle_remove);
> > >   	INIT_LIST_HEAD(&handlr->del_list);
> > >   	INIT_LIST_HEAD(&handlr->lru_list);
> > >   	handlr->wq = wq;
> > > -	ret = mmu_notifier_register(&handlr->mn, handlr->mm);
> > > +	ret = mmu_notifier_register(&handlr->mn, current->mm);
> > >   	if (ret) {
> > >   		kfree(handlr);
> > >   		return ret;
> > >   	}
> > > +	mmget(current->mm);
> > 
> > I flagged this initially but then reviewed the commit message for why you need
> > this reference.  I think it is worth a comment here as well as below.
> > Specifically mentioning the order of calls in do_exit().
> 
> Sure.

I'm somewhat with Jason on this regarding the mmget().  It does say it should not
be held for a long time.  This is why I flagged the code above initially.

But then you mentioned the whole do_exit() thing which I did not take the time
to really dig into.  So perhaps it should be revisited if you need to use the
mmget_not_zero() call to catch this do_exit() senario?

Ira
Dennis Dalessandro Nov. 10, 2020, 1:45 p.m. UTC | #7
On 11/9/2020 11:36 PM, Ira Weiny wrote:
> On Thu, Nov 05, 2020 at 12:13:23PM -0500, Dennis Dalessandro wrote:
>> On 11/4/2020 7:12 PM, Ira Weiny wrote:
>>> On Wed, Oct 28, 2020 at 09:22:43PM -0400, Dennis Dalessandro wrote:
>>>> Two earlier bug fixes have created a security problem in the hfi1
>>>> driver. One fix aimed to solve an issue where current->mm was not valid
>>>> when closing the hfi1 cdev. It attempted to do this by saving a cached
>>>> value of the current->mm pointer at file open time. This is a problem if
>>>> another process with access to the FD calls in via write() or ioctl() to
>>>> pin pages via the hfi driver. The other fix tried to solve a use after
>>>> free by taking a reference on the mm. This was just wrong because its
>>>> possible for a race condition between one process with an mm that opened
>>>> the cdev if it was accessing via an IOCTL, and another process
>>>> attempting to close the cdev with a different current->mm.
>>>
>>> I'm not clear on the issue in this last sentence.  If process A is accessing
>>> the FD via ioctl then process B closes the fd release should not be called
>>> until process A calls close as well?
>>
>> Is that the case. If proc A opened the FD then forked. Won't a call by B to
>> close the fd end up doing the release?
> 
> Only if A closes the FD as well.  In which case A can't be in the middle of an
> ioctl.
> 
>>
>>> I don't think it really matters much the code is wrong for other reasons.  I'm
>>> just trying to understand what you are saying here.
>>>
>>>>
>>>> To fix this correctly we move the cached value of the mm into the mmu
>>>> handler struct for the driver. Now we can check in the insert, evict,
>>>> etc. routines that current->mm matched what the handler was registered
>>>> for. If not, then don't allow access. The register of the mmu notifier
>>>> will save the mm pointer.
>>>>
>>>> Note the check in the unregister is not needed in the event that
>>>> current->mm is empty. This means the tear down is happening due to a
>>>> SigKill or OOM Killer, something along those lines. If current->mm has a
>>>> value then it must be checked and only the task that did the register
>>>> can do the unregister.
>>>
>>> I'm not seeing this bit of logic below...  Sorry.
>>
>> Hmm. Maybe I goofed on a merge or something. Will fix up for a v2.
>>
>>>> Since in do_exit() the exit_mm() is called before exit_files(), which
>>>> would call our close routine a reference is needed on the mm. This is
>>>> taken when the notifier is registered and dropped in the file close
>>>> routine.
>>>
>>> This should be moved below as a comment for why you need the mmget/put()
>>
>> I can do that.
>>
>>>> Also of note is we do not do any explicit work to protect the interval
>>>> tree notifier. It doesn't seem that this is going to be needed since we
>>>> aren't actually doing anything with current->mm. The interval tree
>>>> notifier stuff still has a FIXME noted from a previous commit that will
>>>> be addressed in a follow on patch.
>>>>
>>>> Fixes: e0cf75deab81 ("IB/hfi1: Fix mm_struct use after free")
>>>> Fixes: 3faa3d9a308e ("IB/hfi1: Make use of mm consistent")
>>>> Reported-by: Jann Horn <jannh@google.com>
>>>> Reported-by: Jason Gunthorpe <jgg@nvidia.com>
>>>> Cc: Ira Weiny <ira.weiny@intel.com>
>>>> Reviewed-by: Mike Marciniszyn <mike.marciniszyn@cornelisnetworks.com>
>>>> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
>>>>
>>>> ---
>>>>
>>>> Will add a Cc for stable once the patch is finalized. I'd like to get
>>>> some more feedback on this patch especially in the mmu_interval_notifier
>>>> stuff. I expect to be able to get a Reviewed-by from Ira as well but
>>>> I've made changes he hasn't yet seen so going with a Cc for this round.
>>>>
>>>> This is sort of a v1 as the patch was sent to security folks at first
>>>> which included a few others as well. This is the first public posting.
>>>
>>> Is this really a security issue or a correctness issue?  Is there really any
>>> way that a user can corrupt another users data?  Or is this just going to
>>> scribble all over the wrong process or send the wrong data on the wire?
>>
>> I guess it depends on your viewpoint. It's a security issue in the sense
>> that we are currently letting one process operate while using a pointer to
>> another's mm pointer. Could someone exploit it to do somethign bad, I don't
>> know. I'd rather not take the chance. Due to the nature of the bug we
>> thought it prudent to not take chances here.
> 
> I agree this should be fixed.  But I think it is low on the security issue.  I
> have not really thought about what it might mean in an HPC sense of MPI jobs
> but I still think it would be contained to a single users data.  So still seems
> like a bug which would only affect a single user and I'm not sure how it would
> be exploited to other users data.
> 
> That said we should get it fixed...
> 
> [snip]

Mostly agree. It affects a single user from the OS point of view. 
Depending on the use case it could be affecting multiple jobs though. Is 
it a major security hole, nah, we aren't running around with our hair on 
fire, but we do need to get it fixed in this RC, I really don't want to 
wait until 5.11 and we'll want to send this to stable as well.

>>>> @@ -92,7 +80,7 @@ static unsigned long mmu_node_last(struct mmu_rb_node *node)
>>>>    	return PAGE_ALIGN(node->addr + node->len) - 1;
>>>>    }
>>>> -int hfi1_mmu_rb_register(void *ops_arg, struct mm_struct *mm,
>>>> +int hfi1_mmu_rb_register(void *ops_arg,
>>>>    			 struct mmu_rb_ops *ops,
>>>>    			 struct workqueue_struct *wq,
>>>>    			 struct mmu_rb_handler **handler)
>>>> @@ -110,18 +98,20 @@ int hfi1_mmu_rb_register(void *ops_arg, struct mm_struct *mm,
>>>>    	INIT_HLIST_NODE(&handlr->mn.hlist);
>>>>    	spin_lock_init(&handlr->lock);
>>>>    	handlr->mn.ops = &mn_opts;
>>>> -	handlr->mm = mm;
>>>>    	INIT_WORK(&handlr->del_work, handle_remove);
>>>>    	INIT_LIST_HEAD(&handlr->del_list);
>>>>    	INIT_LIST_HEAD(&handlr->lru_list);
>>>>    	handlr->wq = wq;
>>>> -	ret = mmu_notifier_register(&handlr->mn, handlr->mm);
>>>> +	ret = mmu_notifier_register(&handlr->mn, current->mm);
>>>>    	if (ret) {
>>>>    		kfree(handlr);
>>>>    		return ret;
>>>>    	}
>>>> +	mmget(current->mm);
>>>
>>> I flagged this initially but then reviewed the commit message for why you need
>>> this reference.  I think it is worth a comment here as well as below.
>>> Specifically mentioning the order of calls in do_exit().
>>
>> Sure.
> 
> I'm somewhat with Jason on this regarding the mmget().  It does say it should not
> be held for a long time.  This is why I flagged the code above initially.
> 
> But then you mentioned the whole do_exit() thing which I did not take the time
> to really dig into.  So perhaps it should be revisited if you need to use the
> mmget_not_zero() call to catch this do_exit() senario?

We went back and forth on this whether to rely on the reference taken by 
the notifier vs getting our own. In the end we thought it was better to 
be explicit. Relying on the notifier register works just as well. I've 
made that change, and the others per my previous reply. Once it clears 
testing will post a v2.

The original issue of it going away too early was most likely because a 
kernel thread was doing the close and hence the unregister. I've added a 
check to prevent this as part of the v2, which should have been there as 
part of the v1 patch.

-Denny
diff mbox series

Patch

diff --git a/drivers/infiniband/hw/hfi1/file_ops.c b/drivers/infiniband/hw/hfi1/file_ops.c
index 8ca51e4..bc15498 100644
--- a/drivers/infiniband/hw/hfi1/file_ops.c
+++ b/drivers/infiniband/hw/hfi1/file_ops.c
@@ -206,8 +206,6 @@  static int hfi1_file_open(struct inode *inode, struct file *fp)
 	spin_lock_init(&fd->tid_lock);
 	spin_lock_init(&fd->invalid_lock);
 	fd->rec_cpu_num = -1; /* no cpu affinity by default */
-	fd->mm = current->mm;
-	mmgrab(fd->mm);
 	fd->dd = dd;
 	fp->private_data = fd;
 	return 0;
@@ -711,7 +709,6 @@  static int hfi1_file_close(struct inode *inode, struct file *fp)
 
 	deallocate_ctxt(uctxt);
 done:
-	mmdrop(fdata->mm);
 
 	if (atomic_dec_and_test(&dd->user_refcount))
 		complete(&dd->user_comp);
diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
index b4c6bff..bc59e61 100644
--- a/drivers/infiniband/hw/hfi1/hfi.h
+++ b/drivers/infiniband/hw/hfi1/hfi.h
@@ -1451,7 +1451,6 @@  struct hfi1_filedata {
 	u32 invalid_tid_idx;
 	/* protect invalid_tids array and invalid_tid_idx */
 	spinlock_t invalid_lock;
-	struct mm_struct *mm;
 };
 
 extern struct xarray hfi1_dev_table;
diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
index 24ca17b..2b04942 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.c
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
@@ -48,23 +48,11 @@ 
 #include <linux/rculist.h>
 #include <linux/mmu_notifier.h>
 #include <linux/interval_tree_generic.h>
+#include <linux/sched/mm.h>
 
 #include "mmu_rb.h"
 #include "trace.h"
 
-struct mmu_rb_handler {
-	struct mmu_notifier mn;
-	struct rb_root_cached root;
-	void *ops_arg;
-	spinlock_t lock;        /* protect the RB tree */
-	struct mmu_rb_ops *ops;
-	struct mm_struct *mm;
-	struct list_head lru_list;
-	struct work_struct del_work;
-	struct list_head del_list;
-	struct workqueue_struct *wq;
-};
-
 static unsigned long mmu_node_start(struct mmu_rb_node *);
 static unsigned long mmu_node_last(struct mmu_rb_node *);
 static int mmu_notifier_range_start(struct mmu_notifier *,
@@ -92,7 +80,7 @@  static unsigned long mmu_node_last(struct mmu_rb_node *node)
 	return PAGE_ALIGN(node->addr + node->len) - 1;
 }
 
-int hfi1_mmu_rb_register(void *ops_arg, struct mm_struct *mm,
+int hfi1_mmu_rb_register(void *ops_arg,
 			 struct mmu_rb_ops *ops,
 			 struct workqueue_struct *wq,
 			 struct mmu_rb_handler **handler)
@@ -110,18 +98,20 @@  int hfi1_mmu_rb_register(void *ops_arg, struct mm_struct *mm,
 	INIT_HLIST_NODE(&handlr->mn.hlist);
 	spin_lock_init(&handlr->lock);
 	handlr->mn.ops = &mn_opts;
-	handlr->mm = mm;
 	INIT_WORK(&handlr->del_work, handle_remove);
 	INIT_LIST_HEAD(&handlr->del_list);
 	INIT_LIST_HEAD(&handlr->lru_list);
 	handlr->wq = wq;
 
-	ret = mmu_notifier_register(&handlr->mn, handlr->mm);
+	ret = mmu_notifier_register(&handlr->mn, current->mm);
 	if (ret) {
 		kfree(handlr);
 		return ret;
 	}
 
+	mmget(current->mm);
+	handlr->mm = current->mm;
+
 	*handler = handlr;
 	return 0;
 }
@@ -134,7 +124,9 @@  void hfi1_mmu_rb_unregister(struct mmu_rb_handler *handler)
 	struct list_head del_list;
 
 	/* Unregister first so we don't get any more notifications. */
-	mmu_notifier_unregister(&handler->mn, handler->mm);
+	mmu_notifier_unregister(&handler->mn, handler->mn.mm);
+
+	mmput(handler->mm); /* We don't need to touch the mm anymore it can go away */
 
 	/*
 	 * Make sure the wq delete handler is finished running.  It will not
@@ -166,6 +158,10 @@  int hfi1_mmu_rb_insert(struct mmu_rb_handler *handler,
 	int ret = 0;
 
 	trace_hfi1_mmu_rb_insert(mnode->addr, mnode->len);
+
+	if (current->mm != handler->mm)
+		return -EPERM;
+
 	spin_lock_irqsave(&handler->lock, flags);
 	node = __mmu_rb_search(handler, mnode->addr, mnode->len);
 	if (node) {
@@ -180,6 +176,7 @@  int hfi1_mmu_rb_insert(struct mmu_rb_handler *handler,
 		__mmu_int_rb_remove(mnode, &handler->root);
 		list_del(&mnode->list); /* remove from LRU list */
 	}
+	mnode->handler = handler;
 unlock:
 	spin_unlock_irqrestore(&handler->lock, flags);
 	return ret;
@@ -217,6 +214,9 @@  bool hfi1_mmu_rb_remove_unless_exact(struct mmu_rb_handler *handler,
 	unsigned long flags;
 	bool ret = false;
 
+	if (current->mm != handler->mm)
+		return ret;
+
 	spin_lock_irqsave(&handler->lock, flags);
 	node = __mmu_rb_search(handler, addr, len);
 	if (node) {
@@ -239,6 +239,9 @@  void hfi1_mmu_rb_evict(struct mmu_rb_handler *handler, void *evict_arg)
 	unsigned long flags;
 	bool stop = false;
 
+	if (current->mm != handler->mm)
+		return;
+
 	INIT_LIST_HEAD(&del_list);
 
 	spin_lock_irqsave(&handler->lock, flags);
@@ -272,6 +275,9 @@  void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler,
 {
 	unsigned long flags;
 
+	if (current->mm != handler->mm)
+		return;
+
 	/* Validity of handler and node pointers has been checked by caller. */
 	trace_hfi1_mmu_rb_remove(node->addr, node->len);
 	spin_lock_irqsave(&handler->lock, flags);
diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.h b/drivers/infiniband/hw/hfi1/mmu_rb.h
index f04cec1..8493a2d 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.h
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.h
@@ -54,6 +54,7 @@  struct mmu_rb_node {
 	unsigned long len;
 	unsigned long __last;
 	struct rb_node node;
+	struct mmu_rb_handler *handler;
 	struct list_head list;
 };
 
@@ -71,7 +72,20 @@  struct mmu_rb_ops {
 		     void *evict_arg, bool *stop);
 };
 
-int hfi1_mmu_rb_register(void *ops_arg, struct mm_struct *mm,
+struct mmu_rb_handler {
+	struct mmu_notifier mn;
+	struct rb_root_cached root;
+	void *ops_arg;
+	spinlock_t lock;        /* protect the RB tree */
+	struct mmu_rb_ops *ops;
+	struct list_head lru_list;
+	struct work_struct del_work;
+	struct list_head del_list;
+	struct workqueue_struct *wq;
+	struct mm_struct *mm;
+};
+
+int hfi1_mmu_rb_register(void *ops_arg,
 			 struct mmu_rb_ops *ops,
 			 struct workqueue_struct *wq,
 			 struct mmu_rb_handler **handler);
diff --git a/drivers/infiniband/hw/hfi1/user_exp_rcv.c b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
index f81ca20..fb72a3c 100644
--- a/drivers/infiniband/hw/hfi1/user_exp_rcv.c
+++ b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
@@ -49,6 +49,7 @@ 
 
 #include "mmu_rb.h"
 #include "user_exp_rcv.h"
+
 #include "trace.h"
 
 static void unlock_exp_tids(struct hfi1_ctxtdata *uctxt,
@@ -173,15 +174,18 @@  static void unpin_rcv_pages(struct hfi1_filedata *fd,
 {
 	struct page **pages;
 	struct hfi1_devdata *dd = fd->uctxt->dd;
+	struct mm_struct *mm;
 
 	if (mapped) {
 		pci_unmap_single(dd->pcidev, node->dma_addr,
 				 node->npages * PAGE_SIZE, PCI_DMA_FROMDEVICE);
 		pages = &node->pages[idx];
+		mm = node->notifier.mm;
 	} else {
 		pages = &tidbuf->pages[idx];
+		mm = current->mm;
 	}
-	hfi1_release_user_pages(fd->mm, pages, npages, mapped);
+	hfi1_release_user_pages(mm, pages, npages, mapped);
 	fd->tid_n_pinned -= npages;
 }
 
@@ -216,12 +220,12 @@  static int pin_rcv_pages(struct hfi1_filedata *fd, struct tid_user_buf *tidbuf)
 	 * pages, accept the amount pinned so far and program only that.
 	 * User space knows how to deal with partially programmed buffers.
 	 */
-	if (!hfi1_can_pin_pages(dd, fd->mm, fd->tid_n_pinned, npages)) {
+	if (!hfi1_can_pin_pages(dd, current->mm, fd->tid_n_pinned, npages)) {
 		kfree(pages);
 		return -ENOMEM;
 	}
 
-	pinned = hfi1_acquire_user_pages(fd->mm, vaddr, npages, true, pages);
+	pinned = hfi1_acquire_user_pages(current->mm, vaddr, npages, true, pages);
 	if (pinned <= 0) {
 		kfree(pages);
 		return pinned;
@@ -756,7 +760,7 @@  static int set_rcvarray_entry(struct hfi1_filedata *fd,
 
 	if (fd->use_mn) {
 		ret = mmu_interval_notifier_insert(
-			&node->notifier, fd->mm,
+			&node->notifier, current->mm,
 			tbuf->vaddr + (pageidx * PAGE_SIZE), npages * PAGE_SIZE,
 			&tid_mn_ops);
 		if (ret)
diff --git a/drivers/infiniband/hw/hfi1/user_sdma.c b/drivers/infiniband/hw/hfi1/user_sdma.c
index a92346e..058c83c 100644
--- a/drivers/infiniband/hw/hfi1/user_sdma.c
+++ b/drivers/infiniband/hw/hfi1/user_sdma.c
@@ -188,7 +188,6 @@  int hfi1_user_sdma_alloc_queues(struct hfi1_ctxtdata *uctxt,
 	atomic_set(&pq->n_reqs, 0);
 	init_waitqueue_head(&pq->wait);
 	atomic_set(&pq->n_locked, 0);
-	pq->mm = fd->mm;
 
 	iowait_init(&pq->busy, 0, NULL, NULL, defer_packet_queue,
 		    activate_packet_queue, NULL, NULL);
@@ -230,7 +229,7 @@  int hfi1_user_sdma_alloc_queues(struct hfi1_ctxtdata *uctxt,
 
 	cq->nentries = hfi1_sdma_comp_ring_size;
 
-	ret = hfi1_mmu_rb_register(pq, pq->mm, &sdma_rb_ops, dd->pport->hfi1_wq,
+	ret = hfi1_mmu_rb_register(pq, &sdma_rb_ops, dd->pport->hfi1_wq,
 				   &pq->handler);
 	if (ret) {
 		dd_dev_err(dd, "Failed to register with MMU %d", ret);
@@ -980,13 +979,13 @@  static int pin_sdma_pages(struct user_sdma_request *req,
 
 	npages -= node->npages;
 retry:
-	if (!hfi1_can_pin_pages(pq->dd, pq->mm,
+	if (!hfi1_can_pin_pages(pq->dd, current->mm,
 				atomic_read(&pq->n_locked), npages)) {
 		cleared = sdma_cache_evict(pq, npages);
 		if (cleared >= npages)
 			goto retry;
 	}
-	pinned = hfi1_acquire_user_pages(pq->mm,
+	pinned = hfi1_acquire_user_pages(current->mm,
 					 ((unsigned long)iovec->iov.iov_base +
 					 (node->npages * PAGE_SIZE)), npages, 0,
 					 pages + node->npages);
@@ -995,7 +994,7 @@  static int pin_sdma_pages(struct user_sdma_request *req,
 		return pinned;
 	}
 	if (pinned != npages) {
-		unpin_vector_pages(pq->mm, pages, node->npages, pinned);
+		unpin_vector_pages(current->mm, pages, node->npages, pinned);
 		return -EFAULT;
 	}
 	kfree(node->pages);
@@ -1008,7 +1007,8 @@  static int pin_sdma_pages(struct user_sdma_request *req,
 static void unpin_sdma_pages(struct sdma_mmu_node *node)
 {
 	if (node->npages) {
-		unpin_vector_pages(node->pq->mm, node->pages, 0, node->npages);
+		unpin_vector_pages(mm_from_sdma_node(node), node->pages, 0,
+				   node->npages);
 		atomic_sub(node->npages, &node->pq->n_locked);
 	}
 }
diff --git a/drivers/infiniband/hw/hfi1/user_sdma.h b/drivers/infiniband/hw/hfi1/user_sdma.h
index 9972e0e..7126d61 100644
--- a/drivers/infiniband/hw/hfi1/user_sdma.h
+++ b/drivers/infiniband/hw/hfi1/user_sdma.h
@@ -133,7 +133,6 @@  struct hfi1_user_sdma_pkt_q {
 	unsigned long unpinned;
 	struct mmu_rb_handler *handler;
 	atomic_t n_locked;
-	struct mm_struct *mm;
 };
 
 struct hfi1_user_sdma_comp_q {
@@ -250,4 +249,14 @@  int hfi1_user_sdma_process_request(struct hfi1_filedata *fd,
 				   struct iovec *iovec, unsigned long dim,
 				   unsigned long *count);
 
+static inline struct mm_struct *mm_from_tid_node(struct tid_rb_node *node)
+{
+	return node->notifier.mm;
+}
+
+static inline struct mm_struct *mm_from_sdma_node(struct sdma_mmu_node *node)
+{
+	return node->rb.handler->mn.mm;
+}
+
 #endif /* _HFI1_USER_SDMA_H */