diff mbox series

[for-rc,6/6] IB/hfi1: Remove user expected buffer invalidate race

Message ID 167328549178.1472310.9867497376936699488.stgit@awfm-02.cornelisnetworks.com (mailing list archive)
State Accepted
Headers show
Series HFI fixups around expected recv | expand

Commit Message

Dennis Dalessandro Jan. 9, 2023, 5:31 p.m. UTC
From: Dean Luick <dean.luick@cornelisnetworks.com>

During setup, there is a possible race between a page invalidate
and hardware programming.  Add a covering invalidate over the user
target range during setup.  If anything within that range is
invalidated during setup, fail the setup.  Once set up, each
TID will have its own invalidate callback and invalidate.

Fixes: 3889551db212 ("RDMA/hfi1: Use mmu_interval_notifier_insert for user_exp_rcv")
Signed-off-by: Dean Luick <dean.luick@cornelisnetworks.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
---
 drivers/infiniband/hw/hfi1/user_exp_rcv.c |   58 +++++++++++++++++++++++++++--
 drivers/infiniband/hw/hfi1/user_exp_rcv.h |    2 +
 2 files changed, 55 insertions(+), 5 deletions(-)

Comments

Jason Gunthorpe Jan. 16, 2023, 3:41 p.m. UTC | #1
On Mon, Jan 09, 2023 at 12:31:31PM -0500, Dennis Dalessandro wrote:

> +	if (fd->use_mn) {
> +		ret = mmu_interval_notifier_insert(
> +			&tidbuf->notifier, current->mm,
> +			tidbuf->vaddr, tidbuf->npages * PAGE_SIZE,
> +			&tid_cover_ops);

This is still not the right way to use these notifiers, you should be
removing the calls to mmu_notifier_register()

Jason
Dean Luick Jan. 17, 2023, 7:19 p.m. UTC | #2
On 1/16/2023 9:41 AM, Jason Gunthorpe wrote:
> On Mon, Jan 09, 2023 at 12:31:31PM -0500, Dennis Dalessandro wrote:
>
>> +    if (fd->use_mn) {
>> +            ret = mmu_interval_notifier_insert(
>> +                    &tidbuf->notifier, current->mm,
>> +                    tidbuf->vaddr, tidbuf->npages * PAGE_SIZE,
>> +                    &tid_cover_ops);
>
> This is still not the right way to use these notifiers, you should be
> removing the calls to mmu_notifier_register()

I am confused by your comment.  This is the user expected receive code.  There are no calls to mmu_notifier_register() here.  You removed those calls when you added the FIXME.  The Send DMA side still has calls to mmu_notifier_register().  This series is all about user expected receive.

-Dean

>
> Jason

External recipient
Jason Gunthorpe Jan. 18, 2023, 12:42 p.m. UTC | #3
On Tue, Jan 17, 2023 at 01:19:14PM -0600, Dean Luick wrote:
> On 1/16/2023 9:41 AM, Jason Gunthorpe wrote:
> > On Mon, Jan 09, 2023 at 12:31:31PM -0500, Dennis Dalessandro wrote:
> >
> >> +    if (fd->use_mn) {
> >> +            ret = mmu_interval_notifier_insert(
> >> +                    &tidbuf->notifier, current->mm,
> >> +                    tidbuf->vaddr, tidbuf->npages * PAGE_SIZE,
> >> +                    &tid_cover_ops);
> >
> > This is still not the right way to use these notifiers, you should be
> > removing the calls to mmu_notifier_register()
> 
> I am confused by your comment.  This is the user expected receive
> code.  There are no calls to mmu_notifier_register() here.  You
> removed those calls when you added the FIXME.  The Send DMA side
> still has calls to mmu_notifier_register().  This series is all
> about user expected receive.

Then something else seems wrong because you shouldn't be removing the
notifiers in the same function you add them

Jason
Dean Luick Jan. 19, 2023, 4 p.m. UTC | #4
On 1/18/2023 6:42 AM, Jason Gunthorpe wrote:
> On Tue, Jan 17, 2023 at 01:19:14PM -0600, Dean Luick wrote:
>> On 1/16/2023 9:41 AM, Jason Gunthorpe wrote:
>>> On Mon, Jan 09, 2023 at 12:31:31PM -0500, Dennis Dalessandro wrote:
>>>
>>>> +    if (fd->use_mn) {
>>>> +            ret = mmu_interval_notifier_insert(
>>>> +                    &tidbuf->notifier, current->mm,
>>>> +                    tidbuf->vaddr, tidbuf->npages * PAGE_SIZE,
>>>> +                    &tid_cover_ops);
>>>
>>> This is still not the right way to use these notifiers, you should be
>>> removing the calls to mmu_notifier_register()
>>
>> I am confused by your comment.  This is the user expected receive
>> code.  There are no calls to mmu_notifier_register() here.  You
>> removed those calls when you added the FIXME.  The Send DMA side
>> still has calls to mmu_notifier_register().  This series is all
>> about user expected receive.
>
> Then something else seems wrong because you shouldn't be removing the
> notifiers in the same function you add them

The add-then-remove is intentional.  The purpose is to make sure there are no invalidates while we pin pages and set up the "permanent" notifiers that cover exact ranges based on sequential pages and how the DMA hardware is programmed.  Once the programmed hardware range notifiers are in place, the covering range serves no purpose and can be removed.


-Dean

External recipient
Jason Gunthorpe Jan. 19, 2023, 6:05 p.m. UTC | #5
On Thu, Jan 19, 2023 at 10:00:39AM -0600, Dean Luick wrote:
> On 1/18/2023 6:42 AM, Jason Gunthorpe wrote:
> > On Tue, Jan 17, 2023 at 01:19:14PM -0600, Dean Luick wrote:
> >> On 1/16/2023 9:41 AM, Jason Gunthorpe wrote:
> >>> On Mon, Jan 09, 2023 at 12:31:31PM -0500, Dennis Dalessandro wrote:
> >>>
> >>>> +    if (fd->use_mn) {
> >>>> +            ret = mmu_interval_notifier_insert(
> >>>> +                    &tidbuf->notifier, current->mm,
> >>>> +                    tidbuf->vaddr, tidbuf->npages * PAGE_SIZE,
> >>>> +                    &tid_cover_ops);
> >>>
> >>> This is still not the right way to use these notifiers, you should be
> >>> removing the calls to mmu_notifier_register()
> >>
> >> I am confused by your comment.  This is the user expected receive
> >> code.  There are no calls to mmu_notifier_register() here.  You
> >> removed those calls when you added the FIXME.  The Send DMA side
> >> still has calls to mmu_notifier_register().  This series is all
> >> about user expected receive.
> >
> > Then something else seems wrong because you shouldn't be removing the
> > notifiers in the same function you add them
> 
> The add-then-remove is intentional.  The purpose is to make sure
> there are no invalidates while we pin pages and set up the
> "permanent" notifiers that cover exact ranges based on sequential
> pages and how the DMA hardware is programmed.  Once the programmed
> hardware range notifiers are in place, the covering range serves no
> purpose and can be removed.

Uh, that doesn't sound very good, it is very expensive to create these
notifiers, they were not intended to be overly narrowed like this -
you are better to have a wider range and deal with the rare false hit
than to take the cost of register/unregister on every activity??

Also I'm not entirely sure this can be race free, having two notifiers
responsible for the same memory at the same time sounds like Danger
Danger sort of situation..

Jason
Dean Luick Jan. 20, 2023, 4:09 p.m. UTC | #6
On 1/19/2023 12:05 PM, Jason Gunthorpe wrote:
> On Thu, Jan 19, 2023 at 10:00:39AM -0600, Dean Luick wrote:
>> On 1/18/2023 6:42 AM, Jason Gunthorpe wrote:
>>> On Tue, Jan 17, 2023 at 01:19:14PM -0600, Dean Luick wrote:
>>>> On 1/16/2023 9:41 AM, Jason Gunthorpe wrote:
>>>>> On Mon, Jan 09, 2023 at 12:31:31PM -0500, Dennis Dalessandro wrote:
>>>>>
>>>>>> +    if (fd->use_mn) {
>>>>>> +            ret = mmu_interval_notifier_insert(
>>>>>> +                    &tidbuf->notifier, current->mm,
>>>>>> +                    tidbuf->vaddr, tidbuf->npages * PAGE_SIZE,
>>>>>> +                    &tid_cover_ops);
>>>>>
>>>>> This is still not the right way to use these notifiers, you should be
>>>>> removing the calls to mmu_notifier_register()
>>>>
>>>> I am confused by your comment.  This is the user expected receive
>>>> code.  There are no calls to mmu_notifier_register() here.  You
>>>> removed those calls when you added the FIXME.  The Send DMA side
>>>> still has calls to mmu_notifier_register().  This series is all
>>>> about user expected receive.
>>>
>>> Then something else seems wrong because you shouldn't be removing the
>>> notifiers in the same function you add them
>>
>> The add-then-remove is intentional.  The purpose is to make sure
>> there are no invalidates while we pin pages and set up the
>> "permanent" notifiers that cover exact ranges based on sequential
>> pages and how the DMA hardware is programmed.  Once the programmed
>> hardware range notifiers are in place, the covering range serves no
>> purpose and can be removed.
>
> Uh, that doesn't sound very good, it is very expensive to create these
> notifiers, they were not intended to be overly narrowed like this -
> you are better to have a wider range and deal with the rare false hit
> than to take the cost of register/unregister on every activity??

The temporary cover is a compromise effort to avoid the following situation.  Suppose, as you suggest, we keep a single range on the registration call.  Then:

User registers range A.  The hardware breaks it up into several separate pieces - "TIDs".  Due to pressure, software releases several, but not all the pieces of A.  Because not all of A is gone, the notify range is retained.

Later, software registers range B, which includes parts of the original A.  A new single range is registered.  Since this overlaps with A, we now have multiple range notifiers over the same memory range.  All would still be functionally correct, since there would only be one owner of each TID. But we now have overlapping permanent notify ranges.

Repeat this several times and you can have a mess of overlapping ranges, each with a little piece that is actually live.

With a unique notifier over each TID, the notifier can be dismissed when the TID is released.

We judged it would be simpler and better to keep the smaller, more specific ranges that matched the programmed hardware.  No searching or checking needed.  A 1-1 correspondence.

None of this would be needed if there was a correct, safe way to hold off all memory invalidates in a target range (or just in general) while we set up our per-TID notifiers.


> Also I'm not entirely sure this can be race free, having two notifiers
> responsible for the same memory at the same time sounds like Danger
> Danger sort of situation..

No danger at all.  The temporary "cover" range only registers that something has happened in the whole range.  The "permanent", aka TID, range is responsible for clearing the hardware.  Separate responsibilities.

-Dean

External recipient
diff mbox series

Patch

diff --git a/drivers/infiniband/hw/hfi1/user_exp_rcv.c b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
index f402af1e2903..b02f2f0809c8 100644
--- a/drivers/infiniband/hw/hfi1/user_exp_rcv.c
+++ b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
@@ -23,6 +23,9 @@  static void cacheless_tid_rb_remove(struct hfi1_filedata *fdata,
 static bool tid_rb_invalidate(struct mmu_interval_notifier *mni,
 			      const struct mmu_notifier_range *range,
 			      unsigned long cur_seq);
+static bool tid_cover_invalidate(struct mmu_interval_notifier *mni,
+			         const struct mmu_notifier_range *range,
+			         unsigned long cur_seq);
 static int program_rcvarray(struct hfi1_filedata *fd, struct tid_user_buf *,
 			    struct tid_group *grp,
 			    unsigned int start, u16 count,
@@ -36,6 +39,9 @@  static void clear_tid_node(struct hfi1_filedata *fd, struct tid_rb_node *node);
 static const struct mmu_interval_notifier_ops tid_mn_ops = {
 	.invalidate = tid_rb_invalidate,
 };
+static const struct mmu_interval_notifier_ops tid_cover_ops = {
+	.invalidate = tid_cover_invalidate,
+};
 
 /*
  * Initialize context and file private data needed for Expected
@@ -254,6 +260,7 @@  int hfi1_user_exp_rcv_setup(struct hfi1_filedata *fd,
 		tididx = 0, mapped, mapped_pages = 0;
 	u32 *tidlist = NULL;
 	struct tid_user_buf *tidbuf;
+	unsigned long mmu_seq = 0;
 
 	if (!PAGE_ALIGNED(tinfo->vaddr))
 		return -EINVAL;
@@ -264,6 +271,7 @@  int hfi1_user_exp_rcv_setup(struct hfi1_filedata *fd,
 	if (!tidbuf)
 		return -ENOMEM;
 
+	mutex_init(&tidbuf->cover_mutex);
 	tidbuf->vaddr = tinfo->vaddr;
 	tidbuf->length = tinfo->length;
 	tidbuf->psets = kcalloc(uctxt->expected_count, sizeof(*tidbuf->psets),
@@ -273,6 +281,16 @@  int hfi1_user_exp_rcv_setup(struct hfi1_filedata *fd,
 		goto fail_release_mem;
 	}
 
+	if (fd->use_mn) {
+		ret = mmu_interval_notifier_insert(
+			&tidbuf->notifier, current->mm,
+			tidbuf->vaddr, tidbuf->npages * PAGE_SIZE,
+			&tid_cover_ops);
+		if (ret)
+			goto fail_release_mem;
+		mmu_seq = mmu_interval_read_begin(&tidbuf->notifier);
+	}
+
 	pinned = pin_rcv_pages(fd, tidbuf);
 	if (pinned <= 0) {
 		ret = (pinned < 0) ? pinned : -ENOSPC;
@@ -415,6 +433,20 @@  int hfi1_user_exp_rcv_setup(struct hfi1_filedata *fd,
 	unpin_rcv_pages(fd, tidbuf, NULL, mapped_pages, pinned - mapped_pages,
 			false);
 
+	if (fd->use_mn) {
+		/* check for an invalidate during setup */
+		bool fail = false;
+
+		mutex_lock(&tidbuf->cover_mutex);
+		fail = mmu_interval_read_retry(&tidbuf->notifier, mmu_seq);
+		mutex_unlock(&tidbuf->cover_mutex);
+
+		if (fail) {
+			ret = -EBUSY;
+			goto fail_unprogram;
+		}
+	}
+
 	tinfo->tidcnt = tididx;
 	tinfo->length = mapped_pages * PAGE_SIZE;
 
@@ -424,6 +456,8 @@  int hfi1_user_exp_rcv_setup(struct hfi1_filedata *fd,
 		goto fail_unprogram;
 	}
 
+	if (fd->use_mn)
+		mmu_interval_notifier_remove(&tidbuf->notifier);
 	kfree(tidbuf->pages);
 	kfree(tidbuf->psets);
 	kfree(tidbuf);
@@ -442,6 +476,8 @@  int hfi1_user_exp_rcv_setup(struct hfi1_filedata *fd,
 	fd->tid_used -= pageset_count;
 	spin_unlock(&fd->tid_lock);
 fail_unpin:
+	if (fd->use_mn)
+		mmu_interval_notifier_remove(&tidbuf->notifier);
 	if (pinned > 0)
 		unpin_rcv_pages(fd, tidbuf, NULL, 0, pinned, false);
 fail_release_mem:
@@ -740,11 +776,6 @@  static int set_rcvarray_entry(struct hfi1_filedata *fd,
 			&tid_mn_ops);
 		if (ret)
 			goto out_unmap;
-		/*
-		 * FIXME: This is in the wrong order, the notifier should be
-		 * established before the pages are pinned by pin_rcv_pages.
-		 */
-		mmu_interval_read_begin(&node->notifier);
 	}
 	fd->entry_to_rb[node->rcventry - uctxt->expected_base] = node;
 
@@ -919,6 +950,23 @@  static bool tid_rb_invalidate(struct mmu_interval_notifier *mni,
 	return true;
 }
 
+static bool tid_cover_invalidate(struct mmu_interval_notifier *mni,
+			         const struct mmu_notifier_range *range,
+			         unsigned long cur_seq)
+{
+	struct tid_user_buf *tidbuf =
+		container_of(mni, struct tid_user_buf, notifier);
+
+	/* take action only if unmapping */
+	if (range->event == MMU_NOTIFY_UNMAP) {
+		mutex_lock(&tidbuf->cover_mutex);
+		mmu_interval_set_seq(mni, cur_seq);
+		mutex_unlock(&tidbuf->cover_mutex);
+	}
+
+	return true;
+}
+
 static void cacheless_tid_rb_remove(struct hfi1_filedata *fdata,
 				    struct tid_rb_node *tnode)
 {
diff --git a/drivers/infiniband/hw/hfi1/user_exp_rcv.h b/drivers/infiniband/hw/hfi1/user_exp_rcv.h
index 2ddb3dac7d91..f8ee997d0050 100644
--- a/drivers/infiniband/hw/hfi1/user_exp_rcv.h
+++ b/drivers/infiniband/hw/hfi1/user_exp_rcv.h
@@ -16,6 +16,8 @@  struct tid_pageset {
 };
 
 struct tid_user_buf {
+	struct mmu_interval_notifier notifier;
+	struct mutex cover_mutex;
 	unsigned long vaddr;
 	unsigned long length;
 	unsigned int npages;