mbox series

[GIT,PULL] Please pull RDMA subsystem changes

Message ID 20190428115207.GA11924@ziepe.ca (mailing list archive)
State Superseded
Headers show
Series [GIT,PULL] Please pull RDMA subsystem changes | expand

Pull-request

git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git tags/for-linus

Message

Jason Gunthorpe April 28, 2019, 11:52 a.m. UTC
Hi Linus,

Third rc pull request

Nothing particularly special here. There is a small merge conflict
with Adrea's mm_still_valid patches which is resolved as below:

diff --cc drivers/infiniband/core/uverbs_main.c
index db20b6e0f253c9,f2e7ffe6fc5466..7843e89235c34b
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@@ -1039,7 -992,9 +1039,9 @@@ void uverbs_user_mmap_disassociate(stru
  		 * at a time to get the lock ordering right. Typically there
  		 * will only be one mm, so no big deal.
  		 */
 -		down_write(&mm->mmap_sem);
 +		down_read(&mm->mmap_sem);
+ 		if (!mmget_still_valid(mm))
+ 			goto skip_mm;
  		mutex_lock(&ufile->umap_lock);
  		list_for_each_entry_safe (priv, next_priv, &ufile->umaps,
  					  list) {
@@@ -1051,9 -1006,11 +1053,10 @@@
  
  			zap_vma_ptes(vma, vma->vm_start,
  				     vma->vm_end - vma->vm_start);
 -			vma->vm_flags &= ~(VM_SHARED | VM_MAYSHARE);
  		}
  		mutex_unlock(&ufile->umap_lock);
+ 	skip_mm:
 -		up_write(&mm->mmap_sem);
 +		up_read(&mm->mmap_sem);
  		mmput(mm);
  	}
  }

The tag for-linus-merged with my merge resolution to your tree is also available to pull.

Thanks,
Jason

The following changes since commit dc4060a5dc2557e6b5aa813bf5b73677299d62d2:

  Linux 5.1-rc5 (2019-04-14 15:17:41 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git tags/for-linus

for you to fetch changes up to 2557fabd6e29f349bfa0ac13f38ac98aa5eafc74:

  RDMA/hns: Bugfix for mapping user db (2019-04-25 10:40:04 -0300)

----------------------------------------------------------------
5.1 Third RC pull request

One core bug fix and a few driver ones

- FRWR memory registration for hfi1/qib didn't work with with some iovas
  causing a NFSoRDMA failure regression due to a fix in the NFS side

- A command flow error in mlx5 allowed user space to send a corrupt
  command (and also smash the kernel stack we've since learned)

- Fix a regression and some bugs with device hot unplug that was
  discovered while reviewing Andrea's patches

- hns has a failure if the user asks for certain QP configurations

----------------------------------------------------------------
Guy Levi (1):
      IB/mlx5: Fix scatter to CQE in DCT QP creation

Jason Gunthorpe (3):
      RDMA/mlx5: Do not allow the user to write to the clock page
      RDMA/mlx5: Use rdma_user_map_io for mapping BAR pages
      RDMA/ucontext: Fix regression with disassociate

Josh Collier (1):
      IB/rdmavt: Fix frwr memory registration

Lijun Ou (1):
      RDMA/hns: Bugfix for mapping user db

 drivers/infiniband/core/uverbs.h        |  1 +
 drivers/infiniband/core/uverbs_main.c   | 52 +++++++++++++++++++++++++++++++--
 drivers/infiniband/hw/hns/hns_roce_qp.c |  2 +-
 drivers/infiniband/hw/mlx5/main.c       | 12 ++++----
 drivers/infiniband/hw/mlx5/qp.c         | 11 ++++---
 drivers/infiniband/sw/rdmavt/mr.c       | 17 ++++++-----
 include/uapi/rdma/mlx5-abi.h            |  1 +
 7 files changed, 76 insertions(+), 20 deletions(-)
(diffstat from tag for-linus-merged)

Comments

Linus Torvalds April 28, 2019, 4:59 p.m. UTC | #1
On Sun, Apr 28, 2019 at 4:52 AM Jason Gunthorpe <jgg@mellanox.com> wrote:
>
> Nothing particularly special here. There is a small merge conflict
> with Adrea's mm_still_valid patches which is resolved as below:

I still don't understand *why* you play the crazy VM games to begin with.

What's wrong with just returning SIGBUS? Why does that
rdma_umap_fault() not just look like this one-liner:

        return VM_FAULT_SIGBUS;

without the crazy parts? Nobody ever explained why you'd want to have
that silly "let's turn it into a bogus anonymous mapping".

I've pulled this, but why do the rdma and SCSI people always do these
strange and pointless things?

                  Linus
pr-tracker-bot@kernel.org April 28, 2019, 6:05 p.m. UTC | #2
The pull request you sent on Sun, 28 Apr 2019 11:52:12 +0000:

> git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git tags/for-linus

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/14f974d7f0f1f93d8c35f496ae774ba0f1b3389a

Thank you!
Jason Gunthorpe April 28, 2019, 11:49 p.m. UTC | #3
On Sun, Apr 28, 2019 at 09:59:56AM -0700, Linus Torvalds wrote:
> On Sun, Apr 28, 2019 at 4:52 AM Jason Gunthorpe <jgg@mellanox.com> wrote:
> >
> > Nothing particularly special here. There is a small merge conflict
> > with Adrea's mm_still_valid patches which is resolved as below:
> 
> I still don't understand *why* you play the crazy VM games to begin with.
> 
> What's wrong with just returning SIGBUS? Why does that
> rdma_umap_fault() not just look like this one-liner:
> 
>         return VM_FAULT_SIGBUS;
> 
> without the crazy parts? Nobody ever explained why you'd want to have
> that silly "let's turn it into a bogus anonymous mapping".

There was a big thread where I went over the use case with Andrea, but
I guess that was private..

It is for high availability - we have situations where the hardware
can fault and needs some kind of destructive recovery. For instance a
firmware reboot, or a VM migration.

In these designs there may be multiple cards in the system and the
userspace application could be using both. Just because one card
crashed we can't send SIGBUS and kill the application, that breaks the
HA design.

So.. the kernel makes the BAR VMA into a 'dummy' and sends an async
notification to the application. The use of the BAR memory by
userspace is all 'write only' so it doesn't really care. When it sees
the async notification it safely cleans up the userspace side of
things.

An more modern VM example of where this gets used is on VM systems
using SRIO-V pass through of a raw RDMA device. When it is time to
migrate the VM then the hypervisor causes the SRIO-V instance to fault
and be removed from the guest kernel, then migrates and attaches a new
RDMA SRIO-V instance. The user space is expected to see the failure,
maintain state, then recover onto the new device.

The only alternative that has come up would be to delay the kernel
side until the application cleans up and deletes the VMA, but people
generally don't like this as it degrades the recovery time and has the
usual problems with blocking the kernel on userspace.

When this was created I'm not sure people explored more creative ideas
like trying to handle/ignore the SIGBUS in userspace - unfortunately
it has been so long now that we are probably stuck doing this as part
of the UAPI.

I've been trying to make it less crufty over the last year based on
remarks from yourself and Andrea, but I'm still stuck with this basic
requirement that the VMA shouldn't fault or touch the BAR after the
hardware is released by the kernel.

Thanks,
Jason
Linus Torvalds April 29, 2019, 12:09 a.m. UTC | #4
On Sun, Apr 28, 2019 at 4:49 PM Jason Gunthorpe <jgg@mellanox.com> wrote:
>
> It is for high availability - we have situations where the hardware
> can fault and needs some kind of destructive recovery. For instance a
> firmware reboot, or a VM migration.
>
> In these designs there may be multiple cards in the system and the
> userspace application could be using both. Just because one card
> crashed we can't send SIGBUS and kill the application, that breaks the
> HA design.

Why can't this magical application that is *so* special that it is HA
and does magic mmap's of special rdma areas just catch the SIGBUS?

Honestly, the whole "it's for HA" excuse stinks. It stinks because you
now silently just replace the mapping with *garbage*. That's not HA,
that's just random.

Wouldn't it be a lot better to just get the SIGBUS, and then that
magical application knows that "oh, it's gone", and it could - in its
SIGBUS handler - just do the dummy anonymous mmap() with /dev/zero it
if it wants to?

Whatever. It really sounds like this is yet another horrible back for
bad interfaces for all these "super-special high-end enterprise
people".

I hope that some day enterprise people will wake up and realize that
"enterprise" seems to be often a code name for "lots of random hacks".

                    Linus
Heiko Carstens April 29, 2019, 6:09 a.m. UTC | #5
On Sun, Apr 28, 2019 at 11:52:12AM +0000, Jason Gunthorpe wrote:
> Hi Linus,
> 
> Third rc pull request
> 
> Nothing particularly special here. There is a small merge conflict
> with Adrea's mm_still_valid patches which is resolved as below:
...
> Jason Gunthorpe (3):
>       RDMA/mlx5: Do not allow the user to write to the clock page
>       RDMA/mlx5: Use rdma_user_map_io for mapping BAR pages
>       RDMA/ucontext: Fix regression with disassociate

This doesn't compile. The patch below would fix it, but not sure if
this is what is intended:

drivers/infiniband/core/uverbs_main.c: In function 'rdma_umap_fault':
drivers/infiniband/core/uverbs_main.c:898:28: error: 'struct vm_fault' has no member named 'vm_start'
   vmf->page = ZERO_PAGE(vmf->vm_start);
                            ^~
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index 7843e89235c3..65fe89b3fa2d 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -895,7 +895,7 @@ static vm_fault_t rdma_umap_fault(struct vm_fault *vmf)
 
 	/* Read only pages can just use the system zero page. */
 	if (!(vmf->vma->vm_flags & (VM_WRITE | VM_MAYWRITE))) {
-		vmf->page = ZERO_PAGE(vmf->vm_start);
+		vmf->page = ZERO_PAGE(vmf->vma->vm_start);
 		get_page(vmf->page);
 		return 0;
 	}
Jason Gunthorpe April 29, 2019, 8:40 a.m. UTC | #6
On Mon, Apr 29, 2019 at 08:09:47AM +0200, Heiko Carstens wrote:
> On Sun, Apr 28, 2019 at 11:52:12AM +0000, Jason Gunthorpe wrote:
> > Hi Linus,
> > 
> > Third rc pull request
> > 
> > Nothing particularly special here. There is a small merge conflict
> > with Adrea's mm_still_valid patches which is resolved as below:
> ...
> > Jason Gunthorpe (3):
> >       RDMA/mlx5: Do not allow the user to write to the clock page
> >       RDMA/mlx5: Use rdma_user_map_io for mapping BAR pages
> >       RDMA/ucontext: Fix regression with disassociate
> 
> This doesn't compile. The patch below would fix it, but not sure if
> this is what is intended:
> 
> drivers/infiniband/core/uverbs_main.c: In function 'rdma_umap_fault':
> drivers/infiniband/core/uverbs_main.c:898:28: error: 'struct vm_fault' has no member named 'vm_start'
>    vmf->page = ZERO_PAGE(vmf->vm_start);
>                             ^~
> diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
> index 7843e89235c3..65fe89b3fa2d 100644
> +++ b/drivers/infiniband/core/uverbs_main.c
> @@ -895,7 +895,7 @@ static vm_fault_t rdma_umap_fault(struct vm_fault *vmf)
>  
>  	/* Read only pages can just use the system zero page. */
>  	if (!(vmf->vma->vm_flags & (VM_WRITE | VM_MAYWRITE))) {
> -		vmf->page = ZERO_PAGE(vmf->vm_start);
> +		vmf->page = ZERO_PAGE(vmf->vma->vm_start);
>  		get_page(vmf->page);
>  		return 0;
>  	}
> 

Thanks Heiko, this looks right to me. 

I'm surprised to be seeing this at this point, these patches should
have been seen by 0 day for several days now, and they were in
linux-next already too..

Doug, can you send this to Linus today?

Thanks,
Jason
Michal Kubecek April 29, 2019, 9 a.m. UTC | #7
On Mon, Apr 29, 2019 at 08:40:37AM +0000, Jason Gunthorpe wrote:
> On Mon, Apr 29, 2019 at 08:09:47AM +0200, Heiko Carstens wrote:
> > On Sun, Apr 28, 2019 at 11:52:12AM +0000, Jason Gunthorpe wrote:
> > > Hi Linus,
> > > 
> > > Third rc pull request
> > > 
> > > Nothing particularly special here. There is a small merge conflict
> > > with Adrea's mm_still_valid patches which is resolved as below:
> > ...
> > > Jason Gunthorpe (3):
> > >       RDMA/mlx5: Do not allow the user to write to the clock page
> > >       RDMA/mlx5: Use rdma_user_map_io for mapping BAR pages
> > >       RDMA/ucontext: Fix regression with disassociate
> > 
> > This doesn't compile. The patch below would fix it, but not sure if
> > this is what is intended:
> > 
> > drivers/infiniband/core/uverbs_main.c: In function 'rdma_umap_fault':
> > drivers/infiniband/core/uverbs_main.c:898:28: error: 'struct vm_fault' has no member named 'vm_start'
> >    vmf->page = ZERO_PAGE(vmf->vm_start);
> >                             ^~
> > diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
> > index 7843e89235c3..65fe89b3fa2d 100644
> > +++ b/drivers/infiniband/core/uverbs_main.c
> > @@ -895,7 +895,7 @@ static vm_fault_t rdma_umap_fault(struct vm_fault *vmf)
> >  
> >  	/* Read only pages can just use the system zero page. */
> >  	if (!(vmf->vma->vm_flags & (VM_WRITE | VM_MAYWRITE))) {
> > -		vmf->page = ZERO_PAGE(vmf->vm_start);
> > +		vmf->page = ZERO_PAGE(vmf->vma->vm_start);
> >  		get_page(vmf->page);
> >  		return 0;
> >  	}
> > 
> 
> Thanks Heiko, this looks right to me. 
> 
> I'm surprised to be seeing this at this point, these patches should
> have been seen by 0 day for several days now, and they were in
> linux-next already too..

Most architectures have versions of ZERO_PAGE() which ignore the
argument so that the code builds anyway. I'm not sure if 0-day also
tests s390x builds (which is where I ran into this).

Michal Kubecek
Leon Romanovsky April 29, 2019, 9:19 a.m. UTC | #8
On Mon, Apr 29, 2019 at 11:00:40AM +0200, Michal Kubecek wrote:
> On Mon, Apr 29, 2019 at 08:40:37AM +0000, Jason Gunthorpe wrote:
> > On Mon, Apr 29, 2019 at 08:09:47AM +0200, Heiko Carstens wrote:
> > > On Sun, Apr 28, 2019 at 11:52:12AM +0000, Jason Gunthorpe wrote:
> > > > Hi Linus,
> > > >
> > > > Third rc pull request
> > > >
> > > > Nothing particularly special here. There is a small merge conflict
> > > > with Adrea's mm_still_valid patches which is resolved as below:
> > > ...
> > > > Jason Gunthorpe (3):
> > > >       RDMA/mlx5: Do not allow the user to write to the clock page
> > > >       RDMA/mlx5: Use rdma_user_map_io for mapping BAR pages
> > > >       RDMA/ucontext: Fix regression with disassociate
> > >
> > > This doesn't compile. The patch below would fix it, but not sure if
> > > this is what is intended:
> > >
> > > drivers/infiniband/core/uverbs_main.c: In function 'rdma_umap_fault':
> > > drivers/infiniband/core/uverbs_main.c:898:28: error: 'struct vm_fault' has no member named 'vm_start'
> > >    vmf->page = ZERO_PAGE(vmf->vm_start);
> > >                             ^~
> > > diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
> > > index 7843e89235c3..65fe89b3fa2d 100644
> > > +++ b/drivers/infiniband/core/uverbs_main.c
> > > @@ -895,7 +895,7 @@ static vm_fault_t rdma_umap_fault(struct vm_fault *vmf)
> > >
> > >  	/* Read only pages can just use the system zero page. */
> > >  	if (!(vmf->vma->vm_flags & (VM_WRITE | VM_MAYWRITE))) {
> > > -		vmf->page = ZERO_PAGE(vmf->vm_start);
> > > +		vmf->page = ZERO_PAGE(vmf->vma->vm_start);
> > >  		get_page(vmf->page);
> > >  		return 0;
> > >  	}
> > >
> >
> > Thanks Heiko, this looks right to me.
> >
> > I'm surprised to be seeing this at this point, these patches should
> > have been seen by 0 day for several days now, and they were in
> > linux-next already too..
>
> Most architectures have versions of ZERO_PAGE() which ignore the
> argument so that the code builds anyway. I'm not sure if 0-day also
> tests s390x builds (which is where I ran into this).

According to 0-build results for this patch, the answer is yes, it builds.
s390                        default_defconfig

And it compiles uverbs_main.c (CONFIG_INFINIBAND_USER_ACCESS)
  kernel git:(rdma-next) grep INFIN arch/s390/configs/debug_defconfig
CONFIG_INFINIBAND=m
CONFIG_INFINIBAND_USER_ACCESS=m
CONFIG_MLX4_INFINIBAND=m
CONFIG_MLX5_INFINIBAND=m

Thanks

>
> Michal Kubecek
Doug Ledford April 29, 2019, 3:42 p.m. UTC | #9
On Mon, 2019-04-29 at 08:40 +0000, Jason Gunthorpe wrote:
> On Mon, Apr 29, 2019 at 08:09:47AM +0200, Heiko Carstens wrote:
> > On Sun, Apr 28, 2019 at 11:52:12AM +0000, Jason Gunthorpe wrote:
> > > Hi Linus,
> > > 
> > > Third rc pull request
> > > 
> > > Nothing particularly special here. There is a small merge conflict
> > > with Adrea's mm_still_valid patches which is resolved as below:
> > ...
> > > Jason Gunthorpe (3):
> > >       RDMA/mlx5: Do not allow the user to write to the clock page
> > >       RDMA/mlx5: Use rdma_user_map_io for mapping BAR pages
> > >       RDMA/ucontext: Fix regression with disassociate
> > 
> > This doesn't compile. The patch below would fix it, but not sure if
> > this is what is intended:
> > 
> > drivers/infiniband/core/uverbs_main.c: In function 'rdma_umap_fault':
> > drivers/infiniband/core/uverbs_main.c:898:28: error: 'struct vm_fault' has no member named 'vm_start'
> >    vmf->page = ZERO_PAGE(vmf->vm_start);
> >                             ^~
> > diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
> > index 7843e89235c3..65fe89b3fa2d 100644
> > +++ b/drivers/infiniband/core/uverbs_main.c
> > @@ -895,7 +895,7 @@ static vm_fault_t rdma_umap_fault(struct vm_fault *vmf)
> >  
> >  	/* Read only pages can just use the system zero page. */
> >  	if (!(vmf->vma->vm_flags & (VM_WRITE | VM_MAYWRITE))) {
> > -		vmf->page = ZERO_PAGE(vmf->vm_start);
> > +		vmf->page = ZERO_PAGE(vmf->vma->vm_start);
> >  		get_page(vmf->page);
> >  		return 0;
> >  	}
> > 
> 
> Thanks Heiko, this looks right to me. 
> 
> I'm surprised to be seeing this at this point, these patches should
> have been seen by 0 day for several days now, and they were in
> linux-next already too..
> 
> Doug, can you send this to Linus today?

Yep.
Doug Ledford April 29, 2019, 4:29 p.m. UTC | #10
On Mon, 2019-04-29 at 11:42 -0400, Doug Ledford wrote:
> On Mon, 2019-04-29 at 08:40 +0000, Jason Gunthorpe wrote:
> > On Mon, Apr 29, 2019 at 08:09:47AM +0200, Heiko Carstens wrote:
> > > On Sun, Apr 28, 2019 at 11:52:12AM +0000, Jason Gunthorpe wrote:
> > > > Hi Linus,
> > > > 
> > > > Third rc pull request
> > > > 
> > > > Nothing particularly special here. There is a small merge conflict
> > > > with Adrea's mm_still_valid patches which is resolved as below:
> > > ...
> > > > Jason Gunthorpe (3):
> > > >       RDMA/mlx5: Do not allow the user to write to the clock page
> > > >       RDMA/mlx5: Use rdma_user_map_io for mapping BAR pages
> > > >       RDMA/ucontext: Fix regression with disassociate
> > > 
> > > This doesn't compile. The patch below would fix it, but not sure if
> > > this is what is intended:
> > > 
> > > drivers/infiniband/core/uverbs_main.c: In function 'rdma_umap_fault':
> > > drivers/infiniband/core/uverbs_main.c:898:28: error: 'struct vm_fault' has no member named 'vm_start'
> > >    vmf->page = ZERO_PAGE(vmf->vm_start);
> > >                             ^~
> > > diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
> > > index 7843e89235c3..65fe89b3fa2d 100644
> > > +++ b/drivers/infiniband/core/uverbs_main.c
> > > @@ -895,7 +895,7 @@ static vm_fault_t rdma_umap_fault(struct vm_fault *vmf)
> > >  
> > >  	/* Read only pages can just use the system zero page. */
> > >  	if (!(vmf->vma->vm_flags & (VM_WRITE | VM_MAYWRITE))) {
> > > -		vmf->page = ZERO_PAGE(vmf->vm_start);
> > > +		vmf->page = ZERO_PAGE(vmf->vma->vm_start);
> > >  		get_page(vmf->page);
> > >  		return 0;
> > >  	}
> > > 
> > 
> > Thanks Heiko, this looks right to me. 
> > 
> > I'm surprised to be seeing this at this point, these patches should
> > have been seen by 0 day for several days now, and they were in
> > linux-next already too..
> > 
> > Doug, can you send this to Linus today?
> 
> Yep.
> 

Done.
Jason Gunthorpe April 30, 2019, 12:53 p.m. UTC | #11
On Sun, Apr 28, 2019 at 05:09:08PM -0700, Linus Torvalds wrote:
> On Sun, Apr 28, 2019 at 4:49 PM Jason Gunthorpe <jgg@mellanox.com> wrote:
> >
> > It is for high availability - we have situations where the hardware
> > can fault and needs some kind of destructive recovery. For instance a
> > firmware reboot, or a VM migration.
> >
> > In these designs there may be multiple cards in the system and the
> > userspace application could be using both. Just because one card
> > crashed we can't send SIGBUS and kill the application, that breaks the
> > HA design.
> 
> Why can't this magical application that is *so* special that it is HA
> and does magic mmap's of special rdma areas just catch the SIGBUS?
> 
> Honestly, the whole "it's for HA" excuse stinks. It stinks because you
> now silently just replace the mapping with *garbage*. That's not HA,
> that's just random.

This should only used in cases where user space only writes to the BAR
page (it is an interrupt to the device essentially), so it doesn't
care that the pages are now garbage, we just need to redirect the
writes away from the bar.

However I think someone later on added a readable counter BAR pages to
one of the devices :( So even that ideal wasn't respected.

> Wouldn't it be a lot better to just get the SIGBUS, and then that
> magical application knows that "oh, it's gone", and it could - in its
> SIGBUS handler - just do the dummy anonymous mmap() with /dev/zero it
> if it wants to?

This does sound more appealing, and probably should have been done
instead. All this VMA stuff has been a big pain in the long run

Thanks,
Jason