Message ID | 20190428115207.GA11924@ziepe.ca (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [GIT,PULL] Please pull RDMA subsystem changes | expand |
On Sun, Apr 28, 2019 at 4:52 AM Jason Gunthorpe <jgg@mellanox.com> wrote: > > Nothing particularly special here. There is a small merge conflict > with Adrea's mm_still_valid patches which is resolved as below: I still don't understand *why* you play the crazy VM games to begin with. What's wrong with just returning SIGBUS? Why does that rdma_umap_fault() not just look like this one-liner: return VM_FAULT_SIGBUS; without the crazy parts? Nobody ever explained why you'd want to have that silly "let's turn it into a bogus anonymous mapping". I've pulled this, but why do the rdma and SCSI people always do these strange and pointless things? Linus
The pull request you sent on Sun, 28 Apr 2019 11:52:12 +0000:
> git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git tags/for-linus
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/14f974d7f0f1f93d8c35f496ae774ba0f1b3389a
Thank you!
On Sun, Apr 28, 2019 at 09:59:56AM -0700, Linus Torvalds wrote: > On Sun, Apr 28, 2019 at 4:52 AM Jason Gunthorpe <jgg@mellanox.com> wrote: > > > > Nothing particularly special here. There is a small merge conflict > > with Adrea's mm_still_valid patches which is resolved as below: > > I still don't understand *why* you play the crazy VM games to begin with. > > What's wrong with just returning SIGBUS? Why does that > rdma_umap_fault() not just look like this one-liner: > > return VM_FAULT_SIGBUS; > > without the crazy parts? Nobody ever explained why you'd want to have > that silly "let's turn it into a bogus anonymous mapping". There was a big thread where I went over the use case with Andrea, but I guess that was private.. It is for high availability - we have situations where the hardware can fault and needs some kind of destructive recovery. For instance a firmware reboot, or a VM migration. In these designs there may be multiple cards in the system and the userspace application could be using both. Just because one card crashed we can't send SIGBUS and kill the application, that breaks the HA design. So.. the kernel makes the BAR VMA into a 'dummy' and sends an async notification to the application. The use of the BAR memory by userspace is all 'write only' so it doesn't really care. When it sees the async notification it safely cleans up the userspace side of things. An more modern VM example of where this gets used is on VM systems using SRIO-V pass through of a raw RDMA device. When it is time to migrate the VM then the hypervisor causes the SRIO-V instance to fault and be removed from the guest kernel, then migrates and attaches a new RDMA SRIO-V instance. The user space is expected to see the failure, maintain state, then recover onto the new device. The only alternative that has come up would be to delay the kernel side until the application cleans up and deletes the VMA, but people generally don't like this as it degrades the recovery time and has the usual problems with blocking the kernel on userspace. When this was created I'm not sure people explored more creative ideas like trying to handle/ignore the SIGBUS in userspace - unfortunately it has been so long now that we are probably stuck doing this as part of the UAPI. I've been trying to make it less crufty over the last year based on remarks from yourself and Andrea, but I'm still stuck with this basic requirement that the VMA shouldn't fault or touch the BAR after the hardware is released by the kernel. Thanks, Jason
On Sun, Apr 28, 2019 at 4:49 PM Jason Gunthorpe <jgg@mellanox.com> wrote: > > It is for high availability - we have situations where the hardware > can fault and needs some kind of destructive recovery. For instance a > firmware reboot, or a VM migration. > > In these designs there may be multiple cards in the system and the > userspace application could be using both. Just because one card > crashed we can't send SIGBUS and kill the application, that breaks the > HA design. Why can't this magical application that is *so* special that it is HA and does magic mmap's of special rdma areas just catch the SIGBUS? Honestly, the whole "it's for HA" excuse stinks. It stinks because you now silently just replace the mapping with *garbage*. That's not HA, that's just random. Wouldn't it be a lot better to just get the SIGBUS, and then that magical application knows that "oh, it's gone", and it could - in its SIGBUS handler - just do the dummy anonymous mmap() with /dev/zero it if it wants to? Whatever. It really sounds like this is yet another horrible back for bad interfaces for all these "super-special high-end enterprise people". I hope that some day enterprise people will wake up and realize that "enterprise" seems to be often a code name for "lots of random hacks". Linus
On Sun, Apr 28, 2019 at 11:52:12AM +0000, Jason Gunthorpe wrote: > Hi Linus, > > Third rc pull request > > Nothing particularly special here. There is a small merge conflict > with Adrea's mm_still_valid patches which is resolved as below: ... > Jason Gunthorpe (3): > RDMA/mlx5: Do not allow the user to write to the clock page > RDMA/mlx5: Use rdma_user_map_io for mapping BAR pages > RDMA/ucontext: Fix regression with disassociate This doesn't compile. The patch below would fix it, but not sure if this is what is intended: drivers/infiniband/core/uverbs_main.c: In function 'rdma_umap_fault': drivers/infiniband/core/uverbs_main.c:898:28: error: 'struct vm_fault' has no member named 'vm_start' vmf->page = ZERO_PAGE(vmf->vm_start); ^~ diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c index 7843e89235c3..65fe89b3fa2d 100644 --- a/drivers/infiniband/core/uverbs_main.c +++ b/drivers/infiniband/core/uverbs_main.c @@ -895,7 +895,7 @@ static vm_fault_t rdma_umap_fault(struct vm_fault *vmf) /* Read only pages can just use the system zero page. */ if (!(vmf->vma->vm_flags & (VM_WRITE | VM_MAYWRITE))) { - vmf->page = ZERO_PAGE(vmf->vm_start); + vmf->page = ZERO_PAGE(vmf->vma->vm_start); get_page(vmf->page); return 0; }
On Mon, Apr 29, 2019 at 08:09:47AM +0200, Heiko Carstens wrote: > On Sun, Apr 28, 2019 at 11:52:12AM +0000, Jason Gunthorpe wrote: > > Hi Linus, > > > > Third rc pull request > > > > Nothing particularly special here. There is a small merge conflict > > with Adrea's mm_still_valid patches which is resolved as below: > ... > > Jason Gunthorpe (3): > > RDMA/mlx5: Do not allow the user to write to the clock page > > RDMA/mlx5: Use rdma_user_map_io for mapping BAR pages > > RDMA/ucontext: Fix regression with disassociate > > This doesn't compile. The patch below would fix it, but not sure if > this is what is intended: > > drivers/infiniband/core/uverbs_main.c: In function 'rdma_umap_fault': > drivers/infiniband/core/uverbs_main.c:898:28: error: 'struct vm_fault' has no member named 'vm_start' > vmf->page = ZERO_PAGE(vmf->vm_start); > ^~ > diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c > index 7843e89235c3..65fe89b3fa2d 100644 > +++ b/drivers/infiniband/core/uverbs_main.c > @@ -895,7 +895,7 @@ static vm_fault_t rdma_umap_fault(struct vm_fault *vmf) > > /* Read only pages can just use the system zero page. */ > if (!(vmf->vma->vm_flags & (VM_WRITE | VM_MAYWRITE))) { > - vmf->page = ZERO_PAGE(vmf->vm_start); > + vmf->page = ZERO_PAGE(vmf->vma->vm_start); > get_page(vmf->page); > return 0; > } > Thanks Heiko, this looks right to me. I'm surprised to be seeing this at this point, these patches should have been seen by 0 day for several days now, and they were in linux-next already too.. Doug, can you send this to Linus today? Thanks, Jason
On Mon, Apr 29, 2019 at 08:40:37AM +0000, Jason Gunthorpe wrote: > On Mon, Apr 29, 2019 at 08:09:47AM +0200, Heiko Carstens wrote: > > On Sun, Apr 28, 2019 at 11:52:12AM +0000, Jason Gunthorpe wrote: > > > Hi Linus, > > > > > > Third rc pull request > > > > > > Nothing particularly special here. There is a small merge conflict > > > with Adrea's mm_still_valid patches which is resolved as below: > > ... > > > Jason Gunthorpe (3): > > > RDMA/mlx5: Do not allow the user to write to the clock page > > > RDMA/mlx5: Use rdma_user_map_io for mapping BAR pages > > > RDMA/ucontext: Fix regression with disassociate > > > > This doesn't compile. The patch below would fix it, but not sure if > > this is what is intended: > > > > drivers/infiniband/core/uverbs_main.c: In function 'rdma_umap_fault': > > drivers/infiniband/core/uverbs_main.c:898:28: error: 'struct vm_fault' has no member named 'vm_start' > > vmf->page = ZERO_PAGE(vmf->vm_start); > > ^~ > > diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c > > index 7843e89235c3..65fe89b3fa2d 100644 > > +++ b/drivers/infiniband/core/uverbs_main.c > > @@ -895,7 +895,7 @@ static vm_fault_t rdma_umap_fault(struct vm_fault *vmf) > > > > /* Read only pages can just use the system zero page. */ > > if (!(vmf->vma->vm_flags & (VM_WRITE | VM_MAYWRITE))) { > > - vmf->page = ZERO_PAGE(vmf->vm_start); > > + vmf->page = ZERO_PAGE(vmf->vma->vm_start); > > get_page(vmf->page); > > return 0; > > } > > > > Thanks Heiko, this looks right to me. > > I'm surprised to be seeing this at this point, these patches should > have been seen by 0 day for several days now, and they were in > linux-next already too.. Most architectures have versions of ZERO_PAGE() which ignore the argument so that the code builds anyway. I'm not sure if 0-day also tests s390x builds (which is where I ran into this). Michal Kubecek
On Mon, Apr 29, 2019 at 11:00:40AM +0200, Michal Kubecek wrote: > On Mon, Apr 29, 2019 at 08:40:37AM +0000, Jason Gunthorpe wrote: > > On Mon, Apr 29, 2019 at 08:09:47AM +0200, Heiko Carstens wrote: > > > On Sun, Apr 28, 2019 at 11:52:12AM +0000, Jason Gunthorpe wrote: > > > > Hi Linus, > > > > > > > > Third rc pull request > > > > > > > > Nothing particularly special here. There is a small merge conflict > > > > with Adrea's mm_still_valid patches which is resolved as below: > > > ... > > > > Jason Gunthorpe (3): > > > > RDMA/mlx5: Do not allow the user to write to the clock page > > > > RDMA/mlx5: Use rdma_user_map_io for mapping BAR pages > > > > RDMA/ucontext: Fix regression with disassociate > > > > > > This doesn't compile. The patch below would fix it, but not sure if > > > this is what is intended: > > > > > > drivers/infiniband/core/uverbs_main.c: In function 'rdma_umap_fault': > > > drivers/infiniband/core/uverbs_main.c:898:28: error: 'struct vm_fault' has no member named 'vm_start' > > > vmf->page = ZERO_PAGE(vmf->vm_start); > > > ^~ > > > diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c > > > index 7843e89235c3..65fe89b3fa2d 100644 > > > +++ b/drivers/infiniband/core/uverbs_main.c > > > @@ -895,7 +895,7 @@ static vm_fault_t rdma_umap_fault(struct vm_fault *vmf) > > > > > > /* Read only pages can just use the system zero page. */ > > > if (!(vmf->vma->vm_flags & (VM_WRITE | VM_MAYWRITE))) { > > > - vmf->page = ZERO_PAGE(vmf->vm_start); > > > + vmf->page = ZERO_PAGE(vmf->vma->vm_start); > > > get_page(vmf->page); > > > return 0; > > > } > > > > > > > Thanks Heiko, this looks right to me. > > > > I'm surprised to be seeing this at this point, these patches should > > have been seen by 0 day for several days now, and they were in > > linux-next already too.. > > Most architectures have versions of ZERO_PAGE() which ignore the > argument so that the code builds anyway. I'm not sure if 0-day also > tests s390x builds (which is where I ran into this). According to 0-build results for this patch, the answer is yes, it builds. s390 default_defconfig And it compiles uverbs_main.c (CONFIG_INFINIBAND_USER_ACCESS) kernel git:(rdma-next) grep INFIN arch/s390/configs/debug_defconfig CONFIG_INFINIBAND=m CONFIG_INFINIBAND_USER_ACCESS=m CONFIG_MLX4_INFINIBAND=m CONFIG_MLX5_INFINIBAND=m Thanks > > Michal Kubecek
On Mon, 2019-04-29 at 08:40 +0000, Jason Gunthorpe wrote: > On Mon, Apr 29, 2019 at 08:09:47AM +0200, Heiko Carstens wrote: > > On Sun, Apr 28, 2019 at 11:52:12AM +0000, Jason Gunthorpe wrote: > > > Hi Linus, > > > > > > Third rc pull request > > > > > > Nothing particularly special here. There is a small merge conflict > > > with Adrea's mm_still_valid patches which is resolved as below: > > ... > > > Jason Gunthorpe (3): > > > RDMA/mlx5: Do not allow the user to write to the clock page > > > RDMA/mlx5: Use rdma_user_map_io for mapping BAR pages > > > RDMA/ucontext: Fix regression with disassociate > > > > This doesn't compile. The patch below would fix it, but not sure if > > this is what is intended: > > > > drivers/infiniband/core/uverbs_main.c: In function 'rdma_umap_fault': > > drivers/infiniband/core/uverbs_main.c:898:28: error: 'struct vm_fault' has no member named 'vm_start' > > vmf->page = ZERO_PAGE(vmf->vm_start); > > ^~ > > diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c > > index 7843e89235c3..65fe89b3fa2d 100644 > > +++ b/drivers/infiniband/core/uverbs_main.c > > @@ -895,7 +895,7 @@ static vm_fault_t rdma_umap_fault(struct vm_fault *vmf) > > > > /* Read only pages can just use the system zero page. */ > > if (!(vmf->vma->vm_flags & (VM_WRITE | VM_MAYWRITE))) { > > - vmf->page = ZERO_PAGE(vmf->vm_start); > > + vmf->page = ZERO_PAGE(vmf->vma->vm_start); > > get_page(vmf->page); > > return 0; > > } > > > > Thanks Heiko, this looks right to me. > > I'm surprised to be seeing this at this point, these patches should > have been seen by 0 day for several days now, and they were in > linux-next already too.. > > Doug, can you send this to Linus today? Yep.
On Mon, 2019-04-29 at 11:42 -0400, Doug Ledford wrote: > On Mon, 2019-04-29 at 08:40 +0000, Jason Gunthorpe wrote: > > On Mon, Apr 29, 2019 at 08:09:47AM +0200, Heiko Carstens wrote: > > > On Sun, Apr 28, 2019 at 11:52:12AM +0000, Jason Gunthorpe wrote: > > > > Hi Linus, > > > > > > > > Third rc pull request > > > > > > > > Nothing particularly special here. There is a small merge conflict > > > > with Adrea's mm_still_valid patches which is resolved as below: > > > ... > > > > Jason Gunthorpe (3): > > > > RDMA/mlx5: Do not allow the user to write to the clock page > > > > RDMA/mlx5: Use rdma_user_map_io for mapping BAR pages > > > > RDMA/ucontext: Fix regression with disassociate > > > > > > This doesn't compile. The patch below would fix it, but not sure if > > > this is what is intended: > > > > > > drivers/infiniband/core/uverbs_main.c: In function 'rdma_umap_fault': > > > drivers/infiniband/core/uverbs_main.c:898:28: error: 'struct vm_fault' has no member named 'vm_start' > > > vmf->page = ZERO_PAGE(vmf->vm_start); > > > ^~ > > > diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c > > > index 7843e89235c3..65fe89b3fa2d 100644 > > > +++ b/drivers/infiniband/core/uverbs_main.c > > > @@ -895,7 +895,7 @@ static vm_fault_t rdma_umap_fault(struct vm_fault *vmf) > > > > > > /* Read only pages can just use the system zero page. */ > > > if (!(vmf->vma->vm_flags & (VM_WRITE | VM_MAYWRITE))) { > > > - vmf->page = ZERO_PAGE(vmf->vm_start); > > > + vmf->page = ZERO_PAGE(vmf->vma->vm_start); > > > get_page(vmf->page); > > > return 0; > > > } > > > > > > > Thanks Heiko, this looks right to me. > > > > I'm surprised to be seeing this at this point, these patches should > > have been seen by 0 day for several days now, and they were in > > linux-next already too.. > > > > Doug, can you send this to Linus today? > > Yep. > Done.
On Sun, Apr 28, 2019 at 05:09:08PM -0700, Linus Torvalds wrote: > On Sun, Apr 28, 2019 at 4:49 PM Jason Gunthorpe <jgg@mellanox.com> wrote: > > > > It is for high availability - we have situations where the hardware > > can fault and needs some kind of destructive recovery. For instance a > > firmware reboot, or a VM migration. > > > > In these designs there may be multiple cards in the system and the > > userspace application could be using both. Just because one card > > crashed we can't send SIGBUS and kill the application, that breaks the > > HA design. > > Why can't this magical application that is *so* special that it is HA > and does magic mmap's of special rdma areas just catch the SIGBUS? > > Honestly, the whole "it's for HA" excuse stinks. It stinks because you > now silently just replace the mapping with *garbage*. That's not HA, > that's just random. This should only used in cases where user space only writes to the BAR page (it is an interrupt to the device essentially), so it doesn't care that the pages are now garbage, we just need to redirect the writes away from the bar. However I think someone later on added a readable counter BAR pages to one of the devices :( So even that ideal wasn't respected. > Wouldn't it be a lot better to just get the SIGBUS, and then that > magical application knows that "oh, it's gone", and it could - in its > SIGBUS handler - just do the dummy anonymous mmap() with /dev/zero it > if it wants to? This does sound more appealing, and probably should have been done instead. All this VMA stuff has been a big pain in the long run Thanks, Jason