mbox series

[for-next,0/3] Updates for 6.5

Message ID 168451505181.3702129.3429054159823295146.stgit@awfm-02.cornelisnetworks.com (mailing list archive)
Headers show
Series Updates for 6.5 | expand

Message

Dennis Dalessandro May 19, 2023, 4:54 p.m. UTC
Here are 3 more patches related/spawned out of the work to scale back the
page pinning re-work. This series depends on [1] which was submitted for RC
recently.

[1] https://patchwork.kernel.org/project/linux-rdma/patch/168451393605.3700681.13493776139032178861.stgit@awfm-02.cornelisnetworks.com/

---

Brendan Cunningham (2):
      IB/hfi1: Add mmu_rb_node refcount to hfi1_mmu_rb_template tracepoints
      IB/hfi1: Remove unused struct mmu_rb_ops fields .insert, .invalidate

Patrick Kelsey (1):
      IB/hfi1: Separate user SDMA page-pinning from memory type


 drivers/infiniband/hw/hfi1/Makefile     |   2 +
 drivers/infiniband/hw/hfi1/init.c       |   5 +
 drivers/infiniband/hw/hfi1/mmu_rb.c     |   7 +-
 drivers/infiniband/hw/hfi1/mmu_rb.h     |   7 +-
 drivers/infiniband/hw/hfi1/pin_system.c | 487 ++++++++++++++++++++++++
 drivers/infiniband/hw/hfi1/pinning.c    |  55 +++
 drivers/infiniband/hw/hfi1/pinning.h    |  75 ++++
 drivers/infiniband/hw/hfi1/trace_mmu.h  |  48 ++-
 drivers/infiniband/hw/hfi1/user_sdma.c  | 472 ++---------------------
 drivers/infiniband/hw/hfi1/user_sdma.h  |  15 +-
 include/uapi/rdma/hfi/hfi1_user.h       |  31 +-
 11 files changed, 743 insertions(+), 461 deletions(-)
 create mode 100644 drivers/infiniband/hw/hfi1/pin_system.c
 create mode 100644 drivers/infiniband/hw/hfi1/pinning.c
 create mode 100644 drivers/infiniband/hw/hfi1/pinning.h

--
-Denny

Comments

Jason Gunthorpe June 1, 2023, 10:54 p.m. UTC | #1
On Fri, May 19, 2023 at 12:54:14PM -0400, Dennis Dalessandro wrote:
> Here are 3 more patches related/spawned out of the work to scale back the
> page pinning re-work. This series depends on [1] which was submitted for RC
> recently.
> 
> [1] https://patchwork.kernel.org/project/linux-rdma/patch/168451393605.3700681.13493776139032178861.stgit@awfm-02.cornelisnetworks.com/
> 
> ---
> 
> Brendan Cunningham (2):
>       IB/hfi1: Add mmu_rb_node refcount to hfi1_mmu_rb_template tracepoints
>       IB/hfi1: Remove unused struct mmu_rb_ops fields .insert, .invalidate

I took these two

> Patrick Kelsey (1):
>       IB/hfi1: Separate user SDMA page-pinning from memory type

Don't know what to do with this one

Maybe send a complete feature.

Jason
Dennis Dalessandro June 2, 2023, 3:41 a.m. UTC | #2
On 6/1/23 6:54 PM, Jason Gunthorpe wrote:
> On Fri, May 19, 2023 at 12:54:14PM -0400, Dennis Dalessandro wrote:
>> Here are 3 more patches related/spawned out of the work to scale back the
>> page pinning re-work. This series depends on [1] which was submitted for RC
>> recently.
>>
>> [1] https://patchwork.kernel.org/project/linux-rdma/patch/168451393605.3700681.13493776139032178861.stgit@awfm-02.cornelisnetworks.com/
>>
>> ---
>>
>> Brendan Cunningham (2):
>>       IB/hfi1: Add mmu_rb_node refcount to hfi1_mmu_rb_template tracepoints
>>       IB/hfi1: Remove unused struct mmu_rb_ops fields .insert, .invalidate
> 
> I took these two
> 
>> Patrick Kelsey (1):
>>       IB/hfi1: Separate user SDMA page-pinning from memory type
> 
> Don't know what to do with this one

I'm not seeing why it's a problem. It improves our existing page pinning by
making the code easier to follow. It makes the code easier to maintain as well
centralizing the pinning code.

> Maybe send a complete feature.

It is a complete feature. It's just a refactoring really of an existing feature.
It makes it more flexible to extend in the future and is the current interface
or psm2 and libfabric.

We are clearly not throwing some uapi changes over the fence that have no user
backing. We have existing use cases already.

Now all of this being said, this is starting to concern me. We plan to start
sending patches for supporting new HW soon. We were going to do this
incrementally. Is that going to be considered an incomplete feature? Should we
wait until it's all done and ship it all at once? I was envisioning doing things
the way we did for rdmavt, showing our work so to speak, making incremental
changes overtime as opposed to how we submitted the original hfi1 code in a
giant blob.

For instance, we are going to add a field to a data structure to hold parameters
that differ between multiple chips. The first patch was going to be adding that
infrastructure and moving WFR (the existing chip) values into it. In my opinion
that's basically the same thing we are doing here with the pinning interface.

This is all in effort to be open and involve the community as much as possible
along the way so we avoid a fiasco like we had years ago with hfi1's introduction.

-Denny
Jason Gunthorpe June 2, 2023, 1:42 p.m. UTC | #3
On Thu, Jun 01, 2023 at 11:41:24PM -0400, Dennis Dalessandro wrote:
> On 6/1/23 6:54 PM, Jason Gunthorpe wrote:
> > On Fri, May 19, 2023 at 12:54:14PM -0400, Dennis Dalessandro wrote:
> >> Here are 3 more patches related/spawned out of the work to scale back the
> >> page pinning re-work. This series depends on [1] which was submitted for RC
> >> recently.
> >>
> >> [1] https://patchwork.kernel.org/project/linux-rdma/patch/168451393605.3700681.13493776139032178861.stgit@awfm-02.cornelisnetworks.com/
> >>
> >> ---
> >>
> >> Brendan Cunningham (2):
> >>       IB/hfi1: Add mmu_rb_node refcount to hfi1_mmu_rb_template tracepoints
> >>       IB/hfi1: Remove unused struct mmu_rb_ops fields .insert, .invalidate
> > 
> > I took these two
> > 
> >> Patrick Kelsey (1):
> >>       IB/hfi1: Separate user SDMA page-pinning from memory type
> > 
> > Don't know what to do with this one
> 
> I'm not seeing why it's a problem. It improves our existing page pinning by
> making the code easier to follow. It makes the code easier to maintain as well
> centralizing the pinning code.
> 
> > Maybe send a complete feature.
> 
> It is a complete feature. It's just a refactoring really of an existing feature.
> It makes it more flexible to extend in the future and is the current interface
> or psm2 and libfabric.

If it was refactoring it would not add new uAPI.

The commit message said this was done for 'other than system memory'
which the driver doesn't support.

So it is all very confusing what this is all for.

> Now all of this being said, this is starting to concern me. We plan to start
> sending patches for supporting new HW soon. We were going to do this
> incrementally. 

I think we said long ago that this was the last HW that can use the
hfi1 cdev.

So you will have to think carefully about is needed. It is part of why
I don't want to take uAPI changed for incomplete features. I'm
wondering how you will fit dmabuf into hfi1 when I won't be happy if
this is done by adding dmabuf support to the cdev.

> Is that going to be considered an incomplete feature? Should we
> wait until it's all done and ship it all at once? I was envisioning doing things
> the way we did for rdmavt, showing our work so to speak, making incremental
> changes overtime as opposed to how we submitted the original hfi1 code in a
> giant blob.

I think it depends, stay away from the uapi and things are easier

Jason
Dennis Dalessandro June 2, 2023, 5:15 p.m. UTC | #4
On 6/2/23 9:42 AM, Jason Gunthorpe wrote:
> I think we said long ago that this was the last HW that can use the
> hfi1 cdev.

Yeah. I don't want to conflate the two things though. This patch is specifically
for the older HW.

> So you will have to think carefully about is needed. It is part of why
> I don't want to take uAPI changed for incomplete features. I'm
> wondering how you will fit dmabuf into hfi1 when I won't be happy if
> this is done by adding dmabuf support to the cdev.

I think at one point you said it was ok to have the fast datapath go through the
cdev. We want the core to own the configuration/etc. This is the fast datapath.

The high level idea in my mind is that we do basically what is done with our
control IOCTLs through the core, setting up contexts, etc. However, still create
a cdev for sending in the lists of pages to pin down since that is hot path. I
assume that cdev could be created by the core and owned rather than done by the
driver directly.

>> Is that going to be considered an incomplete feature? Should we
>> wait until it's all done and ship it all at once? I was envisioning doing things
>> the way we did for rdmavt, showing our work so to speak, making incremental
>> changes overtime as opposed to how we submitted the original hfi1 code in a
>> giant blob.
> 
> I think it depends, stay away from the uapi and things are easier

The new chip doesn't require any changes to uapi. We'll start sending patches
soon then.

-Denny
Jason Gunthorpe June 2, 2023, 5:23 p.m. UTC | #5
On Fri, Jun 02, 2023 at 01:15:47PM -0400, Dennis Dalessandro wrote:
> > So you will have to think carefully about is needed. It is part of why
> > I don't want to take uAPI changed for incomplete features. I'm
> > wondering how you will fit dmabuf into hfi1 when I won't be happy if
> > this is done by adding dmabuf support to the cdev.
> 
> I think at one point you said it was ok to have the fast datapath go through the
> cdev. We want the core to own the configuration/etc. This is the fast datapath.

Maybe, but dmabuf binding and page extraction really can't be fast
datapath, it just isn't fast to start with.

You should be crystalizing the dmabuf into a MR and caching the DMA
addrs like everything else and referring the MR on your fast path.

> The high level idea in my mind is that we do basically what is done with our
> control IOCTLs through the core, setting up contexts, etc. However, still create
> a cdev for sending in the lists of pages to pin down since that is hot path. I
> assume that cdev could be created by the core and owned rather than done by the
> driver directly.

I never really liked that HFI has effectively its own weird
implementation of MRs and ODP hidden in the cdev.

Jason
Dennis Dalessandro June 16, 2023, 4:33 p.m. UTC | #6
On 6/2/23 1:23 PM, Jason Gunthorpe wrote:
> On Fri, Jun 02, 2023 at 01:15:47PM -0400, Dennis Dalessandro wrote:
>>> So you will have to think carefully about is needed. It is part of why
>>> I don't want to take uAPI changed for incomplete features. I'm
>>> wondering how you will fit dmabuf into hfi1 when I won't be happy if
>>> this is done by adding dmabuf support to the cdev.

The cdev just needs to know what type of memory it's dealing with. We expect the
dmabuf to be allocated and ready to use. Just like we would a GPU buffer. So
would you still reject the patch if we sent support for AMD's GPU instead of
dmabuf if it's all in-tree and upstream and we have the user code to go with it?

To me enabling this support for our cdev shouldn't be that big of a deal. It's
in support of our existing HW. I'm still committed to fixing the cdev once and
for all. However I'm planning that as part of the next gen HW release. I just
don't see why we need to tie the two things together especially when the only
uAPI change is to have a way to pass in the type of buffer that is backing the
iovec, and I guess should bump the version number as well. We aren't adding any
new IOCTLs or anything like that.

Regardless, specific to this patch, Brendan has reworked the patch to not touch
UAPI but just do the refactoring of splitting out the page pinning from the rest
of the SDMA stuff.

-Denny
Jason Gunthorpe June 16, 2023, 7:25 p.m. UTC | #7
On Fri, Jun 16, 2023 at 12:33:40PM -0400, Dennis Dalessandro wrote:
> On 6/2/23 1:23 PM, Jason Gunthorpe wrote:
> > On Fri, Jun 02, 2023 at 01:15:47PM -0400, Dennis Dalessandro wrote:
> >>> So you will have to think carefully about is needed. It is part of why
> >>> I don't want to take uAPI changed for incomplete features. I'm
> >>> wondering how you will fit dmabuf into hfi1 when I won't be happy if
> >>> this is done by adding dmabuf support to the cdev.
> 
> The cdev just needs to know what type of memory it's dealing with. We expect the
> dmabuf to be allocated and ready to use. Just like we would a GPU buffer. So
> would you still reject the patch if we sent support for AMD's GPU instead of
> dmabuf if it's all in-tree and upstream and we have the user code to go with it?

I don't understand what that even means, how can you support AMD GPU
without also using DMABUF?

My big problem with this patch is I can't understand what it really
even does because it is somehow tied to the HW functionality. Which is
also very confusing because DMABUF is supposed to have general MR
based code for processing MRs.

We are not able to support DMABUF on "ODP" like situations, and AFAIK
this hfi stuff is basically a weird version of ODP / some kernel
support for registration caching.

So maybe if you explain it better and more carefully, IDK.

Jason