mbox series

[RFC,0/9] A rendezvous module

Message ID 20210319125635.34492-1-kaike.wan@intel.com (mailing list archive)
Headers show
Series A rendezvous module | expand

Message

Wan, Kaike March 19, 2021, 12:56 p.m. UTC
From: Kaike Wan <kaike.wan@intel.com>

RDMA transactions on RC QPs have a high demand for memory for HPC
applications such as MPI, especially on nodes with high cpu core
count, where a process is normally dispatched to each core and an RC QP
is created for each remote process. For a 100-node fabric,
about 10 GB - 100 GB memory is required for WQEs/Buffers/QP states on
each server node. This high demand imposes a severe restriction
on the scalability of HPC fabric.

A number of solutions have been implemented over the years. UD based
solutions solve the scalability problem by requiring only one UD
QP per process that can send a message to or receive a message from any
other process. However, it does not have the performance of an
RC QP and the application has to manage the segmentation for large
messages.

SRQ reduces the memory demand by sharing a receive queue among multiple
QPs. However, it still requires an RC QP for each remote
process and each RC QP still requires a send queue. In addition, it is
an optional feature.

XRC further reduces the memory demand by allowing a single QP per
process to communicate with any process on a remote node. In this
mode, each process requires only one QP for each remote node. Again,
this is optional and not all vendors support it.

Dynamically Connected transport minimizes the memory usage, but requires
vendor specific hardware and changes in applications.
Addtionally, all of these mechanisms can induce latency jitter due to
use of more QPs, more QP state, and hence additional
PCIe transactions at scale where NIC QP/CQ caches overflow.

Based on this, we are here proposing a generic, vendor-agnostic approach
to address the RC scalalibity and potentially improve RDMA
performance for larger messages which uses RDMA Write as part of a
rendezvous protocol.

Here are the key points of the proposal:
- For each device used by a job on a given node, there are a fixed
  number of RC connections (default: 4) between this node and any
  other remote node, no matter how many processes are running on each
  node for this job. That is, for a given job and device, all the
  processes on a given node will communicate with all the processes of
  the same job on a remote node through the same fixed number of
  connections. This eliminates the increased memory demand caused by
  core count increase and reduces the overall RC connection setup
  costs.
- Memory region cache is added to reduce the MR
  registration/deregistration costs, as the same buffers tend to be used
  repeatedly by applications. The mr cache is per process and can be
  accessed only by processes in the same job.

Here is how we are proposing to implement this application enabling
rendezvous module that take advantage of the various features
of the RDMA subsystem:
- Create a rendezvous module (rv) under drivers/infiniband/ulp/rv/. This
  can be changed if a better location is recommended.
- The rv modules adds a char device file /dev/rv that user
  applications/middlewares can open to communicate with the rv module
  (PSM3 OFI provider in libfabric 1.12.0 being the primary use case).
  The file interface is crucial for associating the job/process
  with the mr, the connection endpoints, and RDMA requests.
  rv_file_open - Open a handle to the rv module for the current user
                 process.
  rv_file_close - Close the handle and clean up.
  rv_file_ioctl - The main communication methods: to attach to a device,
                  to register/deregister mr, to create connection
                  endpoint, to connect, to query, to get stats, to do
                  RDMA transactions, etc.
  rv_file_mmap - Mmap an event ring into user space for the current user
                 process.
- Basic mode of operations (PSM3 is used as an example for user
  applications):
  - A middleware (like MPI) has out-of-band communication channels
    between any two nodes, which are used to establish high performance
    communications for providers such as PSM3.
  - On each node, the PSM3 provider opens the rv module, and issues an
    attach request to bind to a specific local device and specifies
    information specific to the job. This associates the user process
    with the RDMA device and the job.
  - On each node, the PSM3 provider will establish user space low
    latency eager and control message communications, typically via user
    space UD QPs.
  - On each node, the PSM3 provider will mmap a ring buffer for events
    from the rv module.
  - On each node, the PSM3 provider creates a set of connection
    endpoints by passing the destination info to the rv module. If the
    local node is chosen as the listener for the connection (based on
    address comparison between the two nodes), it will start to listen
    for any incoming IB CM connection requests and accept them when whey
    arrive. This step associates the shared connection endpoints with
    the job and device.
  - On each node, the PSM3 provider will request connection
    establishment through the rv module. On the client node, rv sends an
    IB CM request to the remote node. On the listener node, nothing
    additional needs to be done for the connect request from the PSM3
    provider.
  - On each node, the PSM3 provider will wait until the RC connection is
    established. The PSM3 provider will query the rv module about the
    connection status periodically.
  - As large MPI messages are requested, the PSM3 provider will request
    rv to register MRs for the MPI application’s send and receive
    buffers.  The rv module makes use of its MR cache to limit when such
    requests need to interact with verbs MR calls for the NIC.
    The PSM3 provider control message mechanisms are used to exchange IO
    virtual addresses and rkeys for such buffers and MRs.
  - For RDMA transactions, the RDMA WRITE WITH IMMED request will be
    used. The immediate data will contain info about the target
    process on the remote node  and which outstanding rendezvous message
    this RDMA is for. For send completion and receive completion,
    an event will be posted into the event ring buffer and the PSM3
    provider can poll for completion.
  - As RDMA transactions complete, the PSM3 provider will indicate
    completion of the corresponding MPI send/receive to the MPI
    middleware/application. The PSM3 provider will also inform the rv
    module it may deregister the MR.  The deregistration allows rv
    to cache the MR for use in future requests by the PSM3 provider to
    register the same memory for use in another MPI message. The
    cached MR will be removed if the user buffer is freed and a MMU
    notifier event is received by the rv module.

Please comment,

Thank you,

Kaike

Kaike Wan (9):
  RDMA/rv: Public interferce for the RDMA Rendezvous module
  RDMA/rv: Add the internal header files
  RDMA/rv: Add the rv module
  RDMA/rv: Add functions for memory region cache
  RDMA/rv: Add function to register/deregister memory region
  RDMA/rv: Add connection management functions
  RDMA/rv: Add functions for RDMA transactions
  RDMA/rv: Add functions for file operations
  RDMA/rv: Integrate the file operations into the rv module

 MAINTAINERS                                |    6 +
 drivers/infiniband/Kconfig                 |    1 +
 drivers/infiniband/core/core_priv.h        |    4 +
 drivers/infiniband/core/rdma_core.c        |  237 ++
 drivers/infiniband/core/uverbs_cmd.c       |   54 +-
 drivers/infiniband/ulp/Makefile            |    1 +
 drivers/infiniband/ulp/rv/Kconfig          |   11 +
 drivers/infiniband/ulp/rv/Makefile         |    9 +
 drivers/infiniband/ulp/rv/rv.h             |  892 ++++++
 drivers/infiniband/ulp/rv/rv_conn.c        | 3037 ++++++++++++++++++++
 drivers/infiniband/ulp/rv/rv_file.c        | 1196 ++++++++
 drivers/infiniband/ulp/rv/rv_main.c        |  271 ++
 drivers/infiniband/ulp/rv/rv_mr.c          |  393 +++
 drivers/infiniband/ulp/rv/rv_mr_cache.c    |  428 +++
 drivers/infiniband/ulp/rv/rv_mr_cache.h    |  152 +
 drivers/infiniband/ulp/rv/rv_rdma.c        |  416 +++
 drivers/infiniband/ulp/rv/trace.c          |    7 +
 drivers/infiniband/ulp/rv/trace.h          |   10 +
 drivers/infiniband/ulp/rv/trace_conn.h     |  547 ++++
 drivers/infiniband/ulp/rv/trace_dev.h      |   82 +
 drivers/infiniband/ulp/rv/trace_mr.h       |  109 +
 drivers/infiniband/ulp/rv/trace_mr_cache.h |  181 ++
 drivers/infiniband/ulp/rv/trace_rdma.h     |  254 ++
 drivers/infiniband/ulp/rv/trace_user.h     |  321 +++
 include/rdma/uverbs_types.h                |   10 +
 include/uapi/rdma/rv_user_ioctls.h         |  558 ++++
 26 files changed, 9139 insertions(+), 48 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rv/Kconfig
 create mode 100644 drivers/infiniband/ulp/rv/Makefile
 create mode 100644 drivers/infiniband/ulp/rv/rv.h
 create mode 100644 drivers/infiniband/ulp/rv/rv_conn.c
 create mode 100644 drivers/infiniband/ulp/rv/rv_file.c
 create mode 100644 drivers/infiniband/ulp/rv/rv_main.c
 create mode 100644 drivers/infiniband/ulp/rv/rv_mr.c
 create mode 100644 drivers/infiniband/ulp/rv/rv_mr_cache.c
 create mode 100644 drivers/infiniband/ulp/rv/rv_mr_cache.h
 create mode 100644 drivers/infiniband/ulp/rv/rv_rdma.c
 create mode 100644 drivers/infiniband/ulp/rv/trace.c
 create mode 100644 drivers/infiniband/ulp/rv/trace.h
 create mode 100644 drivers/infiniband/ulp/rv/trace_conn.h
 create mode 100644 drivers/infiniband/ulp/rv/trace_dev.h
 create mode 100644 drivers/infiniband/ulp/rv/trace_mr.h
 create mode 100644 drivers/infiniband/ulp/rv/trace_mr_cache.h
 create mode 100644 drivers/infiniband/ulp/rv/trace_rdma.h
 create mode 100644 drivers/infiniband/ulp/rv/trace_user.h
 create mode 100644 include/uapi/rdma/rv_user_ioctls.h

Comments

Jason Gunthorpe March 19, 2021, 1:53 p.m. UTC | #1
On Fri, Mar 19, 2021 at 08:56:26AM -0400, kaike.wan@intel.com wrote:

> - Basic mode of operations (PSM3 is used as an example for user
>   applications):
>   - A middleware (like MPI) has out-of-band communication channels
>     between any two nodes, which are used to establish high performance
>     communications for providers such as PSM3.

Huh? Doesn't PSM3 already use it's own special non-verbs char devices
that already have memory caches and other stuff? Now you want to throw
that all away and do yet another char dev just for HFI? Why?

I also don't know why you picked the name rv, this looks like it has
little to do with the usual MPI rendezvous protocol. This is all about
bulk transfers. It is actually a lot like RDS. Maybe you should be
using RDS?

Jason
Wan, Kaike March 19, 2021, 2:49 p.m. UTC | #2
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, March 19, 2021 9:53 AM
> To: Wan, Kaike <kaike.wan@intel.com>
> Cc: dledford@redhat.com; linux-rdma@vger.kernel.org; Rimmer, Todd
> <todd.rimmer@intel.com>
> Subject: Re: [PATCH RFC 0/9] A rendezvous module
> 
> On Fri, Mar 19, 2021 at 08:56:26AM -0400, kaike.wan@intel.com wrote:
> 
> > - Basic mode of operations (PSM3 is used as an example for user
> >   applications):
> >   - A middleware (like MPI) has out-of-band communication channels
> >     between any two nodes, which are used to establish high performance
> >     communications for providers such as PSM3.
> 
> Huh? Doesn't PSM3 already use it's own special non-verbs char devices that
> already have memory caches and other stuff? Now you want to throw that
> all away and do yet another char dev just for HFI? Why?
[Wan, Kaike] I think that you are referring to PSM2, which uses the OPA hfi1 driver that is specific to the OPA hardware.
PSM3 uses standard verbs drivers and supports standard RoCE.  A focus is the Intel RDMA Ethernet NICs. As such it cannot use the hfi1 driver through the special PSM2 interface. Rather it works with the hfi1 driver through standard verbs interface. The rv module was a new design to bring these concepts to standard transports and hardware.

> 
> I also don't know why you picked the name rv, this looks like it has little to do
> with the usual MPI rendezvous protocol. This is all about bulk transfers. It is
> actually a lot like RDS. Maybe you should be using RDS?
[Wan, Kaike] While there are similarities in concepts, details are different.  Quite frankly this could be viewed as an application accelerator much like RDS served that purpose for Oracle, which continues to be its main use case. The rv module is currently targeting to enables the MPI/OFI/PSM3 application.

The name "rv" is chosen simply because this module is designed to enable the rendezvous protocol of the MPI/OFI/PSM3 application stack for large messages. Short messages are handled by eager transfer through UDP in PSM3.

FYI, there is an OFA presentation from Thurs reviewing PSM3 and RV and covering much of the architecture and rationale.
> 
> Jason
Jason Gunthorpe March 19, 2021, 3:48 p.m. UTC | #3
On Fri, Mar 19, 2021 at 02:49:29PM +0000, Wan, Kaike wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Friday, March 19, 2021 9:53 AM
> > To: Wan, Kaike <kaike.wan@intel.com>
> > Cc: dledford@redhat.com; linux-rdma@vger.kernel.org; Rimmer, Todd
> > <todd.rimmer@intel.com>
> > Subject: Re: [PATCH RFC 0/9] A rendezvous module
> > 
> > On Fri, Mar 19, 2021 at 08:56:26AM -0400, kaike.wan@intel.com wrote:
> > 
> > > - Basic mode of operations (PSM3 is used as an example for user
> > >   applications):
> > >   - A middleware (like MPI) has out-of-band communication channels
> > >     between any two nodes, which are used to establish high performance
> > >     communications for providers such as PSM3.
> > 
> > Huh? Doesn't PSM3 already use it's own special non-verbs char devices that
> > already have memory caches and other stuff? Now you want to throw that
> > all away and do yet another char dev just for HFI? Why?

> [Wan, Kaike] I think that you are referring to PSM2, which uses the
> OPA hfi1 driver that is specific to the OPA hardware.  PSM3 uses
> standard verbs drivers and supports standard RoCE.  

Uhhh.. "PSM" has always been about the ipath special char device, and
if I recall properly the library was semi-discontinued and merged into
libfabric.

So here you are talking about a libfabric verbs provider that doesn't
use the ipath style char interface but uses verbs and this rv thing so
we call it a libfabric PSM3 provider because thats not confusing to
anyone at all..

> A focus is the Intel RDMA Ethernet NICs. As such it cannot use the
> hfi1 driver through the special PSM2 interface. 

These are the drivers that aren't merged yet, I see. So why are you
sending this now? I'm not interested to look at even more Intel code
when their driver saga is still ongoing for years.

> Rather it works with the hfi1 driver through standard verbs
> interface.

But nobody would do that right? You'd get better results using the
hif1 native interfaces instead of their slow fake verbs stuff.

> > I also don't know why you picked the name rv, this looks like it has little to do
> > with the usual MPI rendezvous protocol. This is all about bulk transfers. It is
> > actually a lot like RDS. Maybe you should be using RDS?

> [Wan, Kaike] While there are similarities in concepts, details are
> different.  

You should list these differences.

> Quite frankly this could be viewed as an application accelerator
> much like RDS served that purpose for Oracle, which continues to be
> its main use case.

Obviously, except it seems to be doing the same basic acceleration
technique as RDS.

> The name "rv" is chosen simply because this module is designed to
> enable the rendezvous protocol of the MPI/OFI/PSM3 application stack
> for large messages. Short messages are handled by eager transfer
> through UDP in PSM3.

A bad name seems like it will further limit potential re-use of this
code.

Jason
Dennis Dalessandro March 19, 2021, 7:22 p.m. UTC | #4
On 3/19/2021 11:48 AM, Jason Gunthorpe wrote:
> On Fri, Mar 19, 2021 at 02:49:29PM +0000, Wan, Kaike wrote:
>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>> Sent: Friday, March 19, 2021 9:53 AM
>>> To: Wan, Kaike <kaike.wan@intel.com>
>>> Cc: dledford@redhat.com; linux-rdma@vger.kernel.org; Rimmer, Todd
>>> <todd.rimmer@intel.com>
>>> Subject: Re: [PATCH RFC 0/9] A rendezvous module
>>>
>>> On Fri, Mar 19, 2021 at 08:56:26AM -0400, kaike.wan@intel.com wrote:
>>>
>>>> - Basic mode of operations (PSM3 is used as an example for user
>>>>    applications):
>>>>    - A middleware (like MPI) has out-of-band communication channels
>>>>      between any two nodes, which are used to establish high performance
>>>>      communications for providers such as PSM3.
>>>
>>> Huh? Doesn't PSM3 already use it's own special non-verbs char devices that
>>> already have memory caches and other stuff? Now you want to throw that
>>> all away and do yet another char dev just for HFI? Why?
> 
>> [Wan, Kaike] I think that you are referring to PSM2, which uses the
>> OPA hfi1 driver that is specific to the OPA hardware.  PSM3 uses
>> standard verbs drivers and supports standard RoCE.
> 
> Uhhh.. "PSM" has always been about the ipath special char device, and
> if I recall properly the library was semi-discontinued and merged into
> libfabric.

This driver is intended to work with a fork of the PSM2 library. The 
PSM2 library which is for Omni-Path is now maintained by Cornelis 
Networks on our GitHub. PSM3 is something from Intel for Ethernet. I 
know it's a bit confusing.

> So here you are talking about a libfabric verbs provider that doesn't
> use the ipath style char interface but uses verbs and this rv thing so
> we call it a libfabric PSM3 provider because thats not confusing to
> anyone at all..
> 
>> A focus is the Intel RDMA Ethernet NICs. As such it cannot use the
>> hfi1 driver through the special PSM2 interface.
> 
> These are the drivers that aren't merged yet, I see. So why are you
> sending this now? I'm not interested to look at even more Intel code
> when their driver saga is still ongoing for years.
> 
>> Rather it works with the hfi1 driver through standard verbs
>> interface.
> 
> But nobody would do that right? You'd get better results using the
> hif1 native interfaces instead of their slow fake verbs stuff.

I can't imagine why. I'm not sure what you mean by our slow fake verbs 
stuff? We support verbs just fine. It's certainly not fake.

>>> I also don't know why you picked the name rv, this looks like it has little to do
>>> with the usual MPI rendezvous protocol. This is all about bulk transfers. It is
>>> actually a lot like RDS. Maybe you should be using RDS?
.
.
>> The name "rv" is chosen simply because this module is designed to
>> enable the rendezvous protocol of the MPI/OFI/PSM3 application stack
>> for large messages. Short messages are handled by eager transfer
>> through UDP in PSM3.
> 
> A bad name seems like it will further limit potential re-use of this
> code.
  As to the name, I started this driver when I was still at Intel before 
handing the reins to Kaike when we spun out. I plucked the name rv out 
of the air without much consideration. While the code seems to have 
changed a lot, the name seems to have stuck. It was as Kaike mentions 
because it helps enable the rendezvous side of PSM3.

-Denny
Jason Gunthorpe March 19, 2021, 7:44 p.m. UTC | #5
On Fri, Mar 19, 2021 at 03:22:45PM -0400, Dennis Dalessandro wrote:

> > > [Wan, Kaike] I think that you are referring to PSM2, which uses the
> > > OPA hfi1 driver that is specific to the OPA hardware.  PSM3 uses
> > > standard verbs drivers and supports standard RoCE.
> > 
> > Uhhh.. "PSM" has always been about the ipath special char device, and
> > if I recall properly the library was semi-discontinued and merged into
> > libfabric.
> 
> This driver is intended to work with a fork of the PSM2 library. The PSM2
> library which is for Omni-Path is now maintained by Cornelis Networks on our
> GitHub. PSM3 is something from Intel for Ethernet. I know it's a bit
> confusing.

"a bit" huh?

> > So here you are talking about a libfabric verbs provider that doesn't
> > use the ipath style char interface but uses verbs and this rv thing so
> > we call it a libfabric PSM3 provider because thats not confusing to
> > anyone at all..
> > 
> > > A focus is the Intel RDMA Ethernet NICs. As such it cannot use the
> > > hfi1 driver through the special PSM2 interface.
> > 
> > These are the drivers that aren't merged yet, I see. So why are you
> > sending this now? I'm not interested to look at even more Intel code
> > when their driver saga is still ongoing for years.
> > 
> > > Rather it works with the hfi1 driver through standard verbs
> > > interface.
> > 
> > But nobody would do that right? You'd get better results using the
> > hif1 native interfaces instead of their slow fake verbs stuff.
> 
> I can't imagine why. I'm not sure what you mean by our slow fake verbs
> stuff? We support verbs just fine. It's certainly not fake.

hfi1 calls to the kernel for data path operations - that is "fake" in
my book. Verbs was always about avoiding that kernel transition, to
put it back in betrays the spirit. So a kernel call for rv, or the hfi
cdev, or the verbs post-send is really all a wash.

I didn't understand your answer, do you see using this with hfi1 or
not? 

It looks a lot copy&pasted from the hfi1 driver, so now we are on our
third copy of this code :(

And why did it suddenly become a ULP that somehow shares uverbs
resources?? I'm pretty skeptical that can be made to work correctly..

Jason
Rimmer, Todd March 19, 2021, 8:12 p.m. UTC | #6
> hfi1 calls to the kernel for data path operations - that is "fake" in my book. Verbs was always about avoiding that kernel transition, to put it back in betrays the spirit. > So a kernel call for rv, or the hfi cdev, or the verbs post-send is really all a wash.

To be clear, different vendors have different priorities and hence different HW designs and approaches.  hfi1 approached the HPC latency needs with a uniquely scalable approach with very low latency @scale. Ironically, other vendors have since replicated some of those mechanisms with their own proprietary mechanisms, such as UD-X.  hfi1 approached storage needs and large message HPC transfers with a direct data placement mechanism (aka RDMA).  It fully supported the verbs API and met the performance needs of it's customers with attractive price and power for its target markets.

Kaike sited hfi1 as just one example of a RDMA verbs device which rv can function for.  Various network and server vendors will each make their own recommendations for software configs and various end users will each have their own preferences and requirements.

Todd
Dennis Dalessandro March 19, 2021, 8:18 p.m. UTC | #7
On 3/19/2021 3:44 PM, Jason Gunthorpe wrote:
> On Fri, Mar 19, 2021 at 03:22:45PM -0400, Dennis Dalessandro wrote:
> 
>>>> [Wan, Kaike] I think that you are referring to PSM2, which uses the
>>>> OPA hfi1 driver that is specific to the OPA hardware.  PSM3 uses
>>>> standard verbs drivers and supports standard RoCE.
>>>
>>> Uhhh.. "PSM" has always been about the ipath special char device, and
>>> if I recall properly the library was semi-discontinued and merged into
>>> libfabric.
>>
>> This driver is intended to work with a fork of the PSM2 library. The PSM2
>> library which is for Omni-Path is now maintained by Cornelis Networks on our
>> GitHub. PSM3 is something from Intel for Ethernet. I know it's a bit
>> confusing.
> 
> "a bit" huh?

Just a bit. :)

> 
>>> So here you are talking about a libfabric verbs provider that doesn't
>>> use the ipath style char interface but uses verbs and this rv thing so
>>> we call it a libfabric PSM3 provider because thats not confusing to
>>> anyone at all..
>>>
>>>> A focus is the Intel RDMA Ethernet NICs. As such it cannot use the
>>>> hfi1 driver through the special PSM2 interface.
>>>
>>> These are the drivers that aren't merged yet, I see. So why are you
>>> sending this now? I'm not interested to look at even more Intel code
>>> when their driver saga is still ongoing for years.
>>>
>>>> Rather it works with the hfi1 driver through standard verbs
>>>> interface.
>>>
>>> But nobody would do that right? You'd get better results using the
>>> hif1 native interfaces instead of their slow fake verbs stuff.
>>
>> I can't imagine why. I'm not sure what you mean by our slow fake verbs
>> stuff? We support verbs just fine. It's certainly not fake.
> 
> hfi1 calls to the kernel for data path operations - that is "fake" in
> my book. Verbs was always about avoiding that kernel transition, to
> put it back in betrays the spirit. So a kernel call for rv, or the hfi
> cdev, or the verbs post-send is really all a wash.

Probably better to argue that in another thread I guess.

> I didn't understand your answer, do you see using this with hfi1 or
> not?

I don't see how this could ever use hfi1. So no.

> It looks a lot copy&pasted from the hfi1 driver, so now we are on our
> third copy of this code :(

I haven't had a chance to look beyond the cover letter in depth at how 
things have changed. I really hope it's not that bad.

> And why did it suddenly become a ULP that somehow shares uverbs
> resources?? I'm pretty skeptical that can be made to work correctly..

I see your point.

I was just providing some background/clarification. Bottom line this has 
nothing to do with hfi1 or OPA or PSM2.

-Denny
Jason Gunthorpe March 19, 2021, 8:26 p.m. UTC | #8
On Fri, Mar 19, 2021 at 08:12:18PM +0000, Rimmer, Todd wrote:
> > hfi1 calls to the kernel for data path operations - that is "fake" in my book. Verbs was always about avoiding that kernel transition, to put it back in betrays the spirit. > So a kernel call for rv, or the hfi cdev, or the verbs post-send is really all a wash.
> 
> To be clear, different vendors have different priorities and hence
> different HW designs and approaches.  hfi1 approached the HPC
> latency needs with a uniquely scalable approach with very low
> latency @scale. 

Yes, we all know the marketing spin the vendors like to use here. A
kernel transition is a kernel transition, and the one in the HFI verbs
path through all the common code is particularly expensive.

I'm suprirsed to hear someone advocate that is a good thing when we
were all told that the hfi1 cdev *must* exist because the kernel
transition through verbs was far to expensive.

> Ironically, other vendors have since replicated some
> of those mechanisms with their own proprietary mechanisms, such as
> UD-X.  

What is a UD-X?

> Kaike sited hfi1 as just one example of a RDMA verbs device which rv
> can function for.

rv seems to completely destroy alot of the HPC performance offloads
that vendors are layering on RC QPs (see what hns and mlx5 are doing),
so I have doubt about this.

It would be better if you could get at least one other device to agree
this is reasonable.

Or work with something that already exists for bulk messaging like
RDS.

Jason
Jason Gunthorpe March 19, 2021, 8:30 p.m. UTC | #9
On Fri, Mar 19, 2021 at 04:18:56PM -0400, Dennis Dalessandro wrote:

> I was just providing some background/clarification. Bottom line this has
> nothing to do with hfi1 or OPA or PSM2.

Well, I hope your two companies can agree on naming if Cornelis
someday needs a PSM3 for it's HW.

Jason
Hefty, Sean March 19, 2021, 8:34 p.m. UTC | #10
> > > I also don't know why you picked the name rv, this looks like it has little to do
> > > with the usual MPI rendezvous protocol. This is all about bulk transfers. It is
> > > actually a lot like RDS. Maybe you should be using RDS?
> 
> > [Wan, Kaike] While there are similarities in concepts, details are
> > different.
> 
> You should list these differences.
> 
> > Quite frankly this could be viewed as an application accelerator
> > much like RDS served that purpose for Oracle, which continues to be
> > its main use case.
> 
> Obviously, except it seems to be doing the same basic acceleration
> technique as RDS.

A better name for this might be "scalable RDMA service", with RDMA meaning the transport operation.  My understanding is this is intended to be usable over any IB/RoCE device.

I'm not familiar enough with the details of RDS or this service to know the differences.

- Sean
Rimmer, Todd March 19, 2021, 8:46 p.m. UTC | #11
> I'm suprirsed to hear someone advocate that is a good thing when we were
> all told that the hfi1 cdev *must* exist because the kernel transition through
> verbs was far to expensive.
It depends on the goal vendors have with verbs vs other APIs such as libfabric.  hfi1's verbs goal was focused on storage bandwidth and the cdev was focused on HPC latency and bandwidth for MPI via PSM2 and libfabric.  I'm unclear why we are debating hfi1 here, seems it should be in another thread.

> What is a UD-X?
UD-X is a vendor specific set of HW interfaces and wire protocols implemented in UCX for nVidia Connect-X series of network devices.  Many of it's concepts are very similar to those which ipath and hfi1 HW and software implemented.

> rv seems to completely destroy alot of the HPC performance offloads that
> vendors are layering on RC QPs
Different vendors have different approaches to performance and chose different design trade-offs.

Todd
Jason Gunthorpe March 19, 2021, 8:54 p.m. UTC | #12
On Fri, Mar 19, 2021 at 08:46:56PM +0000, Rimmer, Todd wrote:
> > I'm suprirsed to hear someone advocate that is a good thing when we were
> > all told that the hfi1 cdev *must* exist because the kernel transition through
> > verbs was far to expensive.
>
> It depends on the goal vendors have with verbs vs other APIs such as
> libfabric.  hfi1's verbs goal was focused on storage bandwidth and
> the cdev was focused on HPC latency and bandwidth for MPI via PSM2
> and libfabric.  I'm unclear why we are debating hfi1 here, seems it
> should be in another thread.

Because this is copied from hfi1?

> > What is a UD-X?
> UD-X is a vendor specific set of HW interfaces and wire protocols
> implemented in UCX for nVidia Connect-X series of network devices.
> Many of it's concepts are very similar to those which ipath and hfi1
> HW and software implemented.

Oh, there is lots of stuff in UCX, I'm not surprised you similarities
to what PSM did since psm/libfabric/ucx are all solving the same
problems.

> > rv seems to completely destroy alot of the HPC performance offloads that
> > vendors are layering on RC QPs

> Different vendors have different approaches to performance and chose
> different design trade-offs.

That isn't my point, by limiting the usability you also restrict the
drivers where this would meaningfully be useful.

So far we now know that it is not useful for mlx5 or hfi1, that leaves
only hns unknown and still in the HPC arena.

Jason
Wan, Kaike March 19, 2021, 8:59 p.m. UTC | #13
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, March 19, 2021 4:55 PM
> To: Rimmer, Todd <todd.rimmer@intel.com>
> Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>; Wan,
> Oh, there is lots of stuff in UCX, I'm not surprised you similarities to what
> PSM did since psm/libfabric/ucx are all solving the same problems.
> 
> > > rv seems to completely destroy alot of the HPC performance offloads
> > > that vendors are layering on RC QPs
> 
> > Different vendors have different approaches to performance and chose
> > different design trade-offs.
> 
> That isn't my point, by limiting the usability you also restrict the drivers where
> this would meaningfully be useful.
> 
> So far we now know that it is not useful for mlx5 or hfi1, that leaves only hns
> unknown and still in the HPC arena.
[Wan, Kaike] Incorrect. The rv module works with hfi1.
> 
> Jason
Dennis Dalessandro March 19, 2021, 9:28 p.m. UTC | #14
On 3/19/2021 4:59 PM, Wan, Kaike wrote:
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> Sent: Friday, March 19, 2021 4:55 PM
>> To: Rimmer, Todd <todd.rimmer@intel.com>
>> Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>; Wan,
>> Oh, there is lots of stuff in UCX, I'm not surprised you similarities to what
>> PSM did since psm/libfabric/ucx are all solving the same problems.
>>
>>>> rv seems to completely destroy alot of the HPC performance offloads
>>>> that vendors are layering on RC QPs
>>
>>> Different vendors have different approaches to performance and chose
>>> different design trade-offs.
>>
>> That isn't my point, by limiting the usability you also restrict the drivers where
>> this would meaningfully be useful.
>>
>> So far we now know that it is not useful for mlx5 or hfi1, that leaves only hns
>> unknown and still in the HPC arena.
> [Wan, Kaike] Incorrect. The rv module works with hfi1.

Interesting. I was thinking the opposite. So what's the benefit? When 
would someone want to do that?

-Denny
Wan, Kaike March 19, 2021, 9:58 p.m. UTC | #15
> From: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
> Sent: Friday, March 19, 2021 5:28 PM
> To: Wan, Kaike <kaike.wan@intel.com>; Jason Gunthorpe <jgg@nvidia.com>;
> Rimmer, Todd <todd.rimmer@intel.com>
> Cc: dledford@redhat.com; linux-rdma@vger.kernel.org
> Subject: Re: [PATCH RFC 0/9] A rendezvous module
> 
> On 3/19/2021 4:59 PM, Wan, Kaike wrote:
> >> From: Jason Gunthorpe <jgg@nvidia.com>
> >> Sent: Friday, March 19, 2021 4:55 PM
> >> To: Rimmer, Todd <todd.rimmer@intel.com>
> >> Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>;
> >> Wan, Oh, there is lots of stuff in UCX, I'm not surprised you
> >> similarities to what PSM did since psm/libfabric/ucx are all solving the
> same problems.
> >>
> >>>> rv seems to completely destroy alot of the HPC performance offloads
> >>>> that vendors are layering on RC QPs
> >>
> >>> Different vendors have different approaches to performance and chose
> >>> different design trade-offs.
> >>
> >> That isn't my point, by limiting the usability you also restrict the
> >> drivers where this would meaningfully be useful.
> >>
> >> So far we now know that it is not useful for mlx5 or hfi1, that
> >> leaves only hns unknown and still in the HPC arena.
> > [Wan, Kaike] Incorrect. The rv module works with hfi1.
> 
> Interesting. I was thinking the opposite. So what's the benefit? When would
> someone want to do that?
[Wan, Kaike] This is only because rv works with the generic Verbs interface and hfi1 happens to be one of the devices that implements the Verbs interface.
> 
> -Denny
Jason Gunthorpe March 19, 2021, 10:35 p.m. UTC | #16
On Fri, Mar 19, 2021 at 09:58:21PM +0000, Wan, Kaike wrote:

> > On 3/19/2021 4:59 PM, Wan, Kaike wrote:
> > >> From: Jason Gunthorpe <jgg@nvidia.com>
> > >> Sent: Friday, March 19, 2021 4:55 PM
> > >> To: Rimmer, Todd <todd.rimmer@intel.com>
> > >> Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>;
> > >> Wan, Oh, there is lots of stuff in UCX, I'm not surprised you
> > >> similarities to what PSM did since psm/libfabric/ucx are all solving the
> > same problems.
> > >>
> > >>>> rv seems to completely destroy alot of the HPC performance offloads
> > >>>> that vendors are layering on RC QPs
> > >>
> > >>> Different vendors have different approaches to performance and chose
> > >>> different design trade-offs.
> > >>
> > >> That isn't my point, by limiting the usability you also restrict the
> > >> drivers where this would meaningfully be useful.
> > >>
> > >> So far we now know that it is not useful for mlx5 or hfi1, that
> > >> leaves only hns unknown and still in the HPC arena.
> > > [Wan, Kaike] Incorrect. The rv module works with hfi1.
> > 
> > Interesting. I was thinking the opposite. So what's the benefit? When would
> > someone want to do that?

> [Wan, Kaike] This is only because rv works with the generic Verbs
> interface and hfi1 happens to be one of the devices that implements
> the Verbs interface.

Well, it works with everything (except EFA), but that is not my point.

I said it is not useful.

Jason
Rimmer, Todd March 19, 2021, 10:57 p.m. UTC | #17
> > [Wan, Kaike] Incorrect. The rv module works with hfi1.
> 
> Interesting. I was thinking the opposite. So what's the benefit? When would
> someone want to do that?
The more interesting scenario is for customers who would like to run libfabric and other Open Fabrics Alliance software over various verbs capable hardware.
Today PSM2 is a good choice for OPA hardware.  However for some other devices without existing libfabric providers, rxm and rxd are the best choices.
As was presented in Open Fabrics workshop today by James Erwin, PSM3 offers noticeable benefits over existing libfabric rxm and rxd providers
and the rv module offers noticeable performance benefits when using PSM3.

> This driver is intended to work with a fork of the PSM2 library. The
> PSM2 library which is for Omni-Path is now maintained by Cornelis
> Networks on our GitHub. PSM3 is something from Intel for Ethernet. I
> know it's a bit confusing.
Intel retains various IP and trademark rights.  Intel's marketing team analyzed and chose the name PSM3.  Obviously plusses and minuses to any name choice.

This is not unlike other industry software history where new major revisions often add and remove support for various HW generations.
PSM(1) - supported infinipath IB adapters, was a standalone API (various forms).
PSM2 - dropped support for infinipath and IB and added support for Omni-Path, along with various features, also added libfabric support
PSM3 - dropped support for Omni-Path, added support for RoCE and verbs capable devices, along with other features,
	also dropped PSM2 API and standardized on libfabric.
All three have similar strategies of onload protocols for eager messages and shared kernel/HW resources for large messages
and direct data placement (RDMA).  So the name Performance Scaled Messaging is meant to reflect the concept and approach
as opposed to reflecting a specific HW implementation or even API.

PSM3 is only available as a libfabric provider.

> I haven't had a chance to look beyond the cover letter in depth at how things
> have changed. I really hope it's not that bad.
While a few stylistic elements got carried forward, as you noticed.  This is much different from hfi1 as it doesn't directly access hardware and is hence smaller.
We carefully looked at overlap with features in ib_core and the patch set contains a couple minor API additions to ib_core to simplify some operations
which others may find useful.

> I also don't know why you picked the name rv, this looks like it has little to do with the usual MPI rendezvous protocol.
The focus of the design was to support the bulk transfer part of the MPI rendezvous protocol, hence the name rv.
We'd welcome other name suggestions, wanted to keep the name simple and brief.

> No pre-adding reserved stuff
> Lots of alignment holes, don't do that either.
We'd like advise on a challenging situation.  Some customers desire NICs to support nVidia GPUs in some environments.
Unfortunately the nVidia GPU drivers are not upstream, and have not been for years.  So we are forced to have both out of tree
and upstream versions of the code.  We need the same applications to be able to work over both, so we would like the
GPU enabled versions of the code to have the same ABI as the upstream code as this greatly simplifies things.
We have removed all GPU specific code from the upstream submission, but used both the "alignment holes" and the "reserved"
mechanisms to hold places for GPU specific fields which can't be upstreamed.

Todd
Jason Gunthorpe March 19, 2021, 11:06 p.m. UTC | #18
On Fri, Mar 19, 2021 at 10:57:20PM +0000, Rimmer, Todd wrote:

> We'd like advise on a challenging situation.  Some customers desire
> NICs to support nVidia GPUs in some environments. Unfortunately the
> nVidia GPU drivers are not upstream, and have not been for years.

Yep.

> So we are forced to have both out of tree and upstream versions of
> the code.

Sure

>  We need the same applications to be able to work over
> both, so we would like the GPU enabled versions of the code to have
> the same ABI as the upstream code as this greatly simplifies things.

The only answer is to get it upstream past community review before you
ship anything to customers.

> We have removed all GPU specific code from the upstream submission,
> but used both the "alignment holes" and the "reserved" mechanisms to
> hold places for GPU specific fields which can't be upstreamed.

Sorry, nope, you have to do something else. We will *not* have garbage
in upstream or be restricted in any way based on out of tree code.

Jason
Dennis Dalessandro March 20, 2021, 4:39 p.m. UTC | #19
On 3/19/2021 6:57 PM, Rimmer, Todd wrote:
>>> [Wan, Kaike] Incorrect. The rv module works with hfi1.
>>
>> Interesting. I was thinking the opposite. So what's the benefit? When would
>> someone want to do that?
> The more interesting scenario is for customers who would like to run libfabric and other Open Fabrics Alliance software over various verbs capable hardware.

Ah ok that makes sense. Not that running it over hfi1 is the goal but 
being able to run over verbs devices. Makes sense to me now.

> Today PSM2 is a good choice for OPA hardware.  However for some other devices without existing libfabric providers, rxm and rxd are the best choices.
> As was presented in Open Fabrics workshop today by James Erwin, PSM3 offers noticeable benefits over existing libfabric rxm and rxd providers
> and the rv module offers noticeable performance benefits when using PSM3.

For those that haven't seen it the talks will be posted to YouTube 
and/or OpenFabrics.org web page. There are actually two talks on this 
stuff. The first of which is by Todd is available now [1], James' talk 
will be up soon I'm sure.

>> I haven't had a chance to look beyond the cover letter in depth at how things
>> have changed. I really hope it's not that bad.
> While a few stylistic elements got carried forward, as you noticed.  This is much different from hfi1 as it doesn't directly access hardware and is hence smaller.
> We carefully looked at overlap with features in ib_core and the patch set contains a couple minor API additions to ib_core to simplify some operations
> which others may find useful.

Right, so if there is common functionality between hfi1 and rv then it 
might belong in the core. Especially considering if it's something 
that's common between a ULP and a HW driver.

>> I also don't know why you picked the name rv, this looks like it has little to do with the usual MPI rendezvous protocol.
> The focus of the design was to support the bulk transfer part of the MPI rendezvous protocol, hence the name rv.
> We'd welcome other name suggestions, wanted to keep the name simple and brief.

Like I said previously you can place the blame for the name on me. Kaike 
and Todd just carried it forward. I think Sean had an idea in one of the 
other replies. Let's hear some other suggestions too.

>> No pre-adding reserved stuff
>> Lots of alignment holes, don't do that either.
> We'd like advise on a challenging situation.  Some customers desire NICs to support nVidia GPUs in some environments.
> Unfortunately the nVidia GPU drivers are not upstream, and have not been for years.  So we are forced to have both out of tree
> and upstream versions of the code.  We need the same applications to be able to work over both, so we would like the
> GPU enabled versions of the code to have the same ABI as the upstream code as this greatly simplifies things.
> We have removed all GPU specific code from the upstream submission, but used both the "alignment holes" and the "reserved"
> mechanisms to hold places for GPU specific fields which can't be upstreamed.

This problem extends to other drivers as well. I'm also interested in 
advice on the situation. I don't particularly like this either, but we 
need a way to accomplish the goal. We owe it to users to be flexible. 
Please offer suggestions.

[1] https://www.youtube.com/watch?v=iOvt_Iqz0uU

-Denny
Leon Romanovsky March 21, 2021, 8:56 a.m. UTC | #20
On Sat, Mar 20, 2021 at 12:39:46PM -0400, Dennis Dalessandro wrote:
> On 3/19/2021 6:57 PM, Rimmer, Todd wrote:
> > > > [Wan, Kaike] Incorrect. The rv module works with hfi1.

<...>

> > We have removed all GPU specific code from the upstream submission, but used both the "alignment holes" and the "reserved"
> > mechanisms to hold places for GPU specific fields which can't be upstreamed.
> 
> This problem extends to other drivers as well. I'm also interested in advice
> on the situation. I don't particularly like this either, but we need a way
> to accomplish the goal. We owe it to users to be flexible. Please offer
> suggestions.

Sorry to interrupt, but it seems that solution was said here [1].
It wasn't said directly, but between the lines it means that you need
two things:
1. Upstream everything.
2. Find another vendor that jumps on this RV bandwagon.

[1] https://lore.kernel.org/linux-rdma/20210319230644.GI2356281@nvidia.com

Thanks
Jason Gunthorpe March 21, 2021, 12:08 p.m. UTC | #21
On Fri, Mar 19, 2021 at 08:34:17PM +0000, Hefty, Sean wrote:
> > > > I also don't know why you picked the name rv, this looks like it has little to do
> > > > with the usual MPI rendezvous protocol. This is all about bulk transfers. It is
> > > > actually a lot like RDS. Maybe you should be using RDS?
> > 
> > > [Wan, Kaike] While there are similarities in concepts, details are
> > > different.
> > 
> > You should list these differences.
> > 
> > > Quite frankly this could be viewed as an application accelerator
> > > much like RDS served that purpose for Oracle, which continues to be
> > > its main use case.
> > 
> > Obviously, except it seems to be doing the same basic acceleration
> > technique as RDS.
> 
> A better name for this might be "scalable RDMA service", with RDMA
> meaning the transport operation.  My understanding is this is
> intended to be usable over any IB/RoCE device.

It looks like it is really all about bulk zero-copy transfers with
kernel involvment.

Jason
Dennis Dalessandro March 21, 2021, 4:24 p.m. UTC | #22
On 3/21/2021 4:56 AM, Leon Romanovsky wrote:
> On Sat, Mar 20, 2021 at 12:39:46PM -0400, Dennis Dalessandro wrote:
>> On 3/19/2021 6:57 PM, Rimmer, Todd wrote:
>>>>> [Wan, Kaike] Incorrect. The rv module works with hfi1.
> 
> <...>
> 
>>> We have removed all GPU specific code from the upstream submission, but used both the "alignment holes" and the "reserved"
>>> mechanisms to hold places for GPU specific fields which can't be upstreamed.
>>
>> This problem extends to other drivers as well. I'm also interested in advice
>> on the situation. I don't particularly like this either, but we need a way
>> to accomplish the goal. We owe it to users to be flexible. Please offer
>> suggestions.
> 
> Sorry to interrupt, but it seems that solution was said here [1].
> It wasn't said directly, but between the lines it means that you need
> two things:
> 1. Upstream everything.

Completely agree. However the GPU code from nvidia is not upstream. I 
don't see that issue getting resolved in this code review. Let's move on.

> 2. Find another vendor that jumps on this RV bandwagon.

That's not a valid argument. Clearly this works over multiple vendors HW.

We should be trying to get things upstream, not putting up walls when 
people want to submit new drivers. Calling code "garbage" [1] is not 
productive.

 > [1] 
https://lore.kernel.org/linux-rdma/20210319230644.GI2356281@nvidia.com

-Denny
Jason Gunthorpe March 21, 2021, 4:45 p.m. UTC | #23
On Sun, Mar 21, 2021 at 12:24:39PM -0400, Dennis Dalessandro wrote:
> On 3/21/2021 4:56 AM, Leon Romanovsky wrote:
> > On Sat, Mar 20, 2021 at 12:39:46PM -0400, Dennis Dalessandro wrote:
> > > On 3/19/2021 6:57 PM, Rimmer, Todd wrote:
> > > > > > [Wan, Kaike] Incorrect. The rv module works with hfi1.
> > 
> > <...>
> > 
> > > > We have removed all GPU specific code from the upstream submission, but used both the "alignment holes" and the "reserved"
> > > > mechanisms to hold places for GPU specific fields which can't be upstreamed.
> > > 
> > > This problem extends to other drivers as well. I'm also interested in advice
> > > on the situation. I don't particularly like this either, but we need a way
> > > to accomplish the goal. We owe it to users to be flexible. Please offer
> > > suggestions.
> > 
> > Sorry to interrupt, but it seems that solution was said here [1].
> > It wasn't said directly, but between the lines it means that you need
> > two things:
> > 1. Upstream everything.
> 
> Completely agree. However the GPU code from nvidia is not upstream. I don't
> see that issue getting resolved in this code review. Let's move on.
> 
> > 2. Find another vendor that jumps on this RV bandwagon.
> 
> That's not a valid argument. Clearly this works over multiple vendors HW.

At a certain point we have to decide if this is a generic code of some
kind or a driver-specific thing like HFI has.

There are also obvious technial problems designing it as a ULP, so it
is a very important question to answer. If it will only ever be used
by one Intel ethernet chip then maybe it isn't really generic code.

On the other hand it really looks like it overlaps in various ways
with both RDS and the qib/hfi1 cdev, so why isn't there any effort to
have some commonality??

> We should be trying to get things upstream, not putting up walls when people
> want to submit new drivers. Calling code "garbage" [1] is not productive.

Putting a bunch of misaligned structures and random reserved fields
*is* garbage by the upstream standard and if I send that to Linus I'll
get yelled at.

And you certainly can't say "we are already shipping this ABI so we
won't change it" either.

You can't square this circle by compromising the upstream world in any
way, it is simply not accepted by the overall community.

All of you should know this, I shouldn't have to lecture on this!

Also no, we should not be "trying to get things upstream" as some goal
in itself. Upstream is not a trashcan to dump stuff into, someone has
to maintain all of this long after it stops being interesting, so
there better be good reasons to put it here in the first place.

If it isn't obvious, I'll repeat again: I'm highly annoyed that Intel
is sending something like this RV, in the state it is in, to support
their own out of tree driver, that they themselves have been dragging
their feet on responding to review comments so it can be upstream for
*years*.

Jason
Dennis Dalessandro March 21, 2021, 5:21 p.m. UTC | #24
On 3/21/2021 12:45 PM, Jason Gunthorpe wrote:
> On Sun, Mar 21, 2021 at 12:24:39PM -0400, Dennis Dalessandro wrote:
>> On 3/21/2021 4:56 AM, Leon Romanovsky wrote:
>>> On Sat, Mar 20, 2021 at 12:39:46PM -0400, Dennis Dalessandro wrote:
>>>> On 3/19/2021 6:57 PM, Rimmer, Todd wrote:
>>>>>>> [Wan, Kaike] Incorrect. The rv module works with hfi1.
>>>
>>> <...>
>>>
>>>>> We have removed all GPU specific code from the upstream submission, but used both the "alignment holes" and the "reserved"
>>>>> mechanisms to hold places for GPU specific fields which can't be upstreamed.
>>>>
>>>> This problem extends to other drivers as well. I'm also interested in advice
>>>> on the situation. I don't particularly like this either, but we need a way
>>>> to accomplish the goal. We owe it to users to be flexible. Please offer
>>>> suggestions.
>>>
>>> Sorry to interrupt, but it seems that solution was said here [1].
>>> It wasn't said directly, but between the lines it means that you need
>>> two things:
>>> 1. Upstream everything.
>>
>> Completely agree. However the GPU code from nvidia is not upstream. I don't
>> see that issue getting resolved in this code review. Let's move on.
>>
>>> 2. Find another vendor that jumps on this RV bandwagon.
>>
>> That's not a valid argument. Clearly this works over multiple vendors HW.
> 
> At a certain point we have to decide if this is a generic code of some
> kind or a driver-specific thing like HFI has.
> 
> There are also obvious technial problems designing it as a ULP, so it
> is a very important question to answer. If it will only ever be used
> by one Intel ethernet chip then maybe it isn't really generic code.

Todd/Kaike, is there something in here that is specific to the Intel 
ethernet chip?

> On the other hand it really looks like it overlaps in various ways
> with both RDS and the qib/hfi1 cdev, so why isn't there any effort to
> have some commonality??

Maybe that's something that should be explored. Isn't this along the 
lines of stuff we talked about with the verbs 2.0 stuff, or whatever we 
ended up calling it.

>> We should be trying to get things upstream, not putting up walls when people
>> want to submit new drivers. Calling code "garbage" [1] is not productive.
> 
> Putting a bunch of misaligned structures and random reserved fields
> *is* garbage by the upstream standard and if I send that to Linus I'll
> get yelled at.

Not saying you should send this to Linus. I'm saying we should figure 
out a way to make it better and insulting people and their hard work 
isn't helping. This is the kind of culture we are trying to get away 
from in the kernel world.

> And you certainly can't say "we are already shipping this ABI so we
> won't change it" either.
> 
> You can't square this circle by compromising the upstream world in any
> way, it is simply not accepted by the overall community.
> 
> All of you should know this, I shouldn't have to lecture on this!

No one is suggesting to compromise the upstream world. There is a bigger 
picture here. The answer for this driver may just be take out the 
reserved stuff. That's pretty simple. The bigger question is how do we 
deal with non-upstream code. It can't be a problem unique to the RDMA 
subsystem. How do others deal with it?

> Also no, we should not be "trying to get things upstream" as some goal
> in itself. Upstream is not a trashcan to dump stuff into, someone has
> to maintain all of this long after it stops being interesting, so
> there better be good reasons to put it here in the first place.

That is completely not what I meant at all. I mean we should be trying 
to get rid of the proprietary, and out of tree stuff. It doesn't at all 
imply to fling crap against the wall and hope it sticks. We should be 
encouraging vendors to submit their code and work with them to get it in 
shape. We clearly have a problem with vendor proprietary code not being 
open. Let's not encourage that behavior. Vendors should say I want to 
submit my code to the Linux kernel. Not eh, it's too much of a hassle 
and kernel people are jerks, so we'll just post it on our website.

> If it isn't obvious, I'll repeat again: I'm highly annoyed that Intel
> is sending something like this RV, in the state it is in, to support
> their own out of tree driver, that they themselves have been dragging
> their feet on responding to review comments so it can be upstream for
> *years*.

To be fair it is sent as RFC. So to me that says they know it's a ways 
off from being ready to be included and are starting the process early.

-Denny
Jason Gunthorpe March 21, 2021, 6:08 p.m. UTC | #25
On Sun, Mar 21, 2021 at 01:21:14PM -0400, Dennis Dalessandro wrote:

> Maybe that's something that should be explored. Isn't this along the lines
> of stuff we talked about with the verbs 2.0 stuff, or whatever we ended up
> calling it.

I think verbs 2.0 turned into the ioctl uapi stuff, I don't remember anymore.

> > > We should be trying to get things upstream, not putting up walls when people
> > > want to submit new drivers. Calling code "garbage" [1] is not productive.
> > 
> > Putting a bunch of misaligned structures and random reserved fields
> > *is* garbage by the upstream standard and if I send that to Linus I'll
> > get yelled at.
> 
> Not saying you should send this to Linus. I'm saying we should figure out a
> way to make it better and insulting people and their hard work isn't
> helping.

No - you've missed what happened here. Todd responded very fast and
explained - Intel *knowingly* sent code that was sub-standard as some
calculated attempt to make Intel's life maintaining their out of tree
drivers easier.

This absoultely needs strong language as I can't catch everything and
people need to understand there are real consequences to violating the
community trust in this way!

> No one is suggesting to compromise the upstream world. 

I'm not sure what you mean - what could upstream do to in any way
change the situation other than compromising on what will be merged?

> There is a bigger picture here. The answer for this driver may just
> be take out the reserved stuff. That's pretty simple. The bigger
> question is how do we deal with non-upstream code. It can't be a
> problem unique to the RDMA subsystem. How do others deal with it?

The kernel community consensus is that upstream code is on its own.

We don't change the kernel to accomodate out-of-tree code. If the kernel
changes and out of tree breaks nobody cares.

Over a long time it has been proven that this methodology is a good
way to effect business change to align with the community consensus
development model - eventually the costs of being out of tree have bad
ROI and companies align.

> That is completely not what I meant at all. I mean we should be
> trying to get rid of the proprietary, and out of tree stuff. It
> doesn't at all imply to fling crap against the wall and hope it
> sticks. We should be encouraging vendors to submit their code and
> work with them to get it in shape.

Well, I am working on something like 4-5 Intel series right now, and
it sometimes does feel like flinging. Have you seen Greg KH's remarks
that he won't even look at Intel patches until they have internal
expert sign off?

> not encourage that behavior. Vendors should say I want to submit my code to
> the Linux kernel. Not eh, it's too much of a hassle and kernel people are
> jerks, so we'll just post it on our website.

There is a very, very fine line between "working with the community"
and "the community will provide free expert engineering work to our
staff".

> To be fair it is sent as RFC. So to me that says they know it's a ways off
> from being ready to be included and are starting the process early.

I'm not sure why it is RFC. It sounds like it is much more than this
if someone has already made a version with links to the NVIDIA GPU
driver and has reached a point where they care about ABI stablility?

Jason
Wan, Kaike March 21, 2021, 7:19 p.m. UTC | #26
> From: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
> Sent: Sunday, March 21, 2021 1:21 PM
> To: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Leon Romanovsky <leon@kernel.org>; Rimmer, Todd
> <todd.rimmer@intel.com>; Wan, Kaike <kaike.wan@intel.com>;
> dledford@redhat.com; linux-rdma@vger.kernel.org
> Subject: Re: [PATCH RFC 0/9] A rendezvous module
> 
> On 3/21/2021 12:45 PM, Jason Gunthorpe wrote:
> > On Sun, Mar 21, 2021 at 12:24:39PM -0400, Dennis Dalessandro wrote:
> >> On 3/21/2021 4:56 AM, Leon Romanovsky wrote:
> >>> On Sat, Mar 20, 2021 at 12:39:46PM -0400, Dennis Dalessandro wrote:
> >>>> On 3/19/2021 6:57 PM, Rimmer, Todd wrote:
> >>>>>>> [Wan, Kaike] Incorrect. The rv module works with hfi1.
> >>>
> >>> <...>
> >>>
> >>
> >>> 2. Find another vendor that jumps on this RV bandwagon.
> >>
> >> That's not a valid argument. Clearly this works over multiple vendors HW.
> >
> > At a certain point we have to decide if this is a generic code of some
> > kind or a driver-specific thing like HFI has.
> >
> > There are also obvious technial problems designing it as a ULP, so it
> > is a very important question to answer. If it will only ever be used
> > by one Intel ethernet chip then maybe it isn't really generic code.
> 
> Todd/Kaike, is there something in here that is specific to the Intel ethernet
> chip?
[Wan, Kaike] No. The code is generic to Verbs/ROCE devices.
Rimmer, Todd March 22, 2021, 3:17 p.m. UTC | #27
> Over a long time it has been proven that this methodology is a good way to effect business change to align with the community consensus development model - eventually the costs of being out of tree have bad ROI and companies align.

Agree.  The key question is when will nVidia upstream it's drivers so companies don't have to endure the resulting "bad ROI" of being forced to have unique out of tree solutions.

> > Putting a bunch of misaligned structures and random reserved fields
> > *is* garbage by the upstream standard and if I send that to Linus 
> > I'll get yelled at.
Let's not overexaggerate this.  The fields we're organized in a logical manner for end user consumption and understanding.  A couple resv fields were used for alignment.  I alluded to the fact we may have some ideas on how those fields or gaps could be used for GPU support in the future and this might help deal with the lack of upstream nVidia code and reduce the ROI challenges for other vendors which that causes.  The discussion seemed centered around the attach parameters, attach is executed once per process at job start.

Todd
Jason Gunthorpe March 22, 2021, 4:47 p.m. UTC | #28
On Mon, Mar 22, 2021 at 03:17:05PM +0000, Rimmer, Todd wrote:

> > Over a long time it has been proven that this methodology is a
> > good way to effect business change to align with the community
> > consensus development model - eventually the costs of being out of
> > tree have bad ROI and companies align.
> 
> Agree.  The key question is when will nVidia upstream it's drivers
> so companies don't have to endure the resulting "bad ROI" of being
> forced to have unique out of tree solutions.

If you are working with NVIDIA GPU and having inefficiencies then you
need to take it through your buisness relationship, not here.

> > > Putting a bunch of misaligned structures and random reserved
> > > fields *is* garbage by the upstream standard and if I send that
> > > to Linus I'll get yelled at.

> Let's not overexaggerate this.  The fields we're organized in a
> logical manner for end user consumption and understanding.  A couple
> resv fields were used for alignment.  

No, a couple of resv fields were randomly added and *a lot* of other
mis-alignements were ignored. That is not our standard for ABI design.

Jason
Hefty, Sean March 22, 2021, 5:31 p.m. UTC | #29
> > To be fair it is sent as RFC. So to me that says they know it's a ways off
> > from being ready to be included and are starting the process early.
> 
> I'm not sure why it is RFC. It sounds like it is much more than this
> if someone has already made a version with links to the NVIDIA GPU
> driver and has reached a point where they care about ABI stablility?

I can take some blame here.  A couple of us were asked to look at this module.  Because the functionality is intended to be device agnostic, we assumed there would be more scrutiny, and we weren't sure how acceptable some aspects would be (e.g. mr cache, ib cm data format).  Rather than debate this internally for months, rework the code, and still miss, we asked Kaike to post an RFC to get community feedback.  For *upstreaming* purposes it was intended as an RFC to gather feedback on the overall approach.  That should have been made clearer.

The direct feedback that Kaike is examining is the difference between this approach and RDS.

- Sean
Christoph Hellwig March 23, 2021, 3:30 p.m. UTC | #30
On Fri, Mar 19, 2021 at 10:57:20PM +0000, Rimmer, Todd wrote:
> We'd like advise on a challenging situation.  Some customers desire NICs to support nVidia GPUs in some environments.
> Unfortunately the nVidia GPU drivers are not upstream, and have not been for years.  So we are forced to have both out of tree
> and upstream versions of the code.  We need the same applications to be able to work over both, so we would like the
> GPU enabled versions of the code to have the same ABI as the upstream code as this greatly simplifies things.
> We have removed all GPU specific code from the upstream submission, but used both the "alignment holes" and the "reserved"
> mechanisms to hold places for GPU specific fields which can't be upstreamed.

NVIDIA GPUs are supported by drivers/gpu/drm/nouveau/, and your are
encourage to support them just like all the other in-tree GPU drivers.
Not sure what support a network protocol would need for a specific GPU.
You're probably trying to do something amazingly stupid here instead of
relying on proper kernel subsystem use.
Christoph Hellwig March 23, 2021, 3:33 p.m. UTC | #31
On Sun, Mar 21, 2021 at 12:24:39PM -0400, Dennis Dalessandro wrote:
> Completely agree. However the GPU code from nvidia is not upstream. I don't
> see that issue getting resolved in this code review. Let's move on.

In which case you need to stop wasting everyones time with you piece of
crap code base, because it is completely and utterly irrelevant.
Christoph Hellwig March 23, 2021, 3:35 p.m. UTC | #32
On Sun, Mar 21, 2021 at 01:45:04PM -0300, Jason Gunthorpe wrote:
> > We should be trying to get things upstream, not putting up walls when people
> > want to submit new drivers. Calling code "garbage" [1] is not productive.
> 
> Putting a bunch of misaligned structures and random reserved fields
> *is* garbage by the upstream standard and if I send that to Linus I'll
> get yelled at.

And when this is for out of tree crap it is per the very definition of
the word garbage.  If I was in your position I would not waste any time
on such an utterly disrespectful submission that violates all the
community rules.
Christoph Hellwig March 23, 2021, 3:36 p.m. UTC | #33
On Sun, Mar 21, 2021 at 01:21:14PM -0400, Dennis Dalessandro wrote:
> No one is suggesting to compromise the upstream world. There is a bigger
> picture here. The answer for this driver may just be take out the reserved
> stuff. That's pretty simple. The bigger question is how do we deal with
> non-upstream code. It can't be a problem unique to the RDMA subsystem. How
> do others deal with it?

By ignoring it, or if it gets too annoying by making life for it so
utterly painful that people give up on it.
Jason Gunthorpe March 23, 2021, 3:46 p.m. UTC | #34
On Tue, Mar 23, 2021 at 03:30:41PM +0000, Christoph Hellwig wrote:

> On Fri, Mar 19, 2021 at 10:57:20PM +0000, Rimmer, Todd wrote:

> > We'd like advise on a challenging situation.  Some customers
> > desire NICs to support nVidia GPUs in some environments.
> > Unfortunately the nVidia GPU drivers are not upstream, and have
> > not been for years.  So we are forced to have both out of tree and
> > upstream versions of the code.  We need the same applications to
> > be able to work over both, so we would like the GPU enabled
> > versions of the code to have the same ABI as the upstream code as
> > this greatly simplifies things.  We have removed all GPU specific
> > code from the upstream submission, but used both the "alignment
> > holes" and the "reserved" mechanisms to hold places for GPU
> > specific fields which can't be upstreamed.
> 
> NVIDIA GPUs are supported by drivers/gpu/drm/nouveau/, and your are
> encourage to support them just like all the other in-tree GPU
> drivers.  Not sure what support a network protocol would need for a
> specific GPU.  You're probably trying to do something amazingly
> stupid here instead of relying on proper kernel subsystem use.

The kernel building block for what they are trying to do with the GPU
is the recently merged DMABUF MR support in the RDMA subsystem.

I'd like to think that since Daniel's team at Intel got the DMABUF
stuff merged to support the applications Todd's Intel team is building
that this RV stuff is already fully ready for dmabuf... (hint hint)

What Todd is alluding to here is the hacky DMABUF alternative that is
in the NVIDIA GPU driver - which HPC networking companies must support
if they want to interwork with the NVIDIA GPU.

Every RDMA vendor playing in the HPC space has some out-of-tree driver
to enable this. :(

Jason
Christoph Hellwig March 23, 2021, 4:07 p.m. UTC | #35
On Tue, Mar 23, 2021 at 12:46:26PM -0300, Jason Gunthorpe wrote:
> What Todd is alluding to here is the hacky DMABUF alternative that is
> in the NVIDIA GPU driver - which HPC networking companies must support
> if they want to interwork with the NVIDIA GPU.
> 
> Every RDMA vendor playing in the HPC space has some out-of-tree driver
> to enable this. :(

I can only recommende everone to buy from a less fucked up GPU or
accelerator vendor.  We're certainly going to do everything we can to
discourage making life easier for people who insist on doing this
stupidest possible thing.
Rimmer, Todd March 23, 2021, 5:25 p.m. UTC | #36
> I can only recommende everone to buy from a less f***** up GPU or
> accelerator vendor.  
I would certainly love that.  This is not just a recent problem, it's been going on for at least 3-5 years with no end in sight.  And the nvidia driver itself is closed-source in the kernel :-(  Making tuning and debug even harder and continuing to add costs to NIC vendors other than nVidia themselves to support this.

Back to the topic at hand, yes, there are a few misalignments in the ABI.  Most of the structures are carefully aligned.  Below I summarize the major structures and their alignment characteristics. In a few places we chose readability for the application programmer by ordering fields in a logical order, such as for statistics.   

In one place a superfluous resv field was used (rv_query_params_out)  and when I alluded that might be able to be taken advantage in the future to enable a common ABI for GPUs, we went down this deep rat hole.

In studying all the related fields, in most cases if we shuffled everything for maximum packing, the structures would still end up being about the same size and this would all be for non-performance path ABIs.

Here is a summary:
13 structures all perfectly aligned with no gaps, plus 7 structures below.

rv_query_params_out - has one 4 byte reserved field to guarantee alignment for a u64 which follows.

rv_attach_params_in - organized logically as early fields influence how later fields are interpreted.  Fair number of fields, two 1 byte gaps and one 7 byte gap.  Shuffling this might save about 4-8 bytes tops

rv_cache_stats_params_out - ordered logically by statistics meanings.  Two 4 byte gaps could be solved by having a less logical order.  Of course, applications reporting these statistics will tend to do output in the original order, so packing this results in a harder to use ABI and more difficult code review for application writers wanting to make sure they report all stats but do so in a logical order.

rv_conn_get_stats_params_out - one 2 byte gap (so the 1 bytes field mimicking the input request can be 1st), three 4 byte gaps.  Same explanation as rv_cache_stats_params_out

rv_conn_create_params_in - one 4 byte gap, easy enough to swap

rv_post_write_params_out - one 3 byte gap.  Presented in logical order, shuffling would still yield the same size as compiler will round up size.

rv_event - carefully packed and aligned.  Had to make this work on a wide range of compilers with a 1 byte common field defining which part of union was relevant.  Could put the same field in all unions to get rid of packed attribute if that is preferred.  We found other similar examples like this in an older 4.18 kernel, cited one below.

It should be noted, there are existing examples with small gaps or reserved fields in the existing kernel and RDMA stack.  A few examples in ib_user_verbs.h include:

ib_uverbs_send_wr - 4 byte gap after ex for the rdma field in union.

ib_uverbs_flow_attr - 2 byte reserved field declared

ib_flow_spec - 2 byte gap after size field

rdma_netdev - 7 byte gap after port_num

./hw/mthca/mthca_eq.c - very similar use of packed for mthca_eqe to the one complained about in rv_event

Todd
Jason Gunthorpe March 23, 2021, 5:44 p.m. UTC | #37
On Tue, Mar 23, 2021 at 05:25:43PM +0000, Rimmer, Todd wrote:

> In studying all the related fields, in most cases if we shuffled
> everything for maximum packing, the structures would still end up
> being about the same size and this would all be for non-performance
> path ABIs.

It is not about size, it is about uABI compatability. Do not place
packing holes in uABI structures as it is *very complicated* to
guarentee those work across all the combinations of user/kernel we
have to support.

Every uABI structure should have explicit padding, it should be
organized to minimize padding, and "easy on the programmer" is not a
top concern.

You must ensure that all compilers generate the same structure byte
layout. On x86 this means a classic i386 compiler, a AMD64 x32 ABI,
and the normal AMD64 compiler.

> It should be noted, there are existing examples with small gaps or
> reserved fields in the existing kernel and RDMA stack.  A few
> examples in ib_user_verbs.h include:

Past mistakes do not excuse future errors. Most of those are not uABI
structs.

Jason
Jason Gunthorpe March 23, 2021, 10:56 p.m. UTC | #38
On Mon, Mar 22, 2021 at 05:31:07PM +0000, Hefty, Sean wrote:
> > > To be fair it is sent as RFC. So to me that says they know it's a ways off
> > > from being ready to be included and are starting the process early.
> > 
> > I'm not sure why it is RFC. It sounds like it is much more than this
> > if someone has already made a version with links to the NVIDIA GPU
> > driver and has reached a point where they care about ABI stablility?
> 
> I can take some blame here.  A couple of us were asked to look at
> this module.  Because the functionality is intended to be device
> agnostic, we assumed there would be more scrutiny, and we weren't
> sure how acceptable some aspects would be (e.g. mr cache, ib cm data
> format).  Rather than debate this internally for months, rework the
> code, and still miss, we asked Kaike to post an RFC to get community
> feedback.  For *upstreaming* purposes it was intended as an RFC to
> gather feedback on the overall approach.  That should have been made
> clearer.

Well, it is hard to even have that kind of conversation when all the
details are wrong. The way this interfaces with uverbs is *just
wrong*, it completely ignores the locking model for how disassociation
works.

Something like this, that is trying to closely integrate with uverbs,
really cannot exist without major surgery to uverbs. If you want to do
something along these lines the uverbs parts cannot be in a ULP.

The mr cache could be moved into some kind of new uverb, that could be
interesting if it is carefully designed.

The actual transport code.. That is going to be really hard. RDS
doesn't integrate with uverbs for a reason, the kernel side owns the
QPs and PDs.

How you create a QP owned by the kernel but linked to a PD owned by
uverbs is going to need very delicate and careful work to be somehow
compatible with our disassociation model.

Are you *sure* this needs to be in the kernel? You can't take a
context switch to a user process and use the shared verbs FD stuff
to create the de-duplicated QPs instead? It is a much simpler design.

Have you thought about an actual *shared verbs QP* ? We have a lot of
other shared objects right now, it is not such a big step. It does
require inter-process locking though - and that is non-trivial.

A shared QP with a kernel owned send/recv to avoid the locking issue
could also be a very interesting solution.

Jason
Rimmer, Todd March 23, 2021, 11:29 p.m. UTC | #39
> How you create a QP owned by the kernel but linked to a PD owned by uverbs is going to need very delicate and careful work to be somehow compatible with our disassociation model.

The primary usage mode is as follows:
The RC QP, PD and MR are all in the kernel.  The buffer virtual address and len is supplied by the user process and then used to lookup a MR in the cache, upon miss, a kernel MR is created against the kernel PD.  There are separate MR caches per user process.

The IOs are initiated by the user, matched to a MR in the cache, then a RDMA Write w/Immed is posted on the kernel RC QP.

In concept, this is not unlike other kernel ULPs which perform direct data placement into user memory but use kernel QPs for connections to remote resources, such as various RDMA storage and filesystem ULPs.
The separation of the MR registration call from the IO allows the registration cost of a miss to be partially hidden behind the end to end RTS/CTS exchange which is occurring in user space.



There is a secondary usage mode where the MRs are cached, but created against a user PD and later used by the user process against QPs in the user.  We found that usage mode offered some slight latency advantages over the primary mode for tiny jobs, but suffered significant scalability issues.  Those latency advantages mainly manifested in microbenchmarks, but did help a few apps.


If it would simplify things, we could focus the discussion on the primary usage mode.  Conceptually, the secondary usage mode may be a good candidate for an extension to uverbs (some form of register MR w/caching API where register MR checks a cache and deregister MR merely decrements a reference count in the cache).

Todd