Message ID | 20210319125635.34492-1-kaike.wan@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | A rendezvous module | expand |
On Fri, Mar 19, 2021 at 08:56:26AM -0400, kaike.wan@intel.com wrote: > - Basic mode of operations (PSM3 is used as an example for user > applications): > - A middleware (like MPI) has out-of-band communication channels > between any two nodes, which are used to establish high performance > communications for providers such as PSM3. Huh? Doesn't PSM3 already use it's own special non-verbs char devices that already have memory caches and other stuff? Now you want to throw that all away and do yet another char dev just for HFI? Why? I also don't know why you picked the name rv, this looks like it has little to do with the usual MPI rendezvous protocol. This is all about bulk transfers. It is actually a lot like RDS. Maybe you should be using RDS? Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Friday, March 19, 2021 9:53 AM > To: Wan, Kaike <kaike.wan@intel.com> > Cc: dledford@redhat.com; linux-rdma@vger.kernel.org; Rimmer, Todd > <todd.rimmer@intel.com> > Subject: Re: [PATCH RFC 0/9] A rendezvous module > > On Fri, Mar 19, 2021 at 08:56:26AM -0400, kaike.wan@intel.com wrote: > > > - Basic mode of operations (PSM3 is used as an example for user > > applications): > > - A middleware (like MPI) has out-of-band communication channels > > between any two nodes, which are used to establish high performance > > communications for providers such as PSM3. > > Huh? Doesn't PSM3 already use it's own special non-verbs char devices that > already have memory caches and other stuff? Now you want to throw that > all away and do yet another char dev just for HFI? Why? [Wan, Kaike] I think that you are referring to PSM2, which uses the OPA hfi1 driver that is specific to the OPA hardware. PSM3 uses standard verbs drivers and supports standard RoCE. A focus is the Intel RDMA Ethernet NICs. As such it cannot use the hfi1 driver through the special PSM2 interface. Rather it works with the hfi1 driver through standard verbs interface. The rv module was a new design to bring these concepts to standard transports and hardware. > > I also don't know why you picked the name rv, this looks like it has little to do > with the usual MPI rendezvous protocol. This is all about bulk transfers. It is > actually a lot like RDS. Maybe you should be using RDS? [Wan, Kaike] While there are similarities in concepts, details are different. Quite frankly this could be viewed as an application accelerator much like RDS served that purpose for Oracle, which continues to be its main use case. The rv module is currently targeting to enables the MPI/OFI/PSM3 application. The name "rv" is chosen simply because this module is designed to enable the rendezvous protocol of the MPI/OFI/PSM3 application stack for large messages. Short messages are handled by eager transfer through UDP in PSM3. FYI, there is an OFA presentation from Thurs reviewing PSM3 and RV and covering much of the architecture and rationale. > > Jason
On Fri, Mar 19, 2021 at 02:49:29PM +0000, Wan, Kaike wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Friday, March 19, 2021 9:53 AM > > To: Wan, Kaike <kaike.wan@intel.com> > > Cc: dledford@redhat.com; linux-rdma@vger.kernel.org; Rimmer, Todd > > <todd.rimmer@intel.com> > > Subject: Re: [PATCH RFC 0/9] A rendezvous module > > > > On Fri, Mar 19, 2021 at 08:56:26AM -0400, kaike.wan@intel.com wrote: > > > > > - Basic mode of operations (PSM3 is used as an example for user > > > applications): > > > - A middleware (like MPI) has out-of-band communication channels > > > between any two nodes, which are used to establish high performance > > > communications for providers such as PSM3. > > > > Huh? Doesn't PSM3 already use it's own special non-verbs char devices that > > already have memory caches and other stuff? Now you want to throw that > > all away and do yet another char dev just for HFI? Why? > [Wan, Kaike] I think that you are referring to PSM2, which uses the > OPA hfi1 driver that is specific to the OPA hardware. PSM3 uses > standard verbs drivers and supports standard RoCE. Uhhh.. "PSM" has always been about the ipath special char device, and if I recall properly the library was semi-discontinued and merged into libfabric. So here you are talking about a libfabric verbs provider that doesn't use the ipath style char interface but uses verbs and this rv thing so we call it a libfabric PSM3 provider because thats not confusing to anyone at all.. > A focus is the Intel RDMA Ethernet NICs. As such it cannot use the > hfi1 driver through the special PSM2 interface. These are the drivers that aren't merged yet, I see. So why are you sending this now? I'm not interested to look at even more Intel code when their driver saga is still ongoing for years. > Rather it works with the hfi1 driver through standard verbs > interface. But nobody would do that right? You'd get better results using the hif1 native interfaces instead of their slow fake verbs stuff. > > I also don't know why you picked the name rv, this looks like it has little to do > > with the usual MPI rendezvous protocol. This is all about bulk transfers. It is > > actually a lot like RDS. Maybe you should be using RDS? > [Wan, Kaike] While there are similarities in concepts, details are > different. You should list these differences. > Quite frankly this could be viewed as an application accelerator > much like RDS served that purpose for Oracle, which continues to be > its main use case. Obviously, except it seems to be doing the same basic acceleration technique as RDS. > The name "rv" is chosen simply because this module is designed to > enable the rendezvous protocol of the MPI/OFI/PSM3 application stack > for large messages. Short messages are handled by eager transfer > through UDP in PSM3. A bad name seems like it will further limit potential re-use of this code. Jason
On 3/19/2021 11:48 AM, Jason Gunthorpe wrote: > On Fri, Mar 19, 2021 at 02:49:29PM +0000, Wan, Kaike wrote: >>> From: Jason Gunthorpe <jgg@nvidia.com> >>> Sent: Friday, March 19, 2021 9:53 AM >>> To: Wan, Kaike <kaike.wan@intel.com> >>> Cc: dledford@redhat.com; linux-rdma@vger.kernel.org; Rimmer, Todd >>> <todd.rimmer@intel.com> >>> Subject: Re: [PATCH RFC 0/9] A rendezvous module >>> >>> On Fri, Mar 19, 2021 at 08:56:26AM -0400, kaike.wan@intel.com wrote: >>> >>>> - Basic mode of operations (PSM3 is used as an example for user >>>> applications): >>>> - A middleware (like MPI) has out-of-band communication channels >>>> between any two nodes, which are used to establish high performance >>>> communications for providers such as PSM3. >>> >>> Huh? Doesn't PSM3 already use it's own special non-verbs char devices that >>> already have memory caches and other stuff? Now you want to throw that >>> all away and do yet another char dev just for HFI? Why? > >> [Wan, Kaike] I think that you are referring to PSM2, which uses the >> OPA hfi1 driver that is specific to the OPA hardware. PSM3 uses >> standard verbs drivers and supports standard RoCE. > > Uhhh.. "PSM" has always been about the ipath special char device, and > if I recall properly the library was semi-discontinued and merged into > libfabric. This driver is intended to work with a fork of the PSM2 library. The PSM2 library which is for Omni-Path is now maintained by Cornelis Networks on our GitHub. PSM3 is something from Intel for Ethernet. I know it's a bit confusing. > So here you are talking about a libfabric verbs provider that doesn't > use the ipath style char interface but uses verbs and this rv thing so > we call it a libfabric PSM3 provider because thats not confusing to > anyone at all.. > >> A focus is the Intel RDMA Ethernet NICs. As such it cannot use the >> hfi1 driver through the special PSM2 interface. > > These are the drivers that aren't merged yet, I see. So why are you > sending this now? I'm not interested to look at even more Intel code > when their driver saga is still ongoing for years. > >> Rather it works with the hfi1 driver through standard verbs >> interface. > > But nobody would do that right? You'd get better results using the > hif1 native interfaces instead of their slow fake verbs stuff. I can't imagine why. I'm not sure what you mean by our slow fake verbs stuff? We support verbs just fine. It's certainly not fake. >>> I also don't know why you picked the name rv, this looks like it has little to do >>> with the usual MPI rendezvous protocol. This is all about bulk transfers. It is >>> actually a lot like RDS. Maybe you should be using RDS? . . >> The name "rv" is chosen simply because this module is designed to >> enable the rendezvous protocol of the MPI/OFI/PSM3 application stack >> for large messages. Short messages are handled by eager transfer >> through UDP in PSM3. > > A bad name seems like it will further limit potential re-use of this > code. As to the name, I started this driver when I was still at Intel before handing the reins to Kaike when we spun out. I plucked the name rv out of the air without much consideration. While the code seems to have changed a lot, the name seems to have stuck. It was as Kaike mentions because it helps enable the rendezvous side of PSM3. -Denny
On Fri, Mar 19, 2021 at 03:22:45PM -0400, Dennis Dalessandro wrote: > > > [Wan, Kaike] I think that you are referring to PSM2, which uses the > > > OPA hfi1 driver that is specific to the OPA hardware. PSM3 uses > > > standard verbs drivers and supports standard RoCE. > > > > Uhhh.. "PSM" has always been about the ipath special char device, and > > if I recall properly the library was semi-discontinued and merged into > > libfabric. > > This driver is intended to work with a fork of the PSM2 library. The PSM2 > library which is for Omni-Path is now maintained by Cornelis Networks on our > GitHub. PSM3 is something from Intel for Ethernet. I know it's a bit > confusing. "a bit" huh? > > So here you are talking about a libfabric verbs provider that doesn't > > use the ipath style char interface but uses verbs and this rv thing so > > we call it a libfabric PSM3 provider because thats not confusing to > > anyone at all.. > > > > > A focus is the Intel RDMA Ethernet NICs. As such it cannot use the > > > hfi1 driver through the special PSM2 interface. > > > > These are the drivers that aren't merged yet, I see. So why are you > > sending this now? I'm not interested to look at even more Intel code > > when their driver saga is still ongoing for years. > > > > > Rather it works with the hfi1 driver through standard verbs > > > interface. > > > > But nobody would do that right? You'd get better results using the > > hif1 native interfaces instead of their slow fake verbs stuff. > > I can't imagine why. I'm not sure what you mean by our slow fake verbs > stuff? We support verbs just fine. It's certainly not fake. hfi1 calls to the kernel for data path operations - that is "fake" in my book. Verbs was always about avoiding that kernel transition, to put it back in betrays the spirit. So a kernel call for rv, or the hfi cdev, or the verbs post-send is really all a wash. I didn't understand your answer, do you see using this with hfi1 or not? It looks a lot copy&pasted from the hfi1 driver, so now we are on our third copy of this code :( And why did it suddenly become a ULP that somehow shares uverbs resources?? I'm pretty skeptical that can be made to work correctly.. Jason
> hfi1 calls to the kernel for data path operations - that is "fake" in my book. Verbs was always about avoiding that kernel transition, to put it back in betrays the spirit. > So a kernel call for rv, or the hfi cdev, or the verbs post-send is really all a wash.
To be clear, different vendors have different priorities and hence different HW designs and approaches. hfi1 approached the HPC latency needs with a uniquely scalable approach with very low latency @scale. Ironically, other vendors have since replicated some of those mechanisms with their own proprietary mechanisms, such as UD-X. hfi1 approached storage needs and large message HPC transfers with a direct data placement mechanism (aka RDMA). It fully supported the verbs API and met the performance needs of it's customers with attractive price and power for its target markets.
Kaike sited hfi1 as just one example of a RDMA verbs device which rv can function for. Various network and server vendors will each make their own recommendations for software configs and various end users will each have their own preferences and requirements.
Todd
On 3/19/2021 3:44 PM, Jason Gunthorpe wrote: > On Fri, Mar 19, 2021 at 03:22:45PM -0400, Dennis Dalessandro wrote: > >>>> [Wan, Kaike] I think that you are referring to PSM2, which uses the >>>> OPA hfi1 driver that is specific to the OPA hardware. PSM3 uses >>>> standard verbs drivers and supports standard RoCE. >>> >>> Uhhh.. "PSM" has always been about the ipath special char device, and >>> if I recall properly the library was semi-discontinued and merged into >>> libfabric. >> >> This driver is intended to work with a fork of the PSM2 library. The PSM2 >> library which is for Omni-Path is now maintained by Cornelis Networks on our >> GitHub. PSM3 is something from Intel for Ethernet. I know it's a bit >> confusing. > > "a bit" huh? Just a bit. :) > >>> So here you are talking about a libfabric verbs provider that doesn't >>> use the ipath style char interface but uses verbs and this rv thing so >>> we call it a libfabric PSM3 provider because thats not confusing to >>> anyone at all.. >>> >>>> A focus is the Intel RDMA Ethernet NICs. As such it cannot use the >>>> hfi1 driver through the special PSM2 interface. >>> >>> These are the drivers that aren't merged yet, I see. So why are you >>> sending this now? I'm not interested to look at even more Intel code >>> when their driver saga is still ongoing for years. >>> >>>> Rather it works with the hfi1 driver through standard verbs >>>> interface. >>> >>> But nobody would do that right? You'd get better results using the >>> hif1 native interfaces instead of their slow fake verbs stuff. >> >> I can't imagine why. I'm not sure what you mean by our slow fake verbs >> stuff? We support verbs just fine. It's certainly not fake. > > hfi1 calls to the kernel for data path operations - that is "fake" in > my book. Verbs was always about avoiding that kernel transition, to > put it back in betrays the spirit. So a kernel call for rv, or the hfi > cdev, or the verbs post-send is really all a wash. Probably better to argue that in another thread I guess. > I didn't understand your answer, do you see using this with hfi1 or > not? I don't see how this could ever use hfi1. So no. > It looks a lot copy&pasted from the hfi1 driver, so now we are on our > third copy of this code :( I haven't had a chance to look beyond the cover letter in depth at how things have changed. I really hope it's not that bad. > And why did it suddenly become a ULP that somehow shares uverbs > resources?? I'm pretty skeptical that can be made to work correctly.. I see your point. I was just providing some background/clarification. Bottom line this has nothing to do with hfi1 or OPA or PSM2. -Denny
On Fri, Mar 19, 2021 at 08:12:18PM +0000, Rimmer, Todd wrote: > > hfi1 calls to the kernel for data path operations - that is "fake" in my book. Verbs was always about avoiding that kernel transition, to put it back in betrays the spirit. > So a kernel call for rv, or the hfi cdev, or the verbs post-send is really all a wash. > > To be clear, different vendors have different priorities and hence > different HW designs and approaches. hfi1 approached the HPC > latency needs with a uniquely scalable approach with very low > latency @scale. Yes, we all know the marketing spin the vendors like to use here. A kernel transition is a kernel transition, and the one in the HFI verbs path through all the common code is particularly expensive. I'm suprirsed to hear someone advocate that is a good thing when we were all told that the hfi1 cdev *must* exist because the kernel transition through verbs was far to expensive. > Ironically, other vendors have since replicated some > of those mechanisms with their own proprietary mechanisms, such as > UD-X. What is a UD-X? > Kaike sited hfi1 as just one example of a RDMA verbs device which rv > can function for. rv seems to completely destroy alot of the HPC performance offloads that vendors are layering on RC QPs (see what hns and mlx5 are doing), so I have doubt about this. It would be better if you could get at least one other device to agree this is reasonable. Or work with something that already exists for bulk messaging like RDS. Jason
On Fri, Mar 19, 2021 at 04:18:56PM -0400, Dennis Dalessandro wrote: > I was just providing some background/clarification. Bottom line this has > nothing to do with hfi1 or OPA or PSM2. Well, I hope your two companies can agree on naming if Cornelis someday needs a PSM3 for it's HW. Jason
> > > I also don't know why you picked the name rv, this looks like it has little to do > > > with the usual MPI rendezvous protocol. This is all about bulk transfers. It is > > > actually a lot like RDS. Maybe you should be using RDS? > > > [Wan, Kaike] While there are similarities in concepts, details are > > different. > > You should list these differences. > > > Quite frankly this could be viewed as an application accelerator > > much like RDS served that purpose for Oracle, which continues to be > > its main use case. > > Obviously, except it seems to be doing the same basic acceleration > technique as RDS. A better name for this might be "scalable RDMA service", with RDMA meaning the transport operation. My understanding is this is intended to be usable over any IB/RoCE device. I'm not familiar enough with the details of RDS or this service to know the differences. - Sean
> I'm suprirsed to hear someone advocate that is a good thing when we were > all told that the hfi1 cdev *must* exist because the kernel transition through > verbs was far to expensive. It depends on the goal vendors have with verbs vs other APIs such as libfabric. hfi1's verbs goal was focused on storage bandwidth and the cdev was focused on HPC latency and bandwidth for MPI via PSM2 and libfabric. I'm unclear why we are debating hfi1 here, seems it should be in another thread. > What is a UD-X? UD-X is a vendor specific set of HW interfaces and wire protocols implemented in UCX for nVidia Connect-X series of network devices. Many of it's concepts are very similar to those which ipath and hfi1 HW and software implemented. > rv seems to completely destroy alot of the HPC performance offloads that > vendors are layering on RC QPs Different vendors have different approaches to performance and chose different design trade-offs. Todd
On Fri, Mar 19, 2021 at 08:46:56PM +0000, Rimmer, Todd wrote: > > I'm suprirsed to hear someone advocate that is a good thing when we were > > all told that the hfi1 cdev *must* exist because the kernel transition through > > verbs was far to expensive. > > It depends on the goal vendors have with verbs vs other APIs such as > libfabric. hfi1's verbs goal was focused on storage bandwidth and > the cdev was focused on HPC latency and bandwidth for MPI via PSM2 > and libfabric. I'm unclear why we are debating hfi1 here, seems it > should be in another thread. Because this is copied from hfi1? > > What is a UD-X? > UD-X is a vendor specific set of HW interfaces and wire protocols > implemented in UCX for nVidia Connect-X series of network devices. > Many of it's concepts are very similar to those which ipath and hfi1 > HW and software implemented. Oh, there is lots of stuff in UCX, I'm not surprised you similarities to what PSM did since psm/libfabric/ucx are all solving the same problems. > > rv seems to completely destroy alot of the HPC performance offloads that > > vendors are layering on RC QPs > Different vendors have different approaches to performance and chose > different design trade-offs. That isn't my point, by limiting the usability you also restrict the drivers where this would meaningfully be useful. So far we now know that it is not useful for mlx5 or hfi1, that leaves only hns unknown and still in the HPC arena. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Friday, March 19, 2021 4:55 PM > To: Rimmer, Todd <todd.rimmer@intel.com> > Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>; Wan, > Oh, there is lots of stuff in UCX, I'm not surprised you similarities to what > PSM did since psm/libfabric/ucx are all solving the same problems. > > > > rv seems to completely destroy alot of the HPC performance offloads > > > that vendors are layering on RC QPs > > > Different vendors have different approaches to performance and chose > > different design trade-offs. > > That isn't my point, by limiting the usability you also restrict the drivers where > this would meaningfully be useful. > > So far we now know that it is not useful for mlx5 or hfi1, that leaves only hns > unknown and still in the HPC arena. [Wan, Kaike] Incorrect. The rv module works with hfi1. > > Jason
On 3/19/2021 4:59 PM, Wan, Kaike wrote: >> From: Jason Gunthorpe <jgg@nvidia.com> >> Sent: Friday, March 19, 2021 4:55 PM >> To: Rimmer, Todd <todd.rimmer@intel.com> >> Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>; Wan, >> Oh, there is lots of stuff in UCX, I'm not surprised you similarities to what >> PSM did since psm/libfabric/ucx are all solving the same problems. >> >>>> rv seems to completely destroy alot of the HPC performance offloads >>>> that vendors are layering on RC QPs >> >>> Different vendors have different approaches to performance and chose >>> different design trade-offs. >> >> That isn't my point, by limiting the usability you also restrict the drivers where >> this would meaningfully be useful. >> >> So far we now know that it is not useful for mlx5 or hfi1, that leaves only hns >> unknown and still in the HPC arena. > [Wan, Kaike] Incorrect. The rv module works with hfi1. Interesting. I was thinking the opposite. So what's the benefit? When would someone want to do that? -Denny
> From: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> > Sent: Friday, March 19, 2021 5:28 PM > To: Wan, Kaike <kaike.wan@intel.com>; Jason Gunthorpe <jgg@nvidia.com>; > Rimmer, Todd <todd.rimmer@intel.com> > Cc: dledford@redhat.com; linux-rdma@vger.kernel.org > Subject: Re: [PATCH RFC 0/9] A rendezvous module > > On 3/19/2021 4:59 PM, Wan, Kaike wrote: > >> From: Jason Gunthorpe <jgg@nvidia.com> > >> Sent: Friday, March 19, 2021 4:55 PM > >> To: Rimmer, Todd <todd.rimmer@intel.com> > >> Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>; > >> Wan, Oh, there is lots of stuff in UCX, I'm not surprised you > >> similarities to what PSM did since psm/libfabric/ucx are all solving the > same problems. > >> > >>>> rv seems to completely destroy alot of the HPC performance offloads > >>>> that vendors are layering on RC QPs > >> > >>> Different vendors have different approaches to performance and chose > >>> different design trade-offs. > >> > >> That isn't my point, by limiting the usability you also restrict the > >> drivers where this would meaningfully be useful. > >> > >> So far we now know that it is not useful for mlx5 or hfi1, that > >> leaves only hns unknown and still in the HPC arena. > > [Wan, Kaike] Incorrect. The rv module works with hfi1. > > Interesting. I was thinking the opposite. So what's the benefit? When would > someone want to do that? [Wan, Kaike] This is only because rv works with the generic Verbs interface and hfi1 happens to be one of the devices that implements the Verbs interface. > > -Denny
On Fri, Mar 19, 2021 at 09:58:21PM +0000, Wan, Kaike wrote: > > On 3/19/2021 4:59 PM, Wan, Kaike wrote: > > >> From: Jason Gunthorpe <jgg@nvidia.com> > > >> Sent: Friday, March 19, 2021 4:55 PM > > >> To: Rimmer, Todd <todd.rimmer@intel.com> > > >> Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>; > > >> Wan, Oh, there is lots of stuff in UCX, I'm not surprised you > > >> similarities to what PSM did since psm/libfabric/ucx are all solving the > > same problems. > > >> > > >>>> rv seems to completely destroy alot of the HPC performance offloads > > >>>> that vendors are layering on RC QPs > > >> > > >>> Different vendors have different approaches to performance and chose > > >>> different design trade-offs. > > >> > > >> That isn't my point, by limiting the usability you also restrict the > > >> drivers where this would meaningfully be useful. > > >> > > >> So far we now know that it is not useful for mlx5 or hfi1, that > > >> leaves only hns unknown and still in the HPC arena. > > > [Wan, Kaike] Incorrect. The rv module works with hfi1. > > > > Interesting. I was thinking the opposite. So what's the benefit? When would > > someone want to do that? > [Wan, Kaike] This is only because rv works with the generic Verbs > interface and hfi1 happens to be one of the devices that implements > the Verbs interface. Well, it works with everything (except EFA), but that is not my point. I said it is not useful. Jason
> > [Wan, Kaike] Incorrect. The rv module works with hfi1. > > Interesting. I was thinking the opposite. So what's the benefit? When would > someone want to do that? The more interesting scenario is for customers who would like to run libfabric and other Open Fabrics Alliance software over various verbs capable hardware. Today PSM2 is a good choice for OPA hardware. However for some other devices without existing libfabric providers, rxm and rxd are the best choices. As was presented in Open Fabrics workshop today by James Erwin, PSM3 offers noticeable benefits over existing libfabric rxm and rxd providers and the rv module offers noticeable performance benefits when using PSM3. > This driver is intended to work with a fork of the PSM2 library. The > PSM2 library which is for Omni-Path is now maintained by Cornelis > Networks on our GitHub. PSM3 is something from Intel for Ethernet. I > know it's a bit confusing. Intel retains various IP and trademark rights. Intel's marketing team analyzed and chose the name PSM3. Obviously plusses and minuses to any name choice. This is not unlike other industry software history where new major revisions often add and remove support for various HW generations. PSM(1) - supported infinipath IB adapters, was a standalone API (various forms). PSM2 - dropped support for infinipath and IB and added support for Omni-Path, along with various features, also added libfabric support PSM3 - dropped support for Omni-Path, added support for RoCE and verbs capable devices, along with other features, also dropped PSM2 API and standardized on libfabric. All three have similar strategies of onload protocols for eager messages and shared kernel/HW resources for large messages and direct data placement (RDMA). So the name Performance Scaled Messaging is meant to reflect the concept and approach as opposed to reflecting a specific HW implementation or even API. PSM3 is only available as a libfabric provider. > I haven't had a chance to look beyond the cover letter in depth at how things > have changed. I really hope it's not that bad. While a few stylistic elements got carried forward, as you noticed. This is much different from hfi1 as it doesn't directly access hardware and is hence smaller. We carefully looked at overlap with features in ib_core and the patch set contains a couple minor API additions to ib_core to simplify some operations which others may find useful. > I also don't know why you picked the name rv, this looks like it has little to do with the usual MPI rendezvous protocol. The focus of the design was to support the bulk transfer part of the MPI rendezvous protocol, hence the name rv. We'd welcome other name suggestions, wanted to keep the name simple and brief. > No pre-adding reserved stuff > Lots of alignment holes, don't do that either. We'd like advise on a challenging situation. Some customers desire NICs to support nVidia GPUs in some environments. Unfortunately the nVidia GPU drivers are not upstream, and have not been for years. So we are forced to have both out of tree and upstream versions of the code. We need the same applications to be able to work over both, so we would like the GPU enabled versions of the code to have the same ABI as the upstream code as this greatly simplifies things. We have removed all GPU specific code from the upstream submission, but used both the "alignment holes" and the "reserved" mechanisms to hold places for GPU specific fields which can't be upstreamed. Todd
On Fri, Mar 19, 2021 at 10:57:20PM +0000, Rimmer, Todd wrote: > We'd like advise on a challenging situation. Some customers desire > NICs to support nVidia GPUs in some environments. Unfortunately the > nVidia GPU drivers are not upstream, and have not been for years. Yep. > So we are forced to have both out of tree and upstream versions of > the code. Sure > We need the same applications to be able to work over > both, so we would like the GPU enabled versions of the code to have > the same ABI as the upstream code as this greatly simplifies things. The only answer is to get it upstream past community review before you ship anything to customers. > We have removed all GPU specific code from the upstream submission, > but used both the "alignment holes" and the "reserved" mechanisms to > hold places for GPU specific fields which can't be upstreamed. Sorry, nope, you have to do something else. We will *not* have garbage in upstream or be restricted in any way based on out of tree code. Jason
On 3/19/2021 6:57 PM, Rimmer, Todd wrote: >>> [Wan, Kaike] Incorrect. The rv module works with hfi1. >> >> Interesting. I was thinking the opposite. So what's the benefit? When would >> someone want to do that? > The more interesting scenario is for customers who would like to run libfabric and other Open Fabrics Alliance software over various verbs capable hardware. Ah ok that makes sense. Not that running it over hfi1 is the goal but being able to run over verbs devices. Makes sense to me now. > Today PSM2 is a good choice for OPA hardware. However for some other devices without existing libfabric providers, rxm and rxd are the best choices. > As was presented in Open Fabrics workshop today by James Erwin, PSM3 offers noticeable benefits over existing libfabric rxm and rxd providers > and the rv module offers noticeable performance benefits when using PSM3. For those that haven't seen it the talks will be posted to YouTube and/or OpenFabrics.org web page. There are actually two talks on this stuff. The first of which is by Todd is available now [1], James' talk will be up soon I'm sure. >> I haven't had a chance to look beyond the cover letter in depth at how things >> have changed. I really hope it's not that bad. > While a few stylistic elements got carried forward, as you noticed. This is much different from hfi1 as it doesn't directly access hardware and is hence smaller. > We carefully looked at overlap with features in ib_core and the patch set contains a couple minor API additions to ib_core to simplify some operations > which others may find useful. Right, so if there is common functionality between hfi1 and rv then it might belong in the core. Especially considering if it's something that's common between a ULP and a HW driver. >> I also don't know why you picked the name rv, this looks like it has little to do with the usual MPI rendezvous protocol. > The focus of the design was to support the bulk transfer part of the MPI rendezvous protocol, hence the name rv. > We'd welcome other name suggestions, wanted to keep the name simple and brief. Like I said previously you can place the blame for the name on me. Kaike and Todd just carried it forward. I think Sean had an idea in one of the other replies. Let's hear some other suggestions too. >> No pre-adding reserved stuff >> Lots of alignment holes, don't do that either. > We'd like advise on a challenging situation. Some customers desire NICs to support nVidia GPUs in some environments. > Unfortunately the nVidia GPU drivers are not upstream, and have not been for years. So we are forced to have both out of tree > and upstream versions of the code. We need the same applications to be able to work over both, so we would like the > GPU enabled versions of the code to have the same ABI as the upstream code as this greatly simplifies things. > We have removed all GPU specific code from the upstream submission, but used both the "alignment holes" and the "reserved" > mechanisms to hold places for GPU specific fields which can't be upstreamed. This problem extends to other drivers as well. I'm also interested in advice on the situation. I don't particularly like this either, but we need a way to accomplish the goal. We owe it to users to be flexible. Please offer suggestions. [1] https://www.youtube.com/watch?v=iOvt_Iqz0uU -Denny
On Sat, Mar 20, 2021 at 12:39:46PM -0400, Dennis Dalessandro wrote: > On 3/19/2021 6:57 PM, Rimmer, Todd wrote: > > > > [Wan, Kaike] Incorrect. The rv module works with hfi1. <...> > > We have removed all GPU specific code from the upstream submission, but used both the "alignment holes" and the "reserved" > > mechanisms to hold places for GPU specific fields which can't be upstreamed. > > This problem extends to other drivers as well. I'm also interested in advice > on the situation. I don't particularly like this either, but we need a way > to accomplish the goal. We owe it to users to be flexible. Please offer > suggestions. Sorry to interrupt, but it seems that solution was said here [1]. It wasn't said directly, but between the lines it means that you need two things: 1. Upstream everything. 2. Find another vendor that jumps on this RV bandwagon. [1] https://lore.kernel.org/linux-rdma/20210319230644.GI2356281@nvidia.com Thanks
On Fri, Mar 19, 2021 at 08:34:17PM +0000, Hefty, Sean wrote: > > > > I also don't know why you picked the name rv, this looks like it has little to do > > > > with the usual MPI rendezvous protocol. This is all about bulk transfers. It is > > > > actually a lot like RDS. Maybe you should be using RDS? > > > > > [Wan, Kaike] While there are similarities in concepts, details are > > > different. > > > > You should list these differences. > > > > > Quite frankly this could be viewed as an application accelerator > > > much like RDS served that purpose for Oracle, which continues to be > > > its main use case. > > > > Obviously, except it seems to be doing the same basic acceleration > > technique as RDS. > > A better name for this might be "scalable RDMA service", with RDMA > meaning the transport operation. My understanding is this is > intended to be usable over any IB/RoCE device. It looks like it is really all about bulk zero-copy transfers with kernel involvment. Jason
On 3/21/2021 4:56 AM, Leon Romanovsky wrote: > On Sat, Mar 20, 2021 at 12:39:46PM -0400, Dennis Dalessandro wrote: >> On 3/19/2021 6:57 PM, Rimmer, Todd wrote: >>>>> [Wan, Kaike] Incorrect. The rv module works with hfi1. > > <...> > >>> We have removed all GPU specific code from the upstream submission, but used both the "alignment holes" and the "reserved" >>> mechanisms to hold places for GPU specific fields which can't be upstreamed. >> >> This problem extends to other drivers as well. I'm also interested in advice >> on the situation. I don't particularly like this either, but we need a way >> to accomplish the goal. We owe it to users to be flexible. Please offer >> suggestions. > > Sorry to interrupt, but it seems that solution was said here [1]. > It wasn't said directly, but between the lines it means that you need > two things: > 1. Upstream everything. Completely agree. However the GPU code from nvidia is not upstream. I don't see that issue getting resolved in this code review. Let's move on. > 2. Find another vendor that jumps on this RV bandwagon. That's not a valid argument. Clearly this works over multiple vendors HW. We should be trying to get things upstream, not putting up walls when people want to submit new drivers. Calling code "garbage" [1] is not productive. > [1] https://lore.kernel.org/linux-rdma/20210319230644.GI2356281@nvidia.com -Denny
On Sun, Mar 21, 2021 at 12:24:39PM -0400, Dennis Dalessandro wrote: > On 3/21/2021 4:56 AM, Leon Romanovsky wrote: > > On Sat, Mar 20, 2021 at 12:39:46PM -0400, Dennis Dalessandro wrote: > > > On 3/19/2021 6:57 PM, Rimmer, Todd wrote: > > > > > > [Wan, Kaike] Incorrect. The rv module works with hfi1. > > > > <...> > > > > > > We have removed all GPU specific code from the upstream submission, but used both the "alignment holes" and the "reserved" > > > > mechanisms to hold places for GPU specific fields which can't be upstreamed. > > > > > > This problem extends to other drivers as well. I'm also interested in advice > > > on the situation. I don't particularly like this either, but we need a way > > > to accomplish the goal. We owe it to users to be flexible. Please offer > > > suggestions. > > > > Sorry to interrupt, but it seems that solution was said here [1]. > > It wasn't said directly, but between the lines it means that you need > > two things: > > 1. Upstream everything. > > Completely agree. However the GPU code from nvidia is not upstream. I don't > see that issue getting resolved in this code review. Let's move on. > > > 2. Find another vendor that jumps on this RV bandwagon. > > That's not a valid argument. Clearly this works over multiple vendors HW. At a certain point we have to decide if this is a generic code of some kind or a driver-specific thing like HFI has. There are also obvious technial problems designing it as a ULP, so it is a very important question to answer. If it will only ever be used by one Intel ethernet chip then maybe it isn't really generic code. On the other hand it really looks like it overlaps in various ways with both RDS and the qib/hfi1 cdev, so why isn't there any effort to have some commonality?? > We should be trying to get things upstream, not putting up walls when people > want to submit new drivers. Calling code "garbage" [1] is not productive. Putting a bunch of misaligned structures and random reserved fields *is* garbage by the upstream standard and if I send that to Linus I'll get yelled at. And you certainly can't say "we are already shipping this ABI so we won't change it" either. You can't square this circle by compromising the upstream world in any way, it is simply not accepted by the overall community. All of you should know this, I shouldn't have to lecture on this! Also no, we should not be "trying to get things upstream" as some goal in itself. Upstream is not a trashcan to dump stuff into, someone has to maintain all of this long after it stops being interesting, so there better be good reasons to put it here in the first place. If it isn't obvious, I'll repeat again: I'm highly annoyed that Intel is sending something like this RV, in the state it is in, to support their own out of tree driver, that they themselves have been dragging their feet on responding to review comments so it can be upstream for *years*. Jason
On 3/21/2021 12:45 PM, Jason Gunthorpe wrote: > On Sun, Mar 21, 2021 at 12:24:39PM -0400, Dennis Dalessandro wrote: >> On 3/21/2021 4:56 AM, Leon Romanovsky wrote: >>> On Sat, Mar 20, 2021 at 12:39:46PM -0400, Dennis Dalessandro wrote: >>>> On 3/19/2021 6:57 PM, Rimmer, Todd wrote: >>>>>>> [Wan, Kaike] Incorrect. The rv module works with hfi1. >>> >>> <...> >>> >>>>> We have removed all GPU specific code from the upstream submission, but used both the "alignment holes" and the "reserved" >>>>> mechanisms to hold places for GPU specific fields which can't be upstreamed. >>>> >>>> This problem extends to other drivers as well. I'm also interested in advice >>>> on the situation. I don't particularly like this either, but we need a way >>>> to accomplish the goal. We owe it to users to be flexible. Please offer >>>> suggestions. >>> >>> Sorry to interrupt, but it seems that solution was said here [1]. >>> It wasn't said directly, but between the lines it means that you need >>> two things: >>> 1. Upstream everything. >> >> Completely agree. However the GPU code from nvidia is not upstream. I don't >> see that issue getting resolved in this code review. Let's move on. >> >>> 2. Find another vendor that jumps on this RV bandwagon. >> >> That's not a valid argument. Clearly this works over multiple vendors HW. > > At a certain point we have to decide if this is a generic code of some > kind or a driver-specific thing like HFI has. > > There are also obvious technial problems designing it as a ULP, so it > is a very important question to answer. If it will only ever be used > by one Intel ethernet chip then maybe it isn't really generic code. Todd/Kaike, is there something in here that is specific to the Intel ethernet chip? > On the other hand it really looks like it overlaps in various ways > with both RDS and the qib/hfi1 cdev, so why isn't there any effort to > have some commonality?? Maybe that's something that should be explored. Isn't this along the lines of stuff we talked about with the verbs 2.0 stuff, or whatever we ended up calling it. >> We should be trying to get things upstream, not putting up walls when people >> want to submit new drivers. Calling code "garbage" [1] is not productive. > > Putting a bunch of misaligned structures and random reserved fields > *is* garbage by the upstream standard and if I send that to Linus I'll > get yelled at. Not saying you should send this to Linus. I'm saying we should figure out a way to make it better and insulting people and their hard work isn't helping. This is the kind of culture we are trying to get away from in the kernel world. > And you certainly can't say "we are already shipping this ABI so we > won't change it" either. > > You can't square this circle by compromising the upstream world in any > way, it is simply not accepted by the overall community. > > All of you should know this, I shouldn't have to lecture on this! No one is suggesting to compromise the upstream world. There is a bigger picture here. The answer for this driver may just be take out the reserved stuff. That's pretty simple. The bigger question is how do we deal with non-upstream code. It can't be a problem unique to the RDMA subsystem. How do others deal with it? > Also no, we should not be "trying to get things upstream" as some goal > in itself. Upstream is not a trashcan to dump stuff into, someone has > to maintain all of this long after it stops being interesting, so > there better be good reasons to put it here in the first place. That is completely not what I meant at all. I mean we should be trying to get rid of the proprietary, and out of tree stuff. It doesn't at all imply to fling crap against the wall and hope it sticks. We should be encouraging vendors to submit their code and work with them to get it in shape. We clearly have a problem with vendor proprietary code not being open. Let's not encourage that behavior. Vendors should say I want to submit my code to the Linux kernel. Not eh, it's too much of a hassle and kernel people are jerks, so we'll just post it on our website. > If it isn't obvious, I'll repeat again: I'm highly annoyed that Intel > is sending something like this RV, in the state it is in, to support > their own out of tree driver, that they themselves have been dragging > their feet on responding to review comments so it can be upstream for > *years*. To be fair it is sent as RFC. So to me that says they know it's a ways off from being ready to be included and are starting the process early. -Denny
On Sun, Mar 21, 2021 at 01:21:14PM -0400, Dennis Dalessandro wrote: > Maybe that's something that should be explored. Isn't this along the lines > of stuff we talked about with the verbs 2.0 stuff, or whatever we ended up > calling it. I think verbs 2.0 turned into the ioctl uapi stuff, I don't remember anymore. > > > We should be trying to get things upstream, not putting up walls when people > > > want to submit new drivers. Calling code "garbage" [1] is not productive. > > > > Putting a bunch of misaligned structures and random reserved fields > > *is* garbage by the upstream standard and if I send that to Linus I'll > > get yelled at. > > Not saying you should send this to Linus. I'm saying we should figure out a > way to make it better and insulting people and their hard work isn't > helping. No - you've missed what happened here. Todd responded very fast and explained - Intel *knowingly* sent code that was sub-standard as some calculated attempt to make Intel's life maintaining their out of tree drivers easier. This absoultely needs strong language as I can't catch everything and people need to understand there are real consequences to violating the community trust in this way! > No one is suggesting to compromise the upstream world. I'm not sure what you mean - what could upstream do to in any way change the situation other than compromising on what will be merged? > There is a bigger picture here. The answer for this driver may just > be take out the reserved stuff. That's pretty simple. The bigger > question is how do we deal with non-upstream code. It can't be a > problem unique to the RDMA subsystem. How do others deal with it? The kernel community consensus is that upstream code is on its own. We don't change the kernel to accomodate out-of-tree code. If the kernel changes and out of tree breaks nobody cares. Over a long time it has been proven that this methodology is a good way to effect business change to align with the community consensus development model - eventually the costs of being out of tree have bad ROI and companies align. > That is completely not what I meant at all. I mean we should be > trying to get rid of the proprietary, and out of tree stuff. It > doesn't at all imply to fling crap against the wall and hope it > sticks. We should be encouraging vendors to submit their code and > work with them to get it in shape. Well, I am working on something like 4-5 Intel series right now, and it sometimes does feel like flinging. Have you seen Greg KH's remarks that he won't even look at Intel patches until they have internal expert sign off? > not encourage that behavior. Vendors should say I want to submit my code to > the Linux kernel. Not eh, it's too much of a hassle and kernel people are > jerks, so we'll just post it on our website. There is a very, very fine line between "working with the community" and "the community will provide free expert engineering work to our staff". > To be fair it is sent as RFC. So to me that says they know it's a ways off > from being ready to be included and are starting the process early. I'm not sure why it is RFC. It sounds like it is much more than this if someone has already made a version with links to the NVIDIA GPU driver and has reached a point where they care about ABI stablility? Jason
> From: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> > Sent: Sunday, March 21, 2021 1:21 PM > To: Jason Gunthorpe <jgg@nvidia.com> > Cc: Leon Romanovsky <leon@kernel.org>; Rimmer, Todd > <todd.rimmer@intel.com>; Wan, Kaike <kaike.wan@intel.com>; > dledford@redhat.com; linux-rdma@vger.kernel.org > Subject: Re: [PATCH RFC 0/9] A rendezvous module > > On 3/21/2021 12:45 PM, Jason Gunthorpe wrote: > > On Sun, Mar 21, 2021 at 12:24:39PM -0400, Dennis Dalessandro wrote: > >> On 3/21/2021 4:56 AM, Leon Romanovsky wrote: > >>> On Sat, Mar 20, 2021 at 12:39:46PM -0400, Dennis Dalessandro wrote: > >>>> On 3/19/2021 6:57 PM, Rimmer, Todd wrote: > >>>>>>> [Wan, Kaike] Incorrect. The rv module works with hfi1. > >>> > >>> <...> > >>> > >> > >>> 2. Find another vendor that jumps on this RV bandwagon. > >> > >> That's not a valid argument. Clearly this works over multiple vendors HW. > > > > At a certain point we have to decide if this is a generic code of some > > kind or a driver-specific thing like HFI has. > > > > There are also obvious technial problems designing it as a ULP, so it > > is a very important question to answer. If it will only ever be used > > by one Intel ethernet chip then maybe it isn't really generic code. > > Todd/Kaike, is there something in here that is specific to the Intel ethernet > chip? [Wan, Kaike] No. The code is generic to Verbs/ROCE devices.
> Over a long time it has been proven that this methodology is a good way to effect business change to align with the community consensus development model - eventually the costs of being out of tree have bad ROI and companies align. Agree. The key question is when will nVidia upstream it's drivers so companies don't have to endure the resulting "bad ROI" of being forced to have unique out of tree solutions. > > Putting a bunch of misaligned structures and random reserved fields > > *is* garbage by the upstream standard and if I send that to Linus > > I'll get yelled at. Let's not overexaggerate this. The fields we're organized in a logical manner for end user consumption and understanding. A couple resv fields were used for alignment. I alluded to the fact we may have some ideas on how those fields or gaps could be used for GPU support in the future and this might help deal with the lack of upstream nVidia code and reduce the ROI challenges for other vendors which that causes. The discussion seemed centered around the attach parameters, attach is executed once per process at job start. Todd
On Mon, Mar 22, 2021 at 03:17:05PM +0000, Rimmer, Todd wrote: > > Over a long time it has been proven that this methodology is a > > good way to effect business change to align with the community > > consensus development model - eventually the costs of being out of > > tree have bad ROI and companies align. > > Agree. The key question is when will nVidia upstream it's drivers > so companies don't have to endure the resulting "bad ROI" of being > forced to have unique out of tree solutions. If you are working with NVIDIA GPU and having inefficiencies then you need to take it through your buisness relationship, not here. > > > Putting a bunch of misaligned structures and random reserved > > > fields *is* garbage by the upstream standard and if I send that > > > to Linus I'll get yelled at. > Let's not overexaggerate this. The fields we're organized in a > logical manner for end user consumption and understanding. A couple > resv fields were used for alignment. No, a couple of resv fields were randomly added and *a lot* of other mis-alignements were ignored. That is not our standard for ABI design. Jason
> > To be fair it is sent as RFC. So to me that says they know it's a ways off > > from being ready to be included and are starting the process early. > > I'm not sure why it is RFC. It sounds like it is much more than this > if someone has already made a version with links to the NVIDIA GPU > driver and has reached a point where they care about ABI stablility? I can take some blame here. A couple of us were asked to look at this module. Because the functionality is intended to be device agnostic, we assumed there would be more scrutiny, and we weren't sure how acceptable some aspects would be (e.g. mr cache, ib cm data format). Rather than debate this internally for months, rework the code, and still miss, we asked Kaike to post an RFC to get community feedback. For *upstreaming* purposes it was intended as an RFC to gather feedback on the overall approach. That should have been made clearer. The direct feedback that Kaike is examining is the difference between this approach and RDS. - Sean
On Fri, Mar 19, 2021 at 10:57:20PM +0000, Rimmer, Todd wrote: > We'd like advise on a challenging situation. Some customers desire NICs to support nVidia GPUs in some environments. > Unfortunately the nVidia GPU drivers are not upstream, and have not been for years. So we are forced to have both out of tree > and upstream versions of the code. We need the same applications to be able to work over both, so we would like the > GPU enabled versions of the code to have the same ABI as the upstream code as this greatly simplifies things. > We have removed all GPU specific code from the upstream submission, but used both the "alignment holes" and the "reserved" > mechanisms to hold places for GPU specific fields which can't be upstreamed. NVIDIA GPUs are supported by drivers/gpu/drm/nouveau/, and your are encourage to support them just like all the other in-tree GPU drivers. Not sure what support a network protocol would need for a specific GPU. You're probably trying to do something amazingly stupid here instead of relying on proper kernel subsystem use.
On Sun, Mar 21, 2021 at 12:24:39PM -0400, Dennis Dalessandro wrote: > Completely agree. However the GPU code from nvidia is not upstream. I don't > see that issue getting resolved in this code review. Let's move on. In which case you need to stop wasting everyones time with you piece of crap code base, because it is completely and utterly irrelevant.
On Sun, Mar 21, 2021 at 01:45:04PM -0300, Jason Gunthorpe wrote: > > We should be trying to get things upstream, not putting up walls when people > > want to submit new drivers. Calling code "garbage" [1] is not productive. > > Putting a bunch of misaligned structures and random reserved fields > *is* garbage by the upstream standard and if I send that to Linus I'll > get yelled at. And when this is for out of tree crap it is per the very definition of the word garbage. If I was in your position I would not waste any time on such an utterly disrespectful submission that violates all the community rules.
On Sun, Mar 21, 2021 at 01:21:14PM -0400, Dennis Dalessandro wrote: > No one is suggesting to compromise the upstream world. There is a bigger > picture here. The answer for this driver may just be take out the reserved > stuff. That's pretty simple. The bigger question is how do we deal with > non-upstream code. It can't be a problem unique to the RDMA subsystem. How > do others deal with it? By ignoring it, or if it gets too annoying by making life for it so utterly painful that people give up on it.
On Tue, Mar 23, 2021 at 03:30:41PM +0000, Christoph Hellwig wrote: > On Fri, Mar 19, 2021 at 10:57:20PM +0000, Rimmer, Todd wrote: > > We'd like advise on a challenging situation. Some customers > > desire NICs to support nVidia GPUs in some environments. > > Unfortunately the nVidia GPU drivers are not upstream, and have > > not been for years. So we are forced to have both out of tree and > > upstream versions of the code. We need the same applications to > > be able to work over both, so we would like the GPU enabled > > versions of the code to have the same ABI as the upstream code as > > this greatly simplifies things. We have removed all GPU specific > > code from the upstream submission, but used both the "alignment > > holes" and the "reserved" mechanisms to hold places for GPU > > specific fields which can't be upstreamed. > > NVIDIA GPUs are supported by drivers/gpu/drm/nouveau/, and your are > encourage to support them just like all the other in-tree GPU > drivers. Not sure what support a network protocol would need for a > specific GPU. You're probably trying to do something amazingly > stupid here instead of relying on proper kernel subsystem use. The kernel building block for what they are trying to do with the GPU is the recently merged DMABUF MR support in the RDMA subsystem. I'd like to think that since Daniel's team at Intel got the DMABUF stuff merged to support the applications Todd's Intel team is building that this RV stuff is already fully ready for dmabuf... (hint hint) What Todd is alluding to here is the hacky DMABUF alternative that is in the NVIDIA GPU driver - which HPC networking companies must support if they want to interwork with the NVIDIA GPU. Every RDMA vendor playing in the HPC space has some out-of-tree driver to enable this. :( Jason
On Tue, Mar 23, 2021 at 12:46:26PM -0300, Jason Gunthorpe wrote: > What Todd is alluding to here is the hacky DMABUF alternative that is > in the NVIDIA GPU driver - which HPC networking companies must support > if they want to interwork with the NVIDIA GPU. > > Every RDMA vendor playing in the HPC space has some out-of-tree driver > to enable this. :( I can only recommende everone to buy from a less fucked up GPU or accelerator vendor. We're certainly going to do everything we can to discourage making life easier for people who insist on doing this stupidest possible thing.
> I can only recommende everone to buy from a less f***** up GPU or > accelerator vendor. I would certainly love that. This is not just a recent problem, it's been going on for at least 3-5 years with no end in sight. And the nvidia driver itself is closed-source in the kernel :-( Making tuning and debug even harder and continuing to add costs to NIC vendors other than nVidia themselves to support this. Back to the topic at hand, yes, there are a few misalignments in the ABI. Most of the structures are carefully aligned. Below I summarize the major structures and their alignment characteristics. In a few places we chose readability for the application programmer by ordering fields in a logical order, such as for statistics. In one place a superfluous resv field was used (rv_query_params_out) and when I alluded that might be able to be taken advantage in the future to enable a common ABI for GPUs, we went down this deep rat hole. In studying all the related fields, in most cases if we shuffled everything for maximum packing, the structures would still end up being about the same size and this would all be for non-performance path ABIs. Here is a summary: 13 structures all perfectly aligned with no gaps, plus 7 structures below. rv_query_params_out - has one 4 byte reserved field to guarantee alignment for a u64 which follows. rv_attach_params_in - organized logically as early fields influence how later fields are interpreted. Fair number of fields, two 1 byte gaps and one 7 byte gap. Shuffling this might save about 4-8 bytes tops rv_cache_stats_params_out - ordered logically by statistics meanings. Two 4 byte gaps could be solved by having a less logical order. Of course, applications reporting these statistics will tend to do output in the original order, so packing this results in a harder to use ABI and more difficult code review for application writers wanting to make sure they report all stats but do so in a logical order. rv_conn_get_stats_params_out - one 2 byte gap (so the 1 bytes field mimicking the input request can be 1st), three 4 byte gaps. Same explanation as rv_cache_stats_params_out rv_conn_create_params_in - one 4 byte gap, easy enough to swap rv_post_write_params_out - one 3 byte gap. Presented in logical order, shuffling would still yield the same size as compiler will round up size. rv_event - carefully packed and aligned. Had to make this work on a wide range of compilers with a 1 byte common field defining which part of union was relevant. Could put the same field in all unions to get rid of packed attribute if that is preferred. We found other similar examples like this in an older 4.18 kernel, cited one below. It should be noted, there are existing examples with small gaps or reserved fields in the existing kernel and RDMA stack. A few examples in ib_user_verbs.h include: ib_uverbs_send_wr - 4 byte gap after ex for the rdma field in union. ib_uverbs_flow_attr - 2 byte reserved field declared ib_flow_spec - 2 byte gap after size field rdma_netdev - 7 byte gap after port_num ./hw/mthca/mthca_eq.c - very similar use of packed for mthca_eqe to the one complained about in rv_event Todd
On Tue, Mar 23, 2021 at 05:25:43PM +0000, Rimmer, Todd wrote: > In studying all the related fields, in most cases if we shuffled > everything for maximum packing, the structures would still end up > being about the same size and this would all be for non-performance > path ABIs. It is not about size, it is about uABI compatability. Do not place packing holes in uABI structures as it is *very complicated* to guarentee those work across all the combinations of user/kernel we have to support. Every uABI structure should have explicit padding, it should be organized to minimize padding, and "easy on the programmer" is not a top concern. You must ensure that all compilers generate the same structure byte layout. On x86 this means a classic i386 compiler, a AMD64 x32 ABI, and the normal AMD64 compiler. > It should be noted, there are existing examples with small gaps or > reserved fields in the existing kernel and RDMA stack. A few > examples in ib_user_verbs.h include: Past mistakes do not excuse future errors. Most of those are not uABI structs. Jason
On Mon, Mar 22, 2021 at 05:31:07PM +0000, Hefty, Sean wrote: > > > To be fair it is sent as RFC. So to me that says they know it's a ways off > > > from being ready to be included and are starting the process early. > > > > I'm not sure why it is RFC. It sounds like it is much more than this > > if someone has already made a version with links to the NVIDIA GPU > > driver and has reached a point where they care about ABI stablility? > > I can take some blame here. A couple of us were asked to look at > this module. Because the functionality is intended to be device > agnostic, we assumed there would be more scrutiny, and we weren't > sure how acceptable some aspects would be (e.g. mr cache, ib cm data > format). Rather than debate this internally for months, rework the > code, and still miss, we asked Kaike to post an RFC to get community > feedback. For *upstreaming* purposes it was intended as an RFC to > gather feedback on the overall approach. That should have been made > clearer. Well, it is hard to even have that kind of conversation when all the details are wrong. The way this interfaces with uverbs is *just wrong*, it completely ignores the locking model for how disassociation works. Something like this, that is trying to closely integrate with uverbs, really cannot exist without major surgery to uverbs. If you want to do something along these lines the uverbs parts cannot be in a ULP. The mr cache could be moved into some kind of new uverb, that could be interesting if it is carefully designed. The actual transport code.. That is going to be really hard. RDS doesn't integrate with uverbs for a reason, the kernel side owns the QPs and PDs. How you create a QP owned by the kernel but linked to a PD owned by uverbs is going to need very delicate and careful work to be somehow compatible with our disassociation model. Are you *sure* this needs to be in the kernel? You can't take a context switch to a user process and use the shared verbs FD stuff to create the de-duplicated QPs instead? It is a much simpler design. Have you thought about an actual *shared verbs QP* ? We have a lot of other shared objects right now, it is not such a big step. It does require inter-process locking though - and that is non-trivial. A shared QP with a kernel owned send/recv to avoid the locking issue could also be a very interesting solution. Jason
> How you create a QP owned by the kernel but linked to a PD owned by uverbs is going to need very delicate and careful work to be somehow compatible with our disassociation model.
The primary usage mode is as follows:
The RC QP, PD and MR are all in the kernel. The buffer virtual address and len is supplied by the user process and then used to lookup a MR in the cache, upon miss, a kernel MR is created against the kernel PD. There are separate MR caches per user process.
The IOs are initiated by the user, matched to a MR in the cache, then a RDMA Write w/Immed is posted on the kernel RC QP.
In concept, this is not unlike other kernel ULPs which perform direct data placement into user memory but use kernel QPs for connections to remote resources, such as various RDMA storage and filesystem ULPs.
The separation of the MR registration call from the IO allows the registration cost of a miss to be partially hidden behind the end to end RTS/CTS exchange which is occurring in user space.
There is a secondary usage mode where the MRs are cached, but created against a user PD and later used by the user process against QPs in the user. We found that usage mode offered some slight latency advantages over the primary mode for tiny jobs, but suffered significant scalability issues. Those latency advantages mainly manifested in microbenchmarks, but did help a few apps.
If it would simplify things, we could focus the discussion on the primary usage mode. Conceptually, the secondary usage mode may be a good candidate for an extension to uverbs (some form of register MR w/caching API where register MR checks a cache and deregister MR merely decrements a reference count in the cache).
Todd
From: Kaike Wan <kaike.wan@intel.com> RDMA transactions on RC QPs have a high demand for memory for HPC applications such as MPI, especially on nodes with high cpu core count, where a process is normally dispatched to each core and an RC QP is created for each remote process. For a 100-node fabric, about 10 GB - 100 GB memory is required for WQEs/Buffers/QP states on each server node. This high demand imposes a severe restriction on the scalability of HPC fabric. A number of solutions have been implemented over the years. UD based solutions solve the scalability problem by requiring only one UD QP per process that can send a message to or receive a message from any other process. However, it does not have the performance of an RC QP and the application has to manage the segmentation for large messages. SRQ reduces the memory demand by sharing a receive queue among multiple QPs. However, it still requires an RC QP for each remote process and each RC QP still requires a send queue. In addition, it is an optional feature. XRC further reduces the memory demand by allowing a single QP per process to communicate with any process on a remote node. In this mode, each process requires only one QP for each remote node. Again, this is optional and not all vendors support it. Dynamically Connected transport minimizes the memory usage, but requires vendor specific hardware and changes in applications. Addtionally, all of these mechanisms can induce latency jitter due to use of more QPs, more QP state, and hence additional PCIe transactions at scale where NIC QP/CQ caches overflow. Based on this, we are here proposing a generic, vendor-agnostic approach to address the RC scalalibity and potentially improve RDMA performance for larger messages which uses RDMA Write as part of a rendezvous protocol. Here are the key points of the proposal: - For each device used by a job on a given node, there are a fixed number of RC connections (default: 4) between this node and any other remote node, no matter how many processes are running on each node for this job. That is, for a given job and device, all the processes on a given node will communicate with all the processes of the same job on a remote node through the same fixed number of connections. This eliminates the increased memory demand caused by core count increase and reduces the overall RC connection setup costs. - Memory region cache is added to reduce the MR registration/deregistration costs, as the same buffers tend to be used repeatedly by applications. The mr cache is per process and can be accessed only by processes in the same job. Here is how we are proposing to implement this application enabling rendezvous module that take advantage of the various features of the RDMA subsystem: - Create a rendezvous module (rv) under drivers/infiniband/ulp/rv/. This can be changed if a better location is recommended. - The rv modules adds a char device file /dev/rv that user applications/middlewares can open to communicate with the rv module (PSM3 OFI provider in libfabric 1.12.0 being the primary use case). The file interface is crucial for associating the job/process with the mr, the connection endpoints, and RDMA requests. rv_file_open - Open a handle to the rv module for the current user process. rv_file_close - Close the handle and clean up. rv_file_ioctl - The main communication methods: to attach to a device, to register/deregister mr, to create connection endpoint, to connect, to query, to get stats, to do RDMA transactions, etc. rv_file_mmap - Mmap an event ring into user space for the current user process. - Basic mode of operations (PSM3 is used as an example for user applications): - A middleware (like MPI) has out-of-band communication channels between any two nodes, which are used to establish high performance communications for providers such as PSM3. - On each node, the PSM3 provider opens the rv module, and issues an attach request to bind to a specific local device and specifies information specific to the job. This associates the user process with the RDMA device and the job. - On each node, the PSM3 provider will establish user space low latency eager and control message communications, typically via user space UD QPs. - On each node, the PSM3 provider will mmap a ring buffer for events from the rv module. - On each node, the PSM3 provider creates a set of connection endpoints by passing the destination info to the rv module. If the local node is chosen as the listener for the connection (based on address comparison between the two nodes), it will start to listen for any incoming IB CM connection requests and accept them when whey arrive. This step associates the shared connection endpoints with the job and device. - On each node, the PSM3 provider will request connection establishment through the rv module. On the client node, rv sends an IB CM request to the remote node. On the listener node, nothing additional needs to be done for the connect request from the PSM3 provider. - On each node, the PSM3 provider will wait until the RC connection is established. The PSM3 provider will query the rv module about the connection status periodically. - As large MPI messages are requested, the PSM3 provider will request rv to register MRs for the MPI application’s send and receive buffers. The rv module makes use of its MR cache to limit when such requests need to interact with verbs MR calls for the NIC. The PSM3 provider control message mechanisms are used to exchange IO virtual addresses and rkeys for such buffers and MRs. - For RDMA transactions, the RDMA WRITE WITH IMMED request will be used. The immediate data will contain info about the target process on the remote node and which outstanding rendezvous message this RDMA is for. For send completion and receive completion, an event will be posted into the event ring buffer and the PSM3 provider can poll for completion. - As RDMA transactions complete, the PSM3 provider will indicate completion of the corresponding MPI send/receive to the MPI middleware/application. The PSM3 provider will also inform the rv module it may deregister the MR. The deregistration allows rv to cache the MR for use in future requests by the PSM3 provider to register the same memory for use in another MPI message. The cached MR will be removed if the user buffer is freed and a MMU notifier event is received by the rv module. Please comment, Thank you, Kaike Kaike Wan (9): RDMA/rv: Public interferce for the RDMA Rendezvous module RDMA/rv: Add the internal header files RDMA/rv: Add the rv module RDMA/rv: Add functions for memory region cache RDMA/rv: Add function to register/deregister memory region RDMA/rv: Add connection management functions RDMA/rv: Add functions for RDMA transactions RDMA/rv: Add functions for file operations RDMA/rv: Integrate the file operations into the rv module MAINTAINERS | 6 + drivers/infiniband/Kconfig | 1 + drivers/infiniband/core/core_priv.h | 4 + drivers/infiniband/core/rdma_core.c | 237 ++ drivers/infiniband/core/uverbs_cmd.c | 54 +- drivers/infiniband/ulp/Makefile | 1 + drivers/infiniband/ulp/rv/Kconfig | 11 + drivers/infiniband/ulp/rv/Makefile | 9 + drivers/infiniband/ulp/rv/rv.h | 892 ++++++ drivers/infiniband/ulp/rv/rv_conn.c | 3037 ++++++++++++++++++++ drivers/infiniband/ulp/rv/rv_file.c | 1196 ++++++++ drivers/infiniband/ulp/rv/rv_main.c | 271 ++ drivers/infiniband/ulp/rv/rv_mr.c | 393 +++ drivers/infiniband/ulp/rv/rv_mr_cache.c | 428 +++ drivers/infiniband/ulp/rv/rv_mr_cache.h | 152 + drivers/infiniband/ulp/rv/rv_rdma.c | 416 +++ drivers/infiniband/ulp/rv/trace.c | 7 + drivers/infiniband/ulp/rv/trace.h | 10 + drivers/infiniband/ulp/rv/trace_conn.h | 547 ++++ drivers/infiniband/ulp/rv/trace_dev.h | 82 + drivers/infiniband/ulp/rv/trace_mr.h | 109 + drivers/infiniband/ulp/rv/trace_mr_cache.h | 181 ++ drivers/infiniband/ulp/rv/trace_rdma.h | 254 ++ drivers/infiniband/ulp/rv/trace_user.h | 321 +++ include/rdma/uverbs_types.h | 10 + include/uapi/rdma/rv_user_ioctls.h | 558 ++++ 26 files changed, 9139 insertions(+), 48 deletions(-) create mode 100644 drivers/infiniband/ulp/rv/Kconfig create mode 100644 drivers/infiniband/ulp/rv/Makefile create mode 100644 drivers/infiniband/ulp/rv/rv.h create mode 100644 drivers/infiniband/ulp/rv/rv_conn.c create mode 100644 drivers/infiniband/ulp/rv/rv_file.c create mode 100644 drivers/infiniband/ulp/rv/rv_main.c create mode 100644 drivers/infiniband/ulp/rv/rv_mr.c create mode 100644 drivers/infiniband/ulp/rv/rv_mr_cache.c create mode 100644 drivers/infiniband/ulp/rv/rv_mr_cache.h create mode 100644 drivers/infiniband/ulp/rv/rv_rdma.c create mode 100644 drivers/infiniband/ulp/rv/trace.c create mode 100644 drivers/infiniband/ulp/rv/trace.h create mode 100644 drivers/infiniband/ulp/rv/trace_conn.h create mode 100644 drivers/infiniband/ulp/rv/trace_dev.h create mode 100644 drivers/infiniband/ulp/rv/trace_mr.h create mode 100644 drivers/infiniband/ulp/rv/trace_mr_cache.h create mode 100644 drivers/infiniband/ulp/rv/trace_rdma.h create mode 100644 drivers/infiniband/ulp/rv/trace_user.h create mode 100644 include/uapi/rdma/rv_user_ioctls.h