Message ID | 1480258296-27032-2-git-send-email-leon@kernel.org (mailing list archive) |
---|---|
State | Deferred |
Headers | show |
On Sun, Nov 27, 2016 at 04:51:27PM +0200, Leon Romanovsky wrote: > +static inline bool rdma_protocol_raw_packet(const struct ib_device *device, u8 port_num) > +{ > + return device->port_immutable[port_num].core_cap_flags & RDMA_CORE_CAP_PROT_RAW_PACKET; > +} Does the mlx drivers really register ports with different capabilities as the same ib_device? I'm not sure that should be allowed. I keep talking about how we need to get rid of the port_num in these sorts of places because it makes no sense... Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > On Sun, Nov 27, 2016 at 04:51:27PM +0200, Leon Romanovsky wrote: > > > +static inline bool rdma_protocol_raw_packet(const struct ib_device *device, u8 > port_num) > > +{ > > + return device->port_immutable[port_num].core_cap_flags & > RDMA_CORE_CAP_PROT_RAW_PACKET; > > +} > > Does the mlx drivers really register ports with different capabilities > as the same ib_device? I'm not sure that should be allowed. > > I keep talking about how we need to get rid of the port_num in these > sorts of places because it makes no sense... > I agree. Requiring the port number has implications that ripple up into the rdma-rw api as well... -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Nov 28, 2016 at 7:00 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > On Sun, Nov 27, 2016 at 04:51:27PM +0200, Leon Romanovsky wrote: > >> +static inline bool rdma_protocol_raw_packet(const struct ib_device *device, u8 port_num) >> +{ >> + return device->port_immutable[port_num].core_cap_flags & RDMA_CORE_CAP_PROT_RAW_PACKET; >> +} > > Does the mlx drivers really register ports with different capabilities > as the same ib_device? I'm not sure that should be allowed. mlx4 yeah (practically for the last ~10 years) for instance Eth ports don't support SMI -- this goes back to the fact that mlx4 devices are single PCI function with potentially two ports and each port can be set to different link layer. But this no more holds for mlx5, these devices are function-per-port and hence IB device per port. I guess we have to swallow that pill and move on as newer devices don't have this behavior, okay? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Nov 28, 2016 at 10:57:00PM +0200, Or Gerlitz wrote: > On Mon, Nov 28, 2016 at 7:00 PM, Jason Gunthorpe > <jgunthorpe@obsidianresearch.com> wrote: > > On Sun, Nov 27, 2016 at 04:51:27PM +0200, Leon Romanovsky wrote: > > > >> +static inline bool rdma_protocol_raw_packet(const struct ib_device *device, u8 port_num) > >> +{ > >> + return device->port_immutable[port_num].core_cap_flags & RDMA_CORE_CAP_PROT_RAW_PACKET; > >> +} > > > > Does the mlx drivers really register ports with different capabilities > > as the same ib_device? I'm not sure that should be allowed. > > mlx4 yeah (practically for the last ~10 years) for instance Eth ports > don't support SMI -- this goes back to the fact that mlx4 devices are > single PCI function with potentially two ports and each port can be struct ib_device is not linked to a PCI device. AFAIK mlx4 created one ib_device for rocee ports and one for IB, or at least it should or things are already broken. > set to different link layer. But this no more holds for mlx5, these > devices are function-per-port and hence IB device per port. Since it has nothing to do with pci devices, please do this properly. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 29, 2016 at 12:25 AM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > On Mon, Nov 28, 2016 at 10:57:00PM +0200, Or Gerlitz wrote: >> On Mon, Nov 28, 2016 at 7:00 PM, Jason Gunthorpe >> <jgunthorpe@obsidianresearch.com> wrote: >> > Does the mlx drivers really register ports with different capabilities >> > as the same ib_device? I'm not sure that should be allowed. >> mlx4 yeah (practically for the last ~10 years) for instance Eth ports >> don't support SMI -- this goes back to the fact that mlx4 devices are >> single PCI function with potentially two ports and each port can be > struct ib_device is not linked to a PCI device. AFAIK mlx4 created one > ib_device for rocee ports and one for IB, no, it doesn't > or at least it should or things are already broken. Can you elaborate what it broken with mlx4? I suspect we even have down there some functionality which depends on that, but again, I 1st would like to hear if/what is broken - I copied the maintainer. >> set to different link layer. But this no more holds for mlx5, these >> devices are function-per-port and hence IB device per port. > Since it has nothing to do with pci devices, please do this properly. again, mlx5 does this already. Jason, patches 8-10 which carry the functional change I want to introduce (allow mlx5 IB devices to be created when RoCE is not supported) stand for themselves. As I wrote, the stack is fully functional (i.e no error in the IB core etc) when only these patches are put. E.g things behave in a similar manner to all the upstream iWARP drivers that refuse to create any QP which is not RC. I am okay to put aside patches 1-7 which I added per Doug request for user-space applications to be able and query what QPs are supports on a device, or to get in patches 1-7, whatever works better for Doug and ppl. I don't think it's fair to ask a re-write of the 10y old IB device-ing of things done by mlx4 just to be able and introduce this reduced functionally (raw packet qp only) of mlx5 devices. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 29, 2016 at 08:35:52AM +0200, Or Gerlitz wrote: > > or at least it should or things are already broken. > > Can you elaborate what it broken with mlx4? I suspect we > even have down there some functionality which depends on that, > but again, I 1st would like to hear if/what is broken - I copied the maintainer. The semantic we require is that everything under a struct ib_device is compatible, you can use an AH on any PD with any QP on any port. As an example wee absolutely do not allow iwarp and ib on the same ib_device, (rdma_cm will break horribly, at least). The fact rocee and ib sort of work together as-is represents a fluke not a design goal :( > Jason, patches 8-10 which carry the functional change I want to introduce > (allow mlx5 IB devices to be created when RoCE is not supported) stand > for themselves. I asked for the port_num to be removed, and the uapi for it to be device not port specific. Don't see why that is such a big deal.. > a re-write of the 10y old IB device-ing of things done by mlx4 just > to be able and introduce this reduced functionally (raw packet qp > only) of mlx5 devices. Seems totally fair to me for patches adding a new uapi. Why do you guys keep thinking uapis should be easy? Who else is going to fix this? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/28/2016 12:08 PM, Steve Wise wrote: >> >> On Sun, Nov 27, 2016 at 04:51:27PM +0200, Leon Romanovsky wrote: >> >>> +static inline bool rdma_protocol_raw_packet(const struct ib_device *device, > u8 >> port_num) >>> +{ >>> + return device->port_immutable[port_num].core_cap_flags & >> RDMA_CORE_CAP_PROT_RAW_PACKET; >>> +} >> >> Does the mlx drivers really register ports with different capabilities >> as the same ib_device? I'm not sure that should be allowed. >> >> I keep talking about how we need to get rid of the port_num in these >> sorts of places because it makes no sense... >> > > I agree. Requiring the port number has implications that ripple up into the > rdma-rw api as well... > > In all fairness, there is no requirement that any two ports on the same device be the same link layer, or if the link layer is Ethernet, there is no requirement that they can't support both iWARP and RoCE. The idea that the parent device defined the supported protocols for all ports of a device became wrong with the first mlx4 device that could do both IB and Ethernet. And I think I've heard rumblings of a combined RoCE/iWARP device possibly in the future from someone else.
On 11/29/2016 9:07 PM, Doug Ledford wrote: > On 11/28/2016 12:08 PM, Steve Wise wrote: >>> >>> On Sun, Nov 27, 2016 at 04:51:27PM +0200, Leon Romanovsky wrote: >>> >>>> +static inline bool rdma_protocol_raw_packet(const struct ib_device *device, >> u8 >>> port_num) >>>> +{ >>>> + return device->port_immutable[port_num].core_cap_flags & >>> RDMA_CORE_CAP_PROT_RAW_PACKET; >>>> +} >>> >>> Does the mlx drivers really register ports with different capabilities >>> as the same ib_device? I'm not sure that should be allowed. >>> >>> I keep talking about how we need to get rid of the port_num in these >>> sorts of places because it makes no sense... >>> >> >> I agree. Requiring the port number has implications that ripple up into the >> rdma-rw api as well... >> >> > > In all fairness, there is no requirement that any two ports on the same > device be the same link layer, or if the link layer is Ethernet, there > is no requirement that they can't support both iWARP and RoCE. The idea > that the parent device defined the supported protocols for all ports of > a device became wrong with the first mlx4 device that could do both IB > and Ethernet. And I think I've heard rumblings of a combined RoCE/iWARP > device possibly in the future from someone else. This one for instance? http://www.qlogic.com/Resources/Documents/DataSheets/Adapters/DataSheet_QLE45211HL_QLE45212HL.pdf I'd love to see any such device support protocol choice per connection, not just per port. That of course would have implications on the rdma commection manager api. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/29/2016 9:33 PM, Tom Talpey wrote: > On 11/29/2016 9:07 PM, Doug Ledford wrote: >> On 11/28/2016 12:08 PM, Steve Wise wrote: >>>> >>>> On Sun, Nov 27, 2016 at 04:51:27PM +0200, Leon Romanovsky wrote: >>>> >>>>> +static inline bool rdma_protocol_raw_packet(const struct ib_device >>>>> *device, >>> u8 >>>> port_num) >>>>> +{ >>>>> + return device->port_immutable[port_num].core_cap_flags & >>>> RDMA_CORE_CAP_PROT_RAW_PACKET; >>>>> +} >>>> >>>> Does the mlx drivers really register ports with different capabilities >>>> as the same ib_device? I'm not sure that should be allowed. >>>> >>>> I keep talking about how we need to get rid of the port_num in these >>>> sorts of places because it makes no sense... >>>> >>> >>> I agree. Requiring the port number has implications that ripple up >>> into the >>> rdma-rw api as well... >>> >>> >> >> In all fairness, there is no requirement that any two ports on the same >> device be the same link layer, or if the link layer is Ethernet, there >> is no requirement that they can't support both iWARP and RoCE. The idea >> that the parent device defined the supported protocols for all ports of >> a device became wrong with the first mlx4 device that could do both IB >> and Ethernet. And I think I've heard rumblings of a combined RoCE/iWARP >> device possibly in the future from someone else. > > This one for instance? > > http://www.qlogic.com/Resources/Documents/DataSheets/Adapters/DataSheet_QLE45211HL_QLE45212HL.pdf > > > I'd love to see any such device support protocol choice per > connection, not just per port. That of course would have > implications on the rdma commection manager api. > That's certainly a prime example, thanks Tom ;-)
> In all fairness, there is no requirement that any two ports on the same > device be the same link layer, or if the link layer is Ethernet, there > is no requirement that they can't support both iWARP and RoCE. The > idea > that the parent device defined the supported protocols for all ports of > a device became wrong with the first mlx4 device that could do both IB > and Ethernet. And I think I've heard rumblings of a combined > RoCE/iWARP > device possibly in the future from someone else. It would help if the community didn't continually redefine terms based on the latest set of patches or whims or random hardware feature. At one time an ib_device meant an actual IB device - go figure. Now it's not even a device, but some abstract weirdness collection ports that all support the same transport, or was it link layer, or ... I really have no idea now. The RDMA subsystem really needs to figure out what it wants to be, because even the term RDMA doesn't even apply to all of the devices that it supports. And now we're at the point of arguing over where drivers should go because no one even knows that anymore. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma- > owner@vger.kernel.org] On Behalf Of Tom Talpey > > > > In all fairness, there is no requirement that any two ports on the > > same device be the same link layer, or if the link layer is Ethernet, > > there is no requirement that they can't support both iWARP and RoCE. > > The idea that the parent device defined the supported protocols for > > all ports of a device became wrong with the first mlx4 device that > > could do both IB and Ethernet. And I think I've heard rumblings of a > > combined RoCE/iWARP device possibly in the future from someone else. > > This one for instance? > > http://www.qlogic.com/Resources/Documents/DataSheets/Adapters/DataShee > t_QLE45211HL_QLE45212HL.pdf > > I'd love to see any such device support protocol choice per connection, not just > per port. That of course would have implications on the rdma commection > manager api. > Exactly. If/when such devices appear, we would need to extend connection management to specify the protocol, rather than infer it from the port space. It would be perfectly sensible to use both RoCE and iWARP over the same physical Ethernet port and the same source IP address. Rethinking about the uAPI, maybe we should report a protocol bit-mask similar to the kernel's, instead of QP types? This would provide all the required information (e.g., any combination of RoCEv1/v2, iWARP, and Raw Ethernet for Ethernet links) for today's use-cases as well as tomorrow's combined RoCE/iWARP devices. --Liran -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 29, 2016 at 09:07:52PM -0500, Doug Ledford wrote: > On 11/28/2016 12:08 PM, Steve Wise wrote: > >> > >> On Sun, Nov 27, 2016 at 04:51:27PM +0200, Leon Romanovsky wrote: > >> > >>> +static inline bool rdma_protocol_raw_packet(const struct ib_device *device, > > u8 > >> port_num) > >>> +{ > >>> + return device->port_immutable[port_num].core_cap_flags & > >> RDMA_CORE_CAP_PROT_RAW_PACKET; > >>> +} > >> > >> Does the mlx drivers really register ports with different capabilities > >> as the same ib_device? I'm not sure that should be allowed. > >> > >> I keep talking about how we need to get rid of the port_num in these > >> sorts of places because it makes no sense... > >> > > > > I agree. Requiring the port number has implications that ripple up into the > > rdma-rw api as well... > > > > > > In all fairness, there is no requirement that any two ports on the same > device be the same link layer, or if the link layer is Ethernet, there > is no requirement that they can't support both iWARP and RoCE. There actually is a requirement. The RDMA CM hard requires all ports be iWARP or !iWARP at least. I'm sure there are other subtle things floating around. There are also things that become very confusing for user space, and we don't have the infrastructure to support, if ports can switch configurations on the fly. The simplest, approach, most in line with how verbs was designed, is to require each ib_device to have a single kind of AH. > The idea that the parent device defined the supported protocols for > all ports of a device became wrong with the first mlx4 device that Arguably it was sort of OK for roceev1, is less OK for v2, but shouldn't have been done anyhow. The uapi question here is do we want to double down and try and make this work (and what does that even *mean*) or admit mlx4 was an error and stop doing that going forward.. Or do something else? eg Specifying the AH type when creating the PD could potentially solve some of the problems... > could do both IB and Ethernet. And I think I've heard rumblings of > a combined RoCE/iWARP device possibly in the future from someone > else. Two struct ib_devices for the same port then... Certainly ugly. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 30, 2016 at 04:30:09PM +0000, Liran Liss wrote: > > I'd love to see any such device support protocol choice per connection, not just > > per port. That of course would have implications on the rdma commection > > manager api. > Exactly. If/when such devices appear, we would need to extend > connection management to specify the protocol, rather than infer it > from the port space. We support that perfectly today as long as the port creates two 'struct ib_devices'. Anything else will require some kind of changes to libibverb's API to specify the AH style. > Rethinking about the uAPI, maybe we should report a protocol > bit-mask similar to the kernel's, instead of QP types? This would > provide all the required information (e.g., any combination of > RoCEv1/v2, iWARP, and Raw Ethernet for Ethernet links) for today's > use-cases as well as tomorrow's combined RoCE/iWARP devices. Maybe we should dump this uapi stuff until Matan's patches are done. The introspection possible with Matan's work is flexible enough to cope with more cases.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/30/2016 6:39 PM, Jason Gunthorpe wrote: > Maybe we should dump this uapi stuff until Matan's patches are > done. The introspection possible with Matan's work is flexible enough > to cope with more cases.. Basically I am OKay with that approach too. Doug, if you are willing to take the mlx5 patches that enable the feature of mlx5 device over Eth port that doesn't support RoCE (8,9,10 - I will have to do some rebasing) and discuss the query when the new ABI code is getting closer to be upstream that's fine too. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> From: Jason Gunthorpe [mailto:jgunthorpe@obsidianresearch.com] > > > Exactly. If/when such devices appear, we would need to extend > > connection management to specify the protocol, rather than infer it > > from the port space. > > We support that perfectly today as long as the port creates two 'struct > ib_devices'. Anything else will require some kind of changes to libibverb's API to > specify the AH style. > rdmacm would still have to choose between these ib_devices somehow - this doesn't change anything. Also, AHs are already port-specific. So, I don't see any issue in this regard. In any case, we have millions of multi-port devices that can use different link types deployed. This is the specification, and more such devices could appear in the future. We cannot change the device model. > > Rethinking about the uAPI, maybe we should report a protocol bit-mask > > similar to the kernel's, instead of QP types? This would provide all > > the required information (e.g., any combination of RoCEv1/v2, iWARP, > > and Raw Ethernet for Ethernet links) for today's use-cases as well as > > tomorrow's combined RoCE/iWARP devices. > > Maybe we should dump this uapi stuff until Matan's patches are done. The > introspection possible with Matan's work is flexible enough to cope with more > cases.. No doubt that the new ABI would be a lot more flexible and self-describing. But it would take a while until we port everything to use it. So, generally, I don't see any problem using the current extensibility capabilities to support useful semantics. > > Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 30, 2016 at 05:01:32PM +0000, Liran Liss wrote: > > From: Jason Gunthorpe [mailto:jgunthorpe@obsidianresearch.com] > > > > > > Exactly. If/when such devices appear, we would need to extend > > > connection management to specify the protocol, rather than infer it > > > from the port space. > > > > We support that perfectly today as long as the port creates two 'struct > > ib_devices'. Anything else will require some kind of changes to libibverb's API to > > specify the AH style. > > > > rdmacm would still have to choose between these ib_devices somehow Each ib_device is either iwarp or rocee, the rdma cm would route iwarp stuff to the iwarp one and rocee stuff to the rocee one. Not really a problem with today's architecture. > - this doesn't change anything. Also, AHs are already > port-specific. So, I don't see any issue in this regard. The current scheme infers the protocol of the AH from the current configuration of the port, which is a crazy API when the port's protocol can change on the fly. > In any case, we have millions of multi-port devices that can use > different link types deployed. This is the specification, and more > such devices could appear in the future. We cannot change the > device model. Of course we can change how they are modeled in Linux, it is just software. > No doubt that the new ABI would be a lot more flexible and > self-describing. But it would take a while until we port everything > to use it. So, generally, I don't see any problem using the current > extensibility capabilities to support useful semantics. Perhaps a moritorium on some changes to the current uAPI will encourage the new one to get finished :P Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > - this doesn't change anything. Also, AHs are already > > port-specific. So, I don't see any issue in this regard. > > The current scheme infers the protocol of the AH from the current > configuration of the port, which is a crazy API when the port's > protocol can change on the fly. Maybe the solution is to make the protocol selection explicit throughout the APIs and associate it with a QP, rather than attempting to list all transport protocols that a port can support. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > > - this doesn't change anything. Also, AHs are already > > > port-specific. So, I don't see any issue in this regard. > > > > The current scheme infers the protocol of the AH from the current > > configuration of the port, which is a crazy API when the port's > > protocol can change on the fly. > > Maybe the solution is to make the protocol selection explicit throughout the APIs > and associate it with a QP, rather than attempting to list all transport protocols that > a port can support. Do you mean requiring the application to pick the protocol? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > > - this doesn't change anything. Also, AHs are already > > > > port-specific. So, I don't see any issue in this regard. > > > > > > The current scheme infers the protocol of the AH from the current > > > configuration of the port, which is a crazy API when the port's > > > protocol can change on the fly. > > > > Maybe the solution is to make the protocol selection explicit > throughout the > APIs > > and associate it with a QP, rather than attempting to list all > transport > protocols that > > a port can support. > > Do you mean requiring the application to pick the protocol? Yes - it seems necessary to support devices with RoCE and iWarp running on the same port. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 30, 2016 at 05:25:18PM +0000, Hefty, Sean wrote: > > > - this doesn't change anything. Also, AHs are already > > > port-specific. So, I don't see any issue in this regard. > > > > The current scheme infers the protocol of the AH from the current > > configuration of the port, which is a crazy API when the port's > > protocol can change on the fly. > > Maybe the solution is to make the protocol selection explicit > throughout the APIs and associate it with a QP, rather than > attempting to list all transport protocols that a port can support. AH's are linked to a PD, not a QP.. If we had to do it again, a PD centric approach would be more sensible: // Create a PD on 'port' using ah format 'protocol' pd = ibv_pd_create(port, enum ah_type protocol); // Enable APM or resource sharing on the PD across two ports ibv_pd_add_port(pd, alt_port); And get rid of the multi-port ib_device concept entirely. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Maybe the solution is to make the protocol selection explicit > > throughout the APIs and associate it with a QP, rather than > > attempting to list all transport protocols that a port can support. > > AH's are linked to a PD, not a QP.. But protocols are associated with QPs. I'm not sure *why* an AH is linked to a PD. Conceptually the association seems unnecessary. > If we had to do it again, a PD centric approach would be more > sensible: > > // Create a PD on 'port' using ah format 'protocol' > pd = ibv_pd_create(port, enum ah_type protocol); > > // Enable APM or resource sharing on the PD across two ports > ibv_pd_add_port(pd, alt_port); > > And get rid of the multi-port ib_device concept entirely. I think no matter what option is chosen, some limitation may end up being placed on how a device is used that may not map to hardware limits. I'm personally fine with that, but we need to define the right level of abstraction (and pick the right terms to describe them). -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 30, 2016 at 06:17:34PM +0000, Hefty, Sean wrote: > > > Maybe the solution is to make the protocol selection explicit > > > throughout the APIs and associate it with a QP, rather than > > > attempting to list all transport protocols that a port can support. > > > > AH's are linked to a PD, not a QP.. > > But protocols are associated with QPs. I'm not sure *why* an AH is > linked to a PD. Conceptually the association seems unnecessary. The AH has to be linked to the PD because the PD specifies the hardware target and the AH is a hardware object. All objects must be traced back to a PD. It is linked to a PD not a QP because the AH can be shared across all QPs, which is useful for UD applications.. The AH is really similar to the ethernet layer in an IP stack, and the QP is akin the TCP layer. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > But protocols are associated with QPs. I'm not sure *why* an AH is > > linked to a PD. Conceptually the association seems unnecessary. > > The AH has to be linked to the PD because the PD specifies the > hardware target and the AH is a hardware object. All objects must be > traced back to a PD. It is linked to a PD not a QP because the AH can > be shared across all QPs, which is useful for UD applications.. > > The AH is really similar to the ethernet layer in an IP stack, and the > QP > is akin the TCP layer. I always viewed the AH as specifying the route to the destination port. Locally, I still don't see why it couldn't work across the device, like the CQ does. Associating it with a PD just seems to limit which QPs, and by association the memory buffers, it can be used with. So, I'm still not seeing the reason for the linkage... -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 30, 2016 at 06:46:13PM +0000, Hefty, Sean wrote: > > > But protocols are associated with QPs. I'm not sure *why* an AH is > > > linked to a PD. Conceptually the association seems unnecessary. > > > > The AH has to be linked to the PD because the PD specifies the > > hardware target and the AH is a hardware object. All objects must be > > traced back to a PD. It is linked to a PD not a QP because the AH can > > be shared across all QPs, which is useful for UD applications.. > > > > The AH is really similar to the ethernet layer in an IP stack, and the > > QP > > is akin the TCP layer. > > I always viewed the AH as specifying the route to the destination > port. Locally, I still don't see why it couldn't work across the > device, like the CQ does. Associating it with a PD just seems to > limit which QPs, and by association the memory buffers, it can be > used with. So, I'm still not seeing the reason for the linkage... Oh I see what you mean, I forgot about that - yes, I don't see any reason why the AH is linked to the PD not the ibv_context.. Perhaps it is an artifact of the thinking when applying this to the kernel where each ULP would allocate resources to the PD and then destroy the PD when the connection is done? So per ulp resources were charged to the PD.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/30/2016 11:36 AM, Jason Gunthorpe wrote: > On Tue, Nov 29, 2016 at 09:07:52PM -0500, Doug Ledford wrote: >> On 11/28/2016 12:08 PM, Steve Wise wrote: >>>> >>>> On Sun, Nov 27, 2016 at 04:51:27PM +0200, Leon Romanovsky wrote: >>>> >>>>> +static inline bool rdma_protocol_raw_packet(const struct ib_device *device, >>> u8 >>>> port_num) >>>>> +{ >>>>> + return device->port_immutable[port_num].core_cap_flags & >>>> RDMA_CORE_CAP_PROT_RAW_PACKET; >>>>> +} >>>> >>>> Does the mlx drivers really register ports with different capabilities >>>> as the same ib_device? I'm not sure that should be allowed. >>>> >>>> I keep talking about how we need to get rid of the port_num in these >>>> sorts of places because it makes no sense... >>>> >>> >>> I agree. Requiring the port number has implications that ripple up into the >>> rdma-rw api as well... >>> >>> >> >> In all fairness, there is no requirement that any two ports on the same >> device be the same link layer, or if the link layer is Ethernet, there >> is no requirement that they can't support both iWARP and RoCE. > > There actually is a requirement. The RDMA CM hard requires all ports > be iWARP or !iWARP at least. I'm sure there are other subtle things > floating around. > > There are also things that become very confusing for user space, and > we don't have the infrastructure to support, if ports can switch > configurations on the fly. > > The simplest, approach, most in line with how verbs was designed, is > to require each ib_device to have a single kind of AH. Sorry, we're conflating two separate things here. I was merely referring to the hardware. Not our implementation of a stack. I was attempting to point out that our stack implementation is placing artificial restrictions on the hardware that the hardware does not share. >> The idea that the parent device defined the supported protocols for >> all ports of a device became wrong with the first mlx4 device that > > Arguably it was sort of OK for roceev1, is less OK for v2, but > shouldn't have been done anyhow. > > The uapi question here is do we want to double down and try and make > this work (and what does that even *mean*) or admit mlx4 was an error > and stop doing that going forward.. > > Or do something else? eg Specifying the AH type when creating the PD > could potentially solve some of the problems... > >> could do both IB and Ethernet. And I think I've heard rumblings of >> a combined RoCE/iWARP device possibly in the future from someone >> else. > > Two struct ib_devices for the same port then... Certainly ugly. I'm not entirely sure I agree....I would have to think about it more, but the underlying problem I'm concerned with is exactly what I point out above: what restrictions would we be placing on the hardware that are artificial and a result of our stack that the hardware itself does not share? Could the hardware support automatic migration from iWARP to RoCE or vice versa if both endpoints supported it? Would that work if we required two separate ib_devices? Things like that.
> I'm not entirely sure I agree....I would have to think about it more, > but the underlying problem I'm concerned with is exactly what I point > out above: what restrictions would we be placing on the hardware that > are artificial and a result of our stack that the hardware itself does > not share? Could the hardware support automatic migration from iWARP > to > RoCE or vice versa if both endpoints supported it? Would that work if > we required two separate ib_devices? Things like that. IMO, in order to create a usable software interface, we will end up defining limitations. The alternative seems to be an endless series of capability bits that expands to millions of combinations, and in the end still won't capture everything. Consider that the existing interfaces already limit what upstream drivers can expose. The primary reason for exposing devices is for apps to associate hardware resources (e.g. CQ and QP) with each other. With software based implementations of roce/iwarp, a device is basically the system. If you connect NICs directly to the CPU, then even separate hardware devices could share state, which could complicate things even more. I'm not sure about Jason's idea of re-defining the PD as a 'protocol domain' (my term), but I think we should seriously consider something similar as an addition to the <device, port> model. I'm still of the opinion that protocol is a property of a QP, not a port. - Sean -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/30/2016 12:30 PM, Hefty, Sean wrote: >>>>> - this doesn't change anything. Also, AHs are already >>>>> port-specific. So, I don't see any issue in this regard. >>>> >>>> The current scheme infers the protocol of the AH from the current >>>> configuration of the port, which is a crazy API when the port's >>>> protocol can change on the fly. >>> >>> Maybe the solution is to make the protocol selection explicit >> throughout the >> APIs >>> and associate it with a QP, rather than attempting to list all >> transport >> protocols that >>> a port can support. >> >> Do you mean requiring the application to pick the protocol? > > Yes - it seems necessary to support devices with RoCE and iWarp running on the same port. Sockets require this, the API requires a protocol family (PF_xxx) and socket type (SOCK_xxx) that direct the endpoint to become TCP, UDP, etc. On the other hand, RDMA APIs to date have considered RDMA to be the transport, and therefore they have hidden the underlying protocol. You just request an "RDMA" endpoint from a specific adapter, and get what it returns. I'd agree that the application picks the protocol, but I'd also argue that the API should have a "wildcard" mode which allows the current behavior. It could be a simple extension to just add a selector parameter. The tricky part is what to do with passive endpoints. If a wildcard is specified, should a multiprotocol adapter create listens on all protocols? Also, what of protocols which auto-negotiate? It's my understanding that some RoCE adapters will attempt to autodetect whether their peer is RoCEv1 or RoCEv2 capable, and adjust their protocol to suit. Tom. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> I'd agree that the application picks the protocol, but I'd also > argue that the API should have a "wildcard" mode which allows the > current behavior. It could be a simple extension to just add a > selector parameter. I agree. The QP type can be viewed as the socket type, we only need to add a protocol field to go along with it. I.e. the protocol should be a qp attribute provided on input to qp create. > The tricky part is what to do with passive endpoints. If a wildcard > is specified, should a multiprotocol adapter create listens on all > protocols? Also, what of protocols which auto-negotiate? It's my > understanding that some RoCE adapters will attempt to autodetect > whether their peer is RoCEv1 or RoCEv2 capable, and adjust their > protocol to suit. We could include the protocol selection, also with wildcard support, here as well. The rdma cm may be able to handle this, so that drivers won't need to deal with a wildcard endpoint. Assuming that we don't end up with some devices that require they implement wildcard support, while others can't... -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma- > owner@vger.kernel.org] On Behalf Of Hefty, Sean > > > I'd agree that the application picks the protocol, but I'd also argue > > that the API should have a "wildcard" mode which allows the current > > behavior. It could be a simple extension to just add a selector > > parameter. > > I agree. The QP type can be viewed as the socket type, we only need to add a > protocol field to go along with it. I.e. the protocol should be a qp attribute > provided on input to qp create. > Right. > > The tricky part is what to do with passive endpoints. If a wildcard is > > specified, should a multiprotocol adapter create listens on all > > protocols? Also, what of protocols which auto-negotiate? It's my > > understanding that some RoCE adapters will attempt to autodetect > > whether their peer is RoCEv1 or RoCEv2 capable, and adjust their > > protocol to suit. > > We could include the protocol selection, also with wildcard support, here as > well. The rdma cm may be able to handle this, so that drivers won't need to deal > with a wildcard endpoint. Assuming that we don't end up with some devices > that require they implement wildcard support, while others can't... Typically, the application will request the protocol that it wants or leave it unspecified. In that case, I think that the rdmacm would select the device default. Anyway, returning to the initial matter at hand: I would like to start with each port reporting a bit mask of the supported protocols on that link (RoCE v1/v2, Raw Ethernet, iWARP, etc.) It will be used for reporting device capabilities in general for tools, as well as by applications that don't use rdmacm. --Liran -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Dec 04, 2016 at 08:38:14PM +0000, Liran Liss wrote: > Anyway, returning to the initial matter at hand: I would like to > start with each port reporting a bit mask of the supported protocols > on that link (RoCE v1/v2, Raw Ethernet, iWARP, etc.) It will be > used for reporting device capabilities in general for tools, as well > as by applications that don't use rdmacm. Why don't we start by defining how it is supposed to even work and how to fix the RDMA CM before adding even more random capability bits? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> From: Jason Gunthorpe [mailto:jgunthorpe@obsidianresearch.com] > > > Anyway, returning to the initial matter at hand: I would like to start > > with each port reporting a bit mask of the supported protocols on that > > link (RoCE v1/v2, Raw Ethernet, iWARP, etc.) It will be used for > > reporting device capabilities in general for tools, as well as by > > applications that don't use rdmacm. > > Why don't we start by defining how it is supposed to even work and how to fix > the RDMA CM before adding even more random capability bits? > > Jason Extensions to RDMA CM is an important topic, which we can continue to discuss as devices that require it are introduced. Protocol capabilities are needed for information purposes as well as for applications that do not use RDMA CM. They are not "random", but depict what a device can do. --Liran -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Dec 5, 2016 at 10:14 PM, Liran Liss <liranl@mellanox.com> wrote: >> From: Jason Gunthorpe [mailto:jgunthorpe@obsidianresearch.com] >>> Anyway, returning to the initial matter at hand: I would like to start >>> with each port reporting a bit mask of the supported protocols on that >>> link (RoCE v1/v2, Raw Ethernet, iWARP, etc.) It will be used for >>> reporting device capabilities in general for tools, as well as by >>> applications that don't use rdmacm. >> Why don't we start by defining how it is supposed to even work and how to fix >> the RDMA CM before adding even more random capability bits? > Extensions to RDMA CM is an important topic, which we can continue to discuss as devices that require it are introduced. > Protocol capabilities are needed for information purposes as well as for applications that do not use RDMA CM. > They are not "random", but depict what a device can do. Doug, Can we get a maintainer say here? I didn't see complaints on the mlx5 IB driver patch that opens IB device also in the lack of RoCE support. We have a valid real life use-case which needs this driver patch. Per your request, I made also some core patch to expose QPTs to user-space and we went into this long discussion. Do you want the change along what was proposed by Liran (allow user-space to query the supported protocols per port) or you are okay with taking only the mlx5 patches for now and discuss the rest later or something else? Please let us know Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Dec 06, 2016 at 11:26:51PM +0200, Or Gerlitz wrote: > Do you want the change along what was proposed by Liran (allow > user-space to query the supported protocols per port) or you are > okay Why doesn't mellanox come up with a plan to actually make all of this work in some sensible way? So far only Mellanox needs this. I've already proposed disallowing multiprotocol struct ib_devices. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> I've already proposed disallowing multiprotocol struct ib_devices.
My preference is to discontinue attempts at associating a protocol with the device. A device could implement a dozen protocols in software. Transports belong to QPs or cm ids, not devices. Each rdma_cm_id should be associated with a specific cm/transport directly, rather than indirectly selecting one based on the bound <device, port>.
If an app wants a specific transport type for a QP, why doesn't it just try to open one and see if the call fails?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/6/2016 5:13 PM, Hefty, Sean wrote: >> I've already proposed disallowing multiprotocol struct ib_devices. > > My preference is to discontinue attempts at associating a protocol with the device. A device could implement a dozen protocols in software. Transports belong to QPs or cm ids, not devices. Each rdma_cm_id should be associated with a specific cm/transport directly, rather than indirectly selecting one based on the bound <device, port>. > > If an app wants a specific transport type for a QP, why doesn't it just try to open one and see if the call fails? I tend to agree with Sean on this. And to answer Or's question from a couple emails back, I'm inclined to just take the last three patches for now while we work out a better idea of how everything here should be done on the first seven patches.
On Wed, Dec 07, 2016 at 05:06:24PM -0500, Doug Ledford wrote: > On 12/6/2016 5:13 PM, Hefty, Sean wrote: > >> I've already proposed disallowing multiprotocol struct ib_devices. > > > > My preference is to discontinue attempts at associating a protocol with the device. A device could implement a dozen protocols in software. Transports belong to QPs or cm ids, not devices. Each rdma_cm_id should be associated with a specific cm/transport directly, rather than indirectly selecting one based on the bound <device, port>. > > > > If an app wants a specific transport type for a QP, why doesn't it just try to open one and see if the call fails? > > I tend to agree with Sean on this. And to answer Or's question from a > couple emails back, I'm inclined to just take the last three patches for > now while we work out a better idea of how everything here should be > done on the first seven patches. Well in that case, we should work toward getting rid of 'struct ib_device' - it is the cause of so much of this trouble. A port-focused model like the netstack uses is overall saner.. We can figure out some other way to model what ports can be shared within a PD, ideally something that works better for rxe.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> From: Jason Gunthorpe [mailto:jgunthorpe@obsidianresearch.com] > Sent: Thursday, December 08, 2016 12:41 AM > > > > > > My preference is to discontinue attempts at associating a protocol with the > device. A device could implement a dozen protocols in software. Transports > belong to QPs or cm ids, not devices. Each rdma_cm_id should be associated > with a specific cm/transport directly, rather than indirectly selecting one based > on the bound <device, port>. > > > > > > If an app wants a specific transport type for a QP, why doesn't it just try to > open one and see if the call fails? > > > > I tend to agree with Sean on this. And to answer Or's question from a > > couple emails back, I'm inclined to just take the last three patches > > for now while we work out a better idea of how everything here should > > be done on the first seven patches. > Yep, it's a good start. > Well in that case, we should work toward getting rid of 'struct ib_device' - it is > the cause of so much of this trouble. A port-focused model like the netstack uses > is overall saner.. > We are discussing 2 separate issues: (1) Whether a device can support multiple protocols (2) Multi-port devices, optionally with different link types. I think that we all agree that (1) will be needed for future devices, and that different QPs might be running different protocols. This means that supported protocols need to be device capabilities, as we have today in the kernel. There are several devices that implement (2), which follows directly from the specification. The model for these devices is not going to change. There is a question on where to report protocol capabilities - at the device or port level. This is regardless of how the kernel CMA will resolve which device and port will service a connection - it's a matter of transparency and management. Doing so at the device level will require apps to know which protocols run on which link type. Doing so at the port level will be more obvious to apps, but won't be a common use case (assuming most apps will use rdmacm, which will do the protocol resolution job for them in the kernel). Doing so at the device level seems to be more aligned with how most future devices would be implemented. However, I won't rule out multi-port devices so soon. When the transport is offloaded to HW, it makes sense to conduct HA between ports within the domain of a single device. In any case, this has nothing to do with 'struct ib_device', which represents a device that handles packet processing internally and exposes transport endpoints to applications. --Liran -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Dec 08, 2016 at 09:29:32AM +0000, Liran Liss wrote: > > Well in that case, we should work toward getting rid of 'struct ib_device' - it is > > the cause of so much of this trouble. A port-focused model like the netstack uses > > is overall saner.. > > > > We are discussing 2 separate issues: > (1) Whether a device can support multiple protocols > (2) Multi-port devices, optionally with different link types. > > I think that we all agree that (1) will be needed for future > devices, We all agree future hardware devices will support multiple protocols, but I don't think there is any agreement so far on how to model that, or what 'struct ib_device' means in such a world. As I said, I'd like to get rid of it if we use a QP centric model to solve this problem. > and that different QPs might be running different protocols. This > means that supported protocols need to be device capabilities, No, they are QP capabilities, add new APIs to work with and query QP related things and stop using the device for anything. > have today in the kernel. There are several devices that implement > (2), which follows directly from the specification. The model for > these devices is not going to change. Again, no, the specification for multiport devices does not contemplate different addressing formats, so the one driver that did this went off spec in a badly thought out manner and left us this mess. > kernel). Doing so at the device level seems to be more aligned with > how most future devices would be implemented. However, I won't rule > out multi-port devices so soon. When the transport is offloaded to > HW, it makes sense to conduct HA between ports within the domain of > a single device. It is a QP thing if we go down Sean's suggestion, so add QP related queries to get the information. I have no idea how to model HA in this world - presumably when you add an AH to a QP it will refuse to do it if the implied HA model isn't supported. But, I'm not sure how well the QP model works when talking about HA/APM.. > In any case, this has nothing to do with 'struct ib_device', which > represents a device that handles packet processing internally and > exposes transport endpoints to applications. We need to make sense of what 'struct ib_device' is to figure out how to solve these problems because we have multiple different definitions running around concurrnatly in the stack - that is why it doesn't work today to have more than one protocol on a struct ib_device. So, yes, it actually has everything to do with 'struct ib_device'.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 5ad43a4..0cb3194 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -485,6 +485,7 @@ static inline struct rdma_hw_stats *rdma_alloc_hw_stats_struct( #define RDMA_CORE_CAP_PROT_ROCE 0x00200000 #define RDMA_CORE_CAP_PROT_IWARP 0x00400000 #define RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP 0x00800000 +#define RDMA_CORE_CAP_PROT_RAW_PACKET 0x01000000 #define RDMA_CORE_PORT_IBA_IB (RDMA_CORE_CAP_PROT_IB \ | RDMA_CORE_CAP_IB_MAD \ @@ -508,6 +509,8 @@ static inline struct rdma_hw_stats *rdma_alloc_hw_stats_struct( #define RDMA_CORE_PORT_INTEL_OPA (RDMA_CORE_PORT_IBA_IB \ | RDMA_CORE_CAP_OPA_MAD) +#define RDMA_CORE_PORT_RAW_PACKET (RDMA_CORE_CAP_PROT_RAW_PACKET) + struct ib_port_attr { u64 subnet_prefix; enum ib_port_state state; @@ -2286,6 +2289,11 @@ static inline bool rdma_ib_or_roce(const struct ib_device *device, u8 port_num) rdma_protocol_roce(device, port_num); } +static inline bool rdma_protocol_raw_packet(const struct ib_device *device, u8 port_num) +{ + return device->port_immutable[port_num].core_cap_flags & RDMA_CORE_CAP_PROT_RAW_PACKET; +} + /** * rdma_cap_ib_mad - Check if the port of a device supports Infiniband * Management Datagrams.