diff mbox

[rdma-next,01/10] IB/core: Add raw packet protocol

Message ID 1480258296-27032-2-git-send-email-leon@kernel.org (mailing list archive)
State Deferred
Headers show

Commit Message

Leon Romanovsky Nov. 27, 2016, 2:51 p.m. UTC
From: Or Gerlitz <ogerlitz@mellanox.com>

Define raw packet protocol which comes to denote this port supports
the RAW_PACKET QP type. To be used in downstream patches where the
IB core serves a query on the supported QP types for device/port.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
---
 include/rdma/ib_verbs.h | 8 ++++++++
 1 file changed, 8 insertions(+)

Comments

Jason Gunthorpe Nov. 28, 2016, 5 p.m. UTC | #1
On Sun, Nov 27, 2016 at 04:51:27PM +0200, Leon Romanovsky wrote:

> +static inline bool rdma_protocol_raw_packet(const struct ib_device *device, u8 port_num)
> +{
> +	return device->port_immutable[port_num].core_cap_flags & RDMA_CORE_CAP_PROT_RAW_PACKET;
> +}

Does the mlx drivers really register ports with different capabilities
as the same ib_device? I'm not sure that should be allowed.

I keep talking about how we need to get rid of the port_num in these
sorts of places because it makes no sense...

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Wise Nov. 28, 2016, 5:08 p.m. UTC | #2
> 
> On Sun, Nov 27, 2016 at 04:51:27PM +0200, Leon Romanovsky wrote:
> 
> > +static inline bool rdma_protocol_raw_packet(const struct ib_device *device,
u8
> port_num)
> > +{
> > +	return device->port_immutable[port_num].core_cap_flags &
> RDMA_CORE_CAP_PROT_RAW_PACKET;
> > +}
> 
> Does the mlx drivers really register ports with different capabilities
> as the same ib_device? I'm not sure that should be allowed.
> 
> I keep talking about how we need to get rid of the port_num in these
> sorts of places because it makes no sense...
> 

I agree.   Requiring the port number has implications that ripple up into the
rdma-rw api as well...
 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz Nov. 28, 2016, 8:57 p.m. UTC | #3
On Mon, Nov 28, 2016 at 7:00 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Sun, Nov 27, 2016 at 04:51:27PM +0200, Leon Romanovsky wrote:
>
>> +static inline bool rdma_protocol_raw_packet(const struct ib_device *device, u8 port_num)
>> +{
>> +     return device->port_immutable[port_num].core_cap_flags & RDMA_CORE_CAP_PROT_RAW_PACKET;
>> +}
>
> Does the mlx drivers really register ports with different capabilities
> as the same ib_device? I'm not sure that should be allowed.

mlx4 yeah (practically for the last ~10 years) for instance Eth ports
don't support SMI -- this goes back to the fact that mlx4 devices are
single PCI function with potentially two ports and each port can be
set to different link layer. But this no more holds for mlx5, these
devices are function-per-port and hence IB device per port. I guess we
have to swallow that pill and move on as newer devices don't have this
behavior, okay?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Nov. 28, 2016, 10:25 p.m. UTC | #4
On Mon, Nov 28, 2016 at 10:57:00PM +0200, Or Gerlitz wrote:
> On Mon, Nov 28, 2016 at 7:00 PM, Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com> wrote:
> > On Sun, Nov 27, 2016 at 04:51:27PM +0200, Leon Romanovsky wrote:
> >
> >> +static inline bool rdma_protocol_raw_packet(const struct ib_device *device, u8 port_num)
> >> +{
> >> +     return device->port_immutable[port_num].core_cap_flags & RDMA_CORE_CAP_PROT_RAW_PACKET;
> >> +}
> >
> > Does the mlx drivers really register ports with different capabilities
> > as the same ib_device? I'm not sure that should be allowed.
> 
> mlx4 yeah (practically for the last ~10 years) for instance Eth ports
> don't support SMI -- this goes back to the fact that mlx4 devices are
> single PCI function with potentially two ports and each port can be

struct ib_device is not linked to a PCI device. AFAIK mlx4 created one
ib_device for rocee ports and one for IB, or at least it should or
things are already broken.

> set to different link layer. But this no more holds for mlx5, these
> devices are function-per-port and hence IB device per port.

Since it has nothing to do with pci devices, please do this properly.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz Nov. 29, 2016, 6:35 a.m. UTC | #5
On Tue, Nov 29, 2016 at 12:25 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Mon, Nov 28, 2016 at 10:57:00PM +0200, Or Gerlitz wrote:
>> On Mon, Nov 28, 2016 at 7:00 PM, Jason Gunthorpe
>> <jgunthorpe@obsidianresearch.com> wrote:

>> > Does the mlx drivers really register ports with different capabilities
>> > as the same ib_device? I'm not sure that should be allowed.

>> mlx4 yeah (practically for the last ~10 years) for instance Eth ports
>> don't support SMI -- this goes back to the fact that mlx4 devices are
>> single PCI function with potentially two ports and each port can be

> struct ib_device is not linked to a PCI device. AFAIK mlx4 created one
> ib_device for rocee ports and one for IB,

no, it doesn't

> or at least it should or things are already broken.

Can you elaborate what it broken with mlx4? I suspect we
even have down there some functionality which depends on that,
but again, I 1st would like to hear if/what is broken - I copied the maintainer.

>> set to different link layer. But this no more holds for mlx5, these
>> devices are function-per-port and hence IB device per port.

> Since it has nothing to do with pci devices, please do this properly.

again, mlx5 does this already.

Jason, patches 8-10 which carry the functional change I want to introduce
(allow mlx5 IB devices to be created when RoCE is not supported) stand
for themselves.

As I wrote, the stack is fully functional  (i.e no error in the IB
core etc) when
only these patches are put. E.g things behave in a similar manner to all
the upstream iWARP drivers that refuse to create any QP which is not RC.

I am okay to put aside patches 1-7 which I added per Doug request for
user-space
applications to be able and query what QPs are supports on a device,
or to get in
patches 1-7, whatever works better for Doug and ppl. I don't think
it's fair to ask
a re-write of the 10y old IB device-ing of things done by mlx4 just to be able
and introduce this reduced functionally (raw packet qp only) of mlx5 devices.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Nov. 29, 2016, 4:19 p.m. UTC | #6
On Tue, Nov 29, 2016 at 08:35:52AM +0200, Or Gerlitz wrote:
> > or at least it should or things are already broken.
> 
> Can you elaborate what it broken with mlx4? I suspect we
> even have down there some functionality which depends on that,
> but again, I 1st would like to hear if/what is broken - I copied the maintainer.

The semantic we require is that everything under a struct ib_device is
compatible, you can use an AH on any PD with any QP on any port. As an
example wee absolutely do not allow iwarp and ib on the same
ib_device, (rdma_cm will break horribly, at least).

The fact rocee and ib sort of work together as-is represents a fluke
not a design goal :(

> Jason, patches 8-10 which carry the functional change I want to introduce
> (allow mlx5 IB devices to be created when RoCE is not supported) stand
> for themselves.

I asked for the port_num to be removed, and the uapi for it to be
device not port specific. Don't see why that is such a big deal..

> a re-write of the 10y old IB device-ing of things done by mlx4 just
> to be able and introduce this reduced functionally (raw packet qp
> only) of mlx5 devices.

Seems totally fair to me for patches adding a new uapi. Why do you
guys keep thinking uapis should be easy?

Who else is going to fix this?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford Nov. 30, 2016, 2:07 a.m. UTC | #7
On 11/28/2016 12:08 PM, Steve Wise wrote:
>>
>> On Sun, Nov 27, 2016 at 04:51:27PM +0200, Leon Romanovsky wrote:
>>
>>> +static inline bool rdma_protocol_raw_packet(const struct ib_device *device,
> u8
>> port_num)
>>> +{
>>> +	return device->port_immutable[port_num].core_cap_flags &
>> RDMA_CORE_CAP_PROT_RAW_PACKET;
>>> +}
>>
>> Does the mlx drivers really register ports with different capabilities
>> as the same ib_device? I'm not sure that should be allowed.
>>
>> I keep talking about how we need to get rid of the port_num in these
>> sorts of places because it makes no sense...
>>
> 
> I agree.   Requiring the port number has implications that ripple up into the
> rdma-rw api as well...
>  
> 

In all fairness, there is no requirement that any two ports on the same
device be the same link layer, or if the link layer is Ethernet, there
is no requirement that they can't support both iWARP and RoCE.  The idea
that the parent device defined the supported protocols for all ports of
a device became wrong with the first mlx4 device that could do both IB
and Ethernet.  And I think I've heard rumblings of a combined RoCE/iWARP
device possibly in the future from someone else.
Tom Talpey Nov. 30, 2016, 2:33 a.m. UTC | #8
On 11/29/2016 9:07 PM, Doug Ledford wrote:
> On 11/28/2016 12:08 PM, Steve Wise wrote:
>>>
>>> On Sun, Nov 27, 2016 at 04:51:27PM +0200, Leon Romanovsky wrote:
>>>
>>>> +static inline bool rdma_protocol_raw_packet(const struct ib_device *device,
>> u8
>>> port_num)
>>>> +{
>>>> +	return device->port_immutable[port_num].core_cap_flags &
>>> RDMA_CORE_CAP_PROT_RAW_PACKET;
>>>> +}
>>>
>>> Does the mlx drivers really register ports with different capabilities
>>> as the same ib_device? I'm not sure that should be allowed.
>>>
>>> I keep talking about how we need to get rid of the port_num in these
>>> sorts of places because it makes no sense...
>>>
>>
>> I agree.   Requiring the port number has implications that ripple up into the
>> rdma-rw api as well...
>>
>>
>
> In all fairness, there is no requirement that any two ports on the same
> device be the same link layer, or if the link layer is Ethernet, there
> is no requirement that they can't support both iWARP and RoCE.  The idea
> that the parent device defined the supported protocols for all ports of
> a device became wrong with the first mlx4 device that could do both IB
> and Ethernet.  And I think I've heard rumblings of a combined RoCE/iWARP
> device possibly in the future from someone else.

This one for instance?

http://www.qlogic.com/Resources/Documents/DataSheets/Adapters/DataSheet_QLE45211HL_QLE45212HL.pdf

I'd love to see any such device support protocol choice per
connection, not just per port. That of course would have
implications on the rdma commection manager api.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford Nov. 30, 2016, 4:18 p.m. UTC | #9
On 11/29/2016 9:33 PM, Tom Talpey wrote:
> On 11/29/2016 9:07 PM, Doug Ledford wrote:
>> On 11/28/2016 12:08 PM, Steve Wise wrote:
>>>>
>>>> On Sun, Nov 27, 2016 at 04:51:27PM +0200, Leon Romanovsky wrote:
>>>>
>>>>> +static inline bool rdma_protocol_raw_packet(const struct ib_device
>>>>> *device,
>>> u8
>>>> port_num)
>>>>> +{
>>>>> +    return device->port_immutable[port_num].core_cap_flags &
>>>> RDMA_CORE_CAP_PROT_RAW_PACKET;
>>>>> +}
>>>>
>>>> Does the mlx drivers really register ports with different capabilities
>>>> as the same ib_device? I'm not sure that should be allowed.
>>>>
>>>> I keep talking about how we need to get rid of the port_num in these
>>>> sorts of places because it makes no sense...
>>>>
>>>
>>> I agree.   Requiring the port number has implications that ripple up
>>> into the
>>> rdma-rw api as well...
>>>
>>>
>>
>> In all fairness, there is no requirement that any two ports on the same
>> device be the same link layer, or if the link layer is Ethernet, there
>> is no requirement that they can't support both iWARP and RoCE.  The idea
>> that the parent device defined the supported protocols for all ports of
>> a device became wrong with the first mlx4 device that could do both IB
>> and Ethernet.  And I think I've heard rumblings of a combined RoCE/iWARP
>> device possibly in the future from someone else.
> 
> This one for instance?
> 
> http://www.qlogic.com/Resources/Documents/DataSheets/Adapters/DataSheet_QLE45211HL_QLE45212HL.pdf
> 
> 
> I'd love to see any such device support protocol choice per
> connection, not just per port. That of course would have
> implications on the rdma commection manager api.
> 

That's certainly a prime example, thanks Tom ;-)
Hefty, Sean Nov. 30, 2016, 4:29 p.m. UTC | #10
> In all fairness, there is no requirement that any two ports on the same
> device be the same link layer, or if the link layer is Ethernet, there
> is no requirement that they can't support both iWARP and RoCE.  The
> idea
> that the parent device defined the supported protocols for all ports of
> a device became wrong with the first mlx4 device that could do both IB
> and Ethernet.  And I think I've heard rumblings of a combined
> RoCE/iWARP
> device possibly in the future from someone else.

It would help if the community didn't continually redefine terms based on the latest set of patches or whims or random hardware feature.  At one time an ib_device meant an actual IB device - go figure.  Now it's not even a device, but some abstract weirdness collection ports that all support the same transport, or was it link layer, or ... I really have no idea now.  The RDMA subsystem really needs to figure out what it wants to be, because even the term RDMA doesn't even apply to all of the devices that it supports.  And now we're at the point of arguing over where drivers should go because no one even knows that anymore.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liran Liss Nov. 30, 2016, 4:30 p.m. UTC | #11
> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-
> owner@vger.kernel.org] On Behalf Of Tom Talpey

> >
> > In all fairness, there is no requirement that any two ports on the
> > same device be the same link layer, or if the link layer is Ethernet,
> > there is no requirement that they can't support both iWARP and RoCE.
> > The idea that the parent device defined the supported protocols for
> > all ports of a device became wrong with the first mlx4 device that
> > could do both IB and Ethernet.  And I think I've heard rumblings of a
> > combined RoCE/iWARP device possibly in the future from someone else.
> 
> This one for instance?
> 
> http://www.qlogic.com/Resources/Documents/DataSheets/Adapters/DataShee
> t_QLE45211HL_QLE45212HL.pdf
> 
> I'd love to see any such device support protocol choice per connection, not just
> per port. That of course would have implications on the rdma commection
> manager api.
> 

Exactly. If/when such devices appear, we would need to extend connection management to specify the protocol, rather than infer it from the port space.
It would be perfectly sensible to use both RoCE and iWARP over the same physical Ethernet port and the same source IP address.

Rethinking about the uAPI, maybe we should report a protocol bit-mask similar to the kernel's, instead of QP types?
This would provide all the required information (e.g., any combination of RoCEv1/v2, iWARP, and Raw Ethernet for Ethernet links) for today's use-cases as well as tomorrow's combined RoCE/iWARP devices.
--Liran

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Nov. 30, 2016, 4:36 p.m. UTC | #12
On Tue, Nov 29, 2016 at 09:07:52PM -0500, Doug Ledford wrote:
> On 11/28/2016 12:08 PM, Steve Wise wrote:
> >>
> >> On Sun, Nov 27, 2016 at 04:51:27PM +0200, Leon Romanovsky wrote:
> >>
> >>> +static inline bool rdma_protocol_raw_packet(const struct ib_device *device,
> > u8
> >> port_num)
> >>> +{
> >>> +	return device->port_immutable[port_num].core_cap_flags &
> >> RDMA_CORE_CAP_PROT_RAW_PACKET;
> >>> +}
> >>
> >> Does the mlx drivers really register ports with different capabilities
> >> as the same ib_device? I'm not sure that should be allowed.
> >>
> >> I keep talking about how we need to get rid of the port_num in these
> >> sorts of places because it makes no sense...
> >>
> > 
> > I agree.   Requiring the port number has implications that ripple up into the
> > rdma-rw api as well...
> >  
> > 
> 
> In all fairness, there is no requirement that any two ports on the same
> device be the same link layer, or if the link layer is Ethernet, there
> is no requirement that they can't support both iWARP and RoCE.

There actually is a requirement. The RDMA CM hard requires all ports
be iWARP or !iWARP at least. I'm sure there are other subtle things
floating around.

There are also things that become very confusing for user space, and
we don't have the infrastructure to support, if ports can switch
configurations on the fly.

The simplest, approach, most in line with how verbs was designed, is
to require each ib_device to have a single kind of AH.

> The idea that the parent device defined the supported protocols for
> all ports of a device became wrong with the first mlx4 device that

Arguably it was sort of OK for roceev1, is less OK for v2, but
shouldn't have been done anyhow.

The uapi question here is do we want to double down and try and make
this work (and what does that even *mean*) or admit mlx4 was an error
and stop doing that going forward..

Or do something else? eg Specifying the AH type when creating the PD
could potentially solve some of the problems...

> could do both IB and Ethernet.  And I think I've heard rumblings of
> a combined RoCE/iWARP device possibly in the future from someone
> else.

Two struct ib_devices for the same port then... Certainly ugly.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Nov. 30, 2016, 4:39 p.m. UTC | #13
On Wed, Nov 30, 2016 at 04:30:09PM +0000, Liran Liss wrote:
> > I'd love to see any such device support protocol choice per connection, not just
> > per port. That of course would have implications on the rdma commection
> > manager api.
 
> Exactly. If/when such devices appear, we would need to extend
> connection management to specify the protocol, rather than infer it
> from the port space.

We support that perfectly today as long as the port creates two 'struct
ib_devices'. Anything else will require some kind of changes to
libibverb's API to specify the AH style.

> Rethinking about the uAPI, maybe we should report a protocol
> bit-mask similar to the kernel's, instead of QP types?  This would
> provide all the required information (e.g., any combination of
> RoCEv1/v2, iWARP, and Raw Ethernet for Ethernet links) for today's
> use-cases as well as tomorrow's combined RoCE/iWARP devices.

Maybe we should dump this uapi stuff until Matan's patches are
done. The introspection possible with Matan's work is flexible enough
to cope with more cases..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz Nov. 30, 2016, 4:59 p.m. UTC | #14
On 11/30/2016 6:39 PM, Jason Gunthorpe wrote:
> Maybe we should dump this uapi stuff until Matan's patches are
> done. The introspection possible with Matan's work is flexible enough
> to cope with more cases..

Basically I am OKay with that approach too.

Doug, if you are willing to take the mlx5 patches that enable the 
feature of mlx5 device over Eth port that doesn't support RoCE (8,9,10 - 
I will have to do some rebasing) and discuss the query when the new ABI 
code is getting closer to be upstream that's fine too.

Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liran Liss Nov. 30, 2016, 5:01 p.m. UTC | #15
> From: Jason Gunthorpe [mailto:jgunthorpe@obsidianresearch.com]

> 
> > Exactly. If/when such devices appear, we would need to extend
> > connection management to specify the protocol, rather than infer it
> > from the port space.
> 
> We support that perfectly today as long as the port creates two 'struct
> ib_devices'. Anything else will require some kind of changes to libibverb's API to
> specify the AH style.
> 

rdmacm would still have to choose between these ib_devices somehow - this doesn't change anything.
Also, AHs are already port-specific. So, I don't see any issue in this regard.

In any case, we have millions of multi-port devices that can use different link types deployed.
This is the specification, and more such devices could appear in the future.
We cannot change the device model.

> > Rethinking about the uAPI, maybe we should report a protocol bit-mask
> > similar to the kernel's, instead of QP types?  This would provide all
> > the required information (e.g., any combination of RoCEv1/v2, iWARP,
> > and Raw Ethernet for Ethernet links) for today's use-cases as well as
> > tomorrow's combined RoCE/iWARP devices.
> 
> Maybe we should dump this uapi stuff until Matan's patches are done. The
> introspection possible with Matan's work is flexible enough to cope with more
> cases..

No doubt that the new ABI would be a lot more flexible and self-describing.
But it would take a while until we port everything to use it.
So, generally, I don't see any problem using the current extensibility capabilities to support useful semantics.

> 
> Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Nov. 30, 2016, 5:08 p.m. UTC | #16
On Wed, Nov 30, 2016 at 05:01:32PM +0000, Liran Liss wrote:
> > From: Jason Gunthorpe [mailto:jgunthorpe@obsidianresearch.com]
> 
> > 
> > > Exactly. If/when such devices appear, we would need to extend
> > > connection management to specify the protocol, rather than infer it
> > > from the port space.
> > 
> > We support that perfectly today as long as the port creates two 'struct
> > ib_devices'. Anything else will require some kind of changes to libibverb's API to
> > specify the AH style.
> > 
> 
> rdmacm would still have to choose between these ib_devices somehow

Each ib_device is either iwarp or rocee, the rdma cm would route iwarp
stuff to the iwarp one and rocee stuff to the rocee one. Not really a
problem with today's architecture.

> - this doesn't change anything.  Also, AHs are already
> port-specific. So, I don't see any issue in this regard.

The current scheme infers the protocol of the AH from the current
configuration of the port, which is a crazy API when the port's
protocol can change on the fly.

> In any case, we have millions of multi-port devices that can use
> different link types deployed.  This is the specification, and more
> such devices could appear in the future.  We cannot change the
> device model.

Of course we can change how they are modeled in Linux, it is just
software.

> No doubt that the new ABI would be a lot more flexible and
> self-describing.  But it would take a while until we port everything
> to use it.  So, generally, I don't see any problem using the current
> extensibility capabilities to support useful semantics.

Perhaps a moritorium on some changes to the current uAPI will
encourage the new one to get finished :P

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hefty, Sean Nov. 30, 2016, 5:25 p.m. UTC | #17
> > - this doesn't change anything.  Also, AHs are already
> > port-specific. So, I don't see any issue in this regard.
> 
> The current scheme infers the protocol of the AH from the current
> configuration of the port, which is a crazy API when the port's
> protocol can change on the fly.

Maybe the solution is to make the protocol selection explicit throughout the APIs and associate it with a QP, rather than attempting to list all transport protocols that a port can support.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Wise Nov. 30, 2016, 5:27 p.m. UTC | #18
> 
> > > - this doesn't change anything.  Also, AHs are already
> > > port-specific. So, I don't see any issue in this regard.
> >
> > The current scheme infers the protocol of the AH from the current
> > configuration of the port, which is a crazy API when the port's
> > protocol can change on the fly.
> 
> Maybe the solution is to make the protocol selection explicit throughout the
APIs
> and associate it with a QP, rather than attempting to list all transport
protocols that
> a port can support.

Do you mean requiring the application to pick the protocol?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hefty, Sean Nov. 30, 2016, 5:30 p.m. UTC | #19
> > > > - this doesn't change anything.  Also, AHs are already
> > > > port-specific. So, I don't see any issue in this regard.
> > >
> > > The current scheme infers the protocol of the AH from the current
> > > configuration of the port, which is a crazy API when the port's
> > > protocol can change on the fly.
> >
> > Maybe the solution is to make the protocol selection explicit
> throughout the
> APIs
> > and associate it with a QP, rather than attempting to list all
> transport
> protocols that
> > a port can support.
> 
> Do you mean requiring the application to pick the protocol?

Yes - it seems necessary to support devices with RoCE and iWarp running on the same port.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Nov. 30, 2016, 5:32 p.m. UTC | #20
On Wed, Nov 30, 2016 at 05:25:18PM +0000, Hefty, Sean wrote:
> > > - this doesn't change anything.  Also, AHs are already
> > > port-specific. So, I don't see any issue in this regard.
> > 
> > The current scheme infers the protocol of the AH from the current
> > configuration of the port, which is a crazy API when the port's
> > protocol can change on the fly.
> 
> Maybe the solution is to make the protocol selection explicit
> throughout the APIs and associate it with a QP, rather than
> attempting to list all transport protocols that a port can support.

AH's are linked to a PD, not a QP..

If we had to do it again, a PD centric approach would be more
sensible:

  // Create a PD on 'port' using ah format 'protocol'
  pd = ibv_pd_create(port, enum ah_type protocol);

  // Enable APM or resource sharing on the PD across two ports
  ibv_pd_add_port(pd, alt_port);

And get rid of the multi-port ib_device concept entirely.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hefty, Sean Nov. 30, 2016, 6:17 p.m. UTC | #21
> > Maybe the solution is to make the protocol selection explicit
> > throughout the APIs and associate it with a QP, rather than
> > attempting to list all transport protocols that a port can support.
> 
> AH's are linked to a PD, not a QP..

But protocols are associated with QPs.  I'm not sure *why* an AH is linked to a PD.  Conceptually the association seems unnecessary.

> If we had to do it again, a PD centric approach would be more
> sensible:
> 
>   // Create a PD on 'port' using ah format 'protocol'
>   pd = ibv_pd_create(port, enum ah_type protocol);
> 
>   // Enable APM or resource sharing on the PD across two ports
>   ibv_pd_add_port(pd, alt_port);
> 
> And get rid of the multi-port ib_device concept entirely.

I think no matter what option is chosen, some limitation may end up being placed on how a device is used that may not map to hardware limits.  I'm personally fine with that, but we need to define the right level of abstraction (and pick the right terms to describe them).
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Nov. 30, 2016, 6:34 p.m. UTC | #22
On Wed, Nov 30, 2016 at 06:17:34PM +0000, Hefty, Sean wrote:
> > > Maybe the solution is to make the protocol selection explicit
> > > throughout the APIs and associate it with a QP, rather than
> > > attempting to list all transport protocols that a port can support.
> > 
> > AH's are linked to a PD, not a QP..
> 
> But protocols are associated with QPs.  I'm not sure *why* an AH is
> linked to a PD.  Conceptually the association seems unnecessary.

The AH has to be linked to the PD because the PD specifies the
hardware target and the AH is a hardware object. All objects must be
traced back to a PD. It is linked to a PD not a QP because the AH can
be shared across all QPs, which is useful for UD applications..

The AH is really similar to the ethernet layer in an IP stack, and the QP
is akin the TCP layer.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hefty, Sean Nov. 30, 2016, 6:46 p.m. UTC | #23
> > But protocols are associated with QPs.  I'm not sure *why* an AH is
> > linked to a PD.  Conceptually the association seems unnecessary.
> 
> The AH has to be linked to the PD because the PD specifies the
> hardware target and the AH is a hardware object. All objects must be
> traced back to a PD. It is linked to a PD not a QP because the AH can
> be shared across all QPs, which is useful for UD applications..
> 
> The AH is really similar to the ethernet layer in an IP stack, and the
> QP
> is akin the TCP layer.

I always viewed the AH as specifying the route to the destination port.  Locally, I still don't see why it couldn't work across the device, like the CQ does.  Associating it with a PD just seems to limit which QPs, and by association the memory buffers, it can be used with.  So, I'm still not seeing the reason for the linkage...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Nov. 30, 2016, 6:58 p.m. UTC | #24
On Wed, Nov 30, 2016 at 06:46:13PM +0000, Hefty, Sean wrote:
> > > But protocols are associated with QPs.  I'm not sure *why* an AH is
> > > linked to a PD.  Conceptually the association seems unnecessary.
> > 
> > The AH has to be linked to the PD because the PD specifies the
> > hardware target and the AH is a hardware object. All objects must be
> > traced back to a PD. It is linked to a PD not a QP because the AH can
> > be shared across all QPs, which is useful for UD applications..
> > 
> > The AH is really similar to the ethernet layer in an IP stack, and the
> > QP
> > is akin the TCP layer.
> 
> I always viewed the AH as specifying the route to the destination
> port.  Locally, I still don't see why it couldn't work across the
> device, like the CQ does.  Associating it with a PD just seems to
> limit which QPs, and by association the memory buffers, it can be
> used with.  So, I'm still not seeing the reason for the linkage...

Oh I see what you mean, I forgot about that - yes, I don't see any
reason why the AH is linked to the PD not the ibv_context..

Perhaps it is an artifact of the thinking when applying this to the
kernel where each ULP would allocate resources to the PD and then
destroy the PD when the connection is done? So per ulp resources were
charged to the PD..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford Nov. 30, 2016, 7:35 p.m. UTC | #25
On 11/30/2016 11:36 AM, Jason Gunthorpe wrote:
> On Tue, Nov 29, 2016 at 09:07:52PM -0500, Doug Ledford wrote:
>> On 11/28/2016 12:08 PM, Steve Wise wrote:
>>>>
>>>> On Sun, Nov 27, 2016 at 04:51:27PM +0200, Leon Romanovsky wrote:
>>>>
>>>>> +static inline bool rdma_protocol_raw_packet(const struct ib_device *device,
>>> u8
>>>> port_num)
>>>>> +{
>>>>> +	return device->port_immutable[port_num].core_cap_flags &
>>>> RDMA_CORE_CAP_PROT_RAW_PACKET;
>>>>> +}
>>>>
>>>> Does the mlx drivers really register ports with different capabilities
>>>> as the same ib_device? I'm not sure that should be allowed.
>>>>
>>>> I keep talking about how we need to get rid of the port_num in these
>>>> sorts of places because it makes no sense...
>>>>
>>>
>>> I agree.   Requiring the port number has implications that ripple up into the
>>> rdma-rw api as well...
>>>  
>>>
>>
>> In all fairness, there is no requirement that any two ports on the same
>> device be the same link layer, or if the link layer is Ethernet, there
>> is no requirement that they can't support both iWARP and RoCE.
> 
> There actually is a requirement. The RDMA CM hard requires all ports
> be iWARP or !iWARP at least. I'm sure there are other subtle things
> floating around.
> 
> There are also things that become very confusing for user space, and
> we don't have the infrastructure to support, if ports can switch
> configurations on the fly.
> 
> The simplest, approach, most in line with how verbs was designed, is
> to require each ib_device to have a single kind of AH.

Sorry, we're conflating two separate things here.  I was merely
referring to the hardware.  Not our implementation of a stack.  I was
attempting to point out that our stack implementation is placing
artificial restrictions on the hardware that the hardware does not share.

>> The idea that the parent device defined the supported protocols for
>> all ports of a device became wrong with the first mlx4 device that
> 
> Arguably it was sort of OK for roceev1, is less OK for v2, but
> shouldn't have been done anyhow.
> 
> The uapi question here is do we want to double down and try and make
> this work (and what does that even *mean*) or admit mlx4 was an error
> and stop doing that going forward..
> 
> Or do something else? eg Specifying the AH type when creating the PD
> could potentially solve some of the problems...
> 
>> could do both IB and Ethernet.  And I think I've heard rumblings of
>> a combined RoCE/iWARP device possibly in the future from someone
>> else.
> 
> Two struct ib_devices for the same port then... Certainly ugly.

I'm not entirely sure I agree....I would have to think about it more,
but the underlying problem I'm concerned with is exactly what I point
out above: what restrictions would we be placing on the hardware that
are artificial and a result of our stack that the hardware itself does
not share?  Could the hardware support automatic migration from iWARP to
RoCE or vice versa if both endpoints supported it?  Would that work if
we required two separate ib_devices?  Things like that.
Hefty, Sean Nov. 30, 2016, 9:12 p.m. UTC | #26
> I'm not entirely sure I agree....I would have to think about it more,
> but the underlying problem I'm concerned with is exactly what I point
> out above: what restrictions would we be placing on the hardware that
> are artificial and a result of our stack that the hardware itself does
> not share?  Could the hardware support automatic migration from iWARP
> to
> RoCE or vice versa if both endpoints supported it?  Would that work if
> we required two separate ib_devices?  Things like that.

IMO, in order to create a usable software interface, we will end up defining limitations.  The alternative seems to be an endless series of capability bits that expands to millions of combinations, and in the end still won't capture everything.  Consider that the existing interfaces already limit what upstream drivers can expose.

The primary reason for exposing devices is for apps to associate hardware resources (e.g. CQ and QP) with each other.  With software based implementations of roce/iwarp, a device is basically the system.  If you connect NICs directly to the CPU, then even separate hardware devices could share state, which could complicate things even more.

I'm not sure about Jason's idea of re-defining the PD as a 'protocol domain' (my term), but I think we should seriously consider something similar as an addition to the <device, port> model.  I'm still of the opinion that protocol is a property of a QP, not a port.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Talpey Dec. 1, 2016, 1:06 p.m. UTC | #27
On 11/30/2016 12:30 PM, Hefty, Sean wrote:
>>>>> - this doesn't change anything.  Also, AHs are already
>>>>> port-specific. So, I don't see any issue in this regard.
>>>>
>>>> The current scheme infers the protocol of the AH from the current
>>>> configuration of the port, which is a crazy API when the port's
>>>> protocol can change on the fly.
>>>
>>> Maybe the solution is to make the protocol selection explicit
>> throughout the
>> APIs
>>> and associate it with a QP, rather than attempting to list all
>> transport
>> protocols that
>>> a port can support.
>>
>> Do you mean requiring the application to pick the protocol?
>
> Yes - it seems necessary to support devices with RoCE and iWarp running on the same port.

Sockets require this, the API requires a protocol family (PF_xxx)
and socket type (SOCK_xxx) that direct the endpoint to become TCP,
UDP, etc.

On the other hand, RDMA APIs to date have considered RDMA to be the
transport, and therefore they have hidden the underlying protocol.
You just request an "RDMA" endpoint from a specific adapter, and get
what it returns.

I'd agree that the application picks the protocol, but I'd also
argue that the API should have a "wildcard" mode which allows the
current behavior. It could be a simple extension to just add a
selector parameter.

The tricky part is what to do with passive endpoints. If a wildcard
is specified, should a multiprotocol adapter create listens on all
protocols? Also, what of protocols which auto-negotiate? It's my
understanding that some RoCE adapters will attempt to autodetect
whether their peer is RoCEv1 or RoCEv2 capable, and adjust their
protocol to suit.

Tom.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hefty, Sean Dec. 1, 2016, 7:07 p.m. UTC | #28
> I'd agree that the application picks the protocol, but I'd also
> argue that the API should have a "wildcard" mode which allows the
> current behavior. It could be a simple extension to just add a
> selector parameter.

I agree.  The QP type can be viewed as the socket type, we only need to add a protocol field to go along with it.  I.e. the protocol should be a qp attribute provided on input to qp create.

> The tricky part is what to do with passive endpoints. If a wildcard
> is specified, should a multiprotocol adapter create listens on all
> protocols? Also, what of protocols which auto-negotiate? It's my
> understanding that some RoCE adapters will attempt to autodetect
> whether their peer is RoCEv1 or RoCEv2 capable, and adjust their
> protocol to suit.

We could include the protocol selection, also with wildcard support, here as well.  The rdma cm may be able to handle this, so that drivers won't need to deal with a wildcard endpoint.  Assuming that we don't end up with some devices that require they implement wildcard support, while others can't...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liran Liss Dec. 4, 2016, 8:38 p.m. UTC | #29
> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-
> owner@vger.kernel.org] On Behalf Of Hefty, Sean

> 
> > I'd agree that the application picks the protocol, but I'd also argue
> > that the API should have a "wildcard" mode which allows the current
> > behavior. It could be a simple extension to just add a selector
> > parameter.
> 
> I agree.  The QP type can be viewed as the socket type, we only need to add a
> protocol field to go along with it.  I.e. the protocol should be a qp attribute
> provided on input to qp create.
> 

Right.

> > The tricky part is what to do with passive endpoints. If a wildcard is
> > specified, should a multiprotocol adapter create listens on all
> > protocols? Also, what of protocols which auto-negotiate? It's my
> > understanding that some RoCE adapters will attempt to autodetect
> > whether their peer is RoCEv1 or RoCEv2 capable, and adjust their
> > protocol to suit.
> 
> We could include the protocol selection, also with wildcard support, here as
> well.  The rdma cm may be able to handle this, so that drivers won't need to deal
> with a wildcard endpoint.  Assuming that we don't end up with some devices
> that require they implement wildcard support, while others can't...

Typically, the application will request the protocol that it wants or leave it unspecified.
In that case, I think that the rdmacm would select the device default.

Anyway, returning to the initial matter at hand: I would like to start with each port reporting a bit mask of the supported protocols on that link (RoCE v1/v2, Raw Ethernet, iWARP, etc.)
It will be used for reporting device capabilities in general for tools, as well as by applications that don't use rdmacm.
--Liran

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Dec. 5, 2016, 5:10 p.m. UTC | #30
On Sun, Dec 04, 2016 at 08:38:14PM +0000, Liran Liss wrote:

> Anyway, returning to the initial matter at hand: I would like to
> start with each port reporting a bit mask of the supported protocols
> on that link (RoCE v1/v2, Raw Ethernet, iWARP, etc.)  It will be
> used for reporting device capabilities in general for tools, as well
> as by applications that don't use rdmacm.

Why don't we start by defining how it is supposed to even work and how
to fix the RDMA CM before adding even more random capability bits?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liran Liss Dec. 5, 2016, 8:14 p.m. UTC | #31
> From: Jason Gunthorpe [mailto:jgunthorpe@obsidianresearch.com]

> 
> > Anyway, returning to the initial matter at hand: I would like to start
> > with each port reporting a bit mask of the supported protocols on that
> > link (RoCE v1/v2, Raw Ethernet, iWARP, etc.)  It will be used for
> > reporting device capabilities in general for tools, as well as by
> > applications that don't use rdmacm.
> 
> Why don't we start by defining how it is supposed to even work and how to fix
> the RDMA CM before adding even more random capability bits?
> 
> Jason

Extensions to RDMA CM is an important topic, which we can continue to discuss as devices that require it are introduced.

Protocol capabilities are needed for information purposes as well as for applications that do not use RDMA CM.
They are not "random", but depict what a device can do.
--Liran
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz Dec. 6, 2016, 9:26 p.m. UTC | #32
On Mon, Dec 5, 2016 at 10:14 PM, Liran Liss <liranl@mellanox.com> wrote:
>> From: Jason Gunthorpe [mailto:jgunthorpe@obsidianresearch.com]

>>> Anyway, returning to the initial matter at hand: I would like to start
>>> with each port reporting a bit mask of the supported protocols on that
>>> link (RoCE v1/v2, Raw Ethernet, iWARP, etc.)  It will be used for
>>> reporting device capabilities in general for tools, as well as by
>>> applications that don't use rdmacm.

>> Why don't we start by defining how it is supposed to even work and how to fix
>> the RDMA CM before adding even more random capability bits?

> Extensions to RDMA CM is an important topic, which we can continue to discuss as devices that require it are introduced.

> Protocol capabilities are needed for information purposes as well as for applications that do not use RDMA CM.
> They are not "random", but depict what a device can do.

Doug,

Can we get a maintainer say here? I didn't see complaints on the mlx5
IB driver patch that opens IB device also in the lack of RoCE support.
We have a valid real life use-case which needs this driver patch.

Per your request, I made also some core patch to expose QPTs to
user-space and we went into this long discussion.

Do you want the change along what was proposed by Liran (allow
user-space to query the supported protocols per port) or you are okay
with  taking only the mlx5 patches for now and discuss the rest later
or something else?

Please let us know

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Dec. 6, 2016, 9:39 p.m. UTC | #33
On Tue, Dec 06, 2016 at 11:26:51PM +0200, Or Gerlitz wrote:

> Do you want the change along what was proposed by Liran (allow
> user-space to query the supported protocols per port) or you are
> okay

Why doesn't mellanox come up with a plan to actually make all of this
work in some sensible way? So far only Mellanox needs this.

I've already proposed disallowing multiprotocol struct ib_devices.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hefty, Sean Dec. 6, 2016, 10:13 p.m. UTC | #34
> I've already proposed disallowing multiprotocol struct ib_devices.

My preference is to discontinue attempts at associating a protocol with the device.  A device could implement a dozen protocols in software.  Transports belong to QPs or cm ids, not devices.  Each rdma_cm_id should be associated with a specific cm/transport directly, rather than indirectly selecting one based on the bound <device, port>.

If an app wants a specific transport type for a QP, why doesn't it just try to open one and see if the call fails?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford Dec. 7, 2016, 10:06 p.m. UTC | #35
On 12/6/2016 5:13 PM, Hefty, Sean wrote:
>> I've already proposed disallowing multiprotocol struct ib_devices.
> 
> My preference is to discontinue attempts at associating a protocol with the device.  A device could implement a dozen protocols in software.  Transports belong to QPs or cm ids, not devices.  Each rdma_cm_id should be associated with a specific cm/transport directly, rather than indirectly selecting one based on the bound <device, port>.
> 
> If an app wants a specific transport type for a QP, why doesn't it just try to open one and see if the call fails?

I tend to agree with Sean on this.  And to answer Or's question from a
couple emails back, I'm inclined to just take the last three patches for
now while we work out a better idea of how everything here should be
done on the first seven patches.
Jason Gunthorpe Dec. 7, 2016, 10:40 p.m. UTC | #36
On Wed, Dec 07, 2016 at 05:06:24PM -0500, Doug Ledford wrote:
> On 12/6/2016 5:13 PM, Hefty, Sean wrote:
> >> I've already proposed disallowing multiprotocol struct ib_devices.
> > 
> > My preference is to discontinue attempts at associating a protocol with the device.  A device could implement a dozen protocols in software.  Transports belong to QPs or cm ids, not devices.  Each rdma_cm_id should be associated with a specific cm/transport directly, rather than indirectly selecting one based on the bound <device, port>.
> > 
> > If an app wants a specific transport type for a QP, why doesn't it just try to open one and see if the call fails?
> 
> I tend to agree with Sean on this.  And to answer Or's question from a
> couple emails back, I'm inclined to just take the last three patches for
> now while we work out a better idea of how everything here should be
> done on the first seven patches.

Well in that case, we should work toward getting rid of 'struct
ib_device' - it is the cause of so much of this trouble. A
port-focused model like the netstack uses is overall saner..

We can figure out some other way to model what ports can be shared
within a PD, ideally something that works better for rxe..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liran Liss Dec. 8, 2016, 9:29 a.m. UTC | #37
> From: Jason Gunthorpe [mailto:jgunthorpe@obsidianresearch.com]
> Sent: Thursday, December 08, 2016 12:41 AM

> > >
> > > My preference is to discontinue attempts at associating a protocol with the
> device.  A device could implement a dozen protocols in software.  Transports
> belong to QPs or cm ids, not devices.  Each rdma_cm_id should be associated
> with a specific cm/transport directly, rather than indirectly selecting one based
> on the bound <device, port>.
> > >
> > > If an app wants a specific transport type for a QP, why doesn't it just try to
> open one and see if the call fails?
> >
> > I tend to agree with Sean on this.  And to answer Or's question from a
> > couple emails back, I'm inclined to just take the last three patches
> > for now while we work out a better idea of how everything here should
> > be done on the first seven patches.
> 

Yep, it's a good start.

> Well in that case, we should work toward getting rid of 'struct ib_device' - it is
> the cause of so much of this trouble. A port-focused model like the netstack uses
> is overall saner..
> 

We are discussing 2 separate issues:
(1) Whether a device can support multiple protocols
(2) Multi-port devices, optionally with different link types.

I think that we all agree that (1) will be needed for future devices, and that different QPs might be running different protocols.
This means that supported protocols need to be device capabilities, as we have today in the kernel.
There are several devices that implement (2), which follows directly from the specification. The model for these devices is not going to change.

There is a question on where to report protocol capabilities - at the device or port level. This is regardless of how the kernel CMA will resolve which device and port will service a connection - it's a matter of transparency and management.
Doing so at the device level will require apps to know which protocols run on which link type. Doing so at the port level will be more obvious to apps, but won't be a common use case (assuming most apps will use rdmacm, which will do the protocol resolution job for them in the kernel). Doing so at the device level seems to be more aligned with how most future devices would be implemented.
However, I won't rule out multi-port devices so soon. When the transport is offloaded to HW, it makes sense to conduct HA between ports within the domain of a single device.

In any case, this has nothing to do with 'struct ib_device', which represents a device that handles packet processing internally and exposes transport endpoints to applications.
--Liran
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Dec. 8, 2016, 5:41 p.m. UTC | #38
On Thu, Dec 08, 2016 at 09:29:32AM +0000, Liran Liss wrote:

> > Well in that case, we should work toward getting rid of 'struct ib_device' - it is
> > the cause of so much of this trouble. A port-focused model like the netstack uses
> > is overall saner..
> > 
> 
> We are discussing 2 separate issues:
> (1) Whether a device can support multiple protocols
> (2) Multi-port devices, optionally with different link types.
> 
> I think that we all agree that (1) will be needed for future
> devices,

We all agree future hardware devices will support multiple protocols,
but I don't think there is any agreement so far on how to model that,
or what 'struct ib_device' means in such a world. As I said, I'd like
to get rid of it if we use a QP centric model to solve this problem.

> and that different QPs might be running different protocols.  This
> means that supported protocols need to be device capabilities,

No, they are QP capabilities, add new APIs to work with and query QP
related things and stop using the device for anything.

> have today in the kernel.  There are several devices that implement
> (2), which follows directly from the specification. The model for
> these devices is not going to change.

Again, no, the specification for multiport devices does not
contemplate different addressing formats, so the one driver that did
this went off spec in a badly thought out manner and left us this
mess.

> kernel). Doing so at the device level seems to be more aligned with
> how most future devices would be implemented.  However, I won't rule
> out multi-port devices so soon. When the transport is offloaded to
> HW, it makes sense to conduct HA between ports within the domain of
> a single device.

It is a QP thing if we go down Sean's suggestion, so add QP related
queries to get the information.

I have no idea how to model HA in this world - presumably when you add
an AH to a QP it will refuse to do it if the implied HA model isn't
supported.

But, I'm not sure how well the QP model works when talking about
HA/APM..

> In any case, this has nothing to do with 'struct ib_device', which
> represents a device that handles packet processing internally and
> exposes transport endpoints to applications.

We need to make sense of what 'struct ib_device' is to figure out how
to solve these problems because we have multiple different definitions
running around concurrnatly in the stack - that is why it doesn't work
today to have more than one protocol on a struct ib_device.

So, yes, it actually has everything to do with 'struct ib_device'..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 5ad43a4..0cb3194 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -485,6 +485,7 @@  static inline struct rdma_hw_stats *rdma_alloc_hw_stats_struct(
 #define RDMA_CORE_CAP_PROT_ROCE         0x00200000
 #define RDMA_CORE_CAP_PROT_IWARP        0x00400000
 #define RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP 0x00800000
+#define RDMA_CORE_CAP_PROT_RAW_PACKET   0x01000000
 
 #define RDMA_CORE_PORT_IBA_IB          (RDMA_CORE_CAP_PROT_IB  \
 					| RDMA_CORE_CAP_IB_MAD \
@@ -508,6 +509,8 @@  static inline struct rdma_hw_stats *rdma_alloc_hw_stats_struct(
 #define RDMA_CORE_PORT_INTEL_OPA       (RDMA_CORE_PORT_IBA_IB  \
 					| RDMA_CORE_CAP_OPA_MAD)
 
+#define RDMA_CORE_PORT_RAW_PACKET	(RDMA_CORE_CAP_PROT_RAW_PACKET)
+
 struct ib_port_attr {
 	u64			subnet_prefix;
 	enum ib_port_state	state;
@@ -2286,6 +2289,11 @@  static inline bool rdma_ib_or_roce(const struct ib_device *device, u8 port_num)
 		rdma_protocol_roce(device, port_num);
 }
 
+static inline bool rdma_protocol_raw_packet(const struct ib_device *device, u8 port_num)
+{
+	return device->port_immutable[port_num].core_cap_flags & RDMA_CORE_CAP_PROT_RAW_PACKET;
+}
+
 /**
  * rdma_cap_ib_mad - Check if the port of a device supports Infiniband
  * Management Datagrams.