diff mbox

[v6,01/26] IB/Verbs: Implement new callback query_transport()

Message ID 1429878230-11749-2-git-send-email-yun.wang@profitbricks.com (mailing list archive)
State Rejected
Headers show

Commit Message

Michael Wang April 24, 2015, 12:23 p.m. UTC
Add new callback query_transport() and implement for each HW.

Mapping List:
		node-type	link-layer	old-transport	new-transport
nes		RNIC		ETH		IWARP		IWARP
amso1100	RNIC		ETH		IWARP		IWARP
cxgb3   	RNIC		ETH		IWARP		IWARP
cxgb4   	RNIC		ETH		IWARP		IWARP
usnic   	USNIC_UDP	ETH		USNIC_UDP	USNIC_UDP
ocrdma  	IB_CA		ETH		IB		IBOE
mlx4    	IB_CA		IB/ETH		IB		IB/IBOE
mlx5    	IB_CA		IB		IB		IB
ehca    	IB_CA		IB		IB		IB
ipath   	IB_CA		IB		IB		IB
mthca   	IB_CA		IB		IB		IB
qib     	IB_CA		IB		IB		IB

Cc: Hal Rosenstock <hal@dev.mellanox.co.il>
Cc: Steve Wise <swise@opengridcomputing.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
---
 drivers/infiniband/core/device.c             |  1 +
 drivers/infiniband/core/verbs.c              |  4 +++-
 drivers/infiniband/hw/amso1100/c2_provider.c |  7 +++++++
 drivers/infiniband/hw/cxgb3/iwch_provider.c  |  7 +++++++
 drivers/infiniband/hw/cxgb4/provider.c       |  7 +++++++
 drivers/infiniband/hw/ehca/ehca_hca.c        |  6 ++++++
 drivers/infiniband/hw/ehca/ehca_iverbs.h     |  3 +++
 drivers/infiniband/hw/ehca/ehca_main.c       |  1 +
 drivers/infiniband/hw/ipath/ipath_verbs.c    |  7 +++++++
 drivers/infiniband/hw/mlx4/main.c            | 10 ++++++++++
 drivers/infiniband/hw/mlx5/main.c            |  7 +++++++
 drivers/infiniband/hw/mthca/mthca_provider.c |  7 +++++++
 drivers/infiniband/hw/nes/nes_verbs.c        |  6 ++++++
 drivers/infiniband/hw/ocrdma/ocrdma_main.c   |  1 +
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c  |  6 ++++++
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.h  |  3 +++
 drivers/infiniband/hw/qib/qib_verbs.c        |  7 +++++++
 drivers/infiniband/hw/usnic/usnic_ib_main.c  |  1 +
 drivers/infiniband/hw/usnic/usnic_ib_verbs.c |  6 ++++++
 drivers/infiniband/hw/usnic/usnic_ib_verbs.h |  2 ++
 include/rdma/ib_verbs.h                      |  7 ++++++-
 21 files changed, 104 insertions(+), 2 deletions(-)

Comments

Tom Talpey April 24, 2015, 2:29 p.m. UTC | #1
On 4/24/2015 8:23 AM, Michael Wang wrote:
> Add new callback query_transport() and implement for each HW.
>
> Mapping List:
> 		node-type	link-layer	old-transport	new-transport
> ...
> mlx4    	IB_CA		IB/ETH		IB		IB/IBOE
> mlx5    	IB_CA		IB		IB		IB
> ...
> diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
> index 57c9809..b6f2f58 100644
> --- a/drivers/infiniband/hw/mlx5/main.c
> +++ b/drivers/infiniband/hw/mlx5/main.c
> @@ -262,6 +262,12 @@ out:
>   	return err;
>   }
>
> +static enum rdma_transport_type
> +mlx5_ib_query_transport(struct ib_device *device, u8 port_num)
> +{
> +	return RDMA_TRANSPORT_IB;
> +}
> +


Just noticed that mlx5 is not being coded as RoCE-capable like mlx4.
The mlx5 driver is for the new ConnectX-4, which implements all three
of IB, RoCE and RoCEv2, right? Are those last two not supported?


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Wang April 24, 2015, 2:35 p.m. UTC | #2
On 04/24/2015 04:29 PM, Tom Talpey wrote:
> On 4/24/2015 8:23 AM, Michael Wang wrote:
[snip]
>> +static enum rdma_transport_type
>> +mlx5_ib_query_transport(struct ib_device *device, u8 port_num)
>> +{
>> +    return RDMA_TRANSPORT_IB;
>> +}
>> +
> 
> 
> Just noticed that mlx5 is not being coded as RoCE-capable like mlx4.
> The mlx5 driver is for the new ConnectX-4, which implements all three
> of IB, RoCE and RoCEv2, right? Are those last two not supported?

I'm not sure about the details of mlx5, but according to the current
implementation, it's transport is IB without a link-layer callback,
which means it doesn't support IBoE...

And there is no method to change the port link-layer type as mlx4 did.

Regards,
Michael Wang

> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Talpey April 24, 2015, 2:46 p.m. UTC | #3
On 4/24/2015 10:35 AM, Michael Wang wrote:
>
>
> On 04/24/2015 04:29 PM, Tom Talpey wrote:
>> On 4/24/2015 8:23 AM, Michael Wang wrote:
> [snip]
>>> +static enum rdma_transport_type
>>> +mlx5_ib_query_transport(struct ib_device *device, u8 port_num)
>>> +{
>>> +    return RDMA_TRANSPORT_IB;
>>> +}
>>> +
>>
>>
>> Just noticed that mlx5 is not being coded as RoCE-capable like mlx4.
>> The mlx5 driver is for the new ConnectX-4, which implements all three
>> of IB, RoCE and RoCEv2, right? Are those last two not supported?
>
> I'm not sure about the details of mlx5, but according to the current
> implementation, it's transport is IB without a link-layer callback,
> which means it doesn't support IBoE...
>
> And there is no method to change the port link-layer type as mlx4 did.

Hal, is that correct?

 From the Mellanox web:

http://www.mellanox.com/related-docs/products/IB_Adapter_card_brochure_c_2_3.pdf

"ConnectX-4
ConnectX-4 adapter cards with Virtual Protocol Interconnect (VPI),
supporting EDR 100Gb/s InfiniBand and 100Gb/s Ethernet connectivity,
provide the...

"Virtual Protocol Interconnect
VPIĀ® flexibility enables any standard networking, clustering, storage,
and management protocol to seamlessly operate over any converged network
leveraging a consolidated software stack. Each port can operate on
InfiniBand, Ethernet, or Data Center Bridging (DCB) fabrics, and
supports IP over InfiniBand (IPoIB), Ethernet over InfiniBand (EoIB)
and RDMA over Converged Ethernet (RoCE and RoCEv2).
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liran Liss April 24, 2015, 3:12 p.m. UTC | #4
> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-
> 
[snip]
> a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index
> 65994a1..d54f91e 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -75,10 +75,13 @@ enum rdma_node_type {  };
> 
>  enum rdma_transport_type {
> +	/* legacy for users */
>  	RDMA_TRANSPORT_IB,
>  	RDMA_TRANSPORT_IWARP,
>  	RDMA_TRANSPORT_USNIC,
> -	RDMA_TRANSPORT_USNIC_UDP
> +	RDMA_TRANSPORT_USNIC_UDP,
> +	/* new transport */
> +	RDMA_TRANSPORT_IBOE,

Remove RDMA_TRANSPORT_IBOE - it is not a transport.
ROCE uses IBTA transport.

If any code should test for ROCE should invoke a specific helper, e.g., rdma_protocol_iboe().
This is  what you currently call "rdma_tech_iboe" is patch 02/26.

I think that pretty much everybody agrees that rdma_protocol_*() is a better name than rdma_tech_*(), right?
So, let's change this.

The semantics are: "check that a link supports a certain wire protocol, or a set of wire protocols", where 'certain'
refers to the specific helper...


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford April 24, 2015, 4 p.m. UTC | #5
On Fri, 2015-04-24 at 10:46 -0400, Tom Talpey wrote:
> On 4/24/2015 10:35 AM, Michael Wang wrote:
> >
> >
> > On 04/24/2015 04:29 PM, Tom Talpey wrote:
> >> On 4/24/2015 8:23 AM, Michael Wang wrote:
> > [snip]
> >>> +static enum rdma_transport_type
> >>> +mlx5_ib_query_transport(struct ib_device *device, u8 port_num)
> >>> +{
> >>> +    return RDMA_TRANSPORT_IB;
> >>> +}
> >>> +
> >>
> >>
> >> Just noticed that mlx5 is not being coded as RoCE-capable like mlx4.
> >> The mlx5 driver is for the new ConnectX-4, which implements all three
> >> of IB, RoCE and RoCEv2, right? Are those last two not supported?
> >
> > I'm not sure about the details of mlx5, but according to the current
> > implementation, it's transport is IB without a link-layer callback,
> > which means it doesn't support IBoE...
> >
> > And there is no method to change the port link-layer type as mlx4 did.
> 
> Hal, is that correct?

The mlx5 Ethernet driver has not yet been submitted to the linux kernel.
Until it lands, the drive is IB only.  When it does land, it will need
to update this area of code.
Michael Wang April 27, 2015, 7:39 a.m. UTC | #6
On 04/24/2015 05:12 PM, Liran Liss wrote:
>> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-
>>
> [snip]
>> a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index
>> 65994a1..d54f91e 100644
>> --- a/include/rdma/ib_verbs.h
>> +++ b/include/rdma/ib_verbs.h
>> @@ -75,10 +75,13 @@ enum rdma_node_type {  };
>>
>>  enum rdma_transport_type {
>> +	/* legacy for users */
>>  	RDMA_TRANSPORT_IB,
>>  	RDMA_TRANSPORT_IWARP,
>>  	RDMA_TRANSPORT_USNIC,
>> -	RDMA_TRANSPORT_USNIC_UDP
>> +	RDMA_TRANSPORT_USNIC_UDP,
>> +	/* new transport */
>> +	RDMA_TRANSPORT_IBOE,
> 
> Remove RDMA_TRANSPORT_IBOE - it is not a transport.
> ROCE uses IBTA transport.
> 
> If any code should test for ROCE should invoke a specific helper, e.g., rdma_protocol_iboe().
> This is  what you currently call "rdma_tech_iboe" is patch 02/26.
> 
> I think that pretty much everybody agrees that rdma_protocol_*() is a better name than rdma_tech_*(), right?
> So, let's change this.

Sure, sounds reasonable now, about the IBOE, we still need it to
separate the port support IB/ETH without the check on link-layer,
So what about a new enum on protocol type?

Like:

enum rdma_protocol {
	RDMA_PROTOCOL_IB,
	RDMA_PROTOCOL_IBOE,
	RDMA_PROTOCOL_IWARP,
	RDMA_PROTOCOL_USNIC_UDP
};

So we could use query_protocol() to ask device provide the protocol
type, and there will be no mixing with the legacy transport type
anymore :-)

Regards,
Michael Wang

> 
> The semantics are: "check that a link supports a certain wire protocol, or a set of wire protocols", where 'certain'
> refers to the specific helper...
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ira Weiny April 27, 2015, 9:52 p.m. UTC | #7
On Mon, Apr 27, 2015 at 09:39:05AM +0200, Michael Wang wrote:
> 
> 
> On 04/24/2015 05:12 PM, Liran Liss wrote:
> >> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-
> >>
> > [snip]
> >> a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index
> >> 65994a1..d54f91e 100644
> >> --- a/include/rdma/ib_verbs.h
> >> +++ b/include/rdma/ib_verbs.h
> >> @@ -75,10 +75,13 @@ enum rdma_node_type {  };
> >>
> >>  enum rdma_transport_type {
> >> +	/* legacy for users */
> >>  	RDMA_TRANSPORT_IB,
> >>  	RDMA_TRANSPORT_IWARP,
> >>  	RDMA_TRANSPORT_USNIC,
> >> -	RDMA_TRANSPORT_USNIC_UDP
> >> +	RDMA_TRANSPORT_USNIC_UDP,
> >> +	/* new transport */
> >> +	RDMA_TRANSPORT_IBOE,
> > 
> > Remove RDMA_TRANSPORT_IBOE - it is not a transport.
> > ROCE uses IBTA transport.
> > 
> > If any code should test for ROCE should invoke a specific helper, e.g., rdma_protocol_iboe().
> > This is  what you currently call "rdma_tech_iboe" is patch 02/26.
> > 
> > I think that pretty much everybody agrees that rdma_protocol_*() is a better name than rdma_tech_*(), right?
> > So, let's change this.
> 
> Sure, sounds reasonable now, about the IBOE, we still need it to
> separate the port support IB/ETH without the check on link-layer,
> So what about a new enum on protocol type?
> 
> Like:
> 
> enum rdma_protocol {
> 	RDMA_PROTOCOL_IB,
> 	RDMA_PROTOCOL_IBOE,
> 	RDMA_PROTOCOL_IWARP,
> 	RDMA_PROTOCOL_USNIC_UDP
> };
> 
> So we could use query_protocol() to ask device provide the protocol
> type, and there will be no mixing with the legacy transport type
> anymore :-)

I'm ok with that.  I like introducing a unique namespace which is clearly
different from the previous "transport" one.

Ira

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Talpey April 28, 2015, 12:16 a.m. UTC | #8
On 4/27/2015 2:52 PM, ira.weiny wrote:
> On Mon, Apr 27, 2015 at 09:39:05AM +0200, Michael Wang wrote:
>>
>>
>> On 04/24/2015 05:12 PM, Liran Liss wrote:
>>>> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-
>>>>
>>> [snip]
>>>> a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index
>>>> 65994a1..d54f91e 100644
>>>> --- a/include/rdma/ib_verbs.h
>>>> +++ b/include/rdma/ib_verbs.h
>>>> @@ -75,10 +75,13 @@ enum rdma_node_type {  };
>>>>
>>>>   enum rdma_transport_type {
>>>> +	/* legacy for users */
>>>>   	RDMA_TRANSPORT_IB,
>>>>   	RDMA_TRANSPORT_IWARP,
>>>>   	RDMA_TRANSPORT_USNIC,
>>>> -	RDMA_TRANSPORT_USNIC_UDP
>>>> +	RDMA_TRANSPORT_USNIC_UDP,
>>>> +	/* new transport */
>>>> +	RDMA_TRANSPORT_IBOE,
>>>
>>> Remove RDMA_TRANSPORT_IBOE - it is not a transport.
>>> ROCE uses IBTA transport.
>>>
>>> If any code should test for ROCE should invoke a specific helper, e.g., rdma_protocol_iboe().
>>> This is  what you currently call "rdma_tech_iboe" is patch 02/26.
>>>
>>> I think that pretty much everybody agrees that rdma_protocol_*() is a better name than rdma_tech_*(), right?
>>> So, let's change this.
>>
>> Sure, sounds reasonable now, about the IBOE, we still need it to
>> separate the port support IB/ETH without the check on link-layer,
>> So what about a new enum on protocol type?
>>
>> Like:
>>
>> enum rdma_protocol {
>> 	RDMA_PROTOCOL_IB,
>> 	RDMA_PROTOCOL_IBOE,
>> 	RDMA_PROTOCOL_IWARP,
>> 	RDMA_PROTOCOL_USNIC_UDP
>> };
>>
>> So we could use query_protocol() to ask device provide the protocol
>> type, and there will be no mixing with the legacy transport type
>> anymore :-)
>
> I'm ok with that.  I like introducing a unique namespace which is clearly
> different from the previous "transport" one.

I agree the word "transport" takes things into the weeds.

But on the topic of naming protocols, I've been wondering, is there
some reason that "IBOE" is being used instead of "RoCE"? The IBOE
protocol used to exist and is not the same as the currently
standardized RoCE, right?

Also wondering, why add "UDP" to USNIC, is there a different USNIC?

Naming multiple layers together seems confusing and maybe in the end
will create more code to deal with the differences. For example, what
token will RoCEv2 take? RoCE_UDP, RoCE_v2 or ... ?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford April 28, 2015, 12:36 a.m. UTC | #9
On Mon, 2015-04-27 at 17:16 -0700, Tom Talpey wrote:
> On 4/27/2015 2:52 PM, ira.weiny wrote:
> > On Mon, Apr 27, 2015 at 09:39:05AM +0200, Michael Wang wrote:
> >>
> >>
> >> On 04/24/2015 05:12 PM, Liran Liss wrote:
> >>>> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-
> >>>>
> >>> [snip]
> >>>> a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index
> >>>> 65994a1..d54f91e 100644
> >>>> --- a/include/rdma/ib_verbs.h
> >>>> +++ b/include/rdma/ib_verbs.h
> >>>> @@ -75,10 +75,13 @@ enum rdma_node_type {  };
> >>>>
> >>>>   enum rdma_transport_type {
> >>>> +	/* legacy for users */
> >>>>   	RDMA_TRANSPORT_IB,
> >>>>   	RDMA_TRANSPORT_IWARP,
> >>>>   	RDMA_TRANSPORT_USNIC,
> >>>> -	RDMA_TRANSPORT_USNIC_UDP
> >>>> +	RDMA_TRANSPORT_USNIC_UDP,
> >>>> +	/* new transport */
> >>>> +	RDMA_TRANSPORT_IBOE,
> >>>
> >>> Remove RDMA_TRANSPORT_IBOE - it is not a transport.
> >>> ROCE uses IBTA transport.
> >>>
> >>> If any code should test for ROCE should invoke a specific helper, e.g., rdma_protocol_iboe().
> >>> This is  what you currently call "rdma_tech_iboe" is patch 02/26.
> >>>
> >>> I think that pretty much everybody agrees that rdma_protocol_*() is a better name than rdma_tech_*(), right?
> >>> So, let's change this.
> >>
> >> Sure, sounds reasonable now, about the IBOE, we still need it to
> >> separate the port support IB/ETH without the check on link-layer,
> >> So what about a new enum on protocol type?
> >>
> >> Like:
> >>
> >> enum rdma_protocol {
> >> 	RDMA_PROTOCOL_IB,
> >> 	RDMA_PROTOCOL_IBOE,
> >> 	RDMA_PROTOCOL_IWARP,
> >> 	RDMA_PROTOCOL_USNIC_UDP
> >> };
> >>
> >> So we could use query_protocol() to ask device provide the protocol
> >> type, and there will be no mixing with the legacy transport type
> >> anymore :-)
> >
> > I'm ok with that.  I like introducing a unique namespace which is clearly
> > different from the previous "transport" one.
> 
> I agree the word "transport" takes things into the weeds.
> 
> But on the topic of naming protocols, I've been wondering, is there
> some reason that "IBOE" is being used instead of "RoCE"?

Because back in the day, when RoCE was accepted into the kernel, I'm
pretty sure it was prior to the IBTA's final stamp of approval and
before the name was set on RoCE, so IBoE was chosen upstream as the more
"correct" name because it properly denoted what it was deemed to truly
be: IB Verbs over Ethernet.

>  The IBOE
> protocol used to exist and is not the same as the currently
> standardized RoCE, right?

I don't believe so.  To my knowledge, there was never an IBoE except in
linux upstream parlance.

> Also wondering, why add "UDP" to USNIC, is there a different USNIC?

Yes, there are two transports, one a distinct ethertype and one that
encapsulates USNIC in UDP.

> Naming multiple layers together seems confusing and maybe in the end
> will create more code to deal with the differences. For example, what
> token will RoCEv2 take? RoCE_UDP, RoCE_v2 or ... ?

Uncertain as of now.
Tom Talpey April 28, 2015, 12:53 a.m. UTC | #10
On 4/27/2015 5:36 PM, Doug Ledford wrote:
> On Mon, 2015-04-27 at 17:16 -0700, Tom Talpey wrote:
>> On 4/27/2015 2:52 PM, ira.weiny wrote:
>>> On Mon, Apr 27, 2015 at 09:39:05AM +0200, Michael Wang wrote:
>>>> On 04/24/2015 05:12 PM, Liran Liss wrote:
>>>>> [snip]
>>>>
>>>> Like:
>>>>
>>>> enum rdma_protocol {
>>>> 	RDMA_PROTOCOL_IB,
>>>> 	RDMA_PROTOCOL_IBOE,
>>>> 	RDMA_PROTOCOL_IWARP,
>>>> 	RDMA_PROTOCOL_USNIC_UDP
>>>> };
>>>>
>>>> So we could use query_protocol() to ask device provide the protocol
>>>> type, and there will be no mixing with the legacy transport type
>>>> anymore :-)
>>>
>>> I'm ok with that.  I like introducing a unique namespace which is clearly
>>> different from the previous "transport" one.
>>
>> I agree the word "transport" takes things into the weeds.
>>
>> But on the topic of naming protocols, I've been wondering, is there
>> some reason that "IBOE" is being used instead of "RoCE"?
>
> Because back in the day, when RoCE was accepted into the kernel, I'm
> pretty sure it was prior to the IBTA's final stamp of approval and
> before the name was set on RoCE, so IBoE was chosen upstream as the more
> "correct" name because it properly denoted what it was deemed to truly
> be: IB Verbs over Ethernet.

Well history is all well and good, but it seems weird to not use the
current, standard name in new code. It confuses me, anyway, because
it seems like IBOE could easily mean something else.

>> Also wondering, why add "UDP" to USNIC, is there a different USNIC?
>
> Yes, there are two transports, one a distinct ethertype and one that
> encapsulates USNIC in UDP.

But this new enum isn't about transport, it's about protocol. So is
there one USNIC protocol, with a raw layering and a separate one with
UDP? Or is it one USNIC protocol with two different framings? Seems
there should be at least the USNIC protocol, without the _UDP
decoration, and I don't see it in the enum.

>
>> Naming multiple layers together seems confusing and maybe in the end
>> will create more code to deal with the differences. For example, what
>> token will RoCEv2 take? RoCE_UDP, RoCE_v2 or ... ?
>
> Uncertain as of now.

Ok, but it's imminent, right? What's the preference/guidance?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford April 28, 2015, 1:24 a.m. UTC | #11
On Mon, 2015-04-27 at 17:53 -0700, Tom Talpey wrote:
> On 4/27/2015 5:36 PM, Doug Ledford wrote:
> > On Mon, 2015-04-27 at 17:16 -0700, Tom Talpey wrote:
> >> On 4/27/2015 2:52 PM, ira.weiny wrote:
> >>> On Mon, Apr 27, 2015 at 09:39:05AM +0200, Michael Wang wrote:
> >>>> On 04/24/2015 05:12 PM, Liran Liss wrote:
> >>>>> [snip]
> >>>>
> >>>> Like:
> >>>>
> >>>> enum rdma_protocol {
> >>>> 	RDMA_PROTOCOL_IB,
> >>>> 	RDMA_PROTOCOL_IBOE,
> >>>> 	RDMA_PROTOCOL_IWARP,
> >>>> 	RDMA_PROTOCOL_USNIC_UDP
> >>>> };
> >>>>
> >>>> So we could use query_protocol() to ask device provide the protocol
> >>>> type, and there will be no mixing with the legacy transport type
> >>>> anymore :-)
> >>>
> >>> I'm ok with that.  I like introducing a unique namespace which is clearly
> >>> different from the previous "transport" one.
> >>
> >> I agree the word "transport" takes things into the weeds.
> >>
> >> But on the topic of naming protocols, I've been wondering, is there
> >> some reason that "IBOE" is being used instead of "RoCE"?
> >
> > Because back in the day, when RoCE was accepted into the kernel, I'm
> > pretty sure it was prior to the IBTA's final stamp of approval and
> > before the name was set on RoCE, so IBoE was chosen upstream as the more
> > "correct" name because it properly denoted what it was deemed to truly
> > be: IB Verbs over Ethernet.
> 
> Well history is all well and good, but it seems weird to not use the
> current, standard name in new code. It confuses me, anyway, because
> it seems like IBOE could easily mean something else.

Having some of it refer to things as IBOE and some as ROCE would be
similarly confusing, and switching existing IBOE usage to ROCE would
cause pain to people with out of tree drivers (Lustre is the main one I
know of).  There's not a good answer here.  There's only less sucky
ones.

> >> Also wondering, why add "UDP" to USNIC, is there a different USNIC?
> >
> > Yes, there are two transports, one a distinct ethertype and one that
> > encapsulates USNIC in UDP.
> 
> But this new enum isn't about transport, it's about protocol. So is
> there one USNIC protocol, with a raw layering and a separate one with
> UDP? Or is it one USNIC protocol with two different framings? Seems
> there should be at least the USNIC protocol, without the _UDP
> decoration, and I don't see it in the enum.

Keep in mind that this enum was Liran's response to Michael's original
patch.  In the enum in Michael's patch, there was both USNIC and
USNIC_UDP.

> >
> >> Naming multiple layers together seems confusing and maybe in the end
> >> will create more code to deal with the differences. For example, what
> >> token will RoCEv2 take? RoCE_UDP, RoCE_v2 or ... ?
> >
> > Uncertain as of now.
> 
> Ok, but it's imminent, right? What's the preference/guidance?

There is a patchset from Devesh Sharma at Emulex.  It added the RoCEv2
capability.  As I recall, it used a new flag added to the existing port
capabilities bitmask and notably did not modify either the node type or
link layer that are currently used to differentiate between the
different protocols.  That's from memory though, so I could be mistaken.

But that patchset was not written with this patchset in mind, and
merging the two may well change that.  In any case, there is a proposed
spec to follow, so for now that's the preference/guidance (unless this
rework means that we need to depart from the spec on internals for
implementation reasons).
Tom Talpey April 28, 2015, 1:49 a.m. UTC | #12
On 4/27/2015 6:24 PM, Doug Ledford wrote:
> On Mon, 2015-04-27 at 17:53 -0700, Tom Talpey wrote:
>> On 4/27/2015 5:36 PM, Doug Ledford wrote:
>>> On Mon, 2015-04-27 at 17:16 -0700, Tom Talpey wrote:
>>>> On 4/27/2015 2:52 PM, ira.weiny wrote:
>>>>> On Mon, Apr 27, 2015 at 09:39:05AM +0200, Michael Wang wrote:
>>>>>> On 04/24/2015 05:12 PM, Liran Liss wrote:
>>>>>>> [snip]
>>>>>>
>>>>>> Like:
>>>>>>
>>>>>> enum rdma_protocol {
>>>>>> 	RDMA_PROTOCOL_IB,
>>>>>> 	RDMA_PROTOCOL_IBOE,
>>>>>> 	RDMA_PROTOCOL_IWARP,
>>>>>> 	RDMA_PROTOCOL_USNIC_UDP
>>>>>> };
>>>>>>
>>>>>> So we could use query_protocol() to ask device provide the protocol
>>>>>> type, and there will be no mixing with the legacy transport type
>>>>>> anymore :-)
>>>>>
>>>>> I'm ok with that.  I like introducing a unique namespace which is clearly
>>>>> different from the previous "transport" one.
>>>>
>>>> I agree the word "transport" takes things into the weeds.
>>>>
>>>> But on the topic of naming protocols, I've been wondering, is there
>>>> some reason that "IBOE" is being used instead of "RoCE"?
>>>
>>> Because back in the day, when RoCE was accepted into the kernel, I'm
>>> pretty sure it was prior to the IBTA's final stamp of approval and
>>> before the name was set on RoCE, so IBoE was chosen upstream as the more
>>> "correct" name because it properly denoted what it was deemed to truly
>>> be: IB Verbs over Ethernet.
>>
>> Well history is all well and good, but it seems weird to not use the
>> current, standard name in new code. It confuses me, anyway, because
>> it seems like IBOE could easily mean something else.
>
> Having some of it refer to things as IBOE and some as ROCE would be
> similarly confusing, and switching existing IBOE usage to ROCE would
> cause pain to people with out of tree drivers (Lustre is the main one I
> know of).  There's not a good answer here.  There's only less sucky
> ones.

Hrm. Well, avoiding churn is good but legacies can wear ya down.
MHO it is worth doing since these are new enums/new patches.


>
>>>> Also wondering, why add "UDP" to USNIC, is there a different USNIC?
>>>
>>> Yes, there are two transports, one a distinct ethertype and one that
>>> encapsulates USNIC in UDP.
>>
>> But this new enum isn't about transport, it's about protocol. So is
>> there one USNIC protocol, with a raw layering and a separate one with
>> UDP? Or is it one USNIC protocol with two different framings? Seems
>> there should be at least the USNIC protocol, without the _UDP
>> decoration, and I don't see it in the enum.
>
> Keep in mind that this enum was Liran's response to Michael's original
> patch.  In the enum in Michael's patch, there was both USNIC and
> USNIC_UDP.

Right! That's why I'm confused. Seems wrong to drop it, right?

>
>>>
>>>> Naming multiple layers together seems confusing and maybe in the end
>>>> will create more code to deal with the differences. For example, what
>>>> token will RoCEv2 take? RoCE_UDP, RoCE_v2 or ... ?
>>>
>>> Uncertain as of now.
>>
>> Ok, but it's imminent, right? What's the preference/guidance?
>
> There is a patchset from Devesh Sharma at Emulex.  It added the RoCEv2
> capability.  As I recall, it used a new flag added to the existing port
> capabilities bitmask and notably did not modify either the node type or
> link layer that are currently used to differentiate between the
> different protocols.  That's from memory though, so I could be mistaken.
>
> But that patchset was not written with this patchset in mind, and
> merging the two may well change that.  In any case, there is a proposed
> spec to follow, so for now that's the preference/guidance (unless this
> rework means that we need to depart from the spec on internals for
> implementation reasons).

Well, if RoCEv2 uses the same protocol enum, that may introduce new
confusion, for example there will be some new CM handling for UDP encap,
source port selection, and of course vlan/tag assignment, etc. But if
there is support under way, and everyone is clear, then, ok.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hefty, Sean April 28, 2015, 6:14 a.m. UTC | #13
PiA+IEtlZXAgaW4gbWluZCB0aGF0IHRoaXMgZW51bSB3YXMgTGlyYW4ncyByZXNwb25zZSB0byBN
aWNoYWVsJ3Mgb3JpZ2luYWwNCj4gPiBwYXRjaC4gIEluIHRoZSBlbnVtIGluIE1pY2hhZWwncyBw
YXRjaCwgdGhlcmUgd2FzIGJvdGggVVNOSUMgYW5kDQo+ID4gVVNOSUNfVURQLg0KPiANCj4gUmln
aHQhIFRoYXQncyB3aHkgSSdtIGNvbmZ1c2VkLiBTZWVtcyB3cm9uZyB0byBkcm9wIGl0LCByaWdo
dD8NCg0KSSB0aGluayB0aGUgb3JpZ2luYWwgVVNOSUMgcHJvdG9jb2wgaXMgbGF5ZXJlZCBkaXJl
Y3RseSBvdmVyIEV0aGVybmV0LiAgVGhlIHByb3RvY29sIGJhc2ljYWxseSBzdG9sZSBhbiBFdGhl
cnR5cGUgKHRoZSBvbmUgdXNlZCBmb3IgSUJvRS9Sb0NFKSBhbmQgaW1wbGVtZW50ZWQgYSBwcm9w
cmlldGFyeSBwcm90b2NvbCBpbnN0ZWFkLiAgSSBoYXZlIG5vIGlkZWEgaG93IHlvdSByZXNvbHZl
IHRoYXQsIGJ1dCBJIGFsc28gZG9uJ3QgdGhpbmsgaXQncyB1c2VkIGFueW1vcmUuICBVU05JQ19V
RFAgaXMganVzdCBVRFAuDQoNCj4gV2VsbCwgaWYgUm9DRXYyIHVzZXMgdGhlIHNhbWUgcHJvdG9j
b2wgZW51bSwgdGhhdCBtYXkgaW50cm9kdWNlIG5ldw0KPiBjb25mdXNpb24sIGZvciBleGFtcGxl
IHRoZXJlIHdpbGwgYmUgc29tZSBuZXcgQ00gaGFuZGxpbmcgZm9yIFVEUCBlbmNhcCwNCj4gc291
cmNlIHBvcnQgc2VsZWN0aW9uLCBhbmQgb2YgY291cnNlIHZsYW4vdGFnIGFzc2lnbm1lbnQsIGV0
Yy4gQnV0IGlmDQo+IHRoZXJlIGlzIHN1cHBvcnQgdW5kZXIgd2F5LCBhbmQgZXZlcnlvbmUgaXMg
Y2xlYXIsIHRoZW4sIG9rLg0KDQpSb0NFdjIvSUJvVURQIHNoYXJlcyB0aGUgc2FtZSBwb3J0IHNw
YWNlIGFzIFVEUC4gIEl0IGhhcyBhIHNpbWlsYXIgaXNzdWVzIGFzIGlXYXJwIGRvZXMgc2hhcmlu
ZyBzdGF0ZSB3aXRoIHRoZSBtYWluIG5ldHdvcmsgc3RhY2suICBJJ20gbm90IGF3YXJlIG9mIGFu
eSBwcm9wb3NhbCBmb3IgcmVzb2x2aW5nIHRoYXQuICBEb2VzIGl0IHJlcXVpcmUgdXNpbmcgYSBz
ZXBhcmF0ZSBJUCBhZGRyZXNzPyAgRG9lcyBpdCB1c2UgYSBwb3J0IG1hcHBlciBmdW5jdGlvbj8g
IERvZXMgbmV0ZGV2IGNhcmUgZm9yIFVEUD8gIEknbSBub3Qgc3VyZSB3aGF0IFVTTklDIGRvZXMg
Zm9yIHRoaXMgZWl0aGVyLCBidXQgYSBjb21tb24gc29sdXRpb24gYmV0d2VlbiBVU05JQyBhbmQg
SUJvVURQIHNlZW1zIHJlYXNvbmFibGUuDQoNCg0K
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Wang April 28, 2015, 2:29 p.m. UTC | #14
On 04/28/2015 03:24 AM, Doug Ledford wrote:
[snip]
>>>> Also wondering, why add "UDP" to USNIC, is there a different USNIC?
>>>
>>> Yes, there are two transports, one a distinct ethertype and one that
>>> encapsulates USNIC in UDP.
>>
>> But this new enum isn't about transport, it's about protocol. So is
>> there one USNIC protocol, with a raw layering and a separate one with
>> UDP? Or is it one USNIC protocol with two different framings? Seems
>> there should be at least the USNIC protocol, without the _UDP
>> decoration, and I don't see it in the enum.
> 
> Keep in mind that this enum was Liran's response to Michael's original
> patch.  In the enum in Michael's patch, there was both USNIC and
> USNIC_UDP.

Yeah, I've not enum PROTOCOL_USNIC since currently there is no place
need it...

The only three cases currently are:
1. trasnport IB, link layer IB		//PROTOCOL_IB
2. transport IB, link layer ETH		//PROTOCOL_IBOE
3. transport IWARP			//PROTOCOL_IWARP

Regards,
Michael Wang

> 
>>>
>>>> Naming multiple layers together seems confusing and maybe in the end
>>>> will create more code to deal with the differences. For example, what
>>>> token will RoCEv2 take? RoCE_UDP, RoCE_v2 or ... ?
>>>
>>> Uncertain as of now.
>>
>> Ok, but it's imminent, right? What's the preference/guidance?
> 
> There is a patchset from Devesh Sharma at Emulex.  It added the RoCEv2
> capability.  As I recall, it used a new flag added to the existing port
> capabilities bitmask and notably did not modify either the node type or
> link layer that are currently used to differentiate between the
> different protocols.  That's from memory though, so I could be mistaken.
> 
> But that patchset was not written with this patchset in mind, and
> merging the two may well change that.  In any case, there is a proposed
> spec to follow, so for now that's the preference/guidance (unless this
> rework means that we need to depart from the spec on internals for
> implementation reasons).
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe April 28, 2015, 6:56 p.m. UTC | #15
On Mon, Apr 27, 2015 at 09:24:35PM -0400, Doug Ledford wrote:
> On Mon, 2015-04-27 at 17:53 -0700, Tom Talpey wrote:

> Having some of it refer to things as IBOE and some as ROCE would be
> similarly confusing, and switching existing IBOE usage to ROCE would
> cause pain to people with out of tree drivers (Lustre is the main one I
> know of).  There's not a good answer here.  There's only less sucky
> ones.

The tide has already turned, we should ditch iboe:

$git grep -i roce_ drivers/infiniband/ | wc -l
91
$git grep -i iboe_ drivers/infiniband/ | wc -l
37

It isn't really mainline's role to be too concerned about out of tree
things like Lustre.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz April 28, 2015, 7:11 p.m. UTC | #16
On Tue, Apr 28, 2015 at 9:56 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Mon, Apr 27, 2015 at 09:24:35PM -0400, Doug Ledford wrote:
>> On Mon, 2015-04-27 at 17:53 -0700, Tom Talpey wrote:
>
>> Having some of it refer to things as IBOE and some as ROCE would be
>> similarly confusing, and switching existing IBOE usage to ROCE would
>> cause pain to people with out of tree drivers (Lustre is the main one I
>> know of).  There's not a good answer here.  There's only less sucky
>> ones.
>
> The tide has already turned, we should ditch iboe:
>
> $git grep -i roce_ drivers/infiniband/ | wc -l
> 91
> $git grep -i iboe_ drivers/infiniband/ | wc -l
> 37
>
> It isn't really mainline's role to be too concerned about out of tree
> things like Lustre.

FWIW, note that Lustre is under staging for a while, not sure how
close they are for actual acceptance.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Goodell (dgoodell) April 28, 2015, 7:50 p.m. UTC | #17
On Apr 28, 2015, at 1:14 AM, Hefty, Sean <sean.hefty@intel.com> wrote:

>>> Keep in mind that this enum was Liran's response to Michael's original
>>> patch.  In the enum in Michael's patch, there was both USNIC and
>>> USNIC_UDP.
>> 
>> Right! That's why I'm confused. Seems wrong to drop it, right?
> 
> I think the original USNIC protocol is layered directly over Ethernet.  The protocol basically stole an Ethertype (the one used for IBoE/RoCE) and implemented a proprietary protocol instead.  I have no idea how you resolve that, but I also don't think it's used anymore.  USNIC_UDP is just UDP.

Sean is correct. The legacy RDMA_TRANSPORT_USNIC code used a proprietary protocol over plain Ethernet frames.  The newer RDMA_TRANSPORT_USNIC_UDP code is just standard UDP/IP/Ethernet packets exposed to user space via the uverbs stack.  The current kernel module will support both formats, it just depends on which user space requests at create_qp time.  From the kernel point of view there is no common protocol between the two TRANSPORTs (other than sharing partially similar Ethernet frames at L2).

I posted last week to clarify some of this: http://marc.info/?l=linux-rdma&m=142972177830718&w=2

>> Well, if RoCEv2 uses the same protocol enum, that may introduce new
>> confusion, for example there will be some new CM handling for UDP encap,
>> source port selection, and of course vlan/tag assignment, etc. But if
>> there is support under way, and everyone is clear, then, ok.
> 
> RoCEv2/IBoUDP shares the same port space as UDP.  It has a similar issues as iWarp does sharing state with the main network stack.  I'm not aware of any proposal for resolving that.  Does it require using a separate IP address?  Does it use a port mapper function?  Does netdev care for UDP?  I'm not sure what USNIC does for this either, but a common solution between USNIC and IBoUDP seems reasonable.

Is the concern here about CM issues or the UDP ports used by the actual usNIC RQs?  CM is not used/supported for usNIC at this time.

-Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hefty, Sean April 28, 2015, 7:53 p.m. UTC | #18
> Is the concern here about CM issues or the UDP ports used by the actual
> usNIC RQs? 

UDP port space sharing
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford April 28, 2015, 8:02 p.m. UTC | #19
On Tue, 2015-04-28 at 22:11 +0300, Or Gerlitz wrote:
> On Tue, Apr 28, 2015 at 9:56 PM, Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com> wrote:
> > On Mon, Apr 27, 2015 at 09:24:35PM -0400, Doug Ledford wrote:
> >> On Mon, 2015-04-27 at 17:53 -0700, Tom Talpey wrote:
> >
> >> Having some of it refer to things as IBOE and some as ROCE would be
> >> similarly confusing, and switching existing IBOE usage to ROCE would
> >> cause pain to people with out of tree drivers (Lustre is the main one I
> >> know of).  There's not a good answer here.  There's only less sucky
> >> ones.
> >
> > The tide has already turned, we should ditch iboe:
> >
> > $git grep -i roce_ drivers/infiniband/ | wc -l
> > 91
> > $git grep -i iboe_ drivers/infiniband/ | wc -l
> > 37
> >
> > It isn't really mainline's role to be too concerned about out of tree
> > things like Lustre.
> 
> FWIW, note that Lustre is under staging for a while, not sure how
> close they are for actual acceptance.

I thought that was just the client and didn't include the server...
Doug Ledford April 28, 2015, 8:03 p.m. UTC | #20
On Tue, 2015-04-28 at 12:56 -0600, Jason Gunthorpe wrote:
> On Mon, Apr 27, 2015 at 09:24:35PM -0400, Doug Ledford wrote:
> > On Mon, 2015-04-27 at 17:53 -0700, Tom Talpey wrote:
> 
> > Having some of it refer to things as IBOE and some as ROCE would be
> > similarly confusing, and switching existing IBOE usage to ROCE would
> > cause pain to people with out of tree drivers (Lustre is the main one I
> > know of).  There's not a good answer here.  There's only less sucky
> > ones.
> 
> The tide has already turned, we should ditch iboe:
> 
> $git grep -i roce_ drivers/infiniband/ | wc -l
> 91
> $git grep -i iboe_ drivers/infiniband/ | wc -l
> 37
> 
> It isn't really mainline's role to be too concerned about out of tree
> things like Lustre.

While I generally agree, one need not be totally callous about out of
tree things either.
Dave Goodell (dgoodell) April 28, 2015, 8:17 p.m. UTC | #21
On Apr 28, 2015, at 2:53 PM, Hefty, Sean <sean.hefty@intel.com> wrote:

>> Is the concern here about CM issues or the UDP ports used by the actual
>> usNIC RQs? 
> 
> UDP port space sharing

For the UDP port used by the usNIC QP, the usnic_verbs kernel driver requires user space to pass a file descriptor of a regular UDP socket down at create_qp time.  The reference count on this socket is incremented to make sure that the socket can't disappear out from under us.  Then an RX filter is installed in the NIC which matches UDP/IP/Ethernet packets that are destined for the UDP port to which the given socket is already bound.  So there is a real UDP socket to make most of the usual things happen in the net stack, but the raw UDP/IP/Ethernet packets get delivered directly to the user space queues by the NIC.  E.g., "netstat" and "lsof" show you proper addressing information, though obviously any information related to data-path statistics will not be accurate.  At teardown we just reverse the steps.

However, I'm not sure if that's the sort of information you were looking for.

-Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hefty, Sean April 28, 2015, 8:25 p.m. UTC | #22
> For the UDP port used by the usNIC QP, the usnic_verbs kernel driver
> requires user space to pass a file descriptor of a regular UDP socket down
> at create_qp time.  The reference count on this socket is incremented to
> make sure that the socket can't disappear out from under us.  Then an RX
> filter is installed in the NIC which matches UDP/IP/Ethernet packets that
> are destined for the UDP port to which the given socket is already bound.
> So there is a real UDP socket to make most of the usual things happen in
> the net stack, but the raw UDP/IP/Ethernet packets get delivered directly
> to the user space queues by the NIC.  E.g., "netstat" and "lsof" show you
> proper addressing information, though obviously any information related to
> data-path statistics will not be accurate.  At teardown we just reverse
> the steps.
> 
> However, I'm not sure if that's the sort of information you were looking
> for.

This is more part of the RoCEv2 discussion than this thread.  But, yes, this is what I was looking for.  Conceptually, this is loosely similar to the port mapper functionality in iWarp, with a direct port mapping. Thanks.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liran Liss April 29, 2015, 8:49 a.m. UTC | #23
> From: Hefty, Sean [mailto:sean.hefty@intel.com]
> 
> > For the UDP port used by the usNIC QP, the usnic_verbs kernel driver
> > requires user space to pass a file descriptor of a regular UDP socket
> > down at create_qp time.  The reference count on this socket is
> > incremented to make sure that the socket can't disappear out from
> > under us.  Then an RX filter is installed in the NIC which matches
> > UDP/IP/Ethernet packets that are destined for the UDP port to which the
> given socket is already bound.
> > So there is a real UDP socket to make most of the usual things happen
> > in the net stack, but the raw UDP/IP/Ethernet packets get delivered
> > directly to the user space queues by the NIC.  E.g., "netstat" and
> > "lsof" show you proper addressing information, though obviously any
> > information related to data-path statistics will not be accurate.  At
> > teardown we just reverse the steps.
> >
> > However, I'm not sure if that's the sort of information you were
> > looking for.
> 
> This is more part of the RoCEv2 discussion than this thread.  But, yes, this is
> what I was looking for.  Conceptually, this is loosely similar to the port
> mapper functionality in iWarp, with a direct port mapping. Thanks.
> 

I would like to point out what seems to be a misconception regarding RoCEv2.
RoCEv2 is *not* like iWARP and USNIC in the sense of using the TCP/UDP transport, and does *not* use any kind of port-mapping.

RoCEv2 endpoints are IBTA QP numbers; the protocol uses UDP only for tunneling and *not* as a transport --  think VXLAN...
All RoCEv2 packets have the same dport (but possibly different dqpns).
It uses a well-known UDP port, assigned by IANA (port 4791).

So, in summary, I think that:
- The transport should refer to the transport semantics of the technology.
-- This could be determined by an organization or a vendor, for example:
--- IB, RoCEv1, and RoCEv2 use IBTA transport
--- iWARP would use the IETF iWARP transport
--- USNIC would use the USENIC transport (assuming that the semantics don't change for different wire protocols)

- The protocol should describe a unique, interoperable, wire-protocol stack, optionally crossing different layers.
-- For example:
--- RoCE ports would advertise 2 protocols:
---- RoCEv1
---- RoCEv2
--- IB ports would advertise the Infiniband protocol
--- iWARP RNIC ports would advertise the iWARP protocol stack
--- USNIC ports would advertise 2 protocols:
---- USENICv1 (Ethertype)
---- USENICv2 (UDP) protocols

- Choosing the protocol at the API level is another matter, and depends on the technology, e.g.,
-- RoCE selects the protocol (V1 or V2) according to the SGID index (according to the specification, each GID entry is associated with a "GID type", that determines the protocol)
-- USENIC does it using a modify QP attribute (if I understood correctly)

- As for IBOE-->ROCE name changes:
-- We should do it in a single patch that modifies all the current kernel symbols
-- I wouldn't worry about out-of-tree drivers, they will adjust as they always do:
--- From my experience, you usually have to do something in your backports when a new kernel is released anyway
--- Substituting symbol names is not a complicated task...

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 18c1ece..a9587c4 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -76,6 +76,7 @@  static int ib_device_check_mandatory(struct ib_device *device)
 	} mandatory_table[] = {
 		IB_MANDATORY_FUNC(query_device),
 		IB_MANDATORY_FUNC(query_port),
+		IB_MANDATORY_FUNC(query_transport),
 		IB_MANDATORY_FUNC(query_pkey),
 		IB_MANDATORY_FUNC(query_gid),
 		IB_MANDATORY_FUNC(alloc_pd),
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index f93eb8d..626c9cf 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -133,14 +133,16 @@  enum rdma_link_layer rdma_port_get_link_layer(struct ib_device *device, u8 port_
 	if (device->get_link_layer)
 		return device->get_link_layer(device, port_num);
 
-	switch (rdma_node_get_transport(device->node_type)) {
+	switch (device->query_transport(device, port_num)) {
 	case RDMA_TRANSPORT_IB:
 		return IB_LINK_LAYER_INFINIBAND;
+	case RDMA_TRANSPORT_IBOE:
 	case RDMA_TRANSPORT_IWARP:
 	case RDMA_TRANSPORT_USNIC:
 	case RDMA_TRANSPORT_USNIC_UDP:
 		return IB_LINK_LAYER_ETHERNET;
 	default:
+		BUG();
 		return IB_LINK_LAYER_UNSPECIFIED;
 	}
 }
diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c
index bdf3507..d46bbb0 100644
--- a/drivers/infiniband/hw/amso1100/c2_provider.c
+++ b/drivers/infiniband/hw/amso1100/c2_provider.c
@@ -99,6 +99,12 @@  static int c2_query_port(struct ib_device *ibdev,
 	return 0;
 }
 
+static enum rdma_transport_type
+c2_query_transport(struct ib_device *device, u8 port_num)
+{
+	return RDMA_TRANSPORT_IWARP;
+}
+
 static int c2_query_pkey(struct ib_device *ibdev,
 			 u8 port, u16 index, u16 * pkey)
 {
@@ -801,6 +807,7 @@  int c2_register_device(struct c2_dev *dev)
 	dev->ibdev.dma_device = &dev->pcidev->dev;
 	dev->ibdev.query_device = c2_query_device;
 	dev->ibdev.query_port = c2_query_port;
+	dev->ibdev.query_transport = c2_query_transport;
 	dev->ibdev.query_pkey = c2_query_pkey;
 	dev->ibdev.query_gid = c2_query_gid;
 	dev->ibdev.alloc_ucontext = c2_alloc_ucontext;
diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c
index 811b24a..09682e9e 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_provider.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -1232,6 +1232,12 @@  static int iwch_query_port(struct ib_device *ibdev,
 	return 0;
 }
 
+static enum rdma_transport_type
+iwch_query_transport(struct ib_device *device, u8 port_num)
+{
+	return RDMA_TRANSPORT_IWARP;
+}
+
 static ssize_t show_rev(struct device *dev, struct device_attribute *attr,
 			char *buf)
 {
@@ -1385,6 +1391,7 @@  int iwch_register_device(struct iwch_dev *dev)
 	dev->ibdev.dma_device = &(dev->rdev.rnic_info.pdev->dev);
 	dev->ibdev.query_device = iwch_query_device;
 	dev->ibdev.query_port = iwch_query_port;
+	dev->ibdev.query_transport = iwch_query_transport;
 	dev->ibdev.query_pkey = iwch_query_pkey;
 	dev->ibdev.query_gid = iwch_query_gid;
 	dev->ibdev.alloc_ucontext = iwch_alloc_ucontext;
diff --git a/drivers/infiniband/hw/cxgb4/provider.c b/drivers/infiniband/hw/cxgb4/provider.c
index 66bd6a2..a445e0d 100644
--- a/drivers/infiniband/hw/cxgb4/provider.c
+++ b/drivers/infiniband/hw/cxgb4/provider.c
@@ -390,6 +390,12 @@  static int c4iw_query_port(struct ib_device *ibdev, u8 port,
 	return 0;
 }
 
+static enum rdma_transport_type
+c4iw_query_transport(struct ib_device *device, u8 port_num)
+{
+	return RDMA_TRANSPORT_IWARP;
+}
+
 static ssize_t show_rev(struct device *dev, struct device_attribute *attr,
 			char *buf)
 {
@@ -506,6 +512,7 @@  int c4iw_register_device(struct c4iw_dev *dev)
 	dev->ibdev.dma_device = &(dev->rdev.lldi.pdev->dev);
 	dev->ibdev.query_device = c4iw_query_device;
 	dev->ibdev.query_port = c4iw_query_port;
+	dev->ibdev.query_transport = c4iw_query_transport;
 	dev->ibdev.query_pkey = c4iw_query_pkey;
 	dev->ibdev.query_gid = c4iw_query_gid;
 	dev->ibdev.alloc_ucontext = c4iw_alloc_ucontext;
diff --git a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c
index 9ed4d25..d5a34a6 100644
--- a/drivers/infiniband/hw/ehca/ehca_hca.c
+++ b/drivers/infiniband/hw/ehca/ehca_hca.c
@@ -242,6 +242,12 @@  query_port1:
 	return ret;
 }
 
+enum rdma_transport_type
+ehca_query_transport(struct ib_device *device, u8 port_num)
+{
+	return RDMA_TRANSPORT_IB;
+}
+
 int ehca_query_sma_attr(struct ehca_shca *shca,
 			u8 port, struct ehca_sma_attr *attr)
 {
diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h
index 22f79af..cec945f 100644
--- a/drivers/infiniband/hw/ehca/ehca_iverbs.h
+++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h
@@ -49,6 +49,9 @@  int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props);
 int ehca_query_port(struct ib_device *ibdev, u8 port,
 		    struct ib_port_attr *props);
 
+enum rdma_transport_type
+ehca_query_transport(struct ib_device *device, u8 port_num);
+
 int ehca_query_sma_attr(struct ehca_shca *shca, u8 port,
 			struct ehca_sma_attr *attr);
 
diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c
index cd8d290..60e0a09 100644
--- a/drivers/infiniband/hw/ehca/ehca_main.c
+++ b/drivers/infiniband/hw/ehca/ehca_main.c
@@ -467,6 +467,7 @@  static int ehca_init_device(struct ehca_shca *shca)
 	shca->ib_device.dma_device          = &shca->ofdev->dev;
 	shca->ib_device.query_device        = ehca_query_device;
 	shca->ib_device.query_port          = ehca_query_port;
+	shca->ib_device.query_transport     = ehca_query_transport;
 	shca->ib_device.query_gid           = ehca_query_gid;
 	shca->ib_device.query_pkey          = ehca_query_pkey;
 	/* shca->in_device.modify_device    = ehca_modify_device    */
diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c
index 44ea939..58d36e3 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c
@@ -1638,6 +1638,12 @@  static int ipath_query_port(struct ib_device *ibdev,
 	return 0;
 }
 
+static enum rdma_transport_type
+ipath_query_transport(struct ib_device *device, u8 port_num)
+{
+	return RDMA_TRANSPORT_IB;
+}
+
 static int ipath_modify_device(struct ib_device *device,
 			       int device_modify_mask,
 			       struct ib_device_modify *device_modify)
@@ -2140,6 +2146,7 @@  int ipath_register_ib_device(struct ipath_devdata *dd)
 	dev->query_device = ipath_query_device;
 	dev->modify_device = ipath_modify_device;
 	dev->query_port = ipath_query_port;
+	dev->query_transport = ipath_query_transport;
 	dev->modify_port = ipath_modify_port;
 	dev->query_pkey = ipath_query_pkey;
 	dev->query_gid = ipath_query_gid;
diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 57070c5..1e13cf9 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -420,6 +420,15 @@  static int mlx4_ib_query_port(struct ib_device *ibdev, u8 port,
 	return __mlx4_ib_query_port(ibdev, port, props, 0);
 }
 
+static enum rdma_transport_type
+mlx4_ib_query_transport(struct ib_device *device, u8 port_num)
+{
+	struct mlx4_dev *dev = to_mdev(device)->dev;
+
+	return dev->caps.port_mask[port_num] == MLX4_PORT_TYPE_IB ?
+		RDMA_TRANSPORT_IB : RDMA_TRANSPORT_IBOE;
+}
+
 int __mlx4_ib_query_gid(struct ib_device *ibdev, u8 port, int index,
 			union ib_gid *gid, int netw_view)
 {
@@ -2202,6 +2211,7 @@  static void *mlx4_ib_add(struct mlx4_dev *dev)
 
 	ibdev->ib_dev.query_device	= mlx4_ib_query_device;
 	ibdev->ib_dev.query_port	= mlx4_ib_query_port;
+	ibdev->ib_dev.query_transport	= mlx4_ib_query_transport;
 	ibdev->ib_dev.get_link_layer	= mlx4_ib_port_link_layer;
 	ibdev->ib_dev.query_gid		= mlx4_ib_query_gid;
 	ibdev->ib_dev.query_pkey	= mlx4_ib_query_pkey;
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 57c9809..b6f2f58 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -262,6 +262,12 @@  out:
 	return err;
 }
 
+static enum rdma_transport_type
+mlx5_ib_query_transport(struct ib_device *device, u8 port_num)
+{
+	return RDMA_TRANSPORT_IB;
+}
+
 static int mlx5_ib_query_gid(struct ib_device *ibdev, u8 port, int index,
 			     union ib_gid *gid)
 {
@@ -1244,6 +1250,7 @@  static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
 
 	dev->ib_dev.query_device	= mlx5_ib_query_device;
 	dev->ib_dev.query_port		= mlx5_ib_query_port;
+	dev->ib_dev.query_transport	= mlx5_ib_query_transport;
 	dev->ib_dev.query_gid		= mlx5_ib_query_gid;
 	dev->ib_dev.query_pkey		= mlx5_ib_query_pkey;
 	dev->ib_dev.modify_device	= mlx5_ib_modify_device;
diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c
index 415f8e1..67ac6a4 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.c
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -179,6 +179,12 @@  static int mthca_query_port(struct ib_device *ibdev,
 	return err;
 }
 
+static enum rdma_transport_type
+mthca_query_transport(struct ib_device *device, u8 port_num)
+{
+	return RDMA_TRANSPORT_IB;
+}
+
 static int mthca_modify_device(struct ib_device *ibdev,
 			       int mask,
 			       struct ib_device_modify *props)
@@ -1281,6 +1287,7 @@  int mthca_register_device(struct mthca_dev *dev)
 	dev->ib_dev.dma_device           = &dev->pdev->dev;
 	dev->ib_dev.query_device         = mthca_query_device;
 	dev->ib_dev.query_port           = mthca_query_port;
+	dev->ib_dev.query_transport      = mthca_query_transport;
 	dev->ib_dev.modify_device        = mthca_modify_device;
 	dev->ib_dev.modify_port          = mthca_modify_port;
 	dev->ib_dev.query_pkey           = mthca_query_pkey;
diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c
index c0d0296..8df5b61 100644
--- a/drivers/infiniband/hw/nes/nes_verbs.c
+++ b/drivers/infiniband/hw/nes/nes_verbs.c
@@ -606,6 +606,11 @@  static int nes_query_port(struct ib_device *ibdev, u8 port, struct ib_port_attr
 	return 0;
 }
 
+static enum rdma_transport_type
+nes_query_transport(struct ib_device *device, u8 port_num)
+{
+	return RDMA_TRANSPORT_IWARP;
+}
 
 /**
  * nes_query_pkey
@@ -3879,6 +3884,7 @@  struct nes_ib_device *nes_init_ofa_device(struct net_device *netdev)
 	nesibdev->ibdev.dev.parent = &nesdev->pcidev->dev;
 	nesibdev->ibdev.query_device = nes_query_device;
 	nesibdev->ibdev.query_port = nes_query_port;
+	nesibdev->ibdev.query_transport = nes_query_transport;
 	nesibdev->ibdev.query_pkey = nes_query_pkey;
 	nesibdev->ibdev.query_gid = nes_query_gid;
 	nesibdev->ibdev.alloc_ucontext = nes_alloc_ucontext;
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_main.c b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
index 7a2b59a..9f4d182 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_main.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
@@ -244,6 +244,7 @@  static int ocrdma_register_device(struct ocrdma_dev *dev)
 	/* mandatory verbs. */
 	dev->ibdev.query_device = ocrdma_query_device;
 	dev->ibdev.query_port = ocrdma_query_port;
+	dev->ibdev.query_transport = ocrdma_query_transport;
 	dev->ibdev.modify_port = ocrdma_modify_port;
 	dev->ibdev.query_gid = ocrdma_query_gid;
 	dev->ibdev.get_link_layer = ocrdma_link_layer;
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index 8771755..73bace4 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -187,6 +187,12 @@  int ocrdma_query_port(struct ib_device *ibdev,
 	return 0;
 }
 
+enum rdma_transport_type
+ocrdma_query_transport(struct ib_device *device, u8 port_num)
+{
+	return RDMA_TRANSPORT_IBOE;
+}
+
 int ocrdma_modify_port(struct ib_device *ibdev, u8 port, int mask,
 		       struct ib_port_modify *props)
 {
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h
index b8f7853..4a81b63 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h
@@ -41,6 +41,9 @@  int ocrdma_query_port(struct ib_device *, u8 port, struct ib_port_attr *props);
 int ocrdma_modify_port(struct ib_device *, u8 port, int mask,
 		       struct ib_port_modify *props);
 
+enum rdma_transport_type
+ocrdma_query_transport(struct ib_device *device, u8 port_num);
+
 void ocrdma_get_guid(struct ocrdma_dev *, u8 *guid);
 int ocrdma_query_gid(struct ib_device *, u8 port,
 		     int index, union ib_gid *gid);
diff --git a/drivers/infiniband/hw/qib/qib_verbs.c b/drivers/infiniband/hw/qib/qib_verbs.c
index 4a35998..caad665 100644
--- a/drivers/infiniband/hw/qib/qib_verbs.c
+++ b/drivers/infiniband/hw/qib/qib_verbs.c
@@ -1650,6 +1650,12 @@  static int qib_query_port(struct ib_device *ibdev, u8 port,
 	return 0;
 }
 
+static enum rdma_transport_type
+qib_query_transport(struct ib_device *device, u8 port_num)
+{
+	return RDMA_TRANSPORT_IB;
+}
+
 static int qib_modify_device(struct ib_device *device,
 			     int device_modify_mask,
 			     struct ib_device_modify *device_modify)
@@ -2184,6 +2190,7 @@  int qib_register_ib_device(struct qib_devdata *dd)
 	ibdev->query_device = qib_query_device;
 	ibdev->modify_device = qib_modify_device;
 	ibdev->query_port = qib_query_port;
+	ibdev->query_transport = qib_query_transport;
 	ibdev->modify_port = qib_modify_port;
 	ibdev->query_pkey = qib_query_pkey;
 	ibdev->query_gid = qib_query_gid;
diff --git a/drivers/infiniband/hw/usnic/usnic_ib_main.c b/drivers/infiniband/hw/usnic/usnic_ib_main.c
index 0d0f986..03ea9f3 100644
--- a/drivers/infiniband/hw/usnic/usnic_ib_main.c
+++ b/drivers/infiniband/hw/usnic/usnic_ib_main.c
@@ -360,6 +360,7 @@  static void *usnic_ib_device_add(struct pci_dev *dev)
 
 	us_ibdev->ib_dev.query_device = usnic_ib_query_device;
 	us_ibdev->ib_dev.query_port = usnic_ib_query_port;
+	us_ibdev->ib_dev.query_transport = usnic_ib_query_transport;
 	us_ibdev->ib_dev.query_pkey = usnic_ib_query_pkey;
 	us_ibdev->ib_dev.query_gid = usnic_ib_query_gid;
 	us_ibdev->ib_dev.get_link_layer = usnic_ib_port_link_layer;
diff --git a/drivers/infiniband/hw/usnic/usnic_ib_verbs.c b/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
index 53bd6a2..ff9a5f7 100644
--- a/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
+++ b/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
@@ -348,6 +348,12 @@  int usnic_ib_query_port(struct ib_device *ibdev, u8 port,
 	return 0;
 }
 
+enum rdma_transport_type
+usnic_ib_query_transport(struct ib_device *device, u8 port_num)
+{
+	return RDMA_TRANSPORT_USNIC_UDP;
+}
+
 int usnic_ib_query_qp(struct ib_qp *qp, struct ib_qp_attr *qp_attr,
 				int qp_attr_mask,
 				struct ib_qp_init_attr *qp_init_attr)
diff --git a/drivers/infiniband/hw/usnic/usnic_ib_verbs.h b/drivers/infiniband/hw/usnic/usnic_ib_verbs.h
index bb864f5..0b1633b 100644
--- a/drivers/infiniband/hw/usnic/usnic_ib_verbs.h
+++ b/drivers/infiniband/hw/usnic/usnic_ib_verbs.h
@@ -27,6 +27,8 @@  int usnic_ib_query_device(struct ib_device *ibdev,
 				struct ib_device_attr *props);
 int usnic_ib_query_port(struct ib_device *ibdev, u8 port,
 				struct ib_port_attr *props);
+enum rdma_transport_type
+usnic_ib_query_transport(struct ib_device *device, u8 port_num);
 int usnic_ib_query_qp(struct ib_qp *qp, struct ib_qp_attr *qp_attr,
 				int qp_attr_mask,
 				struct ib_qp_init_attr *qp_init_attr);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 65994a1..d54f91e 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -75,10 +75,13 @@  enum rdma_node_type {
 };
 
 enum rdma_transport_type {
+	/* legacy for users */
 	RDMA_TRANSPORT_IB,
 	RDMA_TRANSPORT_IWARP,
 	RDMA_TRANSPORT_USNIC,
-	RDMA_TRANSPORT_USNIC_UDP
+	RDMA_TRANSPORT_USNIC_UDP,
+	/* new transport */
+	RDMA_TRANSPORT_IBOE,
 };
 
 __attribute_const__ enum rdma_transport_type
@@ -1501,6 +1504,8 @@  struct ib_device {
 	int		           (*query_port)(struct ib_device *device,
 						 u8 port_num,
 						 struct ib_port_attr *port_attr);
+	enum rdma_transport_type   (*query_transport)(struct ib_device *device,
+						      u8 port_num);
 	enum rdma_link_layer	   (*get_link_layer)(struct ib_device *device,
 						     u8 port_num);
 	int		           (*query_gid)(struct ib_device *device,