diff mbox

[for-next,V2,05/11] IB/core: Add rdma_network_type to wc

Message ID 1449150450-13679-6-git-send-email-matanb@mellanox.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Matan Barak Dec. 3, 2015, 1:47 p.m. UTC
From: Somnath Kotur <Somnath.Kotur@Avagotech.Com>

Providers should tell IB core the wc's network type.
This is used in order to search for the proper GID in the
GID table. When using HCAs that can't provide this info,
IB core tries to deep examine the packet and extract
the GID type by itself.

We choose sgid_index and type from all the matching entries in
RDMA-CM based on hint from the IP stack and we set hop_limit for
the IP packet based on above hint from IP stack.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Somnath Kotur <Somnath.Kotur@Avagotech.Com>
---
 drivers/infiniband/core/addr.c  |  14 +++++
 drivers/infiniband/core/cma.c   |  11 +++-
 drivers/infiniband/core/verbs.c | 123 ++++++++++++++++++++++++++++++++++++++--
 include/rdma/ib_addr.h          |   1 +
 include/rdma/ib_verbs.h         |  44 ++++++++++++++
 5 files changed, 187 insertions(+), 6 deletions(-)

Comments

Christoph Hellwig Dec. 3, 2015, 2:05 p.m. UTC | #1
Bloating the WC with a field that's not really useful for the ULPs
seems pretty sad..
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matan Barak Dec. 3, 2015, 4:19 p.m. UTC | #2
On Thu, Dec 3, 2015 at 4:05 PM, Christoph Hellwig <hch@infradead.org> wrote:
> Bloating the WC with a field that's not really useful for the ULPs
> seems pretty sad..

Network header type is mandatory in order to find the GID type and get
the GIDs correctly from the header.
I realize ULPs might have preferred to get the GID itself, but
resolving the GID costs time and most of the time you don't really
need that when you poll a CQ.
This could be refactored later to use wc_flags instead of a new field
when we approach the cache line limit.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liran Liss Dec. 3, 2015, 4:20 p.m. UTC | #3
> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-

> Subject: Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to
> wc
> 
> Bloating the WC with a field that's not really useful for the ULPs seems pretty
> sad..

You need per packet (read per-WC) network type both for handling incoming connections over QP1 and in UD QPs.
It looks like this patch doesn't extend the structure size due to alignment, so no real harm in any case...

Alternatively, we can consider specifying the network type as a completion flag instead.
--Liran

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Dec. 3, 2015, 4:36 p.m. UTC | #4
On Thu, Dec 03, 2015 at 03:47:12PM +0200, Matan Barak wrote:
> From: Somnath Kotur <Somnath.Kotur@Avagotech.Com>
> 
> Providers should tell IB core the wc's network type.
> This is used in order to search for the proper GID in the
> GID table. When using HCAs that can't provide this info,
> IB core tries to deep examine the packet and extract
> the GID type by itself.

Eh? A wc has a sgid_index, and in this brave new world a gid has the
network type. Why do we need to specify it again?

>  	memset(ah_attr, 0, sizeof *ah_attr);
>  	if (rdma_cap_eth_ah(device, port_num)) {
> +		if (wc->wc_flags & IB_WC_WITH_NETWORK_HDR_TYPE)
> +			net_type = wc->network_hdr_type;
> +		else
> +			net_type = ib_get_net_type_by_grh(device, port_num, grh);
> +		gid_type = ib_network_to_gid_type(net_type);

Like here for instance.

... and I keep saying this is all wrong, once you get into IP land
this entire process needs a route/neighbour lookup.


> -			ret = rdma_addr_find_dmac_by_grh(&grh->dgid, &grh->sgid,
> +			ret = rdma_addr_find_dmac_by_grh(&dgid, &sgid,
>  							 ah_attr->dmac,
>  							 wc->wc_flags & IB_WC_WITH_VLAN ?
>  							 NULL : &vlan_id,

ie no to this.

> +			if (sgid_attr.gid_type == IB_GID_TYPE_ROCE_UDP_ENCAP)
> +				/* TODO: get the hoplimit from the inet/inet6
> +				 * device
> +				 */

And no again, please fix this and all other missing route lookups
before sending another version.

> +	struct {
> +		/* The IB spec states that if it's IPv4, the header

roceev2 spec, surely

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Dec. 6, 2015, 2:03 p.m. UTC | #5
On Thu, Dec 03, 2015 at 04:20:50PM +0000, Liran Liss wrote:
> > From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-
> 
> > Subject: Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to
> > wc
> > 
> > Bloating the WC with a field that's not really useful for the ULPs seems pretty
> > sad..
> 
> You need per packet (read per-WC) network type both for handling incoming connections over QP1 and in UD QPs.
> It looks like this patch doesn't extend the structure size due to alignment, so no real harm in any case...

The ULPs couldn't really care less about the network type.  The problem
is that one monolithic structure is used both for communicatation with
the protocol drivers and the RoCE implementation.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Moni Shoua Dec. 6, 2015, 2:20 p.m. UTC | #6
On Sun, Dec 6, 2015 at 4:03 PM, Christoph Hellwig <hch@infradead.org> wrote:
> The ULPs couldn't really care less about the network type.  The problem

In WC, ULPs care about network type as much as they care for dgid or port_num
So, network_type is indeed an information that is relevant only to
RoCE but it only follows the tradition
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Dec. 7, 2015, 6:02 a.m. UTC | #7
On Thu, Dec 03, 2015 at 04:20:50PM +0000, Liran Liss wrote:
> > From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-
> 
> > Subject: Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to
> > wc
> > 
> > Bloating the WC with a field that's not really useful for the ULPs seems pretty
> > sad..
> 
> You need per packet (read per-WC) network type both for handling incoming connections over QP1 and in UD QPs.
> It looks like this patch doesn't extend the structure size due to alignment, so no real harm in any case...

Why? The sgid index must tell you the network type.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Moni Shoua Dec. 7, 2015, 6:15 a.m. UTC | #8
On Mon, Dec 7, 2015 at 8:02 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> Why? The sgid index must tell you the network type.

struct ib_wc doesn't have sgid or sgid_index field.
What you have though is the sgid taken from the GRH that is scattered
to the first 40/20 bytes of the receive WQE.  This is not enough to
determine the network type.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Parav Pandit Dec. 7, 2015, 6:21 a.m. UTC | #9
On Mon, Dec 7, 2015 at 11:32 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Thu, Dec 03, 2015 at 04:20:50PM +0000, Liran Liss wrote:
>> > From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-
>>
>> > Subject: Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to
>> > wc
>> >
>> > Bloating the WC with a field that's not really useful for the ULPs seems pretty
>> > sad..
>>
>> You need per packet (read per-WC) network type both for handling incoming connections over QP1 and in UD QPs.
>> It looks like this patch doesn't extend the structure size due to alignment, so no real harm in any case...
>
> Why? The sgid index must tell you the network type.
>

User space might not have access to internal properties of gid entry
like kernel. Not sure if more plumbing needs to be added in user space
to get such property and async updates of it.
So Liran might be thinking to have unified way to report required fields via wc?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Dec. 7, 2015, 6:34 a.m. UTC | #10
On Mon, Dec 07, 2015 at 08:15:40AM +0200, Moni Shoua wrote:

> What you have though is the sgid taken from the GRH that is scattered
> to the first 40/20 bytes of the receive WQE.  This is not enough to
> determine the network type.

It is enough to discover the sgid index which will tell you the type.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Moni Shoua Dec. 7, 2015, 6:37 a.m. UTC | #11
On Mon, Dec 7, 2015 at 8:34 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Mon, Dec 07, 2015 at 08:15:40AM +0200, Moni Shoua wrote:
>
>> What you have though is the sgid taken from the GRH that is scattered
>> to the first 40/20 bytes of the receive WQE.  This is not enough to
>> determine the network type.
>
> It is enough to discover the sgid index which will tell you the type.
but how? all you have in hand is the sgid which can appear several
times in the GID table in different indices.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Dec. 7, 2015, 5:12 p.m. UTC | #12
On Mon, Dec 07, 2015 at 08:37:41AM +0200, Moni Shoua wrote:
> On Mon, Dec 7, 2015 at 8:34 AM, Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com> wrote:
> > On Mon, Dec 07, 2015 at 08:15:40AM +0200, Moni Shoua wrote:
> >
> >> What you have though is the sgid taken from the GRH that is scattered
> >> to the first 40/20 bytes of the receive WQE.  This is not enough to
> >> determine the network type.
> >
> > It is enough to discover the sgid index which will tell you the type.
> but how? all you have in hand is the sgid which can appear several
> times in the GID table in different indices.

Eh? Reliably recovering the gid index is not optional. The network
namespace stuff that is already in the kernel hard requires that
ability. (This is the same argument we went around on already why pkey
had to come from the wc, not from the payload)

If your position is it cannot be done from a WC,QP as is, then
gid_index needs to be added to the wc, or something else to remove the
ambiguity.

In either case, network_type is absolutely the wrong thing to have in
the wc.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Moni Shoua Dec. 7, 2015, 6:34 p.m. UTC | #13
Well, just tell me how you want to discover gid_index when you poll
the WC out of the CQ.
Like I said, the sgid itself is in the GRH that is scattered to the
buffers in the receive queue. When ib_poll_cq() is called the pointer
to GRH is not passed so there is no way to determine the gid_index
inside the polling context.
BTW, why do you argue about this only now? This is not RoCE specific
issue. sgid_index was always resolved outside the polling context. The
only difference between then and now is that with InfiniBand there was
no conflict about sgid_index once you had the sgid. Now, when same gid
value can be present in multiple entries you need an extra parameter
to get distinguish between them - that would be the netwrok_type


On Mon, Dec 7, 2015 at 7:12 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Mon, Dec 07, 2015 at 08:37:41AM +0200, Moni Shoua wrote:
>> On Mon, Dec 7, 2015 at 8:34 AM, Jason Gunthorpe
>> <jgunthorpe@obsidianresearch.com> wrote:
>> > On Mon, Dec 07, 2015 at 08:15:40AM +0200, Moni Shoua wrote:
>> >
>> >> What you have though is the sgid taken from the GRH that is scattered
>> >> to the first 40/20 bytes of the receive WQE.  This is not enough to
>> >> determine the network type.
>> >
>> > It is enough to discover the sgid index which will tell you the type.
>> but how? all you have in hand is the sgid which can appear several
>> times in the GID table in different indices.
>
> Eh? Reliably recovering the gid index is not optional. The network
> namespace stuff that is already in the kernel hard requires that
> ability. (This is the same argument we went around on already why pkey
> had to come from the wc, not from the payload)
>
> If your position is it cannot be done from a WC,QP as is, then
> gid_index needs to be added to the wc, or something else to remove the
> ambiguity.
>
> In either case, network_type is absolutely the wrong thing to have in
> the wc.
>
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Dec. 7, 2015, 6:48 p.m. UTC | #14
On Mon, Dec 07, 2015 at 08:34:43PM +0200, Moni Shoua wrote:
> Well, just tell me how you want to discover gid_index when you poll
> the WC out of the CQ.

Hey, I'm not desiging this rocev2 stuff, this is something the rocev2
community needs to sort out.

> Like I said, the sgid itself is in the GRH that is scattered to the
> buffers in the receive queue. When ib_poll_cq() is called the pointer
> to GRH is not passed so there is no way to determine the gid_index
> inside the polling context.

Then have an API to let the consumer recover it.

> BTW, why do you argue about this only now? This is not RoCE specific
> issue. sgid_index was always resolved outside the polling context. The
> only difference between then and now is that with InfiniBand there was
> no conflict about sgid_index once you had the sgid.

I think you answered your own question. In IB and rocev1 the sgid
index is not ambiguous, rocev2 breaks that requirement without
adaquately fixing it.

It is probably also the case that rocev1 has similar problems,
depending on how vland and macvlan ended up being implemented.

> Now, when same gid value can be present in multiple entries you need
> an extra parameter to get distinguish between them - that would be
> the netwrok_type

Eh?  You are only worried about duplicates between rocev1 mode which
uses the GUID and v2 mode which uses the host's IP address?

What about duplicates entirely within v2 mode where vlan and macvlan
can create duplicate host IP addresses.

You can also just not create duplicate entries in the first place. For
instance there probably isn't a really a good reason to use rocev2 for
IPv6 link local addreses, that trivially eliminates the v1/v2
collision.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Moni Shoua Dec. 8, 2015, 7:23 a.m. UTC | #15
On Mon, Dec 7, 2015 at 8:48 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Mon, Dec 07, 2015 at 08:34:43PM +0200, Moni Shoua wrote:
>> Well, just tell me how you want to discover gid_index when you poll
>> the WC out of the CQ.
>
> Hey, I'm not desiging this rocev2 stuff, this is something the rocev2
> community needs to sort out.
>
>> Like I said, the sgid itself is in the GRH that is scattered to the
>> buffers in the receive queue. When ib_poll_cq() is called the pointer
>> to GRH is not passed so there is no way to determine the gid_index
>> inside the polling context.
>
> Then have an API to let the consumer recover it.
>
>> BTW, why do you argue about this only now? This is not RoCE specific
>> issue. sgid_index was always resolved outside the polling context. The
>> only difference between then and now is that with InfiniBand there was
>> no conflict about sgid_index once you had the sgid.
>
> I think you answered your own question. In IB and rocev1 the sgid
> index is not ambiguous, rocev2 breaks that requirement without
> adaquately fixing it.
>
> It is probably also the case that rocev1 has similar problems,
> depending on how vland and macvlan ended up being implemented.
>
>> Now, when same gid value can be present in multiple entries you need
>> an extra parameter to get distinguish between them - that would be
>> the netwrok_type
>
> Eh?  You are only worried about duplicates between rocev1 mode which
> uses the GUID and v2 mode which uses the host's IP address?
>
> What about duplicates entirely within v2 mode where vlan and macvlan
> can create duplicate host IP addresses.
In that case these entries will be distinguished  byt its eth devices
(eth device is part of the gid attributes structure)

>
> You can also just not create duplicate entries in the first place. For
> instance there probably isn't a really a good reason to use rocev2 for
> IPv6 link local addreses, that trivially eliminates the v1/v2
> collision.
First, spec says that I must.
Second, even if you have an example that makes it unnecessary for some
it is still necessary for others and therefore the netwotk_type in WC
is still needed
Third, regarding the IPv6 link local example,  IPv6 link local packet
is almost identical to RoCEv1 but not entirely. The former is an IP
protocol packet while the later isn't. Switches and network admins may
care about this.

>
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Dec. 8, 2015, 10:52 p.m. UTC | #16
On Tue, Dec 08, 2015 at 09:23:18AM +0200, Moni Shoua wrote:
> >> Now, when same gid value can be present in multiple entries you need
> >> an extra parameter to get distinguish between them - that would be
> >> the netwrok_type
> >
> > Eh?  You are only worried about duplicates between rocev1 mode which
> > uses the GUID and v2 mode which uses the host's IP address?
> >
> > What about duplicates entirely within v2 mode where vlan and macvlan
> > can create duplicate host IP addresses.

> In that case these entries will be distinguished  byt its eth devices
> (eth device is part of the gid attributes structure)

Eh? I think you've missed the point, there is no net device when
looking at a wc.

Look, here is a concrete direction:

Replace all the crap in
ib_init_ah_from_wc/get_sgid_index_from_eth/rdma_addr_find_dmac_by_grh

with a straightforward

   rdma_dgid_index_from_wc(
                        const struct ib_qp *qp,
                        const struct ib_wc *wc,
			const struct ib_grh *grh,
			u16 *gid_index)

Sort of function that reads the GRH and wc and returns the unambiguous
gid index that was used to receive that packet on the UD QP.

The gid cache is not allowed to create an ambiguity the driver cannot
resolve.

That said, I wouldn't object to vendor-specific bits in the wc. Ie if
mlx hardware needs a network_type bit to implement
rdma_find_dgid_index_from_wc, then fine - define a vendor specific
place to put it. In this case rdma_find_dgid_index_from_wc would be a
driver call back, which is fine, and what Caitlin was talking about.

But, it is not part of our verbs API, and I'd *strongly* encourage
other vendors and future hardware to simply return the gid index that
the hardware matched instead of requiring the software to try and
guess after the fact.

This is the same issue/argument we went around and around on the
lladdr ipoib details when working on the namespace patches, about how
important it is to resolve the namespace from the hardware headers.

Of course once we have the gid index we now have the net device and
other information needed to make namespaces work.

.. and this is part of what I mean what I said the interface from the
gid cache code is not a sane API and needs to be changed.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Moni Shoua Dec. 9, 2015, 9:34 a.m. UTC | #17
> Eh? I think you've missed the point, there is no net device when
> looking at a wc.
>
> Look, here is a concrete direction:
>
> Replace all the crap in
> ib_init_ah_from_wc/get_sgid_index_from_eth/rdma_addr_find_dmac_by_grh
>
> with a straightforward
>
>    rdma_dgid_index_from_wc(
>                         const struct ib_qp *qp,
>                         const struct ib_wc *wc,
>                         const struct ib_grh *grh,
>                         u16 *gid_index)
>
> Sort of function that reads the GRH and wc and returns the unambiguous
> gid index that was used to receive that packet on the UD QP.
>

I already answered this to but I'll do it again
RoCEv2 spec says that L3 header will be scattered to receive WQE in
the following way
IPv6 and RoCEv1 - 40 bytes of the L3 header (GRH or IPv6) to the first
40 bytes of the receive bufs
IPv4 - 20 bytes of the L3 header to the second half of the first 40
bytes of the receive bufs. The first 20 bytes remain undefined.

Now, if you think how you deduce network_type from GRH you'll see that
it requires tools like checksum validation and other validations and
you end up with a method that is not 100% error free. So,to eliminate
the need for heavy computation (with regards to the other option) and
be free from false deductions you have the option of getting
network_type from the hardware. So, if you do have hardware that
supports it why give it up?



> The gid cache is not allowed to create an ambiguity the driver cannot
> resolve.
>
> That said, I wouldn't object to vendor-specific bits in the wc. Ie if
> mlx hardware needs a network_type bit to implement
> rdma_find_dgid_index_from_wc, then fine - define a vendor specific
> place to put it. In this case rdma_find_dgid_index_from_wc would be a
> driver call back, which is fine, and what Caitlin was talking about.
>

This is not a Mellanox specific flag. See a quote from the spec

A17.4.5.1 UD COMPLETION QUEUE ENTRIES (CQES)
For UD, the Completion Queue Entry (CQE) includes remote address
information (InfiniBand Specification Vol. 1 Rev 1.2.1 Section
11.4.2.1). For RoCEv2, the remote address information comprises the
source L2 Address and a flag that indicates if the received frame is
an IPv4, IPv6 or RoCE packet.



> But, it is not part of our verbs API, and I'd *strongly* encourage
> other vendors and future hardware to simply return the gid index that
> the hardware matched instead of requiring the software to try and
> guess after the fact.

Could be problematic for virtual machine architectures that give a
portion of the entire GID table to a VM that index it 0..N
>
> This is the same issue/argument we went around and around on the
> lladdr ipoib details when working on the namespace patches, about how
> important it is to resolve the namespace from the hardware headers.
>
> Of course once we have the gid index we now have the net device and
> other information needed to make namespaces work.
>
> .. and this is part of what I mean what I said the interface from the
> gid cache code is not a sane API and needs to be changed.
>
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liran Liss Dec. 9, 2015, 9:38 a.m. UTC | #18
> From: Jason Gunthorpe [mailto:jgunthorpe@obsidianresearch.com]

> Look, here is a concrete direction:
> 
> Replace all the crap in
> ib_init_ah_from_wc/get_sgid_index_from_eth/rdma_addr_find_dmac_by_
> grh
> 
> with a straightforward
> 
>    rdma_dgid_index_from_wc(
>                         const struct ib_qp *qp,
>                         const struct ib_wc *wc,
> 			const struct ib_grh *grh,
> 			u16 *gid_index)
> 
> Sort of function that reads the GRH and wc and returns the unambiguous gid
> index that was used to receive that packet on the UD QP.
> 
> The gid cache is not allowed to create an ambiguity the driver cannot resolve.
> 
> That said, I wouldn't object to vendor-specific bits in the wc. Ie if mlx
> hardware needs a network_type bit to implement
> rdma_find_dgid_index_from_wc, then fine - define a vendor specific place
> to put it. In this case rdma_find_dgid_index_from_wc would be a driver call
> back, which is fine, and what Caitlin was talking about.
> 
> But, it is not part of our verbs API, and I'd *strongly* encourage other
> vendors and future hardware to simply return the gid index that the
> hardware matched instead of requiring the software to try and guess after
> the fact.
> 

You are pushing abstraction into provider code instead of handling it in a generic way.

The Verbs are a low-level API, that should report exactly what was received from the wire.
In the RoCEv2 case, it should be the GID/IP addresses and the protocol type.
The addressing information is not intended to be used directly by applications; it is the raw bits that were accepted from the wire.
I don't want to replicate resolution code in every RoCE driver.

ib_init_ah_from_wc() and friends is exactly the place that you want to create an address handle based on completion and packet fields.
This is what applications will use (and only if they are interested in generating a response; otherwise this overhead can be skipped).

As I said earlier, we can make sure that protocol type information doesn't incur any additional overhead.
--Liran

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Moni Shoua Dec. 9, 2015, 9:41 a.m. UTC | #19
On Mon, Dec 7, 2015 at 8:48 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Mon, Dec 07, 2015 at 08:34:43PM +0200, Moni Shoua wrote:
>> Well, just tell me how you want to discover gid_index when you poll
>> the WC out of the CQ.
>
> Hey, I'm not desiging this rocev2 stuff, this is something the rocev2
> community needs to sort out.
>
>> Like I said, the sgid itself is in the GRH that is scattered to the
>> buffers in the receive queue. When ib_poll_cq() is called the pointer
>> to GRH is not passed so there is no way to determine the gid_index
>> inside the polling context.
>
> Then have an API to let the consumer recover it.
We do but it has it disadvantages. We'll use it when there it no other
choice (i.e. not supporting HW)
>
>> BTW, why do you argue about this only now? This is not RoCE specific
>> issue. sgid_index was always resolved outside the polling context. The
>> only difference between then and now is that with InfiniBand there was
>> no conflict about sgid_index once you had the sgid.
>
> I think you answered your own question. In IB and rocev1 the sgid
> index is not ambiguous, rocev2 breaks that requirement without
> adaquately fixing it.

ambiguity is not the issue here. You say that sgid_index needs to be
known during poll so why raise it now?
We want to add information to the CQE to help resolve the conflict but
we leave the resolution of sgid index to the caller, like for RoCEv1
and like for native InfiniBand
>
> It is probably also the case that rocev1 has similar problems,
> depending on how vland and macvlan ended up being implemented.
>
>> Now, when same gid value can be present in multiple entries you need
>> an extra parameter to get distinguish between them - that would be
>> the netwrok_type
>
> Eh?  You are only worried about duplicates between rocev1 mode which
> uses the GUID and v2 mode which uses the host's IP address?
>
> What about duplicates entirely within v2 mode where vlan and macvlan
> can create duplicate host IP addresses.
>
> You can also just not create duplicate entries in the first place. For
> instance there probably isn't a really a good reason to use rocev2 for
> IPv6 link local addreses, that trivially eliminates the v1/v2
> collision.
>
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Dec. 9, 2015, 6:01 p.m. UTC | #20
On Wed, Dec 09, 2015 at 11:34:10AM +0200, Moni Shoua wrote:
> > Eh? I think you've missed the point, there is no net device when
> > looking at a wc.
> >
> > Look, here is a concrete direction:
> >
> > Replace all the crap in
> > ib_init_ah_from_wc/get_sgid_index_from_eth/rdma_addr_find_dmac_by_grh
> >
> > with a straightforward
> >
> >    rdma_dgid_index_from_wc(
> >                         const struct ib_qp *qp,
> >                         const struct ib_wc *wc,
> >                         const struct ib_grh *grh,
> >                         u16 *gid_index)
> >
> > Sort of function that reads the GRH and wc and returns the unambiguous
> > gid index that was used to receive that packet on the UD QP.
> >
> 
> I already answered this to but I'll do it again
> RoCEv2 spec says that L3 header will be scattered to receive WQE in
> the following way
> IPv6 and RoCEv1 - 40 bytes of the L3 header (GRH or IPv6) to the first
> 40 bytes of the receive bufs
> IPv4 - 20 bytes of the L3 header to the second half of the first 40
> bytes of the receive bufs. The first 20 bytes remain undefined.
> 
> Now, if you think how you deduce network_type from GRH you'll see that
> it requires tools like checksum validation and other validations and
> you end up with a method that is not 100% error free. So,to eliminate
> the need for heavy computation (with regards to the other option) and
> be free from false deductions you have the option of getting
> network_type from the hardware. So, if you do have hardware that
> supports it why give it up?

I understand how it works.

From an API perspective, I want to see something that can be used
*correctly* and that means more along the lines of
rdma_dgid_index_from_wc and not what is proposed in this series.

That means not exposing the callers to this awful mess.

> > That said, I wouldn't object to vendor-specific bits in the wc. Ie if
> > mlx hardware needs a network_type bit to implement
> > rdma_find_dgid_index_from_wc, then fine - define a vendor specific
> > place to put it. In this case rdma_find_dgid_index_from_wc would be a
> > driver call back, which is fine, and what Caitlin was talking about.
> 
> This is not a Mellanox specific flag. See a quote from the spec
>
> A17.4.5.1 UD COMPLETION QUEUE ENTRIES (CQES)
> For UD, the Completion Queue Entry (CQE) includes remote address
> information (InfiniBand Specification Vol. 1 Rev 1.2.1 Section
> 11.4.2.1). For RoCEv2, the remote address information comprises the
> source L2 Address and a flag that indicates if the received frame is
> an IPv4, IPv6 or RoCE packet.

rdma_dgid_index_from_wc satisfies the above spec requirements without
creating a horrible API, or requiring all vendors to implement a
network type flag.

> > But, it is not part of our verbs API, and I'd *strongly* encourage
> > other vendors and future hardware to simply return the gid index that
> > the hardware matched instead of requiring the software to try and
> > guess after the fact.
> 
> Could be problematic for virtual machine architectures that give a
> portion of the entire GID table to a VM that index it 0..N

I doubt it.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Dec. 9, 2015, 6:09 p.m. UTC | #21
On Wed, Dec 09, 2015 at 09:38:21AM +0000, Liran Liss wrote:

> > But, it is not part of our verbs API, and I'd *strongly* encourage other
> > vendors and future hardware to simply return the gid index that the
> > hardware matched instead of requiring the software to try and guess after
> > the fact.
 
> You are pushing abstraction into provider code instead of handling it in a generic way.

No, I am defining an API that *make sense* and doesn't leak useless
details. Of course that doesn't force code duplication or anyhting
like that, just implement it smartly.

I think mlx made a big mistake returning network_type instead of gid
index, and I don't want to see that error enshrined in our API.

> The Verbs are a low-level API, that should report exactly what was
> received from the wire.  In the RoCEv2 case, it should be the GID/IP
> addresses and the protocol type.  The addressing information is not
> intended to be used directly by applications; it is the raw bits
> that were accepted from the wire.

Low level details isn't what any in kernel consumer needs. Everything
in kernel needs the gid index to determine the namespace, routing and
other details. It is not optional. A common API is thus needed to do
this conversion.

API-wise, once you get the gid index then it is trivial to make easy
extractors for everything else. ie for example:
rdma_get_ud_src_sockaddr(gid_index,&addr,wc,grh)
rdma_get_ud_dst_sockaddr(gid_index,&addr,wc,grh)

> ib_init_ah_from_wc() and friends is exactly the place that you want
> to create an address handle based on completion and packet fields.

CMA needs exactly the same logic as well, the fact it doesn't have it
is a bug in this series.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Moni Shoua Dec. 10, 2015, 7:53 a.m. UTC | #22
> No, I am defining an API that *make sense* and doesn't leak useless
> details. Of course that doesn't force code duplication or anyhting
> like that, just implement it smartly.
>
> I think mlx made a big mistake returning network_type instead of gid
> index, and I don't want to see that error enshrined in our API.
>
returning gid_index is wrong because it forces CQ pollers to be aware
of the entire table. Like I already mentioned, the GID table is a HW
resource that can be divided and handed to multiple VMs,

>> The Verbs are a low-level API, that should report exactly what was
>> received from the wire.  In the RoCEv2 case, it should be the GID/IP
>> addresses and the protocol type.  The addressing information is not
>> intended to be used directly by applications; it is the raw bits
>> that were accepted from the wire.
>
> Low level details isn't what any in kernel consumer needs. Everything
> in kernel needs the gid index to determine the namespace, routing and
> other details. It is not optional. A common API is thus needed to do
> this conversion.
>
> API-wise, once you get the gid index then it is trivial to make easy
> extractors for everything else. ie for example:
> rdma_get_ud_src_sockaddr(gid_index,&addr,wc,grh)
> rdma_get_ud_dst_sockaddr(gid_index,&addr,wc,grh)
>
>> ib_init_ah_from_wc() and friends is exactly the place that you want
>> to create an address handle based on completion and packet fields.
>
> CMA needs exactly the same logic as well, the fact it doesn't have it
> is a bug in this series.
>
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Moni Shoua Dec. 10, 2015, 7:58 a.m. UTC | #23
>>
>> A17.4.5.1 UD COMPLETION QUEUE ENTRIES (CQES)
>> For UD, the Completion Queue Entry (CQE) includes remote address
>> information (InfiniBand Specification Vol. 1 Rev 1.2.1 Section
>> 11.4.2.1). For RoCEv2, the remote address information comprises the
>> source L2 Address and a flag that indicates if the received frame is
>> an IPv4, IPv6 or RoCE packet.
>
> rdma_dgid_index_from_wc satisfies the above spec requirements without
How? Please explain.

> creating a horrible API, or requiring all vendors to implement a
> network type flag.
>
Actually you haven't suggested anything yet besides a name to the function.
I already said that calculating gid_index from wc and grh requires
extra CPU cycles and is prone to mistakes. But, I might be wrong and
you have a better idea. Do you?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Dec. 10, 2015, 5:38 p.m. UTC | #24
On Thu, Dec 10, 2015 at 09:58:51AM +0200, Moni Shoua wrote:

> > creating a horrible API, or requiring all vendors to implement a
> > network type flag.
> >
> Actually you haven't suggested anything yet besides a name to the function.
> I already said that calculating gid_index from wc and grh requires
> extra CPU cycles and is prone to mistakes. But, I might be wrong and
> you have a better idea. Do you?

I told you, if a driver requires vendor specific information then
include it as private data in the grh.

I keep saying this: computing the gid index is not optional. If a
drive can't do it efficiently then too bad, it has to go the long way
around.

>> I think mlx made a big mistake returning network_type instead of gid
>> index, and I don't want to see that error enshrined in our API.
>>
> returning gid_index is wrong because it forces CQ pollers to be aware
> of the entire table. Like I already mentioned, the GID table is a HW
> resource that can be divided and handed to multiple VMs,

Please. If HW virt stuff can't sort this out it is broken, there are
many trivial solutions.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liran Liss Dec. 13, 2015, 1:56 p.m. UTC | #25
> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-

> 
> > You are pushing abstraction into provider code instead of handling it in a
> generic way.
> 
> No, I am defining an API that *make sense* and doesn't leak useless details.
> Of course that doesn't force code duplication or anyhting like that, just
> implement it smartly.
> 
> I think mlx made a big mistake returning network_type instead of gid index,
> and I don't want to see that error enshrined in our API.
> 
> > The Verbs are a low-level API, that should report exactly what was
> > received from the wire.  In the RoCEv2 case, it should be the GID/IP
> > addresses and the protocol type.  The addressing information is not
> > intended to be used directly by applications; it is the raw bits that
> > were accepted from the wire.
> 
> Low level details isn't what any in kernel consumer needs. Everything in
> kernel needs the gid index to determine the namespace, routing and other
> details. It is not optional. A common API is thus needed to do this conversion.

The Verbs are not intended only for kernel consumers, but also for the ib_core, cma, etc.
For the ib_core, a provider needs to report *all* relevant information that is not visible in the packet payload.
The network type is part of this information.
The proposed changes are a straightforward extension to the code base, directly follow the specification, and adhere to the RDMA stack design in which IP addressing is handled by the cma.

Also, I don't want to do any route resolution on the Rx path.
A UD QP completion just reports the details of the packet it received.

Conceptually, an incoming packet may not even match an SGID index at all.
Maybe, responses should be sent from another device. This should not be decided that the point that a packet was received.

> 
> API-wise, once you get the gid index then it is trivial to make easy extractors
> for everything else. ie for example:
> rdma_get_ud_src_sockaddr(gid_index,&addr,wc,grh)
> rdma_get_ud_dst_sockaddr(gid_index,&addr,wc,grh)
> 

Nice ideas, but not relevant to completions.
The resolved dst address could point to another port or device completely.
The proper way to handle remote UD addresses is by rdma_ids that encapsulate address handles and bound devices.

> > ib_init_ah_from_wc() and friends is exactly the place that you want to
> > create an address handle based on completion and packet fields.
> 
> CMA needs exactly the same logic as well, the fact it doesn't have it is a bug
> in this series.
> 

init_ah_from_wc() should include a route lookup, and this will be fixed.

> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matan Barak Dec. 13, 2015, 2:03 p.m. UTC | #26
On Sun, Dec 13, 2015 at 3:56 PM, Liran Liss <liranl@mellanox.com> wrote:
>> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-
>
>>
>> > You are pushing abstraction into provider code instead of handling it in a
>> generic way.
>>
>> No, I am defining an API that *make sense* and doesn't leak useless details.
>> Of course that doesn't force code duplication or anyhting like that, just
>> implement it smartly.
>>
>> I think mlx made a big mistake returning network_type instead of gid index,
>> and I don't want to see that error enshrined in our API.
>>
>> > The Verbs are a low-level API, that should report exactly what was
>> > received from the wire.  In the RoCEv2 case, it should be the GID/IP
>> > addresses and the protocol type.  The addressing information is not
>> > intended to be used directly by applications; it is the raw bits that
>> > were accepted from the wire.
>>
>> Low level details isn't what any in kernel consumer needs. Everything in
>> kernel needs the gid index to determine the namespace, routing and other
>> details. It is not optional. A common API is thus needed to do this conversion.
>
> The Verbs are not intended only for kernel consumers, but also for the ib_core, cma, etc.
> For the ib_core, a provider needs to report *all* relevant information that is not visible in the packet payload.
> The network type is part of this information.
> The proposed changes are a straightforward extension to the code base, directly follow the specification, and adhere to the RDMA stack design in which IP addressing is handled by the cma.
>
> Also, I don't want to do any route resolution on the Rx path.
> A UD QP completion just reports the details of the packet it received.
>
> Conceptually, an incoming packet may not even match an SGID index at all.
> Maybe, responses should be sent from another device. This should not be decided that the point that a packet was received.
>
>>
>> API-wise, once you get the gid index then it is trivial to make easy extractors
>> for everything else. ie for example:
>> rdma_get_ud_src_sockaddr(gid_index,&addr,wc,grh)
>> rdma_get_ud_dst_sockaddr(gid_index,&addr,wc,grh)
>>
>
> Nice ideas, but not relevant to completions.
> The resolved dst address could point to another port or device completely.
> The proper way to handle remote UD addresses is by rdma_ids that encapsulate address handles and bound devices.
>
>> > ib_init_ah_from_wc() and friends is exactly the place that you want to
>> > create an address handle based on completion and packet fields.
>>
>> CMA needs exactly the same logic as well, the fact it doesn't have it is a bug
>> in this series.
>>
>
> init_ah_from_wc() should include a route lookup, and this will be fixed.
>

This was already fixed in v2.

>> Jason
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the
>> body of a message to majordomo@vger.kernel.org More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index 34b1ada..6e35299 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -256,6 +256,12 @@  static int addr4_resolve(struct sockaddr_in *src_in,
 		goto put;
 	}
 
+	/* If there's a gateway, we're definitely in RoCE v2 (as RoCE v1 isn't
+	 * routable) and we could set the network type accordingly.
+	 */
+	if (rt->rt_uses_gateway)
+		addr->network = RDMA_NETWORK_IPV4;
+
 	ret = dst_fetch_ha(&rt->dst, addr, &fl4.daddr);
 put:
 	ip_rt_put(rt);
@@ -270,6 +276,7 @@  static int addr6_resolve(struct sockaddr_in6 *src_in,
 {
 	struct flowi6 fl6;
 	struct dst_entry *dst;
+	struct rt6_info *rt;
 	int ret;
 
 	memset(&fl6, 0, sizeof fl6);
@@ -281,6 +288,7 @@  static int addr6_resolve(struct sockaddr_in6 *src_in,
 	if ((ret = dst->error))
 		goto put;
 
+	rt = (struct rt6_info *)dst;
 	if (ipv6_addr_any(&fl6.saddr)) {
 		ret = ipv6_dev_get_saddr(addr->net, ip6_dst_idev(dst)->dev,
 					 &fl6.daddr, 0, &fl6.saddr);
@@ -304,6 +312,12 @@  static int addr6_resolve(struct sockaddr_in6 *src_in,
 		goto put;
 	}
 
+	/* If there's a gateway, we're definitely in RoCE v2 (as RoCE v1 isn't
+	 * routable) and we could set the network type accordingly.
+	 */
+	if (rt->rt6i_flags & RTF_GATEWAY)
+		addr->network = RDMA_NETWORK_IPV6;
+
 	ret = dst_fetch_ha(dst, addr, &fl6.daddr);
 put:
 	dst_release(dst);
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 2914e08..5dc853c 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -2302,6 +2302,7 @@  static int cma_resolve_iboe_route(struct rdma_id_private *id_priv)
 {
 	struct rdma_route *route = &id_priv->id.route;
 	struct rdma_addr *addr = &route->addr;
+	enum ib_gid_type network_gid_type;
 	struct cma_work *work;
 	int ret;
 	struct net_device *ndev = NULL;
@@ -2340,7 +2341,15 @@  static int cma_resolve_iboe_route(struct rdma_id_private *id_priv)
 	rdma_ip2gid((struct sockaddr *)&id_priv->id.route.addr.dst_addr,
 		    &route->path_rec->dgid);
 
-	route->path_rec->hop_limit = 1;
+	/* Use the hint from IP Stack to select GID Type */
+	network_gid_type = ib_network_to_gid_type(addr->dev_addr.network);
+	if (addr->dev_addr.network != RDMA_NETWORK_IB) {
+		route->path_rec->gid_type = network_gid_type;
+		/* TODO: get the hoplimit from the inet/inet6 device */
+		route->path_rec->hop_limit = IPV6_DEFAULT_HOPLIMIT;
+	} else {
+		route->path_rec->hop_limit = 1;
+	}
 	route->path_rec->reversible = 1;
 	route->path_rec->pkey = cpu_to_be16(0xffff);
 	route->path_rec->mtu_selector = IB_SA_EQ;
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 4263c4c..c564131 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -311,8 +311,61 @@  struct ib_ah *ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr)
 }
 EXPORT_SYMBOL(ib_create_ah);
 
+static int ib_get_header_version(const union rdma_network_hdr *hdr)
+{
+	const struct iphdr *ip4h = (struct iphdr *)&hdr->roce4grh;
+	struct iphdr ip4h_checked;
+	const struct ipv6hdr *ip6h = (struct ipv6hdr *)&hdr->ibgrh;
+
+	/* If it's IPv6, the version must be 6, otherwise, the first
+	 * 20 bytes (before the IPv4 header) are garbled.
+	 */
+	if (ip6h->version != 6)
+		return (ip4h->version == 4) ? 4 : 0;
+	/* version may be 6 or 4 because the first 20 bytes could be garbled */
+
+	/* RoCE v2 requires no options, thus header length
+	 * must be 5 words
+	 */
+	if (ip4h->ihl != 5)
+		return 6;
+
+	/* Verify checksum.
+	 * We can't write on scattered buffers so we need to copy to
+	 * temp buffer.
+	 */
+	memcpy(&ip4h_checked, ip4h, sizeof(ip4h_checked));
+	ip4h_checked.check = 0;
+	ip4h_checked.check = ip_fast_csum((u8 *)&ip4h_checked, 5);
+	/* if IPv4 header checksum is OK, believe it */
+	if (ip4h->check == ip4h_checked.check)
+		return 4;
+	return 6;
+}
+
+static enum rdma_network_type ib_get_net_type_by_grh(struct ib_device *device,
+						     u8 port_num,
+						     const struct ib_grh *grh)
+{
+	int grh_version;
+
+	if (rdma_protocol_ib(device, port_num))
+		return RDMA_NETWORK_IB;
+
+	grh_version = ib_get_header_version((union rdma_network_hdr *)grh);
+
+	if (grh_version == 4)
+		return RDMA_NETWORK_IPV4;
+
+	if (grh->next_hdr == IPPROTO_UDP)
+		return RDMA_NETWORK_IPV6;
+
+	return RDMA_NETWORK_ROCE_V1;
+}
+
 struct find_gid_index_context {
 	u16 vlan_id;
+	enum ib_gid_type gid_type;
 };
 
 static bool find_gid_index(const union ib_gid *gid,
@@ -322,6 +375,9 @@  static bool find_gid_index(const union ib_gid *gid,
 	struct find_gid_index_context *ctx =
 		(struct find_gid_index_context *)context;
 
+	if (ctx->gid_type != gid_attr->gid_type)
+		return false;
+
 	if ((!!(ctx->vlan_id != 0xffff) == !is_vlan_dev(gid_attr->ndev)) ||
 	    (is_vlan_dev(gid_attr->ndev) &&
 	     vlan_dev_vlan_id(gid_attr->ndev) != ctx->vlan_id))
@@ -332,14 +388,49 @@  static bool find_gid_index(const union ib_gid *gid,
 
 static int get_sgid_index_from_eth(struct ib_device *device, u8 port_num,
 				   u16 vlan_id, const union ib_gid *sgid,
+				   enum ib_gid_type gid_type,
 				   u16 *gid_index)
 {
-	struct find_gid_index_context context = {.vlan_id = vlan_id};
+	struct find_gid_index_context context = {.vlan_id = vlan_id,
+						 .gid_type = gid_type};
 
 	return ib_find_gid_by_filter(device, sgid, port_num, find_gid_index,
 				     &context, gid_index);
 }
 
+static int get_gids_from_rdma_hdr(union rdma_network_hdr *hdr,
+				  enum rdma_network_type net_type,
+				  union ib_gid *sgid, union ib_gid *dgid)
+{
+	struct sockaddr_in  src_in;
+	struct sockaddr_in  dst_in;
+	__be32 src_saddr, dst_saddr;
+
+	if (!sgid || !dgid)
+		return -EINVAL;
+
+	if (net_type == RDMA_NETWORK_IPV4) {
+		memcpy(&src_in.sin_addr.s_addr,
+		       &hdr->roce4grh.saddr, 4);
+		memcpy(&dst_in.sin_addr.s_addr,
+		       &hdr->roce4grh.daddr, 4);
+		src_saddr = src_in.sin_addr.s_addr;
+		dst_saddr = dst_in.sin_addr.s_addr;
+		ipv6_addr_set_v4mapped(src_saddr,
+				       (struct in6_addr *)sgid);
+		ipv6_addr_set_v4mapped(dst_saddr,
+				       (struct in6_addr *)dgid);
+		return 0;
+	} else if (net_type == RDMA_NETWORK_IPV6 ||
+		   net_type == RDMA_NETWORK_IB) {
+		*dgid = hdr->ibgrh.dgid;
+		*sgid = hdr->ibgrh.sgid;
+		return 0;
+	} else {
+		return -EINVAL;
+	}
+}
+
 int ib_init_ah_from_wc(struct ib_device *device, u8 port_num,
 		       const struct ib_wc *wc, const struct ib_grh *grh,
 		       struct ib_ah_attr *ah_attr)
@@ -347,9 +438,25 @@  int ib_init_ah_from_wc(struct ib_device *device, u8 port_num,
 	u32 flow_class;
 	u16 gid_index;
 	int ret;
+	enum rdma_network_type net_type = RDMA_NETWORK_IB;
+	enum ib_gid_type gid_type = IB_GID_TYPE_IB;
+	union ib_gid dgid;
+	union ib_gid sgid;
 
 	memset(ah_attr, 0, sizeof *ah_attr);
 	if (rdma_cap_eth_ah(device, port_num)) {
+		if (wc->wc_flags & IB_WC_WITH_NETWORK_HDR_TYPE)
+			net_type = wc->network_hdr_type;
+		else
+			net_type = ib_get_net_type_by_grh(device, port_num, grh);
+		gid_type = ib_network_to_gid_type(net_type);
+	}
+	ret = get_gids_from_rdma_hdr((union rdma_network_hdr *)grh, net_type,
+				     &sgid, &dgid);
+	if (ret)
+		return ret;
+
+	if (rdma_protocol_roce(device, port_num)) {
 		u16 vlan_id = wc->wc_flags & IB_WC_WITH_VLAN ?
 				wc->vlan_id : 0xffff;
 
@@ -358,7 +465,7 @@  int ib_init_ah_from_wc(struct ib_device *device, u8 port_num,
 
 		if (!(wc->wc_flags & IB_WC_WITH_SMAC) ||
 		    !(wc->wc_flags & IB_WC_WITH_VLAN)) {
-			ret = rdma_addr_find_dmac_by_grh(&grh->dgid, &grh->sgid,
+			ret = rdma_addr_find_dmac_by_grh(&dgid, &sgid,
 							 ah_attr->dmac,
 							 wc->wc_flags & IB_WC_WITH_VLAN ?
 							 NULL : &vlan_id,
@@ -368,7 +475,7 @@  int ib_init_ah_from_wc(struct ib_device *device, u8 port_num,
 		}
 
 		ret = get_sgid_index_from_eth(device, port_num, vlan_id,
-					      &grh->dgid, &gid_index);
+					      &dgid, gid_type, &gid_index);
 		if (ret)
 			return ret;
 
@@ -383,10 +490,10 @@  int ib_init_ah_from_wc(struct ib_device *device, u8 port_num,
 
 	if (wc->wc_flags & IB_WC_GRH) {
 		ah_attr->ah_flags = IB_AH_GRH;
-		ah_attr->grh.dgid = grh->sgid;
+		ah_attr->grh.dgid = sgid;
 
 		if (!rdma_cap_eth_ah(device, port_num)) {
-			ret = ib_find_cached_gid_by_port(device, &grh->dgid,
+			ret = ib_find_cached_gid_by_port(device, &dgid,
 							 IB_GID_TYPE_IB,
 							 port_num, NULL,
 							 &gid_index);
@@ -1026,6 +1133,12 @@  int ib_resolve_eth_dmac(struct ib_qp *qp,
 					ret = -ENXIO;
 				goto out;
 			}
+			if (sgid_attr.gid_type == IB_GID_TYPE_ROCE_UDP_ENCAP)
+				/* TODO: get the hoplimit from the inet/inet6
+				 * device
+				 */
+				qp_attr->ah_attr.grh.hop_limit =
+							IPV6_DEFAULT_HOPLIMIT;
 
 			ifindex = sgid_attr.ndev->ifindex;
 
diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h
index 1152859..c799caa 100644
--- a/include/rdma/ib_addr.h
+++ b/include/rdma/ib_addr.h
@@ -83,6 +83,7 @@  struct rdma_dev_addr {
 	int bound_dev_if;
 	enum rdma_transport_type transport;
 	struct net *net;
+	enum rdma_network_type network;
 };
 
 /**
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 87df931..31fb409 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -50,6 +50,8 @@ 
 #include <linux/workqueue.h>
 #include <linux/socket.h>
 #include <uapi/linux/if_ether.h>
+#include <net/ipv6.h>
+#include <net/ip.h>
 
 #include <linux/atomic.h>
 #include <linux/mmu_notifier.h>
@@ -107,6 +109,35 @@  enum rdma_protocol_type {
 __attribute_const__ enum rdma_transport_type
 rdma_node_get_transport(enum rdma_node_type node_type);
 
+enum rdma_network_type {
+	RDMA_NETWORK_IB,
+	RDMA_NETWORK_ROCE_V1 = RDMA_NETWORK_IB,
+	RDMA_NETWORK_IPV4,
+	RDMA_NETWORK_IPV6
+};
+
+static inline enum ib_gid_type ib_network_to_gid_type(enum rdma_network_type network_type)
+{
+	if (network_type == RDMA_NETWORK_IPV4 ||
+	    network_type == RDMA_NETWORK_IPV6)
+		return IB_GID_TYPE_ROCE_UDP_ENCAP;
+
+	/* IB_GID_TYPE_IB same as RDMA_NETWORK_ROCE_V1 */
+	return IB_GID_TYPE_IB;
+}
+
+static inline enum rdma_network_type ib_gid_to_network_type(enum ib_gid_type gid_type,
+							    union ib_gid *gid)
+{
+	if (gid_type == IB_GID_TYPE_IB)
+		return RDMA_NETWORK_IB;
+
+	if (ipv6_addr_v4mapped((struct in6_addr *)gid))
+		return RDMA_NETWORK_IPV4;
+	else
+		return RDMA_NETWORK_IPV6;
+}
+
 enum rdma_link_layer {
 	IB_LINK_LAYER_UNSPECIFIED,
 	IB_LINK_LAYER_INFINIBAND,
@@ -535,6 +566,17 @@  struct ib_grh {
 	union ib_gid	dgid;
 };
 
+union rdma_network_hdr {
+	struct ib_grh ibgrh;
+	struct {
+		/* The IB spec states that if it's IPv4, the header
+		 * is located in the last 20 bytes of the header.
+		 */
+		u8		reserved[20];
+		struct iphdr	roce4grh;
+	};
+};
+
 enum {
 	IB_MULTICAST_QPN = 0xffffff
 };
@@ -771,6 +813,7 @@  enum ib_wc_flags {
 	IB_WC_IP_CSUM_OK	= (1<<3),
 	IB_WC_WITH_SMAC		= (1<<4),
 	IB_WC_WITH_VLAN		= (1<<5),
+	IB_WC_WITH_NETWORK_HDR_TYPE	= (1<<6),
 };
 
 struct ib_wc {
@@ -793,6 +836,7 @@  struct ib_wc {
 	u8			port_num;	/* valid only for DR SMPs on switches */
 	u8			smac[ETH_ALEN];
 	u16			vlan_id;
+	u8			network_hdr_type;
 };
 
 enum ib_cq_notify_flags {