Message ID | 1444925232-13598-6-git-send-email-matanb@mellanox.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
> + /* Use the hint from IP Stack to select GID Type */ > + network_gid_type = ib_network_to_gid_type(addr->dev_addr.network); > + if (addr->dev_addr.network != RDMA_NETWORK_IB) { > + route->path_rec->gid_type = network_gid_type; > + /* TODO: get the hoplimit from the inet/inet6 device */ > + route->path_rec->hop_limit = IPV6_DEFAULT_HOPLIMIT; Uh, that is more than a TODO, that is showing this is all messed up. It isn't just the hop limit that has to come from the route entry, all the source information of the path comes from there. Ie the gid table should accept the route entry directly and spit out the sgid_index. The responder side is the same, it also needs to do a route lookup to figure out what it is doing, and that may not match what the rx says from the headers. This is important stuff. I really don't like the API changes that went in with the last series that added net_dev and gid_attr everywhere, that just seems to be enabling mistakes like the above. You can't use rocev2 without doing route lookups, providing APIs that don't force this to happen just encourages broken flows like this. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Nov 23, 2015 at 11:19 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: >> + /* Use the hint from IP Stack to select GID Type */ >> + network_gid_type = ib_network_to_gid_type(addr->dev_addr.network); >> + if (addr->dev_addr.network != RDMA_NETWORK_IB) { >> + route->path_rec->gid_type = network_gid_type; >> + /* TODO: get the hoplimit from the inet/inet6 device */ >> + route->path_rec->hop_limit = IPV6_DEFAULT_HOPLIMIT; > > Uh, that is more than a TODO, that is showing this is all messed up. > > It isn't just the hop limit that has to come from the route entry, all > the source information of the path comes from there. Ie the gid table > should accept the route entry directly and spit out the sgid_index. > > The responder side is the same, it also needs to do a route lookup to > figure out what it is doing, and that may not match what the rx says > from the headers. This is important stuff. > The only entity that translates between IPs and GIDs is the RDMACM. The GID cache is like a database. It allows one to store, retrieve and query the GIDs and GID attrs it stores. roce_gid_mgmt, is the part that populates this "dumb" database. IMHO, adding such a "smart" layer to the GID cache is wrong, as this should be part of RDMACM which does the translation. No need to get the gid cache involved. > I really don't like the API changes that went in with the last series > that added net_dev and gid_attr everywhere, that just seems to be > enabling mistakes like the above. You can't use rocev2 without doing > route lookups, providing APIs that don't force this to happen just > encourages broken flows like this. > > Jason > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 24, 2015 at 03:47:51PM +0200, Matan Barak wrote: > > It isn't just the hop limit that has to come from the route entry, all > > the source information of the path comes from there. Ie the gid table > > should accept the route entry directly and spit out the sgid_index. > > > > The responder side is the same, it also needs to do a route lookup to > > figure out what it is doing, and that may not match what the rx says > > from the headers. This is important stuff. > > > > The only entity that translates between IPs and GIDs is the RDMACM. The rocev2 stuff is using IP, and the gid entry is now overloaded to specify IP header fields. Absolutely every determination of IP header fields needs to go through the route table, so every single lookup that can return a rocev2 SGID *MUST* use route data. The places in this series where that isn't done are plain and simply wrong. The abstraction at the gid cache is making it too easy to make this mistake. It is enabling callers to do direct gid lookups without a route lookup, which is unconditionally wrong. Every call site into the gid cache I looked at appears to have this problem. The simplest fix is to have a new gid cache api for rocve2 that somehow forces/includes the necessary route lookup. The existing API cannot simply be extended for rocev2. > roce_gid_mgmt, is the part that populates this "dumb" database. > IMHO, adding such a "smart" layer to the GID cache is wrong, as this > should be part of RDMACM which does the translation. No need to get > the gid cache involved. OK. Change the gid cache so only a RDMA CM private API can return rocev2 gids. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 24, 2015 at 8:14 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > On Tue, Nov 24, 2015 at 03:47:51PM +0200, Matan Barak wrote: >> > It isn't just the hop limit that has to come from the route entry, all >> > the source information of the path comes from there. Ie the gid table >> > should accept the route entry directly and spit out the sgid_index. >> > >> > The responder side is the same, it also needs to do a route lookup to >> > figure out what it is doing, and that may not match what the rx says >> > from the headers. This is important stuff. >> > >> >> The only entity that translates between IPs and GIDs is the RDMACM. > > The rocev2 stuff is using IP, and the gid entry is now overloaded to > specify IP header fields. > The GID entry is now overloaded to expose GID metadata. For example, ndev (for L2 Ethernet attributes) and GID type. > Absolutely every determination of IP header fields needs to go through > the route table, so every single lookup that can return a rocev2 SGID > *MUST* use route data. > > The places in this series where that isn't done are plain and simply > wrong. > IMHO, the user is entitles to choose any valid sgid_index for the interface. Anything he chooses guaranteed to be valid (from security perspective), but doesn't guarantee to work if both sides don't use IPs that can be routed successfully to the destination. Why do we need to block users who use ibv_rc_pingpong and chose the GID index correctly by hand? > The abstraction at the gid cache is making it too easy to make this > mistake. It is enabling callers to do direct gid lookups without a > route lookup, which is unconditionally wrong. Every call site into the > gid cache I looked at appears to have this problem. > We can and should guarantee rdma-cm users get the right GID every time. I don't think we should block users of choosing either a correct GID or an incorrect GID, that's up to them. We're only providing a correct database that these users can query and a right rdma-cm model. > The simplest fix is to have a new gid cache api for rocve2 that > somehow forces/includes the necessary route lookup. The existing API > cannot simply be extended for rocev2. > >> roce_gid_mgmt, is the part that populates this "dumb" database. >> IMHO, adding such a "smart" layer to the GID cache is wrong, as this >> should be part of RDMACM which does the translation. No need to get >> the gid cache involved. > > OK. Change the gid cache so only a RDMA CM private API can return > rocev2 gids. > So you propose to block verbs applications from using the RoCE v2 GIDs? Why? > Jason Matan -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 24, 2015 at 09:07:41PM +0200, Matan Barak wrote: > IMHO, the user is entitles to choose any valid sgid_index for the > interface. Anything he chooses guaranteed to be valid (from security > perspective) No, the namespace patches will have to limit the sgid_indexes that can be used with a QP to those that fall within the namespace. This is another reason I don't like this approach for the kapi. > Why do we need to block users who use ibv_rc_pingpong and chose the > GID index correctly by hand? I'm not really concerned with user space, we are stuck with exporting the gid index there. > > OK. Change the gid cache so only a RDMA CM private API can return > > rocev2 gids. > > So you propose to block verbs applications from using the RoCE v2 GIDs? Why? Just the kernel consumers, so the in-kernel users are correct. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 25, 2015 at 8:55 AM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > On Tue, Nov 24, 2015 at 09:07:41PM +0200, Matan Barak wrote: > >> IMHO, the user is entitles to choose any valid sgid_index for the >> interface. Anything he chooses guaranteed to be valid (from security >> perspective) > > No, the namespace patches will have to limit the sgid_indexes that can > be used with a QP to those that fall within the namespace. This is > another reason I don't like this approach for the kapi. > By saying namespace, do you mean net namespaces? If so, the gid cache allows to search by net device (and there's a "custom" search that the user can define a filter function which can filter by net). Anyway, I don't think this cache should be used other than a simple database. >> Why do we need to block users who use ibv_rc_pingpong and chose the >> GID index correctly by hand? > > I'm not really concerned with user space, we are stuck with exporting > the gid index there. > So why do we need to block kernel applications from doing the same things user-space application can do? >> > OK. Change the gid cache so only a RDMA CM private API can return >> > rocev2 gids. >> >> So you propose to block verbs applications from using the RoCE v2 GIDs? Why? > > Just the kernel consumers, so the in-kernel users are correct. > If there are kernel consumers that want to work with verbs directly, they should use ib_init_ah_from_wc and ib_resolve_eth_dmac (or we can rename that for other L2 attributes). The shared code shouldn't be in the cache. > Jason Matan -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 25, 2015 at 04:18:25PM +0200, Matan Barak wrote: > On Wed, Nov 25, 2015 at 8:55 AM, Jason Gunthorpe > <jgunthorpe@obsidianresearch.com> wrote: > > On Tue, Nov 24, 2015 at 09:07:41PM +0200, Matan Barak wrote: > > > >> IMHO, the user is entitles to choose any valid sgid_index for the > >> interface. Anything he chooses guaranteed to be valid (from security > >> perspective) > > > > No, the namespace patches will have to limit the sgid_indexes that can > > be used with a QP to those that fall within the namespace. This is > > another reason I don't like this approach for the kapi. > > By saying namespace, do you mean net namespaces? Whatever it turns out to be, Haggie was talking about rdma namespaces for some for this stuff too, but IMHO, rocev2 is pretty clearly covered under net namespaces. > If so, the gid cache allows to search by net device (and there's a > "custom" search that the user can define a filter function which can > filter by net). > Anyway, I don't think this cache should be used other than a simple database. It has nothing to do with the cache, it is everywhere else, you can't create a qp with a sgid index that is not part of your namespace, for instance, or recieve a packet on a QP outside your namespace, etc. Lots of details. > >> Why do we need to block users who use ibv_rc_pingpong and chose the > >> GID index correctly by hand? > > > > I'm not really concerned with user space, we are stuck with exporting > > the gid index there. > > So why do we need to block kernel applications from doing the same > things user-space application can do? As I explained, it is never correct to use a naked sgid_index and roceve2, uverbs can't be fixed without a uapi change, but the kernel can be. > If there are kernel consumers that want to work with verbs directly, > they should use ib_init_ah_from_wc and ib_resolve_eth_dmac (or we > can As I already said these functions are wrong, they don't have the routing lookup needed for rocev2. That is my whole point, the functions that are using the gid cache for rocev2 are *not correct* I don't really care how you fix it, but every rocev2 sgid-index lookup in the kernel must be accompanied by a route lookup. I think the gid cache API design is wrong here because it doesn't force the above, but whatever, if you choose a different API it becomes your job to review every patch from now own to make sure other people use your dangerous API properly. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma- > > The abstraction at the gid cache is making it too easy to make this mistake. It > is enabling callers to do direct gid lookups without a route lookup, which is > unconditionally wrong. Every call site into the gid cache I looked at appears to > have this problem. > > The simplest fix is to have a new gid cache api for rocve2 that somehow > forces/includes the necessary route lookup. The existing API cannot simply > be extended for rocev2. > I think that the GID cache should remain just that: a cache. We shouldn't bloat it. The CMA is the proper place to handle IP resolution. > > roce_gid_mgmt, is the part that populates this "dumb" database. > > IMHO, adding such a "smart" layer to the GID cache is wrong, as this > > should be part of RDMACM which does the translation. No need to get > > the gid cache involved. > > OK. Change the gid cache so only a RDMA CM private API can return > rocev2 gids. > The same cache is also used in IB and thus by other components, so it cannot be a private CM API. RoCE ULPs use the CMA for establishing connections, so route lookups should be done from there. --Liran -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Nov 30, 2015 at 10:56 PM, Liran Liss <liranl@mellanox.com> wrote: >> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma- > >> >> The abstraction at the gid cache is making it too easy to make this mistake. It >> is enabling callers to do direct gid lookups without a route lookup, which is >> unconditionally wrong. Every call site into the gid cache I looked at appears to >> have this problem. >> >> The simplest fix is to have a new gid cache api for rocve2 that somehow >> forces/includes the necessary route lookup. The existing API cannot simply >> be extended for rocev2. >> > > I think that the GID cache should remain just that: a cache. > We shouldn't bloat it. > The CMA is the proper place to handle IP resolution. > In the sake of validating a proper route is chosen, we could add route validation in ib_init_ah_from_wc() and ib_init_ah_from_path(). >> > roce_gid_mgmt, is the part that populates this "dumb" database. >> > IMHO, adding such a "smart" layer to the GID cache is wrong, as this >> > should be part of RDMACM which does the translation. No need to get >> > the gid cache involved. >> >> OK. Change the gid cache so only a RDMA CM private API can return >> rocev2 gids. >> > > The same cache is also used in IB and thus by other components, so it cannot > be a private CM API. > RoCE ULPs use the CMA for establishing connections, so route lookups should > be done from there. > --Liran > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c index d3c42b3..3e1f93c 100644 --- a/drivers/infiniband/core/addr.c +++ b/drivers/infiniband/core/addr.c @@ -257,6 +257,12 @@ static int addr4_resolve(struct sockaddr_in *src_in, goto put; } + /* If there's a gateway, we're definitely in RoCE v2 (as RoCE v1 isn't + * routable) and we could set the network type accordingly. + */ + if (rt->rt_uses_gateway) + addr->network = RDMA_NETWORK_IPV4; + ret = dst_fetch_ha(&rt->dst, addr, &fl4.daddr); put: ip_rt_put(rt); @@ -271,6 +277,7 @@ static int addr6_resolve(struct sockaddr_in6 *src_in, { struct flowi6 fl6; struct dst_entry *dst; + struct rt6_info *rt; int ret; memset(&fl6, 0, sizeof fl6); @@ -282,6 +289,7 @@ static int addr6_resolve(struct sockaddr_in6 *src_in, if ((ret = dst->error)) goto put; + rt = (struct rt6_info *)dst; if (ipv6_addr_any(&fl6.saddr)) { ret = ipv6_dev_get_saddr(&init_net, ip6_dst_idev(dst)->dev, &fl6.daddr, 0, &fl6.saddr); @@ -305,6 +313,12 @@ static int addr6_resolve(struct sockaddr_in6 *src_in, goto put; } + /* If there's a gateway, we're definitely in RoCE v2 (as RoCE v1 isn't + * routable) and we could set the network type accordingly. + */ + if (rt->rt6i_flags & RTF_GATEWAY) + addr->network = RDMA_NETWORK_IPV6; + ret = dst_fetch_ha(dst, addr, &fl6.daddr); put: dst_release(dst); diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 2e592e6..c5d1685 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -2253,6 +2253,7 @@ static int cma_resolve_iboe_route(struct rdma_id_private *id_priv) { struct rdma_route *route = &id_priv->id.route; struct rdma_addr *addr = &route->addr; + enum ib_gid_type network_gid_type; struct cma_work *work; int ret; struct net_device *ndev = NULL; @@ -2291,7 +2292,15 @@ static int cma_resolve_iboe_route(struct rdma_id_private *id_priv) rdma_ip2gid((struct sockaddr *)&id_priv->id.route.addr.dst_addr, &route->path_rec->dgid); - route->path_rec->hop_limit = 1; + /* Use the hint from IP Stack to select GID Type */ + network_gid_type = ib_network_to_gid_type(addr->dev_addr.network); + if (addr->dev_addr.network != RDMA_NETWORK_IB) { + route->path_rec->gid_type = network_gid_type; + /* TODO: get the hoplimit from the inet/inet6 device */ + route->path_rec->hop_limit = IPV6_DEFAULT_HOPLIMIT; + } else { + route->path_rec->hop_limit = 1; + } route->path_rec->reversible = 1; route->path_rec->pkey = cpu_to_be16(0xffff); route->path_rec->mtu_selector = IB_SA_EQ; diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index 8b4ade6..2f568ad 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -311,8 +311,61 @@ struct ib_ah *ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr) } EXPORT_SYMBOL(ib_create_ah); +static int ib_get_header_version(const union rdma_network_hdr *hdr) +{ + const struct iphdr *ip4h = (struct iphdr *)&hdr->roce4grh; + struct iphdr ip4h_checked; + const struct ipv6hdr *ip6h = (struct ipv6hdr *)&hdr->ibgrh; + + /* If it's IPv6, the version must be 6, otherwise, the first + * 20 bytes (before the IPv4 header) are garbled. + */ + if (ip6h->version != 6) + return (ip4h->version == 4) ? 4 : 0; + /* version may be 6 or 4 because the first 20 bytes could be garbled */ + + /* RoCE v2 requires no options, thus header length + must be 5 words + */ + if (ip4h->ihl != 5) + return 6; + + /* Verify checksum. + We can't write on scattered buffers so we need to copy to + temp buffer. + */ + memcpy(&ip4h_checked, ip4h, sizeof(ip4h_checked)); + ip4h_checked.check = 0; + ip4h_checked.check = ip_fast_csum((u8 *)&ip4h_checked, 5); + /* if IPv4 header checksum is OK, believe it */ + if (ip4h->check == ip4h_checked.check) + return 4; + return 6; +} + +static enum rdma_network_type ib_get_net_type_by_grh(struct ib_device *device, + u8 port_num, + const struct ib_grh *grh) +{ + int grh_version; + + if (rdma_protocol_ib(device, port_num)) + return RDMA_NETWORK_IB; + + grh_version = ib_get_header_version((union rdma_network_hdr *)grh); + + if (grh_version == 4) + return RDMA_NETWORK_IPV4; + + if (grh->next_hdr == IPPROTO_UDP) + return RDMA_NETWORK_IPV6; + + return RDMA_NETWORK_ROCE_V1; +} + struct find_gid_index_context { u16 vlan_id; + enum ib_gid_type gid_type; }; static bool find_gid_index(const union ib_gid *gid, @@ -322,6 +375,9 @@ static bool find_gid_index(const union ib_gid *gid, struct find_gid_index_context *ctx = (struct find_gid_index_context *)context; + if (ctx->gid_type != gid_attr->gid_type) + return false; + if ((!!(ctx->vlan_id != 0xffff) == !is_vlan_dev(gid_attr->ndev)) || (is_vlan_dev(gid_attr->ndev) && vlan_dev_vlan_id(gid_attr->ndev) != ctx->vlan_id)) @@ -332,14 +388,49 @@ static bool find_gid_index(const union ib_gid *gid, static int get_sgid_index_from_eth(struct ib_device *device, u8 port_num, u16 vlan_id, const union ib_gid *sgid, + enum ib_gid_type gid_type, u16 *gid_index) { - struct find_gid_index_context context = {.vlan_id = vlan_id}; + struct find_gid_index_context context = {.vlan_id = vlan_id, + .gid_type = gid_type}; return ib_find_gid_by_filter(device, sgid, port_num, find_gid_index, &context, gid_index); } +static int get_gids_from_rdma_hdr(union rdma_network_hdr *hdr, + enum rdma_network_type net_type, + union ib_gid *sgid, union ib_gid *dgid) +{ + struct sockaddr_in src_in; + struct sockaddr_in dst_in; + __be32 src_saddr, dst_saddr; + + if (!sgid || !dgid) + return -EINVAL; + + if (net_type == RDMA_NETWORK_IPV4) { + memcpy(&src_in.sin_addr.s_addr, + &hdr->roce4grh.saddr, 4); + memcpy(&dst_in.sin_addr.s_addr, + &hdr->roce4grh.daddr, 4); + src_saddr = src_in.sin_addr.s_addr; + dst_saddr = dst_in.sin_addr.s_addr; + ipv6_addr_set_v4mapped(src_saddr, + (struct in6_addr *)sgid); + ipv6_addr_set_v4mapped(dst_saddr, + (struct in6_addr *)dgid); + return 0; + } else if (net_type == RDMA_NETWORK_IPV6 || + net_type == RDMA_NETWORK_IB) { + *dgid = hdr->ibgrh.dgid; + *sgid = hdr->ibgrh.sgid; + return 0; + } else { + return -EINVAL; + } +} + int ib_init_ah_from_wc(struct ib_device *device, u8 port_num, const struct ib_wc *wc, const struct ib_grh *grh, struct ib_ah_attr *ah_attr) @@ -347,9 +438,25 @@ int ib_init_ah_from_wc(struct ib_device *device, u8 port_num, u32 flow_class; u16 gid_index; int ret; + enum rdma_network_type net_type = RDMA_NETWORK_IB; + enum ib_gid_type gid_type = IB_GID_TYPE_IB; + union ib_gid dgid; + union ib_gid sgid; memset(ah_attr, 0, sizeof *ah_attr); if (rdma_cap_eth_ah(device, port_num)) { + if (wc->wc_flags & IB_WC_WITH_NETWORK_HDR_TYPE) + net_type = wc->network_hdr_type; + else + net_type = ib_get_net_type_by_grh(device, port_num, grh); + gid_type = ib_network_to_gid_type(net_type); + } + ret = get_gids_from_rdma_hdr((union rdma_network_hdr *)grh, net_type, + &sgid, &dgid); + if (ret) + return ret; + + if (rdma_protocol_roce(device, port_num)) { u16 vlan_id = wc->wc_flags & IB_WC_WITH_VLAN ? wc->vlan_id : 0xffff; @@ -358,7 +465,7 @@ int ib_init_ah_from_wc(struct ib_device *device, u8 port_num, if (!(wc->wc_flags & IB_WC_WITH_SMAC) || !(wc->wc_flags & IB_WC_WITH_VLAN)) { - ret = rdma_addr_find_dmac_by_grh(&grh->dgid, &grh->sgid, + ret = rdma_addr_find_dmac_by_grh(&dgid, &sgid, ah_attr->dmac, wc->wc_flags & IB_WC_WITH_VLAN ? NULL : &vlan_id, @@ -368,7 +475,7 @@ int ib_init_ah_from_wc(struct ib_device *device, u8 port_num, } ret = get_sgid_index_from_eth(device, port_num, vlan_id, - &grh->dgid, &gid_index); + &dgid, gid_type, &gid_index); if (ret) return ret; @@ -383,10 +490,10 @@ int ib_init_ah_from_wc(struct ib_device *device, u8 port_num, if (wc->wc_flags & IB_WC_GRH) { ah_attr->ah_flags = IB_AH_GRH; - ah_attr->grh.dgid = grh->sgid; + ah_attr->grh.dgid = sgid; if (!rdma_cap_eth_ah(device, port_num)) { - ret = ib_find_cached_gid_by_port(device, &grh->dgid, + ret = ib_find_cached_gid_by_port(device, &dgid, IB_GID_TYPE_IB, port_num, NULL, &gid_index); @@ -1026,6 +1133,12 @@ int ib_resolve_eth_dmac(struct ib_qp *qp, ret = -ENXIO; goto out; } + if (sgid_attr.gid_type == IB_GID_TYPE_ROCE_UDP_ENCAP) + /* TODO: get the hoplimit from the inet/inet6 + * device + */ + qp_attr->ah_attr.grh.hop_limit = + IPV6_DEFAULT_HOPLIMIT; ifindex = sgid_attr.ndev->ifindex; diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h index 17e4a8b..81e19d9 100644 --- a/include/rdma/ib_addr.h +++ b/include/rdma/ib_addr.h @@ -71,6 +71,7 @@ struct rdma_dev_addr { unsigned short dev_type; int bound_dev_if; enum rdma_transport_type transport; + enum rdma_network_type network; }; /** diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 77906fe..dd1d901 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -50,6 +50,8 @@ #include <linux/workqueue.h> #include <linux/socket.h> #include <uapi/linux/if_ether.h> +#include <net/ipv6.h> +#include <net/ip.h> #include <linux/atomic.h> #include <linux/mmu_notifier.h> @@ -107,6 +109,35 @@ enum rdma_protocol_type { __attribute_const__ enum rdma_transport_type rdma_node_get_transport(enum rdma_node_type node_type); +enum rdma_network_type { + RDMA_NETWORK_IB, + RDMA_NETWORK_ROCE_V1 = RDMA_NETWORK_IB, + RDMA_NETWORK_IPV4, + RDMA_NETWORK_IPV6 +}; + +static inline enum ib_gid_type ib_network_to_gid_type(enum rdma_network_type network_type) +{ + if (network_type == RDMA_NETWORK_IPV4 || + network_type == RDMA_NETWORK_IPV6) + return IB_GID_TYPE_ROCE_UDP_ENCAP; + + /* IB_GID_TYPE_IB same as RDMA_NETWORK_ROCE_V1 */ + return IB_GID_TYPE_IB; +} + +static inline enum rdma_network_type ib_gid_to_network_type(enum ib_gid_type gid_type, + union ib_gid *gid) +{ + if (gid_type == IB_GID_TYPE_IB) + return RDMA_NETWORK_IB; + + if (ipv6_addr_v4mapped((struct in6_addr *)gid)) + return RDMA_NETWORK_IPV4; + else + return RDMA_NETWORK_IPV6; +} + enum rdma_link_layer { IB_LINK_LAYER_UNSPECIFIED, IB_LINK_LAYER_INFINIBAND, @@ -533,6 +564,17 @@ struct ib_grh { union ib_gid dgid; }; +union rdma_network_hdr { + struct ib_grh ibgrh; + struct { + /* The IB spec states that if it's IPv4, the header + * is located in the last 20 bytes of the header. + */ + u8 reserved[20]; + struct iphdr roce4grh; + }; +}; + enum { IB_MULTICAST_QPN = 0xffffff }; @@ -769,6 +811,7 @@ enum ib_wc_flags { IB_WC_IP_CSUM_OK = (1<<3), IB_WC_WITH_SMAC = (1<<4), IB_WC_WITH_VLAN = (1<<5), + IB_WC_WITH_NETWORK_HDR_TYPE = (1<<6), }; struct ib_wc { @@ -791,6 +834,7 @@ struct ib_wc { u8 port_num; /* valid only for DR SMPs on switches */ u8 smac[ETH_ALEN]; u16 vlan_id; + u8 network_hdr_type; }; enum ib_cq_notify_flags {