Message ID | 44ab0dce-c7c9-400b-af24-10b8981358a7@CMEXHTCAS2.ad.emulex.com (mailing list archive) |
---|---|
State | Rejected |
Headers | show |
On 03/25/2015 02:19 PM, Somnath Kotur wrote: > + if (cache->data_vec[ix].attr.ndev && > + cache->data_vec[ix].attr.ndev != old_net_dev) A few lines earlier the memory old_net_dev points at was freed. If two instances of this function run concurrently, what prevents that the old_net_dev memory has been reallocated and hence that attr.ndev == old_net_dev although both pointers refer(red) to different network devices ? > + ACCESS_ONCE(cache->data_vec[ix].seq) = orig_seq; Invoking write_gid() is only safe if the caller serializes write_gid() calls. Apparently the cache->lock mutex is used for that purpose. So why is it necessary to use ACCESS_ONCE() here ? Why is it needed to prevent that the compiler coalesces this write with another write into the same structure ? > + /* Make sure the sequence number we remeber was read This looks like a typo - shouldn't the above read "remember" ? BTW, the style of that comment is recommended only for networking code and not for IB code. Have you verified this patch with checkpatch ? > + mutex_lock(&cache->lock); > + > + for (ix = 0; ix < cache->sz; ix++) > + if (cache->data_vec[ix].attr.ndev == ndev) > + write_gid(ib_dev, port, cache, ix, &zgid, &zattr); > + > + mutex_unlock(&cache->lock); > + return 0; The traditional Linux kernel coding style is one blank line before mutex_lock() and after mutex_unlock() but not after mutex_lock() nor before mutex_unlock(). > + orig_seq = ACCESS_ONCE(cache->data_vec[index].seq); > + /* Make sure we read the sequence number before copying the > + * gid to local storage. */ > + smp_rmb(); Please use READ_ONCE() instead of ACCESS_ONCE() as recommended in <linux/compiler.h>. > +static void free_roce_gid_cache(struct ib_device *ib_dev, u8 port) > +{ > + int i; > + struct ib_roce_gid_cache *cache = > + ib_dev->cache.roce_gid_cache[port - 1]; > + > + if (!cache) > + return; > + > + for (i = 0; i < cache->sz; ++i) { > + if (memcmp(&cache->data_vec[i].gid, &zgid, > + sizeof(cache->data_vec[i].gid))) > + write_gid(ib_dev, port, cache, i, &zgid, &zattr); > + } > + kfree(cache->data_vec); > + kfree(cache); > +} Overwriting data just before it is freed is not useful. Please use CONFIG_SLUB_DEBUG=y to debug use-after-free issues instead of such code. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Matan/Moni, Could either of you please respond to both of Bart's queries? Thanks Somnath > -----Original Message----- > From: Bart Van Assche [mailto:bart.vanassche@sandisk.com] > Sent: Thursday, March 26, 2015 5:13 AM > To: Somnath Kotur; roland@kernel.org > Cc: linux-rdma@vger.kernel.org; Matan Barak > Subject: Re: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache > > On 03/25/2015 02:19 PM, Somnath Kotur wrote: > > + if (cache->data_vec[ix].attr.ndev && > > + cache->data_vec[ix].attr.ndev != old_net_dev) > > A few lines earlier the memory old_net_dev points at was freed. If two > instances of this function run concurrently, what prevents that the > old_net_dev memory has been reallocated and hence that attr.ndev == > old_net_dev although both pointers refer(red) to different network devices > ? > > > + ACCESS_ONCE(cache->data_vec[ix].seq) = orig_seq; > > Invoking write_gid() is only safe if the caller serializes write_gid() calls. > Apparently the cache->lock mutex is used for that purpose. So why is it > necessary to use ACCESS_ONCE() here ? Why is it needed to prevent that > the compiler coalesces this write with another write into the same structure > ? > > > + /* Make sure the sequence number we remeber was read > > This looks like a typo - shouldn't the above read "remember" ? > > BTW, the style of that comment is recommended only for networking code > and not for IB code. Have you verified this patch with checkpatch ? > > > + mutex_lock(&cache->lock); > > + > > + for (ix = 0; ix < cache->sz; ix++) > > + if (cache->data_vec[ix].attr.ndev == ndev) > > + write_gid(ib_dev, port, cache, ix, &zgid, &zattr); > > + > > + mutex_unlock(&cache->lock); > > + return 0; > > The traditional Linux kernel coding style is one blank line before > mutex_lock() and after mutex_unlock() but not after mutex_lock() nor > before mutex_unlock(). > > > + orig_seq = ACCESS_ONCE(cache->data_vec[index].seq); > > + /* Make sure we read the sequence number before copying the > > + * gid to local storage. */ > > + smp_rmb(); > > Please use READ_ONCE() instead of ACCESS_ONCE() as recommended in > <linux/compiler.h>. > > > +static void free_roce_gid_cache(struct ib_device *ib_dev, u8 port) { > > + int i; > > + struct ib_roce_gid_cache *cache = > > + ib_dev->cache.roce_gid_cache[port - 1]; > > + > > + if (!cache) > > + return; > > + > > + for (i = 0; i < cache->sz; ++i) { > > + if (memcmp(&cache->data_vec[i].gid, &zgid, > > + sizeof(cache->data_vec[i].gid))) > > + write_gid(ib_dev, port, cache, i, &zgid, &zattr); > > + } > > + kfree(cache->data_vec); > > + kfree(cache); > > +} > > Overwriting data just before it is freed is not useful. Please use > CONFIG_SLUB_DEBUG=y to debug use-after-free issues instead of such > code. > > Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> In order to manage multiple types, vlans and MACs per GID, we > need to store them along the GID itself. We store the net device > as well, as sometimes GIDs should be handled according to the > net device they came from. Since populating the GID table should > be identical for every RoCE provider, the GIDs table should be > handled in ib_core. > > Adding a GID cache table that supports a lockless find, add and > delete gids. The lockless nature comes from using a unique > sequence number per table entry and detecting that while reading/ > writing this sequence wasn't changed. > > By using this RoCE GID cache table, providers must implement a > modify_gid callback. The table is managed exclusively by > this roce_gid_cache and the provider just need to write > the data to the hardware. > > Signed-off-by: Matan Barak <matanb@mellanox.com> > Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com> > --- > drivers/infiniband/core/Makefile | 3 +- > drivers/infiniband/core/core_priv.h | 24 ++ > drivers/infiniband/core/roce_gid_cache.c | 518 Why does RoCE need such a complex gid cache? If a gid cache is needed at all, why should it be restricted to RoCE only? And why is such a complex synchronization scheme needed? Seriously, how many times will GIDs change and how many readers at once do you expect to have? > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h > index 65994a1..1866595 100644 > --- a/include/rdma/ib_verbs.h > +++ b/include/rdma/ib_verbs.h > @@ -64,6 +64,36 @@ union ib_gid { > } global; > }; > > +extern union ib_gid zgid; > + > +enum ib_gid_type { > + /* If link layer is Ethernet, this is RoCE V1 */ I don't understand this comment. Does RoCE v2 not run on Ethernet? > + IB_GID_TYPE_IB = 0, > + IB_GID_TYPE_ROCE_V2 = 1, > + IB_GID_TYPE_SIZE > +}; Can you explain the purpose of defining a 'GID type'. A GID is just a global address. Why does it matter to anyone using it how it was constructed? > + > +struct ib_gid_attr { > + enum ib_gid_type gid_type; > + struct net_device *ndev; > +}; > + > +struct ib_roce_gid_cache_entry { > + /* seq number of 0 indicates entry being changed. */ > + unsigned int seq; > + union ib_gid gid; > + struct ib_gid_attr attr; > + void *context; > +}; > + > +struct ib_roce_gid_cache { > + int active; > + int sz; > + /* locking against multiple writes in data_vec */ > + struct mutex lock; > + struct ib_roce_gid_cache_entry *data_vec; > +}; > + > enum rdma_node_type { > /* IB values map to NodeInfo:NodeType. */ > RDMA_NODE_IB_CA = 1, > @@ -265,7 +295,9 @@ enum ib_port_cap_flags { > IB_PORT_BOOT_MGMT_SUP = 1 << 23, > IB_PORT_LINK_LATENCY_SUP = 1 << 24, > IB_PORT_CLIENT_REG_SUP = 1 << 25, > - IB_PORT_IP_BASED_GIDS = 1 << 26 > + IB_PORT_IP_BASED_GIDS = 1 << 26, > + IB_PORT_ROCE = 1 << 27, > + IB_PORT_ROCE_V2 = 1 << 28, Why does RoCE suddenly require a port capability bit? RoCE runs today without setting any bit. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Sean, > -----Original Message----- > From: Hefty, Sean [mailto:sean.hefty@intel.com] > Sent: Wednesday, April 08, 2015 6:00 AM > To: Somnath Kotur; roland@kernel.org > Cc: linux-rdma@vger.kernel.org; Matan Barak > Subject: RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache > > > In order to manage multiple types, vlans and MACs per GID, we need to > > store them along the GID itself. We store the net device as well, as > > sometimes GIDs should be handled according to the net device they came > > from. Since populating the GID table should be identical for every > > RoCE provider, the GIDs table should be handled in ib_core. > > > > Adding a GID cache table that supports a lockless find, add and delete > > gids. The lockless nature comes from using a unique sequence number > > per table entry and detecting that while reading/ writing this > > sequence wasn't changed. > > > > By using this RoCE GID cache table, providers must implement a > > modify_gid callback. The table is managed exclusively by this > > roce_gid_cache and the provider just need to write the data to the > > hardware. > > > > Signed-off-by: Matan Barak <matanb@mellanox.com> > > Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com> > > --- > > drivers/infiniband/core/Makefile | 3 +- > > drivers/infiniband/core/core_priv.h | 24 ++ > > drivers/infiniband/core/roce_gid_cache.c | 518 > > Why does RoCE need such a complex gid cache? If a gid cache is needed at > all, why should it be restricted to RoCE only? And why is such a complex > synchronization scheme needed? Seriously, how many times will GIDs > change and how many readers at once do you expect to have? > > > > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index > > 65994a1..1866595 100644 > > --- a/include/rdma/ib_verbs.h > > +++ b/include/rdma/ib_verbs.h > > @@ -64,6 +64,36 @@ union ib_gid { > > } global; > > }; > > > > +extern union ib_gid zgid; > > + > > +enum ib_gid_type { > > + /* If link layer is Ethernet, this is RoCE V1 */ > > I don't understand this comment. Does RoCE v2 not run on Ethernet? > Yes, this comment probably could use a reword.. > > + IB_GID_TYPE_IB = 0, > > + IB_GID_TYPE_ROCE_V2 = 1, > > + IB_GID_TYPE_SIZE > > +}; > > Can you explain the purpose of defining a 'GID type'. A GID is just a global > address. Why does it matter to anyone using it how it was constructed? This is part of RoCE V2 Specification. Please refer to Section A 17.8 . The GID Type determines the protocol for outbound packet generation i.e RoCE V1 (0x8915 Ether Type) or RoCEV2 (IPv4 or IPv6) > > > + > > +struct ib_gid_attr { > > + enum ib_gid_type gid_type; > > + struct net_device *ndev; > > +}; > > + > > +struct ib_roce_gid_cache_entry { > > + /* seq number of 0 indicates entry being changed. */ > > + unsigned int seq; > > + union ib_gid gid; > > + struct ib_gid_attr attr; > > + void *context; > > +}; > > + > > +struct ib_roce_gid_cache { > > + int active; > > + int sz; > > + /* locking against multiple writes in data_vec */ > > + struct mutex lock; > > + struct ib_roce_gid_cache_entry *data_vec; }; > > + > > enum rdma_node_type { > > /* IB values map to NodeInfo:NodeType. */ > > RDMA_NODE_IB_CA = 1, > > @@ -265,7 +295,9 @@ enum ib_port_cap_flags { > > IB_PORT_BOOT_MGMT_SUP = 1 << 23, > > IB_PORT_LINK_LATENCY_SUP = 1 << 24, > > IB_PORT_CLIENT_REG_SUP = 1 << 25, > > - IB_PORT_IP_BASED_GIDS = 1 << 26 > > + IB_PORT_IP_BASED_GIDS = 1 << 26, > > + IB_PORT_ROCE = 1 << 27, > > + IB_PORT_ROCE_V2 = 1 << 28, > > Why does RoCE suddenly require a port capability bit? RoCE runs today > without setting any bit. Again, this is part of RoCE V2 SPEC, please refer to Section A17.5.1- Query HCA(Pasting snippet below) A new "RoCE Supported" capability bit shall be added to the Port Attributes list. This capability bit applies exclusively to ports of the new "RoCEv2" type Thanks Som -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Apr 8, 2015 at 2:30 AM, Hefty, Sean <sean.hefty@intel.com> wrote: >> In order to manage multiple types, vlans and MACs per GID, we >> need to store them along the GID itself. We store the net device >> as well, as sometimes GIDs should be handled according to the >> net device they came from. Since populating the GID table should >> be identical for every RoCE provider, the GIDs table should be >> handled in ib_core. >> >> Adding a GID cache table that supports a lockless find, add and >> delete gids. The lockless nature comes from using a unique >> sequence number per table entry and detecting that while reading/ >> writing this sequence wasn't changed. >> >> By using this RoCE GID cache table, providers must implement a >> modify_gid callback. The table is managed exclusively by >> this roce_gid_cache and the provider just need to write >> the data to the hardware. >> >> Signed-off-by: Matan Barak <matanb@mellanox.com> >> Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com> >> --- >> drivers/infiniband/core/Makefile | 3 +- >> drivers/infiniband/core/core_priv.h | 24 ++ >> drivers/infiniband/core/roce_gid_cache.c | 518 > > Why does RoCE need such a complex gid cache? If a gid cache is needed at all, why should it be restricted to RoCE only? And why is such a complex synchronization scheme needed? Seriously, how many times will GIDs change and how many readers at once do you expect to have? > GID cache is also implemented for link layer IB. Howver, for RoCE the GID cache is also the manager of the table. This means that adding or removing entries from the GID table is under the responsibility of the cache and not the HW/device driver. This is a new scheme that frees each vendor's driver to deal with net and inet events. Content of the GID table is much more dynamic for RoCE than for IB and so is access to the table so I guess that extra mechanism is required. The fact that GID entry is associated with net_device and inet_addr objects that can be modified/deleted at any time is an example. > >> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h >> index 65994a1..1866595 100644 >> --- a/include/rdma/ib_verbs.h >> +++ b/include/rdma/ib_verbs.h >> @@ -64,6 +64,36 @@ union ib_gid { >> } global; >> }; >> >> +extern union ib_gid zgid; >> + >> +enum ib_gid_type { >> + /* If link layer is Ethernet, this is RoCE V1 */ > > I don't understand this comment. Does RoCE v2 not run on Ethernet? > >> + IB_GID_TYPE_IB = 0, >> + IB_GID_TYPE_ROCE_V2 = 1, >> + IB_GID_TYPE_SIZE >> +}; > > Can you explain the purpose of defining a 'GID type'. A GID is just a global address. Why does it matter to anyone using it how it was constructed? > >> + >> +struct ib_gid_attr { >> + enum ib_gid_type gid_type; >> + struct net_device *ndev; >> +}; >> + >> +struct ib_roce_gid_cache_entry { >> + /* seq number of 0 indicates entry being changed. */ >> + unsigned int seq; >> + union ib_gid gid; >> + struct ib_gid_attr attr; >> + void *context; >> +}; >> + >> +struct ib_roce_gid_cache { >> + int active; >> + int sz; >> + /* locking against multiple writes in data_vec */ >> + struct mutex lock; >> + struct ib_roce_gid_cache_entry *data_vec; >> +}; >> + >> enum rdma_node_type { >> /* IB values map to NodeInfo:NodeType. */ >> RDMA_NODE_IB_CA = 1, >> @@ -265,7 +295,9 @@ enum ib_port_cap_flags { >> IB_PORT_BOOT_MGMT_SUP = 1 << 23, >> IB_PORT_LINK_LATENCY_SUP = 1 << 24, >> IB_PORT_CLIENT_REG_SUP = 1 << 25, >> - IB_PORT_IP_BASED_GIDS = 1 << 26 >> + IB_PORT_IP_BASED_GIDS = 1 << 26, >> + IB_PORT_ROCE = 1 << 27, >> + IB_PORT_ROCE_V2 = 1 << 28, > > Why does RoCE suddenly require a port capability bit? RoCE runs today without setting any bit. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Yes, this comment probably could use a reword.. > > > + IB_GID_TYPE_IB = 0, > > > + IB_GID_TYPE_ROCE_V2 = 1, > > > + IB_GID_TYPE_SIZE > > > +}; > > > > Can you explain the purpose of defining a 'GID type'. A GID is just a > global > > address. Why does it matter to anyone using it how it was constructed? > > This is part of RoCE V2 Specification. Please refer to Section A 17.8 . > The GID Type determines the protocol for outbound packet generation i.e > RoCE V1 (0x8915 Ether Type) or RoCEV2 (IPv4 or IPv6) This isn't an interface for the RoCE specification. Why does this need to be added to the verbs interface? It hasn't been needed by apps yet, and I don't see why the apps should be made to care now how the GID is formatted. > > > @@ -265,7 +295,9 @@ enum ib_port_cap_flags { > > > IB_PORT_BOOT_MGMT_SUP = 1 << 23, > > > IB_PORT_LINK_LATENCY_SUP = 1 << 24, > > > IB_PORT_CLIENT_REG_SUP = 1 << 25, > > > - IB_PORT_IP_BASED_GIDS = 1 << 26 > > > + IB_PORT_IP_BASED_GIDS = 1 << 26, > > > + IB_PORT_ROCE = 1 << 27, > > > + IB_PORT_ROCE_V2 = 1 << 28, > > > > Why does RoCE suddenly require a port capability bit? RoCE runs today > > without setting any bit. > Again, this is part of RoCE V2 SPEC, please refer to Section A17.5.1- > Query HCA(Pasting snippet below) > A new "RoCE Supported" capability bit shall be added to the Port > Attributes > list. This capability bit applies exclusively to ports of the new > "RoCEv2" type Same comment as above. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 4/14/2015 2:50 AM, Hefty, Sean wrote: >> Yes, this comment probably could use a reword.. >>>> + IB_GID_TYPE_IB = 0, >>>> + IB_GID_TYPE_ROCE_V2 = 1, >>>> + IB_GID_TYPE_SIZE >>>> +}; >>> >>> Can you explain the purpose of defining a 'GID type'. A GID is just a >> global >>> address. Why does it matter to anyone using it how it was constructed? >> >> This is part of RoCE V2 Specification. Please refer to Section A 17.8 . >> The GID Type determines the protocol for outbound packet generation i.e >> RoCE V1 (0x8915 Ether Type) or RoCEV2 (IPv4 or IPv6) > > This isn't an interface for the RoCE specification. Why does this need to be added to the verbs interface? It hasn't been needed by apps yet, and I don't see why the apps should be made to care now how the GID is formatted. > This is a part of the GID meta info. The user should be able to choose between RoCE V1 (which is represented here by IB_GID_TYPE_IB) and RoCE V2 - just as a user could choose between IPv6 and IPv4. >>>> @@ -265,7 +295,9 @@ enum ib_port_cap_flags { >>>> IB_PORT_BOOT_MGMT_SUP = 1 << 23, >>>> IB_PORT_LINK_LATENCY_SUP = 1 << 24, >>>> IB_PORT_CLIENT_REG_SUP = 1 << 25, >>>> - IB_PORT_IP_BASED_GIDS = 1 << 26 >>>> + IB_PORT_IP_BASED_GIDS = 1 << 26, >>>> + IB_PORT_ROCE = 1 << 27, >>>> + IB_PORT_ROCE_V2 = 1 << 28, >>> >>> Why does RoCE suddenly require a port capability bit? RoCE runs today >>> without setting any bit. >> Again, this is part of RoCE V2 SPEC, please refer to Section A17.5.1- >> Query HCA(Pasting snippet below) >> A new "RoCE Supported" capability bit shall be added to the Port >> Attributes >> list. This capability bit applies exclusively to ports of the new >> "RoCEv2" type > > Same comment as above. > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 3/26/2015 1:42 AM, Bart Van Assche wrote: > On 03/25/2015 02:19 PM, Somnath Kotur wrote: >> + if (cache->data_vec[ix].attr.ndev && >> + cache->data_vec[ix].attr.ndev != old_net_dev) > > A few lines earlier the memory old_net_dev points at was freed. If two > instances of this function run concurrently, what prevents that the > old_net_dev memory has been reallocated and hence that attr.ndev == > old_net_dev although both pointers refer(red) to different network > devices ? > write_gid is *almost* always called in a mutex. The only case it's not protected is in free_roce_gid_cache. free_roce_gid_cache is called only in the error flow of roce_gid_cache_setup_one, when no concurrent write_gid could be made (as the cache isn't setup yet) and in roce_gid_cache_cleanup_one. roce_gid_cache_client_setup_one is called in the error flow of roce_gid_cache_client_setup_one (where no other write_gid are expected for the same above reason) and in roce_gid_cache_client_cleanup_work_handler, where it's called after flush_workqueue(roce_gid_mgmt_wq). Since all write_gids are called through roce_gid_mgmt_wq and we set the cache to inactive mode before flushing the wq and freeing the cache, I think we can conclude no concurrent write_gid on the same cache are possible. >> + ACCESS_ONCE(cache->data_vec[ix].seq) = orig_seq; > > Invoking write_gid() is only safe if the caller serializes write_gid() > calls. Apparently the cache->lock mutex is used for that purpose. So why > is it necessary to use ACCESS_ONCE() here ? Why is it needed to prevent > that the compiler coalesces this write with another write into the same > structure ? > The mutex only serializes cache writes. Cache reads could be done in concurrent with writes and are protected by the ACCESS_ONCE. >> + /* Make sure the sequence number we remeber was read > > This looks like a typo - shouldn't the above read "remember" ? > Will be fixed in V4, thanks. > BTW, the style of that comment is recommended only for networking code > and not for IB code. Have you verified this patch with checkpatch ? > Of course and I've just re-run checkpatch on this patch. It doesn't catch this. >> + mutex_lock(&cache->lock); >> + >> + for (ix = 0; ix < cache->sz; ix++) >> + if (cache->data_vec[ix].attr.ndev == ndev) >> + write_gid(ib_dev, port, cache, ix, &zgid, &zattr); >> + >> + mutex_unlock(&cache->lock); >> + return 0; > > The traditional Linux kernel coding style is one blank line before > mutex_lock() and after mutex_unlock() but not after mutex_lock() nor > before mutex_unlock(). > I didn't find this in the CodingStyle doc. Could you please quote or post a link? >> + orig_seq = ACCESS_ONCE(cache->data_vec[index].seq); >> + /* Make sure we read the sequence number before copying the >> + * gid to local storage. */ >> + smp_rmb(); > > Please use READ_ONCE() instead of ACCESS_ONCE() as recommended in > <linux/compiler.h>. > Ok, I'll change that in V4. I see that READ_ONCE and WRITE_ONCE is different from ACCESS_ONCE only for aggregated data types (which isn't our case), but it won't hurt to change that. >> +static void free_roce_gid_cache(struct ib_device *ib_dev, u8 port) >> +{ >> + int i; >> + struct ib_roce_gid_cache *cache = >> + ib_dev->cache.roce_gid_cache[port - 1]; >> + >> + if (!cache) >> + return; >> + >> + for (i = 0; i < cache->sz; ++i) { >> + if (memcmp(&cache->data_vec[i].gid, &zgid, >> + sizeof(cache->data_vec[i].gid))) >> + write_gid(ib_dev, port, cache, i, &zgid, &zattr); >> + } > > + kfree(cache->data_vec); > > + kfree(cache); > > +} > > Overwriting data just before it is freed is not useful. Please use > CONFIG_SLUB_DEBUG=y to debug use-after-free issues instead of such code. > It's mandatory as write_gid with zgid might cause the vendor driver to free memory it allocated for this GID entry (like in the mlx4 case). > Bart. Thanks for the review :) Matan -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/14/15 15:23, Matan Barak wrote: >>> + mutex_lock(&cache->lock); >>> + >>> + for (ix = 0; ix < cache->sz; ix++) >>> + if (cache->data_vec[ix].attr.ndev == ndev) >>> + write_gid(ib_dev, port, cache, ix, &zgid, &zattr); >>> + >>> + mutex_unlock(&cache->lock); >>> + return 0; >> >> The traditional Linux kernel coding style is one blank line before >> mutex_lock() and after mutex_unlock() but not after mutex_lock() nor >> before mutex_unlock(). >> > > I didn't find this in the CodingStyle doc. Could you please quote or > post a link? Hello Matan, I'm not aware of any formal documentation of this style guideline. But if you look around in the Linux kernel tree you will see that most kernel code follows this style. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> This is a part of the GID meta info. The user should be able to choose > between RoCE V1 (which is represented here by IB_GID_TYPE_IB) and RoCE > V2 - just as a user could choose between IPv6 and IPv4. IPv4 and IPv6 are different protocols, not different formats for the same address. How does RoCE v2 not break every app? This isn't like asking the user to choose between IPv4 versus IPv6, it's asking them to choose between IPv4 assigned by DHCP versus IPv4 assigned statically. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: Hefty, Sean [mailto:sean.hefty@intel.com] > Sent: Tuesday, April 14, 2015 11:02 PM > To: Matan Barak; Somnath Kotur; roland@kernel.org > Cc: linux-rdma@vger.kernel.org > Subject: RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache > > > This is a part of the GID meta info. The user should be able to choose > > between RoCE V1 (which is represented here by IB_GID_TYPE_IB) and > RoCE > > V2 - just as a user could choose between IPv6 and IPv4. > > IPv4 and IPv6 are different protocols, not different formats for the same > address. How does RoCE v2 not break every app? It does not break every app, the choice of which GID type to use is made by the RDMA-CM based on network topology hint obtained from the IP stack. Please refer to patch 15/33: IB/Core: Changes to the IB Core infrastructure for RoCEv2 support. Of course, if the user does not want to go with this choice made by the RDMA-CM, then there is the option of overriding it using the configfs patch (PATCH 14/33) Hope that clarifies? Thanks Som -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> It does not break every app, the choice of which GID type to use is made > by the RDMA-CM based on network topology hint obtained from the IP stack. > Please refer to patch 15/33: IB/Core: Changes to the IB Core > infrastructure for RoCEv2 support. > Of course, if the user does not want to go with this choice made by the > RDMA-CM, then there is the option of overriding it using the configfs > patch (PATCH 14/33) > Hope that clarifies? RoCE v2 is really Infiniband over UDP over IP. Why don't we just call it IBoUDP like it is? IBoUDP changes the Ethertype, replaces the network header, adds a new transport protocol header, and layers IB over that. This change should be exposed properly and not as just a new GID type. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
IMHO, it would be good to have a physical layer representation in the naming convention. -----Original Message----- From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-owner@vger.kernel.org] On Behalf Of Hefty, Sean Sent: Wednesday, April 15, 2015 12:08 PM To: Somnath Kotur; Matan Barak; roland@kernel.org Cc: linux-rdma@vger.kernel.org Subject: RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache > It does not break every app, the choice of which GID type to use is > made by the RDMA-CM based on network topology hint obtained from the IP stack. > Please refer to patch 15/33: IB/Core: Changes to the IB Core > infrastructure for RoCEv2 support. > Of course, if the user does not want to go with this choice made by > the RDMA-CM, then there is the option of overriding it using the > configfs patch (PATCH 14/33) Hope that clarifies? RoCE v2 is really Infiniband over UDP over IP. Why don't we just call it IBoUDP like it is? IBoUDP changes the Ethertype, replaces the network header, adds a new transport protocol header, and layers IB over that. This change should be exposed properly and not as just a new GID type. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html This correspondence, and any attachments or files transmitted with this correspondence, contains information which may be confidential and privileged and is intended solely for the use of the addressee. Unless you are the addressee or are authorized to receive messages for the addressee, you may not use, copy, disseminate, or disclose this correspondence or any information contained in this correspondence to any third party. If you have received this correspondence in error, please notify the sender immediately and delete this correspondence and any attachments or files transmitted with this correspondence from your system, and destroy any and all copies thereof, electronic or otherwise. Your cooperation and understanding are greatly appreciated. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
AFAIK, RoCE v2 is the known and official name. Why would we like to come up with a customized name? These are indeed two different protocols, thus the comparison to DHCP assigned addresses and static addresses is (to say the least) a bit off. Even when comparing IPv4 and IPv6, the most significant user change is the sockaddr_in and sockaddr_in6 addresses. IMHO, since the GID format is identical, the changes could be encoded in a gid_type argument. The gid_type is an inherent part of the address, making identical gids with different gid_types as two different addresses, as expected. On 4/15/2015 7:21 PM, Suri Shelvapille wrote: > IMHO, it would be good to have a physical layer representation in the naming convention. > > -----Original Message----- > From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-owner@vger.kernel.org] On Behalf Of Hefty, Sean > Sent: Wednesday, April 15, 2015 12:08 PM > To: Somnath Kotur; Matan Barak; roland@kernel.org > Cc: linux-rdma@vger.kernel.org > Subject: RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache > >> It does not break every app, the choice of which GID type to use is >> made by the RDMA-CM based on network topology hint obtained from the IP stack. >> Please refer to patch 15/33: IB/Core: Changes to the IB Core >> infrastructure for RoCEv2 support. >> Of course, if the user does not want to go with this choice made by >> the RDMA-CM, then there is the option of overriding it using the >> configfs patch (PATCH 14/33) Hope that clarifies? > > RoCE v2 is really Infiniband over UDP over IP. Why don't we just call it IBoUDP like it is? > > IBoUDP changes the Ethertype, replaces the network header, adds a new transport protocol header, and layers IB over that. This change should be exposed properly and not as just a new GID type. > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html > > This correspondence, and any attachments or files transmitted with this correspondence, contains information which may be confidential and privileged and is intended solely for the use of the addressee. Unless you are the addressee or are authorized to receive messages for the addressee, you may not use, copy, disseminate, or disclose this correspondence or any information contained in this correspondence to any third party. If you have received this correspondence in error, please notify the sender immediately and delete this correspondence and any attachments or files transmitted with this correspondence from your system, and destroy any and all copies thereof, electronic or otherwise. Your cooperation and understanding are greatly appreciated. > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Apr 15, 2015 at 7:08 PM, Hefty, Sean <sean.hefty@intel.com> wrote: >> It does not break every app, the choice of which GID type to use is made >> by the RDMA-CM based on network topology hint obtained from the IP stack. >> Please refer to patch 15/33: IB/Core: Changes to the IB Core >> infrastructure for RoCEv2 support. >> Of course, if the user does not want to go with this choice made by the >> RDMA-CM, then there is the option of overriding it using the configfs >> patch (PATCH 14/33) >> Hope that clarifies? > > RoCE v2 is really Infiniband over UDP over IP. Why don't we just call it IBoUDP like it is? RoCEv2 is the name in the IBTA spec (Annex 17) > > IBoUDP changes the Ethertype, replaces the network header, adds a new transport protocol header, and layers IB over that. This change should be exposed properly and not as just a new GID type. I don't understand what do you suggest here. Can you give an example? > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > RoCE v2 is really Infiniband over UDP over IP. Why don't we just call > it IBoUDP like it is? > RoCEv2 is the name in the IBTA spec (Annex 17) We call RoCE IBoE in the kernel, because that's what it is. RoCE is an IBTA marketing name. Looking through the Annex, I don't see where Ethernet is even a requirement for this technology to work. The IB transport is layered over a standard UDP header. I do see where the spec calls out updating the IP header, but that's it. Regardless of what it's called, it replaces the underlying network and transport protocols, versus IB-classic or IBoE/RoCE. That should be captured properly, not by saying there's a new GID type. RoCE v2 doesn't even use GIDs as part of its protocol. It uses UDP/IP addresses. > > IBoUDP changes the Ethertype, replaces the network header, adds a new > transport protocol header, and layers IB over that. This change should be > exposed properly and not as just a new GID type. > I don't understand what do you suggest here. Can you give an example? I don't have a solution here. Please look at Michael Wang's patch series and see how this would fit into that model. The introduction of iWarp required defining a new 'transport' type. IBoE added a new link layer. Based on those changes, this would warrant introducing a new network layer, so that it can be distinguished properly from the other options. Maybe that's the right approach? Cisco's NIC reports a transport layer of 'USNIC_UDP', which should really just be 'UDP'. That NIC supports UDP/IP/Ethernet, based on the rdma stack's model. RoCE v2 is also UDP/IP/Ethernet, it only layers IB over that. (This makes the use of the term 'transport' confusing. Maybe there should also be a 'session' protocol?) It seems completely reasonable that a device which does IB/UDP/IP/Ethernet to someday expose the UDP/IP/Ethernet portion (if it doesn't already), and from the same port at the same time. Rather than continuing to try to make everything look like an IB-classic device because it's convenient, the stack needs to start exposing things properly. I don't know what the right solution should be, but trying to capture this level of detail as a different GID type definitely looks like a step in the wrong direction. - Sean
On Thu, Mar 26, 2015 at 12:19 AM, Somnath Kotur <somnath.kotur@emulex.com> wrote: > From: Matan Barak <matanb@mellanox.com> > > In order to manage multiple types, vlans and MACs per GID, we > need to store them along the GID itself. We store the net device > as well, as sometimes GIDs should be handled according to the > net device they came from. Since populating the GID table should > be identical for every RoCE provider, the GIDs table should be > handled in ib_core. > > Adding a GID cache table that supports a lockless find, add and > delete gids. The lockless nature comes from using a unique > sequence number per table entry and detecting that while reading/ > writing this sequence wasn't changed. Matan, please use existing mechanism which fits the problem you are trying to solve, I guess one of RCU or seqlock should do the job. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 4/26/2015 8:20 PM, Or Gerlitz wrote: > On Thu, Mar 26, 2015 at 12:19 AM, Somnath Kotur > <somnath.kotur@emulex.com> wrote: >> From: Matan Barak <matanb@mellanox.com> >> >> In order to manage multiple types, vlans and MACs per GID, we >> need to store them along the GID itself. We store the net device >> as well, as sometimes GIDs should be handled according to the >> net device they came from. Since populating the GID table should >> be identical for every RoCE provider, the GIDs table should be >> handled in ib_core. >> >> Adding a GID cache table that supports a lockless find, add and >> delete gids. The lockless nature comes from using a unique >> sequence number per table entry and detecting that while reading/ >> writing this sequence wasn't changed. > > Matan, please use existing mechanism which fits the problem you are > trying to solve, I guess one of RCU or seqlock should do the job. > seqcount fits this problem better. Since if a write and read are done in parallel, there's a good chance we read an out of date entry and we are going to use a GID entry that's going to change in T+epsilon, so RCU doesn't really have an advantage here. The current implementation is a bit more efficient than seqcount, as it allows early termination of read-while-write (because the write puts a known "currently updating" value that the read knows to ignore). AFAIK, this doesn't exist in the current seqcount implementation. However, since this isn't a crucial data-path, I'll change that to seqcount. seqcount is preferred over seqlock, as I don't need the spinlock in seqlock. > Or. > Matan -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Apr 27, 2015 at 10:32 AM, Matan Barak <matanb@mellanox.com> wrote: > On 4/26/2015 8:20 PM, Or Gerlitz wrote: >> On Thu, Mar 26, 2015 at 12:19 AM, Somnath Kotur >> <somnath.kotur@emulex.com> wrote: >>> From: Matan Barak <matanb@mellanox.com> >>> In order to manage multiple types, vlans and MACs per GID, we >>> need to store them along the GID itself. We store the net device >>> as well, as sometimes GIDs should be handled according to the >>> net device they came from. Since populating the GID table should >>> be identical for every RoCE provider, the GIDs table should be >>> handled in ib_core. >>> >>> Adding a GID cache table that supports a lockless find, add and >>> delete gids. The lockless nature comes from using a unique >>> sequence number per table entry and detecting that while reading/ >>> writing this sequence wasn't changed. >> Matan, please use existing mechanism which fits the problem you are >> trying to solve, I guess one of RCU or seqlock should do the job. > seqcount fits this problem better. Since if a write and read are done in > parallel, there's a good chance we read an out of date entry and we are > going to use a GID entry that's going to change in T+epsilon, so RCU doesn't > really have an advantage here. So going back to the problem... we are talking on applications/drivers that attempt to establish new connections doing reads and writes done on behalf of IP stack changes, both are very much not critical path. So this is kind of similar to the neighbour table maintained by ND subsystem which is used by all IP based networking applications and that code uses RCU. I don't see what's wrong with RCU for our sort smaller scale subsystem and what is even wrong with simple rwlock which is the mechanism used today by the IB core git cache, this goes too complex and for no reason that I can think of. > The current implementation is a bit more efficient than seqcount, as it > allows early termination of read-while-write (because the write puts a known > "currently updating" value that the read knows to ignore). AFAIK, this > doesn't exist in the current seqcount implementation. However, since this > isn't a crucial data-path, I'll change that to seqcount. > > seqcount is preferred over seqlock, as I don't need the spinlock in seqlock. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 4/27/2015 9:22 PM, Or Gerlitz wrote: > On Mon, Apr 27, 2015 at 10:32 AM, Matan Barak <matanb@mellanox.com> wrote: >> On 4/26/2015 8:20 PM, Or Gerlitz wrote: >>> On Thu, Mar 26, 2015 at 12:19 AM, Somnath Kotur >>> <somnath.kotur@emulex.com> wrote: >>>> From: Matan Barak <matanb@mellanox.com> > >>>> In order to manage multiple types, vlans and MACs per GID, we >>>> need to store them along the GID itself. We store the net device >>>> as well, as sometimes GIDs should be handled according to the >>>> net device they came from. Since populating the GID table should >>>> be identical for every RoCE provider, the GIDs table should be >>>> handled in ib_core. >>>> >>>> Adding a GID cache table that supports a lockless find, add and >>>> delete gids. The lockless nature comes from using a unique >>>> sequence number per table entry and detecting that while reading/ >>>> writing this sequence wasn't changed. > >>> Matan, please use existing mechanism which fits the problem you are >>> trying to solve, I guess one of RCU or seqlock should do the job. > >> seqcount fits this problem better. Since if a write and read are done in >> parallel, there's a good chance we read an out of date entry and we are >> going to use a GID entry that's going to change in T+epsilon, so RCU doesn't >> really have an advantage here. > > So going back to the problem... we are talking on applications/drivers > that attempt to establish new connections doing reads and writes done > on behalf of IP stack changes, both are very much not critical path. > So this is kind of similar to the neighbour table maintained by ND > subsystem which is used by all IP based networking applications and > that code uses RCU. I don't see what's wrong with RCU for our sort > smaller scale subsystem and what is even wrong with simple rwlock > which is the mechanism used today by the IB core git cache, this goes > too complex and for no reason that I can think of. > I think the real question is why to deal with RCUs that will require re-allocation of entries when it's not necessary or why do we want to use rwlock if the kernel provides a mechanism (called seqcount) that fits this problem better? I disagree about seqcount being complex - if you look at its API you'll find it's a lot simpler than RCU. > >> The current implementation is a bit more efficient than seqcount, as it >> allows early termination of read-while-write (because the write puts a known >> "currently updating" value that the read knows to ignore). AFAIK, this >> doesn't exist in the current seqcount implementation. However, since this >> isn't a crucial data-path, I'll change that to seqcount. >> >> seqcount is preferred over seqlock, as I don't need the spinlock in seqlock. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 28, 2015 at 10:17 AM, Matan Barak <matanb@mellanox.com> wrote: > On 4/27/2015 9:22 PM, Or Gerlitz wrote: > I think the real question is why to deal with RCUs that will require > re-allocation of entries when it's not necessary or why do we want to use > rwlock if the kernel provides a mechanism (called seqcount) that fits this > problem better? > I disagree about seqcount being complex - if you look at its API you'll find > it's a lot simpler than RCU. I took a 2nd look, seqcount is indeed way simpler from RCU, and by itself is simple to use if you feel this provides better solution vs. simple rwlock, so I'm good with that. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile index acf7367..9b63bdf 100644 --- a/drivers/infiniband/core/Makefile +++ b/drivers/infiniband/core/Makefile @@ -9,7 +9,8 @@ obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o \ $(user_access-y) ib_core-y := packer.o ud_header.o verbs.o sysfs.o \ - device.o fmr_pool.o cache.o netlink.o + device.o fmr_pool.o cache.o netlink.o \ + roce_gid_cache.o ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o umem_rbtree.o diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h index 87d1936..a502daa 100644 --- a/drivers/infiniband/core/core_priv.h +++ b/drivers/infiniband/core/core_priv.h @@ -35,6 +35,7 @@ #include <linux/list.h> #include <linux/spinlock.h> +#include <net/net_namespace.h> #include <rdma/ib_verbs.h> @@ -51,4 +52,27 @@ void ib_cache_cleanup(void); int ib_resolve_eth_l2_attrs(struct ib_qp *qp, struct ib_qp_attr *qp_attr, int *qp_attr_mask); + +int roce_gid_cache_get_gid(struct ib_device *ib_dev, u8 port, int index, + union ib_gid *gid, struct ib_gid_attr *attr); + +int roce_gid_cache_find_gid(struct ib_device *ib_dev, union ib_gid *gid, + enum ib_gid_type gid_type, struct net *net, + int if_index, u8 *port, u16 *index); + +int roce_gid_cache_find_gid_by_port(struct ib_device *ib_dev, union ib_gid *gid, + enum ib_gid_type gid_type, u8 port, + struct net *net, int if_index, u16 *index); + +int roce_gid_cache_is_active(struct ib_device *ib_dev, u8 port); + +int roce_add_gid(struct ib_device *ib_dev, u8 port, + union ib_gid *gid, struct ib_gid_attr *attr); + +int roce_del_gid(struct ib_device *ib_dev, u8 port, + union ib_gid *gid, struct ib_gid_attr *attr); + +int roce_del_all_netdev_gids(struct ib_device *ib_dev, u8 port, + struct net_device *ndev); + #endif /* _CORE_PRIV_H */ diff --git a/drivers/infiniband/core/roce_gid_cache.c b/drivers/infiniband/core/roce_gid_cache.c new file mode 100644 index 0000000..80f364a --- /dev/null +++ b/drivers/infiniband/core/roce_gid_cache.c @@ -0,0 +1,518 @@ +/* + * Copyright (c) 2015, Mellanox Technologies inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include <linux/slab.h> +#include <linux/netdevice.h> +#include <linux/rtnetlink.h> +#include <rdma/ib_cache.h> + +#include "core_priv.h" + +union ib_gid zgid; +EXPORT_SYMBOL_GPL(zgid); + +static const struct ib_gid_attr zattr; + +enum gid_attr_find_mask { + GID_ATTR_FIND_MASK_GID_TYPE = 1UL << 0, + GID_ATTR_FIND_MASK_NETDEV = 1UL << 1, +}; + +static inline int start_port(struct ib_device *ib_dev) +{ + return (ib_dev->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1; +} + +struct dev_put_rcu { + struct rcu_head rcu; + struct net_device *ndev; +}; + +static void put_ndev(struct rcu_head *rcu) +{ + struct dev_put_rcu *put_rcu = + container_of(rcu, struct dev_put_rcu, rcu); + + dev_put(put_rcu->ndev); + kfree(put_rcu); +} + +static int write_gid(struct ib_device *ib_dev, u8 port, + struct ib_roce_gid_cache *cache, int ix, + const union ib_gid *gid, + const struct ib_gid_attr *attr) +{ + unsigned int orig_seq; + int ret; + struct dev_put_rcu *put_rcu; + struct net_device *old_net_dev; + + orig_seq = cache->data_vec[ix].seq; + cache->data_vec[ix].seq = -1; + /* Ensure that all readers will see invalid sequence + * identifier before starting the actual GID update. + */ + smp_wmb(); + + ret = ib_dev->modify_gid(ib_dev, port, ix, gid, attr, + &cache->data_vec[ix].context); + + old_net_dev = cache->data_vec[ix].attr.ndev; + if (old_net_dev && old_net_dev != attr->ndev) { + put_rcu = kmalloc(sizeof(*put_rcu), GFP_KERNEL); + if (put_rcu) { + put_rcu->ndev = old_net_dev; + call_rcu(&put_rcu->rcu, put_ndev); + } else { + pr_warn("roce_gid_cache: can't allocate rcu context, using synchronize\n"); + synchronize_rcu(); + dev_put(old_net_dev); + } + } + /* if modify_gid failed, just delete the old gid */ + if (ret || !memcmp(gid, &zgid, sizeof(*gid))) { + gid = &zgid; + attr = &zattr; + cache->data_vec[ix].context = NULL; + } + memcpy(&cache->data_vec[ix].gid, gid, sizeof(*gid)); + memcpy(&cache->data_vec[ix].attr, attr, sizeof(*attr)); + if (cache->data_vec[ix].attr.ndev && + cache->data_vec[ix].attr.ndev != old_net_dev) + dev_hold(cache->data_vec[ix].attr.ndev); + + /* Ensure that all cached gid data updating is finished before + * marking the entry as available. + */ + smp_wmb(); + + if (++orig_seq == (unsigned int)-1) + orig_seq = 0; + ACCESS_ONCE(cache->data_vec[ix].seq) = orig_seq; + + if (!ret) { + struct ib_event event; + + event.device = ib_dev; + event.element.port_num = port; + event.event = IB_EVENT_GID_CHANGE; + + ib_dispatch_event(&event); + } + return ret; +} + +static int find_gid(struct ib_roce_gid_cache *cache, union ib_gid *gid, + const struct ib_gid_attr *val, unsigned long mask) +{ + int i; + unsigned int orig_seq; + + for (i = 0; i < cache->sz; i++) { + struct ib_gid_attr *attr = &cache->data_vec[i].attr; + + orig_seq = cache->data_vec[i].seq; + if (orig_seq == -1) + continue; + /* Make sure the sequence number we remeber was read + * before the gid cache entry content is read. + */ + smp_rmb(); + + if (mask & GID_ATTR_FIND_MASK_GID_TYPE && + attr->gid_type != val->gid_type) + continue; + + if (memcmp(gid, &cache->data_vec[i].gid, sizeof(*gid))) + continue; + + if (mask & GID_ATTR_FIND_MASK_NETDEV && + attr->ndev != val->ndev) + continue; + + /* We have a match, verify that the data we + * compared is valid. Make sure that the + * sequence number we read is the last to be + * read. + */ + smp_rmb(); + if (orig_seq == ACCESS_ONCE(cache->data_vec[i].seq)) + return i; + /* The sequence number changed under our feet, + * the GID entry is invalid. Continue to the + * next entry. + */ + } + + return -1; +} + +int roce_add_gid(struct ib_device *ib_dev, u8 port, + union ib_gid *gid, struct ib_gid_attr *attr) +{ + struct ib_roce_gid_cache *cache; + int ix; + int ret = 0; + + if (!ib_dev->cache.roce_gid_cache) + return -ENOSYS; + + cache = ib_dev->cache.roce_gid_cache[port - start_port(ib_dev)]; + + if (!cache || !cache->active) + return -ENOSYS; + + if (!memcmp(gid, &zgid, sizeof(*gid))) + return -EINVAL; + + mutex_lock(&cache->lock); + + ix = find_gid(cache, gid, attr, GID_ATTR_FIND_MASK_GID_TYPE | + GID_ATTR_FIND_MASK_NETDEV); + if (ix >= 0) + goto out_unlock; + + ix = find_gid(cache, &zgid, NULL, 0); + if (ix < 0) { + ret = -ENOSPC; + goto out_unlock; + } + + write_gid(ib_dev, port, cache, ix, gid, attr); + +out_unlock: + mutex_unlock(&cache->lock); + return ret; +} + +int roce_del_gid(struct ib_device *ib_dev, u8 port, + union ib_gid *gid, struct ib_gid_attr *attr) +{ + struct ib_roce_gid_cache *cache; + int ix; + + if (!ib_dev->cache.roce_gid_cache) + return 0; + + cache = ib_dev->cache.roce_gid_cache[port - start_port(ib_dev)]; + + if (!cache || !cache->active) + return -ENOSYS; + + mutex_lock(&cache->lock); + + ix = find_gid(cache, gid, attr, + GID_ATTR_FIND_MASK_GID_TYPE | + GID_ATTR_FIND_MASK_NETDEV); + if (ix < 0) + goto out_unlock; + + write_gid(ib_dev, port, cache, ix, &zgid, &zattr); + +out_unlock: + mutex_unlock(&cache->lock); + return 0; +} + +int roce_del_all_netdev_gids(struct ib_device *ib_dev, u8 port, + struct net_device *ndev) +{ + struct ib_roce_gid_cache *cache; + int ix; + + if (!ib_dev->cache.roce_gid_cache) + return 0; + + cache = ib_dev->cache.roce_gid_cache[port - start_port(ib_dev)]; + + if (!cache || !cache->active) + return -ENOSYS; + + mutex_lock(&cache->lock); + + for (ix = 0; ix < cache->sz; ix++) + if (cache->data_vec[ix].attr.ndev == ndev) + write_gid(ib_dev, port, cache, ix, &zgid, &zattr); + + mutex_unlock(&cache->lock); + return 0; +} + +int roce_gid_cache_get_gid(struct ib_device *ib_dev, u8 port, int index, + union ib_gid *gid, struct ib_gid_attr *attr) +{ + struct ib_roce_gid_cache *cache; + union ib_gid local_gid; + struct ib_gid_attr local_attr; + unsigned int orig_seq; + + if (!ib_dev->cache.roce_gid_cache) + return -EINVAL; + + cache = ib_dev->cache.roce_gid_cache[port - start_port(ib_dev)]; + + if (!cache || !cache->active) + return -ENOSYS; + + if (index < 0 || index >= cache->sz) + return -EINVAL; + + orig_seq = ACCESS_ONCE(cache->data_vec[index].seq); + /* Make sure we read the sequence number before copying the + * gid to local storage. */ + smp_rmb(); + + memcpy(&local_gid, &cache->data_vec[index].gid, sizeof(local_gid)); + memcpy(&local_attr, &cache->data_vec[index].attr, sizeof(local_attr)); + /* Ensure the local copy completed reading before verifying + * the new sequence number. */ + smp_rmb(); + + if (orig_seq == -1 || + orig_seq != ACCESS_ONCE(cache->data_vec[index].seq)) + return -EAGAIN; + + memcpy(gid, &local_gid, sizeof(*gid)); + if (attr) + memcpy(attr, &local_attr, sizeof(*attr)); + return 0; +} + +static int _roce_gid_cache_find_gid(struct ib_device *ib_dev, union ib_gid *gid, + const struct ib_gid_attr *val, + unsigned long mask, + u8 *port, u16 *index) +{ + struct ib_roce_gid_cache *cache; + u8 p; + int local_index; + + if (!ib_dev->cache.roce_gid_cache) + return -ENOENT; + + for (p = 0; p < ib_dev->phys_port_cnt; p++) { + if (rdma_port_get_link_layer(ib_dev, p + start_port(ib_dev)) != + IB_LINK_LAYER_ETHERNET) + continue; + cache = ib_dev->cache.roce_gid_cache[p]; + if (!cache || !cache->active) + continue; + local_index = find_gid(cache, gid, val, mask); + if (local_index >= 0) { + if (index) + *index = local_index; + if (port) + *port = p + start_port(ib_dev); + return 0; + } + } + + return -ENOENT; +} + +static int get_netdev_from_ifindex(struct net *net, int if_index, + struct ib_gid_attr *gid_attr_val) +{ + if (if_index && net) { + rcu_read_lock(); + gid_attr_val->ndev = dev_get_by_index_rcu(net, if_index); + rcu_read_unlock(); + if (gid_attr_val->ndev) + return GID_ATTR_FIND_MASK_NETDEV; + } + return 0; +} + +int roce_gid_cache_find_gid(struct ib_device *ib_dev, union ib_gid *gid, + enum ib_gid_type gid_type, struct net *net, + int if_index, u8 *port, u16 *index) +{ + unsigned long mask = GID_ATTR_FIND_MASK_GID | + GID_ATTR_FIND_MASK_GID_TYPE; + struct ib_gid_attr gid_attr_val = {.gid_type = gid_type}; + + mask |= get_netdev_from_ifindex(net, if_index, &gid_attr_val); + + return _roce_gid_cache_find_gid(ib_dev, gid, &gid_attr_val, + mask, port, index); +} + +int roce_gid_cache_find_gid_by_port(struct ib_device *ib_dev, union ib_gid *gid, + enum ib_gid_type gid_type, u8 port, + struct net *net, int if_index, u16 *index) +{ + int local_index; + struct ib_roce_gid_cache *cache; + unsigned long mask = GID_ATTR_FIND_MASK_GID_TYPE; + struct ib_gid_attr val = {.gid_type = gid_type}; + + if (!ib_dev->cache.roce_gid_cache || port < start_port(ib_dev) || + port >= (start_port(ib_dev) + ib_dev->phys_port_cnt)) + return -ENOENT; + + cache = ib_dev->cache.roce_gid_cache[port - start_port(ib_dev)]; + if (!cache || !cache->active) + return -ENOENT; + + mask |= get_netdev_from_ifindex(net, if_index, &val); + + local_index = find_gid(cache, gid, &val, mask); + if (local_index >= 0) { + if (index) + *index = local_index; + return 0; + } + + return -ENOENT; +} + +static struct ib_roce_gid_cache *alloc_roce_gid_cache(int sz) +{ + struct ib_roce_gid_cache *cache = + kzalloc(sizeof(struct ib_roce_gid_cache), GFP_KERNEL); + if (!cache) + return NULL; + + cache->data_vec = kcalloc(sz, sizeof(*cache->data_vec), GFP_KERNEL); + if (!cache->data_vec) + goto err_free_cache; + + mutex_init(&cache->lock); + + cache->sz = sz; + + return cache; + +err_free_cache: + kfree(cache); + return NULL; +} + +static void free_roce_gid_cache(struct ib_device *ib_dev, u8 port) +{ + int i; + struct ib_roce_gid_cache *cache = + ib_dev->cache.roce_gid_cache[port - 1]; + + if (!cache) + return; + + for (i = 0; i < cache->sz; ++i) { + if (memcmp(&cache->data_vec[i].gid, &zgid, + sizeof(cache->data_vec[i].gid))) + write_gid(ib_dev, port, cache, i, &zgid, &zattr); + } + kfree(cache->data_vec); + kfree(cache); +} + +static void set_roce_gid_cache_active(struct ib_roce_gid_cache *cache, + int active) +{ + if (!cache) + return; + + cache->active = active; +} + +static int roce_gid_cache_setup_one(struct ib_device *ib_dev) +{ + u8 port; + int err = 0; + + if (!ib_dev->modify_gid) + return -ENOSYS; + + ib_dev->cache.roce_gid_cache = + kcalloc(ib_dev->phys_port_cnt, + sizeof(*ib_dev->cache.roce_gid_cache), GFP_KERNEL); + + if (!ib_dev->cache.roce_gid_cache) { + pr_warn("failed to allocate roce addr cache for %s\n", + ib_dev->name); + return -ENOMEM; + } + + for (port = 0; port < ib_dev->phys_port_cnt; port++) { + if (rdma_port_get_link_layer(ib_dev, port + start_port(ib_dev)) + != IB_LINK_LAYER_ETHERNET) + continue; + ib_dev->cache.roce_gid_cache[port] = + alloc_roce_gid_cache(ib_dev->gid_tbl_len[port]); + if (!ib_dev->cache.roce_gid_cache[port]) { + err = -ENOMEM; + goto rollback_cache_setup; + } + } + return 0; + +rollback_cache_setup: + for (port = 1; port <= ib_dev->phys_port_cnt; port++) + free_roce_gid_cache(ib_dev, port); + + kfree(ib_dev->cache.roce_gid_cache); + ib_dev->cache.roce_gid_cache = NULL; + return err; +} + +static void roce_gid_cache_cleanup_one(struct ib_device *ib_dev) +{ + u8 port; + + if (!ib_dev->cache.roce_gid_cache) + return; + + for (port = 1; port <= ib_dev->phys_port_cnt; port++) + free_roce_gid_cache(ib_dev, port); + + kfree(ib_dev->cache.roce_gid_cache); + ib_dev->cache.roce_gid_cache = NULL; +} + +static void roce_gid_cache_set_active_state(struct ib_device *ib_dev, + int active) +{ + u8 port; + + if (!ib_dev->cache.roce_gid_cache) + return; + + for (port = 0; port < ib_dev->phys_port_cnt; port++) + set_roce_gid_cache_active(ib_dev->cache.roce_gid_cache[port], + active); +} + +int roce_gid_cache_is_active(struct ib_device *ib_dev, u8 port) +{ + return ib_dev->cache.roce_gid_cache && + ib_dev->cache.roce_gid_cache[port - start_port(ib_dev)]->active; +} diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c index 5261665..6fa5e49 100644 --- a/drivers/infiniband/hw/mlx4/main.c +++ b/drivers/infiniband/hw/mlx4/main.c @@ -93,8 +93,6 @@ static void init_query_mad(struct ib_smp *mad) mad->method = IB_MGMT_METHOD_GET; } -static union ib_gid zgid; - static int check_flow_steering_support(struct mlx4_dev *dev) { int eth_num_ports = 0; diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 65994a1..1866595 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -64,6 +64,36 @@ union ib_gid { } global; }; +extern union ib_gid zgid; + +enum ib_gid_type { + /* If link layer is Ethernet, this is RoCE V1 */ + IB_GID_TYPE_IB = 0, + IB_GID_TYPE_ROCE_V2 = 1, + IB_GID_TYPE_SIZE +}; + +struct ib_gid_attr { + enum ib_gid_type gid_type; + struct net_device *ndev; +}; + +struct ib_roce_gid_cache_entry { + /* seq number of 0 indicates entry being changed. */ + unsigned int seq; + union ib_gid gid; + struct ib_gid_attr attr; + void *context; +}; + +struct ib_roce_gid_cache { + int active; + int sz; + /* locking against multiple writes in data_vec */ + struct mutex lock; + struct ib_roce_gid_cache_entry *data_vec; +}; + enum rdma_node_type { /* IB values map to NodeInfo:NodeType. */ RDMA_NODE_IB_CA = 1, @@ -265,7 +295,9 @@ enum ib_port_cap_flags { IB_PORT_BOOT_MGMT_SUP = 1 << 23, IB_PORT_LINK_LATENCY_SUP = 1 << 24, IB_PORT_CLIENT_REG_SUP = 1 << 25, - IB_PORT_IP_BASED_GIDS = 1 << 26 + IB_PORT_IP_BASED_GIDS = 1 << 26, + IB_PORT_ROCE = 1 << 27, + IB_PORT_ROCE_V2 = 1 << 28, }; enum ib_port_width { @@ -1431,6 +1463,7 @@ struct ib_cache { struct ib_pkey_cache **pkey_cache; struct ib_gid_cache **gid_cache; u8 *lmc_cache; + struct ib_roce_gid_cache **roce_gid_cache; }; struct ib_dma_mapping_ops { @@ -1506,6 +1539,26 @@ struct ib_device { int (*query_gid)(struct ib_device *device, u8 port_num, int index, union ib_gid *gid); + /* When calling modify_gid, the HW vendor's driver should + * modify the gid of device @device at gid index @index of + * port @port to be @gid. Meta-info of that gid (for example, + * the network device related to this gid is available + * at @attr. @context allows the HW vendor driver to store extra + * information together with a GID entry. The HW vendor may allocate + * memory to contain this information and store it in @context when a + * new GID entry is written to. Upon the deletion of a GID entry, + * the HW vendor must free any allocated memory. The caller will clear + * @context afterwards.GID deletion is done by passing the zero gid. + * Params are consistent until the next call of modify_gid. + * The function should return 0 on success or error otherwise. + * The function could be called concurrently for different ports. + */ + int (*modify_gid)(struct ib_device *device, + u8 port_num, + unsigned int index, + const union ib_gid *gid, + const struct ib_gid_attr *attr, + void **context); int (*query_pkey)(struct ib_device *device, u8 port_num, u16 index, u16 *pkey); int (*modify_device)(struct ib_device *device,