diff mbox

[v3,for-next,01/33] IB/core: Add RoCE GID cache

Message ID 44ab0dce-c7c9-400b-af24-10b8981358a7@CMEXHTCAS2.ad.emulex.com (mailing list archive)
State Rejected
Headers show

Commit Message

Somnath Kotur March 25, 2015, 9:19 p.m. UTC
From: Matan Barak <matanb@mellanox.com>

In order to manage multiple types, vlans and MACs per GID, we
need to store them along the GID itself. We store the net device
as well, as sometimes GIDs should be handled according to the
net device they came from. Since populating the GID table should
be identical for every RoCE provider, the GIDs table should be
handled in ib_core.

Adding a GID cache table that supports a lockless find, add and
delete gids. The lockless nature comes from using a unique
sequence number per table entry and detecting that while reading/
writing this sequence wasn't changed.

By using this RoCE GID cache table, providers must implement a
modify_gid callback. The table is managed exclusively by
this roce_gid_cache and the provider just need to write
the data to the hardware.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com>
---
 drivers/infiniband/core/Makefile         |   3 +-
 drivers/infiniband/core/core_priv.h      |  24 ++
 drivers/infiniband/core/roce_gid_cache.c | 518 +++++++++++++++++++++++++++++++
 drivers/infiniband/hw/mlx4/main.c        |   2 -
 include/rdma/ib_verbs.h                  |  55 +++-
 5 files changed, 598 insertions(+), 4 deletions(-)
 create mode 100644 drivers/infiniband/core/roce_gid_cache.c

Comments

Bart Van Assche March 25, 2015, 11:42 p.m. UTC | #1
On 03/25/2015 02:19 PM, Somnath Kotur wrote:
> +	if (cache->data_vec[ix].attr.ndev &&
> +	    cache->data_vec[ix].attr.ndev != old_net_dev)

A few lines earlier the memory old_net_dev points at was freed. If two 
instances of this function run concurrently, what prevents that the 
old_net_dev memory has been reallocated and hence that attr.ndev == 
old_net_dev although both pointers refer(red) to different network devices ?

> +	ACCESS_ONCE(cache->data_vec[ix].seq) = orig_seq;

Invoking write_gid() is only safe if the caller serializes write_gid() 
calls. Apparently the cache->lock mutex is used for that purpose. So why 
is it necessary to use ACCESS_ONCE() here ? Why is it needed to prevent 
that the compiler coalesces this write with another write into the same 
structure ?

> +		/* Make sure the sequence number we remeber was read

This looks like a typo - shouldn't the above read "remember" ?

BTW, the style of that comment is recommended only for networking code 
and not for IB code. Have you verified this patch with checkpatch ?

> +	mutex_lock(&cache->lock);
> +
> +	for (ix = 0; ix < cache->sz; ix++)
> +		if (cache->data_vec[ix].attr.ndev == ndev)
> +			write_gid(ib_dev, port, cache, ix, &zgid, &zattr);
> +
> +	mutex_unlock(&cache->lock);
> +	return 0;

The traditional Linux kernel coding style is one blank line before 
mutex_lock() and after mutex_unlock() but not after mutex_lock() nor 
before mutex_unlock().

> +	orig_seq = ACCESS_ONCE(cache->data_vec[index].seq);
> +	/* Make sure we read the sequence number before copying the
> +	 * gid to local storage. */
> +	smp_rmb();

Please use READ_ONCE() instead of ACCESS_ONCE() as recommended in 
<linux/compiler.h>.

> +static void free_roce_gid_cache(struct ib_device *ib_dev, u8 port)
> +{
> +	int i;
> +	struct ib_roce_gid_cache *cache =
> +		ib_dev->cache.roce_gid_cache[port - 1];
> +
> +	if (!cache)
> +		return;
> +
> +	for (i = 0; i < cache->sz; ++i) {
> +		if (memcmp(&cache->data_vec[i].gid, &zgid,
> +			   sizeof(cache->data_vec[i].gid)))
> +		    write_gid(ib_dev, port, cache, i, &zgid, &zattr);
> +	}
 > +	kfree(cache->data_vec);
 > +	kfree(cache);
 > +}

Overwriting data just before it is freed is not useful. Please use 
CONFIG_SLUB_DEBUG=y to debug use-after-free issues instead of such code.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Somnath Kotur March 26, 2015, 2:05 p.m. UTC | #2
Hi Matan/Moni,
                    Could either of you please respond to both of Bart's queries?

Thanks
Somnath

> -----Original Message-----
> From: Bart Van Assche [mailto:bart.vanassche@sandisk.com]
> Sent: Thursday, March 26, 2015 5:13 AM
> To: Somnath Kotur; roland@kernel.org
> Cc: linux-rdma@vger.kernel.org; Matan Barak
> Subject: Re: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
> 
> On 03/25/2015 02:19 PM, Somnath Kotur wrote:
> > +	if (cache->data_vec[ix].attr.ndev &&
> > +	    cache->data_vec[ix].attr.ndev != old_net_dev)
> 
> A few lines earlier the memory old_net_dev points at was freed. If two
> instances of this function run concurrently, what prevents that the
> old_net_dev memory has been reallocated and hence that attr.ndev ==
> old_net_dev although both pointers refer(red) to different network devices
> ?
> 
> > +	ACCESS_ONCE(cache->data_vec[ix].seq) = orig_seq;
> 
> Invoking write_gid() is only safe if the caller serializes write_gid() calls.
> Apparently the cache->lock mutex is used for that purpose. So why is it
> necessary to use ACCESS_ONCE() here ? Why is it needed to prevent that
> the compiler coalesces this write with another write into the same structure
> ?
> 
> > +		/* Make sure the sequence number we remeber was read
> 
> This looks like a typo - shouldn't the above read "remember" ?
> 
> BTW, the style of that comment is recommended only for networking code
> and not for IB code. Have you verified this patch with checkpatch ?
> 
> > +	mutex_lock(&cache->lock);
> > +
> > +	for (ix = 0; ix < cache->sz; ix++)
> > +		if (cache->data_vec[ix].attr.ndev == ndev)
> > +			write_gid(ib_dev, port, cache, ix, &zgid, &zattr);
> > +
> > +	mutex_unlock(&cache->lock);
> > +	return 0;
> 
> The traditional Linux kernel coding style is one blank line before
> mutex_lock() and after mutex_unlock() but not after mutex_lock() nor
> before mutex_unlock().
> 
> > +	orig_seq = ACCESS_ONCE(cache->data_vec[index].seq);
> > +	/* Make sure we read the sequence number before copying the
> > +	 * gid to local storage. */
> > +	smp_rmb();
> 
> Please use READ_ONCE() instead of ACCESS_ONCE() as recommended in
> <linux/compiler.h>.
> 
> > +static void free_roce_gid_cache(struct ib_device *ib_dev, u8 port) {
> > +	int i;
> > +	struct ib_roce_gid_cache *cache =
> > +		ib_dev->cache.roce_gid_cache[port - 1];
> > +
> > +	if (!cache)
> > +		return;
> > +
> > +	for (i = 0; i < cache->sz; ++i) {
> > +		if (memcmp(&cache->data_vec[i].gid, &zgid,
> > +			   sizeof(cache->data_vec[i].gid)))
> > +		    write_gid(ib_dev, port, cache, i, &zgid, &zattr);
> > +	}
>  > +	kfree(cache->data_vec);
>  > +	kfree(cache);
>  > +}
> 
> Overwriting data just before it is freed is not useful. Please use
> CONFIG_SLUB_DEBUG=y to debug use-after-free issues instead of such
> code.
> 
> Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hefty, Sean April 8, 2015, 12:30 a.m. UTC | #3
> In order to manage multiple types, vlans and MACs per GID, we
> need to store them along the GID itself. We store the net device
> as well, as sometimes GIDs should be handled according to the
> net device they came from. Since populating the GID table should
> be identical for every RoCE provider, the GIDs table should be
> handled in ib_core.
> 
> Adding a GID cache table that supports a lockless find, add and
> delete gids. The lockless nature comes from using a unique
> sequence number per table entry and detecting that while reading/
> writing this sequence wasn't changed.
> 
> By using this RoCE GID cache table, providers must implement a
> modify_gid callback. The table is managed exclusively by
> this roce_gid_cache and the provider just need to write
> the data to the hardware.
> 
> Signed-off-by: Matan Barak <matanb@mellanox.com>
> Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com>
> ---
>  drivers/infiniband/core/Makefile         |   3 +-
>  drivers/infiniband/core/core_priv.h      |  24 ++
>  drivers/infiniband/core/roce_gid_cache.c | 518

Why does RoCE need such a complex gid cache?  If a gid cache is needed at all, why should it be restricted to RoCE only?  And why is such a complex synchronization scheme needed?  Seriously, how many times will GIDs change and how many readers at once do you expect to have?


> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index 65994a1..1866595 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -64,6 +64,36 @@ union ib_gid {
>  	} global;
>  };
> 
> +extern union ib_gid zgid;
> +
> +enum ib_gid_type {
> +	/* If link layer is Ethernet, this is RoCE V1 */

I don't understand this comment.  Does RoCE v2 not run on Ethernet?

> +	IB_GID_TYPE_IB        = 0,
> +	IB_GID_TYPE_ROCE_V2   = 1,
> +	IB_GID_TYPE_SIZE
> +};

Can you explain the purpose of defining a 'GID type'.  A GID is just a global address.  Why does it matter to anyone using it how it was constructed?

> +
> +struct ib_gid_attr {
> +	enum ib_gid_type	gid_type;
> +	struct net_device	*ndev;
> +};
> +
> +struct ib_roce_gid_cache_entry {
> +	/* seq number of 0 indicates entry being changed. */
> +	unsigned int        seq;
> +	union ib_gid        gid;
> +	struct ib_gid_attr  attr;
> +	void		   *context;
> +};
> +
> +struct ib_roce_gid_cache {
> +	int		     active;
> +	int                  sz;
> +	/* locking against multiple writes in data_vec */
> +	struct mutex         lock;
> +	struct ib_roce_gid_cache_entry *data_vec;
> +};
> +
>  enum rdma_node_type {
>  	/* IB values map to NodeInfo:NodeType. */
>  	RDMA_NODE_IB_CA 	= 1,
> @@ -265,7 +295,9 @@ enum ib_port_cap_flags {
>  	IB_PORT_BOOT_MGMT_SUP			= 1 << 23,
>  	IB_PORT_LINK_LATENCY_SUP		= 1 << 24,
>  	IB_PORT_CLIENT_REG_SUP			= 1 << 25,
> -	IB_PORT_IP_BASED_GIDS			= 1 << 26
> +	IB_PORT_IP_BASED_GIDS			= 1 << 26,
> +	IB_PORT_ROCE				= 1 << 27,
> +	IB_PORT_ROCE_V2				= 1 << 28,

Why does RoCE suddenly require a port capability bit?  RoCE runs today without setting any bit.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Somnath Kotur April 8, 2015, 4:10 a.m. UTC | #4
Hi Sean,

> -----Original Message-----
> From: Hefty, Sean [mailto:sean.hefty@intel.com]
> Sent: Wednesday, April 08, 2015 6:00 AM
> To: Somnath Kotur; roland@kernel.org
> Cc: linux-rdma@vger.kernel.org; Matan Barak
> Subject: RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
> 
> > In order to manage multiple types, vlans and MACs per GID, we need to
> > store them along the GID itself. We store the net device as well, as
> > sometimes GIDs should be handled according to the net device they came
> > from. Since populating the GID table should be identical for every
> > RoCE provider, the GIDs table should be handled in ib_core.
> >
> > Adding a GID cache table that supports a lockless find, add and delete
> > gids. The lockless nature comes from using a unique sequence number
> > per table entry and detecting that while reading/ writing this
> > sequence wasn't changed.
> >
> > By using this RoCE GID cache table, providers must implement a
> > modify_gid callback. The table is managed exclusively by this
> > roce_gid_cache and the provider just need to write the data to the
> > hardware.
> >
> > Signed-off-by: Matan Barak <matanb@mellanox.com>
> > Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com>
> > ---
> >  drivers/infiniband/core/Makefile         |   3 +-
> >  drivers/infiniband/core/core_priv.h      |  24 ++
> >  drivers/infiniband/core/roce_gid_cache.c | 518
> 
> Why does RoCE need such a complex gid cache?  If a gid cache is needed at
> all, why should it be restricted to RoCE only?  And why is such a complex
> synchronization scheme needed?  Seriously, how many times will GIDs
> change and how many readers at once do you expect to have?
> 
> 
> > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index
> > 65994a1..1866595 100644
> > --- a/include/rdma/ib_verbs.h
> > +++ b/include/rdma/ib_verbs.h
> > @@ -64,6 +64,36 @@ union ib_gid {
> >  	} global;
> >  };
> >
> > +extern union ib_gid zgid;
> > +
> > +enum ib_gid_type {
> > +	/* If link layer is Ethernet, this is RoCE V1 */
> 
> I don't understand this comment.  Does RoCE v2 not run on Ethernet?
> 
Yes, this comment probably could use a reword..
> > +	IB_GID_TYPE_IB        = 0,
> > +	IB_GID_TYPE_ROCE_V2   = 1,
> > +	IB_GID_TYPE_SIZE
> > +};
> 
> Can you explain the purpose of defining a 'GID type'.  A GID is just a global
> address.  Why does it matter to anyone using it how it was constructed?

This is part of RoCE V2 Specification.  Please refer to Section A 17.8 . 
The GID Type determines the protocol for outbound packet generation i.e RoCE V1 (0x8915 Ether Type) or RoCEV2 (IPv4 or IPv6)
> 
> > +
> > +struct ib_gid_attr {
> > +	enum ib_gid_type	gid_type;
> > +	struct net_device	*ndev;
> > +};
> > +
> > +struct ib_roce_gid_cache_entry {
> > +	/* seq number of 0 indicates entry being changed. */
> > +	unsigned int        seq;
> > +	union ib_gid        gid;
> > +	struct ib_gid_attr  attr;
> > +	void		   *context;
> > +};
> > +
> > +struct ib_roce_gid_cache {
> > +	int		     active;
> > +	int                  sz;
> > +	/* locking against multiple writes in data_vec */
> > +	struct mutex         lock;
> > +	struct ib_roce_gid_cache_entry *data_vec; };
> > +
> >  enum rdma_node_type {
> >  	/* IB values map to NodeInfo:NodeType. */
> >  	RDMA_NODE_IB_CA 	= 1,
> > @@ -265,7 +295,9 @@ enum ib_port_cap_flags {
> >  	IB_PORT_BOOT_MGMT_SUP			= 1 << 23,
> >  	IB_PORT_LINK_LATENCY_SUP		= 1 << 24,
> >  	IB_PORT_CLIENT_REG_SUP			= 1 << 25,
> > -	IB_PORT_IP_BASED_GIDS			= 1 << 26
> > +	IB_PORT_IP_BASED_GIDS			= 1 << 26,
> > +	IB_PORT_ROCE				= 1 << 27,
> > +	IB_PORT_ROCE_V2				= 1 << 28,
> 
> Why does RoCE suddenly require a port capability bit?  RoCE runs today
> without setting any bit.
Again, this is part of RoCE V2 SPEC, please refer to Section A17.5.1- Query HCA(Pasting snippet below)
A new "RoCE Supported" capability bit shall be added to the Port Attributes
list. This capability bit applies exclusively to ports of the new
"RoCEv2" type


Thanks
Som
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Moni Shoua April 8, 2015, 8:49 a.m. UTC | #5
On Wed, Apr 8, 2015 at 2:30 AM, Hefty, Sean <sean.hefty@intel.com> wrote:
>> In order to manage multiple types, vlans and MACs per GID, we
>> need to store them along the GID itself. We store the net device
>> as well, as sometimes GIDs should be handled according to the
>> net device they came from. Since populating the GID table should
>> be identical for every RoCE provider, the GIDs table should be
>> handled in ib_core.
>>
>> Adding a GID cache table that supports a lockless find, add and
>> delete gids. The lockless nature comes from using a unique
>> sequence number per table entry and detecting that while reading/
>> writing this sequence wasn't changed.
>>
>> By using this RoCE GID cache table, providers must implement a
>> modify_gid callback. The table is managed exclusively by
>> this roce_gid_cache and the provider just need to write
>> the data to the hardware.
>>
>> Signed-off-by: Matan Barak <matanb@mellanox.com>
>> Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com>
>> ---
>>  drivers/infiniband/core/Makefile         |   3 +-
>>  drivers/infiniband/core/core_priv.h      |  24 ++
>>  drivers/infiniband/core/roce_gid_cache.c | 518
>
> Why does RoCE need such a complex gid cache?  If a gid cache is needed at all, why should it be restricted to RoCE only?  And why is such a complex synchronization scheme needed?  Seriously, how many times will GIDs change and how many readers at once do you expect to have?
>
GID cache is also implemented for link layer IB. Howver, for RoCE the
GID cache is also the manager of the table. This means that adding or
removing entries from the GID table is under the responsibility of the
cache and not the HW/device driver. This is a new scheme that frees
each vendor's driver to deal with net and inet events.
Content of the GID table is much more dynamic for RoCE than for IB and
so is access to the table so I guess that extra mechanism is required.
The fact that GID entry is associated with net_device and inet_addr
objects that can be modified/deleted at any time is an example.
>
>> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
>> index 65994a1..1866595 100644
>> --- a/include/rdma/ib_verbs.h
>> +++ b/include/rdma/ib_verbs.h
>> @@ -64,6 +64,36 @@ union ib_gid {
>>       } global;
>>  };
>>
>> +extern union ib_gid zgid;
>> +
>> +enum ib_gid_type {
>> +     /* If link layer is Ethernet, this is RoCE V1 */
>
> I don't understand this comment.  Does RoCE v2 not run on Ethernet?
>
>> +     IB_GID_TYPE_IB        = 0,
>> +     IB_GID_TYPE_ROCE_V2   = 1,
>> +     IB_GID_TYPE_SIZE
>> +};
>
> Can you explain the purpose of defining a 'GID type'.  A GID is just a global address.  Why does it matter to anyone using it how it was constructed?
>
>> +
>> +struct ib_gid_attr {
>> +     enum ib_gid_type        gid_type;
>> +     struct net_device       *ndev;
>> +};
>> +
>> +struct ib_roce_gid_cache_entry {
>> +     /* seq number of 0 indicates entry being changed. */
>> +     unsigned int        seq;
>> +     union ib_gid        gid;
>> +     struct ib_gid_attr  attr;
>> +     void               *context;
>> +};
>> +
>> +struct ib_roce_gid_cache {
>> +     int                  active;
>> +     int                  sz;
>> +     /* locking against multiple writes in data_vec */
>> +     struct mutex         lock;
>> +     struct ib_roce_gid_cache_entry *data_vec;
>> +};
>> +
>>  enum rdma_node_type {
>>       /* IB values map to NodeInfo:NodeType. */
>>       RDMA_NODE_IB_CA         = 1,
>> @@ -265,7 +295,9 @@ enum ib_port_cap_flags {
>>       IB_PORT_BOOT_MGMT_SUP                   = 1 << 23,
>>       IB_PORT_LINK_LATENCY_SUP                = 1 << 24,
>>       IB_PORT_CLIENT_REG_SUP                  = 1 << 25,
>> -     IB_PORT_IP_BASED_GIDS                   = 1 << 26
>> +     IB_PORT_IP_BASED_GIDS                   = 1 << 26,
>> +     IB_PORT_ROCE                            = 1 << 27,
>> +     IB_PORT_ROCE_V2                         = 1 << 28,
>
> Why does RoCE suddenly require a port capability bit?  RoCE runs today without setting any bit.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hefty, Sean April 13, 2015, 11:50 p.m. UTC | #6
> Yes, this comment probably could use a reword..
> > > +	IB_GID_TYPE_IB        = 0,
> > > +	IB_GID_TYPE_ROCE_V2   = 1,
> > > +	IB_GID_TYPE_SIZE
> > > +};
> >
> > Can you explain the purpose of defining a 'GID type'.  A GID is just a
> global
> > address.  Why does it matter to anyone using it how it was constructed?
> 
> This is part of RoCE V2 Specification.  Please refer to Section A 17.8 .
> The GID Type determines the protocol for outbound packet generation i.e
> RoCE V1 (0x8915 Ether Type) or RoCEV2 (IPv4 or IPv6)

This isn't an interface for the RoCE specification.  Why does this need to be added to the verbs interface?  It hasn't been needed by apps yet, and I don't see why the apps should be made to care now how the GID is formatted.

> > > @@ -265,7 +295,9 @@ enum ib_port_cap_flags {
> > >  	IB_PORT_BOOT_MGMT_SUP			= 1 << 23,
> > >  	IB_PORT_LINK_LATENCY_SUP		= 1 << 24,
> > >  	IB_PORT_CLIENT_REG_SUP			= 1 << 25,
> > > -	IB_PORT_IP_BASED_GIDS			= 1 << 26
> > > +	IB_PORT_IP_BASED_GIDS			= 1 << 26,
> > > +	IB_PORT_ROCE				= 1 << 27,
> > > +	IB_PORT_ROCE_V2				= 1 << 28,
> >
> > Why does RoCE suddenly require a port capability bit?  RoCE runs today
> > without setting any bit.
> Again, this is part of RoCE V2 SPEC, please refer to Section A17.5.1-
> Query HCA(Pasting snippet below)
> A new "RoCE Supported" capability bit shall be added to the Port
> Attributes
> list. This capability bit applies exclusively to ports of the new
> "RoCEv2" type

Same comment as above.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matan Barak April 14, 2015, 9:32 a.m. UTC | #7
On 4/14/2015 2:50 AM, Hefty, Sean wrote:
>> Yes, this comment probably could use a reword..
>>>> +	IB_GID_TYPE_IB        = 0,
>>>> +	IB_GID_TYPE_ROCE_V2   = 1,
>>>> +	IB_GID_TYPE_SIZE
>>>> +};
>>>
>>> Can you explain the purpose of defining a 'GID type'.  A GID is just a
>> global
>>> address.  Why does it matter to anyone using it how it was constructed?
>>
>> This is part of RoCE V2 Specification.  Please refer to Section A 17.8 .
>> The GID Type determines the protocol for outbound packet generation i.e
>> RoCE V1 (0x8915 Ether Type) or RoCEV2 (IPv4 or IPv6)
>
> This isn't an interface for the RoCE specification.  Why does this need to be added to the verbs interface?  It hasn't been needed by apps yet, and I don't see why the apps should be made to care now how the GID is formatted.
>

This is a part of the GID meta info. The user should be able to choose 
between RoCE V1 (which is represented here by IB_GID_TYPE_IB) and RoCE 
V2 - just as a user could choose between IPv6 and IPv4.

>>>> @@ -265,7 +295,9 @@ enum ib_port_cap_flags {
>>>>   	IB_PORT_BOOT_MGMT_SUP			= 1 << 23,
>>>>   	IB_PORT_LINK_LATENCY_SUP		= 1 << 24,
>>>>   	IB_PORT_CLIENT_REG_SUP			= 1 << 25,
>>>> -	IB_PORT_IP_BASED_GIDS			= 1 << 26
>>>> +	IB_PORT_IP_BASED_GIDS			= 1 << 26,
>>>> +	IB_PORT_ROCE				= 1 << 27,
>>>> +	IB_PORT_ROCE_V2				= 1 << 28,
>>>
>>> Why does RoCE suddenly require a port capability bit?  RoCE runs today
>>> without setting any bit.
>> Again, this is part of RoCE V2 SPEC, please refer to Section A17.5.1-
>> Query HCA(Pasting snippet below)
>> A new "RoCE Supported" capability bit shall be added to the Port
>> Attributes
>> list. This capability bit applies exclusively to ports of the new
>> "RoCEv2" type
>
> Same comment as above.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matan Barak April 14, 2015, 1:23 p.m. UTC | #8
On 3/26/2015 1:42 AM, Bart Van Assche wrote:
> On 03/25/2015 02:19 PM, Somnath Kotur wrote:
>> +    if (cache->data_vec[ix].attr.ndev &&
>> +        cache->data_vec[ix].attr.ndev != old_net_dev)
>
> A few lines earlier the memory old_net_dev points at was freed. If two
> instances of this function run concurrently, what prevents that the
> old_net_dev memory has been reallocated and hence that attr.ndev ==
> old_net_dev although both pointers refer(red) to different network
> devices ?
>

write_gid is *almost* always called in a mutex. The only case it's not 
protected is in free_roce_gid_cache. free_roce_gid_cache is called only 
in the error flow of roce_gid_cache_setup_one, when no concurrent 
write_gid could be made (as the cache isn't setup yet) and in 
roce_gid_cache_cleanup_one. roce_gid_cache_client_setup_one is called in 
the error flow of roce_gid_cache_client_setup_one (where no other 
write_gid are expected for the same above reason) and in 
roce_gid_cache_client_cleanup_work_handler, where it's called after 
flush_workqueue(roce_gid_mgmt_wq). Since all write_gids are called 
through roce_gid_mgmt_wq and we set the cache to inactive mode before 
flushing the wq and freeing the cache, I think we can conclude no 
concurrent write_gid on the same cache are possible.

>> +    ACCESS_ONCE(cache->data_vec[ix].seq) = orig_seq;
>
> Invoking write_gid() is only safe if the caller serializes write_gid()
> calls. Apparently the cache->lock mutex is used for that purpose. So why
> is it necessary to use ACCESS_ONCE() here ? Why is it needed to prevent
> that the compiler coalesces this write with another write into the same
> structure ?
>

The mutex only serializes cache writes. Cache reads could be done in 
concurrent with writes and are protected by the ACCESS_ONCE.

>> +        /* Make sure the sequence number we remeber was read
>
> This looks like a typo - shouldn't the above read "remember" ?
>

Will be fixed in V4, thanks.

> BTW, the style of that comment is recommended only for networking code
> and not for IB code. Have you verified this patch with checkpatch ?
>

Of course and I've just re-run checkpatch on this patch. It doesn't 
catch this.

>> +    mutex_lock(&cache->lock);
>> +
>> +    for (ix = 0; ix < cache->sz; ix++)
>> +        if (cache->data_vec[ix].attr.ndev == ndev)
>> +            write_gid(ib_dev, port, cache, ix, &zgid, &zattr);
>> +
>> +    mutex_unlock(&cache->lock);
>> +    return 0;
>
> The traditional Linux kernel coding style is one blank line before
> mutex_lock() and after mutex_unlock() but not after mutex_lock() nor
> before mutex_unlock().
>

I didn't find this in the CodingStyle doc. Could you please quote or 
post a link?

>> +    orig_seq = ACCESS_ONCE(cache->data_vec[index].seq);
>> +    /* Make sure we read the sequence number before copying the
>> +     * gid to local storage. */
>> +    smp_rmb();
>
> Please use READ_ONCE() instead of ACCESS_ONCE() as recommended in
> <linux/compiler.h>.
>

Ok, I'll change that in V4. I see that READ_ONCE and WRITE_ONCE is 
different from ACCESS_ONCE only for aggregated data types (which isn't 
our case), but it won't hurt to change that.

>> +static void free_roce_gid_cache(struct ib_device *ib_dev, u8 port)
>> +{
>> +    int i;
>> +    struct ib_roce_gid_cache *cache =
>> +        ib_dev->cache.roce_gid_cache[port - 1];
>> +
>> +    if (!cache)
>> +        return;
>> +
>> +    for (i = 0; i < cache->sz; ++i) {
>> +        if (memcmp(&cache->data_vec[i].gid, &zgid,
>> +               sizeof(cache->data_vec[i].gid)))
>> +            write_gid(ib_dev, port, cache, i, &zgid, &zattr);
>> +    }
>  > +    kfree(cache->data_vec);
>  > +    kfree(cache);
>  > +}
>
> Overwriting data just before it is freed is not useful. Please use
> CONFIG_SLUB_DEBUG=y to debug use-after-free issues instead of such code.
>

It's mandatory as write_gid with zgid might cause the vendor driver to 
free memory it allocated for this GID entry (like in the mlx4 case).

> Bart.

Thanks for the review :)

Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche April 14, 2015, 3:31 p.m. UTC | #9
On 04/14/15 15:23, Matan Barak wrote:
>>> +    mutex_lock(&cache->lock);
>>> +
>>> +    for (ix = 0; ix < cache->sz; ix++)
>>> +        if (cache->data_vec[ix].attr.ndev == ndev)
>>> +            write_gid(ib_dev, port, cache, ix, &zgid, &zattr);
>>> +
>>> +    mutex_unlock(&cache->lock);
>>> +    return 0;
>>
>> The traditional Linux kernel coding style is one blank line before
>> mutex_lock() and after mutex_unlock() but not after mutex_lock() nor
>> before mutex_unlock().
>>
>
> I didn't find this in the CodingStyle doc. Could you please quote or
> post a link?

Hello Matan,

I'm not aware of any formal documentation of this style guideline. But 
if you look around in the Linux kernel tree you will see that most 
kernel code follows this style.

Bart.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hefty, Sean April 14, 2015, 5:32 p.m. UTC | #10
> This is a part of the GID meta info. The user should be able to choose
> between RoCE V1 (which is represented here by IB_GID_TYPE_IB) and RoCE
> V2 - just as a user could choose between IPv6 and IPv4.

IPv4 and IPv6 are different protocols, not different formats for the same address.  How does RoCE v2 not break every app?  This isn't like asking the user to choose between IPv4 versus IPv6, it's asking them to choose between IPv4 assigned by DHCP versus IPv4 assigned statically.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Somnath Kotur April 15, 2015, 5:35 a.m. UTC | #11
> -----Original Message-----
> From: Hefty, Sean [mailto:sean.hefty@intel.com]
> Sent: Tuesday, April 14, 2015 11:02 PM
> To: Matan Barak; Somnath Kotur; roland@kernel.org
> Cc: linux-rdma@vger.kernel.org
> Subject: RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
> 
> > This is a part of the GID meta info. The user should be able to choose
> > between RoCE V1 (which is represented here by IB_GID_TYPE_IB) and
> RoCE
> > V2 - just as a user could choose between IPv6 and IPv4.
> 
> IPv4 and IPv6 are different protocols, not different formats for the same
> address.  How does RoCE v2 not break every app? 
It does not  break every app, the choice of which GID type to use is made by the RDMA-CM based on network topology hint obtained from the IP stack.
Please refer to patch 15/33: IB/Core: Changes to the IB Core infrastructure for RoCEv2 support.
Of course, if the user does not want to go with this choice made by the RDMA-CM, then there is the option of overriding it using the configfs patch (PATCH 14/33)
Hope that clarifies?

Thanks
Som
 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hefty, Sean April 15, 2015, 4:08 p.m. UTC | #12
> It does not  break every app, the choice of which GID type to use is made
> by the RDMA-CM based on network topology hint obtained from the IP stack.
> Please refer to patch 15/33: IB/Core: Changes to the IB Core
> infrastructure for RoCEv2 support.
> Of course, if the user does not want to go with this choice made by the
> RDMA-CM, then there is the option of overriding it using the configfs
> patch (PATCH 14/33)
> Hope that clarifies?

RoCE v2 is really Infiniband over UDP over IP.  Why don't we just call it IBoUDP like it is?

IBoUDP changes the Ethertype, replaces the network header, adds a new transport protocol header, and layers IB over that.  This change should be exposed properly and not as just a new GID type.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Suresh Shelvapille April 15, 2015, 4:21 p.m. UTC | #13
IMHO, it would be good to have a physical layer representation in the naming convention.

-----Original Message-----
From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-owner@vger.kernel.org] On Behalf Of Hefty, Sean
Sent: Wednesday, April 15, 2015 12:08 PM
To: Somnath Kotur; Matan Barak; roland@kernel.org
Cc: linux-rdma@vger.kernel.org
Subject: RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache

> It does not  break every app, the choice of which GID type to use is
> made by the RDMA-CM based on network topology hint obtained from the IP stack.
> Please refer to patch 15/33: IB/Core: Changes to the IB Core
> infrastructure for RoCEv2 support.
> Of course, if the user does not want to go with this choice made by
> the RDMA-CM, then there is the option of overriding it using the
> configfs patch (PATCH 14/33) Hope that clarifies?

RoCE v2 is really Infiniband over UDP over IP.  Why don't we just call it IBoUDP like it is?

IBoUDP changes the Ethertype, replaces the network header, adds a new transport protocol header, and layers IB over that.  This change should be exposed properly and not as just a new GID type.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

This correspondence, and any attachments or files transmitted with this correspondence, contains information which may be confidential and privileged and is intended solely for the use of the addressee. Unless you are the addressee or are authorized to receive messages for the addressee, you may not use, copy, disseminate, or disclose this correspondence or any information contained in this correspondence to any third party. If you have received this correspondence in error, please notify the sender immediately and delete this correspondence and any attachments or files transmitted with this correspondence from your system, and destroy any and all copies thereof, electronic or otherwise. Your cooperation and understanding are greatly appreciated.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matan Barak April 16, 2015, 10:42 a.m. UTC | #14
AFAIK, RoCE v2 is the known and official name. Why would we like to come 
up with a customized name?

These are indeed two different protocols, thus the comparison to DHCP 
assigned addresses and static addresses is (to say the least) a bit off.

Even when comparing IPv4 and IPv6, the most significant user change is 
the sockaddr_in and sockaddr_in6 addresses. IMHO, since the GID format 
is identical, the changes could be encoded in a gid_type argument. The 
gid_type is an inherent part of the address, making identical gids with 
different gid_types as two different addresses, as expected.

On 4/15/2015 7:21 PM, Suri Shelvapille wrote:
> IMHO, it would be good to have a physical layer representation in the naming convention.
>
> -----Original Message-----
> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-owner@vger.kernel.org] On Behalf Of Hefty, Sean
> Sent: Wednesday, April 15, 2015 12:08 PM
> To: Somnath Kotur; Matan Barak; roland@kernel.org
> Cc: linux-rdma@vger.kernel.org
> Subject: RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
>
>> It does not  break every app, the choice of which GID type to use is
>> made by the RDMA-CM based on network topology hint obtained from the IP stack.
>> Please refer to patch 15/33: IB/Core: Changes to the IB Core
>> infrastructure for RoCEv2 support.
>> Of course, if the user does not want to go with this choice made by
>> the RDMA-CM, then there is the option of overriding it using the
>> configfs patch (PATCH 14/33) Hope that clarifies?
>
> RoCE v2 is really Infiniband over UDP over IP.  Why don't we just call it IBoUDP like it is?
>
> IBoUDP changes the Ethertype, replaces the network header, adds a new transport protocol header, and layers IB over that.  This change should be exposed properly and not as just a new GID type.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> This correspondence, and any attachments or files transmitted with this correspondence, contains information which may be confidential and privileged and is intended solely for the use of the addressee. Unless you are the addressee or are authorized to receive messages for the addressee, you may not use, copy, disseminate, or disclose this correspondence or any information contained in this correspondence to any third party. If you have received this correspondence in error, please notify the sender immediately and delete this correspondence and any attachments or files transmitted with this correspondence from your system, and destroy any and all copies thereof, electronic or otherwise. Your cooperation and understanding are greatly appreciated.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Moni Shoua April 16, 2015, 10:43 a.m. UTC | #15
On Wed, Apr 15, 2015 at 7:08 PM, Hefty, Sean <sean.hefty@intel.com> wrote:
>> It does not  break every app, the choice of which GID type to use is made
>> by the RDMA-CM based on network topology hint obtained from the IP stack.
>> Please refer to patch 15/33: IB/Core: Changes to the IB Core
>> infrastructure for RoCEv2 support.
>> Of course, if the user does not want to go with this choice made by the
>> RDMA-CM, then there is the option of overriding it using the configfs
>> patch (PATCH 14/33)
>> Hope that clarifies?
>
> RoCE v2 is really Infiniband over UDP over IP.  Why don't we just call it IBoUDP like it is?
RoCEv2 is the name in the IBTA spec (Annex 17)
>
> IBoUDP changes the Ethertype, replaces the network header, adds a new transport protocol header, and layers IB over that.  This change should be exposed properly and not as just a new GID type.
I don't understand what do you suggest here. Can you give an example?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hefty, Sean April 16, 2015, 2:58 p.m. UTC | #16
> > RoCE v2 is really Infiniband over UDP over IP.  Why don't we just call

> it IBoUDP like it is?

> RoCEv2 is the name in the IBTA spec (Annex 17)


We call RoCE IBoE in the kernel, because that's what it is.  RoCE is an IBTA marketing name.

Looking through the Annex, I don't see where Ethernet is even a requirement for this technology to work.  The IB transport is layered over a standard UDP header.  I do see where the spec calls out updating the IP header, but that's it.

Regardless of what it's called, it replaces the underlying network and transport protocols, versus IB-classic or IBoE/RoCE.  That should be captured properly, not by saying there's a new GID type.  RoCE v2 doesn't even use GIDs as part of its protocol.  It uses UDP/IP addresses.


> > IBoUDP changes the Ethertype, replaces the network header, adds a new

> transport protocol header, and layers IB over that.  This change should be

> exposed properly and not as just a new GID type.

> I don't understand what do you suggest here. Can you give an example?


I don't have a solution here.  Please look at Michael Wang's patch series and see how this would fit into that model.  The introduction of iWarp required defining a new 'transport' type.  IBoE added a new link layer.  Based on those changes, this would warrant introducing a new network layer, so that it can be distinguished properly from the other options.  Maybe that's the right approach?

Cisco's NIC reports a transport layer of 'USNIC_UDP', which should really just be 'UDP'.  That NIC supports UDP/IP/Ethernet, based on the rdma stack's model.  RoCE v2 is also UDP/IP/Ethernet, it only layers IB over that.  (This makes the use of the term 'transport' confusing.  Maybe there should also be a 'session' protocol?)  It seems completely reasonable that a device which does IB/UDP/IP/Ethernet to someday expose the UDP/IP/Ethernet portion (if it doesn't already), and from the same port at the same time.

Rather than continuing to try to make everything look like an IB-classic device because it's convenient, the stack needs to start exposing things properly.  I don't know what the right solution should be, but trying to capture this level of detail as a different GID type definitely looks like a step in the wrong direction.

- Sean
Or Gerlitz April 26, 2015, 5:20 p.m. UTC | #17
On Thu, Mar 26, 2015 at 12:19 AM, Somnath Kotur
<somnath.kotur@emulex.com> wrote:
> From: Matan Barak <matanb@mellanox.com>
>
> In order to manage multiple types, vlans and MACs per GID, we
> need to store them along the GID itself. We store the net device
> as well, as sometimes GIDs should be handled according to the
> net device they came from. Since populating the GID table should
> be identical for every RoCE provider, the GIDs table should be
> handled in ib_core.
>
> Adding a GID cache table that supports a lockless find, add and
> delete gids. The lockless nature comes from using a unique
> sequence number per table entry and detecting that while reading/
> writing this sequence wasn't changed.

Matan, please use existing mechanism which fits the problem you are
trying to solve, I guess one of RCU or seqlock should do the job.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matan Barak April 27, 2015, 7:32 a.m. UTC | #18
On 4/26/2015 8:20 PM, Or Gerlitz wrote:
> On Thu, Mar 26, 2015 at 12:19 AM, Somnath Kotur
> <somnath.kotur@emulex.com> wrote:
>> From: Matan Barak <matanb@mellanox.com>
>>
>> In order to manage multiple types, vlans and MACs per GID, we
>> need to store them along the GID itself. We store the net device
>> as well, as sometimes GIDs should be handled according to the
>> net device they came from. Since populating the GID table should
>> be identical for every RoCE provider, the GIDs table should be
>> handled in ib_core.
>>
>> Adding a GID cache table that supports a lockless find, add and
>> delete gids. The lockless nature comes from using a unique
>> sequence number per table entry and detecting that while reading/
>> writing this sequence wasn't changed.
>
> Matan, please use existing mechanism which fits the problem you are
> trying to solve, I guess one of RCU or seqlock should do the job.
>

seqcount fits this problem better. Since if a write and read are done in 
parallel, there's a good chance we read an out of date entry and we are 
going to use a GID entry that's going to change in T+epsilon, so RCU 
doesn't really have an advantage here.

The current implementation is a bit more efficient than seqcount, as it 
allows early termination of read-while-write (because the write puts a 
known "currently updating" value that the read knows to ignore). AFAIK, 
this doesn't exist in the current seqcount implementation. However, 
since this isn't a crucial data-path, I'll change that to seqcount.

seqcount is preferred over seqlock, as I don't need the spinlock in seqlock.

> Or.
>

Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz April 27, 2015, 6:22 p.m. UTC | #19
On Mon, Apr 27, 2015 at 10:32 AM, Matan Barak <matanb@mellanox.com> wrote:
> On 4/26/2015 8:20 PM, Or Gerlitz wrote:
>> On Thu, Mar 26, 2015 at 12:19 AM, Somnath Kotur
>> <somnath.kotur@emulex.com> wrote:
>>> From: Matan Barak <matanb@mellanox.com>

>>> In order to manage multiple types, vlans and MACs per GID, we
>>> need to store them along the GID itself. We store the net device
>>> as well, as sometimes GIDs should be handled according to the
>>> net device they came from. Since populating the GID table should
>>> be identical for every RoCE provider, the GIDs table should be
>>> handled in ib_core.
>>>
>>> Adding a GID cache table that supports a lockless find, add and
>>> delete gids. The lockless nature comes from using a unique
>>> sequence number per table entry and detecting that while reading/
>>> writing this sequence wasn't changed.

>> Matan, please use existing mechanism which fits the problem you are
>> trying to solve, I guess one of RCU or seqlock should do the job.

> seqcount fits this problem better. Since if a write and read are done in
> parallel, there's a good chance we read an out of date entry and we are
> going to use a GID entry that's going to change in T+epsilon, so RCU doesn't
> really have an advantage here.

So going back to the problem... we are talking on applications/drivers
that attempt to establish new connections doing reads and writes done
on behalf of IP stack changes, both are very much not critical path.
So this is kind of similar to the neighbour table maintained by ND
subsystem which is used by all IP based networking applications and
that code uses RCU. I don't see what's wrong with RCU for our sort
smaller scale subsystem and what is even wrong with simple rwlock
which is the mechanism used today by the IB core git cache, this goes
too complex and for no reason that I can think of.


> The current implementation is a bit more efficient than seqcount, as it
> allows early termination of read-while-write (because the write puts a known
> "currently updating" value that the read knows to ignore). AFAIK, this
> doesn't exist in the current seqcount implementation. However, since this
> isn't a crucial data-path, I'll change that to seqcount.
>
> seqcount is preferred over seqlock, as I don't need the spinlock in seqlock.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matan Barak April 28, 2015, 7:17 a.m. UTC | #20
On 4/27/2015 9:22 PM, Or Gerlitz wrote:
> On Mon, Apr 27, 2015 at 10:32 AM, Matan Barak <matanb@mellanox.com> wrote:
>> On 4/26/2015 8:20 PM, Or Gerlitz wrote:
>>> On Thu, Mar 26, 2015 at 12:19 AM, Somnath Kotur
>>> <somnath.kotur@emulex.com> wrote:
>>>> From: Matan Barak <matanb@mellanox.com>
>
>>>> In order to manage multiple types, vlans and MACs per GID, we
>>>> need to store them along the GID itself. We store the net device
>>>> as well, as sometimes GIDs should be handled according to the
>>>> net device they came from. Since populating the GID table should
>>>> be identical for every RoCE provider, the GIDs table should be
>>>> handled in ib_core.
>>>>
>>>> Adding a GID cache table that supports a lockless find, add and
>>>> delete gids. The lockless nature comes from using a unique
>>>> sequence number per table entry and detecting that while reading/
>>>> writing this sequence wasn't changed.
>
>>> Matan, please use existing mechanism which fits the problem you are
>>> trying to solve, I guess one of RCU or seqlock should do the job.
>
>> seqcount fits this problem better. Since if a write and read are done in
>> parallel, there's a good chance we read an out of date entry and we are
>> going to use a GID entry that's going to change in T+epsilon, so RCU doesn't
>> really have an advantage here.
>
> So going back to the problem... we are talking on applications/drivers
> that attempt to establish new connections doing reads and writes done
> on behalf of IP stack changes, both are very much not critical path.
> So this is kind of similar to the neighbour table maintained by ND
> subsystem which is used by all IP based networking applications and
> that code uses RCU. I don't see what's wrong with RCU for our sort
> smaller scale subsystem and what is even wrong with simple rwlock
> which is the mechanism used today by the IB core git cache, this goes
> too complex and for no reason that I can think of.
>

I think the real question is why to deal with RCUs that will require 
re-allocation of entries when it's not necessary or why do we want to 
use rwlock if the kernel provides a mechanism (called seqcount) that 
fits this problem better?
I disagree about seqcount being complex - if you look at its API you'll 
find it's a lot simpler than RCU.

>
>> The current implementation is a bit more efficient than seqcount, as it
>> allows early termination of read-while-write (because the write puts a known
>> "currently updating" value that the read knows to ignore). AFAIK, this
>> doesn't exist in the current seqcount implementation. However, since this
>> isn't a crucial data-path, I'll change that to seqcount.
>>
>> seqcount is preferred over seqlock, as I don't need the spinlock in seqlock.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz April 28, 2015, 12:57 p.m. UTC | #21
On Tue, Apr 28, 2015 at 10:17 AM, Matan Barak <matanb@mellanox.com> wrote:
> On 4/27/2015 9:22 PM, Or Gerlitz wrote:
> I think the real question is why to deal with RCUs that will require
> re-allocation of entries when it's not necessary or why do we want to use
> rwlock if the kernel provides a mechanism (called seqcount) that fits this
> problem better?
> I disagree about seqcount being complex - if you look at its API you'll find
> it's a lot simpler than RCU.

I took a 2nd look, seqcount is indeed way simpler from RCU, and by
itself is simple to use
if you feel this provides better solution vs. simple rwlock, so I'm
good with that.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index acf7367..9b63bdf 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -9,7 +9,8 @@  obj-$(CONFIG_INFINIBAND_USER_ACCESS) +=	ib_uverbs.o ib_ucm.o \
 					$(user_access-y)
 
 ib_core-y :=			packer.o ud_header.o verbs.o sysfs.o \
-				device.o fmr_pool.o cache.o netlink.o
+				device.o fmr_pool.o cache.o netlink.o \
+				roce_gid_cache.o
 ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
 ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o umem_rbtree.o
 
diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h
index 87d1936..a502daa 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -35,6 +35,7 @@ 
 
 #include <linux/list.h>
 #include <linux/spinlock.h>
+#include <net/net_namespace.h>
 
 #include <rdma/ib_verbs.h>
 
@@ -51,4 +52,27 @@  void ib_cache_cleanup(void);
 
 int ib_resolve_eth_l2_attrs(struct ib_qp *qp,
 			    struct ib_qp_attr *qp_attr, int *qp_attr_mask);
+
+int roce_gid_cache_get_gid(struct ib_device *ib_dev, u8 port, int index,
+			   union ib_gid *gid, struct ib_gid_attr *attr);
+
+int roce_gid_cache_find_gid(struct ib_device *ib_dev, union ib_gid *gid,
+			    enum ib_gid_type gid_type, struct net *net,
+			    int if_index, u8 *port, u16 *index);
+
+int roce_gid_cache_find_gid_by_port(struct ib_device *ib_dev, union ib_gid *gid,
+				    enum ib_gid_type gid_type, u8 port,
+				    struct net *net, int if_index, u16 *index);
+
+int roce_gid_cache_is_active(struct ib_device *ib_dev, u8 port);
+
+int roce_add_gid(struct ib_device *ib_dev, u8 port,
+		 union ib_gid *gid, struct ib_gid_attr *attr);
+
+int roce_del_gid(struct ib_device *ib_dev, u8 port,
+		 union ib_gid *gid, struct ib_gid_attr *attr);
+
+int roce_del_all_netdev_gids(struct ib_device *ib_dev, u8 port,
+			     struct net_device *ndev);
+
 #endif /* _CORE_PRIV_H */
diff --git a/drivers/infiniband/core/roce_gid_cache.c b/drivers/infiniband/core/roce_gid_cache.c
new file mode 100644
index 0000000..80f364a
--- /dev/null
+++ b/drivers/infiniband/core/roce_gid_cache.c
@@ -0,0 +1,518 @@ 
+/*
+ * Copyright (c) 2015, Mellanox Technologies inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/slab.h>
+#include <linux/netdevice.h>
+#include <linux/rtnetlink.h>
+#include <rdma/ib_cache.h>
+
+#include "core_priv.h"
+
+union ib_gid zgid;
+EXPORT_SYMBOL_GPL(zgid);
+
+static const struct ib_gid_attr zattr;
+
+enum gid_attr_find_mask {
+	GID_ATTR_FIND_MASK_GID_TYPE	= 1UL << 0,
+	GID_ATTR_FIND_MASK_NETDEV	= 1UL << 1,
+};
+
+static inline int start_port(struct ib_device *ib_dev)
+{
+	return (ib_dev->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1;
+}
+
+struct dev_put_rcu {
+	struct rcu_head		rcu;
+	struct net_device	*ndev;
+};
+
+static void put_ndev(struct rcu_head *rcu)
+{
+	struct dev_put_rcu *put_rcu =
+		container_of(rcu, struct dev_put_rcu, rcu);
+
+	dev_put(put_rcu->ndev);
+	kfree(put_rcu);
+}
+
+static int write_gid(struct ib_device *ib_dev, u8 port,
+		     struct ib_roce_gid_cache *cache, int ix,
+		     const union ib_gid *gid,
+		     const struct ib_gid_attr *attr)
+{
+	unsigned int orig_seq;
+	int ret;
+	struct dev_put_rcu	*put_rcu;
+	struct net_device *old_net_dev;
+
+	orig_seq = cache->data_vec[ix].seq;
+	cache->data_vec[ix].seq = -1;
+	/* Ensure that all readers will see invalid sequence
+	 * identifier before starting the actual GID update.
+	 */
+	smp_wmb();
+
+	ret = ib_dev->modify_gid(ib_dev, port, ix, gid, attr,
+				 &cache->data_vec[ix].context);
+
+	old_net_dev = cache->data_vec[ix].attr.ndev;
+	if (old_net_dev && old_net_dev != attr->ndev) {
+		put_rcu = kmalloc(sizeof(*put_rcu), GFP_KERNEL);
+		if (put_rcu) {
+			put_rcu->ndev = old_net_dev;
+			call_rcu(&put_rcu->rcu, put_ndev);
+		} else {
+			pr_warn("roce_gid_cache: can't allocate rcu context, using synchronize\n");
+			synchronize_rcu();
+			dev_put(old_net_dev);
+		}
+	}
+	/* if modify_gid failed, just delete the old gid */
+	if (ret || !memcmp(gid, &zgid, sizeof(*gid))) {
+		gid = &zgid;
+		attr = &zattr;
+		cache->data_vec[ix].context = NULL;
+	}
+	memcpy(&cache->data_vec[ix].gid, gid, sizeof(*gid));
+	memcpy(&cache->data_vec[ix].attr, attr, sizeof(*attr));
+	if (cache->data_vec[ix].attr.ndev &&
+	    cache->data_vec[ix].attr.ndev != old_net_dev)
+		dev_hold(cache->data_vec[ix].attr.ndev);
+
+	/* Ensure that all cached gid data updating is finished before
+	 * marking the entry as available.
+	 */
+	smp_wmb();
+
+	if (++orig_seq == (unsigned int)-1)
+		orig_seq = 0;
+	ACCESS_ONCE(cache->data_vec[ix].seq) = orig_seq;
+
+	if (!ret) {
+		struct ib_event event;
+
+		event.device		= ib_dev;
+		event.element.port_num	= port;
+		event.event		= IB_EVENT_GID_CHANGE;
+
+		ib_dispatch_event(&event);
+	}
+	return ret;
+}
+
+static int find_gid(struct ib_roce_gid_cache *cache, union ib_gid *gid,
+		    const struct ib_gid_attr *val, unsigned long mask)
+{
+	int i;
+	unsigned int orig_seq;
+
+	for (i = 0; i < cache->sz; i++) {
+		struct ib_gid_attr *attr = &cache->data_vec[i].attr;
+
+		orig_seq = cache->data_vec[i].seq;
+		if (orig_seq == -1)
+			continue;
+		/* Make sure the sequence number we remeber was read
+		 * before the gid cache entry content is read.
+		 */
+		smp_rmb();
+
+		if (mask & GID_ATTR_FIND_MASK_GID_TYPE &&
+		    attr->gid_type != val->gid_type)
+			continue;
+
+		if (memcmp(gid, &cache->data_vec[i].gid, sizeof(*gid)))
+			continue;
+
+		if (mask & GID_ATTR_FIND_MASK_NETDEV &&
+		    attr->ndev != val->ndev)
+			continue;
+
+		/* We have a match, verify that the data we
+		 * compared is valid. Make sure that the
+		 * sequence number we read is the last to be
+		 * read.
+		 */
+		smp_rmb();
+		if (orig_seq == ACCESS_ONCE(cache->data_vec[i].seq))
+			return i;
+		/* The sequence number changed under our feet,
+		 * the GID entry is invalid. Continue to the
+		 * next entry.
+		 */
+	}
+
+	return -1;
+}
+
+int roce_add_gid(struct ib_device *ib_dev, u8 port,
+		 union ib_gid *gid, struct ib_gid_attr *attr)
+{
+	struct ib_roce_gid_cache *cache;
+	int ix;
+	int ret = 0;
+
+	if (!ib_dev->cache.roce_gid_cache)
+		return -ENOSYS;
+
+	cache = ib_dev->cache.roce_gid_cache[port - start_port(ib_dev)];
+
+	if (!cache || !cache->active)
+		return -ENOSYS;
+
+	if (!memcmp(gid, &zgid, sizeof(*gid)))
+		return -EINVAL;
+
+	mutex_lock(&cache->lock);
+
+	ix = find_gid(cache, gid, attr, GID_ATTR_FIND_MASK_GID_TYPE |
+		      GID_ATTR_FIND_MASK_NETDEV);
+	if (ix >= 0)
+		goto out_unlock;
+
+	ix = find_gid(cache, &zgid, NULL, 0);
+	if (ix < 0) {
+		ret = -ENOSPC;
+		goto out_unlock;
+	}
+
+	write_gid(ib_dev, port, cache, ix, gid, attr);
+
+out_unlock:
+	mutex_unlock(&cache->lock);
+	return ret;
+}
+
+int roce_del_gid(struct ib_device *ib_dev, u8 port,
+		 union ib_gid *gid, struct ib_gid_attr *attr)
+{
+	struct ib_roce_gid_cache *cache;
+	int ix;
+
+	if (!ib_dev->cache.roce_gid_cache)
+		return 0;
+
+	cache  = ib_dev->cache.roce_gid_cache[port - start_port(ib_dev)];
+
+	if (!cache || !cache->active)
+		return -ENOSYS;
+
+	mutex_lock(&cache->lock);
+
+	ix = find_gid(cache, gid, attr,
+		      GID_ATTR_FIND_MASK_GID_TYPE |
+		      GID_ATTR_FIND_MASK_NETDEV);
+	if (ix < 0)
+		goto out_unlock;
+
+	write_gid(ib_dev, port, cache, ix, &zgid, &zattr);
+
+out_unlock:
+	mutex_unlock(&cache->lock);
+	return 0;
+}
+
+int roce_del_all_netdev_gids(struct ib_device *ib_dev, u8 port,
+			     struct net_device *ndev)
+{
+	struct ib_roce_gid_cache *cache;
+	int ix;
+
+	if (!ib_dev->cache.roce_gid_cache)
+		return 0;
+
+	cache  = ib_dev->cache.roce_gid_cache[port - start_port(ib_dev)];
+
+	if (!cache || !cache->active)
+		return -ENOSYS;
+
+	mutex_lock(&cache->lock);
+
+	for (ix = 0; ix < cache->sz; ix++)
+		if (cache->data_vec[ix].attr.ndev == ndev)
+			write_gid(ib_dev, port, cache, ix, &zgid, &zattr);
+
+	mutex_unlock(&cache->lock);
+	return 0;
+}
+
+int roce_gid_cache_get_gid(struct ib_device *ib_dev, u8 port, int index,
+			   union ib_gid *gid, struct ib_gid_attr *attr)
+{
+	struct ib_roce_gid_cache *cache;
+	union ib_gid local_gid;
+	struct ib_gid_attr local_attr;
+	unsigned int orig_seq;
+
+	if (!ib_dev->cache.roce_gid_cache)
+		return -EINVAL;
+
+	cache = ib_dev->cache.roce_gid_cache[port - start_port(ib_dev)];
+
+	if (!cache || !cache->active)
+		return -ENOSYS;
+
+	if (index < 0 || index >= cache->sz)
+		return -EINVAL;
+
+	orig_seq = ACCESS_ONCE(cache->data_vec[index].seq);
+	/* Make sure we read the sequence number before copying the
+	 * gid to local storage. */
+	smp_rmb();
+
+	memcpy(&local_gid, &cache->data_vec[index].gid, sizeof(local_gid));
+	memcpy(&local_attr, &cache->data_vec[index].attr, sizeof(local_attr));
+	/* Ensure the local copy completed reading before verifying
+	 * the new sequence number. */
+	smp_rmb();
+
+	if (orig_seq == -1 ||
+	    orig_seq != ACCESS_ONCE(cache->data_vec[index].seq))
+		return -EAGAIN;
+
+	memcpy(gid, &local_gid, sizeof(*gid));
+	if (attr)
+		memcpy(attr, &local_attr, sizeof(*attr));
+	return 0;
+}
+
+static int _roce_gid_cache_find_gid(struct ib_device *ib_dev, union ib_gid *gid,
+				    const struct ib_gid_attr *val,
+				    unsigned long mask,
+				    u8 *port, u16 *index)
+{
+	struct ib_roce_gid_cache *cache;
+	u8 p;
+	int local_index;
+
+	if (!ib_dev->cache.roce_gid_cache)
+		return -ENOENT;
+
+	for (p = 0; p < ib_dev->phys_port_cnt; p++) {
+		if (rdma_port_get_link_layer(ib_dev, p + start_port(ib_dev)) !=
+		    IB_LINK_LAYER_ETHERNET)
+			continue;
+		cache = ib_dev->cache.roce_gid_cache[p];
+		if (!cache || !cache->active)
+			continue;
+		local_index = find_gid(cache, gid, val, mask);
+		if (local_index >= 0) {
+			if (index)
+				*index = local_index;
+			if (port)
+				*port = p + start_port(ib_dev);
+			return 0;
+		}
+	}
+
+	return -ENOENT;
+}
+
+static int get_netdev_from_ifindex(struct net *net, int if_index,
+				   struct ib_gid_attr *gid_attr_val)
+{
+	if (if_index && net) {
+		rcu_read_lock();
+		gid_attr_val->ndev = dev_get_by_index_rcu(net, if_index);
+		rcu_read_unlock();
+		if (gid_attr_val->ndev)
+			return GID_ATTR_FIND_MASK_NETDEV;
+	}
+	return 0;
+}
+
+int roce_gid_cache_find_gid(struct ib_device *ib_dev, union ib_gid *gid,
+			    enum ib_gid_type gid_type, struct net *net,
+			    int if_index, u8 *port, u16 *index)
+{
+	unsigned long mask = GID_ATTR_FIND_MASK_GID |
+			     GID_ATTR_FIND_MASK_GID_TYPE;
+	struct ib_gid_attr gid_attr_val = {.gid_type = gid_type};
+
+	mask |= get_netdev_from_ifindex(net, if_index, &gid_attr_val);
+
+	return _roce_gid_cache_find_gid(ib_dev, gid, &gid_attr_val,
+					mask, port, index);
+}
+
+int roce_gid_cache_find_gid_by_port(struct ib_device *ib_dev, union ib_gid *gid,
+				    enum ib_gid_type gid_type, u8 port,
+				    struct net *net, int if_index, u16 *index)
+{
+	int local_index;
+	struct ib_roce_gid_cache *cache;
+	unsigned long mask = GID_ATTR_FIND_MASK_GID_TYPE;
+	struct ib_gid_attr val = {.gid_type = gid_type};
+
+	if (!ib_dev->cache.roce_gid_cache || port < start_port(ib_dev) ||
+	    port >= (start_port(ib_dev) + ib_dev->phys_port_cnt))
+		return -ENOENT;
+
+	cache = ib_dev->cache.roce_gid_cache[port - start_port(ib_dev)];
+	if (!cache || !cache->active)
+		return -ENOENT;
+
+	mask |= get_netdev_from_ifindex(net, if_index, &val);
+
+	local_index = find_gid(cache, gid, &val, mask);
+	if (local_index >= 0) {
+		if (index)
+			*index = local_index;
+		return 0;
+	}
+
+	return -ENOENT;
+}
+
+static struct ib_roce_gid_cache *alloc_roce_gid_cache(int sz)
+{
+	struct ib_roce_gid_cache *cache =
+		kzalloc(sizeof(struct ib_roce_gid_cache), GFP_KERNEL);
+	if (!cache)
+		return NULL;
+
+	cache->data_vec = kcalloc(sz, sizeof(*cache->data_vec), GFP_KERNEL);
+	if (!cache->data_vec)
+		goto err_free_cache;
+
+	mutex_init(&cache->lock);
+
+	cache->sz = sz;
+
+	return cache;
+
+err_free_cache:
+	kfree(cache);
+	return NULL;
+}
+
+static void free_roce_gid_cache(struct ib_device *ib_dev, u8 port)
+{
+	int i;
+	struct ib_roce_gid_cache *cache =
+		ib_dev->cache.roce_gid_cache[port - 1];
+
+	if (!cache)
+		return;
+
+	for (i = 0; i < cache->sz; ++i) {
+		if (memcmp(&cache->data_vec[i].gid, &zgid,
+			   sizeof(cache->data_vec[i].gid)))
+		    write_gid(ib_dev, port, cache, i, &zgid, &zattr);
+	}
+	kfree(cache->data_vec);
+	kfree(cache);
+}
+
+static void set_roce_gid_cache_active(struct ib_roce_gid_cache *cache,
+				      int active)
+{
+	if (!cache)
+		return;
+
+	cache->active = active;
+}
+
+static int roce_gid_cache_setup_one(struct ib_device *ib_dev)
+{
+	u8 port;
+	int err = 0;
+
+	if (!ib_dev->modify_gid)
+		return -ENOSYS;
+
+	ib_dev->cache.roce_gid_cache =
+		kcalloc(ib_dev->phys_port_cnt,
+			sizeof(*ib_dev->cache.roce_gid_cache), GFP_KERNEL);
+
+	if (!ib_dev->cache.roce_gid_cache) {
+		pr_warn("failed to allocate roce addr cache for %s\n",
+			ib_dev->name);
+		return -ENOMEM;
+	}
+
+	for (port = 0; port < ib_dev->phys_port_cnt; port++) {
+		if (rdma_port_get_link_layer(ib_dev, port + start_port(ib_dev))
+		    != IB_LINK_LAYER_ETHERNET)
+			continue;
+		ib_dev->cache.roce_gid_cache[port] =
+			alloc_roce_gid_cache(ib_dev->gid_tbl_len[port]);
+		if (!ib_dev->cache.roce_gid_cache[port]) {
+			err = -ENOMEM;
+			goto rollback_cache_setup;
+		}
+	}
+	return 0;
+
+rollback_cache_setup:
+	for (port = 1; port <= ib_dev->phys_port_cnt; port++)
+		free_roce_gid_cache(ib_dev, port);
+
+	kfree(ib_dev->cache.roce_gid_cache);
+	ib_dev->cache.roce_gid_cache = NULL;
+	return err;
+}
+
+static void roce_gid_cache_cleanup_one(struct ib_device *ib_dev)
+{
+	u8 port;
+
+	if (!ib_dev->cache.roce_gid_cache)
+		return;
+
+	for (port = 1; port <= ib_dev->phys_port_cnt; port++)
+		free_roce_gid_cache(ib_dev, port);
+
+	kfree(ib_dev->cache.roce_gid_cache);
+	ib_dev->cache.roce_gid_cache = NULL;
+}
+
+static void roce_gid_cache_set_active_state(struct ib_device *ib_dev,
+					    int active)
+{
+	u8 port;
+
+	if (!ib_dev->cache.roce_gid_cache)
+		return;
+
+	for (port = 0; port < ib_dev->phys_port_cnt; port++)
+		set_roce_gid_cache_active(ib_dev->cache.roce_gid_cache[port],
+					  active);
+}
+
+int roce_gid_cache_is_active(struct ib_device *ib_dev, u8 port)
+{
+	return ib_dev->cache.roce_gid_cache &&
+		ib_dev->cache.roce_gid_cache[port - start_port(ib_dev)]->active;
+}
diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 5261665..6fa5e49 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -93,8 +93,6 @@  static void init_query_mad(struct ib_smp *mad)
 	mad->method	   = IB_MGMT_METHOD_GET;
 }
 
-static union ib_gid zgid;
-
 static int check_flow_steering_support(struct mlx4_dev *dev)
 {
 	int eth_num_ports = 0;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 65994a1..1866595 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -64,6 +64,36 @@  union ib_gid {
 	} global;
 };
 
+extern union ib_gid zgid;
+
+enum ib_gid_type {
+	/* If link layer is Ethernet, this is RoCE V1 */
+	IB_GID_TYPE_IB        = 0,
+	IB_GID_TYPE_ROCE_V2   = 1,
+	IB_GID_TYPE_SIZE
+};
+
+struct ib_gid_attr {
+	enum ib_gid_type	gid_type;
+	struct net_device	*ndev;
+};
+
+struct ib_roce_gid_cache_entry {
+	/* seq number of 0 indicates entry being changed. */
+	unsigned int        seq;
+	union ib_gid        gid;
+	struct ib_gid_attr  attr;
+	void		   *context;
+};
+
+struct ib_roce_gid_cache {
+	int		     active;
+	int                  sz;
+	/* locking against multiple writes in data_vec */
+	struct mutex         lock;
+	struct ib_roce_gid_cache_entry *data_vec;
+};
+
 enum rdma_node_type {
 	/* IB values map to NodeInfo:NodeType. */
 	RDMA_NODE_IB_CA 	= 1,
@@ -265,7 +295,9 @@  enum ib_port_cap_flags {
 	IB_PORT_BOOT_MGMT_SUP			= 1 << 23,
 	IB_PORT_LINK_LATENCY_SUP		= 1 << 24,
 	IB_PORT_CLIENT_REG_SUP			= 1 << 25,
-	IB_PORT_IP_BASED_GIDS			= 1 << 26
+	IB_PORT_IP_BASED_GIDS			= 1 << 26,
+	IB_PORT_ROCE				= 1 << 27,
+	IB_PORT_ROCE_V2				= 1 << 28,
 };
 
 enum ib_port_width {
@@ -1431,6 +1463,7 @@  struct ib_cache {
 	struct ib_pkey_cache  **pkey_cache;
 	struct ib_gid_cache   **gid_cache;
 	u8                     *lmc_cache;
+	struct ib_roce_gid_cache **roce_gid_cache;
 };
 
 struct ib_dma_mapping_ops {
@@ -1506,6 +1539,26 @@  struct ib_device {
 	int		           (*query_gid)(struct ib_device *device,
 						u8 port_num, int index,
 						union ib_gid *gid);
+	/* When calling modify_gid, the HW vendor's driver should
+	 * modify the gid of device @device at gid index @index of
+	 * port @port to be @gid. Meta-info of that gid (for example,
+	 * the network device related to this gid is available
+	 * at @attr. @context allows the HW vendor driver to store extra
+	 * information together with a GID entry. The HW vendor may allocate
+	 * memory to contain this information and store it in @context when a
+	 * new GID entry is written to. Upon the deletion of a GID entry,
+	 * the HW vendor must free any allocated memory. The caller will clear
+	 * @context afterwards.GID deletion is done by passing the zero gid.
+	 * Params are consistent until the next call of modify_gid.
+	 * The function should return 0 on success or error otherwise.
+	 * The function could be called concurrently for different ports.
+	 */
+	int		           (*modify_gid)(struct ib_device *device,
+						 u8 port_num,
+						 unsigned int index,
+						 const union ib_gid *gid,
+						 const struct ib_gid_attr *attr,
+						 void **context);
 	int		           (*query_pkey)(struct ib_device *device,
 						 u8 port_num, u16 index, u16 *pkey);
 	int		           (*modify_device)(struct ib_device *device,