diff mbox

IB/ipoib: CSUM support in connected mode

Message ID 1438256764-9077-1-git-send-email-yuval.shaia@oracle.com (mailing list archive)
State Changes Requested
Headers show

Commit Message

Yuval Shaia July 30, 2015, 11:46 a.m. UTC
This enhancement suggest the usage of IB CRC instead of CSUM in IPoIB CM.
IPoIB CM uses RC (Reliable Connection) which guarantees the corruption free
delivery of the packet.

InfiniBand uses 32b CRC which provides stronger data integrity protection
compare to 16b IP Checksum. So, there is no added value that IP/TCP
Checksum provides in the IB world.

The proposal is to tell network stack that IPoIB-CM supports IP Checksum
offload. This enables the kernel to save the time of checksum calculation
of IPoIB CM packets. Network sends the IP packet without adding the IP
Checksum to the header. On the receive side, IPoIB driver again tells the
network stack that IP Checksum is good for the incoming packets and network
stack avoids the IP Checksum calculations.

During connection establishment the driver determine if peer supports
IB CRC as checksum. This is done so driver will be able to calculate
checksum before transmiting the packet in case the peer does not support
this feature.

With this enhancement throughput is increased by 60%.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 drivers/infiniband/ulp/ipoib/ipoib.h      |   26 ++++++++++++++++
 drivers/infiniband/ulp/ipoib/ipoib_cm.c   |   46 +++++++++++++++++++++++++++++
 drivers/infiniband/ulp/ipoib/ipoib_main.c |   20 ++++++++++--
 3 files changed, 89 insertions(+), 3 deletions(-)

Comments

Yann Droneaud July 30, 2015, 1:58 p.m. UTC | #1
Hi,

Le jeudi 30 juillet 2015 à 04:46 -0700, Yuval Shaia a écrit :
> This enhancement suggest the usage of IB CRC instead of CSUM in IPoIB 
> CM. IPoIB CM uses RC (Reliable Connection) which guarantees the 
> corruption free delivery of the packet.
> 
> InfiniBand uses 32b CRC which provides stronger data integrity 
> protection compare to 16b IP Checksum.

InfiniBand 32b CRC <=> Ethernet 32b CRC, it's link layer, layer 2.

IPv4 checksum is at another level, it's internet layer, layer 3.

>  So, there is no added value that IP/TCP Checksum provides in the IB 
> world.
> 

Sure, IPv4 checksum is a thing of the past: checksum was dropped from
IP header in IPv6: it assumes the lower layer, such as Ethernet,
provides the required integrety check.

I think not checking the IPv4 checksum should be a choice, carefully
thought, for inside a fabric, as I understand your proposal, packet
with invalid checksum will be allowed to go in/out of the fabric.

It sound like it's a departure from the behavior one can expect from an
IPv4 network stack.

> The proposal is to tell network stack that IPoIB-CM supports IP 
> Checksum offload. This enables the kernel to save the time of 
> checksum calculation of IPoIB CM packets. Network sends the IP packet 
> without adding the IP Checksum to the header. On the receive side, 
> IPoIB driver again tells the network stack that IP Checksum is good 
> for the incoming packets and network stack avoids the IP Checksum 
> calculations.
> 
> During connection establishment the driver determine if peer supports
> IB CRC as checksum. This is done so driver will be able to calculate
> checksum before transmiting the packet in case the peer does not 
> support this feature.
> 

Two questions:

- What will see tool such as wireshark/tcpdump when sniffing checksum
-less IPv4 packets sent/received on IPoIB interface ?

- What might happen if such checksum-less IPv4 packet is later routed to a different IPv4 network ?

> With this enhancement throughput is increased by 60%.
> 

> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>

Regards.
Yuval Shaia July 30, 2015, 3:20 p.m. UTC | #2
On Thu, Jul 30, 2015 at 03:58:13PM +0200, Yann Droneaud wrote:
> Hi,
> 
> Le jeudi 30 juillet 2015 à 04:46 -0700, Yuval Shaia a écrit :
> > This enhancement suggest the usage of IB CRC instead of CSUM in IPoIB 
> > CM. IPoIB CM uses RC (Reliable Connection) which guarantees the 
> > corruption free delivery of the packet.
> > 
> > InfiniBand uses 32b CRC which provides stronger data integrity 
> > protection compare to 16b IP Checksum.
> 
> InfiniBand 32b CRC <=> Ethernet 32b CRC, it's link layer, layer 2.
> 
> IPv4 checksum is at another level, it's internet layer, layer 3.
> 
> >  So, there is no added value that IP/TCP Checksum provides in the IB 
> > world.
> > 
> 
> Sure, IPv4 checksum is a thing of the past: checksum was dropped from
> IP header in IPv6: it assumes the lower layer, such as Ethernet,
> provides the required integrety check.
> 
> I think not checking the IPv4 checksum should be a choice, carefully
> thought, for inside a fabric, as I understand your proposal, packet
> with invalid checksum will be allowed to go in/out of the fabric.
Yes, this is why it is controlled by module parameter.
Maybe a better choice would be to default it to 0.
> 
> It sound like it's a departure from the behavior one can expect from an
> IPv4 network stack.
It should be considered as network-fine-tuning parameter so if admin knows his fabric he can use it.
> 
> > The proposal is to tell network stack that IPoIB-CM supports IP 
> > Checksum offload. This enables the kernel to save the time of 
> > checksum calculation of IPoIB CM packets. Network sends the IP packet 
> > without adding the IP Checksum to the header. On the receive side, 
> > IPoIB driver again tells the network stack that IP Checksum is good 
> > for the incoming packets and network stack avoids the IP Checksum 
> > calculations.
> > 
> > During connection establishment the driver determine if peer supports
> > IB CRC as checksum. This is done so driver will be able to calculate
> > checksum before transmiting the packet in case the peer does not 
> > support this feature.
> > 
> 
> Two questions:
Three :)
> 
> - What will see tool such as wireshark/tcpdump when sniffing checksum
Zero or what ever the networking layer puts in csum when H/W supports CSUM-offloading.
Please note that with this patch driver still supports backward computability (per connection).
This means that for connections with peer which does not support this functionality you expect to see this value filled with checksum.
> -less IPv4 packets sent/received on IPoIB interface ?
No
> 
> - What might happen if such checksum-less IPv4 packet is later routed to a different IPv4 network ?
As noted above, for network that is opened to outside world this feature should be blocked.
In general i would say that if a layer 2 terminator device (e.x router) exist in the fabric - this feature can't be used and must be blocked.
With this limitation it still worth use it because of the reason of increasing throughput
> 
> > With this enhancement throughput is increased by 60%.
> > 
> 
> > Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> 
> Regards.
> 
> -- 
> Yann Droneaud
> OPTEYA
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford July 30, 2015, 3:51 p.m. UTC | #3
On 07/30/2015 11:20 AM, Yuval Shaia wrote:
> On Thu, Jul 30, 2015 at 03:58:13PM +0200, Yann Droneaud wrote:
>> Hi,
>>
>> Le jeudi 30 juillet 2015 à 04:46 -0700, Yuval Shaia a écrit :
>>> This enhancement suggest the usage of IB CRC instead of CSUM in IPoIB 
>>> CM. IPoIB CM uses RC (Reliable Connection) which guarantees the 
>>> corruption free delivery of the packet.
>>>
>>> InfiniBand uses 32b CRC which provides stronger data integrity 
>>> protection compare to 16b IP Checksum.
>>
>> InfiniBand 32b CRC <=> Ethernet 32b CRC, it's link layer, layer 2.
>>
>> IPv4 checksum is at another level, it's internet layer, layer 3.
>>
>>>  So, there is no added value that IP/TCP Checksum provides in the IB 
>>> world.
>>>
>>
>> Sure, IPv4 checksum is a thing of the past: checksum was dropped from
>> IP header in IPv6: it assumes the lower layer, such as Ethernet,
>> provides the required integrety check.
>>
>> I think not checking the IPv4 checksum should be a choice, carefully
>> thought, for inside a fabric, as I understand your proposal, packet
>> with invalid checksum will be allowed to go in/out of the fabric.
> Yes, this is why it is controlled by module parameter.
> Maybe a better choice would be to default it to 0.

In it's current form, yes, it should default to 0.

>>
>> It sound like it's a departure from the behavior one can expect from an
>> IPv4 network stack.
> It should be considered as network-fine-tuning parameter so if admin knows his fabric he can use it.
>>
>>> The proposal is to tell network stack that IPoIB-CM supports IP 
>>> Checksum offload. This enables the kernel to save the time of 
>>> checksum calculation of IPoIB CM packets. Network sends the IP packet 
>>> without adding the IP Checksum to the header. On the receive side, 
>>> IPoIB driver again tells the network stack that IP Checksum is good 
>>> for the incoming packets and network stack avoids the IP Checksum 
>>> calculations.
>>>
>>> During connection establishment the driver determine if peer supports
>>> IB CRC as checksum. This is done so driver will be able to calculate
>>> checksum before transmiting the packet in case the peer does not 
>>> support this feature.
>>>
>>
>> Two questions:
> Three :)

No, he really only had 2, the second one was a line split of the word
checksum-less done by his mailer ;-)

>>
>> - What will see tool such as wireshark/tcpdump when sniffing checksum
> Zero or what ever the networking layer puts in csum when H/W supports CSUM-offloading.
> Please note that with this patch driver still supports backward computability (per connection).
> This means that for connections with peer which does not support this functionality you expect to see this value filled with checksum.
>> -less IPv4 packets sent/received on IPoIB interface ?
> No
>>
>> - What might happen if such checksum-less IPv4 packet is later routed to a different IPv4 network ?
> As noted above, for network that is opened to outside world this feature should be blocked.
> In general i would say that if a layer 2 terminator device (e.x router) exist in the fabric - this feature can't be used and must be blocked.
> With this limitation it still worth use it because of the reason of increasing throughput

In its current state, I have my doubts about this patch.  However, it
seems to me that this should be relatively easy to fix in such a way
that you get 90%+ of the performance benefit, and can turn it on by
default, and we don't cause any problems.  Why not perform the checksum
operation on a per connection basis?  This is all IPoIB traffic anyway,
so every send will have a src ip and dst ip.  If the dst ip is link
local to our src ip device, and the connected mode partner is capable of
running without csum, then send that specific packet without doing a
checksum.  If the IP address is not link local, then do the checksum as
normal.  That way if our final destination is on the other side of a
router, we aren't leaking un-checksummed packets.  It means we would
miss out on being able to do checksum-less transfers from host A on
fabric 0 through host B as a router to host C on fabric 1, but I doubt
that's a very common situation to be in.  Or maybe a better way of
putting this is if our next hop IP address != our dest IP address, then
perform the checksum, otherwise if capable of checksum-less operation,
do so.  Can you rework the patch to operate in that manner?
Bart Van Assche July 30, 2015, 4:38 p.m. UTC | #4
On 07/30/2015 04:46 AM, Yuval Shaia wrote:
>   struct ipoib_cm_data {
>   	__be32 qpn; /* High byte MUST be ignored on receive */
>   	__be32 mtu;
> +	__be16 sig; /* must be IPOIB_CM_PROTO_SIG */
> +	__be16 caps; /* 4 bits proto ver and 12 bits capabilities */
>   };

This patch modifies the private login data format that has been 
standardized by the IETF in RFC 4755. Has this modification already been 
discussed with the IETF ?

See also https://tools.ietf.org/html/rfc4755#section-6.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 30, 2015, 5:15 p.m. UTC | #5
On Thu, Jul 30, 2015 at 11:51:12AM -0400, Doug Ledford wrote:

> In its current state, I have my doubts about this patch.  However, it
> seems to me that this should be relatively easy to fix in such a way
> that you get 90%+ of the performance benefit, and can turn it on by
> default, and we don't cause any problems.

The best way to implement this is to leverage all the checksum
offload work people did for virtualization.

Forward the checksum offload status through the RC connection and
recover it on the other side.

Then the far side stack will know it is dealing with a partial
checksum packet and will properly regenerate the checksum if it
re-transmits.

ie doing it this way doesn't totally break the netstack :)

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yuval Shaia July 30, 2015, 8:09 p.m. UTC | #6
On Thu, Jul 30, 2015 at 09:38:54AM -0700, Bart Van Assche wrote:
> On 07/30/2015 04:46 AM, Yuval Shaia wrote:
> >  struct ipoib_cm_data {
> >  	__be32 qpn; /* High byte MUST be ignored on receive */
> >  	__be32 mtu;
> >+	__be16 sig; /* must be IPOIB_CM_PROTO_SIG */
> >+	__be16 caps; /* 4 bits proto ver and 12 bits capabilities */
> >  };
> 
> This patch modifies the private login data format that has been
> standardized by the IETF in RFC 4755. Has this modification already
> been discussed with the IETF ?
> 
> See also https://tools.ietf.org/html/rfc4755#section-6.
Yes.
I first want to check how linux community react to this proposal.

Please note that though the standard specify 64 bits of data, the actual
data the driver reads/writes is can be up to 196 bytes.
> 
> Bart.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yuval Shaia July 30, 2015, 8:37 p.m. UTC | #7
On Thu, Jul 30, 2015 at 11:51:12AM -0400, Doug Ledford wrote:
> On 07/30/2015 11:20 AM, Yuval Shaia wrote:
> > On Thu, Jul 30, 2015 at 03:58:13PM +0200, Yann Droneaud wrote:
> >> Hi,
> >>
> >> Le jeudi 30 juillet 2015 à 04:46 -0700, Yuval Shaia a écrit :
> >>> This enhancement suggest the usage of IB CRC instead of CSUM in IPoIB 
> >>> CM. IPoIB CM uses RC (Reliable Connection) which guarantees the 
> >>> corruption free delivery of the packet.
> >>>
> >>> InfiniBand uses 32b CRC which provides stronger data integrity 
> >>> protection compare to 16b IP Checksum.
> >>
> >> InfiniBand 32b CRC <=> Ethernet 32b CRC, it's link layer, layer 2.
> >>
> >> IPv4 checksum is at another level, it's internet layer, layer 3.
> >>
> >>>  So, there is no added value that IP/TCP Checksum provides in the IB 
> >>> world.
> >>>
> >>
> >> Sure, IPv4 checksum is a thing of the past: checksum was dropped from
> >> IP header in IPv6: it assumes the lower layer, such as Ethernet,
> >> provides the required integrety check.
> >>
> >> I think not checking the IPv4 checksum should be a choice, carefully
> >> thought, for inside a fabric, as I understand your proposal, packet
> >> with invalid checksum will be allowed to go in/out of the fabric.
> > Yes, this is why it is controlled by module parameter.
> > Maybe a better choice would be to default it to 0.
> 
> In it's current form, yes, it should default to 0.
> 
> >>
> >> It sound like it's a departure from the behavior one can expect from an
> >> IPv4 network stack.
> > It should be considered as network-fine-tuning parameter so if admin knows his fabric he can use it.
> >>
> >>> The proposal is to tell network stack that IPoIB-CM supports IP 
> >>> Checksum offload. This enables the kernel to save the time of 
> >>> checksum calculation of IPoIB CM packets. Network sends the IP packet 
> >>> without adding the IP Checksum to the header. On the receive side, 
> >>> IPoIB driver again tells the network stack that IP Checksum is good 
> >>> for the incoming packets and network stack avoids the IP Checksum 
> >>> calculations.
> >>>
> >>> During connection establishment the driver determine if peer supports
> >>> IB CRC as checksum. This is done so driver will be able to calculate
> >>> checksum before transmiting the packet in case the peer does not 
> >>> support this feature.
> >>>
> >>
> >> Two questions:
> > Three :)
> 
> No, he really only had 2, the second one was a line split of the word
> checksum-less done by his mailer ;-)
> 
> >>
> >> - What will see tool such as wireshark/tcpdump when sniffing checksum
> > Zero or what ever the networking layer puts in csum when H/W supports CSUM-offloading.
> > Please note that with this patch driver still supports backward computability (per connection).
> > This means that for connections with peer which does not support this functionality you expect to see this value filled with checksum.
> >> -less IPv4 packets sent/received on IPoIB interface ?
> > No
> >>
> >> - What might happen if such checksum-less IPv4 packet is later routed to a different IPv4 network ?
> > As noted above, for network that is opened to outside world this feature should be blocked.
> > In general i would say that if a layer 2 terminator device (e.x router) exist in the fabric - this feature can't be used and must be blocked.
> > With this limitation it still worth use it because of the reason of increasing throughput
> 
> In its current state, I have my doubts about this patch.  However, it
> seems to me that this should be relatively easy to fix in such a way
> that you get 90%+ of the performance benefit, and can turn it on by
> default, and we don't cause any problems.  Why not perform the checksum
> operation on a per connection basis?  This is all IPoIB traffic anyway,
This part is already implemented.
Actually this is the main purpose of adding 'caps' field to ipoib_cm_tx.
The peer capabilities (currently only one option but design let us add
up to 12 capabilities in the future) is passed in IPoIB's private data and
saved in ipoib_cm_tx.caps per connection basis.
Then, on ipoib_cm_send, the decision is made based on that (and on some
other conditions) and if needed - the driver calculate the checksum just
before sending.
> so every send will have a src ip and dst ip.  If the dst ip is link
> local to our src ip device, and the connected mode partner is capable of
> running without csum, then send that specific packet without doing a
> checksum.  If the IP address is not link local, then do the checksum as
> normal.  That way if our final destination is on the other side of a
> router, we aren't leaking un-checksummed packets.  It means we would
> miss out on being able to do checksum-less transfers from host A on
> fabric 0 through host B as a router to host C on fabric 1, but I doubt
> that's a very common situation to be in.  Or maybe a better way of
> putting this is if our next hop IP address != our dest IP address, then
> perform the checksum, otherwise if capable of checksum-less operation,
> do so.  Can you rework the patch to operate in that manner?
I think that the concern with 'router' is that when packet goes into it
and then goes out from it - we cannot trust end-to-end IB-CRC as this is
layer 2 CRC.
So, if i understand you correctly, you suggest to tread every host beyond
a router as one that does not support this "fake" and to calculate csum
for it?
This make sense to me but does it cover all such cases (where we can't
trust end-to-end IB-CRC)?
If yes then sure, it is easy to implement.
This way we can default it to 1 and get rid of this module param.
> 
> 
> -- 
> Doug Ledford <dledford@redhat.com>
>               GPG KeyID: 0E572FDD
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yuval Shaia July 30, 2015, 8:46 p.m. UTC | #8
On Thu, Jul 30, 2015 at 11:15:38AM -0600, Jason Gunthorpe wrote:
> On Thu, Jul 30, 2015 at 11:51:12AM -0400, Doug Ledford wrote:
> 
> > In its current state, I have my doubts about this patch.  However, it
> > seems to me that this should be relatively easy to fix in such a way
> > that you get 90%+ of the performance benefit, and can turn it on by
> > default, and we don't cause any problems.
> 
> The best way to implement this is to leverage all the checksum
> offload work people did for virtualization.
> 
> Forward the checksum offload status through the RC connection and
> recover it on the other side.
The current approach is to utilize IPoIB's private-data to exchange this
information.
> 
> Then the far side stack will know it is dealing with a partial
> checksum packet and will properly regenerate the checksum if it
> re-transmits.
> 
> ie doing it this way doesn't totally break the netstack :)
> 
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 30, 2015, 9 p.m. UTC | #9
On Thu, Jul 30, 2015 at 11:46:36PM +0300, Yuval Shaia wrote:
> On Thu, Jul 30, 2015 at 11:15:38AM -0600, Jason Gunthorpe wrote:
> > On Thu, Jul 30, 2015 at 11:51:12AM -0400, Doug Ledford wrote:
> > 
> > > In its current state, I have my doubts about this patch.  However, it
> > > seems to me that this should be relatively easy to fix in such a way
> > > that you get 90%+ of the performance benefit, and can turn it on by
> > > default, and we don't cause any problems.
> > 
> > The best way to implement this is to leverage all the checksum
> > offload work people did for virtualization.
> > 
> > Forward the checksum offload status through the RC connection and
> > recover it on the other side.
> The current approach is to utilize IPoIB's private-data to exchange this
> information.

You need private-data exchange to negotiate the feature.

The feature should be a per-packet csum status header.

When sending a skb that is already fully csumed the receiver sets
CHECKSUM_UNNECESSARY.

When sending a skb that has CHECKSUM_PARTIAL then the
receiver needs to call skb_partial_csum_set.

Look at how something like VIRTIO_NET_HDR_F_NEEDS_CSUM works and copy
that scheme.

DO NOT EVER set CHECKSUM_UNNECESSARY on packets that do not have valid
csums - that breaks the net stack.

Yes, you need to add a header to all packets to support this scheme,
that is what the private-data negotiation is for.

While you are at it, I'd make room for something like
VIRTIO_NET_HDR_GSO_* in the RC protocol too. Implementing GSO
forwarding is probably another big performance win.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche July 31, 2015, 2:03 a.m. UTC | #10
On 07/30/15 13:09, Yuval Shaia wrote:
> On Thu, Jul 30, 2015 at 09:38:54AM -0700, Bart Van Assche wrote:
>> On 07/30/2015 04:46 AM, Yuval Shaia wrote:
>>>   struct ipoib_cm_data {
>>>   	__be32 qpn; /* High byte MUST be ignored on receive */
>>>   	__be32 mtu;
>>> +	__be16 sig; /* must be IPOIB_CM_PROTO_SIG */
>>> +	__be16 caps; /* 4 bits proto ver and 12 bits capabilities */
>>>   };
>>
>> This patch modifies the private login data format that has been
>> standardized by the IETF in RFC 4755. Has this modification already
>> been discussed with the IETF ?
>>
> Yes.

Hello Yuval,

It should have been mentioned in the patch description that this patch 
modifies the wire format, how it modifies the wire format, and also 
which feedback (if any) has been received so far from the IETF.

Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford Sept. 3, 2015, 8:29 p.m. UTC | #11
On 07/30/2015 10:03 PM, Bart Van Assche wrote:
> On 07/30/15 13:09, Yuval Shaia wrote:
>> On Thu, Jul 30, 2015 at 09:38:54AM -0700, Bart Van Assche wrote:
>>> On 07/30/2015 04:46 AM, Yuval Shaia wrote:
>>>>   struct ipoib_cm_data {
>>>>       __be32 qpn; /* High byte MUST be ignored on receive */
>>>>       __be32 mtu;
>>>> +    __be16 sig; /* must be IPOIB_CM_PROTO_SIG */
>>>> +    __be16 caps; /* 4 bits proto ver and 12 bits capabilities */
>>>>   };
>>>
>>> This patch modifies the private login data format that has been
>>> standardized by the IETF in RFC 4755. Has this modification already
>>> been discussed with the IETF ?
>>>
>> Yes.
> 
> Hello Yuval,
> 
> It should have been mentioned in the patch description that this patch
> modifies the wire format, how it modifies the wire format, and also
> which feedback (if any) has been received so far from the IETF.

There hasn't been any follow up on this patch in the last month.  Yuval,
is there still progress being made?
Yuval Shaia Sept. 6, 2015, 11:27 a.m. UTC | #12
On Thu, Sep 03, 2015 at 04:29:45PM -0400, Doug Ledford wrote:
> On 07/30/2015 10:03 PM, Bart Van Assche wrote:
> > On 07/30/15 13:09, Yuval Shaia wrote:
> >> On Thu, Jul 30, 2015 at 09:38:54AM -0700, Bart Van Assche wrote:
> >>> On 07/30/2015 04:46 AM, Yuval Shaia wrote:
> >>>>   struct ipoib_cm_data {
> >>>>       __be32 qpn; /* High byte MUST be ignored on receive */
> >>>>       __be32 mtu;
> >>>> +    __be16 sig; /* must be IPOIB_CM_PROTO_SIG */
> >>>> +    __be16 caps; /* 4 bits proto ver and 12 bits capabilities */
> >>>>   };
> >>>
> >>> This patch modifies the private login data format that has been
> >>> standardized by the IETF in RFC 4755. Has this modification already
> >>> been discussed with the IETF ?
> >>>
> >> Yes.
> > 
> > Hello Yuval,
> > 
> > It should have been mentioned in the patch description that this patch
> > modifies the wire format, how it modifies the wire format, and also
> > which feedback (if any) has been received so far from the IETF.
> 
> There hasn't been any follow up on this patch in the last month.  Yuval,
> is there still progress being made?
Unfortunately no progress yet, too many P1-s.
I will resume work soon.
> 
> 
> -- 
> Doug Ledford <dledford@redhat.com>
>               GPG KeyID: 0E572FDD
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yuval Shaia Nov. 18, 2015, 10:27 a.m. UTC | #13
On Thu, Jul 30, 2015 at 07:03:22PM -0700, Bart Van Assche wrote:
> On 07/30/15 13:09, Yuval Shaia wrote:
> >On Thu, Jul 30, 2015 at 09:38:54AM -0700, Bart Van Assche wrote:
> >>On 07/30/2015 04:46 AM, Yuval Shaia wrote:
> >>>  struct ipoib_cm_data {
> >>>  	__be32 qpn; /* High byte MUST be ignored on receive */
> >>>  	__be32 mtu;
> >>>+	__be16 sig; /* must be IPOIB_CM_PROTO_SIG */
> >>>+	__be16 caps; /* 4 bits proto ver and 12 bits capabilities */
> >>>  };
> >>
> >>This patch modifies the private login data format that has been
> >>standardized by the IETF in RFC 4755. Has this modification already
> >>been discussed with the IETF ?
> >>
> >Yes.
> 
> Hello Yuval,
> 
> It should have been mentioned in the patch description that this
> patch modifies the wire format, how it modifies the wire format, and
Thanks,
Will add something like this:
"To support this, the Private-Data is extended (by 32 bits) with two new
attributes - protocol version and capabilities."
> also which feedback (if any) has been received so far from the IETF.
> 
> Thanks,
> 
> Bart.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yuval Shaia Nov. 18, 2015, 10:46 a.m. UTC | #14
On Thu, Jul 30, 2015 at 03:58:13PM +0200, Yann Droneaud wrote:
> Hi,
> 
> Le jeudi 30 juillet 2015 à 04:46 -0700, Yuval Shaia a écrit :
> > This enhancement suggest the usage of IB CRC instead of CSUM in IPoIB 
> > CM. IPoIB CM uses RC (Reliable Connection) which guarantees the 
> > corruption free delivery of the packet.
> > 
> > InfiniBand uses 32b CRC which provides stronger data integrity 
> > protection compare to 16b IP Checksum.
> 
> InfiniBand 32b CRC <=> Ethernet 32b CRC, it's link layer, layer 2.
> 
> IPv4 checksum is at another level, it's internet layer, layer 3.
> 
> >  So, there is no added value that IP/TCP Checksum provides in the IB 
> > world.
> > 
> 
> Sure, IPv4 checksum is a thing of the past: checksum was dropped from
> IP header in IPv6: it assumes the lower layer, such as Ethernet,
> provides the required integrety check.
Right, will change description to something like this:
InfiniBand 32b CRC offers strong data integrity protection compare CSUM.
> 
> I think not checking the IPv4 checksum should be a choice, carefully
> thought, for inside a fabric, as I understand your proposal, packet
> with invalid checksum will be allowed to go in/out of the fabric.
If peer supports this feature do you see any reason why not?
> 
> It sound like it's a departure from the behavior one can expect from an
> IPv4 network stack.
> 
> > The proposal is to tell network stack that IPoIB-CM supports IP 
> > Checksum offload. This enables the kernel to save the time of 
> > checksum calculation of IPoIB CM packets. Network sends the IP packet 
> > without adding the IP Checksum to the header. On the receive side, 
> > IPoIB driver again tells the network stack that IP Checksum is good 
> > for the incoming packets and network stack avoids the IP Checksum 
> > calculations.
> > 
> > During connection establishment the driver determine if peer supports
> > IB CRC as checksum. This is done so driver will be able to calculate
> > checksum before transmiting the packet in case the peer does not 
> > support this feature.
> > 
> 
> Two questions:
> 
> - What will see tool such as wireshark/tcpdump when sniffing checksum
> -less IPv4 packets sent/received on IPoIB interface ?
Never checked it but i assume 0 or what ever the kernel put there when
driver reports NETIF_F_IP_CSUM.
I can check and zero if needed but any reason to believe it is needed?
> 
> - What might happen if such checksum-less IPv4 packet is later routed to a different IPv4 network ?
Same as my question above, if peer supports this feature do you see
anything wrong?
> 
> > With this enhancement throughput is increased by 60%.
> > 
> 
> > Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> 
> Regards.
> 
> -- 
> Yann Droneaud
> OPTEYA
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yuval Shaia Nov. 18, 2015, 8:27 p.m. UTC | #15
On Thu, Jul 30, 2015 at 03:00:18PM -0600, Jason Gunthorpe wrote:
> On Thu, Jul 30, 2015 at 11:46:36PM +0300, Yuval Shaia wrote:
> > On Thu, Jul 30, 2015 at 11:15:38AM -0600, Jason Gunthorpe wrote:
> > > On Thu, Jul 30, 2015 at 11:51:12AM -0400, Doug Ledford wrote:
> > > 
> > > > In its current state, I have my doubts about this patch.  However, it
> > > > seems to me that this should be relatively easy to fix in such a way
> > > > that you get 90%+ of the performance benefit, and can turn it on by
> > > > default, and we don't cause any problems.
> > > 
> > > The best way to implement this is to leverage all the checksum
> > > offload work people did for virtualization.
> > > 
> > > Forward the checksum offload status through the RC connection and
> > > recover it on the other side.
> > The current approach is to utilize IPoIB's private-data to exchange this
> > information.
> 
> You need private-data exchange to negotiate the feature.
> 
> The feature should be a per-packet csum status header.
> 
> When sending a skb that is already fully csumed the receiver sets
> CHECKSUM_UNNECESSARY.
> 
> When sending a skb that has CHECKSUM_PARTIAL then the
> receiver needs to call skb_partial_csum_set.
> 
> Look at how something like VIRTIO_NET_HDR_F_NEEDS_CSUM works and copy
> that scheme.
Correct me if i'm wrong here but isn't this protocol assume both parties
are aware of this special header?
My case is a bit different, driver must support backward computability in a
way that peer maybe a driver that do not support this feature and expect
packet to be full checksummed.
> 
> DO NOT EVER set CHECKSUM_UNNECESSARY on packets that do not have valid
> csums - that breaks the net stack.
The entire idea here is to fake csum offload so how would i tell the stack
not to run csum on incoming packet?
> 
> Yes, you need to add a header to all packets to support this scheme,
> that is what the private-data negotiation is for.
> 
> While you are at it, I'd make room for something like
> VIRTIO_NET_HDR_GSO_* in the RC protocol too. Implementing GSO
> forwarding is probably another big performance win.
If i understood you correctly and you mean exchange of
"driver-capabilities", then yes, it is there with the extends of
ipoib_cm_data structure.
> 
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Nov. 23, 2015, 6:29 p.m. UTC | #16
On Wed, Nov 18, 2015 at 10:27:41PM +0200, Yuval Shaia wrote:
> > You need private-data exchange to negotiate the feature.
> > 
> > The feature should be a per-packet csum status header.
> > 
> > When sending a skb that is already fully csumed the receiver sets
> > CHECKSUM_UNNECESSARY.
> > 
> > When sending a skb that has CHECKSUM_PARTIAL then the
> > receiver needs to call skb_partial_csum_set.
> > 
> > Look at how something like VIRTIO_NET_HDR_F_NEEDS_CSUM works and copy
> > that scheme.

> Correct me if i'm wrong here but isn't this protocol assume both parties
> are aware of this special header?

Yes.

No matter what you do both sides must be aware of the change in
protocol, doing it correctly requires adding a small wire header to
flow through the checksum state. This would be enabled once
negotiation confirms both sides will support this.

> My case is a bit different, driver must support backward
> computability in a way that peer maybe a driver that do not support
> this feature and expect packet to be full checksummed.

This is why negotiation is mandatory.

> > DO NOT EVER set CHECKSUM_UNNECESSARY on packets that do not have valid
> > csums - that breaks the net stack.
> The entire idea here is to fake csum offload so how would i tell the stack
> not to run csum on incoming packet?

I already explained this, call skb_partial_csum_set and the stack will
avoid csum work.

> > Yes, you need to add a header to all packets to support this scheme,
> > that is what the private-data negotiation is for.
> > 
> > While you are at it, I'd make room for something like
> > VIRTIO_NET_HDR_GSO_* in the RC protocol too. Implementing GSO
> > forwarding is probably another big performance win.
> If i understood you correctly and you mean exchange of
> "driver-capabilities", then yes, it is there with the extends of
> ipoib_cm_data structure.

You need to dig into how the things I referenced above work, fully
understand them and then adapt them to IPoIB.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yann Droneaud Nov. 24, 2015, 9:45 a.m. UTC | #17
Hi,

Le mercredi 18 novembre 2015 à 12:46 +0200, Yuval Shaia a écrit :
> On Thu, Jul 30, 2015 at 03:58:13PM +0200, Yann Droneaud wrote:
> > Le jeudi 30 juillet 2015 à 04:46 -0700, Yuval Shaia a écrit :

> > > This enhancement suggest the usage of IB CRC instead of CSUM in
> > > IPoIB CM. IPoIB CM uses RC (Reliable Connection) which guarantees
> > > the corruption free delivery of the packet.
> > > 
> > > InfiniBand uses 32b CRC which provides stronger data integrity 
> > > protection compare to 16b IP Checksum.
> > 
> > InfiniBand 32b CRC <=> Ethernet 32b CRC, it's link layer, layer 2.
> > 
> > IPv4 checksum is at another level, it's internet layer, layer 3.
> > 
> > > So, there is no added value that IP/TCP Checksum provides in the
> > > IB world.
> > > 
> > 
> > Sure, IPv4 checksum is a thing of the past: checksum was dropped
> > from IP header in IPv6: it assumes the lower layer, such as
> > Ethernet, provides the required integrety check.

> Right, will change description to something like this:
> InfiniBand 32b CRC offers strong data integrity protection compare
> CSUM.

This patch must clearly state that, without IPv4 header checksum, it's
no more IP protocol as described by RFC 791

> > 
> > I think not checking the IPv4 checksum should be a choice,
> > carefully thought, for inside a fabric, as I understand your
> > proposal, packet with invalid checksum will be allowed to go in/out
> > of the fabric.

> If peer supports this feature do you see any reason why not?

You're free to run you own protocol inside your private network, but
this protocol is not compatible with IPv4.

> > 
> > It sound like it's a departure from the behavior one can expect
> > from an IPv4 network stack.
> > 
> > > The proposal is to tell network stack that IPoIB-CM supports IP 
> > > Checksum offload. This enables the kernel to save the time of 
> > > checksum calculation of IPoIB CM packets. Network sends the IP
> > > packet without adding the IP Checksum to the header. On the
> > > receive side, IPoIB driver again tells the network stack that IP
> > > Checksum is good for the incoming packets and network stack
> > > avoids the IP Checksum calculations.
> > > 
> > > During connection establishment the driver determine if peer
> > > supports IB CRC as checksum. This is done so driver will be able
> > > to calculate checksum before transmiting the packet in case the
> > > peer does not support this feature.
> > > 
> > 
> > Two questions:
> > 
> > - What will see tool such as wireshark/tcpdump when sniffing
> > checksum-less IPv4 packets sent/received on IPoIB interface ?

> Never checked it but i assume 0 or what ever the kernel put there
> when driver reports NETIF_F_IP_CSUM. I can check and zero if needed
> but any reason to believe it is needed?

I don't think it's needed if you agree that this modification creates a
new protocol which is no more IPv4.

> > 
> > - What might happen if such checksum-less IPv4 packet is later
> > routed to a different IPv4 network ?

> Same as my question above, if peer supports this feature do you see
> anything wrong?

If peer is going to forward this packet to a different network, which
is not IPoIB based, the checksum will be checked and the packet will be
rejected.

Regards.

-- 
Yann Droneaud
OPTEYA

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Nov. 24, 2015, 6:59 p.m. UTC | #18
On Tue, Nov 24, 2015 at 10:45:14AM +0100, Yann Droneaud wrote:

> > Same as my question above, if peer supports this feature do you see
> > anything wrong?
> 
> If peer is going to forward this packet to a different network, which
> is not IPoIB based, the checksum will be checked and the packet will be
> rejected.

Exactly, this is why this approach has never been acceptable for
mainline. Arrange things so ipoib can use skb_partial_csum_set.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 79859c4..67e6f05 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -92,6 +92,7 @@  enum {
 	IPOIB_FLAG_UMCAST	  = 10,
 	IPOIB_STOP_NEIGH_GC	  = 11,
 	IPOIB_NEIGH_TBL_FLUSH	  = 12,
+	IPOIB_FLAG_CSUM		  = 15,
 
 	IPOIB_MAX_BACKOFF_SECONDS = 16,
 
@@ -183,9 +184,20 @@  struct ipoib_cm_tx_buf {
 
 struct ib_cm_id;
 
+/* Signature so driver can make sure ipoib_cm_data.caps is valid */
+#define IPOIB_CM_PROTO_SIG	0x2211
+/* Current driver ipoib_cm_data version */
+#define IPOIB_CM_PROTO_VER	(1UL << 12)
+
+enum ipoib_cm_data_caps {
+	IPOIB_CM_CAPS_IBCRC_AS_CSUM	= 1UL << 0,
+};
+
 struct ipoib_cm_data {
 	__be32 qpn; /* High byte MUST be ignored on receive */
 	__be32 mtu;
+	__be16 sig; /* must be IPOIB_CM_PROTO_SIG */
+	__be16 caps; /* 4 bits proto ver and 12 bits capabilities */
 };
 
 /*
@@ -230,6 +242,7 @@  struct ipoib_cm_rx {
 	unsigned long		jiffies;
 	enum ipoib_cm_state	state;
 	int			recv_count;
+	u16			caps;
 };
 
 struct ipoib_cm_tx {
@@ -244,6 +257,7 @@  struct ipoib_cm_tx {
 	unsigned	     tx_tail;
 	unsigned long	     flags;
 	u32		     mtu;
+	u16		     caps;
 };
 
 struct ipoib_cm_rx_buf {
@@ -452,8 +466,20 @@  void ipoib_del_neighs_by_gid(struct net_device *dev, u8 *gid);
 
 extern struct workqueue_struct *ipoib_workqueue;
 
+extern int cm_ibcrc_as_csum;
+
 /* functions */
 
+static inline int ipoib_cm_check_proto_sig(u16 proto_sig)
+{
+	return (proto_sig == IPOIB_CM_PROTO_SIG);
+}
+
+static inline int ipoib_cm_check_proto_ver(u16 caps)
+{
+	return ((caps & 0xF000) == IPOIB_CM_PROTO_VER);
+}
+
 int ipoib_poll(struct napi_struct *napi, int budget);
 void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr);
 void ipoib_send_comp_handler(struct ib_cq *cq, void *dev_ptr);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index ee39be6..5578c69 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -423,9 +423,16 @@  static int ipoib_cm_send_rep(struct net_device *dev, struct ib_cm_id *cm_id,
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ipoib_cm_data data = {};
 	struct ib_cm_rep_param rep = {};
+	u16 caps = 0;
+
+	caps |= IPOIB_CM_PROTO_VER;
+	if (cm_ibcrc_as_csum && test_bit(IPOIB_FLAG_CSUM, &priv->flags))
+		caps |= IPOIB_CM_CAPS_IBCRC_AS_CSUM;
 
 	data.qpn = cpu_to_be32(priv->qp->qp_num);
 	data.mtu = cpu_to_be32(IPOIB_CM_BUF_SIZE);
+	data.sig = cpu_to_be16(IPOIB_CM_PROTO_SIG);
+	data.caps = cpu_to_be16(caps);
 
 	rep.private_data = &data;
 	rep.private_data_len = sizeof data;
@@ -444,6 +451,7 @@  static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even
 	struct ipoib_cm_rx *p;
 	unsigned psn;
 	int ret;
+	struct ipoib_cm_data *cm_data;
 
 	ipoib_dbg(priv, "REQ arrived\n");
 	p = kzalloc(sizeof *p, GFP_KERNEL);
@@ -462,6 +470,13 @@  static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even
 		goto err_qp;
 	}
 
+	cm_data = (struct ipoib_cm_data *)event->private_data;
+	ipoib_dbg(priv, "Otherend sig=0x%x\n", be16_to_cpu(cm_data->sig));
+	if (ipoib_cm_check_proto_sig(be16_to_cpu(cm_data->sig)) &&
+	    ipoib_cm_check_proto_ver(be16_to_cpu(cm_data->caps)))
+		p->caps = be16_to_cpu(cm_data->caps);
+	ipoib_dbg(priv, "Otherend caps=0x%x\n", p->caps);
+
 	psn = prandom_u32() & 0xffffff;
 	ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn);
 	if (ret)
@@ -672,6 +687,10 @@  copied:
 	skb->dev = dev;
 	/* XXX get correct PACKET_ type here */
 	skb->pkt_type = PACKET_HOST;
+
+	if (cm_ibcrc_as_csum && test_bit(IPOIB_FLAG_CSUM, &priv->flags))
+		skb->ip_summed = CHECKSUM_UNNECESSARY;
+
 	netif_receive_skb(skb);
 
 repost:
@@ -733,6 +752,18 @@  void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_
 	tx_req = &tx->tx_ring[tx->tx_head & (ipoib_sendq_size - 1)];
 	tx_req->skb = skb;
 
+	/* Calculate checksum if we support ibcrc_as_csum but peer is not */
+	if ((skb->ip_summed == CHECKSUM_PARTIAL) && cm_ibcrc_as_csum &&
+	    test_bit(IPOIB_FLAG_CSUM, &priv->flags) &&
+	    !(tx->caps & IPOIB_CM_CAPS_IBCRC_AS_CSUM)) {
+		if (skb_checksum_help(skb)) {
+			ipoib_warn(priv, "Fail to csum skb\n");
+			++dev->stats.tx_errors;
+			dev_kfree_skb_any(skb);
+			return;
+		}
+	}
+
 	if (unlikely(ipoib_dma_map_tx(priv->ca, tx_req))) {
 		++dev->stats.tx_errors;
 		dev_kfree_skb_any(skb);
@@ -954,6 +985,7 @@  static int ipoib_cm_rep_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even
 	struct ib_qp_attr qp_attr;
 	int qp_attr_mask, ret;
 	struct sk_buff *skb;
+	struct ipoib_cm_data *cm_data;
 
 	p->mtu = be32_to_cpu(data->mtu);
 
@@ -963,6 +995,13 @@  static int ipoib_cm_rep_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even
 		return -EINVAL;
 	}
 
+	cm_data = (struct ipoib_cm_data *)event->private_data;
+	ipoib_dbg(priv, "Otherend sig=0x%x\n", be16_to_cpu(cm_data->sig));
+	if (ipoib_cm_check_proto_sig(be16_to_cpu(cm_data->sig)) &&
+	    ipoib_cm_check_proto_ver(be16_to_cpu(cm_data->caps)))
+		p->caps = be16_to_cpu(cm_data->caps);
+	ipoib_dbg(priv, "Otherend caps=0x%x\n", p->caps);
+
 	qp_attr.qp_state = IB_QPS_RTR;
 	ret = ib_cm_init_qp_attr(cm_id, &qp_attr, &qp_attr_mask);
 	if (ret) {
@@ -1051,9 +1090,16 @@  static int ipoib_cm_send_req(struct net_device *dev,
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ipoib_cm_data data = {};
 	struct ib_cm_req_param req = {};
+	u16 caps = 0;
+
+	caps |= IPOIB_CM_PROTO_VER;
+	if (cm_ibcrc_as_csum && test_bit(IPOIB_FLAG_CSUM, &priv->flags))
+		caps |= IPOIB_CM_CAPS_IBCRC_AS_CSUM;
 
 	data.qpn = cpu_to_be32(priv->qp->qp_num);
 	data.mtu = cpu_to_be32(IPOIB_CM_BUF_SIZE);
+	data.sig = cpu_to_be16(IPOIB_CM_PROTO_SIG);
+	data.caps = cpu_to_be16(caps);
 
 	req.primary_path		= pathrec;
 	req.alternate_path		= NULL;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index b2943c8..a1940e2 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -73,6 +73,11 @@  module_param_named(debug_level, ipoib_debug_level, int, 0644);
 MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0");
 #endif
 
+int cm_ibcrc_as_csum = 1;
+module_param_named(cm_ibcrc_as_csum, cm_ibcrc_as_csum, int, 0444);
+MODULE_PARM_DESC(cm_ibcrc_as_csum,
+		 "Indicates whether to utilize IB-CRC as CSUM in connected mode,(default: 1)");
+
 struct ipoib_path_iter {
 	struct net_device *dev;
 	struct ipoib_path  path;
@@ -189,8 +194,12 @@  static netdev_features_t ipoib_fix_features(struct net_device *dev, netdev_featu
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
-	if (test_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags))
-		features &= ~(NETIF_F_IP_CSUM | NETIF_F_TSO);
+	if (test_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags)) {
+		features &= ~NETIF_F_TSO;
+		if (!(cm_ibcrc_as_csum && (test_bit(IPOIB_FLAG_CSUM,
+					   &priv->flags))))
+			features &= ~(NETIF_F_IP_CSUM | NETIF_F_TSO);
+	}
 
 	return features;
 }
@@ -234,7 +243,11 @@  int ipoib_set_mode(struct net_device *dev, const char *buf)
 		netdev_update_features(dev);
 		dev_set_mtu(dev, ipoib_cm_max_mtu(dev));
 		rtnl_unlock();
-		priv->tx_wr.send_flags &= ~IB_SEND_IP_CSUM;
+		if (cm_ibcrc_as_csum && (test_bit(IPOIB_FLAG_CSUM,
+						  &priv->flags)))
+			priv->tx_wr.send_flags |= IB_SEND_IP_CSUM;
+		else
+			priv->tx_wr.send_flags &= ~IB_SEND_IP_CSUM;
 
 		ipoib_flush_paths(dev);
 		rtnl_lock();
@@ -1552,6 +1565,7 @@  int ipoib_set_dev_features(struct ipoib_dev_priv *priv, struct ib_device *hca)
 	kfree(device_attr);
 
 	if (priv->hca_caps & IB_DEVICE_UD_IP_CSUM) {
+		set_bit(IPOIB_FLAG_CSUM, &priv->flags);
 		priv->dev->hw_features = NETIF_F_SG |
 			NETIF_F_IP_CSUM | NETIF_F_RXCSUM;