Message ID | 1438256764-9077-1-git-send-email-yuval.shaia@oracle.com (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Hi, Le jeudi 30 juillet 2015 à 04:46 -0700, Yuval Shaia a écrit : > This enhancement suggest the usage of IB CRC instead of CSUM in IPoIB > CM. IPoIB CM uses RC (Reliable Connection) which guarantees the > corruption free delivery of the packet. > > InfiniBand uses 32b CRC which provides stronger data integrity > protection compare to 16b IP Checksum. InfiniBand 32b CRC <=> Ethernet 32b CRC, it's link layer, layer 2. IPv4 checksum is at another level, it's internet layer, layer 3. > So, there is no added value that IP/TCP Checksum provides in the IB > world. > Sure, IPv4 checksum is a thing of the past: checksum was dropped from IP header in IPv6: it assumes the lower layer, such as Ethernet, provides the required integrety check. I think not checking the IPv4 checksum should be a choice, carefully thought, for inside a fabric, as I understand your proposal, packet with invalid checksum will be allowed to go in/out of the fabric. It sound like it's a departure from the behavior one can expect from an IPv4 network stack. > The proposal is to tell network stack that IPoIB-CM supports IP > Checksum offload. This enables the kernel to save the time of > checksum calculation of IPoIB CM packets. Network sends the IP packet > without adding the IP Checksum to the header. On the receive side, > IPoIB driver again tells the network stack that IP Checksum is good > for the incoming packets and network stack avoids the IP Checksum > calculations. > > During connection establishment the driver determine if peer supports > IB CRC as checksum. This is done so driver will be able to calculate > checksum before transmiting the packet in case the peer does not > support this feature. > Two questions: - What will see tool such as wireshark/tcpdump when sniffing checksum -less IPv4 packets sent/received on IPoIB interface ? - What might happen if such checksum-less IPv4 packet is later routed to a different IPv4 network ? > With this enhancement throughput is increased by 60%. > > Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com> Regards.
On Thu, Jul 30, 2015 at 03:58:13PM +0200, Yann Droneaud wrote: > Hi, > > Le jeudi 30 juillet 2015 à 04:46 -0700, Yuval Shaia a écrit : > > This enhancement suggest the usage of IB CRC instead of CSUM in IPoIB > > CM. IPoIB CM uses RC (Reliable Connection) which guarantees the > > corruption free delivery of the packet. > > > > InfiniBand uses 32b CRC which provides stronger data integrity > > protection compare to 16b IP Checksum. > > InfiniBand 32b CRC <=> Ethernet 32b CRC, it's link layer, layer 2. > > IPv4 checksum is at another level, it's internet layer, layer 3. > > > So, there is no added value that IP/TCP Checksum provides in the IB > > world. > > > > Sure, IPv4 checksum is a thing of the past: checksum was dropped from > IP header in IPv6: it assumes the lower layer, such as Ethernet, > provides the required integrety check. > > I think not checking the IPv4 checksum should be a choice, carefully > thought, for inside a fabric, as I understand your proposal, packet > with invalid checksum will be allowed to go in/out of the fabric. Yes, this is why it is controlled by module parameter. Maybe a better choice would be to default it to 0. > > It sound like it's a departure from the behavior one can expect from an > IPv4 network stack. It should be considered as network-fine-tuning parameter so if admin knows his fabric he can use it. > > > The proposal is to tell network stack that IPoIB-CM supports IP > > Checksum offload. This enables the kernel to save the time of > > checksum calculation of IPoIB CM packets. Network sends the IP packet > > without adding the IP Checksum to the header. On the receive side, > > IPoIB driver again tells the network stack that IP Checksum is good > > for the incoming packets and network stack avoids the IP Checksum > > calculations. > > > > During connection establishment the driver determine if peer supports > > IB CRC as checksum. This is done so driver will be able to calculate > > checksum before transmiting the packet in case the peer does not > > support this feature. > > > > Two questions: Three :) > > - What will see tool such as wireshark/tcpdump when sniffing checksum Zero or what ever the networking layer puts in csum when H/W supports CSUM-offloading. Please note that with this patch driver still supports backward computability (per connection). This means that for connections with peer which does not support this functionality you expect to see this value filled with checksum. > -less IPv4 packets sent/received on IPoIB interface ? No > > - What might happen if such checksum-less IPv4 packet is later routed to a different IPv4 network ? As noted above, for network that is opened to outside world this feature should be blocked. In general i would say that if a layer 2 terminator device (e.x router) exist in the fabric - this feature can't be used and must be blocked. With this limitation it still worth use it because of the reason of increasing throughput > > > With this enhancement throughput is increased by 60%. > > > > > Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com> > > Regards. > > -- > Yann Droneaud > OPTEYA > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/30/2015 11:20 AM, Yuval Shaia wrote: > On Thu, Jul 30, 2015 at 03:58:13PM +0200, Yann Droneaud wrote: >> Hi, >> >> Le jeudi 30 juillet 2015 à 04:46 -0700, Yuval Shaia a écrit : >>> This enhancement suggest the usage of IB CRC instead of CSUM in IPoIB >>> CM. IPoIB CM uses RC (Reliable Connection) which guarantees the >>> corruption free delivery of the packet. >>> >>> InfiniBand uses 32b CRC which provides stronger data integrity >>> protection compare to 16b IP Checksum. >> >> InfiniBand 32b CRC <=> Ethernet 32b CRC, it's link layer, layer 2. >> >> IPv4 checksum is at another level, it's internet layer, layer 3. >> >>> So, there is no added value that IP/TCP Checksum provides in the IB >>> world. >>> >> >> Sure, IPv4 checksum is a thing of the past: checksum was dropped from >> IP header in IPv6: it assumes the lower layer, such as Ethernet, >> provides the required integrety check. >> >> I think not checking the IPv4 checksum should be a choice, carefully >> thought, for inside a fabric, as I understand your proposal, packet >> with invalid checksum will be allowed to go in/out of the fabric. > Yes, this is why it is controlled by module parameter. > Maybe a better choice would be to default it to 0. In it's current form, yes, it should default to 0. >> >> It sound like it's a departure from the behavior one can expect from an >> IPv4 network stack. > It should be considered as network-fine-tuning parameter so if admin knows his fabric he can use it. >> >>> The proposal is to tell network stack that IPoIB-CM supports IP >>> Checksum offload. This enables the kernel to save the time of >>> checksum calculation of IPoIB CM packets. Network sends the IP packet >>> without adding the IP Checksum to the header. On the receive side, >>> IPoIB driver again tells the network stack that IP Checksum is good >>> for the incoming packets and network stack avoids the IP Checksum >>> calculations. >>> >>> During connection establishment the driver determine if peer supports >>> IB CRC as checksum. This is done so driver will be able to calculate >>> checksum before transmiting the packet in case the peer does not >>> support this feature. >>> >> >> Two questions: > Three :) No, he really only had 2, the second one was a line split of the word checksum-less done by his mailer ;-) >> >> - What will see tool such as wireshark/tcpdump when sniffing checksum > Zero or what ever the networking layer puts in csum when H/W supports CSUM-offloading. > Please note that with this patch driver still supports backward computability (per connection). > This means that for connections with peer which does not support this functionality you expect to see this value filled with checksum. >> -less IPv4 packets sent/received on IPoIB interface ? > No >> >> - What might happen if such checksum-less IPv4 packet is later routed to a different IPv4 network ? > As noted above, for network that is opened to outside world this feature should be blocked. > In general i would say that if a layer 2 terminator device (e.x router) exist in the fabric - this feature can't be used and must be blocked. > With this limitation it still worth use it because of the reason of increasing throughput In its current state, I have my doubts about this patch. However, it seems to me that this should be relatively easy to fix in such a way that you get 90%+ of the performance benefit, and can turn it on by default, and we don't cause any problems. Why not perform the checksum operation on a per connection basis? This is all IPoIB traffic anyway, so every send will have a src ip and dst ip. If the dst ip is link local to our src ip device, and the connected mode partner is capable of running without csum, then send that specific packet without doing a checksum. If the IP address is not link local, then do the checksum as normal. That way if our final destination is on the other side of a router, we aren't leaking un-checksummed packets. It means we would miss out on being able to do checksum-less transfers from host A on fabric 0 through host B as a router to host C on fabric 1, but I doubt that's a very common situation to be in. Or maybe a better way of putting this is if our next hop IP address != our dest IP address, then perform the checksum, otherwise if capable of checksum-less operation, do so. Can you rework the patch to operate in that manner?
On 07/30/2015 04:46 AM, Yuval Shaia wrote: > struct ipoib_cm_data { > __be32 qpn; /* High byte MUST be ignored on receive */ > __be32 mtu; > + __be16 sig; /* must be IPOIB_CM_PROTO_SIG */ > + __be16 caps; /* 4 bits proto ver and 12 bits capabilities */ > }; This patch modifies the private login data format that has been standardized by the IETF in RFC 4755. Has this modification already been discussed with the IETF ? See also https://tools.ietf.org/html/rfc4755#section-6. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 30, 2015 at 11:51:12AM -0400, Doug Ledford wrote: > In its current state, I have my doubts about this patch. However, it > seems to me that this should be relatively easy to fix in such a way > that you get 90%+ of the performance benefit, and can turn it on by > default, and we don't cause any problems. The best way to implement this is to leverage all the checksum offload work people did for virtualization. Forward the checksum offload status through the RC connection and recover it on the other side. Then the far side stack will know it is dealing with a partial checksum packet and will properly regenerate the checksum if it re-transmits. ie doing it this way doesn't totally break the netstack :) Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 30, 2015 at 09:38:54AM -0700, Bart Van Assche wrote: > On 07/30/2015 04:46 AM, Yuval Shaia wrote: > > struct ipoib_cm_data { > > __be32 qpn; /* High byte MUST be ignored on receive */ > > __be32 mtu; > >+ __be16 sig; /* must be IPOIB_CM_PROTO_SIG */ > >+ __be16 caps; /* 4 bits proto ver and 12 bits capabilities */ > > }; > > This patch modifies the private login data format that has been > standardized by the IETF in RFC 4755. Has this modification already > been discussed with the IETF ? > > See also https://tools.ietf.org/html/rfc4755#section-6. Yes. I first want to check how linux community react to this proposal. Please note that though the standard specify 64 bits of data, the actual data the driver reads/writes is can be up to 196 bytes. > > Bart. > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 30, 2015 at 11:51:12AM -0400, Doug Ledford wrote: > On 07/30/2015 11:20 AM, Yuval Shaia wrote: > > On Thu, Jul 30, 2015 at 03:58:13PM +0200, Yann Droneaud wrote: > >> Hi, > >> > >> Le jeudi 30 juillet 2015 à 04:46 -0700, Yuval Shaia a écrit : > >>> This enhancement suggest the usage of IB CRC instead of CSUM in IPoIB > >>> CM. IPoIB CM uses RC (Reliable Connection) which guarantees the > >>> corruption free delivery of the packet. > >>> > >>> InfiniBand uses 32b CRC which provides stronger data integrity > >>> protection compare to 16b IP Checksum. > >> > >> InfiniBand 32b CRC <=> Ethernet 32b CRC, it's link layer, layer 2. > >> > >> IPv4 checksum is at another level, it's internet layer, layer 3. > >> > >>> So, there is no added value that IP/TCP Checksum provides in the IB > >>> world. > >>> > >> > >> Sure, IPv4 checksum is a thing of the past: checksum was dropped from > >> IP header in IPv6: it assumes the lower layer, such as Ethernet, > >> provides the required integrety check. > >> > >> I think not checking the IPv4 checksum should be a choice, carefully > >> thought, for inside a fabric, as I understand your proposal, packet > >> with invalid checksum will be allowed to go in/out of the fabric. > > Yes, this is why it is controlled by module parameter. > > Maybe a better choice would be to default it to 0. > > In it's current form, yes, it should default to 0. > > >> > >> It sound like it's a departure from the behavior one can expect from an > >> IPv4 network stack. > > It should be considered as network-fine-tuning parameter so if admin knows his fabric he can use it. > >> > >>> The proposal is to tell network stack that IPoIB-CM supports IP > >>> Checksum offload. This enables the kernel to save the time of > >>> checksum calculation of IPoIB CM packets. Network sends the IP packet > >>> without adding the IP Checksum to the header. On the receive side, > >>> IPoIB driver again tells the network stack that IP Checksum is good > >>> for the incoming packets and network stack avoids the IP Checksum > >>> calculations. > >>> > >>> During connection establishment the driver determine if peer supports > >>> IB CRC as checksum. This is done so driver will be able to calculate > >>> checksum before transmiting the packet in case the peer does not > >>> support this feature. > >>> > >> > >> Two questions: > > Three :) > > No, he really only had 2, the second one was a line split of the word > checksum-less done by his mailer ;-) > > >> > >> - What will see tool such as wireshark/tcpdump when sniffing checksum > > Zero or what ever the networking layer puts in csum when H/W supports CSUM-offloading. > > Please note that with this patch driver still supports backward computability (per connection). > > This means that for connections with peer which does not support this functionality you expect to see this value filled with checksum. > >> -less IPv4 packets sent/received on IPoIB interface ? > > No > >> > >> - What might happen if such checksum-less IPv4 packet is later routed to a different IPv4 network ? > > As noted above, for network that is opened to outside world this feature should be blocked. > > In general i would say that if a layer 2 terminator device (e.x router) exist in the fabric - this feature can't be used and must be blocked. > > With this limitation it still worth use it because of the reason of increasing throughput > > In its current state, I have my doubts about this patch. However, it > seems to me that this should be relatively easy to fix in such a way > that you get 90%+ of the performance benefit, and can turn it on by > default, and we don't cause any problems. Why not perform the checksum > operation on a per connection basis? This is all IPoIB traffic anyway, This part is already implemented. Actually this is the main purpose of adding 'caps' field to ipoib_cm_tx. The peer capabilities (currently only one option but design let us add up to 12 capabilities in the future) is passed in IPoIB's private data and saved in ipoib_cm_tx.caps per connection basis. Then, on ipoib_cm_send, the decision is made based on that (and on some other conditions) and if needed - the driver calculate the checksum just before sending. > so every send will have a src ip and dst ip. If the dst ip is link > local to our src ip device, and the connected mode partner is capable of > running without csum, then send that specific packet without doing a > checksum. If the IP address is not link local, then do the checksum as > normal. That way if our final destination is on the other side of a > router, we aren't leaking un-checksummed packets. It means we would > miss out on being able to do checksum-less transfers from host A on > fabric 0 through host B as a router to host C on fabric 1, but I doubt > that's a very common situation to be in. Or maybe a better way of > putting this is if our next hop IP address != our dest IP address, then > perform the checksum, otherwise if capable of checksum-less operation, > do so. Can you rework the patch to operate in that manner? I think that the concern with 'router' is that when packet goes into it and then goes out from it - we cannot trust end-to-end IB-CRC as this is layer 2 CRC. So, if i understand you correctly, you suggest to tread every host beyond a router as one that does not support this "fake" and to calculate csum for it? This make sense to me but does it cover all such cases (where we can't trust end-to-end IB-CRC)? If yes then sure, it is easy to implement. This way we can default it to 1 and get rid of this module param. > > > -- > Doug Ledford <dledford@redhat.com> > GPG KeyID: 0E572FDD > > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 30, 2015 at 11:15:38AM -0600, Jason Gunthorpe wrote: > On Thu, Jul 30, 2015 at 11:51:12AM -0400, Doug Ledford wrote: > > > In its current state, I have my doubts about this patch. However, it > > seems to me that this should be relatively easy to fix in such a way > > that you get 90%+ of the performance benefit, and can turn it on by > > default, and we don't cause any problems. > > The best way to implement this is to leverage all the checksum > offload work people did for virtualization. > > Forward the checksum offload status through the RC connection and > recover it on the other side. The current approach is to utilize IPoIB's private-data to exchange this information. > > Then the far side stack will know it is dealing with a partial > checksum packet and will properly regenerate the checksum if it > re-transmits. > > ie doing it this way doesn't totally break the netstack :) > > Jason > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 30, 2015 at 11:46:36PM +0300, Yuval Shaia wrote: > On Thu, Jul 30, 2015 at 11:15:38AM -0600, Jason Gunthorpe wrote: > > On Thu, Jul 30, 2015 at 11:51:12AM -0400, Doug Ledford wrote: > > > > > In its current state, I have my doubts about this patch. However, it > > > seems to me that this should be relatively easy to fix in such a way > > > that you get 90%+ of the performance benefit, and can turn it on by > > > default, and we don't cause any problems. > > > > The best way to implement this is to leverage all the checksum > > offload work people did for virtualization. > > > > Forward the checksum offload status through the RC connection and > > recover it on the other side. > The current approach is to utilize IPoIB's private-data to exchange this > information. You need private-data exchange to negotiate the feature. The feature should be a per-packet csum status header. When sending a skb that is already fully csumed the receiver sets CHECKSUM_UNNECESSARY. When sending a skb that has CHECKSUM_PARTIAL then the receiver needs to call skb_partial_csum_set. Look at how something like VIRTIO_NET_HDR_F_NEEDS_CSUM works and copy that scheme. DO NOT EVER set CHECKSUM_UNNECESSARY on packets that do not have valid csums - that breaks the net stack. Yes, you need to add a header to all packets to support this scheme, that is what the private-data negotiation is for. While you are at it, I'd make room for something like VIRTIO_NET_HDR_GSO_* in the RC protocol too. Implementing GSO forwarding is probably another big performance win. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/30/15 13:09, Yuval Shaia wrote: > On Thu, Jul 30, 2015 at 09:38:54AM -0700, Bart Van Assche wrote: >> On 07/30/2015 04:46 AM, Yuval Shaia wrote: >>> struct ipoib_cm_data { >>> __be32 qpn; /* High byte MUST be ignored on receive */ >>> __be32 mtu; >>> + __be16 sig; /* must be IPOIB_CM_PROTO_SIG */ >>> + __be16 caps; /* 4 bits proto ver and 12 bits capabilities */ >>> }; >> >> This patch modifies the private login data format that has been >> standardized by the IETF in RFC 4755. Has this modification already >> been discussed with the IETF ? >> > Yes. Hello Yuval, It should have been mentioned in the patch description that this patch modifies the wire format, how it modifies the wire format, and also which feedback (if any) has been received so far from the IETF. Thanks, Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/30/2015 10:03 PM, Bart Van Assche wrote: > On 07/30/15 13:09, Yuval Shaia wrote: >> On Thu, Jul 30, 2015 at 09:38:54AM -0700, Bart Van Assche wrote: >>> On 07/30/2015 04:46 AM, Yuval Shaia wrote: >>>> struct ipoib_cm_data { >>>> __be32 qpn; /* High byte MUST be ignored on receive */ >>>> __be32 mtu; >>>> + __be16 sig; /* must be IPOIB_CM_PROTO_SIG */ >>>> + __be16 caps; /* 4 bits proto ver and 12 bits capabilities */ >>>> }; >>> >>> This patch modifies the private login data format that has been >>> standardized by the IETF in RFC 4755. Has this modification already >>> been discussed with the IETF ? >>> >> Yes. > > Hello Yuval, > > It should have been mentioned in the patch description that this patch > modifies the wire format, how it modifies the wire format, and also > which feedback (if any) has been received so far from the IETF. There hasn't been any follow up on this patch in the last month. Yuval, is there still progress being made?
On Thu, Sep 03, 2015 at 04:29:45PM -0400, Doug Ledford wrote: > On 07/30/2015 10:03 PM, Bart Van Assche wrote: > > On 07/30/15 13:09, Yuval Shaia wrote: > >> On Thu, Jul 30, 2015 at 09:38:54AM -0700, Bart Van Assche wrote: > >>> On 07/30/2015 04:46 AM, Yuval Shaia wrote: > >>>> struct ipoib_cm_data { > >>>> __be32 qpn; /* High byte MUST be ignored on receive */ > >>>> __be32 mtu; > >>>> + __be16 sig; /* must be IPOIB_CM_PROTO_SIG */ > >>>> + __be16 caps; /* 4 bits proto ver and 12 bits capabilities */ > >>>> }; > >>> > >>> This patch modifies the private login data format that has been > >>> standardized by the IETF in RFC 4755. Has this modification already > >>> been discussed with the IETF ? > >>> > >> Yes. > > > > Hello Yuval, > > > > It should have been mentioned in the patch description that this patch > > modifies the wire format, how it modifies the wire format, and also > > which feedback (if any) has been received so far from the IETF. > > There hasn't been any follow up on this patch in the last month. Yuval, > is there still progress being made? Unfortunately no progress yet, too many P1-s. I will resume work soon. > > > -- > Doug Ledford <dledford@redhat.com> > GPG KeyID: 0E572FDD > > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 30, 2015 at 07:03:22PM -0700, Bart Van Assche wrote: > On 07/30/15 13:09, Yuval Shaia wrote: > >On Thu, Jul 30, 2015 at 09:38:54AM -0700, Bart Van Assche wrote: > >>On 07/30/2015 04:46 AM, Yuval Shaia wrote: > >>> struct ipoib_cm_data { > >>> __be32 qpn; /* High byte MUST be ignored on receive */ > >>> __be32 mtu; > >>>+ __be16 sig; /* must be IPOIB_CM_PROTO_SIG */ > >>>+ __be16 caps; /* 4 bits proto ver and 12 bits capabilities */ > >>> }; > >> > >>This patch modifies the private login data format that has been > >>standardized by the IETF in RFC 4755. Has this modification already > >>been discussed with the IETF ? > >> > >Yes. > > Hello Yuval, > > It should have been mentioned in the patch description that this > patch modifies the wire format, how it modifies the wire format, and Thanks, Will add something like this: "To support this, the Private-Data is extended (by 32 bits) with two new attributes - protocol version and capabilities." > also which feedback (if any) has been received so far from the IETF. > > Thanks, > > Bart. > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 30, 2015 at 03:58:13PM +0200, Yann Droneaud wrote: > Hi, > > Le jeudi 30 juillet 2015 à 04:46 -0700, Yuval Shaia a écrit : > > This enhancement suggest the usage of IB CRC instead of CSUM in IPoIB > > CM. IPoIB CM uses RC (Reliable Connection) which guarantees the > > corruption free delivery of the packet. > > > > InfiniBand uses 32b CRC which provides stronger data integrity > > protection compare to 16b IP Checksum. > > InfiniBand 32b CRC <=> Ethernet 32b CRC, it's link layer, layer 2. > > IPv4 checksum is at another level, it's internet layer, layer 3. > > > So, there is no added value that IP/TCP Checksum provides in the IB > > world. > > > > Sure, IPv4 checksum is a thing of the past: checksum was dropped from > IP header in IPv6: it assumes the lower layer, such as Ethernet, > provides the required integrety check. Right, will change description to something like this: InfiniBand 32b CRC offers strong data integrity protection compare CSUM. > > I think not checking the IPv4 checksum should be a choice, carefully > thought, for inside a fabric, as I understand your proposal, packet > with invalid checksum will be allowed to go in/out of the fabric. If peer supports this feature do you see any reason why not? > > It sound like it's a departure from the behavior one can expect from an > IPv4 network stack. > > > The proposal is to tell network stack that IPoIB-CM supports IP > > Checksum offload. This enables the kernel to save the time of > > checksum calculation of IPoIB CM packets. Network sends the IP packet > > without adding the IP Checksum to the header. On the receive side, > > IPoIB driver again tells the network stack that IP Checksum is good > > for the incoming packets and network stack avoids the IP Checksum > > calculations. > > > > During connection establishment the driver determine if peer supports > > IB CRC as checksum. This is done so driver will be able to calculate > > checksum before transmiting the packet in case the peer does not > > support this feature. > > > > Two questions: > > - What will see tool such as wireshark/tcpdump when sniffing checksum > -less IPv4 packets sent/received on IPoIB interface ? Never checked it but i assume 0 or what ever the kernel put there when driver reports NETIF_F_IP_CSUM. I can check and zero if needed but any reason to believe it is needed? > > - What might happen if such checksum-less IPv4 packet is later routed to a different IPv4 network ? Same as my question above, if peer supports this feature do you see anything wrong? > > > With this enhancement throughput is increased by 60%. > > > > > Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com> > > Regards. > > -- > Yann Droneaud > OPTEYA > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 30, 2015 at 03:00:18PM -0600, Jason Gunthorpe wrote: > On Thu, Jul 30, 2015 at 11:46:36PM +0300, Yuval Shaia wrote: > > On Thu, Jul 30, 2015 at 11:15:38AM -0600, Jason Gunthorpe wrote: > > > On Thu, Jul 30, 2015 at 11:51:12AM -0400, Doug Ledford wrote: > > > > > > > In its current state, I have my doubts about this patch. However, it > > > > seems to me that this should be relatively easy to fix in such a way > > > > that you get 90%+ of the performance benefit, and can turn it on by > > > > default, and we don't cause any problems. > > > > > > The best way to implement this is to leverage all the checksum > > > offload work people did for virtualization. > > > > > > Forward the checksum offload status through the RC connection and > > > recover it on the other side. > > The current approach is to utilize IPoIB's private-data to exchange this > > information. > > You need private-data exchange to negotiate the feature. > > The feature should be a per-packet csum status header. > > When sending a skb that is already fully csumed the receiver sets > CHECKSUM_UNNECESSARY. > > When sending a skb that has CHECKSUM_PARTIAL then the > receiver needs to call skb_partial_csum_set. > > Look at how something like VIRTIO_NET_HDR_F_NEEDS_CSUM works and copy > that scheme. Correct me if i'm wrong here but isn't this protocol assume both parties are aware of this special header? My case is a bit different, driver must support backward computability in a way that peer maybe a driver that do not support this feature and expect packet to be full checksummed. > > DO NOT EVER set CHECKSUM_UNNECESSARY on packets that do not have valid > csums - that breaks the net stack. The entire idea here is to fake csum offload so how would i tell the stack not to run csum on incoming packet? > > Yes, you need to add a header to all packets to support this scheme, > that is what the private-data negotiation is for. > > While you are at it, I'd make room for something like > VIRTIO_NET_HDR_GSO_* in the RC protocol too. Implementing GSO > forwarding is probably another big performance win. If i understood you correctly and you mean exchange of "driver-capabilities", then yes, it is there with the extends of ipoib_cm_data structure. > > Jason > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 18, 2015 at 10:27:41PM +0200, Yuval Shaia wrote: > > You need private-data exchange to negotiate the feature. > > > > The feature should be a per-packet csum status header. > > > > When sending a skb that is already fully csumed the receiver sets > > CHECKSUM_UNNECESSARY. > > > > When sending a skb that has CHECKSUM_PARTIAL then the > > receiver needs to call skb_partial_csum_set. > > > > Look at how something like VIRTIO_NET_HDR_F_NEEDS_CSUM works and copy > > that scheme. > Correct me if i'm wrong here but isn't this protocol assume both parties > are aware of this special header? Yes. No matter what you do both sides must be aware of the change in protocol, doing it correctly requires adding a small wire header to flow through the checksum state. This would be enabled once negotiation confirms both sides will support this. > My case is a bit different, driver must support backward > computability in a way that peer maybe a driver that do not support > this feature and expect packet to be full checksummed. This is why negotiation is mandatory. > > DO NOT EVER set CHECKSUM_UNNECESSARY on packets that do not have valid > > csums - that breaks the net stack. > The entire idea here is to fake csum offload so how would i tell the stack > not to run csum on incoming packet? I already explained this, call skb_partial_csum_set and the stack will avoid csum work. > > Yes, you need to add a header to all packets to support this scheme, > > that is what the private-data negotiation is for. > > > > While you are at it, I'd make room for something like > > VIRTIO_NET_HDR_GSO_* in the RC protocol too. Implementing GSO > > forwarding is probably another big performance win. > If i understood you correctly and you mean exchange of > "driver-capabilities", then yes, it is there with the extends of > ipoib_cm_data structure. You need to dig into how the things I referenced above work, fully understand them and then adapt them to IPoIB. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, Le mercredi 18 novembre 2015 à 12:46 +0200, Yuval Shaia a écrit : > On Thu, Jul 30, 2015 at 03:58:13PM +0200, Yann Droneaud wrote: > > Le jeudi 30 juillet 2015 à 04:46 -0700, Yuval Shaia a écrit : > > > This enhancement suggest the usage of IB CRC instead of CSUM in > > > IPoIB CM. IPoIB CM uses RC (Reliable Connection) which guarantees > > > the corruption free delivery of the packet. > > > > > > InfiniBand uses 32b CRC which provides stronger data integrity > > > protection compare to 16b IP Checksum. > > > > InfiniBand 32b CRC <=> Ethernet 32b CRC, it's link layer, layer 2. > > > > IPv4 checksum is at another level, it's internet layer, layer 3. > > > > > So, there is no added value that IP/TCP Checksum provides in the > > > IB world. > > > > > > > Sure, IPv4 checksum is a thing of the past: checksum was dropped > > from IP header in IPv6: it assumes the lower layer, such as > > Ethernet, provides the required integrety check. > Right, will change description to something like this: > InfiniBand 32b CRC offers strong data integrity protection compare > CSUM. This patch must clearly state that, without IPv4 header checksum, it's no more IP protocol as described by RFC 791 > > > > I think not checking the IPv4 checksum should be a choice, > > carefully thought, for inside a fabric, as I understand your > > proposal, packet with invalid checksum will be allowed to go in/out > > of the fabric. > If peer supports this feature do you see any reason why not? You're free to run you own protocol inside your private network, but this protocol is not compatible with IPv4. > > > > It sound like it's a departure from the behavior one can expect > > from an IPv4 network stack. > > > > > The proposal is to tell network stack that IPoIB-CM supports IP > > > Checksum offload. This enables the kernel to save the time of > > > checksum calculation of IPoIB CM packets. Network sends the IP > > > packet without adding the IP Checksum to the header. On the > > > receive side, IPoIB driver again tells the network stack that IP > > > Checksum is good for the incoming packets and network stack > > > avoids the IP Checksum calculations. > > > > > > During connection establishment the driver determine if peer > > > supports IB CRC as checksum. This is done so driver will be able > > > to calculate checksum before transmiting the packet in case the > > > peer does not support this feature. > > > > > > > Two questions: > > > > - What will see tool such as wireshark/tcpdump when sniffing > > checksum-less IPv4 packets sent/received on IPoIB interface ? > Never checked it but i assume 0 or what ever the kernel put there > when driver reports NETIF_F_IP_CSUM. I can check and zero if needed > but any reason to believe it is needed? I don't think it's needed if you agree that this modification creates a new protocol which is no more IPv4. > > > > - What might happen if such checksum-less IPv4 packet is later > > routed to a different IPv4 network ? > Same as my question above, if peer supports this feature do you see > anything wrong? If peer is going to forward this packet to a different network, which is not IPoIB based, the checksum will be checked and the packet will be rejected. Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 24, 2015 at 10:45:14AM +0100, Yann Droneaud wrote: > > Same as my question above, if peer supports this feature do you see > > anything wrong? > > If peer is going to forward this packet to a different network, which > is not IPoIB based, the checksum will be checked and the packet will be > rejected. Exactly, this is why this approach has never been acceptable for mainline. Arrange things so ipoib can use skb_partial_csum_set. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 79859c4..67e6f05 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -92,6 +92,7 @@ enum { IPOIB_FLAG_UMCAST = 10, IPOIB_STOP_NEIGH_GC = 11, IPOIB_NEIGH_TBL_FLUSH = 12, + IPOIB_FLAG_CSUM = 15, IPOIB_MAX_BACKOFF_SECONDS = 16, @@ -183,9 +184,20 @@ struct ipoib_cm_tx_buf { struct ib_cm_id; +/* Signature so driver can make sure ipoib_cm_data.caps is valid */ +#define IPOIB_CM_PROTO_SIG 0x2211 +/* Current driver ipoib_cm_data version */ +#define IPOIB_CM_PROTO_VER (1UL << 12) + +enum ipoib_cm_data_caps { + IPOIB_CM_CAPS_IBCRC_AS_CSUM = 1UL << 0, +}; + struct ipoib_cm_data { __be32 qpn; /* High byte MUST be ignored on receive */ __be32 mtu; + __be16 sig; /* must be IPOIB_CM_PROTO_SIG */ + __be16 caps; /* 4 bits proto ver and 12 bits capabilities */ }; /* @@ -230,6 +242,7 @@ struct ipoib_cm_rx { unsigned long jiffies; enum ipoib_cm_state state; int recv_count; + u16 caps; }; struct ipoib_cm_tx { @@ -244,6 +257,7 @@ struct ipoib_cm_tx { unsigned tx_tail; unsigned long flags; u32 mtu; + u16 caps; }; struct ipoib_cm_rx_buf { @@ -452,8 +466,20 @@ void ipoib_del_neighs_by_gid(struct net_device *dev, u8 *gid); extern struct workqueue_struct *ipoib_workqueue; +extern int cm_ibcrc_as_csum; + /* functions */ +static inline int ipoib_cm_check_proto_sig(u16 proto_sig) +{ + return (proto_sig == IPOIB_CM_PROTO_SIG); +} + +static inline int ipoib_cm_check_proto_ver(u16 caps) +{ + return ((caps & 0xF000) == IPOIB_CM_PROTO_VER); +} + int ipoib_poll(struct napi_struct *napi, int budget); void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr); void ipoib_send_comp_handler(struct ib_cq *cq, void *dev_ptr); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index ee39be6..5578c69 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -423,9 +423,16 @@ static int ipoib_cm_send_rep(struct net_device *dev, struct ib_cm_id *cm_id, struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_cm_data data = {}; struct ib_cm_rep_param rep = {}; + u16 caps = 0; + + caps |= IPOIB_CM_PROTO_VER; + if (cm_ibcrc_as_csum && test_bit(IPOIB_FLAG_CSUM, &priv->flags)) + caps |= IPOIB_CM_CAPS_IBCRC_AS_CSUM; data.qpn = cpu_to_be32(priv->qp->qp_num); data.mtu = cpu_to_be32(IPOIB_CM_BUF_SIZE); + data.sig = cpu_to_be16(IPOIB_CM_PROTO_SIG); + data.caps = cpu_to_be16(caps); rep.private_data = &data; rep.private_data_len = sizeof data; @@ -444,6 +451,7 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even struct ipoib_cm_rx *p; unsigned psn; int ret; + struct ipoib_cm_data *cm_data; ipoib_dbg(priv, "REQ arrived\n"); p = kzalloc(sizeof *p, GFP_KERNEL); @@ -462,6 +470,13 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even goto err_qp; } + cm_data = (struct ipoib_cm_data *)event->private_data; + ipoib_dbg(priv, "Otherend sig=0x%x\n", be16_to_cpu(cm_data->sig)); + if (ipoib_cm_check_proto_sig(be16_to_cpu(cm_data->sig)) && + ipoib_cm_check_proto_ver(be16_to_cpu(cm_data->caps))) + p->caps = be16_to_cpu(cm_data->caps); + ipoib_dbg(priv, "Otherend caps=0x%x\n", p->caps); + psn = prandom_u32() & 0xffffff; ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); if (ret) @@ -672,6 +687,10 @@ copied: skb->dev = dev; /* XXX get correct PACKET_ type here */ skb->pkt_type = PACKET_HOST; + + if (cm_ibcrc_as_csum && test_bit(IPOIB_FLAG_CSUM, &priv->flags)) + skb->ip_summed = CHECKSUM_UNNECESSARY; + netif_receive_skb(skb); repost: @@ -733,6 +752,18 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_ tx_req = &tx->tx_ring[tx->tx_head & (ipoib_sendq_size - 1)]; tx_req->skb = skb; + /* Calculate checksum if we support ibcrc_as_csum but peer is not */ + if ((skb->ip_summed == CHECKSUM_PARTIAL) && cm_ibcrc_as_csum && + test_bit(IPOIB_FLAG_CSUM, &priv->flags) && + !(tx->caps & IPOIB_CM_CAPS_IBCRC_AS_CSUM)) { + if (skb_checksum_help(skb)) { + ipoib_warn(priv, "Fail to csum skb\n"); + ++dev->stats.tx_errors; + dev_kfree_skb_any(skb); + return; + } + } + if (unlikely(ipoib_dma_map_tx(priv->ca, tx_req))) { ++dev->stats.tx_errors; dev_kfree_skb_any(skb); @@ -954,6 +985,7 @@ static int ipoib_cm_rep_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even struct ib_qp_attr qp_attr; int qp_attr_mask, ret; struct sk_buff *skb; + struct ipoib_cm_data *cm_data; p->mtu = be32_to_cpu(data->mtu); @@ -963,6 +995,13 @@ static int ipoib_cm_rep_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even return -EINVAL; } + cm_data = (struct ipoib_cm_data *)event->private_data; + ipoib_dbg(priv, "Otherend sig=0x%x\n", be16_to_cpu(cm_data->sig)); + if (ipoib_cm_check_proto_sig(be16_to_cpu(cm_data->sig)) && + ipoib_cm_check_proto_ver(be16_to_cpu(cm_data->caps))) + p->caps = be16_to_cpu(cm_data->caps); + ipoib_dbg(priv, "Otherend caps=0x%x\n", p->caps); + qp_attr.qp_state = IB_QPS_RTR; ret = ib_cm_init_qp_attr(cm_id, &qp_attr, &qp_attr_mask); if (ret) { @@ -1051,9 +1090,16 @@ static int ipoib_cm_send_req(struct net_device *dev, struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_cm_data data = {}; struct ib_cm_req_param req = {}; + u16 caps = 0; + + caps |= IPOIB_CM_PROTO_VER; + if (cm_ibcrc_as_csum && test_bit(IPOIB_FLAG_CSUM, &priv->flags)) + caps |= IPOIB_CM_CAPS_IBCRC_AS_CSUM; data.qpn = cpu_to_be32(priv->qp->qp_num); data.mtu = cpu_to_be32(IPOIB_CM_BUF_SIZE); + data.sig = cpu_to_be16(IPOIB_CM_PROTO_SIG); + data.caps = cpu_to_be16(caps); req.primary_path = pathrec; req.alternate_path = NULL; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index b2943c8..a1940e2 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -73,6 +73,11 @@ module_param_named(debug_level, ipoib_debug_level, int, 0644); MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0"); #endif +int cm_ibcrc_as_csum = 1; +module_param_named(cm_ibcrc_as_csum, cm_ibcrc_as_csum, int, 0444); +MODULE_PARM_DESC(cm_ibcrc_as_csum, + "Indicates whether to utilize IB-CRC as CSUM in connected mode,(default: 1)"); + struct ipoib_path_iter { struct net_device *dev; struct ipoib_path path; @@ -189,8 +194,12 @@ static netdev_features_t ipoib_fix_features(struct net_device *dev, netdev_featu { struct ipoib_dev_priv *priv = netdev_priv(dev); - if (test_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags)) - features &= ~(NETIF_F_IP_CSUM | NETIF_F_TSO); + if (test_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags)) { + features &= ~NETIF_F_TSO; + if (!(cm_ibcrc_as_csum && (test_bit(IPOIB_FLAG_CSUM, + &priv->flags)))) + features &= ~(NETIF_F_IP_CSUM | NETIF_F_TSO); + } return features; } @@ -234,7 +243,11 @@ int ipoib_set_mode(struct net_device *dev, const char *buf) netdev_update_features(dev); dev_set_mtu(dev, ipoib_cm_max_mtu(dev)); rtnl_unlock(); - priv->tx_wr.send_flags &= ~IB_SEND_IP_CSUM; + if (cm_ibcrc_as_csum && (test_bit(IPOIB_FLAG_CSUM, + &priv->flags))) + priv->tx_wr.send_flags |= IB_SEND_IP_CSUM; + else + priv->tx_wr.send_flags &= ~IB_SEND_IP_CSUM; ipoib_flush_paths(dev); rtnl_lock(); @@ -1552,6 +1565,7 @@ int ipoib_set_dev_features(struct ipoib_dev_priv *priv, struct ib_device *hca) kfree(device_attr); if (priv->hca_caps & IB_DEVICE_UD_IP_CSUM) { + set_bit(IPOIB_FLAG_CSUM, &priv->flags); priv->dev->hw_features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_RXCSUM;
This enhancement suggest the usage of IB CRC instead of CSUM in IPoIB CM. IPoIB CM uses RC (Reliable Connection) which guarantees the corruption free delivery of the packet. InfiniBand uses 32b CRC which provides stronger data integrity protection compare to 16b IP Checksum. So, there is no added value that IP/TCP Checksum provides in the IB world. The proposal is to tell network stack that IPoIB-CM supports IP Checksum offload. This enables the kernel to save the time of checksum calculation of IPoIB CM packets. Network sends the IP packet without adding the IP Checksum to the header. On the receive side, IPoIB driver again tells the network stack that IP Checksum is good for the incoming packets and network stack avoids the IP Checksum calculations. During connection establishment the driver determine if peer supports IB CRC as checksum. This is done so driver will be able to calculate checksum before transmiting the packet in case the peer does not support this feature. With this enhancement throughput is increased by 60%. Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com> --- drivers/infiniband/ulp/ipoib/ipoib.h | 26 ++++++++++++++++ drivers/infiniband/ulp/ipoib/ipoib_cm.c | 46 +++++++++++++++++++++++++++++ drivers/infiniband/ulp/ipoib/ipoib_main.c | 20 ++++++++++-- 3 files changed, 89 insertions(+), 3 deletions(-)