diff mbox

[libibverbs] Add support for TX/RX checksum offload

Message ID 1439826618-3015-1-git-send-email-bodong@mellanox.com (mailing list archive)
State Changes Requested
Headers show

Commit Message

Bodong Wang Aug. 17, 2015, 3:50 p.m. UTC
Add a device capability flag IBV_DEVICE_IP_CSUM to denote IPv4 checksum
offload support. Devices should set this flag if they support
insertion/verification of IPv4, TCP and UDP checksums on
outgoing/incoming IPv4 packets sent over IB UD or ETH RAW PACKET QPs.

Flags IBV_SEND_IP_CSUM and IBV_WC_IP_CSUM_OK are added for utilizing this
capability for send and receive separately.

Change-Id: Ie02d708dcbef07ca0d2eac1b156f12aafdba6a97
Signed-off-by: Moshe Lazer <moshel@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Bodong Wang <bodong@mellanox.com>
---
 include/infiniband/verbs.h | 11 +++++++++--
 man/ibv_poll_cq.3          |  3 +++
 man/ibv_post_send.3        |  4 ++++
 3 files changed, 16 insertions(+), 2 deletions(-)

Comments

Doug Ledford Sept. 4, 2015, 11:41 p.m. UTC | #1
On 08/17/2015 11:50 AM, Bodong Wang wrote:
> Add a device capability flag IBV_DEVICE_IP_CSUM to denote IPv4 checksum
> offload support. Devices should set this flag if they support
> insertion/verification of IPv4, TCP and UDP checksums on
> outgoing/incoming IPv4 packets sent over IB UD or ETH RAW PACKET QPs.

Correct me if I'm wrong, but the only reason this is only supported on
UD and RAW ETH QPs is a matter of current firmware.  There's no reason
it couldn't be supported on RC, right?

I know I'm probably not going to get an answer to that before I'm done
doing what I'm going to do, so I'm going to write a patch on the basis
that the above is true.  Because if it is, then I *really* don't like
putting the QP types supported into the API.  Instead, I think this
calls for a query_qp_ex verb to be added, and a new struct item,
qp_caps, and in that we can signal if this specific QP supports IP
checksum offload.

> 
> Flags IBV_SEND_IP_CSUM and IBV_WC_IP_CSUM_OK are added for utilizing this
> capability for send and receive separately.
> 
> Change-Id: Ie02d708dcbef07ca0d2eac1b156f12aafdba6a97
> Signed-off-by: Moshe Lazer <moshel@mellanox.com>
> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
> Signed-off-by: Bodong Wang <bodong@mellanox.com>
> ---
>  include/infiniband/verbs.h | 11 +++++++++--
>  man/ibv_poll_cq.3          |  3 +++
>  man/ibv_post_send.3        |  4 ++++
>  3 files changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
> index 28e1586..6ae7e6e 100644
> --- a/include/infiniband/verbs.h
> +++ b/include/infiniband/verbs.h
> @@ -115,6 +115,7 @@ enum ibv_device_cap_flags {
>  	IBV_DEVICE_RC_RNR_NAK_GEN	= 1 << 12,
>  	IBV_DEVICE_SRQ_RESIZE		= 1 << 13,
>  	IBV_DEVICE_N_NOTIFY_CQ		= 1 << 14,
> +	IBV_DEVICE_IP_CSUM		= 1 << 18,
>  	IBV_DEVICE_XRC			= 1 << 20,
>  	IBV_DEVICE_MANAGED_FLOW_STEERING = 1 << 29
>  };
> @@ -314,9 +315,14 @@ enum ibv_wc_opcode {
>  	IBV_WC_RECV_RDMA_WITH_IMM
>  };
>  
> +enum {
> +	IBV_WC_IP_CSUM_OK_SHIFT	= 2
> +};
> +
>  enum ibv_wc_flags {
>  	IBV_WC_GRH		= 1 << 0,
> -	IBV_WC_WITH_IMM		= 1 << 1
> +	IBV_WC_WITH_IMM		= 1 << 1,
> +	IBV_WC_IP_CSUM_OK	= 1 << IBV_WC_IP_CSUM_OK_SHIFT
>  };
>  
>  struct ibv_wc {
> @@ -653,7 +659,8 @@ enum ibv_send_flags {
>  	IBV_SEND_FENCE		= 1 << 0,
>  	IBV_SEND_SIGNALED	= 1 << 1,
>  	IBV_SEND_SOLICITED	= 1 << 2,
> -	IBV_SEND_INLINE		= 1 << 3
> +	IBV_SEND_INLINE		= 1 << 3,
> +	IBV_SEND_IP_CSUM	= 1 << 4
>  };
>  
>  struct ibv_sge {
> diff --git a/man/ibv_poll_cq.3 b/man/ibv_poll_cq.3
> index 57c6daa..539940d 100644
> --- a/man/ibv_poll_cq.3
> +++ b/man/ibv_poll_cq.3
> @@ -50,6 +50,9 @@ It is either 0 or the bitwise OR of one or more of the following flags:
>  .B IBV_WC_GRH \fR      GRH is present (valid only for UD QPs)
>  .TP
>  .B IBV_WC_WITH_IMM \fR Immediate data value is valid
> +.TP
> +.B IBV_WC_IP_CSUM_OK \fR TCP/UDP checksum over IPv4 and IPv4 header checksum are verified.
> +This feature is supported only when \fBIBV_DEVICE_IP_CSUM\fR flag is set in the device capability flags.
>  .PP
>  Not all
>  .I wc
> diff --git a/man/ibv_post_send.3 b/man/ibv_post_send.3
> index 33fbb50..3b07bcb 100644
> --- a/man/ibv_post_send.3
> +++ b/man/ibv_post_send.3
> @@ -98,6 +98,10 @@ The attribute send_flags describes the properties of the \s-1WR\s0. It is either
>  .TP
>  .B IBV_SEND_INLINE \fR Send data in given gather list as inline data
>  in a send WQE.  Valid only for Send and RDMA Write.  The L_Key will not be checked.
> ++.TP
> ++.B IBV_SEND_IP_CSUM \fR Offload the IPv4 and TCP/UDP checksum calculation.
> ++Valid only for QPs with Transport Service Type \fBIBV_QPT_UD\fR or \fBIBV_QPT_RAW_PACKET\fR.
> ++This feature supported only when \fBIBV_DEVICE_IP_CSUM\fR the flag is set in the device capability flags.
>  .SH "RETURN VALUE"
>  .B ibv_post_send()
>  returns 0 on success, or the value of errno on failure (which indicates the failure reason).
>
Or Gerlitz Sept. 5, 2015, 7:59 p.m. UTC | #2
On Sat, Sep 5, 2015 at 2:41 AM, Doug Ledford <dledford@redhat.com> wrote:
> On 08/17/2015 11:50 AM, Bodong Wang wrote:
>> Add a device capability flag IBV_DEVICE_IP_CSUM to denote IPv4 checksum
>> offload support. Devices should set this flag if they support
>> insertion/verification of IPv4, TCP and UDP checksums on
>> outgoing/incoming IPv4 packets sent over IB UD or ETH RAW PACKET QPs.
>
> Correct me if I'm wrong, but the only reason this is only supported on
> UD and RAW ETH QPs is a matter of current firmware.  There's no reason
> it couldn't be supported on RC, right?

Doug,

The context here is the ability of the user-space RDMA verbs
infrastructure to serve as the baseline for implementing user-space
TCP/IP offloads engines. Such engines would be production worthy in
open-systems environments mostly when they are inter-operable which
whatever stack runs on the other end --> they must not put any
additional bits on the wire --> RC isn't an option, so for IPoIB we
just need an IPoIB UD QP in user space, and for Ethernet RAW PACKET
QP. This device capability is there  ~10y for IB UD and we just
naturally extend it to Eth RAW. If/when a vendor comes up with
supporting csum for RC, we can add another dev cap and say the well
established API applies on them too, with just a slight modification
to man pages and such, makes sense?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford Sept. 9, 2015, 3:20 p.m. UTC | #3
On 09/05/2015 03:59 PM, Or Gerlitz wrote:
> On Sat, Sep 5, 2015 at 2:41 AM, Doug Ledford <dledford@redhat.com> wrote:
>> On 08/17/2015 11:50 AM, Bodong Wang wrote:
>>> Add a device capability flag IBV_DEVICE_IP_CSUM to denote IPv4 checksum
>>> offload support. Devices should set this flag if they support
>>> insertion/verification of IPv4, TCP and UDP checksums on
>>> outgoing/incoming IPv4 packets sent over IB UD or ETH RAW PACKET QPs.
>>
>> Correct me if I'm wrong, but the only reason this is only supported on
>> UD and RAW ETH QPs is a matter of current firmware.  There's no reason
>> it couldn't be supported on RC, right?
> 
> Doug,
> 
> The context here is the ability of the user-space RDMA verbs
> infrastructure to serve as the baseline for implementing user-space
> TCP/IP offloads engines.

Fine.

> Such engines would be production worthy in
> open-systems environments mostly when they are inter-operable which
> whatever stack runs on the other end

OK.

> --> they must not put any
> additional bits on the wire

Maybe.  For base level interop, sure, but for enhanced service in a
homogeneous environment, not necessarily true.

> --> RC isn't an option

Not following here Or.  The IPoIB connected mode packet is 4 byte IPoIB
Header followed by whatever header the TCP stack put on the packet
(likely either IPv4 or IPv6).  The same ipoib_hard_header function is
used for both CM and UD connections, the same ipoib_start_xmit() is used
for both CM and UD connections, and we just hand off to the appropriate
send routine when we have our neighbor lookup complete.  Why would you
need new bits on the wire to do a checksum on an RC send and not for a
UD send?

>, so for IPoIB we
> just need an IPoIB UD QP in user space, and for Ethernet RAW PACKET
> QP. This device capability is there  ~10y for IB UD and we just
> naturally extend it to Eth RAW. If/when a vendor comes up with
> supporting csum for RC, we can add another dev cap and say the well
> established API applies on them too, with just a slight modification
> to man pages and such, makes sense?

So if we ever support RC, then any actual users of this API will have
hardcoded which types of QPs are supported into their apps and they will
*all* have to go modify their source code to re-hardcode the types into
their app and recompile.

Alternatively, we write a reasonable API.  One where the types of QPs
are not set in stone, we tell users to query the API to determine if any
given QP supports IP CSUM offloads, and then if we add more QP types in
the future, the apps just automatically work even on the different QP
types (assuming they used those QP types at all) because it was written
to an API that allowed for it.

Face it Or, the API in this patchset is crap.  I totally get why you are
fighting for it so hard.  You already spelled it out: "This device
capability is there  ~10y for IB UD and we just naturally extend it to
Eth RAW".  I can read between the lines.  You have users of the API,
probably some internal and some external, and if I go demanding a proper
API, all of these people have to recode their apps to the new API, and
you'd like to avoid that if possible.  However, if I don't, and then RC
support is added, then they have to recode their apps anyway.  Wouldn't
it be best to just do it right and get it over with?

So, here's the API I'm proposing:

- Add ibv_query_qp_ex
- In new ibv_query_qp_ex struct, extend the ibv_qp_caps struct to add a
flags element
- Define a flag for IP_CSUM_SUPPORTED
- Define IP_CSUM flag for send operations
- Define the API so that IP_CSUM is ignored on all sends if the QP
doesn't support IP_CSUM and only checked on QPs that support it.  This
way other QP types don't suffer a penalty on send to check this and
return EINVAL if its set
- Define the return flags in the wc struct so we can signal that a CSUM
was performed and succeeded

A user app would then basically follow this flow:

ibv_create_qp()
ibv_query_qp()
  check for IP CSUM and cache result
ibv_post_send()
  set IP_CSUM if QP supports it

ibv_poll_cq()
  if qp supports IP_CSUM, check CSUM result in wc

That's what I would like to see for these changes.
Or Gerlitz Sept. 10, 2015, 7:10 a.m. UTC | #4
On 9/9/2015 6:20 PM, Doug Ledford wrote:
> On 09/05/2015 03:59 PM, Or Gerlitz wrote:
>> --> they must not put any
>> additional bits on the wire
> Maybe.  For base level interop, sure, but for enhanced service in a homogeneous environment, not necessarily true.
>
>> --> RC isn't an option
> Not following here Or.  The IPoIB connected mode packet is 4 byte IPoIB
> Header followed by whatever header the TCP stack put on the packet
> (likely either IPv4 or IPv6).  The same ipoib_hard_header function is
> used for both CM and UD connections, the same ipoib_start_xmit() is used
> for both CM and UD connections, and we just hand off to the appropriate
> send routine when we have our neighbor lookup complete.  Why would you
> need new bits on the wire to do a checksum on an RC send and not for a UD send?

I didn't say that, I said (again) the offload engines will use just 
plain UD along the IPoIB RFC, b/c each IPoIB implementation would 
support that. It's true that an offload can use connected mode (CM) and 
fall back to UD if the other end don't support CM.

To make things concrete, exactly nothing in the libibverbs patch is made 
such it would work just for UD and RAW PACKET qps, except few TEXT 
comments and man page line. In other words, if there is HW vendor that 
supports checksum generation/validation on RC QPs too - the same API 
would just work.

It's pretty easy to add one line of check for the vendor send function 
such that it returns error if checksum generation isn't supported for 
the given QP type, and in that respect, before going and saying this 
submission is crappy, you could have just provided this reviewer comment 
"remove the assumption on QP types, make sure in libmlx4 you return 
error for non UD/RAW qps" and (see next)



>
>> , so for IPoIB we
>> just need an IPoIB UD QP in user space, and for Ethernet RAW PACKET
>> QP. This device capability is there  ~10y for IB UD and we just
>> naturally extend it to Eth RAW. If/when a vendor comes up with
>> supporting csum for RC, we can add another dev cap and say the well
>> established API applies on them too, with just a slight modification
>> to man pages and such, makes sense?
> So if we ever support RC, then any actual users of this API will have
> hardcoded which types of QPs are supported into their apps and they will
> *all* have to go modify their source code to re-hardcode the types into
> their app and recompile.
>
> Alternatively, we write a reasonable API.  One where the types of QPs
> are not set in stone, we tell users to query the API to determine if any
> given QP supports IP CSUM offloads, and then if we add more QP types in
> the future, the apps just automatically work even on the different QP
> types (assuming they used those QP types at all) because it was written
> to an API that allowed for it.
>
> Face it Or, the API in this patchset is crap.  I totally get why you are
> fighting for it so hard.  You already spelled it out: "This device
> capability is there  ~10y for IB UD and we just naturally extend it to
> Eth RAW".  I can read between the lines.  You have users of the API,
> probably some internal and some external, and if I go demanding a proper
> API, all of these people have to recode their apps to the new API, and
> you'd like to avoid that if possible.  However, if I don't, and then RC
> support is added, then they have to recode their apps anyway.  Wouldn't
> it be best to just do it right and get it over with?

(continuing) "show me a plan, how applications can determine the support 
matrix for QPs types and checksum offloads" -- e.g one can enhance the 
extended query device verbs to report that and such.

Indeed there are bunch of UD and RAW based applications out there which 
make perfect use of this API and I don't think we need to gate the 
libibverbs support with that extended query kernel patch.

Why not move forward with the vendor library flow to act along your 
request, removal of assumption from man pages and a kernel patch, which 
on top of, we can add another libibverbs patch that allows apps to query?

I find this approach practical on the one hand, and with clear plan to 
make it fully robust on the other hand.



>
> So, here's the API I'm proposing:
>
> - Add ibv_query_qp_ex
> - In new ibv_query_qp_ex struct, extend the ibv_qp_caps struct to add a
> flags element
> - Define a flag for IP_CSUM_SUPPORTED
> - Define IP_CSUM flag for send operations
> - Define the API so that IP_CSUM is ignored on all sends if the QP
> doesn't support IP_CSUM and only checked on QPs that support it.  This
> way other QP types don't suffer a penalty on send to check this and
> return EINVAL if its set
> - Define the return flags in the wc struct so we can signal that a CSUM
> was performed and succeeded
>
> A user app would then basically follow this flow:
>
> ibv_create_qp()
> ibv_query_qp()
>    check for IP CSUM and cache result
> ibv_post_send()
>    set IP_CSUM if QP supports it
>
> ibv_poll_cq()
>    if qp supports IP_CSUM, check CSUM result in wc
>
> That's what I would like to see for these changes.
>
basically OK, expect that the support matrix for QP types is a property 
of the device, e.g in a similar manner we have for ODP, right? and hence 
the plan I suggested above.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index 28e1586..6ae7e6e 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -115,6 +115,7 @@  enum ibv_device_cap_flags {
 	IBV_DEVICE_RC_RNR_NAK_GEN	= 1 << 12,
 	IBV_DEVICE_SRQ_RESIZE		= 1 << 13,
 	IBV_DEVICE_N_NOTIFY_CQ		= 1 << 14,
+	IBV_DEVICE_IP_CSUM		= 1 << 18,
 	IBV_DEVICE_XRC			= 1 << 20,
 	IBV_DEVICE_MANAGED_FLOW_STEERING = 1 << 29
 };
@@ -314,9 +315,14 @@  enum ibv_wc_opcode {
 	IBV_WC_RECV_RDMA_WITH_IMM
 };
 
+enum {
+	IBV_WC_IP_CSUM_OK_SHIFT	= 2
+};
+
 enum ibv_wc_flags {
 	IBV_WC_GRH		= 1 << 0,
-	IBV_WC_WITH_IMM		= 1 << 1
+	IBV_WC_WITH_IMM		= 1 << 1,
+	IBV_WC_IP_CSUM_OK	= 1 << IBV_WC_IP_CSUM_OK_SHIFT
 };
 
 struct ibv_wc {
@@ -653,7 +659,8 @@  enum ibv_send_flags {
 	IBV_SEND_FENCE		= 1 << 0,
 	IBV_SEND_SIGNALED	= 1 << 1,
 	IBV_SEND_SOLICITED	= 1 << 2,
-	IBV_SEND_INLINE		= 1 << 3
+	IBV_SEND_INLINE		= 1 << 3,
+	IBV_SEND_IP_CSUM	= 1 << 4
 };
 
 struct ibv_sge {
diff --git a/man/ibv_poll_cq.3 b/man/ibv_poll_cq.3
index 57c6daa..539940d 100644
--- a/man/ibv_poll_cq.3
+++ b/man/ibv_poll_cq.3
@@ -50,6 +50,9 @@  It is either 0 or the bitwise OR of one or more of the following flags:
 .B IBV_WC_GRH \fR      GRH is present (valid only for UD QPs)
 .TP
 .B IBV_WC_WITH_IMM \fR Immediate data value is valid
+.TP
+.B IBV_WC_IP_CSUM_OK \fR TCP/UDP checksum over IPv4 and IPv4 header checksum are verified.
+This feature is supported only when \fBIBV_DEVICE_IP_CSUM\fR flag is set in the device capability flags.
 .PP
 Not all
 .I wc
diff --git a/man/ibv_post_send.3 b/man/ibv_post_send.3
index 33fbb50..3b07bcb 100644
--- a/man/ibv_post_send.3
+++ b/man/ibv_post_send.3
@@ -98,6 +98,10 @@  The attribute send_flags describes the properties of the \s-1WR\s0. It is either
 .TP
 .B IBV_SEND_INLINE \fR Send data in given gather list as inline data
 in a send WQE.  Valid only for Send and RDMA Write.  The L_Key will not be checked.
++.TP
++.B IBV_SEND_IP_CSUM \fR Offload the IPv4 and TCP/UDP checksum calculation.
++Valid only for QPs with Transport Service Type \fBIBV_QPT_UD\fR or \fBIBV_QPT_RAW_PACKET\fR.
++This feature supported only when \fBIBV_DEVICE_IP_CSUM\fR the flag is set in the device capability flags.
 .SH "RETURN VALUE"
 .B ibv_post_send()
 returns 0 on success, or the value of errno on failure (which indicates the failure reason).