diff mbox series

[bpf-next,v4,12/21] xdp: Add checksum hint

Message ID 20230728173923.1318596-13-larysa.zaremba@intel.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series [bpf-next,v4,01/21] ice: make RX hash reading code more reusable | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-2 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-4 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-5 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-6 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-3 success Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-7 success Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-11 fail Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-13 fail Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-14 fail Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-15 fail Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-17 fail Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-18 fail Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-19 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-20 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-21 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-22 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24 success Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-25 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-26 success Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-27 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-29 success Logs for veristat
bpf/vmtest-bpf-next-VM_Test-16 fail Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-12 fail Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-next-PR fail PR summary
bpf/vmtest-bpf-next-VM_Test-8 success Logs for test_maps on s390x with gcc
netdev/series_format fail Series does not have a cover letter; Series longer than 15 patches (and no cover letter)
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit fail Errors and warnings before: 5488 this patch: 5489
netdev/cc_maintainers warning 7 maintainers not CCed: hawk@kernel.org yonghong.song@linux.dev corbet@lwn.net davem@davemloft.net pabeni@redhat.com edumazet@google.com linux-doc@vger.kernel.org
netdev/build_clang success Errors and warnings before: 2273 this patch: 2273
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn fail Errors and warnings before: 5728 this patch: 5729
netdev/checkpatch warning WARNING: line length of 82 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 1 this patch: 1
netdev/source_inline success Was 0 now: 0

Commit Message

Larysa Zaremba July 28, 2023, 5:39 p.m. UTC
Implement functionality that enables drivers to expose to XDP code checksum
information that consists of:

- Checksum status - bitfield that consists of
  - number of consecutive validated checksums. This is almost the same as
    csum_level in skb, but starts with 1. Enum names for those bits still
    use checksum level concept, so it is less confusing for driver
    developers.
  - Is checksum partial? This bit cannot coexist with any other
  - Is there a complete checksum available?
- Additional checksum data, a union of:
  - checksum start and offset, if checksum is partial
  - complete checksum, if available

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 Documentation/networking/xdp-rx-metadata.rst |  3 ++
 include/linux/netdevice.h                    |  3 ++
 include/net/xdp.h                            | 46 ++++++++++++++++++++
 kernel/bpf/offload.c                         |  2 +
 net/core/xdp.c                               | 23 ++++++++++
 5 files changed, 77 insertions(+)

Comments

Alexei Starovoitov July 28, 2023, 9:53 p.m. UTC | #1
On Fri, Jul 28, 2023 at 07:39:14PM +0200, Larysa Zaremba wrote:
>  
> +union xdp_csum_info {
> +	/* Checksum referred to by ``csum_start + csum_offset`` is considered
> +	 * valid, but was never calculated, TX device has to do this,
> +	 * starting from csum_start packet byte.
> +	 * Any preceding checksums are also considered valid.
> +	 * Available, if ``status == XDP_CHECKSUM_PARTIAL``.
> +	 */
> +	struct {
> +		u16 csum_start;
> +		u16 csum_offset;
> +	};
> +

CHECKSUM_PARTIAL makes sense on TX, but this RX. I don't see in the above.

> +	/* Checksum, calculated over the whole packet.
> +	 * Available, if ``status & XDP_CHECKSUM_COMPLETE``.
> +	 */
> +	u32 checksum;

imo XDP RX should only support XDP_CHECKSUM_COMPLETE with u32 checksum
or XDP_CHECKSUM_UNNECESSARY.

> +};
> +
> +enum xdp_csum_status {
> +	/* HW had parsed several transport headers and validated their
> +	 * checksums, same as ``CHECKSUM_UNNECESSARY`` in ``sk_buff``.
> +	 * 3 least significant bytes contain number of consecutive checksums,
> +	 * starting with the outermost, reported by hardware as valid.
> +	 * ``sk_buff`` checksum level (``csum_level``) notation is provided
> +	 * for driver developers.
> +	 */
> +	XDP_CHECKSUM_VALID_LVL0		= 1,	/* 1 outermost checksum */
> +	XDP_CHECKSUM_VALID_LVL1		= 2,	/* 2 outermost checksums */
> +	XDP_CHECKSUM_VALID_LVL2		= 3,	/* 3 outermost checksums */
> +	XDP_CHECKSUM_VALID_LVL3		= 4,	/* 4 outermost checksums */
> +	XDP_CHECKSUM_VALID_NUM_MASK	= GENMASK(2, 0),
> +	XDP_CHECKSUM_VALID		= XDP_CHECKSUM_VALID_NUM_MASK,

I don't see what bpf prog suppose to do with these levels.
The driver should pick between 3:
XDP_CHECKSUM_UNNECESSARY, XDP_CHECKSUM_COMPLETE, XDP_CHECKSUM_NONE.

No levels and no anything partial. please.
Willem de Bruijn July 29, 2023, 4:15 p.m. UTC | #2
Alexei Starovoitov wrote:
> On Fri, Jul 28, 2023 at 07:39:14PM +0200, Larysa Zaremba wrote:
> >  
> > +union xdp_csum_info {
> > +	/* Checksum referred to by ``csum_start + csum_offset`` is considered
> > +	 * valid, but was never calculated, TX device has to do this,
> > +	 * starting from csum_start packet byte.
> > +	 * Any preceding checksums are also considered valid.
> > +	 * Available, if ``status == XDP_CHECKSUM_PARTIAL``.
> > +	 */
> > +	struct {
> > +		u16 csum_start;
> > +		u16 csum_offset;
> > +	};
> > +
> 
> CHECKSUM_PARTIAL makes sense on TX, but this RX. I don't see in the above.

It can be observed on RX when packets are looped.

This may be observed even in XDP on veth.
 
> > +	/* Checksum, calculated over the whole packet.
> > +	 * Available, if ``status & XDP_CHECKSUM_COMPLETE``.
> > +	 */
> > +	u32 checksum;
> 
> imo XDP RX should only support XDP_CHECKSUM_COMPLETE with u32 checksum
> or XDP_CHECKSUM_UNNECESSARY.
> 
> > +};
> > +
> > +enum xdp_csum_status {
> > +	/* HW had parsed several transport headers and validated their
> > +	 * checksums, same as ``CHECKSUM_UNNECESSARY`` in ``sk_buff``.
> > +	 * 3 least significant bytes contain number of consecutive checksums,
> > +	 * starting with the outermost, reported by hardware as valid.
> > +	 * ``sk_buff`` checksum level (``csum_level``) notation is provided
> > +	 * for driver developers.
> > +	 */
> > +	XDP_CHECKSUM_VALID_LVL0		= 1,	/* 1 outermost checksum */
> > +	XDP_CHECKSUM_VALID_LVL1		= 2,	/* 2 outermost checksums */
> > +	XDP_CHECKSUM_VALID_LVL2		= 3,	/* 3 outermost checksums */
> > +	XDP_CHECKSUM_VALID_LVL3		= 4,	/* 4 outermost checksums */
> > +	XDP_CHECKSUM_VALID_NUM_MASK	= GENMASK(2, 0),
> > +	XDP_CHECKSUM_VALID		= XDP_CHECKSUM_VALID_NUM_MASK,
> 
> I don't see what bpf prog suppose to do with these levels.
> The driver should pick between 3:
> XDP_CHECKSUM_UNNECESSARY, XDP_CHECKSUM_COMPLETE, XDP_CHECKSUM_NONE.
> 
> No levels and no anything partial. please.

This levels business is an unfortunate side effect of
CHECKSUM_UNNECESSARY. For a packet with multiple checksum fields, what
does the boolean actually mean? With these levels, at least that is
well defined: the first N checksum fields.
Alexei Starovoitov July 29, 2023, 6:04 p.m. UTC | #3
On Sat, Jul 29, 2023 at 9:15 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Alexei Starovoitov wrote:
> > On Fri, Jul 28, 2023 at 07:39:14PM +0200, Larysa Zaremba wrote:
> > >
> > > +union xdp_csum_info {
> > > +   /* Checksum referred to by ``csum_start + csum_offset`` is considered
> > > +    * valid, but was never calculated, TX device has to do this,
> > > +    * starting from csum_start packet byte.
> > > +    * Any preceding checksums are also considered valid.
> > > +    * Available, if ``status == XDP_CHECKSUM_PARTIAL``.
> > > +    */
> > > +   struct {
> > > +           u16 csum_start;
> > > +           u16 csum_offset;
> > > +   };
> > > +
> >
> > CHECKSUM_PARTIAL makes sense on TX, but this RX. I don't see in the above.
>
> It can be observed on RX when packets are looped.
>
> This may be observed even in XDP on veth.

veth and XDP is a broken combination. GSO packets coming out of containers
cannot be parsed properly by XDP.
It was added mainly for testing. Just like "generic XDP".
bpf progs at skb layer is much better fit for veth.

> > > +   /* Checksum, calculated over the whole packet.
> > > +    * Available, if ``status & XDP_CHECKSUM_COMPLETE``.
> > > +    */
> > > +   u32 checksum;
> >
> > imo XDP RX should only support XDP_CHECKSUM_COMPLETE with u32 checksum
> > or XDP_CHECKSUM_UNNECESSARY.
> >
> > > +};
> > > +
> > > +enum xdp_csum_status {
> > > +   /* HW had parsed several transport headers and validated their
> > > +    * checksums, same as ``CHECKSUM_UNNECESSARY`` in ``sk_buff``.
> > > +    * 3 least significant bytes contain number of consecutive checksums,
> > > +    * starting with the outermost, reported by hardware as valid.
> > > +    * ``sk_buff`` checksum level (``csum_level``) notation is provided
> > > +    * for driver developers.
> > > +    */
> > > +   XDP_CHECKSUM_VALID_LVL0         = 1,    /* 1 outermost checksum */
> > > +   XDP_CHECKSUM_VALID_LVL1         = 2,    /* 2 outermost checksums */
> > > +   XDP_CHECKSUM_VALID_LVL2         = 3,    /* 3 outermost checksums */
> > > +   XDP_CHECKSUM_VALID_LVL3         = 4,    /* 4 outermost checksums */
> > > +   XDP_CHECKSUM_VALID_NUM_MASK     = GENMASK(2, 0),
> > > +   XDP_CHECKSUM_VALID              = XDP_CHECKSUM_VALID_NUM_MASK,
> >
> > I don't see what bpf prog suppose to do with these levels.
> > The driver should pick between 3:
> > XDP_CHECKSUM_UNNECESSARY, XDP_CHECKSUM_COMPLETE, XDP_CHECKSUM_NONE.
> >
> > No levels and no anything partial. please.
>
> This levels business is an unfortunate side effect of
> CHECKSUM_UNNECESSARY. For a packet with multiple checksum fields, what
> does the boolean actually mean? With these levels, at least that is
> well defined: the first N checksum fields.

If I understand this correctly this is intel specific feature that
other NICs don't have. skb layer also doesn't have such concept.
The driver should say CHECKSUM_UNNECESSARY when it's sure
or don't pretend that it checks the checksum and just say NONE.
Willem de Bruijn July 30, 2023, 1:13 p.m. UTC | #4
Alexei Starovoitov wrote:
> On Sat, Jul 29, 2023 at 9:15 AM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > Alexei Starovoitov wrote:
> > > On Fri, Jul 28, 2023 at 07:39:14PM +0200, Larysa Zaremba wrote:
> > > >
> > > > +union xdp_csum_info {
> > > > +   /* Checksum referred to by ``csum_start + csum_offset`` is considered
> > > > +    * valid, but was never calculated, TX device has to do this,
> > > > +    * starting from csum_start packet byte.
> > > > +    * Any preceding checksums are also considered valid.
> > > > +    * Available, if ``status == XDP_CHECKSUM_PARTIAL``.
> > > > +    */
> > > > +   struct {
> > > > +           u16 csum_start;
> > > > +           u16 csum_offset;
> > > > +   };
> > > > +
> > >
> > > CHECKSUM_PARTIAL makes sense on TX, but this RX. I don't see in the above.
> >
> > It can be observed on RX when packets are looped.
> >
> > This may be observed even in XDP on veth.
> 
> veth and XDP is a broken combination. GSO packets coming out of containers
> cannot be parsed properly by XDP.
> It was added mainly for testing. Just like "generic XDP".
> bpf progs at skb layer is much better fit for veth.

Ok. Still, seems forward looking and little cost to define the
constant?
 
> > > > +   /* Checksum, calculated over the whole packet.
> > > > +    * Available, if ``status & XDP_CHECKSUM_COMPLETE``.
> > > > +    */
> > > > +   u32 checksum;
> > >
> > > imo XDP RX should only support XDP_CHECKSUM_COMPLETE with u32 checksum
> > > or XDP_CHECKSUM_UNNECESSARY.
> > >
> > > > +};
> > > > +
> > > > +enum xdp_csum_status {
> > > > +   /* HW had parsed several transport headers and validated their
> > > > +    * checksums, same as ``CHECKSUM_UNNECESSARY`` in ``sk_buff``.
> > > > +    * 3 least significant bytes contain number of consecutive checksums,
> > > > +    * starting with the outermost, reported by hardware as valid.
> > > > +    * ``sk_buff`` checksum level (``csum_level``) notation is provided
> > > > +    * for driver developers.
> > > > +    */
> > > > +   XDP_CHECKSUM_VALID_LVL0         = 1,    /* 1 outermost checksum */
> > > > +   XDP_CHECKSUM_VALID_LVL1         = 2,    /* 2 outermost checksums */
> > > > +   XDP_CHECKSUM_VALID_LVL2         = 3,    /* 3 outermost checksums */
> > > > +   XDP_CHECKSUM_VALID_LVL3         = 4,    /* 4 outermost checksums */
> > > > +   XDP_CHECKSUM_VALID_NUM_MASK     = GENMASK(2, 0),
> > > > +   XDP_CHECKSUM_VALID              = XDP_CHECKSUM_VALID_NUM_MASK,
> > >
> > > I don't see what bpf prog suppose to do with these levels.
> > > The driver should pick between 3:
> > > XDP_CHECKSUM_UNNECESSARY, XDP_CHECKSUM_COMPLETE, XDP_CHECKSUM_NONE.
> > >
> > > No levels and no anything partial. please.
> >
> > This levels business is an unfortunate side effect of
> > CHECKSUM_UNNECESSARY. For a packet with multiple checksum fields, what
> > does the boolean actually mean? With these levels, at least that is
> > well defined: the first N checksum fields.
> 
> If I understand this correctly this is intel specific feature that
> other NICs don't have. skb layer also doesn't have such concept.
> The driver should say CHECKSUM_UNNECESSARY when it's sure
> or don't pretend that it checks the checksum and just say NONE.

I did not know how much this was used, but quick grep for non constant
csum_level shows devices from at least six vendors.
Larysa Zaremba July 31, 2023, 10:52 a.m. UTC | #5
On Sun, Jul 30, 2023 at 09:13:02AM -0400, Willem de Bruijn wrote:
> Alexei Starovoitov wrote:
> > On Sat, Jul 29, 2023 at 9:15 AM Willem de Bruijn
> > <willemdebruijn.kernel@gmail.com> wrote:
> > >
> > > Alexei Starovoitov wrote:
> > > > On Fri, Jul 28, 2023 at 07:39:14PM +0200, Larysa Zaremba wrote:
> > > > >
> > > > > +union xdp_csum_info {
> > > > > +   /* Checksum referred to by ``csum_start + csum_offset`` is considered
> > > > > +    * valid, but was never calculated, TX device has to do this,
> > > > > +    * starting from csum_start packet byte.
> > > > > +    * Any preceding checksums are also considered valid.
> > > > > +    * Available, if ``status == XDP_CHECKSUM_PARTIAL``.
> > > > > +    */
> > > > > +   struct {
> > > > > +           u16 csum_start;
> > > > > +           u16 csum_offset;
> > > > > +   };
> > > > > +
> > > >
> > > > CHECKSUM_PARTIAL makes sense on TX, but this RX. I don't see in the above.
> > >
> > > It can be observed on RX when packets are looped.
> > >
> > > This may be observed even in XDP on veth.
> > 
> > veth and XDP is a broken combination. GSO packets coming out of containers
> > cannot be parsed properly by XDP.
> > It was added mainly for testing. Just like "generic XDP".
> > bpf progs at skb layer is much better fit for veth.
> 
> Ok. Still, seems forward looking and little cost to define the
> constant?
>

+1
CHECKSUM_PARTIAL is mostly for testing and removing/adding it doesn't change 
anything from the perspective of the user that does not use it, so I think it is 
worth having.

> > > > > +   /* Checksum, calculated over the whole packet.
> > > > > +    * Available, if ``status & XDP_CHECKSUM_COMPLETE``.
> > > > > +    */
> > > > > +   u32 checksum;
> > > >
> > > > imo XDP RX should only support XDP_CHECKSUM_COMPLETE with u32 checksum
> > > > or XDP_CHECKSUM_UNNECESSARY.
> > > >
> > > > > +};
> > > > > +
> > > > > +enum xdp_csum_status {
> > > > > +   /* HW had parsed several transport headers and validated their
> > > > > +    * checksums, same as ``CHECKSUM_UNNECESSARY`` in ``sk_buff``.
> > > > > +    * 3 least significant bytes contain number of consecutive checksums,
> > > > > +    * starting with the outermost, reported by hardware as valid.
> > > > > +    * ``sk_buff`` checksum level (``csum_level``) notation is provided
> > > > > +    * for driver developers.
> > > > > +    */
> > > > > +   XDP_CHECKSUM_VALID_LVL0         = 1,    /* 1 outermost checksum */
> > > > > +   XDP_CHECKSUM_VALID_LVL1         = 2,    /* 2 outermost checksums */
> > > > > +   XDP_CHECKSUM_VALID_LVL2         = 3,    /* 3 outermost checksums */
> > > > > +   XDP_CHECKSUM_VALID_LVL3         = 4,    /* 4 outermost checksums */
> > > > > +   XDP_CHECKSUM_VALID_NUM_MASK     = GENMASK(2, 0),
> > > > > +   XDP_CHECKSUM_VALID              = XDP_CHECKSUM_VALID_NUM_MASK,
> > > >
> > > > I don't see what bpf prog suppose to do with these levels.
> > > > The driver should pick between 3:
> > > > XDP_CHECKSUM_UNNECESSARY, XDP_CHECKSUM_COMPLETE, XDP_CHECKSUM_NONE.
> > > >
> > > > No levels and no anything partial. please.
> > >
> > > This levels business is an unfortunate side effect of
> > > CHECKSUM_UNNECESSARY. For a packet with multiple checksum fields, what
> > > does the boolean actually mean? With these levels, at least that is
> > > well defined: the first N checksum fields.
> > 
> > If I understand this correctly this is intel specific feature that
> > other NICs don't have. skb layer also doesn't have such concept.

Please look into csum_level field in sk_buff. It is not the most used property 
in the kernel networking code, but it is certainly 1. used by networking stack 
2. set to non-zero value by many vendors.

So you do not need to search yourself, I'll copy-paste the docs for 
CHECKSUM_UNNECESSARY here:

 *   %CHECKSUM_UNNECESSARY is applicable to following protocols:
 *
 *     - TCP: IPv6 and IPv4.
 *     - UDP: IPv4 and IPv6. A device may apply CHECKSUM_UNNECESSARY to a
 *       zero UDP checksum for either IPv4 or IPv6, the networking stack
 *       may perform further validation in this case.
 *     - GRE: only if the checksum is present in the header.
 *     - SCTP: indicates the CRC in SCTP header has been validated.
 *     - FCOE: indicates the CRC in FC frame has been validated.
 *

Please, look at this:

 *   &sk_buff.csum_level indicates the number of consecutive checksums found in
 *   the packet minus one that have been verified as %CHECKSUM_UNNECESSARY.
 *   For instance if a device receives an IPv6->UDP->GRE->IPv4->TCP packet
 *   and a device is able to verify the checksums for UDP (possibly zero),
 *   GRE (checksum flag is set) and TCP, &sk_buff.csum_level would be set to
 *   two. If the device were only able to verify the UDP checksum and not
 *   GRE, either because it doesn't support GRE checksum or because GRE
 *   checksum is bad, skb->csum_level would be set to zero (TCP checksum is
 *   not considered in this case).

From: 
https://elixir.bootlin.com/linux/v6.5-rc4/source/include/linux/skbuff.h#L115

> > The driver should say CHECKSUM_UNNECESSARY when it's sure
> > or don't pretend that it checks the checksum and just say NONE.
> 

Well, in such case, most of the NICs that use CHECKSUM_UNNECESSARY would have to 
return CHECKSUM_NONE instead, because based on my quick search, they mostly 
return checksum level of 0 (no tunneling detected) or 1 (tunneling detected),
so they only parse headers up to a certain depth, meaning it's not possible
to tell whether there isn't another CHECKSUM_UNNECESSARY-eligible header hiding
in the payload, so those NIC cannot guarantee ALL the checksums present in the 
packet are correct. So, by your logic, we should make e.g. AF_XDP user re-check 
already verified checksums themselves, because HW "doesn't pretend that it 
checks the checksum and just says NONE".

> I did not know how much this was used, but quick grep for non constant
> csum_level shows devices from at least six vendors.

Yes, there are several vendors that set the csum_level, including broadcom 
(bnxt) and mellanox (mlx4 and mlx5).

Also, CHECKSUM_UNNECESSARY is found in 100+ drivers/net/ethernet files,
while csum_level is in like 20, which means overwhelming majority of 
CHECKSUM_UNNECESSARY NICs actually stay with the default checksum level of '0'
(they check only the outermost checksum - anything else needs to be verified by 
the networking stack).
Jakub Kicinski July 31, 2023, 4:43 p.m. UTC | #6
On Sun, 30 Jul 2023 09:13:02 -0400 Willem de Bruijn wrote:
> > > This levels business is an unfortunate side effect of
> > > CHECKSUM_UNNECESSARY. For a packet with multiple checksum fields, what
> > > does the boolean actually mean? With these levels, at least that is
> > > well defined: the first N checksum fields.  
> >
> > If I understand this correctly this is intel specific feature that
> > other NICs don't have. skb layer also doesn't have such concept.
> > The driver should say CHECKSUM_UNNECESSARY when it's sure
> > or don't pretend that it checks the checksum and just say NONE.  
> 
> I did not know how much this was used, but quick grep for non constant
> csum_level shows devices from at least six vendors.

I thought it was a legacy thing from early VxLAN days.
We used to leave outer UDP csum as 0 before LCO, and therefore couldn't
convert outer to COMPLETE, so inner could not be offloaded/validated.
Should not be all that relevant today.
Alexei Starovoitov Aug. 1, 2023, 1:03 a.m. UTC | #7
On Mon, Jul 31, 2023 at 3:56 AM Larysa Zaremba <larysa.zaremba@intel.com> wrote:
>
> On Sun, Jul 30, 2023 at 09:13:02AM -0400, Willem de Bruijn wrote:
> > Alexei Starovoitov wrote:
> > > On Sat, Jul 29, 2023 at 9:15 AM Willem de Bruijn
> > > <willemdebruijn.kernel@gmail.com> wrote:
> > > >
> > > > Alexei Starovoitov wrote:
> > > > > On Fri, Jul 28, 2023 at 07:39:14PM +0200, Larysa Zaremba wrote:
> > > > > >
> > > > > > +union xdp_csum_info {
> > > > > > +   /* Checksum referred to by ``csum_start + csum_offset`` is considered
> > > > > > +    * valid, but was never calculated, TX device has to do this,
> > > > > > +    * starting from csum_start packet byte.
> > > > > > +    * Any preceding checksums are also considered valid.
> > > > > > +    * Available, if ``status == XDP_CHECKSUM_PARTIAL``.
> > > > > > +    */
> > > > > > +   struct {
> > > > > > +           u16 csum_start;
> > > > > > +           u16 csum_offset;
> > > > > > +   };
> > > > > > +
> > > > >
> > > > > CHECKSUM_PARTIAL makes sense on TX, but this RX. I don't see in the above.
> > > >
> > > > It can be observed on RX when packets are looped.
> > > >
> > > > This may be observed even in XDP on veth.
> > >
> > > veth and XDP is a broken combination. GSO packets coming out of containers
> > > cannot be parsed properly by XDP.
> > > It was added mainly for testing. Just like "generic XDP".
> > > bpf progs at skb layer is much better fit for veth.
> >
> > Ok. Still, seems forward looking and little cost to define the
> > constant?
> >
>
> +1
> CHECKSUM_PARTIAL is mostly for testing and removing/adding it doesn't change
> anything from the perspective of the user that does not use it, so I think it is
> worth having.

"little cost to define the constant".
Not really. A constant in UAPI is a heavy burden.

> > > > > > +   /* Checksum, calculated over the whole packet.
> > > > > > +    * Available, if ``status & XDP_CHECKSUM_COMPLETE``.
> > > > > > +    */
> > > > > > +   u32 checksum;
> > > > >
> > > > > imo XDP RX should only support XDP_CHECKSUM_COMPLETE with u32 checksum
> > > > > or XDP_CHECKSUM_UNNECESSARY.
> > > > >
> > > > > > +};
> > > > > > +
> > > > > > +enum xdp_csum_status {
> > > > > > +   /* HW had parsed several transport headers and validated their
> > > > > > +    * checksums, same as ``CHECKSUM_UNNECESSARY`` in ``sk_buff``.
> > > > > > +    * 3 least significant bytes contain number of consecutive checksums,
> > > > > > +    * starting with the outermost, reported by hardware as valid.
> > > > > > +    * ``sk_buff`` checksum level (``csum_level``) notation is provided
> > > > > > +    * for driver developers.
> > > > > > +    */
> > > > > > +   XDP_CHECKSUM_VALID_LVL0         = 1,    /* 1 outermost checksum */
> > > > > > +   XDP_CHECKSUM_VALID_LVL1         = 2,    /* 2 outermost checksums */
> > > > > > +   XDP_CHECKSUM_VALID_LVL2         = 3,    /* 3 outermost checksums */
> > > > > > +   XDP_CHECKSUM_VALID_LVL3         = 4,    /* 4 outermost checksums */
> > > > > > +   XDP_CHECKSUM_VALID_NUM_MASK     = GENMASK(2, 0),
> > > > > > +   XDP_CHECKSUM_VALID              = XDP_CHECKSUM_VALID_NUM_MASK,
> > > > >
> > > > > I don't see what bpf prog suppose to do with these levels.
> > > > > The driver should pick between 3:
> > > > > XDP_CHECKSUM_UNNECESSARY, XDP_CHECKSUM_COMPLETE, XDP_CHECKSUM_NONE.
> > > > >
> > > > > No levels and no anything partial. please.
> > > >
> > > > This levels business is an unfortunate side effect of
> > > > CHECKSUM_UNNECESSARY. For a packet with multiple checksum fields, what
> > > > does the boolean actually mean? With these levels, at least that is
> > > > well defined: the first N checksum fields.
> > >
> > > If I understand this correctly this is intel specific feature that
> > > other NICs don't have. skb layer also doesn't have such concept.
>
> Please look into csum_level field in sk_buff. It is not the most used property
> in the kernel networking code, but it is certainly 1. used by networking stack
> 2. set to non-zero value by many vendors.
>
> So you do not need to search yourself, I'll copy-paste the docs for
> CHECKSUM_UNNECESSARY here:
>
>  *   %CHECKSUM_UNNECESSARY is applicable to following protocols:
>  *
>  *     - TCP: IPv6 and IPv4.
>  *     - UDP: IPv4 and IPv6. A device may apply CHECKSUM_UNNECESSARY to a
>  *       zero UDP checksum for either IPv4 or IPv6, the networking stack
>  *       may perform further validation in this case.
>  *     - GRE: only if the checksum is present in the header.
>  *     - SCTP: indicates the CRC in SCTP header has been validated.
>  *     - FCOE: indicates the CRC in FC frame has been validated.
>  *
>
> Please, look at this:
>
>  *   &sk_buff.csum_level indicates the number of consecutive checksums found in
>  *   the packet minus one that have been verified as %CHECKSUM_UNNECESSARY.
>  *   For instance if a device receives an IPv6->UDP->GRE->IPv4->TCP packet
>  *   and a device is able to verify the checksums for UDP (possibly zero),
>  *   GRE (checksum flag is set) and TCP, &sk_buff.csum_level would be set to
>  *   two. If the device were only able to verify the UDP checksum and not
>  *   GRE, either because it doesn't support GRE checksum or because GRE
>  *   checksum is bad, skb->csum_level would be set to zero (TCP checksum is
>  *   not considered in this case).
>
> From:
> https://elixir.bootlin.com/linux/v6.5-rc4/source/include/linux/skbuff.h#L115
>
> > > The driver should say CHECKSUM_UNNECESSARY when it's sure
> > > or don't pretend that it checks the checksum and just say NONE.
> >
>
> Well, in such case, most of the NICs that use CHECKSUM_UNNECESSARY would have to
> return CHECKSUM_NONE instead, because based on my quick search, they mostly
> return checksum level of 0 (no tunneling detected) or 1 (tunneling detected),
> so they only parse headers up to a certain depth, meaning it's not possible
> to tell whether there isn't another CHECKSUM_UNNECESSARY-eligible header hiding
> in the payload, so those NIC cannot guarantee ALL the checksums present in the
> packet are correct. So, by your logic, we should make e.g. AF_XDP user re-check
> already verified checksums themselves, because HW "doesn't pretend that it
> checks the checksum and just says NONE".
>
> > I did not know how much this was used, but quick grep for non constant
> > csum_level shows devices from at least six vendors.
>
> Yes, there are several vendors that set the csum_level, including broadcom
> (bnxt) and mellanox (mlx4 and mlx5).
>
> Also, CHECKSUM_UNNECESSARY is found in 100+ drivers/net/ethernet files,
> while csum_level is in like 20, which means overwhelming majority of
> CHECKSUM_UNNECESSARY NICs actually stay with the default checksum level of '0'
> (they check only the outermost checksum - anything else needs to be verified by
> the networking stack).

No. What I'm saying is that XDP_CHECKSUM_UNNECESSARY should be
equivalent to skb's CHECKSUM_UNNECESSARY with csum_level = 0.
I'm well aware that some drivers are trying to be smart and put csum_level=1.
There is no use case for it in XDP.
"But our HW supports it so XDP prog should read it" is the reason NOT
to expose it to bpf in generic api.

Either we're doing per-driver kfuncs and no common infra or common kfunc
that covers 99% of the drivers. Which is CHECKSUM_UNNECESSARY && csum_level = 0

It's not acceptable to present a generic api to xdp prog with multi level
csum that only works on a specific HW. Next thing there will be new flags
and MAX_CSUM_LEVEL in XDP features.
Pretending to be generic while being HW specific is the worst interface.
Willem de Bruijn Aug. 2, 2023, 1:27 p.m. UTC | #8
Alexei Starovoitov wrote:
> On Mon, Jul 31, 2023 at 3:56 AM Larysa Zaremba <larysa.zaremba@intel.com> wrote:
> >
> > On Sun, Jul 30, 2023 at 09:13:02AM -0400, Willem de Bruijn wrote:
> > > Alexei Starovoitov wrote:
> > > > On Sat, Jul 29, 2023 at 9:15 AM Willem de Bruijn
> > > > <willemdebruijn.kernel@gmail.com> wrote:
> > > > >
> > > > > Alexei Starovoitov wrote:
> > > > > > On Fri, Jul 28, 2023 at 07:39:14PM +0200, Larysa Zaremba wrote:
> > > > > > >
> > > > > > > +union xdp_csum_info {
> > > > > > > +   /* Checksum referred to by ``csum_start + csum_offset`` is considered
> > > > > > > +    * valid, but was never calculated, TX device has to do this,
> > > > > > > +    * starting from csum_start packet byte.
> > > > > > > +    * Any preceding checksums are also considered valid.
> > > > > > > +    * Available, if ``status == XDP_CHECKSUM_PARTIAL``.
> > > > > > > +    */
> > > > > > > +   struct {
> > > > > > > +           u16 csum_start;
> > > > > > > +           u16 csum_offset;
> > > > > > > +   };
> > > > > > > +
> > > > > >
> > > > > > CHECKSUM_PARTIAL makes sense on TX, but this RX. I don't see in the above.
> > > > >
> > > > > It can be observed on RX when packets are looped.
> > > > >
> > > > > This may be observed even in XDP on veth.
> > > >
> > > > veth and XDP is a broken combination. GSO packets coming out of containers
> > > > cannot be parsed properly by XDP.
> > > > It was added mainly for testing. Just like "generic XDP".
> > > > bpf progs at skb layer is much better fit for veth.
> > >
> > > Ok. Still, seems forward looking and little cost to define the
> > > constant?
> > >
> >
> > +1
> > CHECKSUM_PARTIAL is mostly for testing and removing/adding it doesn't change
> > anything from the perspective of the user that does not use it, so I think it is
> > worth having.
> 
> "little cost to define the constant".
> Not really. A constant in UAPI is a heavy burden.
> 
> > > > > > > +   /* Checksum, calculated over the whole packet.
> > > > > > > +    * Available, if ``status & XDP_CHECKSUM_COMPLETE``.
> > > > > > > +    */
> > > > > > > +   u32 checksum;
> > > > > >
> > > > > > imo XDP RX should only support XDP_CHECKSUM_COMPLETE with u32 checksum
> > > > > > or XDP_CHECKSUM_UNNECESSARY.
> > > > > >
> > > > > > > +};
> > > > > > > +
> > > > > > > +enum xdp_csum_status {
> > > > > > > +   /* HW had parsed several transport headers and validated their
> > > > > > > +    * checksums, same as ``CHECKSUM_UNNECESSARY`` in ``sk_buff``.
> > > > > > > +    * 3 least significant bytes contain number of consecutive checksums,
> > > > > > > +    * starting with the outermost, reported by hardware as valid.
> > > > > > > +    * ``sk_buff`` checksum level (``csum_level``) notation is provided
> > > > > > > +    * for driver developers.
> > > > > > > +    */
> > > > > > > +   XDP_CHECKSUM_VALID_LVL0         = 1,    /* 1 outermost checksum */
> > > > > > > +   XDP_CHECKSUM_VALID_LVL1         = 2,    /* 2 outermost checksums */
> > > > > > > +   XDP_CHECKSUM_VALID_LVL2         = 3,    /* 3 outermost checksums */
> > > > > > > +   XDP_CHECKSUM_VALID_LVL3         = 4,    /* 4 outermost checksums */
> > > > > > > +   XDP_CHECKSUM_VALID_NUM_MASK     = GENMASK(2, 0),
> > > > > > > +   XDP_CHECKSUM_VALID              = XDP_CHECKSUM_VALID_NUM_MASK,
> > > > > >
> > > > > > I don't see what bpf prog suppose to do with these levels.
> > > > > > The driver should pick between 3:
> > > > > > XDP_CHECKSUM_UNNECESSARY, XDP_CHECKSUM_COMPLETE, XDP_CHECKSUM_NONE.
> > > > > >
> > > > > > No levels and no anything partial. please.
> > > > >
> > > > > This levels business is an unfortunate side effect of
> > > > > CHECKSUM_UNNECESSARY. For a packet with multiple checksum fields, what
> > > > > does the boolean actually mean? With these levels, at least that is
> > > > > well defined: the first N checksum fields.
> > > >
> > > > If I understand this correctly this is intel specific feature that
> > > > other NICs don't have. skb layer also doesn't have such concept.
> >
> > Please look into csum_level field in sk_buff. It is not the most used property
> > in the kernel networking code, but it is certainly 1. used by networking stack
> > 2. set to non-zero value by many vendors.
> >
> > So you do not need to search yourself, I'll copy-paste the docs for
> > CHECKSUM_UNNECESSARY here:
> >
> >  *   %CHECKSUM_UNNECESSARY is applicable to following protocols:
> >  *
> >  *     - TCP: IPv6 and IPv4.
> >  *     - UDP: IPv4 and IPv6. A device may apply CHECKSUM_UNNECESSARY to a
> >  *       zero UDP checksum for either IPv4 or IPv6, the networking stack
> >  *       may perform further validation in this case.
> >  *     - GRE: only if the checksum is present in the header.
> >  *     - SCTP: indicates the CRC in SCTP header has been validated.
> >  *     - FCOE: indicates the CRC in FC frame has been validated.
> >  *
> >
> > Please, look at this:
> >
> >  *   &sk_buff.csum_level indicates the number of consecutive checksums found in
> >  *   the packet minus one that have been verified as %CHECKSUM_UNNECESSARY.
> >  *   For instance if a device receives an IPv6->UDP->GRE->IPv4->TCP packet
> >  *   and a device is able to verify the checksums for UDP (possibly zero),
> >  *   GRE (checksum flag is set) and TCP, &sk_buff.csum_level would be set to
> >  *   two. If the device were only able to verify the UDP checksum and not
> >  *   GRE, either because it doesn't support GRE checksum or because GRE
> >  *   checksum is bad, skb->csum_level would be set to zero (TCP checksum is
> >  *   not considered in this case).
> >
> > From:
> > https://elixir.bootlin.com/linux/v6.5-rc4/source/include/linux/skbuff.h#L115
> >
> > > > The driver should say CHECKSUM_UNNECESSARY when it's sure
> > > > or don't pretend that it checks the checksum and just say NONE.
> > >
> >
> > Well, in such case, most of the NICs that use CHECKSUM_UNNECESSARY would have to
> > return CHECKSUM_NONE instead, because based on my quick search, they mostly
> > return checksum level of 0 (no tunneling detected) or 1 (tunneling detected),
> > so they only parse headers up to a certain depth, meaning it's not possible
> > to tell whether there isn't another CHECKSUM_UNNECESSARY-eligible header hiding
> > in the payload, so those NIC cannot guarantee ALL the checksums present in the
> > packet are correct. So, by your logic, we should make e.g. AF_XDP user re-check
> > already verified checksums themselves, because HW "doesn't pretend that it
> > checks the checksum and just says NONE".
> >
> > > I did not know how much this was used, but quick grep for non constant
> > > csum_level shows devices from at least six vendors.
> >
> > Yes, there are several vendors that set the csum_level, including broadcom
> > (bnxt) and mellanox (mlx4 and mlx5).
> >
> > Also, CHECKSUM_UNNECESSARY is found in 100+ drivers/net/ethernet files,
> > while csum_level is in like 20, which means overwhelming majority of
> > CHECKSUM_UNNECESSARY NICs actually stay with the default checksum level of '0'
> > (they check only the outermost checksum - anything else needs to be verified by
> > the networking stack).
> 
> No. What I'm saying is that XDP_CHECKSUM_UNNECESSARY should be
> equivalent to skb's CHECKSUM_UNNECESSARY with csum_level = 0.
> I'm well aware that some drivers are trying to be smart and put csum_level=1.
> There is no use case for it in XDP.
> "But our HW supports it so XDP prog should read it" is the reason NOT
> to expose it to bpf in generic api.
> 
> Either we're doing per-driver kfuncs and no common infra or common kfunc
> that covers 99% of the drivers. Which is CHECKSUM_UNNECESSARY && csum_level = 0
> 
> It's not acceptable to present a generic api to xdp prog with multi level
> csum that only works on a specific HW. Next thing there will be new flags
> and MAX_CSUM_LEVEL in XDP features.
> Pretending to be generic while being HW specific is the worst interface.

Ok. Agreed that without it we still cover 99% of the use cases. Fine to drop.
Larysa Zaremba Aug. 7, 2023, 3:03 p.m. UTC | #9
On Wed, Aug 02, 2023 at 09:27:27AM -0400, Willem de Bruijn wrote:
> > No. What I'm saying is that XDP_CHECKSUM_UNNECESSARY should be
> > equivalent to skb's CHECKSUM_UNNECESSARY with csum_level = 0.
> > I'm well aware that some drivers are trying to be smart and put csum_level=1.
> > There is no use case for it in XDP.
> > "But our HW supports it so XDP prog should read it" is the reason NOT
> > to expose it to bpf in generic api.
> > 
> > Either we're doing per-driver kfuncs and no common infra or common kfunc
> > that covers 99% of the drivers. Which is CHECKSUM_UNNECESSARY && csum_level = 0
> > 
> > It's not acceptable to present a generic api to xdp prog with multi level
> > csum that only works on a specific HW. Next thing there will be new flags
> > and MAX_CSUM_LEVEL in XDP features.
> > Pretending to be generic while being HW specific is the worst interface.
> 
> Ok. Agreed that without it we still cover 99% of the use cases. Fine to drop.

Sorry for the late response.
Thanks everyone for the feedback, will drop the checksum level concept from the 
design.
Larysa Zaremba Aug. 7, 2023, 3:08 p.m. UTC | #10
On Mon, Jul 31, 2023 at 09:43:22AM -0700, Jakub Kicinski wrote:
> On Sun, 30 Jul 2023 09:13:02 -0400 Willem de Bruijn wrote:
> > > > This levels business is an unfortunate side effect of
> > > > CHECKSUM_UNNECESSARY. For a packet with multiple checksum fields, what
> > > > does the boolean actually mean? With these levels, at least that is
> > > > well defined: the first N checksum fields.  
> > >
> > > If I understand this correctly this is intel specific feature that
> > > other NICs don't have. skb layer also doesn't have such concept.
> > > The driver should say CHECKSUM_UNNECESSARY when it's sure
> > > or don't pretend that it checks the checksum and just say NONE.  
> > 
> > I did not know how much this was used, but quick grep for non constant
> > csum_level shows devices from at least six vendors.
> 
> I thought it was a legacy thing from early VxLAN days.
> We used to leave outer UDP csum as 0 before LCO, and therefore couldn't
> convert outer to COMPLETE, so inner could not be offloaded/validated.
> Should not be all that relevant today.

Sorry for the delayed response.
Thanks a lot for this feedback, it became a gateway to deepen my understanding 
of checksumming in kernel pretty significantly.
Larysa Zaremba Aug. 7, 2023, 3:32 p.m. UTC | #11
On Mon, Jul 31, 2023 at 06:03:26PM -0700, Alexei Starovoitov wrote:
> On Mon, Jul 31, 2023 at 3:56 AM Larysa Zaremba <larysa.zaremba@intel.com> wrote:
> >
> > On Sun, Jul 30, 2023 at 09:13:02AM -0400, Willem de Bruijn wrote:
> > > Alexei Starovoitov wrote:
> > > > On Sat, Jul 29, 2023 at 9:15 AM Willem de Bruijn
> > > > <willemdebruijn.kernel@gmail.com> wrote:
> > > > >
> > > > > Alexei Starovoitov wrote:
> > > > > > On Fri, Jul 28, 2023 at 07:39:14PM +0200, Larysa Zaremba wrote:
> > > > > > >
> > > > > > > +union xdp_csum_info {
> > > > > > > +   /* Checksum referred to by ``csum_start + csum_offset`` is considered
> > > > > > > +    * valid, but was never calculated, TX device has to do this,
> > > > > > > +    * starting from csum_start packet byte.
> > > > > > > +    * Any preceding checksums are also considered valid.
> > > > > > > +    * Available, if ``status == XDP_CHECKSUM_PARTIAL``.
> > > > > > > +    */
> > > > > > > +   struct {
> > > > > > > +           u16 csum_start;
> > > > > > > +           u16 csum_offset;
> > > > > > > +   };
> > > > > > > +
> > > > > >
> > > > > > CHECKSUM_PARTIAL makes sense on TX, but this RX. I don't see in the above.
> > > > >
> > > > > It can be observed on RX when packets are looped.
> > > > >
> > > > > This may be observed even in XDP on veth.
> > > >
> > > > veth and XDP is a broken combination. GSO packets coming out of containers
> > > > cannot be parsed properly by XDP.
> > > > It was added mainly for testing. Just like "generic XDP".
> > > > bpf progs at skb layer is much better fit for veth.
> > >
> > > Ok. Still, seems forward looking and little cost to define the
> > > constant?
> > >
> >
> > +1
> > CHECKSUM_PARTIAL is mostly for testing and removing/adding it doesn't change
> > anything from the perspective of the user that does not use it, so I think it is
> > worth having.
> 
> "little cost to define the constant".
> Not really. A constant in UAPI is a heavy burden.

Sorry for the delayed response.

I still do not comprehend the problem fully for this particular case, 
considering it shouldn't block any future changes to the API by itself.

But, I personally have no reason to push hard the veth-supporting changes 
(aside from wanting the tests to look nicer).

Still, before removing this in v5, I would like to get some additional feedback 
on this, preferably from Jesper (who, if I remember correctly, takes an interest 
in XDP on veth) or Stanislav.

If instead of union xdp_csum_info we will have just checksum as a second 
argument, there will be no going back for this particular kfunc, so I want to be 
sure nobody will ever need such feature.

[...]
Stanislav Fomichev Aug. 7, 2023, 5:06 p.m. UTC | #12
On 08/07, Larysa Zaremba wrote:
> On Mon, Jul 31, 2023 at 06:03:26PM -0700, Alexei Starovoitov wrote:
> > On Mon, Jul 31, 2023 at 3:56 AM Larysa Zaremba <larysa.zaremba@intel.com> wrote:
> > >
> > > On Sun, Jul 30, 2023 at 09:13:02AM -0400, Willem de Bruijn wrote:
> > > > Alexei Starovoitov wrote:
> > > > > On Sat, Jul 29, 2023 at 9:15 AM Willem de Bruijn
> > > > > <willemdebruijn.kernel@gmail.com> wrote:
> > > > > >
> > > > > > Alexei Starovoitov wrote:
> > > > > > > On Fri, Jul 28, 2023 at 07:39:14PM +0200, Larysa Zaremba wrote:
> > > > > > > >
> > > > > > > > +union xdp_csum_info {
> > > > > > > > +   /* Checksum referred to by ``csum_start + csum_offset`` is considered
> > > > > > > > +    * valid, but was never calculated, TX device has to do this,
> > > > > > > > +    * starting from csum_start packet byte.
> > > > > > > > +    * Any preceding checksums are also considered valid.
> > > > > > > > +    * Available, if ``status == XDP_CHECKSUM_PARTIAL``.
> > > > > > > > +    */
> > > > > > > > +   struct {
> > > > > > > > +           u16 csum_start;
> > > > > > > > +           u16 csum_offset;
> > > > > > > > +   };
> > > > > > > > +
> > > > > > >
> > > > > > > CHECKSUM_PARTIAL makes sense on TX, but this RX. I don't see in the above.
> > > > > >
> > > > > > It can be observed on RX when packets are looped.
> > > > > >
> > > > > > This may be observed even in XDP on veth.
> > > > >
> > > > > veth and XDP is a broken combination. GSO packets coming out of containers
> > > > > cannot be parsed properly by XDP.
> > > > > It was added mainly for testing. Just like "generic XDP".
> > > > > bpf progs at skb layer is much better fit for veth.
> > > >
> > > > Ok. Still, seems forward looking and little cost to define the
> > > > constant?
> > > >
> > >
> > > +1
> > > CHECKSUM_PARTIAL is mostly for testing and removing/adding it doesn't change
> > > anything from the perspective of the user that does not use it, so I think it is
> > > worth having.
> > 
> > "little cost to define the constant".
> > Not really. A constant in UAPI is a heavy burden.
> 
> Sorry for the delayed response.
> 
> I still do not comprehend the problem fully for this particular case, 
> considering it shouldn't block any future changes to the API by itself.
> 
> But, I personally have no reason to push hard the veth-supporting changes 
> (aside from wanting the tests to look nicer).
> 
> Still, before removing this in v5, I would like to get some additional feedback 
> on this, preferably from Jesper (who, if I remember correctly, takes an interest 
> in XDP on veth) or Stanislav.
> 
> If instead of union xdp_csum_info we will have just checksum as a second 
> argument, there will be no going back for this particular kfunc, so I want to be 
> sure nobody will ever need such feature.
> 
> [...]

I'm interested in veth only from the testing pow, so if we lose
csum_partial on veth (and it becomes _none?), I don't see any issue
with that.
diff mbox series

Patch

diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
index ea6dd79a21d3..7f056a44f682 100644
--- a/Documentation/networking/xdp-rx-metadata.rst
+++ b/Documentation/networking/xdp-rx-metadata.rst
@@ -26,6 +26,9 @@  metadata is supported, this set will grow:
 .. kernel-doc:: net/core/xdp.c
    :identifiers: bpf_xdp_metadata_rx_vlan_tag
 
+.. kernel-doc:: net/core/xdp.c
+   :identifiers: bpf_xdp_metadata_rx_csum
+
 An XDP program can use these kfuncs to read the metadata into stack
 variables for its own consumption. Or, to pass the metadata on to other
 consumers, an XDP program can store it into the metadata area carried
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 028dcc4fd02d..a950cec76945 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1660,6 +1660,9 @@  struct xdp_metadata_ops {
 			       enum xdp_rss_hash_type *rss_type);
 	int	(*xmo_rx_vlan_tag)(const struct xdp_md *ctx, u16 *vlan_tci,
 				   __be16 *vlan_proto);
+	int	(*xmo_rx_csum)(const struct xdp_md *ctx,
+			       enum xdp_csum_status *csum_status,
+			       union xdp_csum_info *csum_info);
 };
 
 /**
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 89c58f56ffc6..7e6163e5002a 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -391,6 +391,8 @@  void xdp_attachment_setup(struct xdp_attachment_info *info,
 			   bpf_xdp_metadata_rx_hash) \
 	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_VLAN_TAG, \
 			   bpf_xdp_metadata_rx_vlan_tag) \
+	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_CSUM, \
+			   bpf_xdp_metadata_rx_csum) \
 
 enum {
 #define XDP_METADATA_KFUNC(name, _) name,
@@ -448,6 +450,50 @@  enum xdp_rss_hash_type {
 	XDP_RSS_TYPE_L4_IPV6_SCTP_EX = XDP_RSS_TYPE_L4_IPV6_SCTP | XDP_RSS_L3_DYNHDR,
 };
 
+union xdp_csum_info {
+	/* Checksum referred to by ``csum_start + csum_offset`` is considered
+	 * valid, but was never calculated, TX device has to do this,
+	 * starting from csum_start packet byte.
+	 * Any preceding checksums are also considered valid.
+	 * Available, if ``status == XDP_CHECKSUM_PARTIAL``.
+	 */
+	struct {
+		u16 csum_start;
+		u16 csum_offset;
+	};
+
+	/* Checksum, calculated over the whole packet.
+	 * Available, if ``status & XDP_CHECKSUM_COMPLETE``.
+	 */
+	u32 checksum;
+};
+
+enum xdp_csum_status {
+	/* HW had parsed several transport headers and validated their
+	 * checksums, same as ``CHECKSUM_UNNECESSARY`` in ``sk_buff``.
+	 * 3 least significant bytes contain number of consecutive checksums,
+	 * starting with the outermost, reported by hardware as valid.
+	 * ``sk_buff`` checksum level (``csum_level``) notation is provided
+	 * for driver developers.
+	 */
+	XDP_CHECKSUM_VALID_LVL0		= 1,	/* 1 outermost checksum */
+	XDP_CHECKSUM_VALID_LVL1		= 2,	/* 2 outermost checksums */
+	XDP_CHECKSUM_VALID_LVL2		= 3,	/* 3 outermost checksums */
+	XDP_CHECKSUM_VALID_LVL3		= 4,	/* 4 outermost checksums */
+	XDP_CHECKSUM_VALID_NUM_MASK	= GENMASK(2, 0),
+	XDP_CHECKSUM_VALID		= XDP_CHECKSUM_VALID_NUM_MASK,
+
+	/* Occurs if packet is sent virtually (between Linux VMs / containers)
+	 * This status cannot coexist with any other.
+	 * Refer to ``csum_start`` and ``csum_offset`` in ``xdp_csum_info``
+	 * for more information.
+	 */
+	XDP_CHECKSUM_PARTIAL	= BIT(3),
+
+	/* Checksum, calculated over the entire packet is provided */
+	XDP_CHECKSUM_COMPLETE	= BIT(4),
+};
+
 #ifdef CONFIG_NET
 u32 bpf_xdp_metadata_kfunc_id(int id);
 bool bpf_dev_bound_kfunc_id(u32 btf_id);
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 986e7becfd42..f60a6add5273 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -850,6 +850,8 @@  void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
 		p = ops->xmo_rx_hash;
 	else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_VLAN_TAG))
 		p = ops->xmo_rx_vlan_tag;
+	else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_CSUM))
+		p = ops->xmo_rx_csum;
 out:
 	up_read(&bpf_devs_lock);
 
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 8b55419d332e..d4ea54046afc 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -772,6 +772,29 @@  __bpf_kfunc int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx,
 	return -EOPNOTSUPP;
 }
 
+/**
+ * bpf_xdp_metadata_rx_csum - Get checksum status with additional info.
+ * @ctx: XDP context pointer.
+ * @csum_status: Destination for checksum status.
+ * @csum_info: Destination for complete checksum or partial checksum offset.
+ *
+ * Status (@csum_status) is a bitfield that informs, what checksum
+ * processing was performed. Additional results of such processing,
+ * such as complete checksum or partial checksum offsets,
+ * are passed as info (@csum_info).
+ *
+ * Return:
+ * * Returns 0 on success or ``-errno`` on error.
+ * * ``-EOPNOTSUPP`` : device driver doesn't implement kfunc
+ * * ``-ENODATA``    : Checksum status is unknown
+ */
+__bpf_kfunc int bpf_xdp_metadata_rx_csum(const struct xdp_md *ctx,
+					 enum xdp_csum_status *csum_status,
+					 union xdp_csum_info *csum_info)
+{
+	return -EOPNOTSUPP;
+}
+
 __diag_pop();
 
 BTF_SET8_START(xdp_metadata_kfunc_ids)