Message ID | 20230728173923.1318596-13-larysa.zaremba@intel.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | BPF |
Headers | show |
Series | [bpf-next,v4,01/21] ice: make RX hash reading code more reusable | expand |
On Fri, Jul 28, 2023 at 07:39:14PM +0200, Larysa Zaremba wrote: > > +union xdp_csum_info { > + /* Checksum referred to by ``csum_start + csum_offset`` is considered > + * valid, but was never calculated, TX device has to do this, > + * starting from csum_start packet byte. > + * Any preceding checksums are also considered valid. > + * Available, if ``status == XDP_CHECKSUM_PARTIAL``. > + */ > + struct { > + u16 csum_start; > + u16 csum_offset; > + }; > + CHECKSUM_PARTIAL makes sense on TX, but this RX. I don't see in the above. > + /* Checksum, calculated over the whole packet. > + * Available, if ``status & XDP_CHECKSUM_COMPLETE``. > + */ > + u32 checksum; imo XDP RX should only support XDP_CHECKSUM_COMPLETE with u32 checksum or XDP_CHECKSUM_UNNECESSARY. > +}; > + > +enum xdp_csum_status { > + /* HW had parsed several transport headers and validated their > + * checksums, same as ``CHECKSUM_UNNECESSARY`` in ``sk_buff``. > + * 3 least significant bytes contain number of consecutive checksums, > + * starting with the outermost, reported by hardware as valid. > + * ``sk_buff`` checksum level (``csum_level``) notation is provided > + * for driver developers. > + */ > + XDP_CHECKSUM_VALID_LVL0 = 1, /* 1 outermost checksum */ > + XDP_CHECKSUM_VALID_LVL1 = 2, /* 2 outermost checksums */ > + XDP_CHECKSUM_VALID_LVL2 = 3, /* 3 outermost checksums */ > + XDP_CHECKSUM_VALID_LVL3 = 4, /* 4 outermost checksums */ > + XDP_CHECKSUM_VALID_NUM_MASK = GENMASK(2, 0), > + XDP_CHECKSUM_VALID = XDP_CHECKSUM_VALID_NUM_MASK, I don't see what bpf prog suppose to do with these levels. The driver should pick between 3: XDP_CHECKSUM_UNNECESSARY, XDP_CHECKSUM_COMPLETE, XDP_CHECKSUM_NONE. No levels and no anything partial. please.
Alexei Starovoitov wrote: > On Fri, Jul 28, 2023 at 07:39:14PM +0200, Larysa Zaremba wrote: > > > > +union xdp_csum_info { > > + /* Checksum referred to by ``csum_start + csum_offset`` is considered > > + * valid, but was never calculated, TX device has to do this, > > + * starting from csum_start packet byte. > > + * Any preceding checksums are also considered valid. > > + * Available, if ``status == XDP_CHECKSUM_PARTIAL``. > > + */ > > + struct { > > + u16 csum_start; > > + u16 csum_offset; > > + }; > > + > > CHECKSUM_PARTIAL makes sense on TX, but this RX. I don't see in the above. It can be observed on RX when packets are looped. This may be observed even in XDP on veth. > > + /* Checksum, calculated over the whole packet. > > + * Available, if ``status & XDP_CHECKSUM_COMPLETE``. > > + */ > > + u32 checksum; > > imo XDP RX should only support XDP_CHECKSUM_COMPLETE with u32 checksum > or XDP_CHECKSUM_UNNECESSARY. > > > +}; > > + > > +enum xdp_csum_status { > > + /* HW had parsed several transport headers and validated their > > + * checksums, same as ``CHECKSUM_UNNECESSARY`` in ``sk_buff``. > > + * 3 least significant bytes contain number of consecutive checksums, > > + * starting with the outermost, reported by hardware as valid. > > + * ``sk_buff`` checksum level (``csum_level``) notation is provided > > + * for driver developers. > > + */ > > + XDP_CHECKSUM_VALID_LVL0 = 1, /* 1 outermost checksum */ > > + XDP_CHECKSUM_VALID_LVL1 = 2, /* 2 outermost checksums */ > > + XDP_CHECKSUM_VALID_LVL2 = 3, /* 3 outermost checksums */ > > + XDP_CHECKSUM_VALID_LVL3 = 4, /* 4 outermost checksums */ > > + XDP_CHECKSUM_VALID_NUM_MASK = GENMASK(2, 0), > > + XDP_CHECKSUM_VALID = XDP_CHECKSUM_VALID_NUM_MASK, > > I don't see what bpf prog suppose to do with these levels. > The driver should pick between 3: > XDP_CHECKSUM_UNNECESSARY, XDP_CHECKSUM_COMPLETE, XDP_CHECKSUM_NONE. > > No levels and no anything partial. please. This levels business is an unfortunate side effect of CHECKSUM_UNNECESSARY. For a packet with multiple checksum fields, what does the boolean actually mean? With these levels, at least that is well defined: the first N checksum fields.
On Sat, Jul 29, 2023 at 9:15 AM Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > > Alexei Starovoitov wrote: > > On Fri, Jul 28, 2023 at 07:39:14PM +0200, Larysa Zaremba wrote: > > > > > > +union xdp_csum_info { > > > + /* Checksum referred to by ``csum_start + csum_offset`` is considered > > > + * valid, but was never calculated, TX device has to do this, > > > + * starting from csum_start packet byte. > > > + * Any preceding checksums are also considered valid. > > > + * Available, if ``status == XDP_CHECKSUM_PARTIAL``. > > > + */ > > > + struct { > > > + u16 csum_start; > > > + u16 csum_offset; > > > + }; > > > + > > > > CHECKSUM_PARTIAL makes sense on TX, but this RX. I don't see in the above. > > It can be observed on RX when packets are looped. > > This may be observed even in XDP on veth. veth and XDP is a broken combination. GSO packets coming out of containers cannot be parsed properly by XDP. It was added mainly for testing. Just like "generic XDP". bpf progs at skb layer is much better fit for veth. > > > + /* Checksum, calculated over the whole packet. > > > + * Available, if ``status & XDP_CHECKSUM_COMPLETE``. > > > + */ > > > + u32 checksum; > > > > imo XDP RX should only support XDP_CHECKSUM_COMPLETE with u32 checksum > > or XDP_CHECKSUM_UNNECESSARY. > > > > > +}; > > > + > > > +enum xdp_csum_status { > > > + /* HW had parsed several transport headers and validated their > > > + * checksums, same as ``CHECKSUM_UNNECESSARY`` in ``sk_buff``. > > > + * 3 least significant bytes contain number of consecutive checksums, > > > + * starting with the outermost, reported by hardware as valid. > > > + * ``sk_buff`` checksum level (``csum_level``) notation is provided > > > + * for driver developers. > > > + */ > > > + XDP_CHECKSUM_VALID_LVL0 = 1, /* 1 outermost checksum */ > > > + XDP_CHECKSUM_VALID_LVL1 = 2, /* 2 outermost checksums */ > > > + XDP_CHECKSUM_VALID_LVL2 = 3, /* 3 outermost checksums */ > > > + XDP_CHECKSUM_VALID_LVL3 = 4, /* 4 outermost checksums */ > > > + XDP_CHECKSUM_VALID_NUM_MASK = GENMASK(2, 0), > > > + XDP_CHECKSUM_VALID = XDP_CHECKSUM_VALID_NUM_MASK, > > > > I don't see what bpf prog suppose to do with these levels. > > The driver should pick between 3: > > XDP_CHECKSUM_UNNECESSARY, XDP_CHECKSUM_COMPLETE, XDP_CHECKSUM_NONE. > > > > No levels and no anything partial. please. > > This levels business is an unfortunate side effect of > CHECKSUM_UNNECESSARY. For a packet with multiple checksum fields, what > does the boolean actually mean? With these levels, at least that is > well defined: the first N checksum fields. If I understand this correctly this is intel specific feature that other NICs don't have. skb layer also doesn't have such concept. The driver should say CHECKSUM_UNNECESSARY when it's sure or don't pretend that it checks the checksum and just say NONE.
Alexei Starovoitov wrote: > On Sat, Jul 29, 2023 at 9:15 AM Willem de Bruijn > <willemdebruijn.kernel@gmail.com> wrote: > > > > Alexei Starovoitov wrote: > > > On Fri, Jul 28, 2023 at 07:39:14PM +0200, Larysa Zaremba wrote: > > > > > > > > +union xdp_csum_info { > > > > + /* Checksum referred to by ``csum_start + csum_offset`` is considered > > > > + * valid, but was never calculated, TX device has to do this, > > > > + * starting from csum_start packet byte. > > > > + * Any preceding checksums are also considered valid. > > > > + * Available, if ``status == XDP_CHECKSUM_PARTIAL``. > > > > + */ > > > > + struct { > > > > + u16 csum_start; > > > > + u16 csum_offset; > > > > + }; > > > > + > > > > > > CHECKSUM_PARTIAL makes sense on TX, but this RX. I don't see in the above. > > > > It can be observed on RX when packets are looped. > > > > This may be observed even in XDP on veth. > > veth and XDP is a broken combination. GSO packets coming out of containers > cannot be parsed properly by XDP. > It was added mainly for testing. Just like "generic XDP". > bpf progs at skb layer is much better fit for veth. Ok. Still, seems forward looking and little cost to define the constant? > > > > + /* Checksum, calculated over the whole packet. > > > > + * Available, if ``status & XDP_CHECKSUM_COMPLETE``. > > > > + */ > > > > + u32 checksum; > > > > > > imo XDP RX should only support XDP_CHECKSUM_COMPLETE with u32 checksum > > > or XDP_CHECKSUM_UNNECESSARY. > > > > > > > +}; > > > > + > > > > +enum xdp_csum_status { > > > > + /* HW had parsed several transport headers and validated their > > > > + * checksums, same as ``CHECKSUM_UNNECESSARY`` in ``sk_buff``. > > > > + * 3 least significant bytes contain number of consecutive checksums, > > > > + * starting with the outermost, reported by hardware as valid. > > > > + * ``sk_buff`` checksum level (``csum_level``) notation is provided > > > > + * for driver developers. > > > > + */ > > > > + XDP_CHECKSUM_VALID_LVL0 = 1, /* 1 outermost checksum */ > > > > + XDP_CHECKSUM_VALID_LVL1 = 2, /* 2 outermost checksums */ > > > > + XDP_CHECKSUM_VALID_LVL2 = 3, /* 3 outermost checksums */ > > > > + XDP_CHECKSUM_VALID_LVL3 = 4, /* 4 outermost checksums */ > > > > + XDP_CHECKSUM_VALID_NUM_MASK = GENMASK(2, 0), > > > > + XDP_CHECKSUM_VALID = XDP_CHECKSUM_VALID_NUM_MASK, > > > > > > I don't see what bpf prog suppose to do with these levels. > > > The driver should pick between 3: > > > XDP_CHECKSUM_UNNECESSARY, XDP_CHECKSUM_COMPLETE, XDP_CHECKSUM_NONE. > > > > > > No levels and no anything partial. please. > > > > This levels business is an unfortunate side effect of > > CHECKSUM_UNNECESSARY. For a packet with multiple checksum fields, what > > does the boolean actually mean? With these levels, at least that is > > well defined: the first N checksum fields. > > If I understand this correctly this is intel specific feature that > other NICs don't have. skb layer also doesn't have such concept. > The driver should say CHECKSUM_UNNECESSARY when it's sure > or don't pretend that it checks the checksum and just say NONE. I did not know how much this was used, but quick grep for non constant csum_level shows devices from at least six vendors.
On Sun, Jul 30, 2023 at 09:13:02AM -0400, Willem de Bruijn wrote: > Alexei Starovoitov wrote: > > On Sat, Jul 29, 2023 at 9:15 AM Willem de Bruijn > > <willemdebruijn.kernel@gmail.com> wrote: > > > > > > Alexei Starovoitov wrote: > > > > On Fri, Jul 28, 2023 at 07:39:14PM +0200, Larysa Zaremba wrote: > > > > > > > > > > +union xdp_csum_info { > > > > > + /* Checksum referred to by ``csum_start + csum_offset`` is considered > > > > > + * valid, but was never calculated, TX device has to do this, > > > > > + * starting from csum_start packet byte. > > > > > + * Any preceding checksums are also considered valid. > > > > > + * Available, if ``status == XDP_CHECKSUM_PARTIAL``. > > > > > + */ > > > > > + struct { > > > > > + u16 csum_start; > > > > > + u16 csum_offset; > > > > > + }; > > > > > + > > > > > > > > CHECKSUM_PARTIAL makes sense on TX, but this RX. I don't see in the above. > > > > > > It can be observed on RX when packets are looped. > > > > > > This may be observed even in XDP on veth. > > > > veth and XDP is a broken combination. GSO packets coming out of containers > > cannot be parsed properly by XDP. > > It was added mainly for testing. Just like "generic XDP". > > bpf progs at skb layer is much better fit for veth. > > Ok. Still, seems forward looking and little cost to define the > constant? > +1 CHECKSUM_PARTIAL is mostly for testing and removing/adding it doesn't change anything from the perspective of the user that does not use it, so I think it is worth having. > > > > > + /* Checksum, calculated over the whole packet. > > > > > + * Available, if ``status & XDP_CHECKSUM_COMPLETE``. > > > > > + */ > > > > > + u32 checksum; > > > > > > > > imo XDP RX should only support XDP_CHECKSUM_COMPLETE with u32 checksum > > > > or XDP_CHECKSUM_UNNECESSARY. > > > > > > > > > +}; > > > > > + > > > > > +enum xdp_csum_status { > > > > > + /* HW had parsed several transport headers and validated their > > > > > + * checksums, same as ``CHECKSUM_UNNECESSARY`` in ``sk_buff``. > > > > > + * 3 least significant bytes contain number of consecutive checksums, > > > > > + * starting with the outermost, reported by hardware as valid. > > > > > + * ``sk_buff`` checksum level (``csum_level``) notation is provided > > > > > + * for driver developers. > > > > > + */ > > > > > + XDP_CHECKSUM_VALID_LVL0 = 1, /* 1 outermost checksum */ > > > > > + XDP_CHECKSUM_VALID_LVL1 = 2, /* 2 outermost checksums */ > > > > > + XDP_CHECKSUM_VALID_LVL2 = 3, /* 3 outermost checksums */ > > > > > + XDP_CHECKSUM_VALID_LVL3 = 4, /* 4 outermost checksums */ > > > > > + XDP_CHECKSUM_VALID_NUM_MASK = GENMASK(2, 0), > > > > > + XDP_CHECKSUM_VALID = XDP_CHECKSUM_VALID_NUM_MASK, > > > > > > > > I don't see what bpf prog suppose to do with these levels. > > > > The driver should pick between 3: > > > > XDP_CHECKSUM_UNNECESSARY, XDP_CHECKSUM_COMPLETE, XDP_CHECKSUM_NONE. > > > > > > > > No levels and no anything partial. please. > > > > > > This levels business is an unfortunate side effect of > > > CHECKSUM_UNNECESSARY. For a packet with multiple checksum fields, what > > > does the boolean actually mean? With these levels, at least that is > > > well defined: the first N checksum fields. > > > > If I understand this correctly this is intel specific feature that > > other NICs don't have. skb layer also doesn't have such concept. Please look into csum_level field in sk_buff. It is not the most used property in the kernel networking code, but it is certainly 1. used by networking stack 2. set to non-zero value by many vendors. So you do not need to search yourself, I'll copy-paste the docs for CHECKSUM_UNNECESSARY here: * %CHECKSUM_UNNECESSARY is applicable to following protocols: * * - TCP: IPv6 and IPv4. * - UDP: IPv4 and IPv6. A device may apply CHECKSUM_UNNECESSARY to a * zero UDP checksum for either IPv4 or IPv6, the networking stack * may perform further validation in this case. * - GRE: only if the checksum is present in the header. * - SCTP: indicates the CRC in SCTP header has been validated. * - FCOE: indicates the CRC in FC frame has been validated. * Please, look at this: * &sk_buff.csum_level indicates the number of consecutive checksums found in * the packet minus one that have been verified as %CHECKSUM_UNNECESSARY. * For instance if a device receives an IPv6->UDP->GRE->IPv4->TCP packet * and a device is able to verify the checksums for UDP (possibly zero), * GRE (checksum flag is set) and TCP, &sk_buff.csum_level would be set to * two. If the device were only able to verify the UDP checksum and not * GRE, either because it doesn't support GRE checksum or because GRE * checksum is bad, skb->csum_level would be set to zero (TCP checksum is * not considered in this case). From: https://elixir.bootlin.com/linux/v6.5-rc4/source/include/linux/skbuff.h#L115 > > The driver should say CHECKSUM_UNNECESSARY when it's sure > > or don't pretend that it checks the checksum and just say NONE. > Well, in such case, most of the NICs that use CHECKSUM_UNNECESSARY would have to return CHECKSUM_NONE instead, because based on my quick search, they mostly return checksum level of 0 (no tunneling detected) or 1 (tunneling detected), so they only parse headers up to a certain depth, meaning it's not possible to tell whether there isn't another CHECKSUM_UNNECESSARY-eligible header hiding in the payload, so those NIC cannot guarantee ALL the checksums present in the packet are correct. So, by your logic, we should make e.g. AF_XDP user re-check already verified checksums themselves, because HW "doesn't pretend that it checks the checksum and just says NONE". > I did not know how much this was used, but quick grep for non constant > csum_level shows devices from at least six vendors. Yes, there are several vendors that set the csum_level, including broadcom (bnxt) and mellanox (mlx4 and mlx5). Also, CHECKSUM_UNNECESSARY is found in 100+ drivers/net/ethernet files, while csum_level is in like 20, which means overwhelming majority of CHECKSUM_UNNECESSARY NICs actually stay with the default checksum level of '0' (they check only the outermost checksum - anything else needs to be verified by the networking stack).
On Sun, 30 Jul 2023 09:13:02 -0400 Willem de Bruijn wrote: > > > This levels business is an unfortunate side effect of > > > CHECKSUM_UNNECESSARY. For a packet with multiple checksum fields, what > > > does the boolean actually mean? With these levels, at least that is > > > well defined: the first N checksum fields. > > > > If I understand this correctly this is intel specific feature that > > other NICs don't have. skb layer also doesn't have such concept. > > The driver should say CHECKSUM_UNNECESSARY when it's sure > > or don't pretend that it checks the checksum and just say NONE. > > I did not know how much this was used, but quick grep for non constant > csum_level shows devices from at least six vendors. I thought it was a legacy thing from early VxLAN days. We used to leave outer UDP csum as 0 before LCO, and therefore couldn't convert outer to COMPLETE, so inner could not be offloaded/validated. Should not be all that relevant today.
On Mon, Jul 31, 2023 at 3:56 AM Larysa Zaremba <larysa.zaremba@intel.com> wrote: > > On Sun, Jul 30, 2023 at 09:13:02AM -0400, Willem de Bruijn wrote: > > Alexei Starovoitov wrote: > > > On Sat, Jul 29, 2023 at 9:15 AM Willem de Bruijn > > > <willemdebruijn.kernel@gmail.com> wrote: > > > > > > > > Alexei Starovoitov wrote: > > > > > On Fri, Jul 28, 2023 at 07:39:14PM +0200, Larysa Zaremba wrote: > > > > > > > > > > > > +union xdp_csum_info { > > > > > > + /* Checksum referred to by ``csum_start + csum_offset`` is considered > > > > > > + * valid, but was never calculated, TX device has to do this, > > > > > > + * starting from csum_start packet byte. > > > > > > + * Any preceding checksums are also considered valid. > > > > > > + * Available, if ``status == XDP_CHECKSUM_PARTIAL``. > > > > > > + */ > > > > > > + struct { > > > > > > + u16 csum_start; > > > > > > + u16 csum_offset; > > > > > > + }; > > > > > > + > > > > > > > > > > CHECKSUM_PARTIAL makes sense on TX, but this RX. I don't see in the above. > > > > > > > > It can be observed on RX when packets are looped. > > > > > > > > This may be observed even in XDP on veth. > > > > > > veth and XDP is a broken combination. GSO packets coming out of containers > > > cannot be parsed properly by XDP. > > > It was added mainly for testing. Just like "generic XDP". > > > bpf progs at skb layer is much better fit for veth. > > > > Ok. Still, seems forward looking and little cost to define the > > constant? > > > > +1 > CHECKSUM_PARTIAL is mostly for testing and removing/adding it doesn't change > anything from the perspective of the user that does not use it, so I think it is > worth having. "little cost to define the constant". Not really. A constant in UAPI is a heavy burden. > > > > > > + /* Checksum, calculated over the whole packet. > > > > > > + * Available, if ``status & XDP_CHECKSUM_COMPLETE``. > > > > > > + */ > > > > > > + u32 checksum; > > > > > > > > > > imo XDP RX should only support XDP_CHECKSUM_COMPLETE with u32 checksum > > > > > or XDP_CHECKSUM_UNNECESSARY. > > > > > > > > > > > +}; > > > > > > + > > > > > > +enum xdp_csum_status { > > > > > > + /* HW had parsed several transport headers and validated their > > > > > > + * checksums, same as ``CHECKSUM_UNNECESSARY`` in ``sk_buff``. > > > > > > + * 3 least significant bytes contain number of consecutive checksums, > > > > > > + * starting with the outermost, reported by hardware as valid. > > > > > > + * ``sk_buff`` checksum level (``csum_level``) notation is provided > > > > > > + * for driver developers. > > > > > > + */ > > > > > > + XDP_CHECKSUM_VALID_LVL0 = 1, /* 1 outermost checksum */ > > > > > > + XDP_CHECKSUM_VALID_LVL1 = 2, /* 2 outermost checksums */ > > > > > > + XDP_CHECKSUM_VALID_LVL2 = 3, /* 3 outermost checksums */ > > > > > > + XDP_CHECKSUM_VALID_LVL3 = 4, /* 4 outermost checksums */ > > > > > > + XDP_CHECKSUM_VALID_NUM_MASK = GENMASK(2, 0), > > > > > > + XDP_CHECKSUM_VALID = XDP_CHECKSUM_VALID_NUM_MASK, > > > > > > > > > > I don't see what bpf prog suppose to do with these levels. > > > > > The driver should pick between 3: > > > > > XDP_CHECKSUM_UNNECESSARY, XDP_CHECKSUM_COMPLETE, XDP_CHECKSUM_NONE. > > > > > > > > > > No levels and no anything partial. please. > > > > > > > > This levels business is an unfortunate side effect of > > > > CHECKSUM_UNNECESSARY. For a packet with multiple checksum fields, what > > > > does the boolean actually mean? With these levels, at least that is > > > > well defined: the first N checksum fields. > > > > > > If I understand this correctly this is intel specific feature that > > > other NICs don't have. skb layer also doesn't have such concept. > > Please look into csum_level field in sk_buff. It is not the most used property > in the kernel networking code, but it is certainly 1. used by networking stack > 2. set to non-zero value by many vendors. > > So you do not need to search yourself, I'll copy-paste the docs for > CHECKSUM_UNNECESSARY here: > > * %CHECKSUM_UNNECESSARY is applicable to following protocols: > * > * - TCP: IPv6 and IPv4. > * - UDP: IPv4 and IPv6. A device may apply CHECKSUM_UNNECESSARY to a > * zero UDP checksum for either IPv4 or IPv6, the networking stack > * may perform further validation in this case. > * - GRE: only if the checksum is present in the header. > * - SCTP: indicates the CRC in SCTP header has been validated. > * - FCOE: indicates the CRC in FC frame has been validated. > * > > Please, look at this: > > * &sk_buff.csum_level indicates the number of consecutive checksums found in > * the packet minus one that have been verified as %CHECKSUM_UNNECESSARY. > * For instance if a device receives an IPv6->UDP->GRE->IPv4->TCP packet > * and a device is able to verify the checksums for UDP (possibly zero), > * GRE (checksum flag is set) and TCP, &sk_buff.csum_level would be set to > * two. If the device were only able to verify the UDP checksum and not > * GRE, either because it doesn't support GRE checksum or because GRE > * checksum is bad, skb->csum_level would be set to zero (TCP checksum is > * not considered in this case). > > From: > https://elixir.bootlin.com/linux/v6.5-rc4/source/include/linux/skbuff.h#L115 > > > > The driver should say CHECKSUM_UNNECESSARY when it's sure > > > or don't pretend that it checks the checksum and just say NONE. > > > > Well, in such case, most of the NICs that use CHECKSUM_UNNECESSARY would have to > return CHECKSUM_NONE instead, because based on my quick search, they mostly > return checksum level of 0 (no tunneling detected) or 1 (tunneling detected), > so they only parse headers up to a certain depth, meaning it's not possible > to tell whether there isn't another CHECKSUM_UNNECESSARY-eligible header hiding > in the payload, so those NIC cannot guarantee ALL the checksums present in the > packet are correct. So, by your logic, we should make e.g. AF_XDP user re-check > already verified checksums themselves, because HW "doesn't pretend that it > checks the checksum and just says NONE". > > > I did not know how much this was used, but quick grep for non constant > > csum_level shows devices from at least six vendors. > > Yes, there are several vendors that set the csum_level, including broadcom > (bnxt) and mellanox (mlx4 and mlx5). > > Also, CHECKSUM_UNNECESSARY is found in 100+ drivers/net/ethernet files, > while csum_level is in like 20, which means overwhelming majority of > CHECKSUM_UNNECESSARY NICs actually stay with the default checksum level of '0' > (they check only the outermost checksum - anything else needs to be verified by > the networking stack). No. What I'm saying is that XDP_CHECKSUM_UNNECESSARY should be equivalent to skb's CHECKSUM_UNNECESSARY with csum_level = 0. I'm well aware that some drivers are trying to be smart and put csum_level=1. There is no use case for it in XDP. "But our HW supports it so XDP prog should read it" is the reason NOT to expose it to bpf in generic api. Either we're doing per-driver kfuncs and no common infra or common kfunc that covers 99% of the drivers. Which is CHECKSUM_UNNECESSARY && csum_level = 0 It's not acceptable to present a generic api to xdp prog with multi level csum that only works on a specific HW. Next thing there will be new flags and MAX_CSUM_LEVEL in XDP features. Pretending to be generic while being HW specific is the worst interface.
Alexei Starovoitov wrote: > On Mon, Jul 31, 2023 at 3:56 AM Larysa Zaremba <larysa.zaremba@intel.com> wrote: > > > > On Sun, Jul 30, 2023 at 09:13:02AM -0400, Willem de Bruijn wrote: > > > Alexei Starovoitov wrote: > > > > On Sat, Jul 29, 2023 at 9:15 AM Willem de Bruijn > > > > <willemdebruijn.kernel@gmail.com> wrote: > > > > > > > > > > Alexei Starovoitov wrote: > > > > > > On Fri, Jul 28, 2023 at 07:39:14PM +0200, Larysa Zaremba wrote: > > > > > > > > > > > > > > +union xdp_csum_info { > > > > > > > + /* Checksum referred to by ``csum_start + csum_offset`` is considered > > > > > > > + * valid, but was never calculated, TX device has to do this, > > > > > > > + * starting from csum_start packet byte. > > > > > > > + * Any preceding checksums are also considered valid. > > > > > > > + * Available, if ``status == XDP_CHECKSUM_PARTIAL``. > > > > > > > + */ > > > > > > > + struct { > > > > > > > + u16 csum_start; > > > > > > > + u16 csum_offset; > > > > > > > + }; > > > > > > > + > > > > > > > > > > > > CHECKSUM_PARTIAL makes sense on TX, but this RX. I don't see in the above. > > > > > > > > > > It can be observed on RX when packets are looped. > > > > > > > > > > This may be observed even in XDP on veth. > > > > > > > > veth and XDP is a broken combination. GSO packets coming out of containers > > > > cannot be parsed properly by XDP. > > > > It was added mainly for testing. Just like "generic XDP". > > > > bpf progs at skb layer is much better fit for veth. > > > > > > Ok. Still, seems forward looking and little cost to define the > > > constant? > > > > > > > +1 > > CHECKSUM_PARTIAL is mostly for testing and removing/adding it doesn't change > > anything from the perspective of the user that does not use it, so I think it is > > worth having. > > "little cost to define the constant". > Not really. A constant in UAPI is a heavy burden. > > > > > > > > + /* Checksum, calculated over the whole packet. > > > > > > > + * Available, if ``status & XDP_CHECKSUM_COMPLETE``. > > > > > > > + */ > > > > > > > + u32 checksum; > > > > > > > > > > > > imo XDP RX should only support XDP_CHECKSUM_COMPLETE with u32 checksum > > > > > > or XDP_CHECKSUM_UNNECESSARY. > > > > > > > > > > > > > +}; > > > > > > > + > > > > > > > +enum xdp_csum_status { > > > > > > > + /* HW had parsed several transport headers and validated their > > > > > > > + * checksums, same as ``CHECKSUM_UNNECESSARY`` in ``sk_buff``. > > > > > > > + * 3 least significant bytes contain number of consecutive checksums, > > > > > > > + * starting with the outermost, reported by hardware as valid. > > > > > > > + * ``sk_buff`` checksum level (``csum_level``) notation is provided > > > > > > > + * for driver developers. > > > > > > > + */ > > > > > > > + XDP_CHECKSUM_VALID_LVL0 = 1, /* 1 outermost checksum */ > > > > > > > + XDP_CHECKSUM_VALID_LVL1 = 2, /* 2 outermost checksums */ > > > > > > > + XDP_CHECKSUM_VALID_LVL2 = 3, /* 3 outermost checksums */ > > > > > > > + XDP_CHECKSUM_VALID_LVL3 = 4, /* 4 outermost checksums */ > > > > > > > + XDP_CHECKSUM_VALID_NUM_MASK = GENMASK(2, 0), > > > > > > > + XDP_CHECKSUM_VALID = XDP_CHECKSUM_VALID_NUM_MASK, > > > > > > > > > > > > I don't see what bpf prog suppose to do with these levels. > > > > > > The driver should pick between 3: > > > > > > XDP_CHECKSUM_UNNECESSARY, XDP_CHECKSUM_COMPLETE, XDP_CHECKSUM_NONE. > > > > > > > > > > > > No levels and no anything partial. please. > > > > > > > > > > This levels business is an unfortunate side effect of > > > > > CHECKSUM_UNNECESSARY. For a packet with multiple checksum fields, what > > > > > does the boolean actually mean? With these levels, at least that is > > > > > well defined: the first N checksum fields. > > > > > > > > If I understand this correctly this is intel specific feature that > > > > other NICs don't have. skb layer also doesn't have such concept. > > > > Please look into csum_level field in sk_buff. It is not the most used property > > in the kernel networking code, but it is certainly 1. used by networking stack > > 2. set to non-zero value by many vendors. > > > > So you do not need to search yourself, I'll copy-paste the docs for > > CHECKSUM_UNNECESSARY here: > > > > * %CHECKSUM_UNNECESSARY is applicable to following protocols: > > * > > * - TCP: IPv6 and IPv4. > > * - UDP: IPv4 and IPv6. A device may apply CHECKSUM_UNNECESSARY to a > > * zero UDP checksum for either IPv4 or IPv6, the networking stack > > * may perform further validation in this case. > > * - GRE: only if the checksum is present in the header. > > * - SCTP: indicates the CRC in SCTP header has been validated. > > * - FCOE: indicates the CRC in FC frame has been validated. > > * > > > > Please, look at this: > > > > * &sk_buff.csum_level indicates the number of consecutive checksums found in > > * the packet minus one that have been verified as %CHECKSUM_UNNECESSARY. > > * For instance if a device receives an IPv6->UDP->GRE->IPv4->TCP packet > > * and a device is able to verify the checksums for UDP (possibly zero), > > * GRE (checksum flag is set) and TCP, &sk_buff.csum_level would be set to > > * two. If the device were only able to verify the UDP checksum and not > > * GRE, either because it doesn't support GRE checksum or because GRE > > * checksum is bad, skb->csum_level would be set to zero (TCP checksum is > > * not considered in this case). > > > > From: > > https://elixir.bootlin.com/linux/v6.5-rc4/source/include/linux/skbuff.h#L115 > > > > > > The driver should say CHECKSUM_UNNECESSARY when it's sure > > > > or don't pretend that it checks the checksum and just say NONE. > > > > > > > Well, in such case, most of the NICs that use CHECKSUM_UNNECESSARY would have to > > return CHECKSUM_NONE instead, because based on my quick search, they mostly > > return checksum level of 0 (no tunneling detected) or 1 (tunneling detected), > > so they only parse headers up to a certain depth, meaning it's not possible > > to tell whether there isn't another CHECKSUM_UNNECESSARY-eligible header hiding > > in the payload, so those NIC cannot guarantee ALL the checksums present in the > > packet are correct. So, by your logic, we should make e.g. AF_XDP user re-check > > already verified checksums themselves, because HW "doesn't pretend that it > > checks the checksum and just says NONE". > > > > > I did not know how much this was used, but quick grep for non constant > > > csum_level shows devices from at least six vendors. > > > > Yes, there are several vendors that set the csum_level, including broadcom > > (bnxt) and mellanox (mlx4 and mlx5). > > > > Also, CHECKSUM_UNNECESSARY is found in 100+ drivers/net/ethernet files, > > while csum_level is in like 20, which means overwhelming majority of > > CHECKSUM_UNNECESSARY NICs actually stay with the default checksum level of '0' > > (they check only the outermost checksum - anything else needs to be verified by > > the networking stack). > > No. What I'm saying is that XDP_CHECKSUM_UNNECESSARY should be > equivalent to skb's CHECKSUM_UNNECESSARY with csum_level = 0. > I'm well aware that some drivers are trying to be smart and put csum_level=1. > There is no use case for it in XDP. > "But our HW supports it so XDP prog should read it" is the reason NOT > to expose it to bpf in generic api. > > Either we're doing per-driver kfuncs and no common infra or common kfunc > that covers 99% of the drivers. Which is CHECKSUM_UNNECESSARY && csum_level = 0 > > It's not acceptable to present a generic api to xdp prog with multi level > csum that only works on a specific HW. Next thing there will be new flags > and MAX_CSUM_LEVEL in XDP features. > Pretending to be generic while being HW specific is the worst interface. Ok. Agreed that without it we still cover 99% of the use cases. Fine to drop.
On Wed, Aug 02, 2023 at 09:27:27AM -0400, Willem de Bruijn wrote: > > No. What I'm saying is that XDP_CHECKSUM_UNNECESSARY should be > > equivalent to skb's CHECKSUM_UNNECESSARY with csum_level = 0. > > I'm well aware that some drivers are trying to be smart and put csum_level=1. > > There is no use case for it in XDP. > > "But our HW supports it so XDP prog should read it" is the reason NOT > > to expose it to bpf in generic api. > > > > Either we're doing per-driver kfuncs and no common infra or common kfunc > > that covers 99% of the drivers. Which is CHECKSUM_UNNECESSARY && csum_level = 0 > > > > It's not acceptable to present a generic api to xdp prog with multi level > > csum that only works on a specific HW. Next thing there will be new flags > > and MAX_CSUM_LEVEL in XDP features. > > Pretending to be generic while being HW specific is the worst interface. > > Ok. Agreed that without it we still cover 99% of the use cases. Fine to drop. Sorry for the late response. Thanks everyone for the feedback, will drop the checksum level concept from the design.
On Mon, Jul 31, 2023 at 09:43:22AM -0700, Jakub Kicinski wrote: > On Sun, 30 Jul 2023 09:13:02 -0400 Willem de Bruijn wrote: > > > > This levels business is an unfortunate side effect of > > > > CHECKSUM_UNNECESSARY. For a packet with multiple checksum fields, what > > > > does the boolean actually mean? With these levels, at least that is > > > > well defined: the first N checksum fields. > > > > > > If I understand this correctly this is intel specific feature that > > > other NICs don't have. skb layer also doesn't have such concept. > > > The driver should say CHECKSUM_UNNECESSARY when it's sure > > > or don't pretend that it checks the checksum and just say NONE. > > > > I did not know how much this was used, but quick grep for non constant > > csum_level shows devices from at least six vendors. > > I thought it was a legacy thing from early VxLAN days. > We used to leave outer UDP csum as 0 before LCO, and therefore couldn't > convert outer to COMPLETE, so inner could not be offloaded/validated. > Should not be all that relevant today. Sorry for the delayed response. Thanks a lot for this feedback, it became a gateway to deepen my understanding of checksumming in kernel pretty significantly.
On Mon, Jul 31, 2023 at 06:03:26PM -0700, Alexei Starovoitov wrote: > On Mon, Jul 31, 2023 at 3:56 AM Larysa Zaremba <larysa.zaremba@intel.com> wrote: > > > > On Sun, Jul 30, 2023 at 09:13:02AM -0400, Willem de Bruijn wrote: > > > Alexei Starovoitov wrote: > > > > On Sat, Jul 29, 2023 at 9:15 AM Willem de Bruijn > > > > <willemdebruijn.kernel@gmail.com> wrote: > > > > > > > > > > Alexei Starovoitov wrote: > > > > > > On Fri, Jul 28, 2023 at 07:39:14PM +0200, Larysa Zaremba wrote: > > > > > > > > > > > > > > +union xdp_csum_info { > > > > > > > + /* Checksum referred to by ``csum_start + csum_offset`` is considered > > > > > > > + * valid, but was never calculated, TX device has to do this, > > > > > > > + * starting from csum_start packet byte. > > > > > > > + * Any preceding checksums are also considered valid. > > > > > > > + * Available, if ``status == XDP_CHECKSUM_PARTIAL``. > > > > > > > + */ > > > > > > > + struct { > > > > > > > + u16 csum_start; > > > > > > > + u16 csum_offset; > > > > > > > + }; > > > > > > > + > > > > > > > > > > > > CHECKSUM_PARTIAL makes sense on TX, but this RX. I don't see in the above. > > > > > > > > > > It can be observed on RX when packets are looped. > > > > > > > > > > This may be observed even in XDP on veth. > > > > > > > > veth and XDP is a broken combination. GSO packets coming out of containers > > > > cannot be parsed properly by XDP. > > > > It was added mainly for testing. Just like "generic XDP". > > > > bpf progs at skb layer is much better fit for veth. > > > > > > Ok. Still, seems forward looking and little cost to define the > > > constant? > > > > > > > +1 > > CHECKSUM_PARTIAL is mostly for testing and removing/adding it doesn't change > > anything from the perspective of the user that does not use it, so I think it is > > worth having. > > "little cost to define the constant". > Not really. A constant in UAPI is a heavy burden. Sorry for the delayed response. I still do not comprehend the problem fully for this particular case, considering it shouldn't block any future changes to the API by itself. But, I personally have no reason to push hard the veth-supporting changes (aside from wanting the tests to look nicer). Still, before removing this in v5, I would like to get some additional feedback on this, preferably from Jesper (who, if I remember correctly, takes an interest in XDP on veth) or Stanislav. If instead of union xdp_csum_info we will have just checksum as a second argument, there will be no going back for this particular kfunc, so I want to be sure nobody will ever need such feature. [...]
On 08/07, Larysa Zaremba wrote: > On Mon, Jul 31, 2023 at 06:03:26PM -0700, Alexei Starovoitov wrote: > > On Mon, Jul 31, 2023 at 3:56 AM Larysa Zaremba <larysa.zaremba@intel.com> wrote: > > > > > > On Sun, Jul 30, 2023 at 09:13:02AM -0400, Willem de Bruijn wrote: > > > > Alexei Starovoitov wrote: > > > > > On Sat, Jul 29, 2023 at 9:15 AM Willem de Bruijn > > > > > <willemdebruijn.kernel@gmail.com> wrote: > > > > > > > > > > > > Alexei Starovoitov wrote: > > > > > > > On Fri, Jul 28, 2023 at 07:39:14PM +0200, Larysa Zaremba wrote: > > > > > > > > > > > > > > > > +union xdp_csum_info { > > > > > > > > + /* Checksum referred to by ``csum_start + csum_offset`` is considered > > > > > > > > + * valid, but was never calculated, TX device has to do this, > > > > > > > > + * starting from csum_start packet byte. > > > > > > > > + * Any preceding checksums are also considered valid. > > > > > > > > + * Available, if ``status == XDP_CHECKSUM_PARTIAL``. > > > > > > > > + */ > > > > > > > > + struct { > > > > > > > > + u16 csum_start; > > > > > > > > + u16 csum_offset; > > > > > > > > + }; > > > > > > > > + > > > > > > > > > > > > > > CHECKSUM_PARTIAL makes sense on TX, but this RX. I don't see in the above. > > > > > > > > > > > > It can be observed on RX when packets are looped. > > > > > > > > > > > > This may be observed even in XDP on veth. > > > > > > > > > > veth and XDP is a broken combination. GSO packets coming out of containers > > > > > cannot be parsed properly by XDP. > > > > > It was added mainly for testing. Just like "generic XDP". > > > > > bpf progs at skb layer is much better fit for veth. > > > > > > > > Ok. Still, seems forward looking and little cost to define the > > > > constant? > > > > > > > > > > +1 > > > CHECKSUM_PARTIAL is mostly for testing and removing/adding it doesn't change > > > anything from the perspective of the user that does not use it, so I think it is > > > worth having. > > > > "little cost to define the constant". > > Not really. A constant in UAPI is a heavy burden. > > Sorry for the delayed response. > > I still do not comprehend the problem fully for this particular case, > considering it shouldn't block any future changes to the API by itself. > > But, I personally have no reason to push hard the veth-supporting changes > (aside from wanting the tests to look nicer). > > Still, before removing this in v5, I would like to get some additional feedback > on this, preferably from Jesper (who, if I remember correctly, takes an interest > in XDP on veth) or Stanislav. > > If instead of union xdp_csum_info we will have just checksum as a second > argument, there will be no going back for this particular kfunc, so I want to be > sure nobody will ever need such feature. > > [...] I'm interested in veth only from the testing pow, so if we lose csum_partial on veth (and it becomes _none?), I don't see any issue with that.
diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst index ea6dd79a21d3..7f056a44f682 100644 --- a/Documentation/networking/xdp-rx-metadata.rst +++ b/Documentation/networking/xdp-rx-metadata.rst @@ -26,6 +26,9 @@ metadata is supported, this set will grow: .. kernel-doc:: net/core/xdp.c :identifiers: bpf_xdp_metadata_rx_vlan_tag +.. kernel-doc:: net/core/xdp.c + :identifiers: bpf_xdp_metadata_rx_csum + An XDP program can use these kfuncs to read the metadata into stack variables for its own consumption. Or, to pass the metadata on to other consumers, an XDP program can store it into the metadata area carried diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 028dcc4fd02d..a950cec76945 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1660,6 +1660,9 @@ struct xdp_metadata_ops { enum xdp_rss_hash_type *rss_type); int (*xmo_rx_vlan_tag)(const struct xdp_md *ctx, u16 *vlan_tci, __be16 *vlan_proto); + int (*xmo_rx_csum)(const struct xdp_md *ctx, + enum xdp_csum_status *csum_status, + union xdp_csum_info *csum_info); }; /** diff --git a/include/net/xdp.h b/include/net/xdp.h index 89c58f56ffc6..7e6163e5002a 100644 --- a/include/net/xdp.h +++ b/include/net/xdp.h @@ -391,6 +391,8 @@ void xdp_attachment_setup(struct xdp_attachment_info *info, bpf_xdp_metadata_rx_hash) \ XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_VLAN_TAG, \ bpf_xdp_metadata_rx_vlan_tag) \ + XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_CSUM, \ + bpf_xdp_metadata_rx_csum) \ enum { #define XDP_METADATA_KFUNC(name, _) name, @@ -448,6 +450,50 @@ enum xdp_rss_hash_type { XDP_RSS_TYPE_L4_IPV6_SCTP_EX = XDP_RSS_TYPE_L4_IPV6_SCTP | XDP_RSS_L3_DYNHDR, }; +union xdp_csum_info { + /* Checksum referred to by ``csum_start + csum_offset`` is considered + * valid, but was never calculated, TX device has to do this, + * starting from csum_start packet byte. + * Any preceding checksums are also considered valid. + * Available, if ``status == XDP_CHECKSUM_PARTIAL``. + */ + struct { + u16 csum_start; + u16 csum_offset; + }; + + /* Checksum, calculated over the whole packet. + * Available, if ``status & XDP_CHECKSUM_COMPLETE``. + */ + u32 checksum; +}; + +enum xdp_csum_status { + /* HW had parsed several transport headers and validated their + * checksums, same as ``CHECKSUM_UNNECESSARY`` in ``sk_buff``. + * 3 least significant bytes contain number of consecutive checksums, + * starting with the outermost, reported by hardware as valid. + * ``sk_buff`` checksum level (``csum_level``) notation is provided + * for driver developers. + */ + XDP_CHECKSUM_VALID_LVL0 = 1, /* 1 outermost checksum */ + XDP_CHECKSUM_VALID_LVL1 = 2, /* 2 outermost checksums */ + XDP_CHECKSUM_VALID_LVL2 = 3, /* 3 outermost checksums */ + XDP_CHECKSUM_VALID_LVL3 = 4, /* 4 outermost checksums */ + XDP_CHECKSUM_VALID_NUM_MASK = GENMASK(2, 0), + XDP_CHECKSUM_VALID = XDP_CHECKSUM_VALID_NUM_MASK, + + /* Occurs if packet is sent virtually (between Linux VMs / containers) + * This status cannot coexist with any other. + * Refer to ``csum_start`` and ``csum_offset`` in ``xdp_csum_info`` + * for more information. + */ + XDP_CHECKSUM_PARTIAL = BIT(3), + + /* Checksum, calculated over the entire packet is provided */ + XDP_CHECKSUM_COMPLETE = BIT(4), +}; + #ifdef CONFIG_NET u32 bpf_xdp_metadata_kfunc_id(int id); bool bpf_dev_bound_kfunc_id(u32 btf_id); diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c index 986e7becfd42..f60a6add5273 100644 --- a/kernel/bpf/offload.c +++ b/kernel/bpf/offload.c @@ -850,6 +850,8 @@ void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id) p = ops->xmo_rx_hash; else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_VLAN_TAG)) p = ops->xmo_rx_vlan_tag; + else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_CSUM)) + p = ops->xmo_rx_csum; out: up_read(&bpf_devs_lock); diff --git a/net/core/xdp.c b/net/core/xdp.c index 8b55419d332e..d4ea54046afc 100644 --- a/net/core/xdp.c +++ b/net/core/xdp.c @@ -772,6 +772,29 @@ __bpf_kfunc int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx, return -EOPNOTSUPP; } +/** + * bpf_xdp_metadata_rx_csum - Get checksum status with additional info. + * @ctx: XDP context pointer. + * @csum_status: Destination for checksum status. + * @csum_info: Destination for complete checksum or partial checksum offset. + * + * Status (@csum_status) is a bitfield that informs, what checksum + * processing was performed. Additional results of such processing, + * such as complete checksum or partial checksum offsets, + * are passed as info (@csum_info). + * + * Return: + * * Returns 0 on success or ``-errno`` on error. + * * ``-EOPNOTSUPP`` : device driver doesn't implement kfunc + * * ``-ENODATA`` : Checksum status is unknown + */ +__bpf_kfunc int bpf_xdp_metadata_rx_csum(const struct xdp_md *ctx, + enum xdp_csum_status *csum_status, + union xdp_csum_info *csum_info) +{ + return -EOPNOTSUPP; +} + __diag_pop(); BTF_SET8_START(xdp_metadata_kfunc_ids)
Implement functionality that enables drivers to expose to XDP code checksum information that consists of: - Checksum status - bitfield that consists of - number of consecutive validated checksums. This is almost the same as csum_level in skb, but starts with 1. Enum names for those bits still use checksum level concept, so it is less confusing for driver developers. - Is checksum partial? This bit cannot coexist with any other - Is there a complete checksum available? - Additional checksum data, a union of: - checksum start and offset, if checksum is partial - complete checksum, if available Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> --- Documentation/networking/xdp-rx-metadata.rst | 3 ++ include/linux/netdevice.h | 3 ++ include/net/xdp.h | 46 ++++++++++++++++++++ kernel/bpf/offload.c | 2 + net/core/xdp.c | 23 ++++++++++ 5 files changed, 77 insertions(+)