Message ID | 20220628194812.1453059-33-alexandr.lobakin@intel.com (mailing list archive) |
---|---|
State | RFC |
Delegated to: | BPF |
Headers | show |
Series | bpf, xdp: introduce and use Generic Hints/metadata | expand |
Context | Check | Description |
---|---|---|
bpf/vmtest-bpf-next-PR | fail | PR summary |
bpf/vmtest-bpf-next-VM_Test-1 | fail | Logs for Kernel LATEST on ubuntu-latest with gcc |
bpf/vmtest-bpf-next-VM_Test-2 | fail | Logs for Kernel LATEST on ubuntu-latest with llvm-15 |
bpf/vmtest-bpf-next-VM_Test-3 | fail | Logs for Kernel LATEST on z15 with gcc |
netdev/tree_selection | success | Clearly marked for bpf-next, async |
netdev/apply | fail | Patch does not apply to bpf-next |
Hi Alexander, On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote: > cpumap has its own BH context based on kthread. It has a sane batch > size of 8 frames per one cycle. > GRO can be used on its own, adjust cpumap calls to the > upper stack to use GRO API instead of netif_receive_skb_list() which > processes skbs by batches, but doesn't involve GRO layer at all. > It is most beneficial when a NIC which frame come from is XDP > generic metadata-enabled, but in plenty of tests GRO performs better > than listed receiving even given that it has to calculate full frame > checksums on CPU. > As GRO passes the skbs to the upper stack in the batches of > @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the > device where the frame comes from, it is enough to disable GRO > netdev feature on it to completely restore the original behaviour: > untouched frames will be being bulked and passed to the upper stack > by 8, as it was with netif_receive_skb_list(). > > Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> > --- > kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++----- > 1 file changed, 38 insertions(+), 5 deletions(-) > AFAICT the cpumap + GRO is a good standalone improvement. I think cpumap is still missing this. I have a production use case for this now. We want to do some intelligent RX steering and I think GRO would help over list-ified receive in some cases. We would prefer steer in HW (and thus get existing GRO support) but not all our NICs support it. So we need a software fallback. Are you still interested in merging the cpumap + GRO patches? Thanks, Daniel
> Hi Alexander, > > On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote: > > cpumap has its own BH context based on kthread. It has a sane batch > > size of 8 frames per one cycle. > > GRO can be used on its own, adjust cpumap calls to the > > upper stack to use GRO API instead of netif_receive_skb_list() which > > processes skbs by batches, but doesn't involve GRO layer at all. > > It is most beneficial when a NIC which frame come from is XDP > > generic metadata-enabled, but in plenty of tests GRO performs better > > than listed receiving even given that it has to calculate full frame > > checksums on CPU. > > As GRO passes the skbs to the upper stack in the batches of > > @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the > > device where the frame comes from, it is enough to disable GRO > > netdev feature on it to completely restore the original behaviour: > > untouched frames will be being bulked and passed to the upper stack > > by 8, as it was with netif_receive_skb_list(). > > > > Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> > > --- > > kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++----- > > 1 file changed, 38 insertions(+), 5 deletions(-) > > > > AFAICT the cpumap + GRO is a good standalone improvement. I think > cpumap is still missing this. > > I have a production use case for this now. We want to do some intelligent > RX steering and I think GRO would help over list-ified receive in some cases. > We would prefer steer in HW (and thus get existing GRO support) but not all > our NICs support it. So we need a software fallback. > > Are you still interested in merging the cpumap + GRO patches? Hi Daniel and Alex, Recently I worked on a PoC to add GRO support to cpumap codebase: - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e Here I added GRO support to cpumap through gro-cells. - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55 Here I added GRO support to cpumap trough napi-threaded APIs (with a some changes to them). Please note I have not run any performance tests so far, just verified it does not crash (I was planning to resume this work soon). Please let me know if it works for you. Regards, Lorenzo > > Thanks, > Daniel >
From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Date: Thu, 8 Aug 2024 06:54:06 +0200 >> Hi Alexander, >> >> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote: >>> cpumap has its own BH context based on kthread. It has a sane batch >>> size of 8 frames per one cycle. >>> GRO can be used on its own, adjust cpumap calls to the >>> upper stack to use GRO API instead of netif_receive_skb_list() which >>> processes skbs by batches, but doesn't involve GRO layer at all. >>> It is most beneficial when a NIC which frame come from is XDP >>> generic metadata-enabled, but in plenty of tests GRO performs better >>> than listed receiving even given that it has to calculate full frame >>> checksums on CPU. >>> As GRO passes the skbs to the upper stack in the batches of >>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the >>> device where the frame comes from, it is enough to disable GRO >>> netdev feature on it to completely restore the original behaviour: >>> untouched frames will be being bulked and passed to the upper stack >>> by 8, as it was with netif_receive_skb_list(). >>> >>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> >>> --- >>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++----- >>> 1 file changed, 38 insertions(+), 5 deletions(-) >>> >> >> AFAICT the cpumap + GRO is a good standalone improvement. I think >> cpumap is still missing this. The only concern for having GRO in cpumap without metadata from the NIC descriptor was that when the checksum status is missing, GRO calculates the checksum on CPU, which is not really fast. But I remember sometimes GRO was faster despite that. >> >> I have a production use case for this now. We want to do some intelligent >> RX steering and I think GRO would help over list-ified receive in some cases. >> We would prefer steer in HW (and thus get existing GRO support) but not all >> our NICs support it. So we need a software fallback. >> >> Are you still interested in merging the cpumap + GRO patches? For sure I can revive this part. I was planning to get back to this branch and pick patches which were not related to XDP hints and send them separately. > > Hi Daniel and Alex, > > Recently I worked on a PoC to add GRO support to cpumap codebase: > - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e > Here I added GRO support to cpumap through gro-cells. > - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55 > Here I added GRO support to cpumap trough napi-threaded APIs (with a some > changes to them). Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta overkill, that's why I separated GRO structure from &napi_struct. Let me maybe find some free time, I would then test all 3 solutions (mine, gro_cells, threaded NAPI) and pick/send the best? > > Please note I have not run any performance tests so far, just verified it does > not crash (I was planning to resume this work soon). Please let me know if it > works for you. > > Regards, > Lorenzo > >> >> Thanks, >> Daniel Thanks, Olek
> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> > Date: Thu, 8 Aug 2024 06:54:06 +0200 > > >> Hi Alexander, > >> > >> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote: > >>> cpumap has its own BH context based on kthread. It has a sane batch > >>> size of 8 frames per one cycle. > >>> GRO can be used on its own, adjust cpumap calls to the > >>> upper stack to use GRO API instead of netif_receive_skb_list() which > >>> processes skbs by batches, but doesn't involve GRO layer at all. > >>> It is most beneficial when a NIC which frame come from is XDP > >>> generic metadata-enabled, but in plenty of tests GRO performs better > >>> than listed receiving even given that it has to calculate full frame > >>> checksums on CPU. > >>> As GRO passes the skbs to the upper stack in the batches of > >>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the > >>> device where the frame comes from, it is enough to disable GRO > >>> netdev feature on it to completely restore the original behaviour: > >>> untouched frames will be being bulked and passed to the upper stack > >>> by 8, as it was with netif_receive_skb_list(). > >>> > >>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> > >>> --- > >>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++----- > >>> 1 file changed, 38 insertions(+), 5 deletions(-) > >>> > >> > >> AFAICT the cpumap + GRO is a good standalone improvement. I think > >> cpumap is still missing this. > > The only concern for having GRO in cpumap without metadata from the NIC > descriptor was that when the checksum status is missing, GRO calculates > the checksum on CPU, which is not really fast. > But I remember sometimes GRO was faster despite that. > > >> > >> I have a production use case for this now. We want to do some intelligent > >> RX steering and I think GRO would help over list-ified receive in some cases. > >> We would prefer steer in HW (and thus get existing GRO support) but not all > >> our NICs support it. So we need a software fallback. > >> > >> Are you still interested in merging the cpumap + GRO patches? > > For sure I can revive this part. I was planning to get back to this > branch and pick patches which were not related to XDP hints and send > them separately. > > > > > Hi Daniel and Alex, > > > > Recently I worked on a PoC to add GRO support to cpumap codebase: > > - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e > > Here I added GRO support to cpumap through gro-cells. > > - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55 > > Here I added GRO support to cpumap trough napi-threaded APIs (with a some > > changes to them). > > Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta > overkill, that's why I separated GRO structure from &napi_struct. if we consider the NAPI-threaded implementation, we have the same architecture we have in current cpumap codebase, a thread for each cpumap entry, the only different is we can rely on GRO APIs. Regards, Lorenzo > > Let me maybe find some free time, I would then test all 3 solutions > (mine, gro_cells, threaded NAPI) and pick/send the best? > > > > > Please note I have not run any performance tests so far, just verified it does > > not crash (I was planning to resume this work soon). Please let me know if it > > works for you. > > > > Regards, > > Lorenzo > > > >> > >> Thanks, > >> Daniel > > Thanks, > Olek >
Hi Lorenzo, On Thu, Aug 8, 2024, at 12:54 AM, Lorenzo Bianconi wrote: >> Hi Alexander, >> >> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote: >> > cpumap has its own BH context based on kthread. It has a sane batch >> > size of 8 frames per one cycle. >> > GRO can be used on its own, adjust cpumap calls to the >> > upper stack to use GRO API instead of netif_receive_skb_list() which >> > processes skbs by batches, but doesn't involve GRO layer at all. >> > It is most beneficial when a NIC which frame come from is XDP >> > generic metadata-enabled, but in plenty of tests GRO performs better >> > than listed receiving even given that it has to calculate full frame >> > checksums on CPU. >> > As GRO passes the skbs to the upper stack in the batches of >> > @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the >> > device where the frame comes from, it is enough to disable GRO >> > netdev feature on it to completely restore the original behaviour: >> > untouched frames will be being bulked and passed to the upper stack >> > by 8, as it was with netif_receive_skb_list(). >> > >> > Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> >> > --- >> > kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++----- >> > 1 file changed, 38 insertions(+), 5 deletions(-) >> > >> >> AFAICT the cpumap + GRO is a good standalone improvement. I think >> cpumap is still missing this. >> >> I have a production use case for this now. We want to do some intelligent >> RX steering and I think GRO would help over list-ified receive in some cases. >> We would prefer steer in HW (and thus get existing GRO support) but not all >> our NICs support it. So we need a software fallback. >> >> Are you still interested in merging the cpumap + GRO patches? > > Hi Daniel and Alex, > > Recently I worked on a PoC to add GRO support to cpumap codebase: > - > https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e > Here I added GRO support to cpumap through gro-cells. > - > https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55 > Here I added GRO support to cpumap trough napi-threaded APIs (with a > some > changes to them). Cool! > > Please note I have not run any performance tests so far, just verified it does > not crash (I was planning to resume this work soon). Please let me know if it > works for you. I’ll try to run an A/B test on your two approaches as well as Alex’s. I’ve still got some testbeds with production traffic going thru them. Thanks, Daniel
Hi, On Thu, Aug 8, 2024, at 7:57 AM, Alexander Lobakin wrote: > From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> > Date: Thu, 8 Aug 2024 06:54:06 +0200 > >>> Hi Alexander, >>> >>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote: >>>> cpumap has its own BH context based on kthread. It has a sane batch >>>> size of 8 frames per one cycle. >>>> GRO can be used on its own, adjust cpumap calls to the >>>> upper stack to use GRO API instead of netif_receive_skb_list() which >>>> processes skbs by batches, but doesn't involve GRO layer at all. >>>> It is most beneficial when a NIC which frame come from is XDP >>>> generic metadata-enabled, but in plenty of tests GRO performs better >>>> than listed receiving even given that it has to calculate full frame >>>> checksums on CPU. >>>> As GRO passes the skbs to the upper stack in the batches of >>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the >>>> device where the frame comes from, it is enough to disable GRO >>>> netdev feature on it to completely restore the original behaviour: >>>> untouched frames will be being bulked and passed to the upper stack >>>> by 8, as it was with netif_receive_skb_list(). >>>> >>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> >>>> --- >>>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++----- >>>> 1 file changed, 38 insertions(+), 5 deletions(-) >>>> >>> >>> AFAICT the cpumap + GRO is a good standalone improvement. I think >>> cpumap is still missing this. > > The only concern for having GRO in cpumap without metadata from the NIC > descriptor was that when the checksum status is missing, GRO calculates > the checksum on CPU, which is not really fast. > But I remember sometimes GRO was faster despite that. Good to know, thanks. IIUC some kind of XDP hint support landed already? My use case could also use HW RSS hash to avoid a rehash in XDP prog. And HW RX timestamp to not break SO_TIMESTAMPING. These two are on one of my TODO lists. But I can’t get to them for at least a few weeks. So free to take it if you’d like. > >>> >>> I have a production use case for this now. We want to do some intelligent >>> RX steering and I think GRO would help over list-ified receive in some cases. >>> We would prefer steer in HW (and thus get existing GRO support) but not all >>> our NICs support it. So we need a software fallback. >>> >>> Are you still interested in merging the cpumap + GRO patches? > > For sure I can revive this part. I was planning to get back to this > branch and pick patches which were not related to XDP hints and send > them separately. > >> >> Hi Daniel and Alex, >> >> Recently I worked on a PoC to add GRO support to cpumap codebase: >> - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e >> Here I added GRO support to cpumap through gro-cells. >> - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55 >> Here I added GRO support to cpumap trough napi-threaded APIs (with a some >> changes to them). > > Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta > overkill, that's why I separated GRO structure from &napi_struct. > > Let me maybe find some free time, I would then test all 3 solutions > (mine, gro_cells, threaded NAPI) and pick/send the best? Sounds good. Would be good to compare results. […] Thanks, Daniel
On 08/08/2024 22.44, Daniel Xu wrote: > Hi Lorenzo, > > On Thu, Aug 8, 2024, at 12:54 AM, Lorenzo Bianconi wrote: >>> Hi Alexander, >>> >>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote: [...] >>> >>> AFAICT the cpumap + GRO is a good standalone improvement. I think >>> cpumap is still missing this. >>> >>> I have a production use case for this now. We want to do some intelligent >>> RX steering and I think GRO would help over list-ified receive in some cases. >>> We would prefer steer in HW (and thus get existing GRO support) but not all >>> our NICs support it. So we need a software fallback. >>> I want to state that Cloudflare is also planning to use cpumap in production, and (one) blocker is that CPUMAP doesn't support GRO. >>> Are you still interested in merging the cpumap + GRO patches? >> >> Hi Daniel and Alex, >> >> Recently I worked on a PoC to add GRO support to cpumap codebase: >> - >> https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e >> Here I added GRO support to cpumap through gro-cells. >> - >> https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55 >> Here I added GRO support to cpumap trough napi-threaded APIs (with a >> some >> changes to them). > > Cool! > >> >> Please note I have not run any performance tests so far, just verified it does >> not crash (I was planning to resume this work soon). Please let me know if it >> works for you. > > I’ll try to run an A/B test on your two approaches as well as Alex’s. I’ve still > got some testbeds with production traffic going thru them. > It is awesome that both Olek and you are stepping up for testing this. (I'm currently too busy on cgroup rstat lock related work, but more people will be joining my team this month and I hope they have interest in contributing to this effort). --Jesper
On 08/08/2024 22.52, Daniel Xu wrote: > > On Thu, Aug 8, 2024, at 7:57 AM, Alexander Lobakin wrote: >> [...] >> The only concern for having GRO in cpumap without metadata from the NIC >> descriptor was that when the checksum status is missing, GRO calculates >> the checksum on CPU, which is not really fast. >> But I remember sometimes GRO was faster despite that. > > Good to know, thanks. IIUC some kind of XDP hint support landed already? > The XDP-hints ended-up being called 'XDP RX metadata' in kernel docs[1], which makes it difficult to talk about without talking past each-other. The TX side only got implemented for AF_XDP. [1] https://www.kernel.org/doc/html/latest/networking/xdp-rx-metadata.html [2] https://www.kernel.org/doc/html/latest/networking/xsk-tx-metadata.html What landed 'XDP RX metadata'[1] is that we (via kfunc calls) get access to reading hardware RX offloads/hints directly from the RX-descriptor. This implies a limitation that we only have access to this data in the running XDP-program as the RX-descriptor is short lived. Thus, we need to store the RX-descriptor information somewhere, to make it available to 'cpumap' on the remote CPU. After failing to standardize formatting XDP metadata area. My "new" opinion is that we should simply extend struct xdp_frame with the fields needed for SKB creation. Then we can create some kfunc helpers that allow XDP-prog stores this info. > My use case could also use HW RSS hash to avoid a rehash in XDP prog. > And HW RX timestamp to not break SO_TIMESTAMPING. These two > are on one of my TODO lists. But I can’t get to them for at least > a few weeks. So free to take it if you’d like. The kfuncs you need should be available: HW RSS hash = bpf_xdp_metadata_rx_hash() HW RX timestamp = bpf_xdp_metadata_rx_timestamp() We just need to implement storing the information, such that it is available to CPUMAP, and make it generic such that it also works for veth when getting a XDP redirected xdp_frame. Hoping someone can works on this soon, --Jesper
From: Daniel Xu <dxu@dxuuu.xyz> Date: Thu, 08 Aug 2024 16:52:51 -0400 > Hi, > > On Thu, Aug 8, 2024, at 7:57 AM, Alexander Lobakin wrote: >> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> >> Date: Thu, 8 Aug 2024 06:54:06 +0200 >> >>>> Hi Alexander, >>>> >>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote: >>>>> cpumap has its own BH context based on kthread. It has a sane batch >>>>> size of 8 frames per one cycle. >>>>> GRO can be used on its own, adjust cpumap calls to the >>>>> upper stack to use GRO API instead of netif_receive_skb_list() which >>>>> processes skbs by batches, but doesn't involve GRO layer at all. >>>>> It is most beneficial when a NIC which frame come from is XDP >>>>> generic metadata-enabled, but in plenty of tests GRO performs better >>>>> than listed receiving even given that it has to calculate full frame >>>>> checksums on CPU. >>>>> As GRO passes the skbs to the upper stack in the batches of >>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the >>>>> device where the frame comes from, it is enough to disable GRO >>>>> netdev feature on it to completely restore the original behaviour: >>>>> untouched frames will be being bulked and passed to the upper stack >>>>> by 8, as it was with netif_receive_skb_list(). >>>>> >>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> >>>>> --- >>>>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++----- >>>>> 1 file changed, 38 insertions(+), 5 deletions(-) >>>>> >>>> >>>> AFAICT the cpumap + GRO is a good standalone improvement. I think >>>> cpumap is still missing this. >> >> The only concern for having GRO in cpumap without metadata from the NIC >> descriptor was that when the checksum status is missing, GRO calculates >> the checksum on CPU, which is not really fast. >> But I remember sometimes GRO was faster despite that. > > Good to know, thanks. IIUC some kind of XDP hint support landed already? > > My use case could also use HW RSS hash to avoid a rehash in XDP prog. Unfortunately, for now it's impossible to get HW metadata such as RSS hash and checksum status in cpumap. They're implemented via kfuncs specific to a particular netdevice and this info is available only when running XDP prog. But I think one solution could be: 1. We create some generic structure for cpumap, like struct cpumap_meta { u32 magic; u32 hash; } 2. We add such check in the cpumap code if (xdpf->metalen == sizeof(struct cpumap_meta) && <here we check magic>) skb->hash = meta->hash; 3. In XDP prog, you call Rx hints kfuncs when they're available, obtain RSS hash and then put it in the struct cpumap_meta as XDP frame metadata. > And HW RX timestamp to not break SO_TIMESTAMPING. These two > are on one of my TODO lists. But I can’t get to them for at least > a few weeks. So free to take it if you’d like. > >> >>>> >>>> I have a production use case for this now. We want to do some intelligent >>>> RX steering and I think GRO would help over list-ified receive in some cases. >>>> We would prefer steer in HW (and thus get existing GRO support) but not all >>>> our NICs support it. So we need a software fallback. >>>> >>>> Are you still interested in merging the cpumap + GRO patches? >> >> For sure I can revive this part. I was planning to get back to this >> branch and pick patches which were not related to XDP hints and send >> them separately. >> >>> >>> Hi Daniel and Alex, >>> >>> Recently I worked on a PoC to add GRO support to cpumap codebase: >>> - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e >>> Here I added GRO support to cpumap through gro-cells. >>> - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55 >>> Here I added GRO support to cpumap trough napi-threaded APIs (with a some >>> changes to them). >> >> Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta >> overkill, that's why I separated GRO structure from &napi_struct. >> >> Let me maybe find some free time, I would then test all 3 solutions >> (mine, gro_cells, threaded NAPI) and pick/send the best? > > Sounds good. Would be good to compare results. > > […] > > Thanks, > Daniel Thanks, Olek
Alexander Lobakin <aleksander.lobakin@intel.com> writes: > From: Daniel Xu <dxu@dxuuu.xyz> > Date: Thu, 08 Aug 2024 16:52:51 -0400 > >> Hi, >> >> On Thu, Aug 8, 2024, at 7:57 AM, Alexander Lobakin wrote: >>> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> >>> Date: Thu, 8 Aug 2024 06:54:06 +0200 >>> >>>>> Hi Alexander, >>>>> >>>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote: >>>>>> cpumap has its own BH context based on kthread. It has a sane batch >>>>>> size of 8 frames per one cycle. >>>>>> GRO can be used on its own, adjust cpumap calls to the >>>>>> upper stack to use GRO API instead of netif_receive_skb_list() which >>>>>> processes skbs by batches, but doesn't involve GRO layer at all. >>>>>> It is most beneficial when a NIC which frame come from is XDP >>>>>> generic metadata-enabled, but in plenty of tests GRO performs better >>>>>> than listed receiving even given that it has to calculate full frame >>>>>> checksums on CPU. >>>>>> As GRO passes the skbs to the upper stack in the batches of >>>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the >>>>>> device where the frame comes from, it is enough to disable GRO >>>>>> netdev feature on it to completely restore the original behaviour: >>>>>> untouched frames will be being bulked and passed to the upper stack >>>>>> by 8, as it was with netif_receive_skb_list(). >>>>>> >>>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> >>>>>> --- >>>>>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++----- >>>>>> 1 file changed, 38 insertions(+), 5 deletions(-) >>>>>> >>>>> >>>>> AFAICT the cpumap + GRO is a good standalone improvement. I think >>>>> cpumap is still missing this. >>> >>> The only concern for having GRO in cpumap without metadata from the NIC >>> descriptor was that when the checksum status is missing, GRO calculates >>> the checksum on CPU, which is not really fast. >>> But I remember sometimes GRO was faster despite that. >> >> Good to know, thanks. IIUC some kind of XDP hint support landed already? >> >> My use case could also use HW RSS hash to avoid a rehash in XDP prog. > > Unfortunately, for now it's impossible to get HW metadata such as RSS > hash and checksum status in cpumap. They're implemented via kfuncs > specific to a particular netdevice and this info is available only when > running XDP prog. > > But I think one solution could be: > > 1. We create some generic structure for cpumap, like > > struct cpumap_meta { > u32 magic; > u32 hash; > } > > 2. We add such check in the cpumap code > > if (xdpf->metalen == sizeof(struct cpumap_meta) && > <here we check magic>) > skb->hash = meta->hash; > > 3. In XDP prog, you call Rx hints kfuncs when they're available, obtain > RSS hash and then put it in the struct cpumap_meta as XDP frame metadata. Yes, except don't make this cpumap-specific, make it generic for kernel consumption of the metadata. That way it doesn't even have to be stored in the xdp metadata area, it can be anywhere we want (and hence not subject to ABI issues), and we can use it for skb creation after redirect in other places than cpumap as well (say, on veth devices). So it'll be: struct kernel_meta { u32 hash; u32 timestamp; ...etc } and a kfunc: void store_xdp_kernel_meta(struct kernel meta *meta); which the XDP program can call to populate the metadata area. -Toke
From: Toke Høiland-Jørgensen <toke@redhat.com> Date: Fri, 09 Aug 2024 14:45:33 +0200 > Alexander Lobakin <aleksander.lobakin@intel.com> writes: > >> From: Daniel Xu <dxu@dxuuu.xyz> >> Date: Thu, 08 Aug 2024 16:52:51 -0400 >> >>> Hi, >>> >>> On Thu, Aug 8, 2024, at 7:57 AM, Alexander Lobakin wrote: >>>> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> >>>> Date: Thu, 8 Aug 2024 06:54:06 +0200 >>>> >>>>>> Hi Alexander, >>>>>> >>>>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote: >>>>>>> cpumap has its own BH context based on kthread. It has a sane batch >>>>>>> size of 8 frames per one cycle. >>>>>>> GRO can be used on its own, adjust cpumap calls to the >>>>>>> upper stack to use GRO API instead of netif_receive_skb_list() which >>>>>>> processes skbs by batches, but doesn't involve GRO layer at all. >>>>>>> It is most beneficial when a NIC which frame come from is XDP >>>>>>> generic metadata-enabled, but in plenty of tests GRO performs better >>>>>>> than listed receiving even given that it has to calculate full frame >>>>>>> checksums on CPU. >>>>>>> As GRO passes the skbs to the upper stack in the batches of >>>>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the >>>>>>> device where the frame comes from, it is enough to disable GRO >>>>>>> netdev feature on it to completely restore the original behaviour: >>>>>>> untouched frames will be being bulked and passed to the upper stack >>>>>>> by 8, as it was with netif_receive_skb_list(). >>>>>>> >>>>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> >>>>>>> --- >>>>>>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++----- >>>>>>> 1 file changed, 38 insertions(+), 5 deletions(-) >>>>>>> >>>>>> >>>>>> AFAICT the cpumap + GRO is a good standalone improvement. I think >>>>>> cpumap is still missing this. >>>> >>>> The only concern for having GRO in cpumap without metadata from the NIC >>>> descriptor was that when the checksum status is missing, GRO calculates >>>> the checksum on CPU, which is not really fast. >>>> But I remember sometimes GRO was faster despite that. >>> >>> Good to know, thanks. IIUC some kind of XDP hint support landed already? >>> >>> My use case could also use HW RSS hash to avoid a rehash in XDP prog. >> >> Unfortunately, for now it's impossible to get HW metadata such as RSS >> hash and checksum status in cpumap. They're implemented via kfuncs >> specific to a particular netdevice and this info is available only when >> running XDP prog. >> >> But I think one solution could be: >> >> 1. We create some generic structure for cpumap, like >> >> struct cpumap_meta { >> u32 magic; >> u32 hash; >> } >> >> 2. We add such check in the cpumap code >> >> if (xdpf->metalen == sizeof(struct cpumap_meta) && >> <here we check magic>) >> skb->hash = meta->hash; >> >> 3. In XDP prog, you call Rx hints kfuncs when they're available, obtain >> RSS hash and then put it in the struct cpumap_meta as XDP frame metadata. > > Yes, except don't make this cpumap-specific, make it generic for kernel > consumption of the metadata. That way it doesn't even have to be stored > in the xdp metadata area, it can be anywhere we want (and hence not > subject to ABI issues), and we can use it for skb creation after > redirect in other places than cpumap as well (say, on veth devices). > > So it'll be: > > struct kernel_meta { > u32 hash; > u32 timestamp; > ...etc > } > > and a kfunc: > > void store_xdp_kernel_meta(struct kernel meta *meta); > > which the XDP program can call to populate the metadata area. Hmm, nice! But where to store this info in case of cpumap if not in xdp->data_meta? When you convert XDP frames to skbs in the cpumap code, you only have &xdp_frame and that's it. XDP prog was already run earlier from the driver code at that point. But yes, in general we still need some generic structure, so that generic consumers like cpumap (but not only) could make use of it, not only XDP programs. > > -Toke Thanks, Olek
Alexander Lobakin <aleksander.lobakin@intel.com> writes: > From: Toke Høiland-Jørgensen <toke@redhat.com> > Date: Fri, 09 Aug 2024 14:45:33 +0200 > >> Alexander Lobakin <aleksander.lobakin@intel.com> writes: >> >>> From: Daniel Xu <dxu@dxuuu.xyz> >>> Date: Thu, 08 Aug 2024 16:52:51 -0400 >>> >>>> Hi, >>>> >>>> On Thu, Aug 8, 2024, at 7:57 AM, Alexander Lobakin wrote: >>>>> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> >>>>> Date: Thu, 8 Aug 2024 06:54:06 +0200 >>>>> >>>>>>> Hi Alexander, >>>>>>> >>>>>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote: >>>>>>>> cpumap has its own BH context based on kthread. It has a sane batch >>>>>>>> size of 8 frames per one cycle. >>>>>>>> GRO can be used on its own, adjust cpumap calls to the >>>>>>>> upper stack to use GRO API instead of netif_receive_skb_list() which >>>>>>>> processes skbs by batches, but doesn't involve GRO layer at all. >>>>>>>> It is most beneficial when a NIC which frame come from is XDP >>>>>>>> generic metadata-enabled, but in plenty of tests GRO performs better >>>>>>>> than listed receiving even given that it has to calculate full frame >>>>>>>> checksums on CPU. >>>>>>>> As GRO passes the skbs to the upper stack in the batches of >>>>>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the >>>>>>>> device where the frame comes from, it is enough to disable GRO >>>>>>>> netdev feature on it to completely restore the original behaviour: >>>>>>>> untouched frames will be being bulked and passed to the upper stack >>>>>>>> by 8, as it was with netif_receive_skb_list(). >>>>>>>> >>>>>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> >>>>>>>> --- >>>>>>>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++----- >>>>>>>> 1 file changed, 38 insertions(+), 5 deletions(-) >>>>>>>> >>>>>>> >>>>>>> AFAICT the cpumap + GRO is a good standalone improvement. I think >>>>>>> cpumap is still missing this. >>>>> >>>>> The only concern for having GRO in cpumap without metadata from the NIC >>>>> descriptor was that when the checksum status is missing, GRO calculates >>>>> the checksum on CPU, which is not really fast. >>>>> But I remember sometimes GRO was faster despite that. >>>> >>>> Good to know, thanks. IIUC some kind of XDP hint support landed already? >>>> >>>> My use case could also use HW RSS hash to avoid a rehash in XDP prog. >>> >>> Unfortunately, for now it's impossible to get HW metadata such as RSS >>> hash and checksum status in cpumap. They're implemented via kfuncs >>> specific to a particular netdevice and this info is available only when >>> running XDP prog. >>> >>> But I think one solution could be: >>> >>> 1. We create some generic structure for cpumap, like >>> >>> struct cpumap_meta { >>> u32 magic; >>> u32 hash; >>> } >>> >>> 2. We add such check in the cpumap code >>> >>> if (xdpf->metalen == sizeof(struct cpumap_meta) && >>> <here we check magic>) >>> skb->hash = meta->hash; >>> >>> 3. In XDP prog, you call Rx hints kfuncs when they're available, obtain >>> RSS hash and then put it in the struct cpumap_meta as XDP frame metadata. >> >> Yes, except don't make this cpumap-specific, make it generic for kernel >> consumption of the metadata. That way it doesn't even have to be stored >> in the xdp metadata area, it can be anywhere we want (and hence not >> subject to ABI issues), and we can use it for skb creation after >> redirect in other places than cpumap as well (say, on veth devices). >> >> So it'll be: >> >> struct kernel_meta { >> u32 hash; >> u32 timestamp; >> ...etc >> } >> >> and a kfunc: >> >> void store_xdp_kernel_meta(struct kernel meta *meta); >> >> which the XDP program can call to populate the metadata area. > > Hmm, nice! > > But where to store this info in case of cpumap if not in xdp->data_meta? > When you convert XDP frames to skbs in the cpumap code, you only have > &xdp_frame and that's it. XDP prog was already run earlier from the > driver code at that point. Well, we could put it in skb_shared_info? IIRC, some of the metadata (timestamps?) end up there when building an skb anyway, so we won't even have to copy it around... -Toke
On 8/9/24 6:42 AM, Toke Høiland-Jørgensen wrote: > Alexander Lobakin <aleksander.lobakin@intel.com> writes: > >> From: Toke Høiland-Jørgensen <toke@redhat.com> >> Date: Fri, 09 Aug 2024 14:45:33 +0200 >> >>> Alexander Lobakin <aleksander.lobakin@intel.com> writes: >>> >>>> From: Daniel Xu <dxu@dxuuu.xyz> >>>> Date: Thu, 08 Aug 2024 16:52:51 -0400 >>>> >>>>> Hi, >>>>> >>>>> On Thu, Aug 8, 2024, at 7:57 AM, Alexander Lobakin wrote: >>>>>> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> >>>>>> Date: Thu, 8 Aug 2024 06:54:06 +0200 >>>>>> >>>>>>>> Hi Alexander, >>>>>>>> >>>>>>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote: >>>>>>>>> cpumap has its own BH context based on kthread. It has a sane batch >>>>>>>>> size of 8 frames per one cycle. >>>>>>>>> GRO can be used on its own, adjust cpumap calls to the >>>>>>>>> upper stack to use GRO API instead of netif_receive_skb_list() which >>>>>>>>> processes skbs by batches, but doesn't involve GRO layer at all. >>>>>>>>> It is most beneficial when a NIC which frame come from is XDP >>>>>>>>> generic metadata-enabled, but in plenty of tests GRO performs better >>>>>>>>> than listed receiving even given that it has to calculate full frame >>>>>>>>> checksums on CPU. >>>>>>>>> As GRO passes the skbs to the upper stack in the batches of >>>>>>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the >>>>>>>>> device where the frame comes from, it is enough to disable GRO >>>>>>>>> netdev feature on it to completely restore the original behaviour: >>>>>>>>> untouched frames will be being bulked and passed to the upper stack >>>>>>>>> by 8, as it was with netif_receive_skb_list(). >>>>>>>>> >>>>>>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> >>>>>>>>> --- >>>>>>>>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++----- >>>>>>>>> 1 file changed, 38 insertions(+), 5 deletions(-) >>>>>>>>> >>>>>>>> >>>>>>>> AFAICT the cpumap + GRO is a good standalone improvement. I think >>>>>>>> cpumap is still missing this. >>>>>> >>>>>> The only concern for having GRO in cpumap without metadata from the NIC >>>>>> descriptor was that when the checksum status is missing, GRO calculates >>>>>> the checksum on CPU, which is not really fast. >>>>>> But I remember sometimes GRO was faster despite that. >>>>> >>>>> Good to know, thanks. IIUC some kind of XDP hint support landed already? >>>>> >>>>> My use case could also use HW RSS hash to avoid a rehash in XDP prog. >>>> >>>> Unfortunately, for now it's impossible to get HW metadata such as RSS >>>> hash and checksum status in cpumap. They're implemented via kfuncs >>>> specific to a particular netdevice and this info is available only when >>>> running XDP prog. >>>> >>>> But I think one solution could be: >>>> >>>> 1. We create some generic structure for cpumap, like >>>> >>>> struct cpumap_meta { >>>> u32 magic; >>>> u32 hash; >>>> } >>>> >>>> 2. We add such check in the cpumap code >>>> >>>> if (xdpf->metalen == sizeof(struct cpumap_meta) && >>>> <here we check magic>) >>>> skb->hash = meta->hash; >>>> >>>> 3. In XDP prog, you call Rx hints kfuncs when they're available, obtain >>>> RSS hash and then put it in the struct cpumap_meta as XDP frame metadata. >>> >>> Yes, except don't make this cpumap-specific, make it generic for kernel >>> consumption of the metadata. That way it doesn't even have to be stored >>> in the xdp metadata area, it can be anywhere we want (and hence not >>> subject to ABI issues), and we can use it for skb creation after >>> redirect in other places than cpumap as well (say, on veth devices). >>> >>> So it'll be: >>> >>> struct kernel_meta { >>> u32 hash; >>> u32 timestamp; >>> ...etc >>> } >>> >>> and a kfunc: >>> >>> void store_xdp_kernel_meta(struct kernel meta *meta); >>> >>> which the XDP program can call to populate the metadata area. >> >> Hmm, nice! >> >> But where to store this info in case of cpumap if not in xdp->data_meta? The cpumap has a xdp program. Instead of the kernel's cpumap code building the skb, the cpumap's xdp prog could build the skb itself and directly use the xdp->data_meta to init the skb. I recall there was discussion about doing gro in a bpf prog (I may be mis-remembering though). If possible, then the cpumap's xdp prog can also do the gro? >> When you convert XDP frames to skbs in the cpumap code, you only have >> &xdp_frame and that's it. XDP prog was already run earlier from the >> driver code at that point. > > Well, we could put it in skb_shared_info? IIRC, some of the metadata > (timestamps?) end up there when building an skb anyway, so we won't even > have to copy it around...
On Aug 08, Alexander Lobakin wrote: > From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> > Date: Thu, 8 Aug 2024 06:54:06 +0200 > > >> Hi Alexander, > >> > >> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote: > >>> cpumap has its own BH context based on kthread. It has a sane batch > >>> size of 8 frames per one cycle. > >>> GRO can be used on its own, adjust cpumap calls to the > >>> upper stack to use GRO API instead of netif_receive_skb_list() which > >>> processes skbs by batches, but doesn't involve GRO layer at all. > >>> It is most beneficial when a NIC which frame come from is XDP > >>> generic metadata-enabled, but in plenty of tests GRO performs better > >>> than listed receiving even given that it has to calculate full frame > >>> checksums on CPU. > >>> As GRO passes the skbs to the upper stack in the batches of > >>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the > >>> device where the frame comes from, it is enough to disable GRO > >>> netdev feature on it to completely restore the original behaviour: > >>> untouched frames will be being bulked and passed to the upper stack > >>> by 8, as it was with netif_receive_skb_list(). > >>> > >>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> > >>> --- > >>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++----- > >>> 1 file changed, 38 insertions(+), 5 deletions(-) > >>> > >> > >> AFAICT the cpumap + GRO is a good standalone improvement. I think > >> cpumap is still missing this. > > The only concern for having GRO in cpumap without metadata from the NIC > descriptor was that when the checksum status is missing, GRO calculates > the checksum on CPU, which is not really fast. > But I remember sometimes GRO was faster despite that. For the moment we could test it with UDP traffic with checksum disabled. Regards, Lorenzo > > >> > >> I have a production use case for this now. We want to do some intelligent > >> RX steering and I think GRO would help over list-ified receive in some cases. > >> We would prefer steer in HW (and thus get existing GRO support) but not all > >> our NICs support it. So we need a software fallback. > >> > >> Are you still interested in merging the cpumap + GRO patches? > > For sure I can revive this part. I was planning to get back to this > branch and pick patches which were not related to XDP hints and send > them separately. > > > > > Hi Daniel and Alex, > > > > Recently I worked on a PoC to add GRO support to cpumap codebase: > > - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e > > Here I added GRO support to cpumap through gro-cells. > > - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55 > > Here I added GRO support to cpumap trough napi-threaded APIs (with a some > > changes to them). > > Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta > overkill, that's why I separated GRO structure from &napi_struct. > > Let me maybe find some free time, I would then test all 3 solutions > (mine, gro_cells, threaded NAPI) and pick/send the best? > > > > > Please note I have not run any performance tests so far, just verified it does > > not crash (I was planning to resume this work soon). Please let me know if it > > works for you. > > > > Regards, > > Lorenzo > > > >> > >> Thanks, > >> Daniel > > Thanks, > Olek >
On Aug 09, Toke wrote: > Alexander Lobakin <aleksander.lobakin@intel.com> writes: > > > From: Toke Høiland-Jørgensen <toke@redhat.com> > > Date: Fri, 09 Aug 2024 14:45:33 +0200 > > > >> Alexander Lobakin <aleksander.lobakin@intel.com> writes: > >> > >>> From: Daniel Xu <dxu@dxuuu.xyz> > >>> Date: Thu, 08 Aug 2024 16:52:51 -0400 > >>> > >>>> Hi, > >>>> > >>>> On Thu, Aug 8, 2024, at 7:57 AM, Alexander Lobakin wrote: > >>>>> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> > >>>>> Date: Thu, 8 Aug 2024 06:54:06 +0200 > >>>>> > >>>>>>> Hi Alexander, > >>>>>>> > >>>>>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote: > >>>>>>>> cpumap has its own BH context based on kthread. It has a sane batch > >>>>>>>> size of 8 frames per one cycle. > >>>>>>>> GRO can be used on its own, adjust cpumap calls to the > >>>>>>>> upper stack to use GRO API instead of netif_receive_skb_list() which > >>>>>>>> processes skbs by batches, but doesn't involve GRO layer at all. > >>>>>>>> It is most beneficial when a NIC which frame come from is XDP > >>>>>>>> generic metadata-enabled, but in plenty of tests GRO performs better > >>>>>>>> than listed receiving even given that it has to calculate full frame > >>>>>>>> checksums on CPU. > >>>>>>>> As GRO passes the skbs to the upper stack in the batches of > >>>>>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the > >>>>>>>> device where the frame comes from, it is enough to disable GRO > >>>>>>>> netdev feature on it to completely restore the original behaviour: > >>>>>>>> untouched frames will be being bulked and passed to the upper stack > >>>>>>>> by 8, as it was with netif_receive_skb_list(). > >>>>>>>> > >>>>>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> > >>>>>>>> --- > >>>>>>>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++----- > >>>>>>>> 1 file changed, 38 insertions(+), 5 deletions(-) > >>>>>>>> > >>>>>>> > >>>>>>> AFAICT the cpumap + GRO is a good standalone improvement. I think > >>>>>>> cpumap is still missing this. > >>>>> > >>>>> The only concern for having GRO in cpumap without metadata from the NIC > >>>>> descriptor was that when the checksum status is missing, GRO calculates > >>>>> the checksum on CPU, which is not really fast. > >>>>> But I remember sometimes GRO was faster despite that. > >>>> > >>>> Good to know, thanks. IIUC some kind of XDP hint support landed already? > >>>> > >>>> My use case could also use HW RSS hash to avoid a rehash in XDP prog. > >>> > >>> Unfortunately, for now it's impossible to get HW metadata such as RSS > >>> hash and checksum status in cpumap. They're implemented via kfuncs > >>> specific to a particular netdevice and this info is available only when > >>> running XDP prog. > >>> > >>> But I think one solution could be: > >>> > >>> 1. We create some generic structure for cpumap, like > >>> > >>> struct cpumap_meta { > >>> u32 magic; > >>> u32 hash; > >>> } > >>> > >>> 2. We add such check in the cpumap code > >>> > >>> if (xdpf->metalen == sizeof(struct cpumap_meta) && > >>> <here we check magic>) > >>> skb->hash = meta->hash; > >>> > >>> 3. In XDP prog, you call Rx hints kfuncs when they're available, obtain > >>> RSS hash and then put it in the struct cpumap_meta as XDP frame metadata. > >> > >> Yes, except don't make this cpumap-specific, make it generic for kernel > >> consumption of the metadata. That way it doesn't even have to be stored > >> in the xdp metadata area, it can be anywhere we want (and hence not > >> subject to ABI issues), and we can use it for skb creation after > >> redirect in other places than cpumap as well (say, on veth devices). > >> > >> So it'll be: > >> > >> struct kernel_meta { > >> u32 hash; > >> u32 timestamp; > >> ...etc > >> } > >> > >> and a kfunc: > >> > >> void store_xdp_kernel_meta(struct kernel meta *meta); > >> > >> which the XDP program can call to populate the metadata area. > > > > Hmm, nice! > > > > But where to store this info in case of cpumap if not in xdp->data_meta? > > When you convert XDP frames to skbs in the cpumap code, you only have > > &xdp_frame and that's it. XDP prog was already run earlier from the > > driver code at that point. > > Well, we could put it in skb_shared_info? IIRC, some of the metadata > (timestamps?) end up there when building an skb anyway, so we won't even > have to copy it around... Before vacation I started looking into it a bit, I will resume this work in one week or so. Regards, Lorenzo > > -Toke >
On Fri, 9 Aug 2024 14:20:25 +0200 Alexander Lobakin wrote: > But I think one solution could be: > > 1. We create some generic structure for cpumap, like > > struct cpumap_meta { > u32 magic; > u32 hash; > } > > 2. We add such check in the cpumap code > > if (xdpf->metalen == sizeof(struct cpumap_meta) && > <here we check magic>) > skb->hash = meta->hash; > > 3. In XDP prog, you call Rx hints kfuncs when they're available, obtain > RSS hash and then put it in the struct cpumap_meta as XDP frame metadata. I wonder what the overhead of skb metadata allocation is in practice. With Eric's "return skb to the CPU of origin" we can feed the lockless skb cache one the right CPU, and also feed the lockless page pool cache. I wonder if batched RFS wouldn't be faster than the XDP thing that requires all the groundwork.
On 13/08/2024 03.33, Jakub Kicinski wrote: > On Fri, 9 Aug 2024 14:20:25 +0200 Alexander Lobakin wrote: >> But I think one solution could be: >> >> 1. We create some generic structure for cpumap, like >> >> struct cpumap_meta { >> u32 magic; >> u32 hash; >> } >> >> 2. We add such check in the cpumap code >> >> if (xdpf->metalen == sizeof(struct cpumap_meta) && >> <here we check magic>) >> skb->hash = meta->hash; >> >> 3. In XDP prog, you call Rx hints kfuncs when they're available, obtain >> RSS hash and then put it in the struct cpumap_meta as XDP frame metadata. > > I wonder what the overhead of skb metadata allocation is in practice. > With Eric's "return skb to the CPU of origin" we can feed the lockless > skb cache one the right CPU, and also feed the lockless page pool > cache. I wonder if batched RFS wouldn't be faster than the XDP thing > that requires all the groundwork. I explicitly developed CPUMAP because I was benchmarking Receive Flow Steering (RFS) and Receive Packet Steering (RPS), which I observed was the bottleneck. The overhead was too large on the RX-CPU and bottleneck due to RFS and RPS maintaining data structures to avoid Out-of-Order packets. The Flow Dissector step was also a limiting factor. By bottleneck I mean it didn't scale, as RX-CPU packet per second processing speeds was too low compared to the remote-CPU pps. Digging in my old notes, I can see that RPS was limited to around 4.8 Mpps (and I have a weird disabling part of it showing 7.5Mpps). In [1] remote-CPU could process (starts at) 2.7 Mpps when dropping UDP packet due to UdpNoPorts configured (and baseline 3.3 Mpps if not remote), thus it only scales up-to 1.78 remote-CPUs. [1] shows how optimizations brings remote-CPU to handle 3.2Mpps (close non-remote to 3.3Mpps baseline). In [2] those optimizations bring remote-CPU to 4Mpps (for UdpNoPorts case). XDP RX-redirect in [1]+[2] was around 19Mpps (which might be lower today due to perf paper cuts). [1] https://github.com/xdp-project/xdp-project/blob/master/areas/cpumap/cpumap02-optimizations.org [2] https://github.com/xdp-project/xdp-project/blob/master/areas/cpumap/cpumap03-optimizations.org The benefits Eric's "return skb to the CPU of origin" should help improve the case for the remote-CPU, as I was seeing some bottlenecks in how we returned the memory. --Jesper
From: Alexander Lobakin <aleksander.lobakin@intel.com> Date: Thu, 8 Aug 2024 13:57:00 +0200 > From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> > Date: Thu, 8 Aug 2024 06:54:06 +0200 > >>> Hi Alexander, >>> >>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote: >>>> cpumap has its own BH context based on kthread. It has a sane batch >>>> size of 8 frames per one cycle. >>>> GRO can be used on its own, adjust cpumap calls to the >>>> upper stack to use GRO API instead of netif_receive_skb_list() which >>>> processes skbs by batches, but doesn't involve GRO layer at all. >>>> It is most beneficial when a NIC which frame come from is XDP >>>> generic metadata-enabled, but in plenty of tests GRO performs better >>>> than listed receiving even given that it has to calculate full frame >>>> checksums on CPU. >>>> As GRO passes the skbs to the upper stack in the batches of >>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the >>>> device where the frame comes from, it is enough to disable GRO >>>> netdev feature on it to completely restore the original behaviour: >>>> untouched frames will be being bulked and passed to the upper stack >>>> by 8, as it was with netif_receive_skb_list(). >>>> >>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> >>>> --- >>>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++----- >>>> 1 file changed, 38 insertions(+), 5 deletions(-) >>>> >>> >>> AFAICT the cpumap + GRO is a good standalone improvement. I think >>> cpumap is still missing this. > > The only concern for having GRO in cpumap without metadata from the NIC > descriptor was that when the checksum status is missing, GRO calculates > the checksum on CPU, which is not really fast. > But I remember sometimes GRO was faster despite that. > >>> >>> I have a production use case for this now. We want to do some intelligent >>> RX steering and I think GRO would help over list-ified receive in some cases. >>> We would prefer steer in HW (and thus get existing GRO support) but not all >>> our NICs support it. So we need a software fallback. >>> >>> Are you still interested in merging the cpumap + GRO patches? > > For sure I can revive this part. I was planning to get back to this > branch and pick patches which were not related to XDP hints and send > them separately. > >> >> Hi Daniel and Alex, >> >> Recently I worked on a PoC to add GRO support to cpumap codebase: >> - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e >> Here I added GRO support to cpumap through gro-cells. >> - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55 >> Here I added GRO support to cpumap trough napi-threaded APIs (with a some >> changes to them). > > Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta > overkill, that's why I separated GRO structure from &napi_struct. > > Let me maybe find some free time, I would then test all 3 solutions > (mine, gro_cells, threaded NAPI) and pick/send the best? > >> >> Please note I have not run any performance tests so far, just verified it does >> not crash (I was planning to resume this work soon). Please let me know if it >> works for you. I did tests on both threaded NAPI for cpumap and my old implementation with a traffic generator and I have the following (in Kpps): direct Rx direct GRO cpumap cpumap GRO baseline 2900 5800 2700 2700 (N/A) threaded 2300 4000 old GRO 2300 4000 IOW, 1. There are no differences in perf between Lorenzo's threaded NAPI GRO implementation and my old implementation, but Lorenzo's is also a very nice cleanup as it switches cpumap to threaded NAPI completely and the final diffstat even removes more lines than adds, while mine adds a bunch of lines and refactors a couple hundred, so I'd go with his variant. 2. After switching to NAPI, the performance without GRO decreases (2.3 Mpps vs 2.7 Mpps), but after enabling GRO the perf increases hugely (4 Mpps vs 2.7 Mpps) even though the CPU needs to compute checksums manually. Note that the code is not polished to the top and I also have a good improvement for allocating skb heads from the percpu NAPI cache in my old tree which I'm planning to add to the series, so the final improvement will be even bigger. + after we find how to pass checksum hint to cpumap, it will be yet another big improvement for GRO (current code won't benefit from this at all) To Lorenzo: Would it be fine if I prepare a series containing your patch for threaded NAPI for cpumap (I'd polish it and break into 2 or 3) + skb allocation optimization and send it OR you wanted to send this on your own? I'm fine with either, in the first case, everything would land within one series with the respective credits; in case of the latter, I'd need to send a followup :) >> >> Regards, >> Lorenzo >> >>> >>> Thanks, >>> Daniel Thanks, Olek
Alexander Lobakin <aleksander.lobakin@intel.com> writes: > From: Alexander Lobakin <aleksander.lobakin@intel.com> > Date: Thu, 8 Aug 2024 13:57:00 +0200 > >> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> >> Date: Thu, 8 Aug 2024 06:54:06 +0200 >> >>>> Hi Alexander, >>>> >>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote: >>>>> cpumap has its own BH context based on kthread. It has a sane batch >>>>> size of 8 frames per one cycle. >>>>> GRO can be used on its own, adjust cpumap calls to the >>>>> upper stack to use GRO API instead of netif_receive_skb_list() which >>>>> processes skbs by batches, but doesn't involve GRO layer at all. >>>>> It is most beneficial when a NIC which frame come from is XDP >>>>> generic metadata-enabled, but in plenty of tests GRO performs better >>>>> than listed receiving even given that it has to calculate full frame >>>>> checksums on CPU. >>>>> As GRO passes the skbs to the upper stack in the batches of >>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the >>>>> device where the frame comes from, it is enough to disable GRO >>>>> netdev feature on it to completely restore the original behaviour: >>>>> untouched frames will be being bulked and passed to the upper stack >>>>> by 8, as it was with netif_receive_skb_list(). >>>>> >>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> >>>>> --- >>>>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++----- >>>>> 1 file changed, 38 insertions(+), 5 deletions(-) >>>>> >>>> >>>> AFAICT the cpumap + GRO is a good standalone improvement. I think >>>> cpumap is still missing this. >> >> The only concern for having GRO in cpumap without metadata from the NIC >> descriptor was that when the checksum status is missing, GRO calculates >> the checksum on CPU, which is not really fast. >> But I remember sometimes GRO was faster despite that. >> >>>> >>>> I have a production use case for this now. We want to do some intelligent >>>> RX steering and I think GRO would help over list-ified receive in some cases. >>>> We would prefer steer in HW (and thus get existing GRO support) but not all >>>> our NICs support it. So we need a software fallback. >>>> >>>> Are you still interested in merging the cpumap + GRO patches? >> >> For sure I can revive this part. I was planning to get back to this >> branch and pick patches which were not related to XDP hints and send >> them separately. >> >>> >>> Hi Daniel and Alex, >>> >>> Recently I worked on a PoC to add GRO support to cpumap codebase: >>> - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e >>> Here I added GRO support to cpumap through gro-cells. >>> - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55 >>> Here I added GRO support to cpumap trough napi-threaded APIs (with a some >>> changes to them). >> >> Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta >> overkill, that's why I separated GRO structure from &napi_struct. >> >> Let me maybe find some free time, I would then test all 3 solutions >> (mine, gro_cells, threaded NAPI) and pick/send the best? >> >>> >>> Please note I have not run any performance tests so far, just verified it does >>> not crash (I was planning to resume this work soon). Please let me know if it >>> works for you. > > I did tests on both threaded NAPI for cpumap and my old implementation > with a traffic generator and I have the following (in Kpps): > > direct Rx direct GRO cpumap cpumap GRO > baseline 2900 5800 2700 2700 (N/A) > threaded 2300 4000 > old GRO 2300 4000 > > IOW, > > 1. There are no differences in perf between Lorenzo's threaded NAPI > GRO implementation and my old implementation, but Lorenzo's is also > a very nice cleanup as it switches cpumap to threaded NAPI completely > and the final diffstat even removes more lines than adds, while mine > adds a bunch of lines and refactors a couple hundred, so I'd go with > his variant. > > 2. After switching to NAPI, the performance without GRO decreases (2.3 > Mpps vs 2.7 Mpps), but after enabling GRO the perf increases hugely > (4 Mpps vs 2.7 Mpps) even though the CPU needs to compute checksums > manually. One question for this: IIUC, the benefit of GRO varies with the traffic mix, depending on how much the GRO logic can actually aggregate. So did you test the pathological case as well (spraying packets over so many flows that there is basically no aggregation taking place)? Just to make sure we don't accidentally screw up performance in that case while optimising for the aggregating case :) -Toke
On 13/08/2024 16.54, Toke Høiland-Jørgensen wrote: > Alexander Lobakin <aleksander.lobakin@intel.com> writes: > >> From: Alexander Lobakin <aleksander.lobakin@intel.com> >> Date: Thu, 8 Aug 2024 13:57:00 +0200 >> >>> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> >>> Date: Thu, 8 Aug 2024 06:54:06 +0200 >>> >>>>> Hi Alexander, >>>>> >>>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote: >>>>>> cpumap has its own BH context based on kthread. It has a sane batch >>>>>> size of 8 frames per one cycle. >>>>>> GRO can be used on its own, adjust cpumap calls to the >>>>>> upper stack to use GRO API instead of netif_receive_skb_list() which >>>>>> processes skbs by batches, but doesn't involve GRO layer at all. >>>>>> It is most beneficial when a NIC which frame come from is XDP >>>>>> generic metadata-enabled, but in plenty of tests GRO performs better >>>>>> than listed receiving even given that it has to calculate full frame >>>>>> checksums on CPU. >>>>>> As GRO passes the skbs to the upper stack in the batches of >>>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the >>>>>> device where the frame comes from, it is enough to disable GRO >>>>>> netdev feature on it to completely restore the original behaviour: >>>>>> untouched frames will be being bulked and passed to the upper stack >>>>>> by 8, as it was with netif_receive_skb_list(). >>>>>> >>>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> >>>>>> --- >>>>>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++----- >>>>>> 1 file changed, 38 insertions(+), 5 deletions(-) >>>>>> >>>>> >>>>> AFAICT the cpumap + GRO is a good standalone improvement. I think >>>>> cpumap is still missing this. >>> >>> The only concern for having GRO in cpumap without metadata from the NIC >>> descriptor was that when the checksum status is missing, GRO calculates >>> the checksum on CPU, which is not really fast. >>> But I remember sometimes GRO was faster despite that. >>> >>>>> >>>>> I have a production use case for this now. We want to do some intelligent >>>>> RX steering and I think GRO would help over list-ified receive in some cases. >>>>> We would prefer steer in HW (and thus get existing GRO support) but not all >>>>> our NICs support it. So we need a software fallback. >>>>> >>>>> Are you still interested in merging the cpumap + GRO patches? >>> >>> For sure I can revive this part. I was planning to get back to this >>> branch and pick patches which were not related to XDP hints and send >>> them separately. >>> >>>> >>>> Hi Daniel and Alex, >>>> >>>> Recently I worked on a PoC to add GRO support to cpumap codebase: >>>> - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e >>>> Here I added GRO support to cpumap through gro-cells. >>>> - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55 >>>> Here I added GRO support to cpumap trough napi-threaded APIs (with a some >>>> changes to them). >>> >>> Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta >>> overkill, that's why I separated GRO structure from &napi_struct. >>> >>> Let me maybe find some free time, I would then test all 3 solutions >>> (mine, gro_cells, threaded NAPI) and pick/send the best? >>> >>>> >>>> Please note I have not run any performance tests so far, just verified it does >>>> not crash (I was planning to resume this work soon). Please let me know if it >>>> works for you. >> >> I did tests on both threaded NAPI for cpumap and my old implementation >> with a traffic generator and I have the following (in Kpps): >> What kind of traffic is the traffic generator sending? E.g. is this a type of traffic that gets GRO aggregated? >> direct Rx direct GRO cpumap cpumap GRO >> baseline 2900 5800 2700 2700 (N/A) >> threaded 2300 4000 >> old GRO 2300 4000 >> Nice results. Just to confirm, the units are in Kpps. >> IOW, >> >> 1. There are no differences in perf between Lorenzo's threaded NAPI >> GRO implementation and my old implementation, but Lorenzo's is also >> a very nice cleanup as it switches cpumap to threaded NAPI completely >> and the final diffstat even removes more lines than adds, while mine >> adds a bunch of lines and refactors a couple hundred, so I'd go with >> his variant. >> >> 2. After switching to NAPI, the performance without GRO decreases (2.3 >> Mpps vs 2.7 Mpps), but after enabling GRO the perf increases hugely >> (4 Mpps vs 2.7 Mpps) even though the CPU needs to compute checksums >> manually. > > One question for this: IIUC, the benefit of GRO varies with the traffic > mix, depending on how much the GRO logic can actually aggregate. So did > you test the pathological case as well (spraying packets over so many > flows that there is basically no aggregation taking place)? Just to make > sure we don't accidentally screw up performance in that case while > optimising for the aggregating case :) > For the GRO use-case, I think a basic TCP stream throughput test (like netperf) should show a benefit once cpumap enable GRO, Can you confirm this? Or does the missing hardware RX-hash and RX-checksum cause TCP GRO not to fully work, yet? Thanks A LOT for doing this benchmarking! --Jesper
> From: Alexander Lobakin <aleksander.lobakin@intel.com> > Date: Thu, 8 Aug 2024 13:57:00 +0200 > > > From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> > > Date: Thu, 8 Aug 2024 06:54:06 +0200 > > > >>> Hi Alexander, > >>> > >>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote: > >>>> cpumap has its own BH context based on kthread. It has a sane batch > >>>> size of 8 frames per one cycle. > >>>> GRO can be used on its own, adjust cpumap calls to the > >>>> upper stack to use GRO API instead of netif_receive_skb_list() which > >>>> processes skbs by batches, but doesn't involve GRO layer at all. > >>>> It is most beneficial when a NIC which frame come from is XDP > >>>> generic metadata-enabled, but in plenty of tests GRO performs better > >>>> than listed receiving even given that it has to calculate full frame > >>>> checksums on CPU. > >>>> As GRO passes the skbs to the upper stack in the batches of > >>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the > >>>> device where the frame comes from, it is enough to disable GRO > >>>> netdev feature on it to completely restore the original behaviour: > >>>> untouched frames will be being bulked and passed to the upper stack > >>>> by 8, as it was with netif_receive_skb_list(). > >>>> > >>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> > >>>> --- > >>>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++----- > >>>> 1 file changed, 38 insertions(+), 5 deletions(-) > >>>> > >>> > >>> AFAICT the cpumap + GRO is a good standalone improvement. I think > >>> cpumap is still missing this. > > > > The only concern for having GRO in cpumap without metadata from the NIC > > descriptor was that when the checksum status is missing, GRO calculates > > the checksum on CPU, which is not really fast. > > But I remember sometimes GRO was faster despite that. > > > >>> > >>> I have a production use case for this now. We want to do some intelligent > >>> RX steering and I think GRO would help over list-ified receive in some cases. > >>> We would prefer steer in HW (and thus get existing GRO support) but not all > >>> our NICs support it. So we need a software fallback. > >>> > >>> Are you still interested in merging the cpumap + GRO patches? > > > > For sure I can revive this part. I was planning to get back to this > > branch and pick patches which were not related to XDP hints and send > > them separately. > > > >> > >> Hi Daniel and Alex, > >> > >> Recently I worked on a PoC to add GRO support to cpumap codebase: > >> - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e > >> Here I added GRO support to cpumap through gro-cells. > >> - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55 > >> Here I added GRO support to cpumap trough napi-threaded APIs (with a some > >> changes to them). > > > > Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta > > overkill, that's why I separated GRO structure from &napi_struct. > > > > Let me maybe find some free time, I would then test all 3 solutions > > (mine, gro_cells, threaded NAPI) and pick/send the best? > > > >> > >> Please note I have not run any performance tests so far, just verified it does > >> not crash (I was planning to resume this work soon). Please let me know if it > >> works for you. > > I did tests on both threaded NAPI for cpumap and my old implementation > with a traffic generator and I have the following (in Kpps): > > direct Rx direct GRO cpumap cpumap GRO > baseline 2900 5800 2700 2700 (N/A) > threaded 2300 4000 > old GRO 2300 4000 cool, very nice improvement > > IOW, > > 1. There are no differences in perf between Lorenzo's threaded NAPI > GRO implementation and my old implementation, but Lorenzo's is also > a very nice cleanup as it switches cpumap to threaded NAPI completely > and the final diffstat even removes more lines than adds, while mine > adds a bunch of lines and refactors a couple hundred, so I'd go with > his variant. > > 2. After switching to NAPI, the performance without GRO decreases (2.3 > Mpps vs 2.7 Mpps), but after enabling GRO the perf increases hugely > (4 Mpps vs 2.7 Mpps) even though the CPU needs to compute checksums > manually. > > Note that the code is not polished to the top and I also have a good > improvement for allocating skb heads from the percpu NAPI cache in my > old tree which I'm planning to add to the series, so the final > improvement will be even bigger. > > + after we find how to pass checksum hint to cpumap, it will be yet > another big improvement for GRO (current code won't benefit from > this at all) > > To Lorenzo: > > Would it be fine if I prepare a series containing your patch for > threaded NAPI for cpumap (I'd polish it and break into 2 or 3) + > skb allocation optimization and send it OR you wanted to send this > on your own? I'm fine with either, in the first case, everything > would land within one series with the respective credits; in case > of the latter, I'd need to send a followup :) Sure, I am fine to send my codebase into a bigger series. Thanks a lot for testing :) Regards, Lorenzo > > >> > >> Regards, > >> Lorenzo > >> > >>> > >>> Thanks, > >>> Daniel > > Thanks, > Olek
On Aug 13, Alexander Lobakin wrote: > From: Alexander Lobakin <aleksander.lobakin@intel.com> > Date: Thu, 8 Aug 2024 13:57:00 +0200 > > > From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> > > Date: Thu, 8 Aug 2024 06:54:06 +0200 > > > >>> Hi Alexander, > >>> > >>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote: > >>>> cpumap has its own BH context based on kthread. It has a sane batch > >>>> size of 8 frames per one cycle. > >>>> GRO can be used on its own, adjust cpumap calls to the > >>>> upper stack to use GRO API instead of netif_receive_skb_list() which > >>>> processes skbs by batches, but doesn't involve GRO layer at all. > >>>> It is most beneficial when a NIC which frame come from is XDP > >>>> generic metadata-enabled, but in plenty of tests GRO performs better > >>>> than listed receiving even given that it has to calculate full frame > >>>> checksums on CPU. > >>>> As GRO passes the skbs to the upper stack in the batches of > >>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the > >>>> device where the frame comes from, it is enough to disable GRO > >>>> netdev feature on it to completely restore the original behaviour: > >>>> untouched frames will be being bulked and passed to the upper stack > >>>> by 8, as it was with netif_receive_skb_list(). > >>>> > >>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> > >>>> --- > >>>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++----- > >>>> 1 file changed, 38 insertions(+), 5 deletions(-) > >>>> > >>> > >>> AFAICT the cpumap + GRO is a good standalone improvement. I think > >>> cpumap is still missing this. > > > > The only concern for having GRO in cpumap without metadata from the NIC > > descriptor was that when the checksum status is missing, GRO calculates > > the checksum on CPU, which is not really fast. > > But I remember sometimes GRO was faster despite that. > > > >>> > >>> I have a production use case for this now. We want to do some intelligent > >>> RX steering and I think GRO would help over list-ified receive in some cases. > >>> We would prefer steer in HW (and thus get existing GRO support) but not all > >>> our NICs support it. So we need a software fallback. > >>> > >>> Are you still interested in merging the cpumap + GRO patches? > > > > For sure I can revive this part. I was planning to get back to this > > branch and pick patches which were not related to XDP hints and send > > them separately. > > > >> > >> Hi Daniel and Alex, > >> > >> Recently I worked on a PoC to add GRO support to cpumap codebase: > >> - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e > >> Here I added GRO support to cpumap through gro-cells. > >> - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55 > >> Here I added GRO support to cpumap trough napi-threaded APIs (with a some > >> changes to them). > > > > Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta > > overkill, that's why I separated GRO structure from &napi_struct. > > > > Let me maybe find some free time, I would then test all 3 solutions > > (mine, gro_cells, threaded NAPI) and pick/send the best? > > > >> > >> Please note I have not run any performance tests so far, just verified it does > >> not crash (I was planning to resume this work soon). Please let me know if it > >> works for you. > > I did tests on both threaded NAPI for cpumap and my old implementation > with a traffic generator and I have the following (in Kpps): > > direct Rx direct GRO cpumap cpumap GRO > baseline 2900 5800 2700 2700 (N/A) > threaded 2300 4000 > old GRO 2300 4000 out of my curiority, have you tested even the gro_cells one? Lorenzo > > IOW, > > 1. There are no differences in perf between Lorenzo's threaded NAPI > GRO implementation and my old implementation, but Lorenzo's is also > a very nice cleanup as it switches cpumap to threaded NAPI completely > and the final diffstat even removes more lines than adds, while mine > adds a bunch of lines and refactors a couple hundred, so I'd go with > his variant. > > 2. After switching to NAPI, the performance without GRO decreases (2.3 > Mpps vs 2.7 Mpps), but after enabling GRO the perf increases hugely > (4 Mpps vs 2.7 Mpps) even though the CPU needs to compute checksums > manually. > > Note that the code is not polished to the top and I also have a good > improvement for allocating skb heads from the percpu NAPI cache in my > old tree which I'm planning to add to the series, so the final > improvement will be even bigger. > > + after we find how to pass checksum hint to cpumap, it will be yet > another big improvement for GRO (current code won't benefit from > this at all) > > To Lorenzo: > > Would it be fine if I prepare a series containing your patch for > threaded NAPI for cpumap (I'd polish it and break into 2 or 3) + > skb allocation optimization and send it OR you wanted to send this > on your own? I'm fine with either, in the first case, everything > would land within one series with the respective credits; in case > of the latter, I'd need to send a followup :) > > >> > >> Regards, > >> Lorenzo > >> > >>> > >>> Thanks, > >>> Daniel > > Thanks, > Olek >
From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Date: Tue, 13 Aug 2024 18:27:41 +0200 > On Aug 13, Alexander Lobakin wrote: >> From: Alexander Lobakin <aleksander.lobakin@intel.com> >> Date: Thu, 8 Aug 2024 13:57:00 +0200 >> >>> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> >>> Date: Thu, 8 Aug 2024 06:54:06 +0200 >>> >>>>> Hi Alexander, >>>>> >>>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote: >>>>>> cpumap has its own BH context based on kthread. It has a sane batch >>>>>> size of 8 frames per one cycle. >>>>>> GRO can be used on its own, adjust cpumap calls to the >>>>>> upper stack to use GRO API instead of netif_receive_skb_list() which >>>>>> processes skbs by batches, but doesn't involve GRO layer at all. >>>>>> It is most beneficial when a NIC which frame come from is XDP >>>>>> generic metadata-enabled, but in plenty of tests GRO performs better >>>>>> than listed receiving even given that it has to calculate full frame >>>>>> checksums on CPU. >>>>>> As GRO passes the skbs to the upper stack in the batches of >>>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the >>>>>> device where the frame comes from, it is enough to disable GRO >>>>>> netdev feature on it to completely restore the original behaviour: >>>>>> untouched frames will be being bulked and passed to the upper stack >>>>>> by 8, as it was with netif_receive_skb_list(). >>>>>> >>>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> >>>>>> --- >>>>>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++----- >>>>>> 1 file changed, 38 insertions(+), 5 deletions(-) >>>>>> >>>>> >>>>> AFAICT the cpumap + GRO is a good standalone improvement. I think >>>>> cpumap is still missing this. >>> >>> The only concern for having GRO in cpumap without metadata from the NIC >>> descriptor was that when the checksum status is missing, GRO calculates >>> the checksum on CPU, which is not really fast. >>> But I remember sometimes GRO was faster despite that. >>> >>>>> >>>>> I have a production use case for this now. We want to do some intelligent >>>>> RX steering and I think GRO would help over list-ified receive in some cases. >>>>> We would prefer steer in HW (and thus get existing GRO support) but not all >>>>> our NICs support it. So we need a software fallback. >>>>> >>>>> Are you still interested in merging the cpumap + GRO patches? >>> >>> For sure I can revive this part. I was planning to get back to this >>> branch and pick patches which were not related to XDP hints and send >>> them separately. >>> >>>> >>>> Hi Daniel and Alex, >>>> >>>> Recently I worked on a PoC to add GRO support to cpumap codebase: >>>> - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e >>>> Here I added GRO support to cpumap through gro-cells. >>>> - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55 >>>> Here I added GRO support to cpumap trough napi-threaded APIs (with a some >>>> changes to them). >>> >>> Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta >>> overkill, that's why I separated GRO structure from &napi_struct. >>> >>> Let me maybe find some free time, I would then test all 3 solutions >>> (mine, gro_cells, threaded NAPI) and pick/send the best? >>> >>>> >>>> Please note I have not run any performance tests so far, just verified it does >>>> not crash (I was planning to resume this work soon). Please let me know if it >>>> works for you. >> >> I did tests on both threaded NAPI for cpumap and my old implementation >> with a traffic generator and I have the following (in Kpps): >> >> direct Rx direct GRO cpumap cpumap GRO >> baseline 2900 5800 2700 2700 (N/A) >> threaded 2300 4000 >> old GRO 2300 4000 > > out of my curiority, have you tested even the gro_cells one? I haven't. I mean I could, but I don't feel like cpumap's kthread + separate NAPI then could give better results than merged NAPI + kthread. > > Lorenzo Thanks, Olek
From: Jesper Dangaard Brouer <hawk@kernel.org> Date: Tue, 13 Aug 2024 17:57:44 +0200 > > > On 13/08/2024 16.54, Toke Høiland-Jørgensen wrote: >> Alexander Lobakin <aleksander.lobakin@intel.com> writes: >> >>> From: Alexander Lobakin <aleksander.lobakin@intel.com> >>> Date: Thu, 8 Aug 2024 13:57:00 +0200 >>> >>>> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> >>>> Date: Thu, 8 Aug 2024 06:54:06 +0200 >>>> >>>>>> Hi Alexander, [...] >>> I did tests on both threaded NAPI for cpumap and my old implementation >>> with a traffic generator and I have the following (in Kpps): >>> > > What kind of traffic is the traffic generator sending? > > E.g. is this a type of traffic that gets GRO aggregated? Yes. It's UDP, with the UDP GRO enabled on the receiver. > >>> direct Rx direct GRO cpumap cpumap GRO >>> baseline 2900 5800 2700 2700 (N/A) >>> threaded 2300 4000 >>> old GRO 2300 4000 >>> > > Nice results. Just to confirm, the units are in Kpps. Yes. I.e. cpumap was giving 2.7 Mpps without GRO, then 4.0 Mpps with it. > > >>> IOW, >>> >>> 1. There are no differences in perf between Lorenzo's threaded NAPI >>> GRO implementation and my old implementation, but Lorenzo's is also >>> a very nice cleanup as it switches cpumap to threaded NAPI >>> completely >>> and the final diffstat even removes more lines than adds, while mine >>> adds a bunch of lines and refactors a couple hundred, so I'd go with >>> his variant. >>> >>> 2. After switching to NAPI, the performance without GRO decreases (2.3 >>> Mpps vs 2.7 Mpps), but after enabling GRO the perf increases hugely >>> (4 Mpps vs 2.7 Mpps) even though the CPU needs to compute checksums >>> manually. >> >> One question for this: IIUC, the benefit of GRO varies with the traffic >> mix, depending on how much the GRO logic can actually aggregate. So did >> you test the pathological case as well (spraying packets over so many >> flows that there is basically no aggregation taking place)? Just to make >> sure we don't accidentally screw up performance in that case while >> optimising for the aggregating case :) >> > > For the GRO use-case, I think a basic TCP stream throughput test (like > netperf) should show a benefit once cpumap enable GRO, Can you confirm > this? Yes, TCP benefits as well. > Or does the missing hardware RX-hash and RX-checksum cause TCP GRO not > to fully work, yet? GRO works well for both TCP and UDP. The main bottleneck is that GRO calculates the checksum manually on the CPU now, since there's no checksum status from the NIC. Also, missing Rx hash means GRO will place packets from every flow into the same bucket, but it's not a big deal (they get compared layer by layer anyway). > > Thanks A LOT for doing this benchmarking! I optimized the code a bit and picked my old patches for bulk NAPI skb cache allocation and today I got 4.7 Mpps
Hi Olek, On Mon, Aug 19, 2024 at 04:50:52PM GMT, Alexander Lobakin wrote: [..] > > Thanks A LOT for doing this benchmarking! > > I optimized the code a bit and picked my old patches for bulk NAPI skb > cache allocation and today I got 4.7 Mpps
From: Daniel Xu <dxu@dxuuu.xyz> Date: Tue, 20 Aug 2024 17:29:45 -0700 > Hi Olek, > > On Mon, Aug 19, 2024 at 04:50:52PM GMT, Alexander Lobakin wrote: > [..] >>> Thanks A LOT for doing this benchmarking! >> >> I optimized the code a bit and picked my old patches for bulk NAPI skb >> cache allocation and today I got 4.7 Mpps
On Wed, Aug 21, 2024 at 03:16:51PM GMT, Alexander Lobakin wrote: > From: Daniel Xu <dxu@dxuuu.xyz> > Date: Tue, 20 Aug 2024 17:29:45 -0700 > > > Hi Olek, > > > > On Mon, Aug 19, 2024 at 04:50:52PM GMT, Alexander Lobakin wrote: > > [..] > >>> Thanks A LOT for doing this benchmarking! > >> > >> I optimized the code a bit and picked my old patches for bulk NAPI skb > >> cache allocation and today I got 4.7 Mpps
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c index f4860ac756cd..2d0edf8f6a05 100644 --- a/kernel/bpf/cpumap.c +++ b/kernel/bpf/cpumap.c @@ -29,8 +29,8 @@ #include <trace/events/xdp.h> #include <linux/btf_ids.h> -#include <linux/netdevice.h> /* netif_receive_skb_list */ -#include <linux/etherdevice.h> /* eth_type_trans */ +#include <linux/netdevice.h> +#include <net/gro.h> /* General idea: XDP packets getting XDP redirected to another CPU, * will maximum be stored/queued for one driver ->poll() call. It is @@ -67,6 +67,8 @@ struct bpf_cpu_map_entry { struct bpf_cpumap_val value; struct bpf_prog *prog; + struct gro_node gro; + atomic_t refcnt; /* Control when this struct can be free'ed */ struct rcu_head rcu; @@ -162,6 +164,7 @@ static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu) if (atomic_dec_and_test(&rcpu->refcnt)) { if (rcpu->prog) bpf_prog_put(rcpu->prog); + gro_cleanup(&rcpu->gro); /* The queue should be empty at this point */ __cpu_map_ring_cleanup(rcpu->queue); ptr_ring_cleanup(rcpu->queue, NULL); @@ -295,6 +298,33 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames, return nframes; } +static void cpu_map_gro_flush(struct bpf_cpu_map_entry *rcpu, + struct list_head *list) +{ + bool new = !list_empty(list); + + if (likely(new)) + gro_receive_skb_list(&rcpu->gro, list); + + if (rcpu->gro.bitmask) { + bool flush_old = HZ >= 1000; + + /* If the ring is not empty, there'll be a new iteration + * soon, and we only need to do a full flush if a tick is + * long (> 1 ms). + * If the ring is empty, to not hold GRO packets in the + * stack for too long, do a full flush. + * This is equivalent to how NAPI decides whether to perform + * a full flush (by batches of up to 64 frames tho). + */ + if (__ptr_ring_empty(rcpu->queue)) + flush_old = false; + + __gro_flush(&rcpu->gro, flush_old); + } + + gro_normal_list(&rcpu->gro); +} static int cpu_map_kthread_run(void *data) { @@ -384,7 +414,7 @@ static int cpu_map_kthread_run(void *data) list_add_tail(&skb->list, &list); } - netif_receive_skb_list(&list); + cpu_map_gro_flush(rcpu, &list); /* Feedback loop via tracepoint */ trace_xdp_cpumap_kthread(rcpu->map_id, n, kmem_alloc_drops, @@ -460,8 +490,10 @@ __cpu_map_entry_alloc(struct bpf_map *map, struct bpf_cpumap_val *value, rcpu->map_id = map->id; rcpu->value.qsize = value->qsize; + gro_init(&rcpu->gro, NULL); + if (fd > 0 && __cpu_map_load_bpf_program(rcpu, map, fd)) - goto free_ptr_ring; + goto free_gro; /* Setup kthread */ rcpu->kthread = kthread_create_on_node(cpu_map_kthread_run, rcpu, numa, @@ -482,7 +514,8 @@ __cpu_map_entry_alloc(struct bpf_map *map, struct bpf_cpumap_val *value, free_prog: if (rcpu->prog) bpf_prog_put(rcpu->prog); -free_ptr_ring: +free_gro: + gro_cleanup(&rcpu->gro); ptr_ring_cleanup(rcpu->queue, NULL); free_queue: kfree(rcpu->queue);
cpumap has its own BH context based on kthread. It has a sane batch size of 8 frames per one cycle. GRO can be used on its own, adjust cpumap calls to the upper stack to use GRO API instead of netif_receive_skb_list() which processes skbs by batches, but doesn't involve GRO layer at all. It is most beneficial when a NIC which frame come from is XDP generic metadata-enabled, but in plenty of tests GRO performs better than listed receiving even given that it has to calculate full frame checksums on CPU. As GRO passes the skbs to the upper stack in the batches of @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the device where the frame comes from, it is enough to disable GRO netdev feature on it to completely restore the original behaviour: untouched frames will be being bulked and passed to the upper stack by 8, as it was with netif_receive_skb_list(). Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> --- kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++----- 1 file changed, 38 insertions(+), 5 deletions(-)