Message ID | 20240830162508.1009458-1-aleksander.lobakin@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | bpf: cpumap: enable GRO for XDP_PASS frames | expand |
On Fri, 30 Aug 2024 18:24:59 +0200 Alexander Lobakin wrote: > * patch 4: switch cpumap from a custom kthread to a CPU-pinned > threaded NAPI; Could you try to use the backlog NAPI? Allocating a fake netdev and using NAPI as a threading abstraction feels like an abuse. Maybe try to factor out the necessary bits? What we want is using the per-cpu caches, and feeding GRO. None of the IRQ related NAPI functionality fits in here.
> On Fri, 30 Aug 2024 18:24:59 +0200 Alexander Lobakin wrote: > > * patch 4: switch cpumap from a custom kthread to a CPU-pinned > > threaded NAPI; > > Could you try to use the backlog NAPI? Allocating a fake netdev and > using NAPI as a threading abstraction feels like an abuse. Maybe try > to factor out the necessary bits? What we want is using the per-cpu > caches, and feeding GRO. None of the IRQ related NAPI functionality > fits in here. I was thinking allocating a fake netdev to use NAPI APIs is quite a common approach, but sure, I will looking into it. Regards, Lorenzo
From: Jakub Kicinski <kuba@kernel.org> Date: Tue, 3 Sep 2024 13:51:58 -0700 > On Fri, 30 Aug 2024 18:24:59 +0200 Alexander Lobakin wrote: >> * patch 4: switch cpumap from a custom kthread to a CPU-pinned >> threaded NAPI; > > Could you try to use the backlog NAPI? Allocating a fake netdev and > using NAPI as a threading abstraction feels like an abuse. Maybe try > to factor out the necessary bits? What we want is using the per-cpu > caches, and feeding GRO. None of the IRQ related NAPI functionality > fits in here. Lorenzo will try as he wrote. I can only add that in my old tree, I factored out GRO bits and used them here just as you wrote. The perf was the same, but the diffstat was several hundred lines only to factor out stuff, while here the actual switch to NAPI removes more lines than adds, also custom kthread logic is gone etc. It just looks way more elegant and simple. I could say that gro_cells also "abuses" NAPI the same way, don't you think? But nobody ever objected :> Thanks, Olek
On Wed, 4 Sep 2024 15:13:54 +0200 Alexander Lobakin wrote: > > Could you try to use the backlog NAPI? Allocating a fake netdev and > > using NAPI as a threading abstraction feels like an abuse. Maybe try > > to factor out the necessary bits? What we want is using the per-cpu > > caches, and feeding GRO. None of the IRQ related NAPI functionality > > fits in here. > > Lorenzo will try as he wrote. I can only add that in my old tree, I > factored out GRO bits and used them here just as you wrote. The perf was > the same, but the diffstat was several hundred lines only to factor out > stuff, while here the actual switch to NAPI removes more lines than > adds, also custom kthread logic is gone etc. It just looks way more > elegant and simple. Once again we seem to be arguing whether lower LoC is equivalent to better code? :) If we can use backlog NAPI it hopefully won't be as long. Maybe other, better approaches are within reach, too. > I could say that gro_cells also "abuses" NAPI the same way, don't you > think? "same way"? :] Does it allocate a fake netdev, use NAPI as a threading abstraction or add extra fields to napi_struct ? If other maintainers disagree I won't be upset, but I'm worried that letting NAPI grow into some generic SW abstraction with broad use cases will hinder the ongoing queue config efforts. > But nobody ever objected :>
From: Jakub Kicinski <kuba@kernel.org> Date: Wed, 4 Sep 2024 07:50:41 -0700 > On Wed, 4 Sep 2024 15:13:54 +0200 Alexander Lobakin wrote: >>> Could you try to use the backlog NAPI? Allocating a fake netdev and >>> using NAPI as a threading abstraction feels like an abuse. Maybe try >>> to factor out the necessary bits? What we want is using the per-cpu >>> caches, and feeding GRO. None of the IRQ related NAPI functionality >>> fits in here. >> >> Lorenzo will try as he wrote. I can only add that in my old tree, I >> factored out GRO bits and used them here just as you wrote. The perf was >> the same, but the diffstat was several hundred lines only to factor out >> stuff, while here the actual switch to NAPI removes more lines than >> adds, also custom kthread logic is gone etc. It just looks way more >> elegant and simple. > > Once again we seem to be arguing whether lower LoC is equivalent to > better code? :) If we can use backlog NAPI it hopefully won't be as And once again I didn't say that explicitly :D When 2 patches work the same way, but one has far shorter diffstat, we often prefer this one if it's correct. This one for cpumap looked correct to me and Lorenzo and we didn't have any other ideas, so I picked it. I didn't say "it's better than backlog NAPI because it's shorter", the only thing I said re backlog NAPI is that we'll try it. I didn't think of this previously at all, I'm no backlog expert in general. > long. Maybe other, better approaches are within reach, too. > >> I could say that gro_cells also "abuses" NAPI the same way, don't you >> think? > > "same way"? :] Does it allocate a fake netdev, use NAPI as a threading > abstraction or add extra fields to napi_struct ? Wait wait wait, you said "NAPI IRQ related logics doesn't fit here". I could say the same for gro_cells -- IRQ related NAPI logics doesn't fit there. gro_cells is an SW abstraction. A fake netdev is used by multiple drivers to use GRO, you know that (popular for wireless drivers). They also conflict with the queue config effort. > If other maintainers disagree I won't be upset, but I'm worried > that letting NAPI grow into some generic SW abstraction with broad > use cases will hinder the ongoing queue config efforts. > >> But nobody ever objected :> Thanks, Olek
On Wed, 4 Sep 2024 17:13:59 +0200 Alexander Lobakin wrote: > >> I could say that gro_cells also "abuses" NAPI the same way, don't you > >> think? > > > > "same way"? :] Does it allocate a fake netdev, use NAPI as a threading > > abstraction or add extra fields to napi_struct ? > > Wait wait wait, you said "NAPI IRQ related logics doesn't fit here". I > could say the same for gro_cells -- IRQ related NAPI logics doesn't fit > there. gro_cells is an SW abstraction. Yes, that 1/4th of my complaint does indeed apply :) > A fake netdev is used by multiple drivers to use GRO, you know that > (popular for wireless drivers). They also conflict with the queue config > effort. And it did cause us some issues when changing netdev_priv() already.
On 03/09/2024 23.33, Lorenzo Bianconi wrote: >> On Fri, 30 Aug 2024 18:24:59 +0200 Alexander Lobakin wrote: >>> * patch 4: switch cpumap from a custom kthread to a CPU-pinned >>> threaded NAPI; >> >> Could you try to use the backlog NAPI? Allocating a fake netdev and >> using NAPI as a threading abstraction feels like an abuse. Maybe try >> to factor out the necessary bits? What we want is using the per-cpu >> caches, and feeding GRO. None of the IRQ related NAPI functionality >> fits in here. > > I was thinking allocating a fake netdev to use NAPI APIs is quite a common > approach, but sure, I will looking into it. > I have a use-case for cpumap where I adjust (increase) kthread priority. Using backlog NAPI, will I still be able to change scheduling priority? --Jesper
> > On Fri, 30 Aug 2024 18:24:59 +0200 Alexander Lobakin wrote: > > > * patch 4: switch cpumap from a custom kthread to a CPU-pinned > > > threaded NAPI; > > > > Could you try to use the backlog NAPI? Allocating a fake netdev and > > using NAPI as a threading abstraction feels like an abuse. Maybe try > > to factor out the necessary bits? What we want is using the per-cpu > > caches, and feeding GRO. None of the IRQ related NAPI functionality > > fits in here. > > I was thinking allocating a fake netdev to use NAPI APIs is quite a common > approach, but sure, I will looking into it. From a first glance I think we could use the backlog NAPI APIs here in order to avoid allocating a dummy netdev. We could implement a similar approach I used for the cpumap + gro_cell here [0]. In particular, the cpumap kthread pinned on cpu 'n' can schedule the backlog NAPI associated to cpu 'n'. However according to my understanding it seems the backlog NAPI APIs (in process_backlog()) do not support GRO, right? Am I missing something? Regards, Lorenzo [0] https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e > > Regards, > Lorenzo
On Thu, 5 Sep 2024 19:01:42 +0200 Lorenzo Bianconi wrote: > In particular, the cpumap kthread pinned on cpu 'n' can schedule the > backlog NAPI associated to cpu 'n'. However according to my understanding > it seems the backlog NAPI APIs (in process_backlog()) do not support GRO, > right? Am I missing something? I meant to use the struct directly, not to schedule it. All you need is GRO - feed it packets, flush it. But maybe you can avoid the netdev allocation and patch 3 in other ways. Using backlog NAPI was just the first thing that came to mind.
> On Thu, 5 Sep 2024 19:01:42 +0200 Lorenzo Bianconi wrote: > > In particular, the cpumap kthread pinned on cpu 'n' can schedule the > > backlog NAPI associated to cpu 'n'. However according to my understanding > > it seems the backlog NAPI APIs (in process_backlog()) do not support GRO, > > right? Am I missing something? > > I meant to use the struct directly, not to schedule it. All you need > is GRO - feed it packets, flush it. ack, thx for pointing this out. > But maybe you can avoid the netdev allocation and patch 3 in other ways. > Using backlog NAPI was just the first thing that came to mind. ack, I will look into it. Regards, Lorenzo
> > On Thu, 5 Sep 2024 19:01:42 +0200 Lorenzo Bianconi wrote: > > > In particular, the cpumap kthread pinned on cpu 'n' can schedule the > > > backlog NAPI associated to cpu 'n'. However according to my understanding > > > it seems the backlog NAPI APIs (in process_backlog()) do not support GRO, > > > right? Am I missing something? > > > > I meant to use the struct directly, not to schedule it. All you need > > is GRO - feed it packets, flush it. > > ack, thx for pointing this out. > > > But maybe you can avoid the netdev allocation and patch 3 in other ways. > > Using backlog NAPI was just the first thing that came to mind. > > ack, I will look into it. > > Regards, > Lorenzo Hi all, I reworked my previous implementation to add GRO support to cpumap codebase, removing the dummy netdev dependency and keeping most of the other logic. You can find the codebase here: - https://github.com/LorenzoBianconi/bpf-next/commit/e152cf8c212196fccece0b516190827430c0f5f8 I added to the two patches below in order to reuse some NAPI generic code: - https://github.com/LorenzoBianconi/bpf-next/commit/3c73e9c2f07486590749e9b3bfb8a4b3df4cb5e0 - https://github.com/LorenzoBianconi/bpf-next/commit/d435ce2e1b6a991a6264a5aad4a0374a3ca86a51 I have not run any performance test yet, just functional one. Regards, Lorenzo