Message ID | 1a4e901d6ecb9b091888c4d92256fa4a56cb83a4.1710715791.git.asml.silence@gmail.com (mailing list archive) |
---|---|
State | Deferred |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net-next,v2] net: cache for same cpu skb_attempt_defer_free | expand |
On Mon, Mar 18, 2024 at 8:46 AM Pavel Begunkov <asml.silence@gmail.com> wrote: > > Optimise skb_attempt_defer_free() when run by the same CPU the skb was > allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can > disable softirqs and put the buffer into cpu local caches. > > CPU bound TCP ping pong style benchmarking (i.e. netbench) showed a 1% > throughput increase (392.2 -> 396.4 Krps). Cross checking with profiles, > the total CPU share of skb_attempt_defer_free() dropped by 0.6%. Note, I suspect that we can stably gain this improvement. The reason why I ask is because it might be caused by some factor of chance. > I'd expect the win doubled with rx only benchmarks, as the optimisation > is for the receive path, but the test spends >55% of CPU doing writes. I wonder how you did this test? Could you tell us more, please. > > Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> > --- > > v2: pass @napi_safe=true by using __napi_kfree_skb() > > net/core/skbuff.c | 15 ++++++++++++++- > 1 file changed, 14 insertions(+), 1 deletion(-) > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index b99127712e67..35d37ae70a3d 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -6995,6 +6995,19 @@ void __skb_ext_put(struct skb_ext *ext) > EXPORT_SYMBOL(__skb_ext_put); > #endif /* CONFIG_SKB_EXTENSIONS */ > > +static void kfree_skb_napi_cache(struct sk_buff *skb) > +{ > + /* if SKB is a clone, don't handle this case */ > + if (skb->fclone != SKB_FCLONE_UNAVAILABLE) { > + __kfree_skb(skb); > + return; > + } > + > + local_bh_disable(); > + __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED); __napi_kfree_skb() doesn't care much about why we drop in the rx path, I think. How about replacing it with SKB_CONSUMED like napi_skb_finish() does? Thanks, Jason > + local_bh_enable(); > +} > + > /** > * skb_attempt_defer_free - queue skb for remote freeing > * @skb: buffer > @@ -7013,7 +7026,7 @@ void skb_attempt_defer_free(struct sk_buff *skb) > if (WARN_ON_ONCE(cpu >= nr_cpu_ids) || > !cpu_online(cpu) || > cpu == raw_smp_processor_id()) { > -nodefer: __kfree_skb(skb); > +nodefer: kfree_skb_napi_cache(skb); > return; > } > > -- > 2.44.0 > >
On 3/18/24 02:44, Jason Xing wrote: > On Mon, Mar 18, 2024 at 8:46 AM Pavel Begunkov <asml.silence@gmail.com> wrote: >> >> Optimise skb_attempt_defer_free() when run by the same CPU the skb was >> allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can >> disable softirqs and put the buffer into cpu local caches. >> >> CPU bound TCP ping pong style benchmarking (i.e. netbench) showed a 1% >> throughput increase (392.2 -> 396.4 Krps). Cross checking with profiles, >> the total CPU share of skb_attempt_defer_free() dropped by 0.6%. Note, > > I suspect that we can stably gain this improvement. The reason why I > ask is because it might be caused by some factor of chance. I guess it might be the kernel is even booting only by some factor of chance, but no, the testing was quite stable :) >> I'd expect the win doubled with rx only benchmarks, as the optimisation >> is for the receive path, but the test spends >55% of CPU doing writes. > > I wonder how you did this test? Could you tell us more, please. Well, the way I did it: choose a NIC, redirect all its IRQs to a single CPU of your choice, taskset your rx side to that CPU. You might want to make sure there is no much traffic apart from the test program, but I was lucky to have 2 NICs in the system. Instead of IRQs you can also try to configure steering. I used netbench [1] for both rx and tx, but that shouldn't be important as long as you drive enough traffic, you can try iperf or something else. [1] https://github.com/DylanZA/netbench >> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> >> --- >> >> v2: pass @napi_safe=true by using __napi_kfree_skb() >> >> net/core/skbuff.c | 15 ++++++++++++++- >> 1 file changed, 14 insertions(+), 1 deletion(-) >> >> diff --git a/net/core/skbuff.c b/net/core/skbuff.c >> index b99127712e67..35d37ae70a3d 100644 >> --- a/net/core/skbuff.c >> +++ b/net/core/skbuff.c >> @@ -6995,6 +6995,19 @@ void __skb_ext_put(struct skb_ext *ext) >> EXPORT_SYMBOL(__skb_ext_put); >> #endif /* CONFIG_SKB_EXTENSIONS */ >> >> +static void kfree_skb_napi_cache(struct sk_buff *skb) >> +{ >> + /* if SKB is a clone, don't handle this case */ >> + if (skb->fclone != SKB_FCLONE_UNAVAILABLE) { >> + __kfree_skb(skb); >> + return; >> + } >> + >> + local_bh_disable(); >> + __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED); > > __napi_kfree_skb() doesn't care much about why we drop in the rx path, > I think. How about replacing it with SKB_CONSUMED like > napi_skb_finish() does? I'm sticking here with the current behaviour, __kfree_skb(), which it replaces, passes SKB_DROP_REASON_NOT_SPECIFIED. I can make the change if maintainers ack it, but maybe it should better be done in a separate patch with a proper explanation.
On Mon, Mar 18, 2024 at 1:46 AM Pavel Begunkov <asml.silence@gmail.com> wrote: > > Optimise skb_attempt_defer_free() when run by the same CPU the skb was > allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can > disable softirqs and put the buffer into cpu local caches. > > CPU bound TCP ping pong style benchmarking (i.e. netbench) showed a 1% > throughput increase (392.2 -> 396.4 Krps). Cross checking with profiles, > the total CPU share of skb_attempt_defer_free() dropped by 0.6%. Note, > I'd expect the win doubled with rx only benchmarks, as the optimisation > is for the receive path, but the test spends >55% of CPU doing writes. > > Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> > --- > > v2: pass @napi_safe=true by using __napi_kfree_skb() > > net/core/skbuff.c | 15 ++++++++++++++- > 1 file changed, 14 insertions(+), 1 deletion(-) > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index b99127712e67..35d37ae70a3d 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -6995,6 +6995,19 @@ void __skb_ext_put(struct skb_ext *ext) > EXPORT_SYMBOL(__skb_ext_put); > #endif /* CONFIG_SKB_EXTENSIONS */ > > +static void kfree_skb_napi_cache(struct sk_buff *skb) > +{ > + /* if SKB is a clone, don't handle this case */ > + if (skb->fclone != SKB_FCLONE_UNAVAILABLE) { > + __kfree_skb(skb); > + return; > + } > + > + local_bh_disable(); > + __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED); > + local_bh_enable(); > +} > + > /** > * skb_attempt_defer_free - queue skb for remote freeing > * @skb: buffer > @@ -7013,7 +7026,7 @@ void skb_attempt_defer_free(struct sk_buff *skb) > if (WARN_ON_ONCE(cpu >= nr_cpu_ids) || > !cpu_online(cpu) || > cpu == raw_smp_processor_id()) { > -nodefer: __kfree_skb(skb); > +nodefer: kfree_skb_napi_cache(skb); > return; > } > > -- > 2.44.0 > 1) net-next is currently closed. 2) No NUMA awareness. SLUB does not guarantee the sk_buff was on the correct node. 3) Given that many skbs (like TCP ACK) are freed using __kfree_skb(), I wonder why trying to cache the sk_buff in this particular path is needed. Why not change __kfree_skb() instead ? All these helpers are becoming a maze.
On 3/18/24 10:11, Eric Dumazet wrote: > On Mon, Mar 18, 2024 at 1:46 AM Pavel Begunkov <asml.silence@gmail.com> wrote: >> >> Optimise skb_attempt_defer_free() when run by the same CPU the skb was >> allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can >> disable softirqs and put the buffer into cpu local caches. >> >> CPU bound TCP ping pong style benchmarking (i.e. netbench) showed a 1% >> throughput increase (392.2 -> 396.4 Krps). Cross checking with profiles, >> the total CPU share of skb_attempt_defer_free() dropped by 0.6%. Note, >> I'd expect the win doubled with rx only benchmarks, as the optimisation >> is for the receive path, but the test spends >55% of CPU doing writes. >> >> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> >> --- >> >> v2: pass @napi_safe=true by using __napi_kfree_skb() >> >> net/core/skbuff.c | 15 ++++++++++++++- >> 1 file changed, 14 insertions(+), 1 deletion(-) >> >> diff --git a/net/core/skbuff.c b/net/core/skbuff.c >> index b99127712e67..35d37ae70a3d 100644 >> --- a/net/core/skbuff.c >> +++ b/net/core/skbuff.c >> @@ -6995,6 +6995,19 @@ void __skb_ext_put(struct skb_ext *ext) >> EXPORT_SYMBOL(__skb_ext_put); >> #endif /* CONFIG_SKB_EXTENSIONS */ >> >> +static void kfree_skb_napi_cache(struct sk_buff *skb) >> +{ >> + /* if SKB is a clone, don't handle this case */ >> + if (skb->fclone != SKB_FCLONE_UNAVAILABLE) { >> + __kfree_skb(skb); >> + return; >> + } >> + >> + local_bh_disable(); >> + __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED); >> + local_bh_enable(); >> +} >> + >> /** >> * skb_attempt_defer_free - queue skb for remote freeing >> * @skb: buffer >> @@ -7013,7 +7026,7 @@ void skb_attempt_defer_free(struct sk_buff *skb) >> if (WARN_ON_ONCE(cpu >= nr_cpu_ids) || >> !cpu_online(cpu) || >> cpu == raw_smp_processor_id()) { >> -nodefer: __kfree_skb(skb); >> +nodefer: kfree_skb_napi_cache(skb); >> return; >> } >> >> -- >> 2.44.0 >> > > 1) net-next is currently closed. Ok > 2) No NUMA awareness. SLUB does not guarantee the sk_buff was on the > correct node. Let me see if I read you right. You're saying that SLUB can allocate an skb from a different node, so skb->alloc_cpu might be not indicative of the node, and so we might locally cache an skb of a foreign numa node? If that's the case I don't see how it's different from the cpu != raw_smp_processor_id() path, which will transfer the skb to another cpu and still put it in the local cache in softirq. > 3) Given that many skbs (like TCP ACK) are freed using __kfree_skb(), I wonder > why trying to cache the sk_buff in this particular path is needed. > > Why not change __kfree_skb() instead ? IIRC kfree_skb() can be called from any context including irqoff, it's convenient to have a function that just does the job without too much of extra care. Theoretically it can have a separate path inside based on irqs_disabled(), but that would be ugly.
On 3/18/24 11:41, Pavel Begunkov wrote: > On 3/18/24 10:11, Eric Dumazet wrote: >> On Mon, Mar 18, 2024 at 1:46 AM Pavel Begunkov <asml.silence@gmail.com> wrote: >>> >>> Optimise skb_attempt_defer_free() when run by the same CPU the skb was >>> allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can >>> disable softirqs and put the buffer into cpu local caches. >>> >>> CPU bound TCP ping pong style benchmarking (i.e. netbench) showed a 1% >>> throughput increase (392.2 -> 396.4 Krps). Cross checking with profiles, >>> the total CPU share of skb_attempt_defer_free() dropped by 0.6%. Note, >>> I'd expect the win doubled with rx only benchmarks, as the optimisation >>> is for the receive path, but the test spends >55% of CPU doing writes. >>> >>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> >>> --- >>> >>> v2: pass @napi_safe=true by using __napi_kfree_skb() >>> >>> net/core/skbuff.c | 15 ++++++++++++++- >>> 1 file changed, 14 insertions(+), 1 deletion(-) >>> >>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c >>> index b99127712e67..35d37ae70a3d 100644 >>> --- a/net/core/skbuff.c >>> +++ b/net/core/skbuff.c >>> @@ -6995,6 +6995,19 @@ void __skb_ext_put(struct skb_ext *ext) >>> EXPORT_SYMBOL(__skb_ext_put); >>> #endif /* CONFIG_SKB_EXTENSIONS */ >>> >>> +static void kfree_skb_napi_cache(struct sk_buff *skb) >>> +{ >>> + /* if SKB is a clone, don't handle this case */ >>> + if (skb->fclone != SKB_FCLONE_UNAVAILABLE) { >>> + __kfree_skb(skb); >>> + return; >>> + } >>> + >>> + local_bh_disable(); >>> + __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED); >>> + local_bh_enable(); >>> +} >>> + >>> /** >>> * skb_attempt_defer_free - queue skb for remote freeing >>> * @skb: buffer >>> @@ -7013,7 +7026,7 @@ void skb_attempt_defer_free(struct sk_buff *skb) >>> if (WARN_ON_ONCE(cpu >= nr_cpu_ids) || >>> !cpu_online(cpu) || >>> cpu == raw_smp_processor_id()) { >>> -nodefer: __kfree_skb(skb); >>> +nodefer: kfree_skb_napi_cache(skb); >>> return; >>> } >>> >>> -- >>> 2.44.0 >>> >> >> 1) net-next is currently closed. > > Ok > >> 2) No NUMA awareness. SLUB does not guarantee the sk_buff was on the >> correct node. > > Let me see if I read you right. You're saying that SLUB can > allocate an skb from a different node, so skb->alloc_cpu > might be not indicative of the node, and so we might locally > cache an skb of a foreign numa node? > > If that's the case I don't see how it's different from the > cpu != raw_smp_processor_id() path, which will transfer the > skb to another cpu and still put it in the local cache in > softirq. > > >> 3) Given that many skbs (like TCP ACK) are freed using __kfree_skb(), I wonder >> why trying to cache the sk_buff in this particular path is needed. >> >> Why not change __kfree_skb() instead ? > > IIRC kfree_skb() can be called from any context including irqoff, On the other hand there is napi_pp_put_page() inside which assumes hardirq off, so maybe not, I'd appreciate if someone can clarify it. I was sure that drivers allocating in hardirq via e.g. __netdev_alloc_skb() might want to use kfree_skb(), but maybe they're consistently sticking to dev_kfree_sk*(). > it's convenient to have a function that just does the job without > too much of extra care. Theoretically it can have a separate path > inside based on irqs_disabled(), but that would be ugly.
On Mon, Mar 18, 2024 at 12:43 PM Pavel Begunkov <asml.silence@gmail.com> wrote: > > On 3/18/24 10:11, Eric Dumazet wrote: > > On Mon, Mar 18, 2024 at 1:46 AM Pavel Begunkov <asml.silence@gmail.com> wrote: > >> > >> Optimise skb_attempt_defer_free() when run by the same CPU the skb was > >> allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can > >> disable softirqs and put the buffer into cpu local caches. > >> > >> CPU bound TCP ping pong style benchmarking (i.e. netbench) showed a 1% > >> throughput increase (392.2 -> 396.4 Krps). Cross checking with profiles, > >> the total CPU share of skb_attempt_defer_free() dropped by 0.6%. Note, > >> I'd expect the win doubled with rx only benchmarks, as the optimisation > >> is for the receive path, but the test spends >55% of CPU doing writes. > >> > >> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> > >> --- > >> > >> v2: pass @napi_safe=true by using __napi_kfree_skb() > >> > >> net/core/skbuff.c | 15 ++++++++++++++- > >> 1 file changed, 14 insertions(+), 1 deletion(-) > >> > >> diff --git a/net/core/skbuff.c b/net/core/skbuff.c > >> index b99127712e67..35d37ae70a3d 100644 > >> --- a/net/core/skbuff.c > >> +++ b/net/core/skbuff.c > >> @@ -6995,6 +6995,19 @@ void __skb_ext_put(struct skb_ext *ext) > >> EXPORT_SYMBOL(__skb_ext_put); > >> #endif /* CONFIG_SKB_EXTENSIONS */ > >> > >> +static void kfree_skb_napi_cache(struct sk_buff *skb) > >> +{ > >> + /* if SKB is a clone, don't handle this case */ > >> + if (skb->fclone != SKB_FCLONE_UNAVAILABLE) { > >> + __kfree_skb(skb); > >> + return; > >> + } > >> + > >> + local_bh_disable(); > >> + __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED); > >> + local_bh_enable(); > >> +} > >> + > >> /** > >> * skb_attempt_defer_free - queue skb for remote freeing > >> * @skb: buffer > >> @@ -7013,7 +7026,7 @@ void skb_attempt_defer_free(struct sk_buff *skb) > >> if (WARN_ON_ONCE(cpu >= nr_cpu_ids) || > >> !cpu_online(cpu) || > >> cpu == raw_smp_processor_id()) { > >> -nodefer: __kfree_skb(skb); > >> +nodefer: kfree_skb_napi_cache(skb); > >> return; > >> } > >> > >> -- > >> 2.44.0 > >> > > > > 1) net-next is currently closed. > > Ok > > > 2) No NUMA awareness. SLUB does not guarantee the sk_buff was on the > > correct node. > > Let me see if I read you right. You're saying that SLUB can > allocate an skb from a different node, so skb->alloc_cpu > might be not indicative of the node, and so we might locally > cache an skb of a foreign numa node? > > If that's the case I don't see how it's different from the > cpu != raw_smp_processor_id() path, which will transfer the > skb to another cpu and still put it in the local cache in > softirq. The big win for skb_attempt_defer_free() is for the many pages that are attached to TCP incoming GRO packets. A typical GRO packet can have 45 page fragments. Pages are not recycled by a NIC if the NUMA node does not match. If the win was only for sk_buff itself, I think we should have asked SLUB maintainers why SLUB needs another cache to optimize SLUB fast cache ! > > > > 3) Given that many skbs (like TCP ACK) are freed using __kfree_skb(), I wonder > > why trying to cache the sk_buff in this particular path is needed. > > > > Why not change __kfree_skb() instead ? > > IIRC kfree_skb() can be called from any context including irqoff, > it's convenient to have a function that just does the job without > too much of extra care. Theoretically it can have a separate path > inside based on irqs_disabled(), but that would be ugly. Well, adding one case here, another here, and another here is also ugly.
On 3/18/24 14:33, Eric Dumazet wrote: > On Mon, Mar 18, 2024 at 12:43 PM Pavel Begunkov <asml.silence@gmail.com> wrote: >> >> On 3/18/24 10:11, Eric Dumazet wrote: >>> On Mon, Mar 18, 2024 at 1:46 AM Pavel Begunkov <asml.silence@gmail.com> wrote: >>>> >>>> Optimise skb_attempt_defer_free() when run by the same CPU the skb was >>>> allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can >>>> disable softirqs and put the buffer into cpu local caches. >>>> >>>> CPU bound TCP ping pong style benchmarking (i.e. netbench) showed a 1% >>>> throughput increase (392.2 -> 396.4 Krps). Cross checking with profiles, >>>> the total CPU share of skb_attempt_defer_free() dropped by 0.6%. Note, >>>> I'd expect the win doubled with rx only benchmarks, as the optimisation >>>> is for the receive path, but the test spends >55% of CPU doing writes. >>>> >>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> >>>> --- >>>> >>>> v2: pass @napi_safe=true by using __napi_kfree_skb() >>>> >>>> net/core/skbuff.c | 15 ++++++++++++++- >>>> 1 file changed, 14 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c >>>> index b99127712e67..35d37ae70a3d 100644 >>>> --- a/net/core/skbuff.c >>>> +++ b/net/core/skbuff.c >>>> @@ -6995,6 +6995,19 @@ void __skb_ext_put(struct skb_ext *ext) >>>> EXPORT_SYMBOL(__skb_ext_put); >>>> #endif /* CONFIG_SKB_EXTENSIONS */ >>>> >>>> +static void kfree_skb_napi_cache(struct sk_buff *skb) >>>> +{ >>>> + /* if SKB is a clone, don't handle this case */ >>>> + if (skb->fclone != SKB_FCLONE_UNAVAILABLE) { >>>> + __kfree_skb(skb); >>>> + return; >>>> + } >>>> + >>>> + local_bh_disable(); >>>> + __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED); >>>> + local_bh_enable(); >>>> +} >>>> + >>>> /** >>>> * skb_attempt_defer_free - queue skb for remote freeing >>>> * @skb: buffer >>>> @@ -7013,7 +7026,7 @@ void skb_attempt_defer_free(struct sk_buff *skb) >>>> if (WARN_ON_ONCE(cpu >= nr_cpu_ids) || >>>> !cpu_online(cpu) || >>>> cpu == raw_smp_processor_id()) { >>>> -nodefer: __kfree_skb(skb); >>>> +nodefer: kfree_skb_napi_cache(skb); >>>> return; >>>> } >>>> >>>> -- >>>> 2.44.0 >>>> >>> >>> 1) net-next is currently closed. >> >> Ok >> >>> 2) No NUMA awareness. SLUB does not guarantee the sk_buff was on the >>> correct node. >> >> Let me see if I read you right. You're saying that SLUB can >> allocate an skb from a different node, so skb->alloc_cpu >> might be not indicative of the node, and so we might locally >> cache an skb of a foreign numa node? >> >> If that's the case I don't see how it's different from the >> cpu != raw_smp_processor_id() path, which will transfer the >> skb to another cpu and still put it in the local cache in >> softirq. > > The big win for skb_attempt_defer_free() is for the many pages that are attached > to TCP incoming GRO packets. > > A typical GRO packet can have 45 page fragments. That's great and probably a dominating optimisation for GRO, however it's a path that all TCP reads would go through, large GRO'ed skbs or not. Even if there is a single frag, the patch at least enables the skb cache and gives the frag a chance to hit the pp's lockless cache. > Pages are not recycled by a NIC if the NUMA node does not match. Great, looking it up it seems that it's only true for pages ending up in the pp's ptr_ring but not recycled directly. > If the win was only for sk_buff itself, I think we should have asked > SLUB maintainers > why SLUB needs another cache to optimize SLUB fast cache ! Well, it's just slow for hot paths, not awfully slow but enough to be noticeable and take 1-2% per allocation per request. There are caches in io_uring, because it was slow, there are cache in the block layer. And there are skb and other caches in net, I assume all for the same reasons. Not like I'm adding a new one. I remember someone was poking into similar questions, but I haven't seen any drastic SLUB performance changes. Eric, I'm seriously getting lost in the arguments and don't really see what you're ultimately against. Summing up topics: 1) Is there a NUMA awareness problem in the patch that I missed? If so, can you elaborate? I'll try to address it. Is it a regression comparing to the current state or something that would be nice to have? 2) Do you trust the numbers for the test described? I.e. the rx server have multiple connections, each does small packet (<100B) ping pong style traffic. It's pinned to a CPU and all completions are ensured to run on that CPU. 3) Do you agree that it's a valid use case? Or are you shrugging it off as something that nobody cares about? 4) Maybe you're against the design? >>> 3) Given that many skbs (like TCP ACK) are freed using __kfree_skb(), I wonder >>> why trying to cache the sk_buff in this particular path is needed. >>> >>> Why not change __kfree_skb() instead ? >> >> IIRC kfree_skb() can be called from any context including irqoff, >> it's convenient to have a function that just does the job without >> too much of extra care. Theoretically it can have a separate path >> inside based on irqs_disabled(), but that would be ugly. > > Well, adding one case here, another here, and another here is also ugly. And fwiw, irqs_disabled() is not great for hot paths. Not to the extent of irq_disable(), but it still trashes up a bit the pipeline or so IIRC.
diff --git a/net/core/skbuff.c b/net/core/skbuff.c index b99127712e67..35d37ae70a3d 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -6995,6 +6995,19 @@ void __skb_ext_put(struct skb_ext *ext) EXPORT_SYMBOL(__skb_ext_put); #endif /* CONFIG_SKB_EXTENSIONS */ +static void kfree_skb_napi_cache(struct sk_buff *skb) +{ + /* if SKB is a clone, don't handle this case */ + if (skb->fclone != SKB_FCLONE_UNAVAILABLE) { + __kfree_skb(skb); + return; + } + + local_bh_disable(); + __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED); + local_bh_enable(); +} + /** * skb_attempt_defer_free - queue skb for remote freeing * @skb: buffer @@ -7013,7 +7026,7 @@ void skb_attempt_defer_free(struct sk_buff *skb) if (WARN_ON_ONCE(cpu >= nr_cpu_ids) || !cpu_online(cpu) || cpu == raw_smp_processor_id()) { -nodefer: __kfree_skb(skb); +nodefer: kfree_skb_napi_cache(skb); return; }
Optimise skb_attempt_defer_free() when run by the same CPU the skb was allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can disable softirqs and put the buffer into cpu local caches. CPU bound TCP ping pong style benchmarking (i.e. netbench) showed a 1% throughput increase (392.2 -> 396.4 Krps). Cross checking with profiles, the total CPU share of skb_attempt_defer_free() dropped by 0.6%. Note, I'd expect the win doubled with rx only benchmarks, as the optimisation is for the receive path, but the test spends >55% of CPU doing writes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> --- v2: pass @napi_safe=true by using __napi_kfree_skb() net/core/skbuff.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-)