Message ID | 167293336786.249536.14237439594457105125.stgit@firesoul (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | net: use kmem_cache_free_bulk in kfree_skb_list | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Clearly marked for net-next |
netdev/fixes_present | success | Fixes tag not required for -next series |
netdev/subject_prefix | success | Link |
netdev/cover_letter | success | Series has a cover letter |
netdev/patch_count | success | Link |
netdev/header_inline | success | No static functions without inline keyword in header files |
netdev/build_32bit | success | Errors and warnings before: 2 this patch: 2 |
netdev/cc_maintainers | success | CCed 5 of 5 maintainers |
netdev/build_clang | success | Errors and warnings before: 1 this patch: 1 |
netdev/module_param | success | Was 0 now: 0 |
netdev/verify_signedoff | success | Signed-off-by tag matches author and committer |
netdev/check_selftest | success | No net selftest shell script |
netdev/verify_fixes | success | No Fixes tag |
netdev/build_allmodconfig_warn | success | Errors and warnings before: 2 this patch: 2 |
netdev/checkpatch | warning | WARNING: Missing a blank line after declarations |
netdev/kdoc | success | Errors and warnings before: 0 this patch: 0 |
netdev/source_inline | success | Was 0 now: 0 |
On 05 Jan 16:42, Jesper Dangaard Brouer wrote: >The kfree_skb_list function walks SKB (via skb->next) and frees them >individually to the SLUB/SLAB allocator (kmem_cache). It is more >efficient to bulk free them via the kmem_cache_free_bulk API. > >This patches create a stack local array with SKBs to bulk free while >walking the list. Bulk array size is limited to 16 SKBs to trade off >stack usage and efficiency. The SLUB kmem_cache "skbuff_head_cache" >uses objsize 256 bytes usually in an order-1 page 8192 bytes that is >32 objects per slab (can vary on archs and due to SLUB sharing). Thus, >for SLUB the optimal bulk free case is 32 objects belonging to same >slab, but runtime this isn't likely to occur. > >Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> any performance numbers ? LGTM, Reviewed-by: Saeed Mahameed <saeed@kernel.org>
Hi! Would it not be better to try to actually defer them (queue to the deferred free list and try to ship back to the NAPI cache of the allocating core)? Is the spin lock on the defer list problematic for fowarding cases (which I'm assuming your target)? Also the lack of perf numbers is a bit of a red flag. On Thu, 05 Jan 2023 16:42:47 +0100 Jesper Dangaard Brouer wrote: > +static void kfree_skb_defer_local(struct sk_buff *skb, > + struct skb_free_array *sa, > + enum skb_drop_reason reason) If we wanna keep the implementation as is - I think we should rename the thing to say "bulk" rather than "defer" to avoid confusion with the TCP's "defer to allocating core" scheme.. kfree_skb_list_bulk() ?
On 06/01/2023 23.33, Jakub Kicinski wrote: > Hi! > > Would it not be better to try to actually defer them (queue to > the deferred free list and try to ship back to the NAPI cache of > the allocating core)? > Is the spin lock on the defer list problematic > for fowarding cases (which I'm assuming your target)? We might be talking past each-other. As the NAPI cache for me is the per CPU napi_alloc_cache (this_cpu_ptr(&napi_alloc_cache);) This napi_alloc_cache doesn't use a spin_lock, but depend on being protected by NAPI context. The code in this patch closely resembles how the napi_alloc_cache works. See code: napi_consume_skb() and __kfree_skb_defer(). > Also the lack of perf numbers is a bit of a red flag. > I have run performance tests, but as I tried to explain in the cover letter, for the qdisc use-case this code path is only activated when we have overflow at enqueue. Thus, this doesn't translate directly into a performance numbers, as TX-qdisc is 100% full caused by hardware device being backed up, and this patch makes us use less time on freeing memory. I have been using pktgen script ./pktgen_bench_xmit_mode_queue_xmit.sh which can inject packets at the qdisc layer (invoking __dev_queue_xmit). And then used perf-record to see overhead of SLUB (__slab_free is top#4) is reduced. > On Thu, 05 Jan 2023 16:42:47 +0100 Jesper Dangaard Brouer wrote: >> +static void kfree_skb_defer_local(struct sk_buff *skb, >> + struct skb_free_array *sa, >> + enum skb_drop_reason reason) > > If we wanna keep the implementation as is - I think we should rename > the thing to say "bulk" rather than "defer" to avoid confusion with > the TCP's "defer to allocating core" scheme.. I named it "defer" because the NAPI cache uses "defer" specifically func name __kfree_skb_defer() why I choose kfree_skb_defer_local(), as this patch uses similar scheme. I'm not sure what is meant by 'TCP's "defer to allocating core" scheme'. Looking at code I guess you are referring to skb_attempt_defer_free() and skb_defer_free_flush(). It would be too high cost calling skb_attempt_defer_free() for every SKB because of the expensive spin_lock_irqsave() (+ restore). I see the skb_defer_free_flush() can be improved to use spin_lock_irq() (avoiding mangling CPU flags). And skb_defer_free_flush() (which gets called from RX-NAPI/net_rx_action) end up calling napi_consume_skb() that endup calling kmem_cache_free_bulk() (which I also do, just more directly). > > kfree_skb_list_bulk() ? Hmm, IMHO not really worth changing the function name. The kfree_skb_list() is called in more places, (than qdisc enqueue-overflow case), which automatically benefits if we keep the function name kfree_skb_list(). --Jesper
On Mon, 9 Jan 2023 13:24:54 +0100 Jesper Dangaard Brouer wrote: > > Also the lack of perf numbers is a bit of a red flag. > > > > I have run performance tests, but as I tried to explain in the > cover letter, for the qdisc use-case this code path is only activated > when we have overflow at enqueue. Thus, this doesn't translate directly > into a performance numbers, as TX-qdisc is 100% full caused by hardware > device being backed up, and this patch makes us use less time on freeing > memory. I guess it's quite subjective, so it'd be good to get a third opinion. To me that reads like a premature optimization. Saeed asked for perf numbers, too. Does anyone on the list want to cast the tie-break vote? > I have been using pktgen script ./pktgen_bench_xmit_mode_queue_xmit.sh > which can inject packets at the qdisc layer (invoking __dev_queue_xmit). > And then used perf-record to see overhead of SLUB (__slab_free is top#4) > is reduced. Right, pktgen wasting time while still delivering line rate is not of practical importance. > > kfree_skb_list_bulk() ? > > Hmm, IMHO not really worth changing the function name. The > kfree_skb_list() is called in more places, (than qdisc enqueue-overflow > case), which automatically benefits if we keep the function name > kfree_skb_list(). To be clear - I was suggesting a simple s/kfree_skb_defer_local/kfree_skb_list_bulk/ on the patch, just renaming the static helper. IMO now that we have multiple freeing optimizations using "defer" for the TCP scheme and "bulk" for your prior slab bulk optimizations would improve clarity.
On Mon, 2023-01-09 at 11:34 -0800, Jakub Kicinski wrote: > On Mon, 9 Jan 2023 13:24:54 +0100 Jesper Dangaard Brouer wrote: > > > Also the lack of perf numbers is a bit of a red flag. > > > > > > > I have run performance tests, but as I tried to explain in the > > cover letter, for the qdisc use-case this code path is only activated > > when we have overflow at enqueue. Thus, this doesn't translate directly > > into a performance numbers, as TX-qdisc is 100% full caused by hardware > > device being backed up, and this patch makes us use less time on freeing > > memory. > > I guess it's quite subjective, so it'd be good to get a third opinion. > To me that reads like a premature optimization. Saeed asked for perf > numbers, too. > > Does anyone on the list want to cast the tie-break vote? I'd say there is some value to be gained by this. Basically it means less overhead for dropping packets if we picked a backed up Tx path. > > I have been using pktgen script ./pktgen_bench_xmit_mode_queue_xmit.sh > > which can inject packets at the qdisc layer (invoking __dev_queue_xmit). > > And then used perf-record to see overhead of SLUB (__slab_free is top#4) > > is reduced. > > Right, pktgen wasting time while still delivering line rate is not of > practical importance. I suspect there are probably more real world use cases out there. Although to test it you would probably have to have a congested network to really be able to show much of a benefit. With the pktgen I would be interested in seeing the Qdisc dropped numbers for with vs without this patch. I would consider something like that comparable to us doing an XDP_DROP test since all we are talking about is a synthetic benchmark. > > > > kfree_skb_list_bulk() ? > > > > Hmm, IMHO not really worth changing the function name. The > > kfree_skb_list() is called in more places, (than qdisc enqueue-overflow > > case), which automatically benefits if we keep the function name > > kfree_skb_list(). > > To be clear - I was suggesting a simple > s/kfree_skb_defer_local/kfree_skb_list_bulk/ > on the patch, just renaming the static helper. > > IMO now that we have multiple freeing optimizations using "defer" for > the TCP scheme and "bulk" for your prior slab bulk optimizations would > improve clarity. Rather than defer_local would it maybe make more sense to look at naming it something like "kfree_skb_add_bulk"? Basically we are building onto the list of buffers to free so I figure something like an "add" or "append" would make sense.
On 09/01/2023 23.10, Alexander H Duyck wrote: > On Mon, 2023-01-09 at 11:34 -0800, Jakub Kicinski wrote: >> On Mon, 9 Jan 2023 13:24:54 +0100 Jesper Dangaard Brouer wrote: >>>> Also the lack of perf numbers is a bit of a red flag. >>>> >>> >>> I have run performance tests, but as I tried to explain in the >>> cover letter, for the qdisc use-case this code path is only activated >>> when we have overflow at enqueue. Thus, this doesn't translate directly >>> into a performance numbers, as TX-qdisc is 100% full caused by hardware >>> device being backed up, and this patch makes us use less time on freeing >>> memory. >> >> I guess it's quite subjective, so it'd be good to get a third opinion. >> To me that reads like a premature optimization. Saeed asked for perf >> numbers, too. >> >> Does anyone on the list want to cast the tie-break vote? > > I'd say there is some value to be gained by this. Basically it means > less overhead for dropping packets if we picked a backed up Tx path. > Thanks. I have microbenchmarks[1] of kmem_cache bulking, which I use to assess what is the (best-case) expected gain of using the bulk APIs. The module 'slab_bulk_test01' results at bulk 16 element: kmem-in-loop Per elem: 109 cycles(tsc) 30.532 ns (step:16) kmem-bulk Per elem: 64 cycles(tsc) 17.905 ns (step:16) Thus, best-case expected gain is: 45 cycles(tsc) 12.627 ns. - With usual microbenchmarks caveats - Notice this is both bulk alloc and free [1] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm >>> I have been using pktgen script ./pktgen_bench_xmit_mode_queue_xmit.sh >>> which can inject packets at the qdisc layer (invoking __dev_queue_xmit). >>> And then used perf-record to see overhead of SLUB (__slab_free is top#4) >>> is reduced. >> >> Right, pktgen wasting time while still delivering line rate is not of >> practical importance. > I better explain how I cause the push-back without hitting 10Gbit/s line rate (as we/Linux cannot allocated SKBs fast enough for this). I'm testing this on a 10Gbit/s interface (driver ixgbe). The challenge is that I need to overload the qdisc enqueue layer as that is triggering the call to kfree_skb_list(). Linux with SKBs and qdisc injecting with pktgen is limited to producing packets at (measured) 2,205,588 pps with a single TX-queue (and scaling up 1,951,771 pps per queue or 512 ns per pkt). Reminder 10Gbit/s at 64 bytes packets is 14.8 Mpps (or 67.2 ns per pkt). The trick to trigger the qdisc push-back way earlier is Ethernet flow-control (which is on by default). I was a bit surprised to see, but using pktgen_bench_xmit_mode_queue_xmit.sh on my testlab the remote host was pushing back a lot, resulting in only 256Kpps being actually sent on wire. Monitored with ethtool stats script[2]. [2] https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl > I suspect there are probably more real world use cases out there. > Although to test it you would probably have to have a congested network > to really be able to show much of a benefit. > > With the pktgen I would be interested in seeing the Qdisc dropped > numbers for with vs without this patch. I would consider something like > that comparable to us doing an XDP_DROP test since all we are talking > about is a synthetic benchmark. The pktgen script output how many packets it have transmitted, but from above we know that this most of these packets are actually getting dropped as only 256Kpps are reaching the wire. Result line from pktgen script: count 100000000 (60byte,0frags) - Unpatched kernel: 2396594pps 1150Mb/sec (1150365120bps) errors: 1417469 - Patched kernel : 2479970pps 1190Mb/sec (1190385600bps) errors: 1422753 Difference: * +83376 pps faster (2479970-2396594) * -14 nanosec faster (1/2479970-1/2396594)*10^9 The patched kernel is faster. Around the expected gain from using the kmem_cache bulking API (slightly more actually). More raw data and notes for this email avail in [3]: [3] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/kfree_skb_list01.org >> >>>> kfree_skb_list_bulk() ? >>> >>> Hmm, IMHO not really worth changing the function name. The >>> kfree_skb_list() is called in more places, (than qdisc enqueue-overflow >>> case), which automatically benefits if we keep the function name >>> kfree_skb_list(). >> >> To be clear - I was suggesting a simple >> s/kfree_skb_defer_local/kfree_skb_list_bulk/ >> on the patch, just renaming the static helper. >> Okay, I get it now. But I disagree with same argument as Alex makes below. >> IMO now that we have multiple freeing optimizations using "defer" for >> the TCP scheme and "bulk" for your prior slab bulk optimizations would >> improve clarity. > > Rather than defer_local would it maybe make more sense to look at > naming it something like "kfree_skb_add_bulk"? Basically we are > building onto the list of buffers to free so I figure something like an > "add" or "append" would make sense. > I agree with Alex, that we are building up buffers to be freed *later*, thus we should somehow reflect that in the naming. --Jesper
On Tue, 10 Jan 2023 15:52:48 +0100 Jesper Dangaard Brouer wrote: > > Rather than defer_local would it maybe make more sense to look at > > naming it something like "kfree_skb_add_bulk"? Basically we are > > building onto the list of buffers to free so I figure something like an > > "add" or "append" would make sense. > > I agree with Alex Alex's suggestion (kfree_skb_add_bulk) sgtm.
On 10/01/2023 21.20, Jakub Kicinski wrote: > On Tue, 10 Jan 2023 15:52:48 +0100 Jesper Dangaard Brouer wrote: >>> Rather than defer_local would it maybe make more sense to look at >>> naming it something like "kfree_skb_add_bulk"? Basically we are >>> building onto the list of buffers to free so I figure something like an >>> "add" or "append" would make sense. >> >> I agree with Alex > > Alex's suggestion (kfree_skb_add_bulk) sgtm. Okay, great I'll use that and send a V2. --Jesper
diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 007a5fbe284b..e6fa667174d5 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -964,16 +964,53 @@ kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason) } EXPORT_SYMBOL(kfree_skb_reason); +#define KFREE_SKB_BULK_SIZE 16 + +struct skb_free_array { + unsigned int skb_count; + void *skb_array[KFREE_SKB_BULK_SIZE]; +}; + +static void kfree_skb_defer_local(struct sk_buff *skb, + struct skb_free_array *sa, + enum skb_drop_reason reason) +{ + /* if SKB is a clone, don't handle this case */ + if (unlikely(skb->fclone != SKB_FCLONE_UNAVAILABLE)) { + __kfree_skb(skb); + return; + } + + skb_release_all(skb, reason); + sa->skb_array[sa->skb_count++] = skb; + + if (unlikely(sa->skb_count == KFREE_SKB_BULK_SIZE)) { + kmem_cache_free_bulk(skbuff_head_cache, KFREE_SKB_BULK_SIZE, + sa->skb_array); + sa->skb_count = 0; + } +} + void __fix_address kfree_skb_list_reason(struct sk_buff *segs, enum skb_drop_reason reason) { + struct skb_free_array sa; + sa.skb_count = 0; + while (segs) { struct sk_buff *next = segs->next; + skb_mark_not_on_list(segs); + if (__kfree_skb_reason(segs, reason)) - __kfree_skb(segs); + kfree_skb_defer_local(segs, &sa, reason); + segs = next; } + + if (sa.skb_count) + kmem_cache_free_bulk(skbuff_head_cache, sa.skb_count, + sa.skb_array); } EXPORT_SYMBOL(kfree_skb_list_reason);
The kfree_skb_list function walks SKB (via skb->next) and frees them individually to the SLUB/SLAB allocator (kmem_cache). It is more efficient to bulk free them via the kmem_cache_free_bulk API. This patches create a stack local array with SKBs to bulk free while walking the list. Bulk array size is limited to 16 SKBs to trade off stack usage and efficiency. The SLUB kmem_cache "skbuff_head_cache" uses objsize 256 bytes usually in an order-1 page 8192 bytes that is 32 objects per slab (can vary on archs and due to SLUB sharing). Thus, for SLUB the optimal bulk free case is 32 objects belonging to same slab, but runtime this isn't likely to occur. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> --- net/core/skbuff.c | 39 ++++++++++++++++++++++++++++++++++++++- 1 file changed, 38 insertions(+), 1 deletion(-)