diff mbox series

[net-next,2/2] net: kfree_skb_list use kmem_cache_free_bulk

Message ID 167293336786.249536.14237439594457105125.stgit@firesoul (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series net: use kmem_cache_free_bulk in kfree_skb_list | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 2 this patch: 2
netdev/cc_maintainers success CCed 5 of 5 maintainers
netdev/build_clang success Errors and warnings before: 1 this patch: 1
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 2 this patch: 2
netdev/checkpatch warning WARNING: Missing a blank line after declarations
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Jesper Dangaard Brouer Jan. 5, 2023, 3:42 p.m. UTC
The kfree_skb_list function walks SKB (via skb->next) and frees them
individually to the SLUB/SLAB allocator (kmem_cache). It is more
efficient to bulk free them via the kmem_cache_free_bulk API.

This patches create a stack local array with SKBs to bulk free while
walking the list. Bulk array size is limited to 16 SKBs to trade off
stack usage and efficiency. The SLUB kmem_cache "skbuff_head_cache"
uses objsize 256 bytes usually in an order-1 page 8192 bytes that is
32 objects per slab (can vary on archs and due to SLUB sharing). Thus,
for SLUB the optimal bulk free case is 32 objects belonging to same
slab, but runtime this isn't likely to occur.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 net/core/skbuff.c |   39 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 38 insertions(+), 1 deletion(-)

Comments

Saeed Mahameed Jan. 6, 2023, 8:09 p.m. UTC | #1
On 05 Jan 16:42, Jesper Dangaard Brouer wrote:
>The kfree_skb_list function walks SKB (via skb->next) and frees them
>individually to the SLUB/SLAB allocator (kmem_cache). It is more
>efficient to bulk free them via the kmem_cache_free_bulk API.
>
>This patches create a stack local array with SKBs to bulk free while
>walking the list. Bulk array size is limited to 16 SKBs to trade off
>stack usage and efficiency. The SLUB kmem_cache "skbuff_head_cache"
>uses objsize 256 bytes usually in an order-1 page 8192 bytes that is
>32 objects per slab (can vary on archs and due to SLUB sharing). Thus,
>for SLUB the optimal bulk free case is 32 objects belonging to same
>slab, but runtime this isn't likely to occur.
>
>Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

any performance numbers ? 

LGTM,
Reviewed-by: Saeed Mahameed <saeed@kernel.org>
Jakub Kicinski Jan. 6, 2023, 10:33 p.m. UTC | #2
Hi!

Would it not be better to try to actually defer them (queue to 
the deferred free list and try to ship back to the NAPI cache of 
the allocating core)? Is the spin lock on the defer list problematic
for fowarding cases (which I'm assuming your target)?

Also the lack of perf numbers is a bit of a red flag.

On Thu, 05 Jan 2023 16:42:47 +0100 Jesper Dangaard Brouer wrote:
> +static void kfree_skb_defer_local(struct sk_buff *skb,
> +				  struct skb_free_array *sa,
> +				  enum skb_drop_reason reason)

If we wanna keep the implementation as is - I think we should rename
the thing to say "bulk" rather than "defer" to avoid confusion with 
the TCP's "defer to allocating core" scheme..

kfree_skb_list_bulk() ?
Jesper Dangaard Brouer Jan. 9, 2023, 12:24 p.m. UTC | #3
On 06/01/2023 23.33, Jakub Kicinski wrote:
> Hi!
> 
> Would it not be better to try to actually defer them (queue to
> the deferred free list and try to ship back to the NAPI cache of
> the allocating core)? 
> Is the spin lock on the defer list problematic
> for fowarding cases (which I'm assuming your target)?

We might be talking past each-other.  As the NAPI cache for me
is the per CPU napi_alloc_cache (this_cpu_ptr(&napi_alloc_cache);)

This napi_alloc_cache doesn't use a spin_lock, but depend on being
protected by NAPI context.  The code in this patch closely resembles how
the napi_alloc_cache works.  See code: napi_consume_skb() and
__kfree_skb_defer().


> Also the lack of perf numbers is a bit of a red flag.
>

I have run performance tests, but as I tried to explain in the
cover letter, for the qdisc use-case this code path is only activated
when we have overflow at enqueue.  Thus, this doesn't translate directly
into a performance numbers, as TX-qdisc is 100% full caused by hardware
device being backed up, and this patch makes us use less time on freeing
memory.

I have been using pktgen script ./pktgen_bench_xmit_mode_queue_xmit.sh
which can inject packets at the qdisc layer (invoking __dev_queue_xmit).
And then used perf-record to see overhead of SLUB (__slab_free is top#4)
is reduced.


> On Thu, 05 Jan 2023 16:42:47 +0100 Jesper Dangaard Brouer wrote:
>> +static void kfree_skb_defer_local(struct sk_buff *skb,
>> +				  struct skb_free_array *sa,
>> +				  enum skb_drop_reason reason)
> 
> If we wanna keep the implementation as is - I think we should rename
> the thing to say "bulk" rather than "defer" to avoid confusion with
> the TCP's "defer to allocating core" scheme..

I named it "defer" because the NAPI cache uses "defer" specifically func
name __kfree_skb_defer() why I choose kfree_skb_defer_local(), as this
patch uses similar scheme.

I'm not sure what is meant by 'TCP's "defer to allocating core" scheme'.
Looking at code I guess you are referring to skb_attempt_defer_free()
and skb_defer_free_flush().

It would be too high cost calling skb_attempt_defer_free() for every SKB
because of the expensive spin_lock_irqsave() (+ restore).  I see the
skb_defer_free_flush() can be improved to use spin_lock_irq() (avoiding
mangling CPU flags).  And skb_defer_free_flush() (which gets called from
RX-NAPI/net_rx_action) end up calling napi_consume_skb() that endup
calling kmem_cache_free_bulk() (which I also do, just more directly).

> 
> kfree_skb_list_bulk() ?

Hmm, IMHO not really worth changing the function name.  The
kfree_skb_list() is called in more places, (than qdisc enqueue-overflow
case), which automatically benefits if we keep the function name
kfree_skb_list().

--Jesper
Jakub Kicinski Jan. 9, 2023, 7:34 p.m. UTC | #4
On Mon, 9 Jan 2023 13:24:54 +0100 Jesper Dangaard Brouer wrote:
> > Also the lack of perf numbers is a bit of a red flag.
> >  
> 
> I have run performance tests, but as I tried to explain in the
> cover letter, for the qdisc use-case this code path is only activated
> when we have overflow at enqueue.  Thus, this doesn't translate directly
> into a performance numbers, as TX-qdisc is 100% full caused by hardware
> device being backed up, and this patch makes us use less time on freeing
> memory.

I guess it's quite subjective, so it'd be good to get a third opinion.
To me that reads like a premature optimization. Saeed asked for perf
numbers, too.

Does anyone on the list want to cast the tie-break vote?

> I have been using pktgen script ./pktgen_bench_xmit_mode_queue_xmit.sh
> which can inject packets at the qdisc layer (invoking __dev_queue_xmit).
> And then used perf-record to see overhead of SLUB (__slab_free is top#4)
> is reduced.

Right, pktgen wasting time while still delivering line rate is not of
practical importance.

> > kfree_skb_list_bulk() ?  
> 
> Hmm, IMHO not really worth changing the function name.  The
> kfree_skb_list() is called in more places, (than qdisc enqueue-overflow
> case), which automatically benefits if we keep the function name
> kfree_skb_list().

To be clear - I was suggesting a simple
  s/kfree_skb_defer_local/kfree_skb_list_bulk/
on the patch, just renaming the static helper.

IMO now that we have multiple freeing optimizations using "defer" for
the TCP scheme and "bulk" for your prior slab bulk optimizations would
improve clarity.
Alexander Duyck Jan. 9, 2023, 10:10 p.m. UTC | #5
On Mon, 2023-01-09 at 11:34 -0800, Jakub Kicinski wrote:
> On Mon, 9 Jan 2023 13:24:54 +0100 Jesper Dangaard Brouer wrote:
> > > Also the lack of perf numbers is a bit of a red flag.
> > >  
> > 
> > I have run performance tests, but as I tried to explain in the
> > cover letter, for the qdisc use-case this code path is only activated
> > when we have overflow at enqueue.  Thus, this doesn't translate directly
> > into a performance numbers, as TX-qdisc is 100% full caused by hardware
> > device being backed up, and this patch makes us use less time on freeing
> > memory.
> 
> I guess it's quite subjective, so it'd be good to get a third opinion.
> To me that reads like a premature optimization. Saeed asked for perf
> numbers, too.
> 
> Does anyone on the list want to cast the tie-break vote?

I'd say there is some value to be gained by this. Basically it means
less overhead for dropping packets if we picked a backed up Tx path.

> > I have been using pktgen script ./pktgen_bench_xmit_mode_queue_xmit.sh
> > which can inject packets at the qdisc layer (invoking __dev_queue_xmit).
> > And then used perf-record to see overhead of SLUB (__slab_free is top#4)
> > is reduced.
> 
> Right, pktgen wasting time while still delivering line rate is not of
> practical importance.

I suspect there are probably more real world use cases out there.
Although to test it you would probably have to have a congested network
to really be able to show much of a benefit.

With the pktgen I would be interested in seeing the Qdisc dropped
numbers for with vs without this patch. I would consider something like
that comparable to us doing an XDP_DROP test since all we are talking
about is a synthetic benchmark.

> 
> > > kfree_skb_list_bulk() ?  
> > 
> > Hmm, IMHO not really worth changing the function name.  The
> > kfree_skb_list() is called in more places, (than qdisc enqueue-overflow
> > case), which automatically benefits if we keep the function name
> > kfree_skb_list().
> 
> To be clear - I was suggesting a simple
>   s/kfree_skb_defer_local/kfree_skb_list_bulk/
> on the patch, just renaming the static helper.
> 
> IMO now that we have multiple freeing optimizations using "defer" for
> the TCP scheme and "bulk" for your prior slab bulk optimizations would
> improve clarity.

Rather than defer_local would it maybe make more sense to look at
naming it something like "kfree_skb_add_bulk"? Basically we are
building onto the list of buffers to free so I figure something like an
"add" or "append" would make sense.
Jesper Dangaard Brouer Jan. 10, 2023, 2:52 p.m. UTC | #6
On 09/01/2023 23.10, Alexander H Duyck wrote:
> On Mon, 2023-01-09 at 11:34 -0800, Jakub Kicinski wrote:
>> On Mon, 9 Jan 2023 13:24:54 +0100 Jesper Dangaard Brouer wrote:
>>>> Also the lack of perf numbers is a bit of a red flag.
>>>>   
>>>
>>> I have run performance tests, but as I tried to explain in the
>>> cover letter, for the qdisc use-case this code path is only activated
>>> when we have overflow at enqueue.  Thus, this doesn't translate directly
>>> into a performance numbers, as TX-qdisc is 100% full caused by hardware
>>> device being backed up, and this patch makes us use less time on freeing
>>> memory.
>>
>> I guess it's quite subjective, so it'd be good to get a third opinion.
>> To me that reads like a premature optimization. Saeed asked for perf
>> numbers, too.
>>
>> Does anyone on the list want to cast the tie-break vote?
> 
> I'd say there is some value to be gained by this. Basically it means
> less overhead for dropping packets if we picked a backed up Tx path.
> 

Thanks.

I have microbenchmarks[1] of kmem_cache bulking, which I use to assess 
what is the (best-case) expected gain of using the bulk APIs.

The module 'slab_bulk_test01' results at bulk 16 element:

  kmem-in-loop Per elem: 109 cycles(tsc) 30.532 ns (step:16)
  kmem-bulk    Per elem: 64 cycles(tsc) 17.905 ns (step:16)

Thus, best-case expected gain is: 45 cycles(tsc) 12.627 ns.
  - With usual microbenchmarks caveats
  - Notice this is both bulk alloc and free

[1] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm

>>> I have been using pktgen script ./pktgen_bench_xmit_mode_queue_xmit.sh
>>> which can inject packets at the qdisc layer (invoking __dev_queue_xmit).
>>> And then used perf-record to see overhead of SLUB (__slab_free is top#4)
>>> is reduced.
>>
>> Right, pktgen wasting time while still delivering line rate is not of
>> practical importance.
> 

I better explain how I cause the push-back without hitting 10Gbit/s line
rate (as we/Linux cannot allocated SKBs fast enough for this).

I'm testing this on a 10Gbit/s interface (driver ixgbe). The challenge 
is that I need to overload the qdisc enqueue layer as that is triggering 
the call to kfree_skb_list().

Linux with SKBs and qdisc injecting with pktgen is limited to producing 
packets at (measured) 2,205,588 pps with a single TX-queue (and scaling 
up 1,951,771 pps per queue or 512 ns per pkt). Reminder 10Gbit/s at 64 
bytes packets is 14.8 Mpps (or 67.2 ns per pkt).

The trick to trigger the qdisc push-back way earlier is Ethernet
flow-control (which is on by default).

I was a bit surprised to see, but using pktgen_bench_xmit_mode_queue_xmit.sh
on my testlab the remote host was pushing back a lot, resulting in only
256Kpps being actually sent on wire. Monitored with ethtool stats script[2].


[2] 
https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl

> I suspect there are probably more real world use cases out there.
> Although to test it you would probably have to have a congested network
> to really be able to show much of a benefit.
> 
> With the pktgen I would be interested in seeing the Qdisc dropped
> numbers for with vs without this patch. I would consider something like
> that comparable to us doing an XDP_DROP test since all we are talking
> about is a synthetic benchmark.

The pktgen script output how many packets it have transmitted, but from
above we know that this most of these packets are actually getting
dropped as only 256Kpps are reaching the wire.

Result line from pktgen script: count 100000000 (60byte,0frags)
  - Unpatched kernel: 2396594pps 1150Mb/sec (1150365120bps) errors: 1417469
  - Patched kernel  : 2479970pps 1190Mb/sec (1190385600bps) errors: 1422753

Difference:
  * +83376 pps faster (2479970-2396594)
  * -14 nanosec faster (1/2479970-1/2396594)*10^9

The patched kernel is faster. Around the expected gain from using the
kmem_cache bulking API (slightly more actually).

More raw data and notes for this email avail in [3]:

  [3] 
https://github.com/xdp-project/xdp-project/blob/master/areas/mem/kfree_skb_list01.org


>>
>>>> kfree_skb_list_bulk() ?
>>>
>>> Hmm, IMHO not really worth changing the function name.  The
>>> kfree_skb_list() is called in more places, (than qdisc enqueue-overflow
>>> case), which automatically benefits if we keep the function name
>>> kfree_skb_list().
>>
>> To be clear - I was suggesting a simple
>>    s/kfree_skb_defer_local/kfree_skb_list_bulk/
>> on the patch, just renaming the static helper.
>>

Okay, I get it now. But I disagree with same argument as Alex makes below.

>> IMO now that we have multiple freeing optimizations using "defer" for
>> the TCP scheme and "bulk" for your prior slab bulk optimizations would
>> improve clarity.
> 
> Rather than defer_local would it maybe make more sense to look at
> naming it something like "kfree_skb_add_bulk"? Basically we are
> building onto the list of buffers to free so I figure something like an
> "add" or "append" would make sense.
> 

I agree with Alex, that we are building up buffers to be freed *later*,
thus we should somehow reflect that in the naming.

--Jesper
Jakub Kicinski Jan. 10, 2023, 8:20 p.m. UTC | #7
On Tue, 10 Jan 2023 15:52:48 +0100 Jesper Dangaard Brouer wrote:
> > Rather than defer_local would it maybe make more sense to look at
> > naming it something like "kfree_skb_add_bulk"? Basically we are
> > building onto the list of buffers to free so I figure something like an
> > "add" or "append" would make sense.
> 
> I agree with Alex

Alex's suggestion (kfree_skb_add_bulk) sgtm.
Jesper Dangaard Brouer Jan. 13, 2023, 1:42 p.m. UTC | #8
On 10/01/2023 21.20, Jakub Kicinski wrote:
> On Tue, 10 Jan 2023 15:52:48 +0100 Jesper Dangaard Brouer wrote:
>>> Rather than defer_local would it maybe make more sense to look at
>>> naming it something like "kfree_skb_add_bulk"? Basically we are
>>> building onto the list of buffers to free so I figure something like an
>>> "add" or "append" would make sense.
>>
>> I agree with Alex
> 
> Alex's suggestion (kfree_skb_add_bulk) sgtm.

Okay, great I'll use that and send a V2.

--Jesper
diff mbox series

Patch

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 007a5fbe284b..e6fa667174d5 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -964,16 +964,53 @@  kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason)
 }
 EXPORT_SYMBOL(kfree_skb_reason);
 
+#define KFREE_SKB_BULK_SIZE	16
+
+struct skb_free_array {
+	unsigned int skb_count;
+	void *skb_array[KFREE_SKB_BULK_SIZE];
+};
+
+static void kfree_skb_defer_local(struct sk_buff *skb,
+				  struct skb_free_array *sa,
+				  enum skb_drop_reason reason)
+{
+	/* if SKB is a clone, don't handle this case */
+	if (unlikely(skb->fclone != SKB_FCLONE_UNAVAILABLE)) {
+		__kfree_skb(skb);
+		return;
+	}
+
+	skb_release_all(skb, reason);
+	sa->skb_array[sa->skb_count++] = skb;
+
+	if (unlikely(sa->skb_count == KFREE_SKB_BULK_SIZE)) {
+		kmem_cache_free_bulk(skbuff_head_cache, KFREE_SKB_BULK_SIZE,
+				     sa->skb_array);
+		sa->skb_count = 0;
+	}
+}
+
 void __fix_address
 kfree_skb_list_reason(struct sk_buff *segs, enum skb_drop_reason reason)
 {
+	struct skb_free_array sa;
+	sa.skb_count = 0;
+
 	while (segs) {
 		struct sk_buff *next = segs->next;
 
+		skb_mark_not_on_list(segs);
+
 		if (__kfree_skb_reason(segs, reason))
-			__kfree_skb(segs);
+			kfree_skb_defer_local(segs, &sa, reason);
+
 		segs = next;
 	}
+
+	if (sa.skb_count)
+		kmem_cache_free_bulk(skbuff_head_cache, sa.skb_count,
+				     sa.skb_array);
 }
 EXPORT_SYMBOL(kfree_skb_list_reason);