mbox series

[net-next,0/5] skbuff: introduce skbuff_heads bulking and reusing

Message ID 20210111182655.12159-1-alobakin@pm.me (mailing list archive)
Headers show
Series skbuff: introduce skbuff_heads bulking and reusing | expand

Message

Alexander Lobakin Jan. 11, 2021, 6:27 p.m. UTC
Inspired by cpu_map_kthread_run() and _kfree_skb_defer() logics.

Currently, all sorts of skb allocation always do allocate
skbuff_heads one by one via kmem_cache_alloc().
On the other hand, we have percpu napi_alloc_cache to store
skbuff_heads queued up for freeing and flush them by bulks.

We can use this struct to cache and bulk not only freeing, but also
allocation of new skbuff_heads, as well as to reuse cached-to-free
heads instead of allocating the new ones.
As accessing napi_alloc_cache implies NAPI softirq context, do this
only for __napi_alloc_skb() and its derivatives (napi_alloc_skb()
and napi_get_frags()). The rough amount of their call sites are 69,
which is quite a number.

iperf3 showed a nice bump from 910 to 935 Mbits while performing
UDP VLAN NAT on 1.2 GHz MIPS board. The boost is likely to be
way bigger on more powerful hosts and NICs with tens of Mpps.

Patches 1-2 are preparation steps, while 3-5 do the real work.

Alexander Lobakin (5):
  skbuff: rename fields of struct napi_alloc_cache to be more intuitive
  skbuff: open-code __build_skb() inside __napi_alloc_skb()
  skbuff: reuse skbuff_heads from flush_skb_cache if available
  skbuff: allocate skbuff_heads by bulks instead of one by one
  skbuff: refill skb_cache early from deferred-to-consume entries

 net/core/skbuff.c | 62 ++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 48 insertions(+), 14 deletions(-)

Comments

Eric Dumazet Jan. 12, 2021, 8:20 a.m. UTC | #1
On Mon, Jan 11, 2021 at 7:27 PM Alexander Lobakin <alobakin@pm.me> wrote:
>
> Inspired by cpu_map_kthread_run() and _kfree_skb_defer() logics.
>
> Currently, all sorts of skb allocation always do allocate
> skbuff_heads one by one via kmem_cache_alloc().
> On the other hand, we have percpu napi_alloc_cache to store
> skbuff_heads queued up for freeing and flush them by bulks.
>
> We can use this struct to cache and bulk not only freeing, but also
> allocation of new skbuff_heads, as well as to reuse cached-to-free
> heads instead of allocating the new ones.
> As accessing napi_alloc_cache implies NAPI softirq context, do this
> only for __napi_alloc_skb() and its derivatives (napi_alloc_skb()
> and napi_get_frags()). The rough amount of their call sites are 69,
> which is quite a number.
>
> iperf3 showed a nice bump from 910 to 935 Mbits while performing
> UDP VLAN NAT on 1.2 GHz MIPS board. The boost is likely to be
> way bigger on more powerful hosts and NICs with tens of Mpps.

What is the latency cost of these bulk allocations, and for TCP traffic
on which GRO is the norm ?

Adding caches is increasing cache foot print when the cache is populated.

I wonder if your iperf3 numbers are simply wrong because of lack of
GRO in this UDP VLAN NAT case.

We are adding a log of additional code, thus icache pressure, that
iperf3 tests can not really measure.

Most linus devices simply handle one packet at a time (one packet per interrupt)
Edward Cree Jan. 12, 2021, 9:54 a.m. UTC | #2
Without wishing to weigh in on whether this caching is a good idea...
Wouldn't it be simpler, rather than having two separate "alloc" and "flush"
 caches, to have a single larger cache, such that whenever it becomes full
 we bulk flush the top half, and when it's empty we bulk alloc the bottom
 half?  That should mean fewer branches, fewer instructions etc. than
 having to decide which cache to act upon every time.

-ed
Alexander Lobakin Jan. 12, 2021, 10:56 a.m. UTC | #3
From: Eric Dumazet <edumazet@google.com>
Date: Tue, 12 Jan 2021 09:20:39 +0100

> On Mon, Jan 11, 2021 at 7:27 PM Alexander Lobakin <alobakin@pm.me> wrote:
>>
>> Inspired by cpu_map_kthread_run() and _kfree_skb_defer() logics.
>>
>> Currently, all sorts of skb allocation always do allocate
>> skbuff_heads one by one via kmem_cache_alloc().
>> On the other hand, we have percpu napi_alloc_cache to store
>> skbuff_heads queued up for freeing and flush them by bulks.
>>
>> We can use this struct to cache and bulk not only freeing, but also
>> allocation of new skbuff_heads, as well as to reuse cached-to-free
>> heads instead of allocating the new ones.
>> As accessing napi_alloc_cache implies NAPI softirq context, do this
>> only for __napi_alloc_skb() and its derivatives (napi_alloc_skb()
>> and napi_get_frags()). The rough amount of their call sites are 69,
>> which is quite a number.
>>
>> iperf3 showed a nice bump from 910 to 935 Mbits while performing
>> UDP VLAN NAT on 1.2 GHz MIPS board. The boost is likely to be
>> way bigger on more powerful hosts and NICs with tens of Mpps.
>
> What is the latency cost of these bulk allocations, and for TCP traffic
> on which GRO is the norm ?
>
> Adding caches is increasing cache foot print when the cache is populated.
>
> I wonder if your iperf3 numbers are simply wrong because of lack of
> GRO in this UDP VLAN NAT case.

Ah, I should've mentioned that I use UDP GRO Fraglists, so these
numbers are for GRO.

My board gives full 1 Gbps (link speed) for TCP for more than a year,
so I can't really rely on TCP passthrough to measure the gains or
regressions.

> We are adding a log of additional code, thus icache pressure, that
> iperf3 tests can not really measure.

Not sure if MIPS arch can provide enough debug information to measure
icache pressure, but I'll try to catch this.

> Most linus devices simply handle one packet at a time (one packet per interrupt)

I disagree here, most modern NICs usually handle thousands of packets
per single interrupt due to NAPI, hardware interrupt moderation and so
on, at least in cases with high traffic load.

Al
Alexander Lobakin Jan. 12, 2021, 11:08 a.m. UTC | #4
From: Edward Cree <ecree.xilinx@gmail.com>
Date: Tue, 12 Jan 2021 09:54:04 +0000

> Without wishing to weigh in on whether this caching is a good idea...

Well, we already have a cache to bulk flush "consumed" skbs, although
kmem_cache_free() is generally lighter than kmem_cache_alloc(), and
a page frag cache to allocate skb->head that is also bulking the
operations, since it contains a (compound) page with the size of
min(SZ_32K, PAGE_SIZE).
If they wouldn't give any visible boosts, I think they wouldn't hit
mainline.

> Wouldn't it be simpler, rather than having two separate "alloc" and "flush"
>  caches, to have a single larger cache, such that whenever it becomes full
>  we bulk flush the top half, and when it's empty we bulk alloc the bottom
>  half?  That should mean fewer branches, fewer instructions etc. than
>  having to decide which cache to act upon every time.

I though about a unified cache, but couldn't decide whether to flush
or to allocate heads and how much to process. Your suggestion answers
these questions and generally seems great. I'll try that one, thanks!

> -ed

Al
Eric Dumazet Jan. 12, 2021, 12:23 p.m. UTC | #5
On Tue, Jan 12, 2021 at 12:08 PM Alexander Lobakin <alobakin@pm.me> wrote:
>
> From: Edward Cree <ecree.xilinx@gmail.com>
> Date: Tue, 12 Jan 2021 09:54:04 +0000
>
> > Without wishing to weigh in on whether this caching is a good idea...
>
> Well, we already have a cache to bulk flush "consumed" skbs, although
> kmem_cache_free() is generally lighter than kmem_cache_alloc(), and
> a page frag cache to allocate skb->head that is also bulking the
> operations, since it contains a (compound) page with the size of
> min(SZ_32K, PAGE_SIZE).
> If they wouldn't give any visible boosts, I think they wouldn't hit
> mainline.
>
> > Wouldn't it be simpler, rather than having two separate "alloc" and "flush"
> >  caches, to have a single larger cache, such that whenever it becomes full
> >  we bulk flush the top half, and when it's empty we bulk alloc the bottom
> >  half?  That should mean fewer branches, fewer instructions etc. than
> >  having to decide which cache to act upon every time.
>
> I though about a unified cache, but couldn't decide whether to flush
> or to allocate heads and how much to process. Your suggestion answers
> these questions and generally seems great. I'll try that one, thanks!
>


The thing is : kmalloc() is supposed to have batches already, and nice
per-cpu caches.

This looks like an mm issue, are we sure we want to get over it ?

I would like a full analysis of why SLAB/SLUB does not work well for
your test workload.

More details, more numbers.... before we accept yet another
'networking optimization' adding more code to the 'fast' path.

More code means more latencies when all code needs to be brought up in
cpu caches.
Eric Dumazet Jan. 12, 2021, 12:32 p.m. UTC | #6
On Tue, Jan 12, 2021 at 11:56 AM Alexander Lobakin <alobakin@pm.me> wrote:
>

>
> Ah, I should've mentioned that I use UDP GRO Fraglists, so these
> numbers are for GRO.
>

Right, this suggests UDP GRO fraglist is a pathological case of GRO,
not saving memory.

Real GRO (TCP in most cases) will consume one skb, and have page
fragments for each segment.

Having skbs linked together is not cache friendly.

So I would try first to make this case better, instead of trying to
work around the real issue.
Alexander Lobakin Jan. 12, 2021, 6:26 p.m. UTC | #7
From: Eric Dumazet <edumazet@google.com>
Date: Tue, 12 Jan 2021 13:32:56 +0100

> On Tue, Jan 12, 2021 at 11:56 AM Alexander Lobakin <alobakin@pm.me> wrote:
>>
>
>>
>> Ah, I should've mentioned that I use UDP GRO Fraglists, so these
>> numbers are for GRO.
>>
>
> Right, this suggests UDP GRO fraglist is a pathological case of GRO,
> not saving memory.
>
> Real GRO (TCP in most cases) will consume one skb, and have page
> fragments for each segment.
>
> Having skbs linked together is not cache friendly.

OK, so I rebased test setup a bit to clarify the things out.

I disabled fraglists and GRO/GSO fraglists support advertisement
in driver to exclude any "pathological" cases and switched it
from napi_get_frags() + napi_gro_frags() to napi_alloc_skb() +
napi_gro_receive() to disable local skb reusing (napi_reuse_skb()).
I also enabled GSO UDP L4 ("classic" one: one skbuff_head + frags)
for forwarding, not only local traffic, and disabled NF flow offload
to increase CPU loading and drop performance below link speed so I
could see the changes.

So, the traffic flows looked like:
 - TCP GRO (one head + frags) -> NAT -> hardware TSO;
 - UDP GRO (one head + frags) -> NAT -> driver-side GSO.

Baseline 5.11-rc3:
 - 865 Mbps TCP, 866 Mbps UDP.

This patch (both separate caches and Edward's unified cache):
 - 899 Mbps TCP, 893 Mbps UDP.

So that's cleary *not* only "pathological" UDP GRO Fraglists
"problem" as TCP also got ~35 Mbps from this, as well as
non-fraglisted UDP.

Regarding latencies: I remember there were talks about latencies when
Edward introduced batched GRO (using linked lists to pass skbs from
GRO layer to core stack instead of passing one by one), so I think
it's a perennial question when it comes to batching/caching.

Thanks for the feedback, will post v2 soon.
The question about if this caching is reasonable isn't closed anyway,
but I don't see significant "cons" for now.

> So I would try first to make this case better, instead of trying to
> work around the real issue.

Al
Eric Dumazet Jan. 12, 2021, 7:19 p.m. UTC | #8
On Tue, Jan 12, 2021 at 7:26 PM Alexander Lobakin <alobakin@pm.me> wrote:
>
> From: Eric Dumazet <edumazet@google.com>
> Date: Tue, 12 Jan 2021 13:32:56 +0100
>
> > On Tue, Jan 12, 2021 at 11:56 AM Alexander Lobakin <alobakin@pm.me> wrote:
> >>
> >
> >>
> >> Ah, I should've mentioned that I use UDP GRO Fraglists, so these
> >> numbers are for GRO.
> >>
> >
> > Right, this suggests UDP GRO fraglist is a pathological case of GRO,
> > not saving memory.
> >
> > Real GRO (TCP in most cases) will consume one skb, and have page
> > fragments for each segment.
> >
> > Having skbs linked together is not cache friendly.
>
> OK, so I rebased test setup a bit to clarify the things out.
>
> I disabled fraglists and GRO/GSO fraglists support advertisement
> in driver to exclude any "pathological" cases and switched it
> from napi_get_frags() + napi_gro_frags() to napi_alloc_skb() +
> napi_gro_receive() to disable local skb reusing (napi_reuse_skb()).
> I also enabled GSO UDP L4 ("classic" one: one skbuff_head + frags)
> for forwarding, not only local traffic, and disabled NF flow offload
> to increase CPU loading and drop performance below link speed so I
> could see the changes.
>
> So, the traffic flows looked like:
>  - TCP GRO (one head + frags) -> NAT -> hardware TSO;
>  - UDP GRO (one head + frags) -> NAT -> driver-side GSO.
>
> Baseline 5.11-rc3:
>  - 865 Mbps TCP, 866 Mbps UDP.
>
> This patch (both separate caches and Edward's unified cache):
>  - 899 Mbps TCP, 893 Mbps UDP.
>
> So that's cleary *not* only "pathological" UDP GRO Fraglists
> "problem" as TCP also got ~35 Mbps from this, as well as
> non-fraglisted UDP.
>
> Regarding latencies: I remember there were talks about latencies when
> Edward introduced batched GRO (using linked lists to pass skbs from
> GRO layer to core stack instead of passing one by one), so I think
> it's a perennial question when it comes to batching/caching.
>
> Thanks for the feedback, will post v2 soon.
> The question about if this caching is reasonable isn't closed anyway,
> but I don't see significant "cons" for now.
>

Also it would be nice to have KASAN support.

We do not want to unconditionally to recycle stuff, since this might
hide use-after-free.
Jakub Kicinski Jan. 13, 2021, 1:02 a.m. UTC | #9
On Tue, 12 Jan 2021 13:23:16 +0100 Eric Dumazet wrote:
> On Tue, Jan 12, 2021 at 12:08 PM Alexander Lobakin <alobakin@pm.me> wrote:
> >
> > From: Edward Cree <ecree.xilinx@gmail.com>
> > Date: Tue, 12 Jan 2021 09:54:04 +0000
> >  
> > > Without wishing to weigh in on whether this caching is a good idea...  
> >
> > Well, we already have a cache to bulk flush "consumed" skbs, although
> > kmem_cache_free() is generally lighter than kmem_cache_alloc(), and
> > a page frag cache to allocate skb->head that is also bulking the
> > operations, since it contains a (compound) page with the size of
> > min(SZ_32K, PAGE_SIZE).
> > If they wouldn't give any visible boosts, I think they wouldn't hit
> > mainline.
> >  
> > > Wouldn't it be simpler, rather than having two separate "alloc" and "flush"
> > >  caches, to have a single larger cache, such that whenever it becomes full
> > >  we bulk flush the top half, and when it's empty we bulk alloc the bottom
> > >  half?  That should mean fewer branches, fewer instructions etc. than
> > >  having to decide which cache to act upon every time.  
> >
> > I though about a unified cache, but couldn't decide whether to flush
> > or to allocate heads and how much to process. Your suggestion answers
> > these questions and generally seems great. I'll try that one, thanks!
>  
> The thing is : kmalloc() is supposed to have batches already, and nice
> per-cpu caches.
> 
> This looks like an mm issue, are we sure we want to get over it ?
> 
> I would like a full analysis of why SLAB/SLUB does not work well for
> your test workload.

+1, it does feel like we're getting into mm territory

> More details, more numbers.... before we accept yet another
> 'networking optimization' adding more code to the 'fast' path.
> 
> More code means more latencies when all code needs to be brought up in
> cpu caches.
Eric Dumazet Jan. 13, 2021, 4:46 a.m. UTC | #10
On Wed, Jan 13, 2021 at 2:02 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 12 Jan 2021 13:23:16 +0100 Eric Dumazet wrote:
> > On Tue, Jan 12, 2021 at 12:08 PM Alexander Lobakin <alobakin@pm.me> wrote:
> > >
> > > From: Edward Cree <ecree.xilinx@gmail.com>
> > > Date: Tue, 12 Jan 2021 09:54:04 +0000
> > >
> > > > Without wishing to weigh in on whether this caching is a good idea...
> > >
> > > Well, we already have a cache to bulk flush "consumed" skbs, although
> > > kmem_cache_free() is generally lighter than kmem_cache_alloc(), and
> > > a page frag cache to allocate skb->head that is also bulking the
> > > operations, since it contains a (compound) page with the size of
> > > min(SZ_32K, PAGE_SIZE).
> > > If they wouldn't give any visible boosts, I think they wouldn't hit
> > > mainline.
> > >
> > > > Wouldn't it be simpler, rather than having two separate "alloc" and "flush"
> > > >  caches, to have a single larger cache, such that whenever it becomes full
> > > >  we bulk flush the top half, and when it's empty we bulk alloc the bottom
> > > >  half?  That should mean fewer branches, fewer instructions etc. than
> > > >  having to decide which cache to act upon every time.
> > >
> > > I though about a unified cache, but couldn't decide whether to flush
> > > or to allocate heads and how much to process. Your suggestion answers
> > > these questions and generally seems great. I'll try that one, thanks!
> >
> > The thing is : kmalloc() is supposed to have batches already, and nice
> > per-cpu caches.
> >
> > This looks like an mm issue, are we sure we want to get over it ?
> >
> > I would like a full analysis of why SLAB/SLUB does not work well for
> > your test workload.
>
> +1, it does feel like we're getting into mm territory

I read the existing code, and with Edward Cree idea of reusing the
existing cache (storage of pointers),
ths now all makes sense, since there will be not much added code (and
new storage of 64 pointers)

The remaining issue is to make sure KASAN will still work, we need
this to detect old and new bugs.

Thanks !
Jakub Kicinski Jan. 13, 2021, 5:03 p.m. UTC | #11
On Wed, 13 Jan 2021 05:46:05 +0100 Eric Dumazet wrote:
> On Wed, Jan 13, 2021 at 2:02 AM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Tue, 12 Jan 2021 13:23:16 +0100 Eric Dumazet wrote:  
> > > On Tue, Jan 12, 2021 at 12:08 PM Alexander Lobakin <alobakin@pm.me> wrote:  
> > > >
> > > > From: Edward Cree <ecree.xilinx@gmail.com>
> > > > Date: Tue, 12 Jan 2021 09:54:04 +0000
> > > >  
> > > > > Without wishing to weigh in on whether this caching is a good idea...  
> > > >
> > > > Well, we already have a cache to bulk flush "consumed" skbs, although
> > > > kmem_cache_free() is generally lighter than kmem_cache_alloc(), and
> > > > a page frag cache to allocate skb->head that is also bulking the
> > > > operations, since it contains a (compound) page with the size of
> > > > min(SZ_32K, PAGE_SIZE).
> > > > If they wouldn't give any visible boosts, I think they wouldn't hit
> > > > mainline.
> > > >  
> > > > > Wouldn't it be simpler, rather than having two separate "alloc" and "flush"
> > > > >  caches, to have a single larger cache, such that whenever it becomes full
> > > > >  we bulk flush the top half, and when it's empty we bulk alloc the bottom
> > > > >  half?  That should mean fewer branches, fewer instructions etc. than
> > > > >  having to decide which cache to act upon every time.  
> > > >
> > > > I though about a unified cache, but couldn't decide whether to flush
> > > > or to allocate heads and how much to process. Your suggestion answers
> > > > these questions and generally seems great. I'll try that one, thanks!  
> > >
> > > The thing is : kmalloc() is supposed to have batches already, and nice
> > > per-cpu caches.
> > >
> > > This looks like an mm issue, are we sure we want to get over it ?
> > >
> > > I would like a full analysis of why SLAB/SLUB does not work well for
> > > your test workload.  
> >
> > +1, it does feel like we're getting into mm territory  
> 
> I read the existing code, and with Edward Cree idea of reusing the
> existing cache (storage of pointers),
> ths now all makes sense, since there will be not much added code (and
> new storage of 64 pointers)
> 
> The remaining issue is to make sure KASAN will still work, we need
> this to detect old and new bugs.

IDK much about MM, but we already have a kmem_cache for skbs and now
we're building a cache on top of a cache.  Shouldn't MM take care of
providing a per-CPU BH-only lockless cache?
Eric Dumazet Jan. 13, 2021, 5:15 p.m. UTC | #12
On Wed, Jan 13, 2021 at 6:03 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 13 Jan 2021 05:46:05 +0100 Eric Dumazet wrote:
> > On Wed, Jan 13, 2021 at 2:02 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > >
> > > On Tue, 12 Jan 2021 13:23:16 +0100 Eric Dumazet wrote:
> > > > On Tue, Jan 12, 2021 at 12:08 PM Alexander Lobakin <alobakin@pm.me> wrote:
> > > > >
> > > > > From: Edward Cree <ecree.xilinx@gmail.com>
> > > > > Date: Tue, 12 Jan 2021 09:54:04 +0000
> > > > >
> > > > > > Without wishing to weigh in on whether this caching is a good idea...
> > > > >
> > > > > Well, we already have a cache to bulk flush "consumed" skbs, although
> > > > > kmem_cache_free() is generally lighter than kmem_cache_alloc(), and
> > > > > a page frag cache to allocate skb->head that is also bulking the
> > > > > operations, since it contains a (compound) page with the size of
> > > > > min(SZ_32K, PAGE_SIZE).
> > > > > If they wouldn't give any visible boosts, I think they wouldn't hit
> > > > > mainline.
> > > > >
> > > > > > Wouldn't it be simpler, rather than having two separate "alloc" and "flush"
> > > > > >  caches, to have a single larger cache, such that whenever it becomes full
> > > > > >  we bulk flush the top half, and when it's empty we bulk alloc the bottom
> > > > > >  half?  That should mean fewer branches, fewer instructions etc. than
> > > > > >  having to decide which cache to act upon every time.
> > > > >
> > > > > I though about a unified cache, but couldn't decide whether to flush
> > > > > or to allocate heads and how much to process. Your suggestion answers
> > > > > these questions and generally seems great. I'll try that one, thanks!
> > > >
> > > > The thing is : kmalloc() is supposed to have batches already, and nice
> > > > per-cpu caches.
> > > >
> > > > This looks like an mm issue, are we sure we want to get over it ?
> > > >
> > > > I would like a full analysis of why SLAB/SLUB does not work well for
> > > > your test workload.
> > >
> > > +1, it does feel like we're getting into mm territory
> >
> > I read the existing code, and with Edward Cree idea of reusing the
> > existing cache (storage of pointers),
> > ths now all makes sense, since there will be not much added code (and
> > new storage of 64 pointers)
> >
> > The remaining issue is to make sure KASAN will still work, we need
> > this to detect old and new bugs.
>
> IDK much about MM, but we already have a kmem_cache for skbs and now
> we're building a cache on top of a cache.  Shouldn't MM take care of
> providing a per-CPU BH-only lockless cache?

I think part of the improvement comes from bulk operations, which are
provided by mm layer.

I also note Alexander made no provision for NUMA awareness.
Probably reusing skb located on a remote node will not be ideal.
Jakub Kicinski Jan. 13, 2021, 6:12 p.m. UTC | #13
On Wed, 13 Jan 2021 18:15:20 +0100 Eric Dumazet wrote:
> > IDK much about MM, but we already have a kmem_cache for skbs and now
> > we're building a cache on top of a cache.  Shouldn't MM take care of
> > providing a per-CPU BH-only lockless cache?  
> 
> I think part of the improvement comes from bulk operations, which are
> provided by mm layer.
> 
> I also note Alexander made no provision for NUMA awareness.
> Probably reusing skb located on a remote node will not be ideal.

I was wondering about that yesterday, but couldn't really think 
of a legitimate reason not to have XPS set up right. Do you have
particular config in mind, or are we taking "default config"?

Also can't the skb _itself_ be pfmemalloc?

My main point is that I'm wondering if this sort of cache would be
useful when allocating skbs for sockets? Assuming that the network
stack is not isolated to its own cores, won't fronting alloc_skb() 
with 

	bh_disable() 
	try the cache
	bh_enable()

potentially help? In that sense fronting kmem_cache would feel cleaner
than our own little ring buffer.