Message ID | 20210111182655.12159-1-alobakin@pm.me (mailing list archive) |
---|---|
Headers | show |
Series | skbuff: introduce skbuff_heads bulking and reusing | expand |
On Mon, Jan 11, 2021 at 7:27 PM Alexander Lobakin <alobakin@pm.me> wrote: > > Inspired by cpu_map_kthread_run() and _kfree_skb_defer() logics. > > Currently, all sorts of skb allocation always do allocate > skbuff_heads one by one via kmem_cache_alloc(). > On the other hand, we have percpu napi_alloc_cache to store > skbuff_heads queued up for freeing and flush them by bulks. > > We can use this struct to cache and bulk not only freeing, but also > allocation of new skbuff_heads, as well as to reuse cached-to-free > heads instead of allocating the new ones. > As accessing napi_alloc_cache implies NAPI softirq context, do this > only for __napi_alloc_skb() and its derivatives (napi_alloc_skb() > and napi_get_frags()). The rough amount of their call sites are 69, > which is quite a number. > > iperf3 showed a nice bump from 910 to 935 Mbits while performing > UDP VLAN NAT on 1.2 GHz MIPS board. The boost is likely to be > way bigger on more powerful hosts and NICs with tens of Mpps. What is the latency cost of these bulk allocations, and for TCP traffic on which GRO is the norm ? Adding caches is increasing cache foot print when the cache is populated. I wonder if your iperf3 numbers are simply wrong because of lack of GRO in this UDP VLAN NAT case. We are adding a log of additional code, thus icache pressure, that iperf3 tests can not really measure. Most linus devices simply handle one packet at a time (one packet per interrupt)
Without wishing to weigh in on whether this caching is a good idea... Wouldn't it be simpler, rather than having two separate "alloc" and "flush" caches, to have a single larger cache, such that whenever it becomes full we bulk flush the top half, and when it's empty we bulk alloc the bottom half? That should mean fewer branches, fewer instructions etc. than having to decide which cache to act upon every time. -ed
From: Eric Dumazet <edumazet@google.com> Date: Tue, 12 Jan 2021 09:20:39 +0100 > On Mon, Jan 11, 2021 at 7:27 PM Alexander Lobakin <alobakin@pm.me> wrote: >> >> Inspired by cpu_map_kthread_run() and _kfree_skb_defer() logics. >> >> Currently, all sorts of skb allocation always do allocate >> skbuff_heads one by one via kmem_cache_alloc(). >> On the other hand, we have percpu napi_alloc_cache to store >> skbuff_heads queued up for freeing and flush them by bulks. >> >> We can use this struct to cache and bulk not only freeing, but also >> allocation of new skbuff_heads, as well as to reuse cached-to-free >> heads instead of allocating the new ones. >> As accessing napi_alloc_cache implies NAPI softirq context, do this >> only for __napi_alloc_skb() and its derivatives (napi_alloc_skb() >> and napi_get_frags()). The rough amount of their call sites are 69, >> which is quite a number. >> >> iperf3 showed a nice bump from 910 to 935 Mbits while performing >> UDP VLAN NAT on 1.2 GHz MIPS board. The boost is likely to be >> way bigger on more powerful hosts and NICs with tens of Mpps. > > What is the latency cost of these bulk allocations, and for TCP traffic > on which GRO is the norm ? > > Adding caches is increasing cache foot print when the cache is populated. > > I wonder if your iperf3 numbers are simply wrong because of lack of > GRO in this UDP VLAN NAT case. Ah, I should've mentioned that I use UDP GRO Fraglists, so these numbers are for GRO. My board gives full 1 Gbps (link speed) for TCP for more than a year, so I can't really rely on TCP passthrough to measure the gains or regressions. > We are adding a log of additional code, thus icache pressure, that > iperf3 tests can not really measure. Not sure if MIPS arch can provide enough debug information to measure icache pressure, but I'll try to catch this. > Most linus devices simply handle one packet at a time (one packet per interrupt) I disagree here, most modern NICs usually handle thousands of packets per single interrupt due to NAPI, hardware interrupt moderation and so on, at least in cases with high traffic load. Al
From: Edward Cree <ecree.xilinx@gmail.com> Date: Tue, 12 Jan 2021 09:54:04 +0000 > Without wishing to weigh in on whether this caching is a good idea... Well, we already have a cache to bulk flush "consumed" skbs, although kmem_cache_free() is generally lighter than kmem_cache_alloc(), and a page frag cache to allocate skb->head that is also bulking the operations, since it contains a (compound) page with the size of min(SZ_32K, PAGE_SIZE). If they wouldn't give any visible boosts, I think they wouldn't hit mainline. > Wouldn't it be simpler, rather than having two separate "alloc" and "flush" > caches, to have a single larger cache, such that whenever it becomes full > we bulk flush the top half, and when it's empty we bulk alloc the bottom > half? That should mean fewer branches, fewer instructions etc. than > having to decide which cache to act upon every time. I though about a unified cache, but couldn't decide whether to flush or to allocate heads and how much to process. Your suggestion answers these questions and generally seems great. I'll try that one, thanks! > -ed Al
On Tue, Jan 12, 2021 at 12:08 PM Alexander Lobakin <alobakin@pm.me> wrote: > > From: Edward Cree <ecree.xilinx@gmail.com> > Date: Tue, 12 Jan 2021 09:54:04 +0000 > > > Without wishing to weigh in on whether this caching is a good idea... > > Well, we already have a cache to bulk flush "consumed" skbs, although > kmem_cache_free() is generally lighter than kmem_cache_alloc(), and > a page frag cache to allocate skb->head that is also bulking the > operations, since it contains a (compound) page with the size of > min(SZ_32K, PAGE_SIZE). > If they wouldn't give any visible boosts, I think they wouldn't hit > mainline. > > > Wouldn't it be simpler, rather than having two separate "alloc" and "flush" > > caches, to have a single larger cache, such that whenever it becomes full > > we bulk flush the top half, and when it's empty we bulk alloc the bottom > > half? That should mean fewer branches, fewer instructions etc. than > > having to decide which cache to act upon every time. > > I though about a unified cache, but couldn't decide whether to flush > or to allocate heads and how much to process. Your suggestion answers > these questions and generally seems great. I'll try that one, thanks! > The thing is : kmalloc() is supposed to have batches already, and nice per-cpu caches. This looks like an mm issue, are we sure we want to get over it ? I would like a full analysis of why SLAB/SLUB does not work well for your test workload. More details, more numbers.... before we accept yet another 'networking optimization' adding more code to the 'fast' path. More code means more latencies when all code needs to be brought up in cpu caches.
On Tue, Jan 12, 2021 at 11:56 AM Alexander Lobakin <alobakin@pm.me> wrote: > > > Ah, I should've mentioned that I use UDP GRO Fraglists, so these > numbers are for GRO. > Right, this suggests UDP GRO fraglist is a pathological case of GRO, not saving memory. Real GRO (TCP in most cases) will consume one skb, and have page fragments for each segment. Having skbs linked together is not cache friendly. So I would try first to make this case better, instead of trying to work around the real issue.
From: Eric Dumazet <edumazet@google.com> Date: Tue, 12 Jan 2021 13:32:56 +0100 > On Tue, Jan 12, 2021 at 11:56 AM Alexander Lobakin <alobakin@pm.me> wrote: >> > >> >> Ah, I should've mentioned that I use UDP GRO Fraglists, so these >> numbers are for GRO. >> > > Right, this suggests UDP GRO fraglist is a pathological case of GRO, > not saving memory. > > Real GRO (TCP in most cases) will consume one skb, and have page > fragments for each segment. > > Having skbs linked together is not cache friendly. OK, so I rebased test setup a bit to clarify the things out. I disabled fraglists and GRO/GSO fraglists support advertisement in driver to exclude any "pathological" cases and switched it from napi_get_frags() + napi_gro_frags() to napi_alloc_skb() + napi_gro_receive() to disable local skb reusing (napi_reuse_skb()). I also enabled GSO UDP L4 ("classic" one: one skbuff_head + frags) for forwarding, not only local traffic, and disabled NF flow offload to increase CPU loading and drop performance below link speed so I could see the changes. So, the traffic flows looked like: - TCP GRO (one head + frags) -> NAT -> hardware TSO; - UDP GRO (one head + frags) -> NAT -> driver-side GSO. Baseline 5.11-rc3: - 865 Mbps TCP, 866 Mbps UDP. This patch (both separate caches and Edward's unified cache): - 899 Mbps TCP, 893 Mbps UDP. So that's cleary *not* only "pathological" UDP GRO Fraglists "problem" as TCP also got ~35 Mbps from this, as well as non-fraglisted UDP. Regarding latencies: I remember there were talks about latencies when Edward introduced batched GRO (using linked lists to pass skbs from GRO layer to core stack instead of passing one by one), so I think it's a perennial question when it comes to batching/caching. Thanks for the feedback, will post v2 soon. The question about if this caching is reasonable isn't closed anyway, but I don't see significant "cons" for now. > So I would try first to make this case better, instead of trying to > work around the real issue. Al
On Tue, Jan 12, 2021 at 7:26 PM Alexander Lobakin <alobakin@pm.me> wrote: > > From: Eric Dumazet <edumazet@google.com> > Date: Tue, 12 Jan 2021 13:32:56 +0100 > > > On Tue, Jan 12, 2021 at 11:56 AM Alexander Lobakin <alobakin@pm.me> wrote: > >> > > > >> > >> Ah, I should've mentioned that I use UDP GRO Fraglists, so these > >> numbers are for GRO. > >> > > > > Right, this suggests UDP GRO fraglist is a pathological case of GRO, > > not saving memory. > > > > Real GRO (TCP in most cases) will consume one skb, and have page > > fragments for each segment. > > > > Having skbs linked together is not cache friendly. > > OK, so I rebased test setup a bit to clarify the things out. > > I disabled fraglists and GRO/GSO fraglists support advertisement > in driver to exclude any "pathological" cases and switched it > from napi_get_frags() + napi_gro_frags() to napi_alloc_skb() + > napi_gro_receive() to disable local skb reusing (napi_reuse_skb()). > I also enabled GSO UDP L4 ("classic" one: one skbuff_head + frags) > for forwarding, not only local traffic, and disabled NF flow offload > to increase CPU loading and drop performance below link speed so I > could see the changes. > > So, the traffic flows looked like: > - TCP GRO (one head + frags) -> NAT -> hardware TSO; > - UDP GRO (one head + frags) -> NAT -> driver-side GSO. > > Baseline 5.11-rc3: > - 865 Mbps TCP, 866 Mbps UDP. > > This patch (both separate caches and Edward's unified cache): > - 899 Mbps TCP, 893 Mbps UDP. > > So that's cleary *not* only "pathological" UDP GRO Fraglists > "problem" as TCP also got ~35 Mbps from this, as well as > non-fraglisted UDP. > > Regarding latencies: I remember there were talks about latencies when > Edward introduced batched GRO (using linked lists to pass skbs from > GRO layer to core stack instead of passing one by one), so I think > it's a perennial question when it comes to batching/caching. > > Thanks for the feedback, will post v2 soon. > The question about if this caching is reasonable isn't closed anyway, > but I don't see significant "cons" for now. > Also it would be nice to have KASAN support. We do not want to unconditionally to recycle stuff, since this might hide use-after-free.
On Tue, 12 Jan 2021 13:23:16 +0100 Eric Dumazet wrote: > On Tue, Jan 12, 2021 at 12:08 PM Alexander Lobakin <alobakin@pm.me> wrote: > > > > From: Edward Cree <ecree.xilinx@gmail.com> > > Date: Tue, 12 Jan 2021 09:54:04 +0000 > > > > > Without wishing to weigh in on whether this caching is a good idea... > > > > Well, we already have a cache to bulk flush "consumed" skbs, although > > kmem_cache_free() is generally lighter than kmem_cache_alloc(), and > > a page frag cache to allocate skb->head that is also bulking the > > operations, since it contains a (compound) page with the size of > > min(SZ_32K, PAGE_SIZE). > > If they wouldn't give any visible boosts, I think they wouldn't hit > > mainline. > > > > > Wouldn't it be simpler, rather than having two separate "alloc" and "flush" > > > caches, to have a single larger cache, such that whenever it becomes full > > > we bulk flush the top half, and when it's empty we bulk alloc the bottom > > > half? That should mean fewer branches, fewer instructions etc. than > > > having to decide which cache to act upon every time. > > > > I though about a unified cache, but couldn't decide whether to flush > > or to allocate heads and how much to process. Your suggestion answers > > these questions and generally seems great. I'll try that one, thanks! > > The thing is : kmalloc() is supposed to have batches already, and nice > per-cpu caches. > > This looks like an mm issue, are we sure we want to get over it ? > > I would like a full analysis of why SLAB/SLUB does not work well for > your test workload. +1, it does feel like we're getting into mm territory > More details, more numbers.... before we accept yet another > 'networking optimization' adding more code to the 'fast' path. > > More code means more latencies when all code needs to be brought up in > cpu caches.
On Wed, Jan 13, 2021 at 2:02 AM Jakub Kicinski <kuba@kernel.org> wrote: > > On Tue, 12 Jan 2021 13:23:16 +0100 Eric Dumazet wrote: > > On Tue, Jan 12, 2021 at 12:08 PM Alexander Lobakin <alobakin@pm.me> wrote: > > > > > > From: Edward Cree <ecree.xilinx@gmail.com> > > > Date: Tue, 12 Jan 2021 09:54:04 +0000 > > > > > > > Without wishing to weigh in on whether this caching is a good idea... > > > > > > Well, we already have a cache to bulk flush "consumed" skbs, although > > > kmem_cache_free() is generally lighter than kmem_cache_alloc(), and > > > a page frag cache to allocate skb->head that is also bulking the > > > operations, since it contains a (compound) page with the size of > > > min(SZ_32K, PAGE_SIZE). > > > If they wouldn't give any visible boosts, I think they wouldn't hit > > > mainline. > > > > > > > Wouldn't it be simpler, rather than having two separate "alloc" and "flush" > > > > caches, to have a single larger cache, such that whenever it becomes full > > > > we bulk flush the top half, and when it's empty we bulk alloc the bottom > > > > half? That should mean fewer branches, fewer instructions etc. than > > > > having to decide which cache to act upon every time. > > > > > > I though about a unified cache, but couldn't decide whether to flush > > > or to allocate heads and how much to process. Your suggestion answers > > > these questions and generally seems great. I'll try that one, thanks! > > > > The thing is : kmalloc() is supposed to have batches already, and nice > > per-cpu caches. > > > > This looks like an mm issue, are we sure we want to get over it ? > > > > I would like a full analysis of why SLAB/SLUB does not work well for > > your test workload. > > +1, it does feel like we're getting into mm territory I read the existing code, and with Edward Cree idea of reusing the existing cache (storage of pointers), ths now all makes sense, since there will be not much added code (and new storage of 64 pointers) The remaining issue is to make sure KASAN will still work, we need this to detect old and new bugs. Thanks !
On Wed, 13 Jan 2021 05:46:05 +0100 Eric Dumazet wrote: > On Wed, Jan 13, 2021 at 2:02 AM Jakub Kicinski <kuba@kernel.org> wrote: > > > > On Tue, 12 Jan 2021 13:23:16 +0100 Eric Dumazet wrote: > > > On Tue, Jan 12, 2021 at 12:08 PM Alexander Lobakin <alobakin@pm.me> wrote: > > > > > > > > From: Edward Cree <ecree.xilinx@gmail.com> > > > > Date: Tue, 12 Jan 2021 09:54:04 +0000 > > > > > > > > > Without wishing to weigh in on whether this caching is a good idea... > > > > > > > > Well, we already have a cache to bulk flush "consumed" skbs, although > > > > kmem_cache_free() is generally lighter than kmem_cache_alloc(), and > > > > a page frag cache to allocate skb->head that is also bulking the > > > > operations, since it contains a (compound) page with the size of > > > > min(SZ_32K, PAGE_SIZE). > > > > If they wouldn't give any visible boosts, I think they wouldn't hit > > > > mainline. > > > > > > > > > Wouldn't it be simpler, rather than having two separate "alloc" and "flush" > > > > > caches, to have a single larger cache, such that whenever it becomes full > > > > > we bulk flush the top half, and when it's empty we bulk alloc the bottom > > > > > half? That should mean fewer branches, fewer instructions etc. than > > > > > having to decide which cache to act upon every time. > > > > > > > > I though about a unified cache, but couldn't decide whether to flush > > > > or to allocate heads and how much to process. Your suggestion answers > > > > these questions and generally seems great. I'll try that one, thanks! > > > > > > The thing is : kmalloc() is supposed to have batches already, and nice > > > per-cpu caches. > > > > > > This looks like an mm issue, are we sure we want to get over it ? > > > > > > I would like a full analysis of why SLAB/SLUB does not work well for > > > your test workload. > > > > +1, it does feel like we're getting into mm territory > > I read the existing code, and with Edward Cree idea of reusing the > existing cache (storage of pointers), > ths now all makes sense, since there will be not much added code (and > new storage of 64 pointers) > > The remaining issue is to make sure KASAN will still work, we need > this to detect old and new bugs. IDK much about MM, but we already have a kmem_cache for skbs and now we're building a cache on top of a cache. Shouldn't MM take care of providing a per-CPU BH-only lockless cache?
On Wed, Jan 13, 2021 at 6:03 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Wed, 13 Jan 2021 05:46:05 +0100 Eric Dumazet wrote: > > On Wed, Jan 13, 2021 at 2:02 AM Jakub Kicinski <kuba@kernel.org> wrote: > > > > > > On Tue, 12 Jan 2021 13:23:16 +0100 Eric Dumazet wrote: > > > > On Tue, Jan 12, 2021 at 12:08 PM Alexander Lobakin <alobakin@pm.me> wrote: > > > > > > > > > > From: Edward Cree <ecree.xilinx@gmail.com> > > > > > Date: Tue, 12 Jan 2021 09:54:04 +0000 > > > > > > > > > > > Without wishing to weigh in on whether this caching is a good idea... > > > > > > > > > > Well, we already have a cache to bulk flush "consumed" skbs, although > > > > > kmem_cache_free() is generally lighter than kmem_cache_alloc(), and > > > > > a page frag cache to allocate skb->head that is also bulking the > > > > > operations, since it contains a (compound) page with the size of > > > > > min(SZ_32K, PAGE_SIZE). > > > > > If they wouldn't give any visible boosts, I think they wouldn't hit > > > > > mainline. > > > > > > > > > > > Wouldn't it be simpler, rather than having two separate "alloc" and "flush" > > > > > > caches, to have a single larger cache, such that whenever it becomes full > > > > > > we bulk flush the top half, and when it's empty we bulk alloc the bottom > > > > > > half? That should mean fewer branches, fewer instructions etc. than > > > > > > having to decide which cache to act upon every time. > > > > > > > > > > I though about a unified cache, but couldn't decide whether to flush > > > > > or to allocate heads and how much to process. Your suggestion answers > > > > > these questions and generally seems great. I'll try that one, thanks! > > > > > > > > The thing is : kmalloc() is supposed to have batches already, and nice > > > > per-cpu caches. > > > > > > > > This looks like an mm issue, are we sure we want to get over it ? > > > > > > > > I would like a full analysis of why SLAB/SLUB does not work well for > > > > your test workload. > > > > > > +1, it does feel like we're getting into mm territory > > > > I read the existing code, and with Edward Cree idea of reusing the > > existing cache (storage of pointers), > > ths now all makes sense, since there will be not much added code (and > > new storage of 64 pointers) > > > > The remaining issue is to make sure KASAN will still work, we need > > this to detect old and new bugs. > > IDK much about MM, but we already have a kmem_cache for skbs and now > we're building a cache on top of a cache. Shouldn't MM take care of > providing a per-CPU BH-only lockless cache? I think part of the improvement comes from bulk operations, which are provided by mm layer. I also note Alexander made no provision for NUMA awareness. Probably reusing skb located on a remote node will not be ideal.
On Wed, 13 Jan 2021 18:15:20 +0100 Eric Dumazet wrote: > > IDK much about MM, but we already have a kmem_cache for skbs and now > > we're building a cache on top of a cache. Shouldn't MM take care of > > providing a per-CPU BH-only lockless cache? > > I think part of the improvement comes from bulk operations, which are > provided by mm layer. > > I also note Alexander made no provision for NUMA awareness. > Probably reusing skb located on a remote node will not be ideal. I was wondering about that yesterday, but couldn't really think of a legitimate reason not to have XPS set up right. Do you have particular config in mind, or are we taking "default config"? Also can't the skb _itself_ be pfmemalloc? My main point is that I'm wondering if this sort of cache would be useful when allocating skbs for sockets? Assuming that the network stack is not isolated to its own cores, won't fronting alloc_skb() with bh_disable() try the cache bh_enable() potentially help? In that sense fronting kmem_cache would feel cleaner than our own little ring buffer.