Message ID | 20230331043906.3015706-1-kuba@kernel.org (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [RFC,net-next,1/2] page_pool: allow caching from safely localized NAPI | expand |
On Thu, 30 Mar 2023 21:39:05 -0700 Jakub Kicinski wrote: > +/* If caller didn't allow direct recycling check if we have other reasons > + * to believe that the producer and consumer can't race. > + * > + * Result is only meaningful in softirq context. > + */ > +static bool page_pool_safe_producer(struct page_pool *pool) > +{ > + struct napi_struct *napi = pool->p.napi; > + > + return napi && READ_ONCE(napi->list_owner) == smp_processor_id(); Herm, this also needs && !in_hardirq()
On 31/03/2023 06.39, Jakub Kicinski wrote: > Recent patches to mlx5 mentioned a regression when moving from > driver local page pool to only using the generic page pool code. > Page pool has two recycling paths (1) direct one, which runs in > safe NAPI context (basically consumer context, so producing > can be lockless); and (2) via a ptr_ring, which takes a spin > lock because the freeing can happen from any CPU; producer > and consumer may run concurrently. > > Since the page pool code was added, Eric introduced a revised version > of deferred skb freeing. TCP skbs are now usually returned to the CPU > which allocated them, and freed in softirq context. This places the > freeing (producing of pages back to the pool) enticingly close to > the allocation (consumer). > > If we can prove that we're freeing in the same softirq context in which > the consumer NAPI will run - lockless use of the cache is perfectly fine, > no need for the lock. Super interesting! :-) > > Let drivers link the page pool to a NAPI instance. If the NAPI instance > is scheduled on the same CPU on which we're freeing - place the pages > in the direct cache. > Cool, using the direct cache this way. In the cases where we cannot use the direct cache. There is also the option of bulk freeing into page_pool which mitigate the ptr_ring locking. See code page_pool_put_page_bulk(). For XDP the page_pool_put_page_bulk() API is used by xdp_return_frame_bulk() and xdp_flush_frame_bulk(), which together builds-up a bulk. (p.s. I see we could optimize xdp_return_frame_bulk some more, and avoid some of the rhashtable_lookup(), but that is unrelated to your patch). > With that and patched bnxt (XDP enabled to engage the page pool, sigh, > bnxt really needs page pool work :() I see a 2.6% perf boost with > a TCP stream test (app on a different physical core than softirq). > > The CPU use of relevant functions decreases as expected: > > page_pool_refill_alloc_cache 1.17% -> 0% > _raw_spin_lock 2.41% -> 0.98% > > Only consider lockless path to be safe when NAPI is scheduled > - in practice this should cover majority if not all of steady state > workloads. It's usually the NAPI kicking in that causes the skb flush. > Make sense, but do read the comment above struct pp_alloc_cache. The sizing of pp_alloc_cache is important for this trick/heuristic to work, meaning the pp_alloc_cache have enough room. It is definitely on purpose that pp_alloc_cache have 128 elements and is only refill'ed with 64 elements, which leaves room for this kind of trick. But if Eric's deferred skb freeing have more than 64 pages to free, then we will likely fallback to ptr_ring recycling. Code wise, I suggest that you/we change page_pool_put_page_bulk() to have a variant that 'allow_direct' (looking at code below, you might already do this as this patch over-steer 'allow_direct'). Using the bulk API, would then bulk into ptr_ring in the cases we cannot use direct cache. > The main case we'll miss out on is when application runs on the same > CPU as NAPI. In that case we don't use the deferred skb free path. > We could disable softirq one that path, too... maybe? > > Signed-off-by: Jakub Kicinski <kuba@kernel.org> > --- > CC: hawk@kernel.org > CC: ilias.apalodimas@linaro.org > --- > include/linux/netdevice.h | 3 +++ > include/net/page_pool.h | 1 + > net/core/dev.c | 3 +++ > net/core/page_pool.c | 16 ++++++++++++++++ > 4 files changed, 23 insertions(+) > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index 62e093a6d6d1..b3c11353078b 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -360,8 +360,11 @@ struct napi_struct { > unsigned long gro_bitmask; > int (*poll)(struct napi_struct *, int); > #ifdef CONFIG_NETPOLL > + /* CPU actively polling if netpoll is configured */ > int poll_owner; > #endif > + /* CPU on which NAPI has been scheduled for processing */ > + int list_owner; > struct net_device *dev; > struct gro_list gro_hash[GRO_HASH_BUCKETS]; > struct sk_buff *skb; > diff --git a/include/net/page_pool.h b/include/net/page_pool.h > index ddfa0b328677..f86cdfb51585 100644 > --- a/include/net/page_pool.h > +++ b/include/net/page_pool.h > @@ -77,6 +77,7 @@ struct page_pool_params { > unsigned int pool_size; > int nid; /* Numa node id to allocate from pages from */ > struct device *dev; /* device, for DMA pre-mapping purposes */ > + struct napi_struct *napi; /* Sole consumer of pages, otherwise NULL */ > enum dma_data_direction dma_dir; /* DMA mapping direction */ > unsigned int max_len; /* max DMA sync memory size */ > unsigned int offset; /* DMA addr offset */ > diff --git a/net/core/dev.c b/net/core/dev.c > index 0c4b21291348..a6d6e5c89ce7 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -4360,6 +4360,7 @@ static inline void ____napi_schedule(struct softnet_data *sd, > } > > list_add_tail(&napi->poll_list, &sd->poll_list); > + WRITE_ONCE(napi->list_owner, smp_processor_id()); > /* If not called from net_rx_action() > * we have to raise NET_RX_SOFTIRQ. > */ > @@ -6070,6 +6071,7 @@ bool napi_complete_done(struct napi_struct *n, int work_done) > list_del_init(&n->poll_list); > local_irq_restore(flags); > } > + WRITE_ONCE(n->list_owner, -1); > > val = READ_ONCE(n->state); > do { > @@ -6385,6 +6387,7 @@ void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi, > #ifdef CONFIG_NETPOLL > napi->poll_owner = -1; > #endif > + napi->list_owner = -1; > set_bit(NAPI_STATE_SCHED, &napi->state); > set_bit(NAPI_STATE_NPSVC, &napi->state); > list_add_rcu(&napi->dev_list, &dev->napi_list); > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > index 193c18799865..c3e2ab0c2684 100644 > --- a/net/core/page_pool.c > +++ b/net/core/page_pool.c > @@ -19,6 +19,7 @@ > #include <linux/mm.h> /* for put_page() */ > #include <linux/poison.h> > #include <linux/ethtool.h> > +#include <linux/netdevice.h> > > #include <trace/events/page_pool.h> > > @@ -544,6 +545,18 @@ static bool page_pool_recycle_in_cache(struct page *page, > return true; > } > > +/* If caller didn't allow direct recycling check if we have other reasons > + * to believe that the producer and consumer can't race. > + * > + * Result is only meaningful in softirq context. > + */ > +static bool page_pool_safe_producer(struct page_pool *pool) > +{ > + struct napi_struct *napi = pool->p.napi; > + > + return napi && READ_ONCE(napi->list_owner) == smp_processor_id(); > +} > + > /* If the page refcnt == 1, this will try to recycle the page. > * if PP_FLAG_DMA_SYNC_DEV is set, we'll try to sync the DMA area for > * the configured size min(dma_sync_size, pool->max_len). > @@ -570,6 +583,9 @@ __page_pool_put_page(struct page_pool *pool, struct page *page, > page_pool_dma_sync_for_device(pool, page, > dma_sync_size); > > + if (!allow_direct) > + allow_direct = page_pool_safe_producer(pool); > + I remember some use-case for veth, that explicitly disables "allow_direct". I cannot remember why exactly, but we need to make sure that doesn't break something (as this code can undo the allow_direct). > if (allow_direct && in_softirq() && > page_pool_recycle_in_cache(page, pool)) > return NULL;
On Fri, 31 Mar 2023 11:31:31 +0200 Jesper Dangaard Brouer wrote: > On 31/03/2023 06.39, Jakub Kicinski wrote: > > With that and patched bnxt (XDP enabled to engage the page pool, sigh, > > bnxt really needs page pool work :() I see a 2.6% perf boost with > > a TCP stream test (app on a different physical core than softirq). > > > > The CPU use of relevant functions decreases as expected: > > > > page_pool_refill_alloc_cache 1.17% -> 0% > > _raw_spin_lock 2.41% -> 0.98% > > > > Only consider lockless path to be safe when NAPI is scheduled > > - in practice this should cover majority if not all of steady state > > workloads. It's usually the NAPI kicking in that causes the skb flush. > > Make sense, but do read the comment above struct pp_alloc_cache. > The sizing of pp_alloc_cache is important for this trick/heuristic to > work, meaning the pp_alloc_cache have enough room. > It is definitely on purpose that pp_alloc_cache have 128 elements and is > only refill'ed with 64 elements, which leaves room for this kind of > trick. But if Eric's deferred skb freeing have more than 64 pages to > free, then we will likely fallback to ptr_ring recycling. > > Code wise, I suggest that you/we change page_pool_put_page_bulk() to > have a variant that 'allow_direct' (looking at code below, you might > already do this as this patch over-steer 'allow_direct'). Using the > bulk API, would then bulk into ptr_ring in the cases we cannot use > direct cache. Interesting point, let me re-run some tests with the statistics enabled. For a simple stream test I think it may just be too steady to trigger over/underflow. Each skb will carry at most 18 pages, and driver should only produce 64 packets / consume 64 pages. Each NAPI cycle will start by flushing the deferred free. So unless there is a hiccup either at the app or NAPI side - the flows of pages in each direction should be steady enough to do well with just 128 cache entries. Let me get the data and report back. > > /* If the page refcnt == 1, this will try to recycle the page. > > * if PP_FLAG_DMA_SYNC_DEV is set, we'll try to sync the DMA area for > > * the configured size min(dma_sync_size, pool->max_len). > > @@ -570,6 +583,9 @@ __page_pool_put_page(struct page_pool *pool, struct page *page, > > page_pool_dma_sync_for_device(pool, page, > > dma_sync_size); > > > > + if (!allow_direct) > > + allow_direct = page_pool_safe_producer(pool); > > + > > I remember some use-case for veth, that explicitly disables > "allow_direct". I cannot remember why exactly, but we need to make sure > that doesn't break something (as this code can undo the allow_direct). I can't find anything in veth :( Trying to grep drivers for page_pool_put / page_pool_recycle problems best I could find is commit e38553bdc377 ("net: fec: Use page_pool_put_full_page when freeing rx buffers").
On Fri, 31 Mar 2023 12:06:43 -0700 Jakub Kicinski wrote: > > Make sense, but do read the comment above struct pp_alloc_cache. > > The sizing of pp_alloc_cache is important for this trick/heuristic to > > work, meaning the pp_alloc_cache have enough room. > > It is definitely on purpose that pp_alloc_cache have 128 elements and is > > only refill'ed with 64 elements, which leaves room for this kind of > > trick. But if Eric's deferred skb freeing have more than 64 pages to > > free, then we will likely fallback to ptr_ring recycling. > > > > Code wise, I suggest that you/we change page_pool_put_page_bulk() to > > have a variant that 'allow_direct' (looking at code below, you might > > already do this as this patch over-steer 'allow_direct'). Using the > > bulk API, would then bulk into ptr_ring in the cases we cannot use > > direct cache. > > Interesting point, let me re-run some tests with the statistics enabled. > For a simple stream test I think it may just be too steady to trigger > over/underflow. Each skb will carry at most 18 pages, and driver should > only produce 64 packets / consume 64 pages. Each NAPI cycle will start > by flushing the deferred free. So unless there is a hiccup either at the > app or NAPI side - the flows of pages in each direction should be steady > enough to do well with just 128 cache entries. Let me get the data and > report back. I patched the driver a bit to use page pool for HW-GRO. Below are recycle stats with HW-GRO and with SW GRO + XDP_PASS (packet per page). HW-GRO page=page before after before after recycle: cached: 0 138669686 0 150197505 cache_full: 0 223391 0 74582 ring: 138551933 9997191 149299454 0 ring_full: 0 488 3154 127590 released_refcnt: 0 0 0 0 alloc: fast: 136491361 148615710 146969587 150322859 slow: 1772 1799 144 105 slow_high_order: 0 0 0 0 empty: 1772 1799 144 105 refill: 2165245 156302 2332880 2128 waive: 0 0 0 0 Which seems to confirm that for this trivial test the cache sizing is good enough, and I won't see any benefit from batching (as cache is only full respectively 0.16% and 0.05% of the time).
Hi Jakub On Thu, Mar 30, 2023 at 09:39:05PM -0700, Jakub Kicinski wrote: > Recent patches to mlx5 mentioned a regression when moving from > driver local page pool to only using the generic page pool code. > Page pool has two recycling paths (1) direct one, which runs in > safe NAPI context (basically consumer context, so producing > can be lockless); and (2) via a ptr_ring, which takes a spin > lock because the freeing can happen from any CPU; producer > and consumer may run concurrently. > > Since the page pool code was added, Eric introduced a revised version > of deferred skb freeing. TCP skbs are now usually returned to the CPU > which allocated them, and freed in softirq context. This places the > freeing (producing of pages back to the pool) enticingly close to > the allocation (consumer). > > If we can prove that we're freeing in the same softirq context in which > the consumer NAPI will run - lockless use of the cache is perfectly fine, > no need for the lock. > > Let drivers link the page pool to a NAPI instance. If the NAPI instance > is scheduled on the same CPU on which we're freeing - place the pages > in the direct cache. > > With that and patched bnxt (XDP enabled to engage the page pool, sigh, > bnxt really needs page pool work :() I see a 2.6% perf boost with > a TCP stream test (app on a different physical core than softirq). > > The CPU use of relevant functions decreases as expected: > > page_pool_refill_alloc_cache 1.17% -> 0% > _raw_spin_lock 2.41% -> 0.98% > > Only consider lockless path to be safe when NAPI is scheduled > - in practice this should cover majority if not all of steady state > workloads. It's usually the NAPI kicking in that causes the skb flush. > > The main case we'll miss out on is when application runs on the same > CPU as NAPI. In that case we don't use the deferred skb free path. > We could disable softirq one that path, too... maybe? This whole thing makes a lot of sense to me. > > Signed-off-by: Jakub Kicinski <kuba@kernel.org> > --- > CC: hawk@kernel.org > CC: ilias.apalodimas@linaro.org [...] > return true; > } > > +/* If caller didn't allow direct recycling check if we have other reasons > + * to believe that the producer and consumer can't race. > + * > + * Result is only meaningful in softirq context. > + */ > +static bool page_pool_safe_producer(struct page_pool *pool) > +{ > + struct napi_struct *napi = pool->p.napi; > + > + return napi && READ_ONCE(napi->list_owner) == smp_processor_id(); > +} > + > /* If the page refcnt == 1, this will try to recycle the page. > * if PP_FLAG_DMA_SYNC_DEV is set, we'll try to sync the DMA area for > * the configured size min(dma_sync_size, pool->max_len). > @@ -570,6 +583,9 @@ __page_pool_put_page(struct page_pool *pool, struct page *page, > page_pool_dma_sync_for_device(pool, page, > dma_sync_size); > > + if (!allow_direct) > + allow_direct = page_pool_safe_producer(pool); > + Do we want to hide the decision in __page_pool_put_page(). IOW wouldn't it be better for this function to honor whatever allow_direct dictates and have the allow_direct = page_pool_safe_producer(pool); in callers? Thanks /Ilias > if (allow_direct && in_softirq() && > page_pool_recycle_in_cache(page, pool)) > return NULL; > -- > 2.39.2 >
On Mon, 3 Apr 2023 12:16:04 +0300 Ilias Apalodimas wrote: > > /* If the page refcnt == 1, this will try to recycle the page. > > * if PP_FLAG_DMA_SYNC_DEV is set, we'll try to sync the DMA area for > > * the configured size min(dma_sync_size, pool->max_len). > > @@ -570,6 +583,9 @@ __page_pool_put_page(struct page_pool *pool, struct page *page, > > page_pool_dma_sync_for_device(pool, page, > > dma_sync_size); > > > > + if (!allow_direct) > > + allow_direct = page_pool_safe_producer(pool); > > + > > Do we want to hide the decision in __page_pool_put_page(). IOW wouldn't it > be better for this function to honor whatever allow_direct dictates and > have the allow_direct = page_pool_safe_producer(pool); in callers? Meaning in page_pool_return_skb_page() or all the way from napi_consume_skb()? The former does indeed sounds like a good idea!
On Mon, 3 Apr 2023 at 18:05, Jakub Kicinski <kuba@kernel.org> wrote: > > On Mon, 3 Apr 2023 12:16:04 +0300 Ilias Apalodimas wrote: > > > /* If the page refcnt == 1, this will try to recycle the page. > > > * if PP_FLAG_DMA_SYNC_DEV is set, we'll try to sync the DMA area for > > > * the configured size min(dma_sync_size, pool->max_len). > > > @@ -570,6 +583,9 @@ __page_pool_put_page(struct page_pool *pool, struct page *page, > > > page_pool_dma_sync_for_device(pool, page, > > > dma_sync_size); > > > > > > + if (!allow_direct) > > > + allow_direct = page_pool_safe_producer(pool); > > > + > > > > Do we want to hide the decision in __page_pool_put_page(). IOW wouldn't it > > be better for this function to honor whatever allow_direct dictates and > > have the allow_direct = page_pool_safe_producer(pool); in callers? > > Meaning in page_pool_return_skb_page() or all the way from > napi_consume_skb()? The former does indeed sounds like a good idea! page_pool_return_skb_page() (and maybe page_pool_put_full_page()). FWIW we completely agree on napi_consume_skb(). We are trying to keep page_pool and the net layer as disjoint as possible. The only point we 'pollute' networking code is the recycle bit checking and we'd prefer keeping it like that Regards /Ilias
On Mon, 3 Apr 2023 18:30:55 +0300 Ilias Apalodimas wrote: > > Meaning in page_pool_return_skb_page() or all the way from > > napi_consume_skb()? The former does indeed sounds like a good idea! > > page_pool_return_skb_page() (and maybe page_pool_put_full_page()). > FWIW we completely agree on napi_consume_skb(). We are trying to keep > page_pool and the net layer as disjoint as possible. The only point > we 'pollute' networking code is the recycle bit checking and we'd > prefer keeping it like that Ack, OTOH plumbing thru the budget argument within netdev code should not be a major refactoring. So maybe I should do that after all. Otherwise we have two different conditions - netdev only recycles skbs based on the NAPI budget != 0, but page pool will assume that in_softirq() && !in_hardirq() is always safe. The latter is safe, I think, unless someone adds a print half way thru the cache update... but then it's also safe in NAPI skb recycling, so napi_consume_skb() should stop taking the budget and just look at preempt flags... To make the correctness obvious, for now, I think I will refactor the netdev code to pass a "in NAPI poll" bool to page_pool_return_skb_page(), and add a WARN_ON(!softirq || hardirq). Let's see how the code ends up looking, I'll send it as RFCv2 rather than PATCH to make it clear I'm not sure it's okay with you :)
On 2023/3/31 12:39, Jakub Kicinski wrote: > Recent patches to mlx5 mentioned a regression when moving from > driver local page pool to only using the generic page pool code. > Page pool has two recycling paths (1) direct one, which runs in > safe NAPI context (basically consumer context, so producing > can be lockless); and (2) via a ptr_ring, which takes a spin > lock because the freeing can happen from any CPU; producer > and consumer may run concurrently. > > Since the page pool code was added, Eric introduced a revised version > of deferred skb freeing. TCP skbs are now usually returned to the CPU > which allocated them, and freed in softirq context. This places the > freeing (producing of pages back to the pool) enticingly close to > the allocation (consumer). > > If we can prove that we're freeing in the same softirq context in which > the consumer NAPI will run - lockless use of the cache is perfectly fine, > no need for the lock. > > Let drivers link the page pool to a NAPI instance. If the NAPI instance > is scheduled on the same CPU on which we're freeing - place the pages > in the direct cache. > > With that and patched bnxt (XDP enabled to engage the page pool, sigh, > bnxt really needs page pool work :() I see a 2.6% perf boost with > a TCP stream test (app on a different physical core than softirq). > > The CPU use of relevant functions decreases as expected: > > page_pool_refill_alloc_cache 1.17% -> 0% > _raw_spin_lock 2.41% -> 0.98% > > Only consider lockless path to be safe when NAPI is scheduled > - in practice this should cover majority if not all of steady state > workloads. It's usually the NAPI kicking in that causes the skb flush. Interesting. I wonder if we can make this more generic by adding the skb to per napi list instead of sd->defer_list, so that we can always use NAPI kicking to flush skb as net_tx_action() done for sd->completion_queue instead of softirq kicking? And it seems we know which napi binds to a specific socket through busypoll mechanism, we can reuse that to release a skb to the napi bound to that socket? > > The main case we'll miss out on is when application runs on the same > CPU as NAPI. In that case we don't use the deferred skb free path. > We could disable softirq one that path, too... maybe? >
On Tue, 4 Apr 2023 08:53:36 +0800 Yunsheng Lin wrote: > I wonder if we can make this more generic by adding the skb to per napi > list instead of sd->defer_list, so that we can always use NAPI kicking to > flush skb as net_tx_action() done for sd->completion_queue instead of > softirq kicking? > > And it seems we know which napi binds to a specific socket through > busypoll mechanism, we can reuse that to release a skb to the napi > bound to that socket? Seems doable. My thinking was to first see how well the simpler scheme performs with production workloads because it should have no downsides. Tracking real NAPI pointers per socket and extra RCU sync to manage per-NAPI defer queues may have perf cost.
On 2023/4/4 9:45, Jakub Kicinski wrote: > On Tue, 4 Apr 2023 08:53:36 +0800 Yunsheng Lin wrote: >> I wonder if we can make this more generic by adding the skb to per napi >> list instead of sd->defer_list, so that we can always use NAPI kicking to >> flush skb as net_tx_action() done for sd->completion_queue instead of >> softirq kicking? >> >> And it seems we know which napi binds to a specific socket through >> busypoll mechanism, we can reuse that to release a skb to the napi >> bound to that socket? > > Seems doable. My thinking was to first see how well the simpler scheme > performs with production workloads because it should have no downsides. Look forwording to some performs data with production workloads:) > Tracking real NAPI pointers per socket and extra RCU sync to manage > per-NAPI defer queues may have perf cost. I suppose the extra RCU sync only happen on the napi add/del process, not in the data path? > . >
On Tue, Apr 4, 2023 at 2:53 AM Yunsheng Lin <linyunsheng@huawei.com> wrote: > Interesting. > I wonder if we can make this more generic by adding the skb to per napi > list instead of sd->defer_list, so that we can always use NAPI kicking to > flush skb as net_tx_action() done for sd->completion_queue instead of > softirq kicking? We do not have direct skb -> napi association yet, but using an expensive hash lookup. I had the intent of adding per-cpu caches in this infrastructure, to not acquire the remote-cpu defer_lock for one skb at a time. (This is I think causing some regressions for small packets, with no frags) > > And it seems we know which napi binds to a specific socket through > busypoll mechanism, we can reuse that to release a skb to the napi > bound to that socket? busypoll is not often used, and we usually burn (spinning) cycles there, not sure we want to optimize it ? > > > > > The main case we'll miss out on is when application runs on the same > > CPU as NAPI. In that case we don't use the deferred skb free path. > > We could disable softirq one that path, too... maybe? > >
On 2023/4/4 12:21, Eric Dumazet wrote: > On Tue, Apr 4, 2023 at 2:53 AM Yunsheng Lin <linyunsheng@huawei.com> wrote: > >> Interesting. >> I wonder if we can make this more generic by adding the skb to per napi >> list instead of sd->defer_list, so that we can always use NAPI kicking to >> flush skb as net_tx_action() done for sd->completion_queue instead of >> softirq kicking? > > We do not have direct skb -> napi association yet, but using an > expensive hash lookup. > > I had the intent of adding per-cpu caches in this infrastructure, > to not acquire the remote-cpu defer_lock for one skb at a time. > (This is I think causing some regressions for small packets, with no frags) Is there any reason not to introduce back the per socket defer_list to not acquire the defer_lock for one skb at a time instead of adding per-cpu caches? > >> >> And it seems we know which napi binds to a specific socket through >> busypoll mechanism, we can reuse that to release a skb to the napi >> bound to that socket? > > busypoll is not often used, and we usually burn (spinning) cycles there, > not sure we want to optimize it? How about only optimize napi_by_id()? To be honest, I am not sure how to optimize it exactly, maybe add a callback to notify the deletion of napi to the socket? > >> >>> >>> The main case we'll miss out on is when application runs on the same >>> CPU as NAPI. In that case we don't use the deferred skb free path. >>> We could disable softirq one that path, too... maybe? >>> > . >
On Mon, 2023-04-03 at 10:12 -0700, Jakub Kicinski wrote: > On Mon, 3 Apr 2023 18:30:55 +0300 Ilias Apalodimas wrote: > > > Meaning in page_pool_return_skb_page() or all the way from > > > napi_consume_skb()? The former does indeed sounds like a good > > > idea! > > > > page_pool_return_skb_page() (and maybe page_pool_put_full_page()). > > FWIW we completely agree on napi_consume_skb(). We are trying to > > keep > > page_pool and the net layer as disjoint as possible. The only > > point > > we 'pollute' networking code is the recycle bit checking and we'd > > prefer keeping it like that > > Ack, OTOH plumbing thru the budget argument within netdev code should > not be a major refactoring. So maybe I should do that after all. > > Otherwise we have two different conditions - netdev only recycles > skbs > based on the NAPI budget != 0, but page pool will assume that > in_softirq() && !in_hardirq() is always safe. > > The latter is safe, I think, unless someone adds a print half way > thru > the cache update... but then it's also safe in NAPI skb recycling, > so napi_consume_skb() should stop taking the budget and just look > at preempt flags... > > To make the correctness obvious, for now, I think I will refactor > the netdev code to pass a "in NAPI poll" bool to > page_pool_return_skb_page(), and add a WARN_ON(!softirq || hardirq). > > Let's see how the code ends up looking, I'll send it as RFCv2 rather > than PATCH to make it clear I'm not sure it's okay with you :) Wow, thanks for picking this up so fast! After enabling this in the mlx5 driver, there is already improved page_pool cache usage for our test with the application running on the same CPU with the receive queue NAPI (0 -> 98 % cache usage). Looking forward to the v2.
On Tue, Apr 4, 2023 at 12:50 PM Yunsheng Lin <linyunsheng@huawei.com> wrote: > Is there any reason not to introduce back the per socket defer_list to > not acquire the defer_lock for one skb at a time instead of adding > per-cpu caches? This is a disaster on hosts with millions of sockets. Just think how many pages _one_ skb can hold, and multiply this number by 10,000,000 per-cpu is the more natural way to make sure this kind of cache will not use an unbound amount of memory. We had the same issues with sk->sk_forward_alloc per-socket reserves.
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 62e093a6d6d1..b3c11353078b 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -360,8 +360,11 @@ struct napi_struct { unsigned long gro_bitmask; int (*poll)(struct napi_struct *, int); #ifdef CONFIG_NETPOLL + /* CPU actively polling if netpoll is configured */ int poll_owner; #endif + /* CPU on which NAPI has been scheduled for processing */ + int list_owner; struct net_device *dev; struct gro_list gro_hash[GRO_HASH_BUCKETS]; struct sk_buff *skb; diff --git a/include/net/page_pool.h b/include/net/page_pool.h index ddfa0b328677..f86cdfb51585 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -77,6 +77,7 @@ struct page_pool_params { unsigned int pool_size; int nid; /* Numa node id to allocate from pages from */ struct device *dev; /* device, for DMA pre-mapping purposes */ + struct napi_struct *napi; /* Sole consumer of pages, otherwise NULL */ enum dma_data_direction dma_dir; /* DMA mapping direction */ unsigned int max_len; /* max DMA sync memory size */ unsigned int offset; /* DMA addr offset */ diff --git a/net/core/dev.c b/net/core/dev.c index 0c4b21291348..a6d6e5c89ce7 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4360,6 +4360,7 @@ static inline void ____napi_schedule(struct softnet_data *sd, } list_add_tail(&napi->poll_list, &sd->poll_list); + WRITE_ONCE(napi->list_owner, smp_processor_id()); /* If not called from net_rx_action() * we have to raise NET_RX_SOFTIRQ. */ @@ -6070,6 +6071,7 @@ bool napi_complete_done(struct napi_struct *n, int work_done) list_del_init(&n->poll_list); local_irq_restore(flags); } + WRITE_ONCE(n->list_owner, -1); val = READ_ONCE(n->state); do { @@ -6385,6 +6387,7 @@ void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi, #ifdef CONFIG_NETPOLL napi->poll_owner = -1; #endif + napi->list_owner = -1; set_bit(NAPI_STATE_SCHED, &napi->state); set_bit(NAPI_STATE_NPSVC, &napi->state); list_add_rcu(&napi->dev_list, &dev->napi_list); diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 193c18799865..c3e2ab0c2684 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -19,6 +19,7 @@ #include <linux/mm.h> /* for put_page() */ #include <linux/poison.h> #include <linux/ethtool.h> +#include <linux/netdevice.h> #include <trace/events/page_pool.h> @@ -544,6 +545,18 @@ static bool page_pool_recycle_in_cache(struct page *page, return true; } +/* If caller didn't allow direct recycling check if we have other reasons + * to believe that the producer and consumer can't race. + * + * Result is only meaningful in softirq context. + */ +static bool page_pool_safe_producer(struct page_pool *pool) +{ + struct napi_struct *napi = pool->p.napi; + + return napi && READ_ONCE(napi->list_owner) == smp_processor_id(); +} + /* If the page refcnt == 1, this will try to recycle the page. * if PP_FLAG_DMA_SYNC_DEV is set, we'll try to sync the DMA area for * the configured size min(dma_sync_size, pool->max_len). @@ -570,6 +583,9 @@ __page_pool_put_page(struct page_pool *pool, struct page *page, page_pool_dma_sync_for_device(pool, page, dma_sync_size); + if (!allow_direct) + allow_direct = page_pool_safe_producer(pool); + if (allow_direct && in_softirq() && page_pool_recycle_in_cache(page, pool)) return NULL;
Recent patches to mlx5 mentioned a regression when moving from driver local page pool to only using the generic page pool code. Page pool has two recycling paths (1) direct one, which runs in safe NAPI context (basically consumer context, so producing can be lockless); and (2) via a ptr_ring, which takes a spin lock because the freeing can happen from any CPU; producer and consumer may run concurrently. Since the page pool code was added, Eric introduced a revised version of deferred skb freeing. TCP skbs are now usually returned to the CPU which allocated them, and freed in softirq context. This places the freeing (producing of pages back to the pool) enticingly close to the allocation (consumer). If we can prove that we're freeing in the same softirq context in which the consumer NAPI will run - lockless use of the cache is perfectly fine, no need for the lock. Let drivers link the page pool to a NAPI instance. If the NAPI instance is scheduled on the same CPU on which we're freeing - place the pages in the direct cache. With that and patched bnxt (XDP enabled to engage the page pool, sigh, bnxt really needs page pool work :() I see a 2.6% perf boost with a TCP stream test (app on a different physical core than softirq). The CPU use of relevant functions decreases as expected: page_pool_refill_alloc_cache 1.17% -> 0% _raw_spin_lock 2.41% -> 0.98% Only consider lockless path to be safe when NAPI is scheduled - in practice this should cover majority if not all of steady state workloads. It's usually the NAPI kicking in that causes the skb flush. The main case we'll miss out on is when application runs on the same CPU as NAPI. In that case we don't use the deferred skb free path. We could disable softirq one that path, too... maybe? Signed-off-by: Jakub Kicinski <kuba@kernel.org> --- CC: hawk@kernel.org CC: ilias.apalodimas@linaro.org --- include/linux/netdevice.h | 3 +++ include/net/page_pool.h | 1 + net/core/dev.c | 3 +++ net/core/page_pool.c | 16 ++++++++++++++++ 4 files changed, 23 insertions(+)