diff mbox series

[RFC,net-next,1/2] page_pool: allow caching from safely localized NAPI

Message ID 20230331043906.3015706-1-kuba@kernel.org (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [RFC,net-next,1/2] page_pool: allow caching from safely localized NAPI | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 5279 this patch: 5279
netdev/cc_maintainers success CCed 7 of 7 maintainers
netdev/build_clang success Errors and warnings before: 1024 this patch: 1024
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 5487 this patch: 5487
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 73 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Jakub Kicinski March 31, 2023, 4:39 a.m. UTC
Recent patches to mlx5 mentioned a regression when moving from
driver local page pool to only using the generic page pool code.
Page pool has two recycling paths (1) direct one, which runs in
safe NAPI context (basically consumer context, so producing
can be lockless); and (2) via a ptr_ring, which takes a spin
lock because the freeing can happen from any CPU; producer
and consumer may run concurrently.

Since the page pool code was added, Eric introduced a revised version
of deferred skb freeing. TCP skbs are now usually returned to the CPU
which allocated them, and freed in softirq context. This places the
freeing (producing of pages back to the pool) enticingly close to
the allocation (consumer).

If we can prove that we're freeing in the same softirq context in which
the consumer NAPI will run - lockless use of the cache is perfectly fine,
no need for the lock.

Let drivers link the page pool to a NAPI instance. If the NAPI instance
is scheduled on the same CPU on which we're freeing - place the pages
in the direct cache.

With that and patched bnxt (XDP enabled to engage the page pool, sigh,
bnxt really needs page pool work :() I see a 2.6% perf boost with
a TCP stream test (app on a different physical core than softirq).

The CPU use of relevant functions decreases as expected:

  page_pool_refill_alloc_cache   1.17% -> 0%
  _raw_spin_lock                 2.41% -> 0.98%

Only consider lockless path to be safe when NAPI is scheduled
- in practice this should cover majority if not all of steady state
workloads. It's usually the NAPI kicking in that causes the skb flush.

The main case we'll miss out on is when application runs on the same
CPU as NAPI. In that case we don't use the deferred skb free path.
We could disable softirq one that path, too... maybe?

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
CC: hawk@kernel.org
CC: ilias.apalodimas@linaro.org
---
 include/linux/netdevice.h |  3 +++
 include/net/page_pool.h   |  1 +
 net/core/dev.c            |  3 +++
 net/core/page_pool.c      | 16 ++++++++++++++++
 4 files changed, 23 insertions(+)

Comments

Jakub Kicinski March 31, 2023, 5:15 a.m. UTC | #1
On Thu, 30 Mar 2023 21:39:05 -0700 Jakub Kicinski wrote:
> +/* If caller didn't allow direct recycling check if we have other reasons
> + * to believe that the producer and consumer can't race.
> + *
> + * Result is only meaningful in softirq context.
> + */
> +static bool page_pool_safe_producer(struct page_pool *pool)
> +{
> +	struct napi_struct *napi = pool->p.napi;
> +
> +	return napi && READ_ONCE(napi->list_owner) == smp_processor_id();

Herm, this also needs && !in_hardirq()
Jesper Dangaard Brouer March 31, 2023, 9:31 a.m. UTC | #2
On 31/03/2023 06.39, Jakub Kicinski wrote:
> Recent patches to mlx5 mentioned a regression when moving from
> driver local page pool to only using the generic page pool code.
> Page pool has two recycling paths (1) direct one, which runs in
> safe NAPI context (basically consumer context, so producing
> can be lockless); and (2) via a ptr_ring, which takes a spin
> lock because the freeing can happen from any CPU; producer
> and consumer may run concurrently.
> 
> Since the page pool code was added, Eric introduced a revised version
> of deferred skb freeing. TCP skbs are now usually returned to the CPU
> which allocated them, and freed in softirq context. This places the
> freeing (producing of pages back to the pool) enticingly close to
> the allocation (consumer).
> 
> If we can prove that we're freeing in the same softirq context in which
> the consumer NAPI will run - lockless use of the cache is perfectly fine,
> no need for the lock.

Super interesting! :-)

> 
> Let drivers link the page pool to a NAPI instance. If the NAPI instance
> is scheduled on the same CPU on which we're freeing - place the pages
> in the direct cache.
> 

Cool, using the direct cache this way.

In the cases where we cannot use the direct cache.
There is also the option of bulk freeing into page_pool which mitigate
the ptr_ring locking. See code page_pool_put_page_bulk().

For XDP the page_pool_put_page_bulk() API is used by
xdp_return_frame_bulk() and xdp_flush_frame_bulk(), which together
builds-up a bulk.
(p.s. I see we could optimize xdp_return_frame_bulk some more, and avoid
some of the rhashtable_lookup(), but that is unrelated to your patch).

> With that and patched bnxt (XDP enabled to engage the page pool, sigh,
> bnxt really needs page pool work :() I see a 2.6% perf boost with
> a TCP stream test (app on a different physical core than softirq).
> 
> The CPU use of relevant functions decreases as expected:
> 
>    page_pool_refill_alloc_cache   1.17% -> 0%
>    _raw_spin_lock                 2.41% -> 0.98%
> 
> Only consider lockless path to be safe when NAPI is scheduled
> - in practice this should cover majority if not all of steady state
> workloads. It's usually the NAPI kicking in that causes the skb flush.
>

Make sense, but do read the comment above struct pp_alloc_cache.
The sizing of pp_alloc_cache is important for this trick/heuristic to
work, meaning the pp_alloc_cache have enough room.
It is definitely on purpose that pp_alloc_cache have 128 elements and is
only refill'ed with 64 elements, which leaves room for this kind of
trick.  But if Eric's deferred skb freeing have more than 64 pages to
free, then we will likely fallback to ptr_ring recycling.

Code wise, I suggest that you/we change page_pool_put_page_bulk() to
have a variant that 'allow_direct' (looking at code below, you might
already do this as this patch over-steer 'allow_direct').  Using the
bulk API, would then bulk into ptr_ring in the cases we cannot use
direct cache.

> The main case we'll miss out on is when application runs on the same
> CPU as NAPI. In that case we don't use the deferred skb free path.
> We could disable softirq one that path, too... maybe?
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> CC: hawk@kernel.org
> CC: ilias.apalodimas@linaro.org
> ---
>   include/linux/netdevice.h |  3 +++
>   include/net/page_pool.h   |  1 +
>   net/core/dev.c            |  3 +++
>   net/core/page_pool.c      | 16 ++++++++++++++++
>   4 files changed, 23 insertions(+)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 62e093a6d6d1..b3c11353078b 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -360,8 +360,11 @@ struct napi_struct {
>   	unsigned long		gro_bitmask;
>   	int			(*poll)(struct napi_struct *, int);
>   #ifdef CONFIG_NETPOLL
> +	/* CPU actively polling if netpoll is configured */
>   	int			poll_owner;
>   #endif
> +	/* CPU on which NAPI has been scheduled for processing */
> +	int			list_owner;
>   	struct net_device	*dev;
>   	struct gro_list		gro_hash[GRO_HASH_BUCKETS];
>   	struct sk_buff		*skb;
> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> index ddfa0b328677..f86cdfb51585 100644
> --- a/include/net/page_pool.h
> +++ b/include/net/page_pool.h
> @@ -77,6 +77,7 @@ struct page_pool_params {
>   	unsigned int	pool_size;
>   	int		nid;  /* Numa node id to allocate from pages from */
>   	struct device	*dev; /* device, for DMA pre-mapping purposes */
> +	struct napi_struct *napi; /* Sole consumer of pages, otherwise NULL */
>   	enum dma_data_direction dma_dir; /* DMA mapping direction */
>   	unsigned int	max_len; /* max DMA sync memory size */
>   	unsigned int	offset;  /* DMA addr offset */
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 0c4b21291348..a6d6e5c89ce7 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4360,6 +4360,7 @@ static inline void ____napi_schedule(struct softnet_data *sd,
>   	}
>   
>   	list_add_tail(&napi->poll_list, &sd->poll_list);
> +	WRITE_ONCE(napi->list_owner, smp_processor_id());
>   	/* If not called from net_rx_action()
>   	 * we have to raise NET_RX_SOFTIRQ.
>   	 */
> @@ -6070,6 +6071,7 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
>   		list_del_init(&n->poll_list);
>   		local_irq_restore(flags);
>   	}
> +	WRITE_ONCE(n->list_owner, -1);
>   
>   	val = READ_ONCE(n->state);
>   	do {
> @@ -6385,6 +6387,7 @@ void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
>   #ifdef CONFIG_NETPOLL
>   	napi->poll_owner = -1;
>   #endif
> +	napi->list_owner = -1;
>   	set_bit(NAPI_STATE_SCHED, &napi->state);
>   	set_bit(NAPI_STATE_NPSVC, &napi->state);
>   	list_add_rcu(&napi->dev_list, &dev->napi_list);
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 193c18799865..c3e2ab0c2684 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -19,6 +19,7 @@
>   #include <linux/mm.h> /* for put_page() */
>   #include <linux/poison.h>
>   #include <linux/ethtool.h>
> +#include <linux/netdevice.h>
>   
>   #include <trace/events/page_pool.h>
>   
> @@ -544,6 +545,18 @@ static bool page_pool_recycle_in_cache(struct page *page,
>   	return true;
>   }
>   
> +/* If caller didn't allow direct recycling check if we have other reasons
> + * to believe that the producer and consumer can't race.
> + *
> + * Result is only meaningful in softirq context.
> + */
> +static bool page_pool_safe_producer(struct page_pool *pool)
> +{
> +	struct napi_struct *napi = pool->p.napi;
> +
> +	return napi && READ_ONCE(napi->list_owner) == smp_processor_id();
> +}
> +
>   /* If the page refcnt == 1, this will try to recycle the page.
>    * if PP_FLAG_DMA_SYNC_DEV is set, we'll try to sync the DMA area for
>    * the configured size min(dma_sync_size, pool->max_len).
> @@ -570,6 +583,9 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
>   			page_pool_dma_sync_for_device(pool, page,
>   						      dma_sync_size);
>   
> +		if (!allow_direct)
> +			allow_direct = page_pool_safe_producer(pool);
> +

I remember some use-case for veth, that explicitly disables
"allow_direct".  I cannot remember why exactly, but we need to make sure
that doesn't break something (as this code can undo the allow_direct).

>   		if (allow_direct && in_softirq() &&
>   		    page_pool_recycle_in_cache(page, pool))
>   			return NULL;
Jakub Kicinski March 31, 2023, 7:06 p.m. UTC | #3
On Fri, 31 Mar 2023 11:31:31 +0200 Jesper Dangaard Brouer wrote:
> On 31/03/2023 06.39, Jakub Kicinski wrote:
> > With that and patched bnxt (XDP enabled to engage the page pool, sigh,
> > bnxt really needs page pool work :() I see a 2.6% perf boost with
> > a TCP stream test (app on a different physical core than softirq).
> > 
> > The CPU use of relevant functions decreases as expected:
> > 
> >    page_pool_refill_alloc_cache   1.17% -> 0%
> >    _raw_spin_lock                 2.41% -> 0.98%
> > 
> > Only consider lockless path to be safe when NAPI is scheduled
> > - in practice this should cover majority if not all of steady state
> > workloads. It's usually the NAPI kicking in that causes the skb flush.
> 
> Make sense, but do read the comment above struct pp_alloc_cache.
> The sizing of pp_alloc_cache is important for this trick/heuristic to
> work, meaning the pp_alloc_cache have enough room.
> It is definitely on purpose that pp_alloc_cache have 128 elements and is
> only refill'ed with 64 elements, which leaves room for this kind of
> trick.  But if Eric's deferred skb freeing have more than 64 pages to
> free, then we will likely fallback to ptr_ring recycling.
> 
> Code wise, I suggest that you/we change page_pool_put_page_bulk() to
> have a variant that 'allow_direct' (looking at code below, you might
> already do this as this patch over-steer 'allow_direct').  Using the
> bulk API, would then bulk into ptr_ring in the cases we cannot use
> direct cache.

Interesting point, let me re-run some tests with the statistics enabled.
For a simple stream test I think it may just be too steady to trigger
over/underflow. Each skb will carry at most 18 pages, and driver should
only produce 64 packets / consume 64 pages. Each NAPI cycle will start
by flushing the deferred free. So unless there is a hiccup either at the
app or NAPI side - the flows of pages in each direction should be steady
enough to do well with just 128 cache entries. Let me get the data and
report back.

> >   /* If the page refcnt == 1, this will try to recycle the page.
> >    * if PP_FLAG_DMA_SYNC_DEV is set, we'll try to sync the DMA area for
> >    * the configured size min(dma_sync_size, pool->max_len).
> > @@ -570,6 +583,9 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
> >   			page_pool_dma_sync_for_device(pool, page,
> >   						      dma_sync_size);
> >   
> > +		if (!allow_direct)
> > +			allow_direct = page_pool_safe_producer(pool);
> > +  
> 
> I remember some use-case for veth, that explicitly disables
> "allow_direct".  I cannot remember why exactly, but we need to make sure
> that doesn't break something (as this code can undo the allow_direct).

I can't find anything in veth :( Trying to grep drivers for
page_pool_put / page_pool_recycle problems best I could find is commit
e38553bdc377 ("net: fec: Use page_pool_put_full_page when freeing rx buffers").
Jakub Kicinski March 31, 2023, 10:17 p.m. UTC | #4
On Fri, 31 Mar 2023 12:06:43 -0700 Jakub Kicinski wrote:
> > Make sense, but do read the comment above struct pp_alloc_cache.
> > The sizing of pp_alloc_cache is important for this trick/heuristic to
> > work, meaning the pp_alloc_cache have enough room.
> > It is definitely on purpose that pp_alloc_cache have 128 elements and is
> > only refill'ed with 64 elements, which leaves room for this kind of
> > trick.  But if Eric's deferred skb freeing have more than 64 pages to
> > free, then we will likely fallback to ptr_ring recycling.
> > 
> > Code wise, I suggest that you/we change page_pool_put_page_bulk() to
> > have a variant that 'allow_direct' (looking at code below, you might
> > already do this as this patch over-steer 'allow_direct').  Using the
> > bulk API, would then bulk into ptr_ring in the cases we cannot use
> > direct cache.  
> 
> Interesting point, let me re-run some tests with the statistics enabled.
> For a simple stream test I think it may just be too steady to trigger
> over/underflow. Each skb will carry at most 18 pages, and driver should
> only produce 64 packets / consume 64 pages. Each NAPI cycle will start
> by flushing the deferred free. So unless there is a hiccup either at the
> app or NAPI side - the flows of pages in each direction should be steady
> enough to do well with just 128 cache entries. Let me get the data and
> report back.

I patched the driver a bit to use page pool for HW-GRO.
Below are recycle stats with HW-GRO and with SW GRO + XDP_PASS (packet per page).

		HW-GRO				page=page
		before		after		before		after
recycle:
cached:			0	138669686		0	150197505
cache_full:		0	   223391		0	    74582
ring:		138551933         9997191	149299454		0
ring_full: 		0             488	     3154	   127590
released_refcnt:	0		0		0		0

alloc:
fast:		136491361	148615710	146969587	150322859
slow:		     1772	     1799	      144	      105
slow_high_order:	0		0		0		0
empty:		     1772	     1799	      144	      105
refill:		  2165245	   156302	  2332880	     2128
waive:			0		0		0		0

Which seems to confirm that for this trivial test the cache sizing is
good enough, and I won't see any benefit from batching (as cache is only
full respectively 0.16% and 0.05% of the time).
Ilias Apalodimas April 3, 2023, 9:16 a.m. UTC | #5
Hi Jakub

On Thu, Mar 30, 2023 at 09:39:05PM -0700, Jakub Kicinski wrote:
> Recent patches to mlx5 mentioned a regression when moving from
> driver local page pool to only using the generic page pool code.
> Page pool has two recycling paths (1) direct one, which runs in
> safe NAPI context (basically consumer context, so producing
> can be lockless); and (2) via a ptr_ring, which takes a spin
> lock because the freeing can happen from any CPU; producer
> and consumer may run concurrently.
>
> Since the page pool code was added, Eric introduced a revised version
> of deferred skb freeing. TCP skbs are now usually returned to the CPU
> which allocated them, and freed in softirq context. This places the
> freeing (producing of pages back to the pool) enticingly close to
> the allocation (consumer).
>
> If we can prove that we're freeing in the same softirq context in which
> the consumer NAPI will run - lockless use of the cache is perfectly fine,
> no need for the lock.
>
> Let drivers link the page pool to a NAPI instance. If the NAPI instance
> is scheduled on the same CPU on which we're freeing - place the pages
> in the direct cache.
>
> With that and patched bnxt (XDP enabled to engage the page pool, sigh,
> bnxt really needs page pool work :() I see a 2.6% perf boost with
> a TCP stream test (app on a different physical core than softirq).
>
> The CPU use of relevant functions decreases as expected:
>
>   page_pool_refill_alloc_cache   1.17% -> 0%
>   _raw_spin_lock                 2.41% -> 0.98%
>
> Only consider lockless path to be safe when NAPI is scheduled
> - in practice this should cover majority if not all of steady state
> workloads. It's usually the NAPI kicking in that causes the skb flush.
>
> The main case we'll miss out on is when application runs on the same
> CPU as NAPI. In that case we don't use the deferred skb free path.
> We could disable softirq one that path, too... maybe?

This whole thing makes a lot of sense to me.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> CC: hawk@kernel.org
> CC: ilias.apalodimas@linaro.org

[...]

>  	return true;
>  }
>
> +/* If caller didn't allow direct recycling check if we have other reasons
> + * to believe that the producer and consumer can't race.
> + *
> + * Result is only meaningful in softirq context.
> + */
> +static bool page_pool_safe_producer(struct page_pool *pool)
> +{
> +	struct napi_struct *napi = pool->p.napi;
> +
> +	return napi && READ_ONCE(napi->list_owner) == smp_processor_id();
> +}
> +
>  /* If the page refcnt == 1, this will try to recycle the page.
>   * if PP_FLAG_DMA_SYNC_DEV is set, we'll try to sync the DMA area for
>   * the configured size min(dma_sync_size, pool->max_len).
> @@ -570,6 +583,9 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
>  			page_pool_dma_sync_for_device(pool, page,
>  						      dma_sync_size);
>
> +		if (!allow_direct)
> +			allow_direct = page_pool_safe_producer(pool);
> +

Do we want to hide the decision in __page_pool_put_page().  IOW wouldn't it
be better for this function to honor whatever allow_direct dictates and
have the allow_direct = page_pool_safe_producer(pool); in callers?

Thanks
/Ilias
>  		if (allow_direct && in_softirq() &&
>  		    page_pool_recycle_in_cache(page, pool))
>  			return NULL;
> --
> 2.39.2
>
Jakub Kicinski April 3, 2023, 3:05 p.m. UTC | #6
On Mon, 3 Apr 2023 12:16:04 +0300 Ilias Apalodimas wrote:
> >  /* If the page refcnt == 1, this will try to recycle the page.
> >   * if PP_FLAG_DMA_SYNC_DEV is set, we'll try to sync the DMA area for
> >   * the configured size min(dma_sync_size, pool->max_len).
> > @@ -570,6 +583,9 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
> >  			page_pool_dma_sync_for_device(pool, page,
> >  						      dma_sync_size);
> >
> > +		if (!allow_direct)
> > +			allow_direct = page_pool_safe_producer(pool);
> > +  
> 
> Do we want to hide the decision in __page_pool_put_page().  IOW wouldn't it
> be better for this function to honor whatever allow_direct dictates and
> have the allow_direct = page_pool_safe_producer(pool); in callers?

Meaning in page_pool_return_skb_page() or all the way from
napi_consume_skb()? The former does indeed sounds like a good idea!
Ilias Apalodimas April 3, 2023, 3:30 p.m. UTC | #7
On Mon, 3 Apr 2023 at 18:05, Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 3 Apr 2023 12:16:04 +0300 Ilias Apalodimas wrote:
> > >  /* If the page refcnt == 1, this will try to recycle the page.
> > >   * if PP_FLAG_DMA_SYNC_DEV is set, we'll try to sync the DMA area for
> > >   * the configured size min(dma_sync_size, pool->max_len).
> > > @@ -570,6 +583,9 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
> > >                     page_pool_dma_sync_for_device(pool, page,
> > >                                                   dma_sync_size);
> > >
> > > +           if (!allow_direct)
> > > +                   allow_direct = page_pool_safe_producer(pool);
> > > +
> >
> > Do we want to hide the decision in __page_pool_put_page().  IOW wouldn't it
> > be better for this function to honor whatever allow_direct dictates and
> > have the allow_direct = page_pool_safe_producer(pool); in callers?
>
> Meaning in page_pool_return_skb_page() or all the way from
> napi_consume_skb()? The former does indeed sounds like a good idea!

page_pool_return_skb_page() (and maybe page_pool_put_full_page()).
FWIW we completely agree on napi_consume_skb().  We are trying to keep
page_pool and the net layer as disjoint as possible.  The only point
we 'pollute' networking code is the recycle bit checking and we'd
prefer keeping it like that

Regards
/Ilias
Jakub Kicinski April 3, 2023, 5:12 p.m. UTC | #8
On Mon, 3 Apr 2023 18:30:55 +0300 Ilias Apalodimas wrote:
> > Meaning in page_pool_return_skb_page() or all the way from
> > napi_consume_skb()? The former does indeed sounds like a good idea!  
> 
> page_pool_return_skb_page() (and maybe page_pool_put_full_page()).
> FWIW we completely agree on napi_consume_skb().  We are trying to keep
> page_pool and the net layer as disjoint as possible.  The only point
> we 'pollute' networking code is the recycle bit checking and we'd
> prefer keeping it like that

Ack, OTOH plumbing thru the budget argument within netdev code should
not be a major refactoring. So maybe I should do that after all.

Otherwise we have two different conditions - netdev only recycles skbs
based on the NAPI budget != 0, but page pool will assume that
in_softirq() && !in_hardirq() is always safe.

The latter is safe, I think, unless someone adds a print half way thru
the cache update... but then it's also safe in NAPI skb recycling,
so napi_consume_skb() should stop taking the budget and just look
at preempt flags...

To make the correctness obvious, for now, I think I will refactor 
the netdev code to pass a "in NAPI poll" bool to
page_pool_return_skb_page(), and add a WARN_ON(!softirq || hardirq).

Let's see how the code ends up looking, I'll send it as RFCv2 rather
than PATCH to make it clear I'm not sure it's okay with you :)
Yunsheng Lin April 4, 2023, 12:53 a.m. UTC | #9
On 2023/3/31 12:39, Jakub Kicinski wrote:
> Recent patches to mlx5 mentioned a regression when moving from
> driver local page pool to only using the generic page pool code.
> Page pool has two recycling paths (1) direct one, which runs in
> safe NAPI context (basically consumer context, so producing
> can be lockless); and (2) via a ptr_ring, which takes a spin
> lock because the freeing can happen from any CPU; producer
> and consumer may run concurrently.
> 
> Since the page pool code was added, Eric introduced a revised version
> of deferred skb freeing. TCP skbs are now usually returned to the CPU
> which allocated them, and freed in softirq context. This places the
> freeing (producing of pages back to the pool) enticingly close to
> the allocation (consumer).
> 
> If we can prove that we're freeing in the same softirq context in which
> the consumer NAPI will run - lockless use of the cache is perfectly fine,
> no need for the lock.
> 
> Let drivers link the page pool to a NAPI instance. If the NAPI instance
> is scheduled on the same CPU on which we're freeing - place the pages
> in the direct cache.
> 
> With that and patched bnxt (XDP enabled to engage the page pool, sigh,
> bnxt really needs page pool work :() I see a 2.6% perf boost with
> a TCP stream test (app on a different physical core than softirq).
> 
> The CPU use of relevant functions decreases as expected:
> 
>   page_pool_refill_alloc_cache   1.17% -> 0%
>   _raw_spin_lock                 2.41% -> 0.98%
> 
> Only consider lockless path to be safe when NAPI is scheduled
> - in practice this should cover majority if not all of steady state
> workloads. It's usually the NAPI kicking in that causes the skb flush.

Interesting.
I wonder if we can make this more generic by adding the skb to per napi
list instead of sd->defer_list, so that we can always use NAPI kicking to
flush skb as net_tx_action() done for sd->completion_queue instead of
softirq kicking?

And it seems we know which napi binds to a specific socket through
busypoll mechanism, we can reuse that to release a skb to the napi
bound to that socket?

> 
> The main case we'll miss out on is when application runs on the same
> CPU as NAPI. In that case we don't use the deferred skb free path.
> We could disable softirq one that path, too... maybe?
>
Jakub Kicinski April 4, 2023, 1:45 a.m. UTC | #10
On Tue, 4 Apr 2023 08:53:36 +0800 Yunsheng Lin wrote:
> I wonder if we can make this more generic by adding the skb to per napi
> list instead of sd->defer_list, so that we can always use NAPI kicking to
> flush skb as net_tx_action() done for sd->completion_queue instead of
> softirq kicking?
> 
> And it seems we know which napi binds to a specific socket through
> busypoll mechanism, we can reuse that to release a skb to the napi
> bound to that socket?

Seems doable. My thinking was to first see how well the simpler scheme
performs with production workloads because it should have no downsides.
Tracking real NAPI pointers per socket and extra RCU sync to manage
per-NAPI defer queues may have perf cost.
Yunsheng Lin April 4, 2023, 3:18 a.m. UTC | #11
On 2023/4/4 9:45, Jakub Kicinski wrote:
> On Tue, 4 Apr 2023 08:53:36 +0800 Yunsheng Lin wrote:
>> I wonder if we can make this more generic by adding the skb to per napi
>> list instead of sd->defer_list, so that we can always use NAPI kicking to
>> flush skb as net_tx_action() done for sd->completion_queue instead of
>> softirq kicking?
>>
>> And it seems we know which napi binds to a specific socket through
>> busypoll mechanism, we can reuse that to release a skb to the napi
>> bound to that socket?
> 
> Seems doable. My thinking was to first see how well the simpler scheme
> performs with production workloads because it should have no downsides.
Look forwording to some performs data with production workloads:)

> Tracking real NAPI pointers per socket and extra RCU sync to manage
> per-NAPI defer queues may have perf cost.

I suppose the extra RCU sync only happen on the napi add/del process,
not in the data path?

> .
>
Eric Dumazet April 4, 2023, 4:21 a.m. UTC | #12
On Tue, Apr 4, 2023 at 2:53 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:

> Interesting.
> I wonder if we can make this more generic by adding the skb to per napi
> list instead of sd->defer_list, so that we can always use NAPI kicking to
> flush skb as net_tx_action() done for sd->completion_queue instead of
> softirq kicking?

We do not have direct skb  -> napi association yet, but using an
expensive hash lookup.

I had the intent of adding per-cpu caches in this infrastructure,
to not acquire the remote-cpu defer_lock for one skb at a time.
(This is I think causing some regressions for small packets, with no frags)

>
> And it seems we know which napi binds to a specific socket through
> busypoll mechanism, we can reuse that to release a skb to the napi
> bound to that socket?

busypoll is not often used, and we usually burn (spinning) cycles there,
not sure we want to optimize it ?

>
> >
> > The main case we'll miss out on is when application runs on the same
> > CPU as NAPI. In that case we don't use the deferred skb free path.
> > We could disable softirq one that path, too... maybe?
> >
Yunsheng Lin April 4, 2023, 10:50 a.m. UTC | #13
On 2023/4/4 12:21, Eric Dumazet wrote:
> On Tue, Apr 4, 2023 at 2:53 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> 
>> Interesting.
>> I wonder if we can make this more generic by adding the skb to per napi
>> list instead of sd->defer_list, so that we can always use NAPI kicking to
>> flush skb as net_tx_action() done for sd->completion_queue instead of
>> softirq kicking?
> 
> We do not have direct skb  -> napi association yet, but using an
> expensive hash lookup.
> 
> I had the intent of adding per-cpu caches in this infrastructure,
> to not acquire the remote-cpu defer_lock for one skb at a time.
> (This is I think causing some regressions for small packets, with no frags)

Is there any reason not to introduce back the per socket defer_list to
not acquire the defer_lock for one skb at a time instead of adding
per-cpu caches?

> 
>>
>> And it seems we know which napi binds to a specific socket through
>> busypoll mechanism, we can reuse that to release a skb to the napi
>> bound to that socket?
> 
> busypoll is not often used, and we usually burn (spinning) cycles there,
> not sure we want to optimize it?

How about only optimize napi_by_id()?
To be honest, I am not sure how to optimize it exactly, maybe add a callback
to notify the deletion of napi to the socket?

> 
>>
>>>
>>> The main case we'll miss out on is when application runs on the same
>>> CPU as NAPI. In that case we don't use the deferred skb free path.
>>> We could disable softirq one that path, too... maybe?
>>>
> .
>
Dragos Tatulea April 5, 2023, 5:04 p.m. UTC | #14
On Mon, 2023-04-03 at 10:12 -0700, Jakub Kicinski wrote:
> On Mon, 3 Apr 2023 18:30:55 +0300 Ilias Apalodimas wrote:
> > > Meaning in page_pool_return_skb_page() or all the way from
> > > napi_consume_skb()? The former does indeed sounds like a good
> > > idea!  
> > 
> > page_pool_return_skb_page() (and maybe page_pool_put_full_page()).
> > FWIW we completely agree on napi_consume_skb().  We are trying to
> > keep
> > page_pool and the net layer as disjoint as possible.  The only
> > point
> > we 'pollute' networking code is the recycle bit checking and we'd
> > prefer keeping it like that
> 
> Ack, OTOH plumbing thru the budget argument within netdev code should
> not be a major refactoring. So maybe I should do that after all.
> 
> Otherwise we have two different conditions - netdev only recycles
> skbs
> based on the NAPI budget != 0, but page pool will assume that
> in_softirq() && !in_hardirq() is always safe.
> 
> The latter is safe, I think, unless someone adds a print half way
> thru
> the cache update... but then it's also safe in NAPI skb recycling,
> so napi_consume_skb() should stop taking the budget and just look
> at preempt flags...
> 
> To make the correctness obvious, for now, I think I will refactor 
> the netdev code to pass a "in NAPI poll" bool to
> page_pool_return_skb_page(), and add a WARN_ON(!softirq || hardirq).
> 
> Let's see how the code ends up looking, I'll send it as RFCv2 rather
> than PATCH to make it clear I'm not sure it's okay with you :)

Wow, thanks for picking this up so fast!

After enabling this in the mlx5 driver, there is already improved
page_pool cache usage for our test with the application running on the
same CPU with the receive queue NAPI (0 -> 98 % cache usage).

Looking forward to the v2.
Eric Dumazet April 5, 2023, 5:11 p.m. UTC | #15
On Tue, Apr 4, 2023 at 12:50 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:

> Is there any reason not to introduce back the per socket defer_list to
> not acquire the defer_lock for one skb at a time instead of adding
> per-cpu caches?

This is a disaster on hosts with millions of sockets.
Just think how many pages _one_ skb can hold, and multiply this number
by 10,000,000

per-cpu is the more natural way to make sure this kind of cache will
not use an unbound amount of memory.

We had the same issues with sk->sk_forward_alloc per-socket reserves.
diff mbox series

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 62e093a6d6d1..b3c11353078b 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -360,8 +360,11 @@  struct napi_struct {
 	unsigned long		gro_bitmask;
 	int			(*poll)(struct napi_struct *, int);
 #ifdef CONFIG_NETPOLL
+	/* CPU actively polling if netpoll is configured */
 	int			poll_owner;
 #endif
+	/* CPU on which NAPI has been scheduled for processing */
+	int			list_owner;
 	struct net_device	*dev;
 	struct gro_list		gro_hash[GRO_HASH_BUCKETS];
 	struct sk_buff		*skb;
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index ddfa0b328677..f86cdfb51585 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -77,6 +77,7 @@  struct page_pool_params {
 	unsigned int	pool_size;
 	int		nid;  /* Numa node id to allocate from pages from */
 	struct device	*dev; /* device, for DMA pre-mapping purposes */
+	struct napi_struct *napi; /* Sole consumer of pages, otherwise NULL */
 	enum dma_data_direction dma_dir; /* DMA mapping direction */
 	unsigned int	max_len; /* max DMA sync memory size */
 	unsigned int	offset;  /* DMA addr offset */
diff --git a/net/core/dev.c b/net/core/dev.c
index 0c4b21291348..a6d6e5c89ce7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4360,6 +4360,7 @@  static inline void ____napi_schedule(struct softnet_data *sd,
 	}
 
 	list_add_tail(&napi->poll_list, &sd->poll_list);
+	WRITE_ONCE(napi->list_owner, smp_processor_id());
 	/* If not called from net_rx_action()
 	 * we have to raise NET_RX_SOFTIRQ.
 	 */
@@ -6070,6 +6071,7 @@  bool napi_complete_done(struct napi_struct *n, int work_done)
 		list_del_init(&n->poll_list);
 		local_irq_restore(flags);
 	}
+	WRITE_ONCE(n->list_owner, -1);
 
 	val = READ_ONCE(n->state);
 	do {
@@ -6385,6 +6387,7 @@  void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
 #ifdef CONFIG_NETPOLL
 	napi->poll_owner = -1;
 #endif
+	napi->list_owner = -1;
 	set_bit(NAPI_STATE_SCHED, &napi->state);
 	set_bit(NAPI_STATE_NPSVC, &napi->state);
 	list_add_rcu(&napi->dev_list, &dev->napi_list);
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 193c18799865..c3e2ab0c2684 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -19,6 +19,7 @@ 
 #include <linux/mm.h> /* for put_page() */
 #include <linux/poison.h>
 #include <linux/ethtool.h>
+#include <linux/netdevice.h>
 
 #include <trace/events/page_pool.h>
 
@@ -544,6 +545,18 @@  static bool page_pool_recycle_in_cache(struct page *page,
 	return true;
 }
 
+/* If caller didn't allow direct recycling check if we have other reasons
+ * to believe that the producer and consumer can't race.
+ *
+ * Result is only meaningful in softirq context.
+ */
+static bool page_pool_safe_producer(struct page_pool *pool)
+{
+	struct napi_struct *napi = pool->p.napi;
+
+	return napi && READ_ONCE(napi->list_owner) == smp_processor_id();
+}
+
 /* If the page refcnt == 1, this will try to recycle the page.
  * if PP_FLAG_DMA_SYNC_DEV is set, we'll try to sync the DMA area for
  * the configured size min(dma_sync_size, pool->max_len).
@@ -570,6 +583,9 @@  __page_pool_put_page(struct page_pool *pool, struct page *page,
 			page_pool_dma_sync_for_device(pool, page,
 						      dma_sync_size);
 
+		if (!allow_direct)
+			allow_direct = page_pool_safe_producer(pool);
+
 		if (allow_direct && in_softirq() &&
 		    page_pool_recycle_in_cache(page, pool))
 			return NULL;