[BUG] Possible unsafe page_pool usage in octeontx2

Message ID	20230823094757.gxvCEOBi@linutronix.de (mailing list archive)
State	RFC
Delegated to:	Netdev Maintainers
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AF2C4A92F for <netdev@vger.kernel.org>; Wed, 23 Aug 2023 09:48:03 +0000 (UTC) Date: Wed, 23 Aug 2023 11:47:57 +0200 From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> To: netdev@vger.kernel.org, Ratheesh Kannoth <rkannoth@marvell.com> Cc: "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Geetha sowjanya <gakula@marvell.com>, Ilias Apalodimas <ilias.apalodimas@linaro.org>, Jakub Kicinski <kuba@kernel.org>, Jesper Dangaard Brouer <hawk@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Subbaraya Sundeep <sbhatta@marvell.com>, Sunil Goutham <sgoutham@marvell.com>, Thomas Gleixner <tglx@linutronix.de>, hariprasad <hkelam@marvell.com> Subject: [BUG] Possible unsafe page_pool usage in octeontx2 Message-ID: <20230823094757.gxvCEOBi@linutronix.de> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline
Series	[BUG] Possible unsafe page_pool usage in octeontx2 \| expand [BUG] Possible unsafe page_pool usage in octeontx2

Context	Check	Description
netdev/series_format	warning	Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection	success	Guessed tree name to be net-next
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 1332 this patch: 1332
netdev/cc_maintainers	success	CCed 7 of 7 maintainers
netdev/build_clang	success	Errors and warnings before: 1353 this patch: 1353
netdev/verify_signedoff	fail	author Signed-off-by missing
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 1355 this patch: 1355
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 8 lines checked
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0

Sebastian Andrzej Siewior Aug. 23, 2023, 9:47 a.m. UTC

Hi,

I've been looking at the page_pool locking.

page_pool_alloc_frag() -> page_pool_alloc_pages() ->
__page_pool_get_cached():

There core of the allocation is:
|         /* Caller MUST guarantee safe non-concurrent access, e.g. softirq */
|         if (likely(pool->alloc.count)) {
|                 /* Fast-path */
|                 page = pool->alloc.cache[--pool->alloc.count];

The access to the `cache' array and the `count' variable is not locked.
This is fine as long as there only one consumer per pool. In my
understanding the intention is to have one page_pool per NAPI callback
to ensure this.

The pool can be filled in the same context (within allocation if the
pool is empty). There is also page_pool_recycle_in_cache() which fills
the pool from within skb free, for instance:
 napi_consume_skb() -> skb_release_all() -> skb_release_data() ->
 napi_frag_unref() -> page_pool_return_skb_page().

The last one has the following check here:
|         napi = READ_ONCE(pp->p.napi);
|         allow_direct = napi_safe && napi &&
|                 READ_ONCE(napi->list_owner) == smp_processor_id();

This eventually ends in page_pool_recycle_in_cache() where it adds the
page to the cache buffer if the check above is true (and BH is disabled). 

napi->list_owner is set once NAPI is scheduled until the poll callback
completed. It is safe to add items to list because only one of the two
can run on a single CPU and the completion of them ensured by having BH
disabled the whole time.

This breaks in octeontx2 where a worker is used to fill the buffer:
  otx2_pool_refill_task() -> otx2_alloc_rbuf() -> __otx2_alloc_rbuf() ->
  otx2_alloc_pool_buf() -> page_pool_alloc_frag().

BH is disabled but the add of a page can still happen while NAPI
callback runs on a remote CPU and so corrupting the index/ array.

API wise I would suggest to

because the intention (as I understand it) is to be invoked from within
the NAPI callback (while softirq is served) and not if BH is just
disabled due to a lock or so.

It would also make sense to a add WARN_ON_ONCE(!in_serving_softirq()) to
page_pool_alloc_pages() to spot usage outside of softirq. But this will
trigger in every driver since the same function is used in the open
callback to initially setup the HW.

Sebastian

Ilias Apalodimas Aug. 23, 2023, 11:36 a.m. UTC | #1

Hi Sebastian,

Thanks for the report.


On Wed, 23 Aug 2023 at 12:48, Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> Hi,
>
> I've been looking at the page_pool locking.

Apologies for any traumas we caused with that code :)


>
> page_pool_alloc_frag() -> page_pool_alloc_pages() ->
> __page_pool_get_cached():
>
> There core of the allocation is:
> |         /* Caller MUST guarantee safe non-concurrent access, e.g. softirq */
> |         if (likely(pool->alloc.count)) {
> |                 /* Fast-path */
> |                 page = pool->alloc.cache[--pool->alloc.count];
>
> The access to the `cache' array and the `count' variable is not locked.
> This is fine as long as there only one consumer per pool. In my
> understanding the intention is to have one page_pool per NAPI callback
> to ensure this.
>
> The pool can be filled in the same context (within allocation if the
> pool is empty). There is also page_pool_recycle_in_cache() which fills
> the pool from within skb free, for instance:
>  napi_consume_skb() -> skb_release_all() -> skb_release_data() ->
>  napi_frag_unref() -> page_pool_return_skb_page().
>
> The last one has the following check here:
> |         napi = READ_ONCE(pp->p.napi);
> |         allow_direct = napi_safe && napi &&
> |                 READ_ONCE(napi->list_owner) == smp_processor_id();
>
> This eventually ends in page_pool_recycle_in_cache() where it adds the
> page to the cache buffer if the check above is true (and BH is disabled).
>
> napi->list_owner is set once NAPI is scheduled until the poll callback
> completed. It is safe to add items to list because only one of the two
> can run on a single CPU and the completion of them ensured by having BH
> disabled the whole time.
>
> This breaks in octeontx2 where a worker is used to fill the buffer:
>   otx2_pool_refill_task() -> otx2_alloc_rbuf() -> __otx2_alloc_rbuf() ->
>   otx2_alloc_pool_buf() -> page_pool_alloc_frag().
>
> BH is disabled but the add of a page can still happen while NAPI
> callback runs on a remote CPU and so corrupting the index/ array.
>
> API wise I would suggest to
>
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 7ff80b80a6f9f..b50e219470a36 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -612,7 +612,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
>                         page_pool_dma_sync_for_device(pool, page,
>                                                       dma_sync_size);
>
> -               if (allow_direct && in_softirq() &&
> +               if (allow_direct && in_serving_softirq() &&
>                     page_pool_recycle_in_cache(page, pool))
>                         return NULL;
>

FWIW we used to have that check.
commit 542bcea4be866b ("net: page_pool: use in_softirq() instead")
changed that, so maybe we should revert that overall?

> because the intention (as I understand it) is to be invoked from within
> the NAPI callback (while softirq is served) and not if BH is just
> disabled due to a lock or so.
>
> It would also make sense to a add WARN_ON_ONCE(!in_serving_softirq()) to
> page_pool_alloc_pages() to spot usage outside of softirq. But this will
> trigger in every driver since the same function is used in the open
> callback to initially setup the HW.

What about adding a check in the cached allocation path in order to
skip the initial page allocation?

Thanks
/Ilias
>
> Sebastian

Ratheesh Kannoth Aug. 23, 2023, 12:28 p.m. UTC | #2

> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Sent: Wednesday, August 23, 2023 3:18 PM
> Subject: [EXT] [BUG] Possible unsafe page_pool usage in octeontx2
> 
> This breaks in octeontx2 where a worker is used to fill the buffer:
>   otx2_pool_refill_task() -> otx2_alloc_rbuf() -> __otx2_alloc_rbuf() ->
>   otx2_alloc_pool_buf() -> page_pool_alloc_frag().
>
Thanks. I agree.  This path is called in octeontx2 driver only if __otx2_alloc_rbuf() function returns error In below path.
otx2_napi_handler() -->  pfvf->hw_ops->refill_pool_ptrs() ---> cn10k_refill_pool_ptrs() --> otx2_alloc_buffer() --> __otx2_alloc_rbuf(). 
As I understand, the problem is due to workqueue may get scheduled on other CPU. If we use BOUND workqueue, do you think this problem can be solved ?
 
> BH is disabled but the add of a page can still happen while NAPI callback runs
> on a remote CPU and so corrupting the index/ array.
> 
> API wise I would suggest to
> 
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c index
> 7ff80b80a6f9f..b50e219470a36 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -612,7 +612,7 @@ __page_pool_put_page(struct page_pool *pool,
> struct page *page,
>  			page_pool_dma_sync_for_device(pool, page,
>  						      dma_sync_size);
> 
> -		if (allow_direct && in_softirq() &&
> +		if (allow_direct && in_serving_softirq() &&
>  		    page_pool_recycle_in_cache(page, pool))
>  			return NULL;
> 
> because the intention (as I understand it) is to be invoked from within the
> NAPI callback (while softirq is served) and not if BH is just disabled due to a
> lock or so.
Could you help me understand where in_softirq() check will break ?  If  we TX a packet (dev_queue_xmit()) in 
Process context on same core,  in_serving_softirq() check will prevent it from recycling ?

Sebastian Andrzej Siewior Aug. 23, 2023, 12:54 p.m. UTC | #3

On 2023-08-23 12:28:58 [+0000], Ratheesh Kannoth wrote:
> > From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > Sent: Wednesday, August 23, 2023 3:18 PM
> > Subject: [EXT] [BUG] Possible unsafe page_pool usage in octeontx2
> > 
> > This breaks in octeontx2 where a worker is used to fill the buffer:
> >   otx2_pool_refill_task() -> otx2_alloc_rbuf() -> __otx2_alloc_rbuf() ->
> >   otx2_alloc_pool_buf() -> page_pool_alloc_frag().
> >
> As I understand, the problem is due to workqueue may get scheduled on
> other CPU. If we use BOUND workqueue, do you think this problem can be
> solved ?

It would but is still open to less obvious races for instance if the
IRQ/ NAPI is assigned to another CPU while the workqueue is scheduled.
You would have to additional synchronisation to ensure that bad can
happen. This does not make it any simpler nor prettier or serves as a
good example.

I would suggest to stay away from the lock-less buffer if not in NAPI
and feed the pool->ring instead.

> > BH is disabled but the add of a page can still happen while NAPI callback runs
> > on a remote CPU and so corrupting the index/ array.
> > 
> > API wise I would suggest to
> > 
> > diff --git a/net/core/page_pool.c b/net/core/page_pool.c index
> > 7ff80b80a6f9f..b50e219470a36 100644
> > --- a/net/core/page_pool.c
> > +++ b/net/core/page_pool.c
> > @@ -612,7 +612,7 @@ __page_pool_put_page(struct page_pool *pool,
> > struct page *page,
> >  			page_pool_dma_sync_for_device(pool, page,
> >  						      dma_sync_size);
> > 
> > -		if (allow_direct && in_softirq() &&
> > +		if (allow_direct && in_serving_softirq() &&
> >  		    page_pool_recycle_in_cache(page, pool))
> >  			return NULL;
> > 
> > because the intention (as I understand it) is to be invoked from within the
> > NAPI callback (while softirq is served) and not if BH is just disabled due to a
> > lock or so.
> Could you help me understand where in_softirq() check will break ?  If
> we TX a packet (dev_queue_xmit()) in 
> Process context on same core,  in_serving_softirq() check will prevent
> it from recycling ?

If a check is added to page_pool_alloc_pages() then it will trigger if
you fill the buffer from your ->ndo_open() callback.
Also, if you invoke dev_queue_xmit() from process context. But It will
be added to &pool->ring instead.

Sebastian

Sebastian Andrzej Siewior Aug. 23, 2023, 1:31 p.m. UTC | #4

On 2023-08-23 14:36:06 [+0300], Ilias Apalodimas wrote:
> Hi Sebastian,
Hi Ilias,

> >                                                       dma_sync_size);
> >
> > -               if (allow_direct && in_softirq() &&
> > +               if (allow_direct && in_serving_softirq() &&
> >                     page_pool_recycle_in_cache(page, pool))
> >                         return NULL;
> >
> 
> FWIW we used to have that check.
> commit 542bcea4be866b ("net: page_pool: use in_softirq() instead")
> changed that, so maybe we should revert that overall?

The other changes look okay, this in particular I am not sure about. It
would end up in the pool->ring instead of the lock-less cache. It still
depends how it got here. If it comes from page_pool_return_skb_page()
then the list_owner check will fail because it is not set for the
threaded-NAPI case. If it was a real concern, the entry point must have
been page_pool_put_full_page() or later. This basically assumes the same
context.

> > because the intention (as I understand it) is to be invoked from within
> > the NAPI callback (while softirq is served) and not if BH is just
> > disabled due to a lock or so.
> >
> > It would also make sense to a add WARN_ON_ONCE(!in_serving_softirq()) to
> > page_pool_alloc_pages() to spot usage outside of softirq. But this will
> > trigger in every driver since the same function is used in the open
> > callback to initially setup the HW.
> 
> What about adding a check in the cached allocation path in order to
> skip the initial page allocation?

Maybe. I'm a bit worried that this lock-less has no lockdep or similar
checks if everyone plays by the rules.

> Thanks
> /Ilias

Sebastian

Jakub Kicinski Aug. 23, 2023, 2:49 p.m. UTC | #5

On Wed, 23 Aug 2023 11:47:57 +0200 Sebastian Andrzej Siewior wrote:
> The pool can be filled in the same context (within allocation if the
> pool is empty). There is also page_pool_recycle_in_cache() which fills
> the pool from within skb free, for instance:
>  napi_consume_skb() -> skb_release_all() -> skb_release_data() ->
>  napi_frag_unref() -> page_pool_return_skb_page().
> 
> The last one has the following check here:
> |         napi = READ_ONCE(pp->p.napi);
> |         allow_direct = napi_safe && napi &&
> |                 READ_ONCE(napi->list_owner) == smp_processor_id();
> 
> This eventually ends in page_pool_recycle_in_cache() where it adds the
> page to the cache buffer if the check above is true (and BH is disabled). 
> 
> napi->list_owner is set once NAPI is scheduled until the poll callback
> completed. It is safe to add items to list because only one of the two
> can run on a single CPU and the completion of them ensured by having BH
> disabled the whole time.

One clarification - .napi will be NULL for otx2, it's an opt-in for
drivers which allocate from NAPI, and AFAICT otx2 does not opt in.

Jesper Dangaard Brouer Aug. 23, 2023, 7:45 p.m. UTC | #6

(Cc Olek as he have changes in this code path)

On 23/08/2023 11.47, Sebastian Andrzej Siewior wrote:
> Hi,
> 
> I've been looking at the page_pool locking.
> 
> page_pool_alloc_frag() -> page_pool_alloc_pages() ->
> __page_pool_get_cached():
> 
> There core of the allocation is:
> |         /* Caller MUST guarantee safe non-concurrent access, e.g. softirq */
> |         if (likely(pool->alloc.count)) {
> |                 /* Fast-path */
> |                 page = pool->alloc.cache[--pool->alloc.count];
> 
> The access to the `cache' array and the `count' variable is not locked.
> This is fine as long as there only one consumer per pool. In my
> understanding the intention is to have one page_pool per NAPI callback
> to ensure this.
> 

Yes, the intention is a single PP instance is "bound" to one RX-NAPI.


> The pool can be filled in the same context (within allocation if the
> pool is empty). There is also page_pool_recycle_in_cache() which fills
> the pool from within skb free, for instance:
>   napi_consume_skb() -> skb_release_all() -> skb_release_data() ->
>   napi_frag_unref() -> page_pool_return_skb_page().
> 
> The last one has the following check here:
> |         napi = READ_ONCE(pp->p.napi);
> |         allow_direct = napi_safe && napi &&
> |                 READ_ONCE(napi->list_owner) == smp_processor_id();
> 
> This eventually ends in page_pool_recycle_in_cache() where it adds the
> page to the cache buffer if the check above is true (and BH is disabled).
> 
> napi->list_owner is set once NAPI is scheduled until the poll callback
> completed. It is safe to add items to list because only one of the two
> can run on a single CPU and the completion of them ensured by having BH
> disabled the whole time.
> 
> This breaks in octeontx2 where a worker is used to fill the buffer:
>    otx2_pool_refill_task() -> otx2_alloc_rbuf() -> __otx2_alloc_rbuf() ->
>    otx2_alloc_pool_buf() -> page_pool_alloc_frag().
> 

This seems problematic! - this is NOT allowed.

But otx2_pool_refill_task() is a work-queue, and I though it runs in
process-context.  This WQ process is not allowed to use the lockless PP
cache.  This seems to be a bug!

The problematic part is otx2_alloc_rbuf() that disables BH:

  int otx2_alloc_rbuf(struct otx2_nic *pfvf, struct otx2_pool *pool,
		    dma_addr_t *dma)
  {
	int ret;

	local_bh_disable();
	ret = __otx2_alloc_rbuf(pfvf, pool, dma);
	local_bh_enable();
	return ret;
  }

The fix, can be to not do this local_bh_disable() in this driver?

> BH is disabled but the add of a page can still happen while NAPI
> callback runs on a remote CPU and so corrupting the index/ array.
> 
> API wise I would suggest to
> 
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 7ff80b80a6f9f..b50e219470a36 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -612,7 +612,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
>   			page_pool_dma_sync_for_device(pool, page,
>   						      dma_sync_size);
>   
> -		if (allow_direct && in_softirq() &&
> +		if (allow_direct && in_serving_softirq() &&

This is the "return/free/put" code path, where we have "allow_direct" as
a protection in the API.  API users are suppose to use
page_pool_recycle_direct() to indicate this, but as some point we
allowed APIs to expose 'allow_direct'.

The PP-alloc side is more fragile, and maybe the in_serving_softirq()
belongs there.

>   		    page_pool_recycle_in_cache(page, pool))
>   			return NULL;
>   
> because the intention (as I understand it) is to be invoked from within
> the NAPI callback (while softirq is served) and not if BH is just
> disabled due to a lock or so.
>

True, and it used-to-be like this (in_serving_softirq), but as Ilias
wrote it was changed recently.  This was to support threaded-NAPI (in
542bcea4be866b ("net: page_pool: use in_softirq() instead")), which
I understood was one of your (Sebastian's) use-cases.


> It would also make sense to a add WARN_ON_ONCE(!in_serving_softirq()) to
> page_pool_alloc_pages() to spot usage outside of softirq. But this will
> trigger in every driver since the same function is used in the open
> callback to initially setup the HW.
> 

I'm very open to ideas of detecting this.  Since mentioned commit PP is
open to these kind of miss-uses of the API.

One idea would be to leverage that NAPI napi->list_owner will have been
set to something else than -1, when this is NAPI context.  Getting hold
of napi object, could be done via pp->p.napi (but as Jakub wrote this is
opt-in ATM).

--Jesper

Ratheesh Kannoth Aug. 24, 2023, 2:49 a.m. UTC | #7

> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Sent: Wednesday, August 23, 2023 6:25 PM
> To: Ratheesh Kannoth <rkannoth@marvell.com>
> Subject: Re: RE: [EXT] [BUG] Possible unsafe page_pool usage in octeontx2

> I would suggest to stay away from the lock-less buffer if not in NAPI and feed
> the pool->ring instead.
As Jacub explained, allow_direct will be false as pp->p.napi is 0.
So there is no lockless addition. I think, we don’t have to fix the page pool alloc() in workqueue issue. 
 
-Ratheesh

Ilias Apalodimas Aug. 24, 2023, 7:21 a.m. UTC | #8

[...]

> >
> > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > index 7ff80b80a6f9f..b50e219470a36 100644
> > --- a/net/core/page_pool.c
> > +++ b/net/core/page_pool.c
> > @@ -612,7 +612,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
> >                       page_pool_dma_sync_for_device(pool, page,
> >                                                     dma_sync_size);
> >
> > -             if (allow_direct && in_softirq() &&
> > +             if (allow_direct && in_serving_softirq() &&
>
> This is the "return/free/put" code path, where we have "allow_direct" as
> a protection in the API.  API users are suppose to use
> page_pool_recycle_direct() to indicate this, but as some point we
> allowed APIs to expose 'allow_direct'.
>
> The PP-alloc side is more fragile, and maybe the in_serving_softirq()
> belongs there.
>
> >                   page_pool_recycle_in_cache(page, pool))
> >                       return NULL;
> >
> > because the intention (as I understand it) is to be invoked from within
> > the NAPI callback (while softirq is served) and not if BH is just
> > disabled due to a lock or so.
> >
>
> True, and it used-to-be like this (in_serving_softirq), but as Ilias
> wrote it was changed recently.  This was to support threaded-NAPI (in
> 542bcea4be866b ("net: page_pool: use in_softirq() instead")), which
> I understood was one of your (Sebastian's) use-cases.
>
>
> > It would also make sense to a add WARN_ON_ONCE(!in_serving_softirq()) to
> > page_pool_alloc_pages() to spot usage outside of softirq. But this will
> > trigger in every driver since the same function is used in the open
> > callback to initially setup the HW.
> >
>
> I'm very open to ideas of detecting this.  Since mentioned commit PP is
> open to these kind of miss-uses of the API.
>
> One idea would be to leverage that NAPI napi->list_owner will have been
> set to something else than -1, when this is NAPI context.  Getting hold
> of napi object, could be done via pp->p.napi (but as Jakub wrote this is
> opt-in ATM).

I mentioned this earlier, but can't we add the softirq check in
__page_pool_get_cached()?
In theory, when a driver comes up and allocates pages to fill in its
descriptors it will call page_pool_alloc_pages().  That will go
through the slow allocation path, fill up the caches, and return the
last page.  After that, most of the allocations will be served by
__page_pool_get_cached(), and this is supposed to be running during
the driver Rx routine which runs under NAPI.  So eventually we will
hit that warning.

Thanks
/Ilias

>
> --Jesper

Ilias Apalodimas Aug. 24, 2023, 7:42 a.m. UTC | #9

On Thu, 24 Aug 2023 at 10:21, Ilias Apalodimas
<ilias.apalodimas@linaro.org> wrote:
>
> [...]
>
> > >
> > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > > index 7ff80b80a6f9f..b50e219470a36 100644
> > > --- a/net/core/page_pool.c
> > > +++ b/net/core/page_pool.c
> > > @@ -612,7 +612,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
> > >                       page_pool_dma_sync_for_device(pool, page,
> > >                                                     dma_sync_size);
> > >
> > > -             if (allow_direct && in_softirq() &&
> > > +             if (allow_direct && in_serving_softirq() &&
> >
> > This is the "return/free/put" code path, where we have "allow_direct" as
> > a protection in the API.  API users are suppose to use
> > page_pool_recycle_direct() to indicate this, but as some point we
> > allowed APIs to expose 'allow_direct'.
> >
> > The PP-alloc side is more fragile, and maybe the in_serving_softirq()
> > belongs there.
> >
> > >                   page_pool_recycle_in_cache(page, pool))
> > >                       return NULL;
> > >
> > > because the intention (as I understand it) is to be invoked from within
> > > the NAPI callback (while softirq is served) and not if BH is just
> > > disabled due to a lock or so.
> > >
> >
> > True, and it used-to-be like this (in_serving_softirq), but as Ilias
> > wrote it was changed recently.  This was to support threaded-NAPI (in
> > 542bcea4be866b ("net: page_pool: use in_softirq() instead")), which
> > I understood was one of your (Sebastian's) use-cases.
> >
> >
> > > It would also make sense to a add WARN_ON_ONCE(!in_serving_softirq()) to
> > > page_pool_alloc_pages() to spot usage outside of softirq. But this will
> > > trigger in every driver since the same function is used in the open
> > > callback to initially setup the HW.
> > >
> >
> > I'm very open to ideas of detecting this.  Since mentioned commit PP is
> > open to these kind of miss-uses of the API.
> >
> > One idea would be to leverage that NAPI napi->list_owner will have been
> > set to something else than -1, when this is NAPI context.  Getting hold
> > of napi object, could be done via pp->p.napi (but as Jakub wrote this is
> > opt-in ATM).
>
> I mentioned this earlier, but can't we add the softirq check in
> __page_pool_get_cached()?
> In theory, when a driver comes up and allocates pages to fill in its
> descriptors it will call page_pool_alloc_pages().  That will go
> through the slow allocation path, fill up the caches, and return the
> last page.  After that, most of the allocations will be served by
> __page_pool_get_cached(), and this is supposed to be running during
> the driver Rx routine which runs under NAPI.  So eventually we will
> hit that warning.

Right... Scratch that, this will still warn on the initial allocation.
The first descriptor will get a page of the slow path, but the rest
will be filled via the caches.


/Ilias
>
> Thanks
> /Ilias
>
> >
> > --Jesper

Alexander Lobakin Aug. 24, 2023, 3:26 p.m. UTC | #10

From: Jesper Dangaard Brouer <hawk@kernel.org>
Date: Wed, 23 Aug 2023 21:45:04 +0200

> (Cc Olek as he have changes in this code path)

Thanks! I was reading the thread a bit on LKML, but being in the CC list
is more convenient :D

> 
> On 23/08/2023 11.47, Sebastian Andrzej Siewior wrote:
>> Hi,
>>
>> I've been looking at the page_pool locking.
>>
>> page_pool_alloc_frag() -> page_pool_alloc_pages() ->
>> __page_pool_get_cached():
>>
>> There core of the allocation is:
>> |         /* Caller MUST guarantee safe non-concurrent access, e.g.
>> softirq */
>> |         if (likely(pool->alloc.count)) {
>> |                 /* Fast-path */
>> |                 page = pool->alloc.cache[--pool->alloc.count];
>>
>> The access to the `cache' array and the `count' variable is not locked.
>> This is fine as long as there only one consumer per pool. In my
>> understanding the intention is to have one page_pool per NAPI callback
>> to ensure this.
>>
> 
> Yes, the intention is a single PP instance is "bound" to one RX-NAPI.

Isn't that also a misuse of page_pool->p.napi? I thought it can be set
only when page allocation and cache refill happen both inside the same
NAPI polling function. Otx2 uses workqueues to refill the queues,
meaning that consumer and producer can happen in different contexts or
even threads and it shouldn't set p.napi.

> 
> 
>> The pool can be filled in the same context (within allocation if the
>> pool is empty). There is also page_pool_recycle_in_cache() which fills
>> the pool from within skb free, for instance:
>>   napi_consume_skb() -> skb_release_all() -> skb_release_data() ->
>>   napi_frag_unref() -> page_pool_return_skb_page().
>>
>> The last one has the following check here:
>> |         napi = READ_ONCE(pp->p.napi);
>> |         allow_direct = napi_safe && napi &&
>> |                 READ_ONCE(napi->list_owner) == smp_processor_id();
>>
>> This eventually ends in page_pool_recycle_in_cache() where it adds the
>> page to the cache buffer if the check above is true (and BH is disabled).
>>
>> napi->list_owner is set once NAPI is scheduled until the poll callback
>> completed. It is safe to add items to list because only one of the two
>> can run on a single CPU and the completion of them ensured by having BH
>> disabled the whole time.
>>
>> This breaks in octeontx2 where a worker is used to fill the buffer:
>>    otx2_pool_refill_task() -> otx2_alloc_rbuf() -> __otx2_alloc_rbuf() ->
>>    otx2_alloc_pool_buf() -> page_pool_alloc_frag().
>>
> 
> This seems problematic! - this is NOT allowed.
> 
> But otx2_pool_refill_task() is a work-queue, and I though it runs in
> process-context.  This WQ process is not allowed to use the lockless PP
> cache.  This seems to be a bug!
> 
> The problematic part is otx2_alloc_rbuf() that disables BH:
> 
>  int otx2_alloc_rbuf(struct otx2_nic *pfvf, struct otx2_pool *pool,
>             dma_addr_t *dma)
>  {
>     int ret;
> 
>     local_bh_disable();
>     ret = __otx2_alloc_rbuf(pfvf, pool, dma);
>     local_bh_enable();
>     return ret;
>  }
> 
> The fix, can be to not do this local_bh_disable() in this driver?
> 
>> BH is disabled but the add of a page can still happen while NAPI
>> callback runs on a remote CPU and so corrupting the index/ array.
>>
>> API wise I would suggest to
>>
>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>> index 7ff80b80a6f9f..b50e219470a36 100644
>> --- a/net/core/page_pool.c
>> +++ b/net/core/page_pool.c
>> @@ -612,7 +612,7 @@ __page_pool_put_page(struct page_pool *pool,
>> struct page *page,
>>               page_pool_dma_sync_for_device(pool, page,
>>                                 dma_sync_size);
>>   -        if (allow_direct && in_softirq() &&
>> +        if (allow_direct && in_serving_softirq() &&
> 
> This is the "return/free/put" code path, where we have "allow_direct" as
> a protection in the API.  API users are suppose to use
> page_pool_recycle_direct() to indicate this, but as some point we
> allowed APIs to expose 'allow_direct'.
> 
> The PP-alloc side is more fragile, and maybe the in_serving_softirq()
> belongs there.
> 
>>               page_pool_recycle_in_cache(page, pool))
>>               return NULL;
>>   because the intention (as I understand it) is to be invoked from within
>> the NAPI callback (while softirq is served) and not if BH is just
>> disabled due to a lock or so.
>>
> 
> True, and it used-to-be like this (in_serving_softirq), but as Ilias
> wrote it was changed recently.  This was to support threaded-NAPI (in
> 542bcea4be866b ("net: page_pool: use in_softirq() instead")), which
> I understood was one of your (Sebastian's) use-cases.
> 
> 
>> It would also make sense to a add WARN_ON_ONCE(!in_serving_softirq()) to
>> page_pool_alloc_pages() to spot usage outside of softirq. But this will
>> trigger in every driver since the same function is used in the open
>> callback to initially setup the HW.
>>
> 
> I'm very open to ideas of detecting this.  Since mentioned commit PP is
> open to these kind of miss-uses of the API.
> 
> One idea would be to leverage that NAPI napi->list_owner will have been
> set to something else than -1, when this is NAPI context.  Getting hold
> of napi object, could be done via pp->p.napi (but as Jakub wrote this is
> opt-in ATM).
> 
> --Jesper

Thanks,
Olek

Jesper Dangaard Brouer Aug. 25, 2023, 1:16 p.m. UTC | #11

On 23/08/2023 21.45, Jesper Dangaard Brouer wrote:
> On 23/08/2023 11.47, Sebastian Andrzej Siewior wrote:
[...]
>>
>> This breaks in octeontx2 where a worker is used to fill the buffer:
>>    otx2_pool_refill_task() -> otx2_alloc_rbuf() -> __otx2_alloc_rbuf() ->
>>    otx2_alloc_pool_buf() -> page_pool_alloc_frag().
>>
> 
> This seems problematic! - this is NOT allowed.
> 
> But otx2_pool_refill_task() is a work-queue, and I though it runs in
> process-context.  This WQ process is not allowed to use the lockless PP
> cache.  This seems to be a bug!
> 
> The problematic part is otx2_alloc_rbuf() that disables BH:
> 
>   int otx2_alloc_rbuf(struct otx2_nic *pfvf, struct otx2_pool *pool,
>              dma_addr_t *dma)
>   {
>      int ret;
> 
>      local_bh_disable();
>      ret = __otx2_alloc_rbuf(pfvf, pool, dma);
>      local_bh_enable();
>      return ret;
>   }
> 
> The fix, can be to not do this local_bh_disable() in this driver?

Correcting myself. It is not a fix to remove this local_bh_disable().
(which is obvious now I read the code again).

This WQ process is not allowed to use the page_pool_alloc() API this way
(from a work-queue).  The PP alloc-side API must only be used under NAPI
protection.  Thanks for spotting this Sebastian!

Will/can any of the Cc'ed Marvell people work on a fix?

--Jesper

Jesper Dangaard Brouer Aug. 25, 2023, 1:22 p.m. UTC | #12

On 24/08/2023 17.26, Alexander Lobakin wrote:
> From: Jesper Dangaard Brouer<hawk@kernel.org>
> Date: Wed, 23 Aug 2023 21:45:04 +0200
> 
>> (Cc Olek as he have changes in this code path)
> Thanks! I was reading the thread a bit on LKML, but being in the CC list
> is more convenient :D
> 

:D

>> On 23/08/2023 11.47, Sebastian Andrzej Siewior wrote:
>>> Hi,
>>>
>>> I've been looking at the page_pool locking.
>>>
>>> page_pool_alloc_frag() -> page_pool_alloc_pages() ->
>>> __page_pool_get_cached():
>>>
>>> There core of the allocation is:
>>> |         /* Caller MUST guarantee safe non-concurrent access, e.g.
>>> softirq */
>>> |         if (likely(pool->alloc.count)) {
>>> |                 /* Fast-path */
>>> |                 page = pool->alloc.cache[--pool->alloc.count];
>>>
>>> The access to the `cache' array and the `count' variable is not locked.
>>> This is fine as long as there only one consumer per pool. In my
>>> understanding the intention is to have one page_pool per NAPI callback
>>> to ensure this.
>>>
>> Yes, the intention is a single PP instance is "bound" to one RX-NAPI.
>
> Isn't that also a misuse of page_pool->p.napi? I thought it can be set
> only when page allocation and cache refill happen both inside the same
> NAPI polling function. Otx2 uses workqueues to refill the queues,
> meaning that consumer and producer can happen in different contexts or
> even threads and it shouldn't set p.napi.
> 

As Jakub wrote this otx2 driver doesn't set p.napi (so that part of the
problem cannot happen).

That said using workqueues to refill the queues is not compatible with
using page_pool_alloc APIs.  This need to be fixed in driver!

--Jesper

Alexander Lobakin Aug. 25, 2023, 1:38 p.m. UTC | #13

From: Jesper Dangaard Brouer <hawk@kernel.org>
Date: Fri, 25 Aug 2023 15:22:05 +0200

> 
> 
> On 24/08/2023 17.26, Alexander Lobakin wrote:
>> From: Jesper Dangaard Brouer<hawk@kernel.org>
>> Date: Wed, 23 Aug 2023 21:45:04 +0200
>>
>>> (Cc Olek as he have changes in this code path)
>> Thanks! I was reading the thread a bit on LKML, but being in the CC list
>> is more convenient :D
>>
> 
> :D
> 
>>> On 23/08/2023 11.47, Sebastian Andrzej Siewior wrote:
>>>> Hi,
>>>>
>>>> I've been looking at the page_pool locking.
>>>>
>>>> page_pool_alloc_frag() -> page_pool_alloc_pages() ->
>>>> __page_pool_get_cached():
>>>>
>>>> There core of the allocation is:
>>>> |         /* Caller MUST guarantee safe non-concurrent access, e.g.
>>>> softirq */
>>>> |         if (likely(pool->alloc.count)) {
>>>> |                 /* Fast-path */
>>>> |                 page = pool->alloc.cache[--pool->alloc.count];
>>>>
>>>> The access to the `cache' array and the `count' variable is not locked.
>>>> This is fine as long as there only one consumer per pool. In my
>>>> understanding the intention is to have one page_pool per NAPI callback
>>>> to ensure this.
>>>>
>>> Yes, the intention is a single PP instance is "bound" to one RX-NAPI.
>>
>> Isn't that also a misuse of page_pool->p.napi? I thought it can be set
>> only when page allocation and cache refill happen both inside the same
>> NAPI polling function. Otx2 uses workqueues to refill the queues,
>> meaning that consumer and producer can happen in different contexts or
>> even threads and it shouldn't set p.napi.
>>
> 
> As Jakub wrote this otx2 driver doesn't set p.napi (so that part of the
> problem cannot happen).

Oh I'm dumb :z

> 
> That said using workqueues to refill the queues is not compatible with
> using page_pool_alloc APIs.  This need to be fixed in driver!

Hmmm I'm wondering now, how should we use Page Pool if we want to run
consume and alloc routines separately? Do you want to say we can't use
Page Pool in that case at all? Quoting other your reply:

> This WQ process is not allowed to use the page_pool_alloc() API this
> way (from a work-queue).  The PP alloc-side API must only be used
> under NAPI protection.

Who did say that? If I don't set p.napi, how is Page Pool then tied to NAPI?
Moreover, when you initially do an ifup/.ndo_open, you have to fill your
Rx queues. It's process context and it can happen on whichever CPU.
Do you mean I can't allocate pages in .ndo_open? :D

> 
> --Jesper

Thanks,
Olek

Jesper Dangaard Brouer Aug. 25, 2023, 5:25 p.m. UTC | #14

On 25/08/2023 15.38, Alexander Lobakin wrote:
> From: Jesper Dangaard Brouer <hawk@kernel.org>
> Date: Fri, 25 Aug 2023 15:22:05 +0200
> 
>>
>>
>> On 24/08/2023 17.26, Alexander Lobakin wrote:
>>> From: Jesper Dangaard Brouer<hawk@kernel.org>
>>> Date: Wed, 23 Aug 2023 21:45:04 +0200
>>>
>>>> (Cc Olek as he have changes in this code path)
>>> Thanks! I was reading the thread a bit on LKML, but being in the CC list
>>> is more convenient :D
>>>
>>
>> :D
>>
>>>> On 23/08/2023 11.47, Sebastian Andrzej Siewior wrote:
>>>>> Hi,
>>>>>
>>>>> I've been looking at the page_pool locking.
>>>>>
>>>>> page_pool_alloc_frag() -> page_pool_alloc_pages() ->
>>>>> __page_pool_get_cached():
>>>>>
>>>>> There core of the allocation is:
>>>>> |         /* Caller MUST guarantee safe non-concurrent access, e.g.
>>>>> softirq */
>>>>> |         if (likely(pool->alloc.count)) {
>>>>> |                 /* Fast-path */
>>>>> |                 page = pool->alloc.cache[--pool->alloc.count];
>>>>>
>>>>> The access to the `cache' array and the `count' variable is not locked.
>>>>> This is fine as long as there only one consumer per pool. In my
>>>>> understanding the intention is to have one page_pool per NAPI callback
>>>>> to ensure this.
>>>>>
>>>> Yes, the intention is a single PP instance is "bound" to one RX-NAPI.
>>>
>>> Isn't that also a misuse of page_pool->p.napi? I thought it can be set
>>> only when page allocation and cache refill happen both inside the same
>>> NAPI polling function. Otx2 uses workqueues to refill the queues,
>>> meaning that consumer and producer can happen in different contexts or
>>> even threads and it shouldn't set p.napi.
>>>
>>
>> As Jakub wrote this otx2 driver doesn't set p.napi (so that part of the
>> problem cannot happen).
> 
> Oh I'm dumb :z
> 
>>
>> That said using workqueues to refill the queues is not compatible with
>> using page_pool_alloc APIs.  This need to be fixed in driver!
> 
> Hmmm I'm wondering now, how should we use Page Pool if we want to run
> consume and alloc routines separately? Do you want to say we can't use
> Page Pool in that case at all? Quoting other your reply:
> 
>> This WQ process is not allowed to use the page_pool_alloc() API this
>> way (from a work-queue).  The PP alloc-side API must only be used
>> under NAPI protection.
> 
> Who did say that? If I don't set p.napi, how is Page Pool then tied to NAPI?

*I* say that (as the PP inventor) as that was the design and intent,
that this is tied to a NAPI instance and rely on the NAPI protection to
make it safe to do lockless access to this cache array.

> Moreover, when you initially do an ifup/.ndo_open, you have to fill your
> Rx queues. It's process context and it can happen on whichever CPU.
> Do you mean I can't allocate pages in .ndo_open? :D

True, that all driver basically allocate from this *before* the RX-ring 
/ NAPI is activated.  That is safe and "allowed" given the driver 
RX-ring is not active yet.  This use-case unfortunately also make it 
harder to add something to the PP API, that detect miss-usage of the API.

Looking again at the driver otx2_txrx.c NAPI code path also calls PP 
directly in otx2_napi_handler() will call refill_pool_ptrs() -> 
otx2_refill_pool_ptrs() -> otx2_alloc_buffer() -> __otx2_alloc_rbuf() -> 
if (pool->page_pool) otx2_alloc_pool_buf() -> page_pool_alloc_frag().

The function otx2_alloc_buffer() can also choose to start a WQ, that 
also called PP alloc API this time via otx2_alloc_rbuf() that have 
BH-disable.  Like Sebastian, I don't think this is safe!

--Jesper

This can be a workaround fix:

$ git diff
diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c 
b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
index dce3cea00032..ab7ca146fddf 100644
--- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
+++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
@@ -578,6 +578,10 @@ int otx2_alloc_buffer(struct otx2_nic *pfvf, struct 
otx2_cq_queue *cq,
                 struct refill_work *work;
                 struct delayed_work *dwork;

+               /* page_pool alloc API cannot be used from WQ */
+               if (cq->rbpool->page_pool)
+                       return -ENOMEM;
+
                 work = &pfvf->refill_wrk[cq->cq_idx];
                 dwork = &work->pool_refill_work;
                 /* Schedule a task if no other task is running */

Jakub Kicinski Aug. 26, 2023, 12:42 a.m. UTC | #15

On Fri, 25 Aug 2023 19:25:42 +0200 Jesper Dangaard Brouer wrote:
> >> This WQ process is not allowed to use the page_pool_alloc() API this
> >> way (from a work-queue).  The PP alloc-side API must only be used
> >> under NAPI protection.  
> > 
> > Who did say that? If I don't set p.napi, how is Page Pool then tied to NAPI?  
> 
> *I* say that (as the PP inventor) as that was the design and intent,
> that this is tied to a NAPI instance and rely on the NAPI protection to
> make it safe to do lockless access to this cache array.

Absolutely no objection to us making the NAPI / bh context a requirement
past the startup stage, but just to be sure I understand the code -
technically if the driver never recycles direct, does not set the NAPI,
does not use xdp_return_frame_rx_napi etc. - the cache is always empty
so we good?

I wonder if we can add a check like "mark the pool as BH-only on first
BH use, and WARN() on process use afterwards". But I'm not sure what
CONFIG you'd accept that being under ;)

Alexander Lobakin Aug. 28, 2023, 10:59 a.m. UTC | #16

From: Jakub Kicinski <kuba@kernel.org>
Date: Fri, 25 Aug 2023 17:42:58 -0700

> On Fri, 25 Aug 2023 19:25:42 +0200 Jesper Dangaard Brouer wrote:
>>>> This WQ process is not allowed to use the page_pool_alloc() API this
>>>> way (from a work-queue).  The PP alloc-side API must only be used
>>>> under NAPI protection.  
>>>
>>> Who did say that? If I don't set p.napi, how is Page Pool then tied to NAPI?  
>>
>> *I* say that (as the PP inventor) as that was the design and intent,
>> that this is tied to a NAPI instance and rely on the NAPI protection to
>> make it safe to do lockless access to this cache array.
> 
> Absolutely no objection to us making the NAPI / bh context a requirement
> past the startup stage, but just to be sure I understand the code -
> technically if the driver never recycles direct, does not set the NAPI,
> does not use xdp_return_frame_rx_napi etc. - the cache is always empty
> so we good?

+1, I don't say Otx2 is correct, but I don't see any issues in having
consumer and producer running on different cores and in different
contexes, as long as p.napi is not set.

Split queue model is trending and I don't see reasons why PP may require
serializing -> effectively making it work the same way as "single queue"
mode works. Esp. given that we have ptr_ring with locks, not only the
direct cache.

> 
> I wonder if we can add a check like "mark the pool as BH-only on first
> BH use, and WARN() on process use afterwards". But I'm not sure what

Why do we use spin_lock_bh() and friends then and check in_softirq(), if
PP can work only in one context?

> CONFIG you'd accept that being under ;)

Thanks,
Olek

Alexander Lobakin Aug. 28, 2023, 11:07 a.m. UTC | #17

From: Jesper Dangaard Brouer <hawk@kernel.org>
Date: Fri, 25 Aug 2023 19:25:42 +0200

> 
> 
> On 25/08/2023 15.38, Alexander Lobakin wrote:
>> From: Jesper Dangaard Brouer <hawk@kernel.org>
>> Date: Fri, 25 Aug 2023 15:22:05 +0200
>>
>>>
>>>
>>> On 24/08/2023 17.26, Alexander Lobakin wrote:
>>>> From: Jesper Dangaard Brouer<hawk@kernel.org>
>>>> Date: Wed, 23 Aug 2023 21:45:04 +0200
>>>>
>>>>> (Cc Olek as he have changes in this code path)
>>>> Thanks! I was reading the thread a bit on LKML, but being in the CC
>>>> list
>>>> is more convenient :D
>>>>
>>>
>>> :D
>>>
>>>>> On 23/08/2023 11.47, Sebastian Andrzej Siewior wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I've been looking at the page_pool locking.
>>>>>>
>>>>>> page_pool_alloc_frag() -> page_pool_alloc_pages() ->
>>>>>> __page_pool_get_cached():
>>>>>>
>>>>>> There core of the allocation is:
>>>>>> |         /* Caller MUST guarantee safe non-concurrent access, e.g.
>>>>>> softirq */
>>>>>> |         if (likely(pool->alloc.count)) {
>>>>>> |                 /* Fast-path */
>>>>>> |                 page = pool->alloc.cache[--pool->alloc.count];
>>>>>>
>>>>>> The access to the `cache' array and the `count' variable is not
>>>>>> locked.
>>>>>> This is fine as long as there only one consumer per pool. In my
>>>>>> understanding the intention is to have one page_pool per NAPI
>>>>>> callback
>>>>>> to ensure this.
>>>>>>
>>>>> Yes, the intention is a single PP instance is "bound" to one RX-NAPI.
>>>>
>>>> Isn't that also a misuse of page_pool->p.napi? I thought it can be set
>>>> only when page allocation and cache refill happen both inside the same
>>>> NAPI polling function. Otx2 uses workqueues to refill the queues,
>>>> meaning that consumer and producer can happen in different contexts or
>>>> even threads and it shouldn't set p.napi.
>>>>
>>>
>>> As Jakub wrote this otx2 driver doesn't set p.napi (so that part of the
>>> problem cannot happen).
>>
>> Oh I'm dumb :z
>>
>>>
>>> That said using workqueues to refill the queues is not compatible with
>>> using page_pool_alloc APIs.  This need to be fixed in driver!
>>
>> Hmmm I'm wondering now, how should we use Page Pool if we want to run
>> consume and alloc routines separately? Do you want to say we can't use
>> Page Pool in that case at all? Quoting other your reply:
>>
>>> This WQ process is not allowed to use the page_pool_alloc() API this
>>> way (from a work-queue).  The PP alloc-side API must only be used
>>> under NAPI protection.
>>
>> Who did say that? If I don't set p.napi, how is Page Pool then tied to
>> NAPI?
> 
> *I* say that (as the PP inventor) as that was the design and intent,

> *I*
> the PP inventor

Sure I got that a couple years ago, no need to keep reminding me that xD

> that this is tied to a NAPI instance and rely on the NAPI protection to
> make it safe to do lockless access to this cache array.
> 
>> Moreover, when you initially do an ifup/.ndo_open, you have to fill your
>> Rx queues. It's process context and it can happen on whichever CPU.
>> Do you mean I can't allocate pages in .ndo_open? :D
> 
> True, that all driver basically allocate from this *before* the RX-ring
> / NAPI is activated.  That is safe and "allowed" given the driver
> RX-ring is not active yet.  This use-case unfortunately also make it
> harder to add something to the PP API, that detect miss-usage of the API.
> 
> Looking again at the driver otx2_txrx.c NAPI code path also calls PP
> directly in otx2_napi_handler() will call refill_pool_ptrs() ->
> otx2_refill_pool_ptrs() -> otx2_alloc_buffer() -> __otx2_alloc_rbuf() ->
> if (pool->page_pool) otx2_alloc_pool_buf() -> page_pool_alloc_frag().
> 
> The function otx2_alloc_buffer() can also choose to start a WQ, that
> also called PP alloc API this time via otx2_alloc_rbuf() that have
> BH-disable.  Like Sebastian, I don't think this is safe!

Disabling BH doesn't look correct to me, but I don't see issues in
having consumer and producer working on different cores, as long as they
use ptr_ring with locks.

> 
> --Jesper
> 
> This can be a workaround fix:
> 
> $ git diff
> diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
> b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
> index dce3cea00032..ab7ca146fddf 100644
> --- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
> +++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
> @@ -578,6 +578,10 @@ int otx2_alloc_buffer(struct otx2_nic *pfvf, struct
> otx2_cq_queue *cq,
>                 struct refill_work *work;
>                 struct delayed_work *dwork;
> 
> +               /* page_pool alloc API cannot be used from WQ */
> +               if (cq->rbpool->page_pool)
> +                       return -ENOMEM;

I believe that breaks the driver?

> +
>                 work = &pfvf->refill_wrk[cq->cq_idx];
>                 dwork = &work->pool_refill_work;
>                 /* Schedule a task if no other task is running */

Thanks,
Olek

Jesper Dangaard Brouer Aug. 28, 2023, 12:25 p.m. UTC | #18

On 26/08/2023 02.42, Jakub Kicinski wrote:
> On Fri, 25 Aug 2023 19:25:42 +0200 Jesper Dangaard Brouer wrote:
>>>> This WQ process is not allowed to use the page_pool_alloc() API this
>>>> way (from a work-queue).  The PP alloc-side API must only be used
>>>> under NAPI protection.
>>>
>>> Who did say that? If I don't set p.napi, how is Page Pool then tied to NAPI?
>>
>> *I* say that (as the PP inventor) as that was the design and intent,
>> that this is tied to a NAPI instance and rely on the NAPI protection to
>> make it safe to do lockless access to this cache array.
> 
> Absolutely no objection to us making the NAPI / bh context a requirement
> past the startup stage, but just to be sure I understand the code -
> technically if the driver never recycles direct, does not set the NAPI,
> does not use xdp_return_frame_rx_napi etc. - the cache is always empty
> so we good?
> 

Nope cache is NOT always empty, the PP cache will be refilled if empty.
Thus, PP alloc side code will touch/use pool->alloc.cache[].  See two
places in code with comment: /* Return last page */.

The PP cache is always refilled; Either from ptr_ring or via
page-allocators bulking APIs.

> I wonder if we can add a check like "mark the pool as BH-only on first
> BH use, and WARN() on process use afterwards". But I'm not sure what
> CONFIG you'd accept that being under ;)
> 

The PP alloc side is designed as a Single Consumer data-structure/model
on alloc side, (via a lockless cache/array).  That on empty cache
fallback to bulk from a Multi Consumer data-structure/model to amortize
that cost.  This is where the PP speedup comes from.

--Jesper

Jesper Dangaard Brouer Aug. 28, 2023, 12:34 p.m. UTC | #19

On 28/08/2023 13.07, Alexander Lobakin wrote:
>> This can be a workaround fix:
>>
>> $ git diff
>> diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
>> b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
>> index dce3cea00032..ab7ca146fddf 100644
>> --- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
>> +++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
>> @@ -578,6 +578,10 @@ int otx2_alloc_buffer(struct otx2_nic *pfvf, struct
>> otx2_cq_queue *cq,
>>                  struct refill_work *work;
>>                  struct delayed_work *dwork;
>>
>> +               /* page_pool alloc API cannot be used from WQ */
>> +               if (cq->rbpool->page_pool)
>> +                       return -ENOMEM;
> I believe that breaks the driver?
> 

Why would that break the driver?

AFAIK returning 0 here will break the driver.
We need to return something non-zero, see otx2_refill_pool_ptrs() 
copy-pasted below signature.


>> +
>>                  work = &pfvf->refill_wrk[cq->cq_idx];
>>                  dwork = &work->pool_refill_work;
>>                  /* Schedule a task if no other task is running */


--Jesper

  void otx2_refill_pool_ptrs(void *dev, struct otx2_cq_queue *cq)
  {
	struct otx2_nic *pfvf = dev;
	dma_addr_t bufptr;

	while (cq->pool_ptrs) {
		if (otx2_alloc_buffer(pfvf, cq, &bufptr))
			break;
		otx2_aura_freeptr(pfvf, cq->cq_idx, bufptr + OTX2_HEAD_ROOM);
		cq->pool_ptrs--;
	}
  }

Sebastian Andrzej Siewior Aug. 28, 2023, 4:40 p.m. UTC | #20

On 2023-08-28 13:07:12 [+0200], Alexander Lobakin wrote:
> > Looking again at the driver otx2_txrx.c NAPI code path also calls PP
> > directly in otx2_napi_handler() will call refill_pool_ptrs() ->
> > otx2_refill_pool_ptrs() -> otx2_alloc_buffer() -> __otx2_alloc_rbuf() ->
> > if (pool->page_pool) otx2_alloc_pool_buf() -> page_pool_alloc_frag().
> > 
> > The function otx2_alloc_buffer() can also choose to start a WQ, that
> > also called PP alloc API this time via otx2_alloc_rbuf() that have
> > BH-disable.  Like Sebastian, I don't think this is safe!
> 
> Disabling BH doesn't look correct to me, but I don't see issues in
> having consumer and producer working on different cores, as long as they
> use ptr_ring with locks.

After learning what p.napi is about, may I point out that there are also
users that don't check that and use page_pool_put_full_page() or later?
While I can't comment on the bpf/XDP users, browsing otx2: there is
this:
| otx2_stop()
| -> otx2_free_hw_resources()
|   -> otx2_cleanup_rx_cqes
|      -> otx2_free_bufs
|        -> page_pool_put_full_page(, true)
| -> cancel_delayed_work_sync()

otx2 is "safe" here due to the in_softirq() check in
__page_pool_put_page(). But: the worker could run and fill the lock less
buffer while otx2_free_bufs() is doing clean up/ removing pages and
filling this buffer on another CPU.
The worker is synchronised after the free. The lack of BH-disable in
otx2_stop()'s path safes the day here.
(It looks somehow suspicious that there is first "free mem" followed by
"sync refill worker" and not the other way around)

Sebastian

Ratheesh Kannoth Aug. 30, 2023, 7:14 a.m. UTC | #21

> From: Jesper Dangaard Brouer <hawk@kernel.org>
> Subject: [EXT] Re: [BUG] Possible unsafe page_pool usage in octeontx2
> 
> This WQ process is not allowed to use the page_pool_alloc() API this way
> (from a work-queue).  The PP alloc-side API must only be used under NAPI
> protection.  Thanks for spotting this Sebastian!
> 
> Will/can any of the Cc'ed Marvell people work on a fix?

We are working on it.

[BUG] Possible unsafe page_pool usage in octeontx2

Checks

Commit Message

Comments

Patch