diff mbox series

[net-next] page pool: not return page to alloc cache during pool destruction

Message ID 20230615013645.7297-1-liangchen.linux@gmail.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net-next] page pool: not return page to alloc cache during pool destruction | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 10 this patch: 10
netdev/cc_maintainers success CCed 7 of 7 maintainers
netdev/build_clang success Errors and warnings before: 8 this patch: 8
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 10 this patch: 10
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 8 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Liang Chen June 15, 2023, 1:36 a.m. UTC
When destroying a page pool, the alloc cache and recycle ring are emptied.
If there are inflight pages, the retry process will periodically check the
recycle ring for recently returned pages, but not the alloc cache (alloc
cache is only emptied once). As a result, any pages returned to the alloc
cache after the page pool destruction will be stuck there and cause the
retry process to continuously look for inflight pages and report warnings.

To safeguard against this situation, any pages returning to the alloc cache
after pool destruction should be prevented.

Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
---
 net/core/page_pool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Jakub Kicinski June 15, 2023, 4:20 a.m. UTC | #1
On Thu, 15 Jun 2023 09:36:45 +0800 Liang Chen wrote:
> When destroying a page pool, the alloc cache and recycle ring are emptied.
> If there are inflight pages, the retry process will periodically check the
> recycle ring for recently returned pages, but not the alloc cache (alloc
> cache is only emptied once). As a result, any pages returned to the alloc
> cache after the page pool destruction will be stuck there and cause the
> retry process to continuously look for inflight pages and report warnings.
> 
> To safeguard against this situation, any pages returning to the alloc cache
> after pool destruction should be prevented.

Let's hear from the page pool maintainers but I think the driver 
is supposed to prevent allocations while pool is getting destroyed.
Perhaps we can add DEBUG_NET_WARN_ON_ONCE() for this condition to
prevent wasting cycles in production builds?
Yunsheng Lin June 15, 2023, 9:01 a.m. UTC | #2
On 2023/6/15 9:36, Liang Chen wrote:
> When destroying a page pool, the alloc cache and recycle ring are emptied.
> If there are inflight pages, the retry process will periodically check the
> recycle ring for recently returned pages, but not the alloc cache (alloc
> cache is only emptied once). As a result, any pages returned to the alloc
> cache after the page pool destruction will be stuck there and cause the
> retry process to continuously look for inflight pages and report warnings.

It seems there is still page_pool_put[_full]_page() called with
allow_direct being true after page_pool_destroy() is call, which
is not allowed.

Normally the driver will call napi_disable() before
page_pool_destroy() to ensure there is no such page_pool_destroy()
calling with allow_direct being true after page_pool_destroy() is
called.

> 
> To safeguard against this situation, any pages returning to the alloc cache
> after pool destruction should be prevented.
> 
> Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
> ---
>  net/core/page_pool.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index a3e12a61d456..76255313d349 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -595,7 +595,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
>  			page_pool_dma_sync_for_device(pool, page,
>  						      dma_sync_size);
>  
> -		if (allow_direct && in_softirq() &&
> +		if (allow_direct && in_softirq() && !pool->destroy_cnt &&

The checking seems racy when  __page_pool_put_page() and
page_pool_destroy() are called concurently.

>  		    page_pool_recycle_in_cache(page, pool))
>  			return NULL;
>  
>
Yunsheng Lin June 15, 2023, 9:06 a.m. UTC | #3
On 2023/6/15 17:01, Yunsheng Lin wrote:
> On 2023/6/15 9:36, Liang Chen wrote:
>> When destroying a page pool, the alloc cache and recycle ring are emptied.
>> If there are inflight pages, the retry process will periodically check the
>> recycle ring for recently returned pages, but not the alloc cache (alloc
>> cache is only emptied once). As a result, any pages returned to the alloc
>> cache after the page pool destruction will be stuck there and cause the
>> retry process to continuously look for inflight pages and report warnings.
> 
> It seems there is still page_pool_put[_full]_page() called with
> allow_direct being true after page_pool_destroy() is call, which
> is not allowed.
> 
> Normally the driver will call napi_disable() before
> page_pool_destroy() to ensure there is no such page_pool_destroy()

no such page_pool_put[_full]_page()

> calling with allow_direct being true after page_pool_destroy() is
> called.
> 
>>
>> To safeguard against this situation, any pages returning to the alloc cache
>> after pool destruction should be prevented.
>>
>> Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
>> ---
>>  net/core/page_pool.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>> index a3e12a61d456..76255313d349 100644
>> --- a/net/core/page_pool.c
>> +++ b/net/core/page_pool.c
>> @@ -595,7 +595,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
>>  			page_pool_dma_sync_for_device(pool, page,
>>  						      dma_sync_size);
>>  
>> -		if (allow_direct && in_softirq() &&
>> +		if (allow_direct && in_softirq() && !pool->destroy_cnt &&
> 
> The checking seems racy when  __page_pool_put_page() and
> page_pool_destroy() are called concurently.
> 
>>  		    page_pool_recycle_in_cache(page, pool))
>>  			return NULL;
>>  
>>
> 
> 
> .
>
Ilias Apalodimas June 15, 2023, 10:49 a.m. UTC | #4
Hi Jakub,

On Thu, 15 Jun 2023 at 07:20, Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 15 Jun 2023 09:36:45 +0800 Liang Chen wrote:
> > When destroying a page pool, the alloc cache and recycle ring are emptied.
> > If there are inflight pages, the retry process will periodically check the
> > recycle ring for recently returned pages, but not the alloc cache (alloc
> > cache is only emptied once). As a result, any pages returned to the alloc
> > cache after the page pool destruction will be stuck there and cause the
> > retry process to continuously look for inflight pages and report warnings.
> >
> > To safeguard against this situation, any pages returning to the alloc cache
> > after pool destruction should be prevented.
>
> Let's hear from the page pool maintainers but I think the driver
> is supposed to prevent allocations while pool is getting destroyed.
> Perhaps we can add DEBUG_NET_WARN_ON_ONCE() for this condition to
> prevent wasting cycles in production builds?

Yes the driver is supposed to do that, but OTOH I generally prefer
APIs that don't allow people to shoot themselves in the foot.  IIRC
this check run in fast path only in XDP mode right?  If this doesn't
affect performance,  I don't have any objections.  Jesper was trying
to refactor the destruction path, perhaps there's something in this
that affects his code?

Thanks
/Ilias
Jesper Dangaard Brouer June 15, 2023, 2 p.m. UTC | #5
On 15/06/2023 06.20, Jakub Kicinski wrote:
> On Thu, 15 Jun 2023 09:36:45 +0800 Liang Chen wrote:
>> When destroying a page pool, the alloc cache and recycle ring are emptied.
>> If there are inflight pages, the retry process will periodically check the
>> recycle ring for recently returned pages, but not the alloc cache (alloc
>> cache is only emptied once). As a result, any pages returned to the alloc
>> cache after the page pool destruction will be stuck there and cause the
>> retry process to continuously look for inflight pages and report warnings.
>>
>> To safeguard against this situation, any pages returning to the alloc cache
>> after pool destruction should be prevented.
> 
> Let's hear from the page pool maintainers but I think the driver
> is supposed to prevent allocations while pool is getting destroyed.

Yes, this is a driver API violation. Direct returns (allow_direct) can
only happen from drivers RX path, e.g while driver is active processing
packets (in NAPI).  When driver is shutting down a page_pool, it MUST
have stopped RX path and NAPI (napi_disable()) before calling
page_pool_destroy()  Thus, this situation cannot happen and if it does
it is a driver bug.

> Perhaps we can add DEBUG_NET_WARN_ON_ONCE() for this condition to
> prevent wasting cycles in production builds?
> 

For this page_pool code path ("allow_direct") it is extremely important
we avoid wasting cycles in production.  As this is used for XDP_DROP
use-cases for 100Gbit/s NICs.

At 100Gbit/s with 64 bytes Ethernet frames (84 on wire), the wirespeed
is 148.8Mpps which gives CPU 6.72 nanosec to process each packet.
The microbench[1] shows (below signature) that page_pool_alloc_pages() +
page_pool_recycle_direct() cost 4.041 ns (or 14 cycles(tsc)).
Thus, for this code fast-path every cycle counts.

In practice PCIe transactions/sec seems limit total system to 108Mpps
(with multiple RX-queues + descriptor compression) thus 9.26 nanosec to
process each packet. Individual hardware RX queues seems be limited to
around 36Mpps thus 27.77 nanosec to process each packet.

Adding a DEBUG_NET_WARN_ON_ONCE will be annoying as I like to run my
testlab kernels with CONFIG_DEBUG_NET, which will change this extreme
fash-path slightly (adding some unlikely's affecting code layout to the
mix).

Question to Liang Chen: Did you hit this bug in practice?

--Jesper

CPU E5-1650 v4 @ 3.60GHz
  tasklet_page_pool01_fast_path Per elem:  14 cycles(tsc)  4.041 ns
  tasklet_page_pool02_ptr_ring  Per elem:  49 cycles(tsc) 13.622 ns
  tasklet_page_pool03_slow      Per elem: 162 cycles(tsc) 45.198 ns

[1] 
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
Jakub Kicinski June 16, 2023, 2:45 a.m. UTC | #6
On Thu, 15 Jun 2023 16:00:13 +0200 Jesper Dangaard Brouer wrote:
> Adding a DEBUG_NET_WARN_ON_ONCE will be annoying as I like to run my
> testlab kernels with CONFIG_DEBUG_NET, which will change this extreme
> fash-path slightly (adding some unlikely's affecting code layout to the
> mix).

That's counter-intuitive - the whole point of DEBUG_NET is to have 
a place where we can add checks which we don't want in production
builds. If we can't use it on the datapath we should just remove it
completely and have its checks always enabled..

I do feel your pain of having to test code both with debug enabled
and disabled, but we can't have a cookie and eat it too :(
Liang Chen June 16, 2023, 3:07 a.m. UTC | #7
On Thu, Jun 15, 2023 at 10:00 PM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
>
> On 15/06/2023 06.20, Jakub Kicinski wrote:
> > On Thu, 15 Jun 2023 09:36:45 +0800 Liang Chen wrote:
> >> When destroying a page pool, the alloc cache and recycle ring are emptied.
> >> If there are inflight pages, the retry process will periodically check the
> >> recycle ring for recently returned pages, but not the alloc cache (alloc
> >> cache is only emptied once). As a result, any pages returned to the alloc
> >> cache after the page pool destruction will be stuck there and cause the
> >> retry process to continuously look for inflight pages and report warnings.
> >>
> >> To safeguard against this situation, any pages returning to the alloc cache
> >> after pool destruction should be prevented.
> >
> > Let's hear from the page pool maintainers but I think the driver
> > is supposed to prevent allocations while pool is getting destroyed.
>
> Yes, this is a driver API violation. Direct returns (allow_direct) can
> only happen from drivers RX path, e.g while driver is active processing
> packets (in NAPI).  When driver is shutting down a page_pool, it MUST
> have stopped RX path and NAPI (napi_disable()) before calling
> page_pool_destroy()  Thus, this situation cannot happen and if it does
> it is a driver bug.
>
> > Perhaps we can add DEBUG_NET_WARN_ON_ONCE() for this condition to
> > prevent wasting cycles in production builds?
> >
>
> For this page_pool code path ("allow_direct") it is extremely important
> we avoid wasting cycles in production.  As this is used for XDP_DROP
> use-cases for 100Gbit/s NICs.
>
> At 100Gbit/s with 64 bytes Ethernet frames (84 on wire), the wirespeed
> is 148.8Mpps which gives CPU 6.72 nanosec to process each packet.
> The microbench[1] shows (below signature) that page_pool_alloc_pages() +
> page_pool_recycle_direct() cost 4.041 ns (or 14 cycles(tsc)).
> Thus, for this code fast-path every cycle counts.
>
> In practice PCIe transactions/sec seems limit total system to 108Mpps
> (with multiple RX-queues + descriptor compression) thus 9.26 nanosec to
> process each packet. Individual hardware RX queues seems be limited to
> around 36Mpps thus 27.77 nanosec to process each packet.
>
> Adding a DEBUG_NET_WARN_ON_ONCE will be annoying as I like to run my
> testlab kernels with CONFIG_DEBUG_NET, which will change this extreme
> fash-path slightly (adding some unlikely's affecting code layout to the
> mix).
>
> Question to Liang Chen: Did you hit this bug in practice?
>
> --Jesper
>

Yeah, we hit this problem while implementing page pool support for
virtio_net driver, where we only enable page pool for xdp path, i.e.
turning on/off page pool when xdp is enabled/disabled. The problem
turns up when the xdp program is uninstalled, and there are still
inflight page pool page buffers. Then napi is enabled again, the
driver starts to process those inflight page pool buffers. So we will
need to be aware of the state of the page pool (if it is being
destructed) while returning the pages back. That's what motivated us
to add this check to __page_pool_put_page.

Thanks,
Liang



> CPU E5-1650 v4 @ 3.60GHz
>   tasklet_page_pool01_fast_path Per elem:  14 cycles(tsc)  4.041 ns
>   tasklet_page_pool02_ptr_ring  Per elem:  49 cycles(tsc) 13.622 ns
>   tasklet_page_pool03_slow      Per elem: 162 cycles(tsc) 45.198 ns
>
> [1]
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
>
diff mbox series

Patch

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index a3e12a61d456..76255313d349 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -595,7 +595,7 @@  __page_pool_put_page(struct page_pool *pool, struct page *page,
 			page_pool_dma_sync_for_device(pool, page,
 						      dma_sync_size);
 
-		if (allow_direct && in_softirq() &&
+		if (allow_direct && in_softirq() && !pool->destroy_cnt &&
 		    page_pool_recycle_in_cache(page, pool))
 			return NULL;