diff mbox series

[net-next,v3] page_pool: avoid infinite loop to schedule delayed worker

Message ID 20250214064250.85987-1-kerneljasonxing@gmail.com (mailing list archive)
State Accepted
Commit 43130d02baa137033c25297aaae95fd0edc41654
Delegated to: Netdev Maintainers
Headers show
Series [net-next,v3] page_pool: avoid infinite loop to schedule delayed worker | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 7 of 7 maintainers
netdev/build_clang success Errors and warnings before: 1 this patch: 1
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch warning WARNING: The commit message has 'Call Trace:', perhaps it also needs a 'Fixes:' tag?
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 3 this patch: 3
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2025-02-15--03-00 (tests: 891)

Commit Message

Jason Xing Feb. 14, 2025, 6:42 a.m. UTC
We noticed the kworker in page_pool_release_retry() was waken
up repeatedly and infinitely in production because of the
buggy driver causing the inflight less than 0 and warning
us in page_pool_inflight()[1].

Since the inflight value goes negative, it means we should
not expect the whole page_pool to get back to work normally.

This patch mitigates the adverse effect by not rescheduling
the kworker when detecting the inflight negative in
page_pool_release_retry().

[1]
[Mon Feb 10 20:36:11 2025] ------------[ cut here ]------------
[Mon Feb 10 20:36:11 2025] Negative(-51446) inflight packet-pages
...
[Mon Feb 10 20:36:11 2025] Call Trace:
[Mon Feb 10 20:36:11 2025]  page_pool_release_retry+0x23/0x70
[Mon Feb 10 20:36:11 2025]  process_one_work+0x1b1/0x370
[Mon Feb 10 20:36:11 2025]  worker_thread+0x37/0x3a0
[Mon Feb 10 20:36:11 2025]  kthread+0x11a/0x140
[Mon Feb 10 20:36:11 2025]  ? process_one_work+0x370/0x370
[Mon Feb 10 20:36:11 2025]  ? __kthread_cancel_work+0x40/0x40
[Mon Feb 10 20:36:11 2025]  ret_from_fork+0x35/0x40
[Mon Feb 10 20:36:11 2025] ---[ end trace ebffe800f33e7e34 ]---
Note: before this patch, the above calltrace would flood the
dmesg due to repeated reschedule of release_dw kworker.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
---
v3
Link: https://lore.kernel.org/all/20250213052150.18392-1-kerneljasonxing@gmail.com/
1. remove printing warning even when inflight is negative, or else
it will cause panic.

v2
Link: https://lore.kernel.org/all/20250210130953.26831-1-kerneljasonxing@gmail.com/
1. add more details in commit message.
2. allow printing once when the inflight is negative.
3. correct the position where to stop the reschedule.
Suggested by Mina and Jakub.
---
 net/core/page_pool.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Mina Almasry Feb. 14, 2025, 8:27 p.m. UTC | #1
On Thu, Feb 13, 2025 at 10:43 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> We noticed the kworker in page_pool_release_retry() was waken
> up repeatedly and infinitely in production because of the
> buggy driver causing the inflight less than 0 and warning
> us in page_pool_inflight()[1].
>
> Since the inflight value goes negative, it means we should
> not expect the whole page_pool to get back to work normally.
>
> This patch mitigates the adverse effect by not rescheduling
> the kworker when detecting the inflight negative in
> page_pool_release_retry().
>
> [1]
> [Mon Feb 10 20:36:11 2025] ------------[ cut here ]------------
> [Mon Feb 10 20:36:11 2025] Negative(-51446) inflight packet-pages
> ...
> [Mon Feb 10 20:36:11 2025] Call Trace:
> [Mon Feb 10 20:36:11 2025]  page_pool_release_retry+0x23/0x70
> [Mon Feb 10 20:36:11 2025]  process_one_work+0x1b1/0x370
> [Mon Feb 10 20:36:11 2025]  worker_thread+0x37/0x3a0
> [Mon Feb 10 20:36:11 2025]  kthread+0x11a/0x140
> [Mon Feb 10 20:36:11 2025]  ? process_one_work+0x370/0x370
> [Mon Feb 10 20:36:11 2025]  ? __kthread_cancel_work+0x40/0x40
> [Mon Feb 10 20:36:11 2025]  ret_from_fork+0x35/0x40
> [Mon Feb 10 20:36:11 2025] ---[ end trace ebffe800f33e7e34 ]---
> Note: before this patch, the above calltrace would flood the
> dmesg due to repeated reschedule of release_dw kworker.
>
> Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>

Thanks Jason,

Reviewed-by: Mina Almasry <almasrymina@google.com>

When you find the root cause of the driver bug, if you can think of
ways to catch it sooner in the page_pool or prevent drivers from
triggering it, please do consider sending improvements upstream.
Thanks!
Jason Xing Feb. 14, 2025, 11:14 p.m. UTC | #2
On Sat, Feb 15, 2025 at 4:27 AM Mina Almasry <almasrymina@google.com> wrote:
>
> On Thu, Feb 13, 2025 at 10:43 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > We noticed the kworker in page_pool_release_retry() was waken
> > up repeatedly and infinitely in production because of the
> > buggy driver causing the inflight less than 0 and warning
> > us in page_pool_inflight()[1].
> >
> > Since the inflight value goes negative, it means we should
> > not expect the whole page_pool to get back to work normally.
> >
> > This patch mitigates the adverse effect by not rescheduling
> > the kworker when detecting the inflight negative in
> > page_pool_release_retry().
> >
> > [1]
> > [Mon Feb 10 20:36:11 2025] ------------[ cut here ]------------
> > [Mon Feb 10 20:36:11 2025] Negative(-51446) inflight packet-pages
> > ...
> > [Mon Feb 10 20:36:11 2025] Call Trace:
> > [Mon Feb 10 20:36:11 2025]  page_pool_release_retry+0x23/0x70
> > [Mon Feb 10 20:36:11 2025]  process_one_work+0x1b1/0x370
> > [Mon Feb 10 20:36:11 2025]  worker_thread+0x37/0x3a0
> > [Mon Feb 10 20:36:11 2025]  kthread+0x11a/0x140
> > [Mon Feb 10 20:36:11 2025]  ? process_one_work+0x370/0x370
> > [Mon Feb 10 20:36:11 2025]  ? __kthread_cancel_work+0x40/0x40
> > [Mon Feb 10 20:36:11 2025]  ret_from_fork+0x35/0x40
> > [Mon Feb 10 20:36:11 2025] ---[ end trace ebffe800f33e7e34 ]---
> > Note: before this patch, the above calltrace would flood the
> > dmesg due to repeated reschedule of release_dw kworker.
> >
> > Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
>
> Thanks Jason,
>
> Reviewed-by: Mina Almasry <almasrymina@google.com>

Thanks for the review.

>
> When you find the root cause of the driver bug, if you can think of
> ways to catch it sooner in the page_pool or prevent drivers from
> triggering it, please do consider sending improvements upstream.
> Thanks!

Sure, it's exactly what I want to do :)

Thanks,
Jason

>
> --
> Thanks,
> Mina
Jesper Dangaard Brouer Feb. 17, 2025, 9:38 a.m. UTC | #3
On 15/02/2025 00.14, Jason Xing wrote:
> On Sat, Feb 15, 2025 at 4:27 AM Mina Almasry <almasrymina@google.com> wrote:
>>
>> On Thu, Feb 13, 2025 at 10:43 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
>>>
>>> We noticed the kworker in page_pool_release_retry() was waken
>>> up repeatedly and infinitely in production because of the
>>> buggy driver causing the inflight less than 0 and warning
>>> us in page_pool_inflight()[1].
>>>
>>> Since the inflight value goes negative, it means we should
>>> not expect the whole page_pool to get back to work normally.
>>>
>>> This patch mitigates the adverse effect by not rescheduling
>>> the kworker when detecting the inflight negative in
>>> page_pool_release_retry().
>>>
>>> [1]
>>> [Mon Feb 10 20:36:11 2025] ------------[ cut here ]------------
>>> [Mon Feb 10 20:36:11 2025] Negative(-51446) inflight packet-pages
>>> ...
>>> [Mon Feb 10 20:36:11 2025] Call Trace:
>>> [Mon Feb 10 20:36:11 2025]  page_pool_release_retry+0x23/0x70
>>> [Mon Feb 10 20:36:11 2025]  process_one_work+0x1b1/0x370
>>> [Mon Feb 10 20:36:11 2025]  worker_thread+0x37/0x3a0
>>> [Mon Feb 10 20:36:11 2025]  kthread+0x11a/0x140
>>> [Mon Feb 10 20:36:11 2025]  ? process_one_work+0x370/0x370
>>> [Mon Feb 10 20:36:11 2025]  ? __kthread_cancel_work+0x40/0x40
>>> [Mon Feb 10 20:36:11 2025]  ret_from_fork+0x35/0x40
>>> [Mon Feb 10 20:36:11 2025] ---[ end trace ebffe800f33e7e34 ]---
>>> Note: before this patch, the above calltrace would flood the
>>> dmesg due to repeated reschedule of release_dw kworker.
>>>
>>> Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
>>
>> Thanks Jason,
>>
>> Reviewed-by: Mina Almasry <almasrymina@google.com>
> 
> Thanks for the review.
> 
>>
>> When you find the root cause of the driver bug, if you can think of
>> ways to catch it sooner in the page_pool or prevent drivers from
>> triggering it, please do consider sending improvements upstream.
>> Thanks!
> 
> Sure, it's exactly what I want to do :)

What driver is this happening in?

--Jesper
patchwork-bot+netdevbpf@kernel.org Feb. 18, 2025, noon UTC | #4
Hello:

This patch was applied to netdev/net-next.git (main)
by Paolo Abeni <pabeni@redhat.com>:

On Fri, 14 Feb 2025 14:42:50 +0800 you wrote:
> We noticed the kworker in page_pool_release_retry() was waken
> up repeatedly and infinitely in production because of the
> buggy driver causing the inflight less than 0 and warning
> us in page_pool_inflight()[1].
> 
> Since the inflight value goes negative, it means we should
> not expect the whole page_pool to get back to work normally.
> 
> [...]

Here is the summary with links:
  - [net-next,v3] page_pool: avoid infinite loop to schedule delayed worker
    https://git.kernel.org/netdev/net-next/c/43130d02baa1

You are awesome, thank you!
diff mbox series

Patch

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 1c6fec08bc43..acef1fcd8ddc 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -1112,7 +1112,13 @@  static void page_pool_release_retry(struct work_struct *wq)
 	int inflight;
 
 	inflight = page_pool_release(pool);
-	if (!inflight)
+	/* In rare cases, a driver bug may cause inflight to go negative.
+	 * Don't reschedule release if inflight is 0 or negative.
+	 * - If 0, the page_pool has been destroyed
+	 * - if negative, we will never recover
+	 * in both cases no reschedule is necessary.
+	 */
+	if (inflight <= 0)
 		return;
 
 	/* Periodic warning for page pools the user can't see */