Message ID | 20250214064250.85987-1-kerneljasonxing@gmail.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 43130d02baa137033c25297aaae95fd0edc41654 |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net-next,v3] page_pool: avoid infinite loop to schedule delayed worker | expand |
On Thu, Feb 13, 2025 at 10:43 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > We noticed the kworker in page_pool_release_retry() was waken > up repeatedly and infinitely in production because of the > buggy driver causing the inflight less than 0 and warning > us in page_pool_inflight()[1]. > > Since the inflight value goes negative, it means we should > not expect the whole page_pool to get back to work normally. > > This patch mitigates the adverse effect by not rescheduling > the kworker when detecting the inflight negative in > page_pool_release_retry(). > > [1] > [Mon Feb 10 20:36:11 2025] ------------[ cut here ]------------ > [Mon Feb 10 20:36:11 2025] Negative(-51446) inflight packet-pages > ... > [Mon Feb 10 20:36:11 2025] Call Trace: > [Mon Feb 10 20:36:11 2025] page_pool_release_retry+0x23/0x70 > [Mon Feb 10 20:36:11 2025] process_one_work+0x1b1/0x370 > [Mon Feb 10 20:36:11 2025] worker_thread+0x37/0x3a0 > [Mon Feb 10 20:36:11 2025] kthread+0x11a/0x140 > [Mon Feb 10 20:36:11 2025] ? process_one_work+0x370/0x370 > [Mon Feb 10 20:36:11 2025] ? __kthread_cancel_work+0x40/0x40 > [Mon Feb 10 20:36:11 2025] ret_from_fork+0x35/0x40 > [Mon Feb 10 20:36:11 2025] ---[ end trace ebffe800f33e7e34 ]--- > Note: before this patch, the above calltrace would flood the > dmesg due to repeated reschedule of release_dw kworker. > > Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Thanks Jason, Reviewed-by: Mina Almasry <almasrymina@google.com> When you find the root cause of the driver bug, if you can think of ways to catch it sooner in the page_pool or prevent drivers from triggering it, please do consider sending improvements upstream. Thanks!
On Sat, Feb 15, 2025 at 4:27 AM Mina Almasry <almasrymina@google.com> wrote: > > On Thu, Feb 13, 2025 at 10:43 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > We noticed the kworker in page_pool_release_retry() was waken > > up repeatedly and infinitely in production because of the > > buggy driver causing the inflight less than 0 and warning > > us in page_pool_inflight()[1]. > > > > Since the inflight value goes negative, it means we should > > not expect the whole page_pool to get back to work normally. > > > > This patch mitigates the adverse effect by not rescheduling > > the kworker when detecting the inflight negative in > > page_pool_release_retry(). > > > > [1] > > [Mon Feb 10 20:36:11 2025] ------------[ cut here ]------------ > > [Mon Feb 10 20:36:11 2025] Negative(-51446) inflight packet-pages > > ... > > [Mon Feb 10 20:36:11 2025] Call Trace: > > [Mon Feb 10 20:36:11 2025] page_pool_release_retry+0x23/0x70 > > [Mon Feb 10 20:36:11 2025] process_one_work+0x1b1/0x370 > > [Mon Feb 10 20:36:11 2025] worker_thread+0x37/0x3a0 > > [Mon Feb 10 20:36:11 2025] kthread+0x11a/0x140 > > [Mon Feb 10 20:36:11 2025] ? process_one_work+0x370/0x370 > > [Mon Feb 10 20:36:11 2025] ? __kthread_cancel_work+0x40/0x40 > > [Mon Feb 10 20:36:11 2025] ret_from_fork+0x35/0x40 > > [Mon Feb 10 20:36:11 2025] ---[ end trace ebffe800f33e7e34 ]--- > > Note: before this patch, the above calltrace would flood the > > dmesg due to repeated reschedule of release_dw kworker. > > > > Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> > > Thanks Jason, > > Reviewed-by: Mina Almasry <almasrymina@google.com> Thanks for the review. > > When you find the root cause of the driver bug, if you can think of > ways to catch it sooner in the page_pool or prevent drivers from > triggering it, please do consider sending improvements upstream. > Thanks! Sure, it's exactly what I want to do :) Thanks, Jason > > -- > Thanks, > Mina
On 15/02/2025 00.14, Jason Xing wrote: > On Sat, Feb 15, 2025 at 4:27 AM Mina Almasry <almasrymina@google.com> wrote: >> >> On Thu, Feb 13, 2025 at 10:43 PM Jason Xing <kerneljasonxing@gmail.com> wrote: >>> >>> We noticed the kworker in page_pool_release_retry() was waken >>> up repeatedly and infinitely in production because of the >>> buggy driver causing the inflight less than 0 and warning >>> us in page_pool_inflight()[1]. >>> >>> Since the inflight value goes negative, it means we should >>> not expect the whole page_pool to get back to work normally. >>> >>> This patch mitigates the adverse effect by not rescheduling >>> the kworker when detecting the inflight negative in >>> page_pool_release_retry(). >>> >>> [1] >>> [Mon Feb 10 20:36:11 2025] ------------[ cut here ]------------ >>> [Mon Feb 10 20:36:11 2025] Negative(-51446) inflight packet-pages >>> ... >>> [Mon Feb 10 20:36:11 2025] Call Trace: >>> [Mon Feb 10 20:36:11 2025] page_pool_release_retry+0x23/0x70 >>> [Mon Feb 10 20:36:11 2025] process_one_work+0x1b1/0x370 >>> [Mon Feb 10 20:36:11 2025] worker_thread+0x37/0x3a0 >>> [Mon Feb 10 20:36:11 2025] kthread+0x11a/0x140 >>> [Mon Feb 10 20:36:11 2025] ? process_one_work+0x370/0x370 >>> [Mon Feb 10 20:36:11 2025] ? __kthread_cancel_work+0x40/0x40 >>> [Mon Feb 10 20:36:11 2025] ret_from_fork+0x35/0x40 >>> [Mon Feb 10 20:36:11 2025] ---[ end trace ebffe800f33e7e34 ]--- >>> Note: before this patch, the above calltrace would flood the >>> dmesg due to repeated reschedule of release_dw kworker. >>> >>> Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> >> >> Thanks Jason, >> >> Reviewed-by: Mina Almasry <almasrymina@google.com> > > Thanks for the review. > >> >> When you find the root cause of the driver bug, if you can think of >> ways to catch it sooner in the page_pool or prevent drivers from >> triggering it, please do consider sending improvements upstream. >> Thanks! > > Sure, it's exactly what I want to do :) What driver is this happening in? --Jesper
Hello: This patch was applied to netdev/net-next.git (main) by Paolo Abeni <pabeni@redhat.com>: On Fri, 14 Feb 2025 14:42:50 +0800 you wrote: > We noticed the kworker in page_pool_release_retry() was waken > up repeatedly and infinitely in production because of the > buggy driver causing the inflight less than 0 and warning > us in page_pool_inflight()[1]. > > Since the inflight value goes negative, it means we should > not expect the whole page_pool to get back to work normally. > > [...] Here is the summary with links: - [net-next,v3] page_pool: avoid infinite loop to schedule delayed worker https://git.kernel.org/netdev/net-next/c/43130d02baa1 You are awesome, thank you!
diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 1c6fec08bc43..acef1fcd8ddc 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -1112,7 +1112,13 @@ static void page_pool_release_retry(struct work_struct *wq) int inflight; inflight = page_pool_release(pool); - if (!inflight) + /* In rare cases, a driver bug may cause inflight to go negative. + * Don't reschedule release if inflight is 0 or negative. + * - If 0, the page_pool has been destroyed + * - if negative, we will never recover + * in both cases no reschedule is necessary. + */ + if (inflight <= 0) return; /* Periodic warning for page pools the user can't see */
We noticed the kworker in page_pool_release_retry() was waken up repeatedly and infinitely in production because of the buggy driver causing the inflight less than 0 and warning us in page_pool_inflight()[1]. Since the inflight value goes negative, it means we should not expect the whole page_pool to get back to work normally. This patch mitigates the adverse effect by not rescheduling the kworker when detecting the inflight negative in page_pool_release_retry(). [1] [Mon Feb 10 20:36:11 2025] ------------[ cut here ]------------ [Mon Feb 10 20:36:11 2025] Negative(-51446) inflight packet-pages ... [Mon Feb 10 20:36:11 2025] Call Trace: [Mon Feb 10 20:36:11 2025] page_pool_release_retry+0x23/0x70 [Mon Feb 10 20:36:11 2025] process_one_work+0x1b1/0x370 [Mon Feb 10 20:36:11 2025] worker_thread+0x37/0x3a0 [Mon Feb 10 20:36:11 2025] kthread+0x11a/0x140 [Mon Feb 10 20:36:11 2025] ? process_one_work+0x370/0x370 [Mon Feb 10 20:36:11 2025] ? __kthread_cancel_work+0x40/0x40 [Mon Feb 10 20:36:11 2025] ret_from_fork+0x35/0x40 [Mon Feb 10 20:36:11 2025] ---[ end trace ebffe800f33e7e34 ]--- Note: before this patch, the above calltrace would flood the dmesg due to repeated reschedule of release_dw kworker. Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> --- v3 Link: https://lore.kernel.org/all/20250213052150.18392-1-kerneljasonxing@gmail.com/ 1. remove printing warning even when inflight is negative, or else it will cause panic. v2 Link: https://lore.kernel.org/all/20250210130953.26831-1-kerneljasonxing@gmail.com/ 1. add more details in commit message. 2. allow printing once when the inflight is negative. 3. correct the position where to stop the reschedule. Suggested by Mina and Jakub. --- net/core/page_pool.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)