diff mbox series

[net-next,v3,12/13] net: page_pool: mute the periodic warning for visible page pools

Message ID 20231122034420.1158898-13-kuba@kernel.org (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series net: page_pool: add netlink-based introspection | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/codegen success Generated files up to date
netdev/tree_selection success Clearly marked for net-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1129 this patch: 1129
netdev/cc_maintainers success CCed 6 of 6 maintainers
netdev/build_clang success Errors and warnings before: 1155 this patch: 1155
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1156 this patch: 1156
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 25 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Jakub Kicinski Nov. 22, 2023, 3:44 a.m. UTC
Mute the periodic "stalled pool shutdown" warning if the page pool
is visible to user space. Rolling out a driver using page pools
to just a few hundred hosts at Meta surfaces applications which
fail to reap their broken sockets. Obviously it's best if the
applications are fixed, but we don't generally print warnings
for application resource leaks. Admins can now depend on the
netlink interface for getting page pool info to detect buggy
apps.

While at it throw in the ID of the pool into the message,
in rare cases (pools from destroyed netns) this will make
finding the pool with a debugger easier.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 net/core/page_pool.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

Comments

Eric Dumazet Nov. 22, 2023, 1:53 p.m. UTC | #1
On Wed, Nov 22, 2023 at 4:44 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> Mute the periodic "stalled pool shutdown" warning if the page pool
> is visible to user space. Rolling out a driver using page pools
> to just a few hundred hosts at Meta surfaces applications which
> fail to reap their broken sockets. Obviously it's best if the
> applications are fixed, but we don't generally print warnings
> for application resource leaks. Admins can now depend on the
> netlink interface for getting page pool info to detect buggy
> apps.
>
> While at it throw in the ID of the pool into the message,
> in rare cases (pools from destroyed netns) this will make
> finding the pool with a debugger easier.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  net/core/page_pool.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 3d0938a60646..c2e7c9a6efbe 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -897,18 +897,21 @@ static void page_pool_release_retry(struct work_struct *wq)
>  {
>         struct delayed_work *dwq = to_delayed_work(wq);
>         struct page_pool *pool = container_of(dwq, typeof(*pool), release_dw);
> +       void *netdev;
>         int inflight;
>
>         inflight = page_pool_release(pool);
>         if (!inflight)
>                 return;
>
> -       /* Periodic warning */
> -       if (time_after_eq(jiffies, pool->defer_warn)) {
> +       /* Periodic warning for page pools the user can't see */
> +       netdev = READ_ONCE(pool->slow.netdev);
> +       if (time_after_eq(jiffies, pool->defer_warn) &&
> +           (!netdev || netdev == NET_PTR_POISON)) {
>                 int sec = (s32)((u32)jiffies - (u32)pool->defer_start) / HZ;

Orthogonal to your patch, but this probably could avoid all these casts.

long sec = (jiffies - pool->defer_start) / HZ;


>
> -               pr_warn("%s() stalled pool shutdown %d inflight %d sec\n",
> -                       __func__, inflight, sec);
> +               pr_warn("%s() stalled pool shutdown: id %u, %d inflight %d sec\n",
> +                       __func__, pool->user.id, inflight, sec);
>                 pool->defer_warn = jiffies + DEFER_WARN_INTERVAL;
>         }
>

Reviewed-by: Eric Dumazet <edumazet@google.com>
Jesper Dangaard Brouer Nov. 22, 2023, 4:06 p.m. UTC | #2
On 11/22/23 04:44, Jakub Kicinski wrote:
> Mute the periodic "stalled pool shutdown" warning if the page pool
> is visible to user space. Rolling out a driver using page pools
> to just a few hundred hosts at Meta surfaces applications which
> fail to reap their broken sockets. Obviously it's best if the
> applications are fixed, but we don't generally print warnings
> for application resource leaks. Admins can now depend on the
> netlink interface for getting page pool info to detect buggy
> apps.
> 

Looking at the production logs I can unfortunately see many occurrences
of these "page_pool_release_retry() stalled pool shutdown" warnings.

Digging into this it is likely a driver (bnxt_en) issue on these older
kernels. I'll check if there is a upstream fix.

My experience from the past is that this warning have helped catch
driver related bugs.
With SKB + sockets then it is obviously a lot easier to trigger this
warning, which is why it makes sense to "mute" this.


> While at it throw in the ID of the pool into the message,
> in rare cases (pools from destroyed netns) this will make
> finding the pool with a debugger easier.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>   net/core/page_pool.c | 11 +++++++----
>   1 file changed, 7 insertions(+), 4 deletions(-)

Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>

> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 3d0938a60646..c2e7c9a6efbe 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -897,18 +897,21 @@ static void page_pool_release_retry(struct work_struct *wq)
>   {
>   	struct delayed_work *dwq = to_delayed_work(wq);
>   	struct page_pool *pool = container_of(dwq, typeof(*pool), release_dw);
> +	void *netdev;
>   	int inflight;
>   
>   	inflight = page_pool_release(pool);
>   	if (!inflight)
>   		return;
>   
> -	/* Periodic warning */
> -	if (time_after_eq(jiffies, pool->defer_warn)) {
> +	/* Periodic warning for page pools the user can't see */
> +	netdev = READ_ONCE(pool->slow.netdev);
> +	if (time_after_eq(jiffies, pool->defer_warn) &&
> +	    (!netdev || netdev == NET_PTR_POISON)) {
>   		int sec = (s32)((u32)jiffies - (u32)pool->defer_start) / HZ;
>   
> -		pr_warn("%s() stalled pool shutdown %d inflight %d sec\n",
> -			__func__, inflight, sec);
> +		pr_warn("%s() stalled pool shutdown: id %u, %d inflight %d sec\n",
> +			__func__, pool->user.id, inflight, sec);
>   		pool->defer_warn = jiffies + DEFER_WARN_INTERVAL;
>   	}
>
diff mbox series

Patch

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 3d0938a60646..c2e7c9a6efbe 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -897,18 +897,21 @@  static void page_pool_release_retry(struct work_struct *wq)
 {
 	struct delayed_work *dwq = to_delayed_work(wq);
 	struct page_pool *pool = container_of(dwq, typeof(*pool), release_dw);
+	void *netdev;
 	int inflight;
 
 	inflight = page_pool_release(pool);
 	if (!inflight)
 		return;
 
-	/* Periodic warning */
-	if (time_after_eq(jiffies, pool->defer_warn)) {
+	/* Periodic warning for page pools the user can't see */
+	netdev = READ_ONCE(pool->slow.netdev);
+	if (time_after_eq(jiffies, pool->defer_warn) &&
+	    (!netdev || netdev == NET_PTR_POISON)) {
 		int sec = (s32)((u32)jiffies - (u32)pool->defer_start) / HZ;
 
-		pr_warn("%s() stalled pool shutdown %d inflight %d sec\n",
-			__func__, inflight, sec);
+		pr_warn("%s() stalled pool shutdown: id %u, %d inflight %d sec\n",
+			__func__, pool->user.id, inflight, sec);
 		pool->defer_warn = jiffies + DEFER_WARN_INTERVAL;
 	}