Message ID | 20231122034420.1158898-13-kuba@kernel.org (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | net: page_pool: add netlink-based introspection | expand |
On Wed, Nov 22, 2023 at 4:44 AM Jakub Kicinski <kuba@kernel.org> wrote: > > Mute the periodic "stalled pool shutdown" warning if the page pool > is visible to user space. Rolling out a driver using page pools > to just a few hundred hosts at Meta surfaces applications which > fail to reap their broken sockets. Obviously it's best if the > applications are fixed, but we don't generally print warnings > for application resource leaks. Admins can now depend on the > netlink interface for getting page pool info to detect buggy > apps. > > While at it throw in the ID of the pool into the message, > in rare cases (pools from destroyed netns) this will make > finding the pool with a debugger easier. > > Signed-off-by: Jakub Kicinski <kuba@kernel.org> > --- > net/core/page_pool.c | 11 +++++++---- > 1 file changed, 7 insertions(+), 4 deletions(-) > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > index 3d0938a60646..c2e7c9a6efbe 100644 > --- a/net/core/page_pool.c > +++ b/net/core/page_pool.c > @@ -897,18 +897,21 @@ static void page_pool_release_retry(struct work_struct *wq) > { > struct delayed_work *dwq = to_delayed_work(wq); > struct page_pool *pool = container_of(dwq, typeof(*pool), release_dw); > + void *netdev; > int inflight; > > inflight = page_pool_release(pool); > if (!inflight) > return; > > - /* Periodic warning */ > - if (time_after_eq(jiffies, pool->defer_warn)) { > + /* Periodic warning for page pools the user can't see */ > + netdev = READ_ONCE(pool->slow.netdev); > + if (time_after_eq(jiffies, pool->defer_warn) && > + (!netdev || netdev == NET_PTR_POISON)) { > int sec = (s32)((u32)jiffies - (u32)pool->defer_start) / HZ; Orthogonal to your patch, but this probably could avoid all these casts. long sec = (jiffies - pool->defer_start) / HZ; > > - pr_warn("%s() stalled pool shutdown %d inflight %d sec\n", > - __func__, inflight, sec); > + pr_warn("%s() stalled pool shutdown: id %u, %d inflight %d sec\n", > + __func__, pool->user.id, inflight, sec); > pool->defer_warn = jiffies + DEFER_WARN_INTERVAL; > } > Reviewed-by: Eric Dumazet <edumazet@google.com>
On 11/22/23 04:44, Jakub Kicinski wrote: > Mute the periodic "stalled pool shutdown" warning if the page pool > is visible to user space. Rolling out a driver using page pools > to just a few hundred hosts at Meta surfaces applications which > fail to reap their broken sockets. Obviously it's best if the > applications are fixed, but we don't generally print warnings > for application resource leaks. Admins can now depend on the > netlink interface for getting page pool info to detect buggy > apps. > Looking at the production logs I can unfortunately see many occurrences of these "page_pool_release_retry() stalled pool shutdown" warnings. Digging into this it is likely a driver (bnxt_en) issue on these older kernels. I'll check if there is a upstream fix. My experience from the past is that this warning have helped catch driver related bugs. With SKB + sockets then it is obviously a lot easier to trigger this warning, which is why it makes sense to "mute" this. > While at it throw in the ID of the pool into the message, > in rare cases (pools from destroyed netns) this will make > finding the pool with a debugger easier. > > Signed-off-by: Jakub Kicinski <kuba@kernel.org> > --- > net/core/page_pool.c | 11 +++++++---- > 1 file changed, 7 insertions(+), 4 deletions(-) Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > index 3d0938a60646..c2e7c9a6efbe 100644 > --- a/net/core/page_pool.c > +++ b/net/core/page_pool.c > @@ -897,18 +897,21 @@ static void page_pool_release_retry(struct work_struct *wq) > { > struct delayed_work *dwq = to_delayed_work(wq); > struct page_pool *pool = container_of(dwq, typeof(*pool), release_dw); > + void *netdev; > int inflight; > > inflight = page_pool_release(pool); > if (!inflight) > return; > > - /* Periodic warning */ > - if (time_after_eq(jiffies, pool->defer_warn)) { > + /* Periodic warning for page pools the user can't see */ > + netdev = READ_ONCE(pool->slow.netdev); > + if (time_after_eq(jiffies, pool->defer_warn) && > + (!netdev || netdev == NET_PTR_POISON)) { > int sec = (s32)((u32)jiffies - (u32)pool->defer_start) / HZ; > > - pr_warn("%s() stalled pool shutdown %d inflight %d sec\n", > - __func__, inflight, sec); > + pr_warn("%s() stalled pool shutdown: id %u, %d inflight %d sec\n", > + __func__, pool->user.id, inflight, sec); > pool->defer_warn = jiffies + DEFER_WARN_INTERVAL; > } >
diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 3d0938a60646..c2e7c9a6efbe 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -897,18 +897,21 @@ static void page_pool_release_retry(struct work_struct *wq) { struct delayed_work *dwq = to_delayed_work(wq); struct page_pool *pool = container_of(dwq, typeof(*pool), release_dw); + void *netdev; int inflight; inflight = page_pool_release(pool); if (!inflight) return; - /* Periodic warning */ - if (time_after_eq(jiffies, pool->defer_warn)) { + /* Periodic warning for page pools the user can't see */ + netdev = READ_ONCE(pool->slow.netdev); + if (time_after_eq(jiffies, pool->defer_warn) && + (!netdev || netdev == NET_PTR_POISON)) { int sec = (s32)((u32)jiffies - (u32)pool->defer_start) / HZ; - pr_warn("%s() stalled pool shutdown %d inflight %d sec\n", - __func__, inflight, sec); + pr_warn("%s() stalled pool shutdown: id %u, %d inflight %d sec\n", + __func__, pool->user.id, inflight, sec); pool->defer_warn = jiffies + DEFER_WARN_INTERVAL; }
Mute the periodic "stalled pool shutdown" warning if the page pool is visible to user space. Rolling out a driver using page pools to just a few hundred hosts at Meta surfaces applications which fail to reap their broken sockets. Obviously it's best if the applications are fixed, but we don't generally print warnings for application resource leaks. Admins can now depend on the netlink interface for getting page pool info to detect buggy apps. While at it throw in the ID of the pool into the message, in rare cases (pools from destroyed netns) this will make finding the pool with a debugger easier. Signed-off-by: Jakub Kicinski <kuba@kernel.org> --- net/core/page_pool.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-)