diff mbox series

[net] net/mlx5e: do as little as possible in napi poll when budget is 0

Message ID 20230512025740.1068965-1-kuba@kernel.org (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net] net/mlx5e: do as little as possible in napi poll when budget is 0 | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 8 this patch: 8
netdev/cc_maintainers warning 1 maintainers not CCed: linux-rdma@vger.kernel.org
netdev/build_clang success Errors and warnings before: 8 this patch: 8
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 8 this patch: 8
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 32 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Jakub Kicinski May 12, 2023, 2:57 a.m. UTC
NAPI gets called with budget of 0 from netpoll, which has interrupts
disabled. We should try to free some space on Tx rings and nothing
else.

Specifically do not try to handle XDP TX or try to refill Rx buffers -
we can't use the page pool from IRQ context. Don't check if IRQs moved,
either, that makes no sense in netpoll. Netpoll calls _all_ the rings
from whatever CPU it happens to be invoked on.

In general do as little as possible, the work quickly adds up when
there's tens of rings to poll.

The immediate stack trace I was seeing is:

    __do_softirq+0xd1/0x2c0
    __local_bh_enable_ip+0xc7/0x120
    </IRQ>
    <TASK>
    page_pool_put_defragged_page+0x267/0x320
    mlx5e_free_xdpsq_desc+0x99/0xd0
    mlx5e_poll_xdpsq_cq+0x138/0x3b0
    mlx5e_napi_poll+0xc3/0x8b0
    netpoll_poll_dev+0xce/0x150

AFAIU page pool takes a BH lock, releases it and since BH is now
enabled tries to run softirqs.

Fixes: 60bbf7eeef10 ("mlx5: use page_pool for xdp_return_frame call")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
I'm pointing Fixes at where page_pool was added, although that's
probably not 100% fair.

CC: saeedm@nvidia.com
CC: leon@kernel.org
CC: brouer@redhat.com
CC: tariqt@mellanox.com
---
 .../net/ethernet/mellanox/mlx5/core/en_txrx.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

Comments

Tariq Toukan May 15, 2023, 6:10 a.m. UTC | #1
On 12/05/2023 5:57, Jakub Kicinski wrote:
> NAPI gets called with budget of 0 from netpoll, which has interrupts
> disabled. We should try to free some space on Tx rings and nothing
> else.
> 
> Specifically do not try to handle XDP TX or try to refill Rx buffers -
> we can't use the page pool from IRQ context. Don't check if IRQs moved,
> either, that makes no sense in netpoll. Netpoll calls _all_ the rings
> from whatever CPU it happens to be invoked on.
> 
> In general do as little as possible, the work quickly adds up when
> there's tens of rings to poll.
> 
> The immediate stack trace I was seeing is:
> 
>      __do_softirq+0xd1/0x2c0
>      __local_bh_enable_ip+0xc7/0x120
>      </IRQ>
>      <TASK>
>      page_pool_put_defragged_page+0x267/0x320
>      mlx5e_free_xdpsq_desc+0x99/0xd0
>      mlx5e_poll_xdpsq_cq+0x138/0x3b0
>      mlx5e_napi_poll+0xc3/0x8b0
>      netpoll_poll_dev+0xce/0x150
> 
> AFAIU page pool takes a BH lock, releases it and since BH is now
> enabled tries to run softirqs.
> 
> Fixes: 60bbf7eeef10 ("mlx5: use page_pool for xdp_return_frame call")
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> I'm pointing Fixes at where page_pool was added, although that's
> probably not 100% fair.
> 
> CC: saeedm@nvidia.com
> CC: leon@kernel.org
> CC: brouer@redhat.com
> CC: tariqt@mellanox.com
> ---

Thanks for your patch.
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Eric Dumazet May 15, 2023, 4:39 p.m. UTC | #2
On Fri, May 12, 2023 at 4:57 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> NAPI gets called with budget of 0 from netpoll, which has interrupts
> disabled. We should try to free some space on Tx rings and nothing
> else.
>
> Specifically do not try to handle XDP TX or try to refill Rx buffers -
> we can't use the page pool from IRQ context. Don't check if IRQs moved,
> either, that makes no sense in netpoll. Netpoll calls _all_ the rings
> from whatever CPU it happens to be invoked on.
>
> In general do as little as possible, the work quickly adds up when
> there's tens of rings to poll.
>
> The immediate stack trace I was seeing is:
>
>     __do_softirq+0xd1/0x2c0
>     __local_bh_enable_ip+0xc7/0x120
>     </IRQ>
>     <TASK>
>     page_pool_put_defragged_page+0x267/0x320
>     mlx5e_free_xdpsq_desc+0x99/0xd0
>     mlx5e_poll_xdpsq_cq+0x138/0x3b0
>     mlx5e_napi_poll+0xc3/0x8b0
>     netpoll_poll_dev+0xce/0x150
>
> AFAIU page pool takes a BH lock, releases it and since BH is now
> enabled tries to run softirqs.
>
> Fixes: 60bbf7eeef10 ("mlx5: use page_pool for xdp_return_frame call")
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> I'm pointing Fixes at where page_pool was added, although that's
> probably not 100% fair.
>
> CC: saeedm@nvidia.com
> CC: leon@kernel.org
> CC: brouer@redhat.com
> CC: tariqt@mellanox.com
> ---
>  .../net/ethernet/mellanox/mlx5/core/en_txrx.c | 19 ++++++++++++-------
>  1 file changed, 12 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
> index a50bfda18e96..bd4294dd72da 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
> @@ -161,20 +161,25 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
>                 }
>         }
>
> +       /* budget=0 means we may be in IRQ context, do as little as possible */
> +       if (unlikely(!budget)) {
> +               /* no work done, can't be asked to re-enable IRQs */
> +               WARN_ON_ONCE(napi_complete_done(napi, work_done));

This is not clear why you call napi_complete_done() here ?

Note the fine doc  ( https://www.kernel.org/doc/html/next/networking/napi.html )
says:

<quote>If the budget is 0 napi_complete_done() should never be called.</quote>



> +               goto out;
> +       }
> +
>         busy |= mlx5e_poll_xdpsq_cq(&c->xdpsq.cq);
>
>         if (c->xdp)
>                 busy |= mlx5e_poll_xdpsq_cq(&c->rq_xdpsq.cq);
>
> -       if (likely(budget)) { /* budget=0 means: don't poll rx rings */
> -               if (xsk_open)
> -                       work_done = mlx5e_poll_rx_cq(&xskrq->cq, budget);
> +       if (xsk_open)
> +               work_done = mlx5e_poll_rx_cq(&xskrq->cq, budget);
>
> -               if (likely(budget - work_done))
> -                       work_done += mlx5e_poll_rx_cq(&rq->cq, budget - work_done);
> +       if (likely(budget - work_done))
> +               work_done += mlx5e_poll_rx_cq(&rq->cq, budget - work_done);
>
> -               busy |= work_done == budget;
> -       }
> +       busy |= work_done == budget;
>
>         mlx5e_poll_ico_cq(&c->icosq.cq);
>         if (mlx5e_poll_ico_cq(&c->async_icosq.cq))
> --
> 2.40.1
>
Jakub Kicinski May 17, 2023, 2:02 a.m. UTC | #3
On Mon, 15 May 2023 18:39:20 +0200 Eric Dumazet wrote:
> > +       /* budget=0 means we may be in IRQ context, do as little as possible */
> > +       if (unlikely(!budget)) {
> > +               /* no work done, can't be asked to re-enable IRQs */
> > +               WARN_ON_ONCE(napi_complete_done(napi, work_done));  
> 
> This is not clear why you call napi_complete_done() here ?
> 
> Note the fine doc  ( https://www.kernel.org/doc/html/next/networking/napi.html )
> says:
> 
> <quote>If the budget is 0 napi_complete_done() should never be called.</quote>

fair point, I was trying to be conservative because I think it would
have gotten called before
diff mbox series

Patch

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
index a50bfda18e96..bd4294dd72da 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
@@ -161,20 +161,25 @@  int mlx5e_napi_poll(struct napi_struct *napi, int budget)
 		}
 	}
 
+	/* budget=0 means we may be in IRQ context, do as little as possible */
+	if (unlikely(!budget)) {
+		/* no work done, can't be asked to re-enable IRQs */
+		WARN_ON_ONCE(napi_complete_done(napi, work_done));
+		goto out;
+	}
+
 	busy |= mlx5e_poll_xdpsq_cq(&c->xdpsq.cq);
 
 	if (c->xdp)
 		busy |= mlx5e_poll_xdpsq_cq(&c->rq_xdpsq.cq);
 
-	if (likely(budget)) { /* budget=0 means: don't poll rx rings */
-		if (xsk_open)
-			work_done = mlx5e_poll_rx_cq(&xskrq->cq, budget);
+	if (xsk_open)
+		work_done = mlx5e_poll_rx_cq(&xskrq->cq, budget);
 
-		if (likely(budget - work_done))
-			work_done += mlx5e_poll_rx_cq(&rq->cq, budget - work_done);
+	if (likely(budget - work_done))
+		work_done += mlx5e_poll_rx_cq(&rq->cq, budget - work_done);
 
-		busy |= work_done == budget;
-	}
+	busy |= work_done == budget;
 
 	mlx5e_poll_ico_cq(&c->icosq.cq);
 	if (mlx5e_poll_ico_cq(&c->async_icosq.cq))