diff mbox series

[net] net: fix premature exit from NAPI state polling in napi_disable()

Message ID 20211110191126.1214-1-alexandr.lobakin@intel.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net] net: fix premature exit from NAPI state polling in napi_disable() | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net
netdev/fixes_present success Fixes tag present in non-next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 7 this patch: 7
netdev/cc_maintainers warning 10 maintainers not CCed: songliubraving@fb.com hawk@kernel.org daniel@iogearbox.net john.fastabend@gmail.com yhs@fb.com kpsingh@kernel.org ast@kernel.org andrii@kernel.org kafai@fb.com bpf@vger.kernel.org
netdev/build_clang success Errors and warnings before: 22 this patch: 22
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 11 this patch: 11
netdev/checkpatch warning WARNING: 'unitialized' may be misspelled - perhaps 'uninitialized'?
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Alexander Lobakin Nov. 10, 2021, 7:11 p.m. UTC
Commit 719c57197010 ("net: make napi_disable() symmetric with
enable") accidentally introduced a bug sometimes leading to a kernel
BUG when bringing an iface up/down under heavy traffic load.

Prior to this commit, napi_disable() was polling n->state until
none of (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC) is set and then
always flip them. Now there's a possibility to get away with the
NAPIF_STATE_SCHE unset as 'continue' drops us to the cmpxchg()
call with an unitialized variable, rather than straight to
another round of the state check.

Error path looks like:

napi_disable():
unsigned long val, new; /* new is uninitialized */

do {
	val = READ_ONCE(n->state); /* NAPIF_STATE_NPSVC and/or
				      NAPIF_STATE_SCHED is set */
	if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) { /* true */
		usleep_range(20, 200);
		continue; /* go straight to the condition check */
	}
	new = val | <...>
} while (cmpxchg(&n->state, val, new) != val); /* state == val, cmpxchg()
						  writes garbage */

napi_enable():
do {
	val = READ_ONCE(n->state);
	BUG_ON(!test_bit(NAPI_STATE_SCHED, &val)); /* 50/50 boom */
<...>

while the typical BUG splat is like:

[  172.652461] ------------[ cut here ]------------
[  172.652462] kernel BUG at net/core/dev.c:6937!
[  172.656914] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[  172.661966] CPU: 36 PID: 2829 Comm: xdp_redirect_cp Tainted: G          I       5.15.0 #42
[  172.670222] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0014.082620210524 08/26/2021
[  172.680646] RIP: 0010:napi_enable+0x5a/0xd0
[  172.684832] Code: 07 49 81 cc 00 01 00 00 4c 89 e2 48 89 d8 80 e6 fb f0 48 0f b1 55 10 48 39 c3 74 10 48 8b 5d 10 f6 c7 04 75 3d f6 c3 01 75 b4 <0f> 0b 5b 5d 41 5c c3 65 ff 05 b8 e5 61 53 48 c7 c6 c0 f3 34 ad 48
[  172.703578] RSP: 0018:ffffa3c9497477a8 EFLAGS: 00010246
[  172.708803] RAX: ffffa3c96615a014 RBX: 0000000000000000 RCX: ffff8a4b575301a0
< snip >
[  172.782403] Call Trace:
[  172.784857]  <TASK>
[  172.786963]  ice_up_complete+0x6f/0x210 [ice]
[  172.791349]  ice_xdp+0x136/0x320 [ice]
[  172.795108]  ? ice_change_mtu+0x180/0x180 [ice]
[  172.799648]  dev_xdp_install+0x61/0xe0
[  172.803401]  dev_xdp_attach+0x1e0/0x550
[  172.807240]  dev_change_xdp_fd+0x1e6/0x220
[  172.811338]  do_setlink+0xee8/0x1010
[  172.814917]  rtnl_setlink+0xe5/0x170
[  172.818499]  ? bpf_lsm_binder_set_context_mgr+0x10/0x10
[  172.823732]  ? security_capable+0x36/0x50
< snip >

Fix this by replacing this 'continue' with a goto to the beginning
of the loop body to restore the original behaviour.
This could be written without a goto, but would look uglier and
require one more indent level.

Fixes: 719c57197010 ("net: make napi_disable() symmetric with enable")
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
---
 net/core/dev.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Eric Dumazet Nov. 10, 2021, 7:24 p.m. UTC | #1
On Wed, Nov 10, 2021 at 11:11 AM Alexander Lobakin
<alexandr.lobakin@intel.com> wrote:
>
> Commit 719c57197010 ("net: make napi_disable() symmetric with
> enable") accidentally introduced a bug sometimes leading to a kernel
> BUG when bringing an iface up/down under heavy traffic load.
>
> Prior to this commit, napi_disable() was polling n->state until
> none of (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC) is set and then
> always flip them. Now there's a possibility to get away with the
> NAPIF_STATE_SCHE unset as 'continue' drops us to the cmpxchg()
> call with an unitialized variable, rather than straight to
> another round of the state check.
>
> Error path looks like:
>
> napi_disable():
> unsigned long val, new; /* new is uninitialized */
>
> do {
>         val = READ_ONCE(n->state); /* NAPIF_STATE_NPSVC and/or
>                                       NAPIF_STATE_SCHED is set */
>         if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) { /* true */
>                 usleep_range(20, 200);
>                 continue; /* go straight to the condition check */
>         }
>         new = val | <...>
> } while (cmpxchg(&n->state, val, new) != val); /* state == val, cmpxchg()
>                                                   writes garbage */
>
> napi_enable():
> do {
>         val = READ_ONCE(n->state);
>         BUG_ON(!test_bit(NAPI_STATE_SCHED, &val)); /* 50/50 boom */
> <...>
>
> while the typical BUG splat is like:
>
> [
> Fix this by replacing this 'continue' with a goto to the beginning
> of the loop body to restore the original behaviour.
> This could be written without a goto, but would look uglier and
> require one more indent level.
>
> Fixes: 719c57197010 ("net: make napi_disable() symmetric with enable")
> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
> ---
>  net/core/dev.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index edeb811c454e..5e101c53b9de 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6929,10 +6929,11 @@ void napi_disable(struct napi_struct *n)
>         set_bit(NAPI_STATE_DISABLE, &n->state);
>
>         do {
> +retry:
>                 val = READ_ONCE(n->state);
>                 if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) {
>                         usleep_range(20, 200);
> -                       continue;
> +                       goto retry;
>                 }
>
>                 new = val | NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC;
> --
> 2.33.1
>

Good catch !

What about replacing the error prone do {...} while (cmpxchg(..)) by
something less confusing ?

This way, no need for a label.

diff --git a/net/core/dev.c b/net/core/dev.c
index 5e37d6809317fb3c54686188a908bfcb0bfccdab..9327141892cdaaf0282e082e0c6746abae0f12a7
100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6264,7 +6264,7 @@ void napi_disable(struct napi_struct *n)
        might_sleep();
        set_bit(NAPI_STATE_DISABLE, &n->state);

-       do {
+       for (;;) {
                val = READ_ONCE(n->state);
                if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) {
                        usleep_range(20, 200);
@@ -6273,7 +6273,9 @@ void napi_disable(struct napi_struct *n)

                new = val | NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC;
                new &= ~(NAPIF_STATE_THREADED | NAPIF_STATE_PREFER_BUSY_POLL);
-       } while (cmpxchg(&n->state, val, new) != val);
+               if (cmpxchg(&n->state, val, new) == val)
+                       break;
+       }

        hrtimer_cancel(&n->timer);
Alexander Lobakin Nov. 10, 2021, 7:43 p.m. UTC | #2
From: Eric Dumazet <edumazet@google.com>
Date: Wed, 10 Nov 2021 11:24:39 -0800

> On Wed, Nov 10, 2021 at 11:11 AM Alexander Lobakin
> <alexandr.lobakin@intel.com> wrote:
> >
> > Commit 719c57197010 ("net: make napi_disable() symmetric with
> > enable") accidentally introduced a bug sometimes leading to a kernel
> > BUG when bringing an iface up/down under heavy traffic load.
> >
> > Prior to this commit, napi_disable() was polling n->state until
> > none of (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC) is set and then
> > always flip them. Now there's a possibility to get away with the
> > NAPIF_STATE_SCHE unset as 'continue' drops us to the cmpxchg()
> > call with an unitialized variable, rather than straight to
> > another round of the state check.
> >
> > Error path looks like:
> >
> > napi_disable():
> > unsigned long val, new; /* new is uninitialized */
> >
> > do {
> >         val = READ_ONCE(n->state); /* NAPIF_STATE_NPSVC and/or
> >                                       NAPIF_STATE_SCHED is set */
> >         if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) { /* true */
> >                 usleep_range(20, 200);
> >                 continue; /* go straight to the condition check */
> >         }
> >         new = val | <...>
> > } while (cmpxchg(&n->state, val, new) != val); /* state == val, cmpxchg()
> >                                                   writes garbage */
> >
> > napi_enable():
> > do {
> >         val = READ_ONCE(n->state);
> >         BUG_ON(!test_bit(NAPI_STATE_SCHED, &val)); /* 50/50 boom */
> > <...>
> >
> > while the typical BUG splat is like:
> >
> > [
> > Fix this by replacing this 'continue' with a goto to the beginning
> > of the loop body to restore the original behaviour.
> > This could be written without a goto, but would look uglier and
> > require one more indent level.
> >
> > Fixes: 719c57197010 ("net: make napi_disable() symmetric with enable")
> > Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
> > Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
> > ---
> >  net/core/dev.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index edeb811c454e..5e101c53b9de 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -6929,10 +6929,11 @@ void napi_disable(struct napi_struct *n)
> >         set_bit(NAPI_STATE_DISABLE, &n->state);
> >
> >         do {
> > +retry:
> >                 val = READ_ONCE(n->state);
> >                 if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) {
> >                         usleep_range(20, 200);
> > -                       continue;
> > +                       goto retry;
> >                 }
> >
> >                 new = val | NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC;
> > --
> > 2.33.1
> >
> 
> Good catch !

Thanks!

> What about replacing the error prone do {...} while (cmpxchg(..)) by
> something less confusing ?
> 
> This way, no need for a label.
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 5e37d6809317fb3c54686188a908bfcb0bfccdab..9327141892cdaaf0282e082e0c6746abae0f12a7
> 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6264,7 +6264,7 @@ void napi_disable(struct napi_struct *n)
>         might_sleep();
>         set_bit(NAPI_STATE_DISABLE, &n->state);
> 
> -       do {
> +       for (;;) {
>                 val = READ_ONCE(n->state);
>                 if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) {
>                         usleep_range(20, 200);
> @@ -6273,7 +6273,9 @@ void napi_disable(struct napi_struct *n)
> 
>                 new = val | NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC;
>                 new &= ~(NAPIF_STATE_THREADED | NAPIF_STATE_PREFER_BUSY_POLL);
> -       } while (cmpxchg(&n->state, val, new) != val);
> +               if (cmpxchg(&n->state, val, new) == val)
> +                       break;
> +       }
> 
>         hrtimer_cancel(&n->timer);

LFTM, I'l queue v2 in a moment with you in Suggested-by.

Thanks,
Al
Jakub Kicinski Nov. 11, 2021, 1:44 a.m. UTC | #3
On Wed, 10 Nov 2021 20:43:37 +0100 Alexander Lobakin wrote:
> > What about replacing the error prone do {...} while (cmpxchg(..)) by
> > something less confusing ?
> > 
> > This way, no need for a label.
> > 
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 5e37d6809317fb3c54686188a908bfcb0bfccdab..9327141892cdaaf0282e082e0c6746abae0f12a7
> > 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -6264,7 +6264,7 @@ void napi_disable(struct napi_struct *n)
> >         might_sleep();
> >         set_bit(NAPI_STATE_DISABLE, &n->state);
> > 
> > -       do {
> > +       for (;;) {
> >                 val = READ_ONCE(n->state);
> >                 if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) {
> >                         usleep_range(20, 200);
> > @@ -6273,7 +6273,9 @@ void napi_disable(struct napi_struct *n)
> > 
> >                 new = val | NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC;
> >                 new &= ~(NAPIF_STATE_THREADED | NAPIF_STATE_PREFER_BUSY_POLL);
> > -       } while (cmpxchg(&n->state, val, new) != val);
> > +               if (cmpxchg(&n->state, val, new) == val)
> > +                       break;
> > +       }
> > 
> >         hrtimer_cancel(&n->timer);  
> 
> LFTM, I'l queue v2 in a moment with you in Suggested-by.

Ouch.

Feel free to put

Acked-by: Jakub Kicinski <kuba@kernel.org>

on the v2.
Jakub Kicinski Nov. 11, 2021, 1:44 a.m. UTC | #4
On Wed, 10 Nov 2021 17:44:07 -0800 Jakub Kicinski wrote:
> On Wed, 10 Nov 2021 20:43:37 +0100 Alexander Lobakin wrote:
> > > What about replacing the error prone do {...} while (cmpxchg(..)) by
> > > something less confusing ?
> > > 
> > > This way, no need for a label.
> > > 
> > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > index 5e37d6809317fb3c54686188a908bfcb0bfccdab..9327141892cdaaf0282e082e0c6746abae0f12a7
> > > 100644
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -6264,7 +6264,7 @@ void napi_disable(struct napi_struct *n)
> > >         might_sleep();
> > >         set_bit(NAPI_STATE_DISABLE, &n->state);
> > > 
> > > -       do {
> > > +       for (;;) {
> > >                 val = READ_ONCE(n->state);
> > >                 if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) {
> > >                         usleep_range(20, 200);
> > > @@ -6273,7 +6273,9 @@ void napi_disable(struct napi_struct *n)
> > > 
> > >                 new = val | NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC;
> > >                 new &= ~(NAPIF_STATE_THREADED | NAPIF_STATE_PREFER_BUSY_POLL);
> > > -       } while (cmpxchg(&n->state, val, new) != val);
> > > +               if (cmpxchg(&n->state, val, new) == val)
> > > +                       break;
> > > +       }
> > > 
> > >         hrtimer_cancel(&n->timer);    
> > 
> > LFTM, I'l queue v2 in a moment with you in Suggested-by.  
> 
> Ouch.
> 
> Feel free to put
> 
> Acked-by: Jakub Kicinski <kuba@kernel.org>

Or I'll do it myself since it's already out...
diff mbox series

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index edeb811c454e..5e101c53b9de 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6929,10 +6929,11 @@  void napi_disable(struct napi_struct *n)
 	set_bit(NAPI_STATE_DISABLE, &n->state);
 
 	do {
+retry:
 		val = READ_ONCE(n->state);
 		if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) {
 			usleep_range(20, 200);
-			continue;
+			goto retry;
 		}
 
 		new = val | NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC;