Message ID | 20211110191126.1214-1-alexandr.lobakin@intel.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] net: fix premature exit from NAPI state polling in napi_disable() | expand |
On Wed, Nov 10, 2021 at 11:11 AM Alexander Lobakin <alexandr.lobakin@intel.com> wrote: > > Commit 719c57197010 ("net: make napi_disable() symmetric with > enable") accidentally introduced a bug sometimes leading to a kernel > BUG when bringing an iface up/down under heavy traffic load. > > Prior to this commit, napi_disable() was polling n->state until > none of (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC) is set and then > always flip them. Now there's a possibility to get away with the > NAPIF_STATE_SCHE unset as 'continue' drops us to the cmpxchg() > call with an unitialized variable, rather than straight to > another round of the state check. > > Error path looks like: > > napi_disable(): > unsigned long val, new; /* new is uninitialized */ > > do { > val = READ_ONCE(n->state); /* NAPIF_STATE_NPSVC and/or > NAPIF_STATE_SCHED is set */ > if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) { /* true */ > usleep_range(20, 200); > continue; /* go straight to the condition check */ > } > new = val | <...> > } while (cmpxchg(&n->state, val, new) != val); /* state == val, cmpxchg() > writes garbage */ > > napi_enable(): > do { > val = READ_ONCE(n->state); > BUG_ON(!test_bit(NAPI_STATE_SCHED, &val)); /* 50/50 boom */ > <...> > > while the typical BUG splat is like: > > [ > Fix this by replacing this 'continue' with a goto to the beginning > of the loop body to restore the original behaviour. > This could be written without a goto, but would look uglier and > require one more indent level. > > Fixes: 719c57197010 ("net: make napi_disable() symmetric with enable") > Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> > Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> > --- > net/core/dev.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/net/core/dev.c b/net/core/dev.c > index edeb811c454e..5e101c53b9de 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -6929,10 +6929,11 @@ void napi_disable(struct napi_struct *n) > set_bit(NAPI_STATE_DISABLE, &n->state); > > do { > +retry: > val = READ_ONCE(n->state); > if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) { > usleep_range(20, 200); > - continue; > + goto retry; > } > > new = val | NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC; > -- > 2.33.1 > Good catch ! What about replacing the error prone do {...} while (cmpxchg(..)) by something less confusing ? This way, no need for a label. diff --git a/net/core/dev.c b/net/core/dev.c index 5e37d6809317fb3c54686188a908bfcb0bfccdab..9327141892cdaaf0282e082e0c6746abae0f12a7 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6264,7 +6264,7 @@ void napi_disable(struct napi_struct *n) might_sleep(); set_bit(NAPI_STATE_DISABLE, &n->state); - do { + for (;;) { val = READ_ONCE(n->state); if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) { usleep_range(20, 200); @@ -6273,7 +6273,9 @@ void napi_disable(struct napi_struct *n) new = val | NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC; new &= ~(NAPIF_STATE_THREADED | NAPIF_STATE_PREFER_BUSY_POLL); - } while (cmpxchg(&n->state, val, new) != val); + if (cmpxchg(&n->state, val, new) == val) + break; + } hrtimer_cancel(&n->timer);
From: Eric Dumazet <edumazet@google.com> Date: Wed, 10 Nov 2021 11:24:39 -0800 > On Wed, Nov 10, 2021 at 11:11 AM Alexander Lobakin > <alexandr.lobakin@intel.com> wrote: > > > > Commit 719c57197010 ("net: make napi_disable() symmetric with > > enable") accidentally introduced a bug sometimes leading to a kernel > > BUG when bringing an iface up/down under heavy traffic load. > > > > Prior to this commit, napi_disable() was polling n->state until > > none of (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC) is set and then > > always flip them. Now there's a possibility to get away with the > > NAPIF_STATE_SCHE unset as 'continue' drops us to the cmpxchg() > > call with an unitialized variable, rather than straight to > > another round of the state check. > > > > Error path looks like: > > > > napi_disable(): > > unsigned long val, new; /* new is uninitialized */ > > > > do { > > val = READ_ONCE(n->state); /* NAPIF_STATE_NPSVC and/or > > NAPIF_STATE_SCHED is set */ > > if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) { /* true */ > > usleep_range(20, 200); > > continue; /* go straight to the condition check */ > > } > > new = val | <...> > > } while (cmpxchg(&n->state, val, new) != val); /* state == val, cmpxchg() > > writes garbage */ > > > > napi_enable(): > > do { > > val = READ_ONCE(n->state); > > BUG_ON(!test_bit(NAPI_STATE_SCHED, &val)); /* 50/50 boom */ > > <...> > > > > while the typical BUG splat is like: > > > > [ > > Fix this by replacing this 'continue' with a goto to the beginning > > of the loop body to restore the original behaviour. > > This could be written without a goto, but would look uglier and > > require one more indent level. > > > > Fixes: 719c57197010 ("net: make napi_disable() symmetric with enable") > > Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> > > Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> > > --- > > net/core/dev.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/net/core/dev.c b/net/core/dev.c > > index edeb811c454e..5e101c53b9de 100644 > > --- a/net/core/dev.c > > +++ b/net/core/dev.c > > @@ -6929,10 +6929,11 @@ void napi_disable(struct napi_struct *n) > > set_bit(NAPI_STATE_DISABLE, &n->state); > > > > do { > > +retry: > > val = READ_ONCE(n->state); > > if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) { > > usleep_range(20, 200); > > - continue; > > + goto retry; > > } > > > > new = val | NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC; > > -- > > 2.33.1 > > > > Good catch ! Thanks! > What about replacing the error prone do {...} while (cmpxchg(..)) by > something less confusing ? > > This way, no need for a label. > > diff --git a/net/core/dev.c b/net/core/dev.c > index 5e37d6809317fb3c54686188a908bfcb0bfccdab..9327141892cdaaf0282e082e0c6746abae0f12a7 > 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -6264,7 +6264,7 @@ void napi_disable(struct napi_struct *n) > might_sleep(); > set_bit(NAPI_STATE_DISABLE, &n->state); > > - do { > + for (;;) { > val = READ_ONCE(n->state); > if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) { > usleep_range(20, 200); > @@ -6273,7 +6273,9 @@ void napi_disable(struct napi_struct *n) > > new = val | NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC; > new &= ~(NAPIF_STATE_THREADED | NAPIF_STATE_PREFER_BUSY_POLL); > - } while (cmpxchg(&n->state, val, new) != val); > + if (cmpxchg(&n->state, val, new) == val) > + break; > + } > > hrtimer_cancel(&n->timer); LFTM, I'l queue v2 in a moment with you in Suggested-by. Thanks, Al
On Wed, 10 Nov 2021 20:43:37 +0100 Alexander Lobakin wrote: > > What about replacing the error prone do {...} while (cmpxchg(..)) by > > something less confusing ? > > > > This way, no need for a label. > > > > diff --git a/net/core/dev.c b/net/core/dev.c > > index 5e37d6809317fb3c54686188a908bfcb0bfccdab..9327141892cdaaf0282e082e0c6746abae0f12a7 > > 100644 > > --- a/net/core/dev.c > > +++ b/net/core/dev.c > > @@ -6264,7 +6264,7 @@ void napi_disable(struct napi_struct *n) > > might_sleep(); > > set_bit(NAPI_STATE_DISABLE, &n->state); > > > > - do { > > + for (;;) { > > val = READ_ONCE(n->state); > > if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) { > > usleep_range(20, 200); > > @@ -6273,7 +6273,9 @@ void napi_disable(struct napi_struct *n) > > > > new = val | NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC; > > new &= ~(NAPIF_STATE_THREADED | NAPIF_STATE_PREFER_BUSY_POLL); > > - } while (cmpxchg(&n->state, val, new) != val); > > + if (cmpxchg(&n->state, val, new) == val) > > + break; > > + } > > > > hrtimer_cancel(&n->timer); > > LFTM, I'l queue v2 in a moment with you in Suggested-by. Ouch. Feel free to put Acked-by: Jakub Kicinski <kuba@kernel.org> on the v2.
On Wed, 10 Nov 2021 17:44:07 -0800 Jakub Kicinski wrote: > On Wed, 10 Nov 2021 20:43:37 +0100 Alexander Lobakin wrote: > > > What about replacing the error prone do {...} while (cmpxchg(..)) by > > > something less confusing ? > > > > > > This way, no need for a label. > > > > > > diff --git a/net/core/dev.c b/net/core/dev.c > > > index 5e37d6809317fb3c54686188a908bfcb0bfccdab..9327141892cdaaf0282e082e0c6746abae0f12a7 > > > 100644 > > > --- a/net/core/dev.c > > > +++ b/net/core/dev.c > > > @@ -6264,7 +6264,7 @@ void napi_disable(struct napi_struct *n) > > > might_sleep(); > > > set_bit(NAPI_STATE_DISABLE, &n->state); > > > > > > - do { > > > + for (;;) { > > > val = READ_ONCE(n->state); > > > if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) { > > > usleep_range(20, 200); > > > @@ -6273,7 +6273,9 @@ void napi_disable(struct napi_struct *n) > > > > > > new = val | NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC; > > > new &= ~(NAPIF_STATE_THREADED | NAPIF_STATE_PREFER_BUSY_POLL); > > > - } while (cmpxchg(&n->state, val, new) != val); > > > + if (cmpxchg(&n->state, val, new) == val) > > > + break; > > > + } > > > > > > hrtimer_cancel(&n->timer); > > > > LFTM, I'l queue v2 in a moment with you in Suggested-by. > > Ouch. > > Feel free to put > > Acked-by: Jakub Kicinski <kuba@kernel.org> Or I'll do it myself since it's already out...
diff --git a/net/core/dev.c b/net/core/dev.c index edeb811c454e..5e101c53b9de 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6929,10 +6929,11 @@ void napi_disable(struct napi_struct *n) set_bit(NAPI_STATE_DISABLE, &n->state); do { +retry: val = READ_ONCE(n->state); if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) { usleep_range(20, 200); - continue; + goto retry; } new = val | NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC;