Message ID | ef333a8c-1bb2-49a7-b721-68b28df19b0e@gmail.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | r8169: Fix GRO-related issue with not disabled device interrupts | expand |
On Tue, May 14, 2024 at 8:52 AM Heiner Kallweit <hkallweit1@gmail.com> wrote: > > Ken reported that RTL8125b can lock up if gro_flush_timeout has the > default value of 20000 and napi_defer_hard_irqs is set to 0. > In this scenario device interrupts aren't disabled, what seems to > trigger some silicon bug under heavy load. I was able to reproduce this > behavior on RTL8168h. > Disabling device interrupts if NAPI is scheduled from a place other than > the driver's interrupt handler is a necessity in r8169, for other > drivers it may still be a performance optimization. > > Fixes: 7274c4147afb ("r8169: don't try to disable interrupts if NAPI is scheduled already") > Reported-by: Ken Milmore <ken.milmore@gmail.com> > Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> > --- > drivers/net/ethernet/realtek/r8169_main.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c > index e5ea827a2..01f0ca53d 100644 > --- a/drivers/net/ethernet/realtek/r8169_main.c > +++ b/drivers/net/ethernet/realtek/r8169_main.c > @@ -4639,6 +4639,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance) > { > struct rtl8169_private *tp = dev_instance; > u32 status = rtl_get_events(tp); > + int ret; > > if ((status & 0xffff) == 0xffff || !(status & tp->irq_mask)) > return IRQ_NONE; > @@ -4657,10 +4658,11 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance) > rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING); > } > > - if (napi_schedule_prep(&tp->napi)) { > + ret = __napi_schedule_prep(&tp->napi); > + if (ret >= 0) > rtl_irq_disable(tp); > + if (ret > 0) > __napi_schedule(&tp->napi); > - } > out: > rtl_ack_events(tp, status); > I do not understand this patch. __napi_schedule_prep() would only return -1 if NAPIF_STATE_DISABLE was set, but this should not happen under normal operations ? A simple revert would avoid adding yet another NAPI helper.
From: Eric Dumazet <edumazet@google.com> Date: Tue, 14 May 2024 11:45:05 +0200 > On Tue, May 14, 2024 at 8:52 AM Heiner Kallweit <hkallweit1@gmail.com> wrote: >> >> Ken reported that RTL8125b can lock up if gro_flush_timeout has the >> default value of 20000 and napi_defer_hard_irqs is set to 0. >> In this scenario device interrupts aren't disabled, what seems to >> trigger some silicon bug under heavy load. I was able to reproduce this >> behavior on RTL8168h. >> Disabling device interrupts if NAPI is scheduled from a place other than >> the driver's interrupt handler is a necessity in r8169, for other >> drivers it may still be a performance optimization. >> >> Fixes: 7274c4147afb ("r8169: don't try to disable interrupts if NAPI is scheduled already") >> Reported-by: Ken Milmore <ken.milmore@gmail.com> >> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> >> --- >> drivers/net/ethernet/realtek/r8169_main.c | 6 ++++-- >> 1 file changed, 4 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c >> index e5ea827a2..01f0ca53d 100644 >> --- a/drivers/net/ethernet/realtek/r8169_main.c >> +++ b/drivers/net/ethernet/realtek/r8169_main.c >> @@ -4639,6 +4639,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance) >> { >> struct rtl8169_private *tp = dev_instance; >> u32 status = rtl_get_events(tp); >> + int ret; >> >> if ((status & 0xffff) == 0xffff || !(status & tp->irq_mask)) >> return IRQ_NONE; >> @@ -4657,10 +4658,11 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance) >> rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING); >> } >> >> - if (napi_schedule_prep(&tp->napi)) { >> + ret = __napi_schedule_prep(&tp->napi); >> + if (ret >= 0) >> rtl_irq_disable(tp); >> + if (ret > 0) >> __napi_schedule(&tp->napi); >> - } >> out: >> rtl_ack_events(tp, status); >> > > I do not understand this patch. > > __napi_schedule_prep() would only return -1 if NAPIF_STATE_DISABLE was set, > but this should not happen under normal operations ? Without this patch, napi_schedule_prep() returns false if it's either scheduled already OR it's disabled. Drivers disable interrupts only if it returns true, which means they don't do that if it's already scheduled. With this patch, __napi_schedule_prep() returns -1 if it's disabled and 0 if it was already scheduled. Which means we can disable interrupts when the result is >= 0, i.e. regardless if it was scheduled before the call or within the call. IIUC, this addresses such situations: napi_schedule() // we disabled interrupts napi_poll() // we polled < budget frames napi_complete_done() // reenable the interrupts, no repoll hrtimer_start() // GRO flush is queued napi_schedule() napi_poll() // GRO flush, BUT interrupts are enabled On r8169, this seems to cause issues. On other drivers, it seems to be okay, but with this new helper, you can save some cycles. Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> > > A simple revert would avoid adding yet another NAPI helper. Thanks, Olek
On Tue, May 14, 2024 at 12:53 PM Alexander Lobakin <aleksander.lobakin@intel.com> wrote: > > From: Eric Dumazet <edumazet@google.com> > Date: Tue, 14 May 2024 11:45:05 +0200 > > > On Tue, May 14, 2024 at 8:52 AM Heiner Kallweit <hkallweit1@gmail.com> wrote: > >> > >> Ken reported that RTL8125b can lock up if gro_flush_timeout has the > >> default value of 20000 and napi_defer_hard_irqs is set to 0. > >> In this scenario device interrupts aren't disabled, what seems to > >> trigger some silicon bug under heavy load. I was able to reproduce this > >> behavior on RTL8168h. > >> Disabling device interrupts if NAPI is scheduled from a place other than > >> the driver's interrupt handler is a necessity in r8169, for other > >> drivers it may still be a performance optimization. > >> > >> Fixes: 7274c4147afb ("r8169: don't try to disable interrupts if NAPI is scheduled already") > >> Reported-by: Ken Milmore <ken.milmore@gmail.com> > >> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> > >> --- > >> drivers/net/ethernet/realtek/r8169_main.c | 6 ++++-- > >> 1 file changed, 4 insertions(+), 2 deletions(-) > >> > >> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c > >> index e5ea827a2..01f0ca53d 100644 > >> --- a/drivers/net/ethernet/realtek/r8169_main.c > >> +++ b/drivers/net/ethernet/realtek/r8169_main.c > >> @@ -4639,6 +4639,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance) > >> { > >> struct rtl8169_private *tp = dev_instance; > >> u32 status = rtl_get_events(tp); > >> + int ret; > >> > >> if ((status & 0xffff) == 0xffff || !(status & tp->irq_mask)) > >> return IRQ_NONE; > >> @@ -4657,10 +4658,11 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance) > >> rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING); > >> } > >> > >> - if (napi_schedule_prep(&tp->napi)) { > >> + ret = __napi_schedule_prep(&tp->napi); > >> + if (ret >= 0) > >> rtl_irq_disable(tp); > >> + if (ret > 0) > >> __napi_schedule(&tp->napi); > >> - } > >> out: > >> rtl_ack_events(tp, status); > >> > > > > I do not understand this patch. > > > > __napi_schedule_prep() would only return -1 if NAPIF_STATE_DISABLE was set, > > but this should not happen under normal operations ? > > Without this patch, napi_schedule_prep() returns false if it's either > scheduled already OR it's disabled. Drivers disable interrupts only if > it returns true, which means they don't do that if it's already scheduled. > With this patch, __napi_schedule_prep() returns -1 if it's disabled and > 0 if it was already scheduled. Which means we can disable interrupts > when the result is >= 0, i.e. regardless if it was scheduled before the > call or within the call. > > IIUC, this addresses such situations: > > napi_schedule() // we disabled interrupts > napi_poll() // we polled < budget frames > napi_complete_done() // reenable the interrupts, no repoll > hrtimer_start() // GRO flush is queued > napi_schedule() > napi_poll() // GRO flush, BUT interrupts are enabled > > On r8169, this seems to cause issues. On other drivers, it seems to be > okay, but with this new helper, you can save some cycles. > > Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Rephrasing the changelog is not really helping. Consider myself as a network maintainer, not as a casual patch reviewer. "This seems to cause issues" is rather weak. I would simply revert the faulty commit, because the interrupts are going to be disabled no matter what. Old logic was very simple and rock solid. A revert is a clear stable candidate. rtl_irq_disable(tp); napi_schedule(&tp->napi); If this is still broken, we might have similar issues in old/legacy drivers.
From: Eric Dumazet <edumazet@google.com> Date: Tue, 14 May 2024 13:05:55 +0200 > On Tue, May 14, 2024 at 12:53 PM Alexander Lobakin > <aleksander.lobakin@intel.com> wrote: >> >> From: Eric Dumazet <edumazet@google.com> >> Date: Tue, 14 May 2024 11:45:05 +0200 >> >>> On Tue, May 14, 2024 at 8:52 AM Heiner Kallweit <hkallweit1@gmail.com> wrote: >>>> >>>> Ken reported that RTL8125b can lock up if gro_flush_timeout has the >>>> default value of 20000 and napi_defer_hard_irqs is set to 0. >>>> In this scenario device interrupts aren't disabled, what seems to >>>> trigger some silicon bug under heavy load. I was able to reproduce this >>>> behavior on RTL8168h. >>>> Disabling device interrupts if NAPI is scheduled from a place other than >>>> the driver's interrupt handler is a necessity in r8169, for other >>>> drivers it may still be a performance optimization. >>>> >>>> Fixes: 7274c4147afb ("r8169: don't try to disable interrupts if NAPI is scheduled already") >>>> Reported-by: Ken Milmore <ken.milmore@gmail.com> >>>> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> >>>> --- >>>> drivers/net/ethernet/realtek/r8169_main.c | 6 ++++-- >>>> 1 file changed, 4 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c >>>> index e5ea827a2..01f0ca53d 100644 >>>> --- a/drivers/net/ethernet/realtek/r8169_main.c >>>> +++ b/drivers/net/ethernet/realtek/r8169_main.c >>>> @@ -4639,6 +4639,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance) >>>> { >>>> struct rtl8169_private *tp = dev_instance; >>>> u32 status = rtl_get_events(tp); >>>> + int ret; >>>> >>>> if ((status & 0xffff) == 0xffff || !(status & tp->irq_mask)) >>>> return IRQ_NONE; >>>> @@ -4657,10 +4658,11 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance) >>>> rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING); >>>> } >>>> >>>> - if (napi_schedule_prep(&tp->napi)) { >>>> + ret = __napi_schedule_prep(&tp->napi); >>>> + if (ret >= 0) >>>> rtl_irq_disable(tp); >>>> + if (ret > 0) >>>> __napi_schedule(&tp->napi); >>>> - } >>>> out: >>>> rtl_ack_events(tp, status); >>>> >>> >>> I do not understand this patch. >>> >>> __napi_schedule_prep() would only return -1 if NAPIF_STATE_DISABLE was set, >>> but this should not happen under normal operations ? >> >> Without this patch, napi_schedule_prep() returns false if it's either >> scheduled already OR it's disabled. Drivers disable interrupts only if >> it returns true, which means they don't do that if it's already scheduled. >> With this patch, __napi_schedule_prep() returns -1 if it's disabled and >> 0 if it was already scheduled. Which means we can disable interrupts >> when the result is >= 0, i.e. regardless if it was scheduled before the >> call or within the call. >> >> IIUC, this addresses such situations: >> >> napi_schedule() // we disabled interrupts >> napi_poll() // we polled < budget frames >> napi_complete_done() // reenable the interrupts, no repoll >> hrtimer_start() // GRO flush is queued >> napi_schedule() >> napi_poll() // GRO flush, BUT interrupts are enabled >> >> On r8169, this seems to cause issues. On other drivers, it seems to be >> okay, but with this new helper, you can save some cycles. >> >> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> > > Rephrasing the changelog is not really helping. > > Consider myself as a network maintainer, not as a casual patch reviewer. And? > > "This seems to cause issues" is rather weak. It has "Reported-by", so it really causes issues. > > I would simply revert the faulty commit, because the interrupts are > going to be disabled no matter what. > > Old logic was very simple and rock solid. A revert is a clear stable candidate. > > rtl_irq_disable(tp); > napi_schedule(&tp->napi); > > If this is still broken, we might have similar issues in old/legacy drivers. I might agree that we could just revert the mentioned commit for stable, but for the next net-next, avoid unnecessary scheduling/enabling/disabling interrupts makes sense, not only for "old/legacy" drivers. "Very simple and rock solid" is not an argument for avoiding improvements. Thanks, Olek
On 14.05.2024 13:05, Eric Dumazet wrote: > On Tue, May 14, 2024 at 12:53 PM Alexander Lobakin > <aleksander.lobakin@intel.com> wrote: >> >> From: Eric Dumazet <edumazet@google.com> >> Date: Tue, 14 May 2024 11:45:05 +0200 >> >>> On Tue, May 14, 2024 at 8:52 AM Heiner Kallweit <hkallweit1@gmail.com> wrote: >>>> >>>> Ken reported that RTL8125b can lock up if gro_flush_timeout has the >>>> default value of 20000 and napi_defer_hard_irqs is set to 0. >>>> In this scenario device interrupts aren't disabled, what seems to >>>> trigger some silicon bug under heavy load. I was able to reproduce this >>>> behavior on RTL8168h. >>>> Disabling device interrupts if NAPI is scheduled from a place other than >>>> the driver's interrupt handler is a necessity in r8169, for other >>>> drivers it may still be a performance optimization. >>>> >>>> Fixes: 7274c4147afb ("r8169: don't try to disable interrupts if NAPI is scheduled already") >>>> Reported-by: Ken Milmore <ken.milmore@gmail.com> >>>> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> >>>> --- >>>> drivers/net/ethernet/realtek/r8169_main.c | 6 ++++-- >>>> 1 file changed, 4 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c >>>> index e5ea827a2..01f0ca53d 100644 >>>> --- a/drivers/net/ethernet/realtek/r8169_main.c >>>> +++ b/drivers/net/ethernet/realtek/r8169_main.c >>>> @@ -4639,6 +4639,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance) >>>> { >>>> struct rtl8169_private *tp = dev_instance; >>>> u32 status = rtl_get_events(tp); >>>> + int ret; >>>> >>>> if ((status & 0xffff) == 0xffff || !(status & tp->irq_mask)) >>>> return IRQ_NONE; >>>> @@ -4657,10 +4658,11 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance) >>>> rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING); >>>> } >>>> >>>> - if (napi_schedule_prep(&tp->napi)) { >>>> + ret = __napi_schedule_prep(&tp->napi); >>>> + if (ret >= 0) >>>> rtl_irq_disable(tp); >>>> + if (ret > 0) >>>> __napi_schedule(&tp->napi); >>>> - } >>>> out: >>>> rtl_ack_events(tp, status); >>>> >>> >>> I do not understand this patch. >>> >>> __napi_schedule_prep() would only return -1 if NAPIF_STATE_DISABLE was set, >>> but this should not happen under normal operations ? >> >> Without this patch, napi_schedule_prep() returns false if it's either >> scheduled already OR it's disabled. Drivers disable interrupts only if >> it returns true, which means they don't do that if it's already scheduled. >> With this patch, __napi_schedule_prep() returns -1 if it's disabled and >> 0 if it was already scheduled. Which means we can disable interrupts >> when the result is >= 0, i.e. regardless if it was scheduled before the >> call or within the call. >> >> IIUC, this addresses such situations: >> >> napi_schedule() // we disabled interrupts >> napi_poll() // we polled < budget frames >> napi_complete_done() // reenable the interrupts, no repoll >> hrtimer_start() // GRO flush is queued >> napi_schedule() >> napi_poll() // GRO flush, BUT interrupts are enabled >> >> On r8169, this seems to cause issues. On other drivers, it seems to be >> okay, but with this new helper, you can save some cycles. >> >> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> > > Rephrasing the changelog is not really helping. > > Consider myself as a network maintainer, not as a casual patch reviewer. > > "This seems to cause issues" is rather weak. > > I would simply revert the faulty commit, because the interrupts are > going to be disabled no matter what. > > Old logic was very simple and rock solid. A revert is a clear stable candidate. > > rtl_irq_disable(tp); > napi_schedule(&tp->napi); > Proposed new solution differs from the revert wrt how NAPIF_STATE_DISABLE is handled. I'm not sure the old code would handle this corner case correctly. It would disable device interrupts, and following napi_schedule() is a no-op. Therefore device interrupts would remain disabled, and I think that's not what we want. When trying to solve this by adding an extra call to e.g. napi_disable_pending(), then I have concerns this might be racy. > If this is still broken, we might have similar issues in old/legacy drivers.
On Tue, 14 May 2024 13:05:55 +0200 Eric Dumazet wrote: > > napi_schedule() // we disabled interrupts > > napi_poll() // we polled < budget frames > > napi_complete_done() // reenable the interrupts, no repoll > > hrtimer_start() // GRO flush is queued > > napi_schedule() > > napi_poll() // GRO flush, BUT interrupts are enabled I thought the bug is because of a race with disable. But there's already a synchronize_net() after disable, so NAPI poll must fully exit before we mask in rtl8169_cleanup(). If the bug is double-enable you describe the fix is just making the race window smaller. But I don't think that's the bug. BTW why are events only acked in rtl8169_interrupt() and not rtl8169_poll()?
On Tue, May 14, 2024 at 1:18 PM Alexander Lobakin <aleksander.lobakin@intel.com> wrote: > > From: Eric Dumazet <edumazet@google.com> > Date: Tue, 14 May 2024 13:05:55 +0200 > > > On Tue, May 14, 2024 at 12:53 PM Alexander Lobakin > > <aleksander.lobakin@intel.com> wrote: > >> > >> From: Eric Dumazet <edumazet@google.com> > >> Date: Tue, 14 May 2024 11:45:05 +0200 > >> > >>> On Tue, May 14, 2024 at 8:52 AM Heiner Kallweit <hkallweit1@gmail.com> wrote: > >>>> > >>>> Ken reported that RTL8125b can lock up if gro_flush_timeout has the > >>>> default value of 20000 and napi_defer_hard_irqs is set to 0. > >>>> In this scenario device interrupts aren't disabled, what seems to > >>>> trigger some silicon bug under heavy load. I was able to reproduce this > >>>> behavior on RTL8168h. > >>>> Disabling device interrupts if NAPI is scheduled from a place other than > >>>> the driver's interrupt handler is a necessity in r8169, for other > >>>> drivers it may still be a performance optimization. > >>>> > >>>> Fixes: 7274c4147afb ("r8169: don't try to disable interrupts if NAPI is scheduled already") > >>>> Reported-by: Ken Milmore <ken.milmore@gmail.com> > >>>> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> > >>>> --- > >>>> drivers/net/ethernet/realtek/r8169_main.c | 6 ++++-- > >>>> 1 file changed, 4 insertions(+), 2 deletions(-) > >>>> > >>>> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c > >>>> index e5ea827a2..01f0ca53d 100644 > >>>> --- a/drivers/net/ethernet/realtek/r8169_main.c > >>>> +++ b/drivers/net/ethernet/realtek/r8169_main.c > >>>> @@ -4639,6 +4639,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance) > >>>> { > >>>> struct rtl8169_private *tp = dev_instance; > >>>> u32 status = rtl_get_events(tp); > >>>> + int ret; > >>>> > >>>> if ((status & 0xffff) == 0xffff || !(status & tp->irq_mask)) > >>>> return IRQ_NONE; > >>>> @@ -4657,10 +4658,11 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance) > >>>> rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING); > >>>> } > >>>> > >>>> - if (napi_schedule_prep(&tp->napi)) { > >>>> + ret = __napi_schedule_prep(&tp->napi); > >>>> + if (ret >= 0) > >>>> rtl_irq_disable(tp); > >>>> + if (ret > 0) > >>>> __napi_schedule(&tp->napi); > >>>> - } > >>>> out: > >>>> rtl_ack_events(tp, status); > >>>> > >>> > >>> I do not understand this patch. > >>> > >>> __napi_schedule_prep() would only return -1 if NAPIF_STATE_DISABLE was set, > >>> but this should not happen under normal operations ? > >> > >> Without this patch, napi_schedule_prep() returns false if it's either > >> scheduled already OR it's disabled. Drivers disable interrupts only if > >> it returns true, which means they don't do that if it's already scheduled. > >> With this patch, __napi_schedule_prep() returns -1 if it's disabled and > >> 0 if it was already scheduled. Which means we can disable interrupts > >> when the result is >= 0, i.e. regardless if it was scheduled before the > >> call or within the call. > >> > >> IIUC, this addresses such situations: > >> > >> napi_schedule() // we disabled interrupts > >> napi_poll() // we polled < budget frames > >> napi_complete_done() // reenable the interrupts, no repoll > >> hrtimer_start() // GRO flush is queued > >> napi_schedule() > >> napi_poll() // GRO flush, BUT interrupts are enabled > >> > >> On r8169, this seems to cause issues. On other drivers, it seems to be > >> okay, but with this new helper, you can save some cycles. > >> > >> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> > > > > Rephrasing the changelog is not really helping. > > > > Consider myself as a network maintainer, not as a casual patch reviewer. > > And? > > > > > "This seems to cause issues" is rather weak. > > It has "Reported-by", so it really causes issues. And ? Revert ? > > > > > I would simply revert the faulty commit, because the interrupts are > > going to be disabled no matter what. > > > > Old logic was very simple and rock solid. A revert is a clear stable candidate. > > > > rtl_irq_disable(tp); > > napi_schedule(&tp->napi); > > > > If this is still broken, we might have similar issues in old/legacy drivers. > > I might agree that we could just revert the mentioned commit for stable, > but for the next net-next, avoid unnecessary > scheduling/enabling/disabling interrupts makes sense, not only for > "old/legacy" drivers. > "Very simple and rock solid" is not an argument for avoiding improvements. I explained that I failed to see the 'so called' improvement there. You explained nothing really, just that you like some approach that I think is for net-next.
On 14.05.2024 16:11, Jakub Kicinski wrote: > On Tue, 14 May 2024 13:05:55 +0200 Eric Dumazet wrote: >>> napi_schedule() // we disabled interrupts >>> napi_poll() // we polled < budget frames >>> napi_complete_done() // reenable the interrupts, no repoll >>> hrtimer_start() // GRO flush is queued >>> napi_schedule() >>> napi_poll() // GRO flush, BUT interrupts are enabled > > I thought the bug is because of a race with disable. No, the second napi_poll() in this scenario is executed with device interrupts enabled, what triggers a (supposedly) hw bug under heavy load. So the fix is to disable device interrupts also in the case that NAPI is already scheduled when entering the interrupt handler. > But there's already a synchronize_net() after disable, so NAPI poll > must fully exit before we mask in rtl8169_cleanup(). > > If the bug is double-enable you describe the fix is just making > the race window smaller. But I don't think that's the bug. > > BTW why are events only acked in rtl8169_interrupt() and not > rtl8169_poll()? You mean clearing the rx/tx-related interrupt status bits only after napi_complete_done(), as an alternative to disabling device interrupts?
On Tue, 14 May 2024 18:35:46 +0200 Heiner Kallweit wrote: > > I thought the bug is because of a race with disable. > > No, the second napi_poll() in this scenario is executed with device > interrupts enabled, what triggers a (supposedly) hw bug under heavy > load. So the fix is to disable device interrupts also in the case > that NAPI is already scheduled when entering the interrupt handler. > > > But there's already a synchronize_net() after disable, so NAPI poll > > must fully exit before we mask in rtl8169_cleanup(). > > > > If the bug is double-enable you describe the fix is just making > > the race window smaller. But I don't think that's the bug. > > > > BTW why are events only acked in rtl8169_interrupt() and not > > rtl8169_poll()? > > You mean clearing the rx/tx-related interrupt status bits only > after napi_complete_done(), as an alternative to disabling > device interrupts? Before, basically ack them at the start of a poll function. If gro_timeout / IRQ suppression is not enabled it won't make much of a difference. Probably also won't make much difference with iperf. But normally traffic is bursty so with gro_timeout we can see something like: packets: x x x x x < no more packets > IRQ pending: xxx xxxxxxxxxxxxxxxxxxxxxx ISR: [] [] IRQ ack: x x NAPI: [=====] < timeout > [=] [=] < timeout > [=] Acking at the beginning of NAPI poll can't make us miss events but we'd clear the pending IRQ on the "deferred" NAPI run, avoiding an extra HW IRQ and 2 NAPI calls: packets: x x x x x < no more packets > IRQ pending: xxxx xxxxxxxxxxxxxxxxxxx ISR: [] IRQ ack: x x NAPI: [=====] < timeout > [=]
On 14.05.2024 18:49, Jakub Kicinski wrote: > On Tue, 14 May 2024 18:35:46 +0200 Heiner Kallweit wrote: >>> I thought the bug is because of a race with disable. >> >> No, the second napi_poll() in this scenario is executed with device >> interrupts enabled, what triggers a (supposedly) hw bug under heavy >> load. So the fix is to disable device interrupts also in the case >> that NAPI is already scheduled when entering the interrupt handler. >> >>> But there's already a synchronize_net() after disable, so NAPI poll >>> must fully exit before we mask in rtl8169_cleanup(). >>> >>> If the bug is double-enable you describe the fix is just making >>> the race window smaller. But I don't think that's the bug. >>> >>> BTW why are events only acked in rtl8169_interrupt() and not >>> rtl8169_poll()? >> >> You mean clearing the rx/tx-related interrupt status bits only >> after napi_complete_done(), as an alternative to disabling >> device interrupts? > > Before, basically ack them at the start of a poll function. > If gro_timeout / IRQ suppression is not enabled it won't make > much of a difference. Probably also won't make much difference > with iperf. > > But normally traffic is bursty so with gro_timeout we can see > something like: > > packets: x x x x x < no more packets > > IRQ pending: xxx xxxxxxxxxxxxxxxxxxxxxx > ISR: [] [] > IRQ ack: x x > NAPI: [=====] < timeout > [=] [=] < timeout > [=] > > Acking at the beginning of NAPI poll can't make us miss events > but we'd clear the pending IRQ on the "deferred" NAPI run, avoiding > an extra HW IRQ and 2 NAPI calls: > > packets: x x x x x < no more packets > > IRQ pending: xxxx xxxxxxxxxxxxxxxxxxx > ISR: [] > IRQ ack: x x > NAPI: [=====] < timeout > [=] Thanks for the explanation. What is the benefit of acking interrupts at the beginning of NAPI poll, compared to acking them after napi_complete_done()? If budget is exceeded and we know we're polled again, why ack the interrupts in between? I just tested with the defaults of gro_flush_timeout=20000 and napi_defer_hardirqs=1, and iperf3 --bidir. The difference is massive. When acking after napi_complete_done() I see only a few hundred interrupts. Acking at the beginning of NAPI poll it's few hundred thousand interrupts.
On Tue, 14 May 2024 19:09:21 +0200 Heiner Kallweit wrote: > Thanks for the explanation. What is the benefit of acking interrupts > at the beginning of NAPI poll, compared to acking them after > napi_complete_done()? > If budget is exceeded and we know we're polled again, why ack > the interrupts in between? That's a fair point, the main concern for acking after processing is that we will miss an event. If we ack before processing we can occasionally take an unnecessary IRQ, but we'll never let a packet rot on the ring because it arrived between processing packets and acking the IRQ. But you know the driver better, maybe there's a clean way of avoiding the missed IRQs (not sure it would be worth the complexity, tho TBH). > I just tested with the defaults of gro_flush_timeout=20000 and > napi_defer_hardirqs=1, and iperf3 --bidir. > The difference is massive. When acking after napi_complete_done() > I see only a few hundred interrupts. Acking at the beginning of > NAPI poll it's few hundred thousand interrupts. That's quite odd. Maybe because rtl_tx() doesn't contribute to work done? Maybe it'd be better to set work done to min(budget, !!tx, rx) ? Or maybe the disabling is not working somehow? napi_defer_hardirqs=1 should make us reschedule NAPI if there was _any_ work done. Meaning we'd enable NAPI only after a completely empty NAPI run. On an empty NAPI run it should not matter whether we acked before or after checking for packets, or so I'd naively think.
On Tue, 14 May 2024 10:47:39 -0700 Jakub Kicinski wrote:
> enable NAPI
enable IRQ
It seems to me that these chips are known for being badly-documented and quirky, so maybe an empirical approach is called for. I have briefly surveyed various driver sources available from the vendor and from the BSDs, and AFAICT they all follow the pattern of unconditionally masking interrupts, then clearing the status bits, then processing the rings and then re-enabling interrupts in that order. In this respect, the Linux driver may have become an outlier in that it doesn't *always* mask interrupts before acking them, and that it may process the rings while IRQs are unmasked. I'm not saying that these are necessarily problems, but... There are some differences in how the drivers work: the FreeBSD one masks interrupts straight away but defers writing the status register to the bottom half, the OpenBSD driver seems to do everything in the IRQ handler. These drivers also tend to flip between using "hard IRQs" and the built-in timer, which complicates things. But in terms of the "mask first" approach, I think they all look equivalent. https://sources.debian.org/src/r8168/8.053.00-1/src/r8168_n.c/ https://sources.debian.org/src/r8125/9.011.00-4/src/r8125_n.c/ https://github.com/openbsd/src/blob/master/sys/dev/ic/re.c https://github.com/openbsd/src/blob/master/sys/dev/pci/if_rge.c https://cgit.freebsd.org/src/tree/sys/dev/re/if_re.c I also found this ancient netdev thread which looks startlingly familiar to the behaviour in the present issue. It seems that people have been here before... https://lore.kernel.org/netdev/1242328457.32579.12.camel@lap75545.ornl.gov/ "I added some code to print the irq status when it hangs, and it shows 0x0085, which is RxOK | TxOK | TxDescUnavail, which makes me think we've lost an MSI-edge interrupt somehow." https://lore.kernel.org/netdev/1243042174.3580.23.camel@obelisk.thedillows.org/ "The 8169 chip only generates MSI interrupts when all enabled event sources are quiescent and one or more sources transition to active. If not all of the active events are acknowledged, or a new event becomes active while the existing ones are cleared in the handler, we will not see a new interrupt."
On 14.05.2024 11:45, Eric Dumazet wrote: > On Tue, May 14, 2024 at 8:52 AM Heiner Kallweit <hkallweit1@gmail.com> wrote: >> >> Ken reported that RTL8125b can lock up if gro_flush_timeout has the >> default value of 20000 and napi_defer_hard_irqs is set to 0. >> In this scenario device interrupts aren't disabled, what seems to >> trigger some silicon bug under heavy load. I was able to reproduce this >> behavior on RTL8168h. >> Disabling device interrupts if NAPI is scheduled from a place other than >> the driver's interrupt handler is a necessity in r8169, for other >> drivers it may still be a performance optimization. >> >> Fixes: 7274c4147afb ("r8169: don't try to disable interrupts if NAPI is scheduled already") >> Reported-by: Ken Milmore <ken.milmore@gmail.com> >> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> >> --- >> drivers/net/ethernet/realtek/r8169_main.c | 6 ++++-- >> 1 file changed, 4 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c >> index e5ea827a2..01f0ca53d 100644 >> --- a/drivers/net/ethernet/realtek/r8169_main.c >> +++ b/drivers/net/ethernet/realtek/r8169_main.c >> @@ -4639,6 +4639,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance) >> { >> struct rtl8169_private *tp = dev_instance; >> u32 status = rtl_get_events(tp); >> + int ret; >> >> if ((status & 0xffff) == 0xffff || !(status & tp->irq_mask)) >> return IRQ_NONE; >> @@ -4657,10 +4658,11 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance) >> rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING); >> } >> >> - if (napi_schedule_prep(&tp->napi)) { >> + ret = __napi_schedule_prep(&tp->napi); >> + if (ret >= 0) >> rtl_irq_disable(tp); >> + if (ret > 0) >> __napi_schedule(&tp->napi); >> - } >> out: >> rtl_ack_events(tp, status); >> > > I do not understand this patch. > > __napi_schedule_prep() would only return -1 if NAPIF_STATE_DISABLE was set, > but this should not happen under normal operations ? > > A simple revert would avoid adding yet another NAPI helper. Fine with me, I'll go with the revert for now.
diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c index e5ea827a2..01f0ca53d 100644 --- a/drivers/net/ethernet/realtek/r8169_main.c +++ b/drivers/net/ethernet/realtek/r8169_main.c @@ -4639,6 +4639,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance) { struct rtl8169_private *tp = dev_instance; u32 status = rtl_get_events(tp); + int ret; if ((status & 0xffff) == 0xffff || !(status & tp->irq_mask)) return IRQ_NONE; @@ -4657,10 +4658,11 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance) rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING); } - if (napi_schedule_prep(&tp->napi)) { + ret = __napi_schedule_prep(&tp->napi); + if (ret >= 0) rtl_irq_disable(tp); + if (ret > 0) __napi_schedule(&tp->napi); - } out: rtl_ack_events(tp, status);
Ken reported that RTL8125b can lock up if gro_flush_timeout has the default value of 20000 and napi_defer_hard_irqs is set to 0. In this scenario device interrupts aren't disabled, what seems to trigger some silicon bug under heavy load. I was able to reproduce this behavior on RTL8168h. Disabling device interrupts if NAPI is scheduled from a place other than the driver's interrupt handler is a necessity in r8169, for other drivers it may still be a performance optimization. Fixes: 7274c4147afb ("r8169: don't try to disable interrupts if NAPI is scheduled already") Reported-by: Ken Milmore <ken.milmore@gmail.com> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> --- drivers/net/ethernet/realtek/r8169_main.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)