diff mbox series

[net,2/2] r8169: disable interrupts also for GRO-scheduled NAPI

Message ID ef333a8c-1bb2-49a7-b721-68b28df19b0e@gmail.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series r8169: Fix GRO-related issue with not disabled device interrupts | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net, async
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 925 this patch: 925
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 6 of 6 maintainers
netdev/build_clang success Errors and warnings before: 936 this patch: 936
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 936 this patch: 936
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 20 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-05-15--12-00 (tests: 1021)

Commit Message

Heiner Kallweit May 14, 2024, 6:52 a.m. UTC
Ken reported that RTL8125b can lock up if gro_flush_timeout has the
default value of 20000 and napi_defer_hard_irqs is set to 0.
In this scenario device interrupts aren't disabled, what seems to
trigger some silicon bug under heavy load. I was able to reproduce this
behavior on RTL8168h.
Disabling device interrupts if NAPI is scheduled from a place other than
the driver's interrupt handler is a necessity in r8169, for other
drivers it may still be a performance optimization.

Fixes: 7274c4147afb ("r8169: don't try to disable interrupts if NAPI is scheduled already")
Reported-by: Ken Milmore <ken.milmore@gmail.com>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
 drivers/net/ethernet/realtek/r8169_main.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Comments

Eric Dumazet May 14, 2024, 9:45 a.m. UTC | #1
On Tue, May 14, 2024 at 8:52 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>
> Ken reported that RTL8125b can lock up if gro_flush_timeout has the
> default value of 20000 and napi_defer_hard_irqs is set to 0.
> In this scenario device interrupts aren't disabled, what seems to
> trigger some silicon bug under heavy load. I was able to reproduce this
> behavior on RTL8168h.
> Disabling device interrupts if NAPI is scheduled from a place other than
> the driver's interrupt handler is a necessity in r8169, for other
> drivers it may still be a performance optimization.
>
> Fixes: 7274c4147afb ("r8169: don't try to disable interrupts if NAPI is scheduled already")
> Reported-by: Ken Milmore <ken.milmore@gmail.com>
> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
> ---
>  drivers/net/ethernet/realtek/r8169_main.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
> index e5ea827a2..01f0ca53d 100644
> --- a/drivers/net/ethernet/realtek/r8169_main.c
> +++ b/drivers/net/ethernet/realtek/r8169_main.c
> @@ -4639,6 +4639,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
>  {
>         struct rtl8169_private *tp = dev_instance;
>         u32 status = rtl_get_events(tp);
> +       int ret;
>
>         if ((status & 0xffff) == 0xffff || !(status & tp->irq_mask))
>                 return IRQ_NONE;
> @@ -4657,10 +4658,11 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
>                 rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING);
>         }
>
> -       if (napi_schedule_prep(&tp->napi)) {
> +       ret = __napi_schedule_prep(&tp->napi);
> +       if (ret >= 0)
>                 rtl_irq_disable(tp);
> +       if (ret > 0)
>                 __napi_schedule(&tp->napi);
> -       }
>  out:
>         rtl_ack_events(tp, status);
>

I do not understand this patch.

__napi_schedule_prep() would only return -1 if NAPIF_STATE_DISABLE was set,
but this should not happen under normal operations ?

A simple revert would avoid adding yet another NAPI helper.
Alexander Lobakin May 14, 2024, 10:52 a.m. UTC | #2
From: Eric Dumazet <edumazet@google.com>
Date: Tue, 14 May 2024 11:45:05 +0200

> On Tue, May 14, 2024 at 8:52 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>
>> Ken reported that RTL8125b can lock up if gro_flush_timeout has the
>> default value of 20000 and napi_defer_hard_irqs is set to 0.
>> In this scenario device interrupts aren't disabled, what seems to
>> trigger some silicon bug under heavy load. I was able to reproduce this
>> behavior on RTL8168h.
>> Disabling device interrupts if NAPI is scheduled from a place other than
>> the driver's interrupt handler is a necessity in r8169, for other
>> drivers it may still be a performance optimization.
>>
>> Fixes: 7274c4147afb ("r8169: don't try to disable interrupts if NAPI is scheduled already")
>> Reported-by: Ken Milmore <ken.milmore@gmail.com>
>> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
>> ---
>>  drivers/net/ethernet/realtek/r8169_main.c | 6 ++++--
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
>> index e5ea827a2..01f0ca53d 100644
>> --- a/drivers/net/ethernet/realtek/r8169_main.c
>> +++ b/drivers/net/ethernet/realtek/r8169_main.c
>> @@ -4639,6 +4639,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
>>  {
>>         struct rtl8169_private *tp = dev_instance;
>>         u32 status = rtl_get_events(tp);
>> +       int ret;
>>
>>         if ((status & 0xffff) == 0xffff || !(status & tp->irq_mask))
>>                 return IRQ_NONE;
>> @@ -4657,10 +4658,11 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
>>                 rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING);
>>         }
>>
>> -       if (napi_schedule_prep(&tp->napi)) {
>> +       ret = __napi_schedule_prep(&tp->napi);
>> +       if (ret >= 0)
>>                 rtl_irq_disable(tp);
>> +       if (ret > 0)
>>                 __napi_schedule(&tp->napi);
>> -       }
>>  out:
>>         rtl_ack_events(tp, status);
>>
> 
> I do not understand this patch.
> 
> __napi_schedule_prep() would only return -1 if NAPIF_STATE_DISABLE was set,
> but this should not happen under normal operations ?

Without this patch, napi_schedule_prep() returns false if it's either
scheduled already OR it's disabled. Drivers disable interrupts only if
it returns true, which means they don't do that if it's already scheduled.
With this patch, __napi_schedule_prep() returns -1 if it's disabled and
0 if it was already scheduled. Which means we can disable interrupts
when the result is >= 0, i.e. regardless if it was scheduled before the
call or within the call.

IIUC, this addresses such situations:

napi_schedule()		// we disabled interrupts
napi_poll()		// we polled < budget frames
napi_complete_done()	// reenable the interrupts, no repoll
  hrtimer_start()	// GRO flush is queued
    napi_schedule()
      napi_poll()	// GRO flush, BUT interrupts are enabled

On r8169, this seems to cause issues. On other drivers, it seems to be
okay, but with this new helper, you can save some cycles.

Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>

> 
> A simple revert would avoid adding yet another NAPI helper.

Thanks,
Olek
Eric Dumazet May 14, 2024, 11:05 a.m. UTC | #3
On Tue, May 14, 2024 at 12:53 PM Alexander Lobakin
<aleksander.lobakin@intel.com> wrote:
>
> From: Eric Dumazet <edumazet@google.com>
> Date: Tue, 14 May 2024 11:45:05 +0200
>
> > On Tue, May 14, 2024 at 8:52 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
> >>
> >> Ken reported that RTL8125b can lock up if gro_flush_timeout has the
> >> default value of 20000 and napi_defer_hard_irqs is set to 0.
> >> In this scenario device interrupts aren't disabled, what seems to
> >> trigger some silicon bug under heavy load. I was able to reproduce this
> >> behavior on RTL8168h.
> >> Disabling device interrupts if NAPI is scheduled from a place other than
> >> the driver's interrupt handler is a necessity in r8169, for other
> >> drivers it may still be a performance optimization.
> >>
> >> Fixes: 7274c4147afb ("r8169: don't try to disable interrupts if NAPI is scheduled already")
> >> Reported-by: Ken Milmore <ken.milmore@gmail.com>
> >> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
> >> ---
> >>  drivers/net/ethernet/realtek/r8169_main.c | 6 ++++--
> >>  1 file changed, 4 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
> >> index e5ea827a2..01f0ca53d 100644
> >> --- a/drivers/net/ethernet/realtek/r8169_main.c
> >> +++ b/drivers/net/ethernet/realtek/r8169_main.c
> >> @@ -4639,6 +4639,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
> >>  {
> >>         struct rtl8169_private *tp = dev_instance;
> >>         u32 status = rtl_get_events(tp);
> >> +       int ret;
> >>
> >>         if ((status & 0xffff) == 0xffff || !(status & tp->irq_mask))
> >>                 return IRQ_NONE;
> >> @@ -4657,10 +4658,11 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
> >>                 rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING);
> >>         }
> >>
> >> -       if (napi_schedule_prep(&tp->napi)) {
> >> +       ret = __napi_schedule_prep(&tp->napi);
> >> +       if (ret >= 0)
> >>                 rtl_irq_disable(tp);
> >> +       if (ret > 0)
> >>                 __napi_schedule(&tp->napi);
> >> -       }
> >>  out:
> >>         rtl_ack_events(tp, status);
> >>
> >
> > I do not understand this patch.
> >
> > __napi_schedule_prep() would only return -1 if NAPIF_STATE_DISABLE was set,
> > but this should not happen under normal operations ?
>
> Without this patch, napi_schedule_prep() returns false if it's either
> scheduled already OR it's disabled. Drivers disable interrupts only if
> it returns true, which means they don't do that if it's already scheduled.
> With this patch, __napi_schedule_prep() returns -1 if it's disabled and
> 0 if it was already scheduled. Which means we can disable interrupts
> when the result is >= 0, i.e. regardless if it was scheduled before the
> call or within the call.
>
> IIUC, this addresses such situations:
>
> napi_schedule()         // we disabled interrupts
> napi_poll()             // we polled < budget frames
> napi_complete_done()    // reenable the interrupts, no repoll
>   hrtimer_start()       // GRO flush is queued
>     napi_schedule()
>       napi_poll()       // GRO flush, BUT interrupts are enabled
>
> On r8169, this seems to cause issues. On other drivers, it seems to be
> okay, but with this new helper, you can save some cycles.
>
> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Rephrasing the changelog is not really helping.

Consider myself as a network maintainer, not as a casual patch reviewer.

"This seems to cause issues" is rather weak.

I would simply revert the faulty commit, because the interrupts are
going to be disabled no matter what.

Old logic was very simple and rock solid. A revert is a clear stable candidate.

rtl_irq_disable(tp);
napi_schedule(&tp->napi);

If this is still broken, we might have similar issues in old/legacy drivers.
Alexander Lobakin May 14, 2024, 11:17 a.m. UTC | #4
From: Eric Dumazet <edumazet@google.com>
Date: Tue, 14 May 2024 13:05:55 +0200

> On Tue, May 14, 2024 at 12:53 PM Alexander Lobakin
> <aleksander.lobakin@intel.com> wrote:
>>
>> From: Eric Dumazet <edumazet@google.com>
>> Date: Tue, 14 May 2024 11:45:05 +0200
>>
>>> On Tue, May 14, 2024 at 8:52 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>>>
>>>> Ken reported that RTL8125b can lock up if gro_flush_timeout has the
>>>> default value of 20000 and napi_defer_hard_irqs is set to 0.
>>>> In this scenario device interrupts aren't disabled, what seems to
>>>> trigger some silicon bug under heavy load. I was able to reproduce this
>>>> behavior on RTL8168h.
>>>> Disabling device interrupts if NAPI is scheduled from a place other than
>>>> the driver's interrupt handler is a necessity in r8169, for other
>>>> drivers it may still be a performance optimization.
>>>>
>>>> Fixes: 7274c4147afb ("r8169: don't try to disable interrupts if NAPI is scheduled already")
>>>> Reported-by: Ken Milmore <ken.milmore@gmail.com>
>>>> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
>>>> ---
>>>>  drivers/net/ethernet/realtek/r8169_main.c | 6 ++++--
>>>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
>>>> index e5ea827a2..01f0ca53d 100644
>>>> --- a/drivers/net/ethernet/realtek/r8169_main.c
>>>> +++ b/drivers/net/ethernet/realtek/r8169_main.c
>>>> @@ -4639,6 +4639,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
>>>>  {
>>>>         struct rtl8169_private *tp = dev_instance;
>>>>         u32 status = rtl_get_events(tp);
>>>> +       int ret;
>>>>
>>>>         if ((status & 0xffff) == 0xffff || !(status & tp->irq_mask))
>>>>                 return IRQ_NONE;
>>>> @@ -4657,10 +4658,11 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
>>>>                 rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING);
>>>>         }
>>>>
>>>> -       if (napi_schedule_prep(&tp->napi)) {
>>>> +       ret = __napi_schedule_prep(&tp->napi);
>>>> +       if (ret >= 0)
>>>>                 rtl_irq_disable(tp);
>>>> +       if (ret > 0)
>>>>                 __napi_schedule(&tp->napi);
>>>> -       }
>>>>  out:
>>>>         rtl_ack_events(tp, status);
>>>>
>>>
>>> I do not understand this patch.
>>>
>>> __napi_schedule_prep() would only return -1 if NAPIF_STATE_DISABLE was set,
>>> but this should not happen under normal operations ?
>>
>> Without this patch, napi_schedule_prep() returns false if it's either
>> scheduled already OR it's disabled. Drivers disable interrupts only if
>> it returns true, which means they don't do that if it's already scheduled.
>> With this patch, __napi_schedule_prep() returns -1 if it's disabled and
>> 0 if it was already scheduled. Which means we can disable interrupts
>> when the result is >= 0, i.e. regardless if it was scheduled before the
>> call or within the call.
>>
>> IIUC, this addresses such situations:
>>
>> napi_schedule()         // we disabled interrupts
>> napi_poll()             // we polled < budget frames
>> napi_complete_done()    // reenable the interrupts, no repoll
>>   hrtimer_start()       // GRO flush is queued
>>     napi_schedule()
>>       napi_poll()       // GRO flush, BUT interrupts are enabled
>>
>> On r8169, this seems to cause issues. On other drivers, it seems to be
>> okay, but with this new helper, you can save some cycles.
>>
>> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> 
> Rephrasing the changelog is not really helping.
> 
> Consider myself as a network maintainer, not as a casual patch reviewer.

And?

> 
> "This seems to cause issues" is rather weak.

It has "Reported-by", so it really causes issues.

> 
> I would simply revert the faulty commit, because the interrupts are
> going to be disabled no matter what.
> 
> Old logic was very simple and rock solid. A revert is a clear stable candidate.
> 
> rtl_irq_disable(tp);
> napi_schedule(&tp->napi);
> 
> If this is still broken, we might have similar issues in old/legacy drivers.

I might agree that we could just revert the mentioned commit for stable,
but for the next net-next, avoid unnecessary
scheduling/enabling/disabling interrupts makes sense, not only for
"old/legacy" drivers.
"Very simple and rock solid" is not an argument for avoiding improvements.

Thanks,
Olek
Heiner Kallweit May 14, 2024, 11:29 a.m. UTC | #5
On 14.05.2024 13:05, Eric Dumazet wrote:
> On Tue, May 14, 2024 at 12:53 PM Alexander Lobakin
> <aleksander.lobakin@intel.com> wrote:
>>
>> From: Eric Dumazet <edumazet@google.com>
>> Date: Tue, 14 May 2024 11:45:05 +0200
>>
>>> On Tue, May 14, 2024 at 8:52 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>>>
>>>> Ken reported that RTL8125b can lock up if gro_flush_timeout has the
>>>> default value of 20000 and napi_defer_hard_irqs is set to 0.
>>>> In this scenario device interrupts aren't disabled, what seems to
>>>> trigger some silicon bug under heavy load. I was able to reproduce this
>>>> behavior on RTL8168h.
>>>> Disabling device interrupts if NAPI is scheduled from a place other than
>>>> the driver's interrupt handler is a necessity in r8169, for other
>>>> drivers it may still be a performance optimization.
>>>>
>>>> Fixes: 7274c4147afb ("r8169: don't try to disable interrupts if NAPI is scheduled already")
>>>> Reported-by: Ken Milmore <ken.milmore@gmail.com>
>>>> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
>>>> ---
>>>>  drivers/net/ethernet/realtek/r8169_main.c | 6 ++++--
>>>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
>>>> index e5ea827a2..01f0ca53d 100644
>>>> --- a/drivers/net/ethernet/realtek/r8169_main.c
>>>> +++ b/drivers/net/ethernet/realtek/r8169_main.c
>>>> @@ -4639,6 +4639,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
>>>>  {
>>>>         struct rtl8169_private *tp = dev_instance;
>>>>         u32 status = rtl_get_events(tp);
>>>> +       int ret;
>>>>
>>>>         if ((status & 0xffff) == 0xffff || !(status & tp->irq_mask))
>>>>                 return IRQ_NONE;
>>>> @@ -4657,10 +4658,11 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
>>>>                 rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING);
>>>>         }
>>>>
>>>> -       if (napi_schedule_prep(&tp->napi)) {
>>>> +       ret = __napi_schedule_prep(&tp->napi);
>>>> +       if (ret >= 0)
>>>>                 rtl_irq_disable(tp);
>>>> +       if (ret > 0)
>>>>                 __napi_schedule(&tp->napi);
>>>> -       }
>>>>  out:
>>>>         rtl_ack_events(tp, status);
>>>>
>>>
>>> I do not understand this patch.
>>>
>>> __napi_schedule_prep() would only return -1 if NAPIF_STATE_DISABLE was set,
>>> but this should not happen under normal operations ?
>>
>> Without this patch, napi_schedule_prep() returns false if it's either
>> scheduled already OR it's disabled. Drivers disable interrupts only if
>> it returns true, which means they don't do that if it's already scheduled.
>> With this patch, __napi_schedule_prep() returns -1 if it's disabled and
>> 0 if it was already scheduled. Which means we can disable interrupts
>> when the result is >= 0, i.e. regardless if it was scheduled before the
>> call or within the call.
>>
>> IIUC, this addresses such situations:
>>
>> napi_schedule()         // we disabled interrupts
>> napi_poll()             // we polled < budget frames
>> napi_complete_done()    // reenable the interrupts, no repoll
>>   hrtimer_start()       // GRO flush is queued
>>     napi_schedule()
>>       napi_poll()       // GRO flush, BUT interrupts are enabled
>>
>> On r8169, this seems to cause issues. On other drivers, it seems to be
>> okay, but with this new helper, you can save some cycles.
>>
>> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> 
> Rephrasing the changelog is not really helping.
> 
> Consider myself as a network maintainer, not as a casual patch reviewer.
> 
> "This seems to cause issues" is rather weak.
> 
> I would simply revert the faulty commit, because the interrupts are
> going to be disabled no matter what.
> 
> Old logic was very simple and rock solid. A revert is a clear stable candidate.
> 
> rtl_irq_disable(tp);
> napi_schedule(&tp->napi);
> 
Proposed new solution differs from the revert wrt how NAPIF_STATE_DISABLE
is handled. I'm not sure the old code would handle this corner case correctly.
It would disable device interrupts, and following napi_schedule() is a no-op.
Therefore device interrupts would remain disabled, and I think that's not
what we want.
When trying to solve this by adding an extra call to e.g. napi_disable_pending(),
then I have concerns this might be racy.

> If this is still broken, we might have similar issues in old/legacy drivers.
Jakub Kicinski May 14, 2024, 2:11 p.m. UTC | #6
On Tue, 14 May 2024 13:05:55 +0200 Eric Dumazet wrote:
> > napi_schedule()         // we disabled interrupts
> > napi_poll()             // we polled < budget frames
> > napi_complete_done()    // reenable the interrupts, no repoll
> >   hrtimer_start()       // GRO flush is queued
> >     napi_schedule()
> >       napi_poll()       // GRO flush, BUT interrupts are enabled

I thought the bug is because of a race with disable.
But there's already a synchronize_net() after disable, so NAPI poll
must fully exit before we mask in rtl8169_cleanup().

If the bug is double-enable you describe the fix is just making 
the race window smaller. But I don't think that's the bug.

BTW why are events only acked in rtl8169_interrupt() and not
rtl8169_poll()?
Eric Dumazet May 14, 2024, 2:27 p.m. UTC | #7
On Tue, May 14, 2024 at 1:18 PM Alexander Lobakin
<aleksander.lobakin@intel.com> wrote:
>
> From: Eric Dumazet <edumazet@google.com>
> Date: Tue, 14 May 2024 13:05:55 +0200
>
> > On Tue, May 14, 2024 at 12:53 PM Alexander Lobakin
> > <aleksander.lobakin@intel.com> wrote:
> >>
> >> From: Eric Dumazet <edumazet@google.com>
> >> Date: Tue, 14 May 2024 11:45:05 +0200
> >>
> >>> On Tue, May 14, 2024 at 8:52 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
> >>>>
> >>>> Ken reported that RTL8125b can lock up if gro_flush_timeout has the
> >>>> default value of 20000 and napi_defer_hard_irqs is set to 0.
> >>>> In this scenario device interrupts aren't disabled, what seems to
> >>>> trigger some silicon bug under heavy load. I was able to reproduce this
> >>>> behavior on RTL8168h.
> >>>> Disabling device interrupts if NAPI is scheduled from a place other than
> >>>> the driver's interrupt handler is a necessity in r8169, for other
> >>>> drivers it may still be a performance optimization.
> >>>>
> >>>> Fixes: 7274c4147afb ("r8169: don't try to disable interrupts if NAPI is scheduled already")
> >>>> Reported-by: Ken Milmore <ken.milmore@gmail.com>
> >>>> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
> >>>> ---
> >>>>  drivers/net/ethernet/realtek/r8169_main.c | 6 ++++--
> >>>>  1 file changed, 4 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
> >>>> index e5ea827a2..01f0ca53d 100644
> >>>> --- a/drivers/net/ethernet/realtek/r8169_main.c
> >>>> +++ b/drivers/net/ethernet/realtek/r8169_main.c
> >>>> @@ -4639,6 +4639,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
> >>>>  {
> >>>>         struct rtl8169_private *tp = dev_instance;
> >>>>         u32 status = rtl_get_events(tp);
> >>>> +       int ret;
> >>>>
> >>>>         if ((status & 0xffff) == 0xffff || !(status & tp->irq_mask))
> >>>>                 return IRQ_NONE;
> >>>> @@ -4657,10 +4658,11 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
> >>>>                 rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING);
> >>>>         }
> >>>>
> >>>> -       if (napi_schedule_prep(&tp->napi)) {
> >>>> +       ret = __napi_schedule_prep(&tp->napi);
> >>>> +       if (ret >= 0)
> >>>>                 rtl_irq_disable(tp);
> >>>> +       if (ret > 0)
> >>>>                 __napi_schedule(&tp->napi);
> >>>> -       }
> >>>>  out:
> >>>>         rtl_ack_events(tp, status);
> >>>>
> >>>
> >>> I do not understand this patch.
> >>>
> >>> __napi_schedule_prep() would only return -1 if NAPIF_STATE_DISABLE was set,
> >>> but this should not happen under normal operations ?
> >>
> >> Without this patch, napi_schedule_prep() returns false if it's either
> >> scheduled already OR it's disabled. Drivers disable interrupts only if
> >> it returns true, which means they don't do that if it's already scheduled.
> >> With this patch, __napi_schedule_prep() returns -1 if it's disabled and
> >> 0 if it was already scheduled. Which means we can disable interrupts
> >> when the result is >= 0, i.e. regardless if it was scheduled before the
> >> call or within the call.
> >>
> >> IIUC, this addresses such situations:
> >>
> >> napi_schedule()         // we disabled interrupts
> >> napi_poll()             // we polled < budget frames
> >> napi_complete_done()    // reenable the interrupts, no repoll
> >>   hrtimer_start()       // GRO flush is queued
> >>     napi_schedule()
> >>       napi_poll()       // GRO flush, BUT interrupts are enabled
> >>
> >> On r8169, this seems to cause issues. On other drivers, it seems to be
> >> okay, but with this new helper, you can save some cycles.
> >>
> >> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> >
> > Rephrasing the changelog is not really helping.
> >
> > Consider myself as a network maintainer, not as a casual patch reviewer.
>
> And?
>
> >
> > "This seems to cause issues" is rather weak.
>
> It has "Reported-by", so it really causes issues.

And ?

Revert ?

>
> >
> > I would simply revert the faulty commit, because the interrupts are
> > going to be disabled no matter what.
> >
> > Old logic was very simple and rock solid. A revert is a clear stable candidate.
> >
> > rtl_irq_disable(tp);
> > napi_schedule(&tp->napi);
> >
> > If this is still broken, we might have similar issues in old/legacy drivers.
>
> I might agree that we could just revert the mentioned commit for stable,
> but for the next net-next, avoid unnecessary
> scheduling/enabling/disabling interrupts makes sense, not only for
> "old/legacy" drivers.
> "Very simple and rock solid" is not an argument for avoiding improvements.

I explained that I failed to see the 'so called' improvement there.

You explained nothing really, just that you like some approach that I
think is for net-next.
Heiner Kallweit May 14, 2024, 4:35 p.m. UTC | #8
On 14.05.2024 16:11, Jakub Kicinski wrote:
> On Tue, 14 May 2024 13:05:55 +0200 Eric Dumazet wrote:
>>> napi_schedule()         // we disabled interrupts
>>> napi_poll()             // we polled < budget frames
>>> napi_complete_done()    // reenable the interrupts, no repoll
>>>   hrtimer_start()       // GRO flush is queued
>>>     napi_schedule()
>>>       napi_poll()       // GRO flush, BUT interrupts are enabled
> 
> I thought the bug is because of a race with disable.

No, the second napi_poll() in this scenario is executed with device
interrupts enabled, what triggers a (supposedly) hw bug under heavy
load. So the fix is to disable device interrupts also in the case
that NAPI is already scheduled when entering the interrupt handler.

> But there's already a synchronize_net() after disable, so NAPI poll
> must fully exit before we mask in rtl8169_cleanup().
> 
> If the bug is double-enable you describe the fix is just making 
> the race window smaller. But I don't think that's the bug.
> 
> BTW why are events only acked in rtl8169_interrupt() and not
> rtl8169_poll()? 

You mean clearing the rx/tx-related interrupt status bits only
after napi_complete_done(), as an alternative to disabling
device interrupts?
Jakub Kicinski May 14, 2024, 4:49 p.m. UTC | #9
On Tue, 14 May 2024 18:35:46 +0200 Heiner Kallweit wrote:
> > I thought the bug is because of a race with disable.  
> 
> No, the second napi_poll() in this scenario is executed with device
> interrupts enabled, what triggers a (supposedly) hw bug under heavy
> load. So the fix is to disable device interrupts also in the case
> that NAPI is already scheduled when entering the interrupt handler.
> 
> > But there's already a synchronize_net() after disable, so NAPI poll
> > must fully exit before we mask in rtl8169_cleanup().
> > 
> > If the bug is double-enable you describe the fix is just making 
> > the race window smaller. But I don't think that's the bug.
> > 
> > BTW why are events only acked in rtl8169_interrupt() and not
> > rtl8169_poll()?   
> 
> You mean clearing the rx/tx-related interrupt status bits only
> after napi_complete_done(), as an alternative to disabling
> device interrupts?

Before, basically ack them at the start of a poll function.
If gro_timeout / IRQ suppression is not enabled it won't make 
much of a difference. Probably also won't make much difference
with iperf.

But normally traffic is bursty so with gro_timeout we can see 
something like:

    packets: x x  x  x x   <  no more packets  >
IRQ pending: xxx  xxxxxxxxxxxxxxxxxxxxxx
        ISR:    []                      []
    IRQ ack:    x                       x
       NAPI:     [=====] < timeout > [=] [=] < timeout > [=]

Acking at the beginning of NAPI poll can't make us miss events 
but we'd clear the pending IRQ on the "deferred" NAPI run, avoiding 
an extra HW IRQ and 2 NAPI calls:

    packets: x x  x  x x   <  no more packets  >
IRQ pending: xxxx xxxxxxxxxxxxxxxxxxx
        ISR:    []                   
    IRQ ack:     x                   x
       NAPI:     [=====] < timeout > [=]
Heiner Kallweit May 14, 2024, 5:09 p.m. UTC | #10
On 14.05.2024 18:49, Jakub Kicinski wrote:
> On Tue, 14 May 2024 18:35:46 +0200 Heiner Kallweit wrote:
>>> I thought the bug is because of a race with disable.  
>>
>> No, the second napi_poll() in this scenario is executed with device
>> interrupts enabled, what triggers a (supposedly) hw bug under heavy
>> load. So the fix is to disable device interrupts also in the case
>> that NAPI is already scheduled when entering the interrupt handler.
>>
>>> But there's already a synchronize_net() after disable, so NAPI poll
>>> must fully exit before we mask in rtl8169_cleanup().
>>>
>>> If the bug is double-enable you describe the fix is just making 
>>> the race window smaller. But I don't think that's the bug.
>>>
>>> BTW why are events only acked in rtl8169_interrupt() and not
>>> rtl8169_poll()?   
>>
>> You mean clearing the rx/tx-related interrupt status bits only
>> after napi_complete_done(), as an alternative to disabling
>> device interrupts?
> 
> Before, basically ack them at the start of a poll function.
> If gro_timeout / IRQ suppression is not enabled it won't make 
> much of a difference. Probably also won't make much difference
> with iperf.
> 
> But normally traffic is bursty so with gro_timeout we can see 
> something like:
> 
>     packets: x x  x  x x   <  no more packets  >
> IRQ pending: xxx  xxxxxxxxxxxxxxxxxxxxxx
>         ISR:    []                      []
>     IRQ ack:    x                       x
>        NAPI:     [=====] < timeout > [=] [=] < timeout > [=]
> 
> Acking at the beginning of NAPI poll can't make us miss events 
> but we'd clear the pending IRQ on the "deferred" NAPI run, avoiding 
> an extra HW IRQ and 2 NAPI calls:
> 
>     packets: x x  x  x x   <  no more packets  >
> IRQ pending: xxxx xxxxxxxxxxxxxxxxxxx
>         ISR:    []                   
>     IRQ ack:     x                   x
>        NAPI:     [=====] < timeout > [=]

Thanks for the explanation. What is the benefit of acking interrupts
at the beginning of NAPI poll, compared to acking them after
napi_complete_done()?
If budget is exceeded and we know we're polled again, why ack
the interrupts in between?
I just tested with the defaults of gro_flush_timeout=20000 and
napi_defer_hardirqs=1, and iperf3 --bidir.
The difference is massive. When acking after napi_complete_done()
I see only a few hundred interrupts. Acking at the beginning of
NAPI poll it's few hundred thousand interrupts.
Jakub Kicinski May 14, 2024, 5:47 p.m. UTC | #11
On Tue, 14 May 2024 19:09:21 +0200 Heiner Kallweit wrote:
> Thanks for the explanation. What is the benefit of acking interrupts
> at the beginning of NAPI poll, compared to acking them after
> napi_complete_done()?
> If budget is exceeded and we know we're polled again, why ack
> the interrupts in between?

That's a fair point, the main concern for acking after processing
is that we will miss an event. If we ack before processing we can
occasionally take an unnecessary IRQ, but we'll never let a packet
rot on the ring because it arrived between processing packets and
acking the IRQ.
But you know the driver better, maybe there's a clean way of avoiding
the missed IRQs (not sure it would be worth the complexity, tho TBH).

> I just tested with the defaults of gro_flush_timeout=20000 and
> napi_defer_hardirqs=1, and iperf3 --bidir.
> The difference is massive. When acking after napi_complete_done()
> I see only a few hundred interrupts. Acking at the beginning of
> NAPI poll it's few hundred thousand interrupts.

That's quite odd. Maybe because rtl_tx() doesn't contribute to work
done? Maybe it'd be better to set work done to min(budget, !!tx, rx) ?

Or maybe the disabling is not working somehow?

napi_defer_hardirqs=1 should make us reschedule NAPI if there was _any_
work done. Meaning we'd enable NAPI only after a completely empty NAPI
run. On an empty NAPI run it should not matter whether we acked before
or after checking for packets, or so I'd naively think.
Jakub Kicinski May 14, 2024, 5:49 p.m. UTC | #12
On Tue, 14 May 2024 10:47:39 -0700 Jakub Kicinski wrote:
> enable NAPI 

enable IRQ
Ken Milmore May 14, 2024, 8:47 p.m. UTC | #13
It seems to me that these chips are known for being badly-documented and quirky,
so maybe an empirical approach is called for.

I have briefly surveyed various driver sources available from the vendor and
from the BSDs, and AFAICT they all follow the pattern of unconditionally
masking interrupts, then clearing the status bits, then processing the rings
and then re-enabling interrupts in that order. In this respect, the Linux
driver may have become an outlier in that it doesn't *always* mask interrupts
before acking them, and that it may process the rings while IRQs are unmasked.
I'm not saying that these are necessarily problems, but...

There are some differences in how the drivers work: the FreeBSD one masks
interrupts straight away but defers writing the status register to the bottom
half, the OpenBSD driver seems to do everything in the IRQ handler. These
drivers also tend to flip between using "hard IRQs" and the built-in timer,
which complicates things. But in terms of the "mask first" approach, I think
they all look equivalent.

https://sources.debian.org/src/r8168/8.053.00-1/src/r8168_n.c/
https://sources.debian.org/src/r8125/9.011.00-4/src/r8125_n.c/
https://github.com/openbsd/src/blob/master/sys/dev/ic/re.c
https://github.com/openbsd/src/blob/master/sys/dev/pci/if_rge.c
https://cgit.freebsd.org/src/tree/sys/dev/re/if_re.c


I also found this ancient netdev thread which looks startlingly familiar to the
behaviour in the present issue. It seems that people have been here before...

https://lore.kernel.org/netdev/1242328457.32579.12.camel@lap75545.ornl.gov/
"I added some code to print the irq status when it hangs, and it shows
0x0085, which is RxOK | TxOK | TxDescUnavail, which makes me think we've
lost an MSI-edge interrupt somehow."

https://lore.kernel.org/netdev/1243042174.3580.23.camel@obelisk.thedillows.org/
"The 8169 chip only generates MSI interrupts when all enabled event
sources are quiescent and one or more sources transition to active. If
not all of the active events are acknowledged, or a new event becomes
active while the existing ones are cleared in the handler, we will not
see a new interrupt."
Heiner Kallweit May 15, 2024, 5:53 a.m. UTC | #14
On 14.05.2024 11:45, Eric Dumazet wrote:
> On Tue, May 14, 2024 at 8:52 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>
>> Ken reported that RTL8125b can lock up if gro_flush_timeout has the
>> default value of 20000 and napi_defer_hard_irqs is set to 0.
>> In this scenario device interrupts aren't disabled, what seems to
>> trigger some silicon bug under heavy load. I was able to reproduce this
>> behavior on RTL8168h.
>> Disabling device interrupts if NAPI is scheduled from a place other than
>> the driver's interrupt handler is a necessity in r8169, for other
>> drivers it may still be a performance optimization.
>>
>> Fixes: 7274c4147afb ("r8169: don't try to disable interrupts if NAPI is scheduled already")
>> Reported-by: Ken Milmore <ken.milmore@gmail.com>
>> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
>> ---
>>  drivers/net/ethernet/realtek/r8169_main.c | 6 ++++--
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
>> index e5ea827a2..01f0ca53d 100644
>> --- a/drivers/net/ethernet/realtek/r8169_main.c
>> +++ b/drivers/net/ethernet/realtek/r8169_main.c
>> @@ -4639,6 +4639,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
>>  {
>>         struct rtl8169_private *tp = dev_instance;
>>         u32 status = rtl_get_events(tp);
>> +       int ret;
>>
>>         if ((status & 0xffff) == 0xffff || !(status & tp->irq_mask))
>>                 return IRQ_NONE;
>> @@ -4657,10 +4658,11 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
>>                 rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING);
>>         }
>>
>> -       if (napi_schedule_prep(&tp->napi)) {
>> +       ret = __napi_schedule_prep(&tp->napi);
>> +       if (ret >= 0)
>>                 rtl_irq_disable(tp);
>> +       if (ret > 0)
>>                 __napi_schedule(&tp->napi);
>> -       }
>>  out:
>>         rtl_ack_events(tp, status);
>>
> 
> I do not understand this patch.
> 
> __napi_schedule_prep() would only return -1 if NAPIF_STATE_DISABLE was set,
> but this should not happen under normal operations ?
> 
> A simple revert would avoid adding yet another NAPI helper.

Fine with me, I'll go with the revert for now.
diff mbox series

Patch

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index e5ea827a2..01f0ca53d 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -4639,6 +4639,7 @@  static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
 {
 	struct rtl8169_private *tp = dev_instance;
 	u32 status = rtl_get_events(tp);
+	int ret;
 
 	if ((status & 0xffff) == 0xffff || !(status & tp->irq_mask))
 		return IRQ_NONE;
@@ -4657,10 +4658,11 @@  static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
 		rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING);
 	}
 
-	if (napi_schedule_prep(&tp->napi)) {
+	ret = __napi_schedule_prep(&tp->napi);
+	if (ret >= 0)
 		rtl_irq_disable(tp);
+	if (ret > 0)
 		__napi_schedule(&tp->napi);
-	}
 out:
 	rtl_ack_events(tp, status);