diff mbox series

[RFC,iwl-net] e1000: Hold RTNL when e1000_down can be called

Message ID 20241022172153.217890-1-jdamato@fastly.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series [RFC,iwl-net] e1000: Hold RTNL when e1000_down can be called | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 5 this patch: 5
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 8 of 8 maintainers
netdev/build_clang success Errors and warnings before: 3 this patch: 3
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes fail Problems with Fixes tag: 1
netdev/build_allmodconfig_warn success Errors and warnings before: 4 this patch: 4
netdev/checkpatch warning WARNING: Unknown commit id '8f7ff18a5ec7', maybe rebased or not pulled?
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 30 this patch: 30
netdev/source_inline success Was 0 now: 0

Commit Message

Joe Damato Oct. 22, 2024, 5:21 p.m. UTC
e1000_down calls netif_queue_set_napi, which assumes that RTNL is held.

There are a few paths for e1000_down to be called in e1000 where RTNL is
not currently being held:
  - e1000_shutdown (pci shutdown)
  - e1000_suspend (power management)
  - e1000_reinit_locked (via e1000_reset_task delayed work)

Hold RTNL in two places to fix this issue:
  - e1000_reset_task
  - __e1000_shutdown (which is called from both e1000_shutdown and
    e1000_suspend).

The other paths which call e1000_down seemingly hold RTNL and are OK:
  - e1000_close (ndo_stop)
  - e1000_change_mtu (ndo_change_mtu)

I'm submitting this is as an RFC because:
  - the e1000_reinit_locked issue appears very similar to commit
    21f857f0321d ("e1000e: add rtnl_lock() to e1000_reset_task"), which
    fixes a similar issue in e1000e

however

  - adding rtnl to e1000_reinit_locked seemingly conflicts with an
    earlier e1000 commit b2f963bfaeba ("e1000: fix lockdep warning in
    e1000_reset_task").

Hopefully Intel can weigh in and shed some light on the correct way to
go.

Fixes: 8f7ff18a5ec7 ("e1000: Link NAPI instances to queues and IRQs")
Reported-by: Dmitry Antipov <dmantipov@yandex.ru>
Closes: https://lore.kernel.org/netdev/8cf62307-1965-46a0-a411-ff0080090ff9@yandex.ru/
Signed-off-by: Joe Damato <jdamato@fastly.com>
---
 drivers/net/ethernet/intel/e1000/e1000_main.c | 4 ++++
 1 file changed, 4 insertions(+)


base-commit: d811ac148f0afd2f3f7e1cd7f54de8da973ec5e3

Comments

Joe Damato Oct. 22, 2024, 8 p.m. UTC | #1
On Tue, Oct 22, 2024 at 05:21:53PM +0000, Joe Damato wrote:
> e1000_down calls netif_queue_set_napi, which assumes that RTNL is held.
> 
> There are a few paths for e1000_down to be called in e1000 where RTNL is
> not currently being held:
>   - e1000_shutdown (pci shutdown)
>   - e1000_suspend (power management)
>   - e1000_reinit_locked (via e1000_reset_task delayed work)
> 
> Hold RTNL in two places to fix this issue:
>   - e1000_reset_task
>   - __e1000_shutdown (which is called from both e1000_shutdown and
>     e1000_suspend).

It looks like there's one other spot I missed:

e1000_io_error_detected (pci error handler) which should also hold
rtnl_lock:

+       if (netif_running(netdev)) {
+               rtnl_lock();
                e1000_down(adapter);
+               rtnl_unlock();
+       }

I can send that update in the v2, but I'll wait to see if Intel has suggestions
on the below.
 
> The other paths which call e1000_down seemingly hold RTNL and are OK:
>   - e1000_close (ndo_stop)
>   - e1000_change_mtu (ndo_change_mtu)
> 
> I'm submitting this is as an RFC because:
>   - the e1000_reinit_locked issue appears very similar to commit
>     21f857f0321d ("e1000e: add rtnl_lock() to e1000_reset_task"), which
>     fixes a similar issue in e1000e
> 
> however
> 
>   - adding rtnl to e1000_reinit_locked seemingly conflicts with an
>     earlier e1000 commit b2f963bfaeba ("e1000: fix lockdep warning in
>     e1000_reset_task").
> 
> Hopefully Intel can weigh in and shed some light on the correct way to
> go.
> 
> Fixes: 8f7ff18a5ec7 ("e1000: Link NAPI instances to queues and IRQs")
> Reported-by: Dmitry Antipov <dmantipov@yandex.ru>
> Closes: https://lore.kernel.org/netdev/8cf62307-1965-46a0-a411-ff0080090ff9@yandex.ru/
> Signed-off-by: Joe Damato <jdamato@fastly.com>
Joe Damato Oct. 22, 2024, 9:12 p.m. UTC | #2
On Tue, Oct 22, 2024 at 01:00:47PM -0700, Joe Damato wrote:
> On Tue, Oct 22, 2024 at 05:21:53PM +0000, Joe Damato wrote:
> > e1000_down calls netif_queue_set_napi, which assumes that RTNL is held.
> > 
> > There are a few paths for e1000_down to be called in e1000 where RTNL is
> > not currently being held:
> >   - e1000_shutdown (pci shutdown)
> >   - e1000_suspend (power management)
> >   - e1000_reinit_locked (via e1000_reset_task delayed work)
> > 
> > Hold RTNL in two places to fix this issue:
> >   - e1000_reset_task
> >   - __e1000_shutdown (which is called from both e1000_shutdown and
> >     e1000_suspend).
> 
> It looks like there's one other spot I missed:
> 
> e1000_io_error_detected (pci error handler) which should also hold
> rtnl_lock:
> 
> +       if (netif_running(netdev)) {
> +               rtnl_lock();
>                 e1000_down(adapter);
> +               rtnl_unlock();
> +       }
> 
> I can send that update in the v2, but I'll wait to see if Intel has suggestions
> on the below.
>  
> > The other paths which call e1000_down seemingly hold RTNL and are OK:
> >   - e1000_close (ndo_stop)
> >   - e1000_change_mtu (ndo_change_mtu)
> > 
> > I'm submitting this is as an RFC because:
> >   - the e1000_reinit_locked issue appears very similar to commit
> >     21f857f0321d ("e1000e: add rtnl_lock() to e1000_reset_task"), which
> >     fixes a similar issue in e1000e
> > 
> > however
> > 
> >   - adding rtnl to e1000_reinit_locked seemingly conflicts with an
> >     earlier e1000 commit b2f963bfaeba ("e1000: fix lockdep warning in
> >     e1000_reset_task").
> > 
> > Hopefully Intel can weigh in and shed some light on the correct way to
> > go.

Regarding the above locations where rtnl_lock may need to be held,
comparing to other intel drivers:

  - e1000_reset_task: it appears that igc, igb, and e100e all hold
    rtnl_lock in their reset_task functions, so I think adding an
    rtnl_lock / rtnl_unlock to e1000_reset_task should be OK,
    despite the existence of commit b2f963bfaeba ("e1000: fix
    lockdep warning in e1000_reset_task").

  - e1000_io_error_detected:
      - e1000e temporarily obtains and drops rtnl in
        e1000e_pm_freeze
      - ixgbe holds rtnl in the same path (toward the bottom of
        ixgbe_io_error_detected)
      - igb does NOT hold rtnl in this path (as far as I can tell)
      - it was suggested in another thread to hold rtnl in this path
        for igc [1].
       
     Given that it will be added to igc and is held in this same
     path in e1000e and ixgbe, I think it is safe to add it for
     e1000, as well.

 - e1000_shutdown: 
   - igb holds rtnl in the same path,
   - e1000e temporarily holds it in this path (via
     e1000e_pm_freeze)
   - ixgbe holds rtnl in the same path

So based on the recommendation for igc [1], and the precedent set in
the other Intel drivers in most cases (except igb and the io_error
path), I think adding rtnl to all 3 locations described above is
correct.

Please let me know if you all agree. Thanks for reviewing this.

[1]: https://lore.kernel.org/netdev/40242f59-139a-4b45-8949-1210039f881b@intel.com/
Jacob Keller Oct. 22, 2024, 9:14 p.m. UTC | #3
On 10/22/2024 1:00 PM, Joe Damato wrote:
> On Tue, Oct 22, 2024 at 05:21:53PM +0000, Joe Damato wrote:
>> e1000_down calls netif_queue_set_napi, which assumes that RTNL is held.
>>
>> There are a few paths for e1000_down to be called in e1000 where RTNL is
>> not currently being held:
>>   - e1000_shutdown (pci shutdown)
>>   - e1000_suspend (power management)
>>   - e1000_reinit_locked (via e1000_reset_task delayed work)
>>
>> Hold RTNL in two places to fix this issue:
>>   - e1000_reset_task
>>   - __e1000_shutdown (which is called from both e1000_shutdown and
>>     e1000_suspend).
> 
> It looks like there's one other spot I missed:
> 
> e1000_io_error_detected (pci error handler) which should also hold
> rtnl_lock:
> 
> +       if (netif_running(netdev)) {
> +               rtnl_lock();
>                 e1000_down(adapter);
> +               rtnl_unlock();
> +       }
> 
> I can send that update in the v2, but I'll wait to see if Intel has suggestions
> on the below.
>  
>> The other paths which call e1000_down seemingly hold RTNL and are OK:
>>   - e1000_close (ndo_stop)
>>   - e1000_change_mtu (ndo_change_mtu)
>>
>> I'm submitting this is as an RFC because:
>>   - the e1000_reinit_locked issue appears very similar to commit
>>     21f857f0321d ("e1000e: add rtnl_lock() to e1000_reset_task"), which
>>     fixes a similar issue in e1000e
>>
>> however
>>
>>   - adding rtnl to e1000_reinit_locked seemingly conflicts with an
>>     earlier e1000 commit b2f963bfaeba ("e1000: fix lockdep warning in
>>     e1000_reset_task").
>>
>> Hopefully Intel can weigh in and shed some light on the correct way to
>> go.
>>

From my review, I think we need the RTNL lock around this function. The
deadlocks mentions in the fix lockdep patch appear to be due to having
an *extra* lock which could then cause issues.

>> Fixes: 8f7ff18a5ec7 ("e1000: Link NAPI instances to queues and IRQs")
>> Reported-by: Dmitry Antipov <dmantipov@yandex.ru>
>> Closes: https://lore.kernel.org/netdev/8cf62307-1965-46a0-a411-ff0080090ff9@yandex.ru/
>> Signed-off-by: Joe Damato <jdamato@fastly.com>
Jacob Keller Oct. 22, 2024, 9:15 p.m. UTC | #4
On 10/22/2024 2:12 PM, Joe Damato wrote:
> On Tue, Oct 22, 2024 at 01:00:47PM -0700, Joe Damato wrote:
>> On Tue, Oct 22, 2024 at 05:21:53PM +0000, Joe Damato wrote:
>>> e1000_down calls netif_queue_set_napi, which assumes that RTNL is held.
>>>
>>> There are a few paths for e1000_down to be called in e1000 where RTNL is
>>> not currently being held:
>>>   - e1000_shutdown (pci shutdown)
>>>   - e1000_suspend (power management)
>>>   - e1000_reinit_locked (via e1000_reset_task delayed work)
>>>
>>> Hold RTNL in two places to fix this issue:
>>>   - e1000_reset_task
>>>   - __e1000_shutdown (which is called from both e1000_shutdown and
>>>     e1000_suspend).
>>
>> It looks like there's one other spot I missed:
>>
>> e1000_io_error_detected (pci error handler) which should also hold
>> rtnl_lock:
>>
>> +       if (netif_running(netdev)) {
>> +               rtnl_lock();
>>                 e1000_down(adapter);
>> +               rtnl_unlock();
>> +       }
>>
>> I can send that update in the v2, but I'll wait to see if Intel has suggestions
>> on the below.
>>  
>>> The other paths which call e1000_down seemingly hold RTNL and are OK:
>>>   - e1000_close (ndo_stop)
>>>   - e1000_change_mtu (ndo_change_mtu)
>>>
>>> I'm submitting this is as an RFC because:
>>>   - the e1000_reinit_locked issue appears very similar to commit
>>>     21f857f0321d ("e1000e: add rtnl_lock() to e1000_reset_task"), which
>>>     fixes a similar issue in e1000e
>>>
>>> however
>>>
>>>   - adding rtnl to e1000_reinit_locked seemingly conflicts with an
>>>     earlier e1000 commit b2f963bfaeba ("e1000: fix lockdep warning in
>>>     e1000_reset_task").
>>>
>>> Hopefully Intel can weigh in and shed some light on the correct way to
>>> go.
> 
> Regarding the above locations where rtnl_lock may need to be held,
> comparing to other intel drivers:
> 
>   - e1000_reset_task: it appears that igc, igb, and e100e all hold
>     rtnl_lock in their reset_task functions, so I think adding an
>     rtnl_lock / rtnl_unlock to e1000_reset_task should be OK,
>     despite the existence of commit b2f963bfaeba ("e1000: fix
>     lockdep warning in e1000_reset_task").
> 
>   - e1000_io_error_detected:
>       - e1000e temporarily obtains and drops rtnl in
>         e1000e_pm_freeze
>       - ixgbe holds rtnl in the same path (toward the bottom of
>         ixgbe_io_error_detected)
>       - igb does NOT hold rtnl in this path (as far as I can tell)
>       - it was suggested in another thread to hold rtnl in this path
>         for igc [1].
>        
>      Given that it will be added to igc and is held in this same
>      path in e1000e and ixgbe, I think it is safe to add it for
>      e1000, as well.
> 
>  - e1000_shutdown: 
>    - igb holds rtnl in the same path,
>    - e1000e temporarily holds it in this path (via
>      e1000e_pm_freeze)
>    - ixgbe holds rtnl in the same path
> 
> So based on the recommendation for igc [1], and the precedent set in
> the other Intel drivers in most cases (except igb and the io_error
> path), I think adding rtnl to all 3 locations described above is
> correct.
> 
> Please let me know if you all agree. Thanks for reviewing this.
> 
> 
[1]:
https://lore.kernel.org/netdev/40242f59-139a-4b45-8949-1210039f881b@intel.com/

I agree with this assessment.
Joe Damato Oct. 22, 2024, 9:17 p.m. UTC | #5
On Tue, Oct 22, 2024 at 02:15:27PM -0700, Jacob Keller wrote:
> 
> 
> On 10/22/2024 2:12 PM, Joe Damato wrote:
> > On Tue, Oct 22, 2024 at 01:00:47PM -0700, Joe Damato wrote:
> >> On Tue, Oct 22, 2024 at 05:21:53PM +0000, Joe Damato wrote:
> >>> e1000_down calls netif_queue_set_napi, which assumes that RTNL is held.
> >>>
> >>> There are a few paths for e1000_down to be called in e1000 where RTNL is
> >>> not currently being held:
> >>>   - e1000_shutdown (pci shutdown)
> >>>   - e1000_suspend (power management)
> >>>   - e1000_reinit_locked (via e1000_reset_task delayed work)
> >>>
> >>> Hold RTNL in two places to fix this issue:
> >>>   - e1000_reset_task
> >>>   - __e1000_shutdown (which is called from both e1000_shutdown and
> >>>     e1000_suspend).
> >>
> >> It looks like there's one other spot I missed:
> >>
> >> e1000_io_error_detected (pci error handler) which should also hold
> >> rtnl_lock:
> >>
> >> +       if (netif_running(netdev)) {
> >> +               rtnl_lock();
> >>                 e1000_down(adapter);
> >> +               rtnl_unlock();
> >> +       }
> >>
> >> I can send that update in the v2, but I'll wait to see if Intel has suggestions
> >> on the below.
> >>  
> >>> The other paths which call e1000_down seemingly hold RTNL and are OK:
> >>>   - e1000_close (ndo_stop)
> >>>   - e1000_change_mtu (ndo_change_mtu)
> >>>
> >>> I'm submitting this is as an RFC because:
> >>>   - the e1000_reinit_locked issue appears very similar to commit
> >>>     21f857f0321d ("e1000e: add rtnl_lock() to e1000_reset_task"), which
> >>>     fixes a similar issue in e1000e
> >>>
> >>> however
> >>>
> >>>   - adding rtnl to e1000_reinit_locked seemingly conflicts with an
> >>>     earlier e1000 commit b2f963bfaeba ("e1000: fix lockdep warning in
> >>>     e1000_reset_task").
> >>>
> >>> Hopefully Intel can weigh in and shed some light on the correct way to
> >>> go.
> > 
> > Regarding the above locations where rtnl_lock may need to be held,
> > comparing to other intel drivers:
> > 
> >   - e1000_reset_task: it appears that igc, igb, and e100e all hold
> >     rtnl_lock in their reset_task functions, so I think adding an
> >     rtnl_lock / rtnl_unlock to e1000_reset_task should be OK,
> >     despite the existence of commit b2f963bfaeba ("e1000: fix
> >     lockdep warning in e1000_reset_task").
> > 
> >   - e1000_io_error_detected:
> >       - e1000e temporarily obtains and drops rtnl in
> >         e1000e_pm_freeze
> >       - ixgbe holds rtnl in the same path (toward the bottom of
> >         ixgbe_io_error_detected)
> >       - igb does NOT hold rtnl in this path (as far as I can tell)
> >       - it was suggested in another thread to hold rtnl in this path
> >         for igc [1].
> >        
> >      Given that it will be added to igc and is held in this same
> >      path in e1000e and ixgbe, I think it is safe to add it for
> >      e1000, as well.
> > 
> >  - e1000_shutdown: 
> >    - igb holds rtnl in the same path,
> >    - e1000e temporarily holds it in this path (via
> >      e1000e_pm_freeze)
> >    - ixgbe holds rtnl in the same path
> > 
> > So based on the recommendation for igc [1], and the precedent set in
> > the other Intel drivers in most cases (except igb and the io_error
> > path), I think adding rtnl to all 3 locations described above is
> > correct.
> > 
> > Please let me know if you all agree. Thanks for reviewing this.
> > 
> > 
> [1]:
> https://lore.kernel.org/netdev/40242f59-139a-4b45-8949-1210039f881b@intel.com/
> 
> I agree with this assessment.

Thanks for taking a look. I will send an official iwl-net PATCH with
these changes once the 24 hour timer has expired.
diff mbox series

Patch

diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index 4de9b156b2be..9ed99c75d59e 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -3509,7 +3509,9 @@  static void e1000_reset_task(struct work_struct *work)
 		container_of(work, struct e1000_adapter, reset_task);
 
 	e_err(drv, "Reset adapter\n");
+	rtnl_lock();
 	e1000_reinit_locked(adapter);
+	rtnl_unlock();
 }
 
 /**
@@ -5074,7 +5076,9 @@  static int __e1000_shutdown(struct pci_dev *pdev, bool *enable_wake)
 			usleep_range(10000, 20000);
 
 		WARN_ON(test_bit(__E1000_RESETTING, &adapter->flags));
+		rtnl_lock();
 		e1000_down(adapter);
+		rtnl_unlock();
 	}
 
 	status = er32(STATUS);