diff mbox series

[net,v3] net: ethtool: do runtime PM outside RTNL

Message ID 20231206113934.8d7819857574.I2deb5804ef1739a2af307283d320ef7d82456494@changeid (mailing list archive)
State Deferred
Delegated to: Netdev Maintainers
Headers show
Series [net,v3] net: ethtool: do runtime PM outside RTNL | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success SINGLE THREAD; Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1117 this patch: 1117
netdev/cc_maintainers warning 4 maintainers not CCed: kuba@kernel.org edumazet@google.com d-tatianin@yandex-team.ru pabeni@redhat.com
netdev/build_clang success Errors and warnings before: 1143 this patch: 1143
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 1144 this patch: 1144
netdev/checkpatch warning WARNING: line length of 89 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Johannes Berg Dec. 6, 2023, 10:39 a.m. UTC
From: Johannes Berg <johannes.berg@intel.com>

As reported by Marc MERLIN, at least one driver (igc) wants or
needs to acquire the RTNL inside suspend/resume ops, which can
be called from here in ethtool if runtime PM is enabled.

Allow this by doing runtime PM transitions without the RTNL
held. For the ioctl to have the same operations order, this
required reworking the code to separately check validity and
do the operation. For the netlink code, this now has to do
the runtime_pm_put a bit later.

Reported-by: Marc MERLIN <marc@merlins.org>
Fixes: f32a21376573 ("ethtool: runtime-resume netdev parent before ethtool ioctl ops")
Fixes: d43c65b05b84 ("ethtool: runtime-resume netdev parent in ethnl_ops_begin")
Closes: https://lore.kernel.org/r/20231202221402.GA11155@merlins.org
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
---
v2:
 - add tags
 - use netdev_get_by_name()/netdev_put() in ioctl path
v3:
 - drop if(dev) before netdev_put() [Przemek Kitszel]
 - add Reviewed-by tag
---
 net/ethtool/ioctl.c   | 71 ++++++++++++++++++++++++++-----------------
 net/ethtool/netlink.c | 32 ++++++++-----------
 2 files changed, 56 insertions(+), 47 deletions(-)

Comments

Jakub Kicinski Dec. 6, 2023, 4:44 p.m. UTC | #1
On Wed,  6 Dec 2023 11:39:32 +0100 Johannes Berg wrote:
> As reported by Marc MERLIN, at least one driver (igc) wants or
> needs to acquire the RTNL inside suspend/resume ops, which can
> be called from here in ethtool if runtime PM is enabled.
> 
> Allow this by doing runtime PM transitions without the RTNL
> held. For the ioctl to have the same operations order, this
> required reworking the code to separately check validity and
> do the operation. For the netlink code, this now has to do
> the runtime_pm_put a bit later.

I was really, really hoping that this would serve as a motivation
for Intel to sort out the igb/igc implementation. The flow AFAICT
is ndo_open() starts the NIC, the calls pm_sus, which shuts the NIC
back down immediately (!?) then it schedules a link check from a work
queue, which opens it again (!?). It's a source of never ending bugs.

nit: please don't repost within 24h on netdev:
https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#tl-dr
Johannes Berg Dec. 6, 2023, 4:46 p.m. UTC | #2
On Wed, 2023-12-06 at 08:44 -0800, Jakub Kicinski wrote:
> On Wed,  6 Dec 2023 11:39:32 +0100 Johannes Berg wrote:
> > As reported by Marc MERLIN, at least one driver (igc) wants or
> > needs to acquire the RTNL inside suspend/resume ops, which can
> > be called from here in ethtool if runtime PM is enabled.
> > 
> > Allow this by doing runtime PM transitions without the RTNL
> > held. For the ioctl to have the same operations order, this
> > required reworking the code to separately check validity and
> > do the operation. For the netlink code, this now has to do
> > the runtime_pm_put a bit later.
> 
> I was really, really hoping that this would serve as a motivation
> for Intel to sort out the igb/igc implementation. The flow AFAICT
> is ndo_open() starts the NIC, the calls pm_sus, which shuts the NIC
> back down immediately (!?) then it schedules a link check from a work
> queue, which opens it again (!?). It's a source of never ending bugs.
> 

Well, I work there, but ... WiFi something else entirely. Marc just got
lucky I spotted an issue in the logs ;-)

I'll let you guys take it from here ...

johannes
Marc MERLIN Dec. 6, 2023, 9:39 p.m. UTC | #3
On Wed, Dec 06, 2023 at 05:46:07PM +0100, Johannes Berg wrote:
> On Wed, 2023-12-06 at 08:44 -0800, Jakub Kicinski wrote:
> > On Wed,  6 Dec 2023 11:39:32 +0100 Johannes Berg wrote:
> > > As reported by Marc MERLIN, at least one driver (igc) wants or
> > > needs to acquire the RTNL inside suspend/resume ops, which can
> > > be called from here in ethtool if runtime PM is enabled.
> > > 
> > > Allow this by doing runtime PM transitions without the RTNL
> > > held. For the ioctl to have the same operations order, this
> > > required reworking the code to separately check validity and
> > > do the operation. For the netlink code, this now has to do
> > > the runtime_pm_put a bit later.
> > 
> > I was really, really hoping that this would serve as a motivation
> > for Intel to sort out the igb/igc implementation. The flow AFAICT
> > is ndo_open() starts the NIC, the calls pm_sus, which shuts the NIC
> > back down immediately (!?) then it schedules a link check from a work
> > queue, which opens it again (!?). It's a source of never ending bugs.
> 
> Well, I work there, but ... WiFi something else entirely. Marc just got
> lucky I spotted an issue in the logs ;-)
 
and I am truly thankful for the time you took to help out. It made a
huge difference for me in being able to keep a laptop that I will
probably use for the next 4 years and otherwise had to return.
Thank you for going above and beyond.

> I'll let you guys take it from here ...

As this time the first patch I got still works for me even if it's not
the most ideal way to fix this.
As you all figure out what's the best/better patch, please let me know
if you'd like me to validate/retest.

Thanks,
Marc
Przemek Kitszel Dec. 7, 2023, 10:16 a.m. UTC | #4
On 12/6/23 17:46, Johannes Berg wrote:
> On Wed, 2023-12-06 at 08:44 -0800, Jakub Kicinski wrote:
>> On Wed,  6 Dec 2023 11:39:32 +0100 Johannes Berg wrote:
>>> As reported by Marc MERLIN, at least one driver (igc) wants or
>>> needs to acquire the RTNL inside suspend/resume ops, which can
>>> be called from here in ethtool if runtime PM is enabled.
>>>
>>> Allow this by doing runtime PM transitions without the RTNL
>>> held. For the ioctl to have the same operations order, this
>>> required reworking the code to separately check validity and
>>> do the operation. For the netlink code, this now has to do
>>> the runtime_pm_put a bit later.
>>
>> I was really, really hoping that this would serve as a motivation
>> for Intel to sort out the igb/igc implementation. The flow AFAICT
>> is ndo_open() starts the NIC, the calls pm_sus, which shuts the NIC
>> back down immediately (!?) then it schedules a link check from a work
>> queue, which opens it again (!?). It's a source of never ending bugs.
>>
> 
> Well, I work there, but ... WiFi something else entirely. Marc just got
> lucky I spotted an issue in the logs ;-)
> 
> I'll let you guys take it from here ...
> 
> johannes
> 

I have let know our igc TL, architect, and anybody that could be
interested via cc: IWL.
And I'm happy that this could be done at relaxed pace thanks to Johannes
Jakub Kicinski Dec. 7, 2023, 5:40 p.m. UTC | #5
On Thu, 7 Dec 2023 11:16:10 +0100 Przemek Kitszel wrote:
> I have let know our igc TL, architect, and anybody that could be
> interested via cc: IWL. And I'm happy that this could be done at
> relaxed pace thanks to Johannes

I think you may be expecting us to take Johannes's patch.
It's still on the table, but to make things clear -
upstream we prefer to wait for the "real fix", so if we agree
that fixing igb/igc is a better way (as Heiner pointed out on previous
version PM functions are called by the stack under rtnl elsewhere too,
just not while device is open) - we'll wait for that. Especially
that I'm 80% I complained about the PM in those drivers in
the past and nobody seemed to care. It's a constant source of rtnl
deadlocks.
Marc MERLIN Dec. 11, 2023, 4:52 a.m. UTC | #6
On Thu, Dec 07, 2023 at 09:40:21AM -0800, Jakub Kicinski wrote:
> On Thu, 7 Dec 2023 11:16:10 +0100 Przemek Kitszel wrote:
> > I have let know our igc TL, architect, and anybody that could be
> > interested via cc: IWL. And I'm happy that this could be done at
> > relaxed pace thanks to Johannes
> 
> I think you may be expecting us to take Johannes's patch.
> It's still on the table, but to make things clear -
> upstream we prefer to wait for the "real fix", so if we agree
> that fixing igb/igc is a better way (as Heiner pointed out on previous
> version PM functions are called by the stack under rtnl elsewhere too,
> just not while device is open) - we'll wait for that. Especially
> that I'm 80% I complained about the PM in those drivers in
> the past and nobody seemed to care. It's a constant source of rtnl
> deadlocks.

For whatever it's worth, I want to be clear that all stock kernels
are 100% unusable on lenovo P17gen2 because of this deadlock and that
without the temporary patch, my laptop would be usuable.
It was also a risk of data loss due to repeated deadlocks and unclean
shutdowns.

I cannot say what the correct fix is, but I am definitely hoping you
will accept some solution for the next stable kernel.

Thank you
Marc
Heiner Kallweit Dec. 15, 2023, 1:42 p.m. UTC | #7
On 11.12.2023 05:52, Marc MERLIN wrote:
> On Thu, Dec 07, 2023 at 09:40:21AM -0800, Jakub Kicinski wrote:
>> On Thu, 7 Dec 2023 11:16:10 +0100 Przemek Kitszel wrote:
>>> I have let know our igc TL, architect, and anybody that could be
>>> interested via cc: IWL. And I'm happy that this could be done at
>>> relaxed pace thanks to Johannes
>>
>> I think you may be expecting us to take Johannes's patch.
>> It's still on the table, but to make things clear -
>> upstream we prefer to wait for the "real fix", so if we agree
>> that fixing igb/igc is a better way (as Heiner pointed out on previous
>> version PM functions are called by the stack under rtnl elsewhere too,
>> just not while device is open) - we'll wait for that. Especially
>> that I'm 80% I complained about the PM in those drivers in
>> the past and nobody seemed to care. It's a constant source of rtnl
>> deadlocks.
> 
> For whatever it's worth, I want to be clear that all stock kernels
> are 100% unusable on lenovo P17gen2 because of this deadlock and that
> without the temporary patch, my laptop would be usuable.
> It was also a risk of data loss due to repeated deadlocks and unclean
> shutdowns.
> 
Why don't you simply disable runtime pm for the affected device as a
workaround? This can be done via sysfs.

> I cannot say what the correct fix is, but I am definitely hoping you
> will accept some solution for the next stable kernel.
> 
> Thank you
> Marc
Marc MERLIN Dec. 15, 2023, 5:46 p.m. UTC | #8
On Fri, Dec 15, 2023 at 02:42:01PM +0100, Heiner Kallweit wrote:
> Why don't you simply disable runtime pm for the affected device as a
> workaround? This can be done via sysfs.

1) because I didn't know what the exact bug was and how to work around it :)
2) without power management, the battery use is not good, but yes not
good is better than laptop crashing :)

That said, if it's only affecting wired ethernet, I can also unload the
igc module until I actually need wired ethernet, which is sometimes but
not often.

Thanks for your suggestion
Marc
Marc MERLIN Dec. 24, 2023, 4:30 p.m. UTC | #9
On Fri, Dec 15, 2023 at 09:46:34AM -0800, Marc MERLIN wrote:
> On Fri, Dec 15, 2023 at 02:42:01PM +0100, Heiner Kallweit wrote:
> > Why don't you simply disable runtime pm for the affected device as a
> > workaround? This can be done via sysfs.
> 
> 1) because I didn't know what the exact bug was and how to work around it :)

Mmmh, so I need to test an ubuntu kernel (6.5.0-14) because of sound
issues in mainline TOT, and I can't boot the kernel to completion
without hititng this hang bug. I'm not exactly sure which part of the
boot triggers it.

I can't patch that kernel easily. How exactly do I disable runtime PM
from the kernel command line for "that device" which I'm not even sure
which one it is.  If it's the eth device, I already removed the igc
module to prevent it from loading, and I also removed the ethtool
binary, but I'm still getting the hang.

On the plus side, with 6.6.8 and the old patch which I understand is not
the ideal solution, I can confirm that I've been running problem free
until now, so thanks again for that interim patch.

Thanks,
Marc
Heiner Kallweit Dec. 24, 2023, 11:12 p.m. UTC | #10
On 24.12.2023 17:30, Marc MERLIN wrote:
> On Fri, Dec 15, 2023 at 09:46:34AM -0800, Marc MERLIN wrote:
>> On Fri, Dec 15, 2023 at 02:42:01PM +0100, Heiner Kallweit wrote:
>>> Why don't you simply disable runtime pm for the affected device as a
>>> workaround? This can be done via sysfs.
>>
>> 1) because I didn't know what the exact bug was and how to work around it :)
> 
> Mmmh, so I need to test an ubuntu kernel (6.5.0-14) because of sound
> issues in mainline TOT, and I can't boot the kernel to completion
> without hititng this hang bug. I'm not exactly sure which part of the
> boot triggers it.
> 
> I can't patch that kernel easily. How exactly do I disable runtime PM
> from the kernel command line for "that device" which I'm not even sure

Change <device>/power/control from "auto" to "on".

> which one it is.  If it's the eth device, I already removed the igc
> module to prevent it from loading, and I also removed the ethtool
> binary, but I'm still getting the hang.
> 
> On the plus side, with 6.6.8 and the old patch which I understand is not
> the ideal solution, I can confirm that I've been running problem free
> until now, so thanks again for that interim patch.
> 
> Thanks,
> Marc
Sasha Neftin Dec. 25, 2023, 8:03 a.m. UTC | #11
On 25/12/2023 1:12, Heiner Kallweit wrote:
> On 24.12.2023 17:30, Marc MERLIN wrote:
>> On Fri, Dec 15, 2023 at 09:46:34AM -0800, Marc MERLIN wrote:
>>> On Fri, Dec 15, 2023 at 02:42:01PM +0100, Heiner Kallweit wrote:
>>>> Why don't you simply disable runtime pm for the affected device as a
>>>> workaround? This can be done via sysfs.
>>>
>>> 1) because I didn't know what the exact bug was and how to work around it :)
>>
>> Mmmh, so I need to test an ubuntu kernel (6.5.0-14) because of sound
>> issues in mainline TOT, and I can't boot the kernel to completion
>> without hititng this hang bug. I'm not exactly sure which part of the
>> boot triggers it.
>>
>> I can't patch that kernel easily. How exactly do I disable runtime PM
>> from the kernel command line for "that device" which I'm not even sure
> 
> Change <device>/power/control from "auto" to "on".

Need to figure out your controller location in a file system via 
lspci/lspci -t and then change to "on"
For example: echo on > 
/sys/devices/pci0000\:00/0000\:00\:1c.0/0000\:ae\:00.0/power/control

We are starting to look at this problem, but I can't reproduce the 
problem on my machines yet.

> 
>> which one it is.  If it's the eth device, I already removed the igc
>> module to prevent it from loading, and I also removed the ethtool
>> binary, but I'm still getting the hang.
>>
>> On the plus side, with 6.6.8 and the old patch which I understand is not
>> the ideal solution, I can confirm that I've been running problem free
>> until now, so thanks again for that interim patch.
>>
>> Thanks,
>> Marc
> 
> _______________________________________________
> Intel-wired-lan mailing list
> Intel-wired-lan@osuosl.org
> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
Marc MERLIN Dec. 25, 2023, 11:21 a.m. UTC | #12
On Mon, Dec 25, 2023 at 10:03:23AM +0200, Sasha Neftin wrote:
> > > I can't patch that kernel easily. How exactly do I disable runtime PM
> > > from the kernel command line for "that device" which I'm not even sure
> > 
> > Change <device>/power/control from "auto" to "on".
> 
> Need to figure out your controller location in a file system via lspci/lspci
> -t and then change to "on"
> For example: echo on >
> /sys/devices/pci0000\:00/0000\:00\:1c.0/0000\:ae\:00.0/power/control
> 
> We are starting to look at this problem, but I can't reproduce the problem
> on my machines yet.

Thanks. I realized it was going to be hard either way if the boot hangs
before I get to a command prompt, which was what was happening
yesterday.
I had to boot ubuntu to debug a sound issue, and it was very tricky
since most of the time it hung before I got to a command prompt, but I
was finally able to get it to work long enough to debug the sound issue
and go back to my self built kernel to port over the sound config I
needed.

I wish I could tell you exactly how to reproduct this in a more useful
way, sorry about that.

Marc
Stanislaw Gruszka Jan. 3, 2024, 10:30 a.m. UTC | #13
On Wed, Dec 06, 2023 at 08:44:48AM -0800, Jakub Kicinski wrote:
> On Wed,  6 Dec 2023 11:39:32 +0100 Johannes Berg wrote:
> > As reported by Marc MERLIN, at least one driver (igc) wants or
> > needs to acquire the RTNL inside suspend/resume ops, which can
> > be called from here in ethtool if runtime PM is enabled.
> > 
> > Allow this by doing runtime PM transitions without the RTNL
> > held. For the ioctl to have the same operations order, this
> > required reworking the code to separately check validity and
> > do the operation. For the netlink code, this now has to do
> > the runtime_pm_put a bit later.
> 
> I was really, really hoping that this would serve as a motivation
> for Intel to sort out the igb/igc implementation. The flow AFAICT
> is ndo_open() starts the NIC, the calls pm_sus, which shuts the NIC
> back down immediately (!?) then it schedules a link check from a work

It's not like that. pm_runtime_put() in igc_open() does not disable device.
It calls runtime_idle callback which check if there is link and if is
not, schedule device suspend in 5 second, otherwise device stays running.

Work watchdog_task runs periodically and also check for link changes.

> queue, which opens it again (!?). It's a source of never ending bugs.

Maybe there are issues there and igc pm runtime implementation needs
improvements, with lockings or otherwise. Some folks are looking at this. 
But I think for this particular deadlock problem reverting of below commits
should be considered:

bd869245a3dc net: core: try to runtime-resume detached device in __dev_open
f32a21376573 ethtool: runtime-resume netdev parent before ethtool ioctl ops

First, the deadlock should be addressed also in older kernels and
refactoring is not really backportable fix.

Second, I don't think network stack should do any calls to pm_runtime* .
This should be fully device driver specific, as this depends on device
driver implementation of power saving. IMHO if it's desirable to 
resume disabled device on requests from user space, then
netif_device_detach() should not be used in runtime suspend.

Thoughts ?

Regards
Stanislaw
Heiner Kallweit Jan. 3, 2024, 11:24 a.m. UTC | #14
On 03.01.2024 11:30, Stanislaw Gruszka wrote:
> On Wed, Dec 06, 2023 at 08:44:48AM -0800, Jakub Kicinski wrote:
>> On Wed,  6 Dec 2023 11:39:32 +0100 Johannes Berg wrote:
>>> As reported by Marc MERLIN, at least one driver (igc) wants or
>>> needs to acquire the RTNL inside suspend/resume ops, which can
>>> be called from here in ethtool if runtime PM is enabled.
>>>
>>> Allow this by doing runtime PM transitions without the RTNL
>>> held. For the ioctl to have the same operations order, this
>>> required reworking the code to separately check validity and
>>> do the operation. For the netlink code, this now has to do
>>> the runtime_pm_put a bit later.
>>
>> I was really, really hoping that this would serve as a motivation
>> for Intel to sort out the igb/igc implementation. The flow AFAICT
>> is ndo_open() starts the NIC, the calls pm_sus, which shuts the NIC
>> back down immediately (!?) then it schedules a link check from a work
> 
> It's not like that. pm_runtime_put() in igc_open() does not disable device.
> It calls runtime_idle callback which check if there is link and if is
> not, schedule device suspend in 5 second, otherwise device stays running.
> 
> Work watchdog_task runs periodically and also check for link changes.
> 
>> queue, which opens it again (!?). It's a source of never ending bugs.
> 
> Maybe there are issues there and igc pm runtime implementation needs
> improvements, with lockings or otherwise. Some folks are looking at this. 
> But I think for this particular deadlock problem reverting of below commits
> should be considered:
> 
> bd869245a3dc net: core: try to runtime-resume detached device in __dev_open
> f32a21376573 ethtool: runtime-resume netdev parent before ethtool ioctl ops
> Reverting bd869245a3dc would break existing stuff.

> First, the deadlock should be addressed also in older kernels and
> refactoring is not really backportable fix.
> 
You could simply disable igc runtime pm on older kernel versions
if backporting a proper fix would be too cumbersome.

> Second, I don't think network stack should do any calls to pm_runtime* .

It's not unusual that subsystem core code deals with runtime pm.
E.g. see all the runtime pm calls in drivers/pci/pci.c
IMO it's exactly the purpose of the RPM API to encapsulate the
device-specific (r)pm features.

> This should be fully device driver specific, as this depends on device
> driver implementation of power saving. IMHO if it's desirable to 
> resume disabled device on requests from user space, then
> netif_device_detach() should not be used in runtime suspend.
> 
> Thoughts ?
> 
> Regards
> Stanislaw 
>
Stanislaw Gruszka Jan. 3, 2024, 12:15 p.m. UTC | #15
On Wed, Jan 03, 2024 at 12:24:18PM +0100, Heiner Kallweit wrote:
> On 03.01.2024 11:30, Stanislaw Gruszka wrote:
> > On Wed, Dec 06, 2023 at 08:44:48AM -0800, Jakub Kicinski wrote:
> >> On Wed,  6 Dec 2023 11:39:32 +0100 Johannes Berg wrote:
> >>> As reported by Marc MERLIN, at least one driver (igc) wants or
> >>> needs to acquire the RTNL inside suspend/resume ops, which can
> >>> be called from here in ethtool if runtime PM is enabled.
> >>>
> >>> Allow this by doing runtime PM transitions without the RTNL
> >>> held. For the ioctl to have the same operations order, this
> >>> required reworking the code to separately check validity and
> >>> do the operation. For the netlink code, this now has to do
> >>> the runtime_pm_put a bit later.
> >>
> >> I was really, really hoping that this would serve as a motivation
> >> for Intel to sort out the igb/igc implementation. The flow AFAICT
> >> is ndo_open() starts the NIC, the calls pm_sus, which shuts the NIC
> >> back down immediately (!?) then it schedules a link check from a work
> > 
> > It's not like that. pm_runtime_put() in igc_open() does not disable device.
> > It calls runtime_idle callback which check if there is link and if is
> > not, schedule device suspend in 5 second, otherwise device stays running.
> > 
> > Work watchdog_task runs periodically and also check for link changes.
> > 
> >> queue, which opens it again (!?). It's a source of never ending bugs.
> > 
> > Maybe there are issues there and igc pm runtime implementation needs
> > improvements, with lockings or otherwise. Some folks are looking at this. 
> > But I think for this particular deadlock problem reverting of below commits
> > should be considered:
> > 
> > bd869245a3dc net: core: try to runtime-resume detached device in __dev_open
> > f32a21376573 ethtool: runtime-resume netdev parent before ethtool ioctl ops
> > Reverting bd869245a3dc would break existing stuff.
> 
> > First, the deadlock should be addressed also in older kernels and
> > refactoring is not really backportable fix.
> > 
> You could simply disable igc runtime pm on older kernel versions
> if backporting a proper fix would be too cumbersome.

It would be better to have pm working on older kernels as it use to.

> > Second, I don't think network stack should do any calls to pm_runtime* .
> 
> It's not unusual that subsystem core code deals with runtime pm.
> E.g. see all the runtime pm calls in drivers/pci/pci.c
> IMO it's exactly the purpose of the RPM API to encapsulate the
> device-specific (r)pm features.

PCI is bus layer that control device probe/remove, suspend/resume, etc,
it has to do this. To make proper companion non-bus subsystem should be
used i.e. sound, drm,  bluetooth ...  all of those do not pm_runtime 
in core layer and leave that to drivers. One exception is block layer
with it's blk-pm.c , but that it's also more like library that is used
by the drivers.

Regards
Stanislaw
Jakub Kicinski Jan. 3, 2024, 11:34 p.m. UTC | #16
On Wed, 3 Jan 2024 11:30:17 +0100 Stanislaw Gruszka wrote:
> > I was really, really hoping that this would serve as a motivation
> > for Intel to sort out the igb/igc implementation. The flow AFAICT
> > is ndo_open() starts the NIC, the calls pm_sus, which shuts the NIC
> > back down immediately (!?) then it schedules a link check from a work  
> 
> It's not like that. pm_runtime_put() in igc_open() does not disable device.
> It calls runtime_idle callback which check if there is link and if is
> not, schedule device suspend in 5 second, otherwise device stays running.

Hm, I missed the 5 sec delay there. Next question for me is - how does
it not deadlock in the open?

igc_open()
  __igc_open(resuming=false)
    if (!resuming)
      pm_runtime_get_sync(&pdev->dev);

igc_resume()
  rtnl_lock()
Stanislaw Gruszka Jan. 4, 2024, 8:25 a.m. UTC | #17
On Wed, Jan 03, 2024 at 03:34:05PM -0800, Jakub Kicinski wrote:
> On Wed, 3 Jan 2024 11:30:17 +0100 Stanislaw Gruszka wrote:
> > > I was really, really hoping that this would serve as a motivation
> > > for Intel to sort out the igb/igc implementation. The flow AFAICT
> > > is ndo_open() starts the NIC, the calls pm_sus, which shuts the NIC
> > > back down immediately (!?) then it schedules a link check from a work  
> > 
> > It's not like that. pm_runtime_put() in igc_open() does not disable device.
> > It calls runtime_idle callback which check if there is link and if is
> > not, schedule device suspend in 5 second, otherwise device stays running.
> 
> Hm, I missed the 5 sec delay there. Next question for me is - how does
> it not deadlock in the open?
> 
> igc_open()
>   __igc_open(resuming=false)
>     if (!resuming)
>       pm_runtime_get_sync(&pdev->dev);
> 
> igc_resume()
>   rtnl_lock()

If device was not suspended, pm_runtime_get_sync() will increase
dev->power.usage_count counter and cancel pending rpm suspend
request if any. There is race condition though, more about that
below.

If device was suspended, we could not get to igc_open() since it
was marked as detached and fail netif_device_present() check in
__dev_open(). That was the behaviour before bd869245a3dc.

There is small race window between with igc_open() and scheduled
runtime suspend, if at the same time dev_open() is done and
dev->power.suspend_timer expire:

open:					pm_suspend_timer_fh:

rtnl_lock()
					rpm_suspend()
					  igc_runtime_suspend()
					   __igc_shutdown()
					     rtnl_lock()

__igc_open()
  pm_runtime_get_sync():
    waits for rpm suspend callback done

This needs to be addressed, but it's not that this can happen
all the time. To trigger this someone has to remove the
cable and exactly after 5 seconds do ip link set up. 

Regards
Stanislaw
Heiner Kallweit Jan. 4, 2024, 9:05 a.m. UTC | #18
On 04.01.2024 09:25, Stanislaw Gruszka wrote:
> On Wed, Jan 03, 2024 at 03:34:05PM -0800, Jakub Kicinski wrote:
>> On Wed, 3 Jan 2024 11:30:17 +0100 Stanislaw Gruszka wrote:
>>>> I was really, really hoping that this would serve as a motivation
>>>> for Intel to sort out the igb/igc implementation. The flow AFAICT
>>>> is ndo_open() starts the NIC, the calls pm_sus, which shuts the NIC
>>>> back down immediately (!?) then it schedules a link check from a work  
>>>
>>> It's not like that. pm_runtime_put() in igc_open() does not disable device.
>>> It calls runtime_idle callback which check if there is link and if is
>>> not, schedule device suspend in 5 second, otherwise device stays running.
>>
>> Hm, I missed the 5 sec delay there. Next question for me is - how does
>> it not deadlock in the open?
>>
>> igc_open()
>>   __igc_open(resuming=false)
>>     if (!resuming)
>>       pm_runtime_get_sync(&pdev->dev);
>>
>> igc_resume()
>>   rtnl_lock()
> 
> If device was not suspended, pm_runtime_get_sync() will increase
> dev->power.usage_count counter and cancel pending rpm suspend
> request if any. There is race condition though, more about that
> below.
> 
> If device was suspended, we could not get to igc_open() since it
> was marked as detached and fail netif_device_present() check in
> __dev_open(). That was the behaviour before bd869245a3dc.
> 
> There is small race window between with igc_open() and scheduled
> runtime suspend, if at the same time dev_open() is done and
> dev->power.suspend_timer expire:
> 
> open:					pm_suspend_timer_fh:
> 
> rtnl_lock()
> 					rpm_suspend()
> 					  igc_runtime_suspend()
> 					   __igc_shutdown()
> 					     rtnl_lock()
> 
> __igc_open()
>   pm_runtime_get_sync():
>     waits for rpm suspend callback done
> 
> This needs to be addressed, but it's not that this can happen
> all the time. To trigger this someone has to remove the
> cable and exactly after 5 seconds do ip link set up. 
> 
For me the main question is the following. In igc_resume() you have

	rtnl_lock();
	if (!err && netif_running(netdev))
		err = __igc_open(netdev, true);

	if (!err)
		netif_device_attach(netdev);
	rtnl_unlock();

Why is the global rtnl_lock() needed here? The netdev is in detached
state what protects from e.g. userspace activity, see all the
netif_device_present() checks in net core.

> Regards
> Stanislaw
Jakub Kicinski Jan. 4, 2024, 4:16 p.m. UTC | #19
On Thu, 4 Jan 2024 10:05:12 +0100 Heiner Kallweit wrote:
> > If device was not suspended, pm_runtime_get_sync() will increase
> > dev->power.usage_count counter and cancel pending rpm suspend
> > request if any. There is race condition though, more about that
> > below.
> > 
> > If device was suspended, we could not get to igc_open() since it
> > was marked as detached and fail netif_device_present() check in
> > __dev_open(). That was the behaviour before bd869245a3dc.

__dev_open() tries to resume as well, and is also under rtnl_lock.
So that resume call somehow must never happen or users would see
-ENODEV? Sorry for the basic questions, the flow is confusing :S

> > There is small race window between with igc_open() and scheduled
> > runtime suspend, if at the same time dev_open() is done and
> > dev->power.suspend_timer expire:
> > 
> > open:					pm_suspend_timer_fh:
> > 
> > rtnl_lock()
> > 					rpm_suspend()
> > 					  igc_runtime_suspend()
> > 					   __igc_shutdown()
> > 					     rtnl_lock()
> > 
> > __igc_open()
> >   pm_runtime_get_sync():
> >     waits for rpm suspend callback done
> > 
> > This needs to be addressed, but it's not that this can happen
> > all the time. To trigger this someone has to remove the
> > cable and exactly after 5 seconds do ip link set up. 

Or tries to up exactly 5 sec after probe?

> For me the main question is the following. In igc_resume() you have
> 
> 	rtnl_lock();
> 	if (!err && netif_running(netdev))
> 		err = __igc_open(netdev, true);
> 
> 	if (!err)
> 		netif_device_attach(netdev);
> 	rtnl_unlock();
> 
> Why is the global rtnl_lock() needed here? The netdev is in detached
> state what protects from e.g. userspace activity, see all the
> netif_device_present() checks in net core.

That'd assume there are no RPM calls outside networking in this driver.
Perhaps there aren't but that also sounds wobbly.
Stanislaw Gruszka Jan. 5, 2024, 10:34 a.m. UTC | #20
On Thu, Jan 04, 2024 at 10:05:12AM +0100, Heiner Kallweit wrote:
> On 04.01.2024 09:25, Stanislaw Gruszka wrote:
> For me the main question is the following. In igc_resume() you have
> 
> 	rtnl_lock();
> 	if (!err && netif_running(netdev))
> 		err = __igc_open(netdev, true);
> 
> 	if (!err)
> 		netif_device_attach(netdev);
> 	rtnl_unlock();
> 
> Why is the global rtnl_lock() needed here? The netdev is in detached
> state what protects from e.g. userspace activity, see all the
> netif_device_present() checks in net core.

Good question. Initially I thought that the lock can be removed
for the exact reason you wrote. I.e. the analogus change as you
did for igb could de done ( ac8c58f5b535 ).

But after more detailed examination I can see the need for lock.

__igc_open() calls at least one function that require rtnl_lock:
netif_set_real_num_rx_queues().

Despite using netif_device_attach() without the rtnl_lock at
various places it's not safe. After:

        if (!test_and_set_bit(__LINK_STATE_PRESENT, &dev->state) &&
            netif_running(dev)) {

the full link down can be performed without rtnl_lock, so we can do

                netif_tx_wake_all_queues(dev);
                __netdev_watchdog_up(dev);

during closing or after device is closed.

Just found those two. I think there could be more reasons.

Regards
Stanislaw
Stanislaw Gruszka Jan. 5, 2024, 11:53 a.m. UTC | #21
On Thu, Jan 04, 2024 at 08:16:56AM -0800, Jakub Kicinski wrote:
> On Thu, 4 Jan 2024 10:05:12 +0100 Heiner Kallweit wrote:
> > > If device was not suspended, pm_runtime_get_sync() will increase
> > > dev->power.usage_count counter and cancel pending rpm suspend
> > > request if any. There is race condition though, more about that
> > > below.
> > > 
> > > If device was suspended, we could not get to igc_open() since it
> > > was marked as detached and fail netif_device_present() check in
> > > __dev_open(). That was the behaviour before bd869245a3dc.
> 
> __dev_open() tries to resume as well, and is also under rtnl_lock.

This one is plain 100% deadlock for igc (and igb before ac8c58f5b535)
I'm opting for remove those rpm calls from __dev_open() and ethtool.

The only thing that prevent that deadlock to happen all the time,
is that rpm is disabled by default (for PCI devices). When pci driver
want to rpm be default enabled, it has to call pm_runtime_allow().
Otherwise user has to enable it by:

echo auto > /sys/bus/pci/devices/PCI_ID/power/control 

But this could be also done by some power saving user-space
software. This is most probable reason way Marc reported
that he can not boot his laptop due to this deadlock. 
Other unlikely possibility that for some reason rpm was enabled
by default, but it should not be for PCI since:
bb910a7040e9 ("PCI/PM Runtime: Make runtime PM of PCI devices
inactive by default")

> So that resume call somehow must never happen or users would see
> -ENODEV? Sorry for the basic questions, the flow is confusing :S

If we talk about situation before rpm calls were added to net core
(i.e. < 5.9) there was open/ethtool -ENODEV error when igc/igb
was runtime suspend due to netif_device_present() check fail.

That was by design, what for open the device and loose
energy if there is no cable and device can not be used anyway ?

> > > There is small race window between with igc_open() and scheduled
> > > runtime suspend, if at the same time dev_open() is done and
> > > dev->power.suspend_timer expire:
> > > 
> > > open:					pm_suspend_timer_fh:
> > > 
> > > rtnl_lock()
> > > 					rpm_suspend()
> > > 					  igc_runtime_suspend()
> > > 					   __igc_shutdown()
> > > 					     rtnl_lock()
> > > 
> > > __igc_open()
> > >   pm_runtime_get_sync():
> > >     waits for rpm suspend callback done
> > > 
> > > This needs to be addressed, but it's not that this can happen
> > > all the time. To trigger this someone has to remove the
> > > cable and exactly after 5 seconds do ip link set up. 
> 
> Or tries to up exactly 5 sec after probe?

Just after probe rpm is disabled, so 5 sec after enabling rpm
(with cable removed) or 5 sec after cable remove (with rpm enabled).

> > For me the main question is the following. In igc_resume() you have
> > 
> > 	rtnl_lock();
> > 	if (!err && netif_running(netdev))
> > 		err = __igc_open(netdev, true);
> > 
> > 	if (!err)
> > 		netif_device_attach(netdev);
> > 	rtnl_unlock();
> > 
> > Why is the global rtnl_lock() needed here? The netdev is in detached
> > state what protects from e.g. userspace activity, see all the
> > netif_device_present() checks in net core.
> 
> That'd assume there are no RPM calls outside networking in this driver.
> Perhaps there aren't but that also sounds wobbly.

They are in PCI layer. For example when disabling rpm (reverting auto in
power/control) by:

echo on > /sys/bus/pci/devices/PCI_ID/power/control 

Regards
Stanislaw
Jakub Kicinski Jan. 5, 2024, 3:30 p.m. UTC | #22
On Fri, 5 Jan 2024 12:53:42 +0100 Stanislaw Gruszka wrote:
> On Thu, Jan 04, 2024 at 08:16:56AM -0800, Jakub Kicinski wrote:
> > __dev_open() tries to resume as well, and is also under rtnl_lock.  
> 
> This one is plain 100% deadlock for igc (and igb before ac8c58f5b535)
> I'm opting for remove those rpm calls from __dev_open() and ethtool.

I don't know what gets powered down, exactly, in this device,
so I can't give you a concrete example. But usually there's
at least one ndo / ethtool callback which needs to resume
the device (and already holds rtnl_lock). Taking rtnl_lock
on the resume path is fundamentally broken. Removing the
rpm calls from the core is just going to lead to a whack-a-mole
of bugs in the drivers themselves.

IOW I look at the RPM calls in the core as a canary for people
doing the wrong thing :(

> > So that resume call somehow must never happen or users would see
> > -ENODEV? Sorry for the basic questions, the flow is confusing :S  
> 
> If we talk about situation before rpm calls were added to net core
> (i.e. < 5.9) there was open/ethtool -ENODEV error when igc/igb
> was runtime suspend due to netif_device_present() check fail.
> 
> That was by design, what for open the device and loose
> energy if there is no cable and device can not be used anyway ?

I think "link" means actual link up here, no? As opposed to no cable
plugged in. If I understand that right - the device would have to train
the link in DOWN state in order for the device to be opened?
That would be quite wasteful in terms of power.

Regardless, returning -ENODEV is really not how netdevs should behave.
That's what carrier reporting is for! :(
Stanislaw Gruszka Jan. 5, 2024, 4:29 p.m. UTC | #23
On Fri, Jan 05, 2024 at 07:30:01AM -0800, Jakub Kicinski wrote:
> On Fri, 5 Jan 2024 12:53:42 +0100 Stanislaw Gruszka wrote:
> > On Thu, Jan 04, 2024 at 08:16:56AM -0800, Jakub Kicinski wrote:
> > > __dev_open() tries to resume as well, and is also under rtnl_lock.  
> > 
> > This one is plain 100% deadlock for igc (and igb before ac8c58f5b535)
> > I'm opting for remove those rpm calls from __dev_open() and ethtool.
> 
> I don't know what gets powered down, exactly, in this device,
> so I can't give you a concrete example. But usually there's
> at least one ndo / ethtool callback which needs to resume
> the device (and already holds rtnl_lock). Taking rtnl_lock
> on the resume path is fundamentally broken.

I agree with that.

> Removing the
> rpm calls from the core is just going to lead to a whack-a-mole
> of bugs in the drivers themselves.
>
> IOW I look at the RPM calls in the core as a canary for people
> doing the wrong thing :(

Hmm, this one I don't understand, what other bugs could pop up
after reverting bd869245a3dcc and others that added rpm calls
for the net core?

> > > So that resume call somehow must never happen or users would see
> > > -ENODEV? Sorry for the basic questions, the flow is confusing :S  
> > 
> > If we talk about situation before rpm calls were added to net core
> > (i.e. < 5.9) there was open/ethtool -ENODEV error when igc/igb
> > was runtime suspend due to netif_device_present() check fail.
> > 
> > That was by design, what for open the device and loose
> > energy if there is no cable and device can not be used anyway ?
> 
> I think "link" means actual link up here, no? As opposed to no cable
> plugged in. If I understand that right - the device would have to train
> the link in DOWN state in order for the device to be opened?
> That would be quite wasteful in terms of power.

I ment no cable plugged. When igc device was runtime suspended, and
user connected the cable, user has to power device up via on > power/control
and then ip link set up.

> Regardless, returning -ENODEV is really not how netdevs should behave.
> That's what carrier reporting is for! :(

Ok, I can agrre with that. But I think this should be achived by not using
netif_device_detach() in rpm suspend, not by

        if (!netif_device_present(dev)) {
                /* may be detached because parent is runtime-suspended */
                if (dev->dev.parent)
                        pm_runtime_resume(dev->dev.parent);
                if (!netif_device_present(dev))
                        return -ENODEV;
        }

Regards
Stanislaw
Jakub Kicinski Jan. 6, 2024, 3:02 a.m. UTC | #24
On Fri, 5 Jan 2024 17:29:16 +0100 Stanislaw Gruszka wrote:
> > Removing the rpm calls from the core is just going to lead to a
> > whack-a-mole of bugs in the drivers themselves.
> >
> > IOW I look at the RPM calls in the core as a canary for people
> > doing the wrong thing :(  
> 
> Hmm, this one I don't understand, what other bugs could pop up
> after reverting bd869245a3dcc and others that added rpm calls
> for the net core?

IDK what igc powers down, but if there's any ndo or ethtool
callback which needs to access a register that requires power 
to be resumed - it will deadlock on rtnl exactly the same.

Looking at igc_ethtool I see:

static int igc_ethtool_begin(struct net_device *netdev)
{
	struct igc_adapter *adapter = netdev_priv(netdev);

	pm_runtime_get_sync(&adapter->pdev->dev);
	return 0;
}

static void igc_ethtool_complete(struct net_device *netdev)
{
	struct igc_adapter *adapter = netdev_priv(netdev);

	pm_runtime_put(&adapter->pdev->dev);
}

so unless we think that returning -ENODEV from all ethtool calls
when cable is not plugged in is okay - removing the PM resume
from the core doesn't buy us much :(
Stanislaw Gruszka Jan. 8, 2024, 11:18 a.m. UTC | #25
On Fri, Jan 05, 2024 at 07:02:18PM -0800, Jakub Kicinski wrote:
> On Fri, 5 Jan 2024 17:29:16 +0100 Stanislaw Gruszka wrote:
> > > Removing the rpm calls from the core is just going to lead to a
> > > whack-a-mole of bugs in the drivers themselves.
> > >
> > > IOW I look at the RPM calls in the core as a canary for people
> > > doing the wrong thing :(  
> > 
> > Hmm, this one I don't understand, what other bugs could pop up
> > after reverting bd869245a3dcc and others that added rpm calls
> > for the net core?
> 
> IDK what igc powers down,

From what I can tell basically everything, it's full shutdown.

> but if there's any ndo or ethtool
> callback which needs to access a register that requires power 
> to be resumed - it will deadlock on rtnl exactly the same.
> 
> Looking at igc_ethtool I see:
> 
> static int igc_ethtool_begin(struct net_device *netdev)
> {
> 	struct igc_adapter *adapter = netdev_priv(netdev);
> 
> 	pm_runtime_get_sync(&adapter->pdev->dev);
> 	return 0;
> }
> 
> static void igc_ethtool_complete(struct net_device *netdev)
> {
> 	struct igc_adapter *adapter = netdev_priv(netdev);
> 
> 	pm_runtime_put(&adapter->pdev->dev);
> }
> 
> so unless we think that returning -ENODEV from all ethtool calls
> when cable is not plugged in is okay - removing the PM resume
> from the core doesn't buy us much :(

It would address the regression in simple fix that can be send
to -stable. Event if -ENODEV for all ethtool ops and open is
not good, it's still better than deadlocking whole system.

I agree RPM for igc is not perfect and has issues that need
to be fix. People are working on it inspired by e1000e
implementation. Is should address the main requirement:
no rtnl_lock on resume path - waking up device when needed
on ndo/ethtool.

But that would not be simple fix AFICT, more likely it
will be reimplementation of the whole thing.

Additionally, in context of ethtool, previously each driver 
that implement RPM, woke up the device for actual HW access,
and don't when only memory was used. For example e1000e has
fine tuned ethtool ops. Some others like cadence/macb or 
renesas/sh_eth went event further and have their 
pm_runtime_resume_get_sync() in register access functions.

Now a hardware is powered up on every ethtool op regardless
of actual need.

So I think that the calls are only needed for some drivers, but
for others are detrimental. Would adding new netdev->priv_flags
for calling them be acceptable ?

Regards
Stanislaw
diff mbox series

Patch

diff --git a/net/ethtool/ioctl.c b/net/ethtool/ioctl.c
index a977f8903467..7c0e56985eb8 100644
--- a/net/ethtool/ioctl.c
+++ b/net/ethtool/ioctl.c
@@ -2768,26 +2768,18 @@  static int ethtool_set_fecparam(struct net_device *dev, void __user *useraddr)
 /* The main entry point in this file.  Called from net/core/dev_ioctl.c */
 
 static int
-__dev_ethtool(struct net *net, struct ifreq *ifr, void __user *useraddr,
-	      u32 ethcmd, struct ethtool_devlink_compat *devlink_state)
+__dev_ethtool_check(struct net_device *dev, void __user *useraddr,
+		    u32 ethcmd, u32 *sub_cmd)
 {
-	struct net_device *dev;
-	u32 sub_cmd;
-	int rc;
-	netdev_features_t old_features;
-
-	dev = __dev_get_by_name(net, ifr->ifr_name);
-	if (!dev)
-		return -ENODEV;
-
 	if (ethcmd == ETHTOOL_PERQUEUE) {
-		if (copy_from_user(&sub_cmd, useraddr + sizeof(ethcmd), sizeof(sub_cmd)))
+		if (copy_from_user(sub_cmd, useraddr + sizeof(ethcmd), sizeof(*sub_cmd)))
 			return -EFAULT;
 	} else {
-		sub_cmd = ethcmd;
+		*sub_cmd = ethcmd;
 	}
+
 	/* Allow some commands to be done by anyone */
-	switch (sub_cmd) {
+	switch (*sub_cmd) {
 	case ETHTOOL_GSET:
 	case ETHTOOL_GDRVINFO:
 	case ETHTOOL_GMSGLVL:
@@ -2826,22 +2818,28 @@  __dev_ethtool(struct net *net, struct ifreq *ifr, void __user *useraddr,
 	case ETHTOOL_GFECPARAM:
 		break;
 	default:
-		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 	}
 
-	if (dev->dev.parent)
-		pm_runtime_get_sync(dev->dev.parent);
+	return 0;
+}
 
-	if (!netif_device_present(dev)) {
-		rc = -ENODEV;
-		goto out;
-	}
+static int
+__dev_ethtool_do(struct net_device *dev, struct ifreq *ifr,
+		 void __user *useraddr, u32 ethcmd, u32 sub_cmd,
+		 struct ethtool_devlink_compat *devlink_state)
+{
+	netdev_features_t old_features;
+	int rc;
+
+	if (!netif_device_present(dev))
+		return -ENODEV;
 
 	if (dev->ethtool_ops->begin) {
 		rc = dev->ethtool_ops->begin(dev);
 		if (rc < 0)
-			goto out;
+			return rc;
 	}
 	old_features = dev->features;
 
@@ -3052,7 +3050,7 @@  __dev_ethtool(struct net *net, struct ifreq *ifr, void __user *useraddr,
 		rc = ethtool_set_fecparam(dev, useraddr);
 		break;
 	default:
-		rc = -EOPNOTSUPP;
+		return -EOPNOTSUPP;
 	}
 
 	if (dev->ethtool_ops->complete)
@@ -3060,9 +3058,6 @@  __dev_ethtool(struct net *net, struct ifreq *ifr, void __user *useraddr,
 
 	if (old_features != dev->features)
 		netdev_features_change(dev);
-out:
-	if (dev->dev.parent)
-		pm_runtime_put(dev->dev.parent);
 
 	return rc;
 }
@@ -3070,7 +3065,9 @@  __dev_ethtool(struct net *net, struct ifreq *ifr, void __user *useraddr,
 int dev_ethtool(struct net *net, struct ifreq *ifr, void __user *useraddr)
 {
 	struct ethtool_devlink_compat *state;
-	u32 ethcmd;
+	struct net_device *dev = NULL;
+	netdevice_tracker dev_tracker;
+	u32 ethcmd, subcmd;
 	int rc;
 
 	if (copy_from_user(&ethcmd, useraddr, sizeof(ethcmd)))
@@ -3090,9 +3087,26 @@  int dev_ethtool(struct net *net, struct ifreq *ifr, void __user *useraddr)
 		break;
 	}
 
+	dev = netdev_get_by_name(net, ifr->ifr_name, &dev_tracker, GFP_KERNEL);
+	if (!dev) {
+		rc = -ENODEV;
+		goto exit_free;
+	}
+
+	rc = __dev_ethtool_check(dev, useraddr, ethcmd, &subcmd);
+	if (rc)
+		goto exit_free;
+
+	if (dev->dev.parent)
+		pm_runtime_get_sync(dev->dev.parent);
+
 	rtnl_lock();
-	rc = __dev_ethtool(net, ifr, useraddr, ethcmd, state);
+	rc = __dev_ethtool_do(dev, ifr, useraddr, ethcmd, subcmd, state);
 	rtnl_unlock();
+
+	if (dev->dev.parent)
+		pm_runtime_put(dev->dev.parent);
+
 	if (rc)
 		goto exit_free;
 
@@ -3115,6 +3129,7 @@  int dev_ethtool(struct net *net, struct ifreq *ifr, void __user *useraddr)
 	}
 
 exit_free:
+	netdev_put(dev, &dev_tracker);
 	if (state->devlink)
 		devlink_put(state->devlink);
 	kfree(state);
diff --git a/net/ethtool/netlink.c b/net/ethtool/netlink.c
index fe3553f60bf3..67e2dd893330 100644
--- a/net/ethtool/netlink.c
+++ b/net/ethtool/netlink.c
@@ -34,39 +34,23 @@  int ethnl_ops_begin(struct net_device *dev)
 {
 	int ret;
 
-	if (!dev)
-		return -ENODEV;
-
-	if (dev->dev.parent)
-		pm_runtime_get_sync(dev->dev.parent);
-
 	if (!netif_device_present(dev) ||
-	    dev->reg_state == NETREG_UNREGISTERING) {
-		ret = -ENODEV;
-		goto err;
-	}
+	    dev->reg_state == NETREG_UNREGISTERING)
+		return -ENODEV;
 
 	if (dev->ethtool_ops->begin) {
 		ret = dev->ethtool_ops->begin(dev);
 		if (ret)
-			goto err;
+			return ret;
 	}
 
 	return 0;
-err:
-	if (dev->dev.parent)
-		pm_runtime_put(dev->dev.parent);
-
-	return ret;
 }
 
 void ethnl_ops_complete(struct net_device *dev)
 {
 	if (dev->ethtool_ops->complete)
 		dev->ethtool_ops->complete(dev);
-
-	if (dev->dev.parent)
-		pm_runtime_put(dev->dev.parent);
 }
 
 /**
@@ -602,6 +586,14 @@  static int ethnl_default_set_doit(struct sk_buff *skb, struct genl_info *info)
 			goto out_dev;
 	}
 
+	if (!req_info.dev) {
+		ret = -ENODEV;
+		goto out_dev;
+	}
+
+	if (req_info.dev->dev.parent)
+		pm_runtime_get_sync(req_info.dev->dev.parent);
+
 	rtnl_lock();
 	ret = ethnl_ops_begin(req_info.dev);
 	if (ret < 0)
@@ -617,6 +609,8 @@  static int ethnl_default_set_doit(struct sk_buff *skb, struct genl_info *info)
 	ethnl_ops_complete(req_info.dev);
 out_rtnl:
 	rtnl_unlock();
+	if (req_info.dev->dev.parent)
+		pm_runtime_put(req_info.dev->dev.parent);
 out_dev:
 	ethnl_parse_header_dev_put(&req_info);
 	return ret;