diff mbox series

[net] net: phy: Manual remove LEDs to ensure correct ordering

Message ID 20230617155500.4005881-1-andrew@lunn.ch (mailing list archive)
State Accepted
Commit c938ab4da0eb1620ae3243b0b24c572ddfc318fc
Delegated to: Netdev Maintainers
Headers show
Series [net] net: phy: Manual remove LEDs to ensure correct ordering | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 8 this patch: 8
netdev/cc_maintainers fail 1 blamed authors not CCed: davem@davemloft.net; 4 maintainers not CCed: kuba@kernel.org davem@davemloft.net pabeni@redhat.com edumazet@google.com
netdev/build_clang success Errors and warnings before: 8 this patch: 8
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 8 this patch: 8
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 39 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Andrew Lunn June 17, 2023, 3:55 p.m. UTC
If the core is left to remove the LEDs via devm_, it is performed too
late, after the PHY driver is removed from the PHY. This results in
dereferencing a NULL pointer when the LED core tries to turn the LED
off before destroying the LED.

Manually unregister the LEDs at a safe point in phy_remove.

Cc: stable@vger.kernel.org
Reported-by: Florian Fainelli <f.fainelli@gmail.com>
Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
Fixes: 01e5b728e9e4 ("net: phy: Add a binding for PHY LEDs")
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
---
 drivers/net/phy/phy_device.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

Comments

patchwork-bot+netdevbpf@kernel.org June 18, 2023, 4:40 p.m. UTC | #1
Hello:

This patch was applied to netdev/net.git (main)
by David S. Miller <davem@davemloft.net>:

On Sat, 17 Jun 2023 17:55:00 +0200 you wrote:
> If the core is left to remove the LEDs via devm_, it is performed too
> late, after the PHY driver is removed from the PHY. This results in
> dereferencing a NULL pointer when the LED core tries to turn the LED
> off before destroying the LED.
> 
> Manually unregister the LEDs at a safe point in phy_remove.
> 
> [...]

Here is the summary with links:
  - [net] net: phy: Manual remove LEDs to ensure correct ordering
    https://git.kernel.org/netdev/net/c/c938ab4da0eb

You are awesome, thank you!
Florian Fainelli June 21, 2023, 2:04 p.m. UTC | #2
Hi Andrew,

On 6/17/2023 4:55 PM, Andrew Lunn wrote:
> If the core is left to remove the LEDs via devm_, it is performed too
> late, after the PHY driver is removed from the PHY. This results in
> dereferencing a NULL pointer when the LED core tries to turn the LED
> off before destroying the LED.
> 
> Manually unregister the LEDs at a safe point in phy_remove.
> 
> Cc: stable@vger.kernel.org
> Reported-by: Florian Fainelli <f.fainelli@gmail.com>
> Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
> Fixes: 01e5b728e9e4 ("net: phy: Add a binding for PHY LEDs")
> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

Thanks for fixing this, this is an improvement, though I can still hit 
another sort of use after free whereby the GENET driver removes the 
mdio-bcm-unimac platform device and eventually cuts the clock to the 
MDIO block thus causing the following:

# reboot -f
[   18.162000] bcmgenet 8f00000.ethernet eth0: Link is Down
[   18.305163] SError Interrupt on CPU2, code 0x00000000bf000002 -- SError
[   18.305170] GISB: target abort at 0x8f00e14 [R ], core: cpu_0
[   18.305180] CPU: 2 PID: 41 Comm: kworker/2:1 Not tainted 
6.4.0-rc5-next-20230607-gc7a93fa22690 #98
[   18.305187] Hardware name: BCM972180HB_V20 (DT)
[   18.305191] Workqueue: events set_brightness_delayed
[   18.305214] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS 
BTYPE=--)
[   18.305220] pc : el1_abort+0x30/0x5c
[   18.305230] lr : el1_abort+0x24/0x5c
[   18.305235] sp : ffffffc082b73a90
[   18.305236] x29: ffffffc082b73a90 x28: ffffff8002fad780 x27: 
0000000000000000
[   18.305243] x26: 0000000000000000 x25: 0000000000000000 x24: 
ffffff807dbb340d
[   18.305250] x23: 0000000060000005 x22: ffffffc08066d9ac x21: 
0000000096000210
[   18.305256] x20: ffffffc082b55e14 x19: ffffffc082b73ad0 x18: 
0000000000000000
[   18.305263] x17: 74656e2f74656e72 x16: 656874652e303030 x15: 
303066382f626472
[   18.305269] x14: ffffff8004a84cd8 x13: 6e69622f7273752f x12: 
0000000000000000
[   18.305275] x11: ffffff8002d1c710 x10: 0000000000000870 x9 : 
ffffffc080667e34
[   18.305282] x8 : ffffff8003d44a80 x7 : fefefefefefefeff x6 : 
000073746e657665
[   18.305288] x5 : ffffff8003d44a80 x4 : ffffffc082b73ad0 x3 : 
0000000000000025
[   18.305294] x2 : 000000000000001c x1 : 0000000004208060 x0 : 
0000000000000000
[   18.305303] Kernel panic - not syncing: Asynchronous SError Interrupt
[   18.305306] CPU: 2 PID: 41 Comm: kworker/2:1 Not tainted 
6.4.0-rc5-next-20230607-gc7a93fa22690 #98
[   18.305311] Hardware name: BCM972180HB_V20 (DT)
[   18.305314] Workqueue: events set_brightness_delayed
[   18.305319] Call trace:
[   18.305321]  dump_backtrace+0xdc/0x114
[   18.305329]  show_stack+0x1c/0x28
[   18.305333]  dump_stack_lvl+0x44/0x58
[   18.305339]  dump_stack+0x14/0x1c
[   18.305342]  panic+0x128/0x2f8
[   18.305350]  nmi_panic+0x50/0x70
[   18.305356]  arm64_serror_panic+0x74/0x80
[   18.305361]  do_serror+0x2c/0x5c
[   18.305366]  el1h_64_error_handler+0x30/0x44
[   18.305372]  el1h_64_error+0x64/0x68
[   18.305378]  el1_abort+0x30/0x5c
[   18.305383]  el1h_64_sync_handler+0x64/0xc8
[   18.305389]  el1h_64_sync+0x64/0x68
[   18.305392]  readl_relaxed+0x0/0x8
[   18.305401]  __mdiobus_write+0x3c/0x94
[   18.305409]  mdiobus_write+0x4c/0x70
[   18.305415]  phy_write+0x1c/0x24
[   18.305419]  bcm_phy_read_shadow+0x24/0x40
[   18.305423]  bcm_phy_led_brightness_set+0x40/0x94
[   18.305428]  phy_led_set_brightness+0x48/0x68
[   18.305434]  set_brightness_delayed_set_brightness+0x44/0x7c
[   18.305443]  set_brightness_delayed+0xc4/0x1a4
[   18.305447]  process_one_work+0x1c0/0x284
[   18.305455]  process_scheduled_works+0x44/0x48
[   18.305461]  worker_thread+0x1e8/0x264
[   18.305467]  kthread+0xcc/0xdc
[   18.305474]  ret_from_fork+0x10/0x20
[   18.311812] Kernel Offset: disabled
[   18.311814] CPU features: 0x00000003,00010000,0000420b
[   18.311818] Memory Limit: none
[   18.566507] ---[ end Kernel panic - not syncing: Asynchronous SError 
Interrupt ]---

still not clear to me how the workqueue managed to execute and not 
finish before we unregistered the PHY device.
Andrew Lunn June 21, 2023, 3:01 p.m. UTC | #3
> Thanks for fixing this, this is an improvement, though I can still hit
> another sort of use after free whereby the GENET driver removes the
> mdio-bcm-unimac platform device and eventually cuts the clock to the MDIO
> block thus causing the following:

Hi Florian

Yes, i was not expecting this patch to fix that. But i was getting the
NULL pointer dereference you pointed out with another setup, and this
change does fix that part of the problem.

> still not clear to me how the workqueue managed to execute and not finish
> before we unregistered the PHY device.

Me neither. I took a look at the MDIO bus driver and could not see
anything obvious. I think you are going to have to scatter printk() in
the code to get a clear understanding of the order things are done.
Maybe it is another devm_ timing issue.

      Andrew
Russell King (Oracle) June 21, 2023, 5:04 p.m. UTC | #4
On Wed, Jun 21, 2023 at 03:04:14PM +0100, Florian Fainelli wrote:
> Hi Andrew,
> 
> On 6/17/2023 4:55 PM, Andrew Lunn wrote:
> > If the core is left to remove the LEDs via devm_, it is performed too
> > late, after the PHY driver is removed from the PHY. This results in
> > dereferencing a NULL pointer when the LED core tries to turn the LED
> > off before destroying the LED.
> > 
> > Manually unregister the LEDs at a safe point in phy_remove.
> > 
> > Cc: stable@vger.kernel.org
> > Reported-by: Florian Fainelli <f.fainelli@gmail.com>
> > Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
> > Fixes: 01e5b728e9e4 ("net: phy: Add a binding for PHY LEDs")
> > Signed-off-by: Andrew Lunn <andrew@lunn.ch>
> 
> Thanks for fixing this, this is an improvement, though I can still hit
> another sort of use after free whereby the GENET driver removes the
> mdio-bcm-unimac platform device and eventually cuts the clock to the MDIO
> block thus causing the following:

Hi Florian,

Can you try setting trigger_data->led_cdev to NULL after the
cancel_delayed_work_sync() in netdev_trig_deactivate() and see
what the effect is?

Thanks.
Florian Fainelli June 21, 2023, 5:12 p.m. UTC | #5
Hi Russell,

On 6/21/2023 6:04 PM, Russell King (Oracle) wrote:
> On Wed, Jun 21, 2023 at 03:04:14PM +0100, Florian Fainelli wrote:
>> Hi Andrew,
>>
>> On 6/17/2023 4:55 PM, Andrew Lunn wrote:
>>> If the core is left to remove the LEDs via devm_, it is performed too
>>> late, after the PHY driver is removed from the PHY. This results in
>>> dereferencing a NULL pointer when the LED core tries to turn the LED
>>> off before destroying the LED.
>>>
>>> Manually unregister the LEDs at a safe point in phy_remove.
>>>
>>> Cc: stable@vger.kernel.org
>>> Reported-by: Florian Fainelli <f.fainelli@gmail.com>
>>> Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
>>> Fixes: 01e5b728e9e4 ("net: phy: Add a binding for PHY LEDs")
>>> Signed-off-by: Andrew Lunn <andrew@lunn.ch>
>>
>> Thanks for fixing this, this is an improvement, though I can still hit
>> another sort of use after free whereby the GENET driver removes the
>> mdio-bcm-unimac platform device and eventually cuts the clock to the MDIO
>> block thus causing the following:
> 
> Hi Florian,
> 
> Can you try setting trigger_data->led_cdev to NULL after the
> cancel_delayed_work_sync() in netdev_trig_deactivate() and see
> what the effect is?

Thanks for the suggestion, getting an identical trace as before with 
that change.
Russell King (Oracle) June 21, 2023, 5:52 p.m. UTC | #6
On Wed, Jun 21, 2023 at 06:12:40PM +0100, Florian Fainelli wrote:
> Hi Russell,
> 
> On 6/21/2023 6:04 PM, Russell King (Oracle) wrote:
> > On Wed, Jun 21, 2023 at 03:04:14PM +0100, Florian Fainelli wrote:
> > > Hi Andrew,
> > > 
> > > On 6/17/2023 4:55 PM, Andrew Lunn wrote:
> > > > If the core is left to remove the LEDs via devm_, it is performed too
> > > > late, after the PHY driver is removed from the PHY. This results in
> > > > dereferencing a NULL pointer when the LED core tries to turn the LED
> > > > off before destroying the LED.
> > > > 
> > > > Manually unregister the LEDs at a safe point in phy_remove.
> > > > 
> > > > Cc: stable@vger.kernel.org
> > > > Reported-by: Florian Fainelli <f.fainelli@gmail.com>
> > > > Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
> > > > Fixes: 01e5b728e9e4 ("net: phy: Add a binding for PHY LEDs")
> > > > Signed-off-by: Andrew Lunn <andrew@lunn.ch>
> > > 
> > > Thanks for fixing this, this is an improvement, though I can still hit
> > > another sort of use after free whereby the GENET driver removes the
> > > mdio-bcm-unimac platform device and eventually cuts the clock to the MDIO
> > > block thus causing the following:
> > 
> > Hi Florian,
> > 
> > Can you try setting trigger_data->led_cdev to NULL after the
> > cancel_delayed_work_sync() in netdev_trig_deactivate() and see
> > what the effect is?
> 
> Thanks for the suggestion, getting an identical trace as before with that
> change.

Thanks for trying. I was wondering whether the work was being re-queued
after the flush_work(), but seemingly not.
diff mbox series

Patch

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 17d0d0555a79..53598210be6c 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -3021,6 +3021,15 @@  static int phy_led_blink_set(struct led_classdev *led_cdev,
 	return err;
 }
 
+static void phy_leds_unregister(struct phy_device *phydev)
+{
+	struct phy_led *phyled;
+
+	list_for_each_entry(phyled, &phydev->leds, list) {
+		led_classdev_unregister(&phyled->led_cdev);
+	}
+}
+
 static int of_phy_led(struct phy_device *phydev,
 		      struct device_node *led)
 {
@@ -3054,7 +3063,7 @@  static int of_phy_led(struct phy_device *phydev,
 	init_data.fwnode = of_fwnode_handle(led);
 	init_data.devname_mandatory = true;
 
-	err = devm_led_classdev_register_ext(dev, cdev, &init_data);
+	err = led_classdev_register_ext(dev, cdev, &init_data);
 	if (err)
 		return err;
 
@@ -3083,6 +3092,7 @@  static int of_phy_leds(struct phy_device *phydev)
 		err = of_phy_led(phydev, led);
 		if (err) {
 			of_node_put(led);
+			phy_leds_unregister(phydev);
 			return err;
 		}
 	}
@@ -3305,6 +3315,9 @@  static int phy_remove(struct device *dev)
 
 	cancel_delayed_work_sync(&phydev->state_queue);
 
+	if (IS_ENABLED(CONFIG_PHYLIB_LEDS))
+		phy_leds_unregister(phydev);
+
 	phydev->state = PHY_DOWN;
 
 	sfp_bus_del_upstream(phydev->sfp_bus);