Message ID | 20220723210825.564922-1-javierm@redhat.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | drm/msm: Make .remove and .shutdown HW shutdown consistent | expand |
On Sun, 24 Jul 2022 at 00:09, Javier Martinez Canillas <javierm@redhat.com> wrote: > > Drivers' .remove and .shutdown callbacks are executed on different code > paths. The former is called when a device is removed from the bus, while > the latter is called at system shutdown time to quiesce the device. > > This means that some overlap exists between the two, because both have to > take care of properly shutting down the hardware. But currently the logic > used in these two callbacks isn't consistent in msm drivers, which could > lead to kernel oops. > > For example, on .remove the component is deleted and its .unbind callback > leads to the hardware being shutdown but only if the DRM device has been > marked as registered. > > That check doesn't exist in the .shutdown logic and this can lead to the > driver calling drm_atomic_helper_shutdown() for a DRM device that hasn't > been properly initialized. > > A situation when this can happen is when a driver for an expected device > fails to probe, since the .bind callback will never be executed. > > This bug was attempted to be fixed in commit 623f279c7781 ("drm/msm: fix > shutdown hook in case GPU components failed to bind"), but unfortunately > it still happens in some cases. > > Rather than trying to keep fixing in both places, let's unify in a single > helper function that could be used for the two callbacks. > > Signed-off-by: Javier Martinez Canillas <javierm@redhat.com> > --- > > drivers/gpu/drm/msm/msm_drv.c | 31 +++++++++++++++++-------------- > 1 file changed, 17 insertions(+), 14 deletions(-) > > diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c > index 1ed4cd09dbf8..669891bd6f09 100644 > --- a/drivers/gpu/drm/msm/msm_drv.c > +++ b/drivers/gpu/drm/msm/msm_drv.c > @@ -190,14 +190,8 @@ static int vblank_ctrl_queue_work(struct msm_drm_private *priv, > return 0; > } > > -static int msm_drm_uninit(struct device *dev) > +static void msm_shutdown_hw(struct drm_device *dev) > { > - struct platform_device *pdev = to_platform_device(dev); > - struct msm_drm_private *priv = platform_get_drvdata(pdev); > - struct drm_device *ddev = priv->dev; > - struct msm_kms *kms = priv->kms; > - int i; > - > /* > * Shutdown the hw if we're far enough along where things might be on. > * If we run this too early, we'll end up panicking in any variety of > @@ -205,10 +199,21 @@ static int msm_drm_uninit(struct device *dev) > * msm_drm_init, drm_dev->registered is used as an indicator that the > * shutdown will be successful. > */ > - if (ddev->registered) { > + if (dev->registered) > + drm_atomic_helper_shutdown(dev); > +} > + > +static int msm_drm_uninit(struct device *dev) > +{ > + struct platform_device *pdev = to_platform_device(dev); > + struct msm_drm_private *priv = platform_get_drvdata(pdev); > + struct drm_device *ddev = priv->dev; > + struct msm_kms *kms = priv->kms; > + int i; > + > + if (ddev->registered) > drm_dev_unregister(ddev); No. The drm_dev_unregister() should come before drm_atomic_helper_shutdown(). Also drm_dev_unregister() should not be a part of .shutdown callback. See the documentation in the drm_drv.c > - drm_atomic_helper_shutdown(ddev); > - } > + msm_shutdown_hw(ddev); > > /* We must cancel and cleanup any pending vblank enable/disable > * work before msm_irq_uninstall() to avoid work re-enabling an > @@ -1242,10 +1247,8 @@ void msm_drv_shutdown(struct platform_device *pdev) > struct msm_drm_private *priv = platform_get_drvdata(pdev); > struct drm_device *drm = priv ? priv->dev : NULL; > > - if (!priv || !priv->kms) > - return; > - > - drm_atomic_helper_shutdown(drm); > + if (drm) > + msm_shutdown_hw(drm); > } > > static struct platform_driver msm_platform_driver = { > -- > 2.37.1 >
On 7/24/22 10:53, Dmitry Baryshkov wrote: > On Sun, 24 Jul 2022 at 00:09, Javier Martinez Canillas [...] >> - >> /* >> * Shutdown the hw if we're far enough along where things might be on. >> * If we run this too early, we'll end up panicking in any variety of >> @@ -205,10 +199,21 @@ static int msm_drm_uninit(struct device *dev) >> * msm_drm_init, drm_dev->registered is used as an indicator that the >> * shutdown will be successful. >> */ >> - if (ddev->registered) { >> + if (dev->registered) >> + drm_atomic_helper_shutdown(dev); >> +} >> + >> +static int msm_drm_uninit(struct device *dev) >> +{ >> + struct platform_device *pdev = to_platform_device(dev); >> + struct msm_drm_private *priv = platform_get_drvdata(pdev); >> + struct drm_device *ddev = priv->dev; >> + struct msm_kms *kms = priv->kms; >> + int i; >> + >> + if (ddev->registered) >> drm_dev_unregister(ddev); > > No. The drm_dev_unregister() should come before drm_atomic_helper_shutdown(). > I'm not sure to understand what you meant here, since drm_dev_unregister() is called before drm_atomic_helper_shutdown() that's called in msm_shutdown_hw(). > Also drm_dev_unregister() should not be a part of .shutdown callback. > See the documentation in the drm_drv.c > It is not right now, msm_shutdown_hw() only calls drm_atomic_helper_shutdown() but drm_dev_unregister() is still called from the msm_drm_uninit() function. Now, your comment made me realize that there's a bug in this patch since after the drm_dev_unregister(), dev->registered will be set to false and so in the .remove -> .unbind path drm_atomic_helper_shutdown() will never be executed. I guess one option is to do the if (dev->registered) check in the callers but then it won't really be worth it to have a helper and we could just add that check in msm_drv_shutdown() to conditionally call drm_atomic_helper_shutdown().
On 7/24/22 11:06, Javier Martinez Canillas wrote: [...] > > I guess one option is to do the if (dev->registered) check in the callers but > then it won't really be worth it to have a helper and we could just add that > check in msm_drv_shutdown() to conditionally call drm_atomic_helper_shutdown(). > By the way, the motivation of this patch is to fix a kernel oops during poweroff: [ 169.495897] systemd-shutdown[1]: Powering off. [ 169.500466] kvm: exiting hardware virtualization [ 169.554787] platform wifi-firmware.0: Removing from iommu group 12 [ 169.610238] platform video-firmware.0: Removing from iommu group 10 [ 169.682164] ------------[ cut here ]------------ [ 169.686909] WARNING: CPU: 6 PID: 1 at drivers/gpu/drm/drm_modeset_lock.c:317 drm_modeset_lock_all_ctx+0x3c4/0x3d0 [ 169.697450] Modules linked in: af_alg rtl8192cu rtl_usb cros_ec_sensors rtl8192c_common cros_ec_sensors_core rtlwifi industrialio_triggered_buffer venus_enc venus_dec sbs_battery kfifo_buf cros_ec_typec videobuf2_dma_contig hid_multito uch cros_usbpd_logger typec cros_ec_chardev cros_usbpd_charger videobuf2_memops ath10k_snoc ath10k_core hci_uart btqca venus_core btbcm ath mac80211 bluetooth v4l2_mem2mem videobuf2_v4l2 libarc4 videobuf2_common qcom_spmi_adc_tm5 qrtr cfg 80211 videodev qcom_spmi_adc5 qcom_q6v5_mss ecdh_generic qcom_pil_info ecc qcom_vadc_common mc crct10dif_ce qcom_spmi_temp_alarm rfkill qcom_q6v5 spi_geni_qcom qcom_sysmon qcom_common qmi_helpers snd_soc_max98357a socinfo ip6_tables ip_ta bles x_tables ipmi_devintf ipmi_msghandler fuse zram zsmalloc ipv6 [ 169.767126] CPU: 6 PID: 1 Comm: systemd-shutdow Tainted: G W 5.19.0-rc7+ #20 [ 169.775691] Hardware name: Google CoachZ (rev3+) (DT) [ 169.780874] pstate: a0400009 (NzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 169.788021] pc : drm_modeset_lock_all_ctx+0x3c4/0x3d0 [ 169.793205] lr : drm_modeset_lock_all_ctx+0x48/0x3d0 [ 169.798299] sp : ffff80000805bb80 [ 169.801701] x29: ffff80000805bb80 x28: ffff327c00128000 x27: 0000000000000000 [ 169.809025] x26: 0000000000000000 x25: 0000000000000001 x24: ffffc95d820ec030 [ 169.816349] x23: ffff327c00bbd090 x22: ffffc95d8215eca0 x21: ffff327c039c5800 [ 169.823674] x20: ffff327c039c5988 x19: ffff80000805bbe8 x18: 0000000000000034 [ 169.830998] x17: 000000040044ffff x16: ffffc95d80cac920 x15: 0000000000000000 [ 169.838322] x14: 0000000000000315 x13: 0000000000000315 x12: 0000000000000000 [ 169.845646] x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000 [ 169.852971] x8 : ffff80000805bc28 x7 : 0000000000000000 x6 : 0000000000000000 [ 169.860295] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 169.867619] x2 : ffff327c00128000 x1 : 0000000000000000 x0 : ffff327c039c59b0 [ 169.874944] Call trace: [ 169.877467] drm_modeset_lock_all_ctx+0x3c4/0x3d0 [ 169.882297] drm_atomic_helper_shutdown+0x70/0x134 [ 169.887217] msm_drv_shutdown+0x30/0x40 [ 169.891159] platform_shutdown+0x28/0x40 [ 169.895191] device_shutdown+0x148/0x350 [ 169.899221] kernel_power_off+0x38/0x80 [ 169.903163] __do_sys_reboot+0x288/0x2c0 [ 169.907192] __arm64_sys_reboot+0x28/0x34 [ 169.911309] invoke_syscall+0x48/0x114 [ 169.915162] el0_svc_common.constprop.0+0x44/0xec [ 169.919992] do_el0_svc+0x2c/0xc0 [ 169.923394] el0_svc+0x2c/0x84 [ 169.926535] el0t_64_sync_handler+0x11c/0x150 [ 169.931013] el0t_64_sync+0x18c/0x190 [ 169.934777] ---[ end trace 0000000000000000 ]--- [ 169.939557] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000018 [ 169.948574] Mem abort info: [ 169.951452] ESR = 0x0000000096000004 [ 169.955307] EC = 0x25: DABT (current EL), IL = 32 bits [ 169.960765] SET = 0, FnV = 0 [ 169.963901] EA = 0, S1PTW = 0 [ 169.967127] FSC = 0x04: level 0 translation fault [ 169.972136] Data abort info: [ 169.975093] ISV = 0, ISS = 0x00000004 [ 169.979037] CM = 0, WnR = 0 [ 169.982083] user pgtable: 4k pages, 48-bit VAs, pgdp=000000010eab1000 [ 169.988697] [0000000000000018] pgd=0000000000000000, p4d=0000000000000000 [ 169.995669] Internal error: Oops: 96000004 [#1] PREEMPT SMP [ 170.001396] Modules linked in: af_alg rtl8192cu rtl_usb cros_ec_sensors rtl8192c_common cros_ec_sensors_core rtlwifi industrialio_triggered_buffer venus_enc venus_dec sbs_battery kfifo_buf cros_ec_typec videobuf2_dma_contig hid_multito uch cros_usbpd_logger typec cros_ec_chardev cros_usbpd_charger videobuf2_memops ath10k_snoc ath10k_core hci_uart btqca venus_core btbcm ath mac80211 bluetooth v4l2_mem2mem videobuf2_v4l2 libarc4 videobuf2_common qcom_spmi_adc_tm5 qrtr cfg 80211 videodev qcom_spmi_adc5 qcom_q6v5_mss ecdh_generic qcom_pil_info ecc qcom_vadc_common mc crct10dif_ce qcom_spmi_temp_alarm rfkill qcom_q6v5 spi_geni_qcom qcom_sysmon qcom_common qmi_helpers snd_soc_max98357a socinfo ip6_tables ip_ta bles x_tables ipmi_devintf ipmi_msghandler fuse zram zsmalloc ipv6 [ 170.071041] CPU: 6 PID: 1 Comm: systemd-shutdow Tainted: G W 5.19.0-rc7+ #20 [ 170.079614] Hardware name: Google CoachZ (rev3+) (DT) [ 170.084801] pstate: a0400009 (NzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 170.091941] pc : ww_mutex_lock+0x28/0x32c [ 170.096064] lr : drm_modeset_lock_all_ctx+0x1b0/0x3d0 [ 170.101254] sp : ffff80000805bb50 [ 170.104658] x29: ffff80000805bb50 x28: ffff327c00128000 x27: 0000000000000000 [ 170.111977] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000018 [ 170.119296] x23: ffff80000805bc10 x22: ffff327c039c5ad8 x21: ffff327c039c5800 [ 170.126615] x20: ffff80000805bbe8 x19: 0000000000000018 x18: 0000000000000034 [ 170.133933] x17: 000000040044ffff x16: ffffc95d80cac920 x15: 0000000000000000 [ 170.141252] x14: 0000000000000315 x13: 0000000000000315 x12: 0000000000000000 [ 170.148571] x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000 [ 170.155890] x8 : ffff80000805bc28 x7 : 0000000000000000 x6 : 0000000000000000 [ 170.163209] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 170.170528] x2 : ffff327c00128000 x1 : 0000000000000000 x0 : 0000000000000018 [ 170.177847] Call trace: [ 170.180364] ww_mutex_lock+0x28/0x32c [ 170.184127] drm_modeset_lock_all_ctx+0x1b0/0x3d0 [ 170.188957] drm_atomic_helper_shutdown+0x70/0x134 [ 170.193876] msm_drv_shutdown+0x30/0x40 [ 170.197820] platform_shutdown+0x28/0x40 [ 170.201854] device_shutdown+0x148/0x350 [ 170.205888] kernel_power_off+0x38/0x80 [ 170.209832] __do_sys_reboot+0x288/0x2c0 [ 170.213866] __arm64_sys_reboot+0x28/0x34 [ 170.217990] invoke_syscall+0x48/0x114 [ 170.221843] el0_svc_common.constprop.0+0x44/0xec [ 170.226672] do_el0_svc+0x2c/0xc0 [ 170.230079] el0_svc+0x2c/0x84 [ 170.233215] el0t_64_sync_handler+0x11c/0x150 [ 170.237686] el0t_64_sync+0x18c/0x190 [ 170.241451] Code: aa0103f4 d503201f d2800001 aa0103e3 (c8e37c02) [ 170.247704] ---[ end trace 0000000000000000 ]--- [ 170.252457] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b [ 170.260654] Kernel Offset: 0x495d77c00000 from 0xffff800008000000 [ 170.266910] PHYS_OFFSET: 0xffffcd8500000000 [ 170.271212] CPU features: 0x800,00c2a015,19801c82 [ 170.276042] Memory Limit: none [ 170.279183] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---
diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c index 1ed4cd09dbf8..669891bd6f09 100644 --- a/drivers/gpu/drm/msm/msm_drv.c +++ b/drivers/gpu/drm/msm/msm_drv.c @@ -190,14 +190,8 @@ static int vblank_ctrl_queue_work(struct msm_drm_private *priv, return 0; } -static int msm_drm_uninit(struct device *dev) +static void msm_shutdown_hw(struct drm_device *dev) { - struct platform_device *pdev = to_platform_device(dev); - struct msm_drm_private *priv = platform_get_drvdata(pdev); - struct drm_device *ddev = priv->dev; - struct msm_kms *kms = priv->kms; - int i; - /* * Shutdown the hw if we're far enough along where things might be on. * If we run this too early, we'll end up panicking in any variety of @@ -205,10 +199,21 @@ static int msm_drm_uninit(struct device *dev) * msm_drm_init, drm_dev->registered is used as an indicator that the * shutdown will be successful. */ - if (ddev->registered) { + if (dev->registered) + drm_atomic_helper_shutdown(dev); +} + +static int msm_drm_uninit(struct device *dev) +{ + struct platform_device *pdev = to_platform_device(dev); + struct msm_drm_private *priv = platform_get_drvdata(pdev); + struct drm_device *ddev = priv->dev; + struct msm_kms *kms = priv->kms; + int i; + + if (ddev->registered) drm_dev_unregister(ddev); - drm_atomic_helper_shutdown(ddev); - } + msm_shutdown_hw(ddev); /* We must cancel and cleanup any pending vblank enable/disable * work before msm_irq_uninstall() to avoid work re-enabling an @@ -1242,10 +1247,8 @@ void msm_drv_shutdown(struct platform_device *pdev) struct msm_drm_private *priv = platform_get_drvdata(pdev); struct drm_device *drm = priv ? priv->dev : NULL; - if (!priv || !priv->kms) - return; - - drm_atomic_helper_shutdown(drm); + if (drm) + msm_shutdown_hw(drm); } static struct platform_driver msm_platform_driver = {
Drivers' .remove and .shutdown callbacks are executed on different code paths. The former is called when a device is removed from the bus, while the latter is called at system shutdown time to quiesce the device. This means that some overlap exists between the two, because both have to take care of properly shutting down the hardware. But currently the logic used in these two callbacks isn't consistent in msm drivers, which could lead to kernel oops. For example, on .remove the component is deleted and its .unbind callback leads to the hardware being shutdown but only if the DRM device has been marked as registered. That check doesn't exist in the .shutdown logic and this can lead to the driver calling drm_atomic_helper_shutdown() for a DRM device that hasn't been properly initialized. A situation when this can happen is when a driver for an expected device fails to probe, since the .bind callback will never be executed. This bug was attempted to be fixed in commit 623f279c7781 ("drm/msm: fix shutdown hook in case GPU components failed to bind"), but unfortunately it still happens in some cases. Rather than trying to keep fixing in both places, let's unify in a single helper function that could be used for the two callbacks. Signed-off-by: Javier Martinez Canillas <javierm@redhat.com> --- drivers/gpu/drm/msm/msm_drv.c | 31 +++++++++++++++++-------------- 1 file changed, 17 insertions(+), 14 deletions(-)