diff mbox series

[1/1] drm/fb-helper: Don't schedule_work() to flush frame buffer during panic()

Message ID 20240531074521.30406-1-qiuxu.zhuo@intel.com (mailing list archive)
State New, archived
Headers show
Series [1/1] drm/fb-helper: Don't schedule_work() to flush frame buffer during panic() | expand

Commit Message

Zhuo, Qiuxu May 31, 2024, 7:45 a.m. UTC
Sometimes the system [1] hangs on x86 I/O machine checks. However, the
expected behavior is to reboot the system, as the machine check handler
ultimately triggers a panic(), initiating a reboot in the last step.

The root cause is that sometimes the panic() is blocked when
drm_fb_helper_damage() invoking schedule_work() to flush the frame buffer.
This occurs during the process of flushing all messages to the frame
buffer driver as shown in the following call trace:

  Machine check occurs [2]:
    panic()
      console_flush_on_panic()
        console_flush_all()
          console_emit_next_record()
            con->write()
              vt_console_print()
                hide_cursor()
                  vc->vc_sw->con_cursor()
                    fbcon_cursor()
                      ops->cursor()
                        bit_cursor()
                          soft_cursor()
                            info->fbops->fb_imageblit()
                              drm_fbdev_generic_defio_imageblit()
                                drm_fb_helper_damage_area()
                                  drm_fb_helper_damage()
                                    schedule_work() // <--- blocked here
    ...
    emergency_restart()  // wasn't invoked, so no reboot.

During panic(), except the panic CPU, all the other CPUs are stopped.
In schedule_work(), the panic CPU requires the lock of worker_pool to
queue the work on that pool, while the lock may have been token by some
other stopped CPU. So schedule_work() is blocked.

Additionally, during a panic(), since there is no opportunity to execute
any scheduled work, it's safe to fix this issue by skipping schedule_work()
on 'oops_in_progress' in drm_fb_helper_damage().

[1] Enable the kernel option CONFIG_FRAMEBUFFER_CONSOLE,
    CONFIG_DRM_FBDEV_EMULATION, and boot with the 'console=tty0'
    kernel command line parameter.

[2] Set 'panic_timeout' to a non-zero value before calling panic().

Reported-by: Yudong Wang <yudong.wang@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
---
 drivers/gpu/drm/drm_fb_helper.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Zhuo, Qiuxu June 25, 2024, 1:49 p.m. UTC | #1
Hello,

> From: Zhuo, Qiuxu <qiuxu.zhuo@intel.com>
> Sent: Friday, May 31, 2024 3:45 PM
> To: maarten.lankhorst@linux.intel.com; mripard@kernel.org;
> tzimmermann@suse.de; airlied@gmail.com; daniel@ffwll.ch
> Cc: dri-devel@lists.freedesktop.org; linux-kernel@vger.kernel.org; Luck, Tony
> <tony.luck@intel.com>; Zhuo, Qiuxu <qiuxu.zhuo@intel.com>; Wang, Yudong
> <yudong.wang@intel.com>
> Subject: [PATCH 1/1] drm/fb-helper: Don't schedule_work() to flush frame
> buffer during panic()
> 
> Sometimes the system [1] hangs on x86 I/O machine checks. However, the
> expected behavior is to reboot the system, as the machine check handler
> ultimately triggers a panic(), initiating a reboot in the last step.
> 
> The root cause is that sometimes the panic() is blocked when
> drm_fb_helper_damage() invoking schedule_work() to flush the frame buffer.
> This occurs during the process of flushing all messages to the frame buffer
> driver as shown in the following call trace:
> 
>   Machine check occurs [2]:
>     panic()
>       console_flush_on_panic()
>         console_flush_all()
>           console_emit_next_record()
>             con->write()
>               vt_console_print()
>                 hide_cursor()
>                   vc->vc_sw->con_cursor()
>                     fbcon_cursor()
>                       ops->cursor()
>                         bit_cursor()
>                           soft_cursor()
>                             info->fbops->fb_imageblit()
>                               drm_fbdev_generic_defio_imageblit()
>                                 drm_fb_helper_damage_area()
>                                   drm_fb_helper_damage()
>                                     schedule_work() // <--- blocked here
>     ...
>     emergency_restart()  // wasn't invoked, so no reboot.
> 
> During panic(), except the panic CPU, all the other CPUs are stopped.
> In schedule_work(), the panic CPU requires the lock of worker_pool to queue
> the work on that pool, while the lock may have been token by some other
> stopped CPU. So schedule_work() is blocked.
> 
> Additionally, during a panic(), since there is no opportunity to execute any
> scheduled work, it's safe to fix this issue by skipping schedule_work() on
> 'oops_in_progress' in drm_fb_helper_damage().
> 
> [1] Enable the kernel option CONFIG_FRAMEBUFFER_CONSOLE,
>     CONFIG_DRM_FBDEV_EMULATION, and boot with the 'console=tty0'
>     kernel command line parameter.
> 
> [2] Set 'panic_timeout' to a non-zero value before calling panic().
> 
> Reported-by: Yudong Wang <yudong.wang@intel.com>
> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> ---
>  drivers/gpu/drm/drm_fb_helper.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/gpu/drm/drm_fb_helper.c
> b/drivers/gpu/drm/drm_fb_helper.c index d612133e2cf7..6d7b6f038821
> 100644
> --- a/drivers/gpu/drm/drm_fb_helper.c
> +++ b/drivers/gpu/drm/drm_fb_helper.c
> @@ -628,6 +628,9 @@ static void drm_fb_helper_add_damage_clip(struct
> drm_fb_helper *helper, u32 x, u  static void drm_fb_helper_damage(struct
> drm_fb_helper *helper, u32 x, u32 y,
>  				 u32 width, u32 height)
>  {
> +	if (oops_in_progress)
> +		return;
> +
>  	drm_fb_helper_add_damage_clip(helper, x, y, width, height);
> 
>  	schedule_work(&helper->damage_work);
> --

A gentle ping on this patch. 

Updated with recent error injection test results:
- Without the patch, we typically reproduced the issue [1] once in 100 cycles.
- With the patch, we tested it on 3 systems and passed a total of 1500 cycles.

[1] the system got blocked at drm_fb_helper_damage()-> schedule_work() without reboot.
      For details, please see the commit message.

Thanks!
-Qiuxu
Thomas Zimmermann June 25, 2024, 2:14 p.m. UTC | #2
Hi

Am 31.05.24 um 09:45 schrieb Qiuxu Zhuo:
> Sometimes the system [1] hangs on x86 I/O machine checks. However, the
> expected behavior is to reboot the system, as the machine check handler
> ultimately triggers a panic(), initiating a reboot in the last step.
>
> The root cause is that sometimes the panic() is blocked when
> drm_fb_helper_damage() invoking schedule_work() to flush the frame buffer.
> This occurs during the process of flushing all messages to the frame
> buffer driver as shown in the following call trace:
>
>    Machine check occurs [2]:
>      panic()
>        console_flush_on_panic()
>          console_flush_all()
>            console_emit_next_record()
>              con->write()
>                vt_console_print()
>                  hide_cursor()
>                    vc->vc_sw->con_cursor()
>                      fbcon_cursor()
>                        ops->cursor()
>                          bit_cursor()
>                            soft_cursor()
>                              info->fbops->fb_imageblit()
>                                drm_fbdev_generic_defio_imageblit()
>                                  drm_fb_helper_damage_area()
>                                    drm_fb_helper_damage()
>                                      schedule_work() // <--- blocked here
>      ...
>      emergency_restart()  // wasn't invoked, so no reboot.
>
> During panic(), except the panic CPU, all the other CPUs are stopped.
> In schedule_work(), the panic CPU requires the lock of worker_pool to
> queue the work on that pool, while the lock may have been token by some
> other stopped CPU. So schedule_work() is blocked.
>
> Additionally, during a panic(), since there is no opportunity to execute
> any scheduled work, it's safe to fix this issue by skipping schedule_work()
> on 'oops_in_progress' in drm_fb_helper_damage().
>
> [1] Enable the kernel option CONFIG_FRAMEBUFFER_CONSOLE,
>      CONFIG_DRM_FBDEV_EMULATION, and boot with the 'console=tty0'
>      kernel command line parameter.
>
> [2] Set 'panic_timeout' to a non-zero value before calling panic().
>
> Reported-by: Yudong Wang <yudong.wang@intel.com>
> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>

Acked-by: Thomas Zimmermann <tzimmermann@suse.de>

> ---
>   drivers/gpu/drm/drm_fb_helper.c | 3 +++
>   1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/drm_fb_helper.c b/drivers/gpu/drm/drm_fb_helper.c
> index d612133e2cf7..6d7b6f038821 100644
> --- a/drivers/gpu/drm/drm_fb_helper.c
> +++ b/drivers/gpu/drm/drm_fb_helper.c
> @@ -628,6 +628,9 @@ static void drm_fb_helper_add_damage_clip(struct drm_fb_helper *helper, u32 x, u
>   static void drm_fb_helper_damage(struct drm_fb_helper *helper, u32 x, u32 y,
>   				 u32 width, u32 height)
>   {
> +	if (oops_in_progress)
> +		return;
> +
>   	drm_fb_helper_add_damage_clip(helper, x, y, width, height);
>   
>   	schedule_work(&helper->damage_work);
diff mbox series

Patch

diff --git a/drivers/gpu/drm/drm_fb_helper.c b/drivers/gpu/drm/drm_fb_helper.c
index d612133e2cf7..6d7b6f038821 100644
--- a/drivers/gpu/drm/drm_fb_helper.c
+++ b/drivers/gpu/drm/drm_fb_helper.c
@@ -628,6 +628,9 @@  static void drm_fb_helper_add_damage_clip(struct drm_fb_helper *helper, u32 x, u
 static void drm_fb_helper_damage(struct drm_fb_helper *helper, u32 x, u32 y,
 				 u32 width, u32 height)
 {
+	if (oops_in_progress)
+		return;
+
 	drm_fb_helper_add_damage_clip(helper, x, y, width, height);
 
 	schedule_work(&helper->damage_work);