Message ID | 20240531074521.30406-1-qiuxu.zhuo@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/1] drm/fb-helper: Don't schedule_work() to flush frame buffer during panic() | expand |
Hello, > From: Zhuo, Qiuxu <qiuxu.zhuo@intel.com> > Sent: Friday, May 31, 2024 3:45 PM > To: maarten.lankhorst@linux.intel.com; mripard@kernel.org; > tzimmermann@suse.de; airlied@gmail.com; daniel@ffwll.ch > Cc: dri-devel@lists.freedesktop.org; linux-kernel@vger.kernel.org; Luck, Tony > <tony.luck@intel.com>; Zhuo, Qiuxu <qiuxu.zhuo@intel.com>; Wang, Yudong > <yudong.wang@intel.com> > Subject: [PATCH 1/1] drm/fb-helper: Don't schedule_work() to flush frame > buffer during panic() > > Sometimes the system [1] hangs on x86 I/O machine checks. However, the > expected behavior is to reboot the system, as the machine check handler > ultimately triggers a panic(), initiating a reboot in the last step. > > The root cause is that sometimes the panic() is blocked when > drm_fb_helper_damage() invoking schedule_work() to flush the frame buffer. > This occurs during the process of flushing all messages to the frame buffer > driver as shown in the following call trace: > > Machine check occurs [2]: > panic() > console_flush_on_panic() > console_flush_all() > console_emit_next_record() > con->write() > vt_console_print() > hide_cursor() > vc->vc_sw->con_cursor() > fbcon_cursor() > ops->cursor() > bit_cursor() > soft_cursor() > info->fbops->fb_imageblit() > drm_fbdev_generic_defio_imageblit() > drm_fb_helper_damage_area() > drm_fb_helper_damage() > schedule_work() // <--- blocked here > ... > emergency_restart() // wasn't invoked, so no reboot. > > During panic(), except the panic CPU, all the other CPUs are stopped. > In schedule_work(), the panic CPU requires the lock of worker_pool to queue > the work on that pool, while the lock may have been token by some other > stopped CPU. So schedule_work() is blocked. > > Additionally, during a panic(), since there is no opportunity to execute any > scheduled work, it's safe to fix this issue by skipping schedule_work() on > 'oops_in_progress' in drm_fb_helper_damage(). > > [1] Enable the kernel option CONFIG_FRAMEBUFFER_CONSOLE, > CONFIG_DRM_FBDEV_EMULATION, and boot with the 'console=tty0' > kernel command line parameter. > > [2] Set 'panic_timeout' to a non-zero value before calling panic(). > > Reported-by: Yudong Wang <yudong.wang@intel.com> > Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> > --- > drivers/gpu/drm/drm_fb_helper.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/gpu/drm/drm_fb_helper.c > b/drivers/gpu/drm/drm_fb_helper.c index d612133e2cf7..6d7b6f038821 > 100644 > --- a/drivers/gpu/drm/drm_fb_helper.c > +++ b/drivers/gpu/drm/drm_fb_helper.c > @@ -628,6 +628,9 @@ static void drm_fb_helper_add_damage_clip(struct > drm_fb_helper *helper, u32 x, u static void drm_fb_helper_damage(struct > drm_fb_helper *helper, u32 x, u32 y, > u32 width, u32 height) > { > + if (oops_in_progress) > + return; > + > drm_fb_helper_add_damage_clip(helper, x, y, width, height); > > schedule_work(&helper->damage_work); > -- A gentle ping on this patch. Updated with recent error injection test results: - Without the patch, we typically reproduced the issue [1] once in 100 cycles. - With the patch, we tested it on 3 systems and passed a total of 1500 cycles. [1] the system got blocked at drm_fb_helper_damage()-> schedule_work() without reboot. For details, please see the commit message. Thanks! -Qiuxu
Hi Am 31.05.24 um 09:45 schrieb Qiuxu Zhuo: > Sometimes the system [1] hangs on x86 I/O machine checks. However, the > expected behavior is to reboot the system, as the machine check handler > ultimately triggers a panic(), initiating a reboot in the last step. > > The root cause is that sometimes the panic() is blocked when > drm_fb_helper_damage() invoking schedule_work() to flush the frame buffer. > This occurs during the process of flushing all messages to the frame > buffer driver as shown in the following call trace: > > Machine check occurs [2]: > panic() > console_flush_on_panic() > console_flush_all() > console_emit_next_record() > con->write() > vt_console_print() > hide_cursor() > vc->vc_sw->con_cursor() > fbcon_cursor() > ops->cursor() > bit_cursor() > soft_cursor() > info->fbops->fb_imageblit() > drm_fbdev_generic_defio_imageblit() > drm_fb_helper_damage_area() > drm_fb_helper_damage() > schedule_work() // <--- blocked here > ... > emergency_restart() // wasn't invoked, so no reboot. > > During panic(), except the panic CPU, all the other CPUs are stopped. > In schedule_work(), the panic CPU requires the lock of worker_pool to > queue the work on that pool, while the lock may have been token by some > other stopped CPU. So schedule_work() is blocked. > > Additionally, during a panic(), since there is no opportunity to execute > any scheduled work, it's safe to fix this issue by skipping schedule_work() > on 'oops_in_progress' in drm_fb_helper_damage(). > > [1] Enable the kernel option CONFIG_FRAMEBUFFER_CONSOLE, > CONFIG_DRM_FBDEV_EMULATION, and boot with the 'console=tty0' > kernel command line parameter. > > [2] Set 'panic_timeout' to a non-zero value before calling panic(). > > Reported-by: Yudong Wang <yudong.wang@intel.com> > Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Acked-by: Thomas Zimmermann <tzimmermann@suse.de> > --- > drivers/gpu/drm/drm_fb_helper.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/gpu/drm/drm_fb_helper.c b/drivers/gpu/drm/drm_fb_helper.c > index d612133e2cf7..6d7b6f038821 100644 > --- a/drivers/gpu/drm/drm_fb_helper.c > +++ b/drivers/gpu/drm/drm_fb_helper.c > @@ -628,6 +628,9 @@ static void drm_fb_helper_add_damage_clip(struct drm_fb_helper *helper, u32 x, u > static void drm_fb_helper_damage(struct drm_fb_helper *helper, u32 x, u32 y, > u32 width, u32 height) > { > + if (oops_in_progress) > + return; > + > drm_fb_helper_add_damage_clip(helper, x, y, width, height); > > schedule_work(&helper->damage_work);
diff --git a/drivers/gpu/drm/drm_fb_helper.c b/drivers/gpu/drm/drm_fb_helper.c index d612133e2cf7..6d7b6f038821 100644 --- a/drivers/gpu/drm/drm_fb_helper.c +++ b/drivers/gpu/drm/drm_fb_helper.c @@ -628,6 +628,9 @@ static void drm_fb_helper_add_damage_clip(struct drm_fb_helper *helper, u32 x, u static void drm_fb_helper_damage(struct drm_fb_helper *helper, u32 x, u32 y, u32 width, u32 height) { + if (oops_in_progress) + return; + drm_fb_helper_add_damage_clip(helper, x, y, width, height); schedule_work(&helper->damage_work);
Sometimes the system [1] hangs on x86 I/O machine checks. However, the expected behavior is to reboot the system, as the machine check handler ultimately triggers a panic(), initiating a reboot in the last step. The root cause is that sometimes the panic() is blocked when drm_fb_helper_damage() invoking schedule_work() to flush the frame buffer. This occurs during the process of flushing all messages to the frame buffer driver as shown in the following call trace: Machine check occurs [2]: panic() console_flush_on_panic() console_flush_all() console_emit_next_record() con->write() vt_console_print() hide_cursor() vc->vc_sw->con_cursor() fbcon_cursor() ops->cursor() bit_cursor() soft_cursor() info->fbops->fb_imageblit() drm_fbdev_generic_defio_imageblit() drm_fb_helper_damage_area() drm_fb_helper_damage() schedule_work() // <--- blocked here ... emergency_restart() // wasn't invoked, so no reboot. During panic(), except the panic CPU, all the other CPUs are stopped. In schedule_work(), the panic CPU requires the lock of worker_pool to queue the work on that pool, while the lock may have been token by some other stopped CPU. So schedule_work() is blocked. Additionally, during a panic(), since there is no opportunity to execute any scheduled work, it's safe to fix this issue by skipping schedule_work() on 'oops_in_progress' in drm_fb_helper_damage(). [1] Enable the kernel option CONFIG_FRAMEBUFFER_CONSOLE, CONFIG_DRM_FBDEV_EMULATION, and boot with the 'console=tty0' kernel command line parameter. [2] Set 'panic_timeout' to a non-zero value before calling panic(). Reported-by: Yudong Wang <yudong.wang@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> --- drivers/gpu/drm/drm_fb_helper.c | 3 +++ 1 file changed, 3 insertions(+)