diff mbox series

[v2,1/1] drm/fb-helper: Don't schedule_work() to flush frame buffer during panic()

Message ID 20240703141737.75378-1-qiuxu.zhuo@intel.com (mailing list archive)
State New, archived
Headers show
Series [v2,1/1] drm/fb-helper: Don't schedule_work() to flush frame buffer during panic() | expand

Commit Message

Zhuo, Qiuxu July 3, 2024, 2:17 p.m. UTC
Sometimes the system [1] hangs on x86 I/O machine checks. However, the
expected behavior is to reboot the system, as the machine check handler
ultimately triggers a panic(), initiating a reboot in the last step.

The root cause is that sometimes the panic() is blocked when
drm_fb_helper_damage() invoking schedule_work() to flush the frame buffer.
This occurs during the process of flushing all messages to the frame
buffer driver as shown in the following call trace:

  Machine check occurs [2]:
    panic()
      console_flush_on_panic()
        console_flush_all()
          console_emit_next_record()
            con->write()
              vt_console_print()
                hide_cursor()
                  vc->vc_sw->con_cursor()
                    fbcon_cursor()
                      ops->cursor()
                        bit_cursor()
                          soft_cursor()
                            info->fbops->fb_imageblit()
                              drm_fbdev_generic_defio_imageblit()
                                drm_fb_helper_damage_area()
                                  drm_fb_helper_damage()
                                    schedule_work() // <--- blocked here
    ...
    emergency_restart()  // wasn't invoked, so no reboot.

During panic(), except the panic CPU, all the other CPUs are stopped.
In schedule_work(), the panic CPU requires the lock of worker_pool to
queue the work on that pool, while the lock may have been token by some
other stopped CPU. So schedule_work() is blocked.

Additionally, during a panic(), since there is no opportunity to execute
any scheduled work, it's safe to fix this issue by skipping schedule_work()
on 'oops_in_progress' in drm_fb_helper_damage().

[1] Enable the kernel option CONFIG_FRAMEBUFFER_CONSOLE,
    CONFIG_DRM_FBDEV_EMULATION, and boot with the 'console=tty0'
    kernel command line parameter.

[2] Set 'panic_timeout' to a non-zero value before calling panic().

Acked-by: Thomas Zimmermann <tzimmermann@suse.de>
Reported-by: Yudong Wang <yudong.wang@intel.com>
Tested-by: Yudong Wang <yudong.wang@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
---
v1 -> v2:
  - No function changes.

  - Add 'Acked-by' from Thomas Zimmermann.
    [ Thanks Thomas Zimmermann for reviewing this patch. ]

  - Add 'Tested-by' from Yudong Wang.

  - Add comments to drm_fb_helper_damage() about the early return on oops_in_progress.

 drivers/gpu/drm/drm_fb_helper.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

Comments

Zhuo, Qiuxu Aug. 5, 2024, 7:13 a.m. UTC | #1
> From: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> To: maarten.lankhorst@linux.intel.com, mripard@kernel.org,
> 	tzimmermann@suse.de, airlied@gmail.com, daniel@ffwll.ch
> Cc: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org,
> 	tony.luck@intel.com, qiuxu.zhuo@intel.com, yudong.wang@intel.com
> Subject: [PATCH v2 1/1] drm/fb-helper: Don't schedule_work() to flush frame buffer during panic()
> [...]
> Acked-by: Thomas Zimmermann <tzimmermann@suse.de>
> Reported-by: Yudong Wang <yudong.wang@intel.com>
> Tested-by: Yudong Wang <yudong.wang@intel.com>
> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>

Hi Maarten and maintainers, 

A gentle ping :-).

Could you please help push this v2 fix upstream?
If you have any concerns, please let me know.

Thanks!
-Qiuxu
Thomas Zimmermann Aug. 5, 2024, 7:31 a.m. UTC | #2
Hi

Am 05.08.24 um 09:13 schrieb Qiuxu Zhuo:
>> From: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
>> To: maarten.lankhorst@linux.intel.com, mripard@kernel.org,
>> 	tzimmermann@suse.de, airlied@gmail.com, daniel@ffwll.ch
>> Cc: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org,
>> 	tony.luck@intel.com, qiuxu.zhuo@intel.com, yudong.wang@intel.com
>> Subject: [PATCH v2 1/1] drm/fb-helper: Don't schedule_work() to flush frame buffer during panic()
>> [...]
>> Acked-by: Thomas Zimmermann <tzimmermann@suse.de>
>> Reported-by: Yudong Wang <yudong.wang@intel.com>
>> Tested-by: Yudong Wang <yudong.wang@intel.com>
>> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> Hi Maarten and maintainers,
>
> A gentle ping :-).
>
> Could you please help push this v2 fix upstream?
> If you have any concerns, please let me know.

I already acked this patch, but I still have a question: during a panic, 
will fbcon still print a panic message? I think that would likely 
require scheduling that worker.

Best regards
Thomas

>
> Thanks!
> -Qiuxu
Zhuo, Qiuxu Aug. 5, 2024, 8:10 a.m. UTC | #3
Hi Thomas,

> From: Thomas Zimmermann <tzimmermann@suse.de>
> Sent: Monday, August 5, 2024 3:31 PM
> [...]
> > Hi Maarten and maintainers,
> >
> > A gentle ping :-).
> >
> > Could you please help push this v2 fix upstream?
> > If you have any concerns, please let me know.
> 
> I already acked this patch, but I still have a question: during a panic, will fbcon

Thanks for your kind review of this patch and ACK.

> still print a panic message? I think that would likely require scheduling that
> worker.

During the error injection testing:

1) Without this v2 fix:

   1.1) If panic() is not blocked on [1] (~99 times in 100 cycles),
        then the console/fbcon can print normal panic-related messages like [2],
        and the system can reboot successfully.

   1.2) If panic() is blocked on [1] (~1 time in 100 cycles),
        then the console/fbcon is silent and the system gets hung without reboot.
         This is not the expected behavior. The system is expected to reboot.

2) With this v2 fix:

   2.1) The console/fbcon can always print normal panic related messages like[2],
        and the system can reboot successfully. Same behavior to 1.1).
        [ we tested it ~1500 cycles. ]

[1] panic() -> ... drm_fb_helper_damage() -> schedule_work().
    For details, pls see the v2 commit message.

[2] Panic messages:
    [  133.900042] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 4: ba00000000000e0b
    [  133.900046] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8229ebec> {intel_idle_xstate+0x6c/0xc0}
    [  133.900055] mce: [Hardware Error]: TSC 9701dd289b MISC 29100000 PPIN 9000d7561bb0e340
    [  133.900057] mce: [Hardware Error]: PROCESSOR 0:a06d1 TIME 1715827713 SOCKET 0 APIC 0 microcode 810001d0
    [  133.900060] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
    [  134.053858] mce: [Hardware Error]: Machine check: Processor context corrupt
    [  134.053866] Kernel panic - not syncing: Fatal machine check
    [  134.075183] Kernel Offset: disabled
    [  134.111372] pstore: backend (erst) writing error (-28)

Thanks!
-Qiuxu
Thomas Zimmermann Aug. 5, 2024, 8:26 a.m. UTC | #4
Hi

Am 05.08.24 um 10:10 schrieb Zhuo, Qiuxu:
> Hi Thomas,
>
>> From: Thomas Zimmermann <tzimmermann@suse.de>
>> Sent: Monday, August 5, 2024 3:31 PM
>> [...]
>>> Hi Maarten and maintainers,
>>>
>>> A gentle ping :-).
>>>
>>> Could you please help push this v2 fix upstream?
>>> If you have any concerns, please let me know.
>> I already acked this patch, but I still have a question: during a panic, will fbcon
> Thanks for your kind review of this patch and ACK.
>
>> still print a panic message? I think that would likely require scheduling that
>> worker.
> During the error injection testing:
>
> 1) Without this v2 fix:
>
>     1.1) If panic() is not blocked on [1] (~99 times in 100 cycles),
>          then the console/fbcon can print normal panic-related messages like [2],
>          and the system can reboot successfully.
>
>     1.2) If panic() is blocked on [1] (~1 time in 100 cycles),
>          then the console/fbcon is silent and the system gets hung without reboot.
>           This is not the expected behavior. The system is expected to reboot.
>
> 2) With this v2 fix:
>
>     2.1) The console/fbcon can always print normal panic related messages like[2],
>          and the system can reboot successfully. Same behavior to 1.1).
>          [ we tested it ~1500 cycles. ]
>
> [1] panic() -> ... drm_fb_helper_damage() -> schedule_work().
>      For details, pls see the v2 commit message.
>
> [2] Panic messages:
>      [  133.900042] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 4: ba00000000000e0b
>      [  133.900046] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8229ebec> {intel_idle_xstate+0x6c/0xc0}
>      [  133.900055] mce: [Hardware Error]: TSC 9701dd289b MISC 29100000 PPIN 9000d7561bb0e340
>      [  133.900057] mce: [Hardware Error]: PROCESSOR 0:a06d1 TIME 1715827713 SOCKET 0 APIC 0 microcode 810001d0
>      [  133.900060] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
>      [  134.053858] mce: [Hardware Error]: Machine check: Processor context corrupt
>      [  134.053866] Kernel panic - not syncing: Fatal machine check
>      [  134.075183] Kernel Offset: disabled
>      [  134.111372] pstore: backend (erst) writing error (-28)

Thanks for the detailed reply.

I've found that your patch has already been merged and should now be in 
v6.11-rc2. It'll probably be backported to older kernels as well.

 > dim cite 833cd3e9ad8360785b6c23c82dd3856df00732d9
833cd3e9ad83 ("drm/fb-helper: Don't schedule_work() to flush frame 
buffer during panic()")
 > git tag --contains 833cd3e9ad8360785b6c23c82dd3856df00732d9
drm-fixes-2024-08-02
drm-misc-fixes-2024-08-01
v6.11-rc2

Best regards
Thomas

>
> Thanks!
> -Qiuxu
>   
>
>
Zhuo, Qiuxu Aug. 5, 2024, 10:34 a.m. UTC | #5
Hi Thomas,

> From: Thomas Zimmermann <tzimmermann@suse.de>
> [...]
> 
> Thanks for the detailed reply.
> 
> I've found that your patch has already been merged and should now be in
> v6.11-rc2. It'll probably be backported to older kernels as well.
> 
>  > dim cite 833cd3e9ad8360785b6c23c82dd3856df00732d9
> 833cd3e9ad83 ("drm/fb-helper: Don't schedule_work() to flush frame buffer
> during panic()")  > git tag --contains
> 833cd3e9ad8360785b6c23c82dd3856df00732d9
> drm-fixes-2024-08-02
> drm-misc-fixes-2024-08-01
> v6.11-rc2

Thank you so much for your review and support on this 
diff mbox series

Patch

diff --git a/drivers/gpu/drm/drm_fb_helper.c b/drivers/gpu/drm/drm_fb_helper.c
index d612133e2cf7..6ad75034c966 100644
--- a/drivers/gpu/drm/drm_fb_helper.c
+++ b/drivers/gpu/drm/drm_fb_helper.c
@@ -628,6 +628,17 @@  static void drm_fb_helper_add_damage_clip(struct drm_fb_helper *helper, u32 x, u
 static void drm_fb_helper_damage(struct drm_fb_helper *helper, u32 x, u32 y,
 				 u32 width, u32 height)
 {
+	/*
+	 * This function may be invoked by panic() to flush the frame
+	 * buffer, where all CPUs except the panic CPU are stopped.
+	 * During the following schedule_work(), the panic CPU needs
+	 * the worker_pool lock, which might be held by a stopped CPU,
+	 * causing schedule_work() and panic() to block. Return early on
+	 * oops_in_progress to prevent this blocking.
+	 */
+	if (oops_in_progress)
+		return;
+
 	drm_fb_helper_add_damage_clip(helper, x, y, width, height);
 
 	schedule_work(&helper->damage_work);