mbox series

[v2,0/3] More machine check recovery fixes

Message ID 20210818002942.1607544-1-tony.luck@intel.com (mailing list archive)
Headers show
Series More machine check recovery fixes | expand

Message

Tony Luck Aug. 18, 2021, 12:29 a.m. UTC
Fix a couple of issues in machine check handling

1) A repeated machine check inside the kernel without calling the task
   work function between machine checks it will go into an infinite
   loop
2) Machine checks in kernel functions copying data from user addresses
   send SIGBUS to the user as if the application had consumed the
   poison. But this is wrong. The user should see either an -EFAULT
   error return or a reduced byte count (in the case of write(2)).

My latest tests have been on v4.14-rc6 with this patch (that's already
in -mm) applied:
https://lore.kernel.org/r/20210817053703.2267588-1-naoya.horiguchi@linux.dev

Changes since v1:
1) Fix bug in kill_me_never() that forgot to clear p->mce_count so
   repeated recovery in the same task would trigger the panic for
	"Machine checks to different user pages"
   [Note to Jue Wang ... this *might* be why your test that injects
    two errors into the same buffer passed to a write(2) syscall
    failed with this message]
2) Re-order patches so that "Avoid infinite loop" can be backported
   to stable.

Note that the other two parts of this series depend upon Al Viro's
extensive re-work to lib/iov_iter.c ... so don't try to backport those
without also picking up Al's work.

Tony Luck (3):
  x86/mce: Avoid infinite loop for copy from user recovery
  x86/mce: Change to not send SIGBUS error during copy from user
  x86/mce: Drop copyin special case for #MC

 arch/x86/kernel/cpu/mce/core.c | 62 ++++++++++++++++++++++++----------
 arch/x86/lib/copy_user_64.S    | 13 -------
 include/linux/sched.h          |  1 +
 3 files changed, 45 insertions(+), 31 deletions(-)


base-commit: 7c60610d476766e128cc4284bb6349732cbd6606

Comments

Tony Luck Aug. 18, 2021, 4:14 p.m. UTC | #1
> Changes since v1:
> 1) Fix bug in kill_me_never() that forgot to clear p->mce_count so
>   repeated recovery in the same task would trigger the panic for
>	"Machine checks to different user pages"
>   [Note to Jue Wang ... this *might* be why your test that injects
>    two errors into the same buffer passed to a write(2) syscall
>    failed with this message]

I recreated Jue's specific test today with uncorrected errors in two
pages passed to a write(2) syscall.

	buf = alloc(2 pages);
	inject(buf + 0x440);
	inject*buf + 0x11c0);
	n = write(fd, buf, 8K);

Result was that the write returned 0x440 (i.e. bytes written up to the
first poison location).

-Tony