mbox series

[v4,0/3] Fix MCE handling on AMD hosts

Message ID 20230912211824.90952-1-john.allen@amd.com (mailing list archive)
Headers show
Series Fix MCE handling on AMD hosts | expand

Message

John Allen Sept. 12, 2023, 9:18 p.m. UTC
In the event that a guest process attempts to access memory that has
been poisoned in response to a deferred uncorrected MCE, an AMD system
will currently generate a SIGBUS error which will result in the entire
guest being shutdown. Ideally, we only want to kill the guest process
that accessed poisoned memory in this case.

This support has been included in qemu for Intel hosts for a long time,
but there are a couple of changes needed for AMD hosts. First, we will
need to expose the SUCCOR cpuid bit to guests. Second, we need to modify
the MCE injection code to avoid Intel specific behavior when we are
running on an AMD host.

v2:
  - Add "succor" feature word.
  - Add case to kvm_arch_get_supported_cpuid for the SUCCOR feature.

v3:
  - Reorder series. Only enable SUCCOR after bugs have been fixed.
  - Introduce new patch ignoring AO errors.

v4:
  - Remove redundant check for AO errors.

John Allen (2):
  i386: Fix MCE support for AMD hosts
  i386: Add support for SUCCOR feature

William Roche (1):
  i386: Explicitly ignore unsupported BUS_MCEERR_AO MCE on AMD guest

 target/i386/cpu.c     | 18 +++++++++++++++++-
 target/i386/cpu.h     |  4 ++++
 target/i386/helper.c  |  4 ++++
 target/i386/kvm/kvm.c | 28 ++++++++++++++++++++--------
 4 files changed, 45 insertions(+), 9 deletions(-)

Comments

Joao Martins Feb. 7, 2024, 11:21 a.m. UTC | #1
On 12/09/2023 22:18, John Allen wrote:
> In the event that a guest process attempts to access memory that has
> been poisoned in response to a deferred uncorrected MCE, an AMD system
> will currently generate a SIGBUS error which will result in the entire
> guest being shutdown. Ideally, we only want to kill the guest process
> that accessed poisoned memory in this case.
> 
> This support has been included in qemu for Intel hosts for a long time,
> but there are a couple of changes needed for AMD hosts. First, we will
> need to expose the SUCCOR cpuid bit to guests. Second, we need to modify
> the MCE injection code to avoid Intel specific behavior when we are
> running on an AMD host.
> 

Is there any update with respect to this series?

John's series should fix MCE injection on AMD; as today it is just crashing the
guest (sadly) when an MCE happens in the hypervisor.

William, Paolo, I think the sort-of-dependency(?) of this where we block
migration if there was a poisoned page on is already in Peter's migration
tree[1] (CC'ed). So perhaps this series just needs John to resend it given that
it's been a couple months since v4?

[1]
https://lore.kernel.org/qemu-devel/20240130190640.139364-2-william.roche@oracle.com/

> v2:
>   - Add "succor" feature word.
>   - Add case to kvm_arch_get_supported_cpuid for the SUCCOR feature.
> 
> v3:
>   - Reorder series. Only enable SUCCOR after bugs have been fixed.
>   - Introduce new patch ignoring AO errors.
> 
> v4:
>   - Remove redundant check for AO errors.
> 
> John Allen (2):
>   i386: Fix MCE support for AMD hosts
>   i386: Add support for SUCCOR feature
> 
> William Roche (1):
>   i386: Explicitly ignore unsupported BUS_MCEERR_AO MCE on AMD guest
> 
>  target/i386/cpu.c     | 18 +++++++++++++++++-
>  target/i386/cpu.h     |  4 ++++
>  target/i386/helper.c  |  4 ++++
>  target/i386/kvm/kvm.c | 28 ++++++++++++++++++++--------
>  4 files changed, 45 insertions(+), 9 deletions(-)
>
John Allen Feb. 20, 2024, 5:27 p.m. UTC | #2
On Wed, Feb 07, 2024 at 11:21:05AM +0000, Joao Martins wrote:
> On 12/09/2023 22:18, John Allen wrote:
> > In the event that a guest process attempts to access memory that has
> > been poisoned in response to a deferred uncorrected MCE, an AMD system
> > will currently generate a SIGBUS error which will result in the entire
> > guest being shutdown. Ideally, we only want to kill the guest process
> > that accessed poisoned memory in this case.
> > 
> > This support has been included in qemu for Intel hosts for a long time,
> > but there are a couple of changes needed for AMD hosts. First, we will
> > need to expose the SUCCOR cpuid bit to guests. Second, we need to modify
> > the MCE injection code to avoid Intel specific behavior when we are
> > running on an AMD host.
> > 
> 
> Is there any update with respect to this series?
> 
> John's series should fix MCE injection on AMD; as today it is just crashing the
> guest (sadly) when an MCE happens in the hypervisor.
> 
> William, Paolo, I think the sort-of-dependency(?) of this where we block
> migration if there was a poisoned page on is already in Peter's migration
> tree[1] (CC'ed). So perhaps this series just needs John to resend it given that
> it's been a couple months since v4?

It looks like this series still applies cleanly to latest qemu, but I
can resend if needed.

Thanks,
John

> 
> [1]
> https://lore.kernel.org/qemu-devel/20240130190640.139364-2-william.roche@oracle.com/
> 
> > v2:
> >   - Add "succor" feature word.
> >   - Add case to kvm_arch_get_supported_cpuid for the SUCCOR feature.
> > 
> > v3:
> >   - Reorder series. Only enable SUCCOR after bugs have been fixed.
> >   - Introduce new patch ignoring AO errors.
> > 
> > v4:
> >   - Remove redundant check for AO errors.
> > 
> > John Allen (2):
> >   i386: Fix MCE support for AMD hosts
> >   i386: Add support for SUCCOR feature
> > 
> > William Roche (1):
> >   i386: Explicitly ignore unsupported BUS_MCEERR_AO MCE on AMD guest
> > 
> >  target/i386/cpu.c     | 18 +++++++++++++++++-
> >  target/i386/cpu.h     |  4 ++++
> >  target/i386/helper.c  |  4 ++++
> >  target/i386/kvm/kvm.c | 28 ++++++++++++++++++++--------
> >  4 files changed, 45 insertions(+), 9 deletions(-)
> > 
>
Joao Martins Feb. 21, 2024, 11:42 a.m. UTC | #3
On 20/02/2024 17:27, John Allen wrote:
> On Wed, Feb 07, 2024 at 11:21:05AM +0000, Joao Martins wrote:
>> On 12/09/2023 22:18, John Allen wrote:
>>> In the event that a guest process attempts to access memory that has
>>> been poisoned in response to a deferred uncorrected MCE, an AMD system
>>> will currently generate a SIGBUS error which will result in the entire
>>> guest being shutdown. Ideally, we only want to kill the guest process
>>> that accessed poisoned memory in this case.
>>>
>>> This support has been included in qemu for Intel hosts for a long time,
>>> but there are a couple of changes needed for AMD hosts. First, we will
>>> need to expose the SUCCOR cpuid bit to guests. Second, we need to modify
>>> the MCE injection code to avoid Intel specific behavior when we are
>>> running on an AMD host.
>>>
>>
>> Is there any update with respect to this series?
>>
>> John's series should fix MCE injection on AMD; as today it is just crashing the
>> guest (sadly) when an MCE happens in the hypervisor.
>>
>> William, Paolo, I think the sort-of-dependency(?) of this where we block
>> migration if there was a poisoned page on is already in Peter's migration
>> tree[1] (CC'ed). So perhaps this series just needs John to resend it given that
>> it's been a couple months since v4?
> 
> It looks like this series still applies cleanly to latest qemu, but I
> can resend if needed.
> 
That's great I suppose.

I was hoping Paolo responds, to understand next steps.

There's also the other kernel patch that Paolo suggested[0], to declare the
SUCCOR bit in the kvm supported CPUID? Maybe it's being held up because of that?

[0]
https://lore.kernel.org/qemu-devel/d4c1bb9b-8438-ed00-c79d-e8ad2a7e4eed@redhat.com/

> Thanks,
> John
> 
>>
>> [1]
>> https://lore.kernel.org/qemu-devel/20240130190640.139364-2-william.roche@oracle.com/
>>
>>> v2:
>>>   - Add "succor" feature word.
>>>   - Add case to kvm_arch_get_supported_cpuid for the SUCCOR feature.
>>>
>>> v3:
>>>   - Reorder series. Only enable SUCCOR after bugs have been fixed.
>>>   - Introduce new patch ignoring AO errors.
>>>
>>> v4:
>>>   - Remove redundant check for AO errors.
>>>
>>> John Allen (2):
>>>   i386: Fix MCE support for AMD hosts
>>>   i386: Add support for SUCCOR feature
>>>
>>> William Roche (1):
>>>   i386: Explicitly ignore unsupported BUS_MCEERR_AO MCE on AMD guest
>>>
>>>  target/i386/cpu.c     | 18 +++++++++++++++++-
>>>  target/i386/cpu.h     |  4 ++++
>>>  target/i386/helper.c  |  4 ++++
>>>  target/i386/kvm/kvm.c | 28 ++++++++++++++++++++--------
>>>  4 files changed, 45 insertions(+), 9 deletions(-)
>>>
>>