Message ID | 20230912211824.90952-1-john.allen@amd.com (mailing list archive) |
---|---|
Headers | show |
Series | Fix MCE handling on AMD hosts | expand |
On 12/09/2023 22:18, John Allen wrote: > In the event that a guest process attempts to access memory that has > been poisoned in response to a deferred uncorrected MCE, an AMD system > will currently generate a SIGBUS error which will result in the entire > guest being shutdown. Ideally, we only want to kill the guest process > that accessed poisoned memory in this case. > > This support has been included in qemu for Intel hosts for a long time, > but there are a couple of changes needed for AMD hosts. First, we will > need to expose the SUCCOR cpuid bit to guests. Second, we need to modify > the MCE injection code to avoid Intel specific behavior when we are > running on an AMD host. > Is there any update with respect to this series? John's series should fix MCE injection on AMD; as today it is just crashing the guest (sadly) when an MCE happens in the hypervisor. William, Paolo, I think the sort-of-dependency(?) of this where we block migration if there was a poisoned page on is already in Peter's migration tree[1] (CC'ed). So perhaps this series just needs John to resend it given that it's been a couple months since v4? [1] https://lore.kernel.org/qemu-devel/20240130190640.139364-2-william.roche@oracle.com/ > v2: > - Add "succor" feature word. > - Add case to kvm_arch_get_supported_cpuid for the SUCCOR feature. > > v3: > - Reorder series. Only enable SUCCOR after bugs have been fixed. > - Introduce new patch ignoring AO errors. > > v4: > - Remove redundant check for AO errors. > > John Allen (2): > i386: Fix MCE support for AMD hosts > i386: Add support for SUCCOR feature > > William Roche (1): > i386: Explicitly ignore unsupported BUS_MCEERR_AO MCE on AMD guest > > target/i386/cpu.c | 18 +++++++++++++++++- > target/i386/cpu.h | 4 ++++ > target/i386/helper.c | 4 ++++ > target/i386/kvm/kvm.c | 28 ++++++++++++++++++++-------- > 4 files changed, 45 insertions(+), 9 deletions(-) >
On Wed, Feb 07, 2024 at 11:21:05AM +0000, Joao Martins wrote: > On 12/09/2023 22:18, John Allen wrote: > > In the event that a guest process attempts to access memory that has > > been poisoned in response to a deferred uncorrected MCE, an AMD system > > will currently generate a SIGBUS error which will result in the entire > > guest being shutdown. Ideally, we only want to kill the guest process > > that accessed poisoned memory in this case. > > > > This support has been included in qemu for Intel hosts for a long time, > > but there are a couple of changes needed for AMD hosts. First, we will > > need to expose the SUCCOR cpuid bit to guests. Second, we need to modify > > the MCE injection code to avoid Intel specific behavior when we are > > running on an AMD host. > > > > Is there any update with respect to this series? > > John's series should fix MCE injection on AMD; as today it is just crashing the > guest (sadly) when an MCE happens in the hypervisor. > > William, Paolo, I think the sort-of-dependency(?) of this where we block > migration if there was a poisoned page on is already in Peter's migration > tree[1] (CC'ed). So perhaps this series just needs John to resend it given that > it's been a couple months since v4? It looks like this series still applies cleanly to latest qemu, but I can resend if needed. Thanks, John > > [1] > https://lore.kernel.org/qemu-devel/20240130190640.139364-2-william.roche@oracle.com/ > > > v2: > > - Add "succor" feature word. > > - Add case to kvm_arch_get_supported_cpuid for the SUCCOR feature. > > > > v3: > > - Reorder series. Only enable SUCCOR after bugs have been fixed. > > - Introduce new patch ignoring AO errors. > > > > v4: > > - Remove redundant check for AO errors. > > > > John Allen (2): > > i386: Fix MCE support for AMD hosts > > i386: Add support for SUCCOR feature > > > > William Roche (1): > > i386: Explicitly ignore unsupported BUS_MCEERR_AO MCE on AMD guest > > > > target/i386/cpu.c | 18 +++++++++++++++++- > > target/i386/cpu.h | 4 ++++ > > target/i386/helper.c | 4 ++++ > > target/i386/kvm/kvm.c | 28 ++++++++++++++++++++-------- > > 4 files changed, 45 insertions(+), 9 deletions(-) > > >
On 20/02/2024 17:27, John Allen wrote: > On Wed, Feb 07, 2024 at 11:21:05AM +0000, Joao Martins wrote: >> On 12/09/2023 22:18, John Allen wrote: >>> In the event that a guest process attempts to access memory that has >>> been poisoned in response to a deferred uncorrected MCE, an AMD system >>> will currently generate a SIGBUS error which will result in the entire >>> guest being shutdown. Ideally, we only want to kill the guest process >>> that accessed poisoned memory in this case. >>> >>> This support has been included in qemu for Intel hosts for a long time, >>> but there are a couple of changes needed for AMD hosts. First, we will >>> need to expose the SUCCOR cpuid bit to guests. Second, we need to modify >>> the MCE injection code to avoid Intel specific behavior when we are >>> running on an AMD host. >>> >> >> Is there any update with respect to this series? >> >> John's series should fix MCE injection on AMD; as today it is just crashing the >> guest (sadly) when an MCE happens in the hypervisor. >> >> William, Paolo, I think the sort-of-dependency(?) of this where we block >> migration if there was a poisoned page on is already in Peter's migration >> tree[1] (CC'ed). So perhaps this series just needs John to resend it given that >> it's been a couple months since v4? > > It looks like this series still applies cleanly to latest qemu, but I > can resend if needed. > That's great I suppose. I was hoping Paolo responds, to understand next steps. There's also the other kernel patch that Paolo suggested[0], to declare the SUCCOR bit in the kvm supported CPUID? Maybe it's being held up because of that? [0] https://lore.kernel.org/qemu-devel/d4c1bb9b-8438-ed00-c79d-e8ad2a7e4eed@redhat.com/ > Thanks, > John > >> >> [1] >> https://lore.kernel.org/qemu-devel/20240130190640.139364-2-william.roche@oracle.com/ >> >>> v2: >>> - Add "succor" feature word. >>> - Add case to kvm_arch_get_supported_cpuid for the SUCCOR feature. >>> >>> v3: >>> - Reorder series. Only enable SUCCOR after bugs have been fixed. >>> - Introduce new patch ignoring AO errors. >>> >>> v4: >>> - Remove redundant check for AO errors. >>> >>> John Allen (2): >>> i386: Fix MCE support for AMD hosts >>> i386: Add support for SUCCOR feature >>> >>> William Roche (1): >>> i386: Explicitly ignore unsupported BUS_MCEERR_AO MCE on AMD guest >>> >>> target/i386/cpu.c | 18 +++++++++++++++++- >>> target/i386/cpu.h | 4 ++++ >>> target/i386/helper.c | 4 ++++ >>> target/i386/kvm/kvm.c | 28 ++++++++++++++++++++-------- >>> 4 files changed, 45 insertions(+), 9 deletions(-) >>> >>