mbox series

[PATCHv2,0/2] log guest name and memory error type AO, AR for MCEs

Message ID 20191009164459.8209-1-msmarduch@digitalocean.com (mailing list archive)
Headers show
Series log guest name and memory error type AO, AR for MCEs | expand

Message

Mario Smarduch Oct. 9, 2019, 4:44 p.m. UTC
In a large VPC environment we want to log memory error occurrences
and log them with guest name and type - there are few use cases


- if VM crashes on AR mce inform the user about the reason and resolve the case
- if VM hangs notify the user to reboot and resume processing
- if VM continues to run let the user know, he/she maybe able to correlate
  to vm internal outage
- Rawhammer attacks - isolate/determine the attacker possible migrating it off
  the hypervisor
- In general track memory errors on a hyperviosr over time to determine trends

Monitoring our fleet we come across quite a few of these and been
able to take action where before there were no clues to the causes.

When memory error occurs we get a log entry in qemu log:

Guest [Droplet-12345678] 2019-08-02T05:00:11.940270Z qemu-system-x86_64:
Guest MCE Memory Error at QEMU addr 0x7f3c7622f000 and GUEST 0x78e42f000
addr of type BUS_MCEERR_AR injected

with enterprise logging environment we can to take further actions.

v1 -> v2:
- split into two patches one to get the gustname second to log MCEs 
- addressed comments for MCE logging

Mario Smarduch (2):
  util/qemu-error: add guest name helper with -msg options
  target/i386: log MCE guest and host addresses

 include/qemu/error-report.h |  1 +
 qemu-options.hx             | 10 ++++++----
 target/i386/kvm.c           | 29 ++++++++++++++++++++++++-----
 util/qemu-error.c           | 31 +++++++++++++++++++++++++++++++
 vl.c                        |  5 +++++
 5 files changed, 67 insertions(+), 9 deletions(-)

Comments

Paolo Bonzini Oct. 9, 2019, 9:19 p.m. UTC | #1
On 09/10/19 18:44, Mario Smarduch wrote:
> In a large VPC environment we want to log memory error occurrences
> and log them with guest name and type - there are few use cases
> 
> 
> - if VM crashes on AR mce inform the user about the reason and resolve the case
> - if VM hangs notify the user to reboot and resume processing
> - if VM continues to run let the user know, he/she maybe able to correlate
>   to vm internal outage
> - Rawhammer attacks - isolate/determine the attacker possible migrating it off
>   the hypervisor
> - In general track memory errors on a hyperviosr over time to determine trends
> 
> Monitoring our fleet we come across quite a few of these and been
> able to take action where before there were no clues to the causes.
> 
> When memory error occurs we get a log entry in qemu log:
> 
> Guest [Droplet-12345678] 2019-08-02T05:00:11.940270Z qemu-system-x86_64:
> Guest MCE Memory Error at QEMU addr 0x7f3c7622f000 and GUEST 0x78e42f000
> addr of type BUS_MCEERR_AR injected
> 
> with enterprise logging environment we can to take further actions.
> 
> v1 -> v2:
> - split into two patches one to get the gustname second to log MCEs 
> - addressed comments for MCE logging
> 
> Mario Smarduch (2):
>   util/qemu-error: add guest name helper with -msg options
>   target/i386: log MCE guest and host addresses
> 
>  include/qemu/error-report.h |  1 +
>  qemu-options.hx             | 10 ++++++----
>  target/i386/kvm.c           | 29 ++++++++++++++++++++++++-----
>  util/qemu-error.c           | 31 +++++++++++++++++++++++++++++++
>  vl.c                        |  5 +++++
>  5 files changed, 67 insertions(+), 9 deletions(-)
> 

Queued, thanks.

Paolo
no-reply@patchew.org Oct. 9, 2019, 10:55 p.m. UTC | #2
Patchew URL: https://patchew.org/QEMU/20191009164459.8209-1-msmarduch@digitalocean.com/



Hi,

This series failed the docker-mingw@fedora build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#! /bin/bash
export ARCH=x86_64
make docker-image-fedora V=1 NETWORK=1
time make docker-test-mingw@fedora J=14 NETWORK=1
=== TEST SCRIPT END ===

  CC      util/hbitmap.o
  CC      util/fifo8.o

Encoding error:
'utf-8' codec can't decode byte 0x95 in position 799: invalid start byte
The full traceback has been saved in /tmp/sphinx-err-qsfcd92y.log, if you want to report the issue to the developers.
  CC      util/cacheinfo.o
---
  CC      util/id.o
  CC      util/iov.o
  CC      util/qemu-config.o
make: *** [Makefile:994: docs/interop/index.html] Error 2
make: *** Waiting for unfinished jobs....
Traceback (most recent call last):
  File "./tests/docker/docker.py", line 662, in <module>
---
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', '--label', 'com.qemu.instance.uuid=c8274a8a922d4d2c81b14b6b5c0902ca', '-u', '1003', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', '-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 'SHOW_ENV=', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', '/home/patchew2/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', '/var/tmp/patchew-tester-tmp-vdk7fez9/src/docker-src.2019-10-09-18.53.09.17540:/var/tmp/qemu:z,ro', 'qemu:fedora', '/var/tmp/qemu/run', 'test-mingw']' returned non-zero exit status 2.
filter=--filter=label=com.qemu.instance.uuid=c8274a8a922d4d2c81b14b6b5c0902ca
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-vdk7fez9/src'
make: *** [docker-run-test-mingw@fedora] Error 2

real    2m36.643s
user    0m7.259s


The full log is available at
http://patchew.org/logs/20191009164459.8209-1-msmarduch@digitalocean.com/testing.docker-mingw@fedora/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com
Mario Smarduch Oct. 9, 2019, 11:01 p.m. UTC | #3
On 10/09/2019 02:19 PM, Paolo Bonzini wrote:
> On 09/10/19 18:44, Mario Smarduch wrote:
>> In a large VPC environment we want to log memory error occurrences
>> and log them with guest name and type - there are few use cases
>>
>>
>> - if VM crashes on AR mce inform the user about the reason and resolve the case
>> - if VM hangs notify the user to reboot and resume processing
>> - if VM continues to run let the user know, he/she maybe able to correlate
>>   to vm internal outage
>> - Rawhammer attacks - isolate/determine the attacker possible migrating it off
>>   the hypervisor
>> - In general track memory errors on a hyperviosr over time to determine trends
>>
>> Monitoring our fleet we come across quite a few of these and been
>> able to take action where before there were no clues to the causes.
>>
>> When memory error occurs we get a log entry in qemu log:
>>
>> Guest [Droplet-12345678] 2019-08-02T05:00:11.940270Z qemu-system-x86_64:
>> Guest MCE Memory Error at QEMU addr 0x7f3c7622f000 and GUEST 0x78e42f000
>> addr of type BUS_MCEERR_AR injected
>>
>> with enterprise logging environment we can to take further actions.
>>
>> v1 -> v2:
>> - split into two patches one to get the gustname second to log MCEs 
>> - addressed comments for MCE logging
>>
>> Mario Smarduch (2):
>>   util/qemu-error: add guest name helper with -msg options
>>   target/i386: log MCE guest and host addresses
>>
>>  include/qemu/error-report.h |  1 +
>>  qemu-options.hx             | 10 ++++++----
>>  target/i386/kvm.c           | 29 ++++++++++++++++++++++++-----
>>  util/qemu-error.c           | 31 +++++++++++++++++++++++++++++++
>>  vl.c                        |  5 +++++
>>  5 files changed, 67 insertions(+), 9 deletions(-)
>>
> 
> Queued, thanks.
> 
> Paolo
> 

Great, thanks for the fixup and y'all time.

- Mario