[00/33] migration: capture error reports into Error object

Message ID	20210204171907.901471-1-berrange@redhat.com (mailing list archive)
Headers	show Return-Path: <SRS0=9d3P=HG=nongnu.org=qemu-devel-bounces+qemu-devel=archiver.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 717C864F44 From: =?utf-8?q?Daniel_P=2E_Berrang=C3=A9?= <berrange@redhat.com> To: qemu-devel@nongnu.org Subject: [PATCH 00/33] migration: capture error reports into Error object Date: Thu, 4 Feb 2021 17:18:34 +0000 Message-Id: <20210204171907.901471-1-berrange@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=216.205.24.124; envelope-from=berrange@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -30 X-Spam_score: -3.1 X-Spam_bar: --- X-Spam_report: (-3.1 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.351, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Cc: Juan Quintela <quintela@redhat.com>, =?utf-8?q?Daniel_P=2E_Berrang=C3=A9?= <berrange@redhat.com>, "Dr. David Alan Gilbert" <dgilbert@redhat.com>, Hailiang Zhang <zhang.zhanghailiang@huawei.com> Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org>
Series	migration: capture error reports into Error object \| expand [00/33] migration: capture error reports into Error object [01/33] migration: push Error errp into qemu_loadvm_state() [02/33] migration: push Error errp into qemu_loadvm_state_header() [03/33] migration: push Error errp into qemu_loadvm_state_setup() [04/33] migration: push Error errp into qemu_load_device_state() [05/33] migration: push Error errp into qemu_loadvm_state_main() [06/33] migration: push Error errp into qemu_loadvm_section_start_full() [07/33] migration: push Error errp into qemu_loadvm_section_part_end() [08/33] migration: push Error errp into loadvm_process_command() [09/33] migration: push Error errp into loadvm_handle_cmd_packaged() [10/33] migration: push Error errp into loadvm_postcopy_handle_advise() [11/33] migration: push Error errp into ram_postcopy_incoming_init() [12/33] migration: push Error errp into loadvm_postcopy_handle_listen() [13/33] migration: push Error errp into loadvm_postcopy_handle_run() [14/33] migration: push Error errp into loadvm_postcopy_ram_handle_discard() [15/33] migration: make loadvm_postcopy_handle_resume() void [16/33] migration: push Error errp into loadvm_handle_recv_bitmap() [17/33] migration: push Error errp into loadvm_process_enable_colo() [18/33] migration: push Error errp into colo_init_ram_cache() [19/33] migration: push Error errp into check_section_footer() [20/33] migration: push Error errp into global_state_store() [21/33] migration: remove error reporting from qemu_fopen_bdrv() callers [22/33] migration: push Error errp into qemu_savevm_state_iterate() [23/33] migration: simplify some error reporting in save_snapshot() [24/33] migration: push Error errp into qemu_savevm_state_setup() [25/33] migration: push Error errp into qemu_savevm_state_complete_precopy() [26/33] migration: push Error errp into qemu_savevm_state_complete_precopy_non_iterable() [27/33] migration: push Error errp into qemu_savevm_state_complete_precopy() [28/33] migration: push Error errp into qemu_savevm_send_packaged() [29/33] migration: push Error errp into qemu_savevm_live_state() [30/33] migration: push Error errp into qemu_save_device_state() [31/33] migration: push Error errp into qemu_savevm_state_resume_prepare() [32/33] migration: push Error errp into postcopy_resume_handshake() [33/33] migration: push Error errp into postcopy_do_resume()

Daniel P. Berrangé Feb. 4, 2021, 5:18 p.m. UTC

Due to its long term heritage most of the migration code just invokes
'error_report' when problems hit. This was fine for HMP, since the
messages get redirected from stderr, into the HMP console. It is not
OK for QMP because the errors will not be fed back to the QMP client.

This wasn't a terrible real world problem with QMP so far because
live migration happens in the background, so at least on the target side
there is not a QMP command that needs to capture the incoming migration.
It is a problem on the source side but it doesn't hit frequently as the
source side has fewer failure scenarios. None the less on both sides it
would be desirable if 'query-migrate' can report errors correctly.
With the introduction of the load-snapshot QMP commands, the need for
error reporting becomes more pressing.

Wiring up good error reporting is a large and difficult job, which
this series does NOT complete. The focus here has been on converting
all methods in savevm.c which have an 'int' return value capable of
reporting errors. This covers most of the infrastructure for controlling
the migration state serialization / protocol.

The remaining part that is missing error reporting are the callbacks in
the VMStateDescription struct which can return failure codes, but have
no "Error **errp" parameter. Thinking about how this might be dealt with
in future, a big bang conversion is likely non-viable. We'll probably
want to introduce a duplicate set of callbacks with the "Error **errp"
parameter and convert impls in batches, eventually removing the
original callbacks. I don't intend todo that myself in the immediate
future.

IOW, this patch series probably solves 50% of the problem, but we
still do need the rest to get ideal error reporting.

In doing this savevm conversion I noticed a bunch of places which
see and then ignore errors. I only fixed one or two of them which
were clearly dubious. Other places in savevm.c where it seemed it
was probably ok to ignore errors, I've left using error_report()
on the basis that those are really warnings. Perhaps they could
be changed to warn_report() instead.

There are alot of patches here, but I felt it was easier to review
for correctness if I converted 1 function at a time. The series
does not neccessarily have to be reviewed/appied in 1 go.

Daniel P. Berrangé (33):
  migration: push Error **errp into qemu_loadvm_state()
  migration: push Error **errp into qemu_loadvm_state_header()
  migration: push Error **errp into qemu_loadvm_state_setup()
  migration: push Error **errp into qemu_load_device_state()
  migration: push Error **errp into qemu_loadvm_state_main()
  migration: push Error **errp into qemu_loadvm_section_start_full()
  migration: push Error **errp into qemu_loadvm_section_part_end()
  migration: push Error **errp into loadvm_process_command()
  migration: push Error **errp into loadvm_handle_cmd_packaged()
  migration: push Error **errp into loadvm_postcopy_handle_advise()
  migration: push Error **errp into ram_postcopy_incoming_init()
  migration: push Error **errp into loadvm_postcopy_handle_listen()
  migration: push Error **errp into loadvm_postcopy_handle_run()
  migration: push Error **errp into loadvm_postcopy_ram_handle_discard()
  migration: make loadvm_postcopy_handle_resume() void
  migration: push Error **errp into loadvm_handle_recv_bitmap()
  migration: push Error **errp into loadvm_process_enable_colo()
  migration: push Error **errp into colo_init_ram_cache()
  migration: push Error **errp into check_section_footer()
  migration: push Error **errp into global_state_store()
  migration: remove error reporting from qemu_fopen_bdrv() callers
  migration: push Error **errp into qemu_savevm_state_iterate()
  migration: simplify some error reporting in save_snapshot()
  migration: push Error **errp into qemu_savevm_state_setup()
  migration: push Error **errp into qemu_savevm_state_complete_precopy()
  migration: push Error **errp into
    qemu_savevm_state_complete_precopy_non_iterable()
  migration: push Error **errp into qemu_savevm_state_complete_precopy()
  migration: push Error **errp into qemu_savevm_send_packaged()
  migration: push Error **errp into qemu_savevm_live_state()
  migration: push Error **errp into qemu_save_device_state()
  migration: push Error **errp into qemu_savevm_state_resume_prepare()
  migration: push Error **errp into postcopy_resume_handshake()
  migration: push Error **errp into postcopy_do_resume()

 include/migration/colo.h                      |   2 +-
 include/migration/global_state.h              |   2 +-
 migration/colo.c                              |  12 +-
 migration/global_state.c                      |   6 +-
 migration/migration.c                         |  80 ++-
 migration/postcopy-ram.c                      |   8 +-
 migration/postcopy-ram.h                      |   2 +-
 migration/ram.c                               |  17 +-
 migration/ram.h                               |   4 +-
 migration/savevm.c                            | 594 ++++++++++--------
 migration/savevm.h                            |  23 +-
 .../tests/internal-snapshots-qapi.out         |   3 +-
 12 files changed, 427 insertions(+), 326 deletions(-)

Dr. David Alan Gilbert Feb. 4, 2021, 6:22 p.m. UTC | #1

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> Due to its long term heritage most of the migration code just invokes
> 'error_report' when problems hit. This was fine for HMP, since the
> messages get redirected from stderr, into the HMP console. It is not
> OK for QMP because the errors will not be fed back to the QMP client.
> 
> This wasn't a terrible real world problem with QMP so far because
> live migration happens in the background, so at least on the target side
> there is not a QMP command that needs to capture the incoming migration.
> It is a problem on the source side but it doesn't hit frequently as the
> source side has fewer failure scenarios. None the less on both sides it
> would be desirable if 'query-migrate' can report errors correctly.
> With the introduction of the load-snapshot QMP commands, the need for
> error reporting becomes more pressing.
> 
> Wiring up good error reporting is a large and difficult job, which
> this series does NOT complete. The focus here has been on converting
> all methods in savevm.c which have an 'int' return value capable of
> reporting errors. This covers most of the infrastructure for controlling
> the migration state serialization / protocol.
> 
> The remaining part that is missing error reporting are the callbacks in
> the VMStateDescription struct which can return failure codes, but have
> no "Error **errp" parameter. Thinking about how this might be dealt with
> in future, a big bang conversion is likely non-viable. We'll probably
> want to introduce a duplicate set of callbacks with the "Error **errp"
> parameter and convert impls in batches, eventually removing the
> original callbacks. I don't intend todo that myself in the immediate
> future.
> 
> IOW, this patch series probably solves 50% of the problem, but we
> still do need the rest to get ideal error reporting.
> 
> In doing this savevm conversion I noticed a bunch of places which
> see and then ignore errors. I only fixed one or two of them which
> were clearly dubious. Other places in savevm.c where it seemed it
> was probably ok to ignore errors, I've left using error_report()
> on the basis that those are really warnings. Perhaps they could
> be changed to warn_report() instead.
> 
> There are alot of patches here, but I felt it was easier to review
> for correctness if I converted 1 function at a time. The series
> does not neccessarily have to be reviewed/appied in 1 go.

After this series, what do my errors look like, and where do they end
up?
Do I get my nice backtrace shwoing that device failed, then that was
part of that one...

Dave

> Daniel P. Berrangé (33):
>   migration: push Error **errp into qemu_loadvm_state()
>   migration: push Error **errp into qemu_loadvm_state_header()
>   migration: push Error **errp into qemu_loadvm_state_setup()
>   migration: push Error **errp into qemu_load_device_state()
>   migration: push Error **errp into qemu_loadvm_state_main()
>   migration: push Error **errp into qemu_loadvm_section_start_full()
>   migration: push Error **errp into qemu_loadvm_section_part_end()
>   migration: push Error **errp into loadvm_process_command()
>   migration: push Error **errp into loadvm_handle_cmd_packaged()
>   migration: push Error **errp into loadvm_postcopy_handle_advise()
>   migration: push Error **errp into ram_postcopy_incoming_init()
>   migration: push Error **errp into loadvm_postcopy_handle_listen()
>   migration: push Error **errp into loadvm_postcopy_handle_run()
>   migration: push Error **errp into loadvm_postcopy_ram_handle_discard()
>   migration: make loadvm_postcopy_handle_resume() void
>   migration: push Error **errp into loadvm_handle_recv_bitmap()
>   migration: push Error **errp into loadvm_process_enable_colo()
>   migration: push Error **errp into colo_init_ram_cache()
>   migration: push Error **errp into check_section_footer()
>   migration: push Error **errp into global_state_store()
>   migration: remove error reporting from qemu_fopen_bdrv() callers
>   migration: push Error **errp into qemu_savevm_state_iterate()
>   migration: simplify some error reporting in save_snapshot()
>   migration: push Error **errp into qemu_savevm_state_setup()
>   migration: push Error **errp into qemu_savevm_state_complete_precopy()
>   migration: push Error **errp into
>     qemu_savevm_state_complete_precopy_non_iterable()
>   migration: push Error **errp into qemu_savevm_state_complete_precopy()
>   migration: push Error **errp into qemu_savevm_send_packaged()
>   migration: push Error **errp into qemu_savevm_live_state()
>   migration: push Error **errp into qemu_save_device_state()
>   migration: push Error **errp into qemu_savevm_state_resume_prepare()
>   migration: push Error **errp into postcopy_resume_handshake()
>   migration: push Error **errp into postcopy_do_resume()
> 
>  include/migration/colo.h                      |   2 +-
>  include/migration/global_state.h              |   2 +-
>  migration/colo.c                              |  12 +-
>  migration/global_state.c                      |   6 +-
>  migration/migration.c                         |  80 ++-
>  migration/postcopy-ram.c                      |   8 +-
>  migration/postcopy-ram.h                      |   2 +-
>  migration/ram.c                               |  17 +-
>  migration/ram.h                               |   4 +-
>  migration/savevm.c                            | 594 ++++++++++--------
>  migration/savevm.h                            |  23 +-
>  .../tests/internal-snapshots-qapi.out         |   3 +-
>  12 files changed, 427 insertions(+), 326 deletions(-)
> 
> -- 
> 2.29.2
>

Daniel P. Berrangé Feb. 4, 2021, 7:09 p.m. UTC | #2

On Thu, Feb 04, 2021 at 06:22:49PM +0000, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > Due to its long term heritage most of the migration code just invokes
> > 'error_report' when problems hit. This was fine for HMP, since the
> > messages get redirected from stderr, into the HMP console. It is not
> > OK for QMP because the errors will not be fed back to the QMP client.
> > 
> > This wasn't a terrible real world problem with QMP so far because
> > live migration happens in the background, so at least on the target side
> > there is not a QMP command that needs to capture the incoming migration.
> > It is a problem on the source side but it doesn't hit frequently as the
> > source side has fewer failure scenarios. None the less on both sides it
> > would be desirable if 'query-migrate' can report errors correctly.
> > With the introduction of the load-snapshot QMP commands, the need for
> > error reporting becomes more pressing.
> > 
> > Wiring up good error reporting is a large and difficult job, which
> > this series does NOT complete. The focus here has been on converting
> > all methods in savevm.c which have an 'int' return value capable of
> > reporting errors. This covers most of the infrastructure for controlling
> > the migration state serialization / protocol.
> > 
> > The remaining part that is missing error reporting are the callbacks in
> > the VMStateDescription struct which can return failure codes, but have
> > no "Error **errp" parameter. Thinking about how this might be dealt with
> > in future, a big bang conversion is likely non-viable. We'll probably
> > want to introduce a duplicate set of callbacks with the "Error **errp"
> > parameter and convert impls in batches, eventually removing the
> > original callbacks. I don't intend todo that myself in the immediate
> > future.
> > 
> > IOW, this patch series probably solves 50% of the problem, but we
> > still do need the rest to get ideal error reporting.
> > 
> > In doing this savevm conversion I noticed a bunch of places which
> > see and then ignore errors. I only fixed one or two of them which
> > were clearly dubious. Other places in savevm.c where it seemed it
> > was probably ok to ignore errors, I've left using error_report()
> > on the basis that those are really warnings. Perhaps they could
> > be changed to warn_report() instead.
> > 
> > There are alot of patches here, but I felt it was easier to review
> > for correctness if I converted 1 function at a time. The series
> > does not neccessarily have to be reviewed/appied in 1 go.
> 
> After this series, what do my errors look like, and where do they end
> up?
> Do I get my nice backtrace shwoing that device failed, then that was
> part of that one...

It hasn't modified any of the VMStateDescription callbacks so any
of the per-device logic that was printing errors will still be using
error_report to the console as before.

The errors that have changed (at this stage) are only the higher
level ones that are in the generic part of the code. Where those
errors mentioned a device name/ID they still do.

In some of the parts I've modified there will have been multiple
error_reports collapsed into one error_setg() but the ones that
are eliminated are high level generic messages with no useful
info, so I don't think loosing those is a problem per-se.

The example that I tested was the case where we load a snapshot
under a different config that we saved it with. This is the scenario
that gave the non-deterministic ordering in the iotest you disabled
from my previous series.

In that case, we changed from:

  qemu-system-x86_64: Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices
  {"return": [{"current-progress": 1, "status": "concluded", "total-progress": 1, "type": "snapshot-load", "id": "load-err-stderr", "error": "Error -22 while loading VM state"}]}

To

  {"return": [{"current-progress": 1, "status": "concluded", "total-progress": 1, "type": "snapshot-load", "id": "load-err-stderr", "error": "Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices"}]}

From a HMP loadvm POV, this means instead of seeing

  (hmp)  loadvm foo
  Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices
  Error -22 while loading VM state

You will only see the detailed error message

  (hmp)  loadvm foo
  Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices

In this case I think loosing the "Error -22 while loading VM state"
is fine, as it didn't add value IMHO.

If we get around to converting the VMStateDescription callbacks to
take an error object, then I think we'll possibly need to stack the
error message from the callback, with the higher level message.

Do you have any familiar/good examples of error message stacking I
can look at ?  I should be able to say whether they would be impacted
by this series or not - if they are, then I hopefully only threw away
the fairly useless high level messages, like the "Error -22" message
above.

Regards,
Daniel

Dr. David Alan Gilbert Feb. 8, 2021, 1:29 p.m. UTC | #3

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Thu, Feb 04, 2021 at 06:22:49PM +0000, Dr. David Alan Gilbert wrote:
> > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > Due to its long term heritage most of the migration code just invokes
> > > 'error_report' when problems hit. This was fine for HMP, since the
> > > messages get redirected from stderr, into the HMP console. It is not
> > > OK for QMP because the errors will not be fed back to the QMP client.
> > > 
> > > This wasn't a terrible real world problem with QMP so far because
> > > live migration happens in the background, so at least on the target side
> > > there is not a QMP command that needs to capture the incoming migration.
> > > It is a problem on the source side but it doesn't hit frequently as the
> > > source side has fewer failure scenarios. None the less on both sides it
> > > would be desirable if 'query-migrate' can report errors correctly.
> > > With the introduction of the load-snapshot QMP commands, the need for
> > > error reporting becomes more pressing.
> > > 
> > > Wiring up good error reporting is a large and difficult job, which
> > > this series does NOT complete. The focus here has been on converting
> > > all methods in savevm.c which have an 'int' return value capable of
> > > reporting errors. This covers most of the infrastructure for controlling
> > > the migration state serialization / protocol.
> > > 
> > > The remaining part that is missing error reporting are the callbacks in
> > > the VMStateDescription struct which can return failure codes, but have
> > > no "Error **errp" parameter. Thinking about how this might be dealt with
> > > in future, a big bang conversion is likely non-viable. We'll probably
> > > want to introduce a duplicate set of callbacks with the "Error **errp"
> > > parameter and convert impls in batches, eventually removing the
> > > original callbacks. I don't intend todo that myself in the immediate
> > > future.
> > > 
> > > IOW, this patch series probably solves 50% of the problem, but we
> > > still do need the rest to get ideal error reporting.
> > > 
> > > In doing this savevm conversion I noticed a bunch of places which
> > > see and then ignore errors. I only fixed one or two of them which
> > > were clearly dubious. Other places in savevm.c where it seemed it
> > > was probably ok to ignore errors, I've left using error_report()
> > > on the basis that those are really warnings. Perhaps they could
> > > be changed to warn_report() instead.
> > > 
> > > There are alot of patches here, but I felt it was easier to review
> > > for correctness if I converted 1 function at a time. The series
> > > does not neccessarily have to be reviewed/appied in 1 go.
> > 
> > After this series, what do my errors look like, and where do they end
> > up?
> > Do I get my nice backtrace shwoing that device failed, then that was
> > part of that one...
> 
> It hasn't modified any of the VMStateDescription callbacks so any
> of the per-device logic that was printing errors will still be using
> error_report to the console as before.
> 
> The errors that have changed (at this stage) are only the higher
> level ones that are in the generic part of the code. Where those
> errors mentioned a device name/ID they still do.
> 
> In some of the parts I've modified there will have been multiple
> error_reports collapsed into one error_setg() but the ones that
> are eliminated are high level generic messages with no useful
> info, so I don't think loosing those is a problem per-se.
> 
> The example that I tested was the case where we load a snapshot
> under a different config that we saved it with. This is the scenario
> that gave the non-deterministic ordering in the iotest you disabled
> from my previous series.
> 
> In that case, we changed from:
> 
>   qemu-system-x86_64: Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices
>   {"return": [{"current-progress": 1, "status": "concluded", "total-progress": 1, "type": "snapshot-load", "id": "load-err-stderr", "error": "Error -22 while loading VM state"}]}
> 
> To
> 
>   {"return": [{"current-progress": 1, "status": "concluded", "total-progress": 1, "type": "snapshot-load", "id": "load-err-stderr", "error": "Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices"}]}
> 
> From a HMP loadvm POV, this means instead of seeing
> 
>   (hmp)  loadvm foo
>   Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices
>   Error -22 while loading VM state
> 
> You will only see the detailed error message
> 
>   (hmp)  loadvm foo
>   Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices
> 
> In this case I think loosing the "Error -22 while loading VM state"
> is fine, as it didn't add value IMHO.
> 
> 
> If we get around to converting the VMStateDescription callbacks to
> take an error object, then I think we'll possibly need to stack the
> error message from the callback, with the higher level message.
> 
> Do you have any familiar/good examples of error message stacking I
> can look at ?  I should be able to say whether they would be impacted
> by this series or not - if they are, then I hopefully only threw away
> the fairly useless high level messages, like the "Error -22" message
> above.

Can you try migrating:
  ./x86_64-softmmu/qemu-system-x86_64 -M pc -nographic -device virtio-rng,disable-modern=true
to
  ./x86_64-softmmu/qemu-system-x86_64 -M pc -nographic -device virtio-rng

what I currently get is:
qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x6 read: 0 device: 10 cmask: 10 wmask: 0 w1cmask:0
qemu-system-x86_64: Failed to load PCIDevice:config
qemu-system-x86_64: Failed to load virtio-rng:virtio
qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:04.0/virtio-rng'
qemu-system-x86_64: load of migration failed: Invalid argument

Dave

> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Daniel P. Berrangé Feb. 8, 2021, 1:42 p.m. UTC | #4

On Mon, Feb 08, 2021 at 01:29:03PM +0000, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > On Thu, Feb 04, 2021 at 06:22:49PM +0000, Dr. David Alan Gilbert wrote:
> > > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > > Due to its long term heritage most of the migration code just invokes
> > > > 'error_report' when problems hit. This was fine for HMP, since the
> > > > messages get redirected from stderr, into the HMP console. It is not
> > > > OK for QMP because the errors will not be fed back to the QMP client.
> > > > 
> > > > This wasn't a terrible real world problem with QMP so far because
> > > > live migration happens in the background, so at least on the target side
> > > > there is not a QMP command that needs to capture the incoming migration.
> > > > It is a problem on the source side but it doesn't hit frequently as the
> > > > source side has fewer failure scenarios. None the less on both sides it
> > > > would be desirable if 'query-migrate' can report errors correctly.
> > > > With the introduction of the load-snapshot QMP commands, the need for
> > > > error reporting becomes more pressing.
> > > > 
> > > > Wiring up good error reporting is a large and difficult job, which
> > > > this series does NOT complete. The focus here has been on converting
> > > > all methods in savevm.c which have an 'int' return value capable of
> > > > reporting errors. This covers most of the infrastructure for controlling
> > > > the migration state serialization / protocol.
> > > > 
> > > > The remaining part that is missing error reporting are the callbacks in
> > > > the VMStateDescription struct which can return failure codes, but have
> > > > no "Error **errp" parameter. Thinking about how this might be dealt with
> > > > in future, a big bang conversion is likely non-viable. We'll probably
> > > > want to introduce a duplicate set of callbacks with the "Error **errp"
> > > > parameter and convert impls in batches, eventually removing the
> > > > original callbacks. I don't intend todo that myself in the immediate
> > > > future.
> > > > 
> > > > IOW, this patch series probably solves 50% of the problem, but we
> > > > still do need the rest to get ideal error reporting.
> > > > 
> > > > In doing this savevm conversion I noticed a bunch of places which
> > > > see and then ignore errors. I only fixed one or two of them which
> > > > were clearly dubious. Other places in savevm.c where it seemed it
> > > > was probably ok to ignore errors, I've left using error_report()
> > > > on the basis that those are really warnings. Perhaps they could
> > > > be changed to warn_report() instead.
> > > > 
> > > > There are alot of patches here, but I felt it was easier to review
> > > > for correctness if I converted 1 function at a time. The series
> > > > does not neccessarily have to be reviewed/appied in 1 go.
> > > 
> > > After this series, what do my errors look like, and where do they end
> > > up?
> > > Do I get my nice backtrace shwoing that device failed, then that was
> > > part of that one...
> > 
> > It hasn't modified any of the VMStateDescription callbacks so any
> > of the per-device logic that was printing errors will still be using
> > error_report to the console as before.
> > 
> > The errors that have changed (at this stage) are only the higher
> > level ones that are in the generic part of the code. Where those
> > errors mentioned a device name/ID they still do.
> > 
> > In some of the parts I've modified there will have been multiple
> > error_reports collapsed into one error_setg() but the ones that
> > are eliminated are high level generic messages with no useful
> > info, so I don't think loosing those is a problem per-se.
> > 
> > The example that I tested was the case where we load a snapshot
> > under a different config that we saved it with. This is the scenario
> > that gave the non-deterministic ordering in the iotest you disabled
> > from my previous series.
> > 
> > In that case, we changed from:
> > 
> >   qemu-system-x86_64: Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices
> >   {"return": [{"current-progress": 1, "status": "concluded", "total-progress": 1, "type": "snapshot-load", "id": "load-err-stderr", "error": "Error -22 while loading VM state"}]}
> > 
> > To
> > 
> >   {"return": [{"current-progress": 1, "status": "concluded", "total-progress": 1, "type": "snapshot-load", "id": "load-err-stderr", "error": "Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices"}]}
> > 
> > From a HMP loadvm POV, this means instead of seeing
> > 
> >   (hmp)  loadvm foo
> >   Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices
> >   Error -22 while loading VM state
> > 
> > You will only see the detailed error message
> > 
> >   (hmp)  loadvm foo
> >   Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices
> > 
> > In this case I think loosing the "Error -22 while loading VM state"
> > is fine, as it didn't add value IMHO.
> > 
> > 
> > If we get around to converting the VMStateDescription callbacks to
> > take an error object, then I think we'll possibly need to stack the
> > error message from the callback, with the higher level message.
> > 
> > Do you have any familiar/good examples of error message stacking I
> > can look at ?  I should be able to say whether they would be impacted
> > by this series or not - if they are, then I hopefully only threw away
> > the fairly useless high level messages, like the "Error -22" message
> > above.
> 
> Can you try migrating:
>   ./x86_64-softmmu/qemu-system-x86_64 -M pc -nographic -device virtio-rng,disable-modern=true
> to
>   ./x86_64-softmmu/qemu-system-x86_64 -M pc -nographic -device virtio-rng
> 
> what I currently get is:
> qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x6 read: 0 device: 10 cmask: 10 wmask: 0 w1cmask:0
> qemu-system-x86_64: Failed to load PCIDevice:config
> qemu-system-x86_64: Failed to load virtio-rng:virtio
> qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:04.0/virtio-rng'
> qemu-system-x86_64: load of migration failed: Invalid argument

After my patches the very last line is gone.

So, still reporting using  error_report() is the first 3:

 qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x6 read: 0 device: 10 cmask: 10 wmask: 0 w1cmask:0
 qemu-system-x86_64: Failed to load PCIDevice:config
 qemu-system-x86_64: Failed to load virtio-rng:virtio

Then reported in process_incoming_migration_co() using the message
populated in the Error object, using error_report_err():

 qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:04.0/virtio-rng'

Finally, this is no longer reported:

 qemu-system-x86_64: load of migration failed: Invalid argument

So in this case we've not lost any useful information

Regards,
Daniel

Dr. David Alan Gilbert Feb. 8, 2021, 2:29 p.m. UTC | #5

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Mon, Feb 08, 2021 at 01:29:03PM +0000, Dr. David Alan Gilbert wrote:
> > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > On Thu, Feb 04, 2021 at 06:22:49PM +0000, Dr. David Alan Gilbert wrote:
> > > > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > > > Due to its long term heritage most of the migration code just invokes
> > > > > 'error_report' when problems hit. This was fine for HMP, since the
> > > > > messages get redirected from stderr, into the HMP console. It is not
> > > > > OK for QMP because the errors will not be fed back to the QMP client.
> > > > > 
> > > > > This wasn't a terrible real world problem with QMP so far because
> > > > > live migration happens in the background, so at least on the target side
> > > > > there is not a QMP command that needs to capture the incoming migration.
> > > > > It is a problem on the source side but it doesn't hit frequently as the
> > > > > source side has fewer failure scenarios. None the less on both sides it
> > > > > would be desirable if 'query-migrate' can report errors correctly.
> > > > > With the introduction of the load-snapshot QMP commands, the need for
> > > > > error reporting becomes more pressing.
> > > > > 
> > > > > Wiring up good error reporting is a large and difficult job, which
> > > > > this series does NOT complete. The focus here has been on converting
> > > > > all methods in savevm.c which have an 'int' return value capable of
> > > > > reporting errors. This covers most of the infrastructure for controlling
> > > > > the migration state serialization / protocol.
> > > > > 
> > > > > The remaining part that is missing error reporting are the callbacks in
> > > > > the VMStateDescription struct which can return failure codes, but have
> > > > > no "Error **errp" parameter. Thinking about how this might be dealt with
> > > > > in future, a big bang conversion is likely non-viable. We'll probably
> > > > > want to introduce a duplicate set of callbacks with the "Error **errp"
> > > > > parameter and convert impls in batches, eventually removing the
> > > > > original callbacks. I don't intend todo that myself in the immediate
> > > > > future.
> > > > > 
> > > > > IOW, this patch series probably solves 50% of the problem, but we
> > > > > still do need the rest to get ideal error reporting.
> > > > > 
> > > > > In doing this savevm conversion I noticed a bunch of places which
> > > > > see and then ignore errors. I only fixed one or two of them which
> > > > > were clearly dubious. Other places in savevm.c where it seemed it
> > > > > was probably ok to ignore errors, I've left using error_report()
> > > > > on the basis that those are really warnings. Perhaps they could
> > > > > be changed to warn_report() instead.
> > > > > 
> > > > > There are alot of patches here, but I felt it was easier to review
> > > > > for correctness if I converted 1 function at a time. The series
> > > > > does not neccessarily have to be reviewed/appied in 1 go.
> > > > 
> > > > After this series, what do my errors look like, and where do they end
> > > > up?
> > > > Do I get my nice backtrace shwoing that device failed, then that was
> > > > part of that one...
> > > 
> > > It hasn't modified any of the VMStateDescription callbacks so any
> > > of the per-device logic that was printing errors will still be using
> > > error_report to the console as before.
> > > 
> > > The errors that have changed (at this stage) are only the higher
> > > level ones that are in the generic part of the code. Where those
> > > errors mentioned a device name/ID they still do.
> > > 
> > > In some of the parts I've modified there will have been multiple
> > > error_reports collapsed into one error_setg() but the ones that
> > > are eliminated are high level generic messages with no useful
> > > info, so I don't think loosing those is a problem per-se.
> > > 
> > > The example that I tested was the case where we load a snapshot
> > > under a different config that we saved it with. This is the scenario
> > > that gave the non-deterministic ordering in the iotest you disabled
> > > from my previous series.
> > > 
> > > In that case, we changed from:
> > > 
> > >   qemu-system-x86_64: Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices
> > >   {"return": [{"current-progress": 1, "status": "concluded", "total-progress": 1, "type": "snapshot-load", "id": "load-err-stderr", "error": "Error -22 while loading VM state"}]}
> > > 
> > > To
> > > 
> > >   {"return": [{"current-progress": 1, "status": "concluded", "total-progress": 1, "type": "snapshot-load", "id": "load-err-stderr", "error": "Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices"}]}
> > > 
> > > From a HMP loadvm POV, this means instead of seeing
> > > 
> > >   (hmp)  loadvm foo
> > >   Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices
> > >   Error -22 while loading VM state
> > > 
> > > You will only see the detailed error message
> > > 
> > >   (hmp)  loadvm foo
> > >   Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices
> > > 
> > > In this case I think loosing the "Error -22 while loading VM state"
> > > is fine, as it didn't add value IMHO.
> > > 
> > > 
> > > If we get around to converting the VMStateDescription callbacks to
> > > take an error object, then I think we'll possibly need to stack the
> > > error message from the callback, with the higher level message.
> > > 
> > > Do you have any familiar/good examples of error message stacking I
> > > can look at ?  I should be able to say whether they would be impacted
> > > by this series or not - if they are, then I hopefully only threw away
> > > the fairly useless high level messages, like the "Error -22" message
> > > above.
> > 
> > Can you try migrating:
> >   ./x86_64-softmmu/qemu-system-x86_64 -M pc -nographic -device virtio-rng,disable-modern=true
> > to
> >   ./x86_64-softmmu/qemu-system-x86_64 -M pc -nographic -device virtio-rng
> > 
> > what I currently get is:
> > qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x6 read: 0 device: 10 cmask: 10 wmask: 0 w1cmask:0
> > qemu-system-x86_64: Failed to load PCIDevice:config
> > qemu-system-x86_64: Failed to load virtio-rng:virtio
> > qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:04.0/virtio-rng'
> > qemu-system-x86_64: load of migration failed: Invalid argument
> 
> After my patches the very last line is gone.
> 
> So, still reporting using  error_report() is the first 3:
> 
>  qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x6 read: 0 device: 10 cmask: 10 wmask: 0 w1cmask:0
>  qemu-system-x86_64: Failed to load PCIDevice:config
>  qemu-system-x86_64: Failed to load virtio-rng:virtio

So those are still ending up in the stderr/log ?

> Then reported in process_incoming_migration_co() using the message
> populated in the Error object, using error_report_err():
> 
>  qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:04.0/virtio-rng'

Does that mean we've not got that error associated with the others?  It
could be a pain where we've got multiple devices (e.g. NICs or storage)
and need to realise which one is failing.

> Finally, this is no longer reported:
> 
>  qemu-system-x86_64: load of migration failed: Invalid argument
> 
> So in this case we've not lost any useful information

You occasionally get other things other than Invalid argument; in
particular you get EIO; it can help you determine if the source killed
the migration connection first.

Dave

> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Daniel P. Berrangé Feb. 8, 2021, 2:36 p.m. UTC | #6

On Mon, Feb 08, 2021 at 02:29:41PM +0000, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > On Mon, Feb 08, 2021 at 01:29:03PM +0000, Dr. David Alan Gilbert wrote:
> > > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > > On Thu, Feb 04, 2021 at 06:22:49PM +0000, Dr. David Alan Gilbert wrote:
> > > > > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > > > > Due to its long term heritage most of the migration code just invokes
> > > > > > 'error_report' when problems hit. This was fine for HMP, since the
> > > > > > messages get redirected from stderr, into the HMP console. It is not
> > > > > > OK for QMP because the errors will not be fed back to the QMP client.
> > > > > > 
> > > > > > This wasn't a terrible real world problem with QMP so far because
> > > > > > live migration happens in the background, so at least on the target side
> > > > > > there is not a QMP command that needs to capture the incoming migration.
> > > > > > It is a problem on the source side but it doesn't hit frequently as the
> > > > > > source side has fewer failure scenarios. None the less on both sides it
> > > > > > would be desirable if 'query-migrate' can report errors correctly.
> > > > > > With the introduction of the load-snapshot QMP commands, the need for
> > > > > > error reporting becomes more pressing.
> > > > > > 
> > > > > > Wiring up good error reporting is a large and difficult job, which
> > > > > > this series does NOT complete. The focus here has been on converting
> > > > > > all methods in savevm.c which have an 'int' return value capable of
> > > > > > reporting errors. This covers most of the infrastructure for controlling
> > > > > > the migration state serialization / protocol.
> > > > > > 
> > > > > > The remaining part that is missing error reporting are the callbacks in
> > > > > > the VMStateDescription struct which can return failure codes, but have
> > > > > > no "Error **errp" parameter. Thinking about how this might be dealt with
> > > > > > in future, a big bang conversion is likely non-viable. We'll probably
> > > > > > want to introduce a duplicate set of callbacks with the "Error **errp"
> > > > > > parameter and convert impls in batches, eventually removing the
> > > > > > original callbacks. I don't intend todo that myself in the immediate
> > > > > > future.
> > > > > > 
> > > > > > IOW, this patch series probably solves 50% of the problem, but we
> > > > > > still do need the rest to get ideal error reporting.
> > > > > > 
> > > > > > In doing this savevm conversion I noticed a bunch of places which
> > > > > > see and then ignore errors. I only fixed one or two of them which
> > > > > > were clearly dubious. Other places in savevm.c where it seemed it
> > > > > > was probably ok to ignore errors, I've left using error_report()
> > > > > > on the basis that those are really warnings. Perhaps they could
> > > > > > be changed to warn_report() instead.
> > > > > > 
> > > > > > There are alot of patches here, but I felt it was easier to review
> > > > > > for correctness if I converted 1 function at a time. The series
> > > > > > does not neccessarily have to be reviewed/appied in 1 go.
> > > > > 
> > > > > After this series, what do my errors look like, and where do they end
> > > > > up?
> > > > > Do I get my nice backtrace shwoing that device failed, then that was
> > > > > part of that one...
> > > > 
> > > > It hasn't modified any of the VMStateDescription callbacks so any
> > > > of the per-device logic that was printing errors will still be using
> > > > error_report to the console as before.
> > > > 
> > > > The errors that have changed (at this stage) are only the higher
> > > > level ones that are in the generic part of the code. Where those
> > > > errors mentioned a device name/ID they still do.
> > > > 
> > > > In some of the parts I've modified there will have been multiple
> > > > error_reports collapsed into one error_setg() but the ones that
> > > > are eliminated are high level generic messages with no useful
> > > > info, so I don't think loosing those is a problem per-se.
> > > > 
> > > > The example that I tested was the case where we load a snapshot
> > > > under a different config that we saved it with. This is the scenario
> > > > that gave the non-deterministic ordering in the iotest you disabled
> > > > from my previous series.
> > > > 
> > > > In that case, we changed from:
> > > > 
> > > >   qemu-system-x86_64: Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices
> > > >   {"return": [{"current-progress": 1, "status": "concluded", "total-progress": 1, "type": "snapshot-load", "id": "load-err-stderr", "error": "Error -22 while loading VM state"}]}
> > > > 
> > > > To
> > > > 
> > > >   {"return": [{"current-progress": 1, "status": "concluded", "total-progress": 1, "type": "snapshot-load", "id": "load-err-stderr", "error": "Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices"}]}
> > > > 
> > > > From a HMP loadvm POV, this means instead of seeing
> > > > 
> > > >   (hmp)  loadvm foo
> > > >   Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices
> > > >   Error -22 while loading VM state
> > > > 
> > > > You will only see the detailed error message
> > > > 
> > > >   (hmp)  loadvm foo
> > > >   Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices
> > > > 
> > > > In this case I think loosing the "Error -22 while loading VM state"
> > > > is fine, as it didn't add value IMHO.
> > > > 
> > > > 
> > > > If we get around to converting the VMStateDescription callbacks to
> > > > take an error object, then I think we'll possibly need to stack the
> > > > error message from the callback, with the higher level message.
> > > > 
> > > > Do you have any familiar/good examples of error message stacking I
> > > > can look at ?  I should be able to say whether they would be impacted
> > > > by this series or not - if they are, then I hopefully only threw away
> > > > the fairly useless high level messages, like the "Error -22" message
> > > > above.
> > > 
> > > Can you try migrating:
> > >   ./x86_64-softmmu/qemu-system-x86_64 -M pc -nographic -device virtio-rng,disable-modern=true
> > > to
> > >   ./x86_64-softmmu/qemu-system-x86_64 -M pc -nographic -device virtio-rng
> > > 
> > > what I currently get is:
> > > qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x6 read: 0 device: 10 cmask: 10 wmask: 0 w1cmask:0
> > > qemu-system-x86_64: Failed to load PCIDevice:config
> > > qemu-system-x86_64: Failed to load virtio-rng:virtio
> > > qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:04.0/virtio-rng'
> > > qemu-system-x86_64: load of migration failed: Invalid argument
> > 
> > After my patches the very last line is gone.
> > 
> > So, still reporting using  error_report() is the first 3:
> > 
> >  qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x6 read: 0 device: 10 cmask: 10 wmask: 0 w1cmask:0
> >  qemu-system-x86_64: Failed to load PCIDevice:config
> >  qemu-system-x86_64: Failed to load virtio-rng:virtio
> 
> So those are still ending up in the stderr/log ?

yes.

> > Then reported in process_incoming_migration_co() using the message
> > populated in the Error object, using error_report_err():
> > 
> >  qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:04.0/virtio-rng'
> 
> Does that mean we've not got that error associated with the others?  It
> could be a pain where we've got multiple devices (e.g. NICs or storage)
> and need to realise which one is failing.

In the case of migration, this message will still get put into stderr
with the others.

In the case of HMP "loadvm", this message will also still get into
stderr with the others.

In the case of QMP "load-snapshot", this message will get reported
back to the app via the "query-jobs" error field, and not appear on
stderr.  Obviously long term it would be preferrable if we can get
all the other mesages chained up into the Error object too, so we
get the full set in one place.

> 
> > Finally, this is no longer reported:
> > 
> >  qemu-system-x86_64: load of migration failed: Invalid argument
> > 
> > So in this case we've not lost any useful information
> 
> You occasionally get other things other than Invalid argument; in
> particular you get EIO; it can help you determine if the source killed
> the migration connection first.

All the places which checked qemu_file_get_error() and reported the
errno, should still be turned into Error objects, so I believe we
should get the EIO scenario reports still.

> 
> Dave
> 
> > Regards,
> > Daniel
> > -- 
> > |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> > |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> > |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> -- 
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Regards,
Daniel

Dr. David Alan Gilbert Feb. 15, 2021, 6:38 p.m. UTC | #7

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Mon, Feb 08, 2021 at 01:29:03PM +0000, Dr. David Alan Gilbert wrote:
> > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > On Thu, Feb 04, 2021 at 06:22:49PM +0000, Dr. David Alan Gilbert wrote:
> > > > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > > > Due to its long term heritage most of the migration code just invokes
> > > > > 'error_report' when problems hit. This was fine for HMP, since the
> > > > > messages get redirected from stderr, into the HMP console. It is not
> > > > > OK for QMP because the errors will not be fed back to the QMP client.
> > > > > 
> > > > > This wasn't a terrible real world problem with QMP so far because
> > > > > live migration happens in the background, so at least on the target side
> > > > > there is not a QMP command that needs to capture the incoming migration.
> > > > > It is a problem on the source side but it doesn't hit frequently as the
> > > > > source side has fewer failure scenarios. None the less on both sides it
> > > > > would be desirable if 'query-migrate' can report errors correctly.
> > > > > With the introduction of the load-snapshot QMP commands, the need for
> > > > > error reporting becomes more pressing.
> > > > > 
> > > > > Wiring up good error reporting is a large and difficult job, which
> > > > > this series does NOT complete. The focus here has been on converting
> > > > > all methods in savevm.c which have an 'int' return value capable of
> > > > > reporting errors. This covers most of the infrastructure for controlling
> > > > > the migration state serialization / protocol.
> > > > > 
> > > > > The remaining part that is missing error reporting are the callbacks in
> > > > > the VMStateDescription struct which can return failure codes, but have
> > > > > no "Error **errp" parameter. Thinking about how this might be dealt with
> > > > > in future, a big bang conversion is likely non-viable. We'll probably
> > > > > want to introduce a duplicate set of callbacks with the "Error **errp"
> > > > > parameter and convert impls in batches, eventually removing the
> > > > > original callbacks. I don't intend todo that myself in the immediate
> > > > > future.
> > > > > 
> > > > > IOW, this patch series probably solves 50% of the problem, but we
> > > > > still do need the rest to get ideal error reporting.
> > > > > 
> > > > > In doing this savevm conversion I noticed a bunch of places which
> > > > > see and then ignore errors. I only fixed one or two of them which
> > > > > were clearly dubious. Other places in savevm.c where it seemed it
> > > > > was probably ok to ignore errors, I've left using error_report()
> > > > > on the basis that those are really warnings. Perhaps they could
> > > > > be changed to warn_report() instead.
> > > > > 
> > > > > There are alot of patches here, but I felt it was easier to review
> > > > > for correctness if I converted 1 function at a time. The series
> > > > > does not neccessarily have to be reviewed/appied in 1 go.
> > > > 
> > > > After this series, what do my errors look like, and where do they end
> > > > up?
> > > > Do I get my nice backtrace shwoing that device failed, then that was
> > > > part of that one...
> > > 
> > > It hasn't modified any of the VMStateDescription callbacks so any
> > > of the per-device logic that was printing errors will still be using
> > > error_report to the console as before.
> > > 
> > > The errors that have changed (at this stage) are only the higher
> > > level ones that are in the generic part of the code. Where those
> > > errors mentioned a device name/ID they still do.
> > > 
> > > In some of the parts I've modified there will have been multiple
> > > error_reports collapsed into one error_setg() but the ones that
> > > are eliminated are high level generic messages with no useful
> > > info, so I don't think loosing those is a problem per-se.
> > > 
> > > The example that I tested was the case where we load a snapshot
> > > under a different config that we saved it with. This is the scenario
> > > that gave the non-deterministic ordering in the iotest you disabled
> > > from my previous series.
> > > 
> > > In that case, we changed from:
> > > 
> > >   qemu-system-x86_64: Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices
> > >   {"return": [{"current-progress": 1, "status": "concluded", "total-progress": 1, "type": "snapshot-load", "id": "load-err-stderr", "error": "Error -22 while loading VM state"}]}
> > > 
> > > To
> > > 
> > >   {"return": [{"current-progress": 1, "status": "concluded", "total-progress": 1, "type": "snapshot-load", "id": "load-err-stderr", "error": "Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices"}]}
> > > 
> > > From a HMP loadvm POV, this means instead of seeing
> > > 
> > >   (hmp)  loadvm foo
> > >   Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices
> > >   Error -22 while loading VM state
> > > 
> > > You will only see the detailed error message
> > > 
> > >   (hmp)  loadvm foo
> > >   Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices
> > > 
> > > In this case I think loosing the "Error -22 while loading VM state"
> > > is fine, as it didn't add value IMHO.
> > > 
> > > 
> > > If we get around to converting the VMStateDescription callbacks to
> > > take an error object, then I think we'll possibly need to stack the
> > > error message from the callback, with the higher level message.
> > > 
> > > Do you have any familiar/good examples of error message stacking I
> > > can look at ?  I should be able to say whether they would be impacted
> > > by this series or not - if they are, then I hopefully only threw away
> > > the fairly useless high level messages, like the "Error -22" message
> > > above.
> > 
> > Can you try migrating:
> >   ./x86_64-softmmu/qemu-system-x86_64 -M pc -nographic -device virtio-rng,disable-modern=true
> > to
> >   ./x86_64-softmmu/qemu-system-x86_64 -M pc -nographic -device virtio-rng
> > 
> > what I currently get is:
> > qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x6 read: 0 device: 10 cmask: 10 wmask: 0 w1cmask:0
> > qemu-system-x86_64: Failed to load PCIDevice:config
> > qemu-system-x86_64: Failed to load virtio-rng:virtio
> > qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:04.0/virtio-rng'
> > qemu-system-x86_64: load of migration failed: Invalid argument
> 
> After my patches the very last line is gone.
> 
> So, still reporting using  error_report() is the first 3:
> 
>  qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x6 read: 0 device: 10 cmask: 10 wmask: 0 w1cmask:0
>  qemu-system-x86_64: Failed to load PCIDevice:config
>  qemu-system-x86_64: Failed to load virtio-rng:virtio
> 
> Then reported in process_incoming_migration_co() using the message
> populated in the Error object, using error_report_err():
> 
>  qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:04.0/virtio-rng'
> 
> Finally, this is no longer reported:
> 
>  qemu-system-x86_64: load of migration failed: Invalid argument
> 
> So in this case we've not lost any useful information

One thing to check, and I *think* you're OK, but we have one place where
we actually check the error number:

migration.c:
3414 static MigThrError migration_detect_error(MigrationState *s)
...
3426     /* Try to detect any file errors */
3427     ret = qemu_file_get_error_obj(s->to_dst_file, &local_error);
3428     if (!ret) {
3429         /* Everything is fine */
3430         assert(!local_error);
3431         return MIG_THR_ERR_NONE;
3432     }
3433 
3434     if (local_error) {
3435         migrate_set_error(s, local_error);
3436         error_free(local_error);
3437     }
3438 
3439     if (state == MIGRATION_STATUS_POSTCOPY_ACTIVE && ret == -EIO) {
3440         /*
3441          * For postcopy, we allow the network to be down for a
3442          * while. After that, it can be continued by a
3443          * recovery phase.
3444          */
3445         return postcopy_pause(s);
3446     } else {

This is to go into postcopy pause if the network connection broke (but
not if for example a device moaned about being in an invalid state)

If I read this correctly, file errors are still being preserved - is
that correct?

Dave


> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Daniel P. Berrangé Feb. 15, 2021, 6:58 p.m. UTC | #8

On Mon, Feb 15, 2021 at 06:38:05PM +0000, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > On Mon, Feb 08, 2021 at 01:29:03PM +0000, Dr. David Alan Gilbert wrote:
> > > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > > On Thu, Feb 04, 2021 at 06:22:49PM +0000, Dr. David Alan Gilbert wrote:
> > > > > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > > > > Due to its long term heritage most of the migration code just invokes
> > > > > > 'error_report' when problems hit. This was fine for HMP, since the
> > > > > > messages get redirected from stderr, into the HMP console. It is not
> > > > > > OK for QMP because the errors will not be fed back to the QMP client.
> > > > > > 
> > > > > > This wasn't a terrible real world problem with QMP so far because
> > > > > > live migration happens in the background, so at least on the target side
> > > > > > there is not a QMP command that needs to capture the incoming migration.
> > > > > > It is a problem on the source side but it doesn't hit frequently as the
> > > > > > source side has fewer failure scenarios. None the less on both sides it
> > > > > > would be desirable if 'query-migrate' can report errors correctly.
> > > > > > With the introduction of the load-snapshot QMP commands, the need for
> > > > > > error reporting becomes more pressing.
> > > > > > 
> > > > > > Wiring up good error reporting is a large and difficult job, which
> > > > > > this series does NOT complete. The focus here has been on converting
> > > > > > all methods in savevm.c which have an 'int' return value capable of
> > > > > > reporting errors. This covers most of the infrastructure for controlling
> > > > > > the migration state serialization / protocol.
> > > > > > 
> > > > > > The remaining part that is missing error reporting are the callbacks in
> > > > > > the VMStateDescription struct which can return failure codes, but have
> > > > > > no "Error **errp" parameter. Thinking about how this might be dealt with
> > > > > > in future, a big bang conversion is likely non-viable. We'll probably
> > > > > > want to introduce a duplicate set of callbacks with the "Error **errp"
> > > > > > parameter and convert impls in batches, eventually removing the
> > > > > > original callbacks. I don't intend todo that myself in the immediate
> > > > > > future.
> > > > > > 
> > > > > > IOW, this patch series probably solves 50% of the problem, but we
> > > > > > still do need the rest to get ideal error reporting.
> > > > > > 
> > > > > > In doing this savevm conversion I noticed a bunch of places which
> > > > > > see and then ignore errors. I only fixed one or two of them which
> > > > > > were clearly dubious. Other places in savevm.c where it seemed it
> > > > > > was probably ok to ignore errors, I've left using error_report()
> > > > > > on the basis that those are really warnings. Perhaps they could
> > > > > > be changed to warn_report() instead.
> > > > > > 
> > > > > > There are alot of patches here, but I felt it was easier to review
> > > > > > for correctness if I converted 1 function at a time. The series
> > > > > > does not neccessarily have to be reviewed/appied in 1 go.
> > > > > 
> > > > > After this series, what do my errors look like, and where do they end
> > > > > up?
> > > > > Do I get my nice backtrace shwoing that device failed, then that was
> > > > > part of that one...
> > > > 
> > > > It hasn't modified any of the VMStateDescription callbacks so any
> > > > of the per-device logic that was printing errors will still be using
> > > > error_report to the console as before.
> > > > 
> > > > The errors that have changed (at this stage) are only the higher
> > > > level ones that are in the generic part of the code. Where those
> > > > errors mentioned a device name/ID they still do.
> > > > 
> > > > In some of the parts I've modified there will have been multiple
> > > > error_reports collapsed into one error_setg() but the ones that
> > > > are eliminated are high level generic messages with no useful
> > > > info, so I don't think loosing those is a problem per-se.
> > > > 
> > > > The example that I tested was the case where we load a snapshot
> > > > under a different config that we saved it with. This is the scenario
> > > > that gave the non-deterministic ordering in the iotest you disabled
> > > > from my previous series.
> > > > 
> > > > In that case, we changed from:
> > > > 
> > > >   qemu-system-x86_64: Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices
> > > >   {"return": [{"current-progress": 1, "status": "concluded", "total-progress": 1, "type": "snapshot-load", "id": "load-err-stderr", "error": "Error -22 while loading VM state"}]}
> > > > 
> > > > To
> > > > 
> > > >   {"return": [{"current-progress": 1, "status": "concluded", "total-progress": 1, "type": "snapshot-load", "id": "load-err-stderr", "error": "Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices"}]}
> > > > 
> > > > From a HMP loadvm POV, this means instead of seeing
> > > > 
> > > >   (hmp)  loadvm foo
> > > >   Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices
> > > >   Error -22 while loading VM state
> > > > 
> > > > You will only see the detailed error message
> > > > 
> > > >   (hmp)  loadvm foo
> > > >   Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices
> > > > 
> > > > In this case I think loosing the "Error -22 while loading VM state"
> > > > is fine, as it didn't add value IMHO.
> > > > 
> > > > 
> > > > If we get around to converting the VMStateDescription callbacks to
> > > > take an error object, then I think we'll possibly need to stack the
> > > > error message from the callback, with the higher level message.
> > > > 
> > > > Do you have any familiar/good examples of error message stacking I
> > > > can look at ?  I should be able to say whether they would be impacted
> > > > by this series or not - if they are, then I hopefully only threw away
> > > > the fairly useless high level messages, like the "Error -22" message
> > > > above.
> > > 
> > > Can you try migrating:
> > >   ./x86_64-softmmu/qemu-system-x86_64 -M pc -nographic -device virtio-rng,disable-modern=true
> > > to
> > >   ./x86_64-softmmu/qemu-system-x86_64 -M pc -nographic -device virtio-rng
> > > 
> > > what I currently get is:
> > > qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x6 read: 0 device: 10 cmask: 10 wmask: 0 w1cmask:0
> > > qemu-system-x86_64: Failed to load PCIDevice:config
> > > qemu-system-x86_64: Failed to load virtio-rng:virtio
> > > qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:04.0/virtio-rng'
> > > qemu-system-x86_64: load of migration failed: Invalid argument
> > 
> > After my patches the very last line is gone.
> > 
> > So, still reporting using  error_report() is the first 3:
> > 
> >  qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x6 read: 0 device: 10 cmask: 10 wmask: 0 w1cmask:0
> >  qemu-system-x86_64: Failed to load PCIDevice:config
> >  qemu-system-x86_64: Failed to load virtio-rng:virtio
> > 
> > Then reported in process_incoming_migration_co() using the message
> > populated in the Error object, using error_report_err():
> > 
> >  qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:04.0/virtio-rng'
> > 
> > Finally, this is no longer reported:
> > 
> >  qemu-system-x86_64: load of migration failed: Invalid argument
> > 
> > So in this case we've not lost any useful information
> 
> One thing to check, and I *think* you're OK, but we have one place where
> we actually check the error number:
> 
> migration.c:
> 3414 static MigThrError migration_detect_error(MigrationState *s)
> ...
> 3426     /* Try to detect any file errors */
> 3427     ret = qemu_file_get_error_obj(s->to_dst_file, &local_error);
> 3428     if (!ret) {
> 3429         /* Everything is fine */
> 3430         assert(!local_error);
> 3431         return MIG_THR_ERR_NONE;
> 3432     }
> 3433 
> 3434     if (local_error) {
> 3435         migrate_set_error(s, local_error);
> 3436         error_free(local_error);
> 3437     }
> 3438 
> 3439     if (state == MIGRATION_STATUS_POSTCOPY_ACTIVE && ret == -EIO) {
> 3440         /*
> 3441          * For postcopy, we allow the network to be down for a
> 3442          * while. After that, it can be continued by a
> 3443          * recovery phase.
> 3444          */
> 3445         return postcopy_pause(s);
> 3446     } else {
> 
> This is to go into postcopy pause if the network connection broke (but
> not if for example a device moaned about being in an invalid state)
> 
> If I read this correctly, file errors are still being preserved - is
> that correct?

Yes, in places where QemuFile is reporting an actual I/O error I've
tried to preserve that. Only removed setting of fake I/O errors. So
if anything, we ought to get more accurate at detecting the recoverable
scenarios once we fully cleanup errors.


Regards,
Daniel

Dr. David Alan Gilbert Feb. 15, 2021, 7:01 p.m. UTC | #9

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Mon, Feb 15, 2021 at 06:38:05PM +0000, Dr. David Alan Gilbert wrote:
> > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > On Mon, Feb 08, 2021 at 01:29:03PM +0000, Dr. David Alan Gilbert wrote:
> > > > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > > > On Thu, Feb 04, 2021 at 06:22:49PM +0000, Dr. David Alan Gilbert wrote:
> > > > > > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > > > > > Due to its long term heritage most of the migration code just invokes
> > > > > > > 'error_report' when problems hit. This was fine for HMP, since the
> > > > > > > messages get redirected from stderr, into the HMP console. It is not
> > > > > > > OK for QMP because the errors will not be fed back to the QMP client.
> > > > > > > 
> > > > > > > This wasn't a terrible real world problem with QMP so far because
> > > > > > > live migration happens in the background, so at least on the target side
> > > > > > > there is not a QMP command that needs to capture the incoming migration.
> > > > > > > It is a problem on the source side but it doesn't hit frequently as the
> > > > > > > source side has fewer failure scenarios. None the less on both sides it
> > > > > > > would be desirable if 'query-migrate' can report errors correctly.
> > > > > > > With the introduction of the load-snapshot QMP commands, the need for
> > > > > > > error reporting becomes more pressing.
> > > > > > > 
> > > > > > > Wiring up good error reporting is a large and difficult job, which
> > > > > > > this series does NOT complete. The focus here has been on converting
> > > > > > > all methods in savevm.c which have an 'int' return value capable of
> > > > > > > reporting errors. This covers most of the infrastructure for controlling
> > > > > > > the migration state serialization / protocol.
> > > > > > > 
> > > > > > > The remaining part that is missing error reporting are the callbacks in
> > > > > > > the VMStateDescription struct which can return failure codes, but have
> > > > > > > no "Error **errp" parameter. Thinking about how this might be dealt with
> > > > > > > in future, a big bang conversion is likely non-viable. We'll probably
> > > > > > > want to introduce a duplicate set of callbacks with the "Error **errp"
> > > > > > > parameter and convert impls in batches, eventually removing the
> > > > > > > original callbacks. I don't intend todo that myself in the immediate
> > > > > > > future.
> > > > > > > 
> > > > > > > IOW, this patch series probably solves 50% of the problem, but we
> > > > > > > still do need the rest to get ideal error reporting.
> > > > > > > 
> > > > > > > In doing this savevm conversion I noticed a bunch of places which
> > > > > > > see and then ignore errors. I only fixed one or two of them which
> > > > > > > were clearly dubious. Other places in savevm.c where it seemed it
> > > > > > > was probably ok to ignore errors, I've left using error_report()
> > > > > > > on the basis that those are really warnings. Perhaps they could
> > > > > > > be changed to warn_report() instead.
> > > > > > > 
> > > > > > > There are alot of patches here, but I felt it was easier to review
> > > > > > > for correctness if I converted 1 function at a time. The series
> > > > > > > does not neccessarily have to be reviewed/appied in 1 go.
> > > > > > 
> > > > > > After this series, what do my errors look like, and where do they end
> > > > > > up?
> > > > > > Do I get my nice backtrace shwoing that device failed, then that was
> > > > > > part of that one...
> > > > > 
> > > > > It hasn't modified any of the VMStateDescription callbacks so any
> > > > > of the per-device logic that was printing errors will still be using
> > > > > error_report to the console as before.
> > > > > 
> > > > > The errors that have changed (at this stage) are only the higher
> > > > > level ones that are in the generic part of the code. Where those
> > > > > errors mentioned a device name/ID they still do.
> > > > > 
> > > > > In some of the parts I've modified there will have been multiple
> > > > > error_reports collapsed into one error_setg() but the ones that
> > > > > are eliminated are high level generic messages with no useful
> > > > > info, so I don't think loosing those is a problem per-se.
> > > > > 
> > > > > The example that I tested was the case where we load a snapshot
> > > > > under a different config that we saved it with. This is the scenario
> > > > > that gave the non-deterministic ordering in the iotest you disabled
> > > > > from my previous series.
> > > > > 
> > > > > In that case, we changed from:
> > > > > 
> > > > >   qemu-system-x86_64: Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices
> > > > >   {"return": [{"current-progress": 1, "status": "concluded", "total-progress": 1, "type": "snapshot-load", "id": "load-err-stderr", "error": "Error -22 while loading VM state"}]}
> > > > > 
> > > > > To
> > > > > 
> > > > >   {"return": [{"current-progress": 1, "status": "concluded", "total-progress": 1, "type": "snapshot-load", "id": "load-err-stderr", "error": "Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices"}]}
> > > > > 
> > > > > From a HMP loadvm POV, this means instead of seeing
> > > > > 
> > > > >   (hmp)  loadvm foo
> > > > >   Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices
> > > > >   Error -22 while loading VM state
> > > > > 
> > > > > You will only see the detailed error message
> > > > > 
> > > > >   (hmp)  loadvm foo
> > > > >   Unknown savevm section or instance '0000:00:02.0/virtio-rng' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices
> > > > > 
> > > > > In this case I think loosing the "Error -22 while loading VM state"
> > > > > is fine, as it didn't add value IMHO.
> > > > > 
> > > > > 
> > > > > If we get around to converting the VMStateDescription callbacks to
> > > > > take an error object, then I think we'll possibly need to stack the
> > > > > error message from the callback, with the higher level message.
> > > > > 
> > > > > Do you have any familiar/good examples of error message stacking I
> > > > > can look at ?  I should be able to say whether they would be impacted
> > > > > by this series or not - if they are, then I hopefully only threw away
> > > > > the fairly useless high level messages, like the "Error -22" message
> > > > > above.
> > > > 
> > > > Can you try migrating:
> > > >   ./x86_64-softmmu/qemu-system-x86_64 -M pc -nographic -device virtio-rng,disable-modern=true
> > > > to
> > > >   ./x86_64-softmmu/qemu-system-x86_64 -M pc -nographic -device virtio-rng
> > > > 
> > > > what I currently get is:
> > > > qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x6 read: 0 device: 10 cmask: 10 wmask: 0 w1cmask:0
> > > > qemu-system-x86_64: Failed to load PCIDevice:config
> > > > qemu-system-x86_64: Failed to load virtio-rng:virtio
> > > > qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:04.0/virtio-rng'
> > > > qemu-system-x86_64: load of migration failed: Invalid argument
> > > 
> > > After my patches the very last line is gone.
> > > 
> > > So, still reporting using  error_report() is the first 3:
> > > 
> > >  qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x6 read: 0 device: 10 cmask: 10 wmask: 0 w1cmask:0
> > >  qemu-system-x86_64: Failed to load PCIDevice:config
> > >  qemu-system-x86_64: Failed to load virtio-rng:virtio
> > > 
> > > Then reported in process_incoming_migration_co() using the message
> > > populated in the Error object, using error_report_err():
> > > 
> > >  qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:04.0/virtio-rng'
> > > 
> > > Finally, this is no longer reported:
> > > 
> > >  qemu-system-x86_64: load of migration failed: Invalid argument
> > > 
> > > So in this case we've not lost any useful information
> > 
> > One thing to check, and I *think* you're OK, but we have one place where
> > we actually check the error number:
> > 
> > migration.c:
> > 3414 static MigThrError migration_detect_error(MigrationState *s)
> > ...
> > 3426     /* Try to detect any file errors */
> > 3427     ret = qemu_file_get_error_obj(s->to_dst_file, &local_error);
> > 3428     if (!ret) {
> > 3429         /* Everything is fine */
> > 3430         assert(!local_error);
> > 3431         return MIG_THR_ERR_NONE;
> > 3432     }
> > 3433 
> > 3434     if (local_error) {
> > 3435         migrate_set_error(s, local_error);
> > 3436         error_free(local_error);
> > 3437     }
> > 3438 
> > 3439     if (state == MIGRATION_STATUS_POSTCOPY_ACTIVE && ret == -EIO) {
> > 3440         /*
> > 3441          * For postcopy, we allow the network to be down for a
> > 3442          * while. After that, it can be continued by a
> > 3443          * recovery phase.
> > 3444          */
> > 3445         return postcopy_pause(s);
> > 3446     } else {
> > 
> > This is to go into postcopy pause if the network connection broke (but
> > not if for example a device moaned about being in an invalid state)
> > 
> > If I read this correctly, file errors are still being preserved - is
> > that correct?
> 
> Yes, in places where QemuFile is reporting an actual I/O error I've
> tried to preserve that. Only removed setting of fake I/O errors. So
> if anything, we ought to get more accurate at detecting the recoverable
> scenarios once we fully cleanup errors.

OK, good.

Dave

> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Daniel P. Berrangé Feb. 16, 2021, 9:30 a.m. UTC | #10

On Mon, Feb 15, 2021 at 07:01:28PM +0000, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > On Mon, Feb 15, 2021 at 06:38:05PM +0000, Dr. David Alan Gilbert wrote:
> > > One thing to check, and I *think* you're OK, but we have one place where
> > > we actually check the error number:
> > > 
> > > migration.c:
> > > 3414 static MigThrError migration_detect_error(MigrationState *s)
> > > ...
> > > 3426     /* Try to detect any file errors */
> > > 3427     ret = qemu_file_get_error_obj(s->to_dst_file, &local_error);
> > > 3428     if (!ret) {
> > > 3429         /* Everything is fine */
> > > 3430         assert(!local_error);
> > > 3431         return MIG_THR_ERR_NONE;
> > > 3432     }
> > > 3433 
> > > 3434     if (local_error) {
> > > 3435         migrate_set_error(s, local_error);
> > > 3436         error_free(local_error);
> > > 3437     }
> > > 3438 
> > > 3439     if (state == MIGRATION_STATUS_POSTCOPY_ACTIVE && ret == -EIO) {
> > > 3440         /*
> > > 3441          * For postcopy, we allow the network to be down for a
> > > 3442          * while. After that, it can be continued by a
> > > 3443          * recovery phase.
> > > 3444          */
> > > 3445         return postcopy_pause(s);
> > > 3446     } else {
> > > 
> > > This is to go into postcopy pause if the network connection broke (but
> > > not if for example a device moaned about being in an invalid state)
> > > 
> > > If I read this correctly, file errors are still being preserved - is
> > > that correct?
> > 
> > Yes, in places where QemuFile is reporting an actual I/O error I've
> > tried to preserve that. Only removed setting of fake I/O errors. So
> > if anything, we ought to get more accurate at detecting the recoverable
> > scenarios once we fully cleanup errors.
> 
> OK, good.

One scenario to possibly check though is that in a few places we used
error_report_err() but didn't immediately return an error code back to
the caller, instead carrying on doing other calls. It is possible that
we thus reported an error about bad data, and then later hit the EIO
check for QemuFile.


Regards,
Daniel

Dr. David Alan Gilbert Feb. 16, 2021, 7:32 p.m. UTC | #11

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Mon, Feb 15, 2021 at 07:01:28PM +0000, Dr. David Alan Gilbert wrote:
> > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > On Mon, Feb 15, 2021 at 06:38:05PM +0000, Dr. David Alan Gilbert wrote:
> > > > One thing to check, and I *think* you're OK, but we have one place where
> > > > we actually check the error number:
> > > > 
> > > > migration.c:
> > > > 3414 static MigThrError migration_detect_error(MigrationState *s)
> > > > ...
> > > > 3426     /* Try to detect any file errors */
> > > > 3427     ret = qemu_file_get_error_obj(s->to_dst_file, &local_error);
> > > > 3428     if (!ret) {
> > > > 3429         /* Everything is fine */
> > > > 3430         assert(!local_error);
> > > > 3431         return MIG_THR_ERR_NONE;
> > > > 3432     }
> > > > 3433 
> > > > 3434     if (local_error) {
> > > > 3435         migrate_set_error(s, local_error);
> > > > 3436         error_free(local_error);
> > > > 3437     }
> > > > 3438 
> > > > 3439     if (state == MIGRATION_STATUS_POSTCOPY_ACTIVE && ret == -EIO) {
> > > > 3440         /*
> > > > 3441          * For postcopy, we allow the network to be down for a
> > > > 3442          * while. After that, it can be continued by a
> > > > 3443          * recovery phase.
> > > > 3444          */
> > > > 3445         return postcopy_pause(s);
> > > > 3446     } else {
> > > > 
> > > > This is to go into postcopy pause if the network connection broke (but
> > > > not if for example a device moaned about being in an invalid state)
> > > > 
> > > > If I read this correctly, file errors are still being preserved - is
> > > > that correct?
> > > 
> > > Yes, in places where QemuFile is reporting an actual I/O error I've
> > > tried to preserve that. Only removed setting of fake I/O errors. So
> > > if anything, we ought to get more accurate at detecting the recoverable
> > > scenarios once we fully cleanup errors.
> > 
> > OK, good.
> 
> One scenario to possibly check though is that in a few places we used
> error_report_err() but didn't immediately return an error code back to
> the caller, instead carrying on doing other calls. It is possible that
> we thus reported an error about bad data, and then later hit the EIO
> check for QemuFile.

That's generally OK; it gets pretty painful to do the qemu file checks
after every read.

Dave

> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

[00/33] migration: capture error reports into Error object

Message

Comments