Message ID | 165296995578.196133.16183155555450040914.stgit@buzz (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/4] qdev: add DEVICE_RUNTIME_ERROR event | expand |
First, cover letter is absent. Konstantin, could you please provide a description what the whole series does? Second, add maintainers to CC: +Micheal +Eric +Markus On 5/19/22 17:19, Konstantin Khlebnikov wrote: > This event represents device runtime errors to give time and > reason why device is broken. > > Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> > --- The patch itself seems good to me: Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
I almost missed this series. Please cc: maintainers. scripts/get_maintainer.pl can help you find them, but do use your judgement to maybe trim its output. Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes: > This event represents device runtime errors to give time and > reason why device is broken. Can you give an or more examples of the "device runtime errors" you have in mind? > > Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> > --- > hw/core/qdev.c | 7 +++++++ > include/hw/qdev-core.h | 1 + > qapi/qdev.json | 26 ++++++++++++++++++++++++++ > 3 files changed, 34 insertions(+) > > diff --git a/hw/core/qdev.c b/hw/core/qdev.c > index 84f3019440..e95ceb071b 100644 > --- a/hw/core/qdev.c > +++ b/hw/core/qdev.c > @@ -347,6 +347,13 @@ void qdev_unrealize(DeviceState *dev) > object_property_set_bool(OBJECT(dev), "realized", false, &error_abort); > } > > +void qdev_report_runtime_error(DeviceState *dev, const Error *err) > +{ > + qapi_event_send_device_runtime_error(!!dev->id, dev->id, > + dev->canonical_path, > + error_get_pretty(err)); > +} > + > static int qdev_assert_realized_properly_cb(Object *obj, void *opaque) > { > DeviceState *dev = DEVICE(object_dynamic_cast(obj, TYPE_DEVICE)); > diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h > index 92c3d65208..9ced2e0f09 100644 > --- a/include/hw/qdev-core.h > +++ b/include/hw/qdev-core.h > @@ -396,6 +396,7 @@ bool qdev_realize_and_unref(DeviceState *dev, BusState *bus, Error **errp); > * the life of the simulation and should not be unrealized and freed. > */ > void qdev_unrealize(DeviceState *dev); > +void qdev_report_runtime_error(DeviceState *dev, const Error *err); > void qdev_set_legacy_instance_id(DeviceState *dev, int alias_id, > int required_for_version); > HotplugHandler *qdev_get_bus_hotplug_handler(DeviceState *dev); > diff --git a/qapi/qdev.json b/qapi/qdev.json > index 26cd10106b..89ef32eef7 100644 > --- a/qapi/qdev.json > +++ b/qapi/qdev.json > @@ -159,3 +159,29 @@ > ## > { 'event': 'DEVICE_UNPLUG_GUEST_ERROR', > 'data': { '*device': 'str', 'path': 'str' } } > + > +## > +# @DEVICE_RUNTIME_ERROR: > +# > +# Emitted when a device fails in runtime. at runtime I figure there are plenty of device state transitions in the tree that are failures. Better express that here. Unless you intend to hunt them all down so you can add this event :) > +# > +# @device: the device's ID if it has one > +# > +# @path: the device's QOM path > +# > +# @reason: human readable description > +# > +# Since: 7.1 > +# > +# Example: > +# > +# <- { "event": "DEVICE_RUNTIME_ERROR" > +# "data": { "device": "virtio-net-pci-0", > +# "path": "/machine/peripheral/virtio-net-pci-0", > +# "reason": "virtio-net header incorrect" }, > +# }, > +# "timestamp": { "seconds": 1615570772, "microseconds": 202844 } } > +# > +## > +{ 'event': 'DEVICE_RUNTIME_ERROR', > + 'data': { '*device': 'str', 'path': 'str', 'reason': 'str' } }
On Wed, May 25, 2022 at 12:54:47PM +0200, Markus Armbruster wrote: > Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes: > > > This event represents device runtime errors to give time and > > reason why device is broken. > > Can you give an or more examples of the "device runtime errors" you have > in mind? Initially we wanted to address a situation when a vhost device discovered an inconsistency during virtqueue processing and silently stopped the virtqueue. This resulted in device stall (partial for multiqueue devices) and we were the last to notice that. The solution appeared to be to employ errfd and, upon receiving a notification through it, to emit a QMP event which is actionable in the management layer or further up the stack. Then we observed that virtio (non-vhost) devices suffer from the same issue: they only log the error but don't signal it to the management layer. The case was very similar so we thought it would make sense to share the infrastructure and the QMP event between virtio and vhost. Then Konstantin went a bit further and generalized the concept into generic "device runtime error". I'm personally not completely convinced this generalization is appropriate here; we'd appreciate the opinions from the community on the matter. HTH, Roman.
Roman Kagan <rvkagan@yandex-team.ru> writes: > On Wed, May 25, 2022 at 12:54:47PM +0200, Markus Armbruster wrote: >> Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes: >> >> > This event represents device runtime errors to give time and >> > reason why device is broken. >> >> Can you give an or more examples of the "device runtime errors" you have >> in mind? > > Initially we wanted to address a situation when a vhost device > discovered an inconsistency during virtqueue processing and silently > stopped the virtqueue. This resulted in device stall (partial for > multiqueue devices) and we were the last to notice that. > > The solution appeared to be to employ errfd and, upon receiving a > notification through it, to emit a QMP event which is actionable in the > management layer or further up the stack. > > Then we observed that virtio (non-vhost) devices suffer from the same > issue: they only log the error but don't signal it to the management > layer. The case was very similar so we thought it would make sense to > share the infrastructure and the QMP event between virtio and vhost. > > Then Konstantin went a bit further and generalized the concept into > generic "device runtime error". I'm personally not completely convinced > this generalization is appropriate here; we'd appreciate the opinions > from the community on the matter. "Device emulation sending an even on entering certain error states, so that a management application can do something about it" feels reasonable enough to me as a general concept. The key point is of course "can do something": the event needs to be actionable. Can you describe possible actions for the cases you implement? Once we all have a better idea of the event's purpose, usage, and limitations, we should revisit its documentation.
On Mon, May 30, 2022 at 01:28:17PM +0200, Markus Armbruster wrote: > Roman Kagan <rvkagan@yandex-team.ru> writes: > > > On Wed, May 25, 2022 at 12:54:47PM +0200, Markus Armbruster wrote: > >> Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes: > >> > >> > This event represents device runtime errors to give time and > >> > reason why device is broken. > >> > >> Can you give an or more examples of the "device runtime errors" you have > >> in mind? > > > > Initially we wanted to address a situation when a vhost device > > discovered an inconsistency during virtqueue processing and silently > > stopped the virtqueue. This resulted in device stall (partial for > > multiqueue devices) and we were the last to notice that. > > > > The solution appeared to be to employ errfd and, upon receiving a > > notification through it, to emit a QMP event which is actionable in the > > management layer or further up the stack. > > > > Then we observed that virtio (non-vhost) devices suffer from the same > > issue: they only log the error but don't signal it to the management > > layer. The case was very similar so we thought it would make sense to > > share the infrastructure and the QMP event between virtio and vhost. > > > > Then Konstantin went a bit further and generalized the concept into > > generic "device runtime error". I'm personally not completely convinced > > this generalization is appropriate here; we'd appreciate the opinions > > from the community on the matter. > > "Device emulation sending an even on entering certain error states, so > that a management application can do something about it" feels > reasonable enough to me as a general concept. > > The key point is of course "can do something": the event needs to be > actionable. Can you describe possible actions for the cases you > implement? The first one that we had in mind was informational, like triggering an alert in the monitoring system and/or painting the VM as malfunctioning in the owner's UI. There can be more advanced scenarios like autorecovery by resetting the faulty VM, or fencing it if it's a cluster member. Thanks, Roman.
On Mon, May 30, 2022 at 06:04:32PM +0300, Roman Kagan wrote: > On Mon, May 30, 2022 at 01:28:17PM +0200, Markus Armbruster wrote: > > Roman Kagan <rvkagan@yandex-team.ru> writes: > > > > > On Wed, May 25, 2022 at 12:54:47PM +0200, Markus Armbruster wrote: > > >> Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes: > > >> > > >> > This event represents device runtime errors to give time and > > >> > reason why device is broken. > > >> > > >> Can you give an or more examples of the "device runtime errors" you have > > >> in mind? > > > > > > Initially we wanted to address a situation when a vhost device > > > discovered an inconsistency during virtqueue processing and silently > > > stopped the virtqueue. This resulted in device stall (partial for > > > multiqueue devices) and we were the last to notice that. > > > > > > The solution appeared to be to employ errfd and, upon receiving a > > > notification through it, to emit a QMP event which is actionable in the > > > management layer or further up the stack. > > > > > > Then we observed that virtio (non-vhost) devices suffer from the same > > > issue: they only log the error but don't signal it to the management > > > layer. The case was very similar so we thought it would make sense to > > > share the infrastructure and the QMP event between virtio and vhost. > > > > > > Then Konstantin went a bit further and generalized the concept into > > > generic "device runtime error". I'm personally not completely convinced > > > this generalization is appropriate here; we'd appreciate the opinions > > > from the community on the matter. > > > > "Device emulation sending an even on entering certain error states, so > > that a management application can do something about it" feels > > reasonable enough to me as a general concept. > > > > The key point is of course "can do something": the event needs to be > > actionable. Can you describe possible actions for the cases you > > implement? > > The first one that we had in mind was informational, like triggering an > alert in the monitoring system and/or painting the VM as malfunctioning > in the owner's UI. > > There can be more advanced scenarios like autorecovery by resetting the > faulty VM, or fencing it if it's a cluster member. The discussion kind of stalled here. Do you think the approach makes sense or not? Should we try and resubmit the series with a proper cover letter and possibly other improvements or is it a dead end? Thanks, Roman.
Roman Kagan <rvkagan@yandex-team.ru> writes: > On Mon, May 30, 2022 at 06:04:32PM +0300, Roman Kagan wrote: >> On Mon, May 30, 2022 at 01:28:17PM +0200, Markus Armbruster wrote: >> > Roman Kagan <rvkagan@yandex-team.ru> writes: >> > >> > > On Wed, May 25, 2022 at 12:54:47PM +0200, Markus Armbruster wrote: >> > >> Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes: >> > >> >> > >> > This event represents device runtime errors to give time and >> > >> > reason why device is broken. >> > >> >> > >> Can you give an or more examples of the "device runtime errors" you have >> > >> in mind? >> > > >> > > Initially we wanted to address a situation when a vhost device >> > > discovered an inconsistency during virtqueue processing and silently >> > > stopped the virtqueue. This resulted in device stall (partial for >> > > multiqueue devices) and we were the last to notice that. >> > > >> > > The solution appeared to be to employ errfd and, upon receiving a >> > > notification through it, to emit a QMP event which is actionable in the >> > > management layer or further up the stack. >> > > >> > > Then we observed that virtio (non-vhost) devices suffer from the same >> > > issue: they only log the error but don't signal it to the management >> > > layer. The case was very similar so we thought it would make sense to >> > > share the infrastructure and the QMP event between virtio and vhost. >> > > >> > > Then Konstantin went a bit further and generalized the concept into >> > > generic "device runtime error". I'm personally not completely convinced >> > > this generalization is appropriate here; we'd appreciate the opinions >> > > from the community on the matter. >> > >> > "Device emulation sending an even on entering certain error states, so >> > that a management application can do something about it" feels >> > reasonable enough to me as a general concept. >> > >> > The key point is of course "can do something": the event needs to be >> > actionable. Can you describe possible actions for the cases you >> > implement? >> >> The first one that we had in mind was informational, like triggering an >> alert in the monitoring system and/or painting the VM as malfunctioning >> in the owner's UI. >> >> There can be more advanced scenarios like autorecovery by resetting the >> faulty VM, or fencing it if it's a cluster member. > > The discussion kind of stalled here. My apologies... > Do you think the approach makes > sense or not? Should we try and resubmit the series with a proper cover > letter and possibly other improvements or is it a dead end? As QAPI schema maintainer, my concern is interface design. To sell this interface to me (so to speak), you have to show it's useful and reasonably general. Reasonably general, because we don't want to accumulate one-offs, even if they have their uses. I think this is mostly a matter of commit message(s) and documentation here. Explain your intended use cases. Maybe hand-wave at other use cases you can think of. Document that you're implementing the event only for the specific errors you need, but that it could be implemented more widely as needed. "Complete" feels impractical, though. Makes sense?
On Tue, Jun 21, 2022 at 01:55:25PM +0200, Markus Armbruster wrote: > Roman Kagan <rvkagan@yandex-team.ru> writes: > > > On Mon, May 30, 2022 at 06:04:32PM +0300, Roman Kagan wrote: > >> On Mon, May 30, 2022 at 01:28:17PM +0200, Markus Armbruster wrote: > >> > Roman Kagan <rvkagan@yandex-team.ru> writes: > >> > > >> > > On Wed, May 25, 2022 at 12:54:47PM +0200, Markus Armbruster wrote: > >> > >> Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes: > >> > >> > >> > >> > This event represents device runtime errors to give time and > >> > >> > reason why device is broken. > >> > >> > >> > >> Can you give an or more examples of the "device runtime errors" you have > >> > >> in mind? > >> > > > >> > > Initially we wanted to address a situation when a vhost device > >> > > discovered an inconsistency during virtqueue processing and silently > >> > > stopped the virtqueue. This resulted in device stall (partial for > >> > > multiqueue devices) and we were the last to notice that. > >> > > > >> > > The solution appeared to be to employ errfd and, upon receiving a > >> > > notification through it, to emit a QMP event which is actionable in the > >> > > management layer or further up the stack. > >> > > > >> > > Then we observed that virtio (non-vhost) devices suffer from the same > >> > > issue: they only log the error but don't signal it to the management > >> > > layer. The case was very similar so we thought it would make sense to > >> > > share the infrastructure and the QMP event between virtio and vhost. > >> > > > >> > > Then Konstantin went a bit further and generalized the concept into > >> > > generic "device runtime error". I'm personally not completely convinced > >> > > this generalization is appropriate here; we'd appreciate the opinions > >> > > from the community on the matter. > >> > > >> > "Device emulation sending an even on entering certain error states, so > >> > that a management application can do something about it" feels > >> > reasonable enough to me as a general concept. > >> > > >> > The key point is of course "can do something": the event needs to be > >> > actionable. Can you describe possible actions for the cases you > >> > implement? > >> > >> The first one that we had in mind was informational, like triggering an > >> alert in the monitoring system and/or painting the VM as malfunctioning > >> in the owner's UI. > >> > >> There can be more advanced scenarios like autorecovery by resetting the > >> faulty VM, or fencing it if it's a cluster member. > > > > The discussion kind of stalled here. > > My apologies... > > > Do you think the approach makes > > sense or not? Should we try and resubmit the series with a proper cover > > letter and possibly other improvements or is it a dead end? > > As QAPI schema maintainer, my concern is interface design. To sell this > interface to me (so to speak), you have to show it's useful and > reasonably general. Reasonably general, because we don't want to > accumulate one-offs, even if they have their uses. > > I think this is mostly a matter of commit message(s) and documentation > here. Explain your intended use cases. Maybe hand-wave at other use > cases you can think of. Document that you're implementing the event > only for the specific errors you need, but that it could be implemented > more widely as needed. "Complete" feels impractical, though. > > Makes sense? Absolutely. We'll rework and resubmit the series addressing the issues you've noted, and we'll see how it goes. Thanks, Roman.
diff --git a/hw/core/qdev.c b/hw/core/qdev.c index 84f3019440..e95ceb071b 100644 --- a/hw/core/qdev.c +++ b/hw/core/qdev.c @@ -347,6 +347,13 @@ void qdev_unrealize(DeviceState *dev) object_property_set_bool(OBJECT(dev), "realized", false, &error_abort); } +void qdev_report_runtime_error(DeviceState *dev, const Error *err) +{ + qapi_event_send_device_runtime_error(!!dev->id, dev->id, + dev->canonical_path, + error_get_pretty(err)); +} + static int qdev_assert_realized_properly_cb(Object *obj, void *opaque) { DeviceState *dev = DEVICE(object_dynamic_cast(obj, TYPE_DEVICE)); diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h index 92c3d65208..9ced2e0f09 100644 --- a/include/hw/qdev-core.h +++ b/include/hw/qdev-core.h @@ -396,6 +396,7 @@ bool qdev_realize_and_unref(DeviceState *dev, BusState *bus, Error **errp); * the life of the simulation and should not be unrealized and freed. */ void qdev_unrealize(DeviceState *dev); +void qdev_report_runtime_error(DeviceState *dev, const Error *err); void qdev_set_legacy_instance_id(DeviceState *dev, int alias_id, int required_for_version); HotplugHandler *qdev_get_bus_hotplug_handler(DeviceState *dev); diff --git a/qapi/qdev.json b/qapi/qdev.json index 26cd10106b..89ef32eef7 100644 --- a/qapi/qdev.json +++ b/qapi/qdev.json @@ -159,3 +159,29 @@ ## { 'event': 'DEVICE_UNPLUG_GUEST_ERROR', 'data': { '*device': 'str', 'path': 'str' } } + +## +# @DEVICE_RUNTIME_ERROR: +# +# Emitted when a device fails in runtime. +# +# @device: the device's ID if it has one +# +# @path: the device's QOM path +# +# @reason: human readable description +# +# Since: 7.1 +# +# Example: +# +# <- { "event": "DEVICE_RUNTIME_ERROR" +# "data": { "device": "virtio-net-pci-0", +# "path": "/machine/peripheral/virtio-net-pci-0", +# "reason": "virtio-net header incorrect" }, +# }, +# "timestamp": { "seconds": 1615570772, "microseconds": 202844 } } +# +## +{ 'event': 'DEVICE_RUNTIME_ERROR', + 'data': { '*device': 'str', 'path': 'str', 'reason': 'str' } }
This event represents device runtime errors to give time and reason why device is broken. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> --- hw/core/qdev.c | 7 +++++++ include/hw/qdev-core.h | 1 + qapi/qdev.json | 26 ++++++++++++++++++++++++++ 3 files changed, 34 insertions(+)