diff mbox series

[1/4] qdev: add DEVICE_RUNTIME_ERROR event

Message ID 165296995578.196133.16183155555450040914.stgit@buzz (mailing list archive)
State New, archived
Headers show
Series [1/4] qdev: add DEVICE_RUNTIME_ERROR event | expand

Commit Message

Konstantin Khlebnikov May 19, 2022, 2:19 p.m. UTC
This event represents device runtime errors to give time and
reason why device is broken.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 hw/core/qdev.c         |    7 +++++++
 include/hw/qdev-core.h |    1 +
 qapi/qdev.json         |   26 ++++++++++++++++++++++++++
 3 files changed, 34 insertions(+)

Comments

Vladimir Sementsov-Ogievskiy May 24, 2022, 7:04 p.m. UTC | #1
First, cover letter is absent. Konstantin, could you please provide a description what the whole series does?

Second, add maintainers to CC:
+Micheal
+Eric
+Markus

On 5/19/22 17:19, Konstantin Khlebnikov wrote:
> This event represents device runtime errors to give time and
> reason why device is broken.
> 
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> ---

The patch itself seems good to me:
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Markus Armbruster May 25, 2022, 10:54 a.m. UTC | #2
I almost missed this series.  Please cc: maintainers.
scripts/get_maintainer.pl can help you find them, but do use your
judgement to maybe trim its output.

Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:

> This event represents device runtime errors to give time and
> reason why device is broken.

Can you give an or more examples of the "device runtime errors" you have
in mind?

>
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> ---
>  hw/core/qdev.c         |    7 +++++++
>  include/hw/qdev-core.h |    1 +
>  qapi/qdev.json         |   26 ++++++++++++++++++++++++++
>  3 files changed, 34 insertions(+)
>
> diff --git a/hw/core/qdev.c b/hw/core/qdev.c
> index 84f3019440..e95ceb071b 100644
> --- a/hw/core/qdev.c
> +++ b/hw/core/qdev.c
> @@ -347,6 +347,13 @@ void qdev_unrealize(DeviceState *dev)
>      object_property_set_bool(OBJECT(dev), "realized", false, &error_abort);
>  }
>  
> +void qdev_report_runtime_error(DeviceState *dev, const Error *err)
> +{
> +    qapi_event_send_device_runtime_error(!!dev->id, dev->id,
> +                                         dev->canonical_path,
> +                                         error_get_pretty(err));
> +}
> +
>  static int qdev_assert_realized_properly_cb(Object *obj, void *opaque)
>  {
>      DeviceState *dev = DEVICE(object_dynamic_cast(obj, TYPE_DEVICE));
> diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
> index 92c3d65208..9ced2e0f09 100644
> --- a/include/hw/qdev-core.h
> +++ b/include/hw/qdev-core.h
> @@ -396,6 +396,7 @@ bool qdev_realize_and_unref(DeviceState *dev, BusState *bus, Error **errp);
>   * the life of the simulation and should not be unrealized and freed.
>   */
>  void qdev_unrealize(DeviceState *dev);
> +void qdev_report_runtime_error(DeviceState *dev, const Error *err);
>  void qdev_set_legacy_instance_id(DeviceState *dev, int alias_id,
>                                   int required_for_version);
>  HotplugHandler *qdev_get_bus_hotplug_handler(DeviceState *dev);
> diff --git a/qapi/qdev.json b/qapi/qdev.json
> index 26cd10106b..89ef32eef7 100644
> --- a/qapi/qdev.json
> +++ b/qapi/qdev.json
> @@ -159,3 +159,29 @@
>  ##
>  { 'event': 'DEVICE_UNPLUG_GUEST_ERROR',
>    'data': { '*device': 'str', 'path': 'str' } }
> +
> +##
> +# @DEVICE_RUNTIME_ERROR:
> +#
> +# Emitted when a device fails in runtime.

at runtime

I figure there are plenty of device state transitions in the tree that
are failures.  Better express that here.  Unless you intend to hunt them
all down so you can add this event :)

> +#
> +# @device: the device's ID if it has one
> +#
> +# @path: the device's QOM path
> +#
> +# @reason: human readable description
> +#
> +# Since: 7.1
> +#
> +# Example:
> +#
> +# <- { "event": "DEVICE_RUNTIME_ERROR"
> +#      "data": { "device": "virtio-net-pci-0",
> +#                "path": "/machine/peripheral/virtio-net-pci-0",
> +#                "reason": "virtio-net header incorrect" },
> +#      },
> +#      "timestamp": { "seconds": 1615570772, "microseconds": 202844 } }
> +#
> +##
> +{ 'event': 'DEVICE_RUNTIME_ERROR',
> +        'data': { '*device': 'str', 'path': 'str', 'reason': 'str' } }
Roman Kagan May 27, 2022, 12:49 p.m. UTC | #3
On Wed, May 25, 2022 at 12:54:47PM +0200, Markus Armbruster wrote:
> Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
> 
> > This event represents device runtime errors to give time and
> > reason why device is broken.
> 
> Can you give an or more examples of the "device runtime errors" you have
> in mind?

Initially we wanted to address a situation when a vhost device
discovered an inconsistency during virtqueue processing and silently
stopped the virtqueue.  This resulted in device stall (partial for
multiqueue devices) and we were the last to notice that.

The solution appeared to be to employ errfd and, upon receiving a
notification through it, to emit a QMP event which is actionable in the
management layer or further up the stack.

Then we observed that virtio (non-vhost) devices suffer from the same
issue: they only log the error but don't signal it to the management
layer.  The case was very similar so we thought it would make sense to
share the infrastructure and the QMP event between virtio and vhost.

Then Konstantin went a bit further and generalized the concept into
generic "device runtime error".  I'm personally not completely convinced
this generalization is appropriate here; we'd appreciate the opinions
from the community on the matter.

HTH,
Roman.
Markus Armbruster May 30, 2022, 11:28 a.m. UTC | #4
Roman Kagan <rvkagan@yandex-team.ru> writes:

> On Wed, May 25, 2022 at 12:54:47PM +0200, Markus Armbruster wrote:
>> Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
>> 
>> > This event represents device runtime errors to give time and
>> > reason why device is broken.
>> 
>> Can you give an or more examples of the "device runtime errors" you have
>> in mind?
>
> Initially we wanted to address a situation when a vhost device
> discovered an inconsistency during virtqueue processing and silently
> stopped the virtqueue.  This resulted in device stall (partial for
> multiqueue devices) and we were the last to notice that.
>
> The solution appeared to be to employ errfd and, upon receiving a
> notification through it, to emit a QMP event which is actionable in the
> management layer or further up the stack.
>
> Then we observed that virtio (non-vhost) devices suffer from the same
> issue: they only log the error but don't signal it to the management
> layer.  The case was very similar so we thought it would make sense to
> share the infrastructure and the QMP event between virtio and vhost.
>
> Then Konstantin went a bit further and generalized the concept into
> generic "device runtime error".  I'm personally not completely convinced
> this generalization is appropriate here; we'd appreciate the opinions
> from the community on the matter.

"Device emulation sending an even on entering certain error states, so
that a management application can do something about it" feels
reasonable enough to me as a general concept.

The key point is of course "can do something": the event needs to be
actionable.  Can you describe possible actions for the cases you
implement?

Once we all have a better idea of the event's purpose, usage, and
limitations, we should revisit its documentation.
Roman Kagan May 30, 2022, 3:04 p.m. UTC | #5
On Mon, May 30, 2022 at 01:28:17PM +0200, Markus Armbruster wrote:
> Roman Kagan <rvkagan@yandex-team.ru> writes:
> 
> > On Wed, May 25, 2022 at 12:54:47PM +0200, Markus Armbruster wrote:
> >> Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
> >> 
> >> > This event represents device runtime errors to give time and
> >> > reason why device is broken.
> >> 
> >> Can you give an or more examples of the "device runtime errors" you have
> >> in mind?
> >
> > Initially we wanted to address a situation when a vhost device
> > discovered an inconsistency during virtqueue processing and silently
> > stopped the virtqueue.  This resulted in device stall (partial for
> > multiqueue devices) and we were the last to notice that.
> >
> > The solution appeared to be to employ errfd and, upon receiving a
> > notification through it, to emit a QMP event which is actionable in the
> > management layer or further up the stack.
> >
> > Then we observed that virtio (non-vhost) devices suffer from the same
> > issue: they only log the error but don't signal it to the management
> > layer.  The case was very similar so we thought it would make sense to
> > share the infrastructure and the QMP event between virtio and vhost.
> >
> > Then Konstantin went a bit further and generalized the concept into
> > generic "device runtime error".  I'm personally not completely convinced
> > this generalization is appropriate here; we'd appreciate the opinions
> > from the community on the matter.
> 
> "Device emulation sending an even on entering certain error states, so
> that a management application can do something about it" feels
> reasonable enough to me as a general concept.
> 
> The key point is of course "can do something": the event needs to be
> actionable.  Can you describe possible actions for the cases you
> implement?

The first one that we had in mind was informational, like triggering an
alert in the monitoring system and/or painting the VM as malfunctioning
in the owner's UI.

There can be more advanced scenarios like autorecovery by resetting the
faulty VM, or fencing it if it's a cluster member.

Thanks,
Roman.
Roman Kagan June 20, 2022, 1:49 p.m. UTC | #6
On Mon, May 30, 2022 at 06:04:32PM +0300, Roman Kagan wrote:
> On Mon, May 30, 2022 at 01:28:17PM +0200, Markus Armbruster wrote:
> > Roman Kagan <rvkagan@yandex-team.ru> writes:
> > 
> > > On Wed, May 25, 2022 at 12:54:47PM +0200, Markus Armbruster wrote:
> > >> Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
> > >> 
> > >> > This event represents device runtime errors to give time and
> > >> > reason why device is broken.
> > >> 
> > >> Can you give an or more examples of the "device runtime errors" you have
> > >> in mind?
> > >
> > > Initially we wanted to address a situation when a vhost device
> > > discovered an inconsistency during virtqueue processing and silently
> > > stopped the virtqueue.  This resulted in device stall (partial for
> > > multiqueue devices) and we were the last to notice that.
> > >
> > > The solution appeared to be to employ errfd and, upon receiving a
> > > notification through it, to emit a QMP event which is actionable in the
> > > management layer or further up the stack.
> > >
> > > Then we observed that virtio (non-vhost) devices suffer from the same
> > > issue: they only log the error but don't signal it to the management
> > > layer.  The case was very similar so we thought it would make sense to
> > > share the infrastructure and the QMP event between virtio and vhost.
> > >
> > > Then Konstantin went a bit further and generalized the concept into
> > > generic "device runtime error".  I'm personally not completely convinced
> > > this generalization is appropriate here; we'd appreciate the opinions
> > > from the community on the matter.
> > 
> > "Device emulation sending an even on entering certain error states, so
> > that a management application can do something about it" feels
> > reasonable enough to me as a general concept.
> > 
> > The key point is of course "can do something": the event needs to be
> > actionable.  Can you describe possible actions for the cases you
> > implement?
> 
> The first one that we had in mind was informational, like triggering an
> alert in the monitoring system and/or painting the VM as malfunctioning
> in the owner's UI.
> 
> There can be more advanced scenarios like autorecovery by resetting the
> faulty VM, or fencing it if it's a cluster member.

The discussion kind of stalled here.  Do you think the approach makes
sense or not?  Should we try and resubmit the series with a proper cover
letter and possibly other improvements or is it a dead end?

Thanks,
Roman.
Markus Armbruster June 21, 2022, 11:55 a.m. UTC | #7
Roman Kagan <rvkagan@yandex-team.ru> writes:

> On Mon, May 30, 2022 at 06:04:32PM +0300, Roman Kagan wrote:
>> On Mon, May 30, 2022 at 01:28:17PM +0200, Markus Armbruster wrote:
>> > Roman Kagan <rvkagan@yandex-team.ru> writes:
>> > 
>> > > On Wed, May 25, 2022 at 12:54:47PM +0200, Markus Armbruster wrote:
>> > >> Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
>> > >> 
>> > >> > This event represents device runtime errors to give time and
>> > >> > reason why device is broken.
>> > >> 
>> > >> Can you give an or more examples of the "device runtime errors" you have
>> > >> in mind?
>> > >
>> > > Initially we wanted to address a situation when a vhost device
>> > > discovered an inconsistency during virtqueue processing and silently
>> > > stopped the virtqueue.  This resulted in device stall (partial for
>> > > multiqueue devices) and we were the last to notice that.
>> > >
>> > > The solution appeared to be to employ errfd and, upon receiving a
>> > > notification through it, to emit a QMP event which is actionable in the
>> > > management layer or further up the stack.
>> > >
>> > > Then we observed that virtio (non-vhost) devices suffer from the same
>> > > issue: they only log the error but don't signal it to the management
>> > > layer.  The case was very similar so we thought it would make sense to
>> > > share the infrastructure and the QMP event between virtio and vhost.
>> > >
>> > > Then Konstantin went a bit further and generalized the concept into
>> > > generic "device runtime error".  I'm personally not completely convinced
>> > > this generalization is appropriate here; we'd appreciate the opinions
>> > > from the community on the matter.
>> > 
>> > "Device emulation sending an even on entering certain error states, so
>> > that a management application can do something about it" feels
>> > reasonable enough to me as a general concept.
>> > 
>> > The key point is of course "can do something": the event needs to be
>> > actionable.  Can you describe possible actions for the cases you
>> > implement?
>> 
>> The first one that we had in mind was informational, like triggering an
>> alert in the monitoring system and/or painting the VM as malfunctioning
>> in the owner's UI.
>> 
>> There can be more advanced scenarios like autorecovery by resetting the
>> faulty VM, or fencing it if it's a cluster member.
>
> The discussion kind of stalled here.

My apologies...

>                                       Do you think the approach makes
> sense or not?  Should we try and resubmit the series with a proper cover
> letter and possibly other improvements or is it a dead end?

As QAPI schema maintainer, my concern is interface design.  To sell this
interface to me (so to speak), you have to show it's useful and
reasonably general.  Reasonably general, because we don't want to
accumulate one-offs, even if they have their uses.

I think this is mostly a matter of commit message(s) and documentation
here.  Explain your intended use cases.  Maybe hand-wave at other use
cases you can think of.  Document that you're implementing the event
only for the specific errors you need, but that it could be implemented
more widely as needed.  "Complete" feels impractical, though.

Makes sense?
Roman Kagan June 21, 2022, 12:02 p.m. UTC | #8
On Tue, Jun 21, 2022 at 01:55:25PM +0200, Markus Armbruster wrote:
> Roman Kagan <rvkagan@yandex-team.ru> writes:
> 
> > On Mon, May 30, 2022 at 06:04:32PM +0300, Roman Kagan wrote:
> >> On Mon, May 30, 2022 at 01:28:17PM +0200, Markus Armbruster wrote:
> >> > Roman Kagan <rvkagan@yandex-team.ru> writes:
> >> > 
> >> > > On Wed, May 25, 2022 at 12:54:47PM +0200, Markus Armbruster wrote:
> >> > >> Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
> >> > >> 
> >> > >> > This event represents device runtime errors to give time and
> >> > >> > reason why device is broken.
> >> > >> 
> >> > >> Can you give an or more examples of the "device runtime errors" you have
> >> > >> in mind?
> >> > >
> >> > > Initially we wanted to address a situation when a vhost device
> >> > > discovered an inconsistency during virtqueue processing and silently
> >> > > stopped the virtqueue.  This resulted in device stall (partial for
> >> > > multiqueue devices) and we were the last to notice that.
> >> > >
> >> > > The solution appeared to be to employ errfd and, upon receiving a
> >> > > notification through it, to emit a QMP event which is actionable in the
> >> > > management layer or further up the stack.
> >> > >
> >> > > Then we observed that virtio (non-vhost) devices suffer from the same
> >> > > issue: they only log the error but don't signal it to the management
> >> > > layer.  The case was very similar so we thought it would make sense to
> >> > > share the infrastructure and the QMP event between virtio and vhost.
> >> > >
> >> > > Then Konstantin went a bit further and generalized the concept into
> >> > > generic "device runtime error".  I'm personally not completely convinced
> >> > > this generalization is appropriate here; we'd appreciate the opinions
> >> > > from the community on the matter.
> >> > 
> >> > "Device emulation sending an even on entering certain error states, so
> >> > that a management application can do something about it" feels
> >> > reasonable enough to me as a general concept.
> >> > 
> >> > The key point is of course "can do something": the event needs to be
> >> > actionable.  Can you describe possible actions for the cases you
> >> > implement?
> >> 
> >> The first one that we had in mind was informational, like triggering an
> >> alert in the monitoring system and/or painting the VM as malfunctioning
> >> in the owner's UI.
> >> 
> >> There can be more advanced scenarios like autorecovery by resetting the
> >> faulty VM, or fencing it if it's a cluster member.
> >
> > The discussion kind of stalled here.
> 
> My apologies...
> 
> >                                       Do you think the approach makes
> > sense or not?  Should we try and resubmit the series with a proper cover
> > letter and possibly other improvements or is it a dead end?
> 
> As QAPI schema maintainer, my concern is interface design.  To sell this
> interface to me (so to speak), you have to show it's useful and
> reasonably general.  Reasonably general, because we don't want to
> accumulate one-offs, even if they have their uses.
> 
> I think this is mostly a matter of commit message(s) and documentation
> here.  Explain your intended use cases.  Maybe hand-wave at other use
> cases you can think of.  Document that you're implementing the event
> only for the specific errors you need, but that it could be implemented
> more widely as needed.  "Complete" feels impractical, though.
> 
> Makes sense?

Absolutely.  We'll rework and resubmit the series addressing the issues
you've noted, and we'll see how it goes.

Thanks,
Roman.
diff mbox series

Patch

diff --git a/hw/core/qdev.c b/hw/core/qdev.c
index 84f3019440..e95ceb071b 100644
--- a/hw/core/qdev.c
+++ b/hw/core/qdev.c
@@ -347,6 +347,13 @@  void qdev_unrealize(DeviceState *dev)
     object_property_set_bool(OBJECT(dev), "realized", false, &error_abort);
 }
 
+void qdev_report_runtime_error(DeviceState *dev, const Error *err)
+{
+    qapi_event_send_device_runtime_error(!!dev->id, dev->id,
+                                         dev->canonical_path,
+                                         error_get_pretty(err));
+}
+
 static int qdev_assert_realized_properly_cb(Object *obj, void *opaque)
 {
     DeviceState *dev = DEVICE(object_dynamic_cast(obj, TYPE_DEVICE));
diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
index 92c3d65208..9ced2e0f09 100644
--- a/include/hw/qdev-core.h
+++ b/include/hw/qdev-core.h
@@ -396,6 +396,7 @@  bool qdev_realize_and_unref(DeviceState *dev, BusState *bus, Error **errp);
  * the life of the simulation and should not be unrealized and freed.
  */
 void qdev_unrealize(DeviceState *dev);
+void qdev_report_runtime_error(DeviceState *dev, const Error *err);
 void qdev_set_legacy_instance_id(DeviceState *dev, int alias_id,
                                  int required_for_version);
 HotplugHandler *qdev_get_bus_hotplug_handler(DeviceState *dev);
diff --git a/qapi/qdev.json b/qapi/qdev.json
index 26cd10106b..89ef32eef7 100644
--- a/qapi/qdev.json
+++ b/qapi/qdev.json
@@ -159,3 +159,29 @@ 
 ##
 { 'event': 'DEVICE_UNPLUG_GUEST_ERROR',
   'data': { '*device': 'str', 'path': 'str' } }
+
+##
+# @DEVICE_RUNTIME_ERROR:
+#
+# Emitted when a device fails in runtime.
+#
+# @device: the device's ID if it has one
+#
+# @path: the device's QOM path
+#
+# @reason: human readable description
+#
+# Since: 7.1
+#
+# Example:
+#
+# <- { "event": "DEVICE_RUNTIME_ERROR"
+#      "data": { "device": "virtio-net-pci-0",
+#                "path": "/machine/peripheral/virtio-net-pci-0",
+#                "reason": "virtio-net header incorrect" },
+#      },
+#      "timestamp": { "seconds": 1615570772, "microseconds": 202844 } }
+#
+##
+{ 'event': 'DEVICE_RUNTIME_ERROR',
+        'data': { '*device': 'str', 'path': 'str', 'reason': 'str' } }