Message ID | 147326876478.8546.16045138068342092499.stgit@bahia.lan (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Greg Kurz <groug@kaod.org> writes: > Calling assert() really makes sense when hitting a genuine bug, which calls > for a fix in QEMU. However, when something goes wrong because the guest > sends a malformed message, it is better to write down a more meaningul > error message and exit. > > Signed-off-by: Greg Kurz <groug@kaod.org> > --- > hw/9pfs/virtio-9p-device.c | 20 ++++++++++++++++++-- > 1 file changed, 18 insertions(+), 2 deletions(-) > > diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c > index 009b43f6d045..67059182645a 100644 > --- a/hw/9pfs/virtio-9p-device.c > +++ b/hw/9pfs/virtio-9p-device.c > @@ -19,6 +19,7 @@ > #include "coth.h" > #include "hw/virtio/virtio-access.h" > #include "qemu/iov.h" > +#include "qemu/error-report.h" > > void virtio_9p_push_and_notify(V9fsPDU *pdu) > { > @@ -35,6 +36,11 @@ void virtio_9p_push_and_notify(V9fsPDU *pdu) > virtio_notify(VIRTIO_DEVICE(v), v->vq); > } > > +static void virtio_9p_error(const char *msg) > +{ > + error_report("The virtio-9p driver in the guest has an issue: %s", msg); > +} > + > static void handle_9p_output(VirtIODevice *vdev, VirtQueue *vq) > { > V9fsVirtioState *v = (V9fsVirtioState *)vdev; > @@ -56,13 +62,23 @@ static void handle_9p_output(VirtIODevice *vdev, VirtQueue *vq) > break; > } > > - BUG_ON(elem->out_num == 0 || elem->in_num == 0); > + if (elem->out_num == 0) { > + virtio_9p_error("missing VirtFS request's header"); > + exit(1); > + } Can the guest trigger this? > + if (elem->in_num == 0) { > + virtio_9p_error("missing VirtFS reply's header"); > + exit(1); > + } Same question. > QEMU_BUILD_BUG_ON(sizeof out != 7); > > v->elems[pdu->idx] = elem; > len = iov_to_buf(elem->out_sg, elem->out_num, 0, > &out, sizeof out); > - BUG_ON(len != sizeof out); > + if (len != sizeof out) { > + virtio_9p_error("malformed VirtFS request"); > + exit(1); > + } Same question. > > pdu->size = le32_to_cpu(out.size_le); >
On Wed, 07 Sep 2016 19:19:24 +0200 Greg Kurz <groug@kaod.org> wrote: > Calling assert() really makes sense when hitting a genuine bug, which calls > for a fix in QEMU. However, when something goes wrong because the guest > sends a malformed message, it is better to write down a more meaningul > error message and exit. > > Signed-off-by: Greg Kurz <groug@kaod.org> > --- > hw/9pfs/virtio-9p-device.c | 20 ++++++++++++++++++-- > 1 file changed, 18 insertions(+), 2 deletions(-) While this is an improvement over the current state, I don't think the guest should be able to kill qemu just by doing something stupid. The right way to go is to mark the virtio device as broken and stop doing any processing until the guest resets it. I think Stefan had a patch series doing that for some base virtio errors, but I'd have to search for it.
On Thu, 08 Sep 2016 09:14:05 +0200 Markus Armbruster <armbru@redhat.com> wrote: > Greg Kurz <groug@kaod.org> writes: > > > Calling assert() really makes sense when hitting a genuine bug, which calls > > for a fix in QEMU. However, when something goes wrong because the guest > > sends a malformed message, it is better to write down a more meaningul > > error message and exit. > > > > Signed-off-by: Greg Kurz <groug@kaod.org> > > --- > > hw/9pfs/virtio-9p-device.c | 20 ++++++++++++++++++-- > > 1 file changed, 18 insertions(+), 2 deletions(-) > > > > diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c > > index 009b43f6d045..67059182645a 100644 > > --- a/hw/9pfs/virtio-9p-device.c > > +++ b/hw/9pfs/virtio-9p-device.c > > @@ -19,6 +19,7 @@ > > #include "coth.h" > > #include "hw/virtio/virtio-access.h" > > #include "qemu/iov.h" > > +#include "qemu/error-report.h" > > > > void virtio_9p_push_and_notify(V9fsPDU *pdu) > > { > > @@ -35,6 +36,11 @@ void virtio_9p_push_and_notify(V9fsPDU *pdu) > > virtio_notify(VIRTIO_DEVICE(v), v->vq); > > } > > > > +static void virtio_9p_error(const char *msg) > > +{ > > + error_report("The virtio-9p driver in the guest has an issue: %s", msg); > > +} > > + > > static void handle_9p_output(VirtIODevice *vdev, VirtQueue *vq) > > { > > V9fsVirtioState *v = (V9fsVirtioState *)vdev; > > @@ -56,13 +62,23 @@ static void handle_9p_output(VirtIODevice *vdev, VirtQueue *vq) > > break; > > } > > > > - BUG_ON(elem->out_num == 0 || elem->in_num == 0); > > + if (elem->out_num == 0) { > > + virtio_9p_error("missing VirtFS request's header"); > > + exit(1); > > + } > > Can the guest trigger this? > Yes it can in theory if it pushes an empty buffer... but this "recent" commit changed the outcome: commit 1e7aed70144b4673fc26e73062064b6724795e5f Author: Prasad J Pandit <pjp@fedoraproject.org> Date: Wed Jul 27 21:07:56 2016 +0530 virtio: check vring descriptor buffer length And now, the error is caught in virtqueue_map_desc(): if (!sz) { error_report("virtio: zero sized buffers are not allowed"); exit(1); } So I guess we should keep the BUG_ON() then. BTW, there are similar checks in virtio-blk and virtio-net leading to a QEMU exit... which seem to be obsoleted by the above commit. I'll have a closer look. > > + if (elem->in_num == 0) { > > + virtio_9p_error("missing VirtFS reply's header"); > > + exit(1); > > + } > > Same question. > Same answer. :) > > QEMU_BUILD_BUG_ON(sizeof out != 7); > > > > v->elems[pdu->idx] = elem; > > len = iov_to_buf(elem->out_sg, elem->out_num, 0, > > &out, sizeof out); > > - BUG_ON(len != sizeof out); > > + if (len != sizeof out) { > > + virtio_9p_error("malformed VirtFS request"); > > + exit(1); > > + } > > Same question. > Here this is different: the guest can put a bogus len in the vring_desc structure, and this doesn't get checked earlier. > > > > pdu->size = le32_to_cpu(out.size_le); > > Cheers. -- Greg
On Thu, 8 Sep 2016 10:59:26 +0200 Cornelia Huck <cornelia.huck@de.ibm.com> wrote: > On Wed, 07 Sep 2016 19:19:24 +0200 > Greg Kurz <groug@kaod.org> wrote: > > > Calling assert() really makes sense when hitting a genuine bug, which calls > > for a fix in QEMU. However, when something goes wrong because the guest > > sends a malformed message, it is better to write down a more meaningul > > error message and exit. > > > > Signed-off-by: Greg Kurz <groug@kaod.org> > > --- > > hw/9pfs/virtio-9p-device.c | 20 ++++++++++++++++++-- > > 1 file changed, 18 insertions(+), 2 deletions(-) > > While this is an improvement over the current state, I don't think the > guest should be able to kill qemu just by doing something stupid. > Hi Connie, I'm glad you're pointing this out... this was also my impression, but since there are a bunch of sanity checks in the virtio code that cause QEMU to exit (even recently added like 1e7aed70144b), I did not dare stand up :) > The right way to go is to mark the virtio device as broken and stop > doing any processing until the guest resets it. I think Stefan had a > patch series doing that for some base virtio errors, but I'd have to > search for it. > I'd be glad to have a look and try to address this issue. Thanks ! -- Greg
On Thu, Sep 08, 2016 at 11:12:16AM +0200, Greg Kurz wrote: > On Thu, 8 Sep 2016 10:59:26 +0200 > Cornelia Huck <cornelia.huck@de.ibm.com> wrote: > > > On Wed, 07 Sep 2016 19:19:24 +0200 > > Greg Kurz <groug@kaod.org> wrote: > > > > > Calling assert() really makes sense when hitting a genuine bug, which calls > > > for a fix in QEMU. However, when something goes wrong because the guest > > > sends a malformed message, it is better to write down a more meaningul > > > error message and exit. > > > > > > Signed-off-by: Greg Kurz <groug@kaod.org> > > > --- > > > hw/9pfs/virtio-9p-device.c | 20 ++++++++++++++++++-- > > > 1 file changed, 18 insertions(+), 2 deletions(-) > > > > While this is an improvement over the current state, I don't think the > > guest should be able to kill qemu just by doing something stupid. > > > > Hi Connie, > > I'm glad you're pointing this out... this was also my impression, but > since there are a bunch of sanity checks in the virtio code that cause > QEMU to exit (even recently added like 1e7aed70144b), I did not dare > stand up :) It's true that it's broken in many places but we should just fix them all. A separate question is how to log such hardware/guest bugs generally. People already complained about disk filling up because of us printing errors on each such bug. Maybe print each message only N times, and then set a flag to skip the log until management tells us to restart logging again. > > The right way to go is to mark the virtio device as broken and stop > > doing any processing until the guest resets it. I think Stefan had a > > patch series doing that for some base virtio errors, but I'd have to > > search for it. > > > > I'd be glad to have a look and try to address this issue. > > Thanks ! > > -- > Greg
On Thu, 8 Sep 2016 18:00:28 +0300 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Thu, Sep 08, 2016 at 11:12:16AM +0200, Greg Kurz wrote: > > On Thu, 8 Sep 2016 10:59:26 +0200 > > Cornelia Huck <cornelia.huck@de.ibm.com> wrote: > > > > > On Wed, 07 Sep 2016 19:19:24 +0200 > > > Greg Kurz <groug@kaod.org> wrote: > > > > > > > Calling assert() really makes sense when hitting a genuine bug, which calls > > > > for a fix in QEMU. However, when something goes wrong because the guest > > > > sends a malformed message, it is better to write down a more meaningul > > > > error message and exit. > > > > > > > > Signed-off-by: Greg Kurz <groug@kaod.org> > > > > --- > > > > hw/9pfs/virtio-9p-device.c | 20 ++++++++++++++++++-- > > > > 1 file changed, 18 insertions(+), 2 deletions(-) > > > > > > While this is an improvement over the current state, I don't think the > > > guest should be able to kill qemu just by doing something stupid. > > > > > > > Hi Connie, > > > > I'm glad you're pointing this out... this was also my impression, but > > since there are a bunch of sanity checks in the virtio code that cause > > QEMU to exit (even recently added like 1e7aed70144b), I did not dare > > stand up :) > > It's true that it's broken in many places but we should just > fix them all. > > > A separate question is how to log such hardware/guest bugs generally. > People already complained about disk filling up because of us printing > errors on each such bug. Maybe print each message only N times, and > then set a flag to skip the log until management tells us to restart > logging again. I'd expect to get the message just once per device if we set the device to broken (unless the guess continuously resets it again...) Do we have a generic print/log ratelimit infrastructure in qemu?
On Thu, Sep 08, 2016 at 05:04:47PM +0200, Cornelia Huck wrote: > On Thu, 8 Sep 2016 18:00:28 +0300 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Thu, Sep 08, 2016 at 11:12:16AM +0200, Greg Kurz wrote: > > > On Thu, 8 Sep 2016 10:59:26 +0200 > > > Cornelia Huck <cornelia.huck@de.ibm.com> wrote: > > > > > > > On Wed, 07 Sep 2016 19:19:24 +0200 > > > > Greg Kurz <groug@kaod.org> wrote: > > > > > > > > > Calling assert() really makes sense when hitting a genuine bug, which calls > > > > > for a fix in QEMU. However, when something goes wrong because the guest > > > > > sends a malformed message, it is better to write down a more meaningul > > > > > error message and exit. > > > > > > > > > > Signed-off-by: Greg Kurz <groug@kaod.org> > > > > > --- > > > > > hw/9pfs/virtio-9p-device.c | 20 ++++++++++++++++++-- > > > > > 1 file changed, 18 insertions(+), 2 deletions(-) > > > > > > > > While this is an improvement over the current state, I don't think the > > > > guest should be able to kill qemu just by doing something stupid. > > > > > > > > > > Hi Connie, > > > > > > I'm glad you're pointing this out... this was also my impression, but > > > since there are a bunch of sanity checks in the virtio code that cause > > > QEMU to exit (even recently added like 1e7aed70144b), I did not dare > > > stand up :) > > > > It's true that it's broken in many places but we should just > > fix them all. > > > > > > A separate question is how to log such hardware/guest bugs generally. > > People already complained about disk filling up because of us printing > > errors on each such bug. Maybe print each message only N times, and > > then set a flag to skip the log until management tells us to restart > > logging again. > > I'd expect to get the message just once per device if we set the device > to broken (unless the guess continuously resets it again...) Which it can do, so we should limit that anyway. > Do we have > a generic print/log ratelimit infrastructure in qemu? There are actually two kinds of errors host side ones and ones triggered by guests. We should distinguish between them API-wise, then we will be able to limit the logging of those that guest can trigger.
On Thu, 8 Sep 2016 18:19:27 +0300 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Thu, Sep 08, 2016 at 05:04:47PM +0200, Cornelia Huck wrote: > > On Thu, 8 Sep 2016 18:00:28 +0300 > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > On Thu, Sep 08, 2016 at 11:12:16AM +0200, Greg Kurz wrote: > > > > On Thu, 8 Sep 2016 10:59:26 +0200 > > > > Cornelia Huck <cornelia.huck@de.ibm.com> wrote: > > > > > > > > > On Wed, 07 Sep 2016 19:19:24 +0200 > > > > > Greg Kurz <groug@kaod.org> wrote: > > > > > > > > > > > Calling assert() really makes sense when hitting a genuine bug, which calls > > > > > > for a fix in QEMU. However, when something goes wrong because the guest > > > > > > sends a malformed message, it is better to write down a more meaningul > > > > > > error message and exit. > > > > > > > > > > > > Signed-off-by: Greg Kurz <groug@kaod.org> > > > > > > --- > > > > > > hw/9pfs/virtio-9p-device.c | 20 ++++++++++++++++++-- > > > > > > 1 file changed, 18 insertions(+), 2 deletions(-) > > > > > > > > > > While this is an improvement over the current state, I don't think the > > > > > guest should be able to kill qemu just by doing something stupid. > > > > > > > > > > > > > Hi Connie, > > > > > > > > I'm glad you're pointing this out... this was also my impression, but > > > > since there are a bunch of sanity checks in the virtio code that cause > > > > QEMU to exit (even recently added like 1e7aed70144b), I did not dare > > > > stand up :) > > > > > > It's true that it's broken in many places but we should just > > > fix them all. > > > > > > > > > A separate question is how to log such hardware/guest bugs generally. > > > People already complained about disk filling up because of us printing > > > errors on each such bug. Maybe print each message only N times, and > > > then set a flag to skip the log until management tells us to restart > > > logging again. > > > > I'd expect to get the message just once per device if we set the device > > to broken (unless the guess continuously resets it again...) > > Which it can do, so we should limit that anyway. > > > Do we have > > a generic print/log ratelimit infrastructure in qemu? > > There are actually two kinds of errors > host side ones and ones triggered by guests. > > We should distinguish between them API-wise, then > we will be able to limit the logging of those > that guest can trigger. > FWIW it makes sense to use error_report() if QEMU exits. If it continues execution, this means we're expecting the guest or the host to do something to fix the error condition. This requires QEMU to emit an event of some sort, but not necessarily to log an error message in a file. I guess this depends if QEMU is run by some tooling, or by a human.
On Thu, Sep 08, 2016 at 06:26:52PM +0200, Greg Kurz wrote: > On Thu, 8 Sep 2016 18:19:27 +0300 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Thu, Sep 08, 2016 at 05:04:47PM +0200, Cornelia Huck wrote: > > > On Thu, 8 Sep 2016 18:00:28 +0300 > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > On Thu, Sep 08, 2016 at 11:12:16AM +0200, Greg Kurz wrote: > > > > > On Thu, 8 Sep 2016 10:59:26 +0200 > > > > > Cornelia Huck <cornelia.huck@de.ibm.com> wrote: > > > > > > > > > > > On Wed, 07 Sep 2016 19:19:24 +0200 > > > > > > Greg Kurz <groug@kaod.org> wrote: > > > > > > > > > > > > > Calling assert() really makes sense when hitting a genuine bug, which calls > > > > > > > for a fix in QEMU. However, when something goes wrong because the guest > > > > > > > sends a malformed message, it is better to write down a more meaningul > > > > > > > error message and exit. > > > > > > > > > > > > > > Signed-off-by: Greg Kurz <groug@kaod.org> > > > > > > > --- > > > > > > > hw/9pfs/virtio-9p-device.c | 20 ++++++++++++++++++-- > > > > > > > 1 file changed, 18 insertions(+), 2 deletions(-) > > > > > > > > > > > > While this is an improvement over the current state, I don't think the > > > > > > guest should be able to kill qemu just by doing something stupid. > > > > > > > > > > > > > > > > Hi Connie, > > > > > > > > > > I'm glad you're pointing this out... this was also my impression, but > > > > > since there are a bunch of sanity checks in the virtio code that cause > > > > > QEMU to exit (even recently added like 1e7aed70144b), I did not dare > > > > > stand up :) > > > > > > > > It's true that it's broken in many places but we should just > > > > fix them all. > > > > > > > > > > > > A separate question is how to log such hardware/guest bugs generally. > > > > People already complained about disk filling up because of us printing > > > > errors on each such bug. Maybe print each message only N times, and > > > > then set a flag to skip the log until management tells us to restart > > > > logging again. > > > > > > I'd expect to get the message just once per device if we set the device > > > to broken (unless the guess continuously resets it again...) > > > > Which it can do, so we should limit that anyway. > > > > > Do we have > > > a generic print/log ratelimit infrastructure in qemu? > > > > There are actually two kinds of errors > > host side ones and ones triggered by guests. > > > > We should distinguish between them API-wise, then > > we will be able to limit the logging of those > > that guest can trigger. > > > > FWIW it makes sense to use error_report() if QEMU exits. Not necessarily e.g. hotplug errors trigger error_report too. Generally it should be for host misconfiguration or similar management errors. > If it continues > execution, this means we're expecting the guest or the host to do something > to fix the error condition. This requires QEMU to emit an event of some > sort, but not necessarily to log an error message in a file. I guess this > depends if QEMU is run by some tooling, or by a human. I'm not sure we need an event if tools are not expected to do anything with it. If we limit # of times error is printed, tools will need to reset this counter, so we will need an event on overflow.
Greg Kurz <groug@kaod.org> writes: > On Thu, 8 Sep 2016 18:19:27 +0300 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > >> On Thu, Sep 08, 2016 at 05:04:47PM +0200, Cornelia Huck wrote: >> > On Thu, 8 Sep 2016 18:00:28 +0300 >> > "Michael S. Tsirkin" <mst@redhat.com> wrote: >> > >> > > On Thu, Sep 08, 2016 at 11:12:16AM +0200, Greg Kurz wrote: >> > > > On Thu, 8 Sep 2016 10:59:26 +0200 >> > > > Cornelia Huck <cornelia.huck@de.ibm.com> wrote: >> > > > >> > > > > On Wed, 07 Sep 2016 19:19:24 +0200 >> > > > > Greg Kurz <groug@kaod.org> wrote: >> > > > > >> > > > > > Calling assert() really makes sense when hitting a genuine bug, which calls >> > > > > > for a fix in QEMU. However, when something goes wrong because the guest >> > > > > > sends a malformed message, it is better to write down a more meaningul >> > > > > > error message and exit. >> > > > > > >> > > > > > Signed-off-by: Greg Kurz <groug@kaod.org> >> > > > > > --- >> > > > > > hw/9pfs/virtio-9p-device.c | 20 ++++++++++++++++++-- >> > > > > > 1 file changed, 18 insertions(+), 2 deletions(-) >> > > > > >> > > > > While this is an improvement over the current state, I don't think the >> > > > > guest should be able to kill qemu just by doing something stupid. >> > > > > >> > > > >> > > > Hi Connie, >> > > > >> > > > I'm glad you're pointing this out... this was also my impression, but >> > > > since there are a bunch of sanity checks in the virtio code that cause >> > > > QEMU to exit (even recently added like 1e7aed70144b), I did not dare >> > > > stand up :) >> > > >> > > It's true that it's broken in many places but we should just >> > > fix them all. >> > > >> > > >> > > A separate question is how to log such hardware/guest bugs generally. >> > > People already complained about disk filling up because of us printing >> > > errors on each such bug. Maybe print each message only N times, and >> > > then set a flag to skip the log until management tells us to restart >> > > logging again. >> > >> > I'd expect to get the message just once per device if we set the device >> > to broken (unless the guess continuously resets it again...) >> >> Which it can do, so we should limit that anyway. >> >> > Do we have >> > a generic print/log ratelimit infrastructure in qemu? >> >> There are actually two kinds of errors >> host side ones and ones triggered by guests. >> >> We should distinguish between them API-wise, then >> we will be able to limit the logging of those >> that guest can trigger. >> > > FWIW it makes sense to use error_report() if QEMU exits. exit(STATUS) with STATUS != 0 without printing a message is always wrong. > If it continues > execution, this means we're expecting the guest or the host to do something > to fix the error condition. This requires QEMU to emit an event of some > sort, but not necessarily to log an error message in a file. I guess this > depends if QEMU is run by some tooling, or by a human. error_report() normally goes to stderr. Tooling or humans can of course make it go to a file instead. error_report() is indeed a sub-par way to send an "attention" signal to the host, because recognizing such a signal reliably is unnecessary hard for management applications. QMP events are much easier. Both are useless when the signal needs to go to the guest. Signalling the guest is a device model job. error_report() without exit() has its uses. Error conditions in need of fixing aren't the only reason to call error_report(). But when you add a call, ask yourself whether management application or guest would like to respond to it.
On Fri, 09 Sep 2016 08:38:13 +0200 Markus Armbruster <armbru@redhat.com> wrote: > Greg Kurz <groug@kaod.org> writes: > > > On Thu, 8 Sep 2016 18:19:27 +0300 > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > >> On Thu, Sep 08, 2016 at 05:04:47PM +0200, Cornelia Huck wrote: > >> > On Thu, 8 Sep 2016 18:00:28 +0300 > >> > "Michael S. Tsirkin" <mst@redhat.com> wrote: > >> > > >> > > On Thu, Sep 08, 2016 at 11:12:16AM +0200, Greg Kurz wrote: > >> > > > On Thu, 8 Sep 2016 10:59:26 +0200 > >> > > > Cornelia Huck <cornelia.huck@de.ibm.com> wrote: > >> > > > > >> > > > > On Wed, 07 Sep 2016 19:19:24 +0200 > >> > > > > Greg Kurz <groug@kaod.org> wrote: > >> > > > > > >> > > > > > Calling assert() really makes sense when hitting a genuine bug, which calls > >> > > > > > for a fix in QEMU. However, when something goes wrong because the guest > >> > > > > > sends a malformed message, it is better to write down a more meaningul > >> > > > > > error message and exit. > >> > > > > > > >> > > > > > Signed-off-by: Greg Kurz <groug@kaod.org> > >> > > > > > --- > >> > > > > > hw/9pfs/virtio-9p-device.c | 20 ++++++++++++++++++-- > >> > > > > > 1 file changed, 18 insertions(+), 2 deletions(-) > >> > > > > > >> > > > > While this is an improvement over the current state, I don't think the > >> > > > > guest should be able to kill qemu just by doing something stupid. > >> > > > > > >> > > > > >> > > > Hi Connie, > >> > > > > >> > > > I'm glad you're pointing this out... this was also my impression, but > >> > > > since there are a bunch of sanity checks in the virtio code that cause > >> > > > QEMU to exit (even recently added like 1e7aed70144b), I did not dare > >> > > > stand up :) > >> > > > >> > > It's true that it's broken in many places but we should just > >> > > fix them all. > >> > > > >> > > > >> > > A separate question is how to log such hardware/guest bugs generally. > >> > > People already complained about disk filling up because of us printing > >> > > errors on each such bug. Maybe print each message only N times, and > >> > > then set a flag to skip the log until management tells us to restart > >> > > logging again. > >> > > >> > I'd expect to get the message just once per device if we set the device > >> > to broken (unless the guess continuously resets it again...) > >> > >> Which it can do, so we should limit that anyway. > >> > >> > Do we have > >> > a generic print/log ratelimit infrastructure in qemu? > >> > >> There are actually two kinds of errors > >> host side ones and ones triggered by guests. > >> > >> We should distinguish between them API-wise, then > >> we will be able to limit the logging of those > >> that guest can trigger. > >> > > > > FWIW it makes sense to use error_report() if QEMU exits. > > exit(STATUS) with STATUS != 0 without printing a message is always > wrong. > I fully agree. > > If it continues > > execution, this means we're expecting the guest or the host to do something > > to fix the error condition. This requires QEMU to emit an event of some > > sort, but not necessarily to log an error message in a file. I guess this > > depends if QEMU is run by some tooling, or by a human. > > error_report() normally goes to stderr. Tooling or humans can of course > make it go to a file instead. > > error_report() is indeed a sub-par way to send an "attention" signal to > the host, because recognizing such a signal reliably is unnecessary hard > for management applications. QMP events are much easier. > My wording was poor but yes, that was my point. :) > Both are useless when the signal needs to go to the guest. Signalling > the guest is a device model job. > I also agree with that. In the case of virtio, this is explained in section 2.1.2 of the spec. > error_report() without exit() has its uses. Error conditions in need of > fixing aren't the only reason to call error_report(). But when you add > a call, ask yourself whether management application or guest would like > to respond to it. In the case of the present patch, we currently have BUG_ON() which generates a cryptic and unusable message. It turns out that the first one (elem->out_num == 0 || elem->in_num == 0) is correct since it is now [1] impossible to hit this according to the code (see virtqueue_pop() and virtqueue_map_desc()). The second one (len != sizeof out) though matches a potential guest originated error. If I do as suggested by Connie, then the error_report() isn't needed anymore. Cheers. -- Greg [1] sending an empty buffer was sufficient before commit 1e7aed70144b4 as said in my previous answer
On Thu, 8 Sep 2016 19:55:16 +0300 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Thu, Sep 08, 2016 at 06:26:52PM +0200, Greg Kurz wrote: > > On Thu, 8 Sep 2016 18:19:27 +0300 > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > On Thu, Sep 08, 2016 at 05:04:47PM +0200, Cornelia Huck wrote: > > > > On Thu, 8 Sep 2016 18:00:28 +0300 > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > > > On Thu, Sep 08, 2016 at 11:12:16AM +0200, Greg Kurz wrote: > > If it continues > > execution, this means we're expecting the guest or the host to do something > > to fix the error condition. This requires QEMU to emit an event of some > > sort, but not necessarily to log an error message in a file. I guess this > > depends if QEMU is run by some tooling, or by a human. > > I'm not sure we need an event if tools are not expected to > do anything with it. If we limit # of times error > is printed, tools will need to reset this counter, > so we will need an event on overflow. If the device goes into a broken state, it should be discoverable from outside. I'm not sure we need an actual event signalling this if this happens due to the guest doing something wrong: That would be a task for tools monitoring _inside_ the guest. For tools monitoring the health of the machine (from the host perspective), the discovery interface would probably be enough?
On Fri, 9 Sep 2016 10:30:53 +0200 Cornelia Huck <cornelia.huck@de.ibm.com> wrote: > On Thu, 8 Sep 2016 19:55:16 +0300 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Thu, Sep 08, 2016 at 06:26:52PM +0200, Greg Kurz wrote: > > > On Thu, 8 Sep 2016 18:19:27 +0300 > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > On Thu, Sep 08, 2016 at 05:04:47PM +0200, Cornelia Huck wrote: > > > > > On Thu, 8 Sep 2016 18:00:28 +0300 > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > > > > > On Thu, Sep 08, 2016 at 11:12:16AM +0200, Greg Kurz wrote: > > > > If it continues > > > execution, this means we're expecting the guest or the host to do something > > > to fix the error condition. This requires QEMU to emit an event of some > > > sort, but not necessarily to log an error message in a file. I guess this > > > depends if QEMU is run by some tooling, or by a human. > > > > I'm not sure we need an event if tools are not expected to > > do anything with it. If we limit # of times error > > is printed, tools will need to reset this counter, > > so we will need an event on overflow. > > If the device goes into a broken state, it should be discoverable from > outside. I'm not sure we need an actual event signalling this if this > happens due to the guest doing something wrong: That would be a task > for tools monitoring _inside_ the guest. Well, in case of a virtio device being broken, section 2.1.2 in the spec suggests to set the status to DEVICE_NEEDS_RESET and to notify it to the guest (aka. event signalling). I'll send a patch shortly. > For tools monitoring the > health of the machine (from the host perspective), the discovery > interface would probably be enough? > Yeah, probably. Cheers. -- Greg
On Fri, 9 Sep 2016 10:46:25 +0200 Greg Kurz <groug@kaod.org> wrote: > On Fri, 9 Sep 2016 10:30:53 +0200 > Cornelia Huck <cornelia.huck@de.ibm.com> wrote: > > > On Thu, 8 Sep 2016 19:55:16 +0300 > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > On Thu, Sep 08, 2016 at 06:26:52PM +0200, Greg Kurz wrote: > > > > On Thu, 8 Sep 2016 18:19:27 +0300 > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > > > On Thu, Sep 08, 2016 at 05:04:47PM +0200, Cornelia Huck wrote: > > > > > > On Thu, 8 Sep 2016 18:00:28 +0300 > > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > > > > > > > On Thu, Sep 08, 2016 at 11:12:16AM +0200, Greg Kurz wrote: > > > > > > If it continues > > > > execution, this means we're expecting the guest or the host to do something > > > > to fix the error condition. This requires QEMU to emit an event of some > > > > sort, but not necessarily to log an error message in a file. I guess this > > > > depends if QEMU is run by some tooling, or by a human. > > > > > > I'm not sure we need an event if tools are not expected to > > > do anything with it. If we limit # of times error > > > is printed, tools will need to reset this counter, > > > so we will need an event on overflow. > > > > If the device goes into a broken state, it should be discoverable from > > outside. I'm not sure we need an actual event signalling this if this > > happens due to the guest doing something wrong: That would be a task > > for tools monitoring _inside_ the guest. > > Well, in case of a virtio device being broken, section 2.1.2 in the spec > suggests to set the status to DEVICE_NEEDS_RESET and to notify it to > the guest (aka. event signalling). I'll send a patch shortly. Stefan had already sent <1460467534-29147-4-git-send-email-stefanha@redhat.com> ages ago, but it has not yet made it anywhere... Anyhow, I was concerned with host signalling (sorry for being unclear), and I still do not think we need to alert host monitoring software to guest stupidity. > > > For tools monitoring the > > health of the machine (from the host perspective), the discovery > > interface would probably be enough? > > > > Yeah, probably. > > Cheers. > > -- > Greg >
Greg Kurz <groug@kaod.org> writes: > On Fri, 09 Sep 2016 08:38:13 +0200 > Markus Armbruster <armbru@redhat.com> wrote: > >> Greg Kurz <groug@kaod.org> writes: >> >> > On Thu, 8 Sep 2016 18:19:27 +0300 >> > "Michael S. Tsirkin" <mst@redhat.com> wrote: >> > >> >> On Thu, Sep 08, 2016 at 05:04:47PM +0200, Cornelia Huck wrote: >> >> > On Thu, 8 Sep 2016 18:00:28 +0300 >> >> > "Michael S. Tsirkin" <mst@redhat.com> wrote: >> >> > >> >> > > On Thu, Sep 08, 2016 at 11:12:16AM +0200, Greg Kurz wrote: >> >> > > > On Thu, 8 Sep 2016 10:59:26 +0200 >> >> > > > Cornelia Huck <cornelia.huck@de.ibm.com> wrote: >> >> > > > >> >> > > > > On Wed, 07 Sep 2016 19:19:24 +0200 >> >> > > > > Greg Kurz <groug@kaod.org> wrote: >> >> > > > > >> >> > > > > > Calling assert() really makes sense when hitting a genuine bug, which calls >> >> > > > > > for a fix in QEMU. However, when something goes wrong because the guest >> >> > > > > > sends a malformed message, it is better to write down a more meaningul >> >> > > > > > error message and exit. >> >> > > > > > >> >> > > > > > Signed-off-by: Greg Kurz <groug@kaod.org> >> >> > > > > > --- >> >> > > > > > hw/9pfs/virtio-9p-device.c | 20 ++++++++++++++++++-- >> >> > > > > > 1 file changed, 18 insertions(+), 2 deletions(-) >> >> > > > > >> >> > > > > While this is an improvement over the current state, I don't think the >> >> > > > > guest should be able to kill qemu just by doing something stupid. >> >> > > > > >> >> > > > >> >> > > > Hi Connie, >> >> > > > >> >> > > > I'm glad you're pointing this out... this was also my impression, but >> >> > > > since there are a bunch of sanity checks in the virtio code that cause >> >> > > > QEMU to exit (even recently added like 1e7aed70144b), I did not dare >> >> > > > stand up :) >> >> > > >> >> > > It's true that it's broken in many places but we should just >> >> > > fix them all. >> >> > > >> >> > > >> >> > > A separate question is how to log such hardware/guest bugs generally. >> >> > > People already complained about disk filling up because of us printing >> >> > > errors on each such bug. Maybe print each message only N times, and >> >> > > then set a flag to skip the log until management tells us to restart >> >> > > logging again. >> >> > >> >> > I'd expect to get the message just once per device if we set the device >> >> > to broken (unless the guess continuously resets it again...) >> >> >> >> Which it can do, so we should limit that anyway. >> >> >> >> > Do we have >> >> > a generic print/log ratelimit infrastructure in qemu? >> >> >> >> There are actually two kinds of errors >> >> host side ones and ones triggered by guests. >> >> >> >> We should distinguish between them API-wise, then >> >> we will be able to limit the logging of those >> >> that guest can trigger. >> >> >> > >> > FWIW it makes sense to use error_report() if QEMU exits. >> >> exit(STATUS) with STATUS != 0 without printing a message is always >> wrong. >> > > I fully agree. > >> > If it continues >> > execution, this means we're expecting the guest or the host to do something >> > to fix the error condition. This requires QEMU to emit an event of some >> > sort, but not necessarily to log an error message in a file. I guess this >> > depends if QEMU is run by some tooling, or by a human. >> >> error_report() normally goes to stderr. Tooling or humans can of course >> make it go to a file instead. >> >> error_report() is indeed a sub-par way to send an "attention" signal to >> the host, because recognizing such a signal reliably is unnecessary hard >> for management applications. QMP events are much easier. >> > > My wording was poor but yes, that was my point. :) > >> Both are useless when the signal needs to go to the guest. Signalling >> the guest is a device model job. >> > > I also agree with that. In the case of virtio, this is explained in section > 2.1.2 of the spec. > >> error_report() without exit() has its uses. Error conditions in need of >> fixing aren't the only reason to call error_report(). But when you add >> a call, ask yourself whether management application or guest would like >> to respond to it. > > In the case of the present patch, we currently have BUG_ON() which generates > a cryptic and unusable message. > > It turns out that the first one (elem->out_num == 0 || elem->in_num == 0) is > correct since it is now [1] impossible to hit this according to the code (see > virtqueue_pop() and virtqueue_map_desc()). > > The second one (len != sizeof out) though matches a potential guest originated > error. If I do as suggested by Connie, then the error_report() isn't needed > anymore. I dive into the details of your analysis right now, only make high-level recommendations: * Issues common to all virtio devices should be addressed in the virtio core. If that's not feasible, they should be addressed in all devices consistently. * Guest misbehavior should put the device in a guest-observable error state. It should not crash QEMU, it should not spam stderr. Code handling it in other ways should be marked FIXME. * Nobody expects you to get things perfectly right in one step. Just try to move towards the goal. > > Cheers. > > -- > Greg > > [1] sending an empty buffer was sufficient before commit 1e7aed70144b4 as said > in my previous answer
On Fri, 9 Sep 2016 10:53:05 +0200 Cornelia Huck <cornelia.huck@de.ibm.com> wrote: > On Fri, 9 Sep 2016 10:46:25 +0200 > Greg Kurz <groug@kaod.org> wrote: > > > On Fri, 9 Sep 2016 10:30:53 +0200 > > Cornelia Huck <cornelia.huck@de.ibm.com> wrote: > > > > > On Thu, 8 Sep 2016 19:55:16 +0300 > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > On Thu, Sep 08, 2016 at 06:26:52PM +0200, Greg Kurz wrote: > > > > > On Thu, 8 Sep 2016 18:19:27 +0300 > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > > > > > On Thu, Sep 08, 2016 at 05:04:47PM +0200, Cornelia Huck wrote: > > > > > > > On Thu, 8 Sep 2016 18:00:28 +0300 > > > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > > > > > > > > > On Thu, Sep 08, 2016 at 11:12:16AM +0200, Greg Kurz wrote: > > > > > > > > If it continues > > > > > execution, this means we're expecting the guest or the host to do something > > > > > to fix the error condition. This requires QEMU to emit an event of some > > > > > sort, but not necessarily to log an error message in a file. I guess this > > > > > depends if QEMU is run by some tooling, or by a human. > > > > > > > > I'm not sure we need an event if tools are not expected to > > > > do anything with it. If we limit # of times error > > > > is printed, tools will need to reset this counter, > > > > so we will need an event on overflow. > > > > > > If the device goes into a broken state, it should be discoverable from > > > outside. I'm not sure we need an actual event signalling this if this > > > happens due to the guest doing something wrong: That would be a task > > > for tools monitoring _inside_ the guest. > > > > Well, in case of a virtio device being broken, section 2.1.2 in the spec > > suggests to set the status to DEVICE_NEEDS_RESET and to notify it to > > the guest (aka. event signalling). I'll send a patch shortly. > > Stefan had already sent > <1460467534-29147-4-git-send-email-stefanha@redhat.com> ages ago, but > it has not yet made it anywhere... > I don't know what to do with this message-id :\ > Anyhow, I was concerned with host signalling (sorry for being unclear), > and I still do not think we need to alert host monitoring software to > guest stupidity. > I agree. Sorry if my poor wording made you (and others) think I was suggesting that :) My point was that if QEMU exits because of guest stupidity, you are forced to error_report() something to the host, but this is really suboptimal (even if BUG_ON is worse)... then there was that discussion about log files getting to big, but I don't even know how we came there, as it does not really make sense when QEMU exits. > > > > > For tools monitoring the > > > health of the machine (from the host perspective), the discovery > > > interface would probably be enough? > > > > > > > Yeah, probably. > > > > Cheers. > > > > -- > > Greg > > >
On Fri, 9 Sep 2016 11:26:17 +0200 Greg Kurz <groug@kaod.org> wrote: > > > > Stefan had already sent > > <1460467534-29147-4-git-send-email-stefanha@redhat.com> ages ago, but > > it has not yet made it anywhere... > > > > I don't know what to do with this message-id :\ I finally found :) +message-id:<1460467534-29147-4-git-send-email-stefanha@redhat.com> in the search box at https://lists.nongnu.org/archive/html/qemu-devel/ Cheers. -- Greg
On Fri, 09 Sep 2016 11:08:56 +0200 Markus Armbruster <armbru@redhat.com> wrote: > Greg Kurz <groug@kaod.org> writes: > > > On Fri, 09 Sep 2016 08:38:13 +0200 > > Markus Armbruster <armbru@redhat.com> wrote: > > > >> Greg Kurz <groug@kaod.org> writes: > >> > >> > On Thu, 8 Sep 2016 18:19:27 +0300 > >> > "Michael S. Tsirkin" <mst@redhat.com> wrote: > >> > > >> >> On Thu, Sep 08, 2016 at 05:04:47PM +0200, Cornelia Huck wrote: > >> >> > On Thu, 8 Sep 2016 18:00:28 +0300 > >> >> > "Michael S. Tsirkin" <mst@redhat.com> wrote: > >> >> > > >> >> > > On Thu, Sep 08, 2016 at 11:12:16AM +0200, Greg Kurz wrote: > >> >> > > > On Thu, 8 Sep 2016 10:59:26 +0200 > >> >> > > > Cornelia Huck <cornelia.huck@de.ibm.com> wrote: > >> >> > > > > >> >> > > > > On Wed, 07 Sep 2016 19:19:24 +0200 > >> >> > > > > Greg Kurz <groug@kaod.org> wrote: > >> >> > > > > > >> >> > > > > > Calling assert() really makes sense when hitting a genuine bug, which calls > >> >> > > > > > for a fix in QEMU. However, when something goes wrong because the guest > >> >> > > > > > sends a malformed message, it is better to write down a more meaningul > >> >> > > > > > error message and exit. > >> >> > > > > > > >> >> > > > > > Signed-off-by: Greg Kurz <groug@kaod.org> > >> >> > > > > > --- > >> >> > > > > > hw/9pfs/virtio-9p-device.c | 20 ++++++++++++++++++-- > >> >> > > > > > 1 file changed, 18 insertions(+), 2 deletions(-) > >> >> > > > > > >> >> > > > > While this is an improvement over the current state, I don't think the > >> >> > > > > guest should be able to kill qemu just by doing something stupid. > >> >> > > > > > >> >> > > > > >> >> > > > Hi Connie, > >> >> > > > > >> >> > > > I'm glad you're pointing this out... this was also my impression, but > >> >> > > > since there are a bunch of sanity checks in the virtio code that cause > >> >> > > > QEMU to exit (even recently added like 1e7aed70144b), I did not dare > >> >> > > > stand up :) > >> >> > > > >> >> > > It's true that it's broken in many places but we should just > >> >> > > fix them all. > >> >> > > > >> >> > > > >> >> > > A separate question is how to log such hardware/guest bugs generally. > >> >> > > People already complained about disk filling up because of us printing > >> >> > > errors on each such bug. Maybe print each message only N times, and > >> >> > > then set a flag to skip the log until management tells us to restart > >> >> > > logging again. > >> >> > > >> >> > I'd expect to get the message just once per device if we set the device > >> >> > to broken (unless the guess continuously resets it again...) > >> >> > >> >> Which it can do, so we should limit that anyway. > >> >> > >> >> > Do we have > >> >> > a generic print/log ratelimit infrastructure in qemu? > >> >> > >> >> There are actually two kinds of errors > >> >> host side ones and ones triggered by guests. > >> >> > >> >> We should distinguish between them API-wise, then > >> >> we will be able to limit the logging of those > >> >> that guest can trigger. > >> >> > >> > > >> > FWIW it makes sense to use error_report() if QEMU exits. > >> > >> exit(STATUS) with STATUS != 0 without printing a message is always > >> wrong. > >> > > > > I fully agree. > > > >> > If it continues > >> > execution, this means we're expecting the guest or the host to do something > >> > to fix the error condition. This requires QEMU to emit an event of some > >> > sort, but not necessarily to log an error message in a file. I guess this > >> > depends if QEMU is run by some tooling, or by a human. > >> > >> error_report() normally goes to stderr. Tooling or humans can of course > >> make it go to a file instead. > >> > >> error_report() is indeed a sub-par way to send an "attention" signal to > >> the host, because recognizing such a signal reliably is unnecessary hard > >> for management applications. QMP events are much easier. > >> > > > > My wording was poor but yes, that was my point. :) > > > >> Both are useless when the signal needs to go to the guest. Signalling > >> the guest is a device model job. > >> > > > > I also agree with that. In the case of virtio, this is explained in section > > 2.1.2 of the spec. > > > >> error_report() without exit() has its uses. Error conditions in need of > >> fixing aren't the only reason to call error_report(). But when you add > >> a call, ask yourself whether management application or guest would like > >> to respond to it. > > > > In the case of the present patch, we currently have BUG_ON() which generates > > a cryptic and unusable message. > > > > It turns out that the first one (elem->out_num == 0 || elem->in_num == 0) is > > correct since it is now [1] impossible to hit this according to the code (see > > virtqueue_pop() and virtqueue_map_desc()). > > > > The second one (len != sizeof out) though matches a potential guest originated > > error. If I do as suggested by Connie, then the error_report() isn't needed > > anymore. > > I dive into the details of your analysis right now, only make high-level > recommendations: > > * Issues common to all virtio devices should be addressed in the virtio > core. If that's not feasible, they should be addressed in all devices > consistently. > Agreed. > * Guest misbehavior should put the device in a guest-observable error > state. It should not crash QEMU, it should not spam stderr. Code > handling it in other ways should be marked FIXME. > Agreed. FWIW a bunch of FIXMEs are missing in the virtio code then :) > * Nobody expects you to get things perfectly right in one step. Just > try to move towards the goal. > Sure ! I'm now reading through Stefan's series to address the issue: https://lists.nongnu.org/archive/html/qemu-devel/2016-04/msg01978.html Cheers. -- Greg > > > > Cheers. > > > > -- > > Greg > > > > [1] sending an empty buffer was sufficient before commit 1e7aed70144b4 as said > > in my previous answer
diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c index 009b43f6d045..67059182645a 100644 --- a/hw/9pfs/virtio-9p-device.c +++ b/hw/9pfs/virtio-9p-device.c @@ -19,6 +19,7 @@ #include "coth.h" #include "hw/virtio/virtio-access.h" #include "qemu/iov.h" +#include "qemu/error-report.h" void virtio_9p_push_and_notify(V9fsPDU *pdu) { @@ -35,6 +36,11 @@ void virtio_9p_push_and_notify(V9fsPDU *pdu) virtio_notify(VIRTIO_DEVICE(v), v->vq); } +static void virtio_9p_error(const char *msg) +{ + error_report("The virtio-9p driver in the guest has an issue: %s", msg); +} + static void handle_9p_output(VirtIODevice *vdev, VirtQueue *vq) { V9fsVirtioState *v = (V9fsVirtioState *)vdev; @@ -56,13 +62,23 @@ static void handle_9p_output(VirtIODevice *vdev, VirtQueue *vq) break; } - BUG_ON(elem->out_num == 0 || elem->in_num == 0); + if (elem->out_num == 0) { + virtio_9p_error("missing VirtFS request's header"); + exit(1); + } + if (elem->in_num == 0) { + virtio_9p_error("missing VirtFS reply's header"); + exit(1); + } QEMU_BUILD_BUG_ON(sizeof out != 7); v->elems[pdu->idx] = elem; len = iov_to_buf(elem->out_sg, elem->out_num, 0, &out, sizeof out); - BUG_ON(len != sizeof out); + if (len != sizeof out) { + virtio_9p_error("malformed VirtFS request"); + exit(1); + } pdu->size = le32_to_cpu(out.size_le);
Calling assert() really makes sense when hitting a genuine bug, which calls for a fix in QEMU. However, when something goes wrong because the guest sends a malformed message, it is better to write down a more meaningul error message and exit. Signed-off-by: Greg Kurz <groug@kaod.org> --- hw/9pfs/virtio-9p-device.c | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-)