diff mbox

[1/2] virtio-9p: print error message and exit instead of BUG_ON()

Message ID 147326876478.8546.16045138068342092499.stgit@bahia.lan (mailing list archive)
State New, archived
Headers show

Commit Message

Greg Kurz Sept. 7, 2016, 5:19 p.m. UTC
Calling assert() really makes sense when hitting a genuine bug, which calls
for a fix in QEMU. However, when something goes wrong because the guest
sends a malformed message, it is better to write down a more meaningul
error message and exit.

Signed-off-by: Greg Kurz <groug@kaod.org>
---
 hw/9pfs/virtio-9p-device.c |   20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

Comments

Markus Armbruster Sept. 8, 2016, 7:14 a.m. UTC | #1
Greg Kurz <groug@kaod.org> writes:

> Calling assert() really makes sense when hitting a genuine bug, which calls
> for a fix in QEMU. However, when something goes wrong because the guest
> sends a malformed message, it is better to write down a more meaningul
> error message and exit.
>
> Signed-off-by: Greg Kurz <groug@kaod.org>
> ---
>  hw/9pfs/virtio-9p-device.c |   20 ++++++++++++++++++--
>  1 file changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> index 009b43f6d045..67059182645a 100644
> --- a/hw/9pfs/virtio-9p-device.c
> +++ b/hw/9pfs/virtio-9p-device.c
> @@ -19,6 +19,7 @@
>  #include "coth.h"
>  #include "hw/virtio/virtio-access.h"
>  #include "qemu/iov.h"
> +#include "qemu/error-report.h"
>  
>  void virtio_9p_push_and_notify(V9fsPDU *pdu)
>  {
> @@ -35,6 +36,11 @@ void virtio_9p_push_and_notify(V9fsPDU *pdu)
>      virtio_notify(VIRTIO_DEVICE(v), v->vq);
>  }
>  
> +static void virtio_9p_error(const char *msg)
> +{
> +    error_report("The virtio-9p driver in the guest has an issue: %s", msg);
> +}
> +
>  static void handle_9p_output(VirtIODevice *vdev, VirtQueue *vq)
>  {
>      V9fsVirtioState *v = (V9fsVirtioState *)vdev;
> @@ -56,13 +62,23 @@ static void handle_9p_output(VirtIODevice *vdev, VirtQueue *vq)
>              break;
>          }
>  
> -        BUG_ON(elem->out_num == 0 || elem->in_num == 0);
> +        if (elem->out_num == 0) {
> +            virtio_9p_error("missing VirtFS request's header");
> +            exit(1);
> +        }

Can the guest trigger this?

> +        if (elem->in_num == 0) {
> +            virtio_9p_error("missing VirtFS reply's header");
> +            exit(1);
> +        }

Same question.

>          QEMU_BUILD_BUG_ON(sizeof out != 7);
>  
>          v->elems[pdu->idx] = elem;
>          len = iov_to_buf(elem->out_sg, elem->out_num, 0,
>                           &out, sizeof out);
> -        BUG_ON(len != sizeof out);
> +        if (len != sizeof out) {
> +            virtio_9p_error("malformed VirtFS request");
> +            exit(1);
> +        }

Same question.

>  
>          pdu->size = le32_to_cpu(out.size_le);
>
Cornelia Huck Sept. 8, 2016, 8:59 a.m. UTC | #2
On Wed, 07 Sep 2016 19:19:24 +0200
Greg Kurz <groug@kaod.org> wrote:

> Calling assert() really makes sense when hitting a genuine bug, which calls
> for a fix in QEMU. However, when something goes wrong because the guest
> sends a malformed message, it is better to write down a more meaningul
> error message and exit.
> 
> Signed-off-by: Greg Kurz <groug@kaod.org>
> ---
>  hw/9pfs/virtio-9p-device.c |   20 ++++++++++++++++++--
>  1 file changed, 18 insertions(+), 2 deletions(-)

While this is an improvement over the current state, I don't think the
guest should be able to kill qemu just by doing something stupid.

The right way to go is to mark the virtio device as broken and stop
doing any processing until the guest resets it. I think Stefan had a
patch series doing that for some base virtio errors, but I'd have to
search for it.
Greg Kurz Sept. 8, 2016, 9:05 a.m. UTC | #3
On Thu, 08 Sep 2016 09:14:05 +0200
Markus Armbruster <armbru@redhat.com> wrote:

> Greg Kurz <groug@kaod.org> writes:
> 
> > Calling assert() really makes sense when hitting a genuine bug, which calls
> > for a fix in QEMU. However, when something goes wrong because the guest
> > sends a malformed message, it is better to write down a more meaningul
> > error message and exit.
> >
> > Signed-off-by: Greg Kurz <groug@kaod.org>
> > ---
> >  hw/9pfs/virtio-9p-device.c |   20 ++++++++++++++++++--
> >  1 file changed, 18 insertions(+), 2 deletions(-)
> >
> > diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> > index 009b43f6d045..67059182645a 100644
> > --- a/hw/9pfs/virtio-9p-device.c
> > +++ b/hw/9pfs/virtio-9p-device.c
> > @@ -19,6 +19,7 @@
> >  #include "coth.h"
> >  #include "hw/virtio/virtio-access.h"
> >  #include "qemu/iov.h"
> > +#include "qemu/error-report.h"
> >  
> >  void virtio_9p_push_and_notify(V9fsPDU *pdu)
> >  {
> > @@ -35,6 +36,11 @@ void virtio_9p_push_and_notify(V9fsPDU *pdu)
> >      virtio_notify(VIRTIO_DEVICE(v), v->vq);
> >  }
> >  
> > +static void virtio_9p_error(const char *msg)
> > +{
> > +    error_report("The virtio-9p driver in the guest has an issue: %s", msg);
> > +}
> > +
> >  static void handle_9p_output(VirtIODevice *vdev, VirtQueue *vq)
> >  {
> >      V9fsVirtioState *v = (V9fsVirtioState *)vdev;
> > @@ -56,13 +62,23 @@ static void handle_9p_output(VirtIODevice *vdev, VirtQueue *vq)
> >              break;
> >          }
> >  
> > -        BUG_ON(elem->out_num == 0 || elem->in_num == 0);
> > +        if (elem->out_num == 0) {
> > +            virtio_9p_error("missing VirtFS request's header");
> > +            exit(1);
> > +        }  
> 
> Can the guest trigger this?
> 

Yes it can in theory if it pushes an empty buffer... but this "recent"
commit changed the outcome:

commit 1e7aed70144b4673fc26e73062064b6724795e5f
Author: Prasad J Pandit <pjp@fedoraproject.org>
Date:   Wed Jul 27 21:07:56 2016 +0530

    virtio: check vring descriptor buffer length

And now, the error is caught in virtqueue_map_desc():

    if (!sz) {
        error_report("virtio: zero sized buffers are not allowed");
        exit(1);
    }

So I guess we should keep the BUG_ON() then.

BTW, there are similar checks in virtio-blk and virtio-net leading to a QEMU
exit... which seem to be obsoleted by the above commit. I'll have a closer
look.

> > +        if (elem->in_num == 0) {
> > +            virtio_9p_error("missing VirtFS reply's header");
> > +            exit(1);
> > +        }  
> 
> Same question.
> 

Same answer. :)

> >          QEMU_BUILD_BUG_ON(sizeof out != 7);
> >  
> >          v->elems[pdu->idx] = elem;
> >          len = iov_to_buf(elem->out_sg, elem->out_num, 0,
> >                           &out, sizeof out);
> > -        BUG_ON(len != sizeof out);
> > +        if (len != sizeof out) {
> > +            virtio_9p_error("malformed VirtFS request");
> > +            exit(1);
> > +        }  
> 
> Same question.
> 

Here this is different: the guest can put a bogus len in the vring_desc
structure, and this doesn't get checked earlier.

> >  
> >          pdu->size = le32_to_cpu(out.size_le);
> >    

Cheers.

--
Greg
Greg Kurz Sept. 8, 2016, 9:12 a.m. UTC | #4
On Thu, 8 Sep 2016 10:59:26 +0200
Cornelia Huck <cornelia.huck@de.ibm.com> wrote:

> On Wed, 07 Sep 2016 19:19:24 +0200
> Greg Kurz <groug@kaod.org> wrote:
> 
> > Calling assert() really makes sense when hitting a genuine bug, which calls
> > for a fix in QEMU. However, when something goes wrong because the guest
> > sends a malformed message, it is better to write down a more meaningul
> > error message and exit.
> > 
> > Signed-off-by: Greg Kurz <groug@kaod.org>
> > ---
> >  hw/9pfs/virtio-9p-device.c |   20 ++++++++++++++++++--
> >  1 file changed, 18 insertions(+), 2 deletions(-)  
> 
> While this is an improvement over the current state, I don't think the
> guest should be able to kill qemu just by doing something stupid.
> 

Hi Connie,

I'm glad you're pointing this out... this was also my impression, but
since there are a bunch of sanity checks in the virtio code that cause
QEMU to exit (even recently added like 1e7aed70144b), I did not dare
stand up :)

> The right way to go is to mark the virtio device as broken and stop
> doing any processing until the guest resets it. I think Stefan had a
> patch series doing that for some base virtio errors, but I'd have to
> search for it.
> 

I'd be glad to have a look and try to address this issue.

Thanks !

--
Greg
Michael S. Tsirkin Sept. 8, 2016, 3 p.m. UTC | #5
On Thu, Sep 08, 2016 at 11:12:16AM +0200, Greg Kurz wrote:
> On Thu, 8 Sep 2016 10:59:26 +0200
> Cornelia Huck <cornelia.huck@de.ibm.com> wrote:
> 
> > On Wed, 07 Sep 2016 19:19:24 +0200
> > Greg Kurz <groug@kaod.org> wrote:
> > 
> > > Calling assert() really makes sense when hitting a genuine bug, which calls
> > > for a fix in QEMU. However, when something goes wrong because the guest
> > > sends a malformed message, it is better to write down a more meaningul
> > > error message and exit.
> > > 
> > > Signed-off-by: Greg Kurz <groug@kaod.org>
> > > ---
> > >  hw/9pfs/virtio-9p-device.c |   20 ++++++++++++++++++--
> > >  1 file changed, 18 insertions(+), 2 deletions(-)  
> > 
> > While this is an improvement over the current state, I don't think the
> > guest should be able to kill qemu just by doing something stupid.
> > 
> 
> Hi Connie,
> 
> I'm glad you're pointing this out... this was also my impression, but
> since there are a bunch of sanity checks in the virtio code that cause
> QEMU to exit (even recently added like 1e7aed70144b), I did not dare
> stand up :)

It's true that it's broken in many places but we should just
fix them all.


A separate question is how to log such hardware/guest bugs generally.
People already complained about disk filling up because of us printing
errors on each such bug.  Maybe print each message only N times, and
then set a flag to skip the log until management tells us to restart
logging again.


> > The right way to go is to mark the virtio device as broken and stop
> > doing any processing until the guest resets it. I think Stefan had a
> > patch series doing that for some base virtio errors, but I'd have to
> > search for it.
> > 
> 
> I'd be glad to have a look and try to address this issue.
> 
> Thanks !
> 
> --
> Greg
Cornelia Huck Sept. 8, 2016, 3:04 p.m. UTC | #6
On Thu, 8 Sep 2016 18:00:28 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Thu, Sep 08, 2016 at 11:12:16AM +0200, Greg Kurz wrote:
> > On Thu, 8 Sep 2016 10:59:26 +0200
> > Cornelia Huck <cornelia.huck@de.ibm.com> wrote:
> > 
> > > On Wed, 07 Sep 2016 19:19:24 +0200
> > > Greg Kurz <groug@kaod.org> wrote:
> > > 
> > > > Calling assert() really makes sense when hitting a genuine bug, which calls
> > > > for a fix in QEMU. However, when something goes wrong because the guest
> > > > sends a malformed message, it is better to write down a more meaningul
> > > > error message and exit.
> > > > 
> > > > Signed-off-by: Greg Kurz <groug@kaod.org>
> > > > ---
> > > >  hw/9pfs/virtio-9p-device.c |   20 ++++++++++++++++++--
> > > >  1 file changed, 18 insertions(+), 2 deletions(-)  
> > > 
> > > While this is an improvement over the current state, I don't think the
> > > guest should be able to kill qemu just by doing something stupid.
> > > 
> > 
> > Hi Connie,
> > 
> > I'm glad you're pointing this out... this was also my impression, but
> > since there are a bunch of sanity checks in the virtio code that cause
> > QEMU to exit (even recently added like 1e7aed70144b), I did not dare
> > stand up :)
> 
> It's true that it's broken in many places but we should just
> fix them all.
> 
> 
> A separate question is how to log such hardware/guest bugs generally.
> People already complained about disk filling up because of us printing
> errors on each such bug.  Maybe print each message only N times, and
> then set a flag to skip the log until management tells us to restart
> logging again.

I'd expect to get the message just once per device if we set the device
to broken (unless the guess continuously resets it again...) Do we have
a generic print/log ratelimit infrastructure in qemu?
Michael S. Tsirkin Sept. 8, 2016, 3:19 p.m. UTC | #7
On Thu, Sep 08, 2016 at 05:04:47PM +0200, Cornelia Huck wrote:
> On Thu, 8 Sep 2016 18:00:28 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Thu, Sep 08, 2016 at 11:12:16AM +0200, Greg Kurz wrote:
> > > On Thu, 8 Sep 2016 10:59:26 +0200
> > > Cornelia Huck <cornelia.huck@de.ibm.com> wrote:
> > > 
> > > > On Wed, 07 Sep 2016 19:19:24 +0200
> > > > Greg Kurz <groug@kaod.org> wrote:
> > > > 
> > > > > Calling assert() really makes sense when hitting a genuine bug, which calls
> > > > > for a fix in QEMU. However, when something goes wrong because the guest
> > > > > sends a malformed message, it is better to write down a more meaningul
> > > > > error message and exit.
> > > > > 
> > > > > Signed-off-by: Greg Kurz <groug@kaod.org>
> > > > > ---
> > > > >  hw/9pfs/virtio-9p-device.c |   20 ++++++++++++++++++--
> > > > >  1 file changed, 18 insertions(+), 2 deletions(-)  
> > > > 
> > > > While this is an improvement over the current state, I don't think the
> > > > guest should be able to kill qemu just by doing something stupid.
> > > > 
> > > 
> > > Hi Connie,
> > > 
> > > I'm glad you're pointing this out... this was also my impression, but
> > > since there are a bunch of sanity checks in the virtio code that cause
> > > QEMU to exit (even recently added like 1e7aed70144b), I did not dare
> > > stand up :)
> > 
> > It's true that it's broken in many places but we should just
> > fix them all.
> > 
> > 
> > A separate question is how to log such hardware/guest bugs generally.
> > People already complained about disk filling up because of us printing
> > errors on each such bug.  Maybe print each message only N times, and
> > then set a flag to skip the log until management tells us to restart
> > logging again.
> 
> I'd expect to get the message just once per device if we set the device
> to broken (unless the guess continuously resets it again...)

Which it can do, so we should limit that anyway.

> Do we have
> a generic print/log ratelimit infrastructure in qemu?

There are actually two kinds of errors
host side ones and ones triggered by guests.

We should distinguish between them API-wise, then
we will be able to limit the logging of those
that guest can trigger.
Greg Kurz Sept. 8, 2016, 4:26 p.m. UTC | #8
On Thu, 8 Sep 2016 18:19:27 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Thu, Sep 08, 2016 at 05:04:47PM +0200, Cornelia Huck wrote:
> > On Thu, 8 Sep 2016 18:00:28 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >   
> > > On Thu, Sep 08, 2016 at 11:12:16AM +0200, Greg Kurz wrote:  
> > > > On Thu, 8 Sep 2016 10:59:26 +0200
> > > > Cornelia Huck <cornelia.huck@de.ibm.com> wrote:
> > > >   
> > > > > On Wed, 07 Sep 2016 19:19:24 +0200
> > > > > Greg Kurz <groug@kaod.org> wrote:
> > > > >   
> > > > > > Calling assert() really makes sense when hitting a genuine bug, which calls
> > > > > > for a fix in QEMU. However, when something goes wrong because the guest
> > > > > > sends a malformed message, it is better to write down a more meaningul
> > > > > > error message and exit.
> > > > > > 
> > > > > > Signed-off-by: Greg Kurz <groug@kaod.org>
> > > > > > ---
> > > > > >  hw/9pfs/virtio-9p-device.c |   20 ++++++++++++++++++--
> > > > > >  1 file changed, 18 insertions(+), 2 deletions(-)    
> > > > > 
> > > > > While this is an improvement over the current state, I don't think the
> > > > > guest should be able to kill qemu just by doing something stupid.
> > > > >   
> > > > 
> > > > Hi Connie,
> > > > 
> > > > I'm glad you're pointing this out... this was also my impression, but
> > > > since there are a bunch of sanity checks in the virtio code that cause
> > > > QEMU to exit (even recently added like 1e7aed70144b), I did not dare
> > > > stand up :)  
> > > 
> > > It's true that it's broken in many places but we should just
> > > fix them all.
> > > 
> > > 
> > > A separate question is how to log such hardware/guest bugs generally.
> > > People already complained about disk filling up because of us printing
> > > errors on each such bug.  Maybe print each message only N times, and
> > > then set a flag to skip the log until management tells us to restart
> > > logging again.  
> > 
> > I'd expect to get the message just once per device if we set the device
> > to broken (unless the guess continuously resets it again...)  
> 
> Which it can do, so we should limit that anyway.
> 
> > Do we have
> > a generic print/log ratelimit infrastructure in qemu?  
> 
> There are actually two kinds of errors
> host side ones and ones triggered by guests.
> 
> We should distinguish between them API-wise, then
> we will be able to limit the logging of those
> that guest can trigger.
> 

FWIW it makes sense to use error_report() if QEMU exits. If it continues
execution, this means we're expecting the guest or the host to do something
to fix the error condition. This requires QEMU to emit an event of some
sort, but not necessarily to log an error message in a file. I guess this
depends if QEMU is run by some tooling, or by a human.
Michael S. Tsirkin Sept. 8, 2016, 4:55 p.m. UTC | #9
On Thu, Sep 08, 2016 at 06:26:52PM +0200, Greg Kurz wrote:
> On Thu, 8 Sep 2016 18:19:27 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Thu, Sep 08, 2016 at 05:04:47PM +0200, Cornelia Huck wrote:
> > > On Thu, 8 Sep 2016 18:00:28 +0300
> > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > >   
> > > > On Thu, Sep 08, 2016 at 11:12:16AM +0200, Greg Kurz wrote:  
> > > > > On Thu, 8 Sep 2016 10:59:26 +0200
> > > > > Cornelia Huck <cornelia.huck@de.ibm.com> wrote:
> > > > >   
> > > > > > On Wed, 07 Sep 2016 19:19:24 +0200
> > > > > > Greg Kurz <groug@kaod.org> wrote:
> > > > > >   
> > > > > > > Calling assert() really makes sense when hitting a genuine bug, which calls
> > > > > > > for a fix in QEMU. However, when something goes wrong because the guest
> > > > > > > sends a malformed message, it is better to write down a more meaningul
> > > > > > > error message and exit.
> > > > > > > 
> > > > > > > Signed-off-by: Greg Kurz <groug@kaod.org>
> > > > > > > ---
> > > > > > >  hw/9pfs/virtio-9p-device.c |   20 ++++++++++++++++++--
> > > > > > >  1 file changed, 18 insertions(+), 2 deletions(-)    
> > > > > > 
> > > > > > While this is an improvement over the current state, I don't think the
> > > > > > guest should be able to kill qemu just by doing something stupid.
> > > > > >   
> > > > > 
> > > > > Hi Connie,
> > > > > 
> > > > > I'm glad you're pointing this out... this was also my impression, but
> > > > > since there are a bunch of sanity checks in the virtio code that cause
> > > > > QEMU to exit (even recently added like 1e7aed70144b), I did not dare
> > > > > stand up :)  
> > > > 
> > > > It's true that it's broken in many places but we should just
> > > > fix them all.
> > > > 
> > > > 
> > > > A separate question is how to log such hardware/guest bugs generally.
> > > > People already complained about disk filling up because of us printing
> > > > errors on each such bug.  Maybe print each message only N times, and
> > > > then set a flag to skip the log until management tells us to restart
> > > > logging again.  
> > > 
> > > I'd expect to get the message just once per device if we set the device
> > > to broken (unless the guess continuously resets it again...)  
> > 
> > Which it can do, so we should limit that anyway.
> > 
> > > Do we have
> > > a generic print/log ratelimit infrastructure in qemu?  
> > 
> > There are actually two kinds of errors
> > host side ones and ones triggered by guests.
> > 
> > We should distinguish between them API-wise, then
> > we will be able to limit the logging of those
> > that guest can trigger.
> > 
> 
> FWIW it makes sense to use error_report() if QEMU exits.

Not necessarily e.g. hotplug errors trigger error_report too.
Generally it should be for host misconfiguration or similar
management errors.

> If it continues
> execution, this means we're expecting the guest or the host to do something
> to fix the error condition. This requires QEMU to emit an event of some
> sort, but not necessarily to log an error message in a file. I guess this
> depends if QEMU is run by some tooling, or by a human.

I'm not sure we need an event if tools are not expected to
do anything with it. If we limit # of times error
is printed, tools will need to reset this counter,
so we will need an event on overflow.
Markus Armbruster Sept. 9, 2016, 6:38 a.m. UTC | #10
Greg Kurz <groug@kaod.org> writes:

> On Thu, 8 Sep 2016 18:19:27 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>
>> On Thu, Sep 08, 2016 at 05:04:47PM +0200, Cornelia Huck wrote:
>> > On Thu, 8 Sep 2016 18:00:28 +0300
>> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> >   
>> > > On Thu, Sep 08, 2016 at 11:12:16AM +0200, Greg Kurz wrote:  
>> > > > On Thu, 8 Sep 2016 10:59:26 +0200
>> > > > Cornelia Huck <cornelia.huck@de.ibm.com> wrote:
>> > > >   
>> > > > > On Wed, 07 Sep 2016 19:19:24 +0200
>> > > > > Greg Kurz <groug@kaod.org> wrote:
>> > > > >   
>> > > > > > Calling assert() really makes sense when hitting a genuine bug, which calls
>> > > > > > for a fix in QEMU. However, when something goes wrong because the guest
>> > > > > > sends a malformed message, it is better to write down a more meaningul
>> > > > > > error message and exit.
>> > > > > > 
>> > > > > > Signed-off-by: Greg Kurz <groug@kaod.org>
>> > > > > > ---
>> > > > > >  hw/9pfs/virtio-9p-device.c |   20 ++++++++++++++++++--
>> > > > > >  1 file changed, 18 insertions(+), 2 deletions(-)    
>> > > > > 
>> > > > > While this is an improvement over the current state, I don't think the
>> > > > > guest should be able to kill qemu just by doing something stupid.
>> > > > >   
>> > > > 
>> > > > Hi Connie,
>> > > > 
>> > > > I'm glad you're pointing this out... this was also my impression, but
>> > > > since there are a bunch of sanity checks in the virtio code that cause
>> > > > QEMU to exit (even recently added like 1e7aed70144b), I did not dare
>> > > > stand up :)  
>> > > 
>> > > It's true that it's broken in many places but we should just
>> > > fix them all.
>> > > 
>> > > 
>> > > A separate question is how to log such hardware/guest bugs generally.
>> > > People already complained about disk filling up because of us printing
>> > > errors on each such bug.  Maybe print each message only N times, and
>> > > then set a flag to skip the log until management tells us to restart
>> > > logging again.  
>> > 
>> > I'd expect to get the message just once per device if we set the device
>> > to broken (unless the guess continuously resets it again...)  
>> 
>> Which it can do, so we should limit that anyway.
>> 
>> > Do we have
>> > a generic print/log ratelimit infrastructure in qemu?  
>> 
>> There are actually two kinds of errors
>> host side ones and ones triggered by guests.
>> 
>> We should distinguish between them API-wise, then
>> we will be able to limit the logging of those
>> that guest can trigger.
>> 
>
> FWIW it makes sense to use error_report() if QEMU exits.

exit(STATUS) with STATUS != 0 without printing a message is always
wrong.

>                                                          If it continues
> execution, this means we're expecting the guest or the host to do something
> to fix the error condition. This requires QEMU to emit an event of some
> sort, but not necessarily to log an error message in a file. I guess this
> depends if QEMU is run by some tooling, or by a human.

error_report() normally goes to stderr.  Tooling or humans can of course
make it go to a file instead.

error_report() is indeed a sub-par way to send an "attention" signal to
the host, because recognizing such a signal reliably is unnecessary hard
for management applications.  QMP events are much easier.

Both are useless when the signal needs to go to the guest.  Signalling
the guest is a device model job.

error_report() without exit() has its uses.  Error conditions in need of
fixing aren't the only reason to call error_report().  But when you add
a call, ask yourself whether management application or guest would like
to respond to it.
Greg Kurz Sept. 9, 2016, 7:30 a.m. UTC | #11
On Fri, 09 Sep 2016 08:38:13 +0200
Markus Armbruster <armbru@redhat.com> wrote:

> Greg Kurz <groug@kaod.org> writes:
> 
> > On Thu, 8 Sep 2016 18:19:27 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >  
> >> On Thu, Sep 08, 2016 at 05:04:47PM +0200, Cornelia Huck wrote:  
> >> > On Thu, 8 Sep 2016 18:00:28 +0300
> >> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >> >     
> >> > > On Thu, Sep 08, 2016 at 11:12:16AM +0200, Greg Kurz wrote:    
> >> > > > On Thu, 8 Sep 2016 10:59:26 +0200
> >> > > > Cornelia Huck <cornelia.huck@de.ibm.com> wrote:
> >> > > >     
> >> > > > > On Wed, 07 Sep 2016 19:19:24 +0200
> >> > > > > Greg Kurz <groug@kaod.org> wrote:
> >> > > > >     
> >> > > > > > Calling assert() really makes sense when hitting a genuine bug, which calls
> >> > > > > > for a fix in QEMU. However, when something goes wrong because the guest
> >> > > > > > sends a malformed message, it is better to write down a more meaningul
> >> > > > > > error message and exit.
> >> > > > > > 
> >> > > > > > Signed-off-by: Greg Kurz <groug@kaod.org>
> >> > > > > > ---
> >> > > > > >  hw/9pfs/virtio-9p-device.c |   20 ++++++++++++++++++--
> >> > > > > >  1 file changed, 18 insertions(+), 2 deletions(-)      
> >> > > > > 
> >> > > > > While this is an improvement over the current state, I don't think the
> >> > > > > guest should be able to kill qemu just by doing something stupid.
> >> > > > >     
> >> > > > 
> >> > > > Hi Connie,
> >> > > > 
> >> > > > I'm glad you're pointing this out... this was also my impression, but
> >> > > > since there are a bunch of sanity checks in the virtio code that cause
> >> > > > QEMU to exit (even recently added like 1e7aed70144b), I did not dare
> >> > > > stand up :)    
> >> > > 
> >> > > It's true that it's broken in many places but we should just
> >> > > fix them all.
> >> > > 
> >> > > 
> >> > > A separate question is how to log such hardware/guest bugs generally.
> >> > > People already complained about disk filling up because of us printing
> >> > > errors on each such bug.  Maybe print each message only N times, and
> >> > > then set a flag to skip the log until management tells us to restart
> >> > > logging again.    
> >> > 
> >> > I'd expect to get the message just once per device if we set the device
> >> > to broken (unless the guess continuously resets it again...)    
> >> 
> >> Which it can do, so we should limit that anyway.
> >>   
> >> > Do we have
> >> > a generic print/log ratelimit infrastructure in qemu?    
> >> 
> >> There are actually two kinds of errors
> >> host side ones and ones triggered by guests.
> >> 
> >> We should distinguish between them API-wise, then
> >> we will be able to limit the logging of those
> >> that guest can trigger.
> >>   
> >
> > FWIW it makes sense to use error_report() if QEMU exits.  
> 
> exit(STATUS) with STATUS != 0 without printing a message is always
> wrong.
> 

I fully agree.

> >                                                          If it continues
> > execution, this means we're expecting the guest or the host to do something
> > to fix the error condition. This requires QEMU to emit an event of some
> > sort, but not necessarily to log an error message in a file. I guess this
> > depends if QEMU is run by some tooling, or by a human.  
> 
> error_report() normally goes to stderr.  Tooling or humans can of course
> make it go to a file instead.
> 
> error_report() is indeed a sub-par way to send an "attention" signal to
> the host, because recognizing such a signal reliably is unnecessary hard
> for management applications.  QMP events are much easier.
> 

My wording was poor but yes, that was my point. :)

> Both are useless when the signal needs to go to the guest.  Signalling
> the guest is a device model job.
> 

I also agree with that. In the case of virtio, this is explained in section
2.1.2 of the spec.

> error_report() without exit() has its uses.  Error conditions in need of
> fixing aren't the only reason to call error_report().  But when you add
> a call, ask yourself whether management application or guest would like
> to respond to it.

In the case of the present patch, we currently have BUG_ON() which generates
a cryptic and unusable message.

It turns out that the first one (elem->out_num == 0 || elem->in_num == 0) is
correct since it is now [1] impossible to hit this according to the code (see
virtqueue_pop() and virtqueue_map_desc()).

The second one (len != sizeof out) though matches a potential guest originated
error. If I do as suggested by Connie, then the error_report() isn't needed
anymore.

Cheers.

--
Greg

[1] sending an empty buffer was sufficient before commit 1e7aed70144b4 as said
    in my previous answer
Cornelia Huck Sept. 9, 2016, 8:30 a.m. UTC | #12
On Thu, 8 Sep 2016 19:55:16 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Thu, Sep 08, 2016 at 06:26:52PM +0200, Greg Kurz wrote:
> > On Thu, 8 Sep 2016 18:19:27 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > 
> > > On Thu, Sep 08, 2016 at 05:04:47PM +0200, Cornelia Huck wrote:
> > > > On Thu, 8 Sep 2016 18:00:28 +0300
> > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > >   
> > > > > On Thu, Sep 08, 2016 at 11:12:16AM +0200, Greg Kurz wrote:  

> > If it continues
> > execution, this means we're expecting the guest or the host to do something
> > to fix the error condition. This requires QEMU to emit an event of some
> > sort, but not necessarily to log an error message in a file. I guess this
> > depends if QEMU is run by some tooling, or by a human.
> 
> I'm not sure we need an event if tools are not expected to
> do anything with it. If we limit # of times error
> is printed, tools will need to reset this counter,
> so we will need an event on overflow.

If the device goes into a broken state, it should be discoverable from
outside. I'm not sure we need an actual event signalling this if this
happens due to the guest doing something wrong: That would be a task
for tools monitoring _inside_ the guest. For tools monitoring the
health of the machine (from the host perspective), the discovery
interface would probably be enough?
Greg Kurz Sept. 9, 2016, 8:46 a.m. UTC | #13
On Fri, 9 Sep 2016 10:30:53 +0200
Cornelia Huck <cornelia.huck@de.ibm.com> wrote:

> On Thu, 8 Sep 2016 19:55:16 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Thu, Sep 08, 2016 at 06:26:52PM +0200, Greg Kurz wrote:  
> > > On Thu, 8 Sep 2016 18:19:27 +0300
> > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > >   
> > > > On Thu, Sep 08, 2016 at 05:04:47PM +0200, Cornelia Huck wrote:  
> > > > > On Thu, 8 Sep 2016 18:00:28 +0300
> > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > >     
> > > > > > On Thu, Sep 08, 2016 at 11:12:16AM +0200, Greg Kurz wrote:    
> 
> > > If it continues
> > > execution, this means we're expecting the guest or the host to do something
> > > to fix the error condition. This requires QEMU to emit an event of some
> > > sort, but not necessarily to log an error message in a file. I guess this
> > > depends if QEMU is run by some tooling, or by a human.  
> > 
> > I'm not sure we need an event if tools are not expected to
> > do anything with it. If we limit # of times error
> > is printed, tools will need to reset this counter,
> > so we will need an event on overflow.  
> 
> If the device goes into a broken state, it should be discoverable from
> outside. I'm not sure we need an actual event signalling this if this
> happens due to the guest doing something wrong: That would be a task
> for tools monitoring _inside_ the guest. 

Well, in case of a virtio device being broken, section 2.1.2 in the spec
suggests to set the status to DEVICE_NEEDS_RESET and to notify it to
the guest (aka. event signalling). I'll send a patch shortly.

> For tools monitoring the
> health of the machine (from the host perspective), the discovery
> interface would probably be enough?
> 

Yeah, probably.

Cheers.

--
Greg
Cornelia Huck Sept. 9, 2016, 8:53 a.m. UTC | #14
On Fri, 9 Sep 2016 10:46:25 +0200
Greg Kurz <groug@kaod.org> wrote:

> On Fri, 9 Sep 2016 10:30:53 +0200
> Cornelia Huck <cornelia.huck@de.ibm.com> wrote:
> 
> > On Thu, 8 Sep 2016 19:55:16 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > 
> > > On Thu, Sep 08, 2016 at 06:26:52PM +0200, Greg Kurz wrote:  
> > > > On Thu, 8 Sep 2016 18:19:27 +0300
> > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > >   
> > > > > On Thu, Sep 08, 2016 at 05:04:47PM +0200, Cornelia Huck wrote:  
> > > > > > On Thu, 8 Sep 2016 18:00:28 +0300
> > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > > >     
> > > > > > > On Thu, Sep 08, 2016 at 11:12:16AM +0200, Greg Kurz wrote:    
> > 
> > > > If it continues
> > > > execution, this means we're expecting the guest or the host to do something
> > > > to fix the error condition. This requires QEMU to emit an event of some
> > > > sort, but not necessarily to log an error message in a file. I guess this
> > > > depends if QEMU is run by some tooling, or by a human.  
> > > 
> > > I'm not sure we need an event if tools are not expected to
> > > do anything with it. If we limit # of times error
> > > is printed, tools will need to reset this counter,
> > > so we will need an event on overflow.  
> > 
> > If the device goes into a broken state, it should be discoverable from
> > outside. I'm not sure we need an actual event signalling this if this
> > happens due to the guest doing something wrong: That would be a task
> > for tools monitoring _inside_ the guest. 
> 
> Well, in case of a virtio device being broken, section 2.1.2 in the spec
> suggests to set the status to DEVICE_NEEDS_RESET and to notify it to
> the guest (aka. event signalling). I'll send a patch shortly.

Stefan had already sent
<1460467534-29147-4-git-send-email-stefanha@redhat.com> ages ago, but
it has not yet made it anywhere...

Anyhow, I was concerned with host signalling (sorry for being unclear),
and I still do not think we need to alert host monitoring software to
guest stupidity.

> 
> > For tools monitoring the
> > health of the machine (from the host perspective), the discovery
> > interface would probably be enough?
> > 
> 
> Yeah, probably.
> 
> Cheers.
> 
> --
> Greg
>
Markus Armbruster Sept. 9, 2016, 9:08 a.m. UTC | #15
Greg Kurz <groug@kaod.org> writes:

> On Fri, 09 Sep 2016 08:38:13 +0200
> Markus Armbruster <armbru@redhat.com> wrote:
>
>> Greg Kurz <groug@kaod.org> writes:
>> 
>> > On Thu, 8 Sep 2016 18:19:27 +0300
>> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> >  
>> >> On Thu, Sep 08, 2016 at 05:04:47PM +0200, Cornelia Huck wrote:  
>> >> > On Thu, 8 Sep 2016 18:00:28 +0300
>> >> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> >> >     
>> >> > > On Thu, Sep 08, 2016 at 11:12:16AM +0200, Greg Kurz wrote:    
>> >> > > > On Thu, 8 Sep 2016 10:59:26 +0200
>> >> > > > Cornelia Huck <cornelia.huck@de.ibm.com> wrote:
>> >> > > >     
>> >> > > > > On Wed, 07 Sep 2016 19:19:24 +0200
>> >> > > > > Greg Kurz <groug@kaod.org> wrote:
>> >> > > > >     
>> >> > > > > > Calling assert() really makes sense when hitting a genuine bug, which calls
>> >> > > > > > for a fix in QEMU. However, when something goes wrong because the guest
>> >> > > > > > sends a malformed message, it is better to write down a more meaningul
>> >> > > > > > error message and exit.
>> >> > > > > > 
>> >> > > > > > Signed-off-by: Greg Kurz <groug@kaod.org>
>> >> > > > > > ---
>> >> > > > > >  hw/9pfs/virtio-9p-device.c |   20 ++++++++++++++++++--
>> >> > > > > >  1 file changed, 18 insertions(+), 2 deletions(-)      
>> >> > > > > 
>> >> > > > > While this is an improvement over the current state, I don't think the
>> >> > > > > guest should be able to kill qemu just by doing something stupid.
>> >> > > > >     
>> >> > > > 
>> >> > > > Hi Connie,
>> >> > > > 
>> >> > > > I'm glad you're pointing this out... this was also my impression, but
>> >> > > > since there are a bunch of sanity checks in the virtio code that cause
>> >> > > > QEMU to exit (even recently added like 1e7aed70144b), I did not dare
>> >> > > > stand up :)    
>> >> > > 
>> >> > > It's true that it's broken in many places but we should just
>> >> > > fix them all.
>> >> > > 
>> >> > > 
>> >> > > A separate question is how to log such hardware/guest bugs generally.
>> >> > > People already complained about disk filling up because of us printing
>> >> > > errors on each such bug.  Maybe print each message only N times, and
>> >> > > then set a flag to skip the log until management tells us to restart
>> >> > > logging again.    
>> >> > 
>> >> > I'd expect to get the message just once per device if we set the device
>> >> > to broken (unless the guess continuously resets it again...)    
>> >> 
>> >> Which it can do, so we should limit that anyway.
>> >>   
>> >> > Do we have
>> >> > a generic print/log ratelimit infrastructure in qemu?    
>> >> 
>> >> There are actually two kinds of errors
>> >> host side ones and ones triggered by guests.
>> >> 
>> >> We should distinguish between them API-wise, then
>> >> we will be able to limit the logging of those
>> >> that guest can trigger.
>> >>   
>> >
>> > FWIW it makes sense to use error_report() if QEMU exits.  
>> 
>> exit(STATUS) with STATUS != 0 without printing a message is always
>> wrong.
>> 
>
> I fully agree.
>
>> >                                                          If it continues
>> > execution, this means we're expecting the guest or the host to do something
>> > to fix the error condition. This requires QEMU to emit an event of some
>> > sort, but not necessarily to log an error message in a file. I guess this
>> > depends if QEMU is run by some tooling, or by a human.  
>> 
>> error_report() normally goes to stderr.  Tooling or humans can of course
>> make it go to a file instead.
>> 
>> error_report() is indeed a sub-par way to send an "attention" signal to
>> the host, because recognizing such a signal reliably is unnecessary hard
>> for management applications.  QMP events are much easier.
>> 
>
> My wording was poor but yes, that was my point. :)
>
>> Both are useless when the signal needs to go to the guest.  Signalling
>> the guest is a device model job.
>> 
>
> I also agree with that. In the case of virtio, this is explained in section
> 2.1.2 of the spec.
>
>> error_report() without exit() has its uses.  Error conditions in need of
>> fixing aren't the only reason to call error_report().  But when you add
>> a call, ask yourself whether management application or guest would like
>> to respond to it.
>
> In the case of the present patch, we currently have BUG_ON() which generates
> a cryptic and unusable message.
>
> It turns out that the first one (elem->out_num == 0 || elem->in_num == 0) is
> correct since it is now [1] impossible to hit this according to the code (see
> virtqueue_pop() and virtqueue_map_desc()).
>
> The second one (len != sizeof out) though matches a potential guest originated
> error. If I do as suggested by Connie, then the error_report() isn't needed
> anymore.

I dive into the details of your analysis right now, only make high-level
recommendations:

* Issues common to all virtio devices should be addressed in the virtio
  core.  If that's not feasible, they should be addressed in all devices
  consistently.

* Guest misbehavior should put the device in a guest-observable error
  state.  It should not crash QEMU, it should not spam stderr.  Code
  handling it in other ways should be marked FIXME.

* Nobody expects you to get things perfectly right in one step.  Just
  try to move towards the goal.

>
> Cheers.
>
> --
> Greg
>
> [1] sending an empty buffer was sufficient before commit 1e7aed70144b4 as said
>     in my previous answer
Greg Kurz Sept. 9, 2016, 9:26 a.m. UTC | #16
On Fri, 9 Sep 2016 10:53:05 +0200
Cornelia Huck <cornelia.huck@de.ibm.com> wrote:

> On Fri, 9 Sep 2016 10:46:25 +0200
> Greg Kurz <groug@kaod.org> wrote:
> 
> > On Fri, 9 Sep 2016 10:30:53 +0200
> > Cornelia Huck <cornelia.huck@de.ibm.com> wrote:
> >   
> > > On Thu, 8 Sep 2016 19:55:16 +0300
> > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > >   
> > > > On Thu, Sep 08, 2016 at 06:26:52PM +0200, Greg Kurz wrote:    
> > > > > On Thu, 8 Sep 2016 18:19:27 +0300
> > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > >     
> > > > > > On Thu, Sep 08, 2016 at 05:04:47PM +0200, Cornelia Huck wrote:    
> > > > > > > On Thu, 8 Sep 2016 18:00:28 +0300
> > > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > > > >       
> > > > > > > > On Thu, Sep 08, 2016 at 11:12:16AM +0200, Greg Kurz wrote:      
> > >   
> > > > > If it continues
> > > > > execution, this means we're expecting the guest or the host to do something
> > > > > to fix the error condition. This requires QEMU to emit an event of some
> > > > > sort, but not necessarily to log an error message in a file. I guess this
> > > > > depends if QEMU is run by some tooling, or by a human.    
> > > > 
> > > > I'm not sure we need an event if tools are not expected to
> > > > do anything with it. If we limit # of times error
> > > > is printed, tools will need to reset this counter,
> > > > so we will need an event on overflow.    
> > > 
> > > If the device goes into a broken state, it should be discoverable from
> > > outside. I'm not sure we need an actual event signalling this if this
> > > happens due to the guest doing something wrong: That would be a task
> > > for tools monitoring _inside_ the guest.   
> > 
> > Well, in case of a virtio device being broken, section 2.1.2 in the spec
> > suggests to set the status to DEVICE_NEEDS_RESET and to notify it to
> > the guest (aka. event signalling). I'll send a patch shortly.  
> 
> Stefan had already sent
> <1460467534-29147-4-git-send-email-stefanha@redhat.com> ages ago, but
> it has not yet made it anywhere...
> 

I don't know what to do with this message-id :\

> Anyhow, I was concerned with host signalling (sorry for being unclear),
> and I still do not think we need to alert host monitoring software to
> guest stupidity.
> 

I agree. Sorry if my poor wording made you (and others) think I was
suggesting that :) My point was that if QEMU exits because of guest
stupidity, you are forced to error_report() something to the host,
but this is really suboptimal (even if BUG_ON is worse)... then
there was that discussion about log files getting to big, but I don't
even know how we came there, as it does not really make sense when QEMU
exits.

> >   
> > > For tools monitoring the
> > > health of the machine (from the host perspective), the discovery
> > > interface would probably be enough?
> > >   
> > 
> > Yeah, probably.
> > 
> > Cheers.
> > 
> > --
> > Greg
> >   
>
Greg Kurz Sept. 9, 2016, 9:37 a.m. UTC | #17
On Fri, 9 Sep 2016 11:26:17 +0200
Greg Kurz <groug@kaod.org> wrote:

> > 
> > Stefan had already sent
> > <1460467534-29147-4-git-send-email-stefanha@redhat.com> ages ago, but
> > it has not yet made it anywhere...
> >   
> 
> I don't know what to do with this message-id :\

I finally found :)

+message-id:<1460467534-29147-4-git-send-email-stefanha@redhat.com> in the
search box at https://lists.nongnu.org/archive/html/qemu-devel/

Cheers.

--
Greg
Greg Kurz Sept. 9, 2016, 9:54 a.m. UTC | #18
On Fri, 09 Sep 2016 11:08:56 +0200
Markus Armbruster <armbru@redhat.com> wrote:

> Greg Kurz <groug@kaod.org> writes:
> 
> > On Fri, 09 Sep 2016 08:38:13 +0200
> > Markus Armbruster <armbru@redhat.com> wrote:
> >  
> >> Greg Kurz <groug@kaod.org> writes:
> >>   
> >> > On Thu, 8 Sep 2016 18:19:27 +0300
> >> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >> >    
> >> >> On Thu, Sep 08, 2016 at 05:04:47PM +0200, Cornelia Huck wrote:    
> >> >> > On Thu, 8 Sep 2016 18:00:28 +0300
> >> >> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >> >> >       
> >> >> > > On Thu, Sep 08, 2016 at 11:12:16AM +0200, Greg Kurz wrote:      
> >> >> > > > On Thu, 8 Sep 2016 10:59:26 +0200
> >> >> > > > Cornelia Huck <cornelia.huck@de.ibm.com> wrote:
> >> >> > > >       
> >> >> > > > > On Wed, 07 Sep 2016 19:19:24 +0200
> >> >> > > > > Greg Kurz <groug@kaod.org> wrote:
> >> >> > > > >       
> >> >> > > > > > Calling assert() really makes sense when hitting a genuine bug, which calls
> >> >> > > > > > for a fix in QEMU. However, when something goes wrong because the guest
> >> >> > > > > > sends a malformed message, it is better to write down a more meaningul
> >> >> > > > > > error message and exit.
> >> >> > > > > > 
> >> >> > > > > > Signed-off-by: Greg Kurz <groug@kaod.org>
> >> >> > > > > > ---
> >> >> > > > > >  hw/9pfs/virtio-9p-device.c |   20 ++++++++++++++++++--
> >> >> > > > > >  1 file changed, 18 insertions(+), 2 deletions(-)        
> >> >> > > > > 
> >> >> > > > > While this is an improvement over the current state, I don't think the
> >> >> > > > > guest should be able to kill qemu just by doing something stupid.
> >> >> > > > >       
> >> >> > > > 
> >> >> > > > Hi Connie,
> >> >> > > > 
> >> >> > > > I'm glad you're pointing this out... this was also my impression, but
> >> >> > > > since there are a bunch of sanity checks in the virtio code that cause
> >> >> > > > QEMU to exit (even recently added like 1e7aed70144b), I did not dare
> >> >> > > > stand up :)      
> >> >> > > 
> >> >> > > It's true that it's broken in many places but we should just
> >> >> > > fix them all.
> >> >> > > 
> >> >> > > 
> >> >> > > A separate question is how to log such hardware/guest bugs generally.
> >> >> > > People already complained about disk filling up because of us printing
> >> >> > > errors on each such bug.  Maybe print each message only N times, and
> >> >> > > then set a flag to skip the log until management tells us to restart
> >> >> > > logging again.      
> >> >> > 
> >> >> > I'd expect to get the message just once per device if we set the device
> >> >> > to broken (unless the guess continuously resets it again...)      
> >> >> 
> >> >> Which it can do, so we should limit that anyway.
> >> >>     
> >> >> > Do we have
> >> >> > a generic print/log ratelimit infrastructure in qemu?      
> >> >> 
> >> >> There are actually two kinds of errors
> >> >> host side ones and ones triggered by guests.
> >> >> 
> >> >> We should distinguish between them API-wise, then
> >> >> we will be able to limit the logging of those
> >> >> that guest can trigger.
> >> >>     
> >> >
> >> > FWIW it makes sense to use error_report() if QEMU exits.    
> >> 
> >> exit(STATUS) with STATUS != 0 without printing a message is always
> >> wrong.
> >>   
> >
> > I fully agree.
> >  
> >> >                                                          If it continues
> >> > execution, this means we're expecting the guest or the host to do something
> >> > to fix the error condition. This requires QEMU to emit an event of some
> >> > sort, but not necessarily to log an error message in a file. I guess this
> >> > depends if QEMU is run by some tooling, or by a human.    
> >> 
> >> error_report() normally goes to stderr.  Tooling or humans can of course
> >> make it go to a file instead.
> >> 
> >> error_report() is indeed a sub-par way to send an "attention" signal to
> >> the host, because recognizing such a signal reliably is unnecessary hard
> >> for management applications.  QMP events are much easier.
> >>   
> >
> > My wording was poor but yes, that was my point. :)
> >  
> >> Both are useless when the signal needs to go to the guest.  Signalling
> >> the guest is a device model job.
> >>   
> >
> > I also agree with that. In the case of virtio, this is explained in section
> > 2.1.2 of the spec.
> >  
> >> error_report() without exit() has its uses.  Error conditions in need of
> >> fixing aren't the only reason to call error_report().  But when you add
> >> a call, ask yourself whether management application or guest would like
> >> to respond to it.  
> >
> > In the case of the present patch, we currently have BUG_ON() which generates
> > a cryptic and unusable message.
> >
> > It turns out that the first one (elem->out_num == 0 || elem->in_num == 0) is
> > correct since it is now [1] impossible to hit this according to the code (see
> > virtqueue_pop() and virtqueue_map_desc()).
> >
> > The second one (len != sizeof out) though matches a potential guest originated
> > error. If I do as suggested by Connie, then the error_report() isn't needed
> > anymore.  
> 
> I dive into the details of your analysis right now, only make high-level
> recommendations:
> 
> * Issues common to all virtio devices should be addressed in the virtio
>   core.  If that's not feasible, they should be addressed in all devices
>   consistently.
> 

Agreed.

> * Guest misbehavior should put the device in a guest-observable error
>   state.  It should not crash QEMU, it should not spam stderr.  Code
>   handling it in other ways should be marked FIXME.
> 

Agreed. FWIW a bunch of FIXMEs are missing in the virtio code then :)

> * Nobody expects you to get things perfectly right in one step.  Just
>   try to move towards the goal.
> 

Sure ! I'm now reading through Stefan's series to address the issue:

https://lists.nongnu.org/archive/html/qemu-devel/2016-04/msg01978.html

Cheers.

--
Greg

> >
> > Cheers.
> >
> > --
> > Greg
> >
> > [1] sending an empty buffer was sufficient before commit 1e7aed70144b4 as said
> >     in my previous answer
diff mbox

Patch

diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
index 009b43f6d045..67059182645a 100644
--- a/hw/9pfs/virtio-9p-device.c
+++ b/hw/9pfs/virtio-9p-device.c
@@ -19,6 +19,7 @@ 
 #include "coth.h"
 #include "hw/virtio/virtio-access.h"
 #include "qemu/iov.h"
+#include "qemu/error-report.h"
 
 void virtio_9p_push_and_notify(V9fsPDU *pdu)
 {
@@ -35,6 +36,11 @@  void virtio_9p_push_and_notify(V9fsPDU *pdu)
     virtio_notify(VIRTIO_DEVICE(v), v->vq);
 }
 
+static void virtio_9p_error(const char *msg)
+{
+    error_report("The virtio-9p driver in the guest has an issue: %s", msg);
+}
+
 static void handle_9p_output(VirtIODevice *vdev, VirtQueue *vq)
 {
     V9fsVirtioState *v = (V9fsVirtioState *)vdev;
@@ -56,13 +62,23 @@  static void handle_9p_output(VirtIODevice *vdev, VirtQueue *vq)
             break;
         }
 
-        BUG_ON(elem->out_num == 0 || elem->in_num == 0);
+        if (elem->out_num == 0) {
+            virtio_9p_error("missing VirtFS request's header");
+            exit(1);
+        }
+        if (elem->in_num == 0) {
+            virtio_9p_error("missing VirtFS reply's header");
+            exit(1);
+        }
         QEMU_BUILD_BUG_ON(sizeof out != 7);
 
         v->elems[pdu->idx] = elem;
         len = iov_to_buf(elem->out_sg, elem->out_num, 0,
                          &out, sizeof out);
-        BUG_ON(len != sizeof out);
+        if (len != sizeof out) {
+            virtio_9p_error("malformed VirtFS request");
+            exit(1);
+        }
 
         pdu->size = le32_to_cpu(out.size_le);