diff mbox series

virtio-balloon: Fix page-poison subsection name

Message ID 20210914131716.102851-1-dgilbert@redhat.com (mailing list archive)
State New, archived
Headers show
Series virtio-balloon: Fix page-poison subsection name | expand

Commit Message

Dr. David Alan Gilbert Sept. 14, 2021, 1:17 p.m. UTC
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

The subsection name for page-poison was typo'd as:

  vitio-balloon-device/page-poison

Note the missing 'r' in virtio.

When we have a machine type that enables page poison, and the guest
enables it (which needs a new kernel), things fail rather unpredictably.

The fallout from this is that most of the other subsections fail to
load, including things like the feature bits in the device, one
possible fallout is that the physical addresses of the queues
then get aligned differently and we fail with an error about
last_avail_idx being wrong.
It's not obvious to me why this doesn't produce a more obvious failure,
but virtio's vmstate loading is a bit open-coded.

Fixes: 7483cbbaf82 ("virtio-balloon: Implement support for page poison reporting feature")
bz: https://bugzilla.redhat.com/show_bug.cgi?id=1984401
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 hw/virtio/virtio-balloon.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

David Hildenbrand Sept. 14, 2021, 1:21 p.m. UTC | #1
On 14.09.21 15:17, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> The subsection name for page-poison was typo'd as:
> 
>    vitio-balloon-device/page-poison
> 
> Note the missing 'r' in virtio.
> 
> When we have a machine type that enables page poison, and the guest
> enables it (which needs a new kernel), things fail rather unpredictably.
> 
> The fallout from this is that most of the other subsections fail to
> load, including things like the feature bits in the device, one
> possible fallout is that the physical addresses of the queues
> then get aligned differently and we fail with an error about
> last_avail_idx being wrong.
> It's not obvious to me why this doesn't produce a more obvious failure,
> but virtio's vmstate loading is a bit open-coded.
> 
> Fixes: 7483cbbaf82 ("virtio-balloon: Implement support for page poison reporting feature")
> bz: https://bugzilla.redhat.com/show_bug.cgi?id=1984401
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>   hw/virtio/virtio-balloon.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index 5a69dce35d..c6962fcbfe 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -852,7 +852,7 @@ static const VMStateDescription vmstate_virtio_balloon_free_page_hint = {
>   };
>   
>   static const VMStateDescription vmstate_virtio_balloon_page_poison = {
> -    .name = "vitio-balloon-device/page-poison",
> +    .name = "virtio-balloon-device/page-poison",
>       .version_id = 1,
>       .minimum_version_id = 1,
>       .needed = virtio_balloon_page_poison_support,
> 

Oh, that's very subtle. I wasn't even aware that the prefix really has 
to match the actual device ... I thought the whole idea of the prefix 
here was just to make the string unique ...

Reviewed-by: David Hildenbrand <david@redhat.com>

Thanks!
Dr. David Alan Gilbert Sept. 14, 2021, 1:30 p.m. UTC | #2
* David Hildenbrand (david@redhat.com) wrote:
> On 14.09.21 15:17, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > The subsection name for page-poison was typo'd as:
> > 
> >    vitio-balloon-device/page-poison
> > 
> > Note the missing 'r' in virtio.
> > 
> > When we have a machine type that enables page poison, and the guest
> > enables it (which needs a new kernel), things fail rather unpredictably.
> > 
> > The fallout from this is that most of the other subsections fail to
> > load, including things like the feature bits in the device, one
> > possible fallout is that the physical addresses of the queues
> > then get aligned differently and we fail with an error about
> > last_avail_idx being wrong.
> > It's not obvious to me why this doesn't produce a more obvious failure,
> > but virtio's vmstate loading is a bit open-coded.
> > 
> > Fixes: 7483cbbaf82 ("virtio-balloon: Implement support for page poison reporting feature")
> > bz: https://bugzilla.redhat.com/show_bug.cgi?id=1984401
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >   hw/virtio/virtio-balloon.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> > index 5a69dce35d..c6962fcbfe 100644
> > --- a/hw/virtio/virtio-balloon.c
> > +++ b/hw/virtio/virtio-balloon.c
> > @@ -852,7 +852,7 @@ static const VMStateDescription vmstate_virtio_balloon_free_page_hint = {
> >   };
> >   static const VMStateDescription vmstate_virtio_balloon_page_poison = {
> > -    .name = "vitio-balloon-device/page-poison",
> > +    .name = "virtio-balloon-device/page-poison",
> >       .version_id = 1,
> >       .minimum_version_id = 1,
> >       .needed = virtio_balloon_page_poison_support,
> > 
> 
> Oh, that's very subtle. I wasn't even aware that the prefix really has to
> match the actual device ... I thought the whole idea of the prefix here was
> just to make the string unique ...

Subsection naming is *very* critical; the logic is something like:
  'we're loading the X device'
a subsection arrives for 'N/P'
if 'X==N' then it looks in X for subsection P.
If 'X!=N' then it assumes we've finished loading X
and P is really for an outer device that X is part of.
This is horrible.

Dave

> Reviewed-by: David Hildenbrand <david@redhat.com>
> 
> Thanks!
> 
> -- 
> Thanks,
> 
> David / dhildenb
>
Philippe Mathieu-Daudé Sept. 14, 2021, 1:47 p.m. UTC | #3
On 9/14/21 3:17 PM, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> The subsection name for page-poison was typo'd as:
> 
>   vitio-balloon-device/page-poison
> 
> Note the missing 'r' in virtio.
> 
> When we have a machine type that enables page poison, and the guest
> enables it (which needs a new kernel), things fail rather unpredictably.

IIUC since v5.1 guests have 'page-poison'=true but once migrated
they become 'page-poison'=unset=false?

> The fallout from this is that most of the other subsections fail to
> load, including things like the feature bits in the device, one
> possible fallout is that the physical addresses of the queues
> then get aligned differently and we fail with an error about
> last_avail_idx being wrong.
> It's not obvious to me why this doesn't produce a more obvious failure,
> but virtio's vmstate loading is a bit open-coded.

:/

> Fixes: 7483cbbaf82 ("virtio-balloon: Implement support for page poison reporting feature")
> bz: https://bugzilla.redhat.com/show_bug.cgi?id=1984401
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  hw/virtio/virtio-balloon.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index 5a69dce35d..c6962fcbfe 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -852,7 +852,7 @@ static const VMStateDescription vmstate_virtio_balloon_free_page_hint = {
>  };
>  
>  static const VMStateDescription vmstate_virtio_balloon_page_poison = {
> -    .name = "vitio-balloon-device/page-poison",
> +    .name = "virtio-balloon-device/page-poison",
>      .version_id = 1,
>      .minimum_version_id = 1,
>      .needed = virtio_balloon_page_poison_support,
>
David Hildenbrand Sept. 14, 2021, 1:51 p.m. UTC | #4
On 14.09.21 15:47, Philippe Mathieu-Daudé wrote:
> On 9/14/21 3:17 PM, Dr. David Alan Gilbert (git) wrote:
>> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>>
>> The subsection name for page-poison was typo'd as:
>>
>>    vitio-balloon-device/page-poison
>>
>> Note the missing 'r' in virtio.
>>
>> When we have a machine type that enables page poison, and the guest
>> enables it (which needs a new kernel), things fail rather unpredictably.
> 
> IIUC since v5.1 guests have 'page-poison'=true but once migrated
> they become 'page-poison'=unset=false?

We only migrate the subsection if the guest supports the feature (-> 
newer guest kernel).

If we migrate, it's a broken migration stream. If we don't migrate, 
everything is fine.
Daniel P. Berrangé Sept. 14, 2021, 1:58 p.m. UTC | #5
On Tue, Sep 14, 2021 at 02:30:06PM +0100, Dr. David Alan Gilbert wrote:
> * David Hildenbrand (david@redhat.com) wrote:
> > On 14.09.21 15:17, Dr. David Alan Gilbert (git) wrote:
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > The subsection name for page-poison was typo'd as:
> > > 
> > >    vitio-balloon-device/page-poison
> > > 
> > > Note the missing 'r' in virtio.
> > > 
> > > When we have a machine type that enables page poison, and the guest
> > > enables it (which needs a new kernel), things fail rather unpredictably.
> > > 
> > > The fallout from this is that most of the other subsections fail to
> > > load, including things like the feature bits in the device, one
> > > possible fallout is that the physical addresses of the queues
> > > then get aligned differently and we fail with an error about
> > > last_avail_idx being wrong.
> > > It's not obvious to me why this doesn't produce a more obvious failure,
> > > but virtio's vmstate loading is a bit open-coded.
> > > 
> > > Fixes: 7483cbbaf82 ("virtio-balloon: Implement support for page poison reporting feature")
> > > bz: https://bugzilla.redhat.com/show_bug.cgi?id=1984401
> > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > ---
> > >   hw/virtio/virtio-balloon.c | 2 +-
> > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> > > index 5a69dce35d..c6962fcbfe 100644
> > > --- a/hw/virtio/virtio-balloon.c
> > > +++ b/hw/virtio/virtio-balloon.c
> > > @@ -852,7 +852,7 @@ static const VMStateDescription vmstate_virtio_balloon_free_page_hint = {
> > >   };
> > >   static const VMStateDescription vmstate_virtio_balloon_page_poison = {
> > > -    .name = "vitio-balloon-device/page-poison",
> > > +    .name = "virtio-balloon-device/page-poison",
> > >       .version_id = 1,
> > >       .minimum_version_id = 1,
> > >       .needed = virtio_balloon_page_poison_support,
> > > 
> > 
> > Oh, that's very subtle. I wasn't even aware that the prefix really has to
> > match the actual device ... I thought the whole idea of the prefix here was
> > just to make the string unique ...
> 
> Subsection naming is *very* critical; the logic is something like:
>   'we're loading the X device'
> a subsection arrives for 'N/P'
> if 'X==N' then it looks in X for subsection P.
> If 'X!=N' then it assumes we've finished loading X
> and P is really for an outer device that X is part of.
> This is horrible.

Is there value in making this more explicit via a code convention
for .name field initializers. eg instead of

   .name = "virtio-balloon-device/page-poison",

Prefer

   .name = TYPE_VIRTIO_BALLOON "/page-poison"

?

Regards,
Daniel
Dr. David Alan Gilbert Sept. 14, 2021, 2:16 p.m. UTC | #6
* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Tue, Sep 14, 2021 at 02:30:06PM +0100, Dr. David Alan Gilbert wrote:
> > * David Hildenbrand (david@redhat.com) wrote:
> > > On 14.09.21 15:17, Dr. David Alan Gilbert (git) wrote:
> > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > 
> > > > The subsection name for page-poison was typo'd as:
> > > > 
> > > >    vitio-balloon-device/page-poison
> > > > 
> > > > Note the missing 'r' in virtio.
> > > > 
> > > > When we have a machine type that enables page poison, and the guest
> > > > enables it (which needs a new kernel), things fail rather unpredictably.
> > > > 
> > > > The fallout from this is that most of the other subsections fail to
> > > > load, including things like the feature bits in the device, one
> > > > possible fallout is that the physical addresses of the queues
> > > > then get aligned differently and we fail with an error about
> > > > last_avail_idx being wrong.
> > > > It's not obvious to me why this doesn't produce a more obvious failure,
> > > > but virtio's vmstate loading is a bit open-coded.
> > > > 
> > > > Fixes: 7483cbbaf82 ("virtio-balloon: Implement support for page poison reporting feature")
> > > > bz: https://bugzilla.redhat.com/show_bug.cgi?id=1984401
> > > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > > ---
> > > >   hw/virtio/virtio-balloon.c | 2 +-
> > > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> > > > index 5a69dce35d..c6962fcbfe 100644
> > > > --- a/hw/virtio/virtio-balloon.c
> > > > +++ b/hw/virtio/virtio-balloon.c
> > > > @@ -852,7 +852,7 @@ static const VMStateDescription vmstate_virtio_balloon_free_page_hint = {
> > > >   };
> > > >   static const VMStateDescription vmstate_virtio_balloon_page_poison = {
> > > > -    .name = "vitio-balloon-device/page-poison",
> > > > +    .name = "virtio-balloon-device/page-poison",
> > > >       .version_id = 1,
> > > >       .minimum_version_id = 1,
> > > >       .needed = virtio_balloon_page_poison_support,
> > > > 
> > > 
> > > Oh, that's very subtle. I wasn't even aware that the prefix really has to
> > > match the actual device ... I thought the whole idea of the prefix here was
> > > just to make the string unique ...
> > 
> > Subsection naming is *very* critical; the logic is something like:
> >   'we're loading the X device'
> > a subsection arrives for 'N/P'
> > if 'X==N' then it looks in X for subsection P.
> > If 'X!=N' then it assumes we've finished loading X
> > and P is really for an outer device that X is part of.
> > This is horrible.
> 
> Is there value in making this more explicit via a code convention
> for .name field initializers. eg instead of
> 
>    .name = "virtio-balloon-device/page-poison",
> 
> Prefer
> 
>    .name = TYPE_VIRTIO_BALLOON "/page-poison"
> 
> ?

I think it might be better to see if the vmstate code can check it and
error during saving; certainly this case feels detectable, but I need
to keep an eye open for all the other weird cases.

Dave

> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>
Philippe Mathieu-Daudé Sept. 14, 2021, 2:58 p.m. UTC | #7
On 9/14/21 3:58 PM, Daniel P. Berrangé wrote:
> On Tue, Sep 14, 2021 at 02:30:06PM +0100, Dr. David Alan Gilbert wrote:
>> * David Hildenbrand (david@redhat.com) wrote:
>>> On 14.09.21 15:17, Dr. David Alan Gilbert (git) wrote:
>>>> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>>>>
>>>> The subsection name for page-poison was typo'd as:
>>>>
>>>>    vitio-balloon-device/page-poison
>>>>
>>>> Note the missing 'r' in virtio.
>>>>
>>>> When we have a machine type that enables page poison, and the guest
>>>> enables it (which needs a new kernel), things fail rather unpredictably.
>>>>
>>>> The fallout from this is that most of the other subsections fail to
>>>> load, including things like the feature bits in the device, one
>>>> possible fallout is that the physical addresses of the queues
>>>> then get aligned differently and we fail with an error about
>>>> last_avail_idx being wrong.
>>>> It's not obvious to me why this doesn't produce a more obvious failure,
>>>> but virtio's vmstate loading is a bit open-coded.
>>>>
>>>> Fixes: 7483cbbaf82 ("virtio-balloon: Implement support for page poison reporting feature")
>>>> bz: https://bugzilla.redhat.com/show_bug.cgi?id=1984401
>>>> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>>>> ---
>>>>   hw/virtio/virtio-balloon.c | 2 +-
>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
>>>> index 5a69dce35d..c6962fcbfe 100644
>>>> --- a/hw/virtio/virtio-balloon.c
>>>> +++ b/hw/virtio/virtio-balloon.c
>>>> @@ -852,7 +852,7 @@ static const VMStateDescription vmstate_virtio_balloon_free_page_hint = {
>>>>   };
>>>>   static const VMStateDescription vmstate_virtio_balloon_page_poison = {
>>>> -    .name = "vitio-balloon-device/page-poison",
>>>> +    .name = "virtio-balloon-device/page-poison",
>>>>       .version_id = 1,
>>>>       .minimum_version_id = 1,
>>>>       .needed = virtio_balloon_page_poison_support,
>>>>
>>>
>>> Oh, that's very subtle. I wasn't even aware that the prefix really has to
>>> match the actual device ... I thought the whole idea of the prefix here was
>>> just to make the string unique ...
>>
>> Subsection naming is *very* critical; the logic is something like:
>>   'we're loading the X device'
>> a subsection arrives for 'N/P'
>> if 'X==N' then it looks in X for subsection P.
>> If 'X!=N' then it assumes we've finished loading X
>> and P is really for an outer device that X is part of.
>> This is horrible.
> 
> Is there value in making this more explicit via a code convention
> for .name field initializers. eg instead of
> 
>    .name = "virtio-balloon-device/page-poison",
> 
> Prefer
> 
>    .name = TYPE_VIRTIO_BALLOON "/page-poison"
> 
> ?

IIUC so far only user-creatable devices are required to have
a stable name in the TYPE definition (because CLI must stay
stable). Which is why this type is not recommended for
migration section names.
Dr. David Alan Gilbert Sept. 15, 2021, 8:36 a.m. UTC | #8
Note this also correspounds to:

https://gitlab.com/qemu-project/qemu/-/issues/485
diff mbox series

Patch

diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 5a69dce35d..c6962fcbfe 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -852,7 +852,7 @@  static const VMStateDescription vmstate_virtio_balloon_free_page_hint = {
 };
 
 static const VMStateDescription vmstate_virtio_balloon_page_poison = {
-    .name = "vitio-balloon-device/page-poison",
+    .name = "virtio-balloon-device/page-poison",
     .version_id = 1,
     .minimum_version_id = 1,
     .needed = virtio_balloon_page_poison_support,