diff mbox series

spapr: Migrate CAS reboot flag

Message ID 157911051688.345768.16136592081655557565.stgit@bahia.lan (mailing list archive)
State New, archived
Headers show
Series spapr: Migrate CAS reboot flag | expand

Commit Message

Greg Kurz Jan. 15, 2020, 5:48 p.m. UTC
Migration can potentially race with CAS reboot. If the migration thread
completes migration after CAS has set spapr->cas_reboot but before the
mainloop could pick up the reset request and reset the machine, the
guest is migrated unrebooted and the destination doesn't reboot it
either because it isn't aware a CAS reboot was needed (eg, because a
device was added before CAS). This likely result in a broken or hung
guest.

Even if it is small, the window between CAS and CAS reboot is enough to
re-qualify spapr->cas_reboot as state that we should migrate. Add a new
subsection for that and always send it when a CAS reboot is pending.
This may cause migration to older QEMUs to fail but it is still better
than end up with a broken guest.

The destination cannot honour the CAS reboot request from a post load
handler because this must be done after the guest is fully restored.
It is thus done from a VM change state handler.

Reported-by: Lukáš Doktor <ldoktor@redhat.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
---

This patch is supposed to fix the interrupt controller mode inconsistency
between QEMU and the guest reported in this BZ:

https://bugzilla.redhat.com/show_bug.cgi?id=1781315 (requires auth)

Even if interrupt controller selection doesn't involve CAS reboot anymore,
we still have other conditions that require CAS reboot.
---
 hw/ppc/spapr.c |   43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

Comments

Laurent Vivier Jan. 15, 2020, 6:10 p.m. UTC | #1
Hi,

On 15/01/2020 18:48, Greg Kurz wrote:
> Migration can potentially race with CAS reboot. If the migration thread
> completes migration after CAS has set spapr->cas_reboot but before the
> mainloop could pick up the reset request and reset the machine, the
> guest is migrated unrebooted and the destination doesn't reboot it
> either because it isn't aware a CAS reboot was needed (eg, because a
> device was added before CAS). This likely result in a broken or hung
> guest.
> 
> Even if it is small, the window between CAS and CAS reboot is enough to
> re-qualify spapr->cas_reboot as state that we should migrate. Add a new
> subsection for that and always send it when a CAS reboot is pending.
> This may cause migration to older QEMUs to fail but it is still better
> than end up with a broken guest.
> 
> The destination cannot honour the CAS reboot request from a post load
> handler because this must be done after the guest is fully restored.
> It is thus done from a VM change state handler.
> 
> Reported-by: Lukáš Doktor <ldoktor@redhat.com>
> Signed-off-by: Greg Kurz <groug@kaod.org>
> ---
> 

I'm wondering if the problem can be related with the fact that
main_loop_should_exit() could release qemu_global_mutex in
pause_all_vcpus() in the reset case?

1602 static bool main_loop_should_exit(void)
1603 {
...
1633     request = qemu_reset_requested();
1634     if (request) {
1635         pause_all_vcpus();
1636         qemu_system_reset(request);
1637         resume_all_vcpus();
1638         if (!runstate_check(RUN_STATE_RUNNING) &&
1639                 !runstate_check(RUN_STATE_INMIGRATE)) {
1640             runstate_set(RUN_STATE_PRELAUNCH);
1641         }
1642     }
...

I already sent a patch for this kind of problem (in current Juan pull
request):

"runstate: ignore finishmigrate -> prelaunch transition"

but I don't know if it could fix this one.

Thanks,
Laurent
Cédric Le Goater Jan. 15, 2020, 6:10 p.m. UTC | #2
On 1/15/20 6:48 PM, Greg Kurz wrote:
> Migration can potentially race with CAS reboot. If the migration thread
> completes migration after CAS has set spapr->cas_reboot but before the
> mainloop could pick up the reset request and reset the machine, the
> guest is migrated unrebooted and the destination doesn't reboot it
> either because it isn't aware a CAS reboot was needed (eg, because a
> device was added before CAS). This likely result in a broken or hung
> guest.
> 
> Even if it is small, the window between CAS and CAS reboot is enough to
> re-qualify spapr->cas_reboot as state that we should migrate. Add a new
> subsection for that and always send it when a CAS reboot is pending.
> This may cause migration to older QEMUs to fail but it is still better
> than end up with a broken guest.
> 
> The destination cannot honour the CAS reboot request from a post load
> handler because this must be done after the guest is fully restored.
> It is thus done from a VM change state handler.
> 
> Reported-by: Lukáš Doktor <ldoktor@redhat.com>
> Signed-off-by: Greg Kurz <groug@kaod.org>

Cédric Le Goater <clg@kaod.org>

Nice work ! That was quite complex to catch !

Thanks,

C.

> ---
> 
> This patch is supposed to fix the interrupt controller mode inconsistency
> between QEMU and the guest reported in this BZ:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1781315 (requires auth)
> 
> Even if interrupt controller selection doesn't involve CAS reboot anymore,
> we still have other conditions that require CAS reboot.
> ---
>  hw/ppc/spapr.c |   43 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 43 insertions(+)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 30a5fbd3bea6..bf2763aa16e5 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1959,6 +1959,31 @@ static const VMStateDescription vmstate_spapr_dtb = {
>      },
>  };
>  
> +static bool spapr_cas_reboot_needed(void *opaque)
> +{
> +    SpaprMachineState *spapr = SPAPR_MACHINE(opaque);
> +
> +    /*
> +     * This causes the "spapr_cas_reboot" subsection to always be
> +     * sent if migration raced with CAS. This causes older QEMUs
> +     * that don't know about the subsection to fail migration but
> +     * it is still better than end up with a broken guest on the
> +     * destination.
> +     */
> +    return spapr->cas_reboot;
> +}
> +
> +static const VMStateDescription vmstate_spapr_cas_reboot = {
> +    .name = "spapr_cas_reboot",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .needed = spapr_cas_reboot_needed,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_BOOL(cas_reboot, SpaprMachineState),
> +        VMSTATE_END_OF_LIST()
> +    },
> +};
> +
>  static const VMStateDescription vmstate_spapr = {
>      .name = "spapr",
>      .version_id = 3,
> @@ -1992,6 +2017,7 @@ static const VMStateDescription vmstate_spapr = {
>          &vmstate_spapr_dtb,
>          &vmstate_spapr_cap_large_decr,
>          &vmstate_spapr_cap_ccf_assist,
> +        &vmstate_spapr_cas_reboot,
>          NULL
>      }
>  };
> @@ -2577,6 +2603,21 @@ static PCIHostState *spapr_create_default_phb(void)
>      return PCI_HOST_BRIDGE(dev);
>  }
>  
> +static void spapr_change_state_handler(void *opaque, int running,
> +                                       RunState state)
> +{
> +    SpaprMachineState *spapr = opaque;
> +
> +    if (running && spapr->cas_reboot) {
> +        /*
> +         * This happens when resuming from migration if the source
> +         * processed a CAS but didn't have time to trigger the CAS
> +         * reboot. Do it now.
> +         */
> +        qemu_system_reset_request(SHUTDOWN_CAUSE_SUBSYSTEM_RESET);
> +    }
> +}
> +
>  /* pSeries LPAR / sPAPR hardware init */
>  static void spapr_machine_init(MachineState *machine)
>  {
> @@ -2970,6 +3011,8 @@ static void spapr_machine_init(MachineState *machine)
>  
>          kvmppc_spapr_enable_inkernel_multitce();
>      }
> +
> +    qemu_add_vm_change_state_handler(spapr_change_state_handler, spapr);
>  }
>  
>  static int spapr_kvm_type(MachineState *machine, const char *vm_type)
>
Laurent Vivier Jan. 15, 2020, 6:26 p.m. UTC | #3
On 15/01/2020 19:10, Laurent Vivier wrote:
> Hi,
> 
> On 15/01/2020 18:48, Greg Kurz wrote:
>> Migration can potentially race with CAS reboot. If the migration thread
>> completes migration after CAS has set spapr->cas_reboot but before the
>> mainloop could pick up the reset request and reset the machine, the
>> guest is migrated unrebooted and the destination doesn't reboot it
>> either because it isn't aware a CAS reboot was needed (eg, because a
>> device was added before CAS). This likely result in a broken or hung
>> guest.
>>
>> Even if it is small, the window between CAS and CAS reboot is enough to
>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new
>> subsection for that and always send it when a CAS reboot is pending.
>> This may cause migration to older QEMUs to fail but it is still better
>> than end up with a broken guest.
>>
>> The destination cannot honour the CAS reboot request from a post load
>> handler because this must be done after the guest is fully restored.
>> It is thus done from a VM change state handler.
>>
>> Reported-by: Lukáš Doktor <ldoktor@redhat.com>
>> Signed-off-by: Greg Kurz <groug@kaod.org>
>> ---
>>
> 
> I'm wondering if the problem can be related with the fact that
> main_loop_should_exit() could release qemu_global_mutex in
> pause_all_vcpus() in the reset case?
> 
> 1602 static bool main_loop_should_exit(void)
> 1603 {
> ...
> 1633     request = qemu_reset_requested();
> 1634     if (request) {
> 1635         pause_all_vcpus();
> 1636         qemu_system_reset(request);
> 1637         resume_all_vcpus();
> 1638         if (!runstate_check(RUN_STATE_RUNNING) &&
> 1639                 !runstate_check(RUN_STATE_INMIGRATE)) {
> 1640             runstate_set(RUN_STATE_PRELAUNCH);
> 1641         }
> 1642     }
> ...
> 
> I already sent a patch for this kind of problem (in current Juan pull
> request):
> 
> "runstate: ignore finishmigrate -> prelaunch transition"
> 
> but I don't know if it could fix this one.

I think it should be interesting to have the state transition on source
and destination when the problem occurs (with something like "-trace
runstate_set").

Thanks,
Laurent
Greg Kurz Jan. 16, 2020, 8:48 a.m. UTC | #4
On Wed, 15 Jan 2020 19:10:37 +0100
Laurent Vivier <lvivier@redhat.com> wrote:

> Hi,
> 
> On 15/01/2020 18:48, Greg Kurz wrote:
> > Migration can potentially race with CAS reboot. If the migration thread
> > completes migration after CAS has set spapr->cas_reboot but before the
> > mainloop could pick up the reset request and reset the machine, the
> > guest is migrated unrebooted and the destination doesn't reboot it
> > either because it isn't aware a CAS reboot was needed (eg, because a
> > device was added before CAS). This likely result in a broken or hung
> > guest.
> > 
> > Even if it is small, the window between CAS and CAS reboot is enough to
> > re-qualify spapr->cas_reboot as state that we should migrate. Add a new
> > subsection for that and always send it when a CAS reboot is pending.
> > This may cause migration to older QEMUs to fail but it is still better
> > than end up with a broken guest.
> > 
> > The destination cannot honour the CAS reboot request from a post load
> > handler because this must be done after the guest is fully restored.
> > It is thus done from a VM change state handler.
> > 
> > Reported-by: Lukáš Doktor <ldoktor@redhat.com>
> > Signed-off-by: Greg Kurz <groug@kaod.org>
> > ---
> > 
> 
> I'm wondering if the problem can be related with the fact that
> main_loop_should_exit() could release qemu_global_mutex in
> pause_all_vcpus() in the reset case?
> 
> 1602 static bool main_loop_should_exit(void)
> 1603 {
> ...
> 1633     request = qemu_reset_requested();
> 1634     if (request) {
> 1635         pause_all_vcpus();
> 1636         qemu_system_reset(request);
> 1637         resume_all_vcpus();
> 1638         if (!runstate_check(RUN_STATE_RUNNING) &&
> 1639                 !runstate_check(RUN_STATE_INMIGRATE)) {
> 1640             runstate_set(RUN_STATE_PRELAUNCH);
> 1641         }
> 1642     }
> ...
> 
> I already sent a patch for this kind of problem (in current Juan pull
> request):
> 
> "runstate: ignore finishmigrate -> prelaunch transition"
> 

IIUC your patch avoids an invalid 'prelaunch' -> 'postmigrate' runstate
transition that can happen if the migration thread sets the runstate to
'finishmigrate' when pause_all_vcpus() releases the main loop mutex.

ie. symptom of the problem is QEMU aborting, correct ? The issue I'm
trying to fix is a guest breakage caused by a discrepancy between
QEMU and the guest after migration has succeeded.

> but I don't know if it could fix this one.
> 

I don't think so and your patch kinda illustrates it. If the runstate
is 'finishmigrate' when returning from pause_all_vcpus(), this means
that state was sent to the destination before we could actually reset
the machine.

> Thanks,
> Laurent
>
Laurent Vivier Jan. 16, 2020, 10:37 a.m. UTC | #5
On 16/01/2020 09:48, Greg Kurz wrote:
> On Wed, 15 Jan 2020 19:10:37 +0100
> Laurent Vivier <lvivier@redhat.com> wrote:
> 
>> Hi,
>>
>> On 15/01/2020 18:48, Greg Kurz wrote:
>>> Migration can potentially race with CAS reboot. If the migration thread
>>> completes migration after CAS has set spapr->cas_reboot but before the
>>> mainloop could pick up the reset request and reset the machine, the
>>> guest is migrated unrebooted and the destination doesn't reboot it
>>> either because it isn't aware a CAS reboot was needed (eg, because a
>>> device was added before CAS). This likely result in a broken or hung
>>> guest.
>>>
>>> Even if it is small, the window between CAS and CAS reboot is enough to
>>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new
>>> subsection for that and always send it when a CAS reboot is pending.
>>> This may cause migration to older QEMUs to fail but it is still better
>>> than end up with a broken guest.
>>>
>>> The destination cannot honour the CAS reboot request from a post load
>>> handler because this must be done after the guest is fully restored.
>>> It is thus done from a VM change state handler.
>>>
>>> Reported-by: Lukáš Doktor <ldoktor@redhat.com>
>>> Signed-off-by: Greg Kurz <groug@kaod.org>
>>> ---
>>>
>>
>> I'm wondering if the problem can be related with the fact that
>> main_loop_should_exit() could release qemu_global_mutex in
>> pause_all_vcpus() in the reset case?
>>
>> 1602 static bool main_loop_should_exit(void)
>> 1603 {
>> ...
>> 1633     request = qemu_reset_requested();
>> 1634     if (request) {
>> 1635         pause_all_vcpus();
>> 1636         qemu_system_reset(request);
>> 1637         resume_all_vcpus();
>> 1638         if (!runstate_check(RUN_STATE_RUNNING) &&
>> 1639                 !runstate_check(RUN_STATE_INMIGRATE)) {
>> 1640             runstate_set(RUN_STATE_PRELAUNCH);
>> 1641         }
>> 1642     }
>> ...
>>
>> I already sent a patch for this kind of problem (in current Juan pull
>> request):
>>
>> "runstate: ignore finishmigrate -> prelaunch transition"
>>
> 
> IIUC your patch avoids an invalid 'prelaunch' -> 'postmigrate' runstate
> transition that can happen if the migration thread sets the runstate to
> 'finishmigrate' when pause_all_vcpus() releases the main loop mutex.
> 
> ie. symptom of the problem is QEMU aborting, correct ? The issue I'm
> trying to fix is a guest breakage caused by a discrepancy between
> QEMU and the guest after migration has succeeded.
> 
>> but I don't know if it could fix this one.
>>
> 
> I don't think so and your patch kinda illustrates it. If the runstate
> is 'finishmigrate' when returning from pause_all_vcpus(), this means
> that state was sent to the destination before we could actually reset
> the machine.

Yes, you're right.

But the question behind my comment was: is it expected to have a pending
reset while we are migrating?

Perhaps H_CAS can return H_BUSY and wait the end of the migration and
then be fully executed on destination?

Thanks,
Laurent
Greg Kurz Jan. 16, 2020, 12:14 p.m. UTC | #6
On Thu, 16 Jan 2020 11:37:24 +0100
Laurent Vivier <lvivier@redhat.com> wrote:

> On 16/01/2020 09:48, Greg Kurz wrote:
> > On Wed, 15 Jan 2020 19:10:37 +0100
> > Laurent Vivier <lvivier@redhat.com> wrote:
> > 
> >> Hi,
> >>
> >> On 15/01/2020 18:48, Greg Kurz wrote:
> >>> Migration can potentially race with CAS reboot. If the migration thread
> >>> completes migration after CAS has set spapr->cas_reboot but before the
> >>> mainloop could pick up the reset request and reset the machine, the
> >>> guest is migrated unrebooted and the destination doesn't reboot it
> >>> either because it isn't aware a CAS reboot was needed (eg, because a
> >>> device was added before CAS). This likely result in a broken or hung
> >>> guest.
> >>>
> >>> Even if it is small, the window between CAS and CAS reboot is enough to
> >>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new
> >>> subsection for that and always send it when a CAS reboot is pending.
> >>> This may cause migration to older QEMUs to fail but it is still better
> >>> than end up with a broken guest.
> >>>
> >>> The destination cannot honour the CAS reboot request from a post load
> >>> handler because this must be done after the guest is fully restored.
> >>> It is thus done from a VM change state handler.
> >>>
> >>> Reported-by: Lukáš Doktor <ldoktor@redhat.com>
> >>> Signed-off-by: Greg Kurz <groug@kaod.org>
> >>> ---
> >>>
> >>
> >> I'm wondering if the problem can be related with the fact that
> >> main_loop_should_exit() could release qemu_global_mutex in
> >> pause_all_vcpus() in the reset case?
> >>
> >> 1602 static bool main_loop_should_exit(void)
> >> 1603 {
> >> ...
> >> 1633     request = qemu_reset_requested();
> >> 1634     if (request) {
> >> 1635         pause_all_vcpus();
> >> 1636         qemu_system_reset(request);
> >> 1637         resume_all_vcpus();
> >> 1638         if (!runstate_check(RUN_STATE_RUNNING) &&
> >> 1639                 !runstate_check(RUN_STATE_INMIGRATE)) {
> >> 1640             runstate_set(RUN_STATE_PRELAUNCH);
> >> 1641         }
> >> 1642     }
> >> ...
> >>
> >> I already sent a patch for this kind of problem (in current Juan pull
> >> request):
> >>
> >> "runstate: ignore finishmigrate -> prelaunch transition"
> >>
> > 
> > IIUC your patch avoids an invalid 'prelaunch' -> 'postmigrate' runstate
> > transition that can happen if the migration thread sets the runstate to
> > 'finishmigrate' when pause_all_vcpus() releases the main loop mutex.
> > 
> > ie. symptom of the problem is QEMU aborting, correct ? The issue I'm
> > trying to fix is a guest breakage caused by a discrepancy between
> > QEMU and the guest after migration has succeeded.
> > 
> >> but I don't know if it could fix this one.
> >>
> > 
> > I don't think so and your patch kinda illustrates it. If the runstate
> > is 'finishmigrate' when returning from pause_all_vcpus(), this means
> > that state was sent to the destination before we could actually reset
> > the machine.
> 
> Yes, you're right.
> 
> But the question behind my comment was: is it expected to have a pending
> reset while we are migrating?
> 

Nothing prevents qemu_system_reset_request() to be called when migration
is active. 

> Perhaps H_CAS can return H_BUSY and wait the end of the migration and
> then be fully executed on destination?
> 

And so we would need to teach SLOF to try H_CAS again until it stops
returning H_BUSY ? It seems safer to migrate the CAS reboot flag IMHO.

> Thanks,
> Laurent
>
Greg Kurz Jan. 16, 2020, 6:29 p.m. UTC | #7
On Thu, 16 Jan 2020 13:14:35 +0100
Greg Kurz <groug@kaod.org> wrote:

> On Thu, 16 Jan 2020 11:37:24 +0100
> Laurent Vivier <lvivier@redhat.com> wrote:
> 
> > On 16/01/2020 09:48, Greg Kurz wrote:
> > > On Wed, 15 Jan 2020 19:10:37 +0100
> > > Laurent Vivier <lvivier@redhat.com> wrote:
> > > 
> > >> Hi,
> > >>
> > >> On 15/01/2020 18:48, Greg Kurz wrote:
> > >>> Migration can potentially race with CAS reboot. If the migration thread
> > >>> completes migration after CAS has set spapr->cas_reboot but before the
> > >>> mainloop could pick up the reset request and reset the machine, the
> > >>> guest is migrated unrebooted and the destination doesn't reboot it
> > >>> either because it isn't aware a CAS reboot was needed (eg, because a
> > >>> device was added before CAS). This likely result in a broken or hung
> > >>> guest.
> > >>>
> > >>> Even if it is small, the window between CAS and CAS reboot is enough to
> > >>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new
> > >>> subsection for that and always send it when a CAS reboot is pending.
> > >>> This may cause migration to older QEMUs to fail but it is still better
> > >>> than end up with a broken guest.
> > >>>
> > >>> The destination cannot honour the CAS reboot request from a post load
> > >>> handler because this must be done after the guest is fully restored.
> > >>> It is thus done from a VM change state handler.
> > >>>
> > >>> Reported-by: Lukáš Doktor <ldoktor@redhat.com>
> > >>> Signed-off-by: Greg Kurz <groug@kaod.org>
> > >>> ---
> > >>>
> > >>
> > >> I'm wondering if the problem can be related with the fact that
> > >> main_loop_should_exit() could release qemu_global_mutex in
> > >> pause_all_vcpus() in the reset case?
> > >>
> > >> 1602 static bool main_loop_should_exit(void)
> > >> 1603 {
> > >> ...
> > >> 1633     request = qemu_reset_requested();
> > >> 1634     if (request) {
> > >> 1635         pause_all_vcpus();
> > >> 1636         qemu_system_reset(request);
> > >> 1637         resume_all_vcpus();
> > >> 1638         if (!runstate_check(RUN_STATE_RUNNING) &&
> > >> 1639                 !runstate_check(RUN_STATE_INMIGRATE)) {
> > >> 1640             runstate_set(RUN_STATE_PRELAUNCH);
> > >> 1641         }
> > >> 1642     }
> > >> ...
> > >>
> > >> I already sent a patch for this kind of problem (in current Juan pull
> > >> request):
> > >>
> > >> "runstate: ignore finishmigrate -> prelaunch transition"
> > >>
> > > 
> > > IIUC your patch avoids an invalid 'prelaunch' -> 'postmigrate' runstate
> > > transition that can happen if the migration thread sets the runstate to
> > > 'finishmigrate' when pause_all_vcpus() releases the main loop mutex.
> > > 
> > > ie. symptom of the problem is QEMU aborting, correct ? The issue I'm
> > > trying to fix is a guest breakage caused by a discrepancy between
> > > QEMU and the guest after migration has succeeded.
> > > 
> > >> but I don't know if it could fix this one.
> > >>
> > > 
> > > I don't think so and your patch kinda illustrates it. If the runstate
> > > is 'finishmigrate' when returning from pause_all_vcpus(), this means
> > > that state was sent to the destination before we could actually reset
> > > the machine.
> > 
> > Yes, you're right.
> > 
> > But the question behind my comment was: is it expected to have a pending
> > reset while we are migrating?
> > 
> 
> Nothing prevents qemu_system_reset_request() to be called when migration
> is active. 
> 
> > Perhaps H_CAS can return H_BUSY and wait the end of the migration and
> > then be fully executed on destination?
> > 
> 
> And so we would need to teach SLOF to try H_CAS again until it stops
> returning H_BUSY ? It seems safer to migrate the CAS reboot flag IMHO.
> 

Ok I've tried that with a patched SLOF that sleeps 500ms and tries CAS
again if H_BUSY was returned. It fixes the issue but it looks a bit
ugly because of the polling with an arbitrary timeout in SLOF... I'm
not very comfortable either with calling migration_is_active() from
the CAS code in QEMU.

David,

Any suggestion ?

> > Thanks,
> > Laurent
> > 
>
David Gibson Jan. 17, 2020, 9:16 a.m. UTC | #8
On Thu, Jan 16, 2020 at 07:29:02PM +0100, Greg Kurz wrote:
> On Thu, 16 Jan 2020 13:14:35 +0100
> Greg Kurz <groug@kaod.org> wrote:
> 
> > On Thu, 16 Jan 2020 11:37:24 +0100
> > Laurent Vivier <lvivier@redhat.com> wrote:
> > 
> > > On 16/01/2020 09:48, Greg Kurz wrote:
> > > > On Wed, 15 Jan 2020 19:10:37 +0100
> > > > Laurent Vivier <lvivier@redhat.com> wrote:
> > > > 
> > > >> Hi,
> > > >>
> > > >> On 15/01/2020 18:48, Greg Kurz wrote:
> > > >>> Migration can potentially race with CAS reboot. If the migration thread
> > > >>> completes migration after CAS has set spapr->cas_reboot but before the
> > > >>> mainloop could pick up the reset request and reset the machine, the
> > > >>> guest is migrated unrebooted and the destination doesn't reboot it
> > > >>> either because it isn't aware a CAS reboot was needed (eg, because a
> > > >>> device was added before CAS). This likely result in a broken or hung
> > > >>> guest.
> > > >>>
> > > >>> Even if it is small, the window between CAS and CAS reboot is enough to
> > > >>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new
> > > >>> subsection for that and always send it when a CAS reboot is pending.
> > > >>> This may cause migration to older QEMUs to fail but it is still better
> > > >>> than end up with a broken guest.
> > > >>>
> > > >>> The destination cannot honour the CAS reboot request from a post load
> > > >>> handler because this must be done after the guest is fully restored.
> > > >>> It is thus done from a VM change state handler.
> > > >>>
> > > >>> Reported-by: Lukáš Doktor <ldoktor@redhat.com>
> > > >>> Signed-off-by: Greg Kurz <groug@kaod.org>
> > > >>> ---
> > > >>>
> > > >>
> > > >> I'm wondering if the problem can be related with the fact that
> > > >> main_loop_should_exit() could release qemu_global_mutex in
> > > >> pause_all_vcpus() in the reset case?
> > > >>
> > > >> 1602 static bool main_loop_should_exit(void)
> > > >> 1603 {
> > > >> ...
> > > >> 1633     request = qemu_reset_requested();
> > > >> 1634     if (request) {
> > > >> 1635         pause_all_vcpus();
> > > >> 1636         qemu_system_reset(request);
> > > >> 1637         resume_all_vcpus();
> > > >> 1638         if (!runstate_check(RUN_STATE_RUNNING) &&
> > > >> 1639                 !runstate_check(RUN_STATE_INMIGRATE)) {
> > > >> 1640             runstate_set(RUN_STATE_PRELAUNCH);
> > > >> 1641         }
> > > >> 1642     }
> > > >> ...
> > > >>
> > > >> I already sent a patch for this kind of problem (in current Juan pull
> > > >> request):
> > > >>
> > > >> "runstate: ignore finishmigrate -> prelaunch transition"
> > > >>
> > > > 
> > > > IIUC your patch avoids an invalid 'prelaunch' -> 'postmigrate' runstate
> > > > transition that can happen if the migration thread sets the runstate to
> > > > 'finishmigrate' when pause_all_vcpus() releases the main loop mutex.
> > > > 
> > > > ie. symptom of the problem is QEMU aborting, correct ? The issue I'm
> > > > trying to fix is a guest breakage caused by a discrepancy between
> > > > QEMU and the guest after migration has succeeded.
> > > > 
> > > >> but I don't know if it could fix this one.
> > > >>
> > > > 
> > > > I don't think so and your patch kinda illustrates it. If the runstate
> > > > is 'finishmigrate' when returning from pause_all_vcpus(), this means
> > > > that state was sent to the destination before we could actually reset
> > > > the machine.
> > > 
> > > Yes, you're right.
> > > 
> > > But the question behind my comment was: is it expected to have a pending
> > > reset while we are migrating?
> > > 
> > 
> > Nothing prevents qemu_system_reset_request() to be called when migration
> > is active. 
> > 
> > > Perhaps H_CAS can return H_BUSY and wait the end of the migration and
> > > then be fully executed on destination?
> > > 
> > 
> > And so we would need to teach SLOF to try H_CAS again until it stops
> > returning H_BUSY ? It seems safer to migrate the CAS reboot flag IMHO.
> > 
> 
> Ok I've tried that with a patched SLOF that sleeps 500ms and tries CAS
> again if H_BUSY was returned. It fixes the issue but it looks a bit
> ugly because of the polling with an arbitrary timeout in SLOF... I'm
> not very comfortable either with calling migration_is_active() from
> the CAS code in QEMU.
> 
> David,
> 
> Any suggestion ?

Yeah, I think looping in SLOF is a worse idea than migrating the
cas_reboot flag.

But.. a better solution still might be to just remove the remaining
causes for CAS reboot entirely.  CAS reboots pretty much suck when
they happen, anyway.

With the irq changeover condition removed, I think the remaining
causes are more theoretical than practical situations at this point.
Greg Kurz Jan. 17, 2020, 11:49 a.m. UTC | #9
On Wed, 15 Jan 2020 19:26:18 +0100
Laurent Vivier <lvivier@redhat.com> wrote:

> On 15/01/2020 19:10, Laurent Vivier wrote:
> > Hi,
> > 
> > On 15/01/2020 18:48, Greg Kurz wrote:
> >> Migration can potentially race with CAS reboot. If the migration thread
> >> completes migration after CAS has set spapr->cas_reboot but before the
> >> mainloop could pick up the reset request and reset the machine, the
> >> guest is migrated unrebooted and the destination doesn't reboot it
> >> either because it isn't aware a CAS reboot was needed (eg, because a
> >> device was added before CAS). This likely result in a broken or hung
> >> guest.
> >>
> >> Even if it is small, the window between CAS and CAS reboot is enough to
> >> re-qualify spapr->cas_reboot as state that we should migrate. Add a new
> >> subsection for that and always send it when a CAS reboot is pending.
> >> This may cause migration to older QEMUs to fail but it is still better
> >> than end up with a broken guest.
> >>
> >> The destination cannot honour the CAS reboot request from a post load
> >> handler because this must be done after the guest is fully restored.
> >> It is thus done from a VM change state handler.
> >>
> >> Reported-by: Lukáš Doktor <ldoktor@redhat.com>
> >> Signed-off-by: Greg Kurz <groug@kaod.org>
> >> ---
> >>
> > 
> > I'm wondering if the problem can be related with the fact that
> > main_loop_should_exit() could release qemu_global_mutex in
> > pause_all_vcpus() in the reset case?
> > 
> > 1602 static bool main_loop_should_exit(void)
> > 1603 {
> > ...
> > 1633     request = qemu_reset_requested();
> > 1634     if (request) {
> > 1635         pause_all_vcpus();
> > 1636         qemu_system_reset(request);
> > 1637         resume_all_vcpus();
> > 1638         if (!runstate_check(RUN_STATE_RUNNING) &&
> > 1639                 !runstate_check(RUN_STATE_INMIGRATE)) {
> > 1640             runstate_set(RUN_STATE_PRELAUNCH);
> > 1641         }
> > 1642     }
> > ...
> > 
> > I already sent a patch for this kind of problem (in current Juan pull
> > request):
> > 
> > "runstate: ignore finishmigrate -> prelaunch transition"
> > 
> > but I don't know if it could fix this one.
> 
> I think it should be interesting to have the state transition on source
> and destination when the problem occurs (with something like "-trace
> runstate_set").
> 

With "-serial mon:stdio -trace runstate_set -trace -trace guest_cpu_reset" :

OF stdout device is: /vdevice/vty@71000000
Preparing to boot Linux version 4.18.0-80.el8.ppc64le (mockbuild@ppc-061.build.eng.bos.redhat.com) (gcc version 8.2.1 20180905 (Red Hat 8.2.1-3) (GCC)) #1 SMP Wed Mar 13 11:26:21 UTC 2019
Detected machine type: 0000000000000101
command line: BOOT_IMAGE=/boot/vmlinuz-4.18.0-80.el8.ppc64le root=UUID=012b83a5-2594-48ac-b936-12fec7cdbb9a ro console=ttyS0 console=ttyS0,115200n8 no_timer_check net.ifnames=0 crashkernel=auto
Max number of cores passed to firmware: 2048 (NR_CPUS = 2048)
Calling ibm,client-architecture-support.

 Migration starts here.

..qemu-system-ppc64: warning: kernel_irqchip allowed but unavailable: IRQ_XIVE capability must be present for KVM
Falling back to kernel-irqchip=off

 This ^^ indicates that CAS was called and switched to XIVE, for which
 we lack proper KVM support on GA boston machines.

23348@1579260982.315795:runstate_set current_run_state 9 (running) new_state 7 (finish-migrate)
23348@1579260982.360821:runstate_set current_run_state 7 (finish-migrate) new_state 5 (postmigrate)

 The migration thread is holding the global QEMU mutex at this point. It
 has stopped all CPUs. It now streams the full state to the destination
 before releasing the mutex.

23340@1579260982.797279:guest_cpu_reset cpu=0xf9dbb48a5e0 
23340@1579260982.797319:guest_cpu_reset cpu=0xf9dbb4d56a0 

 The main loop regained control and could process the CAS reboot request
 but it is too late...

23340@1579260982.866071:runstate_set current_run_state 5 (postmigrate) new_state 6 (prelaunch)

> Thanks,
> Laurent
>
Laurent Vivier Jan. 17, 2020, 12:10 p.m. UTC | #10
On 17/01/2020 12:49, Greg Kurz wrote:
> On Wed, 15 Jan 2020 19:26:18 +0100
> Laurent Vivier <lvivier@redhat.com> wrote:
> 
>> On 15/01/2020 19:10, Laurent Vivier wrote:
>>> Hi,
>>>
>>> On 15/01/2020 18:48, Greg Kurz wrote:
>>>> Migration can potentially race with CAS reboot. If the migration thread
>>>> completes migration after CAS has set spapr->cas_reboot but before the
>>>> mainloop could pick up the reset request and reset the machine, the
>>>> guest is migrated unrebooted and the destination doesn't reboot it
>>>> either because it isn't aware a CAS reboot was needed (eg, because a
>>>> device was added before CAS). This likely result in a broken or hung
>>>> guest.
>>>>
>>>> Even if it is small, the window between CAS and CAS reboot is enough to
>>>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new
>>>> subsection for that and always send it when a CAS reboot is pending.
>>>> This may cause migration to older QEMUs to fail but it is still better
>>>> than end up with a broken guest.
>>>>
>>>> The destination cannot honour the CAS reboot request from a post load
>>>> handler because this must be done after the guest is fully restored.
>>>> It is thus done from a VM change state handler.
>>>>
>>>> Reported-by: Lukáš Doktor <ldoktor@redhat.com>
>>>> Signed-off-by: Greg Kurz <groug@kaod.org>
>>>> ---
>>>>
>>>
>>> I'm wondering if the problem can be related with the fact that
>>> main_loop_should_exit() could release qemu_global_mutex in
>>> pause_all_vcpus() in the reset case?
>>>
>>> 1602 static bool main_loop_should_exit(void)
>>> 1603 {
>>> ...
>>> 1633     request = qemu_reset_requested();
>>> 1634     if (request) {
>>> 1635         pause_all_vcpus();
>>> 1636         qemu_system_reset(request);
>>> 1637         resume_all_vcpus();
>>> 1638         if (!runstate_check(RUN_STATE_RUNNING) &&
>>> 1639                 !runstate_check(RUN_STATE_INMIGRATE)) {
>>> 1640             runstate_set(RUN_STATE_PRELAUNCH);
>>> 1641         }
>>> 1642     }
>>> ...
>>>
>>> I already sent a patch for this kind of problem (in current Juan pull
>>> request):
>>>
>>> "runstate: ignore finishmigrate -> prelaunch transition"
>>>
>>> but I don't know if it could fix this one.
>>
>> I think it should be interesting to have the state transition on source
>> and destination when the problem occurs (with something like "-trace
>> runstate_set").
>>
> 
> With "-serial mon:stdio -trace runstate_set -trace -trace guest_cpu_reset" :
> 
> OF stdout device is: /vdevice/vty@71000000
> Preparing to boot Linux version 4.18.0-80.el8.ppc64le (mockbuild@ppc-061.build.eng.bos.redhat.com) (gcc version 8.2.1 20180905 (Red Hat 8.2.1-3) (GCC)) #1 SMP Wed Mar 13 11:26:21 UTC 2019
> Detected machine type: 0000000000000101
> command line: BOOT_IMAGE=/boot/vmlinuz-4.18.0-80.el8.ppc64le root=UUID=012b83a5-2594-48ac-b936-12fec7cdbb9a ro console=ttyS0 console=ttyS0,115200n8 no_timer_check net.ifnames=0 crashkernel=auto
> Max number of cores passed to firmware: 2048 (NR_CPUS = 2048)
> Calling ibm,client-architecture-support.
> 
>  Migration starts here.
> 
> ..qemu-system-ppc64: warning: kernel_irqchip allowed but unavailable: IRQ_XIVE capability must be present for KVM
> Falling back to kernel-irqchip=off
> 
>  This ^^ indicates that CAS was called and switched to XIVE, for which
>  we lack proper KVM support on GA boston machines.
> 
> 23348@1579260982.315795:runstate_set current_run_state 9 (running) new_state 7 (finish-migrate)
> 23348@1579260982.360821:runstate_set current_run_state 7 (finish-migrate) new_state 5 (postmigrate)
> 
>  The migration thread is holding the global QEMU mutex at this point. It
>  has stopped all CPUs. It now streams the full state to the destination
>  before releasing the mutex.
> 
> 23340@1579260982.797279:guest_cpu_reset cpu=0xf9dbb48a5e0 
> 23340@1579260982.797319:guest_cpu_reset cpu=0xf9dbb4d56a0 
> 
>  The main loop regained control and could process the CAS reboot request
>  but it is too late...
> 
> 23340@1579260982.866071:runstate_set current_run_state 5 (postmigrate) new_state 6 (prelaunch)

Thank you Greg.

So I think the best we can do is to migrate cas_reboot.

To delay the H_CAS call would be cleaner but I don't know if H_BUSY is a
valid return state and this forces to update SLOF too.

Reviewed-by: Laurent Vivier <lvivier@redhat.com>

Thanks,
Laurent
Greg Kurz Jan. 17, 2020, 3:44 p.m. UTC | #11
On Fri, 17 Jan 2020 19:16:08 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, Jan 16, 2020 at 07:29:02PM +0100, Greg Kurz wrote:
> > On Thu, 16 Jan 2020 13:14:35 +0100
> > Greg Kurz <groug@kaod.org> wrote:
> > 
> > > On Thu, 16 Jan 2020 11:37:24 +0100
> > > Laurent Vivier <lvivier@redhat.com> wrote:
> > > 
> > > > On 16/01/2020 09:48, Greg Kurz wrote:
> > > > > On Wed, 15 Jan 2020 19:10:37 +0100
> > > > > Laurent Vivier <lvivier@redhat.com> wrote:
> > > > > 
> > > > >> Hi,
> > > > >>
> > > > >> On 15/01/2020 18:48, Greg Kurz wrote:
> > > > >>> Migration can potentially race with CAS reboot. If the migration thread
> > > > >>> completes migration after CAS has set spapr->cas_reboot but before the
> > > > >>> mainloop could pick up the reset request and reset the machine, the
> > > > >>> guest is migrated unrebooted and the destination doesn't reboot it
> > > > >>> either because it isn't aware a CAS reboot was needed (eg, because a
> > > > >>> device was added before CAS). This likely result in a broken or hung
> > > > >>> guest.
> > > > >>>
> > > > >>> Even if it is small, the window between CAS and CAS reboot is enough to
> > > > >>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new
> > > > >>> subsection for that and always send it when a CAS reboot is pending.
> > > > >>> This may cause migration to older QEMUs to fail but it is still better
> > > > >>> than end up with a broken guest.
> > > > >>>
> > > > >>> The destination cannot honour the CAS reboot request from a post load
> > > > >>> handler because this must be done after the guest is fully restored.
> > > > >>> It is thus done from a VM change state handler.
> > > > >>>
> > > > >>> Reported-by: Lukáš Doktor <ldoktor@redhat.com>
> > > > >>> Signed-off-by: Greg Kurz <groug@kaod.org>
> > > > >>> ---
> > > > >>>
> > > > >>
> > > > >> I'm wondering if the problem can be related with the fact that
> > > > >> main_loop_should_exit() could release qemu_global_mutex in
> > > > >> pause_all_vcpus() in the reset case?
> > > > >>
> > > > >> 1602 static bool main_loop_should_exit(void)
> > > > >> 1603 {
> > > > >> ...
> > > > >> 1633     request = qemu_reset_requested();
> > > > >> 1634     if (request) {
> > > > >> 1635         pause_all_vcpus();
> > > > >> 1636         qemu_system_reset(request);
> > > > >> 1637         resume_all_vcpus();
> > > > >> 1638         if (!runstate_check(RUN_STATE_RUNNING) &&
> > > > >> 1639                 !runstate_check(RUN_STATE_INMIGRATE)) {
> > > > >> 1640             runstate_set(RUN_STATE_PRELAUNCH);
> > > > >> 1641         }
> > > > >> 1642     }
> > > > >> ...
> > > > >>
> > > > >> I already sent a patch for this kind of problem (in current Juan pull
> > > > >> request):
> > > > >>
> > > > >> "runstate: ignore finishmigrate -> prelaunch transition"
> > > > >>
> > > > > 
> > > > > IIUC your patch avoids an invalid 'prelaunch' -> 'postmigrate' runstate
> > > > > transition that can happen if the migration thread sets the runstate to
> > > > > 'finishmigrate' when pause_all_vcpus() releases the main loop mutex.
> > > > > 
> > > > > ie. symptom of the problem is QEMU aborting, correct ? The issue I'm
> > > > > trying to fix is a guest breakage caused by a discrepancy between
> > > > > QEMU and the guest after migration has succeeded.
> > > > > 
> > > > >> but I don't know if it could fix this one.
> > > > >>
> > > > > 
> > > > > I don't think so and your patch kinda illustrates it. If the runstate
> > > > > is 'finishmigrate' when returning from pause_all_vcpus(), this means
> > > > > that state was sent to the destination before we could actually reset
> > > > > the machine.
> > > > 
> > > > Yes, you're right.
> > > > 
> > > > But the question behind my comment was: is it expected to have a pending
> > > > reset while we are migrating?
> > > > 
> > > 
> > > Nothing prevents qemu_system_reset_request() to be called when migration
> > > is active. 
> > > 
> > > > Perhaps H_CAS can return H_BUSY and wait the end of the migration and
> > > > then be fully executed on destination?
> > > > 
> > > 
> > > And so we would need to teach SLOF to try H_CAS again until it stops
> > > returning H_BUSY ? It seems safer to migrate the CAS reboot flag IMHO.
> > > 
> > 
> > Ok I've tried that with a patched SLOF that sleeps 500ms and tries CAS
> > again if H_BUSY was returned. It fixes the issue but it looks a bit
> > ugly because of the polling with an arbitrary timeout in SLOF... I'm
> > not very comfortable either with calling migration_is_active() from
> > the CAS code in QEMU.
> > 
> > David,
> > 
> > Any suggestion ?
> 
> Yeah, I think looping in SLOF is a worse idea than migrating the
> cas_reboot flag.
> 
> But.. a better solution still might be to just remove the remaining
> causes for CAS reboot entirely.  CAS reboots pretty much suck when
> they happen, anyway.
> 

I Agree.

> With the irq changeover condition removed, I think the remaining
> causes are more theoretical than practical situations at this point.
> 

FWIW, hotpluggging a PCI device before CAS result in a hung guest (not yet
investigated the details).
Greg Kurz Jan. 17, 2020, 3:49 p.m. UTC | #12
On Fri, 17 Jan 2020 13:10:35 +0100
Laurent Vivier <lvivier@redhat.com> wrote:

> On 17/01/2020 12:49, Greg Kurz wrote:
> > On Wed, 15 Jan 2020 19:26:18 +0100
> > Laurent Vivier <lvivier@redhat.com> wrote:
> > 
> >> On 15/01/2020 19:10, Laurent Vivier wrote:
> >>> Hi,
> >>>
> >>> On 15/01/2020 18:48, Greg Kurz wrote:
> >>>> Migration can potentially race with CAS reboot. If the migration thread
> >>>> completes migration after CAS has set spapr->cas_reboot but before the
> >>>> mainloop could pick up the reset request and reset the machine, the
> >>>> guest is migrated unrebooted and the destination doesn't reboot it
> >>>> either because it isn't aware a CAS reboot was needed (eg, because a
> >>>> device was added before CAS). This likely result in a broken or hung
> >>>> guest.
> >>>>
> >>>> Even if it is small, the window between CAS and CAS reboot is enough to
> >>>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new
> >>>> subsection for that and always send it when a CAS reboot is pending.
> >>>> This may cause migration to older QEMUs to fail but it is still better
> >>>> than end up with a broken guest.
> >>>>
> >>>> The destination cannot honour the CAS reboot request from a post load
> >>>> handler because this must be done after the guest is fully restored.
> >>>> It is thus done from a VM change state handler.
> >>>>
> >>>> Reported-by: Lukáš Doktor <ldoktor@redhat.com>
> >>>> Signed-off-by: Greg Kurz <groug@kaod.org>
> >>>> ---
> >>>>
> >>>
> >>> I'm wondering if the problem can be related with the fact that
> >>> main_loop_should_exit() could release qemu_global_mutex in
> >>> pause_all_vcpus() in the reset case?
> >>>
> >>> 1602 static bool main_loop_should_exit(void)
> >>> 1603 {
> >>> ...
> >>> 1633     request = qemu_reset_requested();
> >>> 1634     if (request) {
> >>> 1635         pause_all_vcpus();
> >>> 1636         qemu_system_reset(request);
> >>> 1637         resume_all_vcpus();
> >>> 1638         if (!runstate_check(RUN_STATE_RUNNING) &&
> >>> 1639                 !runstate_check(RUN_STATE_INMIGRATE)) {
> >>> 1640             runstate_set(RUN_STATE_PRELAUNCH);
> >>> 1641         }
> >>> 1642     }
> >>> ...
> >>>
> >>> I already sent a patch for this kind of problem (in current Juan pull
> >>> request):
> >>>
> >>> "runstate: ignore finishmigrate -> prelaunch transition"
> >>>
> >>> but I don't know if it could fix this one.
> >>
> >> I think it should be interesting to have the state transition on source
> >> and destination when the problem occurs (with something like "-trace
> >> runstate_set").
> >>
> > 
> > With "-serial mon:stdio -trace runstate_set -trace -trace guest_cpu_reset" :
> > 
> > OF stdout device is: /vdevice/vty@71000000
> > Preparing to boot Linux version 4.18.0-80.el8.ppc64le (mockbuild@ppc-061.build.eng.bos.redhat.com) (gcc version 8.2.1 20180905 (Red Hat 8.2.1-3) (GCC)) #1 SMP Wed Mar 13 11:26:21 UTC 2019
> > Detected machine type: 0000000000000101
> > command line: BOOT_IMAGE=/boot/vmlinuz-4.18.0-80.el8.ppc64le root=UUID=012b83a5-2594-48ac-b936-12fec7cdbb9a ro console=ttyS0 console=ttyS0,115200n8 no_timer_check net.ifnames=0 crashkernel=auto
> > Max number of cores passed to firmware: 2048 (NR_CPUS = 2048)
> > Calling ibm,client-architecture-support.
> > 
> >  Migration starts here.
> > 
> > ..qemu-system-ppc64: warning: kernel_irqchip allowed but unavailable: IRQ_XIVE capability must be present for KVM
> > Falling back to kernel-irqchip=off
> > 
> >  This ^^ indicates that CAS was called and switched to XIVE, for which
> >  we lack proper KVM support on GA boston machines.
> > 
> > 23348@1579260982.315795:runstate_set current_run_state 9 (running) new_state 7 (finish-migrate)
> > 23348@1579260982.360821:runstate_set current_run_state 7 (finish-migrate) new_state 5 (postmigrate)
> > 
> >  The migration thread is holding the global QEMU mutex at this point. It
> >  has stopped all CPUs. It now streams the full state to the destination
> >  before releasing the mutex.
> > 
> > 23340@1579260982.797279:guest_cpu_reset cpu=0xf9dbb48a5e0 
> > 23340@1579260982.797319:guest_cpu_reset cpu=0xf9dbb4d56a0 
> > 
> >  The main loop regained control and could process the CAS reboot request
> >  but it is too late...
> > 
> > 23340@1579260982.866071:runstate_set current_run_state 5 (postmigrate) new_state 6 (prelaunch)
> 
> Thank you Greg.
> 
> So I think the best we can do is to migrate cas_reboot.
> 
> To delay the H_CAS call would be cleaner but I don't know if H_BUSY is a
> valid return state and this forces to update SLOF too.
> 

Since SLOF is currently the only user of H_CAS, I guess we have some
flexibility with the valid return codes... but anyway, David doesn't
like the idea :)

> Reviewed-by: Laurent Vivier <lvivier@redhat.com>
> 
> Thanks,
> Laurent
>
Greg Kurz Jan. 20, 2020, 8:04 a.m. UTC | #13
On Fri, 17 Jan 2020 16:44:27 +0100
Greg Kurz <groug@kaod.org> wrote:

> On Fri, 17 Jan 2020 19:16:08 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Thu, Jan 16, 2020 at 07:29:02PM +0100, Greg Kurz wrote:
> > > On Thu, 16 Jan 2020 13:14:35 +0100
> > > Greg Kurz <groug@kaod.org> wrote:
> > > 
> > > > On Thu, 16 Jan 2020 11:37:24 +0100
> > > > Laurent Vivier <lvivier@redhat.com> wrote:
> > > > 
> > > > > On 16/01/2020 09:48, Greg Kurz wrote:
> > > > > > On Wed, 15 Jan 2020 19:10:37 +0100
> > > > > > Laurent Vivier <lvivier@redhat.com> wrote:
> > > > > > 
> > > > > >> Hi,
> > > > > >>
> > > > > >> On 15/01/2020 18:48, Greg Kurz wrote:
> > > > > >>> Migration can potentially race with CAS reboot. If the migration thread
> > > > > >>> completes migration after CAS has set spapr->cas_reboot but before the
> > > > > >>> mainloop could pick up the reset request and reset the machine, the
> > > > > >>> guest is migrated unrebooted and the destination doesn't reboot it
> > > > > >>> either because it isn't aware a CAS reboot was needed (eg, because a
> > > > > >>> device was added before CAS). This likely result in a broken or hung
> > > > > >>> guest.
> > > > > >>>
> > > > > >>> Even if it is small, the window between CAS and CAS reboot is enough to
> > > > > >>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new
> > > > > >>> subsection for that and always send it when a CAS reboot is pending.
> > > > > >>> This may cause migration to older QEMUs to fail but it is still better
> > > > > >>> than end up with a broken guest.
> > > > > >>>
> > > > > >>> The destination cannot honour the CAS reboot request from a post load
> > > > > >>> handler because this must be done after the guest is fully restored.
> > > > > >>> It is thus done from a VM change state handler.
> > > > > >>>
> > > > > >>> Reported-by: Lukáš Doktor <ldoktor@redhat.com>
> > > > > >>> Signed-off-by: Greg Kurz <groug@kaod.org>
> > > > > >>> ---
> > > > > >>>
> > > > > >>
> > > > > >> I'm wondering if the problem can be related with the fact that
> > > > > >> main_loop_should_exit() could release qemu_global_mutex in
> > > > > >> pause_all_vcpus() in the reset case?
> > > > > >>
> > > > > >> 1602 static bool main_loop_should_exit(void)
> > > > > >> 1603 {
> > > > > >> ...
> > > > > >> 1633     request = qemu_reset_requested();
> > > > > >> 1634     if (request) {
> > > > > >> 1635         pause_all_vcpus();
> > > > > >> 1636         qemu_system_reset(request);
> > > > > >> 1637         resume_all_vcpus();
> > > > > >> 1638         if (!runstate_check(RUN_STATE_RUNNING) &&
> > > > > >> 1639                 !runstate_check(RUN_STATE_INMIGRATE)) {
> > > > > >> 1640             runstate_set(RUN_STATE_PRELAUNCH);
> > > > > >> 1641         }
> > > > > >> 1642     }
> > > > > >> ...
> > > > > >>
> > > > > >> I already sent a patch for this kind of problem (in current Juan pull
> > > > > >> request):
> > > > > >>
> > > > > >> "runstate: ignore finishmigrate -> prelaunch transition"
> > > > > >>
> > > > > > 
> > > > > > IIUC your patch avoids an invalid 'prelaunch' -> 'postmigrate' runstate
> > > > > > transition that can happen if the migration thread sets the runstate to
> > > > > > 'finishmigrate' when pause_all_vcpus() releases the main loop mutex.
> > > > > > 
> > > > > > ie. symptom of the problem is QEMU aborting, correct ? The issue I'm
> > > > > > trying to fix is a guest breakage caused by a discrepancy between
> > > > > > QEMU and the guest after migration has succeeded.
> > > > > > 
> > > > > >> but I don't know if it could fix this one.
> > > > > >>
> > > > > > 
> > > > > > I don't think so and your patch kinda illustrates it. If the runstate
> > > > > > is 'finishmigrate' when returning from pause_all_vcpus(), this means
> > > > > > that state was sent to the destination before we could actually reset
> > > > > > the machine.
> > > > > 
> > > > > Yes, you're right.
> > > > > 
> > > > > But the question behind my comment was: is it expected to have a pending
> > > > > reset while we are migrating?
> > > > > 
> > > > 
> > > > Nothing prevents qemu_system_reset_request() to be called when migration
> > > > is active. 
> > > > 
> > > > > Perhaps H_CAS can return H_BUSY and wait the end of the migration and
> > > > > then be fully executed on destination?
> > > > > 
> > > > 
> > > > And so we would need to teach SLOF to try H_CAS again until it stops
> > > > returning H_BUSY ? It seems safer to migrate the CAS reboot flag IMHO.
> > > > 
> > > 
> > > Ok I've tried that with a patched SLOF that sleeps 500ms and tries CAS
> > > again if H_BUSY was returned. It fixes the issue but it looks a bit
> > > ugly because of the polling with an arbitrary timeout in SLOF... I'm
> > > not very comfortable either with calling migration_is_active() from
> > > the CAS code in QEMU.
> > > 
> > > David,
> > > 
> > > Any suggestion ?
> > 
> > Yeah, I think looping in SLOF is a worse idea than migrating the
> > cas_reboot flag.
> > 
> > But.. a better solution still might be to just remove the remaining
> > causes for CAS reboot entirely.  CAS reboots pretty much suck when
> > they happen, anyway.
> > 
> 
> I Agree.
> 
> > With the irq changeover condition removed, I think the remaining
> > causes are more theoretical than practical situations at this point.
> > 
> 
> FWIW, hotpluggging a PCI device before CAS result in a hung guest (not yet
> investigated the details).

commit 10f12e6450407b18b4d5a6b50d3852dcfd7fff75
Author: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com>
Date:   Wed Aug 30 15:21:41 2017 -0300

    hw/ppc: CAS reset on early device hotplug

I'll have a look to see what can be done here.

But I agree the other check is more theoretical:

    /* capabilities that have been added since CAS-generated guest reset.
     * if capabilities have since been removed, generate another reset
     */
    spapr->cas_reboot = !spapr_ovec_subset(ov5_cas_old, spapr->ov5_cas);

Unless changing kernels or tempering with the kernel command line, I don't
see how some capabilities could change between the two CAS in practice.
David Gibson Jan. 21, 2020, 3:41 a.m. UTC | #14
On Wed, Jan 15, 2020 at 07:10:47PM +0100, Cédric Le Goater wrote:
> On 1/15/20 6:48 PM, Greg Kurz wrote:
> > Migration can potentially race with CAS reboot. If the migration thread
> > completes migration after CAS has set spapr->cas_reboot but before the
> > mainloop could pick up the reset request and reset the machine, the
> > guest is migrated unrebooted and the destination doesn't reboot it
> > either because it isn't aware a CAS reboot was needed (eg, because a
> > device was added before CAS). This likely result in a broken or hung
> > guest.
> > 
> > Even if it is small, the window between CAS and CAS reboot is enough to
> > re-qualify spapr->cas_reboot as state that we should migrate. Add a new
> > subsection for that and always send it when a CAS reboot is pending.
> > This may cause migration to older QEMUs to fail but it is still better
> > than end up with a broken guest.
> > 
> > The destination cannot honour the CAS reboot request from a post load
> > handler because this must be done after the guest is fully restored.
> > It is thus done from a VM change state handler.
> > 
> > Reported-by: Lukáš Doktor <ldoktor@redhat.com>
> > Signed-off-by: Greg Kurz <groug@kaod.org>
> 
> Cédric Le Goater <clg@kaod.org>
> 
> Nice work ! That was quite complex to catch !

It is a very nice analysis.  However, I'm disinclined to merge this
for the time being.

My preferred approach would be to just eliminate CAS reboots
altogether, since that has other benefits.  I'm feeling like this
isn't super-urgent, since CAS reboots are extremely rare in practice,
now that we've eliminated the one for the irq switchover.

However, if it's not looking like we'll be ready to do that as the
qemu-5.0 release approaches, then I'll be more than willing to
reconsider this.
David Gibson Jan. 21, 2020, 3:43 a.m. UTC | #15
On Mon, Jan 20, 2020 at 09:04:38AM +0100, Greg Kurz wrote:
> On Fri, 17 Jan 2020 16:44:27 +0100
> Greg Kurz <groug@kaod.org> wrote:
> 
> > On Fri, 17 Jan 2020 19:16:08 +1000
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> > 
> > > On Thu, Jan 16, 2020 at 07:29:02PM +0100, Greg Kurz wrote:
> > > > On Thu, 16 Jan 2020 13:14:35 +0100
> > > > Greg Kurz <groug@kaod.org> wrote:
> > > > 
> > > > > On Thu, 16 Jan 2020 11:37:24 +0100
> > > > > Laurent Vivier <lvivier@redhat.com> wrote:
> > > > > 
> > > > > > On 16/01/2020 09:48, Greg Kurz wrote:
> > > > > > > On Wed, 15 Jan 2020 19:10:37 +0100
> > > > > > > Laurent Vivier <lvivier@redhat.com> wrote:
> > > > > > > 
> > > > > > >> Hi,
> > > > > > >>
> > > > > > >> On 15/01/2020 18:48, Greg Kurz wrote:
> > > > > > >>> Migration can potentially race with CAS reboot. If the migration thread
> > > > > > >>> completes migration after CAS has set spapr->cas_reboot but before the
> > > > > > >>> mainloop could pick up the reset request and reset the machine, the
> > > > > > >>> guest is migrated unrebooted and the destination doesn't reboot it
> > > > > > >>> either because it isn't aware a CAS reboot was needed (eg, because a
> > > > > > >>> device was added before CAS). This likely result in a broken or hung
> > > > > > >>> guest.
> > > > > > >>>
> > > > > > >>> Even if it is small, the window between CAS and CAS reboot is enough to
> > > > > > >>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new
> > > > > > >>> subsection for that and always send it when a CAS reboot is pending.
> > > > > > >>> This may cause migration to older QEMUs to fail but it is still better
> > > > > > >>> than end up with a broken guest.
> > > > > > >>>
> > > > > > >>> The destination cannot honour the CAS reboot request from a post load
> > > > > > >>> handler because this must be done after the guest is fully restored.
> > > > > > >>> It is thus done from a VM change state handler.
> > > > > > >>>
> > > > > > >>> Reported-by: Lukáš Doktor <ldoktor@redhat.com>
> > > > > > >>> Signed-off-by: Greg Kurz <groug@kaod.org>
> > > > > > >>> ---
> > > > > > >>>
> > > > > > >>
> > > > > > >> I'm wondering if the problem can be related with the fact that
> > > > > > >> main_loop_should_exit() could release qemu_global_mutex in
> > > > > > >> pause_all_vcpus() in the reset case?
> > > > > > >>
> > > > > > >> 1602 static bool main_loop_should_exit(void)
> > > > > > >> 1603 {
> > > > > > >> ...
> > > > > > >> 1633     request = qemu_reset_requested();
> > > > > > >> 1634     if (request) {
> > > > > > >> 1635         pause_all_vcpus();
> > > > > > >> 1636         qemu_system_reset(request);
> > > > > > >> 1637         resume_all_vcpus();
> > > > > > >> 1638         if (!runstate_check(RUN_STATE_RUNNING) &&
> > > > > > >> 1639                 !runstate_check(RUN_STATE_INMIGRATE)) {
> > > > > > >> 1640             runstate_set(RUN_STATE_PRELAUNCH);
> > > > > > >> 1641         }
> > > > > > >> 1642     }
> > > > > > >> ...
> > > > > > >>
> > > > > > >> I already sent a patch for this kind of problem (in current Juan pull
> > > > > > >> request):
> > > > > > >>
> > > > > > >> "runstate: ignore finishmigrate -> prelaunch transition"
> > > > > > >>
> > > > > > > 
> > > > > > > IIUC your patch avoids an invalid 'prelaunch' -> 'postmigrate' runstate
> > > > > > > transition that can happen if the migration thread sets the runstate to
> > > > > > > 'finishmigrate' when pause_all_vcpus() releases the main loop mutex.
> > > > > > > 
> > > > > > > ie. symptom of the problem is QEMU aborting, correct ? The issue I'm
> > > > > > > trying to fix is a guest breakage caused by a discrepancy between
> > > > > > > QEMU and the guest after migration has succeeded.
> > > > > > > 
> > > > > > >> but I don't know if it could fix this one.
> > > > > > >>
> > > > > > > 
> > > > > > > I don't think so and your patch kinda illustrates it. If the runstate
> > > > > > > is 'finishmigrate' when returning from pause_all_vcpus(), this means
> > > > > > > that state was sent to the destination before we could actually reset
> > > > > > > the machine.
> > > > > > 
> > > > > > Yes, you're right.
> > > > > > 
> > > > > > But the question behind my comment was: is it expected to have a pending
> > > > > > reset while we are migrating?
> > > > > > 
> > > > > 
> > > > > Nothing prevents qemu_system_reset_request() to be called when migration
> > > > > is active. 
> > > > > 
> > > > > > Perhaps H_CAS can return H_BUSY and wait the end of the migration and
> > > > > > then be fully executed on destination?
> > > > > > 
> > > > > 
> > > > > And so we would need to teach SLOF to try H_CAS again until it stops
> > > > > returning H_BUSY ? It seems safer to migrate the CAS reboot flag IMHO.
> > > > > 
> > > > 
> > > > Ok I've tried that with a patched SLOF that sleeps 500ms and tries CAS
> > > > again if H_BUSY was returned. It fixes the issue but it looks a bit
> > > > ugly because of the polling with an arbitrary timeout in SLOF... I'm
> > > > not very comfortable either with calling migration_is_active() from
> > > > the CAS code in QEMU.
> > > > 
> > > > David,
> > > > 
> > > > Any suggestion ?
> > > 
> > > Yeah, I think looping in SLOF is a worse idea than migrating the
> > > cas_reboot flag.
> > > 
> > > But.. a better solution still might be to just remove the remaining
> > > causes for CAS reboot entirely.  CAS reboots pretty much suck when
> > > they happen, anyway.
> > > 
> > 
> > I Agree.
> > 
> > > With the irq changeover condition removed, I think the remaining
> > > causes are more theoretical than practical situations at this point.
> > > 
> > 
> > FWIW, hotpluggging a PCI device before CAS result in a hung guest (not yet
> > investigated the details).
> 
> commit 10f12e6450407b18b4d5a6b50d3852dcfd7fff75
> Author: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com>
> Date:   Wed Aug 30 15:21:41 2017 -0300
> 
>     hw/ppc: CAS reset on early device hotplug
> 
> I'll have a look to see what can be done here.

Ah.. yes, that one might be a bit tricky.

> But I agree the other check is more theoretical:
> 
>     /* capabilities that have been added since CAS-generated guest reset.
>      * if capabilities have since been removed, generate another reset
>      */
>     spapr->cas_reboot = !spapr_ovec_subset(ov5_cas_old, spapr->ov5_cas);
> 
> Unless changing kernels or tempering with the kernel command line, I don't
> see how some capabilities could change between the two CAS in practice.

Well, we want to be robust and it's at least theoretically possible
that the guest will request different things on subsequent reboots.
However I believe that the original rationale for this check was that
while we could add things to the device tree for added capabilities,
we didn't have a way to roll back the changes for removed
capabilities.

Now that we fully rebuild the device tree at CAS, I think this test
can probably just go, although there's some double checking to do.
Cédric Le Goater Jan. 21, 2020, 6:57 a.m. UTC | #16
On 1/21/20 4:41 AM, David Gibson wrote:
> On Wed, Jan 15, 2020 at 07:10:47PM +0100, Cédric Le Goater wrote:
>> On 1/15/20 6:48 PM, Greg Kurz wrote:
>>> Migration can potentially race with CAS reboot. If the migration thread
>>> completes migration after CAS has set spapr->cas_reboot but before the
>>> mainloop could pick up the reset request and reset the machine, the
>>> guest is migrated unrebooted and the destination doesn't reboot it
>>> either because it isn't aware a CAS reboot was needed (eg, because a
>>> device was added before CAS). This likely result in a broken or hung
>>> guest.
>>>
>>> Even if it is small, the window between CAS and CAS reboot is enough to
>>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new
>>> subsection for that and always send it when a CAS reboot is pending.
>>> This may cause migration to older QEMUs to fail but it is still better
>>> than end up with a broken guest.
>>>
>>> The destination cannot honour the CAS reboot request from a post load
>>> handler because this must be done after the guest is fully restored.
>>> It is thus done from a VM change state handler.
>>>
>>> Reported-by: Lukáš Doktor <ldoktor@redhat.com>
>>> Signed-off-by: Greg Kurz <groug@kaod.org>
>>
>> Cédric Le Goater <clg@kaod.org>
>>
>> Nice work ! That was quite complex to catch !
> 
> It is a very nice analysis.  However, I'm disinclined to merge this
> for the time being.
> 
> My preferred approach would be to just eliminate CAS reboots
> altogether, since that has other benefits.  I'm feeling like this
> isn't super-urgent, since CAS reboots are extremely rare in practice,
> now that we've eliminated the one for the irq switchover.

Yes. The possibility of a migration in the window between CAS and 
CAS reboot must be even more rare.

C.

 
> However, if it's not looking like we'll be ready to do that as the
> qemu-5.0 release approaches, then I'll be more than willing to
> reconsider this.
Greg Kurz Jan. 21, 2020, 7:38 a.m. UTC | #17
On Tue, 21 Jan 2020 14:41:26 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Wed, Jan 15, 2020 at 07:10:47PM +0100, Cédric Le Goater wrote:
> > On 1/15/20 6:48 PM, Greg Kurz wrote:
> > > Migration can potentially race with CAS reboot. If the migration thread
> > > completes migration after CAS has set spapr->cas_reboot but before the
> > > mainloop could pick up the reset request and reset the machine, the
> > > guest is migrated unrebooted and the destination doesn't reboot it
> > > either because it isn't aware a CAS reboot was needed (eg, because a
> > > device was added before CAS). This likely result in a broken or hung
> > > guest.
> > > 
> > > Even if it is small, the window between CAS and CAS reboot is enough to
> > > re-qualify spapr->cas_reboot as state that we should migrate. Add a new
> > > subsection for that and always send it when a CAS reboot is pending.
> > > This may cause migration to older QEMUs to fail but it is still better
> > > than end up with a broken guest.
> > > 
> > > The destination cannot honour the CAS reboot request from a post load
> > > handler because this must be done after the guest is fully restored.
> > > It is thus done from a VM change state handler.
> > > 
> > > Reported-by: Lukáš Doktor <ldoktor@redhat.com>
> > > Signed-off-by: Greg Kurz <groug@kaod.org>
> > 
> > Cédric Le Goater <clg@kaod.org>
> > 
> > Nice work ! That was quite complex to catch !
> 
> It is a very nice analysis.  However, I'm disinclined to merge this
> for the time being.
> 
> My preferred approach would be to just eliminate CAS reboots
> altogether, since that has other benefits.  I'm feeling like this
> isn't super-urgent, since CAS reboots are extremely rare in practice,
> now that we've eliminated the one for the irq switchover.
> 

Yeah. The only _true_ need for CAS rebooting now seems to be hotplug
before CAS, which is likely not something frequent.

> However, if it's not looking like we'll be ready to do that as the
> qemu-5.0 release approaches, then I'll be more than willing to
> reconsider this.
> 

I hope we can drop CAS reboot in time.
Greg Kurz Jan. 21, 2020, 9:32 a.m. UTC | #18
On Tue, 21 Jan 2020 14:43:32 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Mon, Jan 20, 2020 at 09:04:38AM +0100, Greg Kurz wrote:
> > On Fri, 17 Jan 2020 16:44:27 +0100
> > Greg Kurz <groug@kaod.org> wrote:
> > 
> > > On Fri, 17 Jan 2020 19:16:08 +1000
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > 
> > > > On Thu, Jan 16, 2020 at 07:29:02PM +0100, Greg Kurz wrote:
> > > > > On Thu, 16 Jan 2020 13:14:35 +0100
> > > > > Greg Kurz <groug@kaod.org> wrote:
> > > > > 
> > > > > > On Thu, 16 Jan 2020 11:37:24 +0100
> > > > > > Laurent Vivier <lvivier@redhat.com> wrote:
> > > > > > 
> > > > > > > On 16/01/2020 09:48, Greg Kurz wrote:
> > > > > > > > On Wed, 15 Jan 2020 19:10:37 +0100
> > > > > > > > Laurent Vivier <lvivier@redhat.com> wrote:
> > > > > > > > 
> > > > > > > >> Hi,
> > > > > > > >>
> > > > > > > >> On 15/01/2020 18:48, Greg Kurz wrote:
> > > > > > > >>> Migration can potentially race with CAS reboot. If the migration thread
> > > > > > > >>> completes migration after CAS has set spapr->cas_reboot but before the
> > > > > > > >>> mainloop could pick up the reset request and reset the machine, the
> > > > > > > >>> guest is migrated unrebooted and the destination doesn't reboot it
> > > > > > > >>> either because it isn't aware a CAS reboot was needed (eg, because a
> > > > > > > >>> device was added before CAS). This likely result in a broken or hung
> > > > > > > >>> guest.
> > > > > > > >>>
> > > > > > > >>> Even if it is small, the window between CAS and CAS reboot is enough to
> > > > > > > >>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new
> > > > > > > >>> subsection for that and always send it when a CAS reboot is pending.
> > > > > > > >>> This may cause migration to older QEMUs to fail but it is still better
> > > > > > > >>> than end up with a broken guest.
> > > > > > > >>>
> > > > > > > >>> The destination cannot honour the CAS reboot request from a post load
> > > > > > > >>> handler because this must be done after the guest is fully restored.
> > > > > > > >>> It is thus done from a VM change state handler.
> > > > > > > >>>
> > > > > > > >>> Reported-by: Lukáš Doktor <ldoktor@redhat.com>
> > > > > > > >>> Signed-off-by: Greg Kurz <groug@kaod.org>
> > > > > > > >>> ---
> > > > > > > >>>
> > > > > > > >>
> > > > > > > >> I'm wondering if the problem can be related with the fact that
> > > > > > > >> main_loop_should_exit() could release qemu_global_mutex in
> > > > > > > >> pause_all_vcpus() in the reset case?
> > > > > > > >>
> > > > > > > >> 1602 static bool main_loop_should_exit(void)
> > > > > > > >> 1603 {
> > > > > > > >> ...
> > > > > > > >> 1633     request = qemu_reset_requested();
> > > > > > > >> 1634     if (request) {
> > > > > > > >> 1635         pause_all_vcpus();
> > > > > > > >> 1636         qemu_system_reset(request);
> > > > > > > >> 1637         resume_all_vcpus();
> > > > > > > >> 1638         if (!runstate_check(RUN_STATE_RUNNING) &&
> > > > > > > >> 1639                 !runstate_check(RUN_STATE_INMIGRATE)) {
> > > > > > > >> 1640             runstate_set(RUN_STATE_PRELAUNCH);
> > > > > > > >> 1641         }
> > > > > > > >> 1642     }
> > > > > > > >> ...
> > > > > > > >>
> > > > > > > >> I already sent a patch for this kind of problem (in current Juan pull
> > > > > > > >> request):
> > > > > > > >>
> > > > > > > >> "runstate: ignore finishmigrate -> prelaunch transition"
> > > > > > > >>
> > > > > > > > 
> > > > > > > > IIUC your patch avoids an invalid 'prelaunch' -> 'postmigrate' runstate
> > > > > > > > transition that can happen if the migration thread sets the runstate to
> > > > > > > > 'finishmigrate' when pause_all_vcpus() releases the main loop mutex.
> > > > > > > > 
> > > > > > > > ie. symptom of the problem is QEMU aborting, correct ? The issue I'm
> > > > > > > > trying to fix is a guest breakage caused by a discrepancy between
> > > > > > > > QEMU and the guest after migration has succeeded.
> > > > > > > > 
> > > > > > > >> but I don't know if it could fix this one.
> > > > > > > >>
> > > > > > > > 
> > > > > > > > I don't think so and your patch kinda illustrates it. If the runstate
> > > > > > > > is 'finishmigrate' when returning from pause_all_vcpus(), this means
> > > > > > > > that state was sent to the destination before we could actually reset
> > > > > > > > the machine.
> > > > > > > 
> > > > > > > Yes, you're right.
> > > > > > > 
> > > > > > > But the question behind my comment was: is it expected to have a pending
> > > > > > > reset while we are migrating?
> > > > > > > 
> > > > > > 
> > > > > > Nothing prevents qemu_system_reset_request() to be called when migration
> > > > > > is active. 
> > > > > > 
> > > > > > > Perhaps H_CAS can return H_BUSY and wait the end of the migration and
> > > > > > > then be fully executed on destination?
> > > > > > > 
> > > > > > 
> > > > > > And so we would need to teach SLOF to try H_CAS again until it stops
> > > > > > returning H_BUSY ? It seems safer to migrate the CAS reboot flag IMHO.
> > > > > > 
> > > > > 
> > > > > Ok I've tried that with a patched SLOF that sleeps 500ms and tries CAS
> > > > > again if H_BUSY was returned. It fixes the issue but it looks a bit
> > > > > ugly because of the polling with an arbitrary timeout in SLOF... I'm
> > > > > not very comfortable either with calling migration_is_active() from
> > > > > the CAS code in QEMU.
> > > > > 
> > > > > David,
> > > > > 
> > > > > Any suggestion ?
> > > > 
> > > > Yeah, I think looping in SLOF is a worse idea than migrating the
> > > > cas_reboot flag.
> > > > 
> > > > But.. a better solution still might be to just remove the remaining
> > > > causes for CAS reboot entirely.  CAS reboots pretty much suck when
> > > > they happen, anyway.
> > > > 
> > > 
> > > I Agree.
> > > 
> > > > With the irq changeover condition removed, I think the remaining
> > > > causes are more theoretical than practical situations at this point.
> > > > 
> > > 
> > > FWIW, hotpluggging a PCI device before CAS result in a hung guest (not yet
> > > investigated the details).
> > 
> > commit 10f12e6450407b18b4d5a6b50d3852dcfd7fff75
> > Author: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com>
> > Date:   Wed Aug 30 15:21:41 2017 -0300
> > 
> >     hw/ppc: CAS reset on early device hotplug
> > 
> > I'll have a look to see what can be done here.
> 
> Ah.. yes, that one might be a bit tricky.
> 

So far it seems to be related to SLOF not being able to create
new nodes in the DT when parsing the FDT returned by CAS. SLOF
stops the parsing and returns an error. The guest ends up with
a broken DT and eventually hangs later (in my case the kernel
believes it is going to do hash while radix was negotiated with
QEMU). I need to dig some more.

> > But I agree the other check is more theoretical:
> > 
> >     /* capabilities that have been added since CAS-generated guest reset.
> >      * if capabilities have since been removed, generate another reset
> >      */
> >     spapr->cas_reboot = !spapr_ovec_subset(ov5_cas_old, spapr->ov5_cas);
> > 
> > Unless changing kernels or tempering with the kernel command line, I don't
> > see how some capabilities could change between the two CAS in practice.
> 
> Well, we want to be robust and it's at least theoretically possible
> that the guest will request different things on subsequent reboots.

Yes but in the latter case a full machine reset occurs and
spapr->ov5_cas gets cleared, ie. spapr_ovec_subset() returns
true in the check above no matter what.

> However I believe that the original rationale for this check was that
> while we could add things to the device tree for added capabilities,
> we didn't have a way to roll back the changes for removed
> capabilities.
> 

IIUC this is specifically for "removed capabilities since last
CAS". This can happen if:
1) we're already processing a CAS reboot or,
2) a freshly rebooted guest calls CAS twice without being rebooted
   in between.

Since a freshly booted or rebooted guest can only trigger a CAS
reboot because of a "hotplug-before-CAS", if we manage to get rid
of this limitation, 1) cannot happen anymore.

The linux kernel seems to be only calling "ibm,client-architecture-support"
once during early boot so 2) should _never_ happen. Do we care to support
this scenario anyway ?

> Now that we fully rebuild the device tree at CAS, I think this test
> can probably just go, although there's some double checking to do.
> 

I tend to agree.
David Gibson Jan. 22, 2020, 6:50 a.m. UTC | #19
On Tue, Jan 21, 2020 at 10:32:55AM +0100, Greg Kurz wrote:
> On Tue, 21 Jan 2020 14:43:32 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Mon, Jan 20, 2020 at 09:04:38AM +0100, Greg Kurz wrote:
> > > On Fri, 17 Jan 2020 16:44:27 +0100
> > > Greg Kurz <groug@kaod.org> wrote:
> > > 
> > > > On Fri, 17 Jan 2020 19:16:08 +1000
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > 
> > > > > On Thu, Jan 16, 2020 at 07:29:02PM +0100, Greg Kurz wrote:
> > > > > > On Thu, 16 Jan 2020 13:14:35 +0100
> > > > > > Greg Kurz <groug@kaod.org> wrote:
> > > > > > 
> > > > > > > On Thu, 16 Jan 2020 11:37:24 +0100
> > > > > > > Laurent Vivier <lvivier@redhat.com> wrote:
> > > > > > > 
> > > > > > > > On 16/01/2020 09:48, Greg Kurz wrote:
> > > > > > > > > On Wed, 15 Jan 2020 19:10:37 +0100
> > > > > > > > > Laurent Vivier <lvivier@redhat.com> wrote:
> > > > > > > > > 
> > > > > > > > >> Hi,
> > > > > > > > >>
> > > > > > > > >> On 15/01/2020 18:48, Greg Kurz wrote:
> > > > > > > > >>> Migration can potentially race with CAS reboot. If the migration thread
> > > > > > > > >>> completes migration after CAS has set spapr->cas_reboot but before the
> > > > > > > > >>> mainloop could pick up the reset request and reset the machine, the
> > > > > > > > >>> guest is migrated unrebooted and the destination doesn't reboot it
> > > > > > > > >>> either because it isn't aware a CAS reboot was needed (eg, because a
> > > > > > > > >>> device was added before CAS). This likely result in a broken or hung
> > > > > > > > >>> guest.
> > > > > > > > >>>
> > > > > > > > >>> Even if it is small, the window between CAS and CAS reboot is enough to
> > > > > > > > >>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new
> > > > > > > > >>> subsection for that and always send it when a CAS reboot is pending.
> > > > > > > > >>> This may cause migration to older QEMUs to fail but it is still better
> > > > > > > > >>> than end up with a broken guest.
> > > > > > > > >>>
> > > > > > > > >>> The destination cannot honour the CAS reboot request from a post load
> > > > > > > > >>> handler because this must be done after the guest is fully restored.
> > > > > > > > >>> It is thus done from a VM change state handler.
> > > > > > > > >>>
> > > > > > > > >>> Reported-by: Lukáš Doktor <ldoktor@redhat.com>
> > > > > > > > >>> Signed-off-by: Greg Kurz <groug@kaod.org>
> > > > > > > > >>> ---
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > > >> I'm wondering if the problem can be related with the fact that
> > > > > > > > >> main_loop_should_exit() could release qemu_global_mutex in
> > > > > > > > >> pause_all_vcpus() in the reset case?
> > > > > > > > >>
> > > > > > > > >> 1602 static bool main_loop_should_exit(void)
> > > > > > > > >> 1603 {
> > > > > > > > >> ...
> > > > > > > > >> 1633     request = qemu_reset_requested();
> > > > > > > > >> 1634     if (request) {
> > > > > > > > >> 1635         pause_all_vcpus();
> > > > > > > > >> 1636         qemu_system_reset(request);
> > > > > > > > >> 1637         resume_all_vcpus();
> > > > > > > > >> 1638         if (!runstate_check(RUN_STATE_RUNNING) &&
> > > > > > > > >> 1639                 !runstate_check(RUN_STATE_INMIGRATE)) {
> > > > > > > > >> 1640             runstate_set(RUN_STATE_PRELAUNCH);
> > > > > > > > >> 1641         }
> > > > > > > > >> 1642     }
> > > > > > > > >> ...
> > > > > > > > >>
> > > > > > > > >> I already sent a patch for this kind of problem (in current Juan pull
> > > > > > > > >> request):
> > > > > > > > >>
> > > > > > > > >> "runstate: ignore finishmigrate -> prelaunch transition"
> > > > > > > > >>
> > > > > > > > > 
> > > > > > > > > IIUC your patch avoids an invalid 'prelaunch' -> 'postmigrate' runstate
> > > > > > > > > transition that can happen if the migration thread sets the runstate to
> > > > > > > > > 'finishmigrate' when pause_all_vcpus() releases the main loop mutex.
> > > > > > > > > 
> > > > > > > > > ie. symptom of the problem is QEMU aborting, correct ? The issue I'm
> > > > > > > > > trying to fix is a guest breakage caused by a discrepancy between
> > > > > > > > > QEMU and the guest after migration has succeeded.
> > > > > > > > > 
> > > > > > > > >> but I don't know if it could fix this one.
> > > > > > > > >>
> > > > > > > > > 
> > > > > > > > > I don't think so and your patch kinda illustrates it. If the runstate
> > > > > > > > > is 'finishmigrate' when returning from pause_all_vcpus(), this means
> > > > > > > > > that state was sent to the destination before we could actually reset
> > > > > > > > > the machine.
> > > > > > > > 
> > > > > > > > Yes, you're right.
> > > > > > > > 
> > > > > > > > But the question behind my comment was: is it expected to have a pending
> > > > > > > > reset while we are migrating?
> > > > > > > > 
> > > > > > > 
> > > > > > > Nothing prevents qemu_system_reset_request() to be called when migration
> > > > > > > is active. 
> > > > > > > 
> > > > > > > > Perhaps H_CAS can return H_BUSY and wait the end of the migration and
> > > > > > > > then be fully executed on destination?
> > > > > > > > 
> > > > > > > 
> > > > > > > And so we would need to teach SLOF to try H_CAS again until it stops
> > > > > > > returning H_BUSY ? It seems safer to migrate the CAS reboot flag IMHO.
> > > > > > > 
> > > > > > 
> > > > > > Ok I've tried that with a patched SLOF that sleeps 500ms and tries CAS
> > > > > > again if H_BUSY was returned. It fixes the issue but it looks a bit
> > > > > > ugly because of the polling with an arbitrary timeout in SLOF... I'm
> > > > > > not very comfortable either with calling migration_is_active() from
> > > > > > the CAS code in QEMU.
> > > > > > 
> > > > > > David,
> > > > > > 
> > > > > > Any suggestion ?
> > > > > 
> > > > > Yeah, I think looping in SLOF is a worse idea than migrating the
> > > > > cas_reboot flag.
> > > > > 
> > > > > But.. a better solution still might be to just remove the remaining
> > > > > causes for CAS reboot entirely.  CAS reboots pretty much suck when
> > > > > they happen, anyway.
> > > > > 
> > > > 
> > > > I Agree.
> > > > 
> > > > > With the irq changeover condition removed, I think the remaining
> > > > > causes are more theoretical than practical situations at this point.
> > > > > 
> > > > 
> > > > FWIW, hotpluggging a PCI device before CAS result in a hung guest (not yet
> > > > investigated the details).
> > > 
> > > commit 10f12e6450407b18b4d5a6b50d3852dcfd7fff75
> > > Author: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com>
> > > Date:   Wed Aug 30 15:21:41 2017 -0300
> > > 
> > >     hw/ppc: CAS reset on early device hotplug
> > > 
> > > I'll have a look to see what can be done here.
> > 
> > Ah.. yes, that one might be a bit tricky.
> > 
> 
> So far it seems to be related to SLOF not being able to create
> new nodes in the DT when parsing the FDT returned by CAS. SLOF
> stops the parsing and returns an error. The guest ends up with
> a broken DT and eventually hangs later (in my case the kernel
> believes it is going to do hash while radix was negotiated with
> QEMU). I need to dig some more.

Uh... I don't think this is right.  I'm pretty sure SLOF *does* create
new nodes when parsing the CAS FDT, it will need to for
"ibm,dynamic-reconfiguration-memory" at least.

I've done some looking and I think the actual reasons here are a bit
more complicated (but also possibly easier to handle).

  1. We can't send hotplug events to the guest until after CAS,
     because before then we don't know if it can process them, or if
     it can, which interrupt it uses for them.

  2. Queueing hotplug events across CAS for delivery afterwards
     introduces other complications

  3. We need to make sure that each device appears exactly once in
     either the  initial device tree that the guest OS sees, *or* in a
     hotplug event, not both or neither.

Now that we rebuild the DT at CAS time, I think we mightd be able toy
handle this by converting such devices to "cold plugged" at CAS time
(similarly to what we already do at reset).  Since they're in the
CAS-time DT which is what the OS will consume, cold plugged is
effectively how the OS will see them.

A remaining problem is that new PCI devices won't get BARs assigned by
SLOF in this scenario.  We'll probably get away with it because of the
linux,pci-probe-only property, but I don't know we want to rely on
that.  PCI bridges hotplugged introduce further complications, because
they won't get enumerated.

> > > But I agree the other check is more theoretical:
> > > 
> > >     /* capabilities that have been added since CAS-generated guest reset.
> > >      * if capabilities have since been removed, generate another reset
> > >      */
> > >     spapr->cas_reboot = !spapr_ovec_subset(ov5_cas_old, spapr->ov5_cas);
> > > 
> > > Unless changing kernels or tempering with the kernel command line, I don't
> > > see how some capabilities could change between the two CAS in practice.
> > 
> > Well, we want to be robust and it's at least theoretically possible
> > that the guest will request different things on subsequent reboots.
> 
> Yes but in the latter case a full machine reset occurs and

Not necessarily.  A guest could ask for something on one CAS cycle,
then reject it on another, without doing a full reboot.  It'd be a
pointless thing for the guest to do, but it's possible.

> spapr->ov5_cas gets cleared, ie. spapr_ovec_subset() returns
> true in the check above no matter what.

Well, also it could happen if the guest rejects something we put in
the initial value of ov5_cas (which is populated from spapr->ov5, so
it's not entirely empty).

> > However I believe that the original rationale for this check was that
> > while we could add things to the device tree for added capabilities,
> > we didn't have a way to roll back the changes for removed
> > capabilities.
> > 
> 
> IIUC this is specifically for "removed capabilities since last
> CAS". This can happen if:
> 1) we're already processing a CAS reboot or,
> 2) a freshly rebooted guest calls CAS twice without being rebooted
>    in between.
> 
> Since a freshly booted or rebooted guest can only trigger a CAS
> reboot because of a "hotplug-before-CAS", if we manage to get rid
> of this limitation, 1) cannot happen anymore.
> 
> The linux kernel seems to be only calling "ibm,client-architecture-support"
> once during early boot so 2) should _never_ happen. Do we care to support
> this scenario anyway ?

I think you've missed some things in your reasoning.  But it doesn't
really matter because the full dt rebuilt should handle it anyway.  I
have a draft patch which removes this cause for CAS reboots.

Still grappling with the hotplug-before-CAS case.

> > Now that we fully rebuild the device tree at CAS, I think this test
> > can probably just go, although there's some double checking to do.
> > 
> 
> I tend to agree.
Greg Kurz Jan. 22, 2020, 10:06 a.m. UTC | #20
On Wed, 22 Jan 2020 17:50:28 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Tue, Jan 21, 2020 at 10:32:55AM +0100, Greg Kurz wrote:
> > On Tue, 21 Jan 2020 14:43:32 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> > 
> > > On Mon, Jan 20, 2020 at 09:04:38AM +0100, Greg Kurz wrote:
> > > > On Fri, 17 Jan 2020 16:44:27 +0100
> > > > Greg Kurz <groug@kaod.org> wrote:
> > > > 
> > > > > On Fri, 17 Jan 2020 19:16:08 +1000
> > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > > 
> > > > > > On Thu, Jan 16, 2020 at 07:29:02PM +0100, Greg Kurz wrote:
> > > > > > > On Thu, 16 Jan 2020 13:14:35 +0100
> > > > > > > Greg Kurz <groug@kaod.org> wrote:
> > > > > > > 
> > > > > > > > On Thu, 16 Jan 2020 11:37:24 +0100
> > > > > > > > Laurent Vivier <lvivier@redhat.com> wrote:
> > > > > > > > 
> > > > > > > > > On 16/01/2020 09:48, Greg Kurz wrote:
> > > > > > > > > > On Wed, 15 Jan 2020 19:10:37 +0100
> > > > > > > > > > Laurent Vivier <lvivier@redhat.com> wrote:
> > > > > > > > > > 
> > > > > > > > > >> Hi,
> > > > > > > > > >>
> > > > > > > > > >> On 15/01/2020 18:48, Greg Kurz wrote:
> > > > > > > > > >>> Migration can potentially race with CAS reboot. If the migration thread
> > > > > > > > > >>> completes migration after CAS has set spapr->cas_reboot but before the
> > > > > > > > > >>> mainloop could pick up the reset request and reset the machine, the
> > > > > > > > > >>> guest is migrated unrebooted and the destination doesn't reboot it
> > > > > > > > > >>> either because it isn't aware a CAS reboot was needed (eg, because a
> > > > > > > > > >>> device was added before CAS). This likely result in a broken or hung
> > > > > > > > > >>> guest.
> > > > > > > > > >>>
> > > > > > > > > >>> Even if it is small, the window between CAS and CAS reboot is enough to
> > > > > > > > > >>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new
> > > > > > > > > >>> subsection for that and always send it when a CAS reboot is pending.
> > > > > > > > > >>> This may cause migration to older QEMUs to fail but it is still better
> > > > > > > > > >>> than end up with a broken guest.
> > > > > > > > > >>>
> > > > > > > > > >>> The destination cannot honour the CAS reboot request from a post load
> > > > > > > > > >>> handler because this must be done after the guest is fully restored.
> > > > > > > > > >>> It is thus done from a VM change state handler.
> > > > > > > > > >>>
> > > > > > > > > >>> Reported-by: Lukáš Doktor <ldoktor@redhat.com>
> > > > > > > > > >>> Signed-off-by: Greg Kurz <groug@kaod.org>
> > > > > > > > > >>> ---
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > > >> I'm wondering if the problem can be related with the fact that
> > > > > > > > > >> main_loop_should_exit() could release qemu_global_mutex in
> > > > > > > > > >> pause_all_vcpus() in the reset case?
> > > > > > > > > >>
> > > > > > > > > >> 1602 static bool main_loop_should_exit(void)
> > > > > > > > > >> 1603 {
> > > > > > > > > >> ...
> > > > > > > > > >> 1633     request = qemu_reset_requested();
> > > > > > > > > >> 1634     if (request) {
> > > > > > > > > >> 1635         pause_all_vcpus();
> > > > > > > > > >> 1636         qemu_system_reset(request);
> > > > > > > > > >> 1637         resume_all_vcpus();
> > > > > > > > > >> 1638         if (!runstate_check(RUN_STATE_RUNNING) &&
> > > > > > > > > >> 1639                 !runstate_check(RUN_STATE_INMIGRATE)) {
> > > > > > > > > >> 1640             runstate_set(RUN_STATE_PRELAUNCH);
> > > > > > > > > >> 1641         }
> > > > > > > > > >> 1642     }
> > > > > > > > > >> ...
> > > > > > > > > >>
> > > > > > > > > >> I already sent a patch for this kind of problem (in current Juan pull
> > > > > > > > > >> request):
> > > > > > > > > >>
> > > > > > > > > >> "runstate: ignore finishmigrate -> prelaunch transition"
> > > > > > > > > >>
> > > > > > > > > > 
> > > > > > > > > > IIUC your patch avoids an invalid 'prelaunch' -> 'postmigrate' runstate
> > > > > > > > > > transition that can happen if the migration thread sets the runstate to
> > > > > > > > > > 'finishmigrate' when pause_all_vcpus() releases the main loop mutex.
> > > > > > > > > > 
> > > > > > > > > > ie. symptom of the problem is QEMU aborting, correct ? The issue I'm
> > > > > > > > > > trying to fix is a guest breakage caused by a discrepancy between
> > > > > > > > > > QEMU and the guest after migration has succeeded.
> > > > > > > > > > 
> > > > > > > > > >> but I don't know if it could fix this one.
> > > > > > > > > >>
> > > > > > > > > > 
> > > > > > > > > > I don't think so and your patch kinda illustrates it. If the runstate
> > > > > > > > > > is 'finishmigrate' when returning from pause_all_vcpus(), this means
> > > > > > > > > > that state was sent to the destination before we could actually reset
> > > > > > > > > > the machine.
> > > > > > > > > 
> > > > > > > > > Yes, you're right.
> > > > > > > > > 
> > > > > > > > > But the question behind my comment was: is it expected to have a pending
> > > > > > > > > reset while we are migrating?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Nothing prevents qemu_system_reset_request() to be called when migration
> > > > > > > > is active. 
> > > > > > > > 
> > > > > > > > > Perhaps H_CAS can return H_BUSY and wait the end of the migration and
> > > > > > > > > then be fully executed on destination?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > And so we would need to teach SLOF to try H_CAS again until it stops
> > > > > > > > returning H_BUSY ? It seems safer to migrate the CAS reboot flag IMHO.
> > > > > > > > 
> > > > > > > 
> > > > > > > Ok I've tried that with a patched SLOF that sleeps 500ms and tries CAS
> > > > > > > again if H_BUSY was returned. It fixes the issue but it looks a bit
> > > > > > > ugly because of the polling with an arbitrary timeout in SLOF... I'm
> > > > > > > not very comfortable either with calling migration_is_active() from
> > > > > > > the CAS code in QEMU.
> > > > > > > 
> > > > > > > David,
> > > > > > > 
> > > > > > > Any suggestion ?
> > > > > > 
> > > > > > Yeah, I think looping in SLOF is a worse idea than migrating the
> > > > > > cas_reboot flag.
> > > > > > 
> > > > > > But.. a better solution still might be to just remove the remaining
> > > > > > causes for CAS reboot entirely.  CAS reboots pretty much suck when
> > > > > > they happen, anyway.
> > > > > > 
> > > > > 
> > > > > I Agree.
> > > > > 
> > > > > > With the irq changeover condition removed, I think the remaining
> > > > > > causes are more theoretical than practical situations at this point.
> > > > > > 
> > > > > 
> > > > > FWIW, hotpluggging a PCI device before CAS result in a hung guest (not yet
> > > > > investigated the details).
> > > > 
> > > > commit 10f12e6450407b18b4d5a6b50d3852dcfd7fff75
> > > > Author: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com>
> > > > Date:   Wed Aug 30 15:21:41 2017 -0300
> > > > 
> > > >     hw/ppc: CAS reset on early device hotplug
> > > > 
> > > > I'll have a look to see what can be done here.
> > > 
> > > Ah.. yes, that one might be a bit tricky.
> > > 
> > 
> > So far it seems to be related to SLOF not being able to create
> > new nodes in the DT when parsing the FDT returned by CAS. SLOF
> > stops the parsing and returns an error. The guest ends up with
> > a broken DT and eventually hangs later (in my case the kernel
> > believes it is going to do hash while radix was negotiated with
> > QEMU). I need to dig some more.
> 
> Uh... I don't think this is right.  I'm pretty sure SLOF *does* create
> new nodes when parsing the CAS FDT, it will need to for
> "ibm,dynamic-reconfiguration-memory" at least.
> 

It can create "memory@" or "ibm,dynamic-reconfiguration-memory" nodes but
explicitly rejects all others.

> I've done some looking and I think the actual reasons here are a bit
> more complicated (but also possibly easier to handle).
> 
>   1. We can't send hotplug events to the guest until after CAS,
>      because before then we don't know if it can process them, or if
>      it can, which interrupt it uses for them.
> 
>   2. Queueing hotplug events across CAS for delivery afterwards
>      introduces other complications
> 
>   3. We need to make sure that each device appears exactly once in
>      either the  initial device tree that the guest OS sees, *or* in a
>      hotplug event, not both or neither.
> 
> Now that we rebuild the DT at CAS time, I think we mightd be able toy
> handle this by converting such devices to "cold plugged" at CAS time
> (similarly to what we already do at reset).  Since they're in the
> CAS-time DT which is what the OS will consume, cold plugged is
> effectively how the OS will see them.
> 

I have tried hacking around to achieve that. Basically calling
spapr_drc_reset() for all DRCs for which spapr_drc_needed()
returns true.

> A remaining problem is that new PCI devices won't get BARs assigned by
> SLOF in this scenario.  We'll probably get away with it because of the
> linux,pci-probe-only property, but I don't know we want to rely on

We currently only expose this property for pseries-4.2 and newer
machine types... this could be a problem.

> that.  PCI bridges hotplugged introduce further complications, because
> they won't get enumerated.
> 
> > > > But I agree the other check is more theoretical:
> > > > 
> > > >     /* capabilities that have been added since CAS-generated guest reset.
> > > >      * if capabilities have since been removed, generate another reset
> > > >      */
> > > >     spapr->cas_reboot = !spapr_ovec_subset(ov5_cas_old, spapr->ov5_cas);
> > > > 
> > > > Unless changing kernels or tempering with the kernel command line, I don't
> > > > see how some capabilities could change between the two CAS in practice.
> > > 
> > > Well, we want to be robust and it's at least theoretically possible
> > > that the guest will request different things on subsequent reboots.
> > 
> > Yes but in the latter case a full machine reset occurs and
> 
> Not necessarily.  A guest could ask for something on one CAS cycle,
> then reject it on another, without doing a full reboot.  It'd be a
> pointless thing for the guest to do, but it's possible.
> 

Ok, I was asking later on if we want to support the scenario of
multiple CAS without an intermediate full reboot. I now have the
answer :)

> > spapr->ov5_cas gets cleared, ie. spapr_ovec_subset() returns
> > true in the check above no matter what.
> 
> Well, also it could happen if the guest rejects something we put in
> the initial value of ov5_cas (which is populated from spapr->ov5, so
> it's not entirely empty).
> 

AFAICT the initial value of ov5_cas after a full machine reset is
all zeroes until CAS does:

    /* full range of negotiated ov5 capabilities */
    spapr_ovec_intersect(spapr->ov5_cas, spapr->ov5, ov5_guest);

which is done between:

    ov5_cas_old = spapr_ovec_clone(spapr->ov5_cas);

and

    spapr->cas_reboot = !spapr_ovec_subset(ov5_cas_old, spapr->ov5_cas);

So I don't quite understand how ov5_cas_old, ie. spapr->ov5_cas at the
time the guest calls CAS, could have an "initial value not entirely
empty"... This can only happen if the guest calls CAS several times
without doing a full reboot in between. My initial thought was to
refuse this scenario and fail any subsequent CAS attempt made by
the guest before a full reboot.

> > > However I believe that the original rationale for this check was that
> > > while we could add things to the device tree for added capabilities,
> > > we didn't have a way to roll back the changes for removed
> > > capabilities.
> > > 
> > 
> > IIUC this is specifically for "removed capabilities since last
> > CAS". This can happen if:
> > 1) we're already processing a CAS reboot or,
> > 2) a freshly rebooted guest calls CAS twice without being rebooted
> >    in between.
> > 
> > Since a freshly booted or rebooted guest can only trigger a CAS
> > reboot because of a "hotplug-before-CAS", if we manage to get rid
> > of this limitation, 1) cannot happen anymore.
> > 
> > The linux kernel seems to be only calling "ibm,client-architecture-support"
> > once during early boot so 2) should _never_ happen. Do we care to support
> > this scenario anyway ?
> 
> I think you've missed some things in your reasoning.  But it doesn't
> really matter because the full dt rebuilt should handle it anyway.  I
> have a draft patch which removes this cause for CAS reboots.
> 
> Still grappling with the hotplug-before-CAS case.
> 

Same here actually. I was struggling with SLOF to have it create new nodes
for hotplugged-before-CAS devices without crashing :-\

I think I'll wait for your patches to arrive :) Please Cc: me.

> > > Now that we fully rebuild the device tree at CAS, I think this test
> > > can probably just go, although there's some double checking to do.
> > > 
> > 
> > I tend to agree.
>
Greg Kurz Jan. 22, 2020, 12:47 p.m. UTC | #21
On Wed, 15 Jan 2020 19:10:47 +0100
Cédric Le Goater <clg@kaod.org> wrote:

> On 1/15/20 6:48 PM, Greg Kurz wrote:
> > Migration can potentially race with CAS reboot. If the migration thread
> > completes migration after CAS has set spapr->cas_reboot but before the
> > mainloop could pick up the reset request and reset the machine, the
> > guest is migrated unrebooted and the destination doesn't reboot it
> > either because it isn't aware a CAS reboot was needed (eg, because a
> > device was added before CAS). This likely result in a broken or hung
> > guest.
> > 
> > Even if it is small, the window between CAS and CAS reboot is enough to
> > re-qualify spapr->cas_reboot as state that we should migrate. Add a new
> > subsection for that and always send it when a CAS reboot is pending.
> > This may cause migration to older QEMUs to fail but it is still better
> > than end up with a broken guest.
> > 
> > The destination cannot honour the CAS reboot request from a post load
> > handler because this must be done after the guest is fully restored.
> > It is thus done from a VM change state handler.
> > 
> > Reported-by: Lukáš Doktor <ldoktor@redhat.com>
> > Signed-off-by: Greg Kurz <groug@kaod.org>
> 
> Cédric Le Goater <clg@kaod.org>
> 

I guess you mean:

Reviewed-by: Cédric Le Goater <clg@kaod.org>

?

> Nice work ! That was quite complex to catch !
> 
> Thanks,
> 
> C.
> 
> > ---
> > 
> > This patch is supposed to fix the interrupt controller mode inconsistency
> > between QEMU and the guest reported in this BZ:
> > 
> > https://bugzilla.redhat.com/show_bug.cgi?id=1781315 (requires auth)
> > 
> > Even if interrupt controller selection doesn't involve CAS reboot anymore,
> > we still have other conditions that require CAS reboot.
> > ---
> >  hw/ppc/spapr.c |   43 +++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 43 insertions(+)
> > 
> > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > index 30a5fbd3bea6..bf2763aa16e5 100644
> > --- a/hw/ppc/spapr.c
> > +++ b/hw/ppc/spapr.c
> > @@ -1959,6 +1959,31 @@ static const VMStateDescription vmstate_spapr_dtb = {
> >      },
> >  };
> >  
> > +static bool spapr_cas_reboot_needed(void *opaque)
> > +{
> > +    SpaprMachineState *spapr = SPAPR_MACHINE(opaque);
> > +
> > +    /*
> > +     * This causes the "spapr_cas_reboot" subsection to always be
> > +     * sent if migration raced with CAS. This causes older QEMUs
> > +     * that don't know about the subsection to fail migration but
> > +     * it is still better than end up with a broken guest on the
> > +     * destination.
> > +     */
> > +    return spapr->cas_reboot;
> > +}
> > +
> > +static const VMStateDescription vmstate_spapr_cas_reboot = {
> > +    .name = "spapr_cas_reboot",
> > +    .version_id = 1,
> > +    .minimum_version_id = 1,
> > +    .needed = spapr_cas_reboot_needed,
> > +    .fields = (VMStateField[]) {
> > +        VMSTATE_BOOL(cas_reboot, SpaprMachineState),
> > +        VMSTATE_END_OF_LIST()
> > +    },
> > +};
> > +
> >  static const VMStateDescription vmstate_spapr = {
> >      .name = "spapr",
> >      .version_id = 3,
> > @@ -1992,6 +2017,7 @@ static const VMStateDescription vmstate_spapr = {
> >          &vmstate_spapr_dtb,
> >          &vmstate_spapr_cap_large_decr,
> >          &vmstate_spapr_cap_ccf_assist,
> > +        &vmstate_spapr_cas_reboot,
> >          NULL
> >      }
> >  };
> > @@ -2577,6 +2603,21 @@ static PCIHostState *spapr_create_default_phb(void)
> >      return PCI_HOST_BRIDGE(dev);
> >  }
> >  
> > +static void spapr_change_state_handler(void *opaque, int running,
> > +                                       RunState state)
> > +{
> > +    SpaprMachineState *spapr = opaque;
> > +
> > +    if (running && spapr->cas_reboot) {
> > +        /*
> > +         * This happens when resuming from migration if the source
> > +         * processed a CAS but didn't have time to trigger the CAS
> > +         * reboot. Do it now.
> > +         */
> > +        qemu_system_reset_request(SHUTDOWN_CAUSE_SUBSYSTEM_RESET);
> > +    }
> > +}
> > +
> >  /* pSeries LPAR / sPAPR hardware init */
> >  static void spapr_machine_init(MachineState *machine)
> >  {
> > @@ -2970,6 +3011,8 @@ static void spapr_machine_init(MachineState *machine)
> >  
> >          kvmppc_spapr_enable_inkernel_multitce();
> >      }
> > +
> > +    qemu_add_vm_change_state_handler(spapr_change_state_handler, spapr);
> >  }
> >  
> >  static int spapr_kvm_type(MachineState *machine, const char *vm_type)
> > 
>
Cédric Le Goater Jan. 22, 2020, 2:08 p.m. UTC | #22
On 1/22/20 1:47 PM, Greg Kurz wrote:
> On Wed, 15 Jan 2020 19:10:47 +0100
> Cédric Le Goater <clg@kaod.org> wrote:
> 
>> On 1/15/20 6:48 PM, Greg Kurz wrote:
>>> Migration can potentially race with CAS reboot. If the migration thread
>>> completes migration after CAS has set spapr->cas_reboot but before the
>>> mainloop could pick up the reset request and reset the machine, the
>>> guest is migrated unrebooted and the destination doesn't reboot it
>>> either because it isn't aware a CAS reboot was needed (eg, because a
>>> device was added before CAS). This likely result in a broken or hung
>>> guest.
>>>
>>> Even if it is small, the window between CAS and CAS reboot is enough to
>>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new
>>> subsection for that and always send it when a CAS reboot is pending.
>>> This may cause migration to older QEMUs to fail but it is still better
>>> than end up with a broken guest.
>>>
>>> The destination cannot honour the CAS reboot request from a post load
>>> handler because this must be done after the guest is fully restored.
>>> It is thus done from a VM change state handler.
>>>
>>> Reported-by: Lukáš Doktor <ldoktor@redhat.com>
>>> Signed-off-by: Greg Kurz <groug@kaod.org>
>>
>> Cédric Le Goater <clg@kaod.org>
>>
> 
> I guess you mean:
>
> Reviewed-by: Cédric Le Goater <clg@kaod.org>

Yes.

   Reviewed-by: Cédric Le Goater <clg@kaod.org>

Since keyconfig was disabled in thunderbird, I have been adding
tags manually ...

C.
David Gibson Jan. 23, 2020, 5:08 a.m. UTC | #23
On Wed, Jan 22, 2020 at 11:06:18AM +0100, Greg Kurz wrote:
> On Wed, 22 Jan 2020 17:50:28 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Tue, Jan 21, 2020 at 10:32:55AM +0100, Greg Kurz wrote:
> > > On Tue, 21 Jan 2020 14:43:32 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > 
> > > > On Mon, Jan 20, 2020 at 09:04:38AM +0100, Greg Kurz wrote:
> > > > > On Fri, 17 Jan 2020 16:44:27 +0100
> > > > > Greg Kurz <groug@kaod.org> wrote:
> > > > > 
> > > > > > On Fri, 17 Jan 2020 19:16:08 +1000
> > > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > > > 
> > > > > > > On Thu, Jan 16, 2020 at 07:29:02PM +0100, Greg Kurz wrote:
> > > > > > > > On Thu, 16 Jan 2020 13:14:35 +0100
> > > > > > > > Greg Kurz <groug@kaod.org> wrote:
> > > > > > > > 
> > > > > > > > > On Thu, 16 Jan 2020 11:37:24 +0100
> > > > > > > > > Laurent Vivier <lvivier@redhat.com> wrote:
> > > > > > > > > 
> > > > > > > > > > On 16/01/2020 09:48, Greg Kurz wrote:
> > > > > > > > > > > On Wed, 15 Jan 2020 19:10:37 +0100
> > > > > > > > > > > Laurent Vivier <lvivier@redhat.com> wrote:
> > > > > > > > > > > 
> > > > > > > > > > >> Hi,
> > > > > > > > > > >>
> > > > > > > > > > >> On 15/01/2020 18:48, Greg Kurz wrote:
> > > > > > > > > > >>> Migration can potentially race with CAS reboot. If the migration thread
> > > > > > > > > > >>> completes migration after CAS has set spapr->cas_reboot but before the
> > > > > > > > > > >>> mainloop could pick up the reset request and reset the machine, the
> > > > > > > > > > >>> guest is migrated unrebooted and the destination doesn't reboot it
> > > > > > > > > > >>> either because it isn't aware a CAS reboot was needed (eg, because a
> > > > > > > > > > >>> device was added before CAS). This likely result in a broken or hung
> > > > > > > > > > >>> guest.
> > > > > > > > > > >>>
> > > > > > > > > > >>> Even if it is small, the window between CAS and CAS reboot is enough to
> > > > > > > > > > >>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new
> > > > > > > > > > >>> subsection for that and always send it when a CAS reboot is pending.
> > > > > > > > > > >>> This may cause migration to older QEMUs to fail but it is still better
> > > > > > > > > > >>> than end up with a broken guest.
> > > > > > > > > > >>>
> > > > > > > > > > >>> The destination cannot honour the CAS reboot request from a post load
> > > > > > > > > > >>> handler because this must be done after the guest is fully restored.
> > > > > > > > > > >>> It is thus done from a VM change state handler.
> > > > > > > > > > >>>
> > > > > > > > > > >>> Reported-by: Lukáš Doktor <ldoktor@redhat.com>
> > > > > > > > > > >>> Signed-off-by: Greg Kurz <groug@kaod.org>
> > > > > > > > > > >>> ---
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > > >> I'm wondering if the problem can be related with the fact that
> > > > > > > > > > >> main_loop_should_exit() could release qemu_global_mutex in
> > > > > > > > > > >> pause_all_vcpus() in the reset case?
> > > > > > > > > > >>
> > > > > > > > > > >> 1602 static bool main_loop_should_exit(void)
> > > > > > > > > > >> 1603 {
> > > > > > > > > > >> ...
> > > > > > > > > > >> 1633     request = qemu_reset_requested();
> > > > > > > > > > >> 1634     if (request) {
> > > > > > > > > > >> 1635         pause_all_vcpus();
> > > > > > > > > > >> 1636         qemu_system_reset(request);
> > > > > > > > > > >> 1637         resume_all_vcpus();
> > > > > > > > > > >> 1638         if (!runstate_check(RUN_STATE_RUNNING) &&
> > > > > > > > > > >> 1639                 !runstate_check(RUN_STATE_INMIGRATE)) {
> > > > > > > > > > >> 1640             runstate_set(RUN_STATE_PRELAUNCH);
> > > > > > > > > > >> 1641         }
> > > > > > > > > > >> 1642     }
> > > > > > > > > > >> ...
> > > > > > > > > > >>
> > > > > > > > > > >> I already sent a patch for this kind of problem (in current Juan pull
> > > > > > > > > > >> request):
> > > > > > > > > > >>
> > > > > > > > > > >> "runstate: ignore finishmigrate -> prelaunch transition"
> > > > > > > > > > >>
> > > > > > > > > > > 
> > > > > > > > > > > IIUC your patch avoids an invalid 'prelaunch' -> 'postmigrate' runstate
> > > > > > > > > > > transition that can happen if the migration thread sets the runstate to
> > > > > > > > > > > 'finishmigrate' when pause_all_vcpus() releases the main loop mutex.
> > > > > > > > > > > 
> > > > > > > > > > > ie. symptom of the problem is QEMU aborting, correct ? The issue I'm
> > > > > > > > > > > trying to fix is a guest breakage caused by a discrepancy between
> > > > > > > > > > > QEMU and the guest after migration has succeeded.
> > > > > > > > > > > 
> > > > > > > > > > >> but I don't know if it could fix this one.
> > > > > > > > > > >>
> > > > > > > > > > > 
> > > > > > > > > > > I don't think so and your patch kinda illustrates it. If the runstate
> > > > > > > > > > > is 'finishmigrate' when returning from pause_all_vcpus(), this means
> > > > > > > > > > > that state was sent to the destination before we could actually reset
> > > > > > > > > > > the machine.
> > > > > > > > > > 
> > > > > > > > > > Yes, you're right.
> > > > > > > > > > 
> > > > > > > > > > But the question behind my comment was: is it expected to have a pending
> > > > > > > > > > reset while we are migrating?
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Nothing prevents qemu_system_reset_request() to be called when migration
> > > > > > > > > is active. 
> > > > > > > > > 
> > > > > > > > > > Perhaps H_CAS can return H_BUSY and wait the end of the migration and
> > > > > > > > > > then be fully executed on destination?
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > And so we would need to teach SLOF to try H_CAS again until it stops
> > > > > > > > > returning H_BUSY ? It seems safer to migrate the CAS reboot flag IMHO.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Ok I've tried that with a patched SLOF that sleeps 500ms and tries CAS
> > > > > > > > again if H_BUSY was returned. It fixes the issue but it looks a bit
> > > > > > > > ugly because of the polling with an arbitrary timeout in SLOF... I'm
> > > > > > > > not very comfortable either with calling migration_is_active() from
> > > > > > > > the CAS code in QEMU.
> > > > > > > > 
> > > > > > > > David,
> > > > > > > > 
> > > > > > > > Any suggestion ?
> > > > > > > 
> > > > > > > Yeah, I think looping in SLOF is a worse idea than migrating the
> > > > > > > cas_reboot flag.
> > > > > > > 
> > > > > > > But.. a better solution still might be to just remove the remaining
> > > > > > > causes for CAS reboot entirely.  CAS reboots pretty much suck when
> > > > > > > they happen, anyway.
> > > > > > > 
> > > > > > 
> > > > > > I Agree.
> > > > > > 
> > > > > > > With the irq changeover condition removed, I think the remaining
> > > > > > > causes are more theoretical than practical situations at this point.
> > > > > > > 
> > > > > > 
> > > > > > FWIW, hotpluggging a PCI device before CAS result in a hung guest (not yet
> > > > > > investigated the details).
> > > > > 
> > > > > commit 10f12e6450407b18b4d5a6b50d3852dcfd7fff75
> > > > > Author: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com>
> > > > > Date:   Wed Aug 30 15:21:41 2017 -0300
> > > > > 
> > > > >     hw/ppc: CAS reset on early device hotplug
> > > > > 
> > > > > I'll have a look to see what can be done here.
> > > > 
> > > > Ah.. yes, that one might be a bit tricky.
> > > > 
> > > 
> > > So far it seems to be related to SLOF not being able to create
> > > new nodes in the DT when parsing the FDT returned by CAS. SLOF
> > > stops the parsing and returns an error. The guest ends up with
> > > a broken DT and eventually hangs later (in my case the kernel
> > > believes it is going to do hash while radix was negotiated with
> > > QEMU). I need to dig some more.
> > 
> > Uh... I don't think this is right.  I'm pretty sure SLOF *does* create
> > new nodes when parsing the CAS FDT, it will need to for
> > "ibm,dynamic-reconfiguration-memory" at least.
> > 
> 
> It can create "memory@" or "ibm,dynamic-reconfiguration-memory" nodes but
> explicitly rejects all others.

Huh.  Well that's certainly not correct now that we're doing a full
tree rebuild.

> > I've done some looking and I think the actual reasons here are a bit
> > more complicated (but also possibly easier to handle).
> > 
> >   1. We can't send hotplug events to the guest until after CAS,
> >      because before then we don't know if it can process them, or if
> >      it can, which interrupt it uses for them.
> > 
> >   2. Queueing hotplug events across CAS for delivery afterwards
> >      introduces other complications
> > 
> >   3. We need to make sure that each device appears exactly once in
> >      either the  initial device tree that the guest OS sees, *or* in a
> >      hotplug event, not both or neither.
> > 
> > Now that we rebuild the DT at CAS time, I think we mightd be able toy
> > handle this by converting such devices to "cold plugged" at CAS time
> > (similarly to what we already do at reset).  Since they're in the
> > CAS-time DT which is what the OS will consume, cold plugged is
> > effectively how the OS will see them.
> > 
> 
> I have tried hacking around to achieve that. Basically calling
> spapr_drc_reset() for all DRCs for which spapr_drc_needed()
> returns true.
> 
> > A remaining problem is that new PCI devices won't get BARs assigned by
> > SLOF in this scenario.  We'll probably get away with it because of the
> > linux,pci-probe-only property, but I don't know we want to rely on
> 
> We currently only expose this property for pseries-4.2 and newer
> machine types... this could be a problem.

Right.  I don't like relying on it even for the newer machines, since
it presumably wouldn't work for FreeBSD or AIX.

> > that.  PCI bridges hotplugged introduce further complications, because
> > they won't get enumerated.

Alexey has done some testing and looks like there are already a bunch
of problems with PCI bridges hotplugged during SLOF execution.

> > 
> > > > > But I agree the other check is more theoretical:
> > > > > 
> > > > >     /* capabilities that have been added since CAS-generated guest reset.
> > > > >      * if capabilities have since been removed, generate another reset
> > > > >      */
> > > > >     spapr->cas_reboot = !spapr_ovec_subset(ov5_cas_old, spapr->ov5_cas);
> > > > > 
> > > > > Unless changing kernels or tempering with the kernel command line, I don't
> > > > > see how some capabilities could change between the two CAS in practice.
> > > > 
> > > > Well, we want to be robust and it's at least theoretically possible
> > > > that the guest will request different things on subsequent reboots.
> > > 
> > > Yes but in the latter case a full machine reset occurs and
> > 
> > Not necessarily.  A guest could ask for something on one CAS cycle,
> > then reject it on another, without doing a full reboot.  It'd be a
> > pointless thing for the guest to do, but it's possible.
> > 
> 
> Ok, I was asking later on if we want to support the scenario of
> multiple CAS without an intermediate full reboot. I now have the
> answer :)

Well, "want" might be a bit strong.  PAPR allows for the possibility
so we're trying to support it at this stage.  Obviously this could
happen relatively easily with the hotplug-before-CAS cause.

> > > spapr->ov5_cas gets cleared, ie. spapr_ovec_subset() returns
> > > true in the check above no matter what.
> > 
> > Well, also it could happen if the guest rejects something we put in
> > the initial value of ov5_cas (which is populated from spapr->ov5, so
> > it's not entirely empty).
> > 
> 
> AFAICT the initial value of ov5_cas after a full machine reset is
> all zeroes until CAS does:
> 
>     /* full range of negotiated ov5 capabilities */
>     spapr_ovec_intersect(spapr->ov5_cas, spapr->ov5, ov5_guest);
> 
> which is done between:
> 
>     ov5_cas_old = spapr_ovec_clone(spapr->ov5_cas);
> 
> and
> 
>     spapr->cas_reboot = !spapr_ovec_subset(ov5_cas_old, spapr->ov5_cas);
> 
> So I don't quite understand how ov5_cas_old, ie. spapr->ov5_cas at the
> time the guest calls CAS, could have an "initial value not entirely
> empty"... This can only happen if the guest calls CAS several times
> without doing a full reboot in between. My initial thought was to
> refuse this scenario and fail any subsequent CAS attempt made by
> the guest before a full reboot.

Oh, sorry, my mistake.  I say the line:
      spapr->ov5_cas = spapr_ovec_clone(spapr->ov5);

but only now realized that only happens in the qtest case.  Ok, yes,
looks like this could only happen in the case of multiple CAS reboots
in a row.

> 
> > > > However I believe that the original rationale for this check was that
> > > > while we could add things to the device tree for added capabilities,
> > > > we didn't have a way to roll back the changes for removed
> > > > capabilities.
> > > > 
> > > 
> > > IIUC this is specifically for "removed capabilities since last
> > > CAS". This can happen if:
> > > 1) we're already processing a CAS reboot or,
> > > 2) a freshly rebooted guest calls CAS twice without being rebooted
> > >    in between.
> > > 
> > > Since a freshly booted or rebooted guest can only trigger a CAS
> > > reboot because of a "hotplug-before-CAS", if we manage to get rid
> > > of this limitation, 1) cannot happen anymore.
> > > 
> > > The linux kernel seems to be only calling "ibm,client-architecture-support"
> > > once during early boot so 2) should _never_ happen. Do we care to support
> > > this scenario anyway ?
> > 
> > I think you've missed some things in your reasoning.  But it doesn't
> > really matter because the full dt rebuilt should handle it anyway.  I
> > have a draft patch which removes this cause for CAS reboots.
> > 
> > Still grappling with the hotplug-before-CAS case.
> > 
> 
> Same here actually. I was struggling with SLOF to have it create new nodes
> for hotplugged-before-CAS devices without crashing :-\
> 
> I think I'll wait for your patches to arrive :) Please Cc: me.
> 
> > > > Now that we fully rebuild the device tree at CAS, I think this test
> > > > can probably just go, although there's some double checking to do.
> > > > 
> > > 
> > > I tend to agree.
> > 
>
diff mbox series

Patch

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 30a5fbd3bea6..bf2763aa16e5 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1959,6 +1959,31 @@  static const VMStateDescription vmstate_spapr_dtb = {
     },
 };
 
+static bool spapr_cas_reboot_needed(void *opaque)
+{
+    SpaprMachineState *spapr = SPAPR_MACHINE(opaque);
+
+    /*
+     * This causes the "spapr_cas_reboot" subsection to always be
+     * sent if migration raced with CAS. This causes older QEMUs
+     * that don't know about the subsection to fail migration but
+     * it is still better than end up with a broken guest on the
+     * destination.
+     */
+    return spapr->cas_reboot;
+}
+
+static const VMStateDescription vmstate_spapr_cas_reboot = {
+    .name = "spapr_cas_reboot",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .needed = spapr_cas_reboot_needed,
+    .fields = (VMStateField[]) {
+        VMSTATE_BOOL(cas_reboot, SpaprMachineState),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
 static const VMStateDescription vmstate_spapr = {
     .name = "spapr",
     .version_id = 3,
@@ -1992,6 +2017,7 @@  static const VMStateDescription vmstate_spapr = {
         &vmstate_spapr_dtb,
         &vmstate_spapr_cap_large_decr,
         &vmstate_spapr_cap_ccf_assist,
+        &vmstate_spapr_cas_reboot,
         NULL
     }
 };
@@ -2577,6 +2603,21 @@  static PCIHostState *spapr_create_default_phb(void)
     return PCI_HOST_BRIDGE(dev);
 }
 
+static void spapr_change_state_handler(void *opaque, int running,
+                                       RunState state)
+{
+    SpaprMachineState *spapr = opaque;
+
+    if (running && spapr->cas_reboot) {
+        /*
+         * This happens when resuming from migration if the source
+         * processed a CAS but didn't have time to trigger the CAS
+         * reboot. Do it now.
+         */
+        qemu_system_reset_request(SHUTDOWN_CAUSE_SUBSYSTEM_RESET);
+    }
+}
+
 /* pSeries LPAR / sPAPR hardware init */
 static void spapr_machine_init(MachineState *machine)
 {
@@ -2970,6 +3011,8 @@  static void spapr_machine_init(MachineState *machine)
 
         kvmppc_spapr_enable_inkernel_multitce();
     }
+
+    qemu_add_vm_change_state_handler(spapr_change_state_handler, spapr);
 }
 
 static int spapr_kvm_type(MachineState *machine, const char *vm_type)