Message ID | 157911051688.345768.16136592081655557565.stgit@bahia.lan (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | spapr: Migrate CAS reboot flag | expand |
Hi, On 15/01/2020 18:48, Greg Kurz wrote: > Migration can potentially race with CAS reboot. If the migration thread > completes migration after CAS has set spapr->cas_reboot but before the > mainloop could pick up the reset request and reset the machine, the > guest is migrated unrebooted and the destination doesn't reboot it > either because it isn't aware a CAS reboot was needed (eg, because a > device was added before CAS). This likely result in a broken or hung > guest. > > Even if it is small, the window between CAS and CAS reboot is enough to > re-qualify spapr->cas_reboot as state that we should migrate. Add a new > subsection for that and always send it when a CAS reboot is pending. > This may cause migration to older QEMUs to fail but it is still better > than end up with a broken guest. > > The destination cannot honour the CAS reboot request from a post load > handler because this must be done after the guest is fully restored. > It is thus done from a VM change state handler. > > Reported-by: Lukáš Doktor <ldoktor@redhat.com> > Signed-off-by: Greg Kurz <groug@kaod.org> > --- > I'm wondering if the problem can be related with the fact that main_loop_should_exit() could release qemu_global_mutex in pause_all_vcpus() in the reset case? 1602 static bool main_loop_should_exit(void) 1603 { ... 1633 request = qemu_reset_requested(); 1634 if (request) { 1635 pause_all_vcpus(); 1636 qemu_system_reset(request); 1637 resume_all_vcpus(); 1638 if (!runstate_check(RUN_STATE_RUNNING) && 1639 !runstate_check(RUN_STATE_INMIGRATE)) { 1640 runstate_set(RUN_STATE_PRELAUNCH); 1641 } 1642 } ... I already sent a patch for this kind of problem (in current Juan pull request): "runstate: ignore finishmigrate -> prelaunch transition" but I don't know if it could fix this one. Thanks, Laurent
On 1/15/20 6:48 PM, Greg Kurz wrote: > Migration can potentially race with CAS reboot. If the migration thread > completes migration after CAS has set spapr->cas_reboot but before the > mainloop could pick up the reset request and reset the machine, the > guest is migrated unrebooted and the destination doesn't reboot it > either because it isn't aware a CAS reboot was needed (eg, because a > device was added before CAS). This likely result in a broken or hung > guest. > > Even if it is small, the window between CAS and CAS reboot is enough to > re-qualify spapr->cas_reboot as state that we should migrate. Add a new > subsection for that and always send it when a CAS reboot is pending. > This may cause migration to older QEMUs to fail but it is still better > than end up with a broken guest. > > The destination cannot honour the CAS reboot request from a post load > handler because this must be done after the guest is fully restored. > It is thus done from a VM change state handler. > > Reported-by: Lukáš Doktor <ldoktor@redhat.com> > Signed-off-by: Greg Kurz <groug@kaod.org> Cédric Le Goater <clg@kaod.org> Nice work ! That was quite complex to catch ! Thanks, C. > --- > > This patch is supposed to fix the interrupt controller mode inconsistency > between QEMU and the guest reported in this BZ: > > https://bugzilla.redhat.com/show_bug.cgi?id=1781315 (requires auth) > > Even if interrupt controller selection doesn't involve CAS reboot anymore, > we still have other conditions that require CAS reboot. > --- > hw/ppc/spapr.c | 43 +++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 43 insertions(+) > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > index 30a5fbd3bea6..bf2763aa16e5 100644 > --- a/hw/ppc/spapr.c > +++ b/hw/ppc/spapr.c > @@ -1959,6 +1959,31 @@ static const VMStateDescription vmstate_spapr_dtb = { > }, > }; > > +static bool spapr_cas_reboot_needed(void *opaque) > +{ > + SpaprMachineState *spapr = SPAPR_MACHINE(opaque); > + > + /* > + * This causes the "spapr_cas_reboot" subsection to always be > + * sent if migration raced with CAS. This causes older QEMUs > + * that don't know about the subsection to fail migration but > + * it is still better than end up with a broken guest on the > + * destination. > + */ > + return spapr->cas_reboot; > +} > + > +static const VMStateDescription vmstate_spapr_cas_reboot = { > + .name = "spapr_cas_reboot", > + .version_id = 1, > + .minimum_version_id = 1, > + .needed = spapr_cas_reboot_needed, > + .fields = (VMStateField[]) { > + VMSTATE_BOOL(cas_reboot, SpaprMachineState), > + VMSTATE_END_OF_LIST() > + }, > +}; > + > static const VMStateDescription vmstate_spapr = { > .name = "spapr", > .version_id = 3, > @@ -1992,6 +2017,7 @@ static const VMStateDescription vmstate_spapr = { > &vmstate_spapr_dtb, > &vmstate_spapr_cap_large_decr, > &vmstate_spapr_cap_ccf_assist, > + &vmstate_spapr_cas_reboot, > NULL > } > }; > @@ -2577,6 +2603,21 @@ static PCIHostState *spapr_create_default_phb(void) > return PCI_HOST_BRIDGE(dev); > } > > +static void spapr_change_state_handler(void *opaque, int running, > + RunState state) > +{ > + SpaprMachineState *spapr = opaque; > + > + if (running && spapr->cas_reboot) { > + /* > + * This happens when resuming from migration if the source > + * processed a CAS but didn't have time to trigger the CAS > + * reboot. Do it now. > + */ > + qemu_system_reset_request(SHUTDOWN_CAUSE_SUBSYSTEM_RESET); > + } > +} > + > /* pSeries LPAR / sPAPR hardware init */ > static void spapr_machine_init(MachineState *machine) > { > @@ -2970,6 +3011,8 @@ static void spapr_machine_init(MachineState *machine) > > kvmppc_spapr_enable_inkernel_multitce(); > } > + > + qemu_add_vm_change_state_handler(spapr_change_state_handler, spapr); > } > > static int spapr_kvm_type(MachineState *machine, const char *vm_type) >
On 15/01/2020 19:10, Laurent Vivier wrote: > Hi, > > On 15/01/2020 18:48, Greg Kurz wrote: >> Migration can potentially race with CAS reboot. If the migration thread >> completes migration after CAS has set spapr->cas_reboot but before the >> mainloop could pick up the reset request and reset the machine, the >> guest is migrated unrebooted and the destination doesn't reboot it >> either because it isn't aware a CAS reboot was needed (eg, because a >> device was added before CAS). This likely result in a broken or hung >> guest. >> >> Even if it is small, the window between CAS and CAS reboot is enough to >> re-qualify spapr->cas_reboot as state that we should migrate. Add a new >> subsection for that and always send it when a CAS reboot is pending. >> This may cause migration to older QEMUs to fail but it is still better >> than end up with a broken guest. >> >> The destination cannot honour the CAS reboot request from a post load >> handler because this must be done after the guest is fully restored. >> It is thus done from a VM change state handler. >> >> Reported-by: Lukáš Doktor <ldoktor@redhat.com> >> Signed-off-by: Greg Kurz <groug@kaod.org> >> --- >> > > I'm wondering if the problem can be related with the fact that > main_loop_should_exit() could release qemu_global_mutex in > pause_all_vcpus() in the reset case? > > 1602 static bool main_loop_should_exit(void) > 1603 { > ... > 1633 request = qemu_reset_requested(); > 1634 if (request) { > 1635 pause_all_vcpus(); > 1636 qemu_system_reset(request); > 1637 resume_all_vcpus(); > 1638 if (!runstate_check(RUN_STATE_RUNNING) && > 1639 !runstate_check(RUN_STATE_INMIGRATE)) { > 1640 runstate_set(RUN_STATE_PRELAUNCH); > 1641 } > 1642 } > ... > > I already sent a patch for this kind of problem (in current Juan pull > request): > > "runstate: ignore finishmigrate -> prelaunch transition" > > but I don't know if it could fix this one. I think it should be interesting to have the state transition on source and destination when the problem occurs (with something like "-trace runstate_set"). Thanks, Laurent
On Wed, 15 Jan 2020 19:10:37 +0100 Laurent Vivier <lvivier@redhat.com> wrote: > Hi, > > On 15/01/2020 18:48, Greg Kurz wrote: > > Migration can potentially race with CAS reboot. If the migration thread > > completes migration after CAS has set spapr->cas_reboot but before the > > mainloop could pick up the reset request and reset the machine, the > > guest is migrated unrebooted and the destination doesn't reboot it > > either because it isn't aware a CAS reboot was needed (eg, because a > > device was added before CAS). This likely result in a broken or hung > > guest. > > > > Even if it is small, the window between CAS and CAS reboot is enough to > > re-qualify spapr->cas_reboot as state that we should migrate. Add a new > > subsection for that and always send it when a CAS reboot is pending. > > This may cause migration to older QEMUs to fail but it is still better > > than end up with a broken guest. > > > > The destination cannot honour the CAS reboot request from a post load > > handler because this must be done after the guest is fully restored. > > It is thus done from a VM change state handler. > > > > Reported-by: Lukáš Doktor <ldoktor@redhat.com> > > Signed-off-by: Greg Kurz <groug@kaod.org> > > --- > > > > I'm wondering if the problem can be related with the fact that > main_loop_should_exit() could release qemu_global_mutex in > pause_all_vcpus() in the reset case? > > 1602 static bool main_loop_should_exit(void) > 1603 { > ... > 1633 request = qemu_reset_requested(); > 1634 if (request) { > 1635 pause_all_vcpus(); > 1636 qemu_system_reset(request); > 1637 resume_all_vcpus(); > 1638 if (!runstate_check(RUN_STATE_RUNNING) && > 1639 !runstate_check(RUN_STATE_INMIGRATE)) { > 1640 runstate_set(RUN_STATE_PRELAUNCH); > 1641 } > 1642 } > ... > > I already sent a patch for this kind of problem (in current Juan pull > request): > > "runstate: ignore finishmigrate -> prelaunch transition" > IIUC your patch avoids an invalid 'prelaunch' -> 'postmigrate' runstate transition that can happen if the migration thread sets the runstate to 'finishmigrate' when pause_all_vcpus() releases the main loop mutex. ie. symptom of the problem is QEMU aborting, correct ? The issue I'm trying to fix is a guest breakage caused by a discrepancy between QEMU and the guest after migration has succeeded. > but I don't know if it could fix this one. > I don't think so and your patch kinda illustrates it. If the runstate is 'finishmigrate' when returning from pause_all_vcpus(), this means that state was sent to the destination before we could actually reset the machine. > Thanks, > Laurent >
On 16/01/2020 09:48, Greg Kurz wrote: > On Wed, 15 Jan 2020 19:10:37 +0100 > Laurent Vivier <lvivier@redhat.com> wrote: > >> Hi, >> >> On 15/01/2020 18:48, Greg Kurz wrote: >>> Migration can potentially race with CAS reboot. If the migration thread >>> completes migration after CAS has set spapr->cas_reboot but before the >>> mainloop could pick up the reset request and reset the machine, the >>> guest is migrated unrebooted and the destination doesn't reboot it >>> either because it isn't aware a CAS reboot was needed (eg, because a >>> device was added before CAS). This likely result in a broken or hung >>> guest. >>> >>> Even if it is small, the window between CAS and CAS reboot is enough to >>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new >>> subsection for that and always send it when a CAS reboot is pending. >>> This may cause migration to older QEMUs to fail but it is still better >>> than end up with a broken guest. >>> >>> The destination cannot honour the CAS reboot request from a post load >>> handler because this must be done after the guest is fully restored. >>> It is thus done from a VM change state handler. >>> >>> Reported-by: Lukáš Doktor <ldoktor@redhat.com> >>> Signed-off-by: Greg Kurz <groug@kaod.org> >>> --- >>> >> >> I'm wondering if the problem can be related with the fact that >> main_loop_should_exit() could release qemu_global_mutex in >> pause_all_vcpus() in the reset case? >> >> 1602 static bool main_loop_should_exit(void) >> 1603 { >> ... >> 1633 request = qemu_reset_requested(); >> 1634 if (request) { >> 1635 pause_all_vcpus(); >> 1636 qemu_system_reset(request); >> 1637 resume_all_vcpus(); >> 1638 if (!runstate_check(RUN_STATE_RUNNING) && >> 1639 !runstate_check(RUN_STATE_INMIGRATE)) { >> 1640 runstate_set(RUN_STATE_PRELAUNCH); >> 1641 } >> 1642 } >> ... >> >> I already sent a patch for this kind of problem (in current Juan pull >> request): >> >> "runstate: ignore finishmigrate -> prelaunch transition" >> > > IIUC your patch avoids an invalid 'prelaunch' -> 'postmigrate' runstate > transition that can happen if the migration thread sets the runstate to > 'finishmigrate' when pause_all_vcpus() releases the main loop mutex. > > ie. symptom of the problem is QEMU aborting, correct ? The issue I'm > trying to fix is a guest breakage caused by a discrepancy between > QEMU and the guest after migration has succeeded. > >> but I don't know if it could fix this one. >> > > I don't think so and your patch kinda illustrates it. If the runstate > is 'finishmigrate' when returning from pause_all_vcpus(), this means > that state was sent to the destination before we could actually reset > the machine. Yes, you're right. But the question behind my comment was: is it expected to have a pending reset while we are migrating? Perhaps H_CAS can return H_BUSY and wait the end of the migration and then be fully executed on destination? Thanks, Laurent
On Thu, 16 Jan 2020 11:37:24 +0100 Laurent Vivier <lvivier@redhat.com> wrote: > On 16/01/2020 09:48, Greg Kurz wrote: > > On Wed, 15 Jan 2020 19:10:37 +0100 > > Laurent Vivier <lvivier@redhat.com> wrote: > > > >> Hi, > >> > >> On 15/01/2020 18:48, Greg Kurz wrote: > >>> Migration can potentially race with CAS reboot. If the migration thread > >>> completes migration after CAS has set spapr->cas_reboot but before the > >>> mainloop could pick up the reset request and reset the machine, the > >>> guest is migrated unrebooted and the destination doesn't reboot it > >>> either because it isn't aware a CAS reboot was needed (eg, because a > >>> device was added before CAS). This likely result in a broken or hung > >>> guest. > >>> > >>> Even if it is small, the window between CAS and CAS reboot is enough to > >>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new > >>> subsection for that and always send it when a CAS reboot is pending. > >>> This may cause migration to older QEMUs to fail but it is still better > >>> than end up with a broken guest. > >>> > >>> The destination cannot honour the CAS reboot request from a post load > >>> handler because this must be done after the guest is fully restored. > >>> It is thus done from a VM change state handler. > >>> > >>> Reported-by: Lukáš Doktor <ldoktor@redhat.com> > >>> Signed-off-by: Greg Kurz <groug@kaod.org> > >>> --- > >>> > >> > >> I'm wondering if the problem can be related with the fact that > >> main_loop_should_exit() could release qemu_global_mutex in > >> pause_all_vcpus() in the reset case? > >> > >> 1602 static bool main_loop_should_exit(void) > >> 1603 { > >> ... > >> 1633 request = qemu_reset_requested(); > >> 1634 if (request) { > >> 1635 pause_all_vcpus(); > >> 1636 qemu_system_reset(request); > >> 1637 resume_all_vcpus(); > >> 1638 if (!runstate_check(RUN_STATE_RUNNING) && > >> 1639 !runstate_check(RUN_STATE_INMIGRATE)) { > >> 1640 runstate_set(RUN_STATE_PRELAUNCH); > >> 1641 } > >> 1642 } > >> ... > >> > >> I already sent a patch for this kind of problem (in current Juan pull > >> request): > >> > >> "runstate: ignore finishmigrate -> prelaunch transition" > >> > > > > IIUC your patch avoids an invalid 'prelaunch' -> 'postmigrate' runstate > > transition that can happen if the migration thread sets the runstate to > > 'finishmigrate' when pause_all_vcpus() releases the main loop mutex. > > > > ie. symptom of the problem is QEMU aborting, correct ? The issue I'm > > trying to fix is a guest breakage caused by a discrepancy between > > QEMU and the guest after migration has succeeded. > > > >> but I don't know if it could fix this one. > >> > > > > I don't think so and your patch kinda illustrates it. If the runstate > > is 'finishmigrate' when returning from pause_all_vcpus(), this means > > that state was sent to the destination before we could actually reset > > the machine. > > Yes, you're right. > > But the question behind my comment was: is it expected to have a pending > reset while we are migrating? > Nothing prevents qemu_system_reset_request() to be called when migration is active. > Perhaps H_CAS can return H_BUSY and wait the end of the migration and > then be fully executed on destination? > And so we would need to teach SLOF to try H_CAS again until it stops returning H_BUSY ? It seems safer to migrate the CAS reboot flag IMHO. > Thanks, > Laurent >
On Thu, 16 Jan 2020 13:14:35 +0100 Greg Kurz <groug@kaod.org> wrote: > On Thu, 16 Jan 2020 11:37:24 +0100 > Laurent Vivier <lvivier@redhat.com> wrote: > > > On 16/01/2020 09:48, Greg Kurz wrote: > > > On Wed, 15 Jan 2020 19:10:37 +0100 > > > Laurent Vivier <lvivier@redhat.com> wrote: > > > > > >> Hi, > > >> > > >> On 15/01/2020 18:48, Greg Kurz wrote: > > >>> Migration can potentially race with CAS reboot. If the migration thread > > >>> completes migration after CAS has set spapr->cas_reboot but before the > > >>> mainloop could pick up the reset request and reset the machine, the > > >>> guest is migrated unrebooted and the destination doesn't reboot it > > >>> either because it isn't aware a CAS reboot was needed (eg, because a > > >>> device was added before CAS). This likely result in a broken or hung > > >>> guest. > > >>> > > >>> Even if it is small, the window between CAS and CAS reboot is enough to > > >>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new > > >>> subsection for that and always send it when a CAS reboot is pending. > > >>> This may cause migration to older QEMUs to fail but it is still better > > >>> than end up with a broken guest. > > >>> > > >>> The destination cannot honour the CAS reboot request from a post load > > >>> handler because this must be done after the guest is fully restored. > > >>> It is thus done from a VM change state handler. > > >>> > > >>> Reported-by: Lukáš Doktor <ldoktor@redhat.com> > > >>> Signed-off-by: Greg Kurz <groug@kaod.org> > > >>> --- > > >>> > > >> > > >> I'm wondering if the problem can be related with the fact that > > >> main_loop_should_exit() could release qemu_global_mutex in > > >> pause_all_vcpus() in the reset case? > > >> > > >> 1602 static bool main_loop_should_exit(void) > > >> 1603 { > > >> ... > > >> 1633 request = qemu_reset_requested(); > > >> 1634 if (request) { > > >> 1635 pause_all_vcpus(); > > >> 1636 qemu_system_reset(request); > > >> 1637 resume_all_vcpus(); > > >> 1638 if (!runstate_check(RUN_STATE_RUNNING) && > > >> 1639 !runstate_check(RUN_STATE_INMIGRATE)) { > > >> 1640 runstate_set(RUN_STATE_PRELAUNCH); > > >> 1641 } > > >> 1642 } > > >> ... > > >> > > >> I already sent a patch for this kind of problem (in current Juan pull > > >> request): > > >> > > >> "runstate: ignore finishmigrate -> prelaunch transition" > > >> > > > > > > IIUC your patch avoids an invalid 'prelaunch' -> 'postmigrate' runstate > > > transition that can happen if the migration thread sets the runstate to > > > 'finishmigrate' when pause_all_vcpus() releases the main loop mutex. > > > > > > ie. symptom of the problem is QEMU aborting, correct ? The issue I'm > > > trying to fix is a guest breakage caused by a discrepancy between > > > QEMU and the guest after migration has succeeded. > > > > > >> but I don't know if it could fix this one. > > >> > > > > > > I don't think so and your patch kinda illustrates it. If the runstate > > > is 'finishmigrate' when returning from pause_all_vcpus(), this means > > > that state was sent to the destination before we could actually reset > > > the machine. > > > > Yes, you're right. > > > > But the question behind my comment was: is it expected to have a pending > > reset while we are migrating? > > > > Nothing prevents qemu_system_reset_request() to be called when migration > is active. > > > Perhaps H_CAS can return H_BUSY and wait the end of the migration and > > then be fully executed on destination? > > > > And so we would need to teach SLOF to try H_CAS again until it stops > returning H_BUSY ? It seems safer to migrate the CAS reboot flag IMHO. > Ok I've tried that with a patched SLOF that sleeps 500ms and tries CAS again if H_BUSY was returned. It fixes the issue but it looks a bit ugly because of the polling with an arbitrary timeout in SLOF... I'm not very comfortable either with calling migration_is_active() from the CAS code in QEMU. David, Any suggestion ? > > Thanks, > > Laurent > > >
On Thu, Jan 16, 2020 at 07:29:02PM +0100, Greg Kurz wrote: > On Thu, 16 Jan 2020 13:14:35 +0100 > Greg Kurz <groug@kaod.org> wrote: > > > On Thu, 16 Jan 2020 11:37:24 +0100 > > Laurent Vivier <lvivier@redhat.com> wrote: > > > > > On 16/01/2020 09:48, Greg Kurz wrote: > > > > On Wed, 15 Jan 2020 19:10:37 +0100 > > > > Laurent Vivier <lvivier@redhat.com> wrote: > > > > > > > >> Hi, > > > >> > > > >> On 15/01/2020 18:48, Greg Kurz wrote: > > > >>> Migration can potentially race with CAS reboot. If the migration thread > > > >>> completes migration after CAS has set spapr->cas_reboot but before the > > > >>> mainloop could pick up the reset request and reset the machine, the > > > >>> guest is migrated unrebooted and the destination doesn't reboot it > > > >>> either because it isn't aware a CAS reboot was needed (eg, because a > > > >>> device was added before CAS). This likely result in a broken or hung > > > >>> guest. > > > >>> > > > >>> Even if it is small, the window between CAS and CAS reboot is enough to > > > >>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new > > > >>> subsection for that and always send it when a CAS reboot is pending. > > > >>> This may cause migration to older QEMUs to fail but it is still better > > > >>> than end up with a broken guest. > > > >>> > > > >>> The destination cannot honour the CAS reboot request from a post load > > > >>> handler because this must be done after the guest is fully restored. > > > >>> It is thus done from a VM change state handler. > > > >>> > > > >>> Reported-by: Lukáš Doktor <ldoktor@redhat.com> > > > >>> Signed-off-by: Greg Kurz <groug@kaod.org> > > > >>> --- > > > >>> > > > >> > > > >> I'm wondering if the problem can be related with the fact that > > > >> main_loop_should_exit() could release qemu_global_mutex in > > > >> pause_all_vcpus() in the reset case? > > > >> > > > >> 1602 static bool main_loop_should_exit(void) > > > >> 1603 { > > > >> ... > > > >> 1633 request = qemu_reset_requested(); > > > >> 1634 if (request) { > > > >> 1635 pause_all_vcpus(); > > > >> 1636 qemu_system_reset(request); > > > >> 1637 resume_all_vcpus(); > > > >> 1638 if (!runstate_check(RUN_STATE_RUNNING) && > > > >> 1639 !runstate_check(RUN_STATE_INMIGRATE)) { > > > >> 1640 runstate_set(RUN_STATE_PRELAUNCH); > > > >> 1641 } > > > >> 1642 } > > > >> ... > > > >> > > > >> I already sent a patch for this kind of problem (in current Juan pull > > > >> request): > > > >> > > > >> "runstate: ignore finishmigrate -> prelaunch transition" > > > >> > > > > > > > > IIUC your patch avoids an invalid 'prelaunch' -> 'postmigrate' runstate > > > > transition that can happen if the migration thread sets the runstate to > > > > 'finishmigrate' when pause_all_vcpus() releases the main loop mutex. > > > > > > > > ie. symptom of the problem is QEMU aborting, correct ? The issue I'm > > > > trying to fix is a guest breakage caused by a discrepancy between > > > > QEMU and the guest after migration has succeeded. > > > > > > > >> but I don't know if it could fix this one. > > > >> > > > > > > > > I don't think so and your patch kinda illustrates it. If the runstate > > > > is 'finishmigrate' when returning from pause_all_vcpus(), this means > > > > that state was sent to the destination before we could actually reset > > > > the machine. > > > > > > Yes, you're right. > > > > > > But the question behind my comment was: is it expected to have a pending > > > reset while we are migrating? > > > > > > > Nothing prevents qemu_system_reset_request() to be called when migration > > is active. > > > > > Perhaps H_CAS can return H_BUSY and wait the end of the migration and > > > then be fully executed on destination? > > > > > > > And so we would need to teach SLOF to try H_CAS again until it stops > > returning H_BUSY ? It seems safer to migrate the CAS reboot flag IMHO. > > > > Ok I've tried that with a patched SLOF that sleeps 500ms and tries CAS > again if H_BUSY was returned. It fixes the issue but it looks a bit > ugly because of the polling with an arbitrary timeout in SLOF... I'm > not very comfortable either with calling migration_is_active() from > the CAS code in QEMU. > > David, > > Any suggestion ? Yeah, I think looping in SLOF is a worse idea than migrating the cas_reboot flag. But.. a better solution still might be to just remove the remaining causes for CAS reboot entirely. CAS reboots pretty much suck when they happen, anyway. With the irq changeover condition removed, I think the remaining causes are more theoretical than practical situations at this point.
On Wed, 15 Jan 2020 19:26:18 +0100 Laurent Vivier <lvivier@redhat.com> wrote: > On 15/01/2020 19:10, Laurent Vivier wrote: > > Hi, > > > > On 15/01/2020 18:48, Greg Kurz wrote: > >> Migration can potentially race with CAS reboot. If the migration thread > >> completes migration after CAS has set spapr->cas_reboot but before the > >> mainloop could pick up the reset request and reset the machine, the > >> guest is migrated unrebooted and the destination doesn't reboot it > >> either because it isn't aware a CAS reboot was needed (eg, because a > >> device was added before CAS). This likely result in a broken or hung > >> guest. > >> > >> Even if it is small, the window between CAS and CAS reboot is enough to > >> re-qualify spapr->cas_reboot as state that we should migrate. Add a new > >> subsection for that and always send it when a CAS reboot is pending. > >> This may cause migration to older QEMUs to fail but it is still better > >> than end up with a broken guest. > >> > >> The destination cannot honour the CAS reboot request from a post load > >> handler because this must be done after the guest is fully restored. > >> It is thus done from a VM change state handler. > >> > >> Reported-by: Lukáš Doktor <ldoktor@redhat.com> > >> Signed-off-by: Greg Kurz <groug@kaod.org> > >> --- > >> > > > > I'm wondering if the problem can be related with the fact that > > main_loop_should_exit() could release qemu_global_mutex in > > pause_all_vcpus() in the reset case? > > > > 1602 static bool main_loop_should_exit(void) > > 1603 { > > ... > > 1633 request = qemu_reset_requested(); > > 1634 if (request) { > > 1635 pause_all_vcpus(); > > 1636 qemu_system_reset(request); > > 1637 resume_all_vcpus(); > > 1638 if (!runstate_check(RUN_STATE_RUNNING) && > > 1639 !runstate_check(RUN_STATE_INMIGRATE)) { > > 1640 runstate_set(RUN_STATE_PRELAUNCH); > > 1641 } > > 1642 } > > ... > > > > I already sent a patch for this kind of problem (in current Juan pull > > request): > > > > "runstate: ignore finishmigrate -> prelaunch transition" > > > > but I don't know if it could fix this one. > > I think it should be interesting to have the state transition on source > and destination when the problem occurs (with something like "-trace > runstate_set"). > With "-serial mon:stdio -trace runstate_set -trace -trace guest_cpu_reset" : OF stdout device is: /vdevice/vty@71000000 Preparing to boot Linux version 4.18.0-80.el8.ppc64le (mockbuild@ppc-061.build.eng.bos.redhat.com) (gcc version 8.2.1 20180905 (Red Hat 8.2.1-3) (GCC)) #1 SMP Wed Mar 13 11:26:21 UTC 2019 Detected machine type: 0000000000000101 command line: BOOT_IMAGE=/boot/vmlinuz-4.18.0-80.el8.ppc64le root=UUID=012b83a5-2594-48ac-b936-12fec7cdbb9a ro console=ttyS0 console=ttyS0,115200n8 no_timer_check net.ifnames=0 crashkernel=auto Max number of cores passed to firmware: 2048 (NR_CPUS = 2048) Calling ibm,client-architecture-support. Migration starts here. ..qemu-system-ppc64: warning: kernel_irqchip allowed but unavailable: IRQ_XIVE capability must be present for KVM Falling back to kernel-irqchip=off This ^^ indicates that CAS was called and switched to XIVE, for which we lack proper KVM support on GA boston machines. 23348@1579260982.315795:runstate_set current_run_state 9 (running) new_state 7 (finish-migrate) 23348@1579260982.360821:runstate_set current_run_state 7 (finish-migrate) new_state 5 (postmigrate) The migration thread is holding the global QEMU mutex at this point. It has stopped all CPUs. It now streams the full state to the destination before releasing the mutex. 23340@1579260982.797279:guest_cpu_reset cpu=0xf9dbb48a5e0 23340@1579260982.797319:guest_cpu_reset cpu=0xf9dbb4d56a0 The main loop regained control and could process the CAS reboot request but it is too late... 23340@1579260982.866071:runstate_set current_run_state 5 (postmigrate) new_state 6 (prelaunch) > Thanks, > Laurent >
On 17/01/2020 12:49, Greg Kurz wrote: > On Wed, 15 Jan 2020 19:26:18 +0100 > Laurent Vivier <lvivier@redhat.com> wrote: > >> On 15/01/2020 19:10, Laurent Vivier wrote: >>> Hi, >>> >>> On 15/01/2020 18:48, Greg Kurz wrote: >>>> Migration can potentially race with CAS reboot. If the migration thread >>>> completes migration after CAS has set spapr->cas_reboot but before the >>>> mainloop could pick up the reset request and reset the machine, the >>>> guest is migrated unrebooted and the destination doesn't reboot it >>>> either because it isn't aware a CAS reboot was needed (eg, because a >>>> device was added before CAS). This likely result in a broken or hung >>>> guest. >>>> >>>> Even if it is small, the window between CAS and CAS reboot is enough to >>>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new >>>> subsection for that and always send it when a CAS reboot is pending. >>>> This may cause migration to older QEMUs to fail but it is still better >>>> than end up with a broken guest. >>>> >>>> The destination cannot honour the CAS reboot request from a post load >>>> handler because this must be done after the guest is fully restored. >>>> It is thus done from a VM change state handler. >>>> >>>> Reported-by: Lukáš Doktor <ldoktor@redhat.com> >>>> Signed-off-by: Greg Kurz <groug@kaod.org> >>>> --- >>>> >>> >>> I'm wondering if the problem can be related with the fact that >>> main_loop_should_exit() could release qemu_global_mutex in >>> pause_all_vcpus() in the reset case? >>> >>> 1602 static bool main_loop_should_exit(void) >>> 1603 { >>> ... >>> 1633 request = qemu_reset_requested(); >>> 1634 if (request) { >>> 1635 pause_all_vcpus(); >>> 1636 qemu_system_reset(request); >>> 1637 resume_all_vcpus(); >>> 1638 if (!runstate_check(RUN_STATE_RUNNING) && >>> 1639 !runstate_check(RUN_STATE_INMIGRATE)) { >>> 1640 runstate_set(RUN_STATE_PRELAUNCH); >>> 1641 } >>> 1642 } >>> ... >>> >>> I already sent a patch for this kind of problem (in current Juan pull >>> request): >>> >>> "runstate: ignore finishmigrate -> prelaunch transition" >>> >>> but I don't know if it could fix this one. >> >> I think it should be interesting to have the state transition on source >> and destination when the problem occurs (with something like "-trace >> runstate_set"). >> > > With "-serial mon:stdio -trace runstate_set -trace -trace guest_cpu_reset" : > > OF stdout device is: /vdevice/vty@71000000 > Preparing to boot Linux version 4.18.0-80.el8.ppc64le (mockbuild@ppc-061.build.eng.bos.redhat.com) (gcc version 8.2.1 20180905 (Red Hat 8.2.1-3) (GCC)) #1 SMP Wed Mar 13 11:26:21 UTC 2019 > Detected machine type: 0000000000000101 > command line: BOOT_IMAGE=/boot/vmlinuz-4.18.0-80.el8.ppc64le root=UUID=012b83a5-2594-48ac-b936-12fec7cdbb9a ro console=ttyS0 console=ttyS0,115200n8 no_timer_check net.ifnames=0 crashkernel=auto > Max number of cores passed to firmware: 2048 (NR_CPUS = 2048) > Calling ibm,client-architecture-support. > > Migration starts here. > > ..qemu-system-ppc64: warning: kernel_irqchip allowed but unavailable: IRQ_XIVE capability must be present for KVM > Falling back to kernel-irqchip=off > > This ^^ indicates that CAS was called and switched to XIVE, for which > we lack proper KVM support on GA boston machines. > > 23348@1579260982.315795:runstate_set current_run_state 9 (running) new_state 7 (finish-migrate) > 23348@1579260982.360821:runstate_set current_run_state 7 (finish-migrate) new_state 5 (postmigrate) > > The migration thread is holding the global QEMU mutex at this point. It > has stopped all CPUs. It now streams the full state to the destination > before releasing the mutex. > > 23340@1579260982.797279:guest_cpu_reset cpu=0xf9dbb48a5e0 > 23340@1579260982.797319:guest_cpu_reset cpu=0xf9dbb4d56a0 > > The main loop regained control and could process the CAS reboot request > but it is too late... > > 23340@1579260982.866071:runstate_set current_run_state 5 (postmigrate) new_state 6 (prelaunch) Thank you Greg. So I think the best we can do is to migrate cas_reboot. To delay the H_CAS call would be cleaner but I don't know if H_BUSY is a valid return state and this forces to update SLOF too. Reviewed-by: Laurent Vivier <lvivier@redhat.com> Thanks, Laurent
On Fri, 17 Jan 2020 19:16:08 +1000 David Gibson <david@gibson.dropbear.id.au> wrote: > On Thu, Jan 16, 2020 at 07:29:02PM +0100, Greg Kurz wrote: > > On Thu, 16 Jan 2020 13:14:35 +0100 > > Greg Kurz <groug@kaod.org> wrote: > > > > > On Thu, 16 Jan 2020 11:37:24 +0100 > > > Laurent Vivier <lvivier@redhat.com> wrote: > > > > > > > On 16/01/2020 09:48, Greg Kurz wrote: > > > > > On Wed, 15 Jan 2020 19:10:37 +0100 > > > > > Laurent Vivier <lvivier@redhat.com> wrote: > > > > > > > > > >> Hi, > > > > >> > > > > >> On 15/01/2020 18:48, Greg Kurz wrote: > > > > >>> Migration can potentially race with CAS reboot. If the migration thread > > > > >>> completes migration after CAS has set spapr->cas_reboot but before the > > > > >>> mainloop could pick up the reset request and reset the machine, the > > > > >>> guest is migrated unrebooted and the destination doesn't reboot it > > > > >>> either because it isn't aware a CAS reboot was needed (eg, because a > > > > >>> device was added before CAS). This likely result in a broken or hung > > > > >>> guest. > > > > >>> > > > > >>> Even if it is small, the window between CAS and CAS reboot is enough to > > > > >>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new > > > > >>> subsection for that and always send it when a CAS reboot is pending. > > > > >>> This may cause migration to older QEMUs to fail but it is still better > > > > >>> than end up with a broken guest. > > > > >>> > > > > >>> The destination cannot honour the CAS reboot request from a post load > > > > >>> handler because this must be done after the guest is fully restored. > > > > >>> It is thus done from a VM change state handler. > > > > >>> > > > > >>> Reported-by: Lukáš Doktor <ldoktor@redhat.com> > > > > >>> Signed-off-by: Greg Kurz <groug@kaod.org> > > > > >>> --- > > > > >>> > > > > >> > > > > >> I'm wondering if the problem can be related with the fact that > > > > >> main_loop_should_exit() could release qemu_global_mutex in > > > > >> pause_all_vcpus() in the reset case? > > > > >> > > > > >> 1602 static bool main_loop_should_exit(void) > > > > >> 1603 { > > > > >> ... > > > > >> 1633 request = qemu_reset_requested(); > > > > >> 1634 if (request) { > > > > >> 1635 pause_all_vcpus(); > > > > >> 1636 qemu_system_reset(request); > > > > >> 1637 resume_all_vcpus(); > > > > >> 1638 if (!runstate_check(RUN_STATE_RUNNING) && > > > > >> 1639 !runstate_check(RUN_STATE_INMIGRATE)) { > > > > >> 1640 runstate_set(RUN_STATE_PRELAUNCH); > > > > >> 1641 } > > > > >> 1642 } > > > > >> ... > > > > >> > > > > >> I already sent a patch for this kind of problem (in current Juan pull > > > > >> request): > > > > >> > > > > >> "runstate: ignore finishmigrate -> prelaunch transition" > > > > >> > > > > > > > > > > IIUC your patch avoids an invalid 'prelaunch' -> 'postmigrate' runstate > > > > > transition that can happen if the migration thread sets the runstate to > > > > > 'finishmigrate' when pause_all_vcpus() releases the main loop mutex. > > > > > > > > > > ie. symptom of the problem is QEMU aborting, correct ? The issue I'm > > > > > trying to fix is a guest breakage caused by a discrepancy between > > > > > QEMU and the guest after migration has succeeded. > > > > > > > > > >> but I don't know if it could fix this one. > > > > >> > > > > > > > > > > I don't think so and your patch kinda illustrates it. If the runstate > > > > > is 'finishmigrate' when returning from pause_all_vcpus(), this means > > > > > that state was sent to the destination before we could actually reset > > > > > the machine. > > > > > > > > Yes, you're right. > > > > > > > > But the question behind my comment was: is it expected to have a pending > > > > reset while we are migrating? > > > > > > > > > > Nothing prevents qemu_system_reset_request() to be called when migration > > > is active. > > > > > > > Perhaps H_CAS can return H_BUSY and wait the end of the migration and > > > > then be fully executed on destination? > > > > > > > > > > And so we would need to teach SLOF to try H_CAS again until it stops > > > returning H_BUSY ? It seems safer to migrate the CAS reboot flag IMHO. > > > > > > > Ok I've tried that with a patched SLOF that sleeps 500ms and tries CAS > > again if H_BUSY was returned. It fixes the issue but it looks a bit > > ugly because of the polling with an arbitrary timeout in SLOF... I'm > > not very comfortable either with calling migration_is_active() from > > the CAS code in QEMU. > > > > David, > > > > Any suggestion ? > > Yeah, I think looping in SLOF is a worse idea than migrating the > cas_reboot flag. > > But.. a better solution still might be to just remove the remaining > causes for CAS reboot entirely. CAS reboots pretty much suck when > they happen, anyway. > I Agree. > With the irq changeover condition removed, I think the remaining > causes are more theoretical than practical situations at this point. > FWIW, hotpluggging a PCI device before CAS result in a hung guest (not yet investigated the details).
On Fri, 17 Jan 2020 13:10:35 +0100 Laurent Vivier <lvivier@redhat.com> wrote: > On 17/01/2020 12:49, Greg Kurz wrote: > > On Wed, 15 Jan 2020 19:26:18 +0100 > > Laurent Vivier <lvivier@redhat.com> wrote: > > > >> On 15/01/2020 19:10, Laurent Vivier wrote: > >>> Hi, > >>> > >>> On 15/01/2020 18:48, Greg Kurz wrote: > >>>> Migration can potentially race with CAS reboot. If the migration thread > >>>> completes migration after CAS has set spapr->cas_reboot but before the > >>>> mainloop could pick up the reset request and reset the machine, the > >>>> guest is migrated unrebooted and the destination doesn't reboot it > >>>> either because it isn't aware a CAS reboot was needed (eg, because a > >>>> device was added before CAS). This likely result in a broken or hung > >>>> guest. > >>>> > >>>> Even if it is small, the window between CAS and CAS reboot is enough to > >>>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new > >>>> subsection for that and always send it when a CAS reboot is pending. > >>>> This may cause migration to older QEMUs to fail but it is still better > >>>> than end up with a broken guest. > >>>> > >>>> The destination cannot honour the CAS reboot request from a post load > >>>> handler because this must be done after the guest is fully restored. > >>>> It is thus done from a VM change state handler. > >>>> > >>>> Reported-by: Lukáš Doktor <ldoktor@redhat.com> > >>>> Signed-off-by: Greg Kurz <groug@kaod.org> > >>>> --- > >>>> > >>> > >>> I'm wondering if the problem can be related with the fact that > >>> main_loop_should_exit() could release qemu_global_mutex in > >>> pause_all_vcpus() in the reset case? > >>> > >>> 1602 static bool main_loop_should_exit(void) > >>> 1603 { > >>> ... > >>> 1633 request = qemu_reset_requested(); > >>> 1634 if (request) { > >>> 1635 pause_all_vcpus(); > >>> 1636 qemu_system_reset(request); > >>> 1637 resume_all_vcpus(); > >>> 1638 if (!runstate_check(RUN_STATE_RUNNING) && > >>> 1639 !runstate_check(RUN_STATE_INMIGRATE)) { > >>> 1640 runstate_set(RUN_STATE_PRELAUNCH); > >>> 1641 } > >>> 1642 } > >>> ... > >>> > >>> I already sent a patch for this kind of problem (in current Juan pull > >>> request): > >>> > >>> "runstate: ignore finishmigrate -> prelaunch transition" > >>> > >>> but I don't know if it could fix this one. > >> > >> I think it should be interesting to have the state transition on source > >> and destination when the problem occurs (with something like "-trace > >> runstate_set"). > >> > > > > With "-serial mon:stdio -trace runstate_set -trace -trace guest_cpu_reset" : > > > > OF stdout device is: /vdevice/vty@71000000 > > Preparing to boot Linux version 4.18.0-80.el8.ppc64le (mockbuild@ppc-061.build.eng.bos.redhat.com) (gcc version 8.2.1 20180905 (Red Hat 8.2.1-3) (GCC)) #1 SMP Wed Mar 13 11:26:21 UTC 2019 > > Detected machine type: 0000000000000101 > > command line: BOOT_IMAGE=/boot/vmlinuz-4.18.0-80.el8.ppc64le root=UUID=012b83a5-2594-48ac-b936-12fec7cdbb9a ro console=ttyS0 console=ttyS0,115200n8 no_timer_check net.ifnames=0 crashkernel=auto > > Max number of cores passed to firmware: 2048 (NR_CPUS = 2048) > > Calling ibm,client-architecture-support. > > > > Migration starts here. > > > > ..qemu-system-ppc64: warning: kernel_irqchip allowed but unavailable: IRQ_XIVE capability must be present for KVM > > Falling back to kernel-irqchip=off > > > > This ^^ indicates that CAS was called and switched to XIVE, for which > > we lack proper KVM support on GA boston machines. > > > > 23348@1579260982.315795:runstate_set current_run_state 9 (running) new_state 7 (finish-migrate) > > 23348@1579260982.360821:runstate_set current_run_state 7 (finish-migrate) new_state 5 (postmigrate) > > > > The migration thread is holding the global QEMU mutex at this point. It > > has stopped all CPUs. It now streams the full state to the destination > > before releasing the mutex. > > > > 23340@1579260982.797279:guest_cpu_reset cpu=0xf9dbb48a5e0 > > 23340@1579260982.797319:guest_cpu_reset cpu=0xf9dbb4d56a0 > > > > The main loop regained control and could process the CAS reboot request > > but it is too late... > > > > 23340@1579260982.866071:runstate_set current_run_state 5 (postmigrate) new_state 6 (prelaunch) > > Thank you Greg. > > So I think the best we can do is to migrate cas_reboot. > > To delay the H_CAS call would be cleaner but I don't know if H_BUSY is a > valid return state and this forces to update SLOF too. > Since SLOF is currently the only user of H_CAS, I guess we have some flexibility with the valid return codes... but anyway, David doesn't like the idea :) > Reviewed-by: Laurent Vivier <lvivier@redhat.com> > > Thanks, > Laurent >
On Fri, 17 Jan 2020 16:44:27 +0100 Greg Kurz <groug@kaod.org> wrote: > On Fri, 17 Jan 2020 19:16:08 +1000 > David Gibson <david@gibson.dropbear.id.au> wrote: > > > On Thu, Jan 16, 2020 at 07:29:02PM +0100, Greg Kurz wrote: > > > On Thu, 16 Jan 2020 13:14:35 +0100 > > > Greg Kurz <groug@kaod.org> wrote: > > > > > > > On Thu, 16 Jan 2020 11:37:24 +0100 > > > > Laurent Vivier <lvivier@redhat.com> wrote: > > > > > > > > > On 16/01/2020 09:48, Greg Kurz wrote: > > > > > > On Wed, 15 Jan 2020 19:10:37 +0100 > > > > > > Laurent Vivier <lvivier@redhat.com> wrote: > > > > > > > > > > > >> Hi, > > > > > >> > > > > > >> On 15/01/2020 18:48, Greg Kurz wrote: > > > > > >>> Migration can potentially race with CAS reboot. If the migration thread > > > > > >>> completes migration after CAS has set spapr->cas_reboot but before the > > > > > >>> mainloop could pick up the reset request and reset the machine, the > > > > > >>> guest is migrated unrebooted and the destination doesn't reboot it > > > > > >>> either because it isn't aware a CAS reboot was needed (eg, because a > > > > > >>> device was added before CAS). This likely result in a broken or hung > > > > > >>> guest. > > > > > >>> > > > > > >>> Even if it is small, the window between CAS and CAS reboot is enough to > > > > > >>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new > > > > > >>> subsection for that and always send it when a CAS reboot is pending. > > > > > >>> This may cause migration to older QEMUs to fail but it is still better > > > > > >>> than end up with a broken guest. > > > > > >>> > > > > > >>> The destination cannot honour the CAS reboot request from a post load > > > > > >>> handler because this must be done after the guest is fully restored. > > > > > >>> It is thus done from a VM change state handler. > > > > > >>> > > > > > >>> Reported-by: Lukáš Doktor <ldoktor@redhat.com> > > > > > >>> Signed-off-by: Greg Kurz <groug@kaod.org> > > > > > >>> --- > > > > > >>> > > > > > >> > > > > > >> I'm wondering if the problem can be related with the fact that > > > > > >> main_loop_should_exit() could release qemu_global_mutex in > > > > > >> pause_all_vcpus() in the reset case? > > > > > >> > > > > > >> 1602 static bool main_loop_should_exit(void) > > > > > >> 1603 { > > > > > >> ... > > > > > >> 1633 request = qemu_reset_requested(); > > > > > >> 1634 if (request) { > > > > > >> 1635 pause_all_vcpus(); > > > > > >> 1636 qemu_system_reset(request); > > > > > >> 1637 resume_all_vcpus(); > > > > > >> 1638 if (!runstate_check(RUN_STATE_RUNNING) && > > > > > >> 1639 !runstate_check(RUN_STATE_INMIGRATE)) { > > > > > >> 1640 runstate_set(RUN_STATE_PRELAUNCH); > > > > > >> 1641 } > > > > > >> 1642 } > > > > > >> ... > > > > > >> > > > > > >> I already sent a patch for this kind of problem (in current Juan pull > > > > > >> request): > > > > > >> > > > > > >> "runstate: ignore finishmigrate -> prelaunch transition" > > > > > >> > > > > > > > > > > > > IIUC your patch avoids an invalid 'prelaunch' -> 'postmigrate' runstate > > > > > > transition that can happen if the migration thread sets the runstate to > > > > > > 'finishmigrate' when pause_all_vcpus() releases the main loop mutex. > > > > > > > > > > > > ie. symptom of the problem is QEMU aborting, correct ? The issue I'm > > > > > > trying to fix is a guest breakage caused by a discrepancy between > > > > > > QEMU and the guest after migration has succeeded. > > > > > > > > > > > >> but I don't know if it could fix this one. > > > > > >> > > > > > > > > > > > > I don't think so and your patch kinda illustrates it. If the runstate > > > > > > is 'finishmigrate' when returning from pause_all_vcpus(), this means > > > > > > that state was sent to the destination before we could actually reset > > > > > > the machine. > > > > > > > > > > Yes, you're right. > > > > > > > > > > But the question behind my comment was: is it expected to have a pending > > > > > reset while we are migrating? > > > > > > > > > > > > > Nothing prevents qemu_system_reset_request() to be called when migration > > > > is active. > > > > > > > > > Perhaps H_CAS can return H_BUSY and wait the end of the migration and > > > > > then be fully executed on destination? > > > > > > > > > > > > > And so we would need to teach SLOF to try H_CAS again until it stops > > > > returning H_BUSY ? It seems safer to migrate the CAS reboot flag IMHO. > > > > > > > > > > Ok I've tried that with a patched SLOF that sleeps 500ms and tries CAS > > > again if H_BUSY was returned. It fixes the issue but it looks a bit > > > ugly because of the polling with an arbitrary timeout in SLOF... I'm > > > not very comfortable either with calling migration_is_active() from > > > the CAS code in QEMU. > > > > > > David, > > > > > > Any suggestion ? > > > > Yeah, I think looping in SLOF is a worse idea than migrating the > > cas_reboot flag. > > > > But.. a better solution still might be to just remove the remaining > > causes for CAS reboot entirely. CAS reboots pretty much suck when > > they happen, anyway. > > > > I Agree. > > > With the irq changeover condition removed, I think the remaining > > causes are more theoretical than practical situations at this point. > > > > FWIW, hotpluggging a PCI device before CAS result in a hung guest (not yet > investigated the details). commit 10f12e6450407b18b4d5a6b50d3852dcfd7fff75 Author: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com> Date: Wed Aug 30 15:21:41 2017 -0300 hw/ppc: CAS reset on early device hotplug I'll have a look to see what can be done here. But I agree the other check is more theoretical: /* capabilities that have been added since CAS-generated guest reset. * if capabilities have since been removed, generate another reset */ spapr->cas_reboot = !spapr_ovec_subset(ov5_cas_old, spapr->ov5_cas); Unless changing kernels or tempering with the kernel command line, I don't see how some capabilities could change between the two CAS in practice.
On Wed, Jan 15, 2020 at 07:10:47PM +0100, Cédric Le Goater wrote: > On 1/15/20 6:48 PM, Greg Kurz wrote: > > Migration can potentially race with CAS reboot. If the migration thread > > completes migration after CAS has set spapr->cas_reboot but before the > > mainloop could pick up the reset request and reset the machine, the > > guest is migrated unrebooted and the destination doesn't reboot it > > either because it isn't aware a CAS reboot was needed (eg, because a > > device was added before CAS). This likely result in a broken or hung > > guest. > > > > Even if it is small, the window between CAS and CAS reboot is enough to > > re-qualify spapr->cas_reboot as state that we should migrate. Add a new > > subsection for that and always send it when a CAS reboot is pending. > > This may cause migration to older QEMUs to fail but it is still better > > than end up with a broken guest. > > > > The destination cannot honour the CAS reboot request from a post load > > handler because this must be done after the guest is fully restored. > > It is thus done from a VM change state handler. > > > > Reported-by: Lukáš Doktor <ldoktor@redhat.com> > > Signed-off-by: Greg Kurz <groug@kaod.org> > > Cédric Le Goater <clg@kaod.org> > > Nice work ! That was quite complex to catch ! It is a very nice analysis. However, I'm disinclined to merge this for the time being. My preferred approach would be to just eliminate CAS reboots altogether, since that has other benefits. I'm feeling like this isn't super-urgent, since CAS reboots are extremely rare in practice, now that we've eliminated the one for the irq switchover. However, if it's not looking like we'll be ready to do that as the qemu-5.0 release approaches, then I'll be more than willing to reconsider this.
On Mon, Jan 20, 2020 at 09:04:38AM +0100, Greg Kurz wrote: > On Fri, 17 Jan 2020 16:44:27 +0100 > Greg Kurz <groug@kaod.org> wrote: > > > On Fri, 17 Jan 2020 19:16:08 +1000 > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > On Thu, Jan 16, 2020 at 07:29:02PM +0100, Greg Kurz wrote: > > > > On Thu, 16 Jan 2020 13:14:35 +0100 > > > > Greg Kurz <groug@kaod.org> wrote: > > > > > > > > > On Thu, 16 Jan 2020 11:37:24 +0100 > > > > > Laurent Vivier <lvivier@redhat.com> wrote: > > > > > > > > > > > On 16/01/2020 09:48, Greg Kurz wrote: > > > > > > > On Wed, 15 Jan 2020 19:10:37 +0100 > > > > > > > Laurent Vivier <lvivier@redhat.com> wrote: > > > > > > > > > > > > > >> Hi, > > > > > > >> > > > > > > >> On 15/01/2020 18:48, Greg Kurz wrote: > > > > > > >>> Migration can potentially race with CAS reboot. If the migration thread > > > > > > >>> completes migration after CAS has set spapr->cas_reboot but before the > > > > > > >>> mainloop could pick up the reset request and reset the machine, the > > > > > > >>> guest is migrated unrebooted and the destination doesn't reboot it > > > > > > >>> either because it isn't aware a CAS reboot was needed (eg, because a > > > > > > >>> device was added before CAS). This likely result in a broken or hung > > > > > > >>> guest. > > > > > > >>> > > > > > > >>> Even if it is small, the window between CAS and CAS reboot is enough to > > > > > > >>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new > > > > > > >>> subsection for that and always send it when a CAS reboot is pending. > > > > > > >>> This may cause migration to older QEMUs to fail but it is still better > > > > > > >>> than end up with a broken guest. > > > > > > >>> > > > > > > >>> The destination cannot honour the CAS reboot request from a post load > > > > > > >>> handler because this must be done after the guest is fully restored. > > > > > > >>> It is thus done from a VM change state handler. > > > > > > >>> > > > > > > >>> Reported-by: Lukáš Doktor <ldoktor@redhat.com> > > > > > > >>> Signed-off-by: Greg Kurz <groug@kaod.org> > > > > > > >>> --- > > > > > > >>> > > > > > > >> > > > > > > >> I'm wondering if the problem can be related with the fact that > > > > > > >> main_loop_should_exit() could release qemu_global_mutex in > > > > > > >> pause_all_vcpus() in the reset case? > > > > > > >> > > > > > > >> 1602 static bool main_loop_should_exit(void) > > > > > > >> 1603 { > > > > > > >> ... > > > > > > >> 1633 request = qemu_reset_requested(); > > > > > > >> 1634 if (request) { > > > > > > >> 1635 pause_all_vcpus(); > > > > > > >> 1636 qemu_system_reset(request); > > > > > > >> 1637 resume_all_vcpus(); > > > > > > >> 1638 if (!runstate_check(RUN_STATE_RUNNING) && > > > > > > >> 1639 !runstate_check(RUN_STATE_INMIGRATE)) { > > > > > > >> 1640 runstate_set(RUN_STATE_PRELAUNCH); > > > > > > >> 1641 } > > > > > > >> 1642 } > > > > > > >> ... > > > > > > >> > > > > > > >> I already sent a patch for this kind of problem (in current Juan pull > > > > > > >> request): > > > > > > >> > > > > > > >> "runstate: ignore finishmigrate -> prelaunch transition" > > > > > > >> > > > > > > > > > > > > > > IIUC your patch avoids an invalid 'prelaunch' -> 'postmigrate' runstate > > > > > > > transition that can happen if the migration thread sets the runstate to > > > > > > > 'finishmigrate' when pause_all_vcpus() releases the main loop mutex. > > > > > > > > > > > > > > ie. symptom of the problem is QEMU aborting, correct ? The issue I'm > > > > > > > trying to fix is a guest breakage caused by a discrepancy between > > > > > > > QEMU and the guest after migration has succeeded. > > > > > > > > > > > > > >> but I don't know if it could fix this one. > > > > > > >> > > > > > > > > > > > > > > I don't think so and your patch kinda illustrates it. If the runstate > > > > > > > is 'finishmigrate' when returning from pause_all_vcpus(), this means > > > > > > > that state was sent to the destination before we could actually reset > > > > > > > the machine. > > > > > > > > > > > > Yes, you're right. > > > > > > > > > > > > But the question behind my comment was: is it expected to have a pending > > > > > > reset while we are migrating? > > > > > > > > > > > > > > > > Nothing prevents qemu_system_reset_request() to be called when migration > > > > > is active. > > > > > > > > > > > Perhaps H_CAS can return H_BUSY and wait the end of the migration and > > > > > > then be fully executed on destination? > > > > > > > > > > > > > > > > And so we would need to teach SLOF to try H_CAS again until it stops > > > > > returning H_BUSY ? It seems safer to migrate the CAS reboot flag IMHO. > > > > > > > > > > > > > Ok I've tried that with a patched SLOF that sleeps 500ms and tries CAS > > > > again if H_BUSY was returned. It fixes the issue but it looks a bit > > > > ugly because of the polling with an arbitrary timeout in SLOF... I'm > > > > not very comfortable either with calling migration_is_active() from > > > > the CAS code in QEMU. > > > > > > > > David, > > > > > > > > Any suggestion ? > > > > > > Yeah, I think looping in SLOF is a worse idea than migrating the > > > cas_reboot flag. > > > > > > But.. a better solution still might be to just remove the remaining > > > causes for CAS reboot entirely. CAS reboots pretty much suck when > > > they happen, anyway. > > > > > > > I Agree. > > > > > With the irq changeover condition removed, I think the remaining > > > causes are more theoretical than practical situations at this point. > > > > > > > FWIW, hotpluggging a PCI device before CAS result in a hung guest (not yet > > investigated the details). > > commit 10f12e6450407b18b4d5a6b50d3852dcfd7fff75 > Author: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com> > Date: Wed Aug 30 15:21:41 2017 -0300 > > hw/ppc: CAS reset on early device hotplug > > I'll have a look to see what can be done here. Ah.. yes, that one might be a bit tricky. > But I agree the other check is more theoretical: > > /* capabilities that have been added since CAS-generated guest reset. > * if capabilities have since been removed, generate another reset > */ > spapr->cas_reboot = !spapr_ovec_subset(ov5_cas_old, spapr->ov5_cas); > > Unless changing kernels or tempering with the kernel command line, I don't > see how some capabilities could change between the two CAS in practice. Well, we want to be robust and it's at least theoretically possible that the guest will request different things on subsequent reboots. However I believe that the original rationale for this check was that while we could add things to the device tree for added capabilities, we didn't have a way to roll back the changes for removed capabilities. Now that we fully rebuild the device tree at CAS, I think this test can probably just go, although there's some double checking to do.
On 1/21/20 4:41 AM, David Gibson wrote: > On Wed, Jan 15, 2020 at 07:10:47PM +0100, Cédric Le Goater wrote: >> On 1/15/20 6:48 PM, Greg Kurz wrote: >>> Migration can potentially race with CAS reboot. If the migration thread >>> completes migration after CAS has set spapr->cas_reboot but before the >>> mainloop could pick up the reset request and reset the machine, the >>> guest is migrated unrebooted and the destination doesn't reboot it >>> either because it isn't aware a CAS reboot was needed (eg, because a >>> device was added before CAS). This likely result in a broken or hung >>> guest. >>> >>> Even if it is small, the window between CAS and CAS reboot is enough to >>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new >>> subsection for that and always send it when a CAS reboot is pending. >>> This may cause migration to older QEMUs to fail but it is still better >>> than end up with a broken guest. >>> >>> The destination cannot honour the CAS reboot request from a post load >>> handler because this must be done after the guest is fully restored. >>> It is thus done from a VM change state handler. >>> >>> Reported-by: Lukáš Doktor <ldoktor@redhat.com> >>> Signed-off-by: Greg Kurz <groug@kaod.org> >> >> Cédric Le Goater <clg@kaod.org> >> >> Nice work ! That was quite complex to catch ! > > It is a very nice analysis. However, I'm disinclined to merge this > for the time being. > > My preferred approach would be to just eliminate CAS reboots > altogether, since that has other benefits. I'm feeling like this > isn't super-urgent, since CAS reboots are extremely rare in practice, > now that we've eliminated the one for the irq switchover. Yes. The possibility of a migration in the window between CAS and CAS reboot must be even more rare. C. > However, if it's not looking like we'll be ready to do that as the > qemu-5.0 release approaches, then I'll be more than willing to > reconsider this.
On Tue, 21 Jan 2020 14:41:26 +1100 David Gibson <david@gibson.dropbear.id.au> wrote: > On Wed, Jan 15, 2020 at 07:10:47PM +0100, Cédric Le Goater wrote: > > On 1/15/20 6:48 PM, Greg Kurz wrote: > > > Migration can potentially race with CAS reboot. If the migration thread > > > completes migration after CAS has set spapr->cas_reboot but before the > > > mainloop could pick up the reset request and reset the machine, the > > > guest is migrated unrebooted and the destination doesn't reboot it > > > either because it isn't aware a CAS reboot was needed (eg, because a > > > device was added before CAS). This likely result in a broken or hung > > > guest. > > > > > > Even if it is small, the window between CAS and CAS reboot is enough to > > > re-qualify spapr->cas_reboot as state that we should migrate. Add a new > > > subsection for that and always send it when a CAS reboot is pending. > > > This may cause migration to older QEMUs to fail but it is still better > > > than end up with a broken guest. > > > > > > The destination cannot honour the CAS reboot request from a post load > > > handler because this must be done after the guest is fully restored. > > > It is thus done from a VM change state handler. > > > > > > Reported-by: Lukáš Doktor <ldoktor@redhat.com> > > > Signed-off-by: Greg Kurz <groug@kaod.org> > > > > Cédric Le Goater <clg@kaod.org> > > > > Nice work ! That was quite complex to catch ! > > It is a very nice analysis. However, I'm disinclined to merge this > for the time being. > > My preferred approach would be to just eliminate CAS reboots > altogether, since that has other benefits. I'm feeling like this > isn't super-urgent, since CAS reboots are extremely rare in practice, > now that we've eliminated the one for the irq switchover. > Yeah. The only _true_ need for CAS rebooting now seems to be hotplug before CAS, which is likely not something frequent. > However, if it's not looking like we'll be ready to do that as the > qemu-5.0 release approaches, then I'll be more than willing to > reconsider this. > I hope we can drop CAS reboot in time.
On Tue, 21 Jan 2020 14:43:32 +1100 David Gibson <david@gibson.dropbear.id.au> wrote: > On Mon, Jan 20, 2020 at 09:04:38AM +0100, Greg Kurz wrote: > > On Fri, 17 Jan 2020 16:44:27 +0100 > > Greg Kurz <groug@kaod.org> wrote: > > > > > On Fri, 17 Jan 2020 19:16:08 +1000 > > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > > > On Thu, Jan 16, 2020 at 07:29:02PM +0100, Greg Kurz wrote: > > > > > On Thu, 16 Jan 2020 13:14:35 +0100 > > > > > Greg Kurz <groug@kaod.org> wrote: > > > > > > > > > > > On Thu, 16 Jan 2020 11:37:24 +0100 > > > > > > Laurent Vivier <lvivier@redhat.com> wrote: > > > > > > > > > > > > > On 16/01/2020 09:48, Greg Kurz wrote: > > > > > > > > On Wed, 15 Jan 2020 19:10:37 +0100 > > > > > > > > Laurent Vivier <lvivier@redhat.com> wrote: > > > > > > > > > > > > > > > >> Hi, > > > > > > > >> > > > > > > > >> On 15/01/2020 18:48, Greg Kurz wrote: > > > > > > > >>> Migration can potentially race with CAS reboot. If the migration thread > > > > > > > >>> completes migration after CAS has set spapr->cas_reboot but before the > > > > > > > >>> mainloop could pick up the reset request and reset the machine, the > > > > > > > >>> guest is migrated unrebooted and the destination doesn't reboot it > > > > > > > >>> either because it isn't aware a CAS reboot was needed (eg, because a > > > > > > > >>> device was added before CAS). This likely result in a broken or hung > > > > > > > >>> guest. > > > > > > > >>> > > > > > > > >>> Even if it is small, the window between CAS and CAS reboot is enough to > > > > > > > >>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new > > > > > > > >>> subsection for that and always send it when a CAS reboot is pending. > > > > > > > >>> This may cause migration to older QEMUs to fail but it is still better > > > > > > > >>> than end up with a broken guest. > > > > > > > >>> > > > > > > > >>> The destination cannot honour the CAS reboot request from a post load > > > > > > > >>> handler because this must be done after the guest is fully restored. > > > > > > > >>> It is thus done from a VM change state handler. > > > > > > > >>> > > > > > > > >>> Reported-by: Lukáš Doktor <ldoktor@redhat.com> > > > > > > > >>> Signed-off-by: Greg Kurz <groug@kaod.org> > > > > > > > >>> --- > > > > > > > >>> > > > > > > > >> > > > > > > > >> I'm wondering if the problem can be related with the fact that > > > > > > > >> main_loop_should_exit() could release qemu_global_mutex in > > > > > > > >> pause_all_vcpus() in the reset case? > > > > > > > >> > > > > > > > >> 1602 static bool main_loop_should_exit(void) > > > > > > > >> 1603 { > > > > > > > >> ... > > > > > > > >> 1633 request = qemu_reset_requested(); > > > > > > > >> 1634 if (request) { > > > > > > > >> 1635 pause_all_vcpus(); > > > > > > > >> 1636 qemu_system_reset(request); > > > > > > > >> 1637 resume_all_vcpus(); > > > > > > > >> 1638 if (!runstate_check(RUN_STATE_RUNNING) && > > > > > > > >> 1639 !runstate_check(RUN_STATE_INMIGRATE)) { > > > > > > > >> 1640 runstate_set(RUN_STATE_PRELAUNCH); > > > > > > > >> 1641 } > > > > > > > >> 1642 } > > > > > > > >> ... > > > > > > > >> > > > > > > > >> I already sent a patch for this kind of problem (in current Juan pull > > > > > > > >> request): > > > > > > > >> > > > > > > > >> "runstate: ignore finishmigrate -> prelaunch transition" > > > > > > > >> > > > > > > > > > > > > > > > > IIUC your patch avoids an invalid 'prelaunch' -> 'postmigrate' runstate > > > > > > > > transition that can happen if the migration thread sets the runstate to > > > > > > > > 'finishmigrate' when pause_all_vcpus() releases the main loop mutex. > > > > > > > > > > > > > > > > ie. symptom of the problem is QEMU aborting, correct ? The issue I'm > > > > > > > > trying to fix is a guest breakage caused by a discrepancy between > > > > > > > > QEMU and the guest after migration has succeeded. > > > > > > > > > > > > > > > >> but I don't know if it could fix this one. > > > > > > > >> > > > > > > > > > > > > > > > > I don't think so and your patch kinda illustrates it. If the runstate > > > > > > > > is 'finishmigrate' when returning from pause_all_vcpus(), this means > > > > > > > > that state was sent to the destination before we could actually reset > > > > > > > > the machine. > > > > > > > > > > > > > > Yes, you're right. > > > > > > > > > > > > > > But the question behind my comment was: is it expected to have a pending > > > > > > > reset while we are migrating? > > > > > > > > > > > > > > > > > > > Nothing prevents qemu_system_reset_request() to be called when migration > > > > > > is active. > > > > > > > > > > > > > Perhaps H_CAS can return H_BUSY and wait the end of the migration and > > > > > > > then be fully executed on destination? > > > > > > > > > > > > > > > > > > > And so we would need to teach SLOF to try H_CAS again until it stops > > > > > > returning H_BUSY ? It seems safer to migrate the CAS reboot flag IMHO. > > > > > > > > > > > > > > > > Ok I've tried that with a patched SLOF that sleeps 500ms and tries CAS > > > > > again if H_BUSY was returned. It fixes the issue but it looks a bit > > > > > ugly because of the polling with an arbitrary timeout in SLOF... I'm > > > > > not very comfortable either with calling migration_is_active() from > > > > > the CAS code in QEMU. > > > > > > > > > > David, > > > > > > > > > > Any suggestion ? > > > > > > > > Yeah, I think looping in SLOF is a worse idea than migrating the > > > > cas_reboot flag. > > > > > > > > But.. a better solution still might be to just remove the remaining > > > > causes for CAS reboot entirely. CAS reboots pretty much suck when > > > > they happen, anyway. > > > > > > > > > > I Agree. > > > > > > > With the irq changeover condition removed, I think the remaining > > > > causes are more theoretical than practical situations at this point. > > > > > > > > > > FWIW, hotpluggging a PCI device before CAS result in a hung guest (not yet > > > investigated the details). > > > > commit 10f12e6450407b18b4d5a6b50d3852dcfd7fff75 > > Author: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com> > > Date: Wed Aug 30 15:21:41 2017 -0300 > > > > hw/ppc: CAS reset on early device hotplug > > > > I'll have a look to see what can be done here. > > Ah.. yes, that one might be a bit tricky. > So far it seems to be related to SLOF not being able to create new nodes in the DT when parsing the FDT returned by CAS. SLOF stops the parsing and returns an error. The guest ends up with a broken DT and eventually hangs later (in my case the kernel believes it is going to do hash while radix was negotiated with QEMU). I need to dig some more. > > But I agree the other check is more theoretical: > > > > /* capabilities that have been added since CAS-generated guest reset. > > * if capabilities have since been removed, generate another reset > > */ > > spapr->cas_reboot = !spapr_ovec_subset(ov5_cas_old, spapr->ov5_cas); > > > > Unless changing kernels or tempering with the kernel command line, I don't > > see how some capabilities could change between the two CAS in practice. > > Well, we want to be robust and it's at least theoretically possible > that the guest will request different things on subsequent reboots. Yes but in the latter case a full machine reset occurs and spapr->ov5_cas gets cleared, ie. spapr_ovec_subset() returns true in the check above no matter what. > However I believe that the original rationale for this check was that > while we could add things to the device tree for added capabilities, > we didn't have a way to roll back the changes for removed > capabilities. > IIUC this is specifically for "removed capabilities since last CAS". This can happen if: 1) we're already processing a CAS reboot or, 2) a freshly rebooted guest calls CAS twice without being rebooted in between. Since a freshly booted or rebooted guest can only trigger a CAS reboot because of a "hotplug-before-CAS", if we manage to get rid of this limitation, 1) cannot happen anymore. The linux kernel seems to be only calling "ibm,client-architecture-support" once during early boot so 2) should _never_ happen. Do we care to support this scenario anyway ? > Now that we fully rebuild the device tree at CAS, I think this test > can probably just go, although there's some double checking to do. > I tend to agree.
On Tue, Jan 21, 2020 at 10:32:55AM +0100, Greg Kurz wrote: > On Tue, 21 Jan 2020 14:43:32 +1100 > David Gibson <david@gibson.dropbear.id.au> wrote: > > > On Mon, Jan 20, 2020 at 09:04:38AM +0100, Greg Kurz wrote: > > > On Fri, 17 Jan 2020 16:44:27 +0100 > > > Greg Kurz <groug@kaod.org> wrote: > > > > > > > On Fri, 17 Jan 2020 19:16:08 +1000 > > > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > > > > > On Thu, Jan 16, 2020 at 07:29:02PM +0100, Greg Kurz wrote: > > > > > > On Thu, 16 Jan 2020 13:14:35 +0100 > > > > > > Greg Kurz <groug@kaod.org> wrote: > > > > > > > > > > > > > On Thu, 16 Jan 2020 11:37:24 +0100 > > > > > > > Laurent Vivier <lvivier@redhat.com> wrote: > > > > > > > > > > > > > > > On 16/01/2020 09:48, Greg Kurz wrote: > > > > > > > > > On Wed, 15 Jan 2020 19:10:37 +0100 > > > > > > > > > Laurent Vivier <lvivier@redhat.com> wrote: > > > > > > > > > > > > > > > > > >> Hi, > > > > > > > > >> > > > > > > > > >> On 15/01/2020 18:48, Greg Kurz wrote: > > > > > > > > >>> Migration can potentially race with CAS reboot. If the migration thread > > > > > > > > >>> completes migration after CAS has set spapr->cas_reboot but before the > > > > > > > > >>> mainloop could pick up the reset request and reset the machine, the > > > > > > > > >>> guest is migrated unrebooted and the destination doesn't reboot it > > > > > > > > >>> either because it isn't aware a CAS reboot was needed (eg, because a > > > > > > > > >>> device was added before CAS). This likely result in a broken or hung > > > > > > > > >>> guest. > > > > > > > > >>> > > > > > > > > >>> Even if it is small, the window between CAS and CAS reboot is enough to > > > > > > > > >>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new > > > > > > > > >>> subsection for that and always send it when a CAS reboot is pending. > > > > > > > > >>> This may cause migration to older QEMUs to fail but it is still better > > > > > > > > >>> than end up with a broken guest. > > > > > > > > >>> > > > > > > > > >>> The destination cannot honour the CAS reboot request from a post load > > > > > > > > >>> handler because this must be done after the guest is fully restored. > > > > > > > > >>> It is thus done from a VM change state handler. > > > > > > > > >>> > > > > > > > > >>> Reported-by: Lukáš Doktor <ldoktor@redhat.com> > > > > > > > > >>> Signed-off-by: Greg Kurz <groug@kaod.org> > > > > > > > > >>> --- > > > > > > > > >>> > > > > > > > > >> > > > > > > > > >> I'm wondering if the problem can be related with the fact that > > > > > > > > >> main_loop_should_exit() could release qemu_global_mutex in > > > > > > > > >> pause_all_vcpus() in the reset case? > > > > > > > > >> > > > > > > > > >> 1602 static bool main_loop_should_exit(void) > > > > > > > > >> 1603 { > > > > > > > > >> ... > > > > > > > > >> 1633 request = qemu_reset_requested(); > > > > > > > > >> 1634 if (request) { > > > > > > > > >> 1635 pause_all_vcpus(); > > > > > > > > >> 1636 qemu_system_reset(request); > > > > > > > > >> 1637 resume_all_vcpus(); > > > > > > > > >> 1638 if (!runstate_check(RUN_STATE_RUNNING) && > > > > > > > > >> 1639 !runstate_check(RUN_STATE_INMIGRATE)) { > > > > > > > > >> 1640 runstate_set(RUN_STATE_PRELAUNCH); > > > > > > > > >> 1641 } > > > > > > > > >> 1642 } > > > > > > > > >> ... > > > > > > > > >> > > > > > > > > >> I already sent a patch for this kind of problem (in current Juan pull > > > > > > > > >> request): > > > > > > > > >> > > > > > > > > >> "runstate: ignore finishmigrate -> prelaunch transition" > > > > > > > > >> > > > > > > > > > > > > > > > > > > IIUC your patch avoids an invalid 'prelaunch' -> 'postmigrate' runstate > > > > > > > > > transition that can happen if the migration thread sets the runstate to > > > > > > > > > 'finishmigrate' when pause_all_vcpus() releases the main loop mutex. > > > > > > > > > > > > > > > > > > ie. symptom of the problem is QEMU aborting, correct ? The issue I'm > > > > > > > > > trying to fix is a guest breakage caused by a discrepancy between > > > > > > > > > QEMU and the guest after migration has succeeded. > > > > > > > > > > > > > > > > > >> but I don't know if it could fix this one. > > > > > > > > >> > > > > > > > > > > > > > > > > > > I don't think so and your patch kinda illustrates it. If the runstate > > > > > > > > > is 'finishmigrate' when returning from pause_all_vcpus(), this means > > > > > > > > > that state was sent to the destination before we could actually reset > > > > > > > > > the machine. > > > > > > > > > > > > > > > > Yes, you're right. > > > > > > > > > > > > > > > > But the question behind my comment was: is it expected to have a pending > > > > > > > > reset while we are migrating? > > > > > > > > > > > > > > > > > > > > > > Nothing prevents qemu_system_reset_request() to be called when migration > > > > > > > is active. > > > > > > > > > > > > > > > Perhaps H_CAS can return H_BUSY and wait the end of the migration and > > > > > > > > then be fully executed on destination? > > > > > > > > > > > > > > > > > > > > > > And so we would need to teach SLOF to try H_CAS again until it stops > > > > > > > returning H_BUSY ? It seems safer to migrate the CAS reboot flag IMHO. > > > > > > > > > > > > > > > > > > > Ok I've tried that with a patched SLOF that sleeps 500ms and tries CAS > > > > > > again if H_BUSY was returned. It fixes the issue but it looks a bit > > > > > > ugly because of the polling with an arbitrary timeout in SLOF... I'm > > > > > > not very comfortable either with calling migration_is_active() from > > > > > > the CAS code in QEMU. > > > > > > > > > > > > David, > > > > > > > > > > > > Any suggestion ? > > > > > > > > > > Yeah, I think looping in SLOF is a worse idea than migrating the > > > > > cas_reboot flag. > > > > > > > > > > But.. a better solution still might be to just remove the remaining > > > > > causes for CAS reboot entirely. CAS reboots pretty much suck when > > > > > they happen, anyway. > > > > > > > > > > > > > I Agree. > > > > > > > > > With the irq changeover condition removed, I think the remaining > > > > > causes are more theoretical than practical situations at this point. > > > > > > > > > > > > > FWIW, hotpluggging a PCI device before CAS result in a hung guest (not yet > > > > investigated the details). > > > > > > commit 10f12e6450407b18b4d5a6b50d3852dcfd7fff75 > > > Author: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com> > > > Date: Wed Aug 30 15:21:41 2017 -0300 > > > > > > hw/ppc: CAS reset on early device hotplug > > > > > > I'll have a look to see what can be done here. > > > > Ah.. yes, that one might be a bit tricky. > > > > So far it seems to be related to SLOF not being able to create > new nodes in the DT when parsing the FDT returned by CAS. SLOF > stops the parsing and returns an error. The guest ends up with > a broken DT and eventually hangs later (in my case the kernel > believes it is going to do hash while radix was negotiated with > QEMU). I need to dig some more. Uh... I don't think this is right. I'm pretty sure SLOF *does* create new nodes when parsing the CAS FDT, it will need to for "ibm,dynamic-reconfiguration-memory" at least. I've done some looking and I think the actual reasons here are a bit more complicated (but also possibly easier to handle). 1. We can't send hotplug events to the guest until after CAS, because before then we don't know if it can process them, or if it can, which interrupt it uses for them. 2. Queueing hotplug events across CAS for delivery afterwards introduces other complications 3. We need to make sure that each device appears exactly once in either the initial device tree that the guest OS sees, *or* in a hotplug event, not both or neither. Now that we rebuild the DT at CAS time, I think we mightd be able toy handle this by converting such devices to "cold plugged" at CAS time (similarly to what we already do at reset). Since they're in the CAS-time DT which is what the OS will consume, cold plugged is effectively how the OS will see them. A remaining problem is that new PCI devices won't get BARs assigned by SLOF in this scenario. We'll probably get away with it because of the linux,pci-probe-only property, but I don't know we want to rely on that. PCI bridges hotplugged introduce further complications, because they won't get enumerated. > > > But I agree the other check is more theoretical: > > > > > > /* capabilities that have been added since CAS-generated guest reset. > > > * if capabilities have since been removed, generate another reset > > > */ > > > spapr->cas_reboot = !spapr_ovec_subset(ov5_cas_old, spapr->ov5_cas); > > > > > > Unless changing kernels or tempering with the kernel command line, I don't > > > see how some capabilities could change between the two CAS in practice. > > > > Well, we want to be robust and it's at least theoretically possible > > that the guest will request different things on subsequent reboots. > > Yes but in the latter case a full machine reset occurs and Not necessarily. A guest could ask for something on one CAS cycle, then reject it on another, without doing a full reboot. It'd be a pointless thing for the guest to do, but it's possible. > spapr->ov5_cas gets cleared, ie. spapr_ovec_subset() returns > true in the check above no matter what. Well, also it could happen if the guest rejects something we put in the initial value of ov5_cas (which is populated from spapr->ov5, so it's not entirely empty). > > However I believe that the original rationale for this check was that > > while we could add things to the device tree for added capabilities, > > we didn't have a way to roll back the changes for removed > > capabilities. > > > > IIUC this is specifically for "removed capabilities since last > CAS". This can happen if: > 1) we're already processing a CAS reboot or, > 2) a freshly rebooted guest calls CAS twice without being rebooted > in between. > > Since a freshly booted or rebooted guest can only trigger a CAS > reboot because of a "hotplug-before-CAS", if we manage to get rid > of this limitation, 1) cannot happen anymore. > > The linux kernel seems to be only calling "ibm,client-architecture-support" > once during early boot so 2) should _never_ happen. Do we care to support > this scenario anyway ? I think you've missed some things in your reasoning. But it doesn't really matter because the full dt rebuilt should handle it anyway. I have a draft patch which removes this cause for CAS reboots. Still grappling with the hotplug-before-CAS case. > > Now that we fully rebuild the device tree at CAS, I think this test > > can probably just go, although there's some double checking to do. > > > > I tend to agree.
On Wed, 22 Jan 2020 17:50:28 +1100 David Gibson <david@gibson.dropbear.id.au> wrote: > On Tue, Jan 21, 2020 at 10:32:55AM +0100, Greg Kurz wrote: > > On Tue, 21 Jan 2020 14:43:32 +1100 > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > On Mon, Jan 20, 2020 at 09:04:38AM +0100, Greg Kurz wrote: > > > > On Fri, 17 Jan 2020 16:44:27 +0100 > > > > Greg Kurz <groug@kaod.org> wrote: > > > > > > > > > On Fri, 17 Jan 2020 19:16:08 +1000 > > > > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > > > > > > > On Thu, Jan 16, 2020 at 07:29:02PM +0100, Greg Kurz wrote: > > > > > > > On Thu, 16 Jan 2020 13:14:35 +0100 > > > > > > > Greg Kurz <groug@kaod.org> wrote: > > > > > > > > > > > > > > > On Thu, 16 Jan 2020 11:37:24 +0100 > > > > > > > > Laurent Vivier <lvivier@redhat.com> wrote: > > > > > > > > > > > > > > > > > On 16/01/2020 09:48, Greg Kurz wrote: > > > > > > > > > > On Wed, 15 Jan 2020 19:10:37 +0100 > > > > > > > > > > Laurent Vivier <lvivier@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > >> Hi, > > > > > > > > > >> > > > > > > > > > >> On 15/01/2020 18:48, Greg Kurz wrote: > > > > > > > > > >>> Migration can potentially race with CAS reboot. If the migration thread > > > > > > > > > >>> completes migration after CAS has set spapr->cas_reboot but before the > > > > > > > > > >>> mainloop could pick up the reset request and reset the machine, the > > > > > > > > > >>> guest is migrated unrebooted and the destination doesn't reboot it > > > > > > > > > >>> either because it isn't aware a CAS reboot was needed (eg, because a > > > > > > > > > >>> device was added before CAS). This likely result in a broken or hung > > > > > > > > > >>> guest. > > > > > > > > > >>> > > > > > > > > > >>> Even if it is small, the window between CAS and CAS reboot is enough to > > > > > > > > > >>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new > > > > > > > > > >>> subsection for that and always send it when a CAS reboot is pending. > > > > > > > > > >>> This may cause migration to older QEMUs to fail but it is still better > > > > > > > > > >>> than end up with a broken guest. > > > > > > > > > >>> > > > > > > > > > >>> The destination cannot honour the CAS reboot request from a post load > > > > > > > > > >>> handler because this must be done after the guest is fully restored. > > > > > > > > > >>> It is thus done from a VM change state handler. > > > > > > > > > >>> > > > > > > > > > >>> Reported-by: Lukáš Doktor <ldoktor@redhat.com> > > > > > > > > > >>> Signed-off-by: Greg Kurz <groug@kaod.org> > > > > > > > > > >>> --- > > > > > > > > > >>> > > > > > > > > > >> > > > > > > > > > >> I'm wondering if the problem can be related with the fact that > > > > > > > > > >> main_loop_should_exit() could release qemu_global_mutex in > > > > > > > > > >> pause_all_vcpus() in the reset case? > > > > > > > > > >> > > > > > > > > > >> 1602 static bool main_loop_should_exit(void) > > > > > > > > > >> 1603 { > > > > > > > > > >> ... > > > > > > > > > >> 1633 request = qemu_reset_requested(); > > > > > > > > > >> 1634 if (request) { > > > > > > > > > >> 1635 pause_all_vcpus(); > > > > > > > > > >> 1636 qemu_system_reset(request); > > > > > > > > > >> 1637 resume_all_vcpus(); > > > > > > > > > >> 1638 if (!runstate_check(RUN_STATE_RUNNING) && > > > > > > > > > >> 1639 !runstate_check(RUN_STATE_INMIGRATE)) { > > > > > > > > > >> 1640 runstate_set(RUN_STATE_PRELAUNCH); > > > > > > > > > >> 1641 } > > > > > > > > > >> 1642 } > > > > > > > > > >> ... > > > > > > > > > >> > > > > > > > > > >> I already sent a patch for this kind of problem (in current Juan pull > > > > > > > > > >> request): > > > > > > > > > >> > > > > > > > > > >> "runstate: ignore finishmigrate -> prelaunch transition" > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > IIUC your patch avoids an invalid 'prelaunch' -> 'postmigrate' runstate > > > > > > > > > > transition that can happen if the migration thread sets the runstate to > > > > > > > > > > 'finishmigrate' when pause_all_vcpus() releases the main loop mutex. > > > > > > > > > > > > > > > > > > > > ie. symptom of the problem is QEMU aborting, correct ? The issue I'm > > > > > > > > > > trying to fix is a guest breakage caused by a discrepancy between > > > > > > > > > > QEMU and the guest after migration has succeeded. > > > > > > > > > > > > > > > > > > > >> but I don't know if it could fix this one. > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > I don't think so and your patch kinda illustrates it. If the runstate > > > > > > > > > > is 'finishmigrate' when returning from pause_all_vcpus(), this means > > > > > > > > > > that state was sent to the destination before we could actually reset > > > > > > > > > > the machine. > > > > > > > > > > > > > > > > > > Yes, you're right. > > > > > > > > > > > > > > > > > > But the question behind my comment was: is it expected to have a pending > > > > > > > > > reset while we are migrating? > > > > > > > > > > > > > > > > > > > > > > > > > Nothing prevents qemu_system_reset_request() to be called when migration > > > > > > > > is active. > > > > > > > > > > > > > > > > > Perhaps H_CAS can return H_BUSY and wait the end of the migration and > > > > > > > > > then be fully executed on destination? > > > > > > > > > > > > > > > > > > > > > > > > > And so we would need to teach SLOF to try H_CAS again until it stops > > > > > > > > returning H_BUSY ? It seems safer to migrate the CAS reboot flag IMHO. > > > > > > > > > > > > > > > > > > > > > > Ok I've tried that with a patched SLOF that sleeps 500ms and tries CAS > > > > > > > again if H_BUSY was returned. It fixes the issue but it looks a bit > > > > > > > ugly because of the polling with an arbitrary timeout in SLOF... I'm > > > > > > > not very comfortable either with calling migration_is_active() from > > > > > > > the CAS code in QEMU. > > > > > > > > > > > > > > David, > > > > > > > > > > > > > > Any suggestion ? > > > > > > > > > > > > Yeah, I think looping in SLOF is a worse idea than migrating the > > > > > > cas_reboot flag. > > > > > > > > > > > > But.. a better solution still might be to just remove the remaining > > > > > > causes for CAS reboot entirely. CAS reboots pretty much suck when > > > > > > they happen, anyway. > > > > > > > > > > > > > > > > I Agree. > > > > > > > > > > > With the irq changeover condition removed, I think the remaining > > > > > > causes are more theoretical than practical situations at this point. > > > > > > > > > > > > > > > > FWIW, hotpluggging a PCI device before CAS result in a hung guest (not yet > > > > > investigated the details). > > > > > > > > commit 10f12e6450407b18b4d5a6b50d3852dcfd7fff75 > > > > Author: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com> > > > > Date: Wed Aug 30 15:21:41 2017 -0300 > > > > > > > > hw/ppc: CAS reset on early device hotplug > > > > > > > > I'll have a look to see what can be done here. > > > > > > Ah.. yes, that one might be a bit tricky. > > > > > > > So far it seems to be related to SLOF not being able to create > > new nodes in the DT when parsing the FDT returned by CAS. SLOF > > stops the parsing and returns an error. The guest ends up with > > a broken DT and eventually hangs later (in my case the kernel > > believes it is going to do hash while radix was negotiated with > > QEMU). I need to dig some more. > > Uh... I don't think this is right. I'm pretty sure SLOF *does* create > new nodes when parsing the CAS FDT, it will need to for > "ibm,dynamic-reconfiguration-memory" at least. > It can create "memory@" or "ibm,dynamic-reconfiguration-memory" nodes but explicitly rejects all others. > I've done some looking and I think the actual reasons here are a bit > more complicated (but also possibly easier to handle). > > 1. We can't send hotplug events to the guest until after CAS, > because before then we don't know if it can process them, or if > it can, which interrupt it uses for them. > > 2. Queueing hotplug events across CAS for delivery afterwards > introduces other complications > > 3. We need to make sure that each device appears exactly once in > either the initial device tree that the guest OS sees, *or* in a > hotplug event, not both or neither. > > Now that we rebuild the DT at CAS time, I think we mightd be able toy > handle this by converting such devices to "cold plugged" at CAS time > (similarly to what we already do at reset). Since they're in the > CAS-time DT which is what the OS will consume, cold plugged is > effectively how the OS will see them. > I have tried hacking around to achieve that. Basically calling spapr_drc_reset() for all DRCs for which spapr_drc_needed() returns true. > A remaining problem is that new PCI devices won't get BARs assigned by > SLOF in this scenario. We'll probably get away with it because of the > linux,pci-probe-only property, but I don't know we want to rely on We currently only expose this property for pseries-4.2 and newer machine types... this could be a problem. > that. PCI bridges hotplugged introduce further complications, because > they won't get enumerated. > > > > > But I agree the other check is more theoretical: > > > > > > > > /* capabilities that have been added since CAS-generated guest reset. > > > > * if capabilities have since been removed, generate another reset > > > > */ > > > > spapr->cas_reboot = !spapr_ovec_subset(ov5_cas_old, spapr->ov5_cas); > > > > > > > > Unless changing kernels or tempering with the kernel command line, I don't > > > > see how some capabilities could change between the two CAS in practice. > > > > > > Well, we want to be robust and it's at least theoretically possible > > > that the guest will request different things on subsequent reboots. > > > > Yes but in the latter case a full machine reset occurs and > > Not necessarily. A guest could ask for something on one CAS cycle, > then reject it on another, without doing a full reboot. It'd be a > pointless thing for the guest to do, but it's possible. > Ok, I was asking later on if we want to support the scenario of multiple CAS without an intermediate full reboot. I now have the answer :) > > spapr->ov5_cas gets cleared, ie. spapr_ovec_subset() returns > > true in the check above no matter what. > > Well, also it could happen if the guest rejects something we put in > the initial value of ov5_cas (which is populated from spapr->ov5, so > it's not entirely empty). > AFAICT the initial value of ov5_cas after a full machine reset is all zeroes until CAS does: /* full range of negotiated ov5 capabilities */ spapr_ovec_intersect(spapr->ov5_cas, spapr->ov5, ov5_guest); which is done between: ov5_cas_old = spapr_ovec_clone(spapr->ov5_cas); and spapr->cas_reboot = !spapr_ovec_subset(ov5_cas_old, spapr->ov5_cas); So I don't quite understand how ov5_cas_old, ie. spapr->ov5_cas at the time the guest calls CAS, could have an "initial value not entirely empty"... This can only happen if the guest calls CAS several times without doing a full reboot in between. My initial thought was to refuse this scenario and fail any subsequent CAS attempt made by the guest before a full reboot. > > > However I believe that the original rationale for this check was that > > > while we could add things to the device tree for added capabilities, > > > we didn't have a way to roll back the changes for removed > > > capabilities. > > > > > > > IIUC this is specifically for "removed capabilities since last > > CAS". This can happen if: > > 1) we're already processing a CAS reboot or, > > 2) a freshly rebooted guest calls CAS twice without being rebooted > > in between. > > > > Since a freshly booted or rebooted guest can only trigger a CAS > > reboot because of a "hotplug-before-CAS", if we manage to get rid > > of this limitation, 1) cannot happen anymore. > > > > The linux kernel seems to be only calling "ibm,client-architecture-support" > > once during early boot so 2) should _never_ happen. Do we care to support > > this scenario anyway ? > > I think you've missed some things in your reasoning. But it doesn't > really matter because the full dt rebuilt should handle it anyway. I > have a draft patch which removes this cause for CAS reboots. > > Still grappling with the hotplug-before-CAS case. > Same here actually. I was struggling with SLOF to have it create new nodes for hotplugged-before-CAS devices without crashing :-\ I think I'll wait for your patches to arrive :) Please Cc: me. > > > Now that we fully rebuild the device tree at CAS, I think this test > > > can probably just go, although there's some double checking to do. > > > > > > > I tend to agree. >
On Wed, 15 Jan 2020 19:10:47 +0100 Cédric Le Goater <clg@kaod.org> wrote: > On 1/15/20 6:48 PM, Greg Kurz wrote: > > Migration can potentially race with CAS reboot. If the migration thread > > completes migration after CAS has set spapr->cas_reboot but before the > > mainloop could pick up the reset request and reset the machine, the > > guest is migrated unrebooted and the destination doesn't reboot it > > either because it isn't aware a CAS reboot was needed (eg, because a > > device was added before CAS). This likely result in a broken or hung > > guest. > > > > Even if it is small, the window between CAS and CAS reboot is enough to > > re-qualify spapr->cas_reboot as state that we should migrate. Add a new > > subsection for that and always send it when a CAS reboot is pending. > > This may cause migration to older QEMUs to fail but it is still better > > than end up with a broken guest. > > > > The destination cannot honour the CAS reboot request from a post load > > handler because this must be done after the guest is fully restored. > > It is thus done from a VM change state handler. > > > > Reported-by: Lukáš Doktor <ldoktor@redhat.com> > > Signed-off-by: Greg Kurz <groug@kaod.org> > > Cédric Le Goater <clg@kaod.org> > I guess you mean: Reviewed-by: Cédric Le Goater <clg@kaod.org> ? > Nice work ! That was quite complex to catch ! > > Thanks, > > C. > > > --- > > > > This patch is supposed to fix the interrupt controller mode inconsistency > > between QEMU and the guest reported in this BZ: > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1781315 (requires auth) > > > > Even if interrupt controller selection doesn't involve CAS reboot anymore, > > we still have other conditions that require CAS reboot. > > --- > > hw/ppc/spapr.c | 43 +++++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 43 insertions(+) > > > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > > index 30a5fbd3bea6..bf2763aa16e5 100644 > > --- a/hw/ppc/spapr.c > > +++ b/hw/ppc/spapr.c > > @@ -1959,6 +1959,31 @@ static const VMStateDescription vmstate_spapr_dtb = { > > }, > > }; > > > > +static bool spapr_cas_reboot_needed(void *opaque) > > +{ > > + SpaprMachineState *spapr = SPAPR_MACHINE(opaque); > > + > > + /* > > + * This causes the "spapr_cas_reboot" subsection to always be > > + * sent if migration raced with CAS. This causes older QEMUs > > + * that don't know about the subsection to fail migration but > > + * it is still better than end up with a broken guest on the > > + * destination. > > + */ > > + return spapr->cas_reboot; > > +} > > + > > +static const VMStateDescription vmstate_spapr_cas_reboot = { > > + .name = "spapr_cas_reboot", > > + .version_id = 1, > > + .minimum_version_id = 1, > > + .needed = spapr_cas_reboot_needed, > > + .fields = (VMStateField[]) { > > + VMSTATE_BOOL(cas_reboot, SpaprMachineState), > > + VMSTATE_END_OF_LIST() > > + }, > > +}; > > + > > static const VMStateDescription vmstate_spapr = { > > .name = "spapr", > > .version_id = 3, > > @@ -1992,6 +2017,7 @@ static const VMStateDescription vmstate_spapr = { > > &vmstate_spapr_dtb, > > &vmstate_spapr_cap_large_decr, > > &vmstate_spapr_cap_ccf_assist, > > + &vmstate_spapr_cas_reboot, > > NULL > > } > > }; > > @@ -2577,6 +2603,21 @@ static PCIHostState *spapr_create_default_phb(void) > > return PCI_HOST_BRIDGE(dev); > > } > > > > +static void spapr_change_state_handler(void *opaque, int running, > > + RunState state) > > +{ > > + SpaprMachineState *spapr = opaque; > > + > > + if (running && spapr->cas_reboot) { > > + /* > > + * This happens when resuming from migration if the source > > + * processed a CAS but didn't have time to trigger the CAS > > + * reboot. Do it now. > > + */ > > + qemu_system_reset_request(SHUTDOWN_CAUSE_SUBSYSTEM_RESET); > > + } > > +} > > + > > /* pSeries LPAR / sPAPR hardware init */ > > static void spapr_machine_init(MachineState *machine) > > { > > @@ -2970,6 +3011,8 @@ static void spapr_machine_init(MachineState *machine) > > > > kvmppc_spapr_enable_inkernel_multitce(); > > } > > + > > + qemu_add_vm_change_state_handler(spapr_change_state_handler, spapr); > > } > > > > static int spapr_kvm_type(MachineState *machine, const char *vm_type) > > >
On 1/22/20 1:47 PM, Greg Kurz wrote: > On Wed, 15 Jan 2020 19:10:47 +0100 > Cédric Le Goater <clg@kaod.org> wrote: > >> On 1/15/20 6:48 PM, Greg Kurz wrote: >>> Migration can potentially race with CAS reboot. If the migration thread >>> completes migration after CAS has set spapr->cas_reboot but before the >>> mainloop could pick up the reset request and reset the machine, the >>> guest is migrated unrebooted and the destination doesn't reboot it >>> either because it isn't aware a CAS reboot was needed (eg, because a >>> device was added before CAS). This likely result in a broken or hung >>> guest. >>> >>> Even if it is small, the window between CAS and CAS reboot is enough to >>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new >>> subsection for that and always send it when a CAS reboot is pending. >>> This may cause migration to older QEMUs to fail but it is still better >>> than end up with a broken guest. >>> >>> The destination cannot honour the CAS reboot request from a post load >>> handler because this must be done after the guest is fully restored. >>> It is thus done from a VM change state handler. >>> >>> Reported-by: Lukáš Doktor <ldoktor@redhat.com> >>> Signed-off-by: Greg Kurz <groug@kaod.org> >> >> Cédric Le Goater <clg@kaod.org> >> > > I guess you mean: > > Reviewed-by: Cédric Le Goater <clg@kaod.org> Yes. Reviewed-by: Cédric Le Goater <clg@kaod.org> Since keyconfig was disabled in thunderbird, I have been adding tags manually ... C.
On Wed, Jan 22, 2020 at 11:06:18AM +0100, Greg Kurz wrote: > On Wed, 22 Jan 2020 17:50:28 +1100 > David Gibson <david@gibson.dropbear.id.au> wrote: > > > On Tue, Jan 21, 2020 at 10:32:55AM +0100, Greg Kurz wrote: > > > On Tue, 21 Jan 2020 14:43:32 +1100 > > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > > > On Mon, Jan 20, 2020 at 09:04:38AM +0100, Greg Kurz wrote: > > > > > On Fri, 17 Jan 2020 16:44:27 +0100 > > > > > Greg Kurz <groug@kaod.org> wrote: > > > > > > > > > > > On Fri, 17 Jan 2020 19:16:08 +1000 > > > > > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > > > > > > > > > On Thu, Jan 16, 2020 at 07:29:02PM +0100, Greg Kurz wrote: > > > > > > > > On Thu, 16 Jan 2020 13:14:35 +0100 > > > > > > > > Greg Kurz <groug@kaod.org> wrote: > > > > > > > > > > > > > > > > > On Thu, 16 Jan 2020 11:37:24 +0100 > > > > > > > > > Laurent Vivier <lvivier@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > On 16/01/2020 09:48, Greg Kurz wrote: > > > > > > > > > > > On Wed, 15 Jan 2020 19:10:37 +0100 > > > > > > > > > > > Laurent Vivier <lvivier@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > > >> Hi, > > > > > > > > > > >> > > > > > > > > > > >> On 15/01/2020 18:48, Greg Kurz wrote: > > > > > > > > > > >>> Migration can potentially race with CAS reboot. If the migration thread > > > > > > > > > > >>> completes migration after CAS has set spapr->cas_reboot but before the > > > > > > > > > > >>> mainloop could pick up the reset request and reset the machine, the > > > > > > > > > > >>> guest is migrated unrebooted and the destination doesn't reboot it > > > > > > > > > > >>> either because it isn't aware a CAS reboot was needed (eg, because a > > > > > > > > > > >>> device was added before CAS). This likely result in a broken or hung > > > > > > > > > > >>> guest. > > > > > > > > > > >>> > > > > > > > > > > >>> Even if it is small, the window between CAS and CAS reboot is enough to > > > > > > > > > > >>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new > > > > > > > > > > >>> subsection for that and always send it when a CAS reboot is pending. > > > > > > > > > > >>> This may cause migration to older QEMUs to fail but it is still better > > > > > > > > > > >>> than end up with a broken guest. > > > > > > > > > > >>> > > > > > > > > > > >>> The destination cannot honour the CAS reboot request from a post load > > > > > > > > > > >>> handler because this must be done after the guest is fully restored. > > > > > > > > > > >>> It is thus done from a VM change state handler. > > > > > > > > > > >>> > > > > > > > > > > >>> Reported-by: Lukáš Doktor <ldoktor@redhat.com> > > > > > > > > > > >>> Signed-off-by: Greg Kurz <groug@kaod.org> > > > > > > > > > > >>> --- > > > > > > > > > > >>> > > > > > > > > > > >> > > > > > > > > > > >> I'm wondering if the problem can be related with the fact that > > > > > > > > > > >> main_loop_should_exit() could release qemu_global_mutex in > > > > > > > > > > >> pause_all_vcpus() in the reset case? > > > > > > > > > > >> > > > > > > > > > > >> 1602 static bool main_loop_should_exit(void) > > > > > > > > > > >> 1603 { > > > > > > > > > > >> ... > > > > > > > > > > >> 1633 request = qemu_reset_requested(); > > > > > > > > > > >> 1634 if (request) { > > > > > > > > > > >> 1635 pause_all_vcpus(); > > > > > > > > > > >> 1636 qemu_system_reset(request); > > > > > > > > > > >> 1637 resume_all_vcpus(); > > > > > > > > > > >> 1638 if (!runstate_check(RUN_STATE_RUNNING) && > > > > > > > > > > >> 1639 !runstate_check(RUN_STATE_INMIGRATE)) { > > > > > > > > > > >> 1640 runstate_set(RUN_STATE_PRELAUNCH); > > > > > > > > > > >> 1641 } > > > > > > > > > > >> 1642 } > > > > > > > > > > >> ... > > > > > > > > > > >> > > > > > > > > > > >> I already sent a patch for this kind of problem (in current Juan pull > > > > > > > > > > >> request): > > > > > > > > > > >> > > > > > > > > > > >> "runstate: ignore finishmigrate -> prelaunch transition" > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > IIUC your patch avoids an invalid 'prelaunch' -> 'postmigrate' runstate > > > > > > > > > > > transition that can happen if the migration thread sets the runstate to > > > > > > > > > > > 'finishmigrate' when pause_all_vcpus() releases the main loop mutex. > > > > > > > > > > > > > > > > > > > > > > ie. symptom of the problem is QEMU aborting, correct ? The issue I'm > > > > > > > > > > > trying to fix is a guest breakage caused by a discrepancy between > > > > > > > > > > > QEMU and the guest after migration has succeeded. > > > > > > > > > > > > > > > > > > > > > >> but I don't know if it could fix this one. > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > I don't think so and your patch kinda illustrates it. If the runstate > > > > > > > > > > > is 'finishmigrate' when returning from pause_all_vcpus(), this means > > > > > > > > > > > that state was sent to the destination before we could actually reset > > > > > > > > > > > the machine. > > > > > > > > > > > > > > > > > > > > Yes, you're right. > > > > > > > > > > > > > > > > > > > > But the question behind my comment was: is it expected to have a pending > > > > > > > > > > reset while we are migrating? > > > > > > > > > > > > > > > > > > > > > > > > > > > > Nothing prevents qemu_system_reset_request() to be called when migration > > > > > > > > > is active. > > > > > > > > > > > > > > > > > > > Perhaps H_CAS can return H_BUSY and wait the end of the migration and > > > > > > > > > > then be fully executed on destination? > > > > > > > > > > > > > > > > > > > > > > > > > > > > And so we would need to teach SLOF to try H_CAS again until it stops > > > > > > > > > returning H_BUSY ? It seems safer to migrate the CAS reboot flag IMHO. > > > > > > > > > > > > > > > > > > > > > > > > > Ok I've tried that with a patched SLOF that sleeps 500ms and tries CAS > > > > > > > > again if H_BUSY was returned. It fixes the issue but it looks a bit > > > > > > > > ugly because of the polling with an arbitrary timeout in SLOF... I'm > > > > > > > > not very comfortable either with calling migration_is_active() from > > > > > > > > the CAS code in QEMU. > > > > > > > > > > > > > > > > David, > > > > > > > > > > > > > > > > Any suggestion ? > > > > > > > > > > > > > > Yeah, I think looping in SLOF is a worse idea than migrating the > > > > > > > cas_reboot flag. > > > > > > > > > > > > > > But.. a better solution still might be to just remove the remaining > > > > > > > causes for CAS reboot entirely. CAS reboots pretty much suck when > > > > > > > they happen, anyway. > > > > > > > > > > > > > > > > > > > I Agree. > > > > > > > > > > > > > With the irq changeover condition removed, I think the remaining > > > > > > > causes are more theoretical than practical situations at this point. > > > > > > > > > > > > > > > > > > > FWIW, hotpluggging a PCI device before CAS result in a hung guest (not yet > > > > > > investigated the details). > > > > > > > > > > commit 10f12e6450407b18b4d5a6b50d3852dcfd7fff75 > > > > > Author: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com> > > > > > Date: Wed Aug 30 15:21:41 2017 -0300 > > > > > > > > > > hw/ppc: CAS reset on early device hotplug > > > > > > > > > > I'll have a look to see what can be done here. > > > > > > > > Ah.. yes, that one might be a bit tricky. > > > > > > > > > > So far it seems to be related to SLOF not being able to create > > > new nodes in the DT when parsing the FDT returned by CAS. SLOF > > > stops the parsing and returns an error. The guest ends up with > > > a broken DT and eventually hangs later (in my case the kernel > > > believes it is going to do hash while radix was negotiated with > > > QEMU). I need to dig some more. > > > > Uh... I don't think this is right. I'm pretty sure SLOF *does* create > > new nodes when parsing the CAS FDT, it will need to for > > "ibm,dynamic-reconfiguration-memory" at least. > > > > It can create "memory@" or "ibm,dynamic-reconfiguration-memory" nodes but > explicitly rejects all others. Huh. Well that's certainly not correct now that we're doing a full tree rebuild. > > I've done some looking and I think the actual reasons here are a bit > > more complicated (but also possibly easier to handle). > > > > 1. We can't send hotplug events to the guest until after CAS, > > because before then we don't know if it can process them, or if > > it can, which interrupt it uses for them. > > > > 2. Queueing hotplug events across CAS for delivery afterwards > > introduces other complications > > > > 3. We need to make sure that each device appears exactly once in > > either the initial device tree that the guest OS sees, *or* in a > > hotplug event, not both or neither. > > > > Now that we rebuild the DT at CAS time, I think we mightd be able toy > > handle this by converting such devices to "cold plugged" at CAS time > > (similarly to what we already do at reset). Since they're in the > > CAS-time DT which is what the OS will consume, cold plugged is > > effectively how the OS will see them. > > > > I have tried hacking around to achieve that. Basically calling > spapr_drc_reset() for all DRCs for which spapr_drc_needed() > returns true. > > > A remaining problem is that new PCI devices won't get BARs assigned by > > SLOF in this scenario. We'll probably get away with it because of the > > linux,pci-probe-only property, but I don't know we want to rely on > > We currently only expose this property for pseries-4.2 and newer > machine types... this could be a problem. Right. I don't like relying on it even for the newer machines, since it presumably wouldn't work for FreeBSD or AIX. > > that. PCI bridges hotplugged introduce further complications, because > > they won't get enumerated. Alexey has done some testing and looks like there are already a bunch of problems with PCI bridges hotplugged during SLOF execution. > > > > > > > But I agree the other check is more theoretical: > > > > > > > > > > /* capabilities that have been added since CAS-generated guest reset. > > > > > * if capabilities have since been removed, generate another reset > > > > > */ > > > > > spapr->cas_reboot = !spapr_ovec_subset(ov5_cas_old, spapr->ov5_cas); > > > > > > > > > > Unless changing kernels or tempering with the kernel command line, I don't > > > > > see how some capabilities could change between the two CAS in practice. > > > > > > > > Well, we want to be robust and it's at least theoretically possible > > > > that the guest will request different things on subsequent reboots. > > > > > > Yes but in the latter case a full machine reset occurs and > > > > Not necessarily. A guest could ask for something on one CAS cycle, > > then reject it on another, without doing a full reboot. It'd be a > > pointless thing for the guest to do, but it's possible. > > > > Ok, I was asking later on if we want to support the scenario of > multiple CAS without an intermediate full reboot. I now have the > answer :) Well, "want" might be a bit strong. PAPR allows for the possibility so we're trying to support it at this stage. Obviously this could happen relatively easily with the hotplug-before-CAS cause. > > > spapr->ov5_cas gets cleared, ie. spapr_ovec_subset() returns > > > true in the check above no matter what. > > > > Well, also it could happen if the guest rejects something we put in > > the initial value of ov5_cas (which is populated from spapr->ov5, so > > it's not entirely empty). > > > > AFAICT the initial value of ov5_cas after a full machine reset is > all zeroes until CAS does: > > /* full range of negotiated ov5 capabilities */ > spapr_ovec_intersect(spapr->ov5_cas, spapr->ov5, ov5_guest); > > which is done between: > > ov5_cas_old = spapr_ovec_clone(spapr->ov5_cas); > > and > > spapr->cas_reboot = !spapr_ovec_subset(ov5_cas_old, spapr->ov5_cas); > > So I don't quite understand how ov5_cas_old, ie. spapr->ov5_cas at the > time the guest calls CAS, could have an "initial value not entirely > empty"... This can only happen if the guest calls CAS several times > without doing a full reboot in between. My initial thought was to > refuse this scenario and fail any subsequent CAS attempt made by > the guest before a full reboot. Oh, sorry, my mistake. I say the line: spapr->ov5_cas = spapr_ovec_clone(spapr->ov5); but only now realized that only happens in the qtest case. Ok, yes, looks like this could only happen in the case of multiple CAS reboots in a row. > > > > > However I believe that the original rationale for this check was that > > > > while we could add things to the device tree for added capabilities, > > > > we didn't have a way to roll back the changes for removed > > > > capabilities. > > > > > > > > > > IIUC this is specifically for "removed capabilities since last > > > CAS". This can happen if: > > > 1) we're already processing a CAS reboot or, > > > 2) a freshly rebooted guest calls CAS twice without being rebooted > > > in between. > > > > > > Since a freshly booted or rebooted guest can only trigger a CAS > > > reboot because of a "hotplug-before-CAS", if we manage to get rid > > > of this limitation, 1) cannot happen anymore. > > > > > > The linux kernel seems to be only calling "ibm,client-architecture-support" > > > once during early boot so 2) should _never_ happen. Do we care to support > > > this scenario anyway ? > > > > I think you've missed some things in your reasoning. But it doesn't > > really matter because the full dt rebuilt should handle it anyway. I > > have a draft patch which removes this cause for CAS reboots. > > > > Still grappling with the hotplug-before-CAS case. > > > > Same here actually. I was struggling with SLOF to have it create new nodes > for hotplugged-before-CAS devices without crashing :-\ > > I think I'll wait for your patches to arrive :) Please Cc: me. > > > > > Now that we fully rebuild the device tree at CAS, I think this test > > > > can probably just go, although there's some double checking to do. > > > > > > > > > > I tend to agree. > > >
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c index 30a5fbd3bea6..bf2763aa16e5 100644 --- a/hw/ppc/spapr.c +++ b/hw/ppc/spapr.c @@ -1959,6 +1959,31 @@ static const VMStateDescription vmstate_spapr_dtb = { }, }; +static bool spapr_cas_reboot_needed(void *opaque) +{ + SpaprMachineState *spapr = SPAPR_MACHINE(opaque); + + /* + * This causes the "spapr_cas_reboot" subsection to always be + * sent if migration raced with CAS. This causes older QEMUs + * that don't know about the subsection to fail migration but + * it is still better than end up with a broken guest on the + * destination. + */ + return spapr->cas_reboot; +} + +static const VMStateDescription vmstate_spapr_cas_reboot = { + .name = "spapr_cas_reboot", + .version_id = 1, + .minimum_version_id = 1, + .needed = spapr_cas_reboot_needed, + .fields = (VMStateField[]) { + VMSTATE_BOOL(cas_reboot, SpaprMachineState), + VMSTATE_END_OF_LIST() + }, +}; + static const VMStateDescription vmstate_spapr = { .name = "spapr", .version_id = 3, @@ -1992,6 +2017,7 @@ static const VMStateDescription vmstate_spapr = { &vmstate_spapr_dtb, &vmstate_spapr_cap_large_decr, &vmstate_spapr_cap_ccf_assist, + &vmstate_spapr_cas_reboot, NULL } }; @@ -2577,6 +2603,21 @@ static PCIHostState *spapr_create_default_phb(void) return PCI_HOST_BRIDGE(dev); } +static void spapr_change_state_handler(void *opaque, int running, + RunState state) +{ + SpaprMachineState *spapr = opaque; + + if (running && spapr->cas_reboot) { + /* + * This happens when resuming from migration if the source + * processed a CAS but didn't have time to trigger the CAS + * reboot. Do it now. + */ + qemu_system_reset_request(SHUTDOWN_CAUSE_SUBSYSTEM_RESET); + } +} + /* pSeries LPAR / sPAPR hardware init */ static void spapr_machine_init(MachineState *machine) { @@ -2970,6 +3011,8 @@ static void spapr_machine_init(MachineState *machine) kvmppc_spapr_enable_inkernel_multitce(); } + + qemu_add_vm_change_state_handler(spapr_change_state_handler, spapr); } static int spapr_kvm_type(MachineState *machine, const char *vm_type)
Migration can potentially race with CAS reboot. If the migration thread completes migration after CAS has set spapr->cas_reboot but before the mainloop could pick up the reset request and reset the machine, the guest is migrated unrebooted and the destination doesn't reboot it either because it isn't aware a CAS reboot was needed (eg, because a device was added before CAS). This likely result in a broken or hung guest. Even if it is small, the window between CAS and CAS reboot is enough to re-qualify spapr->cas_reboot as state that we should migrate. Add a new subsection for that and always send it when a CAS reboot is pending. This may cause migration to older QEMUs to fail but it is still better than end up with a broken guest. The destination cannot honour the CAS reboot request from a post load handler because this must be done after the guest is fully restored. It is thus done from a VM change state handler. Reported-by: Lukáš Doktor <ldoktor@redhat.com> Signed-off-by: Greg Kurz <groug@kaod.org> --- This patch is supposed to fix the interrupt controller mode inconsistency between QEMU and the guest reported in this BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1781315 (requires auth) Even if interrupt controller selection doesn't involve CAS reboot anymore, we still have other conditions that require CAS reboot. --- hw/ppc/spapr.c | 43 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 43 insertions(+)