diff mbox

[COLO-Frame,v15,00/38] COarse-grain LOck-stepping(COLO) Virtual Machines for Non-stop Service (FT)

Message ID 56D15653.90406@huawei.com (mailing list archive)
State New, archived
Headers show

Commit Message

Zhanghailiang Feb. 27, 2016, 7:54 a.m. UTC
On 2016/2/27 0:36, Dr. David Alan Gilbert wrote:
> * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
>> * zhanghailiang (zhang.zhanghailiang@huawei.com) wrote:
>>> From: root <root@localhost.localdomain>
>>>
>>> This is the 15th version of COLO (Still only support periodic checkpoint).
>>>
>>> Here is only COLO frame part, you can get the whole codes from github:
>>> https://github.com/coloft/qemu/commits/colo-v2.6-periodic-mode
>>>
>>> There are little changes for this series except the network releated part.
>>
>> I was looking at the time the guest is paused during COLO and
>> was surprised to find one of the larger chunks was the time to reset
>> the guest before loading each checkpoint;  I've traced it part way, the
>> biggest contributors for my test VM seem to be:
>>
>>    3.8ms  pcibus_reset: VGA
>>    1.8ms  pcibus_reset: virtio-net-pci
>>    1.5ms  pcibus_reset: virtio-blk-pci
>>    1.5ms  qemu_devices_reset: piix4_reset
>>    1.1ms  pcibus_reset: piix3-ide
>>    1.1ms  pcibus_reset: virtio-rng-pci
>>
>> I've not looked deeper yet, but some of these are very silly;
>> I'm running with -nographic so why it's taking 3.8ms to reset VGA is
>> going to be interesting.
>> Also, my only block device is the virtio-blk, so while I understand the
>> standard PC machine has the IDE controller, why it takes it over a ms
>> to reset an unused device.
>
> OK, so I've dug a bit deeper, and it appears that it's the changes in
> PCI bars that actually take the time;  every time we do a reset we
> reset all the BARs, this causes it to do a pci_update_mappings and
> end up doing a memory_region_del_subregion.
> Then we load the config space of the PCI device as we do the vmstate_load,
> and this recreates all the mappings again.
>
> I'm not sure what the fix is, but that sounds like it would
> speed up the checkpoints usefully if we can avoid the map/remap when
> they're the same.
>

Interesting, and thanks for your report.

We already known qemu_system_reset() is a time-consuming function, we shouldn't
call it here, but if we didn't do that, there will be a bug, which we have
reported before in the previous COLO series, the bellow is the copy of the related
patch comment:

     COLO VMstate: Load VM state into qsb before restore it

     We should not destroy the state of secondary until we receive the whole
     state from the primary, in case the primary fails in the middle of sending
     the state, so, here we cache the device state in Secondary before restore it.

     Besides, we should call qemu_system_reset() before load VM state,
     which can ensure the data is intact.
     Note: If we discard qemu_system_reset(), there will be some odd error,
     For exmple, qemu in slave side crashes and reports:

     KVM: entry failed, hardware error 0x7
     EAX=00000000 EBX=0000e000 ECX=00009578 EDX=0000434f
     ESI=0000fc10 EDI=0000434f EBP=00000000 ESP=00001fca
     EIP=00009594 EFL=00010246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
     ES =0040 00000400 0000ffff 00009300
     CS =f000 000f0000 0000ffff 00009b00
     SS =434f 000434f0 0000ffff 00009300
     DS =434f 000434f0 0000ffff 00009300
     FS =0000 00000000 0000ffff 00009300
     GS =0000 00000000 0000ffff 00009300
     LDT=0000 00000000 0000ffff 00008200
     TR =0000 00000000 0000ffff 00008b00
     GDT=     0002dcc8 00000047
     IDT=     00000000 0000ffff
     CR0=00000010 CR2=ffffffff CR3=00000000 CR4=00000000
     DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
     DR6=00000000ffff0ff0 DR7=0000000000000400
     EFER=0000000000000000
     Code=c0 74 0f 66 b9 78 95 00 00 66 31 d2 66 31 c0 e9 47 e0 fb 90 <f3> 90 fa fc 66 c3 66 53 66 89 c3
     ERROR: invalid runstate transition: 'internal-error' -> 'colo'

     The reason is, some of the device state will be ignored when saving device state to slave,
     if the corresponding data is in its initial value, such as 0.
     But the device state in slave maybe in initialized value, after a loop of checkpoint,
     there will be inconsistent for the value of device state.
     This will happen when the PVM reboot or SVM run ahead of PVM in the startup process.
     Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
     Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
     Signed-off-by: Gonglei <arei.gonglei@huawei.com>
     Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com

As described above, some values of the device state are zero, they will be
ignored during  migration, it has no problem for normal migration, because
for the VM in destination, the initial values will be zero too. But for COLO,
there are more than one round of migration, the related values may be changed
from no-zero to zero, they will be ignored too in the next checkpoint, the
VMstate will be inconsistent for SVM.

The above error is caused directly by wrong value of 'async_pf_en_msr'.

static const VMStateDescription vmstate_async_pf_msr = {
     .name = "cpu/async_pf_msr",
     .version_id = 1,
     .minimum_version_id = 1,
     .needed = async_pf_msr_needed,
     .fields = (VMStateField[]) {
         VMSTATE_UINT64(env.async_pf_en_msr, X86CPU),
         VMSTATE_END_OF_LIST()
     }
};

static bool async_pf_msr_needed(void *opaque)
{
     X86CPU *cpu = opaque;

     return cpu->env.async_pf_en_msr != 0;
}

Some other VMstate of registers in CPUX86State have the same problem,
we can't make sure they won't cause any problems if the values of them
are incorrect.
So here, we just simply call qemu_system_reset() to avoid the inconsistent
problem.
Besides, compared with the most time-consuming operation (ram flushed from
COLO cache to SVM). The time consuming for qemu_system_reset() seems to be
acceptable ;)

Another choice to fix the problem is to save the VMstate ignoring the needed()
return value, but this method is not so graceful.

Thanks,
Hailiang

> Dave
>
>>
>> I guess reset is normally off anyones radar since it's outside
>> the time anyone cares about, but I guess perhaps the guys trying
>> to make qemu start really quickly would be interested.
>>
>> Dave
>>
>>>
>>> Patch status:
>>> Unreviewed: patch 21,27,28,29,33,38
>>> Updated: patch 31,34,35,37
>>>
>>> TODO:
>>> 1. Checkpoint based on proxy in qemu
>>> 2. The capability of continuous FT
>>> 3. Optimize the VM's downtime during checkpoint
>>>
>>> v15:
>>>   - Go on the shutdown process if encounter error while sending shutdown
>>>     message to SVM. (patch 24)
>>>   - Rename qemu_need_skip_netfilter to qemu_netfilter_can_skip and Remove
>>>     some useless comment. (patch 31, Jason)
>>>   - Call object_new_with_props() directly to add filter in
>>>     colo_add_buffer_filter. (patch 34, Jason)
>>>   - Re-implement colo_set_filter_status() based on COLOBufferFilters
>>>     list. (patch 35)
>>>   - Re-implement colo_flush_filter_packets() based on COLOBufferFilters
>>>     list. (patch 37)
>>> v14:
>>>   - Re-implement the network processing based on netfilter (Jason Wang)
>>>   - Rename 'COLOCommand' to 'COLOMessage'. (Markus's suggestion)
>>>   - Split two new patches (patch 27/28) from patch 29
>>>   - Fix some other comments from Dave and Markus.
>>>
>>> v13:
>>>   - Refactor colo_*_cmd helper functions to use 'Error **errp' parameter
>>>    instead of return value to indicate success or failure. (patch 10)
>>>   - Remove the optional error message for COLO_EXIT event. (patch 25)
>>>   - Use semaphore to notify colo/colo incoming loop that failover work is
>>>     finished. (patch 26)
>>>   - Move COLO shutdown related codes to colo.c file. (patch 28)
>>>   - Fix memory leak bug for colo incoming loop. (new patch 31)
>>>   - Re-use some existed helper functions to realize the process of
>>>     saving/loading ram and device. (patch 32)
>>>   - Fix some other comments from Dave and Markus.
>>>
>>> zhanghailiang (38):
>>>    configure: Add parameter for configure to enable/disable COLO support
>>>    migration: Introduce capability 'x-colo' to migration
>>>    COLO: migrate colo related info to secondary node
>>>    migration: Integrate COLO checkpoint process into migration
>>>    migration: Integrate COLO checkpoint process into loadvm
>>>    COLO/migration: Create a new communication path from destination to
>>>      source
>>>    COLO: Implement colo checkpoint protocol
>>>    COLO: Add a new RunState RUN_STATE_COLO
>>>    QEMUSizedBuffer: Introduce two help functions for qsb
>>>    COLO: Save PVM state to secondary side when do checkpoint
>>>    COLO: Load PVM's dirty pages into SVM's RAM cache temporarily
>>>    ram/COLO: Record the dirty pages that SVM received
>>>    COLO: Load VMState into qsb before restore it
>>>    COLO: Flush PVM's cached RAM into SVM's memory
>>>    COLO: Add checkpoint-delay parameter for migrate-set-parameters
>>>    COLO: synchronize PVM's state to SVM periodically
>>>    COLO failover: Introduce a new command to trigger a failover
>>>    COLO failover: Introduce state to record failover process
>>>    COLO: Implement failover work for Primary VM
>>>    COLO: Implement failover work for Secondary VM
>>>    qmp event: Add COLO_EXIT event to notify users while exited from COLO
>>>    COLO failover: Shutdown related socket fd when do failover
>>>    COLO failover: Don't do failover during loading VM's state
>>>    COLO: Process shutdown command for VM in COLO state
>>>    COLO: Update the global runstate after going into colo state
>>>    savevm: Introduce two helper functions for save/find loadvm_handlers
>>>      entry
>>>    migration/savevm: Add new helpers to process the different stages of
>>>      loadvm
>>>    migration/savevm: Export two helper functions for savevm process
>>>    COLO: Separate the process of saving/loading ram and device state
>>>    COLO: Split qemu_savevm_state_begin out of checkpoint process
>>>    net/filter: Add a 'status' property for filter object
>>>    filter-buffer: Accept zero interval
>>>    net: Add notifier/callback for netdev init
>>>    COLO/filter: add each netdev a buffer filter
>>>    COLO: manage the status of buffer filters for PVM
>>>    filter-buffer: make filter_buffer_flush() public
>>>    COLO: flush buffered packets in checkpoint process or exit COLO
>>>    COLO: Add block replication into colo process
>>>
>>>   configure                     |  11 +
>>>   docs/qmp-events.txt           |  16 +
>>>   hmp-commands.hx               |  15 +
>>>   hmp.c                         |  15 +
>>>   hmp.h                         |   1 +
>>>   include/exec/ram_addr.h       |   1 +
>>>   include/migration/colo.h      |  42 ++
>>>   include/migration/failover.h  |  33 ++
>>>   include/migration/migration.h |  16 +
>>>   include/migration/qemu-file.h |   3 +-
>>>   include/net/filter.h          |   5 +
>>>   include/net/net.h             |   4 +
>>>   include/sysemu/sysemu.h       |   9 +
>>>   migration/Makefile.objs       |   2 +
>>>   migration/colo-comm.c         |  76 ++++
>>>   migration/colo-failover.c     |  83 ++++
>>>   migration/colo.c              | 866 ++++++++++++++++++++++++++++++++++++++++++
>>>   migration/migration.c         | 109 +++++-
>>>   migration/qemu-file-buf.c     |  61 +++
>>>   migration/ram.c               | 175 ++++++++-
>>>   migration/savevm.c            | 114 ++++--
>>>   net/filter-buffer.c           |  14 +-
>>>   net/filter.c                  |  40 ++
>>>   net/net.c                     |  33 ++
>>>   qapi-schema.json              | 104 ++++-
>>>   qapi/event.json               |  15 +
>>>   qemu-options.hx               |   4 +-
>>>   qmp-commands.hx               |  23 +-
>>>   stubs/Makefile.objs           |   1 +
>>>   stubs/migration-colo.c        |  54 +++
>>>   trace-events                  |   8 +
>>>   vl.c                          |  31 +-
>>>   32 files changed, 1908 insertions(+), 76 deletions(-)
>>>   create mode 100644 include/migration/colo.h
>>>   create mode 100644 include/migration/failover.h
>>>   create mode 100644 migration/colo-comm.c
>>>   create mode 100644 migration/colo-failover.c
>>>   create mode 100644 migration/colo.c
>>>   create mode 100644 stubs/migration-colo.c
>>>
>>> --
>>> 1.8.3.1
>>>
>>>
>> --
>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
> .
>

Comments

Dr. David Alan Gilbert Feb. 29, 2016, 9:47 a.m. UTC | #1
* Hailiang Zhang (zhang.zhanghailiang@huawei.com) wrote:
> On 2016/2/27 0:36, Dr. David Alan Gilbert wrote:
> >* Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> >>* zhanghailiang (zhang.zhanghailiang@huawei.com) wrote:
> >>>From: root <root@localhost.localdomain>
> >>>
> >>>This is the 15th version of COLO (Still only support periodic checkpoint).
> >>>
> >>>Here is only COLO frame part, you can get the whole codes from github:
> >>>https://github.com/coloft/qemu/commits/colo-v2.6-periodic-mode
> >>>
> >>>There are little changes for this series except the network releated part.
> >>
> >>I was looking at the time the guest is paused during COLO and
> >>was surprised to find one of the larger chunks was the time to reset
> >>the guest before loading each checkpoint;  I've traced it part way, the
> >>biggest contributors for my test VM seem to be:
> >>
> >>   3.8ms  pcibus_reset: VGA
> >>   1.8ms  pcibus_reset: virtio-net-pci
> >>   1.5ms  pcibus_reset: virtio-blk-pci
> >>   1.5ms  qemu_devices_reset: piix4_reset
> >>   1.1ms  pcibus_reset: piix3-ide
> >>   1.1ms  pcibus_reset: virtio-rng-pci
> >>
> >>I've not looked deeper yet, but some of these are very silly;
> >>I'm running with -nographic so why it's taking 3.8ms to reset VGA is
> >>going to be interesting.
> >>Also, my only block device is the virtio-blk, so while I understand the
> >>standard PC machine has the IDE controller, why it takes it over a ms
> >>to reset an unused device.
> >
> >OK, so I've dug a bit deeper, and it appears that it's the changes in
> >PCI bars that actually take the time;  every time we do a reset we
> >reset all the BARs, this causes it to do a pci_update_mappings and
> >end up doing a memory_region_del_subregion.
> >Then we load the config space of the PCI device as we do the vmstate_load,
> >and this recreates all the mappings again.
> >
> >I'm not sure what the fix is, but that sounds like it would
> >speed up the checkpoints usefully if we can avoid the map/remap when
> >they're the same.
> >
> 
> Interesting, and thanks for your report.
> 
> We already known qemu_system_reset() is a time-consuming function, we shouldn't
> call it here, but if we didn't do that, there will be a bug, which we have
> reported before in the previous COLO series, the bellow is the copy of the related
> patch comment:
> 
>     COLO VMstate: Load VM state into qsb before restore it
> 
>     We should not destroy the state of secondary until we receive the whole
>     state from the primary, in case the primary fails in the middle of sending
>     the state, so, here we cache the device state in Secondary before restore it.
> 
>     Besides, we should call qemu_system_reset() before load VM state,
>     which can ensure the data is intact.
>     Note: If we discard qemu_system_reset(), there will be some odd error,
>     For exmple, qemu in slave side crashes and reports:
> 
>     KVM: entry failed, hardware error 0x7
>     EAX=00000000 EBX=0000e000 ECX=00009578 EDX=0000434f
>     ESI=0000fc10 EDI=0000434f EBP=00000000 ESP=00001fca
>     EIP=00009594 EFL=00010246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
>     ES =0040 00000400 0000ffff 00009300
>     CS =f000 000f0000 0000ffff 00009b00
>     SS =434f 000434f0 0000ffff 00009300
>     DS =434f 000434f0 0000ffff 00009300
>     FS =0000 00000000 0000ffff 00009300
>     GS =0000 00000000 0000ffff 00009300
>     LDT=0000 00000000 0000ffff 00008200
>     TR =0000 00000000 0000ffff 00008b00
>     GDT=     0002dcc8 00000047
>     IDT=     00000000 0000ffff
>     CR0=00000010 CR2=ffffffff CR3=00000000 CR4=00000000
>     DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
>     DR6=00000000ffff0ff0 DR7=0000000000000400
>     EFER=0000000000000000
>     Code=c0 74 0f 66 b9 78 95 00 00 66 31 d2 66 31 c0 e9 47 e0 fb 90 <f3> 90 fa fc 66 c3 66 53 66 89 c3
>     ERROR: invalid runstate transition: 'internal-error' -> 'colo'
> 
>     The reason is, some of the device state will be ignored when saving device state to slave,
>     if the corresponding data is in its initial value, such as 0.
>     But the device state in slave maybe in initialized value, after a loop of checkpoint,
>     there will be inconsistent for the value of device state.
>     This will happen when the PVM reboot or SVM run ahead of PVM in the startup process.
>     Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>     Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
>     Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>     Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com
> 
> As described above, some values of the device state are zero, they will be
> ignored during  migration, it has no problem for normal migration, because
> for the VM in destination, the initial values will be zero too. But for COLO,
> there are more than one round of migration, the related values may be changed
> from no-zero to zero, they will be ignored too in the next checkpoint, the
> VMstate will be inconsistent for SVM.

Yes, this doesn't really surprise me; a lot of the migration code does weird things
like this, so reset is the safest way.

> The above error is caused directly by wrong value of 'async_pf_en_msr'.
> 
> static const VMStateDescription vmstate_async_pf_msr = {
>     .name = "cpu/async_pf_msr",
>     .version_id = 1,
>     .minimum_version_id = 1,
>     .needed = async_pf_msr_needed,
>     .fields = (VMStateField[]) {
>         VMSTATE_UINT64(env.async_pf_en_msr, X86CPU),
>         VMSTATE_END_OF_LIST()
>     }
> };
> 
> static bool async_pf_msr_needed(void *opaque)
> {
>     X86CPU *cpu = opaque;
> 
>     return cpu->env.async_pf_en_msr != 0;
> }
> 
> Some other VMstate of registers in CPUX86State have the same problem,
> we can't make sure they won't cause any problems if the values of them
> are incorrect.
> So here, we just simply call qemu_system_reset() to avoid the inconsistent
> problem.
> Besides, compared with the most time-consuming operation (ram flushed from
> COLO cache to SVM). The time consuming for qemu_system_reset() seems to be
> acceptable ;)

I've got a patch where I've tried to multithread the flush - it's made it a little
faster, but not as much as I hoped (~20ms down to ~16ms using 4 cores)

> Another choice to fix the problem is to save the VMstate ignoring the needed()
> return value, but this method is not so graceful.
> diff --git a/migration/vmstate.c b/migration/vmstate.c
> index e5388f0..7d15bba 100644
> --- a/migration/vmstate.c
> +++ b/migration/vmstate.c
> @@ -409,7 +409,7 @@ static void vmstate_subsection_save(QEMUFile *f, const VMStateDescription *vmsd,
>      bool subsection_found = false;
> 
>      while (sub && sub->needed) {
> -        if (sub->needed(opaque)) {
> +        if (sub->needed(opaque) || migrate_in_colo_state()) {
>              const VMStateDescription *vmsd = sub->vmsd;
>              uint8_t len;

Maybe, but I suspect this will also find other strange cases in devices.
For example, without the reset I wouldn't really trust devices to load the
new state in; they wouldn't have been tested with all the states they
might have left themselves in previously.

Dave



> Thanks,
> Hailiang
> 
> >Dave
> >
> >>
> >>I guess reset is normally off anyones radar since it's outside
> >>the time anyone cares about, but I guess perhaps the guys trying
> >>to make qemu start really quickly would be interested.
> >>
> >>Dave
> >>
> >>>
> >>>Patch status:
> >>>Unreviewed: patch 21,27,28,29,33,38
> >>>Updated: patch 31,34,35,37
> >>>
> >>>TODO:
> >>>1. Checkpoint based on proxy in qemu
> >>>2. The capability of continuous FT
> >>>3. Optimize the VM's downtime during checkpoint
> >>>
> >>>v15:
> >>>  - Go on the shutdown process if encounter error while sending shutdown
> >>>    message to SVM. (patch 24)
> >>>  - Rename qemu_need_skip_netfilter to qemu_netfilter_can_skip and Remove
> >>>    some useless comment. (patch 31, Jason)
> >>>  - Call object_new_with_props() directly to add filter in
> >>>    colo_add_buffer_filter. (patch 34, Jason)
> >>>  - Re-implement colo_set_filter_status() based on COLOBufferFilters
> >>>    list. (patch 35)
> >>>  - Re-implement colo_flush_filter_packets() based on COLOBufferFilters
> >>>    list. (patch 37)
> >>>v14:
> >>>  - Re-implement the network processing based on netfilter (Jason Wang)
> >>>  - Rename 'COLOCommand' to 'COLOMessage'. (Markus's suggestion)
> >>>  - Split two new patches (patch 27/28) from patch 29
> >>>  - Fix some other comments from Dave and Markus.
> >>>
> >>>v13:
> >>>  - Refactor colo_*_cmd helper functions to use 'Error **errp' parameter
> >>>   instead of return value to indicate success or failure. (patch 10)
> >>>  - Remove the optional error message for COLO_EXIT event. (patch 25)
> >>>  - Use semaphore to notify colo/colo incoming loop that failover work is
> >>>    finished. (patch 26)
> >>>  - Move COLO shutdown related codes to colo.c file. (patch 28)
> >>>  - Fix memory leak bug for colo incoming loop. (new patch 31)
> >>>  - Re-use some existed helper functions to realize the process of
> >>>    saving/loading ram and device. (patch 32)
> >>>  - Fix some other comments from Dave and Markus.
> >>>
> >>>zhanghailiang (38):
> >>>   configure: Add parameter for configure to enable/disable COLO support
> >>>   migration: Introduce capability 'x-colo' to migration
> >>>   COLO: migrate colo related info to secondary node
> >>>   migration: Integrate COLO checkpoint process into migration
> >>>   migration: Integrate COLO checkpoint process into loadvm
> >>>   COLO/migration: Create a new communication path from destination to
> >>>     source
> >>>   COLO: Implement colo checkpoint protocol
> >>>   COLO: Add a new RunState RUN_STATE_COLO
> >>>   QEMUSizedBuffer: Introduce two help functions for qsb
> >>>   COLO: Save PVM state to secondary side when do checkpoint
> >>>   COLO: Load PVM's dirty pages into SVM's RAM cache temporarily
> >>>   ram/COLO: Record the dirty pages that SVM received
> >>>   COLO: Load VMState into qsb before restore it
> >>>   COLO: Flush PVM's cached RAM into SVM's memory
> >>>   COLO: Add checkpoint-delay parameter for migrate-set-parameters
> >>>   COLO: synchronize PVM's state to SVM periodically
> >>>   COLO failover: Introduce a new command to trigger a failover
> >>>   COLO failover: Introduce state to record failover process
> >>>   COLO: Implement failover work for Primary VM
> >>>   COLO: Implement failover work for Secondary VM
> >>>   qmp event: Add COLO_EXIT event to notify users while exited from COLO
> >>>   COLO failover: Shutdown related socket fd when do failover
> >>>   COLO failover: Don't do failover during loading VM's state
> >>>   COLO: Process shutdown command for VM in COLO state
> >>>   COLO: Update the global runstate after going into colo state
> >>>   savevm: Introduce two helper functions for save/find loadvm_handlers
> >>>     entry
> >>>   migration/savevm: Add new helpers to process the different stages of
> >>>     loadvm
> >>>   migration/savevm: Export two helper functions for savevm process
> >>>   COLO: Separate the process of saving/loading ram and device state
> >>>   COLO: Split qemu_savevm_state_begin out of checkpoint process
> >>>   net/filter: Add a 'status' property for filter object
> >>>   filter-buffer: Accept zero interval
> >>>   net: Add notifier/callback for netdev init
> >>>   COLO/filter: add each netdev a buffer filter
> >>>   COLO: manage the status of buffer filters for PVM
> >>>   filter-buffer: make filter_buffer_flush() public
> >>>   COLO: flush buffered packets in checkpoint process or exit COLO
> >>>   COLO: Add block replication into colo process
> >>>
> >>>  configure                     |  11 +
> >>>  docs/qmp-events.txt           |  16 +
> >>>  hmp-commands.hx               |  15 +
> >>>  hmp.c                         |  15 +
> >>>  hmp.h                         |   1 +
> >>>  include/exec/ram_addr.h       |   1 +
> >>>  include/migration/colo.h      |  42 ++
> >>>  include/migration/failover.h  |  33 ++
> >>>  include/migration/migration.h |  16 +
> >>>  include/migration/qemu-file.h |   3 +-
> >>>  include/net/filter.h          |   5 +
> >>>  include/net/net.h             |   4 +
> >>>  include/sysemu/sysemu.h       |   9 +
> >>>  migration/Makefile.objs       |   2 +
> >>>  migration/colo-comm.c         |  76 ++++
> >>>  migration/colo-failover.c     |  83 ++++
> >>>  migration/colo.c              | 866 ++++++++++++++++++++++++++++++++++++++++++
> >>>  migration/migration.c         | 109 +++++-
> >>>  migration/qemu-file-buf.c     |  61 +++
> >>>  migration/ram.c               | 175 ++++++++-
> >>>  migration/savevm.c            | 114 ++++--
> >>>  net/filter-buffer.c           |  14 +-
> >>>  net/filter.c                  |  40 ++
> >>>  net/net.c                     |  33 ++
> >>>  qapi-schema.json              | 104 ++++-
> >>>  qapi/event.json               |  15 +
> >>>  qemu-options.hx               |   4 +-
> >>>  qmp-commands.hx               |  23 +-
> >>>  stubs/Makefile.objs           |   1 +
> >>>  stubs/migration-colo.c        |  54 +++
> >>>  trace-events                  |   8 +
> >>>  vl.c                          |  31 +-
> >>>  32 files changed, 1908 insertions(+), 76 deletions(-)
> >>>  create mode 100644 include/migration/colo.h
> >>>  create mode 100644 include/migration/failover.h
> >>>  create mode 100644 migration/colo-comm.c
> >>>  create mode 100644 migration/colo-failover.c
> >>>  create mode 100644 migration/colo.c
> >>>  create mode 100644 stubs/migration-colo.c
> >>>
> >>>--
> >>>1.8.3.1
> >>>
> >>>
> >>--
> >>Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >--
> >Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >
> >.
> >
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Zhanghailiang Feb. 29, 2016, 12:16 p.m. UTC | #2
On 2016/2/29 17:47, Dr. David Alan Gilbert wrote:
> * Hailiang Zhang (zhang.zhanghailiang@huawei.com) wrote:
>> On 2016/2/27 0:36, Dr. David Alan Gilbert wrote:
>>> * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
>>>> * zhanghailiang (zhang.zhanghailiang@huawei.com) wrote:
>>>>> From: root <root@localhost.localdomain>
>>>>>
>>>>> This is the 15th version of COLO (Still only support periodic checkpoint).
>>>>>
>>>>> Here is only COLO frame part, you can get the whole codes from github:
>>>>> https://github.com/coloft/qemu/commits/colo-v2.6-periodic-mode
>>>>>
>>>>> There are little changes for this series except the network releated part.
>>>>
>>>> I was looking at the time the guest is paused during COLO and
>>>> was surprised to find one of the larger chunks was the time to reset
>>>> the guest before loading each checkpoint;  I've traced it part way, the
>>>> biggest contributors for my test VM seem to be:
>>>>
>>>>    3.8ms  pcibus_reset: VGA
>>>>    1.8ms  pcibus_reset: virtio-net-pci
>>>>    1.5ms  pcibus_reset: virtio-blk-pci
>>>>    1.5ms  qemu_devices_reset: piix4_reset
>>>>    1.1ms  pcibus_reset: piix3-ide
>>>>    1.1ms  pcibus_reset: virtio-rng-pci
>>>>
>>>> I've not looked deeper yet, but some of these are very silly;
>>>> I'm running with -nographic so why it's taking 3.8ms to reset VGA is
>>>> going to be interesting.
>>>> Also, my only block device is the virtio-blk, so while I understand the
>>>> standard PC machine has the IDE controller, why it takes it over a ms
>>>> to reset an unused device.
>>>
>>> OK, so I've dug a bit deeper, and it appears that it's the changes in
>>> PCI bars that actually take the time;  every time we do a reset we
>>> reset all the BARs, this causes it to do a pci_update_mappings and
>>> end up doing a memory_region_del_subregion.
>>> Then we load the config space of the PCI device as we do the vmstate_load,
>>> and this recreates all the mappings again.
>>>
>>> I'm not sure what the fix is, but that sounds like it would
>>> speed up the checkpoints usefully if we can avoid the map/remap when
>>> they're the same.
>>>
>>
>> Interesting, and thanks for your report.
>>
>> We already known qemu_system_reset() is a time-consuming function, we shouldn't
>> call it here, but if we didn't do that, there will be a bug, which we have
>> reported before in the previous COLO series, the bellow is the copy of the related
>> patch comment:
>>
>>      COLO VMstate: Load VM state into qsb before restore it
>>
>>      We should not destroy the state of secondary until we receive the whole
>>      state from the primary, in case the primary fails in the middle of sending
>>      the state, so, here we cache the device state in Secondary before restore it.
>>
>>      Besides, we should call qemu_system_reset() before load VM state,
>>      which can ensure the data is intact.
>>      Note: If we discard qemu_system_reset(), there will be some odd error,
>>      For exmple, qemu in slave side crashes and reports:
>>
>>      KVM: entry failed, hardware error 0x7
>>      EAX=00000000 EBX=0000e000 ECX=00009578 EDX=0000434f
>>      ESI=0000fc10 EDI=0000434f EBP=00000000 ESP=00001fca
>>      EIP=00009594 EFL=00010246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
>>      ES =0040 00000400 0000ffff 00009300
>>      CS =f000 000f0000 0000ffff 00009b00
>>      SS =434f 000434f0 0000ffff 00009300
>>      DS =434f 000434f0 0000ffff 00009300
>>      FS =0000 00000000 0000ffff 00009300
>>      GS =0000 00000000 0000ffff 00009300
>>      LDT=0000 00000000 0000ffff 00008200
>>      TR =0000 00000000 0000ffff 00008b00
>>      GDT=     0002dcc8 00000047
>>      IDT=     00000000 0000ffff
>>      CR0=00000010 CR2=ffffffff CR3=00000000 CR4=00000000
>>      DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
>>      DR6=00000000ffff0ff0 DR7=0000000000000400
>>      EFER=0000000000000000
>>      Code=c0 74 0f 66 b9 78 95 00 00 66 31 d2 66 31 c0 e9 47 e0 fb 90 <f3> 90 fa fc 66 c3 66 53 66 89 c3
>>      ERROR: invalid runstate transition: 'internal-error' -> 'colo'
>>
>>      The reason is, some of the device state will be ignored when saving device state to slave,
>>      if the corresponding data is in its initial value, such as 0.
>>      But the device state in slave maybe in initialized value, after a loop of checkpoint,
>>      there will be inconsistent for the value of device state.
>>      This will happen when the PVM reboot or SVM run ahead of PVM in the startup process.
>>      Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>>      Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
>>      Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>>      Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com
>>
>> As described above, some values of the device state are zero, they will be
>> ignored during  migration, it has no problem for normal migration, because
>> for the VM in destination, the initial values will be zero too. But for COLO,
>> there are more than one round of migration, the related values may be changed
>> from no-zero to zero, they will be ignored too in the next checkpoint, the
>> VMstate will be inconsistent for SVM.
>
> Yes, this doesn't really surprise me; a lot of the migration code does weird things
> like this, so reset is the safest way.
>

>> The above error is caused directly by wrong value of 'async_pf_en_msr'.
>>
>> static const VMStateDescription vmstate_async_pf_msr = {
>>      .name = "cpu/async_pf_msr",
>>      .version_id = 1,
>>      .minimum_version_id = 1,
>>      .needed = async_pf_msr_needed,
>>      .fields = (VMStateField[]) {
>>          VMSTATE_UINT64(env.async_pf_en_msr, X86CPU),
>>          VMSTATE_END_OF_LIST()
>>      }
>> };
>>
>> static bool async_pf_msr_needed(void *opaque)
>> {
>>      X86CPU *cpu = opaque;
>>
>>      return cpu->env.async_pf_en_msr != 0;
>> }
>>
>> Some other VMstate of registers in CPUX86State have the same problem,
>> we can't make sure they won't cause any problems if the values of them
>> are incorrect.
>> So here, we just simply call qemu_system_reset() to avoid the inconsistent
>> problem.
>> Besides, compared with the most time-consuming operation (ram flushed from
>> COLO cache to SVM). The time consuming for qemu_system_reset() seems to be
>> acceptable ;)
>
> I've got a patch where I've tried to multithread the flush - it's made it a little
> faster, but not as much as I hoped (~20ms down to ~16ms using 4 cores)
>

Hmm, that seems to be a good idea, after switch to COLO (hybrid) mode, in most cases,
we will get much more dirtied pages than the periodic mode, because the delay time
between two checkpoints is usually longer.
The multi-thread flushing way may gain much more in that case, but i doubt, in some
bad case, users still can't bear the pause time.

Actually, we have thought about this problem for a long time,
In our early test based on Kernel COLO-proxy, we can easily got more than
one seconds' flushing time, IMHO, uses can't bear the long pausing time of VM if they
choose to use COLO.

We have designed another scenario which based on userfault's page-miss capability.
The base idea is to convert the flushing action to marking action, the flush action
will be processed during SVM's running time. For now it is only an idea,
we'd like to verify the idea first. (I'm not quite sure if userfaults' page-miss
feature is good performance designed, while we use it to mark one page to be MISS a time).


Thanks,
Hailiang

>> Another choice to fix the problem is to save the VMstate ignoring the needed()
>> return value, but this method is not so graceful.
>> diff --git a/migration/vmstate.c b/migration/vmstate.c
>> index e5388f0..7d15bba 100644
>> --- a/migration/vmstate.c
>> +++ b/migration/vmstate.c
>> @@ -409,7 +409,7 @@ static void vmstate_subsection_save(QEMUFile *f, const VMStateDescription *vmsd,
>>       bool subsection_found = false;
>>
>>       while (sub && sub->needed) {
>> -        if (sub->needed(opaque)) {
>> +        if (sub->needed(opaque) || migrate_in_colo_state()) {
>>               const VMStateDescription *vmsd = sub->vmsd;
>>               uint8_t len;
>
> Maybe, but I suspect this will also find other strange cases in devices.
> For example, without the reset I wouldn't really trust devices to load the
> new state in; they wouldn't have been tested with all the states they
> might have left themselves in previously.
>
> Dave
>
>
>
>> Thanks,
>> Hailiang
>>
>>> Dave
>>>
>>>>
>>>> I guess reset is normally off anyones radar since it's outside
>>>> the time anyone cares about, but I guess perhaps the guys trying
>>>> to make qemu start really quickly would be interested.
>>>>
>>>> Dave
>>>>
>>>>>
>>>>> Patch status:
>>>>> Unreviewed: patch 21,27,28,29,33,38
>>>>> Updated: patch 31,34,35,37
>>>>>
>>>>> TODO:
>>>>> 1. Checkpoint based on proxy in qemu
>>>>> 2. The capability of continuous FT
>>>>> 3. Optimize the VM's downtime during checkpoint
>>>>>
>>>>> v15:
>>>>>   - Go on the shutdown process if encounter error while sending shutdown
>>>>>     message to SVM. (patch 24)
>>>>>   - Rename qemu_need_skip_netfilter to qemu_netfilter_can_skip and Remove
>>>>>     some useless comment. (patch 31, Jason)
>>>>>   - Call object_new_with_props() directly to add filter in
>>>>>     colo_add_buffer_filter. (patch 34, Jason)
>>>>>   - Re-implement colo_set_filter_status() based on COLOBufferFilters
>>>>>     list. (patch 35)
>>>>>   - Re-implement colo_flush_filter_packets() based on COLOBufferFilters
>>>>>     list. (patch 37)
>>>>> v14:
>>>>>   - Re-implement the network processing based on netfilter (Jason Wang)
>>>>>   - Rename 'COLOCommand' to 'COLOMessage'. (Markus's suggestion)
>>>>>   - Split two new patches (patch 27/28) from patch 29
>>>>>   - Fix some other comments from Dave and Markus.
>>>>>
>>>>> v13:
>>>>>   - Refactor colo_*_cmd helper functions to use 'Error **errp' parameter
>>>>>    instead of return value to indicate success or failure. (patch 10)
>>>>>   - Remove the optional error message for COLO_EXIT event. (patch 25)
>>>>>   - Use semaphore to notify colo/colo incoming loop that failover work is
>>>>>     finished. (patch 26)
>>>>>   - Move COLO shutdown related codes to colo.c file. (patch 28)
>>>>>   - Fix memory leak bug for colo incoming loop. (new patch 31)
>>>>>   - Re-use some existed helper functions to realize the process of
>>>>>     saving/loading ram and device. (patch 32)
>>>>>   - Fix some other comments from Dave and Markus.
>>>>>
>>>>> zhanghailiang (38):
>>>>>    configure: Add parameter for configure to enable/disable COLO support
>>>>>    migration: Introduce capability 'x-colo' to migration
>>>>>    COLO: migrate colo related info to secondary node
>>>>>    migration: Integrate COLO checkpoint process into migration
>>>>>    migration: Integrate COLO checkpoint process into loadvm
>>>>>    COLO/migration: Create a new communication path from destination to
>>>>>      source
>>>>>    COLO: Implement colo checkpoint protocol
>>>>>    COLO: Add a new RunState RUN_STATE_COLO
>>>>>    QEMUSizedBuffer: Introduce two help functions for qsb
>>>>>    COLO: Save PVM state to secondary side when do checkpoint
>>>>>    COLO: Load PVM's dirty pages into SVM's RAM cache temporarily
>>>>>    ram/COLO: Record the dirty pages that SVM received
>>>>>    COLO: Load VMState into qsb before restore it
>>>>>    COLO: Flush PVM's cached RAM into SVM's memory
>>>>>    COLO: Add checkpoint-delay parameter for migrate-set-parameters
>>>>>    COLO: synchronize PVM's state to SVM periodically
>>>>>    COLO failover: Introduce a new command to trigger a failover
>>>>>    COLO failover: Introduce state to record failover process
>>>>>    COLO: Implement failover work for Primary VM
>>>>>    COLO: Implement failover work for Secondary VM
>>>>>    qmp event: Add COLO_EXIT event to notify users while exited from COLO
>>>>>    COLO failover: Shutdown related socket fd when do failover
>>>>>    COLO failover: Don't do failover during loading VM's state
>>>>>    COLO: Process shutdown command for VM in COLO state
>>>>>    COLO: Update the global runstate after going into colo state
>>>>>    savevm: Introduce two helper functions for save/find loadvm_handlers
>>>>>      entry
>>>>>    migration/savevm: Add new helpers to process the different stages of
>>>>>      loadvm
>>>>>    migration/savevm: Export two helper functions for savevm process
>>>>>    COLO: Separate the process of saving/loading ram and device state
>>>>>    COLO: Split qemu_savevm_state_begin out of checkpoint process
>>>>>    net/filter: Add a 'status' property for filter object
>>>>>    filter-buffer: Accept zero interval
>>>>>    net: Add notifier/callback for netdev init
>>>>>    COLO/filter: add each netdev a buffer filter
>>>>>    COLO: manage the status of buffer filters for PVM
>>>>>    filter-buffer: make filter_buffer_flush() public
>>>>>    COLO: flush buffered packets in checkpoint process or exit COLO
>>>>>    COLO: Add block replication into colo process
>>>>>
>>>>>   configure                     |  11 +
>>>>>   docs/qmp-events.txt           |  16 +
>>>>>   hmp-commands.hx               |  15 +
>>>>>   hmp.c                         |  15 +
>>>>>   hmp.h                         |   1 +
>>>>>   include/exec/ram_addr.h       |   1 +
>>>>>   include/migration/colo.h      |  42 ++
>>>>>   include/migration/failover.h  |  33 ++
>>>>>   include/migration/migration.h |  16 +
>>>>>   include/migration/qemu-file.h |   3 +-
>>>>>   include/net/filter.h          |   5 +
>>>>>   include/net/net.h             |   4 +
>>>>>   include/sysemu/sysemu.h       |   9 +
>>>>>   migration/Makefile.objs       |   2 +
>>>>>   migration/colo-comm.c         |  76 ++++
>>>>>   migration/colo-failover.c     |  83 ++++
>>>>>   migration/colo.c              | 866 ++++++++++++++++++++++++++++++++++++++++++
>>>>>   migration/migration.c         | 109 +++++-
>>>>>   migration/qemu-file-buf.c     |  61 +++
>>>>>   migration/ram.c               | 175 ++++++++-
>>>>>   migration/savevm.c            | 114 ++++--
>>>>>   net/filter-buffer.c           |  14 +-
>>>>>   net/filter.c                  |  40 ++
>>>>>   net/net.c                     |  33 ++
>>>>>   qapi-schema.json              | 104 ++++-
>>>>>   qapi/event.json               |  15 +
>>>>>   qemu-options.hx               |   4 +-
>>>>>   qmp-commands.hx               |  23 +-
>>>>>   stubs/Makefile.objs           |   1 +
>>>>>   stubs/migration-colo.c        |  54 +++
>>>>>   trace-events                  |   8 +
>>>>>   vl.c                          |  31 +-
>>>>>   32 files changed, 1908 insertions(+), 76 deletions(-)
>>>>>   create mode 100644 include/migration/colo.h
>>>>>   create mode 100644 include/migration/failover.h
>>>>>   create mode 100644 migration/colo-comm.c
>>>>>   create mode 100644 migration/colo-failover.c
>>>>>   create mode 100644 migration/colo.c
>>>>>   create mode 100644 stubs/migration-colo.c
>>>>>
>>>>> --
>>>>> 1.8.3.1
>>>>>
>>>>>
>>>> --
>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>> --
>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>
>>> .
>>>
>>
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
> .
>
Dr. David Alan Gilbert Feb. 29, 2016, 1:04 p.m. UTC | #3
* Hailiang Zhang (zhang.zhanghailiang@huawei.com) wrote:
> On 2016/2/29 17:47, Dr. David Alan Gilbert wrote:
> >* Hailiang Zhang (zhang.zhanghailiang@huawei.com) wrote:
> >>On 2016/2/27 0:36, Dr. David Alan Gilbert wrote:
> >>>* Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:

> >I've got a patch where I've tried to multithread the flush - it's made it a little
> >faster, but not as much as I hoped (~20ms down to ~16ms using 4 cores)
> >
> 
> Hmm, that seems to be a good idea, after switch to COLO (hybrid) mode, in most cases,
> we will get much more dirtied pages than the periodic mode, because the delay time
> between two checkpoints is usually longer.
> The multi-thread flushing way may gain much more in that case, but i doubt, in some
> bad case, users still can't bear the pause time.
> 
> Actually, we have thought about this problem for a long time,
> In our early test based on Kernel COLO-proxy, we can easily got more than
> one seconds' flushing time, IMHO, uses can't bear the long pausing time of VM if they
> choose to use COLO.

Yes, that's just too long; although only solving the 'flushing' time isn't enough in those
cases, because the same cases will probably need to transfer lots of RAM over the wire as well.

> We have designed another scenario which based on userfault's page-miss capability.
> The base idea is to convert the flushing action to marking action, the flush action
> will be processed during SVM's running time. For now it is only an idea,
> we'd like to verify the idea first. (I'm not quite sure if userfaults' page-miss
> feature is good performance designed, while we use it to mark one page to be MISS a time).

Yes, it's a different trade off, slower execution, but no flush time.

Dave

> 
> 
> Thanks,
> Hailiang
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
diff mbox

Patch

diff --git a/migration/vmstate.c b/migration/vmstate.c
index e5388f0..7d15bba 100644
--- a/migration/vmstate.c
+++ b/migration/vmstate.c
@@ -409,7 +409,7 @@  static void vmstate_subsection_save(QEMUFile *f, const VMStateDescription *vmsd,
      bool subsection_found = false;

      while (sub && sub->needed) {
-        if (sub->needed(opaque)) {
+        if (sub->needed(opaque) || migrate_in_colo_state()) {
              const VMStateDescription *vmsd = sub->vmsd;
              uint8_t len;