Message ID | 20241206230838.1111496-1-peterx@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | migration/block: disk activation rewrite | expand |
Peter Xu <peterx@redhat.com> writes: > CI: https://gitlab.com/peterx/qemu/-/pipelines/1577280033 > (note: it's a pipeline of two patchsets, to save CI credits and time) > > v1: https://lore.kernel.org/r/20241204005138.702289-1-peterx@redhat.com > > This is v2 of the series, removing RFC tag, because my goal is to have them > (or some newer version) merged. > > The major change is I merged last three patches, and did quite some changes > here and there, to make sure the global disk activation status is always > consistent. The whole idea is still the same. I say changelog won't help. > > I also temporarily dropped Fabiano's ping-pong test cases to avoid > different versions floating on the list (as I know a new version is coming > at some point. Fabiano: you're taking over the 10.0 pulls, so I assume > you're aware so there's no concern on order of merges). I'll review the > test cases separately when they're ready, but this series is still tested > with that pingpong test and it keeps working. > > I started looking at this problem as a whole when reviewing Fabiano's > series, especially the patch (for a QEMU crash [1]): > > https://lore.kernel.org/r/20241125144612.16194-5-farosas@suse.de > > The proposed patch could work, but it's unwanted to add such side effect to > migration. So I start to think about whether we can provide a cleaner > approach, because migration doesn't need the disks to be active to work at > all. Hence we should try to avoid adding a migration ABI (which doesn't > matter now, but may matter some day) into prepare phase on disk activation > status. Migration should happen with disks inactivated. > > It's also a pure wish that, if bdrv_inactivate_all() could be benign to be > called even if all disks are already inactive. Then the bug is also gone. > After all, similar call on bdrv_activate_all() upon all-active disks is all > fine. I hope that wish could still be fair. But I don't know well on > block layer to say anything meaningful. > > And when I was looking at that, I found more things spread all over the > place on disk activation. I decided to clean all of them up, while > hopefully fixing the QEMU crash [1] too. > > For this v2, I did some more tests, I want to make sure all the past paths > keep working at least on failure or cancel races, also in postcopy failure > cases. So I did below and they all run pass (when I said "emulated" below, > I meant I hacked something to trigger those race / rare failures, because > they aren't easy to trigger with vanilla binary): > > - Tested generic migrate_cancel during precopy, disk activation won't be > affected. Disk status reports correct values in tracepoints. > > - Test Fabiano's ping-pong migration tests on PAUSED state VM. > > - Emulated precopy failure before sending non-iterable, disk inactivation > won't happen, and also activation won't trigger after migration cleanups > (even if activation on top of activate disk is benign, I checked traces > to make sure it'll provide consistent disk status, skipping activation). > > - Emulated precopy failure right after sending non-iterable. Disks will be > inactivated, but then can be reactivated properly before VM starts. > > - Emulated postcopy failure when sending the packed data (which is after > disk invalidated), and making sure src VM will get back to live properly, > re-activate the disks before starting. > > - Emulated concurrent migrate_cancel at the end of migration_completion() > of precopy, after disk inactivated. Disks can be reactivated properly. > > NOTE: here if dest QEMU didn't quit before migrate_cancel, > bdrv_activate_all() can crash src QEMU. This behavior should be the same > before/after this patch. > > Comments welcomed, thanks. > > [1] https://gitlab.com/qemu-project/qemu/-/issues/2395 > > Peter Xu (6): > migration: Add helper to get target runstate > qmp/cont: Only activate disks if migration completed > migration/block: Make late-block-active the default > migration/block: Apply late-block-active behavior to postcopy > migration/block: Fix possible race with block_inactive > migration/block: Rewrite disk activation > > include/migration/misc.h | 4 ++ > migration/migration.h | 6 +- > migration/block-active.c | 94 +++++++++++++++++++++++++++ > migration/colo.c | 2 +- > migration/migration.c | 136 +++++++++++++++------------------------ > migration/savevm.c | 46 ++++++------- > monitor/qmp-cmds.c | 22 +++---- > migration/meson.build | 1 + > migration/trace-events | 3 + > 9 files changed, 188 insertions(+), 126 deletions(-) > create mode 100644 migration/block-active.c Queued, thanks!