Message ID | 20211231120127.22394-1-pankaj.gupta.linux@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] virtio_pmem: enable live migration support | expand |
On 31.12.21 13:01, Pankaj Gupta wrote: > From: Pankaj Gupta <pankaj.gupta.linux@gmail.com>> > > Enable live migration support for virtio-pmem device. > Tested this: with live migration on same host. > > Need suggestion on below points to support virtio-pmem live migration > between two separate host systems: I assume emulated NVDIMMs would have the exact same issue, right? There are two cases to consider I think: 1) Backing storage is migrated manually to the destination (i.e., a file that is copied/moved/transmitted during migration) 2) Backing storage is located on a shared network storage (i.e., a file that is not copied during migration) IIRC you're concerned about 2). > > - There is still possibility of stale page cache page at the > destination host which we cannot invalidate currently as done in 1] > for write-back mode because virtio-pmem memory backend file is mmaped > in guest address space and invalidating corresponding page cache pages > would also fault all the other userspace process mappings on the same file. > Or we make it strict no other process would mmap this backing file? I'd have assume that a simple fsync on the src once migration is about to switch over (e.g., pre_save/post_save handler) should be enough to trigger writeback to the backing storage, at which point the dst can take over. So handling the src is easy. So is the issue that the dst might still have stale pagecache information, because it already accessed some of that file previously, correct? > > -- In commit 1] we first fsync and then invalidate all the pages from destination > page cache. fsync would sync the stale dirty page cache page, Is this the right > thing to do as we might end up in data discrepency? It would be weird if a) The src used/modified the file and fsync'ed the modifications back to backing storage b) The dst has stale dirty pagecache pages that would result in a modification of backing storage during fsync() I mean, that would be fundamentally broken, because the fsync() would corrupt the file. So I assume in a sane environment, the dst could only have stale clean pagecache pages. And we'd have to get rid of these to re-read everything from file. IIRC, an existing mmap of the file on the dst should not really be problematic *as long as* we didn't actually access file content that way and faulted in the pages. So *maybe*, if we do the POSIX_FADV_DONTNEED on the dst before accessing file content via the mmap, there shouldn't be an issue. Unless the mmap itself is already problematic. I think we can assume that once QEMU starts on the dst and wants to mmap the file that it's not mapped into any other process yet. vhost-user will only mmap *after* being told from QEMU about the mmap region and the location in GPA. So if the existing QEMU mmap is not problematic, it should be easy, just do the POSIX_FADV_DONTNEED on the destination when initializing virtio-pmem. If we have to POSIX_FADV_DONTNEED *before* performing the mmap, we might need a way to tell QEMU to POSIX_FADV_DONTNEED before doing the mmap. The could be a parameter for memory-backend-file like "flush=on", or doing that implicitly when we're told that we expect an incoming migration.
Thank you David for replying! > > From: Pankaj Gupta <pankaj.gupta.linux@gmail.com>> > > > > Enable live migration support for virtio-pmem device. > > Tested this: with live migration on same host. > > > > Need suggestion on below points to support virtio-pmem live migration > > between two separate host systems: > > I assume emulated NVDIMMs would have the exact same issue, right? > > There are two cases to consider I think: > > 1) Backing storage is migrated manually to the destination (i.e., a file > that is copied/moved/transmitted during migration) > > 2) Backing storage is located on a shared network storage (i.e., a file > that is not copied during migration) > > IIRC you're concerned about 2). Yes. > > > > > - There is still possibility of stale page cache page at the > > destination host which we cannot invalidate currently as done in 1] > > for write-back mode because virtio-pmem memory backend file is mmaped > > in guest address space and invalidating corresponding page cache pages > > would also fault all the other userspace process mappings on the same file. > > Or we make it strict no other process would mmap this backing file? > > I'd have assume that a simple fsync on the src once migration is about > to switch over (e.g., pre_save/post_save handler) should be enough to > trigger writeback to the backing storage, at which point the dst can > take over. So handling the src is easy. > > So is the issue that the dst might still have stale pagecache > information, because it already accessed some of that file previously, > correct? yes. > > > > > -- In commit 1] we first fsync and then invalidate all the pages from destination > > page cache. fsync would sync the stale dirty page cache page, Is this the right > > thing to do as we might end up in data discrepancy? > > It would be weird if > > a) The src used/modified the file and fsync'ed the modifications back to > backing storage > b) The dst has stale dirty pagecache pages that would result in a > modification of backing storage during fsync() Yes. That's what I thought currently we are doing with commit 1] maybe Stefan can confirm. If yes, itvirg is broken as well. > > I mean, that would be fundamentally broken, because the fsync() would > corrupt the file. So I assume in a sane environment, the dst could only > have stale clean pagecache pages. And we'd have to get rid of these to > re-read everything from file. In case of write back cache mode, we could still have stale dirty pages at the destination host and destination fsync is not the right thing to do. We need to invalidate these pages (Can we invalidate dirty pages resident in page cache with POSIX_FADV_DONTNEED as well?) man pages say, we cannot (unless i misunderstood it). > > IIRC, an existing mmap of the file on the dst should not really be > problematic *as long as* we didn't actually access file content that way > and faulted in the pages. So *maybe*, if we do the POSIX_FADV_DONTNEED > on the dst before accessing file content via the mmap, there shouldn't > be an issue. Unless the mmap itself is already problematic. mmap with shared=ON, might result in stale dirty page cache pages? > > I think we can assume that once QEMU starts on the dst and wants to mmap > the file that it's not mapped into any other process yet. vhost-user > will only mmap *after* being told from QEMU about the mmap region and > the location in GPA. maybe we have an old stale dirty page cache page even if there no mmap process alive before mmaping virtio-pmem backend file in destination? > > So if the existing QEMU mmap is not problematic, it should be easy, just > do the POSIX_FADV_DONTNEED on the destination when initializing > virtio-pmem. If we have to POSIX_FADV_DONTNEED *before* performing the > mmap, we might need a way to tell QEMU to POSIX_FADV_DONTNEED before > doing the mmap. The could be a parameter for memory-backend-file like > "flush=on", or doing that implicitly when we're told that we expect an > incoming migration. Yes, that's what I had in mind. Just wanted to confirm some of my doubts for correct implementation. As I see it, page cache coherency across multiple host systems with live migration needs to be addressed or used to avoid such scenarios. Thanks, Pankaj > > -- > Thanks, > > David / dhildenb
>> >> I mean, that would be fundamentally broken, because the fsync() would >> corrupt the file. So I assume in a sane environment, the dst could only >> have stale clean pagecache pages. And we'd have to get rid of these to >> re-read everything from file. > > In case of write back cache mode, we could still have stale dirty > pages at the destination > host and destination fsync is not the right thing to do. We need to > invalidate these pages > (Can we invalidate dirty pages resident in page cache with > POSIX_FADV_DONTNEED as > well?) man pages say, we cannot (unless i misunderstood it). > I think you'd have to fsync + POSIX_FADV_DONTNEED. But I am still confused how we could end up with dirty pagecache pages on the destination. In my opinion, there should only be clean pagecache pages -- can someone enlighten me? :) >> >> IIRC, an existing mmap of the file on the dst should not really be >> problematic *as long as* we didn't actually access file content that way >> and faulted in the pages. So *maybe*, if we do the POSIX_FADV_DONTNEED >> on the dst before accessing file content via the mmap, there shouldn't >> be an issue. Unless the mmap itself is already problematic. > > mmap with shared=ON, might result in stale dirty page cache pages? But only if actually accessing memory, especially writing to it, no? > >> >> I think we can assume that once QEMU starts on the dst and wants to mmap >> the file that it's not mapped into any other process yet. vhost-user >> will only mmap *after* being told from QEMU about the mmap region and >> the location in GPA. > > maybe we have an old stale dirty page cache page even if there no mmap process > alive before mmaping virtio-pmem backend file in destination? But how could that happen in a sane environment? As I said, any writeback on that file from the destination would actually corrupt the file that has been used+modified on the source in the meantime, no?
> >> I mean, that would be fundamentally broken, because the fsync() would > >> corrupt the file. So I assume in a sane environment, the dst could only > >> have stale clean pagecache pages. And we'd have to get rid of these to > >> re-read everything from file. > > > > In case of write back cache mode, we could still have stale dirty > > pages at the destination > > host and destination fsync is not the right thing to do. We need to > > invalidate these pages > > (Can we invalidate dirty pages resident in page cache with > > POSIX_FADV_DONTNEED as > > well?) man pages say, we cannot (unless i misunderstood it). > > > > I think you'd have to fsync + POSIX_FADV_DONTNEED. But I am still > confused how we could end up with dirty pagecache pages on the > destination. In my opinion, there should only be clean pagecache pages > -- can someone enlighten me? :) because of activity on the page cache pages corresponding to mmap region in the past which is not synced yet or not reclaimed yet. Maybe this is hypothetical or not possible, happy to learn? > > >> > >> IIRC, an existing mmap of the file on the dst should not really be > >> problematic *as long as* we didn't actually access file content that way > >> and faulted in the pages. So *maybe*, if we do the POSIX_FADV_DONTNEED > >> on the dst before accessing file content via the mmap, there shouldn't > >> be an issue. Unless the mmap itself is already problematic. > > > > mmap with shared=ON, might result in stale dirty page cache pages? > > But only if actually accessing memory, especially writing to it, no? yes. > > > > >> > >> I think we can assume that once QEMU starts on the dst and wants to mmap > >> the file that it's not mapped into any other process yet. vhost-user > >> will only mmap *after* being told from QEMU about the mmap region and > >> the location in GPA. > > > > maybe we have an old stale dirty page cache page even if there no mmap process > > alive before mmaping virtio-pmem backend file in destination? > > But how could that happen in a sane environment? As I said, any > writeback on that file from the destination would actually corrupt the > file that has been used+modified on the source in the meantime, no? > > > -- > Thanks, > > David / dhildenb >
On 12.01.22 17:08, Pankaj Gupta wrote: >>>> I mean, that would be fundamentally broken, because the fsync() would >>>> corrupt the file. So I assume in a sane environment, the dst could only >>>> have stale clean pagecache pages. And we'd have to get rid of these to >>>> re-read everything from file. >>> >>> In case of write back cache mode, we could still have stale dirty >>> pages at the destination >>> host and destination fsync is not the right thing to do. We need to >>> invalidate these pages >>> (Can we invalidate dirty pages resident in page cache with >>> POSIX_FADV_DONTNEED as >>> well?) man pages say, we cannot (unless i misunderstood it). >>> >> >> I think you'd have to fsync + POSIX_FADV_DONTNEED. But I am still >> confused how we could end up with dirty pagecache pages on the >> destination. In my opinion, there should only be clean pagecache pages >> -- can someone enlighten me? :) > > because of activity on the page cache pages corresponding to mmap region > in the past which is not synced yet or not reclaimed yet. Maybe this > is hypothetical > or not possible, happy to learn? Right, but assume the following *sane* #1 H0 starts and runs VM. #2 H0 migrates VM to H1. #3 H1 runs VM. #4 H1 migrates VM to H0. #5 H0 runs VM. We'd expect a proper fsync during #2, writing back any dirty pages to the memory backend. Otherwise, #3 would already be broken. Similarly, we'd expect a proper fsync during #4. I assume during #4 we could find clean pagecache pages that are actually invalid, because the underlying file was changed by H1. So we have to make sure to invalidate all pagecache pages (all clean).
> >>>> I mean, that would be fundamentally broken, because the fsync() would > >>>> corrupt the file. So I assume in a sane environment, the dst could only > >>>> have stale clean pagecache pages. And we'd have to get rid of these to > >>>> re-read everything from file. > >>> > >>> In case of write back cache mode, we could still have stale dirty > >>> pages at the destination > >>> host and destination fsync is not the right thing to do. We need to > >>> invalidate these pages > >>> (Can we invalidate dirty pages resident in page cache with > >>> POSIX_FADV_DONTNEED as > >>> well?) man pages say, we cannot (unless i misunderstood it). > >>> > >> > >> I think you'd have to fsync + POSIX_FADV_DONTNEED. But I am still > >> confused how we could end up with dirty pagecache pages on the > >> destination. In my opinion, there should only be clean pagecache pages > >> -- can someone enlighten me? :) > > > > because of activity on the page cache pages corresponding to mmap region > > in the past which is not synced yet or not reclaimed yet. Maybe this > > is hypothetical > > or not possible, happy to learn? > > Right, but assume the following *sane* > > #1 H0 starts and runs VM. > #2 H0 migrates VM to H1. > #3 H1 runs VM. > #4 H1 migrates VM to H0. > #5 H0 runs VM. > > We'd expect a proper fsync during #2, writing back any dirty pages to > the memory backend. Otherwise, #3 would already be broken. Similarly, > we'd expect a proper fsync during #4. > > I assume during #4 we could find clean pagecache pages that are actually > invalid, because the underlying file was changed by H1. So we have to > make sure to invalidate all pagecache pages (all clean). Yes, you mean fsync on source host before migration starts. My point is something like another process mmap same backend file on destination host and/or guest/qemu crashing abruptly.
> > >>>> I mean, that would be fundamentally broken, because the fsync() would > > >>>> corrupt the file. So I assume in a sane environment, the dst could only > > >>>> have stale clean pagecache pages. And we'd have to get rid of these to > > >>>> re-read everything from file. > > >>> > > >>> In case of write back cache mode, we could still have stale dirty > > >>> pages at the destination > > >>> host and destination fsync is not the right thing to do. We need to > > >>> invalidate these pages > > >>> (Can we invalidate dirty pages resident in page cache with > > >>> POSIX_FADV_DONTNEED as > > >>> well?) man pages say, we cannot (unless i misunderstood it). > > >>> > > >> > > >> I think you'd have to fsync + POSIX_FADV_DONTNEED. But I am still > > >> confused how we could end up with dirty pagecache pages on the > > >> destination. In my opinion, there should only be clean pagecache pages > > >> -- can someone enlighten me? :) > > > > > > because of activity on the page cache pages corresponding to mmap region > > > in the past which is not synced yet or not reclaimed yet. Maybe this > > > is hypothetical > > > or not possible, happy to learn? > > > > Right, but assume the following *sane* > > > > #1 H0 starts and runs VM. > > #2 H0 migrates VM to H1. > > #3 H1 runs VM. > > #4 H1 migrates VM to H0. > > #5 H0 runs VM. > > > > We'd expect a proper fsync during #2, writing back any dirty pages to > > the memory backend. Otherwise, #3 would already be broken. Similarly, > > we'd expect a proper fsync during #4. > > > > I assume during #4 we could find clean pagecache pages that are actually > > invalid, because the underlying file was changed by H1. So we have to > > make sure to invalidate all pagecache pages (all clean). > > Yes, you mean fsync on source host before migration starts. My point > is something > like another process mmap same backend file on destination host and/or > guest/qemu > crashing abruptly. In that case we should not start guest if we cannot invalidate all the corresponding page cache pages before starting guest i.e mmaping virtio-pmem backend file. Thank you for the discussion! Best regards, Pankaj
diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c index d1aeb90a31..a19619a387 100644 --- a/hw/virtio/virtio-pmem.c +++ b/hw/virtio/virtio-pmem.c @@ -123,6 +123,7 @@ static void virtio_pmem_realize(DeviceState *dev, Error **errp) } host_memory_backend_set_mapped(pmem->memdev, true); + vmstate_register_ram(&pmem->memdev->mr, DEVICE(pmem)); virtio_init(vdev, TYPE_VIRTIO_PMEM, VIRTIO_ID_PMEM, sizeof(struct virtio_pmem_config)); pmem->rq_vq = virtio_add_queue(vdev, 128, virtio_pmem_flush); @@ -133,6 +134,7 @@ static void virtio_pmem_unrealize(DeviceState *dev) VirtIODevice *vdev = VIRTIO_DEVICE(dev); VirtIOPMEM *pmem = VIRTIO_PMEM(dev); + vmstate_unregister_ram(&pmem->memdev->mr, DEVICE(pmem)); host_memory_backend_set_mapped(pmem->memdev, false); virtio_delete_queue(pmem->rq_vq); virtio_cleanup(vdev); @@ -157,6 +159,16 @@ static MemoryRegion *virtio_pmem_get_memory_region(VirtIOPMEM *pmem, return &pmem->memdev->mr; } +static const VMStateDescription vmstate_virtio_pmem = { + .name = "virtio-pmem", + .minimum_version_id = 1, + .version_id = 1, + .fields = (VMStateField[]) { + VMSTATE_VIRTIO_DEVICE, + VMSTATE_END_OF_LIST() + }, +}; + static Property virtio_pmem_properties[] = { DEFINE_PROP_UINT64(VIRTIO_PMEM_ADDR_PROP, VirtIOPMEM, start, 0), DEFINE_PROP_LINK(VIRTIO_PMEM_MEMDEV_PROP, VirtIOPMEM, memdev, @@ -171,6 +183,7 @@ static void virtio_pmem_class_init(ObjectClass *klass, void *data) VirtIOPMEMClass *vpc = VIRTIO_PMEM_CLASS(klass); device_class_set_props(dc, virtio_pmem_properties); + dc->vmsd = &vmstate_virtio_pmem; vdc->realize = virtio_pmem_realize; vdc->unrealize = virtio_pmem_unrealize;