Message ID | 20201119125940.20017-3-andrey.gruzdev@virtuozzo.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | UFFD write-tracking migration/snapshots | expand |
On Thu, Nov 19, 2020 at 03:59:35PM +0300, Andrey Gruzdev via wrote: > +/** > + * uffd_register_memory: register memory range with UFFD > + * > + * Returns 0 in case of success, negative value on error > + * > + * @uffd: UFFD file descriptor > + * @start: starting virtual address of memory range > + * @length: length of memory range > + * @track_missing: generate events on missing-page faults > + * @track_wp: generate events on write-protected-page faults > + */ > +static int uffd_register_memory(int uffd, hwaddr start, hwaddr length, > + bool track_missing, bool track_wp) > +{ > + struct uffdio_register uffd_register; > + > + uffd_register.range.start = start; > + uffd_register.range.len = length; > + uffd_register.mode = (track_missing ? UFFDIO_REGISTER_MODE_MISSING : 0) | > + (track_wp ? UFFDIO_REGISTER_MODE_WP : 0); > + > + if (ioctl(uffd, UFFDIO_REGISTER, &uffd_register)) { > + error_report("uffd_register_memory() failed: " > + "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i", > + start, length, uffd_register.mode, errno); > + return -1; > + } > + > + return 0; > +} These functions look good; we should even be able to refactor the existing ones, e.g., ram_block_enable_notify(), but we can also do that later. As a start, we can move these helpers into some common files under util/. [...] > +/** > + * ram_write_tracking_start: start UFFD-WP memory tracking > + * > + * Returns 0 for success or negative value in case of error > + * > + */ > +int ram_write_tracking_start(void) > +{ Need to be slightly careful on unwind on this function, because if it fails somehow we don't want to crash the existing running good vm... more below. > + int uffd; > + RAMState *rs = ram_state; > + RAMBlock *bs; > + > + /* Open UFFD file descriptor */ > + uffd = uffd_create_fd(); > + if (uffd < 0) { > + return uffd; > + } > + rs->uffdio_fd = uffd; > + > + RAMBLOCK_FOREACH_NOT_IGNORED(bs) { > + /* Nothing to do with read-only and MMIO-writable regions */ > + if (bs->mr->readonly || bs->mr->rom_device) { > + continue; > + } > + > + /* Register block memory with UFFD to track writes */ > + if (uffd_register_memory(rs->uffdio_fd, (hwaddr) bs->host, > + bs->max_length, false, true)) { > + goto fail; > + } > + /* Apply UFFD write protection to the block memory range */ > + if (uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host, > + bs->max_length, true)) { Here logically we need to undo the previous register first, however userfaultfd will auto-clean these when close(fd), so it's ok. However still better to unwind the protection of pages, I think. So... > + goto fail; > + } > + bs->flags |= RAM_UF_WRITEPROTECT; > + > + info_report("UFFD-WP write-tracking enabled: " > + "block_id=%s page_size=%zu start=%p length=%lu " > + "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i", > + bs->idstr, bs->page_size, bs->host, bs->max_length, > + bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly, > + bs->mr->nonvolatile, bs->mr->rom_device); > + } > + > + return 0; > + > +fail: > + uffd_close_fd(uffd); ... maybe do the unprotect here together, that as long as any of the above step failed, we need to remember to unprotect all the protected pages (or just unprotect everything). And also the RAM_UF_WRITEPROTECT flags being set. > + rs->uffdio_fd = -1; > + return -1; > +} > + > +/** > + * ram_write_tracking_stop: stop UFFD-WP memory tracking and remove protection Didn't remove protections, yet? We should remove those. For a succeeded snapshot we can avoid that (if we want such optimization), or imho we'd better unprotect all just in case the user interrupted the snapshot. > + */ > +void ram_write_tracking_stop(void) > +{ > + RAMState *rs = ram_state; > + RAMBlock *bs; > + assert(rs->uffdio_fd >= 0); > + > + RAMBLOCK_FOREACH_NOT_IGNORED(bs) { > + if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) { > + continue; > + } > + info_report("UFFD-WP write-tracking disabled: " > + "block_id=%s page_size=%zu start=%p length=%lu " > + "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i", > + bs->idstr, bs->page_size, bs->host, bs->max_length, > + bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly, > + bs->mr->nonvolatile, bs->mr->rom_device); > + /* Cleanup flags */ > + bs->flags &= ~RAM_UF_WRITEPROTECT; > + } > + > + /* > + * Close UFFD file descriptor to remove protection, > + * release registered memory regions and flush wait queues > + */ > + uffd_close_fd(rs->uffdio_fd); > + rs->uffdio_fd = -1; > +}
On 19.11.2020 21:39, Peter Xu wrote: > On Thu, Nov 19, 2020 at 03:59:35PM +0300, Andrey Gruzdev via wrote: >> +/** >> + * uffd_register_memory: register memory range with UFFD >> + * >> + * Returns 0 in case of success, negative value on error >> + * >> + * @uffd: UFFD file descriptor >> + * @start: starting virtual address of memory range >> + * @length: length of memory range >> + * @track_missing: generate events on missing-page faults >> + * @track_wp: generate events on write-protected-page faults >> + */ >> +static int uffd_register_memory(int uffd, hwaddr start, hwaddr length, >> + bool track_missing, bool track_wp) >> +{ >> + struct uffdio_register uffd_register; >> + >> + uffd_register.range.start = start; >> + uffd_register.range.len = length; >> + uffd_register.mode = (track_missing ? UFFDIO_REGISTER_MODE_MISSING : 0) | >> + (track_wp ? UFFDIO_REGISTER_MODE_WP : 0); >> + >> + if (ioctl(uffd, UFFDIO_REGISTER, &uffd_register)) { >> + error_report("uffd_register_memory() failed: " >> + "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i", >> + start, length, uffd_register.mode, errno); >> + return -1; >> + } >> + >> + return 0; >> +} > > These functions look good; we should even be able to refactor the existing > ones, e.g., ram_block_enable_notify(), but we can also do that later. As a > start, we can move these helpers into some common files under util/. > > [...] > Yep, agree. >> +/** >> + * ram_write_tracking_start: start UFFD-WP memory tracking >> + * >> + * Returns 0 for success or negative value in case of error >> + * >> + */ >> +int ram_write_tracking_start(void) >> +{ > > Need to be slightly careful on unwind on this function, because if it fails > somehow we don't want to crash the existing running good vm... more below. > >> + int uffd; >> + RAMState *rs = ram_state; >> + RAMBlock *bs; >> + >> + /* Open UFFD file descriptor */ >> + uffd = uffd_create_fd(); >> + if (uffd < 0) { >> + return uffd; >> + } >> + rs->uffdio_fd = uffd; >> + >> + RAMBLOCK_FOREACH_NOT_IGNORED(bs) { >> + /* Nothing to do with read-only and MMIO-writable regions */ >> + if (bs->mr->readonly || bs->mr->rom_device) { >> + continue; >> + } >> + >> + /* Register block memory with UFFD to track writes */ >> + if (uffd_register_memory(rs->uffdio_fd, (hwaddr) bs->host, >> + bs->max_length, false, true)) { >> + goto fail; >> + } >> + /* Apply UFFD write protection to the block memory range */ >> + if (uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host, >> + bs->max_length, true)) { > > Here logically we need to undo the previous register first, however userfaultfd > will auto-clean these when close(fd), so it's ok. However still better to > unwind the protection of pages, I think. So... > It should auto-clean, but as an additional safety measure - yes. >> + goto fail; >> + } >> + bs->flags |= RAM_UF_WRITEPROTECT; >> + >> + info_report("UFFD-WP write-tracking enabled: " >> + "block_id=%s page_size=%zu start=%p length=%lu " >> + "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i", >> + bs->idstr, bs->page_size, bs->host, bs->max_length, >> + bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly, >> + bs->mr->nonvolatile, bs->mr->rom_device); >> + } >> + >> + return 0; >> + >> +fail: >> + uffd_close_fd(uffd); > > ... maybe do the unprotect here together, that as long as any of the above step > failed, we need to remember to unprotect all the protected pages (or just > unprotect everything). And also the RAM_UF_WRITEPROTECT flags being set. > Not resetting RAM_UF_WRITEPROTECT is certainly a bug here. But it seems that simply calling close() on UFFD file descriptor does all the rest cleanup for us - unprotect registered memory regions, remove all extra state from kernel etc. I never had a problem with simple close(uffd) to cleanup.. But maybe really more safe to do unprotect/unregister explicitly. >> + rs->uffdio_fd = -1; >> + return -1; >> +} >> + >> +/** >> + * ram_write_tracking_stop: stop UFFD-WP memory tracking and remove protection > > Didn't remove protections, yet? > > We should remove those. For a succeeded snapshot we can avoid that (if we want > such optimization), or imho we'd better unprotect all just in case the user > interrupted the snapshot. > Seems that closing UFFD descriptor does unprotect for us implicitly.. Am I wrong? >> + */ >> +void ram_write_tracking_stop(void) >> +{ >> + RAMState *rs = ram_state; >> + RAMBlock *bs; >> + assert(rs->uffdio_fd >= 0); >> + >> + RAMBLOCK_FOREACH_NOT_IGNORED(bs) { >> + if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) { >> + continue; >> + } >> + info_report("UFFD-WP write-tracking disabled: " >> + "block_id=%s page_size=%zu start=%p length=%lu " >> + "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i", >> + bs->idstr, bs->page_size, bs->host, bs->max_length, >> + bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly, >> + bs->mr->nonvolatile, bs->mr->rom_device); >> + /* Cleanup flags */ >> + bs->flags &= ~RAM_UF_WRITEPROTECT; >> + } >> + >> + /* >> + * Close UFFD file descriptor to remove protection, >> + * release registered memory regions and flush wait queues >> + */ >> + uffd_close_fd(rs->uffdio_fd); >> + rs->uffdio_fd = -1; >> +} >
On Fri, Nov 20, 2020 at 02:04:46PM +0300, Andrey Gruzdev wrote: > > > + RAMBLOCK_FOREACH_NOT_IGNORED(bs) { > > > + /* Nothing to do with read-only and MMIO-writable regions */ > > > + if (bs->mr->readonly || bs->mr->rom_device) { > > > + continue; > > > + } > > > + > > > + /* Register block memory with UFFD to track writes */ > > > + if (uffd_register_memory(rs->uffdio_fd, (hwaddr) bs->host, > > > + bs->max_length, false, true)) { > > > + goto fail; > > > + } > > > + /* Apply UFFD write protection to the block memory range */ > > > + if (uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host, > > > + bs->max_length, true)) { > > > > Here logically we need to undo the previous register first, however userfaultfd > > will auto-clean these when close(fd), so it's ok. However still better to > > unwind the protection of pages, I think. So... > > > > It should auto-clean, but as an additional safety measure - yes. I'm afraid it will only clean up the registers, but not the page table updates; at least that should be what we do now in the kernel. I'm not sure whether we should always force the kernel to unprotect those when close(). The problem is the registered range is normally quite large while the wr-protect range can be very small (page-based), so that could take a lot of time, which can be unnecessary, since the userspace is the one that knows the best on which range was protected. Indeed I can't think if anything really bad even if not unprotect the pages as you do right now - what will happen is that the wr-protected pages will have UFFD_WP set and PAGE_RW cleared in the page tables even after the close(fd). It means after the snapshot got cancelled those wr-protected pages could still trigger page fault again when being written, though since it's not covered by uffd-wp protected vmas, it'll become a "normal cow" fault, and the write bit will be recovered. However the UFFD_WP bit in the page tables could got leftover there... So maybe it's still best to unprotect from userspace. There's an idea that maybe we can auto-remove the UFFD_WP bit in kernel when cow happens for a page, but that's definitely out of topic (and we must make sure things like "enforced cow for read-only get_user_pages() won't happen again"). No hard to do that in userspace, anyways.
On 20.11.2020 18:01, Peter Xu wrote: > On Fri, Nov 20, 2020 at 02:04:46PM +0300, Andrey Gruzdev wrote: >>>> + RAMBLOCK_FOREACH_NOT_IGNORED(bs) { >>>> + /* Nothing to do with read-only and MMIO-writable regions */ >>>> + if (bs->mr->readonly || bs->mr->rom_device) { >>>> + continue; >>>> + } >>>> + >>>> + /* Register block memory with UFFD to track writes */ >>>> + if (uffd_register_memory(rs->uffdio_fd, (hwaddr) bs->host, >>>> + bs->max_length, false, true)) { >>>> + goto fail; >>>> + } >>>> + /* Apply UFFD write protection to the block memory range */ >>>> + if (uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host, >>>> + bs->max_length, true)) { >>> >>> Here logically we need to undo the previous register first, however userfaultfd >>> will auto-clean these when close(fd), so it's ok. However still better to >>> unwind the protection of pages, I think. So... >>> >> >> It should auto-clean, but as an additional safety measure - yes. > > I'm afraid it will only clean up the registers, but not the page table updates; > at least that should be what we do now in the kernel. I'm not sure whether we > should always force the kernel to unprotect those when close(). The problem is > the registered range is normally quite large while the wr-protect range can be > very small (page-based), so that could take a lot of time, which can be > unnecessary, since the userspace is the one that knows the best on which range > was protected. > > Indeed I can't think if anything really bad even if not unprotect the pages as > you do right now - what will happen is that the wr-protected pages will have > UFFD_WP set and PAGE_RW cleared in the page tables even after the close(fd). > It means after the snapshot got cancelled those wr-protected pages could still > trigger page fault again when being written, though since it's not covered by > uffd-wp protected vmas, it'll become a "normal cow" fault, and the write bit > will be recovered. However the UFFD_WP bit in the page tables could got > leftover there... So maybe it's still best to unprotect from userspace. > > There's an idea that maybe we can auto-remove the UFFD_WP bit in kernel when > cow happens for a page, but that's definitely out of topic (and we must make > sure things like "enforced cow for read-only get_user_pages() won't happen > again"). No hard to do that in userspace, anyways. > Oh, I've got the point. Sure, I need to add un-protect to cleanup code. Thanks for clarification of details on kernel implementation!
* Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote: > Implemented support for the whole RAM block memory > protection/un-protection. Introduced higher level > ram_write_tracking_start() and ram_write_tracking_stop() > to start/stop tracking guest memory writes. > > Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com> > --- > include/exec/memory.h | 7 ++ > migration/ram.c | 267 ++++++++++++++++++++++++++++++++++++++++++ > migration/ram.h | 4 + > 3 files changed, 278 insertions(+) > > diff --git a/include/exec/memory.h b/include/exec/memory.h > index 0f3e6bcd5e..3d798fce16 100644 > --- a/include/exec/memory.h > +++ b/include/exec/memory.h > @@ -139,6 +139,13 @@ typedef struct IOMMUNotifier IOMMUNotifier; > /* RAM is a persistent kind memory */ > #define RAM_PMEM (1 << 5) > > +/* > + * UFFDIO_WRITEPROTECT is used on this RAMBlock to > + * support 'write-tracking' migration type. > + * Implies ram_state->ram_wt_enabled. > + */ > +#define RAM_UF_WRITEPROTECT (1 << 6) > + > static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn, > IOMMUNotifierFlag flags, > hwaddr start, hwaddr end, > diff --git a/migration/ram.c b/migration/ram.c > index 7811cde643..7f273c9996 100644 > --- a/migration/ram.c > +++ b/migration/ram.c > @@ -56,6 +56,12 @@ > #include "savevm.h" > #include "qemu/iov.h" > #include "multifd.h" > +#include <inttypes.h> > +#include <poll.h> > +#include <sys/syscall.h> > +#include <sys/ioctl.h> > +#include <linux/userfaultfd.h> > +#include "sysemu/runstate.h" > > /***********************************************************/ > /* ram save/restore */ > @@ -298,6 +304,8 @@ struct RAMSrcPageRequest { > struct RAMState { > /* QEMUFile used for this migration */ > QEMUFile *f; > + /* UFFD file descriptor, used in 'write-tracking' migration */ > + int uffdio_fd; > /* Last block that we have visited searching for dirty pages */ > RAMBlock *last_seen_block; > /* Last block from where we have sent data */ > @@ -453,6 +461,181 @@ static QemuThread *decompress_threads; > static QemuMutex decomp_done_lock; > static QemuCond decomp_done_cond; > > +/** > + * uffd_create_fd: create UFFD file descriptor > + * > + * Returns non-negative file descriptor or negative value in case of an error > + */ > +static int uffd_create_fd(void) > +{ > + int uffd; > + struct uffdio_api api_struct; > + uint64_t ioctl_mask = BIT(_UFFDIO_REGISTER) | BIT(_UFFDIO_UNREGISTER); You need to be a bit careful about doing this in migration/ram.c - it's generic code; at minimum it needs ifdef'ing for Linux. > + uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK); > + if (uffd < 0) { > + error_report("uffd_create_fd() failed: UFFD not supported"); > + return -1; > + } > + > + api_struct.api = UFFD_API; > + api_struct.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP; > + if (ioctl(uffd, UFFDIO_API, &api_struct)) { > + error_report("uffd_create_fd() failed: " > + "API version not supported version=%llx errno=%i", > + api_struct.api, errno); > + goto fail; > + } > + > + if ((api_struct.ioctls & ioctl_mask) != ioctl_mask) { > + error_report("uffd_create_fd() failed: " > + "PAGEFAULT_FLAG_WP feature missing"); > + goto fail; > + } > + > + return uffd; Should we be putting that somewher that we can share with postcopy? > +fail: > + close(uffd); > + return -1; > +} > + > +/** > + * uffd_close_fd: close UFFD file descriptor > + * > + * @uffd: UFFD file descriptor > + */ > +static void uffd_close_fd(int uffd) > +{ > + assert(uffd >= 0); > + close(uffd); > +} > + > +/** > + * uffd_register_memory: register memory range with UFFD > + * > + * Returns 0 in case of success, negative value on error > + * > + * @uffd: UFFD file descriptor > + * @start: starting virtual address of memory range > + * @length: length of memory range > + * @track_missing: generate events on missing-page faults > + * @track_wp: generate events on write-protected-page faults > + */ > +static int uffd_register_memory(int uffd, hwaddr start, hwaddr length, > + bool track_missing, bool track_wp) > +{ > + struct uffdio_register uffd_register; > + > + uffd_register.range.start = start; > + uffd_register.range.len = length; > + uffd_register.mode = (track_missing ? UFFDIO_REGISTER_MODE_MISSING : 0) | > + (track_wp ? UFFDIO_REGISTER_MODE_WP : 0); > + > + if (ioctl(uffd, UFFDIO_REGISTER, &uffd_register)) { > + error_report("uffd_register_memory() failed: " > + "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i", > + start, length, uffd_register.mode, errno); > + return -1; > + } > + > + return 0; > +} > + > +/** > + * uffd_protect_memory: protect/unprotect memory range for writes with UFFD > + * > + * Returns 0 on success or negative value in case of error > + * > + * @uffd: UFFD file descriptor > + * @start: starting virtual address of memory range > + * @length: length of memory range > + * @wp: write-protect/unprotect > + */ > +static int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp) > +{ > + struct uffdio_writeprotect uffd_writeprotect; > + int res; > + > + uffd_writeprotect.range.start = start; > + uffd_writeprotect.range.len = length; > + uffd_writeprotect.mode = (wp ? UFFDIO_WRITEPROTECT_MODE_WP : 0); > + > + do { > + res = ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect); > + } while (res < 0 && errno == EINTR); > + if (res < 0) { > + error_report("uffd_protect_memory() failed: " > + "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i", > + start, length, uffd_writeprotect.mode, errno); > + return -1; > + } > + > + return 0; > +} > + > +__attribute__ ((unused)) > +static int uffd_read_events(int uffd, struct uffd_msg *msgs, int count); > +__attribute__ ((unused)) > +static bool uffd_poll_events(int uffd, int tmo); > + > +/** > + * uffd_read_events: read pending UFFD events > + * > + * Returns number of fetched messages, 0 if non is available or > + * negative value in case of an error > + * > + * @uffd: UFFD file descriptor > + * @msgs: pointer to message buffer > + * @count: number of messages that can fit in the buffer > + */ > +static int uffd_read_events(int uffd, struct uffd_msg *msgs, int count) > +{ > + ssize_t res; > + do { > + res = read(uffd, msgs, count * sizeof(struct uffd_msg)); > + } while (res < 0 && errno == EINTR); > + > + if ((res < 0 && errno == EAGAIN)) { > + return 0; > + } > + if (res < 0) { > + error_report("uffd_read_events() failed: errno=%i", errno); > + return -1; > + } > + > + return (int) (res / sizeof(struct uffd_msg)); > +} > + > +/** > + * uffd_poll_events: poll UFFD file descriptor for read > + * > + * Returns true if events are available for read, false otherwise > + * > + * @uffd: UFFD file descriptor > + * @tmo: timeout in milliseconds, 0 for non-blocking operation, > + * negative value for infinite wait > + */ > +static bool uffd_poll_events(int uffd, int tmo) > +{ > + int res; > + struct pollfd poll_fd = { .fd = uffd, .events = POLLIN, .revents = 0 }; > + > + do { > + res = poll(&poll_fd, 1, tmo); > + } while (res < 0 && errno == EINTR); > + > + if (res == 0) { > + return false; > + } > + if (res < 0) { > + error_report("uffd_poll_events() failed: errno=%i", errno); > + return false; > + } > + > + return (poll_fd.revents & POLLIN) != 0; > +} > + > static bool do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock *block, > ram_addr_t offset, uint8_t *source_buf); > > @@ -3788,6 +3971,90 @@ static int ram_resume_prepare(MigrationState *s, void *opaque) > return 0; > } > > +/** > + * ram_write_tracking_start: start UFFD-WP memory tracking > + * > + * Returns 0 for success or negative value in case of error > + * > + */ > +int ram_write_tracking_start(void) > +{ > + int uffd; > + RAMState *rs = ram_state; > + RAMBlock *bs; > + > + /* Open UFFD file descriptor */ > + uffd = uffd_create_fd(); > + if (uffd < 0) { > + return uffd; > + } > + rs->uffdio_fd = uffd; > + > + RAMBLOCK_FOREACH_NOT_IGNORED(bs) { > + /* Nothing to do with read-only and MMIO-writable regions */ > + if (bs->mr->readonly || bs->mr->rom_device) { > + continue; > + } > + > + /* Register block memory with UFFD to track writes */ > + if (uffd_register_memory(rs->uffdio_fd, (hwaddr) bs->host, > + bs->max_length, false, true)) { > + goto fail; > + } > + /* Apply UFFD write protection to the block memory range */ > + if (uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host, > + bs->max_length, true)) { > + goto fail; > + } > + bs->flags |= RAM_UF_WRITEPROTECT; > + > + info_report("UFFD-WP write-tracking enabled: " > + "block_id=%s page_size=%zu start=%p length=%lu " > + "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i", > + bs->idstr, bs->page_size, bs->host, bs->max_length, > + bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly, > + bs->mr->nonvolatile, bs->mr->rom_device); > + } > + > + return 0; > + > +fail: > + uffd_close_fd(uffd); > + rs->uffdio_fd = -1; > + return -1; > +} > + > +/** > + * ram_write_tracking_stop: stop UFFD-WP memory tracking and remove protection > + */ > +void ram_write_tracking_stop(void) > +{ > + RAMState *rs = ram_state; > + RAMBlock *bs; > + assert(rs->uffdio_fd >= 0); > + > + RAMBLOCK_FOREACH_NOT_IGNORED(bs) { > + if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) { > + continue; > + } > + info_report("UFFD-WP write-tracking disabled: " > + "block_id=%s page_size=%zu start=%p length=%lu " > + "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i", > + bs->idstr, bs->page_size, bs->host, bs->max_length, > + bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly, > + bs->mr->nonvolatile, bs->mr->rom_device); > + /* Cleanup flags */ > + bs->flags &= ~RAM_UF_WRITEPROTECT; > + } > + > + /* > + * Close UFFD file descriptor to remove protection, > + * release registered memory regions and flush wait queues > + */ > + uffd_close_fd(rs->uffdio_fd); > + rs->uffdio_fd = -1; > +} > + > static SaveVMHandlers savevm_ram_handlers = { > .save_setup = ram_save_setup, > .save_live_iterate = ram_save_iterate, > diff --git a/migration/ram.h b/migration/ram.h > index 011e85414e..3611cb51de 100644 > --- a/migration/ram.h > +++ b/migration/ram.h > @@ -79,4 +79,8 @@ void colo_flush_ram_cache(void); > void colo_release_ram_cache(void); > void colo_incoming_start_dirty_log(void); > > +/* Live snapshots */ > +int ram_write_tracking_start(void); > +void ram_write_tracking_stop(void); > + > #endif > -- > 2.25.1 >
On 24.11.2020 20:57, Dr. David Alan Gilbert wrote: > * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote: >> Implemented support for the whole RAM block memory >> protection/un-protection. Introduced higher level >> ram_write_tracking_start() and ram_write_tracking_stop() >> to start/stop tracking guest memory writes. >> >> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com> >> --- >> include/exec/memory.h | 7 ++ >> migration/ram.c | 267 ++++++++++++++++++++++++++++++++++++++++++ >> migration/ram.h | 4 + >> 3 files changed, 278 insertions(+) >> >> diff --git a/include/exec/memory.h b/include/exec/memory.h >> index 0f3e6bcd5e..3d798fce16 100644 >> --- a/include/exec/memory.h >> +++ b/include/exec/memory.h >> @@ -139,6 +139,13 @@ typedef struct IOMMUNotifier IOMMUNotifier; >> /* RAM is a persistent kind memory */ >> #define RAM_PMEM (1 << 5) >> >> +/* >> + * UFFDIO_WRITEPROTECT is used on this RAMBlock to >> + * support 'write-tracking' migration type. >> + * Implies ram_state->ram_wt_enabled. >> + */ >> +#define RAM_UF_WRITEPROTECT (1 << 6) >> + >> static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn, >> IOMMUNotifierFlag flags, >> hwaddr start, hwaddr end, >> diff --git a/migration/ram.c b/migration/ram.c >> index 7811cde643..7f273c9996 100644 >> --- a/migration/ram.c >> +++ b/migration/ram.c >> @@ -56,6 +56,12 @@ >> #include "savevm.h" >> #include "qemu/iov.h" >> #include "multifd.h" >> +#include <inttypes.h> >> +#include <poll.h> >> +#include <sys/syscall.h> >> +#include <sys/ioctl.h> >> +#include <linux/userfaultfd.h> >> +#include "sysemu/runstate.h" >> >> /***********************************************************/ >> /* ram save/restore */ >> @@ -298,6 +304,8 @@ struct RAMSrcPageRequest { >> struct RAMState { >> /* QEMUFile used for this migration */ >> QEMUFile *f; >> + /* UFFD file descriptor, used in 'write-tracking' migration */ >> + int uffdio_fd; >> /* Last block that we have visited searching for dirty pages */ >> RAMBlock *last_seen_block; >> /* Last block from where we have sent data */ >> @@ -453,6 +461,181 @@ static QemuThread *decompress_threads; >> static QemuMutex decomp_done_lock; >> static QemuCond decomp_done_cond; >> >> +/** >> + * uffd_create_fd: create UFFD file descriptor >> + * >> + * Returns non-negative file descriptor or negative value in case of an error >> + */ >> +static int uffd_create_fd(void) >> +{ >> + int uffd; >> + struct uffdio_api api_struct; >> + uint64_t ioctl_mask = BIT(_UFFDIO_REGISTER) | BIT(_UFFDIO_UNREGISTER); > > You need to be a bit careful about doing this in migration/ram.c - it's > generic code; at minimum it needs ifdef'ing for Linux. > Yes, it's totally linux-specific, I think better to move this code out of migration/ram.c, as Peter proposed. >> + uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK); >> + if (uffd < 0) { >> + error_report("uffd_create_fd() failed: UFFD not supported"); >> + return -1; >> + } >> + >> + api_struct.api = UFFD_API; >> + api_struct.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP; >> + if (ioctl(uffd, UFFDIO_API, &api_struct)) { >> + error_report("uffd_create_fd() failed: " >> + "API version not supported version=%llx errno=%i", >> + api_struct.api, errno); >> + goto fail; >> + } >> + >> + if ((api_struct.ioctls & ioctl_mask) != ioctl_mask) { >> + error_report("uffd_create_fd() failed: " >> + "PAGEFAULT_FLAG_WP feature missing"); >> + goto fail; >> + } >> + >> + return uffd; > > Should we be putting that somewher that we can share with postcopy? > Sure, maybe to util/uffd-wp.c + include/qemu/uffd-wp.h. What do you think? >> +fail: >> + close(uffd); >> + return -1; >> +} >> + >> +/** >> + * uffd_close_fd: close UFFD file descriptor >> + * >> + * @uffd: UFFD file descriptor >> + */ >> +static void uffd_close_fd(int uffd) >> +{ >> + assert(uffd >= 0); >> + close(uffd); >> +} >> + >> +/** >> + * uffd_register_memory: register memory range with UFFD >> + * >> + * Returns 0 in case of success, negative value on error >> + * >> + * @uffd: UFFD file descriptor >> + * @start: starting virtual address of memory range >> + * @length: length of memory range >> + * @track_missing: generate events on missing-page faults >> + * @track_wp: generate events on write-protected-page faults >> + */ >> +static int uffd_register_memory(int uffd, hwaddr start, hwaddr length, >> + bool track_missing, bool track_wp) >> +{ >> + struct uffdio_register uffd_register; >> + >> + uffd_register.range.start = start; >> + uffd_register.range.len = length; >> + uffd_register.mode = (track_missing ? UFFDIO_REGISTER_MODE_MISSING : 0) | >> + (track_wp ? UFFDIO_REGISTER_MODE_WP : 0); >> + >> + if (ioctl(uffd, UFFDIO_REGISTER, &uffd_register)) { >> + error_report("uffd_register_memory() failed: " >> + "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i", >> + start, length, uffd_register.mode, errno); >> + return -1; >> + } >> + >> + return 0; >> +} >> + >> +/** >> + * uffd_protect_memory: protect/unprotect memory range for writes with UFFD >> + * >> + * Returns 0 on success or negative value in case of error >> + * >> + * @uffd: UFFD file descriptor >> + * @start: starting virtual address of memory range >> + * @length: length of memory range >> + * @wp: write-protect/unprotect >> + */ >> +static int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp) >> +{ >> + struct uffdio_writeprotect uffd_writeprotect; >> + int res; >> + >> + uffd_writeprotect.range.start = start; >> + uffd_writeprotect.range.len = length; >> + uffd_writeprotect.mode = (wp ? UFFDIO_WRITEPROTECT_MODE_WP : 0); >> + >> + do { >> + res = ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect); >> + } while (res < 0 && errno == EINTR); >> + if (res < 0) { >> + error_report("uffd_protect_memory() failed: " >> + "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i", >> + start, length, uffd_writeprotect.mode, errno); >> + return -1; >> + } >> + >> + return 0; >> +} >> + >> +__attribute__ ((unused)) >> +static int uffd_read_events(int uffd, struct uffd_msg *msgs, int count); >> +__attribute__ ((unused)) >> +static bool uffd_poll_events(int uffd, int tmo); >> + >> +/** >> + * uffd_read_events: read pending UFFD events >> + * >> + * Returns number of fetched messages, 0 if non is available or >> + * negative value in case of an error >> + * >> + * @uffd: UFFD file descriptor >> + * @msgs: pointer to message buffer >> + * @count: number of messages that can fit in the buffer >> + */ >> +static int uffd_read_events(int uffd, struct uffd_msg *msgs, int count) >> +{ >> + ssize_t res; >> + do { >> + res = read(uffd, msgs, count * sizeof(struct uffd_msg)); >> + } while (res < 0 && errno == EINTR); >> + >> + if ((res < 0 && errno == EAGAIN)) { >> + return 0; >> + } >> + if (res < 0) { >> + error_report("uffd_read_events() failed: errno=%i", errno); >> + return -1; >> + } >> + >> + return (int) (res / sizeof(struct uffd_msg)); >> +} >> + >> +/** >> + * uffd_poll_events: poll UFFD file descriptor for read >> + * >> + * Returns true if events are available for read, false otherwise >> + * >> + * @uffd: UFFD file descriptor >> + * @tmo: timeout in milliseconds, 0 for non-blocking operation, >> + * negative value for infinite wait >> + */ >> +static bool uffd_poll_events(int uffd, int tmo) >> +{ >> + int res; >> + struct pollfd poll_fd = { .fd = uffd, .events = POLLIN, .revents = 0 }; >> + >> + do { >> + res = poll(&poll_fd, 1, tmo); >> + } while (res < 0 && errno == EINTR); >> + >> + if (res == 0) { >> + return false; >> + } >> + if (res < 0) { >> + error_report("uffd_poll_events() failed: errno=%i", errno); >> + return false; >> + } >> + >> + return (poll_fd.revents & POLLIN) != 0; >> +} >> + >> static bool do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock *block, >> ram_addr_t offset, uint8_t *source_buf); >> >> @@ -3788,6 +3971,90 @@ static int ram_resume_prepare(MigrationState *s, void *opaque) >> return 0; >> } >> >> +/** >> + * ram_write_tracking_start: start UFFD-WP memory tracking >> + * >> + * Returns 0 for success or negative value in case of error >> + * >> + */ >> +int ram_write_tracking_start(void) >> +{ >> + int uffd; >> + RAMState *rs = ram_state; >> + RAMBlock *bs; >> + >> + /* Open UFFD file descriptor */ >> + uffd = uffd_create_fd(); >> + if (uffd < 0) { >> + return uffd; >> + } >> + rs->uffdio_fd = uffd; >> + >> + RAMBLOCK_FOREACH_NOT_IGNORED(bs) { >> + /* Nothing to do with read-only and MMIO-writable regions */ >> + if (bs->mr->readonly || bs->mr->rom_device) { >> + continue; >> + } >> + >> + /* Register block memory with UFFD to track writes */ >> + if (uffd_register_memory(rs->uffdio_fd, (hwaddr) bs->host, >> + bs->max_length, false, true)) { >> + goto fail; >> + } >> + /* Apply UFFD write protection to the block memory range */ >> + if (uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host, >> + bs->max_length, true)) { >> + goto fail; >> + } >> + bs->flags |= RAM_UF_WRITEPROTECT; >> + >> + info_report("UFFD-WP write-tracking enabled: " >> + "block_id=%s page_size=%zu start=%p length=%lu " >> + "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i", >> + bs->idstr, bs->page_size, bs->host, bs->max_length, >> + bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly, >> + bs->mr->nonvolatile, bs->mr->rom_device); >> + } >> + >> + return 0; >> + >> +fail: >> + uffd_close_fd(uffd); >> + rs->uffdio_fd = -1; >> + return -1; >> +} >> + >> +/** >> + * ram_write_tracking_stop: stop UFFD-WP memory tracking and remove protection >> + */ >> +void ram_write_tracking_stop(void) >> +{ >> + RAMState *rs = ram_state; >> + RAMBlock *bs; >> + assert(rs->uffdio_fd >= 0); >> + >> + RAMBLOCK_FOREACH_NOT_IGNORED(bs) { >> + if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) { >> + continue; >> + } >> + info_report("UFFD-WP write-tracking disabled: " >> + "block_id=%s page_size=%zu start=%p length=%lu " >> + "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i", >> + bs->idstr, bs->page_size, bs->host, bs->max_length, >> + bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly, >> + bs->mr->nonvolatile, bs->mr->rom_device); >> + /* Cleanup flags */ >> + bs->flags &= ~RAM_UF_WRITEPROTECT; >> + } >> + >> + /* >> + * Close UFFD file descriptor to remove protection, >> + * release registered memory regions and flush wait queues >> + */ >> + uffd_close_fd(rs->uffdio_fd); >> + rs->uffdio_fd = -1; >> +} >> + >> static SaveVMHandlers savevm_ram_handlers = { >> .save_setup = ram_save_setup, >> .save_live_iterate = ram_save_iterate, >> diff --git a/migration/ram.h b/migration/ram.h >> index 011e85414e..3611cb51de 100644 >> --- a/migration/ram.h >> +++ b/migration/ram.h >> @@ -79,4 +79,8 @@ void colo_flush_ram_cache(void); >> void colo_release_ram_cache(void); >> void colo_incoming_start_dirty_log(void); >> >> +/* Live snapshots */ >> +int ram_write_tracking_start(void); >> +void ram_write_tracking_stop(void); >> + >> #endif >> -- >> 2.25.1 >>
* Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote: > On 24.11.2020 20:57, Dr. David Alan Gilbert wrote: > > * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote: > > > Implemented support for the whole RAM block memory > > > protection/un-protection. Introduced higher level > > > ram_write_tracking_start() and ram_write_tracking_stop() > > > to start/stop tracking guest memory writes. > > > > > > Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com> > > > --- > > > include/exec/memory.h | 7 ++ > > > migration/ram.c | 267 ++++++++++++++++++++++++++++++++++++++++++ > > > migration/ram.h | 4 + > > > 3 files changed, 278 insertions(+) > > > > > > diff --git a/include/exec/memory.h b/include/exec/memory.h > > > index 0f3e6bcd5e..3d798fce16 100644 > > > --- a/include/exec/memory.h > > > +++ b/include/exec/memory.h > > > @@ -139,6 +139,13 @@ typedef struct IOMMUNotifier IOMMUNotifier; > > > /* RAM is a persistent kind memory */ > > > #define RAM_PMEM (1 << 5) > > > +/* > > > + * UFFDIO_WRITEPROTECT is used on this RAMBlock to > > > + * support 'write-tracking' migration type. > > > + * Implies ram_state->ram_wt_enabled. > > > + */ > > > +#define RAM_UF_WRITEPROTECT (1 << 6) > > > + > > > static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn, > > > IOMMUNotifierFlag flags, > > > hwaddr start, hwaddr end, > > > diff --git a/migration/ram.c b/migration/ram.c > > > index 7811cde643..7f273c9996 100644 > > > --- a/migration/ram.c > > > +++ b/migration/ram.c > > > @@ -56,6 +56,12 @@ > > > #include "savevm.h" > > > #include "qemu/iov.h" > > > #include "multifd.h" > > > +#include <inttypes.h> > > > +#include <poll.h> > > > +#include <sys/syscall.h> > > > +#include <sys/ioctl.h> > > > +#include <linux/userfaultfd.h> > > > +#include "sysemu/runstate.h" > > > /***********************************************************/ > > > /* ram save/restore */ > > > @@ -298,6 +304,8 @@ struct RAMSrcPageRequest { > > > struct RAMState { > > > /* QEMUFile used for this migration */ > > > QEMUFile *f; > > > + /* UFFD file descriptor, used in 'write-tracking' migration */ > > > + int uffdio_fd; > > > /* Last block that we have visited searching for dirty pages */ > > > RAMBlock *last_seen_block; > > > /* Last block from where we have sent data */ > > > @@ -453,6 +461,181 @@ static QemuThread *decompress_threads; > > > static QemuMutex decomp_done_lock; > > > static QemuCond decomp_done_cond; > > > +/** > > > + * uffd_create_fd: create UFFD file descriptor > > > + * > > > + * Returns non-negative file descriptor or negative value in case of an error > > > + */ > > > +static int uffd_create_fd(void) > > > +{ > > > + int uffd; > > > + struct uffdio_api api_struct; > > > + uint64_t ioctl_mask = BIT(_UFFDIO_REGISTER) | BIT(_UFFDIO_UNREGISTER); > > > > You need to be a bit careful about doing this in migration/ram.c - it's > > generic code; at minimum it needs ifdef'ing for Linux. > > > > Yes, it's totally linux-specific, I think better to move this code out of > migration/ram.c, as Peter proposed. > > > > + uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK); > > > + if (uffd < 0) { > > > + error_report("uffd_create_fd() failed: UFFD not supported"); > > > + return -1; > > > + } > > > + > > > + api_struct.api = UFFD_API; > > > + api_struct.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP; > > > + if (ioctl(uffd, UFFDIO_API, &api_struct)) { > > > + error_report("uffd_create_fd() failed: " > > > + "API version not supported version=%llx errno=%i", > > > + api_struct.api, errno); > > > + goto fail; > > > + } > > > + > > > + if ((api_struct.ioctls & ioctl_mask) != ioctl_mask) { > > > + error_report("uffd_create_fd() failed: " > > > + "PAGEFAULT_FLAG_WP feature missing"); > > > + goto fail; > > > + } > > > + > > > + return uffd; > > > > Should we be putting that somewher that we can share with postcopy? > > > > Sure, maybe to util/uffd-wp.c + include/qemu/uffd-wp.h. > What do you think? Or how about a userfaultfd.c somewhere? Dave > > > +fail: > > > + close(uffd); > > > + return -1; > > > +} > > > + > > > +/** > > > + * uffd_close_fd: close UFFD file descriptor > > > + * > > > + * @uffd: UFFD file descriptor > > > + */ > > > +static void uffd_close_fd(int uffd) > > > +{ > > > + assert(uffd >= 0); > > > + close(uffd); > > > +} > > > + > > > +/** > > > + * uffd_register_memory: register memory range with UFFD > > > + * > > > + * Returns 0 in case of success, negative value on error > > > + * > > > + * @uffd: UFFD file descriptor > > > + * @start: starting virtual address of memory range > > > + * @length: length of memory range > > > + * @track_missing: generate events on missing-page faults > > > + * @track_wp: generate events on write-protected-page faults > > > + */ > > > +static int uffd_register_memory(int uffd, hwaddr start, hwaddr length, > > > + bool track_missing, bool track_wp) > > > +{ > > > + struct uffdio_register uffd_register; > > > + > > > + uffd_register.range.start = start; > > > + uffd_register.range.len = length; > > > + uffd_register.mode = (track_missing ? UFFDIO_REGISTER_MODE_MISSING : 0) | > > > + (track_wp ? UFFDIO_REGISTER_MODE_WP : 0); > > > + > > > + if (ioctl(uffd, UFFDIO_REGISTER, &uffd_register)) { > > > + error_report("uffd_register_memory() failed: " > > > + "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i", > > > + start, length, uffd_register.mode, errno); > > > + return -1; > > > + } > > > + > > > + return 0; > > > +} > > > + > > > +/** > > > + * uffd_protect_memory: protect/unprotect memory range for writes with UFFD > > > + * > > > + * Returns 0 on success or negative value in case of error > > > + * > > > + * @uffd: UFFD file descriptor > > > + * @start: starting virtual address of memory range > > > + * @length: length of memory range > > > + * @wp: write-protect/unprotect > > > + */ > > > +static int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp) > > > +{ > > > + struct uffdio_writeprotect uffd_writeprotect; > > > + int res; > > > + > > > + uffd_writeprotect.range.start = start; > > > + uffd_writeprotect.range.len = length; > > > + uffd_writeprotect.mode = (wp ? UFFDIO_WRITEPROTECT_MODE_WP : 0); > > > + > > > + do { > > > + res = ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect); > > > + } while (res < 0 && errno == EINTR); > > > + if (res < 0) { > > > + error_report("uffd_protect_memory() failed: " > > > + "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i", > > > + start, length, uffd_writeprotect.mode, errno); > > > + return -1; > > > + } > > > + > > > + return 0; > > > +} > > > + > > > +__attribute__ ((unused)) > > > +static int uffd_read_events(int uffd, struct uffd_msg *msgs, int count); > > > +__attribute__ ((unused)) > > > +static bool uffd_poll_events(int uffd, int tmo); > > > + > > > +/** > > > + * uffd_read_events: read pending UFFD events > > > + * > > > + * Returns number of fetched messages, 0 if non is available or > > > + * negative value in case of an error > > > + * > > > + * @uffd: UFFD file descriptor > > > + * @msgs: pointer to message buffer > > > + * @count: number of messages that can fit in the buffer > > > + */ > > > +static int uffd_read_events(int uffd, struct uffd_msg *msgs, int count) > > > +{ > > > + ssize_t res; > > > + do { > > > + res = read(uffd, msgs, count * sizeof(struct uffd_msg)); > > > + } while (res < 0 && errno == EINTR); > > > + > > > + if ((res < 0 && errno == EAGAIN)) { > > > + return 0; > > > + } > > > + if (res < 0) { > > > + error_report("uffd_read_events() failed: errno=%i", errno); > > > + return -1; > > > + } > > > + > > > + return (int) (res / sizeof(struct uffd_msg)); > > > +} > > > + > > > +/** > > > + * uffd_poll_events: poll UFFD file descriptor for read > > > + * > > > + * Returns true if events are available for read, false otherwise > > > + * > > > + * @uffd: UFFD file descriptor > > > + * @tmo: timeout in milliseconds, 0 for non-blocking operation, > > > + * negative value for infinite wait > > > + */ > > > +static bool uffd_poll_events(int uffd, int tmo) > > > +{ > > > + int res; > > > + struct pollfd poll_fd = { .fd = uffd, .events = POLLIN, .revents = 0 }; > > > + > > > + do { > > > + res = poll(&poll_fd, 1, tmo); > > > + } while (res < 0 && errno == EINTR); > > > + > > > + if (res == 0) { > > > + return false; > > > + } > > > + if (res < 0) { > > > + error_report("uffd_poll_events() failed: errno=%i", errno); > > > + return false; > > > + } > > > + > > > + return (poll_fd.revents & POLLIN) != 0; > > > +} > > > + > > > static bool do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock *block, > > > ram_addr_t offset, uint8_t *source_buf); > > > @@ -3788,6 +3971,90 @@ static int ram_resume_prepare(MigrationState *s, void *opaque) > > > return 0; > > > } > > > +/** > > > + * ram_write_tracking_start: start UFFD-WP memory tracking > > > + * > > > + * Returns 0 for success or negative value in case of error > > > + * > > > + */ > > > +int ram_write_tracking_start(void) > > > +{ > > > + int uffd; > > > + RAMState *rs = ram_state; > > > + RAMBlock *bs; > > > + > > > + /* Open UFFD file descriptor */ > > > + uffd = uffd_create_fd(); > > > + if (uffd < 0) { > > > + return uffd; > > > + } > > > + rs->uffdio_fd = uffd; > > > + > > > + RAMBLOCK_FOREACH_NOT_IGNORED(bs) { > > > + /* Nothing to do with read-only and MMIO-writable regions */ > > > + if (bs->mr->readonly || bs->mr->rom_device) { > > > + continue; > > > + } > > > + > > > + /* Register block memory with UFFD to track writes */ > > > + if (uffd_register_memory(rs->uffdio_fd, (hwaddr) bs->host, > > > + bs->max_length, false, true)) { > > > + goto fail; > > > + } > > > + /* Apply UFFD write protection to the block memory range */ > > > + if (uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host, > > > + bs->max_length, true)) { > > > + goto fail; > > > + } > > > + bs->flags |= RAM_UF_WRITEPROTECT; > > > + > > > + info_report("UFFD-WP write-tracking enabled: " > > > + "block_id=%s page_size=%zu start=%p length=%lu " > > > + "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i", > > > + bs->idstr, bs->page_size, bs->host, bs->max_length, > > > + bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly, > > > + bs->mr->nonvolatile, bs->mr->rom_device); > > > + } > > > + > > > + return 0; > > > + > > > +fail: > > > + uffd_close_fd(uffd); > > > + rs->uffdio_fd = -1; > > > + return -1; > > > +} > > > + > > > +/** > > > + * ram_write_tracking_stop: stop UFFD-WP memory tracking and remove protection > > > + */ > > > +void ram_write_tracking_stop(void) > > > +{ > > > + RAMState *rs = ram_state; > > > + RAMBlock *bs; > > > + assert(rs->uffdio_fd >= 0); > > > + > > > + RAMBLOCK_FOREACH_NOT_IGNORED(bs) { > > > + if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) { > > > + continue; > > > + } > > > + info_report("UFFD-WP write-tracking disabled: " > > > + "block_id=%s page_size=%zu start=%p length=%lu " > > > + "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i", > > > + bs->idstr, bs->page_size, bs->host, bs->max_length, > > > + bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly, > > > + bs->mr->nonvolatile, bs->mr->rom_device); > > > + /* Cleanup flags */ > > > + bs->flags &= ~RAM_UF_WRITEPROTECT; > > > + } > > > + > > > + /* > > > + * Close UFFD file descriptor to remove protection, > > > + * release registered memory regions and flush wait queues > > > + */ > > > + uffd_close_fd(rs->uffdio_fd); > > > + rs->uffdio_fd = -1; > > > +} > > > + > > > static SaveVMHandlers savevm_ram_handlers = { > > > .save_setup = ram_save_setup, > > > .save_live_iterate = ram_save_iterate, > > > diff --git a/migration/ram.h b/migration/ram.h > > > index 011e85414e..3611cb51de 100644 > > > --- a/migration/ram.h > > > +++ b/migration/ram.h > > > @@ -79,4 +79,8 @@ void colo_flush_ram_cache(void); > > > void colo_release_ram_cache(void); > > > void colo_incoming_start_dirty_log(void); > > > +/* Live snapshots */ > > > +int ram_write_tracking_start(void); > > > +void ram_write_tracking_stop(void); > > > + > > > #endif > > > -- > > > 2.25.1 > > > > > > -- > Andrey Gruzdev, Principal Engineer > Virtuozzo GmbH +7-903-247-6397 > virtuzzo.com >
On 25.11.2020 21:43, Dr. David Alan Gilbert wrote: > * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote: >> On 24.11.2020 20:57, Dr. David Alan Gilbert wrote: >>> * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote: >>>> Implemented support for the whole RAM block memory >>>> protection/un-protection. Introduced higher level >>>> ram_write_tracking_start() and ram_write_tracking_stop() >>>> to start/stop tracking guest memory writes. >>>> >>>> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com> >>>> --- >>>> include/exec/memory.h | 7 ++ >>>> migration/ram.c | 267 ++++++++++++++++++++++++++++++++++++++++++ >>>> migration/ram.h | 4 + >>>> 3 files changed, 278 insertions(+) >>>> >>>> diff --git a/include/exec/memory.h b/include/exec/memory.h >>>> index 0f3e6bcd5e..3d798fce16 100644 >>>> --- a/include/exec/memory.h >>>> +++ b/include/exec/memory.h >>>> @@ -139,6 +139,13 @@ typedef struct IOMMUNotifier IOMMUNotifier; >>>> /* RAM is a persistent kind memory */ >>>> #define RAM_PMEM (1 << 5) >>>> +/* >>>> + * UFFDIO_WRITEPROTECT is used on this RAMBlock to >>>> + * support 'write-tracking' migration type. >>>> + * Implies ram_state->ram_wt_enabled. >>>> + */ >>>> +#define RAM_UF_WRITEPROTECT (1 << 6) >>>> + >>>> static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn, >>>> IOMMUNotifierFlag flags, >>>> hwaddr start, hwaddr end, >>>> diff --git a/migration/ram.c b/migration/ram.c >>>> index 7811cde643..7f273c9996 100644 >>>> --- a/migration/ram.c >>>> +++ b/migration/ram.c >>>> @@ -56,6 +56,12 @@ >>>> #include "savevm.h" >>>> #include "qemu/iov.h" >>>> #include "multifd.h" >>>> +#include <inttypes.h> >>>> +#include <poll.h> >>>> +#include <sys/syscall.h> >>>> +#include <sys/ioctl.h> >>>> +#include <linux/userfaultfd.h> >>>> +#include "sysemu/runstate.h" >>>> /***********************************************************/ >>>> /* ram save/restore */ >>>> @@ -298,6 +304,8 @@ struct RAMSrcPageRequest { >>>> struct RAMState { >>>> /* QEMUFile used for this migration */ >>>> QEMUFile *f; >>>> + /* UFFD file descriptor, used in 'write-tracking' migration */ >>>> + int uffdio_fd; >>>> /* Last block that we have visited searching for dirty pages */ >>>> RAMBlock *last_seen_block; >>>> /* Last block from where we have sent data */ >>>> @@ -453,6 +461,181 @@ static QemuThread *decompress_threads; >>>> static QemuMutex decomp_done_lock; >>>> static QemuCond decomp_done_cond; >>>> +/** >>>> + * uffd_create_fd: create UFFD file descriptor >>>> + * >>>> + * Returns non-negative file descriptor or negative value in case of an error >>>> + */ >>>> +static int uffd_create_fd(void) >>>> +{ >>>> + int uffd; >>>> + struct uffdio_api api_struct; >>>> + uint64_t ioctl_mask = BIT(_UFFDIO_REGISTER) | BIT(_UFFDIO_UNREGISTER); >>> >>> You need to be a bit careful about doing this in migration/ram.c - it's >>> generic code; at minimum it needs ifdef'ing for Linux. >>> >> >> Yes, it's totally linux-specific, I think better to move this code out of >> migration/ram.c, as Peter proposed. >> >>>> + uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK); >>>> + if (uffd < 0) { >>>> + error_report("uffd_create_fd() failed: UFFD not supported"); >>>> + return -1; >>>> + } >>>> + >>>> + api_struct.api = UFFD_API; >>>> + api_struct.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP; >>>> + if (ioctl(uffd, UFFDIO_API, &api_struct)) { >>>> + error_report("uffd_create_fd() failed: " >>>> + "API version not supported version=%llx errno=%i", >>>> + api_struct.api, errno); >>>> + goto fail; >>>> + } >>>> + >>>> + if ((api_struct.ioctls & ioctl_mask) != ioctl_mask) { >>>> + error_report("uffd_create_fd() failed: " >>>> + "PAGEFAULT_FLAG_WP feature missing"); >>>> + goto fail; >>>> + } >>>> + >>>> + return uffd; >>> >>> Should we be putting that somewher that we can share with postcopy? >>> >> >> Sure, maybe to util/uffd-wp.c + include/qemu/uffd-wp.h. >> What do you think? > > Or how about a userfaultfd.c somewhere? > > Dave > For userfaultfd.c I'm also ok. Andrey >>>> +fail: >>>> + close(uffd); >>>> + return -1; >>>> +} >>>> + >>>> +/** >>>> + * uffd_close_fd: close UFFD file descriptor >>>> + * >>>> + * @uffd: UFFD file descriptor >>>> + */ >>>> +static void uffd_close_fd(int uffd) >>>> +{ >>>> + assert(uffd >= 0); >>>> + close(uffd); >>>> +} >>>> + >>>> +/** >>>> + * uffd_register_memory: register memory range with UFFD >>>> + * >>>> + * Returns 0 in case of success, negative value on error >>>> + * >>>> + * @uffd: UFFD file descriptor >>>> + * @start: starting virtual address of memory range >>>> + * @length: length of memory range >>>> + * @track_missing: generate events on missing-page faults >>>> + * @track_wp: generate events on write-protected-page faults >>>> + */ >>>> +static int uffd_register_memory(int uffd, hwaddr start, hwaddr length, >>>> + bool track_missing, bool track_wp) >>>> +{ >>>> + struct uffdio_register uffd_register; >>>> + >>>> + uffd_register.range.start = start; >>>> + uffd_register.range.len = length; >>>> + uffd_register.mode = (track_missing ? UFFDIO_REGISTER_MODE_MISSING : 0) | >>>> + (track_wp ? UFFDIO_REGISTER_MODE_WP : 0); >>>> + >>>> + if (ioctl(uffd, UFFDIO_REGISTER, &uffd_register)) { >>>> + error_report("uffd_register_memory() failed: " >>>> + "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i", >>>> + start, length, uffd_register.mode, errno); >>>> + return -1; >>>> + } >>>> + >>>> + return 0; >>>> +} >>>> + >>>> +/** >>>> + * uffd_protect_memory: protect/unprotect memory range for writes with UFFD >>>> + * >>>> + * Returns 0 on success or negative value in case of error >>>> + * >>>> + * @uffd: UFFD file descriptor >>>> + * @start: starting virtual address of memory range >>>> + * @length: length of memory range >>>> + * @wp: write-protect/unprotect >>>> + */ >>>> +static int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp) >>>> +{ >>>> + struct uffdio_writeprotect uffd_writeprotect; >>>> + int res; >>>> + >>>> + uffd_writeprotect.range.start = start; >>>> + uffd_writeprotect.range.len = length; >>>> + uffd_writeprotect.mode = (wp ? UFFDIO_WRITEPROTECT_MODE_WP : 0); >>>> + >>>> + do { >>>> + res = ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect); >>>> + } while (res < 0 && errno == EINTR); >>>> + if (res < 0) { >>>> + error_report("uffd_protect_memory() failed: " >>>> + "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i", >>>> + start, length, uffd_writeprotect.mode, errno); >>>> + return -1; >>>> + } >>>> + >>>> + return 0; >>>> +} >>>> + >>>> +__attribute__ ((unused)) >>>> +static int uffd_read_events(int uffd, struct uffd_msg *msgs, int count); >>>> +__attribute__ ((unused)) >>>> +static bool uffd_poll_events(int uffd, int tmo); >>>> + >>>> +/** >>>> + * uffd_read_events: read pending UFFD events >>>> + * >>>> + * Returns number of fetched messages, 0 if non is available or >>>> + * negative value in case of an error >>>> + * >>>> + * @uffd: UFFD file descriptor >>>> + * @msgs: pointer to message buffer >>>> + * @count: number of messages that can fit in the buffer >>>> + */ >>>> +static int uffd_read_events(int uffd, struct uffd_msg *msgs, int count) >>>> +{ >>>> + ssize_t res; >>>> + do { >>>> + res = read(uffd, msgs, count * sizeof(struct uffd_msg)); >>>> + } while (res < 0 && errno == EINTR); >>>> + >>>> + if ((res < 0 && errno == EAGAIN)) { >>>> + return 0; >>>> + } >>>> + if (res < 0) { >>>> + error_report("uffd_read_events() failed: errno=%i", errno); >>>> + return -1; >>>> + } >>>> + >>>> + return (int) (res / sizeof(struct uffd_msg)); >>>> +} >>>> + >>>> +/** >>>> + * uffd_poll_events: poll UFFD file descriptor for read >>>> + * >>>> + * Returns true if events are available for read, false otherwise >>>> + * >>>> + * @uffd: UFFD file descriptor >>>> + * @tmo: timeout in milliseconds, 0 for non-blocking operation, >>>> + * negative value for infinite wait >>>> + */ >>>> +static bool uffd_poll_events(int uffd, int tmo) >>>> +{ >>>> + int res; >>>> + struct pollfd poll_fd = { .fd = uffd, .events = POLLIN, .revents = 0 }; >>>> + >>>> + do { >>>> + res = poll(&poll_fd, 1, tmo); >>>> + } while (res < 0 && errno == EINTR); >>>> + >>>> + if (res == 0) { >>>> + return false; >>>> + } >>>> + if (res < 0) { >>>> + error_report("uffd_poll_events() failed: errno=%i", errno); >>>> + return false; >>>> + } >>>> + >>>> + return (poll_fd.revents & POLLIN) != 0; >>>> +} >>>> + >>>> static bool do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock *block, >>>> ram_addr_t offset, uint8_t *source_buf); >>>> @@ -3788,6 +3971,90 @@ static int ram_resume_prepare(MigrationState *s, void *opaque) >>>> return 0; >>>> } >>>> +/** >>>> + * ram_write_tracking_start: start UFFD-WP memory tracking >>>> + * >>>> + * Returns 0 for success or negative value in case of error >>>> + * >>>> + */ >>>> +int ram_write_tracking_start(void) >>>> +{ >>>> + int uffd; >>>> + RAMState *rs = ram_state; >>>> + RAMBlock *bs; >>>> + >>>> + /* Open UFFD file descriptor */ >>>> + uffd = uffd_create_fd(); >>>> + if (uffd < 0) { >>>> + return uffd; >>>> + } >>>> + rs->uffdio_fd = uffd; >>>> + >>>> + RAMBLOCK_FOREACH_NOT_IGNORED(bs) { >>>> + /* Nothing to do with read-only and MMIO-writable regions */ >>>> + if (bs->mr->readonly || bs->mr->rom_device) { >>>> + continue; >>>> + } >>>> + >>>> + /* Register block memory with UFFD to track writes */ >>>> + if (uffd_register_memory(rs->uffdio_fd, (hwaddr) bs->host, >>>> + bs->max_length, false, true)) { >>>> + goto fail; >>>> + } >>>> + /* Apply UFFD write protection to the block memory range */ >>>> + if (uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host, >>>> + bs->max_length, true)) { >>>> + goto fail; >>>> + } >>>> + bs->flags |= RAM_UF_WRITEPROTECT; >>>> + >>>> + info_report("UFFD-WP write-tracking enabled: " >>>> + "block_id=%s page_size=%zu start=%p length=%lu " >>>> + "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i", >>>> + bs->idstr, bs->page_size, bs->host, bs->max_length, >>>> + bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly, >>>> + bs->mr->nonvolatile, bs->mr->rom_device); >>>> + } >>>> + >>>> + return 0; >>>> + >>>> +fail: >>>> + uffd_close_fd(uffd); >>>> + rs->uffdio_fd = -1; >>>> + return -1; >>>> +} >>>> + >>>> +/** >>>> + * ram_write_tracking_stop: stop UFFD-WP memory tracking and remove protection >>>> + */ >>>> +void ram_write_tracking_stop(void) >>>> +{ >>>> + RAMState *rs = ram_state; >>>> + RAMBlock *bs; >>>> + assert(rs->uffdio_fd >= 0); >>>> + >>>> + RAMBLOCK_FOREACH_NOT_IGNORED(bs) { >>>> + if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) { >>>> + continue; >>>> + } >>>> + info_report("UFFD-WP write-tracking disabled: " >>>> + "block_id=%s page_size=%zu start=%p length=%lu " >>>> + "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i", >>>> + bs->idstr, bs->page_size, bs->host, bs->max_length, >>>> + bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly, >>>> + bs->mr->nonvolatile, bs->mr->rom_device); >>>> + /* Cleanup flags */ >>>> + bs->flags &= ~RAM_UF_WRITEPROTECT; >>>> + } >>>> + >>>> + /* >>>> + * Close UFFD file descriptor to remove protection, >>>> + * release registered memory regions and flush wait queues >>>> + */ >>>> + uffd_close_fd(rs->uffdio_fd); >>>> + rs->uffdio_fd = -1; >>>> +} >>>> + >>>> static SaveVMHandlers savevm_ram_handlers = { >>>> .save_setup = ram_save_setup, >>>> .save_live_iterate = ram_save_iterate, >>>> diff --git a/migration/ram.h b/migration/ram.h >>>> index 011e85414e..3611cb51de 100644 >>>> --- a/migration/ram.h >>>> +++ b/migration/ram.h >>>> @@ -79,4 +79,8 @@ void colo_flush_ram_cache(void); >>>> void colo_release_ram_cache(void); >>>> void colo_incoming_start_dirty_log(void); >>>> +/* Live snapshots */ >>>> +int ram_write_tracking_start(void); >>>> +void ram_write_tracking_stop(void); >>>> + >>>> #endif >>>> -- >>>> 2.25.1 >>>> >> >> >> -- >> Andrey Gruzdev, Principal Engineer >> Virtuozzo GmbH +7-903-247-6397 >> virtuzzo.com >>
diff --git a/include/exec/memory.h b/include/exec/memory.h index 0f3e6bcd5e..3d798fce16 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -139,6 +139,13 @@ typedef struct IOMMUNotifier IOMMUNotifier; /* RAM is a persistent kind memory */ #define RAM_PMEM (1 << 5) +/* + * UFFDIO_WRITEPROTECT is used on this RAMBlock to + * support 'write-tracking' migration type. + * Implies ram_state->ram_wt_enabled. + */ +#define RAM_UF_WRITEPROTECT (1 << 6) + static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn, IOMMUNotifierFlag flags, hwaddr start, hwaddr end, diff --git a/migration/ram.c b/migration/ram.c index 7811cde643..7f273c9996 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -56,6 +56,12 @@ #include "savevm.h" #include "qemu/iov.h" #include "multifd.h" +#include <inttypes.h> +#include <poll.h> +#include <sys/syscall.h> +#include <sys/ioctl.h> +#include <linux/userfaultfd.h> +#include "sysemu/runstate.h" /***********************************************************/ /* ram save/restore */ @@ -298,6 +304,8 @@ struct RAMSrcPageRequest { struct RAMState { /* QEMUFile used for this migration */ QEMUFile *f; + /* UFFD file descriptor, used in 'write-tracking' migration */ + int uffdio_fd; /* Last block that we have visited searching for dirty pages */ RAMBlock *last_seen_block; /* Last block from where we have sent data */ @@ -453,6 +461,181 @@ static QemuThread *decompress_threads; static QemuMutex decomp_done_lock; static QemuCond decomp_done_cond; +/** + * uffd_create_fd: create UFFD file descriptor + * + * Returns non-negative file descriptor or negative value in case of an error + */ +static int uffd_create_fd(void) +{ + int uffd; + struct uffdio_api api_struct; + uint64_t ioctl_mask = BIT(_UFFDIO_REGISTER) | BIT(_UFFDIO_UNREGISTER); + + uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK); + if (uffd < 0) { + error_report("uffd_create_fd() failed: UFFD not supported"); + return -1; + } + + api_struct.api = UFFD_API; + api_struct.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP; + if (ioctl(uffd, UFFDIO_API, &api_struct)) { + error_report("uffd_create_fd() failed: " + "API version not supported version=%llx errno=%i", + api_struct.api, errno); + goto fail; + } + + if ((api_struct.ioctls & ioctl_mask) != ioctl_mask) { + error_report("uffd_create_fd() failed: " + "PAGEFAULT_FLAG_WP feature missing"); + goto fail; + } + + return uffd; + +fail: + close(uffd); + return -1; +} + +/** + * uffd_close_fd: close UFFD file descriptor + * + * @uffd: UFFD file descriptor + */ +static void uffd_close_fd(int uffd) +{ + assert(uffd >= 0); + close(uffd); +} + +/** + * uffd_register_memory: register memory range with UFFD + * + * Returns 0 in case of success, negative value on error + * + * @uffd: UFFD file descriptor + * @start: starting virtual address of memory range + * @length: length of memory range + * @track_missing: generate events on missing-page faults + * @track_wp: generate events on write-protected-page faults + */ +static int uffd_register_memory(int uffd, hwaddr start, hwaddr length, + bool track_missing, bool track_wp) +{ + struct uffdio_register uffd_register; + + uffd_register.range.start = start; + uffd_register.range.len = length; + uffd_register.mode = (track_missing ? UFFDIO_REGISTER_MODE_MISSING : 0) | + (track_wp ? UFFDIO_REGISTER_MODE_WP : 0); + + if (ioctl(uffd, UFFDIO_REGISTER, &uffd_register)) { + error_report("uffd_register_memory() failed: " + "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i", + start, length, uffd_register.mode, errno); + return -1; + } + + return 0; +} + +/** + * uffd_protect_memory: protect/unprotect memory range for writes with UFFD + * + * Returns 0 on success or negative value in case of error + * + * @uffd: UFFD file descriptor + * @start: starting virtual address of memory range + * @length: length of memory range + * @wp: write-protect/unprotect + */ +static int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp) +{ + struct uffdio_writeprotect uffd_writeprotect; + int res; + + uffd_writeprotect.range.start = start; + uffd_writeprotect.range.len = length; + uffd_writeprotect.mode = (wp ? UFFDIO_WRITEPROTECT_MODE_WP : 0); + + do { + res = ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect); + } while (res < 0 && errno == EINTR); + if (res < 0) { + error_report("uffd_protect_memory() failed: " + "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i", + start, length, uffd_writeprotect.mode, errno); + return -1; + } + + return 0; +} + +__attribute__ ((unused)) +static int uffd_read_events(int uffd, struct uffd_msg *msgs, int count); +__attribute__ ((unused)) +static bool uffd_poll_events(int uffd, int tmo); + +/** + * uffd_read_events: read pending UFFD events + * + * Returns number of fetched messages, 0 if non is available or + * negative value in case of an error + * + * @uffd: UFFD file descriptor + * @msgs: pointer to message buffer + * @count: number of messages that can fit in the buffer + */ +static int uffd_read_events(int uffd, struct uffd_msg *msgs, int count) +{ + ssize_t res; + do { + res = read(uffd, msgs, count * sizeof(struct uffd_msg)); + } while (res < 0 && errno == EINTR); + + if ((res < 0 && errno == EAGAIN)) { + return 0; + } + if (res < 0) { + error_report("uffd_read_events() failed: errno=%i", errno); + return -1; + } + + return (int) (res / sizeof(struct uffd_msg)); +} + +/** + * uffd_poll_events: poll UFFD file descriptor for read + * + * Returns true if events are available for read, false otherwise + * + * @uffd: UFFD file descriptor + * @tmo: timeout in milliseconds, 0 for non-blocking operation, + * negative value for infinite wait + */ +static bool uffd_poll_events(int uffd, int tmo) +{ + int res; + struct pollfd poll_fd = { .fd = uffd, .events = POLLIN, .revents = 0 }; + + do { + res = poll(&poll_fd, 1, tmo); + } while (res < 0 && errno == EINTR); + + if (res == 0) { + return false; + } + if (res < 0) { + error_report("uffd_poll_events() failed: errno=%i", errno); + return false; + } + + return (poll_fd.revents & POLLIN) != 0; +} + static bool do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock *block, ram_addr_t offset, uint8_t *source_buf); @@ -3788,6 +3971,90 @@ static int ram_resume_prepare(MigrationState *s, void *opaque) return 0; } +/** + * ram_write_tracking_start: start UFFD-WP memory tracking + * + * Returns 0 for success or negative value in case of error + * + */ +int ram_write_tracking_start(void) +{ + int uffd; + RAMState *rs = ram_state; + RAMBlock *bs; + + /* Open UFFD file descriptor */ + uffd = uffd_create_fd(); + if (uffd < 0) { + return uffd; + } + rs->uffdio_fd = uffd; + + RAMBLOCK_FOREACH_NOT_IGNORED(bs) { + /* Nothing to do with read-only and MMIO-writable regions */ + if (bs->mr->readonly || bs->mr->rom_device) { + continue; + } + + /* Register block memory with UFFD to track writes */ + if (uffd_register_memory(rs->uffdio_fd, (hwaddr) bs->host, + bs->max_length, false, true)) { + goto fail; + } + /* Apply UFFD write protection to the block memory range */ + if (uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host, + bs->max_length, true)) { + goto fail; + } + bs->flags |= RAM_UF_WRITEPROTECT; + + info_report("UFFD-WP write-tracking enabled: " + "block_id=%s page_size=%zu start=%p length=%lu " + "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i", + bs->idstr, bs->page_size, bs->host, bs->max_length, + bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly, + bs->mr->nonvolatile, bs->mr->rom_device); + } + + return 0; + +fail: + uffd_close_fd(uffd); + rs->uffdio_fd = -1; + return -1; +} + +/** + * ram_write_tracking_stop: stop UFFD-WP memory tracking and remove protection + */ +void ram_write_tracking_stop(void) +{ + RAMState *rs = ram_state; + RAMBlock *bs; + assert(rs->uffdio_fd >= 0); + + RAMBLOCK_FOREACH_NOT_IGNORED(bs) { + if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) { + continue; + } + info_report("UFFD-WP write-tracking disabled: " + "block_id=%s page_size=%zu start=%p length=%lu " + "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i", + bs->idstr, bs->page_size, bs->host, bs->max_length, + bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly, + bs->mr->nonvolatile, bs->mr->rom_device); + /* Cleanup flags */ + bs->flags &= ~RAM_UF_WRITEPROTECT; + } + + /* + * Close UFFD file descriptor to remove protection, + * release registered memory regions and flush wait queues + */ + uffd_close_fd(rs->uffdio_fd); + rs->uffdio_fd = -1; +} + static SaveVMHandlers savevm_ram_handlers = { .save_setup = ram_save_setup, .save_live_iterate = ram_save_iterate, diff --git a/migration/ram.h b/migration/ram.h index 011e85414e..3611cb51de 100644 --- a/migration/ram.h +++ b/migration/ram.h @@ -79,4 +79,8 @@ void colo_flush_ram_cache(void); void colo_release_ram_cache(void); void colo_incoming_start_dirty_log(void); +/* Live snapshots */ +int ram_write_tracking_start(void); +void ram_write_tracking_stop(void); + #endif
Implemented support for the whole RAM block memory protection/un-protection. Introduced higher level ram_write_tracking_start() and ram_write_tracking_stop() to start/stop tracking guest memory writes. Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com> --- include/exec/memory.h | 7 ++ migration/ram.c | 267 ++++++++++++++++++++++++++++++++++++++++++ migration/ram.h | 4 + 3 files changed, 278 insertions(+)