From patchwork Thu Nov 26 15:17:29 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrey Gruzdev X-Patchwork-Id: 11934225 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B5468C63697 for ; Thu, 26 Nov 2020 15:20:55 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 3043E21D40 for ; Thu, 26 Nov 2020 15:20:55 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3043E21D40 Authentication-Results: mail.kernel.org; dmarc=pass (p=none dis=none) header.from=nongnu.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:33204 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kiJ4o-0005kw-1A for qemu-devel@archiver.kernel.org; Thu, 26 Nov 2020 10:20:54 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:47590) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kiJ2Q-00046Z-Ce for qemu-devel@nongnu.org; Thu, 26 Nov 2020 10:18:26 -0500 Received: from relay.sw.ru ([185.231.240.75]:49656 helo=relay3.sw.ru) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kiJ2O-00088T-LC for qemu-devel@nongnu.org; Thu, 26 Nov 2020 10:18:26 -0500 Received: from [192.168.15.178] (helo=andrey-MS-7B54.sw.ru) by relay3.sw.ru with esmtp (Exim 4.94) (envelope-from ) id 1kiJ1y-00AT4g-7Q; Thu, 26 Nov 2020 18:17:59 +0300 To: qemu-devel@nongnu.org Cc: Den Lunev , Eric Blake , Paolo Bonzini , Juan Quintela , "Dr . David Alan Gilbert" , Markus Armbruster , Peter Xu , Andrey Gruzdev Subject: [PATCH v4 1/6] introduce 'background-snapshot' migration capability Date: Thu, 26 Nov 2020 18:17:29 +0300 Message-Id: <20201126151734.743849-2-andrey.gruzdev@virtuozzo.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20201126151734.743849-1-andrey.gruzdev@virtuozzo.com> References: <20201126151734.743849-1-andrey.gruzdev@virtuozzo.com> MIME-Version: 1.0 Received-SPF: pass client-ip=185.231.240.75; envelope-from=andrey.gruzdev@virtuozzo.com; helo=relay3.sw.ru X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Reply-to: Andrey Gruzdev X-Patchwork-Original-From: Andrey Gruzdev via From: Andrey Gruzdev Signed-off-by: Andrey Gruzdev --- migration/migration.c | 67 +++++++++++++++++++++++++++++++++++++++++++ migration/migration.h | 1 + qapi/migration.json | 7 ++++- 3 files changed, 74 insertions(+), 1 deletion(-) diff --git a/migration/migration.c b/migration/migration.c index 87a9b59f83..8be3bd2b8c 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -56,6 +56,7 @@ #include "net/announce.h" #include "qemu/queue.h" #include "multifd.h" +#include "sysemu/cpus.h" #ifdef CONFIG_VFIO #include "hw/vfio/vfio-common.h" @@ -134,6 +135,38 @@ enum mig_rp_message_type { MIG_RP_MSG_MAX }; +/* Migration capabilities set */ +struct MigrateCapsSet { + int size; /* Capability set size */ + MigrationCapability caps[]; /* Variadic array of capabilities */ +}; +typedef struct MigrateCapsSet MigrateCapsSet; + +/* Define and initialize MigrateCapsSet */ +#define INITIALIZE_MIGRATE_CAPS_SET(_name, ...) \ + MigrateCapsSet _name = { \ + .size = sizeof((int []) { __VA_ARGS__ }) / sizeof(int), \ + .caps = { __VA_ARGS__ } \ + } + +/* Background-snapshot compatibility check list */ +static const +INITIALIZE_MIGRATE_CAPS_SET(check_caps_background_snapshot, + MIGRATION_CAPABILITY_POSTCOPY_RAM, + MIGRATION_CAPABILITY_DIRTY_BITMAPS, + MIGRATION_CAPABILITY_POSTCOPY_BLOCKTIME, + MIGRATION_CAPABILITY_LATE_BLOCK_ACTIVATE, + MIGRATION_CAPABILITY_RETURN_PATH, + MIGRATION_CAPABILITY_MULTIFD, + MIGRATION_CAPABILITY_PAUSE_BEFORE_SWITCHOVER, + MIGRATION_CAPABILITY_AUTO_CONVERGE, + MIGRATION_CAPABILITY_RELEASE_RAM, + MIGRATION_CAPABILITY_RDMA_PIN_ALL, + MIGRATION_CAPABILITY_COMPRESS, + MIGRATION_CAPABILITY_XBZRLE, + MIGRATION_CAPABILITY_X_COLO, + MIGRATION_CAPABILITY_VALIDATE_UUID); + /* When we add fault tolerance, we could have several migrations at once. For now we don't need to add dynamic creation of migration */ @@ -1165,6 +1198,29 @@ static bool migrate_caps_check(bool *cap_list, } } + if (cap_list[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT]) { +#ifdef CONFIG_LINUX + int idx; + /* + * Check if there are any migration capabilities + * incompatible with 'background-snapshot'. + */ + for (idx = 0; idx < check_caps_background_snapshot.size; idx++) { + int incomp_cap = check_caps_background_snapshot.caps[idx]; + if (cap_list[incomp_cap]) { + error_setg(errp, + "Background-snapshot is not compatible with %s", + MigrationCapability_str(incomp_cap)); + return false; + } + } +#else + error_setg(errp, + "Background-snapshot is not supported on non-Linux hosts"); + return false; +#endif + } + return true; } @@ -2490,6 +2546,15 @@ bool migrate_use_block_incremental(void) return s->parameters.block_incremental; } +bool migrate_background_snapshot(void) +{ + MigrationState *s; + + s = migrate_get_current(); + + return s->enabled_capabilities[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT]; +} + /* migration thread support */ /* * Something bad happened to the RP stream, mark an error @@ -3783,6 +3848,8 @@ static Property migration_properties[] = { DEFINE_PROP_MIG_CAP("x-block", MIGRATION_CAPABILITY_BLOCK), DEFINE_PROP_MIG_CAP("x-return-path", MIGRATION_CAPABILITY_RETURN_PATH), DEFINE_PROP_MIG_CAP("x-multifd", MIGRATION_CAPABILITY_MULTIFD), + DEFINE_PROP_MIG_CAP("x-background-snapshot", + MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT), DEFINE_PROP_END_OF_LIST(), }; diff --git a/migration/migration.h b/migration/migration.h index d096b77f74..f40338cfbf 100644 --- a/migration/migration.h +++ b/migration/migration.h @@ -341,6 +341,7 @@ int migrate_compress_wait_thread(void); int migrate_decompress_threads(void); bool migrate_use_events(void); bool migrate_postcopy_blocktime(void); +bool migrate_background_snapshot(void); /* Sending on the return path - generic and then for each message type */ void migrate_send_rp_shut(MigrationIncomingState *mis, diff --git a/qapi/migration.json b/qapi/migration.json index 3c75820527..6291143678 100644 --- a/qapi/migration.json +++ b/qapi/migration.json @@ -442,6 +442,11 @@ # @validate-uuid: Send the UUID of the source to allow the destination # to ensure it is the same. (since 4.2) # +# @background-snapshot: If enabled, the migration stream will be a snapshot +# of the VM exactly at the point when the migration +# procedure starts. The VM RAM is saved with running VM. +# (since 6.0) +# # Since: 1.2 ## { 'enum': 'MigrationCapability', @@ -449,7 +454,7 @@ 'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram', 'block', 'return-path', 'pause-before-switchover', 'multifd', 'dirty-bitmaps', 'postcopy-blocktime', 'late-block-activate', - 'x-ignore-shared', 'validate-uuid' ] } + 'x-ignore-shared', 'validate-uuid', 'background-snapshot'] } ## # @MigrationCapabilityStatus: From patchwork Thu Nov 26 15:17:30 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrey Gruzdev X-Patchwork-Id: 11934227 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1AE24C63697 for ; Thu, 26 Nov 2020 15:21:29 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 6E2792087C for ; Thu, 26 Nov 2020 15:21:28 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6E2792087C Authentication-Results: mail.kernel.org; dmarc=pass (p=none dis=none) header.from=nongnu.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:34752 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kiJ5L-0006VN-Cm for qemu-devel@archiver.kernel.org; Thu, 26 Nov 2020 10:21:27 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:47874) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kiJ2u-0004Vc-Al for qemu-devel@nongnu.org; Thu, 26 Nov 2020 10:18:56 -0500 Received: from relay.sw.ru ([185.231.240.75]:49780 helo=relay3.sw.ru) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kiJ2p-0008HJ-LX for qemu-devel@nongnu.org; Thu, 26 Nov 2020 10:18:56 -0500 Received: from [192.168.15.178] (helo=andrey-MS-7B54.sw.ru) by relay3.sw.ru with esmtp (Exim 4.94) (envelope-from ) id 1kiJ2M-00AT4g-3P; Thu, 26 Nov 2020 18:18:22 +0300 To: qemu-devel@nongnu.org Cc: Den Lunev , Eric Blake , Paolo Bonzini , Juan Quintela , "Dr . David Alan Gilbert" , Markus Armbruster , Peter Xu , Andrey Gruzdev Subject: [PATCH v4 2/6] introduce UFFD-WP low-level interface helpers Date: Thu, 26 Nov 2020 18:17:30 +0300 Message-Id: <20201126151734.743849-3-andrey.gruzdev@virtuozzo.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20201126151734.743849-1-andrey.gruzdev@virtuozzo.com> References: <20201126151734.743849-1-andrey.gruzdev@virtuozzo.com> MIME-Version: 1.0 Received-SPF: pass client-ip=185.231.240.75; envelope-from=andrey.gruzdev@virtuozzo.com; helo=relay3.sw.ru X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Reply-to: Andrey Gruzdev X-Patchwork-Original-From: Andrey Gruzdev via From: Andrey Gruzdev Implemented support for the whole RAM block memory protection/un-protection. Introduced higher level ram_write_tracking_start() and ram_write_tracking_stop() to start/stop tracking guest memory writes. Signed-off-by: Andrey Gruzdev --- include/exec/memory.h | 7 ++ include/qemu/userfaultfd.h | 29 +++++ migration/ram.c | 120 +++++++++++++++++++++ migration/ram.h | 4 + util/meson.build | 1 + util/userfaultfd.c | 215 +++++++++++++++++++++++++++++++++++++ 6 files changed, 376 insertions(+) create mode 100644 include/qemu/userfaultfd.h create mode 100644 util/userfaultfd.c diff --git a/include/exec/memory.h b/include/exec/memory.h index 0f3e6bcd5e..3d798fce16 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -139,6 +139,13 @@ typedef struct IOMMUNotifier IOMMUNotifier; /* RAM is a persistent kind memory */ #define RAM_PMEM (1 << 5) +/* + * UFFDIO_WRITEPROTECT is used on this RAMBlock to + * support 'write-tracking' migration type. + * Implies ram_state->ram_wt_enabled. + */ +#define RAM_UF_WRITEPROTECT (1 << 6) + static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn, IOMMUNotifierFlag flags, hwaddr start, hwaddr end, diff --git a/include/qemu/userfaultfd.h b/include/qemu/userfaultfd.h new file mode 100644 index 0000000000..fb843c76db --- /dev/null +++ b/include/qemu/userfaultfd.h @@ -0,0 +1,29 @@ +/* + * Linux UFFD-WP support + * + * Copyright Virtuozzo GmbH, 2020 + * + * Authors: + * Andrey Gruzdev + * + * This work is licensed under the terms of the GNU GPL, version 2 or + * later. See the COPYING file in the top-level directory. + */ + +#ifndef USERFAULTFD_H +#define USERFAULTFD_H + +#include "qemu/osdep.h" +#include "exec/hwaddr.h" +#include + +int uffd_create_fd(void); +void uffd_close_fd(int uffd); +int uffd_register_memory(int uffd, hwaddr start, hwaddr length, + bool track_missing, bool track_wp); +int uffd_unregister_memory(int uffd, hwaddr start, hwaddr length); +int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp); +int uffd_read_events(int uffd, struct uffd_msg *msgs, int count); +bool uffd_poll_events(int uffd, int tmo); + +#endif /* USERFAULTFD_H */ diff --git a/migration/ram.c b/migration/ram.c index 7811cde643..3adfd1948d 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -56,6 +56,11 @@ #include "savevm.h" #include "qemu/iov.h" #include "multifd.h" +#include "sysemu/runstate.h" + +#ifdef CONFIG_LINUX +#include "qemu/userfaultfd.h" +#endif /***********************************************************/ /* ram save/restore */ @@ -298,6 +303,8 @@ struct RAMSrcPageRequest { struct RAMState { /* QEMUFile used for this migration */ QEMUFile *f; + /* UFFD file descriptor, used in 'write-tracking' migration */ + int uffdio_fd; /* Last block that we have visited searching for dirty pages */ RAMBlock *last_seen_block; /* Last block from where we have sent data */ @@ -3788,6 +3795,119 @@ static int ram_resume_prepare(MigrationState *s, void *opaque) return 0; } +/* + * ram_write_tracking_start: start UFFD-WP memory tracking + * + * Returns 0 for success or negative value in case of error + * + */ +int ram_write_tracking_start(void) +{ +#ifdef CONFIG_LINUX + int uffd; + RAMState *rs = ram_state; + RAMBlock *bs; + + /* Open UFFD file descriptor */ + uffd = uffd_create_fd(); + if (uffd < 0) { + return uffd; + } + rs->uffdio_fd = uffd; + + RAMBLOCK_FOREACH_NOT_IGNORED(bs) { + /* Nothing to do with read-only and MMIO-writable regions */ + if (bs->mr->readonly || bs->mr->rom_device) { + continue; + } + + bs->flags |= RAM_UF_WRITEPROTECT; + /* Register block memory with UFFD to track writes */ + if (uffd_register_memory(rs->uffdio_fd, (hwaddr) bs->host, + bs->max_length, false, true)) { + goto fail; + } + /* Apply UFFD write protection to the block memory range */ + if (uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host, + bs->max_length, true)) { + goto fail; + } + + info_report("UFFD-WP write-tracking enabled: " + "block_id=%s page_size=%zu start=%p length=%lu " + "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i", + bs->idstr, bs->page_size, bs->host, bs->max_length, + bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly, + bs->mr->nonvolatile, bs->mr->rom_device); + } + + return 0; + +fail: + error_report("ram_write_tracking_start() failed: restoring initial memory state"); + + RAMBLOCK_FOREACH_NOT_IGNORED(bs) { + if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) { + continue; + } + /* + * In case some memory block failed to be write-protected + * remove protection and unregister all succeeded RAM blocks + */ + uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host, bs->max_length, false); + uffd_unregister_memory(rs->uffdio_fd, (hwaddr) bs->host, bs->max_length); + /* Cleanup flags */ + bs->flags &= ~RAM_UF_WRITEPROTECT; + } + + uffd_close_fd(uffd); + rs->uffdio_fd = -1; + return -1; +#else + rs->uffdio_fd = -1; + error_setg(&migrate_get_current()->error, + "Background-snapshot not supported on non-Linux hosts"); + return -1; +#endif /* CONFIG_LINUX */ +} + +/** + * ram_write_tracking_stop: stop UFFD-WP memory tracking and remove protection + */ +void ram_write_tracking_stop(void) +{ +#ifdef CONFIG_LINUX + RAMState *rs = ram_state; + RAMBlock *bs; + assert(rs->uffdio_fd >= 0); + + RAMBLOCK_FOREACH_NOT_IGNORED(bs) { + if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) { + continue; + } + /* Remove protection and unregister all affected RAM blocks */ + uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host, bs->max_length, false); + uffd_unregister_memory(rs->uffdio_fd, (hwaddr) bs->host, bs->max_length); + /* Cleanup flags */ + bs->flags &= ~RAM_UF_WRITEPROTECT; + + info_report("UFFD-WP write-tracking disabled: " + "block_id=%s page_size=%zu start=%p length=%lu " + "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i", + bs->idstr, bs->page_size, bs->host, bs->max_length, + bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly, + bs->mr->nonvolatile, bs->mr->rom_device); + } + + /* Finally close UFFD file descriptor */ + uffd_close_fd(rs->uffdio_fd); + rs->uffdio_fd = -1; +#else + error_setg(&migrate_get_current()->error, + "Background-snapshot not supported on non-Linux hosts"); +#endif /* CONFIG_LINUX */ +} + static SaveVMHandlers savevm_ram_handlers = { .save_setup = ram_save_setup, .save_live_iterate = ram_save_iterate, diff --git a/migration/ram.h b/migration/ram.h index 011e85414e..0ec63e27ee 100644 --- a/migration/ram.h +++ b/migration/ram.h @@ -79,4 +79,8 @@ void colo_flush_ram_cache(void); void colo_release_ram_cache(void); void colo_incoming_start_dirty_log(void); +/* Background snapshots */ +int ram_write_tracking_start(void); +void ram_write_tracking_stop(void); + #endif diff --git a/util/meson.build b/util/meson.build index f359af0d46..c64bfe94b3 100644 --- a/util/meson.build +++ b/util/meson.build @@ -50,6 +50,7 @@ endif if have_system util_ss.add(when: 'CONFIG_GIO', if_true: [files('dbus.c'), gio]) + util_ss.add(when: 'CONFIG_LINUX', if_true: files('userfaultfd.c')) endif if have_block diff --git a/util/userfaultfd.c b/util/userfaultfd.c new file mode 100644 index 0000000000..038953d7ed --- /dev/null +++ b/util/userfaultfd.c @@ -0,0 +1,215 @@ +/* + * Linux UFFD-WP support + * + * Copyright Virtuozzo GmbH, 2020 + * + * Authors: + * Andrey Gruzdev + * + * This work is licensed under the terms of the GNU GPL, version 2 or + * later. See the COPYING file in the top-level directory. + */ + +#include "qemu/osdep.h" +#include "qemu/bitops.h" +#include "qemu/error-report.h" +#include "qemu/userfaultfd.h" +#include +#include +#include + +/** + * uffd_create_fd: create UFFD file descriptor + * + * Returns non-negative file descriptor or negative value in case of an error + */ +int uffd_create_fd(void) +{ + int uffd; + struct uffdio_api api_struct; + uint64_t ioctl_mask = BIT(_UFFDIO_REGISTER) | BIT(_UFFDIO_UNREGISTER); + + uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK); + if (uffd < 0) { + error_report("uffd_create_fd() failed: UFFD not supported"); + return -1; + } + + api_struct.api = UFFD_API; + api_struct.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP; + if (ioctl(uffd, UFFDIO_API, &api_struct)) { + error_report("uffd_create_fd() failed: " + "API version not supported version=%llx errno=%i", + api_struct.api, errno); + goto fail; + } + + if ((api_struct.ioctls & ioctl_mask) != ioctl_mask) { + error_report("uffd_create_fd() failed: " + "PAGEFAULT_FLAG_WP feature missing"); + goto fail; + } + + return uffd; + +fail: + close(uffd); + return -1; +} + +/** + * uffd_close_fd: close UFFD file descriptor + * + * @uffd: UFFD file descriptor + */ +void uffd_close_fd(int uffd) +{ + assert(uffd >= 0); + close(uffd); +} + +/** + * uffd_register_memory: register memory range with UFFD + * + * Returns 0 in case of success, negative value on error + * + * @uffd: UFFD file descriptor + * @start: starting virtual address of memory range + * @length: length of memory range + * @track_missing: generate events on missing-page faults + * @track_wp: generate events on write-protected-page faults + */ +int uffd_register_memory(int uffd, hwaddr start, hwaddr length, + bool track_missing, bool track_wp) +{ + struct uffdio_register uffd_register; + + uffd_register.range.start = start; + uffd_register.range.len = length; + uffd_register.mode = (track_missing ? UFFDIO_REGISTER_MODE_MISSING : 0) | + (track_wp ? UFFDIO_REGISTER_MODE_WP : 0); + + if (ioctl(uffd, UFFDIO_REGISTER, &uffd_register)) { + error_report("uffd_register_memory() failed: " + "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i", + start, length, uffd_register.mode, errno); + return -1; + } + + return 0; +} + +/** + * uffd_unregister_memory: un-register memory range with UFFD + * + * Returns 0 in case of success, negative value on error + * + * @uffd: UFFD file descriptor + * @start: starting virtual address of memory range + * @length: length of memory range + */ +int uffd_unregister_memory(int uffd, hwaddr start, hwaddr length) +{ + struct uffdio_range uffd_range; + + uffd_range.start = start; + uffd_range.len = length; + + if (ioctl(uffd, UFFDIO_UNREGISTER, &uffd_range)) { + error_report("uffd_unregister_memory() failed: " + "start=%0"PRIx64" len=%"PRIu64" errno=%i", + start, length, errno); + return -1; + } + + return 0; +} + +/** + * uffd_protect_memory: protect/unprotect memory range for writes with UFFD + * + * Returns 0 on success or negative value in case of error + * + * @uffd: UFFD file descriptor + * @start: starting virtual address of memory range + * @length: length of memory range + * @wp: write-protect/unprotect + */ +int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp) +{ + struct uffdio_writeprotect uffd_writeprotect; + int res; + + uffd_writeprotect.range.start = start; + uffd_writeprotect.range.len = length; + uffd_writeprotect.mode = (wp ? UFFDIO_WRITEPROTECT_MODE_WP : 0); + + do { + res = ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect); + } while (res < 0 && errno == EINTR); + if (res < 0) { + error_report("uffd_protect_memory() failed: " + "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i", + start, length, uffd_writeprotect.mode, errno); + return -1; + } + + return 0; +} + +/** + * uffd_read_events: read pending UFFD events + * + * Returns number of fetched messages, 0 if non is available or + * negative value in case of an error + * + * @uffd: UFFD file descriptor + * @msgs: pointer to message buffer + * @count: number of messages that can fit in the buffer + */ +int uffd_read_events(int uffd, struct uffd_msg *msgs, int count) +{ + ssize_t res; + do { + res = read(uffd, msgs, count * sizeof(struct uffd_msg)); + } while (res < 0 && errno == EINTR); + + if ((res < 0 && errno == EAGAIN)) { + return 0; + } + if (res < 0) { + error_report("uffd_read_events() failed: errno=%i", errno); + return -1; + } + + return (int) (res / sizeof(struct uffd_msg)); +} + +/** + * uffd_poll_events: poll UFFD file descriptor for read + * + * Returns true if events are available for read, false otherwise + * + * @uffd: UFFD file descriptor + * @tmo: timeout in milliseconds, 0 for non-blocking operation, + * negative value for infinite wait + */ +bool uffd_poll_events(int uffd, int tmo) +{ + int res; + struct pollfd poll_fd = { .fd = uffd, .events = POLLIN, .revents = 0 }; + + do { + res = poll(&poll_fd, 1, tmo); + } while (res < 0 && errno == EINTR); + + if (res == 0) { + return false; + } + if (res < 0) { + error_report("uffd_poll_events() failed: errno=%i", errno); + return false; + } + + return (poll_fd.revents & POLLIN) != 0; +} From patchwork Thu Nov 26 15:17:31 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrey Gruzdev X-Patchwork-Id: 11934231 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5F7F1C63697 for ; Thu, 26 Nov 2020 15:23:16 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id EE36A2087C for ; Thu, 26 Nov 2020 15:23:15 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org EE36A2087C Authentication-Results: mail.kernel.org; dmarc=pass (p=none dis=none) header.from=nongnu.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:40826 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kiJ73-0000az-CH for qemu-devel@archiver.kernel.org; Thu, 26 Nov 2020 10:23:14 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:47996) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kiJ3D-000591-Ag for qemu-devel@nongnu.org; Thu, 26 Nov 2020 10:19:15 -0500 Received: from relay.sw.ru ([185.231.240.75]:49916 helo=relay3.sw.ru) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kiJ3B-0008O5-3Y for qemu-devel@nongnu.org; Thu, 26 Nov 2020 10:19:15 -0500 Received: from [192.168.15.178] (helo=andrey-MS-7B54.sw.ru) by relay3.sw.ru with esmtp (Exim 4.94) (envelope-from ) id 1kiJ2j-00AT4g-NN; Thu, 26 Nov 2020 18:18:45 +0300 To: qemu-devel@nongnu.org Cc: Den Lunev , Eric Blake , Paolo Bonzini , Juan Quintela , "Dr . David Alan Gilbert" , Markus Armbruster , Peter Xu , Andrey Gruzdev Subject: [PATCH v4 3/6] support UFFD write fault processing in ram_save_iterate() Date: Thu, 26 Nov 2020 18:17:31 +0300 Message-Id: <20201126151734.743849-4-andrey.gruzdev@virtuozzo.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20201126151734.743849-1-andrey.gruzdev@virtuozzo.com> References: <20201126151734.743849-1-andrey.gruzdev@virtuozzo.com> MIME-Version: 1.0 Received-SPF: pass client-ip=185.231.240.75; envelope-from=andrey.gruzdev@virtuozzo.com; helo=relay3.sw.ru X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Reply-to: Andrey Gruzdev X-Patchwork-Original-From: Andrey Gruzdev via From: Andrey Gruzdev In this particular implementation the same single migration thread is responsible for both normal linear dirty page migration and procesing UFFD page fault events. Processing write faults includes reading UFFD file descriptor, finding respective RAM block and saving faulting page to the migration stream. After page has been saved, write protection can be removed. Since asynchronous version of qemu_put_buffer() is expected to be used to save pages, we also have to flush migraion stream prior to un-protecting saved memory range. Write protection is being removed for any previously protected memory chunk that has hit the migration stream. That's valid for pages from linear page scan along with write fault pages. Signed-off-by: Andrey Gruzdev --- migration/ram.c | 155 +++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 147 insertions(+), 8 deletions(-) diff --git a/migration/ram.c b/migration/ram.c index 3adfd1948d..bcdccdaef7 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -1441,6 +1441,76 @@ static RAMBlock *unqueue_page(RAMState *rs, ram_addr_t *offset) return block; } +#ifdef CONFIG_LINUX +/** + * ram_find_block_by_host_address: find RAM block containing host page + * + * Returns pointer to RAMBlock if found, NULL otherwise + * + * @rs: current RAM state + * @page_address: host page address + */ +static RAMBlock *ram_find_block_by_host_address(RAMState *rs, hwaddr page_address) +{ + RAMBlock *bs = rs->last_seen_block; + + do { + if (page_address >= (hwaddr) bs->host && (page_address + TARGET_PAGE_SIZE) <= + ((hwaddr) bs->host + bs->max_length)) { + return bs; + } + + bs = QLIST_NEXT_RCU(bs, next); + if (!bs) { + /* Hit the end of the list */ + bs = QLIST_FIRST_RCU(&ram_list.blocks); + } + } while (bs != rs->last_seen_block); + + return NULL; +} + +/** + * poll_fault_page: try to get next UFFD write fault page and, if pending fault + * is found, return RAM block pointer and page offset + * + * Returns pointer to the RAMBlock containing faulting page, + * NULL if no write faults are pending + * + * @rs: current RAM state + * @offset: page offset from the beginning of the block + */ +static RAMBlock *poll_fault_page(RAMState *rs, ram_addr_t *offset) +{ + struct uffd_msg uffd_msg; + hwaddr page_address; + RAMBlock *bs; + int res; + + if (!migrate_background_snapshot()) { + return NULL; + } + + res = uffd_read_events(rs->uffdio_fd, &uffd_msg, 1); + if (res <= 0) { + return NULL; + } + + page_address = uffd_msg.arg.pagefault.address; + bs = ram_find_block_by_host_address(rs, page_address); + if (!bs) { + /* In case we couldn't find respective block, just unprotect faulting page. */ + uffd_protect_memory(rs->uffdio_fd, page_address, TARGET_PAGE_SIZE, false); + error_report("ram_find_block_by_host_address() failed: address=0x%0lx", + page_address); + return NULL; + } + + *offset = (ram_addr_t) (page_address - (hwaddr) bs->host); + return bs; +} +#endif /* CONFIG_LINUX */ + /** * get_queued_page: unqueue a page from the postcopy requests * @@ -1480,6 +1550,16 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss) } while (block && !dirty); +#ifdef CONFIG_LINUX + if (!block) { + /* + * Poll write faults too if background snapshot is enabled; that's + * when we have vcpus got blocked by the write protected pages. + */ + block = poll_fault_page(rs, &offset); + } +#endif /* CONFIG_LINUX */ + if (block) { /* * As soon as we start servicing pages out of order, then we have @@ -1753,6 +1833,55 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss, return pages; } +/** + * ram_save_host_page_pre: ram_save_host_page() pre-notifier + * + * @rs: current RAM state + * @pss: page-search-status structure + * @opaque: pointer to receive opaque context value + */ +static inline +void ram_save_host_page_pre(RAMState *rs, PageSearchStatus *pss, void **opaque) +{ + *(ram_addr_t *) opaque = pss->page; +} + +/** + * ram_save_host_page_post: ram_save_host_page() post-notifier + * + * @rs: current RAM state + * @pss: page-search-status structure + * @opaque: opaque context value + * @res_override: pointer to the return value of ram_save_host_page(), + * overwritten in case of an error + */ +static void ram_save_host_page_post(RAMState *rs, PageSearchStatus *pss, + void *opaque, int *res_override) +{ + /* Check if page is from UFFD-managed region. */ + if (pss->block->flags & RAM_UF_WRITEPROTECT) { +#ifdef CONFIG_LINUX + ram_addr_t page_from = (ram_addr_t) opaque; + hwaddr page_address = (hwaddr) pss->block->host + + ((hwaddr) page_from << TARGET_PAGE_BITS); + hwaddr run_length = (hwaddr) (pss->page - page_from + 1) << TARGET_PAGE_BITS; + int res; + + /* Flush async buffers before un-protect. */ + qemu_fflush(rs->f); + /* Un-protect memory range. */ + res = uffd_protect_memory(rs->uffdio_fd, page_address, run_length, false); + /* We don't want to override existing error from ram_save_host_page(). */ + if (res < 0 && *res_override >= 0) { + *res_override = res; + } +#else + /* Should never happen */ + qemu_file_set_error(rs->f, -ENOSYS); +#endif /* CONFIG_LINUX */ + } +} + /** * ram_find_and_save_block: finds a dirty page and sends it to f * @@ -1779,14 +1908,14 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage) return pages; } + if (!rs->last_seen_block) { + rs->last_seen_block = QLIST_FIRST_RCU(&ram_list.blocks); + } + pss.block = rs->last_seen_block; pss.page = rs->last_page; pss.complete_round = false; - if (!pss.block) { - pss.block = QLIST_FIRST_RCU(&ram_list.blocks); - } - do { again = true; found = get_queued_page(rs, &pss); @@ -1797,7 +1926,11 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage) } if (found) { + void *opaque; + + ram_save_host_page_pre(rs, &pss, &opaque); pages = ram_save_host_page(rs, &pss, last_stage); + ram_save_host_page_post(rs, &pss, opaque, &pages); } } while (!pages && again); @@ -3864,9 +3997,12 @@ fail: rs->uffdio_fd = -1; return -1; #else + /* + * Should never happen since we prohibit 'background-snapshot' + * capability on non-Linux hosts. + */ rs->uffdio_fd = -1; - error_setg(&migrate_get_current()->error, - "Background-snapshot not supported on non-Linux hosts"); + error_setg(&migrate_get_current()->error, QERR_UNDEFINED_ERROR); return -1; #endif /* CONFIG_LINUX */ } @@ -3903,8 +4039,11 @@ void ram_write_tracking_stop(void) uffd_close_fd(rs->uffdio_fd); rs->uffdio_fd = -1; #else - error_setg(&migrate_get_current()->error, - "Background-snapshot not supported on non-Linux hosts"); + /* + * Should never happen since we prohibit 'background-snapshot' + * capability on non-Linux hosts. + */ + error_setg(&migrate_get_current()->error, QERR_UNDEFINED_ERROR); #endif /* CONFIG_LINUX */ } From patchwork Thu Nov 26 15:17:32 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrey Gruzdev X-Patchwork-Id: 11934233 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.2 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URG_BIZ,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A286EC63697 for ; Thu, 26 Nov 2020 15:23:59 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 2F82D2087C for ; Thu, 26 Nov 2020 15:23:59 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2F82D2087C Authentication-Results: mail.kernel.org; dmarc=pass (p=none dis=none) header.from=nongnu.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:42950 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kiJ7l-0001TN-Sc for qemu-devel@archiver.kernel.org; Thu, 26 Nov 2020 10:23:57 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:48100) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kiJ3b-0005it-EZ for qemu-devel@nongnu.org; Thu, 26 Nov 2020 10:19:39 -0500 Received: from relay.sw.ru ([185.231.240.75]:50026 helo=relay3.sw.ru) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kiJ3Z-0008UJ-Aw for qemu-devel@nongnu.org; Thu, 26 Nov 2020 10:19:39 -0500 Received: from [192.168.15.178] (helo=andrey-MS-7B54.sw.ru) by relay3.sw.ru with esmtp (Exim 4.94) (envelope-from ) id 1kiJ37-00AT4g-Uy; Thu, 26 Nov 2020 18:19:09 +0300 To: qemu-devel@nongnu.org Cc: Den Lunev , Eric Blake , Paolo Bonzini , Juan Quintela , "Dr . David Alan Gilbert" , Markus Armbruster , Peter Xu , Andrey Gruzdev Subject: [PATCH v4 4/6] implementation of background snapshot thread Date: Thu, 26 Nov 2020 18:17:32 +0300 Message-Id: <20201126151734.743849-5-andrey.gruzdev@virtuozzo.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20201126151734.743849-1-andrey.gruzdev@virtuozzo.com> References: <20201126151734.743849-1-andrey.gruzdev@virtuozzo.com> MIME-Version: 1.0 Received-SPF: pass client-ip=185.231.240.75; envelope-from=andrey.gruzdev@virtuozzo.com; helo=relay3.sw.ru X-Spam_score_int: -12 X-Spam_score: -1.3 X-Spam_bar: - X-Spam_report: (-1.3 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URG_BIZ=0.573 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Reply-to: Andrey Gruzdev X-Patchwork-Original-From: Andrey Gruzdev via From: Andrey Gruzdev Introducing implementation of 'background' snapshot thread which in overall follows the logic of precopy migration while internally utilizes completely different mechanism to 'freeze' vmstate at the start of snapshot creation. This mechanism is based on userfault_fd with wr-protection support and is Linux-specific. Signed-off-by: Andrey Gruzdev --- migration/migration.c | 177 +++++++++++++++++++++++++++++++++++++++++- migration/migration.h | 3 + migration/savevm.c | 1 - migration/savevm.h | 2 + 4 files changed, 180 insertions(+), 3 deletions(-) diff --git a/migration/migration.c b/migration/migration.c index 8be3bd2b8c..c2efaf72f2 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -1983,6 +1983,7 @@ void migrate_init(MigrationState *s) * locks. */ s->cleanup_bh = 0; + s->vm_start_bh = 0; s->to_dst_file = NULL; s->state = MIGRATION_STATUS_NONE; s->rp_state.from_dst_file = NULL; @@ -3521,6 +3522,21 @@ static void migration_iteration_finish(MigrationState *s) qemu_mutex_unlock_iothread(); } +static void bg_migration_iteration_finish(MigrationState *s) +{ + /* TODO: implement */ +} + +/* + * Return true if continue to the next iteration directly, false + * otherwise. + */ +static MigIterateState bg_migration_iteration_run(MigrationState *s) +{ + /* TODO: implement */ + return MIG_ITERATE_RESUME; +} + void migration_make_urgent_request(void) { qemu_sem_post(&migrate_get_current()->rate_limit_sem); @@ -3668,6 +3684,157 @@ static void *migration_thread(void *opaque) return NULL; } +static void bg_migration_vm_start_bh(void *opaque) +{ + MigrationState *s = opaque; + + qemu_bh_delete(s->vm_start_bh); + s->vm_start_bh = NULL; + + vm_start(); + s->downtime = qemu_clock_get_ms(QEMU_CLOCK_REALTIME) - s->downtime_start; +} + +/** + * Background snapshot thread, based on live migration code. + * This is an alternative implementation of live migration mechanism + * introduced specifically to support background snapshots. + * + * It takes advantage of userfault_fd write protection mechanism introduced + * in v5.7 kernel. Compared to existing dirty page logging migration much + * lesser stream traffic is produced resulting in smaller snapshot images, + * simply cause of no page duplicates can get into the stream. + * + * Another key point is that generated vmstate stream reflects machine state + * 'frozen' at the beginning of snapshot creation compared to dirty page logging + * mechanism, which effectively results in that saved snapshot is the state of VM + * at the end of the process. + */ +static void *bg_migration_thread(void *opaque) +{ + MigrationState *s = opaque; + int64_t setup_start; + MigThrError thr_error; + QEMUFile *fb; + bool early_fail = true; + + rcu_register_thread(); + object_ref(OBJECT(s)); + + qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX); + + setup_start = qemu_clock_get_ms(QEMU_CLOCK_HOST); + /* + * We want to save vmstate for the moment when migration has been + * initiated but also we want to save RAM content while VM is running. + * The RAM content should appear first in the vmstate. So, we first + * stash the non-RAM part of the vmstate to the temporary buffer, + * then write RAM part of the vmstate to the migration stream + * with vCPUs running and, finally, write stashed non-RAM part of + * the vmstate from the buffer to the migration stream. + */ + s->bioc = qio_channel_buffer_new(128 * 1024); + qio_channel_set_name(QIO_CHANNEL(s->bioc), "vmstate-buffer"); + fb = qemu_fopen_channel_output(QIO_CHANNEL(s->bioc)); + object_unref(OBJECT(s->bioc)); + + update_iteration_initial_status(s); + + qemu_savevm_state_header(s->to_dst_file); + qemu_savevm_state_setup(s->to_dst_file); + + if (qemu_savevm_state_guest_unplug_pending()) { + migrate_set_state(&s->state, MIGRATION_STATUS_SETUP, + MIGRATION_STATUS_WAIT_UNPLUG); + + while (s->state == MIGRATION_STATUS_WAIT_UNPLUG && + qemu_savevm_state_guest_unplug_pending()) { + qemu_sem_timedwait(&s->wait_unplug_sem, 250); + } + + migrate_set_state(&s->state, MIGRATION_STATUS_WAIT_UNPLUG, + MIGRATION_STATUS_ACTIVE); + } + s->setup_time = qemu_clock_get_ms(QEMU_CLOCK_HOST) - setup_start; + + migrate_set_state(&s->state, MIGRATION_STATUS_SETUP, + MIGRATION_STATUS_ACTIVE); + trace_migration_thread_setup_complete(); + s->downtime_start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); + + qemu_mutex_lock_iothread(); + + if (global_state_store()) { + goto fail; + } + /* Forcibly stop VM before saving state of vCPUs and devices */ + if (vm_stop_force_state(RUN_STATE_PAUSED)) { + goto fail; + } + /* + * Put vCPUs in sync with shadow context structures, then + * save their state to channel-buffer along with devices. + */ + cpu_synchronize_all_states(); + if (qemu_savevm_state_complete_precopy_non_iterable(fb, false, false)) { + goto fail; + } + /* Now initialize UFFD context and start tracking RAM writes */ + if (ram_write_tracking_start()) { + goto fail; + } + early_fail = false; + + /* + * Start VM from BH handler to avoid write-fault lock here. + * UFFD-WP protection for the whole RAM is already enabled so + * calling VM state change notifiers from vm_start() would initiate + * writes to virtio VQs memory which is in write-protected region. + */ + s->vm_start_bh = qemu_bh_new(bg_migration_vm_start_bh, s); + qemu_bh_schedule(s->vm_start_bh); + + qemu_mutex_unlock_iothread(); + + while (migration_is_active(s)) { + MigIterateState iter_state = bg_migration_iteration_run(s); + if (iter_state == MIG_ITERATE_SKIP) { + continue; + } else if (iter_state == MIG_ITERATE_BREAK) { + break; + } + + /* + * Try to detect any kind of failures, and see whether we + * should stop the migration now. + */ + thr_error = migration_detect_error(s); + if (thr_error == MIG_THR_ERR_FATAL) { + /* Stop migration */ + break; + } + + migration_update_counters(s, qemu_clock_get_ms(QEMU_CLOCK_REALTIME)); + } + + trace_migration_thread_after_loop(); + +fail: + if (early_fail) { + migrate_set_state(&s->state, MIGRATION_STATUS_ACTIVE, + MIGRATION_STATUS_FAILED); + qemu_mutex_unlock_iothread(); + } + + bg_migration_iteration_finish(s); + + qemu_fclose(fb); + object_unref(OBJECT(s)); + rcu_unregister_thread(); + + return NULL; +} + void migrate_fd_connect(MigrationState *s, Error *error_in) { Error *local_err = NULL; @@ -3731,8 +3898,14 @@ void migrate_fd_connect(MigrationState *s, Error *error_in) migrate_fd_cleanup(s); return; } - qemu_thread_create(&s->thread, "live_migration", migration_thread, s, - QEMU_THREAD_JOINABLE); + + if (migrate_background_snapshot()) { + qemu_thread_create(&s->thread, "background_snapshot", + bg_migration_thread, s, QEMU_THREAD_JOINABLE); + } else { + qemu_thread_create(&s->thread, "live_migration", + migration_thread, s, QEMU_THREAD_JOINABLE); + } s->migration_thread_running = true; } diff --git a/migration/migration.h b/migration/migration.h index f40338cfbf..0723955cd7 100644 --- a/migration/migration.h +++ b/migration/migration.h @@ -20,6 +20,7 @@ #include "qemu/thread.h" #include "qemu/coroutine_int.h" #include "io/channel.h" +#include "io/channel-buffer.h" #include "net/announce.h" #include "qom/object.h" @@ -147,8 +148,10 @@ struct MigrationState { /*< public >*/ QemuThread thread; + QEMUBH *vm_start_bh; QEMUBH *cleanup_bh; QEMUFile *to_dst_file; + QIOChannelBuffer *bioc; /* * Protects to_dst_file pointer. We need to make sure we won't * yield or hang during the critical section, since this lock will diff --git a/migration/savevm.c b/migration/savevm.c index 5f937a2762..62d5f8a869 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -1352,7 +1352,6 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy) return 0; } -static int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f, bool in_postcopy, bool inactivate_disks) diff --git a/migration/savevm.h b/migration/savevm.h index ba64a7e271..aaee2528ed 100644 --- a/migration/savevm.h +++ b/migration/savevm.h @@ -64,5 +64,7 @@ int qemu_loadvm_state(QEMUFile *f); void qemu_loadvm_state_cleanup(void); int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis); int qemu_load_device_state(QEMUFile *f); +int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f, + bool in_postcopy, bool inactivate_disks); #endif From patchwork Thu Nov 26 15:17:33 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrey Gruzdev X-Patchwork-Id: 11934235 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B391DC63697 for ; Thu, 26 Nov 2020 15:25:24 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 45AD42087C for ; Thu, 26 Nov 2020 15:25:24 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 45AD42087C Authentication-Results: mail.kernel.org; dmarc=pass (p=none dis=none) header.from=nongnu.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:47568 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kiJ99-0003Ob-C5 for qemu-devel@archiver.kernel.org; Thu, 26 Nov 2020 10:25:23 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:48272) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kiJ46-0005y2-Tp for qemu-devel@nongnu.org; Thu, 26 Nov 2020 10:20:12 -0500 Received: from relay.sw.ru ([185.231.240.75]:50168 helo=relay3.sw.ru) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kiJ40-00009x-CJ for qemu-devel@nongnu.org; Thu, 26 Nov 2020 10:20:10 -0500 Received: from [192.168.15.178] (helo=andrey-MS-7B54.sw.ru) by relay3.sw.ru with esmtp (Exim 4.94) (envelope-from ) id 1kiJ3V-00AT4g-NF; Thu, 26 Nov 2020 18:19:33 +0300 To: qemu-devel@nongnu.org Cc: Den Lunev , Eric Blake , Paolo Bonzini , Juan Quintela , "Dr . David Alan Gilbert" , Markus Armbruster , Peter Xu , Andrey Gruzdev Subject: [PATCH v4 5/6] the rest of write tracking migration code Date: Thu, 26 Nov 2020 18:17:33 +0300 Message-Id: <20201126151734.743849-6-andrey.gruzdev@virtuozzo.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20201126151734.743849-1-andrey.gruzdev@virtuozzo.com> References: <20201126151734.743849-1-andrey.gruzdev@virtuozzo.com> MIME-Version: 1.0 Received-SPF: pass client-ip=185.231.240.75; envelope-from=andrey.gruzdev@virtuozzo.com; helo=relay3.sw.ru X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Reply-to: Andrey Gruzdev X-Patchwork-Original-From: Andrey Gruzdev via From: Andrey Gruzdev Implementation of bg_migration- iteration/completion/finish. Signed-off-by: Andrey Gruzdev --- migration/migration.c | 74 +++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 72 insertions(+), 2 deletions(-) diff --git a/migration/migration.c b/migration/migration.c index c2efaf72f2..ef1522761c 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -3182,6 +3182,50 @@ fail: MIGRATION_STATUS_FAILED); } +/** + * bg_migration_completion: Used by bg_migration_thread when after all the + * RAM has been saved. The caller 'breaks' the loop when this returns. + * + * @s: Current migration state + */ +static void bg_migration_completion(MigrationState *s) +{ + int current_active_state = s->state; + + /* + * Stop tracking RAM writes - un-protect memory, un-register UFFD + * memory ranges, flush kernel wait queues and wake up threads + * waiting for write fault to be resolved. + */ + ram_write_tracking_stop(); + + if (s->state == MIGRATION_STATUS_ACTIVE) { + /* + * By this moment we have RAM content saved into the migration stream. + * The next step is to flush the non-RAM content (device state) + * right after the ram content. The device state has been stored into + * the temporary buffer before RAM saving started. + */ + qemu_put_buffer(s->to_dst_file, s->bioc->data, s->bioc->usage); + qemu_fflush(s->to_dst_file); + } else if (s->state == MIGRATION_STATUS_CANCELLING) { + goto fail; + } + + if (qemu_file_get_error(s->to_dst_file)) { + trace_migration_completion_file_err(); + goto fail; + } + + migrate_set_state(&s->state, current_active_state, + MIGRATION_STATUS_COMPLETED); + return; + +fail: + migrate_set_state(&s->state, current_active_state, + MIGRATION_STATUS_FAILED); +} + bool migrate_colo_enabled(void) { MigrationState *s = migrate_get_current(); @@ -3524,7 +3568,26 @@ static void migration_iteration_finish(MigrationState *s) static void bg_migration_iteration_finish(MigrationState *s) { - /* TODO: implement */ + qemu_mutex_lock_iothread(); + switch (s->state) { + case MIGRATION_STATUS_COMPLETED: + migration_calculate_complete(s); + break; + + case MIGRATION_STATUS_ACTIVE: + case MIGRATION_STATUS_FAILED: + case MIGRATION_STATUS_CANCELLED: + case MIGRATION_STATUS_CANCELLING: + break; + + default: + /* Should not reach here, but if so, forgive the VM. */ + error_report("%s: Unknown ending state %d", __func__, s->state); + break; + } + + migrate_fd_cleanup_schedule(s); + qemu_mutex_unlock_iothread(); } /* @@ -3533,7 +3596,14 @@ static void bg_migration_iteration_finish(MigrationState *s) */ static MigIterateState bg_migration_iteration_run(MigrationState *s) { - /* TODO: implement */ + int res; + + res = qemu_savevm_state_iterate(s->to_dst_file, false); + if (res > 0) { + bg_migration_completion(s); + return MIG_ITERATE_BREAK; + } + return MIG_ITERATE_RESUME; } From patchwork Thu Nov 26 15:17:34 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrey Gruzdev X-Patchwork-Id: 11934229 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2F34AC56202 for ; Thu, 26 Nov 2020 15:22:51 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 97E502087C for ; Thu, 26 Nov 2020 15:22:50 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 97E502087C Authentication-Results: mail.kernel.org; dmarc=pass (p=none dis=none) header.from=nongnu.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:39392 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kiJ6f-0008Qg-KN for qemu-devel@archiver.kernel.org; Thu, 26 Nov 2020 10:22:49 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:48370) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kiJ4P-00060I-G6 for qemu-devel@nongnu.org; Thu, 26 Nov 2020 10:20:31 -0500 Received: from relay.sw.ru ([185.231.240.75]:50278 helo=relay3.sw.ru) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kiJ4N-0000Iu-5c for qemu-devel@nongnu.org; Thu, 26 Nov 2020 10:20:29 -0500 Received: from [192.168.15.178] (helo=andrey-MS-7B54.sw.ru) by relay3.sw.ru with esmtp (Exim 4.94) (envelope-from ) id 1kiJ3t-00AT4g-V6; Thu, 26 Nov 2020 18:19:57 +0300 To: qemu-devel@nongnu.org Cc: Den Lunev , Eric Blake , Paolo Bonzini , Juan Quintela , "Dr . David Alan Gilbert" , Markus Armbruster , Peter Xu , Andrey Gruzdev Subject: [PATCH v4 6/6] introduce simple linear scan rate limiting mechanism Date: Thu, 26 Nov 2020 18:17:34 +0300 Message-Id: <20201126151734.743849-7-andrey.gruzdev@virtuozzo.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20201126151734.743849-1-andrey.gruzdev@virtuozzo.com> References: <20201126151734.743849-1-andrey.gruzdev@virtuozzo.com> MIME-Version: 1.0 Received-SPF: pass client-ip=185.231.240.75; envelope-from=andrey.gruzdev@virtuozzo.com; helo=relay3.sw.ru X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Reply-to: Andrey Gruzdev X-Patchwork-Original-From: Andrey Gruzdev via From: Andrey Gruzdev Since reading UFFD events and saving paged data are performed from the same thread, write fault latencies are sensitive to migration stream stalls. Limiting total page saving rate is a method to reduce amount of noticiable fault resolution latencies. Migration bandwidth limiting is achieved via noticing cases of out-of-threshold write fault latencies and temporarily disabling (strictly speaking, severely throttling) saving non-faulting pages. Signed-off-by: Andrey Gruzdev --- migration/ram.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 65 insertions(+), 2 deletions(-) diff --git a/migration/ram.c b/migration/ram.c index bcdccdaef7..d5b50b7804 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -322,6 +322,10 @@ struct RAMState { /* these variables are used for bitmap sync */ /* last time we did a full bitmap_sync */ int64_t time_last_bitmap_sync; + /* last time UFFD fault occured */ + int64_t last_fault_ns; + /* linear scan throttling counter */ + int throttle_skip_counter; /* bytes transferred at start_time */ uint64_t bytes_xfer_prev; /* number of dirty pages since start_time */ @@ -1506,6 +1510,8 @@ static RAMBlock *poll_fault_page(RAMState *rs, ram_addr_t *offset) return NULL; } + rs->last_fault_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); + *offset = (ram_addr_t) (page_address - (hwaddr) bs->host); return bs; } @@ -1882,6 +1888,55 @@ static void ram_save_host_page_post(RAMState *rs, PageSearchStatus *pss, } } +#define FAULT_HIGH_LATENCY_NS 5000000 /* 5 ms */ +#define SLOW_FAULT_POLL_TMO 5 /* 5 ms */ +#define SLOW_FAULT_SKIP_PAGES 200 + +/** + * limit_scan_rate: limit RAM linear scan rate in case of growing write fault + * latencies, used in write-tracking migration implementation + * + * @rs: current RAM state + */ +static void limit_scan_rate(RAMState *rs) +{ + if (!migrate_background_snapshot()) { + return; + } + +#ifdef CONFIG_LINUX + int64_t last_fault_latency_ns = 0; + + /* Check if last write fault time is available */ + if (rs->last_fault_ns) { + last_fault_latency_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - + rs->last_fault_ns; + rs->last_fault_ns = 0; + } + + /* + * In case last fault time was available and we have + * latency value, check if it's not too high + */ + if (last_fault_latency_ns > FAULT_HIGH_LATENCY_NS) { + /* Reset counter after each slow write fault */ + rs->throttle_skip_counter = SLOW_FAULT_SKIP_PAGES; + } + /* + * Delay thread execution till next write fault occures or timeout expires. + * Next SLOW_FAULT_SKIP_PAGES can be write fault pages only, not from pages going from + * linear scan logic. Thus we moderate migration stream rate to reduce latencies + */ + if (rs->throttle_skip_counter > 0) { + uffd_poll_events(rs->uffdio_fd, SLOW_FAULT_POLL_TMO); + rs->throttle_skip_counter--; + } +#else + /* Should never happen */ + qemu_file_set_error(rs->f, -ENOSYS); +#endif /* CONFIG_LINUX */ +} + /** * ram_find_and_save_block: finds a dirty page and sends it to f * @@ -1931,6 +1986,9 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage) ram_save_host_page_pre(rs, &pss, &opaque); pages = ram_save_host_page(rs, &pss, last_stage); ram_save_host_page_post(rs, &pss, opaque, &pages); + + /* Linear scan rate limiting */ + limit_scan_rate(rs); } } while (!pages && again); @@ -2043,11 +2101,14 @@ static void ram_state_reset(RAMState *rs) rs->last_sent_block = NULL; rs->last_page = 0; rs->last_version = ram_list.version; + rs->last_fault_ns = 0; + rs->throttle_skip_counter = 0; rs->ram_bulk_stage = true; rs->fpo_enabled = false; } -#define MAX_WAIT 50 /* ms, half buffered_file limit */ +#define MAX_WAIT 50 /* ms, half buffered_file limit */ +#define BG_MAX_WAIT 1000 /* 1000 ms, need bigger limit for background snapshot */ /* * 'expected' is the value you expect the bitmap mostly to be full @@ -2723,7 +2784,9 @@ static int ram_save_iterate(QEMUFile *f, void *opaque) if ((i & 63) == 0) { uint64_t t1 = (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - t0) / 1000000; - if (t1 > MAX_WAIT) { + uint64_t max_wait = migrate_background_snapshot() ? + BG_MAX_WAIT : MAX_WAIT; + if (t1 > max_wait) { trace_ram_save_iterate_big_wait(t1, i); break; }