From patchwork Tue Sep 10 16:30:27 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Patrick Roy X-Patchwork-Id: 13798880 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B5861ECE58A for ; Tue, 10 Sep 2024 16:31:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EC8E08D008A; Tue, 10 Sep 2024 12:31:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E78C58D0002; Tue, 10 Sep 2024 12:31:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D22F08D008A; Tue, 10 Sep 2024 12:31:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id B07C08D0002 for ; Tue, 10 Sep 2024 12:31:05 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 20A5980717 for ; Tue, 10 Sep 2024 16:31:05 +0000 (UTC) X-FDA: 82549368090.22.C7223D7 Received: from smtp-fw-52002.amazon.com (smtp-fw-52002.amazon.com [52.119.213.150]) by imf07.hostedemail.com (Postfix) with ESMTP id 29C994000C for ; Tue, 10 Sep 2024 16:31:03 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=amazon.co.uk header.s=amazon201209 header.b=ZsS0ZwaK; dmarc=pass (policy=quarantine) header.from=amazon.co.uk; spf=pass (imf07.hostedemail.com: domain of "prvs=976277991=roypat@amazon.co.uk" designates 52.119.213.150 as permitted sender) smtp.mailfrom="prvs=976277991=roypat@amazon.co.uk" ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725985779; a=rsa-sha256; cv=none; b=43hO0WiAUtNoC+6jx5eCtH4JQwDuDrRmrG1c9YQpGnjViyOLJJIoZ30Y5V0jWMWylLu16K +53Thy9ExARWwWvn2nBu34DP+snsNr1ZgA85VQc3kvXTy0ku2c1SL0hYabyye1ZyVrOReO yMhMRtlskVeJtUaaakn6gVpBeo66Fgc= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=amazon.co.uk header.s=amazon201209 header.b=ZsS0ZwaK; dmarc=pass (policy=quarantine) header.from=amazon.co.uk; spf=pass (imf07.hostedemail.com: domain of "prvs=976277991=roypat@amazon.co.uk" designates 52.119.213.150 as permitted sender) smtp.mailfrom="prvs=976277991=roypat@amazon.co.uk" ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1725985779; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=CXxTji2PsKU8ektMXjm+AW02ug0kAumv6veW9lxkEvg=; b=3FlYcWARdem7jPEtpQ79fY8AAN62vO+Lc0bGWZEspCUG+xFhNNjUQaWOuHRoiFkOl949Yf O2PNltygRWGqQdBUsrtorfmbthtgvYss+h628VNZGmZDQa5WcrKWslyn9PeutrZNcsVNJS DPN00i3gXeA/XH8nHf/Iq4WXk87m//M= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.uk; i=@amazon.co.uk; q=dns/txt; s=amazon201209; t=1725985864; x=1757521864; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=CXxTji2PsKU8ektMXjm+AW02ug0kAumv6veW9lxkEvg=; b=ZsS0ZwaK/tuEC6i8eWE0AhrPrXVNAiX9GcxoL3IywHP1fgfhRkldz1DM k8MsRlvyVIRoIIssp0JucUYCEhpBG2GvBrDmut1HyacLQHwjPllfVZ6l3 x+rqZJl5PT3hUms99pb7nA+Ce3QysWhcocPYdPVu6JCg2E7OkVZF1m58x M=; X-IronPort-AV: E=Sophos;i="6.10,217,1719878400"; d="scan'208";a="658021874" Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO smtpout.prod.us-east-1.prod.farcaster.email.amazon.dev) ([10.43.8.6]) by smtp-border-fw-52002.iad7.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Sep 2024 16:31:01 +0000 Received: from EX19MTAUEB001.ant.amazon.com [10.0.44.209:38231] by smtpin.naws.us-east-1.prod.farcaster.email.amazon.dev [10.0.48.28:2525] with esmtp (Farcaster) id a2b39b4e-e66c-464d-9ecd-b79c04647c96; Tue, 10 Sep 2024 16:31:00 +0000 (UTC) X-Farcaster-Flow-ID: a2b39b4e-e66c-464d-9ecd-b79c04647c96 Received: from EX19D008UEA003.ant.amazon.com (10.252.134.116) by EX19MTAUEB001.ant.amazon.com (10.252.135.108) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Tue, 10 Sep 2024 16:30:52 +0000 Received: from EX19MTAUWB001.ant.amazon.com (10.250.64.248) by EX19D008UEA003.ant.amazon.com (10.252.134.116) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Tue, 10 Sep 2024 16:30:51 +0000 Received: from ua2d7e1a6107c5b.home (172.19.88.180) by mail-relay.amazon.com (10.250.64.254) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34 via Frontend Transport; Tue, 10 Sep 2024 16:30:47 +0000 From: Patrick Roy To: , , , , , , , , , , , , , , , , , , , , CC: Patrick Roy , , , , , Subject: [RFC PATCH v2 01/10] kvm: gmem: Add option to remove gmem from direct map Date: Tue, 10 Sep 2024 17:30:27 +0100 Message-ID: <20240910163038.1298452-2-roypat@amazon.co.uk> X-Mailer: git-send-email 2.46.0 In-Reply-To: <20240910163038.1298452-1-roypat@amazon.co.uk> References: <20240910163038.1298452-1-roypat@amazon.co.uk> MIME-Version: 1.0 X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 29C994000C X-Stat-Signature: 93agna6ptji9k4dkb9oucff6q53gg8c4 X-Rspam-User: X-HE-Tag: 1725985863-773436 X-HE-Meta: U2FsdGVkX18aog70K3iP+gTpQNCK5uULRLn5sjPvpBBv6xojUvEqg6N9uRAtzxtcRrEVBfsKkmgRk1iwC8zhG1YnjnY/YnzDRC5ccWQtsXiML7sNVMudO3yOdDAoRGYzLJ5y4ULdkOgEYCT+UGAF3006GYArfpsfN6cShgZ5FLiSMH0d9+vONmkQz2zPGg5vETUAx8Wwr/EkSI8/lL+72llCnSObqfYY9db1JFCd1RoNgBJnefArHpKHUrN6bqgHUaiXsxMz7vRpnlI71FdJHRChVcIevbgQuGKe/yYXu/Mnbw1m2ZhrqI8GNIQmYi1xXMeKkDBUvPT0T5xAHRtBxkuQiecgiHuZWlnOs9zh2uwh5edNaDzXsZBhvl33hTaOZgF43Sk71HLEvnCR4UJpvVgWvEXvCoclPOb5iuXocjww2+N/xFnim0P5xy3IG5VOTNCri9Dwshoue6weDK2BeAZ+eAsmn5JNXuDYMT+9JZgyJbtThlYwqrvRynBksfG5MR6jRA4zr8yEtVsBBB7lV/iIye+dZWHOqAdFepVs1NYtfg/G+HXC1wrfva2BXee7dhPfZY9K7eT9kE0hYli89amDikU68kzIOOxWsh+p2vRNVK9nneS+xq5hVYR4yM3beVkGUgXe4B+0qxkx6r1CK/GFi0zB4vvYTbnbeTAq1jQQTvJqlAgmyNcw6fn6ImR5zpTDDnKNFxV8KiQk+UZ96Dmz7H/eKnjIZ08IBnZxL3airMd54IctoRrfdEYX9YzupZQY1SWERZWXcVZ31gDlO71GusZdCp+f6G5VR30icT4Z6qbSZZXs9QDz1HC4xKBACS6Gyvkbm841lJnAt9GLydI+yF9FTF+UlO3UnKmGlWbezDC9XuyX8VEMPippBs+0NQ0e3Kb6hp4aVu0KNoUUWsakpJVMkvXcnoWnPFSd1pfCQXlEWl6t6iiNtThTW7QEQgf3QoKdxhOxl+u18yu +n99QaR8 sOvQqwI8gmmIU4fjM5PyMaIbngKjIei0DEx8BBQk0Ritvp3U3My6xEr/EBBCrAnNuCfxPjhXX7E6vjvSqB0BWoTpgisYao4nxqGeSKj4sL1JvQw0B/Xh3xbU+yccDW4EnftvSV34IE8JzDrTms1Gi3PSjQbW10yYhK7GeGM9lrKaipPNpgcKNEdKFuLbKwTKE59rIdAmJ3v4sLYMIrIiIPnqd6zIklCjjywmRpAHd5/nv/yfSo5DivldMc8S3A6I2PYoih8PKdcAziRfB60dfGSKgF5j6Zn5/qsTp3qGkZB265k7MgxOAJkLB2aFDJxFO6U9bdA3AvdONRyfrYf/oqO890NkaDheIIvHDlphJA0sWlRmT1TXqrpy/vpv8DkXdSCpsgO1MSDZePEQ= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Add a flag to the KVM_CREATE_GUEST_MEMFD ioctl that causes gmem pfns to be removed from the host kernel's direct map. Memory is removed immediately after allocation and preparation of gmem folios (after preparation, as the prepare callback might expect the direct map entry to be present). Direct map entries are restored before kvm_arch_gmem_invalidate is called (as ->invalidate_folio is called before ->free_folio), for the same reason. Use the PG_private flag to indicate that a folio is part of gmem with direct map removal enabled. While in this patch, PG_private does have a meaning of "folio not in direct map", this will no longer be true in follow up patches. Gmem folios might get temporarily reinserted into the direct map, but the PG_private flag needs to remain set, as the folios will have private data that needs to be freed independently of direct map status. This is why kvm_gmem_folio_clear_private does not call folio_clear_private. kvm_gmem_{set,clear}_folio_private must be called with the folio lock held. To ensure that failures in kvm_gmem_{clear,set}_private do not cause system instability due to leaving holes in the direct map, try to always restore direct map entries on failure. Pages for which restoration of direct map entries fails are marked as HWPOISON, to prevent the kernel from ever touching them again. Signed-off-by: Patrick Roy --- include/uapi/linux/kvm.h | 2 + virt/kvm/guest_memfd.c | 96 +++++++++++++++++++++++++++++++++++++--- 2 files changed, 91 insertions(+), 7 deletions(-) base-commit: 332d2c1d713e232e163386c35a3ba0c1b90df83f diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 637efc0551453..81b0f4a236b8c 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1564,6 +1564,8 @@ struct kvm_create_guest_memfd { __u64 reserved[6]; }; +#define KVM_GMEM_NO_DIRECT_MAP (1ULL << 0) + #define KVM_PRE_FAULT_MEMORY _IOWR(KVMIO, 0xd5, struct kvm_pre_fault_memory) struct kvm_pre_fault_memory { diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 1c509c3512614..2ed27992206f3 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -4,6 +4,7 @@ #include #include #include +#include #include "kvm_mm.h" @@ -49,8 +50,69 @@ static int kvm_gmem_prepare_folio(struct inode *inode, pgoff_t index, struct fol return 0; } +static bool kvm_gmem_test_no_direct_map(struct inode *inode) +{ + return ((unsigned long)inode->i_private & KVM_GMEM_NO_DIRECT_MAP) == KVM_GMEM_NO_DIRECT_MAP; +} + +static int kvm_gmem_folio_set_private(struct folio *folio) +{ + unsigned long start, npages, i; + int r; + + start = (unsigned long) folio_address(folio); + npages = folio_nr_pages(folio); + + for (i = 0; i < npages; ++i) { + r = set_direct_map_invalid_noflush(folio_page(folio, i)); + if (r) + goto out_remap; + } + flush_tlb_kernel_range(start, start + folio_size(folio)); + folio_set_private(folio); + return 0; +out_remap: + for (; i > 0; i--) { + struct page *page = folio_page(folio, i - 1); + + if (WARN_ON_ONCE(set_direct_map_default_noflush(page))) { + /* + * Random holes in the direct map are bad, let's mark + * these pages as corrupted memory so that the kernel + * avoids ever touching them again. + */ + folio_set_hwpoison(folio); + r = -EHWPOISON; + } + } + return r; +} + +static int kvm_gmem_folio_clear_private(struct folio *folio) +{ + unsigned long npages, i; + int r = 0; + + npages = folio_nr_pages(folio); + + for (i = 0; i < npages; ++i) { + struct page *page = folio_page(folio, i); + + if (WARN_ON_ONCE(set_direct_map_default_noflush(page))) { + folio_set_hwpoison(folio); + r = -EHWPOISON; + } + } + /* + * no TLB flush here: pages without direct map entries should + * never be in the TLB in the first place. + */ + return r; +} + static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool prepare) { + int r; struct folio *folio; /* TODO: Support huge pages. */ @@ -78,19 +140,31 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool } if (prepare) { - int r = kvm_gmem_prepare_folio(inode, index, folio); - if (r < 0) { - folio_unlock(folio); - folio_put(folio); - return ERR_PTR(r); - } + r = kvm_gmem_prepare_folio(inode, index, folio); + if (r < 0) + goto out_err; } + if (!kvm_gmem_test_no_direct_map(inode)) + goto out; + + if (!folio_test_private(folio)) { + r = kvm_gmem_folio_set_private(folio); + if (r) + goto out_err; + } + +out: /* * Ignore accessed, referenced, and dirty flags. The memory is * unevictable and there is no storage to write back to. */ return folio; + +out_err: + folio_unlock(folio); + folio_put(folio); + return ERR_PTR(r); } static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start, @@ -343,6 +417,13 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol return MF_DELAYED; } +static void kvm_gmem_invalidate_folio(struct folio *folio, size_t start, size_t end) +{ + if (start == 0 && end == folio_size(folio)) { + kvm_gmem_folio_clear_private(folio); + } +} + #ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE static void kvm_gmem_free_folio(struct folio *folio) { @@ -358,6 +439,7 @@ static const struct address_space_operations kvm_gmem_aops = { .dirty_folio = noop_dirty_folio, .migrate_folio = kvm_gmem_migrate_folio, .error_remove_folio = kvm_gmem_error_folio, + .invalidate_folio = kvm_gmem_invalidate_folio, #ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE .free_folio = kvm_gmem_free_folio, #endif @@ -442,7 +524,7 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args) { loff_t size = args->size; u64 flags = args->flags; - u64 valid_flags = 0; + u64 valid_flags = KVM_GMEM_NO_DIRECT_MAP; if (flags & ~valid_flags) return -EINVAL;