From patchwork Tue Jul 9 13:20:34 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Patrick Roy X-Patchwork-Id: 13727956 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 545E8C2BD09 for ; Tue, 9 Jul 2024 13:21:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 783D96B00B5; Tue, 9 Jul 2024 09:21:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6BDE66B00B7; Tue, 9 Jul 2024 09:21:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4E9426B00B8; Tue, 9 Jul 2024 09:21:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 252286B00B5 for ; Tue, 9 Jul 2024 09:21:28 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id E13D116184E for ; Tue, 9 Jul 2024 13:21:27 +0000 (UTC) X-FDA: 82320275814.03.BF2E4B8 Received: from smtp-fw-9105.amazon.com (smtp-fw-9105.amazon.com [207.171.188.204]) by imf20.hostedemail.com (Postfix) with ESMTP id E2AA31C001B for ; Tue, 9 Jul 2024 13:21:25 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=amazon.co.uk header.s=amazon201209 header.b="P/eETkF+"; dmarc=pass (policy=quarantine) header.from=amazon.co.uk; spf=pass (imf20.hostedemail.com: domain of "prvs=913fd7204=roypat@amazon.co.uk" designates 207.171.188.204 as permitted sender) smtp.mailfrom="prvs=913fd7204=roypat@amazon.co.uk" ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720531263; a=rsa-sha256; cv=none; b=Q/HjKSRHYIpBqqfyfMDHMzUCjIjMEsBZ4fpLkn8gA7BtsLOFNgdiK9awqHgyg8vLTj76ui +vwrkxpuPjJTC7ARbQ+lujiP5wfMHju6DdFukDpuiJ2NQ3K4FdlUTW5NUVX4otWVEIU/N/ jyEnejlhTc+PYuuefEPqkp9iRmcOKoI= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=amazon.co.uk header.s=amazon201209 header.b="P/eETkF+"; dmarc=pass (policy=quarantine) header.from=amazon.co.uk; spf=pass (imf20.hostedemail.com: domain of "prvs=913fd7204=roypat@amazon.co.uk" designates 207.171.188.204 as permitted sender) smtp.mailfrom="prvs=913fd7204=roypat@amazon.co.uk" ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720531263; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=mg7XA6znVd/yr9yTmGtkTxgi1xD8jDw5CHJVcz9oSG8=; b=WxqhRWawoHG1fVyBj0Bky3/Evn6t3/1/HuB+Th0fppMkVkPUU2LNWmn4rjl4Ll7oJsVncc fujET3+S/ric+IMMaUI7BfdXcfUqYpRD7L+yqksU4mVHR6pdM+MA2pQs+vC6EJBZs79PWi hOfbQczBPKUXagor4UgIC7C/4KdiLFA= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.uk; i=@amazon.co.uk; q=dns/txt; s=amazon201209; t=1720531287; x=1752067287; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=mg7XA6znVd/yr9yTmGtkTxgi1xD8jDw5CHJVcz9oSG8=; b=P/eETkF+ZOdasU07Gb8QBf10Z2+RX3YdvKf+RIlRYmVBT+BQ2ltomaeB tX66CpxZrPlU+Ragl5VOnC+KGpkPrPb6oImVwuHiESHICG1YwRkRRoeex F+UjX8F2OpsYCQp2d0eIiREJdx8YPqyYvhQyEO8/+4Heg0Fv9HrDTkwae I=; X-IronPort-AV: E=Sophos;i="6.09,195,1716249600"; d="scan'208";a="739970210" Received: from pdx4-co-svc-p1-lb2-vlan2.amazon.com (HELO smtpout.prod.us-east-1.prod.farcaster.email.amazon.dev) ([10.25.36.210]) by smtp-border-fw-9105.sea19.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jul 2024 13:21:17 +0000 Received: from EX19MTAUEA002.ant.amazon.com [10.0.0.204:6203] by smtpin.naws.us-east-1.prod.farcaster.email.amazon.dev [10.0.50.89:2525] with esmtp (Farcaster) id 2ee5a019-f3c3-4312-aa43-5ccfe349ee40; Tue, 9 Jul 2024 13:21:15 +0000 (UTC) X-Farcaster-Flow-ID: 2ee5a019-f3c3-4312-aa43-5ccfe349ee40 Received: from EX19D008UEA002.ant.amazon.com (10.252.134.125) by EX19MTAUEA002.ant.amazon.com (10.252.134.9) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Tue, 9 Jul 2024 13:21:13 +0000 Received: from EX19MTAUEC001.ant.amazon.com (10.252.135.222) by EX19D008UEA002.ant.amazon.com (10.252.134.125) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Tue, 9 Jul 2024 13:21:13 +0000 Received: from ua2d7e1a6107c5b.ant.amazon.com (172.19.88.180) by mail-relay.amazon.com (10.252.135.200) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34 via Frontend Transport; Tue, 9 Jul 2024 13:21:11 +0000 From: Patrick Roy To: , , , , , CC: Patrick Roy , , , , , , , , , , , , , , , , , Subject: [RFC PATCH 6/8] kvm: gmem: Temporarily restore direct map entries when needed Date: Tue, 9 Jul 2024 14:20:34 +0100 Message-ID: <20240709132041.3625501-7-roypat@amazon.co.uk> X-Mailer: git-send-email 2.45.2 In-Reply-To: <20240709132041.3625501-1-roypat@amazon.co.uk> References: <20240709132041.3625501-1-roypat@amazon.co.uk> MIME-Version: 1.0 X-Rspamd-Queue-Id: E2AA31C001B X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: 8jq3r1f3wobk1f8yqto3yyc8xnquotb8 X-HE-Tag: 1720531285-347636 X-HE-Meta: U2FsdGVkX185vCTny1L0Mso0Tz3zTEiFYovhv+RRx0HOPtBcAKBWnCf3BibPXevniEXAAa18ziTwbgGZRy6d1Lnh70hoAWgXi7M+rx/HHbXSAwNLFz+J76x/8KZAHoYMFaYY/fEtUGFmtDDV1g/axnLAokACxeFmr9fnR8qCkiDZfjLUN8DpBDg5gopYOvvjG6iSxtjxZ9gCmSIb6G9YAmZUB4QJvaUJH6dqblu35MCFZ9VDjcyPSdkjgpOG3UrmTDATFarqOY6jZMVdSjvLe/MOixb3t32C1zu+qmLHDJ9YBLzXS1bsE/6KIK4wxrnyrtwkYlsWgLYlC7VCte2NPlXkhMpe4g0EsGdoMecXZ2a82B2/XoMfSotM6MznDMSOsAUmZAMVbNHtJi4DKCS5xFnTfhUEYh8/jcrpdLR2aE7U53GJvR94y+6f3H9TN+tBhxtjAj8+wcppWU/zPP2IQOnw4E6T+sQyR4OvSBWCUQLQfBsFsyXXqD5f9jBhQ1cIsdLvTM7MA6HABnbiIzK2dR3+6NdHzmkQ3fhI3xINEksheLqFz3dwjSJ/qvXLBwbDik0Vqklds9+lUKbCsucl5dQUm4j9+JW3MxsoRB2i175FYcX3LXolJmwQsEEUTTceg4/FUvHCDXwnVHOdHoxceuHxTH+S/zmRyLDxgewtl6/Z14QCde808m2a2Y3Y5TAv5DVr9oQEna5fIuZu43K6g4e8c97xLHtfJ/JBqwUqUR8zFlAkFWAn3MjcqSZ17EQPqPGgRRmEgwzr+aiWd85Hz/uVEju7n1wJqgbqGxrn4RXvAENTduacDn33aa2LLe4m1gH9D65abK36O2rYd/u+0MmmyfrX0YNbIjKTJpqbRNIz+ExwEB9rtRX0zffvD6jVOR50nd3SON/zDhj4rC4ZyFKCraQMT13WfD0ZeAUO18/Y98DZIH2sRiEiYXTFh6tPjUR33tV7rzYhJU+4YpV 11EH43An 8g9fcoxzcM1qvRGTq5Nra4g7X2Mg+LAVLCu5tZ/ULs9HRJsIIJcKs9MX+MjryXGHzVuGJ0xZe8LFWKRXGbiGo9e4RKEPezCXbF+sycNipItM5KhD79QnTOCdGtyPO3R6JByZAmEk3PONycG0qc5kgL3OIcic5VoyNkJdQipkfljnbkOKmZv+KgdPVCP/5fRYj37Z08rFeTYvqcNdhyPmBI/JO0LziWJbDHaQy5v++GivPVKNbCP+ArWNMETSumPUghoxJLbgI70zDpeRkG+D88RGmTWHJTuqdxpXwDbDghKkTP0yz4mLgnMK7ZYiURtZTTqkXtTer/zHxbAyz/+qqGQK1wformBY6pwkJe5G01mNsfjqNmROcHqEpgg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: If KVM_GMEM_NO_DIRECT_MAP is set, and KVM tries to internally access guest-private memory inside kvm_{read,write}_guest, or via a gfn_to_pfn_cache, temporarily restore the direct map entry. To avoid race conditions between two threads restoring or zapping direct map entries for the same page and potentially interfering with each other (e.g. unfortune interweavings of map->read->unmap in the form of map(A)->map(B)->read(A)->unmap(A)->read(B) [BOOM]), the following invariant is upheld in this patch: - Only a single gfn_to_pfn_cache can exist for any given pfn, and - All non-gfn_to_pfn_cache code paths that temporarily restore direct map entries complete the entire map->access->unmap critical section while holding the folio lock. To remember whether a given folio currently has a direct map entry, use the PG_private flag. If this flag is set, then the folio is removed from the direct map, otherwise it is present in the direct map. Modifications of this flag, together with the corresponding direct map manipulations, must happen while holding the folio's lock. A gfn_to_pfn_cache cannot hold the folio lock for its entire lifetime, so it operates as follows: In gpc_map, under folio lock, restore the direct map entry and set PG_private to 0. In gpc_unmap, zap the direct map entry again and set PG_private back to 1. If inside gpc_map the cache finds a folio that has PG_private set to 0, it knows that another gfn_to_pfn_cache is currently active for the given pfn (as this is the only scenario in which PG_private can be 0 without the folio lock being held), and so it returns -EINVAL. The only other interesting scenario is then if kvm_{read,write}_guest is called for a gfn whose translation is currently cached inside a gfn_to_pfn_cache. In this case, kvm_{read,write}_guest notices that PG_private is 0 and skips all direct map manipulations. Since it is holding the folio lock, it can be sure that gpc_unmap cannot concurrently zap the direct map entries while kvm_{read,write}_guest still needs them. Note that this implementation is slightly too restrictive, as sometimes multiple gfn_to_pfn_caches need to be active for the same gfn (for example, each vCPU has its own kvm-clock structure, which they all try to put into the same gfn). Signed-off-by: Patrick Roy --- virt/kvm/kvm_main.c | 59 +++++++++++++++++++++--------- virt/kvm/pfncache.c | 89 +++++++++++++++++++++++++++++++++++++++++---- 2 files changed, 123 insertions(+), 25 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 4357f7cdf040..f968f1f3d7f7 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -52,6 +52,7 @@ #include #include +#include #include #include "coalesced_mmio.h" @@ -3291,8 +3292,8 @@ static int __kvm_read_guest_private_page(struct kvm *kvm, void *data, int offset, int len) { kvm_pfn_t pfn; - int r; - struct page *page; + int r = 0; + struct folio *folio; void *kaddr; if (!kvm_can_access_gmem(kvm)) @@ -3303,13 +3304,24 @@ static int __kvm_read_guest_private_page(struct kvm *kvm, if (r < 0) return -EFAULT; - page = pfn_to_page(pfn); - lock_page(page); - kaddr = page_address(page) + offset; - memcpy(data, kaddr, len); - unlock_page(page); - put_page(page); - return 0; + folio = pfn_folio(pfn); + folio_lock(folio); + kaddr = folio_address(folio); + if (folio_test_private(folio)) { + r = set_direct_map_default_noflush(&folio->page); + if (r) + goto out_unlock; + } + memcpy(data, kaddr + offset, len); + if (folio_test_private(folio)) { + r = set_direct_map_invalid_noflush(&folio->page); + if (r) + goto out_unlock; + } +out_unlock: + folio_unlock(folio); + folio_put(folio); + return r; } static int __kvm_vcpu_read_guest_private_page(struct kvm_vcpu *vcpu, @@ -3437,8 +3449,8 @@ static int __kvm_write_guest_private_page(struct kvm *kvm, const void *data, int offset, int len) { kvm_pfn_t pfn; - int r; - struct page *page; + int r = 0; + struct folio *folio; void *kaddr; if (!kvm_can_access_gmem(kvm)) @@ -3449,14 +3461,25 @@ static int __kvm_write_guest_private_page(struct kvm *kvm, if (r < 0) return -EFAULT; - page = pfn_to_page(pfn); - lock_page(page); - kaddr = page_address(page) + offset; - memcpy(kaddr, data, len); - unlock_page(page); - put_page(page); + folio = pfn_folio(pfn); + folio_lock(folio); + kaddr = folio_address(folio); + if (folio_test_private(folio)) { + r = set_direct_map_default_noflush(&folio->page); + if (r) + goto out_unlock; + } + memcpy(kaddr + offset, data, len); + if (folio_test_private(folio)) { + r = set_direct_map_invalid_noflush(&folio->page); + if (r) + goto out_unlock; + } - return 0; +out_unlock: + folio_unlock(folio); + folio_put(folio); + return r; } static int __kvm_vcpu_write_guest_private_page(struct kvm_vcpu *vcpu, diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c index 6430e0a49558..95d2d5cdaa12 100644 --- a/virt/kvm/pfncache.c +++ b/virt/kvm/pfncache.c @@ -16,6 +16,9 @@ #include #include #include +#include + +#include #include "kvm_mm.h" @@ -99,10 +102,68 @@ bool kvm_gpc_check(struct gfn_to_pfn_cache *gpc, unsigned long len) return true; } -static void *gpc_map(kvm_pfn_t pfn) +static int gpc_map_gmem(kvm_pfn_t pfn) { - if (pfn_valid(pfn)) + int r = 0; + struct folio *folio = pfn_folio(pfn); + struct inode *inode = folio_inode(folio); + + if (((unsigned long)inode->i_private & KVM_GMEM_NO_DIRECT_MAP) == 0) + goto out; + + /* We need to avoid race conditions where set_memory_np is called for + * pages that other parts of KVM still try to access. We use the + * PG_private bit for this. If it is set, then the page is removed from + * the direct map. If it is cleared, the page is present in the direct + * map. All changes to this bit, and all modifications of the direct + * map entries for the page happen under the page lock. The _only_ + * place where a page will be in the direct map while the page lock is + * _not_ held is if it is inside a gpc. All other parts of KVM that + * temporarily re-insert gmem pages into the direct map (currently only + * guest_{read,write}_page) take the page lock before the direct map + * entry is restored, and hold it until it is zapped again. This means + * - If we reach gpc_map while, say, guest_read_page is operating on + * this page, we block on acquiring the page lock until + * guest_read_page is done. + * - If we reach gpc_map while another gpc is already caching this + * page, the page is present in the direct map and the PG_private + * flag is cleared. Int his case, we return -EINVAL below to avoid + * two gpcs caching the same page (since we do not ref-count + * insertions back into the direct map, when the first cache gets + * invalidated it would "break" the second cache that assumes the + * page is present in the direct map until the second cache itself + * gets invalidated). + * - Lastly, if guest_read_page is called for a page inside of a gpc, + * it will see that the PG_private flag is cleared, and thus assume + * it is present in the direct map (and leave the direct map entry + * untouched). Since it will be holding the page lock, it cannot race + * with gpc_unmap. + */ + folio_lock(folio); + if (folio_test_private(folio)) { + r = set_direct_map_default_noflush(&folio->page); + if (r) + goto out_unlock; + + folio_clear_private(folio); + } else { + r = -EINVAL; + } +out_unlock: + folio_unlock(folio); +out: + return r; +} + +static void *gpc_map(kvm_pfn_t pfn, bool private) +{ + if (pfn_valid(pfn)) { + if (private) { + if (gpc_map_gmem(pfn)) + return NULL; + } return kmap(pfn_to_page(pfn)); + } #ifdef CONFIG_HAS_IOMEM return memremap(pfn_to_hpa(pfn), PAGE_SIZE, MEMREMAP_WB); @@ -111,13 +172,27 @@ static void *gpc_map(kvm_pfn_t pfn) #endif } -static void gpc_unmap(kvm_pfn_t pfn, void *khva) +static void gpc_unmap(kvm_pfn_t pfn, void *khva, bool private) { /* Unmap the old pfn/page if it was mapped before. */ if (is_error_noslot_pfn(pfn) || !khva) return; if (pfn_valid(pfn)) { + if (private) { + struct folio *folio = pfn_folio(pfn); + struct inode *inode = folio_inode(folio); + + if ((unsigned long)inode->i_private & + KVM_GMEM_NO_DIRECT_MAP) { + folio_lock(folio); + BUG_ON(folio_test_private(folio)); + BUG_ON(set_direct_map_invalid_noflush( + &folio->page)); + folio_set_private(folio); + folio_unlock(folio); + } + } kunmap(pfn_to_page(pfn)); return; } @@ -195,7 +270,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc) * the existing mapping and didn't create a new one. */ if (new_khva != old_khva) - gpc_unmap(new_pfn, new_khva); + gpc_unmap(new_pfn, new_khva, gpc->is_private); kvm_release_pfn_clean(new_pfn); @@ -224,7 +299,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc) if (new_pfn == gpc->pfn) new_khva = old_khva; else - new_khva = gpc_map(new_pfn); + new_khva = gpc_map(new_pfn, gpc->is_private); if (!new_khva) { kvm_release_pfn_clean(new_pfn); @@ -379,7 +454,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l write_unlock_irq(&gpc->lock); if (unmap_old) - gpc_unmap(old_pfn, old_khva); + gpc_unmap(old_pfn, old_khva, old_private); return ret; } @@ -500,6 +575,6 @@ void kvm_gpc_deactivate(struct gfn_to_pfn_cache *gpc) list_del(&gpc->list); spin_unlock(&kvm->gpc_lock); - gpc_unmap(old_pfn, old_khva); + gpc_unmap(old_pfn, old_khva, gpc->is_private); } }