From patchwork Tue Sep 10 16:30:36 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Patrick Roy X-Patchwork-Id: 13798890 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7BB13EDE99A for ; Tue, 10 Sep 2024 16:32:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E7BBC8D0095; Tue, 10 Sep 2024 12:32:03 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E2B828D0002; Tue, 10 Sep 2024 12:32:03 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C80A98D0095; Tue, 10 Sep 2024 12:32:03 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id A7FC58D0002 for ; Tue, 10 Sep 2024 12:32:03 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 65ABF40819 for ; Tue, 10 Sep 2024 16:32:03 +0000 (UTC) X-FDA: 82549370526.29.F81A4FB Received: from smtp-fw-9102.amazon.com (smtp-fw-9102.amazon.com [207.171.184.29]) by imf26.hostedemail.com (Postfix) with ESMTP id 4987E140017 for ; Tue, 10 Sep 2024 16:32:01 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=amazon.co.uk header.s=amazon201209 header.b=tas+COrG; spf=pass (imf26.hostedemail.com: domain of "prvs=976277991=roypat@amazon.co.uk" designates 207.171.184.29 as permitted sender) smtp.mailfrom="prvs=976277991=roypat@amazon.co.uk"; dmarc=pass (policy=quarantine) header.from=amazon.co.uk ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725985869; a=rsa-sha256; cv=none; b=d4fYjWRnz4ECB0BlBdbPg8yLxhJPrVoHKR5bkzuj/pOUD1na8EAdBbkEti+ZAbEJJQ+S/F jX6mA4ZIfXgh3SOHuezMOq34boaVgjODPHtlDWfpv/5VmDqUspjdfsgmkuf4trh8hQcHEU MR6a/C/ucc3bzNPWCMg0CnECXm5zP3M= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=amazon.co.uk header.s=amazon201209 header.b=tas+COrG; spf=pass (imf26.hostedemail.com: domain of "prvs=976277991=roypat@amazon.co.uk" designates 207.171.184.29 as permitted sender) smtp.mailfrom="prvs=976277991=roypat@amazon.co.uk"; dmarc=pass (policy=quarantine) header.from=amazon.co.uk ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1725985869; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=nC+ZthG3ZVV5r6y+2ESHYBsZJWfc4T6QNh+zi4jDtGU=; b=brcbSoZl7OVuLQie+BHHVJQbGUzkKPn7qGoFtA2jjedjmpcQ5pIw/KvdIgYriWQ6JgRyQ9 KaE7pXrjiNTmWFi36xwIVMyPV4Dy2t5U27MCJD7d5VdN387+2cg4CE1nefRU3zbUXQDU7B 2SeWkqd0RbNQB3zRD1lVZBmUSZvCCSk= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.uk; i=@amazon.co.uk; q=dns/txt; s=amazon201209; t=1725985922; x=1757521922; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=nC+ZthG3ZVV5r6y+2ESHYBsZJWfc4T6QNh+zi4jDtGU=; b=tas+COrGSP8X/U6Uv2dkIsBjuPeeX8u6obdKYma4WAEV9EjW0SMiQSQy 4R1fzQ0rgEA4BJknvNqMXg8y3jTNuTxnTtFNI/2VY399lnBp2HQkGs5QD qhokgw4EkgtqpUL+anUdV82h+TcHbNxwGye9EmO3Kct4oYvTfhljWxJq6 Q=; X-IronPort-AV: E=Sophos;i="6.10,217,1719878400"; d="scan'208";a="452556269" Received: from pdx4-co-svc-p1-lb2-vlan3.amazon.com (HELO smtpout.prod.us-east-1.prod.farcaster.email.amazon.dev) ([10.25.36.214]) by smtp-border-fw-9102.sea19.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Sep 2024 16:31:53 +0000 Received: from EX19MTAUEB001.ant.amazon.com [10.0.44.209:34555] by smtpin.naws.us-east-1.prod.farcaster.email.amazon.dev [10.0.48.28:2525] with esmtp (Farcaster) id 87701a07-5d1b-4a8e-b143-b5fc4e93bed5; Tue, 10 Sep 2024 16:31:52 +0000 (UTC) X-Farcaster-Flow-ID: 87701a07-5d1b-4a8e-b143-b5fc4e93bed5 Received: from EX19D008UEC004.ant.amazon.com (10.252.135.170) by EX19MTAUEB001.ant.amazon.com (10.252.135.108) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Tue, 10 Sep 2024 16:31:42 +0000 Received: from EX19MTAUWB001.ant.amazon.com (10.250.64.248) by EX19D008UEC004.ant.amazon.com (10.252.135.170) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Tue, 10 Sep 2024 16:31:42 +0000 Received: from ua2d7e1a6107c5b.home (172.19.88.180) by mail-relay.amazon.com (10.250.64.254) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34 via Frontend Transport; Tue, 10 Sep 2024 16:31:38 +0000 From: Patrick Roy To: , , , , , , , , , , , , , , , , , , , , CC: Patrick Roy , , , , , Subject: [RFC PATCH v2 10/10] kvm: x86: support walking guest page tables in gmem Date: Tue, 10 Sep 2024 17:30:36 +0100 Message-ID: <20240910163038.1298452-11-roypat@amazon.co.uk> X-Mailer: git-send-email 2.46.0 In-Reply-To: <20240910163038.1298452-1-roypat@amazon.co.uk> References: <20240910163038.1298452-1-roypat@amazon.co.uk> MIME-Version: 1.0 X-Stat-Signature: axb47nayky4spgqrw67u48849bkqi7qa X-Rspamd-Queue-Id: 4987E140017 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1725985921-535224 X-HE-Meta: U2FsdGVkX18HsQFA8t3oSkpTaOFk45795XOlS1XbXwWJl0DM8Ed+8TiEOA5WrEnI7iIUqIiquJPiVk0pLIv7FTIcbDOHUXSGdbU4lINUi37OX39N8s/IRGOWyuNOky1bVsasMImWf2jZ5OiII0UC8+BL/16lAyd7JWnVQfBWeoKb3bCPUYDEG4pbKzDHP58lG8T7741+aoQqWg9Bo3ui44g0nuV4G60BVxy87QXLpeqdx6QZCqGv08YyBVDPr7cGp3RWHNq0Oedy5qZO8+Msj97NLgSBxPnGx2CEqSQpZh3V3OHfWzDOREjoOVJqN3z39cWBMexV4IutVrFsB7Mcwq8Eo9MU3vh6PYq4EZjv8vXGojKlw2rcpDB24YYPdw2BVFD4InQvO5MR76dM3YvMNhcHRkw/S7EClvN+TTyiK120jK7wEYqIPfsPSbw8e2j/icy1VWmlCM/AS/G5kZciLFsrt4rNmHlJEaKq+IPVJSii/sMuswK9D4tPQ/lniIQ7LymflLMAHtjYKonE3ud1LVoHexkDQu1AxsPAdYoxwS76RlmAb38/ZlU1VfzHYmheA0u+F/r8guVbUjwjr7YAHchDh0+/lRkxrNKocEptuvFSUuzE9RJn5IkEhGj1oE+HKOe6GU2kupEP1mD74ce6DQ4NE0NZnaGjctYJnNWkegfRuvkmTE26sR0522A82XKfN8Dj8g4wu0UPNcTd6xrrb1vTcDGaoD0wzrJyp52FWxhZ+gLNBJ3W0wMJn1p7Yx3UXN/xQrgyzjP5FYMma5FoD2mSkNCfX8PBYgaGLa5/PVcu62sP8XbBeyMbRksSsZyFAdaQ4t5U107ads/DcLM7hXhQboT7c6haL7KgBHBZSZ/6vEsuClKyALfrESNitcz0yi8LpA8F4K6cR7DMvUsTfNXPaE2JltwiaTGfWDwVV1vUqp5x5jVJIYRjgZl5yVokFaPHy4Ya+YcH2UM52bG ezkVGLAP 6EZcuwXB/ugS70wDR9ApbbuwqhNLxwj6GygCMEM5P/co4ehUuA0z65KB/Y6N+G/6jDHRRGdN8Lel68F2y3tki4YPOfFC+027qe8n2e3b+mUSEF+OI5PZBHKSrxauCr0nu9J4vRFchq0AB7ZAEsA9VWecwDzcTZLpvwMd7JoKf/2N6kGZVjHw1VbUS/BQSFuKiR4xyzvWzVmQgomWuDf/yXLycgiM1WMK7qXkSFRY6UL/Y8aM1XiMzwea3V3hcAiv9j7OEscpamFdWSsaPrrsQ2XkzSw4tkNxeNqfvkuoK3tVo3+l1th+XsfQW0fiU4LHKxn35wU7D4MDS+1zCl0DvMLwH+YXcbA9Q33FgCjvWQ6xR1nqsW3726IOOBqre8Ay4ho5KLIXbr1K2IIQ= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Update the logic in paging_tmpl.h to work with guest_private memory. If KVM cannot access gmem and the guest's page tables are in gfns marked as private, then error out. Let the guest page table walker access gmem by making it use gfn_to_pfn_caches, which are already gmem aware, and also handle on-demand mapping of gmem if KVM_GMEM_NO_DIRECT_MAP is set. We re-use the gfn_to_pfn_cache here to avoid implementing yet another remapping solution to support the cmpxchg used to set the "accessed" bit on guest PTEs. The only case that now needs some special handling is page tables in read-only memslots, as gfn_to_pfn_caches cannot be used for readonly memory. In this case, use kvm_vcpu_read_guest (which is also gmem aware), as there is no need to cache the gfn->pfn translation in this case (there is no need to do a cmpxchg on the PTE as the walker does not set the accessed bit for read-only ptes). gfn_to_pfn_caches are hooked up to the MMU notifiers, meaning if something about guest memory changes between the page table talk and setting the dirty bits (for example a concurrent fallocate on gmem), the gfn_to_pfn_caches will have been invalidated and the entire page table walk is retried. Signed-off-by: Patrick Roy --- arch/x86/kvm/mmu/paging_tmpl.h | 95 ++++++++++++++++++++++++++++------ 1 file changed, 78 insertions(+), 17 deletions(-) diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index 69941cebb3a87..d96fa423bed05 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -84,7 +84,7 @@ struct guest_walker { pt_element_t ptes[PT_MAX_FULL_LEVELS]; pt_element_t prefetch_ptes[PTE_PREFETCH_NUM]; gpa_t pte_gpa[PT_MAX_FULL_LEVELS]; - pt_element_t __user *ptep_user[PT_MAX_FULL_LEVELS]; + struct gfn_to_pfn_cache ptep_caches[PT_MAX_FULL_LEVELS]; bool pte_writable[PT_MAX_FULL_LEVELS]; unsigned int pt_access[PT_MAX_FULL_LEVELS]; unsigned int pte_access; @@ -201,7 +201,7 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu, { unsigned level, index; pt_element_t pte, orig_pte; - pt_element_t __user *ptep_user; + struct gfn_to_pfn_cache *pte_cache; gfn_t table_gfn; int ret; @@ -210,10 +210,12 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu, return 0; for (level = walker->max_level; level >= walker->level; --level) { + unsigned long flags; + pte = orig_pte = walker->ptes[level - 1]; table_gfn = walker->table_gfn[level - 1]; - ptep_user = walker->ptep_user[level - 1]; - index = offset_in_page(ptep_user) / sizeof(pt_element_t); + pte_cache = &walker->ptep_caches[level - 1]; + index = offset_in_page(pte_cache->khva) / sizeof(pt_element_t); if (!(pte & PT_GUEST_ACCESSED_MASK)) { trace_kvm_mmu_set_accessed_bit(table_gfn, index, sizeof(pte)); pte |= PT_GUEST_ACCESSED_MASK; @@ -246,11 +248,26 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu, if (unlikely(!walker->pte_writable[level - 1])) continue; - ret = __try_cmpxchg_user(ptep_user, &orig_pte, pte, fault); + read_lock_irqsave(&pte_cache->lock, flags); + if (!kvm_gpc_check(pte_cache, sizeof(pte))) { + read_unlock_irqrestore(&pte_cache->lock, flags); + /* + * If the gpc got invalidated, then the page table + * it contained probably changed, so we probably need + * to redo the entire walk. + */ + return 1; + } + ret = __try_cmpxchg((pt_element_t *)pte_cache->khva, &orig_pte, pte, sizeof(pte)); + + if (!ret) + kvm_gpc_mark_dirty_in_slot(pte_cache); + + read_unlock_irqrestore(&pte_cache->lock, flags); + if (ret) return ret; - kvm_vcpu_mark_page_dirty(vcpu, table_gfn); walker->ptes[level - 1] = pte; } return 0; @@ -296,6 +313,13 @@ static inline bool FNAME(is_last_gpte)(struct kvm_mmu *mmu, return gpte & PT_PAGE_SIZE_MASK; } + +static void FNAME(walk_deactivate_gpcs)(struct guest_walker *walker) { + for (unsigned int level = 0; level < PT_MAX_FULL_LEVELS; ++level) + if (walker->ptep_caches[level].active) + kvm_gpc_deactivate(&walker->ptep_caches[level]); +} + /* * Fetch a guest pte for a guest virtual address, or for an L2's GPA. */ @@ -305,7 +329,6 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, { int ret; pt_element_t pte; - pt_element_t __user *ptep_user; gfn_t table_gfn; u64 pt_access, pte_access; unsigned index, accessed_dirty, pte_pkey; @@ -320,8 +343,17 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, u16 errcode = 0; gpa_t real_gpa; gfn_t gfn; + struct gfn_to_pfn_cache *pte_cache; trace_kvm_mmu_pagetable_walk(addr, access); + + for (unsigned int level = 0; level < PT_MAX_FULL_LEVELS; ++level) { + pte_cache = &walker->ptep_caches[level]; + + memset(pte_cache, 0, sizeof(*pte_cache)); + kvm_gpc_init(pte_cache, vcpu->kvm); + } + retry_walk: walker->level = mmu->cpu_role.base.level; pte = kvm_mmu_get_guest_pgd(vcpu, mmu); @@ -362,11 +394,13 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, do { struct kvm_memory_slot *slot; - unsigned long host_addr; + unsigned long flags; pt_access = pte_access; --walker->level; + pte_cache = &walker->ptep_caches[walker->level - 1]; + index = PT_INDEX(addr, walker->level); table_gfn = gpte_to_gfn(pte); offset = index * sizeof(pt_element_t); @@ -396,15 +430,36 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, if (!kvm_is_visible_memslot(slot)) goto error; - host_addr = gfn_to_hva_memslot_prot(slot, gpa_to_gfn(real_gpa), - &walker->pte_writable[walker->level - 1]); - if (unlikely(kvm_is_error_hva(host_addr))) - goto error; + /* + * gfn_to_pfn_cache expects the memory to be writable. However, + * if the memory is not writable, we do not need caching in the + * first place, as we only need it to later potentially write + * the access bit (which we cannot do anyway if the memory is + * readonly). + */ + if (slot->flags & KVM_MEM_READONLY) { + if (kvm_vcpu_read_guest(vcpu, real_gpa + offset, &pte, sizeof(pte))) + goto error; + } else { + if (kvm_gpc_activate(pte_cache, real_gpa + offset, + sizeof(pte))) + goto error; - ptep_user = (pt_element_t __user *)((void *)host_addr + offset); - if (unlikely(__get_user(pte, ptep_user))) - goto error; - walker->ptep_user[walker->level - 1] = ptep_user; + read_lock_irqsave(&pte_cache->lock, flags); + while (!kvm_gpc_check(pte_cache, sizeof(pte))) { + read_unlock_irqrestore(&pte_cache->lock, flags); + + if (kvm_gpc_refresh(pte_cache, sizeof(pte))) + goto error; + + read_lock_irqsave(&pte_cache->lock, flags); + } + + pte = *(pt_element_t *)pte_cache->khva; + read_unlock_irqrestore(&pte_cache->lock, flags); + + walker->pte_writable[walker->level - 1] = true; + } trace_kvm_mmu_paging_element(pte, walker->level); @@ -467,13 +522,19 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, addr, write_fault); if (unlikely(ret < 0)) goto error; - else if (ret) + else if (ret) { + FNAME(walk_deactivate_gpcs)(walker); goto retry_walk; + } } + FNAME(walk_deactivate_gpcs)(walker); + return 1; error: + FNAME(walk_deactivate_gpcs)(walker); + errcode |= write_fault | user_fault; if (fetch_fault && (is_efer_nx(mmu) || is_cr4_smep(mmu))) errcode |= PFERR_FETCH_MASK;