From patchwork Mon May 10 16:58:49 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Marc Zyngier X-Patchwork-Id: 12249119 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-17.7 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 47A54C433B4 for ; Mon, 10 May 2021 17:43:33 +0000 (UTC) Received: from desiato.infradead.org (desiato.infradead.org [90.155.92.199]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id A683760249 for ; Mon, 10 May 2021 17:43:32 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A683760249 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=desiato.20200630; h=Sender:Content-Transfer-Encoding :Content-Type:List-Subscribe:List-Help:List-Post:List-Archive: List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To:Message-Id:Date: Subject:Cc:To:From:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=z0j1a+d5HS5+iMAfoc1dbRxwFRHpdaCRGtmfc3D1Ry4=; b=Xojs2/CBo168tKzTxvScX+B+E z3sfvRIMcWHzXfx7tIjiRG030zsG85PpLFalQVDBo+bKmFG1GtNINUsCPBVCPgf5C/y5A0vXWJPK9 5NiUSooj8Shejssh06xY8WVMcAk/glfreWCioprWJSPG1PoXUsy6tZZSonFs259ZfHCbgfei+IvDh cbk5ij1924gWalkCJmBfuxwDFhImXRv4bCafF2bD4BBr4SFBHTL96teMpSzcADGAgdU6WefNrZwKN O9UBH30TfcZxgrRjH1iWoCQwpyUNOhf/Rw6ro/5OBShMi8UjUKsocXYM0PI1gMgWuDj2jFoQXbXIq iD94ikXbg==; Received: from localhost ([::1] helo=desiato.infradead.org) by desiato.infradead.org with esmtp (Exim 4.94 #2 (Red Hat Linux)) id 1lg9uf-00FDws-H2; Mon, 10 May 2021 17:41:50 +0000 Received: from bombadil.infradead.org ([2607:7c80:54:e::133]) by desiato.infradead.org with esmtps (Exim 4.94 #2 (Red Hat Linux)) id 1lg9hb-00F94A-Ua for linux-arm-kernel@desiato.infradead.org; Mon, 10 May 2021 17:28:20 +0000 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20210309; h=Content-Transfer-Encoding: MIME-Version:References:In-Reply-To:Message-Id:Date:Subject:Cc:To:From:Sender :Reply-To:Content-Type:Content-ID:Content-Description; bh=coRWHPm+TnRYB99zcWnOdmKeDKewGhHhRv6oV2B1paQ=; b=dovuRaSgSqs37eTQmLZKrXIw6q aWYh4IxMclY32STUk4V0axDCclYLU8ZyEJW+nQ/F5X/BHVzY+AF1iOAfqPqWNoPKWVD5ITCy1uCbF i72hJ6F9wQFA+7IojUDXDPnw3omiyuDm5LhYoaY5VoJwas5zkLi6DB5uJ3pe8+uC0Bx4tyvmDa508 unHsmoJVrPbQ8pTIY0Qx+UE85mpEd/7BvS6Wt4gw4+4bljGys5GV3q79/V2EoPBhh0DOlIujH0fp+ ie+hVAL2P7vRAkxrPvNgxM5GFA5q7aL3WJ/cVJOpiUaL+qzKWjv/8DzYFcJP0nD+D7o5yfj9zypFY o3PmQ1+A==; Received: from mail.kernel.org ([198.145.29.99]) by bombadil.infradead.org with esmtps (Exim 4.94 #2 (Red Hat Linux)) id 1lg9hY-008yem-Lj for linux-arm-kernel@lists.infradead.org; Mon, 10 May 2021 17:28:18 +0000 Received: from disco-boy.misterjones.org (disco-boy.misterjones.org [51.254.78.96]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 24A3D61469; Mon, 10 May 2021 17:28:16 +0000 (UTC) Received: from 78.163-31-62.static.virginmediabusiness.co.uk ([62.31.163.78] helo=why.lan) by disco-boy.misterjones.org with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1lg9GX-000Uqg-65; Mon, 10 May 2021 18:00:21 +0100 From: Marc Zyngier To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.cs.columbia.edu, kvm@vger.kernel.org Cc: Andre Przywara , Christoffer Dall , Jintack Lim , Haibo Xu , James Morse , Suzuki K Poulose , Alexandru Elisei , kernel-team@android.com, Christoffer Dall , Jintack Lim Subject: [PATCH v4 35/66] KVM: arm64: nv: Handle shadow stage 2 page faults Date: Mon, 10 May 2021 17:58:49 +0100 Message-Id: <20210510165920.1913477-36-maz@kernel.org> X-Mailer: git-send-email 2.29.2 In-Reply-To: <20210510165920.1913477-1-maz@kernel.org> References: <20210510165920.1913477-1-maz@kernel.org> MIME-Version: 1.0 X-SA-Exim-Connect-IP: 62.31.163.78 X-SA-Exim-Rcpt-To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.cs.columbia.edu, kvm@vger.kernel.org, andre.przywara@arm.com, christoffer.dall@arm.com, jintack@cs.columbia.edu, haibo.xu@linaro.org, james.morse@arm.com, suzuki.poulose@arm.com, alexandru.elisei@arm.com, kernel-team@android.com, christoffer.dall@linaro.org, jintack.lim@linaro.org X-SA-Exim-Mail-From: maz@kernel.org X-SA-Exim-Scanned: No (on disco-boy.misterjones.org); SAEximRunCond expanded to false X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20210510_102816_807483_E87818C0 X-CRM114-Status: GOOD ( 30.96 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org If we are faulting on a shadow stage 2 translation, we first walk the guest hypervisor's stage 2 page table to see if it has a mapping. If not, we inject a stage 2 page fault to the virtual EL2. Otherwise, we create a mapping in the shadow stage 2 page table. Note that we have to deal with two IPAs when we got a shadow stage 2 page fault. One is the address we faulted on, and is in the L2 guest phys space. The other is from the guest stage-2 page table walk, and is in the L1 guest phys space. To differentiate them, we rename variables so that fault_ipa is used for the former and ipa is used for the latter. Co-developed-by: Christoffer Dall Co-developed-by: Jintack Lim Signed-off-by: Christoffer Dall Signed-off-by: Jintack Lim [maz: rewrote this multiple times...] Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/kvm_emulate.h | 6 ++ arch/arm64/include/asm/kvm_nested.h | 18 ++++++ arch/arm64/kvm/mmu.c | 93 ++++++++++++++++++++++------ arch/arm64/kvm/nested.c | 48 ++++++++++++++ 4 files changed, 147 insertions(+), 18 deletions(-) diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h index 12935f6a7aa9..1fb2edc923cc 100644 --- a/arch/arm64/include/asm/kvm_emulate.h +++ b/arch/arm64/include/asm/kvm_emulate.h @@ -591,4 +591,10 @@ static __always_inline void kvm_incr_pc(struct kvm_vcpu *vcpu) vcpu->arch.flags |= KVM_ARM64_INCREMENT_PC; } +static inline bool kvm_is_shadow_s2_fault(struct kvm_vcpu *vcpu) +{ + return (vcpu->arch.hw_mmu != &vcpu->kvm->arch.mmu && + vcpu->arch.hw_mmu->nested_stage2_enabled); +} + #endif /* __ARM64_KVM_EMULATE_H__ */ diff --git a/arch/arm64/include/asm/kvm_nested.h b/arch/arm64/include/asm/kvm_nested.h index b784d7891851..4f93a5dab183 100644 --- a/arch/arm64/include/asm/kvm_nested.h +++ b/arch/arm64/include/asm/kvm_nested.h @@ -78,9 +78,27 @@ struct kvm_s2_trans { u64 upper_attr; }; +static inline phys_addr_t kvm_s2_trans_output(struct kvm_s2_trans *trans) +{ + return trans->output; +} + +static inline unsigned long kvm_s2_trans_size(struct kvm_s2_trans *trans) +{ + return trans->block_size; +} + +static inline u32 kvm_s2_trans_esr(struct kvm_s2_trans *trans) +{ + return trans->esr; +} + extern int kvm_walk_nested_s2(struct kvm_vcpu *vcpu, phys_addr_t gipa, struct kvm_s2_trans *result); +extern int kvm_s2_handle_perm_fault(struct kvm_vcpu *vcpu, + struct kvm_s2_trans *trans); +extern int kvm_inject_s2_fault(struct kvm_vcpu *vcpu, u64 esr_el2); int handle_wfx_nested(struct kvm_vcpu *vcpu, bool is_wfe); extern bool __forward_traps(struct kvm_vcpu *vcpu, unsigned int reg, u64 control_bit); diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 5c1a9966ff31..6b3753460293 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -796,7 +796,7 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot, static unsigned long transparent_hugepage_adjust(struct kvm_memory_slot *memslot, unsigned long hva, kvm_pfn_t *pfnp, - phys_addr_t *ipap) + phys_addr_t *ipap, phys_addr_t *fault_ipap) { kvm_pfn_t pfn = *pfnp; @@ -825,6 +825,7 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot, * to PG_head and switch the pfn from a tail page to the head * page accordingly. */ + *fault_ipap &= PMD_MASK; *ipap &= PMD_MASK; kvm_release_pfn_clean(pfn); pfn &= ~(PTRS_PER_PMD - 1); @@ -839,14 +840,16 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot, } static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, - struct kvm_memory_slot *memslot, unsigned long hva, - unsigned long fault_status) + struct kvm_s2_trans *nested, + struct kvm_memory_slot *memslot, + unsigned long hva, unsigned long fault_status) { int ret = 0; - bool write_fault, writable, force_pte = false; + bool write_fault, writable; bool exec_fault; bool device = false; unsigned long mmu_seq; + phys_addr_t ipa = fault_ipa; struct kvm *kvm = vcpu->kvm; struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache; struct vm_area_struct *vma; @@ -858,6 +861,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, unsigned long vma_pagesize, fault_granule; enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R; struct kvm_pgtable *pgt; + unsigned long max_map_size = PUD_SIZE; fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level); write_fault = kvm_is_write_fault(vcpu); @@ -885,7 +889,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, if (logging_active || (vma->vm_flags & VM_PFNMAP)) { - force_pte = true; + max_map_size = vma_pagesize = PAGE_SIZE; vma_shift = PAGE_SHIFT; } @@ -905,7 +909,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, fallthrough; case CONT_PTE_SHIFT: vma_shift = PAGE_SHIFT; - force_pte = true; + max_map_size = PAGE_SIZE; fallthrough; case PAGE_SHIFT: break; @@ -914,10 +918,25 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, } vma_pagesize = 1UL << vma_shift; + + if (kvm_is_shadow_s2_fault(vcpu)) { + ipa = kvm_s2_trans_output(nested); + + /* + * If we're about to create a shadow stage 2 entry, then we + * can only create a block mapping if the guest stage 2 page + * table uses at least as big a mapping. + */ + max_map_size = min(kvm_s2_trans_size(nested), max_map_size); + } + + vma_pagesize = min(vma_pagesize, max_map_size); + if (vma_pagesize == PMD_SIZE || vma_pagesize == PUD_SIZE) fault_ipa &= ~(vma_pagesize - 1); - gfn = fault_ipa >> PAGE_SHIFT; + gfn = ipa >> PAGE_SHIFT; + mmap_read_unlock(current->mm); /* @@ -960,7 +979,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, if (kvm_is_device_pfn(pfn)) { device = true; - force_pte = true; + max_map_size = PAGE_SIZE; } else if (logging_active && !write_fault) { /* * Only actually map the page as writable if this was a write @@ -981,9 +1000,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, * If we are not forced to use page mapping, check if we are * backed by a THP and thus use block mapping if possible. */ - if (vma_pagesize == PAGE_SIZE && !force_pte) - vma_pagesize = transparent_hugepage_adjust(memslot, hva, - &pfn, &fault_ipa); + if (vma_pagesize == PAGE_SIZE && max_map_size >= PMD_SIZE) + vma_pagesize = transparent_hugepage_adjust(memslot, hva, &pfn, + &ipa, &fault_ipa); if (writable) prot |= KVM_PGTABLE_PROT_W; @@ -1059,8 +1078,10 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa) int kvm_handle_guest_abort(struct kvm_vcpu *vcpu) { unsigned long fault_status; - phys_addr_t fault_ipa; + phys_addr_t fault_ipa; /* The address we faulted on */ + phys_addr_t ipa; /* Always the IPA in the L1 guest phys space */ struct kvm_memory_slot *memslot; + struct kvm_s2_trans nested_trans; unsigned long hva; bool is_iabt, write_fault, writable; gfn_t gfn; @@ -1068,7 +1089,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu) fault_status = kvm_vcpu_trap_get_fault_type(vcpu); - fault_ipa = kvm_vcpu_get_fault_ipa(vcpu); + ipa = fault_ipa = kvm_vcpu_get_fault_ipa(vcpu); is_iabt = kvm_vcpu_trap_is_iabt(vcpu); /* Synchronous External Abort? */ @@ -1089,6 +1110,12 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu) /* Check the stage-2 fault is trans. fault or write fault */ if (fault_status != FSC_FAULT && fault_status != FSC_PERM && fault_status != FSC_ACCESS) { + /* + * We must never see an address size fault on shadow stage 2 + * page table walk, because we would have injected an addr + * size fault when we walked the nested s2 page and not + * create the shadow entry. + */ kvm_err("Unsupported FSC: EC=%#x xFSC=%#lx ESR_EL2=%#lx\n", kvm_vcpu_trap_get_class(vcpu), (unsigned long)kvm_vcpu_trap_get_fault(vcpu), @@ -1098,7 +1125,36 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu) idx = srcu_read_lock(&vcpu->kvm->srcu); - gfn = fault_ipa >> PAGE_SHIFT; + /* + * We may have faulted on a shadow stage 2 page table if we are + * running a nested guest. In this case, we have to resolve the L2 + * IPA to the L1 IPA first, before knowing what kind of memory should + * back the L1 IPA. + * + * If the shadow stage 2 page table walk faults, then we simply inject + * this to the guest and carry on. + */ + if (kvm_is_shadow_s2_fault(vcpu)) { + u32 esr; + + ret = kvm_walk_nested_s2(vcpu, fault_ipa, &nested_trans); + esr = kvm_s2_trans_esr(&nested_trans); + if (esr) + kvm_inject_s2_fault(vcpu, esr); + if (ret) + goto out_unlock; + + ret = kvm_s2_handle_perm_fault(vcpu, &nested_trans); + esr = kvm_s2_trans_esr(&nested_trans); + if (esr) + kvm_inject_s2_fault(vcpu, esr); + if (ret) + goto out_unlock; + + ipa = kvm_s2_trans_output(&nested_trans); + } + + gfn = ipa >> PAGE_SHIFT; memslot = gfn_to_memslot(vcpu->kvm, gfn); hva = gfn_to_hva_memslot_prot(memslot, gfn, &writable); write_fault = kvm_is_write_fault(vcpu); @@ -1142,13 +1198,13 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu) * faulting VA. This is always 12 bits, irrespective * of the page size. */ - fault_ipa |= kvm_vcpu_get_hfar(vcpu) & ((1 << 12) - 1); - ret = io_mem_abort(vcpu, fault_ipa); + ipa |= kvm_vcpu_get_hfar(vcpu) & ((1 << 12) - 1); + ret = io_mem_abort(vcpu, ipa); goto out_unlock; } /* Userspace should not be able to register out-of-bounds IPAs */ - VM_BUG_ON(fault_ipa >= kvm_phys_size(vcpu->kvm)); + VM_BUG_ON(ipa >= kvm_phys_size(vcpu->kvm)); if (fault_status == FSC_ACCESS) { handle_access_fault(vcpu, fault_ipa); @@ -1156,7 +1212,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu) goto out_unlock; } - ret = user_mem_abort(vcpu, fault_ipa, memslot, hva, fault_status); + ret = user_mem_abort(vcpu, fault_ipa, &nested_trans, + memslot, hva, fault_status); if (ret == 0) ret = 1; out: diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c index 1067b6422cf2..57f32768d04d 100644 --- a/arch/arm64/kvm/nested.c +++ b/arch/arm64/kvm/nested.c @@ -108,6 +108,15 @@ static u32 compute_fsc(int level, u32 fsc) return fsc | (level & 0x3); } +static int esr_s2_fault(struct kvm_vcpu *vcpu, int level, u32 fsc) +{ + u32 esr; + + esr = kvm_vcpu_get_esr(vcpu) & ~ESR_ELx_FSC; + esr |= compute_fsc(level, fsc); + return esr; +} + static int check_base_s2_limits(struct s2_walk_info *wi, int level, int input_size, int stride) { @@ -457,6 +466,45 @@ void kvm_vcpu_put_hw_mmu(struct kvm_vcpu *vcpu) } } +/* + * Returns non-zero if permission fault is handled by injecting it to the next + * level hypervisor. + */ +int kvm_s2_handle_perm_fault(struct kvm_vcpu *vcpu, struct kvm_s2_trans *trans) +{ + unsigned long fault_status = kvm_vcpu_trap_get_fault_type(vcpu); + bool forward_fault = false; + + trans->esr = 0; + + if (fault_status != FSC_PERM) + return 0; + + if (kvm_vcpu_trap_is_iabt(vcpu)) { + forward_fault = (trans->upper_attr & BIT(54)); + } else { + bool write_fault = kvm_is_write_fault(vcpu); + + forward_fault = ((write_fault && !trans->writable) || + (!write_fault && !trans->readable)); + } + + if (forward_fault) { + trans->esr = esr_s2_fault(vcpu, trans->level, ESR_ELx_FSC_PERM); + return 1; + } + + return 0; +} + +int kvm_inject_s2_fault(struct kvm_vcpu *vcpu, u64 esr_el2) +{ + vcpu_write_sys_reg(vcpu, vcpu->arch.fault.far_el2, FAR_EL2); + vcpu_write_sys_reg(vcpu, vcpu->arch.fault.hpfar_el2, HPFAR_EL2); + + return kvm_inject_nested_sync(vcpu, esr_el2); +} + /* * Inject wfx to the virtual EL2 if this is not from the virtual EL2 and * the virtual HCR_EL2.TWX is set. Otherwise, let the host hypervisor