From patchwork Fri Apr 1 19:46:52 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oliver Upton X-Patchwork-Id: 12798638 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id D2327C433F5 for ; Fri, 1 Apr 2022 19:48:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:Cc:To:From:Subject:Mime-Version: Message-Id:Date:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:In-Reply-To: References:List-Owner; bh=1rgqlMnaDucPHn4rqg3RHvKh5/mkVYAdUQy9W2DFMfE=; b=R1z AznW0tgiftakfJ7uRXZ9Lyu9Q20X4Qjh7cxut0Qx5MgfJgMTjXvY9Q58JoypRimqvY46uv1jbefRi uamLCY7r+OKqzxPJnR4SdUP/q5fbOAMFKdpep+e/z6Mh+F3v+vzUMdAq+8BjlhW/vjeMk4ymP9PF5 gEqffdZGaRFluUNUv2HQ0jYuY6FaJp+8MmWaw0LXhyDt0doIuwUqAShDtUryt200wKND4hxuRV9/1 53QIY6NsTuxzWErpz3Qll+6K/guW2f5JtiZfVwl4adCJPVuvIsLaEIv21b3iwUFQq6oQegBxhmf6B 7wy45RWqyoH5Wexk1QzHtT/fIpz6l1g==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1naNEm-0070Kt-Bi; Fri, 01 Apr 2022 19:47:12 +0000 Received: from mail-io1-xd4a.google.com ([2607:f8b0:4864:20::d4a]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1naNEi-0070Jo-1S for linux-arm-kernel@lists.infradead.org; Fri, 01 Apr 2022 19:47:09 +0000 Received: by mail-io1-xd4a.google.com with SMTP id u18-20020a5d8712000000b0064c7a7c497aso2347863iom.18 for ; Fri, 01 Apr 2022 12:47:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:message-id:mime-version:subject:from:to:cc; bh=sGASyLBse2klm//8WmPcxLcR/0G/i4NUGrOkS5QfPwU=; b=p0I9TS+BxRlThc6cRuQsH+gXHlKHv9krEacz1VjAY7yKZldMuM5KnvSIVJAQtqZWIS ms13dhQL/oagduEMMikCqjI8fe0Il26LNT4mfIaDFyCY8UQ4wcJpW8wMXaYxc3WuQIvM mdupkhdVzZsef3F/9BV50VLdItdtPFVndfUr8B7o+MvldUqTjvZbRKJ4tsw9C7RhnT0p tGTUvq4GpmepsRjB3Z/NSlhaV+un8Bv1yvLJDUlr7lbjuUIICAFyOhRIeFBAwVHssCcM Q9rHvYe8T5j5uPcJ8tt3uaaz+/fFeykQyO7tsCryee8WEis3fkdBS6GJ/JuEmEt1u6Fi TtSg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:message-id:mime-version:subject:from:to:cc; bh=sGASyLBse2klm//8WmPcxLcR/0G/i4NUGrOkS5QfPwU=; b=lJDQVkvy2UPlcHlCnlFDdP4EiqkwLg5Cy+/xhjJbrbqLSvnI7K8PyPkXDQ3Mcg9P+t lPrkxLqeJb3lipFR5FK45VDHQn2nY3gwyDQbRsxhOcr/cF8kSsgR7Ksc6bkDtA3t1vC/ V8tWcr3LT+4PylLvSKXLHYW/MonWDFD8qtLm6NxbmVJvtbDl2OUudU3ccPnPtwawIywF gDxGGXoXgUVYLpYBmz4/MTzdlcFaFrAZ8bZksRdSXjnHFni11RXe3cemRiUNrMNoKiAM YuwF5o7D0ctKeCUbBlr8iA3JH9OT+P8A/pZLryTDhSt6WdzbOG7v00KhXs50oRKiPxhw wS1g== X-Gm-Message-State: AOAM533J2doEqJ369bGp7VT5IaeQNqTg6htz8stXRxkSB9ZHBGXsbJLO TE1levxjWgzaiOz3oic9LVN3LNb6dEU= X-Google-Smtp-Source: ABdhPJzdy78X78ccSq0ir+N3gnYKV5tuuxTZVy7a3oFEuCUTAisBEAFvByvYnHiaUoeXiQdHX+wqwT+YhWE= X-Received: from oupton.c.googlers.com ([fda3:e722:ac3:cc00:2b:ff92:c0a8:404]) (user=oupton job=sendgmr) by 2002:a05:6e02:1c46:b0:2c9:f0a8:2d32 with SMTP id d6-20020a056e021c4600b002c9f0a82d32mr641920ilg.54.1648842423330; Fri, 01 Apr 2022 12:47:03 -0700 (PDT) Date: Fri, 1 Apr 2022 19:46:52 +0000 Message-Id: <20220401194652.950240-1-oupton@google.com> Mime-Version: 1.0 X-Mailer: git-send-email 2.35.1.1094.g7c7d902a7c-goog Subject: [PATCH v2] KVM: arm64: Don't split hugepages outside of MMU write lock From: Oliver Upton To: kvmarm@lists.cs.columbia.edu Cc: kvm@vger.kernel.org, Marc Zyngier , James Morse , Alexandru Elisei , Suzuki K Poulose , linux-arm-kernel@lists.infradead.org, Peter Shier , Ricardo Koller , Reiji Watanabe , Oliver Upton , Jing Zhang X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20220401_124708_108753_CEC2F7E9 X-CRM114-Status: GOOD ( 18.86 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org It is possible to take a stage-2 permission fault on a page larger than PAGE_SIZE. For example, when running a guest backed by 2M HugeTLB, KVM eagerly maps at the largest possible block size. When dirty logging is enabled on a memslot, KVM does *not* eagerly split these 2M stage-2 mappings and instead clears the write bit on the pte. Since dirty logging is always performed at PAGE_SIZE granularity, KVM lazily splits these 2M block mappings down to PAGE_SIZE in the stage-2 fault handler. This operation must be done under the write lock. Since commit f783ef1c0e82 ("KVM: arm64: Add fast path to handle permission relaxation during dirty logging"), the stage-2 fault handler conditionally takes the read lock on permission faults with dirty logging enabled. To that end, it is possible to split a 2M block mapping while only holding the read lock. The problem is demonstrated by running kvm_page_table_test with 2M anonymous HugeTLB, which splats like so: WARNING: CPU: 5 PID: 15276 at arch/arm64/kvm/hyp/pgtable.c:153 stage2_map_walk_leaf+0x124/0x158 [...] Call trace: stage2_map_walk_leaf+0x124/0x158 stage2_map_walker+0x5c/0xf0 __kvm_pgtable_walk+0x100/0x1d4 __kvm_pgtable_walk+0x140/0x1d4 __kvm_pgtable_walk+0x140/0x1d4 kvm_pgtable_walk+0xa0/0xf8 kvm_pgtable_stage2_map+0x15c/0x198 user_mem_abort+0x56c/0x838 kvm_handle_guest_abort+0x1fc/0x2a4 handle_exit+0xa4/0x120 kvm_arch_vcpu_ioctl_run+0x200/0x448 kvm_vcpu_ioctl+0x588/0x664 __arm64_sys_ioctl+0x9c/0xd4 invoke_syscall+0x4c/0x144 el0_svc_common+0xc4/0x190 do_el0_svc+0x30/0x8c el0_svc+0x28/0xcc el0t_64_sync_handler+0x84/0xe4 el0t_64_sync+0x1a4/0x1a8 Fix the issue by only acquiring the read lock if the guest faulted on a PAGE_SIZE granule w/ dirty logging enabled. Add a WARN to catch locking bugs in future changes. Fixes: f783ef1c0e82 ("KVM: arm64: Add fast path to handle permission relaxation during dirty logging") Cc: Jing Zhang Signed-off-by: Oliver Upton Reviewed-by: Reiji Watanabe --- Applies cleanly to kvmarm/fixes at the following commit: 8872d9b3e35a ("KVM: arm64: Drop unneeded minor version check from PSCI v1.x handler") Tested the patch by running KVM selftests. Additionally, I did 10 iterations of the kvm_page_table_test with 2M anon HugeTLB memory. It is expected that this patch will cause fault serialization in the pathological case where all vCPUs are faulting on the same granule of memory, as every vCPU will attempt to acquire the write lock. The only safe way to cure this contention is to dissolve pages eagerly outside of the stage-2 fault handler (like x86). v1: https://lore.kernel.org/kvmarm/20220331213844.2894006-1-oupton@google.com/ v1 -> v2: - Drop impossible check for !use_read_lock before kvm_pgtable_stage2_map() (Reiji) - Codify the requirement to hold the write lock to call kvm_pgtable_stage2_map() with a WARN arch/arm64/kvm/mmu.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 0d19259454d8..53ae2c0640bc 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1079,7 +1079,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, gfn_t gfn; kvm_pfn_t pfn; bool logging_active = memslot_is_logging(memslot); - bool logging_perm_fault = false; + bool use_read_lock = false; unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu); unsigned long vma_pagesize, fault_granule; enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R; @@ -1114,7 +1114,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, if (logging_active) { force_pte = true; vma_shift = PAGE_SHIFT; - logging_perm_fault = (fault_status == FSC_PERM && write_fault); + use_read_lock = (fault_status == FSC_PERM && write_fault && + fault_granule == PAGE_SIZE); } else { vma_shift = get_vma_page_shift(vma, hva); } @@ -1218,7 +1219,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, * logging dirty logging, only acquire read lock for permission * relaxation. */ - if (logging_perm_fault) + if (use_read_lock) read_lock(&kvm->mmu_lock); else write_lock(&kvm->mmu_lock); @@ -1268,6 +1269,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, if (fault_status == FSC_PERM && vma_pagesize == fault_granule) { ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot); } else { + WARN_ONCE(use_read_lock, "Attempted stage-2 map outside of write lock\n"); + ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize, __pfn_to_phys(pfn), prot, memcache); @@ -1280,7 +1283,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, } out_unlock: - if (logging_perm_fault) + if (use_read_lock) read_unlock(&kvm->mmu_lock); else write_unlock(&kvm->mmu_lock);