From patchwork Mon Sep 11 02:16:31 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Stevens X-Patchwork-Id: 13378583 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7EFA2EE49A4 for ; Mon, 11 Sep 2023 02:16:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232663AbjIKCQ4 (ORCPT ); Sun, 10 Sep 2023 22:16:56 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39342 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232166AbjIKCQz (ORCPT ); Sun, 10 Sep 2023 22:16:55 -0400 Received: from mail-pl1-x62d.google.com (mail-pl1-x62d.google.com [IPv6:2607:f8b0:4864:20::62d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2EBFF10F for ; Sun, 10 Sep 2023 19:16:51 -0700 (PDT) Received: by mail-pl1-x62d.google.com with SMTP id d9443c01a7336-1c336f5b1ffso32267535ad.2 for ; Sun, 10 Sep 2023 19:16:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; t=1694398610; x=1695003410; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=xD7OvstxKA2joN48f33NVGVEMdOf8iSPumL3dY1BcYQ=; b=KpzPGRTPJIOXslaJubNz4CHsx8oC1BP8q7hW3MovPQRg2n7pXN+BEfMf6OH/FtPMp2 Z8ERzb26RnE3ayc38bR/QC7kXjCkv4Y5RJgzgOAyc6kW2npk/0EenUsoGrjOPpV+kK7S QxsF4LsSiQ4VioT0KRLxUQX87NZHQuOVy7X5A= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694398610; x=1695003410; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=xD7OvstxKA2joN48f33NVGVEMdOf8iSPumL3dY1BcYQ=; b=Cc2NcNQMef9dLpCWUt4jm4kvQikP1ZqpbFEpsDt5/jS+5alxGerIJJ0rCKbA04yFeP 9YSM9n0s9zHxhzRMPQCXg9mQQxmIR3p+2KGm1AVtXuhZ4923EKK6g9exs+rlZwhp0ra0 /Cj46jk6qLxd/qaiTN1zfyLe6kxn2ccXLnHg5qxp1ki2ok1H6+aLTJbWKBZUypFE5zmX tikDulNo1N93p8iM1F3KzrCFdMR+kcnsgVXsi/Swj4WnDUWjXo0WIWVhI1tzi3wSERAI fU5J9wDN4U/KRRjsQglaVC6msA89a9o6Z644J9f/gv/+9lldbNwsoiMHEPMFBhWlSg58 zGeA== X-Gm-Message-State: AOJu0YyIr2SJHSuvhcm8pkWhxK1/xxYV2gXnolgK6Zj57YbDcryu6t9e Jemg2Pm0fKpz0q0TOSkXrQgLHg== X-Google-Smtp-Source: AGHT+IFy++BuFNruJ3+F43z5nLA2htPWCJELp/lRQsizzfJXm9VzGe4L5Ti1kzq6zJIZvRm4jK0CGg== X-Received: by 2002:a17:903:228d:b0:1c3:9aaf:97be with SMTP id b13-20020a170903228d00b001c39aaf97bemr5233193plh.56.1694398610653; Sun, 10 Sep 2023 19:16:50 -0700 (PDT) Received: from localhost ([2401:fa00:8f:203:282a:59c8:cc3a:2d6]) by smtp.gmail.com with UTF8SMTPSA id x12-20020a1709028ecc00b001b8a897cd26sm5162528plo.195.2023.09.10.19.16.48 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 10 Sep 2023 19:16:50 -0700 (PDT) From: David Stevens X-Google-Original-From: David Stevens To: Sean Christopherson Cc: Yu Zhang , Isaku Yamahata , Zhi Wang , kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, kvm@vger.kernel.org Subject: [PATCH v9 1/6] KVM: Assert that a page's refcount is elevated when marking accessed/dirty Date: Mon, 11 Sep 2023 11:16:31 +0900 Message-ID: <20230911021637.1941096-2-stevensd@google.com> X-Mailer: git-send-email 2.42.0.283.g2d96d420d3-goog In-Reply-To: <20230911021637.1941096-1-stevensd@google.com> References: <20230911021637.1941096-1-stevensd@google.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Sean Christopherson Assert that a page's refcount is elevated, i.e. that _something_ holds a reference to the page, when KVM marks a page as accessed and/or dirty. KVM typically doesn't hold a reference to pages that are mapped into the guest, e.g. to allow page migration, compaction, swap, etc., and instead relies on mmu_notifiers to react to changes in the primary MMU. Incorrect handling of mmu_notifier events (or similar mechanisms) can result in KVM keeping a mapping beyond the lifetime of the backing page, i.e. can (and often does) result in use-after-free. Yelling if KVM marks a freed page as accessed/dirty doesn't prevent badness as KVM usually only does A/D updates when unmapping memory from the guest, i.e. the assertion fires well after an underlying bug has occurred, but yelling does help detect, triage, and debug use-after-free bugs. Note, the assertion must use page_count(), NOT page_ref_count()! For hugepages, the returned struct page may be a tailpage and thus not have its own refcount. Signed-off-by: Sean Christopherson --- virt/kvm/kvm_main.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index d63cf1c4f5a7..ee6090ecb1fe 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2914,6 +2914,19 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_unmap); static bool kvm_is_ad_tracked_page(struct page *page) { + /* + * Assert that KVM isn't attempting to mark a freed page as Accessed or + * Dirty, i.e. that KVM's MMU doesn't have a use-after-free bug. KVM + * (typically) doesn't pin pages that are mapped in KVM's MMU, and + * instead relies on mmu_notifiers to know when a mapping needs to be + * zapped/invalidated. Unmapping from KVM's MMU must happen _before_ + * KVM returns from its mmu_notifier, i.e. the page should have an + * elevated refcount at this point even though KVM doesn't hold a + * reference of its own. + */ + if (WARN_ON_ONCE(!page_count(page))) + return false; + /* * Per page-flags.h, pages tagged PG_reserved "should in general not be * touched (e.g. set dirty) except by its owner". From patchwork Mon Sep 11 02:16:32 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Stevens X-Patchwork-Id: 13378584 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B3A15EE49A4 for ; Mon, 11 Sep 2023 02:17:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232701AbjIKCRF (ORCPT ); Sun, 10 Sep 2023 22:17:05 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43872 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232645AbjIKCRE (ORCPT ); Sun, 10 Sep 2023 22:17:04 -0400 Received: from mail-oo1-xc31.google.com (mail-oo1-xc31.google.com [IPv6:2607:f8b0:4864:20::c31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CE91DCDD for ; Sun, 10 Sep 2023 19:16:55 -0700 (PDT) Received: by mail-oo1-xc31.google.com with SMTP id 006d021491bc7-573249e73f8so2683329eaf.1 for ; Sun, 10 Sep 2023 19:16:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; t=1694398615; x=1695003415; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=V8kEeak7Jm/wBTYBnBaTDd+S+FNlil6EoRPZlJ8iQJM=; b=czsBP2bF1+ULo4hfWCwe7uRFaknlmfp6eCxjXjsIX/is1uRQXNZhS5yc3baUPPDuMn /ttFDIEMx1QaQJnP8t5MKvskHsGmGOMLND4qk7hEI36Gzy43JxYV4jHJLt9YdgUkkZAW P5jh7jf84btK5uCxajdA5atnxkCpRdRFIDdzQ= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694398615; x=1695003415; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=V8kEeak7Jm/wBTYBnBaTDd+S+FNlil6EoRPZlJ8iQJM=; b=D25viUAM8I5zGLP+31oeOSbc+IBLW+aIRzyqSK5xZWRDyG3tqPtOzunBXGJ0AewxJO sOgR0rWzfb8kjNvlgWW9zuhxwFYAN3cQq3wyMaF9/swGwF3wFO5pIeIUQDhd1l4JyNSL kXjtkIeNYdrs9C02z0zq14Ds1sL6my9pNSlx0+BiNtwg+TA3ASxR8jSwGQrtEiyvdgaf jFesv6laEaxSJCR+pGAN/ZrJ1xf8dC78XcNkfeFM7EHGvYMxxiXEmcm5pq2jNo8ZBpnV 8EEvmrhRadY2rGBJEuORV8xj7BU9oj/bEp6u7zDVdXiTqEXWvPDY4wcGA+ZtKQotVtNb nZqA== X-Gm-Message-State: AOJu0YzpizNh/Y3aC1teTRCiAPanTqz10jzhaq8kqE8dQBsA2w1U+SEM MEb4N7GbExzrOtvepo8C+G5Vaw== X-Google-Smtp-Source: AGHT+IHSjmKzK+RW2/XQRbhCvpUc4myy0NraJFC7Uxkq2KP+XZBnUnerFsiB8UN8PsAr9btp9QobSg== X-Received: by 2002:a05:6358:52c8:b0:134:c8ee:e451 with SMTP id z8-20020a05635852c800b00134c8eee451mr12154801rwz.13.1694398614766; Sun, 10 Sep 2023 19:16:54 -0700 (PDT) Received: from localhost ([2401:fa00:8f:203:282a:59c8:cc3a:2d6]) by smtp.gmail.com with UTF8SMTPSA id k11-20020aa790cb000000b0066ebaeb149dsm4477329pfk.88.2023.09.10.19.16.52 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 10 Sep 2023 19:16:54 -0700 (PDT) From: David Stevens X-Google-Original-From: David Stevens To: Sean Christopherson Cc: Yu Zhang , Isaku Yamahata , Zhi Wang , kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, David Stevens Subject: [PATCH v9 2/6] KVM: mmu: Introduce __kvm_follow_pfn function Date: Mon, 11 Sep 2023 11:16:32 +0900 Message-ID: <20230911021637.1941096-3-stevensd@google.com> X-Mailer: git-send-email 2.42.0.283.g2d96d420d3-goog In-Reply-To: <20230911021637.1941096-1-stevensd@google.com> References: <20230911021637.1941096-1-stevensd@google.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: David Stevens Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot. __kvm_follow_pfn refactors the old API's arguments into a struct and, where possible, combines the boolean arguments into a single flags argument. Signed-off-by: David Stevens Reviewed-by: Maxim Levitsky --- include/linux/kvm_host.h | 16 ++++ virt/kvm/kvm_main.c | 171 ++++++++++++++++++++++----------------- virt/kvm/kvm_mm.h | 3 +- virt/kvm/pfncache.c | 10 ++- 4 files changed, 123 insertions(+), 77 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index fb6c6109fdca..c2e0ddf14dba 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -97,6 +97,7 @@ #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1) #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2) #define KVM_PFN_ERR_SIGPENDING (KVM_PFN_ERR_MASK + 3) +#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4) /* * error pfns indicate that the gfn is in slot but faild to @@ -1177,6 +1178,21 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn, void kvm_release_page_clean(struct page *page); void kvm_release_page_dirty(struct page *page); +struct kvm_follow_pfn { + const struct kvm_memory_slot *slot; + gfn_t gfn; + unsigned int flags; + bool atomic; + /* Try to create a writable mapping even for a read fault */ + bool try_map_writable; + + /* Outputs of __kvm_follow_pfn */ + hva_t hva; + bool writable; +}; + +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll); + kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn); kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault, bool *writable); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index ee6090ecb1fe..9b33a59c6d65 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2512,8 +2512,7 @@ static inline int check_user_page_hwpoison(unsigned long addr) * true indicates success, otherwise false is returned. It's also the * only part that runs if we can in atomic context. */ -static bool hva_to_pfn_fast(unsigned long addr, bool write_fault, - bool *writable, kvm_pfn_t *pfn) +static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn) { struct page *page[1]; @@ -2522,14 +2521,12 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault, * or the caller allows to map a writable pfn for a read fault * request. */ - if (!(write_fault || writable)) + if (!((foll->flags & FOLL_WRITE) || foll->try_map_writable)) return false; - if (get_user_page_fast_only(addr, FOLL_WRITE, page)) { + if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) { *pfn = page_to_pfn(page[0]); - - if (writable) - *writable = true; + foll->writable = true; return true; } @@ -2540,35 +2537,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault, * The slow path to get the pfn of the specified host virtual address, * 1 indicates success, -errno is returned if error is detected. */ -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault, - bool interruptible, bool *writable, kvm_pfn_t *pfn) +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn) { - unsigned int flags = FOLL_HWPOISON; + unsigned int flags = FOLL_HWPOISON | foll->flags; struct page *page; int npages; might_sleep(); - if (writable) - *writable = write_fault; - - if (write_fault) - flags |= FOLL_WRITE; - if (async) - flags |= FOLL_NOWAIT; - if (interruptible) - flags |= FOLL_INTERRUPTIBLE; - - npages = get_user_pages_unlocked(addr, 1, &page, flags); + npages = get_user_pages_unlocked(foll->hva, 1, &page, flags); if (npages != 1) return npages; - /* map read fault as writable if possible */ - if (unlikely(!write_fault) && writable) { + if (foll->flags & FOLL_WRITE) { + foll->writable = true; + } else if (foll->try_map_writable) { struct page *wpage; - if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) { - *writable = true; + /* map read fault as writable if possible */ + if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) { + foll->writable = true; put_page(page); page = wpage; } @@ -2599,23 +2587,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn) } static int hva_to_pfn_remapped(struct vm_area_struct *vma, - unsigned long addr, bool write_fault, - bool *writable, kvm_pfn_t *p_pfn) + struct kvm_follow_pfn *foll, kvm_pfn_t *p_pfn) { kvm_pfn_t pfn; pte_t *ptep; pte_t pte; spinlock_t *ptl; + bool write_fault = foll->flags & FOLL_WRITE; int r; - r = follow_pte(vma->vm_mm, addr, &ptep, &ptl); + r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl); if (r) { /* * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does * not call the fault handler, so do it here. */ bool unlocked = false; - r = fixup_user_fault(current->mm, addr, + r = fixup_user_fault(current->mm, foll->hva, (write_fault ? FAULT_FLAG_WRITE : 0), &unlocked); if (unlocked) @@ -2623,7 +2611,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, if (r) return r; - r = follow_pte(vma->vm_mm, addr, &ptep, &ptl); + r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl); if (r) return r; } @@ -2635,8 +2623,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, goto out; } - if (writable) - *writable = pte_write(pte); + foll->writable = pte_write(pte); pfn = pte_pfn(pte); /* @@ -2681,24 +2668,22 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, * 2): @write_fault = false && @writable, @writable will tell the caller * whether the mapping is writable. */ -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible, - bool *async, bool write_fault, bool *writable) +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll) { struct vm_area_struct *vma; kvm_pfn_t pfn; int npages, r; /* we can do it either atomically or asynchronously, not both */ - BUG_ON(atomic && async); + BUG_ON(foll->atomic && (foll->flags & FOLL_NOWAIT)); - if (hva_to_pfn_fast(addr, write_fault, writable, &pfn)) + if (hva_to_pfn_fast(foll, &pfn)) return pfn; - if (atomic) + if (foll->atomic) return KVM_PFN_ERR_FAULT; - npages = hva_to_pfn_slow(addr, async, write_fault, interruptible, - writable, &pfn); + npages = hva_to_pfn_slow(foll, &pfn); if (npages == 1) return pfn; if (npages == -EINTR) @@ -2706,83 +2691,123 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible, mmap_read_lock(current->mm); if (npages == -EHWPOISON || - (!async && check_user_page_hwpoison(addr))) { + (!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))) { pfn = KVM_PFN_ERR_HWPOISON; goto exit; } retry: - vma = vma_lookup(current->mm, addr); + vma = vma_lookup(current->mm, foll->hva); if (vma == NULL) pfn = KVM_PFN_ERR_FAULT; else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) { - r = hva_to_pfn_remapped(vma, addr, write_fault, writable, &pfn); + r = hva_to_pfn_remapped(vma, foll, &pfn); if (r == -EAGAIN) goto retry; if (r < 0) pfn = KVM_PFN_ERR_FAULT; } else { - if (async && vma_is_valid(vma, write_fault)) - *async = true; - pfn = KVM_PFN_ERR_FAULT; + if ((foll->flags & FOLL_NOWAIT) && + vma_is_valid(vma, foll->flags & FOLL_WRITE)) + pfn = KVM_PFN_ERR_NEEDS_IO; + else + pfn = KVM_PFN_ERR_FAULT; } exit: mmap_read_unlock(current->mm); return pfn; } -kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn, - bool atomic, bool interruptible, bool *async, - bool write_fault, bool *writable, hva_t *hva) +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll) { - unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault); + foll->writable = false; + foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL, + foll->flags & FOLL_WRITE); - if (hva) - *hva = addr; - - if (addr == KVM_HVA_ERR_RO_BAD) { - if (writable) - *writable = false; + if (foll->hva == KVM_HVA_ERR_RO_BAD) return KVM_PFN_ERR_RO_FAULT; - } - if (kvm_is_error_hva(addr)) { - if (writable) - *writable = false; + if (kvm_is_error_hva(foll->hva)) return KVM_PFN_NOSLOT; - } - /* Do not map writable pfn in the readonly memslot. */ - if (writable && memslot_is_readonly(slot)) { - *writable = false; - writable = NULL; - } + if (memslot_is_readonly(foll->slot)) + foll->try_map_writable = false; - return hva_to_pfn(addr, atomic, interruptible, async, write_fault, - writable); + return hva_to_pfn(foll); +} +EXPORT_SYMBOL_GPL(__kvm_follow_pfn); + +kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn, + bool atomic, bool interruptible, bool *async, + bool write_fault, bool *writable, hva_t *hva) +{ + kvm_pfn_t pfn; + struct kvm_follow_pfn foll = { + .slot = slot, + .gfn = gfn, + .flags = 0, + .atomic = atomic, + .try_map_writable = !!writable, + }; + + if (write_fault) + foll.flags |= FOLL_WRITE; + if (async) + foll.flags |= FOLL_NOWAIT; + if (interruptible) + foll.flags |= FOLL_INTERRUPTIBLE; + + pfn = __kvm_follow_pfn(&foll); + if (pfn == KVM_PFN_ERR_NEEDS_IO) { + *async = true; + pfn = KVM_PFN_ERR_FAULT; + } + if (hva) + *hva = foll.hva; + if (writable) + *writable = foll.writable; + return pfn; } EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot); kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault, bool *writable) { - return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, false, - NULL, write_fault, writable, NULL); + kvm_pfn_t pfn; + struct kvm_follow_pfn foll = { + .slot = gfn_to_memslot(kvm, gfn), + .gfn = gfn, + .flags = write_fault ? FOLL_WRITE : 0, + .try_map_writable = !!writable, + }; + pfn = __kvm_follow_pfn(&foll); + if (writable) + *writable = foll.writable; + return pfn; } EXPORT_SYMBOL_GPL(gfn_to_pfn_prot); kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn) { - return __gfn_to_pfn_memslot(slot, gfn, false, false, NULL, true, - NULL, NULL); + struct kvm_follow_pfn foll = { + .slot = slot, + .gfn = gfn, + .flags = FOLL_WRITE, + }; + return __kvm_follow_pfn(&foll); } EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot); kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn) { - return __gfn_to_pfn_memslot(slot, gfn, true, false, NULL, true, - NULL, NULL); + struct kvm_follow_pfn foll = { + .slot = slot, + .gfn = gfn, + .flags = FOLL_WRITE, + .atomic = true, + }; + return __kvm_follow_pfn(&foll); } EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic); diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h index 180f1a09e6ba..ed896aee5396 100644 --- a/virt/kvm/kvm_mm.h +++ b/virt/kvm/kvm_mm.h @@ -20,8 +20,7 @@ #define KVM_MMU_UNLOCK(kvm) spin_unlock(&(kvm)->mmu_lock) #endif /* KVM_HAVE_MMU_RWLOCK */ -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible, - bool *async, bool write_fault, bool *writable); +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll); #ifdef CONFIG_HAVE_KVM_PFNCACHE void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c index 2d6aba677830..86cd40acad11 100644 --- a/virt/kvm/pfncache.c +++ b/virt/kvm/pfncache.c @@ -144,6 +144,12 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc) kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT; void *new_khva = NULL; unsigned long mmu_seq; + struct kvm_follow_pfn foll = { + .slot = gpc->memslot, + .gfn = gpa_to_gfn(gpc->gpa), + .flags = FOLL_WRITE, + .hva = gpc->uhva, + }; lockdep_assert_held(&gpc->refresh_lock); @@ -182,8 +188,8 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc) cond_resched(); } - /* We always request a writeable mapping */ - new_pfn = hva_to_pfn(gpc->uhva, false, false, NULL, true, NULL); + /* We always request a writable mapping */ + new_pfn = hva_to_pfn(&foll); if (is_error_noslot_pfn(new_pfn)) goto out_error; From patchwork Mon Sep 11 02:16:33 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Stevens X-Patchwork-Id: 13378586 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 99122EEB57E for ; Mon, 11 Sep 2023 02:17:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232792AbjIKCRQ (ORCPT ); Sun, 10 Sep 2023 22:17:16 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45674 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232731AbjIKCRO (ORCPT ); Sun, 10 Sep 2023 22:17:14 -0400 Received: from mail-pl1-x631.google.com (mail-pl1-x631.google.com [IPv6:2607:f8b0:4864:20::631]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 66EBFE6F for ; Sun, 10 Sep 2023 19:17:00 -0700 (PDT) Received: by mail-pl1-x631.google.com with SMTP id d9443c01a7336-1c39f2b4f5aso4462915ad.0 for ; Sun, 10 Sep 2023 19:17:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; t=1694398619; x=1695003419; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=egByjBzJcCB8brnLv5YMd/MEK6uJwFM+otbuvFagvYA=; b=YF5CuhLN6A3Vytoyb4O19wdujhd9dkx8GtIsfYcLsxzuxj9NfZYnvwXfAhaVdcPZl+ fv9tjTQ6eXTapCKkpW3FJHbo7fCdIVhCcJrc0+wzLEmgxWgUFN55xVEbNZmBr5XWx7tC mUEeA8LcV+YYXAp9fvL3plcYOL4aY56NIeOSk= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694398619; x=1695003419; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=egByjBzJcCB8brnLv5YMd/MEK6uJwFM+otbuvFagvYA=; b=LYalvBF8Q8DA6YJYzNL9ei89I/Y90ZAJIBiY0oh/xlbU7+lFx03e/XY3FAUBrXJzPO VW3z3mVHGyZUqYc4BVDGY9Kn4IXDQdTvFSICO1H4NZRAX41zcrEi9GUyNQtuRbZSB8Qw EaxE5CaUxany4TMziq9LPOA8ZOBmJYWFQx2ZMEKFfPvReqeBJ+6krCWIRRX4ithI5FQ0 fTuCi9D8oGMQOvlj0UISe2Xs1XfxvuPUWP0NQkmUHExG/V6rSLajUT4ObTyOc2zBiOlm g6tgc07iwMA+ntUVNflGjLiieEpMQRg27dTsnk6MMVeRBzFoHPqYxaltM4+hCIip5ZQN cRTQ== X-Gm-Message-State: AOJu0Yypyv+pkBOI5vjnbEVccpz49EXuXv632OKVeqbBt86KJoUGEPVN ckoIbEeA/gdzHYrpXwLE+XL/qA== X-Google-Smtp-Source: AGHT+IGe6g+N219BghCgrSGmsKPPbaQaG6yQl3gXiQ0iSd31WWCTldGSVElf/HAQRu1c27N9793bgg== X-Received: by 2002:a17:903:2303:b0:1b7:f64b:378a with SMTP id d3-20020a170903230300b001b7f64b378amr8074481plh.16.1694398619338; Sun, 10 Sep 2023 19:16:59 -0700 (PDT) Received: from localhost ([2401:fa00:8f:203:282a:59c8:cc3a:2d6]) by smtp.gmail.com with UTF8SMTPSA id s13-20020a170902988d00b001b89891bfc4sm5148444plp.199.2023.09.10.19.16.57 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 10 Sep 2023 19:16:58 -0700 (PDT) From: David Stevens X-Google-Original-From: David Stevens To: Sean Christopherson Cc: Yu Zhang , Isaku Yamahata , Zhi Wang , kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, David Stevens Subject: [PATCH v9 3/6] KVM: mmu: Improve handling of non-refcounted pfns Date: Mon, 11 Sep 2023 11:16:33 +0900 Message-ID: <20230911021637.1941096-4-stevensd@google.com> X-Mailer: git-send-email 2.42.0.283.g2d96d420d3-goog In-Reply-To: <20230911021637.1941096-1-stevensd@google.com> References: <20230911021637.1941096-1-stevensd@google.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: David Stevens KVM's handling of non-refcounted pfns has two problems: - struct pages without refcounting (e.g. tail pages of non-compound higher order pages) cannot be used at all, as gfn_to_pfn does not provide enough information for callers to handle the refcount. - pfns without struct pages can be accessed without the protection of a mmu notifier. This is unsafe because KVM cannot monitor or control the lifespan of such pfns, so it may continue to access the pfns after they are freed. This patch extends the __kvm_follow_pfn API to properly handle these cases. First, it adds a is_refcounted_page output parameter so that callers can tell whether or not a pfn has a struct page that needs to be passed to put_page. Second, it adds a guarded_by_mmu_notifier parameter that is used to avoid returning non-refcounted pages when the caller cannot safely use them. Since callers need to be updated on a case-by-case basis to pay attention to is_refcounted_page, the new behavior of returning non-refcounted pages is opt-in via the allow_non_refcounted_struct_page parameter. Once all callers have been updated, this parameter should be removed. The fact that non-refcounted pfns can no longer be accessed without mmu notifier protection is a breaking change. Since there is no timeline for updating everything in KVM to use mmu notifiers, this change adds an opt-in module parameter called allow_unsafe_mappings to allow such mappings. Systems which trust userspace not to tear down such unsafe mappings while KVM is using them can set this parameter to re-enable the legacy behavior. Signed-off-by: David Stevens --- include/linux/kvm_host.h | 21 ++++++++++ virt/kvm/kvm_main.c | 84 ++++++++++++++++++++++++---------------- virt/kvm/pfncache.c | 1 + 3 files changed, 72 insertions(+), 34 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index c2e0ddf14dba..2ed08ae1a9be 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1185,10 +1185,31 @@ struct kvm_follow_pfn { bool atomic; /* Try to create a writable mapping even for a read fault */ bool try_map_writable; + /* Usage of the returned pfn will be guared by a mmu notifier. */ + bool guarded_by_mmu_notifier; + /* + * When false, do not return pfns for non-refcounted struct pages. + * + * TODO: This allows callers to use kvm_release_pfn on the pfns + * returned by gfn_to_pfn without worrying about corrupting the + * refcounted of non-refcounted pages. Once all callers respect + * is_refcounted_page, this flag should be removed. + */ + bool allow_non_refcounted_struct_page; /* Outputs of __kvm_follow_pfn */ hva_t hva; bool writable; + /* + * True if the returned pfn is for a page with a valid refcount. False + * if the returned pfn has no struct page or if the struct page is not + * being refcounted (e.g. tail pages of non-compound higher order + * allocations from IO/PFNMAP mappings). + * + * When this output flag is false, callers should not try to convert + * the pfn to a struct page. + */ + bool is_refcounted_page; }; kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 9b33a59c6d65..235c5cb3fdac 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -96,6 +96,10 @@ unsigned int halt_poll_ns_shrink; module_param(halt_poll_ns_shrink, uint, 0644); EXPORT_SYMBOL_GPL(halt_poll_ns_shrink); +/* Allow non-struct page memory to be mapped without MMU notifier protection. */ +static bool allow_unsafe_mappings; +module_param(allow_unsafe_mappings, bool, 0444); + /* * Ordering of locks: * @@ -2507,6 +2511,15 @@ static inline int check_user_page_hwpoison(unsigned long addr) return rc == -EHWPOISON; } +static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll, + struct page *page) +{ + kvm_pfn_t pfn = page_to_pfn(page); + + foll->is_refcounted_page = true; + return pfn; +} + /* * The fast path to get the writable pfn which will be stored in @pfn, * true indicates success, otherwise false is returned. It's also the @@ -2525,7 +2538,7 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn) return false; if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) { - *pfn = page_to_pfn(page[0]); + *pfn = kvm_follow_refcounted_pfn(foll, page[0]); foll->writable = true; return true; } @@ -2561,7 +2574,7 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn) page = wpage; } } - *pfn = page_to_pfn(page); + *pfn = kvm_follow_refcounted_pfn(foll, page); return npages; } @@ -2576,16 +2589,6 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault) return true; } -static int kvm_try_get_pfn(kvm_pfn_t pfn) -{ - struct page *page = kvm_pfn_to_refcounted_page(pfn); - - if (!page) - return 1; - - return get_page_unless_zero(page); -} - static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll, kvm_pfn_t *p_pfn) { @@ -2594,6 +2597,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, pte_t pte; spinlock_t *ptl; bool write_fault = foll->flags & FOLL_WRITE; + struct page *page; int r; r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl); @@ -2618,37 +2622,39 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, pte = ptep_get(ptep); + foll->writable = pte_write(pte); + pfn = pte_pfn(pte); + + page = kvm_pfn_to_refcounted_page(pfn); + if (write_fault && !pte_write(pte)) { pfn = KVM_PFN_ERR_RO_FAULT; goto out; } - foll->writable = pte_write(pte); - pfn = pte_pfn(pte); + if (!page) + goto out; - /* - * Get a reference here because callers of *hva_to_pfn* and - * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the - * returned pfn. This is only needed if the VMA has VM_MIXEDMAP - * set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will - * simply do nothing for reserved pfns. - * - * Whoever called remap_pfn_range is also going to call e.g. - * unmap_mapping_range before the underlying pages are freed, - * causing a call to our MMU notifier. - * - * Certain IO or PFNMAP mappings can be backed with valid - * struct pages, but be allocated without refcounting e.g., - * tail pages of non-compound higher order allocations, which - * would then underflow the refcount when the caller does the - * required put_page. Don't allow those pages here. - */ - if (!kvm_try_get_pfn(pfn)) - r = -EFAULT; + if (get_page_unless_zero(page)) + WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn); out: pte_unmap_unlock(ptep, ptl); - *p_pfn = pfn; + + /* + * TODO: Remove the first branch once all callers have been + * taught to play nice with non-refcounted struct pages. + */ + if (page && !foll->is_refcounted_page && + !foll->allow_non_refcounted_struct_page) { + r = -EFAULT; + } else if (!foll->is_refcounted_page && + !foll->guarded_by_mmu_notifier && + !allow_unsafe_mappings) { + r = -EFAULT; + } else { + *p_pfn = pfn; + } return r; } @@ -2722,6 +2728,8 @@ kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll) kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll) { foll->writable = false; + foll->is_refcounted_page = false; + foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL, foll->flags & FOLL_WRITE); @@ -2749,6 +2757,7 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn, .flags = 0, .atomic = atomic, .try_map_writable = !!writable, + .allow_non_refcounted_struct_page = false, }; if (write_fault) @@ -2780,6 +2789,7 @@ kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault, .gfn = gfn, .flags = write_fault ? FOLL_WRITE : 0, .try_map_writable = !!writable, + .allow_non_refcounted_struct_page = false, }; pfn = __kvm_follow_pfn(&foll); if (writable) @@ -2794,6 +2804,7 @@ kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn) .slot = slot, .gfn = gfn, .flags = FOLL_WRITE, + .allow_non_refcounted_struct_page = false, }; return __kvm_follow_pfn(&foll); } @@ -2806,6 +2817,11 @@ kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gf .gfn = gfn, .flags = FOLL_WRITE, .atomic = true, + /* + * Setting atomic means __kvm_follow_pfn will never make it + * to hva_to_pfn_remapped, so this is vacuously true. + */ + .allow_non_refcounted_struct_page = true, }; return __kvm_follow_pfn(&foll); } diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c index 86cd40acad11..6bbf972c11f8 100644 --- a/virt/kvm/pfncache.c +++ b/virt/kvm/pfncache.c @@ -149,6 +149,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc) .gfn = gpa_to_gfn(gpc->gpa), .flags = FOLL_WRITE, .hva = gpc->uhva, + .allow_non_refcounted_struct_page = false, }; lockdep_assert_held(&gpc->refresh_lock); From patchwork Mon Sep 11 02:16:34 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Stevens X-Patchwork-Id: 13378585 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 75D5DC71153 for ; Mon, 11 Sep 2023 02:17:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232836AbjIKCR1 (ORCPT ); Sun, 10 Sep 2023 22:17:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49564 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232645AbjIKCR0 (ORCPT ); Sun, 10 Sep 2023 22:17:26 -0400 Received: from mail-oo1-xc34.google.com (mail-oo1-xc34.google.com [IPv6:2607:f8b0:4864:20::c34]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 625651710 for ; Sun, 10 Sep 2023 19:17:04 -0700 (PDT) Received: by mail-oo1-xc34.google.com with SMTP id 006d021491bc7-5733aa10291so2676826eaf.3 for ; Sun, 10 Sep 2023 19:17:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; t=1694398623; x=1695003423; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=+u6FWrlXt53l0jioDQF0YNbKRi7RWqUmOraPqVJzqX4=; b=LEVmDlEUNW+jUNSXEqtCGWC6uLJhT5Q+YmIFV5OnzHm3q5GTaaU2V+N+Xo4iB6jhpi ie0ufgPGN4b/43F9zn6u/mlMpjM1r/NmVzO6dkckfPxhLRtL8/Hq6u/8JSNp4OoXWuDx z/h3i/6nvdIrkE8qedHmsFvBSyTYmhqh8/3p0= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694398623; x=1695003423; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=+u6FWrlXt53l0jioDQF0YNbKRi7RWqUmOraPqVJzqX4=; b=hBgQIrTavAIHfloUr5SGhhcbdEBXNou5NGazSUHiXH1Eu15IWhxw1EfXVIuZNiVqMR +IoxBhCDbtpejN+hKJaBAO9CLONp8EwTZ7D1N3fAeW0aMDhjX5QE0c4kdLHZhdtjy/0f tqpjcl+3sEc8cgtdNUwrZDY+WmcTwdBAztMPp0zhqwhJfdPs4suCokj6/nEkWf9ZsMgw PoalmSrs88LofVYOscJQ/pTowahcQjfPqrAmGabVdl5N86IzVRy94e2NLBC1/EeQK7OX eFeXPhe9bfBc0Je/2nY3uxAoWq16nEy8rBtwNT3fkybOw1LxzK+LypkElSOr08T8VszX g1Ug== X-Gm-Message-State: AOJu0Ywi1JYNURlT/DGmYtAT+8ZIYXCKL5AEllBB/CL0j2wjW8L/EqYc 8XdaedXiXZGFrgsLS5faWcbzhg== X-Google-Smtp-Source: AGHT+IHwkesfgaOAHg5OyAz5O7jGYM7x2Y+BjVBilIwEqB9YCTVI0ogAM/gbH69qdUBkeeb1GRPk7g== X-Received: by 2002:a9d:6754:0:b0:6b9:6419:1cde with SMTP id w20-20020a9d6754000000b006b964191cdemr9700883otm.22.1694398623724; Sun, 10 Sep 2023 19:17:03 -0700 (PDT) Received: from localhost ([2401:fa00:8f:203:282a:59c8:cc3a:2d6]) by smtp.gmail.com with UTF8SMTPSA id o13-20020a63a80d000000b0056c2de1f32esm4483868pgf.78.2023.09.10.19.17.01 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 10 Sep 2023 19:17:03 -0700 (PDT) From: David Stevens X-Google-Original-From: David Stevens To: Sean Christopherson Cc: Yu Zhang , Isaku Yamahata , Zhi Wang , kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, David Stevens Subject: [PATCH v9 4/6] KVM: Migrate kvm_vcpu_map to __kvm_follow_pfn Date: Mon, 11 Sep 2023 11:16:34 +0900 Message-ID: <20230911021637.1941096-5-stevensd@google.com> X-Mailer: git-send-email 2.42.0.283.g2d96d420d3-goog In-Reply-To: <20230911021637.1941096-1-stevensd@google.com> References: <20230911021637.1941096-1-stevensd@google.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: David Stevens Migrate kvm_vcpu_map to __kvm_follow_pfn. Track is_refcounted_page so that kvm_vcpu_unmap know whether or not it needs to release the page. Signed-off-by: David Stevens Reviewed-by: Maxim Levitsky --- include/linux/kvm_host.h | 2 +- virt/kvm/kvm_main.c | 24 ++++++++++++++---------- 2 files changed, 15 insertions(+), 11 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 2ed08ae1a9be..b95c79b7833b 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -294,6 +294,7 @@ struct kvm_host_map { void *hva; kvm_pfn_t pfn; kvm_pfn_t gfn; + bool is_refcounted_page; }; /* @@ -1228,7 +1229,6 @@ void kvm_release_pfn_dirty(kvm_pfn_t pfn); void kvm_set_pfn_dirty(kvm_pfn_t pfn); void kvm_set_pfn_accessed(kvm_pfn_t pfn); -void kvm_release_pfn(kvm_pfn_t pfn, bool dirty); int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset, int len); int kvm_read_guest(struct kvm *kvm, gpa_t gpa, void *data, unsigned long len); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 235c5cb3fdac..913de4e86d9d 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2886,24 +2886,22 @@ struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn) } EXPORT_SYMBOL_GPL(gfn_to_page); -void kvm_release_pfn(kvm_pfn_t pfn, bool dirty) -{ - if (dirty) - kvm_release_pfn_dirty(pfn); - else - kvm_release_pfn_clean(pfn); -} - int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map) { kvm_pfn_t pfn; void *hva = NULL; struct page *page = KVM_UNMAPPED_PAGE; + struct kvm_follow_pfn foll = { + .slot = gfn_to_memslot(vcpu->kvm, gfn), + .gfn = gfn, + .flags = FOLL_WRITE, + .allow_non_refcounted_struct_page = true, + }; if (!map) return -EINVAL; - pfn = gfn_to_pfn(vcpu->kvm, gfn); + pfn = __kvm_follow_pfn(&foll); if (is_error_noslot_pfn(pfn)) return -EINVAL; @@ -2923,6 +2921,7 @@ int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map) map->hva = hva; map->pfn = pfn; map->gfn = gfn; + map->is_refcounted_page = foll.is_refcounted_page; return 0; } @@ -2946,7 +2945,12 @@ void kvm_vcpu_unmap(struct kvm_vcpu *vcpu, struct kvm_host_map *map, bool dirty) if (dirty) kvm_vcpu_mark_page_dirty(vcpu, map->gfn); - kvm_release_pfn(map->pfn, dirty); + if (map->is_refcounted_page) { + if (dirty) + kvm_release_page_dirty(map->page); + else + kvm_release_page_clean(map->page); + } map->hva = NULL; map->page = NULL; From patchwork Mon Sep 11 02:16:35 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Stevens X-Patchwork-Id: 13378587 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 78CF7C71153 for ; Mon, 11 Sep 2023 02:17:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232813AbjIKCSA (ORCPT ); Sun, 10 Sep 2023 22:18:00 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46546 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232733AbjIKCR6 (ORCPT ); Sun, 10 Sep 2023 22:17:58 -0400 Received: from mail-ot1-x331.google.com (mail-ot1-x331.google.com [IPv6:2607:f8b0:4864:20::331]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E0293E5B for ; Sun, 10 Sep 2023 19:17:14 -0700 (PDT) Received: by mail-ot1-x331.google.com with SMTP id 46e09a7af769-6bd04558784so2765538a34.3 for ; Sun, 10 Sep 2023 19:17:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; t=1694398627; x=1695003427; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=pUMMszyUeB8rBOIlzKvaWgoejhjaS1Q5X7ygUgKYL7o=; b=MHcv5bWJiXCkl/D4jVPSQS5ACW4awitYOVwdMY9IK4b9lRQEtvvBWt4NKADegwtt0v QADeq793DwQYt1AvSmqRQWV4kiR/sOumC3g0/KO3e0s91bLG41Wrov6sYxvIlZowq0Z6 +6z38FQUhGVl+QECzHa5REOsA6mZ2DQImUjqg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694398627; x=1695003427; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=pUMMszyUeB8rBOIlzKvaWgoejhjaS1Q5X7ygUgKYL7o=; b=DMm9TLwqSRrpWt9ix0itmSHYhXQ68WhjSwGZ9f+r70BQgjUAYnyhhuWfzln0DC2tLB IYynxRtcECEAMrkeVdjCbNPPcoE7MSb/VT2s8A/pFaoYmuUNvYXwAXnuIVymrZz6JtLC GhknqTEkszl2HiQPJp0mTQ6zx9uqtswzub3JV//MgGQhE+L/PGqUpVA9cPdxjwOosstz 8BTdvkFFyY38Fkfu/YSJnHQjA2rmbxP3ZsVoeGhZjwJ71nfUFYQ8rSyDabR0ZG8iA79r 60kyKTPTLptfbEOaqbhsTazlDrOm0+0593AcvPeP3hnzBDoBrkQ9anOltAXvR4lKdYOk TgvQ== X-Gm-Message-State: AOJu0YwwtYr3RhOo7DEUvNZDaKvttyXi4eL3rbq/YiGvATKP9Wrftj/Z W8unv0dLKcgJyanD3oICplYQDA== X-Google-Smtp-Source: AGHT+IG+ZZgm9Jt+l/g4kcQ+fj/q9jDIvKCZ7VIBYEeNAcdfOlMNqp3Rov4gWnDJFtxXh/SnKOIYOQ== X-Received: by 2002:a05:6830:1c3:b0:6bf:235c:41f4 with SMTP id r3-20020a05683001c300b006bf235c41f4mr10551348ota.3.1694398627544; Sun, 10 Sep 2023 19:17:07 -0700 (PDT) Received: from localhost ([2401:fa00:8f:203:282a:59c8:cc3a:2d6]) by smtp.gmail.com with UTF8SMTPSA id a23-20020a056a001d1700b006889664aa6csm1295974pfx.5.2023.09.10.19.17.05 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 10 Sep 2023 19:17:07 -0700 (PDT) From: David Stevens X-Google-Original-From: David Stevens To: Sean Christopherson Cc: Yu Zhang , Isaku Yamahata , Zhi Wang , kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, David Stevens Subject: [PATCH v9 5/6] KVM: x86: Migrate to __kvm_follow_pfn Date: Mon, 11 Sep 2023 11:16:35 +0900 Message-ID: <20230911021637.1941096-6-stevensd@google.com> X-Mailer: git-send-email 2.42.0.283.g2d96d420d3-goog In-Reply-To: <20230911021637.1941096-1-stevensd@google.com> References: <20230911021637.1941096-1-stevensd@google.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: David Stevens Migrate functions which need access to is_refcounted_page to __kvm_follow_pfn. The functions which need this are __kvm_faultin_pfn and reexecute_instruction. The former requires replacing the async in/out parameter with FOLL_NOWAIT parameter and the KVM_PFN_ERR_NEEDS_IO return value. Handling non-refcounted pages is complicated, so it will be done in a followup. The latter is a straightforward refactor. APIC related callers do not need to migrate because KVM controls the memslot, so it will always be regular memory. Prefetch related callers do not need to be migrated because atomic gfn_to_pfn calls can never make it to hva_to_pfn_remapped. Signed-off-by: David Stevens Reviewed-by: Maxim Levitsky --- arch/x86/kvm/mmu/mmu.c | 43 ++++++++++++++++++++++++++++++++---------- arch/x86/kvm/x86.c | 12 ++++++++++-- 2 files changed, 43 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index e1d011c67cc6..e1eca26215e2 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4254,7 +4254,14 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) { struct kvm_memory_slot *slot = fault->slot; - bool async; + struct kvm_follow_pfn foll = { + .slot = slot, + .gfn = fault->gfn, + .flags = fault->write ? FOLL_WRITE : 0, + .try_map_writable = true, + .guarded_by_mmu_notifier = true, + .allow_non_refcounted_struct_page = false, + }; /* * Retry the page fault if the gfn hit a memslot that is being deleted @@ -4283,12 +4290,20 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault return RET_PF_EMULATE; } - async = false; - fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async, - fault->write, &fault->map_writable, - &fault->hva); - if (!async) - return RET_PF_CONTINUE; /* *pfn has correct page already */ + foll.flags |= FOLL_NOWAIT; + fault->pfn = __kvm_follow_pfn(&foll); + + if (!is_error_noslot_pfn(fault->pfn)) + goto success; + + /* + * If __kvm_follow_pfn() failed because I/O is needed to fault in the + * page, then either set up an asynchronous #PF to do the I/O, or if + * doing an async #PF isn't possible, retry __kvm_follow_pfn() with + * I/O allowed. All other failures are fatal, i.e. retrying won't help. + */ + if (fault->pfn != KVM_PFN_ERR_NEEDS_IO) + return RET_PF_CONTINUE; if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) { trace_kvm_try_async_get_page(fault->addr, fault->gfn); @@ -4306,9 +4321,17 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault * to wait for IO. Note, gup always bails if it is unable to quickly * get a page and a fatal signal, i.e. SIGKILL, is pending. */ - fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, true, NULL, - fault->write, &fault->map_writable, - &fault->hva); + foll.flags |= FOLL_INTERRUPTIBLE; + foll.flags &= ~FOLL_NOWAIT; + fault->pfn = __kvm_follow_pfn(&foll); + + if (!is_error_noslot_pfn(fault->pfn)) + goto success; + + return RET_PF_CONTINUE; +success: + fault->hva = foll.hva; + fault->map_writable = foll.writable; return RET_PF_CONTINUE; } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6c9c81e82e65..2011a7e47296 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8556,6 +8556,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, { gpa_t gpa = cr2_or_gpa; kvm_pfn_t pfn; + struct kvm_follow_pfn foll; if (!(emulation_type & EMULTYPE_ALLOW_RETRY_PF)) return false; @@ -8585,7 +8586,13 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, * retry instruction -> write #PF -> emulation fail -> retry * instruction -> ... */ - pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa)); + foll = (struct kvm_follow_pfn) { + .slot = gfn_to_memslot(vcpu->kvm, gpa_to_gfn(gpa)), + .gfn = gpa_to_gfn(gpa), + .flags = FOLL_WRITE, + .allow_non_refcounted_struct_page = true, + }; + pfn = __kvm_follow_pfn(&foll); /* * If the instruction failed on the error pfn, it can not be fixed, @@ -8594,7 +8601,8 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, if (is_error_noslot_pfn(pfn)) return false; - kvm_release_pfn_clean(pfn); + if (foll.is_refcounted_page) + kvm_release_page_clean(pfn_to_page(pfn)); /* The instructions are well-emulated on direct mmu. */ if (vcpu->arch.mmu->root_role.direct) { From patchwork Mon Sep 11 02:16:36 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Stevens X-Patchwork-Id: 13378588 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9C266EE49A4 for ; Mon, 11 Sep 2023 02:18:31 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232904AbjIKCSe (ORCPT ); Sun, 10 Sep 2023 22:18:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46550 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232645AbjIKCSc (ORCPT ); Sun, 10 Sep 2023 22:18:32 -0400 Received: from mail-pf1-x430.google.com (mail-pf1-x430.google.com [IPv6:2607:f8b0:4864:20::430]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9E434E40 for ; Sun, 10 Sep 2023 19:17:46 -0700 (PDT) Received: by mail-pf1-x430.google.com with SMTP id d2e1a72fcca58-68fb7fb537dso558663b3a.2 for ; Sun, 10 Sep 2023 19:17:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; t=1694398631; x=1695003431; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=91mLy+G3O/xIpYQLZgTXP1iR98nXkRw1gYxVierAmLM=; b=BdvIMiTzDzYzyJroOoBF07UgFHQQAzzho3wIGtRbxNIw9ELXFKX66rc8aMI7Ki8dxl x/XrOAo6XWDKksTOeXN9LR6eZrkqpAy+aggAVP4Ru3IHK0rHIIlyalUhrvnrJ1+G4Xia 4XIWitulOhyNwIg3oN3Ec+3AyIdkLOwEauKTw= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694398631; x=1695003431; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=91mLy+G3O/xIpYQLZgTXP1iR98nXkRw1gYxVierAmLM=; b=dSFbmCNavhqEWMjJXPI/Ws70mcCjT6DSdLLpmuHzRORIzEgkl9az7PoB+J6i9w7lbU DXzs6nvmgX6/0/H33Ghay4o8X7KBbo1M8h4ZfgHIA9bswHdGK8yME5ti8wCsCoDgZnLf 9jEwQwcWU/Pft9A7gneMo8CA1Qx/XBdq2XwCczhHkSxDKi2948kvzYqIJrltkigpioIV yLQcKcMI2DNhkD6qYG7ne+J6Qw77POCl0/aDJgsLjuDOdgqOkoSmha0W6J0+yVEwzoV1 x/ahsPBNZbUJcTiWasT5cuRipmogvZG0sdNbGJLmVGu75ty443fo5+tonlNQLAkI1Lh9 /KrA== X-Gm-Message-State: AOJu0YynsefhGAzRZC8Gj2IRP0WXmJefxKL01SBZ5szcliJ9nHmeIqh6 SWGyZBufV8ly5VWY+yqw7UYd+jbtQeyqNNInQCM= X-Google-Smtp-Source: AGHT+IEF78He+7+PkQmYAuwbgimpofpg6YfXs4vDl0ZW8a5/SBfT4Pu8o+FolgBcnBH93tJqiem73Q== X-Received: by 2002:a05:6a00:1989:b0:68c:705d:78b3 with SMTP id d9-20020a056a00198900b0068c705d78b3mr7505251pfl.28.1694398631461; Sun, 10 Sep 2023 19:17:11 -0700 (PDT) Received: from localhost ([2401:fa00:8f:203:282a:59c8:cc3a:2d6]) by smtp.gmail.com with UTF8SMTPSA id m2-20020aa79002000000b0068702b66ab1sm4596005pfo.174.2023.09.10.19.17.09 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 10 Sep 2023 19:17:11 -0700 (PDT) From: David Stevens X-Google-Original-From: David Stevens To: Sean Christopherson Cc: Yu Zhang , Isaku Yamahata , Zhi Wang , kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, David Stevens Subject: [PATCH v9 6/6] KVM: x86/mmu: Handle non-refcounted pages Date: Mon, 11 Sep 2023 11:16:36 +0900 Message-ID: <20230911021637.1941096-7-stevensd@google.com> X-Mailer: git-send-email 2.42.0.283.g2d96d420d3-goog In-Reply-To: <20230911021637.1941096-1-stevensd@google.com> References: <20230911021637.1941096-1-stevensd@google.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: David Stevens Handle non-refcounted pages in __kvm_faultin_pfn. This allows the host to map memory into the guest that is backed by non-refcounted struct pages - for example, the tail pages of higher order non-compound pages allocated by the amdgpu driver via ttm_pool_alloc_page. The bulk of this change is tracking the is_refcounted_page flag so that non-refcounted pages don't trigger page_count() == 0 warnings. This is done by storing the flag in an unused bit in the sptes. There are no bits available in PAE SPTEs, so non-refcounted pages can only be handled on TDP and x86-64. Signed-off-by: David Stevens --- arch/x86/kvm/mmu/mmu.c | 52 +++++++++++++++++++++++---------- arch/x86/kvm/mmu/mmu_internal.h | 1 + arch/x86/kvm/mmu/paging_tmpl.h | 8 +++-- arch/x86/kvm/mmu/spte.c | 4 ++- arch/x86/kvm/mmu/spte.h | 12 +++++++- arch/x86/kvm/mmu/tdp_mmu.c | 22 ++++++++------ include/linux/kvm_host.h | 3 ++ virt/kvm/kvm_main.c | 6 ++-- 8 files changed, 76 insertions(+), 32 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index e1eca26215e2..b8168cc4cc96 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -545,12 +545,14 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte) if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) { flush = true; - kvm_set_pfn_accessed(spte_to_pfn(old_spte)); + if (is_refcounted_page_pte(old_spte)) + kvm_set_page_accessed(pfn_to_page(spte_to_pfn(old_spte))); } if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) { flush = true; - kvm_set_pfn_dirty(spte_to_pfn(old_spte)); + if (is_refcounted_page_pte(old_spte)) + kvm_set_page_dirty(pfn_to_page(spte_to_pfn(old_spte))); } return flush; @@ -588,14 +590,18 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep) * before they are reclaimed. Sanity check that, if the pfn is backed * by a refcounted page, the refcount is elevated. */ - page = kvm_pfn_to_refcounted_page(pfn); - WARN_ON_ONCE(page && !page_count(page)); + if (is_refcounted_page_pte(old_spte)) { + page = kvm_pfn_to_refcounted_page(pfn); + WARN_ON_ONCE(!page || !page_count(page)); + } - if (is_accessed_spte(old_spte)) - kvm_set_pfn_accessed(pfn); + if (is_refcounted_page_pte(old_spte)) { + if (is_accessed_spte(old_spte)) + kvm_set_page_accessed(pfn_to_page(pfn)); - if (is_dirty_spte(old_spte)) - kvm_set_pfn_dirty(pfn); + if (is_dirty_spte(old_spte)) + kvm_set_page_dirty(pfn_to_page(pfn)); + } return old_spte; } @@ -631,8 +637,8 @@ static bool mmu_spte_age(u64 *sptep) * Capture the dirty status of the page, so that it doesn't get * lost when the SPTE is marked for access tracking. */ - if (is_writable_pte(spte)) - kvm_set_pfn_dirty(spte_to_pfn(spte)); + if (is_writable_pte(spte) && is_refcounted_page_pte(spte)) + kvm_set_page_dirty(pfn_to_page(spte_to_pfn(spte))); spte = mark_spte_for_access_track(spte); mmu_spte_update_no_track(sptep, spte); @@ -1261,8 +1267,8 @@ static bool spte_wrprot_for_clear_dirty(u64 *sptep) { bool was_writable = test_and_clear_bit(PT_WRITABLE_SHIFT, (unsigned long *)sptep); - if (was_writable && !spte_ad_enabled(*sptep)) - kvm_set_pfn_dirty(spte_to_pfn(*sptep)); + if (was_writable && !spte_ad_enabled(*sptep) && is_refcounted_page_pte(*sptep)) + kvm_set_page_dirty(pfn_to_page(spte_to_pfn(*sptep))); return was_writable; } @@ -2913,6 +2919,11 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot, bool host_writable = !fault || fault->map_writable; bool prefetch = !fault || fault->prefetch; bool write_fault = fault && fault->write; + /* + * Prefetching uses gfn_to_page_many_atomic, which never gets + * non-refcounted pages. + */ + bool is_refcounted = !fault || fault->is_refcounted_page; if (unlikely(is_noslot_pfn(pfn))) { vcpu->stat.pf_mmio_spte_created++; @@ -2940,7 +2951,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot, } wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch, - true, host_writable, &spte); + true, host_writable, is_refcounted, &spte); if (*sptep == spte) { ret = RET_PF_SPURIOUS; @@ -4254,13 +4265,18 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) { struct kvm_memory_slot *slot = fault->slot; + /* + * There are no extra bits for tracking non-refcounted pages in + * PAE SPTEs, so reject non-refcounted struct pages in that case. + */ + bool has_spte_refcount_bit = tdp_enabled && IS_ENABLED(CONFIG_X86_64); struct kvm_follow_pfn foll = { .slot = slot, .gfn = fault->gfn, .flags = fault->write ? FOLL_WRITE : 0, .try_map_writable = true, .guarded_by_mmu_notifier = true, - .allow_non_refcounted_struct_page = false, + .allow_non_refcounted_struct_page = has_spte_refcount_bit, }; /* @@ -4277,6 +4293,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault fault->slot = NULL; fault->pfn = KVM_PFN_NOSLOT; fault->map_writable = false; + fault->is_refcounted_page = false; return RET_PF_CONTINUE; } /* @@ -4332,6 +4349,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault success: fault->hva = foll.hva; fault->map_writable = foll.writable; + fault->is_refcounted_page = foll.is_refcounted_page; return RET_PF_CONTINUE; } @@ -4420,8 +4438,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault r = direct_map(vcpu, fault); out_unlock: + if (fault->is_refcounted_page) + kvm_set_page_accessed(pfn_to_page(fault->pfn)); write_unlock(&vcpu->kvm->mmu_lock); - kvm_release_pfn_clean(fault->pfn); return r; } @@ -4496,8 +4515,9 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, r = kvm_tdp_mmu_map(vcpu, fault); out_unlock: + if (fault->is_refcounted_page) + kvm_set_page_accessed(pfn_to_page(fault->pfn)); read_unlock(&vcpu->kvm->mmu_lock); - kvm_release_pfn_clean(fault->pfn); return r; } #endif diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h index b102014e2c60..7f73bc2a552e 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -239,6 +239,7 @@ struct kvm_page_fault { kvm_pfn_t pfn; hva_t hva; bool map_writable; + bool is_refcounted_page; /* * Indicates the guest is trying to write a gfn that contains one or diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index c85255073f67..0ac4a4e5870c 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -848,7 +848,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault out_unlock: write_unlock(&vcpu->kvm->mmu_lock); - kvm_release_pfn_clean(fault->pfn); + if (fault->is_refcounted_page) + kvm_set_page_accessed(pfn_to_page(fault->pfn)); return r; } @@ -902,7 +903,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, */ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i) { - bool host_writable; + bool host_writable, is_refcounted; gpa_t first_pte_gpa; u64 *sptep, spte; struct kvm_memory_slot *slot; @@ -959,10 +960,11 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int sptep = &sp->spt[i]; spte = *sptep; host_writable = spte & shadow_host_writable_mask; + is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED; slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn); make_spte(vcpu, sp, slot, pte_access, gfn, spte_to_pfn(spte), spte, true, false, - host_writable, &spte); + host_writable, is_refcounted, &spte); return mmu_spte_update(sptep, spte); } diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c index 4a599130e9c9..ce495819061f 100644 --- a/arch/x86/kvm/mmu/spte.c +++ b/arch/x86/kvm/mmu/spte.c @@ -138,7 +138,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, const struct kvm_memory_slot *slot, unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch, bool can_unsync, - bool host_writable, u64 *new_spte) + bool host_writable, bool is_refcounted, u64 *new_spte) { int level = sp->role.level; u64 spte = SPTE_MMU_PRESENT_MASK; @@ -188,6 +188,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, if (level > PG_LEVEL_4K) spte |= PT_PAGE_SIZE_MASK; + if (is_refcounted) + spte |= SPTE_MMU_PAGE_REFCOUNTED; if (shadow_memtype_mask) spte |= static_call(kvm_x86_get_mt_mask)(vcpu, gfn, diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h index a129951c9a88..4bf4a535c23d 100644 --- a/arch/x86/kvm/mmu/spte.h +++ b/arch/x86/kvm/mmu/spte.h @@ -96,6 +96,11 @@ static_assert(!(EPT_SPTE_MMU_WRITABLE & SHADOW_ACC_TRACK_SAVED_MASK)); /* Defined only to keep the above static asserts readable. */ #undef SHADOW_ACC_TRACK_SAVED_MASK +/* + * Indicates that the SPTE refers to a page with a valid refcount. + */ +#define SPTE_MMU_PAGE_REFCOUNTED BIT_ULL(59) + /* * Due to limited space in PTEs, the MMIO generation is a 19 bit subset of * the memslots generation and is derived as follows: @@ -345,6 +350,11 @@ static inline bool is_dirty_spte(u64 spte) return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK; } +static inline bool is_refcounted_page_pte(u64 spte) +{ + return spte & SPTE_MMU_PAGE_REFCOUNTED; +} + static inline u64 get_rsvd_bits(struct rsvd_bits_validate *rsvd_check, u64 pte, int level) { @@ -475,7 +485,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, const struct kvm_memory_slot *slot, unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch, bool can_unsync, - bool host_writable, u64 *new_spte); + bool host_writable, bool is_refcounted, u64 *new_spte); u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte, union kvm_mmu_page_role role, int index); u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled); diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 6c63f2d1675f..185f3c666c2b 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -474,6 +474,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn, bool was_leaf = was_present && is_last_spte(old_spte, level); bool is_leaf = is_present && is_last_spte(new_spte, level); bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte); + bool is_refcounted = is_refcounted_page_pte(old_spte); WARN_ON_ONCE(level > PT64_ROOT_MAX_LEVEL); WARN_ON_ONCE(level < PG_LEVEL_4K); @@ -538,9 +539,9 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn, if (is_leaf != was_leaf) kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1); - if (was_leaf && is_dirty_spte(old_spte) && + if (was_leaf && is_dirty_spte(old_spte) && is_refcounted && (!is_present || !is_dirty_spte(new_spte) || pfn_changed)) - kvm_set_pfn_dirty(spte_to_pfn(old_spte)); + kvm_set_page_dirty(pfn_to_page(spte_to_pfn(old_spte))); /* * Recursively handle child PTs if the change removed a subtree from @@ -552,9 +553,9 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn, (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared); - if (was_leaf && is_accessed_spte(old_spte) && + if (was_leaf && is_accessed_spte(old_spte) && is_refcounted && (!is_present || !is_accessed_spte(new_spte) || pfn_changed)) - kvm_set_pfn_accessed(spte_to_pfn(old_spte)); + kvm_set_page_accessed(pfn_to_page(spte_to_pfn(old_spte))); } /* @@ -988,8 +989,9 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL); else wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn, - fault->pfn, iter->old_spte, fault->prefetch, true, - fault->map_writable, &new_spte); + fault->pfn, iter->old_spte, fault->prefetch, true, + fault->map_writable, fault->is_refcounted_page, + &new_spte); if (new_spte == iter->old_spte) ret = RET_PF_SPURIOUS; @@ -1205,8 +1207,9 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter, * Capture the dirty status of the page, so that it doesn't get * lost when the SPTE is marked for access tracking. */ - if (is_writable_pte(iter->old_spte)) - kvm_set_pfn_dirty(spte_to_pfn(iter->old_spte)); + if (is_writable_pte(iter->old_spte) && + is_refcounted_page_pte(iter->old_spte)) + kvm_set_page_dirty(pfn_to_page(spte_to_pfn(iter->old_spte))); new_spte = mark_spte_for_access_track(iter->old_spte); iter->old_spte = kvm_tdp_mmu_write_spte(iter->sptep, @@ -1628,7 +1631,8 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root, trace_kvm_tdp_mmu_spte_changed(iter.as_id, iter.gfn, iter.level, iter.old_spte, iter.old_spte & ~dbit); - kvm_set_pfn_dirty(spte_to_pfn(iter.old_spte)); + if (is_refcounted_page_pte(iter.old_spte)) + kvm_set_page_dirty(pfn_to_page(spte_to_pfn(iter.old_spte))); } rcu_read_unlock(); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index b95c79b7833b..6696925f01f1 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1179,6 +1179,9 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn, void kvm_release_page_clean(struct page *page); void kvm_release_page_dirty(struct page *page); +void kvm_set_page_accessed(struct page *page); +void kvm_set_page_dirty(struct page *page); + struct kvm_follow_pfn { const struct kvm_memory_slot *slot; gfn_t gfn; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 913de4e86d9d..4d8538cdb690 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2979,17 +2979,19 @@ static bool kvm_is_ad_tracked_page(struct page *page) return !PageReserved(page); } -static void kvm_set_page_dirty(struct page *page) +void kvm_set_page_dirty(struct page *page) { if (kvm_is_ad_tracked_page(page)) SetPageDirty(page); } +EXPORT_SYMBOL_GPL(kvm_set_page_dirty); -static void kvm_set_page_accessed(struct page *page) +void kvm_set_page_accessed(struct page *page) { if (kvm_is_ad_tracked_page(page)) mark_page_accessed(page); } +EXPORT_SYMBOL_GPL(kvm_set_page_accessed); void kvm_release_page_clean(struct page *page) {