mbox series

[RFC,0/11] Rework gfn_to_pfn_cache

Message ID 95fae9cf56b1a7f0a5f2b9a1934e29e924908ff2.camel@infradead.org (mailing list archive)
Headers show
Series Rework gfn_to_pfn_cache | expand

Message

David Woodhouse Nov. 15, 2021, 4:47 p.m. UTC
On Fri, 2021-11-12 at 15:56 +0100, Paolo Bonzini wrote:
> > I think we want to kill the struct kvm_host_map completely, merge its
> > extra 'hva' and 'page' fields into the (possibly renamed)
> > gfn_to_pfn_cache along with your 'guest_uses_pa' flag, and take it from
> > there.
> 
> Yes, that makes sense.

OK... here's what I have so far.

I ended up killing the gfn_to_pfn_cache completely in a preliminary
patch, just so I could cleanly untangle it from the implementation of
kvm_vcpu_map()/kvm_vcpu_unmap(). Those want to die too, but they can
die *after* we have converted all their users to something else.

I then reinstate a newly-invented gfn_to_pfn_cache in another commit,
working only for the rwlock case so far because I have questions... :)

So the idea that we discussed is that the user of the gfn_to_pfn_cache
may set the guest_uses_ga flag to indicate that the cached PFN is used
by (e.g.) the VMCS02.

So... a user of this must check the validity after setting its mode to
IN_GUEST_MODE, and the invalidation must make a request and wake any
vCPU(s) which might be using it. As an optimisation, since these are
likely to be single-vCPU only, I added a 'vcpu' field to the cache.

The wakeup (in #if 0 so far) looks a bit like this...

	unsigned int req = KVM_REQ_GPC_INVALIDATE;

	/*
	 * If the OOM reaper is active, then all vCPUs should have been stopped
	 * already, so perform the request without KVM_REQUEST_WAIT and be sad
	 * if anything needed to be woken.
	 */
	if (!may_block)
		req |= ~KVM_REQUEST_WAIT;

	if (wake_all_vcpus) {
		called = kvm_make_all_cpus_request(kvm, req);
	} else if (wake_vcpus) {
		called = kvm_make_vcpus_request_mask(kvm, req, vcpu_bitmap);
	}
	WARN_ON_ONCE(called && !may_block);

I moved the invalidation to the invalidate_range MMU notifier, as
discussed. But that's where the plan falls down a little bit because
IIUC, that one can't sleep at all. I need to move it *back* to
invalidate_range_start() where I had it before, if I want to let it
wait for vCPUs to exit. Which means... that the cache 'refresh' call
must wait until the mmu_notifier_count reaches zero? Am I allowed to do
that, and make the "There can be only one waiter" comment in
kvm_mmu_notifier_invalidate_range_end() no longer true? Or is there a
better way?

I was also pondering whether to introduce a new arch-independent
KVM_REQ_GPC_INVALIDATE, or let it be arch-dependent and make it a field
of the cache, so that users can raise whatever requests they like?

Anyway, this much works for Xen event channel delivery (again) and
looks like it's *most* of the way there for fixing the nested stuff.

The first four or maybe even eight (modulo testing) patches are
probably ready to be merged anyway. The "Maintain valid mapping of Xen
shared_info page" patch is utterly trivial now and eventually I'll
probably fold it into the subsequent patch, but it's left separate for
illustration, for now.

David Woodhouse (11):
      KVM: x86: Fix steal time asm constraints in 32-bit mode
      KVM: x86/xen: Fix get_attr of KVM_XEN_ATTR_TYPE_SHARED_INFO
      KVM: selftests: Add event channel upcall support to xen_shinfo_test
      KVM: x86/xen: Use sizeof_field() instead of open-coding it
      KVM: nVMX: Use kvm_{read,write}_guest_cached() for shadow_vmcs12
      KVM: nVMX: Use kvm_read_guest_offset_cached() for nested VMCS check
      KVM: nVMX: Use a gfn_to_hva_cache for vmptrld
      KVM: Kill kvm_map_gfn() / kvm_unmap_gfn() and gfn_to_pfn_cache
      KVM: Reinstate gfn_to_pfn_cache with invalidation support
      KVM: x86/xen: Maintain valid mapping of Xen shared_info page
      KVM: x86/xen: Add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery

 Documentation/virt/kvm/api.rst                       |  21 ++++++
 arch/x86/include/asm/kvm_host.h                      |   3 +-
 arch/x86/kvm/irq_comm.c                              |  12 ++++
 arch/x86/kvm/vmx/nested.c                            |  76 +++++++++++++---------
 arch/x86/kvm/vmx/vmx.h                               |  10 +++
 arch/x86/kvm/x86.c                                   |   5 +-
 arch/x86/kvm/xen.c                                   | 305 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
 arch/x86/kvm/xen.h                                   |   9 +++
 include/linux/kvm_host.h                             |  27 ++++++--
 include/linux/kvm_types.h                            |  13 +++-
 include/uapi/linux/kvm.h                             |  11 ++++
 tools/testing/selftests/kvm/x86_64/xen_shinfo_test.c | 187 +++++++++++++++++++++++++++++++++++++++++++++++++---
 virt/kvm/kvm_main.c                                  | 340 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------------
 13 files changed, 862 insertions(+), 157 deletions(-)

Comments

Paolo Bonzini Nov. 15, 2021, 6:50 p.m. UTC | #1
On 11/15/21 17:47, David Woodhouse wrote:
> So... a user of this must check the validity after setting its mode to
> IN_GUEST_MODE, and the invalidation must make a request and wake any
> vCPU(s) which might be using it.

Yes, though the check is implicit in the existing call to 
kvm_vcpu_exit_request(vcpu).

> I moved the invalidation to the invalidate_range MMU notifier, as
> discussed. But that's where the plan falls down a little bit because
> IIUC, that one can't sleep at all.

Which is a problem in the existing code, too.  It hasn't broken yet 
because invalidate_range() is _usually_ called with no spinlocks taken 
(the only caller that does call with a spinlock taken seems to be 
hugetlb_cow).

Once the dust settles, we need to add non_block_start/end around calls 
to ops->invalidate_range.

> I need to move it *back*  to
> invalidate_range_start() where I had it before, if I want to let it
> wait for vCPUs to exit. Which means... that the cache 'refresh' call
> must wait until the mmu_notifier_count reaches zero? Am I allowed to do > that, and make the "There can be only one waiter" comment in
> kvm_mmu_notifier_invalidate_range_end() no longer true?

You can also update the cache while taking the mmu_lock for read, and 
retry if mmu_notifier_retry_hva tells you to do so.  Looking at the 
scenario from commit e649b3f0188 you would have:

       (Invalidator) kvm_mmu_notifier_invalidate_range_start()
       (Invalidator) write_lock(mmu_lock)
       (Invalidator) increment mmu_notifier_count
       (Invalidator) write_unlock(mmu_lock)
       (Invalidator) request KVM_REQ_APIC_PAGE_RELOAD
       (KVM VCPU) vcpu_enter_guest()
       (KVM VCPU) kvm_vcpu_reload_apic_access_page()
    +  (KVM VCPU) read_lock(mmu_lock)
    +  (KVM VCPU) mmu_notifier_retry_hva()
    +  (KVM VCPU) read_unlock(mmu_lock)
    +  (KVM VCPU) retry! (mmu_notifier_count>1)
       (Invalidator) actually unmap page
    +  (Invalidator) kvm_mmu_notifier_invalidate_range_end()
    +  (Invalidator) write_lock(mmu_lock)
    +  (Invalidator) decrement mmu_notifier_count
    +  (Invalidator) write_unlock(mmu_lock)
    +  (KVM VCPU) vcpu_enter_guest()
    +  (KVM VCPU) kvm_vcpu_reload_apic_access_page()
    +  (KVM VCPU) mmu_notifier_retry_hva()

Changing mn_memslots_update_rcuwait to a waitq (and renaming it to 
mn_invalidate_waitq) is of course also a possibility.

Also, for the small requests: since you are at it, can you add the code 
in a new file under virt/kvm/?

Paolo

> I was also pondering whether to introduce a new arch-independent
> KVM_REQ_GPC_INVALIDATE, or let it be arch-dependent and make it a field
> of the cache, so that users can raise whatever requests they like?
David Woodhouse Nov. 15, 2021, 7:11 p.m. UTC | #2
On Mon, 2021-11-15 at 19:50 +0100, Paolo Bonzini wrote:
> On 11/15/21 17:47, David Woodhouse wrote:
> > So... a user of this must check the validity after setting its mode to
> > IN_GUEST_MODE, and the invalidation must make a request and wake any
> > vCPU(s) which might be using it.
> 
> Yes, though the check is implicit in the existing call to 
> kvm_vcpu_exit_request(vcpu).

Right, though *handling* the request isn't (and I'm still not sure
whether to use a single new KVM_REQ_GPC_INVALIDATE or let the user of
the cache specify the req to use).

I don't really want generic code refreshing these caches even when they
aren't going to be used (e.g. vmcs02 for a vCPU that isn't even in L2
guest mode right now).

> > I moved the invalidation to the invalidate_range MMU notifier, as
> > discussed. But that's where the plan falls down a little bit because
> > IIUC, that one can't sleep at all.
> 
> Which is a problem in the existing code, too.  It hasn't broken yet 
> because invalidate_range() is _usually_ called with no spinlocks taken 
> (the only caller that does call with a spinlock taken seems to be 
> hugetlb_cow).
> 
> Once the dust settles, we need to add non_block_start/end around calls 
> to ops->invalidate_range.
> 
> > I need to move it *back*  to
> > invalidate_range_start() where I had it before, if I want to let it
> > wait for vCPUs to exit. Which means... that the cache 'refresh' call
> > must wait until the mmu_notifier_count reaches zero? Am I allowed to do > that, and make the "There can be only one waiter" comment in
> > kvm_mmu_notifier_invalidate_range_end() no longer true?
> 
> You can also update the cache while taking the mmu_lock for read, and 
> retry if mmu_notifier_retry_hva tells you to do so.  Looking at the 
> scenario from commit e649b3f0188 you would have:
> 
>        (Invalidator) kvm_mmu_notifier_invalidate_range_start()
>        (Invalidator) write_lock(mmu_lock)
>        (Invalidator) increment mmu_notifier_count
>        (Invalidator) write_unlock(mmu_lock)
>        (Invalidator) request KVM_REQ_APIC_PAGE_RELOAD
>        (KVM VCPU) vcpu_enter_guest()
>        (KVM VCPU) kvm_vcpu_reload_apic_access_page()
>     +  (KVM VCPU) read_lock(mmu_lock)
>     +  (KVM VCPU) mmu_notifier_retry_hva()
>     +  (KVM VCPU) read_unlock(mmu_lock)
>     +  (KVM VCPU) retry! (mmu_notifier_count>1)


But unless we do start using a waitq, it can just spin and spin and
spin here can't it? 

    +  (KVM VCPU) read_lock(mmu_lock)>
    +  (KVM VCPU) mmu_notifier_retry_hva()>   
    +  (KVM VCPU) read_unlock(mmu_lock)>   
    +  (KVM VCPU) retry! (mmu_notifier_count>1)

    +  (KVM VCPU) read_lock(mmu_lock)>
    +  (KVM VCPU)
mmu_notifier_retry_hva()>   
    +  (KVM VCPU) read_unlock(mmu_lock)>   
    +  (KVM VCPU) retry! (mmu_notifier_count>1)

    +  (KVM VCPU) read_lock(mmu_lock)>
    +  (KVM VCPU)
mmu_notifier_retry_hva()>   
    +  (KVM VCPU) read_unlock(mmu_lock)>   
    +  (KVM VCPU) retry! (mmu_notifier_count>1)

>        (Invalidator) actually unmap page

    +  (KVM VCPU) read_lock(mmu_lock)>
    +  (KVM VCPU) mmu_notifier_retry_hva()>   
    +  (KVM VCPU) read_unlock(mmu_lock)>   
    +  (KVM VCPU) retry! (mmu_notifier_count>1)

    +  (KVM VCPU) read_lock(mmu_lock)>
    +  (KVM VCPU)
mmu_notifier_retry_hva()>   
    +  (KVM VCPU) read_unlock(mmu_lock)>   
    +  (KVM VCPU) retry! (mmu_notifier_count>1)

>     +  (Invalidator) kvm_mmu_notifier_invalidate_range_end()
>     +  (Invalidator) write_lock(mmu_lock)
>     +  (Invalidator) decrement mmu_notifier_count
>     +  (Invalidator) write_unlock(mmu_lock)
>     +  (KVM VCPU) vcpu_enter_guest()
>     +  (KVM VCPU) kvm_vcpu_reload_apic_access_page()
>     +  (KVM VCPU) mmu_notifier_retry_hva()
> 
> Changing mn_memslots_update_rcuwait to a waitq (and renaming it to 
> mn_invalidate_waitq) is of course also a possibility.

I suspect that's the answer.

I think the actual *invalidation* of the cache still lives in the
invalidate_range() callback where I have it at the moment. But making
the req to the affected vCPUs can live in invalidate_range_start(). And
then the code which *handles* that req can wait for the
mmu_notifier_count to reach zero before it proceeds. Atomic users of
the cache (like the Xen event channel code) don't have to get involved
with that.

> Also, for the small requests: since you are at it, can you add the code 
> in a new file under virt/kvm/?

Hm... only if I can make hva_to_pfn() and probably a handful of other
things non-static?
Paolo Bonzini Nov. 15, 2021, 7:26 p.m. UTC | #3
On 11/15/21 20:11, David Woodhouse wrote:
>> Changing mn_memslots_update_rcuwait to a waitq (and renaming it to
>> mn_invalidate_waitq) is of course also a possibility.
> I suspect that's the answer.
> 
> I think the actual*invalidation*  of the cache still lives in the
> invalidate_range() callback where I have it at the moment. But making
> the req to the affected vCPUs can live in invalidate_range_start(). And
> then the code which*handles*  that req can wait for the
> mmu_notifier_count to reach zero before it proceeds. Atomic users of
> the cache (like the Xen event channel code) don't have to get involved
> with that.
> 
>> Also, for the small requests: since you are at it, can you add the code
>> in a new file under virt/kvm/?
>
> Hm... only if I can make hva_to_pfn() and probably a handful of other
> things non-static?

Yes, I think sooner or later we also want all pfn stuff in one file 
(together with MMU notifiers) and all hva stuff in another; so for now 
you can create virt/kvm/hva_to_pfn.h, or virt/kvm/mm.h, or whatever 
color of the bikeshed you prefer.

Paolo
David Woodhouse Nov. 15, 2021, 9:38 p.m. UTC | #4
On Mon, 2021-11-15 at 19:50 +0100, Paolo Bonzini wrote:
> On 11/15/21 17:47, David Woodhouse wrote:
> > I need to move it *back*  to
> > invalidate_range_start() where I had it before, if I want to let it
> > wait for vCPUs to exit. Which means... that the cache 'refresh' call
> > must wait until the mmu_notifier_count reaches zero? Am I allowed to do
> > that, and make the "There can be only one waiter" comment in
> > kvm_mmu_notifier_invalidate_range_end() no longer true?
> 
> You can also update the cache while taking the mmu_lock for read, and 
> retry if mmu_notifier_retry_hva tells you to do so.  Looking at the 
> scenario from commit e649b3f0188 you would have:
> 
>        (Invalidator) kvm_mmu_notifier_invalidate_range_start()
>        (Invalidator) write_lock(mmu_lock)
>        (Invalidator) increment mmu_notifier_count
>        (Invalidator) write_unlock(mmu_lock)
>        (Invalidator) request KVM_REQ_APIC_PAGE_RELOAD
>        (KVM VCPU) vcpu_enter_guest()
>        (KVM VCPU) kvm_vcpu_reload_apic_access_page()
>     +  (KVM VCPU) read_lock(mmu_lock)
>     +  (KVM VCPU) mmu_notifier_retry_hva()
>     +  (KVM VCPU) read_unlock(mmu_lock)
>     +  (KVM VCPU) retry! (mmu_notifier_count>1)
>        (Invalidator) actually unmap page
>     +  (Invalidator) kvm_mmu_notifier_invalidate_range_end()
>     +  (Invalidator) write_lock(mmu_lock)
>     +  (Invalidator) decrement mmu_notifier_count
>     +  (Invalidator) write_unlock(mmu_lock)
>     +  (KVM VCPU) vcpu_enter_guest()
>     +  (KVM VCPU) kvm_vcpu_reload_apic_access_page()
>     +  (KVM VCPU) mmu_notifier_retry_hva()
> 
> Changing mn_memslots_update_rcuwait to a waitq (and renaming it to 
> mn_invalidate_waitq) is of course also a possibility.

I do think I'll go for a waitq but let's start *really* simple to make
sure I've got the basics right.... does this look vaguely sensible?

It returns -EAGAIN and lets the caller retry; I started with a 'goto'
but didn't have a sane exit condition. In fact, I *still* don't have a
sane exit condition for callers like evtchn_set_fn().

I'm actually tempted to split the caches into two lists
(kvm->guest_gpc_list, kvm->atomic_gpc_list) and invalidate only the
*former* from invalidate_range_start(), with these -EAGAIN semantics.
The atomic ones can stay precisely as they were in the series I already
sent since there's no need for them ever to have to spin/wait as long
as they're invalidated in the invalidate_range() MMU notifier.

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d1187b051203..2d76c09e460c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -151,6 +151,7 @@ static inline bool is_error_page(struct page *page)
 #define KVM_REQ_UNBLOCK           2
 #define KVM_REQ_UNHALT            3
 #define KVM_REQ_VM_DEAD           (4 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_GPC_INVALIDATE    (5 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQUEST_ARCH_BASE     8
 
 #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7382aa45d5e8..9bc3162ba650 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -468,8 +468,6 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int idx;
 
-	gfn_to_pfn_cache_invalidate(kvm, start, end, false);
-
 	idx = srcu_read_lock(&kvm->srcu);
 	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
 	srcu_read_unlock(&kvm->srcu, idx);
@@ -689,6 +687,8 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	kvm->mn_active_invalidate_count++;
 	spin_unlock(&kvm->mn_invalidate_lock);
 
+	gfn_to_pfn_cache_invalidate(kvm, range->start, range->end, hva_range.may_block);
+
 	__kvm_handle_hva_range(kvm, &hva_range);
 
 	return 0;
@@ -2674,7 +2674,6 @@ static void gfn_to_pfn_cache_invalidate(struct kvm *kvm, unsigned long start,
 	}
 	spin_unlock(&kvm->gpc_lock);
 
-#if 0
 	unsigned int req = KVM_REQ_GPC_INVALIDATE;
 
 	/*
@@ -2690,7 +2689,6 @@ static void gfn_to_pfn_cache_invalidate(struct kvm *kvm, unsigned long start,
 	} else if (wake_vcpus) {
 		called = kvm_make_vcpus_request_mask(kvm, req, vcpu_bitmap);
 	}
-#endif
 	WARN_ON_ONCE(called && !may_block);
 }
 
@@ -2767,6 +2765,8 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
 	if (!old_valid || old_uhva != gpc->uhva) {
 		unsigned long uhva = gpc->uhva;
 		void *new_khva = NULL;
+		unsigned long mmu_seq;
+		int retry;
 
 		/* Placeholders for "hva is valid but not yet mapped" */
 		gpc->pfn = KVM_PFN_ERR_FAULT;
@@ -2775,10 +2775,20 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
 
 		write_unlock_irq(&gpc->lock);
 
+		mmu_seq = kvm->mmu_notifier_seq;
+		smp_rmb();
+
 		new_pfn = hva_to_pfn(uhva, false, NULL, true, NULL);
 		if (is_error_noslot_pfn(new_pfn))
 			ret = -EFAULT;
-		else if (gpc->kernel_map) {
+		else {
+			read_lock(&kvm->mmu_lock);
+			retry = mmu_notifier_retry_hva(kvm, mmu_seq, uhva);
+			read_unlock(&kvm->mmu_lock);
+			if (retry)
+				ret = -EAGAIN; // or goto the mmu_seq setting bit to retry?
+		}
+		if (!ret && gpc->kernel_map) {
 			if (new_pfn == old_pfn) {
 				new_khva = (void *)((unsigned long)old_khva - page_offset);
 				old_pfn = KVM_PFN_ERR_FAULT;
Sean Christopherson Nov. 15, 2021, 10:59 p.m. UTC | #5
On Mon, Nov 15, 2021, Paolo Bonzini wrote:
> On 11/15/21 20:11, David Woodhouse wrote:
> > > Changing mn_memslots_update_rcuwait to a waitq (and renaming it to
> > > mn_invalidate_waitq) is of course also a possibility.
> > I suspect that's the answer.
> > 
> > I think the actual*invalidation*  of the cache still lives in the
> > invalidate_range() callback where I have it at the moment.

Oooh!  [finally had a lightbulb moment about ->invalidate_range() after years of
befuddlement].

Two things:

  1. Using _only_ ->invalidate_range() is not correct.  ->invalidate_range() is
     required if and only if the old PFN needs to be _unmapped_.  Specifically,
     if the protections are being downgraded without changing the PFN, it doesn't
     need to be called.  E.g. from hugetlb_change_protection():

	/*
	 * No need to call mmu_notifier_invalidate_range() we are downgrading
	 * page table protection not changing it to point to a new page.
	 *
	 * See Documentation/vm/mmu_notifier.rst
	 */

     x86's kvm_arch_mmu_notifier_invalidate_range() is a special snowflake because
     the APIC access page's VMA is controlled by KVM, i.e. is never downgraded, the
     only thing KVM cares about is if the PFN is changed, because that's the only
     thing that can change.

     In this case, if an HVA is downgraded from RW=R, KVM may not invalidate the
     cache and end up writing to memory that is supposed to be read-only.

     I believe we could use ->invalidate_range() to handle the unmap case if KVM's
     ->invalidate_range_start() hook is enhanced to handle the RW=>R case.  The
     "struct mmu_notifier_range" provides the event type, IIUC we could have the
     _start() variant handle MMU_NOTIFY_PROTECTION_{VMA,PAGE} (and maybe
     MMU_NOTIFY_SOFT_DIRTY?), and let the more precise unmap-only variant handle
     everything else.

  2. If we do split the logic across the two hooks, we should (a) do it in a separate
     series and (b) make the logic common to the gfn_to_pfn cache and to the standard
     kvm_unmap_gfn_range().  That would in theory shave a bit of time off walking
     gfn ranges (maybe even moreso with the scalable memslots implementation?), and
     if we're lucky, would resurrect the mostly-dead .change_pte() hook (see commit
     c13fda237f08 ("KVM: Assert that notifier count is elevated in .change_pte()")).

> > But making the req to the affected vCPUs can live in
> > invalidate_range_start(). And then the code which*handles*  that req can
> > wait for the mmu_notifier_count to reach zero before it proceeds. Atomic
> > users of the cache (like the Xen event channel code) don't have to get
> > involved with that.
> > 
> > > Also, for the small requests: since you are at it, can you add the code
> > > in a new file under virt/kvm/?
> > 
> > Hm... only if I can make hva_to_pfn() and probably a handful of other
> > things non-static?
> 
> Yes, I think sooner or later we also want all pfn stuff in one file
> (together with MMU notifiers) and all hva stuff in another; so for now you
> can create virt/kvm/hva_to_pfn.h, or virt/kvm/mm.h, or whatever color of the
> bikeshed you prefer.

Preemptive bikeshed strike... the MMU notifiers aren't strictly "pfn stuff", as
they operate on HVAs.  I don't know exactly what Paolo has in mind, but kvm/mm.h
or kvm/kvm_mm.h seems like it's less likely to become stale in the future.
David Woodhouse Nov. 15, 2021, 11:22 p.m. UTC | #6
On Mon, 2021-11-15 at 22:59 +0000, Sean Christopherson wrote:
> On Mon, Nov 15, 2021, Paolo Bonzini wrote:
> > On 11/15/21 20:11, David Woodhouse wrote:
> > > > Changing mn_memslots_update_rcuwait to a waitq (and renaming it to
> > > > mn_invalidate_waitq) is of course also a possibility.
> > > I suspect that's the answer.
> > > 
> > > I think the actual*invalidation*  of the cache still lives in the
> > > invalidate_range() callback where I have it at the moment.
> 
> Oooh!  [finally had a lightbulb moment about ->invalidate_range() after years of
> befuddlement].
> 
> Two things:
> 
>   1. Using _only_ ->invalidate_range() is not correct.  ->invalidate_range() is
>      required if and only if the old PFN needs to be _unmapped_.  Specifically,
>      if the protections are being downgraded without changing the PFN, it doesn't
>      need to be called.  E.g. from hugetlb_change_protection():

OK, that's kind of important to realise. Thanks.

So, I had just split the atomic and guest-mode invalidations apart:
https://git.infradead.org/users/dwmw2/linux.git/commitdiff/6cf5fe318fd
but will go back to doing it all in invalidate_range_start from a
single list.

And just deal with the fact that the atomic users now have to
loop/retry/wait for there *not* to be an MMU notification in progress.

>      I believe we could use ->invalidate_range() to handle the unmap case if KVM's
>      ->invalidate_range_start() hook is enhanced to handle the RW=>R case.  The
>      "struct mmu_notifier_range" provides the event type, IIUC we could have the
>      _start() variant handle MMU_NOTIFY_PROTECTION_{VMA,PAGE} (and maybe
>      MMU_NOTIFY_SOFT_DIRTY?), and let the more precise unmap-only variant handle
>      everything else.

Not sure that helps us much. It was the termination condition on the
"when should we keep retrying, and when should we give up?" that was
painful, and a mixed mode doesn't that problem it go away.

I'll go back and have another look in the morning, with something much
closer to what I showed in
https://lore.kernel.org/kvm/040d61dad066eb2517c108232efb975bc1cda780.camel@infradead.org/

>   2. If we do split the logic across the two hooks, we should (a) do it in a separate
>      series and (b) make the logic common to the gfn_to_pfn cache and to the standard
>      kvm_unmap_gfn_range(). 
> > 
> > Yes, I think sooner or later we also want all pfn stuff in one file
> > (together with MMU notifiers) and all hva stuff in another; so for now you
> > can create virt/kvm/hva_to_pfn.h, or virt/kvm/mm.h, or whatever color of the
> > bikeshed you prefer.
> 
> Preemptive bikeshed strike... the MMU notifiers aren't strictly "pfn stuff", as
> they operate on HVAs.  I don't know exactly what Paolo has in mind, but kvm/mm.h
> or kvm/kvm_mm.h seems like it's less likely to become stale in the future.

I'd moved kvm/mmu_lock.h to kvm/kvm_mm.h and added to it.
https://git.infradead.org/users/dwmw2/linux.git/commitdiff/a247bc2d0d9
(which I'll make retrospective as I rework the series).

After frowning a little at all the different architectures' Makefiles
that all add the same(ish) list of $(KVM)/foobar.o I ended up punting
that problem by only adding pfncache.o on x86 anyway.

If we're going to split other parts of kvm_main.c out into smaller
files, providing a Makefile snippet in virt/kvm/Makefile.kvm that gives
the *list* of those files would be a useful thing to do. But
arch/powerpc/kvm/Makefile makes my head hurt too much for me to be
shaving that particular yak tonight (why is $(KVM)/irqchip.o handled
differently to the rest...?)
David Woodhouse Nov. 15, 2021, 11:24 p.m. UTC | #7
On Mon, 2021-11-15 at 22:59 +0000, Sean Christopherson wrote:
> > That would in theory shave a bit of time off walking
> > gfn ranges (maybe even moreso with the scalable memslots implementation?), 

(Sorry, missed that bit)

I don't care about memslots anyway for this case as I can just compare
against the cached hva.
David Woodhouse Nov. 16, 2021, 11:50 a.m. UTC | #8
On Mon, 2021-11-15 at 20:26 +0100, Paolo Bonzini wrote:
> > > Also, for the small requests: since you are at it, can you add the code
> > > in a new file under virt/kvm/?
> >
> > Hm... only if I can make hva_to_pfn() and probably a handful of other
> > things non-static?
> 
> Yes, I think sooner or later we also want all pfn stuff in one file 
> (together with MMU notifiers) and all hva stuff in another; so for now 
> you can create virt/kvm/hva_to_pfn.h, or virt/kvm/mm.h, or whatever 
> color of the bikeshed you prefer.


OK... let's start with this.

David Woodhouse (7):
      KVM: Introduce CONFIG_HAVE_KVM_DIRTY_RING
      KVM: Add Makefile.kvm for common files, use it for x86
      KVM: s390: Use Makefile.kvm for common files
      KVM: mips: Use Makefile.kvm for common files
      KVM: RISC-V: Use Makefile.kvm for common files
      KVM: powerpc: Use Makefile.kvm for common files
      KVM: arm64: Use Makefile.kvm for common files

 arch/arm64/kvm/Makefile        |  6 ++----
 arch/mips/kvm/Makefile         |  3 ++-
 arch/powerpc/kvm/Makefile      |  6 +-----
 arch/riscv/kvm/Makefile        |  6 +-----
 arch/s390/kvm/Makefile         |  6 ++----
 arch/x86/kvm/Kconfig           |  1 +
 arch/x86/kvm/Makefile          |  7 +------
 include/linux/kvm_dirty_ring.h |  8 ++++----
 virt/kvm/Kconfig               |  3 +++
 virt/kvm/Makefile.kvm          | 13 +++++++++++++
 virt/kvm/kvm_main.c            |  4 ++--
 11 files changed, 32 insertions(+), 31 deletions(-)
David Woodhouse Nov. 16, 2021, 1:17 p.m. UTC | #9
On Mon, 2021-11-15 at 23:22 +0000, David Woodhouse wrote:
> On Mon, 2021-11-15 at 22:59 +0000, Sean Christopherson wrote:
> > On Mon, Nov 15, 2021, Paolo Bonzini wrote:
> > > On 11/15/21 20:11, David Woodhouse wrote:
> > > > > Changing mn_memslots_update_rcuwait to a waitq (and renaming it to
> > > > > mn_invalidate_waitq) is of course also a possibility.
> > > > I suspect that's the answer.
> > > > 
> > > > I think the actual*invalidation*  of the cache still lives in the
> > > > invalidate_range() callback where I have it at the moment.
> > 
> > Oooh!  [finally had a lightbulb moment about ->invalidate_range() after years of
> > befuddlement].
> > 
> > Two things:
> > 
> >   1. Using _only_ ->invalidate_range() is not correct.  ->invalidate_range() is
> >      required if and only if the old PFN needs to be _unmapped_.  Specifically,
> >      if the protections are being downgraded without changing the PFN, it doesn't
> >      need to be called.  E.g. from hugetlb_change_protection():
> 
> OK, that's kind of important to realise. Thanks.
> 
> So, I had just split the atomic and guest-mode invalidations apart:
> https://git.infradead.org/users/dwmw2/linux.git/commitdiff/6cf5fe318fd
> but will go back to doing it all in invalidate_range_start from a
> single list.
> 
> And just deal with the fact that the atomic users now have to
> loop/retry/wait for there *not* to be an MMU notification in progress.
> 
> >      I believe we could use ->invalidate_range() to handle the unmap case if KVM's
> >      ->invalidate_range_start() hook is enhanced to handle the RW=>R case.  The
> >      "struct mmu_notifier_range" provides the event type, IIUC we could have the
> >      _start() variant handle MMU_NOTIFY_PROTECTION_{VMA,PAGE} (and maybe
> >      MMU_NOTIFY_SOFT_DIRTY?), and let the more precise unmap-only variant handle
> >      everything else.
> 
> Not sure that helps us much. It was the termination condition on the
> "when should we keep retrying, and when should we give up?" that was
> painful, and a mixed mode doesn't that problem it go away.
> 
> I'll go back and have another look in the morning, with something much
> closer to what I showed in
> https://lore.kernel.org/kvm/040d61dad066eb2517c108232efb975bc1cda780.camel@infradead.org/
> 


Looks a bit like this, and it seems to be working for the Xen event
channel self-test. I'll port it into our actual Xen hosting environment
and give it some more serious testing.

I'm not sure I'm ready to sign up to immediately fix everything that's
hosed in nesting and kill off all users of the unsafe kvm_vcpu_map(),
but I'll at least convert one vCPU user to demonstrate that the new
gfn_to_pfn_cache is working sanely for that use case.



From: David Woodhouse <dwmw@amazon.co.uk>
Subject: [PATCH 08/10] KVM: Reinstate gfn_to_pfn_cache with invalidation support

This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.

For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.

Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.

Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.

This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.

As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/Kconfig              |   1 +
 include/linux/kvm_host.h          |  14 ++
 include/linux/kvm_types.h         |  17 ++
 virt/kvm/Kconfig                  |   3 +
 virt/kvm/Makefile.kvm             |   1 +
 virt/kvm/dirty_ring.c             |   2 +-
 virt/kvm/kvm_main.c               |  13 +-
 virt/kvm/{mmu_lock.h => kvm_mm.h} |  23 ++-
 virt/kvm/pfncache.c               | 275 ++++++++++++++++++++++++++++++
 9 files changed, 342 insertions(+), 7 deletions(-)
 rename virt/kvm/{mmu_lock.h => kvm_mm.h} (55%)
 create mode 100644 virt/kvm/pfncache.c

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index d7fa0a42ac25..af351107d47f 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -26,6 +26,7 @@ config KVM
 	select PREEMPT_NOTIFIERS
 	select MMU_NOTIFIER
 	select HAVE_KVM_IRQCHIP
+	select HAVE_KVM_PFNCACHE
 	select HAVE_KVM_IRQFD
 	select HAVE_KVM_DIRTY_RING
 	select IRQ_BYPASS_MANAGER
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c310648cc8f1..52e17e4b7694 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -151,6 +151,7 @@ static inline bool is_error_page(struct page *page)
 #define KVM_REQ_UNBLOCK           2
 #define KVM_REQ_UNHALT            3
 #define KVM_REQ_VM_DEAD           (4 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_GPC_INVALIDATE    (5 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQUEST_ARCH_BASE     8
 
 #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
@@ -559,6 +560,10 @@ struct kvm {
 	unsigned long mn_active_invalidate_count;
 	struct rcuwait mn_memslots_update_rcuwait;
 
+	/* For management / invalidation of gfn_to_pfn_caches */
+	spinlock_t gpc_lock;
+	struct list_head gpc_list;
+
 	/*
 	 * created_vcpus is protected by kvm->lock, and is incremented
 	 * at the beginning of KVM_CREATE_VCPU.  online_vcpus is only
@@ -966,6 +971,15 @@ int kvm_vcpu_write_guest(struct kvm_vcpu *vcpu, gpa_t gpa, const void *data,
 			 unsigned long len);
 void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn);
 
+int kvm_gfn_to_pfn_cache_init(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
+			      struct kvm_vcpu *vcpu, bool kernel_map,
+			      gpa_t gpa, unsigned long len, bool write);
+int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
+				 gpa_t gpa, unsigned long len, bool write);
+bool kvm_gfn_to_pfn_cache_check(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
+				gpa_t gpa, unsigned long len);
+void kvm_gfn_to_pfn_cache_destroy(struct kvm *kvm, struct gfn_to_pfn_cache *gpc);
+
 void kvm_sigset_activate(struct kvm_vcpu *vcpu);
 void kvm_sigset_deactivate(struct kvm_vcpu *vcpu);
 
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 234eab059839..e454d2c003d6 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -19,6 +19,7 @@ struct kvm_memslots;
 enum kvm_mr_change;
 
 #include <linux/types.h>
+#include <linux/spinlock_types.h>
 
 #include <asm/kvm_types.h>
 
@@ -53,6 +54,22 @@ struct gfn_to_hva_cache {
 	struct kvm_memory_slot *memslot;
 };
 
+struct gfn_to_pfn_cache {
+	u64 generation;
+	gpa_t gpa;
+	unsigned long uhva;
+	struct kvm_memory_slot *memslot;
+	struct kvm_vcpu *vcpu;
+	struct list_head list;
+	rwlock_t lock;
+	void *khva;
+	kvm_pfn_t pfn;
+	bool active;
+	bool valid;
+	bool dirty;
+	bool kernel_map;
+};
+
 #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
 /*
  * Memory caches are used to preallocate memory ahead of various MMU flows,
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 97cf5413ac25..f4834c20e4a6 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -4,6 +4,9 @@
 config HAVE_KVM
        bool
 
+config HAVE_KVM_PFNCACHE
+       bool
+
 config HAVE_KVM_IRQCHIP
        bool
 
diff --git a/virt/kvm/Makefile.kvm b/virt/kvm/Makefile.kvm
index ee9c310f3601..ca499a216d0f 100644
--- a/virt/kvm/Makefile.kvm
+++ b/virt/kvm/Makefile.kvm
@@ -11,3 +11,4 @@ kvm-$(CONFIG_KVM_MMIO) += $(KVM)/coalesced_mmio.o
 kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
 kvm-$(CONFIG_HAVE_KVM_IRQCHIP) += $(KVM)/irqchip.o
 kvm-$(CONFIG_HAVE_KVM_DIRTY_RING) += $(KVM)/dirty_ring.o
+kvm-$(CONFIG_HAVE_KVM_PFNCACHE) += $(KVM)/pfncache.o
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index 88f4683198ea..2b4474387895 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -9,7 +9,7 @@
 #include <linux/vmalloc.h>
 #include <linux/kvm_dirty_ring.h>
 #include <trace/events/kvm.h>
-#include "mmu_lock.h"
+#include "kvm_mm.h"
 
 int __weak kvm_cpu_dirty_log_size(void)
 {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 356d636e037d..85506e4bd145 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -59,7 +59,7 @@
 
 #include "coalesced_mmio.h"
 #include "async_pf.h"
-#include "mmu_lock.h"
+#include "kvm_mm.h"
 #include "vfio.h"
 
 #define CREATE_TRACE_POINTS
@@ -684,6 +684,9 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	kvm->mn_active_invalidate_count++;
 	spin_unlock(&kvm->mn_invalidate_lock);
 
+	gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end,
+					  hva_range.may_block);
+
 	__kvm_handle_hva_range(kvm, &hva_range);
 
 	return 0;
@@ -1051,6 +1054,9 @@ static struct kvm *kvm_create_vm(unsigned long type)
 	spin_lock_init(&kvm->mn_invalidate_lock);
 	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
 
+	INIT_LIST_HEAD(&kvm->gpc_list);
+	spin_lock_init(&kvm->gpc_lock);
+
 	INIT_LIST_HEAD(&kvm->devices);
 
 	BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
@@ -2390,8 +2396,8 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
  * 2): @write_fault = false && @writable, @writable will tell the caller
  *     whether the mapping is writable.
  */
-static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
-			bool write_fault, bool *writable)
+kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
+		     bool write_fault, bool *writable)
 {
 	struct vm_area_struct *vma;
 	kvm_pfn_t pfn = 0;
diff --git a/virt/kvm/mmu_lock.h b/virt/kvm/kvm_mm.h
similarity index 55%
rename from virt/kvm/mmu_lock.h
rename to virt/kvm/kvm_mm.h
index 9e1308f9734c..b976e4b07e88 100644
--- a/virt/kvm/mmu_lock.h
+++ b/virt/kvm/kvm_mm.h
@@ -1,7 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0-only
 
-#ifndef KVM_MMU_LOCK_H
-#define KVM_MMU_LOCK_H 1
+#ifndef __KVM_MM_H__
+#define __KVM_MM_H__ 1
 
 /*
  * Architectures can choose whether to use an rwlock or spinlock
@@ -20,4 +20,21 @@
 #define KVM_MMU_UNLOCK(kvm)    spin_unlock(&(kvm)->mmu_lock)
 #endif /* KVM_HAVE_MMU_RWLOCK */
 
-#endif
+kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
+		     bool write_fault, bool *writable);
+
+#ifdef CONFIG_HAVE_KVM_PFNCACHE
+void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
+				       unsigned long start,
+				       unsigned long end,
+				       bool may_block);
+#else
+static inline void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
+						     unsigned long start,
+						     unsigned long end,
+						     bool may_block)
+{
+}
+#endif /* HAVE_KVM_PFNCACHE */
+
+#endif /* __KVM_MM_H__ */
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
new file mode 100644
index 000000000000..f2efc52039a8
--- /dev/null
+++ b/virt/kvm/pfncache.c
@@ -0,0 +1,275 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Kernel-based Virtual Machine driver for Linux
+ *
+ * This module enables kernel and guest-mode vCPU access to guest physical
+ * memory with suitable invalidation mechanisms.
+ *
+ * Copyright © 2021 Amazon.com, Inc. or its affiliates.
+ *
+ * Authors:
+ *   David Woodhouse <dwmw2@infradead.org>
+ */
+
+#include <linux/kvm_host.h>
+#include <linux/kvm.h>
+#include <linux/highmem.h>
+#include <linux/module.h>
+#include <linux/errno.h>
+
+#include "kvm_mm.h"
+
+/*
+ * MMU notifier 'invalidate_range_start' hook.
+ */
+void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start,
+				       unsigned long end, bool may_block)
+{
+	DECLARE_BITMAP(vcpu_bitmap, KVM_MAX_VCPUS);
+	struct gfn_to_pfn_cache *gpc;
+	bool wake_vcpus = false;
+
+	spin_lock(&kvm->gpc_lock);
+	list_for_each_entry(gpc, &kvm->gpc_list, list) {
+		write_lock_irq(&gpc->lock);
+
+		/* Only a single page so no need to care about length */
+		if (gpc->valid && !is_error_noslot_pfn(gpc->pfn) &&
+		    gpc->uhva >= start && gpc->uhva < end) {
+			gpc->valid = false;
+
+			if (gpc->dirty) {
+				int idx = srcu_read_lock(&kvm->srcu);
+				mark_page_dirty(kvm, gpa_to_gfn(gpc->gpa));
+				srcu_read_unlock(&kvm->srcu, idx);
+
+				kvm_set_pfn_dirty(gpc->pfn);
+				gpc->dirty = false;
+			}
+
+			/*
+			 * If a guest vCPU could be using the physical address,
+			 * it needs to be woken.
+			 */
+			if (gpc->vcpu) {
+				if (!wake_vcpus) {
+					wake_vcpus = true;
+					bitmap_zero(vcpu_bitmap, KVM_MAX_VCPUS);
+				}
+				__set_bit(gpc->vcpu->vcpu_idx, vcpu_bitmap);
+			}
+		}
+		write_unlock_irq(&gpc->lock);
+	}
+	spin_unlock(&kvm->gpc_lock);
+
+	if (wake_vcpus) {
+		unsigned int req = KVM_REQ_GPC_INVALIDATE;
+		bool called;
+
+		/*
+		 * If the OOM reaper is active, then all vCPUs should have
+		 * been stopped already, so perform the request without
+		 * KVM_REQUEST_WAIT and be sad if any needed to be woken.
+		 */
+		if (!may_block)
+			req &= ~KVM_REQUEST_WAIT;
+
+		called = kvm_make_vcpus_request_mask(kvm, req, vcpu_bitmap);
+
+		WARN_ON_ONCE(called && !may_block);
+	}
+}
+
+bool kvm_gfn_to_pfn_cache_check(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
+				gpa_t gpa, unsigned long len)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+
+	if ((gpa & ~PAGE_MASK) + len > PAGE_SIZE)
+		return false;
+
+	if (gpc->gpa != gpa || gpc->generation != slots->generation ||
+	    kvm_is_error_hva(gpc->uhva))
+		return false;
+
+	if (!gpc->valid)
+		return false;
+
+	return true;
+}
+EXPORT_SYMBOL_GPL(kvm_gfn_to_pfn_cache_check);
+
+int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
+				 gpa_t gpa, unsigned long len, bool write)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	unsigned long page_offset = gpa & ~PAGE_MASK;
+	kvm_pfn_t old_pfn, new_pfn;
+	unsigned long old_uhva;
+	gpa_t old_gpa;
+	void *old_khva;
+	bool old_valid, old_dirty;
+	int ret = 0;
+
+	/*
+	 * If must fit within a single page. The 'len' argument is
+	 * only to enforce that.
+	 */
+	if (page_offset + len > PAGE_SIZE)
+		return -EINVAL;
+
+	write_lock_irq(&gpc->lock);
+
+	old_gpa = gpc->gpa;
+	old_pfn = gpc->pfn;
+	old_khva = gpc->khva;
+	old_uhva = gpc->uhva;
+	old_valid = gpc->valid;
+	old_dirty = gpc->dirty;
+
+	/* If the userspace HVA is invalid, refresh that first */
+	if (gpc->gpa != gpa || gpc->generation != slots->generation ||
+	    kvm_is_error_hva(gpc->uhva)) {
+		gfn_t gfn = gpa_to_gfn(gpa);
+
+		gpc->dirty = false;
+		gpc->gpa = gpa;
+		gpc->generation = slots->generation;
+		gpc->memslot = __gfn_to_memslot(slots, gfn);
+		gpc->uhva = gfn_to_hva_memslot(gpc->memslot, gfn);
+
+		if (kvm_is_error_hva(gpc->uhva)) {
+			ret = -EFAULT;
+			goto out;
+		}
+
+		gpc->uhva += page_offset;
+	}
+
+	/*
+	 * If the userspace HVA changed or the PFN was already invalid,
+	 * drop the lock and do the HVA to PFN lookup again.
+	 */
+	if (!old_valid || old_uhva != gpc->uhva) {
+		unsigned long uhva = gpc->uhva;
+		void *new_khva = NULL;
+		unsigned long mmu_seq;
+		int retry;
+
+		/* Placeholders for "hva is valid but not yet mapped" */
+		gpc->pfn = KVM_PFN_ERR_FAULT;
+		gpc->khva = NULL;
+		gpc->valid = true;
+
+		write_unlock_irq(&gpc->lock);
+
+	retry_map:
+		mmu_seq = kvm->mmu_notifier_seq;
+		smp_rmb();
+
+		new_pfn = hva_to_pfn(uhva, false, NULL, true, NULL);
+		if (is_error_noslot_pfn(new_pfn)) {
+			ret = -EFAULT;
+			goto map_done;
+		}
+
+		read_lock(&kvm->mmu_lock);
+		retry = mmu_notifier_retry_hva(kvm, mmu_seq, uhva);
+		read_unlock(&kvm->mmu_lock);
+		if (retry) {
+			cond_resched();
+			goto retry_map;
+		}
+
+		if (gpc->kernel_map) {
+			if (new_pfn == old_pfn) {
+				new_khva = (void *)((unsigned long)old_khva - page_offset);
+				old_pfn = KVM_PFN_ERR_FAULT;
+				old_khva = NULL;
+			} else if (pfn_valid(new_pfn)) {
+				new_khva = kmap(pfn_to_page(new_pfn));
+#ifdef CONFIG_HAS_IOMEM
+			} else {
+				new_khva = memremap(pfn_to_hpa(new_pfn), PAGE_SIZE, MEMREMAP_WB);
+#endif
+			}
+			if (!new_khva)
+				ret = -EFAULT;
+		}
+
+	map_done:
+		write_lock_irq(&gpc->lock);
+		if (ret) {
+			gpc->valid = false;
+			gpc->pfn = KVM_PFN_ERR_FAULT;
+			gpc->khva = NULL;
+		} else {
+			/* At this point, gpc->valid may already have been cleared */
+			gpc->pfn = new_pfn;
+			gpc->khva = new_khva + page_offset;
+		}
+	}
+
+ out:
+	if (ret)
+		gpc->dirty = false;
+	else
+		gpc->dirty = write;
+
+	write_unlock_irq(&gpc->lock);
+
+	/* Unmap the old page if it was mapped before */
+	if (!is_error_noslot_pfn(old_pfn)) {
+		if (pfn_valid(old_pfn)) {
+			kunmap(pfn_to_page(old_pfn));
+#ifdef CONFIG_HAS_IOMEM
+		} else {
+			memunmap(old_khva);
+#endif
+		}
+		kvm_release_pfn(old_pfn, old_dirty);
+		if (old_dirty)
+			mark_page_dirty(kvm, old_gpa);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kvm_gfn_to_pfn_cache_refresh);
+
+int kvm_gfn_to_pfn_cache_init(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
+			      struct kvm_vcpu *vcpu, bool kernel_map,
+			      gpa_t gpa, unsigned long len, bool write)
+{
+	if (!gpc->active) {
+		rwlock_init(&gpc->lock);
+
+		gpc->khva = NULL;
+		gpc->pfn = KVM_PFN_ERR_FAULT;
+		gpc->uhva = KVM_HVA_ERR_BAD;
+		gpc->vcpu = vcpu;
+		gpc->kernel_map = kernel_map;
+		gpc->valid = false;
+		gpc->active = true;
+
+		spin_lock(&kvm->gpc_lock);
+		list_add(&gpc->list, &kvm->gpc_list);
+		spin_unlock(&kvm->gpc_lock);
+	}
+	return kvm_gfn_to_pfn_cache_refresh(kvm, gpc, gpa, len, write);
+}
+EXPORT_SYMBOL_GPL(kvm_gfn_to_pfn_cache_init);
+
+void kvm_gfn_to_pfn_cache_destroy(struct kvm *kvm, struct gfn_to_pfn_cache *gpc)
+{
+	if (gpc->active) {
+		spin_lock(&kvm->gpc_lock);
+		list_del(&gpc->list);
+		spin_unlock(&kvm->gpc_lock);
+
+		/* In failing, it will tear down any existing mapping */
+		(void)kvm_gfn_to_pfn_cache_refresh(kvm, gpc, GPA_INVALID, 0, false);
+		gpc->active = false;
+	}
+}
+EXPORT_SYMBOL_GPL(kvm_gfn_to_pfn_cache_destroy);
Paolo Bonzini Nov. 16, 2021, 2:11 p.m. UTC | #10
On 11/16/21 14:17, David Woodhouse wrote:
> I'm not sure I'm ready to sign up to immediately fix everything that's
> hosed in nesting and kill off all users of the unsafe kvm_vcpu_map(),
> but I'll at least convert one vCPU user to demonstrate that the new
> gfn_to_pfn_cache is working sanely for that use case.

I even have old patches that tried to do that, so I can try.

Paolo
David Woodhouse Nov. 16, 2021, 2:25 p.m. UTC | #11
On Tue, 2021-11-16 at 15:11 +0100, Paolo Bonzini wrote:
> On 11/16/21 14:17, David Woodhouse wrote:
> > I'm not sure I'm ready to sign up to immediately fix everything that's
> > hosed in nesting and kill off all users of the unsafe kvm_vcpu_map(),
> > but I'll at least convert one vCPU user to demonstrate that the new
> > gfn_to_pfn_cache is working sanely for that use case.
> 
> I even have old patches that tried to do that, so I can try.

Thanks. I think it starts with this on top of my current tree at 
https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/xen-evtchn


--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9735,6 +9735,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
                if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu))
                        static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
+               if (kvm_check_request(KVM_REQ_GPC_INVALIDATE, vcpu))
+                       ; /* Nothing to do. It just wanted to wake us */
        }
 
        if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win ||
@@ -9781,6 +9783,14 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
        local_irq_disable();
        vcpu->mode = IN_GUEST_MODE;
 
+       /*
+        * If the guest requires direct access to mapped L1 pages, check
+        * the caches are valid. Will raise KVM_REQ_GET_NESTED_STATE_PAGES
+        * to go and revalidate them, if necessary.
+        */
+       if (is_guest_mode(vcpu) && kvm_x86_ops.nested_ops->check_guest_maps)
+               kvm_x86_ops.nested_ops->check_guest_maps();
+
        srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
 
        /*

That check_guest_maps() function can validate the caches which the L2
guest is actually using in the VMCS02, and if they need to be refreshed
then raising a req will immediately break out of vcpu_enter_guest() to
allow that to happen.

I *think* we can just use KVM_REQ_GET_NESTED_STATE_PAGES for that and
don't need to invent a new one?
Paolo Bonzini Nov. 16, 2021, 2:57 p.m. UTC | #12
On 11/16/21 15:25, David Woodhouse wrote:
> +       /*
> +        * If the guest requires direct access to mapped L1 pages, check
> +        * the caches are valid. Will raise KVM_REQ_GET_NESTED_STATE_PAGES
> +        * to go and revalidate them, if necessary.
> +        */
> +       if (is_guest_mode(vcpu) && kvm_x86_ops.nested_ops->check_guest_maps)
> +               kvm_x86_ops.nested_ops->check_guest_maps();
> +

This should not be needed, should it?  As long as the gfn-to-pfn
cache's vcpu field is handled properly, the request will just cause
the vCPU not to enter.  It would have to take the gpc->lock around
changes to gpc->vcpu though (meaning: it's probably best to add a
function gfn_to_pfn_cache_set_vcpu).

Doing it lockless would be harder; I cannot think of any well-known
pattern that is good for this scenario.

> That check_guest_maps() function can validate the caches which the L2
> guest is actually using in the VMCS02, and if they need to be refreshed
> then raising a req will immediately break out of vcpu_enter_guest() to
> allow that to happen.
> 
> I*think*  we can just use KVM_REQ_GET_NESTED_STATE_PAGES for that and
> don't need to invent a new one?

Yes, maybe even do it unconditionally?

-                if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
+                if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu) ||
		     kvm_check_request(KVM_REQ_GPC_INVALIDATE, vcpu))

if the gfn-to-pfn cache's vcpu field is set/reset properly across nested
VM entry and exit.

Paolo
David Woodhouse Nov. 16, 2021, 3:09 p.m. UTC | #13
On Tue, 2021-11-16 at 15:57 +0100, Paolo Bonzini wrote:
> On 11/16/21 15:25, David Woodhouse wrote:
> > +       /*
> > +        * If the guest requires direct access to mapped L1 pages, check
> > +        * the caches are valid. Will raise KVM_REQ_GET_NESTED_STATE_PAGES
> > +        * to go and revalidate them, if necessary.
> > +        */
> > +       if (is_guest_mode(vcpu) && kvm_x86_ops.nested_ops->check_guest_maps)
> > +               kvm_x86_ops.nested_ops->check_guest_maps();
> > +
> 
> This should not be needed, should it?  As long as the gfn-to-pfn
> cache's vcpu field is handled properly, the request will just cause
> the vCPU not to enter. 

If the MMU mappings never change, the request never happens. But the
memslots *can* change, so it does need to be revalidated each time
through I think?

>  It would have to take the gpc->lock around
> changes to gpc->vcpu though (meaning: it's probably best to add a
> function gfn_to_pfn_cache_set_vcpu).

Hm, in my head that was never going to *change* for a given gpc; it
*belongs* to that vCPU for ever (and was even part of vmx->nested. for
that vCPU, to replace e.g. vmx->nested.pi_desc_map).

If I flesh out what I had in my last email a bit more, perhaps my
vision is a little bit clearer...?

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 465455334c0c..9f279d08e570 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1510,6 +1510,7 @@ struct kvm_x86_nested_ops {
 	int (*enable_evmcs)(struct kvm_vcpu *vcpu,
 			    uint16_t *vmcs_version);
 	uint16_t (*get_evmcs_version)(struct kvm_vcpu *vcpu);
+	void (*check_guest_maps)(struct kvm_vcpu *vcpu);
 };
 
 struct kvm_x86_init_ops {
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 280f34ea02c3..71d2d8171a1c 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3242,6 +3242,31 @@ static bool vmx_get_nested_state_pages(struct kvm_vcpu *vcpu)
 	return true;
 }
 
+static void nested_vmx_check_guest_maps(struct kvm_vcpu *vcpu)
+{
+	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct gfn_to_pfn_cache *gpc;
+
+	bool valid;
+
+	if (nested_cpu_has_posted_intr(vmcs12)) {
+		gpc = &vmx->nested.pi_desc_cache;
+
+		read_lock(&gpc->lock);
+		valid = kvm_gfn_to_pfn_cache_check(vcpu->kvm, gpc,
+						   vmcs12->posted_intr_desc_addr,
+						   PAGE_SIZE);
+		read_unlock(&gpc->lock);
+		if (!valid) {
+			/* XX: This isn't idempotent. Make it so, or use a different
+			 * req for the 'refresh'. */
+			kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
+			return;
+		}
+	}
+}
+
 static int nested_vmx_write_pml_buffer(struct kvm_vcpu *vcpu, gpa_t gpa)
 {
 	struct vmcs12 *vmcs12;
@@ -6744,4 +6769,5 @@ struct kvm_x86_nested_ops vmx_nested_ops = {
 	.write_log_dirty = nested_vmx_write_pml_buffer,
 	.enable_evmcs = nested_enable_evmcs,
 	.get_evmcs_version = nested_get_evmcs_version,
+	.check_guest_maps = nested_vmx_check_guest_maps,
 };
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0a689bb62e9e..a879e4d08758 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9735,6 +9735,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 		if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu))
 			static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
+		if (kvm_check_request(KVM_REQ_GPC_INVALIDATE, vcpu))
+			; /* Nothing to do. It just wanted to wake us */
 	}
 
 	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win ||
@@ -9781,6 +9783,14 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	local_irq_disable();
 	vcpu->mode = IN_GUEST_MODE;
 
+	/*
+	 * If the guest requires direct access to mapped L1 pages, check
+	 * the caches are valid. Will raise KVM_REQ_GET_NESTED_STATE_PAGES
+	 * to go and revalidate them, if necessary.
+	 */
+	if (is_guest_mode(vcpu) && kvm_x86_ops.nested_ops->check_guest_maps)
+		kvm_x86_ops.nested_ops->check_guest_maps(vcpu);
+
 	srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
 
 	/*


> Doing it lockless would be harder; I cannot think of any well-known
> pattern that is good for this scenario.
> 
> > That check_guest_maps() function can validate the caches which the L2
> > guest is actually using in the VMCS02, and if they need to be refreshed
> > then raising a req will immediately break out of vcpu_enter_guest() to
> > allow that to happen.
> > 
> > I*think*  we can just use KVM_REQ_GET_NESTED_STATE_PAGES for that and
> > don't need to invent a new one?
> 
> Yes, maybe even do it unconditionally?
> 

So nested_get_vmcs12_pages() certainly isn't idempotent right now
because of all the kvm_vcpu_map() calls, which would just end up
leaking — but I suppose the point is to kill all those, and then maybe
it will be?

I quite liked the idea of *not* refreshing the caches immediately,m
because we can wait until the vCPU is in L2 mode again and actually
*needs* them.
Paolo Bonzini Nov. 16, 2021, 3:49 p.m. UTC | #14
On 11/16/21 16:09, David Woodhouse wrote:
> On Tue, 2021-11-16 at 15:57 +0100, Paolo Bonzini wrote:
>> This should not be needed, should it?  As long as the gfn-to-pfn
>> cache's vcpu field is handled properly, the request will just cause
>> the vCPU not to enter.
> 
> If the MMU mappings never change, the request never happens. But the
> memslots *can* change, so it does need to be revalidated each time
> through I think?

That needs to be done on KVM_SET_USER_MEMORY_REGION, using the same 
request (or even the same list walking code) as the MMU notifiers.

>> It would have to take the gpc->lock around
>> changes to gpc->vcpu though (meaning: it's probably best to add a
>> function gfn_to_pfn_cache_set_vcpu).
> 
> Hm, in my head that was never going to *change* for a given gpc; it
> *belongs* to that vCPU for ever (and was even part of vmx->nested. for
> that vCPU, to replace e.g. vmx->nested.pi_desc_map).

Ah okay, I thought it would be set in nested vmentry and cleared in 
nested vmexit.

> +static void nested_vmx_check_guest_maps(struct kvm_vcpu *vcpu)
> +{
> +	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	struct gfn_to_pfn_cache *gpc;
> +
> +	bool valid;
> +
> +	if (nested_cpu_has_posted_intr(vmcs12)) {
> +		gpc = &vmx->nested.pi_desc_cache;
> +
> +		read_lock(&gpc->lock);
> +		valid = kvm_gfn_to_pfn_cache_check(vcpu->kvm, gpc,
> +						   vmcs12->posted_intr_desc_addr,
> +						   PAGE_SIZE);
> +		read_unlock(&gpc->lock);
> +		if (!valid) {
> +			/* XX: This isn't idempotent. Make it so, or use a different
> +			 * req for the 'refresh'. */
> +			kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> +			return;
> +		}
> +	}
> +}

That's really slow to do on every vmentry.

> So nested_get_vmcs12_pages() certainly isn't idempotent right now
> because of all the kvm_vcpu_map() calls, which would just end up
> leaking — but I suppose the point is to kill all those, and then maybe
> it will be?

Yes, exactly.  That might be a larger than normal patch, but it should 
not be one too hard to review.  Once there's something that works, we 
can think of how to split (if it's worth it).

Paolo

> I quite liked the idea of *not* refreshing the caches immediately,m
> because we can wait until the vCPU is in L2 mode again and actually
> *needs* them.
>   
>
David Woodhouse Nov. 16, 2021, 4:06 p.m. UTC | #15
On Tue, 2021-11-16 at 16:49 +0100, Paolo Bonzini wrote:
> On 11/16/21 16:09, David Woodhouse wrote:
> > On Tue, 2021-11-16 at 15:57 +0100, Paolo Bonzini wrote:
> > > This should not be needed, should it?  As long as the gfn-to-pfn
> > > cache's vcpu field is handled properly, the request will just cause
> > > the vCPU not to enter.
> > 
> > If the MMU mappings never change, the request never happens. But the
> > memslots *can* change, so it does need to be revalidated each time
> > through I think?
> 
> That needs to be done on KVM_SET_USER_MEMORY_REGION, using the same 
> request (or even the same list walking code) as the MMU notifiers.

Hm....  kvm_arch_memslots_updated() is already kicking every vCPU after
the update, and although that was asynchronous it was actually OK
because unlike in the MMU notifier case, that page wasn't actually
going away — and if that HVA *did* subsequently go away, our HVA-based
notifier check would still catch that and kill it synchronously.

But yes, we *could* come up with a wakeup mechanism which does it that
way.


> > > It would have to take the gpc->lock around
> > > changes to gpc->vcpu though (meaning: it's probably best to add a
> > > function gfn_to_pfn_cache_set_vcpu).
> > 
> > Hm, in my head that was never going to *change* for a given gpc; it
> > *belongs* to that vCPU for ever (and was even part of vmx->nested. for
> > that vCPU, to replace e.g. vmx->nested.pi_desc_map).
> 
> Ah okay, I thought it would be set in nested vmentry and cleared in 
> nested vmexit.

I don't think it needs to be proactively cleared; we just don't
*refresh* it until we need it again.

> > +static void nested_vmx_check_guest_maps(struct kvm_vcpu *vcpu)
> > +{
> > +	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> > +	struct gfn_to_pfn_cache *gpc;
> > +
> > +	bool valid;
> > +
> > +	if (nested_cpu_has_posted_intr(vmcs12)) {
> > +		gpc = &vmx->nested.pi_desc_cache;
> > +
> > +		read_lock(&gpc->lock);
> > +		valid = kvm_gfn_to_pfn_cache_check(vcpu->kvm, gpc,
> > +						   vmcs12->posted_intr_desc_addr,
> > +						   PAGE_SIZE);
> > +		read_unlock(&gpc->lock);
> > +		if (!valid) {
> > +			/* XX: This isn't idempotent. Make it so, or use a different
> > +			 * req for the 'refresh'. */
> > +			kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> > +			return;
> > +		}
> > +	}
> > +}
> 
> That's really slow to do on every vmentry.

It probably doesn't have to be.

Ultimately, all it *has* to check is that kvm->memslots->generation
hasn't changed since the caches were valid. That can surely be made
fast enough?

If we *know* the GPA and size haven't changed, and if we know that
gpc->valid becoming false would have been handled differently, then we
could optimise that whole thing away quite effectively to a single
check on ->generations?

> > So nested_get_vmcs12_pages() certainly isn't idempotent right now
> > because of all the kvm_vcpu_map() calls, which would just end up
> > leaking — but I suppose the point is to kill all those, and then maybe
> > it will be?
> 
> Yes, exactly.  That might be a larger than normal patch, but it should 
> not be one too hard to review.  Once there's something that works, we 
> can think of how to split (if it's worth it).

This one actually compiles. Not sure we have any test cases that will
actually exercise it though, do we?

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 465455334c0c..9f279d08e570 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1510,6 +1510,7 @@ struct kvm_x86_nested_ops {
 	int (*enable_evmcs)(struct kvm_vcpu *vcpu,
 			    uint16_t *vmcs_version);
 	uint16_t (*get_evmcs_version)(struct kvm_vcpu *vcpu);
+	void (*check_guest_maps)(struct kvm_vcpu *vcpu);
 };
 
 struct kvm_x86_init_ops {
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 280f34ea02c3..f67751112633 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -309,7 +309,7 @@ static void free_nested(struct kvm_vcpu *vcpu)
 		kvm_release_page_clean(vmx->nested.apic_access_page);
 		vmx->nested.apic_access_page = NULL;
 	}
-	kvm_vcpu_unmap(vcpu, &vmx->nested.virtual_apic_map, true);
+	kvm_gfn_to_pfn_cache_destroy(vcpu->kvm, &vmx->nested.virtual_apic_cache);
 	kvm_vcpu_unmap(vcpu, &vmx->nested.pi_desc_map, true);
 	vmx->nested.pi_desc = NULL;
 
@@ -3170,10 +3170,12 @@ static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
 	}
 
 	if (nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) {
-		map = &vmx->nested.virtual_apic_map;
+		struct gfn_to_pfn_cache *gpc = &vmx->nested.virtual_apic_cache;
 
-		if (!kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->virtual_apic_page_addr), map)) {
-			vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, pfn_to_hpa(map->pfn));
+		if (!kvm_gfn_to_pfn_cache_init(vcpu->kvm, gpc, vcpu, true,
+					       vmcs12->virtual_apic_page_addr,
+					       PAGE_SIZE, true)) {
+			vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, pfn_to_hpa(gpc->pfn));
 		} else if (nested_cpu_has(vmcs12, CPU_BASED_CR8_LOAD_EXITING) &&
 		           nested_cpu_has(vmcs12, CPU_BASED_CR8_STORE_EXITING) &&
 			   !nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) {
@@ -3198,6 +3200,9 @@ static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
 	if (nested_cpu_has_posted_intr(vmcs12)) {
 		map = &vmx->nested.pi_desc_map;
 
+		if (kvm_vcpu_mapped(map))
+			kvm_vcpu_unmap(vcpu, map, true);
+
 		if (!kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->posted_intr_desc_addr), map)) {
 			vmx->nested.pi_desc =
 				(struct pi_desc *)(((void *)map->hva) +
@@ -3242,6 +3247,29 @@ static bool vmx_get_nested_state_pages(struct kvm_vcpu *vcpu)
 	return true;
 }
 
+static void nested_vmx_check_guest_maps(struct kvm_vcpu *vcpu)
+{
+	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct gfn_to_pfn_cache *gpc;
+
+	int valid;
+
+	if (nested_cpu_has_posted_intr(vmcs12)) {
+		gpc = &vmx->nested.virtual_apic_cache;
+
+		read_lock(&gpc->lock);
+		valid = kvm_gfn_to_pfn_cache_check(vcpu->kvm, gpc,
+						   vmcs12->virtual_apic_page_addr,
+						   PAGE_SIZE);
+		read_unlock(&gpc->lock);
+		if (!valid) {
+			kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
+			return;
+		}
+	}
+}
+
 static int nested_vmx_write_pml_buffer(struct kvm_vcpu *vcpu, gpa_t gpa)
 {
 	struct vmcs12 *vmcs12;
@@ -3737,9 +3765,15 @@ static int vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
 
 	max_irr = find_last_bit((unsigned long *)vmx->nested.pi_desc->pir, 256);
 	if (max_irr != 256) {
-		vapic_page = vmx->nested.virtual_apic_map.hva;
-		if (!vapic_page)
+		struct gfn_to_pfn_cache *gpc = &vmx->nested.virtual_apic_cache;
+
+		read_lock(&gpc->lock);
+		if (!kvm_gfn_to_pfn_cache_check(vcpu->kvm, gpc, gpc->gpa, PAGE_SIZE)) {
+			read_unlock(&gpc->lock);
 			goto mmio_needed;
+		}
+
+		vapic_page = gpc->khva;
 
 		__kvm_apic_update_irr(vmx->nested.pi_desc->pir,
 			vapic_page, &max_irr);
@@ -3749,6 +3783,7 @@ static int vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
 			status |= (u8)max_irr;
 			vmcs_write16(GUEST_INTR_STATUS, status);
 		}
+		read_unlock(&gpc->lock);
 	}
 
 	nested_mark_vmcs12_pages_dirty(vcpu);
@@ -4569,7 +4604,7 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
 		kvm_release_page_clean(vmx->nested.apic_access_page);
 		vmx->nested.apic_access_page = NULL;
 	}
-	kvm_vcpu_unmap(vcpu, &vmx->nested.virtual_apic_map, true);
+	kvm_gfn_to_pfn_cache_destroy(vcpu->kvm, &vmx->nested.virtual_apic_cache);
 	kvm_vcpu_unmap(vcpu, &vmx->nested.pi_desc_map, true);
 	vmx->nested.pi_desc = NULL;
 
@@ -6744,4 +6779,5 @@ struct kvm_x86_nested_ops vmx_nested_ops = {
 	.write_log_dirty = nested_vmx_write_pml_buffer,
 	.enable_evmcs = nested_enable_evmcs,
 	.get_evmcs_version = nested_get_evmcs_version,
+	.check_guest_maps = nested_vmx_check_guest_maps,
 };
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ba66c171d951..6c61faef86d3 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3839,19 +3839,23 @@ void pt_update_intercept_for_msr(struct kvm_vcpu *vcpu)
 static bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
-	void *vapic_page;
+	struct gfn_to_pfn_cache *gpc = &vmx->nested.virtual_apic_cache;
 	u32 vppr;
 	int rvi;
 
 	if (WARN_ON_ONCE(!is_guest_mode(vcpu)) ||
 		!nested_cpu_has_vid(get_vmcs12(vcpu)) ||
-		WARN_ON_ONCE(!vmx->nested.virtual_apic_map.gfn))
+		WARN_ON_ONCE(gpc->gpa == GPA_INVALID))
 		return false;
 
 	rvi = vmx_get_rvi();
 
-	vapic_page = vmx->nested.virtual_apic_map.hva;
-	vppr = *((u32 *)(vapic_page + APIC_PROCPRI));
+	read_lock(&gpc->lock);
+	if (!kvm_gfn_to_pfn_cache_check(vcpu->kvm, gpc, gpc->gpa, PAGE_SIZE))
+		vppr = *((u32 *)(gpc->khva + APIC_PROCPRI));
+	else
+		vppr = 0xff;
+	read_unlock(&gpc->lock);
 
 	return ((rvi & 0xf0) > (vppr & 0xf0));
 }
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 4df2ac24ffc1..8364e7fc92a0 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -195,7 +195,7 @@ struct nested_vmx {
 	 * pointers, so we must keep them pinned while L2 runs.
 	 */
 	struct page *apic_access_page;
-	struct kvm_host_map virtual_apic_map;
+	struct gfn_to_pfn_cache virtual_apic_cache;
 	struct kvm_host_map pi_desc_map;
 
 	struct kvm_host_map msr_bitmap_map;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0a689bb62e9e..a879e4d08758 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9735,6 +9735,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 		if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu))
 			static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
+		if (kvm_check_request(KVM_REQ_GPC_INVALIDATE, vcpu))
+			; /* Nothing to do. It just wanted to wake us */
 	}
 
 	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win ||
@@ -9781,6 +9783,14 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	local_irq_disable();
 	vcpu->mode = IN_GUEST_MODE;
 
+	/*
+	 * If the guest requires direct access to mapped L1 pages, check
+	 * the caches are valid. Will raise KVM_REQ_GET_NESTED_STATE_PAGES
+	 * to go and revalidate them, if necessary.
+	 */
+	if (is_guest_mode(vcpu) && kvm_x86_ops.nested_ops->check_guest_maps)
+		kvm_x86_ops.nested_ops->check_guest_maps(vcpu);
+
 	srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
 
 	/*
Paolo Bonzini Nov. 16, 2021, 5:42 p.m. UTC | #16
On 11/16/21 17:06, David Woodhouse wrote:
> On Tue, 2021-11-16 at 16:49 +0100, Paolo Bonzini wrote:
>> On 11/16/21 16:09, David Woodhouse wrote:
>>> On Tue, 2021-11-16 at 15:57 +0100, Paolo Bonzini wrote:
>>>> This should not be needed, should it?  As long as the gfn-to-pfn
>>>> cache's vcpu field is handled properly, the request will just cause
>>>> the vCPU not to enter.
>>>
>>> If the MMU mappings never change, the request never happens. But the
>>> memslots *can* change, so it does need to be revalidated each time
>>> through I think?
>>
>> That needs to be done on KVM_SET_USER_MEMORY_REGION, using the same
>> request (or even the same list walking code) as the MMU notifiers.
> 
> Hm....  kvm_arch_memslots_updated() is already kicking every vCPU after
> the update, and although that was asynchronous it was actually OK
> because unlike in the MMU notifier case, that page wasn't actually
> going away — and if that HVA *did* subsequently go away, our HVA-based
> notifier check would still catch that and kill it synchronously.

Right, so it only needs to change the kvm_vcpu_kick into a 
kvm_make_all_cpus_request without KVM_WAIT.

>>> Hm, in my head that was never going to *change* for a given gpc; it
>>> *belongs* to that vCPU for ever (and was even part of vmx->nested. for
>>> that vCPU, to replace e.g. vmx->nested.pi_desc_map).
>>
>> Ah okay, I thought it would be set in nested vmentry and cleared in
>> nested vmexit.
> 
> I don't think it needs to be proactively cleared; we just don't
> *refresh* it until we need it again.

True, but if it's cleared the vCPU won't be kicked, which is nice.

> If we *know* the GPA and size haven't changed, and if we know that
> gpc->valid becoming false would have been handled differently, then we
> could optimise that whole thing away quite effectively to a single
> check on ->generations?

I wonder if we need a per-gpc memslot generation...  Can it be global?

> This one actually compiles. Not sure we have any test cases that will
> actually exercise it though, do we?

I'll try to spend some time writing testcases.

> +		read_lock(&gpc->lock);
> +		if (!kvm_gfn_to_pfn_cache_check(vcpu->kvm, gpc, gpc->gpa, PAGE_SIZE)) {
> +			read_unlock(&gpc->lock);
>   			goto mmio_needed;
> +		}
> +
> +		vapic_page = gpc->khva;

If we know this gpc is of the synchronous kind, I think we can skip the 
read_lock/read_unlock here?!?

>   		__kvm_apic_update_irr(vmx->nested.pi_desc->pir,
>   			vapic_page, &max_irr);
> @@ -3749,6 +3783,7 @@ static int vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
>   			status |= (u8)max_irr;
>   			vmcs_write16(GUEST_INTR_STATUS, status);
>   		}
> +		read_unlock(&gpc->lock);
>   	}
>   
>   	nested_mark_vmcs12_pages_dirty(vcpu);
> @@ -4569,7 +4604,7 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
>   		kvm_release_page_clean(vmx->nested.apic_access_page);
>   		vmx->nested.apic_access_page = NULL;
>   	}
> -	kvm_vcpu_unmap(vcpu, &vmx->nested.virtual_apic_map, true);
> +	kvm_gfn_to_pfn_cache_destroy(vcpu->kvm, &vmx->nested.virtual_apic_cache);
>   	kvm_vcpu_unmap(vcpu, &vmx->nested.pi_desc_map, true);
>   	vmx->nested.pi_desc = NULL;
>   
> @@ -6744,4 +6779,5 @@ struct kvm_x86_nested_ops vmx_nested_ops = {
>   	.write_log_dirty = nested_vmx_write_pml_buffer,
>   	.enable_evmcs = nested_enable_evmcs,
>   	.get_evmcs_version = nested_get_evmcs_version,
> +	.check_guest_maps = nested_vmx_check_guest_maps,
>   };
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index ba66c171d951..6c61faef86d3 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -3839,19 +3839,23 @@ void pt_update_intercept_for_msr(struct kvm_vcpu *vcpu)
>   static bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
>   {
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
> -	void *vapic_page;
> +	struct gfn_to_pfn_cache *gpc = &vmx->nested.virtual_apic_cache;
>   	u32 vppr;
>   	int rvi;
>   
>   	if (WARN_ON_ONCE(!is_guest_mode(vcpu)) ||
>   		!nested_cpu_has_vid(get_vmcs12(vcpu)) ||
> -		WARN_ON_ONCE(!vmx->nested.virtual_apic_map.gfn))
> +		WARN_ON_ONCE(gpc->gpa == GPA_INVALID))
>   		return false;
>   
>   	rvi = vmx_get_rvi();
>   
> -	vapic_page = vmx->nested.virtual_apic_map.hva;
> -	vppr = *((u32 *)(vapic_page + APIC_PROCPRI));
> +	read_lock(&gpc->lock);
> +	if (!kvm_gfn_to_pfn_cache_check(vcpu->kvm, gpc, gpc->gpa, PAGE_SIZE))
> +		vppr = *((u32 *)(gpc->khva + APIC_PROCPRI));
> +	else
> +		vppr = 0xff;
> +	read_unlock(&gpc->lock);
>   
>   	return ((rvi & 0xf0) > (vppr & 0xf0));
>   }
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index 4df2ac24ffc1..8364e7fc92a0 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -195,7 +195,7 @@ struct nested_vmx {
>   	 * pointers, so we must keep them pinned while L2 runs.
>   	 */
>   	struct page *apic_access_page;
> -	struct kvm_host_map virtual_apic_map;
> +	struct gfn_to_pfn_cache virtual_apic_cache;
>   	struct kvm_host_map pi_desc_map;
>   
>   	struct kvm_host_map msr_bitmap_map;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 0a689bb62e9e..a879e4d08758 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9735,6 +9735,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>   
>   		if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu))
>   			static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
> +		if (kvm_check_request(KVM_REQ_GPC_INVALIDATE, vcpu))
> +			; /* Nothing to do. It just wanted to wake us */
>   	}
>   
>   	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win ||
> @@ -9781,6 +9783,14 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>   	local_irq_disable();
>   	vcpu->mode = IN_GUEST_MODE;
>   
> +	/*
> +	 * If the guest requires direct access to mapped L1 pages, check
> +	 * the caches are valid. Will raise KVM_REQ_GET_NESTED_STATE_PAGES
> +	 * to go and revalidate them, if necessary.
> +	 */
> +	if (is_guest_mode(vcpu) && kvm_x86_ops.nested_ops->check_guest_maps)
> +		kvm_x86_ops.nested_ops->check_guest_maps(vcpu);
> +
>   	srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
>   
>   	/*
>
David Woodhouse Nov. 16, 2021, 5:57 p.m. UTC | #17
On Tue, 2021-11-16 at 18:42 +0100, Paolo Bonzini wrote:
> On 11/16/21 17:06, David Woodhouse wrote:
> > On Tue, 2021-11-16 at 16:49 +0100, Paolo Bonzini wrote:
> > > On 11/16/21 16:09, David Woodhouse wrote:
> > > > On Tue, 2021-11-16 at 15:57 +0100, Paolo Bonzini wrote:
> > > > > This should not be needed, should it?  As long as the gfn-to-pfn
> > > > > cache's vcpu field is handled properly, the request will just cause
> > > > > the vCPU not to enter.
> > > > 
> > > > If the MMU mappings never change, the request never happens. But the
> > > > memslots *can* change, so it does need to be revalidated each time
> > > > through I think?
> > > 
> > > That needs to be done on KVM_SET_USER_MEMORY_REGION, using the same
> > > request (or even the same list walking code) as the MMU notifiers.
> > 
> > Hm....  kvm_arch_memslots_updated() is already kicking every vCPU after
> > the update, and although that was asynchronous it was actually OK
> > because unlike in the MMU notifier case, that page wasn't actually
> > going away — and if that HVA *did* subsequently go away, our HVA-based
> > notifier check would still catch that and kill it synchronously.
> 
> Right, so it only needs to change the kvm_vcpu_kick into a 
> kvm_make_all_cpus_request without KVM_WAIT.

Yeah, I think that works.

> > > > Hm, in my head that was never going to *change* for a given gpc; it
> > > > *belongs* to that vCPU for ever (and was even part of vmx->nested. for
> > > > that vCPU, to replace e.g. vmx->nested.pi_desc_map).
> > > 
> > > Ah okay, I thought it would be set in nested vmentry and cleared in
> > > nested vmexit.
> > 
> > I don't think it needs to be proactively cleared; we just don't
> > *refresh* it until we need it again.
> 
> True, but if it's cleared the vCPU won't be kicked, which is nice.

The vCPU will only be kicked once when it becomes invalid anyway. It's
a trade-off. Either we leave it valid for next time that L2 vCPU is
entered, or we actively clear it. Not sure I lose much sleep either
way?

> > If we *know* the GPA and size haven't changed, and if we know that
> > gpc->valid becoming false would have been handled differently, then we
> > could optimise that whole thing away quite effectively to a single
> > check on ->generations?
> 
> I wonder if we need a per-gpc memslot generation...  Can it be global?

Theoretically, maybe. It's kind of mathematically equivalent to the
highest value of each gpc memslot. And any gpc which *isn't* at that
maximum is by definition invalid.

But I'm not sure I see how to implement it without actively going and
clearing the 'valid' bit on all GPCs when it gets bumped... which was
your previous suggestion if basically running the same code as in the
MMU notifier?




> > This one actually compiles. Not sure we have any test cases that will
> > actually exercise it though, do we?
> 
> I'll try to spend some time writing testcases.
> 
> > +		read_lock(&gpc->lock);
> > +		if (!kvm_gfn_to_pfn_cache_check(vcpu->kvm, gpc, gpc->gpa, PAGE_SIZE)) {
> > +			read_unlock(&gpc->lock);
> >   			goto mmio_needed;
> > +		}
> > +
> > +		vapic_page = gpc->khva;
> 
> If we know this gpc is of the synchronous kind, I think we can skip the 
> read_lock/read_unlock here?!?

Er... this one was OUTSIDE_GUEST_MODE and is the atomic kind, which
means it needs to hold the lock for the duration of the access in order
to prevent (preemption and) racing with the invalidate?

It's the IN_GUEST_MODE one (in my check_guest_maps()) where we might
get away without the lock, perhaps?

> 
> >   		__kvm_apic_update_irr(vmx->nested.pi_desc->pir,
> >   			vapic_page, &max_irr);
> > @@ -3749,6 +3783,7 @@ static int vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
> >   			status |= (u8)max_irr;
> >   			vmcs_write16(GUEST_INTR_STATUS, status);
> >   		}
> > +		read_unlock(&gpc->lock);
> >   	}
> >   

I just realised that the mark_page_dirty() on invalidation and when the
the irqfd workqueue refreshes the gpc might fall foul of the same
dirty_ring problem that I belatedly just spotted with the Xen shinfo
clock write. I'll fix it up to *always* require a vcpu (to be
associated with the writes), and reinstate the guest_uses_pa flag since
that can no longer in implicit in (vcpu!=NULL).

I may leave the actual task of fixing nesting to you, if that's OK, as
long as we consider the new gfn_to_pfn_cache sufficient to address the
problem? I think it's mostly down to how we *use* it now, rather than
the fundamental design of cache itself?
Paolo Bonzini Nov. 16, 2021, 6:46 p.m. UTC | #18
On 11/16/21 18:57, David Woodhouse wrote:
>>> +		read_lock(&gpc->lock);
>>> +		if (!kvm_gfn_to_pfn_cache_check(vcpu->kvm, gpc, gpc->gpa, PAGE_SIZE)) {
>>> +			read_unlock(&gpc->lock);
>>>    			goto mmio_needed;
>>> +		}
>>> +
>>> +		vapic_page = gpc->khva;
>> If we know this gpc is of the synchronous kind, I think we can skip the
>> read_lock/read_unlock here?!?
> Er... this one was OUTSIDE_GUEST_MODE and is the atomic kind, which
> means it needs to hold the lock for the duration of the access in order
> to prevent (preemption and) racing with the invalidate?
> 
> It's the IN_GUEST_MODE one (in my check_guest_maps()) where we might
> get away without the lock, perhaps?

Ah, this is check_nested_events which is mostly IN_GUEST_MODE but not 
always (and that sucks for other reasons).  I'll think a bit more about 
it when I actually do the work.

>>>    		__kvm_apic_update_irr(vmx->nested.pi_desc->pir,
>>>    			vapic_page, &max_irr);
>>> @@ -3749,6 +3783,7 @@ static int vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
>>>    			status |= (u8)max_irr;
>>>    			vmcs_write16(GUEST_INTR_STATUS, status);
>>>    		}
>>> +		read_unlock(&gpc->lock);
>>>    	}
>>>    
> I just realised that the mark_page_dirty() on invalidation and when the
> the irqfd workqueue refreshes the gpc might fall foul of the same
> dirty_ring problem that I belatedly just spotted with the Xen shinfo
> clock write. I'll fix it up to*always*  require a vcpu (to be
> associated with the writes), and reinstate the guest_uses_pa flag since
> that can no longer in implicit in (vcpu!=NULL).

Okay.

> I may leave the actual task of fixing nesting to you, if that's OK, as
> long as we consider the new gfn_to_pfn_cache sufficient to address the
> problem? I think it's mostly down to how we*use*  it now, rather than
> the fundamental design of cache itself?

Yes, I agree.  Just stick whatever you have for nesting as an extra 
patch at the end, and I'll take it from there.

Paolo
David Woodhouse Nov. 16, 2021, 7:34 p.m. UTC | #19
On 16 November 2021 18:46:03 GMT, Paolo Bonzini <bonzini@gnu.org> wrote:
>On 11/16/21 18:57, David Woodhouse wrote:
>>>> +		read_lock(&gpc->lock);
>>>> +		if (!kvm_gfn_to_pfn_cache_check(vcpu->kvm, gpc, gpc->gpa, PAGE_SIZE)) {
>>>> +			read_unlock(&gpc->lock);
>>>>    			goto mmio_needed;
>>>> +		}
>>>> +
>>>> +		vapic_page = gpc->khva;
>>> If we know this gpc is of the synchronous kind, I think we can skip the
>>> read_lock/read_unlock here?!?
>> Er... this one was OUTSIDE_GUEST_MODE and is the atomic kind, which
>> means it needs to hold the lock for the duration of the access in order
>> to prevent (preemption and) racing with the invalidate?
>> 
>> It's the IN_GUEST_MODE one (in my check_guest_maps()) where we might
>> get away without the lock, perhaps?
>
>Ah, this is check_nested_events which is mostly IN_GUEST_MODE but not 
>always (and that sucks for other reasons).  I'll think a bit more about 
>it when I actually do the work.
>
>>>>    		__kvm_apic_update_irr(vmx->nested.pi_desc->pir,
>>>>    			vapic_page, &max_irr);
>>>> @@ -3749,6 +3783,7 @@ static int vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
>>>>    			status |= (u8)max_irr;
>>>>    			vmcs_write16(GUEST_INTR_STATUS, status);
>>>>    		}
>>>> +		read_unlock(&gpc->lock);
>>>>    	}
>>>>    
>> I just realised that the mark_page_dirty() on invalidation and when the
>> the irqfd workqueue refreshes the gpc might fall foul of the same
>> dirty_ring problem that I belatedly just spotted with the Xen shinfo
>> clock write. I'll fix it up to*always*  require a vcpu (to be
>> associated with the writes), and reinstate the guest_uses_pa flag since
>> that can no longer in implicit in (vcpu!=NULL).
>
>Okay.
>
>> I may leave the actual task of fixing nesting to you, if that's OK, as
>> long as we consider the new gfn_to_pfn_cache sufficient to address the
>> problem? I think it's mostly down to how we*use*  it now, rather than
>> the fundamental design of cache itself?
>
>Yes, I agree.  Just stick whatever you have for nesting as an extra 
>patch at the end, and I'll take it from there.


Will do; thanks. I believe you said you'd already merged a batch including killing the first three of the nesting kvm_vcpu_map() users, so I'll wait for those to show up in a tree I can pull from and then post a single series with the unified Makefile.kvm bits followed by pfncache.c and the Xen event channel support using it. And as requested, ending with the nested pfncache one I posted a couple of messages ago.