diff mbox series

[v2,6/8] KVM: Fix multiple races in gfn=>pfn cache refresh

Message ID 20220427014004.1992589-7-seanjc@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: Fix mmu_notifier vs. pfncache vs. pfncache races | expand

Commit Message

Sean Christopherson April 27, 2022, 1:40 a.m. UTC
Rework the gfn=>pfn cache (gpc) refresh logic to address multiple races
between the cache itself, and between the cache and mmu_notifier events.

The existing refresh code attempts to guard against races with the
mmu_notifier by speculatively marking the cache valid, and then marking
it invalid if a mmu_notifier invalidation occurs.  That handles the case
where an invalidation occurs between dropping and re-acquiring gpc->lock,
but it doesn't handle the scenario where the cache is refreshed after the
cache was invalidated by the notifier, but before the notifier elevates
mmu_notifier_count.  The gpc refresh can't use the "retry" helper as its
invalidation occurs _before_ mmu_notifier_count is elevated and before
mmu_notifier_range_start is set/updated.

  CPU0                                    CPU1
  ----                                    ----

  gfn_to_pfn_cache_invalidate_start()
  |
  -> gpc->valid = false;
                                          kvm_gfn_to_pfn_cache_refresh()
                                          |
                                          |-> gpc->valid = true;

                                          hva_to_pfn_retry()
                                          |
                                          -> acquire kvm->mmu_lock
                                             kvm->mmu_notifier_count == 0
                                             mmu_seq == kvm->mmu_notifier_seq
                                             drop kvm->mmu_lock
                                             return pfn 'X'
  acquire kvm->mmu_lock
  kvm_inc_notifier_count()
  drop kvm->mmu_lock()
  kernel frees pfn 'X'
                                          kvm_gfn_to_pfn_cache_check()
                                          |
                                          |-> gpc->valid == true

                                          caller accesses freed pfn 'X'

Key off of mn_active_invalidate_count to detect that a pfncache refresh
needs to wait for an in-progress mmu_notifier invalidation.  While
mn_active_invalidate_count is not guaranteed to be stable, it is
guaranteed to be elevated prior to an invalidation acquiring gpc->lock,
so either the refresh will see an active invalidation and wait, or the
invalidation will run after the refresh completes.

Speculatively marking the cache valid is itself flawed, as a concurrent
kvm_gfn_to_pfn_cache_check() would see a valid cache with stale pfn/khva
values.  The KVM Xen use case explicitly allows/wants multiple users;
even though the caches are allocated per vCPU, __kvm_xen_has_interrupt()
can read a different vCPU (or vCPUs).  Address this race by invalidating
the cache prior to dropping gpc->lock (this is made possible by fixing
the above mmu_notifier race).

Finally, the refresh logic doesn't protect against concurrent refreshes
with different GPAs (which may or may not be a desired use case, but its
allowed in the code), nor does it protect against a false negative on the
memslot generation.  If the first refresh sees a stale memslot generation,
it will refresh the hva and generation before moving on to the hva=>pfn
translation.  If it then drops gpc->lock, a different user can come along,
acquire gpc->lock, see that the memslot generation is fresh, and skip
the hva=>pfn update due to the userspace address also matching (because
it too was updated).  Address this race by adding an "in-progress" flag
so that the refresh that acquires gpc->lock first runs to completion
before other users can start their refresh.

Complicating all of this is the fact that both the hva=>pfn resolution
and mapping of the kernel address can sleep, i.e. must be done outside
of gpc->lock

Fix the above races in one fell swoop, trying to fix each individual race
in a sane manner is impossible, for all intents and purposes.

Fixes: 982ed0de4753 ("KVM: Reinstate gfn_to_pfn_cache with invalidation support")
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Mingwei Zhang <mizhang@google.com>
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 include/linux/kvm_types.h |   1 +
 virt/kvm/kvm_main.c       |   9 ++
 virt/kvm/pfncache.c       | 209 +++++++++++++++++++++++++-------------
 3 files changed, 148 insertions(+), 71 deletions(-)

Comments

Sean Christopherson April 27, 2022, 2:10 p.m. UTC | #1
On Wed, Apr 27, 2022, Sean Christopherson wrote:
> Finally, the refresh logic doesn't protect against concurrent refreshes
> with different GPAs (which may or may not be a desired use case, but its
> allowed in the code), nor does it protect against a false negative on the
> memslot generation.  If the first refresh sees a stale memslot generation,
> it will refresh the hva and generation before moving on to the hva=>pfn
> translation.  If it then drops gpc->lock, a different user can come along,
> acquire gpc->lock, see that the memslot generation is fresh, and skip
> the hva=>pfn update due to the userspace address also matching (because
> it too was updated).  Address this race by adding an "in-progress" flag
> so that the refresh that acquires gpc->lock first runs to completion
> before other users can start their refresh.

...

> @@ -159,10 +249,23 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
>  
>  	write_lock_irq(&gpc->lock);
>  
> +	/*
> +	 * If another task is refreshing the cache, wait for it to complete.
> +	 * There is no guarantee that concurrent refreshes will see the same
> +	 * gpa, memslots generation, etc..., so they must be fully serialized.
> +	 */
> +	while (gpc->refresh_in_progress) {
> +		write_unlock_irq(&gpc->lock);
> +
> +		cond_resched();
> +
> +		write_lock_irq(&gpc->lock);
> +	}
> +	gpc->refresh_in_progress = true;

Adding refresh_in_progress can likely go in a separate patch.  I'll plan on doing
that in a v3 unless it proves to be painful.

> @@ -246,9 +296,26 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
>  	}
>  
>   out:
> +	/*
> +	 * Invalidate the cache and purge the pfn/khva if the refresh failed.
> +	 * Some/all of the uhva, gpa, and memslot generation info may still be
> +	 * valid, leave it as is.
> +	 */
> +	if (ret) {
> +		gpc->valid = false;
> +		gpc->pfn = KVM_PFN_ERR_FAULT;
> +		gpc->khva = NULL;
> +	}
> +
> +	gpc->refresh_in_progress = false;
Lai Jiangshan April 28, 2022, 3:39 a.m. UTC | #2
On Wed, Apr 27, 2022 at 7:16 PM Sean Christopherson <seanjc@google.com> wrote:

> @@ -159,10 +249,23 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
>

The following code of refresh_in_progress is somewhat like mutex.

+ mutex_lock(&gpc->refresh_in_progress); // before write_lock_irq(&gpc->lock);

Is it fit for the intention?

Thanks
Lai

>         write_lock_irq(&gpc->lock);
>
> +       /*
> +        * If another task is refreshing the cache, wait for it to complete.
> +        * There is no guarantee that concurrent refreshes will see the same
> +        * gpa, memslots generation, etc..., so they must be fully serialized.
> +        */
> +       while (gpc->refresh_in_progress) {
> +               write_unlock_irq(&gpc->lock);
> +
> +               cond_resched();
> +
> +               write_lock_irq(&gpc->lock);
> +       }
> +       gpc->refresh_in_progress = true;
> +
>         old_pfn = gpc->pfn;
>         old_khva = gpc->khva - offset_in_page(gpc->khva);
>         old_uhva = gpc->uhva;
> -       old_valid = gpc->valid;
>
Sean Christopherson April 28, 2022, 2:33 p.m. UTC | #3
On Thu, Apr 28, 2022, Lai Jiangshan wrote:
> On Wed, Apr 27, 2022 at 7:16 PM Sean Christopherson <seanjc@google.com> wrote:
> 
> > @@ -159,10 +249,23 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
> >
> 
> The following code of refresh_in_progress is somewhat like mutex.
> 
> + mutex_lock(&gpc->refresh_in_progress); // before write_lock_irq(&gpc->lock);
> 
> Is it fit for the intention?

Yeah, I mutex should work.  Not sure why I shied away from a mutex...
Paolo Bonzini May 20, 2022, 2:49 p.m. UTC | #4
On 4/27/22 03:40, Sean Christopherson wrote:
> +		 * Wait for mn_active_invalidate_count, not mmu_notifier_count,
> +		 * to go away, as the invalidation in the mmu_notifier event
> +		 * occurs_before_  mmu_notifier_count is elevated.
> +		 *
> +		 * Note, mn_active_invalidate_count can change at any time as
> +		 * it's not protected by gpc->lock.  But, it is guaranteed to
> +		 * be elevated before the mmu_notifier acquires gpc->lock, and
> +		 * isn't dropped until after mmu_notifier_seq is updated.  So,
> +		 * this task may get a false positive of sorts, i.e. see an
> +		 * elevated count and wait even though it's technically safe to
> +		 * proceed (becase the mmu_notifier will invalidate the cache
> +		 *_after_  it's refreshed here), but the cache will never be
> +		 * refreshed with stale data, i.e. won't get false negatives.

I am all for lavish comments, but I think this is even too detailed.  What about:

                 /*
                  * mn_active_invalidate_count acts for all intents and purposes
                  * like mmu_notifier_count here; but we cannot use the latter
                  * because the invalidation in the mmu_notifier event occurs
                  * _before_ mmu_notifier_count is elevated.
                  *
                  * Note, it does not matter that mn_active_invalidate_count
                  * is not protected by gpc->lock.  It is guaranteed to
                  * be elevated before the mmu_notifier acquires gpc->lock, and
                  * isn't dropped until after mmu_notifier_seq is updated.
                  */

Paolo
Paolo Bonzini May 20, 2022, 2:58 p.m. UTC | #5
On 5/20/22 16:49, Paolo Bonzini wrote:
> On 4/27/22 03:40, Sean Christopherson wrote:
>> +         * Wait for mn_active_invalidate_count, not mmu_notifier_count,
>> +         * to go away, as the invalidation in the mmu_notifier event
>> +         * occurs_before_  mmu_notifier_count is elevated.
>> +         *
>> +         * Note, mn_active_invalidate_count can change at any time as
>> +         * it's not protected by gpc->lock.  But, it is guaranteed to
>> +         * be elevated before the mmu_notifier acquires gpc->lock, and
>> +         * isn't dropped until after mmu_notifier_seq is updated.  So,
>> +         * this task may get a false positive of sorts, i.e. see an
>> +         * elevated count and wait even though it's technically safe to
>> +         * proceed (becase the mmu_notifier will invalidate the cache
>> +         *_after_  it's refreshed here), but the cache will never be
>> +         * refreshed with stale data, i.e. won't get false negatives.
> 
> I am all for lavish comments, but I think this is even too detailed.  
> What about:

And in fact this should be moved to a separate function.

diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 50ce7b78b42f..321964ff42e1 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -112,6 +112,36 @@ static void gpc_release_pfn_and_khva(struct kvm *kvm, kvm_pfn_t pfn, void *khva)
  	}
  }
  
+
+static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_seq)
+{
+	/*
+	 * mn_active_invalidate_count acts for all intents and purposes
+	 * like mmu_notifier_count here; but we cannot use the latter
+	 * because the invalidation in the mmu_notifier event occurs
+	 * _before_ mmu_notifier_count is elevated.
+	 *
+	 * Note, it does not matter that mn_active_invalidate_count
+	 * is not protected by gpc->lock.  It is guaranteed to
+	 * be elevated before the mmu_notifier acquires gpc->lock, and
+	 * isn't dropped until after mmu_notifier_seq is updated.
+	 */
+	if (kvm->mn_active_invalidate_count)
+		return true;
+
+	/*
+	 * Ensure mn_active_invalidate_count is read before
+	 * mmu_notifier_seq.  This pairs with the smp_wmb() in
+	 * mmu_notifier_invalidate_range_end() to guarantee either the
+	 * old (non-zero) value of mn_active_invalidate_count or the
+	 * new (incremented) value of mmu_notifier_seq is observed.
+	 */
+	smp_rmb();
+	if (kvm->mmu_notifier_seq != mmu_seq)
+		return true;
+	return false;
+}
+
  static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, struct gfn_to_pfn_cache *gpc)
  {
  	/* Note, the new page offset may be different than the old! */
@@ -129,7 +159,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, struct gfn_to_pfn_cache *gpc)
  	 */
  	gpc->valid = false;
  
-	for (;;) {
+	do {
  		mmu_seq = kvm->mmu_notifier_seq;
  		smp_rmb();
  
@@ -188,32 +218,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, struct gfn_to_pfn_cache *gpc)
  		 * attempting to refresh.
  		 */
  		WARN_ON_ONCE(gpc->valid);
-
-		/*
-		 * mn_active_invalidate_count acts for all intents and purposes
-		 * like mmu_notifier_count here; but we cannot use the latter
-		 * because the invalidation in the mmu_notifier event occurs
-		 * _before_ mmu_notifier_count is elevated.
-		 *
-		 * Note, it does not matter that mn_active_invalidate_count
-		 * is not protected by gpc->lock.  It is guaranteed to
-		 * be elevated before the mmu_notifier acquires gpc->lock, and
-		 * isn't dropped until after mmu_notifier_seq is updated.
-		 */
-		if (kvm->mn_active_invalidate_count)
-			continue;
-
-		/*
-		 * Ensure mn_active_invalidate_count is read before
-		 * mmu_notifier_seq.  This pairs with the smp_wmb() in
-		 * mmu_notifier_invalidate_range_end() to guarantee either the
-		 * old (non-zero) value of mn_active_invalidate_count or the
-		 * new (incremented) value of mmu_notifier_seq is observed.
-		 */
-		smp_rmb();
-		if (kvm->mmu_notifier_seq == mmu_seq)
-			break;
-	}
+	} while (mmu_notifier_retry_cache(kvm, mmu_seq);
  
  	gpc->valid = true;
  	gpc->pfn = new_pfn;
Sean Christopherson May 20, 2022, 2:59 p.m. UTC | #6
On Fri, May 20, 2022, Paolo Bonzini wrote:
> On 4/27/22 03:40, Sean Christopherson wrote:
> > +		 * Wait for mn_active_invalidate_count, not mmu_notifier_count,
> > +		 * to go away, as the invalidation in the mmu_notifier event
> > +		 * occurs_before_  mmu_notifier_count is elevated.
> > +		 *
> > +		 * Note, mn_active_invalidate_count can change at any time as
> > +		 * it's not protected by gpc->lock.  But, it is guaranteed to
> > +		 * be elevated before the mmu_notifier acquires gpc->lock, and
> > +		 * isn't dropped until after mmu_notifier_seq is updated.  So,
> > +		 * this task may get a false positive of sorts, i.e. see an
> > +		 * elevated count and wait even though it's technically safe to
> > +		 * proceed (becase the mmu_notifier will invalidate the cache
> > +		 *_after_  it's refreshed here), but the cache will never be
> > +		 * refreshed with stale data, i.e. won't get false negatives.
> 
> I am all for lavish comments, but I think this is even too detailed.

Yeah, the false positive/negative stuff is probably overkill.

> What about:
> 
>                 /*
>                  * mn_active_invalidate_count acts for all intents and purposes
>                  * like mmu_notifier_count here; but we cannot use the latter
>                  * because the invalidation in the mmu_notifier event occurs
>                  * _before_ mmu_notifier_count is elevated.

Looks good, though I'd prefer to avoid the "we", and explicitly call out that its
the invalidation of the caches.


		/*
		 * mn_active_invalidate_count acts for all intents and purposes
		 * like mmu_notifier_count here; but the latter cannot be used
		 * here because the invalidation of caches in the mmu_notifier
		 * event occurs _before_ mmu_notifier_count is elevated.
		 *
		 * Note, it does not matter that mn_active_invalidate_count
		 * is not protected by gpc->lock.  It is guaranteed to
		 * be elevated before the mmu_notifier acquires gpc->lock, and
		 * isn't dropped until after mmu_notifier_seq is updated.
		 */


Also, you'll definitely want to look at v3 of this series.  I'm 99% certain I didn't
change the comment though :-)

https://lore.kernel.org/all/20220429210025.3293691-1-seanjc@google.com
Sean Christopherson May 20, 2022, 3:02 p.m. UTC | #7
On Fri, May 20, 2022, Paolo Bonzini wrote:
> On 5/20/22 16:49, Paolo Bonzini wrote:
> > On 4/27/22 03:40, Sean Christopherson wrote:
> > > +         * Wait for mn_active_invalidate_count, not mmu_notifier_count,
> > > +         * to go away, as the invalidation in the mmu_notifier event
> > > +         * occurs_before_  mmu_notifier_count is elevated.
> > > +         *
> > > +         * Note, mn_active_invalidate_count can change at any time as
> > > +         * it's not protected by gpc->lock.  But, it is guaranteed to
> > > +         * be elevated before the mmu_notifier acquires gpc->lock, and
> > > +         * isn't dropped until after mmu_notifier_seq is updated.  So,
> > > +         * this task may get a false positive of sorts, i.e. see an
> > > +         * elevated count and wait even though it's technically safe to
> > > +         * proceed (becase the mmu_notifier will invalidate the cache
> > > +         *_after_  it's refreshed here), but the cache will never be
> > > +         * refreshed with stale data, i.e. won't get false negatives.
> > 
> > I am all for lavish comments, but I think this is even too detailed.
> > What about:
> 
> And in fact this should be moved to a separate function.
> 
> diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
> index 50ce7b78b42f..321964ff42e1 100644
> --- a/virt/kvm/pfncache.c
> +++ b/virt/kvm/pfncache.c
> @@ -112,6 +112,36 @@ static void gpc_release_pfn_and_khva(struct kvm *kvm, kvm_pfn_t pfn, void *khva)
>  	}
>  }
> +
> +static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_seq)
> +{
> +	/*
> +	 * mn_active_invalidate_count acts for all intents and purposes
> +	 * like mmu_notifier_count here; but we cannot use the latter
> +	 * because the invalidation in the mmu_notifier event occurs
> +	 * _before_ mmu_notifier_count is elevated.
> +	 *
> +	 * Note, it does not matter that mn_active_invalidate_count
> +	 * is not protected by gpc->lock.  It is guaranteed to
> +	 * be elevated before the mmu_notifier acquires gpc->lock, and
> +	 * isn't dropped until after mmu_notifier_seq is updated.
> +	 */
> +	if (kvm->mn_active_invalidate_count)
> +		return true;
> +
> +	/*
> +	 * Ensure mn_active_invalidate_count is read before
> +	 * mmu_notifier_seq.  This pairs with the smp_wmb() in
> +	 * mmu_notifier_invalidate_range_end() to guarantee either the
> +	 * old (non-zero) value of mn_active_invalidate_count or the
> +	 * new (incremented) value of mmu_notifier_seq is observed.
> +	 */
> +	smp_rmb();
> +	if (kvm->mmu_notifier_seq != mmu_seq)
> +		return true;
> +	return false;

This can be

	return kvm->mmu_notifier_seq != mmu_seq;

Looks good otherwise.  It'll probably yield a smaller diff too.
diff mbox series

Patch

diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index ac1ebb37a0ff..83dcb97dddf1 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -74,6 +74,7 @@  struct gfn_to_pfn_cache {
 	enum pfn_cache_usage usage;
 	bool active;
 	bool valid;
+	bool refresh_in_progress;
 };
 
 #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index dfb7dabdbc63..0848430f36c6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -705,6 +705,15 @@  static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	kvm->mn_active_invalidate_count++;
 	spin_unlock(&kvm->mn_invalidate_lock);
 
+	/*
+	 * Invalidate pfn caches _before_ invalidating the secondary MMUs, i.e.
+	 * before acquiring mmu_lock, to avoid holding mmu_lock while acquiring
+	 * each cache's lock.  There are relatively few caches in existence at
+	 * any given time, and the caches themselves can check for hva overlap,
+	 * i.e. don't need to rely on memslot overlap checks for performance.
+	 * Because this runs without holding mmu_lock, the pfn caches must use
+	 * mn_active_invalidate_count (see above) instead of mmu_notifier_count.
+	 */
 	gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end,
 					  hva_range.may_block);
 
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 05cb0bcbf662..b1665d0e6c32 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -112,31 +112,122 @@  static void gpc_release_pfn_and_khva(struct kvm *kvm, kvm_pfn_t pfn, void *khva)
 	}
 }
 
-static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, unsigned long uhva)
+static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, struct gfn_to_pfn_cache *gpc)
 {
+	/* Note, the new page offset may be different than the old! */
+	void *old_khva = gpc->khva - offset_in_page(gpc->khva);
+	kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT;
+	void *new_khva = NULL;
 	unsigned long mmu_seq;
-	kvm_pfn_t new_pfn;
-	int retry;
 
-	do {
+	lockdep_assert_held_write(&gpc->lock);
+
+	/*
+	 * Invalidate the cache prior to dropping gpc->lock, gpc->uhva has
+	 * already been updated and so a concurrent refresh from a different
+	 * task will not detect that gpa/uhva changed.
+	 */
+	gpc->valid = false;
+
+	for (;;) {
 		mmu_seq = kvm->mmu_notifier_seq;
 		smp_rmb();
 
+		write_unlock_irq(&gpc->lock);
+
+		/*
+		 * If the previous iteration "failed" due to an mmu_notifier
+		 * event, release the pfn and unmap the kernel virtual address
+		 * from the previous attempt.  Unmapping might sleep, so this
+		 * needs to be done after dropping the lock.  Opportunistically
+		 * check for resched while the lock isn't held.
+		 */
+		if (new_pfn != KVM_PFN_ERR_FAULT) {
+			/*
+			 * Keep the mapping if the previous iteration reused
+			 * the existing mapping and didn't create a new one.
+			 */
+			if (new_khva == old_khva)
+				new_khva = NULL;
+
+			gpc_release_pfn_and_khva(kvm, new_pfn, new_khva);
+
+			cond_resched();
+		}
+
 		/* We always request a writeable mapping */
-		new_pfn = hva_to_pfn(uhva, false, NULL, true, NULL);
+		new_pfn = hva_to_pfn(gpc->uhva, false, NULL, true, NULL);
 		if (is_error_noslot_pfn(new_pfn))
-			break;
+			goto out_error;
+
+		/*
+		 * Obtain a new kernel mapping if KVM itself will access the
+		 * pfn.  Note, kmap() and memremap() can both sleep, so this
+		 * too must be done outside of gpc->lock!
+		 */
+		if (gpc->usage & KVM_HOST_USES_PFN) {
+			if (new_pfn == gpc->pfn) {
+				new_khva = old_khva;
+			} else if (pfn_valid(new_pfn)) {
+				new_khva = kmap(pfn_to_page(new_pfn));
+#ifdef CONFIG_HAS_IOMEM
+			} else {
+				new_khva = memremap(pfn_to_hpa(new_pfn), PAGE_SIZE, MEMREMAP_WB);
+#endif
+			}
+			if (!new_khva) {
+				kvm_release_pfn_clean(new_pfn);
+				goto out_error;
+			}
+		}
+
+		write_lock_irq(&gpc->lock);
 
-		KVM_MMU_READ_LOCK(kvm);
-		retry = mmu_notifier_retry_hva(kvm, mmu_seq, uhva);
-		KVM_MMU_READ_UNLOCK(kvm);
-		if (!retry)
+		/*
+		 * Other tasks must wait for _this_ refresh to complete before
+		 * attempting to refresh.
+		 */
+		WARN_ON_ONCE(gpc->valid);
+
+		/*
+		 * Wait for mn_active_invalidate_count, not mmu_notifier_count,
+		 * to go away, as the invalidation in the mmu_notifier event
+		 * occurs _before_ mmu_notifier_count is elevated.
+		 *
+		 * Note, mn_active_invalidate_count can change at any time as
+		 * it's not protected by gpc->lock.  But, it is guaranteed to
+		 * be elevated before the mmu_notifier acquires gpc->lock, and
+		 * isn't dropped until after mmu_notifier_seq is updated.  So,
+		 * this task may get a false positive of sorts, i.e. see an
+		 * elevated count and wait even though it's technically safe to
+		 * proceed (becase the mmu_notifier will invalidate the cache
+		 * _after_ it's refreshed here), but the cache will never be
+		 * refreshed with stale data, i.e. won't get false negatives.
+		 */
+		if (kvm->mn_active_invalidate_count)
+			continue;
+
+		/*
+		 * Ensure mn_active_invalidate_count is read before
+		 * mmu_notifier_seq.  This pairs with the smp_wmb() in
+		 * mmu_notifier_invalidate_range_end() to guarantee either the
+		 * old (non-zero) value of mn_active_invalidate_count or the
+		 * new (incremented) value of mmu_notifier_seq is observed.
+		 */
+		smp_rmb();
+		if (kvm->mmu_notifier_seq == mmu_seq)
 			break;
+	}
+
+	gpc->valid = true;
+	gpc->pfn = new_pfn;
+	gpc->khva = new_khva + (gpc->gpa & ~PAGE_MASK);
+	return 0;
 
-		cond_resched();
-	} while (1);
+out_error:
+	write_lock_irq(&gpc->lock);
 
-	return new_pfn;
+	return -EFAULT;
 }
 
 int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
@@ -147,7 +238,6 @@  int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
 	kvm_pfn_t old_pfn, new_pfn;
 	unsigned long old_uhva;
 	void *old_khva;
-	bool old_valid;
 	int ret = 0;
 
 	/*
@@ -159,10 +249,23 @@  int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
 
 	write_lock_irq(&gpc->lock);
 
+	/*
+	 * If another task is refreshing the cache, wait for it to complete.
+	 * There is no guarantee that concurrent refreshes will see the same
+	 * gpa, memslots generation, etc..., so they must be fully serialized.
+	 */
+	while (gpc->refresh_in_progress) {
+		write_unlock_irq(&gpc->lock);
+
+		cond_resched();
+
+		write_lock_irq(&gpc->lock);
+	}
+	gpc->refresh_in_progress = true;
+
 	old_pfn = gpc->pfn;
 	old_khva = gpc->khva - offset_in_page(gpc->khva);
 	old_uhva = gpc->uhva;
-	old_valid = gpc->valid;
 
 	/* If the userspace HVA is invalid, refresh that first */
 	if (gpc->gpa != gpa || gpc->generation != slots->generation ||
@@ -175,7 +278,6 @@  int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
 		gpc->uhva = gfn_to_hva_memslot(gpc->memslot, gfn);
 
 		if (kvm_is_error_hva(gpc->uhva)) {
-			gpc->pfn = KVM_PFN_ERR_FAULT;
 			ret = -EFAULT;
 			goto out;
 		}
@@ -185,60 +287,8 @@  int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
 	 * If the userspace HVA changed or the PFN was already invalid,
 	 * drop the lock and do the HVA to PFN lookup again.
 	 */
-	if (!old_valid || old_uhva != gpc->uhva) {
-		unsigned long uhva = gpc->uhva;
-		void *new_khva = NULL;
-
-		/* Placeholders for "hva is valid but not yet mapped" */
-		gpc->pfn = KVM_PFN_ERR_FAULT;
-		gpc->khva = NULL;
-		gpc->valid = true;
-
-		write_unlock_irq(&gpc->lock);
-
-		new_pfn = hva_to_pfn_retry(kvm, uhva);
-		if (is_error_noslot_pfn(new_pfn)) {
-			ret = -EFAULT;
-			goto map_done;
-		}
-
-		if (gpc->usage & KVM_HOST_USES_PFN) {
-			if (new_pfn == old_pfn) {
-				/*
-				 * Reuse the existing pfn and khva, but put the
-				 * reference acquired hva_to_pfn_retry(); the
-				 * cache still holds a reference to the pfn
-				 * from the previous refresh.
-				 */
-				gpc_release_pfn_and_khva(kvm, new_pfn, NULL);
-
-				new_khva = old_khva;
-				old_pfn = KVM_PFN_ERR_FAULT;
-				old_khva = NULL;
-			} else if (pfn_valid(new_pfn)) {
-				new_khva = kmap(pfn_to_page(new_pfn));
-#ifdef CONFIG_HAS_IOMEM
-			} else {
-				new_khva = memremap(pfn_to_hpa(new_pfn), PAGE_SIZE, MEMREMAP_WB);
-#endif
-			}
-			if (new_khva)
-				new_khva += page_offset;
-			else
-				ret = -EFAULT;
-		}
-
-	map_done:
-		write_lock_irq(&gpc->lock);
-		if (ret) {
-			gpc->valid = false;
-			gpc->pfn = KVM_PFN_ERR_FAULT;
-			gpc->khva = NULL;
-		} else {
-			/* At this point, gpc->valid may already have been cleared */
-			gpc->pfn = new_pfn;
-			gpc->khva = new_khva;
-		}
+	if (!gpc->valid || old_uhva != gpc->uhva) {
+		ret = hva_to_pfn_retry(kvm, gpc);
 	} else {
 		/* If the HVA→PFN mapping was already valid, don't unmap it. */
 		old_pfn = KVM_PFN_ERR_FAULT;
@@ -246,9 +296,26 @@  int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
 	}
 
  out:
+	/*
+	 * Invalidate the cache and purge the pfn/khva if the refresh failed.
+	 * Some/all of the uhva, gpa, and memslot generation info may still be
+	 * valid, leave it as is.
+	 */
+	if (ret) {
+		gpc->valid = false;
+		gpc->pfn = KVM_PFN_ERR_FAULT;
+		gpc->khva = NULL;
+	}
+
+	gpc->refresh_in_progress = false;
+
+	/* Snapshot the new pfn before dropping the lock! */
+	new_pfn = gpc->pfn;
+
 	write_unlock_irq(&gpc->lock);
 
-	gpc_release_pfn_and_khva(kvm, old_pfn, old_khva);
+	if (old_pfn != new_pfn)
+		gpc_release_pfn_and_khva(kvm, old_pfn, old_khva);
 
 	return ret;
 }