diff mbox series

[09/54] KVM: x86/mmu: Unconditionally zap unsync SPs when creating >4k SP at GFN

Message ID 20210622175739.3610207-10-seanjc@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86/mmu: Bug fixes and summer cleaning | expand

Commit Message

Sean Christopherson June 22, 2021, 5:56 p.m. UTC
When creating a new upper-level shadow page, zap unsync shadow pages at
the same target gfn instead of attempting to sync the pages.  This fixes
a bug where an unsync shadow page could be sync'd with an incompatible
context, e.g. wrong smm, is_guest, etc... flags.  In practice, the bug is
relatively benign as sync_page() is all but guaranteed to fail its check
that the guest's desired gfn (for the to-be-sync'd page) matches the
current gfn associated with the shadow page.  I.e. kvm_sync_page() would
end up zapping the page anyways.

Alternatively, __kvm_sync_page() could be modified to explicitly verify
the mmu_role of the unsync shadow page is compatible with the current MMU
context.  But, except for this specific case, __kvm_sync_page() is called
iff the page is compatible, e.g. the transient sync in kvm_mmu_get_page()
requires an exact role match, and the call from kvm_sync_mmu_roots() is
only synchronizing shadow pages from the current MMU (which better be
compatible or KVM has problems).  And as described above, attempting to
sync shadow pages when creating an upper-level shadow page is unlikely
to succeed, e.g. zero successful syncs were observed when running Linux
guests despite over a million attempts.

Fixes: 9f1a122f970d ("KVM: MMU: allow more page become unsync at getting sp time")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 50 ++++++++++++++----------------------------
 1 file changed, 16 insertions(+), 34 deletions(-)

Comments

Paolo Bonzini June 23, 2021, 2:36 p.m. UTC | #1
On 22/06/21 19:56, Sean Christopherson wrote:
> When creating a new upper-level shadow page, zap unsync shadow pages at
> the same target gfn instead of attempting to sync the pages.  This fixes
> a bug where an unsync shadow page could be sync'd with an incompatible
> context, e.g. wrong smm, is_guest, etc... flags.  In practice, the bug is
> relatively benign as sync_page() is all but guaranteed to fail its check
> that the guest's desired gfn (for the to-be-sync'd page) matches the
> current gfn associated with the shadow page.  I.e. kvm_sync_page() would
> end up zapping the page anyways.
> 
> Alternatively, __kvm_sync_page() could be modified to explicitly verify
> the mmu_role of the unsync shadow page is compatible with the current MMU
> context.  But, except for this specific case, __kvm_sync_page() is called
> iff the page is compatible, e.g. the transient sync in kvm_mmu_get_page()
> requires an exact role match, and the call from kvm_sync_mmu_roots() is
> only synchronizing shadow pages from the current MMU (which better be
> compatible or KVM has problems).  And as described above, attempting to
> sync shadow pages when creating an upper-level shadow page is unlikely
> to succeed, e.g. zero successful syncs were observed when running Linux
> guests despite over a million attempts.

One issue, this WARN_ON may now trigger:

                         WARN_ON(!list_empty(&invalid_list));

due to a kvm_mmu_prepare_zap_page that could have happened on an earlier 
iteration of the for_each_valid_sp.  Before your change, __kvm_sync_page 
would be called always before kvm_sync_pages could add anything to 
invalid_list.

Paolo
Sean Christopherson June 23, 2021, 3:08 p.m. UTC | #2
On Wed, Jun 23, 2021, Paolo Bonzini wrote:
> On 22/06/21 19:56, Sean Christopherson wrote:
> > When creating a new upper-level shadow page, zap unsync shadow pages at
> > the same target gfn instead of attempting to sync the pages.  This fixes
> > a bug where an unsync shadow page could be sync'd with an incompatible
> > context, e.g. wrong smm, is_guest, etc... flags.  In practice, the bug is
> > relatively benign as sync_page() is all but guaranteed to fail its check
> > that the guest's desired gfn (for the to-be-sync'd page) matches the
> > current gfn associated with the shadow page.  I.e. kvm_sync_page() would
> > end up zapping the page anyways.
> > 
> > Alternatively, __kvm_sync_page() could be modified to explicitly verify
> > the mmu_role of the unsync shadow page is compatible with the current MMU
> > context.  But, except for this specific case, __kvm_sync_page() is called
> > iff the page is compatible, e.g. the transient sync in kvm_mmu_get_page()
> > requires an exact role match, and the call from kvm_sync_mmu_roots() is
> > only synchronizing shadow pages from the current MMU (which better be
> > compatible or KVM has problems).  And as described above, attempting to
> > sync shadow pages when creating an upper-level shadow page is unlikely
> > to succeed, e.g. zero successful syncs were observed when running Linux
> > guests despite over a million attempts.
> 
> One issue, this WARN_ON may now trigger:
> 
>                         WARN_ON(!list_empty(&invalid_list));
> 
> due to a kvm_mmu_prepare_zap_page that could have happened on an earlier
> iteration of the for_each_valid_sp.  Before your change, __kvm_sync_page
> would be called always before kvm_sync_pages could add anything to
> invalid_list.

Ah, I should have added a comment.  It took me a few minutes of staring to
remember why it can't fire.

The branch at (2), which adds to invalid_list, is taken if and only if the new
page is not a 4k page.

The branch at (3) is taken if and only if the existing page is a 4k page, because
only 4k pages can become unsync.

Because the shadow page's level is incorporated into its role, if the level of
the new page is >4k, the branch at (1) will be taken for all 4k shadow pages.

Maybe something like this for a comment?

			/*
			 * Assert that the page was not zapped if the "sync" was
			 * successful.  Note, this cannot collide with the above
			 * zapping of unsync pages, as this point is reached iff
			 * the new page is a 4k page (only 4k pages can become
			 * unsync and the role check ensures identical levels),
			 * and zapping occurs iff the new page is NOT a 4k page.
			 */
			WARN_ON(!list_empty(&invalid_list));




1)		if (sp->role.word != role.word) {
			/*
			 * If the guest is creating an upper-level page, zap
			 * unsync pages for the same gfn.  While it's possible
			 * the guest is using recursive page tables, in all
			 * likelihood the guest has stopped using the unsync
			 * page and is installing a completely unrelated page.
			 * Unsync pages must not be left as is, because the new
			 * upper-level page will be write-protected.
			 */
2)			if (level > PG_LEVEL_4K && sp->unsync)
				kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
							 &invalid_list);
			continue;
		}

		if (direct_mmu)
			goto trace_get_page;

3)		if (sp->unsync) {
			/*
			 * The page is good, but is stale.  "Sync" the page to
			 * get the latest guest state, but don't write-protect
			 * the page and don't mark it synchronized!  KVM needs
			 * to ensure the mapping is valid, but doesn't need to
			 * fully sync (write-protect) the page until the guest
			 * invalidates the TLB mapping.  This allows multiple
			 * SPs for a single gfn to be unsync.
			 *
			 * If the sync fails, the page is zapped.  If so, break
			 * If so, break in order to rebuild it.
			 */
			if (!kvm_sync_page(vcpu, sp, &invalid_list))
				break;

			WARN_ON(!list_empty(&invalid_list));
			kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
		}
Paolo Bonzini June 23, 2021, 4:38 p.m. UTC | #3
On 23/06/21 17:08, Sean Christopherson wrote:
> Because the shadow page's level is incorporated into its role, if the level of
> the new page is >4k, the branch at (1) will be taken for all 4k shadow pages.
> 
> Maybe something like this for a comment?

Good, integrated.

Though I also wonder why breaking out of the loop early is okay.  Initially I thought
that zapping only matters if there's no existing page with the desired role,
because otherwise the unsync page would have been zapped already by an earlier
kvm_get_mmu_page, but what if the page was synced at the time of kvm_get_mmu_page
and then both were unsynced?

It may be easier to just split the loop to avoid that additional confusion,
something like:

         /*
          * If the guest is creating an upper-level page, zap unsync pages
          * for the same gfn, because the gfn will be write protected and
          * future syncs of those unsync pages could happen with an incompatible
          * context.  While it's possible the guest is using recursive page
          * tables, in all likelihood the guest has stopped using the unsync
          * page and is installing a completely unrelated page.
          */
         if (level > PG_LEVEL_4K) {
                 for_each_valid_sp(vcpu->kvm, sp, sp_list)
                         if (sp->gfn == gfn && sp->role.word != role.word && sp->unsync)
                                 kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
                                                          &invalid_list);
         }

Paolo
Sean Christopherson June 23, 2021, 10:04 p.m. UTC | #4
On Wed, Jun 23, 2021, Paolo Bonzini wrote:
> On 23/06/21 17:08, Sean Christopherson wrote:
> > Because the shadow page's level is incorporated into its role, if the level of
> > the new page is >4k, the branch at (1) will be taken for all 4k shadow pages.
> > 
> > Maybe something like this for a comment?
> 
> Good, integrated.
> 
> Though I also wonder why breaking out of the loop early is okay.  Initially I thought
> that zapping only matters if there's no existing page with the desired role,
> because otherwise the unsync page would have been zapped already by an earlier
> kvm_get_mmu_page, but what if the page was synced at the time of kvm_get_mmu_page
> and then both were unsynced?

That can't happen, because the new >4k SP will mark the page for write-tracking
via account_shadowed(), and any attempt to unsync the page will fail and
write-protect the entry.

It would be possible have both an unsync and a sync SP, e.g. unsync, then INVLPG
only one of the pages.  But as you pointed out, creating the first >4k SP would
be guaranteed to wipe out the unsync SP because no match should exist.

> It may be easier to just split the loop to avoid that additional confusion,
> something like:
> 
>         /*
>          * If the guest is creating an upper-level page, zap unsync pages
>          * for the same gfn, because the gfn will be write protected and
>          * future syncs of those unsync pages could happen with an incompatible
>          * context.

I don't think the part about "future syncs ... with an incompatible context" is
correct.  The unsync walks, i.e. the sync() flows, are done with the current root
and it should be impossible to reach a SP with an invalid context when walking
the child SPs.

I also can't find anything that would out break if the SP were left unsync, i.e.
I haven't found any code that assumes a write-protected SP can't be unsync.
E.g. mmu_try_to_unsync_pages() will force write-protection due to write tracking
even if unsync is left true.  Maybe there was a rule/assumption at some point
that has since gone away?  That's why my comment hedged and just said "don't
do it" without explaining why :-)

All that said, I'm definitely not opposed to simplying/clarifying the code and
ensuring all unsync SPs are zapped in this case.

>	   * While it's possible the guest is using recursive page
>          * tables, in all likelihood the guest has stopped using the unsync
>          * page and is installing a completely unrelated page.
>          */
>         if (level > PG_LEVEL_4K) {

I believe this can be "if (!direct && level > PG_LEVEL_4K)", because the direct
case won't write protect/track anything.

>                 for_each_valid_sp(vcpu->kvm, sp, sp_list)

This can technically be "for_each_gfn_indirect_valid_sp", though I'm not sure it
saves much, if anything.

>                         if (sp->gfn == gfn && sp->role.word != role.word && sp->unsync)
>                                 kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
>                                                          &invalid_list);
>         }
> 
> Paolo
>
Yu Zhang June 25, 2021, 9:51 a.m. UTC | #5
While reading the sync pages code, I just realized that patch
https://lkml.org/lkml/2021/2/9/212 has not be merged in upstream(
though it is irrelevant to this one). May I ask the reason? Thanks!

B.R.
Yu
Paolo Bonzini June 25, 2021, 10:26 a.m. UTC | #6
On 25/06/21 11:51, Yu Zhang wrote:
> While reading the sync pages code, I just realized that patch
> https://lkml.org/lkml/2021/2/9/212 has not be merged in upstream(
> though it is irrelevant to this one). May I ask the reason? Thanks!

I hadn't noticed it, thanks for reminding me.

Paolo
Yu Zhang June 25, 2021, 1:08 p.m. UTC | #7
On Fri, Jun 25, 2021 at 12:26:10PM +0200, Paolo Bonzini wrote:
> On 25/06/21 11:51, Yu Zhang wrote:
> > While reading the sync pages code, I just realized that patch
> > https://lkml.org/lkml/2021/2/9/212 has not be merged in upstream(
> > though it is irrelevant to this one). May I ask the reason? Thanks!
> 
> I hadn't noticed it, thanks for reminding me.

It's just a cleanup patch. And I forgot it too.:) Thanks!

B.R.
Yu
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 04cab330c445..99d26859021d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1843,24 +1843,6 @@  static bool kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	return __kvm_sync_page(vcpu, sp, invalid_list);
 }
 
-/* @gfn should be write-protected at the call site */
-static bool kvm_sync_pages(struct kvm_vcpu *vcpu, gfn_t gfn,
-			   struct list_head *invalid_list)
-{
-	struct kvm_mmu_page *s;
-	bool ret = false;
-
-	for_each_gfn_indirect_valid_sp(vcpu->kvm, s, gfn) {
-		if (!s->unsync)
-			continue;
-
-		WARN_ON(s->role.level != PG_LEVEL_4K);
-		ret |= kvm_sync_page(vcpu, s, invalid_list);
-	}
-
-	return ret;
-}
-
 struct mmu_page_path {
 	struct kvm_mmu_page *parent[PT64_ROOT_MAX_LEVEL];
 	unsigned int idx[PT64_ROOT_MAX_LEVEL];
@@ -1990,8 +1972,6 @@  static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 	struct hlist_head *sp_list;
 	unsigned quadrant;
 	struct kvm_mmu_page *sp;
-	bool need_sync = false;
-	bool flush = false;
 	int collisions = 0;
 	LIST_HEAD(invalid_list);
 
@@ -2014,11 +1994,21 @@  static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 			continue;
 		}
 
-		if (!need_sync && sp->unsync)
-			need_sync = true;
-
-		if (sp->role.word != role.word)
+		if (sp->role.word != role.word) {
+			/*
+			 * If the guest is creating an upper-level page, zap
+			 * unsync pages for the same gfn.  While it's possible
+			 * the guest is using recursive page tables, in all
+			 * likelihood the guest has stopped using the unsync
+			 * page and is installing a completely unrelated page.
+			 * Unsync pages must not be left as is, because the new
+			 * upper-level page will be write-protected.
+			 */
+			if (level > PG_LEVEL_4K && sp->unsync)
+				kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
+							 &invalid_list);
 			continue;
+		}
 
 		if (direct_mmu)
 			goto trace_get_page;
@@ -2052,22 +2042,14 @@  static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 	sp->role = role;
 	hlist_add_head(&sp->hash_link, sp_list);
 	if (!direct) {
-		/*
-		 * we should do write protection before syncing pages
-		 * otherwise the content of the synced shadow page may
-		 * be inconsistent with guest page table.
-		 */
 		account_shadowed(vcpu->kvm, sp);
 		if (level == PG_LEVEL_4K && rmap_write_protect(vcpu, gfn))
 			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
-
-		if (level > PG_LEVEL_4K && need_sync)
-			flush |= kvm_sync_pages(vcpu, gfn, &invalid_list);
 	}
 	trace_kvm_mmu_get_page(sp, true);
-
-	kvm_mmu_flush_or_zap(vcpu, &invalid_list, false, flush);
 out:
+	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
+
 	if (collisions > vcpu->kvm->stat.max_mmu_page_hash_collisions)
 		vcpu->kvm->stat.max_mmu_page_hash_collisions = collisions;
 	return sp;