diff mbox series

[2/2] KVM: x86/mmu: Improve comment about TLB flush semantics for write-protection

Message ID 20220112215801.3502286-3-dmatlack@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86/mmu: Fix write-protection bug in the TDP MMU | expand

Commit Message

David Matlack Jan. 12, 2022, 9:58 p.m. UTC
Rewrite the comment in kvm_mmu_slot_remove_write_access() that explains
why it is safe to flush TLBs outside of the MMU lock after
write-protecting SPTEs for dirty logging. The current comment is a long
run-on sentance that was difficult to undertsand. In addition it was
specific to the shadow MMU (mentioning mmu_spte_update()) when the TDP
MMU has to handle this as well.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 29 ++++++++++++++++++++---------
 1 file changed, 20 insertions(+), 9 deletions(-)

Comments

Sean Christopherson Jan. 13, 2022, 12:46 a.m. UTC | #1
On Wed, Jan 12, 2022, David Matlack wrote:
> Rewrite the comment in kvm_mmu_slot_remove_write_access() that explains
> why it is safe to flush TLBs outside of the MMU lock after
> write-protecting SPTEs for dirty logging. The current comment is a long
> run-on sentance that was difficult to undertsand. In addition it was
> specific to the shadow MMU (mentioning mmu_spte_update()) when the TDP
> MMU has to handle this as well.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 29 ++++++++++++++++++++---------
>  1 file changed, 20 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 1d275e9d76b5..33f550b3be8f 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5825,15 +5825,26 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>  	}
>  
>  	/*
> -	 * We can flush all the TLBs out of the mmu lock without TLB
> -	 * corruption since we just change the spte from writable to
> -	 * readonly so that we only need to care the case of changing
> -	 * spte from present to present (changing the spte from present
> -	 * to nonpresent will flush all the TLBs immediately), in other
> -	 * words, the only case we care is mmu_spte_update() where we
> -	 * have checked Host-writable | MMU-writable instead of
> -	 * PT_WRITABLE_MASK, that means it does not depend on PT_WRITABLE_MASK
> -	 * anymore.
> +	 * It is safe to flush TLBs outside of the MMU lock since SPTEs are only
> +	 * being changed from writable to read-only (i.e. the mapping to host
> +	 * PFNs is not changing). 

Hmm, you mostly address things in the next sentence/paragraph, but it's more than
the SPTE being downgraded from writable => read-only, e.g. if the SPTE were being
made read-only due to userspace removing permissions, then KVM would need to flush
before dropping mmu_lock.  The qualifier about the PFN not changing actually does
more harm than good because it further suggests that writable => read-only is
somehow inherently safe.  

> +      * All we care about is that CPUs start using the
> +	 * read-only mappings from this point forward to ensure the dirty bitmap
> +	 * gets updated, but that does not need to run under the MMU lock.

"this point forward" isn't technically true, the requirement is that the flush
occurs before the memslot update completes.  Definitely splitting hairs, I mean,
this basically is the end of the memslot update, but it's an opportunity to
clarify _why_ the flush needs to happen at this point.

> +	 *
> +	 * Note that there are other reasons why SPTEs can be write-protected
> +	 * besides dirty logging: (1) to intercept guest page table
> +	 * modifications when doing shadow paging and (2) to protecting guest
> +	 * memory that is not host-writable.

So, technically, #2 is not possible.  KVM doesn't allow a memslot to be converted
from writable => read-only, userspace must first delete the entire memslot.  That
means the relevant SPTEs never transition directly from writable to !writable,
they are instead zapped entirely and "new" SPTEs are created that are read-only
from their genesis.

Making a VMA read-only also results in SPTEs being zapped and recreated, though
this is an area for improvement.  We could cover future changes in this area by
being a bit fuzzy in the wording, but I think it would be better to talk only
about the shadow paging case and thus only about MMU-writable, because Host-writable
is (currently) immutable and making it mutable (in the mmu_notifier path) will
have additional "rule" changes.

> + 	 * Both of these usecases require
> +	 * flushing the TLB under the MMU lock to ensure CPUs are not running
> +	 * with writable SPTEs in their TLB. The tricky part is knowing when it
> +	 * is safe to skip a TLB flush if an SPTE is already write-protected,
> +	 * since it could have been write-protected for dirty-logging which does
> +	 * not flush under the lock.

It's a bit unclear that the last "skip a TLB flush" snippet is referring to a
future TLB flush, not this TLB flush.

> +	 *
> +	 * To handle this each SPTE has an MMU-writable bit and a Host-writable
> +	 * bit (KVM-specific bits that are not used by hardware). These bits
> +	 * allow KVM to deduce *why* a given SPTE is currently write-protected,
> +	 * so that it knows when it needs to flush TLBs under the MMU lock.

I much rather we add this type of comment over the definitions for
DEFAULT_SPTE_{HOST,MMU}_WRITEABLE and then provide a reference to those definitions
after a very brief "KVM uses the MMU-writable bit".

So something like this?  Plus more commentry in spte.h.

	/*
	 * It's safe to flush TLBs after dropping mmu_lock as making a writable
	 * SPTE read-only for dirty logging only needs to ensure KVM starts
	 * logging writes to the memslot before the memslot update completes,
	 * i.e. before the enabling of dirty logging is visible to userspace.
	 *
	 * Note, KVM also write-protects SPTEs when shadowing guest page tables,
	 * in which case a TLB flush is needed before dropping mmu_lock().  To
	 * ensure a future TLB flush isn't missed, KVM uses a software-available
	 * bit to track if a SPTE is MMU-Writable, i.e. is considered writable
	 * for shadow paging purposes.  When write-protecting for shadow paging,
	 * KVM clears both WRITABLE and MMU-Writable, and performs a TLB flush
	 * while holding mmu_lock if either bit is cleared.
	 *
	 * See DEFAULT_SPTE_{HOST,MMU}_WRITEABLE for more details.
	 */

>  	 */
>  	if (flush)
>  		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
> -- 
> 2.34.1.703.g22d0c6ccf7-goog
>
David Matlack Jan. 13, 2022, 5:10 p.m. UTC | #2
On Wed, Jan 12, 2022 at 4:46 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Jan 12, 2022, David Matlack wrote:
> > Rewrite the comment in kvm_mmu_slot_remove_write_access() that explains
> > why it is safe to flush TLBs outside of the MMU lock after
> > write-protecting SPTEs for dirty logging. The current comment is a long
> > run-on sentance that was difficult to undertsand. In addition it was
> > specific to the shadow MMU (mentioning mmu_spte_update()) when the TDP
> > MMU has to handle this as well.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 29 ++++++++++++++++++++---------
> >  1 file changed, 20 insertions(+), 9 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 1d275e9d76b5..33f550b3be8f 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -5825,15 +5825,26 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
> >       }
> >
> >       /*
> > -      * We can flush all the TLBs out of the mmu lock without TLB
> > -      * corruption since we just change the spte from writable to
> > -      * readonly so that we only need to care the case of changing
> > -      * spte from present to present (changing the spte from present
> > -      * to nonpresent will flush all the TLBs immediately), in other
> > -      * words, the only case we care is mmu_spte_update() where we
> > -      * have checked Host-writable | MMU-writable instead of
> > -      * PT_WRITABLE_MASK, that means it does not depend on PT_WRITABLE_MASK
> > -      * anymore.
> > +      * It is safe to flush TLBs outside of the MMU lock since SPTEs are only
> > +      * being changed from writable to read-only (i.e. the mapping to host
> > +      * PFNs is not changing).
>
> Hmm, you mostly address things in the next sentence/paragraph, but it's more than
> the SPTE being downgraded from writable => read-only, e.g. if the SPTE were being
> made read-only due to userspace removing permissions, then KVM would need to flush
> before dropping mmu_lock.  The qualifier about the PFN not changing actually does
> more harm than good because it further suggests that writable => read-only is
> somehow inherently safe.
>
> > +      * All we care about is that CPUs start using the
> > +      * read-only mappings from this point forward to ensure the dirty bitmap
> > +      * gets updated, but that does not need to run under the MMU lock.
>
> "this point forward" isn't technically true, the requirement is that the flush
> occurs before the memslot update completes.  Definitely splitting hairs, I mean,
> this basically is the end of the memslot update, but it's an opportunity to
> clarify _why_ the flush needs to happen at this point.
>
> > +      *
> > +      * Note that there are other reasons why SPTEs can be write-protected
> > +      * besides dirty logging: (1) to intercept guest page table
> > +      * modifications when doing shadow paging and (2) to protecting guest
> > +      * memory that is not host-writable.
>
> So, technically, #2 is not possible.  KVM doesn't allow a memslot to be converted
> from writable => read-only, userspace must first delete the entire memslot.  That
> means the relevant SPTEs never transition directly from writable to !writable,
> they are instead zapped entirely and "new" SPTEs are created that are read-only
> from their genesis.
>
> Making a VMA read-only also results in SPTEs being zapped and recreated, though
> this is an area for improvement.  We could cover future changes in this area by
> being a bit fuzzy in the wording, but I think it would be better to talk only
> about the shadow paging case and thus only about MMU-writable, because Host-writable
> is (currently) immutable and making it mutable (in the mmu_notifier path) will
> have additional "rule" changes.
>
> > +      * Both of these usecases require
> > +      * flushing the TLB under the MMU lock to ensure CPUs are not running
> > +      * with writable SPTEs in their TLB. The tricky part is knowing when it
> > +      * is safe to skip a TLB flush if an SPTE is already write-protected,
> > +      * since it could have been write-protected for dirty-logging which does
> > +      * not flush under the lock.
>
> It's a bit unclear that the last "skip a TLB flush" snippet is referring to a
> future TLB flush, not this TLB flush.
>
> > +      *
> > +      * To handle this each SPTE has an MMU-writable bit and a Host-writable
> > +      * bit (KVM-specific bits that are not used by hardware). These bits
> > +      * allow KVM to deduce *why* a given SPTE is currently write-protected,
> > +      * so that it knows when it needs to flush TLBs under the MMU lock.
>
> I much rather we add this type of comment over the definitions for
> DEFAULT_SPTE_{HOST,MMU}_WRITEABLE and then provide a reference to those definitions
> after a very brief "KVM uses the MMU-writable bit".
>
> So something like this?  Plus more commentry in spte.h.
>
>         /*
>          * It's safe to flush TLBs after dropping mmu_lock as making a writable
>          * SPTE read-only for dirty logging only needs to ensure KVM starts
>          * logging writes to the memslot before the memslot update completes,
>          * i.e. before the enabling of dirty logging is visible to userspace.
>          *
>          * Note, KVM also write-protects SPTEs when shadowing guest page tables,
>          * in which case a TLB flush is needed before dropping mmu_lock().  To
>          * ensure a future TLB flush isn't missed, KVM uses a software-available
>          * bit to track if a SPTE is MMU-Writable, i.e. is considered writable
>          * for shadow paging purposes.  When write-protecting for shadow paging,
>          * KVM clears both WRITABLE and MMU-Writable, and performs a TLB flush
>          * while holding mmu_lock if either bit is cleared.
>          *
>          * See DEFAULT_SPTE_{HOST,MMU}_WRITEABLE for more details.
>          */

Makes sense. I'll rework the comment per your feedback and also
document the {host,mmu}-writable bits. Although I think it'd make more
sense to put those comments on shadow_{host,mmu}_writable_mask as
those are the symbols used throughout the code and EPT uses different
bits than DEFAULT_..._WRITABLE.

>
> >        */
> >       if (flush)
> >               kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
> > --
> > 2.34.1.703.g22d0c6ccf7-goog
> >
Sean Christopherson Jan. 13, 2022, 10:40 p.m. UTC | #3
On Thu, Jan 13, 2022, David Matlack wrote:
> On Wed, Jan 12, 2022 at 4:46 PM Sean Christopherson <seanjc@google.com> wrote:
> > So something like this?  Plus more commentry in spte.h.
> >
> >         /*
> >          * It's safe to flush TLBs after dropping mmu_lock as making a writable
> >          * SPTE read-only for dirty logging only needs to ensure KVM starts
> >          * logging writes to the memslot before the memslot update completes,
> >          * i.e. before the enabling of dirty logging is visible to userspace.
> >          *
> >          * Note, KVM also write-protects SPTEs when shadowing guest page tables,
> >          * in which case a TLB flush is needed before dropping mmu_lock().  To
> >          * ensure a future TLB flush isn't missed, KVM uses a software-available
> >          * bit to track if a SPTE is MMU-Writable, i.e. is considered writable
> >          * for shadow paging purposes.  When write-protecting for shadow paging,
> >          * KVM clears both WRITABLE and MMU-Writable, and performs a TLB flush
> >          * while holding mmu_lock if either bit is cleared.
> >          *
> >          * See DEFAULT_SPTE_{HOST,MMU}_WRITEABLE for more details.
> >          */
> 
> Makes sense. I'll rework the comment per your feedback and also
> document the {host,mmu}-writable bits. Although I think it'd make more
> sense to put those comments on shadow_{host,mmu}_writable_mask as
> those are the symbols used throughout the code and EPT uses different
> bits than DEFAULT_..._WRITABLE.

I don't necessarily disagree, but all of the existing comments for SPTE bits are
in spte.h, even though the dynamic masks that are actually used in code are defined
elsewhere.  I'd prefer to keep all the "documentation" somewhat centralized, and it
shouldn't be too onerous to get from shadow_*_mask to DEFAULT_*_WRITABLE.
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1d275e9d76b5..33f550b3be8f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5825,15 +5825,26 @@  void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 	}
 
 	/*
-	 * We can flush all the TLBs out of the mmu lock without TLB
-	 * corruption since we just change the spte from writable to
-	 * readonly so that we only need to care the case of changing
-	 * spte from present to present (changing the spte from present
-	 * to nonpresent will flush all the TLBs immediately), in other
-	 * words, the only case we care is mmu_spte_update() where we
-	 * have checked Host-writable | MMU-writable instead of
-	 * PT_WRITABLE_MASK, that means it does not depend on PT_WRITABLE_MASK
-	 * anymore.
+	 * It is safe to flush TLBs outside of the MMU lock since SPTEs are only
+	 * being changed from writable to read-only (i.e. the mapping to host
+	 * PFNs is not changing). All we care about is that CPUs start using the
+	 * read-only mappings from this point forward to ensure the dirty bitmap
+	 * gets updated, but that does not need to run under the MMU lock.
+	 *
+	 * Note that there are other reasons why SPTEs can be write-protected
+	 * besides dirty logging: (1) to intercept guest page table
+	 * modifications when doing shadow paging and (2) to protecting guest
+	 * memory that is not host-writable. Both of these usecases require
+	 * flushing the TLB under the MMU lock to ensure CPUs are not running
+	 * with writable SPTEs in their TLB. The tricky part is knowing when it
+	 * is safe to skip a TLB flush if an SPTE is already write-protected,
+	 * since it could have been write-protected for dirty-logging which does
+	 * not flush under the lock.
+	 *
+	 * To handle this each SPTE has an MMU-writable bit and a Host-writable
+	 * bit (KVM-specific bits that are not used by hardware). These bits
+	 * allow KVM to deduce *why* a given SPTE is currently write-protected,
+	 * so that it knows when it needs to flush TLBs under the MMU lock.
 	 */
 	if (flush)
 		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);