mbox series

[v2,0/6] KVM: x86: Apply NX mitigation more precisely

Message ID 20220723012325.1715714-1-seanjc@google.com (mailing list archive)
Headers show
Series KVM: x86: Apply NX mitigation more precisely | expand

Message

Sean Christopherson July 23, 2022, 1:23 a.m. UTC
Patch 6 from Mingwei is the end goal of the series.  KVM incorrectly
assumes that the NX huge page mitigation is the only scenario where KVM
will create a non-leaf page instead of a huge page.   Precisely track
(via kvm_mmu_page) if a non-huge page is being forced and use that info
to avoid unnecessarily forcing smaller page sizes in
disallowed_hugepage_adjust().

v2: Rebase, tweak a changelog accordingly.

v1: https://lore.kernel.org/all/20220409003847.819686-1-seanjc@google.com

Mingwei Zhang (1):
  KVM: x86/mmu: explicitly check nx_hugepage in
    disallowed_hugepage_adjust()

Sean Christopherson (5):
  KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked
  KVM: x86/mmu: Properly account NX huge page workaround for nonpaging
    MMUs
  KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting
    SPTE
  KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual
    pages
  KVM: x86/mmu: Add helper to convert SPTE value to its shadow page

 arch/x86/include/asm/kvm_host.h |  17 ++---
 arch/x86/kvm/mmu/mmu.c          | 107 ++++++++++++++++++++++----------
 arch/x86/kvm/mmu/mmu_internal.h |  41 +++++++-----
 arch/x86/kvm/mmu/paging_tmpl.h  |   6 +-
 arch/x86/kvm/mmu/spte.c         |  11 ++++
 arch/x86/kvm/mmu/spte.h         |  17 +++++
 arch/x86/kvm/mmu/tdp_mmu.c      |  49 +++++++++------
 arch/x86/kvm/mmu/tdp_mmu.h      |   2 +
 8 files changed, 167 insertions(+), 83 deletions(-)


base-commit: 1a4d88a361af4f2e91861d632c6a1fe87a9665c2

Comments

Mingwei Zhang July 26, 2022, 5:37 a.m. UTC | #1
On Sat, Jul 23, 2022, Sean Christopherson wrote:
> Patch 6 from Mingwei is the end goal of the series.  KVM incorrectly
> assumes that the NX huge page mitigation is the only scenario where KVM
> will create a non-leaf page instead of a huge page.   Precisely track
> (via kvm_mmu_page) if a non-huge page is being forced and use that info
> to avoid unnecessarily forcing smaller page sizes in
> disallowed_hugepage_adjust().
> 
> v2: Rebase, tweak a changelog accordingly.

hmm, I applied this patch set (v2) on top of kvm/queue (HEAD:
1a4d88a361af) and it seems kvm-unit-tests/vmx failed on both ept=1 and
ept=0. And it did not work on our internel kernel either (kernel
crashed).

Maybe there is still minor issues?

> 
> v1: https://lore.kernel.org/all/20220409003847.819686-1-seanjc@google.com
> 
> Mingwei Zhang (1):
>   KVM: x86/mmu: explicitly check nx_hugepage in
>     disallowed_hugepage_adjust()
> 
> Sean Christopherson (5):
>   KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked
>   KVM: x86/mmu: Properly account NX huge page workaround for nonpaging
>     MMUs
>   KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting
>     SPTE
>   KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual
>     pages
>   KVM: x86/mmu: Add helper to convert SPTE value to its shadow page
> 
>  arch/x86/include/asm/kvm_host.h |  17 ++---
>  arch/x86/kvm/mmu/mmu.c          | 107 ++++++++++++++++++++++----------
>  arch/x86/kvm/mmu/mmu_internal.h |  41 +++++++-----
>  arch/x86/kvm/mmu/paging_tmpl.h  |   6 +-
>  arch/x86/kvm/mmu/spte.c         |  11 ++++
>  arch/x86/kvm/mmu/spte.h         |  17 +++++
>  arch/x86/kvm/mmu/tdp_mmu.c      |  49 +++++++++------
>  arch/x86/kvm/mmu/tdp_mmu.h      |   2 +
>  8 files changed, 167 insertions(+), 83 deletions(-)
> 
> 
> base-commit: 1a4d88a361af4f2e91861d632c6a1fe87a9665c2
> -- 
> 2.37.1.359.gd136c6c3e2-goog
>
Sean Christopherson July 26, 2022, 4:40 p.m. UTC | #2
On Tue, Jul 26, 2022, Mingwei Zhang wrote:
> On Sat, Jul 23, 2022, Sean Christopherson wrote:
> > Patch 6 from Mingwei is the end goal of the series.  KVM incorrectly
> > assumes that the NX huge page mitigation is the only scenario where KVM
> > will create a non-leaf page instead of a huge page.   Precisely track
> > (via kvm_mmu_page) if a non-huge page is being forced and use that info
> > to avoid unnecessarily forcing smaller page sizes in
> > disallowed_hugepage_adjust().
> > 
> > v2: Rebase, tweak a changelog accordingly.
> 
> hmm, I applied this patch set (v2) on top of kvm/queue (HEAD:
> 1a4d88a361af) and it seems kvm-unit-tests/vmx failed on both ept=1 and
> ept=0. And it did not work on our internel kernel either (kernel
> crashed).
> 
> Maybe there is still minor issues?

Heh, or not so minor issues.  I'll see what I broke.  I have a bad feeling that
it's the EPT tests; IIRC I only ran VMX on a platform with MAXPHYADDR < 40.
Sean Christopherson July 26, 2022, 5:21 p.m. UTC | #3
On Tue, Jul 26, 2022, Sean Christopherson wrote:
> On Tue, Jul 26, 2022, Mingwei Zhang wrote:
> > On Sat, Jul 23, 2022, Sean Christopherson wrote:
> > > Patch 6 from Mingwei is the end goal of the series.  KVM incorrectly
> > > assumes that the NX huge page mitigation is the only scenario where KVM
> > > will create a non-leaf page instead of a huge page.   Precisely track
> > > (via kvm_mmu_page) if a non-huge page is being forced and use that info
> > > to avoid unnecessarily forcing smaller page sizes in
> > > disallowed_hugepage_adjust().
> > > 
> > > v2: Rebase, tweak a changelog accordingly.
> > 
> > hmm, I applied this patch set (v2) on top of kvm/queue (HEAD:
> > 1a4d88a361af) and it seems kvm-unit-tests/vmx failed on both ept=1 and
> > ept=0. And it did not work on our internel kernel either (kernel
> > crashed).
> > 
> > Maybe there is still minor issues?
> 
> Heh, or not so minor issues.  I'll see what I broke.  I have a bad feeling that
> it's the EPT tests; IIRC I only ran VMX on a platform with MAXPHYADDR < 40.

Hrm, not seeing failures (beyond the VMX_VMCS_ENUM.MAX_INDEX failure because I'm
running an older QEMU).

I'll follow up off-list to figure out what's going on.
Paolo Bonzini July 28, 2022, 8:17 p.m. UTC | #4
On 7/23/22 03:23, Sean Christopherson wrote:
> Patch 6 from Mingwei is the end goal of the series.  KVM incorrectly
> assumes that the NX huge page mitigation is the only scenario where KVM
> will create a non-leaf page instead of a huge page.   Precisely track
> (via kvm_mmu_page) if a non-huge page is being forced and use that info
> to avoid unnecessarily forcing smaller page sizes in
> disallowed_hugepage_adjust().
> 
> v2: Rebase, tweak a changelog accordingly.
> 
> v1:https://lore.kernel.org/all/20220409003847.819686-1-seanjc@google.com
> 
> Mingwei Zhang (1):
>    KVM: x86/mmu: explicitly check nx_hugepage in
>      disallowed_hugepage_adjust()
> 
> Sean Christopherson (5):
>    KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked
>    KVM: x86/mmu: Properly account NX huge page workaround for nonpaging
>      MMUs
>    KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting
>      SPTE
>    KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual
>      pages
>    KVM: x86/mmu: Add helper to convert SPTE value to its shadow page

Some of the benefits are cool, such as not having to track the pages for 
the TDP MMU, and patch 2 is a borderline bugfix, but there's quite a lot 
of new non-obvious complexity here.

So the obligatory question is: is it worth a hundred lines of new code?

Paolo
Sean Christopherson July 28, 2022, 9:20 p.m. UTC | #5
On Thu, Jul 28, 2022, Paolo Bonzini wrote:
> On 7/23/22 03:23, Sean Christopherson wrote:
> > Patch 6 from Mingwei is the end goal of the series.  KVM incorrectly
> > assumes that the NX huge page mitigation is the only scenario where KVM
> > will create a non-leaf page instead of a huge page.   Precisely track
> > (via kvm_mmu_page) if a non-huge page is being forced and use that info
> > to avoid unnecessarily forcing smaller page sizes in
> > disallowed_hugepage_adjust().
> > 
> > v2: Rebase, tweak a changelog accordingly.
> > 
> > v1:https://lore.kernel.org/all/20220409003847.819686-1-seanjc@google.com
> > 
> > Mingwei Zhang (1):
> >    KVM: x86/mmu: explicitly check nx_hugepage in
> >      disallowed_hugepage_adjust()
> > 
> > Sean Christopherson (5):
> >    KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked
> >    KVM: x86/mmu: Properly account NX huge page workaround for nonpaging
> >      MMUs
> >    KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting
> >      SPTE
> >    KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual
> >      pages
> >    KVM: x86/mmu: Add helper to convert SPTE value to its shadow page
> 
> Some of the benefits are cool, such as not having to track the pages for the
> TDP MMU, and patch 2 is a borderline bugfix, but there's quite a lot of new
> non-obvious complexity here.

100% agree on the complexity.

> So the obligatory question is: is it worth a hundred lines of new code?

Assuming I understanding the bug Mingwei's patch fixes, yes.  Though after
re-reading that changelog, it should more explicitly call out the scenario we
actually care about.

Anyways, the bug we really care about is that by not precisely checking if a
huge page is disallowed, KVM would refuse to create huge page after disabling
dirty logging, which is a very noticeable performance issue for large VMs if
a migration is canceled.  That particular bug has since been unintentionally
fixed in the TDP MMU by zapping the non-leaf SPTE, but there are other paths
that could similarly be affected, e.g. I believe zapping leaf SPTEs in response
to a host page migration (mmu_notifier invalidation) to create a huge page would
yield a similar result; KVM would see the shadow-present non-leaf SPTE and assume
a huge page is disallowed.
Mingwei Zhang July 28, 2022, 9:41 p.m. UTC | #6
On Thu, Jul 28, 2022 at 2:20 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Jul 28, 2022, Paolo Bonzini wrote:
> > On 7/23/22 03:23, Sean Christopherson wrote:
> > > Patch 6 from Mingwei is the end goal of the series.  KVM incorrectly
> > > assumes that the NX huge page mitigation is the only scenario where KVM
> > > will create a non-leaf page instead of a huge page.   Precisely track
> > > (via kvm_mmu_page) if a non-huge page is being forced and use that info
> > > to avoid unnecessarily forcing smaller page sizes in
> > > disallowed_hugepage_adjust().
> > >
> > > v2: Rebase, tweak a changelog accordingly.
> > >
> > > v1:https://lore.kernel.org/all/20220409003847.819686-1-seanjc@google.com
> > >
> > > Mingwei Zhang (1):
> > >    KVM: x86/mmu: explicitly check nx_hugepage in
> > >      disallowed_hugepage_adjust()
> > >
> > > Sean Christopherson (5):
> > >    KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked
> > >    KVM: x86/mmu: Properly account NX huge page workaround for nonpaging
> > >      MMUs
> > >    KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting
> > >      SPTE
> > >    KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual
> > >      pages
> > >    KVM: x86/mmu: Add helper to convert SPTE value to its shadow page
> >
> > Some of the benefits are cool, such as not having to track the pages for the
> > TDP MMU, and patch 2 is a borderline bugfix, but there's quite a lot of new
> > non-obvious complexity here.
>
> 100% agree on the complexity.
>
> > So the obligatory question is: is it worth a hundred lines of new code?
>
> Assuming I understanding the bug Mingwei's patch fixes, yes.  Though after
> re-reading that changelog, it should more explicitly call out the scenario we
> actually care about.
>
> Anyways, the bug we really care about is that by not precisely checking if a
> huge page is disallowed, KVM would refuse to create huge page after disabling
> dirty logging, which is a very noticeable performance issue for large VMs if
> a migration is canceled.  That particular bug has since been unintentionally
> fixed in the TDP MMU by zapping the non-leaf SPTE, but there are other paths
> that could similarly be affected, e.g. I believe zapping leaf SPTEs in response
> to a host page migration (mmu_notifier invalidation) to create a huge page would
> yield a similar result; KVM would see the shadow-present non-leaf SPTE and assume
> a huge page is disallowed.

Just a quick update: the kernel crash has been resolved. It turns out
to be a bug introduced when I rebase the patch.

I see the patch set is working now.
Paolo Bonzini July 28, 2022, 10:09 p.m. UTC | #7
On 7/28/22 23:20, Sean Christopherson wrote:
> 
> Anyways, the bug we really care about is that by not precisely checking if a
> huge page is disallowed, KVM would refuse to create huge page after disabling
> dirty logging, which is a very noticeable performance issue for large VMs if
> a migration is canceled.  That particular bug has since been unintentionally
> fixed in the TDP MMU by zapping the non-leaf SPTE, but there are other paths
> that could similarly be affected, e.g. I believe zapping leaf SPTEs in response
> to a host page migration (mmu_notifier invalidation) to create a huge page would
> yield a similar result; KVM would see the shadow-present non-leaf SPTE and assume
> a huge page is disallowed.

Ok, thanks.  So this will be 5.21 material even during the -rc phase; I 
have posted a couple comments for patch 1 and 2.

One way to simplify the rmb/wmb logic could be to place the rmb/wmb 
respectively after loading iter.old_spte and in tdp_mmu_link_sp.  If you 
like it, feel free to integrate it in v3.

Paolo
Sean Christopherson July 28, 2022, 10:15 p.m. UTC | #8
On Fri, Jul 29, 2022, Paolo Bonzini wrote:
> On 7/28/22 23:20, Sean Christopherson wrote:
> > 
> > Anyways, the bug we really care about is that by not precisely checking if a
> > huge page is disallowed, KVM would refuse to create huge page after disabling
> > dirty logging, which is a very noticeable performance issue for large VMs if
> > a migration is canceled.  That particular bug has since been unintentionally
> > fixed in the TDP MMU by zapping the non-leaf SPTE, but there are other paths
> > that could similarly be affected, e.g. I believe zapping leaf SPTEs in response
> > to a host page migration (mmu_notifier invalidation) to create a huge page would
> > yield a similar result; KVM would see the shadow-present non-leaf SPTE and assume
> > a huge page is disallowed.
> 
> Ok, thanks.  So this will be 5.21 material even during the -rc phase

Yes, this definitely needs more time in the queue before being sent to Linus.