Message ID | 20221208193857.4090582-1-dmatlack@google.com (mailing list archive) |
---|---|
Headers | show |
Series | KVM: Refactor the KVM/x86 TDP MMU into common code | expand |
On Thu, Dec 08, 2022 at 11:38:20AM -0800, David Matlack wrote: > [ mm folks: You are being cc'd since this series includes a mm patch > ("mm: Introduce architecture-neutral PG_LEVEL macros"), but general > feedback is also welcome. I imagine there are a lot of lessons KVM can > learn from mm about sharing page table code across architectures. ] > > Hello, > > This series refactors the KVM/x86 "TDP MMU" into common code. This is > the first step toward sharing TDP (aka Stage-2) page table management > code across architectures that support KVM. For more background on this > effort please see my talk from KVM Forum 2022 "Exploring an > architecture-neutral MMU": > > https://youtu.be/IBhW34fCFi0 > > By the end of this series, 90% of the TDP MMU code is in common directories > (virt/kvm/mmu/ and include/kvm/). The only pieces that remaing in > arch/x86 are code that deals with constructing/inspecting/modifying PTEs > and arch hooks to implement NX Huge Pages (a mitigation for an > Intel-specific vulnerability). > > Before: > > 180 arch/x86/kvm/mmu/tdp_iter.c > 118 arch/x86/kvm/mmu/tdp_iter.h > 1917 arch/x86/kvm/mmu/tdp_mmu.c > 98 arch/x86/kvm/mmu/tdp_mmu.h > ---- > 2313 total > > After: > > 178 virt/kvm/mmu/tdp_iter.c > 1867 virt/kvm/mmu/tdp_mmu.c > 117 include/kvm/tdp_iter.h > 78 include/kvm/tdp_mmu.h > 39 include/kvm/tdp_pgtable.h > ---- > 184 arch/x86/kvm/mmu/tdp_pgtable.c > 76 arch/x86/include/asm/kvm/tdp_pgtable.h > ---- > 2539 total > > This series is very much an RFC, but it does build (I tested x86_64 and > ARM64) and pass basic testing (KVM selftests and kvm-unit-tests on > x86_64), so it is entirely functional aside from any bugs. > > The main areas I would like feedback are: > > - NX Huge Pages support in the TDP MMU requires 5 arch hooks in > the common code, which IMO makes the NX Huge Pages implementation > harder to read. The alternative is to move the NX Huge Pages > implementation into common code, including the fields in struct > kvm_mmu_page and kvm_page_fault, which would increase memory usage > a tiny bit (for non-x86 architectures) and pollute the common code > with an x86-specific security mitigation. Ideas on better ways to > handle this would be appreciated. > > - struct kvm_mmu_page increased by 64 bytes because the separation of > arch and common state eliminated the ability to use unions to > optimize the size of the struct. There's two things we can do to > reduce the size of the struct back down: (1) dynamically allocated > root-specific fields only for root page tables and (2) dynamically > allocate Shadow MMU state in kvm_mmu_page_arch only for Shadow MMU > pages. This should actually be a net *reduction* the size of > kvm_mmu_page relative today for most pages, but I have not > implemented it. > > Note that an alternative approach I implemented avoided this problem > by creating an entirely separate struct for the common TDP MMU (e.g. > struct tdp_mmu_page). This however had a lot of downsides that I > don't think make it a good solution. Notably, it complicated a ton of > existing code in arch/x86/kvm/mmu/mmu.c (e.g. anything that touches > vcpu->arch.mmu->root and kvm_recover_nx_huge_pages()) and created a > new runtime failure mode in to_shadow_page(). > > - Naming. This series does not change the names of any existing code. > So all the KVM/x86 Shadow MMU-style terminology like > "shadow_page"/"sp"/"spte" persists. Should we keep that style in > common code or move toward something less shadow-paging-specific? > e.g. "page_table"/"pt"/"pte". I would strongly be in favor of discarding the shadow paging residue if x86 folks are willing to part ways with it :) > Also do we want to keep "TDP" or switch > to something more familiar across architectures (e.g. ARM and RISC-V > both use "Stage-2")? As it relates to guest memory management I don't see much of an issue with it, TBH. It is sufficiently arch-generic and gets the point across. Beyond that I think it really depends on the scope of the common code. To replace the arm64 table walkers we will need to use it for stage-1 tables. I'm only hand-waving at the cover letter and need to do more reading, but is it possible to accomplish some division: - A set of generic table walkers that implement common operations, like map and unmap. Names and types at this layer wouldn't be virt-specific. - Memory management for KVM guests that uses the table walker library, which we can probably still call the TDP MMU. Certainly this doesn't need to be addressed in the first series, as the x86 surgery is enough on its own. Nonetheless, it is probably worthwhile to get the conversation started about how this code can actually be used by the other arches. -- Thanks, Oliver
On Fri, Dec 9, 2022 at 11:07 AM Oliver Upton <oliver.upton@linux.dev> wrote: > > On Thu, Dec 08, 2022 at 11:38:20AM -0800, David Matlack wrote: > > > Also do we want to keep "TDP" or switch > > to something more familiar across architectures (e.g. ARM and RISC-V > > both use "Stage-2")? > > As it relates to guest memory management I don't see much of an issue > with it, TBH. It is sufficiently arch-generic and gets the point across. > > Beyond that I think it really depends on the scope of the common code. > > To replace the arm64 table walkers we will need to use it for stage-1 > tables. Speaking of, have ARM folks ever discussed deduplicating the KVM/ARM stage-1 code with the Linux stage-1 table code (<linux/pgtable.h>), which is already architecture-neutral? It seems backwards for us to build out an architecture-neutral stage-1 walker in KVM when one already exists. For example, arch/arm64/kvm/mmu.c:get_user_mapping_size() looks like it could be reimplemented using <linux/pgtable.h>, rather than using KVM code. In fact that's what we do for walking stage-1 page tables in KVM/x86. Take a look at arch/x86/kvm/mmu/mmu.c:host_pfn_mapping_level(). I bet we could move that somewhere in mm/ so that it could be shared across KVM/x86 and KVM/ARM. > I'm only hand-waving at the cover letter and need to do more > reading, but is it possible to accomplish some division: > > - A set of generic table walkers that implement common operations, like > map and unmap. Names and types at this layer wouldn't be > virt-specific. > > - Memory management for KVM guests that uses the table walker library, > which we can probably still call the TDP MMU. > > Certainly this doesn't need to be addressed in the first series, as the x86 > surgery is enough on its own. Nonetheless, it is probably worthwhile to > get the conversation started about how this code can actually be used by > the other arches. Yup, we'll need some sort of split like that in order to integrate with KVM/ARM, since the hyp can't access struct kvm, work_queues, etc. in tdp_mmu.c. I don't think we'll need that split for KVM/RISC-V though. So for the sake of incremental progress I'm not planning on doing any of that refactoring preemptively. Plus it should be possible to keep the TDP MMU API constant when the internal implementation eventually gets split up. i.e. I don't forsee it creating a bunch of churn down the road.
On 12/9/22 20:07, Oliver Upton wrote: >> - Naming. This series does not change the names of any existing code. >> So all the KVM/x86 Shadow MMU-style terminology like >> "shadow_page"/"sp"/"spte" persists. Should we keep that style in >> common code or move toward something less shadow-paging-specific? >> e.g. "page_table"/"pt"/"pte". > > I would strongly be in favor of discarding the shadow paging residue if > x86 folks are willing to part ways with it
On Mon, Dec 12, 2022, Paolo Bonzini wrote: > On 12/9/22 20:07, Oliver Upton wrote: > > > - Naming. This series does not change the names of any existing code. > > > So all the KVM/x86 Shadow MMU-style terminology like > > > "shadow_page"/"sp"/"spte" persists. Should we keep that style in > > > common code or move toward something less shadow-paging-specific? > > > e.g. "page_table"/"pt"/"pte". > > > > I would strongly be in favor of discarding the shadow paging residue if > > x86 folks are willing to part ways with it
On 12/13/22 00:26, Sean Christopherson wrote: >>> I would strongly be in favor of discarding the shadow paging residue if >>> x86 folks are willing to part ways with it
On Thu, Dec 08, 2022 at 11:38:20AM -0800, David Matlack wrote: > > Hello, > > This series refactors the KVM/x86 "TDP MMU" into common code. This is > the first step toward sharing TDP (aka Stage-2) page table management > code across architectures that support KVM. Thank you everyone for the feedback on this RFC. I have a couple of updates to share and a question at the end. First, Alexandre Ghiti from Rivos is going to work on the RISC-V port. I'd like to target RISC-V first, since it has significantly lower risk and complexity than e.g. ARM (which has pKVM, stage-1 walkers, and [soon] nested virtualization to deal with). Before I send a v2 I am working on sending several related patches. These are patches that should have enough justification to be merged regardless of the fate of the Common MMU. By sending them out separately, I figure they will be easier to review, allow me to make incremental progress, and ultimately simplify the v2 of this series. What I've sent so far: - https://lore.kernel.org/kvm/20230117222707.3949974-1-dmatlack@google.com/ - https://lore.kernel.org/kvm/20230118175300.790835-1-dmatlack@google.com/ What's coming soon: - A series to add a common API for range-based TLB flushing (patches 29-33 from this series, plus another cleanup). This cleanup stands on its own, plus Raghavendra from Google has need of this for his ARM series to add range-based TLBI support [1] - A patch to move sp->tdp_mmu_page into sp->role.tdp_mmu. This was suggested by Paolo as an alternative to patch 4, and saves a byte from struct kvm_mmu_page. There will probably be more related cleanups I will send but this is everything I'm tracking so far. If anyone wants to see a complete v2 sooner, let me know. Paolo and Sean, what are your thoughts on merging the Common MMU refactor without RISC-V support? e.g. Should Alexandre and I work on developing a functional prototype first, or are you open to merging the refactor and then building RISC-V support on top of that? My preference is the latter so that there is a more stable base on which to build the RISC-V support, we can make incremental progress, and keep everyone upstream more involved in the development. Thanks. [2] https://lore.kernel.org/kvm/20230109215347.3119271-4-rananta@google.com/
On Thu, Jan 19, 2023 at 6:14 PM David Matlack <dmatlack@google.com> wrote: > Paolo and Sean, what are your thoughts on merging the Common MMU > refactor without RISC-V support? I have no objection. We know what the long-term plan is, and it's not so long-term anyway. Paolo
On Thu, 19 Jan 2023 17:14:34 +0000, David Matlack <dmatlack@google.com> wrote: > > On Thu, Dec 08, 2022 at 11:38:20AM -0800, David Matlack wrote: > > > > Hello, > > > > This series refactors the KVM/x86 "TDP MMU" into common code. This is > > the first step toward sharing TDP (aka Stage-2) page table management > > code across architectures that support KVM. > > Thank you everyone for the feedback on this RFC. I have a couple of > updates to share and a question at the end. > > First, Alexandre Ghiti from Rivos is going to work on the RISC-V port. > I'd like to target RISC-V first, since it has significantly lower risk > and complexity than e.g. ARM (which has pKVM, stage-1 walkers, and > [soon] nested virtualization to deal with). And (joy, happiness), the upcoming 128bit page table support[1]. M. [1] https://developer.arm.com/documentation/ddi0601/2022-12/AArch64-Registers/TTBR0-EL1--Translation-Table-Base-Register-0--EL1-?lang=en
On Thu, Jan 19, 2023 at 9:24 AM Marc Zyngier <maz@kernel.org> wrote: > > On Thu, 19 Jan 2023 17:14:34 +0000, David Matlack <dmatlack@google.com> wrote: > > I'd like to target RISC-V first, since it has significantly lower risk > > and complexity than e.g. ARM (which has pKVM, stage-1 walkers, and > > [soon] nested virtualization to deal with). > > And (joy, happiness), the upcoming 128bit page table support[1]. Oh good, I was worried the ARM port was going to be too easy :)
On Thu, Jan 19, 2023 at 10:38 AM David Matlack <dmatlack@google.com> wrote: > > On Thu, Jan 19, 2023 at 9:24 AM Marc Zyngier <maz@kernel.org> wrote: > > > > On Thu, 19 Jan 2023 17:14:34 +0000, David Matlack <dmatlack@google.com> wrote: > > > I'd like to target RISC-V first, since it has significantly lower risk > > > and complexity than e.g. ARM (which has pKVM, stage-1 walkers, and > > > [soon] nested virtualization to deal with). > > > > And (joy, happiness), the upcoming 128bit page table support[1]. > > Oh good, I was worried the ARM port was going to be too easy :) But in all seriousness, I'm not too worried about supporting 128-bit page tables in the Common MMU, assuming it is a compile-time decision. The way I'm planning to organize the code, architecture-specific code will own the PTEs, so each architecture can do whatever they want. There is a hard-coded assumption that PTEs are u64 in the current code, but we can abstract that behind a typedef for 128-bit support. We will need to figure out how to deal with concurrency though. Will 128-bit page table support come with 128-bit atomic support (e.g. compare-exchange)? If so we should be good to go. If not, we'll need to emulate them with e.g. spinlocks. But either way, figuring this out is not specific to the Common MMU. Even if ARM kept its own stage-2 MMU we'd have to solve the same problem there.