Message ID | 20231202091211.13376-1-yan.y.zhao@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | Sharing KVM TDP to IOMMU | expand |
On Sat, Dec 02, 2023 at 05:12:11PM +0800, Yan Zhao wrote: > In this series, term "exported" is used in place of "shared" to avoid > confusion with terminology "shared EPT" in TDX. > > The framework contains 3 main objects: > > "KVM TDP FD" object - The interface of KVM to export TDP page tables. > With this object, KVM allows external components to > access a TDP page table exported by KVM. I don't know much about the internals of kvm, but why have this extra user visible piece? Isn't there only one "TDP" per kvm fd? Why not just use the KVM FD as a handle for the TDP? > "IOMMUFD KVM HWPT" object - A proxy connecting KVM TDP FD to IOMMU driver. > This HWPT has no IOAS associated. > > "KVM domain" in IOMMU driver - Stage 2 domain in IOMMU driver whose paging > structures are managed by KVM. > Its hardware TLB invalidation requests are > notified from KVM via IOMMUFD KVM HWPT > object. This seems broadly the right direction > - About device which partially supports IOPF > > Many devices claiming PCIe PRS capability actually only tolerate IOPF in > certain paths (e.g. DMA paths for SVM applications, but not for non-SVM > applications or driver data such as ring descriptors). But the PRS > capability doesn't include a bit to tell whether a device 100% tolerates > IOPF in all DMA paths. The lack of tolerance for truely DMA pinned guest memory is a significant problem for any real deployment, IMHO. I am aware of no device that can handle PRI on every single DMA path. :( > A simple way is to track an allowed list of devices which are known 100% > IOPF-friendly in VFIO. Another option is to extend PCIe spec to allow > device reporting whether it fully or partially supports IOPF in the PRS > capability. I think we need something like this. > - How to map MSI page on arm platform demands discussions. Yes, the recurring problem :( Probably the same approach as nesting would work for a hack - map the ITS page into the fixed reserved slot and tell the guest not to touch it and to identity map it. Jason
On Mon, Dec 04, 2023, Jason Gunthorpe wrote: > On Sat, Dec 02, 2023 at 05:12:11PM +0800, Yan Zhao wrote: > > In this series, term "exported" is used in place of "shared" to avoid > > confusion with terminology "shared EPT" in TDX. > > > > The framework contains 3 main objects: > > > > "KVM TDP FD" object - The interface of KVM to export TDP page tables. > > With this object, KVM allows external components to > > access a TDP page table exported by KVM. > > I don't know much about the internals of kvm, but why have this extra > user visible piece? That I don't know, I haven't looked at the gory details of this RFC. > Isn't there only one "TDP" per kvm fd? No. In steady state, with TDP (EPT) enabled and assuming homogeneous capabilities across all vCPUs, KVM will have 3+ sets of TDP page tables *active* at any given time: 1. "Normal" 2. SMM 3-N. Guest (for L2, i.e. nested, VMs) The number of possible TDP page tables used for nested VMs is well bounded, but since devices obviously can't be nested VMs, I won't bother trying to explain the the various possibilities (nested NPT on AMD is downright ridiculous). Nested virtualization aside, devices are obviously not capable of running in SMM and so they all need to use the "normal" page tables. I highlighted "active" above because if _any_ memslot is deleted, KVM will invalidate *all* existing page tables and rebuild new page tables as needed. So over the lifetime of a VM, KVM could theoretically use an infinite number of page tables.
On Sat, Dec 02, 2023, Yan Zhao wrote: > This RFC series proposes a framework to resolve IOPF by sharing KVM TDP > (Two Dimensional Paging) page table to IOMMU as its stage 2 paging > structure to support IOPF (IO page fault) on IOMMU's stage 2 paging > structure. > > Previously, all guest pages have to be pinned and mapped in IOMMU stage 2 > paging structures after pass-through devices attached, even if the device > has IOPF capability. Such all-guest-memory pinning can be avoided when IOPF > handling for stage 2 paging structure is supported and if there are only > IOPF-capable devices attached to a VM. > > There are 2 approaches to support IOPF on IOMMU stage 2 paging structures: > - Supporting by IOMMUFD/IOMMU alone > IOMMUFD handles IO page faults on stage-2 HWPT by calling GUPs and then > iommu_map() to setup IOVA mappings. (IOAS is required to keep info of GPA > to HVA, but page pinning/unpinning needs to be skipped.) > Then upon MMU notifiers on host primary MMU, iommu_unmap() is called to > adjust IOVA mappings accordingly. > IOMMU driver needs to support unmapping sub-ranges of a previous mapped > range and take care of huge page merge and split in atomic way. [1][2]. > > - Sharing KVM TDP > IOMMUFD sets the root of KVM TDP page table (EPT/NPT in x86) as the root > of IOMMU stage 2 paging structure, and routes IO page faults to KVM. > (This assumes that the iommu hw supports the same stage-2 page table > format as CPU.) > In this model the page table is centrally managed by KVM (mmu notifier, > page mapping, subpage unmapping, atomic huge page split/merge, etc.), > while IOMMUFD only needs to invalidate iotlb/devtlb properly. There are more approaches beyond having IOMMUFD and KVM be completely separate entities. E.g. extract the bulk of KVM's "TDP MMU" implementation to common code so that IOMMUFD doesn't need to reinvent the wheel. > Currently, there's no upstream code available to support stage 2 IOPF yet. > > This RFC chooses to implement "Sharing KVM TDP" approach which has below > main benefits: Please list out the pros and cons for each. In the cons column for piggybacking KVM's page tables: - *Significantly* increases the complexity in KVM - Puts constraints on what KVM can/can't do in the future (see the movement of SPTE_MMU_PRESENT). - Subjects IOMMUFD to all of KVM's historical baggage, e.g. the memslot deletion mess, the truly nasty MTRR emulation (which I still hope to delete), the NX hugepage mitigation, etc. Please also explain the intended/expected/targeted use cases. E.g. if the main use case is for device passthrough to slice-of-hardware VMs that aren't memory oversubscribed, > - Unified page table management > The complexity of allocating guest pages per GPAs, registering to MMU > notifier on host primary MMU, sub-page unmapping, atomic page merge/split Please find different terminology than "sub-page". With Sub-Page Protection, Intel has more or less established "sub-page" to mean "less than 4KiB granularity". But that can't possibly what you mean here because KVM doesn't support (un)mapping memory at <4KiB granularity. Based on context above, I assume you mean "unmapping arbitrary pages within a given range". > are only required to by handled in KVM side, which has been doing that > well for a long time. > > - Reduced page faults: > Only one page fault is triggered on a single GPA, either caused by IO > access or by vCPU access. (compared to one IO page fault for DMA and one > CPU page fault for vCPUs in the non-shared approach.) This would be relatively easy to solve with bi-directional notifiers, i.e. KVM notifies IOMMUFD when a vCPU faults in a page, and vice versa. > - Reduced memory consumption: > Memory of one page table are saved. I'm not convinced that memory consumption is all that interesting. If a VM is mapping the majority of memory into a device, then odds are good that the guest is backed with at least 2MiB page, if not 1GiB pages, at which point the memory overhead for pages tables is quite small, especially relative to the total amount of memory overheads for such systems. If a VM is mapping only a small subset of its memory into devices, then the IOMMU page tables should be sparsely populated, i.e. won't consume much memory.
On Mon, Dec 04, 2023 at 09:00:55AM -0800, Sean Christopherson wrote: > There are more approaches beyond having IOMMUFD and KVM be > completely separate entities. E.g. extract the bulk of KVM's "TDP > MMU" implementation to common code so that IOMMUFD doesn't need to > reinvent the wheel. We've pretty much done this already, it is called "hmm" and it is what the IO world uses. Merging/splitting huge page is just something that needs some coding in the page table code, that people want for other reasons anyhow. > - Subjects IOMMUFD to all of KVM's historical baggage, e.g. the memslot deletion > mess, the truly nasty MTRR emulation (which I still hope to delete), the NX > hugepage mitigation, etc. Does it? I think that just remains isolated in kvm. The output from KVM is only a radix table top pointer, it is up to KVM how to manage it still. > I'm not convinced that memory consumption is all that interesting. If a VM is > mapping the majority of memory into a device, then odds are good that the guest > is backed with at least 2MiB page, if not 1GiB pages, at which point the memory > overhead for pages tables is quite small, especially relative to the total amount > of memory overheads for such systems. AFAIK the main argument is performance. It is similar to why we want to do IOMMU SVA with MM page table sharing. If IOMMU mirrors/shadows/copies a page table using something like HMM techniques then the invalidations will mark ranges of IOVA as non-present and faults will occur to trigger hmm_range_fault to do the shadowing. This means that pretty much all IO will always encounter a non-present fault, certainly at the start and maybe worse while ongoing. On the other hand, if we share the exact page table then natural CPU touches will usually make the page present before an IO happens in almost all cases and we don't have to take the horribly expensive IO page fault at all. We were not able to make bi-dir notifiers with with the CPU mm, I'm not sure that is "relatively easy" :( Jason
On Mon, Dec 04, 2023, Jason Gunthorpe wrote: > On Mon, Dec 04, 2023 at 09:00:55AM -0800, Sean Christopherson wrote: > > > There are more approaches beyond having IOMMUFD and KVM be > > completely separate entities. E.g. extract the bulk of KVM's "TDP > > MMU" implementation to common code so that IOMMUFD doesn't need to > > reinvent the wheel. > > We've pretty much done this already, it is called "hmm" and it is what > the IO world uses. Merging/splitting huge page is just something that > needs some coding in the page table code, that people want for other > reasons anyhow. Not really. HMM is a wildly different implementation than KVM's TDP MMU. At a glance, HMM is basically a variation on the primary MMU, e.g. deals with VMAs, runs under mmap_lock (or per-VMA locks?), and faults memory into the primary MMU while walking the "secondary" HMM page tables. KVM's TDP MMU (and all of KVM's flavors of MMUs) is much more of a pure secondary MMU. The core of a KVM MMU maps GFNs to PFNs, the intermediate steps that involve the primary MMU are largely orthogonal. E.g. getting a PFN from guest_memfd instead of the primary MMU essentially boils down to invoking kvm_gmem_get_pfn() instead of __gfn_to_pfn_memslot(), the MMU proper doesn't care how the PFN was resolved. I.e. 99% of KVM's MMU logic has no interaction with the primary MMU. > > - Subjects IOMMUFD to all of KVM's historical baggage, e.g. the memslot deletion > > mess, the truly nasty MTRR emulation (which I still hope to delete), the NX > > hugepage mitigation, etc. > > Does it? I think that just remains isolated in kvm. The output from > KVM is only a radix table top pointer, it is up to KVM how to manage > it still. Oh, I didn't mean from a code perspective, I meant from a behaviorial perspective. E.g. there's no reason to disallow huge mappings in the IOMMU because the CPU is vulnerable to the iTLB multi-hit mitigation. > > I'm not convinced that memory consumption is all that interesting. If a VM is > > mapping the majority of memory into a device, then odds are good that the guest > > is backed with at least 2MiB page, if not 1GiB pages, at which point the memory > > overhead for pages tables is quite small, especially relative to the total amount > > of memory overheads for such systems. > > AFAIK the main argument is performance. It is similar to why we want > to do IOMMU SVA with MM page table sharing. > > If IOMMU mirrors/shadows/copies a page table using something like HMM > techniques then the invalidations will mark ranges of IOVA as > non-present and faults will occur to trigger hmm_range_fault to do the > shadowing. > > This means that pretty much all IO will always encounter a non-present > fault, certainly at the start and maybe worse while ongoing. > > On the other hand, if we share the exact page table then natural CPU > touches will usually make the page present before an IO happens in > almost all cases and we don't have to take the horribly expensive IO > page fault at all. I'm not advocating mirroring/copying/shadowing page tables between KVM and the IOMMU. I'm suggesting managing IOMMU page tables mostly independently, but reusing KVM code to do so. I wouldn't even be opposed to KVM outright managing the IOMMU's page tables. E.g. add an "iommu" flag to "union kvm_mmu_page_role" and then the implementation looks rather similar to this series. What terrifies is me sharing page tables between the CPU and the IOMMU verbatim. Yes, sharing page tables will Just Work for faulting in memory, but the downside is that _when_, not if, KVM modifies PTEs for whatever reason, those modifications will also impact the IO path. My understanding is that IO page faults are at least an order of magnitude more expensive than CPU page faults. That means that what's optimal for CPU page tables may not be optimal, or even _viable_, for IOMMU page tables. E.g. based on our conversation at LPC, write-protecting guest memory to do dirty logging is not a viable option for the IOMMU because the latency of the resulting IOPF is too high. Forcing KVM to use D-bit dirty logging for CPUs just because the VM has passthrough (mediated?) devices would be likely a non-starter. One of my biggest concerns with sharing page tables between KVM and IOMMUs is that we will end up having to revert/reject changes that benefit KVM's usage due to regressing the IOMMU usage. If instead KVM treats IOMMU page tables as their own thing, then we can have divergent behavior as needed, e.g. different dirty logging algorithms, different software-available bits, etc. It would also allow us to define new ABI instead of trying to reconcile the many incompatibilies and warts in KVM's existing ABI. E.g. off the top of my head: - The virtual APIC page shouldn't be visible to devices, as it's not "real" guest memory. - Access tracking, i.e. page aging, by making PTEs !PRESENT because the CPU doesn't support A/D bits or because the admin turned them off via KVM's enable_ept_ad_bits module param. - Write-protecting GFNs for shadow paging when L1 is running nested VMs. KVM's ABI can be that device writes to L1's page tables are exempt. - KVM can exempt IOMMU page tables from KVM's awful "drop all page tables if any memslot is deleted" ABI. > We were not able to make bi-dir notifiers with with the CPU mm, I'm > not sure that is "relatively easy" :( I'm not suggesting full blown mirroring, all I'm suggesting is a fire-and-forget notifier for KVM to tell IOMMUFD "I've faulted in GFN A, you might want to do the same". It wouldn't even necessarily need to be a notifier per se, e.g. if we taught KVM to manage IOMMU page tables, then KVM could simply install mappings for multiple sets of page tables as appropriate.
On Mon, Dec 04, 2023 at 11:22:49AM -0800, Sean Christopherson wrote: > On Mon, Dec 04, 2023, Jason Gunthorpe wrote: > > On Mon, Dec 04, 2023 at 09:00:55AM -0800, Sean Christopherson wrote: > > > > > There are more approaches beyond having IOMMUFD and KVM be > > > completely separate entities. E.g. extract the bulk of KVM's "TDP > > > MMU" implementation to common code so that IOMMUFD doesn't need to > > > reinvent the wheel. > > > > We've pretty much done this already, it is called "hmm" and it is what > > the IO world uses. Merging/splitting huge page is just something that > > needs some coding in the page table code, that people want for other > > reasons anyhow. > > Not really. HMM is a wildly different implementation than KVM's TDP MMU. At a > glance, HMM is basically a variation on the primary MMU, e.g. deals with VMAs, > runs under mmap_lock (or per-VMA locks?), and faults memory into the primary MMU > while walking the "secondary" HMM page tables. hmm supports the essential idea of shadowing parts of the primary MMU. This is a big chunk of what kvm is doing, just differently. > KVM's TDP MMU (and all of KVM's flavors of MMUs) is much more of a pure secondary > MMU. The core of a KVM MMU maps GFNs to PFNs, the intermediate steps that involve > the primary MMU are largely orthogonal. E.g. getting a PFN from guest_memfd > instead of the primary MMU essentially boils down to invoking kvm_gmem_get_pfn() > instead of __gfn_to_pfn_memslot(), the MMU proper doesn't care how the PFN was > resolved. I.e. 99% of KVM's MMU logic has no interaction with the primary MMU. Hopefully the memfd stuff we be generalized so we can use it in iommufd too, without relying on kvm. At least the first basic stuff should be doable fairly soon. > I'm not advocating mirroring/copying/shadowing page tables between KVM and the > IOMMU. I'm suggesting managing IOMMU page tables mostly independently, but reusing > KVM code to do so. I guess from my POV, if KVM has two copies of the logically same radix tree then that is fine too. > Yes, sharing page tables will Just Work for faulting in memory, but the downside > is that _when_, not if, KVM modifies PTEs for whatever reason, those modifications > will also impact the IO path. My understanding is that IO page faults are at least > an order of magnitude more expensive than CPU page faults. That means that what's > optimal for CPU page tables may not be optimal, or even _viable_, for IOMMU page > tables. Yes, you wouldn't want to do some of the same KVM techniques today in a shared mode. > E.g. based on our conversation at LPC, write-protecting guest memory to do dirty > logging is not a viable option for the IOMMU because the latency of the resulting > IOPF is too high. Forcing KVM to use D-bit dirty logging for CPUs just because > the VM has passthrough (mediated?) devices would be likely a > non-starter. Yes > One of my biggest concerns with sharing page tables between KVM and IOMMUs is that > we will end up having to revert/reject changes that benefit KVM's usage due to > regressing the IOMMU usage. It is certainly a strong argument > I'm not suggesting full blown mirroring, all I'm suggesting is a fire-and-forget > notifier for KVM to tell IOMMUFD "I've faulted in GFN A, you might want to do the > same". If we say the only thing this works with is the memfd version of KVM, could we design the memfd stuff to not have the same challenges with mirroring as normal VMAs? > It wouldn't even necessarily need to be a notifier per se, e.g. if we taught KVM > to manage IOMMU page tables, then KVM could simply install mappings for multiple > sets of page tables as appropriate. This somehow feels more achievable to me since KVM already has all the code to handle multiple TDPs, having two parallel ones is probably much easier than trying to weld KVM to a different page table implementation through some kind of loose coupled notifier. Jason
On Mon, Dec 04, 2023, Jason Gunthorpe wrote: > On Mon, Dec 04, 2023 at 11:22:49AM -0800, Sean Christopherson wrote: > > I'm not suggesting full blown mirroring, all I'm suggesting is a fire-and-forget > > notifier for KVM to tell IOMMUFD "I've faulted in GFN A, you might want to do the > > same". > > If we say the only thing this works with is the memfd version of KVM, That's likely a big "if", as guest_memfd is not and will not be a wholesale replacement of VMA-based guest memory, at least not in the forseeable future. I would be quite surprised if the target use cases for this could be moved to guest_memfd without losing required functionality. > could we design the memfd stuff to not have the same challenges with > mirroring as normal VMAs? What challenges in particular are you concerned about? And maybe also define "mirroring"? E.g. ensuring that the CPU and IOMMU page tables are synchronized is very different than ensuring that the IOMMU page tables can only map memory that is mappable by the guest, i.e. that KVM can map into the CPU page tables.
On Mon, Dec 04, 2023 at 12:11:46PM -0800, Sean Christopherson wrote: > > could we design the memfd stuff to not have the same challenges with > > mirroring as normal VMAs? > > What challenges in particular are you concerned about? And maybe also define > "mirroring"? E.g. ensuring that the CPU and IOMMU page tables are synchronized > is very different than ensuring that the IOMMU page tables can only map memory > that is mappable by the guest, i.e. that KVM can map into the CPU page tables. IIRC, it has been awhile, it is difficult to get a new populated PTE out of the MM side and into an hmm user and get all the invalidation locking to work as well. Especially when the devices want to do sleeping invalidations. kvm doesn't solve this problem either, but pushing populated TDP PTEs to another observer may be simpler, as perhaps would pushing populated memfd pages or something like that? "mirroring" here would simply mean that if the CPU side has a popoulated page then the hmm side copying it would also have a populated page. Instead of a fault on use model. Jason
On Mon, Dec 04, 2023 at 08:38:17AM -0800, Sean Christopherson wrote: > On Mon, Dec 04, 2023, Jason Gunthorpe wrote: > > On Sat, Dec 02, 2023 at 05:12:11PM +0800, Yan Zhao wrote: > > > In this series, term "exported" is used in place of "shared" to avoid > > > confusion with terminology "shared EPT" in TDX. > > > > > > The framework contains 3 main objects: > > > > > > "KVM TDP FD" object - The interface of KVM to export TDP page tables. > > > With this object, KVM allows external components to > > > access a TDP page table exported by KVM. > > > > I don't know much about the internals of kvm, but why have this extra > > user visible piece? > > That I don't know, I haven't looked at the gory details of this RFC. > > > Isn't there only one "TDP" per kvm fd? > > No. In steady state, with TDP (EPT) enabled and assuming homogeneous capabilities > across all vCPUs, KVM will have 3+ sets of TDP page tables *active* at any given time: > > 1. "Normal" > 2. SMM > 3-N. Guest (for L2, i.e. nested, VMs) Yes, the reason to introduce KVM TDP FD is to let KVM know which TDP the user wants to export(share). For as_id=0 (which is currently the only supported as_id to share), a TDP with smm=0, guest_mode=0 will be chosen. Upon receiving the KVM_CREATE_TDP_FD ioctl, KVM will try to find an existing TDP root with role specified by as_id 0. If there's existing TDP with the target role found, KVM will just export this one; if no existing one found, KVM will create a new TDP root in non-vCPU context. Then, KVM will mark the exported TDP as "exported". tdp_mmu_roots | role | smm | guest_mode +------+-----------+----------+ ------|----------------- | | | | 0 | 0 | 0 ==> address space 0 | v v v 1 | 1 | 0 | .--------. .--------. .--------. 2 | 0 | 1 | | root | | root | | root | 3 | 1 | 1 | |(role 1)| |(role 2)| |(role 3)| | '--------' '--------' '--------' | ^ | | create or get .------. | +--------------------| vCPU | | fault '------' | smm=1 | guest_mode=0 | (set root as exported) v .--------. create or get .---------------. create or get .------. | TDP FD |------------------->| root (role 0) |<-----------------| vCPU | '--------' fault '---------------' fault '------' . smm=0 . guest_mode=0 . non-vCPU context <---|---> vCPU context . . No matter the TDP is exported or not, vCPUs just load TDP root according to its vCPU modes. In this way, KVM is able to share the TDP in KVM address space 0 to IOMMU side. > The number of possible TDP page tables used for nested VMs is well bounded, but > since devices obviously can't be nested VMs, I won't bother trying to explain the > the various possibilities (nested NPT on AMD is downright ridiculous). In future, if possible, I wonder if we can export an TDP for nested VM too. E.g. in scenarios where TDP is partitioned, and one piece is for L2 VM. Maybe we can specify that and tell KVM the very piece of TDP to export. > Nested virtualization aside, devices are obviously not capable of running in SMM > and so they all need to use the "normal" page tables. > > I highlighted "active" above because if _any_ memslot is deleted, KVM will invalidate > *all* existing page tables and rebuild new page tables as needed. So over the > lifetime of a VM, KVM could theoretically use an infinite number of page tables. Right. In patch 36, the TDP root which is marked as "exported" will be exempted from "invalidate". Instead, an "exported" TDP just zaps all leaf entries upon memory slot removal. That is to say, for an exported TDP, it can be "active" until it's unmarked as exported.
On Mon, Dec 04, 2023 at 11:08:00AM -0400, Jason Gunthorpe wrote: > On Sat, Dec 02, 2023 at 05:12:11PM +0800, Yan Zhao wrote: > > In this series, term "exported" is used in place of "shared" to avoid > > confusion with terminology "shared EPT" in TDX. > > > > The framework contains 3 main objects: > > > > "KVM TDP FD" object - The interface of KVM to export TDP page tables. > > With this object, KVM allows external components to > > access a TDP page table exported by KVM. > > I don't know much about the internals of kvm, but why have this extra > user visible piece? Isn't there only one "TDP" per kvm fd? Why not > just use the KVM FD as a handle for the TDP? As explained in a parallel mail, the reason to introduce KVM TDP FD is to let KVM know which TDP the user wants to export(share). And another reason is wrap the exported TDP with its exported ops in a single structure. So, components outside of KVM can query meta data and request page fault, register invalidate callback through the exported ops. struct kvm_tdp_fd { /* Public */ struct file *file; const struct kvm_exported_tdp_ops *ops; /* private to KVM */ struct kvm_exported_tdp *priv; }; For KVM, it only needs to expose this struct kvm_tdp_fd and two symbols kvm_tdp_fd_get() and kvm_tdp_fd_put(). > > > "IOMMUFD KVM HWPT" object - A proxy connecting KVM TDP FD to IOMMU driver. > > This HWPT has no IOAS associated. > > > > "KVM domain" in IOMMU driver - Stage 2 domain in IOMMU driver whose paging > > structures are managed by KVM. > > Its hardware TLB invalidation requests are > > notified from KVM via IOMMUFD KVM HWPT > > object. > > This seems broadly the right direction > > > - About device which partially supports IOPF > > > > Many devices claiming PCIe PRS capability actually only tolerate IOPF in > > certain paths (e.g. DMA paths for SVM applications, but not for non-SVM > > applications or driver data such as ring descriptors). But the PRS > > capability doesn't include a bit to tell whether a device 100% tolerates > > IOPF in all DMA paths. > > The lack of tolerance for truely DMA pinned guest memory is a > significant problem for any real deployment, IMHO. I am aware of no > device that can handle PRI on every single DMA path. :( DSA actaully can handle PRI on all DMA paths. But it requires driver to turn on this capability :( > > A simple way is to track an allowed list of devices which are known 100% > > IOPF-friendly in VFIO. Another option is to extend PCIe spec to allow > > device reporting whether it fully or partially supports IOPF in the PRS > > capability. > > I think we need something like this. > > > - How to map MSI page on arm platform demands discussions. > > Yes, the recurring problem :( > > Probably the same approach as nesting would work for a hack - map the > ITS page into the fixed reserved slot and tell the guest not to touch > it and to identity map it. Ok.
On Mon, Dec 04, 2023 at 09:00:55AM -0800, Sean Christopherson wrote: > On Sat, Dec 02, 2023, Yan Zhao wrote: > Please list out the pros and cons for each. In the cons column for piggybacking > KVM's page tables: > > - *Significantly* increases the complexity in KVM The complexity to KVM (up to now) are a. fault in non-vCPU context b. keep exported root always "active" c. disallow non-coherent DMAs d. movement of SPTE_MMU_PRESENT for a, I think it's accepted, and we can see eager page split allocates non-leaf pages in non-vCPU context already. for b, it requires exported TDP root to keep "active" in KVM's "fast zap" (which invalidates all active TDP roots). And instead, the exported TDP's leaf entries are all zapped. Though it looks not "fast" enough, it avoids an unnecessary root page zap, and it's actually not frequent -- - one for memslot removal (IO page fault is unlikey to happen during VM boot-up) - one for MMIO gen wraparound (which is rare) - one for nx huge page mode change (which is rare too) for c, maybe we can work out a way to remove the MTRR stuffs. for d, I added a config to turn on/off this movement. But right, KVM side will have to sacrifice a bit for software usage and take care of it when the config is on. > - Puts constraints on what KVM can/can't do in the future (see the movement > of SPTE_MMU_PRESENT). > - Subjects IOMMUFD to all of KVM's historical baggage, e.g. the memslot deletion > mess, the truly nasty MTRR emulation (which I still hope to delete), the NX > hugepage mitigation, etc. NX hugepage mitigation only exists on certain CPUs. I don't see it in recent Intel platforms, e.g. SPR and GNR... We can disallow sharing approach if NX huge page mitigation is enabled. But if pinning or partial pinning are not involved, nx huge page will only cause unnecessary zap to reduce performance, but functionally it still works well. Besides, for the extra IO invalidation involved in TDP zap, I think SVM has the same issue. i.e. each zap in primary MMU is also accompanied by a IO invalidation. > > Please also explain the intended/expected/targeted use cases. E.g. if the main > use case is for device passthrough to slice-of-hardware VMs that aren't memory > oversubscribed, > The main use case is for device passthrough with all devices supporting full IOPF. Opportunistically, we hope it can be used in trusted IO, where TDP are shared to IO side. So, there's only one page table audit required and out-of-sync window for mappings between CPU and IO side can also be eliminated. > > - Unified page table management > > The complexity of allocating guest pages per GPAs, registering to MMU > > notifier on host primary MMU, sub-page unmapping, atomic page merge/split > > Please find different terminology than "sub-page". With Sub-Page Protection, Intel > has more or less established "sub-page" to mean "less than 4KiB granularity". But > that can't possibly what you mean here because KVM doesn't support (un)mapping > memory at <4KiB granularity. Based on context above, I assume you mean "unmapping > arbitrary pages within a given range". > Ok, sorry for this confusion. By "sub-page unmapping", I mean atomic huge page splitting and unmapping smaller range in the previous huge page.
On Mon, Dec 04, 2023 at 11:22:49AM -0800, Sean Christopherson wrote: > On Mon, Dec 04, 2023, Jason Gunthorpe wrote: > > On Mon, Dec 04, 2023 at 09:00:55AM -0800, Sean Christopherson wrote: > > > I'm not convinced that memory consumption is all that interesting. If a VM is > > > mapping the majority of memory into a device, then odds are good that the guest > > > is backed with at least 2MiB page, if not 1GiB pages, at which point the memory > > > overhead for pages tables is quite small, especially relative to the total amount > > > of memory overheads for such systems. > > > > AFAIK the main argument is performance. It is similar to why we want > > to do IOMMU SVA with MM page table sharing. > > > > If IOMMU mirrors/shadows/copies a page table using something like HMM > > techniques then the invalidations will mark ranges of IOVA as > > non-present and faults will occur to trigger hmm_range_fault to do the > > shadowing. > > > > This means that pretty much all IO will always encounter a non-present > > fault, certainly at the start and maybe worse while ongoing. > > > > On the other hand, if we share the exact page table then natural CPU > > touches will usually make the page present before an IO happens in > > almost all cases and we don't have to take the horribly expensive IO > > page fault at all. > > I'm not advocating mirroring/copying/shadowing page tables between KVM and the > IOMMU. I'm suggesting managing IOMMU page tables mostly independently, but reusing > KVM code to do so. > > I wouldn't even be opposed to KVM outright managing the IOMMU's page tables. E.g. > add an "iommu" flag to "union kvm_mmu_page_role" and then the implementation looks > rather similar to this series. Yes, very similar to current implementation, which added a "exported" flag to "union kvm_mmu_page_role". > > What terrifies is me sharing page tables between the CPU and the IOMMU verbatim. > > Yes, sharing page tables will Just Work for faulting in memory, but the downside > is that _when_, not if, KVM modifies PTEs for whatever reason, those modifications > will also impact the IO path. My understanding is that IO page faults are at least > an order of magnitude more expensive than CPU page faults. That means that what's > optimal for CPU page tables may not be optimal, or even _viable_, for IOMMU page > tables. > > E.g. based on our conversation at LPC, write-protecting guest memory to do dirty > logging is not a viable option for the IOMMU because the latency of the resulting > IOPF is too high. Forcing KVM to use D-bit dirty logging for CPUs just because > the VM has passthrough (mediated?) devices would be likely a non-starter. > > One of my biggest concerns with sharing page tables between KVM and IOMMUs is that > we will end up having to revert/reject changes that benefit KVM's usage due to > regressing the IOMMU usage. > As the TDP shared by IOMMU is marked by KVM, could we limit the changes (that benefic KVM but regress IOMMU) to TDPs not shared? > If instead KVM treats IOMMU page tables as their own thing, then we can have > divergent behavior as needed, e.g. different dirty logging algorithms, different > software-available bits, etc. It would also allow us to define new ABI instead > of trying to reconcile the many incompatibilies and warts in KVM's existing ABI. > E.g. off the top of my head: > > - The virtual APIC page shouldn't be visible to devices, as it's not "real" guest > memory. > > - Access tracking, i.e. page aging, by making PTEs !PRESENT because the CPU > doesn't support A/D bits or because the admin turned them off via KVM's > enable_ept_ad_bits module param. > > - Write-protecting GFNs for shadow paging when L1 is running nested VMs. KVM's > ABI can be that device writes to L1's page tables are exempt. > > - KVM can exempt IOMMU page tables from KVM's awful "drop all page tables if > any memslot is deleted" ABI. > > > We were not able to make bi-dir notifiers with with the CPU mm, I'm > > not sure that is "relatively easy" :( > > I'm not suggesting full blown mirroring, all I'm suggesting is a fire-and-forget > notifier for KVM to tell IOMMUFD "I've faulted in GFN A, you might want to do the > same". > > It wouldn't even necessarily need to be a notifier per se, e.g. if we taught KVM > to manage IOMMU page tables, then KVM could simply install mappings for multiple > sets of page tables as appropriate. Not sure which approach below is the one you are referring to by "fire-and-forget notifier" and "if we taught KVM to manage IOMMU page tables". Approach A: 1. User space or IOMMUFD tells KVM which address space to share to IOMMUFD. 2. KVM create a special TDP, and maps this page table whenever a GFN in the specified address space is faulted to PFN in vCPU side. 3. IOMMUFD imports this special TDP and receives zaps notification from KVM. KVM will only send the zap notification for memslot removal or for certain MMU zap notifications Approach B: 1. User space or IOMMUFD tells KVM which address space to notify. 2. KVM notifies IOMMUFD whenever a GFN in the specified address space is faulted to PFN in vCPU side. 3. IOMMUFD translates GFN to PFN in its own way (though VMA or through certain new memfd interface), and maps IO PTEs by itself. 4. IOMMUFD zaps IO PTEs when a memslot is removed and interacts with MMU notifier for zap notification in the primary MMU. If approach A is preferred, could vCPUs also be allowed to attach to this special TDP in VMs that don't suffer from NX hugepage mitigation, and do not want live migration with passthrough devices, and don't rely on write-protection for nested VMs.
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Monday, December 4, 2023 11:08 PM > > On Sat, Dec 02, 2023 at 05:12:11PM +0800, Yan Zhao wrote: > > - How to map MSI page on arm platform demands discussions. > > Yes, the recurring problem :( > > Probably the same approach as nesting would work for a hack - map the > ITS page into the fixed reserved slot and tell the guest not to touch > it and to identity map it. > yes logically it should follow what is planned for nesting. just that kvm needs to involve more iommu specific knowledge e.g. iommu_get_msi_cookie() to reserve the slot.
> From: Zhao, Yan Y <yan.y.zhao@intel.com> > Sent: Tuesday, December 5, 2023 9:32 AM > > On Mon, Dec 04, 2023 at 08:38:17AM -0800, Sean Christopherson wrote: > > The number of possible TDP page tables used for nested VMs is well > bounded, but > > since devices obviously can't be nested VMs, I won't bother trying to > explain the > > the various possibilities (nested NPT on AMD is downright ridiculous). > In future, if possible, I wonder if we can export an TDP for nested VM too. > E.g. in scenarios where TDP is partitioned, and one piece is for L2 VM. > Maybe we can specify that and tell KVM the very piece of TDP to export. > nesting is tricky. The reason why the sharing (w/o nesting) is logically ok is that both IOMMU and KVM page tables are for the same GPA address space created by the host. for nested VM together with vIOMMU, the same sharing story holds if the stage-2 page table in both sides still translates GPA. It implies vIOMMU is enabled in nested translation mode and L0 KVM doesn't expose vEPT to L1 VMM (which then uses shadow instead). things become tricky when vIOMMU is working in a shadowing mode or when L0 KVM exposes vEPT to L1 VMM. In either case the stage-2 page table of L0 IOMMU/KVM actually translates a guest address space then sharing becomes problematic (on figuring out whether both refers to the same guest address space while that fact might change at any time).
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Tuesday, December 5, 2023 3:51 AM > > On Mon, Dec 04, 2023 at 11:22:49AM -0800, Sean Christopherson wrote: > > It wouldn't even necessarily need to be a notifier per se, e.g. if we taught > KVM > > to manage IOMMU page tables, then KVM could simply install mappings for > multiple > > sets of page tables as appropriate. iommu driver still needs to be notified to invalidate the iotlb, unless we want KVM to directly call IOMMU API instead of going through iommufd. > > This somehow feels more achievable to me since KVM already has all the > code to handle multiple TDPs, having two parallel ones is probably > much easier than trying to weld KVM to a different page table > implementation through some kind of loose coupled notifier. > yes performance-wise this can also reduce the I/O page faults as the sharing approach achieves. but how is it compared to another way of supporting IOPF natively in iommufd and iommu drivers? Note that iommufd also needs to support native vfio applications e.g. dpdk. I'm not sure whether there will be strong interest in enabling IOPF for those applications. But if the answer is yes then it's inevitable to have such logic implemented in the iommu stack given KVM is not in the picture there. With that is it more reasonable to develop the IOPF support natively in iommu side, plus an optional notifier mechanism to sync with KVM-induced host PTE installation as optimization?