Message ID | 20240823132137.336874-1-aik@amd.com (mailing list archive) |
---|---|
Headers | show |
Series | Secure VFIO, TDISP, SEV TIO | expand |
Alexey Kardashevskiy wrote: > Hi everyone, > > Here are some patches to enable SEV-TIO (aka TDISP, aka secure VFIO) > on AMD Turin. > > The basic idea is to allow DMA to/from encrypted memory of SNP VMs and > secure MMIO in SNP VMs (i.e. with Cbit set) as well. > > These include both guest and host support. QEMU also requires > some patches, links below. > > The patches are organized as: > 01..06 - preparing the host OS; > 07 - new TSM module; > 08 - add PSP SEV TIO ABI (IDE should start working at this point); > 09..14 - add KVM support (TDI binding, MMIO faulting, etc); > 15..19 - guest changes (the rest of SEV TIO ABI, DMA, secure MMIO). > 20, 21 - some helpers for guest OS to use encrypted MMIO > > This is based on a merge of > ee3248f9f8d6 Lukas Wunner spdm: Allow control of next requester nonce > through sysfs > 85ef1ac03941 (AMDESE/snp-host-latest) 4 days ago Michael Roth [TEMP] KVM: guest_memfd: Update gmem_prep are hook to handle partially-allocated folios > > > Please comment. Thanks. This cover letter is something I can read after having been in and around this space for a while, but I wonder how much of it makes sense to casual reviewers? > > Thanks, > > > SEV TIO tree prototype > ====================== [..] > Code > ---- > > Written with AMD SEV SNP in mind, TSM is the PSP and > therefore no much of IDE/TDISP > is left for the host or guest OS. > > Add a common module to expose various data objects in > the same way in host and > guest OS. > > Provide a know on the host to enable IDE encryption. > > Add another version of Guest Request for secure > guest<->PSP communication. > > Enable secure DMA by: > - configuring vTOM in a secure DTE via the PSP to cover > the entire guest RAM; > - mapping all private memory pages in IOMMU just like > as they were shared > (requires hacking iommufd); What kind of hack are we talking about here? An upstream suitable change, or something that needs quite a bit more work to be done properly? I jumped ahead to read Jason's reaction but please do at least provide a map the controversy in the cover letter, something like "see patch 12 for details". > - skipping various enforcements of non-SME or > SWIOTLB in the guest; Is this based on some concept of private vs shared mode devices? > No mixed share+private DMA supported within the > same IOMMU. What does this mean? A device may not have mixed mappings (makes sense), or an IOMMU can not host devices that do not all agree on whether DMA is private or shared? > Enable secure MMIO by: > - configuring RMP entries via the PSP; > - adding necessary helpers for mapping MMIO with > the Cbit set; > - hacking the KVM #PF handler to allow private > MMIO failts. > > Based on the latest upstream KVM (at the > moment it is kvm-coco-queue). Here is where I lament that kvm-coco-queue is not run like akpm/mm where it is possible to try out "yesterday's mm". Perhaps this is an area to collaborate on kvm-coco-queue snapshots to help with testing. > Workflow > -------- > > 1. Boot host OS. > 2. "Connect" the physical device. > 3. Bind a VF to VFIO-PCI. > 4. Run QEMU _without_ the device yet. > 5. Hotplug the VF to the VM. > 6. (if not already) Load the device driver. > 7. Right after the BusMaster is enabled, > tsm.ko performs secure DMA and MMIO setup. > 8. Run tests, for example: > sudo ./pcimem/pcimem > /sys/bus/pci/devices/0000\:01\:00.0/resource4_enc > 0 w*4 0xabcd > > > Assumptions > ----------- > > This requires hotpligging into the VM vs > passing the device via the command line as > VFIO maps all guest memory as the device init > step which is too soon as > SNP LAUNCH UPDATE happens later and will fail > if VFIO maps private memory before that. Would the device not just launch in "shared" mode until it is later converted to private? I am missing the detail of why passing the device on the command line requires that private memory be mapped early. That said, the implication that private device assignment requires hotplug events is a useful property. This matches nicely with initial thoughts that device conversion events are violent and might as well be unplug/replug events to match all the assumptions around what needs to be updated. > This requires the BME hack as MMIO and Not sure what the "BME hack" is, I guess this is foreshadowing for later in this story. > BusMaster enable bits cannot be 0 after MMIO > validation is done It would be useful to call out what is a TDISP requirement, vs device-specific DSM vs host-specific TSM requirement. In this case I assume you are referring to PCI 6.2 11.2.6 where it notes that TDIs must enter the TDISP ERROR state if BME is cleared after the device is locked? ...but this begs the question of whether it needs to be avoided outright or handled as an error recovery case dependending on policy. > the guest OS booting process when this > appens. > > SVSM could help addressing these (not > implemented at the moment). At first though avoiding SVSM entanglements where the kernel can be enlightened shoud be the policy. I would only expect SVSM hacks to cover for legacy OSes that will never be TDISP enlightened, but in that case we are likely talking about fully unaware L2. Lets assume fully enlightened L1 for now. > QEMU advertises TEE-IO capability to the VM. > An additional x-tio flag is added to > vfio-pci. > > > TODOs > ----- > > Deal with PCI reset. Hot unplug+plug? Power > states too. > > Do better generalization, the current code > heavily uses SEV TIO defined > structures in supposedly generic code. > > Fix the documentation comments of SEV TIO structures. Hey, it's a start. I appreciate the "release early" aspect of this posting. > Git trees > --------- > > https://github.com/AMDESE/linux-kvm/tree/tio > https://github.com/AMDESE/qemu/tree/tio [..] > > > Alexey Kardashevskiy (21): > tsm-report: Rename module to reflect what it does > pci/doe: Define protocol types and make those public > pci: Define TEE-IO bit in PCIe device capabilities > PCI/IDE: Define Integrity and Data Encryption (IDE) extended > capability > crypto/ccp: Make some SEV helpers public > crypto: ccp: Enable SEV-TIO feature in the PSP when supported > pci/tdisp: Introduce tsm module > crypto/ccp: Implement SEV TIO firmware interface > kvm: Export kvm_vm_set_mem_attributes > vfio: Export helper to get vfio_device from fd > KVM: SEV: Add TIO VMGEXIT and bind TDI > KVM: IOMMUFD: MEMFD: Map private pages > KVM: X86: Handle private MMIO as shared > RFC: iommu/iommufd/amd: Add IOMMU_HWPT_TRUSTED flag, tweak DTE's > DomainID, IOTLB > coco/sev-guest: Allow multiple source files in the driver > coco/sev-guest: Make SEV-to-PSP request helpers public > coco/sev-guest: Implement the guest side of things > RFC: pci: Add BUS_NOTIFY_PCI_BUS_MASTER event > sev-guest: Stop changing encrypted page state for TDISP devices > pci: Allow encrypted MMIO mapping via sysfs > pci: Define pci_iomap_range_encrypted > > drivers/crypto/ccp/Makefile | 2 + > drivers/pci/Makefile | 1 + > drivers/virt/coco/Makefile | 3 +- > drivers/virt/coco/sev-guest/Makefile | 1 + > arch/x86/include/asm/kvm-x86-ops.h | 2 + > arch/x86/include/asm/kvm_host.h | 2 + > arch/x86/include/asm/sev.h | 23 + > arch/x86/include/uapi/asm/svm.h | 2 + > arch/x86/kvm/svm/svm.h | 2 + > drivers/crypto/ccp/sev-dev-tio.h | 105 ++ > drivers/crypto/ccp/sev-dev.h | 4 + > drivers/iommu/amd/amd_iommu_types.h | 2 + > drivers/iommu/iommufd/io_pagetable.h | 3 + > drivers/iommu/iommufd/iommufd_private.h | 4 + > drivers/virt/coco/sev-guest/sev-guest.h | 56 + > include/asm-generic/pci_iomap.h | 4 + > include/linux/device.h | 5 + > include/linux/device/bus.h | 3 + > include/linux/dma-direct.h | 4 + > include/linux/iommufd.h | 6 + > include/linux/kvm_host.h | 70 + > include/linux/pci-doe.h | 4 + > include/linux/pci-ide.h | 18 + > include/linux/pci.h | 2 +- > include/linux/psp-sev.h | 116 +- > include/linux/swiotlb.h | 4 + > include/linux/tsm-report.h | 113 ++ > include/linux/tsm.h | 337 +++-- > include/linux/vfio.h | 1 + > include/uapi/linux/iommufd.h | 1 + > include/uapi/linux/kvm.h | 29 + > include/uapi/linux/pci_regs.h | 77 +- > include/uapi/linux/psp-sev.h | 4 +- > arch/x86/coco/sev/core.c | 11 + > arch/x86/kvm/mmu/mmu.c | 6 +- > arch/x86/kvm/svm/sev.c | 217 +++ > arch/x86/kvm/svm/svm.c | 3 + > arch/x86/kvm/x86.c | 12 + > arch/x86/mm/mem_encrypt.c | 5 + > arch/x86/virt/svm/sev.c | 23 +- > drivers/crypto/ccp/sev-dev-tio.c | 1565 ++++++++++++++++++++ > drivers/crypto/ccp/sev-dev-tsm.c | 397 +++++ > drivers/crypto/ccp/sev-dev.c | 87 +- > drivers/iommu/amd/iommu.c | 20 +- > drivers/iommu/iommufd/hw_pagetable.c | 4 + > drivers/iommu/iommufd/io_pagetable.c | 2 + > drivers/iommu/iommufd/main.c | 21 + > drivers/iommu/iommufd/pages.c | 94 +- > drivers/pci/doe.c | 2 - > drivers/pci/ide.c | 186 +++ > drivers/pci/iomap.c | 24 + > drivers/pci/mmap.c | 11 +- > drivers/pci/pci-sysfs.c | 27 +- > drivers/pci/pci.c | 3 + > drivers/pci/proc.c | 2 +- > drivers/vfio/vfio_main.c | 13 + > drivers/virt/coco/sev-guest/{sev-guest.c => sev_guest.c} | 68 +- > drivers/virt/coco/sev-guest/sev_guest_tio.c | 513 +++++++ > drivers/virt/coco/tdx-guest/tdx-guest.c | 8 +- > drivers/virt/coco/tsm-report.c | 512 +++++++ > drivers/virt/coco/tsm.c | 1542 ++++++++++++++----- > virt/kvm/guest_memfd.c | 40 + > virt/kvm/kvm_main.c | 4 +- > virt/kvm/vfio.c | 197 ++- > Documentation/virt/coco/tsm.rst | 62 + > MAINTAINERS | 4 +- > arch/x86/kvm/Kconfig | 1 + > drivers/pci/Kconfig | 4 + > drivers/virt/coco/Kconfig | 11 + > 69 files changed, 6163 insertions(+), 548 deletions(-) > create mode 100644 drivers/crypto/ccp/sev-dev-tio.h > create mode 100644 drivers/virt/coco/sev-guest/sev-guest.h > create mode 100644 include/linux/pci-ide.h > create mode 100644 include/linux/tsm-report.h > create mode 100644 drivers/crypto/ccp/sev-dev-tio.c > create mode 100644 drivers/crypto/ccp/sev-dev-tsm.c > create mode 100644 drivers/pci/ide.c > rename drivers/virt/coco/sev-guest/{sev-guest.c => sev_guest.c} (96%) > create mode 100644 drivers/virt/coco/sev-guest/sev_guest_tio.c > create mode 100644 drivers/virt/coco/tsm-report.c > create mode 100644 Documentation/virt/coco/tsm.rst > > -- > 2.45.2 >
On 29/8/24 06:43, Dan Williams wrote: > Alexey Kardashevskiy wrote: >> Hi everyone, >> >> Here are some patches to enable SEV-TIO (aka TDISP, aka secure VFIO) >> on AMD Turin. >> >> The basic idea is to allow DMA to/from encrypted memory of SNP VMs and >> secure MMIO in SNP VMs (i.e. with Cbit set) as well. >> >> These include both guest and host support. QEMU also requires >> some patches, links below. >> >> The patches are organized as: >> 01..06 - preparing the host OS; >> 07 - new TSM module; >> 08 - add PSP SEV TIO ABI (IDE should start working at this point); >> 09..14 - add KVM support (TDI binding, MMIO faulting, etc); >> 15..19 - guest changes (the rest of SEV TIO ABI, DMA, secure MMIO). >> 20, 21 - some helpers for guest OS to use encrypted MMIO >> >> This is based on a merge of >> ee3248f9f8d6 Lukas Wunner spdm: Allow control of next requester nonce >> through sysfs >> 85ef1ac03941 (AMDESE/snp-host-latest) 4 days ago Michael Roth [TEMP] KVM: guest_memfd: Update gmem_prep are hook to handle partially-allocated folios >> >> >> Please comment. Thanks. > > This cover letter is something I can read after having been in and > around this space for a while, but I wonder how much of it makes sense > to casual reviewers? > >> >> Thanks, >> >> >> SEV TIO tree prototype >> ====================== > [..] >> Code >> ---- >> >> Written with AMD SEV SNP in mind, TSM is the PSP and >> therefore no much of IDE/TDISP >> is left for the host or guest OS. >> >> Add a common module to expose various data objects in >> the same way in host and >> guest OS. >> >> Provide a know on the host to enable IDE encryption. >> >> Add another version of Guest Request for secure >> guest<->PSP communication. >> >> Enable secure DMA by: >> - configuring vTOM in a secure DTE via the PSP to cover >> the entire guest RAM; >> - mapping all private memory pages in IOMMU just like >> as they were shared >> (requires hacking iommufd); > > What kind of hack are we talking about here? An upstream suitable > change, or something that needs quite a bit more work to be done > properly? Right now it is hacking IOMMUFD to go to the KVM for private_gfn->host_pfn. As I am being told in this thread, VFIO DMA map/unmap needs to be taught to accept {memfd, offset}. > I jumped ahead to read Jason's reaction but please do at least provide a > map the controversy in the cover letter, something like "see patch 12 for > details". Yeah, noticed that, thanks, appreciated! >> - skipping various enforcements of non-SME or >> SWIOTLB in the guest; > > Is this based on some concept of private vs shared mode devices? > >> No mixed share+private DMA supported within the >> same IOMMU. > > What does this mean? A device may not have mixed mappings (makes sense), Currently devices do not have an idea about private host memory (but it is being worked on afaik). > or an IOMMU can not host devices that do not all agree on whether DMA is > private or shared? The hardware allows that via hardware-assisted vIOMMU and I/O page tables in the guest with C-bit takes into accound by the IOMMU but the software support is missing right now. So for this initial drop, vTOM is used for DMA - this thing says "everything below <addr> is private, above <addr> - shared" so nothing needs to bother with the C-bit, and in my exercise I set the <addr> to the allowed maximum. So each IOMMUFD instance in the VM is either "all private mappings" or "all shared". Could be half/half by moving that <addr> :) >> Enable secure MMIO by: >> - configuring RMP entries via the PSP; >> - adding necessary helpers for mapping MMIO with >> the Cbit set; >> - hacking the KVM #PF handler to allow private >> MMIO failts. >> >> Based on the latest upstream KVM (at the >> moment it is kvm-coco-queue). > > Here is where I lament that kvm-coco-queue is not run like akpm/mm where > it is possible to try out "yesterday's mm". Perhaps this is an area to > collaborate on kvm-coco-queue snapshots to help with testing. Yeah this more an idea of what it is based on, I normally push a tested branch somewhere on github, just to eliminate uncertainty. > >> Workflow >> -------- >> >> 1. Boot host OS. >> 2. "Connect" the physical device. >> 3. Bind a VF to VFIO-PCI. >> 4. Run QEMU _without_ the device yet. >> 5. Hotplug the VF to the VM. >> 6. (if not already) Load the device driver. >> 7. Right after the BusMaster is enabled, >> tsm.ko performs secure DMA and MMIO setup. >> 8. Run tests, for example: >> sudo ./pcimem/pcimem >> /sys/bus/pci/devices/0000\:01\:00.0/resource4_enc >> 0 w*4 0xabcd >> >> >> Assumptions >> ----------- >> >> This requires hotpligging into the VM vs >> passing the device via the command line as >> VFIO maps all guest memory as the device init >> step which is too soon as >> SNP LAUNCH UPDATE happens later and will fail >> if VFIO maps private memory before that. > > Would the device not just launch in "shared" mode until it is later > converted to private? I am missing the detail of why passing the device > on the command line requires that private memory be mapped early. A sequencing problem. QEMU "realizes" a VFIO device, it creates an iommufd instance which creates a domain and writes to a DTE (a IOMMU descriptor for PCI BDFn). And DTE is not updated after than. For secure stuff, DTE needs to be slightly different. So right then I tell IOMMUFD that it will handle private memory. Then, the same VFIO "realize" handler maps the guest memory in iommufd. I use the same flag (well, pointer to kvm) in the iommufd pinning code, private memory is pinned and mapped (and related page state change happens as the guest memory is made guest-owned in RMP). QEMU goes to machine_reset() and calls "SNP LAUNCH UPDATE" (the actual place changed recenly, huh) and the latter will measure the guest and try making all guest memory private but it already happened => error. I think I have to decouple the pinning and the IOMMU/DTE setting. > That said, the implication that private device assignment requires > hotplug events is a useful property. This matches nicely with initial > thoughts that device conversion events are violent and might as well be > unplug/replug events to match all the assumptions around what needs to > be updated. For the initial drop, I tell QEMU via "-device vfio-pci,x-tio=true" that it is going to be private so there should be no massive conversion. > >> This requires the BME hack as MMIO and > > Not sure what the "BME hack" is, I guess this is foreshadowing for later > in this story. > >> BusMaster enable bits cannot be 0 after MMIO >> validation is done > > It would be useful to call out what is a TDISP requirement, vs > device-specific DSM vs host-specific TSM requirement. In this case I > assume you are referring to PCI 6.2 11.2.6 where it notes that TDIs must Oh there is 6.2 already. > enter the TDISP ERROR state if BME is cleared after the device is > locked? > > ...but this begs the question of whether it needs to be avoided outright Well, besides a couple of avoidable places (like testing INTx support which we know is not going to work on VFs anyway), a standard driver enables MSE first (and the value for the command register does not have 1 for BME) and only then BME. TBH I do not think writing BME=0 when BME=0 already is "clearing" but my test device disagrees. > or handled as an error recovery case dependending on policy. Avoding seems more straight forward unless we actually want enlightened device drivers which want to examine the interface report before enabling the device. Not sure. >> the guest OS booting process when this >> appens. >> >> SVSM could help addressing these (not >> implemented at the moment). > > At first though avoiding SVSM entanglements where the kernel can be > enlightened shoud be the policy. I would only expect SVSM hacks to cover > for legacy OSes that will never be TDISP enlightened, but in that case > we are likely talking about fully unaware L2. Lets assume fully > enlightened L1 for now. Well, I could also tweak OVMF to make necessary calls to the PSP and hack QEMU to postpone the command register updates to get this going, just a matter of ugliness. >> QEMU advertises TEE-IO capability to the VM. >> An additional x-tio flag is added to >> vfio-pci. >> >> >> TODOs >> ----- >> >> Deal with PCI reset. Hot unplug+plug? Power >> states too. >> >> Do better generalization, the current code >> heavily uses SEV TIO defined >> structures in supposedly generic code. >> >> Fix the documentation comments of SEV TIO structures. > > Hey, it's a start. I appreciate the "release early" aspect of this > posting. :) Thanks, >> Git trees >> --------- >> >> https://github.com/AMDESE/linux-kvm/tree/tio >> https://github.com/AMDESE/qemu/tree/tio > [..] >> >> >> Alexey Kardashevskiy (21): >> tsm-report: Rename module to reflect what it does >> pci/doe: Define protocol types and make those public >> pci: Define TEE-IO bit in PCIe device capabilities >> PCI/IDE: Define Integrity and Data Encryption (IDE) extended >> capability >> crypto/ccp: Make some SEV helpers public >> crypto: ccp: Enable SEV-TIO feature in the PSP when supported >> pci/tdisp: Introduce tsm module >> crypto/ccp: Implement SEV TIO firmware interface >> kvm: Export kvm_vm_set_mem_attributes >> vfio: Export helper to get vfio_device from fd >> KVM: SEV: Add TIO VMGEXIT and bind TDI >> KVM: IOMMUFD: MEMFD: Map private pages >> KVM: X86: Handle private MMIO as shared >> RFC: iommu/iommufd/amd: Add IOMMU_HWPT_TRUSTED flag, tweak DTE's >> DomainID, IOTLB >> coco/sev-guest: Allow multiple source files in the driver >> coco/sev-guest: Make SEV-to-PSP request helpers public >> coco/sev-guest: Implement the guest side of things >> RFC: pci: Add BUS_NOTIFY_PCI_BUS_MASTER event >> sev-guest: Stop changing encrypted page state for TDISP devices >> pci: Allow encrypted MMIO mapping via sysfs >> pci: Define pci_iomap_range_encrypted >> >> drivers/crypto/ccp/Makefile | 2 + >> drivers/pci/Makefile | 1 + >> drivers/virt/coco/Makefile | 3 +- >> drivers/virt/coco/sev-guest/Makefile | 1 + >> arch/x86/include/asm/kvm-x86-ops.h | 2 + >> arch/x86/include/asm/kvm_host.h | 2 + >> arch/x86/include/asm/sev.h | 23 + >> arch/x86/include/uapi/asm/svm.h | 2 + >> arch/x86/kvm/svm/svm.h | 2 + >> drivers/crypto/ccp/sev-dev-tio.h | 105 ++ >> drivers/crypto/ccp/sev-dev.h | 4 + >> drivers/iommu/amd/amd_iommu_types.h | 2 + >> drivers/iommu/iommufd/io_pagetable.h | 3 + >> drivers/iommu/iommufd/iommufd_private.h | 4 + >> drivers/virt/coco/sev-guest/sev-guest.h | 56 + >> include/asm-generic/pci_iomap.h | 4 + >> include/linux/device.h | 5 + >> include/linux/device/bus.h | 3 + >> include/linux/dma-direct.h | 4 + >> include/linux/iommufd.h | 6 + >> include/linux/kvm_host.h | 70 + >> include/linux/pci-doe.h | 4 + >> include/linux/pci-ide.h | 18 + >> include/linux/pci.h | 2 +- >> include/linux/psp-sev.h | 116 +- >> include/linux/swiotlb.h | 4 + >> include/linux/tsm-report.h | 113 ++ >> include/linux/tsm.h | 337 +++-- >> include/linux/vfio.h | 1 + >> include/uapi/linux/iommufd.h | 1 + >> include/uapi/linux/kvm.h | 29 + >> include/uapi/linux/pci_regs.h | 77 +- >> include/uapi/linux/psp-sev.h | 4 +- >> arch/x86/coco/sev/core.c | 11 + >> arch/x86/kvm/mmu/mmu.c | 6 +- >> arch/x86/kvm/svm/sev.c | 217 +++ >> arch/x86/kvm/svm/svm.c | 3 + >> arch/x86/kvm/x86.c | 12 + >> arch/x86/mm/mem_encrypt.c | 5 + >> arch/x86/virt/svm/sev.c | 23 +- >> drivers/crypto/ccp/sev-dev-tio.c | 1565 ++++++++++++++++++++ >> drivers/crypto/ccp/sev-dev-tsm.c | 397 +++++ >> drivers/crypto/ccp/sev-dev.c | 87 +- >> drivers/iommu/amd/iommu.c | 20 +- >> drivers/iommu/iommufd/hw_pagetable.c | 4 + >> drivers/iommu/iommufd/io_pagetable.c | 2 + >> drivers/iommu/iommufd/main.c | 21 + >> drivers/iommu/iommufd/pages.c | 94 +- >> drivers/pci/doe.c | 2 - >> drivers/pci/ide.c | 186 +++ >> drivers/pci/iomap.c | 24 + >> drivers/pci/mmap.c | 11 +- >> drivers/pci/pci-sysfs.c | 27 +- >> drivers/pci/pci.c | 3 + >> drivers/pci/proc.c | 2 +- >> drivers/vfio/vfio_main.c | 13 + >> drivers/virt/coco/sev-guest/{sev-guest.c => sev_guest.c} | 68 +- >> drivers/virt/coco/sev-guest/sev_guest_tio.c | 513 +++++++ >> drivers/virt/coco/tdx-guest/tdx-guest.c | 8 +- >> drivers/virt/coco/tsm-report.c | 512 +++++++ >> drivers/virt/coco/tsm.c | 1542 ++++++++++++++----- >> virt/kvm/guest_memfd.c | 40 + >> virt/kvm/kvm_main.c | 4 +- >> virt/kvm/vfio.c | 197 ++- >> Documentation/virt/coco/tsm.rst | 62 + >> MAINTAINERS | 4 +- >> arch/x86/kvm/Kconfig | 1 + >> drivers/pci/Kconfig | 4 + >> drivers/virt/coco/Kconfig | 11 + >> 69 files changed, 6163 insertions(+), 548 deletions(-) >> create mode 100644 drivers/crypto/ccp/sev-dev-tio.h >> create mode 100644 drivers/virt/coco/sev-guest/sev-guest.h >> create mode 100644 include/linux/pci-ide.h >> create mode 100644 include/linux/tsm-report.h >> create mode 100644 drivers/crypto/ccp/sev-dev-tio.c >> create mode 100644 drivers/crypto/ccp/sev-dev-tsm.c >> create mode 100644 drivers/pci/ide.c >> rename drivers/virt/coco/sev-guest/{sev-guest.c => sev_guest.c} (96%) >> create mode 100644 drivers/virt/coco/sev-guest/sev_guest_tio.c >> create mode 100644 drivers/virt/coco/tsm-report.c >> create mode 100644 Documentation/virt/coco/tsm.rst >> >> -- >> 2.45.2 >> > >
Alexey Kardashevskiy wrote: [..] > >> - skipping various enforcements of non-SME or > >> SWIOTLB in the guest; > > > > Is this based on some concept of private vs shared mode devices? > > > >> No mixed share+private DMA supported within the > >> same IOMMU. > > > > What does this mean? A device may not have mixed mappings (makes sense), > > Currently devices do not have an idea about private host memory (but it > is being worked on afaik). Worked on where? You mean the PCI core indicating that a device is private or not? Is that not indicated by guest-side TSM connection state? > > or an IOMMU can not host devices that do not all agree on whether DMA is > > private or shared? > > The hardware allows that via hardware-assisted vIOMMU and I/O page > tables in the guest with C-bit takes into accound by the IOMMU but the > software support is missing right now. So for this initial drop, vTOM is > used for DMA - this thing says "everything below <addr> is private, > above <addr> - shared" so nothing needs to bother with the C-bit, and in > my exercise I set the <addr> to the allowed maximum. > > So each IOMMUFD instance in the VM is either "all private mappings" or > "all shared". Could be half/half by moving that <addr> :) I thought existing use cases assume that the CC-VM can trigger page conversions at will without regard to a vTOM concept? It would be nice to have that address-map separation arrangement, has not that ship already sailed? [..] > > Would the device not just launch in "shared" mode until it is later > > converted to private? I am missing the detail of why passing the device > > on the command line requires that private memory be mapped early. > > A sequencing problem. > > QEMU "realizes" a VFIO device, it creates an iommufd instance which > creates a domain and writes to a DTE (a IOMMU descriptor for PCI BDFn). > And DTE is not updated after than. For secure stuff, DTE needs to be > slightly different. So right then I tell IOMMUFD that it will handle > private memory. > > Then, the same VFIO "realize" handler maps the guest memory in iommufd. > I use the same flag (well, pointer to kvm) in the iommufd pinning code, > private memory is pinned and mapped (and related page state change > happens as the guest memory is made guest-owned in RMP). > > QEMU goes to machine_reset() and calls "SNP LAUNCH UPDATE" (the actual > place changed recenly, huh) and the latter will measure the guest and > try making all guest memory private but it already happened => error. > > I think I have to decouple the pinning and the IOMMU/DTE setting. > > > That said, the implication that private device assignment requires > > hotplug events is a useful property. This matches nicely with initial > > thoughts that device conversion events are violent and might as well be > > unplug/replug events to match all the assumptions around what needs to > > be updated. > > For the initial drop, I tell QEMU via "-device vfio-pci,x-tio=true" that > it is going to be private so there should be no massive conversion. That's a SEV-TIO RFC-specific hack, or a proposal? An approach that aligns more closely with the VFIO operational model, where it maps and waits for guest faults / usages, is that QEMU would be told that the device is "bind capable", because the host is not in a position to assume how the guest will use the device. A "bind capable" device operates in shared mode unless and until the guest triggers private conversion. > >> This requires the BME hack as MMIO and > > > > Not sure what the "BME hack" is, I guess this is foreshadowing for later > > in this story. > > > >> BusMaster enable bits cannot be 0 after MMIO > >> validation is done > > > > It would be useful to call out what is a TDISP requirement, vs > > device-specific DSM vs host-specific TSM requirement. In this case I > > assume you are referring to PCI 6.2 11.2.6 where it notes that TDIs must > > Oh there is 6.2 already. > > > enter the TDISP ERROR state if BME is cleared after the device is > > locked? > > > > ...but this begs the question of whether it needs to be avoided outright > > Well, besides a couple of avoidable places (like testing INTx support > which we know is not going to work on VFs anyway), a standard driver > enables MSE first (and the value for the command register does not have > 1 for BME) and only then BME. TBH I do not think writing BME=0 when > BME=0 already is "clearing" but my test device disagrees. ...but we should not be creating kernel policy around test devices. What matters is real devices. Now, if it is likely that real / production devices will go into the TDISP ERROR state by not coalescing MSE + BME updates then we need a solution. Given it is unlikely that TDISP support will be widespread any time soon it is likely tenable to assume TDISP compatible drivers call a new: pci_enable(pdev, PCI_ENABLE_TARGET | PCI_ENABLE_INITIATOR); ...or something like that to coalesce command register writes. Otherwise if that retrofit ends up being too much work or confusion then the ROI of teaching the PCI core to recover this scenario needs to be evaluated. > > or handled as an error recovery case dependending on policy. > > Avoding seems more straight forward unless we actually want enlightened > device drivers which want to examine the interface report before > enabling the device. Not sure. If TDISP capable devices trends towards a handful of devices in the near term then some driver fixups seems reasonable. Otherwise if every PCI device driver Linux has ever seens needs to be ready for that device to have a TDISP capable flavor then mitigating this in the PCI core makes more sense than playing driver whack-a-mole. > >> the guest OS booting process when this > >> appens. > >> > >> SVSM could help addressing these (not > >> implemented at the moment). > > > > At first though avoiding SVSM entanglements where the kernel can be > > enlightened shoud be the policy. I would only expect SVSM hacks to cover > > for legacy OSes that will never be TDISP enlightened, but in that case > > we are likely talking about fully unaware L2. Lets assume fully > > enlightened L1 for now. > > Well, I could also tweak OVMF to make necessary calls to the PSP and > hack QEMU to postpone the command register updates to get this going, > just a matter of ugliness. Per above, the tradeoff should be in ROI, not ugliness. I don't see how OVMF helps when devices might be being virtually hotplugged or reset.
On 30/8/24 09:41, Dan Williams wrote: > Alexey Kardashevskiy wrote: > [..] >>>> - skipping various enforcements of non-SME or >>>> SWIOTLB in the guest; >>> >>> Is this based on some concept of private vs shared mode devices? >>> >>>> No mixed share+private DMA supported within the >>>> same IOMMU. >>> >>> What does this mean? A device may not have mixed mappings (makes sense), >> >> Currently devices do not have an idea about private host memory (but it >> is being worked on afaik). > > Worked on where? You mean the PCI core indicating that a device is > private or not? Is that not indicated by guest-side TSM connection > state? > >>> or an IOMMU can not host devices that do not all agree on whether DMA is >>> private or shared? >> >> The hardware allows that via hardware-assisted vIOMMU and I/O page >> tables in the guest with C-bit takes into accound by the IOMMU but the >> software support is missing right now. So for this initial drop, vTOM is >> used for DMA - this thing says "everything below <addr> is private, >> above <addr> - shared" so nothing needs to bother with the C-bit, and in >> my exercise I set the <addr> to the allowed maximum. >> >> So each IOMMUFD instance in the VM is either "all private mappings" or >> "all shared". Could be half/half by moving that <addr> :) > > I thought existing use cases assume that the CC-VM can trigger page > conversions at will without regard to a vTOM concept? It would be nice > to have that address-map separation arrangement, has not that ship > already sailed? Mmm. I am either confusing you too much or not following you :) Any page can be converted, the proposed arrangement would require that convertion-candidate-pages are allocated from a specific pool? There are two vTOMs - one in IOMMU to decide on Cbit for DMA trafic (I use this one), one in VMSA ("VIRTUAL_TOM") for guest memory (this exercise is not using it). Which one do you mean? > > [..] >>> Would the device not just launch in "shared" mode until it is later >>> converted to private? I am missing the detail of why passing the device >>> on the command line requires that private memory be mapped early. >> >> A sequencing problem. >> >> QEMU "realizes" a VFIO device, it creates an iommufd instance which >> creates a domain and writes to a DTE (a IOMMU descriptor for PCI BDFn). >> And DTE is not updated after than. For secure stuff, DTE needs to be >> slightly different. So right then I tell IOMMUFD that it will handle >> private memory. >> >> Then, the same VFIO "realize" handler maps the guest memory in iommufd. >> I use the same flag (well, pointer to kvm) in the iommufd pinning code, >> private memory is pinned and mapped (and related page state change >> happens as the guest memory is made guest-owned in RMP). >> >> QEMU goes to machine_reset() and calls "SNP LAUNCH UPDATE" (the actual >> place changed recenly, huh) and the latter will measure the guest and >> try making all guest memory private but it already happened => error. >> >> I think I have to decouple the pinning and the IOMMU/DTE setting. >> >>> That said, the implication that private device assignment requires >>> hotplug events is a useful property. This matches nicely with initial >>> thoughts that device conversion events are violent and might as well be >>> unplug/replug events to match all the assumptions around what needs to >>> be updated. >> >> For the initial drop, I tell QEMU via "-device vfio-pci,x-tio=true" that >> it is going to be private so there should be no massive conversion. > > That's a SEV-TIO RFC-specific hack, or a proposal? Not sure at the moment :) > An approach that aligns more closely with the VFIO operational model, > where it maps and waits for guest faults / usages, is that QEMU would be > told that the device is "bind capable", because the host is not in a > position to assume how the guest will use the device. A "bind capable" > device operates in shared mode unless and until the guest triggers > private conversion. True. I just started this exercise without QEMU DiscardManager. Now I rely on it but it either needs to allow dynamic flip from discarded==private to discarded==shared (should do for now) or allow 3 states for guest pages. >>>> This requires the BME hack as MMIO and >>> >>> Not sure what the "BME hack" is, I guess this is foreshadowing for later >>> in this story. >> > >>>> BusMaster enable bits cannot be 0 after MMIO >>>> validation is done >>> >>> It would be useful to call out what is a TDISP requirement, vs >>> device-specific DSM vs host-specific TSM requirement. In this case I >>> assume you are referring to PCI 6.2 11.2.6 where it notes that TDIs must >> >> Oh there is 6.2 already. >> >>> enter the TDISP ERROR state if BME is cleared after the device is >>> locked? >>> >>> ...but this begs the question of whether it needs to be avoided outright >> >> Well, besides a couple of avoidable places (like testing INTx support >> which we know is not going to work on VFs anyway), a standard driver >> enables MSE first (and the value for the command register does not have >> 1 for BME) and only then BME. TBH I do not think writing BME=0 when >> BME=0 already is "clearing" but my test device disagrees. > > ...but we should not be creating kernel policy around test devices. What > matters is real devices. Now, if it is likely that real / production > devices will go into the TDISP ERROR state by not coalescing MSE + BME > updates then we need a solution. True but I do not even know who to ask this question :) > Given it is unlikely that TDISP support will be widespread any time soon > it is likely tenable to assume TDISP compatible drivers call a new: > > pci_enable(pdev, PCI_ENABLE_TARGET | PCI_ENABLE_INITIATOR); > > ...or something like that to coalesce command register writes. > > Otherwise if that retrofit ends up being too much work or confusion then > the ROI of teaching the PCI core to recover this scenario needs to be > evaluated. Agree. >>> or handled as an error recovery case dependending on policy. >> >> Avoding seems more straight forward unless we actually want enlightened >> device drivers which want to examine the interface report before >> enabling the device. Not sure. > > If TDISP capable devices trends towards a handful of devices in the near > term then some driver fixups seems reasonable. Otherwise if every PCI > device driver Linux has ever seens needs to be ready for that device to > have a TDISP capable flavor then mitigating this in the PCI core makes > more sense than playing driver whack-a-mole. > >>>> the guest OS booting process when this >>>> appens. >>>> >>>> SVSM could help addressing these (not >>>> implemented at the moment). >>> >>> At first though avoiding SVSM entanglements where the kernel can be >>> enlightened shoud be the policy. I would only expect SVSM hacks to cover >>> for legacy OSes that will never be TDISP enlightened, but in that case >>> we are likely talking about fully unaware L2. Lets assume fully >>> enlightened L1 for now. >> >> Well, I could also tweak OVMF to make necessary calls to the PSP and >> hack QEMU to postpone the command register updates to get this going, >> just a matter of ugliness. > > Per above, the tradeoff should be in ROI, not ugliness. I don't see how > OVMF helps when devices might be being virtually hotplugged or reset. I have no clue how exactly hotplug works on x86, is not BIOS playing role in it? Thanks,
Alexey Kardashevskiy wrote: [..] > > I thought existing use cases assume that the CC-VM can trigger page > > conversions at will without regard to a vTOM concept? It would be nice > > to have that address-map separation arrangement, has not that ship > > already sailed? > > Mmm. I am either confusing you too much or not following you :) Any page > can be converted, the proposed arrangement would require that > convertion-candidate-pages are allocated from a specific pool? > > There are two vTOMs - one in IOMMU to decide on Cbit for DMA trafic (I > use this one), one in VMSA ("VIRTUAL_TOM") for guest memory (this > exercise is not using it). Which one do you mean? Dunno, you introduced the vTOM term. Suffice to say if any page can be converted in this model then I was confused. > > [..] > >>> Would the device not just launch in "shared" mode until it is later > >>> converted to private? I am missing the detail of why passing the device > >>> on the command line requires that private memory be mapped early. > >> > >> A sequencing problem. > >> > >> QEMU "realizes" a VFIO device, it creates an iommufd instance which > >> creates a domain and writes to a DTE (a IOMMU descriptor for PCI BDFn). > >> And DTE is not updated after than. For secure stuff, DTE needs to be > >> slightly different. So right then I tell IOMMUFD that it will handle > >> private memory. > >> > >> Then, the same VFIO "realize" handler maps the guest memory in iommufd. > >> I use the same flag (well, pointer to kvm) in the iommufd pinning code, > >> private memory is pinned and mapped (and related page state change > >> happens as the guest memory is made guest-owned in RMP). > >> > >> QEMU goes to machine_reset() and calls "SNP LAUNCH UPDATE" (the actual > >> place changed recenly, huh) and the latter will measure the guest and > >> try making all guest memory private but it already happened => error. > >> > >> I think I have to decouple the pinning and the IOMMU/DTE setting. > >> > >>> That said, the implication that private device assignment requires > >>> hotplug events is a useful property. This matches nicely with initial > >>> thoughts that device conversion events are violent and might as well be > >>> unplug/replug events to match all the assumptions around what needs to > >>> be updated. > >> > >> For the initial drop, I tell QEMU via "-device vfio-pci,x-tio=true" that > >> it is going to be private so there should be no massive conversion. > > > > That's a SEV-TIO RFC-specific hack, or a proposal? > > Not sure at the moment :) Ok, without more information it looks like a SEV-TIO shortcut. > > An approach that aligns more closely with the VFIO operational model, > > where it maps and waits for guest faults / usages, is that QEMU would be > > told that the device is "bind capable", because the host is not in a > > position to assume how the guest will use the device. A "bind capable" > > device operates in shared mode unless and until the guest triggers > > private conversion. > > True. I just started this exercise without QEMU DiscardManager. Now I > rely on it but it either needs to allow dynamic flip from > discarded==private to discarded==shared (should do for now) or allow 3 > states for guest pages. As we talked about on the KernelSIG call there is a potentially a guestmemfd proposal to handle in place conversion without a DiscardManager: https://lore.kernel.org/kvm/20240712232937.2861788-1-ackerleytng@google.com/ [..] > > Per above, the tradeoff should be in ROI, not ugliness. I don't see how > > OVMF helps when devices might be being virtually hotplugged or reset. > > I have no clue how exactly hotplug works on x86, is not BIOS playing > role in it? Thanks, The hotplug controller can either be native PCIe or firmware managed. Likely we would pick the path of least of resistance for QEMU to facilitate device conversion.
On Fri, Aug 23, 2024, Alexey Kardashevskiy wrote: > Hi everyone, > > Here are some patches to enable SEV-TIO (aka TDISP, aka secure VFIO) > on AMD Turin. > > The basic idea is to allow DMA to/from encrypted memory of SNP VMs and > secure MMIO in SNP VMs (i.e. with Cbit set) as well. > > These include both guest and host support. QEMU also requires > some patches, links below. > > The patches are organized as: > 01..06 - preparing the host OS; > 07 - new TSM module; > 08 - add PSP SEV TIO ABI (IDE should start working at this point); > 09..14 - add KVM support (TDI binding, MMIO faulting, etc); > 15..19 - guest changes (the rest of SEV TIO ABI, DMA, secure MMIO). > 20, 21 - some helpers for guest OS to use encrypted MMIO > > This is based on a merge of > ee3248f9f8d6 Lukas Wunner spdm: Allow control of next requester nonce > through sysfs > 85ef1ac03941 (AMDESE/snp-host-latest) 4 days ago Michael Roth [TEMP] KVM: guest_memfd: Update gmem_prep are hook to handle partially-allocated folios > > > Please comment. Thanks. 1. Use scripts/get_maintainer.pl 2. Fix your MUA to wrap closer to 80 chars 3. Explain the core design, e.g. roles and responsibilities, coordination between KVM, VFIO/IOMMUFD, userspace, etc.
> From: Alexey Kardashevskiy <aik@amd.com> > Sent: Thursday, August 29, 2024 10:14 PM > > On 29/8/24 06:43, Dan Williams wrote: > > Alexey Kardashevskiy wrote: > >> Assumptions > >> ----------- > >> > >> This requires hotpligging into the VM vs > >> passing the device via the command line as > >> VFIO maps all guest memory as the device init > >> step which is too soon as > >> SNP LAUNCH UPDATE happens later and will fail > >> if VFIO maps private memory before that. > > > > Would the device not just launch in "shared" mode until it is later > > converted to private? I am missing the detail of why passing the device > > on the command line requires that private memory be mapped early. > > A sequencing problem. > > QEMU "realizes" a VFIO device, it creates an iommufd instance which > creates a domain and writes to a DTE (a IOMMU descriptor for PCI BDFn). > And DTE is not updated after than. For secure stuff, DTE needs to be > slightly different. So right then I tell IOMMUFD that it will handle > private memory. > > Then, the same VFIO "realize" handler maps the guest memory in iommufd. > I use the same flag (well, pointer to kvm) in the iommufd pinning code, > private memory is pinned and mapped (and related page state change > happens as the guest memory is made guest-owned in RMP). > > QEMU goes to machine_reset() and calls "SNP LAUNCH UPDATE" (the actual > place changed recenly, huh) and the latter will measure the guest and > try making all guest memory private but it already happened => error. > > I think I have to decouple the pinning and the IOMMU/DTE setting. > Assume there will be a new hwpt type which hints for special DTE setting at attach time and connects to a guest memfd. It'd make sense to defer mapping guest memory to a point after "SNP LAUNCH UPDATE" is completed for devices attached to such hwpt, as long as we document such restriction clearly for that new type.