[RFC,00/21] Secure VFIO, TDISP, SEV TIO

Message ID	20240823132137.336874-1-aik@amd.com (mailing list archive)
Headers	show Received: from NAM12-MW2-obe.outbound.protection.outlook.com (mail-mw2nam12on2051.outbound.protection.outlook.com [40.107.244.51]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 709FC185945; Fri, 23 Aug 2024 13:22:29 +0000 (UTC) Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C From: Alexey Kardashevskiy <aik@amd.com> To: <kvm@vger.kernel.org> CC: <iommu@lists.linux.dev>, <linux-coco@lists.linux.dev>, <linux-pci@vger.kernel.org>, Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>, Alex Williamson <alex.williamson@redhat.com>, Dan Williams <dan.j.williams@intel.com>, <pratikrajesh.sampat@amd.com>, <michael.day@amd.com>, <david.kaplan@amd.com>, <dhaval.giani@amd.com>, Santosh Shukla <santosh.shukla@amd.com>, Tom Lendacky <thomas.lendacky@amd.com>, Michael Roth <michael.roth@amd.com>, "Alexander Graf" <agraf@suse.de>, Nikunj A Dadhania <nikunj@amd.com>, Vasant Hegde <vasant.hegde@amd.com>, Lukas Wunner <lukas@wunner.de>, Alexey Kardashevskiy <aik@amd.com> Subject: [RFC PATCH 00/21] Secure VFIO, TDISP, SEV TIO Date: Fri, 23 Aug 2024 23:21:14 +1000 Message-ID: <20240823132137.336874-1-aik@amd.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain
Series	Secure VFIO, TDISP, SEV TIO \| expand [RFC,00/21] Secure VFIO, TDISP, SEV TIO [RFC,01/21] tsm-report: Rename module to reflect what it does [RFC,02/21] pci/doe: Define protocol types and make those public [RFC,03/21] pci: Define TEE-IO bit in PCIe device capabilities [RFC,04/21] PCI/IDE: Define Integrity and Data Encryption (IDE) extended capability [RFC,05/21] crypto/ccp: Make some SEV helpers public [RFC,06/21] crypto: ccp: Enable SEV-TIO feature in the PSP when supported [RFC,07/21] pci/tdisp: Introduce tsm module [RFC,08/21] crypto/ccp: Implement SEV TIO firmware interface [RFC,09/21] kvm: Export kvm_vm_set_mem_attributes [RFC,10/21] vfio: Export helper to get vfio_device from fd [RFC,11/21] KVM: SEV: Add TIO VMGEXIT and bind TDI [RFC,12/21] KVM: IOMMUFD: MEMFD: Map private pages [RFC,13/21] KVM: X86: Handle private MMIO as shared [RFC,14/21] RFC: iommu/iommufd/amd: Add IOMMU_HWPT_TRUSTED flag, tweak DTE's DomainID, IOTLB [RFC,15/21] coco/sev-guest: Allow multiple source files in the driver [RFC,16/21] coco/sev-guest: Make SEV-to-PSP request helpers public [RFC,17/21] coco/sev-guest: Implement the guest side of things [RFC,18/21] RFC: pci: Add BUS_NOTIFY_PCI_BUS_MASTER event [RFC,19/21] sev-guest: Stop changing encrypted page state for TDISP devices [RFC,20/21] pci: Allow encrypted MMIO mapping via sysfs [RFC,21/21] pci: Define pci_iomap_range_encrypted

Alexey Kardashevskiy Aug. 23, 2024, 1:21 p.m. UTC

Hi everyone,

Here are some patches to enable SEV-TIO (aka TDISP, aka secure VFIO)
on AMD Turin.

The basic idea is to allow DMA to/from encrypted memory of SNP VMs and
secure MMIO in SNP VMs (i.e. with Cbit set) as well.

These include both guest and host support. QEMU also requires
some patches, links below.

The patches are organized as:
01..06 - preparing the host OS;
07 - new TSM module;
08 - add PSP SEV TIO ABI (IDE should start working at this point);
09..14 - add KVM support (TDI binding, MMIO faulting, etc);
15..19 - guest changes (the rest of SEV TIO ABI, DMA, secure MMIO).
20, 21 - some helpers for guest OS to use encrypted MMIO

This is based on a merge of
ee3248f9f8d6 Lukas Wunner spdm: Allow control of next requester nonce
through sysfs
85ef1ac03941 (AMDESE/snp-host-latest) 4 days ago Michael Roth [TEMP] KVM: guest_memfd: Update gmem_prep are hook to handle partially-allocated folios


Please comment. Thanks.

Thanks,


SEV TIO tree prototype
======================

Goal
----

Support secure PCI devices pass through to confidential VMs.
The support is defined by PCIe 6, SPDM, TDISP (not AMD) and SEV TIO
specification (by AMD).

SEV TIO spec:
https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/specifications/58271_0_70.pdf
Whitepaper:
https://www.amd.com/content/dam/amd/en/documents/epyc-business-docs/white-papers/sev-tio-whitepaper.pdf
GHCB spec update is coming.

The changeset adds:
- a generic TSM.ko module;
makes necessary changes to:
- ccp (interface to AMD secure platform processor, "PSP", works on the
host),
- sev-guest (interace to PSP for confidential SNP guest),
- kvm_amd,
- memfd,
- vfio kvm device,
- kvm-x86.

Acronyms
--------

TEE - Trusted Execution Environments, a concept of managing trust
between the host
and devices
TSM - TEE Security Manager (TSM), an entity which ensures
security on the host
PSP - AMD platform secure processor (also "ASP", "AMD-SP"), acts
as TSM on AMD.
SEV TIO - the TIO protocol implemented by the PSP and used by
the host
GHCB - guest/host communication block - a protocol for
guest-to-host communication
via a shared page

CMA, IDE, TDISP,
TVM, DSM, SPDM, DOE

Code
----

Written with AMD SEV SNP in mind, TSM is the PSP and
therefore no much of IDE/TDISP
is left for the host or guest OS.

Add a common module to expose various data objects in
the same way in host and
guest OS.

Provide a know on the host to enable IDE encryption.

Add another version of Guest Request for secure
guest<->PSP communication.

Enable secure DMA by:
- configuring vTOM in a secure DTE via the PSP to cover
the entire guest RAM;
- mapping all private memory pages in IOMMU just like
as they were shared
(requires hacking iommufd);
- skipping various enforcements of non-SME or
SWIOTLB in the guest;

No mixed share+private DMA supported within the
same IOMMU.

Enable secure MMIO by:
- configuring RMP entries via the PSP;
- adding necessary helpers for mapping MMIO with
the Cbit set;
- hacking the KVM #PF handler to allow private
MMIO failts.

Based on the latest upstream KVM (at the
moment it is kvm-coco-queue).


Workflow
--------

1. Boot host OS.
2. "Connect" the physical device.
3. Bind a VF to VFIO-PCI.
4. Run QEMU _without_ the device yet.
5. Hotplug the VF to the VM.
6. (if not already) Load the device driver.
7. Right after the BusMaster is enabled,
tsm.ko performs secure DMA and MMIO setup.
8. Run tests, for example:
sudo ./pcimem/pcimem
/sys/bus/pci/devices/0000\:01\:00.0/resource4_enc
0 w*4 0xabcd


Assumptions
-----------

This requires hotpligging into the VM vs
passing the device via the command line as
VFIO maps all guest memory as the device init
step which is too soon as
SNP LAUNCH UPDATE happens later and will fail
if VFIO maps private memory before that.

This requires the BME hack as MMIO and
BusMaster enable bits cannot be 0 after MMIO
validation is done and there are moments in
the guest OS booting process when this
appens.

SVSM could help addressing these (not
implemented at the moment).

QEMU advertises TEE-IO capability to the VM.
An additional x-tio flag is added to
vfio-pci.


TODOs
-----

Deal with PCI reset. Hot unplug+plug? Power
states too.

Do better generalization, the current code
heavily uses SEV TIO defined
structures in supposedly generic code.

Fix the documentation comments of SEV TIO structures.


Git trees
---------

https://github.com/AMDESE/linux-kvm/tree/tio
https://github.com/AMDESE/qemu/tree/tio




Alexey Kardashevskiy (21):
  tsm-report: Rename module to reflect what it does
  pci/doe: Define protocol types and make those public
  pci: Define TEE-IO bit in PCIe device capabilities
  PCI/IDE: Define Integrity and Data Encryption (IDE) extended
    capability
  crypto/ccp: Make some SEV helpers public
  crypto: ccp: Enable SEV-TIO feature in the PSP when supported
  pci/tdisp: Introduce tsm module
  crypto/ccp: Implement SEV TIO firmware interface
  kvm: Export kvm_vm_set_mem_attributes
  vfio: Export helper to get vfio_device from fd
  KVM: SEV: Add TIO VMGEXIT and bind TDI
  KVM: IOMMUFD: MEMFD: Map private pages
  KVM: X86: Handle private MMIO as shared
  RFC: iommu/iommufd/amd: Add IOMMU_HWPT_TRUSTED flag, tweak DTE's
    DomainID, IOTLB
  coco/sev-guest: Allow multiple source files in the driver
  coco/sev-guest: Make SEV-to-PSP request helpers public
  coco/sev-guest: Implement the guest side of things
  RFC: pci: Add BUS_NOTIFY_PCI_BUS_MASTER event
  sev-guest: Stop changing encrypted page state for TDISP devices
  pci: Allow encrypted MMIO mapping via sysfs
  pci: Define pci_iomap_range_encrypted

 drivers/crypto/ccp/Makefile                              |    2 +
 drivers/pci/Makefile                                     |    1 +
 drivers/virt/coco/Makefile                               |    3 +-
 drivers/virt/coco/sev-guest/Makefile                     |    1 +
 arch/x86/include/asm/kvm-x86-ops.h                       |    2 +
 arch/x86/include/asm/kvm_host.h                          |    2 +
 arch/x86/include/asm/sev.h                               |   23 +
 arch/x86/include/uapi/asm/svm.h                          |    2 +
 arch/x86/kvm/svm/svm.h                                   |    2 +
 drivers/crypto/ccp/sev-dev-tio.h                         |  105 ++
 drivers/crypto/ccp/sev-dev.h                             |    4 +
 drivers/iommu/amd/amd_iommu_types.h                      |    2 +
 drivers/iommu/iommufd/io_pagetable.h                     |    3 +
 drivers/iommu/iommufd/iommufd_private.h                  |    4 +
 drivers/virt/coco/sev-guest/sev-guest.h                  |   56 +
 include/asm-generic/pci_iomap.h                          |    4 +
 include/linux/device.h                                   |    5 +
 include/linux/device/bus.h                               |    3 +
 include/linux/dma-direct.h                               |    4 +
 include/linux/iommufd.h                                  |    6 +
 include/linux/kvm_host.h                                 |   70 +
 include/linux/pci-doe.h                                  |    4 +
 include/linux/pci-ide.h                                  |   18 +
 include/linux/pci.h                                      |    2 +-
 include/linux/psp-sev.h                                  |  116 +-
 include/linux/swiotlb.h                                  |    4 +
 include/linux/tsm-report.h                               |  113 ++
 include/linux/tsm.h                                      |  337 +++--
 include/linux/vfio.h                                     |    1 +
 include/uapi/linux/iommufd.h                             |    1 +
 include/uapi/linux/kvm.h                                 |   29 +
 include/uapi/linux/pci_regs.h                            |   77 +-
 include/uapi/linux/psp-sev.h                             |    4 +-
 arch/x86/coco/sev/core.c                                 |   11 +
 arch/x86/kvm/mmu/mmu.c                                   |    6 +-
 arch/x86/kvm/svm/sev.c                                   |  217 +++
 arch/x86/kvm/svm/svm.c                                   |    3 +
 arch/x86/kvm/x86.c                                       |   12 +
 arch/x86/mm/mem_encrypt.c                                |    5 +
 arch/x86/virt/svm/sev.c                                  |   23 +-
 drivers/crypto/ccp/sev-dev-tio.c                         | 1565 ++++++++++++++++++++
 drivers/crypto/ccp/sev-dev-tsm.c                         |  397 +++++
 drivers/crypto/ccp/sev-dev.c                             |   87 +-
 drivers/iommu/amd/iommu.c                                |   20 +-
 drivers/iommu/iommufd/hw_pagetable.c                     |    4 +
 drivers/iommu/iommufd/io_pagetable.c                     |    2 +
 drivers/iommu/iommufd/main.c                             |   21 +
 drivers/iommu/iommufd/pages.c                            |   94 +-
 drivers/pci/doe.c                                        |    2 -
 drivers/pci/ide.c                                        |  186 +++
 drivers/pci/iomap.c                                      |   24 +
 drivers/pci/mmap.c                                       |   11 +-
 drivers/pci/pci-sysfs.c                                  |   27 +-
 drivers/pci/pci.c                                        |    3 +
 drivers/pci/proc.c                                       |    2 +-
 drivers/vfio/vfio_main.c                                 |   13 +
 drivers/virt/coco/sev-guest/{sev-guest.c => sev_guest.c} |   68 +-
 drivers/virt/coco/sev-guest/sev_guest_tio.c              |  513 +++++++
 drivers/virt/coco/tdx-guest/tdx-guest.c                  |    8 +-
 drivers/virt/coco/tsm-report.c                           |  512 +++++++
 drivers/virt/coco/tsm.c                                  | 1542 ++++++++++++++-----
 virt/kvm/guest_memfd.c                                   |   40 +
 virt/kvm/kvm_main.c                                      |    4 +-
 virt/kvm/vfio.c                                          |  197 ++-
 Documentation/virt/coco/tsm.rst                          |   62 +
 MAINTAINERS                                              |    4 +-
 arch/x86/kvm/Kconfig                                     |    1 +
 drivers/pci/Kconfig                                      |    4 +
 drivers/virt/coco/Kconfig                                |   11 +
 69 files changed, 6163 insertions(+), 548 deletions(-)
 create mode 100644 drivers/crypto/ccp/sev-dev-tio.h
 create mode 100644 drivers/virt/coco/sev-guest/sev-guest.h
 create mode 100644 include/linux/pci-ide.h
 create mode 100644 include/linux/tsm-report.h
 create mode 100644 drivers/crypto/ccp/sev-dev-tio.c
 create mode 100644 drivers/crypto/ccp/sev-dev-tsm.c
 create mode 100644 drivers/pci/ide.c
 rename drivers/virt/coco/sev-guest/{sev-guest.c => sev_guest.c} (96%)
 create mode 100644 drivers/virt/coco/sev-guest/sev_guest_tio.c
 create mode 100644 drivers/virt/coco/tsm-report.c
 create mode 100644 Documentation/virt/coco/tsm.rst

Dan Williams Aug. 28, 2024, 8:43 p.m. UTC | #1

Alexey Kardashevskiy wrote:
> Hi everyone,
> 
> Here are some patches to enable SEV-TIO (aka TDISP, aka secure VFIO)
> on AMD Turin.
> 
> The basic idea is to allow DMA to/from encrypted memory of SNP VMs and
> secure MMIO in SNP VMs (i.e. with Cbit set) as well.
> 
> These include both guest and host support. QEMU also requires
> some patches, links below.
> 
> The patches are organized as:
> 01..06 - preparing the host OS;
> 07 - new TSM module;
> 08 - add PSP SEV TIO ABI (IDE should start working at this point);
> 09..14 - add KVM support (TDI binding, MMIO faulting, etc);
> 15..19 - guest changes (the rest of SEV TIO ABI, DMA, secure MMIO).
> 20, 21 - some helpers for guest OS to use encrypted MMIO
> 
> This is based on a merge of
> ee3248f9f8d6 Lukas Wunner spdm: Allow control of next requester nonce
> through sysfs
> 85ef1ac03941 (AMDESE/snp-host-latest) 4 days ago Michael Roth [TEMP] KVM: guest_memfd: Update gmem_prep are hook to handle partially-allocated folios
> 
> 
> Please comment. Thanks.

This cover letter is something I can read after having been in and
around this space for a while, but I wonder how much of it makes sense
to casual reviewers?

> 
> Thanks,
> 
> 
> SEV TIO tree prototype
> ======================
[..]
> Code
> ----
> 
> Written with AMD SEV SNP in mind, TSM is the PSP and
> therefore no much of IDE/TDISP
> is left for the host or guest OS.
> 
> Add a common module to expose various data objects in
> the same way in host and
> guest OS.
> 
> Provide a know on the host to enable IDE encryption.
> 
> Add another version of Guest Request for secure
> guest<->PSP communication.
> 
> Enable secure DMA by:
> - configuring vTOM in a secure DTE via the PSP to cover
> the entire guest RAM;
> - mapping all private memory pages in IOMMU just like
> as they were shared
> (requires hacking iommufd);

What kind of hack are we talking about here? An upstream suitable
change, or something that needs quite a bit more work to be done
properly?

I jumped ahead to read Jason's reaction but please do at least provide a
map the controversy in the cover letter, something like "see patch 12 for
details".

> - skipping various enforcements of non-SME or
> SWIOTLB in the guest;

Is this based on some concept of private vs shared mode devices?

> No mixed share+private DMA supported within the
> same IOMMU.

What does this mean? A device may not have mixed mappings (makes sense),
or an IOMMU can not host devices that do not all agree on whether DMA is
private or shared?

> Enable secure MMIO by:
> - configuring RMP entries via the PSP;
> - adding necessary helpers for mapping MMIO with
> the Cbit set;
> - hacking the KVM #PF handler to allow private
> MMIO failts.
> 
> Based on the latest upstream KVM (at the
> moment it is kvm-coco-queue).

Here is where I lament that kvm-coco-queue is not run like akpm/mm where
it is possible to try out "yesterday's mm". Perhaps this is an area to
collaborate on kvm-coco-queue snapshots to help with testing.

> Workflow
> --------
> 
> 1. Boot host OS.
> 2. "Connect" the physical device.
> 3. Bind a VF to VFIO-PCI.
> 4. Run QEMU _without_ the device yet.
> 5. Hotplug the VF to the VM.
> 6. (if not already) Load the device driver.
> 7. Right after the BusMaster is enabled,
> tsm.ko performs secure DMA and MMIO setup.
> 8. Run tests, for example:
> sudo ./pcimem/pcimem
> /sys/bus/pci/devices/0000\:01\:00.0/resource4_enc
> 0 w*4 0xabcd
> 
> 
> Assumptions
> -----------
> 
> This requires hotpligging into the VM vs
> passing the device via the command line as
> VFIO maps all guest memory as the device init
> step which is too soon as
> SNP LAUNCH UPDATE happens later and will fail
> if VFIO maps private memory before that.

Would the device not just launch in "shared" mode until it is later
converted to private? I am missing the detail of why passing the device
on the command line requires that private memory be mapped early.

That said, the implication that private device assignment requires
hotplug events is a useful property. This matches nicely with initial
thoughts that device conversion events are violent and might as well be
unplug/replug events to match all the assumptions around what needs to
be updated.

> This requires the BME hack as MMIO and

Not sure what the "BME hack" is, I guess this is foreshadowing for later
in this story.

> BusMaster enable bits cannot be 0 after MMIO
> validation is done

It would be useful to call out what is a TDISP requirement, vs
device-specific DSM vs host-specific TSM requirement. In this case I
assume you are referring to PCI 6.2 11.2.6 where it notes that TDIs must
enter the TDISP ERROR state if BME is cleared after the device is
locked?

...but this begs the question of whether it needs to be avoided outright
or handled as an error recovery case dependending on policy.

> the guest OS booting process when this
> appens.
> 
> SVSM could help addressing these (not
> implemented at the moment).

At first though avoiding SVSM entanglements where the kernel can be
enlightened shoud be the policy. I would only expect SVSM hacks to cover
for legacy OSes that will never be TDISP enlightened, but in that case
we are likely talking about fully unaware L2. Lets assume fully
enlightened L1 for now.

> QEMU advertises TEE-IO capability to the VM.
> An additional x-tio flag is added to
> vfio-pci.
> 
> 
> TODOs
> -----
> 
> Deal with PCI reset. Hot unplug+plug? Power
> states too.
> 
> Do better generalization, the current code
> heavily uses SEV TIO defined
> structures in supposedly generic code.
> 
> Fix the documentation comments of SEV TIO structures.

Hey, it's a start. I appreciate the "release early" aspect of this
posting.

> Git trees
> ---------
> 
> https://github.com/AMDESE/linux-kvm/tree/tio
> https://github.com/AMDESE/qemu/tree/tio
[..]
> 
> 
> Alexey Kardashevskiy (21):
>   tsm-report: Rename module to reflect what it does
>   pci/doe: Define protocol types and make those public
>   pci: Define TEE-IO bit in PCIe device capabilities
>   PCI/IDE: Define Integrity and Data Encryption (IDE) extended
>     capability
>   crypto/ccp: Make some SEV helpers public
>   crypto: ccp: Enable SEV-TIO feature in the PSP when supported
>   pci/tdisp: Introduce tsm module
>   crypto/ccp: Implement SEV TIO firmware interface
>   kvm: Export kvm_vm_set_mem_attributes
>   vfio: Export helper to get vfio_device from fd
>   KVM: SEV: Add TIO VMGEXIT and bind TDI
>   KVM: IOMMUFD: MEMFD: Map private pages
>   KVM: X86: Handle private MMIO as shared
>   RFC: iommu/iommufd/amd: Add IOMMU_HWPT_TRUSTED flag, tweak DTE's
>     DomainID, IOTLB
>   coco/sev-guest: Allow multiple source files in the driver
>   coco/sev-guest: Make SEV-to-PSP request helpers public
>   coco/sev-guest: Implement the guest side of things
>   RFC: pci: Add BUS_NOTIFY_PCI_BUS_MASTER event
>   sev-guest: Stop changing encrypted page state for TDISP devices
>   pci: Allow encrypted MMIO mapping via sysfs
>   pci: Define pci_iomap_range_encrypted
> 
>  drivers/crypto/ccp/Makefile                              |    2 +
>  drivers/pci/Makefile                                     |    1 +
>  drivers/virt/coco/Makefile                               |    3 +-
>  drivers/virt/coco/sev-guest/Makefile                     |    1 +
>  arch/x86/include/asm/kvm-x86-ops.h                       |    2 +
>  arch/x86/include/asm/kvm_host.h                          |    2 +
>  arch/x86/include/asm/sev.h                               |   23 +
>  arch/x86/include/uapi/asm/svm.h                          |    2 +
>  arch/x86/kvm/svm/svm.h                                   |    2 +
>  drivers/crypto/ccp/sev-dev-tio.h                         |  105 ++
>  drivers/crypto/ccp/sev-dev.h                             |    4 +
>  drivers/iommu/amd/amd_iommu_types.h                      |    2 +
>  drivers/iommu/iommufd/io_pagetable.h                     |    3 +
>  drivers/iommu/iommufd/iommufd_private.h                  |    4 +
>  drivers/virt/coco/sev-guest/sev-guest.h                  |   56 +
>  include/asm-generic/pci_iomap.h                          |    4 +
>  include/linux/device.h                                   |    5 +
>  include/linux/device/bus.h                               |    3 +
>  include/linux/dma-direct.h                               |    4 +
>  include/linux/iommufd.h                                  |    6 +
>  include/linux/kvm_host.h                                 |   70 +
>  include/linux/pci-doe.h                                  |    4 +
>  include/linux/pci-ide.h                                  |   18 +
>  include/linux/pci.h                                      |    2 +-
>  include/linux/psp-sev.h                                  |  116 +-
>  include/linux/swiotlb.h                                  |    4 +
>  include/linux/tsm-report.h                               |  113 ++
>  include/linux/tsm.h                                      |  337 +++--
>  include/linux/vfio.h                                     |    1 +
>  include/uapi/linux/iommufd.h                             |    1 +
>  include/uapi/linux/kvm.h                                 |   29 +
>  include/uapi/linux/pci_regs.h                            |   77 +-
>  include/uapi/linux/psp-sev.h                             |    4 +-
>  arch/x86/coco/sev/core.c                                 |   11 +
>  arch/x86/kvm/mmu/mmu.c                                   |    6 +-
>  arch/x86/kvm/svm/sev.c                                   |  217 +++
>  arch/x86/kvm/svm/svm.c                                   |    3 +
>  arch/x86/kvm/x86.c                                       |   12 +
>  arch/x86/mm/mem_encrypt.c                                |    5 +
>  arch/x86/virt/svm/sev.c                                  |   23 +-
>  drivers/crypto/ccp/sev-dev-tio.c                         | 1565 ++++++++++++++++++++
>  drivers/crypto/ccp/sev-dev-tsm.c                         |  397 +++++
>  drivers/crypto/ccp/sev-dev.c                             |   87 +-
>  drivers/iommu/amd/iommu.c                                |   20 +-
>  drivers/iommu/iommufd/hw_pagetable.c                     |    4 +
>  drivers/iommu/iommufd/io_pagetable.c                     |    2 +
>  drivers/iommu/iommufd/main.c                             |   21 +
>  drivers/iommu/iommufd/pages.c                            |   94 +-
>  drivers/pci/doe.c                                        |    2 -
>  drivers/pci/ide.c                                        |  186 +++
>  drivers/pci/iomap.c                                      |   24 +
>  drivers/pci/mmap.c                                       |   11 +-
>  drivers/pci/pci-sysfs.c                                  |   27 +-
>  drivers/pci/pci.c                                        |    3 +
>  drivers/pci/proc.c                                       |    2 +-
>  drivers/vfio/vfio_main.c                                 |   13 +
>  drivers/virt/coco/sev-guest/{sev-guest.c => sev_guest.c} |   68 +-
>  drivers/virt/coco/sev-guest/sev_guest_tio.c              |  513 +++++++
>  drivers/virt/coco/tdx-guest/tdx-guest.c                  |    8 +-
>  drivers/virt/coco/tsm-report.c                           |  512 +++++++
>  drivers/virt/coco/tsm.c                                  | 1542 ++++++++++++++-----
>  virt/kvm/guest_memfd.c                                   |   40 +
>  virt/kvm/kvm_main.c                                      |    4 +-
>  virt/kvm/vfio.c                                          |  197 ++-
>  Documentation/virt/coco/tsm.rst                          |   62 +
>  MAINTAINERS                                              |    4 +-
>  arch/x86/kvm/Kconfig                                     |    1 +
>  drivers/pci/Kconfig                                      |    4 +
>  drivers/virt/coco/Kconfig                                |   11 +
>  69 files changed, 6163 insertions(+), 548 deletions(-)
>  create mode 100644 drivers/crypto/ccp/sev-dev-tio.h
>  create mode 100644 drivers/virt/coco/sev-guest/sev-guest.h
>  create mode 100644 include/linux/pci-ide.h
>  create mode 100644 include/linux/tsm-report.h
>  create mode 100644 drivers/crypto/ccp/sev-dev-tio.c
>  create mode 100644 drivers/crypto/ccp/sev-dev-tsm.c
>  create mode 100644 drivers/pci/ide.c
>  rename drivers/virt/coco/sev-guest/{sev-guest.c => sev_guest.c} (96%)
>  create mode 100644 drivers/virt/coco/sev-guest/sev_guest_tio.c
>  create mode 100644 drivers/virt/coco/tsm-report.c
>  create mode 100644 Documentation/virt/coco/tsm.rst
> 
> -- 
> 2.45.2
>

Alexey Kardashevskiy Aug. 29, 2024, 2:13 p.m. UTC | #2

On 29/8/24 06:43, Dan Williams wrote:
> Alexey Kardashevskiy wrote:
>> Hi everyone,
>>
>> Here are some patches to enable SEV-TIO (aka TDISP, aka secure VFIO)
>> on AMD Turin.
>>
>> The basic idea is to allow DMA to/from encrypted memory of SNP VMs and
>> secure MMIO in SNP VMs (i.e. with Cbit set) as well.
>>
>> These include both guest and host support. QEMU also requires
>> some patches, links below.
>>
>> The patches are organized as:
>> 01..06 - preparing the host OS;
>> 07 - new TSM module;
>> 08 - add PSP SEV TIO ABI (IDE should start working at this point);
>> 09..14 - add KVM support (TDI binding, MMIO faulting, etc);
>> 15..19 - guest changes (the rest of SEV TIO ABI, DMA, secure MMIO).
>> 20, 21 - some helpers for guest OS to use encrypted MMIO
>>
>> This is based on a merge of
>> ee3248f9f8d6 Lukas Wunner spdm: Allow control of next requester nonce
>> through sysfs
>> 85ef1ac03941 (AMDESE/snp-host-latest) 4 days ago Michael Roth [TEMP] KVM: guest_memfd: Update gmem_prep are hook to handle partially-allocated folios
>>
>>
>> Please comment. Thanks.
> 
> This cover letter is something I can read after having been in and
> around this space for a while, but I wonder how much of it makes sense
> to casual reviewers?
> 
>>
>> Thanks,
>>
>>
>> SEV TIO tree prototype
>> ======================
> [..]
>> Code
>> ----
>>
>> Written with AMD SEV SNP in mind, TSM is the PSP and
>> therefore no much of IDE/TDISP
>> is left for the host or guest OS.
>>
>> Add a common module to expose various data objects in
>> the same way in host and
>> guest OS.
>>
>> Provide a know on the host to enable IDE encryption.
>>
>> Add another version of Guest Request for secure
>> guest<->PSP communication.
>>
>> Enable secure DMA by:
>> - configuring vTOM in a secure DTE via the PSP to cover
>> the entire guest RAM;
>> - mapping all private memory pages in IOMMU just like
>> as they were shared
>> (requires hacking iommufd);
> 
> What kind of hack are we talking about here? An upstream suitable
> change, or something that needs quite a bit more work to be done
> properly?

Right now it is hacking IOMMUFD to go to the KVM for 
private_gfn->host_pfn. As I am being told in this thread, VFIO DMA 
map/unmap needs to be taught to accept {memfd, offset}.


> I jumped ahead to read Jason's reaction but please do at least provide a
> map the controversy in the cover letter, something like "see patch 12 for
> details".

Yeah, noticed that, thanks, appreciated!

>> - skipping various enforcements of non-SME or
>> SWIOTLB in the guest;
> 
> Is this based on some concept of private vs shared mode devices?
> 
>> No mixed share+private DMA supported within the
>> same IOMMU.
> 
> What does this mean? A device may not have mixed mappings (makes sense),

Currently devices do not have an idea about private host memory (but it 
is being worked on afaik).

> or an IOMMU can not host devices that do not all agree on whether DMA is
> private or shared?

The hardware allows that via hardware-assisted vIOMMU and I/O page 
tables in the guest with C-bit takes into accound by the IOMMU but the 
software support is missing right now. So for this initial drop, vTOM is 
used for DMA - this thing says "everything below <addr> is private, 
above <addr> - shared" so nothing needs to bother with the C-bit, and in 
my exercise I set the <addr> to the allowed maximum.

So each IOMMUFD instance in the VM is either "all private mappings" or 
"all shared". Could be half/half by moving that <addr> :)


>> Enable secure MMIO by:
>> - configuring RMP entries via the PSP;
>> - adding necessary helpers for mapping MMIO with
>> the Cbit set;
>> - hacking the KVM #PF handler to allow private
>> MMIO failts.
>>
>> Based on the latest upstream KVM (at the
>> moment it is kvm-coco-queue).
> 
> Here is where I lament that kvm-coco-queue is not run like akpm/mm where
> it is possible to try out "yesterday's mm". Perhaps this is an area to
> collaborate on kvm-coco-queue snapshots to help with testing.

Yeah this more an idea of what it is based on, I normally push a tested 
branch somewhere on github, just to eliminate uncertainty.

> 
>> Workflow
>> --------
>>
>> 1. Boot host OS.
>> 2. "Connect" the physical device.
>> 3. Bind a VF to VFIO-PCI.
>> 4. Run QEMU _without_ the device yet.
>> 5. Hotplug the VF to the VM.
>> 6. (if not already) Load the device driver.
>> 7. Right after the BusMaster is enabled,
>> tsm.ko performs secure DMA and MMIO setup.
>> 8. Run tests, for example:
>> sudo ./pcimem/pcimem
>> /sys/bus/pci/devices/0000\:01\:00.0/resource4_enc
>> 0 w*4 0xabcd
>>
>>
>> Assumptions
>> -----------
>>
>> This requires hotpligging into the VM vs
>> passing the device via the command line as
>> VFIO maps all guest memory as the device init
>> step which is too soon as
>> SNP LAUNCH UPDATE happens later and will fail
>> if VFIO maps private memory before that.
> 
> Would the device not just launch in "shared" mode until it is later
> converted to private? I am missing the detail of why passing the device
> on the command line requires that private memory be mapped early.

A sequencing problem.

QEMU "realizes" a VFIO device, it creates an iommufd instance which 
creates a domain and writes to a DTE (a IOMMU descriptor for PCI BDFn). 
And DTE is not updated after than. For secure stuff, DTE needs to be 
slightly different. So right then I tell IOMMUFD that it will handle 
private memory.

Then, the same VFIO "realize" handler maps the guest memory in iommufd. 
I use the same flag (well, pointer to kvm) in the iommufd pinning code, 
private memory is pinned and mapped (and related page state change 
happens as the guest memory is made guest-owned in RMP).

QEMU goes to machine_reset() and calls "SNP LAUNCH UPDATE" (the actual 
place changed recenly, huh) and the latter will measure the guest and 
try making all guest memory private but it already happened => error.

I think I have to decouple the pinning and the IOMMU/DTE setting.

> That said, the implication that private device assignment requires
> hotplug events is a useful property. This matches nicely with initial
> thoughts that device conversion events are violent and might as well be
> unplug/replug events to match all the assumptions around what needs to
> be updated.

For the initial drop, I tell QEMU via "-device vfio-pci,x-tio=true" that 
it is going to be private so there should be no massive conversion.

> 
>> This requires the BME hack as MMIO and
> 
> Not sure what the "BME hack" is, I guess this is foreshadowing for later
> in this story.
 >
>> BusMaster enable bits cannot be 0 after MMIO
>> validation is done
> 
> It would be useful to call out what is a TDISP requirement, vs
> device-specific DSM vs host-specific TSM requirement. In this case I
> assume you are referring to PCI 6.2 11.2.6 where it notes that TDIs must

Oh there is 6.2 already.

> enter the TDISP ERROR state if BME is cleared after the device is
> locked?
> 
> ...but this begs the question of whether it needs to be avoided outright

Well, besides a couple of avoidable places (like testing INTx support 
which we know is not going to work on VFs anyway), a standard driver 
enables MSE first (and the value for the command register does not have 
1 for BME) and only then BME. TBH I do not think writing BME=0 when 
BME=0 already is "clearing" but my test device disagrees.

> or handled as an error recovery case dependending on policy.

Avoding seems more straight forward unless we actually want enlightened 
device drivers which want to examine the interface report before 
enabling the device. Not sure.

>> the guest OS booting process when this
>> appens.
>>
>> SVSM could help addressing these (not
>> implemented at the moment).
> 
> At first though avoiding SVSM entanglements where the kernel can be
> enlightened shoud be the policy. I would only expect SVSM hacks to cover
> for legacy OSes that will never be TDISP enlightened, but in that case
> we are likely talking about fully unaware L2. Lets assume fully
> enlightened L1 for now.

Well, I could also tweak OVMF to make necessary calls to the PSP and 
hack QEMU to postpone the command register updates to get this going, 
just a matter of ugliness.

>> QEMU advertises TEE-IO capability to the VM.
>> An additional x-tio flag is added to
>> vfio-pci.
>>
>>
>> TODOs
>> -----
>>
>> Deal with PCI reset. Hot unplug+plug? Power
>> states too.
>>
>> Do better generalization, the current code
>> heavily uses SEV TIO defined
>> structures in supposedly generic code.
>>
>> Fix the documentation comments of SEV TIO structures.
> 
> Hey, it's a start. I appreciate the "release early" aspect of this
> posting.

:)

Thanks,


>> Git trees
>> ---------
>>
>> https://github.com/AMDESE/linux-kvm/tree/tio
>> https://github.com/AMDESE/qemu/tree/tio
> [..]
>>
>>
>> Alexey Kardashevskiy (21):
>>    tsm-report: Rename module to reflect what it does
>>    pci/doe: Define protocol types and make those public
>>    pci: Define TEE-IO bit in PCIe device capabilities
>>    PCI/IDE: Define Integrity and Data Encryption (IDE) extended
>>      capability
>>    crypto/ccp: Make some SEV helpers public
>>    crypto: ccp: Enable SEV-TIO feature in the PSP when supported
>>    pci/tdisp: Introduce tsm module
>>    crypto/ccp: Implement SEV TIO firmware interface
>>    kvm: Export kvm_vm_set_mem_attributes
>>    vfio: Export helper to get vfio_device from fd
>>    KVM: SEV: Add TIO VMGEXIT and bind TDI
>>    KVM: IOMMUFD: MEMFD: Map private pages
>>    KVM: X86: Handle private MMIO as shared
>>    RFC: iommu/iommufd/amd: Add IOMMU_HWPT_TRUSTED flag, tweak DTE's
>>      DomainID, IOTLB
>>    coco/sev-guest: Allow multiple source files in the driver
>>    coco/sev-guest: Make SEV-to-PSP request helpers public
>>    coco/sev-guest: Implement the guest side of things
>>    RFC: pci: Add BUS_NOTIFY_PCI_BUS_MASTER event
>>    sev-guest: Stop changing encrypted page state for TDISP devices
>>    pci: Allow encrypted MMIO mapping via sysfs
>>    pci: Define pci_iomap_range_encrypted
>>
>>   drivers/crypto/ccp/Makefile                              |    2 +
>>   drivers/pci/Makefile                                     |    1 +
>>   drivers/virt/coco/Makefile                               |    3 +-
>>   drivers/virt/coco/sev-guest/Makefile                     |    1 +
>>   arch/x86/include/asm/kvm-x86-ops.h                       |    2 +
>>   arch/x86/include/asm/kvm_host.h                          |    2 +
>>   arch/x86/include/asm/sev.h                               |   23 +
>>   arch/x86/include/uapi/asm/svm.h                          |    2 +
>>   arch/x86/kvm/svm/svm.h                                   |    2 +
>>   drivers/crypto/ccp/sev-dev-tio.h                         |  105 ++
>>   drivers/crypto/ccp/sev-dev.h                             |    4 +
>>   drivers/iommu/amd/amd_iommu_types.h                      |    2 +
>>   drivers/iommu/iommufd/io_pagetable.h                     |    3 +
>>   drivers/iommu/iommufd/iommufd_private.h                  |    4 +
>>   drivers/virt/coco/sev-guest/sev-guest.h                  |   56 +
>>   include/asm-generic/pci_iomap.h                          |    4 +
>>   include/linux/device.h                                   |    5 +
>>   include/linux/device/bus.h                               |    3 +
>>   include/linux/dma-direct.h                               |    4 +
>>   include/linux/iommufd.h                                  |    6 +
>>   include/linux/kvm_host.h                                 |   70 +
>>   include/linux/pci-doe.h                                  |    4 +
>>   include/linux/pci-ide.h                                  |   18 +
>>   include/linux/pci.h                                      |    2 +-
>>   include/linux/psp-sev.h                                  |  116 +-
>>   include/linux/swiotlb.h                                  |    4 +
>>   include/linux/tsm-report.h                               |  113 ++
>>   include/linux/tsm.h                                      |  337 +++--
>>   include/linux/vfio.h                                     |    1 +
>>   include/uapi/linux/iommufd.h                             |    1 +
>>   include/uapi/linux/kvm.h                                 |   29 +
>>   include/uapi/linux/pci_regs.h                            |   77 +-
>>   include/uapi/linux/psp-sev.h                             |    4 +-
>>   arch/x86/coco/sev/core.c                                 |   11 +
>>   arch/x86/kvm/mmu/mmu.c                                   |    6 +-
>>   arch/x86/kvm/svm/sev.c                                   |  217 +++
>>   arch/x86/kvm/svm/svm.c                                   |    3 +
>>   arch/x86/kvm/x86.c                                       |   12 +
>>   arch/x86/mm/mem_encrypt.c                                |    5 +
>>   arch/x86/virt/svm/sev.c                                  |   23 +-
>>   drivers/crypto/ccp/sev-dev-tio.c                         | 1565 ++++++++++++++++++++
>>   drivers/crypto/ccp/sev-dev-tsm.c                         |  397 +++++
>>   drivers/crypto/ccp/sev-dev.c                             |   87 +-
>>   drivers/iommu/amd/iommu.c                                |   20 +-
>>   drivers/iommu/iommufd/hw_pagetable.c                     |    4 +
>>   drivers/iommu/iommufd/io_pagetable.c                     |    2 +
>>   drivers/iommu/iommufd/main.c                             |   21 +
>>   drivers/iommu/iommufd/pages.c                            |   94 +-
>>   drivers/pci/doe.c                                        |    2 -
>>   drivers/pci/ide.c                                        |  186 +++
>>   drivers/pci/iomap.c                                      |   24 +
>>   drivers/pci/mmap.c                                       |   11 +-
>>   drivers/pci/pci-sysfs.c                                  |   27 +-
>>   drivers/pci/pci.c                                        |    3 +
>>   drivers/pci/proc.c                                       |    2 +-
>>   drivers/vfio/vfio_main.c                                 |   13 +
>>   drivers/virt/coco/sev-guest/{sev-guest.c => sev_guest.c} |   68 +-
>>   drivers/virt/coco/sev-guest/sev_guest_tio.c              |  513 +++++++
>>   drivers/virt/coco/tdx-guest/tdx-guest.c                  |    8 +-
>>   drivers/virt/coco/tsm-report.c                           |  512 +++++++
>>   drivers/virt/coco/tsm.c                                  | 1542 ++++++++++++++-----
>>   virt/kvm/guest_memfd.c                                   |   40 +
>>   virt/kvm/kvm_main.c                                      |    4 +-
>>   virt/kvm/vfio.c                                          |  197 ++-
>>   Documentation/virt/coco/tsm.rst                          |   62 +
>>   MAINTAINERS                                              |    4 +-
>>   arch/x86/kvm/Kconfig                                     |    1 +
>>   drivers/pci/Kconfig                                      |    4 +
>>   drivers/virt/coco/Kconfig                                |   11 +
>>   69 files changed, 6163 insertions(+), 548 deletions(-)
>>   create mode 100644 drivers/crypto/ccp/sev-dev-tio.h
>>   create mode 100644 drivers/virt/coco/sev-guest/sev-guest.h
>>   create mode 100644 include/linux/pci-ide.h
>>   create mode 100644 include/linux/tsm-report.h
>>   create mode 100644 drivers/crypto/ccp/sev-dev-tio.c
>>   create mode 100644 drivers/crypto/ccp/sev-dev-tsm.c
>>   create mode 100644 drivers/pci/ide.c
>>   rename drivers/virt/coco/sev-guest/{sev-guest.c => sev_guest.c} (96%)
>>   create mode 100644 drivers/virt/coco/sev-guest/sev_guest_tio.c
>>   create mode 100644 drivers/virt/coco/tsm-report.c
>>   create mode 100644 Documentation/virt/coco/tsm.rst
>>
>> -- 
>> 2.45.2
>>
> 
>

Dan Williams Aug. 29, 2024, 11:41 p.m. UTC | #3

Alexey Kardashevskiy wrote:
[..]
> >> - skipping various enforcements of non-SME or
> >> SWIOTLB in the guest;
> > 
> > Is this based on some concept of private vs shared mode devices?
> > 
> >> No mixed share+private DMA supported within the
> >> same IOMMU.
> > 
> > What does this mean? A device may not have mixed mappings (makes sense),
> 
> Currently devices do not have an idea about private host memory (but it 
> is being worked on afaik).

Worked on where? You mean the PCI core indicating that a device is
private or not? Is that not indicated by guest-side TSM connection
state?

> > or an IOMMU can not host devices that do not all agree on whether DMA is
> > private or shared?
> 
> The hardware allows that via hardware-assisted vIOMMU and I/O page 
> tables in the guest with C-bit takes into accound by the IOMMU but the 
> software support is missing right now. So for this initial drop, vTOM is 
> used for DMA - this thing says "everything below <addr> is private, 
> above <addr> - shared" so nothing needs to bother with the C-bit, and in 
> my exercise I set the <addr> to the allowed maximum.
> 
> So each IOMMUFD instance in the VM is either "all private mappings" or 
> "all shared". Could be half/half by moving that <addr> :)

I thought existing use cases assume that the CC-VM can trigger page
conversions at will without regard to a vTOM concept? It would be nice
to have that address-map separation arrangement, has not that ship
already sailed?

[..]
> > Would the device not just launch in "shared" mode until it is later
> > converted to private? I am missing the detail of why passing the device
> > on the command line requires that private memory be mapped early.
> 
> A sequencing problem.
> 
> QEMU "realizes" a VFIO device, it creates an iommufd instance which 
> creates a domain and writes to a DTE (a IOMMU descriptor for PCI BDFn). 
> And DTE is not updated after than. For secure stuff, DTE needs to be 
> slightly different. So right then I tell IOMMUFD that it will handle 
> private memory.
> 
> Then, the same VFIO "realize" handler maps the guest memory in iommufd. 
> I use the same flag (well, pointer to kvm) in the iommufd pinning code, 
> private memory is pinned and mapped (and related page state change 
> happens as the guest memory is made guest-owned in RMP).
> 
> QEMU goes to machine_reset() and calls "SNP LAUNCH UPDATE" (the actual 
> place changed recenly, huh) and the latter will measure the guest and 
> try making all guest memory private but it already happened => error.
> 
> I think I have to decouple the pinning and the IOMMU/DTE setting.
> 
> > That said, the implication that private device assignment requires
> > hotplug events is a useful property. This matches nicely with initial
> > thoughts that device conversion events are violent and might as well be
> > unplug/replug events to match all the assumptions around what needs to
> > be updated.
> 
> For the initial drop, I tell QEMU via "-device vfio-pci,x-tio=true" that 
> it is going to be private so there should be no massive conversion.

That's a SEV-TIO RFC-specific hack, or a proposal?

An approach that aligns more closely with the VFIO operational model,
where it maps and waits for guest faults / usages, is that QEMU would be
told that the device is "bind capable", because the host is not in a
position to assume how the guest will use the device. A "bind capable"
device operates in shared mode unless and until the guest triggers
private conversion.

> >> This requires the BME hack as MMIO and
> > 
> > Not sure what the "BME hack" is, I guess this is foreshadowing for later
> > in this story.
>  >
> >> BusMaster enable bits cannot be 0 after MMIO
> >> validation is done
> > 
> > It would be useful to call out what is a TDISP requirement, vs
> > device-specific DSM vs host-specific TSM requirement. In this case I
> > assume you are referring to PCI 6.2 11.2.6 where it notes that TDIs must
> 
> Oh there is 6.2 already.
> 
> > enter the TDISP ERROR state if BME is cleared after the device is
> > locked?
> > 
> > ...but this begs the question of whether it needs to be avoided outright
> 
> Well, besides a couple of avoidable places (like testing INTx support 
> which we know is not going to work on VFs anyway), a standard driver 
> enables MSE first (and the value for the command register does not have 
> 1 for BME) and only then BME. TBH I do not think writing BME=0 when 
> BME=0 already is "clearing" but my test device disagrees.

...but we should not be creating kernel policy around test devices. What
matters is real devices. Now, if it is likely that real / production
devices will go into the TDISP ERROR state by not coalescing MSE + BME
updates then we need a solution.

Given it is unlikely that TDISP support will be widespread any time soon
it is likely tenable to assume TDISP compatible drivers call a new:

   pci_enable(pdev, PCI_ENABLE_TARGET | PCI_ENABLE_INITIATOR);

...or something like that to coalesce command register writes.

Otherwise if that retrofit ends up being too much work or confusion then
the ROI of teaching the PCI core to recover this scenario needs to be
evaluated.

> > or handled as an error recovery case dependending on policy.
> 
> Avoding seems more straight forward unless we actually want enlightened 
> device drivers which want to examine the interface report before 
> enabling the device. Not sure.

If TDISP capable devices trends towards a handful of devices in the near
term then some driver fixups seems reasonable. Otherwise if every PCI
device driver Linux has ever seens needs to be ready for that device to
have a TDISP capable flavor then mitigating this in the PCI core makes
more sense than playing driver whack-a-mole.

> >> the guest OS booting process when this
> >> appens.
> >>
> >> SVSM could help addressing these (not
> >> implemented at the moment).
> > 
> > At first though avoiding SVSM entanglements where the kernel can be
> > enlightened shoud be the policy. I would only expect SVSM hacks to cover
> > for legacy OSes that will never be TDISP enlightened, but in that case
> > we are likely talking about fully unaware L2. Lets assume fully
> > enlightened L1 for now.
> 
> Well, I could also tweak OVMF to make necessary calls to the PSP and 
> hack QEMU to postpone the command register updates to get this going, 
> just a matter of ugliness.

Per above, the tradeoff should be in ROI, not ugliness. I don't see how
OVMF helps when devices might be being virtually hotplugged or reset.

Alexey Kardashevskiy Aug. 30, 2024, 4:38 a.m. UTC | #4

On 30/8/24 09:41, Dan Williams wrote:
> Alexey Kardashevskiy wrote:
> [..]
>>>> - skipping various enforcements of non-SME or
>>>> SWIOTLB in the guest;
>>>
>>> Is this based on some concept of private vs shared mode devices?
>>>
>>>> No mixed share+private DMA supported within the
>>>> same IOMMU.
>>>
>>> What does this mean? A device may not have mixed mappings (makes sense),
>>
>> Currently devices do not have an idea about private host memory (but it
>> is being worked on afaik).
> 
> Worked on where? You mean the PCI core indicating that a device is
> private or not? Is that not indicated by guest-side TSM connection
> state?
> >>> or an IOMMU can not host devices that do not all agree on whether 
DMA is
>>> private or shared?
>>
>> The hardware allows that via hardware-assisted vIOMMU and I/O page
>> tables in the guest with C-bit takes into accound by the IOMMU but the
>> software support is missing right now. So for this initial drop, vTOM is
>> used for DMA - this thing says "everything below <addr> is private,
>> above <addr> - shared" so nothing needs to bother with the C-bit, and in
>> my exercise I set the <addr> to the allowed maximum.
>>
>> So each IOMMUFD instance in the VM is either "all private mappings" or
>> "all shared". Could be half/half by moving that <addr> :)
> 
> I thought existing use cases assume that the CC-VM can trigger page
> conversions at will without regard to a vTOM concept? It would be nice
> to have that address-map separation arrangement, has not that ship
> already sailed?

Mmm. I am either confusing you too much or not following you :) Any page 
can be converted, the proposed arrangement would require that 
convertion-candidate-pages are allocated from a specific pool?

There are two vTOMs - one in IOMMU to decide on Cbit for DMA trafic (I 
use this one), one in VMSA ("VIRTUAL_TOM") for guest memory (this 
exercise is not using it). Which one do you mean?

> 
> [..]
>>> Would the device not just launch in "shared" mode until it is later
>>> converted to private? I am missing the detail of why passing the device
>>> on the command line requires that private memory be mapped early.
>>
>> A sequencing problem.
>>
>> QEMU "realizes" a VFIO device, it creates an iommufd instance which
>> creates a domain and writes to a DTE (a IOMMU descriptor for PCI BDFn).
>> And DTE is not updated after than. For secure stuff, DTE needs to be
>> slightly different. So right then I tell IOMMUFD that it will handle
>> private memory.
>>
>> Then, the same VFIO "realize" handler maps the guest memory in iommufd.
>> I use the same flag (well, pointer to kvm) in the iommufd pinning code,
>> private memory is pinned and mapped (and related page state change
>> happens as the guest memory is made guest-owned in RMP).
>>
>> QEMU goes to machine_reset() and calls "SNP LAUNCH UPDATE" (the actual
>> place changed recenly, huh) and the latter will measure the guest and
>> try making all guest memory private but it already happened => error.
>>
>> I think I have to decouple the pinning and the IOMMU/DTE setting.
>>
>>> That said, the implication that private device assignment requires
>>> hotplug events is a useful property. This matches nicely with initial
>>> thoughts that device conversion events are violent and might as well be
>>> unplug/replug events to match all the assumptions around what needs to
>>> be updated.
>>
>> For the initial drop, I tell QEMU via "-device vfio-pci,x-tio=true" that
>> it is going to be private so there should be no massive conversion.
> 
> That's a SEV-TIO RFC-specific hack, or a proposal?

Not sure at the moment :)

> An approach that aligns more closely with the VFIO operational model,
> where it maps and waits for guest faults / usages, is that QEMU would be
> told that the device is "bind capable", because the host is not in a
> position to assume how the guest will use the device. A "bind capable"
> device operates in shared mode unless and until the guest triggers
> private conversion.

True. I just started this exercise without QEMU DiscardManager. Now I 
rely on it but it either needs to allow dynamic flip from 
discarded==private to discarded==shared (should do for now) or  allow 3 
states for guest pages.

>>>> This requires the BME hack as MMIO and
>>>
>>> Not sure what the "BME hack" is, I guess this is foreshadowing for later
>>> in this story.
>>   >
>>>> BusMaster enable bits cannot be 0 after MMIO
>>>> validation is done
>>>
>>> It would be useful to call out what is a TDISP requirement, vs
>>> device-specific DSM vs host-specific TSM requirement. In this case I
>>> assume you are referring to PCI 6.2 11.2.6 where it notes that TDIs must
>>
>> Oh there is 6.2 already.
>>
>>> enter the TDISP ERROR state if BME is cleared after the device is
>>> locked?
>>>
>>> ...but this begs the question of whether it needs to be avoided outright
>>
>> Well, besides a couple of avoidable places (like testing INTx support
>> which we know is not going to work on VFs anyway), a standard driver
>> enables MSE first (and the value for the command register does not have
>> 1 for BME) and only then BME. TBH I do not think writing BME=0 when
>> BME=0 already is "clearing" but my test device disagrees.
> 
> ...but we should not be creating kernel policy around test devices. What
> matters is real devices. Now, if it is likely that real / production
> devices will go into the TDISP ERROR state by not coalescing MSE + BME
> updates then we need a solution.

True but I do not even know who to ask this question :)

> Given it is unlikely that TDISP support will be widespread any time soon
> it is likely tenable to assume TDISP compatible drivers call a new:
> 
>     pci_enable(pdev, PCI_ENABLE_TARGET | PCI_ENABLE_INITIATOR);
> 
> ...or something like that to coalesce command register writes.
> 
> Otherwise if that retrofit ends up being too much work or confusion then
> the ROI of teaching the PCI core to recover this scenario needs to be
> evaluated.

Agree.

>>> or handled as an error recovery case dependending on policy.
>>
>> Avoding seems more straight forward unless we actually want enlightened
>> device drivers which want to examine the interface report before
>> enabling the device. Not sure.
> 
> If TDISP capable devices trends towards a handful of devices in the near
> term then some driver fixups seems reasonable. Otherwise if every PCI
> device driver Linux has ever seens needs to be ready for that device to
> have a TDISP capable flavor then mitigating this in the PCI core makes
> more sense than playing driver whack-a-mole.
 >
>>>> the guest OS booting process when this
>>>> appens.
>>>>
>>>> SVSM could help addressing these (not
>>>> implemented at the moment).
>>>
>>> At first though avoiding SVSM entanglements where the kernel can be
>>> enlightened shoud be the policy. I would only expect SVSM hacks to cover
>>> for legacy OSes that will never be TDISP enlightened, but in that case
>>> we are likely talking about fully unaware L2. Lets assume fully
>>> enlightened L1 for now.
>>
>> Well, I could also tweak OVMF to make necessary calls to the PSP and
>> hack QEMU to postpone the command register updates to get this going,
>> just a matter of ugliness.
> 
> Per above, the tradeoff should be in ROI, not ugliness. I don't see how
> OVMF helps when devices might be being virtually hotplugged or reset.

I have no clue how exactly hotplug works on x86, is not BIOS playing 
role in it? Thanks,

Dan Williams Aug. 30, 2024, 9:57 p.m. UTC | #5

Alexey Kardashevskiy wrote:
[..]
> > I thought existing use cases assume that the CC-VM can trigger page
> > conversions at will without regard to a vTOM concept? It would be nice
> > to have that address-map separation arrangement, has not that ship
> > already sailed?
> 
> Mmm. I am either confusing you too much or not following you :) Any page 
> can be converted, the proposed arrangement would require that 
> convertion-candidate-pages are allocated from a specific pool?
> 
> There are two vTOMs - one in IOMMU to decide on Cbit for DMA trafic (I 
> use this one), one in VMSA ("VIRTUAL_TOM") for guest memory (this 
> exercise is not using it). Which one do you mean?

Dunno, you introduced the vTOM term. Suffice to say if any page can be
converted in this model then I was confused.

> > [..]
> >>> Would the device not just launch in "shared" mode until it is later
> >>> converted to private? I am missing the detail of why passing the device
> >>> on the command line requires that private memory be mapped early.
> >>
> >> A sequencing problem.
> >>
> >> QEMU "realizes" a VFIO device, it creates an iommufd instance which
> >> creates a domain and writes to a DTE (a IOMMU descriptor for PCI BDFn).
> >> And DTE is not updated after than. For secure stuff, DTE needs to be
> >> slightly different. So right then I tell IOMMUFD that it will handle
> >> private memory.
> >>
> >> Then, the same VFIO "realize" handler maps the guest memory in iommufd.
> >> I use the same flag (well, pointer to kvm) in the iommufd pinning code,
> >> private memory is pinned and mapped (and related page state change
> >> happens as the guest memory is made guest-owned in RMP).
> >>
> >> QEMU goes to machine_reset() and calls "SNP LAUNCH UPDATE" (the actual
> >> place changed recenly, huh) and the latter will measure the guest and
> >> try making all guest memory private but it already happened => error.
> >>
> >> I think I have to decouple the pinning and the IOMMU/DTE setting.
> >>
> >>> That said, the implication that private device assignment requires
> >>> hotplug events is a useful property. This matches nicely with initial
> >>> thoughts that device conversion events are violent and might as well be
> >>> unplug/replug events to match all the assumptions around what needs to
> >>> be updated.
> >>
> >> For the initial drop, I tell QEMU via "-device vfio-pci,x-tio=true" that
> >> it is going to be private so there should be no massive conversion.
> > 
> > That's a SEV-TIO RFC-specific hack, or a proposal?
> 
> Not sure at the moment :)

Ok, without more information it looks like a SEV-TIO shortcut.

> > An approach that aligns more closely with the VFIO operational model,
> > where it maps and waits for guest faults / usages, is that QEMU would be
> > told that the device is "bind capable", because the host is not in a
> > position to assume how the guest will use the device. A "bind capable"
> > device operates in shared mode unless and until the guest triggers
> > private conversion.
> 
> True. I just started this exercise without QEMU DiscardManager. Now I 
> rely on it but it either needs to allow dynamic flip from 
> discarded==private to discarded==shared (should do for now) or  allow 3 
> states for guest pages.

As we talked about on the KernelSIG call there is a potentially a
guestmemfd proposal to handle in place conversion without a
DiscardManager:

https://lore.kernel.org/kvm/20240712232937.2861788-1-ackerleytng@google.com/

[..]
> > Per above, the tradeoff should be in ROI, not ugliness. I don't see how
> > OVMF helps when devices might be being virtually hotplugged or reset.
> 
> I have no clue how exactly hotplug works on x86, is not BIOS playing 
> role in it? Thanks,

The hotplug controller can either be native PCIe or firmware managed.
Likely we would pick the path of least of resistance for QEMU to
facilitate device conversion.

Sean Christopherson Sept. 3, 2024, 3:56 p.m. UTC | #6

On Fri, Aug 23, 2024, Alexey Kardashevskiy wrote:
> Hi everyone,
> 
> Here are some patches to enable SEV-TIO (aka TDISP, aka secure VFIO)
> on AMD Turin.
> 
> The basic idea is to allow DMA to/from encrypted memory of SNP VMs and
> secure MMIO in SNP VMs (i.e. with Cbit set) as well.
> 
> These include both guest and host support. QEMU also requires
> some patches, links below.
> 
> The patches are organized as:
> 01..06 - preparing the host OS;
> 07 - new TSM module;
> 08 - add PSP SEV TIO ABI (IDE should start working at this point);
> 09..14 - add KVM support (TDI binding, MMIO faulting, etc);
> 15..19 - guest changes (the rest of SEV TIO ABI, DMA, secure MMIO).
> 20, 21 - some helpers for guest OS to use encrypted MMIO
> 
> This is based on a merge of
> ee3248f9f8d6 Lukas Wunner spdm: Allow control of next requester nonce
> through sysfs
> 85ef1ac03941 (AMDESE/snp-host-latest) 4 days ago Michael Roth [TEMP] KVM: guest_memfd: Update gmem_prep are hook to handle partially-allocated folios
> 
> 
> Please comment. Thanks.

1. Use scripts/get_maintainer.pl
2. Fix your MUA to wrap closer to 80 chars
3. Explain the core design, e.g. roles and responsibilities, coordination between
   KVM, VFIO/IOMMUFD, userspace, etc.

Tian, Kevin Sept. 5, 2024, 8:21 a.m. UTC | #7

> From: Alexey Kardashevskiy <aik@amd.com>
> Sent: Thursday, August 29, 2024 10:14 PM
>
> On 29/8/24 06:43, Dan Williams wrote:
> > Alexey Kardashevskiy wrote:
> >> Assumptions
> >> -----------
> >>
> >> This requires hotpligging into the VM vs
> >> passing the device via the command line as
> >> VFIO maps all guest memory as the device init
> >> step which is too soon as
> >> SNP LAUNCH UPDATE happens later and will fail
> >> if VFIO maps private memory before that.
> >
> > Would the device not just launch in "shared" mode until it is later
> > converted to private? I am missing the detail of why passing the device
> > on the command line requires that private memory be mapped early.
> 
> A sequencing problem.
> 
> QEMU "realizes" a VFIO device, it creates an iommufd instance which
> creates a domain and writes to a DTE (a IOMMU descriptor for PCI BDFn).
> And DTE is not updated after than. For secure stuff, DTE needs to be
> slightly different. So right then I tell IOMMUFD that it will handle
> private memory.
> 
> Then, the same VFIO "realize" handler maps the guest memory in iommufd.
> I use the same flag (well, pointer to kvm) in the iommufd pinning code,
> private memory is pinned and mapped (and related page state change
> happens as the guest memory is made guest-owned in RMP).
> 
> QEMU goes to machine_reset() and calls "SNP LAUNCH UPDATE" (the actual
> place changed recenly, huh) and the latter will measure the guest and
> try making all guest memory private but it already happened => error.
> 
> I think I have to decouple the pinning and the IOMMU/DTE setting.
> 

Assume there will be a new hwpt type which hints for special DTE
setting at attach time and connects to a guest memfd. It'd make
sense to defer mapping guest memory to a point after "SNP
LAUNCH UPDATE" is completed for devices attached to such hwpt,
as long as we document such restriction clearly for that new type.

[RFC,00/21] Secure VFIO, TDISP, SEV TIO

Message

Comments