mbox series

[RFC,v2,00/58] KVM: Arm SMMUv3 driver for pKVM

Message ID 20241212180423.1578358-1-smostafa@google.com (mailing list archive)
Headers show
Series KVM: Arm SMMUv3 driver for pKVM | expand

Message

Mostafa Saleh Dec. 12, 2024, 6:03 p.m. UTC
This is v2 of the series sent last year:
https://lore.kernel.org/kvmarm/20230201125328.2186498-1-jean-philippe@linaro.org/

pKVM overview:
=============
The pKVM hypervisor, recently introduced on arm64, provides a separation
of privileges between the host and hypervisor parts of KVM, where the
hypervisor is trusted by guests but the host is not [1][2]. The host is
initially trusted during boot, but its privileges are reduced after KVM
is initialized so that, if an adversary later gains access to the large
attack surface of the host, it cannot access guest data.

Currently with pKVM, the host can still instruct DMA-capable devices
like the GPU to access guest and hypervisor memory, which undermines
this isolation. Preventing DMA attacks requires an IOMMU, owned by the
hypervisor.

This series adds a hypervisor driver for the Arm SMMUv3 IOMMU. Since the
hypervisor part of pKVM (called nVHE here) is minimal, moving the whole
host SMMU driver into nVHE isn't really an option. It is too large and
complex and requires infrastructure from all over the kernel. We add a
reduced nVHE driver that deals with populating the SMMU tables and the
command queue, and the host driver still deals with probing and some
initialization.

Some of the pKVM infrastructure is not upstream yet, which are dependencies
for this series, so it should be considered a forward looking RFC for
what we think about how DMA isolation can be supported in pKVM or in
other similar confidential computing solutions and not a ready to merge
solution.
This is discussed further in the dependencies section below.

Patches overview
================
The patches are split as follows:
Patches 1-10: Mostly about splitting the current SMMUv3 driver and
io-pgtable-arm library, so the code can be re-used in the KVM driver
either inside the kernel or the hypervisor.
Most of these patches are best reviewed with git's --color-moved.

Patches 11-24: Introduce the hypervisor core code for IOMMUs which is
not specific to SMMUv3, these are the hypercall handlers and common
logic in the hypervisor.
It also introduces the key functions __pkvm_host_{un}use_dma_page which
are used to track DMA mapped pages, more on this in the design section.

Patches 25-41: Add the hypervisor part of the KVM SMMUv3 driver which
is called by hypervisor core IOMMU code, these are para-virtualized
operations such as attach/detach, map/unmap...

Patches 42-54: Add the kernel part of the KVM SMMUv3 driver, this
probes the IOMMUs and initialises them and populates the list of SMMUs
to the hypervisor, it also implements the kernel iommu_ops and registers
the IOMMUs with the kernel.

Patches 55-58: Two extra optimizations introduced at the end to avoid
complicating the start of the series, one to optimise iommu_map_sg and
the other is to batch TLB invalidation which I noticed to be a problem
while testing as my HW doesn’t support range invalidation.

A development branch is available at:
https://android-kvm.googlesource.com/linux/+log/refs/heads/for-upstream/pkvm-smmu

Design
======
We've explored 4 solutions so far, we only mention two of them here
which I believe are the most promising as they offer private IO spaces,
while the others were discussed in the v1 of the series cover letter.

1. Paravirtual I/O page tables
This is the solution implemented in this series. The host creates
IOVA->HPA mappings with two hypercalls map_pages() and unmap_pages(), and
the hypervisor populates the page tables. Page tables are abstracted into
IOMMU domains, which allow multiple devices to share the same address
space. Another four hypercalls, alloc_domain(), attach_dev(), detach_dev()
and free_domain(), manage the domains, the semantics of those hypercalls
are almost identical to the IOMMU ops which make the kernel driver part
simpler.

Some key points in the hypervisor design:
a- Tracking mapped pages: the hypervisor must prevent pages mapped in the
   IOMMU to be donated to a protected guest or the hypervisor, or allow
   a protected guest/hypervisor page be mapped in an IOMMU domain.

   For that we rely on the vmemmap refcount, where each time a page is
   mapped it’s refcount is incremented and ownership is checked, and
   each time it's successfully unmapped it’s decremented. And any memory
   donation would be denied for refcounted pages.

b- Locking: The io-pgtable-arm is lockless under some guarantees of how
   the IOMMU code behaves. However with pKVM, the kernel is not trusted
   and a malicious kernel can issue concurrent requests causing memory
   corruption or UAF, so that it has to be locked in the hypervisor.

c- Memory management: The hypervisor needs a way to allocate pages for
   the pv page tables, for that an IOMMU pool is created which can be
   topped up from a hypercall, and the IOMMU hypercalls returns encoded
   memory requests which can be fulfilled by the kernel driver.

2. Nested SMMUv3 translation (with emulation)
Another approach is to rely on nested translation support which is
optional in SMMUv3, that requires an architecturally accurate emulation
of SMMUv3 which can be complicated including cmdq emulation.

With this approach, we can use the same page tables as the CPU stage-2,
which adds more constraints on HW (SMMUv3 features must match the CPU)
and the ability of the devices to handle faults as the CPU part relies
on lazy mapping and has no guarantees about pages being mapped.
Or we can use a shadow IOMMU page table instead.

I have a prototype that is not ready yet to be posted for nested:
https://android-kvm.googlesource.com/linux/+log/refs/heads/smostafa/android15-6.6-smmu-nesting-wip


The trade off between the 2 approaches can be roughly summarised as:
Paravirtualization:
- Compatible with more HW (and IOMMUs).
- Better DMA performance due to shorter table walks/less TLB pressure
- Needs extra complexity to squeeze the last bit of optimization (around
  unmap, and map_sg).

Nested Emulation
- Faster map_pages (not sure about unmap because it requires cmdq
  emulation for TLB invalidation if DVM not used).
- Needs extra complexity for architecturally emulating SMMUv3.

I believe that the first approach looks more promising with this trade
off. However, I plan to complete the nested emulation and post it with
a comparison with this approach in terms of performance, and maybe this
topic can be discussed in an upcoming conference.

Dependencies
============
This series depends on some parts of pKVM that are not upstreamed yet,
some of them are currently posted[3][4]. However, not to spam the list
with many of these changes which are not relevant to IOMMU/SMMUv3 the
patches are developed on top of them.

This series also depends on another series reworking the io-pgtable walker[5]

Performance
===========
With CONFIG_DMA_MAP_BENCHMARK on a 4-core Morello board.
Numbers represent the average time needed for one dma_map/dma_unmap call
in μs, lower is better.
It is compared with the kernel driver, which is not quite a fair comparison
as it doesn't fulfil pKVM DMA isolation requirements. However, these
numbers are provided just to give a rough idea about how the overhead
looks like.
			Kernel driver	      pKVM driver
4K - 1 thread		0.1/0.7               0.3/1.3
4K - 4 threads		0.1/1.1               0.5/3.3
1M - 1 thread		0.8/21.5              2.6/27.3
1M - 4 threads		1.1/45.7              3.6/46.2

And tested as follows:
echo dma_map_benchmark > /sys/bus/pci/devices/0000\:06\:00.0/driver_override
echo 0000:06:00.0 >  /sys/bus/pci/devices/0000\:06\:00.0/driver/unbind
echo 0000:06:00.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind
./dma_map_bechmark -t $threads -g $nr_pages


Future work
==========
- Add IDENTITY_DOMAIN support, I already have some patches for that, but
  didn’t want to complicate this series, I can send them separately.
- Complete the comparison with the nesting support and find the most
  suitable solution for upstream.


Main changes since v1
=====================
- Patches are reordered to split the introduction of the KVM IOMMU
  code and the SMMUv3 driver.
- KVM EL2 code is closer the EL1 where domains are decoupled from
  IOMMUs.
- SMMUv3 new features (stage-1 support, IRQ and EVTQ in the kernel).
- Adaptions to the new SMMUv3 cleanups.
- Rework tracking of mapped pages to improve performance.
- Rework locking to improve performance.
- Rework unmap to improve performance.
- Adding iotlb_gather to optimize unmap.
- Add new operations to optimize map_sg operation.
- Registering driver is dynamically done instead of statically checked.
- Memory allocation for page table pages are changed to be separate
  pool and HVCs instead of share mc that required atomic allocation.
- Support for higher order page allocation.
- Support for non-coherent SMMUs.
- Support for DABT and MMIO emulation.


[1] https://lore.kernel.org/kvmarm/20220519134204.5379-1-will@kernel.org/
[2] https://www.youtube.com/watch?v=9npebeVFbFw
[3] https://lore.kernel.org/kvmarm/20241203103735.2267589-1-qperret@google.com/
[4] https://lore.kernel.org/all/20241202154742.3611749-1-tabba@google.com/
[5] https://lore.kernel.org/linux-iommu/20241028213146.238941-1-robdclark@gmail.com/T/#t


Jean-Philippe Brucker (23):
  iommu/io-pgtable-arm: Split the page table driver
  iommu/io-pgtable-arm: Split initialization
  iommu/io-pgtable: Add configure() operation
  iommu/arm-smmu-v3: Move some definitions to arm64 include/
  iommu/arm-smmu-v3: Extract driver-specific bits from probe function
  iommu/arm-smmu-v3: Move some functions to arm-smmu-v3-common.c
  iommu/arm-smmu-v3: Move queue and table allocation to
    arm-smmu-v3-common.c
  iommu/arm-smmu-v3: Move firmware probe to arm-smmu-v3-common
  iommu/arm-smmu-v3: Move IOMMU registration to arm-smmu-v3-common.c
  KVM: arm64: pkvm: Add pkvm_udelay()
  KVM: arm64: pkvm: Add __pkvm_host_add_remove_page()
  KVM: arm64: pkvm: Support SCMI power domain
  KVM: arm64: iommu: Support power management
  KVM: arm64: iommu: Add SMMUv3 driver
  KVM: arm64: smmu-v3: Initialize registers
  KVM: arm64: smmu-v3: Setup command queue
  KVM: arm64: smmu-v3: Reset the device
  KVM: arm64: smmu-v3: Support io-pgtable
  iommu/arm-smmu-v3-kvm: Add host driver for pKVM
  iommu/arm-smmu-v3-kvm: Pass a list of SMMU devices to the hypervisor
  iommu/arm-smmu-v3-kvm: Validate device features
  iommu/arm-smmu-v3-kvm: Allocate structures and reset device
  iommu/arm-smmu-v3-kvm: Probe power domains

Mostafa Saleh (35):
  iommu/arm-smmu-v3: Move common irq code to common file
  KVM: arm64: Add __pkvm_{use, unuse}_dma()
  KVM: arm64: Introduce IOMMU driver infrastructure
  KVM: arm64: pkvm: Add IOMMU hypercalls
  KVM: arm64: iommu: Add a memory pool for the IOMMU
  KVM: arm64: iommu: Add domains
  KVM: arm64: iommu: Add {attach, detach}_dev
  KVM: arm64: iommu: Add map/unmap() operations
  KVM: arm64: iommu: support iommu_iotlb_gather
  KVM: arm64: Support power domains
  KVM: arm64: iommu: Support DABT for IOMMU
  KVM: arm64: smmu-v3: Setup stream table
  KVM: arm64: smmu-v3: Setup event queue
  KVM: arm64: smmu-v3: Add {alloc/free}_domain
  KVM: arm64: smmu-v3: Add TLB ops
  KVM: arm64: smmu-v3: Add context descriptor functions
  KVM: arm64: smmu-v3: Add attach_dev
  KVM: arm64: smmu-v3: Add detach_dev
  iommu/io-pgtable: Generalize walker interface
  iommu/io-pgtable-arm: Add post table walker callback
  drivers/iommu: io-pgtable: Add IO_PGTABLE_QUIRK_UNMAP_INVAL
  KVM: arm64: smmu-v3: Add map/unmap pages and iova_to_phys
  KVM: arm64: smmu-v3: Add DABT handler
  KVM: arm64: Add function to topup generic allocator
  KVM: arm64: Add macro for SMCCC call with all returns
  iommu/arm-smmu-v3-kvm: Add function to topup IOMMU allocator
  iommu/arm-smmu-v3-kvm: Add IOMMU ops
  iommu/arm-smmu-v3-kvm: Add map, unmap and iova_to_phys operations
  iommu/arm-smmu-v3-kvm: Support PASID operations
  iommu/arm-smmu-v3-kvm: Add IRQs for the driver
  iommu/arm-smmu-v3-kvm: Enable runtime PM
  drivers/iommu: Add deferred map_sg operations
  KVM: arm64: iommu: Add hypercall for map_sg
  iommu/arm-smmu-v3-kvm: Implement sg operations
  iommu/arm-smmu-v3-kvm: Support command queue batching

 arch/arm64/include/asm/arm-smmu-v3-common.h   |  592 +++++++
 arch/arm64/include/asm/kvm_asm.h              |    9 +
 arch/arm64/include/asm/kvm_host.h             |   48 +-
 arch/arm64/include/asm/kvm_hyp.h              |    2 +
 arch/arm64/kvm/Makefile                       |    2 +-
 arch/arm64/kvm/arm.c                          |    8 +-
 arch/arm64/kvm/hyp/hyp-constants.c            |    1 +
 arch/arm64/kvm/hyp/include/nvhe/iommu.h       |   91 ++
 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |    3 +
 arch/arm64/kvm/hyp/include/nvhe/mm.h          |    1 +
 arch/arm64/kvm/hyp/include/nvhe/pkvm.h        |   37 +
 .../arm64/kvm/hyp/include/nvhe/trap_handler.h |    2 +
 arch/arm64/kvm/hyp/nvhe/Makefile              |    6 +-
 arch/arm64/kvm/hyp/nvhe/alloc_mgt.c           |    2 +
 arch/arm64/kvm/hyp/nvhe/hyp-main.c            |  114 ++
 arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c   | 1390 +++++++++++++++++
 .../arm64/kvm/hyp/nvhe/iommu/io-pgtable-arm.c |  153 ++
 arch/arm64/kvm/hyp/nvhe/iommu/iommu.c         |  490 ++++++
 arch/arm64/kvm/hyp/nvhe/mem_protect.c         |  133 +-
 arch/arm64/kvm/hyp/nvhe/mm.c                  |   17 +
 arch/arm64/kvm/hyp/nvhe/power/hvc.c           |   47 +
 arch/arm64/kvm/hyp/nvhe/power/scmi.c          |  231 +++
 arch/arm64/kvm/hyp/nvhe/setup.c               |    9 +
 arch/arm64/kvm/hyp/nvhe/timer-sr.c            |   42 +
 arch/arm64/kvm/iommu.c                        |   89 ++
 arch/arm64/kvm/mmu.c                          |   20 +
 arch/arm64/kvm/pkvm.c                         |   20 +
 drivers/gpu/drm/msm/msm_iommu.c               |    5 +-
 drivers/iommu/Kconfig                         |    9 +
 drivers/iommu/Makefile                        |    2 +-
 drivers/iommu/arm/arm-smmu-v3/Makefile        |    7 +
 .../arm/arm-smmu-v3/arm-smmu-v3-common.c      |  824 ++++++++++
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c   | 1093 +++++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |  989 +-----------
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |  758 +++------
 drivers/iommu/io-pgtable-arm-common.c         |  929 +++++++++++
 drivers/iommu/io-pgtable-arm.c                | 1061 +------------
 drivers/iommu/io-pgtable-arm.h                |   30 -
 drivers/iommu/io-pgtable.c                    |   15 +
 drivers/iommu/iommu.c                         |   53 +-
 include/kvm/arm_smmu_v3.h                     |   46 +
 include/kvm/iommu.h                           |   59 +
 include/kvm/power_domain.h                    |   24 +
 include/linux/io-pgtable-arm.h                |  233 +++
 include/linux/io-pgtable.h                    |   38 +-
 include/linux/iommu.h                         |   43 +-
 46 files changed, 7169 insertions(+), 2608 deletions(-)
 create mode 100644 arch/arm64/include/asm/arm-smmu-v3-common.h
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/iommu.h
 create mode 100644 arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/iommu/io-pgtable-arm.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/power/hvc.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/power/scmi.c
 create mode 100644 arch/arm64/kvm/iommu.c
 create mode 100644 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common.c
 create mode 100644 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
 create mode 100644 drivers/iommu/io-pgtable-arm-common.c
 delete mode 100644 drivers/iommu/io-pgtable-arm.h
 create mode 100644 include/kvm/arm_smmu_v3.h
 create mode 100644 include/kvm/iommu.h
 create mode 100644 include/kvm/power_domain.h
 create mode 100644 include/linux/io-pgtable-arm.h

Comments

Jason Gunthorpe Dec. 12, 2024, 7:41 p.m. UTC | #1
On Thu, Dec 12, 2024 at 06:03:24PM +0000, Mostafa Saleh wrote:

> This series adds a hypervisor driver for the Arm SMMUv3 IOMMU. Since the
> hypervisor part of pKVM (called nVHE here) is minimal, moving the whole
> host SMMU driver into nVHE isn't really an option. It is too large and
> complex and requires infrastructure from all over the kernel. We add a
> reduced nVHE driver that deals with populating the SMMU tables and the
> command queue, and the host driver still deals with probing and some
> initialization.

The cover letter doesn't explain why someone needs page tables in the
guest at all?

If you are able to implement nested support then you can boot the
guest with no-iommu and an effective identity translation through a
hypervisor controlled S2. ie no guest map/unmap. Great DMA
performance.

I thought the point of doing the paravirt here was to allow dynamic
pinning of the guest memory? This is the primary downside with nested.
The entire guest memory has to be pinned down at guest boot.

> 1. Paravirtual I/O page tables
> This is the solution implemented in this series. The host creates
> IOVA->HPA mappings with two hypercalls map_pages() and unmap_pages(), and
> the hypervisor populates the page tables. Page tables are abstracted into
> IOMMU domains, which allow multiple devices to share the same address
> space. Another four hypercalls, alloc_domain(), attach_dev(), detach_dev()
> and free_domain(), manage the domains, the semantics of those hypercalls
> are almost identical to the IOMMU ops which make the kernel driver part
> simpler.

That is re-inventing virtio-iommu. I don't really understand why this
series is hacking up arm-smmuv3 so much, that is not, and should not,
be a paravirt driver. Why not create a clean new pkvm specific driver
for the paravirt?? Or find a way to re-use parts of virtio-iommu?

Shouldn't other arch versions of pkvm be able to re-use the same guest
iommu driver?

> b- Locking: The io-pgtable-arm is lockless under some guarantees of how
>    the IOMMU code behaves. However with pKVM, the kernel is not trusted
>    and a malicious kernel can issue concurrent requests causing memory
>    corruption or UAF, so that it has to be locked in the hypervisor.

? I don't get it, the hypervisor page table has to be private to the
hypervisor. It is not that io-pgtable-arm is lockless, it is that it
relies on a particular kind of caller supplied locking. pkvm's calls
into its private io-pgtable-arm would need pkvm specific locking that
makes sense for it. Where does a malicious guest kernel get into this?

> 2. Nested SMMUv3 translation (with emulation)
> Another approach is to rely on nested translation support which is
> optional in SMMUv3, that requires an architecturally accurate emulation
> of SMMUv3 which can be complicated including cmdq emulation.

The confidential compute folks are going in this direction.

> The trade off between the 2 approaches can be roughly summarised as:
> Paravirtualization:
> - Compatible with more HW (and IOMMUs).
> - Better DMA performance due to shorter table walks/less TLB pressure
> - Needs extra complexity to squeeze the last bit of optimization (around
>   unmap, and map_sg).

It has better straight line DMA performance if the DMAs are all
static. Generally much, much worse performance if the DMAs are
dynamically mapped as you have to trap so much stuff.

The other negative is there is no way to get SVA support with
para-virtualization.

The positive is you don't have to pin the VM's memory.

> Nested Emulation
> - Faster map_pages (not sure about unmap because it requires cmdq
>   emulation for TLB invalidation if DVM not used).

If you can do nested then you can run in identity mode and then you
don't have any performance down side. It is a complete win.

Even if you do non-idenity nested is still likely faster for changing
translation than paravirt approaches. A single cmdq range invalidate
should be about the same broad overhead as a single paravirt call to
unmap except they can be batched under load.

Things like vCMDQ eliminate this overhead entirely, to my mind that is
the future direction of this HW as you obviously need to HW optimize
invalidation...

> - Needs extra complexity for architecturally emulating SMMUv3.

Lots of people have now done this, it is not really so bad. In
exchange you get a full architected feature set, better performance,
and are ready for HW optimizations.

> - Add IDENTITY_DOMAIN support, I already have some patches for that, but
>   didn’t want to complicate this series, I can send them separately.

This seems kind of pointless to me. If you can tolerate identity (ie
pin all memory) then do nested, and maybe don't even bother with a
guest iommu.

If you want most of the guest memory to be swappable/movable/whatever
then paravirt is the only choice, and you really don't want the guest
to have any identiy support at all.

Really, I think you'd want to have both options, there is no "best"
here. It depends what people want to use the VM for.

My advice for merging would be to start with the pkvm side setting up
a fully pinned S2 and do not have a guest driver. Nesting without
emulating smmuv3. Basically you get protected identity DMA support. I
think that would be a much less sprawling patch series. From there it
would be well positioned to add both smmuv3 emulation and a paravirt
iommu flow.

Jason
Mostafa Saleh Dec. 13, 2024, 7:39 p.m. UTC | #2
Hi Jason,

Thanks a lot for taking the time to review this, I tried to reply to all
points. However I think a main source of confusion was that this is only
for the host kernel not guests, with this series guests still have no
access to DMA under pKVM. I hope that clarifies some of the points.

On Thu, Dec 12, 2024 at 03:41:19PM -0400, Jason Gunthorpe wrote:
> On Thu, Dec 12, 2024 at 06:03:24PM +0000, Mostafa Saleh wrote:
> 
> > This series adds a hypervisor driver for the Arm SMMUv3 IOMMU. Since the
> > hypervisor part of pKVM (called nVHE here) is minimal, moving the whole
> > host SMMU driver into nVHE isn't really an option. It is too large and
> > complex and requires infrastructure from all over the kernel. We add a
> > reduced nVHE driver that deals with populating the SMMU tables and the
> > command queue, and the host driver still deals with probing and some
> > initialization.
> 
> The cover letter doesn't explain why someone needs page tables in the
> guest at all?

This is not for guests but for the host, the hypervisor needs to
establish DMA isolation between the host and the hypervisor/guests.
Before these patches; as mentioned, a host can program a DMA device
to read/write any memory (that has nothing to do with whether the
guest has DMA access or not).

So it’s mandatory for pKVM to establish DMA isolation, otherwise
it can be easily defeated.

However, guest DMA support is optional and only needed for device
passthrough, I have some patches to support that in pKVM also(only with
vfio-platform), but it’s unlikely to be posted upstream before merging a
host DMA isolation solution first as it’s mandatory.

> 
> If you are able to implement nested support then you can boot the
> guest with no-iommu and an effective identity translation through a
> hypervisor controlled S2. ie no guest map/unmap. Great DMA
> performance.

We can do that for the host also, which is discussed in the v1 cover
letter. However, we try to keep feature parity with the normal (VHE)
KVM arm64 support, so constraining KVM support to not have IOVA spaces
for devices seems too much and impractical on modern systems (phones for
example).

> 
> I thought the point of doing the paravirt here was to allow dynamic
> pinning of the guest memory? This is the primary downside with nested.
> The entire guest memory has to be pinned down at guest boot.

As this is for the host, memory pinning is not really an issue (However,
with nesting and shared CPU stage-2 there are other challenges as
mentioned).

> 
> > 1. Paravirtual I/O page tables
> > This is the solution implemented in this series. The host creates
> > IOVA->HPA mappings with two hypercalls map_pages() and unmap_pages(), and
> > the hypervisor populates the page tables. Page tables are abstracted into
> > IOMMU domains, which allow multiple devices to share the same address
> > space. Another four hypercalls, alloc_domain(), attach_dev(), detach_dev()
> > and free_domain(), manage the domains, the semantics of those hypercalls
> > are almost identical to the IOMMU ops which make the kernel driver part
> > simpler.
> 
> That is re-inventing virtio-iommu. I don't really understand why this
> series is hacking up arm-smmuv3 so much, that is not, and should not,
> be a paravirt driver. Why not create a clean new pkvm specific driver
> for the paravirt?? Or find a way to re-use parts of virtio-iommu?
> 
> Shouldn't other arch versions of pkvm be able to re-use the same guest
> iommu driver?

As mentioned, this is for the host kernel not the guest. However the
hypervisor/kernel interface is not IOMMU specific. And it can be extended
to other IOMMUs/archs.

There is no hacking for the arm-smmu-v3 driver, but mostly splitting
the driver so it can be re-used + introduction for a separate hypervisor
driver, it’s similar to how SVA re-use part of the driver also but just
on a bigger scale.

> 
> > b- Locking: The io-pgtable-arm is lockless under some guarantees of how
> >    the IOMMU code behaves. However with pKVM, the kernel is not trusted
> >    and a malicious kernel can issue concurrent requests causing memory
> >    corruption or UAF, so that it has to be locked in the hypervisor.
> 
> ? I don't get it, the hypervisor page table has to be private to the
> hypervisor. It is not that io-pgtable-arm is lockless, it is that it
> relies on a particular kind of caller supplied locking. pkvm's calls
> into its private io-pgtable-arm would need pkvm specific locking that
> makes sense for it. Where does a malicious guest kernel get into this?

At the moment when the kernel driver uses the io-pgtable-arm, it doesn’t
protect it with any locks under some assumptions, for example, unmapping
a table with block size and a leaf inside it with page size concurrently
can cause a UAF, but the DMA API never does that.

With pKVM, the host kernel is not trusted, and if compromised it can
instrument such attacks to corrupt hypervisor memory, so the hypervisor
would lock io-pgtable-arm operations in EL2 to avoid that.

> 
> > 2. Nested SMMUv3 translation (with emulation)
> > Another approach is to rely on nested translation support which is
> > optional in SMMUv3, that requires an architecturally accurate emulation
> > of SMMUv3 which can be complicated including cmdq emulation.
> 
> The confidential compute folks are going in this direction.

I see, but one key advantage for pKVM that it requires minimum hardware,
with the paravirtual approach we can support single stage SMMUv3 or even
non-architected IOMMUs, that + the DMA performance, might give it slight
edge, but as I mentioned I plan to do more throughout comparison with
nesting and maybe discuss it in a conference this year.

> 
> > The trade off between the 2 approaches can be roughly summarised as:
> > Paravirtualization:
> > - Compatible with more HW (and IOMMUs).
> > - Better DMA performance due to shorter table walks/less TLB pressure
> > - Needs extra complexity to squeeze the last bit of optimization (around
> >   unmap, and map_sg).
> 
> It has better straight line DMA performance if the DMAs are all
> static. Generally much, much worse performance if the DMAs are
> dynamically mapped as you have to trap so much stuff.

I agree it’s not that clear, I will finish the nested implementation
and run some standard IO benchmarks.

> 
> The other negative is there is no way to get SVA support with
> para-virtualization.
> 
Yeah, SVA is tricky, I guess for that we would have to use nesting,
but tbh, I don’t think it’s a deal breaker for now.

> The positive is you don't have to pin the VM's memory.
> 
> > Nested Emulation
> > - Faster map_pages (not sure about unmap because it requires cmdq
> >   emulation for TLB invalidation if DVM not used).
> 
> If you can do nested then you can run in identity mode and then you
> don't have any performance down side. It is a complete win.

Unfortunately, as mentioned above it’s not that practical, many devices
in mobile space expect IO translation capability.

> 
> Even if you do non-idenity nested is still likely faster for changing
> translation than paravirt approaches. A single cmdq range invalidate
> should be about the same broad overhead as a single paravirt call to
> unmap except they can be batched under load.
> 
> Things like vCMDQ eliminate this overhead entirely, to my mind that is
> the future direction of this HW as you obviously need to HW optimize
> invalidation...
> 
> > - Needs extra complexity for architecturally emulating SMMUv3.
> 
> Lots of people have now done this, it is not really so bad. In
> exchange you get a full architected feature set, better performance,
> and are ready for HW optimizations.

It’s not impossible, it’s just more complicated doing it in the
hypervisor which has limited features compared to the kernel + I haven’t
seen any open source implementation for that except for Qemu which is in
userspace.

> 
> > - Add IDENTITY_DOMAIN support, I already have some patches for that, but
> >   didn’t want to complicate this series, I can send them separately.
> 
> This seems kind of pointless to me. If you can tolerate identity (ie
> pin all memory) then do nested, and maybe don't even bother with a
> guest iommu.

As mentioned, the choice for para-virt was not only to avoid pinning,
as this is the host, for IDENTITY_DOMAIN we either share the page table,
then we have to deal with lazy mapping (SMMU features, BBM...) or mirror
the table in a shadow SMMU only identity page table.

> 
> If you want most of the guest memory to be swappable/movable/whatever
> then paravirt is the only choice, and you really don't want the guest
> to have any identiy support at all.
> 
> Really, I think you'd want to have both options, there is no "best"
> here. It depends what people want to use the VM for.
> 
> My advice for merging would be to start with the pkvm side setting up
> a fully pinned S2 and do not have a guest driver. Nesting without
> emulating smmuv3. Basically you get protected identity DMA support. I
> think that would be a much less sprawling patch series. From there it
> would be well positioned to add both smmuv3 emulation and a paravirt
> iommu flow.
> 

I am open to any suggestions, but I believe any solution considered for
merge, should have enough features to be usable on actual systems (translating
IOMMU can be used for example) so either para-virt as this series or full
nesting as the PoC above (or maybe both?), which IMO comes down to the
trade-off mentioned above.

Thanks,
Mostafa

> Jason