diff mbox series

[RFC,19/45] KVM: arm64: iommu: Add domains

Message ID 20230201125328.2186498-20-jean-philippe@linaro.org (mailing list archive)
State New, archived
Headers show
Series KVM: Arm SMMUv3 driver for pKVM | expand

Commit Message

Jean-Philippe Brucker Feb. 1, 2023, 12:53 p.m. UTC
The IOMMU domain abstraction allows to share the same page tables
between multiple devices. That may be necessary due to hardware
constraints, if multiple devices cannot be isolated by the IOMMU
(conventional PCI bus for example). It may also help with optimizing
resource or TLB use. For pKVM in particular, it may be useful to reduce
the amount of memory required for page tables. All devices owned by the
host kernel could be attached to the same domain (though that requires
host changes).

Each IOMMU device holds an array of domains, and the host allocates
domain IDs that index this array. The alloc() operation initializes the
domain and prepares the page tables. The attach() operation initializes
the device table that holds the PGD and its configuration.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 arch/arm64/kvm/hyp/include/nvhe/iommu.h |  16 +++
 include/kvm/iommu.h                     |  55 ++++++++
 arch/arm64/kvm/hyp/nvhe/iommu/iommu.c   | 161 ++++++++++++++++++++++++
 3 files changed, 232 insertions(+)

Comments

Mostafa Saleh Feb. 7, 2023, 1:13 p.m. UTC | #1
Hi Jean,

On Wed, Feb 01, 2023 at 12:53:03PM +0000, Jean-Philippe Brucker wrote:
> The IOMMU domain abstraction allows to share the same page tables
> between multiple devices. That may be necessary due to hardware
> constraints, if multiple devices cannot be isolated by the IOMMU
> (conventional PCI bus for example). It may also help with optimizing
> resource or TLB use. For pKVM in particular, it may be useful to reduce
> the amount of memory required for page tables. All devices owned by the
> host kernel could be attached to the same domain (though that requires
> host changes).
> 
> Each IOMMU device holds an array of domains, and the host allocates
> domain IDs that index this array. The alloc() operation initializes the
> domain and prepares the page tables. The attach() operation initializes
> the device table that holds the PGD and its configuration.

I was wondering about the need for pre-allocation of the domain array.

An alternative way I see:
- We don’t pre-allocate any domain.

- When the EL1 driver has a request to domain_alloc, it will allocate
both kernel(iommu_domain) and hypervisor domains(kvm_hyp_iommu_domain).

- In __pkvm_host_iommu_alloc_domain, it will take over the hyp struct
from the kernel (via donation).

- In all other hypercalls, the kernel address of kvm_hyp_iommu_domain will
be used as domain ID, which guarantees uniqueness and O(1) access.

- The hypervisor would just need to transform the address(kern_hyp_va)
to get the domain pointer.

I believe that would save some memory as we allocate domains when needed.

Please let me know what you think about this?

Thanks,
Mostafa
Mostafa Saleh Feb. 8, 2023, 12:31 p.m. UTC | #2
On Tue, Feb 7, 2023 at 1:13 PM Mostafa Saleh <smostafa@google.com> wrote:

> I was wondering about the need for pre-allocation of the domain array.
>
> An alternative way I see:
> - We don’t pre-allocate any domain.
>
> - When the EL1 driver has a request to domain_alloc, it will allocate
> both kernel(iommu_domain) and hypervisor domains(kvm_hyp_iommu_domain).
>
> - In __pkvm_host_iommu_alloc_domain, it will take over the hyp struct
> from the kernel (via donation).
>
> - In all other hypercalls, the kernel address of kvm_hyp_iommu_domain will
> be used as domain ID, which guarantees uniqueness and O(1) access.
>
> - The hypervisor would just need to transform the address(kern_hyp_va)
> to get the domain pointer.


This actually will not work with the current sequence, as we can't
guarantee that the domain_id sent later from the host is trusted, and as
the domain points to the page table this can be dangerous, I will have a
closer look to see if we can make this work somehow.
Jean-Philippe Brucker Feb. 8, 2023, 6:05 p.m. UTC | #3
On Wed, Feb 08, 2023 at 12:31:15PM +0000, Mostafa Saleh wrote:
> On Tue, Feb 7, 2023 at 1:13 PM Mostafa Saleh <smostafa@google.com> wrote:
> 
> > I was wondering about the need for pre-allocation of the domain array.
> >
> > An alternative way I see:
> > - We don’t pre-allocate any domain.
> >
> > - When the EL1 driver has a request to domain_alloc, it will allocate
> > both kernel(iommu_domain) and hypervisor domains(kvm_hyp_iommu_domain).
> >
> > - In __pkvm_host_iommu_alloc_domain, it will take over the hyp struct
> > from the kernel (via donation).

That also requires an entire page for each domain no?  I guess this domain
table would only be worse in memory use if we have fewer than 2 domains,
since it costs one page for the root table, and then stores 256 domains
per leaf page.

What I've been trying to avoid with this table is introducing a malloc in
the hypervisor, but we might have to bite the bullet eventually (although
with a malloc, access will probably worse than O(1)).

Thanks,
Jean

> >
> > - In all other hypercalls, the kernel address of kvm_hyp_iommu_domain will
> > be used as domain ID, which guarantees uniqueness and O(1) access.
> >
> > - The hypervisor would just need to transform the address(kern_hyp_va)
> > to get the domain pointer.
> 
> 
> This actually will not work with the current sequence, as we can't
> guarantee that the domain_id sent later from the host is trusted, and as
> the domain points to the page table this can be dangerous, I will have a
> closer look to see if we can make this work somehow.
Mostafa Saleh Feb. 10, 2023, 10:03 p.m. UTC | #4
On Wed, Feb 8, 2023 at 6:05 PM Jean-Philippe Brucker
<jean-philippe@linaro.org> wrote:
>
> On Wed, Feb 08, 2023 at 12:31:15PM +0000, Mostafa Saleh wrote:
> > On Tue, Feb 7, 2023 at 1:13 PM Mostafa Saleh <smostafa@google.com> wrote:
> >
> > > I was wondering about the need for pre-allocation of the domain array.
> > >
> > > An alternative way I see:
> > > - We don’t pre-allocate any domain.
> > >
> > > - When the EL1 driver has a request to domain_alloc, it will allocate
> > > both kernel(iommu_domain) and hypervisor domains(kvm_hyp_iommu_domain).
> > >
> > > - In __pkvm_host_iommu_alloc_domain, it will take over the hyp struct
> > > from the kernel (via donation).
>
> That also requires an entire page for each domain no?  I guess this domain
> table would only be worse in memory use if we have fewer than 2 domains,
> since it costs one page for the root table, and then stores 256 domains
> per leaf page.

Yes, that would require a page for a domain also, which is inefficient.

> What I've been trying to avoid with this table is introducing a malloc in
> the hypervisor, but we might have to bite the bullet eventually (although
> with a malloc, access will probably worse than O(1)).

An alternative approach,

1- At SMMU init, it will alloc va range which is not backed by any memory
(via pkvm_alloc_private_va_range) that is contiguous with the max size
of domains.
2- This will be like a large array indexed by domain id, and it would
be filled on demand
from memcache.
3-alloc_domain will make sure that the new domain_id has a page and
then any other
access from map and unmap would just index this memory.

This can save the extra page in the root table, handle_to_domain would
slightly be
more efficient.
But this can cause page faults in EL2 if domain_id was not correct (allocated in
EL2 before). So I am not sure if it is worth it.


Thanks,
Mostafa
Mostafa Saleh May 19, 2023, 3:33 p.m. UTC | #5
Hi Jean,

On Wed, Feb 01, 2023 at 12:53:03PM +0000, Jean-Philippe Brucker wrote:
> +/*
> + * Serialize access to domains and IOMMU driver internal structures (command
> + * queue, device tables)
> + */
> +static hyp_spinlock_t iommu_lock;
> +
I was looking more into this lock and I think we can make it per IOMMU instead
of having one big lock to avoid congestion, as I see it is only used to
protect per-IOMMU resources.
Some special handling needed as hyp_spinlock_t is not exposed to EL1.
Maybe something like this:

diff --git a/arch/arm64/kvm/hyp/hyp-constants.c b/arch/arm64/kvm/hyp/hyp-constants.c
index b257a3b4bfc5..96d30a37f9e6 100644
--- a/arch/arm64/kvm/hyp/hyp-constants.c
+++ b/arch/arm64/kvm/hyp/hyp-constants.c
@@ -3,11 +3,13 @@
 #include <linux/kbuild.h>
 #include <nvhe/memory.h>
 #include <nvhe/pkvm.h>
+#include <nvhe/spinlock.h>

 int main(void)
 {
 	DEFINE(STRUCT_HYP_PAGE_SIZE,	sizeof(struct hyp_page));
 	DEFINE(PKVM_HYP_VM_SIZE,	sizeof(struct pkvm_hyp_vm));
 	DEFINE(PKVM_HYP_VCPU_SIZE,	sizeof(struct pkvm_hyp_vcpu));
+	DEFINE(HYP_SPINLOCK_SIZE,	sizeof(hyp_spinlock_t));
 	return 0;
 }
diff --git a/drivers/iommu/arm/arm-smmu-v3/Makefile b/drivers/iommu/arm/arm-smmu-v3/Makefile
index a90b97d8bae3..cf9195e24a08 100644
--- a/drivers/iommu/arm/arm-smmu-v3/Makefile
+++ b/drivers/iommu/arm/arm-smmu-v3/Makefile
@@ -6,6 +6,7 @@ arm_smmu_v3-objs-$(CONFIG_ARM_SMMU_V3_SVA) += arm-smmu-v3-sva.o
 arm_smmu_v3-objs := $(arm_smmu_v3-objs-y)
 
 obj-$(CONFIG_ARM_SMMU_V3_PKVM) += arm_smmu_v3_kvm.o
+ccflags-$(CONFIG_ARM_SMMU_V3_PKVM) += -Iarch/arm64/kvm/
 arm_smmu_v3_kvm-objs-y += arm-smmu-v3-kvm.o
 arm_smmu_v3_kvm-objs-y += arm-smmu-v3-common.o
 arm_smmu_v3_kvm-objs := $(arm_smmu_v3_kvm-objs-y)
diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
index ab888da731bc..82827b99b1ed 100644
--- a/include/kvm/iommu.h
+++ b/include/kvm/iommu.h
@@ -5,6 +5,12 @@
 #include <asm/kvm_host.h>
 #include <kvm/power_domain.h>
 #include <linux/io-pgtable.h>
+#ifdef __KVM_NVHE_HYPERVISOR__
+#include <nvhe/spinlock.h>
+#else
+#include "hyp_constants.h"
+#endif
 
 /*
  * Parameters from the trusted host:
@@ -23,6 +29,11 @@ struct kvm_hyp_iommu {
 
 	struct io_pgtable_params	*pgtable;
 	bool				power_is_off;
+#ifdef __KVM_NVHE_HYPERVISOR__
+	hyp_spinlock_t			iommu_lock;
+#else
+	u8 unused[HYP_SPINLOCK_SIZE];
+#endif
 };
 
 struct kvm_hyp_iommu_memcache {
diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
index 1f4d5fcc1386..afaf173e65ed 100644
--- a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
@@ -14,12 +14,6 @@
 
 struct kvm_hyp_iommu_memcache __ro_after_init *kvm_hyp_iommu_memcaches;
 
-/*
- * Serialize access to domains and IOMMU driver internal structures (command
- * queue, device tables)
- */
-static hyp_spinlock_t iommu_lock;
-
 #define domain_to_iopt(_iommu, _domain, _domain_id)		\
 	(struct io_pgtable) {					\
 		.ops = &(_iommu)->pgtable->ops,			\
@@ -93,10 +87,10 @@ int kvm_iommu_alloc_domain(pkvm_handle_t iommu_id, pkvm_handle_t domain_id,
 {
 	int ret = -EINVAL;
 	struct io_pgtable iopt;
-	struct kvm_hyp_iommu *iommu;
+	struct kvm_hyp_iommu *iommu = kvm_iommu_ops.get_iommu_by_id(iommu_id);
 	struct kvm_hyp_iommu_domain *domain;
 
-	hyp_spin_lock(&iommu_lock);
+	hyp_spin_lock(&iommu->iommu_lock);
 	domain = handle_to_domain(iommu_id, domain_id, &iommu);
 	if (!domain)
 		goto out_unlock;
@@ -112,7 +106,7 @@ int kvm_iommu_alloc_domain(pkvm_handle_t iommu_id, pkvm_handle_t domain_id,
 	domain->refs = 1;
 	domain->pgd = iopt.pgd;
 out_unlock:
-	hyp_spin_unlock(&iommu_lock);
+	hyp_spin_unlock(&iommu->iommu_lock);
 	return ret;
 }
 
@@ -120,10 +114,10 @@ int kvm_iommu_free_domain(pkvm_handle_t iommu_id, pkvm_handle_t domain_id)
 {
 	int ret = -EINVAL;
 	struct io_pgtable iopt;
-	struct kvm_hyp_iommu *iommu;
+	struct kvm_hyp_iommu *iommu = kvm_iommu_ops.get_iommu_by_id(iommu_id);
 	struct kvm_hyp_iommu_domain *domain;
 
-	hyp_spin_lock(&iommu_lock);
+	hyp_spin_lock(&iommu->iommu_lock);
 	domain = handle_to_domain(iommu_id, domain_id, &iommu);
 	if (!domain)
 		goto out_unlock;
@@ -137,7 +131,7 @@ int kvm_iommu_free_domain(pkvm_handle_t iommu_id, pkvm_handle_t domain_id)
 	memset(domain, 0, sizeof(*domain));
 
 out_unlock:
-	hyp_spin_unlock(&iommu_lock);
+	hyp_spin_unlock(&iommu->iommu_lock);
 	return ret;
 }
Jean-Philippe Brucker June 2, 2023, 3:29 p.m. UTC | #6
Hi Mostafa,

On Fri, 19 May 2023 at 16:33, Mostafa Saleh <smostafa@google.com> wrote:
>
> Hi Jean,
>
> On Wed, Feb 01, 2023 at 12:53:03PM +0000, Jean-Philippe Brucker wrote:
> > +/*
> > + * Serialize access to domains and IOMMU driver internal structures (command
> > + * queue, device tables)
> > + */
> > +static hyp_spinlock_t iommu_lock;
> > +
> I was looking more into this lock and I think we can make it per IOMMU instead
> of having one big lock to avoid congestion, as I see it is only used to
> protect per-IOMMU resources.

Yes it's a major bottleneck, thanks for looking into this. I think we'll
eventually want to improve scalability within an IOMMU and even a domain:
a multi-queue network device will share a single domain between multiple
CPUs, each issuing lots of map/unmap calls. Or just two devices drivers
working on different CPUs and sharing one IOMMU. In those cases the
per-IOMMU lock won't be good enough and we'll be back to this problem, but
a per-IOMMU lock should already improve scalability for some systems.

Currently the global lock protects:

(1) The domain array (per IOMMU)
(2) The io-pgtables (per domain)
(3) The command queue (per SMMU)
(4) Power state (per SMMU)

I ran some experiments with refcounting the domains to avoid the lock for
(1) and (2), which improves map() scalability. See
https://jpbrucker.net/git/linux/commit/?h=pkvm/smmu-wip&id=5ad3bc6fd589b6f11fe31ccee5b8777ba8739167
(and another experiment to optimize the DMA refcount on branch pkvm/smmu-wip)

For (2), the io-pgtable-arm code already supports concurrent accesses
(2c3d273eabe8 ("iommu/io-pgtable-arm: Support lockless operation")). But
this one needs careful consideration because in the host, the io-pgtable
code trusts the device drivers. For example it expects that only buggy
drivers call map()/unmap() to the same IOVA concurrently. We need to make
sure a compromised host drivers can't exploit these assumptions to do
anything nasty.

There are options to improve (3) as well. The host SMMU driver supports
lockless command-queue (587e6c10a7ce ("iommu/arm-smmu-v3: Reduce
contention during command-queue insertion")) but it may be too complex to
put in the hypervisor (or rather, I haven't tried to understand it yet).
We could also wait for systems with ECMDQ, which enables per-CPU queues.

> Some special handling needed as hyp_spinlock_t is not exposed to EL1.
> Maybe something like this:
>
> diff --git a/arch/arm64/kvm/hyp/hyp-constants.c b/arch/arm64/kvm/hyp/hyp-constants.c
> index b257a3b4bfc5..96d30a37f9e6 100644
> --- a/arch/arm64/kvm/hyp/hyp-constants.c
> +++ b/arch/arm64/kvm/hyp/hyp-constants.c
> @@ -3,11 +3,13 @@
>  #include <linux/kbuild.h>
>  #include <nvhe/memory.h>
>  #include <nvhe/pkvm.h>
> +#include <nvhe/spinlock.h>
>
>  int main(void)
>  {
>         DEFINE(STRUCT_HYP_PAGE_SIZE,    sizeof(struct hyp_page));
>         DEFINE(PKVM_HYP_VM_SIZE,        sizeof(struct pkvm_hyp_vm));
>         DEFINE(PKVM_HYP_VCPU_SIZE,      sizeof(struct pkvm_hyp_vcpu));
> +       DEFINE(HYP_SPINLOCK_SIZE,       sizeof(hyp_spinlock_t));
>         return 0;
>  }
> diff --git a/drivers/iommu/arm/arm-smmu-v3/Makefile b/drivers/iommu/arm/arm-smmu-v3/Makefile
> index a90b97d8bae3..cf9195e24a08 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/Makefile
> +++ b/drivers/iommu/arm/arm-smmu-v3/Makefile
> @@ -6,6 +6,7 @@ arm_smmu_v3-objs-$(CONFIG_ARM_SMMU_V3_SVA) += arm-smmu-v3-sva.o
>  arm_smmu_v3-objs := $(arm_smmu_v3-objs-y)
>
>  obj-$(CONFIG_ARM_SMMU_V3_PKVM) += arm_smmu_v3_kvm.o
> +ccflags-$(CONFIG_ARM_SMMU_V3_PKVM) += -Iarch/arm64/kvm/
>  arm_smmu_v3_kvm-objs-y += arm-smmu-v3-kvm.o
>  arm_smmu_v3_kvm-objs-y += arm-smmu-v3-common.o
>  arm_smmu_v3_kvm-objs := $(arm_smmu_v3_kvm-objs-y)
> diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
> index ab888da731bc..82827b99b1ed 100644
> --- a/include/kvm/iommu.h
> +++ b/include/kvm/iommu.h
> @@ -5,6 +5,12 @@
>  #include <asm/kvm_host.h>
>  #include <kvm/power_domain.h>
>  #include <linux/io-pgtable.h>
> +#ifdef __KVM_NVHE_HYPERVISOR__
> +#include <nvhe/spinlock.h>
> +#else
> +#include "hyp_constants.h"
> +#endif
>
>  /*
>   * Parameters from the trusted host:
> @@ -23,6 +29,11 @@ struct kvm_hyp_iommu {
>
>         struct io_pgtable_params        *pgtable;
>         bool                            power_is_off;
> +#ifdef __KVM_NVHE_HYPERVISOR__
> +       hyp_spinlock_t                  iommu_lock;
> +#else
> +       u8 unused[HYP_SPINLOCK_SIZE];
> +#endif
>  };
>
>  struct kvm_hyp_iommu_memcache {
> diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
> index 1f4d5fcc1386..afaf173e65ed 100644
> --- a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
> +++ b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
> @@ -14,12 +14,6 @@
>
>  struct kvm_hyp_iommu_memcache __ro_after_init *kvm_hyp_iommu_memcaches;
>
> -/*
> - * Serialize access to domains and IOMMU driver internal structures (command
> - * queue, device tables)
> - */
> -static hyp_spinlock_t iommu_lock;
> -
>  #define domain_to_iopt(_iommu, _domain, _domain_id)            \
>         (struct io_pgtable) {                                   \
>                 .ops = &(_iommu)->pgtable->ops,                 \
> @@ -93,10 +87,10 @@ int kvm_iommu_alloc_domain(pkvm_handle_t iommu_id, pkvm_handle_t domain_id,
>  {
>         int ret = -EINVAL;
>         struct io_pgtable iopt;
> -       struct kvm_hyp_iommu *iommu;
> +       struct kvm_hyp_iommu *iommu = kvm_iommu_ops.get_iommu_by_id(iommu_id);
>         struct kvm_hyp_iommu_domain *domain;
>
> -       hyp_spin_lock(&iommu_lock);
> +       hyp_spin_lock(&iommu->iommu_lock);
>         domain = handle_to_domain(iommu_id, domain_id, &iommu);
>         if (!domain)
>                 goto out_unlock;
> @@ -112,7 +106,7 @@ int kvm_iommu_alloc_domain(pkvm_handle_t iommu_id, pkvm_handle_t domain_id,
>         domain->refs = 1;
>         domain->pgd = iopt.pgd;
>  out_unlock:
> -       hyp_spin_unlock(&iommu_lock);
> +       hyp_spin_unlock(&iommu->iommu_lock);
>         return ret;
>  }
>
> @@ -120,10 +114,10 @@ int kvm_iommu_free_domain(pkvm_handle_t iommu_id, pkvm_handle_t domain_id)
>  {
>         int ret = -EINVAL;
>         struct io_pgtable iopt;
> -       struct kvm_hyp_iommu *iommu;
> +       struct kvm_hyp_iommu *iommu = kvm_iommu_ops.get_iommu_by_id(iommu_id);
>         struct kvm_hyp_iommu_domain *domain;
>
> -       hyp_spin_lock(&iommu_lock);
> +       hyp_spin_lock(&iommu->iommu_lock);
>         domain = handle_to_domain(iommu_id, domain_id, &iommu);
>         if (!domain)
>                 goto out_unlock;
> @@ -137,7 +131,7 @@ int kvm_iommu_free_domain(pkvm_handle_t iommu_id, pkvm_handle_t domain_id)
>         memset(domain, 0, sizeof(*domain));
>
>  out_unlock:
> -       hyp_spin_unlock(&iommu_lock);
> +       hyp_spin_unlock(&iommu->iommu_lock);
>         return ret;
>  }
> --
>
> (I didn't include full patch as it is too long, but mainly the
> rest is s/&iommu_lock/&iommu->iommu_lock)
>
> Please let me know what do you think?

It makes sense, I can append it to my tree or squash it (with your S-o-B)
if you want to send the full patch

Thanks,
Jean
Mostafa Saleh June 15, 2023, 1:32 p.m. UTC | #7
Hi Jean,

On Fri, Jun 02, 2023 at 04:29:07PM +0100, Jean-Philippe Brucker wrote:
> Hi Mostafa,
> 
> On Fri, 19 May 2023 at 16:33, Mostafa Saleh <smostafa@google.com> wrote:
> >
> > Hi Jean,
> >
> > On Wed, Feb 01, 2023 at 12:53:03PM +0000, Jean-Philippe Brucker wrote:
> > > +/*
> > > + * Serialize access to domains and IOMMU driver internal structures (command
> > > + * queue, device tables)
> > > + */
> > > +static hyp_spinlock_t iommu_lock;
> > > +
> > I was looking more into this lock and I think we can make it per IOMMU instead
> > of having one big lock to avoid congestion, as I see it is only used to
> > protect per-IOMMU resources.
> 
> Yes it's a major bottleneck, thanks for looking into this. I think we'll
> eventually want to improve scalability within an IOMMU and even a domain:
> a multi-queue network device will share a single domain between multiple
> CPUs, each issuing lots of map/unmap calls. Or just two devices drivers
> working on different CPUs and sharing one IOMMU. In those cases the
> per-IOMMU lock won't be good enough and we'll be back to this problem, but
> a per-IOMMU lock should already improve scalability for some systems.
> 
> Currently the global lock protects:
> 
> (1) The domain array (per IOMMU)
> (2) The io-pgtables (per domain)
> (3) The command queue (per SMMU)
> (4) Power state (per SMMU)
> 
> I ran some experiments with refcounting the domains to avoid the lock for
> (1) and (2), which improves map() scalability. See
> https://jpbrucker.net/git/linux/commit/?h=pkvm/smmu-wip&id=5ad3bc6fd589b6f11fe31ccee5b8777ba8739167
> (and another experiment to optimize the DMA refcount on branch pkvm/smmu-wip)

Thanks for the detailed explanation! I will give it a try.

> For (2), the io-pgtable-arm code already supports concurrent accesses
> (2c3d273eabe8 ("iommu/io-pgtable-arm: Support lockless operation")). But
> this one needs careful consideration because in the host, the io-pgtable
> code trusts the device drivers. For example it expects that only buggy
> drivers call map()/unmap() to the same IOVA concurrently. We need to make
> sure a compromised host drivers can't exploit these assumptions to do
> anything nasty.
> 
> There are options to improve (3) as well. The host SMMU driver supports
> lockless command-queue (587e6c10a7ce ("iommu/arm-smmu-v3: Reduce
> contention during command-queue insertion")) but it may be too complex to
> put in the hypervisor (or rather, I haven't tried to understand it yet).
> We could also wait for systems with ECMDQ, which enables per-CPU queues.
I was thinking about the same thing, in the futrue we should aim to have a
similar lockless approach for the cmdq as the host in EL2. I am not sure
how feasable it is, I will have a closer look. But having a per-IOMMU
shouldbe be a step in the right direction.

> > Some special handling needed as hyp_spinlock_t is not exposed to EL1.
> > Maybe something like this:
> >
> > diff --git a/arch/arm64/kvm/hyp/hyp-constants.c b/arch/arm64/kvm/hyp/hyp-constants.c
> > index b257a3b4bfc5..96d30a37f9e6 100644
> > --- a/arch/arm64/kvm/hyp/hyp-constants.c
> > +++ b/arch/arm64/kvm/hyp/hyp-constants.c
> > @@ -3,11 +3,13 @@
> >  #include <linux/kbuild.h>
> >  #include <nvhe/memory.h>
> >  #include <nvhe/pkvm.h>
> > +#include <nvhe/spinlock.h>
> >
> >  int main(void)
> >  {
> >         DEFINE(STRUCT_HYP_PAGE_SIZE,    sizeof(struct hyp_page));
> >         DEFINE(PKVM_HYP_VM_SIZE,        sizeof(struct pkvm_hyp_vm));
> >         DEFINE(PKVM_HYP_VCPU_SIZE,      sizeof(struct pkvm_hyp_vcpu));
> > +       DEFINE(HYP_SPINLOCK_SIZE,       sizeof(hyp_spinlock_t));
> >         return 0;
> >  }
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/Makefile b/drivers/iommu/arm/arm-smmu-v3/Makefile
> > index a90b97d8bae3..cf9195e24a08 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/Makefile
> > +++ b/drivers/iommu/arm/arm-smmu-v3/Makefile
> > @@ -6,6 +6,7 @@ arm_smmu_v3-objs-$(CONFIG_ARM_SMMU_V3_SVA) += arm-smmu-v3-sva.o
> >  arm_smmu_v3-objs := $(arm_smmu_v3-objs-y)
> >
> >  obj-$(CONFIG_ARM_SMMU_V3_PKVM) += arm_smmu_v3_kvm.o
> > +ccflags-$(CONFIG_ARM_SMMU_V3_PKVM) += -Iarch/arm64/kvm/
> >  arm_smmu_v3_kvm-objs-y += arm-smmu-v3-kvm.o
> >  arm_smmu_v3_kvm-objs-y += arm-smmu-v3-common.o
> >  arm_smmu_v3_kvm-objs := $(arm_smmu_v3_kvm-objs-y)
> > diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
> > index ab888da731bc..82827b99b1ed 100644
> > --- a/include/kvm/iommu.h
> > +++ b/include/kvm/iommu.h
> > @@ -5,6 +5,12 @@
> >  #include <asm/kvm_host.h>
> >  #include <kvm/power_domain.h>
> >  #include <linux/io-pgtable.h>
> > +#ifdef __KVM_NVHE_HYPERVISOR__
> > +#include <nvhe/spinlock.h>
> > +#else
> > +#include "hyp_constants.h"
> > +#endif
> >
> >  /*
> >   * Parameters from the trusted host:
> > @@ -23,6 +29,11 @@ struct kvm_hyp_iommu {
> >
> >         struct io_pgtable_params        *pgtable;
> >         bool                            power_is_off;
> > +#ifdef __KVM_NVHE_HYPERVISOR__
> > +       hyp_spinlock_t                  iommu_lock;
> > +#else
> > +       u8 unused[HYP_SPINLOCK_SIZE];
> > +#endif
> >  };
> >
> >  struct kvm_hyp_iommu_memcache {
> > diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
> > index 1f4d5fcc1386..afaf173e65ed 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
> > +++ b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
> > @@ -14,12 +14,6 @@
> >
> >  struct kvm_hyp_iommu_memcache __ro_after_init *kvm_hyp_iommu_memcaches;
> >
> > -/*
> > - * Serialize access to domains and IOMMU driver internal structures (command
> > - * queue, device tables)
> > - */
> > -static hyp_spinlock_t iommu_lock;
> > -
> >  #define domain_to_iopt(_iommu, _domain, _domain_id)            \
> >         (struct io_pgtable) {                                   \
> >                 .ops = &(_iommu)->pgtable->ops,                 \
> > @@ -93,10 +87,10 @@ int kvm_iommu_alloc_domain(pkvm_handle_t iommu_id, pkvm_handle_t domain_id,
> >  {
> >         int ret = -EINVAL;
> >         struct io_pgtable iopt;
> > -       struct kvm_hyp_iommu *iommu;
> > +       struct kvm_hyp_iommu *iommu = kvm_iommu_ops.get_iommu_by_id(iommu_id);
> >         struct kvm_hyp_iommu_domain *domain;
> >
> > -       hyp_spin_lock(&iommu_lock);
> > +       hyp_spin_lock(&iommu->iommu_lock);
> >         domain = handle_to_domain(iommu_id, domain_id, &iommu);
> >         if (!domain)
> >                 goto out_unlock;
> > @@ -112,7 +106,7 @@ int kvm_iommu_alloc_domain(pkvm_handle_t iommu_id, pkvm_handle_t domain_id,
> >         domain->refs = 1;
> >         domain->pgd = iopt.pgd;
> >  out_unlock:
> > -       hyp_spin_unlock(&iommu_lock);
> > +       hyp_spin_unlock(&iommu->iommu_lock);
> >         return ret;
> >  }
> >
> > @@ -120,10 +114,10 @@ int kvm_iommu_free_domain(pkvm_handle_t iommu_id, pkvm_handle_t domain_id)
> >  {
> >         int ret = -EINVAL;
> >         struct io_pgtable iopt;
> > -       struct kvm_hyp_iommu *iommu;
> > +       struct kvm_hyp_iommu *iommu = kvm_iommu_ops.get_iommu_by_id(iommu_id);
> >         struct kvm_hyp_iommu_domain *domain;
> >
> > -       hyp_spin_lock(&iommu_lock);
> > +       hyp_spin_lock(&iommu->iommu_lock);
> >         domain = handle_to_domain(iommu_id, domain_id, &iommu);
> >         if (!domain)
> >                 goto out_unlock;
> > @@ -137,7 +131,7 @@ int kvm_iommu_free_domain(pkvm_handle_t iommu_id, pkvm_handle_t domain_id)
> >         memset(domain, 0, sizeof(*domain));
> >
> >  out_unlock:
> > -       hyp_spin_unlock(&iommu_lock);
> > +       hyp_spin_unlock(&iommu->iommu_lock);
> >         return ret;
> >  }
> > --
> >
> > (I didn't include full patch as it is too long, but mainly the
> > rest is s/&iommu_lock/&iommu->iommu_lock)
> >
> > Please let me know what do you think?
> 
> It makes sense, I can append it to my tree or squash it (with your S-o-B)
> if you want to send the full patch
Sure, that is the full patch based on latest
https://jpbrucker.net/git/linux/log/?h=pkvm/smmu

Thanks,
Mostafa

--

From 5b43b97ca2486d52aa2ad475e4a34357dea57bca Mon Sep 17 00:00:00 2001
From: Mostafa Saleh <smostafa@google.com>
Date: Tue, 13 Jun 2023 14:59:37 +0000
Subject: [PATCH] KVM: arm: iommu: Use per-iommu lock instead of one global
 lock

Currently, there is one big lock for all of the IOMMU operations
which is used to protect per-IOMMU resources or domains (which are
currently per-IOMMU also).

This can be improved for setups with many IOMMUs by having a per
IOMMU lock.

As now we always get the iommu before calling handle_to_domain, we
can remove the out_iommu arg and iommu_id and pass iommu instead.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 arch/arm64/kvm/hyp/hyp-constants.c     |  1 +
 arch/arm64/kvm/hyp/nvhe/iommu/iommu.c  | 75 +++++++++++++-------------
 drivers/iommu/arm/arm-smmu-v3/Makefile |  1 +
 include/kvm/iommu.h                    | 10 ++++
 4 files changed, 48 insertions(+), 39 deletions(-)

diff --git a/arch/arm64/kvm/hyp/hyp-constants.c b/arch/arm64/kvm/hyp/hyp-constants.c
index b257a3b4bfc5..d84cc8ec1e46 100644
--- a/arch/arm64/kvm/hyp/hyp-constants.c
+++ b/arch/arm64/kvm/hyp/hyp-constants.c
@@ -9,5 +9,6 @@ int main(void)
 	DEFINE(STRUCT_HYP_PAGE_SIZE,	sizeof(struct hyp_page));
 	DEFINE(PKVM_HYP_VM_SIZE,	sizeof(struct pkvm_hyp_vm));
 	DEFINE(PKVM_HYP_VCPU_SIZE,	sizeof(struct pkvm_hyp_vcpu));
+	DEFINE(HYP_SPINLOCK_SIZE,       sizeof(hyp_spinlock_t));
 	return 0;
 }
diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
index c70d3b6934eb..1dcba39d1d41 100644
--- a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
@@ -14,12 +14,6 @@

 struct kvm_hyp_iommu_memcache __ro_after_init *kvm_hyp_iommu_memcaches;

-/*
- * Serialize access to domains and IOMMU driver internal structures (command
- * queue, device tables)
- */
-static hyp_spinlock_t iommu_lock;
-
 #define domain_to_iopt(_iommu, _domain, _domain_id)		\
 	(struct io_pgtable) {					\
 		.ops = &(_iommu)->pgtable->ops,			\
@@ -59,14 +53,11 @@ void kvm_iommu_reclaim_page(void *p)
 }

 static struct kvm_hyp_iommu_domain *
-handle_to_domain(pkvm_handle_t iommu_id, pkvm_handle_t domain_id,
-		 struct kvm_hyp_iommu **out_iommu)
+handle_to_domain(struct kvm_hyp_iommu *iommu, pkvm_handle_t domain_id)
 {
 	int idx;
-	struct kvm_hyp_iommu *iommu;
 	struct kvm_hyp_iommu_domain *domains;

-	iommu = kvm_iommu_ops.get_iommu_by_id(iommu_id);
 	if (!iommu)
 		return NULL;

@@ -83,7 +74,6 @@ handle_to_domain(pkvm_handle_t iommu_id, pkvm_handle_t domain_id,
 		iommu->domains[idx] = domains;
 	}

-	*out_iommu = iommu;
 	return &domains[domain_id & KVM_IOMMU_DOMAIN_ID_LEAF_MASK];
 }

@@ -95,8 +85,9 @@ int kvm_iommu_alloc_domain(pkvm_handle_t iommu_id, pkvm_handle_t domain_id,
 	struct kvm_hyp_iommu *iommu;
 	struct kvm_hyp_iommu_domain *domain;

-	hyp_spin_lock(&iommu_lock);
-	domain = handle_to_domain(iommu_id, domain_id, &iommu);
+	iommu = kvm_iommu_ops.get_iommu_by_id(iommu_id);
+	hyp_spin_lock(&iommu->iommu_lock);
+	domain = handle_to_domain(iommu, domain_id);
 	if (!domain)
 		goto out_unlock;

@@ -111,7 +102,7 @@ int kvm_iommu_alloc_domain(pkvm_handle_t iommu_id, pkvm_handle_t domain_id,
 	domain->refs = 1;
 	domain->pgd = iopt.pgd;
 out_unlock:
-	hyp_spin_unlock(&iommu_lock);
+	hyp_spin_unlock(&iommu->iommu_lock);
 	return ret;
 }

@@ -122,8 +113,9 @@ int kvm_iommu_free_domain(pkvm_handle_t iommu_id, pkvm_handle_t domain_id)
 	struct kvm_hyp_iommu *iommu;
 	struct kvm_hyp_iommu_domain *domain;

-	hyp_spin_lock(&iommu_lock);
-	domain = handle_to_domain(iommu_id, domain_id, &iommu);
+	iommu = kvm_iommu_ops.get_iommu_by_id(iommu_id);
+	hyp_spin_lock(&iommu->iommu_lock);
+	domain = handle_to_domain(iommu, domain_id);
 	if (!domain)
 		goto out_unlock;

@@ -136,7 +128,7 @@ int kvm_iommu_free_domain(pkvm_handle_t iommu_id, pkvm_handle_t domain_id)
 	memset(domain, 0, sizeof(*domain));

 out_unlock:
-	hyp_spin_unlock(&iommu_lock);
+	hyp_spin_unlock(&iommu->iommu_lock);
 	return ret;
 }

@@ -147,8 +139,9 @@ int kvm_iommu_attach_dev(pkvm_handle_t iommu_id, pkvm_handle_t domain_id,
 	struct kvm_hyp_iommu *iommu;
 	struct kvm_hyp_iommu_domain *domain;

-	hyp_spin_lock(&iommu_lock);
-	domain = handle_to_domain(iommu_id, domain_id, &iommu);
+	iommu = kvm_iommu_ops.get_iommu_by_id(iommu_id);
+	hyp_spin_lock(&iommu->iommu_lock);
+	domain = handle_to_domain(iommu, domain_id);
 	if (!domain || !domain->refs || domain->refs == UINT_MAX)
 		goto out_unlock;

@@ -158,7 +151,7 @@ int kvm_iommu_attach_dev(pkvm_handle_t iommu_id, pkvm_handle_t domain_id,

 	domain->refs++;
 out_unlock:
-	hyp_spin_unlock(&iommu_lock);
+	hyp_spin_unlock(&iommu->iommu_lock);
 	return ret;
 }

@@ -169,8 +162,9 @@ int kvm_iommu_detach_dev(pkvm_handle_t iommu_id, pkvm_handle_t domain_id,
 	struct kvm_hyp_iommu *iommu;
 	struct kvm_hyp_iommu_domain *domain;

-	hyp_spin_lock(&iommu_lock);
-	domain = handle_to_domain(iommu_id, domain_id, &iommu);
+	iommu = kvm_iommu_ops.get_iommu_by_id(iommu_id);
+	hyp_spin_lock(&iommu->iommu_lock);
+	domain = handle_to_domain(iommu, domain_id);
 	if (!domain || domain->refs <= 1)
 		goto out_unlock;

@@ -180,7 +174,7 @@ int kvm_iommu_detach_dev(pkvm_handle_t iommu_id, pkvm_handle_t domain_id,

 	domain->refs--;
 out_unlock:
-	hyp_spin_unlock(&iommu_lock);
+	hyp_spin_unlock(&iommu->iommu_lock);
 	return ret;
 }

@@ -209,9 +203,10 @@ int kvm_iommu_map_pages(pkvm_handle_t iommu_id, pkvm_handle_t domain_id,
 	    iova + size < iova || paddr + size < paddr)
 		return -EOVERFLOW;

-	hyp_spin_lock(&iommu_lock);
+	iommu = kvm_iommu_ops.get_iommu_by_id(iommu_id);
+	hyp_spin_lock(&iommu->iommu_lock);

-	domain = handle_to_domain(iommu_id, domain_id, &iommu);
+	domain = handle_to_domain(iommu, domain_id);
 	if (!domain)
 		goto err_unlock;

@@ -236,7 +231,7 @@ int kvm_iommu_map_pages(pkvm_handle_t iommu_id, pkvm_handle_t domain_id,
 		paddr += mapped;
 	}

-	hyp_spin_unlock(&iommu_lock);
+	hyp_spin_unlock(&iommu->iommu_lock);
 	return 0;

 err_unmap:
@@ -245,7 +240,7 @@ int kvm_iommu_map_pages(pkvm_handle_t iommu_id, pkvm_handle_t domain_id,
 		iopt_unmap_pages(&iopt, iova_orig, pgsize, pgcount, NULL);
 	__pkvm_host_unshare_dma(paddr_orig, size);
 err_unlock:
-	hyp_spin_unlock(&iommu_lock);
+	hyp_spin_unlock(&iommu->iommu_lock);
 	return ret;
 }

@@ -269,8 +264,9 @@ size_t kvm_iommu_unmap_pages(pkvm_handle_t iommu_id, pkvm_handle_t domain_id,
 	    iova + size < iova)
 		return 0;

-	hyp_spin_lock(&iommu_lock);
-	domain = handle_to_domain(iommu_id, domain_id, &iommu);
+	iommu = kvm_iommu_ops.get_iommu_by_id(iommu_id);
+	hyp_spin_lock(&iommu->iommu_lock);
+	domain = handle_to_domain(iommu, domain_id);
 	if (!domain)
 		goto out_unlock;

@@ -299,7 +295,7 @@ size_t kvm_iommu_unmap_pages(pkvm_handle_t iommu_id, pkvm_handle_t domain_id,
 	}

 out_unlock:
-	hyp_spin_unlock(&iommu_lock);
+	hyp_spin_unlock(&iommu->iommu_lock);
 	return total_unmapped;
 }

@@ -311,14 +307,15 @@ phys_addr_t kvm_iommu_iova_to_phys(pkvm_handle_t iommu_id,
 	struct kvm_hyp_iommu *iommu;
 	struct kvm_hyp_iommu_domain *domain;

-	hyp_spin_lock(&iommu_lock);
-	domain = handle_to_domain(iommu_id, domain_id, &iommu);
+	iommu = kvm_iommu_ops.get_iommu_by_id(iommu_id);
+	hyp_spin_lock(&iommu->iommu_lock);
+	domain = handle_to_domain(iommu, domain_id);
 	if (domain) {
 		iopt = domain_to_iopt(iommu, domain, domain_id);

 		phys = iopt_iova_to_phys(&iopt, iova);
 	}
-	hyp_spin_unlock(&iommu_lock);
+	hyp_spin_unlock(&iommu->iommu_lock);
 	return phys;
 }

@@ -333,9 +330,9 @@ static int iommu_power_on(struct kvm_power_domain *pd)
 	 * We currently assume that the device retains its architectural state
 	 * across power off, hence no save/restore.
 	 */
-	hyp_spin_lock(&iommu_lock);
+	hyp_spin_lock(&iommu->iommu_lock);
 	iommu->power_is_off = false;
-	hyp_spin_unlock(&iommu_lock);
+	hyp_spin_unlock(&iommu->iommu_lock);
 	return 0;
 }

@@ -346,9 +343,9 @@ static int iommu_power_off(struct kvm_power_domain *pd)

 	pkvm_debug("%s\n", __func__);

-	hyp_spin_lock(&iommu_lock);
+	hyp_spin_lock(&iommu->iommu_lock);
 	iommu->power_is_off = true;
-	hyp_spin_unlock(&iommu_lock);
+	hyp_spin_unlock(&iommu->iommu_lock);
 	return 0;
 }

@@ -362,6 +359,8 @@ int kvm_iommu_init_device(struct kvm_hyp_iommu *iommu)
 	int ret;
 	void *domains;

+	hyp_spin_lock_init(&iommu->iommu_lock);
+
 	ret = pkvm_init_power_domain(&iommu->power_domain, &iommu_power_ops);
 	if (ret)
 		return ret;
@@ -376,8 +375,6 @@ int kvm_iommu_init(void)
 {
 	enum kvm_pgtable_prot prot;

-	hyp_spin_lock_init(&iommu_lock);
-
 	if (WARN_ON(!kvm_iommu_ops.get_iommu_by_id ||
 		    !kvm_iommu_ops.alloc_iopt ||
 		    !kvm_iommu_ops.free_iopt ||
diff --git a/drivers/iommu/arm/arm-smmu-v3/Makefile b/drivers/iommu/arm/arm-smmu-v3/Makefile
index a90b97d8bae3..cf9195e24a08 100644
--- a/drivers/iommu/arm/arm-smmu-v3/Makefile
+++ b/drivers/iommu/arm/arm-smmu-v3/Makefile
@@ -6,6 +6,7 @@ arm_smmu_v3-objs-$(CONFIG_ARM_SMMU_V3_SVA) += arm-smmu-v3-sva.o
 arm_smmu_v3-objs := $(arm_smmu_v3-objs-y)

 obj-$(CONFIG_ARM_SMMU_V3_PKVM) += arm_smmu_v3_kvm.o
+ccflags-$(CONFIG_ARM_SMMU_V3_PKVM) += -Iarch/arm64/kvm/
 arm_smmu_v3_kvm-objs-y += arm-smmu-v3-kvm.o
 arm_smmu_v3_kvm-objs-y += arm-smmu-v3-common.o
 arm_smmu_v3_kvm-objs := $(arm_smmu_v3_kvm-objs-y)
diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
index ab888da731bc..b7cda7540156 100644
--- a/include/kvm/iommu.h
+++ b/include/kvm/iommu.h
@@ -5,6 +5,11 @@
 #include <asm/kvm_host.h>
 #include <kvm/power_domain.h>
 #include <linux/io-pgtable.h>
+#ifdef __KVM_NVHE_HYPERVISOR__
+#include <nvhe/spinlock.h>
+#else
+#include "hyp_constants.h"
+#endif

 /*
  * Parameters from the trusted host:
@@ -23,6 +28,11 @@ struct kvm_hyp_iommu {

 	struct io_pgtable_params	*pgtable;
 	bool				power_is_off;
+#ifdef __KVM_NVHE_HYPERVISOR__
+	hyp_spinlock_t			iommu_lock;
+#else
+	u8 unused[HYP_SPINLOCK_SIZE];
+#endif
 };

 struct kvm_hyp_iommu_memcache {
--
2.41.0.162.gfafddb0af9-goog
diff mbox series

Patch

diff --git a/arch/arm64/kvm/hyp/include/nvhe/iommu.h b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
index 4959c30977b8..76d3fa6ce331 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/iommu.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
@@ -2,8 +2,12 @@ 
 #ifndef __ARM64_KVM_NVHE_IOMMU_H__
 #define __ARM64_KVM_NVHE_IOMMU_H__
 
+#include <kvm/iommu.h>
+#include <linux/io-pgtable.h>
+
 #if IS_ENABLED(CONFIG_KVM_IOMMU)
 int kvm_iommu_init(void);
+int kvm_iommu_init_device(struct kvm_hyp_iommu *iommu);
 void *kvm_iommu_donate_page(void);
 void kvm_iommu_reclaim_page(void *p);
 
@@ -74,8 +78,20 @@  static inline phys_addr_t kvm_iommu_iova_to_phys(pkvm_handle_t iommu_id,
 }
 #endif /* CONFIG_KVM_IOMMU */
 
+struct kvm_iommu_tlb_cookie {
+	struct kvm_hyp_iommu	*iommu;
+	pkvm_handle_t		domain_id;
+};
+
 struct kvm_iommu_ops {
 	int (*init)(void);
+	struct kvm_hyp_iommu *(*get_iommu_by_id)(pkvm_handle_t smmu_id);
+	int (*alloc_iopt)(struct io_pgtable *iopt, unsigned long pgd_hva);
+	int (*free_iopt)(struct io_pgtable *iopt);
+	int (*attach_dev)(struct kvm_hyp_iommu *iommu, pkvm_handle_t domain_id,
+			  struct kvm_hyp_iommu_domain *domain, u32 endpoint_id);
+	int (*detach_dev)(struct kvm_hyp_iommu *iommu, pkvm_handle_t domain_id,
+			  struct kvm_hyp_iommu_domain *domain, u32 endpoint_id);
 };
 
 extern struct kvm_iommu_ops kvm_iommu_ops;
diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
index 12b06a5df889..2bbe5f7bf726 100644
--- a/include/kvm/iommu.h
+++ b/include/kvm/iommu.h
@@ -3,6 +3,23 @@ 
 #define __KVM_IOMMU_H
 
 #include <asm/kvm_host.h>
+#include <linux/io-pgtable.h>
+
+/*
+ * Parameters from the trusted host:
+ * @pgtable_cfg:	page table configuration
+ * @domains:		root domain table
+ * @nr_domains:		max number of domains (exclusive)
+ *
+ * Other members are filled and used at runtime by the IOMMU driver.
+ */
+struct kvm_hyp_iommu {
+	struct io_pgtable_cfg		pgtable_cfg;
+	void				**domains;
+	size_t				nr_domains;
+
+	struct io_pgtable_params	*pgtable;
+};
 
 struct kvm_hyp_iommu_memcache {
 	struct kvm_hyp_memcache	pages;
@@ -12,4 +29,42 @@  struct kvm_hyp_iommu_memcache {
 extern struct kvm_hyp_iommu_memcache *kvm_nvhe_sym(kvm_hyp_iommu_memcaches);
 #define kvm_hyp_iommu_memcaches kvm_nvhe_sym(kvm_hyp_iommu_memcaches)
 
+struct kvm_hyp_iommu_domain {
+	void			*pgd;
+	u32			refs;
+};
+
+/*
+ * At the moment the number of domains is limited by the ASID and VMID size on
+ * Arm. With single-stage translation, that size is 2^8 or 2^16. On a lot of
+ * platforms the number of devices is actually the limiting factor and we'll
+ * only need a handful of domains, but with PASID or SR-IOV support that limit
+ * can be reached.
+ *
+ * In practice we're rarely going to need a lot of domains. To avoid allocating
+ * a large domain table, we use a two-level table, indexed by domain ID. With
+ * 4kB pages and 16-bytes domains, the leaf table contains 256 domains, and the
+ * root table 256 pointers. With 64kB pages, the leaf table contains 4096
+ * domains and the root table 16 pointers. In this case, or when using 8-bit
+ * VMIDs, it may be more advantageous to use a single level. But using two
+ * levels allows to easily extend the domain size.
+ */
+#define KVM_IOMMU_MAX_DOMAINS	(1 << 16)
+
+/* Number of entries in the level-2 domain table */
+#define KVM_IOMMU_DOMAINS_PER_PAGE \
+	(PAGE_SIZE / sizeof(struct kvm_hyp_iommu_domain))
+
+/* Number of entries in the root domain table */
+#define KVM_IOMMU_DOMAINS_ROOT_ENTRIES \
+	(KVM_IOMMU_MAX_DOMAINS / KVM_IOMMU_DOMAINS_PER_PAGE)
+
+#define KVM_IOMMU_DOMAINS_ROOT_SIZE \
+	(KVM_IOMMU_DOMAINS_ROOT_ENTRIES * sizeof(void *))
+
+/* Bits [16:split] index the root table, bits [split-1:0] index the leaf table */
+#define KVM_IOMMU_DOMAIN_ID_SPLIT	ilog2(KVM_IOMMU_DOMAINS_PER_PAGE)
+
+#define KVM_IOMMU_DOMAIN_ID_LEAF_MASK	((1 << KVM_IOMMU_DOMAIN_ID_SPLIT) - 1)
+
 #endif /* __KVM_IOMMU_H */
diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
index 1a9184fbbd27..7404ea77ed9f 100644
--- a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
@@ -13,6 +13,22 @@ 
 
 struct kvm_hyp_iommu_memcache __ro_after_init *kvm_hyp_iommu_memcaches;
 
+/*
+ * Serialize access to domains and IOMMU driver internal structures (command
+ * queue, device tables)
+ */
+static hyp_spinlock_t iommu_lock;
+
+#define domain_to_iopt(_iommu, _domain, _domain_id)		\
+	(struct io_pgtable) {					\
+		.ops = &(_iommu)->pgtable->ops,			\
+		.pgd = (_domain)->pgd,				\
+		.cookie = &(struct kvm_iommu_tlb_cookie) {	\
+			.iommu		= (_iommu),		\
+			.domain_id	= (_domain_id),		\
+		},						\
+	}
+
 void *kvm_iommu_donate_page(void)
 {
 	void *p;
@@ -41,10 +57,155 @@  void kvm_iommu_reclaim_page(void *p)
 				     PAGE_SIZE);
 }
 
+static struct kvm_hyp_iommu_domain *
+handle_to_domain(pkvm_handle_t iommu_id, pkvm_handle_t domain_id,
+		 struct kvm_hyp_iommu **out_iommu)
+{
+	int idx;
+	struct kvm_hyp_iommu *iommu;
+	struct kvm_hyp_iommu_domain *domains;
+
+	iommu = kvm_iommu_ops.get_iommu_by_id(iommu_id);
+	if (!iommu)
+		return NULL;
+
+	if (domain_id >= iommu->nr_domains)
+		return NULL;
+	domain_id = array_index_nospec(domain_id, iommu->nr_domains);
+
+	idx = domain_id >> KVM_IOMMU_DOMAIN_ID_SPLIT;
+	domains = iommu->domains[idx];
+	if (!domains) {
+		domains = kvm_iommu_donate_page();
+		if (!domains)
+			return NULL;
+		iommu->domains[idx] = domains;
+	}
+
+	*out_iommu = iommu;
+	return &domains[domain_id & KVM_IOMMU_DOMAIN_ID_LEAF_MASK];
+}
+
+int kvm_iommu_alloc_domain(pkvm_handle_t iommu_id, pkvm_handle_t domain_id,
+			   unsigned long pgd_hva)
+{
+	int ret = -EINVAL;
+	struct io_pgtable iopt;
+	struct kvm_hyp_iommu *iommu;
+	struct kvm_hyp_iommu_domain *domain;
+
+	hyp_spin_lock(&iommu_lock);
+	domain = handle_to_domain(iommu_id, domain_id, &iommu);
+	if (!domain)
+		goto out_unlock;
+
+	if (domain->refs)
+		goto out_unlock;
+
+	iopt = domain_to_iopt(iommu, domain, domain_id);
+	ret = kvm_iommu_ops.alloc_iopt(&iopt, pgd_hva);
+	if (ret)
+		goto out_unlock;
+
+	domain->refs = 1;
+	domain->pgd = iopt.pgd;
+out_unlock:
+	hyp_spin_unlock(&iommu_lock);
+	return ret;
+}
+
+int kvm_iommu_free_domain(pkvm_handle_t iommu_id, pkvm_handle_t domain_id)
+{
+	int ret = -EINVAL;
+	struct io_pgtable iopt;
+	struct kvm_hyp_iommu *iommu;
+	struct kvm_hyp_iommu_domain *domain;
+
+	hyp_spin_lock(&iommu_lock);
+	domain = handle_to_domain(iommu_id, domain_id, &iommu);
+	if (!domain)
+		goto out_unlock;
+
+	if (domain->refs != 1)
+		goto out_unlock;
+
+	iopt = domain_to_iopt(iommu, domain, domain_id);
+	ret = kvm_iommu_ops.free_iopt(&iopt);
+
+	memset(domain, 0, sizeof(*domain));
+
+out_unlock:
+	hyp_spin_unlock(&iommu_lock);
+	return ret;
+}
+
+int kvm_iommu_attach_dev(pkvm_handle_t iommu_id, pkvm_handle_t domain_id,
+			 u32 endpoint_id)
+{
+	int ret = -EINVAL;
+	struct kvm_hyp_iommu *iommu;
+	struct kvm_hyp_iommu_domain *domain;
+
+	hyp_spin_lock(&iommu_lock);
+	domain = handle_to_domain(iommu_id, domain_id, &iommu);
+	if (!domain || !domain->refs || domain->refs == UINT_MAX)
+		goto out_unlock;
+
+	ret = kvm_iommu_ops.attach_dev(iommu, domain_id, domain, endpoint_id);
+	if (ret)
+		goto out_unlock;
+
+	domain->refs++;
+out_unlock:
+	hyp_spin_unlock(&iommu_lock);
+	return ret;
+}
+
+int kvm_iommu_detach_dev(pkvm_handle_t iommu_id, pkvm_handle_t domain_id,
+			 u32 endpoint_id)
+{
+	int ret = -EINVAL;
+	struct kvm_hyp_iommu *iommu;
+	struct kvm_hyp_iommu_domain *domain;
+
+	hyp_spin_lock(&iommu_lock);
+	domain = handle_to_domain(iommu_id, domain_id, &iommu);
+	if (!domain || domain->refs <= 1)
+		goto out_unlock;
+
+	ret = kvm_iommu_ops.detach_dev(iommu, domain_id, domain, endpoint_id);
+	if (ret)
+		goto out_unlock;
+
+	domain->refs--;
+out_unlock:
+	hyp_spin_unlock(&iommu_lock);
+	return ret;
+}
+
+int kvm_iommu_init_device(struct kvm_hyp_iommu *iommu)
+{
+	void *domains;
+
+	domains = iommu->domains;
+	iommu->domains = kern_hyp_va(domains);
+	return pkvm_create_mappings(iommu->domains, iommu->domains +
+				    KVM_IOMMU_DOMAINS_ROOT_ENTRIES, PAGE_HYP);
+}
+
 int kvm_iommu_init(void)
 {
 	enum kvm_pgtable_prot prot;
 
+	hyp_spin_lock_init(&iommu_lock);
+
+	if (WARN_ON(!kvm_iommu_ops.get_iommu_by_id ||
+		    !kvm_iommu_ops.alloc_iopt ||
+		    !kvm_iommu_ops.free_iopt ||
+		    !kvm_iommu_ops.attach_dev ||
+		    !kvm_iommu_ops.detach_dev))
+		return -ENODEV;
+
 	/* The memcache is shared with the host */
 	prot = pkvm_mkstate(PAGE_HYP, PKVM_PAGE_SHARED_OWNED);
 	return pkvm_create_mappings(kvm_hyp_iommu_memcaches,