mbox series

[v14,00/13] SMMUv3 Nested Stage Setup (IOMMU part)

Message ID 20210223205634.604221-1-eric.auger@redhat.com (mailing list archive)
Headers show
Series SMMUv3 Nested Stage Setup (IOMMU part) | expand

Message

Eric Auger Feb. 23, 2021, 8:56 p.m. UTC
This series brings the IOMMU part of HW nested paging support
in the SMMUv3. The VFIO part is submitted separately.

This is based on Jean-Philippe's
[PATCH v12 00/10] iommu: I/O page faults for SMMUv3
https://lore.kernel.org/linux-arm-kernel/YBfij71tyYvh8LhB@myrica/T/

The IOMMU API is extended to support 2 new API functionalities:
1) pass the guest stage 1 configuration
2) pass stage 1 MSI bindings

Then those capabilities gets implemented in the SMMUv3 driver.

The virtualizer passes information through the VFIO user API
which cascades them to the iommu subsystem. This allows the guest
to own stage 1 tables and context descriptors (so-called PASID
table) while the host owns stage 2 tables and main configuration
structures (STE).

Best Regards

Eric

This series can be found at:
https://github.com/eauger/linux/tree/v5.11-stallv12-2stage-v14
(including the VFIO part in its last version: v12)

The VFIO series is sent separately.

History:

Previous version (v13):
https://github.com/eauger/linux/tree/5.10-rc4-2stage-v13

v13 -> v14:
- Took into account all received comments I think. Great
  thanks to all the testers for their effort and sometimes
  fixes. I am really grateful to you!
- numerous fixes including guest running in
  noiommu, iommu.strict=0, iommu.passthrough=on,
  enable_unsafe_noiommu_mode

v12 -> v13:
- fixed compilation issue with CONFIG_ARM_SMMU_V3_SVA
  reported by Shameer. This urged me to revisit patch 4 into
  iommu/smmuv3: Allow s1 and s2 configs to coexist where
  s1_cfg and s2_cfg are not dynamically allocated anymore.
  Instead I use a new set field in existing structs
- fixed 2 others config checks
- Updated "iommu/arm-smmu-v3: Maintain a SID->device structure"
  according to the last version

v11 -> v12:
- rebase on top of v5.10-rc4

Eric Auger (13):
  iommu: Introduce attach/detach_pasid_table API
  iommu: Introduce bind/unbind_guest_msi
  iommu/smmuv3: Allow s1 and s2 configs to coexist
  iommu/smmuv3: Get prepared for nested stage support
  iommu/smmuv3: Implement attach/detach_pasid_table
  iommu/smmuv3: Allow stage 1 invalidation with unmanaged ASIDs
  iommu/smmuv3: Implement cache_invalidate
  dma-iommu: Implement NESTED_MSI cookie
  iommu/smmuv3: Nested mode single MSI doorbell per domain enforcement
  iommu/smmuv3: Enforce incompatibility between nested mode and HW MSI
    regions
  iommu/smmuv3: Implement bind/unbind_guest_msi
  iommu/smmuv3: report additional recoverable faults
  iommu/smmuv3: Accept configs with more than one context descriptor

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 444 ++++++++++++++++++--
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  14 +-
 drivers/iommu/dma-iommu.c                   | 142 ++++++-
 drivers/iommu/iommu.c                       | 106 +++++
 include/linux/dma-iommu.h                   |  16 +
 include/linux/iommu.h                       |  47 +++
 include/uapi/linux/iommu.h                  |  54 +++
 7 files changed, 781 insertions(+), 42 deletions(-)

Comments

Eric Auger Feb. 25, 2021, 4:06 p.m. UTC | #1
Hi Shameer, all

On 2/23/21 9:56 PM, Eric Auger wrote:
> This series brings the IOMMU part of HW nested paging support
> in the SMMUv3. The VFIO part is submitted separately.
> 
> This is based on Jean-Philippe's
> [PATCH v12 00/10] iommu: I/O page faults for SMMUv3
> https://lore.kernel.org/linux-arm-kernel/YBfij71tyYvh8LhB@myrica/T/
> 
> The IOMMU API is extended to support 2 new API functionalities:
> 1) pass the guest stage 1 configuration
> 2) pass stage 1 MSI bindings
> 
> Then those capabilities gets implemented in the SMMUv3 driver.
> 
> The virtualizer passes information through the VFIO user API
> which cascades them to the iommu subsystem. This allows the guest
> to own stage 1 tables and context descriptors (so-called PASID
> table) while the host owns stage 2 tables and main configuration
> structures (STE).
> 
> Best Regards
> 
> Eric
> 
> This series can be found at:
> https://github.com/eauger/linux/tree/v5.11-stallv12-2stage-v14
> (including the VFIO part in its last version: v12)

As committed, I have rebased the iommu + vfio part on top of Jean's
sva/current (5.11-rc4).

https://github.com/eauger/linux/tree/jean_sva_current_2stage_v14

I have not tested the SVA bits but I have tested there is no regression
from my pov.

From the QEMU perspective is works off the shelf with that branch but if
you want to use other SVA related IOCTLs please remind of updating the
linux headers.

Again thank you to all of you who reviewed and tested the previous version.

Thanks

Eric
> 
> The VFIO series is sent separately.
> 
> History:
> 
> Previous version (v13):
> https://github.com/eauger/linux/tree/5.10-rc4-2stage-v13
> 
> v13 -> v14:
> - Took into account all received comments I think. Great
>   thanks to all the testers for their effort and sometimes
>   fixes. I am really grateful to you!
> - numerous fixes including guest running in
>   noiommu, iommu.strict=0, iommu.passthrough=on,
>   enable_unsafe_noiommu_mode
> 
> v12 -> v13:
> - fixed compilation issue with CONFIG_ARM_SMMU_V3_SVA
>   reported by Shameer. This urged me to revisit patch 4 into
>   iommu/smmuv3: Allow s1 and s2 configs to coexist where
>   s1_cfg and s2_cfg are not dynamically allocated anymore.
>   Instead I use a new set field in existing structs
> - fixed 2 others config checks
> - Updated "iommu/arm-smmu-v3: Maintain a SID->device structure"
>   according to the last version
> 
> v11 -> v12:
> - rebase on top of v5.10-rc4
> 
> Eric Auger (13):
>   iommu: Introduce attach/detach_pasid_table API
>   iommu: Introduce bind/unbind_guest_msi
>   iommu/smmuv3: Allow s1 and s2 configs to coexist
>   iommu/smmuv3: Get prepared for nested stage support
>   iommu/smmuv3: Implement attach/detach_pasid_table
>   iommu/smmuv3: Allow stage 1 invalidation with unmanaged ASIDs
>   iommu/smmuv3: Implement cache_invalidate
>   dma-iommu: Implement NESTED_MSI cookie
>   iommu/smmuv3: Nested mode single MSI doorbell per domain enforcement
>   iommu/smmuv3: Enforce incompatibility between nested mode and HW MSI
>     regions
>   iommu/smmuv3: Implement bind/unbind_guest_msi
>   iommu/smmuv3: report additional recoverable faults
>   iommu/smmuv3: Accept configs with more than one context descriptor
> 
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 444 ++++++++++++++++++--
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  14 +-
>  drivers/iommu/dma-iommu.c                   | 142 ++++++-
>  drivers/iommu/iommu.c                       | 106 +++++
>  include/linux/dma-iommu.h                   |  16 +
>  include/linux/iommu.h                       |  47 +++
>  include/uapi/linux/iommu.h                  |  54 +++
>  7 files changed, 781 insertions(+), 42 deletions(-)
>
Krishna Reddy March 18, 2021, 12:16 a.m. UTC | #2
Tested-by: Krishna Reddy <vdumpa@nvidia.com>

Validated nested translations with NVMe PCI device assigned to Guest VM. 
Tested with both v12 and v13 of Jean-Philippe's patches as base.

> This is based on Jean-Philippe's
> [PATCH v12 00/10] iommu: I/O page faults for SMMUv3
> https://lore.kernel.org/linux-arm-kernel/YBfij71tyYvh8LhB@myrica/T/

With Jean-Philippe's V13, Patch 12 of this series has a conflict that had to be resolved manually.

-KR
Eric Auger March 19, 2021, 1:17 p.m. UTC | #3
Hi Krishna,

On 3/18/21 1:16 AM, Krishna Reddy wrote:
> Tested-by: Krishna Reddy <vdumpa@nvidia.com>
> 
> Validated nested translations with NVMe PCI device assigned to Guest VM. 
> Tested with both v12 and v13 of Jean-Philippe's patches as base.

Many thanks for that.
> 
>> This is based on Jean-Philippe's
>> [PATCH v12 00/10] iommu: I/O page faults for SMMUv3
>> https://lore.kernel.org/linux-arm-kernel/YBfij71tyYvh8LhB@myrica/T/
> 
> With Jean-Philippe's V13, Patch 12 of this series has a conflict that had to be resolved manually.

Yep I will respin accordingly.

Best Regards

Eric
> 
> -KR
> 
>
Sumit Gupta April 22, 2021, 3:04 p.m. UTC | #4
Hi Eric,
I have validated the v14 of the patch series from branch "jean_sva_current_2stage_v14".
Verfied nested translations with NVMe PCI device assigned to Qemu 5.2 Guest.
Had to revert patch "mm: notify remote TLBs when dirtying a PTE".

Tested-by: Sumit Gupta <sumitg@nvidia.com>
Jean-Philippe Brucker April 23, 2021, 1 p.m. UTC | #5
Hi Sumit,

On Thu, Apr 22, 2021 at 08:34:38PM +0530, Sumit Gupta wrote:
> Had to revert patch "mm: notify remote TLBs when dirtying a PTE".

Did that patch cause any issue, or is it just not needed on your system?
It fixes an hypothetical problem with the way ATS is implemented. Maybe I
actually observed it on an old software model, I don't remember. Either
way it's unlikely to go upstream but I'd like to know if I should drop it
from my tree.

Thanks,
Jean
Sumit Gupta April 23, 2021, 5:16 p.m. UTC | #6
Hi Jean,


> Hi Sumit,
> 
> On Thu, Apr 22, 2021 at 08:34:38PM +0530, Sumit Gupta wrote:
>> Had to revert patch "mm: notify remote TLBs when dirtying a PTE".
> 
> Did that patch cause any issue, or is it just not needed on your system?
> It fixes an hypothetical problem with the way ATS is implemented. Maybe I
> actually observed it on an old software model, I don't remember. Either
> way it's unlikely to go upstream but I'd like to know if I should drop it
> from my tree.

I tried Nested SMMUv3 patches v15(Eric's branch: 
v5.12-rc6-jean-iopf-14-2stage-v15) on top of your current sva patches 
with Kernel-5.12.0-rc8.
Had to revert same patch "mm: notify remote TLBs when dirtying a PTE" to 
avoid below crash[1]. I am not sure about the cause yet.
Didn't get crash after reverting patch and nested translations worked.

[1]
[   11.730943] arm-smmu-v3 9050000.smmuv3: ias 44-bit, oas 44-bit 
(features 0x00008305)^M^M
[   11.833791] arm-smmu-v3 9050000.smmuv3: allocated 524288 entries for 
cmdq^M^M
[   11.979456] arm-smmu-v3 9050000.smmuv3: allocated 524288 entries for 
evtq^M^M
[   12.048895] cacheinfo: Unable to detect cache hierarchy for CPU 0^M^M
[   12.234175] loop: module loaded^M^M
[   12.279552] megasas: 07.714.04.00-rc1^M^M
[   12.408831] nvme 0000:00:02.0: Adding to iommu group 0^M^M
[   12.488063] nvme nvme0: pci function 0000:00:02.0^M^M
[   12.525887] nvme 0000:00:02.0: enabling device (0000 -> 0002)^M^M
[   12.612159] physmap-flash 0.flash: physmap platform flash device: 
[mem 0x00000000-0x03ffffff]^M^M
[ 1721.586943] Unable to handle kernel paging request at virtual address 
ffff617f80000000^M
[ 1721.587263] Mem abort info:^M
[ 1721.587776]   ESR = 0x96000145^M
[ 1721.587968]   EC = 0x25: DABT (current EL), IL = 32 bits^M
[ 1721.588416]   SET = 0, FnV = 0^M
[ 1721.588672]   EA = 0, S1PTW = 0^M
[ 1721.588863] Data abort info:^M
[ 1721.589120]   ISV = 0, ISS = 0x00000145^M
[ 1721.589311]   CM = 1, WnR = 1^M
[ 1721.589568] swapper pgtable: 64k pages, 48-bit VAs, 
pgdp=0000000111280000^M
[ 1721.589951] [ffff617f80000000] pgd=0000000000000000, 
p4d=0000000000000000, pud=0000000000000000^M
[ 1721.590592] Internal error: Oops: 96000145 [#1] PREEMPT SMP^M
[ 1721.590912] Modules linked in:^M
[ 1721.591232] CPU: 0 PID: 664 Comm: qemu-system-aar Not tainted 
5.12.0-rc8-tegra-229886-g4786d4a20d7 #22^M
[ 1721.591680] pstate: a0400005 (NzCv daif +PAN -UAO -TCO BTYPE=--)
[ 1721.592128] pc : __flush_dcache_area+0x20/0x38
[ 1721.592511] lr : kvm_set_spte_hva+0x64/0xc8
[ 1721.592832] sp : ffff8000145cfc30
[ 1721.593087] x29: ffff8000145cfc30 x28: ffff000095221c80
[ 1721.593599] x27: 0000000000000002 x26: ffff0000a3711c88
[ 1721.594112] x25: ffff00009333a740 x24: 01e8618000000f53
[ 1721.594624] x23: 0000ffffb8320000 x22: 0000000000000001
[ 1721.595136] x21: 0000ffffb8320000 x20: ffff0000a1268000
[ 1721.595647] x19: ffff800011c95000 x18: 0000000000000000
[ 1721.596160] x17: 0000000000000000 x16: 0000000000000000
[ 1721.596608] x15: 0000000000000000 x14: 0000000000000000
[ 1721.597120] x13: 0000000000000000 x12: 0000000000000000
[ 1721.597568] x11: 0000000000000000 x10: 0000000000000000
[ 1721.598080] x9 : 0000000000000000 x8 : ffff00009333a740
[ 1721.598592] x7 : 07fd000ffffb8320 x6 : ffff0000815bc190
[ 1721.599104] x5 : 0000000000011b06 x4 : 0000000000000000
[ 1721.599552] x3 : 000000000000003f x2 : 0000000000000040
[ 1721.600064] x1 : ffff617f80010000 x0 : ffff617f80000000
[ 1721.600576] Call trace:
[ 1721.600768]  __flush_dcache_area+0x20/0x38
[ 1721.601216]  kvm_mmu_notifier_change_pte+0x5c/0xa8
[ 1721.601601]  __mmu_notifier_change_pte+0x60/0xa0
[ 1721.601985]  __handle_mm_fault+0x740/0xde8
[ 1721.602367]  handle_mm_fault+0xe8/0x238
[ 1721.602751]  do_page_fault+0x160/0x3a8
[ 1721.603200]  do_mem_abort+0x40/0xb0
[ 1721.603520]  el0_da+0x20/0x30
[ 1721.603967]  el0_sync_handler+0x68/0xd0
[ 1721.604416]  el0_sync+0x154/0x180
[ 1721.604864] Code: 9ac32042 8b010001 d1000443 8a230000 (d50b7e20)
[ 1721.605184] ---[ end trace 7678eb97889b6fbd ]---
[ 1721.605504] Kernel panic - not syncing: Oops: Fatal exception
[ 1721.605824] Kernel Offset: disabled
[ 1721.606016] CPU features: 0x00340216,6280a018
[ 1721.606335] Memory Limit: 2909 MB
[ 1721.606656] ---[ end Kernel panic - not syncing: Oops: Fatal 
exception ]---
Krishna Reddy April 23, 2021, 5:58 p.m. UTC | #7
>> Did that patch cause any issue, or is it just not needed on your system?
>> It fixes an hypothetical problem with the way ATS is implemented. 
>> Maybe I actually observed it on an old software model, I don't 
>> remember. Either way it's unlikely to go upstream but I'd like to know 
>> if I should drop it from my tree.

> Had to revert same patch "mm: notify remote TLBs when dirtying a PTE" to
> avoid below crash[1]. I am not sure about the cause yet.

I have noticed this issue earlier with patch pointed here and root caused the issue as below.
It happens after vfio_mmap request from QEMU for the PCIe device and during the access of VA when
PTE access flags are updated. 

kvm_mmu_notifier_change_pte() --> kvm_set_spte_hve() --> kvm_set_spte_hva() --> clean_dcache_guest_page()

The validation model doesn't have FWB capability supported.
__clean_dcache_guest_page() attempts to perform dcache flush on pcie bar address(not a valid_pfn()) through page_address(),
which doesn't have page table mapping and leads to exception.

I have worked around the issue by filtering out the request if the pfn is not valid in  __clean_dcache_guest_page(). 
As the patch wasn't posted in the community, reverted it as well.

-KR
Sumit Gupta April 23, 2021, 6:19 p.m. UTC | #8
>>> Did that patch cause any issue, or is it just not needed on your system?
>>> It fixes an hypothetical problem with the way ATS is implemented.
>>> Maybe I actually observed it on an old software model, I don't
>>> remember. Either way it's unlikely to go upstream but I'd like to know
>>> if I should drop it from my tree.
> 
>> Had to revert same patch "mm: notify remote TLBs when dirtying a PTE" to
>> avoid below crash[1]. I am not sure about the cause yet.
> 
> I have noticed this issue earlier with patch pointed here and root caused the issue as below.
> It happens after vfio_mmap request from QEMU for the PCIe device and during the access of VA when
> PTE access flags are updated.
> 
> kvm_mmu_notifier_change_pte() --> kvm_set_spte_hve() --> kvm_set_spte_hva() --> clean_dcache_guest_page()
> 
> The validation model doesn't have FWB capability supported.
> __clean_dcache_guest_page() attempts to perform dcache flush on pcie bar address(not a valid_pfn()) through page_address(),
> which doesn't have page table mapping and leads to exception.
> 
> I have worked around the issue by filtering out the request if the pfn is not valid in  __clean_dcache_guest_page().
> As the patch wasn't posted in the community, reverted it as well.

Thank you Krishna for sharing the analysis.

Best Regards,
Sumit Gupta
Marc Zyngier April 24, 2021, 9:06 a.m. UTC | #9
On Fri, 23 Apr 2021 18:58:23 +0100,
Krishna Reddy <vdumpa@nvidia.com> wrote:
> 
> >> Did that patch cause any issue, or is it just not needed on your system?
> >> It fixes an hypothetical problem with the way ATS is implemented. 
> >> Maybe I actually observed it on an old software model, I don't 
> >> remember. Either way it's unlikely to go upstream but I'd like to know 
> >> if I should drop it from my tree.
> 
> > Had to revert same patch "mm: notify remote TLBs when dirtying a PTE" to
> > avoid below crash[1]. I am not sure about the cause yet.
> 
> I have noticed this issue earlier with patch pointed here and root
> caused the issue as below.  It happens after vfio_mmap request from
> QEMU for the PCIe device and during the access of VA when PTE access
> flags are updated.
> 
> kvm_mmu_notifier_change_pte() --> kvm_set_spte_hve() -->
> kvm_set_spte_hva() --> clean_dcache_guest_page()
> 
> The validation model doesn't have FWB capability supported.
> __clean_dcache_guest_page() attempts to perform dcache flush on pcie
> bar address(not a valid_pfn()) through page_address(), which doesn't
> have page table mapping and leads to exception.
> 
> I have worked around the issue by filtering out the request if the
> pfn is not valid in __clean_dcache_guest_page().  As the patch
> wasn't posted in the community, reverted it as well.

That's papering over the real issue, and this mapping path needs
fixing as it was only ever expected to be called for CoW.

Can you please try the following patch and let me know if that fixes
the issue for good?

Thanks,

	M.

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 77cb2d28f2a4..b62dd40a4083 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1147,7 +1147,8 @@ int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
 	 * We've moved a page around, probably through CoW, so let's treat it
 	 * just like a translation fault and clean the cache to the PoC.
 	 */
-	clean_dcache_guest_page(pfn, PAGE_SIZE);
+	if (!kvm_is_device_pfn(pfn))
+		clean_dcache_guest_page(pfn, PAGE_SIZE);
 	handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &pfn);
 	return 0;
 }
Sumit Gupta April 24, 2021, 11:29 a.m. UTC | #10
>> I have worked around the issue by filtering out the request if the
>> pfn is not valid in __clean_dcache_guest_page().  As the patch
>> wasn't posted in the community, reverted it as well.
> 
> That's papering over the real issue, and this mapping path needs
> fixing as it was only ever expected to be called for CoW.
> 
> Can you please try the following patch and let me know if that fixes
> the issue for good?
> 

Hi Marc,

Thank you for the patch. This patch fixed the crash for me.
For the formal patch, please add:

Tested-by: Sumit Gupta <sumitg@nvidia.com>

> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 77cb2d28f2a4..b62dd40a4083 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1147,7 +1147,8 @@ int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
>           * We've moved a page around, probably through CoW, so let's treat it
>           * just like a translation fault and clean the cache to the PoC.
>           */
> -       clean_dcache_guest_page(pfn, PAGE_SIZE);
> +       if (!kvm_is_device_pfn(pfn))
> +               clean_dcache_guest_page(pfn, PAGE_SIZE);
>          handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &pfn);
>          return 0;
>   }
> 
> 
> --
> Without deviation from the norm, progress is not possible.
>