mbox series

[v4,0/7] Use 1st-level for IOVA translation

Message ID 20191219031634.15168-1-baolu.lu@linux.intel.com (mailing list archive)
Headers show
Series Use 1st-level for IOVA translation | expand

Message

Baolu Lu Dec. 19, 2019, 3:16 a.m. UTC
Intel VT-d in scalable mode supports two types of page tables
for DMA translation: the first level page table and the second
level page table. The first level page table uses the same
format as the CPU page table, while the second level page table
keeps compatible with previous formats. The software is able
to choose any one of them for DMA remapping according to the use
case.

This patchset aims to move IOVA (I/O Virtual Address) translation
to 1st-level page table in scalable mode. This will simplify vIOMMU
(IOMMU simulated by VM hypervisor) design by using the two-stage
translation, a.k.a. nested mode translation.

As Intel VT-d architecture offers caching mode, guest IOVA (GIOVA)
support is currently implemented in a shadow page manner. The device
simulation software, like QEMU, has to figure out GIOVA->GPA mappings
and write them to a shadowed page table, which will be used by the
physical IOMMU. Each time when mappings are created or destroyed in
vIOMMU, the simulation software has to intervene. Hence, the changes
on GIOVA->GPA could be shadowed to host.


     .-----------.
     |  vIOMMU   |
     |-----------|                 .--------------------.
     |           |IOTLB flush trap |        QEMU        |
     .-----------. (map/unmap)     |--------------------|
     |GIOVA->GPA |---------------->|    .------------.  |
     '-----------'                 |    | GIOVA->HPA |  |
     |           |                 |    '------------'  |
     '-----------'                 |                    |
                                   |                    |
                                   '--------------------'
                                                |
            <------------------------------------
            |
            v VFIO/IOMMU API
      .-----------.
      |  pIOMMU   |
      |-----------|
      |           |
      .-----------.
      |GIOVA->HPA |
      '-----------'
      |           |
      '-----------'

In VT-d 3.0, scalable mode is introduced, which offers two-level
translation page tables and nested translation mode. Regards to
GIOVA support, it can be simplified by 1) moving the GIOVA support
over 1st-level page table to store GIOVA->GPA mapping in vIOMMU,
2) binding vIOMMU 1st level page table to the pIOMMU, 3) using pIOMMU
second level for GPA->HPA translation, and 4) enable nested (a.k.a.
dual-stage) translation in host. Compared with current shadow GIOVA
support, the new approach makes the vIOMMU design simpler and more
efficient as we only need to flush the pIOMMU IOTLB and possible
device-IOTLB when an IOVA mapping in vIOMMU is torn down.

     .-----------.
     |  vIOMMU   |
     |-----------|                 .-----------.
     |           |IOTLB flush trap |   QEMU    |
     .-----------.    (unmap)      |-----------|
     |GIOVA->GPA |---------------->|           |
     '-----------'                 '-----------'
     |           |                       |
     '-----------'                       |
           <------------------------------
           |      VFIO/IOMMU          
           |  cache invalidation and  
           | guest gpd bind interfaces
           v
     .-----------.
     |  pIOMMU   |
     |-----------|
     .-----------.
     |GIOVA->GPA |<---First level
     '-----------'
     | GPA->HPA  |<---Scond level
     '-----------'
     '-----------'

This patch applies the first level page table for IOVA translation
unless the DOMAIN_ATTR_NESTING domain attribution has been set.
Setting of this attribution means the second level will be used to
map gPA (guest physical address) to hPA (host physical address), and
the mappings between gVA (guest virtual address) and gPA will be
maintained by the guest with the page table address binding to host's
first level.

Based-on-idea-by: Ashok Raj <ashok.raj@intel.com>
Based-on-idea-by: Kevin Tian <kevin.tian@intel.com>
Based-on-idea-by: Liu Yi L <yi.l.liu@intel.com>
Based-on-idea-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Based-on-idea-by: Sanjay Kumar <sanjay.k.kumar@intel.com>
Based-on-idea-by: Lu Baolu <baolu.lu@linux.intel.com>

Change log:

v3->v4:
 - The previous version was posted here
   https://lkml.org/lkml/2019/12/10/2126
 - Set Execute Disable (bit 63) in first level table entries.
 - Enhance pasid-based iotlb invalidation for both default domain
   and auxiliary domain.
 - Add debugfs file to expose page table internals.

v2->v3:
 - The previous version was posted here
   https://lkml.org/lkml/2019/11/27/1831
 - Accept Jacob's suggestion on merging two page tables.

 v1->v2
 - The first series was posted here
   https://lkml.org/lkml/2019/9/23/297
 - Use per domain page table ops to handle different page tables.
 - Use first level for DMA remapping by default on both bare metal
   and vm guest.
 - Code refine according to code review comments for v1.

Lu Baolu (7):
  iommu/vt-d: Identify domains using first level page table
  iommu/vt-d: Add set domain DOMAIN_ATTR_NESTING attr
  iommu/vt-d: Add PASID_FLAG_FL5LP for first-level pasid setup
  iommu/vt-d: Setup pasid entries for iova over first level
  iommu/vt-d: Flush PASID-based iotlb for iova over first level
  iommu/vt-d: Use iova over first level
  iommu/vt-d: debugfs: Add support to show page table internals

 drivers/iommu/dmar.c                |  41 ++++++
 drivers/iommu/intel-iommu-debugfs.c |  75 +++++++++++
 drivers/iommu/intel-iommu.c         | 201 +++++++++++++++++++++++++---
 drivers/iommu/intel-pasid.c         |   7 +-
 drivers/iommu/intel-pasid.h         |   6 +
 drivers/iommu/intel-svm.c           |   8 +-
 include/linux/intel-iommu.h         |  20 ++-
 7 files changed, 326 insertions(+), 32 deletions(-)

Comments

Yi Liu Dec. 20, 2019, 11:50 a.m. UTC | #1
Hi Baolu,

In a brief, this version is pretty good to me. However, I still want
to have the following checks to see if anything missed. Wish it
helps.

1) would using IOVA over FLPT default on?
My opinion is that before we have got gIOVA nested translation
done for passthru devices, we should make this feature as off.

2) the domain->agaw is somehow calculated according to the
capabilities related to second level page table. As we are moving
IOVA to FLPT, I'd suggest to calculate domain->agaw with the
translation modes FLPT supports (e.g. 4 level and 5 level)

3) Per VT-d spec, FLPT has canonical requirement to the input
addresses. So I'd suggest to add some enhance regards to it.
Please refer to chapter 3.6 :-).

3.6 First-Level Translation
First-level translation restricts the input-address to a canonical address (i.e., address bits 63:N have
the same value as address bit [N-1], where N is 48-bits with 4-level paging and 57-bits with 5-level
paging). Requests subject to first-level translation by remapping hardware are subject to canonical
address checking as a pre-condition for first-level translation, and a violation is treated as a
translation-fault.

Regards,
Yi Liu

> From: Lu Baolu [mailto:baolu.lu@linux.intel.com]
> Sent: Thursday, December 19, 2019 11:16 AM
> To: Joerg Roedel <joro@8bytes.org>; David Woodhouse <dwmw2@infradead.org>;
> Alex Williamson <alex.williamson@redhat.com>
> Subject: [PATCH v4 0/7] Use 1st-level for IOVA translation
> 
> Intel VT-d in scalable mode supports two types of page tables
> for DMA translation: the first level page table and the second
> level page table. The first level page table uses the same
> format as the CPU page table, while the second level page table
> keeps compatible with previous formats. The software is able
> to choose any one of them for DMA remapping according to the use
> case.
> 
> This patchset aims to move IOVA (I/O Virtual Address) translation
> to 1st-level page table in scalable mode. This will simplify vIOMMU
> (IOMMU simulated by VM hypervisor) design by using the two-stage
> translation, a.k.a. nested mode translation.
> 
> As Intel VT-d architecture offers caching mode, guest IOVA (GIOVA)
> support is currently implemented in a shadow page manner. The device
> simulation software, like QEMU, has to figure out GIOVA->GPA mappings
> and write them to a shadowed page table, which will be used by the
> physical IOMMU. Each time when mappings are created or destroyed in
> vIOMMU, the simulation software has to intervene. Hence, the changes
> on GIOVA->GPA could be shadowed to host.
> 
> 
>      .-----------.
>      |  vIOMMU   |
>      |-----------|                 .--------------------.
>      |           |IOTLB flush trap |        QEMU        |
>      .-----------. (map/unmap)     |--------------------|
>      |GIOVA->GPA |---------------->|    .------------.  |
>      '-----------'                 |    | GIOVA->HPA |  |
>      |           |                 |    '------------'  |
>      '-----------'                 |                    |
>                                    |                    |
>                                    '--------------------'
>                                                 |
>             <------------------------------------
>             |
>             v VFIO/IOMMU API
>       .-----------.
>       |  pIOMMU   |
>       |-----------|
>       |           |
>       .-----------.
>       |GIOVA->HPA |
>       '-----------'
>       |           |
>       '-----------'
> 
> In VT-d 3.0, scalable mode is introduced, which offers two-level
> translation page tables and nested translation mode. Regards to
> GIOVA support, it can be simplified by 1) moving the GIOVA support
> over 1st-level page table to store GIOVA->GPA mapping in vIOMMU,
> 2) binding vIOMMU 1st level page table to the pIOMMU, 3) using pIOMMU
> second level for GPA->HPA translation, and 4) enable nested (a.k.a.
> dual-stage) translation in host. Compared with current shadow GIOVA
> support, the new approach makes the vIOMMU design simpler and more
> efficient as we only need to flush the pIOMMU IOTLB and possible
> device-IOTLB when an IOVA mapping in vIOMMU is torn down.
> 
>      .-----------.
>      |  vIOMMU   |
>      |-----------|                 .-----------.
>      |           |IOTLB flush trap |   QEMU    |
>      .-----------.    (unmap)      |-----------|
>      |GIOVA->GPA |---------------->|           |
>      '-----------'                 '-----------'
>      |           |                       |
>      '-----------'                       |
>            <------------------------------
>            |      VFIO/IOMMU
>            |  cache invalidation and
>            | guest gpd bind interfaces
>            v
>      .-----------.
>      |  pIOMMU   |
>      |-----------|
>      .-----------.
>      |GIOVA->GPA |<---First level
>      '-----------'
>      | GPA->HPA  |<---Scond level
>      '-----------'
>      '-----------'
> 
> This patch applies the first level page table for IOVA translation
> unless the DOMAIN_ATTR_NESTING domain attribution has been set.
> Setting of this attribution means the second level will be used to
> map gPA (guest physical address) to hPA (host physical address), and
> the mappings between gVA (guest virtual address) and gPA will be
> maintained by the guest with the page table address binding to host's
> first level.
> 
> Based-on-idea-by: Ashok Raj <ashok.raj@intel.com>
> Based-on-idea-by: Kevin Tian <kevin.tian@intel.com>
> Based-on-idea-by: Liu Yi L <yi.l.liu@intel.com>
> Based-on-idea-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Based-on-idea-by: Sanjay Kumar <sanjay.k.kumar@intel.com>
> Based-on-idea-by: Lu Baolu <baolu.lu@linux.intel.com>
> 
> Change log:
> 
> v3->v4:
>  - The previous version was posted here
>    https://lkml.org/lkml/2019/12/10/2126
>  - Set Execute Disable (bit 63) in first level table entries.
>  - Enhance pasid-based iotlb invalidation for both default domain
>    and auxiliary domain.
>  - Add debugfs file to expose page table internals.
> 
> v2->v3:
>  - The previous version was posted here
>    https://lkml.org/lkml/2019/11/27/1831
>  - Accept Jacob's suggestion on merging two page tables.
> 
>  v1->v2
>  - The first series was posted here
>    https://lkml.org/lkml/2019/9/23/297
>  - Use per domain page table ops to handle different page tables.
>  - Use first level for DMA remapping by default on both bare metal
>    and vm guest.
>  - Code refine according to code review comments for v1.
> 
> Lu Baolu (7):
>   iommu/vt-d: Identify domains using first level page table
>   iommu/vt-d: Add set domain DOMAIN_ATTR_NESTING attr
>   iommu/vt-d: Add PASID_FLAG_FL5LP for first-level pasid setup
>   iommu/vt-d: Setup pasid entries for iova over first level
>   iommu/vt-d: Flush PASID-based iotlb for iova over first level
>   iommu/vt-d: Use iova over first level
>   iommu/vt-d: debugfs: Add support to show page table internals
> 
>  drivers/iommu/dmar.c                |  41 ++++++
>  drivers/iommu/intel-iommu-debugfs.c |  75 +++++++++++
>  drivers/iommu/intel-iommu.c         | 201 +++++++++++++++++++++++++---
>  drivers/iommu/intel-pasid.c         |   7 +-
>  drivers/iommu/intel-pasid.h         |   6 +
>  drivers/iommu/intel-svm.c           |   8 +-
>  include/linux/intel-iommu.h         |  20 ++-
>  7 files changed, 326 insertions(+), 32 deletions(-)
> 
> --
> 2.17.1
Baolu Lu Dec. 21, 2019, 2:51 a.m. UTC | #2
Hi Yi,

Thanks for the comments.

On 12/20/19 7:50 PM, Liu, Yi L wrote:
> Hi Baolu,
> 
> In a brief, this version is pretty good to me. However, I still want
> to have the following checks to see if anything missed. Wish it
> helps.
> 
> 1) would using IOVA over FLPT default on?
> My opinion is that before we have got gIOVA nested translation
> done for passthru devices, we should make this feature as off.

No worry.

IOVA over first level is a sub-feature of scalable mode. Currently,
scalable mode is default off and we won't switch it on until all
features are done.

> 
> 2) the domain->agaw is somehow calculated according to the
> capabilities related to second level page table. As we are moving
> IOVA to FLPT, I'd suggest to calculate domain->agaw with the
> translation modes FLPT supports (e.g. 4 level and 5 level)

We merged first level and second level, hence the domain->agaw should be
selected for both. The only shortcoming of this is that it doesn't
support a 3-only second level in scalable mode. But I don't think we
have any chances to see such hardware.

> 
> 3) Per VT-d spec, FLPT has canonical requirement to the input
> addresses. So I'd suggest to add some enhance regards to it.
> Please refer to chapter 3.6 :-).

Yes. Good catch! We should manipulate the page table entry according to
this requirement.

> 
> 3.6 First-Level Translation
> First-level translation restricts the input-address to a canonical address (i.e., address bits 63:N have
> the same value as address bit [N-1], where N is 48-bits with 4-level paging and 57-bits with 5-level
> paging). Requests subject to first-level translation by remapping hardware are subject to canonical
> address checking as a pre-condition for first-level translation, and a violation is treated as a
> translation-fault.
> 
> Regards,
> Yi Liu

Best regards,
baolu
Baolu Lu Dec. 21, 2019, 3:14 a.m. UTC | #3
Hi again,

On 2019/12/20 19:50, Liu, Yi L wrote:
> 3) Per VT-d spec, FLPT has canonical requirement to the input
> addresses. So I'd suggest to add some enhance regards to it.
> Please refer to chapter 3.6:-).
> 
> 3.6 First-Level Translation
> First-level translation restricts the input-address to a canonical address (i.e., address bits 63:N have
> the same value as address bit [N-1], where N is 48-bits with 4-level paging and 57-bits with 5-level
> paging). Requests subject to first-level translation by remapping hardware are subject to canonical
> address checking as a pre-condition for first-level translation, and a violation is treated as a
> translation-fault.

It seems to be a conflict at bit 63. It should be the same as bit[N-1]
according to the canonical address requirement; but it is also used as
the XD control. Any thought?

Best regards,
baolu
Baolu Lu Dec. 22, 2019, 7 a.m. UTC | #4
Hi Yi,

On 12/21/19 11:14 AM, Lu Baolu wrote:
> Hi again,
> 
> On 2019/12/20 19:50, Liu, Yi L wrote:
>> 3) Per VT-d spec, FLPT has canonical requirement to the input
>> addresses. So I'd suggest to add some enhance regards to it.
>> Please refer to chapter 3.6:-).
>>
>> 3.6 First-Level Translation
>> First-level translation restricts the input-address to a canonical 
>> address (i.e., address bits 63:N have
>> the same value as address bit [N-1], where N is 48-bits with 4-level 
>> paging and 57-bits with 5-level
>> paging). Requests subject to first-level translation by remapping 
>> hardware are subject to canonical
>> address checking as a pre-condition for first-level translation, and a 
>> violation is treated as a
>> translation-fault.
> 
> It seems to be a conflict at bit 63. It should be the same as bit[N-1]
> according to the canonical address requirement; but it is also used as
> the XD control. Any thought?

Ignore this please. It makes no sense. :-) I confused.

Best regards,
baolu