mbox series

[RFC,0/4] Use 1st-level for DMA remapping in guest

Message ID 20190923122454.9888-1-baolu.lu@linux.intel.com (mailing list archive)
Headers show
Series Use 1st-level for DMA remapping in guest | expand

Message

Baolu Lu Sept. 23, 2019, 12:24 p.m. UTC
This patchset aims to move IOVA (I/O Virtual Address) translation
to 1st-level page table under scalable mode. The major purpose of
this effort is to make guest IOVA support more efficient.

As Intel VT-d architecture offers caching-mode, guest IOVA (GIOVA)
support is now implemented in a shadow page manner. The device
simulation software, like QEMU, has to figure out GIOVA->GPA mapping
and writes to a shadowed page table, which will be used by pIOMMU.
Each time when mappings are created or destroyed in vIOMMU, the
simulation software will intervene. The change on GIOVA->GPA will be
shadowed to host, and the pIOMMU will be updated via VFIO/IOMMU
interfaces.


     .-----------.
     |  vIOMMU   |
     |-----------|                 .--------------------.
     |           |IOTLB flush trap |        QEMU        |
     .-----------. (map/unmap)     |--------------------|
     | GVA->GPA  |---------------->|      .----------.  |
     '-----------'                 |      | GPA->HPA |  |
     |           |                 |      '----------'  |
     '-----------'                 |                    |
                                   |                    |
                                   '--------------------'
                                                |
            <------------------------------------
            |
            v VFIO/IOMMU API
      .-----------.
      |  pIOMMU   |
      |-----------|
      |           |
      .-----------.
      | GVA->HPA  |
      '-----------'
      |           |
      '-----------'

In VT-d 3.0, scalable mode is introduced, which offers two level
translation page tables and nested translation mode. Regards to
GIOVA support, it can be simplified by 1) moving the GIOVA support
over 1st-level page table to store GIOVA->GPA mapping in vIOMMU,
2) binding vIOMMU 1st level page table to the pIOMMU, 3) using pIOMMU
second level for GPA->HPA translation, and 4) enable nested (a.k.a.
dual stage) translation in host. Compared with current shadow GIOVA
support, the new approach is more secure and software is simplified
as we only need to flush the pIOMMU IOTLB and possible device-IOTLB
when an IOVA mapping in vIOMMU is torn down.

     .-----------.
     |  vIOMMU   |
     |-----------|                 .-----------.
     |           |IOTLB flush trap |   QEMU    |
     .-----------.    (unmap)      |-----------|
     | GVA->GPA  |---------------->|           |
     '-----------'                 '-----------'
     |           |                       |
     '-----------'                       |
           <------------------------------
           |      VFIO/IOMMU          
           |  cache invalidation and  
           | guest gpd bind interfaces
           v
     .-----------.
     |  pIOMMU   |
     |-----------|
     .-----------.
     | GVA->GPA  |<---First level
     '-----------'
     | GPA->HPA  |<---Scond level
     '-----------'
     '-----------'

This patch series only aims to achieve the first goal, a.k.a using
first level translation for IOVA mappings in vIOMMU. I am sending
it out for your comments. Any comments, suggestions and concerns are
welcomed.

Based-on-idea-by: Ashok Raj <ashok.raj@intel.com>
Based-on-idea-by: Kevin Tian <kevin.tian@intel.com>
Based-on-idea-by: Liu Yi L <yi.l.liu@intel.com>
Based-on-idea-by: Lu Baolu <baolu.lu@linux.intel.com>
Based-on-idea-by: Sanjay Kumar <sanjay.k.kumar@intel.com>

Lu Baolu (4):
  iommu/vt-d: Move domain_flush_cache helper into header
  iommu/vt-d: Add first level page table interfaces
  iommu/vt-d: Map/unmap domain with mmmap/mmunmap
  iommu/vt-d: Identify domains using first level page table

 drivers/iommu/Makefile             |   2 +-
 drivers/iommu/intel-iommu.c        | 142 ++++++++++--
 drivers/iommu/intel-pgtable.c      | 342 +++++++++++++++++++++++++++++
 include/linux/intel-iommu.h        |  31 ++-
 include/trace/events/intel_iommu.h |  60 +++++
 5 files changed, 553 insertions(+), 24 deletions(-)
 create mode 100644 drivers/iommu/intel-pgtable.c

Comments

Jacob Pan Sept. 23, 2019, 7:27 p.m. UTC | #1
Hi Baolu,

On Mon, 23 Sep 2019 20:24:50 +0800
Lu Baolu <baolu.lu@linux.intel.com> wrote:

> This patchset aims to move IOVA (I/O Virtual Address) translation
> to 1st-level page table under scalable mode. The major purpose of
> this effort is to make guest IOVA support more efficient.
> 
> As Intel VT-d architecture offers caching-mode, guest IOVA (GIOVA)
> support is now implemented in a shadow page manner. The device
> simulation software, like QEMU, has to figure out GIOVA->GPA mapping
> and writes to a shadowed page table, which will be used by pIOMMU.
> Each time when mappings are created or destroyed in vIOMMU, the
> simulation software will intervene. The change on GIOVA->GPA will be
> shadowed to host, and the pIOMMU will be updated via VFIO/IOMMU
> interfaces.
> 
> 
>      .-----------.
>      |  vIOMMU   |
>      |-----------|                 .--------------------.
>      |           |IOTLB flush trap |        QEMU        |
>      .-----------. (map/unmap)     |--------------------|
>      | GVA->GPA  |---------------->|      .----------.  |
>      '-----------'                 |      | GPA->HPA |  |
>      |           |                 |      '----------'  |
>      '-----------'                 |                    |
>                                    |                    |
>                                    '--------------------'
>                                                 |
>             <------------------------------------
>             |
>             v VFIO/IOMMU API
>       .-----------.
>       |  pIOMMU   |
>       |-----------|
>       |           |
>       .-----------.
>       | GVA->HPA  |
>       '-----------'
>       |           |
>       '-----------'
> 
> In VT-d 3.0, scalable mode is introduced, which offers two level
> translation page tables and nested translation mode. Regards to
> GIOVA support, it can be simplified by 1) moving the GIOVA support
> over 1st-level page table to store GIOVA->GPA mapping in vIOMMU,
> 2) binding vIOMMU 1st level page table to the pIOMMU, 3) using pIOMMU
> second level for GPA->HPA translation, and 4) enable nested (a.k.a.
> dual stage) translation in host. Compared with current shadow GIOVA
> support, the new approach is more secure and software is simplified
> as we only need to flush the pIOMMU IOTLB and possible device-IOTLB
> when an IOVA mapping in vIOMMU is torn down.
> 
>      .-----------.
>      |  vIOMMU   |
>      |-----------|                 .-----------.
>      |           |IOTLB flush trap |   QEMU    |
>      .-----------.    (unmap)      |-----------|
>      | GVA->GPA  |---------------->|           |
>      '-----------'                 '-----------'
>      |           |                       |
>      '-----------'                       |
>            <------------------------------
>            |      VFIO/IOMMU          
>            |  cache invalidation and  
>            | guest gpd bind interfaces
>            v
For vSVA, the guest PGD bind interface will mark the PASID as guest
PASID and will inject page request into the guest. In FL gIOVA case, I
guess we are assuming there is no page fault for GIOVA. I will need to
add a flag in the gpgd bind such that any PRS will be auto responded
with invalid.

Also, native use of IOVA FL map is not to be supported? i.e. IOMMU API
and DMA API for native usage will continue to be SL only?
>      .-----------.
>      |  pIOMMU   |
>      |-----------|
>      .-----------.
>      | GVA->GPA  |<---First level
>      '-----------'
>      | GPA->HPA  |<---Scond level
>      '-----------'
>      '-----------'
> 
> This patch series only aims to achieve the first goal, a.k.a using
> first level translation for IOVA mappings in vIOMMU. I am sending
> it out for your comments. Any comments, suggestions and concerns are
> welcomed.
> 


> Based-on-idea-by: Ashok Raj <ashok.raj@intel.com>
> Based-on-idea-by: Kevin Tian <kevin.tian@intel.com>
> Based-on-idea-by: Liu Yi L <yi.l.liu@intel.com>
> Based-on-idea-by: Lu Baolu <baolu.lu@linux.intel.com>
> Based-on-idea-by: Sanjay Kumar <sanjay.k.kumar@intel.com>
> 
> Lu Baolu (4):
>   iommu/vt-d: Move domain_flush_cache helper into header
>   iommu/vt-d: Add first level page table interfaces
>   iommu/vt-d: Map/unmap domain with mmmap/mmunmap
>   iommu/vt-d: Identify domains using first level page table
> 
>  drivers/iommu/Makefile             |   2 +-
>  drivers/iommu/intel-iommu.c        | 142 ++++++++++--
>  drivers/iommu/intel-pgtable.c      | 342
> +++++++++++++++++++++++++++++ include/linux/intel-iommu.h        |
> 31 ++- include/trace/events/intel_iommu.h |  60 +++++
>  5 files changed, 553 insertions(+), 24 deletions(-)
>  create mode 100644 drivers/iommu/intel-pgtable.c
> 

[Jacob Pan]
Ashok Raj Sept. 23, 2019, 8:25 p.m. UTC | #2
Hi Jacob

On Mon, Sep 23, 2019 at 12:27:15PM -0700, Jacob Pan wrote:
> > 
> > In VT-d 3.0, scalable mode is introduced, which offers two level
> > translation page tables and nested translation mode. Regards to
> > GIOVA support, it can be simplified by 1) moving the GIOVA support
> > over 1st-level page table to store GIOVA->GPA mapping in vIOMMU,
> > 2) binding vIOMMU 1st level page table to the pIOMMU, 3) using pIOMMU
> > second level for GPA->HPA translation, and 4) enable nested (a.k.a.
> > dual stage) translation in host. Compared with current shadow GIOVA
> > support, the new approach is more secure and software is simplified
> > as we only need to flush the pIOMMU IOTLB and possible device-IOTLB
> > when an IOVA mapping in vIOMMU is torn down.
> > 
> >      .-----------.
> >      |  vIOMMU   |
> >      |-----------|                 .-----------.
> >      |           |IOTLB flush trap |   QEMU    |
> >      .-----------.    (unmap)      |-----------|
> >      | GVA->GPA  |---------------->|           |
> >      '-----------'                 '-----------'
> >      |           |                       |
> >      '-----------'                       |
> >            <------------------------------
> >            |      VFIO/IOMMU          
> >            |  cache invalidation and  
> >            | guest gpd bind interfaces
> >            v
> For vSVA, the guest PGD bind interface will mark the PASID as guest
> PASID and will inject page request into the guest. In FL gIOVA case, I
> guess we are assuming there is no page fault for GIOVA. I will need to
> add a flag in the gpgd bind such that any PRS will be auto responded
> with invalid.

Is there real need to enforce this? I'm not sure if there is any
limitation in the spec, and if so, can the guest check that instead?

Also i believe the idea is to overcommit PASID#0 such uses. Thought
we had a capability to expose this to the vIOMMU as well. Not sure if this
is already documented, if not should be up in the next rev.


> 
> Also, native use of IOVA FL map is not to be supported? i.e. IOMMU API
> and DMA API for native usage will continue to be SL only?
> >      .-----------.
> >      |  pIOMMU   |
> >      |-----------|
> >      .-----------.
> >      | GVA->GPA  |<---First level
> >      '-----------'
> >      | GPA->HPA  |<---Scond level

s/Scond/Second

> >      '-----------'
> >      '-----------'
> > 
> > This patch series only aims to achieve the first goal, a.k.a using
> > first level translation for IOVA mappings in vIOMMU. I am sending
> > it out for your comments. Any comments, suggestions and concerns are
> > welcomed.
> > 
> 
> 
> > Based-on-idea-by: Ashok Raj <ashok.raj@intel.com>
> > Based-on-idea-by: Kevin Tian <kevin.tian@intel.com>
> > Based-on-idea-by: Liu Yi L <yi.l.liu@intel.com>
> > Based-on-idea-by: Lu Baolu <baolu.lu@linux.intel.com>
> > Based-on-idea-by: Sanjay Kumar <sanjay.k.kumar@intel.com>
> > 
> > Lu Baolu (4):
> >   iommu/vt-d: Move domain_flush_cache helper into header
> >   iommu/vt-d: Add first level page table interfaces
> >   iommu/vt-d: Map/unmap domain with mmmap/mmunmap
> >   iommu/vt-d: Identify domains using first level page table
> > 
> >  drivers/iommu/Makefile             |   2 +-
> >  drivers/iommu/intel-iommu.c        | 142 ++++++++++--
> >  drivers/iommu/intel-pgtable.c      | 342
> > +++++++++++++++++++++++++++++ include/linux/intel-iommu.h        |
> > 31 ++- include/trace/events/intel_iommu.h |  60 +++++
> >  5 files changed, 553 insertions(+), 24 deletions(-)
> >  create mode 100644 drivers/iommu/intel-pgtable.c
> > 
> 
> [Jacob Pan]
Baolu Lu Sept. 24, 2019, 4:27 a.m. UTC | #3
Hi Jacob,

On 9/24/19 3:27 AM, Jacob Pan wrote:
> Hi Baolu,
> 
> On Mon, 23 Sep 2019 20:24:50 +0800
> Lu Baolu <baolu.lu@linux.intel.com> wrote:
> 
>> This patchset aims to move IOVA (I/O Virtual Address) translation
>> to 1st-level page table under scalable mode. The major purpose of
>> this effort is to make guest IOVA support more efficient.
>>
>> As Intel VT-d architecture offers caching-mode, guest IOVA (GIOVA)
>> support is now implemented in a shadow page manner. The device
>> simulation software, like QEMU, has to figure out GIOVA->GPA mapping
>> and writes to a shadowed page table, which will be used by pIOMMU.
>> Each time when mappings are created or destroyed in vIOMMU, the
>> simulation software will intervene. The change on GIOVA->GPA will be
>> shadowed to host, and the pIOMMU will be updated via VFIO/IOMMU
>> interfaces.
>>
>>
>>       .-----------.
>>       |  vIOMMU   |
>>       |-----------|                 .--------------------.
>>       |           |IOTLB flush trap |        QEMU        |
>>       .-----------. (map/unmap)     |--------------------|
>>       | GVA->GPA  |---------------->|      .----------.  |
>>       '-----------'                 |      | GPA->HPA |  |
>>       |           |                 |      '----------'  |
>>       '-----------'                 |                    |
>>                                     |                    |
>>                                     '--------------------'
>>                                                  |
>>              <------------------------------------
>>              |
>>              v VFIO/IOMMU API
>>        .-----------.
>>        |  pIOMMU   |
>>        |-----------|
>>        |           |
>>        .-----------.
>>        | GVA->HPA  |
>>        '-----------'
>>        |           |
>>        '-----------'
>>
>> In VT-d 3.0, scalable mode is introduced, which offers two level
>> translation page tables and nested translation mode. Regards to
>> GIOVA support, it can be simplified by 1) moving the GIOVA support
>> over 1st-level page table to store GIOVA->GPA mapping in vIOMMU,
>> 2) binding vIOMMU 1st level page table to the pIOMMU, 3) using pIOMMU
>> second level for GPA->HPA translation, and 4) enable nested (a.k.a.
>> dual stage) translation in host. Compared with current shadow GIOVA
>> support, the new approach is more secure and software is simplified
>> as we only need to flush the pIOMMU IOTLB and possible device-IOTLB
>> when an IOVA mapping in vIOMMU is torn down.
>>
>>       .-----------.
>>       |  vIOMMU   |
>>       |-----------|                 .-----------.
>>       |           |IOTLB flush trap |   QEMU    |
>>       .-----------.    (unmap)      |-----------|
>>       | GVA->GPA  |---------------->|           |
>>       '-----------'                 '-----------'
>>       |           |                       |
>>       '-----------'                       |
>>             <------------------------------
>>             |      VFIO/IOMMU
>>             |  cache invalidation and
>>             | guest gpd bind interfaces
>>             v
> For vSVA, the guest PGD bind interface will mark the PASID as guest
> PASID and will inject page request into the guest. In FL gIOVA case, I
> guess we are assuming there is no page fault for GIOVA. I will need to
> add a flag in the gpgd bind such that any PRS will be auto responded
> with invalid.

There should be no page fault. The pages should have been pinned.

> 
> Also, native use of IOVA FL map is not to be supported? i.e. IOMMU API
> and DMA API for native usage will continue to be SL only?

Yes. There isn't such use case as far as I can see.

Best regards,
Baolu

>>       .-----------.
>>       |  pIOMMU   |
>>       |-----------|
>>       .-----------.
>>       | GVA->GPA  |<---First level
>>       '-----------'
>>       | GPA->HPA  |<---Scond level
>>       '-----------'
>>       '-----------'
>>
>> This patch series only aims to achieve the first goal, a.k.a using
>> first level translation for IOVA mappings in vIOMMU. I am sending
>> it out for your comments. Any comments, suggestions and concerns are
>> welcomed.
>>
> 
> 
>> Based-on-idea-by: Ashok Raj <ashok.raj@intel.com>
>> Based-on-idea-by: Kevin Tian <kevin.tian@intel.com>
>> Based-on-idea-by: Liu Yi L <yi.l.liu@intel.com>
>> Based-on-idea-by: Lu Baolu <baolu.lu@linux.intel.com>
>> Based-on-idea-by: Sanjay Kumar <sanjay.k.kumar@intel.com>
>>
>> Lu Baolu (4):
>>    iommu/vt-d: Move domain_flush_cache helper into header
>>    iommu/vt-d: Add first level page table interfaces
>>    iommu/vt-d: Map/unmap domain with mmmap/mmunmap
>>    iommu/vt-d: Identify domains using first level page table
>>
>>   drivers/iommu/Makefile             |   2 +-
>>   drivers/iommu/intel-iommu.c        | 142 ++++++++++--
>>   drivers/iommu/intel-pgtable.c      | 342
>> +++++++++++++++++++++++++++++ include/linux/intel-iommu.h        |
>> 31 ++- include/trace/events/intel_iommu.h |  60 +++++
>>   5 files changed, 553 insertions(+), 24 deletions(-)
>>   create mode 100644 drivers/iommu/intel-pgtable.c
>>
> 
> [Jacob Pan]
>
Baolu Lu Sept. 24, 2019, 4:40 a.m. UTC | #4
Hi,

On 9/24/19 4:25 AM, Raj, Ashok wrote:
> Hi Jacob
> 
> On Mon, Sep 23, 2019 at 12:27:15PM -0700, Jacob Pan wrote:
>>>
>>> In VT-d 3.0, scalable mode is introduced, which offers two level
>>> translation page tables and nested translation mode. Regards to
>>> GIOVA support, it can be simplified by 1) moving the GIOVA support
>>> over 1st-level page table to store GIOVA->GPA mapping in vIOMMU,
>>> 2) binding vIOMMU 1st level page table to the pIOMMU, 3) using pIOMMU
>>> second level for GPA->HPA translation, and 4) enable nested (a.k.a.
>>> dual stage) translation in host. Compared with current shadow GIOVA
>>> support, the new approach is more secure and software is simplified
>>> as we only need to flush the pIOMMU IOTLB and possible device-IOTLB
>>> when an IOVA mapping in vIOMMU is torn down.
>>>
>>>       .-----------.
>>>       |  vIOMMU   |
>>>       |-----------|                 .-----------.
>>>       |           |IOTLB flush trap |   QEMU    |
>>>       .-----------.    (unmap)      |-----------|
>>>       | GVA->GPA  |---------------->|           |
>>>       '-----------'                 '-----------'
>>>       |           |                       |
>>>       '-----------'                       |
>>>             <------------------------------
>>>             |      VFIO/IOMMU
>>>             |  cache invalidation and
>>>             | guest gpd bind interfaces
>>>             v
>> For vSVA, the guest PGD bind interface will mark the PASID as guest
>> PASID and will inject page request into the guest. In FL gIOVA case, I
>> guess we are assuming there is no page fault for GIOVA. I will need to
>> add a flag in the gpgd bind such that any PRS will be auto responded
>> with invalid.
> 
> Is there real need to enforce this? I'm not sure if there is any
> limitation in the spec, and if so, can the guest check that instead?

For FL gIOVA case, gPASID is always 0. If a physical device is passed
through, hPASID is also 0; If an mdev device (representing an ADI)
instead, hPASID would be the PASID corresponding to the ADI. The
simulation software (i.e. QEMU) maintains a map between gPASID and
hPASID.

I second Ashok's idea. We don't need to distinguish these two cases in
the api and handle page request interrupt in guest as an unrecoverable
one.

> 
> Also i believe the idea is to overcommit PASID#0 such uses. Thought
> we had a capability to expose this to the vIOMMU as well. Not sure if this
> is already documented, if not should be up in the next rev.
> 
> 
>>
>> Also, native use of IOVA FL map is not to be supported? i.e. IOMMU API
>> and DMA API for native usage will continue to be SL only?
>>>       .-----------.
>>>       |  pIOMMU   |
>>>       |-----------|
>>>       .-----------.
>>>       | GVA->GPA  |<---First level
>>>       '-----------'
>>>       | GPA->HPA  |<---Scond level
> 
> s/Scond/Second

Yes. Thanks!

Best regards,
Baolu
Tian, Kevin Sept. 24, 2019, 7 a.m. UTC | #5
> From: Raj, Ashok
> Sent: Tuesday, September 24, 2019 4:26 AM
> 
> Hi Jacob
> 
> On Mon, Sep 23, 2019 at 12:27:15PM -0700, Jacob Pan wrote:
> > >
> > > In VT-d 3.0, scalable mode is introduced, which offers two level
> > > translation page tables and nested translation mode. Regards to
> > > GIOVA support, it can be simplified by 1) moving the GIOVA support
> > > over 1st-level page table to store GIOVA->GPA mapping in vIOMMU,
> > > 2) binding vIOMMU 1st level page table to the pIOMMU, 3) using
> pIOMMU
> > > second level for GPA->HPA translation, and 4) enable nested (a.k.a.
> > > dual stage) translation in host. Compared with current shadow GIOVA
> > > support, the new approach is more secure and software is simplified
> > > as we only need to flush the pIOMMU IOTLB and possible device-IOTLB
> > > when an IOVA mapping in vIOMMU is torn down.
> > >
> > >      .-----------.
> > >      |  vIOMMU   |
> > >      |-----------|                 .-----------.
> > >      |           |IOTLB flush trap |   QEMU    |
> > >      .-----------.    (unmap)      |-----------|
> > >      | GVA->GPA  |---------------->|           |
> > >      '-----------'                 '-----------'

GVA should be replaced by GIOVA in all the figures.

> > >      |           |                       |
> > >      '-----------'                       |
> > >            <------------------------------
> > >            |      VFIO/IOMMU
> > >            |  cache invalidation and
> > >            | guest gpd bind interfaces
> > >            v
> > For vSVA, the guest PGD bind interface will mark the PASID as guest
> > PASID and will inject page request into the guest. In FL gIOVA case, I
> > guess we are assuming there is no page fault for GIOVA. I will need to
> > add a flag in the gpgd bind such that any PRS will be auto responded
> > with invalid.
> 
> Is there real need to enforce this? I'm not sure if there is any
> limitation in the spec, and if so, can the guest check that instead?

Whether to allow page fault is not usage specific (GIOVA, GVA, etc.).
It's really about the device capability and IOMMU capability. VT-d
allows page fault on both levels. So we don't need enforce it. 

btw in the future we may need an interface to tell VFIO whether a
 device is 100% DMA-faultable thus pinning can be avoided. But for 
now I'm not sure how such knowledge can be retrieved w/o device 
specific knowledge. PCI PRI capability only indicates that the device 
supports page fault, but not that the device enables page fault on 
its every DMA access. Maybe we need a new bit in PRI capability for
such purpose.

> 
> Also i believe the idea is to overcommit PASID#0 such uses. Thought
> we had a capability to expose this to the vIOMMU as well. Not sure if this
> is already documented, if not should be up in the next rev.
> 
> 
> >
> > Also, native use of IOVA FL map is not to be supported? i.e. IOMMU API
> > and DMA API for native usage will continue to be SL only?
> > >      .-----------.
> > >      |  pIOMMU   |
> > >      |-----------|
> > >      .-----------.
> > >      | GVA->GPA  |<---First level
> > >      '-----------'
> > >      | GPA->HPA  |<---Scond level
> 
> s/Scond/Second
> 
> > >      '-----------'
> > >      '-----------'
> > >
> > > This patch series only aims to achieve the first goal, a.k.a using

first goal? then what are other goals? I didn't spot such information.

Also earlier you mentioned the new approach (nested) is more secure
than shadowing. why?

> > > first level translation for IOVA mappings in vIOMMU. I am sending
> > > it out for your comments. Any comments, suggestions and concerns are
> > > welcomed.
> > >
> >
> >
> > > Based-on-idea-by: Ashok Raj <ashok.raj@intel.com>
> > > Based-on-idea-by: Kevin Tian <kevin.tian@intel.com>
> > > Based-on-idea-by: Liu Yi L <yi.l.liu@intel.com>
> > > Based-on-idea-by: Lu Baolu <baolu.lu@linux.intel.com>
> > > Based-on-idea-by: Sanjay Kumar <sanjay.k.kumar@intel.com>
> > >
> > > Lu Baolu (4):
> > >   iommu/vt-d: Move domain_flush_cache helper into header
> > >   iommu/vt-d: Add first level page table interfaces
> > >   iommu/vt-d: Map/unmap domain with mmmap/mmunmap
> > >   iommu/vt-d: Identify domains using first level page table
> > >
> > >  drivers/iommu/Makefile             |   2 +-
> > >  drivers/iommu/intel-iommu.c        | 142 ++++++++++--
> > >  drivers/iommu/intel-pgtable.c      | 342
> > > +++++++++++++++++++++++++++++ include/linux/intel-iommu.h        |
> > > 31 ++- include/trace/events/intel_iommu.h |  60 +++++
> > >  5 files changed, 553 insertions(+), 24 deletions(-)
> > >  create mode 100644 drivers/iommu/intel-pgtable.c
> > >
> >
> > [Jacob Pan]
Baolu Lu Sept. 25, 2019, 2:48 a.m. UTC | #6
Hi Kevin,

On 9/24/19 3:00 PM, Tian, Kevin wrote:
>>>>       '-----------'
>>>>       '-----------'
>>>>
>>>> This patch series only aims to achieve the first goal, a.k.a using
> first goal? then what are other goals? I didn't spot such information.
> 

The overall goal is to use IOMMU nested mode to avoid shadow page table
and VMEXIT when map an gIOVA. This includes below 4 steps (maybe not
accurate, but you could get the point.)

1) GIOVA mappings over 1st-level page table;
2) binding vIOMMU 1st level page table to the pIOMMU;
3) using pIOMMU second level for GPA->HPA translation;
4) enable nested (a.k.a. dual stage) translation in host.

This patch set aims to achieve 1).

> Also earlier you mentioned the new approach (nested) is more secure
> than shadowing. why?
> 

My bad! After reconsideration, I realized that it's not "more secure".

Thanks for pointing this out.

Best regards,
Baolu
Peter Xu Sept. 25, 2019, 6:56 a.m. UTC | #7
On Wed, Sep 25, 2019 at 10:48:32AM +0800, Lu Baolu wrote:
> Hi Kevin,
> 
> On 9/24/19 3:00 PM, Tian, Kevin wrote:
> > > > >       '-----------'
> > > > >       '-----------'
> > > > > 
> > > > > This patch series only aims to achieve the first goal, a.k.a using
> > first goal? then what are other goals? I didn't spot such information.
> > 
> 
> The overall goal is to use IOMMU nested mode to avoid shadow page table
> and VMEXIT when map an gIOVA. This includes below 4 steps (maybe not
> accurate, but you could get the point.)
> 
> 1) GIOVA mappings over 1st-level page table;
> 2) binding vIOMMU 1st level page table to the pIOMMU;
> 3) using pIOMMU second level for GPA->HPA translation;
> 4) enable nested (a.k.a. dual stage) translation in host.
> 
> This patch set aims to achieve 1).

Would it make sense to use 1st level even for bare-metal to replace
the 2nd level?

What I'm thinking is the DPDK apps - they have MMU page table already
there for the huge pages, then if they can use 1st level as the
default device page table then it even does not need to map, because
it can simply bind the process root page table pointer to the 1st
level page root pointer of the device contexts that it uses.

Regards,
Tian, Kevin Sept. 25, 2019, 7:21 a.m. UTC | #8
> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Wednesday, September 25, 2019 2:57 PM
> 
> On Wed, Sep 25, 2019 at 10:48:32AM +0800, Lu Baolu wrote:
> > Hi Kevin,
> >
> > On 9/24/19 3:00 PM, Tian, Kevin wrote:
> > > > > >       '-----------'
> > > > > >       '-----------'
> > > > > >
> > > > > > This patch series only aims to achieve the first goal, a.k.a using
> > > first goal? then what are other goals? I didn't spot such information.
> > >
> >
> > The overall goal is to use IOMMU nested mode to avoid shadow page
> table
> > and VMEXIT when map an gIOVA. This includes below 4 steps (maybe not
> > accurate, but you could get the point.)
> >
> > 1) GIOVA mappings over 1st-level page table;
> > 2) binding vIOMMU 1st level page table to the pIOMMU;
> > 3) using pIOMMU second level for GPA->HPA translation;
> > 4) enable nested (a.k.a. dual stage) translation in host.
> >
> > This patch set aims to achieve 1).
> 
> Would it make sense to use 1st level even for bare-metal to replace
> the 2nd level?
> 
> What I'm thinking is the DPDK apps - they have MMU page table already
> there for the huge pages, then if they can use 1st level as the
> default device page table then it even does not need to map, because
> it can simply bind the process root page table pointer to the 1st
> level page root pointer of the device contexts that it uses.
> 

Then you need bear with possible page faults from using CPU page
table, while most devices don't support it today. 

Thanks
Kevin
Peter Xu Sept. 25, 2019, 7:45 a.m. UTC | #9
On Wed, Sep 25, 2019 at 07:21:51AM +0000, Tian, Kevin wrote:
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Wednesday, September 25, 2019 2:57 PM
> > 
> > On Wed, Sep 25, 2019 at 10:48:32AM +0800, Lu Baolu wrote:
> > > Hi Kevin,
> > >
> > > On 9/24/19 3:00 PM, Tian, Kevin wrote:
> > > > > > >       '-----------'
> > > > > > >       '-----------'
> > > > > > >
> > > > > > > This patch series only aims to achieve the first goal, a.k.a using
> > > > first goal? then what are other goals? I didn't spot such information.
> > > >
> > >
> > > The overall goal is to use IOMMU nested mode to avoid shadow page
> > table
> > > and VMEXIT when map an gIOVA. This includes below 4 steps (maybe not
> > > accurate, but you could get the point.)
> > >
> > > 1) GIOVA mappings over 1st-level page table;
> > > 2) binding vIOMMU 1st level page table to the pIOMMU;
> > > 3) using pIOMMU second level for GPA->HPA translation;
> > > 4) enable nested (a.k.a. dual stage) translation in host.
> > >
> > > This patch set aims to achieve 1).
> > 
> > Would it make sense to use 1st level even for bare-metal to replace
> > the 2nd level?
> > 
> > What I'm thinking is the DPDK apps - they have MMU page table already
> > there for the huge pages, then if they can use 1st level as the
> > default device page table then it even does not need to map, because
> > it can simply bind the process root page table pointer to the 1st
> > level page root pointer of the device contexts that it uses.
> > 
> 
> Then you need bear with possible page faults from using CPU page
> table, while most devices don't support it today. 

Right, I was just thinking aloud.  After all neither do we have IOMMU
hardware to support 1st level (or am I wrong?)...  It's just that when
the 1st level is ready it should sound doable because IIUC PRI should
be always with the 1st level support no matter on IOMMU side or the
device side?

I'm actually not sure about whether my understanding here is
correct... I thought the pasid binding previously was only for some
vendor kernel drivers but not a general thing to userspace.  I feel
like that should be doable in the future once we've got some new
syscall interface ready to deliver 1st level page table (e.g., via
vfio?) then applications like DPDK seems to be able to use that too
even directly via bare metal.

Regards,
Tian, Kevin Sept. 25, 2019, 8:02 a.m. UTC | #10
> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Wednesday, September 25, 2019 3:45 PM
> 
> On Wed, Sep 25, 2019 at 07:21:51AM +0000, Tian, Kevin wrote:
> > > From: Peter Xu [mailto:peterx@redhat.com]
> > > Sent: Wednesday, September 25, 2019 2:57 PM
> > >
> > > On Wed, Sep 25, 2019 at 10:48:32AM +0800, Lu Baolu wrote:
> > > > Hi Kevin,
> > > >
> > > > On 9/24/19 3:00 PM, Tian, Kevin wrote:
> > > > > > > >       '-----------'
> > > > > > > >       '-----------'
> > > > > > > >
> > > > > > > > This patch series only aims to achieve the first goal, a.k.a using
> > > > > first goal? then what are other goals? I didn't spot such information.
> > > > >
> > > >
> > > > The overall goal is to use IOMMU nested mode to avoid shadow page
> > > table
> > > > and VMEXIT when map an gIOVA. This includes below 4 steps (maybe
> not
> > > > accurate, but you could get the point.)
> > > >
> > > > 1) GIOVA mappings over 1st-level page table;
> > > > 2) binding vIOMMU 1st level page table to the pIOMMU;
> > > > 3) using pIOMMU second level for GPA->HPA translation;
> > > > 4) enable nested (a.k.a. dual stage) translation in host.
> > > >
> > > > This patch set aims to achieve 1).
> > >
> > > Would it make sense to use 1st level even for bare-metal to replace
> > > the 2nd level?
> > >
> > > What I'm thinking is the DPDK apps - they have MMU page table already
> > > there for the huge pages, then if they can use 1st level as the
> > > default device page table then it even does not need to map, because
> > > it can simply bind the process root page table pointer to the 1st
> > > level page root pointer of the device contexts that it uses.
> > >
> >
> > Then you need bear with possible page faults from using CPU page
> > table, while most devices don't support it today.
> 
> Right, I was just thinking aloud.  After all neither do we have IOMMU
> hardware to support 1st level (or am I wrong?)...  It's just that when

You are right. Current VT-d supports only 2nd level.

> the 1st level is ready it should sound doable because IIUC PRI should
> be always with the 1st level support no matter on IOMMU side or the
> device side?

No. PRI is not tied to 1st or 2nd level. Actually from device p.o.v, it's
just a protocol to trigger page fault, but the device doesn't care whether
the page fault is on 1st or 2nd level in the IOMMU side. The only
relevant part is that a PRI request can have PASID tagged or cleared.
When it's tagged with PASID, the IOMMU will locate the translation
table under the given PASID (either 1st or 2nd level is fine, according
to PASID entry setting). When no PASID is included, the IOMMU locates
the translation from default entry (e.g. PASID#0 or any PASID contained
in RID2PASID in VT-d).

Your knowledge happened to be correct in deprecated ECS mode. At
that time, there is only one 2nd level per context entry which doesn't
support page fault, and there is only one 1st level per PASID entry which
supports page fault. Then PRI could be indirectly connected to 1st level,
but this just changed with new scalable mode.

Another note is that the PRI capability only indicates that a device is
capable of handling page faults, but not that a device can tolerate
page fault for any of its DMA access. If the latter is fasle, using CPU 
page table for DPDK usage is still risky (and specific to device behavior)

> 
> I'm actually not sure about whether my understanding here is
> correct... I thought the pasid binding previously was only for some
> vendor kernel drivers but not a general thing to userspace.  I feel
> like that should be doable in the future once we've got some new
> syscall interface ready to deliver 1st level page table (e.g., via
> vfio?) then applications like DPDK seems to be able to use that too
> even directly via bare metal.
> 

using 1st level for userspace is different from supporting DMA page
fault in userspace. The former is purely about which structure to
keep the mapping. I think we may do the same thing for both bare
metal and guest (using 2nd level only for GPA when nested is enabled
on the IOMMU). But reusing CPU page table for userspace is more
tricky. :-)

Thanks
Kevin
Peter Xu Sept. 25, 2019, 8:52 a.m. UTC | #11
On Wed, Sep 25, 2019 at 08:02:23AM +0000, Tian, Kevin wrote:
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Wednesday, September 25, 2019 3:45 PM
> > 
> > On Wed, Sep 25, 2019 at 07:21:51AM +0000, Tian, Kevin wrote:
> > > > From: Peter Xu [mailto:peterx@redhat.com]
> > > > Sent: Wednesday, September 25, 2019 2:57 PM
> > > >
> > > > On Wed, Sep 25, 2019 at 10:48:32AM +0800, Lu Baolu wrote:
> > > > > Hi Kevin,
> > > > >
> > > > > On 9/24/19 3:00 PM, Tian, Kevin wrote:
> > > > > > > > >       '-----------'
> > > > > > > > >       '-----------'
> > > > > > > > >
> > > > > > > > > This patch series only aims to achieve the first goal, a.k.a using
> > > > > > first goal? then what are other goals? I didn't spot such information.
> > > > > >
> > > > >
> > > > > The overall goal is to use IOMMU nested mode to avoid shadow page
> > > > table
> > > > > and VMEXIT when map an gIOVA. This includes below 4 steps (maybe
> > not
> > > > > accurate, but you could get the point.)
> > > > >
> > > > > 1) GIOVA mappings over 1st-level page table;
> > > > > 2) binding vIOMMU 1st level page table to the pIOMMU;
> > > > > 3) using pIOMMU second level for GPA->HPA translation;
> > > > > 4) enable nested (a.k.a. dual stage) translation in host.
> > > > >
> > > > > This patch set aims to achieve 1).
> > > >
> > > > Would it make sense to use 1st level even for bare-metal to replace
> > > > the 2nd level?
> > > >
> > > > What I'm thinking is the DPDK apps - they have MMU page table already
> > > > there for the huge pages, then if they can use 1st level as the
> > > > default device page table then it even does not need to map, because
> > > > it can simply bind the process root page table pointer to the 1st
> > > > level page root pointer of the device contexts that it uses.
> > > >
> > >
> > > Then you need bear with possible page faults from using CPU page
> > > table, while most devices don't support it today.
> > 
> > Right, I was just thinking aloud.  After all neither do we have IOMMU
> > hardware to support 1st level (or am I wrong?)...  It's just that when
> 
> You are right. Current VT-d supports only 2nd level.
> 
> > the 1st level is ready it should sound doable because IIUC PRI should
> > be always with the 1st level support no matter on IOMMU side or the
> > device side?
> 
> No. PRI is not tied to 1st or 2nd level. Actually from device p.o.v, it's
> just a protocol to trigger page fault, but the device doesn't care whether
> the page fault is on 1st or 2nd level in the IOMMU side. The only
> relevant part is that a PRI request can have PASID tagged or cleared.
> When it's tagged with PASID, the IOMMU will locate the translation
> table under the given PASID (either 1st or 2nd level is fine, according
> to PASID entry setting). When no PASID is included, the IOMMU locates
> the translation from default entry (e.g. PASID#0 or any PASID contained
> in RID2PASID in VT-d).
> 
> Your knowledge happened to be correct in deprecated ECS mode. At
> that time, there is only one 2nd level per context entry which doesn't
> support page fault, and there is only one 1st level per PASID entry which
> supports page fault. Then PRI could be indirectly connected to 1st level,
> but this just changed with new scalable mode.
> 
> Another note is that the PRI capability only indicates that a device is
> capable of handling page faults, but not that a device can tolerate
> page fault for any of its DMA access. If the latter is fasle, using CPU 
> page table for DPDK usage is still risky (and specific to device behavior)
> 
> > 
> > I'm actually not sure about whether my understanding here is
> > correct... I thought the pasid binding previously was only for some
> > vendor kernel drivers but not a general thing to userspace.  I feel
> > like that should be doable in the future once we've got some new
> > syscall interface ready to deliver 1st level page table (e.g., via
> > vfio?) then applications like DPDK seems to be able to use that too
> > even directly via bare metal.
> > 
> 
> using 1st level for userspace is different from supporting DMA page
> fault in userspace. The former is purely about which structure to
> keep the mapping. I think we may do the same thing for both bare
> metal and guest (using 2nd level only for GPA when nested is enabled
> on the IOMMU). But reusing CPU page table for userspace is more
> tricky. :-)

Yes I should have mixed up the 1st level page table and PRI a bit, and
after all my initial question should be irrelevant to this series as
well so it's already a bit out of topic (sorry for that).

And, thanks for explaining these. :)
Baolu Lu Sept. 26, 2019, 1:37 a.m. UTC | #12
Hi Peter,

On 9/25/19 4:52 PM, Peter Xu wrote:
> On Wed, Sep 25, 2019 at 08:02:23AM +0000, Tian, Kevin wrote:
>>> From: Peter Xu [mailto:peterx@redhat.com]
>>> Sent: Wednesday, September 25, 2019 3:45 PM
>>>
>>> On Wed, Sep 25, 2019 at 07:21:51AM +0000, Tian, Kevin wrote:
>>>>> From: Peter Xu [mailto:peterx@redhat.com]
>>>>> Sent: Wednesday, September 25, 2019 2:57 PM
>>>>>
>>>>> On Wed, Sep 25, 2019 at 10:48:32AM +0800, Lu Baolu wrote:
>>>>>> Hi Kevin,
>>>>>>
>>>>>> On 9/24/19 3:00 PM, Tian, Kevin wrote:
>>>>>>>>>>        '-----------'
>>>>>>>>>>        '-----------'
>>>>>>>>>>
>>>>>>>>>> This patch series only aims to achieve the first goal, a.k.a using
>>>>>>> first goal? then what are other goals? I didn't spot such information.
>>>>>>>
>>>>>>
>>>>>> The overall goal is to use IOMMU nested mode to avoid shadow page
>>>>> table
>>>>>> and VMEXIT when map an gIOVA. This includes below 4 steps (maybe
>>> not
>>>>>> accurate, but you could get the point.)
>>>>>>
>>>>>> 1) GIOVA mappings over 1st-level page table;
>>>>>> 2) binding vIOMMU 1st level page table to the pIOMMU;
>>>>>> 3) using pIOMMU second level for GPA->HPA translation;
>>>>>> 4) enable nested (a.k.a. dual stage) translation in host.
>>>>>>
>>>>>> This patch set aims to achieve 1).
>>>>>
>>>>> Would it make sense to use 1st level even for bare-metal to replace
>>>>> the 2nd level?
>>>>>
>>>>> What I'm thinking is the DPDK apps - they have MMU page table already
>>>>> there for the huge pages, then if they can use 1st level as the
>>>>> default device page table then it even does not need to map, because
>>>>> it can simply bind the process root page table pointer to the 1st
>>>>> level page root pointer of the device contexts that it uses.
>>>>>
>>>>
>>>> Then you need bear with possible page faults from using CPU page
>>>> table, while most devices don't support it today.
>>>
>>> Right, I was just thinking aloud.  After all neither do we have IOMMU
>>> hardware to support 1st level (or am I wrong?)...  It's just that when
>>
>> You are right. Current VT-d supports only 2nd level.
>>
>>> the 1st level is ready it should sound doable because IIUC PRI should
>>> be always with the 1st level support no matter on IOMMU side or the
>>> device side?
>>
>> No. PRI is not tied to 1st or 2nd level. Actually from device p.o.v, it's
>> just a protocol to trigger page fault, but the device doesn't care whether
>> the page fault is on 1st or 2nd level in the IOMMU side. The only
>> relevant part is that a PRI request can have PASID tagged or cleared.
>> When it's tagged with PASID, the IOMMU will locate the translation
>> table under the given PASID (either 1st or 2nd level is fine, according
>> to PASID entry setting). When no PASID is included, the IOMMU locates
>> the translation from default entry (e.g. PASID#0 or any PASID contained
>> in RID2PASID in VT-d).
>>
>> Your knowledge happened to be correct in deprecated ECS mode. At
>> that time, there is only one 2nd level per context entry which doesn't
>> support page fault, and there is only one 1st level per PASID entry which
>> supports page fault. Then PRI could be indirectly connected to 1st level,
>> but this just changed with new scalable mode.
>>
>> Another note is that the PRI capability only indicates that a device is
>> capable of handling page faults, but not that a device can tolerate
>> page fault for any of its DMA access. If the latter is fasle, using CPU
>> page table for DPDK usage is still risky (and specific to device behavior)
>>
>>>
>>> I'm actually not sure about whether my understanding here is
>>> correct... I thought the pasid binding previously was only for some
>>> vendor kernel drivers but not a general thing to userspace.  I feel
>>> like that should be doable in the future once we've got some new
>>> syscall interface ready to deliver 1st level page table (e.g., via
>>> vfio?) then applications like DPDK seems to be able to use that too
>>> even directly via bare metal.
>>>
>>
>> using 1st level for userspace is different from supporting DMA page
>> fault in userspace. The former is purely about which structure to
>> keep the mapping. I think we may do the same thing for both bare
>> metal and guest (using 2nd level only for GPA when nested is enabled
>> on the IOMMU). But reusing CPU page table for userspace is more
>> tricky. :-)
> 
> Yes I should have mixed up the 1st level page table and PRI a bit, and
> after all my initial question should be irrelevant to this series as
> well so it's already a bit out of topic (sorry for that).

Never mind. Good discussion. :-)

Actually I have plan to use 1st level on bare metal as well. Just
looking forward to more motivation and use cases.

> 
> And, thanks for explaining these. :)
> 

Thanks for Kevin's explanation. :-)

Best regards,
Baolu