mbox series

[RFC,v3,0/8] Add IOPF support for VFIO passthrough

Message ID 20210409034420.1799-1-lushenming@huawei.com (mailing list archive)
Headers show
Series Add IOPF support for VFIO passthrough | expand

Message

Shenming Lu April 9, 2021, 3:44 a.m. UTC
Hi,

Requesting for your comments and suggestions. :-)

The static pinning and mapping problem in VFIO and possible solutions
have been discussed a lot [1, 2]. One of the solutions is to add I/O
Page Fault support for VFIO devices. Different from those relatively
complicated software approaches such as presenting a vIOMMU that provides
the DMA buffer information (might include para-virtualized optimizations),
IOPF mainly depends on the hardware faulting capability, such as the PCIe
PRI extension or Arm SMMU stall model. What's more, the IOPF support in
the IOMMU driver has already been implemented in SVA [3]. So we add IOPF
support for VFIO passthrough based on the IOPF part of SVA in this series.

We have measured its performance with UADK [4] (passthrough an accelerator
to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA):

Run hisi_sec_test...
 - with varying sending times and message lengths
 - with/without IOPF enabled (speed slowdown)

when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1):
            slowdown (num of faults)
 times      VFIO IOPF      host SVA
 1          63.4% (518)    82.8% (512)
 100        22.9% (1058)   47.9% (1024)
 1000       2.6% (1071)    8.5% (1024)

when msg_len = 10MB (and PREMAP_LEN = 512):
            slowdown (num of faults)
 times      VFIO IOPF
 1          32.6% (13)
 100        3.5% (26)
 1000       1.6% (26)

History:

v2 -> v3
 - Nit fixes.
 - No reason to disable reporting the unrecoverable faults. (baolu)
 - Maintain a global IOPF enabled group list.
 - Split the pre-mapping optimization to be a separate patch.
 - Add selective faulting support (use vfio_pin_pages to indicate the
   non-faultable scope and add a new struct vfio_range to record it,
   untested). (Kevin)

v1 -> v2
 - Numerous improvements following the suggestions. Thanks a lot to all
   of you.

Note that PRI is not supported at the moment since there is no hardware.

Links:
[1] Lesokhin I, et al. Page Fault Support for Network Controllers. In ASPLOS,
    2016.
[2] Tian K, et al. coIOMMU: A Virtual IOMMU with Cooperative DMA Buffer Tracking
    for Efficient Memory Management in Direct I/O. In USENIX ATC, 2020.
[3] https://patchwork.kernel.org/project/linux-arm-kernel/cover/20210401154718.307519-1-jean-philippe@linaro.org/
[4] https://github.com/Linaro/uadk

Thanks,
Shenming


Shenming Lu (8):
  iommu: Evolve the device fault reporting framework
  vfio/type1: Add a page fault handler
  vfio/type1: Add an MMU notifier to avoid pinning
  vfio/type1: Pre-map more pages than requested in the IOPF handling
  vfio/type1: VFIO_IOMMU_ENABLE_IOPF
  vfio/type1: No need to statically pin and map if IOPF enabled
  vfio/type1: Add selective DMA faulting support
  vfio: Add nested IOPF support

 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |    3 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |   18 +-
 drivers/iommu/iommu.c                         |   56 +-
 drivers/vfio/vfio.c                           |   85 +-
 drivers/vfio/vfio_iommu_type1.c               | 1000 ++++++++++++++++-
 include/linux/iommu.h                         |   19 +-
 include/linux/vfio.h                          |   13 +
 include/uapi/linux/iommu.h                    |    4 +
 include/uapi/linux/vfio.h                     |    6 +
 9 files changed, 1181 insertions(+), 23 deletions(-)

Comments

Shenming Lu April 26, 2021, 1:41 a.m. UTC | #1
On 2021/4/9 11:44, Shenming Lu wrote:
> Hi,
> 
> Requesting for your comments and suggestions. :-)

Kind ping...

> 
> The static pinning and mapping problem in VFIO and possible solutions
> have been discussed a lot [1, 2]. One of the solutions is to add I/O
> Page Fault support for VFIO devices. Different from those relatively
> complicated software approaches such as presenting a vIOMMU that provides
> the DMA buffer information (might include para-virtualized optimizations),
> IOPF mainly depends on the hardware faulting capability, such as the PCIe
> PRI extension or Arm SMMU stall model. What's more, the IOPF support in
> the IOMMU driver has already been implemented in SVA [3]. So we add IOPF
> support for VFIO passthrough based on the IOPF part of SVA in this series.
> 
> We have measured its performance with UADK [4] (passthrough an accelerator
> to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA):
> 
> Run hisi_sec_test...
>  - with varying sending times and message lengths
>  - with/without IOPF enabled (speed slowdown)
> 
> when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1):
>             slowdown (num of faults)
>  times      VFIO IOPF      host SVA
>  1          63.4% (518)    82.8% (512)
>  100        22.9% (1058)   47.9% (1024)
>  1000       2.6% (1071)    8.5% (1024)
> 
> when msg_len = 10MB (and PREMAP_LEN = 512):
>             slowdown (num of faults)
>  times      VFIO IOPF
>  1          32.6% (13)
>  100        3.5% (26)
>  1000       1.6% (26)
> 
> History:
> 
> v2 -> v3
>  - Nit fixes.
>  - No reason to disable reporting the unrecoverable faults. (baolu)
>  - Maintain a global IOPF enabled group list.
>  - Split the pre-mapping optimization to be a separate patch.
>  - Add selective faulting support (use vfio_pin_pages to indicate the
>    non-faultable scope and add a new struct vfio_range to record it,
>    untested). (Kevin)
> 
> v1 -> v2
>  - Numerous improvements following the suggestions. Thanks a lot to all
>    of you.
> 
> Note that PRI is not supported at the moment since there is no hardware.
> 
> Links:
> [1] Lesokhin I, et al. Page Fault Support for Network Controllers. In ASPLOS,
>     2016.
> [2] Tian K, et al. coIOMMU: A Virtual IOMMU with Cooperative DMA Buffer Tracking
>     for Efficient Memory Management in Direct I/O. In USENIX ATC, 2020.
> [3] https://patchwork.kernel.org/project/linux-arm-kernel/cover/20210401154718.307519-1-jean-philippe@linaro.org/
> [4] https://github.com/Linaro/uadk
> 
> Thanks,
> Shenming
> 
> 
> Shenming Lu (8):
>   iommu: Evolve the device fault reporting framework
>   vfio/type1: Add a page fault handler
>   vfio/type1: Add an MMU notifier to avoid pinning
>   vfio/type1: Pre-map more pages than requested in the IOPF handling
>   vfio/type1: VFIO_IOMMU_ENABLE_IOPF
>   vfio/type1: No need to statically pin and map if IOPF enabled
>   vfio/type1: Add selective DMA faulting support
>   vfio: Add nested IOPF support
> 
>  .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |    3 +-
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |   18 +-
>  drivers/iommu/iommu.c                         |   56 +-
>  drivers/vfio/vfio.c                           |   85 +-
>  drivers/vfio/vfio_iommu_type1.c               | 1000 ++++++++++++++++-
>  include/linux/iommu.h                         |   19 +-
>  include/linux/vfio.h                          |   13 +
>  include/uapi/linux/iommu.h                    |    4 +
>  include/uapi/linux/vfio.h                     |    6 +
>  9 files changed, 1181 insertions(+), 23 deletions(-)
>
Shenming Lu May 11, 2021, 11:30 a.m. UTC | #2
Hi Alex,

Hope for some suggestions or comments from you since there seems to be many unsure
points in this series. :-)

Thanks,
Shenming


On 2021/4/26 9:41, Shenming Lu wrote:
> On 2021/4/9 11:44, Shenming Lu wrote:
>> Hi,
>>
>> Requesting for your comments and suggestions. :-)
> 
> Kind ping...
> 
>>
>> The static pinning and mapping problem in VFIO and possible solutions
>> have been discussed a lot [1, 2]. One of the solutions is to add I/O
>> Page Fault support for VFIO devices. Different from those relatively
>> complicated software approaches such as presenting a vIOMMU that provides
>> the DMA buffer information (might include para-virtualized optimizations),
>> IOPF mainly depends on the hardware faulting capability, such as the PCIe
>> PRI extension or Arm SMMU stall model. What's more, the IOPF support in
>> the IOMMU driver has already been implemented in SVA [3]. So we add IOPF
>> support for VFIO passthrough based on the IOPF part of SVA in this series.
>>
>> We have measured its performance with UADK [4] (passthrough an accelerator
>> to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA):
>>
>> Run hisi_sec_test...
>>  - with varying sending times and message lengths
>>  - with/without IOPF enabled (speed slowdown)
>>
>> when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1):
>>             slowdown (num of faults)
>>  times      VFIO IOPF      host SVA
>>  1          63.4% (518)    82.8% (512)
>>  100        22.9% (1058)   47.9% (1024)
>>  1000       2.6% (1071)    8.5% (1024)
>>
>> when msg_len = 10MB (and PREMAP_LEN = 512):
>>             slowdown (num of faults)
>>  times      VFIO IOPF
>>  1          32.6% (13)
>>  100        3.5% (26)
>>  1000       1.6% (26)
>>
>> History:
>>
>> v2 -> v3
>>  - Nit fixes.
>>  - No reason to disable reporting the unrecoverable faults. (baolu)
>>  - Maintain a global IOPF enabled group list.
>>  - Split the pre-mapping optimization to be a separate patch.
>>  - Add selective faulting support (use vfio_pin_pages to indicate the
>>    non-faultable scope and add a new struct vfio_range to record it,
>>    untested). (Kevin)
>>
>> v1 -> v2
>>  - Numerous improvements following the suggestions. Thanks a lot to all
>>    of you.
>>
>> Note that PRI is not supported at the moment since there is no hardware.
>>
>> Links:
>> [1] Lesokhin I, et al. Page Fault Support for Network Controllers. In ASPLOS,
>>     2016.
>> [2] Tian K, et al. coIOMMU: A Virtual IOMMU with Cooperative DMA Buffer Tracking
>>     for Efficient Memory Management in Direct I/O. In USENIX ATC, 2020.
>> [3] https://patchwork.kernel.org/project/linux-arm-kernel/cover/20210401154718.307519-1-jean-philippe@linaro.org/
>> [4] https://github.com/Linaro/uadk
>>
>> Thanks,
>> Shenming
>>
>>
>> Shenming Lu (8):
>>   iommu: Evolve the device fault reporting framework
>>   vfio/type1: Add a page fault handler
>>   vfio/type1: Add an MMU notifier to avoid pinning
>>   vfio/type1: Pre-map more pages than requested in the IOPF handling
>>   vfio/type1: VFIO_IOMMU_ENABLE_IOPF
>>   vfio/type1: No need to statically pin and map if IOPF enabled
>>   vfio/type1: Add selective DMA faulting support
>>   vfio: Add nested IOPF support
>>
>>  .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |    3 +-
>>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |   18 +-
>>  drivers/iommu/iommu.c                         |   56 +-
>>  drivers/vfio/vfio.c                           |   85 +-
>>  drivers/vfio/vfio_iommu_type1.c               | 1000 ++++++++++++++++-
>>  include/linux/iommu.h                         |   19 +-
>>  include/linux/vfio.h                          |   13 +
>>  include/uapi/linux/iommu.h                    |    4 +
>>  include/uapi/linux/vfio.h                     |    6 +
>>  9 files changed, 1181 insertions(+), 23 deletions(-)
>>
Alex Williamson May 18, 2021, 6:57 p.m. UTC | #3
On Fri, 9 Apr 2021 11:44:12 +0800
Shenming Lu <lushenming@huawei.com> wrote:

> Hi,
> 
> Requesting for your comments and suggestions. :-)
> 
> The static pinning and mapping problem in VFIO and possible solutions
> have been discussed a lot [1, 2]. One of the solutions is to add I/O
> Page Fault support for VFIO devices. Different from those relatively
> complicated software approaches such as presenting a vIOMMU that provides
> the DMA buffer information (might include para-virtualized optimizations),
> IOPF mainly depends on the hardware faulting capability, such as the PCIe
> PRI extension or Arm SMMU stall model. What's more, the IOPF support in
> the IOMMU driver has already been implemented in SVA [3]. So we add IOPF
> support for VFIO passthrough based on the IOPF part of SVA in this series.

The SVA proposals are being reworked to make use of a new IOASID
object, it's not clear to me that this shouldn't also make use of that
work as it does a significant expansion of the type1 IOMMU with fault
handlers that would duplicate new work using that new model.

> We have measured its performance with UADK [4] (passthrough an accelerator
> to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA):
> 
> Run hisi_sec_test...
>  - with varying sending times and message lengths
>  - with/without IOPF enabled (speed slowdown)
> 
> when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1):
>             slowdown (num of faults)
>  times      VFIO IOPF      host SVA
>  1          63.4% (518)    82.8% (512)
>  100        22.9% (1058)   47.9% (1024)
>  1000       2.6% (1071)    8.5% (1024)
> 
> when msg_len = 10MB (and PREMAP_LEN = 512):
>             slowdown (num of faults)
>  times      VFIO IOPF
>  1          32.6% (13)
>  100        3.5% (26)
>  1000       1.6% (26)

It seems like this is only an example that you can make a benchmark
show anything you want.  The best results would be to pre-map
everything, which is what we have without this series.  What is an
acceptable overhead to incur to avoid page pinning?  What if userspace
had more fine grained control over which mappings were available for
faulting and which were statically mapped?  I don't really see what
sense the pre-mapping range makes.  If I assume the user is QEMU in a
non-vIOMMU configuration, pre-mapping the beginning of each RAM section
doesn't make any logical sense relative to device DMA.

Comments per patch to follow.  Thanks,

Alex


> History:
> 
> v2 -> v3
>  - Nit fixes.
>  - No reason to disable reporting the unrecoverable faults. (baolu)
>  - Maintain a global IOPF enabled group list.
>  - Split the pre-mapping optimization to be a separate patch.
>  - Add selective faulting support (use vfio_pin_pages to indicate the
>    non-faultable scope and add a new struct vfio_range to record it,
>    untested). (Kevin)
> 
> v1 -> v2
>  - Numerous improvements following the suggestions. Thanks a lot to all
>    of you.
> 
> Note that PRI is not supported at the moment since there is no hardware.
> 
> Links:
> [1] Lesokhin I, et al. Page Fault Support for Network Controllers. In ASPLOS,
>     2016.
> [2] Tian K, et al. coIOMMU: A Virtual IOMMU with Cooperative DMA Buffer Tracking
>     for Efficient Memory Management in Direct I/O. In USENIX ATC, 2020.
> [3] https://patchwork.kernel.org/project/linux-arm-kernel/cover/20210401154718.307519-1-jean-philippe@linaro.org/
> [4] https://github.com/Linaro/uadk
> 
> Thanks,
> Shenming
> 
> 
> Shenming Lu (8):
>   iommu: Evolve the device fault reporting framework
>   vfio/type1: Add a page fault handler
>   vfio/type1: Add an MMU notifier to avoid pinning
>   vfio/type1: Pre-map more pages than requested in the IOPF handling
>   vfio/type1: VFIO_IOMMU_ENABLE_IOPF
>   vfio/type1: No need to statically pin and map if IOPF enabled
>   vfio/type1: Add selective DMA faulting support
>   vfio: Add nested IOPF support
> 
>  .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |    3 +-
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |   18 +-
>  drivers/iommu/iommu.c                         |   56 +-
>  drivers/vfio/vfio.c                           |   85 +-
>  drivers/vfio/vfio_iommu_type1.c               | 1000 ++++++++++++++++-
>  include/linux/iommu.h                         |   19 +-
>  include/linux/vfio.h                          |   13 +
>  include/uapi/linux/iommu.h                    |    4 +
>  include/uapi/linux/vfio.h                     |    6 +
>  9 files changed, 1181 insertions(+), 23 deletions(-)
>
Shenming Lu May 21, 2021, 6:37 a.m. UTC | #4
Hi Alex,

On 2021/5/19 2:57, Alex Williamson wrote:
> On Fri, 9 Apr 2021 11:44:12 +0800
> Shenming Lu <lushenming@huawei.com> wrote:
> 
>> Hi,
>>
>> Requesting for your comments and suggestions. :-)
>>
>> The static pinning and mapping problem in VFIO and possible solutions
>> have been discussed a lot [1, 2]. One of the solutions is to add I/O
>> Page Fault support for VFIO devices. Different from those relatively
>> complicated software approaches such as presenting a vIOMMU that provides
>> the DMA buffer information (might include para-virtualized optimizations),
>> IOPF mainly depends on the hardware faulting capability, such as the PCIe
>> PRI extension or Arm SMMU stall model. What's more, the IOPF support in
>> the IOMMU driver has already been implemented in SVA [3]. So we add IOPF
>> support for VFIO passthrough based on the IOPF part of SVA in this series.
> 
> The SVA proposals are being reworked to make use of a new IOASID
> object, it's not clear to me that this shouldn't also make use of that
> work as it does a significant expansion of the type1 IOMMU with fault
> handlers that would duplicate new work using that new model.

It seems that the IOASID extension for guest SVA would not affect this series,
will we do any host-guest IOASID translation in the VFIO fault handler?

> 
>> We have measured its performance with UADK [4] (passthrough an accelerator
>> to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA):
>>
>> Run hisi_sec_test...
>>  - with varying sending times and message lengths
>>  - with/without IOPF enabled (speed slowdown)
>>
>> when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1):
>>             slowdown (num of faults)
>>  times      VFIO IOPF      host SVA
>>  1          63.4% (518)    82.8% (512)
>>  100        22.9% (1058)   47.9% (1024)
>>  1000       2.6% (1071)    8.5% (1024)
>>
>> when msg_len = 10MB (and PREMAP_LEN = 512):
>>             slowdown (num of faults)
>>  times      VFIO IOPF
>>  1          32.6% (13)
>>  100        3.5% (26)
>>  1000       1.6% (26)
> 
> It seems like this is only an example that you can make a benchmark
> show anything you want.  The best results would be to pre-map
> everything, which is what we have without this series.  What is an
> acceptable overhead to incur to avoid page pinning?  What if userspace
> had more fine grained control over which mappings were available for
> faulting and which were statically mapped?  I don't really see what
> sense the pre-mapping range makes.  If I assume the user is QEMU in a
> non-vIOMMU configuration, pre-mapping the beginning of each RAM section
> doesn't make any logical sense relative to device DMA.

As you said in Patch 4, we can introduce full end-to-end functionality
before trying to improve performance, and I will drop the pre-mapping patch
in the current stage...

Is there a need that userspace wants more fine grained control over which
mappings are available for faulting? If so, we may evolve the MAP ioctl
to support for specifying the faulting range.

As for the overhead of IOPF, it is unavoidable if enabling on-demand paging
(and page faults occur almost only when first accessing), and it seems that
there is a large optimization space compared to CPU page faulting.

Thanks,
Shenming

> 
> Comments per patch to follow.  Thanks,
> 
> Alex
> 
> 
>> History:
>>
>> v2 -> v3
>>  - Nit fixes.
>>  - No reason to disable reporting the unrecoverable faults. (baolu)
>>  - Maintain a global IOPF enabled group list.
>>  - Split the pre-mapping optimization to be a separate patch.
>>  - Add selective faulting support (use vfio_pin_pages to indicate the
>>    non-faultable scope and add a new struct vfio_range to record it,
>>    untested). (Kevin)
>>
>> v1 -> v2
>>  - Numerous improvements following the suggestions. Thanks a lot to all
>>    of you.
>>
>> Note that PRI is not supported at the moment since there is no hardware.
>>
>> Links:
>> [1] Lesokhin I, et al. Page Fault Support for Network Controllers. In ASPLOS,
>>     2016.
>> [2] Tian K, et al. coIOMMU: A Virtual IOMMU with Cooperative DMA Buffer Tracking
>>     for Efficient Memory Management in Direct I/O. In USENIX ATC, 2020.
>> [3] https://patchwork.kernel.org/project/linux-arm-kernel/cover/20210401154718.307519-1-jean-philippe@linaro.org/
>> [4] https://github.com/Linaro/uadk
>>
>> Thanks,
>> Shenming
>>
>>
>> Shenming Lu (8):
>>   iommu: Evolve the device fault reporting framework
>>   vfio/type1: Add a page fault handler
>>   vfio/type1: Add an MMU notifier to avoid pinning
>>   vfio/type1: Pre-map more pages than requested in the IOPF handling
>>   vfio/type1: VFIO_IOMMU_ENABLE_IOPF
>>   vfio/type1: No need to statically pin and map if IOPF enabled
>>   vfio/type1: Add selective DMA faulting support
>>   vfio: Add nested IOPF support
>>
>>  .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |    3 +-
>>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |   18 +-
>>  drivers/iommu/iommu.c                         |   56 +-
>>  drivers/vfio/vfio.c                           |   85 +-
>>  drivers/vfio/vfio_iommu_type1.c               | 1000 ++++++++++++++++-
>>  include/linux/iommu.h                         |   19 +-
>>  include/linux/vfio.h                          |   13 +
>>  include/uapi/linux/iommu.h                    |    4 +
>>  include/uapi/linux/vfio.h                     |    6 +
>>  9 files changed, 1181 insertions(+), 23 deletions(-)
>>
> 
> .
>
Alex Williamson May 24, 2021, 10:11 p.m. UTC | #5
On Fri, 21 May 2021 14:37:21 +0800
Shenming Lu <lushenming@huawei.com> wrote:

> Hi Alex,
> 
> On 2021/5/19 2:57, Alex Williamson wrote:
> > On Fri, 9 Apr 2021 11:44:12 +0800
> > Shenming Lu <lushenming@huawei.com> wrote:
> >   
> >> Hi,
> >>
> >> Requesting for your comments and suggestions. :-)
> >>
> >> The static pinning and mapping problem in VFIO and possible solutions
> >> have been discussed a lot [1, 2]. One of the solutions is to add I/O
> >> Page Fault support for VFIO devices. Different from those relatively
> >> complicated software approaches such as presenting a vIOMMU that provides
> >> the DMA buffer information (might include para-virtualized optimizations),
> >> IOPF mainly depends on the hardware faulting capability, such as the PCIe
> >> PRI extension or Arm SMMU stall model. What's more, the IOPF support in
> >> the IOMMU driver has already been implemented in SVA [3]. So we add IOPF
> >> support for VFIO passthrough based on the IOPF part of SVA in this series.  
> > 
> > The SVA proposals are being reworked to make use of a new IOASID
> > object, it's not clear to me that this shouldn't also make use of that
> > work as it does a significant expansion of the type1 IOMMU with fault
> > handlers that would duplicate new work using that new model.  
> 
> It seems that the IOASID extension for guest SVA would not affect this series,
> will we do any host-guest IOASID translation in the VFIO fault handler?

Surely it will, we don't currently have any IOMMU fault handling or
forwarding of IOMMU faults through to the vfio bus driver, both of
those would be included in an IOASID implementation.  I think Jason's
vision is to use IOASID to deprecate type1 for all use cases, so even
if we were to choose to implement IOPF in type1 we should agree on
common interfaces with IOASID.
 
> >> We have measured its performance with UADK [4] (passthrough an accelerator
> >> to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA):
> >>
> >> Run hisi_sec_test...
> >>  - with varying sending times and message lengths
> >>  - with/without IOPF enabled (speed slowdown)
> >>
> >> when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1):
> >>             slowdown (num of faults)
> >>  times      VFIO IOPF      host SVA
> >>  1          63.4% (518)    82.8% (512)
> >>  100        22.9% (1058)   47.9% (1024)
> >>  1000       2.6% (1071)    8.5% (1024)
> >>
> >> when msg_len = 10MB (and PREMAP_LEN = 512):
> >>             slowdown (num of faults)
> >>  times      VFIO IOPF
> >>  1          32.6% (13)
> >>  100        3.5% (26)
> >>  1000       1.6% (26)  
> > 
> > It seems like this is only an example that you can make a benchmark
> > show anything you want.  The best results would be to pre-map
> > everything, which is what we have without this series.  What is an
> > acceptable overhead to incur to avoid page pinning?  What if userspace
> > had more fine grained control over which mappings were available for
> > faulting and which were statically mapped?  I don't really see what
> > sense the pre-mapping range makes.  If I assume the user is QEMU in a
> > non-vIOMMU configuration, pre-mapping the beginning of each RAM section
> > doesn't make any logical sense relative to device DMA.  
> 
> As you said in Patch 4, we can introduce full end-to-end functionality
> before trying to improve performance, and I will drop the pre-mapping patch
> in the current stage...
> 
> Is there a need that userspace wants more fine grained control over which
> mappings are available for faulting? If so, we may evolve the MAP ioctl
> to support for specifying the faulting range.

You're essentially enabling this for a vfio bus driver via patch 7/8,
pinning for selective DMA faulting.  How would a driver in userspace
make equivalent requests?  In the case of performance, the user could
mlock the page but they have no mechanism here to pre-fault it.  Should
they?

> As for the overhead of IOPF, it is unavoidable if enabling on-demand paging
> (and page faults occur almost only when first accessing), and it seems that
> there is a large optimization space compared to CPU page faulting.

Yes, there's of course going to be overhead in terms of latency for the
page faults.  My point was more that when a host is not under memory
pressure we should trend towards the performance of pinned, static
mappings and we should be able to create arbitrarily large pre-fault
behavior to show that.  But I think what we really want to enable via
IOPF is density, right?  Therefore how many more assigned device guests
can you run on a host with IOPF?  How does the slope, plateau, and
inflection point of their aggregate throughput compare to static
pinning?  VM startup time is probably also a useful metric, ie. trading
device latency for startup latency.  Thanks,

Alex
Shenming Lu May 27, 2021, 11:25 a.m. UTC | #6
On 2021/5/25 6:11, Alex Williamson wrote:
> On Fri, 21 May 2021 14:37:21 +0800
> Shenming Lu <lushenming@huawei.com> wrote:
> 
>> Hi Alex,
>>
>> On 2021/5/19 2:57, Alex Williamson wrote:
>>> On Fri, 9 Apr 2021 11:44:12 +0800
>>> Shenming Lu <lushenming@huawei.com> wrote:
>>>   
>>>> Hi,
>>>>
>>>> Requesting for your comments and suggestions. :-)
>>>>
>>>> The static pinning and mapping problem in VFIO and possible solutions
>>>> have been discussed a lot [1, 2]. One of the solutions is to add I/O
>>>> Page Fault support for VFIO devices. Different from those relatively
>>>> complicated software approaches such as presenting a vIOMMU that provides
>>>> the DMA buffer information (might include para-virtualized optimizations),
>>>> IOPF mainly depends on the hardware faulting capability, such as the PCIe
>>>> PRI extension or Arm SMMU stall model. What's more, the IOPF support in
>>>> the IOMMU driver has already been implemented in SVA [3]. So we add IOPF
>>>> support for VFIO passthrough based on the IOPF part of SVA in this series.  
>>>
>>> The SVA proposals are being reworked to make use of a new IOASID
>>> object, it's not clear to me that this shouldn't also make use of that
>>> work as it does a significant expansion of the type1 IOMMU with fault
>>> handlers that would duplicate new work using that new model.  
>>
>> It seems that the IOASID extension for guest SVA would not affect this series,
>> will we do any host-guest IOASID translation in the VFIO fault handler?
> 
> Surely it will, we don't currently have any IOMMU fault handling or
> forwarding of IOMMU faults through to the vfio bus driver, both of
> those would be included in an IOASID implementation.  I think Jason's
> vision is to use IOASID to deprecate type1 for all use cases, so even
> if we were to choose to implement IOPF in type1 we should agree on
> common interfaces with IOASID.

Yeah, the guest IOPF(L1) handling may include the host-guest IOASID translation,
which can be placed in the IOASID layer (in fact it can be placed in many places
such as the vfio pci driver since it don't really handle the fault event, it just
transfers the event to the vIOMMU).
But the host IOPF(L2) has no relationship with IOASID at all, it needs to have a
knowledge of the vfio_dma ranges.
Could we add the host IOPF support to type1 first (integrate it within the MAP ioctl)?
And we may migrate the generic iommu controls (such as MAP/UNMAP...) from type1 to
IOASID in the future (it seems to be a huge work, I will be very happy if I could
help this)... :-)

>  
>>>> We have measured its performance with UADK [4] (passthrough an accelerator
>>>> to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA):
>>>>
>>>> Run hisi_sec_test...
>>>>  - with varying sending times and message lengths
>>>>  - with/without IOPF enabled (speed slowdown)
>>>>
>>>> when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1):
>>>>             slowdown (num of faults)
>>>>  times      VFIO IOPF      host SVA
>>>>  1          63.4% (518)    82.8% (512)
>>>>  100        22.9% (1058)   47.9% (1024)
>>>>  1000       2.6% (1071)    8.5% (1024)
>>>>
>>>> when msg_len = 10MB (and PREMAP_LEN = 512):
>>>>             slowdown (num of faults)
>>>>  times      VFIO IOPF
>>>>  1          32.6% (13)
>>>>  100        3.5% (26)
>>>>  1000       1.6% (26)  
>>>
>>> It seems like this is only an example that you can make a benchmark
>>> show anything you want.  The best results would be to pre-map
>>> everything, which is what we have without this series.  What is an
>>> acceptable overhead to incur to avoid page pinning?  What if userspace
>>> had more fine grained control over which mappings were available for
>>> faulting and which were statically mapped?  I don't really see what
>>> sense the pre-mapping range makes.  If I assume the user is QEMU in a
>>> non-vIOMMU configuration, pre-mapping the beginning of each RAM section
>>> doesn't make any logical sense relative to device DMA.  
>>
>> As you said in Patch 4, we can introduce full end-to-end functionality
>> before trying to improve performance, and I will drop the pre-mapping patch
>> in the current stage...
>>
>> Is there a need that userspace wants more fine grained control over which
>> mappings are available for faulting? If so, we may evolve the MAP ioctl
>> to support for specifying the faulting range.
> 
> You're essentially enabling this for a vfio bus driver via patch 7/8,
> pinning for selective DMA faulting.  How would a driver in userspace
> make equivalent requests?  In the case of performance, the user could
> mlock the page but they have no mechanism here to pre-fault it.  Should
> they?

Make sense.

It seems that we should additionally iommu_map the pages which are IOPF
enabled and pinned in vfio_iommu_type1_pin_pages, and there is no need
to add more tracking structures in Patch 7...

> 
>> As for the overhead of IOPF, it is unavoidable if enabling on-demand paging
>> (and page faults occur almost only when first accessing), and it seems that
>> there is a large optimization space compared to CPU page faulting.
> 
> Yes, there's of course going to be overhead in terms of latency for the
> page faults.  My point was more that when a host is not under memory
> pressure we should trend towards the performance of pinned, static
> mappings and we should be able to create arbitrarily large pre-fault
> behavior to show that.

Make sense.

> But I think what we really want to enable via IOPF is density, right?

density? Did you mean the proportion of the IOPF enabled mappings?

> Therefore how many more assigned device guests
> can you run on a host with IOPF?> How does the slope, plateau, and
> inflection point of their aggregate throughput compare to static
> pinning?  VM startup time is probably also a useful metric, ie. trading
> device latency for startup latency.  Thanks,

Yeah, these are what we have to consider and test later. :-)
And the slope, plateau, and inflection point of the aggregate throughput
may depend on the specific device driver's behavior (such as whether they
reuse the DMA buffer)...

Thanks,
Shenming

> 
> Alex
> 
> .
>