mbox series

[v7,00/15] Optimizing iommu_[map/unmap] performance

Message ID 1623850736-389584-1-git-send-email-quic_c_gdjako@quicinc.com (mailing list archive)
Headers show
Series Optimizing iommu_[map/unmap] performance | expand

Message

Georgi Djakov June 16, 2021, 1:38 p.m. UTC
When unmapping a buffer from an IOMMU domain, the IOMMU framework unmaps
the buffer at a granule of the largest page size that is supported by
the IOMMU hardware and fits within the buffer. For every block that
is unmapped, the IOMMU framework will call into the IOMMU driver, and
then the io-pgtable framework to walk the page tables to find the entry
that corresponds to the IOVA, and then unmaps the entry.

This can be suboptimal in scenarios where a buffer or a piece of a
buffer can be split into several contiguous page blocks of the same size.
For example, consider an IOMMU that supports 4 KB page blocks, 2 MB page
blocks, and 1 GB page blocks, and a buffer that is 4 MB in size is being
unmapped at IOVA 0. The current call-flow will result in 4 indirect calls,
and 2 page table walks, to unmap 2 entries that are next to each other in
the page-tables, when both entries could have been unmapped in one shot
by clearing both page table entries in the same call.

The same optimization is applicable to mapping buffers as well, so
these patches implement a set of callbacks called unmap_pages and
map_pages to the io-pgtable code and IOMMU drivers which unmaps or maps
an IOVA range that consists of a number of pages of the same
page size that is supported by the IOMMU hardware, and allows for
manipulating multiple page table entries in the same set of indirect
calls. The reason for introducing these callbacks is to give other IOMMU
drivers/io-pgtable formats time to change to using the new callbacks, so
that the transition to using this approach can be done piecemeal.

Changes since V6:
(https://lore.kernel.org/r/1623776913-390160-1-git-send-email-quic_c_gdjako@quicinc.com/)

* Fix compiler warning (patch 08/15)
* Free underlying page tables for large mappings (patch 10/15)
    Consider the case where a 2N--where N > 1--MB buffer is composed
    entirely of 4 KB pages. This means that at the second to last level,
    the buffer will have N non-leaf entries that point to page tables
    with 4 KB mappings.

    When the buffer is unmapped, all N entries will be cleared at the
    second to last level. However, the existing logic only checks if
    it needs to free the underlying page tables for the first non-leaf
    entry. Therefore, the page table memory for the other entries N-1
    entries will be leaked.

    Fix this memory leak by ensuring that we apply the same check to
    all N entries that are being unmapped.

    When unmapping multiple entries, __arm_lpae_unmap() should unmap
    one entry at a time and perform TLB maintenance as required for that
    entry.

Changes since V5: 
(https://lore.kernel.org/r/20210408171402.12607-1-isaacm@codeaurora.org/)

* Rebased on next-20210515.
* Fixed minor checkpatch warnings - indentation, extra blank lines.
* Use the correct function argument in __arm_lpae_map(). (chenxiang)

Changes since V4:

* Fixed type for addr_merge from phys_addr_t to unsigned long so
  that GENMASK() can be used.
* Hooked up arm_v7s_[unmap/map]_pages to the io-pgtable ops.
* Introduced a macro for calculating the number of page table entries
  for the ARM LPAE io-pgtable format.

Changes since V3:

* Removed usage of ULL variants of bitops from Will's patches, as
  they were not needed.
* Instead of unmapping/mapping pgcount pages, unmap_pages() and
  map_pages() will at most unmap and map pgcount pages, allowing
  for part of the pages in pgcount to be mapped and unmapped. This
  was done to simplify the handling in the io-pgtable layer.
* Extended the existing PTE manipulation methods in io-pgtable-arm
  to handle multiple entries, per Robin's suggestion, eliminating
  the need to add functions to clear multiple PTEs.
* Implemented a naive form of [map/unmap]_pages() for ARM v7s io-pgtable
  format.
* arm_[v7s/lpae]_[map/unmap] will call
  arm_[v7s/lpae]_[map_pages/unmap_pages] with an argument of 1 page.
* The arm_smmu_[map/unmap] functions have been removed, since they
  have been replaced by arm_smmu_[map/unmap]_pages.

Changes since V2:

* Added a check in __iommu_map() to check for the existence
  of either the map or map_pages callback as per Lu's suggestion.

Changes since V1:

* Implemented the map_pages() callbacks
* Integrated Will's patches into this series which
  address several concerns about how iommu_pgsize() partitioned a
  buffer (I made a minor change to the patch which changes
  iommu_pgsize() to use bitmaps by using the ULL variants of
  the bitops)

Isaac J. Manjarres (12):
  iommu/io-pgtable: Introduce unmap_pages() as a page table op
  iommu: Add an unmap_pages() op for IOMMU drivers
  iommu/io-pgtable: Introduce map_pages() as a page table op
  iommu: Add a map_pages() op for IOMMU drivers
  iommu: Add support for the map_pages() callback
  iommu/io-pgtable-arm: Prepare PTE methods for handling multiple
    entries
  iommu/io-pgtable-arm: Implement arm_lpae_unmap_pages()
  iommu/io-pgtable-arm: Implement arm_lpae_map_pages()
  iommu/io-pgtable-arm-v7s: Implement arm_v7s_unmap_pages()
  iommu/io-pgtable-arm-v7s: Implement arm_v7s_map_pages()
  iommu/arm-smmu: Implement the unmap_pages() IOMMU driver callback
  iommu/arm-smmu: Implement the map_pages() IOMMU driver callback

Will Deacon (3):
  iommu: Use bitmap to calculate page size in iommu_pgsize()
  iommu: Split 'addr_merge' argument to iommu_pgsize() into separate
    parts
  iommu: Hook up '->unmap_pages' driver callback

 drivers/iommu/arm/arm-smmu/arm-smmu.c |  18 +--
 drivers/iommu/io-pgtable-arm-v7s.c    |  50 ++++++--
 drivers/iommu/io-pgtable-arm.c        | 223 +++++++++++++++++++++-------------
 drivers/iommu/iommu.c                 | 129 +++++++++++++++-----
 include/linux/io-pgtable.h            |   8 ++
 include/linux/iommu.h                 |   9 ++
 6 files changed, 307 insertions(+), 130 deletions(-)

Comments

Georgi Djakov July 14, 2021, 2:24 p.m. UTC | #1
On 16.06.21 16:38, Georgi Djakov wrote:
> When unmapping a buffer from an IOMMU domain, the IOMMU framework unmaps
> the buffer at a granule of the largest page size that is supported by
> the IOMMU hardware and fits within the buffer. For every block that
> is unmapped, the IOMMU framework will call into the IOMMU driver, and
> then the io-pgtable framework to walk the page tables to find the entry
> that corresponds to the IOVA, and then unmaps the entry.
> 
> This can be suboptimal in scenarios where a buffer or a piece of a
> buffer can be split into several contiguous page blocks of the same size.
> For example, consider an IOMMU that supports 4 KB page blocks, 2 MB page
> blocks, and 1 GB page blocks, and a buffer that is 4 MB in size is being
> unmapped at IOVA 0. The current call-flow will result in 4 indirect calls,
> and 2 page table walks, to unmap 2 entries that are next to each other in
> the page-tables, when both entries could have been unmapped in one shot
> by clearing both page table entries in the same call.
> 
> The same optimization is applicable to mapping buffers as well, so
> these patches implement a set of callbacks called unmap_pages and
> map_pages to the io-pgtable code and IOMMU drivers which unmaps or maps
> an IOVA range that consists of a number of pages of the same
> page size that is supported by the IOMMU hardware, and allows for
> manipulating multiple page table entries in the same set of indirect
> calls. The reason for introducing these callbacks is to give other IOMMU
> drivers/io-pgtable formats time to change to using the new callbacks, so
> that the transition to using this approach can be done piecemeal.

Hi Will,

Did you get a chance to look at this patchset? Most patches are already
acked/reviewed and all still applies clean on rc1.

Thanks,
Georgi
Baolu Lu July 15, 2021, 1:23 a.m. UTC | #2
On 7/14/21 10:24 PM, Georgi Djakov wrote:
> On 16.06.21 16:38, Georgi Djakov wrote:
>> When unmapping a buffer from an IOMMU domain, the IOMMU framework unmaps
>> the buffer at a granule of the largest page size that is supported by
>> the IOMMU hardware and fits within the buffer. For every block that
>> is unmapped, the IOMMU framework will call into the IOMMU driver, and
>> then the io-pgtable framework to walk the page tables to find the entry
>> that corresponds to the IOVA, and then unmaps the entry.
>>
>> This can be suboptimal in scenarios where a buffer or a piece of a
>> buffer can be split into several contiguous page blocks of the same size.
>> For example, consider an IOMMU that supports 4 KB page blocks, 2 MB page
>> blocks, and 1 GB page blocks, and a buffer that is 4 MB in size is being
>> unmapped at IOVA 0. The current call-flow will result in 4 indirect 
>> calls,
>> and 2 page table walks, to unmap 2 entries that are next to each other in
>> the page-tables, when both entries could have been unmapped in one shot
>> by clearing both page table entries in the same call.
>>
>> The same optimization is applicable to mapping buffers as well, so
>> these patches implement a set of callbacks called unmap_pages and
>> map_pages to the io-pgtable code and IOMMU drivers which unmaps or maps
>> an IOVA range that consists of a number of pages of the same
>> page size that is supported by the IOMMU hardware, and allows for
>> manipulating multiple page table entries in the same set of indirect
>> calls. The reason for introducing these callbacks is to give other IOMMU
>> drivers/io-pgtable formats time to change to using the new callbacks, so
>> that the transition to using this approach can be done piecemeal.
> 
> Hi Will,
> 
> Did you get a chance to look at this patchset? Most patches are already
> acked/reviewed and all still applies clean on rc1.

I also have the ops->[un]map_pages implementation for the Intel IOMMU
driver. I will post them once the iommu/core part get applied.

Best regards,
baolu
chenxiang July 15, 2021, 1:51 a.m. UTC | #3
在 2021/7/15 9:23, Lu Baolu 写道:
> On 7/14/21 10:24 PM, Georgi Djakov wrote:
>> On 16.06.21 16:38, Georgi Djakov wrote:
>>> When unmapping a buffer from an IOMMU domain, the IOMMU framework 
>>> unmaps
>>> the buffer at a granule of the largest page size that is supported by
>>> the IOMMU hardware and fits within the buffer. For every block that
>>> is unmapped, the IOMMU framework will call into the IOMMU driver, and
>>> then the io-pgtable framework to walk the page tables to find the entry
>>> that corresponds to the IOVA, and then unmaps the entry.
>>>
>>> This can be suboptimal in scenarios where a buffer or a piece of a
>>> buffer can be split into several contiguous page blocks of the same 
>>> size.
>>> For example, consider an IOMMU that supports 4 KB page blocks, 2 MB 
>>> page
>>> blocks, and 1 GB page blocks, and a buffer that is 4 MB in size is 
>>> being
>>> unmapped at IOVA 0. The current call-flow will result in 4 indirect 
>>> calls,
>>> and 2 page table walks, to unmap 2 entries that are next to each 
>>> other in
>>> the page-tables, when both entries could have been unmapped in one shot
>>> by clearing both page table entries in the same call.
>>>
>>> The same optimization is applicable to mapping buffers as well, so
>>> these patches implement a set of callbacks called unmap_pages and
>>> map_pages to the io-pgtable code and IOMMU drivers which unmaps or maps
>>> an IOVA range that consists of a number of pages of the same
>>> page size that is supported by the IOMMU hardware, and allows for
>>> manipulating multiple page table entries in the same set of indirect
>>> calls. The reason for introducing these callbacks is to give other 
>>> IOMMU
>>> drivers/io-pgtable formats time to change to using the new 
>>> callbacks, so
>>> that the transition to using this approach can be done piecemeal.
>>
>> Hi Will,
>>
>> Did you get a chance to look at this patchset? Most patches are already
>> acked/reviewed and all still applies clean on rc1.
>
> I also have the ops->[un]map_pages implementation for the Intel IOMMU
> driver. I will post them once the iommu/core part get applied.

I also implement those callbacks on ARM SMMUV3 based on this series, and 
use dma_map_benchmark to have a test on
the latency of map/unmap as follows, and i think it promotes much on the 
latency of map/unmap. I will also plan to post
the implementations for ARM SMMUV3 after this series are applied.

t = 1(thread = 1):
                    before opt(us)   after opt(us)
g=1(4K size)        0.1/1.3          0.1/0.8
g=2(8K size)        0.2/1.5          0.2/0.9
g=4(16K size)       0.3/1.9          0.1/1.1
g=8(32K size)       0.5/2.7          0.2/1.4
g=16(64K size)      1.0/4.5          0.2/2.0
g=32(128K size)     1.8/7.9          0.2/3.3
g=64(256K size)     3.7/14.8         0.4/6.1
g=128(512K size)    7.1/14.7         0.5/10.4
g=256(1M size)      14.0/53.9        0.8/19.3
g=512(2M size)      0.2/0.9          0.2/0.9
g=1024(4M size)     0.5/1.5          0.4/1.0

t = 10(thread = 10):
                    before opt(us)   after opt(us)
g=1(4K size)        0.3/7.0          0.1/5.8
g=2(8K size)        0.4/6.7          0.3/6.0
g=4(16K size)       0.5/6.3          0.3/5.6
g=8(32K size)       0.5/8.3          0.2/6.3
g=16(64K size)      1.0/17.3         0.3/12.4
g=32(128K size)     1.8/36.0         0.2/24.2
g=64(256K size)     4.3/67.2         1.2/46.4
g=128(512K size)    7.8/93.7         1.3/94.2
g=256(1M size)      14.7/280.8       1.8/191.5
g=512(2M size)      3.6/3.2          1.5/2.5
g=1024(4M size)     2.0/3.1          1.8/2.6


>
> Best regards,
> baolu
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>
> .
>
Joerg Roedel July 26, 2021, 10:37 a.m. UTC | #4
On Wed, Jun 16, 2021 at 06:38:41AM -0700, Georgi Djakov wrote:
> Isaac J. Manjarres (12):
>   iommu/io-pgtable: Introduce unmap_pages() as a page table op
>   iommu: Add an unmap_pages() op for IOMMU drivers
>   iommu/io-pgtable: Introduce map_pages() as a page table op
>   iommu: Add a map_pages() op for IOMMU drivers
>   iommu: Add support for the map_pages() callback
>   iommu/io-pgtable-arm: Prepare PTE methods for handling multiple
>     entries
>   iommu/io-pgtable-arm: Implement arm_lpae_unmap_pages()
>   iommu/io-pgtable-arm: Implement arm_lpae_map_pages()
>   iommu/io-pgtable-arm-v7s: Implement arm_v7s_unmap_pages()
>   iommu/io-pgtable-arm-v7s: Implement arm_v7s_map_pages()
>   iommu/arm-smmu: Implement the unmap_pages() IOMMU driver callback
>   iommu/arm-smmu: Implement the map_pages() IOMMU driver callback
> 
> Will Deacon (3):
>   iommu: Use bitmap to calculate page size in iommu_pgsize()
>   iommu: Split 'addr_merge' argument to iommu_pgsize() into separate
>     parts
>   iommu: Hook up '->unmap_pages' driver callback

Applied to iommu/core branch, thanks to everyone involved!