diff mbox series

iommu/dma: Add config for PCI SAC address trick

Message ID ef8abf1c6b0839e39b272738fc7bc4d9699c8bcb.1652895419.git.robin.murphy@arm.com (mailing list archive)
State New, archived
Headers show
Series iommu/dma: Add config for PCI SAC address trick | expand

Commit Message

Robin Murphy May 18, 2022, 5:36 p.m. UTC
For devices stuck behind a conventional PCI bus, saving extra cycles at
33MHz is probably fairly significant. However since native PCI Express
is now the norm for high-performance devices, the optimisation to always
prefer 32-bit addresses for the sake of avoiding DAC is starting to look
rather anachronistic. Technically 32-bit addresses do have shorter TLPs
on PCIe, but unless the device is saturating its link bandwidth with
small transfers it seems unlikely that the difference is appreciable.

What definitely is appreciable, however, is that the IOVA allocator
doesn't behave all that well once the 32-bit space starts getting full.
As DMA working sets get bigger, this optimisation increasingly backfires
and adds considerable overhead to the dma_map path for use-cases like
high-bandwidth networking. We've increasingly bandaged the allocator
in attempts to mitigate this, but it remains fundamentally at odds with
other valid requirements to try as hard as possible to satisfy a request
within the given limit; what we really need is to just avoid this odd
notion of a speculative allocation when it isn't beneficial anyway.

Unfortunately that's where things get awkward... Having been present on
x86 for 15 years or so now, it turns out there are systems which fail to
properly define the upper limit of usable IOVA space for certain devices
and this trick was the only thing letting them work OK. I had a similar
ulterior motive for a couple of early arm64 systems when originally
adding it to iommu-dma, but those really should now be fixed with proper
firmware bindings, and other arm64 users really need it out of the way,
so let's just leave it default-on for x86.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/Kconfig     | 24 ++++++++++++++++++++++++
 drivers/iommu/dma-iommu.c |  2 +-
 2 files changed, 25 insertions(+), 1 deletion(-)

Comments

Christoph Hellwig May 19, 2022, 7:11 a.m. UTC | #1
On Wed, May 18, 2022 at 06:36:59PM +0100, Robin Murphy wrote:
> +config IOMMU_DMA_PCI_SAC_OPT
> +	bool "Enable 64-bit legacy PCI optimisation by default"
> +	depends on IOMMU_DMA
> +	default X86
> +	help
> +	  Enable by default an IOMMU optimisation for 64-bit legacy PCI devices,
> +	  wherein the DMA API layer will always first try to allocate a 32-bit
> +	  DMA address suitable for a single address cycle, before falling back
> +	  to allocating from the full usable address range. If your system has
> +	  64-bit legacy PCI devices in 32-bit slots where using dual address
> +	  cycles reduces DMA throughput significantly, this optimisation may be
> +	  beneficial to overall performance.

The config option name sounds odd.  Yes, maybe for actual 64-bit PCI
this actualy is an optimization.  But I'd think of it more as a
workaround. and I'd probably word it as such.  I also would not not
default to true for x86, just allow for that.  There is nothing
fundamental about x86 wanting that, just that people use more crap
drivers on x86.  An the fact that AMD SEV sets the high bit for
encrypted memory has been weeding out at least some of them.

> +bool iommu_dma_forcedac __read_mostly = !IS_ENABLED(CONFIG_IOMMU_DMA_PCI_SAC_OPT);

Overly long line here.
John Garry May 19, 2022, 10:20 a.m. UTC | #2
On 18/05/2022 18:36, Robin Murphy wrote:
> For devices stuck behind a conventional PCI bus, saving extra cycles at
> 33MHz is probably fairly significant. However since native PCI Express
> is now the norm for high-performance devices, the optimisation to always
> prefer 32-bit addresses for the sake of avoiding DAC is starting to look
> rather anachronistic. Technically 32-bit addresses do have shorter TLPs
> on PCIe, but unless the device is saturating its link bandwidth with
> small transfers it seems unlikely that the difference is appreciable.
> 
> What definitely is appreciable, however, is that the IOVA allocator
> doesn't behave all that well once the 32-bit space starts getting full.
> As DMA working sets get bigger, this optimisation increasingly backfires
> and adds considerable overhead to the dma_map path for use-cases like
> high-bandwidth networking. We've increasingly bandaged the allocator
> in attempts to mitigate this, but it remains fundamentally at odds with
> other valid requirements to try as hard as possible to satisfy a request
> within the given limit; what we really need is to just avoid this odd
> notion of a speculative allocation when it isn't beneficial anyway.
> 
> Unfortunately that's where things get awkward... Having been present on
> x86 for 15 years or so now, it turns out there are systems which fail to
> properly define the upper limit of usable IOVA space for certain devices
> and this trick was the only thing letting them work OK. I had a similar
> ulterior motive for a couple of early arm64 systems when originally
> adding it to iommu-dma, but those really should now be fixed with proper
> firmware bindings, and other arm64 users really need it out of the way,
> so let's just leave it default-on for x86.
> 
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> ---
>   drivers/iommu/Kconfig     | 24 ++++++++++++++++++++++++
>   drivers/iommu/dma-iommu.c |  2 +-

It might be worth printing this default value always and not just for 
when it is set from commandline, like what we do for default domain type 
and IOTLB invalidation policy

>   2 files changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index c79a0df090c0..bf9b295f1c89 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -144,6 +144,30 @@ config IOMMU_DMA
>   	select IRQ_MSI_IOMMU
>   	select NEED_SG_DMA_LENGTH
>   
> +config IOMMU_DMA_PCI_SAC_OPT
> +	bool "Enable 64-bit legacy PCI optimisation by default"
> +	depends on IOMMU_DMA
> +	default X86

Do we have a strategy for if and when issues start popping up on other 
architectures? Is it to simply tell them to just turn this flag on (and 
also fix your platform)?

> +	help
> +	  Enable by default an IOMMU optimisation for 64-bit legacy PCI devices,
> +	  wherein the DMA API layer will always first try to allocate a 32-bit
> +	  DMA address suitable for a single address cycle, before falling back
> +	  to allocating from the full usable address range. If your system has
> +	  64-bit legacy PCI devices in 32-bit slots where using dual address
> +	  cycles reduces DMA throughput significantly, this optimisation may be
> +	  beneficial to overall performance.
> +
> +	  If you have a modern PCI Express based system, it should usually be
> +	  safe to say "n" here and avoid the potential extra allocation overhead.
> +	  However, beware that this optimisation has also historically papered
> +	  over bugs where the IOMMU address range above 32 bits is not fully
> +	  usable. If device DMA problems and/or IOMMU faults start occurring
> +	  with IOMMU translation enabled after disabling this option, it is
> +	  likely a sign of a latent driver or firmware/BIOS bug.
> +
> +	  If this option is not set, the optimisation can be enabled at
> +	  boot time with the "iommu.forcedac=0" command-line argument.
> +

Thanks,
John
diff mbox series

Patch

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index c79a0df090c0..bf9b295f1c89 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -144,6 +144,30 @@  config IOMMU_DMA
 	select IRQ_MSI_IOMMU
 	select NEED_SG_DMA_LENGTH
 
+config IOMMU_DMA_PCI_SAC_OPT
+	bool "Enable 64-bit legacy PCI optimisation by default"
+	depends on IOMMU_DMA
+	default X86
+	help
+	  Enable by default an IOMMU optimisation for 64-bit legacy PCI devices,
+	  wherein the DMA API layer will always first try to allocate a 32-bit
+	  DMA address suitable for a single address cycle, before falling back
+	  to allocating from the full usable address range. If your system has
+	  64-bit legacy PCI devices in 32-bit slots where using dual address
+	  cycles reduces DMA throughput significantly, this optimisation may be
+	  beneficial to overall performance.
+
+	  If you have a modern PCI Express based system, it should usually be
+	  safe to say "n" here and avoid the potential extra allocation overhead.
+	  However, beware that this optimisation has also historically papered
+	  over bugs where the IOMMU address range above 32 bits is not fully
+	  usable. If device DMA problems and/or IOMMU faults start occurring
+	  with IOMMU translation enabled after disabling this option, it is
+	  likely a sign of a latent driver or firmware/BIOS bug.
+
+	  If this option is not set, the optimisation can be enabled at
+	  boot time with the "iommu.forcedac=0" command-line argument.
+
 # Shared Virtual Addressing
 config IOMMU_SVA
 	bool
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 09f6e1c0f9c0..c8d409d3f861 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -66,7 +66,7 @@  struct iommu_dma_cookie {
 };
 
 static DEFINE_STATIC_KEY_FALSE(iommu_deferred_attach_enabled);
-bool iommu_dma_forcedac __read_mostly;
+bool iommu_dma_forcedac __read_mostly = !IS_ENABLED(CONFIG_IOMMU_DMA_PCI_SAC_OPT);
 
 static int __init iommu_dma_forcedac_setup(char *str)
 {