diff mbox series

swiotlb: Introduce DMA_ATTR_SKIP_DEVICE_SYNC

Message ID 20250115194659.618438-1-florian.fainelli@broadcom.com (mailing list archive)
State New
Headers show
Series swiotlb: Introduce DMA_ATTR_SKIP_DEVICE_SYNC | expand

Commit Message

Florian Fainelli Jan. 15, 2025, 7:46 p.m. UTC
From: Justin Chen <justin.chen@broadcom.com>

Network device driver's receive path typically do the following:

- dma_map_single(.., DMA_FROM_DEVICE)
- dma_sync_single_for_cpu() to allow the CPU to inspect packet
  descriptors
- dma_unmap_single(.., DMA_FROM_DEVICE) when releasing the buffer

Each of those operations incurs a copy from the original buffer to the
TLB buffer, even if the device is known to be writing full buffers.

Add a DMA_ATTR_SKIP_DEVICE_SYNC flag which can be set by device drivers
to skip the copy at dma_map_single() to speed up the RX path when the
device is known to be doing full buffer writes.

This has been seen to provide a 20% speedup for Wi-Fi RX throughput
testing.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
[florian: commit message, add DMA-API attribute flag]
Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
---
 Documentation/core-api/dma-attributes.rst | 9 +++++++++
 Documentation/core-api/swiotlb.rst        | 4 +++-
 include/linux/dma-mapping.h               | 6 ++++++
 include/trace/events/dma.h                | 3 ++-
 kernel/dma/swiotlb.c                      | 8 ++++++++
 5 files changed, 28 insertions(+), 2 deletions(-)

Comments

Robin Murphy Jan. 15, 2025, 8:12 p.m. UTC | #1
On 2025-01-15 7:46 pm, Florian Fainelli wrote:
> From: Justin Chen <justin.chen@broadcom.com>
> 
> Network device driver's receive path typically do the following:
> 
> - dma_map_single(.., DMA_FROM_DEVICE)
> - dma_sync_single_for_cpu() to allow the CPU to inspect packet
>    descriptors
> - dma_unmap_single(.., DMA_FROM_DEVICE) when releasing the buffer
> 
> Each of those operations incurs a copy from the original buffer to the
> TLB buffer, even if the device is known to be writing full buffers.
> 
> Add a DMA_ATTR_SKIP_DEVICE_SYNC flag which can be set by device drivers
> to skip the copy at dma_map_single() to speed up the RX path when the
> device is known to be doing full buffer writes.
> 
> This has been seen to provide a 20% speedup for Wi-Fi RX throughput
> testing.
> 
> Signed-off-by: Justin Chen <justin.chen@broadcom.com>
> [florian: commit message, add DMA-API attribute flag]
> Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
> ---
>   Documentation/core-api/dma-attributes.rst | 9 +++++++++
>   Documentation/core-api/swiotlb.rst        | 4 +++-
>   include/linux/dma-mapping.h               | 6 ++++++
>   include/trace/events/dma.h                | 3 ++-
>   kernel/dma/swiotlb.c                      | 8 ++++++++
>   5 files changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/core-api/dma-attributes.rst b/Documentation/core-api/dma-attributes.rst
> index 1887d92e8e92..ccd9c1891200 100644
> --- a/Documentation/core-api/dma-attributes.rst
> +++ b/Documentation/core-api/dma-attributes.rst
> @@ -130,3 +130,12 @@ accesses to DMA buffers in both privileged "supervisor" and unprivileged
>   subsystem that the buffer is fully accessible at the elevated privilege
>   level (and ideally inaccessible or at least read-only at the
>   lesser-privileged levels).
> +
> +DMA_ATTR_SKIP_DEVICE_SYNC
> +-------------------------
> +
> +Device drivers can set DMA_ATTR_SKIP_DEVICE_SYNC in order to avoid doing a copy
> +from the original buffer to the TLB buffer for dma_map_single() with a
> +DMA_FROM_DEVICE direction. This can be used to save an extra copy in a device
> +driver's data path when using swiotlb bounce buffering.
> +
> diff --git a/Documentation/core-api/swiotlb.rst b/Documentation/core-api/swiotlb.rst
> index 9e0fe027dd3b..3bc1f9ba67b2 100644
> --- a/Documentation/core-api/swiotlb.rst
> +++ b/Documentation/core-api/swiotlb.rst
> @@ -67,7 +67,9 @@ to the driver for programming into the device. If a DMA operation specifies
>   multiple memory buffer segments, a separate bounce buffer must be allocated for
>   each segment. swiotlb_tbl_map_single() always does a "sync" operation (i.e., a
>   CPU copy) to initialize the bounce buffer to match the contents of the original
> -buffer.
> +buffer, except if DMA_ATTR_SKIP_DEVICE_SYNC is specified and the direction is
> +DMA_FROM_DEVICE. This is a performance optimization that may not be suitable for
> +all platforms.
>   
>   swiotlb_tbl_unmap_single() does the reverse. If the DMA operation might have
>   updated the bounce buffer memory and DMA_ATTR_SKIP_CPU_SYNC is not set, the
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index b79925b1c433..bfdaa65f8e9d 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -58,6 +58,12 @@
>    */
>   #define DMA_ATTR_PRIVILEGED		(1UL << 9)
>   
> +/*
> + * DMA_ATTR_SKIP_DEVICE_SYNC: used to indicate that the buffer does not need to
> + * be synchronized to the device.
> + */
> +#define DMA_ATTR_SKIP_DEVICE_SYNC	(1UL << 10)
> +
>   /*
>    * A dma_addr_t can hold any valid DMA or bus address for the platform.  It can
>    * be given to a device to use as a DMA source or target.  It is specific to a
> diff --git a/include/trace/events/dma.h b/include/trace/events/dma.h
> index d8ddc27b6a7c..6eb8fd7e3515 100644
> --- a/include/trace/events/dma.h
> +++ b/include/trace/events/dma.h
> @@ -31,7 +31,8 @@ TRACE_DEFINE_ENUM(DMA_NONE);
>   		{ DMA_ATTR_FORCE_CONTIGUOUS, "FORCE_CONTIGUOUS" }, \
>   		{ DMA_ATTR_ALLOC_SINGLE_PAGES, "ALLOC_SINGLE_PAGES" }, \
>   		{ DMA_ATTR_NO_WARN, "NO_WARN" }, \
> -		{ DMA_ATTR_PRIVILEGED, "PRIVILEGED" })
> +		{ DMA_ATTR_PRIVILEGED, "PRIVILEGED" }, \
> +		{ DMA_ATTR_SKIP_DEVICE_SYNC, "SKIP_DEVICE_SYNC" })
>   
>   DECLARE_EVENT_CLASS(dma_map,
>   	TP_PROTO(struct device *dev, phys_addr_t phys_addr, dma_addr_t dma_addr,
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index abcf3fa63a56..8dab89bf5e33 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -1435,8 +1435,16 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>   	 * the original data, even if it's garbage, is necessary to match
>   	 * hardware behavior.  Use of swiotlb is supposed to be transparent,
>   	 * i.e. swiotlb must not corrupt memory by clobbering unwritten bytes.
> +	 *
> +	 * Setting DMA_ATTR_SKIP_DEVICE_SYNC will negate the behavior described
> +	 * before and avoid the copy from the original buffer to the TLB
> +	 * buffer.
>   	 */
> +	if (dir == DMA_FROM_DEVICE && (attrs & DMA_ATTR_SKIP_DEVICE_SYNC))
> +		goto out;

Nope, we still need to initialise the SWIOTLB slot with *something*, or 
we're just reintroducing the same data leakage issue again. The whole 
deal with that was that the caller *did* expect the entire buffer to be 
written, but the device had an error, and thus the subsequent unmap 
bounced out whatever data was in SWIOTLB before.

A memset is hopefully at least a bit faster than a copy, so maybe there 
is still some life in this idea, but the semantic is not "skip sync", 
it's "I'm OK with this entire buffer getting scribbled even my device 
ends up never touching it".

Thanks,
Robin.

> +
>   	swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE, pool);
> +out:
>   	return tlb_addr;
>   }
>
Florian Fainelli Jan. 15, 2025, 10:16 p.m. UTC | #2
On 1/15/25 12:12, Robin Murphy wrote:
> On 2025-01-15 7:46 pm, Florian Fainelli wrote:
>> From: Justin Chen <justin.chen@broadcom.com>
>>
>> Network device driver's receive path typically do the following:
>>
>> - dma_map_single(.., DMA_FROM_DEVICE)
>> - dma_sync_single_for_cpu() to allow the CPU to inspect packet
>>    descriptors
>> - dma_unmap_single(.., DMA_FROM_DEVICE) when releasing the buffer
>>
>> Each of those operations incurs a copy from the original buffer to the
>> TLB buffer, even if the device is known to be writing full buffers.
>>
>> Add a DMA_ATTR_SKIP_DEVICE_SYNC flag which can be set by device drivers
>> to skip the copy at dma_map_single() to speed up the RX path when the
>> device is known to be doing full buffer writes.
>>
>> This has been seen to provide a 20% speedup for Wi-Fi RX throughput
>> testing.
>>
>> Signed-off-by: Justin Chen <justin.chen@broadcom.com>
>> [florian: commit message, add DMA-API attribute flag]
>> Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
>> ---
>>   Documentation/core-api/dma-attributes.rst | 9 +++++++++
>>   Documentation/core-api/swiotlb.rst        | 4 +++-
>>   include/linux/dma-mapping.h               | 6 ++++++
>>   include/trace/events/dma.h                | 3 ++-
>>   kernel/dma/swiotlb.c                      | 8 ++++++++
>>   5 files changed, 28 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/core-api/dma-attributes.rst b/ 
>> Documentation/core-api/dma-attributes.rst
>> index 1887d92e8e92..ccd9c1891200 100644
>> --- a/Documentation/core-api/dma-attributes.rst
>> +++ b/Documentation/core-api/dma-attributes.rst
>> @@ -130,3 +130,12 @@ accesses to DMA buffers in both privileged 
>> "supervisor" and unprivileged
>>   subsystem that the buffer is fully accessible at the elevated privilege
>>   level (and ideally inaccessible or at least read-only at the
>>   lesser-privileged levels).
>> +
>> +DMA_ATTR_SKIP_DEVICE_SYNC
>> +-------------------------
>> +
>> +Device drivers can set DMA_ATTR_SKIP_DEVICE_SYNC in order to avoid 
>> doing a copy
>> +from the original buffer to the TLB buffer for dma_map_single() with a
>> +DMA_FROM_DEVICE direction. This can be used to save an extra copy in 
>> a device
>> +driver's data path when using swiotlb bounce buffering.
>> +
>> diff --git a/Documentation/core-api/swiotlb.rst b/Documentation/core- 
>> api/swiotlb.rst
>> index 9e0fe027dd3b..3bc1f9ba67b2 100644
>> --- a/Documentation/core-api/swiotlb.rst
>> +++ b/Documentation/core-api/swiotlb.rst
>> @@ -67,7 +67,9 @@ to the driver for programming into the device. If a 
>> DMA operation specifies
>>   multiple memory buffer segments, a separate bounce buffer must be 
>> allocated for
>>   each segment. swiotlb_tbl_map_single() always does a "sync" 
>> operation (i.e., a
>>   CPU copy) to initialize the bounce buffer to match the contents of 
>> the original
>> -buffer.
>> +buffer, except if DMA_ATTR_SKIP_DEVICE_SYNC is specified and the 
>> direction is
>> +DMA_FROM_DEVICE. This is a performance optimization that may not be 
>> suitable for
>> +all platforms.
>>   swiotlb_tbl_unmap_single() does the reverse. If the DMA operation 
>> might have
>>   updated the bounce buffer memory and DMA_ATTR_SKIP_CPU_SYNC is not 
>> set, the
>> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
>> index b79925b1c433..bfdaa65f8e9d 100644
>> --- a/include/linux/dma-mapping.h
>> +++ b/include/linux/dma-mapping.h
>> @@ -58,6 +58,12 @@
>>    */
>>   #define DMA_ATTR_PRIVILEGED        (1UL << 9)
>> +/*
>> + * DMA_ATTR_SKIP_DEVICE_SYNC: used to indicate that the buffer does 
>> not need to
>> + * be synchronized to the device.
>> + */
>> +#define DMA_ATTR_SKIP_DEVICE_SYNC    (1UL << 10)
>> +
>>   /*
>>    * A dma_addr_t can hold any valid DMA or bus address for the 
>> platform.  It can
>>    * be given to a device to use as a DMA source or target.  It is 
>> specific to a
>> diff --git a/include/trace/events/dma.h b/include/trace/events/dma.h
>> index d8ddc27b6a7c..6eb8fd7e3515 100644
>> --- a/include/trace/events/dma.h
>> +++ b/include/trace/events/dma.h
>> @@ -31,7 +31,8 @@ TRACE_DEFINE_ENUM(DMA_NONE);
>>           { DMA_ATTR_FORCE_CONTIGUOUS, "FORCE_CONTIGUOUS" }, \
>>           { DMA_ATTR_ALLOC_SINGLE_PAGES, "ALLOC_SINGLE_PAGES" }, \
>>           { DMA_ATTR_NO_WARN, "NO_WARN" }, \
>> -        { DMA_ATTR_PRIVILEGED, "PRIVILEGED" })
>> +        { DMA_ATTR_PRIVILEGED, "PRIVILEGED" }, \
>> +        { DMA_ATTR_SKIP_DEVICE_SYNC, "SKIP_DEVICE_SYNC" })
>>   DECLARE_EVENT_CLASS(dma_map,
>>       TP_PROTO(struct device *dev, phys_addr_t phys_addr, dma_addr_t 
>> dma_addr,
>> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
>> index abcf3fa63a56..8dab89bf5e33 100644
>> --- a/kernel/dma/swiotlb.c
>> +++ b/kernel/dma/swiotlb.c
>> @@ -1435,8 +1435,16 @@ phys_addr_t swiotlb_tbl_map_single(struct 
>> device *dev, phys_addr_t orig_addr,
>>        * the original data, even if it's garbage, is necessary to match
>>        * hardware behavior.  Use of swiotlb is supposed to be 
>> transparent,
>>        * i.e. swiotlb must not corrupt memory by clobbering unwritten 
>> bytes.
>> +     *
>> +     * Setting DMA_ATTR_SKIP_DEVICE_SYNC will negate the behavior 
>> described
>> +     * before and avoid the copy from the original buffer to the TLB
>> +     * buffer.
>>        */
>> +    if (dir == DMA_FROM_DEVICE && (attrs & DMA_ATTR_SKIP_DEVICE_SYNC))
>> +        goto out;
> 
> Nope, we still need to initialise the SWIOTLB slot with *something*, or 
> we're just reintroducing the same data leakage issue again. The whole 
> deal with that was that the caller *did* expect the entire buffer to be 
> written, but the device had an error, and thus the subsequent unmap 
> bounced out whatever data was in SWIOTLB before.

Do you know how common that assumption is within drivers? Could that 
behavior been hidden behind a flag, sort of like what is being done here?

> 
> A memset is hopefully at least a bit faster than a copy, so maybe there 
> is still some life in this idea, but the semantic is not "skip sync", 
> it's "I'm OK with this entire buffer getting scribbled even my device 
> ends up never touching it".

Sure, let me test with a memset(), any suggestion on a name that carries 
better semantics, I do agree that "skip sync" is not quite descriptive 
enough.
diff mbox series

Patch

diff --git a/Documentation/core-api/dma-attributes.rst b/Documentation/core-api/dma-attributes.rst
index 1887d92e8e92..ccd9c1891200 100644
--- a/Documentation/core-api/dma-attributes.rst
+++ b/Documentation/core-api/dma-attributes.rst
@@ -130,3 +130,12 @@  accesses to DMA buffers in both privileged "supervisor" and unprivileged
 subsystem that the buffer is fully accessible at the elevated privilege
 level (and ideally inaccessible or at least read-only at the
 lesser-privileged levels).
+
+DMA_ATTR_SKIP_DEVICE_SYNC
+-------------------------
+
+Device drivers can set DMA_ATTR_SKIP_DEVICE_SYNC in order to avoid doing a copy
+from the original buffer to the TLB buffer for dma_map_single() with a
+DMA_FROM_DEVICE direction. This can be used to save an extra copy in a device
+driver's data path when using swiotlb bounce buffering.
+
diff --git a/Documentation/core-api/swiotlb.rst b/Documentation/core-api/swiotlb.rst
index 9e0fe027dd3b..3bc1f9ba67b2 100644
--- a/Documentation/core-api/swiotlb.rst
+++ b/Documentation/core-api/swiotlb.rst
@@ -67,7 +67,9 @@  to the driver for programming into the device. If a DMA operation specifies
 multiple memory buffer segments, a separate bounce buffer must be allocated for
 each segment. swiotlb_tbl_map_single() always does a "sync" operation (i.e., a
 CPU copy) to initialize the bounce buffer to match the contents of the original
-buffer.
+buffer, except if DMA_ATTR_SKIP_DEVICE_SYNC is specified and the direction is
+DMA_FROM_DEVICE. This is a performance optimization that may not be suitable for
+all platforms.
 
 swiotlb_tbl_unmap_single() does the reverse. If the DMA operation might have
 updated the bounce buffer memory and DMA_ATTR_SKIP_CPU_SYNC is not set, the
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index b79925b1c433..bfdaa65f8e9d 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -58,6 +58,12 @@ 
  */
 #define DMA_ATTR_PRIVILEGED		(1UL << 9)
 
+/*
+ * DMA_ATTR_SKIP_DEVICE_SYNC: used to indicate that the buffer does not need to
+ * be synchronized to the device.
+ */
+#define DMA_ATTR_SKIP_DEVICE_SYNC	(1UL << 10)
+
 /*
  * A dma_addr_t can hold any valid DMA or bus address for the platform.  It can
  * be given to a device to use as a DMA source or target.  It is specific to a
diff --git a/include/trace/events/dma.h b/include/trace/events/dma.h
index d8ddc27b6a7c..6eb8fd7e3515 100644
--- a/include/trace/events/dma.h
+++ b/include/trace/events/dma.h
@@ -31,7 +31,8 @@  TRACE_DEFINE_ENUM(DMA_NONE);
 		{ DMA_ATTR_FORCE_CONTIGUOUS, "FORCE_CONTIGUOUS" }, \
 		{ DMA_ATTR_ALLOC_SINGLE_PAGES, "ALLOC_SINGLE_PAGES" }, \
 		{ DMA_ATTR_NO_WARN, "NO_WARN" }, \
-		{ DMA_ATTR_PRIVILEGED, "PRIVILEGED" })
+		{ DMA_ATTR_PRIVILEGED, "PRIVILEGED" }, \
+		{ DMA_ATTR_SKIP_DEVICE_SYNC, "SKIP_DEVICE_SYNC" })
 
 DECLARE_EVENT_CLASS(dma_map,
 	TP_PROTO(struct device *dev, phys_addr_t phys_addr, dma_addr_t dma_addr,
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index abcf3fa63a56..8dab89bf5e33 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -1435,8 +1435,16 @@  phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
 	 * the original data, even if it's garbage, is necessary to match
 	 * hardware behavior.  Use of swiotlb is supposed to be transparent,
 	 * i.e. swiotlb must not corrupt memory by clobbering unwritten bytes.
+	 *
+	 * Setting DMA_ATTR_SKIP_DEVICE_SYNC will negate the behavior described
+	 * before and avoid the copy from the original buffer to the TLB
+	 * buffer.
 	 */
+	if (dir == DMA_FROM_DEVICE && (attrs & DMA_ATTR_SKIP_DEVICE_SYNC))
+		goto out;
+
 	swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE, pool);
+out:
 	return tlb_addr;
 }