diff mbox series

[v6,2/7] dma: avoid redundant calls for sync operations

Message ID 20240507112026.1803778-3-aleksander.lobakin@intel.com (mailing list archive)
State Not Applicable
Headers show
Series dma: skip calling no-op sync ops when possible | expand

Checks

Context Check Description
netdev/tree_selection success Guessing tree name failed - patch did not apply

Commit Message

Alexander Lobakin May 7, 2024, 11:20 a.m. UTC
Quite often, devices do not need dma_sync operations on x86_64 at least.
Indeed, when dev_is_dma_coherent(dev) is true and
dev_use_swiotlb(dev) is false, iommu_dma_sync_single_for_cpu()
and friends do nothing.

However, indirectly calling them when CONFIG_RETPOLINE=y consumes about
10% of cycles on a cpu receiving packets from softirq at ~100Gbit rate.
Even if/when CONFIG_RETPOLINE is not set, there is a cost of about 3%.

Add dev->need_dma_sync boolean and turn it off during the device
initialization (dma_set_mask()) depending on the setup:
dev_is_dma_coherent() for the direct DMA, !(sync_single_for_device ||
sync_single_for_cpu) or the new dma_map_ops flag, %DMA_F_CAN_SKIP_SYNC,
advertised for non-NULL DMA ops.
Then later, if/when swiotlb is used for the first time, the flag
is reset back to on, from swiotlb_tbl_map_single().

On iavf, the UDP trafficgen with XDP_DROP in skb mode test shows
+3-5% increase for direct DMA.

Suggested-by: Christoph Hellwig <hch@lst.de> # direct DMA shortcut
Co-developed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 include/linux/device.h      |  4 +++
 include/linux/dma-map-ops.h | 12 ++++++++
 include/linux/dma-mapping.h | 53 +++++++++++++++++++++++++++++++----
 kernel/dma/mapping.c        | 55 +++++++++++++++++++++++++++++--------
 kernel/dma/swiotlb.c        |  6 ++++
 5 files changed, 113 insertions(+), 17 deletions(-)

Comments

Marek Szyprowski May 9, 2024, 11:41 a.m. UTC | #1
Dear All,

On 07.05.2024 13:20, Alexander Lobakin wrote:
> Quite often, devices do not need dma_sync operations on x86_64 at least.
> Indeed, when dev_is_dma_coherent(dev) is true and
> dev_use_swiotlb(dev) is false, iommu_dma_sync_single_for_cpu()
> and friends do nothing.
>
> However, indirectly calling them when CONFIG_RETPOLINE=y consumes about
> 10% of cycles on a cpu receiving packets from softirq at ~100Gbit rate.
> Even if/when CONFIG_RETPOLINE is not set, there is a cost of about 3%.
>
> Add dev->need_dma_sync boolean and turn it off during the device
> initialization (dma_set_mask()) depending on the setup:
> dev_is_dma_coherent() for the direct DMA, !(sync_single_for_device ||
> sync_single_for_cpu) or the new dma_map_ops flag, %DMA_F_CAN_SKIP_SYNC,
> advertised for non-NULL DMA ops.
> Then later, if/when swiotlb is used for the first time, the flag
> is reset back to on, from swiotlb_tbl_map_single().
>
> On iavf, the UDP trafficgen with XDP_DROP in skb mode test shows
> +3-5% increase for direct DMA.
>
> Suggested-by: Christoph Hellwig <hch@lst.de> # direct DMA shortcut
> Co-developed-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> ---
>   include/linux/device.h      |  4 +++
>   include/linux/dma-map-ops.h | 12 ++++++++
>   include/linux/dma-mapping.h | 53 +++++++++++++++++++++++++++++++----
>   kernel/dma/mapping.c        | 55 +++++++++++++++++++++++++++++--------
>   kernel/dma/swiotlb.c        |  6 ++++
>   5 files changed, 113 insertions(+), 17 deletions(-)


This patch landed in today's linux-next as commit f406c8e4b770 ("dma: 
avoid redundant calls for sync operations"). Unfortunately I found that 
it breaks some of the ARM 32bit boards by forcing skipping DMA sync 
operations on non-coherent systems. This happens because this patch 
hooks dma_need_sync=true initialization into set_dma_mask(), but 
set_dma_mask() is not called from all device drivers, especially from 
those which operates properly with the default 32bit dma mask (like most 
of the platform devices created by the OF layer).

Frankly speaking I have no idea how this should be fixed. I expect that 
there are lots of broken devices after this change, because I don't 
remember that calling set_dma_mask() is mandatory for device drivers.

After adding dma_set_mask(dev, DMA_BIT_MASK(32)) to the drivers relevant 
for my boards the issues are gone, but I'm not sure this is the right 
approach...


> ...

Best regards
Alexander Lobakin May 9, 2024, 11:44 a.m. UTC | #2
From: Marek Szyprowski <m.szyprowski@samsung.com>
Date: Thu, 9 May 2024 13:41:16 +0200

> Dear All,
> 
> On 07.05.2024 13:20, Alexander Lobakin wrote:
>> Quite often, devices do not need dma_sync operations on x86_64 at least.
>> Indeed, when dev_is_dma_coherent(dev) is true and
>> dev_use_swiotlb(dev) is false, iommu_dma_sync_single_for_cpu()
>> and friends do nothing.
>>
>> However, indirectly calling them when CONFIG_RETPOLINE=y consumes about
>> 10% of cycles on a cpu receiving packets from softirq at ~100Gbit rate.
>> Even if/when CONFIG_RETPOLINE is not set, there is a cost of about 3%.
>>
>> Add dev->need_dma_sync boolean and turn it off during the device
>> initialization (dma_set_mask()) depending on the setup:
>> dev_is_dma_coherent() for the direct DMA, !(sync_single_for_device ||
>> sync_single_for_cpu) or the new dma_map_ops flag, %DMA_F_CAN_SKIP_SYNC,
>> advertised for non-NULL DMA ops.
>> Then later, if/when swiotlb is used for the first time, the flag
>> is reset back to on, from swiotlb_tbl_map_single().
>>
>> On iavf, the UDP trafficgen with XDP_DROP in skb mode test shows
>> +3-5% increase for direct DMA.
>>
>> Suggested-by: Christoph Hellwig <hch@lst.de> # direct DMA shortcut
>> Co-developed-by: Eric Dumazet <edumazet@google.com>
>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
>> ---
>>   include/linux/device.h      |  4 +++
>>   include/linux/dma-map-ops.h | 12 ++++++++
>>   include/linux/dma-mapping.h | 53 +++++++++++++++++++++++++++++++----
>>   kernel/dma/mapping.c        | 55 +++++++++++++++++++++++++++++--------
>>   kernel/dma/swiotlb.c        |  6 ++++
>>   5 files changed, 113 insertions(+), 17 deletions(-)
> 
> 
> This patch landed in today's linux-next as commit f406c8e4b770 ("dma: 
> avoid redundant calls for sync operations"). Unfortunately I found that 
> it breaks some of the ARM 32bit boards by forcing skipping DMA sync 
> operations on non-coherent systems. This happens because this patch 
> hooks dma_need_sync=true initialization into set_dma_mask(), but 
> set_dma_mask() is not called from all device drivers, especially from 
> those which operates properly with the default 32bit dma mask (like most 
> of the platform devices created by the OF layer).
> 
> Frankly speaking I have no idea how this should be fixed. I expect that 
> there are lots of broken devices after this change, because I don't 
> remember that calling set_dma_mask() is mandatory for device drivers.
> 
> After adding dma_set_mask(dev, DMA_BIT_MASK(32)) to the drivers relevant 
> for my boards the issues are gone, but I'm not sure this is the right 
> approach...

If I remember correctly, *all* device drivers which use DMA *must* call
dma_set_*mask() on probe. That's why we added it there and didn't care.
Alternatively, if it really breaks a lot of drivers, we can set
dma_need_sync = true by default before the driver probing. I thought of
this, but the correct approach would be to call dma_set_*mask() from the
respective drivers.

> 
> 
>> ...
> 
> Best regards

Thanks,
Olek
Alexander Lobakin May 9, 2024, 11:59 a.m. UTC | #3
From: Alexander Lobakin <aleksander.lobakin@intel.com>
Date: Thu, 9 May 2024 13:44:37 +0200

> From: Marek Szyprowski <m.szyprowski@samsung.com>
> Date: Thu, 9 May 2024 13:41:16 +0200
> 
>> Dear All,
>>
>> On 07.05.2024 13:20, Alexander Lobakin wrote:
>>> Quite often, devices do not need dma_sync operations on x86_64 at least.
>>> Indeed, when dev_is_dma_coherent(dev) is true and
>>> dev_use_swiotlb(dev) is false, iommu_dma_sync_single_for_cpu()
>>> and friends do nothing.
>>>
>>> However, indirectly calling them when CONFIG_RETPOLINE=y consumes about
>>> 10% of cycles on a cpu receiving packets from softirq at ~100Gbit rate.
>>> Even if/when CONFIG_RETPOLINE is not set, there is a cost of about 3%.
>>>
>>> Add dev->need_dma_sync boolean and turn it off during the device
>>> initialization (dma_set_mask()) depending on the setup:
>>> dev_is_dma_coherent() for the direct DMA, !(sync_single_for_device ||
>>> sync_single_for_cpu) or the new dma_map_ops flag, %DMA_F_CAN_SKIP_SYNC,
>>> advertised for non-NULL DMA ops.
>>> Then later, if/when swiotlb is used for the first time, the flag
>>> is reset back to on, from swiotlb_tbl_map_single().
>>>
>>> On iavf, the UDP trafficgen with XDP_DROP in skb mode test shows
>>> +3-5% increase for direct DMA.
>>>
>>> Suggested-by: Christoph Hellwig <hch@lst.de> # direct DMA shortcut
>>> Co-developed-by: Eric Dumazet <edumazet@google.com>
>>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
>>> ---
>>>   include/linux/device.h      |  4 +++
>>>   include/linux/dma-map-ops.h | 12 ++++++++
>>>   include/linux/dma-mapping.h | 53 +++++++++++++++++++++++++++++++----
>>>   kernel/dma/mapping.c        | 55 +++++++++++++++++++++++++++++--------
>>>   kernel/dma/swiotlb.c        |  6 ++++
>>>   5 files changed, 113 insertions(+), 17 deletions(-)
>>
>>
>> This patch landed in today's linux-next as commit f406c8e4b770 ("dma: 
>> avoid redundant calls for sync operations"). Unfortunately I found that 
>> it breaks some of the ARM 32bit boards by forcing skipping DMA sync 
>> operations on non-coherent systems. This happens because this patch 
>> hooks dma_need_sync=true initialization into set_dma_mask(), but 
>> set_dma_mask() is not called from all device drivers, especially from 
>> those which operates properly with the default 32bit dma mask (like most 
>> of the platform devices created by the OF layer).
>>
>> Frankly speaking I have no idea how this should be fixed. I expect that 
>> there are lots of broken devices after this change, because I don't 
>> remember that calling set_dma_mask() is mandatory for device drivers.
>>
>> After adding dma_set_mask(dev, DMA_BIT_MASK(32)) to the drivers relevant 
>> for my boards the issues are gone, but I'm not sure this is the right 
>> approach...
> 
> If I remember correctly, *all* device drivers which use DMA *must* call
> dma_set_*mask() on probe. That's why we added it there and didn't care.
> Alternatively, if it really breaks a lot of drivers, we can set
> dma_need_sync = true by default before the driver probing. I thought of

Or invert the flag, so that false would mean "it needs sync" and it
would be the default if dma_*mask*() wasn't called.

Chris, what do you think?

> this, but the correct approach would be to call dma_set_*mask() from the
> respective drivers.
> 
>>
>>
>>> ...
>>
>> Best regards

Thanks,
Olek
Christoph Hellwig May 9, 2024, 12:01 p.m. UTC | #4
On Thu, May 09, 2024 at 01:44:37PM +0200, Alexander Lobakin wrote:
> If I remember correctly, *all* device drivers which use DMA *must* call
> dma_set_*mask() on probe. That's why we added it there and didn't care.
> Alternatively, if it really breaks a lot of drivers, we can set
> dma_need_sync = true by default before the driver probing. I thought of
> this, but the correct approach would be to call dma_set_*mask() from the
> respective drivers.

No, we default to 32-bit DMA for a lot of busses for historical reasons,
especially platform devices.
Christoph Hellwig May 9, 2024, 12:02 p.m. UTC | #5
On Thu, May 09, 2024 at 01:59:41PM +0200, Alexander Lobakin wrote:
> Or invert the flag, so that false would mean "it needs sync" and it
> would be the default if dma_*mask*() wasn't called.

That is probably the safest option.
Steven Price May 9, 2024, 1:43 p.m. UTC | #6
On 07/05/2024 12:20, Alexander Lobakin wrote:
> Quite often, devices do not need dma_sync operations on x86_64 at least.
> Indeed, when dev_is_dma_coherent(dev) is true and
> dev_use_swiotlb(dev) is false, iommu_dma_sync_single_for_cpu()
> and friends do nothing.
> 
> However, indirectly calling them when CONFIG_RETPOLINE=y consumes about
> 10% of cycles on a cpu receiving packets from softirq at ~100Gbit rate.
> Even if/when CONFIG_RETPOLINE is not set, there is a cost of about 3%.
> 
> Add dev->need_dma_sync boolean and turn it off during the device
> initialization (dma_set_mask()) depending on the setup:
> dev_is_dma_coherent() for the direct DMA, !(sync_single_for_device ||
> sync_single_for_cpu) or the new dma_map_ops flag, %DMA_F_CAN_SKIP_SYNC,
> advertised for non-NULL DMA ops.
> Then later, if/when swiotlb is used for the first time, the flag
> is reset back to on, from swiotlb_tbl_map_single().
> 
> On iavf, the UDP trafficgen with XDP_DROP in skb mode test shows
> +3-5% increase for direct DMA.
> 
> Suggested-by: Christoph Hellwig <hch@lst.de> # direct DMA shortcut
> Co-developed-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

I've bisected a boot failure (on a Firefly RK3288) to this commit.
AFAICT the problem is that I have (at least) two drivers which don't
call dma_set_mask() and therefore never initialise the new dma_need_sync
variable.

The specific drivers are "rockchip-drm" and "rk_gmac-dwmac". Is it a
requirement that all drivers engaging in DMA should call dma_set_mask()
- and therefore this has uncovered a bug in those drivers. Or is the
assumption that all drivers call dma_set_mask() faulty?

Thanks,

Steve

> ---
>  include/linux/device.h      |  4 +++
>  include/linux/dma-map-ops.h | 12 ++++++++
>  include/linux/dma-mapping.h | 53 +++++++++++++++++++++++++++++++----
>  kernel/dma/mapping.c        | 55 +++++++++++++++++++++++++++++--------
>  kernel/dma/swiotlb.c        |  6 ++++
>  5 files changed, 113 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/device.h b/include/linux/device.h
> index b9f5464f44ed..ed95b829f05b 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -691,6 +691,7 @@ struct device_physical_location {
>   *		and optionall (if the coherent mask is large enough) also
>   *		for dma allocations.  This flag is managed by the dma ops
>   *		instance from ->dma_supported.
> + * @dma_need_sync: The device needs performing DMA sync operations.
>   *
>   * At the lowest level, every device in a Linux system is represented by an
>   * instance of struct device. The device structure contains the information
> @@ -803,6 +804,9 @@ struct device {
>  #ifdef CONFIG_DMA_OPS_BYPASS
>  	bool			dma_ops_bypass : 1;
>  #endif
> +#ifdef CONFIG_DMA_NEED_SYNC
> +	bool			dma_need_sync:1;
> +#endif
>  };
>  
>  /**
> diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
> index 4abc60f04209..4893cb89cb52 100644
> --- a/include/linux/dma-map-ops.h
> +++ b/include/linux/dma-map-ops.h
> @@ -18,8 +18,11 @@ struct iommu_ops;
>   *
>   * DMA_F_PCI_P2PDMA_SUPPORTED: Indicates the dma_map_ops implementation can
>   * handle PCI P2PDMA pages in the map_sg/unmap_sg operation.
> + * DMA_F_CAN_SKIP_SYNC: DMA sync operations can be skipped if the device is
> + * coherent and it's not an SWIOTLB buffer.
>   */
>  #define DMA_F_PCI_P2PDMA_SUPPORTED     (1 << 0)
> +#define DMA_F_CAN_SKIP_SYNC            (1 << 1)
>  
>  struct dma_map_ops {
>  	unsigned int flags;
> @@ -273,6 +276,15 @@ static inline bool dev_is_dma_coherent(struct device *dev)
>  }
>  #endif /* CONFIG_ARCH_HAS_DMA_COHERENCE_H */
>  
> +static inline void dma_reset_need_sync(struct device *dev)
> +{
> +#ifdef CONFIG_DMA_NEED_SYNC
> +	/* Reset it only once so that the function can be called on hotpath */
> +	if (unlikely(!dev->dma_need_sync))
> +		dev->dma_need_sync = true;
> +#endif
> +}
> +
>  /*
>   * Check whether potential kmalloc() buffers are safe for non-coherent DMA.
>   */
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index a569b56b25e2..eb4e15893b6c 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -282,16 +282,59 @@ static inline int dma_mmap_noncontiguous(struct device *dev,
>  #endif /* CONFIG_HAS_DMA */
>  
>  #if defined(CONFIG_HAS_DMA) && defined(CONFIG_DMA_NEED_SYNC)
> -void dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, size_t size,
> +void __dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, size_t size,
>  		enum dma_data_direction dir);
> -void dma_sync_single_for_device(struct device *dev, dma_addr_t addr,
> +void __dma_sync_single_for_device(struct device *dev, dma_addr_t addr,
>  		size_t size, enum dma_data_direction dir);
> -void dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg,
> +void __dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg,
>  		int nelems, enum dma_data_direction dir);
> -void dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg,
> +void __dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg,
>  		int nelems, enum dma_data_direction dir);
> -bool dma_need_sync(struct device *dev, dma_addr_t dma_addr);
> +bool __dma_need_sync(struct device *dev, dma_addr_t dma_addr);
> +
> +static inline bool dma_dev_need_sync(const struct device *dev)
> +{
> +	/* Always call DMA sync operations when debugging is enabled */
> +	return dev->dma_need_sync || IS_ENABLED(CONFIG_DMA_API_DEBUG);
> +}
> +
> +static inline void dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr,
> +		size_t size, enum dma_data_direction dir)
> +{
> +	if (dma_dev_need_sync(dev))
> +		__dma_sync_single_for_cpu(dev, addr, size, dir);
> +}
> +
> +static inline void dma_sync_single_for_device(struct device *dev,
> +		dma_addr_t addr, size_t size, enum dma_data_direction dir)
> +{
> +	if (dma_dev_need_sync(dev))
> +		__dma_sync_single_for_device(dev, addr, size, dir);
> +}
> +
> +static inline void dma_sync_sg_for_cpu(struct device *dev,
> +		struct scatterlist *sg, int nelems, enum dma_data_direction dir)
> +{
> +	if (dma_dev_need_sync(dev))
> +		__dma_sync_sg_for_cpu(dev, sg, nelems, dir);
> +}
> +
> +static inline void dma_sync_sg_for_device(struct device *dev,
> +		struct scatterlist *sg, int nelems, enum dma_data_direction dir)
> +{
> +	if (dma_dev_need_sync(dev))
> +		__dma_sync_sg_for_device(dev, sg, nelems, dir);
> +}
> +
> +static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
> +{
> +	return dma_dev_need_sync(dev) ? __dma_need_sync(dev, dma_addr) : false;
> +}
>  #else /* !CONFIG_HAS_DMA || !CONFIG_DMA_NEED_SYNC */
> +static inline bool dma_dev_need_sync(const struct device *dev)
> +{
> +	return false;
> +}
>  static inline void dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr,
>  		size_t size, enum dma_data_direction dir)
>  {
> diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
> index c78b78e95a26..3524bc92c37f 100644
> --- a/kernel/dma/mapping.c
> +++ b/kernel/dma/mapping.c
> @@ -330,7 +330,7 @@ void dma_unmap_resource(struct device *dev, dma_addr_t addr, size_t size,
>  EXPORT_SYMBOL(dma_unmap_resource);
>  
>  #ifdef CONFIG_DMA_NEED_SYNC
> -void dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, size_t size,
> +void __dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, size_t size,
>  		enum dma_data_direction dir)
>  {
>  	const struct dma_map_ops *ops = get_dma_ops(dev);
> @@ -342,9 +342,9 @@ void dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, size_t size,
>  		ops->sync_single_for_cpu(dev, addr, size, dir);
>  	debug_dma_sync_single_for_cpu(dev, addr, size, dir);
>  }
> -EXPORT_SYMBOL(dma_sync_single_for_cpu);
> +EXPORT_SYMBOL(__dma_sync_single_for_cpu);
>  
> -void dma_sync_single_for_device(struct device *dev, dma_addr_t addr,
> +void __dma_sync_single_for_device(struct device *dev, dma_addr_t addr,
>  		size_t size, enum dma_data_direction dir)
>  {
>  	const struct dma_map_ops *ops = get_dma_ops(dev);
> @@ -356,9 +356,9 @@ void dma_sync_single_for_device(struct device *dev, dma_addr_t addr,
>  		ops->sync_single_for_device(dev, addr, size, dir);
>  	debug_dma_sync_single_for_device(dev, addr, size, dir);
>  }
> -EXPORT_SYMBOL(dma_sync_single_for_device);
> +EXPORT_SYMBOL(__dma_sync_single_for_device);
>  
> -void dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg,
> +void __dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg,
>  		    int nelems, enum dma_data_direction dir)
>  {
>  	const struct dma_map_ops *ops = get_dma_ops(dev);
> @@ -370,9 +370,9 @@ void dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg,
>  		ops->sync_sg_for_cpu(dev, sg, nelems, dir);
>  	debug_dma_sync_sg_for_cpu(dev, sg, nelems, dir);
>  }
> -EXPORT_SYMBOL(dma_sync_sg_for_cpu);
> +EXPORT_SYMBOL(__dma_sync_sg_for_cpu);
>  
> -void dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg,
> +void __dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg,
>  		       int nelems, enum dma_data_direction dir)
>  {
>  	const struct dma_map_ops *ops = get_dma_ops(dev);
> @@ -384,18 +384,47 @@ void dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg,
>  		ops->sync_sg_for_device(dev, sg, nelems, dir);
>  	debug_dma_sync_sg_for_device(dev, sg, nelems, dir);
>  }
> -EXPORT_SYMBOL(dma_sync_sg_for_device);
> +EXPORT_SYMBOL(__dma_sync_sg_for_device);
>  
> -bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
> +bool __dma_need_sync(struct device *dev, dma_addr_t dma_addr)
>  {
>  	const struct dma_map_ops *ops = get_dma_ops(dev);
>  
>  	if (dma_map_direct(dev, ops))
> +		/*
> +		 * dma_need_sync could've been reset on first SWIOTLB buffer
> +		 * mapping, but @dma_addr is not necessary an SWIOTLB buffer.
> +		 * In this case, fall back to more granular check.
> +		 */
>  		return dma_direct_need_sync(dev, dma_addr);
> -	return ops->sync_single_for_cpu || ops->sync_single_for_device;
> +	return true;
>  }
> -EXPORT_SYMBOL_GPL(dma_need_sync);
> -#endif /* CONFIG_DMA_NEED_SYNC */
> +EXPORT_SYMBOL_GPL(__dma_need_sync);
> +
> +static void dma_setup_need_sync(struct device *dev)
> +{
> +	const struct dma_map_ops *ops = get_dma_ops(dev);
> +
> +	if (dma_map_direct(dev, ops) || (ops->flags & DMA_F_CAN_SKIP_SYNC))
> +		/*
> +		 * dma_need_sync will be reset to %true on first SWIOTLB buffer
> +		 * mapping, if any. During the device initialization, it's
> +		 * enough to check only for the DMA coherence.
> +		 */
> +		dev->dma_need_sync = !dev_is_dma_coherent(dev);
> +	else if (!ops->sync_single_for_device && !ops->sync_single_for_cpu &&
> +		 !ops->sync_sg_for_device && !ops->sync_sg_for_cpu)
> +		/*
> +		 * Synchronization is not possible when none of DMA sync ops
> +		 * is set.
> +		 */
> +		dev->dma_need_sync = false;
> +	else
> +		dev->dma_need_sync = true;
> +}
> +#else /* !CONFIG_DMA_NEED_SYNC */
> +static inline void dma_setup_need_sync(struct device *dev) { }
> +#endif /* !CONFIG_DMA_NEED_SYNC */
>  
>  /*
>   * The whole dma_get_sgtable() idea is fundamentally unsafe - it seems
> @@ -785,6 +814,8 @@ int dma_set_mask(struct device *dev, u64 mask)
>  
>  	arch_dma_set_mask(dev, mask);
>  	*dev->dma_mask = mask;
> +	dma_setup_need_sync(dev);
> +
>  	return 0;
>  }
>  EXPORT_SYMBOL(dma_set_mask);
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 046da973a7e2..ae3e593eaadb 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -1408,6 +1408,12 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>  		return (phys_addr_t)DMA_MAPPING_ERROR;
>  	}
>  
> +	/*
> +	 * If dma_need_sync wasn't set, reset it on first SWIOTLB buffer
> +	 * mapping to always sync SWIOTLB buffers.
> +	 */
> +	dma_reset_need_sync(dev);
> +
>  	/*
>  	 * Save away the mapping from the original address to the DMA address.
>  	 * This is needed when we sync the memory.  Then we sync the buffer if
Christoph Hellwig May 9, 2024, 1:49 p.m. UTC | #7
On Thu, May 09, 2024 at 02:43:52PM +0100, Steven Price wrote:
> The specific drivers are "rockchip-drm" and "rk_gmac-dwmac". Is it a
> requirement that all drivers engaging in DMA should call dma_set_mask()
> - and therefore this has uncovered a bug in those drivers. Or is the
> assumption that all drivers call dma_set_mask() faulty?

That was the assumption behind the code, but the assumption is wrong.
Alex is working on a fix.
Robin Murphy May 9, 2024, 2:33 p.m. UTC | #8
On 09/05/2024 2:43 pm, Steven Price wrote:
> On 07/05/2024 12:20, Alexander Lobakin wrote:
>> Quite often, devices do not need dma_sync operations on x86_64 at least.
>> Indeed, when dev_is_dma_coherent(dev) is true and
>> dev_use_swiotlb(dev) is false, iommu_dma_sync_single_for_cpu()
>> and friends do nothing.
>>
>> However, indirectly calling them when CONFIG_RETPOLINE=y consumes about
>> 10% of cycles on a cpu receiving packets from softirq at ~100Gbit rate.
>> Even if/when CONFIG_RETPOLINE is not set, there is a cost of about 3%.
>>
>> Add dev->need_dma_sync boolean and turn it off during the device
>> initialization (dma_set_mask()) depending on the setup:
>> dev_is_dma_coherent() for the direct DMA, !(sync_single_for_device ||
>> sync_single_for_cpu) or the new dma_map_ops flag, %DMA_F_CAN_SKIP_SYNC,
>> advertised for non-NULL DMA ops.
>> Then later, if/when swiotlb is used for the first time, the flag
>> is reset back to on, from swiotlb_tbl_map_single().
>>
>> On iavf, the UDP trafficgen with XDP_DROP in skb mode test shows
>> +3-5% increase for direct DMA.
>>
>> Suggested-by: Christoph Hellwig <hch@lst.de> # direct DMA shortcut
>> Co-developed-by: Eric Dumazet <edumazet@google.com>
>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> 
> I've bisected a boot failure (on a Firefly RK3288) to this commit.
> AFAICT the problem is that I have (at least) two drivers which don't
> call dma_set_mask() and therefore never initialise the new dma_need_sync
> variable.
> 
> The specific drivers are "rockchip-drm" and "rk_gmac-dwmac". Is it a
> requirement that all drivers engaging in DMA should call dma_set_mask()
> - and therefore this has uncovered a bug in those drivers. Or is the
> assumption that all drivers call dma_set_mask() faulty?

Historically it's long been documented (at least in DMA-API-HOWTO) that 
a 32-bit DMA mask is assumed by default, so as much as we would prefer 
to shift expectations, there are still going to be a great many drivers 
relying on that :(

Perhaps its time for dma-debug to start warning about implicit mask 
usage, maybe that might help push the agenda a bit?

Thanks,
Robin.
Alexander Lobakin May 9, 2024, 2:43 p.m. UTC | #9
From: Robin Murphy <robin.murphy@arm.com>
Date: Thu, 9 May 2024 15:33:13 +0100

> On 09/05/2024 2:43 pm, Steven Price wrote:
>> On 07/05/2024 12:20, Alexander Lobakin wrote:
>>> Quite often, devices do not need dma_sync operations on x86_64 at least.
>>> Indeed, when dev_is_dma_coherent(dev) is true and
>>> dev_use_swiotlb(dev) is false, iommu_dma_sync_single_for_cpu()
>>> and friends do nothing.
>>>
>>> However, indirectly calling them when CONFIG_RETPOLINE=y consumes about
>>> 10% of cycles on a cpu receiving packets from softirq at ~100Gbit rate.
>>> Even if/when CONFIG_RETPOLINE is not set, there is a cost of about 3%.
>>>
>>> Add dev->need_dma_sync boolean and turn it off during the device
>>> initialization (dma_set_mask()) depending on the setup:
>>> dev_is_dma_coherent() for the direct DMA, !(sync_single_for_device ||
>>> sync_single_for_cpu) or the new dma_map_ops flag, %DMA_F_CAN_SKIP_SYNC,
>>> advertised for non-NULL DMA ops.
>>> Then later, if/when swiotlb is used for the first time, the flag
>>> is reset back to on, from swiotlb_tbl_map_single().
>>>
>>> On iavf, the UDP trafficgen with XDP_DROP in skb mode test shows
>>> +3-5% increase for direct DMA.
>>>
>>> Suggested-by: Christoph Hellwig <hch@lst.de> # direct DMA shortcut
>>> Co-developed-by: Eric Dumazet <edumazet@google.com>
>>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
>>
>> I've bisected a boot failure (on a Firefly RK3288) to this commit.
>> AFAICT the problem is that I have (at least) two drivers which don't
>> call dma_set_mask() and therefore never initialise the new dma_need_sync
>> variable.
>>
>> The specific drivers are "rockchip-drm" and "rk_gmac-dwmac". Is it a
>> requirement that all drivers engaging in DMA should call dma_set_mask()
>> - and therefore this has uncovered a bug in those drivers. Or is the
>> assumption that all drivers call dma_set_mask() faulty?
> 
> Historically it's long been documented (at least in DMA-API-HOWTO) that
> a 32-bit DMA mask is assumed by default, so as much as we would prefer
> to shift expectations, there are still going to be a great many drivers
> relying on that :(
> 
> Perhaps its time for dma-debug to start warning about implicit mask
> usage, maybe that might help push the agenda a bit?

I also thought of this, but currently don't know how to detect whether a
driver has called dma_set_mask*().

The fix will arrive in several minutes.

> 
> Thanks,
> Robin.

Thanks,
Olek
diff mbox series

Patch

diff --git a/include/linux/device.h b/include/linux/device.h
index b9f5464f44ed..ed95b829f05b 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -691,6 +691,7 @@  struct device_physical_location {
  *		and optionall (if the coherent mask is large enough) also
  *		for dma allocations.  This flag is managed by the dma ops
  *		instance from ->dma_supported.
+ * @dma_need_sync: The device needs performing DMA sync operations.
  *
  * At the lowest level, every device in a Linux system is represented by an
  * instance of struct device. The device structure contains the information
@@ -803,6 +804,9 @@  struct device {
 #ifdef CONFIG_DMA_OPS_BYPASS
 	bool			dma_ops_bypass : 1;
 #endif
+#ifdef CONFIG_DMA_NEED_SYNC
+	bool			dma_need_sync:1;
+#endif
 };
 
 /**
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 4abc60f04209..4893cb89cb52 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -18,8 +18,11 @@  struct iommu_ops;
  *
  * DMA_F_PCI_P2PDMA_SUPPORTED: Indicates the dma_map_ops implementation can
  * handle PCI P2PDMA pages in the map_sg/unmap_sg operation.
+ * DMA_F_CAN_SKIP_SYNC: DMA sync operations can be skipped if the device is
+ * coherent and it's not an SWIOTLB buffer.
  */
 #define DMA_F_PCI_P2PDMA_SUPPORTED     (1 << 0)
+#define DMA_F_CAN_SKIP_SYNC            (1 << 1)
 
 struct dma_map_ops {
 	unsigned int flags;
@@ -273,6 +276,15 @@  static inline bool dev_is_dma_coherent(struct device *dev)
 }
 #endif /* CONFIG_ARCH_HAS_DMA_COHERENCE_H */
 
+static inline void dma_reset_need_sync(struct device *dev)
+{
+#ifdef CONFIG_DMA_NEED_SYNC
+	/* Reset it only once so that the function can be called on hotpath */
+	if (unlikely(!dev->dma_need_sync))
+		dev->dma_need_sync = true;
+#endif
+}
+
 /*
  * Check whether potential kmalloc() buffers are safe for non-coherent DMA.
  */
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index a569b56b25e2..eb4e15893b6c 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -282,16 +282,59 @@  static inline int dma_mmap_noncontiguous(struct device *dev,
 #endif /* CONFIG_HAS_DMA */
 
 #if defined(CONFIG_HAS_DMA) && defined(CONFIG_DMA_NEED_SYNC)
-void dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, size_t size,
+void __dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, size_t size,
 		enum dma_data_direction dir);
-void dma_sync_single_for_device(struct device *dev, dma_addr_t addr,
+void __dma_sync_single_for_device(struct device *dev, dma_addr_t addr,
 		size_t size, enum dma_data_direction dir);
-void dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg,
+void __dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg,
 		int nelems, enum dma_data_direction dir);
-void dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg,
+void __dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg,
 		int nelems, enum dma_data_direction dir);
-bool dma_need_sync(struct device *dev, dma_addr_t dma_addr);
+bool __dma_need_sync(struct device *dev, dma_addr_t dma_addr);
+
+static inline bool dma_dev_need_sync(const struct device *dev)
+{
+	/* Always call DMA sync operations when debugging is enabled */
+	return dev->dma_need_sync || IS_ENABLED(CONFIG_DMA_API_DEBUG);
+}
+
+static inline void dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr,
+		size_t size, enum dma_data_direction dir)
+{
+	if (dma_dev_need_sync(dev))
+		__dma_sync_single_for_cpu(dev, addr, size, dir);
+}
+
+static inline void dma_sync_single_for_device(struct device *dev,
+		dma_addr_t addr, size_t size, enum dma_data_direction dir)
+{
+	if (dma_dev_need_sync(dev))
+		__dma_sync_single_for_device(dev, addr, size, dir);
+}
+
+static inline void dma_sync_sg_for_cpu(struct device *dev,
+		struct scatterlist *sg, int nelems, enum dma_data_direction dir)
+{
+	if (dma_dev_need_sync(dev))
+		__dma_sync_sg_for_cpu(dev, sg, nelems, dir);
+}
+
+static inline void dma_sync_sg_for_device(struct device *dev,
+		struct scatterlist *sg, int nelems, enum dma_data_direction dir)
+{
+	if (dma_dev_need_sync(dev))
+		__dma_sync_sg_for_device(dev, sg, nelems, dir);
+}
+
+static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
+{
+	return dma_dev_need_sync(dev) ? __dma_need_sync(dev, dma_addr) : false;
+}
 #else /* !CONFIG_HAS_DMA || !CONFIG_DMA_NEED_SYNC */
+static inline bool dma_dev_need_sync(const struct device *dev)
+{
+	return false;
+}
 static inline void dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr,
 		size_t size, enum dma_data_direction dir)
 {
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index c78b78e95a26..3524bc92c37f 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -330,7 +330,7 @@  void dma_unmap_resource(struct device *dev, dma_addr_t addr, size_t size,
 EXPORT_SYMBOL(dma_unmap_resource);
 
 #ifdef CONFIG_DMA_NEED_SYNC
-void dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, size_t size,
+void __dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, size_t size,
 		enum dma_data_direction dir)
 {
 	const struct dma_map_ops *ops = get_dma_ops(dev);
@@ -342,9 +342,9 @@  void dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, size_t size,
 		ops->sync_single_for_cpu(dev, addr, size, dir);
 	debug_dma_sync_single_for_cpu(dev, addr, size, dir);
 }
-EXPORT_SYMBOL(dma_sync_single_for_cpu);
+EXPORT_SYMBOL(__dma_sync_single_for_cpu);
 
-void dma_sync_single_for_device(struct device *dev, dma_addr_t addr,
+void __dma_sync_single_for_device(struct device *dev, dma_addr_t addr,
 		size_t size, enum dma_data_direction dir)
 {
 	const struct dma_map_ops *ops = get_dma_ops(dev);
@@ -356,9 +356,9 @@  void dma_sync_single_for_device(struct device *dev, dma_addr_t addr,
 		ops->sync_single_for_device(dev, addr, size, dir);
 	debug_dma_sync_single_for_device(dev, addr, size, dir);
 }
-EXPORT_SYMBOL(dma_sync_single_for_device);
+EXPORT_SYMBOL(__dma_sync_single_for_device);
 
-void dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg,
+void __dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg,
 		    int nelems, enum dma_data_direction dir)
 {
 	const struct dma_map_ops *ops = get_dma_ops(dev);
@@ -370,9 +370,9 @@  void dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg,
 		ops->sync_sg_for_cpu(dev, sg, nelems, dir);
 	debug_dma_sync_sg_for_cpu(dev, sg, nelems, dir);
 }
-EXPORT_SYMBOL(dma_sync_sg_for_cpu);
+EXPORT_SYMBOL(__dma_sync_sg_for_cpu);
 
-void dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg,
+void __dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg,
 		       int nelems, enum dma_data_direction dir)
 {
 	const struct dma_map_ops *ops = get_dma_ops(dev);
@@ -384,18 +384,47 @@  void dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg,
 		ops->sync_sg_for_device(dev, sg, nelems, dir);
 	debug_dma_sync_sg_for_device(dev, sg, nelems, dir);
 }
-EXPORT_SYMBOL(dma_sync_sg_for_device);
+EXPORT_SYMBOL(__dma_sync_sg_for_device);
 
-bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
+bool __dma_need_sync(struct device *dev, dma_addr_t dma_addr)
 {
 	const struct dma_map_ops *ops = get_dma_ops(dev);
 
 	if (dma_map_direct(dev, ops))
+		/*
+		 * dma_need_sync could've been reset on first SWIOTLB buffer
+		 * mapping, but @dma_addr is not necessary an SWIOTLB buffer.
+		 * In this case, fall back to more granular check.
+		 */
 		return dma_direct_need_sync(dev, dma_addr);
-	return ops->sync_single_for_cpu || ops->sync_single_for_device;
+	return true;
 }
-EXPORT_SYMBOL_GPL(dma_need_sync);
-#endif /* CONFIG_DMA_NEED_SYNC */
+EXPORT_SYMBOL_GPL(__dma_need_sync);
+
+static void dma_setup_need_sync(struct device *dev)
+{
+	const struct dma_map_ops *ops = get_dma_ops(dev);
+
+	if (dma_map_direct(dev, ops) || (ops->flags & DMA_F_CAN_SKIP_SYNC))
+		/*
+		 * dma_need_sync will be reset to %true on first SWIOTLB buffer
+		 * mapping, if any. During the device initialization, it's
+		 * enough to check only for the DMA coherence.
+		 */
+		dev->dma_need_sync = !dev_is_dma_coherent(dev);
+	else if (!ops->sync_single_for_device && !ops->sync_single_for_cpu &&
+		 !ops->sync_sg_for_device && !ops->sync_sg_for_cpu)
+		/*
+		 * Synchronization is not possible when none of DMA sync ops
+		 * is set.
+		 */
+		dev->dma_need_sync = false;
+	else
+		dev->dma_need_sync = true;
+}
+#else /* !CONFIG_DMA_NEED_SYNC */
+static inline void dma_setup_need_sync(struct device *dev) { }
+#endif /* !CONFIG_DMA_NEED_SYNC */
 
 /*
  * The whole dma_get_sgtable() idea is fundamentally unsafe - it seems
@@ -785,6 +814,8 @@  int dma_set_mask(struct device *dev, u64 mask)
 
 	arch_dma_set_mask(dev, mask);
 	*dev->dma_mask = mask;
+	dma_setup_need_sync(dev);
+
 	return 0;
 }
 EXPORT_SYMBOL(dma_set_mask);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 046da973a7e2..ae3e593eaadb 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -1408,6 +1408,12 @@  phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
 		return (phys_addr_t)DMA_MAPPING_ERROR;
 	}
 
+	/*
+	 * If dma_need_sync wasn't set, reset it on first SWIOTLB buffer
+	 * mapping to always sync SWIOTLB buffers.
+	 */
+	dma_reset_need_sync(dev);
+
 	/*
 	 * Save away the mapping from the original address to the DMA address.
 	 * This is needed when we sync the memory.  Then we sync the buffer if