mbox series

[v5,00/15] mm, dma, arm64: Reduce ARCH_KMALLOC_MINALIGN to 8

Message ID 20230524171904.3967031-1-catalin.marinas@arm.com (mailing list archive)
Headers show
Series mm, dma, arm64: Reduce ARCH_KMALLOC_MINALIGN to 8 | expand

Message

Catalin Marinas May 24, 2023, 5:18 p.m. UTC
Hi,

Another version of the series reducing the kmalloc() minimum alignment
on arm64 to 8 (from 128). Other architectures can easily opt in by
defining ARCH_KMALLOC_MINALIGN as 8 and selecting
DMA_BOUNCE_UNALIGNED_KMALLOC.

The first 10 patches decouple ARCH_KMALLOC_MINALIGN from
ARCH_DMA_MINALIGN and, for arm64, limit the kmalloc() caches to those
aligned to the run-time probed cache_line_size(). On arm64 we gain the
kmalloc-{64,192} caches.

The subsequent patches (11 to 15) further reduce the kmalloc() caches to
kmalloc-{8,16,32,96} if the default swiotlb is present by bouncing small
buffers in the DMA API.

Changes since v4:

- Following Robin's suggestions, reworked the iommu handling so that the
  buffer size checks are done in the dev_use_swiotlb() and
  dev_use_sg_swiotlb() functions (together with dev_is_untrusted()). The
  sync operations can now check for the SG_DMA_USE_SWIOTLB flag. Since
  this flag is no longer specific to kmalloc() bouncing (covers
  dev_is_untrusted() as well), the sg_is_dma_use_swiotlb() and
  sg_dma_mark_use_swiotlb() functions are always defined if
  CONFIG_SWIOTLB.

- Dropped ARCH_WANT_KMALLOC_DMA_BOUNCE, only left the
  DMA_BOUNCE_UNALIGNED_KMALLOC option, selectable by the arch code. The
  NEED_SG_DMA_FLAGS is now selected by IOMMU_DMA if SWIOTLB.

- Rather than adding another config option, allow
  dma_get_cache_alignment() to be overridden by the arch code
  (Christoph's suggestion).

- Added a comment to the dma_kmalloc_needs_bounce() function on the
  heuristics behind the bouncing.

- Added acked-by/reviewed-by tags (not adding Ard's tested-by yet as
  there were some changes).

The updated patches are also available on this branch:

git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux devel/kmalloc-minalign

Thanks.

Catalin Marinas (14):
  mm/slab: Decouple ARCH_KMALLOC_MINALIGN from ARCH_DMA_MINALIGN
  dma: Allow dma_get_cache_alignment() to be overridden by the arch code
  mm/slab: Simplify create_kmalloc_cache() args and make it static
  mm/slab: Limit kmalloc() minimum alignment to
    dma_get_cache_alignment()
  drivers/base: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN
  drivers/gpu: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN
  drivers/usb: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN
  drivers/spi: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN
  drivers/md: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN
  arm64: Allow kmalloc() caches aligned to the smaller cache_line_size()
  dma-mapping: Force bouncing if the kmalloc() size is not
    cache-line-aligned
  iommu/dma: Force bouncing if the size is not cacheline-aligned
  mm: slab: Reduce the kmalloc() minimum alignment if DMA bouncing
    possible
  arm64: Enable ARCH_WANT_KMALLOC_DMA_BOUNCE for arm64

Robin Murphy (1):
  scatterlist: Add dedicated config for DMA flags

 arch/arm64/Kconfig             |  1 +
 arch/arm64/include/asm/cache.h |  3 ++
 arch/arm64/mm/init.c           |  7 +++-
 drivers/base/devres.c          |  6 ++--
 drivers/gpu/drm/drm_managed.c  |  6 ++--
 drivers/iommu/Kconfig          |  1 +
 drivers/iommu/dma-iommu.c      | 50 +++++++++++++++++++++++-----
 drivers/md/dm-crypt.c          |  2 +-
 drivers/pci/Kconfig            |  1 +
 drivers/spi/spidev.c           |  2 +-
 drivers/usb/core/buffer.c      |  8 ++---
 include/linux/dma-map-ops.h    | 61 ++++++++++++++++++++++++++++++++++
 include/linux/dma-mapping.h    |  4 ++-
 include/linux/scatterlist.h    | 29 +++++++++++++---
 include/linux/slab.h           | 14 ++++++--
 kernel/dma/Kconfig             |  7 ++++
 kernel/dma/direct.h            |  3 +-
 mm/slab.c                      |  6 +---
 mm/slab.h                      |  5 ++-
 mm/slab_common.c               | 46 +++++++++++++++++++------
 20 files changed, 213 insertions(+), 49 deletions(-)

Comments

Jonathan Cameron May 25, 2023, 12:31 p.m. UTC | #1
On Wed, 24 May 2023 18:18:49 +0100
Catalin Marinas <catalin.marinas@arm.com> wrote:

> Hi,
> 
> Another version of the series reducing the kmalloc() minimum alignment
> on arm64 to 8 (from 128). Other architectures can easily opt in by
> defining ARCH_KMALLOC_MINALIGN as 8 and selecting
> DMA_BOUNCE_UNALIGNED_KMALLOC.
> 
> The first 10 patches decouple ARCH_KMALLOC_MINALIGN from
> ARCH_DMA_MINALIGN and, for arm64, limit the kmalloc() caches to those
> aligned to the run-time probed cache_line_size(). On arm64 we gain the
> kmalloc-{64,192} caches.
> 
> The subsequent patches (11 to 15) further reduce the kmalloc() caches to
> kmalloc-{8,16,32,96} if the default swiotlb is present by bouncing small
> buffers in the DMA API.

Hi Catalin,

I think IIO_DMA_MINALIGN needs to switch to ARCH_DMA_MINALIGN as well.

It's used to force static alignement of buffers with larger structures,
to make them suitable for non coherent DMA, similar to your other cases.

Thanks,

Jonathan


> 
> Changes since v4:
> 
> - Following Robin's suggestions, reworked the iommu handling so that the
>   buffer size checks are done in the dev_use_swiotlb() and
>   dev_use_sg_swiotlb() functions (together with dev_is_untrusted()). The
>   sync operations can now check for the SG_DMA_USE_SWIOTLB flag. Since
>   this flag is no longer specific to kmalloc() bouncing (covers
>   dev_is_untrusted() as well), the sg_is_dma_use_swiotlb() and
>   sg_dma_mark_use_swiotlb() functions are always defined if
>   CONFIG_SWIOTLB.
> 
> - Dropped ARCH_WANT_KMALLOC_DMA_BOUNCE, only left the
>   DMA_BOUNCE_UNALIGNED_KMALLOC option, selectable by the arch code. The
>   NEED_SG_DMA_FLAGS is now selected by IOMMU_DMA if SWIOTLB.
> 
> - Rather than adding another config option, allow
>   dma_get_cache_alignment() to be overridden by the arch code
>   (Christoph's suggestion).
> 
> - Added a comment to the dma_kmalloc_needs_bounce() function on the
>   heuristics behind the bouncing.
> 
> - Added acked-by/reviewed-by tags (not adding Ard's tested-by yet as
>   there were some changes).
> 
> The updated patches are also available on this branch:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux devel/kmalloc-minalign
> 
> Thanks.
> 
> Catalin Marinas (14):
>   mm/slab: Decouple ARCH_KMALLOC_MINALIGN from ARCH_DMA_MINALIGN
>   dma: Allow dma_get_cache_alignment() to be overridden by the arch code
>   mm/slab: Simplify create_kmalloc_cache() args and make it static
>   mm/slab: Limit kmalloc() minimum alignment to
>     dma_get_cache_alignment()
>   drivers/base: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN
>   drivers/gpu: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN
>   drivers/usb: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN
>   drivers/spi: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN
>   drivers/md: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN
>   arm64: Allow kmalloc() caches aligned to the smaller cache_line_size()
>   dma-mapping: Force bouncing if the kmalloc() size is not
>     cache-line-aligned
>   iommu/dma: Force bouncing if the size is not cacheline-aligned
>   mm: slab: Reduce the kmalloc() minimum alignment if DMA bouncing
>     possible
>   arm64: Enable ARCH_WANT_KMALLOC_DMA_BOUNCE for arm64
> 
> Robin Murphy (1):
>   scatterlist: Add dedicated config for DMA flags
> 
>  arch/arm64/Kconfig             |  1 +
>  arch/arm64/include/asm/cache.h |  3 ++
>  arch/arm64/mm/init.c           |  7 +++-
>  drivers/base/devres.c          |  6 ++--
>  drivers/gpu/drm/drm_managed.c  |  6 ++--
>  drivers/iommu/Kconfig          |  1 +
>  drivers/iommu/dma-iommu.c      | 50 +++++++++++++++++++++++-----
>  drivers/md/dm-crypt.c          |  2 +-
>  drivers/pci/Kconfig            |  1 +
>  drivers/spi/spidev.c           |  2 +-
>  drivers/usb/core/buffer.c      |  8 ++---
>  include/linux/dma-map-ops.h    | 61 ++++++++++++++++++++++++++++++++++
>  include/linux/dma-mapping.h    |  4 ++-
>  include/linux/scatterlist.h    | 29 +++++++++++++---
>  include/linux/slab.h           | 14 ++++++--
>  kernel/dma/Kconfig             |  7 ++++
>  kernel/dma/direct.h            |  3 +-
>  mm/slab.c                      |  6 +---
>  mm/slab.h                      |  5 ++-
>  mm/slab_common.c               | 46 +++++++++++++++++++------
>  20 files changed, 213 insertions(+), 49 deletions(-)
> 
> 
>
Catalin Marinas May 25, 2023, 2:31 p.m. UTC | #2
On Thu, May 25, 2023 at 01:31:38PM +0100, Jonathan Cameron wrote:
> On Wed, 24 May 2023 18:18:49 +0100
> Catalin Marinas <catalin.marinas@arm.com> wrote:
> > Another version of the series reducing the kmalloc() minimum alignment
> > on arm64 to 8 (from 128). Other architectures can easily opt in by
> > defining ARCH_KMALLOC_MINALIGN as 8 and selecting
> > DMA_BOUNCE_UNALIGNED_KMALLOC.
> > 
> > The first 10 patches decouple ARCH_KMALLOC_MINALIGN from
> > ARCH_DMA_MINALIGN and, for arm64, limit the kmalloc() caches to those
> > aligned to the run-time probed cache_line_size(). On arm64 we gain the
> > kmalloc-{64,192} caches.
> > 
> > The subsequent patches (11 to 15) further reduce the kmalloc() caches to
> > kmalloc-{8,16,32,96} if the default swiotlb is present by bouncing small
> > buffers in the DMA API.
> 
> I think IIO_DMA_MINALIGN needs to switch to ARCH_DMA_MINALIGN as well.
> 
> It's used to force static alignement of buffers with larger structures,
> to make them suitable for non coherent DMA, similar to your other cases.

Ah, I forgot that you introduced that macro. However, at a quick grep, I
don't think this forced alignment always works as intended (irrespective
of this series). Let's take an example:

struct ltc2496_driverdata {
	/* this must be the first member */
	struct ltc2497core_driverdata common_ddata;
	struct spi_device *spi;

	/*
	 * DMA (thus cache coherency maintenance) may require the
	 * transfer buffers to live in their own cache lines.
	 */
	unsigned char rxbuf[3] __aligned(IIO_DMA_MINALIGN);
	unsigned char txbuf[3];
};

The rxbuf is aligned to IIO_DMA_MINALIGN, the structure and its size as
well but txbuf is at an offset of 3 bytes from the aligned
IIO_DMA_MINALIGN. So basically any cache maintenance on rxbuf would
corrupt txbuf. You need rxbuf to be the only resident of a
cache line, therefore the next member needs such alignment as well.

With this series and SWIOTLB enabled, however, if you try to transfer 3
bytes, they will be bounced, so the missing alignment won't matter much.
Jonathan Cameron May 26, 2023, 4:07 p.m. UTC | #3
On Thu, 25 May 2023 15:31:34 +0100
Catalin Marinas <catalin.marinas@arm.com> wrote:

> On Thu, May 25, 2023 at 01:31:38PM +0100, Jonathan Cameron wrote:
> > On Wed, 24 May 2023 18:18:49 +0100
> > Catalin Marinas <catalin.marinas@arm.com> wrote:  
> > > Another version of the series reducing the kmalloc() minimum alignment
> > > on arm64 to 8 (from 128). Other architectures can easily opt in by
> > > defining ARCH_KMALLOC_MINALIGN as 8 and selecting
> > > DMA_BOUNCE_UNALIGNED_KMALLOC.
> > > 
> > > The first 10 patches decouple ARCH_KMALLOC_MINALIGN from
> > > ARCH_DMA_MINALIGN and, for arm64, limit the kmalloc() caches to those
> > > aligned to the run-time probed cache_line_size(). On arm64 we gain the
> > > kmalloc-{64,192} caches.
> > > 
> > > The subsequent patches (11 to 15) further reduce the kmalloc() caches to
> > > kmalloc-{8,16,32,96} if the default swiotlb is present by bouncing small
> > > buffers in the DMA API.  
> > 
> > I think IIO_DMA_MINALIGN needs to switch to ARCH_DMA_MINALIGN as well.
> > 
> > It's used to force static alignement of buffers with larger structures,
> > to make them suitable for non coherent DMA, similar to your other cases.  
> 
> Ah, I forgot that you introduced that macro. However, at a quick grep, I
> don't think this forced alignment always works as intended (irrespective
> of this series). Let's take an example:
> 
> struct ltc2496_driverdata {
> 	/* this must be the first member */
> 	struct ltc2497core_driverdata common_ddata;
> 	struct spi_device *spi;
> 
> 	/*
> 	 * DMA (thus cache coherency maintenance) may require the
> 	 * transfer buffers to live in their own cache lines.
> 	 */
> 	unsigned char rxbuf[3] __aligned(IIO_DMA_MINALIGN);
> 	unsigned char txbuf[3];
> };
> 
> The rxbuf is aligned to IIO_DMA_MINALIGN, the structure and its size as
> well but txbuf is at an offset of 3 bytes from the aligned
> IIO_DMA_MINALIGN. So basically any cache maintenance on rxbuf would
> corrupt txbuf.

That was intentional (though possibly wrong if I've misunderstood
the underlying issue).

For SPI controllers at least my understanding was that it is safe to
assume that they won't trample on themselves.  The driver doesn't
touch the buffers when DMA is in flight - to do so would indeed result
in corruption.

So whilst we could end up with the SPI master writing stale data back
to txbuf after the transfer it will never matter (as the value is unchanged).
Any flushes in the other direction will end up flushing both rxbuf and
txbuf anyway which is also harmless.

> You need rxbuf to be the only resident of a
> cache line, therefore the next member needs such alignment as well.
> 
> With this series and SWIOTLB enabled, however, if you try to transfer 3
> bytes, they will be bounced, so the missing alignment won't matter much.
> 

Only on arm64?  If the above is wrong, it might cause trouble on
some other architectures.

As a side note spi_write_then_read() goes through a bounce buffer dance
to avoid using dma unsafe buffers. Superficially that looks to me like it might
now end up with an undersized buffer and hence up bouncing which rather
defeats the point of it.  It uses SMP_CACHE_BYTES for the size.

Jonathan
Jonathan Cameron May 26, 2023, 4:29 p.m. UTC | #4
On Fri, 26 May 2023 17:07:40 +0100
Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:

> On Thu, 25 May 2023 15:31:34 +0100
> Catalin Marinas <catalin.marinas@arm.com> wrote:
> 
> > On Thu, May 25, 2023 at 01:31:38PM +0100, Jonathan Cameron wrote:  
> > > On Wed, 24 May 2023 18:18:49 +0100
> > > Catalin Marinas <catalin.marinas@arm.com> wrote:    
> > > > Another version of the series reducing the kmalloc() minimum alignment
> > > > on arm64 to 8 (from 128). Other architectures can easily opt in by
> > > > defining ARCH_KMALLOC_MINALIGN as 8 and selecting
> > > > DMA_BOUNCE_UNALIGNED_KMALLOC.
> > > > 
> > > > The first 10 patches decouple ARCH_KMALLOC_MINALIGN from
> > > > ARCH_DMA_MINALIGN and, for arm64, limit the kmalloc() caches to those
> > > > aligned to the run-time probed cache_line_size(). On arm64 we gain the
> > > > kmalloc-{64,192} caches.
> > > > 
> > > > The subsequent patches (11 to 15) further reduce the kmalloc() caches to
> > > > kmalloc-{8,16,32,96} if the default swiotlb is present by bouncing small
> > > > buffers in the DMA API.    
> > > 
> > > I think IIO_DMA_MINALIGN needs to switch to ARCH_DMA_MINALIGN as well.
> > > 
> > > It's used to force static alignement of buffers with larger structures,
> > > to make them suitable for non coherent DMA, similar to your other cases.    
> > 
> > Ah, I forgot that you introduced that macro. However, at a quick grep, I
> > don't think this forced alignment always works as intended (irrespective
> > of this series). Let's take an example:
> > 
> > struct ltc2496_driverdata {
> > 	/* this must be the first member */
> > 	struct ltc2497core_driverdata common_ddata;
> > 	struct spi_device *spi;
> > 
> > 	/*
> > 	 * DMA (thus cache coherency maintenance) may require the
> > 	 * transfer buffers to live in their own cache lines.
> > 	 */
> > 	unsigned char rxbuf[3] __aligned(IIO_DMA_MINALIGN);
> > 	unsigned char txbuf[3];
> > };
> > 
> > The rxbuf is aligned to IIO_DMA_MINALIGN, the structure and its size as
> > well but txbuf is at an offset of 3 bytes from the aligned
> > IIO_DMA_MINALIGN. So basically any cache maintenance on rxbuf would
> > corrupt txbuf.  
> 
> That was intentional (though possibly wrong if I've misunderstood
> the underlying issue).
> 
> For SPI controllers at least my understanding was that it is safe to
> assume that they won't trample on themselves.  The driver doesn't
> touch the buffers when DMA is in flight - to do so would indeed result
> in corruption.
> 
> So whilst we could end up with the SPI master writing stale data back
> to txbuf after the transfer it will never matter (as the value is unchanged).
> Any flushes in the other direction will end up flushing both rxbuf and
> txbuf anyway which is also harmless.

Adding missing detail.  As the driver never writes txbuf whilst any DMA
is going on, the second cache evict (to flush out any lines that have
crept back into cache after the flush - and write backs - pre DMA) will
find a clean line and will drop it without writing back - thus no corruption.



> 
> > You need rxbuf to be the only resident of a
> > cache line, therefore the next member needs such alignment as well.
> > 
> > With this series and SWIOTLB enabled, however, if you try to transfer 3
> > bytes, they will be bounced, so the missing alignment won't matter much.
> >   
> 
> Only on arm64?  If the above is wrong, it might cause trouble on
> some other architectures.
> 
> As a side note spi_write_then_read() goes through a bounce buffer dance
> to avoid using dma unsafe buffers. Superficially that looks to me like it might
> now end up with an undersized buffer and hence up bouncing which rather
> defeats the point of it.  It uses SMP_CACHE_BYTES for the size.
> 
> Jonathan
> 
>
Catalin Marinas May 30, 2023, 1:38 p.m. UTC | #5
On Fri, May 26, 2023 at 05:29:30PM +0100, Jonathan Cameron wrote:
> On Fri, 26 May 2023 17:07:40 +0100
> Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
> > On Thu, 25 May 2023 15:31:34 +0100
> > Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > On Thu, May 25, 2023 at 01:31:38PM +0100, Jonathan Cameron wrote:  
> > > > On Wed, 24 May 2023 18:18:49 +0100
> > > > Catalin Marinas <catalin.marinas@arm.com> wrote:    
> > > > > Another version of the series reducing the kmalloc() minimum alignment
> > > > > on arm64 to 8 (from 128). Other architectures can easily opt in by
> > > > > defining ARCH_KMALLOC_MINALIGN as 8 and selecting
> > > > > DMA_BOUNCE_UNALIGNED_KMALLOC.
> > > > > 
> > > > > The first 10 patches decouple ARCH_KMALLOC_MINALIGN from
> > > > > ARCH_DMA_MINALIGN and, for arm64, limit the kmalloc() caches to those
> > > > > aligned to the run-time probed cache_line_size(). On arm64 we gain the
> > > > > kmalloc-{64,192} caches.
> > > > > 
> > > > > The subsequent patches (11 to 15) further reduce the kmalloc() caches to
> > > > > kmalloc-{8,16,32,96} if the default swiotlb is present by bouncing small
> > > > > buffers in the DMA API.    
> > > > 
> > > > I think IIO_DMA_MINALIGN needs to switch to ARCH_DMA_MINALIGN as well.
> > > > 
> > > > It's used to force static alignement of buffers with larger structures,
> > > > to make them suitable for non coherent DMA, similar to your other cases.    
> > > 
> > > Ah, I forgot that you introduced that macro. However, at a quick grep, I
> > > don't think this forced alignment always works as intended (irrespective
> > > of this series). Let's take an example:
> > > 
> > > struct ltc2496_driverdata {
> > > 	/* this must be the first member */
> > > 	struct ltc2497core_driverdata common_ddata;
> > > 	struct spi_device *spi;
> > > 
> > > 	/*
> > > 	 * DMA (thus cache coherency maintenance) may require the
> > > 	 * transfer buffers to live in their own cache lines.
> > > 	 */
> > > 	unsigned char rxbuf[3] __aligned(IIO_DMA_MINALIGN);
> > > 	unsigned char txbuf[3];
> > > };
> > > 
> > > The rxbuf is aligned to IIO_DMA_MINALIGN, the structure and its size as
> > > well but txbuf is at an offset of 3 bytes from the aligned
> > > IIO_DMA_MINALIGN. So basically any cache maintenance on rxbuf would
> > > corrupt txbuf.  
> > 
> > That was intentional (though possibly wrong if I've misunderstood
> > the underlying issue).
> > 
> > For SPI controllers at least my understanding was that it is safe to
> > assume that they won't trample on themselves.  The driver doesn't
> > touch the buffers when DMA is in flight - to do so would indeed result
> > in corruption.
> > 
> > So whilst we could end up with the SPI master writing stale data back
> > to txbuf after the transfer it will never matter (as the value is unchanged).
> > Any flushes in the other direction will end up flushing both rxbuf and
> > txbuf anyway which is also harmless.
> 
> Adding missing detail.  As the driver never writes txbuf whilst any DMA
> is going on, the second cache evict (to flush out any lines that have
> crept back into cache after the flush - and write backs - pre DMA) will
> find a clean line and will drop it without writing back - thus no corruption.

Thanks for the clarification. One more thing, can the txbuf be written
prior to the DMA_FROM_DEVICE transfer into rxbuf? Or the txbuf writing
is always followed by a DMA_TO_DEVICE mapping (which would flush the
cache line).
Jonathan Cameron May 30, 2023, 4:31 p.m. UTC | #6
On Tue, 30 May 2023 14:38:55 +0100
Catalin Marinas <catalin.marinas@arm.com> wrote:

> On Fri, May 26, 2023 at 05:29:30PM +0100, Jonathan Cameron wrote:
> > On Fri, 26 May 2023 17:07:40 +0100
> > Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:  
> > > On Thu, 25 May 2023 15:31:34 +0100
> > > Catalin Marinas <catalin.marinas@arm.com> wrote:  
> > > > On Thu, May 25, 2023 at 01:31:38PM +0100, Jonathan Cameron wrote:    
> > > > > On Wed, 24 May 2023 18:18:49 +0100
> > > > > Catalin Marinas <catalin.marinas@arm.com> wrote:      
> > > > > > Another version of the series reducing the kmalloc() minimum alignment
> > > > > > on arm64 to 8 (from 128). Other architectures can easily opt in by
> > > > > > defining ARCH_KMALLOC_MINALIGN as 8 and selecting
> > > > > > DMA_BOUNCE_UNALIGNED_KMALLOC.
> > > > > > 
> > > > > > The first 10 patches decouple ARCH_KMALLOC_MINALIGN from
> > > > > > ARCH_DMA_MINALIGN and, for arm64, limit the kmalloc() caches to those
> > > > > > aligned to the run-time probed cache_line_size(). On arm64 we gain the
> > > > > > kmalloc-{64,192} caches.
> > > > > > 
> > > > > > The subsequent patches (11 to 15) further reduce the kmalloc() caches to
> > > > > > kmalloc-{8,16,32,96} if the default swiotlb is present by bouncing small
> > > > > > buffers in the DMA API.      
> > > > > 
> > > > > I think IIO_DMA_MINALIGN needs to switch to ARCH_DMA_MINALIGN as well.
> > > > > 
> > > > > It's used to force static alignement of buffers with larger structures,
> > > > > to make them suitable for non coherent DMA, similar to your other cases.      
> > > > 
> > > > Ah, I forgot that you introduced that macro. However, at a quick grep, I
> > > > don't think this forced alignment always works as intended (irrespective
> > > > of this series). Let's take an example:
> > > > 
> > > > struct ltc2496_driverdata {
> > > > 	/* this must be the first member */
> > > > 	struct ltc2497core_driverdata common_ddata;
> > > > 	struct spi_device *spi;
> > > > 
> > > > 	/*
> > > > 	 * DMA (thus cache coherency maintenance) may require the
> > > > 	 * transfer buffers to live in their own cache lines.
> > > > 	 */
> > > > 	unsigned char rxbuf[3] __aligned(IIO_DMA_MINALIGN);
> > > > 	unsigned char txbuf[3];
> > > > };
> > > > 
> > > > The rxbuf is aligned to IIO_DMA_MINALIGN, the structure and its size as
> > > > well but txbuf is at an offset of 3 bytes from the aligned
> > > > IIO_DMA_MINALIGN. So basically any cache maintenance on rxbuf would
> > > > corrupt txbuf.    
> > > 
> > > That was intentional (though possibly wrong if I've misunderstood
> > > the underlying issue).
> > > 
> > > For SPI controllers at least my understanding was that it is safe to
> > > assume that they won't trample on themselves.  The driver doesn't
> > > touch the buffers when DMA is in flight - to do so would indeed result
> > > in corruption.
> > > 
> > > So whilst we could end up with the SPI master writing stale data back
> > > to txbuf after the transfer it will never matter (as the value is unchanged).
> > > Any flushes in the other direction will end up flushing both rxbuf and
> > > txbuf anyway which is also harmless.  
> > 
> > Adding missing detail.  As the driver never writes txbuf whilst any DMA
> > is going on, the second cache evict (to flush out any lines that have
> > crept back into cache after the flush - and write backs - pre DMA) will
> > find a clean line and will drop it without writing back - thus no corruption.  
> 
> Thanks for the clarification. One more thing, can the txbuf be written
> prior to the DMA_FROM_DEVICE transfer into rxbuf? Or the txbuf writing
> is always followed by a DMA_TO_DEVICE mapping (which would flush the
> cache line).
> 

In practice, the driver should never write the txbuf unless it's about to
use it for a transfer (and a lock prevents anything racing against that -
the lock covering both tx and rx buffers), then the SPI controller would
have to follow it with a DMA_TO_DEVICE mapping.  

Having said that there might be examples in tree where a really weird sequence occurs.
1. tx is written with a default value (say in probe()).
2. An rx only transfer occurs first.
Would require something like a device that needs a read to wake it up before it'll
take any transmitted data. Or one that resets on a particularly long read (yuk,
the ad7476 does that, but it has no tx buffer so that's fine).
I don't think we have that case, but may be worth looking out for in future.

In more detail:
The two buffers are either both used for a given call to the SPI core driver
or they are used individually.
In any case where the data is used by host or device the mappings should be
fine.

TX and RX pair.
 - Take lock
 - Write tx from CPU
 - Pass both tx and rx to the spi controller which will map tx line DMA_TO_DEVICE and
   rx DMA_FROM_DEVICE - write back the line to memory (dcache_clean_poc() for tx map
   and rx map)
 - On interrupt or similar spi controller will use unmap DMA_FROM_DEVICE on the rx buffer - caches
   drop what they think are clean lines, so read will then be from memory
   (dcache_inval_poc() for rx unmap, noop  for tx unmap - but taken out anyway by the
   rx one which is fine)
 - Read rx content from CPU.
 - Release lock

TX only
 - Take lock
 - Write tx from CPU
 - tx only passed to spi controller ... DMA_TO_DEVICE... write back line to memory
 - No unmap DMA_FROM_DEVICE needed as nothing updated by device (device reads only) so no need
   to flush anything any cached lines are still correct.
 - Mapping torn down.
 - Release lock

RX only
 - Take lock
 - rx only passed to spi controller..  DMA_FROM_DEVICE.. cleans rx (probably clean anyway)
 - Device fills rx in memory.
 - rx DMA_FROM_DEVICE to drop any clean lines from cache and ensures we get rx from memory. 
 - Mapping torn down.
 - CPU reads rx
 - Release lock

I've probably missed some mappings in there, but short version is that rx and tx are
treated as one resource by the driver will all access serialized.  In some cases we will
get bonus flushes but shouldn't hurt anything.

Jonathan