mbox series

[v4,00/15] mm, dma, arm64: Reduce ARCH_KMALLOC_MINALIGN to 8

Message ID 20230518173403.1150549-1-catalin.marinas@arm.com (mailing list archive)
Headers show
Series mm, dma, arm64: Reduce ARCH_KMALLOC_MINALIGN to 8 | expand

Message

Catalin Marinas May 18, 2023, 5:33 p.m. UTC
Hi,

That's the fourth version of the series reducing the kmalloc() minimum
alignment on arm64 to 8 (from 128).

The first 10 patches decouple ARCH_KMALLOC_MINALIGN from
ARCH_DMA_MINALIGN and, for arm64, it limits the kmalloc() caches to
those aligned to the run-time probed cache_line_size(). The advantage on
arm64 is that we gain the kmalloc-{64,192} caches.

The subsequent patches (11 to 15) further reduce the kmalloc() caches to
kmalloc-{8,16,32,96} if the default swiotlb is present by bouncing small
buffers in the DMA API. For iommu, following discussions with Robin, we
concluded that it's still simpler to walk the sg list if the device is
non-coherent and follow the bouncing path when any of the elements may
originate from a small kmalloc() allocation.

Main changes since v3:

- Reorganise the series so that the first 10 patches could be applied
  before the DMA bouncing. They are still useful on arm64 reducing the
  kmalloc() alignment to 64.

- There is no dma_sg_kmalloc_needs_bounce() function, it has been
  unrolled in the iommu_dma_sync_sg_for_device() function.

- No crypto changes needed following Herbert's reworking of the crypto
  code (thanks!).

The patches are also available on this branch:

git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux devel/kmalloc-minalign

Thanks.

Catalin Marinas (14):
  mm/slab: Decouple ARCH_KMALLOC_MINALIGN from ARCH_DMA_MINALIGN
  dma: Allow dma_get_cache_alignment() to return the smaller
    cache_line_size()
  mm/slab: Simplify create_kmalloc_cache() args and make it static
  mm/slab: Limit kmalloc() minimum alignment to
    dma_get_cache_alignment()
  drivers/base: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN
  drivers/gpu: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN
  drivers/usb: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN
  drivers/spi: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN
  drivers/md: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN
  arm64: Allow kmalloc() caches aligned to the smaller cache_line_size()
  dma-mapping: Force bouncing if the kmalloc() size is not
    cache-line-aligned
  iommu/dma: Force bouncing if the size is not cacheline-aligned
  mm: slab: Reduce the kmalloc() minimum alignment if DMA bouncing
    possible
  arm64: Enable ARCH_WANT_KMALLOC_DMA_BOUNCE for arm64

Robin Murphy (1):
  scatterlist: Add dedicated config for DMA flags

 arch/arm64/Kconfig             |  2 ++
 arch/arm64/include/asm/cache.h |  1 +
 arch/arm64/mm/init.c           |  7 ++++-
 drivers/base/devres.c          |  6 ++---
 drivers/gpu/drm/drm_managed.c  |  6 ++---
 drivers/iommu/dma-iommu.c      | 25 ++++++++++++++----
 drivers/md/dm-crypt.c          |  2 +-
 drivers/pci/Kconfig            |  1 +
 drivers/spi/spidev.c           |  2 +-
 drivers/usb/core/buffer.c      |  8 +++---
 include/linux/dma-map-ops.h    | 48 ++++++++++++++++++++++++++++++++++
 include/linux/dma-mapping.h    |  4 ++-
 include/linux/scatterlist.h    | 29 +++++++++++++++++---
 include/linux/slab.h           | 16 +++++++++---
 kernel/dma/Kconfig             | 19 ++++++++++++++
 kernel/dma/direct.h            |  3 ++-
 mm/slab.c                      |  6 +----
 mm/slab.h                      |  5 ++--
 mm/slab_common.c               | 43 +++++++++++++++++++++++-------
 19 files changed, 188 insertions(+), 45 deletions(-)

Comments

Linus Torvalds May 18, 2023, 5:56 p.m. UTC | #1
On Thu, May 18, 2023 at 10:34 AM Catalin Marinas
<catalin.marinas@arm.com> wrote:
>
> That's the fourth version of the series reducing the kmalloc() minimum
> alignment on arm64 to 8 (from 128).

Lovely. On my M2 Macbook air, I right now have about 24MB in the
kmalloc-128 bucket, and most of it is presumably just 16 byte
allocations (judging by my x86-64 slabinfo).

I guess it doesn't really matter when I have 16GB in the machine, but
this has annoyed me for a while.

It feels like this is ready for 6.5, no?

                 Linus
Ard Biesheuvel May 18, 2023, 6:13 p.m. UTC | #2
On Thu, 18 May 2023 at 19:56, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, May 18, 2023 at 10:34 AM Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> >
> > That's the fourth version of the series reducing the kmalloc() minimum
> > alignment on arm64 to 8 (from 128).

For the series:

Tested-by: Ard Biesheuvel <ardb@kernel.org> # tx2

and I am seeing lots of smaller allocations, all of which would have
otherwise taken up 128 or 256 bytes:

kmalloc-192         6426   6426    192   42    2 : tunables    0    0
  0 : slabdata    153    153      0
kmalloc-128         9472   9472    128   64    2 : tunables    0    0
  0 : slabdata    148    148      0
kmalloc-96         15666  15666     96   42    1 : tunables    0    0
  0 : slabdata    373    373      0
kmalloc-64         21952  21952     64   64    1 : tunables    0    0
  0 : slabdata    343    343      0
kmalloc-32         23424  23424     32  128    1 : tunables    0    0
  0 : slabdata    183    183      0
kmalloc-16         41216  41216     16  256    1 : tunables    0    0
  0 : slabdata    161    161      0
kmalloc-8          77846  80384      8  512    1 : tunables    0    0
  0 : slabdata    157    157      0

The box is fully DMA coherent, of course, so this doesn't really tell
us whether the swiotlb DMA bouncing stuff works or not.

>
> Lovely. On my M2 Macbook air, I right now have about 24MB in the
> kmalloc-128 bucket, and most of it is presumably just 16 byte
> allocations (judging by my x86-64 slabinfo).
>
> I guess it doesn't really matter when I have 16GB in the machine, but
> this has annoyed me for a while.
>

Yeah but surely the overhead in terms of D-cache footprint is a factor here too?

> It feels like this is ready for 6.5, no?
>

Yes, please...
Catalin Marinas May 18, 2023, 6:46 p.m. UTC | #3
On Thu, May 18, 2023 at 10:56:24AM -0700, Linus Torvalds wrote:
> On Thu, May 18, 2023 at 10:34 AM Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> >
> > That's the fourth version of the series reducing the kmalloc() minimum
> > alignment on arm64 to 8 (from 128).
> 
> Lovely. On my M2 Macbook air, I right now have about 24MB in the
> kmalloc-128 bucket, and most of it is presumably just 16 byte
> allocations (judging by my x86-64 slabinfo).
> 
> I guess it doesn't really matter when I have 16GB in the machine, but
> this has annoyed me for a while.
> 
> It feels like this is ready for 6.5, no?

From an implementation approach perspective, I definitely target 6.5.
But I need help with testing this, especially the iommu bits (can buy
Robin some beers ;)).
Catalin Marinas May 18, 2023, 6:50 p.m. UTC | #4
On Thu, May 18, 2023 at 08:13:08PM +0200, Ard Biesheuvel wrote:
> On Thu, 18 May 2023 at 19:56, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Thu, May 18, 2023 at 10:34 AM Catalin Marinas
> > <catalin.marinas@arm.com> wrote:
> > >
> > > That's the fourth version of the series reducing the kmalloc() minimum
> > > alignment on arm64 to 8 (from 128).
> 
> For the series:
> 
> Tested-by: Ard Biesheuvel <ardb@kernel.org> # tx2
[...]
> The box is fully DMA coherent, of course, so this doesn't really tell
> us whether the swiotlb DMA bouncing stuff works or not.

Thanks.

On TX2, I forced the bouncing with:

diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 43bf50c35e14..9006bf680db0 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -296,7 +296,7 @@ static inline bool dma_kmalloc_safe(struct device *dev,
	 * is coherent or the direction is DMA_TO_DEVICE (non-desctructive
	 * cache maintenance and benign cache line evictions).
	 */
-	if (dev_is_dma_coherent(dev) || dir == DMA_TO_DEVICE)
+	if (/*dev_is_dma_coherent(dev) || */dir == DMA_TO_DEVICE)
		return true;
 
	return false;