mbox series

[00/10] mm, arm64: Reduce ARCH_KMALLOC_MINALIGN below the cache line size

Message ID 20220405135758.774016-1-catalin.marinas@arm.com (mailing list archive)
Headers show
Series mm, arm64: Reduce ARCH_KMALLOC_MINALIGN below the cache line size | expand

Message

Catalin Marinas April 5, 2022, 1:57 p.m. UTC
Hi,

On arm64 ARCH_DMA_MINALIGN (and therefore ARCH_KMALLOC_MINALIGN) is 128.
While the majority of arm64 SoCs have a 64-byte cache line size (or
rather CWG - cache writeback granule), we chose a less than optimal
value in order to support all SoCs in a single kernel image.

The aim of this series is to allow smaller default ARCH_KMALLOC_MINALIGN
with kmalloc() caches configured at boot time to be safe when an SoC has
a larger DMA alignment requirement.

The first patch decouples ARCH_KMALLOC_MINALIGN from ARCH_DMA_MINALIGN
with the aim to only use the latter in DMA-specific compile-time
annotations. ARCH_KMALLOC_MINALIGN becomes the minimum (static)
guaranteed kmalloc() alignment but not necessarily safe for non-coherent
DMA. Patches 2-7 change some drivers/ code to use ARCH_DMA_MINALIGN
instead of ARCH_KMALLOC_MINALIGN.

Patch 8 introduces the dynamic arch_kmalloc_minalign() and the slab code
changes to set the corresponding minimum alignment on the newly created
kmalloc() caches. Patch 10 defines arch_kmalloc_minalign() for arm64
returning cache_line_size() together with reducing ARCH_KMALLOC_MINALIGN
to 64. ARCH_DMA_MINALIGN remains 128 on arm64.

I don't have access to it but there's the Fujitsu A64FX with a CWG of
256 (the arm64 cache_line_size() returns 256). This series will bump the
smallest kmalloc cache to kmalloc-256. The platform is known to be fully
cache coherent (or so I think) and we decided long ago not to bump
ARCH_DMA_MINALIGN to 256. If problematic, we could make the dynamic
kmalloc() alignment on arm64 min(ARCH_DMA_MINALIGN, cache_line_size()).

This series is beneficial to arm64 even if it's only reducing the
kmalloc() minimum alignment to 64. While it would be nice to reduce this
further to 8 (or 16) on SoCs known to be fully DMA coherent, detecting
this is via arch_setup_dma_ops() is problematic, especially with late
probed devices. I'd leave it for an additional RFC series on top of
this (there are ideas like bounce buffering for non-coherent devices if
the SoC was deemed coherent).

Thanks.

Catalin Marinas (10):
  mm/slab: Decouple ARCH_KMALLOC_MINALIGN from ARCH_DMA_MINALIGN
  drivers/base: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN
  drivers/gpu: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN
  drivers/md: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN
  drivers/spi: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN
  drivers/usb: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN
  crypto: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN
  mm/slab: Allow dynamic kmalloc() minimum alignment
  mm/slab: Simplify create_kmalloc_cache() args and make it static
  arm64: Enable dynamic kmalloc() minimum alignment

 arch/arm64/include/asm/cache.h |  1 +
 arch/arm64/kernel/cacheinfo.c  |  7 ++++++
 drivers/base/devres.c          |  4 ++--
 drivers/gpu/drm/drm_managed.c  |  4 ++--
 drivers/md/dm-crypt.c          |  2 +-
 drivers/spi/spidev.c           |  2 +-
 drivers/usb/core/buffer.c      |  8 +++----
 drivers/usb/misc/usbtest.c     |  2 +-
 include/linux/crypto.h         |  2 +-
 include/linux/slab.h           | 25 ++++++++++++++++-----
 mm/slab.c                      |  6 +----
 mm/slab.h                      |  5 ++---
 mm/slab_common.c               | 40 ++++++++++++++++++++++------------
 13 files changed, 69 insertions(+), 39 deletions(-)

Comments

Vlastimil Babka April 7, 2022, 2:40 p.m. UTC | #1
On 4/5/22 15:57, Catalin Marinas wrote:
> Hi,
> 
> On arm64 ARCH_DMA_MINALIGN (and therefore ARCH_KMALLOC_MINALIGN) is 128.
> While the majority of arm64 SoCs have a 64-byte cache line size (or
> rather CWG - cache writeback granule), we chose a less than optimal
> value in order to support all SoCs in a single kernel image.
> 
> The aim of this series is to allow smaller default ARCH_KMALLOC_MINALIGN
> with kmalloc() caches configured at boot time to be safe when an SoC has
> a larger DMA alignment requirement.
> 
> The first patch decouples ARCH_KMALLOC_MINALIGN from ARCH_DMA_MINALIGN
> with the aim to only use the latter in DMA-specific compile-time
> annotations. ARCH_KMALLOC_MINALIGN becomes the minimum (static)
> guaranteed kmalloc() alignment but not necessarily safe for non-coherent
> DMA. Patches 2-7 change some drivers/ code to use ARCH_DMA_MINALIGN
> instead of ARCH_KMALLOC_MINALIGN.
> 
> Patch 8 introduces the dynamic arch_kmalloc_minalign() and the slab code
> changes to set the corresponding minimum alignment on the newly created
> kmalloc() caches. Patch 10 defines arch_kmalloc_minalign() for arm64
> returning cache_line_size() together with reducing ARCH_KMALLOC_MINALIGN
> to 64. ARCH_DMA_MINALIGN remains 128 on arm64.
> 
> I don't have access to it but there's the Fujitsu A64FX with a CWG of
> 256 (the arm64 cache_line_size() returns 256). This series will bump the
> smallest kmalloc cache to kmalloc-256. The platform is known to be fully
> cache coherent (or so I think) and we decided long ago not to bump
> ARCH_DMA_MINALIGN to 256. If problematic, we could make the dynamic
> kmalloc() alignment on arm64 min(ARCH_DMA_MINALIGN, cache_line_size()).
> 
> This series is beneficial to arm64 even if it's only reducing the
> kmalloc() minimum alignment to 64. While it would be nice to reduce this
> further to 8 (or 16) on SoCs known to be fully DMA coherent, detecting
> this is via arch_setup_dma_ops() is problematic, especially with late
> probed devices. I'd leave it for an additional RFC series on top of
> this (there are ideas like bounce buffering for non-coherent devices if
> the SoC was deemed coherent).

Oh that sounds great, and perhaps should help with our SLOB problem as
detailed in this subthread [1]. To recap:

- we would like kfree() to work on allocations done by kmem_cache_alloc(),
in addition to kmalloc()
- for SLOB this would mean that kmem_cache_alloc() objects have to store
their alloc size (prepended to the allocated object) which is now done for
kmalloc() objects only - we don't have to store the size if
kmem_cache_free() gives us the kmem_cache pointer which contains the
per-cache size.
- due to ARCH_KMALLOC_MINALIGN and dma guarantees we should return
allocations aligned to ARCH_KMALLOC_MINALIGN and the prepended size header
should also not share their ARCH_KMALLOC_MINALIGN block with another
(shorter) allocation that has a different lifetime, for the dma coherency
reasons
- this is very wasteful especially with the 128 bytes alignment, and seems
we already violate it in some scenarios anyway [2]. Extending this to all
objects would be even more wasteful.

So this series would help here, especially if we can get to the 8/16 size.
But now I also wonder if keeping the name and meaning of "MINALIGN" is in
fact misleading and unnecessarily constraining us? What this is really about
is "granularity of exclusive access", no? Let's say the dma granularity is
64bytes, and there's a kmalloc(56). If SLOB find a 64-bytes aligned block,
uses the first 8 bytes for the size header and returns the remaining 56
bytes, then the returned pointer is not *aligned* to 64 bytes, but it's
still aligned enough for cpu accesses (which need only e.g. 8), and
non-coherent dma should be also safe because nobody will be accessing the 8
bytes header, until the user of the object calls kfree() which should happen
only when it's done with any dma operations. Is my reasoning correct and
would this be safe?

[1] https://lore.kernel.org/all/20211122013026.909933-1-rkovhaev@gmail.com/
[2] https://lore.kernel.org/all/d0927ca6-1710-5b2b-3682-6a80eb4e48d1@suse.cz/
Catalin Marinas April 7, 2022, 5:48 p.m. UTC | #2
On Thu, Apr 07, 2022 at 04:40:15PM +0200, Vlastimil Babka wrote:
> On 4/5/22 15:57, Catalin Marinas wrote:
> > This series is beneficial to arm64 even if it's only reducing the
> > kmalloc() minimum alignment to 64. While it would be nice to reduce this
> > further to 8 (or 16) on SoCs known to be fully DMA coherent, detecting
> > this is via arch_setup_dma_ops() is problematic, especially with late
> > probed devices. I'd leave it for an additional RFC series on top of
> > this (there are ideas like bounce buffering for non-coherent devices if
> > the SoC was deemed coherent).
[...]
> - due to ARCH_KMALLOC_MINALIGN and dma guarantees we should return
> allocations aligned to ARCH_KMALLOC_MINALIGN and the prepended size header
> should also not share their ARCH_KMALLOC_MINALIGN block with another
> (shorter) allocation that has a different lifetime, for the dma coherency
> reasons
> - this is very wasteful especially with the 128 bytes alignment, and seems
> we already violate it in some scenarios anyway [2]. Extending this to all
> objects would be even more wasteful.
> 
> So this series would help here, especially if we can get to the 8/16 size.

If we get to 8/16 size, it would only be for platforms that are fully
coherent. Otherwise, with non-coherent DMA, the minimum kmalloc()
alignment would still be the cache line size (typically 64) even if
ARCH_KMALLOC_MINALIGN is 8.

IIUC your point is that if ARCH_KMALLOC_MINALIGN is 8, kmalloc() could
return pointers 8-byte aligned only as long as DMA safety is preserved
(like not sharing the rest of the cache line with anything other
writers).

> But now I also wonder if keeping the name and meaning of "MINALIGN" is in
> fact misleading and unnecessarily constraining us? What this is really about
> is "granularity of exclusive access", no?

Not necessarily. Yes, in lots of cases it is about granularity of access
but there are others where the code does need the pointer returned
aligned to ARCH_DMA_MINALIGN (currently via ARCH_KMALLOC_MINALIGN).
Crypto seems to have such requirement (see the sub-thread with Herbert).
Some (all?) callers ask kmalloc() for the aligned size and there's an
expectation that if the size is a multiple of a power of two, kmalloc()
will return a pointer aligned to that power of two. I think we need to
preserve these semantics which may lead to some more wastage if you add
the header (e.g. a size of 3*64 returns a pointer either aligned to 192
or 256).

> Let's say the dma granularity is 64bytes, and there's a kmalloc(56).

In your example, the size is not a power of two (or multiple of), so I
guess there's no expectation for a 64-byte alignment (it can be 8)
unless DMA is involved. See below.

> If SLOB find a 64-bytes aligned block, uses the first 8 bytes for the
> size header and returns the remaining 56 bytes, then the returned
> pointer is not *aligned* to 64 bytes, but it's still aligned enough
> for cpu accesses (which need only e.g. 8), and non-coherent dma should
> be also safe because nobody will be accessing the 8 bytes header,
> until the user of the object calls kfree() which should happen only
> when it's done with any dma operations. Is my reasoning correct and
> would this be safe?

From the DMA perspective, it's not safe currently. Let's say we have an
inbound DMA transfer, the DMA API will invalidate the cache line prior
to DMA. In arm64 terms, it means that the cache line is discarded, not
flushed to memory. If the first 8 bytes had not been written back to
RAM, they'd be lost. If we can guarantee that no CPU write happens to
the cache line during the DMA transfer, we can change the DMA mapping
operation to do a clean+invalidate (flush the cacheline to RAM) first. I
guess this could be done with an IS_ENABLED(CONFIG_SLOB) check.
Vlastimil Babka April 8, 2022, 2:37 p.m. UTC | #3
On 4/7/22 19:48, Catalin Marinas wrote:
> On Thu, Apr 07, 2022 at 04:40:15PM +0200, Vlastimil Babka wrote:
>> On 4/5/22 15:57, Catalin Marinas wrote:
>> > This series is beneficial to arm64 even if it's only reducing the
>> > kmalloc() minimum alignment to 64. While it would be nice to reduce this
>> > further to 8 (or 16) on SoCs known to be fully DMA coherent, detecting
>> > this is via arch_setup_dma_ops() is problematic, especially with late
>> > probed devices. I'd leave it for an additional RFC series on top of
>> > this (there are ideas like bounce buffering for non-coherent devices if
>> > the SoC was deemed coherent).
> [...]
>> - due to ARCH_KMALLOC_MINALIGN and dma guarantees we should return
>> allocations aligned to ARCH_KMALLOC_MINALIGN and the prepended size header
>> should also not share their ARCH_KMALLOC_MINALIGN block with another
>> (shorter) allocation that has a different lifetime, for the dma coherency
>> reasons
>> - this is very wasteful especially with the 128 bytes alignment, and seems
>> we already violate it in some scenarios anyway [2]. Extending this to all
>> objects would be even more wasteful.
>> 
>> So this series would help here, especially if we can get to the 8/16 size.
> 
> If we get to 8/16 size, it would only be for platforms that are fully
> coherent. Otherwise, with non-coherent DMA, the minimum kmalloc()
> alignment would still be the cache line size (typically 64) even if
> ARCH_KMALLOC_MINALIGN is 8.

OK.

> IIUC your point is that if ARCH_KMALLOC_MINALIGN is 8, kmalloc() could
> return pointers 8-byte aligned only as long as DMA safety is preserved
> (like not sharing the rest of the cache line with anything other
> writers).

Yeah IIUC we now have no way to distinguish callers of kmalloc() that
actually rely on the allocations being DMA safe, so we can only assume it's
potentially all of them. Because __GFP_DMA is something different than coherency.

>> But now I also wonder if keeping the name and meaning of "MINALIGN" is in
>> fact misleading and unnecessarily constraining us? What this is really about
>> is "granularity of exclusive access", no?
> 
> Not necessarily. Yes, in lots of cases it is about granularity of access
> but there are others where the code does need the pointer returned
> aligned to ARCH_DMA_MINALIGN (currently via ARCH_KMALLOC_MINALIGN).
> Crypto seems to have such requirement (see the sub-thread with Herbert).
> Some (all?) callers ask kmalloc() for the aligned size and there's an
> expectation that if the size is a multiple of a power of two, kmalloc()
> will return a pointer aligned to that power of two. I think we need to
> preserve these semantics which may lead to some more wastage if you add
> the header (e.g. a size of 3*64 returns a pointer either aligned to 192
> or 256).

Agree, I wasn't suggesting we start ignoring the (explicit or implicit
power-of-2) alignment requests. In that case the size header would still be
in the preceding (relative to the alignment address, e.g. 64bytes) block.
But if dma safety didn't require 64byte granularity, the rest of that block
could be used by another allocation, so that would reduce the wastage.

>> Let's say the dma granularity is 64bytes, and there's a kmalloc(56).
> 
> In your example, the size is not a power of two (or multiple of), so I
> guess there's no expectation for a 64-byte alignment (it can be 8)
> unless DMA is involved. See below.

Yes, and AFAIU we cannot rule out the DMA involvement, unfortunately.

>> If SLOB find a 64-bytes aligned block, uses the first 8 bytes for the
>> size header and returns the remaining 56 bytes, then the returned
>> pointer is not *aligned* to 64 bytes, but it's still aligned enough
>> for cpu accesses (which need only e.g. 8), and non-coherent dma should
>> be also safe because nobody will be accessing the 8 bytes header,
>> until the user of the object calls kfree() which should happen only
>> when it's done with any dma operations. Is my reasoning correct and
>> would this be safe?
> 
> From the DMA perspective, it's not safe currently. Let's say we have an
> inbound DMA transfer, the DMA API will invalidate the cache line prior
> to DMA. In arm64 terms, it means that the cache line is discarded, not
> flushed to memory. If the first 8 bytes had not been written back to
> RAM, they'd be lost.

Thanks for the insight, that's what I was looking for.

> If we can guarantee that no CPU write happens to
> the cache line during the DMA transfer, we can change the DMA mapping
> operation to do a clean+invalidate (flush the cacheline to RAM) first. I
> guess this could be done with an IS_ENABLED(CONFIG_SLOB) check.
 
I see. Maybe we'll think of another solution for SLOB.