mbox series

[v3,00/13] mm, dma, arm64: Reduce ARCH_KMALLOC_MINALIGN to 8

Message ID 20221106220143.2129263-1-catalin.marinas@arm.com (mailing list archive)
Headers show
Series mm, dma, arm64: Reduce ARCH_KMALLOC_MINALIGN to 8 | expand

Message

Catalin Marinas Nov. 6, 2022, 10:01 p.m. UTC
Hi,

That's the third attempt at reducing the kmalloc() minimum alignment on
arm64 below the ARCH_DMA_MINALIGN of 128. The first version was not
aggressive enough, limiting ARCH_KMALLOC_MINALIGN to 64 while the second
version added an explicit __GFP_PACKED flag.

This third version reduces ARCH_KMALLOC_MINALIGN to 8 while defining
ARCH_DMA_MINALIGN for all platforms and using it instead of the former
in places where we need a static alignment (structure or members align
attributes).

The first patch decouples the kmalloc() and DMA alignment, though this
only takes effect after the Kconfig entry is enabled by the last patch.

Patches 2 and 3 add bouncing via the swiotlb if any of the sizes are
small enough to have originated from an unaligned kmalloc() cache. Not
entirely sure whether my approach for iommu bouncing is correct, so open
to suggestions.

Patch 4 is a fallback in case there is no swiotlb buffer. Together with
patch 6, we can still get a smaller kmalloc() minalign of 64 (typical
cache line size) rather than 128 on arm64. If we improve the bouncing to
use the DMA coherent pool, this run-time __kmalloc_minalign() can go
away. Patch 5 is some cleanup following the refactoring in patch 4.

Patches 7-12 change some ARCH_KMALLOC_MINALIGN uses to
ARCH_DMA_MINALIGN. The crypto changes have been rejected by Herbert
previously but I still included them here until the crypto code is
refactored.

The last patch enables the bouncing for arm64.

Thanks.

Catalin Marinas (13):
  mm/slab: Decouple ARCH_KMALLOC_MINALIGN from ARCH_DMA_MINALIGN
  dma-mapping: Force bouncing if the kmalloc() size is not
    cacheline-aligned
  iommu/dma: Force bouncing of the size is not cacheline-aligned
  mm/slab: Allow kmalloc() minimum alignment fallback to
    dma_get_cache_alignment()
  mm/slab: Simplify create_kmalloc_cache() args and make it static
  dma: Allow the smaller cache_line_size() returned by
    dma_get_cache_alignment()
  drivers/base: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN
  drivers/gpu: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN
  drivers/usb: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN
  drivers/spi: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN
  crypto: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN
  drivers/md: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN
  dma: arm64: Add CONFIG_DMA_BOUNCE_UNALIGNED_KMALLOC and enable it for
    arm64

 arch/arm64/Kconfig            |  2 ++
 drivers/base/devres.c         |  6 ++---
 drivers/gpu/drm/drm_managed.c |  6 ++---
 drivers/iommu/dma-iommu.c     | 12 ++++++---
 drivers/md/dm-crypt.c         |  2 +-
 drivers/spi/spidev.c          |  2 +-
 drivers/usb/core/buffer.c     |  8 +++---
 include/linux/crypto.h        |  2 +-
 include/linux/dma-map-ops.h   | 50 +++++++++++++++++++++++++++++++++++
 include/linux/dma-mapping.h   |  4 ++-
 include/linux/scatterlist.h   | 27 ++++++++++++++++---
 include/linux/slab.h          | 14 +++++++---
 kernel/dma/Kconfig            | 14 ++++++++++
 kernel/dma/direct.h           |  3 ++-
 mm/slab.c                     |  6 +----
 mm/slab.h                     |  5 ++--
 mm/slab_common.c              | 49 +++++++++++++++++++++++++++-------
 17 files changed, 169 insertions(+), 43 deletions(-)

Comments

Isaac Manjarres March 16, 2023, 6:38 p.m. UTC | #1
On Sun, Nov 06, 2022 at 10:01:30PM +0000, Catalin Marinas wrote:
> Patches 7-12 change some ARCH_KMALLOC_MINALIGN uses to
> ARCH_DMA_MINALIGN. The crypto changes have been rejected by Herbert
> previously but I still included them here until the crypto code is
> refactored.
Hi Catalin,

Herbert merged the changes to the crypto code that were required to be
able to safely lower the minimum alignment for kmalloc in [1].

Given this, I don't think there's anything blocking this series from
being merged. The requirement for SWIOTLB to get to the minimum
kmalloc alignment down to 8 bytes shouldn't prevent this series from
being merged, as the amount of memory that is allocated for SWIOTLB
can be configured through the commandline to minimize the impact of
having SWIOTLB memory. Additionally, even if no SWIOTLB is present,
this series still offers memory savings on a lot of ARM64 platforms
by using the cache line size as the minimum alignment for kmalloc.

Can you please rebase this series so that it can be merged?

Thanks,
Isaac

[1]: https://lore.kernel.org/all/Y4nDL50nToBbi4DS@gondor.apana.org.au/
Catalin Marinas April 19, 2023, 4:06 p.m. UTC | #2
On Thu, Mar 16, 2023 at 11:38:47AM -0700, Isaac Manjarres wrote:
> On Sun, Nov 06, 2022 at 10:01:30PM +0000, Catalin Marinas wrote:
> > Patches 7-12 change some ARCH_KMALLOC_MINALIGN uses to
> > ARCH_DMA_MINALIGN. The crypto changes have been rejected by Herbert
> > previously but I still included them here until the crypto code is
> > refactored.
> 
> Herbert merged the changes to the crypto code that were required to be
> able to safely lower the minimum alignment for kmalloc in [1].

Yes, I saw this.

> Given this, I don't think there's anything blocking this series from
> being merged. The requirement for SWIOTLB to get to the minimum
> kmalloc alignment down to 8 bytes shouldn't prevent this series from
> being merged, as the amount of memory that is allocated for SWIOTLB
> can be configured through the commandline to minimize the impact of
> having SWIOTLB memory. Additionally, even if no SWIOTLB is present,
> this series still offers memory savings on a lot of ARM64 platforms
> by using the cache line size as the minimum alignment for kmalloc.

Actually, there's some progress on the swiotlb front to allow dynamic
allocation. I haven't reviewed the series yet (I wasn't aware of it
until v2) but at a quick look, it limits the dynamic allocation to
bouncing buffers of at least a page size. Maybe this can be later
improved for buffers below ARCH_DMA_MINALIGN.

https://lore.kernel.org/r/cover.1681898595.git.petr.tesarik.ext@huawei.com

> Can you please rebase this series so that it can be merged?

I rebased it locally but the last stumbling block is sorting out the
iommu bouncing. I was hoping Robin Murphy can lend a hand but he's been
busy with other bits. I'll repost the series at 6.4-rc1.
Petr Tesarik April 20, 2023, 9:52 a.m. UTC | #3
On 4/19/2023 6:06 PM, Catalin Marinas wrote:
> On Thu, Mar 16, 2023 at 11:38:47AM -0700, Isaac Manjarres wrote:
>[...]>> Given this, I don't think there's anything blocking this series from
>> being merged. The requirement for SWIOTLB to get to the minimum
>> kmalloc alignment down to 8 bytes shouldn't prevent this series from
>> being merged, as the amount of memory that is allocated for SWIOTLB
>> can be configured through the commandline to minimize the impact of
>> having SWIOTLB memory. Additionally, even if no SWIOTLB is present,
>> this series still offers memory savings on a lot of ARM64 platforms
>> by using the cache line size as the minimum alignment for kmalloc.
> 
> Actually, there's some progress on the swiotlb front to allow dynamic
> allocation. I haven't reviewed the series yet (I wasn't aware of it
> until v2) but at a quick look, it limits the dynamic allocation to
> bouncing buffers of at least a page size. Maybe this can be later
> improved for buffers below ARCH_DMA_MINALIGN.

Indeed. My patch allocates dynamic bounce buffers with
dma_direct_alloc_pages() to keep things simple for now, but there is no
real reason against allocating less than a page with another suitable
allocator.

However, I'd be interested what the use case is, so I can assess the
performance impact, which depends on workload, and FYI it may not even
be negative. ;-)

Petr Tesarik
Catalin Marinas April 20, 2023, 5:43 p.m. UTC | #4
On Thu, Apr 20, 2023 at 11:52:00AM +0200, Petr Tesarik wrote:
> On 4/19/2023 6:06 PM, Catalin Marinas wrote:
> > On Thu, Mar 16, 2023 at 11:38:47AM -0700, Isaac Manjarres wrote:
> >[...]>> Given this, I don't think there's anything blocking this series from
> >> being merged. The requirement for SWIOTLB to get to the minimum
> >> kmalloc alignment down to 8 bytes shouldn't prevent this series from
> >> being merged, as the amount of memory that is allocated for SWIOTLB
> >> can be configured through the commandline to minimize the impact of
> >> having SWIOTLB memory. Additionally, even if no SWIOTLB is present,
> >> this series still offers memory savings on a lot of ARM64 platforms
> >> by using the cache line size as the minimum alignment for kmalloc.
> > 
> > Actually, there's some progress on the swiotlb front to allow dynamic
> > allocation. I haven't reviewed the series yet (I wasn't aware of it
> > until v2) but at a quick look, it limits the dynamic allocation to
> > bouncing buffers of at least a page size. Maybe this can be later
> > improved for buffers below ARCH_DMA_MINALIGN.
> 
> Indeed. My patch allocates dynamic bounce buffers with
> dma_direct_alloc_pages() to keep things simple for now, but there is no
> real reason against allocating less than a page with another suitable
> allocator.

I guess it could fall back to a suitably aligned kmalloc() for smaller
sizes.

> However, I'd be interested what the use case is, so I can assess the
> performance impact, which depends on workload, and FYI it may not even
> be negative. ;-)

On arm64 we have an ARCH_DMA_MINALIGN of 128 bytes as that's the largest
cache line size that you can find on a non-coherent platform. The
implication is that ARCH_KMALLOC_MINALIGN is also 128, so smaller
slab-{8,16,32,64,96,192} caches cannot be created, leading to some
memory wastage.

This series decouples the two static alignments so that we can have an
ARCH_KMALLOC_MINALIGN of 8 while keeping ARCH_DMA_MINALIGN as 128. The
problem is that there are some drivers that do a small kmalloc() (below
a cache line size; typically USB drivers) and expect DMA to such buffer
to work. If the cache line is shared with some unrelated data, either
the cache maintenance in the DMA API corrupts such data or the cache
dirtying overrides inbound DMA data.

So, the solution is to bounce such small buffers if they end up in
functions like dma_map_single(). All we need is for the bounce buffer to
be aligned to the cache line size and honour the coherent mask (normally
ok with one of the GFP_DMA/DMA32 flags if required).

The swiotlb buffer would solve this but there are some (mobile)
platforms where the vendor disables the bounce buffer to save memory.
Having a way to dynamically allocate it in those rare cases above would
be helpful.
Isaac Manjarres May 15, 2023, 7:09 p.m. UTC | #5
On Wed, Apr 19, 2023 at 05:06:04PM +0100, Catalin Marinas wrote:
> I rebased it locally but the last stumbling block is sorting out the
> iommu bouncing. I was hoping Robin Murphy can lend a hand but he's been
> busy with other bits. I'll repost the series at 6.4-rc1.
Hey Catalin, just following up on this. I think it might be worthwhile
to split this series into two series:

Series 1: Decouple ARCH_KMALLOC_MINALIGN from ARCH_DMA_MINALIGN,
and use the cacheline size to determine the minimum kmalloc
alignment.

Series 2: Lower the minimum kmalloc alignment to 8 bytes by adding
support for using SWIOTLB to bounce unaligned kmalloc buffers for DMA
transactions.

Dividing the patches as such has the advantage of lowering the minimum
kmalloc alignment to 64 bytes on many ARM64 systems while the work for
lowering the minimum alignment to 8 bytes proceeds. This provides a
noticeable decrease in the slab memory footprint (e.g. I observed a 15
MB decrease in slab usage on a device I was using).

What are your thoughts on this?

--Isaac
Catalin Marinas May 16, 2023, 5:19 p.m. UTC | #6
On Mon, May 15, 2023 at 12:09:12PM -0700, Isaac Manjarres wrote:
> On Wed, Apr 19, 2023 at 05:06:04PM +0100, Catalin Marinas wrote:
> > I rebased it locally but the last stumbling block is sorting out the
> > iommu bouncing. I was hoping Robin Murphy can lend a hand but he's been
> > busy with other bits. I'll repost the series at 6.4-rc1.
> Hey Catalin, just following up on this. I think it might be worthwhile
> to split this series into two series:
> 
> Series 1: Decouple ARCH_KMALLOC_MINALIGN from ARCH_DMA_MINALIGN,
> and use the cacheline size to determine the minimum kmalloc
> alignment.
> 
> Series 2: Lower the minimum kmalloc alignment to 8 bytes by adding
> support for using SWIOTLB to bounce unaligned kmalloc buffers for DMA
> transactions.
> 
> Dividing the patches as such has the advantage of lowering the minimum
> kmalloc alignment to 64 bytes on many ARM64 systems while the work for
> lowering the minimum alignment to 8 bytes proceeds. This provides a
> noticeable decrease in the slab memory footprint (e.g. I observed a 15
> MB decrease in slab usage on a device I was using).

I attempted "series 1" some time ago and the discussion led to the
combined approach (i.e. don't bother with limiting kmalloc minimum
alignment to cache_line_size() but instead bounce those small buffers).
In my series, I still have this fallback in case there's no swiotlb
buffer.

I'll post a new series this week (including DMA bouncing) but I'll try
to move the bouncing towards the end of the series in case there are
more discussions around this, at least the first part could be picked
up.
Isaac Manjarres May 16, 2023, 6:19 p.m. UTC | #7
On Tue, May 16, 2023 at 06:19:43PM +0100, Catalin Marinas wrote:
> On Mon, May 15, 2023 at 12:09:12PM -0700, Isaac Manjarres wrote:
> > Hey Catalin, just following up on this. I think it might be worthwhile
> > to split this series into two series:
> > 
> > Series 1: Decouple ARCH_KMALLOC_MINALIGN from ARCH_DMA_MINALIGN,
> > and use the cacheline size to determine the minimum kmalloc
> > alignment.
> > 
> > Series 2: Lower the minimum kmalloc alignment to 8 bytes by adding
> > support for using SWIOTLB to bounce unaligned kmalloc buffers for DMA
> > transactions.
> 
> I attempted "series 1" some time ago and the discussion led to the
> combined approach (i.e. don't bother with limiting kmalloc minimum
> alignment to cache_line_size() but instead bounce those small buffers).
> In my series, I still have this fallback in case there's no swiotlb
> buffer.

> I'll post a new series this week (including DMA bouncing) but I'll try
> to move the bouncing towards the end of the series in case there are
> more discussions around this, at least the first part could be picked
> up.

Thanks Catalin! I think restructuring the series as you're suggesting
makes sense. At least being able to pick up the first part of the series
would be great, since it will have a positive impact on the memory
footprint for a lot of devices. This also helps alleviate some of the
memory overhead for devices that move from a 32-bit ARM kernel to a
64-bit ARM kernel.

The second part can continue to be refined until the SWIOTLB and IOMMU bounce
buffering refactor is complete.

-Isaac