mbox series

[v2,0/3] iommu/io-pgtable-arm-v7s: Use DMA32 zone for page tables

Message ID 20181111090341.120786-1-drinkcat@chromium.org (mailing list archive)
Headers show
Series iommu/io-pgtable-arm-v7s: Use DMA32 zone for page tables | expand

Message

Nicolas Boichat Nov. 11, 2018, 9:03 a.m. UTC
This is a follow-up to the discussion in [1], to make sure that the page
tables allocated by iommu/io-pgtable-arm-v7s are contained within 32-bit
physical address space.

[1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html

Fixes since v1:
 - Add support for SLAB_CACHE_DMA32 in slab and slub (patches 1/2)
 - iommu/io-pgtable-arm-v7s (patch 3):
   - Changed approach to use SLAB_CACHE_DMA32 added by the previous
     commit.
   - Use DMA or DMA32 depending on the architecture (DMA for arm,
     DMA32 for arm64).

Nicolas Boichat (3):
  mm: slab/slub: Add check_slab_flags function to check for valid flags
  mm: Add support for SLAB_CACHE_DMA32
  iommu/io-pgtable-arm-v7s: Request DMA32 memory, and improve debugging

 drivers/iommu/io-pgtable-arm-v7s.c | 20 ++++++++++++++++----
 include/linux/slab.h               |  2 ++
 mm/internal.h                      | 21 +++++++++++++++++++--
 mm/slab.c                          | 10 +++-------
 mm/slab.h                          |  3 ++-
 mm/slab_common.c                   |  2 +-
 mm/slub.c                          | 24 +++++++++++++++++-------
 7 files changed, 60 insertions(+), 22 deletions(-)

Comments

Christoph Lameter (Ampere) Nov. 21, 2018, 6:20 p.m. UTC | #1
On Sun, 11 Nov 2018, Nicolas Boichat wrote:

> This is a follow-up to the discussion in [1], to make sure that the page
> tables allocated by iommu/io-pgtable-arm-v7s are contained within 32-bit
> physical address space.

Page tables? This means you need a page frame? Why go through the slab
allocators?
Matthew Wilcox (Oracle) Nov. 21, 2018, 9:38 p.m. UTC | #2
On Wed, Nov 21, 2018 at 06:20:02PM +0000, Christopher Lameter wrote:
> On Sun, 11 Nov 2018, Nicolas Boichat wrote:
> 
> > This is a follow-up to the discussion in [1], to make sure that the page
> > tables allocated by iommu/io-pgtable-arm-v7s are contained within 32-bit
> > physical address space.
> 
> Page tables? This means you need a page frame? Why go through the slab
> allocators?

Because this particular architecture has sub-page-size PMD page tables.
We desperately need to hoist page table allocation out of the architectures;
there're a bunch of different implementations and they're mostly bad,
one way or another.

For each level of page table we generally have three cases:

1. single page
2. sub-page, naturally aligned
3. multiple pages, naturally aligned

for 1 and 3, the page allocator will do just fine.
for 2, we should have a per-MM page_frag allocator.  s390 already has
something like this, although it's more complicated.  ppc also has
something a little more complex for the cases when it's configured with
a 64k page size but wants to use a 4k page table entry.

I'd like x86 to be able to simply do:

#define pte_alloc_one(mm, addr)	page_alloc_table(mm, addr, 0)
#define pmd_alloc_one(mm, addr)	page_alloc_table(mm, addr, 0)
#define pud_alloc_one(mm, addr)	page_alloc_table(mm, addr, 0)
#define p4d_alloc_one(mm, addr)	page_alloc_table(mm, addr, 0)

An architecture with 4k page size and needing a 16k PMD would do:

#define pmd_alloc_one(mm, addr) page_alloc_table(mm, addr, 2)

while an architecture with a 64k page size needing a 4k PTE would do:

#define ARCH_PAGE_TABLE_FRAG
#define pte_alloc_one(mm, addr) pagefrag_alloc_table(mm, addr, 4096)

I haven't had time to work on this, but perhaps someone with a problem
that needs fixing would like to, instead of burying yet another awful
implementation away in arch/ somewhere.
Robin Murphy Nov. 21, 2018, 10:26 p.m. UTC | #3
On 2018-11-21 9:38 pm, Matthew Wilcox wrote:
> On Wed, Nov 21, 2018 at 06:20:02PM +0000, Christopher Lameter wrote:
>> On Sun, 11 Nov 2018, Nicolas Boichat wrote:
>>
>>> This is a follow-up to the discussion in [1], to make sure that the page
>>> tables allocated by iommu/io-pgtable-arm-v7s are contained within 32-bit
>>> physical address space.
>>
>> Page tables? This means you need a page frame? Why go through the slab
>> allocators?
> 
> Because this particular architecture has sub-page-size PMD page tables.
> We desperately need to hoist page table allocation out of the architectures;
> there're a bunch of different implementations and they're mostly bad,
> one way or another.

These are IOMMU page tables, rather than CPU ones, so we're already well 
outside arch code - indeed the original motivation of io-pgtable was to 
be entirely independent of the p*d types and arch-specific MM code (this 
Armv7 short-descriptor format is already "non-native" when used by 
drivers in an arm64 kernel).

There are various efficiency reasons for using regular kernel memory 
instead of coherent DMA allocations - for the most part it works well, 
we just have the odd corner case like this one where the 32-bit format 
gets used on 64-bit systems such that the tables themselves still need 
to be allocated below 4GB (although the final output address can point 
at higher memory by virtue of the IOMMU in question not implementing 
permissions and repurposing some of those PTE fields as extra address bits).

TBH, if this DMA32 stuff is going to be contentious we could possibly 
just rip out the offending kmem_cache - it seemed like good practice for 
the use-case, but provided kzalloc(SZ_1K, gfp | GFP_DMA32) can be relied 
upon to give the same 1KB alignment and chance of succeeding as the 
equivalent kmem_cache_alloc(), then we could quite easily make do with 
that instead.

Thanks,
Robin.

> For each level of page table we generally have three cases:
> 
> 1. single page
> 2. sub-page, naturally aligned
> 3. multiple pages, naturally aligned
> 
> for 1 and 3, the page allocator will do just fine.
> for 2, we should have a per-MM page_frag allocator.  s390 already has
> something like this, although it's more complicated.  ppc also has
> something a little more complex for the cases when it's configured with
> a 64k page size but wants to use a 4k page table entry.
> 
> I'd like x86 to be able to simply do:
> 
> #define pte_alloc_one(mm, addr)	page_alloc_table(mm, addr, 0)
> #define pmd_alloc_one(mm, addr)	page_alloc_table(mm, addr, 0)
> #define pud_alloc_one(mm, addr)	page_alloc_table(mm, addr, 0)
> #define p4d_alloc_one(mm, addr)	page_alloc_table(mm, addr, 0)
> 
> An architecture with 4k page size and needing a 16k PMD would do:
> 
> #define pmd_alloc_one(mm, addr) page_alloc_table(mm, addr, 2)
> 
> while an architecture with a 64k page size needing a 4k PTE would do:
> 
> #define ARCH_PAGE_TABLE_FRAG
> #define pte_alloc_one(mm, addr) pagefrag_alloc_table(mm, addr, 4096)
> 
> I haven't had time to work on this, but perhaps someone with a problem
> that needs fixing would like to, instead of burying yet another awful
> implementation away in arch/ somewhere.
>
Nicolas Boichat Nov. 22, 2018, 1:05 a.m. UTC | #4
On Thu, Nov 22, 2018 at 6:27 AM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 2018-11-21 9:38 pm, Matthew Wilcox wrote:
> > On Wed, Nov 21, 2018 at 06:20:02PM +0000, Christopher Lameter wrote:
> >> On Sun, 11 Nov 2018, Nicolas Boichat wrote:
> >>
> >>> This is a follow-up to the discussion in [1], to make sure that the page
> >>> tables allocated by iommu/io-pgtable-arm-v7s are contained within 32-bit
> >>> physical address space.
> >>
> >> Page tables? This means you need a page frame? Why go through the slab
> >> allocators?
> >
> > Because this particular architecture has sub-page-size PMD page tables.
> > We desperately need to hoist page table allocation out of the architectures;
> > there're a bunch of different implementations and they're mostly bad,
> > one way or another.
>
> These are IOMMU page tables, rather than CPU ones, so we're already well
> outside arch code - indeed the original motivation of io-pgtable was to
> be entirely independent of the p*d types and arch-specific MM code (this
> Armv7 short-descriptor format is already "non-native" when used by
> drivers in an arm64 kernel).
>
> There are various efficiency reasons for using regular kernel memory
> instead of coherent DMA allocations - for the most part it works well,
> we just have the odd corner case like this one where the 32-bit format
> gets used on 64-bit systems such that the tables themselves still need
> to be allocated below 4GB (although the final output address can point
> at higher memory by virtue of the IOMMU in question not implementing
> permissions and repurposing some of those PTE fields as extra address bits).
>
> TBH, if this DMA32 stuff is going to be contentious we could possibly
> just rip out the offending kmem_cache - it seemed like good practice for
> the use-case, but provided kzalloc(SZ_1K, gfp | GFP_DMA32) can be relied
> upon to give the same 1KB alignment and chance of succeeding as the
> equivalent kmem_cache_alloc(), then we could quite easily make do with
> that instead.

Yes, but if we want to use kzalloc, we'll need to create
kmalloc_caches for DMA32, which seems wasteful as there are no other
users (see my comment here:
https://patchwork.kernel.org/patch/10677525/#22332697).

Thanks,

> Thanks,
> Robin.
>
> > For each level of page table we generally have three cases:
> >
> > 1. single page
> > 2. sub-page, naturally aligned
> > 3. multiple pages, naturally aligned
> >
> > for 1 and 3, the page allocator will do just fine.
> > for 2, we should have a per-MM page_frag allocator.  s390 already has
> > something like this, although it's more complicated.  ppc also has
> > something a little more complex for the cases when it's configured with
> > a 64k page size but wants to use a 4k page table entry.
> >
> > I'd like x86 to be able to simply do:
> >
> > #define pte_alloc_one(mm, addr)       page_alloc_table(mm, addr, 0)
> > #define pmd_alloc_one(mm, addr)       page_alloc_table(mm, addr, 0)
> > #define pud_alloc_one(mm, addr)       page_alloc_table(mm, addr, 0)
> > #define p4d_alloc_one(mm, addr)       page_alloc_table(mm, addr, 0)
> >
> > An architecture with 4k page size and needing a 16k PMD would do:
> >
> > #define pmd_alloc_one(mm, addr) page_alloc_table(mm, addr, 2)
> >
> > while an architecture with a 64k page size needing a 4k PTE would do:
> >
> > #define ARCH_PAGE_TABLE_FRAG
> > #define pte_alloc_one(mm, addr) pagefrag_alloc_table(mm, addr, 4096)
> >
> > I haven't had time to work on this, but perhaps someone with a problem
> > that needs fixing would like to, instead of burying yet another awful
> > implementation away in arch/ somewhere.
> >
Matthew Wilcox (Oracle) Nov. 22, 2018, 2:35 a.m. UTC | #5
On Wed, Nov 21, 2018 at 10:26:26PM +0000, Robin Murphy wrote:
> These are IOMMU page tables, rather than CPU ones, so we're already well
> outside arch code - indeed the original motivation of io-pgtable was to be
> entirely independent of the p*d types and arch-specific MM code (this Armv7
> short-descriptor format is already "non-native" when used by drivers in an
> arm64 kernel).

There was quite a lot of explanation missing from this patch description!

> There are various efficiency reasons for using regular kernel memory instead
> of coherent DMA allocations - for the most part it works well, we just have
> the odd corner case like this one where the 32-bit format gets used on
> 64-bit systems such that the tables themselves still need to be allocated
> below 4GB (although the final output address can point at higher memory by
> virtue of the IOMMU in question not implementing permissions and repurposing
> some of those PTE fields as extra address bits).
> 
> TBH, if this DMA32 stuff is going to be contentious we could possibly just
> rip out the offending kmem_cache - it seemed like good practice for the
> use-case, but provided kzalloc(SZ_1K, gfp | GFP_DMA32) can be relied upon to
> give the same 1KB alignment and chance of succeeding as the equivalent
> kmem_cache_alloc(), then we could quite easily make do with that instead.

I think you should look at using the page_frag allocator here.  You can
use whatever GFP_DMA flags you like.
Nicolas Boichat Nov. 22, 2018, 5:56 a.m. UTC | #6
On Thu, Nov 22, 2018 at 10:36 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Nov 21, 2018 at 10:26:26PM +0000, Robin Murphy wrote:
> > These are IOMMU page tables, rather than CPU ones, so we're already well
> > outside arch code - indeed the original motivation of io-pgtable was to be
> > entirely independent of the p*d types and arch-specific MM code (this Armv7
> > short-descriptor format is already "non-native" when used by drivers in an
> > arm64 kernel).
>
> There was quite a lot of explanation missing from this patch description!

I totally agree ,-) I'm not familiar at all with either iommu or
mm/... Looks like the patchset triggered a helpful discussion, and I
understand the problem better now. I'll improve the description in the
next revision.

> > There are various efficiency reasons for using regular kernel memory instead
> > of coherent DMA allocations - for the most part it works well, we just have
> > the odd corner case like this one where the 32-bit format gets used on
> > 64-bit systems such that the tables themselves still need to be allocated
> > below 4GB (although the final output address can point at higher memory by
> > virtue of the IOMMU in question not implementing permissions and repurposing
> > some of those PTE fields as extra address bits).
> >
> > TBH, if this DMA32 stuff is going to be contentious we could possibly just
> > rip out the offending kmem_cache - it seemed like good practice for the
> > use-case, but provided kzalloc(SZ_1K, gfp | GFP_DMA32) can be relied upon to
> > give the same 1KB alignment and chance of succeeding as the equivalent
> > kmem_cache_alloc(), then we could quite easily make do with that instead.
>
> I think you should look at using the page_frag allocator here.  You can
> use whatever GFP_DMA flags you like.

I'll try that.

Thanks!
Christoph Hellwig Nov. 22, 2018, 8:23 a.m. UTC | #7
On Wed, Nov 21, 2018 at 10:26:26PM +0000, Robin Murphy wrote:
> TBH, if this DMA32 stuff is going to be contentious we could possibly just
> rip out the offending kmem_cache - it seemed like good practice for the
> use-case, but provided kzalloc(SZ_1K, gfp | GFP_DMA32) can be relied upon to
> give the same 1KB alignment and chance of succeeding as the equivalent
> kmem_cache_alloc(), then we could quite easily make do with that instead.

Neither is the slab support for kmalloc, not do kmalloc allocations
have useful alignment apparently (at least if you use slub debug).

But I do agree with the sentiment of not wanting to spread GFP_DMA32
futher into the slab allocator.

I think you want a simple genalloc allocator for this rather special
use case.
Christoph Hellwig Nov. 22, 2018, 8:26 a.m. UTC | #8
On Wed, Nov 21, 2018 at 06:35:58PM -0800, Matthew Wilcox wrote:
> I think you should look at using the page_frag allocator here.  You can
> use whatever GFP_DMA flags you like.

So I actually tries to use page_frag to solve the XFS unaligned kmalloc
allocations problem, and I don't think it is the right hammer for this
nail (or any other nail outside of networking).

The problem with the page_frag allocator is that it never reuses
fragments returned to the page, but only only frees the page once all
fragments are freed.  This means that if you have some long(er) term
allocations you are effectively creating memory leaks.
Matthew Wilcox (Oracle) Nov. 22, 2018, 3:16 p.m. UTC | #9
On Thu, Nov 22, 2018 at 12:26:02AM -0800, Christoph Hellwig wrote:
> On Wed, Nov 21, 2018 at 06:35:58PM -0800, Matthew Wilcox wrote:
> > I think you should look at using the page_frag allocator here.  You can
> > use whatever GFP_DMA flags you like.
> 
> So I actually tries to use page_frag to solve the XFS unaligned kmalloc
> allocations problem, and I don't think it is the right hammer for this
> nail (or any other nail outside of networking).
> 
> The problem with the page_frag allocator is that it never reuses
> fragments returned to the page, but only only frees the page once all
> fragments are freed.  This means that if you have some long(er) term
> allocations you are effectively creating memory leaks.

Yes, your allocations from the page_frag allocator have to have similar
lifetimes.  I thought that would be ideal for XFS though; as I understood
the problem, these were per-IO allocations, and IOs to the same filesystem
tend to take roughly the same amount of time.  Sure, in an error case,
some IOs will take a long time before timing out, but it should be OK
to have pages unavailable during that time in these rare situations.
What am I missing?
Christoph Hellwig Nov. 22, 2018, 3:19 p.m. UTC | #10
On Thu, Nov 22, 2018 at 07:16:32AM -0800, Matthew Wilcox wrote:
> Yes, your allocations from the page_frag allocator have to have similar
> lifetimes.  I thought that would be ideal for XFS though; as I understood
> the problem, these were per-IO allocations, and IOs to the same filesystem
> tend to take roughly the same amount of time.  Sure, in an error case,
> some IOs will take a long time before timing out, but it should be OK
> to have pages unavailable during that time in these rare situations.
> What am I missing?

No, thee are allocations for meatada buffers, which can stay around
for a long time.  Worse still that depends on usage, so one buffer
allocated from ma page might basically stay around forever, while
another one might get reclaimed very quickly.
Nicolas Boichat Nov. 23, 2018, 3:04 a.m. UTC | #11
On Thu, Nov 22, 2018 at 4:23 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, Nov 21, 2018 at 10:26:26PM +0000, Robin Murphy wrote:
> > TBH, if this DMA32 stuff is going to be contentious we could possibly just
> > rip out the offending kmem_cache - it seemed like good practice for the
> > use-case, but provided kzalloc(SZ_1K, gfp | GFP_DMA32) can be relied upon to
> > give the same 1KB alignment and chance of succeeding as the equivalent
> > kmem_cache_alloc(), then we could quite easily make do with that instead.
>
> Neither is the slab support for kmalloc, not do kmalloc allocations
> have useful alignment apparently (at least if you use slub debug).
>
> But I do agree with the sentiment of not wanting to spread GFP_DMA32
> futher into the slab allocator.
>
> I think you want a simple genalloc allocator for this rather special
> use case.

So I had a look at genalloc, we'd need to add pre-allocated memory
using gen_pool_add [1]. There can be up to 4096 L2 page tables, so we
may need to pre-allocate 4MB of memory (1KB per L2 page table). We
could add chunks on demand, but then it'd be difficult to free them up
(genalloc does not have a "gen_pool_remove" call). So basically if the
full 4MB end up being requested, we'd be stuck with that until the
iommu domain is freed (on the arm64 Mediatek platforms I looked at,
there is only one iommu domain, and it never gets freed).

page_frag would at least have a chance to reclaim those pages (if I
understand Christoph's statement correctly)

Robin: Do you have some ideas of the lifetime/usage of L2 tables? If
they are usually few of them, or if they don't get reclaimed easily,
some on demand genalloc allocation would be ok (or even 4MB allocation
on init, if we're willing to take that hit). If they get allocated and
freed together, maybe page_frag is a better option?

Thanks,

[1] https://www.kernel.org/doc/html/v4.19/core-api/genalloc.html
Nicolas Boichat Nov. 23, 2018, 5:37 a.m. UTC | #12
On Fri, Nov 23, 2018 at 11:04 AM Nicolas Boichat <drinkcat@chromium.org> wrote:
>
> On Thu, Nov 22, 2018 at 4:23 PM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Wed, Nov 21, 2018 at 10:26:26PM +0000, Robin Murphy wrote:
> > > TBH, if this DMA32 stuff is going to be contentious we could possibly just
> > > rip out the offending kmem_cache - it seemed like good practice for the
> > > use-case, but provided kzalloc(SZ_1K, gfp | GFP_DMA32) can be relied upon to
> > > give the same 1KB alignment and chance of succeeding as the equivalent
> > > kmem_cache_alloc(), then we could quite easily make do with that instead.
> >
> > Neither is the slab support for kmalloc, not do kmalloc allocations
> > have useful alignment apparently (at least if you use slub debug).
> >
> > But I do agree with the sentiment of not wanting to spread GFP_DMA32
> > futher into the slab allocator.
> >
> > I think you want a simple genalloc allocator for this rather special
> > use case.
>
> So I had a look at genalloc, we'd need to add pre-allocated memory
> using gen_pool_add [1]. There can be up to 4096 L2 page tables, so we
> may need to pre-allocate 4MB of memory (1KB per L2 page table). We
> could add chunks on demand, but then it'd be difficult to free them up
> (genalloc does not have a "gen_pool_remove" call). So basically if the
> full 4MB end up being requested, we'd be stuck with that until the
> iommu domain is freed (on the arm64 Mediatek platforms I looked at,
> there is only one iommu domain, and it never gets freed).

I tried out genalloc with pre-allocated 4MB, and that seems to work
fine. Allocating in chunks would require genalloc changes as
gen_pool_add calls kmalloc with just GFP_KERNEL [2], and we are in
atomic context in __arm_v7s_alloc_table...

[2] https://elixir.bootlin.com/linux/latest/source/lib/genalloc.c#L190

> page_frag would at least have a chance to reclaim those pages (if I
> understand Christoph's statement correctly)
>
> Robin: Do you have some ideas of the lifetime/usage of L2 tables? If
> they are usually few of them, or if they don't get reclaimed easily,
> some on demand genalloc allocation would be ok (or even 4MB allocation
> on init, if we're willing to take that hit). If they get allocated and
> freed together, maybe page_frag is a better option?
>
> Thanks,
>
> [1] https://www.kernel.org/doc/html/v4.19/core-api/genalloc.html
Vlastimil Babka Nov. 23, 2018, 12:23 p.m. UTC | #13
On 11/22/18 9:23 AM, Christoph Hellwig wrote:
> On Wed, Nov 21, 2018 at 10:26:26PM +0000, Robin Murphy wrote:
>> TBH, if this DMA32 stuff is going to be contentious we could possibly just
>> rip out the offending kmem_cache - it seemed like good practice for the
>> use-case, but provided kzalloc(SZ_1K, gfp | GFP_DMA32) can be relied upon to
>> give the same 1KB alignment and chance of succeeding as the equivalent
>> kmem_cache_alloc(), then we could quite easily make do with that instead.
> 
> Neither is the slab support for kmalloc, not do kmalloc allocations
> have useful alignment apparently (at least if you use slub debug).

Is this also true for caches created by kmem_cache_create(), that
debugging options can result in not respecting the alignment passed to
kmem_cache_create()? That would be rather bad, IMHO.

> But I do agree with the sentiment of not wanting to spread GFP_DMA32
> futher into the slab allocator.

I don't see a problem with GFP_DMA32 for custom caches. Generic
kmalloc() would be worse, since it would have to create a new array of
kmalloc caches. But that's already ruled out due to the alignment.

> I think you want a simple genalloc allocator for this rather special
> use case.

I would prefer if slab could support it, as it doesn't have to
preallocate. OTOH if the allocations are GFP_ATOMIC as suggested later
in the thread, and need to always succeed, then preallocation could be
better, and thus maybe genalloc.
Michal Hocko Nov. 23, 2018, 12:30 p.m. UTC | #14
On Fri 23-11-18 13:23:41, Vlastimil Babka wrote:
> On 11/22/18 9:23 AM, Christoph Hellwig wrote:
[...]
> > But I do agree with the sentiment of not wanting to spread GFP_DMA32
> > futher into the slab allocator.
> 
> I don't see a problem with GFP_DMA32 for custom caches. Generic
> kmalloc() would be worse, since it would have to create a new array of
> kmalloc caches. But that's already ruled out due to the alignment.

Yes that makes quite a lot of sense to me. We do not really need a
generic support. Just make sure that if somebody creates a GFP_DMA32
restricted cache then allow allocating restricted memory from that.

Is there any fundamental reason that this wouldn't be possible?
Christoph Hellwig Nov. 26, 2018, 8:02 a.m. UTC | #15
On Fri, Nov 23, 2018 at 01:23:41PM +0100, Vlastimil Babka wrote:
> Is this also true for caches created by kmem_cache_create(), that
> debugging options can result in not respecting the alignment passed to
> kmem_cache_create()? That would be rather bad, IMHO.

That's what I understood in the discussion.  If not it would make
our live simpler, but would need to be well document.

Christoph can probably explain the alignment choices in slub.

> 
> > But I do agree with the sentiment of not wanting to spread GFP_DMA32
> > futher into the slab allocator.
> 
> I don't see a problem with GFP_DMA32 for custom caches. Generic
> kmalloc() would be worse, since it would have to create a new array of
> kmalloc caches. But that's already ruled out due to the alignment.

True, purely slab probably isn't too bad.
Nicolas Boichat Nov. 28, 2018, 8:55 a.m. UTC | #16
On Mon, Nov 26, 2018 at 4:02 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Fri, Nov 23, 2018 at 01:23:41PM +0100, Vlastimil Babka wrote:
> > Is this also true for caches created by kmem_cache_create(), that
> > debugging options can result in not respecting the alignment passed to
> > kmem_cache_create()? That would be rather bad, IMHO.
>
> That's what I understood in the discussion.  If not it would make
> our live simpler, but would need to be well document.

From my experiment, adding `slub_debug` to command line does _not_
break the alignment of kmem_cache_alloc'ed objects.

We do see an increase in slab_size
(/sys/kernel/slab/io-pgtable_armv7s_l2/slab_size), from 1024 to 3072
(probably because slub needs to allocate space on each side for the
red zone/padding, while keeping the alignment?)

> Christoph can probably explain the alignment choices in slub.
>
> >
> > > But I do agree with the sentiment of not wanting to spread GFP_DMA32
> > > futher into the slab allocator.
> >
> > I don't see a problem with GFP_DMA32 for custom caches. Generic
> > kmalloc() would be worse, since it would have to create a new array of
> > kmalloc caches. But that's already ruled out due to the alignment.
>
> True, purely slab probably isn't too bad.
Nicolas Boichat Dec. 4, 2018, 9:37 a.m. UTC | #17
On Sun, Nov 11, 2018 at 5:04 PM Nicolas Boichat <drinkcat@chromium.org> wrote:
>
> This is a follow-up to the discussion in [1], to make sure that the page
> tables allocated by iommu/io-pgtable-arm-v7s are contained within 32-bit
> physical address space.
>
> [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html

Hi everyone,

Let's try to summarize here.

First, we confirmed that this is a regression, and IOMMU errors happen
on 4.19 and linux-next/master on MT8173 (elm, Acer Chromebook R13).
The issue most likely starts from ad67f5a6545f ("arm64: replace
ZONE_DMA with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number
of Mediatek platforms (and maybe others?).

We have a few options here:
1. This series [2], that adds support for GFP_DMA32 slab caches,
_without_ adding kmalloc caches (since there are no users of
kmalloc(..., GFP_DMA32)). I think I've addressed all the comments on
the 3 patches, and AFAICT this solution works fine.
2. genalloc. That works, but unless we preallocate 4MB for L2 tables
(which is wasteful as we usually only need a handful of L2 tables),
we'll need changes in the core (use GFP_ATOMIC) to allow allocating on
demand, and as it stands we'd have no way to shrink the allocation.
3. page_frag [3]. That works fine, and the code is quite simple. One
drawback is that fragments in partially freed pages cannot be reused
(from limited experiments, I see that IOMMU L2 tables are rarely
freed, so it's unlikely a whole page would get freed). But given the
low number of L2 tables, maybe we can live with that.

I think 2 is out. Any preference between 1 and 3? I think 1 makes
better use of the memory, so that'd be my preference. But I'm probably
missing something.

[2] https://patchwork.kernel.org/cover/10677529/, 3 patches
[3] https://patchwork.codeaurora.org/patch/671639/

Thanks,

Nicolas
Vlastimil Babka Dec. 4, 2018, 2:35 p.m. UTC | #18
On 12/4/18 10:37 AM, Nicolas Boichat wrote:
> On Sun, Nov 11, 2018 at 5:04 PM Nicolas Boichat <drinkcat@chromium.org> wrote:
>>
>> This is a follow-up to the discussion in [1], to make sure that the page
>> tables allocated by iommu/io-pgtable-arm-v7s are contained within 32-bit
>> physical address space.
>>
>> [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html
> 
> Hi everyone,
> 
> Let's try to summarize here.
> 
> First, we confirmed that this is a regression, and IOMMU errors happen
> on 4.19 and linux-next/master on MT8173 (elm, Acer Chromebook R13).
> The issue most likely starts from ad67f5a6545f ("arm64: replace
> ZONE_DMA with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number
> of Mediatek platforms (and maybe others?).
> 
> We have a few options here:
> 1. This series [2], that adds support for GFP_DMA32 slab caches,
> _without_ adding kmalloc caches (since there are no users of
> kmalloc(..., GFP_DMA32)). I think I've addressed all the comments on
> the 3 patches, and AFAICT this solution works fine.
> 2. genalloc. That works, but unless we preallocate 4MB for L2 tables
> (which is wasteful as we usually only need a handful of L2 tables),
> we'll need changes in the core (use GFP_ATOMIC) to allow allocating on
> demand, and as it stands we'd have no way to shrink the allocation.
> 3. page_frag [3]. That works fine, and the code is quite simple. One
> drawback is that fragments in partially freed pages cannot be reused
> (from limited experiments, I see that IOMMU L2 tables are rarely
> freed, so it's unlikely a whole page would get freed). But given the
> low number of L2 tables, maybe we can live with that.
> 
> I think 2 is out. Any preference between 1 and 3? I think 1 makes
> better use of the memory, so that'd be my preference. But I'm probably
> missing something.

I would prefer 1 as well. IIRC you already confirmed that alignment
requirements are not broken for custom kmem caches even in presence of
SLUB debug options (and I would say it's a bug to be fixed if they
weren't). I just asked (and didn't get a reply I think) about your
ability to handle the GFP_ATOMIC allocation failures. They should be
rare when only single page allocations are needed for the kmem cache.
But in case they are not an option, then preallocating would be needed,
thus probably option 2.

> [2] https://patchwork.kernel.org/cover/10677529/, 3 patches
> [3] https://patchwork.codeaurora.org/patch/671639/
> 
> Thanks,
> 
> Nicolas
>
Will Deacon Dec. 4, 2018, 4:28 p.m. UTC | #19
On Tue, Dec 04, 2018 at 05:37:13PM +0800, Nicolas Boichat wrote:
> On Sun, Nov 11, 2018 at 5:04 PM Nicolas Boichat <drinkcat@chromium.org> wrote:
> >
> > This is a follow-up to the discussion in [1], to make sure that the page
> > tables allocated by iommu/io-pgtable-arm-v7s are contained within 32-bit
> > physical address space.
> >
> > [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html
> 
> Hi everyone,
> 
> Let's try to summarize here.
> 
> First, we confirmed that this is a regression, and IOMMU errors happen
> on 4.19 and linux-next/master on MT8173 (elm, Acer Chromebook R13).
> The issue most likely starts from ad67f5a6545f ("arm64: replace
> ZONE_DMA with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number
> of Mediatek platforms (and maybe others?).
> 
> We have a few options here:
> 1. This series [2], that adds support for GFP_DMA32 slab caches,
> _without_ adding kmalloc caches (since there are no users of
> kmalloc(..., GFP_DMA32)). I think I've addressed all the comments on
> the 3 patches, and AFAICT this solution works fine.
> 2. genalloc. That works, but unless we preallocate 4MB for L2 tables
> (which is wasteful as we usually only need a handful of L2 tables),
> we'll need changes in the core (use GFP_ATOMIC) to allow allocating on
> demand, and as it stands we'd have no way to shrink the allocation.
> 3. page_frag [3]. That works fine, and the code is quite simple. One
> drawback is that fragments in partially freed pages cannot be reused
> (from limited experiments, I see that IOMMU L2 tables are rarely
> freed, so it's unlikely a whole page would get freed). But given the
> low number of L2 tables, maybe we can live with that.
> 
> I think 2 is out. Any preference between 1 and 3? I think 1 makes
> better use of the memory, so that'd be my preference. But I'm probably
> missing something.

FWIW, I'm open to any solution at this point, since I'd like to see this
regression fixed. (1) does sound better longer-term, but (3) looks pretty
much ready to do afaict.

Will
Nicolas Boichat Dec. 5, 2018, 2:04 a.m. UTC | #20
On Tue, Dec 4, 2018 at 10:35 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 12/4/18 10:37 AM, Nicolas Boichat wrote:
> > On Sun, Nov 11, 2018 at 5:04 PM Nicolas Boichat <drinkcat@chromium.org> wrote:
> >>
> >> This is a follow-up to the discussion in [1], to make sure that the page
> >> tables allocated by iommu/io-pgtable-arm-v7s are contained within 32-bit
> >> physical address space.
> >>
> >> [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html
> >
> > Hi everyone,
> >
> > Let's try to summarize here.
> >
> > First, we confirmed that this is a regression, and IOMMU errors happen
> > on 4.19 and linux-next/master on MT8173 (elm, Acer Chromebook R13).
> > The issue most likely starts from ad67f5a6545f ("arm64: replace
> > ZONE_DMA with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number
> > of Mediatek platforms (and maybe others?).
> >
> > We have a few options here:
> > 1. This series [2], that adds support for GFP_DMA32 slab caches,
> > _without_ adding kmalloc caches (since there are no users of
> > kmalloc(..., GFP_DMA32)). I think I've addressed all the comments on
> > the 3 patches, and AFAICT this solution works fine.
> > 2. genalloc. That works, but unless we preallocate 4MB for L2 tables
> > (which is wasteful as we usually only need a handful of L2 tables),
> > we'll need changes in the core (use GFP_ATOMIC) to allow allocating on
> > demand, and as it stands we'd have no way to shrink the allocation.
> > 3. page_frag [3]. That works fine, and the code is quite simple. One
> > drawback is that fragments in partially freed pages cannot be reused
> > (from limited experiments, I see that IOMMU L2 tables are rarely
> > freed, so it's unlikely a whole page would get freed). But given the
> > low number of L2 tables, maybe we can live with that.
> >
> > I think 2 is out. Any preference between 1 and 3? I think 1 makes
> > better use of the memory, so that'd be my preference. But I'm probably
> > missing something.
>
> I would prefer 1 as well. IIRC you already confirmed that alignment
> requirements are not broken for custom kmem caches even in presence of
> SLUB debug options (and I would say it's a bug to be fixed if they
> weren't).

> I just asked (and didn't get a reply I think) about your
> ability to handle the GFP_ATOMIC allocation failures. They should be
> rare when only single page allocations are needed for the kmem cache.
> But in case they are not an option, then preallocating would be needed,
> thus probably option 2.

Oh, sorry, I missed your question.

I don't have a full answer, but:
- The allocations themselves are rare (I count a few 10s of L2 tables
at most on my system, I assume we rarely have >100), and yes, we only
need a single page, so the failures should be exceptional.
- My change is probably not making anything worse: I assume that even
with the current approach using GFP_DMA slab caches on older kernels,
failures could potentially happen. I don't think we've seen those. If
we are really concerned about this, maybe we'd need to modify
mtk_iommu_map to not hold a spinlock (if that's possible), so we don't
need to use GFP_ATOMIC. I suggest we just keep an eye on such issues,
and address them if they show up (we can even revisit genalloc at that
stage).

Anyway, I'll clean up patches for 1 (mostly commit message changes
based on the comments in the threads) and resend.

Thanks,

> > [2] https://patchwork.kernel.org/cover/10677529/, 3 patches
> > [3] https://patchwork.codeaurora.org/patch/671639/
> >
> > Thanks,
> >
> > Nicolas
> >
>
Nicolas Boichat Dec. 5, 2018, 5:51 a.m. UTC | #21
On Wed, Dec 5, 2018 at 10:04 AM Nicolas Boichat <drinkcat@chromium.org> wrote:
>
> On Tue, Dec 4, 2018 at 10:35 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > On 12/4/18 10:37 AM, Nicolas Boichat wrote:
> > > On Sun, Nov 11, 2018 at 5:04 PM Nicolas Boichat <drinkcat@chromium.org> wrote:
> > >>
> > >> This is a follow-up to the discussion in [1], to make sure that the page
> > >> tables allocated by iommu/io-pgtable-arm-v7s are contained within 32-bit
> > >> physical address space.
> > >>
> > >> [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html
> > >
> > > Hi everyone,
> > >
> > > Let's try to summarize here.
> > >
> > > First, we confirmed that this is a regression, and IOMMU errors happen
> > > on 4.19 and linux-next/master on MT8173 (elm, Acer Chromebook R13).
> > > The issue most likely starts from ad67f5a6545f ("arm64: replace
> > > ZONE_DMA with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number
> > > of Mediatek platforms (and maybe others?).
> > >
> > > We have a few options here:
> > > 1. This series [2], that adds support for GFP_DMA32 slab caches,
> > > _without_ adding kmalloc caches (since there are no users of
> > > kmalloc(..., GFP_DMA32)). I think I've addressed all the comments on
> > > the 3 patches, and AFAICT this solution works fine.
> > > 2. genalloc. That works, but unless we preallocate 4MB for L2 tables
> > > (which is wasteful as we usually only need a handful of L2 tables),
> > > we'll need changes in the core (use GFP_ATOMIC) to allow allocating on
> > > demand, and as it stands we'd have no way to shrink the allocation.
> > > 3. page_frag [3]. That works fine, and the code is quite simple. One
> > > drawback is that fragments in partially freed pages cannot be reused
> > > (from limited experiments, I see that IOMMU L2 tables are rarely
> > > freed, so it's unlikely a whole page would get freed). But given the
> > > low number of L2 tables, maybe we can live with that.
> > >
> > > I think 2 is out. Any preference between 1 and 3? I think 1 makes
> > > better use of the memory, so that'd be my preference. But I'm probably
> > > missing something.
> >
> > I would prefer 1 as well. IIRC you already confirmed that alignment
> > requirements are not broken for custom kmem caches even in presence of
> > SLUB debug options (and I would say it's a bug to be fixed if they
> > weren't).
>
> > I just asked (and didn't get a reply I think) about your
> > ability to handle the GFP_ATOMIC allocation failures. They should be
> > rare when only single page allocations are needed for the kmem cache.
> > But in case they are not an option, then preallocating would be needed,
> > thus probably option 2.
>
> Oh, sorry, I missed your question.
>
> I don't have a full answer, but:
> - The allocations themselves are rare (I count a few 10s of L2 tables
> at most on my system, I assume we rarely have >100), and yes, we only
> need a single page, so the failures should be exceptional.
> - My change is probably not making anything worse: I assume that even
> with the current approach using GFP_DMA slab caches on older kernels,
> failures could potentially happen. I don't think we've seen those. If
> we are really concerned about this, maybe we'd need to modify
> mtk_iommu_map to not hold a spinlock (if that's possible), so we don't
> need to use GFP_ATOMIC. I suggest we just keep an eye on such issues,
> and address them if they show up (we can even revisit genalloc at that
> stage).
>
> Anyway, I'll clean up patches for 1 (mostly commit message changes
> based on the comments in the threads) and resend.

Done here: https://patchwork.kernel.org/cover/10713019/ .

> Thanks,
>
> > > [2] https://patchwork.kernel.org/cover/10677529/, 3 patches
> > > [3] https://patchwork.codeaurora.org/patch/671639/
> > >
> > > Thanks,
> > >
> > > Nicolas
> > >
> >
Will Deacon Dec. 5, 2018, 2:41 p.m. UTC | #22
On Wed, Dec 05, 2018 at 10:04:00AM +0800, Nicolas Boichat wrote:
> On Tue, Dec 4, 2018 at 10:35 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > On 12/4/18 10:37 AM, Nicolas Boichat wrote:
> > > On Sun, Nov 11, 2018 at 5:04 PM Nicolas Boichat <drinkcat@chromium.org> wrote:
> > >>
> > >> This is a follow-up to the discussion in [1], to make sure that the page
> > >> tables allocated by iommu/io-pgtable-arm-v7s are contained within 32-bit
> > >> physical address space.
> > >>
> > >> [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html
> > >
> > > Hi everyone,
> > >
> > > Let's try to summarize here.
> > >
> > > First, we confirmed that this is a regression, and IOMMU errors happen
> > > on 4.19 and linux-next/master on MT8173 (elm, Acer Chromebook R13).
> > > The issue most likely starts from ad67f5a6545f ("arm64: replace
> > > ZONE_DMA with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number
> > > of Mediatek platforms (and maybe others?).
> > >
> > > We have a few options here:
> > > 1. This series [2], that adds support for GFP_DMA32 slab caches,
> > > _without_ adding kmalloc caches (since there are no users of
> > > kmalloc(..., GFP_DMA32)). I think I've addressed all the comments on
> > > the 3 patches, and AFAICT this solution works fine.
> > > 2. genalloc. That works, but unless we preallocate 4MB for L2 tables
> > > (which is wasteful as we usually only need a handful of L2 tables),
> > > we'll need changes in the core (use GFP_ATOMIC) to allow allocating on
> > > demand, and as it stands we'd have no way to shrink the allocation.
> > > 3. page_frag [3]. That works fine, and the code is quite simple. One
> > > drawback is that fragments in partially freed pages cannot be reused
> > > (from limited experiments, I see that IOMMU L2 tables are rarely
> > > freed, so it's unlikely a whole page would get freed). But given the
> > > low number of L2 tables, maybe we can live with that.
> > >
> > > I think 2 is out. Any preference between 1 and 3? I think 1 makes
> > > better use of the memory, so that'd be my preference. But I'm probably
> > > missing something.
> >
> > I would prefer 1 as well. IIRC you already confirmed that alignment
> > requirements are not broken for custom kmem caches even in presence of
> > SLUB debug options (and I would say it's a bug to be fixed if they
> > weren't).
> 
> > I just asked (and didn't get a reply I think) about your
> > ability to handle the GFP_ATOMIC allocation failures. They should be
> > rare when only single page allocations are needed for the kmem cache.
> > But in case they are not an option, then preallocating would be needed,
> > thus probably option 2.
> 
> Oh, sorry, I missed your question.
> 
> I don't have a full answer, but:
> - The allocations themselves are rare (I count a few 10s of L2 tables
> at most on my system, I assume we rarely have >100), and yes, we only
> need a single page, so the failures should be exceptional.
> - My change is probably not making anything worse: I assume that even
> with the current approach using GFP_DMA slab caches on older kernels,
> failures could potentially happen. I don't think we've seen those. If
> we are really concerned about this, maybe we'd need to modify
> mtk_iommu_map to not hold a spinlock (if that's possible), so we don't
> need to use GFP_ATOMIC. I suggest we just keep an eye on such issues,
> and address them if they show up (we can even revisit genalloc at that
> stage).

I think the spinlock is the least of our worries: the map/unmap routines
can be called in irq context and may need to allocate second-level tables.

Will