Message ID | 20181111090341.120786-1-drinkcat@chromium.org (mailing list archive) |
---|---|
Headers | show |
Series | iommu/io-pgtable-arm-v7s: Use DMA32 zone for page tables | expand |
On Sun, 11 Nov 2018, Nicolas Boichat wrote: > This is a follow-up to the discussion in [1], to make sure that the page > tables allocated by iommu/io-pgtable-arm-v7s are contained within 32-bit > physical address space. Page tables? This means you need a page frame? Why go through the slab allocators?
On Wed, Nov 21, 2018 at 06:20:02PM +0000, Christopher Lameter wrote: > On Sun, 11 Nov 2018, Nicolas Boichat wrote: > > > This is a follow-up to the discussion in [1], to make sure that the page > > tables allocated by iommu/io-pgtable-arm-v7s are contained within 32-bit > > physical address space. > > Page tables? This means you need a page frame? Why go through the slab > allocators? Because this particular architecture has sub-page-size PMD page tables. We desperately need to hoist page table allocation out of the architectures; there're a bunch of different implementations and they're mostly bad, one way or another. For each level of page table we generally have three cases: 1. single page 2. sub-page, naturally aligned 3. multiple pages, naturally aligned for 1 and 3, the page allocator will do just fine. for 2, we should have a per-MM page_frag allocator. s390 already has something like this, although it's more complicated. ppc also has something a little more complex for the cases when it's configured with a 64k page size but wants to use a 4k page table entry. I'd like x86 to be able to simply do: #define pte_alloc_one(mm, addr) page_alloc_table(mm, addr, 0) #define pmd_alloc_one(mm, addr) page_alloc_table(mm, addr, 0) #define pud_alloc_one(mm, addr) page_alloc_table(mm, addr, 0) #define p4d_alloc_one(mm, addr) page_alloc_table(mm, addr, 0) An architecture with 4k page size and needing a 16k PMD would do: #define pmd_alloc_one(mm, addr) page_alloc_table(mm, addr, 2) while an architecture with a 64k page size needing a 4k PTE would do: #define ARCH_PAGE_TABLE_FRAG #define pte_alloc_one(mm, addr) pagefrag_alloc_table(mm, addr, 4096) I haven't had time to work on this, but perhaps someone with a problem that needs fixing would like to, instead of burying yet another awful implementation away in arch/ somewhere.
On 2018-11-21 9:38 pm, Matthew Wilcox wrote: > On Wed, Nov 21, 2018 at 06:20:02PM +0000, Christopher Lameter wrote: >> On Sun, 11 Nov 2018, Nicolas Boichat wrote: >> >>> This is a follow-up to the discussion in [1], to make sure that the page >>> tables allocated by iommu/io-pgtable-arm-v7s are contained within 32-bit >>> physical address space. >> >> Page tables? This means you need a page frame? Why go through the slab >> allocators? > > Because this particular architecture has sub-page-size PMD page tables. > We desperately need to hoist page table allocation out of the architectures; > there're a bunch of different implementations and they're mostly bad, > one way or another. These are IOMMU page tables, rather than CPU ones, so we're already well outside arch code - indeed the original motivation of io-pgtable was to be entirely independent of the p*d types and arch-specific MM code (this Armv7 short-descriptor format is already "non-native" when used by drivers in an arm64 kernel). There are various efficiency reasons for using regular kernel memory instead of coherent DMA allocations - for the most part it works well, we just have the odd corner case like this one where the 32-bit format gets used on 64-bit systems such that the tables themselves still need to be allocated below 4GB (although the final output address can point at higher memory by virtue of the IOMMU in question not implementing permissions and repurposing some of those PTE fields as extra address bits). TBH, if this DMA32 stuff is going to be contentious we could possibly just rip out the offending kmem_cache - it seemed like good practice for the use-case, but provided kzalloc(SZ_1K, gfp | GFP_DMA32) can be relied upon to give the same 1KB alignment and chance of succeeding as the equivalent kmem_cache_alloc(), then we could quite easily make do with that instead. Thanks, Robin. > For each level of page table we generally have three cases: > > 1. single page > 2. sub-page, naturally aligned > 3. multiple pages, naturally aligned > > for 1 and 3, the page allocator will do just fine. > for 2, we should have a per-MM page_frag allocator. s390 already has > something like this, although it's more complicated. ppc also has > something a little more complex for the cases when it's configured with > a 64k page size but wants to use a 4k page table entry. > > I'd like x86 to be able to simply do: > > #define pte_alloc_one(mm, addr) page_alloc_table(mm, addr, 0) > #define pmd_alloc_one(mm, addr) page_alloc_table(mm, addr, 0) > #define pud_alloc_one(mm, addr) page_alloc_table(mm, addr, 0) > #define p4d_alloc_one(mm, addr) page_alloc_table(mm, addr, 0) > > An architecture with 4k page size and needing a 16k PMD would do: > > #define pmd_alloc_one(mm, addr) page_alloc_table(mm, addr, 2) > > while an architecture with a 64k page size needing a 4k PTE would do: > > #define ARCH_PAGE_TABLE_FRAG > #define pte_alloc_one(mm, addr) pagefrag_alloc_table(mm, addr, 4096) > > I haven't had time to work on this, but perhaps someone with a problem > that needs fixing would like to, instead of burying yet another awful > implementation away in arch/ somewhere. >
On Thu, Nov 22, 2018 at 6:27 AM Robin Murphy <robin.murphy@arm.com> wrote: > > On 2018-11-21 9:38 pm, Matthew Wilcox wrote: > > On Wed, Nov 21, 2018 at 06:20:02PM +0000, Christopher Lameter wrote: > >> On Sun, 11 Nov 2018, Nicolas Boichat wrote: > >> > >>> This is a follow-up to the discussion in [1], to make sure that the page > >>> tables allocated by iommu/io-pgtable-arm-v7s are contained within 32-bit > >>> physical address space. > >> > >> Page tables? This means you need a page frame? Why go through the slab > >> allocators? > > > > Because this particular architecture has sub-page-size PMD page tables. > > We desperately need to hoist page table allocation out of the architectures; > > there're a bunch of different implementations and they're mostly bad, > > one way or another. > > These are IOMMU page tables, rather than CPU ones, so we're already well > outside arch code - indeed the original motivation of io-pgtable was to > be entirely independent of the p*d types and arch-specific MM code (this > Armv7 short-descriptor format is already "non-native" when used by > drivers in an arm64 kernel). > > There are various efficiency reasons for using regular kernel memory > instead of coherent DMA allocations - for the most part it works well, > we just have the odd corner case like this one where the 32-bit format > gets used on 64-bit systems such that the tables themselves still need > to be allocated below 4GB (although the final output address can point > at higher memory by virtue of the IOMMU in question not implementing > permissions and repurposing some of those PTE fields as extra address bits). > > TBH, if this DMA32 stuff is going to be contentious we could possibly > just rip out the offending kmem_cache - it seemed like good practice for > the use-case, but provided kzalloc(SZ_1K, gfp | GFP_DMA32) can be relied > upon to give the same 1KB alignment and chance of succeeding as the > equivalent kmem_cache_alloc(), then we could quite easily make do with > that instead. Yes, but if we want to use kzalloc, we'll need to create kmalloc_caches for DMA32, which seems wasteful as there are no other users (see my comment here: https://patchwork.kernel.org/patch/10677525/#22332697). Thanks, > Thanks, > Robin. > > > For each level of page table we generally have three cases: > > > > 1. single page > > 2. sub-page, naturally aligned > > 3. multiple pages, naturally aligned > > > > for 1 and 3, the page allocator will do just fine. > > for 2, we should have a per-MM page_frag allocator. s390 already has > > something like this, although it's more complicated. ppc also has > > something a little more complex for the cases when it's configured with > > a 64k page size but wants to use a 4k page table entry. > > > > I'd like x86 to be able to simply do: > > > > #define pte_alloc_one(mm, addr) page_alloc_table(mm, addr, 0) > > #define pmd_alloc_one(mm, addr) page_alloc_table(mm, addr, 0) > > #define pud_alloc_one(mm, addr) page_alloc_table(mm, addr, 0) > > #define p4d_alloc_one(mm, addr) page_alloc_table(mm, addr, 0) > > > > An architecture with 4k page size and needing a 16k PMD would do: > > > > #define pmd_alloc_one(mm, addr) page_alloc_table(mm, addr, 2) > > > > while an architecture with a 64k page size needing a 4k PTE would do: > > > > #define ARCH_PAGE_TABLE_FRAG > > #define pte_alloc_one(mm, addr) pagefrag_alloc_table(mm, addr, 4096) > > > > I haven't had time to work on this, but perhaps someone with a problem > > that needs fixing would like to, instead of burying yet another awful > > implementation away in arch/ somewhere. > >
On Wed, Nov 21, 2018 at 10:26:26PM +0000, Robin Murphy wrote: > These are IOMMU page tables, rather than CPU ones, so we're already well > outside arch code - indeed the original motivation of io-pgtable was to be > entirely independent of the p*d types and arch-specific MM code (this Armv7 > short-descriptor format is already "non-native" when used by drivers in an > arm64 kernel). There was quite a lot of explanation missing from this patch description! > There are various efficiency reasons for using regular kernel memory instead > of coherent DMA allocations - for the most part it works well, we just have > the odd corner case like this one where the 32-bit format gets used on > 64-bit systems such that the tables themselves still need to be allocated > below 4GB (although the final output address can point at higher memory by > virtue of the IOMMU in question not implementing permissions and repurposing > some of those PTE fields as extra address bits). > > TBH, if this DMA32 stuff is going to be contentious we could possibly just > rip out the offending kmem_cache - it seemed like good practice for the > use-case, but provided kzalloc(SZ_1K, gfp | GFP_DMA32) can be relied upon to > give the same 1KB alignment and chance of succeeding as the equivalent > kmem_cache_alloc(), then we could quite easily make do with that instead. I think you should look at using the page_frag allocator here. You can use whatever GFP_DMA flags you like.
On Thu, Nov 22, 2018 at 10:36 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Wed, Nov 21, 2018 at 10:26:26PM +0000, Robin Murphy wrote: > > These are IOMMU page tables, rather than CPU ones, so we're already well > > outside arch code - indeed the original motivation of io-pgtable was to be > > entirely independent of the p*d types and arch-specific MM code (this Armv7 > > short-descriptor format is already "non-native" when used by drivers in an > > arm64 kernel). > > There was quite a lot of explanation missing from this patch description! I totally agree ,-) I'm not familiar at all with either iommu or mm/... Looks like the patchset triggered a helpful discussion, and I understand the problem better now. I'll improve the description in the next revision. > > There are various efficiency reasons for using regular kernel memory instead > > of coherent DMA allocations - for the most part it works well, we just have > > the odd corner case like this one where the 32-bit format gets used on > > 64-bit systems such that the tables themselves still need to be allocated > > below 4GB (although the final output address can point at higher memory by > > virtue of the IOMMU in question not implementing permissions and repurposing > > some of those PTE fields as extra address bits). > > > > TBH, if this DMA32 stuff is going to be contentious we could possibly just > > rip out the offending kmem_cache - it seemed like good practice for the > > use-case, but provided kzalloc(SZ_1K, gfp | GFP_DMA32) can be relied upon to > > give the same 1KB alignment and chance of succeeding as the equivalent > > kmem_cache_alloc(), then we could quite easily make do with that instead. > > I think you should look at using the page_frag allocator here. You can > use whatever GFP_DMA flags you like. I'll try that. Thanks!
On Wed, Nov 21, 2018 at 10:26:26PM +0000, Robin Murphy wrote: > TBH, if this DMA32 stuff is going to be contentious we could possibly just > rip out the offending kmem_cache - it seemed like good practice for the > use-case, but provided kzalloc(SZ_1K, gfp | GFP_DMA32) can be relied upon to > give the same 1KB alignment and chance of succeeding as the equivalent > kmem_cache_alloc(), then we could quite easily make do with that instead. Neither is the slab support for kmalloc, not do kmalloc allocations have useful alignment apparently (at least if you use slub debug). But I do agree with the sentiment of not wanting to spread GFP_DMA32 futher into the slab allocator. I think you want a simple genalloc allocator for this rather special use case.
On Wed, Nov 21, 2018 at 06:35:58PM -0800, Matthew Wilcox wrote: > I think you should look at using the page_frag allocator here. You can > use whatever GFP_DMA flags you like. So I actually tries to use page_frag to solve the XFS unaligned kmalloc allocations problem, and I don't think it is the right hammer for this nail (or any other nail outside of networking). The problem with the page_frag allocator is that it never reuses fragments returned to the page, but only only frees the page once all fragments are freed. This means that if you have some long(er) term allocations you are effectively creating memory leaks.
On Thu, Nov 22, 2018 at 12:26:02AM -0800, Christoph Hellwig wrote: > On Wed, Nov 21, 2018 at 06:35:58PM -0800, Matthew Wilcox wrote: > > I think you should look at using the page_frag allocator here. You can > > use whatever GFP_DMA flags you like. > > So I actually tries to use page_frag to solve the XFS unaligned kmalloc > allocations problem, and I don't think it is the right hammer for this > nail (or any other nail outside of networking). > > The problem with the page_frag allocator is that it never reuses > fragments returned to the page, but only only frees the page once all > fragments are freed. This means that if you have some long(er) term > allocations you are effectively creating memory leaks. Yes, your allocations from the page_frag allocator have to have similar lifetimes. I thought that would be ideal for XFS though; as I understood the problem, these were per-IO allocations, and IOs to the same filesystem tend to take roughly the same amount of time. Sure, in an error case, some IOs will take a long time before timing out, but it should be OK to have pages unavailable during that time in these rare situations. What am I missing?
On Thu, Nov 22, 2018 at 07:16:32AM -0800, Matthew Wilcox wrote: > Yes, your allocations from the page_frag allocator have to have similar > lifetimes. I thought that would be ideal for XFS though; as I understood > the problem, these were per-IO allocations, and IOs to the same filesystem > tend to take roughly the same amount of time. Sure, in an error case, > some IOs will take a long time before timing out, but it should be OK > to have pages unavailable during that time in these rare situations. > What am I missing? No, thee are allocations for meatada buffers, which can stay around for a long time. Worse still that depends on usage, so one buffer allocated from ma page might basically stay around forever, while another one might get reclaimed very quickly.
On Thu, Nov 22, 2018 at 4:23 PM Christoph Hellwig <hch@infradead.org> wrote: > > On Wed, Nov 21, 2018 at 10:26:26PM +0000, Robin Murphy wrote: > > TBH, if this DMA32 stuff is going to be contentious we could possibly just > > rip out the offending kmem_cache - it seemed like good practice for the > > use-case, but provided kzalloc(SZ_1K, gfp | GFP_DMA32) can be relied upon to > > give the same 1KB alignment and chance of succeeding as the equivalent > > kmem_cache_alloc(), then we could quite easily make do with that instead. > > Neither is the slab support for kmalloc, not do kmalloc allocations > have useful alignment apparently (at least if you use slub debug). > > But I do agree with the sentiment of not wanting to spread GFP_DMA32 > futher into the slab allocator. > > I think you want a simple genalloc allocator for this rather special > use case. So I had a look at genalloc, we'd need to add pre-allocated memory using gen_pool_add [1]. There can be up to 4096 L2 page tables, so we may need to pre-allocate 4MB of memory (1KB per L2 page table). We could add chunks on demand, but then it'd be difficult to free them up (genalloc does not have a "gen_pool_remove" call). So basically if the full 4MB end up being requested, we'd be stuck with that until the iommu domain is freed (on the arm64 Mediatek platforms I looked at, there is only one iommu domain, and it never gets freed). page_frag would at least have a chance to reclaim those pages (if I understand Christoph's statement correctly) Robin: Do you have some ideas of the lifetime/usage of L2 tables? If they are usually few of them, or if they don't get reclaimed easily, some on demand genalloc allocation would be ok (or even 4MB allocation on init, if we're willing to take that hit). If they get allocated and freed together, maybe page_frag is a better option? Thanks, [1] https://www.kernel.org/doc/html/v4.19/core-api/genalloc.html
On Fri, Nov 23, 2018 at 11:04 AM Nicolas Boichat <drinkcat@chromium.org> wrote: > > On Thu, Nov 22, 2018 at 4:23 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > On Wed, Nov 21, 2018 at 10:26:26PM +0000, Robin Murphy wrote: > > > TBH, if this DMA32 stuff is going to be contentious we could possibly just > > > rip out the offending kmem_cache - it seemed like good practice for the > > > use-case, but provided kzalloc(SZ_1K, gfp | GFP_DMA32) can be relied upon to > > > give the same 1KB alignment and chance of succeeding as the equivalent > > > kmem_cache_alloc(), then we could quite easily make do with that instead. > > > > Neither is the slab support for kmalloc, not do kmalloc allocations > > have useful alignment apparently (at least if you use slub debug). > > > > But I do agree with the sentiment of not wanting to spread GFP_DMA32 > > futher into the slab allocator. > > > > I think you want a simple genalloc allocator for this rather special > > use case. > > So I had a look at genalloc, we'd need to add pre-allocated memory > using gen_pool_add [1]. There can be up to 4096 L2 page tables, so we > may need to pre-allocate 4MB of memory (1KB per L2 page table). We > could add chunks on demand, but then it'd be difficult to free them up > (genalloc does not have a "gen_pool_remove" call). So basically if the > full 4MB end up being requested, we'd be stuck with that until the > iommu domain is freed (on the arm64 Mediatek platforms I looked at, > there is only one iommu domain, and it never gets freed). I tried out genalloc with pre-allocated 4MB, and that seems to work fine. Allocating in chunks would require genalloc changes as gen_pool_add calls kmalloc with just GFP_KERNEL [2], and we are in atomic context in __arm_v7s_alloc_table... [2] https://elixir.bootlin.com/linux/latest/source/lib/genalloc.c#L190 > page_frag would at least have a chance to reclaim those pages (if I > understand Christoph's statement correctly) > > Robin: Do you have some ideas of the lifetime/usage of L2 tables? If > they are usually few of them, or if they don't get reclaimed easily, > some on demand genalloc allocation would be ok (or even 4MB allocation > on init, if we're willing to take that hit). If they get allocated and > freed together, maybe page_frag is a better option? > > Thanks, > > [1] https://www.kernel.org/doc/html/v4.19/core-api/genalloc.html
On 11/22/18 9:23 AM, Christoph Hellwig wrote: > On Wed, Nov 21, 2018 at 10:26:26PM +0000, Robin Murphy wrote: >> TBH, if this DMA32 stuff is going to be contentious we could possibly just >> rip out the offending kmem_cache - it seemed like good practice for the >> use-case, but provided kzalloc(SZ_1K, gfp | GFP_DMA32) can be relied upon to >> give the same 1KB alignment and chance of succeeding as the equivalent >> kmem_cache_alloc(), then we could quite easily make do with that instead. > > Neither is the slab support for kmalloc, not do kmalloc allocations > have useful alignment apparently (at least if you use slub debug). Is this also true for caches created by kmem_cache_create(), that debugging options can result in not respecting the alignment passed to kmem_cache_create()? That would be rather bad, IMHO. > But I do agree with the sentiment of not wanting to spread GFP_DMA32 > futher into the slab allocator. I don't see a problem with GFP_DMA32 for custom caches. Generic kmalloc() would be worse, since it would have to create a new array of kmalloc caches. But that's already ruled out due to the alignment. > I think you want a simple genalloc allocator for this rather special > use case. I would prefer if slab could support it, as it doesn't have to preallocate. OTOH if the allocations are GFP_ATOMIC as suggested later in the thread, and need to always succeed, then preallocation could be better, and thus maybe genalloc.
On Fri 23-11-18 13:23:41, Vlastimil Babka wrote: > On 11/22/18 9:23 AM, Christoph Hellwig wrote: [...] > > But I do agree with the sentiment of not wanting to spread GFP_DMA32 > > futher into the slab allocator. > > I don't see a problem with GFP_DMA32 for custom caches. Generic > kmalloc() would be worse, since it would have to create a new array of > kmalloc caches. But that's already ruled out due to the alignment. Yes that makes quite a lot of sense to me. We do not really need a generic support. Just make sure that if somebody creates a GFP_DMA32 restricted cache then allow allocating restricted memory from that. Is there any fundamental reason that this wouldn't be possible?
On Fri, Nov 23, 2018 at 01:23:41PM +0100, Vlastimil Babka wrote: > Is this also true for caches created by kmem_cache_create(), that > debugging options can result in not respecting the alignment passed to > kmem_cache_create()? That would be rather bad, IMHO. That's what I understood in the discussion. If not it would make our live simpler, but would need to be well document. Christoph can probably explain the alignment choices in slub. > > > But I do agree with the sentiment of not wanting to spread GFP_DMA32 > > futher into the slab allocator. > > I don't see a problem with GFP_DMA32 for custom caches. Generic > kmalloc() would be worse, since it would have to create a new array of > kmalloc caches. But that's already ruled out due to the alignment. True, purely slab probably isn't too bad.
On Mon, Nov 26, 2018 at 4:02 PM Christoph Hellwig <hch@infradead.org> wrote: > > On Fri, Nov 23, 2018 at 01:23:41PM +0100, Vlastimil Babka wrote: > > Is this also true for caches created by kmem_cache_create(), that > > debugging options can result in not respecting the alignment passed to > > kmem_cache_create()? That would be rather bad, IMHO. > > That's what I understood in the discussion. If not it would make > our live simpler, but would need to be well document. From my experiment, adding `slub_debug` to command line does _not_ break the alignment of kmem_cache_alloc'ed objects. We do see an increase in slab_size (/sys/kernel/slab/io-pgtable_armv7s_l2/slab_size), from 1024 to 3072 (probably because slub needs to allocate space on each side for the red zone/padding, while keeping the alignment?) > Christoph can probably explain the alignment choices in slub. > > > > > > But I do agree with the sentiment of not wanting to spread GFP_DMA32 > > > futher into the slab allocator. > > > > I don't see a problem with GFP_DMA32 for custom caches. Generic > > kmalloc() would be worse, since it would have to create a new array of > > kmalloc caches. But that's already ruled out due to the alignment. > > True, purely slab probably isn't too bad.
On Sun, Nov 11, 2018 at 5:04 PM Nicolas Boichat <drinkcat@chromium.org> wrote: > > This is a follow-up to the discussion in [1], to make sure that the page > tables allocated by iommu/io-pgtable-arm-v7s are contained within 32-bit > physical address space. > > [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html Hi everyone, Let's try to summarize here. First, we confirmed that this is a regression, and IOMMU errors happen on 4.19 and linux-next/master on MT8173 (elm, Acer Chromebook R13). The issue most likely starts from ad67f5a6545f ("arm64: replace ZONE_DMA with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number of Mediatek platforms (and maybe others?). We have a few options here: 1. This series [2], that adds support for GFP_DMA32 slab caches, _without_ adding kmalloc caches (since there are no users of kmalloc(..., GFP_DMA32)). I think I've addressed all the comments on the 3 patches, and AFAICT this solution works fine. 2. genalloc. That works, but unless we preallocate 4MB for L2 tables (which is wasteful as we usually only need a handful of L2 tables), we'll need changes in the core (use GFP_ATOMIC) to allow allocating on demand, and as it stands we'd have no way to shrink the allocation. 3. page_frag [3]. That works fine, and the code is quite simple. One drawback is that fragments in partially freed pages cannot be reused (from limited experiments, I see that IOMMU L2 tables are rarely freed, so it's unlikely a whole page would get freed). But given the low number of L2 tables, maybe we can live with that. I think 2 is out. Any preference between 1 and 3? I think 1 makes better use of the memory, so that'd be my preference. But I'm probably missing something. [2] https://patchwork.kernel.org/cover/10677529/, 3 patches [3] https://patchwork.codeaurora.org/patch/671639/ Thanks, Nicolas
On 12/4/18 10:37 AM, Nicolas Boichat wrote: > On Sun, Nov 11, 2018 at 5:04 PM Nicolas Boichat <drinkcat@chromium.org> wrote: >> >> This is a follow-up to the discussion in [1], to make sure that the page >> tables allocated by iommu/io-pgtable-arm-v7s are contained within 32-bit >> physical address space. >> >> [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html > > Hi everyone, > > Let's try to summarize here. > > First, we confirmed that this is a regression, and IOMMU errors happen > on 4.19 and linux-next/master on MT8173 (elm, Acer Chromebook R13). > The issue most likely starts from ad67f5a6545f ("arm64: replace > ZONE_DMA with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number > of Mediatek platforms (and maybe others?). > > We have a few options here: > 1. This series [2], that adds support for GFP_DMA32 slab caches, > _without_ adding kmalloc caches (since there are no users of > kmalloc(..., GFP_DMA32)). I think I've addressed all the comments on > the 3 patches, and AFAICT this solution works fine. > 2. genalloc. That works, but unless we preallocate 4MB for L2 tables > (which is wasteful as we usually only need a handful of L2 tables), > we'll need changes in the core (use GFP_ATOMIC) to allow allocating on > demand, and as it stands we'd have no way to shrink the allocation. > 3. page_frag [3]. That works fine, and the code is quite simple. One > drawback is that fragments in partially freed pages cannot be reused > (from limited experiments, I see that IOMMU L2 tables are rarely > freed, so it's unlikely a whole page would get freed). But given the > low number of L2 tables, maybe we can live with that. > > I think 2 is out. Any preference between 1 and 3? I think 1 makes > better use of the memory, so that'd be my preference. But I'm probably > missing something. I would prefer 1 as well. IIRC you already confirmed that alignment requirements are not broken for custom kmem caches even in presence of SLUB debug options (and I would say it's a bug to be fixed if they weren't). I just asked (and didn't get a reply I think) about your ability to handle the GFP_ATOMIC allocation failures. They should be rare when only single page allocations are needed for the kmem cache. But in case they are not an option, then preallocating would be needed, thus probably option 2. > [2] https://patchwork.kernel.org/cover/10677529/, 3 patches > [3] https://patchwork.codeaurora.org/patch/671639/ > > Thanks, > > Nicolas >
On Tue, Dec 04, 2018 at 05:37:13PM +0800, Nicolas Boichat wrote: > On Sun, Nov 11, 2018 at 5:04 PM Nicolas Boichat <drinkcat@chromium.org> wrote: > > > > This is a follow-up to the discussion in [1], to make sure that the page > > tables allocated by iommu/io-pgtable-arm-v7s are contained within 32-bit > > physical address space. > > > > [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html > > Hi everyone, > > Let's try to summarize here. > > First, we confirmed that this is a regression, and IOMMU errors happen > on 4.19 and linux-next/master on MT8173 (elm, Acer Chromebook R13). > The issue most likely starts from ad67f5a6545f ("arm64: replace > ZONE_DMA with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number > of Mediatek platforms (and maybe others?). > > We have a few options here: > 1. This series [2], that adds support for GFP_DMA32 slab caches, > _without_ adding kmalloc caches (since there are no users of > kmalloc(..., GFP_DMA32)). I think I've addressed all the comments on > the 3 patches, and AFAICT this solution works fine. > 2. genalloc. That works, but unless we preallocate 4MB for L2 tables > (which is wasteful as we usually only need a handful of L2 tables), > we'll need changes in the core (use GFP_ATOMIC) to allow allocating on > demand, and as it stands we'd have no way to shrink the allocation. > 3. page_frag [3]. That works fine, and the code is quite simple. One > drawback is that fragments in partially freed pages cannot be reused > (from limited experiments, I see that IOMMU L2 tables are rarely > freed, so it's unlikely a whole page would get freed). But given the > low number of L2 tables, maybe we can live with that. > > I think 2 is out. Any preference between 1 and 3? I think 1 makes > better use of the memory, so that'd be my preference. But I'm probably > missing something. FWIW, I'm open to any solution at this point, since I'd like to see this regression fixed. (1) does sound better longer-term, but (3) looks pretty much ready to do afaict. Will
On Tue, Dec 4, 2018 at 10:35 PM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 12/4/18 10:37 AM, Nicolas Boichat wrote: > > On Sun, Nov 11, 2018 at 5:04 PM Nicolas Boichat <drinkcat@chromium.org> wrote: > >> > >> This is a follow-up to the discussion in [1], to make sure that the page > >> tables allocated by iommu/io-pgtable-arm-v7s are contained within 32-bit > >> physical address space. > >> > >> [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html > > > > Hi everyone, > > > > Let's try to summarize here. > > > > First, we confirmed that this is a regression, and IOMMU errors happen > > on 4.19 and linux-next/master on MT8173 (elm, Acer Chromebook R13). > > The issue most likely starts from ad67f5a6545f ("arm64: replace > > ZONE_DMA with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number > > of Mediatek platforms (and maybe others?). > > > > We have a few options here: > > 1. This series [2], that adds support for GFP_DMA32 slab caches, > > _without_ adding kmalloc caches (since there are no users of > > kmalloc(..., GFP_DMA32)). I think I've addressed all the comments on > > the 3 patches, and AFAICT this solution works fine. > > 2. genalloc. That works, but unless we preallocate 4MB for L2 tables > > (which is wasteful as we usually only need a handful of L2 tables), > > we'll need changes in the core (use GFP_ATOMIC) to allow allocating on > > demand, and as it stands we'd have no way to shrink the allocation. > > 3. page_frag [3]. That works fine, and the code is quite simple. One > > drawback is that fragments in partially freed pages cannot be reused > > (from limited experiments, I see that IOMMU L2 tables are rarely > > freed, so it's unlikely a whole page would get freed). But given the > > low number of L2 tables, maybe we can live with that. > > > > I think 2 is out. Any preference between 1 and 3? I think 1 makes > > better use of the memory, so that'd be my preference. But I'm probably > > missing something. > > I would prefer 1 as well. IIRC you already confirmed that alignment > requirements are not broken for custom kmem caches even in presence of > SLUB debug options (and I would say it's a bug to be fixed if they > weren't). > I just asked (and didn't get a reply I think) about your > ability to handle the GFP_ATOMIC allocation failures. They should be > rare when only single page allocations are needed for the kmem cache. > But in case they are not an option, then preallocating would be needed, > thus probably option 2. Oh, sorry, I missed your question. I don't have a full answer, but: - The allocations themselves are rare (I count a few 10s of L2 tables at most on my system, I assume we rarely have >100), and yes, we only need a single page, so the failures should be exceptional. - My change is probably not making anything worse: I assume that even with the current approach using GFP_DMA slab caches on older kernels, failures could potentially happen. I don't think we've seen those. If we are really concerned about this, maybe we'd need to modify mtk_iommu_map to not hold a spinlock (if that's possible), so we don't need to use GFP_ATOMIC. I suggest we just keep an eye on such issues, and address them if they show up (we can even revisit genalloc at that stage). Anyway, I'll clean up patches for 1 (mostly commit message changes based on the comments in the threads) and resend. Thanks, > > [2] https://patchwork.kernel.org/cover/10677529/, 3 patches > > [3] https://patchwork.codeaurora.org/patch/671639/ > > > > Thanks, > > > > Nicolas > > >
On Wed, Dec 5, 2018 at 10:04 AM Nicolas Boichat <drinkcat@chromium.org> wrote: > > On Tue, Dec 4, 2018 at 10:35 PM Vlastimil Babka <vbabka@suse.cz> wrote: > > > > On 12/4/18 10:37 AM, Nicolas Boichat wrote: > > > On Sun, Nov 11, 2018 at 5:04 PM Nicolas Boichat <drinkcat@chromium.org> wrote: > > >> > > >> This is a follow-up to the discussion in [1], to make sure that the page > > >> tables allocated by iommu/io-pgtable-arm-v7s are contained within 32-bit > > >> physical address space. > > >> > > >> [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html > > > > > > Hi everyone, > > > > > > Let's try to summarize here. > > > > > > First, we confirmed that this is a regression, and IOMMU errors happen > > > on 4.19 and linux-next/master on MT8173 (elm, Acer Chromebook R13). > > > The issue most likely starts from ad67f5a6545f ("arm64: replace > > > ZONE_DMA with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number > > > of Mediatek platforms (and maybe others?). > > > > > > We have a few options here: > > > 1. This series [2], that adds support for GFP_DMA32 slab caches, > > > _without_ adding kmalloc caches (since there are no users of > > > kmalloc(..., GFP_DMA32)). I think I've addressed all the comments on > > > the 3 patches, and AFAICT this solution works fine. > > > 2. genalloc. That works, but unless we preallocate 4MB for L2 tables > > > (which is wasteful as we usually only need a handful of L2 tables), > > > we'll need changes in the core (use GFP_ATOMIC) to allow allocating on > > > demand, and as it stands we'd have no way to shrink the allocation. > > > 3. page_frag [3]. That works fine, and the code is quite simple. One > > > drawback is that fragments in partially freed pages cannot be reused > > > (from limited experiments, I see that IOMMU L2 tables are rarely > > > freed, so it's unlikely a whole page would get freed). But given the > > > low number of L2 tables, maybe we can live with that. > > > > > > I think 2 is out. Any preference between 1 and 3? I think 1 makes > > > better use of the memory, so that'd be my preference. But I'm probably > > > missing something. > > > > I would prefer 1 as well. IIRC you already confirmed that alignment > > requirements are not broken for custom kmem caches even in presence of > > SLUB debug options (and I would say it's a bug to be fixed if they > > weren't). > > > I just asked (and didn't get a reply I think) about your > > ability to handle the GFP_ATOMIC allocation failures. They should be > > rare when only single page allocations are needed for the kmem cache. > > But in case they are not an option, then preallocating would be needed, > > thus probably option 2. > > Oh, sorry, I missed your question. > > I don't have a full answer, but: > - The allocations themselves are rare (I count a few 10s of L2 tables > at most on my system, I assume we rarely have >100), and yes, we only > need a single page, so the failures should be exceptional. > - My change is probably not making anything worse: I assume that even > with the current approach using GFP_DMA slab caches on older kernels, > failures could potentially happen. I don't think we've seen those. If > we are really concerned about this, maybe we'd need to modify > mtk_iommu_map to not hold a spinlock (if that's possible), so we don't > need to use GFP_ATOMIC. I suggest we just keep an eye on such issues, > and address them if they show up (we can even revisit genalloc at that > stage). > > Anyway, I'll clean up patches for 1 (mostly commit message changes > based on the comments in the threads) and resend. Done here: https://patchwork.kernel.org/cover/10713019/ . > Thanks, > > > > [2] https://patchwork.kernel.org/cover/10677529/, 3 patches > > > [3] https://patchwork.codeaurora.org/patch/671639/ > > > > > > Thanks, > > > > > > Nicolas > > > > >
On Wed, Dec 05, 2018 at 10:04:00AM +0800, Nicolas Boichat wrote: > On Tue, Dec 4, 2018 at 10:35 PM Vlastimil Babka <vbabka@suse.cz> wrote: > > > > On 12/4/18 10:37 AM, Nicolas Boichat wrote: > > > On Sun, Nov 11, 2018 at 5:04 PM Nicolas Boichat <drinkcat@chromium.org> wrote: > > >> > > >> This is a follow-up to the discussion in [1], to make sure that the page > > >> tables allocated by iommu/io-pgtable-arm-v7s are contained within 32-bit > > >> physical address space. > > >> > > >> [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html > > > > > > Hi everyone, > > > > > > Let's try to summarize here. > > > > > > First, we confirmed that this is a regression, and IOMMU errors happen > > > on 4.19 and linux-next/master on MT8173 (elm, Acer Chromebook R13). > > > The issue most likely starts from ad67f5a6545f ("arm64: replace > > > ZONE_DMA with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number > > > of Mediatek platforms (and maybe others?). > > > > > > We have a few options here: > > > 1. This series [2], that adds support for GFP_DMA32 slab caches, > > > _without_ adding kmalloc caches (since there are no users of > > > kmalloc(..., GFP_DMA32)). I think I've addressed all the comments on > > > the 3 patches, and AFAICT this solution works fine. > > > 2. genalloc. That works, but unless we preallocate 4MB for L2 tables > > > (which is wasteful as we usually only need a handful of L2 tables), > > > we'll need changes in the core (use GFP_ATOMIC) to allow allocating on > > > demand, and as it stands we'd have no way to shrink the allocation. > > > 3. page_frag [3]. That works fine, and the code is quite simple. One > > > drawback is that fragments in partially freed pages cannot be reused > > > (from limited experiments, I see that IOMMU L2 tables are rarely > > > freed, so it's unlikely a whole page would get freed). But given the > > > low number of L2 tables, maybe we can live with that. > > > > > > I think 2 is out. Any preference between 1 and 3? I think 1 makes > > > better use of the memory, so that'd be my preference. But I'm probably > > > missing something. > > > > I would prefer 1 as well. IIRC you already confirmed that alignment > > requirements are not broken for custom kmem caches even in presence of > > SLUB debug options (and I would say it's a bug to be fixed if they > > weren't). > > > I just asked (and didn't get a reply I think) about your > > ability to handle the GFP_ATOMIC allocation failures. They should be > > rare when only single page allocations are needed for the kmem cache. > > But in case they are not an option, then preallocating would be needed, > > thus probably option 2. > > Oh, sorry, I missed your question. > > I don't have a full answer, but: > - The allocations themselves are rare (I count a few 10s of L2 tables > at most on my system, I assume we rarely have >100), and yes, we only > need a single page, so the failures should be exceptional. > - My change is probably not making anything worse: I assume that even > with the current approach using GFP_DMA slab caches on older kernels, > failures could potentially happen. I don't think we've seen those. If > we are really concerned about this, maybe we'd need to modify > mtk_iommu_map to not hold a spinlock (if that's possible), so we don't > need to use GFP_ATOMIC. I suggest we just keep an eye on such issues, > and address them if they show up (we can even revisit genalloc at that > stage). I think the spinlock is the least of our worries: the map/unmap routines can be called in irq context and may need to allocate second-level tables. Will