Message ID | 20201029221806.189523375@linutronix.de (mailing list archive) |
---|---|
Headers | show |
Series | mm/highmem: Preemptible variant of kmap_atomic & friends | expand |
On Thu, Oct 29, 2020 at 3:32 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > > Though I wanted to share the current state of affairs before investigating > that further. If there is consensus in going forward with this, I'll have a > deeper look into this issue. Me likee. I think this looks like the right thing to do. I didn't actually apply the patches, but just from reading them it _looks_ to me like you do the migrate_disable() unconditionally, even if it's not a highmem page.. That sounds like it might be a good thing for debugging, but not necessarily great in general. Or am I misreading things? Linus
On Thu, Oct 29 2020 at 16:11, Linus Torvalds wrote: > On Thu, Oct 29, 2020 at 3:32 PM Thomas Gleixner <tglx@linutronix.de> wrote: >> >> Though I wanted to share the current state of affairs before investigating >> that further. If there is consensus in going forward with this, I'll have a >> deeper look into this issue. > > Me likee. I think this looks like the right thing to do. > > I didn't actually apply the patches, but just from reading them it > _looks_ to me like you do the migrate_disable() unconditionally, even > if it's not a highmem page.. > > That sounds like it might be a good thing for debugging, but not > necessarily great in general. > > Or am I misreading things? No, you're not misreading it, but doing it conditionally would be a complete semantical disaster. kmap_atomic*() also disables preemption and pagefaults unconditionaly. If that wouldn't be the case then every caller would have to have conditionals like 'if (CONFIG_HIGHMEM)' or worse 'if (PageHighMem(page)'. Let's not go there. Migrate disable is a less horrible plague than preempt and pagefault disable even if the scheduler people disagree due to the lack of theory backing that up :) The charm of the new interface is that users still can rely on per cpuness independent of being on a highmem plagued system. For non highmem systems the extra migrate disable/enable is really a minor nuissance. Thanks, tglx
On Thu, Oct 29, 2020 at 11:18:06PM +0100, Thomas Gleixner wrote: > This is achieved by: ... > > - Consolidating all kmap atomic implementations in generic code ... > Though I wanted to share the current state of affairs before investigating > that further. If there is consensus in going forward with this, I'll have a > deeper look into this issue. I think the consolidation is a winner no matter where we go next. Maybe split it out in a prep series so we can get it in ASAP?
On Fri, Oct 30 2020 at 00:41, Thomas Gleixner wrote: > On Thu, Oct 29 2020 at 16:11, Linus Torvalds wrote: > No, you're not misreading it, but doing it conditionally would be a > complete semantical disaster. kmap_atomic*() also disables preemption > and pagefaults unconditionaly. If that wouldn't be the case then every > caller would have to have conditionals like 'if (CONFIG_HIGHMEM)' or > worse 'if (PageHighMem(page)'. > > Let's not go there. > > Migrate disable is a less horrible plague than preempt and pagefault > disable even if the scheduler people disagree due to the lack of theory > backing that up :) > > The charm of the new interface is that users still can rely on per > cpuness independent of being on a highmem plagued system. For non > highmem systems the extra migrate disable/enable is really a minor > nuissance. thinking about it some more after having sleep and coffee, we actually could hide the migrate disable in the actual highmem part. But then we really should not name it kmap_local. 'local' suggests locality, think local_irq*, local_bh* ... kmap_task would be more accurate then. Toughts? tglx
On Fri, Oct 30 2020 at 08:25, Christoph Hellwig wrote: > On Thu, Oct 29, 2020 at 11:18:06PM +0100, Thomas Gleixner wrote: >> - Consolidating all kmap atomic implementations in generic code > > I think the consolidation is a winner no matter where we go next. Maybe > split it out in a prep series so we can get it in ASAP? Yes, patch 2-15 can just go without any dependency. The only thing which needs a bit of thought is naming. See the other reply to Linus. Thanks, tglx
On Thu, Oct 29, 2020 at 11:18:06PM +0100, Thomas Gleixner wrote: > This series provides kmap_local.* iomap_local variants which only disable > migration to keep the virtual mapping address stable accross preemption, > but do neither disable pagefaults nor preemption. The new functions can be > used in any context, but if used in atomic context the caller has to take > care of eventually disabling pagefaults. Could I ask for a CONFIG_KMAP_DEBUG which aliases all the kmap variants to vmap()? I think we currently have a problem in iov_iter on HIGHMEM configs: copy_page_to_iter() calls page_copy_sane() which checks: head = compound_head(page); if (likely(n <= v && v <= page_size(head))) return true; but then: void *kaddr = kmap_atomic(page); size_t wanted = copy_to_iter(kaddr + offset, bytes, i); kunmap_atomic(kaddr); so if offset to offset+bytes is larger than PAGE_SIZE, this is going to work for lowmem pages and fail miserably for highmem pages. I suggest vmap() because vmap has a PAGE_SIZE gap between each allocation. Alternatively if we could have a kmap_atomic_compound(), that would be awesome, but probably not realistic to implement. I've more or less resigned myself to having to map things one page at a time.
On Fri, Oct 30 2020 at 13:06, Matthew Wilcox wrote: > On Thu, Oct 29, 2020 at 11:18:06PM +0100, Thomas Gleixner wrote: >> This series provides kmap_local.* iomap_local variants which only disable >> migration to keep the virtual mapping address stable accross preemption, >> but do neither disable pagefaults nor preemption. The new functions can be >> used in any context, but if used in atomic context the caller has to take >> care of eventually disabling pagefaults. > > Could I ask for a CONFIG_KMAP_DEBUG which aliases all the kmap variants > to vmap()? I think we currently have a problem in iov_iter on HIGHMEM > configs: For kmap() that would work, but for kmap_atomic() not so much when it is called in non-preemptible context because vmap() might sleep. > copy_page_to_iter() calls page_copy_sane() which checks: > > head = compound_head(page); > if (likely(n <= v && v <= page_size(head))) > return true; > > but then: > > void *kaddr = kmap_atomic(page); > size_t wanted = copy_to_iter(kaddr + offset, bytes, i); > kunmap_atomic(kaddr); > > so if offset to offset+bytes is larger than PAGE_SIZE, this is going to > work for lowmem pages and fail miserably for highmem pages. I suggest > vmap() because vmap has a PAGE_SIZE gap between each allocation. On 32bit highmem the kmap_atomic() case is easy: Double the number of mapping slots and only use every second one, which gives you a guard page between the maps. For 64bit we could do something ugly: Enable the highmem kmap_atomic() crud and enforce an alias mapping (at least on the architectures where this is reasonable). Then you get the same as for 32bit. > Alternatively if we could have a kmap_atomic_compound(), that would > be awesome, but probably not realistic to implement. I've more > or less resigned myself to having to map things one page at a time. That might be horribly awesome on 32bit :) Thanks, tglx
On Fri, Oct 30, 2020 at 2:39 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > But then we really should not name it kmap_local. 'local' suggests > locality, think local_irq*, local_bh* ... kmap_task would be more > accurate then. So the main reason I'd like to see it is because I think on a non-highmem machine, the new kmap should be a complete no-op. IOW, we'd make sure that there are no costs, no need to increment any "restrict migration" counts etc. It's been a bit of a pain to have kmap_atomic() have magical side semantics that people might then depend on. I think "local" could still work as a name, because it would have to be thread-local (and maybe we'd want a debug mode where that gets verified, as actual HIGHMEM machines are getting rare). I'd avoid "task", because that implies (to me, at least) that it wouldn't be good for interrupts etc that don't have a task context. I think the main issue is that it has to be released in the same context as it was created (ie no passing those things around to other contexts). I think "local" is fine for that, but I could imagine other names. The ones that come to mind are somewhat cumbersome, though ("shortterm" or "local_ctx" or something along those lines). I dunno. Linus
On Fri, Oct 30 2020 at 13:28, Linus Torvalds wrote: > On Fri, Oct 30, 2020 at 2:39 AM Thomas Gleixner <tglx@linutronix.de> wrote: >> >> But then we really should not name it kmap_local. 'local' suggests >> locality, think local_irq*, local_bh* ... kmap_task would be more >> accurate then. > > So the main reason I'd like to see it is because I think on a > non-highmem machine, the new kmap should be a complete no-op. IOW, > we'd make sure that there are no costs, no need to increment any > "restrict migration" counts etc. Fair enough. > It's been a bit of a pain to have kmap_atomic() have magical side > semantics that people might then depend on. kmap_atomic() will still have the side semantics :) > I think "local" could still work as a name, because it would have to > be thread-local (and maybe we'd want a debug mode where that gets > verified, as actual HIGHMEM machines are getting rare). > > I'd avoid "task", because that implies (to me, at least) that it > wouldn't be good for interrupts etc that don't have a task context. > > I think the main issue is that it has to be released in the same > context as it was created (ie no passing those things around to other > contexts). I think "local" is fine for that, but I could imagine other > names. The ones that come to mind are somewhat cumbersome, though > ("shortterm" or "local_ctx" or something along those lines). Yeah, not really intuitive either. Let's stick with _local and add proper function documentation which clearly says, that the side effect of non-migratability applies only for the 32bit highmem case in order to make it work at all. So code which needs CPU locality cannot rely on it and we have enough debug stuff to catch something like: kmap_local() this_cpu_write(....) kunmap_local() Let me redo the pile. While at it I might have a look at that debug request from Willy in the other end of this thread. Any comment on that? https://lore.kernel.org/r/87k0v7mrrd.fsf@nanos.tec.linutronix.de Thanks, tglx
On Fri, Oct 30, 2020 at 3:26 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > While at it I might have a look at that debug request from Willy in the > other end of this thread. Any comment on that? > > https://lore.kernel.org/r/87k0v7mrrd.fsf@nanos.tec.linutronix.de I do think that it would be nice to have a debug mode, particularly since over the last few years we've really lost a lot of HIGHMEM coverage (to the point that I've wondered how worthwhile it really is to support at all any more - I think it's Arnd who argued that it's mainly some embedded ARM variants that will want it for the forseeable future). So I'm honestly somewhat torn. I think HIGHMEM is dying, and yes that argues for "non-HIGHMEM had better have some debug coverage", but at the same time I think it doesn't even really matter any more. At some point those embedded ARM platforms just aren't even interesting - they might as well use older kernels if they are the only thing really arguing for HIGHMEM at all. This is one reason why I'd like the new kmap_local() to be a no-op, and I'd prefer for it to have no other side effects - because I want to be ready to remove it entirely some day. And if we end up having some transition where people start rewriting "kmap_atomic()" to be "kmap_local() + explicit preemption disable", then I think that would be a good step on that whole "kmap will eventually go away" path. But I do *not* believe that we need to add _so_ much debug support that we'd catch Willy's "more than one page" case. And I absolutely do not believe for a second that we should start caring about compound pages. NO. kmap() is almost dead already, we're not making it worse. To me, your patch series has two big advantages: - more common code - kmap_local() becomes more of a no-op and the last thing we want is to expand on kmap. Linus
On Fri, Oct 30 2020 at 15:46, Linus Torvalds wrote: > On Fri, Oct 30, 2020 at 3:26 PM Thomas Gleixner <tglx@linutronix.de> wrote: > To me, your patch series has two big advantages: > > - more common code > > - kmap_local() becomes more of a no-op > > and the last thing we want is to expand on kmap. Happy to go with that. While trying to document the mess, I just stumbled over the abuse of kmap_atomic_prot() in drivers/gpu/drm/ttm/ttm_bo_util.c: dst = kmap_atomic_prot(d, prot); drivers/gpu/drm/ttm/ttm_bo_util.c: src = kmap_atomic_prot(s, prot); drivers/gpu/drm/vmwgfx/vmwgfx_blit.c: kmap_atomic_prot(d->dst_pages[dst_page], drivers/gpu/drm/vmwgfx/vmwgfx_blit.c: kmap_atomic_prot(d->src_pages[src_page], For !HIGHMEM kmap_atomic_prot() just ignores the 'prot' argument and returns the page address. Moar patches to be written ... Sigh! Thanks, tglx
On Sat, Oct 31 2020 at 00:26, Thomas Gleixner wrote: > On Fri, Oct 30 2020 at 15:46, Linus Torvalds wrote: >> On Fri, Oct 30, 2020 at 3:26 PM Thomas Gleixner <tglx@linutronix.de> wrote: >> To me, your patch series has two big advantages: >> >> - more common code >> >> - kmap_local() becomes more of a no-op >> >> and the last thing we want is to expand on kmap. > > Happy to go with that. > > While trying to document the mess, I just stumbled over the abuse of > kmap_atomic_prot() in > > drivers/gpu/drm/ttm/ttm_bo_util.c: dst = kmap_atomic_prot(d, prot); > drivers/gpu/drm/ttm/ttm_bo_util.c: src = kmap_atomic_prot(s, prot); > drivers/gpu/drm/vmwgfx/vmwgfx_blit.c: kmap_atomic_prot(d->dst_pages[dst_page], > drivers/gpu/drm/vmwgfx/vmwgfx_blit.c: kmap_atomic_prot(d->src_pages[src_page], > > For !HIGHMEM kmap_atomic_prot() just ignores the 'prot' argument and > returns the page address. > > Moar patches to be written ... Sigh! Or not. This is actually correct by some definition of correct. For the non highmem case pgprot is set via the set_memory_*() functions and this just handles the highmem case. Comments are overrrated...
On Fri, Oct 30, 2020 at 11:46 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Fri, Oct 30, 2020 at 3:26 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > > > While at it I might have a look at that debug request from Willy in the > > other end of this thread. Any comment on that? > > > > https://lore.kernel.org/r/87k0v7mrrd.fsf@nanos.tec.linutronix.de > > I do think that it would be nice to have a debug mode, particularly > since over the last few years we've really lost a lot of HIGHMEM > coverage (to the point that I've wondered how worthwhile it really is > to support at all any more - I think it's Arnd who argued that it's > mainly some embedded ARM variants that will want it for the forseeable > future). > > So I'm honestly somewhat torn. I think HIGHMEM is dying, and yes that > argues for "non-HIGHMEM had better have some debug coverage", but at > the same time I think it doesn't even really matter any more. Shifting the testing of highmem code to the actual users of highmem sounds reasonable to me. This means it will get broken more often, but if it doesn't happen all the time, it might serve as a canary: once a bug survives for long enough, we have a good indication that users stopped caring ;-) > At some > point those embedded ARM platforms just aren't even interesting - they > might as well use older kernels if they are the only thing really > arguing for HIGHMEM at all. Agreed, but it does need a little time to get there. My best guess is three to five years from now we can remove it for good, provided a few things happen first: 1. The remaining users of TI Keystone2, NXP i.MX6 Quad(Plus) and Renesas R8A7790/R8A7791/R8A7742/R8A7743 that use the largest memory configuration must have stopped requiring kernel version updates. These were all introduced at the peak of 32-bit Arm embedded systems around 2013, but they have long (15+ year) product life cycles and users pick these because they do promise kernel updates, unlike other SoC families that get stuck on old vendor kernels much earlier. 2. The plan to add a CONFIG_VMSPLIT_4G_4G option on arch/arm/ must work out. We don't have all the code yet, and so far it looks like it's going to be a bit ugly and a bit slow but not nearly as ugly or slow as it was on x86 20 years ago. This would cover Cortex-A7/A15 systems from 2GB to 4GB, which are still fairly common and need to keep using mainline kernels for much longer than the system in point 1. It won't help on Cortex-A9 machines with 2GB, which I hope can migrate CONFIG_VMSPLIT_2G_OPT as a fallback. 3. No new systems with larger memory must appear. I noticed that e.g. the newly introduced Rockchips RV1109 and Allwinner A50/R328/V316 support LP-DDR4 on a dual Cortex-A7, but I hope that nobody will in practice combine a $2 SoC with a $15 memory chip. Most other 32-bit chips use DDR3, which tends to prohibit configurations over 4GB in new designs, with the cheapest ones limited to 512MB (a single 256Mx16 chip) and the high end having already moved on to 64 bit. Regarding 32-bit non-Arm systems, support for the MIPS-based Baikal T1 was merged earlier this year and is used in Russian PC style systems with up to 8GB. There are also some users on 10+ year old 32-bit netbooks or business laptops, both x86 and Apple G4. The longest-lived 32-bit embedded systems with large memory (other than Arm) are probably NXP QorIQ P20xx/P40xx used in military VME bus systems, and low-end embedded systems based on Vortex86. I'm less worried about all of these because upstream kernel support for ppc32 and x86-32 is already bitrotting and they will likely get stuck on the last working kernel before the TI/Renesas/NXP Arm systems do. Arnd
> There are also some users on 10+ year old 32-bit netbooks or > business laptops, both x86 and Apple G4. > The longest-lived 32-bit embedded systems with large memory > (other than Arm) are probably NXP QorIQ P20xx/P40xx used in > military VME bus systems, and low-end embedded systems based > on Vortex86. > I'm less worried about all of these because upstream kernel > support for ppc32 and x86-32 is already bitrotting and they will > likely get stuck on the last working kernel before the > TI/Renesas/NXP Arm systems do. > Upstream kernel support for ppc32 is bitrotting, seriously ? What do you mean exactly ? ppc32 is actively supported, with recent addition of support of hugepages, kasan, uaccess protection, VMAP stack, etc ... Christophe
On Sat, Oct 31, 2020 at 4:04 PM Christophe Leroy <christophe.leroy@csgroup.eu> wrote: > > There are also some users on 10+ year old 32-bit netbooks or > > business laptops, both x86 and Apple G4. > > The longest-lived 32-bit embedded systems with large memory > > (other than Arm) are probably NXP QorIQ P20xx/P40xx used in > > military VME bus systems, and low-end embedded systems based > > on Vortex86. > > I'm less worried about all of these because upstream kernel > > support for ppc32 and x86-32 is already bitrotting and they will > > likely get stuck on the last working kernel before the > > TI/Renesas/NXP Arm systems do. > > > > Upstream kernel support for ppc32 is bitrotting, seriously ? What do > you mean exactly ? I was thinking more of the platform support: out of the twelve 32-bit platforms in arch/powerpc/platforms/, your 8xx is the only one listed as 'maintained' or 'supported' in the maintainers list, and that seems to accurately describe the current state. Freescale seems to have practically stopped contributing to any of their 32-bit platforms in 2016 after the NXP acquisition and no longer employing either of the maintainers. Similarly, Ben seems to have stopped working on powermac in 2016, which was ten years after the last 32-bit hardware shipped for that platform. > ppc32 is actively supported, with recent addition of support of > hugepages, kasan, uaccess protection, VMAP stack, etc ... That is good to hear, I didn't know about these additions. What platforms are people using to develop these? Is this mainly your 8xx work, or is there ongoing development for platforms that need highmem? Arnd
On Thu, Oct 29 2020 at 23:18, Thomas Gleixner wrote: > > There is also a still to be investigated question from Linus on the initial > posting versus the per cpu / per task mapping stack depth which might need > to be made larger due to the ability to take page faults within a mapping > region. I looked deeper into that and we have a stack depth of 20. That's plenty and I couldn't find a way to get above 10 nested ones including faults, interrupts, softirqs. With some stress testing I was not able to get over a maximum of 6 according to the traceprintk I added. For some obscure reason when CONFIG_DEBUG_HIGHMEM is enabled the stack depth is increased from 20 to 41. But the only thing DEBUG_HIGHMEM does is to enable a few BUG_ON()'s in the mapping code. That's a leftover from the historical mapping code which had fixed entries for various purposes. DEBUG_HIGHMEM inserted guard mappings between the map types. But that got all ditched when kmap_atomic() switched to a stack based map management. Though the WITH_KM_FENCE magic survived without being functional. All the thing does today is to increase the stack depth. I just made that functional again by keeping the stack depth increase and utilizing every second slot. That should catch Willy's mapping problem nicely if he bothers to test on 32bit :) Thanks, tglx