Message ID | 20200919091751.011116649@linutronix.de (mailing list archive) |
---|---|
Headers | show |
Series | mm/highmem: Provide a preemptible variant of kmap_atomic & friends | expand |
On Sat, Sep 19, 2020 at 11:50 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > First of all, sorry for the horribly big Cc list! > > Following up to the discussion in: > > https://lore.kernel.org/r/20200914204209.256266093@linutronix.de > > this provides a preemptible variant of kmap_atomic & related > interfaces. This is achieved by: > > - Consolidating all kmap atomic implementations in generic code > > - Switching from per CPU storage of the kmap index to a per task storage > > - Adding a pteval array to the per task storage which contains the ptevals > of the currently active temporary kmaps > > - Adding context switch code which checks whether the outgoing or the > incoming task has active temporary kmaps. If so, the outgoing task's > kmaps are removed and the incoming task's kmaps are restored. > > - Adding new interfaces k[un]map_temporary*() which are not disabling > preemption and can be called from any context (except NMI). > > Contrary to kmap() which provides preemptible and "persistant" mappings, > these interfaces are meant to replace the temporary mappings provided by > kmap_atomic*() today. > > This allows to get rid of conditional mapping choices and allows to have > preemptible short term mappings on 64bit which are today enforced to be > non-preemptible due to the highmem constraints. It clearly puts overhead on > the highmem users, but highmem is slow anyway. > > This is not a wholesale conversion which makes kmap_atomic magically > preemptible because there might be usage sites which rely on the implicit > preempt disable. So this needs to be done on a case by case basis and the > call sites converted to kmap_temporary. > > Note, that this is only lightly tested on X86 and completely untested on > all other architectures. > > The lot is also available from > > git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git highmem I think it should be the case, but I want to double check: Will copy_*_user be allowed within a kmap_temporary section? This would allow us to ditch an absolute pile of slowpaths. -Daniel > > Thanks, > > tglx > --- > a/arch/arm/mm/highmem.c | 121 --------------------- > a/arch/microblaze/mm/highmem.c | 78 ------------- > a/arch/nds32/mm/highmem.c | 48 -------- > a/arch/powerpc/mm/highmem.c | 67 ----------- > a/arch/sparc/mm/highmem.c | 115 -------------------- > arch/arc/Kconfig | 1 > arch/arc/include/asm/highmem.h | 8 + > arch/arc/mm/highmem.c | 44 ------- > arch/arm/Kconfig | 1 > arch/arm/include/asm/highmem.h | 30 +++-- > arch/arm/mm/Makefile | 1 > arch/csky/Kconfig | 1 > arch/csky/include/asm/highmem.h | 4 > arch/csky/mm/highmem.c | 75 ------------- > arch/microblaze/Kconfig | 1 > arch/microblaze/include/asm/highmem.h | 6 - > arch/microblaze/mm/Makefile | 1 > arch/microblaze/mm/init.c | 6 - > arch/mips/Kconfig | 1 > arch/mips/include/asm/highmem.h | 4 > arch/mips/mm/highmem.c | 77 ------------- > arch/mips/mm/init.c | 3 > arch/nds32/Kconfig.cpu | 1 > arch/nds32/include/asm/highmem.h | 21 ++- > arch/nds32/mm/Makefile | 1 > arch/powerpc/Kconfig | 1 > arch/powerpc/include/asm/highmem.h | 6 - > arch/powerpc/mm/Makefile | 1 > arch/powerpc/mm/mem.c | 7 - > arch/sparc/Kconfig | 1 > arch/sparc/include/asm/highmem.h | 7 - > arch/sparc/mm/Makefile | 3 > arch/sparc/mm/srmmu.c | 2 > arch/x86/include/asm/fixmap.h | 1 > arch/x86/include/asm/highmem.h | 12 +- > arch/x86/include/asm/iomap.h | 29 +++-- > arch/x86/mm/highmem_32.c | 59 ---------- > arch/x86/mm/init_32.c | 15 -- > arch/x86/mm/iomap_32.c | 57 ---------- > arch/xtensa/Kconfig | 1 > arch/xtensa/include/asm/highmem.h | 9 + > arch/xtensa/mm/highmem.c | 44 ------- > b/arch/x86/Kconfig | 3 > include/linux/highmem.h | 141 +++++++++++++++--------- > include/linux/io-mapping.h | 2 > include/linux/sched.h | 9 + > kernel/sched/core.c | 10 + > mm/Kconfig | 3 > mm/highmem.c | 192 ++++++++++++++++++++++++++++++++-- > 49 files changed, 422 insertions(+), 909 deletions(-)
On Sat, Sep 19, 2020 at 12:35 PM Daniel Vetter <daniel@ffwll.ch> wrote: > > On Sat, Sep 19, 2020 at 11:50 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > > > First of all, sorry for the horribly big Cc list! > > > > Following up to the discussion in: > > > > https://lore.kernel.org/r/20200914204209.256266093@linutronix.de > > > > this provides a preemptible variant of kmap_atomic & related > > interfaces. This is achieved by: > > > > - Consolidating all kmap atomic implementations in generic code > > > > - Switching from per CPU storage of the kmap index to a per task storage > > > > - Adding a pteval array to the per task storage which contains the ptevals > > of the currently active temporary kmaps > > > > - Adding context switch code which checks whether the outgoing or the > > incoming task has active temporary kmaps. If so, the outgoing task's > > kmaps are removed and the incoming task's kmaps are restored. > > > > - Adding new interfaces k[un]map_temporary*() which are not disabling > > preemption and can be called from any context (except NMI). > > > > Contrary to kmap() which provides preemptible and "persistant" mappings, > > these interfaces are meant to replace the temporary mappings provided by > > kmap_atomic*() today. > > > > This allows to get rid of conditional mapping choices and allows to have > > preemptible short term mappings on 64bit which are today enforced to be > > non-preemptible due to the highmem constraints. It clearly puts overhead on > > the highmem users, but highmem is slow anyway. > > > > This is not a wholesale conversion which makes kmap_atomic magically > > preemptible because there might be usage sites which rely on the implicit > > preempt disable. So this needs to be done on a case by case basis and the > > call sites converted to kmap_temporary. > > > > Note, that this is only lightly tested on X86 and completely untested on > > all other architectures. > > > > The lot is also available from > > > > git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git highmem > > I think it should be the case, but I want to double check: Will > copy_*_user be allowed within a kmap_temporary section? This would > allow us to ditch an absolute pile of slowpaths. (coffee just kicked in) copy_*_user is ofc allowed, but if you hit a page fault you get a short read/write. This looks like it would remove the need to handle these in a slowpath, since page faults can now be served in this new kmap_temporary sections. But this sounds too good to be true, so I'm wondering what I'm missing. -Daniel > > > > > Thanks, > > > > tglx > > --- > > a/arch/arm/mm/highmem.c | 121 --------------------- > > a/arch/microblaze/mm/highmem.c | 78 ------------- > > a/arch/nds32/mm/highmem.c | 48 -------- > > a/arch/powerpc/mm/highmem.c | 67 ----------- > > a/arch/sparc/mm/highmem.c | 115 -------------------- > > arch/arc/Kconfig | 1 > > arch/arc/include/asm/highmem.h | 8 + > > arch/arc/mm/highmem.c | 44 ------- > > arch/arm/Kconfig | 1 > > arch/arm/include/asm/highmem.h | 30 +++-- > > arch/arm/mm/Makefile | 1 > > arch/csky/Kconfig | 1 > > arch/csky/include/asm/highmem.h | 4 > > arch/csky/mm/highmem.c | 75 ------------- > > arch/microblaze/Kconfig | 1 > > arch/microblaze/include/asm/highmem.h | 6 - > > arch/microblaze/mm/Makefile | 1 > > arch/microblaze/mm/init.c | 6 - > > arch/mips/Kconfig | 1 > > arch/mips/include/asm/highmem.h | 4 > > arch/mips/mm/highmem.c | 77 ------------- > > arch/mips/mm/init.c | 3 > > arch/nds32/Kconfig.cpu | 1 > > arch/nds32/include/asm/highmem.h | 21 ++- > > arch/nds32/mm/Makefile | 1 > > arch/powerpc/Kconfig | 1 > > arch/powerpc/include/asm/highmem.h | 6 - > > arch/powerpc/mm/Makefile | 1 > > arch/powerpc/mm/mem.c | 7 - > > arch/sparc/Kconfig | 1 > > arch/sparc/include/asm/highmem.h | 7 - > > arch/sparc/mm/Makefile | 3 > > arch/sparc/mm/srmmu.c | 2 > > arch/x86/include/asm/fixmap.h | 1 > > arch/x86/include/asm/highmem.h | 12 +- > > arch/x86/include/asm/iomap.h | 29 +++-- > > arch/x86/mm/highmem_32.c | 59 ---------- > > arch/x86/mm/init_32.c | 15 -- > > arch/x86/mm/iomap_32.c | 57 ---------- > > arch/xtensa/Kconfig | 1 > > arch/xtensa/include/asm/highmem.h | 9 + > > arch/xtensa/mm/highmem.c | 44 ------- > > b/arch/x86/Kconfig | 3 > > include/linux/highmem.h | 141 +++++++++++++++--------- > > include/linux/io-mapping.h | 2 > > include/linux/sched.h | 9 + > > kernel/sched/core.c | 10 + > > mm/Kconfig | 3 > > mm/highmem.c | 192 ++++++++++++++++++++++++++++++++-- > > 49 files changed, 422 insertions(+), 909 deletions(-) > > > > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch
On Sat, Sep 19, 2020 at 2:50 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > this provides a preemptible variant of kmap_atomic & related > interfaces. This is achieved by: Ack. This looks really nice, even apart from the new capability. The only thing I really reacted to is that the name doesn't make sense to me: "kmap_temporary()" seems a bit odd. Particularly for an interface that really is basically meant as a better replacement of "kmap_atomic()" (but is perhaps also a better replacement for "kmap()"). I think I understand how the name came about: I think the "temporary" is there as a distinction from the "longterm" regular kmap(). So I think it makes some sense from an internal implementation angle, but I don't think it makes a lot of sense from an interface name. I don't know what might be a better name, but if we want to emphasize that it's thread-private and a one-off, maybe "local" would be a better naming, and make it distinct from the "global" nature of the old kmap() interface? However, another solution might be to just use this new preemptible "local" kmap(), and remove the old global one entirely. Yes, the old global one caches the page table mapping and that sounds really efficient and nice. But it's actually horribly horribly bad, because it means that we need to use locking for them. Your new "temporary" implementation seems to be fundamentally better locking-wise, and only need preemption disabling as locking (and is equally fast for the non-highmem case). So I wonder if the single-page TLB flush isn't a better model, and whether it wouldn't be a lot simpler to just get rid of the old complex kmap() entirely, and replace it with this? I agree we can't replace the kmap_atomic() version, because maybe people depend on the preemption disabling it also implied. But what about replacing the non-atomic kmap()? Maybe I've missed something. Is it because the new interface still does "pagefault_disable()" perhaps? But does it even need the pagefault_disable() at all? Yes, the *atomic* one obviously needed it. But why does this new one need to disable page faults? [ I'm just reading the patches, I didn't try to apply them and look at the end result, so I might have missed something ] The only other worry I would have is just test coverage of this change. I suspect very few developers use HIGHMEM. And I know the various test robots don't tend to test 32-bit either. But apart from that question about naming (and perhaps replacing kmap() entirely), I very much like it. Linus
On Sat, Sep 19, 2020 at 10:18:54AM -0700, Linus Torvalds wrote: > On Sat, Sep 19, 2020 at 2:50 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > > > this provides a preemptible variant of kmap_atomic & related > > interfaces. This is achieved by: > > Ack. This looks really nice, even apart from the new capability. > > The only thing I really reacted to is that the name doesn't make sense > to me: "kmap_temporary()" seems a bit odd. > > Particularly for an interface that really is basically meant as a > better replacement of "kmap_atomic()" (but is perhaps also a better > replacement for "kmap()"). > > I think I understand how the name came about: I think the "temporary" > is there as a distinction from the "longterm" regular kmap(). So I > think it makes some sense from an internal implementation angle, but I > don't think it makes a lot of sense from an interface name. > > I don't know what might be a better name, but if we want to emphasize > that it's thread-private and a one-off, maybe "local" would be a > better naming, and make it distinct from the "global" nature of the > old kmap() interface? > > However, another solution might be to just use this new preemptible > "local" kmap(), and remove the old global one entirely. Yes, the old > global one caches the page table mapping and that sounds really > efficient and nice. But it's actually horribly horribly bad, because > it means that we need to use locking for them. Your new "temporary" > implementation seems to be fundamentally better locking-wise, and only > need preemption disabling as locking (and is equally fast for the > non-highmem case). > > So I wonder if the single-page TLB flush isn't a better model, and > whether it wouldn't be a lot simpler to just get rid of the old > complex kmap() entirely, and replace it with this? > > I agree we can't replace the kmap_atomic() version, because maybe > people depend on the preemption disabling it also implied. But what > about replacing the non-atomic kmap()? My concern with that is people might use kmap() and then pass the address to a different task. So we need to audit the current users of kmap() and convert any that do that into using vmap() instead. I like kmap_local(). Or kmap_thread().
On Sat, Sep 19, 2020 at 10:39 AM Matthew Wilcox <willy@infradead.org> wrote: > > My concern with that is people might use kmap() and then pass the address > to a different task. So we need to audit the current users of kmap() > and convert any that do that into using vmap() instead. Ahh. Yes, I guess they might do that. It sounds strange, but not entirely crazy - I could imagine some "PIO thread" that does IO to a page that has been set up by somebody else using kmap(). Or similar. Linus
On Sat, Sep 19 2020 at 12:37, Daniel Vetter wrote: > On Sat, Sep 19, 2020 at 12:35 PM Daniel Vetter <daniel@ffwll.ch> wrote: >> I think it should be the case, but I want to double check: Will >> copy_*_user be allowed within a kmap_temporary section? This would >> allow us to ditch an absolute pile of slowpaths. > > (coffee just kicked in) copy_*_user is ofc allowed, but if you hit a > page fault you get a short read/write. This looks like it would remove > the need to handle these in a slowpath, since page faults can now be > served in this new kmap_temporary sections. But this sounds too good > to be true, so I'm wondering what I'm missing. In principle we could allow pagefaults, but not with the currently proposed interface which can be called from any context. Obviously if called from atomic context it can't handle user page faults. In theory we could make a variant which does not disable pagefaults, but that's what kmap() already provides. Thanks, tglx
On Sat, Sep 19 2020 at 10:18, Linus Torvalds wrote: > On Sat, Sep 19, 2020 at 2:50 AM Thomas Gleixner <tglx@linutronix.de> wrote: >> >> this provides a preemptible variant of kmap_atomic & related >> interfaces. This is achieved by: > > Ack. This looks really nice, even apart from the new capability. > > The only thing I really reacted to is that the name doesn't make sense > to me: "kmap_temporary()" seems a bit odd. Yeah. Couldn't come up with something useful. > Particularly for an interface that really is basically meant as a > better replacement of "kmap_atomic()" (but is perhaps also a better > replacement for "kmap()"). > > I think I understand how the name came about: I think the "temporary" > is there as a distinction from the "longterm" regular kmap(). So I > think it makes some sense from an internal implementation angle, but I > don't think it makes a lot of sense from an interface name. > > I don't know what might be a better name, but if we want to emphasize > that it's thread-private and a one-off, maybe "local" would be a > better naming, and make it distinct from the "global" nature of the > old kmap() interface? Right, _local or _thread would be more intuitive. > However, another solution might be to just use this new preemptible > "local" kmap(), and remove the old global one entirely. Yes, the old > global one caches the page table mapping and that sounds really > efficient and nice. But it's actually horribly horribly bad, because > it means that we need to use locking for them. Your new "temporary" > implementation seems to be fundamentally better locking-wise, and only > need preemption disabling as locking (and is equally fast for the > non-highmem case). > > So I wonder if the single-page TLB flush isn't a better model, and > whether it wouldn't be a lot simpler to just get rid of the old > complex kmap() entirely, and replace it with this? > > I agree we can't replace the kmap_atomic() version, because maybe > people depend on the preemption disabling it also implied. But what > about replacing the non-atomic kmap()? > > Maybe I've missed something. Is it because the new interface still > does "pagefault_disable()" perhaps? > > But does it even need the pagefault_disable() at all? Yes, the > *atomic* one obviously needed it. But why does this new one need to > disable page faults? It disables pagefaults because it can be called from atomic and non-atomic context. That was the point to get rid of if (preeemptible()) kmap(); else kmap_atomic(); If it does not disable pagefaults, then it's just a lightweight variant of kmap() for short lived mappings. > But apart from that question about naming (and perhaps replacing > kmap() entirely), I very much like it. I thought about it, but then I figured that kmap pointers can be handed to other contexts from the thread which sets up the mapping because it's 'permanent'. I'm not sure whether that actually happens, so we'd need to audit all kmap() users to be sure. If there is no such use case, then we surely can get of rid of kmap() completely. It's only 300+ instances to stare at and quite some of them are wrapped into other functions. Highmem sucks no matter what and the only sane solution is to remove it completely. Thanks, tglx
On Sun, Sep 20, 2020 at 08:23:26AM +0200, Thomas Gleixner wrote: > On Sat, Sep 19 2020 at 12:37, Daniel Vetter wrote: > > On Sat, Sep 19, 2020 at 12:35 PM Daniel Vetter <daniel@ffwll.ch> wrote: > >> I think it should be the case, but I want to double check: Will > >> copy_*_user be allowed within a kmap_temporary section? This would > >> allow us to ditch an absolute pile of slowpaths. > > > > (coffee just kicked in) copy_*_user is ofc allowed, but if you hit a > > page fault you get a short read/write. This looks like it would remove > > the need to handle these in a slowpath, since page faults can now be > > served in this new kmap_temporary sections. But this sounds too good > > to be true, so I'm wondering what I'm missing. > > In principle we could allow pagefaults, but not with the currently > proposed interface which can be called from any context. Obviously if > called from atomic context it can't handle user page faults. Yeah that's clear, but does the implemention need to disable pagefaults unconditionally? > In theory we could make a variant which does not disable pagefaults, but > that's what kmap() already provides. Currently we have a bunch of code which roughly does kmap_atomic(); copy_*_user(); kunmap_atomic(); if (short_copy_user) { kmap(); copy_*_user(remaining_stuff); kunmap(); } And the only reason is that kmap is a notch slower, hence the fastpath. If we get a kmap which is fast and allows pagefaults (only in contexts that allow pagefaults already ofc) then we can ditch a pile of kmap users. Cheers, Daniel
On Sun, Sep 20 2020 at 08:41, Thomas Gleixner wrote: > On Sat, Sep 19 2020 at 10:18, Linus Torvalds wrote: >> Maybe I've missed something. Is it because the new interface still >> does "pagefault_disable()" perhaps? >> >> But does it even need the pagefault_disable() at all? Yes, the >> *atomic* one obviously needed it. But why does this new one need to >> disable page faults? > > It disables pagefaults because it can be called from atomic and > non-atomic context. That was the point to get rid of > > if (preeemptible()) > kmap(); > else > kmap_atomic(); > > If it does not disable pagefaults, then it's just a lightweight variant > of kmap() for short lived mappings. Actually most usage sites of kmap atomic do not need page faults to be disabled at all. As Daniel pointed out the implicit pagefault disable enforces copy_from_user_inatomic() even in places which are clearly preemptible task context. As we need to look at each instance anyway we can add the PF disable explicitely to the very few places which actually need it. Thanks, tglx
On Sun, Sep 20, 2020 at 1:49 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > Actually most usage sites of kmap atomic do not need page faults to be > disabled at all. Right. I think the pagefault disabling has (almost) nothing at all to do with the kmap() itself - it comes from the "atomic" part, not the "kmap" part. I say *almost*, because there is one issue that needs some thought: the amount of kmap nesting. The kmap_atomic() interface - and your local/temporary/whatever versions of it - depends very much inherently on being strictly nesting. In fact, it depends so much on it that maybe that should be part of the new name? It's very wrong to do addr1 = kmap_atomic(); addr2 = kmap_atomic(); ..do something with addr 1.. kunmap_atomic(addr1); .. do something with addr 2.. kunmap_atomic(addr2); because the way we allocate the slots is by using a percpu-atomic inc-return (and we deallocate using dec). So it's fundamentally a stack. And that's perfectly fine for page faults - if they do any kmaps, those will obviously nest. So the only issue with page faults might be that the stack grows _larger_. And that might need some thought. We already make the kmap stack bigger for CONFIG_DEBUG_HIGHMEM, and it's possibly that if we allow page faults we need to make the kmap stack bigger still. Btw, looking at the stack code, Ithink your new implementation of it is a bit scary: static inline int kmap_atomic_idx_push(void) { - int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1; + int idx = current->kmap_ctrl.idx++; and now that 'current->kmap_ctrl.idx' is not atomic wrt (a) NMI's (this may be ok, maybe we never do kmaps in NMIs, and with nesting I think it's fine anyway - the NMI will undo whatever it did) (b) the prev/next switch And that (b) part worries me. You do the kmap_switch_temporary() to switch the entries, but you do that *separately* from actually switching 'current' to the new value. So kmap_switch_temporary() looks safe, but I don't think it actually is. Because while it first unmaps the old entries and then remaps the new ones, an interrupt can come in, and at that point it matters what is *CURRENT*. And regardless of whether 'current' is 'prev' or 'next', that kmap_switch_temporary() loop may be doing the wrong thing, depending on which one had the deeper stack. The interrupt will be using whatever "current->kmap_ctrl.idx" is, but that might overwrite entries that are in the process of being restored (if current is still 'prev', but kmap_switch_temporary() is in the "restore @next's kmaps" pgase), or it might stomp on entries that have been pte_clear()'ed by the 'prev' thing. I dunno. The latter may be one of those "it works anyway, it overwrites things we don't care about", but the former will most definitely not work. And it will be completely impossible to debug, because it will depend on an interrupt that uses kmap_local/atomic/whatever() coming in _just_ at the right point in the scheduler, and only when the scheduler has been entered with the right number of kmap entries on the prev/next stack. And no developer will ever see this with any amount of debug code enabled, because it will only hit on legacy platforms that do this kmap anyway. So honestly, that code scares me. I think it's buggy. And even if it "happens to work", it does so for all the wrong reasons, and is very fragile. So I would suggest: - continue to use an actual per-cpu kmap_atomic_idx - make the switching code save the old idx, then unmap the old entries one by one (while doing the proper "pop" action), and then map the new entries one by one (while doing the proper "push" action). which would mean that the only index that is actually ever *USED* is the percpu one, and it's always up-to-date and pushed/popped for individual entries, rather than this - imho completely bogus - optimization where you use "p->kmap_ctrl.idx" directly and very very unsafely. Alternatively, that process counter would need about a hundred lines of commentary about exactly why it's safe. Because I don't think it is. Linus
On Sun, Sep 20 2020 at 10:23, Daniel Vetter wrote: > On Sun, Sep 20, 2020 at 08:23:26AM +0200, Thomas Gleixner wrote: >> On Sat, Sep 19 2020 at 12:37, Daniel Vetter wrote: >> > On Sat, Sep 19, 2020 at 12:35 PM Daniel Vetter <daniel@ffwll.ch> wrote: >> >> I think it should be the case, but I want to double check: Will >> >> copy_*_user be allowed within a kmap_temporary section? This would >> >> allow us to ditch an absolute pile of slowpaths. >> > >> > (coffee just kicked in) copy_*_user is ofc allowed, but if you hit a >> > page fault you get a short read/write. This looks like it would remove >> > the need to handle these in a slowpath, since page faults can now be >> > served in this new kmap_temporary sections. But this sounds too good >> > to be true, so I'm wondering what I'm missing. >> >> In principle we could allow pagefaults, but not with the currently >> proposed interface which can be called from any context. Obviously if >> called from atomic context it can't handle user page faults. > > Yeah that's clear, but does the implemention need to disable pagefaults > unconditionally? As I wrote in the other reply it's not required and the final interface will neither disable preemption nor pagefaults (except for the atomic wrapper around it). Thanks, tglx
On Sun, Sep 20 2020 at 09:57, Linus Torvalds wrote: > On Sun, Sep 20, 2020 at 1:49 AM Thomas Gleixner <tglx@linutronix.de> wrote: > Btw, looking at the stack code, Ithink your new implementation of it > is a bit scary: > > static inline int kmap_atomic_idx_push(void) > { > - int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1; > + int idx = current->kmap_ctrl.idx++; > > and now that 'current->kmap_ctrl.idx' is not atomic wrt > > (a) NMI's (this may be ok, maybe we never do kmaps in NMIs, and with > nesting I think it's fine anyway - the NMI will undo whatever it did) Right. Nesting should be a non issue, but I don't think we have kmap_atomic() in NMI context. > (b) the prev/next switch > > And that (b) part worries me. You do the kmap_switch_temporary() to > switch the entries, but you do that *separately* from actually > switching 'current' to the new value. > > So kmap_switch_temporary() looks safe, but I don't think it actually > is. Because while it first unmaps the old entries and then remaps the > new ones, an interrupt can come in, and at that point it matters what > is *CURRENT*. > > And regardless of whether 'current' is 'prev' or 'next', that > kmap_switch_temporary() loop may be doing the wrong thing, depending > on which one had the deeper stack. The interrupt will be using > whatever "current->kmap_ctrl.idx" is, but that might overwrite entries > that are in the process of being restored (if current is still 'prev', > but kmap_switch_temporary() is in the "restore @next's kmaps" pgase), > or it might stomp on entries that have been pte_clear()'ed by the > 'prev' thing. Duh yes. Never thought about that. > Alternatively, that process counter would need about a hundred lines > of commentary about exactly why it's safe. Because I don't think it > is. I think the more obvious solution is to split the whole exercise: schedule() prepare_switch() unmap() switch_to() finish_switch() map() That's safe because neither the unmap() nor the map() code changes kmap_ctrl.idx. So if there is an interrupt coming in between unmap() and switch_to() then a kmap_local() there will use the next entry. So we could even do the unmap() with interrupts enabled (preemption disabled). Same for the map() part. To explain that we need only a few lines of commentry methinks. Thanks, tglx
On Sun, Sep 20, 2020 at 10:40 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > I think the more obvious solution is to split the whole exercise: > > schedule() > prepare_switch() > unmap() > > switch_to() > > finish_switch() > map() Yeah, that looks much easier to explain. Ack. Linus
On Sun, Sep 20, 2020 at 10:42 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Yeah, that looks much easier to explain. Ack. Btw, one thing that might be a good idea at least initially is to add a check for p->kmap_ctrl.idx being zero at fork, exit and maybe syscall return time (but that last one may be too cumbersome to really worry about). The kmap_atomic() interface basically has a lot of coverage for leaked as part of all the "might_sleep()" checks sprinkled around, The new kmap_temporary/local/whatever wouldn't have that kind of incidental debug checking, and any leaked kmap indexes would be rather hard to debug (much) later when they cause index overflows or whatever. Linus
On Sun, Sep 20 2020 at 10:42, Linus Torvalds wrote: > On Sun, Sep 20, 2020 at 10:40 AM Thomas Gleixner <tglx@linutronix.de> wrote: >> >> I think the more obvious solution is to split the whole exercise: >> >> schedule() >> prepare_switch() >> unmap() >> >> switch_to() >> >> finish_switch() >> map() > > Yeah, that looks much easier to explain. Ack. So far so good, but Peter Z. just pointed out to me that I completely missed the fact that this cannot work. If a task is migrated to a different CPU then the mapping address will change which will explode in colourful ways. On RT kernels this works because we ping the task to the CPU via migrate_disable(). On a !RT kernel migrate_disable() maps to preempt_disable() which brings us back to square one. /me goes back to the drawing board. Thanks, tglx
On Mon, Sep 21, 2020 at 12:39 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > If a task is migrated to a different CPU then the mapping address will > change which will explode in colourful ways. Heh. Right you are. Maybe we really *could* call this new kmap functionality something like "kmap_percpu()" (or maybe "local" is good enough), and make it act like your RT code does for spinlocks - not disable preemption, but only disabling CPU migration. That would probably be good enough for a lot of users that don't want to expose excessive latencies, but where it's really not a huge deal to say "stick to this CPU for a short while". The crypto code certainly sounds like one such case. Linus
On Mon, Sep 21 2020 at 09:24, Linus Torvalds wrote: > On Mon, Sep 21, 2020 at 12:39 AM Thomas Gleixner <tglx@linutronix.de> wrote: >> >> If a task is migrated to a different CPU then the mapping address will >> change which will explode in colourful ways. > > Right you are. > > Maybe we really *could* call this new kmap functionality something > like "kmap_percpu()" (or maybe "local" is good enough), and make it > act like your RT code does for spinlocks - not disable preemption, but > only disabling CPU migration. I"m all for it, but the scheduler people have opinions :) > That would probably be good enough for a lot of users that don't want > to expose excessive latencies, but where it's really not a huge deal > to say "stick to this CPU for a short while". > > The crypto code certainly sounds like one such case. I looked at a lot of the kmap_atomic() places and quite some of them only require migration to be disabled to keep the temporary map stable. Quite some code could be simplified significantly especially those places which need to do copy_from/to_user inside these sections. Graphics is the main example here as Daniel pointed out. Alternatively this could of course be solved with per CPU page tables which will come around some day anyway I fear. Thanks, tglx
On Sat, Sep 19, 2020 at 06:39:06PM +0100, Matthew Wilcox wrote: > On Sat, Sep 19, 2020 at 10:18:54AM -0700, Linus Torvalds wrote: > > On Sat, Sep 19, 2020 at 2:50 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > > > > > this provides a preemptible variant of kmap_atomic & related > > > interfaces. This is achieved by: > > > > Ack. This looks really nice, even apart from the new capability. > > > > The only thing I really reacted to is that the name doesn't make sense > > to me: "kmap_temporary()" seems a bit odd. > > > > Particularly for an interface that really is basically meant as a > > better replacement of "kmap_atomic()" (but is perhaps also a better > > replacement for "kmap()"). > > > > I think I understand how the name came about: I think the "temporary" > > is there as a distinction from the "longterm" regular kmap(). So I > > think it makes some sense from an internal implementation angle, but I > > don't think it makes a lot of sense from an interface name. > > > > I don't know what might be a better name, but if we want to emphasize > > that it's thread-private and a one-off, maybe "local" would be a > > better naming, and make it distinct from the "global" nature of the > > old kmap() interface? > > > > However, another solution might be to just use this new preemptible > > "local" kmap(), and remove the old global one entirely. Yes, the old > > global one caches the page table mapping and that sounds really > > efficient and nice. But it's actually horribly horribly bad, because > > it means that we need to use locking for them. Your new "temporary" > > implementation seems to be fundamentally better locking-wise, and only > > need preemption disabling as locking (and is equally fast for the > > non-highmem case). > > > > So I wonder if the single-page TLB flush isn't a better model, and > > whether it wouldn't be a lot simpler to just get rid of the old > > complex kmap() entirely, and replace it with this? > > > > I agree we can't replace the kmap_atomic() version, because maybe > > people depend on the preemption disabling it also implied. But what > > about replacing the non-atomic kmap()? > > My concern with that is people might use kmap() and then pass the address > to a different task. So we need to audit the current users of kmap() > and convert any that do that into using vmap() instead. > I've done some of this work.[3] PKS and pmem stray write protection[2] depend on kmap to enable the correct PKS settings. After working through the exception handling we realized that some users of kmap() seem to be doing just this; passing the address to a different task. From what I have found ~90% of kmap() callers are 'kmap_thread()' and the other ~10% are kmap().[3] But of those 10% I'm not familiar with the code enough to know if they really require a 'global' map. What I do know is they save an address which appears to be used in other threads. But I could be wrong. For PKS I added a 'global' implementation which could then be called by kmap() and added a new kmap_thread() call which used the original 'local' version of the PKS interface. The PKS work is still being reviewed internally for the TIP core code. But I've pushed it all to git hub for purposes of this discussion.[1] > I like kmap_local(). Or kmap_thread(). I chose kmap_thread() so that makes sense to me. I also thought about using kmap_global() as an alternative interface which would change just ~10% of the callers and make the series much smaller. But internal discussions lead me to chose kmap_thread() as the new interface so that we don't change the semantics of kmap(). Ira [1] https://github.com/weiny2/linux-kernel/tree/lm-pks-pmem-for-5.10-v3 [2] https://lore.kernel.org/lkml/20200717072056.73134-1-ira.weiny@intel.com/ [3] 12:42:06 > git grep ' kmap(' *.c | grep -v '* ' | wc -l 22 12:43:32 > git grep ' kmap_thread(' *.c | grep -v '* ' | wc -l 204 Here are the callers which hand an address to another thread. 12:45:25 > git grep ' kmap(' *.c | grep -v '* ' arch/x86/mm/dump_pagetables.c: [PKMAP_BASE_NR] = { 0UL, "Persistent kmap() Area" }, drivers/firewire/net.c: ptr = kmap(dev->broadcast_rcv_buffer.pages[u]); drivers/gpu/drm/i915/gem/i915_gem_pages.c: return kmap(sg_page(sgt->sgl)); drivers/gpu/drm/i915/selftests/i915_perf.c: scratch = kmap(ce->vm->scratch[0].base.page); drivers/gpu/drm/ttm/ttm_bo_util.c: map->virtual = kmap(map->page); drivers/infiniband/hw/qib/qib_user_sdma.c: mpage = kmap(page); drivers/misc/vmw_vmci/vmci_host.c: context->notify = kmap(context->notify_page) + (uva & (PAGE_SIZE - 1)); drivers/misc/xilinx_sdfec.c: addr = kmap(pages[i]); drivers/mmc/host/usdhi6rol0.c: host->pg.mapped = kmap(host->pg.page); drivers/mmc/host/usdhi6rol0.c: host->pg.mapped = kmap(host->pg.page); drivers/mmc/host/usdhi6rol0.c: host->pg.mapped = kmap(host->pg.page); drivers/nvme/target/tcp.c: iov->iov_base = kmap(sg_page(sg)) + sg->offset + sg_offset; drivers/scsi/libiscsi_tcp.c: segment->sg_mapped = kmap(sg_page(sg)); drivers/target/iscsi/iscsi_target.c: iov[i].iov_base = kmap(sg_page(sg)) + sg->offset + page_off; drivers/target/target_core_transport.c: return kmap(sg_page(sg)) + sg->offset; fs/btrfs/check-integrity.c: block_ctx->datav[i] = kmap(block_ctx->pagev[i]); fs/ceph/dir.c: cache_ctl->dentries = kmap(cache_ctl->page); fs/ceph/inode.c: ctl->dentries = kmap(ctl->page); lib/scatterlist.c: miter->addr = kmap(miter->page) + miter->__offset; net/ceph/pagelist.c: pl->mapped_tail = kmap(page); net/ceph/pagelist.c: pl->mapped_tail = kmap(page); virt/kvm/kvm_main.c: hva = kmap(page);
On Mon, Sep 21, 2020 at 09:27:57PM +0200, Thomas Gleixner wrote: > On Mon, Sep 21 2020 at 09:24, Linus Torvalds wrote: > > On Mon, Sep 21, 2020 at 12:39 AM Thomas Gleixner <tglx@linutronix.de> wrote: > >> > >> If a task is migrated to a different CPU then the mapping address will > >> change which will explode in colourful ways. > > > > Right you are. > > > > Maybe we really *could* call this new kmap functionality something > > like "kmap_percpu()" (or maybe "local" is good enough), and make it > > act like your RT code does for spinlocks - not disable preemption, but > > only disabling CPU migration. > > I"m all for it, but the scheduler people have opinions :) Right, so I'm concerned. migrate_disable() wrecks pretty much all Real-Time scheduler theory we have, and PREEMPRT_RT bringing it in is somewhat ironic. Yes, it allows breaking up non-preemptible regions of non-deterministic duration, and thereby both reduce and bound the scheduling latency, the cost for doing that is that the theory on CPU utilization/bandwidth go out the window. To easily see this consider an SMP system with a number of tasks equal to the number of CPUs. On a regular (preempt_disable) kernel we can always run each task, by virtue of always having an idle CPU to take the task. However, with migrate_disable() we can have each task preempted in a migrate_disable() region, worse we can stack them all on the _same_ CPU (super ridiculous odds, sure). And then we end up only able to run one task, with the rest of the CPUs picking their nose. The end result is that, like with unbounded latency, we will still miss our deadline, simply because we got starved for CPU. Now, while we could (with a _lot_ of work) rework the kernel to not rely on the implicit per-cpu ness of things like spinlock_t, the moment we bring in basic primitives that rely on migrate_disable() we're stuck with it. The problem is; afaict there's been no research into this problem. There might be scheduling (read: balancing) schemes that can mitigate/solve this problem, or it might prove to be a 'hard' problem, I just don't know. But once we start down this road, it's going to be hell to get rid of it. That's why I've been arguing (for many years) to strictly limit this to PREEMPT_RT and only as a gap-stop, not a fundamental primitive to build on.
On Mon, Sep 21, 2020 at 09:27:57PM +0200, Thomas Gleixner wrote: > Alternatively this could of course be solved with per CPU page tables > which will come around some day anyway I fear. Previously (with PTI) we looked at making the entire kernel map per-CPU, and that takes a 2K copy on switch_mm() (or more general, the user part of whatever the top level directory is for architectures that have a shared kernel/user page-table setup in the first place). The idea was having a fixed per-cpu kernel page-table, share a bunch of (kernel) page-tables between all CPUs and then copy in the user part on switch. I've forgotten what the plan was for ASID/PCID in that scheme. For x86_64 we've been fearing the performance of that 2k copy, but I don't think we've ever actually bit the bullet and implemented it to see how bad it really is.
On Wed, Sep 23 2020 at 12:19, peterz wrote: > On Mon, Sep 21, 2020 at 09:27:57PM +0200, Thomas Gleixner wrote: >> Alternatively this could of course be solved with per CPU page tables >> which will come around some day anyway I fear. > > Previously (with PTI) we looked at making the entire kernel map per-CPU, > and that takes a 2K copy on switch_mm() (or more general, the user part > of whatever the top level directory is for architectures that have a > shared kernel/user page-table setup in the first place). > > The idea was having a fixed per-cpu kernel page-table, share a bunch of > (kernel) page-tables between all CPUs and then copy in the user part on > switch. > > I've forgotten what the plan was for ASID/PCID in that scheme. > > For x86_64 we've been fearing the performance of that 2k copy, but I > don't think we've ever actually bit the bullet and implemented it to see > how bad it really is. I actually did at some point and depending on the workload the overhead was clearly measurable. And yes, it fell apart with PCID and I could not come up with a scheme for it which did not suck horribly. So I burried the patches in the poison cabinet. Aside of that, we'd need to implement that for a eight other architectures as well... Thanks, tglx
On Wed, Sep 23 2020 at 10:40, peterz wrote: > Right, so I'm concerned. migrate_disable() wrecks pretty much all > Real-Time scheduler theory we have, and PREEMPRT_RT bringing it in is > somewhat ironic. It's even more ironic that the approach of PREEMPT_RT has been 'pragmatic ignorance of theory' from the very beginning and despite violating all theories it still works. :) > Yes, it allows breaking up non-preemptible regions of non-deterministic > duration, and thereby both reduce and bound the scheduling latency, the > cost for doing that is that the theory on CPU utilization/bandwidth go > out the window. I agree, that the theory goes out of the window, but does it matter in practice? I've yet to see a report of migrate disable stacking being the culprit of a missed deadline and I surely have stared at lots of reports in the past 10+ years. > To easily see this consider an SMP system with a number of tasks equal > to the number of CPUs. On a regular (preempt_disable) kernel we can > always run each task, by virtue of always having an idle CPU to take the > task. > > However, with migrate_disable() we can have each task preempted in a > migrate_disable() region, worse we can stack them all on the _same_ CPU > (super ridiculous odds, sure). And then we end up only able to run one > task, with the rest of the CPUs picking their nose. > > The end result is that, like with unbounded latency, we will still miss > our deadline, simply because we got starved for CPU. I'm well aware of that. > Now, while we could (with a _lot_ of work) rework the kernel to not rely > on the implicit per-cpu ness of things like spinlock_t, the moment we > bring in basic primitives that rely on migrate_disable() we're stuck > with it. Right, but we are stuck with per CPUness and distangling that is just infeasible IMO. > The problem is; afaict there's been no research into this problem. There is no research on a lot of things the kernel does :) > There might be scheduling (read: balancing) schemes that can > mitigate/solve this problem, or it might prove to be a 'hard' problem, > I just don't know. In practive balancing surely can take the number of preempted tasks which are in a migrate disable section into account which would be just another measure to work around the fact that the kernel is not adhering to the theories. It never did that even w/o migrate disable. > But once we start down this road, it's going to be hell to get rid of > it. Like most of the other things the kernel came up with to deal with the oddities of modern hardware :) > That's why I've been arguing (for many years) to strictly limit this to > PREEMPT_RT and only as a gap-stop, not a fundamental primitive to build > on. I know, but short of rewriting the world, I'm not seing the faintest plan to remove the stop gap. :) As we discussed not long ago we have too many inconsistent preemption models already. RT is adding yet another one. And that's worse than introducing migrate disable as a general available facility. IMO, reaching a point of consistency where our different preemption models (NONE, VOLUNTARY, PREEMPT. RT) build on each other is far more important. For !RT migrate disable is far less of an danger than for RT kernels because the amount of code which will use it is rather limited compared to code which still will disable preemption implicit through spin and rw locks. On RT converting these locks to 'sleepable spinlocks' is just possible because RT forces almost everything into task context and migrate disable is just the obvious decomposition of preempt disable which implicitely disables migration. But that means that RT is by orders of magnitude more prone to run into the scheduling trainwreck you are worried about. It just refuses to do so at least with real world work loads. I'm surely in favour of having solid theories behind implementation, but at some point you just have to bite the bullet and chose pragmatism in order to make progress. Proliferating inconsistency is not real progress, as it is violating the most fundamental engineering principles. That's by far more dangerous than violating scheduling theories which are built on perfect models and therefore enforce violation by practical implementations anyway. Thanks, tglx
On Mon, Sep 21 2020 at 21:27, Thomas Gleixner wrote: > On Mon, Sep 21 2020 at 09:24, Linus Torvalds wrote: >> On Mon, Sep 21, 2020 at 12:39 AM Thomas Gleixner <tglx@linutronix.de> wrote: >> Maybe we really *could* call this new kmap functionality something >> like "kmap_percpu()" (or maybe "local" is good enough), and make it >> act like your RT code does for spinlocks - not disable preemption, but >> only disabling CPU migration. > > I"m all for it, but the scheduler people have opinions :) I just took the latest version of migrate disable patches https://lore.kernel.org/r/20200921163557.234036895@infradead.org removed the RT dependency on top of them and adopted the kmap stuff (addressing the various comments while at it and unbreaking ARM). I'm not going to post that until there is consensus about the general availability of migrate disable, but for those who want to play with it I've pushed the resulting combo out to: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git highmem For testing I've tweaked a few places to use the new _local() variants and it survived testing so far and I've verified that there is actual preemption which means zap/restore of the thread local kmaps. Thanks, tglx
On Wed, 23 Sep 2020 10:40:32 +0200 peterz@infradead.org wrote: > However, with migrate_disable() we can have each task preempted in a > migrate_disable() region, worse we can stack them all on the _same_ CPU > (super ridiculous odds, sure). And then we end up only able to run one > task, with the rest of the CPUs picking their nose. What if we just made migrate_disable() a local_lock() available for !RT? I mean make it a priority inheritance PER CPU lock. That is, no two tasks could do a migrate_disable() on the same CPU? If one task does a migrate_disable() and then gets preempted and the preempting task does a migrate_disable() on the same CPU, it will block and wait for the first task to do a migrate_enable(). No two tasks on the same CPU could enter the migrate_disable() section simultaneously, just like no two tasks could enter a preempt_disable() section. In essence, we just allow local_lock() to be used for both RT and !RT. Perhaps make migrate_disable() an anonymous local_lock()? This should lower the SHC in theory, if you can't have stacked migrate disables on the same CPU. -- Steve
On Wed, Sep 23 2020 at 11:52, Steven Rostedt wrote: > On Wed, 23 Sep 2020 10:40:32 +0200 > peterz@infradead.org wrote: > >> However, with migrate_disable() we can have each task preempted in a >> migrate_disable() region, worse we can stack them all on the _same_ CPU >> (super ridiculous odds, sure). And then we end up only able to run one >> task, with the rest of the CPUs picking their nose. > > What if we just made migrate_disable() a local_lock() available for !RT? > > I mean make it a priority inheritance PER CPU lock. > > That is, no two tasks could do a migrate_disable() on the same CPU? If > one task does a migrate_disable() and then gets preempted and the > preempting task does a migrate_disable() on the same CPU, it will block > and wait for the first task to do a migrate_enable(). > > No two tasks on the same CPU could enter the migrate_disable() section > simultaneously, just like no two tasks could enter a preempt_disable() > section. > > In essence, we just allow local_lock() to be used for both RT and !RT. > > Perhaps make migrate_disable() an anonymous local_lock()? > > This should lower the SHC in theory, if you can't have stacked migrate > disables on the same CPU. I'm pretty sure this ends up in locking hell pretty fast and aside of that it's not working for scenarios like: kmap_local(); migrate_disable(); ... copy_from_user() -> #PF -> schedule() which brought us into that discussion in the first place. You would stop any other migrate disable user from running until the page fault is resolved... Thanks, tglx
On Wed, 23 Sep 2020 22:55:54 +0200 Thomas Gleixner <tglx@linutronix.de> wrote: > > Perhaps make migrate_disable() an anonymous local_lock()? > > > > This should lower the SHC in theory, if you can't have stacked migrate > > disables on the same CPU. > > I'm pretty sure this ends up in locking hell pretty fast and aside of > that it's not working for scenarios like: > > kmap_local(); > migrate_disable(); > ... > > copy_from_user() > -> #PF > -> schedule() > > which brought us into that discussion in the first place. You would stop > any other migrate disable user from running until the page fault is > resolved... Then scratch the idea of having anonymous local_lock() and just bring local_lock in directly? Then have a kmap local lock, which would only block those that need to do a kmap. Now as for migration disabled nesting, at least now we would have groupings of this, and perhaps the theorists can handle that. I mean, how is this much different that having a bunch of tasks blocked on a mutex with the owner is pinned on a CPU? migrate_disable() is a BKL of pinning affinity. If we only have local_lock() available (even on !RT), then it makes the blocking in groups. At least this way you could grep for all the different local_locks in the system and plug that into the algorithm for WCS, just like one would with a bunch of mutexes. -- Steve
On Wed, Sep 23 2020 at 17:12, Steven Rostedt wrote: > On Wed, 23 Sep 2020 22:55:54 +0200 > Then scratch the idea of having anonymous local_lock() and just bring > local_lock in directly? Then have a kmap local lock, which would only > block those that need to do a kmap. That's still going to end up in lock ordering nightmares and you lose the ability to use kmap_local from arbitrary contexts which was again one of the goals of this exercise. Aside of that you're imposing reentrancy protections on something which does not need it in the first place. > Now as for migration disabled nesting, at least now we would have > groupings of this, and perhaps the theorists can handle that. I mean, > how is this much different that having a bunch of tasks blocked on a > mutex with the owner is pinned on a CPU? > > migrate_disable() is a BKL of pinning affinity. No. That's just wrong. preempt disable is a concurrency control, i.e. protecting against reentrancy on a given CPU. But it's a cpu global protection which means that it's not protecting a specific code path. Contrary to preempt disable, migrate disable is not protecting against reentrancy on a given CPU. It's a temporary restriction to the scheduler on placement. The fact that disabling preemption implicitely disables migration does not make them semantically equivalent. > If we only have local_lock() available (even on !RT), then it makes > the blocking in groups. At least this way you could grep for all the > different local_locks in the system and plug that into the algorithm > for WCS, just like one would with a bunch of mutexes. You cannot do that on RT at all where migrate disable is substituting preempt disable in spin and rw locks. The result would be the same as with a !RT kernel just with horribly bad performance. That means the stacking problem has to be solved anyway. So why on earth do you want to create yet another special duct tape case for kamp_local() which proliferates inconsistency instead of aiming for consistency accross all preemption models? Thanks, tglx
On Wed, Sep 23, 2020 at 11:52:51AM -0400, Steven Rostedt wrote: > On Wed, 23 Sep 2020 10:40:32 +0200 > peterz@infradead.org wrote: > > > However, with migrate_disable() we can have each task preempted in a > > migrate_disable() region, worse we can stack them all on the _same_ CPU > > (super ridiculous odds, sure). And then we end up only able to run one > > task, with the rest of the CPUs picking their nose. > > What if we just made migrate_disable() a local_lock() available for !RT? Can't, neiter migrate_disable() nor migrate_enable() are allowed to block -- which is what makes their implementation so 'interesting'. > This should lower the SHC in theory, if you can't have stacked migrate > disables on the same CPU. See this email in that other thread (you're on Cc too IIRC): https://lkml.kernel.org/r/20200923170809.GY1362448@hirez.programming.kicks-ass.net I think that is we 'frob' the balance PULL, we'll end up with something similar. Whichever way around we turn this thing, the migrate_disable() runtime (we'll have to add a tracer for that), will be an interference term on the lower priority task, exactly like preempt_disable() would be. We'll just not exclude a higher priority task from running. AFAICT; the best case is a single migrate_disable() nesting, where a higher priority task preempts in a migrate_disable() section -- this is per design. When this preempted task becomes elegible to run under the ideal model (IOW it becomes one of the M highest priority tasks), it might still have to wait for the preemptee's migrate_disable() section to complete. Thereby suffering interference in the exact duration of migrate_disable() section. Per this argument, the change from preempt_disable() to migrate_disable() gets us: - higher priority tasks gain reduced wake-up latency - lower priority tasks are unchanged and are subject to the exact same interference term as if the higher priority task were using preempt_disable(). Since we've already established this term is unbounded, any task but the highest priority task is basically buggered. TL;DR, if we get balancing fixed and achieve (near) the optimal case above, migrate_disable() is an over-all win. But it's provably non-deterministic as long as the migrate_disable() sections are non-deterministic. The reason this all mostly works in practise is (I think) because: - People care most about the higher prio RT tasks and craft them to mostly avoid the migrate_disable() infected code. - The preemption scenario is most pronounced at higher utilization scenarios, and I suspect this is fairly rare to begin with. - And while many of these migrate_disable() regions are unbound in theory, in practise they're often fairly reasonable. So my current todo list is: - Change RT PULL - Change DL PULL - Add migrate_disable() tracer; exactly like preempt/irqoff, except measuring task-runtime instead of cpu-time. - Add a mode that measures actual interference. - Add a traceevent to detect preemption in migrate_disable(). And then I suppose I should twist Daniel's arm to update his model to include these scenarios and numbers.
On Thu, 24 Sep 2020 08:57:52 +0200 Thomas Gleixner <tglx@linutronix.de> wrote: > > Now as for migration disabled nesting, at least now we would have > > groupings of this, and perhaps the theorists can handle that. I mean, > > how is this much different that having a bunch of tasks blocked on a > > mutex with the owner is pinned on a CPU? > > > > migrate_disable() is a BKL of pinning affinity. > > No. That's just wrong. preempt disable is a concurrency control, I think you totally misunderstood what I was saying. The above wasn't about comparing preempt_disable to migrate_disable. It was comparing migrate_disable to a chain of tasks blocked on mutexes where the top owner has preempt_disable set. You still have a bunch of tasks that can't move to other CPUs. > > If we only have local_lock() available (even on !RT), then it makes > > the blocking in groups. At least this way you could grep for all the > > different local_locks in the system and plug that into the algorithm > > for WCS, just like one would with a bunch of mutexes. > > You cannot do that on RT at all where migrate disable is substituting > preempt disable in spin and rw locks. The result would be the same as > with a !RT kernel just with horribly bad performance. Note, the spin and rwlocks already have a lock associated with them. Why would it be any different on RT? I wasn't suggesting adding another lock inside a spinlock. Why would I recommend THAT? I wasn't recommending blindly replacing migrate_disable() with local_lock(). I just meant expose local_lock() but not migrate_disable(). > > That means the stacking problem has to be solved anyway. > > So why on earth do you want to create yet another special duct tape case > for kamp_local() which proliferates inconsistency instead of aiming for > consistency accross all preemption models? The idea was to help with the scheduling issue. Anyway, instead of blocking. What about having a counter of number of migrate disabled tasks per cpu, and when taking a migrate_disable(), and there's already another task with migrate_disabled() set, and the current task has an affinity greater than 1, it tries to migrate to another CPU? This way migrate_disable() is less likely to have a bunch of tasks blocked on one CPU serialized by each task exiting the migrate_disable() section. Yes, there's more overhead, but it only happens if multiple tasks are in a migrate disable section on the same CPU. -- Steve
On Thu, Sep 24, 2020 at 08:32:41AM -0400, Steven Rostedt wrote: > Anyway, instead of blocking. What about having a counter of number of > migrate disabled tasks per cpu, and when taking a migrate_disable(), and there's > already another task with migrate_disabled() set, and the current task has > an affinity greater than 1, it tries to migrate to another CPU? That doesn't solve the problem. On wakeup we should already prefer an idle CPU over one running a (RT) task, but you can always wake more tasks than there's CPUs around and you'll _have_ to stack at some point. The trick is how to unstack them correctly. We need to detect when a migrate_disable() task _should_ start running again, and migrate away whoever is in the way at that point. It turns out, that getting selected for pull-balance is exactly that condition, and clearly a migrate_disable() task cannot be pulled, but we can use that signal to try and pull away the running task that's in the way.
On Thu, 24 Sep 2020 14:42:41 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Sep 24, 2020 at 08:32:41AM -0400, Steven Rostedt wrote: > > Anyway, instead of blocking. What about having a counter of number of > > migrate disabled tasks per cpu, and when taking a migrate_disable(), and there's > > already another task with migrate_disabled() set, and the current task has > > an affinity greater than 1, it tries to migrate to another CPU? > > That doesn't solve the problem. On wakeup we should already prefer an > idle CPU over one running a (RT) task, but you can always wake more > tasks than there's CPUs around and you'll _have_ to stack at some point. Yes, understood. > > The trick is how to unstack them correctly. We need to detect when a > migrate_disable() task _should_ start running again, and migrate away > whoever is in the way at that point. > > It turns out, that getting selected for pull-balance is exactly that > condition, and clearly a migrate_disable() task cannot be pulled, but we > can use that signal to try and pull away the running task that's in the > way. Unless of course that running task is in a migrate disable section itself ;-) But I guess we will always have that SHC, and there will always be a scenario that you can't balance properly. But hopefully in practice we wont see that. How to handle kmap_local(), will migrate_disable() be used only for 32bit or, for consistency, will it also apply to 64bit? -- Steve
On Thu, Sep 24, 2020 at 09:51:38AM -0400, Steven Rostedt wrote: > > It turns out, that getting selected for pull-balance is exactly that > > condition, and clearly a migrate_disable() task cannot be pulled, but we > > can use that signal to try and pull away the running task that's in the > > way. > > Unless of course that running task is in a migrate disable section > itself ;-) See my ramblings here: https://lkml.kernel.org/r/20200924082717.GA1362448@hirez.programming.kicks-ass.net My plan was to have the migrate_enable() of the running task trigger the migration in that case.
On Thu, Sep 24 2020 at 08:32, Steven Rostedt wrote: > On Thu, 24 Sep 2020 08:57:52 +0200 > Thomas Gleixner <tglx@linutronix.de> wrote: > >> > Now as for migration disabled nesting, at least now we would have >> > groupings of this, and perhaps the theorists can handle that. I mean, >> > how is this much different that having a bunch of tasks blocked on a >> > mutex with the owner is pinned on a CPU? >> > >> > migrate_disable() is a BKL of pinning affinity. >> >> No. That's just wrong. preempt disable is a concurrency control, > > I think you totally misunderstood what I was saying. The above wasn't about > comparing preempt_disable to migrate_disable. It was comparing > migrate_disable to a chain of tasks blocked on mutexes where the top owner > has preempt_disable set. You still have a bunch of tasks that can't move to > other CPUs. What? The top owner does not prevent any task from moving. The tasks cannot move because they are blocked on the mutex, which means they are not runnable and non runnable tasks are not migrated at all. I really don't understand what you are trying to say. >> > If we only have local_lock() available (even on !RT), then it makes >> > the blocking in groups. At least this way you could grep for all the >> > different local_locks in the system and plug that into the algorithm >> > for WCS, just like one would with a bunch of mutexes. >> >> You cannot do that on RT at all where migrate disable is substituting >> preempt disable in spin and rw locks. The result would be the same as >> with a !RT kernel just with horribly bad performance. > > Note, the spin and rwlocks already have a lock associated with them. Why > would it be any different on RT? I wasn't suggesting adding another lock > inside a spinlock. Why would I recommend THAT? I wasn't recommending > blindly replacing migrate_disable() with local_lock(). I just meant expose > local_lock() but not migrate_disable(). We already exposed local_lock() to non RT and it's for places which do preempt_disable() or local_irq_disable() without having a lock associated. But both primitives are scope less and therefore behave like CPU local BKLs. What local_lock() provides in these cases is: - Making the protection scope clear by associating a named local lock which is coverred by lockdep. - It still maps to preempt_disable() or local_irq_disable() in !RT kernels - The scope and the named lock allows RT kernels to substitute with real (recursion aware) locking primitives which keep preemption and interupts enabled, but provide the fine grained protection for the scoped critical section. So how would you substitute migrate_disable() with a local_lock()? You can't. Again migrate_disable() is NOT a concurrency control and therefore it cannot be substituted by any concurrency control primitive. >> That means the stacking problem has to be solved anyway. >> >> So why on earth do you want to create yet another special duct tape case >> for kamp_local() which proliferates inconsistency instead of aiming for >> consistency accross all preemption models? > > The idea was to help with the scheduling issue. I don't see how mixing concepts and trying to duct tape a problem which is clearly in the realm of the scheduler, i.e. load balancing and placing algorithms, is helpful. Thanks, tglx
On Thu, 24 Sep 2020 19:55:10 +0200 Thomas Gleixner <tglx@linutronix.de> wrote: > On Thu, Sep 24 2020 at 08:32, Steven Rostedt wrote: > > On Thu, 24 Sep 2020 08:57:52 +0200 > > Thomas Gleixner <tglx@linutronix.de> wrote: > > > >> > Now as for migration disabled nesting, at least now we would have > >> > groupings of this, and perhaps the theorists can handle that. I mean, > >> > how is this much different that having a bunch of tasks blocked on a > >> > mutex with the owner is pinned on a CPU? > >> > > >> > migrate_disable() is a BKL of pinning affinity. > >> > >> No. That's just wrong. preempt disable is a concurrency control, > > > > I think you totally misunderstood what I was saying. The above wasn't about > > comparing preempt_disable to migrate_disable. It was comparing > > migrate_disable to a chain of tasks blocked on mutexes where the top owner > > has preempt_disable set. You still have a bunch of tasks that can't move to > > other CPUs. > > What? The top owner does not prevent any task from moving. The tasks > cannot move because they are blocked on the mutex, which means they are > not runnable and non runnable tasks are not migrated at all. And neither are migrated disabled tasks preempted by a high priority task. > > I really don't understand what you are trying to say. Don't worry about it. I was just making a high level comparison of how migrate disabled tasks blocked on a higher priority task is similar to that of tasks blocked on a mutex held by a pinned task that is preempted by a high priority task. But we can forget this analogy as it's not appropriate for the current conversation. > > >> > If we only have local_lock() available (even on !RT), then it makes > >> > the blocking in groups. At least this way you could grep for all the > >> > different local_locks in the system and plug that into the algorithm > >> > for WCS, just like one would with a bunch of mutexes. > >> > >> You cannot do that on RT at all where migrate disable is substituting > >> preempt disable in spin and rw locks. The result would be the same as > >> with a !RT kernel just with horribly bad performance. > > > > Note, the spin and rwlocks already have a lock associated with them. Why > > would it be any different on RT? I wasn't suggesting adding another lock > > inside a spinlock. Why would I recommend THAT? I wasn't recommending > > blindly replacing migrate_disable() with local_lock(). I just meant expose > > local_lock() but not migrate_disable(). > > We already exposed local_lock() to non RT and it's for places which do > preempt_disable() or local_irq_disable() without having a lock > associated. But both primitives are scope less and therefore behave like > CPU local BKLs. What local_lock() provides in these cases is: > > - Making the protection scope clear by associating a named local > lock which is coverred by lockdep. > > - It still maps to preempt_disable() or local_irq_disable() in !RT > kernels > > - The scope and the named lock allows RT kernels to substitute with > real (recursion aware) locking primitives which keep preemption and > interupts enabled, but provide the fine grained protection for the > scoped critical section. I'm very much aware of the above. > > So how would you substitute migrate_disable() with a local_lock()? You > can't. Again migrate_disable() is NOT a concurrency control and > therefore it cannot be substituted by any concurrency control primitive. When I was first writing my email, I was writing about a way to replace migrate_disable with a construct similar to local locks without actually mentioning local locks, but then rewrote it to state local locks, trying to simplify what I was writing. I shouldn't have done that, because it portrayed that I wanted to use local_lock() unmodified. I was actually thinking of a new construct that was similar but not exactly the same as local lock. But this will just make things more complex and we can forget about it. I'll wait to see what Peter produces. -- Steve
On 9/24/20 10:27 AM, peterz@infradead.org wrote: > So my current todo list is: > > - Change RT PULL > - Change DL PULL > - Add migrate_disable() tracer; exactly like preempt/irqoff, except > measuring task-runtime instead of cpu-time. > - Add a mode that measures actual interference. > - Add a traceevent to detect preemption in migrate_disable(). > > > And then I suppose I should twist Daniel's arm to update his model to > include these scenarios and numbers. Challenge accepted :-) -- Daniel