mbox series

[00/13] preempt: Make preempt count unconditional

Message ID 20200914204209.256266093@linutronix.de (mailing list archive)
Headers show
Series preempt: Make preempt count unconditional | expand

Message

Thomas Gleixner Sept. 14, 2020, 8:42 p.m. UTC
Folks!

While working on various preempt count related things, I stumbled (again)
over the inconsistency of our preempt count handling.

The handling of preempt_count() is inconsistent accross kernel
configurations. On kernels which have PREEMPT_COUNT=n
preempt_disable/enable() and the lock/unlock functions are not affecting
the preempt count, only local_bh_disable/enable() and _bh variants of
locking, soft interrupt delivery, hard interrupt and NMI context affect it.

It's therefore impossible to have a consistent set of checks which provide
information about the context in which a function is called. In many cases
it makes sense to have seperate functions for seperate contexts, but there
are valid reasons to avoid that and handle different calling contexts
conditionally.

The lack of such indicators which work on all kernel configuratios is a
constant source of trouble because developers either do not understand the
implications or try to work around this inconsistency in weird
ways. Neither seem these issues be catched by reviewers and testing.

Recently merged code does:

	 gfp = preemptible() ? GFP_KERNEL : GFP_ATOMIC;

Looks obviously correct, except for the fact that preemptible() is
unconditionally false for CONFIF_PREEMPT_COUNT=n, i.e. all allocations in
that code use GFP_ATOMIC on such kernels.

Attempts to make preempt count unconditional and consistent have been
rejected in the past with handwaving performance arguments.

Freshly conducted benchmarks did not reveal any measurable impact from
enabling preempt count unconditionally. On kernels with CONFIG_PREEMPT_NONE
or CONFIG_PREEMPT_VOLUNTARY the preempt count is only incremented and
decremented but the result of the decrement is not tested. Contrary to that
enabling CONFIG_PREEMPT which tests the result has a small but measurable
impact due to the conditional branch/call.

It's about time to make essential functionality of the kernel consistent
accross the various preemption models.

The series is also available from git:

   git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git preempt

That's the first part of a larger effort related to preempt count:

 1) The analysis of the usage sites of in_interrupt(), in_atomic(),
    in_softirq() is still ongoing, but so far the number of buggy users is
    clearly the vast majority. There will be seperate patch series
    (currently 46 and counting) to address these issues once the analysis
    is complete in the next days.

 2) The long discussed state tracking of local irq disable in preempt count
    which accounts interrupt disabled sections as atomic and avoids issuing
    costly instructions (sti, cli, popf or their non X86 counterparts) when
    the state does not change, i.e. nested irq_save() or irq_restore(). I
    have this working on X86 already and contrary to my earlier attempts
    this was reasonably straight forward due to the recent entry/exit code
    consolidation.

    What I've not done yet is to optimize the preempt count handling
    of the [un]lock_irq* operations so they handle the interrupt disabled
    state and the preempt count modification in one go. That's an obvious
    add on, but correctness first ...

 3) Lazy interrupt disabling as a straight forward extension to #2. This
    avoids the actual disabling at the CPU level completely and catches an
    incoming interrupt in the low level entry code, modifies the interrupt
    disabled state on the return stack, notes the interrupt as pending in
    software and raises it again when interrupts are reenabled. This has
    still a few issues which I'm hunting down (cpuidle is unhappy ...)

Thanks,

	tglx
---
 arch/arm/include/asm/assembler.h                                 |   11 --
 arch/arm/kernel/iwmmxt.S                                         |    2 
 arch/arm/mach-ep93xx/crunch-bits.S                               |    2 
 arch/xtensa/kernel/entry.S                                       |    2 
 drivers/gpu/drm/i915/Kconfig.debug                               |    1 
 drivers/gpu/drm/i915/i915_utils.h                                |    3 
 include/linux/bit_spinlock.h                                     |    4 -
 include/linux/lockdep.h                                          |    6 -
 include/linux/pagemap.h                                          |    4 -
 include/linux/preempt.h                                          |   37 +---------
 include/linux/uaccess.h                                          |    6 -
 kernel/Kconfig.preempt                                           |    4 -
 kernel/sched/core.c                                              |    6 -
 lib/Kconfig.debug                                                |    3 
 lib/Kconfig.debug.rej                                            |   14 +--
 tools/testing/selftests/rcutorture/configs/rcu/SRCU-t            |    1 
 tools/testing/selftests/rcutorture/configs/rcu/SRCU-u            |    1 
 tools/testing/selftests/rcutorture/configs/rcu/TINY01            |    1 
 tools/testing/selftests/rcutorture/doc/TINY_RCU.txt              |    5 -
 tools/testing/selftests/rcutorture/doc/TREE_RCU-kconfig.txt      |    1 
 tools/testing/selftests/rcutorture/formal/srcu-cbmc/src/config.h |    1 
 21 files changed, 23 insertions(+), 92 deletions(-)

Comments

Steven Rostedt Sept. 14, 2020, 8:54 p.m. UTC | #1
On Mon, 14 Sep 2020 22:42:09 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> 21 files changed, 23 insertions(+), 92 deletions(-)

This alone makes it look promising, and hopefully acceptable by Linus :-)

-- Steve
Linus Torvalds Sept. 14, 2020, 8:59 p.m. UTC | #2
On Mon, Sep 14, 2020 at 1:45 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Recently merged code does:
>
>          gfp = preemptible() ? GFP_KERNEL : GFP_ATOMIC;
>
> Looks obviously correct, except for the fact that preemptible() is
> unconditionally false for CONFIF_PREEMPT_COUNT=n, i.e. all allocations in
> that code use GFP_ATOMIC on such kernels.

I don't think this is a good reason to entirely get rid of the no-preempt thing.

The above is just garbage. It's bogus. You can't do it.

Blaming the no-preempt code for this bug is extremely unfair, imho.

And the no-preempt code does help make for much better code generation
for simple spinlocks.

Where is that horribly buggy recent code? It's not in that exact
format, certainly, since 'grep' doesn't find it.

             Linus
Thomas Gleixner Sept. 14, 2020, 9:55 p.m. UTC | #3
On Mon, Sep 14 2020 at 13:59, Linus Torvalds wrote:
> On Mon, Sep 14, 2020 at 1:45 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> Recently merged code does:
>>
>>          gfp = preemptible() ? GFP_KERNEL : GFP_ATOMIC;
>>
>> Looks obviously correct, except for the fact that preemptible() is
>> unconditionally false for CONFIF_PREEMPT_COUNT=n, i.e. all allocations in
>> that code use GFP_ATOMIC on such kernels.
>
> I don't think this is a good reason to entirely get rid of the
> no-preempt thing.

I did not say that this is a good reason. It just illustrates the
problem.

> The above is just garbage. It's bogus. You can't do it.
>
> Blaming the no-preempt code for this bug is extremely unfair, imho.

I'm not blaming the no-preempt code. I'm blaming inconsistency and there
is no real good argument for inconsistent behaviour, TBH.

> And the no-preempt code does help make for much better code generation
> for simple spinlocks.

Yes it does generate better code, but I tried hard to spot a difference
in various metrics exposed by perf. It's all in the noise and I only
can spot a difference when the actual preemption check after the
decrement which still depends on CONFIG_PREEMPT is in place, but that's
not the case for PREEMPT_NONE or PREEMPT_VOLUNTARY kernels where the
decrement is just a decrement w/o any conditional behind it.

> Where is that horribly buggy recent code? It's not in that exact
> format, certainly, since 'grep' doesn't find it.

Bah, that was stuff in next which got dropped again.

But just look at any check which uses preemptible(), especially those
which check !preemptible():

In the X86 #GP handler we have:

	/*
	 * To be potentially processing a kprobe fault and to trust the result
	 * from kprobe_running(), we have to be non-preemptible.
	 */
	if (!preemptible() &&
	    kprobe_running() &&
	    kprobe_fault_handler(regs, X86_TRAP_GP))
		goto exit;

and a similar check in the S390 code in kprobe_exceptions_notify(). That
all magically 'works' because that code might have been actually tested
with lockdep enabled which enforces PREEMPT_COUNT...

The SG code has some interesting usage as well:

		if (miter->__flags & SG_MITER_ATOMIC) {
			WARN_ON_ONCE(preemptible());
			kunmap_atomic(miter->addr);

How is that WARN_ON_ONCE() supposed to catch anything? Especially as
calling code does:

	flags = SG_MITER_TO_SG;
	if (!preemptible())
		flags |= SG_MITER_ATOMIC;

which is equally useless on kernels which have PREEMPT_COUNT=n.

There are bugs which are related to in_atomic() or other in_***() usage
all over the place as well.

Inconsistency at the core level is a clear recipe for disaster and at
some point we have to bite the bullet and accept that consistency is
more important than the non measurable 3 cycles?

Thanks,

        tglx
Paul E. McKenney Sept. 15, 2020, 5:25 p.m. UTC | #4
On Mon, Sep 14, 2020 at 01:59:15PM -0700, Linus Torvalds wrote:
> On Mon, Sep 14, 2020 at 1:45 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > Recently merged code does:
> >
> >          gfp = preemptible() ? GFP_KERNEL : GFP_ATOMIC;
> >
> > Looks obviously correct, except for the fact that preemptible() is
> > unconditionally false for CONFIF_PREEMPT_COUNT=n, i.e. all allocations in
> > that code use GFP_ATOMIC on such kernels.
> 
> I don't think this is a good reason to entirely get rid of the no-preempt thing.
> 
> The above is just garbage. It's bogus. You can't do it.
> 
> Blaming the no-preempt code for this bug is extremely unfair, imho.
> 
> And the no-preempt code does help make for much better code generation
> for simple spinlocks.
> 
> Where is that horribly buggy recent code? It's not in that exact
> format, certainly, since 'grep' doesn't find it.

It would be convenient for that "gfp =" code to work, as this would
allow better cache locality while invoking RCU callbacks, and would
further provide better robustness to callback floods.  The full story
is quite long, but here are alternatives have not yet been proven to be
abject failures:

1.	Use workqueues to do the allocations in a clean context.
	While waiting for the allocations, the callbacks are queued
	in the old cache-busting manner.  This functions correctly,
	but in the meantime (which on busy systems can be some time)
	the cache locality and robustness are lost.

2.	Provide the ability to allocate memory in raw atomic context.
	This is extremely effective, especially when used in combination
	with #1 above, but as you might suspect, the MM guys don't like
	it much.

In contrast, with Thomas's patch series, call_rcu() and kvfree_rcu()
could just look at preemptible() to see whether or not it was safe to
allocate memory, even in !PREEMPT kernels -- and in the common case,
it almost always would be safe.  It is quite possible that this approach
would work in isolation, or failing that, that adding #1 above would do
the trick.

I understand that this is all very hand-wavy, and I do apologize for that.
If you really want the full sad story with performance numbers and the
works, let me know!

							Thanx, Paul