diff mbox series

[V2,08/15] Documentation: Add lock ordering and nesting documentation

Message ID 20200318204408.211530902@linutronix.de (mailing list archive)
State Superseded
Headers show
Series Lock ordering documentation and annotation for lockdep | expand

Commit Message

Thomas Gleixner March 18, 2020, 8:43 p.m. UTC
From: Thomas Gleixner <tglx@linutronix.de>

The kernel provides a variety of locking primitives. The nesting of these
lock types and the implications of them on RT enabled kernels is nowhere
documented.

Add initial documentation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V2: Addressed review comments from Randy
---
 Documentation/locking/index.rst     |    1 
 Documentation/locking/locktypes.rst |  298 ++++++++++++++++++++++++++++++++++++
 2 files changed, 299 insertions(+)
 create mode 100644 Documentation/locking/locktypes.rst

Comments

Paul E. McKenney March 18, 2020, 10:31 p.m. UTC | #1
On Wed, Mar 18, 2020 at 09:43:10PM +0100, Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> The kernel provides a variety of locking primitives. The nesting of these
> lock types and the implications of them on RT enabled kernels is nowhere
> documented.
> 
> Add initial documentation.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Mostly native-English-speaker services below, so please feel free to
ignore.  The one place I made a substantive change, I marked it "@@@".
I only did about half of this document, but should this prove useful,
I will do the other half later.

							Thanx, Paul

> ---
> V2: Addressed review comments from Randy
> ---
>  Documentation/locking/index.rst     |    1 
>  Documentation/locking/locktypes.rst |  298 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 299 insertions(+)
>  create mode 100644 Documentation/locking/locktypes.rst
> 
> --- a/Documentation/locking/index.rst
> +++ b/Documentation/locking/index.rst
> @@ -7,6 +7,7 @@ locking
>  .. toctree::
>      :maxdepth: 1
>  
> +    locktypes
>      lockdep-design
>      lockstat
>      locktorture
> --- /dev/null
> +++ b/Documentation/locking/locktypes.rst
> @@ -0,0 +1,298 @@
> +.. _kernel_hacking_locktypes:
> +
> +==========================
> +Lock types and their rules
> +==========================
> +
> +Introduction
> +============
> +
> +The kernel provides a variety of locking primitives which can be divided
> +into two categories:
> +
> + - Sleeping locks
> + - Spinning locks
> +
> +This document describes the lock types at least at the conceptual level and
> +provides rules for nesting of lock types also under the aspect of PREEMPT_RT.

I suggest something like this:

This document conceptually describes these lock types and provides rules
for their nesting, including the rules for use under PREEMPT_RT.

> +
> +Lock categories
> +===============
> +
> +Sleeping locks
> +--------------
> +
> +Sleeping locks can only be acquired in preemptible task context.
> +
> +Some of the implementations allow try_lock() attempts from other contexts,
> +but that has to be really evaluated carefully including the question
> +whether the unlock can be done from that context safely as well.
> +
> +Note that some lock types change their implementation details when
> +debugging is enabled, so this should be really only considered if there is
> +no other option.

How about something like this?

Although implementations allow try_lock() from other contexts, it is
necessary to carefully evaluate the safety of unlock() as well as of
try_lock().  Furthermore, it is also necessary to evaluate the debugging
versions of these primitives.  In short, don't acquire sleeping locks
from other contexts unless there is no other option.

> +Sleeping lock types:
> +
> + - mutex
> + - rt_mutex
> + - semaphore
> + - rw_semaphore
> + - ww_mutex
> + - percpu_rw_semaphore
> +
> +On a PREEMPT_RT enabled kernel the following lock types are converted to
> +sleeping locks:

On PREEMPT_RT kernels, these lock types are converted to sleeping locks:

> + - spinlock_t
> + - rwlock_t
> +
> +Spinning locks
> +--------------
> +
> + - raw_spinlock_t
> + - bit spinlocks
> +
> +On a non PREEMPT_RT enabled kernel the following lock types are spinning
> +locks as well:

On non-PREEMPT_RT kernels, these lock types are also spinning locks:

> + - spinlock_t
> + - rwlock_t
> +
> +Spinning locks implicitly disable preemption and the lock / unlock functions
> +can have suffixes which apply further protections:
> +
> + ===================  ====================================================
> + _bh()                Disable / enable bottom halves (soft interrupts)
> + _irq()               Disable / enable interrupts
> + _irqsave/restore()   Save and disable / restore interrupt disabled state
> + ===================  ====================================================
> +
> +
> +rtmutex
> +=======
> +
> +RT-mutexes are mutexes with support for priority inheritance (PI).
> +
> +PI has limitations on non PREEMPT_RT enabled kernels due to preemption and
> +interrupt disabled sections.
> +
> +On a PREEMPT_RT enabled kernel most of these sections are fully
> +preemptible. This is possible because PREEMPT_RT forces most executions
> +into task context, especially interrupt handlers and soft interrupts, which
> +allows to substitute spinlock_t and rwlock_t with RT-mutex based
> +implementations.

PI clearly cannot preempt preemption-disabled or interrupt-disabled
regions of code, even on PREEMPT_RT kernels.  Instead, PREEMPT_RT kernels
execute most such regions of code in preemptible task context, especially
interrupt handlers and soft interrupts.  This conversion allows spinlock_t
and rwlock_t to be implemented via RT-mutexes.

> +
> +raw_spinlock_t and spinlock_t
> +=============================
> +
> +raw_spinlock_t
> +--------------
> +
> +raw_spinlock_t is a strict spinning lock implementation regardless of the
> +kernel configuration including PREEMPT_RT enabled kernels.
> +
> +raw_spinlock_t is to be used only in real critical core code, low level
> +interrupt handling and places where protecting (hardware) state is required
> +to be safe against preemption and eventually interrupts.
> +
> +Another reason to use raw_spinlock_t is when the critical section is tiny
> +to avoid the overhead of spinlock_t on a PREEMPT_RT enabled kernel in the
> +contended case.

raw_spinlock_t is a strict spinning lock implementation in all kernels,
including PREEMPT_RT kernels.  Use raw_spinlock_t only in real critical
core code, low level interrupt handling and places where disabling
preemption or interrupts is required, for example, to safely access
hardware state.  raw_spinlock_t can sometimes also be used when the
critical section is tiny and the lock is lightly contended, thus avoiding
RT-mutex overhead.

@@@  I added the point about the lock being lightly contended.

> +spinlock_t
> +----------
> +
> +The semantics of spinlock_t change with the state of CONFIG_PREEMPT_RT.
> +
> +On a non PREEMPT_RT enabled kernel spinlock_t is mapped to raw_spinlock_t
> +and has exactly the same semantics.
> +
> +spinlock_t and PREEMPT_RT
> +-------------------------
> +
> +On a PREEMPT_RT enabled kernel spinlock_t is mapped to a separate
> +implementation based on rt_mutex which changes the semantics:
> +
> + - Preemption is not disabled
> +
> + - The hard interrupt related suffixes for spin_lock / spin_unlock
> +   operations (_irq, _irqsave / _irqrestore) do not affect the CPUs
                                                                  CPU's
> +   interrupt disabled state
> +
> + - The soft interrupt related suffix (_bh()) is still disabling the
> +   execution of soft interrupts, but contrary to a non PREEMPT_RT enabled
> +   kernel, which utilizes the preemption count, this is achieved by a per
> +   CPU bottom half locking mechanism.

 - The soft interrupt related suffix (_bh()) still disables softirq
   handlers.  However, unlike non-PREEMPT_RT kernels (which disable
   preemption to get this effect), PREEMPT_RT kernels use a per-CPU
   per-bottom-half locking mechanism.

> +All other semantics of spinlock_t are preserved:
> +
> + - Migration of tasks which hold a spinlock_t is prevented. On a non
> +   PREEMPT_RT enabled kernel this is implicit due to preemption disable.
> +   PREEMPT_RT has a separate mechanism to achieve this. This ensures that
> +   pointers to per CPU variables stay valid even if the task is preempted.
> +
> + - Task state preservation. The task state is not affected when a lock is
> +   contended and the task has to schedule out and wait for the lock to
> +   become available. The lock wake up restores the task state unless there
> +   was a regular (not lock related) wake up on the task. This ensures that
> +   the task state rules are always correct independent of the kernel
> +   configuration.
> +
> +rwlock_t
> +========
> +
> +rwlock_t is a multiple readers and single writer lock mechanism.
> +
> +On a non PREEMPT_RT enabled kernel rwlock_t is implemented as a spinning
> +lock and the suffix rules of spinlock_t apply accordingly. The
> +implementation is fair and prevents writer starvation.
> +
> +rwlock_t and PREEMPT_RT
> +-----------------------
> +
> +On a PREEMPT_RT enabled kernel rwlock_t is mapped to a separate
> +implementation based on rt_mutex which changes the semantics:
> +
> + - Same changes as for spinlock_t
> +
> + - The implementation is not fair and can cause writer starvation under
> +   certain circumstances. The reason for this is that a writer cannot grant
> +   its priority to multiple readers. Readers which are blocked on a writer
> +   fully support the priority inheritance protocol.
> +
> +
> +PREEMPT_RT caveats
> +==================
> +
> +spinlock_t and rwlock_t
> +-----------------------
> +
> +The substitution of spinlock_t and rwlock_t on PREEMPT_RT enabled kernels
> +with RT-mutex based implementations has a few implications.
> +
> +On a non PREEMPT_RT enabled kernel the following code construct is
> +perfectly fine::
> +
> +   local_irq_disable();
> +   spin_lock(&lock);
> +
> +and fully equivalent to::
> +
> +   spin_lock_irq(&lock);
> +
> +Same applies to rwlock_t and the _irqsave() suffix variant.
> +
> +On a PREEMPT_RT enabled kernel this breaks because the RT-mutex
> +substitution expects a fully preemptible context.
> +
> +The preferred solution is to use :c:func:`spin_lock_irq()` or
> +:c:func:`spin_lock_irqsave()` and their unlock counterparts.
> +
> +PREEMPT_RT also offers a local_lock mechanism to substitute the
> +local_irq_disable/save() constructs in cases where a separation of the
> +interrupt disabling and the locking is really unavoidable. This should be
> +restricted to very rare cases.
> +
> +
> +raw_spinlock_t
> +--------------
> +
> +Locking of a raw_spinlock_t disables preemption and eventually interrupts.
> +Therefore code inside the critical region has to be careful to avoid calls
> +into code which takes a regular spinlock_t or rwlock_t. A prime example is
> +memory allocation.
> +
> +On a non PREEMPT_RT enabled kernel the following code construct is
> +perfectly fine code::
> +
> +  raw_spin_lock(&lock);
> +  p = kmalloc(sizeof(*p), GFP_ATOMIC);
> +
> +On a PREEMPT_RT enabled kernel this breaks because the memory allocator is
> +fully preemptible and therefore does not support allocations from truly
> +atomic contexts.
> +
> +Contrary to that the following code construct is perfectly fine on
> +PREEMPT_RT as spin_lock() does not disable preemption::
> +
> +  spin_lock(&lock);
> +  p = kmalloc(sizeof(*p), GFP_ATOMIC);
> +
> +Most places which use GFP_ATOMIC allocations are safe on PREEMPT_RT as the
> +execution is forced into thread context and the lock substitution is
> +ensuring preemptibility.
> +
> +
> +bit spinlocks
> +-------------
> +
> +Bit spinlocks are problematic for PREEMPT_RT as they cannot be easily
> +substituted by an RT-mutex based implementation for obvious reasons.
> +
> +The semantics of bit spinlocks are preserved on a PREEMPT_RT enabled kernel
> +and the caveats vs. raw_spinlock_t apply.
> +
> +Some bit spinlocks are substituted by regular spinlock_t for PREEMPT_RT but
> +this requires conditional (#ifdef'ed) code changes at the usage side while
> +the spinlock_t substitution is simply done by the compiler and the
> +conditionals are restricted to header files and core implementation of the
> +locking primitives and the usage sites do not require any changes.
> +
> +
> +Lock type nesting rules
> +=======================
> +
> +The most basic rules are:
> +
> +  - Lock types of the same lock category (sleeping, spinning) can nest
> +    arbitrarily as long as they respect the general lock ordering rules to
> +    prevent deadlocks.
> +
> +  - Sleeping lock types cannot nest inside spinning lock types.
> +
> +  - Spinning lock types can nest inside sleeping lock types.
> +
> +These rules apply in general independent of CONFIG_PREEMPT_RT.
> +
> +As PREEMPT_RT changes the lock category of spinlock_t and rwlock_t from
> +spinning to sleeping this has obviously restrictions how they can nest with
> +raw_spinlock_t.
> +
> +This results in the following nest ordering:
> +
> +  1) Sleeping locks
> +  2) spinlock_t and rwlock_t
> +  3) raw_spinlock_t and bit spinlocks
> +
> +Lockdep is aware of these constraints to ensure that they are respected.
> +
> +
> +Owner semantics
> +===============
> +
> +Most lock types in the Linux kernel have strict owner semantics, i.e. the
> +context (task) which acquires a lock has to release it.
> +
> +There are two exceptions:
> +
> +  - semaphores
> +  - rwsems
> +
> +semaphores have no strict owner semantics for historical reasons. They are
> +often used for both serialization and waiting purposes. That's generally
> +discouraged and should be replaced by separate serialization and wait
> +mechanisms.
> +
> +rwsems have grown interfaces which allow non owner release for special
> +purposes. This usage is problematic on PREEMPT_RT because PREEMPT_RT
> +substitutes all locking primitives except semaphores with RT-mutex based
> +implementations to provide priority inheritance for all lock types except
> +the truly spinning ones. Priority inheritance on ownerless locks is
> +obviously impossible.
> +
> +For now the rwsem non-owner release excludes code which utilizes it from
> +being used on PREEMPT_RT enabled kernels. In same cases this can be
> +mitigated by disabling portions of the code, in other cases the complete
> +functionality has to be disabled until a workable solution has been found.
>
Davidlohr Bueso March 19, 2020, 8:51 a.m. UTC | #2
On Wed, 18 Mar 2020, Thomas Gleixner wrote:
>+Owner semantics
>+===============
>+
>+Most lock types in the Linux kernel have strict owner semantics, i.e. the
>+context (task) which acquires a lock has to release it.
>+
>+There are two exceptions:
>+
>+  - semaphores
>+  - rwsems
>+
>+semaphores have no strict owner semantics for historical reasons. They are

I would rephrase this to:

semaphores have no owner semantics for historical reason, and as such
trylock and release operations can be called from interrupt context. They
are ...

>+often used for both serialization and waiting purposes. That's generally
>+discouraged and should be replaced by separate serialization and wait
>+mechanisms.
            ^ , such as mutexes or completions.

>+
>+rwsems have grown interfaces which allow non owner release for special
>+purposes. This usage is problematic on PREEMPT_RT because PREEMPT_RT
>+substitutes all locking primitives except semaphores with RT-mutex based
>+implementations to provide priority inheritance for all lock types except
>+the truly spinning ones. Priority inheritance on ownerless locks is
>+obviously impossible.
>+
>+For now the rwsem non-owner release excludes code which utilizes it from
>+being used on PREEMPT_RT enabled kernels. In same cases this can be
>+mitigated by disabling portions of the code, in other cases the complete
>+functionality has to be disabled until a workable solution has been found.

Thanks,
Davidlohr
Jonathan Corbet March 19, 2020, 3:04 p.m. UTC | #3
On Wed, 18 Mar 2020 21:43:10 +0100
Thomas Gleixner <tglx@linutronix.de> wrote:

> From: Thomas Gleixner <tglx@linutronix.de>
> 
> The kernel provides a variety of locking primitives. The nesting of these
> lock types and the implications of them on RT enabled kernels is nowhere
> documented.
> 
> Add initial documentation.

...time to add a a couple of nits...:)

> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> V2: Addressed review comments from Randy
> ---
>  Documentation/locking/index.rst     |    1 
>  Documentation/locking/locktypes.rst |  298 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 299 insertions(+)
>  create mode 100644 Documentation/locking/locktypes.rst
> 
> --- a/Documentation/locking/index.rst
> +++ b/Documentation/locking/index.rst
> @@ -7,6 +7,7 @@ locking
>  .. toctree::
>      :maxdepth: 1
>  
> +    locktypes
>      lockdep-design
>      lockstat
>      locktorture
> --- /dev/null
> +++ b/Documentation/locking/locktypes.rst
> @@ -0,0 +1,298 @@
> +.. _kernel_hacking_locktypes:
> +

So ... I vaguely remember that some Thomas guy added a document saying we
should be putting SPDX tags on our files? :)

> +==========================
> +Lock types and their rules
> +==========================

[...]

> +PREEMPT_RT caveats
> +==================
> +
> +spinlock_t and rwlock_t
> +-----------------------
> +
> +The substitution of spinlock_t and rwlock_t on PREEMPT_RT enabled kernels
> +with RT-mutex based implementations has a few implications.
> +
> +On a non PREEMPT_RT enabled kernel the following code construct is
> +perfectly fine::
> +
> +   local_irq_disable();
> +   spin_lock(&lock);
> +
> +and fully equivalent to::
> +
> +   spin_lock_irq(&lock);
> +
> +Same applies to rwlock_t and the _irqsave() suffix variant.
> +
> +On a PREEMPT_RT enabled kernel this breaks because the RT-mutex
> +substitution expects a fully preemptible context.
> +
> +The preferred solution is to use :c:func:`spin_lock_irq()` or
> +:c:func:`spin_lock_irqsave()` and their unlock counterparts.

We don't need (and shouldn't use) :c:func: anymore; just saying
spin_lock_irq() will cause the Right Things to happen.

Thanks,
jon
Thomas Gleixner March 19, 2020, 6:02 p.m. UTC | #4
Paul,

"Paul E. McKenney" <paulmck@kernel.org> writes:

> On Wed, Mar 18, 2020 at 09:43:10PM +0100, Thomas Gleixner wrote:
>
> Mostly native-English-speaker services below, so please feel free to
> ignore.  The one place I made a substantive change, I marked it "@@@".
> I only did about half of this document, but should this prove useful,
> I will do the other half later.

Native speaker services are always useful and appreciated.

>> +The kernel provides a variety of locking primitives which can be divided
>> +into two categories:
>> +
>> + - Sleeping locks
>> + - Spinning locks
>> +
>> +This document describes the lock types at least at the conceptual level and
>> +provides rules for nesting of lock types also under the aspect of PREEMPT_RT.
>
> I suggest something like this:
>
> This document conceptually describes these lock types and provides rules
> for their nesting, including the rules for use under PREEMPT_RT.

Way better :)

>> +Sleeping locks can only be acquired in preemptible task context.
>> +
>> +Some of the implementations allow try_lock() attempts from other contexts,
>> +but that has to be really evaluated carefully including the question
>> +whether the unlock can be done from that context safely as well.
>> +
>> +Note that some lock types change their implementation details when
>> +debugging is enabled, so this should be really only considered if there is
>> +no other option.
>
> How about something like this?
>
> Although implementations allow try_lock() from other contexts, it is
> necessary to carefully evaluate the safety of unlock() as well as of
> try_lock().  Furthermore, it is also necessary to evaluate the debugging
> versions of these primitives.  In short, don't acquire sleeping locks
> from other contexts unless there is no other option.

Yup.

>> +Sleeping lock types:
>> +
>> + - mutex
>> + - rt_mutex
>> + - semaphore
>> + - rw_semaphore
>> + - ww_mutex
>> + - percpu_rw_semaphore
>> +
>> +On a PREEMPT_RT enabled kernel the following lock types are converted to
>> +sleeping locks:
>
> On PREEMPT_RT kernels, these lock types are converted to sleeping
> locks:

Ok.

>> + - spinlock_t
>> + - rwlock_t
>> +
>> +Spinning locks
>> +--------------
>> +
>> + - raw_spinlock_t
>> + - bit spinlocks
>> +
>> +On a non PREEMPT_RT enabled kernel the following lock types are spinning
>> +locks as well:
>
> On non-PREEMPT_RT kernels, these lock types are also spinning locks:

Ok.

>> + - spinlock_t
>> + - rwlock_t
>> +
>> +Spinning locks implicitly disable preemption and the lock / unlock functions
>> +can have suffixes which apply further protections:
>> +
>> + ===================  ====================================================
>> + _bh()                Disable / enable bottom halves (soft interrupts)
>> + _irq()               Disable / enable interrupts
>> + _irqsave/restore()   Save and disable / restore interrupt disabled state
>> + ===================  ====================================================
>> +
>> +
>> +rtmutex
>> +=======
>> +
>> +RT-mutexes are mutexes with support for priority inheritance (PI).
>> +
>> +PI has limitations on non PREEMPT_RT enabled kernels due to preemption and
>> +interrupt disabled sections.
>> +
>> +On a PREEMPT_RT enabled kernel most of these sections are fully
>> +preemptible. This is possible because PREEMPT_RT forces most executions
>> +into task context, especially interrupt handlers and soft interrupts, which
>> +allows to substitute spinlock_t and rwlock_t with RT-mutex based
>> +implementations.
>
> PI clearly cannot preempt preemption-disabled or interrupt-disabled
> regions of code, even on PREEMPT_RT kernels.  Instead, PREEMPT_RT kernels
> execute most such regions of code in preemptible task context, especially
> interrupt handlers and soft interrupts.  This conversion allows spinlock_t
> and rwlock_t to be implemented via RT-mutexes.

Nice.

>> +
>> +raw_spinlock_t and spinlock_t
>> +=============================
>> +
>> +raw_spinlock_t
>> +--------------
>> +
>> +raw_spinlock_t is a strict spinning lock implementation regardless of the
>> +kernel configuration including PREEMPT_RT enabled kernels.
>> +
>> +raw_spinlock_t is to be used only in real critical core code, low level
>> +interrupt handling and places where protecting (hardware) state is required
>> +to be safe against preemption and eventually interrupts.
>> +
>> +Another reason to use raw_spinlock_t is when the critical section is tiny
>> +to avoid the overhead of spinlock_t on a PREEMPT_RT enabled kernel in the
>> +contended case.
>
> raw_spinlock_t is a strict spinning lock implementation in all kernels,
> including PREEMPT_RT kernels.  Use raw_spinlock_t only in real critical
> core code, low level interrupt handling and places where disabling
> preemption or interrupts is required, for example, to safely access
> hardware state.  raw_spinlock_t can sometimes also be used when the
> critical section is tiny and the lock is lightly contended, thus avoiding
> RT-mutex overhead.
>
> @@@  I added the point about the lock being lightly contended.

Hmm, not sure. The point is that if the critical section is small the
overhead of cross CPU boosting along with the resulting IPIs is going to
be at least an order of magnitude larger. And on contention this is just
pushing the raw_spinlock contention off to the raw_spinlock in the rt
mutex plus the owning tasks pi_lock which makes things even worse.

>> + - The hard interrupt related suffixes for spin_lock / spin_unlock
>> +   operations (_irq, _irqsave / _irqrestore) do not affect the CPUs
>                                                                   CPU's

Si senor!

>> +   interrupt disabled state
>> +
>> + - The soft interrupt related suffix (_bh()) is still disabling the
>> +   execution of soft interrupts, but contrary to a non PREEMPT_RT enabled
>> +   kernel, which utilizes the preemption count, this is achieved by a per
>> +   CPU bottom half locking mechanism.
>
>  - The soft interrupt related suffix (_bh()) still disables softirq
>    handlers.  However, unlike non-PREEMPT_RT kernels (which disable
>    preemption to get this effect), PREEMPT_RT kernels use a per-CPU
>    per-bottom-half locking mechanism.

it's not per-bottom-half anymore. That turned out to be dangerous due to
dependencies between BH types, e.g. network and timers.

I hope I was able to encourage you to comment on the other half as well :)

Thanks,

        tglx
Thomas Gleixner March 19, 2020, 6:04 p.m. UTC | #5
Jonathan Corbet <corbet@lwn.net> writes:
> On Wed, 18 Mar 2020 21:43:10 +0100
> Thomas Gleixner <tglx@linutronix.de> wrote:
>> Add initial documentation.
>
> ...time to add a a couple of nits...:)

...time

Is that valid RST?

>> +++ b/Documentation/locking/locktypes.rst
>> @@ -0,0 +1,298 @@
>> +.. _kernel_hacking_locktypes:
>> +
>
> So ... I vaguely remember that some Thomas guy added a document saying we
> should be putting SPDX tags on our files? :)

Never met him or heard about that.

>> +
>> +The preferred solution is to use :c:func:`spin_lock_irq()` or
>> +:c:func:`spin_lock_irqsave()` and their unlock counterparts.
>
> We don't need (and shouldn't use) :c:func: anymore; just saying
> spin_lock_irq() will cause the Right Things to happen.

Good to know. Will fix.

Thanks,

        tglx
Paul E. McKenney March 20, 2020, 4:01 p.m. UTC | #6
On Thu, Mar 19, 2020 at 07:02:17PM +0100, Thomas Gleixner wrote:
> Paul,
> 
> "Paul E. McKenney" <paulmck@kernel.org> writes:
> 
> > On Wed, Mar 18, 2020 at 09:43:10PM +0100, Thomas Gleixner wrote:
> >
> > Mostly native-English-speaker services below, so please feel free to
> > ignore.  The one place I made a substantive change, I marked it "@@@".
> > I only did about half of this document, but should this prove useful,
> > I will do the other half later.
> 
> Native speaker services are always useful and appreciated.

Glad it is helpful.  ;-)

[ . . . ]

> >> +
> >> +raw_spinlock_t and spinlock_t
> >> +=============================
> >> +
> >> +raw_spinlock_t
> >> +--------------
> >> +
> >> +raw_spinlock_t is a strict spinning lock implementation regardless of the
> >> +kernel configuration including PREEMPT_RT enabled kernels.
> >> +
> >> +raw_spinlock_t is to be used only in real critical core code, low level
> >> +interrupt handling and places where protecting (hardware) state is required
> >> +to be safe against preemption and eventually interrupts.
> >> +
> >> +Another reason to use raw_spinlock_t is when the critical section is tiny
> >> +to avoid the overhead of spinlock_t on a PREEMPT_RT enabled kernel in the
> >> +contended case.
> >
> > raw_spinlock_t is a strict spinning lock implementation in all kernels,
> > including PREEMPT_RT kernels.  Use raw_spinlock_t only in real critical
> > core code, low level interrupt handling and places where disabling
> > preemption or interrupts is required, for example, to safely access
> > hardware state.  raw_spinlock_t can sometimes also be used when the
> > critical section is tiny and the lock is lightly contended, thus avoiding
> > RT-mutex overhead.
> >
> > @@@  I added the point about the lock being lightly contended.
> 
> Hmm, not sure. The point is that if the critical section is small the
> overhead of cross CPU boosting along with the resulting IPIs is going to
> be at least an order of magnitude larger. And on contention this is just
> pushing the raw_spinlock contention off to the raw_spinlock in the rt
> mutex plus the owning tasks pi_lock which makes things even worse.

Fair enough.  So, leaving that out:

raw_spinlock_t is a strict spinning lock implementation in all kernels,
including PREEMPT_RT kernels.  Use raw_spinlock_t only in real critical
core code, low level interrupt handling and places where disabling
preemption or interrupts is required, for example, to safely access
hardware state.  In addition, raw_spinlock_t can sometimes be used when
the critical section is tiny, thus avoiding RT-mutex overhead.

> >> + - The hard interrupt related suffixes for spin_lock / spin_unlock
> >> +   operations (_irq, _irqsave / _irqrestore) do not affect the CPUs
> 
> Si senor!

;-)

> >> +   interrupt disabled state
> >> +
> >> + - The soft interrupt related suffix (_bh()) is still disabling the
> >> +   execution of soft interrupts, but contrary to a non PREEMPT_RT enabled
> >> +   kernel, which utilizes the preemption count, this is achieved by a per
> >> +   CPU bottom half locking mechanism.
> >
> >  - The soft interrupt related suffix (_bh()) still disables softirq
> >    handlers.  However, unlike non-PREEMPT_RT kernels (which disable
> >    preemption to get this effect), PREEMPT_RT kernels use a per-CPU
> >    per-bottom-half locking mechanism.
> 
> it's not per-bottom-half anymore. That turned out to be dangerous due to
> dependencies between BH types, e.g. network and timers.

Ah!  OK, how about this?

 - The soft interrupt related suffix (_bh()) still disables softirq
   handlers.  However, unlike non-PREEMPT_RT kernels (which disable
   preemption to get this effect), PREEMPT_RT kernels use a per-CPU
   lock to exclude softirq handlers.

> I hope I was able to encourage you to comment on the other half as well :)

OK, here goes...

> +All other semantics of spinlock_t are preserved:
> +
> + - Migration of tasks which hold a spinlock_t is prevented. On a non
> +   PREEMPT_RT enabled kernel this is implicit due to preemption disable.
> +   PREEMPT_RT has a separate mechanism to achieve this. This ensures that
> +   pointers to per CPU variables stay valid even if the task is preempted.
> +
> + - Task state preservation. The task state is not affected when a lock is
> +   contended and the task has to schedule out and wait for the lock to
> +   become available. The lock wake up restores the task state unless there
> +   was a regular (not lock related) wake up on the task. This ensures that
> +   the task state rules are always correct independent of the kernel
> +   configuration.

How about this?

PREEMPT_RT kernels preserve all other spinlock_t semantics:

 - Tasks holding a spinlock_t do not migrate.  Non-PREEMPT_RT kernels
   avoid migration by disabling preemption.  PREEMPT_RT kernels instead
   disable migration, which ensures that pointers to per-CPU variables
   remain valid even if the task is preempted.

 - Task state is preserved across spinlock acquisition, ensuring that the
   task-state rules apply to all kernel configurations.  In non-PREEMPT_RT
   kernels leave task state untouched.  However, PREEMPT_RT must change
   task state if the task blocks during acquisition.  Therefore, the
   corresponding lock wakeup restores the task state.  Note that regular
   (not lock related) wakeups do not restore task state.

> +rwlock_t
> +========
> +
> +rwlock_t is a multiple readers and single writer lock mechanism.
> +
> +On a non PREEMPT_RT enabled kernel rwlock_t is implemented as a spinning
> +lock and the suffix rules of spinlock_t apply accordingly. The
> +implementation is fair and prevents writer starvation.

Non-PREEMPT_RT kernels implement rwlock_t as a spinning lock and the
suffix rules of spinlock_t apply accordingly. The implementation is fair,
thus preventing writer starvation.

> +rwlock_t and PREEMPT_RT
> +-----------------------
> +
> +On a PREEMPT_RT enabled kernel rwlock_t is mapped to a separate
> +implementation based on rt_mutex which changes the semantics:
> +
> + - Same changes as for spinlock_t
> +
> + - The implementation is not fair and can cause writer starvation under
> +   certain circumstances. The reason for this is that a writer cannot grant
> +   its priority to multiple readers. Readers which are blocked on a writer
> +   fully support the priority inheritance protocol.

PREEMPT_RT kernels map rwlock_t to a separate rt_mutex-based
implementation, thus changing semantics:

 - All the spinlock_t changes also apply to rwlock_t.

 - Because an rwlock_t writer cannot grant its priority to multiple
   readers, a preempted low-priority reader will continue holding its
   lock, thus starving even high-priority writers.  In contrast, because
   readers can grant their priority to a writer, a preempted low-priority
   writer will have its priority boosted until it releases the lock,
   thus preventing that writer from starving readers.

> +PREEMPT_RT caveats
> +==================
> +
> +spinlock_t and rwlock_t
> +-----------------------
> +
> +The substitution of spinlock_t and rwlock_t on PREEMPT_RT enabled kernels
> +with RT-mutex based implementations has a few implications.
> +
> +On a non PREEMPT_RT enabled kernel the following code construct is
> +perfectly fine::

These changes in spinlock_t and rwlock_t semantics on PREEMPT_RT kernels
have a few implications.  For example, on a non-PREEMPT_RT kernel the
following code sequence works as expected::

> +   local_irq_disable();
> +   spin_lock(&lock);
> +
> +and fully equivalent to::

and is fully equivalent to::

> +   spin_lock_irq(&lock);
> +
> +Same applies to rwlock_t and the _irqsave() suffix variant.

The same applies to rwlock_t and its _irqsave() variant.

> +On a PREEMPT_RT enabled kernel this breaks because the RT-mutex
> +substitution expects a fully preemptible context.
> +
> +The preferred solution is to use :c:func:`spin_lock_irq()` or
> +:c:func:`spin_lock_irqsave()` and their unlock counterparts.
> +
> +PREEMPT_RT also offers a local_lock mechanism to substitute the
> +local_irq_disable/save() constructs in cases where a separation of the
> +interrupt disabling and the locking is really unavoidable. This should be
> +restricted to very rare cases.

On PREEMPT_RT kernel this code sequence breaks because RT-mutex
requires a fully preemptible context.  Instead, use spin_lock_irq() or
spin_lock_irqsave() and their unlock counterparts.  In cases where the
interrupt disabling and locking must remain separate, PREEMPT_RT offers
a local_lock mechanism.  Acquiring the local_lock pins the task to a
CPU, allowing things like per-CPU irq-disabled locks to be acquired.
However, this approach should be used only where absolutely necessary.


> +raw_spinlock_t
> +--------------
> +
> +Locking of a raw_spinlock_t disables preemption and eventually interrupts.
> +Therefore code inside the critical region has to be careful to avoid calls
> +into code which takes a regular spinlock_t or rwlock_t. A prime example is
> +memory allocation.
> +
> +On a non PREEMPT_RT enabled kernel the following code construct is
> +perfectly fine code::

Acquiring a raw_spinlock_t disables preemption and possibly also
interrupts, so the critical section must avoid acquiring a regular
spinlock_t or rwlock_t, for example, the critical section must avoid
allocating memory.  Thus, on a non-PREEMPT_RT kernel the following code
works perfectly::

> +  raw_spin_lock(&lock);
> +  p = kmalloc(sizeof(*p), GFP_ATOMIC);
> +
> +On a PREEMPT_RT enabled kernel this breaks because the memory allocator is
> +fully preemptible and therefore does not support allocations from truly
> +atomic contexts.
> +
> +Contrary to that the following code construct is perfectly fine on
> +PREEMPT_RT as spin_lock() does not disable preemption::

But this code failes on PREEMPT_RT kernels because the memory allocator
is fully preemptible and therefore cannot be invoked from truly atomic
contexts.  However, it is perfectly fine to invoke the memory allocator
while holding a normal non-raw spinlocks because they do not disable
preemption::

> +  spin_lock(&lock);
> +  p = kmalloc(sizeof(*p), GFP_ATOMIC);
> +
> +Most places which use GFP_ATOMIC allocations are safe on PREEMPT_RT as the
> +execution is forced into thread context and the lock substitution is
> +ensuring preemptibility.

Interestingly enough, most uses of GFP_ATOMIC allocations are
actually safe on PREEMPT_RT because the the lock substitution ensures
preemptibility.  Only those GFP_ATOMIC allocations that are invoke
while holding a raw spinlock or with preemption otherwise disabled need
adjustment to work correctly on PREEMPT_RT.

[ I am not as confident of the above as I would like to be... ]

And meeting time, will continue later!

							Thanx, Paul

> +bit spinlocks
> +-------------
> +
> +Bit spinlocks are problematic for PREEMPT_RT as they cannot be easily
> +substituted by an RT-mutex based implementation for obvious reasons.
> +
> +The semantics of bit spinlocks are preserved on a PREEMPT_RT enabled kernel
> +and the caveats vs. raw_spinlock_t apply.
> +
> +Some bit spinlocks are substituted by regular spinlock_t for PREEMPT_RT but
> +this requires conditional (#ifdef'ed) code changes at the usage side while
> +the spinlock_t substitution is simply done by the compiler and the
> +conditionals are restricted to header files and core implementation of the
> +locking primitives and the usage sites do not require any changes.
> +
> +
> +Lock type nesting rules
> +=======================
> +
> +The most basic rules are:
> +
> +  - Lock types of the same lock category (sleeping, spinning) can nest
> +    arbitrarily as long as they respect the general lock ordering rules to
> +    prevent deadlocks.
> +
> +  - Sleeping lock types cannot nest inside spinning lock types.
> +
> +  - Spinning lock types can nest inside sleeping lock types.
> +
> +These rules apply in general independent of CONFIG_PREEMPT_RT.
> +
> +As PREEMPT_RT changes the lock category of spinlock_t and rwlock_t from
> +spinning to sleeping this has obviously restrictions how they can nest with
> +raw_spinlock_t.
> +
> +This results in the following nest ordering:
> +
> +  1) Sleeping locks
> +  2) spinlock_t and rwlock_t
> +  3) raw_spinlock_t and bit spinlocks
> +
> +Lockdep is aware of these constraints to ensure that they are respected.
> +
> +
> +Owner semantics
> +===============
> +
> +Most lock types in the Linux kernel have strict owner semantics, i.e. the
> +context (task) which acquires a lock has to release it.
> +
> +There are two exceptions:
> +
> +  - semaphores
> +  - rwsems
> +
> +semaphores have no strict owner semantics for historical reasons. They are
> +often used for both serialization and waiting purposes. That's generally
> +discouraged and should be replaced by separate serialization and wait
> +mechanisms.
> +
> +rwsems have grown interfaces which allow non owner release for special
> +purposes. This usage is problematic on PREEMPT_RT because PREEMPT_RT
> +substitutes all locking primitives except semaphores with RT-mutex based
> +implementations to provide priority inheritance for all lock types except
> +the truly spinning ones. Priority inheritance on ownerless locks is
> +obviously impossible.
> +
> +For now the rwsem non-owner release excludes code which utilizes it from
> +being used on PREEMPT_RT enabled kernels. In same cases this can be
> +mitigated by disabling portions of the code, in other cases the complete
> +functionality has to be disabled until a workable solution has been found.
>
Thomas Gleixner March 20, 2020, 7:51 p.m. UTC | #7
"Paul E. McKenney" <paulmck@kernel.org> writes:
>
>  - The soft interrupt related suffix (_bh()) still disables softirq
>    handlers.  However, unlike non-PREEMPT_RT kernels (which disable
>    preemption to get this effect), PREEMPT_RT kernels use a per-CPU
>    lock to exclude softirq handlers.

I've made that:

  - The soft interrupt related suffix (_bh()) still disables softirq
    handlers.

    Non-PREEMPT_RT kernels disable preemption to get this effect.

    PREEMPT_RT kernels use a per-CPU lock for serialization. The lock
    disables softirq handlers and prevents reentrancy by a preempting
    task.
    
On non-RT this is implicit through preemption disable, but it's non
obvious for RT as preemption stays enabled.

> PREEMPT_RT kernels preserve all other spinlock_t semantics:
>
>  - Tasks holding a spinlock_t do not migrate.  Non-PREEMPT_RT kernels
>    avoid migration by disabling preemption.  PREEMPT_RT kernels instead
>    disable migration, which ensures that pointers to per-CPU variables
>    remain valid even if the task is preempted.
>
>  - Task state is preserved across spinlock acquisition, ensuring that the
>    task-state rules apply to all kernel configurations.  In non-PREEMPT_RT
>    kernels leave task state untouched.  However, PREEMPT_RT must change
>    task state if the task blocks during acquisition.  Therefore, the
>    corresponding lock wakeup restores the task state.  Note that regular
>    (not lock related) wakeups do not restore task state.

   - Task state is preserved across spinlock acquisition, ensuring that the
     task-state rules apply to all kernel configurations.  Non-PREEMPT_RT
     kernels leave task state untouched.  However, PREEMPT_RT must change
     task state if the task blocks during acquisition.  Therefore, it
     saves the current task state before blocking and the corresponding
     lock wakeup restores it. A regular not lock related wakeup sets the
     task state to RUNNING. If this happens while the task is blocked on
     a spinlock then the saved task state is changed so that correct
     state is restored on lock wakeup.

Hmm?

> But this code failes on PREEMPT_RT kernels because the memory allocator
> is fully preemptible and therefore cannot be invoked from truly atomic
> contexts.  However, it is perfectly fine to invoke the memory allocator
> while holding a normal non-raw spinlocks because they do not disable
> preemption::
>
>> +  spin_lock(&lock);
>> +  p = kmalloc(sizeof(*p), GFP_ATOMIC);
>> +
>> +Most places which use GFP_ATOMIC allocations are safe on PREEMPT_RT as the
>> +execution is forced into thread context and the lock substitution is
>> +ensuring preemptibility.
>
> Interestingly enough, most uses of GFP_ATOMIC allocations are
> actually safe on PREEMPT_RT because the the lock substitution ensures
> preemptibility.  Only those GFP_ATOMIC allocations that are invoke
> while holding a raw spinlock or with preemption otherwise disabled need
> adjustment to work correctly on PREEMPT_RT.
>
> [ I am not as confident of the above as I would like to be... ]

I'd leave that whole paragraph out. This documents the rules and from
the above code examples it's pretty clear what works and what not :)

> And meeting time, will continue later!

Enjoy!

Thanks,

        tglx
Paul E. McKenney March 20, 2020, 9:02 p.m. UTC | #8
On Fri, Mar 20, 2020 at 08:51:44PM +0100, Thomas Gleixner wrote:
> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >
> >  - The soft interrupt related suffix (_bh()) still disables softirq
> >    handlers.  However, unlike non-PREEMPT_RT kernels (which disable
> >    preemption to get this effect), PREEMPT_RT kernels use a per-CPU
> >    lock to exclude softirq handlers.
> 
> I've made that:
> 
>   - The soft interrupt related suffix (_bh()) still disables softirq
>     handlers.
> 
>     Non-PREEMPT_RT kernels disable preemption to get this effect.
> 
>     PREEMPT_RT kernels use a per-CPU lock for serialization. The lock
>     disables softirq handlers and prevents reentrancy by a preempting
>     task.

That works!  At the end, I would instead say "prevents reentrancy
due to task preemption", but what you have works.

> On non-RT this is implicit through preemption disable, but it's non
> obvious for RT as preemption stays enabled.
> 
> > PREEMPT_RT kernels preserve all other spinlock_t semantics:
> >
> >  - Tasks holding a spinlock_t do not migrate.  Non-PREEMPT_RT kernels
> >    avoid migration by disabling preemption.  PREEMPT_RT kernels instead
> >    disable migration, which ensures that pointers to per-CPU variables
> >    remain valid even if the task is preempted.
> >
> >  - Task state is preserved across spinlock acquisition, ensuring that the
> >    task-state rules apply to all kernel configurations.  In non-PREEMPT_RT
> >    kernels leave task state untouched.  However, PREEMPT_RT must change
> >    task state if the task blocks during acquisition.  Therefore, the
> >    corresponding lock wakeup restores the task state.  Note that regular
> >    (not lock related) wakeups do not restore task state.
> 
>    - Task state is preserved across spinlock acquisition, ensuring that the
>      task-state rules apply to all kernel configurations.  Non-PREEMPT_RT
>      kernels leave task state untouched.  However, PREEMPT_RT must change
>      task state if the task blocks during acquisition.  Therefore, it
>      saves the current task state before blocking and the corresponding
>      lock wakeup restores it. A regular not lock related wakeup sets the
>      task state to RUNNING. If this happens while the task is blocked on
>      a spinlock then the saved task state is changed so that correct
>      state is restored on lock wakeup.
> 
> Hmm?

I of course cannot resist editing the last two sentences:

   ... Other types of wakeups unconditionally set task state to RUNNING.
   If this happens while a task is blocked while acquiring a spinlock,
   then the task state is restored to its pre-acquisition value at
   lock-wakeup time.

> > But this code failes on PREEMPT_RT kernels because the memory allocator
> > is fully preemptible and therefore cannot be invoked from truly atomic
> > contexts.  However, it is perfectly fine to invoke the memory allocator
> > while holding a normal non-raw spinlocks because they do not disable
> > preemption::
> >
> >> +  spin_lock(&lock);
> >> +  p = kmalloc(sizeof(*p), GFP_ATOMIC);
> >> +
> >> +Most places which use GFP_ATOMIC allocations are safe on PREEMPT_RT as the
> >> +execution is forced into thread context and the lock substitution is
> >> +ensuring preemptibility.
> >
> > Interestingly enough, most uses of GFP_ATOMIC allocations are
> > actually safe on PREEMPT_RT because the the lock substitution ensures
> > preemptibility.  Only those GFP_ATOMIC allocations that are invoke
> > while holding a raw spinlock or with preemption otherwise disabled need
> > adjustment to work correctly on PREEMPT_RT.
> >
> > [ I am not as confident of the above as I would like to be... ]
> 
> I'd leave that whole paragraph out. This documents the rules and from
> the above code examples it's pretty clear what works and what not :)

Works for me!  ;-)

> > And meeting time, will continue later!
> 
> Enjoy!

Not bad, actually, as meetings go.

							Thanx, Paul
Thomas Gleixner March 20, 2020, 10:36 p.m. UTC | #9
"Paul E. McKenney" <paulmck@kernel.org> writes:
> On Fri, Mar 20, 2020 at 08:51:44PM +0100, Thomas Gleixner wrote:
>> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >
>> >  - The soft interrupt related suffix (_bh()) still disables softirq
>> >    handlers.  However, unlike non-PREEMPT_RT kernels (which disable
>> >    preemption to get this effect), PREEMPT_RT kernels use a per-CPU
>> >    lock to exclude softirq handlers.
>> 
>> I've made that:
>> 
>>   - The soft interrupt related suffix (_bh()) still disables softirq
>>     handlers.
>> 
>>     Non-PREEMPT_RT kernels disable preemption to get this effect.
>> 
>>     PREEMPT_RT kernels use a per-CPU lock for serialization. The lock
>>     disables softirq handlers and prevents reentrancy by a preempting
>>     task.
>
> That works!  At the end, I would instead say "prevents reentrancy
> due to task preemption", but what you have works.

Yours is better.

>>    - Task state is preserved across spinlock acquisition, ensuring that the
>>      task-state rules apply to all kernel configurations.  Non-PREEMPT_RT
>>      kernels leave task state untouched.  However, PREEMPT_RT must change
>>      task state if the task blocks during acquisition.  Therefore, it
>>      saves the current task state before blocking and the corresponding
>>      lock wakeup restores it. A regular not lock related wakeup sets the
>>      task state to RUNNING. If this happens while the task is blocked on
>>      a spinlock then the saved task state is changed so that correct
>>      state is restored on lock wakeup.
>> 
>> Hmm?
>
> I of course cannot resist editing the last two sentences:
>
>    ... Other types of wakeups unconditionally set task state to RUNNING.
>    If this happens while a task is blocked while acquiring a spinlock,
>    then the task state is restored to its pre-acquisition value at
>    lock-wakeup time.

Errm no. That would mean

     state = UNINTERRUPTIBLE
     lock()
       block()
         real_state = state
         state = SLEEPONLOCK

                               non lock wakeup
                                 state = RUNNING    <--- FAIL #1

                               lock wakeup
                                 state = real_state <--- FAIL #2

How it works is:

     state = UNINTERRUPTIBLE
     lock()
       block()
         real_state = state
         state = SLEEPONLOCK

                               non lock wakeup
                                 real_state = RUNNING

                               lock wakeup
                                 state = real_state == RUNNING

If there is no 'non lock wakeup' before the lock wakeup:

     state = UNINTERRUPTIBLE
     lock()
       block()
         real_state = state
         state = SLEEPONLOCK

                               lock wakeup
                                 state = real_state == UNINTERRUPTIBLE

I agree that what I tried to express is hard to parse, but it's at least
halfways correct :)

Thanks,

        tglx
Paul E. McKenney March 21, 2020, 2:29 a.m. UTC | #10
On Fri, Mar 20, 2020 at 11:36:03PM +0100, Thomas Gleixner wrote:
> "Paul E. McKenney" <paulmck@kernel.org> writes:
> > On Fri, Mar 20, 2020 at 08:51:44PM +0100, Thomas Gleixner wrote:
> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >
> >> >  - The soft interrupt related suffix (_bh()) still disables softirq
> >> >    handlers.  However, unlike non-PREEMPT_RT kernels (which disable
> >> >    preemption to get this effect), PREEMPT_RT kernels use a per-CPU
> >> >    lock to exclude softirq handlers.
> >> 
> >> I've made that:
> >> 
> >>   - The soft interrupt related suffix (_bh()) still disables softirq
> >>     handlers.
> >> 
> >>     Non-PREEMPT_RT kernels disable preemption to get this effect.
> >> 
> >>     PREEMPT_RT kernels use a per-CPU lock for serialization. The lock
> >>     disables softirq handlers and prevents reentrancy by a preempting
> >>     task.
> >
> > That works!  At the end, I would instead say "prevents reentrancy
> > due to task preemption", but what you have works.
> 
> Yours is better.
> 
> >>    - Task state is preserved across spinlock acquisition, ensuring that the
> >>      task-state rules apply to all kernel configurations.  Non-PREEMPT_RT
> >>      kernels leave task state untouched.  However, PREEMPT_RT must change
> >>      task state if the task blocks during acquisition.  Therefore, it
> >>      saves the current task state before blocking and the corresponding
> >>      lock wakeup restores it. A regular not lock related wakeup sets the
> >>      task state to RUNNING. If this happens while the task is blocked on
> >>      a spinlock then the saved task state is changed so that correct
> >>      state is restored on lock wakeup.
> >> 
> >> Hmm?
> >
> > I of course cannot resist editing the last two sentences:
> >
> >    ... Other types of wakeups unconditionally set task state to RUNNING.
> >    If this happens while a task is blocked while acquiring a spinlock,
> >    then the task state is restored to its pre-acquisition value at
> >    lock-wakeup time.
> 
> Errm no. That would mean
> 
>      state = UNINTERRUPTIBLE
>      lock()
>        block()
>          real_state = state
>          state = SLEEPONLOCK
> 
>                                non lock wakeup
>                                  state = RUNNING    <--- FAIL #1
> 
>                                lock wakeup
>                                  state = real_state <--- FAIL #2
> 
> How it works is:
> 
>      state = UNINTERRUPTIBLE
>      lock()
>        block()
>          real_state = state
>          state = SLEEPONLOCK
> 
>                                non lock wakeup
>                                  real_state = RUNNING
> 
>                                lock wakeup
>                                  state = real_state == RUNNING
> 
> If there is no 'non lock wakeup' before the lock wakeup:
> 
>      state = UNINTERRUPTIBLE
>      lock()
>        block()
>          real_state = state
>          state = SLEEPONLOCK
> 
>                                lock wakeup
>                                  state = real_state == UNINTERRUPTIBLE
> 
> I agree that what I tried to express is hard to parse, but it's at least
> halfways correct :)

Apologies!  That is what I get for not looking it up in the source.  :-/

OK, so I am stupid enough not only to get it wrong, but also to try again:

   ... Other types of wakeups would normally unconditionally set the
   task state to RUNNING, but that does not work here because the task
   must remain blocked until the lock becomes available.  Therefore,
   when a non-lock wakeup attempts to awaken a task blocked waiting
   for a spinlock, it instead sets the saved state to RUNNING.  Then,
   when the lock acquisition completes, the lock wakeup sets the task
   state to the saved state, in this case setting it to RUNNING.

Is that better?

							Thanx, Paul
Thomas Gleixner March 21, 2020, 10:26 a.m. UTC | #11
"Paul E. McKenney" <paulmck@kernel.org> writes:
> On Fri, Mar 20, 2020 at 11:36:03PM +0100, Thomas Gleixner wrote:
>> I agree that what I tried to express is hard to parse, but it's at least
>> halfways correct :)
>
> Apologies!  That is what I get for not looking it up in the source.  :-/
>
> OK, so I am stupid enough not only to get it wrong, but also to try again:
>
>    ... Other types of wakeups would normally unconditionally set the
>    task state to RUNNING, but that does not work here because the task
>    must remain blocked until the lock becomes available.  Therefore,
>    when a non-lock wakeup attempts to awaken a task blocked waiting
>    for a spinlock, it instead sets the saved state to RUNNING.  Then,
>    when the lock acquisition completes, the lock wakeup sets the task
>    state to the saved state, in this case setting it to RUNNING.
>
> Is that better?

Definitely!

Thanks for all the editorial work!

       tglx
Paul E. McKenney March 21, 2020, 5:23 p.m. UTC | #12
On Sat, Mar 21, 2020 at 11:26:06AM +0100, Thomas Gleixner wrote:
> "Paul E. McKenney" <paulmck@kernel.org> writes:
> > On Fri, Mar 20, 2020 at 11:36:03PM +0100, Thomas Gleixner wrote:
> >> I agree that what I tried to express is hard to parse, but it's at least
> >> halfways correct :)
> >
> > Apologies!  That is what I get for not looking it up in the source.  :-/
> >
> > OK, so I am stupid enough not only to get it wrong, but also to try again:
> >
> >    ... Other types of wakeups would normally unconditionally set the
> >    task state to RUNNING, but that does not work here because the task
> >    must remain blocked until the lock becomes available.  Therefore,
> >    when a non-lock wakeup attempts to awaken a task blocked waiting
> >    for a spinlock, it instead sets the saved state to RUNNING.  Then,
> >    when the lock acquisition completes, the lock wakeup sets the task
> >    state to the saved state, in this case setting it to RUNNING.
> >
> > Is that better?
> 
> Definitely!
> 
> Thanks for all the editorial work!

NP, and glad you like it!

But I felt even more stupid sometime in the middle of the night.  Why on
earth didn't I work in your nice examples?  :-/

I will pull them in later.  Time to go hike!!!

							Thanx, Paul
Joel Fernandes March 21, 2020, 9:21 p.m. UTC | #13
Hi Thomas,

Just a few comments:

[...]
> +rtmutex
> +=======
> +
> +RT-mutexes are mutexes with support for priority inheritance (PI).
> +
> +PI has limitations on non PREEMPT_RT enabled kernels due to preemption and
> +interrupt disabled sections.
> +
> +On a PREEMPT_RT enabled kernel most of these sections are fully
> +preemptible. This is possible because PREEMPT_RT forces most executions
> +into task context, especially interrupt handlers and soft interrupts, which
> +allows to substitute spinlock_t and rwlock_t with RT-mutex based
> +implementations.
> +
> +
> +raw_spinlock_t and spinlock_t
> +=============================
> +
> +raw_spinlock_t
> +--------------
> +
> +raw_spinlock_t is a strict spinning lock implementation regardless of the
> +kernel configuration including PREEMPT_RT enabled kernels.
> +
> +raw_spinlock_t is to be used only in real critical core code, low level
> +interrupt handling and places where protecting (hardware) state is required
> +to be safe against preemption and eventually interrupts.
> +
> +Another reason to use raw_spinlock_t is when the critical section is tiny
> +to avoid the overhead of spinlock_t on a PREEMPT_RT enabled kernel in the
> +contended case.
> +
> +spinlock_t
> +----------
> +
> +The semantics of spinlock_t change with the state of CONFIG_PREEMPT_RT.
> +
> +On a non PREEMPT_RT enabled kernel spinlock_t is mapped to raw_spinlock_t
> +and has exactly the same semantics.
> +
> +spinlock_t and PREEMPT_RT
> +-------------------------
> +
> +On a PREEMPT_RT enabled kernel spinlock_t is mapped to a separate
> +implementation based on rt_mutex which changes the semantics:
> +
> + - Preemption is not disabled
> +
> + - The hard interrupt related suffixes for spin_lock / spin_unlock
> +   operations (_irq, _irqsave / _irqrestore) do not affect the CPUs
> +   interrupt disabled state
> +
> + - The soft interrupt related suffix (_bh()) is still disabling the
> +   execution of soft interrupts, but contrary to a non PREEMPT_RT enabled
> +   kernel, which utilizes the preemption count, this is achieved by a per
> +   CPU bottom half locking mechanism.
> +
> +All other semantics of spinlock_t are preserved:
> +
> + - Migration of tasks which hold a spinlock_t is prevented. On a non
> +   PREEMPT_RT enabled kernel this is implicit due to preemption disable.
> +   PREEMPT_RT has a separate mechanism to achieve this. This ensures that
> +   pointers to per CPU variables stay valid even if the task is preempted.
> +
> + - Task state preservation. The task state is not affected when a lock is
> +   contended and the task has to schedule out and wait for the lock to
> +   become available. The lock wake up restores the task state unless there
> +   was a regular (not lock related) wake up on the task. This ensures that
> +   the task state rules are always correct independent of the kernel
> +   configuration.
> +
> +rwlock_t
> +========
> +
> +rwlock_t is a multiple readers and single writer lock mechanism.
> +
> +On a non PREEMPT_RT enabled kernel rwlock_t is implemented as a spinning
> +lock and the suffix rules of spinlock_t apply accordingly. The
> +implementation is fair and prevents writer starvation.
>

You mentioned writer starvation, but I think it would be good to also mention
that rwlock_t on a non-PREEMPT_RT kernel also does not have _reader_
starvation problem, since it uses queued implementation.  This fact is worth
mentioning here, since further below you explain that an rwlock in PREEMPT_RT
does have reader starvation problem.

> +rwlock_t and PREEMPT_RT
> +-----------------------
> +
> +On a PREEMPT_RT enabled kernel rwlock_t is mapped to a separate
> +implementation based on rt_mutex which changes the semantics:
> +
> + - Same changes as for spinlock_t
> +
> + - The implementation is not fair and can cause writer starvation under
> +   certain circumstances. The reason for this is that a writer cannot grant
> +   its priority to multiple readers. Readers which are blocked on a writer
> +   fully support the priority inheritance protocol.

Is it hard to give priority to multiple readers because the number of readers
to give priority to could be unbounded?

> +
> +
> +PREEMPT_RT caveats
> +==================
> +
> +spinlock_t and rwlock_t
> +-----------------------
> +
> +The substitution of spinlock_t and rwlock_t on PREEMPT_RT enabled kernels
> +with RT-mutex based implementations has a few implications.
> +
> +On a non PREEMPT_RT enabled kernel the following code construct is
> +perfectly fine::
> +
> +   local_irq_disable();
> +   spin_lock(&lock);
> +
> +and fully equivalent to::
> +
> +   spin_lock_irq(&lock);
> +
> +Same applies to rwlock_t and the _irqsave() suffix variant.
> +
> +On a PREEMPT_RT enabled kernel this breaks because the RT-mutex
> +substitution expects a fully preemptible context.
> +
> +The preferred solution is to use :c:func:`spin_lock_irq()` or
> +:c:func:`spin_lock_irqsave()` and their unlock counterparts.
> +
> +PREEMPT_RT also offers a local_lock mechanism to substitute the
> +local_irq_disable/save() constructs in cases where a separation of the
> +interrupt disabling and the locking is really unavoidable. This should be
> +restricted to very rare cases.

It would also be nice to mention where else local_lock() can be used, such as
protecting per-cpu variables without disabling preemption. Could we add a
section on protecting per-cpu data? (Happy to do that and send a patch if you
prefer).

> +raw_spinlock_t
> +--------------
> +
> +Locking of a raw_spinlock_t disables preemption and eventually interrupts.
> +Therefore code inside the critical region has to be careful to avoid calls
> +into code which takes a regular spinlock_t or rwlock_t. A prime example is
> +memory allocation.
> +
> +On a non PREEMPT_RT enabled kernel the following code construct is
> +perfectly fine code::
> +
> +  raw_spin_lock(&lock);
> +  p = kmalloc(sizeof(*p), GFP_ATOMIC);
> +
> +On a PREEMPT_RT enabled kernel this breaks because the memory allocator is
> +fully preemptible and therefore does not support allocations from truly
> +atomic contexts.
> +
> +Contrary to that the following code construct is perfectly fine on
> +PREEMPT_RT as spin_lock() does not disable preemption::
> +
> +  spin_lock(&lock);
> +  p = kmalloc(sizeof(*p), GFP_ATOMIC);
> +
> +Most places which use GFP_ATOMIC allocations are safe on PREEMPT_RT as the
> +execution is forced into thread context and the lock substitution is
> +ensuring preemptibility.
> +
> +
> +bit spinlocks
> +-------------
> +
> +Bit spinlocks are problematic for PREEMPT_RT as they cannot be easily
> +substituted by an RT-mutex based implementation for obvious reasons.
> +
> +The semantics of bit spinlocks are preserved on a PREEMPT_RT enabled kernel
> +and the caveats vs. raw_spinlock_t apply.
> +
> +Some bit spinlocks are substituted by regular spinlock_t for PREEMPT_RT but
> +this requires conditional (#ifdef'ed) code changes at the usage side while
> +the spinlock_t substitution is simply done by the compiler and the
> +conditionals are restricted to header files and core implementation of the
> +locking primitives and the usage sites do not require any changes.
> +
> +
> +Lock type nesting rules
> +=======================
> +
> +The most basic rules are:
> +
> +  - Lock types of the same lock category (sleeping, spinning) can nest
> +    arbitrarily as long as they respect the general lock ordering rules to
> +    prevent deadlocks.
> +
> +  - Sleeping lock types cannot nest inside spinning lock types.
> +
> +  - Spinning lock types can nest inside sleeping lock types.
> +
> +These rules apply in general independent of CONFIG_PREEMPT_RT.
> +
> +As PREEMPT_RT changes the lock category of spinlock_t and rwlock_t from
> +spinning to sleeping this has obviously restrictions how they can nest with
> +raw_spinlock_t.
> +
> +This results in the following nest ordering:
> +
> +  1) Sleeping locks
> +  2) spinlock_t and rwlock_t
> +  3) raw_spinlock_t and bit spinlocks
> +
> +Lockdep is aware of these constraints to ensure that they are respected.
> +
> +
> +Owner semantics
> +===============
> +
> +Most lock types in the Linux kernel have strict owner semantics, i.e. the
> +context (task) which acquires a lock has to release it.
> +
> +There are two exceptions:
> +
> +  - semaphores
> +  - rwsems
> +
> +semaphores have no strict owner semantics for historical reasons. They are
> +often used for both serialization and waiting purposes. That's generally
> +discouraged and should be replaced by separate serialization and wait
> +mechanisms.
> +
> +rwsems have grown interfaces which allow non owner release for special
> +purposes. This usage is problematic on PREEMPT_RT because PREEMPT_RT
> +substitutes all locking primitives except semaphores with RT-mutex based
> +implementations to provide priority inheritance for all lock types except
> +the truly spinning ones. Priority inheritance on ownerless locks is
> +obviously impossible.
> +
> +For now the rwsem non-owner release excludes code which utilizes it from
> +being used on PREEMPT_RT enabled kernels.

I could not parse the last sentence here, but I think you meant "For now,
PREEMPT_RT enabled kernels disable code that perform a non-owner release of
an rwsem". Correct me if I'm wrong.

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>

thanks,

 - Joel


> In same cases this can be
> +mitigated by disabling portions of the code, in other cases the complete
> +functionality has to be disabled until a workable solution has been found.
>
Thomas Gleixner March 21, 2020, 9:49 p.m. UTC | #14
Joel Fernandes <joel@joelfernandes.org> writes:
>> +rwlock_t
>> +========
>> +
>> +rwlock_t is a multiple readers and single writer lock mechanism.
>> +
>> +On a non PREEMPT_RT enabled kernel rwlock_t is implemented as a spinning
>> +lock and the suffix rules of spinlock_t apply accordingly. The
>> +implementation is fair and prevents writer starvation.
>>
>
> You mentioned writer starvation, but I think it would be good to also mention
> that rwlock_t on a non-PREEMPT_RT kernel also does not have _reader_
> starvation problem, since it uses queued implementation.  This fact is worth
> mentioning here, since further below you explain that an rwlock in PREEMPT_RT
> does have reader starvation problem.

It's worth mentioning. But RT really has only write starvation not
reader starvation.

>> +rwlock_t and PREEMPT_RT
>> +-----------------------
>> +
>> +On a PREEMPT_RT enabled kernel rwlock_t is mapped to a separate
>> +implementation based on rt_mutex which changes the semantics:
>> +
>> + - Same changes as for spinlock_t
>> +
>> + - The implementation is not fair and can cause writer starvation under
>> +   certain circumstances. The reason for this is that a writer cannot grant
>> +   its priority to multiple readers. Readers which are blocked on a writer
>> +   fully support the priority inheritance protocol.
>
> Is it hard to give priority to multiple readers because the number of readers
> to give priority to could be unbounded?

Yes, and it's horribly complex and racy. We had an implemetation years
ago which taught us not to try it again :)

>> +PREEMPT_RT also offers a local_lock mechanism to substitute the
>> +local_irq_disable/save() constructs in cases where a separation of the
>> +interrupt disabling and the locking is really unavoidable. This should be
>> +restricted to very rare cases.
>
> It would also be nice to mention where else local_lock() can be used, such as
> protecting per-cpu variables without disabling preemption. Could we add a
> section on protecting per-cpu data? (Happy to do that and send a patch if you
> prefer).

The local lock section will come soon when we post the local lock
patches again.

>> +rwsems have grown interfaces which allow non owner release for special
>> +purposes. This usage is problematic on PREEMPT_RT because PREEMPT_RT
>> +substitutes all locking primitives except semaphores with RT-mutex based
>> +implementations to provide priority inheritance for all lock types except
>> +the truly spinning ones. Priority inheritance on ownerless locks is
>> +obviously impossible.
>> +
>> +For now the rwsem non-owner release excludes code which utilizes it from
>> +being used on PREEMPT_RT enabled kernels.
>
> I could not parse the last sentence here, but I think you meant "For now,
> PREEMPT_RT enabled kernels disable code that perform a non-owner release of
> an rwsem". Correct me if I'm wrong.

Right, that's what I wanted to say :)

Care to send a delta patch?

Thanks!

        tglx
Joel Fernandes March 22, 2020, 1:36 a.m. UTC | #15
On Sat, Mar 21, 2020 at 10:49:04PM +0100, Thomas Gleixner wrote:
[...] 
> >> +rwsems have grown interfaces which allow non owner release for special
> >> +purposes. This usage is problematic on PREEMPT_RT because PREEMPT_RT
> >> +substitutes all locking primitives except semaphores with RT-mutex based
> >> +implementations to provide priority inheritance for all lock types except
> >> +the truly spinning ones. Priority inheritance on ownerless locks is
> >> +obviously impossible.
> >> +
> >> +For now the rwsem non-owner release excludes code which utilizes it from
> >> +being used on PREEMPT_RT enabled kernels.
> >
> > I could not parse the last sentence here, but I think you meant "For now,
> > PREEMPT_RT enabled kernels disable code that perform a non-owner release of
> > an rwsem". Correct me if I'm wrong.
> 
> Right, that's what I wanted to say :)
> 
> Care to send a delta patch?

Absolutely, doing that now. :-)

thanks,

 - Joel
diff mbox series

Patch

--- a/Documentation/locking/index.rst
+++ b/Documentation/locking/index.rst
@@ -7,6 +7,7 @@  locking
 .. toctree::
     :maxdepth: 1
 
+    locktypes
     lockdep-design
     lockstat
     locktorture
--- /dev/null
+++ b/Documentation/locking/locktypes.rst
@@ -0,0 +1,298 @@ 
+.. _kernel_hacking_locktypes:
+
+==========================
+Lock types and their rules
+==========================
+
+Introduction
+============
+
+The kernel provides a variety of locking primitives which can be divided
+into two categories:
+
+ - Sleeping locks
+ - Spinning locks
+
+This document describes the lock types at least at the conceptual level and
+provides rules for nesting of lock types also under the aspect of PREEMPT_RT.
+
+Lock categories
+===============
+
+Sleeping locks
+--------------
+
+Sleeping locks can only be acquired in preemptible task context.
+
+Some of the implementations allow try_lock() attempts from other contexts,
+but that has to be really evaluated carefully including the question
+whether the unlock can be done from that context safely as well.
+
+Note that some lock types change their implementation details when
+debugging is enabled, so this should be really only considered if there is
+no other option.
+
+Sleeping lock types:
+
+ - mutex
+ - rt_mutex
+ - semaphore
+ - rw_semaphore
+ - ww_mutex
+ - percpu_rw_semaphore
+
+On a PREEMPT_RT enabled kernel the following lock types are converted to
+sleeping locks:
+
+ - spinlock_t
+ - rwlock_t
+
+Spinning locks
+--------------
+
+ - raw_spinlock_t
+ - bit spinlocks
+
+On a non PREEMPT_RT enabled kernel the following lock types are spinning
+locks as well:
+
+ - spinlock_t
+ - rwlock_t
+
+Spinning locks implicitly disable preemption and the lock / unlock functions
+can have suffixes which apply further protections:
+
+ ===================  ====================================================
+ _bh()                Disable / enable bottom halves (soft interrupts)
+ _irq()               Disable / enable interrupts
+ _irqsave/restore()   Save and disable / restore interrupt disabled state
+ ===================  ====================================================
+
+
+rtmutex
+=======
+
+RT-mutexes are mutexes with support for priority inheritance (PI).
+
+PI has limitations on non PREEMPT_RT enabled kernels due to preemption and
+interrupt disabled sections.
+
+On a PREEMPT_RT enabled kernel most of these sections are fully
+preemptible. This is possible because PREEMPT_RT forces most executions
+into task context, especially interrupt handlers and soft interrupts, which
+allows to substitute spinlock_t and rwlock_t with RT-mutex based
+implementations.
+
+
+raw_spinlock_t and spinlock_t
+=============================
+
+raw_spinlock_t
+--------------
+
+raw_spinlock_t is a strict spinning lock implementation regardless of the
+kernel configuration including PREEMPT_RT enabled kernels.
+
+raw_spinlock_t is to be used only in real critical core code, low level
+interrupt handling and places where protecting (hardware) state is required
+to be safe against preemption and eventually interrupts.
+
+Another reason to use raw_spinlock_t is when the critical section is tiny
+to avoid the overhead of spinlock_t on a PREEMPT_RT enabled kernel in the
+contended case.
+
+spinlock_t
+----------
+
+The semantics of spinlock_t change with the state of CONFIG_PREEMPT_RT.
+
+On a non PREEMPT_RT enabled kernel spinlock_t is mapped to raw_spinlock_t
+and has exactly the same semantics.
+
+spinlock_t and PREEMPT_RT
+-------------------------
+
+On a PREEMPT_RT enabled kernel spinlock_t is mapped to a separate
+implementation based on rt_mutex which changes the semantics:
+
+ - Preemption is not disabled
+
+ - The hard interrupt related suffixes for spin_lock / spin_unlock
+   operations (_irq, _irqsave / _irqrestore) do not affect the CPUs
+   interrupt disabled state
+
+ - The soft interrupt related suffix (_bh()) is still disabling the
+   execution of soft interrupts, but contrary to a non PREEMPT_RT enabled
+   kernel, which utilizes the preemption count, this is achieved by a per
+   CPU bottom half locking mechanism.
+
+All other semantics of spinlock_t are preserved:
+
+ - Migration of tasks which hold a spinlock_t is prevented. On a non
+   PREEMPT_RT enabled kernel this is implicit due to preemption disable.
+   PREEMPT_RT has a separate mechanism to achieve this. This ensures that
+   pointers to per CPU variables stay valid even if the task is preempted.
+
+ - Task state preservation. The task state is not affected when a lock is
+   contended and the task has to schedule out and wait for the lock to
+   become available. The lock wake up restores the task state unless there
+   was a regular (not lock related) wake up on the task. This ensures that
+   the task state rules are always correct independent of the kernel
+   configuration.
+
+rwlock_t
+========
+
+rwlock_t is a multiple readers and single writer lock mechanism.
+
+On a non PREEMPT_RT enabled kernel rwlock_t is implemented as a spinning
+lock and the suffix rules of spinlock_t apply accordingly. The
+implementation is fair and prevents writer starvation.
+
+rwlock_t and PREEMPT_RT
+-----------------------
+
+On a PREEMPT_RT enabled kernel rwlock_t is mapped to a separate
+implementation based on rt_mutex which changes the semantics:
+
+ - Same changes as for spinlock_t
+
+ - The implementation is not fair and can cause writer starvation under
+   certain circumstances. The reason for this is that a writer cannot grant
+   its priority to multiple readers. Readers which are blocked on a writer
+   fully support the priority inheritance protocol.
+
+
+PREEMPT_RT caveats
+==================
+
+spinlock_t and rwlock_t
+-----------------------
+
+The substitution of spinlock_t and rwlock_t on PREEMPT_RT enabled kernels
+with RT-mutex based implementations has a few implications.
+
+On a non PREEMPT_RT enabled kernel the following code construct is
+perfectly fine::
+
+   local_irq_disable();
+   spin_lock(&lock);
+
+and fully equivalent to::
+
+   spin_lock_irq(&lock);
+
+Same applies to rwlock_t and the _irqsave() suffix variant.
+
+On a PREEMPT_RT enabled kernel this breaks because the RT-mutex
+substitution expects a fully preemptible context.
+
+The preferred solution is to use :c:func:`spin_lock_irq()` or
+:c:func:`spin_lock_irqsave()` and their unlock counterparts.
+
+PREEMPT_RT also offers a local_lock mechanism to substitute the
+local_irq_disable/save() constructs in cases where a separation of the
+interrupt disabling and the locking is really unavoidable. This should be
+restricted to very rare cases.
+
+
+raw_spinlock_t
+--------------
+
+Locking of a raw_spinlock_t disables preemption and eventually interrupts.
+Therefore code inside the critical region has to be careful to avoid calls
+into code which takes a regular spinlock_t or rwlock_t. A prime example is
+memory allocation.
+
+On a non PREEMPT_RT enabled kernel the following code construct is
+perfectly fine code::
+
+  raw_spin_lock(&lock);
+  p = kmalloc(sizeof(*p), GFP_ATOMIC);
+
+On a PREEMPT_RT enabled kernel this breaks because the memory allocator is
+fully preemptible and therefore does not support allocations from truly
+atomic contexts.
+
+Contrary to that the following code construct is perfectly fine on
+PREEMPT_RT as spin_lock() does not disable preemption::
+
+  spin_lock(&lock);
+  p = kmalloc(sizeof(*p), GFP_ATOMIC);
+
+Most places which use GFP_ATOMIC allocations are safe on PREEMPT_RT as the
+execution is forced into thread context and the lock substitution is
+ensuring preemptibility.
+
+
+bit spinlocks
+-------------
+
+Bit spinlocks are problematic for PREEMPT_RT as they cannot be easily
+substituted by an RT-mutex based implementation for obvious reasons.
+
+The semantics of bit spinlocks are preserved on a PREEMPT_RT enabled kernel
+and the caveats vs. raw_spinlock_t apply.
+
+Some bit spinlocks are substituted by regular spinlock_t for PREEMPT_RT but
+this requires conditional (#ifdef'ed) code changes at the usage side while
+the spinlock_t substitution is simply done by the compiler and the
+conditionals are restricted to header files and core implementation of the
+locking primitives and the usage sites do not require any changes.
+
+
+Lock type nesting rules
+=======================
+
+The most basic rules are:
+
+  - Lock types of the same lock category (sleeping, spinning) can nest
+    arbitrarily as long as they respect the general lock ordering rules to
+    prevent deadlocks.
+
+  - Sleeping lock types cannot nest inside spinning lock types.
+
+  - Spinning lock types can nest inside sleeping lock types.
+
+These rules apply in general independent of CONFIG_PREEMPT_RT.
+
+As PREEMPT_RT changes the lock category of spinlock_t and rwlock_t from
+spinning to sleeping this has obviously restrictions how they can nest with
+raw_spinlock_t.
+
+This results in the following nest ordering:
+
+  1) Sleeping locks
+  2) spinlock_t and rwlock_t
+  3) raw_spinlock_t and bit spinlocks
+
+Lockdep is aware of these constraints to ensure that they are respected.
+
+
+Owner semantics
+===============
+
+Most lock types in the Linux kernel have strict owner semantics, i.e. the
+context (task) which acquires a lock has to release it.
+
+There are two exceptions:
+
+  - semaphores
+  - rwsems
+
+semaphores have no strict owner semantics for historical reasons. They are
+often used for both serialization and waiting purposes. That's generally
+discouraged and should be replaced by separate serialization and wait
+mechanisms.
+
+rwsems have grown interfaces which allow non owner release for special
+purposes. This usage is problematic on PREEMPT_RT because PREEMPT_RT
+substitutes all locking primitives except semaphores with RT-mutex based
+implementations to provide priority inheritance for all lock types except
+the truly spinning ones. Priority inheritance on ownerless locks is
+obviously impossible.
+
+For now the rwsem non-owner release excludes code which utilizes it from
+being used on PREEMPT_RT enabled kernels. In same cases this can be
+mitigated by disabling portions of the code, in other cases the complete
+functionality has to be disabled until a workable solution has been found.