diff mbox series

Documentation: Fill the gaps about entry/noinstr constraints

Message ID 878rx5b7i5.ffs@tglx (mailing list archive)
State New, archived
Headers show
Series Documentation: Fill the gaps about entry/noinstr constraints | expand

Commit Message

Thomas Gleixner Nov. 30, 2021, 10:31 p.m. UTC
The entry/exit handling for exceptions, interrupts, syscalls and KVM is
not really documented except for some comments.

Fill the gaps.

Reported-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 Documentation/core-api/entry.rst |  268 +++++++++++++++++++++++++++++++++++++++
 Documentation/core-api/index.rst |    8 +
 kernel/entry/common.c            |    1 
 3 files changed, 276 insertions(+), 1 deletion(-)

Comments

Mark Rutland Dec. 1, 2021, 10:56 a.m. UTC | #1
Hi Thomas,

On Tue, Nov 30, 2021 at 11:31:30PM +0100, Thomas Gleixner wrote:
> The entry/exit handling for exceptions, interrupts, syscalls and KVM is
> not really documented except for some comments.
> 
> Fill the gaps.

Thanks for this! Now there's less chance I'll get this wrong. :)

I have a couple of minor comments below -- mostly typo/formatting junk.

> 
> Reported-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  Documentation/core-api/entry.rst |  268 +++++++++++++++++++++++++++++++++++++++
>  Documentation/core-api/index.rst |    8 +
>  kernel/entry/common.c            |    1 

I think the change to kernel/entry/common.c got included by accident?

>  3 files changed, 276 insertions(+), 1 deletion(-)
> 
> --- /dev/null
> +++ b/Documentation/core-api/entry.rst
> @@ -0,0 +1,268 @@
> +Entry/exit handling for exceptions, interrupts, syscalls and KVM
> +================================================================
> +
> +For any transition from one execution domain into another the kernel
> +requires update of various states. The state updates have strict rules
> +versus ordering.
> +
> +The states which need to be updated are:
> +
> +  * Lockdep
> +  * RCU
> +  * Preemption counter
> +  * Tracing
> +  * Time accounting
> +
> +The update order depends on the transition type and is explained below in
> +the transition type sections.
> +
> +Non-instrumentable code - noinstr
> +---------------------------------
> +
> +Low level transition code cannot be instrumented before RCU is watching and
> +after RCU went into a non watching state (NOHZ, NOHZ_FULL) as most
> +instrumentation facilities depend on RCU.
> +
> +Aside of that many architectures have to save register state, e.g. debug or
> +cause registers before another exception of the same type can happen. A
> +breakpoint in the breakpoint entry code would overwrite the debug registers
> +of the inital breakpoint.
> +
> +Such code has to be marked with the 'noinstr' attribute. That places the
> +code into a special section which is taboo for instrumentation and debug
> +facilities.
> +
> +In a function which is marked 'noinstr' it's only allowed to call into
> +non-instrumentable code except when the invocation of instrumentable code
> +is annotated with a instrumentation_begin()/instrumentation_end() pair::
> +
> +  noinstr void entry(void)
> +  {
> +  	handle_entry();     <-- must be 'noinstr' or '__always_inline'
> +	...
> +	instrumentation_begin();
> +	handle_context();   <-- instrumentable code
> +	instrumentation_end();
> +	...
> +	handle_exit();     <-- must be 'noinstr' or '__always_inline'
> +  }
> +
> +This allows verification of the 'noinstr' restrictions via objtool on
> +supported architectures.
> +
> +Invoking non-instrumentable functions from instrumentable context has no
> +restrictions and is useful to protect e.g. state switching which would
> +cause malfunction if instrumented.
> +
> +All non-instrumentable entry/exit code sections before and after the RCU
> +state transitions must run with interrupts disabled.
> +
> +Syscalls
> +--------
> +
> +Syscall entry exit code starts obviously in low level architecture specific

As a small nit, can we remove the "obviously"? It's certainly obvious to you
and me, but it doesn't meaningfully affect the sentence either way.

> +assembly code and calls out into C-code after establishing low level
> +architecture specific state and stack frames. This low level code must not
> +be instrumented. A typical syscall handling function invoked from low level
> +assembly code looks like this::
> +
> +  noinstr void do_syscall(struct pt_regs \*regs, int nr)
                                            ^^

Is `\*` necessary here? ... and/or should this be an explicit code block (which
IIUC doesn't require this esacping), e.g.

.. code-block:: c

      noinstr void do_syscall(struct pt_regs *regs, int nr)
	{
		...
	}

Similar comment for the other code snippets in this patch.

> +  {
> +	arch_syscall_enter(regs);
> +	nr = syscall_enter_from_user_mode(regs, nr);
> +
> +	instrumentation_begin();
> +
> +	if (!invoke_syscall(regs, nr) && nr != -1)
> +	 	result_reg(regs) = __sys_ni_syscall(regs);
> +
> +	instrumentation_end();
> +
> +	syscall_exit_to_user_mode(regs);
> +  }
> +
> +syscall_enter_from_user_mode() first invokes enter_from_user_mode() which
> +establishes state in the following order:
> +
> +  * Lockdep
> +  * RCU / Context tracking
> +  * Tracing
> +
> +and then invokes the various entry work functions like ptrace, seccomp,
> +audit, syscall tracing etc. After the function returns instrumentable code
> +can be invoked. After returning from the syscall handler the instrumentable
> +code section ends and syscall_exit_to_user_mode() is invoked.
> +
> +syscall_exit_to_user_mode() handles all work which needs to be done before
> +returning to user space like tracing, audit, signals, task work etc. After
> +that it invokes exit_to_user_mode() which again handles the state
> +transition in the reverse order:
> +
> +  * Tracing
> +  * RCU / Context tracking
> +  * Lockdep
> +
> +syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also
> +available as fine grained subfunctions in cases where the architecture code
> +has to do extra work between the various steps. In such cases it has to
> +ensure that enter_from_user_mode() is called first on entry and
> +exit_to_user_mode() is called last on exit.
> +
> +
> +KVM
> +---
> +
> +Entering or exiting guest mode is very similar to syscalls. From the host
> +kernel point of view the CPU goes off into user space when entering the
> +guest and returns to the kernel on exit.
> +
> +kvm_guest_enter_irqoff() is a KVM specific variant of exit_to_user_mode()
> +and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode().
> +The state operations have the same ordering.
> +
> +Task work handling is done separately for guest at the boundary of the
> +vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of
> +the work handled on return to user space.
> +
> +Interrupts and regular exceptions
> +---------------------------------
> +
> +Interrupts entry and exit handling is slightly more complex than syscalls
> +and KVM transitions.
> +
> +If an interrupt is raised while the CPU executes in user space, the entry
> +and exit handling is exactly the same as for syscalls.
> +
> +If the interrupt is raised while the CPU executes in kernel space the entry
> +and exit handling is slightly different. RCU state is only updated when the
> +interrupt was raised in context of the idle task because that's the only

Since we have an idle task for each cpu, perhaps either:

  s/the idle task/an idle task/
  s/the idle task/the CPU's idle task/

> +kernel context where RCU can be not watching on NOHZ enabled kernels.
> +Lockdep and tracing have to be updated unconditionally.
> +
> +irqentry_enter() and irqentry_exit() provide the implementation for this.
> +
> +The architecture specific part looks similar to syscall handling::
> +
> +  noinstr void do_interrupt(struct pt_regs \*regs, int nr)
> +  {
> +	arch_interrupt_enter(regs);
> +	state = irqentry_enter(regs);
> +
> +	instrumentation_begin();
> +
> +	irq_enter_rcu();
> +	invoke_irq_handler(regs, nr);
> +	irq_exit_rcu();
> +
> +	instrumentation_end();
> +
> +	irqentry_exit(regs, state);
> +  }
> +
> +Note, that the invocation of the actual interrupt handler is within a
> +irq_enter_rcu() and irq_exit_rcu() pair.
> +
> +irq_enter_rcu() updates the preemption count which makes in_hardirq()
> +return true, handles NOHZ tick state and interrupt time accounting. This
> +means that up to the point where irq_enter_rcu() is invoked in_hardirq()
> +returns false.
> +
> +irq_exit_rcu() handles interrupt time accounting, undoes the preemption
> +count update and eventually handles soft interrupts and NOHZ tick state.
> +
> +The preemption count could be established in irqentry_enter() already, but
> +there is no real value to do so. This allows the preemption count to be
> +traced and just puts a restriction on the early entry code up to
> +irq_enter_rcu().
> +
> +This also keeps the handling vs. irq_exit_rcu() symmetric and
> +irq_exit_rcu() must undo the preempt count elevation before handling soft
> +interrupts and irqentry_exit() also requires that because it might
> +schedule.
> +
> +
> +NMI and NMI-like exceptions
> +---------------------------
> +
> +NMIs and NMI like exceptions, e.g. Machine checks, double faults, debug
> +interrupts etc. can hit any context and have to be extra careful vs. the
> +state.
> +
> +Debug exceptions can handle user space breakpoints or watchpoints in the
> +same way as an interrupt which was raised while executing in user space,
> +but kernel mode debug exceptions have to be treated like NMIs as they can
> +even happen in NMI context, e.g. due to code patching.
> +
> +Also Machine check exceptions can handle user mode exceptions like regular
> +interrupts, but for kernel mode exceptions they have to be treated like
> +NMIs.
> +
> +NMIs and the other NMI-like exceptions handle state transitions in the most
> +straight forward way and do not differentiate between user and kernel mode
> +origin.
> +
> +The state update on entry is handled in irqentry_nmi_enter() which updates
> +state in the following order:
> +
> +  * Preemption counter
> +  * Lockdep
> +  * RCU
> +  * Tracing
> +
> +The exit counterpart irqenttry_nmi_exit() does the reverse operation in the
                        ^^^^^^^^^

 s/irqenttry/irqentry/

> +reverse order.
> +
> +Note, that the update of the preemption counter has to be the first
> +operation on enter and the last operation on exit. The reason is that both
> +lockdep and RCU rely on in_nmi() returning true in this case. The
> +preemption count modification in the NMI entry/exit case can obviously not
> +be traced.

Could we say "must not" instead of "can not", e.g.

  The preemption count modification in the NMI entry/exit must not be traced.

That way it's clearly a requirement, rather than a limitation.

> +Architecture specific code looks like this::
> +
> +  noinstr void do_nmi(struct pt_regs \*regs)
> +  {
> +	arch_nmi_enter(regs);
> +	state = irqentry_nmi_enter(regs);
> +
> +	instrumentation_begin();
> +
> +	invoke_nmi_handler(regs);
> +
> +	instrumentation_end();
> +	irqentry_nmi_exit(regs);
> +  }

To keep the begin/end and enter/exit calls visually balanced, should the
instrumentation_end() call have trailing a line space, e.g.

e.g.

	arch_nmi_enter(regs);
	state = irqentry_nmi_enter(regs);

	instrumentation_begin();

	invoke_nmi_handler(regs);

	instrumentation_end();

	irqentry_nmi_exit(regs);

... or sandwiched around invoke_nmi_handler(), e.g.

	arch_nmi_enter(regs);
	state = irqentry_nmi_enter(regs);

	instrumentation_begin();
	invoke_nmi_handler(regs);
	instrumentation_end();

	irqentry_nmi_exit(regs);

Since the examples in this file are fairly simple I'd suggest the latter.

> +and for e.g. a debug exception it can look like this::
> +
> +  noinstr void do_debug(struct pt_regs \*regs)
> +  {
> +	arch_nmi_enter(regs);
> +
> +	debug_regs = save_debug_regs();
> +
> +	if (user_mode(regs)) {
> +		state = irqentry_enter(regs);
> +
> +		instrumentation_begin();
> +
> +		user_mode_debug_handler(regs, debug_regs);
> +
> +		instrumentation_end();
> +
> +		irqentry_exit(regs, state);
> +  	} else {
> +  		state = irqentry_nmi_enter(regs);
> +
> +		instrumentation_begin();
> +
> +		kernel_mode_debug_handler(regs, debug_regs);
> +
> +		instrumentation_end();
> +
> +		irqentry_nmi_exit(regs, state);
> +	}
> +  }
> +
> +There is no combined irqentry_nmi_if_kernel() function available as the
> +above cannot be handled in an exception agnostic way.
> --- a/Documentation/core-api/index.rst
> +++ b/Documentation/core-api/index.rst
> @@ -44,6 +44,14 @@ Library functionality that is used throu
>     timekeeping
>     errseq
>  
> +Low level entry and exit
> +========================
> +
> +.. toctree::
> +   :maxdepth: 1
> +
> +   entry
> +
>  Concurrency primitives
>  ======================
>  
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -1,5 +1,4 @@
>  // SPDX-License-Identifier: GPL-2.0
> -

Unrelated whitespace change?

Thanks,
Mark.

>  #include <linux/context_tracking.h>
>  #include <linux/entry-common.h>
>  #include <linux/highmem.h>
Thomas Gleixner Dec. 1, 2021, 6:14 p.m. UTC | #2
Mark,

On Wed, Dec 01 2021 at 10:56, Mark Rutland wrote:
> On Tue, Nov 30, 2021 at 11:31:30PM +0100, Thomas Gleixner wrote:
>> ---
>>  Documentation/core-api/entry.rst |  268 +++++++++++++++++++++++++++++++++++++++
>>  Documentation/core-api/index.rst |    8 +
>>  kernel/entry/common.c            |    1 
>
> I think the change to kernel/entry/common.c got included by accident?

That's what I get from doing such things 30 minutes before midnight...

>> +
>> +Syscall entry exit code starts obviously in low level architecture specific
>
> As a small nit, can we remove the "obviously"? It's certainly obvious to you
> and me, but it doesn't meaningfully affect the sentence either way.

Indeed.

>> +assembly code and calls out into C-code after establishing low level
>> +architecture specific state and stack frames. This low level code must not
>> +be instrumented. A typical syscall handling function invoked from low level
>> +assembly code looks like this::
>> +
>> +  noinstr void do_syscall(struct pt_regs \*regs, int nr)
>                                             ^^
>
> Is `\*` necessary here? ... and/or should this be an explicit code block (which
> IIUC doesn't require this esacping), e.g.
>
> .. code-block:: c

Right. Let me try that.

>       noinstr void do_syscall(struct pt_regs *regs, int nr)
>> +
>> +If the interrupt is raised while the CPU executes in kernel space the entry
>> +and exit handling is slightly different. RCU state is only updated when the
>> +interrupt was raised in context of the idle task because that's the only
>
> Since we have an idle task for each cpu, perhaps either:
>
>   s/the idle task/an idle task/
>   s/the idle task/the CPU's idle task/

Yes, that's more precise

>> +Note, that the update of the preemption counter has to be the first
>> +operation on enter and the last operation on exit. The reason is that both
>> +lockdep and RCU rely on in_nmi() returning true in this case. The
>> +preemption count modification in the NMI entry/exit case can obviously not
>> +be traced.
>
> Could we say "must not" instead of "can not", e.g.
>
>   The preemption count modification in the NMI entry/exit must not be traced.
>
> That way it's clearly a requirement, rather than a limitation.

Yes.

>> +Architecture specific code looks like this::
>> +
>> +  noinstr void do_nmi(struct pt_regs \*regs)
>> +  {
>> +	arch_nmi_enter(regs);
>> +	state = irqentry_nmi_enter(regs);
>> +
>> +	instrumentation_begin();
>> +
>> +	invoke_nmi_handler(regs);
>> +
>> +	instrumentation_end();
>> +	irqentry_nmi_exit(regs);
>> +  }
>
> To keep the begin/end and enter/exit calls visually balanced, should the
> instrumentation_end() call have trailing a line space, e.g.

Yup.

Thanks,

        tglx
Mark Rutland Dec. 1, 2021, 6:23 p.m. UTC | #3
On Wed, Dec 01, 2021 at 07:14:41PM +0100, Thomas Gleixner wrote:
> Mark,
> 
> On Wed, Dec 01 2021 at 10:56, Mark Rutland wrote:
> > On Tue, Nov 30, 2021 at 11:31:30PM +0100, Thomas Gleixner wrote:
> >> ---
> >>  Documentation/core-api/entry.rst |  268 +++++++++++++++++++++++++++++++++++++++
> >>  Documentation/core-api/index.rst |    8 +
> >>  kernel/entry/common.c            |    1 
> >
> > I think the change to kernel/entry/common.c got included by accident?
> 
> That's what I get from doing such things 30 minutes before midnight...

Ah, I had debugged it down to:

nobikeshed void do_rst(struct tglx *tglx);
{
	aargh_rst_enter(tglx);

	documentation_begin();
	invoke_editor(tglx);
 	documentation_end();
}

... where I think we forgot the:

	enter_from_sleep_mode(tglx);
	...
	exit_to_sleep_mode(tglx);

Mark.
Thomas Gleixner Dec. 1, 2021, 8:28 p.m. UTC | #4
On Wed, Dec 01 2021 at 18:23, Mark Rutland wrote:
> On Wed, Dec 01, 2021 at 07:14:41PM +0100, Thomas Gleixner wrote:
>> Mark,
>> 
>> On Wed, Dec 01 2021 at 10:56, Mark Rutland wrote:
>> > On Tue, Nov 30, 2021 at 11:31:30PM +0100, Thomas Gleixner wrote:
>> >> ---
>> >>  Documentation/core-api/entry.rst |  268 +++++++++++++++++++++++++++++++++++++++
>> >>  Documentation/core-api/index.rst |    8 +
>> >>  kernel/entry/common.c            |    1 
>> >
>> > I think the change to kernel/entry/common.c got included by accident?
>> 
>> That's what I get from doing such things 30 minutes before midnight...
>
> Ah, I had debugged it down to:
>
> nobikeshed void do_rst(struct tglx *tglx);
> {
> 	aargh_rst_enter(tglx);
>
> 	documentation_begin();
> 	invoke_editor(tglx);
>  	documentation_end();
> }
>
> ... where I think we forgot the:
>
> 	enter_from_sleep_mode(tglx);
> 	...
> 	exit_to_sleep_mode(tglx);

ROTFL. You made my day!
diff mbox series

Patch

--- /dev/null
+++ b/Documentation/core-api/entry.rst
@@ -0,0 +1,268 @@ 
+Entry/exit handling for exceptions, interrupts, syscalls and KVM
+================================================================
+
+For any transition from one execution domain into another the kernel
+requires update of various states. The state updates have strict rules
+versus ordering.
+
+The states which need to be updated are:
+
+  * Lockdep
+  * RCU
+  * Preemption counter
+  * Tracing
+  * Time accounting
+
+The update order depends on the transition type and is explained below in
+the transition type sections.
+
+Non-instrumentable code - noinstr
+---------------------------------
+
+Low level transition code cannot be instrumented before RCU is watching and
+after RCU went into a non watching state (NOHZ, NOHZ_FULL) as most
+instrumentation facilities depend on RCU.
+
+Aside of that many architectures have to save register state, e.g. debug or
+cause registers before another exception of the same type can happen. A
+breakpoint in the breakpoint entry code would overwrite the debug registers
+of the inital breakpoint.
+
+Such code has to be marked with the 'noinstr' attribute. That places the
+code into a special section which is taboo for instrumentation and debug
+facilities.
+
+In a function which is marked 'noinstr' it's only allowed to call into
+non-instrumentable code except when the invocation of instrumentable code
+is annotated with a instrumentation_begin()/instrumentation_end() pair::
+
+  noinstr void entry(void)
+  {
+  	handle_entry();     <-- must be 'noinstr' or '__always_inline'
+	...
+	instrumentation_begin();
+	handle_context();   <-- instrumentable code
+	instrumentation_end();
+	...
+	handle_exit();     <-- must be 'noinstr' or '__always_inline'
+  }
+
+This allows verification of the 'noinstr' restrictions via objtool on
+supported architectures.
+
+Invoking non-instrumentable functions from instrumentable context has no
+restrictions and is useful to protect e.g. state switching which would
+cause malfunction if instrumented.
+
+All non-instrumentable entry/exit code sections before and after the RCU
+state transitions must run with interrupts disabled.
+
+Syscalls
+--------
+
+Syscall entry exit code starts obviously in low level architecture specific
+assembly code and calls out into C-code after establishing low level
+architecture specific state and stack frames. This low level code must not
+be instrumented. A typical syscall handling function invoked from low level
+assembly code looks like this::
+
+  noinstr void do_syscall(struct pt_regs \*regs, int nr)
+  {
+	arch_syscall_enter(regs);
+	nr = syscall_enter_from_user_mode(regs, nr);
+
+	instrumentation_begin();
+
+	if (!invoke_syscall(regs, nr) && nr != -1)
+	 	result_reg(regs) = __sys_ni_syscall(regs);
+
+	instrumentation_end();
+
+	syscall_exit_to_user_mode(regs);
+  }
+
+syscall_enter_from_user_mode() first invokes enter_from_user_mode() which
+establishes state in the following order:
+
+  * Lockdep
+  * RCU / Context tracking
+  * Tracing
+
+and then invokes the various entry work functions like ptrace, seccomp,
+audit, syscall tracing etc. After the function returns instrumentable code
+can be invoked. After returning from the syscall handler the instrumentable
+code section ends and syscall_exit_to_user_mode() is invoked.
+
+syscall_exit_to_user_mode() handles all work which needs to be done before
+returning to user space like tracing, audit, signals, task work etc. After
+that it invokes exit_to_user_mode() which again handles the state
+transition in the reverse order:
+
+  * Tracing
+  * RCU / Context tracking
+  * Lockdep
+
+syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also
+available as fine grained subfunctions in cases where the architecture code
+has to do extra work between the various steps. In such cases it has to
+ensure that enter_from_user_mode() is called first on entry and
+exit_to_user_mode() is called last on exit.
+
+
+KVM
+---
+
+Entering or exiting guest mode is very similar to syscalls. From the host
+kernel point of view the CPU goes off into user space when entering the
+guest and returns to the kernel on exit.
+
+kvm_guest_enter_irqoff() is a KVM specific variant of exit_to_user_mode()
+and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode().
+The state operations have the same ordering.
+
+Task work handling is done separately for guest at the boundary of the
+vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of
+the work handled on return to user space.
+
+Interrupts and regular exceptions
+---------------------------------
+
+Interrupts entry and exit handling is slightly more complex than syscalls
+and KVM transitions.
+
+If an interrupt is raised while the CPU executes in user space, the entry
+and exit handling is exactly the same as for syscalls.
+
+If the interrupt is raised while the CPU executes in kernel space the entry
+and exit handling is slightly different. RCU state is only updated when the
+interrupt was raised in context of the idle task because that's the only
+kernel context where RCU can be not watching on NOHZ enabled kernels.
+Lockdep and tracing have to be updated unconditionally.
+
+irqentry_enter() and irqentry_exit() provide the implementation for this.
+
+The architecture specific part looks similar to syscall handling::
+
+  noinstr void do_interrupt(struct pt_regs \*regs, int nr)
+  {
+	arch_interrupt_enter(regs);
+	state = irqentry_enter(regs);
+
+	instrumentation_begin();
+
+	irq_enter_rcu();
+	invoke_irq_handler(regs, nr);
+	irq_exit_rcu();
+
+	instrumentation_end();
+
+	irqentry_exit(regs, state);
+  }
+
+Note, that the invocation of the actual interrupt handler is within a
+irq_enter_rcu() and irq_exit_rcu() pair.
+
+irq_enter_rcu() updates the preemption count which makes in_hardirq()
+return true, handles NOHZ tick state and interrupt time accounting. This
+means that up to the point where irq_enter_rcu() is invoked in_hardirq()
+returns false.
+
+irq_exit_rcu() handles interrupt time accounting, undoes the preemption
+count update and eventually handles soft interrupts and NOHZ tick state.
+
+The preemption count could be established in irqentry_enter() already, but
+there is no real value to do so. This allows the preemption count to be
+traced and just puts a restriction on the early entry code up to
+irq_enter_rcu().
+
+This also keeps the handling vs. irq_exit_rcu() symmetric and
+irq_exit_rcu() must undo the preempt count elevation before handling soft
+interrupts and irqentry_exit() also requires that because it might
+schedule.
+
+
+NMI and NMI-like exceptions
+---------------------------
+
+NMIs and NMI like exceptions, e.g. Machine checks, double faults, debug
+interrupts etc. can hit any context and have to be extra careful vs. the
+state.
+
+Debug exceptions can handle user space breakpoints or watchpoints in the
+same way as an interrupt which was raised while executing in user space,
+but kernel mode debug exceptions have to be treated like NMIs as they can
+even happen in NMI context, e.g. due to code patching.
+
+Also Machine check exceptions can handle user mode exceptions like regular
+interrupts, but for kernel mode exceptions they have to be treated like
+NMIs.
+
+NMIs and the other NMI-like exceptions handle state transitions in the most
+straight forward way and do not differentiate between user and kernel mode
+origin.
+
+The state update on entry is handled in irqentry_nmi_enter() which updates
+state in the following order:
+
+  * Preemption counter
+  * Lockdep
+  * RCU
+  * Tracing
+
+The exit counterpart irqenttry_nmi_exit() does the reverse operation in the
+reverse order.
+
+Note, that the update of the preemption counter has to be the first
+operation on enter and the last operation on exit. The reason is that both
+lockdep and RCU rely on in_nmi() returning true in this case. The
+preemption count modification in the NMI entry/exit case can obviously not
+be traced.
+
+Architecture specific code looks like this::
+
+  noinstr void do_nmi(struct pt_regs \*regs)
+  {
+	arch_nmi_enter(regs);
+	state = irqentry_nmi_enter(regs);
+
+	instrumentation_begin();
+
+	invoke_nmi_handler(regs);
+
+	instrumentation_end();
+	irqentry_nmi_exit(regs);
+  }
+
+and for e.g. a debug exception it can look like this::
+
+  noinstr void do_debug(struct pt_regs \*regs)
+  {
+	arch_nmi_enter(regs);
+
+	debug_regs = save_debug_regs();
+
+	if (user_mode(regs)) {
+		state = irqentry_enter(regs);
+
+		instrumentation_begin();
+
+		user_mode_debug_handler(regs, debug_regs);
+
+		instrumentation_end();
+
+		irqentry_exit(regs, state);
+  	} else {
+  		state = irqentry_nmi_enter(regs);
+
+		instrumentation_begin();
+
+		kernel_mode_debug_handler(regs, debug_regs);
+
+		instrumentation_end();
+
+		irqentry_nmi_exit(regs, state);
+	}
+  }
+
+There is no combined irqentry_nmi_if_kernel() function available as the
+above cannot be handled in an exception agnostic way.
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -44,6 +44,14 @@  Library functionality that is used throu
    timekeeping
    errseq
 
+Low level entry and exit
+========================
+
+.. toctree::
+   :maxdepth: 1
+
+   entry
+
 Concurrency primitives
 ======================
 
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -1,5 +1,4 @@ 
 // SPDX-License-Identifier: GPL-2.0
-
 #include <linux/context_tracking.h>
 #include <linux/entry-common.h>
 #include <linux/highmem.h>