diff mbox

ARM: document the use of NEON in kernel mode

Message ID 1376050869-29255-1-git-send-email-ard.biesheuvel@linaro.org (mailing list archive)
State New, archived
Headers show

Commit Message

Ard Biesheuvel Aug. 9, 2013, 12:21 p.m. UTC
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 Documentation/arm/kernel_mode_neon.txt | 132 +++++++++++++++++++++++++++++++++
 1 file changed, 132 insertions(+)
 create mode 100644 Documentation/arm/kernel_mode_neon.txt

Comments

Domenico Andreoli Aug. 10, 2013, 1:34 p.m. UTC | #1
Hi Ard!

On Fri, Aug 09, 2013 at 02:21:09PM +0200, Ard Biesheuvel wrote:
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> ---
>  Documentation/arm/kernel_mode_neon.txt | 132 +++++++++++++++++++++++++++++++++
>  1 file changed, 132 insertions(+)
>  create mode 100644 Documentation/arm/kernel_mode_neon.txt
> 
> diff --git a/Documentation/arm/kernel_mode_neon.txt b/Documentation/arm/kernel_mode_neon.txt
> new file mode 100644
> index 0000000..4c2de85
> --- /dev/null
> +++ b/Documentation/arm/kernel_mode_neon.txt
> @@ -0,0 +1,132 @@
> +Kernel mode NEON
> +================
> +
> +TL;DR summary
> +-------------
> +* Use only NEON instructions, or VFP instructions that don't rely on support
> +  code
> +* Isolate your NEON code in a separate compilation unit, and compile it with
> +  '-mfpu=neon -mfloat-abi=softfp'
> +* Put kernel_neon_begin() and kernel_neon_end() calls around the calls into your
> +  NEON code
> +* Don't sleep in your NEON code, and be aware that it will be executed with
> +  preemption disabled
> +
> +
> +Introduction
> +------------
> +It is possible to use NEON instructions (and in some cases, VFP instructions) in
> +code that runs in kernel mode. However, for performance reasons, the NEON/VFP
> +register file is not preserved and restored at every context switch or taken
> +exception like the normal register file is, so some manual intervention is
> +required. Furthermore, special care is required for code that may sleep [i.e.,
> +may call schedule()], as NEON or VFP instructions will be executed in a
> +non-preemptible section for reasons outlined below.
> +
> +
> +Lazy preserve and restore
> +-------------------------
> +The NEON/VFP register file is managed using lazy preserve (on UP systems) and
> +lazy restore (on both SMP and UP systems). This means that the register file is
> +kept 'live', and is only preserved and restored when multiple tasks are
> +contending for the NEON/VFP unit (or, in the SMP case, when a task migrates to
> +another core). Lazy restore is implemented by disabling the NEON/VFP unit after
> +every context switch, resulting in a trap when subsequently a NEON/VFP
> +instruction is issued, allowing the kernel to step in and perform the restore if
> +necessary.
> +
> +Any use of the NEON/VFP unit in kernel mode should not interfere with this, so
> +it is required to do an 'eager' preserve of the NEON/VFP register file, and
> +enable the NEON/VFP unit explicitly so no exceptions are generated on first
> +subsequent use. This is handled by the function kernel_neon_begin(), which
> +should be called before any kernel mode NEON or VFP instructions are issued.
> +Likewise, the NEON/VFP unit should be disabled again after use to make sure user
> +mode will hit the lazy restore trap upon next use. This is handled by the
> +function kernel_neon_end().
> +
> +
> +Interruptions in kernel mode
> +----------------------------
> +For reasons of performance and simplicity, it was decided that there shall be no
> +preserve/restore mechanism for the kernel mode NEON/VFP register contents. This
> +implies that interruptions of a kernel mode NEON section can only be allowed if
> +they are guaranteed not to touch the NEON/VFP registers. For this reason, the
> +following rules and restrictions apply in the kernel:
> +* NEON/VFP code is not allowed in interrupt context;
> +* NEON/VFP code is not allowed to sleep;
> +* NEON/VFP code is executed with preemption disabled.
> +
> +If latency is a concern, it is possible to put back to back calls to
> +kernel_neon_end() and kernel_neon_begin() in places in your code where none of
> +the NEON registers are live. (Additional calls to kernel_neon_begin() should be
> +reasonably cheap if no context switch occurred in the meantime)

I don't understand. How latency concerns could be relieved with scattering
non-NEON code with calls to kernel_neon_end() and kernel_neon_begin()?

I expect such NEON code would be rather at leaves of the call trees,
so there should not be so many functions called with disabled preemption
from within a NEON critical section, right?

Definitively I don't know the complexity of code that could benefit
from NEON.

> +
> +
> +VFP and support code
> +--------------------
> +Earlier versions of VFP (prior to version 3) rely on software support for things
> +like IEEE-754 compliant underflow handling etc. When the VFP unit needs such
> +software assistance, it signals the kernel by raising an undefined instruction
> +exception. The kernel responds by inspecting the VFP control registers and the
> +current instruction and arguments, and emulates the instruction in software.
> +
> +Such software assistance is currently not implemented for VFP instructions
> +executed in kernel mode. If such a condition is encountered, the kernel will
> +fail and generate an OOPS.
> +
> +
> +Separating NEON code from ordinary code
> +---------------------------------------
> +The compiler is not aware of the special significance of kernel_neon_begin() and
> +kernel_neon_end(), i.e., that it is only allowed to issue NEON/VFP instructions
> +between calls to these respective functions. Furthermore, GCC may generate NEON
> +instructions of its own at -O3 level if -mfpu=neon is selected, and even if the
> +kernel is currently compiled at -O2, future changes may result in NEON/VFP
> +instructions appearing in unexpected places if no special care is taken.
> +
> +Therefore, the recommended and only supported way of using NEON/VFP in the
> +kernel is by adhering to the following rules:
> +* isolate the NEON code in a separate compilation unit and compile it with
> +  '-mfpu=neon -mfloat-abi=softfp';
> +* issue the calls to kernel_neon_begin(), kernel_neon_end() as well as the calls
> +  into the unit containing the NEON code from a compilation unit which is *not*
> +  built with the GCC flag '-mfpu=neon' set.
> +
> +As the kernel is compiled with '-msoft-float', the above will guarantee that
> +both NEON and VFP instructions will only ever appear in designated compilation
> +units at any optimization level.
> +
> +
> +NEON assembler
> +--------------
> +NEON assembler is supported with no additional caveats as long as the rules
> +above are followed.
> +
> +
> +NEON code generated by GCC
> +--------------------------
> +The GCC option -ftree-vectorize (implied by -O3) tries to exploit implicit
> +parallelism, and generates NEON code from ordinary C source code. This is fully
> +supported as long as the rules above are followed.

It's not clear to me the purpose of this paragraph.

That -O3 should be not used anywhere in the kernel except for those units
already built with -mfpu=neon -mfloat-abi=softfp?

> +
> +
> +NEON intrinsics
> +---------------
> +NEON intrinsics are also supported. However, as code using NEON intrinsics
> +relies on the GCC header 'arm_neon.h', the following tricks are necessary as
> +this header is not fully compatible with the kernel:
> +* Declare the interface between the NEON intrinsics code and its caller using
> +  only plain old C types (i.e., avoid using uintXX_t and uXX types; they seem
> +  unambiguous but they are not[1]);
> +* Compile the unit containing the NEON intrinsics with '-ffreestanding' so it
> +  does not choke on the missing 'stdint.h' #included by 'arm_neon.h' (this is a
> +  C99 header which the kernel does not supply) and don't include any ordinary
> +  kernel headers;
> +* Call the NEON code from a separate compilation unit that does the interfacing
> +  with the rest of the kernel, includes kernel headers, types etc.
> +
> +----
> +[1] Neither the bare metal version nor the glibc version of GCC agrees with the
> +    kernel on the definitions of int32_t, uint32_t and/or uintptr_t: this
> +    becomes a problem when you try to include both <linux/types.h> and
> +    <stdint.h> in the same compilation unit.
> -- 
> 1.8.1.2

ciao,
Domenico
Ard Biesheuvel Aug. 10, 2013, 2:09 p.m. UTC | #2
On 10 August 2013 15:34, Domenico Andreoli <cavokz@gmail.com> wrote:
> Hi Ard!
>

Ciao!

> On Fri, Aug 09, 2013 at 02:21:09PM +0200, Ard Biesheuvel wrote:

[...]

>> +If latency is a concern, it is possible to put back to back calls to
>> +kernel_neon_end() and kernel_neon_begin() in places in your code where none of
>> +the NEON registers are live. (Additional calls to kernel_neon_begin() should be
>> +reasonably cheap if no context switch occurred in the meantime)
>
> I don't understand. How latency concerns could be relieved with scattering
> non-NEON code with calls to kernel_neon_end() and kernel_neon_begin()?
>

The idea is that by doing something like

kernel_neon_begin();
... use the NEON for a very long time ...
kernel_neon_end();

can in some cases be changed to

kernel_neon_begin();
... use the NEON for not so long a time ...
kernel_neon_end();
kernel_neon_begin();
... use the NEON for not so long a time ...
kernel_neon_end();

Note that kernel_neon_end() re-enables preemption (in the
CONFIG_PREEMPT case) which triggers a context switch if a higher
priority task is pending.
The point I am trying to make is that
a) the second call to kernel_neon_begin() is not as costly as the
first one if in fact no context switch occurred,
b) you should only put this in places where clobbering the NEON
registers is allowable (i.e., no NEON registers are live)

> I expect such NEON code would be rather at leaves of the call trees,
> so there should not be so many functions called with disabled preemption
> from within a NEON critical section, right?
>

The point is that the NEON code itself runs with preemption disabled,
and may take some time to complete, so you can trade some speed for
lower latency by releasing and re-acquiring the NEON unit more often
(but only in places where you can tolerate losing your NEON register
contents).

> Definitively I don't know the complexity of code that could benefit
> from NEON.
>

On Cortex-A15, I saw:
- 60% speedup in xor_blocks
- 400% speedup in RAID6
- 50% speedup in AES (CTR and XTS-encrypt modes, and potentially CCMP)

so there is definitely a case for NEON in the kernel.

[...]


>> +NEON assembler
>> +--------------
>> +NEON assembler is supported with no additional caveats as long as the rules
>> +above are followed.
>> +
>> +
>> +NEON code generated by GCC
>> +--------------------------
>> +The GCC option -ftree-vectorize (implied by -O3) tries to exploit implicit
>> +parallelism, and generates NEON code from ordinary C source code. This is fully
>> +supported as long as the rules above are followed.
>
> It's not clear to me the purpose of this paragraph.
>

The purpose of the document is to explain how kernel mode NEON is
intended to be used. In order to cover as many potential use cases as
possible, this paragraph is included to explain that not only explicit
NEON like assembler or intrinsics can be used, but also implicit NEON
like the code GCC generates in some cases.

> That -O3 should be not used anywhere in the kernel except for those units
> already built with -mfpu=neon -mfloat-abi=softfp?
>

People may have valid reasons for preferring -O3 in some places. They
should just be aware that combining -O3 with -mfpu=neon can result in
NEON code turning up anywhere, not just in the places where intrinsics
or (inline) assembler were used.

[...]

Regards,
Ard.
Nicolas Pitre Aug. 13, 2013, 9:21 p.m. UTC | #3
On Fri, 9 Aug 2013, Ard Biesheuvel wrote:

> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

Reviewed-by: Nicolas Pitre <nico@linaro.org>

> ---
>  Documentation/arm/kernel_mode_neon.txt | 132 +++++++++++++++++++++++++++++++++
>  1 file changed, 132 insertions(+)
>  create mode 100644 Documentation/arm/kernel_mode_neon.txt
> 
> diff --git a/Documentation/arm/kernel_mode_neon.txt b/Documentation/arm/kernel_mode_neon.txt
> new file mode 100644
> index 0000000..4c2de85
> --- /dev/null
> +++ b/Documentation/arm/kernel_mode_neon.txt
> @@ -0,0 +1,132 @@
> +Kernel mode NEON
> +================
> +
> +TL;DR summary
> +-------------
> +* Use only NEON instructions, or VFP instructions that don't rely on support
> +  code
> +* Isolate your NEON code in a separate compilation unit, and compile it with
> +  '-mfpu=neon -mfloat-abi=softfp'
> +* Put kernel_neon_begin() and kernel_neon_end() calls around the calls into your
> +  NEON code
> +* Don't sleep in your NEON code, and be aware that it will be executed with
> +  preemption disabled
> +
> +
> +Introduction
> +------------
> +It is possible to use NEON instructions (and in some cases, VFP instructions) in
> +code that runs in kernel mode. However, for performance reasons, the NEON/VFP
> +register file is not preserved and restored at every context switch or taken
> +exception like the normal register file is, so some manual intervention is
> +required. Furthermore, special care is required for code that may sleep [i.e.,
> +may call schedule()], as NEON or VFP instructions will be executed in a
> +non-preemptible section for reasons outlined below.
> +
> +
> +Lazy preserve and restore
> +-------------------------
> +The NEON/VFP register file is managed using lazy preserve (on UP systems) and
> +lazy restore (on both SMP and UP systems). This means that the register file is
> +kept 'live', and is only preserved and restored when multiple tasks are
> +contending for the NEON/VFP unit (or, in the SMP case, when a task migrates to
> +another core). Lazy restore is implemented by disabling the NEON/VFP unit after
> +every context switch, resulting in a trap when subsequently a NEON/VFP
> +instruction is issued, allowing the kernel to step in and perform the restore if
> +necessary.
> +
> +Any use of the NEON/VFP unit in kernel mode should not interfere with this, so
> +it is required to do an 'eager' preserve of the NEON/VFP register file, and
> +enable the NEON/VFP unit explicitly so no exceptions are generated on first
> +subsequent use. This is handled by the function kernel_neon_begin(), which
> +should be called before any kernel mode NEON or VFP instructions are issued.
> +Likewise, the NEON/VFP unit should be disabled again after use to make sure user
> +mode will hit the lazy restore trap upon next use. This is handled by the
> +function kernel_neon_end().
> +
> +
> +Interruptions in kernel mode
> +----------------------------
> +For reasons of performance and simplicity, it was decided that there shall be no
> +preserve/restore mechanism for the kernel mode NEON/VFP register contents. This
> +implies that interruptions of a kernel mode NEON section can only be allowed if
> +they are guaranteed not to touch the NEON/VFP registers. For this reason, the
> +following rules and restrictions apply in the kernel:
> +* NEON/VFP code is not allowed in interrupt context;
> +* NEON/VFP code is not allowed to sleep;
> +* NEON/VFP code is executed with preemption disabled.
> +
> +If latency is a concern, it is possible to put back to back calls to
> +kernel_neon_end() and kernel_neon_begin() in places in your code where none of
> +the NEON registers are live. (Additional calls to kernel_neon_begin() should be
> +reasonably cheap if no context switch occurred in the meantime)
> +
> +
> +VFP and support code
> +--------------------
> +Earlier versions of VFP (prior to version 3) rely on software support for things
> +like IEEE-754 compliant underflow handling etc. When the VFP unit needs such
> +software assistance, it signals the kernel by raising an undefined instruction
> +exception. The kernel responds by inspecting the VFP control registers and the
> +current instruction and arguments, and emulates the instruction in software.
> +
> +Such software assistance is currently not implemented for VFP instructions
> +executed in kernel mode. If such a condition is encountered, the kernel will
> +fail and generate an OOPS.
> +
> +
> +Separating NEON code from ordinary code
> +---------------------------------------
> +The compiler is not aware of the special significance of kernel_neon_begin() and
> +kernel_neon_end(), i.e., that it is only allowed to issue NEON/VFP instructions
> +between calls to these respective functions. Furthermore, GCC may generate NEON
> +instructions of its own at -O3 level if -mfpu=neon is selected, and even if the
> +kernel is currently compiled at -O2, future changes may result in NEON/VFP
> +instructions appearing in unexpected places if no special care is taken.
> +
> +Therefore, the recommended and only supported way of using NEON/VFP in the
> +kernel is by adhering to the following rules:
> +* isolate the NEON code in a separate compilation unit and compile it with
> +  '-mfpu=neon -mfloat-abi=softfp';
> +* issue the calls to kernel_neon_begin(), kernel_neon_end() as well as the calls
> +  into the unit containing the NEON code from a compilation unit which is *not*
> +  built with the GCC flag '-mfpu=neon' set.
> +
> +As the kernel is compiled with '-msoft-float', the above will guarantee that
> +both NEON and VFP instructions will only ever appear in designated compilation
> +units at any optimization level.
> +
> +
> +NEON assembler
> +--------------
> +NEON assembler is supported with no additional caveats as long as the rules
> +above are followed.
> +
> +
> +NEON code generated by GCC
> +--------------------------
> +The GCC option -ftree-vectorize (implied by -O3) tries to exploit implicit
> +parallelism, and generates NEON code from ordinary C source code. This is fully
> +supported as long as the rules above are followed.
> +
> +
> +NEON intrinsics
> +---------------
> +NEON intrinsics are also supported. However, as code using NEON intrinsics
> +relies on the GCC header 'arm_neon.h', the following tricks are necessary as
> +this header is not fully compatible with the kernel:
> +* Declare the interface between the NEON intrinsics code and its caller using
> +  only plain old C types (i.e., avoid using uintXX_t and uXX types; they seem
> +  unambiguous but they are not[1]);
> +* Compile the unit containing the NEON intrinsics with '-ffreestanding' so it
> +  does not choke on the missing 'stdint.h' #included by 'arm_neon.h' (this is a
> +  C99 header which the kernel does not supply) and don't include any ordinary
> +  kernel headers;
> +* Call the NEON code from a separate compilation unit that does the interfacing
> +  with the rest of the kernel, includes kernel headers, types etc.
> +
> +----
> +[1] Neither the bare metal version nor the glibc version of GCC agrees with the
> +    kernel on the definitions of int32_t, uint32_t and/or uintptr_t: this
> +    becomes a problem when you try to include both <linux/types.h> and
> +    <stdint.h> in the same compilation unit.
> -- 
> 1.8.1.2
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>
diff mbox

Patch

diff --git a/Documentation/arm/kernel_mode_neon.txt b/Documentation/arm/kernel_mode_neon.txt
new file mode 100644
index 0000000..4c2de85
--- /dev/null
+++ b/Documentation/arm/kernel_mode_neon.txt
@@ -0,0 +1,132 @@ 
+Kernel mode NEON
+================
+
+TL;DR summary
+-------------
+* Use only NEON instructions, or VFP instructions that don't rely on support
+  code
+* Isolate your NEON code in a separate compilation unit, and compile it with
+  '-mfpu=neon -mfloat-abi=softfp'
+* Put kernel_neon_begin() and kernel_neon_end() calls around the calls into your
+  NEON code
+* Don't sleep in your NEON code, and be aware that it will be executed with
+  preemption disabled
+
+
+Introduction
+------------
+It is possible to use NEON instructions (and in some cases, VFP instructions) in
+code that runs in kernel mode. However, for performance reasons, the NEON/VFP
+register file is not preserved and restored at every context switch or taken
+exception like the normal register file is, so some manual intervention is
+required. Furthermore, special care is required for code that may sleep [i.e.,
+may call schedule()], as NEON or VFP instructions will be executed in a
+non-preemptible section for reasons outlined below.
+
+
+Lazy preserve and restore
+-------------------------
+The NEON/VFP register file is managed using lazy preserve (on UP systems) and
+lazy restore (on both SMP and UP systems). This means that the register file is
+kept 'live', and is only preserved and restored when multiple tasks are
+contending for the NEON/VFP unit (or, in the SMP case, when a task migrates to
+another core). Lazy restore is implemented by disabling the NEON/VFP unit after
+every context switch, resulting in a trap when subsequently a NEON/VFP
+instruction is issued, allowing the kernel to step in and perform the restore if
+necessary.
+
+Any use of the NEON/VFP unit in kernel mode should not interfere with this, so
+it is required to do an 'eager' preserve of the NEON/VFP register file, and
+enable the NEON/VFP unit explicitly so no exceptions are generated on first
+subsequent use. This is handled by the function kernel_neon_begin(), which
+should be called before any kernel mode NEON or VFP instructions are issued.
+Likewise, the NEON/VFP unit should be disabled again after use to make sure user
+mode will hit the lazy restore trap upon next use. This is handled by the
+function kernel_neon_end().
+
+
+Interruptions in kernel mode
+----------------------------
+For reasons of performance and simplicity, it was decided that there shall be no
+preserve/restore mechanism for the kernel mode NEON/VFP register contents. This
+implies that interruptions of a kernel mode NEON section can only be allowed if
+they are guaranteed not to touch the NEON/VFP registers. For this reason, the
+following rules and restrictions apply in the kernel:
+* NEON/VFP code is not allowed in interrupt context;
+* NEON/VFP code is not allowed to sleep;
+* NEON/VFP code is executed with preemption disabled.
+
+If latency is a concern, it is possible to put back to back calls to
+kernel_neon_end() and kernel_neon_begin() in places in your code where none of
+the NEON registers are live. (Additional calls to kernel_neon_begin() should be
+reasonably cheap if no context switch occurred in the meantime)
+
+
+VFP and support code
+--------------------
+Earlier versions of VFP (prior to version 3) rely on software support for things
+like IEEE-754 compliant underflow handling etc. When the VFP unit needs such
+software assistance, it signals the kernel by raising an undefined instruction
+exception. The kernel responds by inspecting the VFP control registers and the
+current instruction and arguments, and emulates the instruction in software.
+
+Such software assistance is currently not implemented for VFP instructions
+executed in kernel mode. If such a condition is encountered, the kernel will
+fail and generate an OOPS.
+
+
+Separating NEON code from ordinary code
+---------------------------------------
+The compiler is not aware of the special significance of kernel_neon_begin() and
+kernel_neon_end(), i.e., that it is only allowed to issue NEON/VFP instructions
+between calls to these respective functions. Furthermore, GCC may generate NEON
+instructions of its own at -O3 level if -mfpu=neon is selected, and even if the
+kernel is currently compiled at -O2, future changes may result in NEON/VFP
+instructions appearing in unexpected places if no special care is taken.
+
+Therefore, the recommended and only supported way of using NEON/VFP in the
+kernel is by adhering to the following rules:
+* isolate the NEON code in a separate compilation unit and compile it with
+  '-mfpu=neon -mfloat-abi=softfp';
+* issue the calls to kernel_neon_begin(), kernel_neon_end() as well as the calls
+  into the unit containing the NEON code from a compilation unit which is *not*
+  built with the GCC flag '-mfpu=neon' set.
+
+As the kernel is compiled with '-msoft-float', the above will guarantee that
+both NEON and VFP instructions will only ever appear in designated compilation
+units at any optimization level.
+
+
+NEON assembler
+--------------
+NEON assembler is supported with no additional caveats as long as the rules
+above are followed.
+
+
+NEON code generated by GCC
+--------------------------
+The GCC option -ftree-vectorize (implied by -O3) tries to exploit implicit
+parallelism, and generates NEON code from ordinary C source code. This is fully
+supported as long as the rules above are followed.
+
+
+NEON intrinsics
+---------------
+NEON intrinsics are also supported. However, as code using NEON intrinsics
+relies on the GCC header 'arm_neon.h', the following tricks are necessary as
+this header is not fully compatible with the kernel:
+* Declare the interface between the NEON intrinsics code and its caller using
+  only plain old C types (i.e., avoid using uintXX_t and uXX types; they seem
+  unambiguous but they are not[1]);
+* Compile the unit containing the NEON intrinsics with '-ffreestanding' so it
+  does not choke on the missing 'stdint.h' #included by 'arm_neon.h' (this is a
+  C99 header which the kernel does not supply) and don't include any ordinary
+  kernel headers;
+* Call the NEON code from a separate compilation unit that does the interfacing
+  with the rest of the kernel, includes kernel headers, types etc.
+
+----
+[1] Neither the bare metal version nor the glibc version of GCC agrees with the
+    kernel on the definitions of int32_t, uint32_t and/or uintptr_t: this
+    becomes a problem when you try to include both <linux/types.h> and
+    <stdint.h> in the same compilation unit.