diff mbox series

[RFC,v9,01/27] Documentation/x86: Add CET description

Message ID 20200205181935.3712-2-yu-cheng.yu@intel.com (mailing list archive)
State New, archived
Headers show
Series Control-flow Enforcement: Shadow Stack | expand

Commit Message

Yu-cheng Yu Feb. 5, 2020, 6:19 p.m. UTC
Explain no_cet_shstk/no_cet_ibt kernel parameters, and introduce a new
document on Control-flow Enforcement Technology (CET).

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 .../admin-guide/kernel-parameters.txt         |   6 +
 Documentation/x86/index.rst                   |   1 +
 Documentation/x86/intel_cet.rst               | 294 ++++++++++++++++++
 3 files changed, 301 insertions(+)
 create mode 100644 Documentation/x86/intel_cet.rst

Comments

Randy Dunlap Feb. 6, 2020, 12:16 a.m. UTC | #1
Hi,

I have a few comments and a question (please see inline below).


On 2/5/20 10:19 AM, Yu-cheng Yu wrote:
> Explain no_cet_shstk/no_cet_ibt kernel parameters, and introduce a new
> document on Control-flow Enforcement Technology (CET).
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |   6 +
>  Documentation/x86/index.rst                   |   1 +
>  Documentation/x86/intel_cet.rst               | 294 ++++++++++++++++++
>  3 files changed, 301 insertions(+)
>  create mode 100644 Documentation/x86/intel_cet.rst
> 

> diff --git a/Documentation/x86/intel_cet.rst b/Documentation/x86/intel_cet.rst
> new file mode 100644
> index 000000000000..71e2462fea5c
> --- /dev/null
> +++ b/Documentation/x86/intel_cet.rst
> @@ -0,0 +1,294 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=========================================
> +Control-flow Enforcement Technology (CET)
> +=========================================
> +

...

> +
> +[5] CET system calls
> +====================
> +
> +The following arch_prctl() system calls are added for CET:
> +
> +arch_prctl(ARCH_X86_CET_STATUS, unsigned long *addr)
> +    Return CET feature status.
> +
> +    The parameter 'addr' is a pointer to a user buffer.
> +    On returning to the caller, the kernel fills the following
> +    information::
> +
> +        *addr       = SHSTK/IBT status
> +        *(addr + 1) = SHSTK base address
> +        *(addr + 2) = SHSTK size
> +
> +arch_prctl(ARCH_X86_CET_DISABLE, unsigned long features)
> +    Disable SHSTK and/or IBT specified in 'features'.  Return -EPERM
> +    if CET is locked.
> +
> +arch_prctl(ARCH_X86_CET_LOCK)
> +    Lock in CET feature.

which feature?

> +
> +arch_prctl(ARCH_X86_CET_ALLOC_SHSTK, unsigned long *addr)
> +    Allocate a new SHSTK and put a restore token at top.
> +
> +    The parameter 'addr' is a pointer to a user buffer and indicates
> +    the desired SHSTK size to allocate.  On returning to the caller,
> +    the kernel fills '*addr' with the base address of the new SHSTK.
> +
> +arch_prctl(ARCH_X86_CET_MARK_LEGACY_CODE, unsigned long *addr)
> +    Mark an address range as IBT legacy code.
> +
> +    The parameter 'addr' is a pointer to a user buffer that has the
> +    following information::
> +
> +        *addr       = starting linear address of the legacy code
> +        *(addr + 1) = size of the legacy code
> +        *(addr + 2) = set (1); clear (0)
> +
> +Note:
> +  There is no CET-enabling arch_prctl function.  By design, CET is
> +  enabled automatically if the binary and the system can support it.
> +
> +  The parameters passed are always unsigned 64-bit.  When an IA32
> +  application passing pointers, it should only use the lower 32 bits.
> +
> +[6] The implementation of the SHSTK
> +===================================
> +
> +SHSTK size
> +----------
> +
> +A task's SHSTK is allocated from memory to a fixed size of
> +RLIMIT_STACK.  A compat-mode thread's SHSTK size is 1/4 of
> +RLIMIT_STACK.  The smaller 32-bit thread SHSTK allows more threads to
> +share a 32-bit address space.
> +
> +Signal
> +------
> +
> +The main program and its signal handlers use the same SHSTK.  Because
> +the SHSTK stores only return addresses, a large SHSTK will cover the
> +condition that both the program stack and the sigaltstack run out.
> +
> +The kernel creates a restore token at the SHSTK restoring address and
> +verifies that token when restoring from the signal handler.
> +
> +IBT for signal delivering and sigreturn is the same as the main
> +program's setup; except for WAIT_ENDBR status, which can be read from

s/;/,/

> +MSR_IA32_U_CET.  In general, a task is in WAIT_ENDBR after an
> +indirect CALL/JMP and before the next instruction starts.
> +
> +A task's WAIT_ENDBR is reset for its signal handler, but preserved on
> +the task's stack; and then restored from sigreturn.

s/;/,/

> +
> +Fork
> +----
> +
> +The SHSTK's vma has VM_SHSTK flag set; its PTEs are required to be
> +read-only and dirty.  When a SHSTK PTE is not present, RO, and dirty,
> +a SHSTK access triggers a page fault with an additional SHSTK bit set
> +in the page fault error code.
> +
> +When a task forks a child, its SHSTK PTEs are copied and both the
> +parent's and the child's SHSTK PTEs are cleared of the dirty bit.
> +Upon the next SHSTK access, the resulting SHSTK page fault is handled
> +by page copy/re-use.
> +
> +When a pthread child is created, the kernel allocates a new SHSTK for
> +the new thread.
> +
> +Setjmp/Longjmp
> +--------------
> +
> +Longjmp unwinds SHSTK until it matches the program stack.
> +
> +Ucontext
> +--------
> +
> +In GLIBC, getcontext/setcontext is implemented in similar way as
> +setjmp/longjmp.
> +
> +When makecontext creates a new ucontext, a new SHSTK is allocated for
> +that context with ARCH_X86_CET_ALLOC_SHSTK syscall.  The kernel
> +creates a restore token at the top of the new SHSTK and the user-mode
> +code switches to the new SHSTK with the RSTORSSP instruction.
> +
> +[7] The management of read-only & dirty PTEs for SHSTK
> +======================================================
> +
> +A RO and dirty PTE exists in the following cases:
> +
> +(a) A page is modified and then shared with a fork()'ed child;
> +(b) A R/O page that has been COW'ed;
> +(c) A SHSTK page.
> +
> +The processor only checks the dirty bit for (c).  To prevent the use
> +of non-SHSTK memory as SHSTK, we use a spare bit of the 64-bit PTE as
> +DIRTY_SW for (a) and (b) above.  This results to the following PTE
> +settings::
> +
> +    Modified PTE:             (R/W + DIRTY_HW)
> +    Modified and shared PTE:  (R/O + DIRTY_SW)
> +    R/O PTE, COW'ed:          (R/O + DIRTY_SW)
> +    SHSTK PTE:                (R/O + DIRTY_HW)
> +    SHSTK PTE, COW'ed:        (R/O + DIRTY_HW)
> +    SHSTK PTE, shared:        (R/O + DIRTY_SW)
> +
> +Note that DIRTY_SW is only used in R/O PTEs but not R/W PTEs.
> +
> +[8] The implementation of IBT legacy bitmap
> +===========================================
> +
> +When IBT is active, a non-IBT-capable legacy library can be executed
> +if its address ranges are specified in the legacy code bitmap.  The
> +bitmap covers the whole user-space address, which is TASK_SIZE_MAX
> +for 64-bit and TASK_SIZE for IA32, and its each bit indicates a 4-KB

confusing:
                                          its each bit

> +legacy code page.  It is read-only from an application, and setup by
> +the kernel as a special mapping when the first time the application

                           drop:   when

> +calls arch_prctl(ARCH_X86_CET_MARK_LEGACY_CODE).  The application
> +manages the bitmap through the arch_prctl.

                      through the arch_prctl() interface.


cheers.
Yu-cheng Yu Feb. 6, 2020, 8:17 p.m. UTC | #2
On Wed, 2020-02-05 at 16:16 -0800, Randy Dunlap wrote:
> Hi,
> 
> I have a few comments and a question (please see inline below).
> 
> 
> On 2/5/20 10:19 AM, Yu-cheng Yu wrote:
[...]
> > +arch_prctl(ARCH_X86_CET_LOCK)
> > +    Lock in CET feature.
> 
> which feature?

Both SHSTK and IBT are locked.  They cannot be turned off afterwards.

I will check things you pointed out.

Thanks,
Yu-cheng
Kees Cook Feb. 25, 2020, 8:02 p.m. UTC | #3
On Wed, Feb 05, 2020 at 10:19:09AM -0800, Yu-cheng Yu wrote:
> Explain no_cet_shstk/no_cet_ibt kernel parameters, and introduce a new
> document on Control-flow Enforcement Technology (CET).
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

I'm not a huge fan of the boot param names, but I can't suggest anything
better. ;) I love the extensive docs!

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

> ---
>  .../admin-guide/kernel-parameters.txt         |   6 +
>  Documentation/x86/index.rst                   |   1 +
>  Documentation/x86/intel_cet.rst               | 294 ++++++++++++++++++
>  3 files changed, 301 insertions(+)
>  create mode 100644 Documentation/x86/intel_cet.rst
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index ade4e6ec23e0..8b69ebf0baed 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3001,6 +3001,12 @@
>  			noexec=on: enable non-executable mappings (default)
>  			noexec=off: disable non-executable mappings
>  
> +	no_cet_shstk	[X86-64] Disable Shadow Stack for user-mode
> +			applications
> +
> +	no_cet_ibt	[X86-64] Disable Indirect Branch Tracking for user-mode
> +			applications
> +
>  	nosmap		[X86,PPC]
>  			Disable SMAP (Supervisor Mode Access Prevention)
>  			even if it is supported by processor.
> diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst
> index a8de2fbc1caa..81f919801765 100644
> --- a/Documentation/x86/index.rst
> +++ b/Documentation/x86/index.rst
> @@ -19,6 +19,7 @@ x86-specific Documentation
>     tlb
>     mtrr
>     pat
> +   intel_cet
>     intel_mpx
>     intel-iommu
>     intel_txt
> diff --git a/Documentation/x86/intel_cet.rst b/Documentation/x86/intel_cet.rst
> new file mode 100644
> index 000000000000..71e2462fea5c
> --- /dev/null
> +++ b/Documentation/x86/intel_cet.rst
> @@ -0,0 +1,294 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=========================================
> +Control-flow Enforcement Technology (CET)
> +=========================================
> +
> +[1] Overview
> +============
> +
> +Control-flow Enforcement Technology (CET) provides protection against
> +return/jump-oriented programming (ROP) attacks.  It can be setup to
> +protect both applications and the kernel.  In the first phase, only
> +user-mode protection is implemented in the 64-bit kernel; 32-bit
> +applications are supported in compatibility mode.
> +
> +CET introduces Shadow Stack (SHSTK) and Indirect Branch Tracking
> +(IBT).  SHSTK is a secondary stack allocated from memory and cannot
> +be directly modified by applications.  When executing a CALL, the
> +processor pushes a copy of the return address to SHSTK.  Upon
> +function return, the processor pops the SHSTK copy and compares it
> +to the one from the program stack.  If the two copies differ, the
> +processor raises a control-protection fault.  IBT verifies indirect
> +CALL/JMP targets are intended as marked by the compiler with 'ENDBR'
> +opcodes (see CET instructions below).
> +
> +There are two kernel configuration options:
> +
> +    X86_INTEL_SHADOW_STACK_USER, and
> +    X86_INTEL_BRANCH_TRACKING_USER.
> +
> +To build a CET-enabled kernel, Binutils v2.31 and GCC v8.1 or later
> +are required.  To build a CET-enabled application, GLIBC v2.28 or
> +later is also required.
> +
> +There are two command-line options for disabling CET features::
> +
> +    no_cet_shstk - disables SHSTK, and
> +    no_cet_ibt   - disables IBT.
> +
> +At run time, /proc/cpuinfo shows the availability of SHSTK and IBT.
> +
> +[2] CET assembly instructions
> +=============================
> +
> +RDSSP %r
> +    Read the SHSTK pointer into %r.
> +
> +INCSSP %r
> +    Unwind (increment) the SHSTK pointer (0 ~ 255) steps as indicated
> +    in the operand register.  The GLIBC longjmp uses INCSSP to unwind
> +    the SHSTK until that matches the program stack.  When it is
> +    necessary to unwind beyond 255 steps, longjmp divides and repeats
> +    the process.
> +
> +RSTORSSP (%r)
> +    Switch to the SHSTK indicated in the 'restore token' pointed by
> +    the operand register and replace the 'restore token' with a new
> +    token to be saved (with SAVEPREVSSP) for the outgoing SHSTK.
> +
> +::
> +
> +                                Before RSTORSSP
> +
> +               Incoming SHSTK                 Current/Outgoing SHSTK
> +
> +          |----------------------|           |----------------------|
> +   addr=x |                      |     ssp-> |                      |
> +          |----------------------|           |----------------------|
> +   (%r)-> | rstor_token=(x|Lg)   |  addr=y-8 |                      |
> +          |----------------------|           |----------------------|
> +
> +                                After RSTORSSP
> +
> +          |----------------------|           |----------------------|
> +   addr=x |                      |           |                      |
> +          |----------------------|           |----------------------|
> +    ssp-> | rstor_token=(y|Pv|Lg)|  addr=y-8 |                      |
> +          |----------------------|           |----------------------|
> +
> +    note:
> +        1. Only valid addresses and restore tokens can be on the
> +           user-mode SHSTK.
> +        2. A token is always of type u64 and must align to u64.
> +        3. The incoming SHSTK pointer in a rstor_token must point to
> +           immediately above the token.
> +        4. 'Lg' is bit[0] of a rstor_token indicating a 64-bit SHSTK.
> +        5. 'Pv' is bit[1] of a rstor_token indicating the token is to
> +           be used only for the next SAVEPREVSSP and invalid for
> +           RSTORSSP.
> +
> +SAVEPREVSSP
> +    Pop the SHSTK 'restore token' pointed by current SHSTK pointer
> +    and store it at (previous SHSTK pointer - 8).
> +
> +::
> +
> +                               After SAVEPREVSSP
> +
> +          |----------------------|           |----------------------|
> +    ssp-> |                      |           |                      |
> +          |----------------------|           |----------------------|
> + addr=x-8 | rstor_token=(y|Pv|Lg)|  addr=y-8 | rstor_token(y|Lg)    |
> +          |----------------------|           |----------------------|
> +
> +WRUSS %r0, (%r1)
> +    Write the value in %r0 to the SHSTK address pointed by (%r1).
> +    This is a kernel-mode only instruction.
> +
> +ENDBR and NOTRACK prefix
> +    When IBT is enabled, an indirect CALL/JMP must either::
> +
> +        have a NOTRACK prefix,
> +        reach an ENDBR, or
> +        reach an address within a legacy code page;
> +
> +    or it results in a control-protection fault.
> +
> +    When the target address is derived from information that cannot
> +    be modified, the compiler uses the NOTRACK prefix.  In other
> +    cases, the compiler inserts an ENDBR at the target address.
> +
> +    A legacy code page is designated in the legacy code bitmap, which
> +    is explained below in section [8].
> +
> +[3] Application Enabling
> +========================
> +
> +An application's CET capability is marked in its ELF header and can
> +be verified from the following command output, in the
> +NT_GNU_PROPERTY_TYPE_0 field:
> +
> +    readelf -n <application>
> +
> +If an application supports CET and is statically linked, it will run
> +with CET protection.  If the application needs any shared libraries,
> +the loader checks all dependencies and enables CET only when all
> +requirements are met.
> +
> +[4] Legacy Libraries
> +====================
> +
> +GLIBC provides a few tunables for backward compatibility.
> +
> +GLIBC_TUNABLES=glibc.tune.hwcaps=-SHSTK,-IBT
> +    Turn off SHSTK/IBT for the current shell.
> +
> +GLIBC_TUNABLES=glibc.tune.x86_shstk=<on, permissive>
> +    This controls how dlopen() handles SHSTK legacy libraries::
> +
> +        on         - continue with SHSTK enabled;
> +        permissive - continue with SHSTK off.
> +
> +[5] CET system calls
> +====================
> +
> +The following arch_prctl() system calls are added for CET:
> +
> +arch_prctl(ARCH_X86_CET_STATUS, unsigned long *addr)
> +    Return CET feature status.
> +
> +    The parameter 'addr' is a pointer to a user buffer.
> +    On returning to the caller, the kernel fills the following
> +    information::
> +
> +        *addr       = SHSTK/IBT status
> +        *(addr + 1) = SHSTK base address
> +        *(addr + 2) = SHSTK size
> +
> +arch_prctl(ARCH_X86_CET_DISABLE, unsigned long features)
> +    Disable SHSTK and/or IBT specified in 'features'.  Return -EPERM
> +    if CET is locked.
> +
> +arch_prctl(ARCH_X86_CET_LOCK)
> +    Lock in CET feature.
> +
> +arch_prctl(ARCH_X86_CET_ALLOC_SHSTK, unsigned long *addr)
> +    Allocate a new SHSTK and put a restore token at top.
> +
> +    The parameter 'addr' is a pointer to a user buffer and indicates
> +    the desired SHSTK size to allocate.  On returning to the caller,
> +    the kernel fills '*addr' with the base address of the new SHSTK.
> +
> +arch_prctl(ARCH_X86_CET_MARK_LEGACY_CODE, unsigned long *addr)
> +    Mark an address range as IBT legacy code.
> +
> +    The parameter 'addr' is a pointer to a user buffer that has the
> +    following information::
> +
> +        *addr       = starting linear address of the legacy code
> +        *(addr + 1) = size of the legacy code
> +        *(addr + 2) = set (1); clear (0)
> +
> +Note:
> +  There is no CET-enabling arch_prctl function.  By design, CET is
> +  enabled automatically if the binary and the system can support it.
> +
> +  The parameters passed are always unsigned 64-bit.  When an IA32
> +  application passing pointers, it should only use the lower 32 bits.
> +
> +[6] The implementation of the SHSTK
> +===================================
> +
> +SHSTK size
> +----------
> +
> +A task's SHSTK is allocated from memory to a fixed size of
> +RLIMIT_STACK.  A compat-mode thread's SHSTK size is 1/4 of
> +RLIMIT_STACK.  The smaller 32-bit thread SHSTK allows more threads to
> +share a 32-bit address space.
> +
> +Signal
> +------
> +
> +The main program and its signal handlers use the same SHSTK.  Because
> +the SHSTK stores only return addresses, a large SHSTK will cover the
> +condition that both the program stack and the sigaltstack run out.
> +
> +The kernel creates a restore token at the SHSTK restoring address and
> +verifies that token when restoring from the signal handler.
> +
> +IBT for signal delivering and sigreturn is the same as the main
> +program's setup; except for WAIT_ENDBR status, which can be read from
> +MSR_IA32_U_CET.  In general, a task is in WAIT_ENDBR after an
> +indirect CALL/JMP and before the next instruction starts.
> +
> +A task's WAIT_ENDBR is reset for its signal handler, but preserved on
> +the task's stack; and then restored from sigreturn.
> +
> +Fork
> +----
> +
> +The SHSTK's vma has VM_SHSTK flag set; its PTEs are required to be
> +read-only and dirty.  When a SHSTK PTE is not present, RO, and dirty,
> +a SHSTK access triggers a page fault with an additional SHSTK bit set
> +in the page fault error code.
> +
> +When a task forks a child, its SHSTK PTEs are copied and both the
> +parent's and the child's SHSTK PTEs are cleared of the dirty bit.
> +Upon the next SHSTK access, the resulting SHSTK page fault is handled
> +by page copy/re-use.
> +
> +When a pthread child is created, the kernel allocates a new SHSTK for
> +the new thread.
> +
> +Setjmp/Longjmp
> +--------------
> +
> +Longjmp unwinds SHSTK until it matches the program stack.
> +
> +Ucontext
> +--------
> +
> +In GLIBC, getcontext/setcontext is implemented in similar way as
> +setjmp/longjmp.
> +
> +When makecontext creates a new ucontext, a new SHSTK is allocated for
> +that context with ARCH_X86_CET_ALLOC_SHSTK syscall.  The kernel
> +creates a restore token at the top of the new SHSTK and the user-mode
> +code switches to the new SHSTK with the RSTORSSP instruction.
> +
> +[7] The management of read-only & dirty PTEs for SHSTK
> +======================================================
> +
> +A RO and dirty PTE exists in the following cases:
> +
> +(a) A page is modified and then shared with a fork()'ed child;
> +(b) A R/O page that has been COW'ed;
> +(c) A SHSTK page.
> +
> +The processor only checks the dirty bit for (c).  To prevent the use
> +of non-SHSTK memory as SHSTK, we use a spare bit of the 64-bit PTE as
> +DIRTY_SW for (a) and (b) above.  This results to the following PTE
> +settings::
> +
> +    Modified PTE:             (R/W + DIRTY_HW)
> +    Modified and shared PTE:  (R/O + DIRTY_SW)
> +    R/O PTE, COW'ed:          (R/O + DIRTY_SW)
> +    SHSTK PTE:                (R/O + DIRTY_HW)
> +    SHSTK PTE, COW'ed:        (R/O + DIRTY_HW)
> +    SHSTK PTE, shared:        (R/O + DIRTY_SW)
> +
> +Note that DIRTY_SW is only used in R/O PTEs but not R/W PTEs.
> +
> +[8] The implementation of IBT legacy bitmap
> +===========================================
> +
> +When IBT is active, a non-IBT-capable legacy library can be executed
> +if its address ranges are specified in the legacy code bitmap.  The
> +bitmap covers the whole user-space address, which is TASK_SIZE_MAX
> +for 64-bit and TASK_SIZE for IA32, and its each bit indicates a 4-KB
> +legacy code page.  It is read-only from an application, and setup by
> +the kernel as a special mapping when the first time the application
> +calls arch_prctl(ARCH_X86_CET_MARK_LEGACY_CODE).  The application
> +manages the bitmap through the arch_prctl.
> -- 
> 2.21.0
>
Dave Hansen Feb. 26, 2020, 5:57 p.m. UTC | #4
> index ade4e6ec23e0..8b69ebf0baed 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3001,6 +3001,12 @@
>  			noexec=on: enable non-executable mappings (default)
>  			noexec=off: disable non-executable mappings
>  
> +	no_cet_shstk	[X86-64] Disable Shadow Stack for user-mode
> +			applications

If we ever add kernel support, "no_cet_shstk" will mean "no cet shstk
for userspace"?

> +	no_cet_ibt	[X86-64] Disable Indirect Branch Tracking for user-mode
> +			applications
> +
>  	nosmap		[X86,PPC]
>  			Disable SMAP (Supervisor Mode Access Prevention)
>  			even if it is supported by processor.

BTW, this documentation is misplaced.  It needs to go to the spot where
you introduce the code for these options.

> diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst
> index a8de2fbc1caa..81f919801765 100644
> --- a/Documentation/x86/index.rst
> +++ b/Documentation/x86/index.rst
> @@ -19,6 +19,7 @@ x86-specific Documentation
>     tlb
>     mtrr
>     pat
> +   intel_cet
>     intel_mpx
>     intel-iommu
>     intel_txt
> diff --git a/Documentation/x86/intel_cet.rst b/Documentation/x86/intel_cet.rst
> new file mode 100644
> index 000000000000..71e2462fea5c
> --- /dev/null
> +++ b/Documentation/x86/intel_cet.rst
> @@ -0,0 +1,294 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=========================================
> +Control-flow Enforcement Technology (CET)
> +=========================================
> +
> +[1] Overview
> +============
> +
> +Control-flow Enforcement Technology (CET) provides protection against
> +return/jump-oriented programming (ROP) attacks.  It can be setup to

							      ^ set up

> +protect both applications and the kernel.  In the first phase, only
> +user-mode protection is implemented in the 64-bit kernel; 32-bit
> +applications are supported in compatibility mode.

Please just say what *is* at the time of the writing.  We don't need to
talk about "phases".

Also, you haven't mentioned that this is a *hardware* feature and that
it's only on Intel CPUs at the moment.  That's kinda essential.  If I've
got an AMD CPU, I can just stop reading. :)

The hardware supports shadow stacks for both userspace and the kernel in
both 32 and 64-bit modes.  32-bit kernel support is not implemented.
Both 32-bit and 64-bit user applications can run on 64-bit kernels.

This is also missing the same key points about enabling as the Kconfig
text: apps don't get this for free and must be specifically enabled.

> +CET introduces Shadow Stack (SHSTK) and Indirect Branch Tracking
> +(IBT).  SHSTK is a secondary stack allocated from memory and cannot
> +be directly modified by applications.  When executing a CALL, the
> +processor pushes a copy of the return address to SHSTK. 

... and to the normal stack

>  Upon
> +function return, the processor pops the SHSTK copy and compares it
> +to the one from the program stack.  If the two copies differ, the
> +processor raises a control-protection fault.  IBT verifies indirect
> +CALL/JMP targets are intended as marked by the compiler with 'ENDBR'
> +opcodes (see CET instructions below).
> +
> +There are two kernel configuration options:
> +
> +    X86_INTEL_SHADOW_STACK_USER, and
> +    X86_INTEL_BRANCH_TRACKING_USER.
> +
> +To build a CET-enabled kernel, Binutils v2.31 and GCC v8.1 or later
> +are required.

Why are these needed to build a CET-enabled kernel?

>  To build a CET-enabled application, GLIBC v2.28 or
> +later is also required.
> +
> +There are two command-line options for disabling CET features::
> +
> +    no_cet_shstk - disables SHSTK, and
> +    no_cet_ibt   - disables IBT.
> +
> +At run time, /proc/cpuinfo shows the availability of SHSTK and IBT.

Availability of what?

If I set X86_INTEL_SHADOW_STACK_USER=n, I'll still see the cpuinfo
flags, but I won't have runtime support.

Probably best to say that cpuinfo tells you about processor support only.

> +[2] CET assembly instructions
> +=============================

Why do we need this in the kernel?  What is specific to Linux or the
kernel?  Why wouldn't I just go read the SDM if I want to know how the
instructions work?

> +[3] Application Enabling
> +========================
> +
> +An application's CET capability is marked in its ELF header and can
> +be verified from the following command output, in the
> +NT_GNU_PROPERTY_TYPE_0 field:
> +
> +    readelf -n <application>
> +
> +If an application supports CET and is statically linked, it will run
> +with CET protection.  If the application needs any shared libraries,
> +the loader checks all dependencies and enables CET only when all
> +requirements are met.

What about shared libraries loaded after the program starts?

> +[4] Legacy Libraries
> +====================
> +
> +GLIBC provides a few tunables for backward compatibility.
> +
> +GLIBC_TUNABLES=glibc.tune.hwcaps=-SHSTK,-IBT
> +    Turn off SHSTK/IBT for the current shell.
> +
> +GLIBC_TUNABLES=glibc.tune.x86_shstk=<on, permissive>
> +    This controls how dlopen() handles SHSTK legacy libraries::
> +
> +        on         - continue with SHSTK enabled;
> +        permissive - continue with SHSTK off.

This seems like manpage fodder more than kernel documentation to me.

> +[5] CET system calls
> +====================
> +
> +The following arch_prctl() system calls are added for CET:

FWIW, I wouldn't call each of these a "system call".

"Several arch_prctl()'s have been added for CET:"

> +arch_prctl(ARCH_X86_CET_STATUS, unsigned long *addr)
> +    Return CET feature status.
> +
> +    The parameter 'addr' is a pointer to a user buffer.
> +    On returning to the caller, the kernel fills the following
> +    information::
> +
> +        *addr       = SHSTK/IBT status
> +        *(addr + 1) = SHSTK base address
> +        *(addr + 2) = SHSTK size
> +
> +arch_prctl(ARCH_X86_CET_DISABLE, unsigned long features)
> +    Disable SHSTK and/or IBT specified in 'features'.  Return -EPERM
> +    if CET is locked.
> +
> +arch_prctl(ARCH_X86_CET_LOCK)
> +    Lock in CET feature.

Shouldn't this say what "locking" means?

> +arch_prctl(ARCH_X86_CET_ALLOC_SHSTK, unsigned long *addr)
> +    Allocate a new SHSTK and put a restore token at top.
> +
> +    The parameter 'addr' is a pointer to a user buffer and indicates
> +    the desired SHSTK size to allocate.  On returning to the caller,
> +    the kernel fills '*addr' with the base address of the new SHSTK.



> +arch_prctl(ARCH_X86_CET_MARK_LEGACY_CODE, unsigned long *addr)
> +    Mark an address range as IBT legacy code.
> +
> +    The parameter 'addr' is a pointer to a user buffer that has the
> +    following information::
> +
> +        *addr       = starting linear address of the legacy code
> +        *(addr + 1) = size of the legacy code
> +        *(addr + 2) = set (1); clear (0)
> +
> +Note:
> +  There is no CET-enabling arch_prctl function.  By design, CET is
> +  enabled automatically if the binary and the system can support it.

This is kinda interesting.  It means that a JIT couldn't choose to
protect the code it generates and have different rules from itself?

> +  The parameters passed are always unsigned 64-bit.  When an IA32
> +  application passing pointers, it should only use the lower 32 bits.

Won't a 32-bit app calling prctl() use the 32-bit ABI?  How would it
even know it's running on a 64-bit kernel?

> +[6] The implementation of the SHSTK
> +===================================
> +
> +SHSTK size
> +----------
> +
> +A task's SHSTK is allocated from memory to a fixed size of
> +RLIMIT_STACK.

I can't really parse that sentence.  Is this saying that shadow stacks
are limited by and share space with normal stacks via RLIMIT_STACK?

>  A compat-mode thread's SHSTK size is 1/4 of
> +RLIMIT_STACK.  The smaller 32-bit thread SHSTK allows more threads to
> +share a 32-bit address space.

I thought the size was passed in from userspace?  Where does this sizing
take place?  Is this a convention or is it being enforced?

> +Signal
> +------
> +
> +The main program and its signal handlers use the same SHSTK.  Because
> +the SHSTK stores only return addresses, a large SHSTK will cover the
> +condition that both the program stack and the sigaltstack run out.

						 ^ typo?

I'm not sure what this is trying to say.

> +The kernel creates a restore token at the SHSTK restoring address and
> +verifies that token when restoring from the signal handler.

I think there's a sentence or two of background missing here.  I'm
really lost as to what this is trying to tell me.

> +IBT for signal delivering and sigreturn is the same as the main
> +program's setup; except for WAIT_ENDBR status, which can be read from
> +MSR_IA32_U_CET.  In general, a task is in WAIT_ENDBR after an
> +indirect CALL/JMP and before the next instruction starts.

I'm 100% lost.  I have no idea what this is trying to tell me or why it
is relevant to the kernel.

> +Fork
> +----
> +
> +The SHSTK's vma has VM_SHSTK flag set; its PTEs are required to be
> +read-only and dirty.  When a SHSTK PTE is not present, RO, and dirty,
> +a SHSTK access triggers a page fault with an additional SHSTK bit set
> +in the page fault error code.
> +
> +When a task forks a child, its SHSTK PTEs are copied and both the
> +parent's and the child's SHSTK PTEs are cleared of the dirty bit.
> +Upon the next SHSTK access, the resulting SHSTK page fault is handled
> +by page copy/re-use.

What's the most important thing about shadow stacks and fork()?  Does
this documentation tell that to the reader?

> +When a pthread child is created, the kernel allocates a new SHSTK for
> +the new thread.

Why is this here?  Are pthread children created work fork()?

> +Setjmp/Longjmp
> +--------------
> +
> +Longjmp unwinds SHSTK until it matches the program stack.

I'm missing what this has to do with the kernel.

> +Ucontext
> +--------
> +
> +In GLIBC, getcontext/setcontext is implemented in similar way as
> +setjmp/longjmp.
> +
> +When makecontext creates a new ucontext, a new SHSTK is allocated for
> +that context with ARCH_X86_CET_ALLOC_SHSTK syscall.  The kernel

Nit: ARCH_X86_CET_ALLOC_SHSTK is not a syscall.

> +creates a restore token at the top of the new SHSTK and the user-mode
> +code switches to the new SHSTK with the RSTORSSP instruction.

This seems like a howto for doing user-level threading.  It seems like
it could be replaced by a single sentence in the
ARCH_X86_CET_ALLOC_SHSTK documentation explaining that new shadow stacks
are generally (always??) allocated along with new stacks.  Since new
clone() threads need a new stack, they also need a new shadow stack.
User-level threads that need a new stack are also expected to allocate a
new shadow stack.

Right?

> +[7] The management of read-only & dirty PTEs for SHSTK
> +======================================================
> +
> +A RO and dirty PTE exists in the following cases:
> +
> +(a) A page is modified and then shared with a fork()'ed child;
> +(b) A R/O page that has been COW'ed;
> +(c) A SHSTK page.
> +
> +The processor only checks the dirty bit for (c).  To prevent the use
> +of non-SHSTK memory as SHSTK, we use a spare bit of the 64-bit PTE as
> +DIRTY_SW for (a) and (b) above.  This results to the following PTE
> +settings::
> +
> +    Modified PTE:             (R/W + DIRTY_HW)
> +    Modified and shared PTE:  (R/O + DIRTY_SW)
> +    R/O PTE, COW'ed:          (R/O + DIRTY_SW)
> +    SHSTK PTE:                (R/O + DIRTY_HW)
> +    SHSTK PTE, COW'ed:        (R/O + DIRTY_HW)
> +    SHSTK PTE, shared:        (R/O + DIRTY_SW)
> +
> +Note that DIRTY_SW is only used in R/O PTEs but not R/W PTEs.

I really don't think this belongs in the documentation, especially since
it's duplicated almost verbatim in code comments.

> +[8] The implementation of IBT legacy bitmap
> +===========================================
> +
> +When IBT is active, a non-IBT-capable legacy library can be executed
> +if its address ranges are specified in the legacy code bitmap.  The
> +bitmap covers the whole user-space address, which is TASK_SIZE_MAX
> +for 64-bit and TASK_SIZE for IA32, and its each bit indicates a 4-KB
> +legacy code page.  It is read-only from an application, and setup by

								^ set up

> +the kernel as a special mapping when the first time the application
> +calls arch_prctl(ARCH_X86_CET_MARK_LEGACY_CODE).  The application
> +manages the bitmap through the arch_prctl.
Yu-cheng Yu Feb. 28, 2020, 3:55 p.m. UTC | #5
On Tue, 2020-02-25 at 12:02 -0800, Kees Cook wrote:
> On Wed, Feb 05, 2020 at 10:19:09AM -0800, Yu-cheng Yu wrote:
> > Explain no_cet_shstk/no_cet_ibt kernel parameters, and introduce a new
> > document on Control-flow Enforcement Technology (CET).
> > 
> > Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> I'm not a huge fan of the boot param names, but I can't suggest anything
> better. ;) I love the extensive docs!
> 
> Reviewed-by: Kees Cook <keescook@chromium.org>

Thanks for reviewing!

Yu-cheng
Yu-cheng Yu March 9, 2020, 5 p.m. UTC | #6
On Wed, 2020-02-26 at 09:57 -0800, Dave Hansen wrote:
> > index ade4e6ec23e0..8b69ebf0baed 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -3001,6 +3001,12 @@
> >  			noexec=on: enable non-executable mappings (default)
> >  			noexec=off: disable non-executable mappings
> >  
> > +	no_cet_shstk	[X86-64] Disable Shadow Stack for user-mode
> > +			applications
> 
> If we ever add kernel support, "no_cet_shstk" will mean "no cet shstk
> for userspace"?

What about no_user_shstk, no_kernel_shstk?

> 
> > +	no_cet_ibt	[X86-64] Disable Indirect Branch Tracking for user-mode
> > +			applications
> > +
> >  	nosmap		[X86,PPC]
> >  			Disable SMAP (Supervisor Mode Access Prevention)
> >  			even if it is supported by processor.
> 
> BTW, this documentation is misplaced.  It needs to go to the spot where
> you introduce the code for these options.

We used to introduce the document later in the series.  The feedback was to
introduce it first so that readers know what to expect.

[...]

> > diff --git a/Documentation/x86/intel_cet.rst b/Documentation/x86/intel_cet.rst
> > new file mode 100644
> > index 000000000000..71e2462fea5c
> > --- /dev/null
> > +++ b/Documentation/x86/intel_cet.rst
> > @@ -0,0 +1,294 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +=========================================
> > +Control-flow Enforcement Technology (CET)
> > +=========================================
> > +
> > +[1] Overview
> > +============
> > +
> > +Control-flow Enforcement Technology (CET) provides protection against
> > +return/jump-oriented programming (ROP) attacks.  It can be setup to

[...]

> > +
> > +There are two kernel configuration options:
> > +
> > +    X86_INTEL_SHADOW_STACK_USER, and
> > +    X86_INTEL_BRANCH_TRACKING_USER.
> > +
> > +To build a CET-enabled kernel, Binutils v2.31 and GCC v8.1 or later
> > +are required.
> 
> Why are these needed to build a CET-enabled kernel?

We could (and used to) allow legacy toolchains, but after considering
practical purposes, dropped the support.  We can continue the discussion,
and if those are desired, bring them back.

[...]

> > +[2] CET assembly instructions
> > +=============================
> 
> Why do we need this in the kernel?  What is specific to Linux or the
> kernel?  Why wouldn't I just go read the SDM if I want to know how the
> instructions work?

Now the SDM has this.  I will drop this section.

> > +[3] Application Enabling
> > +========================
> > +
> > +An application's CET capability is marked in its ELF header and can
> > +be verified from the following command output, in the
> > +NT_GNU_PROPERTY_TYPE_0 field:
> > +
> > +    readelf -n <application>
> > +
> > +If an application supports CET and is statically linked, it will run
> > +with CET protection.  If the application needs any shared libraries,
> > +the loader checks all dependencies and enables CET only when all
> > +requirements are met.
> 
> What about shared libraries loaded after the program starts?

The loader does the check for dlopen().


> > +[4] Legacy Libraries
> > +====================
> > +
> > +GLIBC provides a few tunables for backward compatibility.
> > +
> > +GLIBC_TUNABLES=glibc.tune.hwcaps=-SHSTK,-IBT
> > +    Turn off SHSTK/IBT for the current shell.
> > +
> > +GLIBC_TUNABLES=glibc.tune.x86_shstk=<on, permissive>
> > +    This controls how dlopen() handles SHSTK legacy libraries::
> > +
> > +        on         - continue with SHSTK enabled;
> > +        permissive - continue with SHSTK off.
> 
> This seems like manpage fodder more than kernel documentation to me.

Yes, we can drop this as well.

[...]

> > +Note:
> > +  There is no CET-enabling arch_prctl function.  By design, CET is
> > +  enabled automatically if the binary and the system can support it.
> 
> This is kinda interesting.  It means that a JIT couldn't choose to
> protect the code it generates and have different rules from itself?

JIT needs to be updated for CET first.  Once that is done, it runs with CET
enabled.  It can use the NOTRACK prefix, for example.

> > +  The parameters passed are always unsigned 64-bit.  When an IA32
> > +  application passing pointers, it should only use the lower 32 bits.
> 
> Won't a 32-bit app calling prctl() use the 32-bit ABI?  How would it
> even know it's running on a 64-bit kernel?

The 32-bit app is passing only a pointer to an array of 64-bit numbers.

> 
> > +[6] The implementation of the SHSTK
> > +===================================
> > +
> > +SHSTK size
> > +----------
> > +
> > +A task's SHSTK is allocated from memory to a fixed size of
> > +RLIMIT_STACK.
> 
> I can't really parse that sentence.  Is this saying that shadow stacks
> are limited by and share space with normal stacks via RLIMIT_STACK?
> 
> >  A compat-mode thread's SHSTK size is 1/4 of
> > +RLIMIT_STACK.  The smaller 32-bit thread SHSTK allows more threads to
> > +share a 32-bit address space.
> 
> I thought the size was passed in from userspace?  Where does this sizing
> take place?  Is this a convention or is it being enforced?

I will make this (and other things you pointed out) clear in the next
version.

Yu-cheng
Dave Hansen March 9, 2020, 5:21 p.m. UTC | #7
On 3/9/20 10:00 AM, Yu-cheng Yu wrote:
> On Wed, 2020-02-26 at 09:57 -0800, Dave Hansen wrote:
>>> index ade4e6ec23e0..8b69ebf0baed 100644
>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>> @@ -3001,6 +3001,12 @@
>>>  			noexec=on: enable non-executable mappings (default)
>>>  			noexec=off: disable non-executable mappings
>>>  
>>> +	no_cet_shstk	[X86-64] Disable Shadow Stack for user-mode
>>> +			applications
>>
>> If we ever add kernel support, "no_cet_shstk" will mean "no cet shstk
>> for userspace"?
> 
> What about no_user_shstk, no_kernel_shstk?

Those are better.

>>> +	no_cet_ibt	[X86-64] Disable Indirect Branch Tracking for user-mode
>>> +			applications
>>> +
>>>  	nosmap		[X86,PPC]
>>>  			Disable SMAP (Supervisor Mode Access Prevention)
>>>  			even if it is supported by processor.
>>
>> BTW, this documentation is misplaced.  It needs to go to the spot where
>> you introduce the code for these options.
> 
> We used to introduce the document later in the series.  The feedback was to
> introduce it first so that readers know what to expect.

To me, that doesn't apply for things that are implemented in this
specific of a spot in the code and *ALSO* might not even make the final
series.


>>> +Note:
>>> +  There is no CET-enabling arch_prctl function.  By design, CET is
>>> +  enabled automatically if the binary and the system can support it.
>>
>> This is kinda interesting.  It means that a JIT couldn't choose to
>> protect the code it generates and have different rules from itself?
> 
> JIT needs to be updated for CET first.  Once that is done, it runs with CET
> enabled.  It can use the NOTRACK prefix, for example.

Am I missing something?

What's the direct connection between shadow stacks and Indirect Branch
Tracking other than Intel marketing umbrellas?

>>> +  The parameters passed are always unsigned 64-bit.  When an IA32
>>> +  application passing pointers, it should only use the lower 32 bits.
>>
>> Won't a 32-bit app calling prctl() use the 32-bit ABI?  How would it
>> even know it's running on a 64-bit kernel?
> 
> The 32-bit app is passing only a pointer to an array of 64-bit numbers.

Well, the documentation just talked about pointers and I naively assume
it means the "unsigned long *" you had in there.

Rather than make suggestions, just say that the ABI is universally
64-bit.  Saying that the pointers must be valid is just kinda silly.
It's also not 100% clear what an "IA32 application" *MEANS* given fun
things like x32.

Also, I went to go find this implementation in your series.  I couldn't
find it.  Did I miss a patch?  Or are you documenting things you didn't
even implement?
Yu-cheng Yu March 9, 2020, 7:27 p.m. UTC | #8
On Mon, 2020-03-09 at 10:21 -0700, Dave Hansen wrote:
> On 3/9/20 10:00 AM, Yu-cheng Yu wrote:
> > On Wed, 2020-02-26 at 09:57 -0800, Dave Hansen wrote:
> > > > index ade4e6ec23e0..8b69ebf0baed 100644
> > > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > > @@ -3001,6 +3001,12 @@
> > > >  			noexec=on: enable non-executable mappings (default)
> > > >  			noexec=off: disable non-executable mappings
> > > >  
> > > > +	no_cet_shstk	[X86-64] Disable Shadow Stack for user-mode
> > > > +			applications
> > > 
> > > If we ever add kernel support, "no_cet_shstk" will mean "no cet shstk
> > > for userspace"?
> > 
> > What about no_user_shstk, no_kernel_shstk?

[...]

> > > > +Note:
> > > > +  There is no CET-enabling arch_prctl function.  By design, CET is
> > > > +  enabled automatically if the binary and the system can support it.
> > > 
> > > This is kinda interesting.  It means that a JIT couldn't choose to
> > > protect the code it generates and have different rules from itself?
> > 
> > JIT needs to be updated for CET first.  Once that is done, it runs with CET
> > enabled.  It can use the NOTRACK prefix, for example.
> 
> Am I missing something?
> 
> What's the direct connection between shadow stacks and Indirect Branch
> Tracking other than Intel marketing umbrellas?

What I meant is that JIT code needs to be updated first; if it skips RETs,
it needs to unwind the stack, and if it does indirect JMPs somewhere it
needs to fix up the branch target or use NOTRACK.

> > > > +  The parameters passed are always unsigned 64-bit.  When an IA32
> > > > +  application passing pointers, it should only use the lower 32 bits.
> > > 
> > > Won't a 32-bit app calling prctl() use the 32-bit ABI?  How would it
> > > even know it's running on a 64-bit kernel?
> > 
> > The 32-bit app is passing only a pointer to an array of 64-bit numbers.
> 
> Well, the documentation just talked about pointers and I naively assume
> it means the "unsigned long *" you had in there.
> 
> Rather than make suggestions, just say that the ABI is universally
> 64-bit.  Saying that the pointers must be valid is just kinda silly.
> It's also not 100% clear what an "IA32 application" *MEANS* given fun
> things like x32.

Ok, I will update the text.

> 
> Also, I went to go find this implementation in your series.  I couldn't
> find it.  Did I miss a patch?  Or are you documenting things you didn't
> even implement?

In patch #27: Add arch_prctl functions for Shadow Stack.

Yu-cheng
Dave Hansen March 9, 2020, 7:35 p.m. UTC | #9
On 3/9/20 12:27 PM, Yu-cheng Yu wrote:
> On Mon, 2020-03-09 at 10:21 -0700, Dave Hansen wrote:
>> On 3/9/20 10:00 AM, Yu-cheng Yu wrote:
>>> On Wed, 2020-02-26 at 09:57 -0800, Dave Hansen wrote>>>>> +Note:
>>>>> +  There is no CET-enabling arch_prctl function.  By design, CET is
>>>>> +  enabled automatically if the binary and the system can support it.
>>>>
>>>> This is kinda interesting.  It means that a JIT couldn't choose to
>>>> protect the code it generates and have different rules from itself?
>>>
>>> JIT needs to be updated for CET first.  Once that is done, it runs with CET
>>> enabled.  It can use the NOTRACK prefix, for example.
>>
>> Am I missing something?
>>
>> What's the direct connection between shadow stacks and Indirect Branch
>> Tracking other than Intel marketing umbrellas?
> 
> What I meant is that JIT code needs to be updated first; if it skips RETs,
> it needs to unwind the stack, and if it does indirect JMPs somewhere it
> needs to fix up the branch target or use NOTRACK.

I'm totally lost.  I think we have very different models of how a JIT
might generate and run code.

I can totally see a scenario where a JIT goes and generates a bunch of
code, then forks a new thread to go run that code.  The control flow of
the JIT thread itself *NEVER* interacts with the control flow of the
program it writes.  They never share a stack and nothing ever jumps or
rets between the two worlds.

Does anything actually do that?  I've got no idea.  But, I can clearly
see a world where the entirety of Chrome and Firefox and the entire rust
runtime might not be fully recompiled and CET-enabled for a while.  But,
we still want the JIT-generated code to be CET-protected since it has
the most exposed attack surface.

I don't think that's too far-fetched.
H.J. Lu March 9, 2020, 7:50 p.m. UTC | #10
On Mon, Mar 9, 2020 at 12:35 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 3/9/20 12:27 PM, Yu-cheng Yu wrote:
> > On Mon, 2020-03-09 at 10:21 -0700, Dave Hansen wrote:
> >> On 3/9/20 10:00 AM, Yu-cheng Yu wrote:
> >>> On Wed, 2020-02-26 at 09:57 -0800, Dave Hansen wrote>>>>> +Note:
> >>>>> +  There is no CET-enabling arch_prctl function.  By design, CET is
> >>>>> +  enabled automatically if the binary and the system can support it.
> >>>>
> >>>> This is kinda interesting.  It means that a JIT couldn't choose to
> >>>> protect the code it generates and have different rules from itself?
> >>>
> >>> JIT needs to be updated for CET first.  Once that is done, it runs with CET
> >>> enabled.  It can use the NOTRACK prefix, for example.
> >>
> >> Am I missing something?
> >>
> >> What's the direct connection between shadow stacks and Indirect Branch
> >> Tracking other than Intel marketing umbrellas?
> >
> > What I meant is that JIT code needs to be updated first; if it skips RETs,
> > it needs to unwind the stack, and if it does indirect JMPs somewhere it
> > needs to fix up the branch target or use NOTRACK.
>
> I'm totally lost.  I think we have very different models of how a JIT
> might generate and run code.
>
> I can totally see a scenario where a JIT goes and generates a bunch of
> code, then forks a new thread to go run that code.  The control flow of
> the JIT thread itself *NEVER* interacts with the control flow of the
> program it writes.  They never share a stack and nothing ever jumps or
> rets between the two worlds.
>
> Does anything actually do that?  I've got no idea.  But, I can clearly
> see a world where the entirety of Chrome and Firefox and the entire rust
> runtime might not be fully recompiled and CET-enabled for a while.  But,
> we still want the JIT-generated code to be CET-protected since it has
> the most exposed attack surface.
>
> I don't think that's too far-fetched.

CET support is all or nothing.   You can mix and match, but you will get
no CET protection, similar to NX feature.
Andy Lutomirski March 9, 2020, 8:16 p.m. UTC | #11
> On Mar 9, 2020, at 12:50 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> 
> On Mon, Mar 9, 2020 at 12:35 PM Dave Hansen <dave.hansen@intel.com> wrote:
>> 
>>> On 3/9/20 12:27 PM, Yu-cheng Yu wrote:
>>> On Mon, 2020-03-09 at 10:21 -0700, Dave Hansen wrote:
>>>> On 3/9/20 10:00 AM, Yu-cheng Yu wrote:
>>>>> On Wed, 2020-02-26 at 09:57 -0800, Dave Hansen wrote>>>>> +Note:
>>>>>>> +  There is no CET-enabling arch_prctl function.  By design, CET is
>>>>>>> +  enabled automatically if the binary and the system can support it.
>>>>>> 
>>>>>> This is kinda interesting.  It means that a JIT couldn't choose to
>>>>>> protect the code it generates and have different rules from itself?
>>>>> 
>>>>> JIT needs to be updated for CET first.  Once that is done, it runs with CET
>>>>> enabled.  It can use the NOTRACK prefix, for example.
>>>> 
>>>> Am I missing something?
>>>> 
>>>> What's the direct connection between shadow stacks and Indirect Branch
>>>> Tracking other than Intel marketing umbrellas?
>>> 
>>> What I meant is that JIT code needs to be updated first; if it skips RETs,
>>> it needs to unwind the stack, and if it does indirect JMPs somewhere it
>>> needs to fix up the branch target or use NOTRACK.
>> 
>> I'm totally lost.  I think we have very different models of how a JIT
>> might generate and run code.
>> 
>> I can totally see a scenario where a JIT goes and generates a bunch of
>> code, then forks a new thread to go run that code.  The control flow of
>> the JIT thread itself *NEVER* interacts with the control flow of the
>> program it writes.  They never share a stack and nothing ever jumps or
>> rets between the two worlds.
>> 
>> Does anything actually do that?  I've got no idea.  But, I can clearly
>> see a world where the entirety of Chrome and Firefox and the entire rust
>> runtime might not be fully recompiled and CET-enabled for a while.  But,
>> we still want the JIT-generated code to be CET-protected since it has
>> the most exposed attack surface.
>> 
>> I don't think that's too far-fetched.
> 
> CET support is all or nothing.   You can mix and match, but you will get
> no CET protection, similar to NX feature.
> 

Can you explain?

If a program with the magic ELF CET flags missing can’t make a thread with IBT and/or SHSTK enabled, then I think we’ve made an error and should fix it.

> -- 
> H.J.
H.J. Lu March 9, 2020, 8:54 p.m. UTC | #12
On Mon, Mar 9, 2020 at 1:16 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
>
>
> > On Mar 9, 2020, at 12:50 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Mon, Mar 9, 2020 at 12:35 PM Dave Hansen <dave.hansen@intel.com> wrote:
> >>
> >>> On 3/9/20 12:27 PM, Yu-cheng Yu wrote:
> >>> On Mon, 2020-03-09 at 10:21 -0700, Dave Hansen wrote:
> >>>> On 3/9/20 10:00 AM, Yu-cheng Yu wrote:
> >>>>> On Wed, 2020-02-26 at 09:57 -0800, Dave Hansen wrote>>>>> +Note:
> >>>>>>> +  There is no CET-enabling arch_prctl function.  By design, CET is
> >>>>>>> +  enabled automatically if the binary and the system can support it.
> >>>>>>
> >>>>>> This is kinda interesting.  It means that a JIT couldn't choose to
> >>>>>> protect the code it generates and have different rules from itself?
> >>>>>
> >>>>> JIT needs to be updated for CET first.  Once that is done, it runs with CET
> >>>>> enabled.  It can use the NOTRACK prefix, for example.
> >>>>
> >>>> Am I missing something?
> >>>>
> >>>> What's the direct connection between shadow stacks and Indirect Branch
> >>>> Tracking other than Intel marketing umbrellas?
> >>>
> >>> What I meant is that JIT code needs to be updated first; if it skips RETs,
> >>> it needs to unwind the stack, and if it does indirect JMPs somewhere it
> >>> needs to fix up the branch target or use NOTRACK.
> >>
> >> I'm totally lost.  I think we have very different models of how a JIT
> >> might generate and run code.
> >>
> >> I can totally see a scenario where a JIT goes and generates a bunch of
> >> code, then forks a new thread to go run that code.  The control flow of
> >> the JIT thread itself *NEVER* interacts with the control flow of the
> >> program it writes.  They never share a stack and nothing ever jumps or
> >> rets between the two worlds.
> >>
> >> Does anything actually do that?  I've got no idea.  But, I can clearly
> >> see a world where the entirety of Chrome and Firefox and the entire rust
> >> runtime might not be fully recompiled and CET-enabled for a while.  But,
> >> we still want the JIT-generated code to be CET-protected since it has
> >> the most exposed attack surface.
> >>
> >> I don't think that's too far-fetched.
> >
> > CET support is all or nothing.   You can mix and match, but you will get
> > no CET protection, similar to NX feature.
> >
>
> Can you explain?

I was talking about creating a program from mixed object files with and without
CET marker.

> If a program with the magic ELF CET flags missing can’t make a thread with IBT and/or SHSTK enabled, then I think we’ve made an error and should fix it.
>

A non-CET program can start a CET program and vice versa.
Dave Hansen March 9, 2020, 8:59 p.m. UTC | #13
On 3/9/20 1:54 PM, H.J. Lu wrote:
>> If a program with the magic ELF CET flags missing can’t make a
>> thread with IBT and/or SHSTK enabled, then I think we’ve made an
>> error and should fix it.
>> 
> A non-CET program can start a CET program and vice versa.

Could we be specific here, please?

HJ are you saying that:
* CET program can execve() a non-CET program, and
* a non-CET program can execve() a CET program

?

That's obvious.

But what are the rules for clone()?  Should there be rules for
mismatches for CET enabling between threads if a process (not child
processes)?
H.J. Lu March 9, 2020, 9:12 p.m. UTC | #14
On Mon, Mar 9, 2020 at 1:59 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 3/9/20 1:54 PM, H.J. Lu wrote:
> >> If a program with the magic ELF CET flags missing can’t make a
> >> thread with IBT and/or SHSTK enabled, then I think we’ve made an
> >> error and should fix it.
> >>
> > A non-CET program can start a CET program and vice versa.
>
> Could we be specific here, please?
>
> HJ are you saying that:
> * CET program can execve() a non-CET program, and
> * a non-CET program can execve() a CET program
>
> ?

Yes.

> That's obvious.
>
> But what are the rules for clone()?  Should there be rules for
> mismatches for CET enabling between threads if a process (not child
> processes)?

What did you mean? A threaded application is either CET enabled or not
CET enabled.   A new thread from clone makes no difference.
Andy Lutomirski March 9, 2020, 10:02 p.m. UTC | #15
> On Mar 9, 2020, at 2:13 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> 
> On Mon, Mar 9, 2020 at 1:59 PM Dave Hansen <dave.hansen@intel.com> wrote:
>> 
>> On 3/9/20 1:54 PM, H.J. Lu wrote:
>>>> If a program with the magic ELF CET flags missing can’t make a
>>>> thread with IBT and/or SHSTK enabled, then I think we’ve made an
>>>> error and should fix it.
>>>> 
>>> A non-CET program can start a CET program and vice versa.
>> 
>> Could we be specific here, please?
>> 
>> HJ are you saying that:
>> * CET program can execve() a non-CET program, and
>> * a non-CET program can execve() a CET program
>> 
>> ?
> 
> Yes.
> 
>> That's obvious.
>> 
>> But what are the rules for clone()?  Should there be rules for
>> mismatches for CET enabling between threads if a process (not child
>> processes)?
> 
> What did you mean? A threaded application is either CET enabled or not
> CET enabled.   A new thread from clone makes no difference.

Why?  Dave’s example seems like a good reason to allow per-thread control.



> 
> -- 
> H.J.
Dave Hansen March 9, 2020, 10:19 p.m. UTC | #16
On 3/9/20 2:12 PM, H.J. Lu wrote:
>> But what are the rules for clone()?  Should there be rules for
>> mismatches for CET enabling between threads if a process (not child
>> processes)?
> What did you mean? A threaded application is either CET enabled or not
> CET enabled.   A new thread from clone makes no difference.

Stacks are fundamentally thread-local resources.  The registers that
point to them and MSRs that manage shadow stacks are all CPU-thread
local.  Nothing is fundamentally tied to the address space shared across
the process.

A thread might also share *no* control flow with its child.  It might
ask the thread to start in code that the parent can never even reach.

It sounds like you've picked a Linux implementation that has
restrictions on top of the fundamentals.  That's not wrong per se, but
it does deserve explanation and deliberate, not experimental design.

Could you go back to the folks at Intel and try to figure out what this
was designed to *do*?  Yes, I'm probably one of those folks.  You know
where to find me. :)
H.J. Lu March 9, 2020, 11:11 p.m. UTC | #17
On Mon, Mar 9, 2020 at 3:19 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 3/9/20 2:12 PM, H.J. Lu wrote:
> >> But what are the rules for clone()?  Should there be rules for
> >> mismatches for CET enabling between threads if a process (not child
> >> processes)?
> > What did you mean? A threaded application is either CET enabled or not
> > CET enabled.   A new thread from clone makes no difference.
>
> Stacks are fundamentally thread-local resources.  The registers that
> point to them and MSRs that manage shadow stacks are all CPU-thread
> local.  Nothing is fundamentally tied to the address space shared across
> the process.
>
> A thread might also share *no* control flow with its child.  It might
> ask the thread to start in code that the parent can never even reach.
>
> It sounds like you've picked a Linux implementation that has
> restrictions on top of the fundamentals.  That's not wrong per se, but
> it does deserve explanation and deliberate, not experimental design.
>
> Could you go back to the folks at Intel and try to figure out what this
> was designed to *do*?  Yes, I'm probably one of those folks.  You know
> where to find me. :)

A threaded application is loaded from disk.  The object file on disk is
either CET enabled or not CET enabled.
Dave Hansen March 9, 2020, 11:20 p.m. UTC | #18
On 3/9/20 4:11 PM, H.J. Lu wrote:
> A threaded application is loaded from disk.  The object file on disk is
> either CET enabled or not CET enabled.

Huh.  Are you saying that all instructions executed on userspace on
Linux come off of object files on the disk?  That's an interesting
assertion.  You might want to go take a look at the processes on your
systems.  Here's my browser for example:

# for p in $(ps aux | grep chromium | awk '{print $2}' ); do cat
/proc/$p/maps; done | grep ' r-xp 00000000 00:00 0'
...
202f00082000-202f000bf000 r-xp 00000000 00:00 0
202f000c2000-202f000c3000 r-xp 00000000 00:00 0
202f00102000-202f00103000 r-xp 00000000 00:00 0
202f00142000-202f00143000 r-xp 00000000 00:00 0
202f00182000-202f001bf000 r-xp 00000000 00:00 0

Lots of funny looking memory areas which are anonymous and executable!
Those didn't come off the disk.  Same thing in firefox.  Weird.  Any
idea what those are?

One guess: https://en.wikipedia.org/wiki/Just-in-time_compilation
H.J. Lu March 9, 2020, 11:51 p.m. UTC | #19
On Mon, Mar 9, 2020 at 4:21 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 3/9/20 4:11 PM, H.J. Lu wrote:
> > A threaded application is loaded from disk.  The object file on disk is
> > either CET enabled or not CET enabled.
>
> Huh.  Are you saying that all instructions executed on userspace on
> Linux come off of object files on the disk?  That's an interesting
> assertion.  You might want to go take a look at the processes on your
> systems.  Here's my browser for example:
>
> # for p in $(ps aux | grep chromium | awk '{print $2}' ); do cat
> /proc/$p/maps; done | grep ' r-xp 00000000 00:00 0'
> ...
> 202f00082000-202f000bf000 r-xp 00000000 00:00 0
> 202f000c2000-202f000c3000 r-xp 00000000 00:00 0
> 202f00102000-202f00103000 r-xp 00000000 00:00 0
> 202f00142000-202f00143000 r-xp 00000000 00:00 0
> 202f00182000-202f001bf000 r-xp 00000000 00:00 0
>
> Lots of funny looking memory areas which are anonymous and executable!
> Those didn't come off the disk.  Same thing in firefox.  Weird.  Any
> idea what those are?
>
> One guess: https://en.wikipedia.org/wiki/Just-in-time_compilation

jitted code belongs to a process loaded from disk.  Enable CET in
an application which uses JIT engine means to also enable CET in
JIT engine.  Take git as an example, "git grep" crashed for me on Tiger
Lake.   It turned out that git itself was compiled with -fcf-protection and
git was linked against libpcre2-8.so.0 also compiled with -fcf-protection,
which has a JIT, sljit, which was not CET enabled.  git crashed in the
jitted codes due to missing ENDBR.  I had to enable CET in sljit to make
git working on CET enabled Tiger Lake.  So we need to enable CET in
JIT engine before enabling CET in applications which use JIT engine.
Andy Lutomirski March 9, 2020, 11:59 p.m. UTC | #20
On Mon, Mar 9, 2020 at 4:52 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Mon, Mar 9, 2020 at 4:21 PM Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 3/9/20 4:11 PM, H.J. Lu wrote:
> > > A threaded application is loaded from disk.  The object file on disk is
> > > either CET enabled or not CET enabled.
> >
> > Huh.  Are you saying that all instructions executed on userspace on
> > Linux come off of object files on the disk?  That's an interesting
> > assertion.  You might want to go take a look at the processes on your
> > systems.  Here's my browser for example:
> >
> > # for p in $(ps aux | grep chromium | awk '{print $2}' ); do cat
> > /proc/$p/maps; done | grep ' r-xp 00000000 00:00 0'
> > ...
> > 202f00082000-202f000bf000 r-xp 00000000 00:00 0
> > 202f000c2000-202f000c3000 r-xp 00000000 00:00 0
> > 202f00102000-202f00103000 r-xp 00000000 00:00 0
> > 202f00142000-202f00143000 r-xp 00000000 00:00 0
> > 202f00182000-202f001bf000 r-xp 00000000 00:00 0
> >
> > Lots of funny looking memory areas which are anonymous and executable!
> > Those didn't come off the disk.  Same thing in firefox.  Weird.  Any
> > idea what those are?
> >
> > One guess: https://en.wikipedia.org/wiki/Just-in-time_compilation
>
> jitted code belongs to a process loaded from disk.  Enable CET in
> an application which uses JIT engine means to also enable CET in
> JIT engine.  Take git as an example, "git grep" crashed for me on Tiger
> Lake.   It turned out that git itself was compiled with -fcf-protection and
> git was linked against libpcre2-8.so.0 also compiled with -fcf-protection,
> which has a JIT, sljit, which was not CET enabled.  git crashed in the
> jitted codes due to missing ENDBR.  I had to enable CET in sljit to make
> git working on CET enabled Tiger Lake.  So we need to enable CET in
> JIT engine before enabling CET in applications which use JIT engine.

This could presumably have been fixed by having libpcre or sljit
disable IBT before calling into JIT code or by running the JIT code in
another thread.  In the other direction, a non-CET libpcre build could
build IBT-capable JITted code and enable JIT (by syscall if we allow
that or by creating a thread?) when calling it.  And IBT has this
fancy legacy bitmap to allow non-instrumented code to run with IBT on,
although SHSTK doesn't have hardware support for a similar feature.

So, sure, the glibc-linked ELF ecosystem needs some degree of CET
coordination, but it is absolutely not the case that a process MUST
have all CET or no CET.  Let's please support the complicated cases in
the kernel and the ABI too.  If glibc wants to make it annoying to do
complicated things, so be it.  People work behind glibc's back all the
time.

--Andy
H.J. Lu March 10, 2020, 12:08 a.m. UTC | #21
On Mon, Mar 9, 2020 at 4:59 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
> On Mon, Mar 9, 2020 at 4:52 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Mon, Mar 9, 2020 at 4:21 PM Dave Hansen <dave.hansen@intel.com> wrote:
> > >
> > > On 3/9/20 4:11 PM, H.J. Lu wrote:
> > > > A threaded application is loaded from disk.  The object file on disk is
> > > > either CET enabled or not CET enabled.
> > >
> > > Huh.  Are you saying that all instructions executed on userspace on
> > > Linux come off of object files on the disk?  That's an interesting
> > > assertion.  You might want to go take a look at the processes on your
> > > systems.  Here's my browser for example:
> > >
> > > # for p in $(ps aux | grep chromium | awk '{print $2}' ); do cat
> > > /proc/$p/maps; done | grep ' r-xp 00000000 00:00 0'
> > > ...
> > > 202f00082000-202f000bf000 r-xp 00000000 00:00 0
> > > 202f000c2000-202f000c3000 r-xp 00000000 00:00 0
> > > 202f00102000-202f00103000 r-xp 00000000 00:00 0
> > > 202f00142000-202f00143000 r-xp 00000000 00:00 0
> > > 202f00182000-202f001bf000 r-xp 00000000 00:00 0
> > >
> > > Lots of funny looking memory areas which are anonymous and executable!
> > > Those didn't come off the disk.  Same thing in firefox.  Weird.  Any
> > > idea what those are?
> > >
> > > One guess: https://en.wikipedia.org/wiki/Just-in-time_compilation
> >
> > jitted code belongs to a process loaded from disk.  Enable CET in
> > an application which uses JIT engine means to also enable CET in
> > JIT engine.  Take git as an example, "git grep" crashed for me on Tiger
> > Lake.   It turned out that git itself was compiled with -fcf-protection and
> > git was linked against libpcre2-8.so.0 also compiled with -fcf-protection,
> > which has a JIT, sljit, which was not CET enabled.  git crashed in the
> > jitted codes due to missing ENDBR.  I had to enable CET in sljit to make
> > git working on CET enabled Tiger Lake.  So we need to enable CET in
> > JIT engine before enabling CET in applications which use JIT engine.
>
> This could presumably have been fixed by having libpcre or sljit
> disable IBT before calling into JIT code or by running the JIT code in
> another thread.  In the other direction, a non-CET libpcre build could
> build IBT-capable JITted code and enable JIT (by syscall if we allow
> that or by creating a thread?) when calling it.  And IBT has this

This is not how thread in user space works.

> fancy legacy bitmap to allow non-instrumented code to run with IBT on,
> although SHSTK doesn't have hardware support for a similar feature.

All these changes are called CET enabing.

> So, sure, the glibc-linked ELF ecosystem needs some degree of CET
> coordination, but it is absolutely not the case that a process MUST
> have all CET or no CET.  Let's please support the complicated cases in
> the kernel and the ABI too.  If glibc wants to make it annoying to do
> complicated things, so be it.  People work behind glibc's back all the
> time.

CET is no different from NX in this regard.
Andy Lutomirski March 10, 2020, 1:21 a.m. UTC | #22
I am baffled by this discussion.

>> On Mar 9, 2020, at 5:09 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> 
>> On Mon, Mar 9, 2020 at 4:59 PM Andy Lutomirski <luto@amacapital.net> wrote:
> 
>>>> .
>> This could presumably have been fixed by having libpcre or sljit
>> disable IBT before calling into JIT code or by running the JIT code in
>> another thread.  In the other direction, a non-CET libpcre build could
>> build IBT-capable JITted code and enable JIT (by syscall if we allow
>> that or by creating a thread?) when calling it.  And IBT has this
> 
> This is not how thread in user space works.

void create_cet_thread(void (*func)(), unsigned int cet_flags);

I could implement this using clone() if the kernel provides the requisite support. Sure, creating threads behind libc’s back like this is perilous, but it can be done.

> 
>> fancy legacy bitmap to allow non-instrumented code to run with IBT on,
>> although SHSTK doesn't have hardware support for a similar feature.
> 
> All these changes are called CET enabing.

What does that mean?  If program A loads library B, and library B very carefully loads CET-mismatched code, program A may be blissfully unaware.

> 
>> So, sure, the glibc-linked ELF ecosystem needs some degree of CET
>> coordination, but it is absolutely not the case that a process MUST
>> have all CET or no CET.  Let's please support the complicated cases in
>> the kernel and the ABI too.  If glibc wants to make it annoying to do
>> complicated things, so be it.  People work behind glibc's back all the
>> time.
> 
> CET is no different from NX in this regard.

NX is in the page tables, and CET, mostly, is not.  Also, we seriously flubbed READ_IMPLIES_EXEC and made it affect far more mappings than ever should have been affected.

If a legacy program (non-NX-aware) loads a newer library, and the library opens a device node and mmaps it PROT_READ, it gets RX.  This is not a good design. In fact, it’s actively problematic.

Let us please not take Linux’s NX legacy support as an example of good design.
H.J. Lu March 10, 2020, 2:13 a.m. UTC | #23
On Mon, Mar 9, 2020 at 6:21 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
> I am baffled by this discussion.
>
> >> On Mar 9, 2020, at 5:09 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> >>
> >> On Mon, Mar 9, 2020 at 4:59 PM Andy Lutomirski <luto@amacapital.net> wrote:
> >
> >>>> .
> >> This could presumably have been fixed by having libpcre or sljit
> >> disable IBT before calling into JIT code or by running the JIT code in
> >> another thread.  In the other direction, a non-CET libpcre build could
> >> build IBT-capable JITted code and enable JIT (by syscall if we allow
> >> that or by creating a thread?) when calling it.  And IBT has this
> >
> > This is not how thread in user space works.
>
> void create_cet_thread(void (*func)(), unsigned int cet_flags);
>
> I could implement this using clone() if the kernel provides the requisite support. Sure, creating threads behind libc’s back like this is perilous, but it can be done.

Sure, this can live outside of libc with kernel support.

> >
> >> fancy legacy bitmap to allow non-instrumented code to run with IBT on,
> >> although SHSTK doesn't have hardware support for a similar feature.
> >
> > All these changes are called CET enabing.
>
> What does that mean?  If program A loads library B, and library B very carefully loads CET-mismatched code, program A may be blissfully unaware.

Any source changes to make codes CET compatible is to enable CET.

Shadow stack can't be turned on or off arbitrarily.  ld.so checks it and
makes sure that everything is consistent.  But this is entirely done in
user space.  In the first phase, we want to make CET simple, not too
complicated.


H.J.
diff mbox series

Patch

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index ade4e6ec23e0..8b69ebf0baed 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3001,6 +3001,12 @@ 
 			noexec=on: enable non-executable mappings (default)
 			noexec=off: disable non-executable mappings
 
+	no_cet_shstk	[X86-64] Disable Shadow Stack for user-mode
+			applications
+
+	no_cet_ibt	[X86-64] Disable Indirect Branch Tracking for user-mode
+			applications
+
 	nosmap		[X86,PPC]
 			Disable SMAP (Supervisor Mode Access Prevention)
 			even if it is supported by processor.
diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst
index a8de2fbc1caa..81f919801765 100644
--- a/Documentation/x86/index.rst
+++ b/Documentation/x86/index.rst
@@ -19,6 +19,7 @@  x86-specific Documentation
    tlb
    mtrr
    pat
+   intel_cet
    intel_mpx
    intel-iommu
    intel_txt
diff --git a/Documentation/x86/intel_cet.rst b/Documentation/x86/intel_cet.rst
new file mode 100644
index 000000000000..71e2462fea5c
--- /dev/null
+++ b/Documentation/x86/intel_cet.rst
@@ -0,0 +1,294 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================================
+Control-flow Enforcement Technology (CET)
+=========================================
+
+[1] Overview
+============
+
+Control-flow Enforcement Technology (CET) provides protection against
+return/jump-oriented programming (ROP) attacks.  It can be setup to
+protect both applications and the kernel.  In the first phase, only
+user-mode protection is implemented in the 64-bit kernel; 32-bit
+applications are supported in compatibility mode.
+
+CET introduces Shadow Stack (SHSTK) and Indirect Branch Tracking
+(IBT).  SHSTK is a secondary stack allocated from memory and cannot
+be directly modified by applications.  When executing a CALL, the
+processor pushes a copy of the return address to SHSTK.  Upon
+function return, the processor pops the SHSTK copy and compares it
+to the one from the program stack.  If the two copies differ, the
+processor raises a control-protection fault.  IBT verifies indirect
+CALL/JMP targets are intended as marked by the compiler with 'ENDBR'
+opcodes (see CET instructions below).
+
+There are two kernel configuration options:
+
+    X86_INTEL_SHADOW_STACK_USER, and
+    X86_INTEL_BRANCH_TRACKING_USER.
+
+To build a CET-enabled kernel, Binutils v2.31 and GCC v8.1 or later
+are required.  To build a CET-enabled application, GLIBC v2.28 or
+later is also required.
+
+There are two command-line options for disabling CET features::
+
+    no_cet_shstk - disables SHSTK, and
+    no_cet_ibt   - disables IBT.
+
+At run time, /proc/cpuinfo shows the availability of SHSTK and IBT.
+
+[2] CET assembly instructions
+=============================
+
+RDSSP %r
+    Read the SHSTK pointer into %r.
+
+INCSSP %r
+    Unwind (increment) the SHSTK pointer (0 ~ 255) steps as indicated
+    in the operand register.  The GLIBC longjmp uses INCSSP to unwind
+    the SHSTK until that matches the program stack.  When it is
+    necessary to unwind beyond 255 steps, longjmp divides and repeats
+    the process.
+
+RSTORSSP (%r)
+    Switch to the SHSTK indicated in the 'restore token' pointed by
+    the operand register and replace the 'restore token' with a new
+    token to be saved (with SAVEPREVSSP) for the outgoing SHSTK.
+
+::
+
+                                Before RSTORSSP
+
+               Incoming SHSTK                 Current/Outgoing SHSTK
+
+          |----------------------|           |----------------------|
+   addr=x |                      |     ssp-> |                      |
+          |----------------------|           |----------------------|
+   (%r)-> | rstor_token=(x|Lg)   |  addr=y-8 |                      |
+          |----------------------|           |----------------------|
+
+                                After RSTORSSP
+
+          |----------------------|           |----------------------|
+   addr=x |                      |           |                      |
+          |----------------------|           |----------------------|
+    ssp-> | rstor_token=(y|Pv|Lg)|  addr=y-8 |                      |
+          |----------------------|           |----------------------|
+
+    note:
+        1. Only valid addresses and restore tokens can be on the
+           user-mode SHSTK.
+        2. A token is always of type u64 and must align to u64.
+        3. The incoming SHSTK pointer in a rstor_token must point to
+           immediately above the token.
+        4. 'Lg' is bit[0] of a rstor_token indicating a 64-bit SHSTK.
+        5. 'Pv' is bit[1] of a rstor_token indicating the token is to
+           be used only for the next SAVEPREVSSP and invalid for
+           RSTORSSP.
+
+SAVEPREVSSP
+    Pop the SHSTK 'restore token' pointed by current SHSTK pointer
+    and store it at (previous SHSTK pointer - 8).
+
+::
+
+                               After SAVEPREVSSP
+
+          |----------------------|           |----------------------|
+    ssp-> |                      |           |                      |
+          |----------------------|           |----------------------|
+ addr=x-8 | rstor_token=(y|Pv|Lg)|  addr=y-8 | rstor_token(y|Lg)    |
+          |----------------------|           |----------------------|
+
+WRUSS %r0, (%r1)
+    Write the value in %r0 to the SHSTK address pointed by (%r1).
+    This is a kernel-mode only instruction.
+
+ENDBR and NOTRACK prefix
+    When IBT is enabled, an indirect CALL/JMP must either::
+
+        have a NOTRACK prefix,
+        reach an ENDBR, or
+        reach an address within a legacy code page;
+
+    or it results in a control-protection fault.
+
+    When the target address is derived from information that cannot
+    be modified, the compiler uses the NOTRACK prefix.  In other
+    cases, the compiler inserts an ENDBR at the target address.
+
+    A legacy code page is designated in the legacy code bitmap, which
+    is explained below in section [8].
+
+[3] Application Enabling
+========================
+
+An application's CET capability is marked in its ELF header and can
+be verified from the following command output, in the
+NT_GNU_PROPERTY_TYPE_0 field:
+
+    readelf -n <application>
+
+If an application supports CET and is statically linked, it will run
+with CET protection.  If the application needs any shared libraries,
+the loader checks all dependencies and enables CET only when all
+requirements are met.
+
+[4] Legacy Libraries
+====================
+
+GLIBC provides a few tunables for backward compatibility.
+
+GLIBC_TUNABLES=glibc.tune.hwcaps=-SHSTK,-IBT
+    Turn off SHSTK/IBT for the current shell.
+
+GLIBC_TUNABLES=glibc.tune.x86_shstk=<on, permissive>
+    This controls how dlopen() handles SHSTK legacy libraries::
+
+        on         - continue with SHSTK enabled;
+        permissive - continue with SHSTK off.
+
+[5] CET system calls
+====================
+
+The following arch_prctl() system calls are added for CET:
+
+arch_prctl(ARCH_X86_CET_STATUS, unsigned long *addr)
+    Return CET feature status.
+
+    The parameter 'addr' is a pointer to a user buffer.
+    On returning to the caller, the kernel fills the following
+    information::
+
+        *addr       = SHSTK/IBT status
+        *(addr + 1) = SHSTK base address
+        *(addr + 2) = SHSTK size
+
+arch_prctl(ARCH_X86_CET_DISABLE, unsigned long features)
+    Disable SHSTK and/or IBT specified in 'features'.  Return -EPERM
+    if CET is locked.
+
+arch_prctl(ARCH_X86_CET_LOCK)
+    Lock in CET feature.
+
+arch_prctl(ARCH_X86_CET_ALLOC_SHSTK, unsigned long *addr)
+    Allocate a new SHSTK and put a restore token at top.
+
+    The parameter 'addr' is a pointer to a user buffer and indicates
+    the desired SHSTK size to allocate.  On returning to the caller,
+    the kernel fills '*addr' with the base address of the new SHSTK.
+
+arch_prctl(ARCH_X86_CET_MARK_LEGACY_CODE, unsigned long *addr)
+    Mark an address range as IBT legacy code.
+
+    The parameter 'addr' is a pointer to a user buffer that has the
+    following information::
+
+        *addr       = starting linear address of the legacy code
+        *(addr + 1) = size of the legacy code
+        *(addr + 2) = set (1); clear (0)
+
+Note:
+  There is no CET-enabling arch_prctl function.  By design, CET is
+  enabled automatically if the binary and the system can support it.
+
+  The parameters passed are always unsigned 64-bit.  When an IA32
+  application passing pointers, it should only use the lower 32 bits.
+
+[6] The implementation of the SHSTK
+===================================
+
+SHSTK size
+----------
+
+A task's SHSTK is allocated from memory to a fixed size of
+RLIMIT_STACK.  A compat-mode thread's SHSTK size is 1/4 of
+RLIMIT_STACK.  The smaller 32-bit thread SHSTK allows more threads to
+share a 32-bit address space.
+
+Signal
+------
+
+The main program and its signal handlers use the same SHSTK.  Because
+the SHSTK stores only return addresses, a large SHSTK will cover the
+condition that both the program stack and the sigaltstack run out.
+
+The kernel creates a restore token at the SHSTK restoring address and
+verifies that token when restoring from the signal handler.
+
+IBT for signal delivering and sigreturn is the same as the main
+program's setup; except for WAIT_ENDBR status, which can be read from
+MSR_IA32_U_CET.  In general, a task is in WAIT_ENDBR after an
+indirect CALL/JMP and before the next instruction starts.
+
+A task's WAIT_ENDBR is reset for its signal handler, but preserved on
+the task's stack; and then restored from sigreturn.
+
+Fork
+----
+
+The SHSTK's vma has VM_SHSTK flag set; its PTEs are required to be
+read-only and dirty.  When a SHSTK PTE is not present, RO, and dirty,
+a SHSTK access triggers a page fault with an additional SHSTK bit set
+in the page fault error code.
+
+When a task forks a child, its SHSTK PTEs are copied and both the
+parent's and the child's SHSTK PTEs are cleared of the dirty bit.
+Upon the next SHSTK access, the resulting SHSTK page fault is handled
+by page copy/re-use.
+
+When a pthread child is created, the kernel allocates a new SHSTK for
+the new thread.
+
+Setjmp/Longjmp
+--------------
+
+Longjmp unwinds SHSTK until it matches the program stack.
+
+Ucontext
+--------
+
+In GLIBC, getcontext/setcontext is implemented in similar way as
+setjmp/longjmp.
+
+When makecontext creates a new ucontext, a new SHSTK is allocated for
+that context with ARCH_X86_CET_ALLOC_SHSTK syscall.  The kernel
+creates a restore token at the top of the new SHSTK and the user-mode
+code switches to the new SHSTK with the RSTORSSP instruction.
+
+[7] The management of read-only & dirty PTEs for SHSTK
+======================================================
+
+A RO and dirty PTE exists in the following cases:
+
+(a) A page is modified and then shared with a fork()'ed child;
+(b) A R/O page that has been COW'ed;
+(c) A SHSTK page.
+
+The processor only checks the dirty bit for (c).  To prevent the use
+of non-SHSTK memory as SHSTK, we use a spare bit of the 64-bit PTE as
+DIRTY_SW for (a) and (b) above.  This results to the following PTE
+settings::
+
+    Modified PTE:             (R/W + DIRTY_HW)
+    Modified and shared PTE:  (R/O + DIRTY_SW)
+    R/O PTE, COW'ed:          (R/O + DIRTY_SW)
+    SHSTK PTE:                (R/O + DIRTY_HW)
+    SHSTK PTE, COW'ed:        (R/O + DIRTY_HW)
+    SHSTK PTE, shared:        (R/O + DIRTY_SW)
+
+Note that DIRTY_SW is only used in R/O PTEs but not R/W PTEs.
+
+[8] The implementation of IBT legacy bitmap
+===========================================
+
+When IBT is active, a non-IBT-capable legacy library can be executed
+if its address ranges are specified in the legacy code bitmap.  The
+bitmap covers the whole user-space address, which is TASK_SIZE_MAX
+for 64-bit and TASK_SIZE for IA32, and its each bit indicates a 4-KB
+legacy code page.  It is read-only from an application, and setup by
+the kernel as a special mapping when the first time the application
+calls arch_prctl(ARCH_X86_CET_MARK_LEGACY_CODE).  The application
+manages the bitmap through the arch_prctl.