[00/35] Shadow stacks for userspace

Message ID	20220130211838.8382-1-rick.p.edgecombe@intel.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Rick Edgecombe <rick.p.edgecombe@intel.com> To: x86@kernel.org, "H . Peter Anvin" <hpa@zytor.com>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann <arnd@arndb.de>, Andy Lutomirski <luto@kernel.org>, Balbir Singh <bsingharora@gmail.com>, Borislav Petkov <bp@alien8.de>, Cyrill Gorcunov <gorcunov@gmail.com>, Dave Hansen <dave.hansen@linux.intel.com>, Eugene Syromiatnikov <esyr@redhat.com>, Florian Weimer <fweimer@redhat.com>, "H . J . Lu" <hjl.tools@gmail.com>, Jann Horn <jannh@google.com>, Jonathan Corbet <corbet@lwn.net>, Kees Cook <keescook@chromium.org>, Mike Kravetz <mike.kravetz@oracle.com>, Nadav Amit <nadav.amit@gmail.com>, Oleg Nesterov <oleg@redhat.com>, Pavel Machek <pavel@ucw.cz>, Peter Zijlstra <peterz@infradead.org>, Randy Dunlap <rdunlap@infradead.org>, "Ravi V . Shankar" <ravi.v.shankar@intel.com>, Dave Martin <Dave.Martin@arm.com>, Weijiang Yang <weijiang.yang@intel.com>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>, joao.moreira@intel.com, John Allen <john.allen@amd.com>, kcc@google.com, eranian@google.com Cc: rick.p.edgecombe@intel.com Subject: [PATCH 00/35] Shadow stacks for userspace Date: Sun, 30 Jan 2022 13:18:03 -0800 Message-Id: <20220130211838.8382-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Shadow stacks for userspace \| expand [00/35] Shadow stacks for userspace [01/35] Documentation/x86: Add CET description [02/35] x86/cet/shstk: Add Kconfig option for Shadow Stack [03/35] x86/cpufeatures: Add CET CPU feature flags for Control-flow Enforcement Technology (CET) [04/35] x86/cpufeatures: Introduce CPU setup and option parsing for CET [05/35] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states [06/35] x86/cet: Add control-protection fault handler [07/35] x86/mm: Remove _PAGE_DIRTY from kernel RO pages [08/35] x86/mm: Move pmd_write(), pud_write() up in the file [09/35] x86/mm: Introduce _PAGE_COW [10/35] drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS [11/35] x86/mm: Update pte_modify for _PAGE_COW [12/35] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRT… [13/35] mm: Move VM_UFFD_MINOR_BIT from 37 to 38 [14/35] mm: Introduce VM_SHADOW_STACK for shadow stack memory [15/35] x86/mm: Check Shadow Stack page fault errors [16/35] x86/mm: Update maybe_mkwrite() for shadow stack [17/35] mm: Fixup places that call pte_mkwrite() directly [18/35] mm: Add guard pages around a shadow stack. [19/35] mm/mmap: Add shadow stack pages to memory accounting [20/35] mm: Update can_follow_write_pte() for shadow stack [21/35] mm/mprotect: Exclude shadow stack from preserve_write [22/35] x86/mm: Prevent VM_WRITE shadow stacks [23/35] x86/fpu: Add helpers for modifying supervisor xstate [24/35] mm: Re-introduce vm_flags to do_mmap() [25/35] x86/cet/shstk: Add user-mode shadow stack support [26/35] x86/process: Change copy_thread() argument 'arg' to 'stack_size' [27/35] x86/fpu: Add unsafe xsave buffer helpers [28/35] x86/cet/shstk: Handle thread shadow stack [29/35] x86/cet/shstk: Introduce shadow stack token setup/verify routines [30/35] x86/cet/shstk: Handle signals for shadow stack [31/35] x86/cet/shstk: Add arch_prctl elf feature functions [32/35] x86/cet/shstk: Introduce map_shadow_stack syscall [33/35] selftests/x86: Add map_shadow_stack syscall test [34/35] x86/cet/shstk: Support wrss for userspace [35/35] x86/cpufeatures: Limit shadow stack to Intel CPUs

Edgecombe, Rick P Jan. 30, 2022, 9:18 p.m. UTC

Hi,

This is a slight reboot of the userspace CET series. I will be taking over the 
series from Yu-cheng. Per some internal recommendations, I’ve reset the version
number and am calling it a new series. Hopefully, it doesn’t cause confusion.

The new plan is to upstream only userspace Shadow Stack support at this point. 
IBT can follow later, but for now I’ll focus solely on the most in-demand and
widely available (with the feature on AMD CPUs now) part of CET.

I thought as part of this reset, it might be useful to more fully write-up the 
design and summarize the history of the previous CET series. So this slightly
long cover letter does that. The "Updates" section has the changes, if anyone
doesn't want the history.


Why is Shadow Stack Wanted
==========================
The main use case for userspace shadow stack is providing protection against 
return oriented programming attacks. Fedora and Ubuntu already have many/most 
packages enabled for shadow stack. The main missing piece is Linux kernel 
support and there seems to be a high amount of interest in the ecosystem for
getting this feature supported. Besides security, Google has also done some
work on using shadow stack to improve performance and reliability of tracing.


Userspace Shadow Stack Implementation
=====================================
Shadow stack works by maintaining a secondary (shadow) stack that cannot be 
directly modified by applications. When executing a CALL instruction, the 
processor pushes the return address to both the normal stack and to the special 
permissioned shadow stack. Upon ret, the processor pops the shadow stack copy 
and compares it to the normal stack copy. If the two differ, the processor 
raises a control protection fault. This implementation supports shadow stack on 
64 bit kernels only, with support for 32 bit only via IA32 emulation.

	Shadow Stack Memory
	-------------------
	The majority of this series deals with changes for handling the special 
	shadow stack memory permissions. This memory is specified by the 
	Dirty+RO PTE bits. A tricky aspect of this is that this combination was 
	previously used to specify COW memory. So Linux needs to handle COW 
	differently when shadow stack is in use. The solution is to use a 
	software PTE bit to denote COW memory, and take care to clear the dirty
	bit when setting the memory RO.

	Setup and Upkeep of HW Registers
	--------------------------------
	Using userspace CET requires a CR4 bit set, and also the manipulation 
	of two xsave managed MSRs. The kernel needs to modify these registers 
	during various operations like clone and signal handling. These 
	operations may happen when the registers are restored to the CPU, or 
	saved in an xsave buffer. Since the recent AMX triggered FPU overhaul 
	removed direct access to the xsave buffer, this series adds an 
	interface to operate on the supervisor xstate.

	New ABIs
	--------
	This series introduces some new ABIs. The primary one is the shadow 
	stack itself. Since it is readable and the shadow stack pointer is 
	exposed to user space, applications can easily read and process the 
	shadow stack. And in fact the tracing usages plan to do exactly that.

	Most of the shadow stack contents are written by HW, but some of the 
	entries are added by the kernel. The main place for this is signals. As 
	part of handling the signal the kernel does some manual adjustment of 
	the shadow stack that userspace depends on.

	In addition to the contents of the shadow stack there is also user 
	visible behavior around when new shadow stacks are created and set in 
	the shadow stack pointer (SSP) register. This is relatively 
	straightforward – shadow stacks are created when new stacks are created 
	(thread creation, fork, etc). It is more or less what is required to 
	keep apps working.

	For situations when userspace creates a new stack (i.e. makecontext(), 
	fibers, etc), a new syscall is provided for creating shadow stack 
	memory. To make the shadow stack usable, it needs to have a restore 
	token written to the protected memory. So the syscall provides a way to 
	specificity this should be done by the kernel.

	When a shadow stack violation happens (when the return address of stack 
	not matching return address in shadow stack), a segfault is generated 
	with a new si_code specific to CET violations.

	Lastly, a new arch_prctl interface is created for controlling the 
	enablement of CET-like features. It is intended to also be used for 
	LAM. It operates on the feature status per-thread, so for process wide 
	enabling it is intended to be used early in things like dynamic 
	linker/loaders. However, it can be used later for per-thread enablement 
	of features like WRSS.

	WRSS
	----
	WRSS is an instruction that can write to shadow stacks. The HW provides 
	a way to enable this instruction for userspace use. Since shadow 
	stack’s are created initially protected, enabling WRSS allows any apps 
	that want to do unusual things with their stacks to have a way to 
	weaken protection and make things more flexible. A new feature bit is 
	defined to control enabling/disabling of WRSS.


History
=======
The branding “CET” really consists of two features: “Shadow Stack” and 
“Indirect Branch Tracking”. They both restrict previously allowed, but rarely 
valid behaviors and require userspace to change to avoid these behaviors before 
enabling the protection. These raw HW features need to be assembled into a 
software solution across userspace and kernel in order to add security value.
The kernel part of this solution has evolved iteratively starting with a lengthy
RFC period. 

Until now, the enabling effort was trying to support both Shadow Stack and IBT. 
This history will focus on a few areas of the shadow stack development history 
that I thought stood out.

	Signals
	-------
	Originally signals placed the location of the shadow stack restore 
	token inside the saved state on the stack. This was problematic from a 
	past ABI promises perspective. So the restore location was instead just 
	assumed from the shadow stack pointer. This works because in normal 
	allowed cases of calling sigreturn, the shadow stack pointer should be 
	right at the restore token at that time. There is no alternate shadow 
	stack support. If an alt shadow stack is added later we would need to 
	find a place to store the regular shadow stack token location. Options 
	could be to push something on the alt shadow stack, or to keep 
	something on the kernel side. So the current design keeps things simple 
	while slightly kicking the can down the road if alt shadow stacks 
	become a thing later. Siglongjmp is handled in glibc, using the incssp 
	instruction to unwind the shadow stack over the token.

	Shadow Stack Allocation
	-----------------------
	makecontext() implementations need a way to create new shadow stacks 
	with restore token’s such that they can be pivoted to from userspace. 
	The first interface to do this was an arch_prctl(). It created a shadow 
	stack with a restore token pre-setup, since the kernel has an 
	instruction that can write to user shadow stacks. However, this 
	interface was abandoned for being strange.

	The next version created PROT_SHADOW_STACK. This interface had two 
	problems. One, it left no options but for userspace to create writable 
	memory, write a restore token, then mproctect() it PROT_SHADOW_STACK. 
	The writable window left the shadow stack exposed, weakening the 
	security. Second, it caused problems with the guard pages. Since the 
	memory was initially created writable it did not have a guard page, but 
	then was mprotected later to a type of memory that should have one. 
	This resulted in missing guard pages and confused rb_subtree_gap’s.

	This version introduces a new syscall that behaves similarly to the 
	initial arch_prctl() interface in that it has the kernel write the 
	restore token.

	Enabling Interface
	------------------
	For the entire history of the original CET series, the design was to 
	enable shadow stack automatically if the feature bit was detected in 
	the elf header. Then it was userspace’s responsibility to turn it off 
	via an arch_prctl() if it was not desired, and this was handled by the 
	glibc dynamic loader. Glibc’s standard behavior (when CET if configured 
	is to leave shadow stack enabled if the executable and all linked 
	libraries are marked with shadow stacks.

	Many distros (Fedora and others) have binaries already marked with 
	shadow stack, waiting for kernel support. Unfortunately their glibc 
	binaries expect the original arch_prctl() interface for allocating 
	shadow stacks, as those changes were pushed ahead of kernel support. 
	The net result of it all is, when updating to a kernel with shadow 
	stack these binaries would suddenly get shadow stack enabled and expect 
	the arch_prctl() interface to be there. And so calls to makecontext() 
	will fail, resulting in visible breakages. This series deals with this 
	problem as described below in "Updates".


Updates
=======
These updates were mostly driven by public comments, but a lot of the design 
elements are new. I would like some extra scrutiny on the updates.

	New syscall for Shadow Stack Allocation
	---------------------------------------
	A new syscall is added for allocating shadow stacks to replace 
	PROT_SHADOW_STACK. Several options were considered, as described in the 
	“x86/cet/shstk: Introduce map_shadow_stack syscall”.

	Xsave Managed Supervisor State Modifications
	--------------------------------------------
	The shadow stack feature requires the kernel to modify xsaves managed 
	state. On one of the last versions of Yu-cheng’s series Boris had 
	commented on the pattern it was using to do this not necessarily being 
	ideal. The pattern was to force a restore to the registers and always 
	do the modification there. Then Thomas did an overhaul of the fpu code, 
	part of which consisted of making raw access to the xsave buffer 
	private to the fpu code. So this series tries to expose access again, 
	and in a way that addresses Boris’ comments.

	The method is to provide functions like wmsrl/rdmsrl, but that can 
	direct the operation to the correct location (registers or buffer), 
	while giving the proper notice to the fpu subsystem so things don’t get 
	clobbered or corrupted.

	In the past a solution like this was discussed as part of the PASID 
	series, and Thomas was not in favor. In CET’s case there is a more 
	logic around the CET MSR’s than in PASID's, and wrapping this logic 
	minimizes near identical open coded logic needed to do this more 
	efficiently. In addition it resolves the above described problem of 
	having no access to the xsave buffer. So it is being put forward here 
	under the supposition that CET’s usage may lead to a different 
	conclusion, not to try to ignore past direction.

	The user interrupt series has similar needs as CET, and will also use
	this internal interface if it’s found acceptable.

	Support for WRSS
	----------------
	Andy Lutomirski had asked if we change the shadow stack allocation API 
	such that userspace cannot create arbitrary shadow stacks, then we look 
	at exposing an interface to enable the WRSS instruction for userspace. 
	This way app’s that want to do unexpected things with shadow stacks 
	would still have the option to create shadow stacks with arbitrary 
	data.

	Switch Enabling Interface
	-------------------------
	As described above there is a problem with userspace binaries waiting 
	to break as soon as the kernel supports CET. This needs to be prevented 
	by changing the interface such that the old binaries will not enable 
	shadow stack AND behave as if shadow stack is not enabled. They should 
	run normally without shadow stack protection. Creating a new feature 
	(SHSTK2) for shadow stack was explored. SHSTK would never be supported 
	by the kernel, and all the userspace build tools would be updated to 
	target SHSTK2 instead of SHSTK. So old SHSTK binaries would be cleanly
	disabled.

	But there are existing downsides to automatic elf header processing 
	based enabling. The elf header feature spec is not defined by the 
	kernel and there are proposals to expand it to describe additional 
	logic. A simpler interface where the kernel is simply told what to 
	enable, and leaves all the decision making to userspace, is more 
	flexible for userspace and simpler for the kernel. There also already 
	needs to be an ARCH_X86_FEATURE_ENABLE arch_prctl() for WRSS (and 
	likely LAM will use it too), so it avoids there being two ways to turn 
	on these types of features. The only tricky part for shadow stack, is 
	that it has to be enabled very early. Wherever the shadow stack is 
	enabled, the app cannot return from that point, otherwise there will be 
	a shadow stack violation. It turns out glibc can enable shadow stack 
	this early, so it works nicely. So not automatically enabling any 
	features in the elf header will cleanly disable all old binaries, which 
	expect the kernel to enable CET features automatically. Then after the 
	kernel changes are upstream, glibc can be updated to use the new
	interface. This is the solution implemented in this series.

	Expand Commit Logs
	------------------
	As part of spinning up on this series, I found some of the commit logs 
	did not describe the changes in enough detail for me understand their 
	purpose. I tried to expand the logs and comments, where I had to go 
	digging. Hopefully it’s useful.
	
	Limit to only Intel Processors
	------------------------------
	Shadow stack is supported on some AMD processors, but this revision 
	(with expanded HW usage and xsaves changes) has only has been tested on 
	Intel ones. So this series has a patch to limit shadow stack support to 
	Intel processors. Ideally the patch would not even make it to mainline, 
	and should be dropped as soon as this testing is done. It's included 
	just in case.


Future Work
===========
Even though this is now exclusively a shadow stack series, there is still some 
remaining shadow stack work to be done.

	Ptrace
	------
	Early in the series, there was a patch to allow IA32_U_CET and
	IA32_PL3_SSP to be set. This patch was dropped and planned as a follow
	up to basic support, and it remains the plan. It will be needed for
	in-progress gdb support.

	CRIU Support
	------------
	In the past there was some speculation on the mailing list about 
	whether CRIU would need to be taught about CET. It turns out, it does. 
	The first issue hit is that CRIU calls sigreturn directly from its 
	“parasite code” that it injects into the dumper process. This violates
	this shadow stack implementation’s protection that intends to prevent
	attackers from doing this.

	With so many packages already enabled with shadow stack, there is 
	probably desire to make it work seamlessly. But in the meantime if 
	distros want to support shadow stack and CRIU, users could manually 
	disabled shadow stack via “GLIBC_TUNABLES=glibc.cpu.x86_shstk=off” for 
	a process they will wants to dump. It’s not ideal.

	I’d like to hear what people think about having shadow stack in the 
	kernel without this resolved. Nothing would change for any users until 
	they enable shadow stack in the kernel and update to a glibc configured
	with CET. Should CRIU userspace be solved before kernel support?

	Selftests
	---------
	There are some CET selftests being worked on and they are not included
	here.

Thanks,

Rick

Rick Edgecombe (7):
  x86/mm: Prevent VM_WRITE shadow stacks
  x86/fpu: Add helpers for modifying supervisor xstate
  x86/fpu: Add unsafe xsave buffer helpers
  x86/cet/shstk: Introduce map_shadow_stack syscall
  selftests/x86: Add map_shadow_stack syscall test
  x86/cet/shstk: Support wrss for userspace
  x86/cpufeatures: Limit shadow stack to Intel CPUs

Yu-cheng Yu (28):
  Documentation/x86: Add CET description
  x86/cet/shstk: Add Kconfig option for Shadow Stack
  x86/cpufeatures: Add CET CPU feature flags for Control-flow
    Enforcement Technology (CET)
  x86/cpufeatures: Introduce CPU setup and option parsing for CET
  x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states
  x86/cet: Add control-protection fault handler
  x86/mm: Remove _PAGE_DIRTY from kernel RO pages
  x86/mm: Move pmd_write(), pud_write() up in the file
  x86/mm: Introduce _PAGE_COW
  drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS
  x86/mm: Update pte_modify for _PAGE_COW
  x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for
    transition from _PAGE_DIRTY to _PAGE_COW
  mm: Move VM_UFFD_MINOR_BIT from 37 to 38
  mm: Introduce VM_SHADOW_STACK for shadow stack memory
  x86/mm: Check Shadow Stack page fault errors
  x86/mm: Update maybe_mkwrite() for shadow stack
  mm: Fixup places that call pte_mkwrite() directly
  mm: Add guard pages around a shadow stack.
  mm/mmap: Add shadow stack pages to memory accounting
  mm: Update can_follow_write_pte() for shadow stack
  mm/mprotect: Exclude shadow stack from preserve_write
  mm: Re-introduce vm_flags to do_mmap()
  x86/cet/shstk: Add user-mode shadow stack support
  x86/process: Change copy_thread() argument 'arg' to 'stack_size'
  x86/cet/shstk: Handle thread shadow stack
  x86/cet/shstk: Introduce shadow stack token setup/verify routines
  x86/cet/shstk: Handle signals for shadow stack
  x86/cet/shstk: Add arch_prctl elf feature functions

 .../admin-guide/kernel-parameters.txt         |   4 +
 Documentation/filesystems/proc.rst            |   1 +
 Documentation/x86/cet.rst                     | 145 ++++++
 Documentation/x86/index.rst                   |   1 +
 arch/arm/kernel/signal.c                      |   2 +-
 arch/arm64/kernel/signal.c                    |   2 +-
 arch/arm64/kernel/signal32.c                  |   2 +-
 arch/sparc/kernel/signal32.c                  |   2 +-
 arch/sparc/kernel/signal_64.c                 |   2 +-
 arch/x86/Kconfig                              |  22 +
 arch/x86/Kconfig.assembler                    |   5 +
 arch/x86/entry/syscalls/syscall_32.tbl        |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
 arch/x86/ia32/ia32_signal.c                   |  25 +-
 arch/x86/include/asm/cet.h                    |  54 +++
 arch/x86/include/asm/cpufeatures.h            |   1 +
 arch/x86/include/asm/disabled-features.h      |   8 +-
 arch/x86/include/asm/fpu/api.h                |   8 +
 arch/x86/include/asm/fpu/types.h              |  23 +-
 arch/x86/include/asm/fpu/xstate.h             |   6 +-
 arch/x86/include/asm/idtentry.h               |   4 +
 arch/x86/include/asm/mman.h                   |  24 +
 arch/x86/include/asm/mmu_context.h            |   2 +
 arch/x86/include/asm/msr-index.h              |  20 +
 arch/x86/include/asm/page_types.h             |   7 +
 arch/x86/include/asm/pgtable.h                | 302 ++++++++++--
 arch/x86/include/asm/pgtable_types.h          |  48 +-
 arch/x86/include/asm/processor.h              |   6 +
 arch/x86/include/asm/special_insns.h          |  30 ++
 arch/x86/include/asm/trap_pf.h                |   2 +
 arch/x86/include/uapi/asm/mman.h              |   8 +-
 arch/x86/include/uapi/asm/prctl.h             |  10 +
 arch/x86/include/uapi/asm/processor-flags.h   |   2 +
 arch/x86/kernel/Makefile                      |   1 +
 arch/x86/kernel/cpu/common.c                  |  20 +
 arch/x86/kernel/cpu/cpuid-deps.c              |   1 +
 arch/x86/kernel/elf_feature_prctl.c           |  72 +++
 arch/x86/kernel/fpu/xstate.c                  | 167 ++++++-
 arch/x86/kernel/idt.c                         |   4 +
 arch/x86/kernel/process.c                     |  17 +-
 arch/x86/kernel/process_64.c                  |   2 +
 arch/x86/kernel/shstk.c                       | 446 ++++++++++++++++++
 arch/x86/kernel/signal.c                      |  13 +
 arch/x86/kernel/signal_compat.c               |   2 +-
 arch/x86/kernel/traps.c                       |  62 +++
 arch/x86/mm/fault.c                           |  19 +
 arch/x86/mm/mmap.c                            |  48 ++
 arch/x86/mm/pat/set_memory.c                  |   2 +-
 arch/x86/mm/pgtable.c                         |  25 +
 drivers/gpu/drm/i915/gvt/gtt.c                |   2 +-
 fs/aio.c                                      |   2 +-
 fs/proc/task_mmu.c                            |   3 +
 include/linux/mm.h                            |  19 +-
 include/linux/pgtable.h                       |   8 +
 include/linux/syscalls.h                      |   1 +
 include/uapi/asm-generic/siginfo.h            |   3 +-
 include/uapi/asm-generic/unistd.h             |   2 +-
 ipc/shm.c                                     |   2 +-
 kernel/sys_ni.c                               |   1 +
 mm/gup.c                                      |  16 +-
 mm/huge_memory.c                              |  27 +-
 mm/memory.c                                   |   5 +-
 mm/migrate.c                                  |   3 +-
 mm/mmap.c                                     |  15 +-
 mm/mprotect.c                                 |   9 +-
 mm/nommu.c                                    |   4 +-
 mm/util.c                                     |   2 +-
 tools/testing/selftests/x86/Makefile          |   9 +-
 .../selftests/x86/test_map_shadow_stack.c     |  75 +++
 69 files changed, 1797 insertions(+), 92 deletions(-)
 create mode 100644 Documentation/x86/cet.rst
 create mode 100644 arch/x86/include/asm/cet.h
 create mode 100644 arch/x86/include/asm/mman.h
 create mode 100644 arch/x86/kernel/elf_feature_prctl.c
 create mode 100644 arch/x86/kernel/shstk.c
 create mode 100644 tools/testing/selftests/x86/test_map_shadow_stack.c


base-commit: e783362eb54cd99b2cac8b3a9aeac942e6f6ac07

Thomas Gleixner Feb. 3, 2022, 9:07 p.m. UTC | #1

Rick,

On Sun, Jan 30 2022 at 13:18, Rick Edgecombe wrote:
> This is a slight reboot of the userspace CET series. I will be taking over the 
> series from Yu-cheng. Per some internal recommendations, I’ve reset the version
> number and am calling it a new series. Hopefully, it doesn’t cause
> confusion.

That's fine as it seems to be a major change in course, so a reset to V1
is justified. Don't worry about confusion, we can easily confuse ourself
with minor things than that version reset :)

> The new plan is to upstream only userspace Shadow Stack support at this point. 
> IBT can follow later, but for now I’ll focus solely on the most in-demand and
> widely available (with the feature on AMD CPUs now) part of CET.

We just have to keep in IBT mind so that we don't add roadblocks which
we regret some time later.

> I thought as part of this reset, it might be useful to more fully write-up the 
> design and summarize the history of the previous CET series. So this slightly
> long cover letter does that. The "Updates" section has the changes, if anyone
> doesn't want the history.

Thanks for that lengthy writeup. It's appreciated. There is too much
confusion already so a coherent summary is helpful.

> Why is Shadow Stack Wanted
> ==========================
> The main use case for userspace shadow stack is providing protection against 
> return oriented programming attacks. Fedora and Ubuntu already have many/most 
> packages enabled for shadow stack.

Which is unfortunately part of the overall problem ...

> History
> =======
> The branding “CET” really consists of two features: “Shadow Stack” and 
> “Indirect Branch Tracking”. They both restrict previously allowed, but rarely 
> valid behaviors and require userspace to change to avoid these behaviors before 
> enabling the protection. These raw HW features need to be assembled into a 
> software solution across userspace and kernel in order to add security value.
> The kernel part of this solution has evolved iteratively starting with a lengthy
> RFC period. 
>
> Until now, the enabling effort was trying to support both Shadow Stack and IBT. 
> This history will focus on a few areas of the shadow stack development history 
> that I thought stood out.
>
> 	Signals
> 	-------
> 	Originally signals placed the location of the shadow stack restore 
> 	token inside the saved state on the stack. This was problematic from a 
> 	past ABI promises perspective. So the restore location was instead just 
> 	assumed from the shadow stack pointer. This works because in normal 
> 	allowed cases of calling sigreturn, the shadow stack pointer should be 
> 	right at the restore token at that time. There is no alternate shadow 
> 	stack support. If an alt shadow stack is added later we would
> 	need to

So how is that going to work? altstack is not an esoteric corner case.

> 	Enabling Interface
> 	------------------
> 	For the entire history of the original CET series, the design was to 
> 	enable shadow stack automatically if the feature bit was detected in 
> 	the elf header. Then it was userspace’s responsibility to turn it off 
> 	via an arch_prctl() if it was not desired, and this was handled by the 
> 	glibc dynamic loader. Glibc’s standard behavior (when CET if configured 
> 	is to leave shadow stack enabled if the executable and all linked 
> 	libraries are marked with shadow stacks.
>
> 	Many distros (Fedora and others) have binaries already marked with 
> 	shadow stack, waiting for kernel support. Unfortunately their glibc 
> 	binaries expect the original arch_prctl() interface for allocating 
> 	shadow stacks, as those changes were pushed ahead of kernel support. 
> 	The net result of it all is, when updating to a kernel with shadow 
> 	stack these binaries would suddenly get shadow stack enabled and expect 
> 	the arch_prctl() interface to be there. And so calls to makecontext() 
> 	will fail, resulting in visible breakages. This series deals with this 
> 	problem as described below in "Updates".

I'm really impressed by the well thought out coordination on the glibc and
distro side. Designed by committee never worked ...

> Updates
> =======
> These updates were mostly driven by public comments, but a lot of the design 
> elements are new. I would like some extra scrutiny on the updates.
>
> 	New syscall for Shadow Stack Allocation
> 	---------------------------------------
> 	A new syscall is added for allocating shadow stacks to replace 
> 	PROT_SHADOW_STACK. Several options were considered, as described in the 
> 	“x86/cet/shstk: Introduce map_shadow_stack syscall”.
>
> 	Xsave Managed Supervisor State Modifications
> 	--------------------------------------------
> 	The shadow stack feature requires the kernel to modify xsaves managed 
> 	state. On one of the last versions of Yu-cheng’s series Boris had 
> 	commented on the pattern it was using to do this not necessarily being 
> 	ideal. The pattern was to force a restore to the registers and always 
> 	do the modification there. Then Thomas did an overhaul of the fpu code, 
> 	part of which consisted of making raw access to the xsave buffer 
> 	private to the fpu code. So this series tries to expose access again, 
> 	and in a way that addresses Boris’ comments.
>
> 	The method is to provide functions like wmsrl/rdmsrl, but that can 
> 	direct the operation to the correct location (registers or buffer), 
> 	while giving the proper notice to the fpu subsystem so things don’t get 
> 	clobbered or corrupted.
>
> 	In the past a solution like this was discussed as part of the PASID 
> 	series, and Thomas was not in favor. In CET’s case there is a more 
> 	logic around the CET MSR’s than in PASID's, and wrapping this logic 
> 	minimizes near identical open coded logic needed to do this more 
> 	efficiently. In addition it resolves the above described problem of 
> 	having no access to the xsave buffer. So it is being put forward here 
> 	under the supposition that CET’s usage may lead to a different 
> 	conclusion, not to try to ignore past direction.
>
> 	The user interrupt series has similar needs as CET, and will also use
> 	this internal interface if it’s found acceptable.

I'll have a look.

> 	Switch Enabling Interface
> 	-------------------------
> 	But there are existing downsides to automatic elf header processing 
> 	based enabling. The elf header feature spec is not defined by the 
> 	kernel and there are proposals to expand it to describe additional 
> 	logic. A simpler interface where the kernel is simply told what to 
> 	enable, and leaves all the decision making to userspace, is more 
> 	flexible for userspace and simpler for the kernel. There also already 
> 	needs to be an ARCH_X86_FEATURE_ENABLE arch_prctl() for WRSS (and 
> 	likely LAM will use it too), so it avoids there being two ways to turn 
> 	on these types of features. The only tricky part for shadow stack, is 
> 	that it has to be enabled very early. Wherever the shadow stack is 
> 	enabled, the app cannot return from that point, otherwise there will be 
> 	a shadow stack violation. It turns out glibc can enable shadow stack 
> 	this early, so it works nicely. So not automatically enabling any 
> 	features in the elf header will cleanly disable all old binaries, which 
> 	expect the kernel to enable CET features automatically. Then after the 
> 	kernel changes are upstream, glibc can be updated to use the new
> 	interface. This is the solution implemented in this series.

Makes sense.

> 	Expand Commit Logs
> 	------------------
> 	As part of spinning up on this series, I found some of the commit logs 
> 	did not describe the changes in enough detail for me understand their 
> 	purpose. I tried to expand the logs and comments, where I had to go 
> 	digging. Hopefully it’s useful.

Proper changelogs are always appreciated.
	
> 	Limit to only Intel Processors
> 	------------------------------
> 	Shadow stack is supported on some AMD processors, but this revision 
> 	(with expanded HW usage and xsaves changes) has only has been tested on 
> 	Intel ones. So this series has a patch to limit shadow stack support to 
> 	Intel processors. Ideally the patch would not even make it to mainline, 
> 	and should be dropped as soon as this testing is done. It's included 
> 	just in case.

Ha. I can give you access to an AMD machine with CET SS supported :)

> Future Work
> ===========
> Even though this is now exclusively a shadow stack series, there is still some 
> remaining shadow stack work to be done.
>
> 	Ptrace
> 	------
> 	Early in the series, there was a patch to allow IA32_U_CET and
> 	IA32_PL3_SSP to be set. This patch was dropped and planned as a follow
> 	up to basic support, and it remains the plan. It will be needed for
> 	in-progress gdb support.

It's pretty much a prerequisite for enabling it, right?

> 	CRIU Support
> 	------------
> 	In the past there was some speculation on the mailing list about 
> 	whether CRIU would need to be taught about CET. It turns out, it does. 
> 	The first issue hit is that CRIU calls sigreturn directly from its 
> 	“parasite code” that it injects into the dumper process. This violates
> 	this shadow stack implementation’s protection that intends to prevent
> 	attackers from doing this.
>
> 	With so many packages already enabled with shadow stack, there is 
> 	probably desire to make it work seamlessly. But in the meantime if 
> 	distros want to support shadow stack and CRIU, users could manually 
> 	disabled shadow stack via “GLIBC_TUNABLES=glibc.cpu.x86_shstk=off” for 
> 	a process they will wants to dump. It’s not ideal.
>
> 	I’d like to hear what people think about having shadow stack in the 
> 	kernel without this resolved. Nothing would change for any users until 
> 	they enable shadow stack in the kernel and update to a glibc configured
> 	with CET. Should CRIU userspace be solved before kernel support?

Definitely yes. Making CRIU users add a glibc tunable is not really an
option. We can't break CRIU systems with a kernel upgrade.

Thanks,

        tglx

Edgecombe, Rick P Feb. 4, 2022, 1:08 a.m. UTC | #2

Hi Thomas,

Thanks for feedback on the plan.

On Thu, 2022-02-03 at 22:07 +0100, Thomas Gleixner wrote:
> > Until now, the enabling effort was trying to support both Shadow
> > Stack and IBT. 
> > This history will focus on a few areas of the shadow stack
> > development history 
> > that I thought stood out.
> > 
> >        Signals
> >        -------
> >        Originally signals placed the location of the shadow stack
> > restore 
> >        token inside the saved state on the stack. This was
> > problematic from a 
> >        past ABI promises perspective. So the restore location was
> > instead just 
> >        assumed from the shadow stack pointer. This works because in
> > normal 
> >        allowed cases of calling sigreturn, the shadow stack pointer
> > should be 
> >        right at the restore token at that time. There is no
> > alternate shadow 
> >        stack support. If an alt shadow stack is added later we
> > would
> >        need to
> 
> So how is that going to work? altstack is not an esoteric corner
> case.

My understanding is that the main usages for the signal stack were
handling stack overflows and corruption. Since the shadow stack only
contains return addresses rather than large stack allocations, and is
not generally writable or pivotable, I thought there was a good
possibility an alt shadow stack would not end up being especially
useful. Does it seem like reasonable guesswork?

If it does seems likely, I'll give it some more thought than that hand
wavy plan.

>         
> >        Limit to only Intel Processors
> >        ------------------------------
> >        Shadow stack is supported on some AMD processors, but this
> > revision 
> >        (with expanded HW usage and xsaves changes) has only has
> > been tested on 
> >        Intel ones. So this series has a patch to limit shadow stack
> > support to 
> >        Intel processors. Ideally the patch would not even make it
> > to mainline, 
> >        and should be dropped as soon as this testing is done. It's
> > included 
> >        just in case.
> 
> Ha. I can give you access to an AMD machine with CET SS supported :)

Thanks for the offer. It sounds like John Allen can do this testing.

> 
> > Future Work
> > ===========
> > Even though this is now exclusively a shadow stack series, there is
> > still some 
> > remaining shadow stack work to be done.
> > 
> >        Ptrace
> >        ------
> >        Early in the series, there was a patch to allow IA32_U_CET
> > and
> >        IA32_PL3_SSP to be set. This patch was dropped and planned
> > as a follow
> >        up to basic support, and it remains the plan. It will be
> > needed for
> >        in-progress gdb support.
> 
> It's pretty much a prerequisite for enabling it, right?

Yes.

> 
> >        CRIU Support
> >        ------------
> >        In the past there was some speculation on the mailing list
> > about 
> >        whether CRIU would need to be taught about CET. It turns
> > out, it does. 
> >        The first issue hit is that CRIU calls sigreturn directly
> > from its 
> >        “parasite code” that it injects into the dumper process.
> > This violates
> >        this shadow stack implementation’s protection that intends
> > to prevent
> >        attackers from doing this.
> > 
> >        With so many packages already enabled with shadow stack,
> > there is 
> >        probably desire to make it work seamlessly. But in the
> > meantime if 
> >        distros want to support shadow stack and CRIU, users could
> > manually 
> >        disabled shadow stack via
> > “GLIBC_TUNABLES=glibc.cpu.x86_shstk=off” for 
> >        a process they will wants to dump. It’s not ideal.
> > 
> >        I’d like to hear what people think about having shadow stack
> > in the 
> >        kernel without this resolved. Nothing would change for any
> > users until 
> >        they enable shadow stack in the kernel and update to a glibc
> > configured
> >        with CET. Should CRIU userspace be solved before kernel
> > support?
> 
> Definitely yes. Making CRIU users add a glibc tunable is not really
> an
> option. We can't break CRIU systems with a kernel upgrade.

Ok got it, thanks. Just to be clear though, existing distros/binaries
out there will not have shadow stack enabled with just an updated
kernel (due to the enabling changes). So the CRIU tools would only
break after future glibc binaries enable CET, which users/distros would
have to do specifically (glibc doesn't even enable cet by default).

Since the purpose of this feature is to restrict previously allowed
behaviors, and it’s apparently getting enabled by default in some
distro's packages, I guess there is a decent chance that once a system
is updated with a future glibc some app somewhere will break. I was
under the impression that as long as there were no breakages under a
current set of binaries (including glibc), this was not considered a
kernel regression. Please correct me if this is wrong. I think there
are other options if we want to make this softer.

Of course none of that prevents known breakages from being fixed for
normal reasons and I’ll look into that for CRIU.

Andy Lutomirski Feb. 4, 2022, 5:20 a.m. UTC | #3

On Thu, Feb 3, 2022, at 5:08 PM, Edgecombe, Rick P wrote:
> Hi Thomas,

>> >        Signals
>> >        -------
>> >        Originally signals placed the location of the shadow stack
>> > restore 
>> >        token inside the saved state on the stack. This was
>> > problematic from a 
>> >        past ABI promises perspective.

What was the actual problem?

>> > So the restore location was
>> > instead just 
>> >        assumed from the shadow stack pointer. This works because in
>> > normal 
>> >        allowed cases of calling sigreturn, the shadow stack pointer
>> > should be 
>> >        right at the restore token at that time. There is no
>> > alternate shadow 
>> >        stack support. If an alt shadow stack is added later we
>> > would
>> >        need to
>> 
>> So how is that going to work? altstack is not an esoteric corner
>> case.
>
> My understanding is that the main usages for the signal stack were
> handling stack overflows and corruption. Since the shadow stack only
> contains return addresses rather than large stack allocations, and is
> not generally writable or pivotable, I thought there was a good
> possibility an alt shadow stack would not end up being especially
> useful. Does it seem like reasonable guesswork?

It's also used for things like DOSEMU that execute in a weird context and then trap back out to the outer program using a signal handler and an altstack.  Also, imagine someone writing a SIGSEGV handler specifically intended to handle shadow stack overflow.

The shadow stack can be pivoted using RSTORSSP.

Edgecombe, Rick P Feb. 4, 2022, 8:23 p.m. UTC | #4

On Thu, 2022-02-03 at 21:20 -0800, Andy Lutomirski wrote:
> On Thu, Feb 3, 2022, at 5:08 PM, Edgecombe, Rick P wrote:
> > Hi Thomas,
> > > >         Signals
> > > >         -------
> > > >         Originally signals placed the location of the shadow
> > > > stack
> > > > restore 
> > > >         token inside the saved state on the stack. This was
> > > > problematic from a 
> > > >         past ABI promises perspective.
> 
> What was the actual problem?

The meat of the discussion that I saw was this thread:

https://lore.kernel.org/lkml/CALCETrVTeYfzO-XWh+VwTuKCyPyp-oOMGH=QR_msG9tPQ4xPmA@mail.gmail.com/

The problem was that the saved shadow stack pointer was placed in a
location unexpected by userspace (at the end of the floating point
state), which apparently can be relocated by userspace that would not
be aware of the shadow stack state at the end. I think an earlier
version was not extendable either.

It does not seem to be a fully dead end, but I have to admit I didn’t
fully internalize the limits imposed by the userspace expectations of
the sigframe because the current solution seemed good. I’ll have to dig
into it a little more because alt stack, CRIU and these expectations
all seem intertwined.

Here is a question for you guys though – is a general ucontext
extension solution a nice-to-have on its own?

I was thinking that it would be better to keep CET stuff out of the
sigframe related structures if it can be avoided. One reason is that it
keeps security related register values out of writable locations. I’m
not thinking of any specific problem, but just a general idea of not
opening that stuff up if it’s not needed. An example is in the IBT
series, the wait-for-endbranch state was saved in a ucontext flag. Some
usages may want to keep it from being cleared in a signal and the
endbranch check skipped. So for shadow stack, just the general notion
that this is not ideal state to open up.

On where to keep the wait-for-endbranch state, PeterZ had suggested the
possibility of having a per-mm hashmap of (userspace stack addresses)-> 
(kernel side saved state), limited to some sensible limit of items.
This could be extendable to other state besides CET stuff. I was
thinking to look in that direction if it’s needed for the alt shadow
stack.

But, is it swimming against the place where saved state is "supposed"
to be? There is some optimum of compatibility (more apps able to opt-
in) and security. I guess it's probably not good to have the kernel
bend over backwards trying to get both.

> > > > So the restore location was
> > > > instead just 
> > > >         assumed from the shadow stack pointer. This works
> > > > because in
> > > > normal 
> > > >         allowed cases of calling sigreturn, the shadow stack
> > > > pointer
> > > > should be 
> > > >         right at the restore token at that time. There is no
> > > > alternate shadow 
> > > >         stack support. If an alt shadow stack is added later we
> > > > would
> > > >         need to
> > > 
> > > So how is that going to work? altstack is not an esoteric corner
> > > case.
> > 
> > My understanding is that the main usages for the signal stack were
> > handling stack overflows and corruption. Since the shadow stack
> > only
> > contains return addresses rather than large stack allocations, and
> > is
> > not generally writable or pivotable, I thought there was a good
> > possibility an alt shadow stack would not end up being especially
> > useful. Does it seem like reasonable guesswork?
> 
> It's also used for things like DOSEMU that execute in a weird context
> and then trap back out to the outer program using a signal handler
> and an altstack.  Also, imagine someone writing a SIGSEGV handler
> specifically intended to handle shadow stack overflow.

Interesting, thanks. I had been thinking that an alt shadow stack would
require a new interface that would mostly just sit dormant taking up
space. But probably an (obvious) better way would be to just have the
signalstack() syscall automatically create a new shadow stack, like the
rest of the series does automatically for new threads. I’ll think I’ll
see how that would look.

> 
> The shadow stack can be pivoted using RSTORSSP.

Yes, I just meant that the ability to pivot or modify is restricted (in
RSTORSSP's case by restore token checks) and so with less ability to
interact with it, it could be less likely for there to be corruptions.
This if of course just speculation.

David Laight Feb. 5, 2022, 1:26 p.m. UTC | #5

From: Edgecombe, Rick P
> Sent: 04 February 2022 01:08
> Hi Thomas,
> 
> Thanks for feedback on the plan.
> 
> On Thu, 2022-02-03 at 22:07 +0100, Thomas Gleixner wrote:
> > > Until now, the enabling effort was trying to support both Shadow
> > > Stack and IBT.
> > > This history will focus on a few areas of the shadow stack
> > > development history
> > > that I thought stood out.
> > >
> > >        Signals
> > >        -------
> > >        Originally signals placed the location of the shadow stack
> > > restore
> > >        token inside the saved state on the stack. This was
> > > problematic from a
> > >        past ABI promises perspective. So the restore location was
> > > instead just
> > >        assumed from the shadow stack pointer. This works because in
> > > normal
> > >        allowed cases of calling sigreturn, the shadow stack pointer
> > > should be
> > >        right at the restore token at that time. There is no
> > > alternate shadow
> > >        stack support. If an alt shadow stack is added later we
> > > would
> > >        need to
> >
> > So how is that going to work? altstack is not an esoteric corner
> > case.
> 
> My understanding is that the main usages for the signal stack were
> handling stack overflows and corruption. Since the shadow stack only
> contains return addresses rather than large stack allocations, and is
> not generally writable or pivotable, I thought there was a good
> possibility an alt shadow stack would not end up being especially
> useful. Does it seem like reasonable guesswork?

The other 'problem' is that it is valid to longjump out of a signal handler.
These days you have to use siglongjmp() not longjmp() but it is still used.

It is probably also valid to use siglongjmp() to jump from a nested
signal handler into the outer handler.
Given both signal handlers can have their own stack, there can be three
stacks involved.

I think the shadow stack pointer has to be in ucontext - which also
means the application can change it before returning from a signal.
In much the same way as all the segment registers can be changed
leading to all the nasty bugs when the final 'return to user' code
traps in kernel when loading invalid segment registers or executing iret.

Hmmm... do shadow stacks mean that longjmp() has to be a system call?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

H.J. Lu Feb. 5, 2022, 1:29 p.m. UTC | #6

On Sat, Feb 5, 2022 at 5:27 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Edgecombe, Rick P
> > Sent: 04 February 2022 01:08
> > Hi Thomas,
> >
> > Thanks for feedback on the plan.
> >
> > On Thu, 2022-02-03 at 22:07 +0100, Thomas Gleixner wrote:
> > > > Until now, the enabling effort was trying to support both Shadow
> > > > Stack and IBT.
> > > > This history will focus on a few areas of the shadow stack
> > > > development history
> > > > that I thought stood out.
> > > >
> > > >        Signals
> > > >        -------
> > > >        Originally signals placed the location of the shadow stack
> > > > restore
> > > >        token inside the saved state on the stack. This was
> > > > problematic from a
> > > >        past ABI promises perspective. So the restore location was
> > > > instead just
> > > >        assumed from the shadow stack pointer. This works because in
> > > > normal
> > > >        allowed cases of calling sigreturn, the shadow stack pointer
> > > > should be
> > > >        right at the restore token at that time. There is no
> > > > alternate shadow
> > > >        stack support. If an alt shadow stack is added later we
> > > > would
> > > >        need to
> > >
> > > So how is that going to work? altstack is not an esoteric corner
> > > case.
> >
> > My understanding is that the main usages for the signal stack were
> > handling stack overflows and corruption. Since the shadow stack only
> > contains return addresses rather than large stack allocations, and is
> > not generally writable or pivotable, I thought there was a good
> > possibility an alt shadow stack would not end up being especially
> > useful. Does it seem like reasonable guesswork?
>
> The other 'problem' is that it is valid to longjump out of a signal handler.
> These days you have to use siglongjmp() not longjmp() but it is still used.
>
> It is probably also valid to use siglongjmp() to jump from a nested
> signal handler into the outer handler.
> Given both signal handlers can have their own stack, there can be three
> stacks involved.
>
> I think the shadow stack pointer has to be in ucontext - which also
> means the application can change it before returning from a signal.
> In much the same way as all the segment registers can be changed
> leading to all the nasty bugs when the final 'return to user' code
> traps in kernel when loading invalid segment registers or executing iret.
>
> Hmmm... do shadow stacks mean that longjmp() has to be a system call?

No.  setjmp/longjmp save and restore shadow stack pointer.

--
H.J.

Edgecombe, Rick P Feb. 5, 2022, 8:15 p.m. UTC | #7

On Sat, 2022-02-05 at 05:29 -0800, H.J. Lu wrote:
> On Sat, Feb 5, 2022 at 5:27 AM David Laight <David.Laight@aculab.com>
> wrote:
> > 
> > From: Edgecombe, Rick P
> > > Sent: 04 February 2022 01:08
> > > Hi Thomas,
> > > 
> > > Thanks for feedback on the plan.
> > > 
> > > On Thu, 2022-02-03 at 22:07 +0100, Thomas Gleixner wrote:
> > > > > Until now, the enabling effort was trying to support both
> > > > > Shadow
> > > > > Stack and IBT.
> > > > > This history will focus on a few areas of the shadow stack
> > > > > development history
> > > > > that I thought stood out.
> > > > > 
> > > > >        Signals
> > > > >        -------
> > > > >        Originally signals placed the location of the shadow
> > > > > stack
> > > > > restore
> > > > >        token inside the saved state on the stack. This was
> > > > > problematic from a
> > > > >        past ABI promises perspective. So the restore location
> > > > > was
> > > > > instead just
> > > > >        assumed from the shadow stack pointer. This works
> > > > > because in
> > > > > normal
> > > > >        allowed cases of calling sigreturn, the shadow stack
> > > > > pointer
> > > > > should be
> > > > >        right at the restore token at that time. There is no
> > > > > alternate shadow
> > > > >        stack support. If an alt shadow stack is added later
> > > > > we
> > > > > would
> > > > >        need to
> > > > 
> > > > So how is that going to work? altstack is not an esoteric
> > > > corner
> > > > case.
> > > 
> > > My understanding is that the main usages for the signal stack
> > > were
> > > handling stack overflows and corruption. Since the shadow stack
> > > only
> > > contains return addresses rather than large stack allocations,
> > > and is
> > > not generally writable or pivotable, I thought there was a good
> > > possibility an alt shadow stack would not end up being especially
> > > useful. Does it seem like reasonable guesswork?
> > 
> > The other 'problem' is that it is valid to longjump out of a signal
> > handler.
> > These days you have to use siglongjmp() not longjmp() but it is
> > still used.
> > 
> > It is probably also valid to use siglongjmp() to jump from a nested
> > signal handler into the outer handler.
> > Given both signal handlers can have their own stack, there can be
> > three
> > stacks involved.

So the scenario is?

1. Handle signal 1
2. sigsetjmp()
3. signalstack()
4. Handle signal 2 on alt stack
5. siglongjmp()

I'll check that it is covered by the tests, but I think it should work
in this series that has no alt shadow stack. I have only done a high
level overview of how the shadow stack stuff, that doesn't involve the
kernel, works in glibc. Sounds like I'll need to do a deeper dive.

> > 
> > I think the shadow stack pointer has to be in ucontext - which also
> > means the application can change it before returning from a signal.

Yes we might need to change it to support alt shadow stacks. Can you
elaborate why you think it has to be in ucontext? I was thinking of
looking at three options for storing the ssp:
 - Stored in the shadow stack like a token using WRUSS from the kernel.
 - Stored on the kernel side using a hashmap that maps ucontext or
   sigframe userspace address to ssp (this is of course similar to 
   storing in ucontext, except that the user can’t change the ssp).
 - Stored writable in userspace in ucontext.

But in this version, without alt shadow stacks, the shadow stack
pointer is not stored in ucontext. This causes the limitation that
userspace can only call sigreturn when it has returned back to a point
where there is a restore token on the shadow stack (which was placed
there by the kernel). This doesn’t mean it can’t switch to a different
shadow stack or handle a nested signal, but it limits the possibility
for calling sigreturn with a totally different sigframe (like CRIU and
SROP attacks do). It should hopefully be a helpful, protective
limitation for most apps and I'm hoping CRIU can be fixed without
removing it.

I am not aware of other limitations to signals (besides normal shadow
stack enforcement), but I could be missing it. And people's skepticism
is making me want to go back over it with more scrutiny.

> > In much the same way as all the segment registers can be changed
> > leading to all the nasty bugs when the final 'return to user' code
> > traps in kernel when loading invalid segment registers or executing
> > iret.

I don't think this is as difficult to avoid because userspace ssp has
its own register that should not be accessed at that point, but I have
not given this aspect enough analysis. Thanks for bringing it up.

> > 
> > Hmmm... do shadow stacks mean that longjmp() has to be a system
> > call?
> 
> No.  setjmp/longjmp save and restore shadow stack pointer.
> 

It sounds like it would help to write up in a lot more detail exactly
how all the signal and specialer stack manipulation scenarios work in
glibc.

H.J. Lu Feb. 5, 2022, 8:21 p.m. UTC | #8

On Sat, Feb 5, 2022 at 12:15 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Sat, 2022-02-05 at 05:29 -0800, H.J. Lu wrote:
> > On Sat, Feb 5, 2022 at 5:27 AM David Laight <David.Laight@aculab.com>
> > wrote:
> > >
> > > From: Edgecombe, Rick P
> > > > Sent: 04 February 2022 01:08
> > > > Hi Thomas,
> > > >
> > > > Thanks for feedback on the plan.
> > > >
> > > > On Thu, 2022-02-03 at 22:07 +0100, Thomas Gleixner wrote:
> > > > > > Until now, the enabling effort was trying to support both
> > > > > > Shadow
> > > > > > Stack and IBT.
> > > > > > This history will focus on a few areas of the shadow stack
> > > > > > development history
> > > > > > that I thought stood out.
> > > > > >
> > > > > >        Signals
> > > > > >        -------
> > > > > >        Originally signals placed the location of the shadow
> > > > > > stack
> > > > > > restore
> > > > > >        token inside the saved state on the stack. This was
> > > > > > problematic from a
> > > > > >        past ABI promises perspective. So the restore location
> > > > > > was
> > > > > > instead just
> > > > > >        assumed from the shadow stack pointer. This works
> > > > > > because in
> > > > > > normal
> > > > > >        allowed cases of calling sigreturn, the shadow stack
> > > > > > pointer
> > > > > > should be
> > > > > >        right at the restore token at that time. There is no
> > > > > > alternate shadow
> > > > > >        stack support. If an alt shadow stack is added later
> > > > > > we
> > > > > > would
> > > > > >        need to
> > > > >
> > > > > So how is that going to work? altstack is not an esoteric
> > > > > corner
> > > > > case.
> > > >
> > > > My understanding is that the main usages for the signal stack
> > > > were
> > > > handling stack overflows and corruption. Since the shadow stack
> > > > only
> > > > contains return addresses rather than large stack allocations,
> > > > and is
> > > > not generally writable or pivotable, I thought there was a good
> > > > possibility an alt shadow stack would not end up being especially
> > > > useful. Does it seem like reasonable guesswork?
> > >
> > > The other 'problem' is that it is valid to longjump out of a signal
> > > handler.
> > > These days you have to use siglongjmp() not longjmp() but it is
> > > still used.
> > >
> > > It is probably also valid to use siglongjmp() to jump from a nested
> > > signal handler into the outer handler.
> > > Given both signal handlers can have their own stack, there can be
> > > three
> > > stacks involved.
>
> So the scenario is?
>
> 1. Handle signal 1
> 2. sigsetjmp()
> 3. signalstack()
> 4. Handle signal 2 on alt stack
> 5. siglongjmp()
>
> I'll check that it is covered by the tests, but I think it should work
> in this series that has no alt shadow stack. I have only done a high
> level overview of how the shadow stack stuff, that doesn't involve the
> kernel, works in glibc. Sounds like I'll need to do a deeper dive.
>
> > >
> > > I think the shadow stack pointer has to be in ucontext - which also
> > > means the application can change it before returning from a signal.
>
> Yes we might need to change it to support alt shadow stacks. Can you
> elaborate why you think it has to be in ucontext? I was thinking of
> looking at three options for storing the ssp:
>  - Stored in the shadow stack like a token using WRUSS from the kernel.
>  - Stored on the kernel side using a hashmap that maps ucontext or
>    sigframe userspace address to ssp (this is of course similar to
>    storing in ucontext, except that the user can’t change the ssp).
>  - Stored writable in userspace in ucontext.
>
> But in this version, without alt shadow stacks, the shadow stack
> pointer is not stored in ucontext. This causes the limitation that
> userspace can only call sigreturn when it has returned back to a point
> where there is a restore token on the shadow stack (which was placed
> there by the kernel). This doesn’t mean it can’t switch to a different
> shadow stack or handle a nested signal, but it limits the possibility
> for calling sigreturn with a totally different sigframe (like CRIU and
> SROP attacks do). It should hopefully be a helpful, protective
> limitation for most apps and I'm hoping CRIU can be fixed without
> removing it.
>
> I am not aware of other limitations to signals (besides normal shadow
> stack enforcement), but I could be missing it. And people's skepticism
> is making me want to go back over it with more scrutiny.
>
> > > In much the same way as all the segment registers can be changed
> > > leading to all the nasty bugs when the final 'return to user' code
> > > traps in kernel when loading invalid segment registers or executing
> > > iret.
>
> I don't think this is as difficult to avoid because userspace ssp has
> its own register that should not be accessed at that point, but I have
> not given this aspect enough analysis. Thanks for bringing it up.
>
> > >
> > > Hmmm... do shadow stacks mean that longjmp() has to be a system
> > > call?
> >
> > No.  setjmp/longjmp save and restore shadow stack pointer.
> >
>
> It sounds like it would help to write up in a lot more detail exactly
> how all the signal and specialer stack manipulation scenarios work in
> glibc.
>

setjmp/longjmp work on the same sigjmp_buf.  Shadow stack pointer
is saved and restored, just like any other callee-saved registers.

Peter Zijlstra Feb. 6, 2022, 1:06 p.m. UTC | #9

On Fri, Feb 04, 2022 at 01:08:25AM +0000, Edgecombe, Rick P wrote:

> > So how is that going to work? altstack is not an esoteric corner
> > case.
> 
> My understanding is that the main usages for the signal stack were
> handling stack overflows and corruption. Since the shadow stack only
> contains return addresses rather than large stack allocations, and is
> not generally writable or pivotable, I thought there was a good
> possibility an alt shadow stack would not end up being especially
> useful. Does it seem like reasonable guesswork?

altstacks are also used in userspace threading implemenations.

Peter Zijlstra Feb. 6, 2022, 1:19 p.m. UTC | #10

On Sat, Feb 05, 2022 at 12:21:12PM -0800, H.J. Lu wrote:

> setjmp/longjmp work on the same sigjmp_buf.  Shadow stack pointer
> is saved and restored, just like any other callee-saved registers.

How is having that shadow stack pointer in user-writable memory not a
problem? That seems like a prime target to subvert the whole shadow
stack machinery.

David Laight Feb. 6, 2022, 1:42 p.m. UTC | #11

From: Edgecombe, Rick P
> Sent: 05 February 2022 20:15
> 
> On Sat, 2022-02-05 at 05:29 -0800, H.J. Lu wrote:
> > On Sat, Feb 5, 2022 at 5:27 AM David Laight <David.Laight@aculab.com>
> > wrote:
> > >
> > > From: Edgecombe, Rick P
> > > > Sent: 04 February 2022 01:08
> > > > Hi Thomas,
> > > >
> > > > Thanks for feedback on the plan.
> > > >
> > > > On Thu, 2022-02-03 at 22:07 +0100, Thomas Gleixner wrote:
> > > > > > Until now, the enabling effort was trying to support both
> > > > > > Shadow
> > > > > > Stack and IBT.
> > > > > > This history will focus on a few areas of the shadow stack
> > > > > > development history
> > > > > > that I thought stood out.
> > > > > >
> > > > > >        Signals
> > > > > >        -------
> > > > > >        Originally signals placed the location of the shadow
> > > > > > stack
> > > > > > restore
> > > > > >        token inside the saved state on the stack. This was
> > > > > > problematic from a
> > > > > >        past ABI promises perspective. So the restore location
> > > > > > was
> > > > > > instead just
> > > > > >        assumed from the shadow stack pointer. This works
> > > > > > because in
> > > > > > normal
> > > > > >        allowed cases of calling sigreturn, the shadow stack
> > > > > > pointer
> > > > > > should be
> > > > > >        right at the restore token at that time. There is no
> > > > > > alternate shadow
> > > > > >        stack support. If an alt shadow stack is added later
> > > > > > we
> > > > > > would
> > > > > >        need to
> > > > >
> > > > > So how is that going to work? altstack is not an esoteric
> > > > > corner
> > > > > case.
> > > >
> > > > My understanding is that the main usages for the signal stack
> > > > were
> > > > handling stack overflows and corruption. Since the shadow stack
> > > > only
> > > > contains return addresses rather than large stack allocations,
> > > > and is
> > > > not generally writable or pivotable, I thought there was a good
> > > > possibility an alt shadow stack would not end up being especially
> > > > useful. Does it seem like reasonable guesswork?
> > >
> > > The other 'problem' is that it is valid to longjump out of a signal
> > > handler.
> > > These days you have to use siglongjmp() not longjmp() but it is
> > > still used.
> > >
> > > It is probably also valid to use siglongjmp() to jump from a nested
> > > signal handler into the outer handler.
> > > Given both signal handlers can have their own stack, there can be
> > > three
> > > stacks involved.
> 
> So the scenario is?
> 
> 1. Handle signal 1
> 2. sigsetjmp()
> 3. signalstack()
> 4. Handle signal 2 on alt stack
> 5. siglongjmp()
> 
> I'll check that it is covered by the tests, but I think it should work
> in this series that has no alt shadow stack. I have only done a high
> level overview of how the shadow stack stuff, that doesn't involve the
> kernel, works in glibc. Sounds like I'll need to do a deeper dive.

The posix/xopen definition for setjmp/longjmp doesn't require such
longjmp requests to work.

Although they still have to do something that doesn't break badly.
Aborting the process is probably fine!

> > > I think the shadow stack pointer has to be in ucontext - which also
> > > means the application can change it before returning from a signal.
> 
> Yes we might need to change it to support alt shadow stacks. Can you
> elaborate why you think it has to be in ucontext? I was thinking of
> looking at three options for storing the ssp:
>  - Stored in the shadow stack like a token using WRUSS from the kernel.
>  - Stored on the kernel side using a hashmap that maps ucontext or
>    sigframe userspace address to ssp (this is of course similar to
>    storing in ucontext, except that the user can’t change the ssp).
>  - Stored writable in userspace in ucontext.
> 
> But in this version, without alt shadow stacks, the shadow stack
> pointer is not stored in ucontext. This causes the limitation that
> userspace can only call sigreturn when it has returned back to a point
> where there is a restore token on the shadow stack (which was placed
> there by the kernel). This doesn’t mean it can’t switch to a different
> shadow stack or handle a nested signal, but it limits the possibility
> for calling sigreturn with a totally different sigframe (like CRIU and
> SROP attacks do). It should hopefully be a helpful, protective
> limitation for most apps and I'm hoping CRIU can be fixed without
> removing it.
> 
> I am not aware of other limitations to signals (besides normal shadow
> stack enforcement), but I could be missing it. And people's skepticism
> is making me want to go back over it with more scrutiny.
> 
> > > In much the same way as all the segment registers can be changed
> > > leading to all the nasty bugs when the final 'return to user' code
> > > traps in kernel when loading invalid segment registers or executing
> > > iret.
> 
> I don't think this is as difficult to avoid because userspace ssp has
> its own register that should not be accessed at that point, but I have
> not given this aspect enough analysis. Thanks for bringing it up.

So the user ssp isn't saved (or restored) by the trap entry/exit.
So it needs to be saved by the context switch code?
Much like the user segment registers?
So you are likely to get the same problems if restoring it can fault
in kernel (eg for a non-canonical address).

> > > Hmmm... do shadow stacks mean that longjmp() has to be a system
> > > call?
> >
> > No.  setjmp/longjmp save and restore shadow stack pointer.

Ok, I was thinking that direct access to the user ssp would be
a privileged operation.
If it can be written you don't really have to worry about what code
is trying to do - it can actually do what it likes.
It just catches unintentional operations (like buffer overflows).

Was there any 'spare' space in struct jmpbuf ?
Otherwise you can only enable shadow stacks if everything has been
recompiled - including any shared libraries the might be dlopen()ed.
(or does the compiler invent an alloca() call somehow for a
size that comes back from glibc?)

I've never really considered how setjmp/longjmp handle callee saved
register variables (apart from it being hard).
The original pdp11 implementation probably only needed to save r6 and r7.

What does happen to all the 'extended state' that XSAVE handles?
IIRC all the AVX registers are caller saved (so should probably
be zerod), but some of the SSE ones are callee saved, and one or
two of the fpu flags are sticky and annoying enough the save/restore
at the best of times.

> It sounds like it would help to write up in a lot more detail exactly
> how all the signal and specialer stack manipulation scenarios work in
> glibc.

Some cross references might have made people notice that the ucontext
extensions for AVX512 (if not earlier ones) broke the minimal/default
signal stack size.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

H.J. Lu Feb. 6, 2022, 1:55 p.m. UTC | #12

On Sun, Feb 6, 2022 at 5:42 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Edgecombe, Rick P
> > Sent: 05 February 2022 20:15
> >
> > On Sat, 2022-02-05 at 05:29 -0800, H.J. Lu wrote:
> > > On Sat, Feb 5, 2022 at 5:27 AM David Laight <David.Laight@aculab.com>
> > > wrote:
> > > >
> > > > From: Edgecombe, Rick P
> > > > > Sent: 04 February 2022 01:08
> > > > > Hi Thomas,
> > > > >
> > > > > Thanks for feedback on the plan.
> > > > >
> > > > > On Thu, 2022-02-03 at 22:07 +0100, Thomas Gleixner wrote:
> > > > > > > Until now, the enabling effort was trying to support both
> > > > > > > Shadow
> > > > > > > Stack and IBT.
> > > > > > > This history will focus on a few areas of the shadow stack
> > > > > > > development history
> > > > > > > that I thought stood out.
> > > > > > >
> > > > > > >        Signals
> > > > > > >        -------
> > > > > > >        Originally signals placed the location of the shadow
> > > > > > > stack
> > > > > > > restore
> > > > > > >        token inside the saved state on the stack. This was
> > > > > > > problematic from a
> > > > > > >        past ABI promises perspective. So the restore location
> > > > > > > was
> > > > > > > instead just
> > > > > > >        assumed from the shadow stack pointer. This works
> > > > > > > because in
> > > > > > > normal
> > > > > > >        allowed cases of calling sigreturn, the shadow stack
> > > > > > > pointer
> > > > > > > should be
> > > > > > >        right at the restore token at that time. There is no
> > > > > > > alternate shadow
> > > > > > >        stack support. If an alt shadow stack is added later
> > > > > > > we
> > > > > > > would
> > > > > > >        need to
> > > > > >
> > > > > > So how is that going to work? altstack is not an esoteric
> > > > > > corner
> > > > > > case.
> > > > >
> > > > > My understanding is that the main usages for the signal stack
> > > > > were
> > > > > handling stack overflows and corruption. Since the shadow stack
> > > > > only
> > > > > contains return addresses rather than large stack allocations,
> > > > > and is
> > > > > not generally writable or pivotable, I thought there was a good
> > > > > possibility an alt shadow stack would not end up being especially
> > > > > useful. Does it seem like reasonable guesswork?
> > > >
> > > > The other 'problem' is that it is valid to longjump out of a signal
> > > > handler.
> > > > These days you have to use siglongjmp() not longjmp() but it is
> > > > still used.
> > > >
> > > > It is probably also valid to use siglongjmp() to jump from a nested
> > > > signal handler into the outer handler.
> > > > Given both signal handlers can have their own stack, there can be
> > > > three
> > > > stacks involved.
> >
> > So the scenario is?
> >
> > 1. Handle signal 1
> > 2. sigsetjmp()
> > 3. signalstack()
> > 4. Handle signal 2 on alt stack
> > 5. siglongjmp()
> >
> > I'll check that it is covered by the tests, but I think it should work
> > in this series that has no alt shadow stack. I have only done a high
> > level overview of how the shadow stack stuff, that doesn't involve the
> > kernel, works in glibc. Sounds like I'll need to do a deeper dive.
>
> The posix/xopen definition for setjmp/longjmp doesn't require such
> longjmp requests to work.
>
> Although they still have to do something that doesn't break badly.
> Aborting the process is probably fine!
>
> > > > I think the shadow stack pointer has to be in ucontext - which also
> > > > means the application can change it before returning from a signal.
> >
> > Yes we might need to change it to support alt shadow stacks. Can you
> > elaborate why you think it has to be in ucontext? I was thinking of
> > looking at three options for storing the ssp:
> >  - Stored in the shadow stack like a token using WRUSS from the kernel.
> >  - Stored on the kernel side using a hashmap that maps ucontext or
> >    sigframe userspace address to ssp (this is of course similar to
> >    storing in ucontext, except that the user can’t change the ssp).
> >  - Stored writable in userspace in ucontext.
> >
> > But in this version, without alt shadow stacks, the shadow stack
> > pointer is not stored in ucontext. This causes the limitation that
> > userspace can only call sigreturn when it has returned back to a point
> > where there is a restore token on the shadow stack (which was placed
> > there by the kernel). This doesn’t mean it can’t switch to a different
> > shadow stack or handle a nested signal, but it limits the possibility
> > for calling sigreturn with a totally different sigframe (like CRIU and
> > SROP attacks do). It should hopefully be a helpful, protective
> > limitation for most apps and I'm hoping CRIU can be fixed without
> > removing it.
> >
> > I am not aware of other limitations to signals (besides normal shadow
> > stack enforcement), but I could be missing it. And people's skepticism
> > is making me want to go back over it with more scrutiny.
> >
> > > > In much the same way as all the segment registers can be changed
> > > > leading to all the nasty bugs when the final 'return to user' code
> > > > traps in kernel when loading invalid segment registers or executing
> > > > iret.
> >
> > I don't think this is as difficult to avoid because userspace ssp has
> > its own register that should not be accessed at that point, but I have
> > not given this aspect enough analysis. Thanks for bringing it up.
>
> So the user ssp isn't saved (or restored) by the trap entry/exit.
> So it needs to be saved by the context switch code?
> Much like the user segment registers?
> So you are likely to get the same problems if restoring it can fault
> in kernel (eg for a non-canonical address).
>
> > > > Hmmm... do shadow stacks mean that longjmp() has to be a system
> > > > call?
> > >
> > > No.  setjmp/longjmp save and restore shadow stack pointer.
>
> Ok, I was thinking that direct access to the user ssp would be
> a privileged operation.

User space can only pop shadow stack.  longjmp does

#ifdef SHADOW_STACK_POINTER_OFFSET
# if IS_IN (libc) && defined SHARED && defined FEATURE_1_OFFSET
/* Check if Shadow Stack is enabled.  */
testl $X86_FEATURE_1_SHSTK, %fs:FEATURE_1_OFFSET
jz L(skip_ssp)
# else
xorl %eax, %eax
# endif
/* Check and adjust the Shadow-Stack-Pointer.  */
/* Get the current ssp.  */
rdsspq %rax
/* And compare it with the saved ssp value.  */
subq SHADOW_STACK_POINTER_OFFSET(%rdi), %rax
je L(skip_ssp)
/* Count the number of frames to adjust and adjust it
   with incssp instruction.  The instruction can adjust
   the ssp by [0..255] value only thus use a loop if
   the number of frames is bigger than 255.  */
negq %rax
shrq $3, %rax
/* NB: We saved Shadow-Stack-Pointer of setjmp.  Since we are
       restoring Shadow-Stack-Pointer of setjmp's caller, we
       need to unwind shadow stack by one more frame.  */
addq $1, %rax

movl $255, %ebx
L(loop):
cmpq %rbx, %rax
cmovb %rax, %rbx
incsspq %rbx
subq %rbx, %rax
ja L(loop)

L(skip_ssp):
#endif

> If it can be written you don't really have to worry about what code
> is trying to do - it can actually do what it likes.
> It just catches unintentional operations (like buffer overflows).
>
> Was there any 'spare' space in struct jmpbuf ?

By pure luck, we have ONE spare space in sigjmp_buf.

> Otherwise you can only enable shadow stacks if everything has been
> recompiled - including any shared libraries the might be dlopen()ed.
> (or does the compiler invent an alloca() call somehow for a
> size that comes back from glibc?)
>
> I've never really considered how setjmp/longjmp handle callee saved
> register variables (apart from it being hard).
> The original pdp11 implementation probably only needed to save r6 and r7.
>
> What does happen to all the 'extended state' that XSAVE handles?
> IIRC all the AVX registers are caller saved (so should probably
> be zerod), but some of the SSE ones are callee saved, and one or
> two of the fpu flags are sticky and annoying enough the save/restore
> at the best of times.
>
> > It sounds like it would help to write up in a lot more detail exactly
> > how all the signal and specialer stack manipulation scenarios work in
> > glibc.
>
> Some cross references might have made people notice that the ucontext
> extensions for AVX512 (if not earlier ones) broke the minimal/default
> signal stack size.
>
>         David
>

Mike Rapoport Feb. 6, 2022, 6:42 p.m. UTC | #13

(added more CRIU people)

On Sun, Jan 30, 2022 at 01:18:03PM -0800, Rick Edgecombe wrote:
> Hi,
> 
> This is a slight reboot of the userspace CET series. I will be taking over the 
> series from Yu-cheng. Per some internal recommendations, I’ve reset the version
> number and am calling it a new series. Hopefully, it doesn’t cause confusion.
> 
> The new plan is to upstream only userspace Shadow Stack support at this point. 
> IBT can follow later, but for now I’ll focus solely on the most in-demand and
> widely available (with the feature on AMD CPUs now) part of CET.
> 
> I thought as part of this reset, it might be useful to more fully write-up the 
> design and summarize the history of the previous CET series. So this slightly
> long cover letter does that. The "Updates" section has the changes, if anyone
> doesn't want the history.
> 
> 
> Why is Shadow Stack Wanted
> ==========================
> The main use case for userspace shadow stack is providing protection against 
> return oriented programming attacks. Fedora and Ubuntu already have many/most 
> packages enabled for shadow stack. The main missing piece is Linux kernel 
> support and there seems to be a high amount of interest in the ecosystem for
> getting this feature supported. Besides security, Google has also done some
> work on using shadow stack to improve performance and reliability of tracing.
> 
> 
> Userspace Shadow Stack Implementation
> =====================================
> Shadow stack works by maintaining a secondary (shadow) stack that cannot be 
> directly modified by applications. When executing a CALL instruction, the 
> processor pushes the return address to both the normal stack and to the special 
> permissioned shadow stack. Upon ret, the processor pops the shadow stack copy 
> and compares it to the normal stack copy. If the two differ, the processor 
> raises a control protection fault. This implementation supports shadow stack on 
> 64 bit kernels only, with support for 32 bit only via IA32 emulation.
> 
> 	Shadow Stack Memory
> 	-------------------
> 	The majority of this series deals with changes for handling the special 
> 	shadow stack memory permissions. This memory is specified by the 
> 	Dirty+RO PTE bits. A tricky aspect of this is that this combination was 
> 	previously used to specify COW memory. So Linux needs to handle COW 
> 	differently when shadow stack is in use. The solution is to use a 
> 	software PTE bit to denote COW memory, and take care to clear the dirty
> 	bit when setting the memory RO.
> 
> 	Setup and Upkeep of HW Registers
> 	--------------------------------
> 	Using userspace CET requires a CR4 bit set, and also the manipulation 
> 	of two xsave managed MSRs. The kernel needs to modify these registers 
> 	during various operations like clone and signal handling. These 
> 	operations may happen when the registers are restored to the CPU, or 
> 	saved in an xsave buffer. Since the recent AMX triggered FPU overhaul 
> 	removed direct access to the xsave buffer, this series adds an 
> 	interface to operate on the supervisor xstate.
> 
> 	New ABIs
> 	--------
> 	This series introduces some new ABIs. The primary one is the shadow 
> 	stack itself. Since it is readable and the shadow stack pointer is 
> 	exposed to user space, applications can easily read and process the 
> 	shadow stack. And in fact the tracing usages plan to do exactly that.
> 
> 	Most of the shadow stack contents are written by HW, but some of the 
> 	entries are added by the kernel. The main place for this is signals. As 
> 	part of handling the signal the kernel does some manual adjustment of 
> 	the shadow stack that userspace depends on.
> 
> 	In addition to the contents of the shadow stack there is also user 
> 	visible behavior around when new shadow stacks are created and set in 
> 	the shadow stack pointer (SSP) register. This is relatively 
> 	straightforward – shadow stacks are created when new stacks are created 
> 	(thread creation, fork, etc). It is more or less what is required to 
> 	keep apps working.
> 
> 	For situations when userspace creates a new stack (i.e. makecontext(), 
> 	fibers, etc), a new syscall is provided for creating shadow stack 
> 	memory. To make the shadow stack usable, it needs to have a restore 
> 	token written to the protected memory. So the syscall provides a way to 
> 	specificity this should be done by the kernel.
> 
> 	When a shadow stack violation happens (when the return address of stack 
> 	not matching return address in shadow stack), a segfault is generated 
> 	with a new si_code specific to CET violations.
> 
> 	Lastly, a new arch_prctl interface is created for controlling the 
> 	enablement of CET-like features. It is intended to also be used for 
> 	LAM. It operates on the feature status per-thread, so for process wide 
> 	enabling it is intended to be used early in things like dynamic 
> 	linker/loaders. However, it can be used later for per-thread enablement 
> 	of features like WRSS.
> 
> 	WRSS
> 	----
> 	WRSS is an instruction that can write to shadow stacks. The HW provides 
> 	a way to enable this instruction for userspace use. Since shadow 
> 	stack’s are created initially protected, enabling WRSS allows any apps 
> 	that want to do unusual things with their stacks to have a way to 
> 	weaken protection and make things more flexible. A new feature bit is 
> 	defined to control enabling/disabling of WRSS.
> 
> 
> History
> =======
> The branding “CET” really consists of two features: “Shadow Stack” and 
> “Indirect Branch Tracking”. They both restrict previously allowed, but rarely 
> valid behaviors and require userspace to change to avoid these behaviors before 
> enabling the protection. These raw HW features need to be assembled into a 
> software solution across userspace and kernel in order to add security value.
> The kernel part of this solution has evolved iteratively starting with a lengthy
> RFC period. 
> 
> Until now, the enabling effort was trying to support both Shadow Stack and IBT. 
> This history will focus on a few areas of the shadow stack development history 
> that I thought stood out.
> 
> 	Signals
> 	-------
> 	Originally signals placed the location of the shadow stack restore 
> 	token inside the saved state on the stack. This was problematic from a 
> 	past ABI promises perspective. So the restore location was instead just 
> 	assumed from the shadow stack pointer. This works because in normal 
> 	allowed cases of calling sigreturn, the shadow stack pointer should be 
> 	right at the restore token at that time. There is no alternate shadow 
> 	stack support. If an alt shadow stack is added later we would need to 
> 	find a place to store the regular shadow stack token location. Options 
> 	could be to push something on the alt shadow stack, or to keep 
> 	something on the kernel side. So the current design keeps things simple 
> 	while slightly kicking the can down the road if alt shadow stacks 
> 	become a thing later. Siglongjmp is handled in glibc, using the incssp 
> 	instruction to unwind the shadow stack over the token.
> 
> 	Shadow Stack Allocation
> 	-----------------------
> 	makecontext() implementations need a way to create new shadow stacks 
> 	with restore token’s such that they can be pivoted to from userspace. 
> 	The first interface to do this was an arch_prctl(). It created a shadow 
> 	stack with a restore token pre-setup, since the kernel has an 
> 	instruction that can write to user shadow stacks. However, this 
> 	interface was abandoned for being strange.
> 
> 	The next version created PROT_SHADOW_STACK. This interface had two 
> 	problems. One, it left no options but for userspace to create writable 
> 	memory, write a restore token, then mproctect() it PROT_SHADOW_STACK. 
> 	The writable window left the shadow stack exposed, weakening the 
> 	security. Second, it caused problems with the guard pages. Since the 
> 	memory was initially created writable it did not have a guard page, but 
> 	then was mprotected later to a type of memory that should have one. 
> 	This resulted in missing guard pages and confused rb_subtree_gap’s.
> 
> 	This version introduces a new syscall that behaves similarly to the 
> 	initial arch_prctl() interface in that it has the kernel write the 
> 	restore token.
> 
> 	Enabling Interface
> 	------------------
> 	For the entire history of the original CET series, the design was to 
> 	enable shadow stack automatically if the feature bit was detected in 
> 	the elf header. Then it was userspace’s responsibility to turn it off 
> 	via an arch_prctl() if it was not desired, and this was handled by the 
> 	glibc dynamic loader. Glibc’s standard behavior (when CET if configured 
> 	is to leave shadow stack enabled if the executable and all linked 
> 	libraries are marked with shadow stacks.
> 
> 	Many distros (Fedora and others) have binaries already marked with 
> 	shadow stack, waiting for kernel support. Unfortunately their glibc 
> 	binaries expect the original arch_prctl() interface for allocating 
> 	shadow stacks, as those changes were pushed ahead of kernel support. 
> 	The net result of it all is, when updating to a kernel with shadow 
> 	stack these binaries would suddenly get shadow stack enabled and expect 
> 	the arch_prctl() interface to be there. And so calls to makecontext() 
> 	will fail, resulting in visible breakages. This series deals with this 
> 	problem as described below in "Updates".
> 
> 
> Updates
> =======
> These updates were mostly driven by public comments, but a lot of the design 
> elements are new. I would like some extra scrutiny on the updates.
> 
> 	New syscall for Shadow Stack Allocation
> 	---------------------------------------
> 	A new syscall is added for allocating shadow stacks to replace 
> 	PROT_SHADOW_STACK. Several options were considered, as described in the 
> 	“x86/cet/shstk: Introduce map_shadow_stack syscall”.
> 
> 	Xsave Managed Supervisor State Modifications
> 	--------------------------------------------
> 	The shadow stack feature requires the kernel to modify xsaves managed 
> 	state. On one of the last versions of Yu-cheng’s series Boris had 
> 	commented on the pattern it was using to do this not necessarily being 
> 	ideal. The pattern was to force a restore to the registers and always 
> 	do the modification there. Then Thomas did an overhaul of the fpu code, 
> 	part of which consisted of making raw access to the xsave buffer 
> 	private to the fpu code. So this series tries to expose access again, 
> 	and in a way that addresses Boris’ comments.
> 
> 	The method is to provide functions like wmsrl/rdmsrl, but that can 
> 	direct the operation to the correct location (registers or buffer), 
> 	while giving the proper notice to the fpu subsystem so things don’t get 
> 	clobbered or corrupted.
> 
> 	In the past a solution like this was discussed as part of the PASID 
> 	series, and Thomas was not in favor. In CET’s case there is a more 
> 	logic around the CET MSR’s than in PASID's, and wrapping this logic 
> 	minimizes near identical open coded logic needed to do this more 
> 	efficiently. In addition it resolves the above described problem of 
> 	having no access to the xsave buffer. So it is being put forward here 
> 	under the supposition that CET’s usage may lead to a different 
> 	conclusion, not to try to ignore past direction.
> 
> 	The user interrupt series has similar needs as CET, and will also use
> 	this internal interface if it’s found acceptable.
> 
> 	Support for WRSS
> 	----------------
> 	Andy Lutomirski had asked if we change the shadow stack allocation API 
> 	such that userspace cannot create arbitrary shadow stacks, then we look 
> 	at exposing an interface to enable the WRSS instruction for userspace. 
> 	This way app’s that want to do unexpected things with shadow stacks 
> 	would still have the option to create shadow stacks with arbitrary 
> 	data.
> 
> 	Switch Enabling Interface
> 	-------------------------
> 	As described above there is a problem with userspace binaries waiting 
> 	to break as soon as the kernel supports CET. This needs to be prevented 
> 	by changing the interface such that the old binaries will not enable 
> 	shadow stack AND behave as if shadow stack is not enabled. They should 
> 	run normally without shadow stack protection. Creating a new feature 
> 	(SHSTK2) for shadow stack was explored. SHSTK would never be supported 
> 	by the kernel, and all the userspace build tools would be updated to 
> 	target SHSTK2 instead of SHSTK. So old SHSTK binaries would be cleanly
> 	disabled.
> 
> 	But there are existing downsides to automatic elf header processing 
> 	based enabling. The elf header feature spec is not defined by the 
> 	kernel and there are proposals to expand it to describe additional 
> 	logic. A simpler interface where the kernel is simply told what to 
> 	enable, and leaves all the decision making to userspace, is more 
> 	flexible for userspace and simpler for the kernel. There also already 
> 	needs to be an ARCH_X86_FEATURE_ENABLE arch_prctl() for WRSS (and 
> 	likely LAM will use it too), so it avoids there being two ways to turn 
> 	on these types of features. The only tricky part for shadow stack, is 
> 	that it has to be enabled very early. Wherever the shadow stack is 
> 	enabled, the app cannot return from that point, otherwise there will be 
> 	a shadow stack violation. It turns out glibc can enable shadow stack 
> 	this early, so it works nicely. So not automatically enabling any 
> 	features in the elf header will cleanly disable all old binaries, which 
> 	expect the kernel to enable CET features automatically. Then after the 
> 	kernel changes are upstream, glibc can be updated to use the new
> 	interface. This is the solution implemented in this series.
> 
> 	Expand Commit Logs
> 	------------------
> 	As part of spinning up on this series, I found some of the commit logs 
> 	did not describe the changes in enough detail for me understand their 
> 	purpose. I tried to expand the logs and comments, where I had to go 
> 	digging. Hopefully it’s useful.
> 	
> 	Limit to only Intel Processors
> 	------------------------------
> 	Shadow stack is supported on some AMD processors, but this revision 
> 	(with expanded HW usage and xsaves changes) has only has been tested on 
> 	Intel ones. So this series has a patch to limit shadow stack support to 
> 	Intel processors. Ideally the patch would not even make it to mainline, 
> 	and should be dropped as soon as this testing is done. It's included 
> 	just in case.
> 
> 
> Future Work
> ===========
> Even though this is now exclusively a shadow stack series, there is still some 
> remaining shadow stack work to be done.
> 
> 	Ptrace
> 	------
> 	Early in the series, there was a patch to allow IA32_U_CET and
> 	IA32_PL3_SSP to be set. This patch was dropped and planned as a follow
> 	up to basic support, and it remains the plan. It will be needed for
> 	in-progress gdb support.
> 
> 	CRIU Support
> 	------------
> 	In the past there was some speculation on the mailing list about 
> 	whether CRIU would need to be taught about CET. It turns out, it does. 
> 	The first issue hit is that CRIU calls sigreturn directly from its 
> 	“parasite code” that it injects into the dumper process. This violates
> 	this shadow stack implementation’s protection that intends to prevent
> 	attackers from doing this.
> 
> 	With so many packages already enabled with shadow stack, there is 
> 	probably desire to make it work seamlessly. But in the meantime if 
> 	distros want to support shadow stack and CRIU, users could manually 
> 	disabled shadow stack via “GLIBC_TUNABLES=glibc.cpu.x86_shstk=off” for 
> 	a process they will wants to dump. It’s not ideal.
> 
> 	I’d like to hear what people think about having shadow stack in the 
> 	kernel without this resolved. Nothing would change for any users until 
> 	they enable shadow stack in the kernel and update to a glibc configured
> 	with CET. Should CRIU userspace be solved before kernel support?
> 
> 	Selftests
> 	---------
> 	There are some CET selftests being worked on and they are not included
> 	here.
> 
> Thanks,
> 
> Rick
> 
> Rick Edgecombe (7):
>   x86/mm: Prevent VM_WRITE shadow stacks
>   x86/fpu: Add helpers for modifying supervisor xstate
>   x86/fpu: Add unsafe xsave buffer helpers
>   x86/cet/shstk: Introduce map_shadow_stack syscall
>   selftests/x86: Add map_shadow_stack syscall test
>   x86/cet/shstk: Support wrss for userspace
>   x86/cpufeatures: Limit shadow stack to Intel CPUs
> 
> Yu-cheng Yu (28):
>   Documentation/x86: Add CET description
>   x86/cet/shstk: Add Kconfig option for Shadow Stack
>   x86/cpufeatures: Add CET CPU feature flags for Control-flow
>     Enforcement Technology (CET)
>   x86/cpufeatures: Introduce CPU setup and option parsing for CET
>   x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states
>   x86/cet: Add control-protection fault handler
>   x86/mm: Remove _PAGE_DIRTY from kernel RO pages
>   x86/mm: Move pmd_write(), pud_write() up in the file
>   x86/mm: Introduce _PAGE_COW
>   drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS
>   x86/mm: Update pte_modify for _PAGE_COW
>   x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for
>     transition from _PAGE_DIRTY to _PAGE_COW
>   mm: Move VM_UFFD_MINOR_BIT from 37 to 38
>   mm: Introduce VM_SHADOW_STACK for shadow stack memory
>   x86/mm: Check Shadow Stack page fault errors
>   x86/mm: Update maybe_mkwrite() for shadow stack
>   mm: Fixup places that call pte_mkwrite() directly
>   mm: Add guard pages around a shadow stack.
>   mm/mmap: Add shadow stack pages to memory accounting
>   mm: Update can_follow_write_pte() for shadow stack
>   mm/mprotect: Exclude shadow stack from preserve_write
>   mm: Re-introduce vm_flags to do_mmap()
>   x86/cet/shstk: Add user-mode shadow stack support
>   x86/process: Change copy_thread() argument 'arg' to 'stack_size'
>   x86/cet/shstk: Handle thread shadow stack
>   x86/cet/shstk: Introduce shadow stack token setup/verify routines
>   x86/cet/shstk: Handle signals for shadow stack
>   x86/cet/shstk: Add arch_prctl elf feature functions
> 
>  .../admin-guide/kernel-parameters.txt         |   4 +
>  Documentation/filesystems/proc.rst            |   1 +
>  Documentation/x86/cet.rst                     | 145 ++++++
>  Documentation/x86/index.rst                   |   1 +
>  arch/arm/kernel/signal.c                      |   2 +-
>  arch/arm64/kernel/signal.c                    |   2 +-
>  arch/arm64/kernel/signal32.c                  |   2 +-
>  arch/sparc/kernel/signal32.c                  |   2 +-
>  arch/sparc/kernel/signal_64.c                 |   2 +-
>  arch/x86/Kconfig                              |  22 +
>  arch/x86/Kconfig.assembler                    |   5 +
>  arch/x86/entry/syscalls/syscall_32.tbl        |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
>  arch/x86/ia32/ia32_signal.c                   |  25 +-
>  arch/x86/include/asm/cet.h                    |  54 +++
>  arch/x86/include/asm/cpufeatures.h            |   1 +
>  arch/x86/include/asm/disabled-features.h      |   8 +-
>  arch/x86/include/asm/fpu/api.h                |   8 +
>  arch/x86/include/asm/fpu/types.h              |  23 +-
>  arch/x86/include/asm/fpu/xstate.h             |   6 +-
>  arch/x86/include/asm/idtentry.h               |   4 +
>  arch/x86/include/asm/mman.h                   |  24 +
>  arch/x86/include/asm/mmu_context.h            |   2 +
>  arch/x86/include/asm/msr-index.h              |  20 +
>  arch/x86/include/asm/page_types.h             |   7 +
>  arch/x86/include/asm/pgtable.h                | 302 ++++++++++--
>  arch/x86/include/asm/pgtable_types.h          |  48 +-
>  arch/x86/include/asm/processor.h              |   6 +
>  arch/x86/include/asm/special_insns.h          |  30 ++
>  arch/x86/include/asm/trap_pf.h                |   2 +
>  arch/x86/include/uapi/asm/mman.h              |   8 +-
>  arch/x86/include/uapi/asm/prctl.h             |  10 +
>  arch/x86/include/uapi/asm/processor-flags.h   |   2 +
>  arch/x86/kernel/Makefile                      |   1 +
>  arch/x86/kernel/cpu/common.c                  |  20 +
>  arch/x86/kernel/cpu/cpuid-deps.c              |   1 +
>  arch/x86/kernel/elf_feature_prctl.c           |  72 +++
>  arch/x86/kernel/fpu/xstate.c                  | 167 ++++++-
>  arch/x86/kernel/idt.c                         |   4 +
>  arch/x86/kernel/process.c                     |  17 +-
>  arch/x86/kernel/process_64.c                  |   2 +
>  arch/x86/kernel/shstk.c                       | 446 ++++++++++++++++++
>  arch/x86/kernel/signal.c                      |  13 +
>  arch/x86/kernel/signal_compat.c               |   2 +-
>  arch/x86/kernel/traps.c                       |  62 +++
>  arch/x86/mm/fault.c                           |  19 +
>  arch/x86/mm/mmap.c                            |  48 ++
>  arch/x86/mm/pat/set_memory.c                  |   2 +-
>  arch/x86/mm/pgtable.c                         |  25 +
>  drivers/gpu/drm/i915/gvt/gtt.c                |   2 +-
>  fs/aio.c                                      |   2 +-
>  fs/proc/task_mmu.c                            |   3 +
>  include/linux/mm.h                            |  19 +-
>  include/linux/pgtable.h                       |   8 +
>  include/linux/syscalls.h                      |   1 +
>  include/uapi/asm-generic/siginfo.h            |   3 +-
>  include/uapi/asm-generic/unistd.h             |   2 +-
>  ipc/shm.c                                     |   2 +-
>  kernel/sys_ni.c                               |   1 +
>  mm/gup.c                                      |  16 +-
>  mm/huge_memory.c                              |  27 +-
>  mm/memory.c                                   |   5 +-
>  mm/migrate.c                                  |   3 +-
>  mm/mmap.c                                     |  15 +-
>  mm/mprotect.c                                 |   9 +-
>  mm/nommu.c                                    |   4 +-
>  mm/util.c                                     |   2 +-
>  tools/testing/selftests/x86/Makefile          |   9 +-
>  .../selftests/x86/test_map_shadow_stack.c     |  75 +++
>  69 files changed, 1797 insertions(+), 92 deletions(-)
>  create mode 100644 Documentation/x86/cet.rst
>  create mode 100644 arch/x86/include/asm/cet.h
>  create mode 100644 arch/x86/include/asm/mman.h
>  create mode 100644 arch/x86/kernel/elf_feature_prctl.c
>  create mode 100644 arch/x86/kernel/shstk.c
>  create mode 100644 tools/testing/selftests/x86/test_map_shadow_stack.c
> 
> 
> base-commit: e783362eb54cd99b2cac8b3a9aeac942e6f6ac07
> -- 
> 2.17.1

Adrian Reber Feb. 7, 2022, 7:20 a.m. UTC | #14

On Sun, Feb 06, 2022 at 08:42:03PM +0200, Mike Rapoport wrote:
> (added more CRIU people)

Thanks, Mike.

> On Sun, Jan 30, 2022 at 01:18:03PM -0800, Rick Edgecombe wrote:
> > This is a slight reboot of the userspace CET series. I will be taking over the 
> > series from Yu-cheng. Per some internal recommendations, I’ve reset the version
> > number and am calling it a new series. Hopefully, it doesn’t cause confusion.
> > 
> > The new plan is to upstream only userspace Shadow Stack support at this point. 
> > IBT can follow later, but for now I’ll focus solely on the most in-demand and
> > widely available (with the feature on AMD CPUs now) part of CET.
> > 
> > I thought as part of this reset, it might be useful to more fully write-up the 
> > design and summarize the history of the previous CET series. So this slightly
> > long cover letter does that. The "Updates" section has the changes, if anyone
> > doesn't want the history.

[...]

> > 	CRIU Support
> > 	------------
> > 	In the past there was some speculation on the mailing list about 
> > 	whether CRIU would need to be taught about CET. It turns out, it does. 
> > 	The first issue hit is that CRIU calls sigreturn directly from its 
> > 	“parasite code” that it injects into the dumper process. This violates
> > 	this shadow stack implementation’s protection that intends to prevent
> > 	attackers from doing this.
> > 
> > 	With so many packages already enabled with shadow stack, there is 
> > 	probably desire to make it work seamlessly. But in the meantime if 
> > 	distros want to support shadow stack and CRIU, users could manually 
> > 	disabled shadow stack via “GLIBC_TUNABLES=glibc.cpu.x86_shstk=off” for 
> > 	a process they will wants to dump. It’s not ideal.
> > 
> > 	I’d like to hear what people think about having shadow stack in the 
> > 	kernel without this resolved. Nothing would change for any users until 
> > 	they enable shadow stack in the kernel and update to a glibc configured
> > 	with CET. Should CRIU userspace be solved before kernel support?

From the CRIU side I can say that I would definitely like to see this
resolved. CRIU just went through a similar exercise with rseq() being
enabled in glibc and CI broke all around for us and other projects
relying on CRIU. Although rseq() was around for a long time we were not
aware of it but luckily 5.13 introduced a way to handle it for CRIU with
ptrace. An environment variable existed but did not really help when
CRIU is called somewhere in the middle of the container software stack.

From my point of view a solution not involving an environment variable
would definitely be preferred.

		Adrian

Florian Weimer Feb. 7, 2022, 10:22 a.m. UTC | #15

* David Laight:

> Was there any 'spare' space in struct jmpbuf ?

jmp_buf in glibc looks like this:

(gdb) ptype/o jmp_buf
type = struct __jmp_buf_tag {
/*      0      |      64 */    __jmp_buf __jmpbuf;
/*     64      |       4 */    int __mask_was_saved;
/* XXX  4-byte hole      */
/*     72      |     128 */    __sigset_t __saved_mask;

                               /* total size (bytes):  200 */
                             } [1]
(gdb) ptype/o __jmp_buf
type = long [8]

The glibc ABI reserves space for 1024 signals, something that Linux is
never going to implement.  We can use that space to store a few extra
registers in __save_mask.  There is a complication because the
pthread_cancel unwinding allocates only space for the __jmpbuf member.
Fortunately, we do not need to unwind the shadow stack for thread
cancellation, so we don't need that extra space in that case.

Thanks,
Florian

Dave Hansen Feb. 7, 2022, 4:30 p.m. UTC | #16

On 2/6/22 23:20, Adrian Reber wrote:
>>> 	CRIU Support
>>> 	------------
>>> 	In the past there was some speculation on the mailing list about 
>>> 	whether CRIU would need to be taught about CET. It turns out, it does. 
>>> 	The first issue hit is that CRIU calls sigreturn directly from its 
>>> 	“parasite code” that it injects into the dumper process. This violates
>>> 	this shadow stack implementation’s protection that intends to prevent
>>> 	attackers from doing this.
...
>>From the CRIU side I can say that I would definitely like to see this
> resolved. CRIU just went through a similar exercise with rseq() being
> enabled in glibc and CI broke all around for us and other projects
> relying on CRIU. Although rseq() was around for a long time we were not
> aware of it but luckily 5.13 introduced a way to handle it for CRIU with
> ptrace. An environment variable existed but did not really help when
> CRIU is called somewhere in the middle of the container software stack.
> 
>>From my point of view a solution not involving an environment variable
> would definitely be preferred.

Have there been things like this for CRIU in the past?  Something where
CRIU needs control but that's also security-sensitive?

Any thoughts on how you would _like_ to see this resolved?

Andy Lutomirski Feb. 8, 2022, 1:31 a.m. UTC | #17

On 2/5/22 12:15, Edgecombe, Rick P wrote:
> On Sat, 2022-02-05 at 05:29 -0800, H.J. Lu wrote:
>> On Sat, Feb 5, 2022 at 5:27 AM David Laight <David.Laight@aculab.com>
>> wrote:
>>>
>>> From: Edgecombe, Rick P
>>>> Sent: 04 February 2022 01:08
>>>> Hi Thomas,
>>>>
>>>> Thanks for feedback on the plan.
>>>>
>>>> On Thu, 2022-02-03 at 22:07 +0100, Thomas Gleixner wrote:
>>>>>> Until now, the enabling effort was trying to support both
>>>>>> Shadow
>>>>>> Stack and IBT.
>>>>>> This history will focus on a few areas of the shadow stack
>>>>>> development history
>>>>>> that I thought stood out.
>>>>>>
>>>>>>         Signals
>>>>>>         -------
>>>>>>         Originally signals placed the location of the shadow
>>>>>> stack
>>>>>> restore
>>>>>>         token inside the saved state on the stack. This was
>>>>>> problematic from a
>>>>>>         past ABI promises perspective. So the restore location
>>>>>> was
>>>>>> instead just
>>>>>>         assumed from the shadow stack pointer. This works
>>>>>> because in
>>>>>> normal
>>>>>>         allowed cases of calling sigreturn, the shadow stack
>>>>>> pointer
>>>>>> should be
>>>>>>         right at the restore token at that time. There is no
>>>>>> alternate shadow
>>>>>>         stack support. If an alt shadow stack is added later
>>>>>> we
>>>>>> would
>>>>>>         need to
>>>>>
>>>>> So how is that going to work? altstack is not an esoteric
>>>>> corner
>>>>> case.
>>>>
>>>> My understanding is that the main usages for the signal stack
>>>> were
>>>> handling stack overflows and corruption. Since the shadow stack
>>>> only
>>>> contains return addresses rather than large stack allocations,
>>>> and is
>>>> not generally writable or pivotable, I thought there was a good
>>>> possibility an alt shadow stack would not end up being especially
>>>> useful. Does it seem like reasonable guesswork?
>>>
>>> The other 'problem' is that it is valid to longjump out of a signal
>>> handler.
>>> These days you have to use siglongjmp() not longjmp() but it is
>>> still used.
>>>
>>> It is probably also valid to use siglongjmp() to jump from a nested
>>> signal handler into the outer handler.
>>> Given both signal handlers can have their own stack, there can be
>>> three
>>> stacks involved.
> 
> So the scenario is?
> 
> 1. Handle signal 1
> 2. sigsetjmp()
> 3. signalstack()
> 4. Handle signal 2 on alt stack
> 5. siglongjmp()
> 
> I'll check that it is covered by the tests, but I think it should work
> in this series that has no alt shadow stack. I have only done a high
> level overview of how the shadow stack stuff, that doesn't involve the
> kernel, works in glibc. Sounds like I'll need to do a deeper dive.
> 
>>>
>>> I think the shadow stack pointer has to be in ucontext - which also
>>> means the application can change it before returning from a signal.
> 
> Yes we might need to change it to support alt shadow stacks. Can you
> elaborate why you think it has to be in ucontext? I was thinking of
> looking at three options for storing the ssp:
>   - Stored in the shadow stack like a token using WRUSS from the kernel.
>   - Stored on the kernel side using a hashmap that maps ucontext or
>     sigframe userspace address to ssp (this is of course similar to
>     storing in ucontext, except that the user can’t change the ssp).
>   - Stored writable in userspace in ucontext.
> 
> But in this version, without alt shadow stacks, the shadow stack
> pointer is not stored in ucontext. This causes the limitation that
> userspace can only call sigreturn when it has returned back to a point
> where there is a restore token on the shadow stack (which was placed
> there by the kernel).

I'll reply here and maybe cover multiple things.

User code already needs to rewind the regular stack to call sigreturn -- 
sigreturn find the signal frame based on ESP/RSP.  So if you call it 
from the wrong place, you go boom.  I think that the Linux SHSTK ABI 
should have the property that no amount of tampering with just the 
ucontext and associated structures can cause sigreturn to redirect to 
the wrong IP -- there should be something on the shadow stack that also 
gets verified in sigreturn.  IIRC the series does this, but it's been a 
while.  The post-sigreturn SSP should be entirely implied by 
pre-sigreturn SSP (or perhaps something on the shadow stack), so, in the 
absence of an altshadowstack feature, no ucontext changes should be needed.

We can also return from a signal or from more than one signal at once, 
as above, using siglongjmp.  It seems like this should Just Work (tm), 
at least in the absence of altshadowstack.

So this leaves altshadowstack.  If we want to allow userspace to handle 
a shstk overflow, I think we need altshadowstack.  And I can easily 
imagine signal handling in a coroutine or user-threading evironment (Go? 
UMCG or whatever it's called?) wanting this.  As noted, this obnoxious 
Andy person didn't like putting any shstk-related extensions in the FPU 
state.

For better or for worse, altshadowstack is (I think) fundamentally a new 
API.  No amount of ucontext magic is going to materialize an entire 
shadow stack out of nowhere when someone calls sigaltstack().  So the 
questions are: should we support altshadowstack from day one and, if so, 
what should it look like?

If we want to be clever, we could attempt to make altstadowstack 
compatible with RSTORSSP.  Signal delivery pushes a restore token to the 
old stack (hah!  what if the old stack is full?) and pushes the RSTORSSP 
busy magic to the new stack, and sigreturn inverts it.  Code that wants 
to return without sigreturn does it manually with RSTORSSP.  (Assuming 
that I've understood the arcane RSTORSSP sequence right.  Intel wins 
major points for documentation quality here.)  Or we could invent our 
own scheme.  In either case, I don't immediately see any reason that the 
ucontext needs to contain a shadow stack pointer.

There's a delightful wart to consider, though.  siglongjmp, at least as 
currently envisioned, can't return off an altshadowstack: the whole 
point of the INCSSP distance restrictions to to avoid incrementing right 
off the top of the current stack, but siglongjmp off an altshadowstack 
fundamentally switches stacks.  So either siglongjmp off an 
altshadowstack needs to be illegal or it needs to work differently.  (By 
incssp-ing to the top of the altshadowstack, then switching, then 
incssp-ing some more?  How does it even find the top of the current 
altshadowstack?)  And the plot thickens if one tries to siglongjmp off 
two nested altshadowstack-using signals in a single call.   Fortunately, 
since altshadowstack is a new API, it's not entirely crazy to have 
different rules.

So I don't have a complete or even almost complete design in mind, but I 
think we do need to make a conscious decision either to design this 
right or to skip it for v1.

As for CRIU, I don't think anyone really expects a new kernel, running 
new userspace that takes advantage of features in the new kernel, to 
work with old CRIU.  Upgrading to a SHSTK kernel should still allow 
using CRIU with non-SHSTK userspace, but I don't see how it's possible 
for CRIU to handle SHSTK without updates.  We should certainly do our 
best to make CRIU's life easy, though.

  This doesn’t mean it can’t switch to a different
> shadow stack or handle a nested signal, but it limits the possibility
> for calling sigreturn with a totally different sigframe (like CRIU and
> SROP attacks do). It should hopefully be a helpful, protective
> limitation for most apps and I'm hoping CRIU can be fixed without
> removing it.
> 
> I am not aware of other limitations to signals (besides normal shadow
> stack enforcement), but I could be missing it. And people's skepticism
> is making me want to go back over it with more scrutiny.
> 
>>> In much the same way as all the segment registers can be changed
>>> leading to all the nasty bugs when the final 'return to user' code
>>> traps in kernel when loading invalid segment registers or executing
>>> iret.
> 
> I don't think this is as difficult to avoid because userspace ssp has
> its own register that should not be accessed at that point, but I have
> not given this aspect enough analysis. Thanks for bringing it up.
> 
>>>
>>> Hmmm... do shadow stacks mean that longjmp() has to be a system
>>> call?
>>
>> No.  setjmp/longjmp save and restore shadow stack pointer.
>>
> 
> It sounds like it would help to write up in a lot more detail exactly
> how all the signal and specialer stack manipulation scenarios work in
> glibc.
>

Edgecombe, Rick P Feb. 8, 2022, 1:46 a.m. UTC | #18

On Sun, 2022-02-06 at 13:42 +0000, David Laight wrote:
> > I don't think this is as difficult to avoid because userspace ssp
> > has
> > its own register that should not be accessed at that point, but I
> > have
> > not given this aspect enough analysis. Thanks for bringing it up.
> 
> So the user ssp isn't saved (or restored) by the trap entry/exit.
> So it needs to be saved by the context switch code?
> Much like the user segment registers?
> So you are likely to get the same problems if restoring it can fault
> in kernel (eg for a non-canonical address).

PL3_SSP is lazily saved and restored by the FPU supervisor xsave code,
which has its buffer in kernel memory. For the most part it is
userspace instructions that use this register and they can only modify
it in limited ways.

It does look like IRET can cause a #CP if the PL3 SSP is not aligned,
but only after RIP and CPL are set back to userspace. I'm not confident
enough interpreting the specs to assert the specific behavior and will
follow up internally to clarify.

Mike Rapoport Feb. 8, 2022, 9:16 a.m. UTC | #19

On Mon, Feb 07, 2022 at 08:30:50AM -0800, Dave Hansen wrote:
> On 2/6/22 23:20, Adrian Reber wrote:
> >>> 	CRIU Support
> >>> 	------------
> >>> 	In the past there was some speculation on the mailing list about 
> >>> 	whether CRIU would need to be taught about CET. It turns out, it does. 
> >>> 	The first issue hit is that CRIU calls sigreturn directly from its 
> >>> 	“parasite code” that it injects into the dumper process. This violates
> >>> 	this shadow stack implementation’s protection that intends to prevent
> >>> 	attackers from doing this.
> ...
> >>From the CRIU side I can say that I would definitely like to see this
> > resolved. CRIU just went through a similar exercise with rseq() being
> > enabled in glibc and CI broke all around for us and other projects
> > relying on CRIU. Although rseq() was around for a long time we were not
> > aware of it but luckily 5.13 introduced a way to handle it for CRIU with
> > ptrace. An environment variable existed but did not really help when
> > CRIU is called somewhere in the middle of the container software stack.
> > 
> >>From my point of view a solution not involving an environment variable
> > would definitely be preferred.
> 
> Have there been things like this for CRIU in the past?  Something where
> CRIU needs control but that's also security-sensitive?

Generally CRIU requires (almost) root privileges to work, but I don't think
it handles something as security sensitive and restrictive as shadow stacks. 
 
> Any thoughts on how you would _like_ to see this resolved?

Ideally, CRIU will need a knob that will tell the kernel/CET machinery
where the next RET will jump, along the lines of
restore_signal_shadow_stack() AFAIU.

But such a knob will immediately reduce the security value of the entire
thing, and I don't have good ideas how to deal with it :(

Cyrill Gorcunov Feb. 8, 2022, 9:29 a.m. UTC | #20

On Tue, Feb 08, 2022 at 11:16:51AM +0200, Mike Rapoport wrote:
>  
> > Any thoughts on how you would _like_ to see this resolved?
> 
> Ideally, CRIU will need a knob that will tell the kernel/CET machinery
> where the next RET will jump, along the lines of
> restore_signal_shadow_stack() AFAIU.
> 
> But such a knob will immediately reduce the security value of the entire
> thing, and I don't have good ideas how to deal with it :(

Probably a kind of latch in the task_struct which would trigger off once
returt to a different address happened, thus we would be able to jump inside
paratite code. Of course such trigger should be available under proper
capability only.

Thomas Gleixner Feb. 8, 2022, 9:31 a.m. UTC | #21

On Mon, Feb 07 2022 at 17:31, Andy Lutomirski wrote:
> So this leaves altshadowstack.  If we want to allow userspace to handle 
> a shstk overflow, I think we need altshadowstack.  And I can easily 
> imagine signal handling in a coroutine or user-threading evironment (Go? 
> UMCG or whatever it's called?) wanting this.  As noted, this obnoxious 
> Andy person didn't like putting any shstk-related extensions in the FPU 
> state.
>
> For better or for worse, altshadowstack is (I think) fundamentally a new 
> API.  No amount of ucontext magic is going to materialize an entire 
> shadow stack out of nowhere when someone calls sigaltstack().  So the 
> questions are: should we support altshadowstack from day one and, if so, 
> what should it look like?

I think we should support them from day one.

> So I don't have a complete or even almost complete design in mind, but I 
> think we do need to make a conscious decision either to design this 
> right or to skip it for v1.

Skipping it might create a fundamental design fail situation as it might
require changes to the shadow stack signal handling in general which
becomes a nightmare once a non-altstack API is exposed.

> As for CRIU, I don't think anyone really expects a new kernel, running 
> new userspace that takes advantage of features in the new kernel, to 
> work with old CRIU.

Yes, CRIU needs updates, but what ensures that CRIU managed user space
does not use SHSTK if CRIU is not updated yet?

> Upgrading to a SHSTK kernel should still allow using CRIU with
> non-SHSTK userspace, but I don't see how it's possible for CRIU to
> handle SHSTK without updates.  We should certainly do our best to make
> CRIU's life easy, though.

Handling CRIU with SHSTK enabled has to be part of the overall design
otherwise we'll either end up with horrible hacks or with a requirement
to change the V1 UAPI....

Thanks,

        tglx

Andy Lutomirski Feb. 8, 2022, 4:15 p.m. UTC | #22

On Tue, Feb 8, 2022, at 1:31 AM, Thomas Gleixner wrote:
> On Mon, Feb 07 2022 at 17:31, Andy Lutomirski wrote:
>> So this leaves altshadowstack.  If we want to allow userspace to handle 
>> a shstk overflow, I think we need altshadowstack.  And I can easily 
>> imagine signal handling in a coroutine or user-threading evironment (Go? 
>> UMCG or whatever it's called?) wanting this.  As noted, this obnoxious 
>> Andy person didn't like putting any shstk-related extensions in the FPU 
>> state.
>>
>> For better or for worse, altshadowstack is (I think) fundamentally a new 
>> API.  No amount of ucontext magic is going to materialize an entire 
>> shadow stack out of nowhere when someone calls sigaltstack().  So the 
>> questions are: should we support altshadowstack from day one and, if so, 
>> what should it look like?
>
> I think we should support them from day one.
>
>> So I don't have a complete or even almost complete design in mind, but I 
>> think we do need to make a conscious decision either to design this 
>> right or to skip it for v1.
>
> Skipping it might create a fundamental design fail situation as it might
> require changes to the shadow stack signal handling in general which
> becomes a nightmare once a non-altstack API is exposed.

It would also expose a range of kernels in which shstk is on but programs that want altshadowstack don't have it.  That would be annoying.

>
>> As for CRIU, I don't think anyone really expects a new kernel, running 
>> new userspace that takes advantage of features in the new kernel, to 
>> work with old CRIU.
>
> Yes, CRIU needs updates, but what ensures that CRIU managed user space
> does not use SHSTK if CRIU is not updated yet?

In some sense this is like any other feature.  If a program uses timerfd but CRIU doesn't support timerfd, then it won't work.  SHSTK is a bit unique because it's likely that all programs on a system will start using it all at once.

Andy Lutomirski Feb. 8, 2022, 4:21 p.m. UTC | #23

On Tue, Feb 8, 2022, at 1:29 AM, Cyrill Gorcunov wrote:
> On Tue, Feb 08, 2022 at 11:16:51AM +0200, Mike Rapoport wrote:
>>  
>> > Any thoughts on how you would _like_ to see this resolved?
>> 
>> Ideally, CRIU will need a knob that will tell the kernel/CET machinery
>> where the next RET will jump, along the lines of
>> restore_signal_shadow_stack() AFAIU.
>> 
>> But such a knob will immediately reduce the security value of the entire
>> thing, and I don't have good ideas how to deal with it :(
>
> Probably a kind of latch in the task_struct which would trigger off once
> returt to a different address happened, thus we would be able to jump inside
> paratite code. Of course such trigger should be available under proper
> capability only.

I'm not fully in touch with how parasite, etc works.  Are we talking about save or restore?  If it's restore, what exactly does CRIU need to do?  Is it just that CRIU needs to return out from its resume code into the to-be-resumed program without tripping CET?  Would it be acceptable for CRIU to require that at least one shstk slot be free at save time?  Or do we need a mechanism to atomically switch to a completely full shadow stack at resume?

Off the top of my head, a sigreturn (or sigreturn-like mechanism) that is intended for use for altshadowstack could safely verify a token on the altshdowstack, possibly compare to something in ucontext (or not -- this isn't clearly necessary) and switch back to the previous stack.  CRIU could use that too.  Obviously CRIU will need a way to populate the relevant stacks, but WRUSS can be used for that, and I think this is a fundamental requirement for CRIU -- CRIU restore absolutely needs a way to write the saved shadow stack data into the shadow stack.

So I think the only special capability that CRIU really needs is WRUSS, and we need to wire that up anyway.

Cyrill Gorcunov Feb. 8, 2022, 5:02 p.m. UTC | #24

On Tue, Feb 08, 2022 at 08:21:20AM -0800, Andy Lutomirski wrote:
> >> But such a knob will immediately reduce the security value of the entire
> >> thing, and I don't have good ideas how to deal with it :(
> >
> > Probably a kind of latch in the task_struct which would trigger off once
> > returt to a different address happened, thus we would be able to jump inside
> > paratite code. Of course such trigger should be available under proper
> > capability only.
> 
> I'm not fully in touch with how parasite, etc works.  Are we talking about save or restore?

We use parasite code in question during checkpoint phase as far as I remember.
push addr/lret trick is used to run "injected" code (code injection itself is
done via ptrace) in compat mode at least. Dima, Andrei, I didn't look into this code
for years already, do we still need to support compat mode at all?

> If it's restore, what exactly does CRIU need to do?  Is it just that CRIU needs to return
> out from its resume code into the to-be-resumed program without tripping CET?  Would it
> be acceptable for CRIU to require that at least one shstk slot be free at save time?
> Or do we need a mechanism to atomically switch to a completely full shadow stack at resume?
> 
> Off the top of my head, a sigreturn (or sigreturn-like mechanism) that is intended for
> use for altshadowstack could safely verify a token on the altshdowstack, possibly
> compare to something in ucontext (or not -- this isn't clearly necessary) and switch
> back to the previous stack.  CRIU could use that too.  Obviously CRIU will need a way
> to populate the relevant stacks, but WRUSS can be used for that, and I think this
> is a fundamental requirement for CRIU -- CRIU restore absolutely needs a way to write
> the saved shadow stack data into the shadow stack.
> 
> So I think the only special capability that CRIU really needs is WRUSS, and
> we need to wire that up anyway.

Thanks for these notes, Andy! I can't provide any sane answer here since didn't
read tech spec for this feature yet :-)

Edgecombe, Rick P Feb. 9, 2022, 2:18 a.m. UTC | #25

On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote:
> On Tue, Feb 08, 2022 at 08:21:20AM -0800, Andy Lutomirski wrote:
> > > > But such a knob will immediately reduce the security value of
> > > > the entire
> > > > thing, and I don't have good ideas how to deal with it :(
> > > 
> > > Probably a kind of latch in the task_struct which would trigger
> > > off once
> > > returt to a different address happened, thus we would be able to
> > > jump inside
> > > paratite code. Of course such trigger should be available under
> > > proper
> > > capability only.
> > 
> > I'm not fully in touch with how parasite, etc works.  Are we
> > talking about save or restore?
> 
> We use parasite code in question during checkpoint phase as far as I
> remember.
> push addr/lret trick is used to run "injected" code (code injection
> itself is
> done via ptrace) in compat mode at least. Dima, Andrei, I didn't look
> into this code
> for years already, do we still need to support compat mode at all?
> 
> > If it's restore, what exactly does CRIU need to do?  Is it just
> > that CRIU needs to return
> > out from its resume code into the to-be-resumed program without
> > tripping CET?  Would it
> > be acceptable for CRIU to require that at least one shstk slot be
> > free at save time?
> > Or do we need a mechanism to atomically switch to a completely full
> > shadow stack at resume?
> > 
> > Off the top of my head, a sigreturn (or sigreturn-like mechanism)
> > that is intended for
> > use for altshadowstack could safely verify a token on the
> > altshdowstack, possibly
> > compare to something in ucontext (or not -- this isn't clearly
> > necessary) and switch
> > back to the previous stack.  CRIU could use that too.  Obviously
> > CRIU will need a way
> > to populate the relevant stacks, but WRUSS can be used for that,
> > and I think this
> > is a fundamental requirement for CRIU -- CRIU restore absolutely
> > needs a way to write
> > the saved shadow stack data into the shadow stack.

Still wrapping my head around the CRIU save and restore steps, but
another general approach might be to give ptrace the ability to
temporarily pause/resume/set CET enablement and SSP for a stopped
thread. Then injected code doesn't need to jump through any hoops or
possibly run into road blocks. I'm not sure how much this opens things
up if the thread has to be stopped...

Cyrill, could it fit into the CRIU pause and resume flow? What action
causes the final resuming of execution of the restored process for
checkpointing and for restore? Wondering if we could somehow make CET
re-enable exactly then.

And I guess this also needs a way to create shadow stack allocations at
a specific address to match where they were in the dumped process. That
is missing in this series.


> > 
> > So I think the only special capability that CRIU really needs is
> > WRUSS, and
> > we need to wire that up anyway.
> 
> Thanks for these notes, Andy! I can't provide any sane answer here
> since didn't
> read tech spec for this feature yet :-)

Cyrill Gorcunov Feb. 9, 2022, 6:43 a.m. UTC | #26

On Wed, Feb 09, 2022 at 02:18:42AM +0000, Edgecombe, Rick P wrote:
...
> 
> Still wrapping my head around the CRIU save and restore steps, but
> another general approach might be to give ptrace the ability to
> temporarily pause/resume/set CET enablement and SSP for a stopped
> thread. Then injected code doesn't need to jump through any hoops or
> possibly run into road blocks. I'm not sure how much this opens things
> up if the thread has to be stopped...
> 
> Cyrill, could it fit into the CRIU pause and resume flow? What action
> causes the final resuming of execution of the restored process for
> checkpointing and for restore? Wondering if we could somehow make CET
> re-enable exactly then.
> 
> And I guess this also needs a way to create shadow stack allocations at
> a specific address to match where they were in the dumped process. That
> is missing in this series.

Thanks Rick! This sounds like an option. I need a couple of days to refresh
my memory about criu internals. Let me CC a few current active criu developers
(CC list is already big enough though:), maybe this will speedup the procedure.

Mike Rapoport Feb. 9, 2022, 10:53 a.m. UTC | #27

Hi Rick,

On Wed, Feb 09, 2022 at 02:18:42AM +0000, Edgecombe, Rick P wrote:
> On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote:
> > On Tue, Feb 08, 2022 at 08:21:20AM -0800, Andy Lutomirski wrote:
> > > > > But such a knob will immediately reduce the security value of
> > > > > the entire
> > > > > thing, and I don't have good ideas how to deal with it :(
> > > > 
> > > > Probably a kind of latch in the task_struct which would trigger
> > > > off once
> > > > returt to a different address happened, thus we would be able to
> > > > jump inside
> > > > paratite code. Of course such trigger should be available under
> > > > proper
> > > > capability only.
> > > 
> > > I'm not fully in touch with how parasite, etc works.  Are we
> > > talking about save or restore?
> > 
> > We use parasite code in question during checkpoint phase as far as I
> > remember.
> > push addr/lret trick is used to run "injected" code (code injection
> > itself is
> > done via ptrace) in compat mode at least. Dima, Andrei, I didn't look
> > into this code
> > for years already, do we still need to support compat mode at all?
> > 
> > > If it's restore, what exactly does CRIU need to do?  Is it just
> > > that CRIU needs to return
> > > out from its resume code into the to-be-resumed program without
> > > tripping CET?  Would it
> > > be acceptable for CRIU to require that at least one shstk slot be
> > > free at save time?
> > > Or do we need a mechanism to atomically switch to a completely full
> > > shadow stack at resume?
> > > 
> > > Off the top of my head, a sigreturn (or sigreturn-like mechanism)
> > > that is intended for
> > > use for altshadowstack could safely verify a token on the
> > > altshdowstack, possibly
> > > compare to something in ucontext (or not -- this isn't clearly
> > > necessary) and switch
> > > back to the previous stack.  CRIU could use that too.  Obviously
> > > CRIU will need a way
> > > to populate the relevant stacks, but WRUSS can be used for that,
> > > and I think this
> > > is a fundamental requirement for CRIU -- CRIU restore absolutely
> > > needs a way to write
> > > the saved shadow stack data into the shadow stack.
> 
> Still wrapping my head around the CRIU save and restore steps, but
> another general approach might be to give ptrace the ability to
> temporarily pause/resume/set CET enablement and SSP for a stopped
> thread. Then injected code doesn't need to jump through any hoops or
> possibly run into road blocks. I'm not sure how much this opens things
> up if the thread has to be stopped...
 
IIRC, criu dump does something like this:
* Stop the process being dumped (victim) with ptrace
* Inject parasite code and data into the victim, again with ptrace.
  Among other things the parasite data contains a sigreturn frame with
  saved victim state.
* Resume the victim process, which will run parasite code now.
* When parasite finishes it uses that frame to sigreturn to normal victim
  execution

So, my feeling is that for dump side WRUSS should be enough.

> Cyrill, could it fit into the CRIU pause and resume flow? What action
> causes the final resuming of execution of the restored process for
> checkpointing and for restore? Wondering if we could somehow make CET
> re-enable exactly then.
> 
> And I guess this also needs a way to create shadow stack allocations at
> a specific address to match where they were in the dumped process. That
> is missing in this series.

Yes, criu restore will need to recreate shadow stack mappings. Currently,
we recreate the restored process (target) address space based on
/proc/pid/maps and /proc/pid/smaps. CRIU preserves the virtual addresses
and VMA flags. The relevant steps of restore process can be summarised as:
* Clone() the target process tree
* Recreate VMAs with the needed size and flags, but not necessarily at the
  correct place yet
* Partially populate memory data from the saved images
* Move VMAs to their exact addresses
* Complete restoring the data
* Create a frame for sigreturn and jump to the target.

Here, the stack used after sigreturn contains the data that was captured
during dump and it entirely different from what shadow stack will contain.

There are several points when the target threads are stopped, so
pausing/resuming CET may help.
 
> > > So I think the only special capability that CRIU really needs is
> > > WRUSS, and
> > > we need to wire that up anyway.
> > 
> > Thanks for these notes, Andy! I can't provide any sane answer here
> > since didn't
> > read tech spec for this feature yet :-)
> 
> 
> 
>

Andy Lutomirski Feb. 10, 2022, 2:37 a.m. UTC | #28

On 2/8/22 18:18, Edgecombe, Rick P wrote:
> On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote:
>> On Tue, Feb 08, 2022 at 08:21:20AM -0800, Andy Lutomirski wrote:
>>>>> But such a knob will immediately reduce the security value of
>>>>> the entire
>>>>> thing, and I don't have good ideas how to deal with it :(
>>>>
>>>> Probably a kind of latch in the task_struct which would trigger
>>>> off once
>>>> returt to a different address happened, thus we would be able to
>>>> jump inside
>>>> paratite code. Of course such trigger should be available under
>>>> proper
>>>> capability only.
>>>
>>> I'm not fully in touch with how parasite, etc works.  Are we
>>> talking about save or restore?
>>
>> We use parasite code in question during checkpoint phase as far as I
>> remember.
>> push addr/lret trick is used to run "injected" code (code injection
>> itself is
>> done via ptrace) in compat mode at least. Dima, Andrei, I didn't look
>> into this code
>> for years already, do we still need to support compat mode at all?
>>
>>> If it's restore, what exactly does CRIU need to do?  Is it just
>>> that CRIU needs to return
>>> out from its resume code into the to-be-resumed program without
>>> tripping CET?  Would it
>>> be acceptable for CRIU to require that at least one shstk slot be
>>> free at save time?
>>> Or do we need a mechanism to atomically switch to a completely full
>>> shadow stack at resume?
>>>
>>> Off the top of my head, a sigreturn (or sigreturn-like mechanism)
>>> that is intended for
>>> use for altshadowstack could safely verify a token on the
>>> altshdowstack, possibly
>>> compare to something in ucontext (or not -- this isn't clearly
>>> necessary) and switch
>>> back to the previous stack.  CRIU could use that too.  Obviously
>>> CRIU will need a way
>>> to populate the relevant stacks, but WRUSS can be used for that,
>>> and I think this
>>> is a fundamental requirement for CRIU -- CRIU restore absolutely
>>> needs a way to write
>>> the saved shadow stack data into the shadow stack.
> 
> Still wrapping my head around the CRIU save and restore steps, but
> another general approach might be to give ptrace the ability to
> temporarily pause/resume/set CET enablement and SSP for a stopped
> thread. Then injected code doesn't need to jump through any hoops or
> possibly run into road blocks. I'm not sure how much this opens things
> up if the thread has to be stopped...

Hmm, that's maybe not insane.

An alternative would be to add a bona fide ptrace call-a-function 
mechanism.  I can think of two potentially usable variants:

1. Straight call.  PTRACE_CALL_FUNCTION(addr) just emulates CALL addr, 
shadow stack push and all.

2. Signal-style.  PTRACE_CALL_FUNCTION_SIGFRAME injects an actual signal 
frame just like a real signal is being delivered with the specified 
handler.  There could be a variant to opt-in to also using a specified 
altstack and altshadowstack.

2 would be more expensive but would avoid the need for much in the way 
of asm magic.  The injected code could be plain C (or Rust or Zig or 
whatever).

All of this only really handles save, not restore.  I don't understand 
restore enough to fully understand the issue.

--Andy

H.J. Lu Feb. 10, 2022, 2:53 a.m. UTC | #29

On Wed, Feb 9, 2022 at 6:37 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> On 2/8/22 18:18, Edgecombe, Rick P wrote:
> > On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote:
> >> On Tue, Feb 08, 2022 at 08:21:20AM -0800, Andy Lutomirski wrote:
> >>>>> But such a knob will immediately reduce the security value of
> >>>>> the entire
> >>>>> thing, and I don't have good ideas how to deal with it :(
> >>>>
> >>>> Probably a kind of latch in the task_struct which would trigger
> >>>> off once
> >>>> returt to a different address happened, thus we would be able to
> >>>> jump inside
> >>>> paratite code. Of course such trigger should be available under
> >>>> proper
> >>>> capability only.
> >>>
> >>> I'm not fully in touch with how parasite, etc works.  Are we
> >>> talking about save or restore?
> >>
> >> We use parasite code in question during checkpoint phase as far as I
> >> remember.
> >> push addr/lret trick is used to run "injected" code (code injection
> >> itself is
> >> done via ptrace) in compat mode at least. Dima, Andrei, I didn't look
> >> into this code
> >> for years already, do we still need to support compat mode at all?
> >>
> >>> If it's restore, what exactly does CRIU need to do?  Is it just
> >>> that CRIU needs to return
> >>> out from its resume code into the to-be-resumed program without
> >>> tripping CET?  Would it
> >>> be acceptable for CRIU to require that at least one shstk slot be
> >>> free at save time?
> >>> Or do we need a mechanism to atomically switch to a completely full
> >>> shadow stack at resume?
> >>>
> >>> Off the top of my head, a sigreturn (or sigreturn-like mechanism)
> >>> that is intended for
> >>> use for altshadowstack could safely verify a token on the
> >>> altshdowstack, possibly
> >>> compare to something in ucontext (or not -- this isn't clearly
> >>> necessary) and switch
> >>> back to the previous stack.  CRIU could use that too.  Obviously
> >>> CRIU will need a way
> >>> to populate the relevant stacks, but WRUSS can be used for that,
> >>> and I think this
> >>> is a fundamental requirement for CRIU -- CRIU restore absolutely
> >>> needs a way to write
> >>> the saved shadow stack data into the shadow stack.
> >
> > Still wrapping my head around the CRIU save and restore steps, but
> > another general approach might be to give ptrace the ability to
> > temporarily pause/resume/set CET enablement and SSP for a stopped
> > thread. Then injected code doesn't need to jump through any hoops or
> > possibly run into road blocks. I'm not sure how much this opens things
> > up if the thread has to be stopped...
>
> Hmm, that's maybe not insane.
>
> An alternative would be to add a bona fide ptrace call-a-function
> mechanism.  I can think of two potentially usable variants:
>
> 1. Straight call.  PTRACE_CALL_FUNCTION(addr) just emulates CALL addr,
> shadow stack push and all.
>
> 2. Signal-style.  PTRACE_CALL_FUNCTION_SIGFRAME injects an actual signal
> frame just like a real signal is being delivered with the specified
> handler.  There could be a variant to opt-in to also using a specified
> altstack and altshadowstack.
>
> 2 would be more expensive but would avoid the need for much in the way
> of asm magic.  The injected code could be plain C (or Rust or Zig or
> whatever).
>
> All of this only really handles save, not restore.  I don't understand
> restore enough to fully understand the issue.

FWIW, CET enabled GDB can call a function in a CET enabled process.
Adding Felix who may know more about it.

Willgerodt, Felix Feb. 10, 2022, 1:52 p.m. UTC | #30

> -----Original Message-----
> From: H.J. Lu <hjl.tools@gmail.com>
> Sent: Donnerstag, 10. Februar 2022 03:54
> To: Lutomirski, Andy <luto@kernel.org>; Willgerodt, Felix
> <felix.willgerodt@intel.com>
> Cc: Edgecombe, Rick P <rick.p.edgecombe@intel.com>; gorcunov@gmail.com;
> bsingharora@gmail.com; hpa@zytor.com; Syromiatnikov, Eugene
> <esyr@redhat.com>; peterz@infradead.org; rdunlap@infradead.org;
> keescook@chromium.org; 0x7f454c46@gmail.com;
> dave.hansen@linux.intel.com; kirill.shutemov@linux.intel.com; Eranian,
> Stephane <eranian@google.com>; linux-mm@kvack.org; adrian@lisas.de;
> fweimer@redhat.com; nadav.amit@gmail.com; jannh@google.com;
> avagin@gmail.com; linux-arch@vger.kernel.org; kcc@google.com;
> bp@alien8.de; oleg@redhat.com; pavel@ucw.cz; linux-doc@vger.kernel.org;
> arnd@arndb.de; Moreira, Joao <joao.moreira@intel.com>; tglx@linutronix.de;
> mike.kravetz@oracle.com; x86@kernel.org; Yang, Weijiang
> <weijiang.yang@intel.com>; rppt@kernel.org; Dave.Martin@arm.com;
> john.allen@amd.com; mingo@redhat.com; Hansen, Dave
> <dave.hansen@intel.com>; corbet@lwn.net; linux-kernel@vger.kernel.org;
> linux-api@vger.kernel.org; Shankar, Ravi V <ravi.v.shankar@intel.com>
> Subject: Re: [PATCH 00/35] Shadow stacks for userspace
> 
> On Wed, Feb 9, 2022 at 6:37 PM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On 2/8/22 18:18, Edgecombe, Rick P wrote:
> > > On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote:
> > >> On Tue, Feb 08, 2022 at 08:21:20AM -0800, Andy Lutomirski wrote:
> > >>>>> But such a knob will immediately reduce the security value of
> > >>>>> the entire
> > >>>>> thing, and I don't have good ideas how to deal with it :(
> > >>>>
> > >>>> Probably a kind of latch in the task_struct which would trigger
> > >>>> off once
> > >>>> returt to a different address happened, thus we would be able to
> > >>>> jump inside
> > >>>> paratite code. Of course such trigger should be available under
> > >>>> proper
> > >>>> capability only.
> > >>>
> > >>> I'm not fully in touch with how parasite, etc works.  Are we
> > >>> talking about save or restore?
> > >>
> > >> We use parasite code in question during checkpoint phase as far as I
> > >> remember.
> > >> push addr/lret trick is used to run "injected" code (code injection
> > >> itself is
> > >> done via ptrace) in compat mode at least. Dima, Andrei, I didn't look
> > >> into this code
> > >> for years already, do we still need to support compat mode at all?
> > >>
> > >>> If it's restore, what exactly does CRIU need to do?  Is it just
> > >>> that CRIU needs to return
> > >>> out from its resume code into the to-be-resumed program without
> > >>> tripping CET?  Would it
> > >>> be acceptable for CRIU to require that at least one shstk slot be
> > >>> free at save time?
> > >>> Or do we need a mechanism to atomically switch to a completely full
> > >>> shadow stack at resume?
> > >>>
> > >>> Off the top of my head, a sigreturn (or sigreturn-like mechanism)
> > >>> that is intended for
> > >>> use for altshadowstack could safely verify a token on the
> > >>> altshdowstack, possibly
> > >>> compare to something in ucontext (or not -- this isn't clearly
> > >>> necessary) and switch
> > >>> back to the previous stack.  CRIU could use that too.  Obviously
> > >>> CRIU will need a way
> > >>> to populate the relevant stacks, but WRUSS can be used for that,
> > >>> and I think this
> > >>> is a fundamental requirement for CRIU -- CRIU restore absolutely
> > >>> needs a way to write
> > >>> the saved shadow stack data into the shadow stack.
> > >
> > > Still wrapping my head around the CRIU save and restore steps, but
> > > another general approach might be to give ptrace the ability to
> > > temporarily pause/resume/set CET enablement and SSP for a stopped
> > > thread. Then injected code doesn't need to jump through any hoops or
> > > possibly run into road blocks. I'm not sure how much this opens things
> > > up if the thread has to be stopped...
> >
> > Hmm, that's maybe not insane.
> >
> > An alternative would be to add a bona fide ptrace call-a-function
> > mechanism.  I can think of two potentially usable variants:
> >
> > 1. Straight call.  PTRACE_CALL_FUNCTION(addr) just emulates CALL addr,
> > shadow stack push and all.
> >
> > 2. Signal-style.  PTRACE_CALL_FUNCTION_SIGFRAME injects an actual signal
> > frame just like a real signal is being delivered with the specified
> > handler.  There could be a variant to opt-in to also using a specified
> > altstack and altshadowstack.
> >
> > 2 would be more expensive but would avoid the need for much in the way
> > of asm magic.  The injected code could be plain C (or Rust or Zig or
> > whatever).
> >
> > All of this only really handles save, not restore.  I don't understand
> > restore enough to fully understand the issue.
> 
> FWIW, CET enabled GDB can call a function in a CET enabled process.
> Adding Felix who may know more about it.
> 
> 
> --
> H.J.

I don't know much about CRIU or kernel code, so I will stick to explaining
what our GDB patches for CET (not upstream yet) currently do.

GDB does inferior calls by setting the PC to the function it wants to call
and by manipulating the return address. It basically creates a dummy
frame and runs that on top of where it currently is.

To enable this for CET, our GDB CET patches push onto the shstk of the
inferior by writing to the inferiors memory, if it isn't out of range,
and by incrementing the SSP (using NT_X86_CET), both via ptrace.

(GDB also has a command called 'return', which basically returns early from
a function. Here GDB just decrements the SSP via ptrace.)

This was based on earlier versions of the kernel patches.
If the interface needs to change or if new interfaces would be available to
do this better, that is fine from our pov. We didn't upstream this yet.

Also, if you have any concerns with this approach please let me know.

Regards,
Felix

Intel Deutschland GmbH
Registered Address: Am Campeon 10, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de <http://www.intel.de>
Managing Directors: Christin Eisenschmid, Sharon Heck, Tiffany Doon Silva  
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928

Andrei Vagin Feb. 11, 2022, 7:41 a.m. UTC | #31

On Wed, Feb 09, 2022 at 06:37:53PM -0800, Andy Lutomirski wrote:
> On 2/8/22 18:18, Edgecombe, Rick P wrote:
> > On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote:
> > > On Tue, Feb 08, 2022 at 08:21:20AM -0800, Andy Lutomirski wrote:
> > > > > > But such a knob will immediately reduce the security value of
> > > > > > the entire
> > > > > > thing, and I don't have good ideas how to deal with it :(
> > > > > 
> > > > > Probably a kind of latch in the task_struct which would trigger
> > > > > off once
> > > > > returt to a different address happened, thus we would be able to
> > > > > jump inside
> > > > > paratite code. Of course such trigger should be available under
> > > > > proper
> > > > > capability only.
> > > > 
> > > > I'm not fully in touch with how parasite, etc works.  Are we
> > > > talking about save or restore?
> > > 
> > > We use parasite code in question during checkpoint phase as far as I
> > > remember.
> > > push addr/lret trick is used to run "injected" code (code injection
> > > itself is
> > > done via ptrace) in compat mode at least. Dima, Andrei, I didn't look
> > > into this code
> > > for years already, do we still need to support compat mode at all?
> > > 
> > > > If it's restore, what exactly does CRIU need to do?  Is it just
> > > > that CRIU needs to return
> > > > out from its resume code into the to-be-resumed program without
> > > > tripping CET?  Would it
> > > > be acceptable for CRIU to require that at least one shstk slot be
> > > > free at save time?
> > > > Or do we need a mechanism to atomically switch to a completely full
> > > > shadow stack at resume?
> > > > 
> > > > Off the top of my head, a sigreturn (or sigreturn-like mechanism)
> > > > that is intended for
> > > > use for altshadowstack could safely verify a token on the
> > > > altshdowstack, possibly
> > > > compare to something in ucontext (or not -- this isn't clearly
> > > > necessary) and switch
> > > > back to the previous stack.  CRIU could use that too.  Obviously
> > > > CRIU will need a way
> > > > to populate the relevant stacks, but WRUSS can be used for that,
> > > > and I think this
> > > > is a fundamental requirement for CRIU -- CRIU restore absolutely
> > > > needs a way to write
> > > > the saved shadow stack data into the shadow stack.
> > 
> > Still wrapping my head around the CRIU save and restore steps, but
> > another general approach might be to give ptrace the ability to
> > temporarily pause/resume/set CET enablement and SSP for a stopped
> > thread. Then injected code doesn't need to jump through any hoops or
> > possibly run into road blocks. I'm not sure how much this opens things
> > up if the thread has to be stopped...
> 
> Hmm, that's maybe not insane.
> 
> An alternative would be to add a bona fide ptrace call-a-function mechanism.
> I can think of two potentially usable variants:
> 
> 1. Straight call.  PTRACE_CALL_FUNCTION(addr) just emulates CALL addr,
> shadow stack push and all.
> 
> 2. Signal-style.  PTRACE_CALL_FUNCTION_SIGFRAME injects an actual signal
> frame just like a real signal is being delivered with the specified handler.
> There could be a variant to opt-in to also using a specified altstack and
> altshadowstack.

I think this would be ideal. In CRIU, the parasite code is executed in
the "daemon" mode and returns back via sigreturn.  Right now, CRIU needs
to generate a signal frame. If I understand your idea right, the signal
frame will be generated by the kernel.

> 
> 2 would be more expensive but would avoid the need for much in the way of
> asm magic.  The injected code could be plain C (or Rust or Zig or whatever).
> 
> All of this only really handles save, not restore.  I don't understand
> restore enough to fully understand the issue.

In a few words, it works like this: CRIU restores all required resources
and prepares a signal frame with a target process state, then it
switches to a small PIE blob, where it restores vma-s and calls
rt_sigreturn.

> 
> --Andy

Mike Rapoport Feb. 11, 2022, 8:04 a.m. UTC | #32

On Thu, Feb 10, 2022 at 11:41:16PM -0800, avagin@gmail.com wrote:
> On Wed, Feb 09, 2022 at 06:37:53PM -0800, Andy Lutomirski wrote:
> >
> > An alternative would be to add a bona fide ptrace call-a-function mechanism.
> > I can think of two potentially usable variants:
> > 
> > 1. Straight call.  PTRACE_CALL_FUNCTION(addr) just emulates CALL addr,
> > shadow stack push and all.
> > 
> > 2. Signal-style.  PTRACE_CALL_FUNCTION_SIGFRAME injects an actual signal
> > frame just like a real signal is being delivered with the specified handler.
> > There could be a variant to opt-in to also using a specified altstack and
> > altshadowstack.
> 
> I think this would be ideal. In CRIU, the parasite code is executed in
> the "daemon" mode and returns back via sigreturn.  Right now, CRIU needs
> to generate a signal frame. If I understand your idea right, the signal
> frame will be generated by the kernel.
> 
> > 
> > 2 would be more expensive but would avoid the need for much in the way of
> > asm magic.  The injected code could be plain C (or Rust or Zig or whatever).
> > 
> > All of this only really handles save, not restore.  I don't understand
> > restore enough to fully understand the issue.
> 
> In a few words, it works like this: CRIU restores all required resources
> and prepares a signal frame with a target process state, then it
> switches to a small PIE blob, where it restores vma-s and calls
> rt_sigreturn.

I think it's also important to note that the stack is restored as a part of
the process memory, i.e. its contents is read from the images.
 
> > 
> > --Andy

Mike Rapoport Feb. 28, 2022, 8:27 p.m. UTC | #33

On Wed, Feb 09, 2022 at 06:37:53PM -0800, Andy Lutomirski wrote:
> On 2/8/22 18:18, Edgecombe, Rick P wrote:
> > On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote:
> > 
> > Still wrapping my head around the CRIU save and restore steps, but
> > another general approach might be to give ptrace the ability to
> > temporarily pause/resume/set CET enablement and SSP for a stopped
> > thread. Then injected code doesn't need to jump through any hoops or
> > possibly run into road blocks. I'm not sure how much this opens things
> > up if the thread has to be stopped...
> 
> Hmm, that's maybe not insane.
> 
> An alternative would be to add a bona fide ptrace call-a-function mechanism.
> I can think of two potentially usable variants:
> 
> 1. Straight call.  PTRACE_CALL_FUNCTION(addr) just emulates CALL addr,
> shadow stack push and all.
> 
> 2. Signal-style.  PTRACE_CALL_FUNCTION_SIGFRAME injects an actual signal
> frame just like a real signal is being delivered with the specified handler.
> There could be a variant to opt-in to also using a specified altstack and
> altshadowstack.

Using ptrace() will not solve CRIU's issue with sigreturn because sigreturn
is called from the victim context rather than from the criu process that
controls the dump and uses ptrace().

Even with the current shadow stack interface Rick proposed, CRIU can restore
the victim using ptrace without any additional knobs, but we loose an
important ability to "self-cure" the victim from the parasite in case
anything goes wrong with criu control process.

Moreover, the issue with backward compatibility is not with ptrace but with
sigreturn and it seems that criu is not its only user.

So I think we need a way to allow direct calls to sigreturn that will
bypass check and restore of the shadow stack.

I only know that there are sigreturn users except criu that show up in
Debian codesearch, and I don't know how do they use it, but for full
backward compatibility we'd need to have no-CET sigreturn as default and
add a new, say UC_CHECK_SHSTK flag to rt_sigframe->uc.uc_flags or even a
new syscall for libc signal handling.

> 2 would be more expensive but would avoid the need for much in the way of
> asm magic.  The injected code could be plain C (or Rust or Zig or whatever).
> 
> All of this only really handles save, not restore.  I don't understand
> restore enough to fully understand the issue.

Restore is more complex, will get to it later.

> --Andy

Andy Lutomirski Feb. 28, 2022, 8:30 p.m. UTC | #34

On Mon, Feb 28, 2022, at 12:27 PM, Mike Rapoport wrote:
> On Wed, Feb 09, 2022 at 06:37:53PM -0800, Andy Lutomirski wrote:
>> On 2/8/22 18:18, Edgecombe, Rick P wrote:
>> > On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote:
>> > 
>> > Still wrapping my head around the CRIU save and restore steps, but
>> > another general approach might be to give ptrace the ability to
>> > temporarily pause/resume/set CET enablement and SSP for a stopped
>> > thread. Then injected code doesn't need to jump through any hoops or
>> > possibly run into road blocks. I'm not sure how much this opens things
>> > up if the thread has to be stopped...
>> 
>> Hmm, that's maybe not insane.
>> 
>> An alternative would be to add a bona fide ptrace call-a-function mechanism.
>> I can think of two potentially usable variants:
>> 
>> 1. Straight call.  PTRACE_CALL_FUNCTION(addr) just emulates CALL addr,
>> shadow stack push and all.
>> 
>> 2. Signal-style.  PTRACE_CALL_FUNCTION_SIGFRAME injects an actual signal
>> frame just like a real signal is being delivered with the specified handler.
>> There could be a variant to opt-in to also using a specified altstack and
>> altshadowstack.
>
> Using ptrace() will not solve CRIU's issue with sigreturn because sigreturn
> is called from the victim context rather than from the criu process that
> controls the dump and uses ptrace().

I'm not sure I follow.

>
> Even with the current shadow stack interface Rick proposed, CRIU can restore
> the victim using ptrace without any additional knobs, but we loose an
> important ability to "self-cure" the victim from the parasite in case
> anything goes wrong with criu control process.
>
> Moreover, the issue with backward compatibility is not with ptrace but with
> sigreturn and it seems that criu is not its only user.

So we need an ability for a tracer to cause the tracee to call a function and to return successfully.  Apparently a gdb branch can already do this with shstk, and my PTRACE_CALL_FUNCTION_SIGFRAME should also do the trick.  I don't see why we need a sigretur-but-dont-verify -- we just need this mechanism to create a frame such that sigreturn actually works.

--Andy

Mike Rapoport Feb. 28, 2022, 9:30 p.m. UTC | #35

On Mon, Feb 28, 2022 at 12:30:41PM -0800, Andy Lutomirski wrote:
> 
> 
> On Mon, Feb 28, 2022, at 12:27 PM, Mike Rapoport wrote:
> > On Wed, Feb 09, 2022 at 06:37:53PM -0800, Andy Lutomirski wrote:
> >> On 2/8/22 18:18, Edgecombe, Rick P wrote:
> >> > On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote:
> >> > 
> >
> > Even with the current shadow stack interface Rick proposed, CRIU can restore
> > the victim using ptrace without any additional knobs, but we loose an
> > important ability to "self-cure" the victim from the parasite in case
> > anything goes wrong with criu control process.
> >
> > Moreover, the issue with backward compatibility is not with ptrace but with
> > sigreturn and it seems that criu is not its only user.
> 
> So we need an ability for a tracer to cause the tracee to call a function
> and to return successfully.  Apparently a gdb branch can already do this
> with shstk, and my PTRACE_CALL_FUNCTION_SIGFRAME should also do the
> trick.  I don't see why we need a sigretur-but-dont-verify -- we just
> need this mechanism to create a frame such that sigreturn actually works.

If I understand correctly, PTRACE_CALL_FUNCTION_SIGFRAME() injects a frame
into the tracee and makes the tracee call sigreturn.
I.e. the tracee is stopped and this is used pretty much as PTRACE_CONT or
PTRACE_SYSCALL.

In such case this defeats the purpose of sigreturn in CRIU because it is
called asynchronously by the tracee when the tracer is about to detach or
even already detached.

For synchronous use-case PTRACE_SETREGSET will be enough, the rest of the
sigframe can be restored by other means.

And with 'criu restore' there may be even no tracer by the time sigreturn
is called.

> --Andy

Andy Lutomirski Feb. 28, 2022, 10:55 p.m. UTC | #36

On Mon, Feb 28, 2022, at 1:30 PM, Mike Rapoport wrote:
> On Mon, Feb 28, 2022 at 12:30:41PM -0800, Andy Lutomirski wrote:
>> 
>> 
>> On Mon, Feb 28, 2022, at 12:27 PM, Mike Rapoport wrote:
>> > On Wed, Feb 09, 2022 at 06:37:53PM -0800, Andy Lutomirski wrote:
>> >> On 2/8/22 18:18, Edgecombe, Rick P wrote:
>> >> > On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote:
>> >> > 
>> >
>> > Even with the current shadow stack interface Rick proposed, CRIU can restore
>> > the victim using ptrace without any additional knobs, but we loose an
>> > important ability to "self-cure" the victim from the parasite in case
>> > anything goes wrong with criu control process.
>> >
>> > Moreover, the issue with backward compatibility is not with ptrace but with
>> > sigreturn and it seems that criu is not its only user.
>> 
>> So we need an ability for a tracer to cause the tracee to call a function
>> and to return successfully.  Apparently a gdb branch can already do this
>> with shstk, and my PTRACE_CALL_FUNCTION_SIGFRAME should also do the
>> trick.  I don't see why we need a sigretur-but-dont-verify -- we just
>> need this mechanism to create a frame such that sigreturn actually works.
>
> If I understand correctly, PTRACE_CALL_FUNCTION_SIGFRAME() injects a frame
> into the tracee and makes the tracee call sigreturn.
> I.e. the tracee is stopped and this is used pretty much as PTRACE_CONT or
> PTRACE_SYSCALL.
>
> In such case this defeats the purpose of sigreturn in CRIU because it is
> called asynchronously by the tracee when the tracer is about to detach or
> even already detached.

The intent of PTRACE_CALL_FUNCTION_SIGFRAME is push a signal frame onto the stack and call a function.  That function should then be able to call sigreturn just like any normal signal handler.  There should be no requirement that the tracer still be attached when this happens, although the code calling sigreturn still needs to be mapped.

(Specifically, on modern arches, the user runtime is expected to provide a "restorer" that calls sigreturn.  A hypotheticall PTRACE_CALL_FUNCTION_SIGFRAME would either need to call sigreturn directly or provide a restorer.)

Mike Rapoport March 3, 2022, 7:40 p.m. UTC | #37

On Mon, Feb 28, 2022 at 02:55:30PM -0800, Andy Lutomirski wrote:
> 
> 
> On Mon, Feb 28, 2022, at 1:30 PM, Mike Rapoport wrote:
> > On Mon, Feb 28, 2022 at 12:30:41PM -0800, Andy Lutomirski wrote:
> >> 
> >> 
> >> On Mon, Feb 28, 2022, at 12:27 PM, Mike Rapoport wrote:
> >> > On Wed, Feb 09, 2022 at 06:37:53PM -0800, Andy Lutomirski wrote:
> >> >> On 2/8/22 18:18, Edgecombe, Rick P wrote:
> >> >> > On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote:
> >> >> > 
> >> >
> >> > Even with the current shadow stack interface Rick proposed, CRIU can restore
> >> > the victim using ptrace without any additional knobs, but we loose an
> >> > important ability to "self-cure" the victim from the parasite in case
> >> > anything goes wrong with criu control process.
> >> >
> >> > Moreover, the issue with backward compatibility is not with ptrace but with
> >> > sigreturn and it seems that criu is not its only user.
> >> 
> >> So we need an ability for a tracer to cause the tracee to call a function
> >> and to return successfully.  Apparently a gdb branch can already do this
> >> with shstk, and my PTRACE_CALL_FUNCTION_SIGFRAME should also do the
> >> trick.  I don't see why we need a sigretur-but-dont-verify -- we just
> >> need this mechanism to create a frame such that sigreturn actually works.
> >
> > If I understand correctly, PTRACE_CALL_FUNCTION_SIGFRAME() injects a frame
> > into the tracee and makes the tracee call sigreturn.
> > I.e. the tracee is stopped and this is used pretty much as PTRACE_CONT or
> > PTRACE_SYSCALL.
> >
> > In such case this defeats the purpose of sigreturn in CRIU because it is
> > called asynchronously by the tracee when the tracer is about to detach or
> > even already detached.
> 
> The intent of PTRACE_CALL_FUNCTION_SIGFRAME is push a signal frame onto
> the stack and call a function.  That function should then be able to call
> sigreturn just like any normal signal handler.  

Ok, let me reiterate.

We have a seized and stopped tracee, use PTRACE_CALL_FUNCTION_SIGFRAME
to push a signal frame onto the tracee's stack so that sigreturn could use
that frame, then set the tracee %rip to the function we'd like to call and
then we PTRACE_CONT the tracee. Tracee continues to execute the parasite
code that calls sigreturn to clean up and restore the tracee process.

PTRACE_CALL_FUNCTION_SIGFRAME also pushes a restore token to the shadow
stack, just like setup_rt_frame() does, so that sys_rt_sigreturn() won't
bail out at restore_signal_shadow_stack().

The only thing that CRIU actually needs is to push a restore token to the
shadow stack, so for us a ptrace call that does that would be ideal.

Andy Lutomirski March 3, 2022, 11 p.m. UTC | #38

On Thu, Mar 3, 2022 at 11:43 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Mon, Feb 28, 2022 at 02:55:30PM -0800, Andy Lutomirski wrote:
> >
> >
> > On Mon, Feb 28, 2022, at 1:30 PM, Mike Rapoport wrote:
> > > On Mon, Feb 28, 2022 at 12:30:41PM -0800, Andy Lutomirski wrote:
> > >>
> > >>
> > >> On Mon, Feb 28, 2022, at 12:27 PM, Mike Rapoport wrote:
> > >> > On Wed, Feb 09, 2022 at 06:37:53PM -0800, Andy Lutomirski wrote:
> > >> >> On 2/8/22 18:18, Edgecombe, Rick P wrote:
> > >> >> > On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote:
> > >> >> >
> > >> >
> > >> > Even with the current shadow stack interface Rick proposed, CRIU can restore
> > >> > the victim using ptrace without any additional knobs, but we loose an
> > >> > important ability to "self-cure" the victim from the parasite in case
> > >> > anything goes wrong with criu control process.
> > >> >
> > >> > Moreover, the issue with backward compatibility is not with ptrace but with
> > >> > sigreturn and it seems that criu is not its only user.
> > >>
> > >> So we need an ability for a tracer to cause the tracee to call a function
> > >> and to return successfully.  Apparently a gdb branch can already do this
> > >> with shstk, and my PTRACE_CALL_FUNCTION_SIGFRAME should also do the
> > >> trick.  I don't see why we need a sigretur-but-dont-verify -- we just
> > >> need this mechanism to create a frame such that sigreturn actually works.
> > >
> > > If I understand correctly, PTRACE_CALL_FUNCTION_SIGFRAME() injects a frame
> > > into the tracee and makes the tracee call sigreturn.
> > > I.e. the tracee is stopped and this is used pretty much as PTRACE_CONT or
> > > PTRACE_SYSCALL.
> > >
> > > In such case this defeats the purpose of sigreturn in CRIU because it is
> > > called asynchronously by the tracee when the tracer is about to detach or
> > > even already detached.
> >
> > The intent of PTRACE_CALL_FUNCTION_SIGFRAME is push a signal frame onto
> > the stack and call a function.  That function should then be able to call
> > sigreturn just like any normal signal handler.
>
> Ok, let me reiterate.
>
> We have a seized and stopped tracee, use PTRACE_CALL_FUNCTION_SIGFRAME
> to push a signal frame onto the tracee's stack so that sigreturn could use
> that frame, then set the tracee %rip to the function we'd like to call and
> then we PTRACE_CONT the tracee. Tracee continues to execute the parasite
> code that calls sigreturn to clean up and restore the tracee process.
>
> PTRACE_CALL_FUNCTION_SIGFRAME also pushes a restore token to the shadow
> stack, just like setup_rt_frame() does, so that sys_rt_sigreturn() won't
> bail out at restore_signal_shadow_stack().

That is the intent.

>
> The only thing that CRIU actually needs is to push a restore token to the
> shadow stack, so for us a ptrace call that does that would be ideal.
>

That seems fine too.  The main benefit of the SIGFRAME approach is
that, AIUI, CRIU eventually constructs a signal frame anyway, and
getting one ready-made seems plausibly helpful.  But if it's not
actually that useful, then there's no need to do it.

Edgecombe, Rick P March 4, 2022, 1:30 a.m. UTC | #39

On Thu, 2022-03-03 at 15:00 -0800, Andy Lutomirski wrote:
> > > The intent of PTRACE_CALL_FUNCTION_SIGFRAME is push a signal
> > > frame onto
> > > the stack and call a function.  That function should then be able
> > > to call
> > > sigreturn just like any normal signal handler.
> > 
> > Ok, let me reiterate.
> > 
> > We have a seized and stopped tracee, use
> > PTRACE_CALL_FUNCTION_SIGFRAME
> > to push a signal frame onto the tracee's stack so that sigreturn
> > could use
> > that frame, then set the tracee %rip to the function we'd like to
> > call and
> > then we PTRACE_CONT the tracee. Tracee continues to execute the
> > parasite
> > code that calls sigreturn to clean up and restore the tracee
> > process.
> > 
> > PTRACE_CALL_FUNCTION_SIGFRAME also pushes a restore token to the
> > shadow
> > stack, just like setup_rt_frame() does, so that sys_rt_sigreturn()
> > won't
> > bail out at restore_signal_shadow_stack().
> 
> That is the intent.
> 
> > 
> > The only thing that CRIU actually needs is to push a restore token
> > to the
> > shadow stack, so for us a ptrace call that does that would be
> > ideal.
> > 
> 
> That seems fine too.  The main benefit of the SIGFRAME approach is
> that, AIUI, CRIU eventually constructs a signal frame anyway, and
> getting one ready-made seems plausibly helpful.  But if it's not
> actually that useful, then there's no need to do it.

I guess pushing a token to the shadow stack could be done like GDB does
calls, with just the basic CET ptrace support. So do we even need a
specific push token operation?

I suppose if CRIU already used some kernel encapsulation of a seized
call/return operation it would have been easier to make CRIU work with
the introduction of CET. But the design of CRIU seems to be to have the
kernel expose just enough and then tie it all together in userspace.

Andy, did you have any other usages for PTRACE_CALL_FUNCTION in mind? I
couldn't find any other CRIU-like users of sigreturn in the debian
source search (but didn't read all 819 pages that come up with
"sigreturn"). It seemed to be mostly seccomp sandbox references.

Andy Lutomirski March 4, 2022, 7:13 p.m. UTC | #40

On 3/3/22 17:30, Edgecombe, Rick P wrote:
> On Thu, 2022-03-03 at 15:00 -0800, Andy Lutomirski wrote:
>>>> The intent of PTRACE_CALL_FUNCTION_SIGFRAME is push a signal
>>>> frame onto
>>>> the stack and call a function.  That function should then be able
>>>> to call
>>>> sigreturn just like any normal signal handler.
>>>
>>> Ok, let me reiterate.
>>>
>>> We have a seized and stopped tracee, use
>>> PTRACE_CALL_FUNCTION_SIGFRAME
>>> to push a signal frame onto the tracee's stack so that sigreturn
>>> could use
>>> that frame, then set the tracee %rip to the function we'd like to
>>> call and
>>> then we PTRACE_CONT the tracee. Tracee continues to execute the
>>> parasite
>>> code that calls sigreturn to clean up and restore the tracee
>>> process.
>>>
>>> PTRACE_CALL_FUNCTION_SIGFRAME also pushes a restore token to the
>>> shadow
>>> stack, just like setup_rt_frame() does, so that sys_rt_sigreturn()
>>> won't
>>> bail out at restore_signal_shadow_stack().
>>
>> That is the intent.
>>
>>>
>>> The only thing that CRIU actually needs is to push a restore token
>>> to the
>>> shadow stack, so for us a ptrace call that does that would be
>>> ideal.
>>>
>>
>> That seems fine too.  The main benefit of the SIGFRAME approach is
>> that, AIUI, CRIU eventually constructs a signal frame anyway, and
>> getting one ready-made seems plausibly helpful.  But if it's not
>> actually that useful, then there's no need to do it.
> 
> I guess pushing a token to the shadow stack could be done like GDB does
> calls, with just the basic CET ptrace support. So do we even need a
> specific push token operation?
> 
> I suppose if CRIU already used some kernel encapsulation of a seized
> call/return operation it would have been easier to make CRIU work with
> the introduction of CET. But the design of CRIU seems to be to have the
> kernel expose just enough and then tie it all together in userspace.
> 
> Andy, did you have any other usages for PTRACE_CALL_FUNCTION in mind? I
> couldn't find any other CRIU-like users of sigreturn in the debian
> source search (but didn't read all 819 pages that come up with
> "sigreturn"). It seemed to be mostly seccomp sandbox references.

I don't see a benefit compelling enough to justify the added complexity, 
given that existing mechanisms can do it.

The sigframe thing, OTOH, seems genuinely useful if CRIU would actually 
use it to save the full register state.  Generating a signal frame from 
scratch is a pain.  That being said, if CRIU isn't excited, then don't 
bother.

Mike Rapoport March 7, 2022, 6:56 p.m. UTC | #41

On Fri, Mar 04, 2022 at 11:13:19AM -0800, Andy Lutomirski wrote:
> On 3/3/22 17:30, Edgecombe, Rick P wrote:
> > On Thu, 2022-03-03 at 15:00 -0800, Andy Lutomirski wrote:
> > > > > The intent of PTRACE_CALL_FUNCTION_SIGFRAME is push a signal
> > > > > frame onto
> > > > > the stack and call a function.  That function should then be able
> > > > > to call
> > > > > sigreturn just like any normal signal handler.
> > > > 
> > > > Ok, let me reiterate.
> > > > 
> > > > We have a seized and stopped tracee, use
> > > > PTRACE_CALL_FUNCTION_SIGFRAME
> > > > to push a signal frame onto the tracee's stack so that sigreturn
> > > > could use
> > > > that frame, then set the tracee %rip to the function we'd like to
> > > > call and
> > > > then we PTRACE_CONT the tracee. Tracee continues to execute the
> > > > parasite
> > > > code that calls sigreturn to clean up and restore the tracee
> > > > process.
> > > > 
> > > > PTRACE_CALL_FUNCTION_SIGFRAME also pushes a restore token to the
> > > > shadow
> > > > stack, just like setup_rt_frame() does, so that sys_rt_sigreturn()
> > > > won't
> > > > bail out at restore_signal_shadow_stack().
> > > 
> > > That is the intent.
> > > 
> > > > 
> > > > The only thing that CRIU actually needs is to push a restore token
> > > > to the
> > > > shadow stack, so for us a ptrace call that does that would be
> > > > ideal.
> > > > 
> > > 
> > > That seems fine too.  The main benefit of the SIGFRAME approach is
> > > that, AIUI, CRIU eventually constructs a signal frame anyway, and
> > > getting one ready-made seems plausibly helpful.  But if it's not
> > > actually that useful, then there's no need to do it.
> > 
> > I guess pushing a token to the shadow stack could be done like GDB does
> > calls, with just the basic CET ptrace support. So do we even need a
> > specific push token operation?

I've tried to follow gdb CET push implementation, but got lost.
What is "basic CET ptrace support"? I don't see any ptrace changes in this
series.
 
> > I suppose if CRIU already used some kernel encapsulation of a seized
> > call/return operation it would have been easier to make CRIU work with
> > the introduction of CET. But the design of CRIU seems to be to have the
> > kernel expose just enough and then tie it all together in userspace.
> > 
> > Andy, did you have any other usages for PTRACE_CALL_FUNCTION in mind? I
> > couldn't find any other CRIU-like users of sigreturn in the debian
> > source search (but didn't read all 819 pages that come up with
> > "sigreturn"). It seemed to be mostly seccomp sandbox references.
> 
> I don't see a benefit compelling enough to justify the added complexity,
> given that existing mechanisms can do it.
> 
> The sigframe thing, OTOH, seems genuinely useful if CRIU would actually use
> it to save the full register state.  Generating a signal frame from scratch
> is a pain.  That being said, if CRIU isn't excited, then don't bother.

CRIU is excited :)

I just was looking for the minimal possible interface that will allow us to
call sigreturn. Rick is right and CRIU does try to expose as little as
possible and handle the pain in the userspace.

The SIGFRAME approach is indeed very helpful, especially if we can make it
work on other architectures eventually.

H.J. Lu March 7, 2022, 7:07 p.m. UTC | #42

On Mon, Mar 7, 2022 at 10:57 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Fri, Mar 04, 2022 at 11:13:19AM -0800, Andy Lutomirski wrote:
> > On 3/3/22 17:30, Edgecombe, Rick P wrote:
> > > On Thu, 2022-03-03 at 15:00 -0800, Andy Lutomirski wrote:
> > > > > > The intent of PTRACE_CALL_FUNCTION_SIGFRAME is push a signal
> > > > > > frame onto
> > > > > > the stack and call a function.  That function should then be able
> > > > > > to call
> > > > > > sigreturn just like any normal signal handler.
> > > > >
> > > > > Ok, let me reiterate.
> > > > >
> > > > > We have a seized and stopped tracee, use
> > > > > PTRACE_CALL_FUNCTION_SIGFRAME
> > > > > to push a signal frame onto the tracee's stack so that sigreturn
> > > > > could use
> > > > > that frame, then set the tracee %rip to the function we'd like to
> > > > > call and
> > > > > then we PTRACE_CONT the tracee. Tracee continues to execute the
> > > > > parasite
> > > > > code that calls sigreturn to clean up and restore the tracee
> > > > > process.
> > > > >
> > > > > PTRACE_CALL_FUNCTION_SIGFRAME also pushes a restore token to the
> > > > > shadow
> > > > > stack, just like setup_rt_frame() does, so that sys_rt_sigreturn()
> > > > > won't
> > > > > bail out at restore_signal_shadow_stack().
> > > >
> > > > That is the intent.
> > > >
> > > > >
> > > > > The only thing that CRIU actually needs is to push a restore token
> > > > > to the
> > > > > shadow stack, so for us a ptrace call that does that would be
> > > > > ideal.
> > > > >
> > > >
> > > > That seems fine too.  The main benefit of the SIGFRAME approach is
> > > > that, AIUI, CRIU eventually constructs a signal frame anyway, and
> > > > getting one ready-made seems plausibly helpful.  But if it's not
> > > > actually that useful, then there's no need to do it.
> > >
> > > I guess pushing a token to the shadow stack could be done like GDB does
> > > calls, with just the basic CET ptrace support. So do we even need a
> > > specific push token operation?
>
> I've tried to follow gdb CET push implementation, but got lost.
> What is "basic CET ptrace support"? I don't see any ptrace changes in this
> series.

Here is the CET ptrace patch on CET 5.16 kernel branch:

https://github.com/hjl-tools/linux/commit/3a43ec29ddac56f87807161b5aeafa80f632363d

> > > I suppose if CRIU already used some kernel encapsulation of a seized
> > > call/return operation it would have been easier to make CRIU work with
> > > the introduction of CET. But the design of CRIU seems to be to have the
> > > kernel expose just enough and then tie it all together in userspace.
> > >
> > > Andy, did you have any other usages for PTRACE_CALL_FUNCTION in mind? I
> > > couldn't find any other CRIU-like users of sigreturn in the debian
> > > source search (but didn't read all 819 pages that come up with
> > > "sigreturn"). It seemed to be mostly seccomp sandbox references.
> >
> > I don't see a benefit compelling enough to justify the added complexity,
> > given that existing mechanisms can do it.
> >
> > The sigframe thing, OTOH, seems genuinely useful if CRIU would actually use
> > it to save the full register state.  Generating a signal frame from scratch
> > is a pain.  That being said, if CRIU isn't excited, then don't bother.
>
> CRIU is excited :)
>
> I just was looking for the minimal possible interface that will allow us to
> call sigreturn. Rick is right and CRIU does try to expose as little as
> possible and handle the pain in the userspace.
>
> The SIGFRAME approach is indeed very helpful, especially if we can make it
> work on other architectures eventually.
>
> --
> Sincerely yours,
> Mike.

David Laight March 7, 2022, 10:21 p.m. UTC | #43

From: Mike Rapoport
> Sent: 07 March 2022 18:57
...
> > The sigframe thing, OTOH, seems genuinely useful if CRIU would actually use
> > it to save the full register state.  Generating a signal frame from scratch
> > is a pain.  That being said, if CRIU isn't excited, then don't bother.
> 
> CRIU is excited :)
> 
> I just was looking for the minimal possible interface that will allow us to
> call sigreturn. Rick is right and CRIU does try to expose as little as
> possible and handle the pain in the userspace.
> 
> The SIGFRAME approach is indeed very helpful, especially if we can make it
> work on other architectures eventually.

I thought the full sigframe layout depends very much on what the kernel
decides it needs to save?
Some parts are exposed to the signal handler, but there are large
blocks of data that XSAVE (etc) save that have to be put onto the
signal stack.
Is it even vaguely feasible to replicate what a specific kernel
generates on specific hardware in a userspace library?
The size of this data is getting bigger and bigger - causing
issues with the SIGALTSTACK (and even thread stack) minimum sizes.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Mike Rapoport May 31, 2022, 11:59 a.m. UTC | #44

Hi all,

On Mon, Mar 07, 2022 at 11:07:01AM -0800, H.J. Lu wrote:
> On Mon, Mar 7, 2022 at 10:57 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > On Fri, Mar 04, 2022 at 11:13:19AM -0800, Andy Lutomirski wrote:
> > > On 3/3/22 17:30, Edgecombe, Rick P wrote:
> 
> Here is the CET ptrace patch on CET 5.16 kernel branch:
> 
> https://github.com/hjl-tools/linux/commit/3a43ec29ddac56f87807161b5aeafa80f632363d

It took me a while, but at last I have a version of CRIU that knows how to
handle shadow stack. For the shadow stack manipulation during dump and for
the creation of the sigframe for sigreturn I used the CET ptrace patch for
5.16 (thanks H.J).

For the restore I had to add two modifications to the kernel APIs on top of
this version of the shadow stack series:

* add address parameter to map_shadow_stack() so that it'll call mmap()
with MAP_FIXED if the address is requested. This is required to restore the
shadow stack at the same address as it was at dump time.

* add ability to unlock shadow stack features using ptrace. This is
required because the current glibc (or at least in the version I used for
tests) locks shadow stack state when it loads a program. This locking means
that a process will either have shadow stack disabled without an ability to
enable it or it will have shadow stack enabled with WRSS disabled and
again, there is no way to re-enable WRSS. With that, ptrace looked like the
most sensible interface to interfere with the shadow stack locking.

I've pushed the kernel modifications here:

https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=cet/kvm

and CRIU modifications here:

https://github.com/rppt/criu/tree/cet/v0.1

Edgecombe, Rick P May 31, 2022, 4:25 p.m. UTC | #45

Mike,

Thanks for doing this. Glad to hear this is solvable with the current
paradigm.

On Tue, 2022-05-31 at 14:59 +0300, Mike Rapoport wrote:
> * add ability to unlock shadow stack features using ptrace. This is
> required because the current glibc (or at least in the version I used
> for
> tests) locks shadow stack state when it loads a program. This locking
> means
> that a process will either have shadow stack disabled without an
> ability to
> enable it or it will have shadow stack enabled with WRSS disabled and
> again, there is no way to re-enable WRSS. With that, ptrace looked
> like the
> most sensible interface to interfere with the shadow stack locking.

So whatever glibc you have lock's features even if it doesn't enable
shadow stack? Hmm, I've not encountered this. Which glibc is it?

WRSS is a feature where you would usually want to lock it as disabled,
but WRSS cannot be enabled if shadow stack is not enabled. Locking
shadow stack and WRSS off together doesn't have any security benefits
in theory. so I'm thinking glibc doesn't need to do this. The kernel
could even refuse to lock WRSS without shadow stack being enabled.
Could we avoid the extra ptrace functionality then?

Rick

Mike Rapoport May 31, 2022, 4:36 p.m. UTC | #46

On Tue, May 31, 2022 at 04:25:13PM +0000, Edgecombe, Rick P wrote:
> Mike,
> 
> Thanks for doing this. Glad to hear this is solvable with the current
> paradigm.
> 
> On Tue, 2022-05-31 at 14:59 +0300, Mike Rapoport wrote:
> > * add ability to unlock shadow stack features using ptrace. This is
> > required because the current glibc (or at least in the version I used
> > for
> > tests) locks shadow stack state when it loads a program. This locking
> > means
> > that a process will either have shadow stack disabled without an
> > ability to
> > enable it or it will have shadow stack enabled with WRSS disabled and
> > again, there is no way to re-enable WRSS. With that, ptrace looked
> > like the
> > most sensible interface to interfere with the shadow stack locking.
> 
> So whatever glibc you have lock's features even if it doesn't enable
> shadow stack? Hmm, I've not encountered this. Which glibc is it?

I use glibc from here:
https://gitlab.com/x86-glibc/glibc/, commit b6f9a22a00c1f8ae8c0991886f0a714f2f5da002

AFAIU, it's H.J cet work.

 
> WRSS is a feature where you would usually want to lock it as disabled,
> but WRSS cannot be enabled if shadow stack is not enabled. Locking
> shadow stack and WRSS off together doesn't have any security benefits
> in theory. so I'm thinking glibc doesn't need to do this. The kernel
> could even refuse to lock WRSS without shadow stack being enabled.
> Could we avoid the extra ptrace functionality then?

What I see for is that a program can support shadow stack, glibc enables
shadow stack, does not enable WRSS and than calls

	arch_prctl(ARCH_X86_FEATURE_LOCK,
		   LINUX_X86_FEATURE_SHSTK | LINUX_X86_FEATURE_WRSS);

so that WRSS cannot be re-enabled.

For the programs that do not support shadow stack, both SHSTK and WRSS are
disabled, but still there is the same call to
arch_prctl(ARCH_X86_FEATURE_LOCK, ...) and then neither shadow stack nor
WRSS can be enabled.

My original plan was to run CRIU with no shadow stack, enable shadow stack
and WRSS in the restored tasks using arch_prct() and after the shadow stack
contents is restored disable WRSS.

Obviously, this didn't work with glibc I have :)

On the bright side, having a ptrace call to unlock shadow stack and wrss
allows running CRIU itself with shadow stack.
 
> Rick

Edgecombe, Rick P May 31, 2022, 5:34 p.m. UTC | #47

On Tue, 2022-05-31 at 19:36 +0300, Mike Rapoport wrote:
> > WRSS is a feature where you would usually want to lock it as
> > disabled,
> > but WRSS cannot be enabled if shadow stack is not enabled. Locking
> > shadow stack and WRSS off together doesn't have any security
> > benefits
> > in theory. so I'm thinking glibc doesn't need to do this. The
> > kernel
> > could even refuse to lock WRSS without shadow stack being enabled.
> > Could we avoid the extra ptrace functionality then?
> 
> What I see for is that a program can support shadow stack, glibc
> enables
> shadow stack, does not enable WRSS and than calls
> 
>         arch_prctl(ARCH_X86_FEATURE_LOCK,
>                    LINUX_X86_FEATURE_SHSTK | LINUX_X86_FEATURE_WRSS);

I see the logic is glibc will lock SHSTK|IBT if either is enabled in
the elf header. I guess that is why I didn't see the locking happening
for me, because my manual enablement test doesn't have either set in
the header.

It can't see where that glibc knows about WRSS though...

The glibc logic seems wrong to me also, because shadow stack or IBT
could be force-disabled via glibc tunables. I don't see why the elf
header bit should exclusively control the feature locking. Or why both
should be locked if only one is in the header.

> 
> so that WRSS cannot be re-enabled.
> 
> For the programs that do not support shadow stack, both SHSTK and
> WRSS are
> disabled, but still there is the same call to
> arch_prctl(ARCH_X86_FEATURE_LOCK, ...) and then neither shadow stack
> nor
> WRSS can be enabled.
> 
> My original plan was to run CRIU with no shadow stack, enable shadow
> stack
> and WRSS in the restored tasks using arch_prct() and after the shadow
> stack
> contents is restored disable WRSS.
> 
> Obviously, this didn't work with glibc I have :)

Were you disabling shadow stack via glibc tunnable? Or was the elf
header marked for IBT? If it was a plain old binary, the code looks to
me like it should not lock any features.

> 
> On the bright side, having a ptrace call to unlock shadow stack and
> wrss
> allows running CRIU itself with shadow stack.
>  

Yea, having something working is really great. My only hesitancy is
that, per a discussion on the LAM patchset, we are going to make this
enabling API CET only (same semantics for though). I suppose the
locking API arch_prctl() could still be support other arch features,
but it might be a second CET only regset. It's not the end of the
world.

I guess the other consideration is tieing CRIU to glibc peculiarities.
Like even if we fix glibc, then CRIU may not work with some other libc
or app that force disables for some weird reason. Is it supposed to be
libc-agnostic?

H.J. Lu May 31, 2022, 6 p.m. UTC | #48

On Tue, May 31, 2022 at 10:34 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2022-05-31 at 19:36 +0300, Mike Rapoport wrote:
> > > WRSS is a feature where you would usually want to lock it as
> > > disabled,
> > > but WRSS cannot be enabled if shadow stack is not enabled. Locking
> > > shadow stack and WRSS off together doesn't have any security
> > > benefits
> > > in theory. so I'm thinking glibc doesn't need to do this. The
> > > kernel
> > > could even refuse to lock WRSS without shadow stack being enabled.
> > > Could we avoid the extra ptrace functionality then?
> >
> > What I see for is that a program can support shadow stack, glibc
> > enables
> > shadow stack, does not enable WRSS and than calls
> >
> >         arch_prctl(ARCH_X86_FEATURE_LOCK,
> >                    LINUX_X86_FEATURE_SHSTK | LINUX_X86_FEATURE_WRSS);
>
> I see the logic is glibc will lock SHSTK|IBT if either is enabled in
> the elf header. I guess that is why I didn't see the locking happening
> for me, because my manual enablement test doesn't have either set in
> the header.
>
> It can't see where that glibc knows about WRSS though...
>
> The glibc logic seems wrong to me also, because shadow stack or IBT
> could be force-disabled via glibc tunables. I don't see why the elf
> header bit should exclusively control the feature locking. Or why both
> should be locked if only one is in the header.

glibc locks SHSTK and IBT only if they are enabled at run-time. It doesn't
enable/disable/lock WRSS at the moment.  If WRSS can be enabled
via arch_prctl at any time, we can't lock it.  If WRSS should be locked early,
how should it be enabled in application?  Also can WRSS be enabled
from a dlopened object?

> >
> > so that WRSS cannot be re-enabled.
> >
> > For the programs that do not support shadow stack, both SHSTK and
> > WRSS are
> > disabled, but still there is the same call to
> > arch_prctl(ARCH_X86_FEATURE_LOCK, ...) and then neither shadow stack
> > nor
> > WRSS can be enabled.
> >
> > My original plan was to run CRIU with no shadow stack, enable shadow
> > stack
> > and WRSS in the restored tasks using arch_prct() and after the shadow
> > stack
> > contents is restored disable WRSS.
> >
> > Obviously, this didn't work with glibc I have :)
>
> Were you disabling shadow stack via glibc tunnable? Or was the elf
> header marked for IBT? If it was a plain old binary, the code looks to
> me like it should not lock any features.
>
> >
> > On the bright side, having a ptrace call to unlock shadow stack and
> > wrss
> > allows running CRIU itself with shadow stack.
> >
>
> Yea, having something working is really great. My only hesitancy is
> that, per a discussion on the LAM patchset, we are going to make this
> enabling API CET only (same semantics for though). I suppose the
> locking API arch_prctl() could still be support other arch features,
> but it might be a second CET only regset. It's not the end of the
> world.
>
> I guess the other consideration is tieing CRIU to glibc peculiarities.
> Like even if we fix glibc, then CRIU may not work with some other libc
> or app that force disables for some weird reason. Is it supposed to be
> libc-agnostic?
>

Mike Rapoport June 1, 2022, 8:06 a.m. UTC | #49

On Tue, May 31, 2022 at 05:34:50PM +0000, Edgecombe, Rick P wrote:
> On Tue, 2022-05-31 at 19:36 +0300, Mike Rapoport wrote:
> > > WRSS is a feature where you would usually want to lock it as
> > > disabled,
> > > but WRSS cannot be enabled if shadow stack is not enabled. Locking
> > > shadow stack and WRSS off together doesn't have any security
> > > benefits
> > > in theory. so I'm thinking glibc doesn't need to do this. The
> > > kernel
> > > could even refuse to lock WRSS without shadow stack being enabled.
> > > Could we avoid the extra ptrace functionality then?
> > 
> > What I see for is that a program can support shadow stack, glibc
> > enables
> > shadow stack, does not enable WRSS and than calls
> > 
> >         arch_prctl(ARCH_X86_FEATURE_LOCK,
> >                    LINUX_X86_FEATURE_SHSTK | LINUX_X86_FEATURE_WRSS);
> 
> I see the logic is glibc will lock SHSTK|IBT if either is enabled in
> the elf header. I guess that is why I didn't see the locking happening
> for me, because my manual enablement test doesn't have either set in
> the header.

The locking was quite a surprise for me when I moved from standalone test
to a system with CET-enabled glibc :)
 
> It can't see where that glibc knows about WRSS though...

Right, it was my mistake, as H.J. said glibc locks SHSTK and IBT.
 
> The glibc logic seems wrong to me also, because shadow stack or IBT
> could be force-disabled via glibc tunables. I don't see why the elf
> header bit should exclusively control the feature locking. Or why both
> should be locked if only one is in the header.
> 
> > 
> > so that WRSS cannot be re-enabled.
> > 
> > For the programs that do not support shadow stack, both SHSTK and
> > WRSS are
> > disabled, but still there is the same call to
> > arch_prctl(ARCH_X86_FEATURE_LOCK, ...) and then neither shadow stack
> > nor
> > WRSS can be enabled.
> > 
> > My original plan was to run CRIU with no shadow stack, enable shadow
> > stack
> > and WRSS in the restored tasks using arch_prct() and after the shadow
> > stack
> > contents is restored disable WRSS.
> > 
> > Obviously, this didn't work with glibc I have :)
> 
> Were you disabling shadow stack via glibc tunnable? Or was the elf
> header marked for IBT? If it was a plain old binary, the code looks to
> me like it should not lock any features.

I built criu as a plain old binary, there were no SHSTK or IBT markers. And
I've seen that there was a call to arch_prctl that locked the features as
disabled. 
 
> > On the bright side, having a ptrace call to unlock shadow stack and
> > wrss
> > allows running CRIU itself with shadow stack.
> 
> Yea, having something working is really great. My only hesitancy is
> that, per a discussion on the LAM patchset, we are going to make this
> enabling API CET only (same semantics for though). I suppose the
> locking API arch_prctl() could still be support other arch features,
> but it might be a second CET only regset. It's not the end of the
> world.

The support for CET in criu is anyway experimental for now, if the kernel
API will be slightly different in the end, we'll update criu.
The important things are the ability to control tracee shadow stack
from ptrace, the ability to map the shadow stack at fixed address and the
ability to control the features at least from ptrace.
As long as we have APIs that provide those, it should be Ok.
 
> I guess the other consideration is tieing CRIU to glibc peculiarities.
> Like even if we fix glibc, then CRIU may not work with some other libc
> or app that force disables for some weird reason. Is it supposed to be
> libc-agnostic?

Actually using the ptrace to control the CET features does not tie criu to
glibc. The current proposal for the arch_prctl() allows libc to lock CET
features and having a ptrace call to control the lock makes criu agnostic
to libc behaviour.

Edgecombe, Rick P June 1, 2022, 5:24 p.m. UTC | #50

On Wed, 2022-06-01 at 11:06 +0300, Mike Rapoport wrote:
> > Yea, having something working is really great. My only hesitancy is
> > that, per a discussion on the LAM patchset, we are going to make
> > this
> > enabling API CET only (same semantics for though). I suppose the
> > locking API arch_prctl() could still be support other arch
> > features,
> > but it might be a second CET only regset. It's not the end of the
> > world.
> 
> The support for CET in criu is anyway experimental for now, if the
> kernel
> API will be slightly different in the end, we'll update criu.
> The important things are the ability to control tracee shadow stack
> from ptrace, the ability to map the shadow stack at fixed address and
> the
> ability to control the features at least from ptrace.
> As long as we have APIs that provide those, it should be Ok.
>  
> > I guess the other consideration is tieing CRIU to glibc
> > peculiarities.
> > Like even if we fix glibc, then CRIU may not work with some other
> > libc
> > or app that force disables for some weird reason. Is it supposed to
> > be
> > libc-agnostic?
> 
> Actually using the ptrace to control the CET features does not tie
> criu to
> glibc. The current proposal for the arch_prctl() allows libc to lock
> CET
> features and having a ptrace call to control the lock makes criu
> agnostic
> to libc behaviour.

From staring at the glibc code, I'm suspicious something was weird with
your test setup, as I don't think it should be locking. But I guess to
be completely proper you would need to save and restore the lock state
anyway. So, ok yea, on balance probably better to have an extra
interface.

Should we make it a GET/SET interface?

Edgecombe, Rick P June 1, 2022, 5:27 p.m. UTC | #51

On Tue, 2022-05-31 at 11:00 -0700, H.J. Lu wrote:
> > The glibc logic seems wrong to me also, because shadow stack or IBT
> > could be force-disabled via glibc tunables. I don't see why the elf
> > header bit should exclusively control the feature locking. Or why
> > both
> > should be locked if only one is in the header.
> 
> glibc locks SHSTK and IBT only if they are enabled at run-time.

It's not what I saw in the code. Somehow Mike saw something different
as well.

>  It doesn't
> enable/disable/lock WRSS at the moment.  If WRSS can be enabled
> via arch_prctl at any time, we can't lock it.  If WRSS should be
> locked early,
> how should it be enabled in application?  Also can WRSS be enabled
> from a dlopened object?

I think in the past we discussed having another elf header bit that
behaved differently (OR vs AND).

H.J. Lu June 1, 2022, 7:27 p.m. UTC | #52

On Wed, Jun 1, 2022 at 10:27 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2022-05-31 at 11:00 -0700, H.J. Lu wrote:
> > > The glibc logic seems wrong to me also, because shadow stack or IBT
> > > could be force-disabled via glibc tunables. I don't see why the elf
> > > header bit should exclusively control the feature locking. Or why
> > > both
> > > should be locked if only one is in the header.
> >
> > glibc locks SHSTK and IBT only if they are enabled at run-time.
>
> It's not what I saw in the code. Somehow Mike saw something different
> as well.

The current glibc cet branch:

https://gitlab.com/x86-glibc/glibc/-/commits/users/hjl/cet/master

locks only available CET features.  Since only SHSTK is available, I
saw

arch_prctl(0x3003 /* ARCH_??? */, 0x2)  = 0

CET features are always enabled early in ld.so to allow function
calls in the CET enabled ld.so.   ld.so always locks CET features
even if they are disabled when the program or a dependency
library isn't CET enabled.

> >  It doesn't
> > enable/disable/lock WRSS at the moment.  If WRSS can be enabled
> > via arch_prctl at any time, we can't lock it.  If WRSS should be
> > locked early,
> > how should it be enabled in application?  Also can WRSS be enabled
> > from a dlopened object?
>
> I think in the past we discussed having another elf header bit that
> behaved differently (OR vs AND).

We should have a complete list of use cases and design a way to
support them.

Mike Rapoport June 9, 2022, 6:04 p.m. UTC | #53

On Wed, Jun 01, 2022 at 05:24:26PM +0000, Edgecombe, Rick P wrote:
> On Wed, 2022-06-01 at 11:06 +0300, Mike Rapoport wrote:
> > > Yea, having something working is really great. My only hesitancy is
> > > that, per a discussion on the LAM patchset, we are going to make
> > > this
> > > enabling API CET only (same semantics for though). I suppose the
> > > locking API arch_prctl() could still be support other arch
> > > features,
> > > but it might be a second CET only regset. It's not the end of the
> > > world.
> > 
> > The support for CET in criu is anyway experimental for now, if the
> > kernel
> > API will be slightly different in the end, we'll update criu.
> > The important things are the ability to control tracee shadow stack
> > from ptrace, the ability to map the shadow stack at fixed address and
> > the
> > ability to control the features at least from ptrace.
> > As long as we have APIs that provide those, it should be Ok.
> >  
> > > I guess the other consideration is tieing CRIU to glibc
> > > peculiarities.
> > > Like even if we fix glibc, then CRIU may not work with some other
> > > libc
> > > or app that force disables for some weird reason. Is it supposed to
> > > be
> > > libc-agnostic?
> > 
> > Actually using the ptrace to control the CET features does not tie
> > criu to
> > glibc. The current proposal for the arch_prctl() allows libc to lock
> > CET
> > features and having a ptrace call to control the lock makes criu
> > agnostic
> > to libc behaviour.
> 
> From staring at the glibc code, I'm suspicious something was weird with
> your test setup, as I don't think it should be locking. But I guess to
> be completely proper you would need to save and restore the lock state
> anyway. So, ok yea, on balance probably better to have an extra
> interface.
> 
> Should we make it a GET/SET interface?

Yes, I think so.

[00/35] Shadow stacks for userspace

Message

Comments