[v2] proc: allow restricting /proc/pid/mem writes

Message ID	20240301213442.198443-1-adrian.ratiu@collabora.com (mailing list archive)
State	Handled Elsewhere
Headers	show Received: from madrid.collaboradmins.com (madrid.collaboradmins.com [46.235.227.194]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C0E152575F; Fri, 1 Mar 2024 21:35:31 +0000 (UTC) sender: aratiu) by madrid.collaboradmins.com (Postfix) with ESMTPSA id 09BB7378208B; Fri, 1 Mar 2024 21:35:28 +0000 (UTC) From: Adrian Ratiu <adrian.ratiu@collabora.com> To: linux-fsdevel@vger.kernel.org Cc: kernel@collabora.com, linux-security-module@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, Adrian Ratiu <adrian.ratiu@collabora.com>, Guenter Roeck <groeck@chromium.org>, Doug Anderson <dianders@chromium.org>, Kees Cook <keescook@chromium.org>, Jann Horn <jannh@google.com>, Andrew Morton <akpm@linux-foundation.org>, Randy Dunlap <rdunlap@infradead.org>, Christian Brauner <brauner@kernel.org>, Mike Frysinger <vapier@chromium.org> Subject: [PATCH v2] proc: allow restricting /proc/pid/mem writes Date: Fri, 1 Mar 2024 23:34:42 +0200 Message-ID: <20240301213442.198443-1-adrian.ratiu@collabora.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[v2] proc: allow restricting /proc/pid/mem writes \| expand [v2] proc: allow restricting /proc/pid/mem writes

Adrian Ratiu March 1, 2024, 9:34 p.m. UTC

Prior to v2.6.39 write access to /proc/<pid>/mem was restricted,
after which it got allowed in commit 198214a7ee50 ("proc: enable
writing to /proc/pid/mem"). Famous last words from that patch:
"no longer a security hazard". :)

Afterwards exploits appeared started causing drama like [1]. The
/proc/*/mem exploits can be rather sophisticated like [2] which
installed an arbitrary payload from noexec storage into a running
process then exec'd it, which itself could include an ELF loader
to run arbitrary code off noexec storage.

As part of hardening against these types of attacks, distrbutions
can restrict /proc/*/mem to only allow writes when they makes sense,
like in case of debuggers which have ptrace permissions, as they
are able to access memory anyway via PTRACE_POKEDATA and friends.

Dropping the mode bits disables write access for non-root users.
Trying to `chmod` the paths back fails as the kernel rejects it.

For users with CAP_DAC_OVERRIDE (usually just root) we have to
disable the mem_write callback to avoid bypassing the mode bits.

Writes can be used to bypass permissions on memory maps, even if a
memory region is mapped r-x (as is a program's executable pages),
the process can open its own /proc/self/mem file and write to the
pages directly.

Even if seccomp filters block mmap/mprotect calls with W|X perms,
they often cannot block open calls as daemons want to read/write
their own runtime state and seccomp filters cannot check file paths.
Write calls also can't be blocked in general via seccomp.

Since the mem file is part of the dynamic /proc/<pid>/ space, we
can't run chmod once at boot to restrict it (and trying to react
to every process and run chmod doesn't scale, and the kernel no
longer allows chmod on any of these paths).

SELinux could be used with a rule to cover all /proc/*/mem files,
but even then having multiple ways to deny an attack is useful in
case on layer fails.

[1] https://lwn.net/Articles/476947/
[2] https://issues.chromium.org/issues/40089045

Based on an initial patch by Mike Frysinger <vapier@chromium.org>.

Cc: Guenter Roeck <groeck@chromium.org>
Cc: Doug Anderson <dianders@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jann Horn <jannh@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Christian Brauner <brauner@kernel.org>
Co-developed-by: Mike Frysinger <vapier@chromium.org>
Signed-off-by: Mike Frysinger <vapier@chromium.org>
Signed-off-by: Adrian Ratiu <adrian.ratiu@collabora.com>
---
Changes in v2:
 * Added boot time parameter with default kconfig option
 * Moved check earlier in mem_open() instead of mem_write()
 * Simplified implementation branching
 * Removed dependency on CONFIG_MEMCG
---
 .../admin-guide/kernel-parameters.txt         |  4 ++
 fs/proc/base.c                                | 47 ++++++++++++++++++-
 security/Kconfig                              | 22 +++++++++
 3 files changed, 71 insertions(+), 2 deletions(-)

Kees Cook March 1, 2024, 11:55 p.m. UTC | #1

On Fri, Mar 01, 2024 at 11:34:42PM +0200, Adrian Ratiu wrote:
> Prior to v2.6.39 write access to /proc/<pid>/mem was restricted,
> after which it got allowed in commit 198214a7ee50 ("proc: enable
> writing to /proc/pid/mem"). Famous last words from that patch:
> "no longer a security hazard". :)
> 
> Afterwards exploits appeared started causing drama like [1]. The

nit: I think "appeared" can be dropped here.

> /proc/*/mem exploits can be rather sophisticated like [2] which
> installed an arbitrary payload from noexec storage into a running
> process then exec'd it, which itself could include an ELF loader
> to run arbitrary code off noexec storage.
> 
> As part of hardening against these types of attacks, distrbutions
> can restrict /proc/*/mem to only allow writes when they makes sense,
> like in case of debuggers which have ptrace permissions, as they
> are able to access memory anyway via PTRACE_POKEDATA and friends.
> 
> Dropping the mode bits disables write access for non-root users.
> Trying to `chmod` the paths back fails as the kernel rejects it.
> 
> For users with CAP_DAC_OVERRIDE (usually just root) we have to
> disable the mem_write callback to avoid bypassing the mode bits.
> 
> Writes can be used to bypass permissions on memory maps, even if a
> memory region is mapped r-x (as is a program's executable pages),
> the process can open its own /proc/self/mem file and write to the
> pages directly.
> 
> Even if seccomp filters block mmap/mprotect calls with W|X perms,
> they often cannot block open calls as daemons want to read/write
> their own runtime state and seccomp filters cannot check file paths.
> Write calls also can't be blocked in general via seccomp.
> 
> Since the mem file is part of the dynamic /proc/<pid>/ space, we
> can't run chmod once at boot to restrict it (and trying to react
> to every process and run chmod doesn't scale, and the kernel no
> longer allows chmod on any of these paths).
> 
> SELinux could be used with a rule to cover all /proc/*/mem files,
> but even then having multiple ways to deny an attack is useful in
> case on layer fails.

Everything above here is good to keep in the commit log, but it's all
the "background". Please also write here what has been done to address
the background above it. e.g.:

"Introduce a CONFIG and a __ro_after_init runtime toggle to make
it so only processes that are already tracing the task to write to
/proc/<pid>/mem." etc

> 
> [1] https://lwn.net/Articles/476947/
> [2] https://issues.chromium.org/issues/40089045

These can be:

Link: https://lwn.net/Articles/476947/ [1]
Link: https://issues.chromium.org/issues/40089045 [2]

> Based on an initial patch by Mike Frysinger <vapier@chromium.org>.
> 
> Cc: Guenter Roeck <groeck@chromium.org>
> Cc: Doug Anderson <dianders@chromium.org>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Jann Horn <jannh@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Randy Dunlap <rdunlap@infradead.org>
> Cc: Christian Brauner <brauner@kernel.org>
> Co-developed-by: Mike Frysinger <vapier@chromium.org>
> Signed-off-by: Mike Frysinger <vapier@chromium.org>
> Signed-off-by: Adrian Ratiu <adrian.ratiu@collabora.com>
> ---
> Changes in v2:
>  * Added boot time parameter with default kconfig option
>  * Moved check earlier in mem_open() instead of mem_write()
>  * Simplified implementation branching
>  * Removed dependency on CONFIG_MEMCG

Can you mention in the commit log what behaviors have been tested with
this patch? For example, I assume gdb still works with
restrict_proc_mem_write=y ?

When this is enabled, what _does_ break that people might expect to
work?

> ---
>  .../admin-guide/kernel-parameters.txt         |  4 ++
>  fs/proc/base.c                                | 47 ++++++++++++++++++-
>  security/Kconfig                              | 22 +++++++++
>  3 files changed, 71 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 460b97a1d0da..0647e2f54248 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -5618,6 +5618,10 @@
>  	reset_devices	[KNL] Force drivers to reset the underlying device
>  			during initialization.
>  
> +	restrict_proc_mem_write= [KNL]

Please add here:

			Format: <bool>

> +			Enable or disable write access to /proc/*/mem files.
> +			Default is SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON.
> +
>  	resume=		[SWSUSP]
>  			Specify the partition device for software suspend
>  			Format:
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 98a031ac2648..92f668191312 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -152,6 +152,30 @@ struct pid_entry {
>  		NULL, &proc_pid_attr_operations,	\
>  		{ .lsmid = LSMID })
>  
> +#ifdef CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE

Please drop this CONFIG entirely -- it should be always available for
all builds of the kernel. Only CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON
needs to remain.

> +DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON,
> +			   restrict_proc_mem_write);
> +static int __init early_restrict_proc_mem_write(char *buf)
> +{
> +	int ret;
> +	bool bool_result;
> +
> +	ret = kstrtobool(buf, &bool_result);
> +	if (ret)
> +		return ret;
> +
> +	if (bool_result)
> +		static_branch_enable(&restrict_proc_mem_write);
> +	else
> +		static_branch_disable(&restrict_proc_mem_write);
> +	return 0;
> +}
> +early_param("restrict_proc_mem_write", early_restrict_proc_mem_write);
> +# define PROC_PID_MEM_MODE S_IRUSR
> +#else
> +# define PROC_PID_MEM_MODE (S_IRUSR|S_IWUSR)
> +#endif

PROC_PID_MEM_MODE will need to be a __ro_after_init variable, set by
early_restrict_proc_mem_write, otherwise the mode won't change based on
the runtime setting. e.g.:

#ifdef CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON
mode_t proc_pid_mem_mode __ro_after_init = S_IRUSR;
#else
mode_t proc_pid_mem_mode __ro_after_init = (S_IRUSR|S_IWUSR);
#endif

DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON,
			   restrict_proc_mem_write);
...
	if (bool_result) {
		static_branch_enable(&restrict_proc_mem_write);
		proc_pid_mem_mode = S_IRUSR;
	} else {
		static_branch_disable(&restrict_proc_mem_write);
		proc_pid_mem_mode = (S_IRUSR|S_IWUSR);
	}
...
	REG("mem",        proc_pid_mem_mode, proc_mem_operations),


> +
>  /*
>   * Count the number of hardlinks for the pid_entry table, excluding the .
>   * and .. links.
> @@ -829,6 +853,25 @@ static int mem_open(struct inode *inode, struct file *file)
>  {
>  	int ret = __mem_open(inode, file, PTRACE_MODE_ATTACH);
>  
> +#ifdef CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE

Drop this ifdef (as mentioned above).

> +	struct mm_struct *mm = file->private_data;
> +	struct task_struct *task = get_proc_task(inode);
> +
> +	if (mm && task) {
> +		/* Only allow writes by processes already ptracing the target task */
> +		if (file->f_mode & FMODE_WRITE &&
> +		    static_branch_maybe(CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON,
> +					&restrict_proc_mem_write)) {

Do we need to also do an mm_access() on the task to verify that the task
we're about to check has its mm still matching file->private_data? The
PID can change out from under us (but the mm cannot).

> +			rcu_read_lock();
> +			if (!ptracer_capable(current, mm->user_ns) ||
> +			    current != ptrace_parent(task))

If you're just allowing "already ptracing", why include the
ptracer_capable() check?

> +				ret = -EACCES;
> +			rcu_read_unlock();
> +		}
> +		put_task_struct(task);
> +	}
> +#endif
> +
>  	/* OK to pass negative loff_t, we can catch out-of-range */
>  	file->f_mode |= FMODE_UNSIGNED_OFFSET;
>  
> @@ -3281,7 +3324,7 @@ static const struct pid_entry tgid_base_stuff[] = {
>  #ifdef CONFIG_NUMA
>  	REG("numa_maps",  S_IRUGO, proc_pid_numa_maps_operations),
>  #endif
> -	REG("mem",        S_IRUSR|S_IWUSR, proc_mem_operations),
> +	REG("mem",        PROC_PID_MEM_MODE, proc_mem_operations),
>  	LNK("cwd",        proc_cwd_link),
>  	LNK("root",       proc_root_link),
>  	LNK("exe",        proc_exe_link),
> @@ -3631,7 +3674,7 @@ static const struct pid_entry tid_base_stuff[] = {
>  #ifdef CONFIG_NUMA
>  	REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations),
>  #endif
> -	REG("mem",       S_IRUSR|S_IWUSR, proc_mem_operations),
> +	REG("mem",       PROC_PID_MEM_MODE, proc_mem_operations),
>  	LNK("cwd",       proc_cwd_link),
>  	LNK("root",      proc_root_link),
>  	LNK("exe",       proc_exe_link),
> diff --git a/security/Kconfig b/security/Kconfig
> index 412e76f1575d..ffee9e847ed9 100644
> --- a/security/Kconfig
> +++ b/security/Kconfig
> @@ -19,6 +19,28 @@ config SECURITY_DMESG_RESTRICT
>  
>  	  If you are unsure how to answer this question, answer N.
>  
> +config SECURITY_PROC_MEM_RESTRICT_WRITE
> +	bool "Restrict /proc/*/mem write access"
> +	default n
> +	help
> +	  This restricts writes to /proc/<pid>/mem, except when the current
> +	  process ptraces the /proc/<pid>/mem task, because a ptracer already
> +	  has write access to the tracee memory.
> +
> +	  Write access to this file allows bypassing memory map permissions,
> +	  such as modifying read-only code.
> +
> +	  If you are unsure how to answer this question, answer N.
> +
> +config SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON
> +	bool "Default state of /proc/*/mem write restriction"
> +	depends on SECURITY_PROC_MEM_RESTRICT_WRITE
> +	default y
> +	help
> +	  /proc/*/mem write access is controlled by kernel boot param
> +	  "restrict_proc_mem_write" and this config chooses the default
> +	  boot state.

As mentioned, I'd say merge the help texts here, but drop
SECURITY_PROC_MEM_RESTRICT_WRITE.

> +
>  config SECURITY
>  	bool "Enable different security models"
>  	depends on SYSFS
> -- 
> 2.30.2
> 

Thanks for this! I look forward to turning it on. :)

-Kees

Adrian Ratiu March 2, 2024, 10:31 a.m. UTC | #2

On Saturday, March 02, 2024 01:55 EET, Kees Cook <keescook@chromium.org> wrote:

> On Fri, Mar 01, 2024 at 11:34:42PM +0200, Adrian Ratiu wrote:
> > Prior to v2.6.39 write access to /proc/<pid>/mem was restricted,
> > after which it got allowed in commit 198214a7ee50 ("proc: enable
> > writing to /proc/pid/mem"). Famous last words from that patch:
> > "no longer a security hazard". :)
> > 
> > Afterwards exploits appeared started causing drama like [1]. The
> 
> nit: I think "appeared" can be dropped here.
> 
> > /proc/*/mem exploits can be rather sophisticated like [2] which
> > installed an arbitrary payload from noexec storage into a running
> > process then exec'd it, which itself could include an ELF loader
> > to run arbitrary code off noexec storage.
> > 
> > As part of hardening against these types of attacks, distrbutions
> > can restrict /proc/*/mem to only allow writes when they makes sense,
> > like in case of debuggers which have ptrace permissions, as they
> > are able to access memory anyway via PTRACE_POKEDATA and friends.
> > 
> > Dropping the mode bits disables write access for non-root users.
> > Trying to `chmod` the paths back fails as the kernel rejects it.
> > 
> > For users with CAP_DAC_OVERRIDE (usually just root) we have to
> > disable the mem_write callback to avoid bypassing the mode bits.
> > 
> > Writes can be used to bypass permissions on memory maps, even if a
> > memory region is mapped r-x (as is a program's executable pages),
> > the process can open its own /proc/self/mem file and write to the
> > pages directly.
> > 
> > Even if seccomp filters block mmap/mprotect calls with W|X perms,
> > they often cannot block open calls as daemons want to read/write
> > their own runtime state and seccomp filters cannot check file paths.
> > Write calls also can't be blocked in general via seccomp.
> > 
> > Since the mem file is part of the dynamic /proc/<pid>/ space, we
> > can't run chmod once at boot to restrict it (and trying to react
> > to every process and run chmod doesn't scale, and the kernel no
> > longer allows chmod on any of these paths).
> > 
> > SELinux could be used with a rule to cover all /proc/*/mem files,
> > but even then having multiple ways to deny an attack is useful in
> > case on layer fails.
> 
> Everything above here is good to keep in the commit log, but it's all
> the "background". Please also write here what has been done to address
> the background above it. e.g.:
> 
> "Introduce a CONFIG and a __ro_after_init runtime toggle to make
> it so only processes that are already tracing the task to write to
> /proc/<pid>/mem." etc
> 
> > 
> > [1] https://lwn.net/Articles/476947/
> > [2] https://issues.chromium.org/issues/40089045
> 
> These can be:
> 
> Link: https://lwn.net/Articles/476947/ [1]
> Link: https://issues.chromium.org/issues/40089045 [2]
> 
> > Based on an initial patch by Mike Frysinger <vapier@chromium.org>.
> > 
> > Cc: Guenter Roeck <groeck@chromium.org>
> > Cc: Doug Anderson <dianders@chromium.org>
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: Jann Horn <jannh@google.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Randy Dunlap <rdunlap@infradead.org>
> > Cc: Christian Brauner <brauner@kernel.org>
> > Co-developed-by: Mike Frysinger <vapier@chromium.org>
> > Signed-off-by: Mike Frysinger <vapier@chromium.org>
> > Signed-off-by: Adrian Ratiu <adrian.ratiu@collabora.com>
> > ---
> > Changes in v2:
> >  * Added boot time parameter with default kconfig option
> >  * Moved check earlier in mem_open() instead of mem_write()
> >  * Simplified implementation branching
> >  * Removed dependency on CONFIG_MEMCG
> 
> Can you mention in the commit log what behaviors have been tested with
> this patch? For example, I assume gdb still works with
> restrict_proc_mem_write=y ?
> 

Thanks, I will address all the above commit message feedback in v3.

Yes, gdb and gdbserver work with restrict_proc_mem_write=y. My testing is
focused on the correct functioning of GDB and gdbserver (lldb/server use 
ptrace POKEDATA so they work regardless of restrict_proc_mem_write).

This all started from my attempt to fix gdbserver by adding a ptrace fallback
in case /proc/pid/mem writes are blocked without any exception, because
that breaks basic functionality like setting breakpoints.

GDB upstream NAK'ed my ptrace fallback approach because it's doesn't
work well with their /proc/pid/mem focused design required for non-stop 
mode (the default all-stop mode is emulated on top of non-stop), as well
as ptrace peek/poke requiring a live task which can cause memory access
problems if the ptraced task dies.

Other solutions were considered by GDB upstream, including using the 
process_vm_writev & co, but they respect page permissions and GDB has
to write RO pages to set breakpoints.

In the end GDB maintainers directed me to do a proper kernel fix with an
exception for tasks already ptracing others, because from a security
perspective, they can already access tracee memory regardless of
/proc/pid/mem restrictions, so here we are. :)

> When this is enabled, what _does_ break that people might expect to
> work?

With the current iteration, all things I tested work as expected. It is rather
hard to come up with things that break with restrict_proc_mem_write=y,
because other than debuggers and exploits I don't have other use-cases.

Obvious things like "echo >/proc/self/mem" get permission denied, but
that is expected with restrict_proc_mem_write=y, so I wouldn't classify
it as breakage.

In theory there might be some weird/legacy apps which might break, so
that is why I suggest we land the mechanism as default off, and later,
after it gets tested in various distributions, pull the trigger to make it
default on.

> 
> > ---
> >  .../admin-guide/kernel-parameters.txt         |  4 ++
> >  fs/proc/base.c                                | 47 ++++++++++++++++++-
> >  security/Kconfig                              | 22 +++++++++
> >  3 files changed, 71 insertions(+), 2 deletions(-)
> > 
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 460b97a1d0da..0647e2f54248 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -5618,6 +5618,10 @@
> >  	reset_devices	[KNL] Force drivers to reset the underlying device
> >  			during initialization.
> >  
> > +	restrict_proc_mem_write= [KNL]
> 
> Please add here:
> 
> 			Format: <bool>
> 

Ack, will do in v3.

> > +			Enable or disable write access to /proc/*/mem files.
> > +			Default is SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON.
> > +
> >  	resume=		[SWSUSP]
> >  			Specify the partition device for software suspend
> >  			Format:
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 98a031ac2648..92f668191312 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -152,6 +152,30 @@ struct pid_entry {
> >  		NULL, &proc_pid_attr_operations,	\
> >  		{ .lsmid = LSMID })
> >  
> > +#ifdef CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE
> 
> Please drop this CONFIG entirely -- it should be always available for
> all builds of the kernel. Only CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON
> needs to remain.
> 

Ack, will do in v3.

> > +DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON,
> > +			   restrict_proc_mem_write);
> > +static int __init early_restrict_proc_mem_write(char *buf)
> > +{
> > +	int ret;
> > +	bool bool_result;
> > +
> > +	ret = kstrtobool(buf, &bool_result);
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (bool_result)
> > +		static_branch_enable(&restrict_proc_mem_write);
> > +	else
> > +		static_branch_disable(&restrict_proc_mem_write);
> > +	return 0;
> > +}
> > +early_param("restrict_proc_mem_write", early_restrict_proc_mem_write);
> > +# define PROC_PID_MEM_MODE S_IRUSR
> > +#else
> > +# define PROC_PID_MEM_MODE (S_IRUSR|S_IWUSR)
> > +#endif
> 
> PROC_PID_MEM_MODE will need to be a __ro_after_init variable, set by
> early_restrict_proc_mem_write, otherwise the mode won't change based on
> the runtime setting. e.g.:
> 
> #ifdef CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON
> mode_t proc_pid_mem_mode __ro_after_init = S_IRUSR;
> #else
> mode_t proc_pid_mem_mode __ro_after_init = (S_IRUSR|S_IWUSR);
> #endif
> 
> DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON,
> 			   restrict_proc_mem_write);
> ...
> 	if (bool_result) {
> 		static_branch_enable(&restrict_proc_mem_write);
> 		proc_pid_mem_mode = S_IRUSR;
> 	} else {
> 		static_branch_disable(&restrict_proc_mem_write);
> 		proc_pid_mem_mode = (S_IRUSR|S_IWUSR);
> 	}
> ...
> 	REG("mem",        proc_pid_mem_mode, proc_mem_operations),
> 
> 

Ack, will do in v3.

> > +
> >  /*
> >   * Count the number of hardlinks for the pid_entry table, excluding the .
> >   * and .. links.
> > @@ -829,6 +853,25 @@ static int mem_open(struct inode *inode, struct file *file)
> >  {
> >  	int ret = __mem_open(inode, file, PTRACE_MODE_ATTACH);
> >  
> > +#ifdef CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE
> 
> Drop this ifdef (as mentioned above).
> 
> > +	struct mm_struct *mm = file->private_data;
> > +	struct task_struct *task = get_proc_task(inode);
> > +
> > +	if (mm && task) {
> > +		/* Only allow writes by processes already ptracing the target task */
> > +		if (file->f_mode & FMODE_WRITE &&
> > +		    static_branch_maybe(CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON,
> > +					&restrict_proc_mem_write)) {
> 
> Do we need to also do an mm_access() on the task to verify that the task
> we're about to check has its mm still matching file->private_data? The
> PID can change out from under us (but the mm cannot).
> 

Likely yes, will look into this for v3.

> > +			rcu_read_lock();
> > +			if (!ptracer_capable(current, mm->user_ns) ||
> > +			    current != ptrace_parent(task))
> 
> If you're just allowing "already ptracing", why include the
> ptracer_capable() check?
> 

It is a very good observation that the check is redundant. :)

It is a remnant from a previous iteration of this patch, from
when I was proposing solutions to GDB upstream. I left it there
because it doesn't do much harm to verify capability as well,
more of a precaution / test invariant than anything else.

I'll remove it in v3 since it might cause confusion.

> > +				ret = -EACCES;
> > +			rcu_read_unlock();
> > +		}
> > +		put_task_struct(task);
> > +	}
> > +#endif
> > +
> >  	/* OK to pass negative loff_t, we can catch out-of-range */
> >  	file->f_mode |= FMODE_UNSIGNED_OFFSET;
> >  
> > @@ -3281,7 +3324,7 @@ static const struct pid_entry tgid_base_stuff[] = {
> >  #ifdef CONFIG_NUMA
> >  	REG("numa_maps",  S_IRUGO, proc_pid_numa_maps_operations),
> >  #endif
> > -	REG("mem",        S_IRUSR|S_IWUSR, proc_mem_operations),
> > +	REG("mem",        PROC_PID_MEM_MODE, proc_mem_operations),
> >  	LNK("cwd",        proc_cwd_link),
> >  	LNK("root",       proc_root_link),
> >  	LNK("exe",        proc_exe_link),
> > @@ -3631,7 +3674,7 @@ static const struct pid_entry tid_base_stuff[] = {
> >  #ifdef CONFIG_NUMA
> >  	REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations),
> >  #endif
> > -	REG("mem",       S_IRUSR|S_IWUSR, proc_mem_operations),
> > +	REG("mem",       PROC_PID_MEM_MODE, proc_mem_operations),
> >  	LNK("cwd",       proc_cwd_link),
> >  	LNK("root",      proc_root_link),
> >  	LNK("exe",       proc_exe_link),
> > diff --git a/security/Kconfig b/security/Kconfig
> > index 412e76f1575d..ffee9e847ed9 100644
> > --- a/security/Kconfig
> > +++ b/security/Kconfig
> > @@ -19,6 +19,28 @@ config SECURITY_DMESG_RESTRICT
> >  
> >  	  If you are unsure how to answer this question, answer N.
> >  
> > +config SECURITY_PROC_MEM_RESTRICT_WRITE
> > +	bool "Restrict /proc/*/mem write access"
> > +	default n
> > +	help
> > +	  This restricts writes to /proc/<pid>/mem, except when the current
> > +	  process ptraces the /proc/<pid>/mem task, because a ptracer already
> > +	  has write access to the tracee memory.
> > +
> > +	  Write access to this file allows bypassing memory map permissions,
> > +	  such as modifying read-only code.
> > +
> > +	  If you are unsure how to answer this question, answer N.
> > +
> > +config SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON
> > +	bool "Default state of /proc/*/mem write restriction"
> > +	depends on SECURITY_PROC_MEM_RESTRICT_WRITE
> > +	default y
> > +	help
> > +	  /proc/*/mem write access is controlled by kernel boot param
> > +	  "restrict_proc_mem_write" and this config chooses the default
> > +	  boot state.
> 
> As mentioned, I'd say merge the help texts here, but drop
> SECURITY_PROC_MEM_RESTRICT_WRITE.
> 

Ack, will do in v3.

> > +
> >  config SECURITY
> >  	bool "Enable different security models"
> >  	depends on SYSFS
> > -- 
> > 2.30.2
> > 
> 
> Thanks for this! I look forward to turning it on. :)
>

Thank you very much for all your feedback!
It is much appreciated.

I'll wait a few more days before sending v3 to let others
comment, then address everything.
 
> -Kees
> 
> -- 
> Kees Cook

Christian Brauner March 4, 2024, 1:20 p.m. UTC | #3

On Fri, Mar 01, 2024 at 11:34:42PM +0200, Adrian Ratiu wrote:
> Prior to v2.6.39 write access to /proc/<pid>/mem was restricted,
> after which it got allowed in commit 198214a7ee50 ("proc: enable
> writing to /proc/pid/mem"). Famous last words from that patch:
> "no longer a security hazard". :)
> 
> Afterwards exploits appeared started causing drama like [1]. The
> /proc/*/mem exploits can be rather sophisticated like [2] which
> installed an arbitrary payload from noexec storage into a running
> process then exec'd it, which itself could include an ELF loader
> to run arbitrary code off noexec storage.
> 
> As part of hardening against these types of attacks, distrbutions
> can restrict /proc/*/mem to only allow writes when they makes sense,
> like in case of debuggers which have ptrace permissions, as they
> are able to access memory anyway via PTRACE_POKEDATA and friends.
> 
> Dropping the mode bits disables write access for non-root users.
> Trying to `chmod` the paths back fails as the kernel rejects it.
> 
> For users with CAP_DAC_OVERRIDE (usually just root) we have to
> disable the mem_write callback to avoid bypassing the mode bits.
> 
> Writes can be used to bypass permissions on memory maps, even if a
> memory region is mapped r-x (as is a program's executable pages),
> the process can open its own /proc/self/mem file and write to the
> pages directly.
> 
> Even if seccomp filters block mmap/mprotect calls with W|X perms,
> they often cannot block open calls as daemons want to read/write
> their own runtime state and seccomp filters cannot check file paths.
> Write calls also can't be blocked in general via seccomp.
> 
> Since the mem file is part of the dynamic /proc/<pid>/ space, we
> can't run chmod once at boot to restrict it (and trying to react
> to every process and run chmod doesn't scale, and the kernel no
> longer allows chmod on any of these paths).
> 
> SELinux could be used with a rule to cover all /proc/*/mem files,
> but even then having multiple ways to deny an attack is useful in
> case on layer fails.
> 
> [1] https://lwn.net/Articles/476947/
> [2] https://issues.chromium.org/issues/40089045
> 
> Based on an initial patch by Mike Frysinger <vapier@chromium.org>.
> 
> Cc: Guenter Roeck <groeck@chromium.org>
> Cc: Doug Anderson <dianders@chromium.org>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Jann Horn <jannh@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Randy Dunlap <rdunlap@infradead.org>
> Cc: Christian Brauner <brauner@kernel.org>
> Co-developed-by: Mike Frysinger <vapier@chromium.org>
> Signed-off-by: Mike Frysinger <vapier@chromium.org>
> Signed-off-by: Adrian Ratiu <adrian.ratiu@collabora.com>
> ---
> Changes in v2:
>  * Added boot time parameter with default kconfig option
>  * Moved check earlier in mem_open() instead of mem_write()
>  * Simplified implementation branching
>  * Removed dependency on CONFIG_MEMCG
> ---
>  .../admin-guide/kernel-parameters.txt         |  4 ++
>  fs/proc/base.c                                | 47 ++++++++++++++++++-
>  security/Kconfig                              | 22 +++++++++
>  3 files changed, 71 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 460b97a1d0da..0647e2f54248 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -5618,6 +5618,10 @@
>  	reset_devices	[KNL] Force drivers to reset the underlying device
>  			during initialization.
>  
> +	restrict_proc_mem_write= [KNL]
> +			Enable or disable write access to /proc/*/mem files.
> +			Default is SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON.
> +
>  	resume=		[SWSUSP]
>  			Specify the partition device for software suspend
>  			Format:
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 98a031ac2648..92f668191312 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -152,6 +152,30 @@ struct pid_entry {
>  		NULL, &proc_pid_attr_operations,	\
>  		{ .lsmid = LSMID })
>  
> +#ifdef CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE
> +DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON,
> +			   restrict_proc_mem_write);
> +static int __init early_restrict_proc_mem_write(char *buf)
> +{
> +	int ret;
> +	bool bool_result;
> +
> +	ret = kstrtobool(buf, &bool_result);
> +	if (ret)
> +		return ret;
> +
> +	if (bool_result)
> +		static_branch_enable(&restrict_proc_mem_write);
> +	else
> +		static_branch_disable(&restrict_proc_mem_write);
> +	return 0;
> +}
> +early_param("restrict_proc_mem_write", early_restrict_proc_mem_write);
> +# define PROC_PID_MEM_MODE S_IRUSR
> +#else
> +# define PROC_PID_MEM_MODE (S_IRUSR|S_IWUSR)
> +#endif
> +
>  /*
>   * Count the number of hardlinks for the pid_entry table, excluding the .
>   * and .. links.
> @@ -829,6 +853,25 @@ static int mem_open(struct inode *inode, struct file *file)
>  {
>  	int ret = __mem_open(inode, file, PTRACE_MODE_ATTACH);
>  
> +#ifdef CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE
> +	struct mm_struct *mm = file->private_data;
> +	struct task_struct *task = get_proc_task(inode);
> +
> +	if (mm && task) {
> +		/* Only allow writes by processes already ptracing the target task */
> +		if (file->f_mode & FMODE_WRITE &&
> +		    static_branch_maybe(CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON,
> +					&restrict_proc_mem_write)) {
> +			rcu_read_lock();
> +			if (!ptracer_capable(current, mm->user_ns) ||
> +			    current != ptrace_parent(task))
> +				ret = -EACCES;

Uhm, this will break the seccomp notifier, no? So you can't turn on
SECURITY_PROC_MEM_RESTRICT_WRITE when you want to use the seccomp
notifier to do system call interception and rewrite memory locations of
the calling task, no? Which is very much relied upon in various
container managers and possibly other security tools.

Which means that you can't turn this on in any of the regular distros.

So you need to either account for the calling task being a seccomp
supervisor for the task whose memory it is trying to access or you need
to provide a migration path by adding an api that let's caller's perform
these writes through the seccomp notifier.

Adrian Ratiu March 4, 2024, 1:48 p.m. UTC | #4

On Monday, March 04, 2024 15:20 EET, Christian Brauner <brauner@kernel.org> wrote:

> On Fri, Mar 01, 2024 at 11:34:42PM +0200, Adrian Ratiu wrote:
> > Prior to v2.6.39 write access to /proc/<pid>/mem was restricted,
> > after which it got allowed in commit 198214a7ee50 ("proc: enable
> > writing to /proc/pid/mem"). Famous last words from that patch:
> > "no longer a security hazard". :)
> > 
> > Afterwards exploits appeared started causing drama like [1]. The
> > /proc/*/mem exploits can be rather sophisticated like [2] which
> > installed an arbitrary payload from noexec storage into a running
> > process then exec'd it, which itself could include an ELF loader
> > to run arbitrary code off noexec storage.
> > 
> > As part of hardening against these types of attacks, distrbutions
> > can restrict /proc/*/mem to only allow writes when they makes sense,
> > like in case of debuggers which have ptrace permissions, as they
> > are able to access memory anyway via PTRACE_POKEDATA and friends.
> > 
> > Dropping the mode bits disables write access for non-root users.
> > Trying to `chmod` the paths back fails as the kernel rejects it.
> > 
> > For users with CAP_DAC_OVERRIDE (usually just root) we have to
> > disable the mem_write callback to avoid bypassing the mode bits.
> > 
> > Writes can be used to bypass permissions on memory maps, even if a
> > memory region is mapped r-x (as is a program's executable pages),
> > the process can open its own /proc/self/mem file and write to the
> > pages directly.
> > 
> > Even if seccomp filters block mmap/mprotect calls with W|X perms,
> > they often cannot block open calls as daemons want to read/write
> > their own runtime state and seccomp filters cannot check file paths.
> > Write calls also can't be blocked in general via seccomp.
> > 
> > Since the mem file is part of the dynamic /proc/<pid>/ space, we
> > can't run chmod once at boot to restrict it (and trying to react
> > to every process and run chmod doesn't scale, and the kernel no
> > longer allows chmod on any of these paths).
> > 
> > SELinux could be used with a rule to cover all /proc/*/mem files,
> > but even then having multiple ways to deny an attack is useful in
> > case on layer fails.
> > 
> > [1] https://lwn.net/Articles/476947/
> > [2] https://issues.chromium.org/issues/40089045
> > 
> > Based on an initial patch by Mike Frysinger <vapier@chromium.org>.
> > 
> > Cc: Guenter Roeck <groeck@chromium.org>
> > Cc: Doug Anderson <dianders@chromium.org>
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: Jann Horn <jannh@google.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Randy Dunlap <rdunlap@infradead.org>
> > Cc: Christian Brauner <brauner@kernel.org>
> > Co-developed-by: Mike Frysinger <vapier@chromium.org>
> > Signed-off-by: Mike Frysinger <vapier@chromium.org>
> > Signed-off-by: Adrian Ratiu <adrian.ratiu@collabora.com>
> > ---
> > Changes in v2:
> >  * Added boot time parameter with default kconfig option
> >  * Moved check earlier in mem_open() instead of mem_write()
> >  * Simplified implementation branching
> >  * Removed dependency on CONFIG_MEMCG
> > ---
> >  .../admin-guide/kernel-parameters.txt         |  4 ++
> >  fs/proc/base.c                                | 47 ++++++++++++++++++-
> >  security/Kconfig                              | 22 +++++++++
> >  3 files changed, 71 insertions(+), 2 deletions(-)
> > 
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 460b97a1d0da..0647e2f54248 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -5618,6 +5618,10 @@
> >  	reset_devices	[KNL] Force drivers to reset the underlying device
> >  			during initialization.
> >  
> > +	restrict_proc_mem_write= [KNL]
> > +			Enable or disable write access to /proc/*/mem files.
> > +			Default is SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON.
> > +
> >  	resume=		[SWSUSP]
> >  			Specify the partition device for software suspend
> >  			Format:
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 98a031ac2648..92f668191312 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -152,6 +152,30 @@ struct pid_entry {
> >  		NULL, &proc_pid_attr_operations,	\
> >  		{ .lsmid = LSMID })
> >  
> > +#ifdef CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE
> > +DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON,
> > +			   restrict_proc_mem_write);
> > +static int __init early_restrict_proc_mem_write(char *buf)
> > +{
> > +	int ret;
> > +	bool bool_result;
> > +
> > +	ret = kstrtobool(buf, &bool_result);
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (bool_result)
> > +		static_branch_enable(&restrict_proc_mem_write);
> > +	else
> > +		static_branch_disable(&restrict_proc_mem_write);
> > +	return 0;
> > +}
> > +early_param("restrict_proc_mem_write", early_restrict_proc_mem_write);
> > +# define PROC_PID_MEM_MODE S_IRUSR
> > +#else
> > +# define PROC_PID_MEM_MODE (S_IRUSR|S_IWUSR)
> > +#endif
> > +
> >  /*
> >   * Count the number of hardlinks for the pid_entry table, excluding the .
> >   * and .. links.
> > @@ -829,6 +853,25 @@ static int mem_open(struct inode *inode, struct file *file)
> >  {
> >  	int ret = __mem_open(inode, file, PTRACE_MODE_ATTACH);
> >  
> > +#ifdef CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE
> > +	struct mm_struct *mm = file->private_data;
> > +	struct task_struct *task = get_proc_task(inode);
> > +
> > +	if (mm && task) {
> > +		/* Only allow writes by processes already ptracing the target task */
> > +		if (file->f_mode & FMODE_WRITE &&
> > +		    static_branch_maybe(CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON,
> > +					&restrict_proc_mem_write)) {
> > +			rcu_read_lock();
> > +			if (!ptracer_capable(current, mm->user_ns) ||
> > +			    current != ptrace_parent(task))
> > +				ret = -EACCES;
> 
> Uhm, this will break the seccomp notifier, no? So you can't turn on
> SECURITY_PROC_MEM_RESTRICT_WRITE when you want to use the seccomp
> notifier to do system call interception and rewrite memory locations of
> the calling task, no? Which is very much relied upon in various
> container managers and possibly other security tools.
> 
> Which means that you can't turn this on in any of the regular distros.
> 
> So you need to either account for the calling task being a seccomp
> supervisor for the task whose memory it is trying to access or you need
> to provide a migration path by adding an api that let's caller's perform
> these writes through the seccomp notifier.

Thanks for raising this!

I did test seccomp filtering/blocking functionality which seemed to work but I'll make sure to also test syscall interception before sending v3, to confirm whether it breaks.

The simplest solution is to add an exception for seccomp supervisors just like we did for tracers, yes, so I'm inclined to go with that if needed. :)

Ideally we find all exceptions and fix them before defaulting this to on -- unforeseen breakages is why I want to default it to OFF, at least initially, until we can be reasonably sure all cases have been covered.

Christian Brauner March 4, 2024, 2:05 p.m. UTC | #5

On Mon, Mar 04, 2024 at 01:48:19PM +0000, Adrian Ratiu wrote:
> On Monday, March 04, 2024 15:20 EET, Christian Brauner <brauner@kernel.org> wrote:
> 
> > On Fri, Mar 01, 2024 at 11:34:42PM +0200, Adrian Ratiu wrote:
> > > Prior to v2.6.39 write access to /proc/<pid>/mem was restricted,
> > > after which it got allowed in commit 198214a7ee50 ("proc: enable
> > > writing to /proc/pid/mem"). Famous last words from that patch:
> > > "no longer a security hazard". :)
> > > 
> > > Afterwards exploits appeared started causing drama like [1]. The
> > > /proc/*/mem exploits can be rather sophisticated like [2] which
> > > installed an arbitrary payload from noexec storage into a running
> > > process then exec'd it, which itself could include an ELF loader
> > > to run arbitrary code off noexec storage.
> > > 
> > > As part of hardening against these types of attacks, distrbutions
> > > can restrict /proc/*/mem to only allow writes when they makes sense,
> > > like in case of debuggers which have ptrace permissions, as they
> > > are able to access memory anyway via PTRACE_POKEDATA and friends.
> > > 
> > > Dropping the mode bits disables write access for non-root users.
> > > Trying to `chmod` the paths back fails as the kernel rejects it.
> > > 
> > > For users with CAP_DAC_OVERRIDE (usually just root) we have to
> > > disable the mem_write callback to avoid bypassing the mode bits.
> > > 
> > > Writes can be used to bypass permissions on memory maps, even if a
> > > memory region is mapped r-x (as is a program's executable pages),
> > > the process can open its own /proc/self/mem file and write to the
> > > pages directly.
> > > 
> > > Even if seccomp filters block mmap/mprotect calls with W|X perms,
> > > they often cannot block open calls as daemons want to read/write
> > > their own runtime state and seccomp filters cannot check file paths.
> > > Write calls also can't be blocked in general via seccomp.
> > > 
> > > Since the mem file is part of the dynamic /proc/<pid>/ space, we
> > > can't run chmod once at boot to restrict it (and trying to react
> > > to every process and run chmod doesn't scale, and the kernel no
> > > longer allows chmod on any of these paths).
> > > 
> > > SELinux could be used with a rule to cover all /proc/*/mem files,
> > > but even then having multiple ways to deny an attack is useful in
> > > case on layer fails.
> > > 
> > > [1] https://lwn.net/Articles/476947/
> > > [2] https://issues.chromium.org/issues/40089045
> > > 
> > > Based on an initial patch by Mike Frysinger <vapier@chromium.org>.
> > > 
> > > Cc: Guenter Roeck <groeck@chromium.org>
> > > Cc: Doug Anderson <dianders@chromium.org>
> > > Cc: Kees Cook <keescook@chromium.org>
> > > Cc: Jann Horn <jannh@google.com>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: Randy Dunlap <rdunlap@infradead.org>
> > > Cc: Christian Brauner <brauner@kernel.org>
> > > Co-developed-by: Mike Frysinger <vapier@chromium.org>
> > > Signed-off-by: Mike Frysinger <vapier@chromium.org>
> > > Signed-off-by: Adrian Ratiu <adrian.ratiu@collabora.com>
> > > ---
> > > Changes in v2:
> > >  * Added boot time parameter with default kconfig option
> > >  * Moved check earlier in mem_open() instead of mem_write()
> > >  * Simplified implementation branching
> > >  * Removed dependency on CONFIG_MEMCG
> > > ---
> > >  .../admin-guide/kernel-parameters.txt         |  4 ++
> > >  fs/proc/base.c                                | 47 ++++++++++++++++++-
> > >  security/Kconfig                              | 22 +++++++++
> > >  3 files changed, 71 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > > index 460b97a1d0da..0647e2f54248 100644
> > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > @@ -5618,6 +5618,10 @@
> > >  	reset_devices	[KNL] Force drivers to reset the underlying device
> > >  			during initialization.
> > >  
> > > +	restrict_proc_mem_write= [KNL]
> > > +			Enable or disable write access to /proc/*/mem files.
> > > +			Default is SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON.
> > > +
> > >  	resume=		[SWSUSP]
> > >  			Specify the partition device for software suspend
> > >  			Format:
> > > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > > index 98a031ac2648..92f668191312 100644
> > > --- a/fs/proc/base.c
> > > +++ b/fs/proc/base.c
> > > @@ -152,6 +152,30 @@ struct pid_entry {
> > >  		NULL, &proc_pid_attr_operations,	\
> > >  		{ .lsmid = LSMID })
> > >  
> > > +#ifdef CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE
> > > +DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON,
> > > +			   restrict_proc_mem_write);
> > > +static int __init early_restrict_proc_mem_write(char *buf)
> > > +{
> > > +	int ret;
> > > +	bool bool_result;
> > > +
> > > +	ret = kstrtobool(buf, &bool_result);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	if (bool_result)
> > > +		static_branch_enable(&restrict_proc_mem_write);
> > > +	else
> > > +		static_branch_disable(&restrict_proc_mem_write);
> > > +	return 0;
> > > +}
> > > +early_param("restrict_proc_mem_write", early_restrict_proc_mem_write);
> > > +# define PROC_PID_MEM_MODE S_IRUSR
> > > +#else
> > > +# define PROC_PID_MEM_MODE (S_IRUSR|S_IWUSR)
> > > +#endif
> > > +
> > >  /*
> > >   * Count the number of hardlinks for the pid_entry table, excluding the .
> > >   * and .. links.
> > > @@ -829,6 +853,25 @@ static int mem_open(struct inode *inode, struct file *file)
> > >  {
> > >  	int ret = __mem_open(inode, file, PTRACE_MODE_ATTACH);
> > >  
> > > +#ifdef CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE
> > > +	struct mm_struct *mm = file->private_data;
> > > +	struct task_struct *task = get_proc_task(inode);
> > > +
> > > +	if (mm && task) {
> > > +		/* Only allow writes by processes already ptracing the target task */
> > > +		if (file->f_mode & FMODE_WRITE &&
> > > +		    static_branch_maybe(CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON,
> > > +					&restrict_proc_mem_write)) {
> > > +			rcu_read_lock();
> > > +			if (!ptracer_capable(current, mm->user_ns) ||
> > > +			    current != ptrace_parent(task))
> > > +				ret = -EACCES;
> > 
> > Uhm, this will break the seccomp notifier, no? So you can't turn on
> > SECURITY_PROC_MEM_RESTRICT_WRITE when you want to use the seccomp
> > notifier to do system call interception and rewrite memory locations of
> > the calling task, no? Which is very much relied upon in various
> > container managers and possibly other security tools.
> > 
> > Which means that you can't turn this on in any of the regular distros.
> > 
> > So you need to either account for the calling task being a seccomp
> > supervisor for the task whose memory it is trying to access or you need
> > to provide a migration path by adding an api that let's caller's perform
> > these writes through the seccomp notifier.
> 
> Thanks for raising this!
> 
> I did test seccomp filtering/blocking functionality which seemed to
> work but I'll make sure to also test syscall interception before
> sending v3, to confirm whether it breaks.
> 
> The simplest solution is to add an exception for seccomp supervisors
> just like we did for tracers, yes, so I'm inclined to go with that if
> needed. :)

Ok. Note that your patch also doesn't cover process_vm_writev() which
means that you can still use that as an alternative to write to memory -
albeit with a lote more raciness. IOW, a seccomp notifier can do the
dance of:

pidfd = clone3(CLONE_PIDFD)
// handle error
int fd_mem = open("/proc/$pid/mem", O_RDWR);:
// handle error
if (pidfd_send_signal(pidfd, 0, ...) == 0)
        write(fd_mem, ...);

which lets it avoid the raciness. That's not possible with
process_vm_writev() especially if it's received via AF_UNIX sockets
which happens if the seccomp notifier lives in a proxy process. I know
that happens in the wild.

So overall, it seems a bit odd to me because why block /proc/<pid>/mem
specifically and not also cover process_vm_writev()? Because that's easy
to block via regular seccomp system call filtering?

Adrian Ratiu March 4, 2024, 2:06 p.m. UTC | #6

On Saturday, March 02, 2024 01:55 EET, Kees Cook <keescook@chromium.org> wrote:

> On Fri, Mar 01, 2024 at 11:34:42PM +0200, Adrian Ratiu wrote:
> > Prior to v2.6.39 write access to /proc/<pid>/mem was restricted,
> > after which it got allowed in commit 198214a7ee50 ("proc: enable
> > writing to /proc/pid/mem"). Famous last words from that patch:
> > "no longer a security hazard". :)
> > 
> > Afterwards exploits appeared started causing drama like [1]. The
> 
> nit: I think "appeared" can be dropped here.
> 
> > /proc/*/mem exploits can be rather sophisticated like [2] which
> > installed an arbitrary payload from noexec storage into a running
> > process then exec'd it, which itself could include an ELF loader
> > to run arbitrary code off noexec storage.
> > 
> > As part of hardening against these types of attacks, distrbutions
> > can restrict /proc/*/mem to only allow writes when they makes sense,
> > like in case of debuggers which have ptrace permissions, as they
> > are able to access memory anyway via PTRACE_POKEDATA and friends.
> > 
> > Dropping the mode bits disables write access for non-root users.
> > Trying to `chmod` the paths back fails as the kernel rejects it.
> > 
> > For users with CAP_DAC_OVERRIDE (usually just root) we have to
> > disable the mem_write callback to avoid bypassing the mode bits.
> > 
> > Writes can be used to bypass permissions on memory maps, even if a
> > memory region is mapped r-x (as is a program's executable pages),
> > the process can open its own /proc/self/mem file and write to the
> > pages directly.
> > 
> > Even if seccomp filters block mmap/mprotect calls with W|X perms,
> > they often cannot block open calls as daemons want to read/write
> > their own runtime state and seccomp filters cannot check file paths.
> > Write calls also can't be blocked in general via seccomp.
> > 
> > Since the mem file is part of the dynamic /proc/<pid>/ space, we
> > can't run chmod once at boot to restrict it (and trying to react
> > to every process and run chmod doesn't scale, and the kernel no
> > longer allows chmod on any of these paths).
> > 
> > SELinux could be used with a rule to cover all /proc/*/mem files,
> > but even then having multiple ways to deny an attack is useful in
> > case on layer fails.
> 
> Everything above here is good to keep in the commit log, but it's all
> the "background". Please also write here what has been done to address
> the background above it. e.g.:
> 
> "Introduce a CONFIG and a __ro_after_init runtime toggle to make
> it so only processes that are already tracing the task to write to
> /proc/<pid>/mem." etc
> 
> > 
> > [1] https://lwn.net/Articles/476947/
> > [2] https://issues.chromium.org/issues/40089045
> 
> These can be:
> 
> Link: https://lwn.net/Articles/476947/ [1]
> Link: https://issues.chromium.org/issues/40089045 [2]
> 
> > Based on an initial patch by Mike Frysinger <vapier@chromium.org>.
> > 
> > Cc: Guenter Roeck <groeck@chromium.org>
> > Cc: Doug Anderson <dianders@chromium.org>
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: Jann Horn <jannh@google.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Randy Dunlap <rdunlap@infradead.org>
> > Cc: Christian Brauner <brauner@kernel.org>
> > Co-developed-by: Mike Frysinger <vapier@chromium.org>
> > Signed-off-by: Mike Frysinger <vapier@chromium.org>
> > Signed-off-by: Adrian Ratiu <adrian.ratiu@collabora.com>
> > ---
> > Changes in v2:
> >  * Added boot time parameter with default kconfig option
> >  * Moved check earlier in mem_open() instead of mem_write()
> >  * Simplified implementation branching
> >  * Removed dependency on CONFIG_MEMCG
> 
> Can you mention in the commit log what behaviors have been tested with
> this patch? For example, I assume gdb still works with
> restrict_proc_mem_write=y ?
> 
> When this is enabled, what _does_ break that people might expect to
> work?
> 
> > ---
> >  .../admin-guide/kernel-parameters.txt         |  4 ++
> >  fs/proc/base.c                                | 47 ++++++++++++++++++-
> >  security/Kconfig                              | 22 +++++++++
> >  3 files changed, 71 insertions(+), 2 deletions(-)
> > 
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 460b97a1d0da..0647e2f54248 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -5618,6 +5618,10 @@
> >  	reset_devices	[KNL] Force drivers to reset the underlying device
> >  			during initialization.
> >  
> > +	restrict_proc_mem_write= [KNL]
> 
> Please add here:
> 
> 			Format: <bool>
> 
> > +			Enable or disable write access to /proc/*/mem files.
> > +			Default is SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON.
> > +
> >  	resume=		[SWSUSP]
> >  			Specify the partition device for software suspend
> >  			Format:
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 98a031ac2648..92f668191312 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -152,6 +152,30 @@ struct pid_entry {
> >  		NULL, &proc_pid_attr_operations,	\
> >  		{ .lsmid = LSMID })
> >  
> > +#ifdef CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE
> 
> Please drop this CONFIG entirely -- it should be always available for
> all builds of the kernel. Only CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON
> needs to remain.
> 
> > +DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON,
> > +			   restrict_proc_mem_write);
> > +static int __init early_restrict_proc_mem_write(char *buf)
> > +{
> > +	int ret;
> > +	bool bool_result;
> > +
> > +	ret = kstrtobool(buf, &bool_result);
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (bool_result)
> > +		static_branch_enable(&restrict_proc_mem_write);
> > +	else
> > +		static_branch_disable(&restrict_proc_mem_write);
> > +	return 0;
> > +}
> > +early_param("restrict_proc_mem_write", early_restrict_proc_mem_write);
> > +# define PROC_PID_MEM_MODE S_IRUSR
> > +#else
> > +# define PROC_PID_MEM_MODE (S_IRUSR|S_IWUSR)
> > +#endif
> 
> PROC_PID_MEM_MODE will need to be a __ro_after_init variable, set by
> early_restrict_proc_mem_write, otherwise the mode won't change based on
> the runtime setting. e.g.:
> 
> #ifdef CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON
> mode_t proc_pid_mem_mode __ro_after_init = S_IRUSR;
> #else
> mode_t proc_pid_mem_mode __ro_after_init = (S_IRUSR|S_IWUSR);
> #endif
> 
> DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON,
> 			   restrict_proc_mem_write);
> ...
> 	if (bool_result) {
> 		static_branch_enable(&restrict_proc_mem_write);
> 		proc_pid_mem_mode = S_IRUSR;
> 	} else {
> 		static_branch_disable(&restrict_proc_mem_write);
> 		proc_pid_mem_mode = (S_IRUSR|S_IWUSR);
> 	}
> ...
> 	REG("mem",        proc_pid_mem_mode, proc_mem_operations),

I'm having trouble implementing this because the proc_pid_mem_mode initializer needs to be a compile-time constant, so I can't set a runtime value in the REG() definition like suggested above.

This was not an issue in v2 because we had two separate kconfigs:
-  one which enabled the feature and controlled the static build-time file modes (CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE)
- another which set the default runtime value and more importantly which depended on the first one so the values are consistent (CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON)

Do you have a suggestion how to fix this? Maybe store the flags in a static key? I'm asking because I'm not very familiar with static keys.

Or maybe we can continue using the 2 kconfig options?

> 
> 
> > +
> >  /*
> >   * Count the number of hardlinks for the pid_entry table, excluding the .
> >   * and .. links.
> > @@ -829,6 +853,25 @@ static int mem_open(struct inode *inode, struct file *file)
> >  {
> >  	int ret = __mem_open(inode, file, PTRACE_MODE_ATTACH);
> >  
> > +#ifdef CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE
> 
> Drop this ifdef (as mentioned above).
> 
> > +	struct mm_struct *mm = file->private_data;
> > +	struct task_struct *task = get_proc_task(inode);
> > +
> > +	if (mm && task) {
> > +		/* Only allow writes by processes already ptracing the target task */
> > +		if (file->f_mode & FMODE_WRITE &&
> > +		    static_branch_maybe(CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON,
> > +					&restrict_proc_mem_write)) {
> 
> Do we need to also do an mm_access() on the task to verify that the task
> we're about to check has its mm still matching file->private_data? The
> PID can change out from under us (but the mm cannot).
> 
> > +			rcu_read_lock();
> > +			if (!ptracer_capable(current, mm->user_ns) ||
> > +			    current != ptrace_parent(task))
> 
> If you're just allowing "already ptracing", why include the
> ptracer_capable() check?
> 
> > +				ret = -EACCES;
> > +			rcu_read_unlock();
> > +		}
> > +		put_task_struct(task);
> > +	}
> > +#endif
> > +
> >  	/* OK to pass negative loff_t, we can catch out-of-range */
> >  	file->f_mode |= FMODE_UNSIGNED_OFFSET;
> >  
> > @@ -3281,7 +3324,7 @@ static const struct pid_entry tgid_base_stuff[] = {
> >  #ifdef CONFIG_NUMA
> >  	REG("numa_maps",  S_IRUGO, proc_pid_numa_maps_operations),
> >  #endif
> > -	REG("mem",        S_IRUSR|S_IWUSR, proc_mem_operations),
> > +	REG("mem",        PROC_PID_MEM_MODE, proc_mem_operations),
> >  	LNK("cwd",        proc_cwd_link),
> >  	LNK("root",       proc_root_link),
> >  	LNK("exe",        proc_exe_link),
> > @@ -3631,7 +3674,7 @@ static const struct pid_entry tid_base_stuff[] = {
> >  #ifdef CONFIG_NUMA
> >  	REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations),
> >  #endif
> > -	REG("mem",       S_IRUSR|S_IWUSR, proc_mem_operations),
> > +	REG("mem",       PROC_PID_MEM_MODE, proc_mem_operations),
> >  	LNK("cwd",       proc_cwd_link),
> >  	LNK("root",      proc_root_link),
> >  	LNK("exe",       proc_exe_link),
> > diff --git a/security/Kconfig b/security/Kconfig
> > index 412e76f1575d..ffee9e847ed9 100644
> > --- a/security/Kconfig
> > +++ b/security/Kconfig
> > @@ -19,6 +19,28 @@ config SECURITY_DMESG_RESTRICT
> >  
> >  	  If you are unsure how to answer this question, answer N.
> >  
> > +config SECURITY_PROC_MEM_RESTRICT_WRITE
> > +	bool "Restrict /proc/*/mem write access"
> > +	default n
> > +	help
> > +	  This restricts writes to /proc/<pid>/mem, except when the current
> > +	  process ptraces the /proc/<pid>/mem task, because a ptracer already
> > +	  has write access to the tracee memory.
> > +
> > +	  Write access to this file allows bypassing memory map permissions,
> > +	  such as modifying read-only code.
> > +
> > +	  If you are unsure how to answer this question, answer N.
> > +
> > +config SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON
> > +	bool "Default state of /proc/*/mem write restriction"
> > +	depends on SECURITY_PROC_MEM_RESTRICT_WRITE
> > +	default y
> > +	help
> > +	  /proc/*/mem write access is controlled by kernel boot param
> > +	  "restrict_proc_mem_write" and this config chooses the default
> > +	  boot state.
> 
> As mentioned, I'd say merge the help texts here, but drop
> SECURITY_PROC_MEM_RESTRICT_WRITE.
> 
> > +
> >  config SECURITY
> >  	bool "Enable different security models"
> >  	depends on SYSFS
> > -- 
> > 2.30.2
> > 
> 
> Thanks for this! I look forward to turning it on. :)
> 
> -Kees
> 
> -- 
> Kees Cook

Adrian Ratiu March 4, 2024, 2:35 p.m. UTC | #7

On Monday, March 04, 2024 16:05 EET, Christian Brauner <brauner@kernel.org> wrote:

> On Mon, Mar 04, 2024 at 01:48:19PM +0000, Adrian Ratiu wrote:
> > On Monday, March 04, 2024 15:20 EET, Christian Brauner <brauner@kernel.org> wrote:
> > 
> > > On Fri, Mar 01, 2024 at 11:34:42PM +0200, Adrian Ratiu wrote:
> > > > Prior to v2.6.39 write access to /proc/<pid>/mem was restricted,
> > > > after which it got allowed in commit 198214a7ee50 ("proc: enable
> > > > writing to /proc/pid/mem"). Famous last words from that patch:
> > > > "no longer a security hazard". :)
> > > > 
> > > > Afterwards exploits appeared started causing drama like [1]. The
> > > > /proc/*/mem exploits can be rather sophisticated like [2] which
> > > > installed an arbitrary payload from noexec storage into a running
> > > > process then exec'd it, which itself could include an ELF loader
> > > > to run arbitrary code off noexec storage.
> > > > 
> > > > As part of hardening against these types of attacks, distrbutions
> > > > can restrict /proc/*/mem to only allow writes when they makes sense,
> > > > like in case of debuggers which have ptrace permissions, as they
> > > > are able to access memory anyway via PTRACE_POKEDATA and friends.
> > > > 
> > > > Dropping the mode bits disables write access for non-root users.
> > > > Trying to `chmod` the paths back fails as the kernel rejects it.
> > > > 
> > > > For users with CAP_DAC_OVERRIDE (usually just root) we have to
> > > > disable the mem_write callback to avoid bypassing the mode bits.
> > > > 
> > > > Writes can be used to bypass permissions on memory maps, even if a
> > > > memory region is mapped r-x (as is a program's executable pages),
> > > > the process can open its own /proc/self/mem file and write to the
> > > > pages directly.
> > > > 
> > > > Even if seccomp filters block mmap/mprotect calls with W|X perms,
> > > > they often cannot block open calls as daemons want to read/write
> > > > their own runtime state and seccomp filters cannot check file paths.
> > > > Write calls also can't be blocked in general via seccomp.
> > > > 
> > > > Since the mem file is part of the dynamic /proc/<pid>/ space, we
> > > > can't run chmod once at boot to restrict it (and trying to react
> > > > to every process and run chmod doesn't scale, and the kernel no
> > > > longer allows chmod on any of these paths).
> > > > 
> > > > SELinux could be used with a rule to cover all /proc/*/mem files,
> > > > but even then having multiple ways to deny an attack is useful in
> > > > case on layer fails.
> > > > 
> > > > [1] https://lwn.net/Articles/476947/
> > > > [2] https://issues.chromium.org/issues/40089045
> > > > 
> > > > Based on an initial patch by Mike Frysinger <vapier@chromium.org>.
> > > > 
> > > > Cc: Guenter Roeck <groeck@chromium.org>
> > > > Cc: Doug Anderson <dianders@chromium.org>
> > > > Cc: Kees Cook <keescook@chromium.org>
> > > > Cc: Jann Horn <jannh@google.com>
> > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > Cc: Randy Dunlap <rdunlap@infradead.org>
> > > > Cc: Christian Brauner <brauner@kernel.org>
> > > > Co-developed-by: Mike Frysinger <vapier@chromium.org>
> > > > Signed-off-by: Mike Frysinger <vapier@chromium.org>
> > > > Signed-off-by: Adrian Ratiu <adrian.ratiu@collabora.com>
> > > > ---
> > > > Changes in v2:
> > > >  * Added boot time parameter with default kconfig option
> > > >  * Moved check earlier in mem_open() instead of mem_write()
> > > >  * Simplified implementation branching
> > > >  * Removed dependency on CONFIG_MEMCG
> > > > ---
> > > >  .../admin-guide/kernel-parameters.txt         |  4 ++
> > > >  fs/proc/base.c                                | 47 ++++++++++++++++++-
> > > >  security/Kconfig                              | 22 +++++++++
> > > >  3 files changed, 71 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > > > index 460b97a1d0da..0647e2f54248 100644
> > > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > > @@ -5618,6 +5618,10 @@
> > > >  	reset_devices	[KNL] Force drivers to reset the underlying device
> > > >  			during initialization.
> > > >  
> > > > +	restrict_proc_mem_write= [KNL]
> > > > +			Enable or disable write access to /proc/*/mem files.
> > > > +			Default is SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON.
> > > > +
> > > >  	resume=		[SWSUSP]
> > > >  			Specify the partition device for software suspend
> > > >  			Format:
> > > > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > > > index 98a031ac2648..92f668191312 100644
> > > > --- a/fs/proc/base.c
> > > > +++ b/fs/proc/base.c
> > > > @@ -152,6 +152,30 @@ struct pid_entry {
> > > >  		NULL, &proc_pid_attr_operations,	\
> > > >  		{ .lsmid = LSMID })
> > > >  
> > > > +#ifdef CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE
> > > > +DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON,
> > > > +			   restrict_proc_mem_write);
> > > > +static int __init early_restrict_proc_mem_write(char *buf)
> > > > +{
> > > > +	int ret;
> > > > +	bool bool_result;
> > > > +
> > > > +	ret = kstrtobool(buf, &bool_result);
> > > > +	if (ret)
> > > > +		return ret;
> > > > +
> > > > +	if (bool_result)
> > > > +		static_branch_enable(&restrict_proc_mem_write);
> > > > +	else
> > > > +		static_branch_disable(&restrict_proc_mem_write);
> > > > +	return 0;
> > > > +}
> > > > +early_param("restrict_proc_mem_write", early_restrict_proc_mem_write);
> > > > +# define PROC_PID_MEM_MODE S_IRUSR
> > > > +#else
> > > > +# define PROC_PID_MEM_MODE (S_IRUSR|S_IWUSR)
> > > > +#endif
> > > > +
> > > >  /*
> > > >   * Count the number of hardlinks for the pid_entry table, excluding the .
> > > >   * and .. links.
> > > > @@ -829,6 +853,25 @@ static int mem_open(struct inode *inode, struct file *file)
> > > >  {
> > > >  	int ret = __mem_open(inode, file, PTRACE_MODE_ATTACH);
> > > >  
> > > > +#ifdef CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE
> > > > +	struct mm_struct *mm = file->private_data;
> > > > +	struct task_struct *task = get_proc_task(inode);
> > > > +
> > > > +	if (mm && task) {
> > > > +		/* Only allow writes by processes already ptracing the target task */
> > > > +		if (file->f_mode & FMODE_WRITE &&
> > > > +		    static_branch_maybe(CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON,
> > > > +					&restrict_proc_mem_write)) {
> > > > +			rcu_read_lock();
> > > > +			if (!ptracer_capable(current, mm->user_ns) ||
> > > > +			    current != ptrace_parent(task))
> > > > +				ret = -EACCES;
> > > 
> > > Uhm, this will break the seccomp notifier, no? So you can't turn on
> > > SECURITY_PROC_MEM_RESTRICT_WRITE when you want to use the seccomp
> > > notifier to do system call interception and rewrite memory locations of
> > > the calling task, no? Which is very much relied upon in various
> > > container managers and possibly other security tools.
> > > 
> > > Which means that you can't turn this on in any of the regular distros.
> > > 
> > > So you need to either account for the calling task being a seccomp
> > > supervisor for the task whose memory it is trying to access or you need
> > > to provide a migration path by adding an api that let's caller's perform
> > > these writes through the seccomp notifier.
> > 
> > Thanks for raising this!
> > 
> > I did test seccomp filtering/blocking functionality which seemed to
> > work but I'll make sure to also test syscall interception before
> > sending v3, to confirm whether it breaks.
> > 
> > The simplest solution is to add an exception for seccomp supervisors
> > just like we did for tracers, yes, so I'm inclined to go with that if
> > needed. :)
> 
> Ok. Note that your patch also doesn't cover process_vm_writev() which
> means that you can still use that as an alternative to write to memory -
> albeit with a lote more raciness. IOW, a seccomp notifier can do the
> dance of:
> 
> pidfd = clone3(CLONE_PIDFD)
> // handle error
> int fd_mem = open("/proc/$pid/mem", O_RDWR);:
> // handle error
> if (pidfd_send_signal(pidfd, 0, ...) == 0)
>         write(fd_mem, ...);
> 
> which lets it avoid the raciness. That's not possible with
> process_vm_writev() especially if it's received via AF_UNIX sockets
> which happens if the seccomp notifier lives in a proxy process. I know
> that happens in the wild.
> 
> So overall, it seems a bit odd to me because why block /proc/<pid>/mem
> specifically and not also cover process_vm_writev()? Because that's easy
> to block via regular seccomp system call filtering?

Yes, easy to block and also respect page permissions (can't write read-only memory) as well as require ptrace access anyway by checking PTRACE_MODE_ATTACH_REALCREDS.

Kees Cook March 4, 2024, 5:42 p.m. UTC | #8

On Mon, Mar 04, 2024 at 02:06:43PM +0000, Adrian Ratiu wrote:
> On Saturday, March 02, 2024 01:55 EET, Kees Cook <keescook@chromium.org> wrote:
> > On Fri, Mar 01, 2024 at 11:34:42PM +0200, Adrian Ratiu wrote:
> > > [...]
> > > +# define PROC_PID_MEM_MODE S_IRUSR
> > > +#else
> > > +# define PROC_PID_MEM_MODE (S_IRUSR|S_IWUSR)
> > > +#endif
> > 
> > PROC_PID_MEM_MODE will need to be a __ro_after_init variable, set by
> > early_restrict_proc_mem_write, otherwise the mode won't change based on
> > the runtime setting. e.g.:
> > 
> > #ifdef CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON
> > mode_t proc_pid_mem_mode __ro_after_init = S_IRUSR;
> > #else
> > mode_t proc_pid_mem_mode __ro_after_init = (S_IRUSR|S_IWUSR);
> > #endif
> > 
> > DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON,
> > 			   restrict_proc_mem_write);
> > ...
> > 	if (bool_result) {
> > 		static_branch_enable(&restrict_proc_mem_write);
> > 		proc_pid_mem_mode = S_IRUSR;
> > 	} else {
> > 		static_branch_disable(&restrict_proc_mem_write);
> > 		proc_pid_mem_mode = (S_IRUSR|S_IWUSR);
> > 	}
> > ...
> > 	REG("mem",        proc_pid_mem_mode, proc_mem_operations),
> 
> I'm having trouble implementing this because the proc_pid_mem_mode initializer needs to be a compile-time constant, so I can't set a runtime value in the REG() definition like suggested above.

Ah. Yeah, so I guess just drop the perms change -- you're already
checking the behavior in the open(), so you can just leave the perms
alone.

Kees Cook March 4, 2024, 5:49 p.m. UTC | #9

On Mon, Mar 04, 2024 at 02:20:22PM +0100, Christian Brauner wrote:
> On Fri, Mar 01, 2024 at 11:34:42PM +0200, Adrian Ratiu wrote:
> > Prior to v2.6.39 write access to /proc/<pid>/mem was restricted,
> > after which it got allowed in commit 198214a7ee50 ("proc: enable
> > writing to /proc/pid/mem"). Famous last words from that patch:
> > "no longer a security hazard". :)
> > 
> > Afterwards exploits appeared started causing drama like [1]. The
> > /proc/*/mem exploits can be rather sophisticated like [2] which
> > installed an arbitrary payload from noexec storage into a running
> > process then exec'd it, which itself could include an ELF loader
> > to run arbitrary code off noexec storage.
> > 
> > As part of hardening against these types of attacks, distrbutions
> > can restrict /proc/*/mem to only allow writes when they makes sense,
> > like in case of debuggers which have ptrace permissions, as they
> > are able to access memory anyway via PTRACE_POKEDATA and friends.
> > 
> > Dropping the mode bits disables write access for non-root users.
> > Trying to `chmod` the paths back fails as the kernel rejects it.
> > 
> > For users with CAP_DAC_OVERRIDE (usually just root) we have to
> > disable the mem_write callback to avoid bypassing the mode bits.
> > 
> > Writes can be used to bypass permissions on memory maps, even if a
> > memory region is mapped r-x (as is a program's executable pages),
> > the process can open its own /proc/self/mem file and write to the
> > pages directly.
> > 
> > Even if seccomp filters block mmap/mprotect calls with W|X perms,
> > they often cannot block open calls as daemons want to read/write
> > their own runtime state and seccomp filters cannot check file paths.
> > Write calls also can't be blocked in general via seccomp.
> > 
> > Since the mem file is part of the dynamic /proc/<pid>/ space, we
> > can't run chmod once at boot to restrict it (and trying to react
> > to every process and run chmod doesn't scale, and the kernel no
> > longer allows chmod on any of these paths).
> > 
> > SELinux could be used with a rule to cover all /proc/*/mem files,
> > but even then having multiple ways to deny an attack is useful in
> > case on layer fails.
> > 
> > [1] https://lwn.net/Articles/476947/
> > [2] https://issues.chromium.org/issues/40089045
> > 
> > Based on an initial patch by Mike Frysinger <vapier@chromium.org>.
> > 
> > Cc: Guenter Roeck <groeck@chromium.org>
> > Cc: Doug Anderson <dianders@chromium.org>
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: Jann Horn <jannh@google.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Randy Dunlap <rdunlap@infradead.org>
> > Cc: Christian Brauner <brauner@kernel.org>
> > Co-developed-by: Mike Frysinger <vapier@chromium.org>
> > Signed-off-by: Mike Frysinger <vapier@chromium.org>
> > Signed-off-by: Adrian Ratiu <adrian.ratiu@collabora.com>
> > ---
> > Changes in v2:
> >  * Added boot time parameter with default kconfig option
> >  * Moved check earlier in mem_open() instead of mem_write()
> >  * Simplified implementation branching
> >  * Removed dependency on CONFIG_MEMCG
> > ---
> >  .../admin-guide/kernel-parameters.txt         |  4 ++
> >  fs/proc/base.c                                | 47 ++++++++++++++++++-
> >  security/Kconfig                              | 22 +++++++++
> >  3 files changed, 71 insertions(+), 2 deletions(-)
> > 
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 460b97a1d0da..0647e2f54248 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -5618,6 +5618,10 @@
> >  	reset_devices	[KNL] Force drivers to reset the underlying device
> >  			during initialization.
> >  
> > +	restrict_proc_mem_write= [KNL]
> > +			Enable or disable write access to /proc/*/mem files.
> > +			Default is SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON.
> > +
> >  	resume=		[SWSUSP]
> >  			Specify the partition device for software suspend
> >  			Format:
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 98a031ac2648..92f668191312 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -152,6 +152,30 @@ struct pid_entry {
> >  		NULL, &proc_pid_attr_operations,	\
> >  		{ .lsmid = LSMID })
> >  
> > +#ifdef CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE
> > +DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON,
> > +			   restrict_proc_mem_write);
> > +static int __init early_restrict_proc_mem_write(char *buf)
> > +{
> > +	int ret;
> > +	bool bool_result;
> > +
> > +	ret = kstrtobool(buf, &bool_result);
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (bool_result)
> > +		static_branch_enable(&restrict_proc_mem_write);
> > +	else
> > +		static_branch_disable(&restrict_proc_mem_write);
> > +	return 0;
> > +}
> > +early_param("restrict_proc_mem_write", early_restrict_proc_mem_write);
> > +# define PROC_PID_MEM_MODE S_IRUSR
> > +#else
> > +# define PROC_PID_MEM_MODE (S_IRUSR|S_IWUSR)
> > +#endif
> > +
> >  /*
> >   * Count the number of hardlinks for the pid_entry table, excluding the .
> >   * and .. links.
> > @@ -829,6 +853,25 @@ static int mem_open(struct inode *inode, struct file *file)
> >  {
> >  	int ret = __mem_open(inode, file, PTRACE_MODE_ATTACH);
> >  
> > +#ifdef CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE
> > +	struct mm_struct *mm = file->private_data;
> > +	struct task_struct *task = get_proc_task(inode);
> > +
> > +	if (mm && task) {
> > +		/* Only allow writes by processes already ptracing the target task */
> > +		if (file->f_mode & FMODE_WRITE &&
> > +		    static_branch_maybe(CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON,
> > +					&restrict_proc_mem_write)) {
> > +			rcu_read_lock();
> > +			if (!ptracer_capable(current, mm->user_ns) ||
> > +			    current != ptrace_parent(task))
> > +				ret = -EACCES;
> 
> Uhm, this will break the seccomp notifier, no? So you can't turn on
> SECURITY_PROC_MEM_RESTRICT_WRITE when you want to use the seccomp
> notifier to do system call interception and rewrite memory locations of
> the calling task, no? Which is very much relied upon in various
> container managers and possibly other security tools.
> 
> Which means that you can't turn this on in any of the regular distros.

FWIW, it's a run-time toggle, but yes, let's make sure this works
correctly.

> So you need to either account for the calling task being a seccomp
> supervisor for the task whose memory it is trying to access or you need
> to provide a migration path by adding an api that let's caller's perform
> these writes through the seccomp notifier.

How do seccomp supervisors that use USER_NOTIF do those kinds of
memory writes currently? I thought they were actually using ptrace?
Everything I'm familiar with is just using SECCOMP_IOCTL_NOTIF_ADDFD,
and not doing fancy memory pokes.

-Kees

Kees Cook March 4, 2024, 5:56 p.m. UTC | #10

On Mon, Mar 04, 2024 at 02:35:29PM +0000, Adrian Ratiu wrote:
> Yes, easy to block and also respect page permissions (can't write
> read-only memory) as well as require ptrace access anyway by checking
> PTRACE_MODE_ATTACH_REALCREDS.

right, I don't think process_vm_writev() ignores page permissions? i.e. I
don't see where it is using FOLL_FORCE, which is one of the central
problems with /proc/$pid/mem. (Which reminds me, this is worth mentioning
more explicitly in the commit log for v3.)

Christian Brauner March 5, 2024, 8:59 a.m. UTC | #11

> > Uhm, this will break the seccomp notifier, no? So you can't turn on
> > SECURITY_PROC_MEM_RESTRICT_WRITE when you want to use the seccomp
> > notifier to do system call interception and rewrite memory locations of
> > the calling task, no? Which is very much relied upon in various
> > container managers and possibly other security tools.
> > 
> > Which means that you can't turn this on in any of the regular distros.
> 
> FWIW, it's a run-time toggle, but yes, let's make sure this works
> correctly.
> 
> > So you need to either account for the calling task being a seccomp
> > supervisor for the task whose memory it is trying to access or you need
> > to provide a migration path by adding an api that let's caller's perform
> > these writes through the seccomp notifier.
> 
> How do seccomp supervisors that use USER_NOTIF do those kinds of
> memory writes currently? I thought they were actually using ptrace?
> Everything I'm familiar with is just using SECCOMP_IOCTL_NOTIF_ADDFD,
> and not doing fancy memory pokes.

For example, incus has a seccomp supervisor such that each container
gets it's own goroutine that is responsible for handling system call
interception.

If a container is started the container runtime connects to an AF_UNIX
socket to register with the seccomp supervisor. It stays connected until
it stops. Everytime a system call is performed that is registered in the
seccomp notifier filter the container runtime will send a AF_UNIX
message to the seccomp supervisor. This will include the following fds:

- the pidfd of the task that performed the system call (we should
  actually replace this with SO_PEERPIDFD now that we have that)
- the fd of the task's memory to /proc/<pid>/mem

The seccomp supervisor will then perform the system call interception
including the required memory reads and writes.

There's no ptrace involved. That was the whole point of the seccomp
notifier. :)

Kees Cook March 5, 2024, 9:41 a.m. UTC | #12

On Tue, Mar 05, 2024 at 09:59:47AM +0100, Christian Brauner wrote:
> > > Uhm, this will break the seccomp notifier, no? So you can't turn on
> > > SECURITY_PROC_MEM_RESTRICT_WRITE when you want to use the seccomp
> > > notifier to do system call interception and rewrite memory locations of
> > > the calling task, no? Which is very much relied upon in various
> > > container managers and possibly other security tools.
> > > 
> > > Which means that you can't turn this on in any of the regular distros.
> > 
> > FWIW, it's a run-time toggle, but yes, let's make sure this works
> > correctly.
> > 
> > > So you need to either account for the calling task being a seccomp
> > > supervisor for the task whose memory it is trying to access or you need
> > > to provide a migration path by adding an api that let's caller's perform
> > > these writes through the seccomp notifier.
> > 
> > How do seccomp supervisors that use USER_NOTIF do those kinds of
> > memory writes currently? I thought they were actually using ptrace?
> > Everything I'm familiar with is just using SECCOMP_IOCTL_NOTIF_ADDFD,
> > and not doing fancy memory pokes.
> 
> For example, incus has a seccomp supervisor such that each container
> gets it's own goroutine that is responsible for handling system call
> interception.
> 
> If a container is started the container runtime connects to an AF_UNIX
> socket to register with the seccomp supervisor. It stays connected until
> it stops. Everytime a system call is performed that is registered in the
> seccomp notifier filter the container runtime will send a AF_UNIX
> message to the seccomp supervisor. This will include the following fds:
> 
> - the pidfd of the task that performed the system call (we should
>   actually replace this with SO_PEERPIDFD now that we have that)
> - the fd of the task's memory to /proc/<pid>/mem
> 
> The seccomp supervisor will then perform the system call interception
> including the required memory reads and writes.

Okay, so the patch would very much break that. Some questions, though:
- why not use process_vm_writev()?
- does the supervisor depend on FOLL_FORCE?

Perhaps is is sufficient to block the use of FOLL_FORCE?

I took a look at the Chrome OS exploit, and I _think_ it is depending
on the FOLL_FORCE behavior (it searches for a symbol to overwrite that
if I'm following correctly is in a read-only region), but some of the
binaries don't include source code, so I couldn't easily see what was
being injected. Mike or Adrian can you confirm this?

Christian Brauner March 5, 2024, 9:58 a.m. UTC | #13

On Tue, Mar 05, 2024 at 01:41:29AM -0800, Kees Cook wrote:
> On Tue, Mar 05, 2024 at 09:59:47AM +0100, Christian Brauner wrote:
> > > > Uhm, this will break the seccomp notifier, no? So you can't turn on
> > > > SECURITY_PROC_MEM_RESTRICT_WRITE when you want to use the seccomp
> > > > notifier to do system call interception and rewrite memory locations of
> > > > the calling task, no? Which is very much relied upon in various
> > > > container managers and possibly other security tools.
> > > > 
> > > > Which means that you can't turn this on in any of the regular distros.
> > > 
> > > FWIW, it's a run-time toggle, but yes, let's make sure this works
> > > correctly.
> > > 
> > > > So you need to either account for the calling task being a seccomp
> > > > supervisor for the task whose memory it is trying to access or you need
> > > > to provide a migration path by adding an api that let's caller's perform
> > > > these writes through the seccomp notifier.
> > > 
> > > How do seccomp supervisors that use USER_NOTIF do those kinds of
> > > memory writes currently? I thought they were actually using ptrace?
> > > Everything I'm familiar with is just using SECCOMP_IOCTL_NOTIF_ADDFD,
> > > and not doing fancy memory pokes.
> > 
> > For example, incus has a seccomp supervisor such that each container
> > gets it's own goroutine that is responsible for handling system call
> > interception.
> > 
> > If a container is started the container runtime connects to an AF_UNIX
> > socket to register with the seccomp supervisor. It stays connected until
> > it stops. Everytime a system call is performed that is registered in the
> > seccomp notifier filter the container runtime will send a AF_UNIX
> > message to the seccomp supervisor. This will include the following fds:
> > 
> > - the pidfd of the task that performed the system call (we should
> >   actually replace this with SO_PEERPIDFD now that we have that)
> > - the fd of the task's memory to /proc/<pid>/mem
> > 
> > The seccomp supervisor will then perform the system call interception
> > including the required memory reads and writes.
> 
> Okay, so the patch would very much break that. Some questions, though:
> - why not use process_vm_writev()?

Because it's inherently racy as I've explained in an earlier mail in
this thread. Opening /proc/<pid>/mem we can guard via:

// Assume we hold @pidfd for supervised process

int fd_mem = open("/proc/$pid/mem", O_RDWR);:

if (pidfd_send_signal(pidfd, 0, ...) == 0)
        write(fd_mem, ...);

But we can't exactly do:

process_vm_writev(pid, WRITE_TO_MEMORY, ...);
if (pidfd_send_signal(pidfd, 0, ...) == 0)
        write(fd_mem, ...);

That's always racy. The process might have been reaped before we even
call pidfd_send_signal() and we're writing to some random process
memory.

If we wanted to support this we'd need to implement a proposal I had a
while ago:

#define PROCESS_VM_RW_PIDFD (1 << 0)

process_vm_readv(pidfd,  ..., PROCESS_VM_RW_PIDFD);
process_vm_writev(pidfd, ..., PROCESS_VM_RW_PIDFD);

which is similar to what we did for waitid(pidfd, P_PIDFD, ...)

That would make it possible to use a pidfd instead of a pid in the two
system calls. Then we can get rid of the raciness and actually use those
system calls. As they are now, we can't.

> - does the supervisor depend on FOLL_FORCE?

Since the write handler for /proc/<pid>/mem does raise FOLL_FORCE
unconditionally it likely would implicitly. But I'm not familiar enough
with FOLL_FORCE to say for sure.

> Perhaps is is sufficient to block the use of FOLL_FORCE?
> 
> I took a look at the Chrome OS exploit, and I _think_ it is depending
> on the FOLL_FORCE behavior (it searches for a symbol to overwrite that
> if I'm following correctly is in a read-only region), but some of the
> binaries don't include source code, so I couldn't easily see what was
> being injected. Mike or Adrian can you confirm this?

Kees Cook March 5, 2024, 10:12 a.m. UTC | #14

On Tue, Mar 05, 2024 at 10:58:25AM +0100, Christian Brauner wrote:
> Since the write handler for /proc/<pid>/mem does raise FOLL_FORCE
> unconditionally it likely would implicitly. But I'm not familiar enough
> with FOLL_FORCE to say for sure.

I should phrase the question better. :) Is the supervisor writing into
read-only regions of the child process?

Christian Brauner March 5, 2024, 10:32 a.m. UTC | #15

On Tue, Mar 05, 2024 at 02:12:26AM -0800, Kees Cook wrote:
> On Tue, Mar 05, 2024 at 10:58:25AM +0100, Christian Brauner wrote:
> > Since the write handler for /proc/<pid>/mem does raise FOLL_FORCE
> > unconditionally it likely would implicitly. But I'm not familiar enough
> > with FOLL_FORCE to say for sure.
> 
> I should phrase the question better. :) Is the supervisor writing into
> read-only regions of the child process?

Hm... I suspect we don't. Let's take two concrete examples so you can
tell me.

Incus intercepts the sysinfo() syscall. It prepares a struct sysinfo
with cgroup aware values for the supervised process and then does:

unix.Pwrite(siov.memFd, &sysinfo, sizeof(struct sysinfo), seccomp_data.args[0]))

It also intercepts some bpf system calls attaching bpf programs for the
caller. If that fails we update the log buffer for the supervised
process:

union bpf_attr attr = {}, new_attr = {};

// read struct bpf_attr from mem_fd
ret = pread(mem_fd, &attr, attr_len, req->data.args[1]);
if (ret < 0)
        return -errno;

// Do stuff with attr. Stuff fails. Update log buffer for supervised process:
if ((new_attr.log_size) > 0 && (pwrite(mem_fd, new_attr.log_buf, new_attr.log_size, attr.log_buf) != new_attr.log_size))

But I'm not sure if there are other use-cases that would require this.

Christian Brauner March 5, 2024, 11:03 a.m. UTC | #16

On Tue, Mar 05, 2024 at 10:58:31AM +0100, Christian Brauner wrote:
> On Tue, Mar 05, 2024 at 01:41:29AM -0800, Kees Cook wrote:
> > On Tue, Mar 05, 2024 at 09:59:47AM +0100, Christian Brauner wrote:
> > > > > Uhm, this will break the seccomp notifier, no? So you can't turn on
> > > > > SECURITY_PROC_MEM_RESTRICT_WRITE when you want to use the seccomp
> > > > > notifier to do system call interception and rewrite memory locations of
> > > > > the calling task, no? Which is very much relied upon in various
> > > > > container managers and possibly other security tools.
> > > > > 
> > > > > Which means that you can't turn this on in any of the regular distros.
> > > > 
> > > > FWIW, it's a run-time toggle, but yes, let's make sure this works
> > > > correctly.
> > > > 
> > > > > So you need to either account for the calling task being a seccomp
> > > > > supervisor for the task whose memory it is trying to access or you need
> > > > > to provide a migration path by adding an api that let's caller's perform
> > > > > these writes through the seccomp notifier.
> > > > 
> > > > How do seccomp supervisors that use USER_NOTIF do those kinds of
> > > > memory writes currently? I thought they were actually using ptrace?
> > > > Everything I'm familiar with is just using SECCOMP_IOCTL_NOTIF_ADDFD,
> > > > and not doing fancy memory pokes.
> > > 
> > > For example, incus has a seccomp supervisor such that each container
> > > gets it's own goroutine that is responsible for handling system call
> > > interception.
> > > 
> > > If a container is started the container runtime connects to an AF_UNIX
> > > socket to register with the seccomp supervisor. It stays connected until
> > > it stops. Everytime a system call is performed that is registered in the
> > > seccomp notifier filter the container runtime will send a AF_UNIX
> > > message to the seccomp supervisor. This will include the following fds:
> > > 
> > > - the pidfd of the task that performed the system call (we should
> > >   actually replace this with SO_PEERPIDFD now that we have that)
> > > - the fd of the task's memory to /proc/<pid>/mem
> > > 
> > > The seccomp supervisor will then perform the system call interception
> > > including the required memory reads and writes.
> > 
> > Okay, so the patch would very much break that. Some questions, though:
> > - why not use process_vm_writev()?
> 
> Because it's inherently racy as I've explained in an earlier mail in
> this thread. Opening /proc/<pid>/mem we can guard via:
> 
> // Assume we hold @pidfd for supervised process
> 
> int fd_mem = open("/proc/$pid/mem", O_RDWR);:
> 
> if (pidfd_send_signal(pidfd, 0, ...) == 0)
>         write(fd_mem, ...);
> 
> But we can't exactly do:
> 
> process_vm_writev(pid, WRITE_TO_MEMORY, ...);
> if (pidfd_send_signal(pidfd, 0, ...) == 0)
>         write(fd_mem, ...);
> 
> That's always racy. The process might have been reaped before we even
> call pidfd_send_signal() and we're writing to some random process
> memory.
> 
> If we wanted to support this we'd need to implement a proposal I had a
> while ago:
> 
> #define PROCESS_VM_RW_PIDFD (1 << 0)
> 
> process_vm_readv(pidfd,  ..., PROCESS_VM_RW_PIDFD);
> process_vm_writev(pidfd, ..., PROCESS_VM_RW_PIDFD);
> 
> which is similar to what we did for waitid(pidfd, P_PIDFD, ...)
> 
> That would make it possible to use a pidfd instead of a pid in the two
> system calls. Then we can get rid of the raciness and actually use those
> system calls. As they are now, we can't.

What btw, is the Linux sandbox on Chromium doing? Did they finally move
away from SECCOMP_RET_TRAP to SECCOMP_RET_USER_NOTIF? I see:

https://issues.chromium.org/issues/40145101

What ever became of this?

Adrian Ratiu March 5, 2024, 3:38 p.m. UTC | #17

On Tuesday, March 05, 2024 11:41 EET, Kees Cook <keescook@chromium.org> wrote:

> On Tue, Mar 05, 2024 at 09:59:47AM +0100, Christian Brauner wrote:
> > > > Uhm, this will break the seccomp notifier, no? So you can't turn on
> > > > SECURITY_PROC_MEM_RESTRICT_WRITE when you want to use the seccomp
> > > > notifier to do system call interception and rewrite memory locations of
> > > > the calling task, no? Which is very much relied upon in various
> > > > container managers and possibly other security tools.
> > > > 
> > > > Which means that you can't turn this on in any of the regular distros.
> > > 
> > > FWIW, it's a run-time toggle, but yes, let's make sure this works
> > > correctly.
> > > 
> > > > So you need to either account for the calling task being a seccomp
> > > > supervisor for the task whose memory it is trying to access or you need
> > > > to provide a migration path by adding an api that let's caller's perform
> > > > these writes through the seccomp notifier.
> > > 
> > > How do seccomp supervisors that use USER_NOTIF do those kinds of
> > > memory writes currently? I thought they were actually using ptrace?
> > > Everything I'm familiar with is just using SECCOMP_IOCTL_NOTIF_ADDFD,
> > > and not doing fancy memory pokes.
> > 
> > For example, incus has a seccomp supervisor such that each container
> > gets it's own goroutine that is responsible for handling system call
> > interception.
> > 
> > If a container is started the container runtime connects to an AF_UNIX
> > socket to register with the seccomp supervisor. It stays connected until
> > it stops. Everytime a system call is performed that is registered in the
> > seccomp notifier filter the container runtime will send a AF_UNIX
> > message to the seccomp supervisor. This will include the following fds:
> > 
> > - the pidfd of the task that performed the system call (we should
> >   actually replace this with SO_PEERPIDFD now that we have that)
> > - the fd of the task's memory to /proc/<pid>/mem
> > 
> > The seccomp supervisor will then perform the system call interception
> > including the required memory reads and writes.
> 
> Okay, so the patch would very much break that. Some questions, though:
> - why not use process_vm_writev()?
> - does the supervisor depend on FOLL_FORCE?
> 
> Perhaps is is sufficient to block the use of FOLL_FORCE?
> 
> I took a look at the Chrome OS exploit, and I _think_ it is depending
> on the FOLL_FORCE behavior (it searches for a symbol to overwrite that
> if I'm following correctly is in a read-only region), but some of the
> binaries don't include source code, so I couldn't easily see what was
> being injected. Mike or Adrian can you confirm this?

I can't speak for what is acceptable for ChromeOS security because 
I'm not part of that project, so I'll let Mike answer whether blocking
writes is mandatory for them or blocking FOLL_FORCE is enough.

From a design perspective, the question is whether to
1. block writes and allow known good exceptions 
or
2. allow writes and block known bad/exploitable exceptions. 
 
I am looking into reproducing and adding an exception for the
container syscall intercept use-case raised by Christian, because
I think it's easier to justify allowing known good exceptions from
a security perspective.

Otherwise I'm fine with both approaches.

@Mike WDYT ?

Kees Cook March 5, 2024, 6:33 p.m. UTC | #18

On Tue, Mar 05, 2024 at 12:03:21PM +0100, Christian Brauner wrote:
> What btw, is the Linux sandbox on Chromium doing? Did they finally move
> away from SECCOMP_RET_TRAP to SECCOMP_RET_USER_NOTIF? I see:
> 
> https://issues.chromium.org/issues/40145101
> 
> What ever became of this?

I'm not entirely sure. I don't think it's been a priority lately
(obviously).

Kees Cook March 5, 2024, 6:37 p.m. UTC | #19

On Tue, Mar 05, 2024 at 11:32:04AM +0100, Christian Brauner wrote:
> On Tue, Mar 05, 2024 at 02:12:26AM -0800, Kees Cook wrote:
> > On Tue, Mar 05, 2024 at 10:58:25AM +0100, Christian Brauner wrote:
> > > Since the write handler for /proc/<pid>/mem does raise FOLL_FORCE
> > > unconditionally it likely would implicitly. But I'm not familiar enough
> > > with FOLL_FORCE to say for sure.
> > 
> > I should phrase the question better. :) Is the supervisor writing into
> > read-only regions of the child process?
> 
> Hm... I suspect we don't. Let's take two concrete examples so you can
> tell me.
> 
> Incus intercepts the sysinfo() syscall. It prepares a struct sysinfo
> with cgroup aware values for the supervised process and then does:
> 
> unix.Pwrite(siov.memFd, &sysinfo, sizeof(struct sysinfo), seccomp_data.args[0]))
> 
> It also intercepts some bpf system calls attaching bpf programs for the
> caller. If that fails we update the log buffer for the supervised
> process:
> 
> union bpf_attr attr = {}, new_attr = {};
> 
> // read struct bpf_attr from mem_fd
> ret = pread(mem_fd, &attr, attr_len, req->data.args[1]);
> if (ret < 0)
>         return -errno;
> 
> // Do stuff with attr. Stuff fails. Update log buffer for supervised process:
> if ((new_attr.log_size) > 0 && (pwrite(mem_fd, new_attr.log_buf, new_attr.log_size, attr.log_buf) != new_attr.log_size))

This is almost certainly in writable memory (either stack or .data).

> But I'm not sure if there are other use-cases that would require this.

Maybe this option needs to be per-process (like no_new_privs), and with
a few access levels:

- as things are now
- no FOLL_FORCE unless by ptracer
- no writes unless by ptracer
- no FOLL_FORCE ever
- no writes ever
- no reads unless by ptracer
- no reads ever

Which feels more like 3 toggles: read, write, FOLL_FORCE. Each set to
"DAC", "ptracer", and "none"?

Adrian Ratiu March 5, 2024, 7:34 p.m. UTC | #20

On Tuesday, March 05, 2024 20:37 EET, Kees Cook <keescook@chromium.org> wrote:

> On Tue, Mar 05, 2024 at 11:32:04AM +0100, Christian Brauner wrote:
> > On Tue, Mar 05, 2024 at 02:12:26AM -0800, Kees Cook wrote:
> > > On Tue, Mar 05, 2024 at 10:58:25AM +0100, Christian Brauner wrote:
> > > > Since the write handler for /proc/<pid>/mem does raise FOLL_FORCE
> > > > unconditionally it likely would implicitly. But I'm not familiar enough
> > > > with FOLL_FORCE to say for sure.
> > > 
> > > I should phrase the question better. :) Is the supervisor writing into
> > > read-only regions of the child process?
> > 
> > Hm... I suspect we don't. Let's take two concrete examples so you can
> > tell me.
> > 
> > Incus intercepts the sysinfo() syscall. It prepares a struct sysinfo
> > with cgroup aware values for the supervised process and then does:
> > 
> > unix.Pwrite(siov.memFd, &sysinfo, sizeof(struct sysinfo), seccomp_data.args[0]))
> > 
> > It also intercepts some bpf system calls attaching bpf programs for the
> > caller. If that fails we update the log buffer for the supervised
> > process:
> > 
> > union bpf_attr attr = {}, new_attr = {};
> > 
> > // read struct bpf_attr from mem_fd
> > ret = pread(mem_fd, &attr, attr_len, req->data.args[1]);
> > if (ret < 0)
> >         return -errno;
> > 
> > // Do stuff with attr. Stuff fails. Update log buffer for supervised process:
> > if ((new_attr.log_size) > 0 && (pwrite(mem_fd, new_attr.log_buf, new_attr.log_size, attr.log_buf) != new_attr.log_size))
> 
> This is almost certainly in writable memory (either stack or .data).

Mostly yes, but we can't be certain where it comes from, because
SECCOMP_IOCTL_NOTIF_RECV passes any addresses set by the
caller to the supervisor process.

It is a kind of "implementation defined" behavior, just like we
can't predict what the supervisor will do with the caller mem :)

> 
> > But I'm not sure if there are other use-cases that would require this.
> 
> Maybe this option needs to be per-process (like no_new_privs), and with
> a few access levels:
> 
> - as things are now
> - no FOLL_FORCE unless by ptracer
> - no writes unless by ptracer
> - no FOLL_FORCE ever
> - no writes ever
> - no reads unless by ptracer
> - no reads ever
> 
> Which feels more like 3 toggles: read, write, FOLL_FORCE. Each set to
> "DAC", "ptracer", and "none"?

I really like this approach because it provides a  mechanism
with maximum flexibility without imposing a specific policy.

What does DAC mean in this context? My mind jumps to
Digital to Analog Converter :)

Shall I give it a try in v3?

> 
> -- 
> Kees Cook

Kees Cook March 5, 2024, 7:38 p.m. UTC | #21

On Tue, Mar 05, 2024 at 07:34:34PM +0000, Adrian Ratiu wrote:
> On Tuesday, March 05, 2024 20:37 EET, Kees Cook <keescook@chromium.org> wrote:
> 
> > On Tue, Mar 05, 2024 at 11:32:04AM +0100, Christian Brauner wrote:
> > > On Tue, Mar 05, 2024 at 02:12:26AM -0800, Kees Cook wrote:
> > > > On Tue, Mar 05, 2024 at 10:58:25AM +0100, Christian Brauner wrote:
> > > > > Since the write handler for /proc/<pid>/mem does raise FOLL_FORCE
> > > > > unconditionally it likely would implicitly. But I'm not familiar enough
> > > > > with FOLL_FORCE to say for sure.
> > > > 
> > > > I should phrase the question better. :) Is the supervisor writing into
> > > > read-only regions of the child process?
> > > 
> > > Hm... I suspect we don't. Let's take two concrete examples so you can
> > > tell me.
> > > 
> > > Incus intercepts the sysinfo() syscall. It prepares a struct sysinfo
> > > with cgroup aware values for the supervised process and then does:
> > > 
> > > unix.Pwrite(siov.memFd, &sysinfo, sizeof(struct sysinfo), seccomp_data.args[0]))
> > > 
> > > It also intercepts some bpf system calls attaching bpf programs for the
> > > caller. If that fails we update the log buffer for the supervised
> > > process:
> > > 
> > > union bpf_attr attr = {}, new_attr = {};
> > > 
> > > // read struct bpf_attr from mem_fd
> > > ret = pread(mem_fd, &attr, attr_len, req->data.args[1]);
> > > if (ret < 0)
> > >         return -errno;
> > > 
> > > // Do stuff with attr. Stuff fails. Update log buffer for supervised process:
> > > if ((new_attr.log_size) > 0 && (pwrite(mem_fd, new_attr.log_buf, new_attr.log_size, attr.log_buf) != new_attr.log_size))
> > 
> > This is almost certainly in writable memory (either stack or .data).
> 
> Mostly yes, but we can't be certain where it comes from, because
> SECCOMP_IOCTL_NOTIF_RECV passes any addresses set by the
> caller to the supervisor process.
> 
> It is a kind of "implementation defined" behavior, just like we
> can't predict what the supervisor will do with the caller mem :)
> 
> > 
> > > But I'm not sure if there are other use-cases that would require this.
> > 
> > Maybe this option needs to be per-process (like no_new_privs), and with
> > a few access levels:
> > 
> > - as things are now
> > - no FOLL_FORCE unless by ptracer
> > - no writes unless by ptracer
> > - no FOLL_FORCE ever
> > - no writes ever
> > - no reads unless by ptracer
> > - no reads ever
> > 
> > Which feels more like 3 toggles: read, write, FOLL_FORCE. Each set to
> > "DAC", "ptracer", and "none"?
> 
> I really like this approach because it provides a  mechanism
> with maximum flexibility without imposing a specific policy.
> 
> What does DAC mean in this context? My mind jumps to
> Digital to Analog Converter :)

Ah yes, sorry, this is Discretionary Access Control (which is my
short-hand for saying "basic file permissions"). But I guess that's kind
of not really true since the open() access checks are doing a
"ptrace-able" check in addition to the file perms check.

> Shall I give it a try in v3?

Yeah, though maybe see if Mike or Jann chime in over the next few days?

-Kees

Christian Brauner March 6, 2024, 10:31 a.m. UTC | #22

On Tue, Mar 05, 2024 at 10:37:20AM -0800, Kees Cook wrote:
> On Tue, Mar 05, 2024 at 11:32:04AM +0100, Christian Brauner wrote:
> > On Tue, Mar 05, 2024 at 02:12:26AM -0800, Kees Cook wrote:
> > > On Tue, Mar 05, 2024 at 10:58:25AM +0100, Christian Brauner wrote:
> > > > Since the write handler for /proc/<pid>/mem does raise FOLL_FORCE
> > > > unconditionally it likely would implicitly. But I'm not familiar enough
> > > > with FOLL_FORCE to say for sure.
> > > 
> > > I should phrase the question better. :) Is the supervisor writing into
> > > read-only regions of the child process?
> > 
> > Hm... I suspect we don't. Let's take two concrete examples so you can
> > tell me.
> > 
> > Incus intercepts the sysinfo() syscall. It prepares a struct sysinfo
> > with cgroup aware values for the supervised process and then does:
> > 
> > unix.Pwrite(siov.memFd, &sysinfo, sizeof(struct sysinfo), seccomp_data.args[0]))
> > 
> > It also intercepts some bpf system calls attaching bpf programs for the
> > caller. If that fails we update the log buffer for the supervised
> > process:
> > 
> > union bpf_attr attr = {}, new_attr = {};
> > 
> > // read struct bpf_attr from mem_fd
> > ret = pread(mem_fd, &attr, attr_len, req->data.args[1]);
> > if (ret < 0)
> >         return -errno;
> > 
> > // Do stuff with attr. Stuff fails. Update log buffer for supervised process:
> > if ((new_attr.log_size) > 0 && (pwrite(mem_fd, new_attr.log_buf, new_attr.log_size, attr.log_buf) != new_attr.log_size))
> 
> This is almost certainly in writable memory (either stack or .data).
> 
> > But I'm not sure if there are other use-cases that would require this.
> 
> Maybe this option needs to be per-process (like no_new_privs), and with
> a few access levels:
> 
> - as things are now
> - no FOLL_FORCE unless by ptracer
> - no writes unless by ptracer
> - no FOLL_FORCE ever
> - no writes ever
> - no reads unless by ptracer
> - no reads ever

Doing it as a prctl() would be fine.

Matt Denton March 6, 2024, 10:49 a.m. UTC | #23

The SECCOMP_RET_USER_NOTIF sandbox is partially implemented but the
reason we needed it (glibc blocking signals during certain syscalls we
wanted to emulate) got reverted and we haven't had any important
issues with the SECCOMP_RET_TRAP sandbox since then. /proc/pid/mem was
always restricted on ChromeOS so the plan was to use
process_vm_readv() and process_vm_writev() in the unsandboxed broker
process. We knew about the pid race of course, but this would be far
from the only place that Chrome would be potentially vulnerable to the
race so it didn't seem any worse.

We did need to use process_vm_writev() for some syscalls, like
emulating stat() required us to write into the supervised process.

On Tue, Mar 5, 2024 at 3:03 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Tue, Mar 05, 2024 at 10:58:31AM +0100, Christian Brauner wrote:
> > On Tue, Mar 05, 2024 at 01:41:29AM -0800, Kees Cook wrote:
> > > On Tue, Mar 05, 2024 at 09:59:47AM +0100, Christian Brauner wrote:
> > > > > > Uhm, this will break the seccomp notifier, no? So you can't turn on
> > > > > > SECURITY_PROC_MEM_RESTRICT_WRITE when you want to use the seccomp
> > > > > > notifier to do system call interception and rewrite memory locations of
> > > > > > the calling task, no? Which is very much relied upon in various
> > > > > > container managers and possibly other security tools.
> > > > > >
> > > > > > Which means that you can't turn this on in any of the regular distros.
> > > > >
> > > > > FWIW, it's a run-time toggle, but yes, let's make sure this works
> > > > > correctly.
> > > > >
> > > > > > So you need to either account for the calling task being a seccomp
> > > > > > supervisor for the task whose memory it is trying to access or you need
> > > > > > to provide a migration path by adding an api that let's caller's perform
> > > > > > these writes through the seccomp notifier.
> > > > >
> > > > > How do seccomp supervisors that use USER_NOTIF do those kinds of
> > > > > memory writes currently? I thought they were actually using ptrace?
> > > > > Everything I'm familiar with is just using SECCOMP_IOCTL_NOTIF_ADDFD,
> > > > > and not doing fancy memory pokes.
> > > >
> > > > For example, incus has a seccomp supervisor such that each container
> > > > gets it's own goroutine that is responsible for handling system call
> > > > interception.
> > > >
> > > > If a container is started the container runtime connects to an AF_UNIX
> > > > socket to register with the seccomp supervisor. It stays connected until
> > > > it stops. Everytime a system call is performed that is registered in the
> > > > seccomp notifier filter the container runtime will send a AF_UNIX
> > > > message to the seccomp supervisor. This will include the following fds:
> > > >
> > > > - the pidfd of the task that performed the system call (we should
> > > >   actually replace this with SO_PEERPIDFD now that we have that)
> > > > - the fd of the task's memory to /proc/<pid>/mem
> > > >
> > > > The seccomp supervisor will then perform the system call interception
> > > > including the required memory reads and writes.
> > >
> > > Okay, so the patch would very much break that. Some questions, though:
> > > - why not use process_vm_writev()?
> >
> > Because it's inherently racy as I've explained in an earlier mail in
> > this thread. Opening /proc/<pid>/mem we can guard via:
> >
> > // Assume we hold @pidfd for supervised process
> >
> > int fd_mem = open("/proc/$pid/mem", O_RDWR);:
> >
> > if (pidfd_send_signal(pidfd, 0, ...) == 0)
> >         write(fd_mem, ...);
> >
> > But we can't exactly do:
> >
> > process_vm_writev(pid, WRITE_TO_MEMORY, ...);
> > if (pidfd_send_signal(pidfd, 0, ...) == 0)
> >         write(fd_mem, ...);
> >
> > That's always racy. The process might have been reaped before we even
> > call pidfd_send_signal() and we're writing to some random process
> > memory.
> >
> > If we wanted to support this we'd need to implement a proposal I had a
> > while ago:
> >
> > #define PROCESS_VM_RW_PIDFD (1 << 0)
> >
> > process_vm_readv(pidfd,  ..., PROCESS_VM_RW_PIDFD);
> > process_vm_writev(pidfd, ..., PROCESS_VM_RW_PIDFD);
> >
> > which is similar to what we did for waitid(pidfd, P_PIDFD, ...)
> >
> > That would make it possible to use a pidfd instead of a pid in the two
> > system calls. Then we can get rid of the raciness and actually use those
> > system calls. As they are now, we can't.
>
> What btw, is the Linux sandbox on Chromium doing? Did they finally move
> away from SECCOMP_RET_TRAP to SECCOMP_RET_USER_NOTIF? I see:
>
> https://issues.chromium.org/issues/40145101
>
> What ever became of this?

[v2] proc: allow restricting /proc/pid/mem writes

Commit Message

Comments

Patch