[v2,1/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE
diff mbox series

Message ID 20200603162328.854164-2-areber@redhat.com
State New
Headers show
Series
  • capabilities: Introduce CAP_CHECKPOINT_RESTORE
Related show

Commit Message

Adrian Reber June 3, 2020, 4:23 p.m. UTC
This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
checkpoint/restore for non-root users.

Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has been
asked numerous times if it is possible to checkpoint/restore a process as
non-root. The answer usually was: 'almost'.

The main blocker to restore a process as non-root was to control the PID of the
restored process. This feature available via the clone3 system call, or via
/proc/sys/kernel/ns_last_pid is unfortunately guarded by CAP_SYS_ADMIN.

In the past two years, requests for non-root checkpoint/restore have increased
due to the following use cases:
* Checkpoint/Restore in an HPC environment in combination with a resource
  manager distributing jobs where users are always running as non-root.
  There is a desire to provide a way to checkpoint and restore long running
  jobs.
* Container migration as non-root
* We have been in contact with JVM developers who are integrating
  CRIU into a Java VM to decrease the startup time. These checkpoint/restore
  applications are not meant to be running with CAP_SYS_ADMIN.

We have seen the following workarounds:
* Use a setuid wrapper around CRIU:
  See https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c
* Use a setuid helper that writes to ns_last_pid.
  Unfortunately, this helper delegation technique is impossible to use with
  clone3, and is thus prone to races.
  See https://github.com/twosigma/set_ns_last_pid
* Cycle through PIDs with fork() until the desired PID is reached:
  This has been demonstrated to work with cycling rates of 100,000 PIDs/s
  See https://github.com/twosigma/set_ns_last_pid
* Patch out the CAP_SYS_ADMIN check from the kernel
* Run the desired application in a new user and PID namespace to provide
  a local CAP_SYS_ADMIN for controlling PIDs. This technique has limited use in
  typical container environments (e.g., Kubernetes) as /proc is
  typically protected with read-only layers (e.g., /proc/sys) for hardening
  purposes. Read-only layers prevent additional /proc mounts (due to proc's
  SB_I_USERNS_VISIBLE property), making the use of new PID namespaces limited as
  certain applications need access to /proc matching their PID namespace.

The introduced capability allows to:
* Control PIDs when the current user is CAP_CHECKPOINT_RESTORE capable
  for the corresponding PID namespace via ns_last_pid/clone3.
* Open files in /proc/pid/map_files when the current user is
  CAP_CHECKPOINT_RESTORE capable in the root namespace, useful for recovering
  files that are unreachable via the file system such as deleted files, or memfd
  files.

See corresponding selftest for an example with clone3().

Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
---
v2:
 - Renamed CAP_RESTORE to CAP_CHECKPOINT_RESTORE
 - Added a test
 - Added details about CRIU's use of map_files
 - Allow changing /proc/self/exe link with CAP_CHECKPOINT_RESTORE
---
 fs/proc/base.c                      | 8 ++++----
 include/linux/capability.h          | 6 ++++++
 include/uapi/linux/capability.h     | 9 ++++++++-
 kernel/pid.c                        | 2 +-
 kernel/pid_namespace.c              | 2 +-
 security/selinux/include/classmap.h | 5 +++--
 6 files changed, 23 insertions(+), 9 deletions(-)

Comments

Cyrill Gorcunov June 3, 2020, 5:01 p.m. UTC | #1
On Wed, Jun 03, 2020 at 06:23:26PM +0200, Adrian Reber wrote:
...
>  
>  /*
> - * Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how the
> - * symlinks may be used to bypass permissions on ancestor directories in the
> - * path to the file in question.
> + * Only allow CAP_SYS_ADMIN and CAP_CHECKPOINT_RESTORE to follow the links, due
> + * to concerns about how the symlinks may be used to bypass permissions on
> + * ancestor directories in the path to the file in question.
>   */
>  static const char *
>  proc_map_files_get_link(struct dentry *dentry,
>  			struct inode *inode,
>  		        struct delayed_call *done)
>  {
> -	if (!capable(CAP_SYS_ADMIN))
> +	if (!(capable(CAP_SYS_ADMIN) || capable(CAP_CHECKPOINT_RESTORE)))
>  		return ERR_PTR(-EPERM);

You know, I'm still not sure if we need this capable() check at all since
we have proc_fd_access_allowed() called but anyway can we please make this
if() condition more explicit

	if (!capable(CAP_SYS_ADMIN) && !capable(CAP_CHECKPOINT_RESTORE))
		return ERR_PTR(-EPERM);

though I won't insist. And I'll reread the series a bit later once I've
some spare time to.
Andrei Vagin June 9, 2020, 3:42 a.m. UTC | #2
On Wed, Jun 03, 2020 at 06:23:26PM +0200, Adrian Reber wrote:
> This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
> checkpoint/restore for non-root users.
> 
> Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has been
> asked numerous times if it is possible to checkpoint/restore a process as
> non-root. The answer usually was: 'almost'.
> 
> The main blocker to restore a process as non-root was to control the PID of the
> restored process. This feature available via the clone3 system call, or via
> /proc/sys/kernel/ns_last_pid is unfortunately guarded by CAP_SYS_ADMIN.
> 
> In the past two years, requests for non-root checkpoint/restore have increased
> due to the following use cases:
> * Checkpoint/Restore in an HPC environment in combination with a resource
>   manager distributing jobs where users are always running as non-root.
>   There is a desire to provide a way to checkpoint and restore long running
>   jobs.
> * Container migration as non-root
> * We have been in contact with JVM developers who are integrating
>   CRIU into a Java VM to decrease the startup time. These checkpoint/restore
>   applications are not meant to be running with CAP_SYS_ADMIN.
> 
...
> 
> The introduced capability allows to:
> * Control PIDs when the current user is CAP_CHECKPOINT_RESTORE capable
>   for the corresponding PID namespace via ns_last_pid/clone3.
> * Open files in /proc/pid/map_files when the current user is
>   CAP_CHECKPOINT_RESTORE capable in the root namespace, useful for recovering
>   files that are unreachable via the file system such as deleted files, or memfd
>   files.

PTRACE_O_SUSPEND_SECCOMP is needed for C/R and it is protected by
CAP_SYS_ADMIN too.

Thanks,
Andrei
Christian Brauner June 9, 2020, 7:44 a.m. UTC | #3
On Mon, Jun 08, 2020 at 08:42:21PM -0700, Andrei Vagin wrote:
> On Wed, Jun 03, 2020 at 06:23:26PM +0200, Adrian Reber wrote:
> > This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
> > checkpoint/restore for non-root users.
> > 
> > Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has been
> > asked numerous times if it is possible to checkpoint/restore a process as
> > non-root. The answer usually was: 'almost'.
> > 
> > The main blocker to restore a process as non-root was to control the PID of the
> > restored process. This feature available via the clone3 system call, or via
> > /proc/sys/kernel/ns_last_pid is unfortunately guarded by CAP_SYS_ADMIN.
> > 
> > In the past two years, requests for non-root checkpoint/restore have increased
> > due to the following use cases:
> > * Checkpoint/Restore in an HPC environment in combination with a resource
> >   manager distributing jobs where users are always running as non-root.
> >   There is a desire to provide a way to checkpoint and restore long running
> >   jobs.
> > * Container migration as non-root
> > * We have been in contact with JVM developers who are integrating
> >   CRIU into a Java VM to decrease the startup time. These checkpoint/restore
> >   applications are not meant to be running with CAP_SYS_ADMIN.
> > 
> ...
> > 
> > The introduced capability allows to:
> > * Control PIDs when the current user is CAP_CHECKPOINT_RESTORE capable
> >   for the corresponding PID namespace via ns_last_pid/clone3.
> > * Open files in /proc/pid/map_files when the current user is
> >   CAP_CHECKPOINT_RESTORE capable in the root namespace, useful for recovering
> >   files that are unreachable via the file system such as deleted files, or memfd
> >   files.
> 
> PTRACE_O_SUSPEND_SECCOMP is needed for C/R and it is protected by
> CAP_SYS_ADMIN too.

This is currently capable(CAP_SYS_ADMIN) (init_ns capable) why is it
safe to allow unprivileged users to suspend security policies? That
sounds like a bad idea.

	if (unlikely(data & PTRACE_O_SUSPEND_SECCOMP)) {
		if (!IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) ||
		    !IS_ENABLED(CONFIG_SECCOMP))
			return -EINVAL;

		if (!capable(CAP_SYS_ADMIN))
			return -EPERM;

		if (seccomp_mode(&current->seccomp) != SECCOMP_MODE_DISABLED ||
		    current->ptrace & PT_SUSPEND_SECCOMP)
			return -EPERM;
	}

Christian
Andrei Vagin June 9, 2020, 4:06 p.m. UTC | #4
On Tue, Jun 09, 2020 at 09:44:22AM +0200, Christian Brauner wrote:
> On Mon, Jun 08, 2020 at 08:42:21PM -0700, Andrei Vagin wrote:
> > On Wed, Jun 03, 2020 at 06:23:26PM +0200, Adrian Reber wrote:
> > > This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
> > > checkpoint/restore for non-root users.
> > > 
> > > Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has been
> > > asked numerous times if it is possible to checkpoint/restore a process as
> > > non-root. The answer usually was: 'almost'.
> > > 
> > > The main blocker to restore a process as non-root was to control the PID of the
> > > restored process. This feature available via the clone3 system call, or via
> > > /proc/sys/kernel/ns_last_pid is unfortunately guarded by CAP_SYS_ADMIN.
> > > 
> > > In the past two years, requests for non-root checkpoint/restore have increased
> > > due to the following use cases:
> > > * Checkpoint/Restore in an HPC environment in combination with a resource
> > >   manager distributing jobs where users are always running as non-root.
> > >   There is a desire to provide a way to checkpoint and restore long running
> > >   jobs.
> > > * Container migration as non-root
> > > * We have been in contact with JVM developers who are integrating
> > >   CRIU into a Java VM to decrease the startup time. These checkpoint/restore
> > >   applications are not meant to be running with CAP_SYS_ADMIN.
> > > 
> > ...
> > > 
> > > The introduced capability allows to:
> > > * Control PIDs when the current user is CAP_CHECKPOINT_RESTORE capable
> > >   for the corresponding PID namespace via ns_last_pid/clone3.
> > > * Open files in /proc/pid/map_files when the current user is
> > >   CAP_CHECKPOINT_RESTORE capable in the root namespace, useful for recovering
> > >   files that are unreachable via the file system such as deleted files, or memfd
> > >   files.
> > 
> > PTRACE_O_SUSPEND_SECCOMP is needed for C/R and it is protected by
> > CAP_SYS_ADMIN too.
> 
> This is currently capable(CAP_SYS_ADMIN) (init_ns capable) why is it
> safe to allow unprivileged users to suspend security policies? That
> sounds like a bad idea.

Why do you think so bad about me;). I don't suggest to remove or
downgrade this capability check. The patch allows all c/r related
operations if the current has CAP_CHECKPOINT_RESTORE.

So in this case the check:
     if (!capable(CAP_SYS_ADMIN))
             return -EPERM;

will be converted in:
     if (!capable(CAP_SYS_ADMIN) && !capable(CAP_CHECKPOINT_RESTORE))
             return -EPERM;

If we want to think about how to convert this capable to ns_capable, we
need to do this in a separate series. And the logic may be that a
process is able to suspend only filters that have been added from the
current user-namespace or its descendants. But we need to think about
this more carefully, maybe there are more pitfalls.
Christian Brauner June 9, 2020, 4:14 p.m. UTC | #5
On Tue, Jun 09, 2020 at 09:06:27AM -0700, Andrei Vagin wrote:
> On Tue, Jun 09, 2020 at 09:44:22AM +0200, Christian Brauner wrote:
> > On Mon, Jun 08, 2020 at 08:42:21PM -0700, Andrei Vagin wrote:
> > > On Wed, Jun 03, 2020 at 06:23:26PM +0200, Adrian Reber wrote:
> > > > This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
> > > > checkpoint/restore for non-root users.
> > > > 
> > > > Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has been
> > > > asked numerous times if it is possible to checkpoint/restore a process as
> > > > non-root. The answer usually was: 'almost'.
> > > > 
> > > > The main blocker to restore a process as non-root was to control the PID of the
> > > > restored process. This feature available via the clone3 system call, or via
> > > > /proc/sys/kernel/ns_last_pid is unfortunately guarded by CAP_SYS_ADMIN.
> > > > 
> > > > In the past two years, requests for non-root checkpoint/restore have increased
> > > > due to the following use cases:
> > > > * Checkpoint/Restore in an HPC environment in combination with a resource
> > > >   manager distributing jobs where users are always running as non-root.
> > > >   There is a desire to provide a way to checkpoint and restore long running
> > > >   jobs.
> > > > * Container migration as non-root
> > > > * We have been in contact with JVM developers who are integrating
> > > >   CRIU into a Java VM to decrease the startup time. These checkpoint/restore
> > > >   applications are not meant to be running with CAP_SYS_ADMIN.
> > > > 
> > > ...
> > > > 
> > > > The introduced capability allows to:
> > > > * Control PIDs when the current user is CAP_CHECKPOINT_RESTORE capable
> > > >   for the corresponding PID namespace via ns_last_pid/clone3.
> > > > * Open files in /proc/pid/map_files when the current user is
> > > >   CAP_CHECKPOINT_RESTORE capable in the root namespace, useful for recovering
> > > >   files that are unreachable via the file system such as deleted files, or memfd
> > > >   files.
> > > 
> > > PTRACE_O_SUSPEND_SECCOMP is needed for C/R and it is protected by
> > > CAP_SYS_ADMIN too.
> > 
> > This is currently capable(CAP_SYS_ADMIN) (init_ns capable) why is it
> > safe to allow unprivileged users to suspend security policies? That
> > sounds like a bad idea.
> 
> Why do you think so bad about me;). I don't suggest to remove or

Andrei, nothing could be further from me than to think bad about you!
You've done way too much excellent work. ;)

> downgrade this capability check. The patch allows all c/r related
> operations if the current has CAP_CHECKPOINT_RESTORE.
> 
> So in this case the check:
>      if (!capable(CAP_SYS_ADMIN))
>              return -EPERM;
> 
> will be converted in:
>      if (!capable(CAP_SYS_ADMIN) && !capable(CAP_CHECKPOINT_RESTORE))
>              return -EPERM;

Yeah, I got that but what's the goal here? Isn't it that you want to
make it safe to install the criu binary with the CAP_CHECKPOINT_RESTORE
fscap set so that unprivileged users can restore their own processes
without creating a new user namespace or am I missing something? The
use-cases in the cover-letter make it sound like that's what this is
leading up to:

> > > > * Checkpoint/Restore in an HPC environment in combination with a resource
> > > >   manager distributing jobs where users are always running as non-root.
> > > >   There is a desire to provide a way to checkpoint and restore long running
> > > >   jobs.
> > > > * Container migration as non-root
> > > > * We have been in contact with JVM developers who are integrating
> > > >   CRIU into a Java VM to decrease the startup time. These checkpoint/restore
> > > >   applications are not meant to be running with CAP_SYS_ADMIN.

But maybe I'm just misunderstanding crucial bits (likely (TM)).

Christian
Cyrill Gorcunov June 9, 2020, 6:45 p.m. UTC | #6
On Wed, Jun 03, 2020 at 06:23:26PM +0200, Adrian Reber wrote:
> This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
> checkpoint/restore for non-root users.
> 
> Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has been
> asked numerous times if it is possible to checkpoint/restore a process as
> non-root. The answer usually was: 'almost'.
> 
> The main blocker to restore a process as non-root was to control the PID of the
> restored process. This feature available via the clone3 system call, or via
> /proc/sys/kernel/ns_last_pid is unfortunately guarded by CAP_SYS_ADMIN.
...
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index d86c0afc8a85..ce02f3a4b2d7 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2189,16 +2189,16 @@ struct map_files_info {
>  };
>  
>  /*
> - * Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how the
> - * symlinks may be used to bypass permissions on ancestor directories in the
> - * path to the file in question.
> + * Only allow CAP_SYS_ADMIN and CAP_CHECKPOINT_RESTORE to follow the links, due
> + * to concerns about how the symlinks may be used to bypass permissions on
> + * ancestor directories in the path to the file in question.
>   */
>  static const char *
>  proc_map_files_get_link(struct dentry *dentry,
>  			struct inode *inode,
>  		        struct delayed_call *done)
>  {
> -	if (!capable(CAP_SYS_ADMIN))
> +	if (!(capable(CAP_SYS_ADMIN) || capable(CAP_CHECKPOINT_RESTORE)))
>  		return ERR_PTR(-EPERM);

First of all -- sorry for late reply. You know, looking into this code more
I think this CAP_SYS_ADMIN is simply wrong: for example I can't even fetch
links for /proc/self/map_files. Still /proc/$pid/maps (which as well points
to the files opened) test for ptrace-read permission. I think we need
ptrace-may-attach test here instead of these capabilities (if I can attach
to a process I can read any data needed, including the content of the
mapped files, if only I'm not missing something obvious).
Nicolas Viennot June 9, 2020, 8:09 p.m. UTC | #7
>>  proc_map_files_get_link(struct dentry *dentry,
>>  			struct inode *inode,
>>  		        struct delayed_call *done)
>>  {
>> -	if (!capable(CAP_SYS_ADMIN))
>> +	if (!(capable(CAP_SYS_ADMIN) || capable(CAP_CHECKPOINT_RESTORE)))
>>  		return ERR_PTR(-EPERM);

> First of all -- sorry for late reply. You know, looking into this code more I think this CAP_SYS_ADMIN is simply wrong: for example I can't even fetch links for /proc/self/map_files. Still /proc/$pid/maps (which as well points to the files opened) test for ptrace-read permission. I think we need ptrace-may-attach test here instead of these capabilities (if I can attach to a process I can read any data needed, including the content of the mapped files, if only I'm not missing something obvious).

Currently /proc/pid/map_files/* have exactly the same permission checks as /proc/pid/fd/*, with the exception of the extra CAP_SYS_ADMIN check. The check originated from the following discussions where 3 security issues are discussed:
http://lkml.iu.edu/hypermail/linux/kernel/1505.2/02524.html
http://lkml.iu.edu/hypermail/linux/kernel/1505.2/04030.html

From what I understand, the extra CAP_SYS_ADMIN comes from the following issues:
1. Being able to open dma-buf / kdbus region (referred in the referenced email as problem #1). I don't fully understand what the dangers are, but perhaps we could do CAP_SYS_ADMIN check only for such dangerous files, as opposed to all files.
2. /proc/pid/fd/* is already a security hole (Andy says "I hope to fix that some day"). He essentially says that it's not because fds are insecure that map_files should be too. He seems to claim that mapped files that are then closed seems to be a bigger concern than other opened files. However, in the present time (5 years after these email conversations), the fd directory does not have the CAP_SYS_ADMIN check which doesn't convinces me that the holes of /proc/pid/fd/* are such a big of a deal. I'm not entirely sure what security issue Andy refers to, but, I understand something along the lines of: Some process gets an fd of a file read-only opened (via a unix socket for example, or after a chroot), and gets to re-open the file in write access via /proc/self/fd/N to do some damage.
3. Being able to ftruncate a file after a chroot+privilege drop. I may be wrong, but if privileges were dropped, then there's no reason that the then unprivileged user would have write access to the mmaped file inode. Seems a false problem.

It turns out that some of these concerns have been addressed with the introduction of memfd with seals, introduced around the same time where the map_files discussions took place. These seals allow one to share write access of an mmap region to an unsecure program, without fearing of getting a SIGBUS because the unsecure program could call ftruncate() on the fd. More on that at https://lwn.net/Articles/593918/ . Also, that article says "There are a number of fairly immediate use cases for the sealing functionality in general. Graphics drivers could use it to safely receive buffers from applications. The upcoming kdbus transport can benefit from sealing.". This rings a bell with problem #1. Perhaps memfd is a solution to Andy's concerns?

Overall, I think the CAP_SYS_ADMIN map_files/ extra check compared to fd/ does not improve security in practice. Fds will be given to insecure programs. Better security can be achieved with memfd seals, and sane permissioning on files, regardless if they were once closed.

I think Adrian added a CAP_CHECKPOINT_RESTORE on the map_files to avoid opening a can of worm. But I guess the cat is out of the bag now.

-Nico
Eric W. Biederman June 9, 2020, 9:05 p.m. UTC | #8
Nicolas Viennot <Nicolas.Viennot@twosigma.com> writes:

>>>  proc_map_files_get_link(struct dentry *dentry,
>>>  			struct inode *inode,
>>>  		        struct delayed_call *done)
>>>  {
>>> -	if (!capable(CAP_SYS_ADMIN))
>>> +	if (!(capable(CAP_SYS_ADMIN) || capable(CAP_CHECKPOINT_RESTORE)))
>>>  		return ERR_PTR(-EPERM);
>
>> First of all -- sorry for late reply. You know, looking into this
>> code more I think this CAP_SYS_ADMIN is simply wrong: for example I
>> can't even fetch links for /proc/self/map_files. Still
>> /proc/$pid/maps (which as well points to the files opened) test for
>> ptrace-read permission. I think we need ptrace-may-attach test here
>> instead of these capabilities (if I can attach to a process I can
>> read any data needed, including the content of the mapped files, if
>> only I'm not missing something obvious).
>
> Currently /proc/pid/map_files/* have exactly the same permission
> checks as /proc/pid/fd/*, with the exception of the extra
> CAP_SYS_ADMIN check. The check originated from the following
> discussions where 3 security issues are discussed:
> http://lkml.iu.edu/hypermail/linux/kernel/1505.2/02524.html
> http://lkml.iu.edu/hypermail/linux/kernel/1505.2/04030.html
>
> From what I understand, the extra CAP_SYS_ADMIN comes from the
> following issues:

> 1. Being able to open dma-buf / kdbus region (referred in the
> referenced email as problem #1). I don't fully understand what the
> dangers are, but perhaps we could do CAP_SYS_ADMIN check only for such
> dangerous files, as opposed to all files.

I don't know precisely the concern but my memory is that some drivers do
interesting things when mmaped.  Possibly even to changing the vm_file.

I think that is worth running to the ground and figuring out in the
context of checkpoint/restart because the ordinary checkpoint/restart
code won't be able deal with them either.

So I vote for figuring that case out and dealing with it.


> 2. /proc/pid/fd/* is already a security hole (Andy says "I hope to fix
> that some day"). He essentially says that it's not because fds are
> insecure that map_files should be too. He seems to claim that mapped
> files that are then closed seems to be a bigger concern than other
> opened files. However, in the present time (5 years after these email
> conversations), the fd directory does not have the CAP_SYS_ADMIN check
> which doesn't convinces me that the holes of /proc/pid/fd/* are such a
> big of a deal. I'm not entirely sure what security issue Andy refers
> to, but, I understand something along the lines of: Some process gets
> an fd of a file read-only opened (via a unix socket for example, or
> after a chroot), and gets to re-open the file in write access via
> /proc/self/fd/N to do some damage.

I would hope the other permission checks on such a file will prevent
some of that nonsense.  But definitely worth taking a hard look at.

> 3. Being able to ftruncate a file after a chroot+privilege drop. I may
> be wrong, but if privileges were dropped, then there's no reason that
> the then unprivileged user would have write access to the mmaped file
> inode. Seems a false problem.

Yes.

> It turns out that some of these concerns have been addressed with the
> introduction of memfd with seals, introduced around the same time
> where the map_files discussions took place. These seals allow one to
> share write access of an mmap region to an unsecure program, without
> fearing of getting a SIGBUS because the unsecure program could call
> ftruncate() on the fd. More on that at
> https://lwn.net/Articles/593918/ . Also, that article says "There are
> a number of fairly immediate use cases for the sealing functionality
> in general. Graphics drivers could use it to safely receive buffers
> from applications. The upcoming kdbus transport can benefit from
> sealing.". This rings a bell with problem #1. Perhaps memfd is a
> solution to Andy's concerns?

> Overall, I think the CAP_SYS_ADMIN map_files/ extra check compared to
> fd/ does not improve security in practice. Fds will be given to
> insecure programs. Better security can be achieved with memfd seals,
> and sane permissioning on files, regardless if they were once closed.

I would love to see the work put in to safely relax the permission check
from capable to ns_capable.  Which is just dealing with point 1.

There might be some other assumptions that a process can't get at mmaped
regions.

Eric
Cyrill Gorcunov June 9, 2020, 9:28 p.m. UTC | #9
On Tue, Jun 09, 2020 at 08:09:49PM +0000, Nicolas Viennot wrote:
> >>  proc_map_files_get_link(struct dentry *dentry,
> >>  			struct inode *inode,
> >>  		        struct delayed_call *done)
> >>  {
> >> -	if (!capable(CAP_SYS_ADMIN))
> >> +	if (!(capable(CAP_SYS_ADMIN) || capable(CAP_CHECKPOINT_RESTORE)))
> >>  		return ERR_PTR(-EPERM);
> 
> > First of all -- sorry for late reply. You know, looking into this code more I think
> this CAP_SYS_ADMIN is simply wrong: for example I can't even fetch links for /proc/self/map_files.
> Still /proc/$pid/maps (which as well points to the files opened) test for ptrace-read permission.
> I think we need ptrace-may-attach test here instead of these capabilities (if I can attach to
> a process I can read any data needed, including the content of the mapped files, if only
> I'm not missing something obvious).
> 

Nikolas, could you please split the text lines next time, I've had to add newlines into reply manually :)

> Currently /proc/pid/map_files/* have exactly the same permission checks as /proc/pid/fd/*, with the exception
> of the extra CAP_SYS_ADMIN check. The check originated from the following discussions where 3 security issues are discussed:
> http://lkml.iu.edu/hypermail/linux/kernel/1505.2/02524.html
> http://lkml.iu.edu/hypermail/linux/kernel/1505.2/04030.html
> 
> From what I understand, the extra CAP_SYS_ADMIN comes from the following issues:
> 1. Being able to open dma-buf / kdbus region (referred in the referenced email as problem #1).
> I don't fully understand what the dangers are, but perhaps we could do CAP_SYS_ADMIN check
> only for such dangerous files, as opposed to all files.

As far as I remember we only need to read the content of mmap'ed files and if I've ptrace-attach
permission we aready can inject own code into a process and read anything we wish. That said we probably
should fixup this interface like -- test for open mode and if it is read only then ptrace-attach
should be enough, if it is write mode -- then we require being node's admin instead of just adding
a new capability here. And thanks a huge for mail reference, I'll take a look once time permit.
Andrei Vagin June 10, 2020, 7:59 a.m. UTC | #10
On Tue, Jun 09, 2020 at 06:14:27PM +0200, Christian Brauner wrote:
> On Tue, Jun 09, 2020 at 09:06:27AM -0700, Andrei Vagin wrote:
> > On Tue, Jun 09, 2020 at 09:44:22AM +0200, Christian Brauner wrote:
> > > On Mon, Jun 08, 2020 at 08:42:21PM -0700, Andrei Vagin wrote:
...
> > > > PTRACE_O_SUSPEND_SECCOMP is needed for C/R and it is protected by
> > > > CAP_SYS_ADMIN too.
> > > 
> > > This is currently capable(CAP_SYS_ADMIN) (init_ns capable) why is it
> > > safe to allow unprivileged users to suspend security policies? That
> > > sounds like a bad idea.
> > 
...
> > I don't suggest to remove or
> > downgrade this capability check. The patch allows all c/r related
> > operations if the current has CAP_CHECKPOINT_RESTORE.
> > 
> > So in this case the check:
> >      if (!capable(CAP_SYS_ADMIN))
> >              return -EPERM;
> > 
> > will be converted in:
> >      if (!capable(CAP_SYS_ADMIN) && !capable(CAP_CHECKPOINT_RESTORE))
> >              return -EPERM;
> 
> Yeah, I got that but what's the goal here? Isn't it that you want to
> make it safe to install the criu binary with the CAP_CHECKPOINT_RESTORE
> fscap set so that unprivileged users can restore their own processes
> without creating a new user namespace or am I missing something? The
> use-cases in the cover-letter make it sound like that's what this is
> leading up to:
> > > > > * Checkpoint/Restore in an HPC environment in combination with a resource
> > > > >   manager distributing jobs where users are always running as non-root.
> > > > >   There is a desire to provide a way to checkpoint and restore long running
> > > > >   jobs.
> > > > > * Container migration as non-root
> > > > > * We have been in contact with JVM developers who are integrating
> > > > >   CRIU into a Java VM to decrease the startup time. These checkpoint/restore
> > > > >   applications are not meant to be running with CAP_SYS_ADMIN.
> 
> But maybe I'm just misunderstanding crucial bits (likely (TM)).

I think you understand this right. The goal is to make it possible to
use C/R functionality for unprivileged processes. And for me, here are
two separate tasks. The first one is how to allow unprivileged users to
use C/R from the root user namespace. This is what we discuss here.

And another one is how to allow to use C/R functionality from a non-root
user namespaces. The second task is about downgrading capable to
ns_capable for map_files and PTRACE_O_SUSPEND_SECCOMP.

Thanks,
Andrei
Casey Schaufler June 10, 2020, 3:41 p.m. UTC | #11
On 6/10/2020 12:59 AM, Andrei Vagin wrote:
> On Tue, Jun 09, 2020 at 06:14:27PM +0200, Christian Brauner wrote:
>> On Tue, Jun 09, 2020 at 09:06:27AM -0700, Andrei Vagin wrote:
>>> On Tue, Jun 09, 2020 at 09:44:22AM +0200, Christian Brauner wrote:
>>>> On Mon, Jun 08, 2020 at 08:42:21PM -0700, Andrei Vagin wrote:
> ...
>>>>> PTRACE_O_SUSPEND_SECCOMP is needed for C/R and it is protected by
>>>>> CAP_SYS_ADMIN too.
>>>> This is currently capable(CAP_SYS_ADMIN) (init_ns capable) why is it
>>>> safe to allow unprivileged users to suspend security policies? That
>>>> sounds like a bad idea.
> ...
>>> I don't suggest to remove or
>>> downgrade this capability check. The patch allows all c/r related
>>> operations if the current has CAP_CHECKPOINT_RESTORE.
>>>
>>> So in this case the check:
>>>      if (!capable(CAP_SYS_ADMIN))
>>>              return -EPERM;
>>>
>>> will be converted in:
>>>      if (!capable(CAP_SYS_ADMIN) && !capable(CAP_CHECKPOINT_RESTORE))
>>>              return -EPERM;
>> Yeah, I got that but what's the goal here? Isn't it that you want to
>> make it safe to install the criu binary with the CAP_CHECKPOINT_RESTORE
>> fscap set so that unprivileged users can restore their own processes
>> without creating a new user namespace or am I missing something? The
>> use-cases in the cover-letter make it sound like that's what this is
>> leading up to:
>>>>>> * Checkpoint/Restore in an HPC environment in combination with a resource
>>>>>>   manager distributing jobs where users are always running as non-root.
>>>>>>   There is a desire to provide a way to checkpoint and restore long running
>>>>>>   jobs.
>>>>>> * Container migration as non-root
>>>>>> * We have been in contact with JVM developers who are integrating
>>>>>>   CRIU into a Java VM to decrease the startup time. These checkpoint/restore
>>>>>>   applications are not meant to be running with CAP_SYS_ADMIN.
>> But maybe I'm just misunderstanding crucial bits (likely (TM)).
> I think you understand this right. The goal is to make it possible to
> use C/R functionality for unprivileged processes.

Y'all keep saying "unprivileged processes" when you mean
"processes with less than root privilege". A process with
CAP_CHECKPOINT_RESTORE *is* a privileged process. It would
have different privilege from a process with CAP_SYS_ADMIN
(the current case) but is not "unprivileged".

>  And for me, here are
> two separate tasks. The first one is how to allow unprivileged users to
> use C/R from the root user namespace. This is what we discuss here.
>
> And another one is how to allow to use C/R functionality from a non-root
> user namespaces. The second task is about downgrading capable to
> ns_capable for map_files and PTRACE_O_SUSPEND_SECCOMP.
>
> Thanks,
> Andrei
Christian Brauner June 10, 2020, 3:48 p.m. UTC | #12
On Wed, Jun 10, 2020 at 08:41:29AM -0700, Casey Schaufler wrote:
> 
> On 6/10/2020 12:59 AM, Andrei Vagin wrote:
> > On Tue, Jun 09, 2020 at 06:14:27PM +0200, Christian Brauner wrote:
> >> On Tue, Jun 09, 2020 at 09:06:27AM -0700, Andrei Vagin wrote:
> >>> On Tue, Jun 09, 2020 at 09:44:22AM +0200, Christian Brauner wrote:
> >>>> On Mon, Jun 08, 2020 at 08:42:21PM -0700, Andrei Vagin wrote:
> > ...
> >>>>> PTRACE_O_SUSPEND_SECCOMP is needed for C/R and it is protected by
> >>>>> CAP_SYS_ADMIN too.
> >>>> This is currently capable(CAP_SYS_ADMIN) (init_ns capable) why is it
> >>>> safe to allow unprivileged users to suspend security policies? That
> >>>> sounds like a bad idea.
> > ...
> >>> I don't suggest to remove or
> >>> downgrade this capability check. The patch allows all c/r related
> >>> operations if the current has CAP_CHECKPOINT_RESTORE.
> >>>
> >>> So in this case the check:
> >>>      if (!capable(CAP_SYS_ADMIN))
> >>>              return -EPERM;
> >>>
> >>> will be converted in:
> >>>      if (!capable(CAP_SYS_ADMIN) && !capable(CAP_CHECKPOINT_RESTORE))
> >>>              return -EPERM;
> >> Yeah, I got that but what's the goal here? Isn't it that you want to
> >> make it safe to install the criu binary with the CAP_CHECKPOINT_RESTORE
> >> fscap set so that unprivileged users can restore their own processes
> >> without creating a new user namespace or am I missing something? The
> >> use-cases in the cover-letter make it sound like that's what this is
> >> leading up to:
> >>>>>> * Checkpoint/Restore in an HPC environment in combination with a resource
> >>>>>>   manager distributing jobs where users are always running as non-root.
> >>>>>>   There is a desire to provide a way to checkpoint and restore long running
> >>>>>>   jobs.
> >>>>>> * Container migration as non-root
> >>>>>> * We have been in contact with JVM developers who are integrating
> >>>>>>   CRIU into a Java VM to decrease the startup time. These checkpoint/restore
> >>>>>>   applications are not meant to be running with CAP_SYS_ADMIN.
> >> But maybe I'm just misunderstanding crucial bits (likely (TM)).
> > I think you understand this right. The goal is to make it possible to
> > use C/R functionality for unprivileged processes.
> 
> Y'all keep saying "unprivileged processes" when you mean
> "processes with less than root privilege". A process with
> CAP_CHECKPOINT_RESTORE *is* a privileged process. It would

That was me being imprecise. What I mean is "unprivileged user"
not "unprivileged process". It makes me a little uneasy that an
unprivileged _user_ can call the criu binary with the
CAP_CHECKPOINT_RESTORE fscap set and suspend seccomp of a process (Which
is what my original question here was about). Maybe this is paranoia but
shouldn't suspending _security_ mechanisms be kept either under
CAP_SYS_ADMIN or CAP_MAC_ADMIN?

Christian

Patch
diff mbox series

diff --git a/fs/proc/base.c b/fs/proc/base.c
index d86c0afc8a85..ce02f3a4b2d7 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2189,16 +2189,16 @@  struct map_files_info {
 };
 
 /*
- * Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how the
- * symlinks may be used to bypass permissions on ancestor directories in the
- * path to the file in question.
+ * Only allow CAP_SYS_ADMIN and CAP_CHECKPOINT_RESTORE to follow the links, due
+ * to concerns about how the symlinks may be used to bypass permissions on
+ * ancestor directories in the path to the file in question.
  */
 static const char *
 proc_map_files_get_link(struct dentry *dentry,
 			struct inode *inode,
 		        struct delayed_call *done)
 {
-	if (!capable(CAP_SYS_ADMIN))
+	if (!(capable(CAP_SYS_ADMIN) || capable(CAP_CHECKPOINT_RESTORE)))
 		return ERR_PTR(-EPERM);
 
 	return proc_pid_get_link(dentry, inode, done);
diff --git a/include/linux/capability.h b/include/linux/capability.h
index b4345b38a6be..1e7fe311cabe 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -261,6 +261,12 @@  static inline bool bpf_capable(void)
 	return capable(CAP_BPF) || capable(CAP_SYS_ADMIN);
 }
 
+static inline bool checkpoint_restore_ns_capable(struct user_namespace *ns)
+{
+	return ns_capable(ns, CAP_CHECKPOINT_RESTORE) ||
+		ns_capable(ns, CAP_SYS_ADMIN);
+}
+
 /* audit system wants to get cap info from files as well */
 extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
 
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 48ff0757ae5e..395dd0df8d08 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -408,7 +408,14 @@  struct vfs_ns_cap_data {
  */
 #define CAP_BPF			39
 
-#define CAP_LAST_CAP         CAP_BPF
+
+/* Allow checkpoint/restore related operations */
+/* Allow PID selection during clone3() */
+/* Allow writing to ns_last_pid */
+
+#define CAP_CHECKPOINT_RESTORE	40
+
+#define CAP_LAST_CAP         CAP_CHECKPOINT_RESTORE
 
 #define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)
 
diff --git a/kernel/pid.c b/kernel/pid.c
index 3122043fe364..ada55c7b2b19 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -198,7 +198,7 @@  struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 			if (tid != 1 && !tmp->child_reaper)
 				goto out_free;
 			retval = -EPERM;
-			if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
+			if (!checkpoint_restore_ns_capable(tmp->user_ns))
 				goto out_free;
 			set_tid_size--;
 		}
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 0e5ac162c3a8..ac135bd600eb 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -269,7 +269,7 @@  static int pid_ns_ctl_handler(struct ctl_table *table, int write,
 	struct ctl_table tmp = *table;
 	int ret, next;
 
-	if (write && !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN))
+	if (write && !checkpoint_restore_ns_capable(pid_ns->user_ns))
 		return -EPERM;
 
 	/*
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index 98e1513b608a..40cebde62856 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -27,9 +27,10 @@ 
 	    "audit_control", "setfcap"
 
 #define COMMON_CAP2_PERMS  "mac_override", "mac_admin", "syslog", \
-		"wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf"
+		"wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf", \
+		"checkpoint_restore"
 
-#if CAP_LAST_CAP > CAP_BPF
+#if CAP_LAST_CAP > CAP_CHECKPOINT_RESTORE
 #error New capability defined, please update COMMON_CAP2_PERMS.
 #endif