diff mbox series

[v5,1/3] open: add close_range()

Message ID 20200602204219.186620-2-christian.brauner@ubuntu.com
State New, archived
Headers show
Series close_range() | expand

Commit Message

Christian Brauner June 2, 2020, 8:42 p.m. UTC
This adds the close_range() syscall. It allows to efficiently close a range
of file descriptors up to all file descriptors of a calling task.

I've also coordinated with some FreeBSD developers who got in touch with
me (Cced below). FreeBSD intends to add the same syscall once we merged it.
Quite a bunch of projects in userspace are waiting on this syscall
including Python and systemd.

The syscall came up in a recent discussion around the new mount API and
making new file descriptor types cloexec by default. During this
discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
syscall in this manner has been requested by various people over time.

First, it helps to close all file descriptors of an exec()ing task. This
can be done safely via (quoting Al's example from [1] verbatim):

        /* that exec is sensitive */
        unshare(CLONE_FILES);
        /* we don't want anything past stderr here */
        close_range(3, ~0U);
        execve(....);

The code snippet above is one way of working around the problem that file
descriptors are not cloexec by default. This is aggravated by the fact that
we can't just switch them over without massively regressing userspace. For
a whole class of programs having an in-kernel method of closing all file
descriptors is very helpful (e.g. demons, service managers, programming
language standard libraries, container managers etc.).
(Please note, unshare(CLONE_FILES) should only be needed if the calling
task is multi-threaded and shares the file descriptor table with another
thread in which case two threads could race with one thread allocating file
descriptors and the other one closing them via close_range(). For the
general case close_range() before the execve() is sufficient.)

Second, it allows userspace to avoid implementing closing all file
descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
file descriptor. From looking at various large(ish) userspace code bases
this or similar patterns are very common in:
- service managers (cf. [4])
- libcs (cf. [6])
- container runtimes (cf. [5])
- programming language runtimes/standard libraries
  - Python (cf. [2])
  - Rust (cf. [7], [8])
As Dmitry pointed out there's even a long-standing glibc bug about missing
kernel support for this task (cf. [3]).
In addition, the syscall will also work for tasks that do not have procfs
mounted and on kernels that do not have procfs support compiled in. In such
situations the only way to make sure that all file descriptors are closed
is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
OPEN_MAX trickery (cf. comment [8] on Rust).

The performance is striking. For good measure, comparing the following
simple close_all_fds() userspace implementation that is essentially just
glibc's version in [6]:

static int close_all_fds(void)
{
        int dir_fd;
        DIR *dir;
        struct dirent *direntp;

        dir = opendir("/proc/self/fd");
        if (!dir)
                return -1;
        dir_fd = dirfd(dir);
        while ((direntp = readdir(dir))) {
                int fd;
                if (strcmp(direntp->d_name, ".") == 0)
                        continue;
                if (strcmp(direntp->d_name, "..") == 0)
                        continue;
                fd = atoi(direntp->d_name);
                if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
                        continue;
                close(fd);
        }
        closedir(dir);
        return 0;
}

to close_range() yields:
1. closing 4 open files:
   - close_all_fds(): ~280 us
   - close_range():    ~24 us

2. closing 1000 open files:
   - close_all_fds(): ~5000 us
   - close_range():   ~800 us

close_range() is designed to allow for some flexibility. Specifically, it
does not simply always close all open file descriptors of a task. Instead,
callers can specify an upper bound.
This is e.g. useful for scenarios where specific file descriptors are
created with well-known numbers that are supposed to be excluded from
getting closed.
For extra paranoia close_range() comes with a flags argument. This can e.g.
be used to implement extension. Once can imagine userspace wanting to stop
at the first error instead of ignoring errors under certain circumstances.
There might be other valid ideas in the future. In any case, a flag
argument doesn't hurt and keeps us on the safe side.

From an implementation side this is kept rather dumb. It saw some input
from David and Jann but all nonsense is obviously my own!
- Errors to close file descriptors are currently ignored. (Could be changed
  by setting a flag in the future if needed.)
- __close_range() is a rather simplistic wrapper around __close_fd().
  My reasoning behind this is based on the nature of how __close_fd() needs
  to release an fd. But maybe I misunderstood specifics:
  We take the files_lock and rcu-dereference the fdtable of the calling
  task, we find the entry in the fdtable, get the file and need to release
  files_lock before calling filp_close().
  In the meantime the fdtable might have been altered so we can't just
  retake the spinlock and keep the old rcu-reference of the fdtable
  around. Instead we need to grab a fresh reference to the fdtable.
  If my reasoning is correct then there's really no point in fancyfying
  __close_range(): We just need to rcu-dereference the fdtable of the
  calling task once to cap the max_fd value correctly and then go on
  calling __close_fd() in a loop.

/* References */
[1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/
[2]: https://github.com/python/cpython/blob/9e4f2f3a6b8ee995c365e86d976937c141d867f8/Modules/_posixsubprocess.c#L220
[3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7
[4]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/fd-util.c#L217
[5]: https://github.com/lxc/lxc/blob/ddf4b77e11a4d08f09b7b9cd13e593f8c047edc5/src/lxc/start.c#L236
[6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17
     Note that this is an internal implementation that is not exported.
     Currently, libc seems to not provide an exported version of this
     because of missing kernel support to do this.
[7]: https://github.com/rust-lang/rust/issues/12148
[8]: https://github.com/rust-lang/rust/blob/5f47c0613ed4eb46fca3633c1297364c09e5e451/src/libstd/sys/unix/process2.rs#L303-L308
     Rust's solution is slightly different but is equally unperformant.
     Rust calls getdtablesize() which is a glibc library function that
     simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then
     goes on to call close() on each fd. That's obviously overkill for most
     tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or
     OPEN_MAX.
     Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set
     to 1024. Even in this case, there's a very high chance that in the
     common case Rust is calling the close() syscall 1021 times pointlessly
     if the task just has 0, 1, and 2 open.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kyle Evans <self@kyle-evans.net>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dmitry V. Levin <ldv@altlinux.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
---
/* v2 */
- Linus Torvalds <torvalds@linux-foundation.org>:
  - add cond_resched() to yield cpu when closing a lot of file descriptors
- Al Viro <viro@zeniv.linux.org.uk>:
  - add cond_resched() to yield cpu when closing a lot of file descriptors

/* v3 */
unchanged

/* v4 */
- Oleg Nesterov <oleg@redhat.com>:
  - fix braino: s/max()/min()/

/* v5 */
unchanged
---
 fs/file.c                | 62 ++++++++++++++++++++++++++++++++++------
 fs/open.c                | 20 +++++++++++++
 include/linux/fdtable.h  |  2 ++
 include/linux/syscalls.h |  2 ++
 4 files changed, 78 insertions(+), 8 deletions(-)

Comments

Florian Weimer June 2, 2020, 11:30 p.m. UTC | #1
* Christian Brauner:

> The performance is striking. For good measure, comparing the following
> simple close_all_fds() userspace implementation that is essentially just
> glibc's version in [6]:
>
> static int close_all_fds(void)
> {
>         int dir_fd;
>         DIR *dir;
>         struct dirent *direntp;
>
>         dir = opendir("/proc/self/fd");
>         if (!dir)
>                 return -1;
>         dir_fd = dirfd(dir);
>         while ((direntp = readdir(dir))) {
>                 int fd;
>                 if (strcmp(direntp->d_name, ".") == 0)
>                         continue;
>                 if (strcmp(direntp->d_name, "..") == 0)
>                         continue;
>                 fd = atoi(direntp->d_name);
>                 if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
>                         continue;
>                 close(fd);
>         }
>         closedir(dir);
>         return 0;
> }
>

> [6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17
>      Note that this is an internal implementation that is not exported.
>      Currently, libc seems to not provide an exported version of this
>      because of missing kernel support to do this.

Just to be clear, this code is not compiled into glibc anymore in
typical configurations.  I have posted a patch to turn grantpt into a
no-op: <https://sourceware.org/pipermail/libc-alpha/2020-May/114379.html>

I'm not entirely convinced that it's safe to keep iterating over
/proc/self/fd while also closing descriptors.  Ideally, I think an
application should call getdents64, process the file names for
descriptors in the buffer, and if any have been closed, seek to zero
before the next getdents64 call.  Maybe procfs is different, but with
other file systems, unlinking files can trigger directory reordering,
and then you get strange effects.  The d_ino behavior for
/proc/self/fd is a bit strange as well (it's not consistently
descriptor plus 3).
Christian Brauner June 2, 2020, 11:37 p.m. UTC | #2
On Wed, Jun 03, 2020 at 01:30:57AM +0200, Florian Weimer wrote:
> * Christian Brauner:
> 
> > The performance is striking. For good measure, comparing the following
> > simple close_all_fds() userspace implementation that is essentially just
> > glibc's version in [6]:
> >
> > static int close_all_fds(void)
> > {
> >         int dir_fd;
> >         DIR *dir;
> >         struct dirent *direntp;
> >
> >         dir = opendir("/proc/self/fd");
> >         if (!dir)
> >                 return -1;
> >         dir_fd = dirfd(dir);
> >         while ((direntp = readdir(dir))) {
> >                 int fd;
> >                 if (strcmp(direntp->d_name, ".") == 0)
> >                         continue;
> >                 if (strcmp(direntp->d_name, "..") == 0)
> >                         continue;
> >                 fd = atoi(direntp->d_name);
> >                 if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
> >                         continue;
> >                 close(fd);
> >         }
> >         closedir(dir);
> >         return 0;
> > }
> >
> 
> > [6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17
> >      Note that this is an internal implementation that is not exported.
> >      Currently, libc seems to not provide an exported version of this
> >      because of missing kernel support to do this.
> 
> Just to be clear, this code is not compiled into glibc anymore in
> typical configurations.  I have posted a patch to turn grantpt into a
> no-op: <https://sourceware.org/pipermail/libc-alpha/2020-May/114379.html>

That's great! (I remember commenting on that thread.)
Michael Kerrisk (man-pages) June 3, 2020, 10:24 a.m. UTC | #3
Hi Christian,

Could we have a manual page for this API (best before it's merged)?

Thanks,

Michael

On Tue, 2 Jun 2020 at 22:44, Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> This adds the close_range() syscall. It allows to efficiently close a range
> of file descriptors up to all file descriptors of a calling task.
>
> I've also coordinated with some FreeBSD developers who got in touch with
> me (Cced below). FreeBSD intends to add the same syscall once we merged it.
> Quite a bunch of projects in userspace are waiting on this syscall
> including Python and systemd.
>
> The syscall came up in a recent discussion around the new mount API and
> making new file descriptor types cloexec by default. During this
> discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
> syscall in this manner has been requested by various people over time.
>
> First, it helps to close all file descriptors of an exec()ing task. This
> can be done safely via (quoting Al's example from [1] verbatim):
>
>         /* that exec is sensitive */
>         unshare(CLONE_FILES);
>         /* we don't want anything past stderr here */
>         close_range(3, ~0U);
>         execve(....);
>
> The code snippet above is one way of working around the problem that file
> descriptors are not cloexec by default. This is aggravated by the fact that
> we can't just switch them over without massively regressing userspace. For
> a whole class of programs having an in-kernel method of closing all file
> descriptors is very helpful (e.g. demons, service managers, programming
> language standard libraries, container managers etc.).
> (Please note, unshare(CLONE_FILES) should only be needed if the calling
> task is multi-threaded and shares the file descriptor table with another
> thread in which case two threads could race with one thread allocating file
> descriptors and the other one closing them via close_range(). For the
> general case close_range() before the execve() is sufficient.)
>
> Second, it allows userspace to avoid implementing closing all file
> descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
> file descriptor. From looking at various large(ish) userspace code bases
> this or similar patterns are very common in:
> - service managers (cf. [4])
> - libcs (cf. [6])
> - container runtimes (cf. [5])
> - programming language runtimes/standard libraries
>   - Python (cf. [2])
>   - Rust (cf. [7], [8])
> As Dmitry pointed out there's even a long-standing glibc bug about missing
> kernel support for this task (cf. [3]).
> In addition, the syscall will also work for tasks that do not have procfs
> mounted and on kernels that do not have procfs support compiled in. In such
> situations the only way to make sure that all file descriptors are closed
> is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
> OPEN_MAX trickery (cf. comment [8] on Rust).
>
> The performance is striking. For good measure, comparing the following
> simple close_all_fds() userspace implementation that is essentially just
> glibc's version in [6]:
>
> static int close_all_fds(void)
> {
>         int dir_fd;
>         DIR *dir;
>         struct dirent *direntp;
>
>         dir = opendir("/proc/self/fd");
>         if (!dir)
>                 return -1;
>         dir_fd = dirfd(dir);
>         while ((direntp = readdir(dir))) {
>                 int fd;
>                 if (strcmp(direntp->d_name, ".") == 0)
>                         continue;
>                 if (strcmp(direntp->d_name, "..") == 0)
>                         continue;
>                 fd = atoi(direntp->d_name);
>                 if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
>                         continue;
>                 close(fd);
>         }
>         closedir(dir);
>         return 0;
> }
>
> to close_range() yields:
> 1. closing 4 open files:
>    - close_all_fds(): ~280 us
>    - close_range():    ~24 us
>
> 2. closing 1000 open files:
>    - close_all_fds(): ~5000 us
>    - close_range():   ~800 us
>
> close_range() is designed to allow for some flexibility. Specifically, it
> does not simply always close all open file descriptors of a task. Instead,
> callers can specify an upper bound.
> This is e.g. useful for scenarios where specific file descriptors are
> created with well-known numbers that are supposed to be excluded from
> getting closed.
> For extra paranoia close_range() comes with a flags argument. This can e.g.
> be used to implement extension. Once can imagine userspace wanting to stop
> at the first error instead of ignoring errors under certain circumstances.
> There might be other valid ideas in the future. In any case, a flag
> argument doesn't hurt and keeps us on the safe side.
>
> From an implementation side this is kept rather dumb. It saw some input
> from David and Jann but all nonsense is obviously my own!
> - Errors to close file descriptors are currently ignored. (Could be changed
>   by setting a flag in the future if needed.)
> - __close_range() is a rather simplistic wrapper around __close_fd().
>   My reasoning behind this is based on the nature of how __close_fd() needs
>   to release an fd. But maybe I misunderstood specifics:
>   We take the files_lock and rcu-dereference the fdtable of the calling
>   task, we find the entry in the fdtable, get the file and need to release
>   files_lock before calling filp_close().
>   In the meantime the fdtable might have been altered so we can't just
>   retake the spinlock and keep the old rcu-reference of the fdtable
>   around. Instead we need to grab a fresh reference to the fdtable.
>   If my reasoning is correct then there's really no point in fancyfying
>   __close_range(): We just need to rcu-dereference the fdtable of the
>   calling task once to cap the max_fd value correctly and then go on
>   calling __close_fd() in a loop.
>
> /* References */
> [1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/
> [2]: https://github.com/python/cpython/blob/9e4f2f3a6b8ee995c365e86d976937c141d867f8/Modules/_posixsubprocess.c#L220
> [3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7
> [4]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/fd-util.c#L217
> [5]: https://github.com/lxc/lxc/blob/ddf4b77e11a4d08f09b7b9cd13e593f8c047edc5/src/lxc/start.c#L236
> [6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17
>      Note that this is an internal implementation that is not exported.
>      Currently, libc seems to not provide an exported version of this
>      because of missing kernel support to do this.
> [7]: https://github.com/rust-lang/rust/issues/12148
> [8]: https://github.com/rust-lang/rust/blob/5f47c0613ed4eb46fca3633c1297364c09e5e451/src/libstd/sys/unix/process2.rs#L303-L308
>      Rust's solution is slightly different but is equally unperformant.
>      Rust calls getdtablesize() which is a glibc library function that
>      simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then
>      goes on to call close() on each fd. That's obviously overkill for most
>      tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or
>      OPEN_MAX.
>      Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set
>      to 1024. Even in this case, there's a very high chance that in the
>      common case Rust is calling the close() syscall 1021 times pointlessly
>      if the task just has 0, 1, and 2 open.
>
> Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Kyle Evans <self@kyle-evans.net>
> Cc: Jann Horn <jannh@google.com>
> Cc: David Howells <dhowells@redhat.com>
> Cc: Dmitry V. Levin <ldv@altlinux.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Florian Weimer <fweimer@redhat.com>
> Cc: linux-api@vger.kernel.org
> ---
> /* v2 */
> - Linus Torvalds <torvalds@linux-foundation.org>:
>   - add cond_resched() to yield cpu when closing a lot of file descriptors
> - Al Viro <viro@zeniv.linux.org.uk>:
>   - add cond_resched() to yield cpu when closing a lot of file descriptors
>
> /* v3 */
> unchanged
>
> /* v4 */
> - Oleg Nesterov <oleg@redhat.com>:
>   - fix braino: s/max()/min()/
>
> /* v5 */
> unchanged
> ---
>  fs/file.c                | 62 ++++++++++++++++++++++++++++++++++------
>  fs/open.c                | 20 +++++++++++++
>  include/linux/fdtable.h  |  2 ++
>  include/linux/syscalls.h |  2 ++
>  4 files changed, 78 insertions(+), 8 deletions(-)
>
> diff --git a/fs/file.c b/fs/file.c
> index abb8b7081d7a..e260bfe687d1 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -10,6 +10,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/export.h>
>  #include <linux/fs.h>
> +#include <linux/kernel.h>
>  #include <linux/mm.h>
>  #include <linux/sched/signal.h>
>  #include <linux/slab.h>
> @@ -620,12 +621,9 @@ void fd_install(unsigned int fd, struct file *file)
>
>  EXPORT_SYMBOL(fd_install);
>
> -/*
> - * The same warnings as for __alloc_fd()/__fd_install() apply here...
> - */
> -int __close_fd(struct files_struct *files, unsigned fd)
> +static struct file *pick_file(struct files_struct *files, unsigned fd)
>  {
> -       struct file *file;
> +       struct file *file = NULL;
>         struct fdtable *fdt;
>
>         spin_lock(&files->file_lock);
> @@ -637,15 +635,63 @@ int __close_fd(struct files_struct *files, unsigned fd)
>                 goto out_unlock;
>         rcu_assign_pointer(fdt->fd[fd], NULL);
>         __put_unused_fd(files, fd);
> -       spin_unlock(&files->file_lock);
> -       return filp_close(file, files);
>
>  out_unlock:
>         spin_unlock(&files->file_lock);
> -       return -EBADF;
> +       return file;
> +}
> +
> +/*
> + * The same warnings as for __alloc_fd()/__fd_install() apply here...
> + */
> +int __close_fd(struct files_struct *files, unsigned fd)
> +{
> +       struct file *file;
> +
> +       file = pick_file(files, fd);
> +       if (!file)
> +               return -EBADF;
> +
> +       return filp_close(file, files);
>  }
>  EXPORT_SYMBOL(__close_fd); /* for ksys_close() */
>
> +/**
> + * __close_range() - Close all file descriptors in a given range.
> + *
> + * @fd:     starting file descriptor to close
> + * @max_fd: last file descriptor to close
> + *
> + * This closes a range of file descriptors. All file descriptors
> + * from @fd up to and including @max_fd are closed.
> + */
> +int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
> +{
> +       unsigned int cur_max;
> +
> +       if (fd > max_fd)
> +               return -EINVAL;
> +
> +       rcu_read_lock();
> +       cur_max = files_fdtable(files)->max_fds;
> +       rcu_read_unlock();
> +
> +       /* cap to last valid index into fdtable */
> +       max_fd = min(max_fd, (cur_max - 1));
> +       while (fd <= max_fd) {
> +               struct file *file;
> +
> +               file = pick_file(files, fd++);
> +               if (!file)
> +                       continue;
> +
> +               filp_close(file, files);
> +               cond_resched();
> +       }
> +
> +       return 0;
> +}
> +
>  /*
>   * variant of __close_fd that gets a ref on the file for later fput.
>   * The caller must ensure that filp_close() called on the file, and then
> diff --git a/fs/open.c b/fs/open.c
> index 719b320ede52..87e076e9e127 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -1279,6 +1279,26 @@ SYSCALL_DEFINE1(close, unsigned int, fd)
>         return retval;
>  }
>
> +/**
> + * close_range() - Close all file descriptors in a given range.
> + *
> + * @fd:     starting file descriptor to close
> + * @max_fd: last file descriptor to close
> + * @flags:  reserved for future extensions
> + *
> + * This closes a range of file descriptors. All file descriptors
> + * from @fd up to and including @max_fd are closed.
> + * Currently, errors to close a given file descriptor are ignored.
> + */
> +SYSCALL_DEFINE3(close_range, unsigned int, fd, unsigned int, max_fd,
> +               unsigned int, flags)
> +{
> +       if (flags)
> +               return -EINVAL;
> +
> +       return __close_range(current->files, fd, max_fd);
> +}
> +
>  /*
>   * This routine simulates a hangup on the tty, to arrange that users
>   * are given clean terminals at login time.
> diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
> index f07c55ea0c22..fcd07181a365 100644
> --- a/include/linux/fdtable.h
> +++ b/include/linux/fdtable.h
> @@ -121,6 +121,8 @@ extern void __fd_install(struct files_struct *files,
>                       unsigned int fd, struct file *file);
>  extern int __close_fd(struct files_struct *files,
>                       unsigned int fd);
> +extern int __close_range(struct files_struct *files, unsigned int fd,
> +                        unsigned int max_fd);
>  extern int __close_fd_get_file(unsigned int fd, struct file **res);
>
>  extern struct kmem_cache *files_cachep;
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 1815065d52f3..18fea399329b 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -442,6 +442,8 @@ asmlinkage long sys_openat(int dfd, const char __user *filename, int flags,
>  asmlinkage long sys_openat2(int dfd, const char __user *filename,
>                             struct open_how *how, size_t size);
>  asmlinkage long sys_close(unsigned int fd);
> +asmlinkage long sys_close_range(unsigned int fd, unsigned int max_fd,
> +                               unsigned int flags);
>  asmlinkage long sys_vhangup(void);
>
>  /* fs/pipe.c */
> --
> 2.26.2
>
Szabolcs Nagy June 5, 2020, 2:55 p.m. UTC | #4
* Christian Brauner <christian.brauner@ubuntu.com> [2020-06-02 22:42:17 +0200]:
> This adds the close_range() syscall. It allows to efficiently close a range
> of file descriptors up to all file descriptors of a calling task.
> 
> I've also coordinated with some FreeBSD developers who got in touch with
> me (Cced below). FreeBSD intends to add the same syscall once we merged it.
> Quite a bunch of projects in userspace are waiting on this syscall
> including Python and systemd.
> 
> The syscall came up in a recent discussion around the new mount API and
> making new file descriptor types cloexec by default. During this
> discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
> syscall in this manner has been requested by various people over time.
> 
> First, it helps to close all file descriptors of an exec()ing task. This
> can be done safely via (quoting Al's example from [1] verbatim):
> 
>         /* that exec is sensitive */
>         unshare(CLONE_FILES);
>         /* we don't want anything past stderr here */
>         close_range(3, ~0U);
>         execve(....);

this api needs a documentation patch if there isn't yet.

currently there is no libc interface contract in place that
says which calls may use libc internal fds e.g. i've seen

  openlog(...) // opens libc internal syslog fd
  ...
  fork()
  closefrom(...) // close syslog fd
  open(...) // something that reuses the closed fd
  syslog(...) // unsafe: uses the wrong fd
  execve(...)

syslog uses a libc internal fd that the user trampled on and
this can go bad in many ways depending on what libc apis are
used between closefrom (or equivalent) and exec.

> The code snippet above is one way of working around the problem that file
> descriptors are not cloexec by default. This is aggravated by the fact that
> we can't just switch them over without massively regressing userspace. For

why is a switch_to_cloexec_range worse than close_range?
the former seems safer to me. (and allows libc calls
to be made between such switch and exec: libc internal
fds have to be cloexec anyway)

> a whole class of programs having an in-kernel method of closing all file
> descriptors is very helpful (e.g. demons, service managers, programming
> language standard libraries, container managers etc.).
> (Please note, unshare(CLONE_FILES) should only be needed if the calling
> task is multi-threaded and shares the file descriptor table with another
> thread in which case two threads could race with one thread allocating file
> descriptors and the other one closing them via close_range(). For the
> general case close_range() before the execve() is sufficient.)
> 
> Second, it allows userspace to avoid implementing closing all file
> descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
> file descriptor. From looking at various large(ish) userspace code bases
> this or similar patterns are very common in:
> - service managers (cf. [4])
> - libcs (cf. [6])
> - container runtimes (cf. [5])
> - programming language runtimes/standard libraries
>   - Python (cf. [2])
>   - Rust (cf. [7], [8])
> As Dmitry pointed out there's even a long-standing glibc bug about missing
> kernel support for this task (cf. [3]).
> In addition, the syscall will also work for tasks that do not have procfs
> mounted and on kernels that do not have procfs support compiled in. In such
> situations the only way to make sure that all file descriptors are closed
> is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
> OPEN_MAX trickery (cf. comment [8] on Rust).

close_range still seems like a bad operation to expose.

if users really want closing behaviour (instead of marking
fds cloexec) then they likely need coordination with libc
and other libraries.

e.g. this usage does not work:

  maxfd = findmaxfd();
  call_that_may_leak_fds();
  close_range(maxfd,~0U);

as far as i can tell only the close right before exec works.
Kyle Evans June 6, 2020, 2:54 a.m. UTC | #5
On Fri, Jun 5, 2020 at 9:55 AM Szabolcs Nagy <nsz@port70.net> wrote:
>
> * Christian Brauner <christian.brauner@ubuntu.com> [2020-06-02 22:42:17 +0200]:
> > [... snip ...]
> >
> > First, it helps to close all file descriptors of an exec()ing task. This
> > can be done safely via (quoting Al's example from [1] verbatim):
> >
> >         /* that exec is sensitive */
> >         unshare(CLONE_FILES);
> >         /* we don't want anything past stderr here */
> >         close_range(3, ~0U);
> >         execve(....);
>
> this api needs a documentation patch if there isn't yet.
>
> currently there is no libc interface contract in place that
> says which calls may use libc internal fds e.g. i've seen
>
>   openlog(...) // opens libc internal syslog fd
>   ...
>   fork()
>   closefrom(...) // close syslog fd
>   open(...) // something that reuses the closed fd
>   syslog(...) // unsafe: uses the wrong fd
>   execve(...)
>
> syslog uses a libc internal fd that the user trampled on and
> this can go bad in many ways depending on what libc apis are
> used between closefrom (or equivalent) and exec.
>

Documentation is good. :-) I think you'll find that while this example
seems to be innocuous on FreeBSD (and likely other *BSD), this is an
atypical scenario and generally not advised.  You would usually not
start closing until you're actually ready to exec/fail.

> > The code snippet above is one way of working around the problem that file
> > descriptors are not cloexec by default. This is aggravated by the fact that
> > we can't just switch them over without massively regressing userspace. For
>
> why is a switch_to_cloexec_range worse than close_range?
> the former seems safer to me. (and allows libc calls
> to be made between such switch and exec: libc internal
> fds have to be cloexec anyway)
>

I wouldn't say it's worse, but it only solves half the problem. While
closefrom -> exec is the most common usage by a long shot, there are
also times (e.g. post-fork without intent to exec for a daemon/service
type) that you want to go ahead and close everything except maybe a
pipe fd that you've opened for IPC. While uncommon, there's no reason
this needs to devolve into a loop to close 'all the fds' when you can
instead introduce close_range to solve both the exec case and other
less common scenarios.

> > a whole class of programs having an in-kernel method of closing all file
> > descriptors is very helpful (e.g. demons, service managers, programming
> > language standard libraries, container managers etc.).
> > (Please note, unshare(CLONE_FILES) should only be needed if the calling
> > task is multi-threaded and shares the file descriptor table with another
> > thread in which case two threads could race with one thread allocating file
> > descriptors and the other one closing them via close_range(). For the
> > general case close_range() before the execve() is sufficient.)
> >
> > Second, it allows userspace to avoid implementing closing all file
> > descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
> > file descriptor. From looking at various large(ish) userspace code bases
> > this or similar patterns are very common in:
> > - service managers (cf. [4])
> > - libcs (cf. [6])
> > - container runtimes (cf. [5])
> > - programming language runtimes/standard libraries
> >   - Python (cf. [2])
> >   - Rust (cf. [7], [8])
> > As Dmitry pointed out there's even a long-standing glibc bug about missing
> > kernel support for this task (cf. [3]).
> > In addition, the syscall will also work for tasks that do not have procfs
> > mounted and on kernels that do not have procfs support compiled in. In such
> > situations the only way to make sure that all file descriptors are closed
> > is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
> > OPEN_MAX trickery (cf. comment [8] on Rust).
>
> close_range still seems like a bad operation to expose.
>
> if users really want closing behaviour (instead of marking
> fds cloexec) then they likely need coordination with libc
> and other libraries.
>
> e.g. this usage does not work:
>
>   maxfd = findmaxfd();
>   call_that_may_leak_fds();
>   close_range(maxfd,~0U);
>
> as far as i can tell only the close right before exec works.

I don't have much to say on this one, except that's also an unusual
case. I don't know that I'd anticipate close_range getting used for
that kind of cleanup job (I've never seen a similar use of closefrom),
which seems to just be papering over application/library bugs.

Coordination with libc is generally not much of an issue, because this
is really one of the last things you do before exec() or swiftly
failing miserably. Applications that currently loop over all fd <=
maxfd and close(fd) right now are subject to the very same
constraints, this is just a much more efficient way and
debugger-friendly way to accomplish it. You've absolutely not lived
life until you've had to watch thousands of close() calls painfully
scroll by in truss/strace.

Thank,

Kyle Evans
Kyle Evans June 6, 2020, 3:11 a.m. UTC | #6
On Fri, Jun 5, 2020 at 9:54 PM Kyle Evans <kevans@freebsd.org> wrote:
>
> On Fri, Jun 5, 2020 at 9:55 AM Szabolcs Nagy <nsz@port70.net> wrote:
> >
> > * Christian Brauner <christian.brauner@ubuntu.com> [2020-06-02 22:42:17 +0200]:
> > > [... snip ...]
> > >
> > > First, it helps to close all file descriptors of an exec()ing task. This
> > > can be done safely via (quoting Al's example from [1] verbatim):
> > >
> > >         /* that exec is sensitive */
> > >         unshare(CLONE_FILES);
> > >         /* we don't want anything past stderr here */
> > >         close_range(3, ~0U);
> > >         execve(....);
> >
> > this api needs a documentation patch if there isn't yet.
> >
> > currently there is no libc interface contract in place that
> > says which calls may use libc internal fds e.g. i've seen
> >
> >   openlog(...) // opens libc internal syslog fd
> >   ...
> >   fork()
> >   closefrom(...) // close syslog fd
> >   open(...) // something that reuses the closed fd
> >   syslog(...) // unsafe: uses the wrong fd
> >   execve(...)
> >
> > syslog uses a libc internal fd that the user trampled on and
> > this can go bad in many ways depending on what libc apis are
> > used between closefrom (or equivalent) and exec.
> >
>
> Documentation is good. :-) I think you'll find that while this example
> seems to be innocuous on FreeBSD (and likely other *BSD), this is an
> atypical scenario and generally not advised.  You would usually not
> start closing until you're actually ready to exec/fail.
>

Minor correction: not innocuous here, either; O_CLOFORK is not yet a thing. :-)
Szabolcs Nagy June 6, 2020, 11:55 a.m. UTC | #7
* Kyle Evans <kevans@freebsd.org> [2020-06-05 21:54:56 -0500]:
> On Fri, Jun 5, 2020 at 9:55 AM Szabolcs Nagy <nsz@port70.net> wrote:
> > this api needs a documentation patch if there isn't yet.
> >
> > currently there is no libc interface contract in place that
> > says which calls may use libc internal fds e.g. i've seen
> >
> >   openlog(...) // opens libc internal syslog fd
> >   ...
> >   fork()
> >   closefrom(...) // close syslog fd
> >   open(...) // something that reuses the closed fd
> >   syslog(...) // unsafe: uses the wrong fd
> >   execve(...)
> >
> > syslog uses a libc internal fd that the user trampled on and
> > this can go bad in many ways depending on what libc apis are
> > used between closefrom (or equivalent) and exec.
> >
> 
> Documentation is good. :-) I think you'll find that while this example
> seems to be innocuous on FreeBSD (and likely other *BSD), this is an
> atypical scenario and generally not advised.  You would usually not
> start closing until you're actually ready to exec/fail.

it's a recent bug https://bugs.kde.org/show_bug.cgi?id=420921

but not the first closefrom bug i saw: it is a fundamentally
unsafe operation that frees resources owned by others.

> > > The code snippet above is one way of working around the problem that file
> > > descriptors are not cloexec by default. This is aggravated by the fact that
> > > we can't just switch them over without massively regressing userspace. For
> >
> > why is a switch_to_cloexec_range worse than close_range?
> > the former seems safer to me. (and allows libc calls
> > to be made between such switch and exec: libc internal
> > fds have to be cloexec anyway)
> >
> 
> I wouldn't say it's worse, but it only solves half the problem. While
> closefrom -> exec is the most common usage by a long shot, there are
> also times (e.g. post-fork without intent to exec for a daemon/service
> type) that you want to go ahead and close everything except maybe a
> pipe fd that you've opened for IPC. While uncommon, there's no reason
> this needs to devolve into a loop to close 'all the fds' when you can
> instead introduce close_range to solve both the exec case and other
> less common scenarios.

the syslog example shows why post-fork closefrom without
intent to exec does not work: there is no contract about
which api calls behave like syslog, so calling anything
after closefrom can be broken.

libc can introduce new api contracts e.g. that some apis
don't use fds internally or after a closefrom call some
apis behave differently, if this is the expected direction
then it would be nice to propose that on the libc-coord
at openwall.com list.

> Coordination with libc is generally not much of an issue, because this
> is really one of the last things you do before exec() or swiftly
> failing miserably. Applications that currently loop over all fd <=
> maxfd and close(fd) right now are subject to the very same
> constraints, this is just a much more efficient way and
> debugger-friendly way to accomplish it. You've absolutely not lived
> life until you've had to watch thousands of close() calls painfully
> scroll by in truss/strace.

applications do a 'close all fds' operation because there
is no alternative. (i think there are better ways to do
this than looping: you can poll on the fds and only close
the ones that didnt POLLNVAL, this should be more portable
than /proc, but it's besides my point) optimizing this
operation may not be the only way to achive whatever those
applications are trying to do.

if closefrom only works before exec then that should be
documented and callers that do otherwise fixed, if
important users do things between closefrom and exec then
i think a different design is needed with libc maintainers
involved.
Kyle Evans June 6, 2020, 2:43 p.m. UTC | #8
On Sat, Jun 6, 2020 at 6:55 AM Szabolcs Nagy <nsz@port70.net> wrote:
>
> * Kyle Evans <kevans@freebsd.org> [2020-06-05 21:54:56 -0500]:
> > On Fri, Jun 5, 2020 at 9:55 AM Szabolcs Nagy <nsz@port70.net> wrote:
> > > this api needs a documentation patch if there isn't yet.
> > >
> > > currently there is no libc interface contract in place that
> > > says which calls may use libc internal fds e.g. i've seen
> > >
> > >   openlog(...) // opens libc internal syslog fd
> > >   ...
> > >   fork()
> > >   closefrom(...) // close syslog fd
> > >   open(...) // something that reuses the closed fd
> > >   syslog(...) // unsafe: uses the wrong fd
> > >   execve(...)
> > >
> > > syslog uses a libc internal fd that the user trampled on and
> > > this can go bad in many ways depending on what libc apis are
> > > used between closefrom (or equivalent) and exec.
> > >
> >
> > Documentation is good. :-) I think you'll find that while this example
> > seems to be innocuous on FreeBSD (and likely other *BSD), this is an
> > atypical scenario and generally not advised.  You would usually not
> > start closing until you're actually ready to exec/fail.
>
> it's a recent bug https://bugs.kde.org/show_bug.cgi?id=420921
>
> but not the first closefrom bug i saw: it is a fundamentally
> unsafe operation that frees resources owned by others.
>

Yes, close() is an inherently unsafe operation, and they managed this
bug without even having closefrom/close_range. I'm not entirely
convinced folks are going to spontaneously develop a need to massively
close things just because close_range exists. If they have a need,
they're already doing it with what they have available and causing
bugs like the above.

> > > > The code snippet above is one way of working around the problem that file
> > > > descriptors are not cloexec by default. This is aggravated by the fact that
> > > > we can't just switch them over without massively regressing userspace. For
> > >
> > > why is a switch_to_cloexec_range worse than close_range?
> > > the former seems safer to me. (and allows libc calls
> > > to be made between such switch and exec: libc internal
> > > fds have to be cloexec anyway)
> > >
> >
> > I wouldn't say it's worse, but it only solves half the problem. While
> > closefrom -> exec is the most common usage by a long shot, there are
> > also times (e.g. post-fork without intent to exec for a daemon/service
> > type) that you want to go ahead and close everything except maybe a
> > pipe fd that you've opened for IPC. While uncommon, there's no reason
> > this needs to devolve into a loop to close 'all the fds' when you can
> > instead introduce close_range to solve both the exec case and other
> > less common scenarios.
>
> the syslog example shows why post-fork closefrom without
> intent to exec does not work: there is no contract about
> which api calls behave like syslog, so calling anything
> after closefrom can be broken.
>

I think that example shows one scenario where it's not safe, that's
again in firmly "don't do that" territory. You can close arbitrary fds
very early or very late, but not somewhere in the middle of an even
remotely complex application. This problem exists with or without
close_range.

Like I said before, this is already done quite successfully now, along
with other not-even-forked uses. e.g. OpenSSH sshd will closefrom()
just after argument parsing:
https://github.com/openbsd/src/blob/master/usr.bin/ssh/sshd.c#L1582

> libc can introduce new api contracts e.g. that some apis
> don't use fds internally or after a closefrom call some
> apis behave differently, if this is the expected direction
> then it would be nice to propose that on the libc-coord
> at openwall.com list.
>

I suspect it's likely better to document that one should close
arbitrary fds very early or very late instead. Documenting which APIs
are inherently unsafe before/after would seem to be fraught with peril
-- you can enumerate what in libc is a potential problem, but there
are other libs in use by applications that will also use fds
internally and potentially cause problems, but we can't possibly raise
awareness of all of them. We can, however, raise awareness of the
valid and incredibly useful use-cases.

> > Coordination with libc is generally not much of an issue, because this
> > is really one of the last things you do before exec() or swiftly
> > failing miserably. Applications that currently loop over all fd <=
> > maxfd and close(fd) right now are subject to the very same
> > constraints, this is just a much more efficient way and
> > debugger-friendly way to accomplish it. You've absolutely not lived
> > life until you've had to watch thousands of close() calls painfully
> > scroll by in truss/strace.
>
> applications do a 'close all fds' operation because there
> is no alternative. (i think there are better ways to do
> this than looping: you can poll on the fds and only close
> the ones that didnt POLLNVAL, this should be more portable
> than /proc, but it's besides my point) optimizing this
> operation may not be the only way to achive whatever those
> applications are trying to do.

In most cases, those applications really are just trying to close all
fds other than the ones they want to stick around. Polling the fds
before trying to close them would just *double* the current problem
without fixing your other concern at all -- you still can't arbitrary
close open fds without understanding their provenance, and now you've
doubled the number of syscalls required just to close what you don't
need.

fdwalk() + close() is an (IMO) great alternative that's even more
flexible, but that still has its own problems.

Thanks,

Kyle Evans
David Laight June 7, 2020, 1:22 p.m. UTC | #9
From: Szabolcs Nagy
> Sent: 05 June 2020 15:56
...
> currently there is no libc interface contract in place that
> says which calls may use libc internal fds e.g. i've seen
> 
>   openlog(...) // opens libc internal syslog fd
>   ...
>   fork()
>   closefrom(...) // close syslog fd
>   open(...) // something that reuses the closed fd
>   syslog(...) // unsafe: uses the wrong fd
>   execve(...)
> 
> syslog uses a libc internal fd that the user trampled on and
> this can go bad in many ways depending on what libc apis are
> used between closefrom (or equivalent) and exec.

It is, of course, traditional that daemons only call
close(0); close(1); close(2);
Took us ages to discover that a misspelt fprintf()
was adding data to the stdout buffer and eventually
flushing 10k of ascii text into an inter-process pipe
that had a 32bit field for 'message extension length'.

FWIW isn't syslog() going to go badly wrong after fork()
anyway?
Unless libc's fork() calls closelog().

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Michael Kerrisk (man-pages) Sept. 17, 2020, 7:52 a.m. UTC | #10
Hey Christian,

Could we please have a manual page for the close_range(2) syscall
that's about to land in 5.9?

Thanks,

Michael

On Wed, 3 Jun 2020 at 12:24, Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
>
> Hi Christian,
>
> Could we have a manual page for this API (best before it's merged)?
>
> Thanks,
>
> Michael
>
> On Tue, 2 Jun 2020 at 22:44, Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > This adds the close_range() syscall. It allows to efficiently close a range
> > of file descriptors up to all file descriptors of a calling task.
> >
> > I've also coordinated with some FreeBSD developers who got in touch with
> > me (Cced below). FreeBSD intends to add the same syscall once we merged it.
> > Quite a bunch of projects in userspace are waiting on this syscall
> > including Python and systemd.
> >
> > The syscall came up in a recent discussion around the new mount API and
> > making new file descriptor types cloexec by default. During this
> > discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
> > syscall in this manner has been requested by various people over time.
> >
> > First, it helps to close all file descriptors of an exec()ing task. This
> > can be done safely via (quoting Al's example from [1] verbatim):
> >
> >         /* that exec is sensitive */
> >         unshare(CLONE_FILES);
> >         /* we don't want anything past stderr here */
> >         close_range(3, ~0U);
> >         execve(....);
> >
> > The code snippet above is one way of working around the problem that file
> > descriptors are not cloexec by default. This is aggravated by the fact that
> > we can't just switch them over without massively regressing userspace. For
> > a whole class of programs having an in-kernel method of closing all file
> > descriptors is very helpful (e.g. demons, service managers, programming
> > language standard libraries, container managers etc.).
> > (Please note, unshare(CLONE_FILES) should only be needed if the calling
> > task is multi-threaded and shares the file descriptor table with another
> > thread in which case two threads could race with one thread allocating file
> > descriptors and the other one closing them via close_range(). For the
> > general case close_range() before the execve() is sufficient.)
> >
> > Second, it allows userspace to avoid implementing closing all file
> > descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
> > file descriptor. From looking at various large(ish) userspace code bases
> > this or similar patterns are very common in:
> > - service managers (cf. [4])
> > - libcs (cf. [6])
> > - container runtimes (cf. [5])
> > - programming language runtimes/standard libraries
> >   - Python (cf. [2])
> >   - Rust (cf. [7], [8])
> > As Dmitry pointed out there's even a long-standing glibc bug about missing
> > kernel support for this task (cf. [3]).
> > In addition, the syscall will also work for tasks that do not have procfs
> > mounted and on kernels that do not have procfs support compiled in. In such
> > situations the only way to make sure that all file descriptors are closed
> > is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
> > OPEN_MAX trickery (cf. comment [8] on Rust).
> >
> > The performance is striking. For good measure, comparing the following
> > simple close_all_fds() userspace implementation that is essentially just
> > glibc's version in [6]:
> >
> > static int close_all_fds(void)
> > {
> >         int dir_fd;
> >         DIR *dir;
> >         struct dirent *direntp;
> >
> >         dir = opendir("/proc/self/fd");
> >         if (!dir)
> >                 return -1;
> >         dir_fd = dirfd(dir);
> >         while ((direntp = readdir(dir))) {
> >                 int fd;
> >                 if (strcmp(direntp->d_name, ".") == 0)
> >                         continue;
> >                 if (strcmp(direntp->d_name, "..") == 0)
> >                         continue;
> >                 fd = atoi(direntp->d_name);
> >                 if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
> >                         continue;
> >                 close(fd);
> >         }
> >         closedir(dir);
> >         return 0;
> > }
> >
> > to close_range() yields:
> > 1. closing 4 open files:
> >    - close_all_fds(): ~280 us
> >    - close_range():    ~24 us
> >
> > 2. closing 1000 open files:
> >    - close_all_fds(): ~5000 us
> >    - close_range():   ~800 us
> >
> > close_range() is designed to allow for some flexibility. Specifically, it
> > does not simply always close all open file descriptors of a task. Instead,
> > callers can specify an upper bound.
> > This is e.g. useful for scenarios where specific file descriptors are
> > created with well-known numbers that are supposed to be excluded from
> > getting closed.
> > For extra paranoia close_range() comes with a flags argument. This can e.g.
> > be used to implement extension. Once can imagine userspace wanting to stop
> > at the first error instead of ignoring errors under certain circumstances.
> > There might be other valid ideas in the future. In any case, a flag
> > argument doesn't hurt and keeps us on the safe side.
> >
> > From an implementation side this is kept rather dumb. It saw some input
> > from David and Jann but all nonsense is obviously my own!
> > - Errors to close file descriptors are currently ignored. (Could be changed
> >   by setting a flag in the future if needed.)
> > - __close_range() is a rather simplistic wrapper around __close_fd().
> >   My reasoning behind this is based on the nature of how __close_fd() needs
> >   to release an fd. But maybe I misunderstood specifics:
> >   We take the files_lock and rcu-dereference the fdtable of the calling
> >   task, we find the entry in the fdtable, get the file and need to release
> >   files_lock before calling filp_close().
> >   In the meantime the fdtable might have been altered so we can't just
> >   retake the spinlock and keep the old rcu-reference of the fdtable
> >   around. Instead we need to grab a fresh reference to the fdtable.
> >   If my reasoning is correct then there's really no point in fancyfying
> >   __close_range(): We just need to rcu-dereference the fdtable of the
> >   calling task once to cap the max_fd value correctly and then go on
> >   calling __close_fd() in a loop.
> >
> > /* References */
> > [1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/
> > [2]: https://github.com/python/cpython/blob/9e4f2f3a6b8ee995c365e86d976937c141d867f8/Modules/_posixsubprocess.c#L220
> > [3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7
> > [4]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/fd-util.c#L217
> > [5]: https://github.com/lxc/lxc/blob/ddf4b77e11a4d08f09b7b9cd13e593f8c047edc5/src/lxc/start.c#L236
> > [6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17
> >      Note that this is an internal implementation that is not exported.
> >      Currently, libc seems to not provide an exported version of this
> >      because of missing kernel support to do this.
> > [7]: https://github.com/rust-lang/rust/issues/12148
> > [8]: https://github.com/rust-lang/rust/blob/5f47c0613ed4eb46fca3633c1297364c09e5e451/src/libstd/sys/unix/process2.rs#L303-L308
> >      Rust's solution is slightly different but is equally unperformant.
> >      Rust calls getdtablesize() which is a glibc library function that
> >      simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then
> >      goes on to call close() on each fd. That's obviously overkill for most
> >      tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or
> >      OPEN_MAX.
> >      Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set
> >      to 1024. Even in this case, there's a very high chance that in the
> >      common case Rust is calling the close() syscall 1021 times pointlessly
> >      if the task just has 0, 1, and 2 open.
> >
> > Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
> > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> > Cc: Arnd Bergmann <arnd@arndb.de>
> > Cc: Kyle Evans <self@kyle-evans.net>
> > Cc: Jann Horn <jannh@google.com>
> > Cc: David Howells <dhowells@redhat.com>
> > Cc: Dmitry V. Levin <ldv@altlinux.org>
> > Cc: Oleg Nesterov <oleg@redhat.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Florian Weimer <fweimer@redhat.com>
> > Cc: linux-api@vger.kernel.org
> > ---
> > /* v2 */
> > - Linus Torvalds <torvalds@linux-foundation.org>:
> >   - add cond_resched() to yield cpu when closing a lot of file descriptors
> > - Al Viro <viro@zeniv.linux.org.uk>:
> >   - add cond_resched() to yield cpu when closing a lot of file descriptors
> >
> > /* v3 */
> > unchanged
> >
> > /* v4 */
> > - Oleg Nesterov <oleg@redhat.com>:
> >   - fix braino: s/max()/min()/
> >
> > /* v5 */
> > unchanged
> > ---
> >  fs/file.c                | 62 ++++++++++++++++++++++++++++++++++------
> >  fs/open.c                | 20 +++++++++++++
> >  include/linux/fdtable.h  |  2 ++
> >  include/linux/syscalls.h |  2 ++
> >  4 files changed, 78 insertions(+), 8 deletions(-)
> >
> > diff --git a/fs/file.c b/fs/file.c
> > index abb8b7081d7a..e260bfe687d1 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -10,6 +10,7 @@
> >  #include <linux/syscalls.h>
> >  #include <linux/export.h>
> >  #include <linux/fs.h>
> > +#include <linux/kernel.h>
> >  #include <linux/mm.h>
> >  #include <linux/sched/signal.h>
> >  #include <linux/slab.h>
> > @@ -620,12 +621,9 @@ void fd_install(unsigned int fd, struct file *file)
> >
> >  EXPORT_SYMBOL(fd_install);
> >
> > -/*
> > - * The same warnings as for __alloc_fd()/__fd_install() apply here...
> > - */
> > -int __close_fd(struct files_struct *files, unsigned fd)
> > +static struct file *pick_file(struct files_struct *files, unsigned fd)
> >  {
> > -       struct file *file;
> > +       struct file *file = NULL;
> >         struct fdtable *fdt;
> >
> >         spin_lock(&files->file_lock);
> > @@ -637,15 +635,63 @@ int __close_fd(struct files_struct *files, unsigned fd)
> >                 goto out_unlock;
> >         rcu_assign_pointer(fdt->fd[fd], NULL);
> >         __put_unused_fd(files, fd);
> > -       spin_unlock(&files->file_lock);
> > -       return filp_close(file, files);
> >
> >  out_unlock:
> >         spin_unlock(&files->file_lock);
> > -       return -EBADF;
> > +       return file;
> > +}
> > +
> > +/*
> > + * The same warnings as for __alloc_fd()/__fd_install() apply here...
> > + */
> > +int __close_fd(struct files_struct *files, unsigned fd)
> > +{
> > +       struct file *file;
> > +
> > +       file = pick_file(files, fd);
> > +       if (!file)
> > +               return -EBADF;
> > +
> > +       return filp_close(file, files);
> >  }
> >  EXPORT_SYMBOL(__close_fd); /* for ksys_close() */
> >
> > +/**
> > + * __close_range() - Close all file descriptors in a given range.
> > + *
> > + * @fd:     starting file descriptor to close
> > + * @max_fd: last file descriptor to close
> > + *
> > + * This closes a range of file descriptors. All file descriptors
> > + * from @fd up to and including @max_fd are closed.
> > + */
> > +int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
> > +{
> > +       unsigned int cur_max;
> > +
> > +       if (fd > max_fd)
> > +               return -EINVAL;
> > +
> > +       rcu_read_lock();
> > +       cur_max = files_fdtable(files)->max_fds;
> > +       rcu_read_unlock();
> > +
> > +       /* cap to last valid index into fdtable */
> > +       max_fd = min(max_fd, (cur_max - 1));
> > +       while (fd <= max_fd) {
> > +               struct file *file;
> > +
> > +               file = pick_file(files, fd++);
> > +               if (!file)
> > +                       continue;
> > +
> > +               filp_close(file, files);
> > +               cond_resched();
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> >  /*
> >   * variant of __close_fd that gets a ref on the file for later fput.
> >   * The caller must ensure that filp_close() called on the file, and then
> > diff --git a/fs/open.c b/fs/open.c
> > index 719b320ede52..87e076e9e127 100644
> > --- a/fs/open.c
> > +++ b/fs/open.c
> > @@ -1279,6 +1279,26 @@ SYSCALL_DEFINE1(close, unsigned int, fd)
> >         return retval;
> >  }
> >
> > +/**
> > + * close_range() - Close all file descriptors in a given range.
> > + *
> > + * @fd:     starting file descriptor to close
> > + * @max_fd: last file descriptor to close
> > + * @flags:  reserved for future extensions
> > + *
> > + * This closes a range of file descriptors. All file descriptors
> > + * from @fd up to and including @max_fd are closed.
> > + * Currently, errors to close a given file descriptor are ignored.
> > + */
> > +SYSCALL_DEFINE3(close_range, unsigned int, fd, unsigned int, max_fd,
> > +               unsigned int, flags)
> > +{
> > +       if (flags)
> > +               return -EINVAL;
> > +
> > +       return __close_range(current->files, fd, max_fd);
> > +}
> > +
> >  /*
> >   * This routine simulates a hangup on the tty, to arrange that users
> >   * are given clean terminals at login time.
> > diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
> > index f07c55ea0c22..fcd07181a365 100644
> > --- a/include/linux/fdtable.h
> > +++ b/include/linux/fdtable.h
> > @@ -121,6 +121,8 @@ extern void __fd_install(struct files_struct *files,
> >                       unsigned int fd, struct file *file);
> >  extern int __close_fd(struct files_struct *files,
> >                       unsigned int fd);
> > +extern int __close_range(struct files_struct *files, unsigned int fd,
> > +                        unsigned int max_fd);
> >  extern int __close_fd_get_file(unsigned int fd, struct file **res);
> >
> >  extern struct kmem_cache *files_cachep;
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index 1815065d52f3..18fea399329b 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -442,6 +442,8 @@ asmlinkage long sys_openat(int dfd, const char __user *filename, int flags,
> >  asmlinkage long sys_openat2(int dfd, const char __user *filename,
> >                             struct open_how *how, size_t size);
> >  asmlinkage long sys_close(unsigned int fd);
> > +asmlinkage long sys_close_range(unsigned int fd, unsigned int max_fd,
> > +                               unsigned int flags);
> >  asmlinkage long sys_vhangup(void);
> >
> >  /* fs/pipe.c */
> > --
> > 2.26.2
> >
>
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
diff mbox series

Patch

diff --git a/fs/file.c b/fs/file.c
index abb8b7081d7a..e260bfe687d1 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -10,6 +10,7 @@ 
 #include <linux/syscalls.h>
 #include <linux/export.h>
 #include <linux/fs.h>
+#include <linux/kernel.h>
 #include <linux/mm.h>
 #include <linux/sched/signal.h>
 #include <linux/slab.h>
@@ -620,12 +621,9 @@  void fd_install(unsigned int fd, struct file *file)
 
 EXPORT_SYMBOL(fd_install);
 
-/*
- * The same warnings as for __alloc_fd()/__fd_install() apply here...
- */
-int __close_fd(struct files_struct *files, unsigned fd)
+static struct file *pick_file(struct files_struct *files, unsigned fd)
 {
-	struct file *file;
+	struct file *file = NULL;
 	struct fdtable *fdt;
 
 	spin_lock(&files->file_lock);
@@ -637,15 +635,63 @@  int __close_fd(struct files_struct *files, unsigned fd)
 		goto out_unlock;
 	rcu_assign_pointer(fdt->fd[fd], NULL);
 	__put_unused_fd(files, fd);
-	spin_unlock(&files->file_lock);
-	return filp_close(file, files);
 
 out_unlock:
 	spin_unlock(&files->file_lock);
-	return -EBADF;
+	return file;
+}
+
+/*
+ * The same warnings as for __alloc_fd()/__fd_install() apply here...
+ */
+int __close_fd(struct files_struct *files, unsigned fd)
+{
+	struct file *file;
+
+	file = pick_file(files, fd);
+	if (!file)
+		return -EBADF;
+
+	return filp_close(file, files);
 }
 EXPORT_SYMBOL(__close_fd); /* for ksys_close() */
 
+/**
+ * __close_range() - Close all file descriptors in a given range.
+ *
+ * @fd:     starting file descriptor to close
+ * @max_fd: last file descriptor to close
+ *
+ * This closes a range of file descriptors. All file descriptors
+ * from @fd up to and including @max_fd are closed.
+ */
+int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
+{
+	unsigned int cur_max;
+
+	if (fd > max_fd)
+		return -EINVAL;
+
+	rcu_read_lock();
+	cur_max = files_fdtable(files)->max_fds;
+	rcu_read_unlock();
+
+	/* cap to last valid index into fdtable */
+	max_fd = min(max_fd, (cur_max - 1));
+	while (fd <= max_fd) {
+		struct file *file;
+
+		file = pick_file(files, fd++);
+		if (!file)
+			continue;
+
+		filp_close(file, files);
+		cond_resched();
+	}
+
+	return 0;
+}
+
 /*
  * variant of __close_fd that gets a ref on the file for later fput.
  * The caller must ensure that filp_close() called on the file, and then
diff --git a/fs/open.c b/fs/open.c
index 719b320ede52..87e076e9e127 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1279,6 +1279,26 @@  SYSCALL_DEFINE1(close, unsigned int, fd)
 	return retval;
 }
 
+/**
+ * close_range() - Close all file descriptors in a given range.
+ *
+ * @fd:     starting file descriptor to close
+ * @max_fd: last file descriptor to close
+ * @flags:  reserved for future extensions
+ *
+ * This closes a range of file descriptors. All file descriptors
+ * from @fd up to and including @max_fd are closed.
+ * Currently, errors to close a given file descriptor are ignored.
+ */
+SYSCALL_DEFINE3(close_range, unsigned int, fd, unsigned int, max_fd,
+		unsigned int, flags)
+{
+	if (flags)
+		return -EINVAL;
+
+	return __close_range(current->files, fd, max_fd);
+}
+
 /*
  * This routine simulates a hangup on the tty, to arrange that users
  * are given clean terminals at login time.
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index f07c55ea0c22..fcd07181a365 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -121,6 +121,8 @@  extern void __fd_install(struct files_struct *files,
 		      unsigned int fd, struct file *file);
 extern int __close_fd(struct files_struct *files,
 		      unsigned int fd);
+extern int __close_range(struct files_struct *files, unsigned int fd,
+			 unsigned int max_fd);
 extern int __close_fd_get_file(unsigned int fd, struct file **res);
 
 extern struct kmem_cache *files_cachep;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 1815065d52f3..18fea399329b 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -442,6 +442,8 @@  asmlinkage long sys_openat(int dfd, const char __user *filename, int flags,
 asmlinkage long sys_openat2(int dfd, const char __user *filename,
 			    struct open_how *how, size_t size);
 asmlinkage long sys_close(unsigned int fd);
+asmlinkage long sys_close_range(unsigned int fd, unsigned int max_fd,
+				unsigned int flags);
 asmlinkage long sys_vhangup(void);
 
 /* fs/pipe.c */