[v10,6/8] rust: file: add `FileDescriptorReservation`

Message ID	20240915-alice-file-v10-6-88484f7a3dcf@google.com (mailing list archive)
State	New
Headers	show Received: from mail-wm1-f73.google.com (mail-wm1-f73.google.com [209.85.128.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2E0FD1C9DDF for <linux-fsdevel@vger.kernel.org>; Sun, 15 Sep 2024 14:31:52 +0000 (UTC) Date: Sun, 15 Sep 2024 14:31:32 +0000 In-Reply-To: <20240915-alice-file-v10-0-88484f7a3dcf@google.com> Precedence: bulk Mime-Version: 1.0 References: <20240915-alice-file-v10-0-88484f7a3dcf@google.com> Message-ID: <20240915-alice-file-v10-6-88484f7a3dcf@google.com> Subject: [PATCH v10 6/8] rust: file: add `FileDescriptorReservation` From: Alice Ryhl <aliceryhl@google.com> To: Paul Moore <paul@paul-moore.com>, James Morris <jmorris@namei.org>, "Serge E. Hallyn" <serge@hallyn.com>, Miguel Ojeda <ojeda@kernel.org>, Christian Brauner <brauner@kernel.org> Cc: Alex Gaynor <alex.gaynor@gmail.com>, Wedson Almeida Filho <wedsonaf@gmail.com>, Boqun Feng <boqun.feng@gmail.com>, Gary Guo <gary@garyguo.net>, " =?utf-8?q?Bj=C3=B6rn_Roy_Baron?= " <bjorn3_gh@protonmail.com>, Benno Lossin <benno.lossin@proton.me>, Andreas Hindborg <a.hindborg@samsung.com>, Peter Zijlstra <peterz@infradead.org>, Alexander Viro <viro@zeniv.linux.org.uk>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, " =?utf-8?q?Arve_Hj=C3=B8?= =?utf-8?q?nnev=C3=A5g?= " <arve@android.com>, Todd Kjos <tkjos@android.com>, Martijn Coenen <maco@android.com>, Joel Fernandes <joel@joelfernandes.org>, Carlos Llamas <cmllamas@google.com>, Suren Baghdasaryan <surenb@google.com>, Dan Williams <dan.j.williams@intel.com>, Matthew Wilcox <willy@infradead.org>, Thomas Gleixner <tglx@linutronix.de>, Daniel Xu <dxu@dxuuu.xyz>, Martin Rodriguez Reboredo <yakoyoku@gmail.com>, Trevor Gross <tmgross@umich.edu>, linux-kernel@vger.kernel.org, linux-security-module@vger.kernel.org, rust-for-linux@vger.kernel.org, linux-fsdevel@vger.kernel.org, Alice Ryhl <aliceryhl@google.com>, Kees Cook <kees@kernel.org> Content-Type: text/plain; charset="utf-8"
Series	File abstractions needed by Rust Binder \| expand [v10,0/8] File abstractions needed by Rust Binder [v10,1/8] rust: types: add `NotThreadSafe` [v10,2/8] rust: task: add `Task::current_raw` [v10,3/8] rust: file: add Rust abstraction for `struct file` [v10,4/8] rust: cred: add Rust abstraction for `struct cred` [v10,5/8] rust: security: add abstraction for secctx [v10,6/8] rust: file: add `FileDescriptorReservation` [v10,7/8] rust: file: add `Kuid` wrapper [v10,8/8] rust: file: add abstraction for `poll_table`

Alice Ryhl Sept. 15, 2024, 2:31 p.m. UTC

From: Wedson Almeida Filho <wedsonaf@gmail.com>

Allow for the creation of a file descriptor in two steps: first, we
reserve a slot for it, then we commit or drop the reservation. The first
step may fail (e.g., the current process ran out of available slots),
but commit and drop never fail (and are mutually exclusive).

This is needed by Rust Binder when fds are sent from one process to
another. It has to be a two-step process to properly handle the case
where multiple fds are sent: The operation must fail or succeed
atomically, which we achieve by first reserving the fds we need, and
only installing the files once we have reserved enough fds to send the
files.

Fd reservations assume that the value of `current` does not change
between the call to get_unused_fd_flags and the call to fd_install (or
put_unused_fd). By not implementing the Send trait, this abstraction
ensures that the `FileDescriptorReservation` cannot be moved into a
different process.

Signed-off-by: Wedson Almeida Filho <wedsonaf@gmail.com>
Co-developed-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Benno Lossin <benno.lossin@proton.me>
Reviewed-by: Martin Rodriguez Reboredo <yakoyoku@gmail.com>
Reviewed-by: Trevor Gross <tmgross@umich.edu>
Reviewed-by: Gary Guo <gary@garyguo.net>
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
---
 rust/kernel/fs/file.rs | 75 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 74 insertions(+), 1 deletion(-)

Al Viro Sept. 15, 2024, 6:39 p.m. UTC | #1

On Sun, Sep 15, 2024 at 02:31:32PM +0000, Alice Ryhl wrote:

> +impl Drop for FileDescriptorReservation {
> +    fn drop(&mut self) {
> +        // SAFETY: By the type invariants of this type, `self.fd` was previously returned by
> +        // `get_unused_fd_flags`. We have not yet used the fd, so it is still valid, and `current`
> +        // still refers to the same task, as this type cannot be moved across task boundaries.
> +        unsafe { bindings::put_unused_fd(self.fd) };
> +    }
> +}

FWIW, it's a bit more delicate.  The real rules for API users are

	1) anything you get from get_unused_fd_flags() (well, alloc_fd(),
internally) must be passed either to put_unused_fd() or fd_install() before
you return from syscall.  That should be done by the same thread and
all calls of put_unused_fd() or fd_install() should be paired with
some get_unused_fd_flags() in that manner (i.e. by the same thread,
within the same syscall, etc.)

	2) calling thread MUST NOT unshare descriptor table while it has
any reserved descriptors.  I.e.
	fd = get_unused_fd();
	unshare_files();
	fd_install(fd, file);
is a bug.  Reservations are discarded by that.  Getting rid of that
constraint would require tracking the sets of reserved descriptors
separately for each thread that happens to share the descriptor table.
Conceptually they *are* per-thread - the same thread that has done
reservation must either discard it or use it.  However, it's easier to
keep the "it's reserved by some thread" represented in descriptor table
itself (bit set in ->open_fds bitmap, file reference in ->fd[] array is
NULL) than try and keep track of who's reserved what.  The constraint is
basically "all reservations can stay with the old copy", i.e. "caller has
no reservations of its own to transfer into the new private copy it gets".
	It's not particularly onerous[*] and it simplifies things
quite a bit.  However, if we are documenting thing, it needs to be
put explicitly.  With respect to Rust, if you do e.g. binfmt-in-rust
support it will immediately become an issue - begin_new_exec() is calling
unshare_files(), so the example above can become an issue.

	Internally (in fs/file.c, that is) we have additional safety
rule - anything that might be given an arbitrary descriptor (e.g.
do_dup2() destination can come directly from dup2(2) argument,
file_close_fd_locked() victim can come directly from close(2) one,
etc.) must leave reserved descriptors alone.  Not an issue API users
need to watch out for, though.

[*] unsharing the descriptor table is done by
	+ close_range(2), which has no reason to allocate any descriptors
and is only called by userland.
	+ unshare(2), which has no reason to allocate any descriptors
and is only called by userland.
	+ a place in early init that call ksys_unshare() while arranging
the environment for /linuxrc from initrd image to be run.  Again, no
reserved descriptors there.
	+ coredumping thread in the beginning of do_coredump().
The caller is at the point of signal delivery, which means that it had
already left whatever syscall it might have been in.  Which means
that all reservations must have been undone by that point.
	+ execve() at the point of no return (in begin_new_exec()).
That's the only place where violation of that constraint on some later
changes is plausible.  That one needs to be watched out for.

Al Viro Sept. 15, 2024, 7:34 p.m. UTC | #2

On Sun, Sep 15, 2024 at 07:39:05PM +0100, Al Viro wrote:
> 	2) calling thread MUST NOT unshare descriptor table while it has
> any reserved descriptors.  I.e.
> 	fd = get_unused_fd();
> 	unshare_files();
> 	fd_install(fd, file);
> is a bug.  Reservations are discarded by that.  Getting rid of that
> constraint would require tracking the sets of reserved descriptors
> separately for each thread that happens to share the descriptor table.
> Conceptually they *are* per-thread - the same thread that has done
> reservation must either discard it or use it.  However, it's easier to
> keep the "it's reserved by some thread" represented in descriptor table
> itself (bit set in ->open_fds bitmap, file reference in ->fd[] array is
> NULL) than try and keep track of who's reserved what.  The constraint is
> basically "all reservations can stay with the old copy", i.e. "caller has
> no reservations of its own to transfer into the new private copy it gets".

FWIW, I toyed with the idea of having reservations kept per-thread;
it is possible and it simplifies some things, but I hadn't been able to
find a way to do that without buggering syscall latency for open() et.al.

It would keep track of the set of reservations in task_struct (count,
two-element array for the first two + page pointer for spillovers,
for the rare threads that need more than two reserved simultaneously).

Representation in fdtable:
state		open_fds bit	value in ->fd[] array
free		clear		0UL
reserved	set		0UL
uncommitted	set		1UL|(unsigned long)file
open		set		(unsigned long)file

with file lookup treating any odd value as 0 (i.e. as reserved)
fd_install() switching reserved to uncommitted *AND* separate
"commit" operation that does this:
	if current->reservation_count == 0
		return
	if failure
		for each descriptor in our reserved set
			v = fdtable->fd[descriptor]
			if (v) {
				fdtable->fd[descriptor] = 0;
				fput((struct file *)(v & ~1);
			}
			clear bit in fdtable->open_fds[]
	else
		for each descriptor in our reserved set
			v = fdtable->fd[descriptor]
			if (v)
				fdtable->fd[descriptor] = v & ~1;
			else
				BUG
	current->reservation_count = 0

That "commit" thing would be called on return from syscall
for userland threads and would be called explicitly in
kernel threads that have a descriptor table and work with
it.

The benefit would be that fd_install() would *NOT* have to be done
after the last possible failure exit - something that installs
a lot of files wouldn't have to play convoluted games on cleanup.
Simply returning an error would do the right thing.

Two stores into ->fd[] instead of one is not a big deal;
however, anything like task_work_add() to arrange the call
of "commit" ends up being bloody awful.

We could have it called from syscall glue directly, but
that means touching assembler for quite a few architectures,
and I hadn't gotten around to that.  Can be done, but it's
not a pleasant work...

Alice Ryhl Sept. 15, 2024, 8:13 p.m. UTC | #3

On Sun, Sep 15, 2024 at 8:39 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Sun, Sep 15, 2024 at 02:31:32PM +0000, Alice Ryhl wrote:
>
> > +impl Drop for FileDescriptorReservation {
> > +    fn drop(&mut self) {
> > +        // SAFETY: By the type invariants of this type, `self.fd` was previously returned by
> > +        // `get_unused_fd_flags`. We have not yet used the fd, so it is still valid, and `current`
> > +        // still refers to the same task, as this type cannot be moved across task boundaries.
> > +        unsafe { bindings::put_unused_fd(self.fd) };
> > +    }
> > +}
>
> FWIW, it's a bit more delicate.  The real rules for API users are
>
>         1) anything you get from get_unused_fd_flags() (well, alloc_fd(),
> internally) must be passed either to put_unused_fd() or fd_install() before
> you return from syscall.  That should be done by the same thread and
> all calls of put_unused_fd() or fd_install() should be paired with
> some get_unused_fd_flags() in that manner (i.e. by the same thread,
> within the same syscall, etc.)

Ok, we have to use it before returning from the syscall. That's fine.

What happens if I call `get_unused_fd_flags`, and then never call
`put_unused_fd`? Assume that I don't try to use the fd in the future,
and that I just forgot to clean up after myself.

>         2) calling thread MUST NOT unshare descriptor table while it has
> any reserved descriptors.  I.e.
>         fd = get_unused_fd();
>         unshare_files();
>         fd_install(fd, file);
> is a bug.  Reservations are discarded by that.  Getting rid of that
> constraint would require tracking the sets of reserved descriptors
> separately for each thread that happens to share the descriptor table.
> Conceptually they *are* per-thread - the same thread that has done
> reservation must either discard it or use it.  However, it's easier to
> keep the "it's reserved by some thread" represented in descriptor table
> itself (bit set in ->open_fds bitmap, file reference in ->fd[] array is
> NULL) than try and keep track of who's reserved what.  The constraint is
> basically "all reservations can stay with the old copy", i.e. "caller has
> no reservations of its own to transfer into the new private copy it gets".
>         It's not particularly onerous[*] and it simplifies things
> quite a bit.  However, if we are documenting thing, it needs to be
> put explicitly.  With respect to Rust, if you do e.g. binfmt-in-rust
> support it will immediately become an issue - begin_new_exec() is calling
> unshare_files(), so the example above can become an issue.
>
>         Internally (in fs/file.c, that is) we have additional safety
> rule - anything that might be given an arbitrary descriptor (e.g.
> do_dup2() destination can come directly from dup2(2) argument,
> file_close_fd_locked() victim can come directly from close(2) one,
> etc.) must leave reserved descriptors alone.  Not an issue API users
> need to watch out for, though.
>
> [*] unsharing the descriptor table is done by
>         + close_range(2), which has no reason to allocate any descriptors
> and is only called by userland.
>         + unshare(2), which has no reason to allocate any descriptors
> and is only called by userland.
>         + a place in early init that call ksys_unshare() while arranging
> the environment for /linuxrc from initrd image to be run.  Again, no
> reserved descriptors there.
>         + coredumping thread in the beginning of do_coredump().
> The caller is at the point of signal delivery, which means that it had
> already left whatever syscall it might have been in.  Which means
> that all reservations must have been undone by that point.
>         + execve() at the point of no return (in begin_new_exec()).
> That's the only place where violation of that constraint on some later
> changes is plausible.  That one needs to be watched out for.

Thanks for going through that. From a Rust perspective, it sounds
easiest to just declare that execve() is an unsafe operation, and that
one of the conditions for using it is that you don't hold any fd
reservations. Trying to encode this in the fd reservation logic seems
too onerous, and I'm guessing execve is not used particularly often
anyway.

Alice

Al Viro Sept. 15, 2024, 10:01 p.m. UTC | #4

> What happens if I call `get_unused_fd_flags`, and then never call
> `put_unused_fd`? Assume that I don't try to use the fd in the future,
> and that I just forgot to clean up after myself.

Moderate amount of bogosities while the thread exists.  For one thing,
descriptor is leaked - for open() et.al. it will look like it's in use.
For close() it will look like it's _not_ open (i.e. trying to close it
from userland will quietly do nothing).  Trying to overwrite it with
dup2() will keep failing with -EBUSY.

Kernel-side it definitely violates assertions, but currently nothing
will break.  Might or might not remain true in the future.  Doing that
again and again would lead to inflated descriptor table, but then
so would dup2(0, something_large).

IOW, it would be a bug, but it's probably not going to be high impact
security hole.

> >         + execve() at the point of no return (in begin_new_exec()).
	      execve(2), sorry.
> > That's the only place where violation of that constraint on some later
> > changes is plausible.  That one needs to be watched out for.
 
> Thanks for going through that.

I'm in the middle of writing documentation on the descriptor table and
struct file handling right now anyway...

> From a Rust perspective, it sounds
> easiest to just declare that execve() is an unsafe operation, and that
> one of the conditions for using it is that you don't hold any fd
> reservations. Trying to encode this in the fd reservation logic seems
> too onerous, and I'm guessing execve is not used particularly often
> anyway.

Sorry, bad editing on my part - I should've clearly marked execve() as a
syscall.  It's not that it's an unsafe operation - it's only called from
userland anyway, so it's not going to happen in scope of any reserved
descriptor.

The problem is different:

* userland calls execve("/usr/bin/whatever", argv, envp)

* that proceeds through several wrappers to do_execveat_common().
  There are several syscalls in that family and they all converge
  to do_execveat_common() - wrappers just deal with difference
  in calling conventions.

* do_execveat_common() set up exec context (struct linux_binprm).
  That opens the binary to be executed, creates memory context
  to be used, calculates argc, sets up argv and envp on what will
  become the userland stack for the new program, etc. - basically,
  all the work for marshalling the data from caller's memory.
  Then it calls bprm_execve(), which is where the rest of the work
  will be done.

* bprm_execve() eventually calls exec_binprm().  That calls
  search_binary_handler(), which is where we finally get a look
  at the binary we are trying to load.  search_binary_handler()
  goes through the known binary formats (ELF, script, etc.)
  and tries to offer the exec context to ->load_binary() of
  each.

* ->load_binary() instance looks at the binary (starting with the
  first 256 bytes read for us by prepare_binprm() called in the
  beginning of search_binary_handler()).  If it doesn't have the
  right magic values, ->load_binary() just returns -ENOEXEC,
  so that the next format would be tried.  If it *does* look like
  something this format is going to deal with, more sanity checks
  are done, things are set up, etc. - details depend upon the
  binary format in question.  See load_elf_binary() for some taste
  of that.  Eventually it decides to actually discard the old
  memory and switch to new binary.  Up to that point it can
  return an error - -ENOEXEC for soft ones ("not mine, after all,
  try other formats"), something like -EINVAL/-ENOMEM/-EPERM/-EIO/etc.
  for hard ones ("fail execve(2) with that error").  _After_ that
  point we have nothing to return to; old binary is not mapped
  anymore, userland stack frame is gone, etc.  Any errors past
  that point are treated as "kill me".

  At the point of no return we call begin_new_exec().  That
  kills all other threads, unshares descriptor table in case it
  had been shared wider than the thread group, switches memory
  context, etc.

  Once begin_new_exec() is done, we can safely map whatever we
  want to map, handle relocations, etc.  Among other things,
  we modify the userland register values saved on syscall entry,
  so that once we return from syscall we'll end up with the
  right register state at the right userland entry point, etc.
  If everything goes fine, ->load_binary() returns 0 into
  search_binary_handler() and we are pretty much done - some
  cleanups on the way out and off to the loaded binary.

  Alternatively, we may decide to mangle the exec context -
  that's what #! handling does (see load_script() - basically
  it opens the interpreter and massages the arguments, so
  that something like debian/rules build-arch turns into
  /usr/bin/make -f debian/rules build-arch and tells the
  caller to deal with that; this is what the loop in
  search_binary_handler() is about).

There's not a lot of binary formats (5 of those currently -
all in fs/binmt_*.c), but there's nothing to prohibit more
of them.  If somebody decides to add the infrastructure for
writing those in Rust, begin_new_exec() wrapper will need
to be documented as "never call that in scope of reserved
descriptor".  Maybe by marking that wrapper unsafe and
telling the users about the restriction wrt descriptor
reservations, maybe by somehow telling the compiler to
watch out for that - or maybe the constraint will be gone
by that time.

In any case, the underlying constraint ("a thread with
reserved descriptors should not try to get a private
descriptor table until all those descriptors are disposed
of one way or another") needs to be documented.

Al Viro Sept. 15, 2024, 10:05 p.m. UTC | #5

On Sun, Sep 15, 2024 at 11:01:26PM +0100, Al Viro wrote:

> There's not a lot of binary formats (5 of those currently -
> all in fs/binmt_*.c), but there's nothing to prohibit more
            binfmt_*.c, sorry.

> of them.  If somebody decides to add the infrastructure for
> writing those in Rust, begin_new_exec() wrapper will need
> to be documented as "never call that in scope of reserved
> descriptor".  Maybe by marking that wrapper unsafe and
> telling the users about the restriction wrt descriptor
> reservations, maybe by somehow telling the compiler to
> watch out for that - or maybe the constraint will be gone
> by that time.
> 
> In any case, the underlying constraint ("a thread with
> reserved descriptors should not try to get a private
> descriptor table until all those descriptors are disposed
> of one way or another") needs to be documented.
>

Al Viro Sept. 16, 2024, 4:18 a.m. UTC | #6

On Sun, Sep 15, 2024 at 08:34:43PM +0100, Al Viro wrote:

> FWIW, I toyed with the idea of having reservations kept per-thread;
> it is possible and it simplifies some things, but I hadn't been able to
> find a way to do that without buggering syscall latency for open() et.al.

Hmm...  How about the following:

* add an equivalent of array of pairs (fd, file) to task_struct;
representation could be e.g. (a _VERY_ preliminary variant)
	unsigned fd_count;
	int fds[2];
	struct file *fp[2];
	void *spillover;
with 'spillover' being a separately allocated array of pairs to deal with
the moments when we have more than 2 simultaneously reserved descriptors.
Initially NULL, allocated the first time we need more than 2.  Always empty
outside of syscall.

* inline primitives:
	count_reserved_fds()
	reserved_descriptor(index)
	reserved_file(index)

* int reserve_fd(flags)
	returns -E... or index.

	slot = current->fd_count
	if (unlikely(slot == 2) && !current->spillover) {
		allocate spillover
		if failed
			return -ENOMEM
		set current->spillover
	}
	if slot is maximal allowed (2 + how much fits into allocated part?)
		return -E<something>
	fd = get_unused_fd_flags(flags);
	if (unlikely(fd < 0))
		return fd;
	if (likely(slot < 2)) {
		current->fds[slot] = fd;
		current->fp[slot] = NULL;
	} else {
		store (fd, NULL) into element #(slot - 2) of current->spillover
	}
	current->fd_count = slot + 1;

* void install_file(index, file)
	
	if (likely(slot < 2))
		current->fp[slot] = file;
	else
		store file to element #(slot - 2) of current->spillover

* void __commit_reservations(unsigned count, bool failed)
	// count == current->fd_count

	while (count--) {
		fd = reserved_descriptor(count);
		file = reserved_file(count);
		if (!file)
			put_unused_fd(fd);
		else if (!failed)
			fd_install(fd, file);
		else {
			put_unused_fd(fd);
			fput(file);
		}
	}
	current->fd_count = 0;

* static inline void commit_fd_reservations(bool failed)
	called in syscall glue, right after the syscall returns
	
	unsigned slots = current->fd_count;
	if (unlikely(slots))
		__commit_reservations(slots, failed);


Then we can (in addition to the current use of get_unused_fd_flags() et.al. -
that still works) do e.g. things like

	for (i = 0; i < 69; i++) {
		index = reserve_fd(FD_CLOEXEC);

		if (unlikely(index < 0))
			return index;

		file = some_driver_shite(some_shite, i);
		if (IS_ERR(file))
			return PTR_ERR(file);

		install_file(index, file); // consumed file

		ioctl_result.some_array[i] = reserved_descriptor(index);
		....
	}
	...
	if (copy_to_user(arg, &ioctl_result, sizeof(ioctl_result))
		return -EFAULT;
	...
	return 0;

and have it DTRT on all failures, no matter how many files we have added,
etc. - on syscall return we will either commit all reservations
(on success) or release all reserved descriptors and drop all files we
had planned to put into descriptor table.  Getting that right manually
is doable (drm has some examples), but it's _not_ pleasant.

The win here is in simpler cleanup code.  And it can coexist with the
current API just fine.  The PITA is in the need to add the call
of commit_fd_reservations() in syscall exit glue and have that done
on all architectures ;-/

FWIW, I suspect that it won't be slower than the current API, even
if used on hot paths.  pipe(2) would be an interesting testcase
for that - converting it is easy, and there's a plenty of loads
where latency of pipe(2) would be visible.

Comments?

[v10,6/8] rust: file: add `FileDescriptorReservation`

Commit Message

Comments

Patch