Message ID | 20240926-pocht-sittlich-87108178c093@brauner (mailing list archive) |
---|---|
State | Handled Elsewhere |
Headers | show |
Series | [RFC] rust: add PidNamespace wrapper | expand |
On Thu, Sep 26, 2024 at 6:36 PM Christian Brauner <brauner@kernel.org> wrote: > > Ok, so here's my feeble attempt at getting something going for wrapping > struct pid_namespace as struct pid_namespace indirectly came up in the > file abstraction thread. This looks great! > The lifetime of a pid namespace is intimately tied to the lifetime of > task. The pid namespace of a task doesn't ever change. A > unshare(CLONE_NEWPID) or setns(fd_pidns/pidfd, CLONE_NEWPID) will not > change the task's pid namespace only the pid namespace of children > spawned by the task. This invariant is important to keep in mind. > > After a task is reaped it will be detached from its associated struct > pids via __unhash_process(). This will also set task->thread_pid to > NULL. > > In order to retrieve the pid namespace of a task task_active_pid_ns() > can be used. The helper works on both current and non-current taks but > the requirements are slightly different in both cases and it depends on > where the helper is called. > > The rules for this are simple but difficult for me to translate into > Rust. If task_active_pid_ns() is called on current then no RCU locking > is needed as current is obviously alive. On the other hand calling > task_active_pid_ns() after release_task() would work but it would mean > task_active_pid_ns() will return NULL. > > Calling task_active_pid_ns() on a non-current task, while valid, must be > under RCU or other protection mechanism as the task might be > release_task() and thus in __unhash_process(). Just to confirm, calling task_active_pid_ns() on a non-current task requires the rcu lock even if you own a refcont on the task? Alice
On Fri, Sep 27, 2024 at 02:04:13PM GMT, Alice Ryhl wrote: > On Thu, Sep 26, 2024 at 6:36 PM Christian Brauner <brauner@kernel.org> wrote: > > > > Ok, so here's my feeble attempt at getting something going for wrapping > > struct pid_namespace as struct pid_namespace indirectly came up in the > > file abstraction thread. > > This looks great! Thanks! > > > The lifetime of a pid namespace is intimately tied to the lifetime of > > task. The pid namespace of a task doesn't ever change. A > > unshare(CLONE_NEWPID) or setns(fd_pidns/pidfd, CLONE_NEWPID) will not > > change the task's pid namespace only the pid namespace of children > > spawned by the task. This invariant is important to keep in mind. > > > > After a task is reaped it will be detached from its associated struct > > pids via __unhash_process(). This will also set task->thread_pid to > > NULL. > > > > In order to retrieve the pid namespace of a task task_active_pid_ns() > > can be used. The helper works on both current and non-current taks but > > the requirements are slightly different in both cases and it depends on > > where the helper is called. > > > > The rules for this are simple but difficult for me to translate into > > Rust. If task_active_pid_ns() is called on current then no RCU locking > > is needed as current is obviously alive. On the other hand calling > > task_active_pid_ns() after release_task() would work but it would mean > > task_active_pid_ns() will return NULL. > > > > Calling task_active_pid_ns() on a non-current task, while valid, must be > > under RCU or other protection mechanism as the task might be > > release_task() and thus in __unhash_process(). > > Just to confirm, calling task_active_pid_ns() on a non-current task > requires the rcu lock even if you own a refcont on the task? Interesting question. Afaik, yes. task_active_pid_ns() goes via task->thread_pid which is a shorthand for task->pid_links[PIDTYPE_PID]. This will be NULLed when the task exits and is dead (so usually when someone has waited on it - ignoring ptrace for sanity reasons and autoreaping the latter amounts to the same thing just in-kernel): T1 T2 T3 exit(0); wait(T1) -> wait_task_zombie() -> release_task() -> __exit_signals() -> __unash_process() // sets task->thread_pid == NULL task_active_pid_ns(T1) // task->pid_links[PIDTYPE_PID] == NULL So having a reference to struct task_struct doesn't prevent task->thread_pid becoming NULL. And you touch upon a very interesting point. The lifetime of struct pid_namespace is actually tied to struct pid much tighter than it is to struct task_struct. So when a task is released (transitions from zombie to dead in the common case) the following happens: release_task() -> __exit_signals() -> thread_pid = get_pid(task->thread_pid) -> __unhash_process() -> detach_pid(PIDTYPE_PID) -> __change_pid() { task->thread_pid = NULL; task->pid_links[PIDTYPE_PID] = NULL; free_pid(thread_pid) } put_pid(thread_pid) And the free_pid() in __change_pid() does a delayed_put_pid() via call_rcu(). So afaiu, taking the rcu_read_lock() synchronizes against that delayed_put_pid() in __change_pid() so the call_rcu() will wait until everyone who does rcu_read_lock() task_active_pid_ns(task) rcu_read_unlock() and sees task->thread_pid non-NULL, is done. This way no additional reference count on struct task_struct or struct pid is needed before plucking the pid namespace from there. Does that make sense or have I gotten it all wrong?
On Fri, Sep 27, 2024 at 4:21 PM Christian Brauner <brauner@kernel.org> wrote: > > On Fri, Sep 27, 2024 at 02:04:13PM GMT, Alice Ryhl wrote: > > On Thu, Sep 26, 2024 at 6:36 PM Christian Brauner <brauner@kernel.org> wrote: > > > > > > Ok, so here's my feeble attempt at getting something going for wrapping > > > struct pid_namespace as struct pid_namespace indirectly came up in the > > > file abstraction thread. > > > > This looks great! > > Thanks! > > > > > > The lifetime of a pid namespace is intimately tied to the lifetime of > > > task. The pid namespace of a task doesn't ever change. A > > > unshare(CLONE_NEWPID) or setns(fd_pidns/pidfd, CLONE_NEWPID) will not > > > change the task's pid namespace only the pid namespace of children > > > spawned by the task. This invariant is important to keep in mind. > > > > > > After a task is reaped it will be detached from its associated struct > > > pids via __unhash_process(). This will also set task->thread_pid to > > > NULL. > > > > > > In order to retrieve the pid namespace of a task task_active_pid_ns() > > > can be used. The helper works on both current and non-current taks but > > > the requirements are slightly different in both cases and it depends on > > > where the helper is called. > > > > > > The rules for this are simple but difficult for me to translate into > > > Rust. If task_active_pid_ns() is called on current then no RCU locking > > > is needed as current is obviously alive. On the other hand calling > > > task_active_pid_ns() after release_task() would work but it would mean > > > task_active_pid_ns() will return NULL. > > > > > > Calling task_active_pid_ns() on a non-current task, while valid, must be > > > under RCU or other protection mechanism as the task might be > > > release_task() and thus in __unhash_process(). > > > > Just to confirm, calling task_active_pid_ns() on a non-current task > > requires the rcu lock even if you own a refcont on the task? > > Interesting question. Afaik, yes. task_active_pid_ns() goes via > task->thread_pid which is a shorthand for task->pid_links[PIDTYPE_PID]. > > This will be NULLed when the task exits and is dead (so usually when > someone has waited on it - ignoring ptrace for sanity reasons and > autoreaping the latter amounts to the same thing just in-kernel): > > T1 T2 T3 > exit(0); > wait(T1) > -> wait_task_zombie() > -> release_task() > -> __exit_signals() > -> __unash_process() > // sets task->thread_pid == NULL task_active_pid_ns(T1) > // task->pid_links[PIDTYPE_PID] == NULL > > So having a reference to struct task_struct doesn't prevent > task->thread_pid becoming NULL. > > And you touch upon a very interesting point. The lifetime of struct > pid_namespace is actually tied to struct pid much tighter than it is to > struct task_struct. So when a task is released (transitions from zombie > to dead in the common case) the following happens: > > release_task() > -> __exit_signals() > -> thread_pid = get_pid(task->thread_pid) > -> __unhash_process() > -> detach_pid(PIDTYPE_PID) > -> __change_pid() > { > task->thread_pid = NULL; > task->pid_links[PIDTYPE_PID] = NULL; > free_pid(thread_pid) > } > put_pid(thread_pid) > > And the free_pid() in __change_pid() does a delayed_put_pid() via > call_rcu(). > > So afaiu, taking the rcu_read_lock() synchronizes against that > delayed_put_pid() in __change_pid() so the call_rcu() will wait until > everyone who does > > rcu_read_lock() > task_active_pid_ns(task) > rcu_read_unlock() > > and sees task->thread_pid non-NULL, is done. This way no additional > reference count on struct task_struct or struct pid is needed before > plucking the pid namespace from there. Does that make sense or have I > gotten it all wrong? Okay. I agree that the code you have is the best we can do; at least until we get an rcu guard in Rust. The macro doesn't quite work. You need to do something to constrain the lifetime used by `PidNamespace::from_ptr`. Right now, there is no constraint on the lifetime, so the caller can just pick the lifetime 'static which is the lifetime that never ends. We want to constrain it to a lifetime that ends before the task dies. The easiest is to create a local variable and use the lifetime of that local variable. That way, the reference can never escape the current function, and hence, can't escape the current task. More generally, I'm sure there are lots of fields in current where we can access them without rcu only because we know the current task isn't going to die on us. I don't think we should have a macro for every single one. I think we can put together a single macro for getting a lifetime that ends before returning to userspace, and then reuse that lifetime for both `current` and `current_pid_ns`, and possibly also the `DeferredFd` patch. Alice
diff --git a/rust/helpers/helpers.c b/rust/helpers/helpers.c index 62022b18caf5..d553ad9361ce 100644 --- a/rust/helpers/helpers.c +++ b/rust/helpers/helpers.c @@ -17,6 +17,7 @@ #include "kunit.c" #include "mutex.c" #include "page.c" +#include "pid_namespace.c" #include "rbtree.c" #include "refcount.c" #include "security.c" diff --git a/rust/helpers/pid_namespace.c b/rust/helpers/pid_namespace.c new file mode 100644 index 000000000000..f41482bdec9a --- /dev/null +++ b/rust/helpers/pid_namespace.c @@ -0,0 +1,26 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include <linux/pid_namespace.h> +#include <linux/cleanup.h> + +struct pid_namespace *rust_helper_get_pid_ns(struct pid_namespace *ns) +{ + return get_pid_ns(ns); +} + +void rust_helper_put_pid_ns(struct pid_namespace *ns) +{ + put_pid_ns(ns); +} + +/* Get a reference on a task's pid namespace. */ +struct pid_namespace *rust_helper_task_get_pid_ns(struct task_struct *task) +{ + struct pid_namespace *pid_ns; + + guard(rcu)(); + pid_ns = task_active_pid_ns(task); + if (pid_ns) + get_pid_ns(pid_ns); + return pid_ns; +} diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs index ff7d88022c57..0e78ec9d06e0 100644 --- a/rust/kernel/lib.rs +++ b/rust/kernel/lib.rs @@ -44,6 +44,7 @@ #[cfg(CONFIG_NET)] pub mod net; pub mod page; +pub mod pid_namespace; pub mod prelude; pub mod print; pub mod sizes; diff --git a/rust/kernel/pid_namespace.rs b/rust/kernel/pid_namespace.rs new file mode 100644 index 000000000000..cd12c21a68cb --- /dev/null +++ b/rust/kernel/pid_namespace.rs @@ -0,0 +1,68 @@ +// SPDX-License-Identifier: GPL-2.0 + +//! Pid namespaces. +//! +//! C header: [`include/linux/pid_namespace.h`](srctree/include/linux/pid_namespace.h) and +//! [`include/linux/pid.h`](srctree/include/linux/pid.h) + +use crate::{ + bindings, + types::{AlwaysRefCounted, Opaque}, +}; +use core::{ + ptr, +}; + +/// Wraps the kernel's `struct pid_namespace`. Thread safe. +/// +/// This structure represents the Rust abstraction for a C `struct pid_namespace`. This +/// implementation abstracts the usage of an already existing C `struct pid_namespace` within Rust +/// code that we get passed from the C side. +#[repr(transparent)] +pub struct PidNamespace { + inner: Opaque<bindings::pid_namespace>, +} + +impl PidNamespace { + /// Returns a raw pointer to the inner C struct. + #[inline] + pub fn as_ptr(&self) -> *mut bindings::pid_namespace { + self.inner.get() + } + + /// Creates a reference to a [`PidNamespace`] from a valid pointer. + /// + /// # Safety + /// + /// The caller must ensure that `ptr` is valid and remains valid for the lifetime of the + /// returned [`PidNamespace`] reference. + pub unsafe fn from_ptr<'a>(ptr: *const bindings::pid_namespace) -> &'a Self { + // SAFETY: The safety requirements guarantee the validity of the dereference, while the + // `PidNamespace` type being transparent makes the cast ok. + unsafe { &*ptr.cast() } + } +} + +// SAFETY: Instances of `PidNamespace` are always reference-counted. +unsafe impl AlwaysRefCounted for PidNamespace { + #[inline] + fn inc_ref(&self) { + // SAFETY: The existence of a shared reference means that the refcount is nonzero. + unsafe { bindings::get_pid_ns(self.as_ptr()) }; + } + + #[inline] + unsafe fn dec_ref(obj: ptr::NonNull<PidNamespace>) { + // SAFETY: The safety requirements guarantee that the refcount is non-zero. + unsafe { bindings::put_pid_ns(obj.cast().as_ptr()) } + } +} + +// SAFETY: +// - `PidNamespace::dec_ref` can be called from any thread. +// - It is okay to send ownership of `PidNamespace` across thread boundaries. +unsafe impl Send for PidNamespace {} + +// SAFETY: It's OK to access `PidNamespace` through shared references from other threads because +// we're either accessing properties that don't change or that are properly synchronised by C code. +unsafe impl Sync for PidNamespace {} diff --git a/rust/kernel/task.rs b/rust/kernel/task.rs index 1a36a9f19368..89a431dfac5d 100644 --- a/rust/kernel/task.rs +++ b/rust/kernel/task.rs @@ -6,7 +6,8 @@ use crate::{ bindings, - types::{NotThreadSafe, Opaque}, + pid_namespace::PidNamespace, + types::{ARef, NotThreadSafe, Opaque}, }; use core::{ cmp::{Eq, PartialEq}, @@ -36,6 +37,37 @@ macro_rules! current { }; } +/// Returns the currently running task's pid namespace. +/// +/// The lifetime of `PidNamespace` is intimately tied to the lifetime of `Task`. The pid namespace +/// of a `Task` doesn't ever change. A `unshare(CLONE_NEWPID)` or `setns(fd_pidns/pidfd, +/// CLONE_NEWPID)` will not change the task's pid namespace. This invariant is important to keep in +/// mind. +/// +/// After a task is reaped it will be detached from its associated `struct pid`s via +/// __unhash_process(). This will specifically set `task->thread_pid` to `NULL`. +/// +/// In order to retrieve the pid namespace of a task `task_active_pid_ns()` can be used. The rules +/// for this are simple but difficult for me to translate into Rust. If `task_active_pid_ns()` is +/// called from `current` then no RCU locking is needed as current is obviously alive. However, +/// calling `task_active_pid_ns()` on a non-`current` task, while valid, must be under RCU or other +/// protection as the task might be in __unhash_process(). +/// +/// We could add an always refcounted `PidNamespace` impl to `Task` but that would be pointless +/// refcount bumping for the usual case where the caller retrieves the pid namespace of `current`. +/// +/// So I added a macro that gets the active pid namespace of `current` and a `task_get_pid_ns()` +/// impl that returns an `ARef<PidNamespace>` or `None` if the pid namespace is `NULL`. Returning +/// an `Option<ARef<PidNamespace>>` forces the caller to make a conscious decision what instead of +/// just silently translating a `NULL` to `current`'s pid namespace. +#[macro_export] +macro_rules! current_pid_ns { + () => { + let ptr = current() + unsafe { PidNamespace::from_ptr(bindings::task_active_pid_ns(ptr)) } + }; +} + /// Wraps the kernel's `struct task_struct`. /// /// # Invariants @@ -182,11 +214,23 @@ pub fn signal_pending(&self) -> bool { unsafe { bindings::signal_pending(self.0.get()) != 0 } } - /// Returns the given task's pid in the current pid namespace. - pub fn pid_in_current_ns(&self) -> Pid { - // SAFETY: We know that `self.0.get()` is valid by the type invariant, and passing a null - // pointer as the namespace is correct for using the current namespace. - unsafe { bindings::task_tgid_nr_ns(self.0.get(), ptr::null_mut()) } + /// Returns task's pid namespace with elevated reference count + pub fn task_get_pid_ns(&self) -> Option<ARef<PidNamespace>> { + let ptr = unsafe { bindings::task_get_pid_ns(self.0.get()) }; + if ptr.is_null() { + None + } else { + // SAFETY: `ptr` is valid by the safety requirements of this function. And we own a + // reference count via `task_get_pid_ns()`. + // CAST: `Self` is a `repr(transparent)` wrapper around `bindings::pid_namespace`. + Some(unsafe { ARef::from_raw(ptr::NonNull::new_unchecked(ptr.cast::<PidNamespace>())) }) + } + } + + /// Returns the given task's pid in the provided pid namespace. + pub fn task_tgid_nr_ns(&self, pidns: &PidNamespace) -> Pid { + // SAFETY: We know that `self.0.get()` is valid by the type invariant. + unsafe { bindings::task_tgid_nr_ns(self.0.get(), pidns.as_ptr()) } } /// Wakes up the task.
Ok, so here's my feeble attempt at getting something going for wrapping struct pid_namespace as struct pid_namespace indirectly came up in the file abstraction thread. The lifetime of a pid namespace is intimately tied to the lifetime of task. The pid namespace of a task doesn't ever change. A unshare(CLONE_NEWPID) or setns(fd_pidns/pidfd, CLONE_NEWPID) will not change the task's pid namespace only the pid namespace of children spawned by the task. This invariant is important to keep in mind. After a task is reaped it will be detached from its associated struct pids via __unhash_process(). This will also set task->thread_pid to NULL. In order to retrieve the pid namespace of a task task_active_pid_ns() can be used. The helper works on both current and non-current taks but the requirements are slightly different in both cases and it depends on where the helper is called. The rules for this are simple but difficult for me to translate into Rust. If task_active_pid_ns() is called on current then no RCU locking is needed as current is obviously alive. On the other hand calling task_active_pid_ns() after release_task() would work but it would mean task_active_pid_ns() will return NULL. Calling task_active_pid_ns() on a non-current task, while valid, must be under RCU or other protection mechanism as the task might be release_task() and thus in __unhash_process(). Handling that in a single impl seemed cumbersome but that may just be my lack of kernel Rust experience. It would of course be possible to add an always refcounted PidNamespace impl to Task but that would be pointless refcount bumping for the usual case where the caller retrieves the pid namespace of current. Instead I added a macro that gets the active pid namespace of current and a task_get_pid_ns() impl that returns an Option<ARef<PidNamespace>>. Returning an Option<ARef<PidNamespace>> forces the caller to make a conscious decision instead of just silently translating a NULL to current pid namespace when passed to e.g., task_tgid_nr_ns(). Signed-off-by: Christian Brauner <brauner@kernel.org> --- rust/helpers/helpers.c | 1 + rust/helpers/pid_namespace.c | 26 ++++++++++++++ rust/kernel/lib.rs | 1 + rust/kernel/pid_namespace.rs | 68 ++++++++++++++++++++++++++++++++++++ rust/kernel/task.rs | 56 +++++++++++++++++++++++++---- 5 files changed, 146 insertions(+), 6 deletions(-) create mode 100644 rust/helpers/pid_namespace.c create mode 100644 rust/kernel/pid_namespace.rs