diff mbox series

[RFC] rust: add PidNamespace wrapper

Message ID 20240926-pocht-sittlich-87108178c093@brauner (mailing list archive)
State New
Headers show
Series [RFC] rust: add PidNamespace wrapper | expand

Commit Message

Christian Brauner Sept. 26, 2024, 4:35 p.m. UTC
Ok, so here's my feeble attempt at getting something going for wrapping
struct pid_namespace as struct pid_namespace indirectly came up in the
file abstraction thread.

The lifetime of a pid namespace is intimately tied to the lifetime of
task. The pid namespace of a task doesn't ever change. A
unshare(CLONE_NEWPID) or setns(fd_pidns/pidfd, CLONE_NEWPID) will not
change the task's pid namespace only the pid namespace of children
spawned by the task. This invariant is important to keep in mind.

After a task is reaped it will be detached from its associated struct
pids via __unhash_process(). This will also set task->thread_pid to
NULL.

In order to retrieve the pid namespace of a task task_active_pid_ns()
can be used. The helper works on both current and non-current taks but
the requirements are slightly different in both cases and it depends on
where the helper is called.

The rules for this are simple but difficult for me to translate into
Rust. If task_active_pid_ns() is called on current then no RCU locking
is needed as current is obviously alive. On the other hand calling
task_active_pid_ns() after release_task() would work but it would mean
task_active_pid_ns() will return NULL.

Calling task_active_pid_ns() on a non-current task, while valid, must be
under RCU or other protection mechanism as the task might be
release_task() and thus in __unhash_process().

Handling that in a single impl seemed cumbersome but that may just be
my lack of kernel Rust experience.

It would of course be possible to add an always refcounted PidNamespace
impl to Task but that would be pointless refcount bumping for the usual
case where the caller retrieves the pid namespace of current.

Instead I added a macro that gets the active pid namespace of current
and a task_get_pid_ns() impl that returns an Option<ARef<PidNamespace>>.
Returning an Option<ARef<PidNamespace>> forces the caller to make a
conscious decision instead of just silently translating a NULL to
current pid namespace when passed to e.g., task_tgid_nr_ns().

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 rust/helpers/helpers.c       |  1 +
 rust/helpers/pid_namespace.c | 26 ++++++++++++++
 rust/kernel/lib.rs           |  1 +
 rust/kernel/pid_namespace.rs | 68 ++++++++++++++++++++++++++++++++++++
 rust/kernel/task.rs          | 56 +++++++++++++++++++++++++----
 5 files changed, 146 insertions(+), 6 deletions(-)
 create mode 100644 rust/helpers/pid_namespace.c
 create mode 100644 rust/kernel/pid_namespace.rs

Comments

Alice Ryhl Sept. 27, 2024, 12:04 p.m. UTC | #1
On Thu, Sep 26, 2024 at 6:36 PM Christian Brauner <brauner@kernel.org> wrote:
>
> Ok, so here's my feeble attempt at getting something going for wrapping
> struct pid_namespace as struct pid_namespace indirectly came up in the
> file abstraction thread.

This looks great!

> The lifetime of a pid namespace is intimately tied to the lifetime of
> task. The pid namespace of a task doesn't ever change. A
> unshare(CLONE_NEWPID) or setns(fd_pidns/pidfd, CLONE_NEWPID) will not
> change the task's pid namespace only the pid namespace of children
> spawned by the task. This invariant is important to keep in mind.
>
> After a task is reaped it will be detached from its associated struct
> pids via __unhash_process(). This will also set task->thread_pid to
> NULL.
>
> In order to retrieve the pid namespace of a task task_active_pid_ns()
> can be used. The helper works on both current and non-current taks but
> the requirements are slightly different in both cases and it depends on
> where the helper is called.
>
> The rules for this are simple but difficult for me to translate into
> Rust. If task_active_pid_ns() is called on current then no RCU locking
> is needed as current is obviously alive. On the other hand calling
> task_active_pid_ns() after release_task() would work but it would mean
> task_active_pid_ns() will return NULL.
>
> Calling task_active_pid_ns() on a non-current task, while valid, must be
> under RCU or other protection mechanism as the task might be
> release_task() and thus in __unhash_process().

Just to confirm, calling task_active_pid_ns() on a non-current task
requires the rcu lock even if you own a refcont on the task?

Alice
Christian Brauner Sept. 27, 2024, 2:21 p.m. UTC | #2
On Fri, Sep 27, 2024 at 02:04:13PM GMT, Alice Ryhl wrote:
> On Thu, Sep 26, 2024 at 6:36 PM Christian Brauner <brauner@kernel.org> wrote:
> >
> > Ok, so here's my feeble attempt at getting something going for wrapping
> > struct pid_namespace as struct pid_namespace indirectly came up in the
> > file abstraction thread.
> 
> This looks great!

Thanks!

> 
> > The lifetime of a pid namespace is intimately tied to the lifetime of
> > task. The pid namespace of a task doesn't ever change. A
> > unshare(CLONE_NEWPID) or setns(fd_pidns/pidfd, CLONE_NEWPID) will not
> > change the task's pid namespace only the pid namespace of children
> > spawned by the task. This invariant is important to keep in mind.
> >
> > After a task is reaped it will be detached from its associated struct
> > pids via __unhash_process(). This will also set task->thread_pid to
> > NULL.
> >
> > In order to retrieve the pid namespace of a task task_active_pid_ns()
> > can be used. The helper works on both current and non-current taks but
> > the requirements are slightly different in both cases and it depends on
> > where the helper is called.
> >
> > The rules for this are simple but difficult for me to translate into
> > Rust. If task_active_pid_ns() is called on current then no RCU locking
> > is needed as current is obviously alive. On the other hand calling
> > task_active_pid_ns() after release_task() would work but it would mean
> > task_active_pid_ns() will return NULL.
> >
> > Calling task_active_pid_ns() on a non-current task, while valid, must be
> > under RCU or other protection mechanism as the task might be
> > release_task() and thus in __unhash_process().
> 
> Just to confirm, calling task_active_pid_ns() on a non-current task
> requires the rcu lock even if you own a refcont on the task?

Interesting question. Afaik, yes. task_active_pid_ns() goes via
task->thread_pid which is a shorthand for task->pid_links[PIDTYPE_PID].

This will be NULLed when the task exits and is dead (so usually when
someone has waited on it - ignoring ptrace for sanity reasons and
autoreaping the latter amounts to the same thing just in-kernel):

T1                      T2                                                   T3
exit(0);
                        wait(T1)
                        -> wait_task_zombie()
                           -> release_task()
                              -> __exit_signals()
                                 -> __unash_process()
                                    // sets task->thread_pid == NULL         task_active_pid_ns(T1)
                                    // task->pid_links[PIDTYPE_PID] == NULL

So having a reference to struct task_struct doesn't prevent
task->thread_pid becoming NULL.

And you touch upon a very interesting point. The lifetime of struct
pid_namespace is actually tied to struct pid much tighter than it is to
struct task_struct. So when a task is released (transitions from zombie
to dead in the common case) the following happens:

release_task()
-> __exit_signals()
   -> thread_pid = get_pid(task->thread_pid)
      -> __unhash_process()
         -> detach_pid(PIDTYPE_PID)
            -> __change_pid()
               {
                       task->thread_pid = NULL;
                       task->pid_links[PIDTYPE_PID] = NULL;
                       free_pid(thread_pid)
               }
         put_pid(thread_pid)

And the free_pid() in __change_pid() does a delayed_put_pid() via
call_rcu().

So afaiu, taking the rcu_read_lock() synchronizes against that
delayed_put_pid() in __change_pid() so the call_rcu() will wait until
everyone who does

rcu_read_lock()
task_active_pid_ns(task)
rcu_read_unlock()

and sees task->thread_pid non-NULL, is done. This way no additional
reference count on struct task_struct or struct pid is needed before
plucking the pid namespace from there. Does that make sense or have I
gotten it all wrong?
Alice Ryhl Sept. 27, 2024, 2:58 p.m. UTC | #3
On Fri, Sep 27, 2024 at 4:21 PM Christian Brauner <brauner@kernel.org> wrote:
>
> On Fri, Sep 27, 2024 at 02:04:13PM GMT, Alice Ryhl wrote:
> > On Thu, Sep 26, 2024 at 6:36 PM Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > Ok, so here's my feeble attempt at getting something going for wrapping
> > > struct pid_namespace as struct pid_namespace indirectly came up in the
> > > file abstraction thread.
> >
> > This looks great!
>
> Thanks!
>
> >
> > > The lifetime of a pid namespace is intimately tied to the lifetime of
> > > task. The pid namespace of a task doesn't ever change. A
> > > unshare(CLONE_NEWPID) or setns(fd_pidns/pidfd, CLONE_NEWPID) will not
> > > change the task's pid namespace only the pid namespace of children
> > > spawned by the task. This invariant is important to keep in mind.
> > >
> > > After a task is reaped it will be detached from its associated struct
> > > pids via __unhash_process(). This will also set task->thread_pid to
> > > NULL.
> > >
> > > In order to retrieve the pid namespace of a task task_active_pid_ns()
> > > can be used. The helper works on both current and non-current taks but
> > > the requirements are slightly different in both cases and it depends on
> > > where the helper is called.
> > >
> > > The rules for this are simple but difficult for me to translate into
> > > Rust. If task_active_pid_ns() is called on current then no RCU locking
> > > is needed as current is obviously alive. On the other hand calling
> > > task_active_pid_ns() after release_task() would work but it would mean
> > > task_active_pid_ns() will return NULL.
> > >
> > > Calling task_active_pid_ns() on a non-current task, while valid, must be
> > > under RCU or other protection mechanism as the task might be
> > > release_task() and thus in __unhash_process().
> >
> > Just to confirm, calling task_active_pid_ns() on a non-current task
> > requires the rcu lock even if you own a refcont on the task?
>
> Interesting question. Afaik, yes. task_active_pid_ns() goes via
> task->thread_pid which is a shorthand for task->pid_links[PIDTYPE_PID].
>
> This will be NULLed when the task exits and is dead (so usually when
> someone has waited on it - ignoring ptrace for sanity reasons and
> autoreaping the latter amounts to the same thing just in-kernel):
>
> T1                      T2                                                   T3
> exit(0);
>                         wait(T1)
>                         -> wait_task_zombie()
>                            -> release_task()
>                               -> __exit_signals()
>                                  -> __unash_process()
>                                     // sets task->thread_pid == NULL         task_active_pid_ns(T1)
>                                     // task->pid_links[PIDTYPE_PID] == NULL
>
> So having a reference to struct task_struct doesn't prevent
> task->thread_pid becoming NULL.
>
> And you touch upon a very interesting point. The lifetime of struct
> pid_namespace is actually tied to struct pid much tighter than it is to
> struct task_struct. So when a task is released (transitions from zombie
> to dead in the common case) the following happens:
>
> release_task()
> -> __exit_signals()
>    -> thread_pid = get_pid(task->thread_pid)
>       -> __unhash_process()
>          -> detach_pid(PIDTYPE_PID)
>             -> __change_pid()
>                {
>                        task->thread_pid = NULL;
>                        task->pid_links[PIDTYPE_PID] = NULL;
>                        free_pid(thread_pid)
>                }
>          put_pid(thread_pid)
>
> And the free_pid() in __change_pid() does a delayed_put_pid() via
> call_rcu().
>
> So afaiu, taking the rcu_read_lock() synchronizes against that
> delayed_put_pid() in __change_pid() so the call_rcu() will wait until
> everyone who does
>
> rcu_read_lock()
> task_active_pid_ns(task)
> rcu_read_unlock()
>
> and sees task->thread_pid non-NULL, is done. This way no additional
> reference count on struct task_struct or struct pid is needed before
> plucking the pid namespace from there. Does that make sense or have I
> gotten it all wrong?

Okay. I agree that the code you have is the best we can do; at least
until we get an rcu guard in Rust.

The macro doesn't quite work. You need to do something to constrain
the lifetime used by `PidNamespace::from_ptr`. Right now, there is no
constraint on the lifetime, so the caller can just pick the lifetime
'static which is the lifetime that never ends. We want to constrain it
to a lifetime that ends before the task dies. The easiest is to create
a local variable and use the lifetime of that local variable. That
way, the reference can never escape the current function, and hence,
can't escape the current task.

More generally, I'm sure there are lots of fields in current where we
can access them without rcu only because we know the current task
isn't going to die on us. I don't think we should have a macro for
every single one. I think we can put together a single macro for
getting a lifetime that ends before returning to userspace, and then
reuse that lifetime for both `current` and `current_pid_ns`, and
possibly also the `DeferredFd` patch.

Alice
diff mbox series

Patch

diff --git a/rust/helpers/helpers.c b/rust/helpers/helpers.c
index 62022b18caf5..d553ad9361ce 100644
--- a/rust/helpers/helpers.c
+++ b/rust/helpers/helpers.c
@@ -17,6 +17,7 @@ 
 #include "kunit.c"
 #include "mutex.c"
 #include "page.c"
+#include "pid_namespace.c"
 #include "rbtree.c"
 #include "refcount.c"
 #include "security.c"
diff --git a/rust/helpers/pid_namespace.c b/rust/helpers/pid_namespace.c
new file mode 100644
index 000000000000..f41482bdec9a
--- /dev/null
+++ b/rust/helpers/pid_namespace.c
@@ -0,0 +1,26 @@ 
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/pid_namespace.h>
+#include <linux/cleanup.h>
+
+struct pid_namespace *rust_helper_get_pid_ns(struct pid_namespace *ns)
+{
+	return get_pid_ns(ns);
+}
+
+void rust_helper_put_pid_ns(struct pid_namespace *ns)
+{
+	put_pid_ns(ns);
+}
+
+/* Get a reference on a task's pid namespace. */
+struct pid_namespace *rust_helper_task_get_pid_ns(struct task_struct *task)
+{
+	struct pid_namespace *pid_ns;
+
+	guard(rcu)();
+	pid_ns = task_active_pid_ns(task);
+	if (pid_ns)
+		get_pid_ns(pid_ns);
+	return pid_ns;
+}
diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
index ff7d88022c57..0e78ec9d06e0 100644
--- a/rust/kernel/lib.rs
+++ b/rust/kernel/lib.rs
@@ -44,6 +44,7 @@ 
 #[cfg(CONFIG_NET)]
 pub mod net;
 pub mod page;
+pub mod pid_namespace;
 pub mod prelude;
 pub mod print;
 pub mod sizes;
diff --git a/rust/kernel/pid_namespace.rs b/rust/kernel/pid_namespace.rs
new file mode 100644
index 000000000000..cd12c21a68cb
--- /dev/null
+++ b/rust/kernel/pid_namespace.rs
@@ -0,0 +1,68 @@ 
+// SPDX-License-Identifier: GPL-2.0
+
+//! Pid namespaces.
+//!
+//! C header: [`include/linux/pid_namespace.h`](srctree/include/linux/pid_namespace.h) and
+//! [`include/linux/pid.h`](srctree/include/linux/pid.h)
+
+use crate::{
+    bindings,
+    types::{AlwaysRefCounted, Opaque},
+};
+use core::{
+    ptr,
+};
+
+/// Wraps the kernel's `struct pid_namespace`. Thread safe.
+///
+/// This structure represents the Rust abstraction for a C `struct pid_namespace`. This
+/// implementation abstracts the usage of an already existing C `struct pid_namespace` within Rust
+/// code that we get passed from the C side.
+#[repr(transparent)]
+pub struct PidNamespace {
+    inner: Opaque<bindings::pid_namespace>,
+}
+
+impl PidNamespace {
+    /// Returns a raw pointer to the inner C struct.
+    #[inline]
+    pub fn as_ptr(&self) -> *mut bindings::pid_namespace {
+        self.inner.get()
+    }
+
+    /// Creates a reference to a [`PidNamespace`] from a valid pointer.
+    ///
+    /// # Safety
+    ///
+    /// The caller must ensure that `ptr` is valid and remains valid for the lifetime of the
+    /// returned [`PidNamespace`] reference.
+    pub unsafe fn from_ptr<'a>(ptr: *const bindings::pid_namespace) -> &'a Self {
+        // SAFETY: The safety requirements guarantee the validity of the dereference, while the
+        // `PidNamespace` type being transparent makes the cast ok.
+        unsafe { &*ptr.cast() }
+    }
+}
+
+// SAFETY: Instances of `PidNamespace` are always reference-counted.
+unsafe impl AlwaysRefCounted for PidNamespace {
+    #[inline]
+    fn inc_ref(&self) {
+        // SAFETY: The existence of a shared reference means that the refcount is nonzero.
+        unsafe { bindings::get_pid_ns(self.as_ptr()) };
+    }
+
+    #[inline]
+    unsafe fn dec_ref(obj: ptr::NonNull<PidNamespace>) {
+        // SAFETY: The safety requirements guarantee that the refcount is non-zero.
+        unsafe { bindings::put_pid_ns(obj.cast().as_ptr()) }
+    }
+}
+
+// SAFETY:
+// - `PidNamespace::dec_ref` can be called from any thread.
+// - It is okay to send ownership of `PidNamespace` across thread boundaries.
+unsafe impl Send for PidNamespace {}
+
+// SAFETY: It's OK to access `PidNamespace` through shared references from other threads because
+// we're either accessing properties that don't change or that are properly synchronised by C code.
+unsafe impl Sync for PidNamespace {}
diff --git a/rust/kernel/task.rs b/rust/kernel/task.rs
index 1a36a9f19368..89a431dfac5d 100644
--- a/rust/kernel/task.rs
+++ b/rust/kernel/task.rs
@@ -6,7 +6,8 @@ 
 
 use crate::{
     bindings,
-    types::{NotThreadSafe, Opaque},
+    pid_namespace::PidNamespace,
+    types::{ARef, NotThreadSafe, Opaque},
 };
 use core::{
     cmp::{Eq, PartialEq},
@@ -36,6 +37,37 @@  macro_rules! current {
     };
 }
 
+/// Returns the currently running task's pid namespace.
+///
+/// The lifetime of `PidNamespace` is intimately tied to the lifetime of `Task`. The pid namespace
+/// of a `Task` doesn't ever change. A `unshare(CLONE_NEWPID)` or `setns(fd_pidns/pidfd,
+/// CLONE_NEWPID)` will not change the task's pid namespace. This invariant is important to keep in
+/// mind.
+///
+/// After a task is reaped it will be detached from its associated `struct pid`s via
+/// __unhash_process(). This will specifically set `task->thread_pid` to `NULL`.
+///
+/// In order to retrieve the pid namespace of a task `task_active_pid_ns()` can be used. The rules
+/// for this are simple but difficult for me to translate into Rust. If `task_active_pid_ns()` is
+/// called from `current` then no RCU locking is needed as current is obviously alive. However,
+/// calling `task_active_pid_ns()` on a non-`current` task, while valid, must be under RCU or other
+/// protection as the task might be in __unhash_process().
+///
+/// We could add an always refcounted `PidNamespace` impl to `Task` but that would be pointless
+/// refcount bumping for the usual case where the caller retrieves the pid namespace of `current`.
+///
+/// So I added a macro that gets the active pid namespace of `current` and a `task_get_pid_ns()`
+/// impl that returns an `ARef<PidNamespace>` or `None` if the pid namespace is `NULL`. Returning
+/// an `Option<ARef<PidNamespace>>` forces the caller to make a conscious decision what instead of
+/// just silently translating a `NULL` to `current`'s pid namespace.
+#[macro_export]
+macro_rules! current_pid_ns {
+    () => {
+        let ptr = current()
+        unsafe { PidNamespace::from_ptr(bindings::task_active_pid_ns(ptr)) }
+    };
+}
+
 /// Wraps the kernel's `struct task_struct`.
 ///
 /// # Invariants
@@ -182,11 +214,23 @@  pub fn signal_pending(&self) -> bool {
         unsafe { bindings::signal_pending(self.0.get()) != 0 }
     }
 
-    /// Returns the given task's pid in the current pid namespace.
-    pub fn pid_in_current_ns(&self) -> Pid {
-        // SAFETY: We know that `self.0.get()` is valid by the type invariant, and passing a null
-        // pointer as the namespace is correct for using the current namespace.
-        unsafe { bindings::task_tgid_nr_ns(self.0.get(), ptr::null_mut()) }
+    /// Returns task's pid namespace with elevated reference count
+    pub fn task_get_pid_ns(&self) -> Option<ARef<PidNamespace>> {
+        let ptr = unsafe { bindings::task_get_pid_ns(self.0.get()) };
+        if ptr.is_null() {
+            None
+        } else {
+            // SAFETY: `ptr` is valid by the safety requirements of this function. And we own a
+            // reference count via `task_get_pid_ns()`.
+            // CAST: `Self` is a `repr(transparent)` wrapper around `bindings::pid_namespace`.
+            Some(unsafe { ARef::from_raw(ptr::NonNull::new_unchecked(ptr.cast::<PidNamespace>())) })
+        }
+    }
+
+    /// Returns the given task's pid in the provided pid namespace.
+    pub fn task_tgid_nr_ns(&self, pidns: &PidNamespace) -> Pid {
+        // SAFETY: We know that `self.0.get()` is valid by the type invariant.
+        unsafe { bindings::task_tgid_nr_ns(self.0.get(), pidns.as_ptr()) }
     }
 
     /// Wakes up the task.