diff mbox series

[RFC,06/16] fs: create detached mounts from detached mounts

Message ID 20250221-brauner-open_tree-v1-6-dbcfcb98c676@kernel.org (mailing list archive)
State New
Headers show
Series fs: expand abilities of anonymous mount namespaces | expand

Commit Message

Christian Brauner Feb. 21, 2025, 1:13 p.m. UTC
Add the ability to create detached mounts from detached mounts.

Currently, detached mounts can only be created from attached mounts.
This limitaton prevents various use-cases. For example, the ability to
mount a subdirectory without ever having to make the whole filesystem
visible first.

The current permission model for the OPEN_TREE_CLONE flag of the
open_tree() system call is:

(1) Check that the caller is privileged over the owning user namespace
    of it's current mount namespace.

(2) Check that the caller is located in the mount namespace of the mount
    it wants to create a detached copy of.

While it is not strictly necessary to do it this way it is consistently
applied in the new mount api. This model will also be used when allowing
the creation of detached mount from another detached mount.

The (1) requirement can simply be met by performing the same check as
for the non-detached case, i.e., verify that the caller is privileged
over its current mount namespace.

To meet the (2) requirement it must be possible to infer the origin
mount namespace that the anonymous mount namespace of the detached mount
was created from.

The origin mount namespace of an anonymous mount is the mount namespace
that the mounts that were copied into the anonymous mount namespace
originate from.

The origin mount namespace of the anonymous mount namespace must be the
same as the caller's mount namespace. To establish this the sequence
number of the caller's mount namespace and the origin sequence number of
the anonymous mount namespace are compared.

The caller is always located in a non-anonymous mount namespace since
anonymous mount namespaces cannot be setns()ed into. The caller's mount
namespace will thus always have a valid sequence number.

The owning namespace of any mount namespace, anonymous or non-anonymous,
can never change. A mount attached to a non-anonymous mount namespace
can never change mount namespace.

If the sequence number of the non-anonymous mount namespace and the
origin sequence number of the anonymous mount namespace match, the
owning namespaces must match as well.

Hence, the capability check on the owning namespace of the caller's
mount namespace ensures that the caller has the ability to copy the
mount tree.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/namespace.c | 38 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 37 insertions(+), 1 deletion(-)
diff mbox series

Patch

diff --git a/fs/namespace.c b/fs/namespace.c
index c61b9704499a..66b9cea1cf66 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -998,6 +998,12 @@  static inline int check_mnt(struct mount *mnt)
 	return mnt->mnt_ns == current->nsproxy->mnt_ns;
 }
 
+static inline bool check_anonymous_mnt(struct mount *mnt)
+{
+	return is_anon_ns(mnt->mnt_ns) &&
+	       mnt->mnt_ns->seq_origin == current->nsproxy->mnt_ns->seq;
+}
+
 /*
  * vfsmount lock must be held for write
  */
@@ -2822,6 +2828,32 @@  static int do_change_type(struct path *path, int ms_flags)
  *     namespace, i.e., the caller is trying to copy a mount namespace
  *     entry from nsfs.
  * (3) The caller tries to copy a pidfs mount referring to a pidfd.
+ * (4) The caller is trying to copy a mount tree that belongs to an
+ *     anonymous mount namespace.
+ *
+ *     For that to be safe, this helper enforces that the origin mount
+ *     namespace the anonymous mount namespace was created from is the
+ *     same as the caller's mount namespace by comparing the sequence
+ *     numbers.
+ *
+ *     This is not strictly necessary. The current semantics of the new
+ *     mount api enforce that the caller must be located in the same
+ *     mount namespace as the mount tree it interacts with. Using the
+ *     origin sequence number preserves these semantics even for
+ *     anonymous mount namespaces. However, one could envision extending
+ *     the api to directly operate across mount namespace if needed.
+ *
+ *     The ownership of a non-anonymous mount namespace such as the
+ *     caller's cannot change.
+ *     => We know that the caller's mount namespace is stable.
+ *
+ *     If the origin sequence number of the anonymous mount namespace is
+ *     the same as the sequence number of the caller's mount namespace.
+ *     => The owning namespaces are the same.
+ *
+ *     ==> The earlier capability check on the owning namespace of the
+ *         caller's mount namespace ensures that the caller has the
+ *         ability to copy the mount tree.
  *
  * Returns true if the mount tree can be copied, false otherwise.
  */
@@ -2840,9 +2872,13 @@  static inline bool may_copy_tree(struct path *path)
 	if (d_op == &pidfs_dentry_operations)
 		return true;
 
-	return false;
+	if (!is_mounted(path->mnt))
+		return false;
+
+	return check_anonymous_mnt(mnt);
 }
 
+
 static struct mount *__do_loopback(struct path *old_path, int recurse)
 {
 	struct mount *mnt = ERR_PTR(-EINVAL), *old = real_mount(old_path->mnt);