diff mbox series

fs/namespace: eliminate unnecessary mount counting

Message ID 20220123100448.GA1468@haolee.io (mailing list archive)
State New, archived
Headers show
Series fs/namespace: eliminate unnecessary mount counting | expand

Commit Message

Hao Lee Jan. 23, 2022, 10:04 a.m. UTC
propagate_one() counts the number of propagated mounts in each
propagation. We can count them in advance and use the number in
subsequent propagation.

Signed-off-by: Hao Lee <haolee.swjtu@gmail.com>
---
 fs/namespace.c | 27 +++++++++++++++++++--------
 fs/pnode.c     | 12 +++++++-----
 fs/pnode.h     |  5 +++--
 3 files changed, 29 insertions(+), 15 deletions(-)

Comments

Al Viro Feb. 14, 2022, 3:04 a.m. UTC | #1
On Sun, Jan 23, 2022 at 10:04:48AM +0000, Hao Lee wrote:
> propagate_one() counts the number of propagated mounts in each
> propagation. We can count them in advance and use the number in
> subsequent propagation.

You are relying upon highly non-obvious assumptions.  Namely, that
copies will have the same amount of mounts as source_mnt.  AFAICS,
it's not true in case of mount --move - there source_mnt might very
well contain the things that would be skipped in subsequent copies.
E.g. anything marked unbindable.  Or mntns binds - anything that would
be skipped by copy_tree() without special flags.

Sure, we could make count_mounts() return just the number of those
that will go into subsequent copies (with mount --move we don't add
the original subtree - it's been in the namespace and thus is already
counted), but
	1) it creates an extra dependency in already convoluted code
(copy_tree() and count_mounts() need to be kept in sync, in case we ever
add new classes of mounts to be skipped)
	2) I'm *NOT* certain that we won't ever run into the non-move
cases where the original tree contains something that would be skipped
from subsequent ones, and there we want to count the original.	Matter of
fact, we do run into that.  Look:

# arrange a private tree at /tmp/a
mkdir /tmp/a
mount --bind /tmp/a /tmp/a
mount --make-rprivate /tmp/a
# mountpoint at /tmp/a/x
mkdir /tmp/a/x
mount --bind /tmp/a/x /tmp/a/x
# this will be a peer of /tmp/a/x
mkdir /tmp/a/y
# ... and this - a mountpoint in it
mkdir /tmp/a/x/v
# ... rbind fodder:
mkdir /tmp/a/z
touch /tmp/a/z/f
# start a new mntns, so we won't run afoul of loop checks
unshare -m &
# ... and bind it on /tmp/a/z/f
mount --bind /proc/$!/ns/mnt /tmp/a/z/f
# now we can do the rest - it won't spread into child namespace
# make /tmp/a/x a peer of /tmp/b/x
mount --make-shared /tmp/a/x
mount --bind /tmp/a/x /tmp/a/y
# ... and rbind /tmp/a/z at /tmp/a/x/v
# which will propagate a copy to /tmp/b/x/v
# except that mntns bound on /tmp/a/x/v/f will *not* propagate.
mount --rbind /tmp/a/z /tmp/a/x/v
# verify that
stat /tmp/a/x/v
stat /tmp/a/y/v
stat /tmp/a/x/v/f
stat /tmp/a/y/v/f

Result:
  File: /tmp/a/x/v/
  Size: 4096            Blocks: 8          IO Block: 4096   directory
Device: 808h/2056d      Inode: 270607      Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-02-13 21:43:45.058485130 -0500
Modify: 2022-02-13 21:42:37.142457622 -0500
Change: 2022-02-13 21:42:37.142457622 -0500
 Birth: 2022-02-13 21:42:37.142457622 -0500
  File: /tmp/a/y/v/
  Size: 4096            Blocks: 8          IO Block: 4096   directory
Device: 808h/2056d      Inode: 270607      Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-02-13 21:43:45.058485130 -0500
Modify: 2022-02-13 21:42:37.142457622 -0500
Change: 2022-02-13 21:42:37.142457622 -0500
 Birth: 2022-02-13 21:42:37.142457622 -0500
  File: /tmp/a/x/v/f
  Size: 0               Blocks: 0          IO Block: 4096   regular empty file
Device: 4h/4d   Inode: 4026532237  Links: 1
Access: (0444/-r--r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-02-13 21:42:37.146457624 -0500
Modify: 2022-02-13 21:42:37.146457624 -0500
Change: 2022-02-13 21:42:37.146457624 -0500
 Birth: -
  File: /tmp/a/y/v/f
  Size: 0               Blocks: 0          IO Block: 4096   regular empty file
Device: 808h/2056d      Inode: 270608      Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-02-13 21:42:37.142457622 -0500
Modify: 2022-02-13 21:42:37.142457622 -0500
Change: 2022-02-13 21:42:37.142457622 -0500
 Birth: 2022-02-13 21:42:37.142457622 -0500

	Note that /tmp/a/x/v and /tmp/a/y/v resolve to the same directory
(otherwise seen at /tmp/a/z), but /tmp/a/x/v/f and /tmp/a/y/v/f do *not*
resolve to the same thing - the latter is a regular file on /dev/sda8
(nothing got propagated there), while the former is *not* - it's an
mntns descriptor we'd bound on /tmp/a/z/f

	IOW, the first copy has two mount nodes, the second - only one.
Initial copy at rbind does get mntns binds copied into it - look at
CL_COPY_MNT_NS_FILE in arguments of copy_tree() call in __do_loopback().
However, we do *not* propagate that subsequent copies (propagate_one()
never passes CL_COPY_MNT_NS_FILE).  So that's at least one case where we
want different contributions from the first copy and every subsequent one.

	So we'd need to run *two* counts, the one to be used from
attach_recursive_mnt() and another for propagate_one().  With even more
places where the things could go wrong...

	I don't believe it's worth the trouble.  Sure, you run that loop
only once, instead of once per copy.  And if that's more than noise,
compared to allocating the same mounts we'd been counting, connecting
them into tree, hashing, etc., I would be *very* surprised.

NAKed-by: Al Viro <viro@zeniv.linux.org.uk>
Hao Lee Feb. 14, 2022, 8:29 a.m. UTC | #2
On Mon, Feb 14, 2022 at 11:04 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Sun, Jan 23, 2022 at 10:04:48AM +0000, Hao Lee wrote:
> > propagate_one() counts the number of propagated mounts in each
> > propagation. We can count them in advance and use the number in
> > subsequent propagation.
>
> You are relying upon highly non-obvious assumptions.  Namely, that
> copies will have the same amount of mounts as source_mnt.  AFAICS,
> it's not true in case of mount --move - there source_mnt might very
> well contain the things that would be skipped in subsequent copies.
> E.g. anything marked unbindable.  Or mntns binds - anything that would
> be skipped by copy_tree() without special flags.
>
> Sure, we could make count_mounts() return just the number of those
> that will go into subsequent copies (with mount --move we don't add
> the original subtree - it's been in the namespace and thus is already
> counted), but
>         1) it creates an extra dependency in already convoluted code
> (copy_tree() and count_mounts() need to be kept in sync, in case we ever
> add new classes of mounts to be skipped)
>         2) I'm *NOT* certain that we won't ever run into the non-move
> cases where the original tree contains something that would be skipped
> from subsequent ones, and there we want to count the original.  Matter of
> fact, we do run into that.  Look:
>
> # arrange a private tree at /tmp/a
> mkdir /tmp/a
> mount --bind /tmp/a /tmp/a
> mount --make-rprivate /tmp/a
> # mountpoint at /tmp/a/x
> mkdir /tmp/a/x
> mount --bind /tmp/a/x /tmp/a/x
> # this will be a peer of /tmp/a/x
> mkdir /tmp/a/y
> # ... and this - a mountpoint in it
> mkdir /tmp/a/x/v
> # ... rbind fodder:
> mkdir /tmp/a/z
> touch /tmp/a/z/f
> # start a new mntns, so we won't run afoul of loop checks
> unshare -m &
> # ... and bind it on /tmp/a/z/f
> mount --bind /proc/$!/ns/mnt /tmp/a/z/f
> # now we can do the rest - it won't spread into child namespace
> # make /tmp/a/x a peer of /tmp/b/x
> mount --make-shared /tmp/a/x
> mount --bind /tmp/a/x /tmp/a/y
> # ... and rbind /tmp/a/z at /tmp/a/x/v
> # which will propagate a copy to /tmp/b/x/v
> # except that mntns bound on /tmp/a/x/v/f will *not* propagate.
> mount --rbind /tmp/a/z /tmp/a/x/v
> # verify that
> stat /tmp/a/x/v
> stat /tmp/a/y/v
> stat /tmp/a/x/v/f
> stat /tmp/a/y/v/f
>
> Result:
>   File: /tmp/a/x/v/
>   Size: 4096            Blocks: 8          IO Block: 4096   directory
> Device: 808h/2056d      Inode: 270607      Links: 2
> Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2022-02-13 21:43:45.058485130 -0500
> Modify: 2022-02-13 21:42:37.142457622 -0500
> Change: 2022-02-13 21:42:37.142457622 -0500
>  Birth: 2022-02-13 21:42:37.142457622 -0500
>   File: /tmp/a/y/v/
>   Size: 4096            Blocks: 8          IO Block: 4096   directory
> Device: 808h/2056d      Inode: 270607      Links: 2
> Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2022-02-13 21:43:45.058485130 -0500
> Modify: 2022-02-13 21:42:37.142457622 -0500
> Change: 2022-02-13 21:42:37.142457622 -0500
>  Birth: 2022-02-13 21:42:37.142457622 -0500
>   File: /tmp/a/x/v/f
>   Size: 0               Blocks: 0          IO Block: 4096   regular empty file
> Device: 4h/4d   Inode: 4026532237  Links: 1
> Access: (0444/-r--r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2022-02-13 21:42:37.146457624 -0500
> Modify: 2022-02-13 21:42:37.146457624 -0500
> Change: 2022-02-13 21:42:37.146457624 -0500
>  Birth: -
>   File: /tmp/a/y/v/f
>   Size: 0               Blocks: 0          IO Block: 4096   regular empty file
> Device: 808h/2056d      Inode: 270608      Links: 1
> Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2022-02-13 21:42:37.142457622 -0500
> Modify: 2022-02-13 21:42:37.142457622 -0500
> Change: 2022-02-13 21:42:37.142457622 -0500
>  Birth: 2022-02-13 21:42:37.142457622 -0500
>
>         Note that /tmp/a/x/v and /tmp/a/y/v resolve to the same directory
> (otherwise seen at /tmp/a/z), but /tmp/a/x/v/f and /tmp/a/y/v/f do *not*
> resolve to the same thing - the latter is a regular file on /dev/sda8
> (nothing got propagated there), while the former is *not* - it's an
> mntns descriptor we'd bound on /tmp/a/z/f
>
>         IOW, the first copy has two mount nodes, the second - only one.
> Initial copy at rbind does get mntns binds copied into it - look at
> CL_COPY_MNT_NS_FILE in arguments of copy_tree() call in __do_loopback().
> However, we do *not* propagate that subsequent copies (propagate_one()
> never passes CL_COPY_MNT_NS_FILE).  So that's at least one case where we
> want different contributions from the first copy and every subsequent one.
>
>         So we'd need to run *two* counts, the one to be used from
> attach_recursive_mnt() and another for propagate_one().  With even more
> places where the things could go wrong...

This is really a classic counterexample.
Thanks for your detailed explanation!

>
>         I don't believe it's worth the trouble.  Sure, you run that loop
> only once, instead of once per copy.  And if that's more than noise,
> compared to allocating the same mounts we'd been counting, connecting
> them into tree, hashing, etc., I would be *very* surprised.

Got it. Thanks!

>
> NAKed-by: Al Viro <viro@zeniv.linux.org.uk>
diff mbox series

Patch

diff --git a/fs/namespace.c b/fs/namespace.c
index c6feb92209a6..5d05392854ca 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2082,18 +2082,15 @@  static int invent_group_ids(struct mount *mnt, bool recurse)
 	return 0;
 }
 
-int count_mounts(struct mnt_namespace *ns, struct mount *mnt)
+int update_pending_mounts(struct mnt_namespace *ns, unsigned int mounts)
 {
 	unsigned int max = READ_ONCE(sysctl_mount_max);
-	unsigned int mounts = 0, old, pending, sum;
-	struct mount *p;
-
-	for (p = mnt; p; p = next_mnt(p, mnt))
-		mounts++;
+	unsigned int old, pending, sum;
 
 	old = ns->mounts;
 	pending = ns->pending_mounts;
 	sum = old + pending;
+
 	if ((old > sum) ||
 	    (pending > sum) ||
 	    (max < sum) ||
@@ -2104,6 +2101,17 @@  int count_mounts(struct mnt_namespace *ns, struct mount *mnt)
 	return 0;
 }
 
+unsigned int count_mounts(struct mount *mnt)
+{
+	unsigned int mounts = 0;
+	struct mount *p;
+
+	for (p = mnt; p; p = next_mnt(p, mnt))
+		mounts++;
+
+	return mounts;
+}
+
 /*
  *  @source_mnt : mount tree to be attached
  *  @nd         : place the mount tree @source_mnt is attached
@@ -2178,6 +2186,7 @@  static int attach_recursive_mnt(struct mount *source_mnt,
 	struct mountpoint *smp;
 	struct mount *child, *p;
 	struct hlist_node *n;
+	unsigned int nr_mounts;
 	int err;
 
 	/* Preallocate a mountpoint in case the new mounts need
@@ -2187,9 +2196,10 @@  static int attach_recursive_mnt(struct mount *source_mnt,
 	if (IS_ERR(smp))
 		return PTR_ERR(smp);
 
+	nr_mounts = count_mounts(source_mnt);
 	/* Is there space to add these mounts to the mount namespace? */
 	if (!moving) {
-		err = count_mounts(ns, source_mnt);
+		err = update_pending_mounts(ns, nr_mounts);
 		if (err)
 			goto out;
 	}
@@ -2198,7 +2208,8 @@  static int attach_recursive_mnt(struct mount *source_mnt,
 		err = invent_group_ids(source_mnt, true);
 		if (err)
 			goto out;
-		err = propagate_mnt(dest_mnt, dest_mp, source_mnt, &tree_list);
+		err = propagate_mnt(dest_mnt, dest_mp, source_mnt, nr_mounts,
+				    &tree_list);
 		lock_mount_hash();
 		if (err)
 			goto out_cleanup_ids;
diff --git a/fs/pnode.c b/fs/pnode.c
index 1106137c747a..877de718fc35 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -222,7 +222,7 @@  static inline bool peers(struct mount *m1, struct mount *m2)
 	return m1->mnt_group_id == m2->mnt_group_id && m1->mnt_group_id;
 }
 
-static int propagate_one(struct mount *m)
+static int propagate_one(struct mount *m, unsigned int nr_mounts)
 {
 	struct mount *child;
 	int type;
@@ -269,7 +269,7 @@  static int propagate_one(struct mount *m)
 	last_dest = m;
 	last_source = child;
 	hlist_add_head(&child->mnt_hash, list);
-	return count_mounts(m->mnt_ns, child);
+	return update_pending_mounts(m->mnt_ns, nr_mounts);
 }
 
 /*
@@ -284,9 +284,11 @@  static int propagate_one(struct mount *m)
  * @dest_dentry: destination dentry.
  * @source_mnt: source mount.
  * @tree_list : list of heads of trees to be attached.
+ * @nr_mounts : the number of mounts in source_mnt
  */
 int propagate_mnt(struct mount *dest_mnt, struct mountpoint *dest_mp,
-		    struct mount *source_mnt, struct hlist_head *tree_list)
+		    struct mount *source_mnt, unsigned int nr_mounts,
+		    struct hlist_head *tree_list)
 {
 	struct mount *m, *n;
 	int ret = 0;
@@ -305,7 +307,7 @@  int propagate_mnt(struct mount *dest_mnt, struct mountpoint *dest_mp,
 
 	/* all peers of dest_mnt, except dest_mnt itself */
 	for (n = next_peer(dest_mnt); n != dest_mnt; n = next_peer(n)) {
-		ret = propagate_one(n);
+		ret = propagate_one(n, nr_mounts);
 		if (ret)
 			goto out;
 	}
@@ -316,7 +318,7 @@  int propagate_mnt(struct mount *dest_mnt, struct mountpoint *dest_mp,
 		/* everything in that slave group */
 		n = m;
 		do {
-			ret = propagate_one(n);
+			ret = propagate_one(n, nr_mounts);
 			if (ret)
 				goto out;
 			n = next_peer(n);
diff --git a/fs/pnode.h b/fs/pnode.h
index 988f1aa9b02a..005355c0dd49 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -38,7 +38,7 @@  static inline void set_mnt_shared(struct mount *mnt)
 
 void change_mnt_propagation(struct mount *, int);
 int propagate_mnt(struct mount *, struct mountpoint *, struct mount *,
-		struct hlist_head *);
+		  unsigned int, struct hlist_head *);
 int propagate_umount(struct list_head *);
 int propagate_mount_busy(struct mount *, int);
 void propagate_mount_unlock(struct mount *);
@@ -52,5 +52,6 @@  void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp,
 struct mount *copy_tree(struct mount *, struct dentry *, int);
 bool is_path_reachable(struct mount *, struct dentry *,
 			 const struct path *root);
-int count_mounts(struct mnt_namespace *ns, struct mount *mnt);
+int update_pending_mounts(struct mnt_namespace *ns, unsigned int mounts);
+unsigned int count_mounts(struct mount *mnt);
 #endif /* _LINUX_PNODE_H */