[REVIEW] mnt: Tuck mounts under others instead of creating shadow/side mounts.

On Sun, May 14, 2017 at 04:26:18AM -0500, Eric W. Biederman wrote:
> ebiederm@xmission.com (Eric W. Biederman) writes:
> 
> > Andrei Vagin <avagin@virtuozzo.com> writes:
> >
> >> Hi Eric,
> >>
> >> I found one issue about this patch. Take a look at the next script. The
> >> idea is simple, we call mount twice and than we call umount twice, but
> >> the second umount fails.
> >
> > Definitely an odd.  I will take a look.
> 
> After a little more looking I have to say the only thing I can see wrong
> in the behavior is that the first umount doesn't unmount everything.
> 
> Changing the mount propagation tree while it is being traversed
> apparently prevents a complete traversal of the propgation tree.

Is it be enough to find topper which will not be umounted? I attached a
patch which does this. I can't find a reason why this will not work.

The idea of the patch is that we try to find a topper, check whether it is
marked, and if it is marked, then we try to find the next one. All
marked toppers are umounted and the first unmarked topper is attached to
the parent mount.

And I think we need to add tests in tools/testing/selftests/...

> 
> Last time I was looking at this I was playing with multipass algorithm
> because of these peculiarities that happen with propagation trees.
> 
> I suspect we are going to need to fix umount to behave correct before
> we tune it for performance.  So that we actually know what correct
> behavior is.
> 
> Eric
> 
> 
> >>
> >> [root@fc24 ~]# cat m.sh
> >> #!/bin/sh
> >>
> >> set -x -e
> >> mount -t tmpfs xxx /mnt
> >> mkdir -p /mnt/1
> >> mkdir -p /mnt/2
> >> mount --bind /mnt/1 /mnt/1
> >> mount --make-shared /mnt/1
> >> mount --bind /mnt/1 /mnt/2
> >> mkdir -p /mnt/1/1
> >> for i in `seq 2`; do
> >> 	mount --bind /mnt/1/1 /mnt/1/1
> >> done
> >> for i in `seq 2`; do
> >> 	umount /mnt/1/1 || {
> >> 		cat /proc/self/mountinfo | grep xxx
> >> 		exit 1
> >> 	}
> >> done
> >>
> >> [root@fc24 ~]# unshare -Urm  ./m.sh
> >> + mount -t tmpfs xxx /mnt
> >> + mkdir -p /mnt/1
> >> + mkdir -p /mnt/2
> >> + mount --bind /mnt/1 /mnt/1
> >> + mount --make-shared /mnt/1
> >> + mount --bind /mnt/1 /mnt/2
> >> + mkdir -p /mnt/1/1
> >> ++ seq 2
> >> + for i in '`seq 2`'
> >> + mount --bind /mnt/1/1 /mnt/1/1
> >> + for i in '`seq 2`'
> >> + mount --bind /mnt/1/1 /mnt/1/1
> >> ++ seq 2
> >> + for i in '`seq 2`'
> >> + umount /mnt/1/1
> >> + for i in '`seq 2`'
> >> + umount /mnt/1/1
> >> umount: /mnt/1/1: not mounted
> >> + cat /proc/self/mountinfo
> >> + grep xxx
> >> 147 116 0:42 / /mnt rw,relatime - tmpfs xxx rw
> >> 148 147 0:42 /1 /mnt/1 rw,relatime shared:65 - tmpfs xxx rw
> >> 149 147 0:42 /1 /mnt/2 rw,relatime shared:65 - tmpfs xxx rw
> >> 157 149 0:42 /1/1 /mnt/2/1 rw,relatime shared:65 - tmpfs xxx rw
> >> + exit 1
> >>
> >> And you can see that /mnt/2 contains a mount, but it should not.
> >>
> >> Thanks,
> >> Andrei
> >>
> >> On Thu, Jan 05, 2017 at 10:04:14AM +1300, Eric W. Biederman wrote:
> >>> 
> >>> Ever since mount propagation was introduced in cases where a mount in
> >>> propagated to parent mount mountpoint pair that is already in use the
> >>> code has placed the new mount behind the old mount in the mount hash
> >>> table.
> >>> 
> >>> This implementation detail is problematic as it allows creating
> >>> arbitrary length mount hash chains.
> >>> 
> >>> Furthermore it invalidates the constraint maintained elsewhere in the
> >>> mount code that a parent mount and a mountpoint pair will have exactly
> >>> one mount upon them.  Making it hard to deal with and to talk about
> >>> this special case in the mount code.
> >>> 
> >>> Modify mount propagation to notice when there is already a mount at
> >>> the parent mount and mountpoint where a new mount is propagating to
> >>> and place that preexisting mount on top of the new mount.
> >>> 
> >>> Modify unmount propagation to notice when a mount that is being
> >>> unmounted has another mount on top of it (and no other children), and
> >>> to replace the unmounted mount with the mount on top of it.
> >>> 
> >>> Move the MNT_UMUONT test from __lookup_mnt_last into
> >>> __propagate_umount as that is the only call of __lookup_mnt_last where
> >>> MNT_UMOUNT may be set on any mount visible in the mount hash table.
> >>> 
> >>> These modifications allow:
> >>>  - __lookup_mnt_last to be removed.
> >>>  - attach_shadows to be renamed __attach_mnt and the it's shadow
> >>>    handling to be removed.
> >>>  - commit_tree to be simplified
> >>>  - copy_tree to be simplified
> >>> 
> >>> The result is an easier to understand tree of mounts that does not
> >>> allow creation of arbitrary length hash chains in the mount hash table.
> >>> 
> >>> v2: Updated to mnt_change_mountpoint to not call dput or mntput
> >>> and instead to decrement the counts directly.  It is guaranteed
> >>> that there will be other references when mnt_change_mountpoint is
> >>> called so this is safe.
> >>> 
> >>> Cc: stable@vger.kernel.org
> >>> Fixes: b90fa9ae8f51 ("[PATCH] shared mount handling: bind and rbind")
> >>> Tested-by: Andrei Vagin <avagin@virtuozzo.com>
> >>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> >>> ---
> >>> 
> >>> Since the last version some of you may have seen I have modified
> >>> my implementation of mnt_change_mountpoint so that it no longer calls
> >>> mntput or dput but instead relies on the knowledge that it can not
> >>> possibly have the last reference to the mnt and dentry of interest.
> >>> This avoids code checking tools from complaining bitterly.
> >>> 
> >>> This is on top of my previous patch that sorts out locking of the
> >>> mountpoint hash table.  After time giving ample time for review I intend
> >>> to push this and the previous bug fix to Linus.
> >>> 
> >>>  fs/mount.h     |   1 -
> >>>  fs/namespace.c | 110 +++++++++++++++++++++++++++++++--------------------------
> >>>  fs/pnode.c     |  27 ++++++++++----
> >>>  fs/pnode.h     |   2 ++
> >>>  4 files changed, 82 insertions(+), 58 deletions(-)
> >>> 
> >>> diff --git a/fs/mount.h b/fs/mount.h
> >>> index 2c856fc47ae3..2826543a131d 100644
> >>> --- a/fs/mount.h
> >>> +++ b/fs/mount.h
> >>> @@ -89,7 +89,6 @@ static inline int is_mounted(struct vfsmount *mnt)
> >>>  }
> >>>  
> >>>  extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *);
> >>> -extern struct mount *__lookup_mnt_last(struct vfsmount *, struct dentry *);
> >>>  
> >>>  extern int __legitimize_mnt(struct vfsmount *, unsigned);
> >>>  extern bool legitimize_mnt(struct vfsmount *, unsigned);
> >>> diff --git a/fs/namespace.c b/fs/namespace.c
> >>> index 487ba30bb5c6..91ccfb73f0e0 100644
> >>> --- a/fs/namespace.c
> >>> +++ b/fs/namespace.c
> >>> @@ -637,28 +637,6 @@ struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry)
> >>>  }
> >>>  
> >>>  /*
> >>> - * find the last mount at @dentry on vfsmount @mnt.
> >>> - * mount_lock must be held.
> >>> - */
> >>> -struct mount *__lookup_mnt_last(struct vfsmount *mnt, struct dentry *dentry)
> >>> -{
> >>> -	struct mount *p, *res = NULL;
> >>> -	p = __lookup_mnt(mnt, dentry);
> >>> -	if (!p)
> >>> -		goto out;
> >>> -	if (!(p->mnt.mnt_flags & MNT_UMOUNT))
> >>> -		res = p;
> >>> -	hlist_for_each_entry_continue(p, mnt_hash) {
> >>> -		if (&p->mnt_parent->mnt != mnt || p->mnt_mountpoint != dentry)
> >>> -			break;
> >>> -		if (!(p->mnt.mnt_flags & MNT_UMOUNT))
> >>> -			res = p;
> >>> -	}
> >>> -out:
> >>> -	return res;
> >>> -}
> >>> -
> >>> -/*
> >>>   * lookup_mnt - Return the first child mount mounted at path
> >>>   *
> >>>   * "First" means first mounted chronologically.  If you create the
> >>> @@ -878,6 +856,13 @@ void mnt_set_mountpoint(struct mount *mnt,
> >>>  	hlist_add_head(&child_mnt->mnt_mp_list, &mp->m_list);
> >>>  }
> >>>  
> >>> +static void __attach_mnt(struct mount *mnt, struct mount *parent)
> >>> +{
> >>> +	hlist_add_head_rcu(&mnt->mnt_hash,
> >>> +			   m_hash(&parent->mnt, mnt->mnt_mountpoint));
> >>> +	list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
> >>> +}
> >>> +
> >>>  /*
> >>>   * vfsmount lock must be held for write
> >>>   */
> >>> @@ -886,28 +871,45 @@ static void attach_mnt(struct mount *mnt,
> >>>  			struct mountpoint *mp)
> >>>  {
> >>>  	mnt_set_mountpoint(parent, mp, mnt);
> >>> -	hlist_add_head_rcu(&mnt->mnt_hash, m_hash(&parent->mnt, mp->m_dentry));
> >>> -	list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
> >>> +	__attach_mnt(mnt, parent);
> >>>  }
> >>>  
> >>> -static void attach_shadowed(struct mount *mnt,
> >>> -			struct mount *parent,
> >>> -			struct mount *shadows)
> >>> +void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, struct mount *mnt)
> >>>  {
> >>> -	if (shadows) {
> >>> -		hlist_add_behind_rcu(&mnt->mnt_hash, &shadows->mnt_hash);
> >>> -		list_add(&mnt->mnt_child, &shadows->mnt_child);
> >>> -	} else {
> >>> -		hlist_add_head_rcu(&mnt->mnt_hash,
> >>> -				m_hash(&parent->mnt, mnt->mnt_mountpoint));
> >>> -		list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
> >>> -	}
> >>> +	struct mountpoint *old_mp = mnt->mnt_mp;
> >>> +	struct dentry *old_mountpoint = mnt->mnt_mountpoint;
> >>> +	struct mount *old_parent = mnt->mnt_parent;
> >>> +
> >>> +	list_del_init(&mnt->mnt_child);
> >>> +	hlist_del_init(&mnt->mnt_mp_list);
> >>> +	hlist_del_init_rcu(&mnt->mnt_hash);
> >>> +
> >>> +	attach_mnt(mnt, parent, mp);
> >>> +
> >>> +	put_mountpoint(old_mp);
> >>> +
> >>> +	/*
> >>> +	 * Safely avoid even the suggestion this code might sleep or
> >>> +	 * lock the mount hash by taking avantage of the knowlege that
> >>> +	 * mnt_change_mounpoint will not release the final reference
> >>> +	 * to a mountpoint.
> >>> +	 *
> >>> +	 * During mounting, another mount will continue to use the old
> >>> +	 * mountpoint and during unmounting, the old mountpoint will
> >>> +	 * continue to exist until namespace_unlock which happens well
> >>> +	 * after mnt_change_mountpoint.
> >>> +	 */
> >>> +	spin_lock(&old_mountpoint->d_lock);
> >>> +	old_mountpoint->d_lockref.count--;
> >>> +	spin_unlock(&old_mountpoint->d_lock);
> >>> +
> >>> +	mnt_add_count(old_parent, -1);
> >>>  }
> >>>  
> >>>  /*
> >>>   * vfsmount lock must be held for write
> >>>   */
> >>> -static void commit_tree(struct mount *mnt, struct mount *shadows)
> >>> +static void commit_tree(struct mount *mnt)
> >>>  {
> >>>  	struct mount *parent = mnt->mnt_parent;
> >>>  	struct mount *m;
> >>> @@ -925,7 +927,7 @@ static void commit_tree(struct mount *mnt, struct mount *shadows)
> >>>  	n->mounts += n->pending_mounts;
> >>>  	n->pending_mounts = 0;
> >>>  
> >>> -	attach_shadowed(mnt, parent, shadows);
> >>> +	__attach_mnt(mnt, parent);
> >>>  	touch_mnt_namespace(n);
> >>>  }
> >>>  
> >>> @@ -1764,7 +1766,6 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
> >>>  			continue;
> >>>  
> >>>  		for (s = r; s; s = next_mnt(s, r)) {
> >>> -			struct mount *t = NULL;
> >>>  			if (!(flag & CL_COPY_UNBINDABLE) &&
> >>>  			    IS_MNT_UNBINDABLE(s)) {
> >>>  				s = skip_mnt_tree(s);
> >>> @@ -1786,14 +1787,7 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
> >>>  				goto out;
> >>>  			lock_mount_hash();
> >>>  			list_add_tail(&q->mnt_list, &res->mnt_list);
> >>> -			mnt_set_mountpoint(parent, p->mnt_mp, q);
> >>> -			if (!list_empty(&parent->mnt_mounts)) {
> >>> -				t = list_last_entry(&parent->mnt_mounts,
> >>> -					struct mount, mnt_child);
> >>> -				if (t->mnt_mp != p->mnt_mp)
> >>> -					t = NULL;
> >>> -			}
> >>> -			attach_shadowed(q, parent, t);
> >>> +			attach_mnt(q, parent, p->mnt_mp);
> >>>  			unlock_mount_hash();
> >>>  		}
> >>>  	}
> >>> @@ -1992,10 +1986,18 @@ static int attach_recursive_mnt(struct mount *source_mnt,
> >>>  {
> >>>  	HLIST_HEAD(tree_list);
> >>>  	struct mnt_namespace *ns = dest_mnt->mnt_ns;
> >>> +	struct mountpoint *smp;
> >>>  	struct mount *child, *p;
> >>>  	struct hlist_node *n;
> >>>  	int err;
> >>>  
> >>> +	/* Preallocate a mountpoint in case the new mounts need
> >>> +	 * to be tucked under other mounts.
> >>> +	 */
> >>> +	smp = get_mountpoint(source_mnt->mnt.mnt_root);
> >>> +	if (IS_ERR(smp))
> >>> +		return PTR_ERR(smp);
> >>> +
> >>>  	/* Is there space to add these mounts to the mount namespace? */
> >>>  	if (!parent_path) {
> >>>  		err = count_mounts(ns, source_mnt);
> >>> @@ -2022,17 +2024,22 @@ static int attach_recursive_mnt(struct mount *source_mnt,
> >>>  		touch_mnt_namespace(source_mnt->mnt_ns);
> >>>  	} else {
> >>>  		mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt);
> >>> -		commit_tree(source_mnt, NULL);
> >>> +		commit_tree(source_mnt);
> >>>  	}
> >>>  
> >>>  	hlist_for_each_entry_safe(child, n, &tree_list, mnt_hash) {
> >>> -		struct mount *q;
> >>>  		hlist_del_init(&child->mnt_hash);
> >>> -		q = __lookup_mnt_last(&child->mnt_parent->mnt,
> >>> -				      child->mnt_mountpoint);
> >>> -		commit_tree(child, q);
> >>> +		if (child->mnt.mnt_root == smp->m_dentry) {
> >>> +			struct mount *q;
> >>> +			q = __lookup_mnt(&child->mnt_parent->mnt,
> >>> +					 child->mnt_mountpoint);
> >>> +			if (q)
> >>> +				mnt_change_mountpoint(child, smp, q);
> >>> +		}
> >>> +		commit_tree(child);
> >>>  	}
> >>>  	unlock_mount_hash();
> >>> +	put_mountpoint(smp);
> >>>  
> >>>  	return 0;
> >>>  
> >>> @@ -2046,6 +2053,7 @@ static int attach_recursive_mnt(struct mount *source_mnt,
> >>>  	cleanup_group_ids(source_mnt, NULL);
> >>>   out:
> >>>  	ns->pending_mounts = 0;
> >>> +	put_mountpoint(smp);
> >>>  	return err;
> >>>  }
> >>>  
> >>> diff --git a/fs/pnode.c b/fs/pnode.c
> >>> index 06a793f4ae38..eb4331240fd1 100644
> >>> --- a/fs/pnode.c
> >>> +++ b/fs/pnode.c
> >>> @@ -327,6 +327,9 @@ int propagate_mnt(struct mount *dest_mnt, struct mountpoint *dest_mp,
> >>>   */
> >>>  static inline int do_refcount_check(struct mount *mnt, int count)
> >>>  {
> >>> +	struct mount *topper = __lookup_mnt(&mnt->mnt, mnt->mnt.mnt_root);
> >>> +	if (topper)
> >>> +		count++;
> >>>  	return mnt_get_count(mnt) > count;
> >>>  }
> >>>  
> >>> @@ -359,7 +362,7 @@ int propagate_mount_busy(struct mount *mnt, int refcnt)
> >>>  
> >>>  	for (m = propagation_next(parent, parent); m;
> >>>  	     		m = propagation_next(m, parent)) {
> >>> -		child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint);
> >>> +		child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint);
> >>>  		if (child && list_empty(&child->mnt_mounts) &&
> >>>  		    (ret = do_refcount_check(child, 1)))
> >>>  			break;
> >>> @@ -381,7 +384,7 @@ void propagate_mount_unlock(struct mount *mnt)
> >>>  
> >>>  	for (m = propagation_next(parent, parent); m;
> >>>  			m = propagation_next(m, parent)) {
> >>> -		child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint);
> >>> +		child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint);
> >>>  		if (child)
> >>>  			child->mnt.mnt_flags &= ~MNT_LOCKED;
> >>>  	}
> >>> @@ -399,9 +402,11 @@ static void mark_umount_candidates(struct mount *mnt)
> >>>  
> >>>  	for (m = propagation_next(parent, parent); m;
> >>>  			m = propagation_next(m, parent)) {
> >>> -		struct mount *child = __lookup_mnt_last(&m->mnt,
> >>> +		struct mount *child = __lookup_mnt(&m->mnt,
> >>>  						mnt->mnt_mountpoint);
> >>> -		if (child && (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m))) {
> >>> +		if (!child || (child->mnt.mnt_flags & MNT_UMOUNT))
> >>> +			continue;
> >>> +		if (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m)) {
> >>>  			SET_MNT_MARK(child);
> >>>  		}
> >>>  	}
> >>> @@ -420,8 +425,8 @@ static void __propagate_umount(struct mount *mnt)
> >>>  
> >>>  	for (m = propagation_next(parent, parent); m;
> >>>  			m = propagation_next(m, parent)) {
> >>> -
> >>> -		struct mount *child = __lookup_mnt_last(&m->mnt,
> >>> +		struct mount *topper;
> >>> +		struct mount *child = __lookup_mnt(&m->mnt,
> >>>  						mnt->mnt_mountpoint);
> >>>  		/*
> >>>  		 * umount the child only if the child has no children
> >>> @@ -430,6 +435,16 @@ static void __propagate_umount(struct mount *mnt)
> >>>  		if (!child || !IS_MNT_MARKED(child))
> >>>  			continue;
> >>>  		CLEAR_MNT_MARK(child);
> >>> +
> >>> +		/* If there is exactly one mount covering all of child
> >>> +		 * replace child with that mount.
> >>> +		 */
> >>> +		topper = __lookup_mnt(&child->mnt, child->mnt.mnt_root);
> >>> +		if (topper &&
> >>> +		    (child->mnt_mounts.next == &topper->mnt_child) &&
> >>> +		    (topper->mnt_child.next == &child->mnt_mounts))
> >>> +			mnt_change_mountpoint(child->mnt_parent, child->mnt_mp, topper);
> >>> +
> >>>  		if (list_empty(&child->mnt_mounts)) {
> >>>  			list_del_init(&child->mnt_child);
> >>>  			child->mnt.mnt_flags |= MNT_UMOUNT;
> >>> diff --git a/fs/pnode.h b/fs/pnode.h
> >>> index 550f5a8b4fcf..dc87e65becd2 100644
> >>> --- a/fs/pnode.h
> >>> +++ b/fs/pnode.h
> >>> @@ -49,6 +49,8 @@ int get_dominating_id(struct mount *mnt, const struct path *root);
> >>>  unsigned int mnt_get_count(struct mount *mnt);
> >>>  void mnt_set_mountpoint(struct mount *, struct mountpoint *,
> >>>  			struct mount *);
> >>> +void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp,
> >>> +			   struct mount *mnt);
> >>>  struct mount *copy_tree(struct mount *, struct dentry *, int);
> >>>  bool is_path_reachable(struct mount *, struct dentry *,
> >>>  			 const struct path *root);
> >>> -- 
> >>> 2.10.1
> >>>

[REVIEW] mnt: Tuck mounts under others instead of creating shadow/side mounts.

Commit Message

Comments

Patch