diff mbox

[REVIEW] mnt: Tuck mounts under others instead of creating shadow/side mounts.

Message ID 87a8b6r0z5.fsf_-_@xmission.com (mailing list archive)
State New, archived
Headers show

Commit Message

Eric W. Biederman Jan. 4, 2017, 9:04 p.m. UTC
Ever since mount propagation was introduced in cases where a mount in
propagated to parent mount mountpoint pair that is already in use the
code has placed the new mount behind the old mount in the mount hash
table.

This implementation detail is problematic as it allows creating
arbitrary length mount hash chains.

Furthermore it invalidates the constraint maintained elsewhere in the
mount code that a parent mount and a mountpoint pair will have exactly
one mount upon them.  Making it hard to deal with and to talk about
this special case in the mount code.

Modify mount propagation to notice when there is already a mount at
the parent mount and mountpoint where a new mount is propagating to
and place that preexisting mount on top of the new mount.

Modify unmount propagation to notice when a mount that is being
unmounted has another mount on top of it (and no other children), and
to replace the unmounted mount with the mount on top of it.

Move the MNT_UMUONT test from __lookup_mnt_last into
__propagate_umount as that is the only call of __lookup_mnt_last where
MNT_UMOUNT may be set on any mount visible in the mount hash table.

These modifications allow:
 - __lookup_mnt_last to be removed.
 - attach_shadows to be renamed __attach_mnt and the it's shadow
   handling to be removed.
 - commit_tree to be simplified
 - copy_tree to be simplified

The result is an easier to understand tree of mounts that does not
allow creation of arbitrary length hash chains in the mount hash table.

v2: Updated to mnt_change_mountpoint to not call dput or mntput
and instead to decrement the counts directly.  It is guaranteed
that there will be other references when mnt_change_mountpoint is
called so this is safe.

Cc: stable@vger.kernel.org
Fixes: b90fa9ae8f51 ("[PATCH] shared mount handling: bind and rbind")
Tested-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---

Since the last version some of you may have seen I have modified
my implementation of mnt_change_mountpoint so that it no longer calls
mntput or dput but instead relies on the knowledge that it can not
possibly have the last reference to the mnt and dentry of interest.
This avoids code checking tools from complaining bitterly.

This is on top of my previous patch that sorts out locking of the
mountpoint hash table.  After time giving ample time for review I intend
to push this and the previous bug fix to Linus.

 fs/mount.h     |   1 -
 fs/namespace.c | 110 +++++++++++++++++++++++++++++++--------------------------
 fs/pnode.c     |  27 ++++++++++----
 fs/pnode.h     |   2 ++
 4 files changed, 82 insertions(+), 58 deletions(-)

Comments

Al Viro Jan. 7, 2017, 5:06 a.m. UTC | #1
On Thu, Jan 05, 2017 at 10:04:14AM +1300, Eric W. Biederman wrote:

>  - attach_shadows to be renamed __attach_mnt and the it's shadow
>    handling to be removed.

Er...  s/the it's/its/, presumably?  Or am I misparsing that?

> v2: Updated to mnt_change_mountpoint to not call dput or mntput
> and instead to decrement the counts directly.  It is guaranteed
> that there will be other references when mnt_change_mountpoint is
> called so this is safe.

> +void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, struct mount *mnt)
>  {

Too generic name, IMO, and I really wonder if "mount" (== interpose) and
"umount" (== excise?) cases would be better off separately.

> +	struct mountpoint *old_mp = mnt->mnt_mp;
> +	struct dentry *old_mountpoint = mnt->mnt_mountpoint;
> +	struct mount *old_parent = mnt->mnt_parent;
> +
> +	list_del_init(&mnt->mnt_child);
> +	hlist_del_init(&mnt->mnt_mp_list);
> +	hlist_del_init_rcu(&mnt->mnt_hash);
> +
> +	attach_mnt(mnt, parent, mp);
> +
> +	put_mountpoint(old_mp);

> +	 *
> +	 * During mounting, another mount will continue to use the old
> +	 * mountpoint and during unmounting, the old mountpoint will
> +	 * continue to exist until namespace_unlock which happens well
> +	 * after mnt_change_mountpoint.
> +	 */

Umm...  AFAICS, in the former case "another mount" is simply parent, right?

> +	spin_lock(&old_mountpoint->d_lock);
> +	old_mountpoint->d_lockref.count--;
> +	spin_unlock(&old_mountpoint->d_lock);
> +	mnt_add_count(old_parent, -1);


> +		if (child->mnt.mnt_root == smp->m_dentry) {

Explain, please.  In which case is that condition _not_ satisfied, and
what should happen i

> +			struct mount *q;
> +			q = __lookup_mnt(&child->mnt_parent->mnt,
> +					 child->mnt_mountpoint);
> +			if (q)
> +				mnt_change_mountpoint(child, smp, q);


>  	unlock_mount_hash();
> +	put_mountpoint(smp);

Wrong order...

>  	ns->pending_mounts = 0;
> +	put_mountpoint(smp);

... and worse yet here.


>  static inline int do_refcount_check(struct mount *mnt, int count)
>  {
> +	struct mount *topper = __lookup_mnt(&mnt->mnt, mnt->mnt.mnt_root);
> +	if (topper)
> +		count++;
>  	return mnt_get_count(mnt) > count;
>  }

> @@ -359,7 +362,7 @@ int propagate_mount_busy(struct mount *mnt, int refcnt)
>  
>  	for (m = propagation_next(parent, parent); m;
>  	     		m = propagation_next(m, parent)) {
> -		child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint);
> +		child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint);
>  		if (child && list_empty(&child->mnt_mounts) &&
>  		    (ret = do_refcount_check(child, 1)))
>  			break;

Er...  You do realize that you can end up with more that one such
propagation, right?  IOW, there might be more than one thing slipped in.

> +		if (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m)) {
>  			SET_MNT_MARK(child);

Reread the condition, please...  And yes, I realize that original is
also rather odd; at a guess it was meant to be !(, not (!, but that's
just a guess - it's your code, IIRC.

> @@ -420,8 +425,8 @@ static void __propagate_umount(struct mount *mnt)
>  
>  	for (m = propagation_next(parent, parent); m;
>  			m = propagation_next(m, parent)) {
> -
> -		struct mount *child = __lookup_mnt_last(&m->mnt,
> +		struct mount *topper;
> +		struct mount *child = __lookup_mnt(&m->mnt,
>  						mnt->mnt_mountpoint);
>  		/*
>  		 * umount the child only if the child has no children
> @@ -430,6 +435,16 @@ static void __propagate_umount(struct mount *mnt)
>  		if (!child || !IS_MNT_MARKED(child))
>  			continue;
>  		CLEAR_MNT_MARK(child);
> +
> +		/* If there is exactly one mount covering all of child
> +		 * replace child with that mount.
> +		 */
> +		topper = __lookup_mnt(&child->mnt, child->mnt.mnt_root);
> +		if (topper &&

> +		    (child->mnt_mounts.next == &topper->mnt_child) &&
> +		    (topper->mnt_child.next == &child->mnt_mounts))

Weird way to spell child->mnt_mounts.next == child->mnt_mounts.prev, that...
Or, perhaps, the entire thing ought to be
		if (list_is_singular(&child->mnt_mounts)) {
			topper = list_first_entry(&child->mnt_mounts,
						  struct mount, mnt_child);
			if (topper->mnt_parent == child &&
			    topped->mnt_mountpoint == child->mnt.mnt_root)

to avoid hash lookups.

> +			mnt_change_mountpoint(child->mnt_parent, child->mnt_mp, topper);


FWIW, the my main worry here is your handling of the umount.  For
example, what happens if
	* something is mounted on A (m1)
	* something else is mounted on A/bar (m2)
	* D is a slave of C
	* something has been mounted on D/foo (n)
	* you do mount --rbind A C/foo (m1' on C/foo, m2' on m1/bar,
					m1'' interposed on D/foo under n,
					m2'' on m1''/bar,
					m1'' slave of m1', m2'' slave of m2)
	* you make C/foo and C/foo/bar private (m1'' and m2'' are not getting
					propagation from m1' and m2' anymore)
	* you umount C/foo/bar		(m2' is unmounted)
	* you umount C/foo
m1' gets unmounted, all right, but what of m1''?  D is a slave of C, so we
get propagation of umount from C/foo to D/foo; m1'' still has m2'' attached
to it.  AFAICS, your logics will happily slip m1'' from under n (assuming
that n itself is not busy), and leak both m1'' and m2''.

OTOH, the case of multiple slip-under (Z is slave of Y, which is a slave of X,
mount on Z, then mount of Y, then mount on X) the check for being busy would
do very odd things.

Something's fishy on the umount side...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman Jan. 11, 2017, 12:10 a.m. UTC | #2
Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Thu, Jan 05, 2017 at 10:04:14AM +1300, Eric W. Biederman wrote:
>
>>  - attach_shadows to be renamed __attach_mnt and the it's shadow
>>    handling to be removed.
>
> Er...  s/the it's/its/, presumably?  Or am I misparsing that?
>
>> v2: Updated to mnt_change_mountpoint to not call dput or mntput
>> and instead to decrement the counts directly.  It is guaranteed
>> that there will be other references when mnt_change_mountpoint is
>> called so this is safe.
>
>> +void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, struct mount *mnt)
>>  {
>
> Too generic name, IMO, and I really wonder if "mount" (== interpose) and
> "umount" (== excise?) cases would be better off separately.
>
>> +	struct mountpoint *old_mp = mnt->mnt_mp;
>> +	struct dentry *old_mountpoint = mnt->mnt_mountpoint;
>> +	struct mount *old_parent = mnt->mnt_parent;
>> +
>> +	list_del_init(&mnt->mnt_child);
>> +	hlist_del_init(&mnt->mnt_mp_list);
>> +	hlist_del_init_rcu(&mnt->mnt_hash);
>> +
>> +	attach_mnt(mnt, parent, mp);
>> +
>> +	put_mountpoint(old_mp);
>
>> +	 *
>> +	 * During mounting, another mount will continue to use the old
>> +	 * mountpoint and during unmounting, the old mountpoint will
>> +	 * continue to exist until namespace_unlock which happens well
>> +	 * after mnt_change_mountpoint.
>> +	 */
>
> Umm...  AFAICS, in the former case "another mount" is simply parent,
> right?

Yes it is the new parent mountpoint.  I was looking at it from a
different perspective.


>> +	spin_lock(&old_mountpoint->d_lock);
>> +	old_mountpoint->d_lockref.count--;
>> +	spin_unlock(&old_mountpoint->d_lock);
>> +	mnt_add_count(old_parent, -1);
>
>
>> +		if (child->mnt.mnt_root == smp->m_dentry) {
>
> Explain, please.  In which case is that condition _not_ satisfied, and
> what should happen i

When a tree is grafted in that condition does not apply to the lower
leaves of the tree.  At the same time nothing needs to be done for those
leaves.  Only the primary mountpoint needs to worry about tucking.


>> +			struct mount *q;
>> +			q = __lookup_mnt(&child->mnt_parent->mnt,
>> +					 child->mnt_mountpoint);
>> +			if (q)
>> +				mnt_change_mountpoint(child, smp, q);
>
>
>>  	unlock_mount_hash();
>> +	put_mountpoint(smp);
>
> Wrong order...
>
>>  	ns->pending_mounts = 0;
>> +	put_mountpoint(smp);
>
> ... and worse yet here.

Definitely.  I totally spaced on propagating the locking changes to this
patch when I rebased it.

>>  static inline int do_refcount_check(struct mount *mnt, int count)
>>  {
>> +	struct mount *topper = __lookup_mnt(&mnt->mnt, mnt->mnt.mnt_root);
>> +	if (topper)
>> +		count++;
>>  	return mnt_get_count(mnt) > count;
>>  }
>
>> @@ -359,7 +362,7 @@ int propagate_mount_busy(struct mount *mnt, int refcnt)
>>  
>>  	for (m = propagation_next(parent, parent); m;
>>  	     		m = propagation_next(m, parent)) {
>> -		child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint);
>> +		child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint);
>>  		if (child && list_empty(&child->mnt_mounts) &&
>>  		    (ret = do_refcount_check(child, 1)))
>>  			break;
>
> Er...  You do realize that you can end up with more that one such
> propagation, right?  IOW, there might be more than one thing slipped
> in.

So I have stared at this a lot and I don't see what you seem to be
seeing here.  I do however see that propagate_mount_busy has been
buggy since the beginning, as it only fails in the propagation case
if list of children is empty.

I also see my modification to the code is buggy since the list empty
precludes my changes to do_refcount_check from being effective.

I have looked hard and your point with multiple propagations elludes me.

I am going to add a patch to fix propagate_mount_busy, and then rebase
this patch on top of that, and post it all for review.


>> +		if (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m)) {
>>  			SET_MNT_MARK(child);
>
> Reread the condition, please...  And yes, I realize that original is
> also rather odd; at a guess it was meant to be !(, not (!, but that's
> just a guess - it's your code, IIRC.

The intent is to find all trees that we can unmount where the point
at which the tree meets the rest of the mounts is not locked.

The mark is later used to see if it ok to unmount a mount or if
we will reveal information to userspace (by breaking a lock).

Therefore the mark needs to be set if the mount is unlocked,
and recursively the mark needs to be set for every child of
that mount where the mark is set (the second condition).

Which makes the code essentially correct.

Unfortunately the code does not handle multiple unmounts from the same
parent mount point.  Which means shadow/side mount support and untucking
of mounts fails to handle multiple unmounts from the same parent mount.

The way the mark is used fundamentally assumes only one operation on
each mountpoint, and that is broken.

I intend to work on propagute_umount some more and fix that brokenness
and hopefully fix the performance issues as well.  But I am leaving that
work for another change as it is going to require stripping out and
replacing algorithms, and so far I don't have good solutions.

>> @@ -420,8 +425,8 @@ static void __propagate_umount(struct mount *mnt)
>>  
>>  	for (m = propagation_next(parent, parent); m;
>>  			m = propagation_next(m, parent)) {
>> -
>> -		struct mount *child = __lookup_mnt_last(&m->mnt,
>> +		struct mount *topper;
>> +		struct mount *child = __lookup_mnt(&m->mnt,
>>  						mnt->mnt_mountpoint);
>>  		/*
>>  		 * umount the child only if the child has no children
>> @@ -430,6 +435,16 @@ static void __propagate_umount(struct mount *mnt)
>>  		if (!child || !IS_MNT_MARKED(child))
>>  			continue;
>>  		CLEAR_MNT_MARK(child);
>> +
>> +		/* If there is exactly one mount covering all of child
>> +		 * replace child with that mount.
>> +		 */
>> +		topper = __lookup_mnt(&child->mnt, child->mnt.mnt_root);
>> +		if (topper &&
>
>> +		    (child->mnt_mounts.next == &topper->mnt_child) &&
>> +		    (topper->mnt_child.next == &child->mnt_mounts))
>
> Weird way to spell child->mnt_mounts.next == child->mnt_mounts.prev, that...
> Or, perhaps, the entire thing ought to be

Except it is clearer than that.  It verifies not just that there is one
list item but that topper is that one list item.

Further it is the exact same logic as is used in do_check_refcnt and
at this stage in development I figured it was more important to have
a recognizable pattern than to have the absolute most performant code.
Especially as the basic complexity of the code is the same either way.

> 		if (list_is_singular(&child->mnt_mounts)) {
> 			topper = list_first_entry(&child->mnt_mounts,
> 						  struct mount, mnt_child);
> 			if (topper->mnt_parent == child &&
> 			    topped->mnt_mountpoint == child->mnt.mnt_root)
>
> to avoid hash lookups.

That it would and now that I see that list_is_singular exists it looks
like a reasonable option.

>> +			mnt_change_mountpoint(child->mnt_parent, child->mnt_mp, topper);
>
>
> FWIW, the my main worry here is your handling of the umount.  For
> example, what happens if
> 	* something is mounted on A (m1)
> 	* something else is mounted on A/bar (m2)
> 	* D is a slave of C
> 	* something has been mounted on D/foo (n)
> 	* you do mount --rbind A C/foo (m1' on C/foo, m2' on m1'/bar,
> 					m1'' interposed on D/foo under n,
> 					m2'' on m1''/bar,
> 					m1'' slave of m1', m2'' slave of m2)
> 	* you make C/foo and C/foo/bar private (m1'' and m2'' are not getting
> 					propagation from m1' and m2' anymore)
> 	* you umount C/foo/bar		(m2' is unmounted)
> 	* you umount C/foo
> m1' gets unmounted, all right, but what of m1''?  D is a slave of C, so we
> get propagation of umount from C/foo to D/foo; m1'' still has m2'' attached
> to it.  AFAICS, your logics will happily slip m1'' from under n (assuming
> that n itself is not busy), and leak both m1'' and m2''.

Yes.  This is exactly the same behavior we have today without my patch.
The only difference is who is the parent mount.

$ cat > viro1.sh << EOF
#!/bin/sh
set -e
set -x

mount -t tmpfs base /mnt
mkdir -p /mnt/A
mount -t tmpfs m1 /mnt/A
mkdir -p /mnt/A/bar
mount -t tmpfs m2 /mnt/A/bar

mkdir -p /mnt/D
mkdir -p /mnt/C
mount -t tmpfs mC /mnt/C
mkdir -p /mnt/C/foo
mount --make-shared /mnt/C
mount --bind /mnt/C /mnt/D
mount --make-slave /mnt/D
mount -t tmpfs n /mnt/D/foo
mount --rbind /mnt/A /mnt/C/foo

echo
cat /proc/self/mountinfo

mount --make-private /mnt/C/foo
mount --make-private /mnt/C/foo/bar

echo
cat /proc/self/mountinfo

umount /mnt/C/foo/bar

echo
cat /proc/self/mountinfo

umount /mnt/C/foo

echo
cat /proc/self/mountinfo
EOF
$ chmod +x ./viro1.sh
$ unshare -Urm ./viro1.sh

At least when I run the above on a kernel with and without my patch
under discussion all I see different is mnt_id of the parent.  Which is
exactly what we should expect from this change.

Did I make a mistake in creating my script?

Or are you referring to the fact that mount_propagation_busy is just
plain buggy?

> OTOH, the case of multiple slip-under (Z is slave of Y, which is a slave of X,
> mount on Z, then mount of Y, then mount on X) the check for being busy would
> do very odd things.

I don't see what you are referring to.  Reposting shortly.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Jan. 11, 2017, 4:11 a.m. UTC | #3
On Wed, Jan 11, 2017 at 01:10:57PM +1300, Eric W. Biederman wrote:
> >> +		if (child->mnt.mnt_root == smp->m_dentry) {
> >
> > Explain, please.  In which case is that condition _not_ satisfied, and
> > what should happen i
> 
> When a tree is grafted in that condition does not apply to the lower
> leaves of the tree.  At the same time nothing needs to be done for those
> leaves.  Only the primary mountpoint needs to worry about tucking.

	How in hell would those lower leaves end up on the list in
attach_recursive_mnt()?  IDGI...

> >> +		if (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m)) {
> >>  			SET_MNT_MARK(child);
> >
> > Reread the condition, please...  And yes, I realize that original is
> > also rather odd; at a guess it was meant to be !(, not (!, but that's
> > just a guess - it's your code, IIRC.
> 
> The intent is to find all trees that we can unmount where the point
> at which the tree meets the rest of the mounts is not locked.
> 
> The mark is later used to see if it ok to unmount a mount or if
> we will reveal information to userspace (by breaking a lock).
> 
> Therefore the mark needs to be set if the mount is unlocked,
> and recursively the mark needs to be set for every child of
> that mount where the mark is set (the second condition).

*blink*

You have "if mount is not locked - mark it; if mount is already marked -
mark it again".  The latter part (|| IS_MNT_MARKED(mnt), that is) looks
very odd, won't you agree?  What the hell was that (its counterpart in
the earlier code) about?

I could understand something along the lines "mark it unless it's locked
or already marked", but your code is "mark it if it's not locked *or*
if it's already marked".  Makes no sense in that form.

> > FWIW, the my main worry here is your handling of the umount.  For
> > example, what happens if
> > 	* something is mounted on A (m1)
> > 	* something else is mounted on A/bar (m2)
> > 	* D is a slave of C
> > 	* something has been mounted on D/foo (n)
> > 	* you do mount --rbind A C/foo (m1' on C/foo, m2' on m1'/bar,
> > 					m1'' interposed on D/foo under n,
> > 					m2'' on m1''/bar,
> > 					m1'' slave of m1', m2'' slave of m2)
> > 	* you make C/foo and C/foo/bar private (m1'' and m2'' are not getting
> > 					propagation from m1' and m2' anymore)
> > 	* you umount C/foo/bar		(m2' is unmounted)
> > 	* you umount C/foo
> > m1' gets unmounted, all right, but what of m1''?  D is a slave of C, so we
> > get propagation of umount from C/foo to D/foo; m1'' still has m2'' attached
> > to it.  AFAICS, your logics will happily slip m1'' from under n (assuming
> > that n itself is not busy), and leak both m1'' and m2''.
> 
> Yes.  This is exactly the same behavior we have today without my patch.
> The only difference is who is the parent mount.

Not quite.  In the current tree m1'' should get stuck there (and be exposed
when n gets unmounted); AFAICS, your change will have it kicked out, with
m2'' still attached and still contributing to refcount of m1''.

I might be missing something (and I hadn't checked your script - right now
I'm at 16 hours of uptime after only 4 hours of sleep).  Will take a look
at that after I grab some sleep...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman Jan. 11, 2017, 4:03 p.m. UTC | #4
Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Wed, Jan 11, 2017 at 01:10:57PM +1300, Eric W. Biederman wrote:
>> >> +		if (child->mnt.mnt_root == smp->m_dentry) {
>> >
>> > Explain, please.  In which case is that condition _not_ satisfied, and
>> > what should happen i
>> 
>> When a tree is grafted in that condition does not apply to the lower
>> leaves of the tree.  At the same time nothing needs to be done for those
>> leaves.  Only the primary mountpoint needs to worry about tucking.
>
> 	How in hell would those lower leaves end up on the list in
> attach_recursive_mnt()?  IDGI...

The submounts of a mount tree that is being attached need to have
commit_tree called on them to attach them to a mount namespace.

>
>> >> +		if (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m)) {
>> >>  			SET_MNT_MARK(child);
>> >
>> > Reread the condition, please...  And yes, I realize that original is
>> > also rather odd; at a guess it was meant to be !(, not (!, but that's
>> > just a guess - it's your code, IIRC.
>> 
>> The intent is to find all trees that we can unmount where the point
>> at which the tree meets the rest of the mounts is not locked.
>> 
>> The mark is later used to see if it ok to unmount a mount or if
>> we will reveal information to userspace (by breaking a lock).
>> 
>> Therefore the mark needs to be set if the mount is unlocked,
>> and recursively the mark needs to be set for every child of
>> that mount where the mark is set (the second condition).
>
> *blink*
>
> You have "if mount is not locked - mark it; if mount is already marked -
> mark it again".  The latter part (|| IS_MNT_MARKED(mnt), that is) looks
> very odd, won't you agree?  What the hell was that (its counterpart in
> the earlier code) about?

Not mark it again.  If the parent is marked mark the child.

This is about finding subtrees where the root of the subree is unlocked
but the children may be locked to that root.  Still we can safely
unmount the entire subtree without revealing anything to userspace.

The walk is designed to happen from parent to the child mounts.

> I could understand something along the lines "mark it unless it's locked
> or already marked", but your code is "mark it if it's not locked *or*
> if it's already marked".  Makes no sense in that form.
>
>> > FWIW, the my main worry here is your handling of the umount.  For
>> > example, what happens if
>> > 	* something is mounted on A (m1)
>> > 	* something else is mounted on A/bar (m2)
>> > 	* D is a slave of C
>> > 	* something has been mounted on D/foo (n)
>> > 	* you do mount --rbind A C/foo (m1' on C/foo, m2' on m1'/bar,
>> > 					m1'' interposed on D/foo under n,
>> > 					m2'' on m1''/bar,
>> > 					m1'' slave of m1', m2'' slave of m2)
>> > 	* you make C/foo and C/foo/bar private (m1'' and m2'' are not getting
>> > 					propagation from m1' and m2' anymore)
>> > 	* you umount C/foo/bar		(m2' is unmounted)
>> > 	* you umount C/foo
>> > m1' gets unmounted, all right, but what of m1''?  D is a slave of C, so we
>> > get propagation of umount from C/foo to D/foo; m1'' still has m2'' attached
>> > to it.  AFAICS, your logics will happily slip m1'' from under n (assuming
>> > that n itself is not busy), and leak both m1'' and m2''.
>> 
>> Yes.  This is exactly the same behavior we have today without my patch.
>> The only difference is who is the parent mount.
>
> Not quite.  In the current tree m1'' should get stuck there (and be exposed
> when n gets unmounted); AFAICS, your change will have it kicked out, with
> m2'' still attached and still contributing to refcount of m1''.
>
> I might be missing something (and I hadn't checked your script - right now
> I'm at 16 hours of uptime after only 4 hours of sleep).  Will take a look
> at that after I grab some sleep...

Please look with fresh rested eyes.  I will see about posting a new
version of the patch shortly.

Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Jan. 12, 2017, 5:03 a.m. UTC | #5
On Thu, Jan 12, 2017 at 05:03:42AM +1300, Eric W. Biederman wrote:
> Al Viro <viro@ZenIV.linux.org.uk> writes:
> 
> > On Wed, Jan 11, 2017 at 01:10:57PM +1300, Eric W. Biederman wrote:
> >> >> +		if (child->mnt.mnt_root == smp->m_dentry) {
> >> >
> >> > Explain, please.  In which case is that condition _not_ satisfied, and
> >> > what should happen i
> >> 
> >> When a tree is grafted in that condition does not apply to the lower
> >> leaves of the tree.  At the same time nothing needs to be done for those
> >> leaves.  Only the primary mountpoint needs to worry about tucking.
> >
> > 	How in hell would those lower leaves end up on the list in
> > attach_recursive_mnt()?  IDGI...
> 
> The submounts of a mount tree that is being attached need to have
> commit_tree called on them to attach them to a mount namespace.

Huh?  commit_tree() is called once for each copy of the source tree.  This
        list_add_tail(&head, &mnt->mnt_list);
        list_for_each_entry(m, &head, mnt_list)
                m->mnt_ns = n;
is what goes through submounts in each of them, _not_ the loop in the caller.

What we get out of propagate_mnt() is the list of copies of source tree,
one for each of the mountpoints that should get propagation from the
target.
	->mnt_mountpoint/->mnt_parent is fully set for all nodes.
	Everything except the roots of those trees is hashed and
has ->mnt_child set up.
	->mnt_hash of the roots of those copies host the cyclic list,
anchored in tree_list passed to propagate_mnt().
	->mnt_list in each copy forms an unanchored cyclic list
going through all mounts in that copy.

The loop in attach_recursive_mnt() takes the tree_list apart and for each
element (== each copy of source tree) we have commit_tree() called once,
doing the remaining work:
	* splices the ->mnt_list into the namespace's mount list
	* sets ->mnt_ns for all nodes (root and submounts alike)
	* sets ->mnt_child and ->mnt_hash for the root.

Again, the loop in attach_recursive_mnt() is over the set of secondary
copies of the source tree; it goes *only* through their roots.  Submounts
are seen only by commit_tree(), in the list_for_each_entry() loop in
that function.  Hell, try to add else WARN_ON(1); to that if (...) of yours
and see if you can trigger it if you don't believe the above...

> > You have "if mount is not locked - mark it; if mount is already marked -
> > mark it again".  The latter part (|| IS_MNT_MARKED(mnt), that is) looks
> > very odd, won't you agree?  What the hell was that (its counterpart in
> > the earlier code) about?
> 
> Not mark it again.  If the parent is marked mark the child.

*doh*
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrey Vagin May 14, 2017, 2:15 a.m. UTC | #6
Hi Eric,

I found one issue about this patch. Take a look at the next script. The
idea is simple, we call mount twice and than we call umount twice, but
the second umount fails.

[root@fc24 ~]# cat m.sh
#!/bin/sh

set -x -e
mount -t tmpfs xxx /mnt
mkdir -p /mnt/1
mkdir -p /mnt/2
mount --bind /mnt/1 /mnt/1
mount --make-shared /mnt/1
mount --bind /mnt/1 /mnt/2
mkdir -p /mnt/1/1
for i in `seq 2`; do
	mount --bind /mnt/1/1 /mnt/1/1
done
for i in `seq 2`; do
	umount /mnt/1/1 || {
		cat /proc/self/mountinfo | grep xxx
		exit 1
	}
done

[root@fc24 ~]# unshare -Urm  ./m.sh
+ mount -t tmpfs xxx /mnt
+ mkdir -p /mnt/1
+ mkdir -p /mnt/2
+ mount --bind /mnt/1 /mnt/1
+ mount --make-shared /mnt/1
+ mount --bind /mnt/1 /mnt/2
+ mkdir -p /mnt/1/1
++ seq 2
+ for i in '`seq 2`'
+ mount --bind /mnt/1/1 /mnt/1/1
+ for i in '`seq 2`'
+ mount --bind /mnt/1/1 /mnt/1/1
++ seq 2
+ for i in '`seq 2`'
+ umount /mnt/1/1
+ for i in '`seq 2`'
+ umount /mnt/1/1
umount: /mnt/1/1: not mounted
+ cat /proc/self/mountinfo
+ grep xxx
147 116 0:42 / /mnt rw,relatime - tmpfs xxx rw
148 147 0:42 /1 /mnt/1 rw,relatime shared:65 - tmpfs xxx rw
149 147 0:42 /1 /mnt/2 rw,relatime shared:65 - tmpfs xxx rw
157 149 0:42 /1/1 /mnt/2/1 rw,relatime shared:65 - tmpfs xxx rw
+ exit 1

And you can see that /mnt/2 contains a mount, but it should not.

Thanks,
Andrei

On Thu, Jan 05, 2017 at 10:04:14AM +1300, Eric W. Biederman wrote:
> 
> Ever since mount propagation was introduced in cases where a mount in
> propagated to parent mount mountpoint pair that is already in use the
> code has placed the new mount behind the old mount in the mount hash
> table.
> 
> This implementation detail is problematic as it allows creating
> arbitrary length mount hash chains.
> 
> Furthermore it invalidates the constraint maintained elsewhere in the
> mount code that a parent mount and a mountpoint pair will have exactly
> one mount upon them.  Making it hard to deal with and to talk about
> this special case in the mount code.
> 
> Modify mount propagation to notice when there is already a mount at
> the parent mount and mountpoint where a new mount is propagating to
> and place that preexisting mount on top of the new mount.
> 
> Modify unmount propagation to notice when a mount that is being
> unmounted has another mount on top of it (and no other children), and
> to replace the unmounted mount with the mount on top of it.
> 
> Move the MNT_UMUONT test from __lookup_mnt_last into
> __propagate_umount as that is the only call of __lookup_mnt_last where
> MNT_UMOUNT may be set on any mount visible in the mount hash table.
> 
> These modifications allow:
>  - __lookup_mnt_last to be removed.
>  - attach_shadows to be renamed __attach_mnt and the it's shadow
>    handling to be removed.
>  - commit_tree to be simplified
>  - copy_tree to be simplified
> 
> The result is an easier to understand tree of mounts that does not
> allow creation of arbitrary length hash chains in the mount hash table.
> 
> v2: Updated to mnt_change_mountpoint to not call dput or mntput
> and instead to decrement the counts directly.  It is guaranteed
> that there will be other references when mnt_change_mountpoint is
> called so this is safe.
> 
> Cc: stable@vger.kernel.org
> Fixes: b90fa9ae8f51 ("[PATCH] shared mount handling: bind and rbind")
> Tested-by: Andrei Vagin <avagin@virtuozzo.com>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
> 
> Since the last version some of you may have seen I have modified
> my implementation of mnt_change_mountpoint so that it no longer calls
> mntput or dput but instead relies on the knowledge that it can not
> possibly have the last reference to the mnt and dentry of interest.
> This avoids code checking tools from complaining bitterly.
> 
> This is on top of my previous patch that sorts out locking of the
> mountpoint hash table.  After time giving ample time for review I intend
> to push this and the previous bug fix to Linus.
> 
>  fs/mount.h     |   1 -
>  fs/namespace.c | 110 +++++++++++++++++++++++++++++++--------------------------
>  fs/pnode.c     |  27 ++++++++++----
>  fs/pnode.h     |   2 ++
>  4 files changed, 82 insertions(+), 58 deletions(-)
> 
> diff --git a/fs/mount.h b/fs/mount.h
> index 2c856fc47ae3..2826543a131d 100644
> --- a/fs/mount.h
> +++ b/fs/mount.h
> @@ -89,7 +89,6 @@ static inline int is_mounted(struct vfsmount *mnt)
>  }
>  
>  extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *);
> -extern struct mount *__lookup_mnt_last(struct vfsmount *, struct dentry *);
>  
>  extern int __legitimize_mnt(struct vfsmount *, unsigned);
>  extern bool legitimize_mnt(struct vfsmount *, unsigned);
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 487ba30bb5c6..91ccfb73f0e0 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -637,28 +637,6 @@ struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry)
>  }
>  
>  /*
> - * find the last mount at @dentry on vfsmount @mnt.
> - * mount_lock must be held.
> - */
> -struct mount *__lookup_mnt_last(struct vfsmount *mnt, struct dentry *dentry)
> -{
> -	struct mount *p, *res = NULL;
> -	p = __lookup_mnt(mnt, dentry);
> -	if (!p)
> -		goto out;
> -	if (!(p->mnt.mnt_flags & MNT_UMOUNT))
> -		res = p;
> -	hlist_for_each_entry_continue(p, mnt_hash) {
> -		if (&p->mnt_parent->mnt != mnt || p->mnt_mountpoint != dentry)
> -			break;
> -		if (!(p->mnt.mnt_flags & MNT_UMOUNT))
> -			res = p;
> -	}
> -out:
> -	return res;
> -}
> -
> -/*
>   * lookup_mnt - Return the first child mount mounted at path
>   *
>   * "First" means first mounted chronologically.  If you create the
> @@ -878,6 +856,13 @@ void mnt_set_mountpoint(struct mount *mnt,
>  	hlist_add_head(&child_mnt->mnt_mp_list, &mp->m_list);
>  }
>  
> +static void __attach_mnt(struct mount *mnt, struct mount *parent)
> +{
> +	hlist_add_head_rcu(&mnt->mnt_hash,
> +			   m_hash(&parent->mnt, mnt->mnt_mountpoint));
> +	list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
> +}
> +
>  /*
>   * vfsmount lock must be held for write
>   */
> @@ -886,28 +871,45 @@ static void attach_mnt(struct mount *mnt,
>  			struct mountpoint *mp)
>  {
>  	mnt_set_mountpoint(parent, mp, mnt);
> -	hlist_add_head_rcu(&mnt->mnt_hash, m_hash(&parent->mnt, mp->m_dentry));
> -	list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
> +	__attach_mnt(mnt, parent);
>  }
>  
> -static void attach_shadowed(struct mount *mnt,
> -			struct mount *parent,
> -			struct mount *shadows)
> +void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, struct mount *mnt)
>  {
> -	if (shadows) {
> -		hlist_add_behind_rcu(&mnt->mnt_hash, &shadows->mnt_hash);
> -		list_add(&mnt->mnt_child, &shadows->mnt_child);
> -	} else {
> -		hlist_add_head_rcu(&mnt->mnt_hash,
> -				m_hash(&parent->mnt, mnt->mnt_mountpoint));
> -		list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
> -	}
> +	struct mountpoint *old_mp = mnt->mnt_mp;
> +	struct dentry *old_mountpoint = mnt->mnt_mountpoint;
> +	struct mount *old_parent = mnt->mnt_parent;
> +
> +	list_del_init(&mnt->mnt_child);
> +	hlist_del_init(&mnt->mnt_mp_list);
> +	hlist_del_init_rcu(&mnt->mnt_hash);
> +
> +	attach_mnt(mnt, parent, mp);
> +
> +	put_mountpoint(old_mp);
> +
> +	/*
> +	 * Safely avoid even the suggestion this code might sleep or
> +	 * lock the mount hash by taking avantage of the knowlege that
> +	 * mnt_change_mounpoint will not release the final reference
> +	 * to a mountpoint.
> +	 *
> +	 * During mounting, another mount will continue to use the old
> +	 * mountpoint and during unmounting, the old mountpoint will
> +	 * continue to exist until namespace_unlock which happens well
> +	 * after mnt_change_mountpoint.
> +	 */
> +	spin_lock(&old_mountpoint->d_lock);
> +	old_mountpoint->d_lockref.count--;
> +	spin_unlock(&old_mountpoint->d_lock);
> +
> +	mnt_add_count(old_parent, -1);
>  }
>  
>  /*
>   * vfsmount lock must be held for write
>   */
> -static void commit_tree(struct mount *mnt, struct mount *shadows)
> +static void commit_tree(struct mount *mnt)
>  {
>  	struct mount *parent = mnt->mnt_parent;
>  	struct mount *m;
> @@ -925,7 +927,7 @@ static void commit_tree(struct mount *mnt, struct mount *shadows)
>  	n->mounts += n->pending_mounts;
>  	n->pending_mounts = 0;
>  
> -	attach_shadowed(mnt, parent, shadows);
> +	__attach_mnt(mnt, parent);
>  	touch_mnt_namespace(n);
>  }
>  
> @@ -1764,7 +1766,6 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
>  			continue;
>  
>  		for (s = r; s; s = next_mnt(s, r)) {
> -			struct mount *t = NULL;
>  			if (!(flag & CL_COPY_UNBINDABLE) &&
>  			    IS_MNT_UNBINDABLE(s)) {
>  				s = skip_mnt_tree(s);
> @@ -1786,14 +1787,7 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
>  				goto out;
>  			lock_mount_hash();
>  			list_add_tail(&q->mnt_list, &res->mnt_list);
> -			mnt_set_mountpoint(parent, p->mnt_mp, q);
> -			if (!list_empty(&parent->mnt_mounts)) {
> -				t = list_last_entry(&parent->mnt_mounts,
> -					struct mount, mnt_child);
> -				if (t->mnt_mp != p->mnt_mp)
> -					t = NULL;
> -			}
> -			attach_shadowed(q, parent, t);
> +			attach_mnt(q, parent, p->mnt_mp);
>  			unlock_mount_hash();
>  		}
>  	}
> @@ -1992,10 +1986,18 @@ static int attach_recursive_mnt(struct mount *source_mnt,
>  {
>  	HLIST_HEAD(tree_list);
>  	struct mnt_namespace *ns = dest_mnt->mnt_ns;
> +	struct mountpoint *smp;
>  	struct mount *child, *p;
>  	struct hlist_node *n;
>  	int err;
>  
> +	/* Preallocate a mountpoint in case the new mounts need
> +	 * to be tucked under other mounts.
> +	 */
> +	smp = get_mountpoint(source_mnt->mnt.mnt_root);
> +	if (IS_ERR(smp))
> +		return PTR_ERR(smp);
> +
>  	/* Is there space to add these mounts to the mount namespace? */
>  	if (!parent_path) {
>  		err = count_mounts(ns, source_mnt);
> @@ -2022,17 +2024,22 @@ static int attach_recursive_mnt(struct mount *source_mnt,
>  		touch_mnt_namespace(source_mnt->mnt_ns);
>  	} else {
>  		mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt);
> -		commit_tree(source_mnt, NULL);
> +		commit_tree(source_mnt);
>  	}
>  
>  	hlist_for_each_entry_safe(child, n, &tree_list, mnt_hash) {
> -		struct mount *q;
>  		hlist_del_init(&child->mnt_hash);
> -		q = __lookup_mnt_last(&child->mnt_parent->mnt,
> -				      child->mnt_mountpoint);
> -		commit_tree(child, q);
> +		if (child->mnt.mnt_root == smp->m_dentry) {
> +			struct mount *q;
> +			q = __lookup_mnt(&child->mnt_parent->mnt,
> +					 child->mnt_mountpoint);
> +			if (q)
> +				mnt_change_mountpoint(child, smp, q);
> +		}
> +		commit_tree(child);
>  	}
>  	unlock_mount_hash();
> +	put_mountpoint(smp);
>  
>  	return 0;
>  
> @@ -2046,6 +2053,7 @@ static int attach_recursive_mnt(struct mount *source_mnt,
>  	cleanup_group_ids(source_mnt, NULL);
>   out:
>  	ns->pending_mounts = 0;
> +	put_mountpoint(smp);
>  	return err;
>  }
>  
> diff --git a/fs/pnode.c b/fs/pnode.c
> index 06a793f4ae38..eb4331240fd1 100644
> --- a/fs/pnode.c
> +++ b/fs/pnode.c
> @@ -327,6 +327,9 @@ int propagate_mnt(struct mount *dest_mnt, struct mountpoint *dest_mp,
>   */
>  static inline int do_refcount_check(struct mount *mnt, int count)
>  {
> +	struct mount *topper = __lookup_mnt(&mnt->mnt, mnt->mnt.mnt_root);
> +	if (topper)
> +		count++;
>  	return mnt_get_count(mnt) > count;
>  }
>  
> @@ -359,7 +362,7 @@ int propagate_mount_busy(struct mount *mnt, int refcnt)
>  
>  	for (m = propagation_next(parent, parent); m;
>  	     		m = propagation_next(m, parent)) {
> -		child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint);
> +		child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint);
>  		if (child && list_empty(&child->mnt_mounts) &&
>  		    (ret = do_refcount_check(child, 1)))
>  			break;
> @@ -381,7 +384,7 @@ void propagate_mount_unlock(struct mount *mnt)
>  
>  	for (m = propagation_next(parent, parent); m;
>  			m = propagation_next(m, parent)) {
> -		child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint);
> +		child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint);
>  		if (child)
>  			child->mnt.mnt_flags &= ~MNT_LOCKED;
>  	}
> @@ -399,9 +402,11 @@ static void mark_umount_candidates(struct mount *mnt)
>  
>  	for (m = propagation_next(parent, parent); m;
>  			m = propagation_next(m, parent)) {
> -		struct mount *child = __lookup_mnt_last(&m->mnt,
> +		struct mount *child = __lookup_mnt(&m->mnt,
>  						mnt->mnt_mountpoint);
> -		if (child && (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m))) {
> +		if (!child || (child->mnt.mnt_flags & MNT_UMOUNT))
> +			continue;
> +		if (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m)) {
>  			SET_MNT_MARK(child);
>  		}
>  	}
> @@ -420,8 +425,8 @@ static void __propagate_umount(struct mount *mnt)
>  
>  	for (m = propagation_next(parent, parent); m;
>  			m = propagation_next(m, parent)) {
> -
> -		struct mount *child = __lookup_mnt_last(&m->mnt,
> +		struct mount *topper;
> +		struct mount *child = __lookup_mnt(&m->mnt,
>  						mnt->mnt_mountpoint);
>  		/*
>  		 * umount the child only if the child has no children
> @@ -430,6 +435,16 @@ static void __propagate_umount(struct mount *mnt)
>  		if (!child || !IS_MNT_MARKED(child))
>  			continue;
>  		CLEAR_MNT_MARK(child);
> +
> +		/* If there is exactly one mount covering all of child
> +		 * replace child with that mount.
> +		 */
> +		topper = __lookup_mnt(&child->mnt, child->mnt.mnt_root);
> +		if (topper &&
> +		    (child->mnt_mounts.next == &topper->mnt_child) &&
> +		    (topper->mnt_child.next == &child->mnt_mounts))
> +			mnt_change_mountpoint(child->mnt_parent, child->mnt_mp, topper);
> +
>  		if (list_empty(&child->mnt_mounts)) {
>  			list_del_init(&child->mnt_child);
>  			child->mnt.mnt_flags |= MNT_UMOUNT;
> diff --git a/fs/pnode.h b/fs/pnode.h
> index 550f5a8b4fcf..dc87e65becd2 100644
> --- a/fs/pnode.h
> +++ b/fs/pnode.h
> @@ -49,6 +49,8 @@ int get_dominating_id(struct mount *mnt, const struct path *root);
>  unsigned int mnt_get_count(struct mount *mnt);
>  void mnt_set_mountpoint(struct mount *, struct mountpoint *,
>  			struct mount *);
> +void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp,
> +			   struct mount *mnt);
>  struct mount *copy_tree(struct mount *, struct dentry *, int);
>  bool is_path_reachable(struct mount *, struct dentry *,
>  			 const struct path *root);
> -- 
> 2.10.1
>
Eric W. Biederman May 14, 2017, 4:05 a.m. UTC | #7
Andrei Vagin <avagin@virtuozzo.com> writes:

> Hi Eric,
>
> I found one issue about this patch. Take a look at the next script. The
> idea is simple, we call mount twice and than we call umount twice, but
> the second umount fails.

Definitely an odd.  I will take a look.

Eric

>
> [root@fc24 ~]# cat m.sh
> #!/bin/sh
>
> set -x -e
> mount -t tmpfs xxx /mnt
> mkdir -p /mnt/1
> mkdir -p /mnt/2
> mount --bind /mnt/1 /mnt/1
> mount --make-shared /mnt/1
> mount --bind /mnt/1 /mnt/2
> mkdir -p /mnt/1/1
> for i in `seq 2`; do
> 	mount --bind /mnt/1/1 /mnt/1/1
> done
> for i in `seq 2`; do
> 	umount /mnt/1/1 || {
> 		cat /proc/self/mountinfo | grep xxx
> 		exit 1
> 	}
> done
>
> [root@fc24 ~]# unshare -Urm  ./m.sh
> + mount -t tmpfs xxx /mnt
> + mkdir -p /mnt/1
> + mkdir -p /mnt/2
> + mount --bind /mnt/1 /mnt/1
> + mount --make-shared /mnt/1
> + mount --bind /mnt/1 /mnt/2
> + mkdir -p /mnt/1/1
> ++ seq 2
> + for i in '`seq 2`'
> + mount --bind /mnt/1/1 /mnt/1/1
> + for i in '`seq 2`'
> + mount --bind /mnt/1/1 /mnt/1/1
> ++ seq 2
> + for i in '`seq 2`'
> + umount /mnt/1/1
> + for i in '`seq 2`'
> + umount /mnt/1/1
> umount: /mnt/1/1: not mounted
> + cat /proc/self/mountinfo
> + grep xxx
> 147 116 0:42 / /mnt rw,relatime - tmpfs xxx rw
> 148 147 0:42 /1 /mnt/1 rw,relatime shared:65 - tmpfs xxx rw
> 149 147 0:42 /1 /mnt/2 rw,relatime shared:65 - tmpfs xxx rw
> 157 149 0:42 /1/1 /mnt/2/1 rw,relatime shared:65 - tmpfs xxx rw
> + exit 1
>
> And you can see that /mnt/2 contains a mount, but it should not.
>
> Thanks,
> Andrei
>
> On Thu, Jan 05, 2017 at 10:04:14AM +1300, Eric W. Biederman wrote:
>> 
>> Ever since mount propagation was introduced in cases where a mount in
>> propagated to parent mount mountpoint pair that is already in use the
>> code has placed the new mount behind the old mount in the mount hash
>> table.
>> 
>> This implementation detail is problematic as it allows creating
>> arbitrary length mount hash chains.
>> 
>> Furthermore it invalidates the constraint maintained elsewhere in the
>> mount code that a parent mount and a mountpoint pair will have exactly
>> one mount upon them.  Making it hard to deal with and to talk about
>> this special case in the mount code.
>> 
>> Modify mount propagation to notice when there is already a mount at
>> the parent mount and mountpoint where a new mount is propagating to
>> and place that preexisting mount on top of the new mount.
>> 
>> Modify unmount propagation to notice when a mount that is being
>> unmounted has another mount on top of it (and no other children), and
>> to replace the unmounted mount with the mount on top of it.
>> 
>> Move the MNT_UMUONT test from __lookup_mnt_last into
>> __propagate_umount as that is the only call of __lookup_mnt_last where
>> MNT_UMOUNT may be set on any mount visible in the mount hash table.
>> 
>> These modifications allow:
>>  - __lookup_mnt_last to be removed.
>>  - attach_shadows to be renamed __attach_mnt and the it's shadow
>>    handling to be removed.
>>  - commit_tree to be simplified
>>  - copy_tree to be simplified
>> 
>> The result is an easier to understand tree of mounts that does not
>> allow creation of arbitrary length hash chains in the mount hash table.
>> 
>> v2: Updated to mnt_change_mountpoint to not call dput or mntput
>> and instead to decrement the counts directly.  It is guaranteed
>> that there will be other references when mnt_change_mountpoint is
>> called so this is safe.
>> 
>> Cc: stable@vger.kernel.org
>> Fixes: b90fa9ae8f51 ("[PATCH] shared mount handling: bind and rbind")
>> Tested-by: Andrei Vagin <avagin@virtuozzo.com>
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>> 
>> Since the last version some of you may have seen I have modified
>> my implementation of mnt_change_mountpoint so that it no longer calls
>> mntput or dput but instead relies on the knowledge that it can not
>> possibly have the last reference to the mnt and dentry of interest.
>> This avoids code checking tools from complaining bitterly.
>> 
>> This is on top of my previous patch that sorts out locking of the
>> mountpoint hash table.  After time giving ample time for review I intend
>> to push this and the previous bug fix to Linus.
>> 
>>  fs/mount.h     |   1 -
>>  fs/namespace.c | 110 +++++++++++++++++++++++++++++++--------------------------
>>  fs/pnode.c     |  27 ++++++++++----
>>  fs/pnode.h     |   2 ++
>>  4 files changed, 82 insertions(+), 58 deletions(-)
>> 
>> diff --git a/fs/mount.h b/fs/mount.h
>> index 2c856fc47ae3..2826543a131d 100644
>> --- a/fs/mount.h
>> +++ b/fs/mount.h
>> @@ -89,7 +89,6 @@ static inline int is_mounted(struct vfsmount *mnt)
>>  }
>>  
>>  extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *);
>> -extern struct mount *__lookup_mnt_last(struct vfsmount *, struct dentry *);
>>  
>>  extern int __legitimize_mnt(struct vfsmount *, unsigned);
>>  extern bool legitimize_mnt(struct vfsmount *, unsigned);
>> diff --git a/fs/namespace.c b/fs/namespace.c
>> index 487ba30bb5c6..91ccfb73f0e0 100644
>> --- a/fs/namespace.c
>> +++ b/fs/namespace.c
>> @@ -637,28 +637,6 @@ struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry)
>>  }
>>  
>>  /*
>> - * find the last mount at @dentry on vfsmount @mnt.
>> - * mount_lock must be held.
>> - */
>> -struct mount *__lookup_mnt_last(struct vfsmount *mnt, struct dentry *dentry)
>> -{
>> -	struct mount *p, *res = NULL;
>> -	p = __lookup_mnt(mnt, dentry);
>> -	if (!p)
>> -		goto out;
>> -	if (!(p->mnt.mnt_flags & MNT_UMOUNT))
>> -		res = p;
>> -	hlist_for_each_entry_continue(p, mnt_hash) {
>> -		if (&p->mnt_parent->mnt != mnt || p->mnt_mountpoint != dentry)
>> -			break;
>> -		if (!(p->mnt.mnt_flags & MNT_UMOUNT))
>> -			res = p;
>> -	}
>> -out:
>> -	return res;
>> -}
>> -
>> -/*
>>   * lookup_mnt - Return the first child mount mounted at path
>>   *
>>   * "First" means first mounted chronologically.  If you create the
>> @@ -878,6 +856,13 @@ void mnt_set_mountpoint(struct mount *mnt,
>>  	hlist_add_head(&child_mnt->mnt_mp_list, &mp->m_list);
>>  }
>>  
>> +static void __attach_mnt(struct mount *mnt, struct mount *parent)
>> +{
>> +	hlist_add_head_rcu(&mnt->mnt_hash,
>> +			   m_hash(&parent->mnt, mnt->mnt_mountpoint));
>> +	list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
>> +}
>> +
>>  /*
>>   * vfsmount lock must be held for write
>>   */
>> @@ -886,28 +871,45 @@ static void attach_mnt(struct mount *mnt,
>>  			struct mountpoint *mp)
>>  {
>>  	mnt_set_mountpoint(parent, mp, mnt);
>> -	hlist_add_head_rcu(&mnt->mnt_hash, m_hash(&parent->mnt, mp->m_dentry));
>> -	list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
>> +	__attach_mnt(mnt, parent);
>>  }
>>  
>> -static void attach_shadowed(struct mount *mnt,
>> -			struct mount *parent,
>> -			struct mount *shadows)
>> +void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, struct mount *mnt)
>>  {
>> -	if (shadows) {
>> -		hlist_add_behind_rcu(&mnt->mnt_hash, &shadows->mnt_hash);
>> -		list_add(&mnt->mnt_child, &shadows->mnt_child);
>> -	} else {
>> -		hlist_add_head_rcu(&mnt->mnt_hash,
>> -				m_hash(&parent->mnt, mnt->mnt_mountpoint));
>> -		list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
>> -	}
>> +	struct mountpoint *old_mp = mnt->mnt_mp;
>> +	struct dentry *old_mountpoint = mnt->mnt_mountpoint;
>> +	struct mount *old_parent = mnt->mnt_parent;
>> +
>> +	list_del_init(&mnt->mnt_child);
>> +	hlist_del_init(&mnt->mnt_mp_list);
>> +	hlist_del_init_rcu(&mnt->mnt_hash);
>> +
>> +	attach_mnt(mnt, parent, mp);
>> +
>> +	put_mountpoint(old_mp);
>> +
>> +	/*
>> +	 * Safely avoid even the suggestion this code might sleep or
>> +	 * lock the mount hash by taking avantage of the knowlege that
>> +	 * mnt_change_mounpoint will not release the final reference
>> +	 * to a mountpoint.
>> +	 *
>> +	 * During mounting, another mount will continue to use the old
>> +	 * mountpoint and during unmounting, the old mountpoint will
>> +	 * continue to exist until namespace_unlock which happens well
>> +	 * after mnt_change_mountpoint.
>> +	 */
>> +	spin_lock(&old_mountpoint->d_lock);
>> +	old_mountpoint->d_lockref.count--;
>> +	spin_unlock(&old_mountpoint->d_lock);
>> +
>> +	mnt_add_count(old_parent, -1);
>>  }
>>  
>>  /*
>>   * vfsmount lock must be held for write
>>   */
>> -static void commit_tree(struct mount *mnt, struct mount *shadows)
>> +static void commit_tree(struct mount *mnt)
>>  {
>>  	struct mount *parent = mnt->mnt_parent;
>>  	struct mount *m;
>> @@ -925,7 +927,7 @@ static void commit_tree(struct mount *mnt, struct mount *shadows)
>>  	n->mounts += n->pending_mounts;
>>  	n->pending_mounts = 0;
>>  
>> -	attach_shadowed(mnt, parent, shadows);
>> +	__attach_mnt(mnt, parent);
>>  	touch_mnt_namespace(n);
>>  }
>>  
>> @@ -1764,7 +1766,6 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
>>  			continue;
>>  
>>  		for (s = r; s; s = next_mnt(s, r)) {
>> -			struct mount *t = NULL;
>>  			if (!(flag & CL_COPY_UNBINDABLE) &&
>>  			    IS_MNT_UNBINDABLE(s)) {
>>  				s = skip_mnt_tree(s);
>> @@ -1786,14 +1787,7 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
>>  				goto out;
>>  			lock_mount_hash();
>>  			list_add_tail(&q->mnt_list, &res->mnt_list);
>> -			mnt_set_mountpoint(parent, p->mnt_mp, q);
>> -			if (!list_empty(&parent->mnt_mounts)) {
>> -				t = list_last_entry(&parent->mnt_mounts,
>> -					struct mount, mnt_child);
>> -				if (t->mnt_mp != p->mnt_mp)
>> -					t = NULL;
>> -			}
>> -			attach_shadowed(q, parent, t);
>> +			attach_mnt(q, parent, p->mnt_mp);
>>  			unlock_mount_hash();
>>  		}
>>  	}
>> @@ -1992,10 +1986,18 @@ static int attach_recursive_mnt(struct mount *source_mnt,
>>  {
>>  	HLIST_HEAD(tree_list);
>>  	struct mnt_namespace *ns = dest_mnt->mnt_ns;
>> +	struct mountpoint *smp;
>>  	struct mount *child, *p;
>>  	struct hlist_node *n;
>>  	int err;
>>  
>> +	/* Preallocate a mountpoint in case the new mounts need
>> +	 * to be tucked under other mounts.
>> +	 */
>> +	smp = get_mountpoint(source_mnt->mnt.mnt_root);
>> +	if (IS_ERR(smp))
>> +		return PTR_ERR(smp);
>> +
>>  	/* Is there space to add these mounts to the mount namespace? */
>>  	if (!parent_path) {
>>  		err = count_mounts(ns, source_mnt);
>> @@ -2022,17 +2024,22 @@ static int attach_recursive_mnt(struct mount *source_mnt,
>>  		touch_mnt_namespace(source_mnt->mnt_ns);
>>  	} else {
>>  		mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt);
>> -		commit_tree(source_mnt, NULL);
>> +		commit_tree(source_mnt);
>>  	}
>>  
>>  	hlist_for_each_entry_safe(child, n, &tree_list, mnt_hash) {
>> -		struct mount *q;
>>  		hlist_del_init(&child->mnt_hash);
>> -		q = __lookup_mnt_last(&child->mnt_parent->mnt,
>> -				      child->mnt_mountpoint);
>> -		commit_tree(child, q);
>> +		if (child->mnt.mnt_root == smp->m_dentry) {
>> +			struct mount *q;
>> +			q = __lookup_mnt(&child->mnt_parent->mnt,
>> +					 child->mnt_mountpoint);
>> +			if (q)
>> +				mnt_change_mountpoint(child, smp, q);
>> +		}
>> +		commit_tree(child);
>>  	}
>>  	unlock_mount_hash();
>> +	put_mountpoint(smp);
>>  
>>  	return 0;
>>  
>> @@ -2046,6 +2053,7 @@ static int attach_recursive_mnt(struct mount *source_mnt,
>>  	cleanup_group_ids(source_mnt, NULL);
>>   out:
>>  	ns->pending_mounts = 0;
>> +	put_mountpoint(smp);
>>  	return err;
>>  }
>>  
>> diff --git a/fs/pnode.c b/fs/pnode.c
>> index 06a793f4ae38..eb4331240fd1 100644
>> --- a/fs/pnode.c
>> +++ b/fs/pnode.c
>> @@ -327,6 +327,9 @@ int propagate_mnt(struct mount *dest_mnt, struct mountpoint *dest_mp,
>>   */
>>  static inline int do_refcount_check(struct mount *mnt, int count)
>>  {
>> +	struct mount *topper = __lookup_mnt(&mnt->mnt, mnt->mnt.mnt_root);
>> +	if (topper)
>> +		count++;
>>  	return mnt_get_count(mnt) > count;
>>  }
>>  
>> @@ -359,7 +362,7 @@ int propagate_mount_busy(struct mount *mnt, int refcnt)
>>  
>>  	for (m = propagation_next(parent, parent); m;
>>  	     		m = propagation_next(m, parent)) {
>> -		child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint);
>> +		child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint);
>>  		if (child && list_empty(&child->mnt_mounts) &&
>>  		    (ret = do_refcount_check(child, 1)))
>>  			break;
>> @@ -381,7 +384,7 @@ void propagate_mount_unlock(struct mount *mnt)
>>  
>>  	for (m = propagation_next(parent, parent); m;
>>  			m = propagation_next(m, parent)) {
>> -		child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint);
>> +		child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint);
>>  		if (child)
>>  			child->mnt.mnt_flags &= ~MNT_LOCKED;
>>  	}
>> @@ -399,9 +402,11 @@ static void mark_umount_candidates(struct mount *mnt)
>>  
>>  	for (m = propagation_next(parent, parent); m;
>>  			m = propagation_next(m, parent)) {
>> -		struct mount *child = __lookup_mnt_last(&m->mnt,
>> +		struct mount *child = __lookup_mnt(&m->mnt,
>>  						mnt->mnt_mountpoint);
>> -		if (child && (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m))) {
>> +		if (!child || (child->mnt.mnt_flags & MNT_UMOUNT))
>> +			continue;
>> +		if (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m)) {
>>  			SET_MNT_MARK(child);
>>  		}
>>  	}
>> @@ -420,8 +425,8 @@ static void __propagate_umount(struct mount *mnt)
>>  
>>  	for (m = propagation_next(parent, parent); m;
>>  			m = propagation_next(m, parent)) {
>> -
>> -		struct mount *child = __lookup_mnt_last(&m->mnt,
>> +		struct mount *topper;
>> +		struct mount *child = __lookup_mnt(&m->mnt,
>>  						mnt->mnt_mountpoint);
>>  		/*
>>  		 * umount the child only if the child has no children
>> @@ -430,6 +435,16 @@ static void __propagate_umount(struct mount *mnt)
>>  		if (!child || !IS_MNT_MARKED(child))
>>  			continue;
>>  		CLEAR_MNT_MARK(child);
>> +
>> +		/* If there is exactly one mount covering all of child
>> +		 * replace child with that mount.
>> +		 */
>> +		topper = __lookup_mnt(&child->mnt, child->mnt.mnt_root);
>> +		if (topper &&
>> +		    (child->mnt_mounts.next == &topper->mnt_child) &&
>> +		    (topper->mnt_child.next == &child->mnt_mounts))
>> +			mnt_change_mountpoint(child->mnt_parent, child->mnt_mp, topper);
>> +
>>  		if (list_empty(&child->mnt_mounts)) {
>>  			list_del_init(&child->mnt_child);
>>  			child->mnt.mnt_flags |= MNT_UMOUNT;
>> diff --git a/fs/pnode.h b/fs/pnode.h
>> index 550f5a8b4fcf..dc87e65becd2 100644
>> --- a/fs/pnode.h
>> +++ b/fs/pnode.h
>> @@ -49,6 +49,8 @@ int get_dominating_id(struct mount *mnt, const struct path *root);
>>  unsigned int mnt_get_count(struct mount *mnt);
>>  void mnt_set_mountpoint(struct mount *, struct mountpoint *,
>>  			struct mount *);
>> +void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp,
>> +			   struct mount *mnt);
>>  struct mount *copy_tree(struct mount *, struct dentry *, int);
>>  bool is_path_reachable(struct mount *, struct dentry *,
>>  			 const struct path *root);
>> -- 
>> 2.10.1
>>
Eric W. Biederman May 14, 2017, 9:26 a.m. UTC | #8
ebiederm@xmission.com (Eric W. Biederman) writes:

> Andrei Vagin <avagin@virtuozzo.com> writes:
>
>> Hi Eric,
>>
>> I found one issue about this patch. Take a look at the next script. The
>> idea is simple, we call mount twice and than we call umount twice, but
>> the second umount fails.
>
> Definitely an odd.  I will take a look.

After a little more looking I have to say the only thing I can see wrong
in the behavior is that the first umount doesn't unmount everything.

Changing the mount propagation tree while it is being traversed
apparently prevents a complete traversal of the propgation tree.

Last time I was looking at this I was playing with multipass algorithm
because of these peculiarities that happen with propagation trees.

I suspect we are going to need to fix umount to behave correct before
we tune it for performance.  So that we actually know what correct
behavior is.

Eric


>>
>> [root@fc24 ~]# cat m.sh
>> #!/bin/sh
>>
>> set -x -e
>> mount -t tmpfs xxx /mnt
>> mkdir -p /mnt/1
>> mkdir -p /mnt/2
>> mount --bind /mnt/1 /mnt/1
>> mount --make-shared /mnt/1
>> mount --bind /mnt/1 /mnt/2
>> mkdir -p /mnt/1/1
>> for i in `seq 2`; do
>> 	mount --bind /mnt/1/1 /mnt/1/1
>> done
>> for i in `seq 2`; do
>> 	umount /mnt/1/1 || {
>> 		cat /proc/self/mountinfo | grep xxx
>> 		exit 1
>> 	}
>> done
>>
>> [root@fc24 ~]# unshare -Urm  ./m.sh
>> + mount -t tmpfs xxx /mnt
>> + mkdir -p /mnt/1
>> + mkdir -p /mnt/2
>> + mount --bind /mnt/1 /mnt/1
>> + mount --make-shared /mnt/1
>> + mount --bind /mnt/1 /mnt/2
>> + mkdir -p /mnt/1/1
>> ++ seq 2
>> + for i in '`seq 2`'
>> + mount --bind /mnt/1/1 /mnt/1/1
>> + for i in '`seq 2`'
>> + mount --bind /mnt/1/1 /mnt/1/1
>> ++ seq 2
>> + for i in '`seq 2`'
>> + umount /mnt/1/1
>> + for i in '`seq 2`'
>> + umount /mnt/1/1
>> umount: /mnt/1/1: not mounted
>> + cat /proc/self/mountinfo
>> + grep xxx
>> 147 116 0:42 / /mnt rw,relatime - tmpfs xxx rw
>> 148 147 0:42 /1 /mnt/1 rw,relatime shared:65 - tmpfs xxx rw
>> 149 147 0:42 /1 /mnt/2 rw,relatime shared:65 - tmpfs xxx rw
>> 157 149 0:42 /1/1 /mnt/2/1 rw,relatime shared:65 - tmpfs xxx rw
>> + exit 1
>>
>> And you can see that /mnt/2 contains a mount, but it should not.
>>
>> Thanks,
>> Andrei
>>
>> On Thu, Jan 05, 2017 at 10:04:14AM +1300, Eric W. Biederman wrote:
>>> 
>>> Ever since mount propagation was introduced in cases where a mount in
>>> propagated to parent mount mountpoint pair that is already in use the
>>> code has placed the new mount behind the old mount in the mount hash
>>> table.
>>> 
>>> This implementation detail is problematic as it allows creating
>>> arbitrary length mount hash chains.
>>> 
>>> Furthermore it invalidates the constraint maintained elsewhere in the
>>> mount code that a parent mount and a mountpoint pair will have exactly
>>> one mount upon them.  Making it hard to deal with and to talk about
>>> this special case in the mount code.
>>> 
>>> Modify mount propagation to notice when there is already a mount at
>>> the parent mount and mountpoint where a new mount is propagating to
>>> and place that preexisting mount on top of the new mount.
>>> 
>>> Modify unmount propagation to notice when a mount that is being
>>> unmounted has another mount on top of it (and no other children), and
>>> to replace the unmounted mount with the mount on top of it.
>>> 
>>> Move the MNT_UMUONT test from __lookup_mnt_last into
>>> __propagate_umount as that is the only call of __lookup_mnt_last where
>>> MNT_UMOUNT may be set on any mount visible in the mount hash table.
>>> 
>>> These modifications allow:
>>>  - __lookup_mnt_last to be removed.
>>>  - attach_shadows to be renamed __attach_mnt and the it's shadow
>>>    handling to be removed.
>>>  - commit_tree to be simplified
>>>  - copy_tree to be simplified
>>> 
>>> The result is an easier to understand tree of mounts that does not
>>> allow creation of arbitrary length hash chains in the mount hash table.
>>> 
>>> v2: Updated to mnt_change_mountpoint to not call dput or mntput
>>> and instead to decrement the counts directly.  It is guaranteed
>>> that there will be other references when mnt_change_mountpoint is
>>> called so this is safe.
>>> 
>>> Cc: stable@vger.kernel.org
>>> Fixes: b90fa9ae8f51 ("[PATCH] shared mount handling: bind and rbind")
>>> Tested-by: Andrei Vagin <avagin@virtuozzo.com>
>>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>> ---
>>> 
>>> Since the last version some of you may have seen I have modified
>>> my implementation of mnt_change_mountpoint so that it no longer calls
>>> mntput or dput but instead relies on the knowledge that it can not
>>> possibly have the last reference to the mnt and dentry of interest.
>>> This avoids code checking tools from complaining bitterly.
>>> 
>>> This is on top of my previous patch that sorts out locking of the
>>> mountpoint hash table.  After time giving ample time for review I intend
>>> to push this and the previous bug fix to Linus.
>>> 
>>>  fs/mount.h     |   1 -
>>>  fs/namespace.c | 110 +++++++++++++++++++++++++++++++--------------------------
>>>  fs/pnode.c     |  27 ++++++++++----
>>>  fs/pnode.h     |   2 ++
>>>  4 files changed, 82 insertions(+), 58 deletions(-)
>>> 
>>> diff --git a/fs/mount.h b/fs/mount.h
>>> index 2c856fc47ae3..2826543a131d 100644
>>> --- a/fs/mount.h
>>> +++ b/fs/mount.h
>>> @@ -89,7 +89,6 @@ static inline int is_mounted(struct vfsmount *mnt)
>>>  }
>>>  
>>>  extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *);
>>> -extern struct mount *__lookup_mnt_last(struct vfsmount *, struct dentry *);
>>>  
>>>  extern int __legitimize_mnt(struct vfsmount *, unsigned);
>>>  extern bool legitimize_mnt(struct vfsmount *, unsigned);
>>> diff --git a/fs/namespace.c b/fs/namespace.c
>>> index 487ba30bb5c6..91ccfb73f0e0 100644
>>> --- a/fs/namespace.c
>>> +++ b/fs/namespace.c
>>> @@ -637,28 +637,6 @@ struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry)
>>>  }
>>>  
>>>  /*
>>> - * find the last mount at @dentry on vfsmount @mnt.
>>> - * mount_lock must be held.
>>> - */
>>> -struct mount *__lookup_mnt_last(struct vfsmount *mnt, struct dentry *dentry)
>>> -{
>>> -	struct mount *p, *res = NULL;
>>> -	p = __lookup_mnt(mnt, dentry);
>>> -	if (!p)
>>> -		goto out;
>>> -	if (!(p->mnt.mnt_flags & MNT_UMOUNT))
>>> -		res = p;
>>> -	hlist_for_each_entry_continue(p, mnt_hash) {
>>> -		if (&p->mnt_parent->mnt != mnt || p->mnt_mountpoint != dentry)
>>> -			break;
>>> -		if (!(p->mnt.mnt_flags & MNT_UMOUNT))
>>> -			res = p;
>>> -	}
>>> -out:
>>> -	return res;
>>> -}
>>> -
>>> -/*
>>>   * lookup_mnt - Return the first child mount mounted at path
>>>   *
>>>   * "First" means first mounted chronologically.  If you create the
>>> @@ -878,6 +856,13 @@ void mnt_set_mountpoint(struct mount *mnt,
>>>  	hlist_add_head(&child_mnt->mnt_mp_list, &mp->m_list);
>>>  }
>>>  
>>> +static void __attach_mnt(struct mount *mnt, struct mount *parent)
>>> +{
>>> +	hlist_add_head_rcu(&mnt->mnt_hash,
>>> +			   m_hash(&parent->mnt, mnt->mnt_mountpoint));
>>> +	list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
>>> +}
>>> +
>>>  /*
>>>   * vfsmount lock must be held for write
>>>   */
>>> @@ -886,28 +871,45 @@ static void attach_mnt(struct mount *mnt,
>>>  			struct mountpoint *mp)
>>>  {
>>>  	mnt_set_mountpoint(parent, mp, mnt);
>>> -	hlist_add_head_rcu(&mnt->mnt_hash, m_hash(&parent->mnt, mp->m_dentry));
>>> -	list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
>>> +	__attach_mnt(mnt, parent);
>>>  }
>>>  
>>> -static void attach_shadowed(struct mount *mnt,
>>> -			struct mount *parent,
>>> -			struct mount *shadows)
>>> +void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, struct mount *mnt)
>>>  {
>>> -	if (shadows) {
>>> -		hlist_add_behind_rcu(&mnt->mnt_hash, &shadows->mnt_hash);
>>> -		list_add(&mnt->mnt_child, &shadows->mnt_child);
>>> -	} else {
>>> -		hlist_add_head_rcu(&mnt->mnt_hash,
>>> -				m_hash(&parent->mnt, mnt->mnt_mountpoint));
>>> -		list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
>>> -	}
>>> +	struct mountpoint *old_mp = mnt->mnt_mp;
>>> +	struct dentry *old_mountpoint = mnt->mnt_mountpoint;
>>> +	struct mount *old_parent = mnt->mnt_parent;
>>> +
>>> +	list_del_init(&mnt->mnt_child);
>>> +	hlist_del_init(&mnt->mnt_mp_list);
>>> +	hlist_del_init_rcu(&mnt->mnt_hash);
>>> +
>>> +	attach_mnt(mnt, parent, mp);
>>> +
>>> +	put_mountpoint(old_mp);
>>> +
>>> +	/*
>>> +	 * Safely avoid even the suggestion this code might sleep or
>>> +	 * lock the mount hash by taking avantage of the knowlege that
>>> +	 * mnt_change_mounpoint will not release the final reference
>>> +	 * to a mountpoint.
>>> +	 *
>>> +	 * During mounting, another mount will continue to use the old
>>> +	 * mountpoint and during unmounting, the old mountpoint will
>>> +	 * continue to exist until namespace_unlock which happens well
>>> +	 * after mnt_change_mountpoint.
>>> +	 */
>>> +	spin_lock(&old_mountpoint->d_lock);
>>> +	old_mountpoint->d_lockref.count--;
>>> +	spin_unlock(&old_mountpoint->d_lock);
>>> +
>>> +	mnt_add_count(old_parent, -1);
>>>  }
>>>  
>>>  /*
>>>   * vfsmount lock must be held for write
>>>   */
>>> -static void commit_tree(struct mount *mnt, struct mount *shadows)
>>> +static void commit_tree(struct mount *mnt)
>>>  {
>>>  	struct mount *parent = mnt->mnt_parent;
>>>  	struct mount *m;
>>> @@ -925,7 +927,7 @@ static void commit_tree(struct mount *mnt, struct mount *shadows)
>>>  	n->mounts += n->pending_mounts;
>>>  	n->pending_mounts = 0;
>>>  
>>> -	attach_shadowed(mnt, parent, shadows);
>>> +	__attach_mnt(mnt, parent);
>>>  	touch_mnt_namespace(n);
>>>  }
>>>  
>>> @@ -1764,7 +1766,6 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
>>>  			continue;
>>>  
>>>  		for (s = r; s; s = next_mnt(s, r)) {
>>> -			struct mount *t = NULL;
>>>  			if (!(flag & CL_COPY_UNBINDABLE) &&
>>>  			    IS_MNT_UNBINDABLE(s)) {
>>>  				s = skip_mnt_tree(s);
>>> @@ -1786,14 +1787,7 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
>>>  				goto out;
>>>  			lock_mount_hash();
>>>  			list_add_tail(&q->mnt_list, &res->mnt_list);
>>> -			mnt_set_mountpoint(parent, p->mnt_mp, q);
>>> -			if (!list_empty(&parent->mnt_mounts)) {
>>> -				t = list_last_entry(&parent->mnt_mounts,
>>> -					struct mount, mnt_child);
>>> -				if (t->mnt_mp != p->mnt_mp)
>>> -					t = NULL;
>>> -			}
>>> -			attach_shadowed(q, parent, t);
>>> +			attach_mnt(q, parent, p->mnt_mp);
>>>  			unlock_mount_hash();
>>>  		}
>>>  	}
>>> @@ -1992,10 +1986,18 @@ static int attach_recursive_mnt(struct mount *source_mnt,
>>>  {
>>>  	HLIST_HEAD(tree_list);
>>>  	struct mnt_namespace *ns = dest_mnt->mnt_ns;
>>> +	struct mountpoint *smp;
>>>  	struct mount *child, *p;
>>>  	struct hlist_node *n;
>>>  	int err;
>>>  
>>> +	/* Preallocate a mountpoint in case the new mounts need
>>> +	 * to be tucked under other mounts.
>>> +	 */
>>> +	smp = get_mountpoint(source_mnt->mnt.mnt_root);
>>> +	if (IS_ERR(smp))
>>> +		return PTR_ERR(smp);
>>> +
>>>  	/* Is there space to add these mounts to the mount namespace? */
>>>  	if (!parent_path) {
>>>  		err = count_mounts(ns, source_mnt);
>>> @@ -2022,17 +2024,22 @@ static int attach_recursive_mnt(struct mount *source_mnt,
>>>  		touch_mnt_namespace(source_mnt->mnt_ns);
>>>  	} else {
>>>  		mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt);
>>> -		commit_tree(source_mnt, NULL);
>>> +		commit_tree(source_mnt);
>>>  	}
>>>  
>>>  	hlist_for_each_entry_safe(child, n, &tree_list, mnt_hash) {
>>> -		struct mount *q;
>>>  		hlist_del_init(&child->mnt_hash);
>>> -		q = __lookup_mnt_last(&child->mnt_parent->mnt,
>>> -				      child->mnt_mountpoint);
>>> -		commit_tree(child, q);
>>> +		if (child->mnt.mnt_root == smp->m_dentry) {
>>> +			struct mount *q;
>>> +			q = __lookup_mnt(&child->mnt_parent->mnt,
>>> +					 child->mnt_mountpoint);
>>> +			if (q)
>>> +				mnt_change_mountpoint(child, smp, q);
>>> +		}
>>> +		commit_tree(child);
>>>  	}
>>>  	unlock_mount_hash();
>>> +	put_mountpoint(smp);
>>>  
>>>  	return 0;
>>>  
>>> @@ -2046,6 +2053,7 @@ static int attach_recursive_mnt(struct mount *source_mnt,
>>>  	cleanup_group_ids(source_mnt, NULL);
>>>   out:
>>>  	ns->pending_mounts = 0;
>>> +	put_mountpoint(smp);
>>>  	return err;
>>>  }
>>>  
>>> diff --git a/fs/pnode.c b/fs/pnode.c
>>> index 06a793f4ae38..eb4331240fd1 100644
>>> --- a/fs/pnode.c
>>> +++ b/fs/pnode.c
>>> @@ -327,6 +327,9 @@ int propagate_mnt(struct mount *dest_mnt, struct mountpoint *dest_mp,
>>>   */
>>>  static inline int do_refcount_check(struct mount *mnt, int count)
>>>  {
>>> +	struct mount *topper = __lookup_mnt(&mnt->mnt, mnt->mnt.mnt_root);
>>> +	if (topper)
>>> +		count++;
>>>  	return mnt_get_count(mnt) > count;
>>>  }
>>>  
>>> @@ -359,7 +362,7 @@ int propagate_mount_busy(struct mount *mnt, int refcnt)
>>>  
>>>  	for (m = propagation_next(parent, parent); m;
>>>  	     		m = propagation_next(m, parent)) {
>>> -		child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint);
>>> +		child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint);
>>>  		if (child && list_empty(&child->mnt_mounts) &&
>>>  		    (ret = do_refcount_check(child, 1)))
>>>  			break;
>>> @@ -381,7 +384,7 @@ void propagate_mount_unlock(struct mount *mnt)
>>>  
>>>  	for (m = propagation_next(parent, parent); m;
>>>  			m = propagation_next(m, parent)) {
>>> -		child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint);
>>> +		child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint);
>>>  		if (child)
>>>  			child->mnt.mnt_flags &= ~MNT_LOCKED;
>>>  	}
>>> @@ -399,9 +402,11 @@ static void mark_umount_candidates(struct mount *mnt)
>>>  
>>>  	for (m = propagation_next(parent, parent); m;
>>>  			m = propagation_next(m, parent)) {
>>> -		struct mount *child = __lookup_mnt_last(&m->mnt,
>>> +		struct mount *child = __lookup_mnt(&m->mnt,
>>>  						mnt->mnt_mountpoint);
>>> -		if (child && (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m))) {
>>> +		if (!child || (child->mnt.mnt_flags & MNT_UMOUNT))
>>> +			continue;
>>> +		if (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m)) {
>>>  			SET_MNT_MARK(child);
>>>  		}
>>>  	}
>>> @@ -420,8 +425,8 @@ static void __propagate_umount(struct mount *mnt)
>>>  
>>>  	for (m = propagation_next(parent, parent); m;
>>>  			m = propagation_next(m, parent)) {
>>> -
>>> -		struct mount *child = __lookup_mnt_last(&m->mnt,
>>> +		struct mount *topper;
>>> +		struct mount *child = __lookup_mnt(&m->mnt,
>>>  						mnt->mnt_mountpoint);
>>>  		/*
>>>  		 * umount the child only if the child has no children
>>> @@ -430,6 +435,16 @@ static void __propagate_umount(struct mount *mnt)
>>>  		if (!child || !IS_MNT_MARKED(child))
>>>  			continue;
>>>  		CLEAR_MNT_MARK(child);
>>> +
>>> +		/* If there is exactly one mount covering all of child
>>> +		 * replace child with that mount.
>>> +		 */
>>> +		topper = __lookup_mnt(&child->mnt, child->mnt.mnt_root);
>>> +		if (topper &&
>>> +		    (child->mnt_mounts.next == &topper->mnt_child) &&
>>> +		    (topper->mnt_child.next == &child->mnt_mounts))
>>> +			mnt_change_mountpoint(child->mnt_parent, child->mnt_mp, topper);
>>> +
>>>  		if (list_empty(&child->mnt_mounts)) {
>>>  			list_del_init(&child->mnt_child);
>>>  			child->mnt.mnt_flags |= MNT_UMOUNT;
>>> diff --git a/fs/pnode.h b/fs/pnode.h
>>> index 550f5a8b4fcf..dc87e65becd2 100644
>>> --- a/fs/pnode.h
>>> +++ b/fs/pnode.h
>>> @@ -49,6 +49,8 @@ int get_dominating_id(struct mount *mnt, const struct path *root);
>>>  unsigned int mnt_get_count(struct mount *mnt);
>>>  void mnt_set_mountpoint(struct mount *, struct mountpoint *,
>>>  			struct mount *);
>>> +void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp,
>>> +			   struct mount *mnt);
>>>  struct mount *copy_tree(struct mount *, struct dentry *, int);
>>>  bool is_path_reachable(struct mount *, struct dentry *,
>>>  			 const struct path *root);
>>> -- 
>>> 2.10.1
>>>
diff mbox

Patch

diff --git a/fs/mount.h b/fs/mount.h
index 2c856fc47ae3..2826543a131d 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -89,7 +89,6 @@  static inline int is_mounted(struct vfsmount *mnt)
 }
 
 extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *);
-extern struct mount *__lookup_mnt_last(struct vfsmount *, struct dentry *);
 
 extern int __legitimize_mnt(struct vfsmount *, unsigned);
 extern bool legitimize_mnt(struct vfsmount *, unsigned);
diff --git a/fs/namespace.c b/fs/namespace.c
index 487ba30bb5c6..91ccfb73f0e0 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -637,28 +637,6 @@  struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry)
 }
 
 /*
- * find the last mount at @dentry on vfsmount @mnt.
- * mount_lock must be held.
- */
-struct mount *__lookup_mnt_last(struct vfsmount *mnt, struct dentry *dentry)
-{
-	struct mount *p, *res = NULL;
-	p = __lookup_mnt(mnt, dentry);
-	if (!p)
-		goto out;
-	if (!(p->mnt.mnt_flags & MNT_UMOUNT))
-		res = p;
-	hlist_for_each_entry_continue(p, mnt_hash) {
-		if (&p->mnt_parent->mnt != mnt || p->mnt_mountpoint != dentry)
-			break;
-		if (!(p->mnt.mnt_flags & MNT_UMOUNT))
-			res = p;
-	}
-out:
-	return res;
-}
-
-/*
  * lookup_mnt - Return the first child mount mounted at path
  *
  * "First" means first mounted chronologically.  If you create the
@@ -878,6 +856,13 @@  void mnt_set_mountpoint(struct mount *mnt,
 	hlist_add_head(&child_mnt->mnt_mp_list, &mp->m_list);
 }
 
+static void __attach_mnt(struct mount *mnt, struct mount *parent)
+{
+	hlist_add_head_rcu(&mnt->mnt_hash,
+			   m_hash(&parent->mnt, mnt->mnt_mountpoint));
+	list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
+}
+
 /*
  * vfsmount lock must be held for write
  */
@@ -886,28 +871,45 @@  static void attach_mnt(struct mount *mnt,
 			struct mountpoint *mp)
 {
 	mnt_set_mountpoint(parent, mp, mnt);
-	hlist_add_head_rcu(&mnt->mnt_hash, m_hash(&parent->mnt, mp->m_dentry));
-	list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
+	__attach_mnt(mnt, parent);
 }
 
-static void attach_shadowed(struct mount *mnt,
-			struct mount *parent,
-			struct mount *shadows)
+void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, struct mount *mnt)
 {
-	if (shadows) {
-		hlist_add_behind_rcu(&mnt->mnt_hash, &shadows->mnt_hash);
-		list_add(&mnt->mnt_child, &shadows->mnt_child);
-	} else {
-		hlist_add_head_rcu(&mnt->mnt_hash,
-				m_hash(&parent->mnt, mnt->mnt_mountpoint));
-		list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
-	}
+	struct mountpoint *old_mp = mnt->mnt_mp;
+	struct dentry *old_mountpoint = mnt->mnt_mountpoint;
+	struct mount *old_parent = mnt->mnt_parent;
+
+	list_del_init(&mnt->mnt_child);
+	hlist_del_init(&mnt->mnt_mp_list);
+	hlist_del_init_rcu(&mnt->mnt_hash);
+
+	attach_mnt(mnt, parent, mp);
+
+	put_mountpoint(old_mp);
+
+	/*
+	 * Safely avoid even the suggestion this code might sleep or
+	 * lock the mount hash by taking avantage of the knowlege that
+	 * mnt_change_mounpoint will not release the final reference
+	 * to a mountpoint.
+	 *
+	 * During mounting, another mount will continue to use the old
+	 * mountpoint and during unmounting, the old mountpoint will
+	 * continue to exist until namespace_unlock which happens well
+	 * after mnt_change_mountpoint.
+	 */
+	spin_lock(&old_mountpoint->d_lock);
+	old_mountpoint->d_lockref.count--;
+	spin_unlock(&old_mountpoint->d_lock);
+
+	mnt_add_count(old_parent, -1);
 }
 
 /*
  * vfsmount lock must be held for write
  */
-static void commit_tree(struct mount *mnt, struct mount *shadows)
+static void commit_tree(struct mount *mnt)
 {
 	struct mount *parent = mnt->mnt_parent;
 	struct mount *m;
@@ -925,7 +927,7 @@  static void commit_tree(struct mount *mnt, struct mount *shadows)
 	n->mounts += n->pending_mounts;
 	n->pending_mounts = 0;
 
-	attach_shadowed(mnt, parent, shadows);
+	__attach_mnt(mnt, parent);
 	touch_mnt_namespace(n);
 }
 
@@ -1764,7 +1766,6 @@  struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
 			continue;
 
 		for (s = r; s; s = next_mnt(s, r)) {
-			struct mount *t = NULL;
 			if (!(flag & CL_COPY_UNBINDABLE) &&
 			    IS_MNT_UNBINDABLE(s)) {
 				s = skip_mnt_tree(s);
@@ -1786,14 +1787,7 @@  struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
 				goto out;
 			lock_mount_hash();
 			list_add_tail(&q->mnt_list, &res->mnt_list);
-			mnt_set_mountpoint(parent, p->mnt_mp, q);
-			if (!list_empty(&parent->mnt_mounts)) {
-				t = list_last_entry(&parent->mnt_mounts,
-					struct mount, mnt_child);
-				if (t->mnt_mp != p->mnt_mp)
-					t = NULL;
-			}
-			attach_shadowed(q, parent, t);
+			attach_mnt(q, parent, p->mnt_mp);
 			unlock_mount_hash();
 		}
 	}
@@ -1992,10 +1986,18 @@  static int attach_recursive_mnt(struct mount *source_mnt,
 {
 	HLIST_HEAD(tree_list);
 	struct mnt_namespace *ns = dest_mnt->mnt_ns;
+	struct mountpoint *smp;
 	struct mount *child, *p;
 	struct hlist_node *n;
 	int err;
 
+	/* Preallocate a mountpoint in case the new mounts need
+	 * to be tucked under other mounts.
+	 */
+	smp = get_mountpoint(source_mnt->mnt.mnt_root);
+	if (IS_ERR(smp))
+		return PTR_ERR(smp);
+
 	/* Is there space to add these mounts to the mount namespace? */
 	if (!parent_path) {
 		err = count_mounts(ns, source_mnt);
@@ -2022,17 +2024,22 @@  static int attach_recursive_mnt(struct mount *source_mnt,
 		touch_mnt_namespace(source_mnt->mnt_ns);
 	} else {
 		mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt);
-		commit_tree(source_mnt, NULL);
+		commit_tree(source_mnt);
 	}
 
 	hlist_for_each_entry_safe(child, n, &tree_list, mnt_hash) {
-		struct mount *q;
 		hlist_del_init(&child->mnt_hash);
-		q = __lookup_mnt_last(&child->mnt_parent->mnt,
-				      child->mnt_mountpoint);
-		commit_tree(child, q);
+		if (child->mnt.mnt_root == smp->m_dentry) {
+			struct mount *q;
+			q = __lookup_mnt(&child->mnt_parent->mnt,
+					 child->mnt_mountpoint);
+			if (q)
+				mnt_change_mountpoint(child, smp, q);
+		}
+		commit_tree(child);
 	}
 	unlock_mount_hash();
+	put_mountpoint(smp);
 
 	return 0;
 
@@ -2046,6 +2053,7 @@  static int attach_recursive_mnt(struct mount *source_mnt,
 	cleanup_group_ids(source_mnt, NULL);
  out:
 	ns->pending_mounts = 0;
+	put_mountpoint(smp);
 	return err;
 }
 
diff --git a/fs/pnode.c b/fs/pnode.c
index 06a793f4ae38..eb4331240fd1 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -327,6 +327,9 @@  int propagate_mnt(struct mount *dest_mnt, struct mountpoint *dest_mp,
  */
 static inline int do_refcount_check(struct mount *mnt, int count)
 {
+	struct mount *topper = __lookup_mnt(&mnt->mnt, mnt->mnt.mnt_root);
+	if (topper)
+		count++;
 	return mnt_get_count(mnt) > count;
 }
 
@@ -359,7 +362,7 @@  int propagate_mount_busy(struct mount *mnt, int refcnt)
 
 	for (m = propagation_next(parent, parent); m;
 	     		m = propagation_next(m, parent)) {
-		child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint);
+		child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint);
 		if (child && list_empty(&child->mnt_mounts) &&
 		    (ret = do_refcount_check(child, 1)))
 			break;
@@ -381,7 +384,7 @@  void propagate_mount_unlock(struct mount *mnt)
 
 	for (m = propagation_next(parent, parent); m;
 			m = propagation_next(m, parent)) {
-		child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint);
+		child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint);
 		if (child)
 			child->mnt.mnt_flags &= ~MNT_LOCKED;
 	}
@@ -399,9 +402,11 @@  static void mark_umount_candidates(struct mount *mnt)
 
 	for (m = propagation_next(parent, parent); m;
 			m = propagation_next(m, parent)) {
-		struct mount *child = __lookup_mnt_last(&m->mnt,
+		struct mount *child = __lookup_mnt(&m->mnt,
 						mnt->mnt_mountpoint);
-		if (child && (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m))) {
+		if (!child || (child->mnt.mnt_flags & MNT_UMOUNT))
+			continue;
+		if (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m)) {
 			SET_MNT_MARK(child);
 		}
 	}
@@ -420,8 +425,8 @@  static void __propagate_umount(struct mount *mnt)
 
 	for (m = propagation_next(parent, parent); m;
 			m = propagation_next(m, parent)) {
-
-		struct mount *child = __lookup_mnt_last(&m->mnt,
+		struct mount *topper;
+		struct mount *child = __lookup_mnt(&m->mnt,
 						mnt->mnt_mountpoint);
 		/*
 		 * umount the child only if the child has no children
@@ -430,6 +435,16 @@  static void __propagate_umount(struct mount *mnt)
 		if (!child || !IS_MNT_MARKED(child))
 			continue;
 		CLEAR_MNT_MARK(child);
+
+		/* If there is exactly one mount covering all of child
+		 * replace child with that mount.
+		 */
+		topper = __lookup_mnt(&child->mnt, child->mnt.mnt_root);
+		if (topper &&
+		    (child->mnt_mounts.next == &topper->mnt_child) &&
+		    (topper->mnt_child.next == &child->mnt_mounts))
+			mnt_change_mountpoint(child->mnt_parent, child->mnt_mp, topper);
+
 		if (list_empty(&child->mnt_mounts)) {
 			list_del_init(&child->mnt_child);
 			child->mnt.mnt_flags |= MNT_UMOUNT;
diff --git a/fs/pnode.h b/fs/pnode.h
index 550f5a8b4fcf..dc87e65becd2 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -49,6 +49,8 @@  int get_dominating_id(struct mount *mnt, const struct path *root);
 unsigned int mnt_get_count(struct mount *mnt);
 void mnt_set_mountpoint(struct mount *, struct mountpoint *,
 			struct mount *);
+void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp,
+			   struct mount *mnt);
 struct mount *copy_tree(struct mount *, struct dentry *, int);
 bool is_path_reachable(struct mount *, struct dentry *,
 			 const struct path *root);