diff mbox

[v5] mnt: Tuck mounts under others instead of creating shadow/side mounts.

Message ID 87lgtn167n.fsf@xmission.com (mailing list archive)
State New, archived
Headers show

Commit Message

Eric W. Biederman Feb. 3, 2017, 6:26 p.m. UTC
Ram Pai <linuxram@us.ibm.com> writes:

> On Fri, Feb 03, 2017 at 11:54:21PM +1300, Eric W. Biederman wrote:
>> ebiederm@xmission.com (Eric W. Biederman) writes:
>> 
>> > Ram Pai <linuxram@us.ibm.com> writes:
>> >
>> >> On Sat, Jan 21, 2017 at 05:15:29PM +1300, Eric W. Biederman wrote:
>> >>> Ram Pai <linuxram@us.ibm.com> writes:
>> >>> 
>> >>> >> @@ -359,12 +373,24 @@ int propagate_mount_busy(struct mount *mnt, int refcnt)
>> >>> >> 
>> >>> >>  	for (m = propagation_next(parent, parent); m;
>> >>> >>  	     		m = propagation_next(m, parent)) {
>> >>> >> -		child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint);
>> >>> >> -		if (child && list_empty(&child->mnt_mounts) &&
>> >>> >> -		    (ret = do_refcount_check(child, 1)))
>> >>> >> -			break;
>> >>> >> +		int count = 1;
>> >>> >> +		child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint);
>> >>> >> +		if (!child)
>> >>> >> +			continue;
>> >>> >> +
>> >>> >> +		/* Is there exactly one mount on the child that covers
>> >>> >> +		 * it completely whose reference should be ignored?
>> >>> >> +		 */
>> >>> >> +		topper = find_topper(child);
>> >>> >
>> >>> > This is tricky. I understand it is trying to identify the case where a
>> >>> > mount got tucked-in because of propagation.  But this will not
>> >>> > distinguish the case where a mount got over-mounted genuinely, not because of
>> >>> > propagation, but because of explicit user action.
>> >>> >
>> >>> >
>> >>> > example:
>> >>> >
>> >>> > case 1: (explicit user action)
>> >>> > 	B is a slave of A
>> >>> > 	mount something on A/a , it will propagate to B/a
>> >>> > 	and than mount something on B/a
>> >>> >
>> >>> > case 2: (tucked mount)
>> >>> > 	B is a slave of A
>> >>> > 	mount something on B/a
>> >>> > 	and than mount something on A/a
>> >>> >
>> >>> > Both case 1 and case 2 lead to the same mount configuration.
>> >>> >
>> >>> >
>> >>> > 	  however 'umount A/a' in case 1 should fail.
>> >>> > 	  and 'umount A/a' in case 2 should pass.
>> >>> >
>> >>> > Right? in other words, umounts of 'tucked mounts' should pass(case 2).
>> >>> > 	whereas umounts of mounts on which overmounts exist should
>> >>> > 		fail.(case 1)
>> >>> 
>> >>> Looking at your example.  I agree that case 1 will fail today.
>> >>
>> >> And should continue to fail. right? Your semantics change will pass it.
>> >
>> > I don't see why it should continue to fail.
>> >
>> >>> However my actual expectation would be for both mount configurations
>> >>> to behave the same.  In both cases something has been explicitly mounted
>> >>> on B/a and something has propagated to B/a.  In both cases the mount
>> >>> on top is what was explicitly mounted, and the mount below is what was
>> >>> propagated to B/a.
>> >>> 
>> >>> I don't see why the order of operations should matter.
>> >>
>> >> One of the subtle expectation is reversibility.
>> >>
>> >> Mount followed immediately by unmount has always passed and that is the
>> >> standard expectation always. Your proposed code will ensure that.
>> >>
>> >> However there is one other subtle expectaton.
>> >>
>> >> A mount cannot disappear if a user has explicitly mounted on top of it.
>> >>
>> >> your proposed code will not meet that expectation. 
>> >>
>> >> In other words, these two expectations make it behave differently even
>> >> when; arguably, they feel like the same configuration.
>> >
>> > I am not seeing that.
>> >
>> >
>> >
>> >>> 
>> >>> > maybe we need a flag to identify tucked mounts?
>> >>> 
>> >>> To preserve our exact current semantics yes.
>> >>> 
>> >>> The mount configurations that are delibearately constructed that I am
>> >>> aware of are comparatively simple.  I don't think anyone has even taken
>> >>> advantage of the shadow/side mounts at this point.  I made a reasonable
>> >>> effort to find out and no one was even aware they existed.  Much less
>> >>> what they were.  And certainly no one I talked to could find code that
>> >>> used them.
>> >>
>> >> But someday; even if its after a decade, someone ;) will
>> >> stumble into this semantics and wonder 'why?'. Its better to get it right
>> >> sooner. Sorry, I am blaming myself; for keeping some of the problems
>> >> open thinking no one will bump into them.
>> >
>> > Oh definitely.  If we have people ready to talk it through I am happy to
>> > dot as many i's and cross as many t's as we productively can.
>> >
>> > I was just pointing out that I don't have any reason to expect that any
>> > one depends on the subtle details of the implementation today so we
>> > still have some wiggle room to fix them.  Even if they are visible to
>> > user space.
>> 
>> So I haven't seen a reply, and we are getting awfully close to the merge
>> window.  Is there anything concrete we can do to ease concerns?
>> 
>> Right now I am thinking my last version of the patch is the likely the
>> best we have time and energy to manage and it would be good to merge it
>> before the code bit rots.
>
> I was waiting for some other opinions on the behavior, since I
> continue to think that 'one should not be able to unmount mounts on
> which a user has explicitly mounted upon'. I am happy to be overruled,
> since your patch significantly improves the rest of the semantics.
>
> Viro?

Ram Pai, just to be clear you were hoping to add the logic below to my patch?

My objections to the snippet below are:

- It makes it hard for the CRIU folks (yet more state they have to find
  and restore).

- It feels subjectively worse to me.

- We already have cases where mounts are unmounted transparently (umount on rmdir).

- Al Viro claims that the side/shadow mounts are ordinary mounts and
  maintaining this extra logic that remembers if we tucked one mount
  under another seems to make this them less ordinary.

- The symmetry for unmounting exists for a tucked mount.  We can unmount
  it via propagation or we can unmount the mount above it, and then we
  can unmount the new underlying mount.  So I don't see why we don't
  want symmetry in the other case just because we mounted on top of
  the mount and rather than had the mount tucked under us.


Eric

Comments

Ram Pai Feb. 3, 2017, 8:28 p.m. UTC | #1
On Sat, Feb 04, 2017 at 07:26:20AM +1300, Eric W. Biederman wrote:
> Ram Pai <linuxram@us.ibm.com> writes:
> 
> > On Fri, Feb 03, 2017 at 11:54:21PM +1300, Eric W. Biederman wrote:
> >> ebiederm@xmission.com (Eric W. Biederman) writes:
> >> 
> >> > Ram Pai <linuxram@us.ibm.com> writes:
> >> >
> >> >> On Sat, Jan 21, 2017 at 05:15:29PM +1300, Eric W. Biederman wrote:
> >> >>> Ram Pai <linuxram@us.ibm.com> writes:
> >> >>> 
> >> >>> >> @@ -359,12 +373,24 @@ int propagate_mount_busy(struct mount *mnt, int refcnt)
> >> >>> >> 
> >> >>> >>  	for (m = propagation_next(parent, parent); m;
> >> >>> >>  	     		m = propagation_next(m, parent)) {
> >> >>> >> -		child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint);
> >> >>> >> -		if (child && list_empty(&child->mnt_mounts) &&
> >> >>> >> -		    (ret = do_refcount_check(child, 1)))
> >> >>> >> -			break;
> >> >>> >> +		int count = 1;
> >> >>> >> +		child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint);
> >> >>> >> +		if (!child)
> >> >>> >> +			continue;
> >> >>> >> +
> >> >>> >> +		/* Is there exactly one mount on the child that covers
> >> >>> >> +		 * it completely whose reference should be ignored?
> >> >>> >> +		 */
> >> >>> >> +		topper = find_topper(child);
> >> >>> >
> >> >>> > This is tricky. I understand it is trying to identify the case where a
> >> >>> > mount got tucked-in because of propagation.  But this will not
> >> >>> > distinguish the case where a mount got over-mounted genuinely, not because of
> >> >>> > propagation, but because of explicit user action.
> >> >>> >
> >> >>> >
> >> >>> > example:
> >> >>> >
> >> >>> > case 1: (explicit user action)
> >> >>> > 	B is a slave of A
> >> >>> > 	mount something on A/a , it will propagate to B/a
> >> >>> > 	and than mount something on B/a
> >> >>> >
> >> >>> > case 2: (tucked mount)
> >> >>> > 	B is a slave of A
> >> >>> > 	mount something on B/a
> >> >>> > 	and than mount something on A/a
> >> >>> >
> >> >>> > Both case 1 and case 2 lead to the same mount configuration.
> >> >>> >
> >> >>> >
> >> >>> > 	  however 'umount A/a' in case 1 should fail.
> >> >>> > 	  and 'umount A/a' in case 2 should pass.
> >> >>> >
> >> >>> > Right? in other words, umounts of 'tucked mounts' should pass(case 2).
> >> >>> > 	whereas umounts of mounts on which overmounts exist should
> >> >>> > 		fail.(case 1)
> >> >>> 
> >> >>> Looking at your example.  I agree that case 1 will fail today.
> >> >>
> >> >> And should continue to fail. right? Your semantics change will pass it.
> >> >
> >> > I don't see why it should continue to fail.
> >> >
> >> >>> However my actual expectation would be for both mount configurations
> >> >>> to behave the same.  In both cases something has been explicitly mounted
> >> >>> on B/a and something has propagated to B/a.  In both cases the mount
> >> >>> on top is what was explicitly mounted, and the mount below is what was
> >> >>> propagated to B/a.
> >> >>> 
> >> >>> I don't see why the order of operations should matter.
> >> >>
> >> >> One of the subtle expectation is reversibility.
> >> >>
> >> >> Mount followed immediately by unmount has always passed and that is the
> >> >> standard expectation always. Your proposed code will ensure that.
> >> >>
> >> >> However there is one other subtle expectaton.
> >> >>
> >> >> A mount cannot disappear if a user has explicitly mounted on top of it.
> >> >>
> >> >> your proposed code will not meet that expectation. 
> >> >>
> >> >> In other words, these two expectations make it behave differently even
> >> >> when; arguably, they feel like the same configuration.
> >> >
> >> > I am not seeing that.
> >> >
> >> >
> >> >
> >> >>> 
> >> >>> > maybe we need a flag to identify tucked mounts?
> >> >>> 
> >> >>> To preserve our exact current semantics yes.
> >> >>> 
> >> >>> The mount configurations that are delibearately constructed that I am
> >> >>> aware of are comparatively simple.  I don't think anyone has even taken
> >> >>> advantage of the shadow/side mounts at this point.  I made a reasonable
> >> >>> effort to find out and no one was even aware they existed.  Much less
> >> >>> what they were.  And certainly no one I talked to could find code that
> >> >>> used them.
> >> >>
> >> >> But someday; even if its after a decade, someone ;) will
> >> >> stumble into this semantics and wonder 'why?'. Its better to get it right
> >> >> sooner. Sorry, I am blaming myself; for keeping some of the problems
> >> >> open thinking no one will bump into them.
> >> >
> >> > Oh definitely.  If we have people ready to talk it through I am happy to
> >> > dot as many i's and cross as many t's as we productively can.
> >> >
> >> > I was just pointing out that I don't have any reason to expect that any
> >> > one depends on the subtle details of the implementation today so we
> >> > still have some wiggle room to fix them.  Even if they are visible to
> >> > user space.
> >> 
> >> So I haven't seen a reply, and we are getting awfully close to the merge
> >> window.  Is there anything concrete we can do to ease concerns?
> >> 
> >> Right now I am thinking my last version of the patch is the likely the
> >> best we have time and energy to manage and it would be good to merge it
> >> before the code bit rots.
> >
> > I was waiting for some other opinions on the behavior, since I
> > continue to think that 'one should not be able to unmount mounts on
> > which a user has explicitly mounted upon'. I am happy to be overruled,
> > since your patch significantly improves the rest of the semantics.
> >
> > Viro?
> 
> Ram Pai, just to be clear you were hoping to add the logic below to my patch?

Yes. the behavior of your patch below is what I was proposing.

> 
> My objections to the snippet below are:
> 
> - It makes it hard for the CRIU folks (yet more state they have to find
>   and restore).

true. unfortunately one more subtle detail to be aware off.

> 
> - It feels subjectively worse to me.
> 
> - We already have cases where mounts are unmounted transparently (umount on rmdir).

sorry. i am not aware of this case. some details will help.

> 
> - Al Viro claims that the side/shadow mounts are ordinary mounts and
>   maintaining this extra logic that remembers if we tucked one mount
>   under another seems to make this them less ordinary.

I tend to argue that they are a bit more than ordinary, for they have the
ability to tuck.

> 
> - The symmetry for unmounting exists for a tucked mount.  We can unmount
>   it via propagation or we can unmount the mount above it, and then we
>   can unmount the new underlying mount.

this is fine with me.

>   So I don't see why we don't
>   want symmetry in the other case just because we mounted on top of
>   the mount and rather than had the mount tucked under us.

A tucked mount should be un-tuckable. I agree.  But a non-tucked mount
cannot pretend to be tucked and this is where I disagree.


> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 8bfad42c1ccf..8b00e0548438 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2047,8 +2047,10 @@ static int attach_recursive_mnt(struct mount *source_mnt,
>  		hlist_del_init(&child->mnt_hash);
>  		q = __lookup_mnt(&child->mnt_parent->mnt,
>  				 child->mnt_mountpoint);
> -		if (q)
> +		if (q) {
>  			mnt_change_mountpoint(child, smp, q);
> +			child->mnt.mnt_flags |= MNT_TUCKED;
> +		}
>  		commit_tree(child);
>  	}
>  	put_mountpoint(smp);
> diff --git a/fs/pnode.c b/fs/pnode.c
> index 5bc7896d122a..e2a6ac68feb9 100644
> --- a/fs/pnode.c
> +++ b/fs/pnode.c
> @@ -327,6 +327,9 @@ static struct mount *find_topper(struct mount *mnt)
>  	/* If there is exactly one mount covering mnt completely return it. */
>  	struct mount *child;
> 
> +	if (!(mnt->mnt.mnt_flags & MNT_TUCKED))
> +		return NULL;
> +	
>  	if (!list_is_singular(&mnt->mnt_mounts))
>  		return NULL;
> 
> diff --git a/include/linux/mount.h b/include/linux/mount.h
> index 8e0352af06b7..25ca398b19b3 100644
> --- a/include/linux/mount.h
> +++ b/include/linux/mount.h
> @@ -52,6 +52,7 @@ struct mnt_namespace;
> 
>  #define MNT_INTERNAL	0x4000
> 
> +#define MNT_TUCKED		0x020000
>  #define MNT_LOCK_ATIME		0x040000
>  #define MNT_LOCK_NOEXEC		0x080000
>  #define MNT_LOCK_NOSUID		0x100000
> 
> Eric
Eric W. Biederman Feb. 3, 2017, 8:58 p.m. UTC | #2
Ram Pai <linuxram@us.ibm.com> writes:

> On Sat, Feb 04, 2017 at 07:26:20AM +1300, Eric W. Biederman wrote:
>> Ram Pai <linuxram@us.ibm.com> writes:
>> 
>> > On Fri, Feb 03, 2017 at 11:54:21PM +1300, Eric W. Biederman wrote:
>> >> ebiederm@xmission.com (Eric W. Biederman) writes:
>> >> 
>> >> > Ram Pai <linuxram@us.ibm.com> writes:
>> >> >
>> >> >> On Sat, Jan 21, 2017 at 05:15:29PM +1300, Eric W. Biederman wrote:
>> >> >>> Ram Pai <linuxram@us.ibm.com> writes:
>> >> >>> 
>> >> >>> >> @@ -359,12 +373,24 @@ int propagate_mount_busy(struct mount *mnt, int refcnt)
>> >> >>> >> 
>> >> >>> >>  	for (m = propagation_next(parent, parent); m;
>> >> >>> >>  	     		m = propagation_next(m, parent)) {
>> >> >>> >> -		child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint);
>> >> >>> >> -		if (child && list_empty(&child->mnt_mounts) &&
>> >> >>> >> -		    (ret = do_refcount_check(child, 1)))
>> >> >>> >> -			break;
>> >> >>> >> +		int count = 1;
>> >> >>> >> +		child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint);
>> >> >>> >> +		if (!child)
>> >> >>> >> +			continue;
>> >> >>> >> +
>> >> >>> >> +		/* Is there exactly one mount on the child that covers
>> >> >>> >> +		 * it completely whose reference should be ignored?
>> >> >>> >> +		 */
>> >> >>> >> +		topper = find_topper(child);
>> >> >>> >
>> >> >>> > This is tricky. I understand it is trying to identify the case where a
>> >> >>> > mount got tucked-in because of propagation.  But this will not
>> >> >>> > distinguish the case where a mount got over-mounted genuinely, not because of
>> >> >>> > propagation, but because of explicit user action.
>> >> >>> >
>> >> >>> >
>> >> >>> > example:
>> >> >>> >
>> >> >>> > case 1: (explicit user action)
>> >> >>> > 	B is a slave of A
>> >> >>> > 	mount something on A/a , it will propagate to B/a
>> >> >>> > 	and than mount something on B/a
>> >> >>> >
>> >> >>> > case 2: (tucked mount)
>> >> >>> > 	B is a slave of A
>> >> >>> > 	mount something on B/a
>> >> >>> > 	and than mount something on A/a
>> >> >>> >
>> >> >>> > Both case 1 and case 2 lead to the same mount configuration.
>> >> >>> >
>> >> >>> >
>> >> >>> > 	  however 'umount A/a' in case 1 should fail.
>> >> >>> > 	  and 'umount A/a' in case 2 should pass.
>> >> >>> >
>> >> >>> > Right? in other words, umounts of 'tucked mounts' should pass(case 2).
>> >> >>> > 	whereas umounts of mounts on which overmounts exist should
>> >> >>> > 		fail.(case 1)
>> >> >>> 
>> >> >>> Looking at your example.  I agree that case 1 will fail today.
>> >> >>
>> >> >> And should continue to fail. right? Your semantics change will pass it.
>> >> >
>> >> > I don't see why it should continue to fail.
>> >> >
>> >> >>> However my actual expectation would be for both mount configurations
>> >> >>> to behave the same.  In both cases something has been explicitly mounted
>> >> >>> on B/a and something has propagated to B/a.  In both cases the mount
>> >> >>> on top is what was explicitly mounted, and the mount below is what was
>> >> >>> propagated to B/a.
>> >> >>> 
>> >> >>> I don't see why the order of operations should matter.
>> >> >>
>> >> >> One of the subtle expectation is reversibility.
>> >> >>
>> >> >> Mount followed immediately by unmount has always passed and that is the
>> >> >> standard expectation always. Your proposed code will ensure that.
>> >> >>
>> >> >> However there is one other subtle expectaton.
>> >> >>
>> >> >> A mount cannot disappear if a user has explicitly mounted on top of it.
>> >> >>
>> >> >> your proposed code will not meet that expectation. 
>> >> >>
>> >> >> In other words, these two expectations make it behave differently even
>> >> >> when; arguably, they feel like the same configuration.
>> >> >
>> >> > I am not seeing that.
>> >> >
>> >> >
>> >> >
>> >> >>> 
>> >> >>> > maybe we need a flag to identify tucked mounts?
>> >> >>> 
>> >> >>> To preserve our exact current semantics yes.
>> >> >>> 
>> >> >>> The mount configurations that are delibearately constructed that I am
>> >> >>> aware of are comparatively simple.  I don't think anyone has even taken
>> >> >>> advantage of the shadow/side mounts at this point.  I made a reasonable
>> >> >>> effort to find out and no one was even aware they existed.  Much less
>> >> >>> what they were.  And certainly no one I talked to could find code that
>> >> >>> used them.
>> >> >>
>> >> >> But someday; even if its after a decade, someone ;) will
>> >> >> stumble into this semantics and wonder 'why?'. Its better to get it right
>> >> >> sooner. Sorry, I am blaming myself; for keeping some of the problems
>> >> >> open thinking no one will bump into them.
>> >> >
>> >> > Oh definitely.  If we have people ready to talk it through I am happy to
>> >> > dot as many i's and cross as many t's as we productively can.
>> >> >
>> >> > I was just pointing out that I don't have any reason to expect that any
>> >> > one depends on the subtle details of the implementation today so we
>> >> > still have some wiggle room to fix them.  Even if they are visible to
>> >> > user space.
>> >> 
>> >> So I haven't seen a reply, and we are getting awfully close to the merge
>> >> window.  Is there anything concrete we can do to ease concerns?
>> >> 
>> >> Right now I am thinking my last version of the patch is the likely the
>> >> best we have time and energy to manage and it would be good to merge it
>> >> before the code bit rots.
>> >
>> > I was waiting for some other opinions on the behavior, since I
>> > continue to think that 'one should not be able to unmount mounts on
>> > which a user has explicitly mounted upon'. I am happy to be overruled,
>> > since your patch significantly improves the rest of the semantics.
>> >
>> > Viro?
>> 
>> Ram Pai, just to be clear you were hoping to add the logic below to my patch?
>
> Yes. the behavior of your patch below is what I was proposing.
>
>> 
>> My objections to the snippet below are:
>> 
>> - It makes it hard for the CRIU folks (yet more state they have to find
>>   and restore).
>
> true. unfortunately one more subtle detail to be aware off.

A bit more than that, as it means that it requires an almost exact
playback of the sequence of mounts in all mount namespaces to
get to the point of reproducing a mount namespace.

>> - It feels subjectively worse to me.
>> 
>> - We already have cases where mounts are unmounted transparently (umount on rmdir).
>
> sorry. i am not aware of this case. some details will help.

The question:

What happens when we rmdir a directory that has a mount on it in another
mount namespace?

What happens when someone on the nfs server deletes a directory there
is a mount on?


It used to be that we returned -EBUSY, and refused the rmdir operation,
and we lied in the vfs about the nfs dentry being deleted to preserve
the mount.

In recent kernels I have done the work so that we transparently unmount
the mounts and allow the rmdir to happen.  An unprivileged user mounting
over say glibc and blocking the yum update of it is a pretty serious
bug.

>> - Al Viro claims that the side/shadow mounts are ordinary mounts and
>>   maintaining this extra logic that remembers if we tucked one mount
>>   under another seems to make this them less ordinary.
>
> I tend to argue that they are a bit more than ordinary, for they have the
> ability to tuck.
>
>> 
>> - The symmetry for unmounting exists for a tucked mount.  We can unmount
>>   it via propagation or we can unmount the mount above it, and then we
>>   can unmount the new underlying mount.
>
> this is fine with me.
>
>>   So I don't see why we don't
>>   want symmetry in the other case just because we mounted on top of
>>   the mount and rather than had the mount tucked under us.
>
> A tucked mount should be un-tuckable. I agree.  But a non-tucked mount
> cannot pretend to be tucked and this is where I disagree.

I have always seen the question as: Should a mount that is propagated be
unmountable via umount propagation.

Which leads me to think that allowing the umount propagation when it
won't change the applications view of files and filesystems is a good
thing.  From my perspective it also better preserves the reversability
property that is important.   The mount propgated and now the unmount
propagated.


From a system management point of view one of the largest practical
problems with mount namespaces and mount propagation is: mounts that
propagate into another mount namespaces but don't get unmounted.


Which is to say not unmounting something (especially silently) and
leaving the filesystem busy when something could be unmounted is a
practical problem for people.


I am going to be out for a week, and I am leaving in a few minutes.
So I am going to push my patch to the my for-next branch, so there
is a reasonable chance of merging things when the merge window opens.

If the feedback is to add the MNT_TUCKED annotations to make the patch
suitable for merging to Linus's tree I will take care of that when
I get back.

Eric
Andrey Vagin Feb. 6, 2017, 3:25 a.m. UTC | #3
On Sat, Feb 04, 2017 at 09:58:39AM +1300, Eric W. Biederman wrote:
> Ram Pai <linuxram@us.ibm.com> writes:
> 
> > On Sat, Feb 04, 2017 at 07:26:20AM +1300, Eric W. Biederman wrote:
> >> Ram Pai <linuxram@us.ibm.com> writes:
> >> 
> >> > On Fri, Feb 03, 2017 at 11:54:21PM +1300, Eric W. Biederman wrote:
> >> >> ebiederm@xmission.com (Eric W. Biederman) writes:
> >> >> 
> >> >> > Ram Pai <linuxram@us.ibm.com> writes:
> >> >> >
> >> >> >> On Sat, Jan 21, 2017 at 05:15:29PM +1300, Eric W. Biederman wrote:
> >> >> >>> Ram Pai <linuxram@us.ibm.com> writes:
> >> >> >>> 
> >> >> >>> >> @@ -359,12 +373,24 @@ int propagate_mount_busy(struct mount *mnt, int refcnt)
> >> >> >>> >> 
> >> >> >>> >>  	for (m = propagation_next(parent, parent); m;
> >> >> >>> >>  	     		m = propagation_next(m, parent)) {
> >> >> >>> >> -		child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint);
> >> >> >>> >> -		if (child && list_empty(&child->mnt_mounts) &&
> >> >> >>> >> -		    (ret = do_refcount_check(child, 1)))
> >> >> >>> >> -			break;
> >> >> >>> >> +		int count = 1;
> >> >> >>> >> +		child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint);
> >> >> >>> >> +		if (!child)
> >> >> >>> >> +			continue;
> >> >> >>> >> +
> >> >> >>> >> +		/* Is there exactly one mount on the child that covers
> >> >> >>> >> +		 * it completely whose reference should be ignored?
> >> >> >>> >> +		 */
> >> >> >>> >> +		topper = find_topper(child);
> >> >> >>> >
> >> >> >>> > This is tricky. I understand it is trying to identify the case where a
> >> >> >>> > mount got tucked-in because of propagation.  But this will not
> >> >> >>> > distinguish the case where a mount got over-mounted genuinely, not because of
> >> >> >>> > propagation, but because of explicit user action.
> >> >> >>> >
> >> >> >>> >
> >> >> >>> > example:
> >> >> >>> >
> >> >> >>> > case 1: (explicit user action)
> >> >> >>> > 	B is a slave of A
> >> >> >>> > 	mount something on A/a , it will propagate to B/a
> >> >> >>> > 	and than mount something on B/a
> >> >> >>> >
> >> >> >>> > case 2: (tucked mount)
> >> >> >>> > 	B is a slave of A
> >> >> >>> > 	mount something on B/a
> >> >> >>> > 	and than mount something on A/a
> >> >> >>> >
> >> >> >>> > Both case 1 and case 2 lead to the same mount configuration.
> >> >> >>> >
> >> >> >>> >
> >> >> >>> > 	  however 'umount A/a' in case 1 should fail.
> >> >> >>> > 	  and 'umount A/a' in case 2 should pass.
> >> >> >>> >
> >> >> >>> > Right? in other words, umounts of 'tucked mounts' should pass(case 2).
> >> >> >>> > 	whereas umounts of mounts on which overmounts exist should
> >> >> >>> > 		fail.(case 1)
> >> >> >>> 
> >> >> >>> Looking at your example.  I agree that case 1 will fail today.
> >> >> >>
> >> >> >> And should continue to fail. right? Your semantics change will pass it.
> >> >> >
> >> >> > I don't see why it should continue to fail.
> >> >> >
> >> >> >>> However my actual expectation would be for both mount configurations
> >> >> >>> to behave the same.  In both cases something has been explicitly mounted
> >> >> >>> on B/a and something has propagated to B/a.  In both cases the mount
> >> >> >>> on top is what was explicitly mounted, and the mount below is what was
> >> >> >>> propagated to B/a.
> >> >> >>> 
> >> >> >>> I don't see why the order of operations should matter.
> >> >> >>
> >> >> >> One of the subtle expectation is reversibility.
> >> >> >>
> >> >> >> Mount followed immediately by unmount has always passed and that is the
> >> >> >> standard expectation always. Your proposed code will ensure that.
> >> >> >>
> >> >> >> However there is one other subtle expectaton.
> >> >> >>
> >> >> >> A mount cannot disappear if a user has explicitly mounted on top of it.
> >> >> >>
> >> >> >> your proposed code will not meet that expectation. 
> >> >> >>
> >> >> >> In other words, these two expectations make it behave differently even
> >> >> >> when; arguably, they feel like the same configuration.
> >> >> >
> >> >> > I am not seeing that.
> >> >> >
> >> >> >
> >> >> >
> >> >> >>> 
> >> >> >>> > maybe we need a flag to identify tucked mounts?
> >> >> >>> 
> >> >> >>> To preserve our exact current semantics yes.
> >> >> >>> 
> >> >> >>> The mount configurations that are delibearately constructed that I am
> >> >> >>> aware of are comparatively simple.  I don't think anyone has even taken
> >> >> >>> advantage of the shadow/side mounts at this point.  I made a reasonable
> >> >> >>> effort to find out and no one was even aware they existed.  Much less
> >> >> >>> what they were.  And certainly no one I talked to could find code that
> >> >> >>> used them.
> >> >> >>
> >> >> >> But someday; even if its after a decade, someone ;) will
> >> >> >> stumble into this semantics and wonder 'why?'. Its better to get it right
> >> >> >> sooner. Sorry, I am blaming myself; for keeping some of the problems
> >> >> >> open thinking no one will bump into them.
> >> >> >
> >> >> > Oh definitely.  If we have people ready to talk it through I am happy to
> >> >> > dot as many i's and cross as many t's as we productively can.
> >> >> >
> >> >> > I was just pointing out that I don't have any reason to expect that any
> >> >> > one depends on the subtle details of the implementation today so we
> >> >> > still have some wiggle room to fix them.  Even if they are visible to
> >> >> > user space.
> >> >> 
> >> >> So I haven't seen a reply, and we are getting awfully close to the merge
> >> >> window.  Is there anything concrete we can do to ease concerns?
> >> >> 
> >> >> Right now I am thinking my last version of the patch is the likely the
> >> >> best we have time and energy to manage and it would be good to merge it
> >> >> before the code bit rots.
> >> >
> >> > I was waiting for some other opinions on the behavior, since I
> >> > continue to think that 'one should not be able to unmount mounts on
> >> > which a user has explicitly mounted upon'. I am happy to be overruled,
> >> > since your patch significantly improves the rest of the semantics.
> >> >
> >> > Viro?
> >> 
> >> Ram Pai, just to be clear you were hoping to add the logic below to my patch?
> >
> > Yes. the behavior of your patch below is what I was proposing.
> >
> >> 
> >> My objections to the snippet below are:
> >> 
> >> - It makes it hard for the CRIU folks (yet more state they have to find
> >>   and restore).
> >
> > true. unfortunately one more subtle detail to be aware off.
> 
> A bit more than that, as it means that it requires an almost exact
> playback of the sequence of mounts in all mount namespaces to
> get to the point of reproducing a mount namespace.

Currently dump and restore of mount namespaces is the most complicated
part of CRIU. The main problem is that we don't know how a tree of
mounts was created.  Mounts have two types of relationships:
child<->parent and shared groups. Currently both this relationships
can't be restored directly.  We can not add a mount into an existing
group, the group can be only inherited from a source mount.  And we can
not restore "tucked" mounts, which can be only appeared due to
propagation.

Now we don't know an algorithm to dump and restore any set of mount
points for a reqsonable time. The problem becomes more complex if we
start thinking how to restore mount namespaces which lives in different
user namespaces.


This patch from Eric together with my patch https://lkml.org/lkml/2017/1/23/712
can solve the problem of dumping and restoring mount namespaces.

> 
> >> - It feels subjectively worse to me.
> >> 
> >> - We already have cases where mounts are unmounted transparently (umount on rmdir).
> >
> > sorry. i am not aware of this case. some details will help.
> 
> The question:
> 
> What happens when we rmdir a directory that has a mount on it in another
> mount namespace?
> 
> What happens when someone on the nfs server deletes a directory there
> is a mount on?
> 
> 
> It used to be that we returned -EBUSY, and refused the rmdir operation,
> and we lied in the vfs about the nfs dentry being deleted to preserve
> the mount.
> 
> In recent kernels I have done the work so that we transparently unmount
> the mounts and allow the rmdir to happen.  An unprivileged user mounting
> over say glibc and blocking the yum update of it is a pretty serious
> bug.
> 
> >> - Al Viro claims that the side/shadow mounts are ordinary mounts and
> >>   maintaining this extra logic that remembers if we tucked one mount
> >>   under another seems to make this them less ordinary.
> >
> > I tend to argue that they are a bit more than ordinary, for they have the
> > ability to tuck.
> >
> >> 
> >> - The symmetry for unmounting exists for a tucked mount.  We can unmount
> >>   it via propagation or we can unmount the mount above it, and then we
> >>   can unmount the new underlying mount.
> >
> > this is fine with me.
> >
> >>   So I don't see why we don't
> >>   want symmetry in the other case just because we mounted on top of
> >>   the mount and rather than had the mount tucked under us.
> >
> > A tucked mount should be un-tuckable. I agree.  But a non-tucked mount
> > cannot pretend to be tucked and this is where I disagree.
> 
> I have always seen the question as: Should a mount that is propagated be
> unmountable via umount propagation.
> 
> Which leads me to think that allowing the umount propagation when it
> won't change the applications view of files and filesystems is a good
> thing.  From my perspective it also better preserves the reversability
> property that is important.   The mount propgated and now the unmount
> propagated.
> 
> 
> From a system management point of view one of the largest practical
> problems with mount namespaces and mount propagation is: mounts that
> propagate into another mount namespaces but don't get unmounted.
> 
> 
> Which is to say not unmounting something (especially silently) and
> leaving the filesystem busy when something could be unmounted is a
> practical problem for people.
> 
> 
> I am going to be out for a week, and I am leaving in a few minutes.
> So I am going to push my patch to the my for-next branch, so there
> is a reasonable chance of merging things when the merge window opens.
> 
> If the feedback is to add the MNT_TUCKED annotations to make the patch
> suitable for merging to Linus's tree I will take care of that when
> I get back.
> 
> Eric
> 
> 
>
Ram Pai Feb. 6, 2017, 9:40 p.m. UTC | #4
On Sun, Feb 05, 2017 at 07:25:09PM -0800, Andrei Vagin wrote:
> On Sat, Feb 04, 2017 at 09:58:39AM +1300, Eric W. Biederman wrote:
> > Ram Pai <linuxram@us.ibm.com> writes:
> > 
> > > On Sat, Feb 04, 2017 at 07:26:20AM +1300, Eric W. Biederman wrote:
> > >> Ram Pai <linuxram@us.ibm.com> writes:
> > >> 
> > >> > On Fri, Feb 03, 2017 at 11:54:21PM +1300, Eric W. Biederman wrote:
> > >> >> ebiederm@xmission.com (Eric W. Biederman) writes:
> > >> >> 
> > >> >> > Ram Pai <linuxram@us.ibm.com> writes:
> > >> >> >
> > >> >> >> On Sat, Jan 21, 2017 at 05:15:29PM +1300, Eric W. Biederman wrote:
> > >> >> >>> Ram Pai <linuxram@us.ibm.com> writes:
> > >> >> >>> 
> > >> >> >>> >> @@ -359,12 +373,24 @@ int propagate_mount_busy(struct mount *mnt, int refcnt)
> > >> >> >>> >> 
....snip....
> > 
> > A bit more than that, as it means that it requires an almost exact
> > playback of the sequence of mounts in all mount namespaces to
> > get to the point of reproducing a mount namespace.

> 
> Currently dump and restore of mount namespaces is the most complicated
> part of CRIU. The main problem is that we don't know how a tree of
> mounts was created.  Mounts have two types of relationships:
> child<->parent and shared groups. Currently both this relationships
> can't be restored directly.  We can not add a mount into an existing
> group, the group can be only inherited from a source mount.  And we can
> not restore "tucked" mounts, which can be only appeared due to
> propagation.
> 
> Now we don't know an algorithm to dump and restore any set of mount
> points for a reqsonable time. The problem becomes more complex if we
> start thinking how to restore mount namespaces which lives in different
> user namespaces.
> 
> 
> This patch from Eric together with my patch https://lkml.org/lkml/2017/1/23/712
> can solve the problem of dumping and restoring mount namespaces.

Lets say we exposed the tucked flag for the mount in the
/proc/*/mountinfo, than there will be no need to know the history.
Its just a matter or setting that flag as part of
your new MS_SET_GROUP mount() call. right?

Looking at the patch at https://lkml.org/lkml/2017/1/23/712, I am
guessing the CRUI dumps information from mountinfo to restore it
later.

RP
Andrey Vagin Feb. 7, 2017, 6:35 a.m. UTC | #5
On Mon, Feb 06, 2017 at 01:40:53PM -0800, Ram Pai wrote:
> On Sun, Feb 05, 2017 at 07:25:09PM -0800, Andrei Vagin wrote:
> > On Sat, Feb 04, 2017 at 09:58:39AM +1300, Eric W. Biederman wrote:
> > > Ram Pai <linuxram@us.ibm.com> writes:
> > > 
> > > > On Sat, Feb 04, 2017 at 07:26:20AM +1300, Eric W. Biederman wrote:
> > > >> Ram Pai <linuxram@us.ibm.com> writes:
> > > >> 
> > > >> > On Fri, Feb 03, 2017 at 11:54:21PM +1300, Eric W. Biederman wrote:
> > > >> >> ebiederm@xmission.com (Eric W. Biederman) writes:
> > > >> >> 
> > > >> >> > Ram Pai <linuxram@us.ibm.com> writes:
> > > >> >> >
> > > >> >> >> On Sat, Jan 21, 2017 at 05:15:29PM +1300, Eric W. Biederman wrote:
> > > >> >> >>> Ram Pai <linuxram@us.ibm.com> writes:
> > > >> >> >>> 
> > > >> >> >>> >> @@ -359,12 +373,24 @@ int propagate_mount_busy(struct mount *mnt, int refcnt)
> > > >> >> >>> >> 
> ....snip....
> > > 
> > > A bit more than that, as it means that it requires an almost exact
> > > playback of the sequence of mounts in all mount namespaces to
> > > get to the point of reproducing a mount namespace.
> 
> > 
> > Currently dump and restore of mount namespaces is the most complicated
> > part of CRIU. The main problem is that we don't know how a tree of
> > mounts was created.  Mounts have two types of relationships:
> > child<->parent and shared groups. Currently both this relationships
> > can't be restored directly.  We can not add a mount into an existing
> > group, the group can be only inherited from a source mount.  And we can
> > not restore "tucked" mounts, which can be only appeared due to

Sorry, I used a wrong term. I mean "shadow" mounts.

> > propagation.
> > 
> > Now we don't know an algorithm to dump and restore any set of mount
> > points for a reqsonable time. The problem becomes more complex if we
> > start thinking how to restore mount namespaces which lives in different
> > user namespaces.
> > 
> > 
> > This patch from Eric together with my patch https://lkml.org/lkml/2017/1/23/712
> > can solve the problem of dumping and restoring mount namespaces.
> 
> Lets say we exposed the tucked flag for the mount in the
> /proc/*/mountinfo, than there will be no need to know the history.
> Its just a matter or setting that flag as part of
> your new MS_SET_GROUP mount() call. right?

This is right.

> 
> Looking at the patch at https://lkml.org/lkml/2017/1/23/712, I am
> guessing the CRUI dumps information from mountinfo to restore it
> later.
> 
> RP
>
diff mbox

Patch

diff --git a/fs/namespace.c b/fs/namespace.c
index 8bfad42c1ccf..8b00e0548438 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2047,8 +2047,10 @@  static int attach_recursive_mnt(struct mount *source_mnt,
 		hlist_del_init(&child->mnt_hash);
 		q = __lookup_mnt(&child->mnt_parent->mnt,
 				 child->mnt_mountpoint);
-		if (q)
+		if (q) {
 			mnt_change_mountpoint(child, smp, q);
+			child->mnt.mnt_flags |= MNT_TUCKED;
+		}
 		commit_tree(child);
 	}
 	put_mountpoint(smp);
diff --git a/fs/pnode.c b/fs/pnode.c
index 5bc7896d122a..e2a6ac68feb9 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -327,6 +327,9 @@  static struct mount *find_topper(struct mount *mnt)
 	/* If there is exactly one mount covering mnt completely return it. */
 	struct mount *child;
 
+	if (!(mnt->mnt.mnt_flags & MNT_TUCKED))
+		return NULL;
+	
 	if (!list_is_singular(&mnt->mnt_mounts))
 		return NULL;
 
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 8e0352af06b7..25ca398b19b3 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -52,6 +52,7 @@  struct mnt_namespace;
 
 #define MNT_INTERNAL	0x4000
 
+#define MNT_TUCKED		0x020000
 #define MNT_LOCK_ATIME		0x040000
 #define MNT_LOCK_NOEXEC		0x080000
 #define MNT_LOCK_NOSUID		0x100000