Message ID | 87a8b6r0z5.fsf_-_@xmission.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Jan 05, 2017 at 10:04:14AM +1300, Eric W. Biederman wrote: > - attach_shadows to be renamed __attach_mnt and the it's shadow > handling to be removed. Er... s/the it's/its/, presumably? Or am I misparsing that? > v2: Updated to mnt_change_mountpoint to not call dput or mntput > and instead to decrement the counts directly. It is guaranteed > that there will be other references when mnt_change_mountpoint is > called so this is safe. > +void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, struct mount *mnt) > { Too generic name, IMO, and I really wonder if "mount" (== interpose) and "umount" (== excise?) cases would be better off separately. > + struct mountpoint *old_mp = mnt->mnt_mp; > + struct dentry *old_mountpoint = mnt->mnt_mountpoint; > + struct mount *old_parent = mnt->mnt_parent; > + > + list_del_init(&mnt->mnt_child); > + hlist_del_init(&mnt->mnt_mp_list); > + hlist_del_init_rcu(&mnt->mnt_hash); > + > + attach_mnt(mnt, parent, mp); > + > + put_mountpoint(old_mp); > + * > + * During mounting, another mount will continue to use the old > + * mountpoint and during unmounting, the old mountpoint will > + * continue to exist until namespace_unlock which happens well > + * after mnt_change_mountpoint. > + */ Umm... AFAICS, in the former case "another mount" is simply parent, right? > + spin_lock(&old_mountpoint->d_lock); > + old_mountpoint->d_lockref.count--; > + spin_unlock(&old_mountpoint->d_lock); > + mnt_add_count(old_parent, -1); > + if (child->mnt.mnt_root == smp->m_dentry) { Explain, please. In which case is that condition _not_ satisfied, and what should happen i > + struct mount *q; > + q = __lookup_mnt(&child->mnt_parent->mnt, > + child->mnt_mountpoint); > + if (q) > + mnt_change_mountpoint(child, smp, q); > unlock_mount_hash(); > + put_mountpoint(smp); Wrong order... > ns->pending_mounts = 0; > + put_mountpoint(smp); ... and worse yet here. > static inline int do_refcount_check(struct mount *mnt, int count) > { > + struct mount *topper = __lookup_mnt(&mnt->mnt, mnt->mnt.mnt_root); > + if (topper) > + count++; > return mnt_get_count(mnt) > count; > } > @@ -359,7 +362,7 @@ int propagate_mount_busy(struct mount *mnt, int refcnt) > > for (m = propagation_next(parent, parent); m; > m = propagation_next(m, parent)) { > - child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint); > + child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint); > if (child && list_empty(&child->mnt_mounts) && > (ret = do_refcount_check(child, 1))) > break; Er... You do realize that you can end up with more that one such propagation, right? IOW, there might be more than one thing slipped in. > + if (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m)) { > SET_MNT_MARK(child); Reread the condition, please... And yes, I realize that original is also rather odd; at a guess it was meant to be !(, not (!, but that's just a guess - it's your code, IIRC. > @@ -420,8 +425,8 @@ static void __propagate_umount(struct mount *mnt) > > for (m = propagation_next(parent, parent); m; > m = propagation_next(m, parent)) { > - > - struct mount *child = __lookup_mnt_last(&m->mnt, > + struct mount *topper; > + struct mount *child = __lookup_mnt(&m->mnt, > mnt->mnt_mountpoint); > /* > * umount the child only if the child has no children > @@ -430,6 +435,16 @@ static void __propagate_umount(struct mount *mnt) > if (!child || !IS_MNT_MARKED(child)) > continue; > CLEAR_MNT_MARK(child); > + > + /* If there is exactly one mount covering all of child > + * replace child with that mount. > + */ > + topper = __lookup_mnt(&child->mnt, child->mnt.mnt_root); > + if (topper && > + (child->mnt_mounts.next == &topper->mnt_child) && > + (topper->mnt_child.next == &child->mnt_mounts)) Weird way to spell child->mnt_mounts.next == child->mnt_mounts.prev, that... Or, perhaps, the entire thing ought to be if (list_is_singular(&child->mnt_mounts)) { topper = list_first_entry(&child->mnt_mounts, struct mount, mnt_child); if (topper->mnt_parent == child && topped->mnt_mountpoint == child->mnt.mnt_root) to avoid hash lookups. > + mnt_change_mountpoint(child->mnt_parent, child->mnt_mp, topper); FWIW, the my main worry here is your handling of the umount. For example, what happens if * something is mounted on A (m1) * something else is mounted on A/bar (m2) * D is a slave of C * something has been mounted on D/foo (n) * you do mount --rbind A C/foo (m1' on C/foo, m2' on m1/bar, m1'' interposed on D/foo under n, m2'' on m1''/bar, m1'' slave of m1', m2'' slave of m2) * you make C/foo and C/foo/bar private (m1'' and m2'' are not getting propagation from m1' and m2' anymore) * you umount C/foo/bar (m2' is unmounted) * you umount C/foo m1' gets unmounted, all right, but what of m1''? D is a slave of C, so we get propagation of umount from C/foo to D/foo; m1'' still has m2'' attached to it. AFAICS, your logics will happily slip m1'' from under n (assuming that n itself is not busy), and leak both m1'' and m2''. OTOH, the case of multiple slip-under (Z is slave of Y, which is a slave of X, mount on Z, then mount of Y, then mount on X) the check for being busy would do very odd things. Something's fishy on the umount side... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Al Viro <viro@ZenIV.linux.org.uk> writes: > On Thu, Jan 05, 2017 at 10:04:14AM +1300, Eric W. Biederman wrote: > >> - attach_shadows to be renamed __attach_mnt and the it's shadow >> handling to be removed. > > Er... s/the it's/its/, presumably? Or am I misparsing that? > >> v2: Updated to mnt_change_mountpoint to not call dput or mntput >> and instead to decrement the counts directly. It is guaranteed >> that there will be other references when mnt_change_mountpoint is >> called so this is safe. > >> +void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, struct mount *mnt) >> { > > Too generic name, IMO, and I really wonder if "mount" (== interpose) and > "umount" (== excise?) cases would be better off separately. > >> + struct mountpoint *old_mp = mnt->mnt_mp; >> + struct dentry *old_mountpoint = mnt->mnt_mountpoint; >> + struct mount *old_parent = mnt->mnt_parent; >> + >> + list_del_init(&mnt->mnt_child); >> + hlist_del_init(&mnt->mnt_mp_list); >> + hlist_del_init_rcu(&mnt->mnt_hash); >> + >> + attach_mnt(mnt, parent, mp); >> + >> + put_mountpoint(old_mp); > >> + * >> + * During mounting, another mount will continue to use the old >> + * mountpoint and during unmounting, the old mountpoint will >> + * continue to exist until namespace_unlock which happens well >> + * after mnt_change_mountpoint. >> + */ > > Umm... AFAICS, in the former case "another mount" is simply parent, > right? Yes it is the new parent mountpoint. I was looking at it from a different perspective. >> + spin_lock(&old_mountpoint->d_lock); >> + old_mountpoint->d_lockref.count--; >> + spin_unlock(&old_mountpoint->d_lock); >> + mnt_add_count(old_parent, -1); > > >> + if (child->mnt.mnt_root == smp->m_dentry) { > > Explain, please. In which case is that condition _not_ satisfied, and > what should happen i When a tree is grafted in that condition does not apply to the lower leaves of the tree. At the same time nothing needs to be done for those leaves. Only the primary mountpoint needs to worry about tucking. >> + struct mount *q; >> + q = __lookup_mnt(&child->mnt_parent->mnt, >> + child->mnt_mountpoint); >> + if (q) >> + mnt_change_mountpoint(child, smp, q); > > >> unlock_mount_hash(); >> + put_mountpoint(smp); > > Wrong order... > >> ns->pending_mounts = 0; >> + put_mountpoint(smp); > > ... and worse yet here. Definitely. I totally spaced on propagating the locking changes to this patch when I rebased it. >> static inline int do_refcount_check(struct mount *mnt, int count) >> { >> + struct mount *topper = __lookup_mnt(&mnt->mnt, mnt->mnt.mnt_root); >> + if (topper) >> + count++; >> return mnt_get_count(mnt) > count; >> } > >> @@ -359,7 +362,7 @@ int propagate_mount_busy(struct mount *mnt, int refcnt) >> >> for (m = propagation_next(parent, parent); m; >> m = propagation_next(m, parent)) { >> - child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint); >> + child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint); >> if (child && list_empty(&child->mnt_mounts) && >> (ret = do_refcount_check(child, 1))) >> break; > > Er... You do realize that you can end up with more that one such > propagation, right? IOW, there might be more than one thing slipped > in. So I have stared at this a lot and I don't see what you seem to be seeing here. I do however see that propagate_mount_busy has been buggy since the beginning, as it only fails in the propagation case if list of children is empty. I also see my modification to the code is buggy since the list empty precludes my changes to do_refcount_check from being effective. I have looked hard and your point with multiple propagations elludes me. I am going to add a patch to fix propagate_mount_busy, and then rebase this patch on top of that, and post it all for review. >> + if (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m)) { >> SET_MNT_MARK(child); > > Reread the condition, please... And yes, I realize that original is > also rather odd; at a guess it was meant to be !(, not (!, but that's > just a guess - it's your code, IIRC. The intent is to find all trees that we can unmount where the point at which the tree meets the rest of the mounts is not locked. The mark is later used to see if it ok to unmount a mount or if we will reveal information to userspace (by breaking a lock). Therefore the mark needs to be set if the mount is unlocked, and recursively the mark needs to be set for every child of that mount where the mark is set (the second condition). Which makes the code essentially correct. Unfortunately the code does not handle multiple unmounts from the same parent mount point. Which means shadow/side mount support and untucking of mounts fails to handle multiple unmounts from the same parent mount. The way the mark is used fundamentally assumes only one operation on each mountpoint, and that is broken. I intend to work on propagute_umount some more and fix that brokenness and hopefully fix the performance issues as well. But I am leaving that work for another change as it is going to require stripping out and replacing algorithms, and so far I don't have good solutions. >> @@ -420,8 +425,8 @@ static void __propagate_umount(struct mount *mnt) >> >> for (m = propagation_next(parent, parent); m; >> m = propagation_next(m, parent)) { >> - >> - struct mount *child = __lookup_mnt_last(&m->mnt, >> + struct mount *topper; >> + struct mount *child = __lookup_mnt(&m->mnt, >> mnt->mnt_mountpoint); >> /* >> * umount the child only if the child has no children >> @@ -430,6 +435,16 @@ static void __propagate_umount(struct mount *mnt) >> if (!child || !IS_MNT_MARKED(child)) >> continue; >> CLEAR_MNT_MARK(child); >> + >> + /* If there is exactly one mount covering all of child >> + * replace child with that mount. >> + */ >> + topper = __lookup_mnt(&child->mnt, child->mnt.mnt_root); >> + if (topper && > >> + (child->mnt_mounts.next == &topper->mnt_child) && >> + (topper->mnt_child.next == &child->mnt_mounts)) > > Weird way to spell child->mnt_mounts.next == child->mnt_mounts.prev, that... > Or, perhaps, the entire thing ought to be Except it is clearer than that. It verifies not just that there is one list item but that topper is that one list item. Further it is the exact same logic as is used in do_check_refcnt and at this stage in development I figured it was more important to have a recognizable pattern than to have the absolute most performant code. Especially as the basic complexity of the code is the same either way. > if (list_is_singular(&child->mnt_mounts)) { > topper = list_first_entry(&child->mnt_mounts, > struct mount, mnt_child); > if (topper->mnt_parent == child && > topped->mnt_mountpoint == child->mnt.mnt_root) > > to avoid hash lookups. That it would and now that I see that list_is_singular exists it looks like a reasonable option. >> + mnt_change_mountpoint(child->mnt_parent, child->mnt_mp, topper); > > > FWIW, the my main worry here is your handling of the umount. For > example, what happens if > * something is mounted on A (m1) > * something else is mounted on A/bar (m2) > * D is a slave of C > * something has been mounted on D/foo (n) > * you do mount --rbind A C/foo (m1' on C/foo, m2' on m1'/bar, > m1'' interposed on D/foo under n, > m2'' on m1''/bar, > m1'' slave of m1', m2'' slave of m2) > * you make C/foo and C/foo/bar private (m1'' and m2'' are not getting > propagation from m1' and m2' anymore) > * you umount C/foo/bar (m2' is unmounted) > * you umount C/foo > m1' gets unmounted, all right, but what of m1''? D is a slave of C, so we > get propagation of umount from C/foo to D/foo; m1'' still has m2'' attached > to it. AFAICS, your logics will happily slip m1'' from under n (assuming > that n itself is not busy), and leak both m1'' and m2''. Yes. This is exactly the same behavior we have today without my patch. The only difference is who is the parent mount. $ cat > viro1.sh << EOF #!/bin/sh set -e set -x mount -t tmpfs base /mnt mkdir -p /mnt/A mount -t tmpfs m1 /mnt/A mkdir -p /mnt/A/bar mount -t tmpfs m2 /mnt/A/bar mkdir -p /mnt/D mkdir -p /mnt/C mount -t tmpfs mC /mnt/C mkdir -p /mnt/C/foo mount --make-shared /mnt/C mount --bind /mnt/C /mnt/D mount --make-slave /mnt/D mount -t tmpfs n /mnt/D/foo mount --rbind /mnt/A /mnt/C/foo echo cat /proc/self/mountinfo mount --make-private /mnt/C/foo mount --make-private /mnt/C/foo/bar echo cat /proc/self/mountinfo umount /mnt/C/foo/bar echo cat /proc/self/mountinfo umount /mnt/C/foo echo cat /proc/self/mountinfo EOF $ chmod +x ./viro1.sh $ unshare -Urm ./viro1.sh At least when I run the above on a kernel with and without my patch under discussion all I see different is mnt_id of the parent. Which is exactly what we should expect from this change. Did I make a mistake in creating my script? Or are you referring to the fact that mount_propagation_busy is just plain buggy? > OTOH, the case of multiple slip-under (Z is slave of Y, which is a slave of X, > mount on Z, then mount of Y, then mount on X) the check for being busy would > do very odd things. I don't see what you are referring to. Reposting shortly. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 11, 2017 at 01:10:57PM +1300, Eric W. Biederman wrote: > >> + if (child->mnt.mnt_root == smp->m_dentry) { > > > > Explain, please. In which case is that condition _not_ satisfied, and > > what should happen i > > When a tree is grafted in that condition does not apply to the lower > leaves of the tree. At the same time nothing needs to be done for those > leaves. Only the primary mountpoint needs to worry about tucking. How in hell would those lower leaves end up on the list in attach_recursive_mnt()? IDGI... > >> + if (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m)) { > >> SET_MNT_MARK(child); > > > > Reread the condition, please... And yes, I realize that original is > > also rather odd; at a guess it was meant to be !(, not (!, but that's > > just a guess - it's your code, IIRC. > > The intent is to find all trees that we can unmount where the point > at which the tree meets the rest of the mounts is not locked. > > The mark is later used to see if it ok to unmount a mount or if > we will reveal information to userspace (by breaking a lock). > > Therefore the mark needs to be set if the mount is unlocked, > and recursively the mark needs to be set for every child of > that mount where the mark is set (the second condition). *blink* You have "if mount is not locked - mark it; if mount is already marked - mark it again". The latter part (|| IS_MNT_MARKED(mnt), that is) looks very odd, won't you agree? What the hell was that (its counterpart in the earlier code) about? I could understand something along the lines "mark it unless it's locked or already marked", but your code is "mark it if it's not locked *or* if it's already marked". Makes no sense in that form. > > FWIW, the my main worry here is your handling of the umount. For > > example, what happens if > > * something is mounted on A (m1) > > * something else is mounted on A/bar (m2) > > * D is a slave of C > > * something has been mounted on D/foo (n) > > * you do mount --rbind A C/foo (m1' on C/foo, m2' on m1'/bar, > > m1'' interposed on D/foo under n, > > m2'' on m1''/bar, > > m1'' slave of m1', m2'' slave of m2) > > * you make C/foo and C/foo/bar private (m1'' and m2'' are not getting > > propagation from m1' and m2' anymore) > > * you umount C/foo/bar (m2' is unmounted) > > * you umount C/foo > > m1' gets unmounted, all right, but what of m1''? D is a slave of C, so we > > get propagation of umount from C/foo to D/foo; m1'' still has m2'' attached > > to it. AFAICS, your logics will happily slip m1'' from under n (assuming > > that n itself is not busy), and leak both m1'' and m2''. > > Yes. This is exactly the same behavior we have today without my patch. > The only difference is who is the parent mount. Not quite. In the current tree m1'' should get stuck there (and be exposed when n gets unmounted); AFAICS, your change will have it kicked out, with m2'' still attached and still contributing to refcount of m1''. I might be missing something (and I hadn't checked your script - right now I'm at 16 hours of uptime after only 4 hours of sleep). Will take a look at that after I grab some sleep... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Al Viro <viro@ZenIV.linux.org.uk> writes: > On Wed, Jan 11, 2017 at 01:10:57PM +1300, Eric W. Biederman wrote: >> >> + if (child->mnt.mnt_root == smp->m_dentry) { >> > >> > Explain, please. In which case is that condition _not_ satisfied, and >> > what should happen i >> >> When a tree is grafted in that condition does not apply to the lower >> leaves of the tree. At the same time nothing needs to be done for those >> leaves. Only the primary mountpoint needs to worry about tucking. > > How in hell would those lower leaves end up on the list in > attach_recursive_mnt()? IDGI... The submounts of a mount tree that is being attached need to have commit_tree called on them to attach them to a mount namespace. > >> >> + if (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m)) { >> >> SET_MNT_MARK(child); >> > >> > Reread the condition, please... And yes, I realize that original is >> > also rather odd; at a guess it was meant to be !(, not (!, but that's >> > just a guess - it's your code, IIRC. >> >> The intent is to find all trees that we can unmount where the point >> at which the tree meets the rest of the mounts is not locked. >> >> The mark is later used to see if it ok to unmount a mount or if >> we will reveal information to userspace (by breaking a lock). >> >> Therefore the mark needs to be set if the mount is unlocked, >> and recursively the mark needs to be set for every child of >> that mount where the mark is set (the second condition). > > *blink* > > You have "if mount is not locked - mark it; if mount is already marked - > mark it again". The latter part (|| IS_MNT_MARKED(mnt), that is) looks > very odd, won't you agree? What the hell was that (its counterpart in > the earlier code) about? Not mark it again. If the parent is marked mark the child. This is about finding subtrees where the root of the subree is unlocked but the children may be locked to that root. Still we can safely unmount the entire subtree without revealing anything to userspace. The walk is designed to happen from parent to the child mounts. > I could understand something along the lines "mark it unless it's locked > or already marked", but your code is "mark it if it's not locked *or* > if it's already marked". Makes no sense in that form. > >> > FWIW, the my main worry here is your handling of the umount. For >> > example, what happens if >> > * something is mounted on A (m1) >> > * something else is mounted on A/bar (m2) >> > * D is a slave of C >> > * something has been mounted on D/foo (n) >> > * you do mount --rbind A C/foo (m1' on C/foo, m2' on m1'/bar, >> > m1'' interposed on D/foo under n, >> > m2'' on m1''/bar, >> > m1'' slave of m1', m2'' slave of m2) >> > * you make C/foo and C/foo/bar private (m1'' and m2'' are not getting >> > propagation from m1' and m2' anymore) >> > * you umount C/foo/bar (m2' is unmounted) >> > * you umount C/foo >> > m1' gets unmounted, all right, but what of m1''? D is a slave of C, so we >> > get propagation of umount from C/foo to D/foo; m1'' still has m2'' attached >> > to it. AFAICS, your logics will happily slip m1'' from under n (assuming >> > that n itself is not busy), and leak both m1'' and m2''. >> >> Yes. This is exactly the same behavior we have today without my patch. >> The only difference is who is the parent mount. > > Not quite. In the current tree m1'' should get stuck there (and be exposed > when n gets unmounted); AFAICS, your change will have it kicked out, with > m2'' still attached and still contributing to refcount of m1''. > > I might be missing something (and I hadn't checked your script - right now > I'm at 16 hours of uptime after only 4 hours of sleep). Will take a look > at that after I grab some sleep... Please look with fresh rested eyes. I will see about posting a new version of the patch shortly. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jan 12, 2017 at 05:03:42AM +1300, Eric W. Biederman wrote: > Al Viro <viro@ZenIV.linux.org.uk> writes: > > > On Wed, Jan 11, 2017 at 01:10:57PM +1300, Eric W. Biederman wrote: > >> >> + if (child->mnt.mnt_root == smp->m_dentry) { > >> > > >> > Explain, please. In which case is that condition _not_ satisfied, and > >> > what should happen i > >> > >> When a tree is grafted in that condition does not apply to the lower > >> leaves of the tree. At the same time nothing needs to be done for those > >> leaves. Only the primary mountpoint needs to worry about tucking. > > > > How in hell would those lower leaves end up on the list in > > attach_recursive_mnt()? IDGI... > > The submounts of a mount tree that is being attached need to have > commit_tree called on them to attach them to a mount namespace. Huh? commit_tree() is called once for each copy of the source tree. This list_add_tail(&head, &mnt->mnt_list); list_for_each_entry(m, &head, mnt_list) m->mnt_ns = n; is what goes through submounts in each of them, _not_ the loop in the caller. What we get out of propagate_mnt() is the list of copies of source tree, one for each of the mountpoints that should get propagation from the target. ->mnt_mountpoint/->mnt_parent is fully set for all nodes. Everything except the roots of those trees is hashed and has ->mnt_child set up. ->mnt_hash of the roots of those copies host the cyclic list, anchored in tree_list passed to propagate_mnt(). ->mnt_list in each copy forms an unanchored cyclic list going through all mounts in that copy. The loop in attach_recursive_mnt() takes the tree_list apart and for each element (== each copy of source tree) we have commit_tree() called once, doing the remaining work: * splices the ->mnt_list into the namespace's mount list * sets ->mnt_ns for all nodes (root and submounts alike) * sets ->mnt_child and ->mnt_hash for the root. Again, the loop in attach_recursive_mnt() is over the set of secondary copies of the source tree; it goes *only* through their roots. Submounts are seen only by commit_tree(), in the list_for_each_entry() loop in that function. Hell, try to add else WARN_ON(1); to that if (...) of yours and see if you can trigger it if you don't believe the above... > > You have "if mount is not locked - mark it; if mount is already marked - > > mark it again". The latter part (|| IS_MNT_MARKED(mnt), that is) looks > > very odd, won't you agree? What the hell was that (its counterpart in > > the earlier code) about? > > Not mark it again. If the parent is marked mark the child. *doh* -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Eric, I found one issue about this patch. Take a look at the next script. The idea is simple, we call mount twice and than we call umount twice, but the second umount fails. [root@fc24 ~]# cat m.sh #!/bin/sh set -x -e mount -t tmpfs xxx /mnt mkdir -p /mnt/1 mkdir -p /mnt/2 mount --bind /mnt/1 /mnt/1 mount --make-shared /mnt/1 mount --bind /mnt/1 /mnt/2 mkdir -p /mnt/1/1 for i in `seq 2`; do mount --bind /mnt/1/1 /mnt/1/1 done for i in `seq 2`; do umount /mnt/1/1 || { cat /proc/self/mountinfo | grep xxx exit 1 } done [root@fc24 ~]# unshare -Urm ./m.sh + mount -t tmpfs xxx /mnt + mkdir -p /mnt/1 + mkdir -p /mnt/2 + mount --bind /mnt/1 /mnt/1 + mount --make-shared /mnt/1 + mount --bind /mnt/1 /mnt/2 + mkdir -p /mnt/1/1 ++ seq 2 + for i in '`seq 2`' + mount --bind /mnt/1/1 /mnt/1/1 + for i in '`seq 2`' + mount --bind /mnt/1/1 /mnt/1/1 ++ seq 2 + for i in '`seq 2`' + umount /mnt/1/1 + for i in '`seq 2`' + umount /mnt/1/1 umount: /mnt/1/1: not mounted + cat /proc/self/mountinfo + grep xxx 147 116 0:42 / /mnt rw,relatime - tmpfs xxx rw 148 147 0:42 /1 /mnt/1 rw,relatime shared:65 - tmpfs xxx rw 149 147 0:42 /1 /mnt/2 rw,relatime shared:65 - tmpfs xxx rw 157 149 0:42 /1/1 /mnt/2/1 rw,relatime shared:65 - tmpfs xxx rw + exit 1 And you can see that /mnt/2 contains a mount, but it should not. Thanks, Andrei On Thu, Jan 05, 2017 at 10:04:14AM +1300, Eric W. Biederman wrote: > > Ever since mount propagation was introduced in cases where a mount in > propagated to parent mount mountpoint pair that is already in use the > code has placed the new mount behind the old mount in the mount hash > table. > > This implementation detail is problematic as it allows creating > arbitrary length mount hash chains. > > Furthermore it invalidates the constraint maintained elsewhere in the > mount code that a parent mount and a mountpoint pair will have exactly > one mount upon them. Making it hard to deal with and to talk about > this special case in the mount code. > > Modify mount propagation to notice when there is already a mount at > the parent mount and mountpoint where a new mount is propagating to > and place that preexisting mount on top of the new mount. > > Modify unmount propagation to notice when a mount that is being > unmounted has another mount on top of it (and no other children), and > to replace the unmounted mount with the mount on top of it. > > Move the MNT_UMUONT test from __lookup_mnt_last into > __propagate_umount as that is the only call of __lookup_mnt_last where > MNT_UMOUNT may be set on any mount visible in the mount hash table. > > These modifications allow: > - __lookup_mnt_last to be removed. > - attach_shadows to be renamed __attach_mnt and the it's shadow > handling to be removed. > - commit_tree to be simplified > - copy_tree to be simplified > > The result is an easier to understand tree of mounts that does not > allow creation of arbitrary length hash chains in the mount hash table. > > v2: Updated to mnt_change_mountpoint to not call dput or mntput > and instead to decrement the counts directly. It is guaranteed > that there will be other references when mnt_change_mountpoint is > called so this is safe. > > Cc: stable@vger.kernel.org > Fixes: b90fa9ae8f51 ("[PATCH] shared mount handling: bind and rbind") > Tested-by: Andrei Vagin <avagin@virtuozzo.com> > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> > --- > > Since the last version some of you may have seen I have modified > my implementation of mnt_change_mountpoint so that it no longer calls > mntput or dput but instead relies on the knowledge that it can not > possibly have the last reference to the mnt and dentry of interest. > This avoids code checking tools from complaining bitterly. > > This is on top of my previous patch that sorts out locking of the > mountpoint hash table. After time giving ample time for review I intend > to push this and the previous bug fix to Linus. > > fs/mount.h | 1 - > fs/namespace.c | 110 +++++++++++++++++++++++++++++++-------------------------- > fs/pnode.c | 27 ++++++++++---- > fs/pnode.h | 2 ++ > 4 files changed, 82 insertions(+), 58 deletions(-) > > diff --git a/fs/mount.h b/fs/mount.h > index 2c856fc47ae3..2826543a131d 100644 > --- a/fs/mount.h > +++ b/fs/mount.h > @@ -89,7 +89,6 @@ static inline int is_mounted(struct vfsmount *mnt) > } > > extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *); > -extern struct mount *__lookup_mnt_last(struct vfsmount *, struct dentry *); > > extern int __legitimize_mnt(struct vfsmount *, unsigned); > extern bool legitimize_mnt(struct vfsmount *, unsigned); > diff --git a/fs/namespace.c b/fs/namespace.c > index 487ba30bb5c6..91ccfb73f0e0 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -637,28 +637,6 @@ struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry) > } > > /* > - * find the last mount at @dentry on vfsmount @mnt. > - * mount_lock must be held. > - */ > -struct mount *__lookup_mnt_last(struct vfsmount *mnt, struct dentry *dentry) > -{ > - struct mount *p, *res = NULL; > - p = __lookup_mnt(mnt, dentry); > - if (!p) > - goto out; > - if (!(p->mnt.mnt_flags & MNT_UMOUNT)) > - res = p; > - hlist_for_each_entry_continue(p, mnt_hash) { > - if (&p->mnt_parent->mnt != mnt || p->mnt_mountpoint != dentry) > - break; > - if (!(p->mnt.mnt_flags & MNT_UMOUNT)) > - res = p; > - } > -out: > - return res; > -} > - > -/* > * lookup_mnt - Return the first child mount mounted at path > * > * "First" means first mounted chronologically. If you create the > @@ -878,6 +856,13 @@ void mnt_set_mountpoint(struct mount *mnt, > hlist_add_head(&child_mnt->mnt_mp_list, &mp->m_list); > } > > +static void __attach_mnt(struct mount *mnt, struct mount *parent) > +{ > + hlist_add_head_rcu(&mnt->mnt_hash, > + m_hash(&parent->mnt, mnt->mnt_mountpoint)); > + list_add_tail(&mnt->mnt_child, &parent->mnt_mounts); > +} > + > /* > * vfsmount lock must be held for write > */ > @@ -886,28 +871,45 @@ static void attach_mnt(struct mount *mnt, > struct mountpoint *mp) > { > mnt_set_mountpoint(parent, mp, mnt); > - hlist_add_head_rcu(&mnt->mnt_hash, m_hash(&parent->mnt, mp->m_dentry)); > - list_add_tail(&mnt->mnt_child, &parent->mnt_mounts); > + __attach_mnt(mnt, parent); > } > > -static void attach_shadowed(struct mount *mnt, > - struct mount *parent, > - struct mount *shadows) > +void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, struct mount *mnt) > { > - if (shadows) { > - hlist_add_behind_rcu(&mnt->mnt_hash, &shadows->mnt_hash); > - list_add(&mnt->mnt_child, &shadows->mnt_child); > - } else { > - hlist_add_head_rcu(&mnt->mnt_hash, > - m_hash(&parent->mnt, mnt->mnt_mountpoint)); > - list_add_tail(&mnt->mnt_child, &parent->mnt_mounts); > - } > + struct mountpoint *old_mp = mnt->mnt_mp; > + struct dentry *old_mountpoint = mnt->mnt_mountpoint; > + struct mount *old_parent = mnt->mnt_parent; > + > + list_del_init(&mnt->mnt_child); > + hlist_del_init(&mnt->mnt_mp_list); > + hlist_del_init_rcu(&mnt->mnt_hash); > + > + attach_mnt(mnt, parent, mp); > + > + put_mountpoint(old_mp); > + > + /* > + * Safely avoid even the suggestion this code might sleep or > + * lock the mount hash by taking avantage of the knowlege that > + * mnt_change_mounpoint will not release the final reference > + * to a mountpoint. > + * > + * During mounting, another mount will continue to use the old > + * mountpoint and during unmounting, the old mountpoint will > + * continue to exist until namespace_unlock which happens well > + * after mnt_change_mountpoint. > + */ > + spin_lock(&old_mountpoint->d_lock); > + old_mountpoint->d_lockref.count--; > + spin_unlock(&old_mountpoint->d_lock); > + > + mnt_add_count(old_parent, -1); > } > > /* > * vfsmount lock must be held for write > */ > -static void commit_tree(struct mount *mnt, struct mount *shadows) > +static void commit_tree(struct mount *mnt) > { > struct mount *parent = mnt->mnt_parent; > struct mount *m; > @@ -925,7 +927,7 @@ static void commit_tree(struct mount *mnt, struct mount *shadows) > n->mounts += n->pending_mounts; > n->pending_mounts = 0; > > - attach_shadowed(mnt, parent, shadows); > + __attach_mnt(mnt, parent); > touch_mnt_namespace(n); > } > > @@ -1764,7 +1766,6 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry, > continue; > > for (s = r; s; s = next_mnt(s, r)) { > - struct mount *t = NULL; > if (!(flag & CL_COPY_UNBINDABLE) && > IS_MNT_UNBINDABLE(s)) { > s = skip_mnt_tree(s); > @@ -1786,14 +1787,7 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry, > goto out; > lock_mount_hash(); > list_add_tail(&q->mnt_list, &res->mnt_list); > - mnt_set_mountpoint(parent, p->mnt_mp, q); > - if (!list_empty(&parent->mnt_mounts)) { > - t = list_last_entry(&parent->mnt_mounts, > - struct mount, mnt_child); > - if (t->mnt_mp != p->mnt_mp) > - t = NULL; > - } > - attach_shadowed(q, parent, t); > + attach_mnt(q, parent, p->mnt_mp); > unlock_mount_hash(); > } > } > @@ -1992,10 +1986,18 @@ static int attach_recursive_mnt(struct mount *source_mnt, > { > HLIST_HEAD(tree_list); > struct mnt_namespace *ns = dest_mnt->mnt_ns; > + struct mountpoint *smp; > struct mount *child, *p; > struct hlist_node *n; > int err; > > + /* Preallocate a mountpoint in case the new mounts need > + * to be tucked under other mounts. > + */ > + smp = get_mountpoint(source_mnt->mnt.mnt_root); > + if (IS_ERR(smp)) > + return PTR_ERR(smp); > + > /* Is there space to add these mounts to the mount namespace? */ > if (!parent_path) { > err = count_mounts(ns, source_mnt); > @@ -2022,17 +2024,22 @@ static int attach_recursive_mnt(struct mount *source_mnt, > touch_mnt_namespace(source_mnt->mnt_ns); > } else { > mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt); > - commit_tree(source_mnt, NULL); > + commit_tree(source_mnt); > } > > hlist_for_each_entry_safe(child, n, &tree_list, mnt_hash) { > - struct mount *q; > hlist_del_init(&child->mnt_hash); > - q = __lookup_mnt_last(&child->mnt_parent->mnt, > - child->mnt_mountpoint); > - commit_tree(child, q); > + if (child->mnt.mnt_root == smp->m_dentry) { > + struct mount *q; > + q = __lookup_mnt(&child->mnt_parent->mnt, > + child->mnt_mountpoint); > + if (q) > + mnt_change_mountpoint(child, smp, q); > + } > + commit_tree(child); > } > unlock_mount_hash(); > + put_mountpoint(smp); > > return 0; > > @@ -2046,6 +2053,7 @@ static int attach_recursive_mnt(struct mount *source_mnt, > cleanup_group_ids(source_mnt, NULL); > out: > ns->pending_mounts = 0; > + put_mountpoint(smp); > return err; > } > > diff --git a/fs/pnode.c b/fs/pnode.c > index 06a793f4ae38..eb4331240fd1 100644 > --- a/fs/pnode.c > +++ b/fs/pnode.c > @@ -327,6 +327,9 @@ int propagate_mnt(struct mount *dest_mnt, struct mountpoint *dest_mp, > */ > static inline int do_refcount_check(struct mount *mnt, int count) > { > + struct mount *topper = __lookup_mnt(&mnt->mnt, mnt->mnt.mnt_root); > + if (topper) > + count++; > return mnt_get_count(mnt) > count; > } > > @@ -359,7 +362,7 @@ int propagate_mount_busy(struct mount *mnt, int refcnt) > > for (m = propagation_next(parent, parent); m; > m = propagation_next(m, parent)) { > - child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint); > + child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint); > if (child && list_empty(&child->mnt_mounts) && > (ret = do_refcount_check(child, 1))) > break; > @@ -381,7 +384,7 @@ void propagate_mount_unlock(struct mount *mnt) > > for (m = propagation_next(parent, parent); m; > m = propagation_next(m, parent)) { > - child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint); > + child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint); > if (child) > child->mnt.mnt_flags &= ~MNT_LOCKED; > } > @@ -399,9 +402,11 @@ static void mark_umount_candidates(struct mount *mnt) > > for (m = propagation_next(parent, parent); m; > m = propagation_next(m, parent)) { > - struct mount *child = __lookup_mnt_last(&m->mnt, > + struct mount *child = __lookup_mnt(&m->mnt, > mnt->mnt_mountpoint); > - if (child && (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m))) { > + if (!child || (child->mnt.mnt_flags & MNT_UMOUNT)) > + continue; > + if (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m)) { > SET_MNT_MARK(child); > } > } > @@ -420,8 +425,8 @@ static void __propagate_umount(struct mount *mnt) > > for (m = propagation_next(parent, parent); m; > m = propagation_next(m, parent)) { > - > - struct mount *child = __lookup_mnt_last(&m->mnt, > + struct mount *topper; > + struct mount *child = __lookup_mnt(&m->mnt, > mnt->mnt_mountpoint); > /* > * umount the child only if the child has no children > @@ -430,6 +435,16 @@ static void __propagate_umount(struct mount *mnt) > if (!child || !IS_MNT_MARKED(child)) > continue; > CLEAR_MNT_MARK(child); > + > + /* If there is exactly one mount covering all of child > + * replace child with that mount. > + */ > + topper = __lookup_mnt(&child->mnt, child->mnt.mnt_root); > + if (topper && > + (child->mnt_mounts.next == &topper->mnt_child) && > + (topper->mnt_child.next == &child->mnt_mounts)) > + mnt_change_mountpoint(child->mnt_parent, child->mnt_mp, topper); > + > if (list_empty(&child->mnt_mounts)) { > list_del_init(&child->mnt_child); > child->mnt.mnt_flags |= MNT_UMOUNT; > diff --git a/fs/pnode.h b/fs/pnode.h > index 550f5a8b4fcf..dc87e65becd2 100644 > --- a/fs/pnode.h > +++ b/fs/pnode.h > @@ -49,6 +49,8 @@ int get_dominating_id(struct mount *mnt, const struct path *root); > unsigned int mnt_get_count(struct mount *mnt); > void mnt_set_mountpoint(struct mount *, struct mountpoint *, > struct mount *); > +void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, > + struct mount *mnt); > struct mount *copy_tree(struct mount *, struct dentry *, int); > bool is_path_reachable(struct mount *, struct dentry *, > const struct path *root); > -- > 2.10.1 >
Andrei Vagin <avagin@virtuozzo.com> writes: > Hi Eric, > > I found one issue about this patch. Take a look at the next script. The > idea is simple, we call mount twice and than we call umount twice, but > the second umount fails. Definitely an odd. I will take a look. Eric > > [root@fc24 ~]# cat m.sh > #!/bin/sh > > set -x -e > mount -t tmpfs xxx /mnt > mkdir -p /mnt/1 > mkdir -p /mnt/2 > mount --bind /mnt/1 /mnt/1 > mount --make-shared /mnt/1 > mount --bind /mnt/1 /mnt/2 > mkdir -p /mnt/1/1 > for i in `seq 2`; do > mount --bind /mnt/1/1 /mnt/1/1 > done > for i in `seq 2`; do > umount /mnt/1/1 || { > cat /proc/self/mountinfo | grep xxx > exit 1 > } > done > > [root@fc24 ~]# unshare -Urm ./m.sh > + mount -t tmpfs xxx /mnt > + mkdir -p /mnt/1 > + mkdir -p /mnt/2 > + mount --bind /mnt/1 /mnt/1 > + mount --make-shared /mnt/1 > + mount --bind /mnt/1 /mnt/2 > + mkdir -p /mnt/1/1 > ++ seq 2 > + for i in '`seq 2`' > + mount --bind /mnt/1/1 /mnt/1/1 > + for i in '`seq 2`' > + mount --bind /mnt/1/1 /mnt/1/1 > ++ seq 2 > + for i in '`seq 2`' > + umount /mnt/1/1 > + for i in '`seq 2`' > + umount /mnt/1/1 > umount: /mnt/1/1: not mounted > + cat /proc/self/mountinfo > + grep xxx > 147 116 0:42 / /mnt rw,relatime - tmpfs xxx rw > 148 147 0:42 /1 /mnt/1 rw,relatime shared:65 - tmpfs xxx rw > 149 147 0:42 /1 /mnt/2 rw,relatime shared:65 - tmpfs xxx rw > 157 149 0:42 /1/1 /mnt/2/1 rw,relatime shared:65 - tmpfs xxx rw > + exit 1 > > And you can see that /mnt/2 contains a mount, but it should not. > > Thanks, > Andrei > > On Thu, Jan 05, 2017 at 10:04:14AM +1300, Eric W. Biederman wrote: >> >> Ever since mount propagation was introduced in cases where a mount in >> propagated to parent mount mountpoint pair that is already in use the >> code has placed the new mount behind the old mount in the mount hash >> table. >> >> This implementation detail is problematic as it allows creating >> arbitrary length mount hash chains. >> >> Furthermore it invalidates the constraint maintained elsewhere in the >> mount code that a parent mount and a mountpoint pair will have exactly >> one mount upon them. Making it hard to deal with and to talk about >> this special case in the mount code. >> >> Modify mount propagation to notice when there is already a mount at >> the parent mount and mountpoint where a new mount is propagating to >> and place that preexisting mount on top of the new mount. >> >> Modify unmount propagation to notice when a mount that is being >> unmounted has another mount on top of it (and no other children), and >> to replace the unmounted mount with the mount on top of it. >> >> Move the MNT_UMUONT test from __lookup_mnt_last into >> __propagate_umount as that is the only call of __lookup_mnt_last where >> MNT_UMOUNT may be set on any mount visible in the mount hash table. >> >> These modifications allow: >> - __lookup_mnt_last to be removed. >> - attach_shadows to be renamed __attach_mnt and the it's shadow >> handling to be removed. >> - commit_tree to be simplified >> - copy_tree to be simplified >> >> The result is an easier to understand tree of mounts that does not >> allow creation of arbitrary length hash chains in the mount hash table. >> >> v2: Updated to mnt_change_mountpoint to not call dput or mntput >> and instead to decrement the counts directly. It is guaranteed >> that there will be other references when mnt_change_mountpoint is >> called so this is safe. >> >> Cc: stable@vger.kernel.org >> Fixes: b90fa9ae8f51 ("[PATCH] shared mount handling: bind and rbind") >> Tested-by: Andrei Vagin <avagin@virtuozzo.com> >> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> >> --- >> >> Since the last version some of you may have seen I have modified >> my implementation of mnt_change_mountpoint so that it no longer calls >> mntput or dput but instead relies on the knowledge that it can not >> possibly have the last reference to the mnt and dentry of interest. >> This avoids code checking tools from complaining bitterly. >> >> This is on top of my previous patch that sorts out locking of the >> mountpoint hash table. After time giving ample time for review I intend >> to push this and the previous bug fix to Linus. >> >> fs/mount.h | 1 - >> fs/namespace.c | 110 +++++++++++++++++++++++++++++++-------------------------- >> fs/pnode.c | 27 ++++++++++---- >> fs/pnode.h | 2 ++ >> 4 files changed, 82 insertions(+), 58 deletions(-) >> >> diff --git a/fs/mount.h b/fs/mount.h >> index 2c856fc47ae3..2826543a131d 100644 >> --- a/fs/mount.h >> +++ b/fs/mount.h >> @@ -89,7 +89,6 @@ static inline int is_mounted(struct vfsmount *mnt) >> } >> >> extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *); >> -extern struct mount *__lookup_mnt_last(struct vfsmount *, struct dentry *); >> >> extern int __legitimize_mnt(struct vfsmount *, unsigned); >> extern bool legitimize_mnt(struct vfsmount *, unsigned); >> diff --git a/fs/namespace.c b/fs/namespace.c >> index 487ba30bb5c6..91ccfb73f0e0 100644 >> --- a/fs/namespace.c >> +++ b/fs/namespace.c >> @@ -637,28 +637,6 @@ struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry) >> } >> >> /* >> - * find the last mount at @dentry on vfsmount @mnt. >> - * mount_lock must be held. >> - */ >> -struct mount *__lookup_mnt_last(struct vfsmount *mnt, struct dentry *dentry) >> -{ >> - struct mount *p, *res = NULL; >> - p = __lookup_mnt(mnt, dentry); >> - if (!p) >> - goto out; >> - if (!(p->mnt.mnt_flags & MNT_UMOUNT)) >> - res = p; >> - hlist_for_each_entry_continue(p, mnt_hash) { >> - if (&p->mnt_parent->mnt != mnt || p->mnt_mountpoint != dentry) >> - break; >> - if (!(p->mnt.mnt_flags & MNT_UMOUNT)) >> - res = p; >> - } >> -out: >> - return res; >> -} >> - >> -/* >> * lookup_mnt - Return the first child mount mounted at path >> * >> * "First" means first mounted chronologically. If you create the >> @@ -878,6 +856,13 @@ void mnt_set_mountpoint(struct mount *mnt, >> hlist_add_head(&child_mnt->mnt_mp_list, &mp->m_list); >> } >> >> +static void __attach_mnt(struct mount *mnt, struct mount *parent) >> +{ >> + hlist_add_head_rcu(&mnt->mnt_hash, >> + m_hash(&parent->mnt, mnt->mnt_mountpoint)); >> + list_add_tail(&mnt->mnt_child, &parent->mnt_mounts); >> +} >> + >> /* >> * vfsmount lock must be held for write >> */ >> @@ -886,28 +871,45 @@ static void attach_mnt(struct mount *mnt, >> struct mountpoint *mp) >> { >> mnt_set_mountpoint(parent, mp, mnt); >> - hlist_add_head_rcu(&mnt->mnt_hash, m_hash(&parent->mnt, mp->m_dentry)); >> - list_add_tail(&mnt->mnt_child, &parent->mnt_mounts); >> + __attach_mnt(mnt, parent); >> } >> >> -static void attach_shadowed(struct mount *mnt, >> - struct mount *parent, >> - struct mount *shadows) >> +void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, struct mount *mnt) >> { >> - if (shadows) { >> - hlist_add_behind_rcu(&mnt->mnt_hash, &shadows->mnt_hash); >> - list_add(&mnt->mnt_child, &shadows->mnt_child); >> - } else { >> - hlist_add_head_rcu(&mnt->mnt_hash, >> - m_hash(&parent->mnt, mnt->mnt_mountpoint)); >> - list_add_tail(&mnt->mnt_child, &parent->mnt_mounts); >> - } >> + struct mountpoint *old_mp = mnt->mnt_mp; >> + struct dentry *old_mountpoint = mnt->mnt_mountpoint; >> + struct mount *old_parent = mnt->mnt_parent; >> + >> + list_del_init(&mnt->mnt_child); >> + hlist_del_init(&mnt->mnt_mp_list); >> + hlist_del_init_rcu(&mnt->mnt_hash); >> + >> + attach_mnt(mnt, parent, mp); >> + >> + put_mountpoint(old_mp); >> + >> + /* >> + * Safely avoid even the suggestion this code might sleep or >> + * lock the mount hash by taking avantage of the knowlege that >> + * mnt_change_mounpoint will not release the final reference >> + * to a mountpoint. >> + * >> + * During mounting, another mount will continue to use the old >> + * mountpoint and during unmounting, the old mountpoint will >> + * continue to exist until namespace_unlock which happens well >> + * after mnt_change_mountpoint. >> + */ >> + spin_lock(&old_mountpoint->d_lock); >> + old_mountpoint->d_lockref.count--; >> + spin_unlock(&old_mountpoint->d_lock); >> + >> + mnt_add_count(old_parent, -1); >> } >> >> /* >> * vfsmount lock must be held for write >> */ >> -static void commit_tree(struct mount *mnt, struct mount *shadows) >> +static void commit_tree(struct mount *mnt) >> { >> struct mount *parent = mnt->mnt_parent; >> struct mount *m; >> @@ -925,7 +927,7 @@ static void commit_tree(struct mount *mnt, struct mount *shadows) >> n->mounts += n->pending_mounts; >> n->pending_mounts = 0; >> >> - attach_shadowed(mnt, parent, shadows); >> + __attach_mnt(mnt, parent); >> touch_mnt_namespace(n); >> } >> >> @@ -1764,7 +1766,6 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry, >> continue; >> >> for (s = r; s; s = next_mnt(s, r)) { >> - struct mount *t = NULL; >> if (!(flag & CL_COPY_UNBINDABLE) && >> IS_MNT_UNBINDABLE(s)) { >> s = skip_mnt_tree(s); >> @@ -1786,14 +1787,7 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry, >> goto out; >> lock_mount_hash(); >> list_add_tail(&q->mnt_list, &res->mnt_list); >> - mnt_set_mountpoint(parent, p->mnt_mp, q); >> - if (!list_empty(&parent->mnt_mounts)) { >> - t = list_last_entry(&parent->mnt_mounts, >> - struct mount, mnt_child); >> - if (t->mnt_mp != p->mnt_mp) >> - t = NULL; >> - } >> - attach_shadowed(q, parent, t); >> + attach_mnt(q, parent, p->mnt_mp); >> unlock_mount_hash(); >> } >> } >> @@ -1992,10 +1986,18 @@ static int attach_recursive_mnt(struct mount *source_mnt, >> { >> HLIST_HEAD(tree_list); >> struct mnt_namespace *ns = dest_mnt->mnt_ns; >> + struct mountpoint *smp; >> struct mount *child, *p; >> struct hlist_node *n; >> int err; >> >> + /* Preallocate a mountpoint in case the new mounts need >> + * to be tucked under other mounts. >> + */ >> + smp = get_mountpoint(source_mnt->mnt.mnt_root); >> + if (IS_ERR(smp)) >> + return PTR_ERR(smp); >> + >> /* Is there space to add these mounts to the mount namespace? */ >> if (!parent_path) { >> err = count_mounts(ns, source_mnt); >> @@ -2022,17 +2024,22 @@ static int attach_recursive_mnt(struct mount *source_mnt, >> touch_mnt_namespace(source_mnt->mnt_ns); >> } else { >> mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt); >> - commit_tree(source_mnt, NULL); >> + commit_tree(source_mnt); >> } >> >> hlist_for_each_entry_safe(child, n, &tree_list, mnt_hash) { >> - struct mount *q; >> hlist_del_init(&child->mnt_hash); >> - q = __lookup_mnt_last(&child->mnt_parent->mnt, >> - child->mnt_mountpoint); >> - commit_tree(child, q); >> + if (child->mnt.mnt_root == smp->m_dentry) { >> + struct mount *q; >> + q = __lookup_mnt(&child->mnt_parent->mnt, >> + child->mnt_mountpoint); >> + if (q) >> + mnt_change_mountpoint(child, smp, q); >> + } >> + commit_tree(child); >> } >> unlock_mount_hash(); >> + put_mountpoint(smp); >> >> return 0; >> >> @@ -2046,6 +2053,7 @@ static int attach_recursive_mnt(struct mount *source_mnt, >> cleanup_group_ids(source_mnt, NULL); >> out: >> ns->pending_mounts = 0; >> + put_mountpoint(smp); >> return err; >> } >> >> diff --git a/fs/pnode.c b/fs/pnode.c >> index 06a793f4ae38..eb4331240fd1 100644 >> --- a/fs/pnode.c >> +++ b/fs/pnode.c >> @@ -327,6 +327,9 @@ int propagate_mnt(struct mount *dest_mnt, struct mountpoint *dest_mp, >> */ >> static inline int do_refcount_check(struct mount *mnt, int count) >> { >> + struct mount *topper = __lookup_mnt(&mnt->mnt, mnt->mnt.mnt_root); >> + if (topper) >> + count++; >> return mnt_get_count(mnt) > count; >> } >> >> @@ -359,7 +362,7 @@ int propagate_mount_busy(struct mount *mnt, int refcnt) >> >> for (m = propagation_next(parent, parent); m; >> m = propagation_next(m, parent)) { >> - child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint); >> + child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint); >> if (child && list_empty(&child->mnt_mounts) && >> (ret = do_refcount_check(child, 1))) >> break; >> @@ -381,7 +384,7 @@ void propagate_mount_unlock(struct mount *mnt) >> >> for (m = propagation_next(parent, parent); m; >> m = propagation_next(m, parent)) { >> - child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint); >> + child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint); >> if (child) >> child->mnt.mnt_flags &= ~MNT_LOCKED; >> } >> @@ -399,9 +402,11 @@ static void mark_umount_candidates(struct mount *mnt) >> >> for (m = propagation_next(parent, parent); m; >> m = propagation_next(m, parent)) { >> - struct mount *child = __lookup_mnt_last(&m->mnt, >> + struct mount *child = __lookup_mnt(&m->mnt, >> mnt->mnt_mountpoint); >> - if (child && (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m))) { >> + if (!child || (child->mnt.mnt_flags & MNT_UMOUNT)) >> + continue; >> + if (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m)) { >> SET_MNT_MARK(child); >> } >> } >> @@ -420,8 +425,8 @@ static void __propagate_umount(struct mount *mnt) >> >> for (m = propagation_next(parent, parent); m; >> m = propagation_next(m, parent)) { >> - >> - struct mount *child = __lookup_mnt_last(&m->mnt, >> + struct mount *topper; >> + struct mount *child = __lookup_mnt(&m->mnt, >> mnt->mnt_mountpoint); >> /* >> * umount the child only if the child has no children >> @@ -430,6 +435,16 @@ static void __propagate_umount(struct mount *mnt) >> if (!child || !IS_MNT_MARKED(child)) >> continue; >> CLEAR_MNT_MARK(child); >> + >> + /* If there is exactly one mount covering all of child >> + * replace child with that mount. >> + */ >> + topper = __lookup_mnt(&child->mnt, child->mnt.mnt_root); >> + if (topper && >> + (child->mnt_mounts.next == &topper->mnt_child) && >> + (topper->mnt_child.next == &child->mnt_mounts)) >> + mnt_change_mountpoint(child->mnt_parent, child->mnt_mp, topper); >> + >> if (list_empty(&child->mnt_mounts)) { >> list_del_init(&child->mnt_child); >> child->mnt.mnt_flags |= MNT_UMOUNT; >> diff --git a/fs/pnode.h b/fs/pnode.h >> index 550f5a8b4fcf..dc87e65becd2 100644 >> --- a/fs/pnode.h >> +++ b/fs/pnode.h >> @@ -49,6 +49,8 @@ int get_dominating_id(struct mount *mnt, const struct path *root); >> unsigned int mnt_get_count(struct mount *mnt); >> void mnt_set_mountpoint(struct mount *, struct mountpoint *, >> struct mount *); >> +void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, >> + struct mount *mnt); >> struct mount *copy_tree(struct mount *, struct dentry *, int); >> bool is_path_reachable(struct mount *, struct dentry *, >> const struct path *root); >> -- >> 2.10.1 >>
ebiederm@xmission.com (Eric W. Biederman) writes: > Andrei Vagin <avagin@virtuozzo.com> writes: > >> Hi Eric, >> >> I found one issue about this patch. Take a look at the next script. The >> idea is simple, we call mount twice and than we call umount twice, but >> the second umount fails. > > Definitely an odd. I will take a look. After a little more looking I have to say the only thing I can see wrong in the behavior is that the first umount doesn't unmount everything. Changing the mount propagation tree while it is being traversed apparently prevents a complete traversal of the propgation tree. Last time I was looking at this I was playing with multipass algorithm because of these peculiarities that happen with propagation trees. I suspect we are going to need to fix umount to behave correct before we tune it for performance. So that we actually know what correct behavior is. Eric >> >> [root@fc24 ~]# cat m.sh >> #!/bin/sh >> >> set -x -e >> mount -t tmpfs xxx /mnt >> mkdir -p /mnt/1 >> mkdir -p /mnt/2 >> mount --bind /mnt/1 /mnt/1 >> mount --make-shared /mnt/1 >> mount --bind /mnt/1 /mnt/2 >> mkdir -p /mnt/1/1 >> for i in `seq 2`; do >> mount --bind /mnt/1/1 /mnt/1/1 >> done >> for i in `seq 2`; do >> umount /mnt/1/1 || { >> cat /proc/self/mountinfo | grep xxx >> exit 1 >> } >> done >> >> [root@fc24 ~]# unshare -Urm ./m.sh >> + mount -t tmpfs xxx /mnt >> + mkdir -p /mnt/1 >> + mkdir -p /mnt/2 >> + mount --bind /mnt/1 /mnt/1 >> + mount --make-shared /mnt/1 >> + mount --bind /mnt/1 /mnt/2 >> + mkdir -p /mnt/1/1 >> ++ seq 2 >> + for i in '`seq 2`' >> + mount --bind /mnt/1/1 /mnt/1/1 >> + for i in '`seq 2`' >> + mount --bind /mnt/1/1 /mnt/1/1 >> ++ seq 2 >> + for i in '`seq 2`' >> + umount /mnt/1/1 >> + for i in '`seq 2`' >> + umount /mnt/1/1 >> umount: /mnt/1/1: not mounted >> + cat /proc/self/mountinfo >> + grep xxx >> 147 116 0:42 / /mnt rw,relatime - tmpfs xxx rw >> 148 147 0:42 /1 /mnt/1 rw,relatime shared:65 - tmpfs xxx rw >> 149 147 0:42 /1 /mnt/2 rw,relatime shared:65 - tmpfs xxx rw >> 157 149 0:42 /1/1 /mnt/2/1 rw,relatime shared:65 - tmpfs xxx rw >> + exit 1 >> >> And you can see that /mnt/2 contains a mount, but it should not. >> >> Thanks, >> Andrei >> >> On Thu, Jan 05, 2017 at 10:04:14AM +1300, Eric W. Biederman wrote: >>> >>> Ever since mount propagation was introduced in cases where a mount in >>> propagated to parent mount mountpoint pair that is already in use the >>> code has placed the new mount behind the old mount in the mount hash >>> table. >>> >>> This implementation detail is problematic as it allows creating >>> arbitrary length mount hash chains. >>> >>> Furthermore it invalidates the constraint maintained elsewhere in the >>> mount code that a parent mount and a mountpoint pair will have exactly >>> one mount upon them. Making it hard to deal with and to talk about >>> this special case in the mount code. >>> >>> Modify mount propagation to notice when there is already a mount at >>> the parent mount and mountpoint where a new mount is propagating to >>> and place that preexisting mount on top of the new mount. >>> >>> Modify unmount propagation to notice when a mount that is being >>> unmounted has another mount on top of it (and no other children), and >>> to replace the unmounted mount with the mount on top of it. >>> >>> Move the MNT_UMUONT test from __lookup_mnt_last into >>> __propagate_umount as that is the only call of __lookup_mnt_last where >>> MNT_UMOUNT may be set on any mount visible in the mount hash table. >>> >>> These modifications allow: >>> - __lookup_mnt_last to be removed. >>> - attach_shadows to be renamed __attach_mnt and the it's shadow >>> handling to be removed. >>> - commit_tree to be simplified >>> - copy_tree to be simplified >>> >>> The result is an easier to understand tree of mounts that does not >>> allow creation of arbitrary length hash chains in the mount hash table. >>> >>> v2: Updated to mnt_change_mountpoint to not call dput or mntput >>> and instead to decrement the counts directly. It is guaranteed >>> that there will be other references when mnt_change_mountpoint is >>> called so this is safe. >>> >>> Cc: stable@vger.kernel.org >>> Fixes: b90fa9ae8f51 ("[PATCH] shared mount handling: bind and rbind") >>> Tested-by: Andrei Vagin <avagin@virtuozzo.com> >>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> >>> --- >>> >>> Since the last version some of you may have seen I have modified >>> my implementation of mnt_change_mountpoint so that it no longer calls >>> mntput or dput but instead relies on the knowledge that it can not >>> possibly have the last reference to the mnt and dentry of interest. >>> This avoids code checking tools from complaining bitterly. >>> >>> This is on top of my previous patch that sorts out locking of the >>> mountpoint hash table. After time giving ample time for review I intend >>> to push this and the previous bug fix to Linus. >>> >>> fs/mount.h | 1 - >>> fs/namespace.c | 110 +++++++++++++++++++++++++++++++-------------------------- >>> fs/pnode.c | 27 ++++++++++---- >>> fs/pnode.h | 2 ++ >>> 4 files changed, 82 insertions(+), 58 deletions(-) >>> >>> diff --git a/fs/mount.h b/fs/mount.h >>> index 2c856fc47ae3..2826543a131d 100644 >>> --- a/fs/mount.h >>> +++ b/fs/mount.h >>> @@ -89,7 +89,6 @@ static inline int is_mounted(struct vfsmount *mnt) >>> } >>> >>> extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *); >>> -extern struct mount *__lookup_mnt_last(struct vfsmount *, struct dentry *); >>> >>> extern int __legitimize_mnt(struct vfsmount *, unsigned); >>> extern bool legitimize_mnt(struct vfsmount *, unsigned); >>> diff --git a/fs/namespace.c b/fs/namespace.c >>> index 487ba30bb5c6..91ccfb73f0e0 100644 >>> --- a/fs/namespace.c >>> +++ b/fs/namespace.c >>> @@ -637,28 +637,6 @@ struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry) >>> } >>> >>> /* >>> - * find the last mount at @dentry on vfsmount @mnt. >>> - * mount_lock must be held. >>> - */ >>> -struct mount *__lookup_mnt_last(struct vfsmount *mnt, struct dentry *dentry) >>> -{ >>> - struct mount *p, *res = NULL; >>> - p = __lookup_mnt(mnt, dentry); >>> - if (!p) >>> - goto out; >>> - if (!(p->mnt.mnt_flags & MNT_UMOUNT)) >>> - res = p; >>> - hlist_for_each_entry_continue(p, mnt_hash) { >>> - if (&p->mnt_parent->mnt != mnt || p->mnt_mountpoint != dentry) >>> - break; >>> - if (!(p->mnt.mnt_flags & MNT_UMOUNT)) >>> - res = p; >>> - } >>> -out: >>> - return res; >>> -} >>> - >>> -/* >>> * lookup_mnt - Return the first child mount mounted at path >>> * >>> * "First" means first mounted chronologically. If you create the >>> @@ -878,6 +856,13 @@ void mnt_set_mountpoint(struct mount *mnt, >>> hlist_add_head(&child_mnt->mnt_mp_list, &mp->m_list); >>> } >>> >>> +static void __attach_mnt(struct mount *mnt, struct mount *parent) >>> +{ >>> + hlist_add_head_rcu(&mnt->mnt_hash, >>> + m_hash(&parent->mnt, mnt->mnt_mountpoint)); >>> + list_add_tail(&mnt->mnt_child, &parent->mnt_mounts); >>> +} >>> + >>> /* >>> * vfsmount lock must be held for write >>> */ >>> @@ -886,28 +871,45 @@ static void attach_mnt(struct mount *mnt, >>> struct mountpoint *mp) >>> { >>> mnt_set_mountpoint(parent, mp, mnt); >>> - hlist_add_head_rcu(&mnt->mnt_hash, m_hash(&parent->mnt, mp->m_dentry)); >>> - list_add_tail(&mnt->mnt_child, &parent->mnt_mounts); >>> + __attach_mnt(mnt, parent); >>> } >>> >>> -static void attach_shadowed(struct mount *mnt, >>> - struct mount *parent, >>> - struct mount *shadows) >>> +void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, struct mount *mnt) >>> { >>> - if (shadows) { >>> - hlist_add_behind_rcu(&mnt->mnt_hash, &shadows->mnt_hash); >>> - list_add(&mnt->mnt_child, &shadows->mnt_child); >>> - } else { >>> - hlist_add_head_rcu(&mnt->mnt_hash, >>> - m_hash(&parent->mnt, mnt->mnt_mountpoint)); >>> - list_add_tail(&mnt->mnt_child, &parent->mnt_mounts); >>> - } >>> + struct mountpoint *old_mp = mnt->mnt_mp; >>> + struct dentry *old_mountpoint = mnt->mnt_mountpoint; >>> + struct mount *old_parent = mnt->mnt_parent; >>> + >>> + list_del_init(&mnt->mnt_child); >>> + hlist_del_init(&mnt->mnt_mp_list); >>> + hlist_del_init_rcu(&mnt->mnt_hash); >>> + >>> + attach_mnt(mnt, parent, mp); >>> + >>> + put_mountpoint(old_mp); >>> + >>> + /* >>> + * Safely avoid even the suggestion this code might sleep or >>> + * lock the mount hash by taking avantage of the knowlege that >>> + * mnt_change_mounpoint will not release the final reference >>> + * to a mountpoint. >>> + * >>> + * During mounting, another mount will continue to use the old >>> + * mountpoint and during unmounting, the old mountpoint will >>> + * continue to exist until namespace_unlock which happens well >>> + * after mnt_change_mountpoint. >>> + */ >>> + spin_lock(&old_mountpoint->d_lock); >>> + old_mountpoint->d_lockref.count--; >>> + spin_unlock(&old_mountpoint->d_lock); >>> + >>> + mnt_add_count(old_parent, -1); >>> } >>> >>> /* >>> * vfsmount lock must be held for write >>> */ >>> -static void commit_tree(struct mount *mnt, struct mount *shadows) >>> +static void commit_tree(struct mount *mnt) >>> { >>> struct mount *parent = mnt->mnt_parent; >>> struct mount *m; >>> @@ -925,7 +927,7 @@ static void commit_tree(struct mount *mnt, struct mount *shadows) >>> n->mounts += n->pending_mounts; >>> n->pending_mounts = 0; >>> >>> - attach_shadowed(mnt, parent, shadows); >>> + __attach_mnt(mnt, parent); >>> touch_mnt_namespace(n); >>> } >>> >>> @@ -1764,7 +1766,6 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry, >>> continue; >>> >>> for (s = r; s; s = next_mnt(s, r)) { >>> - struct mount *t = NULL; >>> if (!(flag & CL_COPY_UNBINDABLE) && >>> IS_MNT_UNBINDABLE(s)) { >>> s = skip_mnt_tree(s); >>> @@ -1786,14 +1787,7 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry, >>> goto out; >>> lock_mount_hash(); >>> list_add_tail(&q->mnt_list, &res->mnt_list); >>> - mnt_set_mountpoint(parent, p->mnt_mp, q); >>> - if (!list_empty(&parent->mnt_mounts)) { >>> - t = list_last_entry(&parent->mnt_mounts, >>> - struct mount, mnt_child); >>> - if (t->mnt_mp != p->mnt_mp) >>> - t = NULL; >>> - } >>> - attach_shadowed(q, parent, t); >>> + attach_mnt(q, parent, p->mnt_mp); >>> unlock_mount_hash(); >>> } >>> } >>> @@ -1992,10 +1986,18 @@ static int attach_recursive_mnt(struct mount *source_mnt, >>> { >>> HLIST_HEAD(tree_list); >>> struct mnt_namespace *ns = dest_mnt->mnt_ns; >>> + struct mountpoint *smp; >>> struct mount *child, *p; >>> struct hlist_node *n; >>> int err; >>> >>> + /* Preallocate a mountpoint in case the new mounts need >>> + * to be tucked under other mounts. >>> + */ >>> + smp = get_mountpoint(source_mnt->mnt.mnt_root); >>> + if (IS_ERR(smp)) >>> + return PTR_ERR(smp); >>> + >>> /* Is there space to add these mounts to the mount namespace? */ >>> if (!parent_path) { >>> err = count_mounts(ns, source_mnt); >>> @@ -2022,17 +2024,22 @@ static int attach_recursive_mnt(struct mount *source_mnt, >>> touch_mnt_namespace(source_mnt->mnt_ns); >>> } else { >>> mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt); >>> - commit_tree(source_mnt, NULL); >>> + commit_tree(source_mnt); >>> } >>> >>> hlist_for_each_entry_safe(child, n, &tree_list, mnt_hash) { >>> - struct mount *q; >>> hlist_del_init(&child->mnt_hash); >>> - q = __lookup_mnt_last(&child->mnt_parent->mnt, >>> - child->mnt_mountpoint); >>> - commit_tree(child, q); >>> + if (child->mnt.mnt_root == smp->m_dentry) { >>> + struct mount *q; >>> + q = __lookup_mnt(&child->mnt_parent->mnt, >>> + child->mnt_mountpoint); >>> + if (q) >>> + mnt_change_mountpoint(child, smp, q); >>> + } >>> + commit_tree(child); >>> } >>> unlock_mount_hash(); >>> + put_mountpoint(smp); >>> >>> return 0; >>> >>> @@ -2046,6 +2053,7 @@ static int attach_recursive_mnt(struct mount *source_mnt, >>> cleanup_group_ids(source_mnt, NULL); >>> out: >>> ns->pending_mounts = 0; >>> + put_mountpoint(smp); >>> return err; >>> } >>> >>> diff --git a/fs/pnode.c b/fs/pnode.c >>> index 06a793f4ae38..eb4331240fd1 100644 >>> --- a/fs/pnode.c >>> +++ b/fs/pnode.c >>> @@ -327,6 +327,9 @@ int propagate_mnt(struct mount *dest_mnt, struct mountpoint *dest_mp, >>> */ >>> static inline int do_refcount_check(struct mount *mnt, int count) >>> { >>> + struct mount *topper = __lookup_mnt(&mnt->mnt, mnt->mnt.mnt_root); >>> + if (topper) >>> + count++; >>> return mnt_get_count(mnt) > count; >>> } >>> >>> @@ -359,7 +362,7 @@ int propagate_mount_busy(struct mount *mnt, int refcnt) >>> >>> for (m = propagation_next(parent, parent); m; >>> m = propagation_next(m, parent)) { >>> - child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint); >>> + child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint); >>> if (child && list_empty(&child->mnt_mounts) && >>> (ret = do_refcount_check(child, 1))) >>> break; >>> @@ -381,7 +384,7 @@ void propagate_mount_unlock(struct mount *mnt) >>> >>> for (m = propagation_next(parent, parent); m; >>> m = propagation_next(m, parent)) { >>> - child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint); >>> + child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint); >>> if (child) >>> child->mnt.mnt_flags &= ~MNT_LOCKED; >>> } >>> @@ -399,9 +402,11 @@ static void mark_umount_candidates(struct mount *mnt) >>> >>> for (m = propagation_next(parent, parent); m; >>> m = propagation_next(m, parent)) { >>> - struct mount *child = __lookup_mnt_last(&m->mnt, >>> + struct mount *child = __lookup_mnt(&m->mnt, >>> mnt->mnt_mountpoint); >>> - if (child && (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m))) { >>> + if (!child || (child->mnt.mnt_flags & MNT_UMOUNT)) >>> + continue; >>> + if (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m)) { >>> SET_MNT_MARK(child); >>> } >>> } >>> @@ -420,8 +425,8 @@ static void __propagate_umount(struct mount *mnt) >>> >>> for (m = propagation_next(parent, parent); m; >>> m = propagation_next(m, parent)) { >>> - >>> - struct mount *child = __lookup_mnt_last(&m->mnt, >>> + struct mount *topper; >>> + struct mount *child = __lookup_mnt(&m->mnt, >>> mnt->mnt_mountpoint); >>> /* >>> * umount the child only if the child has no children >>> @@ -430,6 +435,16 @@ static void __propagate_umount(struct mount *mnt) >>> if (!child || !IS_MNT_MARKED(child)) >>> continue; >>> CLEAR_MNT_MARK(child); >>> + >>> + /* If there is exactly one mount covering all of child >>> + * replace child with that mount. >>> + */ >>> + topper = __lookup_mnt(&child->mnt, child->mnt.mnt_root); >>> + if (topper && >>> + (child->mnt_mounts.next == &topper->mnt_child) && >>> + (topper->mnt_child.next == &child->mnt_mounts)) >>> + mnt_change_mountpoint(child->mnt_parent, child->mnt_mp, topper); >>> + >>> if (list_empty(&child->mnt_mounts)) { >>> list_del_init(&child->mnt_child); >>> child->mnt.mnt_flags |= MNT_UMOUNT; >>> diff --git a/fs/pnode.h b/fs/pnode.h >>> index 550f5a8b4fcf..dc87e65becd2 100644 >>> --- a/fs/pnode.h >>> +++ b/fs/pnode.h >>> @@ -49,6 +49,8 @@ int get_dominating_id(struct mount *mnt, const struct path *root); >>> unsigned int mnt_get_count(struct mount *mnt); >>> void mnt_set_mountpoint(struct mount *, struct mountpoint *, >>> struct mount *); >>> +void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, >>> + struct mount *mnt); >>> struct mount *copy_tree(struct mount *, struct dentry *, int); >>> bool is_path_reachable(struct mount *, struct dentry *, >>> const struct path *root); >>> -- >>> 2.10.1 >>>
diff --git a/fs/mount.h b/fs/mount.h index 2c856fc47ae3..2826543a131d 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -89,7 +89,6 @@ static inline int is_mounted(struct vfsmount *mnt) } extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *); -extern struct mount *__lookup_mnt_last(struct vfsmount *, struct dentry *); extern int __legitimize_mnt(struct vfsmount *, unsigned); extern bool legitimize_mnt(struct vfsmount *, unsigned); diff --git a/fs/namespace.c b/fs/namespace.c index 487ba30bb5c6..91ccfb73f0e0 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -637,28 +637,6 @@ struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry) } /* - * find the last mount at @dentry on vfsmount @mnt. - * mount_lock must be held. - */ -struct mount *__lookup_mnt_last(struct vfsmount *mnt, struct dentry *dentry) -{ - struct mount *p, *res = NULL; - p = __lookup_mnt(mnt, dentry); - if (!p) - goto out; - if (!(p->mnt.mnt_flags & MNT_UMOUNT)) - res = p; - hlist_for_each_entry_continue(p, mnt_hash) { - if (&p->mnt_parent->mnt != mnt || p->mnt_mountpoint != dentry) - break; - if (!(p->mnt.mnt_flags & MNT_UMOUNT)) - res = p; - } -out: - return res; -} - -/* * lookup_mnt - Return the first child mount mounted at path * * "First" means first mounted chronologically. If you create the @@ -878,6 +856,13 @@ void mnt_set_mountpoint(struct mount *mnt, hlist_add_head(&child_mnt->mnt_mp_list, &mp->m_list); } +static void __attach_mnt(struct mount *mnt, struct mount *parent) +{ + hlist_add_head_rcu(&mnt->mnt_hash, + m_hash(&parent->mnt, mnt->mnt_mountpoint)); + list_add_tail(&mnt->mnt_child, &parent->mnt_mounts); +} + /* * vfsmount lock must be held for write */ @@ -886,28 +871,45 @@ static void attach_mnt(struct mount *mnt, struct mountpoint *mp) { mnt_set_mountpoint(parent, mp, mnt); - hlist_add_head_rcu(&mnt->mnt_hash, m_hash(&parent->mnt, mp->m_dentry)); - list_add_tail(&mnt->mnt_child, &parent->mnt_mounts); + __attach_mnt(mnt, parent); } -static void attach_shadowed(struct mount *mnt, - struct mount *parent, - struct mount *shadows) +void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, struct mount *mnt) { - if (shadows) { - hlist_add_behind_rcu(&mnt->mnt_hash, &shadows->mnt_hash); - list_add(&mnt->mnt_child, &shadows->mnt_child); - } else { - hlist_add_head_rcu(&mnt->mnt_hash, - m_hash(&parent->mnt, mnt->mnt_mountpoint)); - list_add_tail(&mnt->mnt_child, &parent->mnt_mounts); - } + struct mountpoint *old_mp = mnt->mnt_mp; + struct dentry *old_mountpoint = mnt->mnt_mountpoint; + struct mount *old_parent = mnt->mnt_parent; + + list_del_init(&mnt->mnt_child); + hlist_del_init(&mnt->mnt_mp_list); + hlist_del_init_rcu(&mnt->mnt_hash); + + attach_mnt(mnt, parent, mp); + + put_mountpoint(old_mp); + + /* + * Safely avoid even the suggestion this code might sleep or + * lock the mount hash by taking avantage of the knowlege that + * mnt_change_mounpoint will not release the final reference + * to a mountpoint. + * + * During mounting, another mount will continue to use the old + * mountpoint and during unmounting, the old mountpoint will + * continue to exist until namespace_unlock which happens well + * after mnt_change_mountpoint. + */ + spin_lock(&old_mountpoint->d_lock); + old_mountpoint->d_lockref.count--; + spin_unlock(&old_mountpoint->d_lock); + + mnt_add_count(old_parent, -1); } /* * vfsmount lock must be held for write */ -static void commit_tree(struct mount *mnt, struct mount *shadows) +static void commit_tree(struct mount *mnt) { struct mount *parent = mnt->mnt_parent; struct mount *m; @@ -925,7 +927,7 @@ static void commit_tree(struct mount *mnt, struct mount *shadows) n->mounts += n->pending_mounts; n->pending_mounts = 0; - attach_shadowed(mnt, parent, shadows); + __attach_mnt(mnt, parent); touch_mnt_namespace(n); } @@ -1764,7 +1766,6 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry, continue; for (s = r; s; s = next_mnt(s, r)) { - struct mount *t = NULL; if (!(flag & CL_COPY_UNBINDABLE) && IS_MNT_UNBINDABLE(s)) { s = skip_mnt_tree(s); @@ -1786,14 +1787,7 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry, goto out; lock_mount_hash(); list_add_tail(&q->mnt_list, &res->mnt_list); - mnt_set_mountpoint(parent, p->mnt_mp, q); - if (!list_empty(&parent->mnt_mounts)) { - t = list_last_entry(&parent->mnt_mounts, - struct mount, mnt_child); - if (t->mnt_mp != p->mnt_mp) - t = NULL; - } - attach_shadowed(q, parent, t); + attach_mnt(q, parent, p->mnt_mp); unlock_mount_hash(); } } @@ -1992,10 +1986,18 @@ static int attach_recursive_mnt(struct mount *source_mnt, { HLIST_HEAD(tree_list); struct mnt_namespace *ns = dest_mnt->mnt_ns; + struct mountpoint *smp; struct mount *child, *p; struct hlist_node *n; int err; + /* Preallocate a mountpoint in case the new mounts need + * to be tucked under other mounts. + */ + smp = get_mountpoint(source_mnt->mnt.mnt_root); + if (IS_ERR(smp)) + return PTR_ERR(smp); + /* Is there space to add these mounts to the mount namespace? */ if (!parent_path) { err = count_mounts(ns, source_mnt); @@ -2022,17 +2024,22 @@ static int attach_recursive_mnt(struct mount *source_mnt, touch_mnt_namespace(source_mnt->mnt_ns); } else { mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt); - commit_tree(source_mnt, NULL); + commit_tree(source_mnt); } hlist_for_each_entry_safe(child, n, &tree_list, mnt_hash) { - struct mount *q; hlist_del_init(&child->mnt_hash); - q = __lookup_mnt_last(&child->mnt_parent->mnt, - child->mnt_mountpoint); - commit_tree(child, q); + if (child->mnt.mnt_root == smp->m_dentry) { + struct mount *q; + q = __lookup_mnt(&child->mnt_parent->mnt, + child->mnt_mountpoint); + if (q) + mnt_change_mountpoint(child, smp, q); + } + commit_tree(child); } unlock_mount_hash(); + put_mountpoint(smp); return 0; @@ -2046,6 +2053,7 @@ static int attach_recursive_mnt(struct mount *source_mnt, cleanup_group_ids(source_mnt, NULL); out: ns->pending_mounts = 0; + put_mountpoint(smp); return err; } diff --git a/fs/pnode.c b/fs/pnode.c index 06a793f4ae38..eb4331240fd1 100644 --- a/fs/pnode.c +++ b/fs/pnode.c @@ -327,6 +327,9 @@ int propagate_mnt(struct mount *dest_mnt, struct mountpoint *dest_mp, */ static inline int do_refcount_check(struct mount *mnt, int count) { + struct mount *topper = __lookup_mnt(&mnt->mnt, mnt->mnt.mnt_root); + if (topper) + count++; return mnt_get_count(mnt) > count; } @@ -359,7 +362,7 @@ int propagate_mount_busy(struct mount *mnt, int refcnt) for (m = propagation_next(parent, parent); m; m = propagation_next(m, parent)) { - child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint); + child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint); if (child && list_empty(&child->mnt_mounts) && (ret = do_refcount_check(child, 1))) break; @@ -381,7 +384,7 @@ void propagate_mount_unlock(struct mount *mnt) for (m = propagation_next(parent, parent); m; m = propagation_next(m, parent)) { - child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint); + child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint); if (child) child->mnt.mnt_flags &= ~MNT_LOCKED; } @@ -399,9 +402,11 @@ static void mark_umount_candidates(struct mount *mnt) for (m = propagation_next(parent, parent); m; m = propagation_next(m, parent)) { - struct mount *child = __lookup_mnt_last(&m->mnt, + struct mount *child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint); - if (child && (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m))) { + if (!child || (child->mnt.mnt_flags & MNT_UMOUNT)) + continue; + if (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m)) { SET_MNT_MARK(child); } } @@ -420,8 +425,8 @@ static void __propagate_umount(struct mount *mnt) for (m = propagation_next(parent, parent); m; m = propagation_next(m, parent)) { - - struct mount *child = __lookup_mnt_last(&m->mnt, + struct mount *topper; + struct mount *child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint); /* * umount the child only if the child has no children @@ -430,6 +435,16 @@ static void __propagate_umount(struct mount *mnt) if (!child || !IS_MNT_MARKED(child)) continue; CLEAR_MNT_MARK(child); + + /* If there is exactly one mount covering all of child + * replace child with that mount. + */ + topper = __lookup_mnt(&child->mnt, child->mnt.mnt_root); + if (topper && + (child->mnt_mounts.next == &topper->mnt_child) && + (topper->mnt_child.next == &child->mnt_mounts)) + mnt_change_mountpoint(child->mnt_parent, child->mnt_mp, topper); + if (list_empty(&child->mnt_mounts)) { list_del_init(&child->mnt_child); child->mnt.mnt_flags |= MNT_UMOUNT; diff --git a/fs/pnode.h b/fs/pnode.h index 550f5a8b4fcf..dc87e65becd2 100644 --- a/fs/pnode.h +++ b/fs/pnode.h @@ -49,6 +49,8 @@ int get_dominating_id(struct mount *mnt, const struct path *root); unsigned int mnt_get_count(struct mount *mnt); void mnt_set_mountpoint(struct mount *, struct mountpoint *, struct mount *); +void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, + struct mount *mnt); struct mount *copy_tree(struct mount *, struct dentry *, int); bool is_path_reachable(struct mount *, struct dentry *, const struct path *root);