Message ID | 148029910861.27779.4517883721395202453.stgit@pluto.themaw.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
So a far as I can tell, this patch series is intended to address Al's review comments against the http://lkml.kernel.org/r/20161011053352.27645.83962.stgit@pluto.themaw.net series? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2016-11-30 at 14:22 -0800, Andrew Morton wrote: > So a far as I can tell, this patch series is intended to > address Al's review comments against the > http://lkml.kernel.org/r/20161011053352.27645.83962.stgit@pluto.themaw.net > series? That's right and also to fix an additional problem I found in subsequent testing (the may_umount_tree() changes). Ian -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
FWIW, I've folded that pile into vfs.git#work.autofs. Problems: * (fixed) __path_is_mountpoint() should _not_ treat NULL from __lookup_mnt() as "nothing's mounted there" until it has checked that mount_lock hadn't been touched - mount --move on something unrelated can race with lockless hash lookup and lead to false negatives. * linux/mount.h might be the wrong place for path_is_mountpoint(). Or it shouldn't be inlined. I don't like the includes you've added there. * path_has_submounts() is broken. At the very least, it's AB-BA between mount_lock and rename_lock. I would suggest trying to put read_seqlock_excl(&mount_lock) around the call of d_walk() in there, and using __lookup_mnt() in the callback (without retries on the mount_lock, of course - read_seqlock_excl done on the outside is enough). I'm not sure if it won't cause trouble with contention, though; that needs testing. As it is, that function is broken in #work.autofs, same as it is in -mm and -next. * the last one (propagation-related) is too ugly to live - at the very least, its pieces should live in fs/pnode.c; exposing propagate_next() is simply wrong. I hadn't picked that one at all, and I would suggest coordinating anything in that area with ebiederman - he has some work around fs/pnode.c and you risk stepping on his toes. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Dec 03, 2016 at 05:13:22AM +0000, Al Viro wrote: > * path_has_submounts() is broken. At the very least, it's > AB-BA between mount_lock and rename_lock. I would suggest trying to > put read_seqlock_excl(&mount_lock) around the call of d_walk() in there, > and using __lookup_mnt() in the callback (without retries on the mount_lock, > of course - read_seqlock_excl done on the outside is enough). I'm not sure > if it won't cause trouble with contention, though; that needs testing. As > it is, that function is broken in #work.autofs, same as it is in -mm and > -next. Fix for path_has_submounts() (as above) force-pushed. It does need testing and profiling, obviously. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, 2016-12-03 at 05:13 +0000, Al Viro wrote: > FWIW, I've folded that pile into vfs.git#work.autofs. > > Problems: > * (fixed) __path_is_mountpoint() should _not_ treat NULL from > __lookup_mnt() as "nothing's mounted there" until it has checked > that mount_lock hadn't been touched - mount --move on something unrelated > can race with lockless hash lookup and lead to false negatives. Right, looking at what you've done there the mistake is so obvious now! Thanks for helping with it. > * linux/mount.h might be the wrong place for path_is_mountpoint(). > Or it shouldn't be inlined. I don't like the includes you've added there. > * path_has_submounts() is broken. At the very least, it's > AB-BA between mount_lock and rename_lock. I would suggest trying to > put read_seqlock_excl(&mount_lock) around the call of d_walk() in there, > and using __lookup_mnt() in the callback (without retries on the mount_lock, > of course - read_seqlock_excl done on the outside is enough). I'm not sure > if it won't cause trouble with contention, though; that needs testing. As > it is, that function is broken in #work.autofs, same as it is in -mm and > -next. Umm ... that's a much more obvious dumb mistake and what you've done there didn't occur to me even after you spelled it out, I'll take some time to digest it. And thanks for that one too. Ian -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, 2016-12-03 at 23:29 +0000, Al Viro wrote: > On Sat, Dec 03, 2016 at 05:13:22AM +0000, Al Viro wrote: > > > > * path_has_submounts() is broken. At the very least, it's > > AB-BA between mount_lock and rename_lock. I would suggest trying to > > put read_seqlock_excl(&mount_lock) around the call of d_walk() in there, > > and using __lookup_mnt() in the callback (without retries on the mount_lock, > > of course - read_seqlock_excl done on the outside is enough). I'm not sure > > if it won't cause trouble with contention, though; that needs testing. As > > it is, that function is broken in #work.autofs, same as it is in -mm and > > -next. > Fix for path_has_submounts() (as above) force-pushed. It does > need testing and profiling, obviously. I'll run my tests against it and re-run with oprofile if all goes well. The submount-test I use should show contention if it's a problem but I'm not sure the number of mounts used will be sufficient to show up scale problems. Basically each case of the test (there are two) runs for 100 iterations using 10 processes with timing set to promote expire to mount contention, mainly to test for expire to mount races. If I don't see contention I might need to analyse further whether the test has adequate coverage. Ian -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, 2016-12-04 at 10:18 +0800, Ian Kent wrote: > On Sat, 2016-12-03 at 23:29 +0000, Al Viro wrote: > > > > On Sat, Dec 03, 2016 at 05:13:22AM +0000, Al Viro wrote: > > > > > > > > > * path_has_submounts() is broken. At the very least, it's > > > AB-BA between mount_lock and rename_lock. I would suggest trying to > > > put read_seqlock_excl(&mount_lock) around the call of d_walk() in there, > > > and using __lookup_mnt() in the callback (without retries on the > > > mount_lock, > > > of course - read_seqlock_excl done on the outside is enough). I'm not > > > sure > > > if it won't cause trouble with contention, though; that needs testing. As > > > it is, that function is broken in #work.autofs, same as it is in -mm and > > > -next. > > Fix for path_has_submounts() (as above) force-pushed. It does > > need testing and profiling, obviously. > I'll run my tests against it and re-run with oprofile if all goes well. > > The submount-test I use should show contention if it's a problem but I'm not > sure the number of mounts used will be sufficient to show up scale problems. > > Basically each case of the test (there are two) runs for 100 iterations using > 10 > processes with timing set to promote expire to mount contention, mainly to > test > for expire to mount races. > > If I don't see contention I might need to analyse further whether the test has > adequate coverage. I have run my usual tests on a build of vfs.get#work.autofs. Both tests ran without problem multiple times (even though they should be deterministic experience had shown they sometimes aren't). I also did a system wide oprofile run of an additional run of the submount-test. I hadn't noticed that the submount-test runs are shorter that I have come to expect. This might be due to changes to the behaviour of the expire and mount timing over time. But I always check that I'm seeing expire and mount behaviour that should lead to contention during the test run and that looked as expected. The run time was about 55 minutes for each of the two cases I test whereas I had come to expect a run time of around 70 minutes each. It's been quite a while since I've actually paid attention to this and a lot has changed in the VFS since. It might be due to Neil Browns' changes to satisfy path walks in rcu-walk mode where possible rather than always dropping into ref-walk mode as I didn't profile those changes when they were implemented because I didn't notice unexpectedly different run times when testing the changes. I was going to talk about why the autofs usage of path_has_submounts() should not be problematic but that would be tedious for everyone, instead I'll just say that, clearly it is possible to abuse path_has_submounts() by calling it on mount points that have a large number of directories, but autofs shouldn't do that and the profile verifies this. I'm not sure how accurate the profile is (I don't do it very often). There were about 400000 out of 22000000 samples missed. And hopefully the report output is readable enough to be useful after posting .... Anyway, a system wide opreport (such as it is) showed this: CPU: Intel Haswell microarchitecture, speed 3700 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000 samples % image name app name symbol name 3897186 17.6585 libc-2.22.so operf __strcmp_sse2_unaligned 3125714 14.1629 libstdc++.so.6.0.21 operf std::_Rb_tree_increment(std::_Rb_tree_node_base const*) 1500262 6.7978 libsqlite3.so.0.8.6 evolution /usr/lib64/libsqlite3.so.0.8.6 856262 3.8798 operf operf /usr/bin/operf 749603 3.3965 libglib-2.0.so.0.4600.2 evolution /usr/lib64/libglib-2.0.so.0.4600.2 560812 2.5411 libdbus-1.so.3.14.8 dbus-daemon /usr/lib64/libdbus-1.so.3.14.8 378227 1.7138 systemd systemd /usr/lib/systemd/systemd 301175 1.3647 libcamel-1.2.so.54.0.0 evolution /usr/lib64/libcamel-1.2.so.54.0.0 223545 1.0129 dbus-daemon dbus-daemon /usr/bin/dbus-daemon 187783 0.8509 libc-2.22.so dbus-daemon __strcmp_sse2_unaligned 157564 0.7139 libc-2.22.so evolution __strcmp_sse2_unaligned 157218 0.7124 libc-2.22.so evolution _int_malloc 156089 0.7073 libgobject-2.0.so.0.4600.2 evolution /usr/lib64/libgobject-2.0.so.0.4600.2 116277 0.5269 libc-2.22.so evolution _int_free 116275 0.5269 libxul.so firefox /usr/lib64/firefox/libxul.so 104278 0.4725 libc-2.22.so evolution __GI_____strtoull_l_internal 92937 0.4211 libc-2.22.so ps _IO_vfscanf ... 101 4.6e-04 vmlinux-4.9.0-rc7 opendir path_is_mountpoint ... 23 1.0e-04 vmlinux-4.9.0-rc7 opendir path_has_submounts ... where opendir is a program the test uses to trigger a mount and do various checks on the result. Not sure if this is enough or is what's needed, let me know if there's something else specific I should do. Ian -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, 2016-12-03 at 05:13 +0000, Al Viro wrote: > FWIW, I've folded that pile into vfs.git#work.autofs. > > Problems: snip ... > * the last one (propagation-related) is too ugly to live - at the > very least, its pieces should live in fs/pnode.c; exposing propagate_next() > is simply wrong. I hadn't picked that one at all, and I would suggest > coordinating anything in that area with ebiederman - he has some work > around fs/pnode.c and you risk stepping on his toes. The earlier patches seem to be ok now so how about we talk a little about this last one. Eric, Al mentioned that you are working with fs/pnode.c and recommended I co- ordinate with you. So is my working on this this (which is most likely going to live in pnode.c if I can can get something acceptable) going to cause complications for you? Is what your doing at a point were it would be worth doing as Al suggests? Anyway, the problem that this patch is supposed to solve is to check if any of the list of mnt_mounts or any of the mounts propagated from each are in use. One obvious problem with it is the propagation could be very large. But now I look at it again there's no reason to have to every tree because if one tree is busy then the the set of trees is busy. But every tree would be visited if the not busy so it's perhaps still a problem. The difficult thing is working out if a tree is busy, more so because there could be a struct path holding references within any the trees so I don't know of a simpler, more efficient way to check for this. Anyone have any suggestions at all? Ian -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ian Kent <raven@themaw.net> writes: > On Sat, 2016-12-03 at 05:13 +0000, Al Viro wrote: >> FWIW, I've folded that pile into vfs.git#work.autofs. >> >> Problems: > > snip ... > >> * the last one (propagation-related) is too ugly to live - at the >> very least, its pieces should live in fs/pnode.c; exposing propagate_next() >> is simply wrong. I hadn't picked that one at all, and I would suggest >> coordinating anything in that area with ebiederman - he has some work >> around fs/pnode.c and you risk stepping on his toes. > > The earlier patches seem to be ok now so how about we talk a little about this > last one. > > Eric, Al mentioned that you are working with fs/pnode.c and recommended I co- > ordinate with you. > > So is my working on this this (which is most likely going to live in pnode.c if > I can can get something acceptable) going to cause complications for you? > Is what your doing at a point were it would be worth doing as Al > suggests? > > Anyway, the problem that this patch is supposed to solve is to check if any of > the list of mnt_mounts or any of the mounts propagated from each are in use. > > One obvious problem with it is the propagation could be very large. > > But now I look at it again there's no reason to have to every tree because if > one tree is busy then the the set of trees is busy. But every tree would be > visited if the not busy so it's perhaps still a problem. > > The difficult thing is working out if a tree is busy, more so because there > could be a struct path holding references within any the trees so I don't know > of a simpler, more efficient way to check for this. So coordination seems to be in order. Not so much because of your current implementation (which won't tell you what you want to know) but because an implementation that tells you what you are looking for has really bad > O(N^2) complexity walking the propagation tree in the worst case. To be correct you code would need to use next_mnt and walk the tree and on each mount use propagate_mnt_busy to see if that entry can be unmounted. Possibly with small variations to match your case. Mounts in one part of the tree may propagate to places that mounts in other parts of the tree will not. I am assuming you are not worried about MNT_DETACH, as that case would not need may_umount_tree at all. I am researching how to make walking propagation for multiple mounts efficient but it is a non-trivial problem. > Anyone have any suggestions at all? In the short term I recommend just not doing this, but if you want to see if you can find an efficient algorithm with me, I am happy to bring you up to speed on all of the gory details. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2016-12-08 at 10:30 +1300, Eric W. Biederman wrote: > Ian Kent <raven@themaw.net> writes: > > > On Sat, 2016-12-03 at 05:13 +0000, Al Viro wrote: > > > FWIW, I've folded that pile into vfs.git#work.autofs. > > > > > > Problems: > > > > snip ... > > > > > * the last one (propagation-related) is too ugly to live - at the > > > very least, its pieces should live in fs/pnode.c; exposing > > > propagate_next() > > > is simply wrong. I hadn't picked that one at all, and I would suggest > > > coordinating anything in that area with ebiederman - he has some work > > > around fs/pnode.c and you risk stepping on his toes. > > > > The earlier patches seem to be ok now so how about we talk a little about > > this > > last one. > > > > Eric, Al mentioned that you are working with fs/pnode.c and recommended I > > co- > > ordinate with you. > > > > So is my working on this this (which is most likely going to live in pnode.c > > if > > I can can get something acceptable) going to cause complications for you? > > Is what your doing at a point were it would be worth doing as Al > > suggests? > > > > Anyway, the problem that this patch is supposed to solve is to check if any > > of > > the list of mnt_mounts or any of the mounts propagated from each are in use. > > > > One obvious problem with it is the propagation could be very large. > > > > But now I look at it again there's no reason to have to every tree because > > if > > one tree is busy then the the set of trees is busy. But every tree would be > > visited if the not busy so it's perhaps still a problem. > > > > The difficult thing is working out if a tree is busy, more so because there > > could be a struct path holding references within any the trees so I don't > > know > > of a simpler, more efficient way to check for this. > > So coordination seems to be in order. Not so much because of your > current implementation (which won't tell you what you want to know) Umm ... ok I've missed something then because the patch I posted does appear to get the calculation right. Perhaps I've missed a case where it gets it wrong .... But now I'm looking deeper into it I'm beginning to wonder why it worked for one of the cases. Maybe I need to do some more tests with even more printks. > but because an implementation that tells you what you are looking for > has really bad > O(N^2) complexity walking the propagation tree in > the worst case. Yes it is bad indeed. > > To be correct you code would need to use next_mnt and walk the tree and > on each mount use propagate_mnt_busy to see if that entry can be > unmounted. Possibly with small variations to match your case. Mounts > in one part of the tree may propagate to places that mounts in other > parts of the tree will not. Right, the point there being use propagate_mnt_busy() for the check. I tried to use that when I started this but had trouble, I'll have another look at using it. I thought the reason I couldn't use propagate_mnt_busy() is because it will see any mount with submounts as busy and that's not what's need. It's mounts below each of mounts having submounts that make them busy in my case. You probably don't want to read about this but, unfortunately, an example is possibly the best way to describe what I need to achieve. A simple example for is the autofs direct mount map entry (a so called autofs multi-mount): /absolute/path/to/direct/mount-point \ /offset/path/one server:/export/path/one \ /offset2/path/two server2:/export/path/other The path /absolute/path/to/direct/mount-point is an autofs direct mount trigger and is essentially an autofs mount. When the mount is triggered each of the offsets, /offset/path/one and /offset2/path/two offset trigger mounts are mounted so they can trigger mounts independently. When the direct mount /absolute/path/to/direct/mount-point is expired the two offsets need to be umounted if they are not busy so /absolute/path/to/direct/mount-point having submounts doesn't necessarily make it unable to be umounted. But this is (supposed) to be for only to one level of mounts, for example if I had: /absolute/path/to/direct/mount-point \ /offset/path/one server:/export/path/one \ /offset2/path/two server2:/export/path/other \ /offset/path/one/nested server3:/export/path3/nested then the the offsets below the nesting point /offset/path/one make it busy and need to be expired first. So why would I even need to do this? It's functionality of Sun autofs maps and that's the primary map format autofs supports but that's why it's needed and not why I do it this way. Another example of one these multi-mounts is: /absolute/path/to/direct/mount-point \ / servern:/another/export \ /offset/path/one server:/export/path/one \ /offset2/path/two server2:/export/path/other where an NFS mount is done on the root of the offset tree (similar to the case where nesting is present in the example map entry above). Once that happens autofs can't get a callback to mount offsets unless there are offset triggers mounted to trigger the mounts. > > I am assuming you are not worried about MNT_DETACH, as that case would > not need may_umount_tree at all. I'm not. > > I am researching how to make walking propagation for multiple mounts > efficient but it is a non-trivial problem. Sure is. > > > Anyone have any suggestions at all? > > In the short term I recommend just not doing this, but if you want to > see if you can find an efficient algorithm with me, I am happy to bring > you up to speed on all of the gory details. I need to do this to make autofs play better in the presence of namespaces and that's a problem now and is long overdue to be resolved. So I'm happy to do whatever I can to work toward that. I could leave this part of the implementation until later because the false positive this solves leads to an expire fail and that is (must be) handled properly by the user space daemon. But when I first noticed the false positive the expire failure was handled ok by the daemon (AFAICS) but there was what looked like a reference counting problem (but not quite that either, odd) which I was going to return to later. Maybe I've stumbled onto a subtle propagation bug that really needs fixing .... Ian -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ian Kent <raven@themaw.net> writes: > On Thu, 2016-12-08 at 10:30 +1300, Eric W. Biederman wrote: >> Ian Kent <raven@themaw.net> writes: >> >> > On Sat, 2016-12-03 at 05:13 +0000, Al Viro wrote: >> > > FWIW, I've folded that pile into vfs.git#work.autofs. >> > > >> > > Problems: >> > >> > snip ... >> > >> > > * the last one (propagation-related) is too ugly to live - at the >> > > very least, its pieces should live in fs/pnode.c; exposing >> > > propagate_next() >> > > is simply wrong. I hadn't picked that one at all, and I would suggest >> > > coordinating anything in that area with ebiederman - he has some work >> > > around fs/pnode.c and you risk stepping on his toes. >> > >> > The earlier patches seem to be ok now so how about we talk a little about >> > this >> > last one. >> > >> > Eric, Al mentioned that you are working with fs/pnode.c and recommended I >> > co- >> > ordinate with you. >> > >> > So is my working on this this (which is most likely going to live in pnode.c >> > if >> > I can can get something acceptable) going to cause complications for you? >> > Is what your doing at a point were it would be worth doing as Al >> > suggests? >> > >> > Anyway, the problem that this patch is supposed to solve is to check if any >> > of >> > the list of mnt_mounts or any of the mounts propagated from each are in use. >> > >> > One obvious problem with it is the propagation could be very large. >> > >> > But now I look at it again there's no reason to have to every tree because >> > if >> > one tree is busy then the the set of trees is busy. But every tree would be >> > visited if the not busy so it's perhaps still a problem. >> > >> > The difficult thing is working out if a tree is busy, more so because there >> > could be a struct path holding references within any the trees so I don't >> > know >> > of a simpler, more efficient way to check for this. >> >> So coordination seems to be in order. Not so much because of your >> current implementation (which won't tell you what you want to know) > > Umm ... ok I've missed something then because the patch I posted does appear to > get the calculation right. Perhaps I've missed a case where it gets it wrong > .... > > But now I'm looking deeper into it I'm beginning to wonder why it worked for one > of the cases. Maybe I need to do some more tests with even more > printks. So the case that I am concerned about is that if someone happens to do mount --bind the mount propagation trees would not just be mirror images in single filesystems but have leaves in odd places. Consider .../base/one/two (where one and two are automounts) and base is the shared mountpoint that is propagated between namespaces. If someone does mount --bind .../base/one ../somepath/one in any namespace then there will be places where an unmount of "two" will propagate to, but that an unmount of "one" won't as their respective propagation trees are different. Not accounting for different mount points having different propagation trees is what I meant when I said your code was wrong. >> but because an implementation that tells you what you are looking for >> has really bad > O(N^2) complexity walking the propagation tree in >> the worst case. > > Yes it is bad indeed. > >> >> To be correct you code would need to use next_mnt and walk the tree and >> on each mount use propagate_mnt_busy to see if that entry can be >> unmounted. Possibly with small variations to match your case. Mounts >> in one part of the tree may propagate to places that mounts in other >> parts of the tree will not. > > Right, the point there being use propagate_mnt_busy() for the check. Or a variant of propagate_mnt_busy. > I tried to use that when I started this but had trouble, I'll have another look > at using it. > > I thought the reason I couldn't use propagate_mnt_busy() is because it will see > any mount with submounts as busy and that's not what's need. It's mounts below > each of mounts having submounts that make them busy in my case. If you have the count of submounts passed into propagate_mnt_busy() it will come very close to what you need. It is still possible to have false positives in the face of propagation slaves that have had some of your autofs mounts unmounted, and something else mounted upon them. If all you need is a better approximation something like your current code with comments about it's limitations might even be reasonable. > You probably don't want to read about this but, unfortunately, an example is > possibly the best way to describe what I need to achieve. [snip good example] >> > Anyone have any suggestions at all? >> >> In the short term I recommend just not doing this, but if you want to >> see if you can find an efficient algorithm with me, I am happy to bring >> you up to speed on all of the gory details. > > I need to do this to make autofs play better in the presence of namespaces and > that's a problem now and is long overdue to be resolved. > > So I'm happy to do whatever I can to work toward that. > > I could leave this part of the implementation until later because the false > positive this solves leads to an expire fail and that is (must be) handled > properly by the user space daemon. > > But when I first noticed the false positive the expire failure was handled ok by > the daemon (AFAICS) but there was what looked like a reference counting problem > (but not quite that either, odd) which I was going to return to later. > > Maybe I've stumbled onto a subtle propagation bug that really needs > fixing .... MNT_DETACH has worst case performance issues when a tree of mounts is unmounted that would really be nice to fix, as it can propogate pretty horrible. You are trying to do something very similar to MNT_DETACH so an efficient solution for one is likely an efficient solution for the other. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2016-12-08 at 17:28 +1300, Eric W. Biederman wrote: > Ian Kent <raven@themaw.net> writes: > > > On Thu, 2016-12-08 at 10:30 +1300, Eric W. Biederman wrote: > > > Ian Kent <raven@themaw.net> writes: > > > > > > > On Sat, 2016-12-03 at 05:13 +0000, Al Viro wrote: > > > > > FWIW, I've folded that pile into vfs.git#work.autofs. > > > > > > > > > > Problems: > > > > > > > > snip ... > > > > > > > > > * the last one (propagation-related) is too ugly to live - at > > > > > the > > > > > very least, its pieces should live in fs/pnode.c; exposing > > > > > propagate_next() > > > > > is simply wrong. I hadn't picked that one at all, and I would suggest > > > > > coordinating anything in that area with ebiederman - he has some work > > > > > around fs/pnode.c and you risk stepping on his toes. > > > > > > > > The earlier patches seem to be ok now so how about we talk a little > > > > about > > > > this > > > > last one. > > > > > > > > Eric, Al mentioned that you are working with fs/pnode.c and recommended > > > > I > > > > co- > > > > ordinate with you. > > > > > > > > So is my working on this this (which is most likely going to live in > > > > pnode.c > > > > if > > > > I can can get something acceptable) going to cause complications for > > > > you? > > > > Is what your doing at a point were it would be worth doing as Al > > > > suggests? > > > > > > > > Anyway, the problem that this patch is supposed to solve is to check if > > > > any > > > > of > > > > the list of mnt_mounts or any of the mounts propagated from each are in > > > > use. > > > > > > > > One obvious problem with it is the propagation could be very large. > > > > > > > > But now I look at it again there's no reason to have to every tree > > > > because > > > > if > > > > one tree is busy then the the set of trees is busy. But every tree would > > > > be > > > > visited if the not busy so it's perhaps still a problem. > > > > > > > > The difficult thing is working out if a tree is busy, more so because > > > > there > > > > could be a struct path holding references within any the trees so I > > > > don't > > > > know > > > > of a simpler, more efficient way to check for this. > > > > > > So coordination seems to be in order. Not so much because of your > > > current implementation (which won't tell you what you want to know) > > > > Umm ... ok I've missed something then because the patch I posted does appear > > to > > get the calculation right. Perhaps I've missed a case where it gets it wrong > > .... > > > > But now I'm looking deeper into it I'm beginning to wonder why it worked for > > one > > of the cases. Maybe I need to do some more tests with even more > > printks. > > So the case that I am concerned about is that if someone happens to do > mount --bind the mount propagation trees would not just be mirror images > in single filesystems but have leaves in odd places. > > Consider .../base/one/two (where one and two are automounts) and base > is the shared mountpoint that is propagated between namespaces. > > If someone does mount --bind .../base/one ../somepath/one in any > namespace then there will be places where an unmount of "two" will > propagate to, but that an unmount of "one" won't as their respective > propagation trees are different. > > Not accounting for different mount points having different propagation > trees is what I meant when I said your code was wrong. Yeah, it's all too easy for me to miss important cases like this because such things are not allowed (or rather not supported, since it can be done) within automount managed directories. Even without propagation it can be problematic. But you're right, that's no excuse for a VFS function to not cater for, or at least handle sensibly, cases like this. > > > > but because an implementation that tells you what you are looking for > > > has really bad > O(N^2) complexity walking the propagation tree in > > > the worst case. > > > > Yes it is bad indeed. > > > > > > > > To be correct you code would need to use next_mnt and walk the tree and > > > on each mount use propagate_mnt_busy to see if that entry can be > > > unmounted. Possibly with small variations to match your case. Mounts > > > in one part of the tree may propagate to places that mounts in other > > > parts of the tree will not. > > > > Right, the point there being use propagate_mnt_busy() for the check. > > Or a variant of propagate_mnt_busy. > > > I tried to use that when I started this but had trouble, I'll have another > > look > > at using it. > > > > I thought the reason I couldn't use propagate_mnt_busy() is because it will > > see > > any mount with submounts as busy and that's not what's need. It's mounts > > below > > each of mounts having submounts that make them busy in my case. > > If you have the count of submounts passed into propagate_mnt_busy() it > will come very close to what you need. It is still possible to have > false positives in the face of propagation slaves that have had some of > your autofs mounts unmounted, and something else mounted upon them. That shouldn't be done within an automount managed mount even though there's no way to for the autofs module to veto such mounts and the daemon would probably handle a good number of cases of it. The justification being that automount managed mounts are defined by maps that refer to specific paths which are specifically what the administrator wishes to provide. So, strictly speaking, passing an autofs mount point to a container with a volume option should use the same path as the automount managed directory although it isn't (can't be) enforced and works ok, it's wrong in principle. Still, this is also an aside from the fact that the cases you site must be handled by general kernel code. > > If all you need is a better approximation something like your current > code with comments about it's limitations might even be reasonable. Maybe but I worry about the propagation list becoming very large which is likely for some larger autofs users. So I really do need to worry about these cases. I was aware of this problem with the initial patch but really needed a starting point for discussion. > > > You probably don't want to read about this but, unfortunately, an example is > > possibly the best way to describe what I need to achieve. > > [snip good example] > > > > > Anyone have any suggestions at all? > > > > > > In the short term I recommend just not doing this, but if you want to > > > see if you can find an efficient algorithm with me, I am happy to bring > > > you up to speed on all of the gory details. > > > > I need to do this to make autofs play better in the presence of namespaces > > and > > that's a problem now and is long overdue to be resolved. > > > > So I'm happy to do whatever I can to work toward that. > > > > I could leave this part of the implementation until later because the false > > positive this solves leads to an expire fail and that is (must be) handled > > properly by the user space daemon. > > > > But when I first noticed the false positive the expire failure was handled > > ok by > > the daemon (AFAICS) but there was what looked like a reference counting > > problem > > (but not quite that either, odd) which I was going to return to later. > > > > Maybe I've stumbled onto a subtle propagation bug that really needs > > fixing .... > > MNT_DETACH has worst case performance issues when a tree of mounts is > unmounted that would really be nice to fix, as it can propogate pretty > horrible. You are trying to do something very similar to MNT_DETACH so > an efficient solution for one is likely an efficient solution for the > other. Let me digest that for a while. Thanks Ian -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c index c4df881..dd2ea5d 100644 --- a/fs/autofs4/root.c +++ b/fs/autofs4/root.c @@ -437,13 +437,8 @@ static int autofs4_d_manage(struct path *path, bool rcu_walk) /* The daemon never waits. */ if (autofs4_oz_mode(sbi)) { - if (rcu_walk) { - if (!path_is_mountpoint_rcu(path)) - return -EISDIR; - } else { - if (!path_is_mountpoint(path)) - return -EISDIR; - } + if (!path_is_mountpoint(path)) + return -EISDIR; return 0; } @@ -471,7 +466,7 @@ static int autofs4_d_manage(struct path *path, bool rcu_walk) if (ino->flags & AUTOFS_INF_WANT_EXPIRE) return 0; - if (path_is_mountpoint_rcu(path)) + if (path_is_mountpoint(path)) return 0; inode = d_inode_rcu(dentry); if (inode && S_ISLNK(inode->i_mode)) diff --git a/fs/namespace.c b/fs/namespace.c index 79473ee..da1cd87 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1160,12 +1160,23 @@ struct vfsmount *mntget(struct vfsmount *mnt) } EXPORT_SYMBOL(mntget); -static bool __path_is_mountpoint(struct path *path) +/* __path_is_mountpoint() - Check if path is a mount in the current + * namespace. + * + * d_mountpoint() can only be used reliably to establish if a dentry is + * not mounted in any namespace and that common case is handled inline. + * d_mountpoint() isn't aware of the possibility there may be multiple + * mounts using a given dentry in a different namespace. This function + * checks if the passed in path is a mountpoint rather than the dentry + * alone. + */ +bool __path_is_mountpoint(const struct path *path) { struct mount *mount; struct vfsmount *mnt; unsigned seq; + rcu_read_lock(); do { seq = read_seqbegin(&mount_lock); mount = __lookup_mnt(path->mnt, path->dentry); @@ -1173,35 +1184,11 @@ static bool __path_is_mountpoint(struct path *path) } while (mnt && !(mnt->mnt_flags & MNT_SYNC_UMOUNT) && read_seqretry(&mount_lock, seq)); - - return mnt != NULL; -} - -/* Check if path is a mount in current namespace */ -bool path_is_mountpoint(struct path *path) -{ - bool res; - - if (!d_mountpoint(path->dentry)) - return false; - - rcu_read_lock(); - res = __path_is_mountpoint(path); rcu_read_unlock(); - return res; -} -EXPORT_SYMBOL(path_is_mountpoint); - -/* Check if path is a mount in current namespace */ -bool path_is_mountpoint_rcu(struct path *path) -{ - if (!d_mountpoint(path->dentry)) - return false; - - return __path_is_mountpoint(path); + return mnt != NULL; } -EXPORT_SYMBOL(path_is_mountpoint_rcu); +EXPORT_SYMBOL(__path_is_mountpoint); struct vfsmount *mnt_clone_internal(struct path *path) { diff --git a/include/linux/fs.h b/include/linux/fs.h index 03a5a39..83de8b6 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2095,8 +2095,6 @@ extern int vfs_ustat(dev_t, struct kstatfs *); extern int freeze_super(struct super_block *super); extern int thaw_super(struct super_block *super); extern bool our_mnt(struct vfsmount *mnt); -extern bool path_is_mountpoint(struct path *); -extern bool path_is_mountpoint_rcu(struct path *); extern int current_umask(void); diff --git a/include/linux/mount.h b/include/linux/mount.h index 1172cce..42dc62b 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -15,6 +15,8 @@ #include <linux/spinlock.h> #include <linux/seqlock.h> #include <linux/atomic.h> +#include <linux/path.h> +#include <linux/dcache.h> struct super_block; struct vfsmount; @@ -98,4 +100,13 @@ extern dev_t name_to_dev_t(const char *name); extern unsigned int sysctl_mount_max; +extern bool __path_is_mountpoint(const struct path *path); +static inline bool path_is_mountpoint(const struct path *path) +{ + if (!d_mountpoint(path->dentry)) + return 0; + + return __path_is_mountpoint(path); +} + #endif /* _LINUX_MOUNT_H */