diff mbox

[1/7] vfs - merge path_is_mountpoint() and path_is_mountpoint_rcu()

Message ID 148029910861.27779.4517883721395202453.stgit@pluto.themaw.net (mailing list archive)
State New, archived
Headers show

Commit Message

Ian Kent Nov. 28, 2016, 2:11 a.m. UTC
From: Ian Kent <ikent@redhat.com>

Forgetting that the rcu lock allows nesting I added a superfluous rcu
version of path_is_mountpoint().

Merge it and the rcu version, make the common case (d_mountpoint()
returning true) inline and change the path parameter to a const.

Also move the function definition to include/linux/mount.h as it
seems a more sensible place for it.

Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Omar Sandoval <osandov@osandov.com>
---
 fs/autofs4/root.c     |   11 +++--------
 fs/namespace.c        |   41 ++++++++++++++---------------------------
 include/linux/fs.h    |    2 --
 include/linux/mount.h |   11 +++++++++++
 4 files changed, 28 insertions(+), 37 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Andrew Morton Nov. 30, 2016, 10:22 p.m. UTC | #1
So a far as I can tell, this patch series is intended to
address Al's review comments against the
http://lkml.kernel.org/r/20161011053352.27645.83962.stgit@pluto.themaw.net
series?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ian Kent Dec. 1, 2016, 12:13 a.m. UTC | #2
On Wed, 2016-11-30 at 14:22 -0800, Andrew Morton wrote:
> So a far as I can tell, this patch series is intended to
> address Al's review comments against the
> http://lkml.kernel.org/r/20161011053352.27645.83962.stgit@pluto.themaw.net
> series?

That's right and also to fix an additional problem I found in subsequent testing
(the may_umount_tree() changes).

Ian
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Dec. 3, 2016, 5:13 a.m. UTC | #3
FWIW, I've folded that pile into vfs.git#work.autofs.

Problems:
	* (fixed) __path_is_mountpoint() should _not_ treat NULL from
__lookup_mnt() as "nothing's mounted there" until it has checked
that mount_lock hadn't been touched - mount --move on something unrelated
can race with lockless hash lookup and lead to false negatives.
	* linux/mount.h might be the wrong place for path_is_mountpoint().
Or it shouldn't be inlined.  I don't like the includes you've added there.
	* path_has_submounts() is broken.  At the very least, it's
AB-BA between mount_lock and rename_lock.  I would suggest trying to
put read_seqlock_excl(&mount_lock) around the call of d_walk() in there,
and using __lookup_mnt() in the callback (without retries on the mount_lock,
of course - read_seqlock_excl done on the outside is enough).  I'm not sure
if it won't cause trouble with contention, though; that needs testing.  As
it is, that function is broken in #work.autofs, same as it is in -mm and
-next.
	* the last one (propagation-related) is too ugly to live - at the
very least, its pieces should live in fs/pnode.c; exposing propagate_next()
is simply wrong.  I hadn't picked that one at all, and I would suggest
coordinating anything in that area with ebiederman - he has some work
around fs/pnode.c and you risk stepping on his toes.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Dec. 3, 2016, 11:29 p.m. UTC | #4
On Sat, Dec 03, 2016 at 05:13:22AM +0000, Al Viro wrote:
> 	* path_has_submounts() is broken.  At the very least, it's
> AB-BA between mount_lock and rename_lock.  I would suggest trying to
> put read_seqlock_excl(&mount_lock) around the call of d_walk() in there,
> and using __lookup_mnt() in the callback (without retries on the mount_lock,
> of course - read_seqlock_excl done on the outside is enough).  I'm not sure
> if it won't cause trouble with contention, though; that needs testing.  As
> it is, that function is broken in #work.autofs, same as it is in -mm and
> -next.

	Fix for path_has_submounts() (as above) force-pushed.  It does
need testing and profiling, obviously.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ian Kent Dec. 4, 2016, 2:07 a.m. UTC | #5
On Sat, 2016-12-03 at 05:13 +0000, Al Viro wrote:
> 	FWIW, I've folded that pile into vfs.git#work.autofs.
> 
> Problems:
> 	* (fixed) __path_is_mountpoint() should _not_ treat NULL from
> __lookup_mnt() as "nothing's mounted there" until it has checked
> that mount_lock hadn't been touched - mount --move on something unrelated
> can race with lockless hash lookup and lead to false negatives.

Right, looking at what you've done there the mistake is so obvious now!
Thanks for helping with it.

> 	* linux/mount.h might be the wrong place for path_is_mountpoint().
> Or it shouldn't be inlined.  I don't like the includes you've added there.
> 	* path_has_submounts() is broken.  At the very least, it's
> AB-BA between mount_lock and rename_lock.  I would suggest trying to
> put read_seqlock_excl(&mount_lock) around the call of d_walk() in there,
> and using __lookup_mnt() in the callback (without retries on the mount_lock,
> of course - read_seqlock_excl done on the outside is enough).  I'm not sure
> if it won't cause trouble with contention, though; that needs testing.  As
> it is, that function is broken in #work.autofs, same as it is in -mm and
> -next.

Umm ... that's a much more obvious dumb mistake and what you've done there
didn't occur to me even after you spelled it out, I'll take some time to digest
it. And thanks for that one too.

Ian
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ian Kent Dec. 4, 2016, 2:18 a.m. UTC | #6
On Sat, 2016-12-03 at 23:29 +0000, Al Viro wrote:
> On Sat, Dec 03, 2016 at 05:13:22AM +0000, Al Viro wrote:
> > 
> > 	* path_has_submounts() is broken.  At the very least, it's
> > AB-BA between mount_lock and rename_lock.  I would suggest trying to
> > put read_seqlock_excl(&mount_lock) around the call of d_walk() in there,
> > and using __lookup_mnt() in the callback (without retries on the mount_lock,
> > of course - read_seqlock_excl done on the outside is enough).  I'm not sure
> > if it won't cause trouble with contention, though; that needs testing.  As
> > it is, that function is broken in #work.autofs, same as it is in -mm and
> > -next.
> 	Fix for path_has_submounts() (as above) force-pushed.  It does
> need testing and profiling, obviously.

I'll run my tests against it and re-run with oprofile if all goes well.

The submount-test I use should show contention if it's a problem but I'm not
sure the number of mounts used will be sufficient to show up scale problems.

Basically each case of the test (there are two) runs for 100 iterations using 10
processes with timing set to promote expire to mount contention, mainly to test
for expire to mount races.

If I don't see contention I might need to analyse further whether the test has
adequate coverage.

Ian
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ian Kent Dec. 5, 2016, 2:50 a.m. UTC | #7
On Sun, 2016-12-04 at 10:18 +0800, Ian Kent wrote:
> On Sat, 2016-12-03 at 23:29 +0000, Al Viro wrote:
> > 
> > On Sat, Dec 03, 2016 at 05:13:22AM +0000, Al Viro wrote:
> > > 
> > > 
> > > 	* path_has_submounts() is broken.  At the very least, it's
> > > AB-BA between mount_lock and rename_lock.  I would suggest trying to
> > > put read_seqlock_excl(&mount_lock) around the call of d_walk() in there,
> > > and using __lookup_mnt() in the callback (without retries on the
> > > mount_lock,
> > > of course - read_seqlock_excl done on the outside is enough).  I'm not
> > > sure
> > > if it won't cause trouble with contention, though; that needs testing.  As
> > > it is, that function is broken in #work.autofs, same as it is in -mm and
> > > -next.
> > 	Fix for path_has_submounts() (as above) force-pushed.  It does
> > need testing and profiling, obviously.
> I'll run my tests against it and re-run with oprofile if all goes well.
> 
> The submount-test I use should show contention if it's a problem but I'm not
> sure the number of mounts used will be sufficient to show up scale problems.
> 
> Basically each case of the test (there are two) runs for 100 iterations using
> 10
> processes with timing set to promote expire to mount contention, mainly to
> test
> for expire to mount races.
> 
> If I don't see contention I might need to analyse further whether the test has
> adequate coverage.

I have run my usual tests on a build of vfs.get#work.autofs.

Both tests ran without problem multiple times (even though they should be
deterministic experience had shown they sometimes aren't).

I also did a system wide oprofile run of an additional run of the submount-test.

I hadn't noticed that the submount-test runs are shorter that I have come to
expect. This might be due to changes to the behaviour of the expire and mount
timing over time.  But I always check that I'm seeing expire and mount behaviour
that should lead to contention during the test run and that looked as expected.

The run time was about 55 minutes for each of the two cases I test whereas I had
come to expect a run time of around 70 minutes each. It's been quite a while
since I've actually paid attention to this and a lot has changed in the VFS
since.

It might be due to Neil Browns' changes to satisfy path walks in rcu-walk mode
where possible rather than always dropping into ref-walk mode as I didn't
profile those changes when they were implemented because I didn't notice
unexpectedly different run times when testing the changes.

I was going to talk about why the autofs usage of path_has_submounts() should
not be problematic but that would be tedious for everyone, instead I'll just say
that, clearly it is possible to abuse path_has_submounts() by calling it on
mount points that have a large number of directories, but autofs shouldn't do
that and the profile verifies this.

I'm not sure how accurate the profile is (I don't do it very often). There were
about 400000 out of 22000000 samples missed.

And hopefully the report output is readable enough to be useful after posting
....

Anyway, a system wide opreport (such as it is) showed this:

CPU: Intel Haswell microarchitecture, speed 3700 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        image name               app name                 symbol name
3897186  17.6585  libc-2.22.so             operf                    __strcmp_sse2_unaligned
3125714  14.1629  libstdc++.so.6.0.21      operf                    std::_Rb_tree_increment(std::_Rb_tree_node_base const*)
1500262   6.7978  libsqlite3.so.0.8.6      evolution                /usr/lib64/libsqlite3.so.0.8.6
856262    3.8798  operf                    operf                    /usr/bin/operf
749603    3.3965  libglib-2.0.so.0.4600.2  evolution                /usr/lib64/libglib-2.0.so.0.4600.2
560812    2.5411  libdbus-1.so.3.14.8      dbus-daemon              /usr/lib64/libdbus-1.so.3.14.8
378227    1.7138  systemd                  systemd                  /usr/lib/systemd/systemd
301175    1.3647  libcamel-1.2.so.54.0.0   evolution                /usr/lib64/libcamel-1.2.so.54.0.0
223545    1.0129  dbus-daemon              dbus-daemon              /usr/bin/dbus-daemon
187783    0.8509  libc-2.22.so             dbus-daemon              __strcmp_sse2_unaligned
157564    0.7139  libc-2.22.so             evolution                __strcmp_sse2_unaligned
157218    0.7124  libc-2.22.so             evolution                _int_malloc
156089    0.7073  libgobject-2.0.so.0.4600.2 evolution                /usr/lib64/libgobject-2.0.so.0.4600.2
116277    0.5269  libc-2.22.so             evolution                _int_free
116275    0.5269  libxul.so                firefox                  /usr/lib64/firefox/libxul.so
104278    0.4725  libc-2.22.so             evolution                __GI_____strtoull_l_internal
92937     0.4211  libc-2.22.so             ps                       _IO_vfscanf
...
101      4.6e-04  vmlinux-4.9.0-rc7        opendir                  path_is_mountpoint
...
23       1.0e-04  vmlinux-4.9.0-rc7        opendir                  path_has_submounts
...

where opendir is a program the test uses to trigger a mount and do various
checks on the result.

Not sure if this is enough or is what's needed, let me know if there's something
else specific I should do.

Ian
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ian Kent Dec. 6, 2016, 9:37 a.m. UTC | #8
On Sat, 2016-12-03 at 05:13 +0000, Al Viro wrote:
> 	FWIW, I've folded that pile into vfs.git#work.autofs.
> 
> Problems:

snip ...

> 	* the last one (propagation-related) is too ugly to live - at the
> very least, its pieces should live in fs/pnode.c; exposing propagate_next()
> is simply wrong.  I hadn't picked that one at all, and I would suggest
> coordinating anything in that area with ebiederman - he has some work
> around fs/pnode.c and you risk stepping on his toes.

The earlier patches seem to be ok now so how about we talk a little about this
last one.

Eric, Al mentioned that you are working with fs/pnode.c and recommended I co-
ordinate with you.

So is my working on this this (which is most likely going to live in pnode.c if
I can can get something acceptable) going to cause complications for you?
Is what your doing at a point were it would be worth doing as Al suggests?

Anyway, the problem that this patch is supposed to solve is to check if any of
the list of mnt_mounts or any of the mounts propagated from each are in use.

One obvious problem with it is the propagation could be very large.

But now I look at it again there's no reason to have to every tree because if
one tree is busy then the the set of trees is busy. But every tree would be
visited if the not busy so it's perhaps still a problem.

The difficult thing is working out if a tree is busy, more so because there
could be a struct path holding references within any the trees so I don't know
of a simpler, more efficient way to check for this.

Anyone have any suggestions at all?
Ian
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman Dec. 7, 2016, 9:30 p.m. UTC | #9
Ian Kent <raven@themaw.net> writes:

> On Sat, 2016-12-03 at 05:13 +0000, Al Viro wrote:
>> 	FWIW, I've folded that pile into vfs.git#work.autofs.
>> 
>> Problems:
>
> snip ...
>
>> 	* the last one (propagation-related) is too ugly to live - at the
>> very least, its pieces should live in fs/pnode.c; exposing propagate_next()
>> is simply wrong.  I hadn't picked that one at all, and I would suggest
>> coordinating anything in that area with ebiederman - he has some work
>> around fs/pnode.c and you risk stepping on his toes.
>
> The earlier patches seem to be ok now so how about we talk a little about this
> last one.
>
> Eric, Al mentioned that you are working with fs/pnode.c and recommended I co-
> ordinate with you.
>
> So is my working on this this (which is most likely going to live in pnode.c if
> I can can get something acceptable) going to cause complications for you?
> Is what your doing at a point were it would be worth doing as Al
> suggests?
>
> Anyway, the problem that this patch is supposed to solve is to check if any of
> the list of mnt_mounts or any of the mounts propagated from each are in use.
>
> One obvious problem with it is the propagation could be very large.
>
> But now I look at it again there's no reason to have to every tree because if
> one tree is busy then the the set of trees is busy. But every tree would be
> visited if the not busy so it's perhaps still a problem.
>
> The difficult thing is working out if a tree is busy, more so because there
> could be a struct path holding references within any the trees so I don't know
> of a simpler, more efficient way to check for this.

So coordination seems to be in order.  Not so much because of your
current implementation (which won't tell you what you want to know)
but because an implementation that tells you what you are looking for
has really bad > O(N^2) complexity walking the propagation tree in
the worst case.

To be correct you code would need to use next_mnt and walk the tree and
on each mount use propagate_mnt_busy to see if that entry can be
unmounted.  Possibly with small variations to match your case.  Mounts
in one part of the tree may propagate to places that mounts in other
parts of the tree will not.

I am assuming you are not worried about MNT_DETACH, as that case would
not need may_umount_tree at all.

I am researching how to make walking propagation for multiple mounts
efficient but it is a non-trivial problem.

> Anyone have any suggestions at all?

In the short term I recommend just not doing this, but if you want to
see if you can find an efficient algorithm with me, I am happy to bring
you up to speed on all of the gory details.

Eric


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ian Kent Dec. 8, 2016, 3:29 a.m. UTC | #10
On Thu, 2016-12-08 at 10:30 +1300, Eric W. Biederman wrote:
> Ian Kent <raven@themaw.net> writes:
> 
> > On Sat, 2016-12-03 at 05:13 +0000, Al Viro wrote:
> > > 	FWIW, I've folded that pile into vfs.git#work.autofs.
> > > 
> > > Problems:
> > 
> > snip ...
> > 
> > > 	* the last one (propagation-related) is too ugly to live - at the
> > > very least, its pieces should live in fs/pnode.c; exposing
> > > propagate_next()
> > > is simply wrong.  I hadn't picked that one at all, and I would suggest
> > > coordinating anything in that area with ebiederman - he has some work
> > > around fs/pnode.c and you risk stepping on his toes.
> > 
> > The earlier patches seem to be ok now so how about we talk a little about
> > this
> > last one.
> > 
> > Eric, Al mentioned that you are working with fs/pnode.c and recommended I
> > co-
> > ordinate with you.
> > 
> > So is my working on this this (which is most likely going to live in pnode.c
> > if
> > I can can get something acceptable) going to cause complications for you?
> > Is what your doing at a point were it would be worth doing as Al
> > suggests?
> > 
> > Anyway, the problem that this patch is supposed to solve is to check if any
> > of
> > the list of mnt_mounts or any of the mounts propagated from each are in use.
> > 
> > One obvious problem with it is the propagation could be very large.
> > 
> > But now I look at it again there's no reason to have to every tree because
> > if
> > one tree is busy then the the set of trees is busy. But every tree would be
> > visited if the not busy so it's perhaps still a problem.
> > 
> > The difficult thing is working out if a tree is busy, more so because there
> > could be a struct path holding references within any the trees so I don't
> > know
> > of a simpler, more efficient way to check for this.
> 
> So coordination seems to be in order.  Not so much because of your
> current implementation (which won't tell you what you want to know)

Umm ... ok I've missed something then because the patch I posted does appear to
get the calculation right. Perhaps I've missed a case where it gets it wrong
....

But now I'm looking deeper into it I'm beginning to wonder why it worked for one
of the cases. Maybe I need to do some more tests with even more printks. 

> but because an implementation that tells you what you are looking for
> has really bad > O(N^2) complexity walking the propagation tree in
> the worst case.

Yes it is bad indeed.

> 
> To be correct you code would need to use next_mnt and walk the tree and
> on each mount use propagate_mnt_busy to see if that entry can be
> unmounted.  Possibly with small variations to match your case.  Mounts
> in one part of the tree may propagate to places that mounts in other
> parts of the tree will not.

Right, the point there being use propagate_mnt_busy() for the check.

I tried to use that when I started this but had trouble, I'll have another look
at using it.

I thought the reason I couldn't use propagate_mnt_busy() is because it will see
any mount with submounts as busy and that's not what's need. It's mounts below
each of mounts having submounts that make them busy in my case.

You probably don't want to read about this but, unfortunately, an example is
possibly the best way to describe what I need to achieve.

A simple example for is the autofs direct mount map entry (a so called autofs
multi-mount):

/absolute/path/to/direct/mount-point \
    /offset/path/one      server:/export/path/one \
    /offset2/path/two     server2:/export/path/other

The path /absolute/path/to/direct/mount-point is an autofs direct mount trigger
and is essentially an autofs mount.

When the mount is triggered each of the offsets, /offset/path/one and
/offset2/path/two offset trigger mounts are mounted so they can trigger mounts
independently.

When the direct mount /absolute/path/to/direct/mount-point is expired the two
offsets need to be umounted if they are not busy so
/absolute/path/to/direct/mount-point having submounts doesn't necessarily make
it unable to be umounted.

But this is (supposed) to be for only to one level of mounts, for example if I
had:

/absolute/path/to/direct/mount-point \
    /offset/path/one        server:/export/path/one \
    /offset2/path/two       server2:/export/path/other \
    /offset/path/one/nested server3:/export/path3/nested

then the the offsets below the nesting point /offset/path/one make it busy and
need to be expired first.

So why would I even need to do this?

It's functionality of Sun autofs maps and that's the primary map format autofs
supports but that's why it's needed and not why I do it this way.

Another example of one these multi-mounts is:

/absolute/path/to/direct/mount-point \
    /                     servern:/another/export \
    /offset/path/one      server:/export/path/one \
    /offset2/path/two     server2:/export/path/other

where an NFS mount is done on the root of the offset tree (similar to the case
where nesting is present in the example map entry above).

Once that happens autofs can't get a callback to mount offsets unless there are
offset triggers mounted to trigger the mounts.

> 
> I am assuming you are not worried about MNT_DETACH, as that case would
> not need may_umount_tree at all.

I'm not. 

> 
> I am researching how to make walking propagation for multiple mounts
> efficient but it is a non-trivial problem.

Sure is.

> 
> > Anyone have any suggestions at all?
> 
> In the short term I recommend just not doing this, but if you want to
> see if you can find an efficient algorithm with me, I am happy to bring
> you up to speed on all of the gory details.

I need to do this to make autofs play better in the presence of namespaces and
that's a problem now and is long overdue to be resolved.

So I'm happy to do whatever I can to work toward that.

I could leave this part of the implementation until later because the false
positive this solves leads to an expire fail and that is (must be) handled
properly by the user space daemon.

But when I first noticed the false positive the expire failure was handled ok by
the daemon (AFAICS) but there was what looked like a reference counting problem
(but not quite that either, odd) which I was going to return to later.

Maybe I've stumbled onto a subtle propagation bug that really needs fixing ....

Ian
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman Dec. 8, 2016, 4:28 a.m. UTC | #11
Ian Kent <raven@themaw.net> writes:

> On Thu, 2016-12-08 at 10:30 +1300, Eric W. Biederman wrote:
>> Ian Kent <raven@themaw.net> writes:
>> 
>> > On Sat, 2016-12-03 at 05:13 +0000, Al Viro wrote:
>> > > 	FWIW, I've folded that pile into vfs.git#work.autofs.
>> > > 
>> > > Problems:
>> > 
>> > snip ...
>> > 
>> > > 	* the last one (propagation-related) is too ugly to live - at the
>> > > very least, its pieces should live in fs/pnode.c; exposing
>> > > propagate_next()
>> > > is simply wrong.  I hadn't picked that one at all, and I would suggest
>> > > coordinating anything in that area with ebiederman - he has some work
>> > > around fs/pnode.c and you risk stepping on his toes.
>> > 
>> > The earlier patches seem to be ok now so how about we talk a little about
>> > this
>> > last one.
>> > 
>> > Eric, Al mentioned that you are working with fs/pnode.c and recommended I
>> > co-
>> > ordinate with you.
>> > 
>> > So is my working on this this (which is most likely going to live in pnode.c
>> > if
>> > I can can get something acceptable) going to cause complications for you?
>> > Is what your doing at a point were it would be worth doing as Al
>> > suggests?
>> > 
>> > Anyway, the problem that this patch is supposed to solve is to check if any
>> > of
>> > the list of mnt_mounts or any of the mounts propagated from each are in use.
>> > 
>> > One obvious problem with it is the propagation could be very large.
>> > 
>> > But now I look at it again there's no reason to have to every tree because
>> > if
>> > one tree is busy then the the set of trees is busy. But every tree would be
>> > visited if the not busy so it's perhaps still a problem.
>> > 
>> > The difficult thing is working out if a tree is busy, more so because there
>> > could be a struct path holding references within any the trees so I don't
>> > know
>> > of a simpler, more efficient way to check for this.
>> 
>> So coordination seems to be in order.  Not so much because of your
>> current implementation (which won't tell you what you want to know)
>
> Umm ... ok I've missed something then because the patch I posted does appear to
> get the calculation right. Perhaps I've missed a case where it gets it wrong
> ....
>
> But now I'm looking deeper into it I'm beginning to wonder why it worked for one
> of the cases. Maybe I need to do some more tests with even more
> printks.

So the case that I am concerned about is that if someone happens to do
mount --bind the mount propagation trees would not just be mirror images
in single filesystems but have leaves in odd places.

Consider .../base/one/two (where one and two are automounts) and base
is the shared mountpoint that is propagated between namespaces.

If someone does mount --bind .../base/one ../somepath/one in any
namespace then there will be places where an unmount of "two" will
propagate to, but that an unmount of "one" won't as their respective
propagation trees are different.

Not accounting for different mount points having different propagation
trees is what I meant when I said your code was wrong.

>> but because an implementation that tells you what you are looking for
>> has really bad > O(N^2) complexity walking the propagation tree in
>> the worst case.
>
> Yes it is bad indeed.
>
>> 
>> To be correct you code would need to use next_mnt and walk the tree and
>> on each mount use propagate_mnt_busy to see if that entry can be
>> unmounted.  Possibly with small variations to match your case.  Mounts
>> in one part of the tree may propagate to places that mounts in other
>> parts of the tree will not.
>
> Right, the point there being use propagate_mnt_busy() for the check.

Or a variant of propagate_mnt_busy.

> I tried to use that when I started this but had trouble, I'll have another look
> at using it.
>
> I thought the reason I couldn't use propagate_mnt_busy() is because it will see
> any mount with submounts as busy and that's not what's need. It's mounts below
> each of mounts having submounts that make them busy in my case.

If you have the count of submounts passed into propagate_mnt_busy() it
will come very close to what you need.  It is still possible to have
false positives in the face of propagation slaves that have had some of
your autofs mounts unmounted, and something else mounted upon them.

If all you need is a better approximation something like your current
code with comments about it's limitations might even be reasonable.

> You probably don't want to read about this but, unfortunately, an example is
> possibly the best way to describe what I need to achieve.

[snip good example]

>> > Anyone have any suggestions at all?
>> 
>> In the short term I recommend just not doing this, but if you want to
>> see if you can find an efficient algorithm with me, I am happy to bring
>> you up to speed on all of the gory details.
>
> I need to do this to make autofs play better in the presence of namespaces and
> that's a problem now and is long overdue to be resolved.
>
> So I'm happy to do whatever I can to work toward that.
>
> I could leave this part of the implementation until later because the false
> positive this solves leads to an expire fail and that is (must be) handled
> properly by the user space daemon.
>
> But when I first noticed the false positive the expire failure was handled ok by
> the daemon (AFAICS) but there was what looked like a reference counting problem
> (but not quite that either, odd) which I was going to return to later.
>
> Maybe I've stumbled onto a subtle propagation bug that really needs
> fixing ....

MNT_DETACH has worst case performance issues when a tree of mounts is
unmounted that would really be nice to fix, as it can propogate pretty
horrible.  You are trying to do something very similar to MNT_DETACH so
an efficient solution for one is likely an efficient solution for the
other.


Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ian Kent Dec. 8, 2016, 6:18 a.m. UTC | #12
On Thu, 2016-12-08 at 17:28 +1300, Eric W. Biederman wrote:
> Ian Kent <raven@themaw.net> writes:
> 
> > On Thu, 2016-12-08 at 10:30 +1300, Eric W. Biederman wrote:
> > > Ian Kent <raven@themaw.net> writes:
> > > 
> > > > On Sat, 2016-12-03 at 05:13 +0000, Al Viro wrote:
> > > > > 	FWIW, I've folded that pile into vfs.git#work.autofs.
> > > > > 
> > > > > Problems:
> > > > 
> > > > snip ...
> > > > 
> > > > > 	* the last one (propagation-related) is too ugly to live - at
> > > > > the
> > > > > very least, its pieces should live in fs/pnode.c; exposing
> > > > > propagate_next()
> > > > > is simply wrong.  I hadn't picked that one at all, and I would suggest
> > > > > coordinating anything in that area with ebiederman - he has some work
> > > > > around fs/pnode.c and you risk stepping on his toes.
> > > > 
> > > > The earlier patches seem to be ok now so how about we talk a little
> > > > about
> > > > this
> > > > last one.
> > > > 
> > > > Eric, Al mentioned that you are working with fs/pnode.c and recommended
> > > > I
> > > > co-
> > > > ordinate with you.
> > > > 
> > > > So is my working on this this (which is most likely going to live in
> > > > pnode.c
> > > > if
> > > > I can can get something acceptable) going to cause complications for
> > > > you?
> > > > Is what your doing at a point were it would be worth doing as Al
> > > > suggests?
> > > > 
> > > > Anyway, the problem that this patch is supposed to solve is to check if
> > > > any
> > > > of
> > > > the list of mnt_mounts or any of the mounts propagated from each are in
> > > > use.
> > > > 
> > > > One obvious problem with it is the propagation could be very large.
> > > > 
> > > > But now I look at it again there's no reason to have to every tree
> > > > because
> > > > if
> > > > one tree is busy then the the set of trees is busy. But every tree would
> > > > be
> > > > visited if the not busy so it's perhaps still a problem.
> > > > 
> > > > The difficult thing is working out if a tree is busy, more so because
> > > > there
> > > > could be a struct path holding references within any the trees so I
> > > > don't
> > > > know
> > > > of a simpler, more efficient way to check for this.
> > > 
> > > So coordination seems to be in order.  Not so much because of your
> > > current implementation (which won't tell you what you want to know)
> > 
> > Umm ... ok I've missed something then because the patch I posted does appear
> > to
> > get the calculation right. Perhaps I've missed a case where it gets it wrong
> > ....
> > 
> > But now I'm looking deeper into it I'm beginning to wonder why it worked for
> > one
> > of the cases. Maybe I need to do some more tests with even more
> > printks.
> 
> So the case that I am concerned about is that if someone happens to do
> mount --bind the mount propagation trees would not just be mirror images
> in single filesystems but have leaves in odd places.
> 
> Consider .../base/one/two (where one and two are automounts) and base
> is the shared mountpoint that is propagated between namespaces.
> 
> If someone does mount --bind .../base/one ../somepath/one in any
> namespace then there will be places where an unmount of "two" will
> propagate to, but that an unmount of "one" won't as their respective
> propagation trees are different.
> 
> Not accounting for different mount points having different propagation
> trees is what I meant when I said your code was wrong.

Yeah, it's all too easy for me to miss important cases like this because such
things are not allowed (or rather not supported, since it can be done) within
automount managed directories. Even without propagation it can be problematic.

But you're right, that's no excuse for a VFS function to not cater for, or at
least handle sensibly, cases like this.

> 
> > > but because an implementation that tells you what you are looking for
> > > has really bad > O(N^2) complexity walking the propagation tree in
> > > the worst case.
> > 
> > Yes it is bad indeed.
> > 
> > > 
> > > To be correct you code would need to use next_mnt and walk the tree and
> > > on each mount use propagate_mnt_busy to see if that entry can be
> > > unmounted.  Possibly with small variations to match your case.  Mounts
> > > in one part of the tree may propagate to places that mounts in other
> > > parts of the tree will not.
> > 
> > Right, the point there being use propagate_mnt_busy() for the check.
> 
> Or a variant of propagate_mnt_busy.
> 
> > I tried to use that when I started this but had trouble, I'll have another
> > look
> > at using it.
> > 
> > I thought the reason I couldn't use propagate_mnt_busy() is because it will
> > see
> > any mount with submounts as busy and that's not what's need. It's mounts
> > below
> > each of mounts having submounts that make them busy in my case.
> 
> If you have the count of submounts passed into propagate_mnt_busy() it
> will come very close to what you need.  It is still possible to have
> false positives in the face of propagation slaves that have had some of
> your autofs mounts unmounted, and something else mounted upon them.

That shouldn't be done within an automount managed mount even though there's no
way to for the autofs module to veto such mounts and the daemon would probably
handle a good number of cases of it.

The justification being that automount managed mounts are defined by maps that
refer to specific paths which are specifically what the administrator wishes to
provide.

So, strictly speaking, passing an autofs mount point to a container with a
volume option should use the same path as the automount managed directory
although it isn't (can't be) enforced and works ok, it's wrong in principle.

Still, this is also an aside from the fact that the cases you site must be
handled by general kernel code.

> 
> If all you need is a better approximation something like your current
> code with comments about it's limitations might even be reasonable.

Maybe but I worry about the propagation list becoming very large which is likely
for some larger autofs users. So I really do need to worry about these cases.

I was aware of this problem with the initial patch but really needed a starting
point for discussion.

> 
> > You probably don't want to read about this but, unfortunately, an example is
> > possibly the best way to describe what I need to achieve.
> 
> [snip good example]
> 
> > > > Anyone have any suggestions at all?
> > > 
> > > In the short term I recommend just not doing this, but if you want to
> > > see if you can find an efficient algorithm with me, I am happy to bring
> > > you up to speed on all of the gory details.
> > 
> > I need to do this to make autofs play better in the presence of namespaces
> > and
> > that's a problem now and is long overdue to be resolved.
> > 
> > So I'm happy to do whatever I can to work toward that.
> > 
> > I could leave this part of the implementation until later because the false
> > positive this solves leads to an expire fail and that is (must be) handled
> > properly by the user space daemon.
> > 
> > But when I first noticed the false positive the expire failure was handled
> > ok by
> > the daemon (AFAICS) but there was what looked like a reference counting
> > problem
> > (but not quite that either, odd) which I was going to return to later.
> > 
> > Maybe I've stumbled onto a subtle propagation bug that really needs
> > fixing ....
> 
> MNT_DETACH has worst case performance issues when a tree of mounts is
> unmounted that would really be nice to fix, as it can propogate pretty
> horrible.  You are trying to do something very similar to MNT_DETACH so
> an efficient solution for one is likely an efficient solution for the
> other.

Let me digest that for a while.

Thanks
Ian

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
index c4df881..dd2ea5d 100644
--- a/fs/autofs4/root.c
+++ b/fs/autofs4/root.c
@@ -437,13 +437,8 @@  static int autofs4_d_manage(struct path *path, bool rcu_walk)
 
 	/* The daemon never waits. */
 	if (autofs4_oz_mode(sbi)) {
-		if (rcu_walk) {
-			if (!path_is_mountpoint_rcu(path))
-				return -EISDIR;
-		} else {
-			if (!path_is_mountpoint(path))
-				return -EISDIR;
-		}
+		if (!path_is_mountpoint(path))
+			return -EISDIR;
 		return 0;
 	}
 
@@ -471,7 +466,7 @@  static int autofs4_d_manage(struct path *path, bool rcu_walk)
 
 		if (ino->flags & AUTOFS_INF_WANT_EXPIRE)
 			return 0;
-		if (path_is_mountpoint_rcu(path))
+		if (path_is_mountpoint(path))
 			return 0;
 		inode = d_inode_rcu(dentry);
 		if (inode && S_ISLNK(inode->i_mode))
diff --git a/fs/namespace.c b/fs/namespace.c
index 79473ee..da1cd87 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1160,12 +1160,23 @@  struct vfsmount *mntget(struct vfsmount *mnt)
 }
 EXPORT_SYMBOL(mntget);
 
-static bool __path_is_mountpoint(struct path *path)
+/* __path_is_mountpoint() - Check if path is a mount in the current
+ *                          namespace.
+ *
+ *  d_mountpoint() can only be used reliably to establish if a dentry is
+ *  not mounted in any namespace and that common case is handled inline.
+ *  d_mountpoint() isn't aware of the possibility there may be multiple
+ *  mounts using a given dentry in a different namespace. This function
+ *  checks if the passed in path is a mountpoint rather than the dentry
+ *  alone.
+ */
+bool __path_is_mountpoint(const struct path *path)
 {
 	struct mount *mount;
 	struct vfsmount *mnt;
 	unsigned seq;
 
+	rcu_read_lock();
 	do {
 		seq = read_seqbegin(&mount_lock);
 		mount = __lookup_mnt(path->mnt, path->dentry);
@@ -1173,35 +1184,11 @@  static bool __path_is_mountpoint(struct path *path)
 	} while (mnt &&
 		 !(mnt->mnt_flags & MNT_SYNC_UMOUNT) &&
 		 read_seqretry(&mount_lock, seq));
-
-	return mnt != NULL;
-}
-
-/* Check if path is a mount in current namespace */
-bool path_is_mountpoint(struct path *path)
-{
-	bool res;
-
-	if (!d_mountpoint(path->dentry))
-		return false;
-
-	rcu_read_lock();
-	res = __path_is_mountpoint(path);
 	rcu_read_unlock();
 
-	return res;
-}
-EXPORT_SYMBOL(path_is_mountpoint);
-
-/* Check if path is a mount in current namespace */
-bool path_is_mountpoint_rcu(struct path *path)
-{
-	if (!d_mountpoint(path->dentry))
-		return false;
-
-	return __path_is_mountpoint(path);
+	return mnt != NULL;
 }
-EXPORT_SYMBOL(path_is_mountpoint_rcu);
+EXPORT_SYMBOL(__path_is_mountpoint);
 
 struct vfsmount *mnt_clone_internal(struct path *path)
 {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 03a5a39..83de8b6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2095,8 +2095,6 @@  extern int vfs_ustat(dev_t, struct kstatfs *);
 extern int freeze_super(struct super_block *super);
 extern int thaw_super(struct super_block *super);
 extern bool our_mnt(struct vfsmount *mnt);
-extern bool path_is_mountpoint(struct path *);
-extern bool path_is_mountpoint_rcu(struct path *);
 
 extern int current_umask(void);
 
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 1172cce..42dc62b 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -15,6 +15,8 @@ 
 #include <linux/spinlock.h>
 #include <linux/seqlock.h>
 #include <linux/atomic.h>
+#include <linux/path.h>
+#include <linux/dcache.h>
 
 struct super_block;
 struct vfsmount;
@@ -98,4 +100,13 @@  extern dev_t name_to_dev_t(const char *name);
 
 extern unsigned int sysctl_mount_max;
 
+extern bool __path_is_mountpoint(const struct path *path);
+static inline bool path_is_mountpoint(const struct path *path)
+{
+	if (!d_mountpoint(path->dentry))
+		return 0;
+
+	return __path_is_mountpoint(path);
+}
+
 #endif /* _LINUX_MOUNT_H */