[review,3/8] dcache: Clearly separate the two directory rename cases in d_splice_alias

Message ID	87fv3mjxsc.fsf_-_@x220.int.ebiederm.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: ebiederm@xmission.com (Eric W. Biederman) To: Linux Containers <containers@lists.linux-foundation.org> Cc: linux-fsdevel@vger.kernel.org, Al Viro <viro@ZenIV.linux.org.uk>, Andy Lutomirski <luto@amacapital.net>, "Serge E. Hallyn" <serge@hallyn.com>, Richard Weinberger <richard@nod.at>, Andrey Vagin <avagin@openvz.org>, Jann Horn <jann@thejh.net>, Willy Tarreau <w@1wt.eu>, Omar Sandoval <osandov@osandov.com>, Miklos Szeredi <miklos@szeredi.hu>, Linus Torvalds <torvalds@linux-foundation.org>, "J. Bruce Fields" <bfields@fieldses.org> References: <871tncuaf6.fsf@x220.int.ebiederm.org> <87mw5xq7lt.fsf@x220.int.ebiederm.org> <87a8yqou41.fsf_-_@x220.int.ebiederm.org> <874moq9oyb.fsf_-_@x220.int.ebiederm.org> <871tfkawu9.fsf_-_@x220.int.ebiederm.org> <87egjk9i61.fsf_-_@x220.int.ebiederm.org> <20150810043637.GC14139@ZenIV.linux.org.uk> <877foymrwt.fsf@x220.int.ebiederm.org> <87wpwyjxwc.fsf_-_@x220.int.ebiederm.org> Date: Thu, 13 Aug 2015 23:31:47 -0500 In-Reply-To: <87wpwyjxwc.fsf_-_@x220.int.ebiederm.org> (Eric W. Biederman's message of "Thu, 13 Aug 2015 23:29:23 -0500") Message-ID: <87fv3mjxsc.fsf_-_@x220.int.ebiederm.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain parse: 2.0 (0.1%), extract_message_metadata: 18 (1.1%), get_uri_detail_list: 5 (0.3%), tests_pri_-1000: 7 (0.4%), tests_pri_-950: 1.47 (0.1%), tests_pri_-900: 1.21 (0.1%), tests_pri_-400: 36 (2.2%), check_bayes: 34 (2.2%), b_tokenize: 13 (0.8%), b_tok_get_all: 11 (0.7%), b_comp_prob: 2.9 (0.2%), b_tok_touch_all: 6 (0.3%), b_finish: 0.76 (0.0%), tests_pri_0: 1503 (94.3%), tests_pri_500: 11 (0.7%), rewrite_mail: 0.00 (0.0%) Subject: [PATCH review 3/8] dcache: Clearly separate the two directory rename cases in d_splice_alias Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Eric W. Biederman Aug. 14, 2015, 4:31 a.m. UTC

There are two scenarios that can result in an alias being found in
d_splice_alias.  The first scenario is a disconnected dentry being
looked up and reconnected (common with nfs file handles, and
open_by_handle_at).  The second scenario is a lookup on a directory
lazily discovering that it was renamed on the remote filesystem.

The locking challenge for handling these two scenarios is that
the instance of lookup calling d_splice_alias has already
taken the dentry->d_parent->d_inode->i_mutex.

If the mutex was not held the locking we could just do:
	rename_mutex = &inode->i_sb->s_vfs_rename_mutex;
	mutex_lock(rename_mutex);
	if (d_ancestor(alias, dentry)) {
		mutex_unlock(rename_mutex);
		pr_warn_ratelimited(...);
		return -ELOOP;
	}
	m1 = &dentry->d_parent->d_inode->i_mutex;
	mutex_lock_nested(m1, I_MUTEX_PARENT);
	if (!IS_ROOT(alias) && (alias->d_parent != dentry->d_parent)) {
		m2 = &alias->d_parent->d_inode->i_mutex;
		mutex_lock_nested(m2, I_MUTEX_PARENT2);
	}
	d_move(alias, dentry);
	if (m2)
		mutex_unlock(m2);
	mutex_unlock(m1);
	mutex_unlock(rename_mutex);

Which is essentially lock_rename and unlock_reaname with added
handling of the IS_ROOT case.

The problem is that as the lookup locking stands today grabbing the
s_vfs_rename_mutex must use mutex_trylock which can fail, so for reliability
reasons we need to avoid using the rename_mutex as much as possible.

For the case where a disconnected dentry is being connected (aka the
IS_ROOT case) the d_ancestor test can be placed under the rename_lock
removing any chance that taking locks will cause a failure.

For the lazy rename case things are trickier because when
dentry->d_parent and alias->d_parent are not equal the code need to
take the s_vfs_rename_mutex, or risk the d_ancestor call in
lock_rename being inaccurate.  As s_vfs_rename_mutex is take first
there are no locking ordering issues with taking
alias->d_parent->d_inode->i_mutex.  Furthermore as games with
rename_lock are unnecessary d_move instead of __d_move can be called.

Sleeping in d_splice_alias is not something new as
security_d_instantiate is a sleeping function.

Compared to the current implementation this change introduces a
function for each case __d_rename_alias and __d_connect_alias and
moves the acquisition of rename_lock down into those functions.

In the case of __d_rename_alias which was __d_unalias this allows the
existing lock ordering rules to be followed much more closely removing
the need for a trylock when acquiring
alias->d_parent->d_inode->i_mutex, and allowing the use of d_move.

A common helper that prints the warning message when a loop is
detected is factored out into d_alias_is_ancestor so that code does not
need to be duplicated in both cases.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/dcache.c | 111 ++++++++++++++++++++++++++++++++++++------------------------
 1 file changed, 67 insertions(+), 44 deletions(-)

Al Viro Aug. 15, 2015, 6:16 a.m. UTC | #1

On Thu, Aug 13, 2015 at 11:31:47PM -0500, Eric W. Biederman wrote:

> The problem is that as the lookup locking stands today grabbing the
> s_vfs_rename_mutex must use mutex_trylock which can fail, so for reliability
> reasons we need to avoid using the rename_mutex as much as possible.

I really don't like it.  For one thing, you *still* are taking it - have to,
really.  So this argument is moot anyway.  ESTALE can happen here.  For
another, I'm not convinced that this "we don't need no stinkin'
extra locks for attaching a detached subtree" is correct.  E.g. what's
to protect that IS_ROOT(new) from changing right under you?

Consider a corrupted filesystem (or a bogus server, or a sufficiently
unpleasant race with another client, etc.)

You have a detached subtree with lookups from *TWO* directories trying to
attach its root.  Now what?  ->i_mutex on either prospective parent won't
give a damn thing - different inodes.  The current tree would've serialized
those on rename_lock and treated that as attach a detached followed by
relocate a misplaced.  With your change we get a race in there.

This place is subtle and nasty, and we had rather nasty races there.
Repeatedly.  Any non-trivial locking changes in that area should go
separately from everything else and only with an accurate analysis of
those changes.  It's one of the easiest places to fuck up in.  Been
there, done that, and so had many other people.

I'm not saying that it wouldn't benefit from cleaner locking - it sure
as hell would.  But it's in a really incestous relationship with a lot
of other pieces, both in fs/dcache.c and elsewhere.  Let's not mix that
into anything else - driveby cleanups in that place are very likely to
cause serious trouble.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric W. Biederman Aug. 15, 2015, 6:25 p.m. UTC | #2

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Thu, Aug 13, 2015 at 11:31:47PM -0500, Eric W. Biederman wrote:
>
>> The problem is that as the lookup locking stands today grabbing the
>> s_vfs_rename_mutex must use mutex_trylock which can fail, so for reliability
>> reasons we need to avoid using the rename_mutex as much as possible.
>
> I really don't like it.  For one thing, you *still* are taking it - have to,
> really.  So this argument is moot anyway.  

Not moot.  We don't take s_vfs_rename_mutex when we are connecting a
disconnected dentry alias.  If d_splice_alias could sleep and take
s_vfs_rename_mutex then we could have a single path through the code.

Unfortunately the code can't sleep when taking s_vfs_rename_mutex so
attempting to take s_vfs_rename_mutex for both paths will introduce
unnecessary -ESTALE failures into d_splice_alias.

> ESTALE can happen here.  For
> another, I'm not convinced that this "we don't need no stinkin'
> extra locks for attaching a detached subtree" is correct.  E.g. what's
> to protect that IS_ROOT(new) from changing right under you?

You are quite correct that I missed that nothing protects the result of
IS_ROOT(new).   So my change does introduce a case where we don't
hold the appropriate inode mutexes when renaming a dentry and that
introduces races elsewhere in the code so it is not acceptable.

But it is true that we only take rename_lock and don't take any
additional mutex when connecting a disconnected dentry.  Aka "we don't
need no stinkin' extra locks".  We clearly can not take the
new->d_parent->d_inode->i_mutex when IS_ROOT(new) as that is
meaningless.  Further I do not see a point in taking s_vfs_rename_mutex
in that case.

Not for this round but if you can see any reason why our not taking
s_vfs_rename_mutex when connecting disconnected dentries is wrong
and we need to take it and risk -ESTALE.  I would love to know because I
would love to clean up that mess.

> I'm not saying that it wouldn't benefit from cleaner locking - it sure
> as hell would.  But it's in a really incestous relationship with a lot
> of other pieces, both in fs/dcache.c and elsewhere.  Let's not mix that
> into anything else - driveby cleanups in that place are very likely to
> cause serious trouble.

Fair enough. 

I do have to touch d_splice_alias and change the locking, so
unfortunately I have to do some of the nasty locking analysis anyway.

I am keeping the i_lock cleanup because it is trivial and removes the
need to figure out if there is any existing ordering between i_lock and
mount_lock, and if taking mount_lock could induce a deadlock.  That
audit would not be trivial.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric W. Biederman Aug. 15, 2015, 6:35 p.m. UTC | #3

It is possible in some situations to rename a file or directory through
one mount point such that it can start out inside of a bind mount and
after the rename wind up outside of the bind mount. Unfortunately with
user namespaces these conditions can be trivially created by creating a
bind mount under an existing bind mount.

I have identified four situations in which this may be a problem.
- __d_path and d_absolute_path need to error on disconnected paths
that can not reach some root directory or lsm path based security
checks can incorrectly succeed.

- Normal path name resolution following .. can lead to a directory
that is outside of the original loopback mount.

- file handle reconsititution aka exportfs_decode_fh can yield a dentry
from which d_parent can be followed up to mnt->sb->s_root, but
d_parent can not be followed up to mnt->mnt_root.

- Mounts on a path that has been renamed outside of a loopback mount
become unreachable, as there is no possible path that can be passed
to umount to unmount them.

My strategy:

o File handle reconsitituion problems can be prevented by enabling
the nfsd subtree checks for nfs exports, and open_by_handle_at
requires capable(CAP_DAC_READ_SEARCH) so is only usable by the global
root. This makes any problems difficult if not impossible to exploit
in practice so I have not yet written code to address that issue.

o The functions __d_path and d_absolute_path are agumented so that the
security modules will not be fed a problematic path to work with.

o Following of .. has been agumented to test that after d_parent has
been resolved the original directory is connected, and if not
an error of -ENOENT is returned.

o I do not worry about mounts that are disconnected from their bind
mount as these mounts can always be freed by either umount -l on
the bind mount they have escaped from, or by freeing the mount
namespace. So I do not believe there is an actual problem.

Pathname resolution is a common fast path and most of the code in this
patchset to support keeping .. from becoming expensive in the common
case.

After hearing the Al's feedback and running some numbers I have given
up attempting to keeping the number of d_ancestor calls during pathname
resolution to an absolute minimum. It appears that simply preventing
calls d_ancestor unless a directory has escaped is good enough. This
change in approach has significantly simplified the code.

The implementation change this round is I have dropped my patch cleaning
up d_splice_alias. Al Viro found a race that makes the technique I was
using fundamentally racy. I now have d_splice_alias taking mount_lock
around rename_lock. Since I don't have to sleep in d_splice_alias
change is minimal and sufficient for this purpose.

Barring some other idiocy I think this will be the final version of this
patchset.

These changes are all against v4.2-rc1.

For those who like to see everything in a single tree the code is at:

git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing

Eric W. Biederman (7):
dcache: Handle escaped paths in prepend_path
dcache: Reduce the scope of i_lock in d_splice_alias
mnt: Track which mounts use a dentry as root.
dcache: Implement d_common_ancestor
dcache: Only read d_flags once in d_is_dir
mnt: Track when a directory escapes a bind mount
vfs: Test for and handle paths that are unreachable from their mnt_root

fs/dcache.c | 91 +++++++++++++++++++++++++++--
fs/mount.h | 9 +++
fs/namei.c | 26 ++++++++-
fs/namespace.c | 152 +++++++++++++++++++++++++++++++++++++++++++++++--
include/linux/dcache.h | 11 +++-
include/linux/mount.h | 1 +
6 files changed, 279 insertions(+), 11 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Linus Torvalds Aug. 15, 2015, 7:36 p.m. UTC | #4

On Sat, Aug 15, 2015 at 11:35 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> The implementation change this round is I have dropped my patch cleaning
> up d_splice_alias.  Al Viro found a race that makes the technique I was
> using fundamentally racy.  I now have d_splice_alias taking mount_lock
> around rename_lock.  Since I don't have to sleep in d_splice_alias
> change is minimal and sufficient for this purpose.

Quite frankly, I hate this series.

Not all of it. Patches 1,2 and 5 look fine to me. But 3,4,6 and 7 I'm
not happy with. Particularly patch 6.

It just smells like a bad hack to me. My gut feel says that it's all
wrong. It doesn't feel clean or right.

I'd much rather make ".." handling more expensive than add and
maintain that MNT_DIR_ESCAPED flag.  My gut feel is that yes, we
should look seriously at making ".." much smarter (so I don't object
to the concept of patch 7/7 at all), but I think we should strive to
look at that ".." handling *without* adding the crufty odd special
bind mount crud.

Put another way: the whole "escape bind mount" thing is not at all a
new issue, it smells very much like the very traditional Unix "escape
chroot" thing. And I detest how this adds magic rules for bind mounts,
when it feels like a much more generic issue.

Ok, so I haven't really thought deeply about this, this literally is
just a "gut feel" kind of thing.

Can we really not validate ".." some clever way _without_ adding all
those "mount escape" flags? And by "clever" I potentially mean "not
clever" and in fact just fairly brute force. I'd almost prefer to just
walk the parent chains all the way to the root and validate the ".."
that way..

                 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds Aug. 15, 2015, 7:48 p.m. UTC | #5

On Sat, Aug 15, 2015 at 12:36 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Can we really not validate ".." some clever way _without_ adding all
> those "mount escape" flags? And by "clever" I potentially mean "not
> clever" and in fact just fairly brute force. I'd almost prefer to just
> walk the parent chains all the way to the root and validate the ".."
> that way..

For example: while it's true that walking a logn chain of parents (to
validate that we hit root etc) would be expensive, I don't think we'd
necessarily need to do it for the common case.

For example, if out current "mnt->mnt_root" is a _real_ root (so
IS_ROOT() is true), then we know we're not in some possibly partial
bind mount, so we don't need to check anything else, and we can
happily move to the parent dentry *without* having to be particularly
careful.

Otherwise we might need to walk the dentry parent chain to check that
yes, we will hit that mnt->mnt_root" entry, and that we're not
possibly escaping the bind mount. But even that walk is "just"
following a chain of pointers. It's not *that* expensive.

I'd much rather make ".." more expensive, if it means that we don't
have to track the status of whether a mount has a potentially escaped
directory in it or not.  Because I think we can avoid the costs for
traditional non-bind mounts.

No?

                      Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric W. Biederman Aug. 15, 2015, 9:07 p.m. UTC | #6

On August 15, 2015 2:48:34 PM CDT, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>On Sat, Aug 15, 2015 at 12:36 PM, Linus Torvalds
><torvalds@linux-foundation.org> wrote:
>>
>> Can we really not validate ".." some clever way _without_ adding all
>> those "mount escape" flags? And by "clever" I potentially mean "not
>> clever" and in fact just fairly brute force. I'd almost prefer to
>just
>> walk the parent chains all the way to the root and validate the ".."
>> that way..
>
>For example: while it's true that walking a logn chain of parents (to
>validate that we hit root etc) would be expensive, I don't think we'd
>necessarily need to do it for the common case.
>
>For example, if out current "mnt->mnt_root" is a _real_ root (so
>IS_ROOT() is true), then we know we're not in some possibly partial
>bind mount, so we don't need to check anything else, and we can
>happily move to the parent dentry *without* having to be particularly
>careful.
>
>Otherwise we might need to walk the dentry parent chain to check that
>yes, we will hit that mnt->mnt_root" entry, and that we're not
>possibly escaping the bind mount. But even that walk is "just"
>following a chain of pointers. It's not *that* expensive.
>
>I'd much rather make ".." more expensive, if it means that we don't
>have to track the status of whether a mount has a potentially escaped
>directory in it or not.  Because I think we can avoid the costs for
>traditional non-bind mounts.
>
>No?

Yes we can compare s_root and mnt_root and only call is_subir  if they don't match.

At this point it is a matter of trade offs.

If there is not an escape I do not expect my current implementation will have a measurable cost.   And I don't expect there will be any escapes.

That said if you and Al would be happy with what you are proposing I can easily implement it.

My only concern at this point is that I know some containers run  with a bind mount for their root directory so it might be a change with a measurable cost.  At the same time shallow directory paths are the norm so I don't expect there to be much of a cost.

Eric


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds Aug. 15, 2015, 10:47 p.m. UTC | #7

On Sat, Aug 15, 2015 at 2:07 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> Yes we can compare s_root and mnt_root and only call is_subir  if they don't match.

Not even "is_subdir()" - for the RCU traversal case, just d_ancestor()
should be sufficient since we'd already be in an RCU read-locked
region and the RCU lookup checks the rename sequence number around it
all.

And d_ancestor() should really be pretty low-cost - even *if* we have
to call it, which wouldn't even be the case for the normal situation.

> At this point it is a matter of trade offs.
>
> If there is not an escape I do not expect my current implementation will have a measurable cost.
> And I don't expect there will be any escapes.

So the cost I worry about is not the CPU cost, but the complexity and
correctness. If anything goes subtly wrong, the end result is going to
be some very very subtle bugs.

And personally, I'd be much happier with something that is a bit more
straightforward, even if it makes ".." lookup slower. Especially since
I think we can limit the costs to fairly obvious cases (ie only for
partial bind mounts). Keep the code more straightforward, and *if* we
ever see the cost of dentry traversal

But it's up to Al, I think.

Al, comments?

                Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric W. Biederman Aug. 16, 2015, 12:59 a.m. UTC | #8

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Sat, Aug 15, 2015 at 2:07 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>> Yes we can compare s_root and mnt_root and only call is_subir  if they don't match.
>
> Not even "is_subdir()" - for the RCU traversal case, just d_ancestor()
> should be sufficient since we'd already be in an RCU read-locked
> region and the RCU lookup checks the rename sequence number around it
> all.

We check the dentry sequence number and the mount sequence number, which
may be enough to catch a local rename but is certainly not enough to
catch what d_ancestor cares about.

Further we have the partial rcu to non-rcu walk case represented by
unlazy_walk that means we can't blithely do something that might be
wrong and only check the sequence numbers at each step.

> And d_ancestor() should really be pretty low-cost - even *if* we have
> to call it, which wouldn't even be the case for the normal situation.

>
>> At this point it is a matter of trade offs.
>>
>> If there is not an escape I do not expect my current implementation will have a measurable cost.
>> And I don't expect there will be any escapes.
>
> So the cost I worry about is not the CPU cost, but the complexity and
> correctness. If anything goes subtly wrong, the end result is going to
> be some very very subtle bugs.

Fair enough.  I like simple low complexity code, but I don't want to
mess up the pathname lookup fastpath.

> And personally, I'd be much happier with something that is a bit more
> straightforward, even if it makes ".." lookup slower. Especially since
> I think we can limit the costs to fairly obvious cases (ie only for
> partial bind mounts). Keep the code more straightforward, and *if* we
> ever see the cost of dentry traversal
>
> But it's up to Al, I think.
>
> Al, comments?

At the very beginning of this I got shot down by Al Viro for a simple
implementation that essentially had everything except the check for
being a bind mount.  Knowing what I know now I realize it was a bit
buggy, calling d_ancestor in the rcu walk instead of d_subdir, but it
was shot down for the cpu cost.  Then Al suggested the basic approach I
have taken in these patches.

As soon as I am done testing I am going to post the revised version of
my final patch that only performs is_subdir checks on bind mounts.

Then we can decide to merge whichever version of the code you and Al are
happy with.

Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro Aug. 16, 2015, 2:12 a.m. UTC | #9

On Sat, Aug 15, 2015 at 03:47:50PM -0700, Linus Torvalds wrote:
> So the cost I worry about is not the CPU cost, but the complexity and
> correctness. If anything goes subtly wrong, the end result is going to
> be some very very subtle bugs.
> 
> And personally, I'd be much happier with something that is a bit more
> straightforward, even if it makes ".." lookup slower. Especially since
> I think we can limit the costs to fairly obvious cases (ie only for
> partial bind mounts). Keep the code more straightforward, and *if* we
> ever see the cost of dentry traversal
> 
> But it's up to Al, I think.
> 
> Al, comments?

I think you are underestimating the frequency of .. traversals.  Any build
process that creates relative symlinks will be hitting it all the time,
for one thing.  And less-than-entire-fs mounts are not something pathological -
I've got quite a few here, things like container setups often create such
beasts, etc.  Not to mention that things like NFSv4 will often look like such
partial mounts, BTW.

I really don't understand why the hell do we need *anything* complicated
around __d_move() callers - just take sodding spinlock of mount_lock in
the (few) callers around rename_lock, and that's it.  No need for anything
subtle and brittle.

Redoing the locking in d_splice_alias() simply doesn't belong anywhere near
that work.

Basically, all places where we change tree topology go through __d_move()
(and take rename_lock around it).  There are very few of those.  And we
can find and taint affected mounts quite easily - I think we all agree that
beginning of that series looks sane (locating mounts by mnt_root), right?

Let's just add "mount_lock should be held read-exclusive by all callers of
__d_move()" and do the find-and-taint logics from
dentry_lock_for_move().  Move these
        BUG_ON(d_ancestor(dentry, target));
        BUG_ON(d_ancestor(target, dentry));
into dentry_lock_for_move(), while we are at it.  And have it begin with
finding the last common ancestor of dentry and target.  Which turns those
checks into ancestor == dentry and ancestor == target...  Then we need to
taint everything with root being an ancestor of dentry, but not an ancestor
of target.  Which is trivial with LCA already found.  For exchange case we
need to do that both for dentry and target, of course.  That's it.  After that
we just use that taint instead of "is it a partial?" in .. handling.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds Aug. 16, 2015, 2:25 a.m. UTC | #10

On Sat, Aug 15, 2015 at 7:12 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> I think you are underestimating the frequency of .. traversals.  Any build
> process that creates relative symlinks will be hitting it all the time,
> for one thing.

I suspect you're over-estimating how expensive it is to just walk down
to the mount-point. It's just a few pointer traversals.

Realistically, we probably do more than that for a *regular* path
component lookup, when we follow the hash chains. Following a d_parent
chain for ".." isn't that different.

Just looking at the last patch Eric sent, that one looks _trivial_. It
didn't need *any* preparation or new rules. Compared to the mess with
marking things MNT_DIR_ESCAPED etc, I know which approach I'd prefer.

But hey, if you think you can simplify it... I just don't think that
even totally ignoring the d_splice_alias() things, and totally
ignoring any locking around __d_move(), the whole "mark things
MNT_DIR_ESCAPED" is a lot more complex.

              Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro Aug. 16, 2015, 4:53 a.m. UTC | #11

On Sat, Aug 15, 2015 at 07:25:41PM -0700, Linus Torvalds wrote:
> On Sat, Aug 15, 2015 at 7:12 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > I think you are underestimating the frequency of .. traversals.  Any build
> > process that creates relative symlinks will be hitting it all the time,
> > for one thing.
> 
> I suspect you're over-estimating how expensive it is to just walk down
> to the mount-point. It's just a few pointer traversals.
> 
> Realistically, we probably do more than that for a *regular* path
> component lookup, when we follow the hash chains. Following a d_parent
> chain for ".." isn't that different.

Point, but...  Keep in mind that there's another PITA in there - unreachable
submounts are not as harmless as Eric hopes.  umount -l of the entire tainted
mount is a very large hammer _and_ userland needs to know when to use
it in the first place; otherwise we'll end up with dirty reboots.
So slightly longer term I want to have something done to them when they become
unreachable.  Namely, detach and leave in their place a trap that would
give EINVAL on attempt to cross.  Details depend on another pile of patches
to review and serialize (nfsd-related fs_pin stuff), but catching the moments
when they become unreachable is going to be useful (IOW, I don't see how to do
it without catching those; there might be an elegant solution I'm missing, of
course).

> Just looking at the last patch Eric sent, that one looks _trivial_. It
> didn't need *any* preparation or new rules. Compared to the mess with
> marking things MNT_DIR_ESCAPED etc, I know which approach I'd prefer.
> 
> But hey, if you think you can simplify it... I just don't think that
> even totally ignoring the d_splice_alias() things, and totally
> ignoring any locking around __d_move(), the whole "mark things
> MNT_DIR_ESCAPED" is a lot more complex.

	Basically what I have in mind is a few helpers called from 
dentry_lock_for_move() with d_move(), d_exchange() and d_splice_alias()
doing read_seqlock_excl(&mount_lock); just before grabbing rename_lock and
dropping it right after dropping rename_lock.

find_and_taint(dentry, ancestor)
{
	if (!is_dir(dentry))
		return;
	for (p = dentry->d_parent; p != ancestor; p = next) {
		if (unlikely(d_is_someones_root(p)))
			taint_mounts_of_this_subtree(p);
			// ... with dentry passed there as well when we
			// start handling unreachable submounts.
		next = p->d_parent;
		if (p == next)
			break;
	}
}

depth(d)
{
	int n;
	for (n = 0; !IS_ROOT(d); d = d->d_parent, n++)
		;
	return n;
}

/* find the last common ancestor of d1 and d2; NULL if there isn't one */
LCA(d1, d2)
{
	int n1 = depth(d1), n2 = depth(d2);
	if (n1 > n2)
		do d1 = d1->d_parent; while (--n1 != n2);
	else if (n1 < n2)
		do d2 = d2->d_parent; while (--n2 != n1);
	while (d1 != d2) {
		if (unlikely(IS_ROOT(d1)))
			return NULL;
		d1 = d1->d_parent;
		d2 = d2->d_parent;
	}
	return d1;
}

dentry_lock_for_move(dentry, target, exchange)
{
	ancestor = LCA(dentry, target);
	BUG_ON(ancestor == dentry);	// these BUG_ON are antisocial, BTW
	BUG_ON(ancestor == target);
	find_and_taint(dentry, ancestor);
	if (exchange)
		find_and_taint(target, ancestor);
	// the rest - as we do now
}
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric W. Biederman Aug. 16, 2015, 6:22 a.m. UTC | #12

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Sat, Aug 15, 2015 at 07:25:41PM -0700, Linus Torvalds wrote:
>> On Sat, Aug 15, 2015 at 7:12 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> >
>> > I think you are underestimating the frequency of .. traversals.  Any build
>> > process that creates relative symlinks will be hitting it all the time,
>> > for one thing.
>> 
>> I suspect you're over-estimating how expensive it is to just walk down
>> to the mount-point. It's just a few pointer traversals.
>> 
>> Realistically, we probably do more than that for a *regular* path
>> component lookup, when we follow the hash chains. Following a d_parent
>> chain for ".." isn't that different.
>
> Point, but...  Keep in mind that there's another PITA in there - unreachable
> submounts are not as harmless as Eric hopes.  umount -l of the entire tainted
> mount is a very large hammer _and_ userland needs to know when to use
> it in the first place; otherwise we'll end up with dirty reboots.
> So slightly longer term I want to have something done to them when they become
> unreachable.  Namely, detach and leave in their place a trap that would
> give EINVAL on attempt to cross.  Details depend on another pile of patches
> to review and serialize (nfsd-related fs_pin stuff), but catching the moments
> when they become unreachable is going to be useful (IOW, I don't see how to do
> it without catching those; there might be an elegant solution I'm missing, of
> course).

*Rolls my eyes*

This has to be one of the most inconsistent lines of reasoning I have
ever heard.  Either escaping from a bind mount is as rare as hen's teeth
and our handling of the case is to keep it that way.  Or if they are
common we really need to handle people performing lookups on bind mounts
people have escaped from.

If escaping from bind mounts is as rare as hen's teeth (which it had
better be), then we don't have to be clever, we don't have to be nice,
and umount -l is perfectly sufficient.  And most of the time it really
will be an exit of a mount namespace that blows things away anyway.

Certainly all of the easy to create cases require creating a user
namespace and which onws a mount namespace that a less privileged user
can manipulate.

If you gave someone you don't trust permission to move a directory
under which you have a mount point in the initial mount namespace
and that causes problems I am pretty certain that is a PBKAC error.

But even if I accept that the unmount code has to be done you have been
objecting to the infrastructure that is needed to make it happen.

That is to unmount something we must take namespace_sem from
d_splice_alias.  But more specifically we must call namespace_lock()
and namespace_unlock() from d_splice_alias.  And there are no hacks
like trylock for running synchronize_rcu.  That is we must sleep in
d_splice_alias to unmount things.   Arranging the locks so that it can
happen is what you have rather been strenuously objecting to.

Further if you want to unmount things you can't do it from __d_move
inside the rename_lock.  The logic must be outside the rename_lock.

>> Just looking at the last patch Eric sent, that one looks _trivial_. It
>> didn't need *any* preparation or new rules. Compared to the mess with
>> marking things MNT_DIR_ESCAPED etc, I know which approach I'd prefer.
>> 
>> But hey, if you think you can simplify it... I just don't think that
>> even totally ignoring the d_splice_alias() things, and totally
>> ignoring any locking around __d_move(), the whole "mark things
>> MNT_DIR_ESCAPED" is a lot more complex.
>
> 	Basically what I have in mind is a few helpers called from 
> dentry_lock_for_move() with d_move(), d_exchange() and d_splice_alias()
> doing read_seqlock_excl(&mount_lock); just before grabbing rename_lock and
> dropping it right after dropping rename_lock.

So you are arguing for code that has heavier locking than what I have
done, and you are conveniently ignoring the cost of reviewing all of
the code to see if i_lock is ever taken under mount_lock.

> find_and_taint(dentry, ancestor)
> {
> 	if (!is_dir(dentry))
> 		return;
> 	for (p = dentry->d_parent; p != ancestor; p = next) {
> 		if (unlikely(d_is_someones_root(p)))
> 			taint_mounts_of_this_subtree(p);
> 			// ... with dentry passed there as well when we
> 			// start handling unreachable submounts.
> 		next = p->d_parent;
> 		if (p == next)
> 			break;
> 	}
> }
>
> depth(d)
> {
> 	int n;
> 	for (n = 0; !IS_ROOT(d); d = d->d_parent, n++)
> 		;
> 	return n;
> }
>
> /* find the last common ancestor of d1 and d2; NULL if there isn't one */
> LCA(d1, d2)
> {
You are of course missing the common case and the obvious optimization
here:
	if (d1->d_parent == d2->d_parent)
		return d1->d_parent;

> 	int n1 = depth(d1), n2 = depth(d2);
> 	if (n1 > n2)
> 		do d1 = d1->d_parent; while (--n1 != n2);
> 	else if (n1 < n2)
> 		do d2 = d2->d_parent; while (--n2 != n1);
> 	while (d1 != d2) {
> 		if (unlikely(IS_ROOT(d1)))
> 			return NULL;
> 		d1 = d1->d_parent;
> 		d2 = d2->d_parent;
> 	}
> 	return d1;
> }
>
> dentry_lock_for_move(dentry, target, exchange)
> {
> 	ancestor = LCA(dentry, target);
> 	BUG_ON(ancestor == dentry);	// these BUG_ON are antisocial, BTW
> 	BUG_ON(ancestor == target);

> 	find_and_taint(dentry, ancestor);
> 	if (exchange)
> 		find_and_taint(target, ancestor);
> 	// the rest - as we do now

Which begs the question.  What is the point of doing this in
dentry_lock_for_move.  None of that code has any logical connection to
the function of dentry_lock_for_move.  And if the rest of the logic is
untouched it is just stupid to do the work in dentry_lock_for_move.

> }

Al I am sorry.  I am having a very hard time taking your comments on
code direction seriously at this point.  I have tested working code. 

You have some sort of half thought out proof of concept code that you
have never published.  And you are complaining that my working code is
not your half thought through code.

In my last round of patches that I sent out today.  I did put mount_lock
just outside of rename_lock, in d_splice_alias.  But apparently you
haven't noticed.

Now at this point I have hit the limit of my time available for rewrites
before the merge window.  We can go with my 7 patch variant I posted
today (whose only sin appears not to be your implemenation), it's
trivial reduction that Linus likes because it is simple, someone else
can write one, or this can all wait until the next development cycle.

Given that I have working code with no known defects I was really hoping
I could plug this security hole this development cycle.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro Aug. 16, 2015, 6:55 a.m. UTC | #13

On Sun, Aug 16, 2015 at 01:22:41AM -0500, Eric W. Biederman wrote:

> In my last round of patches that I sent out today.  I did put mount_lock
> just outside of rename_lock, in d_splice_alias.  But apparently you
> haven't noticed.

I have.  The problem I have with that one is that you end up with duplicated
logics rather than taking it to one place.

> Now at this point I have hit the limit of my time available for rewrites
> before the merge window.  We can go with my 7 patch variant I posted
> today (whose only sin appears not to be your implemenation), it's
> trivial reduction that Linus likes because it is simple, someone else
> can write one, or this can all wait until the next development cycle.

... or either of us can do merging those checks into a single place,
be it as a followup to your 7-patch series, or folded with the
fs/dcache.c-affecting patches in there.  If you have no time left, I can
certainly do that followup myself - not a problem[1]

And umount-related followups are just that - I'm not asking you to do those,
especially since as I said this stuff is sensitive to fs_pin details (so far
it appears to fold nicely with the __detach_mounts()/umount_tree() stuff,
BTW).

[1] with credits for your patches preserved - normally I would assume that
this goes without saying, but your reply seems to imply that I'm playing some
kind of political BS games, so I'd rather spell that out.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro Aug. 16, 2015, 7:04 a.m. UTC | #14

On Sun, Aug 16, 2015 at 07:55:03AM +0100, Al Viro wrote:

> And umount-related followups are just that - I'm not asking you to do those,
> especially since as I said this stuff is sensitive to fs_pin details (so far
> it appears to fold nicely with the __detach_mounts()/umount_tree() stuff,
> BTW).

PS: it doesn't need namespace_sem taken inside __d_move() - actual
detaching does, of course, but that part gets done via task_work_add(),
in a reasonably sane locking environment.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric W. Biederman Aug. 16, 2015, 11:33 a.m. UTC | #15

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Sun, Aug 16, 2015 at 01:22:41AM -0500, Eric W. Biederman wrote:
>
>> In my last round of patches that I sent out today.  I did put mount_lock
>> just outside of rename_lock, in d_splice_alias.  But apparently you
>> haven't noticed.
>
> I have.  The problem I have with that one is that you end up with duplicated
> logics rather than taking it to one place.

We are talking about a very small duplication of code.  But you are also
talking about combining logic that can get you in trouble pretty easily.

As I read the code sketch you posted it had the issue that if a
disconnected dentry was also a mount point it would mark that mount
point as escaped (because there was no common ancestor).

In the split out logic I don't even bother because you trivially can't
have escaped from anywhere when IS_ROOT(dentry).

>> Now at this point I have hit the limit of my time available for rewrites
>> before the merge window.  We can go with my 7 patch variant I posted
>> today (whose only sin appears not to be your implemenation), it's
>> trivial reduction that Linus likes because it is simple, someone else
>> can write one, or this can all wait until the next development cycle.
>
> ... or either of us can do merging those checks into a single place,
> be it as a followup to your 7-patch series, or folded with the
> fs/dcache.c-affecting patches in there.  If you have no time left, I can
> certainly do that followup myself - not a problem[1]

I don't have time.  Everytime I have worked with this it has take pretty
much full days of staring at the code, and I don't have any more full
days left before the merge window.

> And umount-related followups are just that - I'm not asking you to do those,
> especially since as I said this stuff is sensitive to fs_pin details (so far
> it appears to fold nicely with the __detach_mounts()/umount_tree() stuff,
> BTW).
>
> PS: it doesn't need namespace_sem taken inside __d_move() - actual
> detaching does, of course, but that part gets done via task_work_add(),
> in a reasonably sane locking environment.

The part that boggles my mind about that whole approach is that just
outside of rename_lock there already is a sane locking environment.

Admittedly it will take a touch of work in d_splice_alias to get it
sane, but that isn't particularly hard.

	write_seqlock(&rename_lock);
        if (!IS_ROOT(new))
		__d_unalias(...);

__d_unalias(...)
{	
        /* Parent's don't go away so !IS_ROOT(new) will stay valid */
        write_sequnlock(&rename_lock);
        /* Hooray! The code can be normal and sleep if it needs to,
         * before calling into __d_move.
         */
         ...
         d_move();
}

Similarly for duplicate logic removal.  Teaching d_splice_alias when
it is performing an ordinarly rename to be able to just call d_move
gets you far more from a maintenance perspective than cramming
everything into __d_move.

> [1] with credits for your patches preserved - normally I would assume that
> this goes without saying, but your reply seems to imply that I'm playing some
> kind of political BS games, so I'd rather spell that out.

My issue wasn't political games, but rather holding code to an
apparently BS standard.

Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric W. Biederman Aug. 16, 2015, 11:51 a.m. UTC | #16

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Sat, Aug 15, 2015 at 7:12 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>> I think you are underestimating the frequency of .. traversals.  Any build
>> process that creates relative symlinks will be hitting it all the time,
>> for one thing.
>
> I suspect you're over-estimating how expensive it is to just walk down
> to the mount-point. It's just a few pointer traversals.
>
> Realistically, we probably do more than that for a *regular* path
> component lookup, when we follow the hash chains. Following a d_parent
> chain for ".." isn't that different.
>
> Just looking at the last patch Eric sent, that one looks _trivial_. It
> didn't need *any* preparation or new rules. Compared to the mess with
> marking things MNT_DIR_ESCAPED etc, I know which approach I'd prefer.
>
> But hey, if you think you can simplify it... I just don't think that
> even totally ignoring the d_splice_alias() things, and totally
> ignoring any locking around __d_move(), the whole "mark things
> MNT_DIR_ESCAPED" is a lot more complex.

It occurs to me that there is a fairly simple way we can emperically
test to see how expensive calling is_subdir for every .. on a bind mount
is in practice.

- Take my last patch
- run a benchmark outside of a bind mount (perhaps a kernel compile).
- run the same benchmark inside of a bind mount.

See if the performance differs.

I am going to try to find time to do this, but I am travelling for the
next couple of days.

If someone who has a bit more time wants to try it and beats me to that
would be great.

I think having some emperical numbers would be nice in this part of the
conversation.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Willy Tarreau Aug. 16, 2015, 10:29 p.m. UTC | #17

On Sun, Aug 16, 2015 at 06:51:33AM -0500, Eric W. Biederman wrote:
> It occurs to me that there is a fairly simple way we can emperically
> test to see how expensive calling is_subdir for every .. on a bind mount
> is in practice.
> 
> - Take my last patch
> - run a benchmark outside of a bind mount (perhaps a kernel compile).
> - run the same benchmark inside of a bind mount.
> 
> See if the performance differs.
> 
> I am going to try to find time to do this, but I am travelling for the
> next couple of days.
> 
> If someone who has a bit more time wants to try it and beats me to that
> would be great.
> 
> I think having some emperical numbers would be nice in this part of the
> conversation.

I took a bit of time to do something simpler though less scientific : I
just ran a kernel build under strace -c -f to get a rough idea of the
number of syscalls. It takes 2m23 without strace and 3m00 under strace.
7.2M syscalls were seen out of which only 3.2M were related to FS access
(mostly open(), fstat() and stat()), the rest is minor. That's 22k
syscalls per second, or 45 us between two syscalls. From this point I'm
having a hard time imagining that any extra tests in the code path would
have a significant impact to cause a measurable difference compared to
these 45us, especially since I'm counting all FS accesses (worst case)
and not just those referencing "..".

Just my two cents,
Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro Aug. 21, 2015, 7:51 a.m. UTC | #18

On Sun, Aug 16, 2015 at 06:33:21AM -0500, Eric W. Biederman wrote:

> > ... or either of us can do merging those checks into a single place,
> > be it as a followup to your 7-patch series, or folded with the
> > fs/dcache.c-affecting patches in there.  If you have no time left, I can
> > certainly do that followup myself - not a problem[1]
> 
> I don't have time.  Everytime I have worked with this it has take pretty
> much full days of staring at the code, and I don't have any more full
> days left before the merge window.

OK, at that point I've pretty much given up on fs_pin for this cycle.
And testing your variant with unconditional checks on .. appears to have
fairly low overhead.  I still want to deal with catching and unmounting the
unreachable suckers, so fs/dcache.c side of things will get used when we get
to that stuff, but for now I've taken your 1/7, 2/7 plus the variant of
"vfs: Test for and handle paths that are unreachable from their mnt_root"
that doesn't care whether anything escaped or not.

3--6 are held in a local branch for now; I *am* going to use them
come next cycle.  And there's another pile of fun around that area, also
for the next cycle - kernel-initiated subtree removals on things like
sysfs et.al.; handling of the locking in those is inconsistent and tied
with the fun we have for d_move()/__d_unalias().  Sigh...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric W. Biederman Aug. 21, 2015, 3:27 p.m. UTC | #19

On August 21, 2015 12:51:05 AM PDT, Al Viro <viro@ZenIV.linux.org.uk> wrote:
>On Sun, Aug 16, 2015 at 06:33:21AM -0500, Eric W. Biederman wrote:
>
>> > ... or either of us can do merging those checks into a single
>place,
>> > be it as a followup to your 7-patch series, or folded with the
>> > fs/dcache.c-affecting patches in there.  If you have no time left,
>I can
>> > certainly do that followup myself - not a problem[1]
>> 
>> I don't have time.  Everytime I have worked with this it has take
>pretty
>> much full days of staring at the code, and I don't have any more full
>> days left before the merge window.
>
>OK, at that point I've pretty much given up on fs_pin for this cycle.
>And testing your variant with unconditional checks on .. appears to
>have
>fairly low overhead.  I still want to deal with catching and unmounting
>the
>unreachable suckers, so fs/dcache.c side of things will get used when
>we get
>to that stuff, but for now I've taken your 1/7, 2/7 plus the variant of
>"vfs: Test for and handle paths that are unreachable from their
>mnt_root"
>that doesn't care whether anything escaped or not.
>
>3--6 are held in a local branch for now; I *am* going to use them
>come next cycle.  And there's another pile of fun around that area,
>also
>for the next cycle - kernel-initiated subtree removals on things like
>sysfs et.al.; handling of the locking in those is inconsistent and tied
>with the fun we have for d_move()/__d_unalias().  Sigh...


I am sorry to hear about a mess in sysfs. 

I am glad to hear that we do not appear to need MNT_DIR_ESCAPED to avoid performance regressions.  That should make back porters lives much easier.


As for the future I have a suspicion that we want to look at the rcu readable dynamically sized hash tables the networking guys cooked up.

Al thank you very much for performance testing the simple version.

Eric

 
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[review,3/8] dcache: Clearly separate the two directory rename cases in d_splice_alias

Commit Message

Comments

Patch