Message ID | 87fv3mjxsc.fsf_-_@x220.int.ebiederm.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Aug 13, 2015 at 11:31:47PM -0500, Eric W. Biederman wrote: > The problem is that as the lookup locking stands today grabbing the > s_vfs_rename_mutex must use mutex_trylock which can fail, so for reliability > reasons we need to avoid using the rename_mutex as much as possible. I really don't like it. For one thing, you *still* are taking it - have to, really. So this argument is moot anyway. ESTALE can happen here. For another, I'm not convinced that this "we don't need no stinkin' extra locks for attaching a detached subtree" is correct. E.g. what's to protect that IS_ROOT(new) from changing right under you? Consider a corrupted filesystem (or a bogus server, or a sufficiently unpleasant race with another client, etc.) You have a detached subtree with lookups from *TWO* directories trying to attach its root. Now what? ->i_mutex on either prospective parent won't give a damn thing - different inodes. The current tree would've serialized those on rename_lock and treated that as attach a detached followed by relocate a misplaced. With your change we get a race in there. This place is subtle and nasty, and we had rather nasty races there. Repeatedly. Any non-trivial locking changes in that area should go separately from everything else and only with an accurate analysis of those changes. It's one of the easiest places to fuck up in. Been there, done that, and so had many other people. I'm not saying that it wouldn't benefit from cleaner locking - it sure as hell would. But it's in a really incestous relationship with a lot of other pieces, both in fs/dcache.c and elsewhere. Let's not mix that into anything else - driveby cleanups in that place are very likely to cause serious trouble. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Al Viro <viro@ZenIV.linux.org.uk> writes: > On Thu, Aug 13, 2015 at 11:31:47PM -0500, Eric W. Biederman wrote: > >> The problem is that as the lookup locking stands today grabbing the >> s_vfs_rename_mutex must use mutex_trylock which can fail, so for reliability >> reasons we need to avoid using the rename_mutex as much as possible. > > I really don't like it. For one thing, you *still* are taking it - have to, > really. So this argument is moot anyway. Not moot. We don't take s_vfs_rename_mutex when we are connecting a disconnected dentry alias. If d_splice_alias could sleep and take s_vfs_rename_mutex then we could have a single path through the code. Unfortunately the code can't sleep when taking s_vfs_rename_mutex so attempting to take s_vfs_rename_mutex for both paths will introduce unnecessary -ESTALE failures into d_splice_alias. > ESTALE can happen here. For > another, I'm not convinced that this "we don't need no stinkin' > extra locks for attaching a detached subtree" is correct. E.g. what's > to protect that IS_ROOT(new) from changing right under you? You are quite correct that I missed that nothing protects the result of IS_ROOT(new). So my change does introduce a case where we don't hold the appropriate inode mutexes when renaming a dentry and that introduces races elsewhere in the code so it is not acceptable. But it is true that we only take rename_lock and don't take any additional mutex when connecting a disconnected dentry. Aka "we don't need no stinkin' extra locks". We clearly can not take the new->d_parent->d_inode->i_mutex when IS_ROOT(new) as that is meaningless. Further I do not see a point in taking s_vfs_rename_mutex in that case. Not for this round but if you can see any reason why our not taking s_vfs_rename_mutex when connecting disconnected dentries is wrong and we need to take it and risk -ESTALE. I would love to know because I would love to clean up that mess. > I'm not saying that it wouldn't benefit from cleaner locking - it sure > as hell would. But it's in a really incestous relationship with a lot > of other pieces, both in fs/dcache.c and elsewhere. Let's not mix that > into anything else - driveby cleanups in that place are very likely to > cause serious trouble. Fair enough. I do have to touch d_splice_alias and change the locking, so unfortunately I have to do some of the nasty locking analysis anyway. I am keeping the i_lock cleanup because it is trivial and removes the need to figure out if there is any existing ordering between i_lock and mount_lock, and if taking mount_lock could induce a deadlock. That audit would not be trivial. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
It is possible in some situations to rename a file or directory through one mount point such that it can start out inside of a bind mount and after the rename wind up outside of the bind mount. Unfortunately with user namespaces these conditions can be trivially created by creating a bind mount under an existing bind mount. I have identified four situations in which this may be a problem. - __d_path and d_absolute_path need to error on disconnected paths that can not reach some root directory or lsm path based security checks can incorrectly succeed. - Normal path name resolution following .. can lead to a directory that is outside of the original loopback mount. - file handle reconsititution aka exportfs_decode_fh can yield a dentry from which d_parent can be followed up to mnt->sb->s_root, but d_parent can not be followed up to mnt->mnt_root. - Mounts on a path that has been renamed outside of a loopback mount become unreachable, as there is no possible path that can be passed to umount to unmount them. My strategy: o File handle reconsitituion problems can be prevented by enabling the nfsd subtree checks for nfs exports, and open_by_handle_at requires capable(CAP_DAC_READ_SEARCH) so is only usable by the global root. This makes any problems difficult if not impossible to exploit in practice so I have not yet written code to address that issue. o The functions __d_path and d_absolute_path are agumented so that the security modules will not be fed a problematic path to work with. o Following of .. has been agumented to test that after d_parent has been resolved the original directory is connected, and if not an error of -ENOENT is returned. o I do not worry about mounts that are disconnected from their bind mount as these mounts can always be freed by either umount -l on the bind mount they have escaped from, or by freeing the mount namespace. So I do not believe there is an actual problem. Pathname resolution is a common fast path and most of the code in this patchset to support keeping .. from becoming expensive in the common case. After hearing the Al's feedback and running some numbers I have given up attempting to keeping the number of d_ancestor calls during pathname resolution to an absolute minimum. It appears that simply preventing calls d_ancestor unless a directory has escaped is good enough. This change in approach has significantly simplified the code. The implementation change this round is I have dropped my patch cleaning up d_splice_alias. Al Viro found a race that makes the technique I was using fundamentally racy. I now have d_splice_alias taking mount_lock around rename_lock. Since I don't have to sleep in d_splice_alias change is minimal and sufficient for this purpose. Barring some other idiocy I think this will be the final version of this patchset. These changes are all against v4.2-rc1. For those who like to see everything in a single tree the code is at: git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing Eric W. Biederman (7): dcache: Handle escaped paths in prepend_path dcache: Reduce the scope of i_lock in d_splice_alias mnt: Track which mounts use a dentry as root. dcache: Implement d_common_ancestor dcache: Only read d_flags once in d_is_dir mnt: Track when a directory escapes a bind mount vfs: Test for and handle paths that are unreachable from their mnt_root fs/dcache.c | 91 +++++++++++++++++++++++++++-- fs/mount.h | 9 +++ fs/namei.c | 26 ++++++++- fs/namespace.c | 152 +++++++++++++++++++++++++++++++++++++++++++++++-- include/linux/dcache.h | 11 +++- include/linux/mount.h | 1 + 6 files changed, 279 insertions(+), 11 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Aug 15, 2015 at 11:35 AM, Eric W. Biederman <ebiederm@xmission.com> wrote: > > The implementation change this round is I have dropped my patch cleaning > up d_splice_alias. Al Viro found a race that makes the technique I was > using fundamentally racy. I now have d_splice_alias taking mount_lock > around rename_lock. Since I don't have to sleep in d_splice_alias > change is minimal and sufficient for this purpose. Quite frankly, I hate this series. Not all of it. Patches 1,2 and 5 look fine to me. But 3,4,6 and 7 I'm not happy with. Particularly patch 6. It just smells like a bad hack to me. My gut feel says that it's all wrong. It doesn't feel clean or right. I'd much rather make ".." handling more expensive than add and maintain that MNT_DIR_ESCAPED flag. My gut feel is that yes, we should look seriously at making ".." much smarter (so I don't object to the concept of patch 7/7 at all), but I think we should strive to look at that ".." handling *without* adding the crufty odd special bind mount crud. Put another way: the whole "escape bind mount" thing is not at all a new issue, it smells very much like the very traditional Unix "escape chroot" thing. And I detest how this adds magic rules for bind mounts, when it feels like a much more generic issue. Ok, so I haven't really thought deeply about this, this literally is just a "gut feel" kind of thing. Can we really not validate ".." some clever way _without_ adding all those "mount escape" flags? And by "clever" I potentially mean "not clever" and in fact just fairly brute force. I'd almost prefer to just walk the parent chains all the way to the root and validate the ".." that way.. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Aug 15, 2015 at 12:36 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Can we really not validate ".." some clever way _without_ adding all > those "mount escape" flags? And by "clever" I potentially mean "not > clever" and in fact just fairly brute force. I'd almost prefer to just > walk the parent chains all the way to the root and validate the ".." > that way.. For example: while it's true that walking a logn chain of parents (to validate that we hit root etc) would be expensive, I don't think we'd necessarily need to do it for the common case. For example, if out current "mnt->mnt_root" is a _real_ root (so IS_ROOT() is true), then we know we're not in some possibly partial bind mount, so we don't need to check anything else, and we can happily move to the parent dentry *without* having to be particularly careful. Otherwise we might need to walk the dentry parent chain to check that yes, we will hit that mnt->mnt_root" entry, and that we're not possibly escaping the bind mount. But even that walk is "just" following a chain of pointers. It's not *that* expensive. I'd much rather make ".." more expensive, if it means that we don't have to track the status of whether a mount has a potentially escaped directory in it or not. Because I think we can avoid the costs for traditional non-bind mounts. No? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On August 15, 2015 2:48:34 PM CDT, Linus Torvalds <torvalds@linux-foundation.org> wrote: >On Sat, Aug 15, 2015 at 12:36 PM, Linus Torvalds ><torvalds@linux-foundation.org> wrote: >> >> Can we really not validate ".." some clever way _without_ adding all >> those "mount escape" flags? And by "clever" I potentially mean "not >> clever" and in fact just fairly brute force. I'd almost prefer to >just >> walk the parent chains all the way to the root and validate the ".." >> that way.. > >For example: while it's true that walking a logn chain of parents (to >validate that we hit root etc) would be expensive, I don't think we'd >necessarily need to do it for the common case. > >For example, if out current "mnt->mnt_root" is a _real_ root (so >IS_ROOT() is true), then we know we're not in some possibly partial >bind mount, so we don't need to check anything else, and we can >happily move to the parent dentry *without* having to be particularly >careful. > >Otherwise we might need to walk the dentry parent chain to check that >yes, we will hit that mnt->mnt_root" entry, and that we're not >possibly escaping the bind mount. But even that walk is "just" >following a chain of pointers. It's not *that* expensive. > >I'd much rather make ".." more expensive, if it means that we don't >have to track the status of whether a mount has a potentially escaped >directory in it or not. Because I think we can avoid the costs for >traditional non-bind mounts. > >No? Yes we can compare s_root and mnt_root and only call is_subir if they don't match. At this point it is a matter of trade offs. If there is not an escape I do not expect my current implementation will have a measurable cost. And I don't expect there will be any escapes. That said if you and Al would be happy with what you are proposing I can easily implement it. My only concern at this point is that I know some containers run with a bind mount for their root directory so it might be a change with a measurable cost. At the same time shallow directory paths are the norm so I don't expect there to be much of a cost. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Aug 15, 2015 at 2:07 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: > > Yes we can compare s_root and mnt_root and only call is_subir if they don't match. Not even "is_subdir()" - for the RCU traversal case, just d_ancestor() should be sufficient since we'd already be in an RCU read-locked region and the RCU lookup checks the rename sequence number around it all. And d_ancestor() should really be pretty low-cost - even *if* we have to call it, which wouldn't even be the case for the normal situation. > At this point it is a matter of trade offs. > > If there is not an escape I do not expect my current implementation will have a measurable cost. > And I don't expect there will be any escapes. So the cost I worry about is not the CPU cost, but the complexity and correctness. If anything goes subtly wrong, the end result is going to be some very very subtle bugs. And personally, I'd be much happier with something that is a bit more straightforward, even if it makes ".." lookup slower. Especially since I think we can limit the costs to fairly obvious cases (ie only for partial bind mounts). Keep the code more straightforward, and *if* we ever see the cost of dentry traversal But it's up to Al, I think. Al, comments? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Sat, Aug 15, 2015 at 2:07 PM, Eric W. Biederman > <ebiederm@xmission.com> wrote: >> >> Yes we can compare s_root and mnt_root and only call is_subir if they don't match. > > Not even "is_subdir()" - for the RCU traversal case, just d_ancestor() > should be sufficient since we'd already be in an RCU read-locked > region and the RCU lookup checks the rename sequence number around it > all. We check the dentry sequence number and the mount sequence number, which may be enough to catch a local rename but is certainly not enough to catch what d_ancestor cares about. Further we have the partial rcu to non-rcu walk case represented by unlazy_walk that means we can't blithely do something that might be wrong and only check the sequence numbers at each step. > And d_ancestor() should really be pretty low-cost - even *if* we have > to call it, which wouldn't even be the case for the normal situation. > >> At this point it is a matter of trade offs. >> >> If there is not an escape I do not expect my current implementation will have a measurable cost. >> And I don't expect there will be any escapes. > > So the cost I worry about is not the CPU cost, but the complexity and > correctness. If anything goes subtly wrong, the end result is going to > be some very very subtle bugs. Fair enough. I like simple low complexity code, but I don't want to mess up the pathname lookup fastpath. > And personally, I'd be much happier with something that is a bit more > straightforward, even if it makes ".." lookup slower. Especially since > I think we can limit the costs to fairly obvious cases (ie only for > partial bind mounts). Keep the code more straightforward, and *if* we > ever see the cost of dentry traversal > > But it's up to Al, I think. > > Al, comments? At the very beginning of this I got shot down by Al Viro for a simple implementation that essentially had everything except the check for being a bind mount. Knowing what I know now I realize it was a bit buggy, calling d_ancestor in the rcu walk instead of d_subdir, but it was shot down for the cpu cost. Then Al suggested the basic approach I have taken in these patches. As soon as I am done testing I am going to post the revised version of my final patch that only performs is_subdir checks on bind mounts. Then we can decide to merge whichever version of the code you and Al are happy with. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Aug 15, 2015 at 03:47:50PM -0700, Linus Torvalds wrote: > So the cost I worry about is not the CPU cost, but the complexity and > correctness. If anything goes subtly wrong, the end result is going to > be some very very subtle bugs. > > And personally, I'd be much happier with something that is a bit more > straightforward, even if it makes ".." lookup slower. Especially since > I think we can limit the costs to fairly obvious cases (ie only for > partial bind mounts). Keep the code more straightforward, and *if* we > ever see the cost of dentry traversal > > But it's up to Al, I think. > > Al, comments? I think you are underestimating the frequency of .. traversals. Any build process that creates relative symlinks will be hitting it all the time, for one thing. And less-than-entire-fs mounts are not something pathological - I've got quite a few here, things like container setups often create such beasts, etc. Not to mention that things like NFSv4 will often look like such partial mounts, BTW. I really don't understand why the hell do we need *anything* complicated around __d_move() callers - just take sodding spinlock of mount_lock in the (few) callers around rename_lock, and that's it. No need for anything subtle and brittle. Redoing the locking in d_splice_alias() simply doesn't belong anywhere near that work. Basically, all places where we change tree topology go through __d_move() (and take rename_lock around it). There are very few of those. And we can find and taint affected mounts quite easily - I think we all agree that beginning of that series looks sane (locating mounts by mnt_root), right? Let's just add "mount_lock should be held read-exclusive by all callers of __d_move()" and do the find-and-taint logics from dentry_lock_for_move(). Move these BUG_ON(d_ancestor(dentry, target)); BUG_ON(d_ancestor(target, dentry)); into dentry_lock_for_move(), while we are at it. And have it begin with finding the last common ancestor of dentry and target. Which turns those checks into ancestor == dentry and ancestor == target... Then we need to taint everything with root being an ancestor of dentry, but not an ancestor of target. Which is trivial with LCA already found. For exchange case we need to do that both for dentry and target, of course. That's it. After that we just use that taint instead of "is it a partial?" in .. handling. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Aug 15, 2015 at 7:12 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > I think you are underestimating the frequency of .. traversals. Any build > process that creates relative symlinks will be hitting it all the time, > for one thing. I suspect you're over-estimating how expensive it is to just walk down to the mount-point. It's just a few pointer traversals. Realistically, we probably do more than that for a *regular* path component lookup, when we follow the hash chains. Following a d_parent chain for ".." isn't that different. Just looking at the last patch Eric sent, that one looks _trivial_. It didn't need *any* preparation or new rules. Compared to the mess with marking things MNT_DIR_ESCAPED etc, I know which approach I'd prefer. But hey, if you think you can simplify it... I just don't think that even totally ignoring the d_splice_alias() things, and totally ignoring any locking around __d_move(), the whole "mark things MNT_DIR_ESCAPED" is a lot more complex. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Aug 15, 2015 at 07:25:41PM -0700, Linus Torvalds wrote: > On Sat, Aug 15, 2015 at 7:12 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > I think you are underestimating the frequency of .. traversals. Any build > > process that creates relative symlinks will be hitting it all the time, > > for one thing. > > I suspect you're over-estimating how expensive it is to just walk down > to the mount-point. It's just a few pointer traversals. > > Realistically, we probably do more than that for a *regular* path > component lookup, when we follow the hash chains. Following a d_parent > chain for ".." isn't that different. Point, but... Keep in mind that there's another PITA in there - unreachable submounts are not as harmless as Eric hopes. umount -l of the entire tainted mount is a very large hammer _and_ userland needs to know when to use it in the first place; otherwise we'll end up with dirty reboots. So slightly longer term I want to have something done to them when they become unreachable. Namely, detach and leave in their place a trap that would give EINVAL on attempt to cross. Details depend on another pile of patches to review and serialize (nfsd-related fs_pin stuff), but catching the moments when they become unreachable is going to be useful (IOW, I don't see how to do it without catching those; there might be an elegant solution I'm missing, of course). > Just looking at the last patch Eric sent, that one looks _trivial_. It > didn't need *any* preparation or new rules. Compared to the mess with > marking things MNT_DIR_ESCAPED etc, I know which approach I'd prefer. > > But hey, if you think you can simplify it... I just don't think that > even totally ignoring the d_splice_alias() things, and totally > ignoring any locking around __d_move(), the whole "mark things > MNT_DIR_ESCAPED" is a lot more complex. Basically what I have in mind is a few helpers called from dentry_lock_for_move() with d_move(), d_exchange() and d_splice_alias() doing read_seqlock_excl(&mount_lock); just before grabbing rename_lock and dropping it right after dropping rename_lock. find_and_taint(dentry, ancestor) { if (!is_dir(dentry)) return; for (p = dentry->d_parent; p != ancestor; p = next) { if (unlikely(d_is_someones_root(p))) taint_mounts_of_this_subtree(p); // ... with dentry passed there as well when we // start handling unreachable submounts. next = p->d_parent; if (p == next) break; } } depth(d) { int n; for (n = 0; !IS_ROOT(d); d = d->d_parent, n++) ; return n; } /* find the last common ancestor of d1 and d2; NULL if there isn't one */ LCA(d1, d2) { int n1 = depth(d1), n2 = depth(d2); if (n1 > n2) do d1 = d1->d_parent; while (--n1 != n2); else if (n1 < n2) do d2 = d2->d_parent; while (--n2 != n1); while (d1 != d2) { if (unlikely(IS_ROOT(d1))) return NULL; d1 = d1->d_parent; d2 = d2->d_parent; } return d1; } dentry_lock_for_move(dentry, target, exchange) { ancestor = LCA(dentry, target); BUG_ON(ancestor == dentry); // these BUG_ON are antisocial, BTW BUG_ON(ancestor == target); find_and_taint(dentry, ancestor); if (exchange) find_and_taint(target, ancestor); // the rest - as we do now } -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Al Viro <viro@ZenIV.linux.org.uk> writes: > On Sat, Aug 15, 2015 at 07:25:41PM -0700, Linus Torvalds wrote: >> On Sat, Aug 15, 2015 at 7:12 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: >> > >> > I think you are underestimating the frequency of .. traversals. Any build >> > process that creates relative symlinks will be hitting it all the time, >> > for one thing. >> >> I suspect you're over-estimating how expensive it is to just walk down >> to the mount-point. It's just a few pointer traversals. >> >> Realistically, we probably do more than that for a *regular* path >> component lookup, when we follow the hash chains. Following a d_parent >> chain for ".." isn't that different. > > Point, but... Keep in mind that there's another PITA in there - unreachable > submounts are not as harmless as Eric hopes. umount -l of the entire tainted > mount is a very large hammer _and_ userland needs to know when to use > it in the first place; otherwise we'll end up with dirty reboots. > So slightly longer term I want to have something done to them when they become > unreachable. Namely, detach and leave in their place a trap that would > give EINVAL on attempt to cross. Details depend on another pile of patches > to review and serialize (nfsd-related fs_pin stuff), but catching the moments > when they become unreachable is going to be useful (IOW, I don't see how to do > it without catching those; there might be an elegant solution I'm missing, of > course). *Rolls my eyes* This has to be one of the most inconsistent lines of reasoning I have ever heard. Either escaping from a bind mount is as rare as hen's teeth and our handling of the case is to keep it that way. Or if they are common we really need to handle people performing lookups on bind mounts people have escaped from. If escaping from bind mounts is as rare as hen's teeth (which it had better be), then we don't have to be clever, we don't have to be nice, and umount -l is perfectly sufficient. And most of the time it really will be an exit of a mount namespace that blows things away anyway. Certainly all of the easy to create cases require creating a user namespace and which onws a mount namespace that a less privileged user can manipulate. If you gave someone you don't trust permission to move a directory under which you have a mount point in the initial mount namespace and that causes problems I am pretty certain that is a PBKAC error. But even if I accept that the unmount code has to be done you have been objecting to the infrastructure that is needed to make it happen. That is to unmount something we must take namespace_sem from d_splice_alias. But more specifically we must call namespace_lock() and namespace_unlock() from d_splice_alias. And there are no hacks like trylock for running synchronize_rcu. That is we must sleep in d_splice_alias to unmount things. Arranging the locks so that it can happen is what you have rather been strenuously objecting to. Further if you want to unmount things you can't do it from __d_move inside the rename_lock. The logic must be outside the rename_lock. >> Just looking at the last patch Eric sent, that one looks _trivial_. It >> didn't need *any* preparation or new rules. Compared to the mess with >> marking things MNT_DIR_ESCAPED etc, I know which approach I'd prefer. >> >> But hey, if you think you can simplify it... I just don't think that >> even totally ignoring the d_splice_alias() things, and totally >> ignoring any locking around __d_move(), the whole "mark things >> MNT_DIR_ESCAPED" is a lot more complex. > > Basically what I have in mind is a few helpers called from > dentry_lock_for_move() with d_move(), d_exchange() and d_splice_alias() > doing read_seqlock_excl(&mount_lock); just before grabbing rename_lock and > dropping it right after dropping rename_lock. So you are arguing for code that has heavier locking than what I have done, and you are conveniently ignoring the cost of reviewing all of the code to see if i_lock is ever taken under mount_lock. > find_and_taint(dentry, ancestor) > { > if (!is_dir(dentry)) > return; > for (p = dentry->d_parent; p != ancestor; p = next) { > if (unlikely(d_is_someones_root(p))) > taint_mounts_of_this_subtree(p); > // ... with dentry passed there as well when we > // start handling unreachable submounts. > next = p->d_parent; > if (p == next) > break; > } > } > > depth(d) > { > int n; > for (n = 0; !IS_ROOT(d); d = d->d_parent, n++) > ; > return n; > } > > /* find the last common ancestor of d1 and d2; NULL if there isn't one */ > LCA(d1, d2) > { You are of course missing the common case and the obvious optimization here: if (d1->d_parent == d2->d_parent) return d1->d_parent; > int n1 = depth(d1), n2 = depth(d2); > if (n1 > n2) > do d1 = d1->d_parent; while (--n1 != n2); > else if (n1 < n2) > do d2 = d2->d_parent; while (--n2 != n1); > while (d1 != d2) { > if (unlikely(IS_ROOT(d1))) > return NULL; > d1 = d1->d_parent; > d2 = d2->d_parent; > } > return d1; > } > > dentry_lock_for_move(dentry, target, exchange) > { > ancestor = LCA(dentry, target); > BUG_ON(ancestor == dentry); // these BUG_ON are antisocial, BTW > BUG_ON(ancestor == target); > find_and_taint(dentry, ancestor); > if (exchange) > find_and_taint(target, ancestor); > // the rest - as we do now Which begs the question. What is the point of doing this in dentry_lock_for_move. None of that code has any logical connection to the function of dentry_lock_for_move. And if the rest of the logic is untouched it is just stupid to do the work in dentry_lock_for_move. > } Al I am sorry. I am having a very hard time taking your comments on code direction seriously at this point. I have tested working code. You have some sort of half thought out proof of concept code that you have never published. And you are complaining that my working code is not your half thought through code. In my last round of patches that I sent out today. I did put mount_lock just outside of rename_lock, in d_splice_alias. But apparently you haven't noticed. Now at this point I have hit the limit of my time available for rewrites before the merge window. We can go with my 7 patch variant I posted today (whose only sin appears not to be your implemenation), it's trivial reduction that Linus likes because it is simple, someone else can write one, or this can all wait until the next development cycle. Given that I have working code with no known defects I was really hoping I could plug this security hole this development cycle. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Aug 16, 2015 at 01:22:41AM -0500, Eric W. Biederman wrote: > In my last round of patches that I sent out today. I did put mount_lock > just outside of rename_lock, in d_splice_alias. But apparently you > haven't noticed. I have. The problem I have with that one is that you end up with duplicated logics rather than taking it to one place. > Now at this point I have hit the limit of my time available for rewrites > before the merge window. We can go with my 7 patch variant I posted > today (whose only sin appears not to be your implemenation), it's > trivial reduction that Linus likes because it is simple, someone else > can write one, or this can all wait until the next development cycle. ... or either of us can do merging those checks into a single place, be it as a followup to your 7-patch series, or folded with the fs/dcache.c-affecting patches in there. If you have no time left, I can certainly do that followup myself - not a problem[1] And umount-related followups are just that - I'm not asking you to do those, especially since as I said this stuff is sensitive to fs_pin details (so far it appears to fold nicely with the __detach_mounts()/umount_tree() stuff, BTW). [1] with credits for your patches preserved - normally I would assume that this goes without saying, but your reply seems to imply that I'm playing some kind of political BS games, so I'd rather spell that out. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Aug 16, 2015 at 07:55:03AM +0100, Al Viro wrote: > And umount-related followups are just that - I'm not asking you to do those, > especially since as I said this stuff is sensitive to fs_pin details (so far > it appears to fold nicely with the __detach_mounts()/umount_tree() stuff, > BTW). PS: it doesn't need namespace_sem taken inside __d_move() - actual detaching does, of course, but that part gets done via task_work_add(), in a reasonably sane locking environment. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Al Viro <viro@ZenIV.linux.org.uk> writes: > On Sun, Aug 16, 2015 at 01:22:41AM -0500, Eric W. Biederman wrote: > >> In my last round of patches that I sent out today. I did put mount_lock >> just outside of rename_lock, in d_splice_alias. But apparently you >> haven't noticed. > > I have. The problem I have with that one is that you end up with duplicated > logics rather than taking it to one place. We are talking about a very small duplication of code. But you are also talking about combining logic that can get you in trouble pretty easily. As I read the code sketch you posted it had the issue that if a disconnected dentry was also a mount point it would mark that mount point as escaped (because there was no common ancestor). In the split out logic I don't even bother because you trivially can't have escaped from anywhere when IS_ROOT(dentry). >> Now at this point I have hit the limit of my time available for rewrites >> before the merge window. We can go with my 7 patch variant I posted >> today (whose only sin appears not to be your implemenation), it's >> trivial reduction that Linus likes because it is simple, someone else >> can write one, or this can all wait until the next development cycle. > > ... or either of us can do merging those checks into a single place, > be it as a followup to your 7-patch series, or folded with the > fs/dcache.c-affecting patches in there. If you have no time left, I can > certainly do that followup myself - not a problem[1] I don't have time. Everytime I have worked with this it has take pretty much full days of staring at the code, and I don't have any more full days left before the merge window. > And umount-related followups are just that - I'm not asking you to do those, > especially since as I said this stuff is sensitive to fs_pin details (so far > it appears to fold nicely with the __detach_mounts()/umount_tree() stuff, > BTW). > > PS: it doesn't need namespace_sem taken inside __d_move() - actual > detaching does, of course, but that part gets done via task_work_add(), > in a reasonably sane locking environment. The part that boggles my mind about that whole approach is that just outside of rename_lock there already is a sane locking environment. Admittedly it will take a touch of work in d_splice_alias to get it sane, but that isn't particularly hard. write_seqlock(&rename_lock); if (!IS_ROOT(new)) __d_unalias(...); __d_unalias(...) { /* Parent's don't go away so !IS_ROOT(new) will stay valid */ write_sequnlock(&rename_lock); /* Hooray! The code can be normal and sleep if it needs to, * before calling into __d_move. */ ... d_move(); } Similarly for duplicate logic removal. Teaching d_splice_alias when it is performing an ordinarly rename to be able to just call d_move gets you far more from a maintenance perspective than cramming everything into __d_move. > [1] with credits for your patches preserved - normally I would assume that > this goes without saying, but your reply seems to imply that I'm playing some > kind of political BS games, so I'd rather spell that out. My issue wasn't political games, but rather holding code to an apparently BS standard. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Sat, Aug 15, 2015 at 7:12 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: >> >> I think you are underestimating the frequency of .. traversals. Any build >> process that creates relative symlinks will be hitting it all the time, >> for one thing. > > I suspect you're over-estimating how expensive it is to just walk down > to the mount-point. It's just a few pointer traversals. > > Realistically, we probably do more than that for a *regular* path > component lookup, when we follow the hash chains. Following a d_parent > chain for ".." isn't that different. > > Just looking at the last patch Eric sent, that one looks _trivial_. It > didn't need *any* preparation or new rules. Compared to the mess with > marking things MNT_DIR_ESCAPED etc, I know which approach I'd prefer. > > But hey, if you think you can simplify it... I just don't think that > even totally ignoring the d_splice_alias() things, and totally > ignoring any locking around __d_move(), the whole "mark things > MNT_DIR_ESCAPED" is a lot more complex. It occurs to me that there is a fairly simple way we can emperically test to see how expensive calling is_subdir for every .. on a bind mount is in practice. - Take my last patch - run a benchmark outside of a bind mount (perhaps a kernel compile). - run the same benchmark inside of a bind mount. See if the performance differs. I am going to try to find time to do this, but I am travelling for the next couple of days. If someone who has a bit more time wants to try it and beats me to that would be great. I think having some emperical numbers would be nice in this part of the conversation. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Aug 16, 2015 at 06:51:33AM -0500, Eric W. Biederman wrote: > It occurs to me that there is a fairly simple way we can emperically > test to see how expensive calling is_subdir for every .. on a bind mount > is in practice. > > - Take my last patch > - run a benchmark outside of a bind mount (perhaps a kernel compile). > - run the same benchmark inside of a bind mount. > > See if the performance differs. > > I am going to try to find time to do this, but I am travelling for the > next couple of days. > > If someone who has a bit more time wants to try it and beats me to that > would be great. > > I think having some emperical numbers would be nice in this part of the > conversation. I took a bit of time to do something simpler though less scientific : I just ran a kernel build under strace -c -f to get a rough idea of the number of syscalls. It takes 2m23 without strace and 3m00 under strace. 7.2M syscalls were seen out of which only 3.2M were related to FS access (mostly open(), fstat() and stat()), the rest is minor. That's 22k syscalls per second, or 45 us between two syscalls. From this point I'm having a hard time imagining that any extra tests in the code path would have a significant impact to cause a measurable difference compared to these 45us, especially since I'm counting all FS accesses (worst case) and not just those referencing "..". Just my two cents, Willy -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Aug 16, 2015 at 06:33:21AM -0500, Eric W. Biederman wrote: > > ... or either of us can do merging those checks into a single place, > > be it as a followup to your 7-patch series, or folded with the > > fs/dcache.c-affecting patches in there. If you have no time left, I can > > certainly do that followup myself - not a problem[1] > > I don't have time. Everytime I have worked with this it has take pretty > much full days of staring at the code, and I don't have any more full > days left before the merge window. OK, at that point I've pretty much given up on fs_pin for this cycle. And testing your variant with unconditional checks on .. appears to have fairly low overhead. I still want to deal with catching and unmounting the unreachable suckers, so fs/dcache.c side of things will get used when we get to that stuff, but for now I've taken your 1/7, 2/7 plus the variant of "vfs: Test for and handle paths that are unreachable from their mnt_root" that doesn't care whether anything escaped or not. 3--6 are held in a local branch for now; I *am* going to use them come next cycle. And there's another pile of fun around that area, also for the next cycle - kernel-initiated subtree removals on things like sysfs et.al.; handling of the locking in those is inconsistent and tied with the fun we have for d_move()/__d_unalias(). Sigh... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On August 21, 2015 12:51:05 AM PDT, Al Viro <viro@ZenIV.linux.org.uk> wrote: >On Sun, Aug 16, 2015 at 06:33:21AM -0500, Eric W. Biederman wrote: > >> > ... or either of us can do merging those checks into a single >place, >> > be it as a followup to your 7-patch series, or folded with the >> > fs/dcache.c-affecting patches in there. If you have no time left, >I can >> > certainly do that followup myself - not a problem[1] >> >> I don't have time. Everytime I have worked with this it has take >pretty >> much full days of staring at the code, and I don't have any more full >> days left before the merge window. > >OK, at that point I've pretty much given up on fs_pin for this cycle. >And testing your variant with unconditional checks on .. appears to >have >fairly low overhead. I still want to deal with catching and unmounting >the >unreachable suckers, so fs/dcache.c side of things will get used when >we get >to that stuff, but for now I've taken your 1/7, 2/7 plus the variant of >"vfs: Test for and handle paths that are unreachable from their >mnt_root" >that doesn't care whether anything escaped or not. > >3--6 are held in a local branch for now; I *am* going to use them >come next cycle. And there's another pile of fun around that area, >also >for the next cycle - kernel-initiated subtree removals on things like >sysfs et.al.; handling of the locking in those is inconsistent and tied >with the fun we have for d_move()/__d_unalias(). Sigh... I am sorry to hear about a mess in sysfs. I am glad to hear that we do not appear to need MNT_DIR_ESCAPED to avoid performance regressions. That should make back porters lives much easier. As for the future I have a suspicion that we want to look at the rcu readable dynamically sized hash tables the networking guys cooked up. Al thank you very much for performance testing the simple version. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/dcache.c b/fs/dcache.c index 53b7f1e63beb..c1eece74621f 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -2711,41 +2711,77 @@ struct dentry *d_ancestor(struct dentry *p1, struct dentry *p2) return NULL; } +static struct dentry *d_alias_is_ancestor(struct dentry *alias, + struct dentry *dentry) +{ + struct dentry *ancestor = d_ancestor(alias, dentry); + if (ancestor) { + pr_warn_ratelimited( + "VFS: Lookup of '%s' in %s %s would have caused loop\n", + dentry->d_name.name, + dentry->d_sb->s_type->name, + dentry->d_sb->s_id); + } + return ancestor; +} + /* * This helper attempts to cope with remotely renamed directories * * It assumes that the caller is already holding - * dentry->d_parent->d_inode->i_mutex, and rename_lock + * dentry->d_parent->d_inode->i_mutex * * Note: If ever the locking in lock_rename() changes, then please * remember to update this too... */ -static int __d_unalias(struct inode *inode, - struct dentry *dentry, struct dentry *alias) +static int __d_rename_alias(struct dentry *alias, struct dentry *dentry) { - struct mutex *m1 = NULL, *m2 = NULL; - int ret = -ESTALE; + struct mutex *rename_mutex = NULL, *parent_mutex = NULL; - /* If alias and dentry share a parent, then no extra locks required */ - if (alias->d_parent == dentry->d_parent) - goto out_unalias; + /* Holding dentry->d_parent->d_inode->i_mutex guarantees this + * that the equality or inequality alias->d_parent and + * dentry->d_parent remains stable. + */ + if (alias->d_parent != dentry->d_parent) { + /* See lock_rename() */ + rename_mutex = &dentry->d_sb->s_vfs_rename_mutex; + if (!mutex_trylock(rename_mutex)) + return -ESTALE; + + if (d_alias_is_ancestor(alias, dentry)) { + mutex_unlock(rename_mutex); + return -ELOOP; + } + parent_mutex = &alias->d_parent->d_inode->i_mutex; + mutex_lock_nested(parent_mutex, I_MUTEX_PARENT2); + } + d_move(alias, dentry); + if (parent_mutex) { + mutex_unlock(parent_mutex); + mutex_unlock(rename_mutex); + } + return 0; +} - /* See lock_rename() */ - if (!mutex_trylock(&dentry->d_sb->s_vfs_rename_mutex)) - goto out_err; - m1 = &dentry->d_sb->s_vfs_rename_mutex; - if (!mutex_trylock(&alias->d_parent->d_inode->i_mutex)) - goto out_err; - m2 = &alias->d_parent->d_inode->i_mutex; -out_unalias: +/* + * This helper connects disconnected dentries. + * + * It assumes that the caller is already holding + * dentry->d_parent->d_inode->i_mutex + * + */ +static int __d_connect_alias(struct inode *inode, + struct dentry *alias, struct dentry *dentry) +{ + write_seqlock(&rename_lock); + if (unlikely(d_alias_is_ancestor(alias, dentry))) { + write_sequnlock(&rename_lock); + return -ELOOP; + } __d_move(alias, dentry, false); - ret = 0; -out_err: - if (m2) - mutex_unlock(m2); - if (m1) - mutex_unlock(m1); - return ret; + write_sequnlock(&rename_lock); + security_d_instantiate(alias, inode); + return 0; } /** @@ -2786,30 +2822,17 @@ struct dentry *d_splice_alias(struct inode *inode, struct dentry *dentry) if (S_ISDIR(inode->i_mode)) { struct dentry *new = __d_find_any_alias(inode); if (unlikely(new)) { + int err; /* The reference to new ensures it remains an alias */ spin_unlock(&inode->i_lock); - write_seqlock(&rename_lock); - if (unlikely(d_ancestor(new, dentry))) { - write_sequnlock(&rename_lock); + + if (!IS_ROOT(new)) + err = __d_rename_alias(new, dentry); + else + err = __d_connect_alias(inode, new, dentry); + if (err) { dput(new); - new = ERR_PTR(-ELOOP); - pr_warn_ratelimited( - "VFS: Lookup of '%s' in %s %s" - " would have caused loop\n", - dentry->d_name.name, - inode->i_sb->s_type->name, - inode->i_sb->s_id); - } else if (!IS_ROOT(new)) { - int err = __d_unalias(inode, dentry, new); - write_sequnlock(&rename_lock); - if (err) { - dput(new); - new = ERR_PTR(err); - } - } else { - __d_move(new, dentry, false); - write_sequnlock(&rename_lock); - security_d_instantiate(new, inode); + new = ERR_PTR(err); } iput(inode); return new;
There are two scenarios that can result in an alias being found in d_splice_alias. The first scenario is a disconnected dentry being looked up and reconnected (common with nfs file handles, and open_by_handle_at). The second scenario is a lookup on a directory lazily discovering that it was renamed on the remote filesystem. The locking challenge for handling these two scenarios is that the instance of lookup calling d_splice_alias has already taken the dentry->d_parent->d_inode->i_mutex. If the mutex was not held the locking we could just do: rename_mutex = &inode->i_sb->s_vfs_rename_mutex; mutex_lock(rename_mutex); if (d_ancestor(alias, dentry)) { mutex_unlock(rename_mutex); pr_warn_ratelimited(...); return -ELOOP; } m1 = &dentry->d_parent->d_inode->i_mutex; mutex_lock_nested(m1, I_MUTEX_PARENT); if (!IS_ROOT(alias) && (alias->d_parent != dentry->d_parent)) { m2 = &alias->d_parent->d_inode->i_mutex; mutex_lock_nested(m2, I_MUTEX_PARENT2); } d_move(alias, dentry); if (m2) mutex_unlock(m2); mutex_unlock(m1); mutex_unlock(rename_mutex); Which is essentially lock_rename and unlock_reaname with added handling of the IS_ROOT case. The problem is that as the lookup locking stands today grabbing the s_vfs_rename_mutex must use mutex_trylock which can fail, so for reliability reasons we need to avoid using the rename_mutex as much as possible. For the case where a disconnected dentry is being connected (aka the IS_ROOT case) the d_ancestor test can be placed under the rename_lock removing any chance that taking locks will cause a failure. For the lazy rename case things are trickier because when dentry->d_parent and alias->d_parent are not equal the code need to take the s_vfs_rename_mutex, or risk the d_ancestor call in lock_rename being inaccurate. As s_vfs_rename_mutex is take first there are no locking ordering issues with taking alias->d_parent->d_inode->i_mutex. Furthermore as games with rename_lock are unnecessary d_move instead of __d_move can be called. Sleeping in d_splice_alias is not something new as security_d_instantiate is a sleeping function. Compared to the current implementation this change introduces a function for each case __d_rename_alias and __d_connect_alias and moves the acquisition of rename_lock down into those functions. In the case of __d_rename_alias which was __d_unalias this allows the existing lock ordering rules to be followed much more closely removing the need for a trylock when acquiring alias->d_parent->d_inode->i_mutex, and allowing the use of d_move. A common helper that prints the warning message when a loop is detected is factored out into d_alias_is_ancestor so that code does not need to be duplicated in both cases. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- fs/dcache.c | 111 ++++++++++++++++++++++++++++++++++++------------------------ 1 file changed, 67 insertions(+), 44 deletions(-)