Message ID | 166147984370.25420.13019217727422217511.stgit@noble.brown (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Improve scalability of directory operations | expand |
On Thu, Aug 25, 2022 at 7:16 PM NeilBrown <neilb@suse.de> wrote: > > If a filesystem supports parallel modification in directories, it sets > FS_PAR_DIR_UPDATE on the file_system_type. lookup_open() and the new > lookup_hash_update() notice the flag and take a shared lock on the > directory, and rely on a lock-bit in d_flags, much like parallel lookup > relies on DCACHE_PAR_LOOKUP. Ugh. I absolutely believe in the DCACHE_PAR_LOOKUP model, and in "parallel updates" being important, but I *despise* locking code like this + if (wq && IS_PAR_UPDATE(dir)) + inode_lock_shared_nested(dir, I_MUTEX_PARENT); + else + inode_lock_nested(dir, I_MUTEX_PARENT); and I really really hope there's some better model for this. That "wq" test in particular is just absolutely disgusting. So now it doesn't just depend on whether the directory supports parallel updates, now the caller can choose whether to do the parallel thing or not, and that's how "create" is different from "rename". And that last difference is, I think, the one that makes me go "No. HELL NO". Instead of it being up to the filesystem to say "I can do parallel creates, but I need to serialize renames", this whole thing has been set up to be about the caller making that decision. That's just feels horribly horribly wrong. Yes, I realize that to you that's just a "later patches will do renames", but what if it really is a filesystem issue where the filesystem can easily handle new names, but needs something else for renames because it has various cross-directory issues, perhaps? So I feel this is fundamentally wrong, and this ugliness is a small effect of that wrongness. I think we should strive for (a) make that 'wq' and DCACHE_PAR_LOOKUP bit be unconditional (b) aim for the inode lock being taken *after* the _lookup_hash(), since the VFS layer side has to be able to handle the concurrency on the dcache side anyway (c) at that point, the serialization really ends up being about the call into the filesystem, and aim to simply move the inode_lock{_shared]_nested() into the filesystem so that there's no need for a flag and related conditional locking at all. Because right now I think the main reason we cannot move the lock into the filesystem is literally that we've made the lock cover not just the filesystem part, but the "lookup and create dentry" part too. But once you have that "DCACHE_PAR_LOOKUP" bit and the d_alloc_parallel() logic to serialize a _particular_ dentry being created (as opposed to serializing all the sleeping ops to that directly), I really think we should strive to move the locking - that no longer helps the VFS dcache layer - closer to the filesystem call and eventually into it. Please? I think these patches are "mostly going in the right direction", but at the same time I feel like there's some serious mis-design going on. Linus
On Sat, 27 Aug 2022, Linus Torvalds wrote: > On Thu, Aug 25, 2022 at 7:16 PM NeilBrown <neilb@suse.de> wrote: > > > > If a filesystem supports parallel modification in directories, it sets > > FS_PAR_DIR_UPDATE on the file_system_type. lookup_open() and the new > > lookup_hash_update() notice the flag and take a shared lock on the > > directory, and rely on a lock-bit in d_flags, much like parallel lookup > > relies on DCACHE_PAR_LOOKUP. > > Ugh. Thanks :-) - no, really - thanks for the high-level review! > > I absolutely believe in the DCACHE_PAR_LOOKUP model, and in "parallel > updates" being important, but I *despise* locking code like this > > + if (wq && IS_PAR_UPDATE(dir)) > + inode_lock_shared_nested(dir, I_MUTEX_PARENT); > + else > + inode_lock_nested(dir, I_MUTEX_PARENT); > > and I really really hope there's some better model for this. > > That "wq" test in particular is just absolutely disgusting. So now it > doesn't just depend on whether the directory supports parallel > updates, now the caller can choose whether to do the parallel thing or > not, and that's how "create" is different from "rename". As you note, by the end of the series "create" is not more different from "rename" than it already is. I only broke up the patches to make review more manageable. The "wq" can be removed. There are two options. One is to change every kern_path_create() or user_path_create() caller to passed in a wq. Then we can assume that a wq is always available. There are about a dozen of these calls, so not an enormous change, but one that I didn't want to think about just yet. I could add a patch at the front of the series which did this. Alternate option is to never pass in a wq for create operation, and use var_waitqueue() (or something similar) to provide a global shared wait queue (which is essentially what I am using to wait for DCACHE_PAR_UPDATE to clear). The more I think about it, the more I think this is the best way forward. Maybe we'll want to increase WAIT_TABLE_BITS ... I wonder how to measure how much contention there is on these shared queues. > > And that last difference is, I think, the one that makes me go "No. HELL NO". > > Instead of it being up to the filesystem to say "I can do parallel > creates, but I need to serialize renames", this whole thing has been > set up to be about the caller making that decision. I think that is a misunderstanding. The caller isn't making a decision - except the IS_PAR_UPDATE() test which is simply acting on the fs request. What you are seeing is a misguided attempt to leave in place some existing interfaces which assumed exclusive locking and didn't provide wqs. > > That's just feels horribly horribly wrong. > > Yes, I realize that to you that's just a "later patches will do > renames", but what if it really is a filesystem issue where the > filesystem can easily handle new names, but needs something else for > renames because it has various cross-directory issues, perhaps? Obviously a filesystem can add its own locking - and they will have to, though at a finer grain that the VFS can do. > > So I feel this is fundamentally wrong, and this ugliness is a small > effect of that wrongness. > > I think we should strive for > > (a) make that 'wq' and DCACHE_PAR_LOOKUP bit be unconditional Agreed (in an earlier version DCACHE_PAR_LOOKUP was optional, but I realised that you wouldn't like that :-) > > (b) aim for the inode lock being taken *after* the _lookup_hash(), > since the VFS layer side has to be able to handle the concurrency on > the dcache side anyway I think you are suggesting that we change ->lookup call to NOT require i_rwsem be held. That is not a small change. I agree that it makes sense in the long term. Getting there .... won't be a quick as I'd hoped. > > (c) at that point, the serialization really ends up being about the > call into the filesystem, and aim to simply move the > inode_lock{_shared]_nested() into the filesystem so that there's no > need for a flag and related conditional locking at all. It might be nice to take a shared lock in VFS, and let the FS upgrade it to exclusive if needed, but we don't have upgrade_read() ... maybe it would be deadlock-prone. > > Because right now I think the main reason we cannot move the lock into > the filesystem is literally that we've made the lock cover not just > the filesystem part, but the "lookup and create dentry" part too. > > But once you have that "DCACHE_PAR_LOOKUP" bit and the > d_alloc_parallel() logic to serialize a _particular_ dentry being > created (as opposed to serializing all the sleeping ops to that > directly), I really think we should strive to move the locking - that > no longer helps the VFS dcache layer - closer to the filesystem call > and eventually into it. > > Please? I think these patches are "mostly going in the right > direction", but at the same time I feel like there's some serious > mis-design going on. Hmmm.... I'll dig more deeply into ->lookup and see if I can understand the locking well enough to feel safe removing i_rwsem from it. Thanks, NeilBrown
On Fri, Aug 26, 2022 at 4:07 PM NeilBrown <neilb@suse.de> wrote: > > As you note, by the end of the series "create" is not more different > from "rename" than it already is. I only broke up the patches to make > review more manageable. Yes, I understand. But I'm saying that maybe a filesystem actually might want to treat them differently. That said, the really nasty part was that 'wq' thing that meant that different paths had different directory locking not because of low-level filesystem issues, but because of caller issues. So that's the one I _really_ disliked, and that I don't think should exist even as a partial first step. The "tie every operation together with one flag" I can live with, in case it turns out that yes, that one flag is all anybody ever really wants. > Alternate option is to never pass in a wq for create operation, and use > var_waitqueue() (or something similar) to provide a global shared wait > queue (which is essentially what I am using to wait for > DCACHE_PAR_UPDATE to clear). I _think_ this is what I would prefer. I say that I _think_ I prefer that, because maybe there are issues with it. But since you basically do that DCACHE_PAR_UPDATE thing anyway, and it's one of the main users of this var_waitqueue, it feels right to me. But then if it just end sup not working well for some practical reason, at that point maybe I'd just say "I was wrong, I thought it would work, but it's better to spread it out to be a per-thread wait-queue on the stack". IOW, my preference would be to simply just try it, knowing that you *can* do the "pass explicit wait-queue down" thing if we need to. Hmm? > > Instead of it being up to the filesystem to say "I can do parallel > > creates, but I need to serialize renames", this whole thing has been > > set up to be about the caller making that decision. > > I think that is a misunderstanding. The caller isn't making a decision > - except the IS_PAR_UPDATE() test which is simply acting on the fs > request. What you are seeing is a misguided attempt to leave in place > some existing interfaces which assumed exclusive locking and didn't > provide wqs. Ok. I still would prefer to have unified locking, not that "do this for one filesystem, do that for another" conditional one. > > (b) aim for the inode lock being taken *after* the _lookup_hash(), > > since the VFS layer side has to be able to handle the concurrency on > > the dcache side anyway > > I think you are suggesting that we change ->lookup call to NOT > require i_rwsem be held. Yes and no. One issue for me is that with your change as-is, then 99% of all people who don't happen to use NFS, the inode lock gives all that VFS code mutual exclusion. Take that lookup_hash_update() function as a practical case: all the *common* filesystems will be running with that function basically 100% serialized per directory, because they'll be doing that inode_lock_nested(dir); ... inode_unlock(dir); around it all. At the same time, all that code is supposed to work even *without* the lock, because once it's a IS_PAR_UPDATE() filesystem, there's effectively no locking at all. What exclusive directory locks even remain at that point? IOW, to me it feels like you are trying to make the code go towards a future with basically no locking at all as far as the VFS layer is concerned (because once all the directory modifications take the inode lock as shared, *all* the inode locking is shared, and is basically a no-op). BUT you are doing so while not having most people even *test* that situation. See what I'm trying to say (but possibly expressing very badly)? So I feel like if the VFS code cannot rely on locking *anyway* in the general case, and should work without it, then we really shouldn't have any locking around any of the VFS operations. The logical conclusion of that would be to push it all down into the filesystem (possibly with the help of a coccinelle script). Now it doesn't have to go that far - at least not initially - but I do think we should at least make sure that as much as possible of the actual VFS code sees that "new world order" of no directory locking, so that that situation gets *tested* as widely as possible. > That is not a small change. Now, that I agree with. I guss we won't get there soon (or ever). But see above what I dislike about the directory locking model change. > It might be nice to take a shared lock in VFS, and let the FS upgrade it > to exclusive if needed, but we don't have upgrade_read() ... maybe it > would be deadlock-prone. Yes, upgrading a read lock is fundamentally impossible and will deadlock trivially (think just two readers that both want to do the upgrade - they'll block each other from doing so). So it's not actually a possible operation. Linus
On Fri, Aug 26, 2022 at 12:06:55PM -0700, Linus Torvalds wrote: > Because right now I think the main reason we cannot move the lock into > the filesystem is literally that we've made the lock cover not just > the filesystem part, but the "lookup and create dentry" part too. How about rename loop prevention? mount vs. rmdir? d_splice_alias() fun on tree reconnects? > But once you have that "DCACHE_PAR_LOOKUP" bit and the > d_alloc_parallel() logic to serialize a _particular_ dentry being > created (as opposed to serializing all the sleeping ops to that > directly), I really think we should strive to move the locking - that > no longer helps the VFS dcache layer - closer to the filesystem call > and eventually into it. See above.
On Fri, Aug 26, 2022 at 05:13:38PM -0700, Linus Torvalds wrote: > So I feel like if the VFS code cannot rely on locking *anyway* in the > general case, and should work without it, then we really shouldn't > have any locking around any of the VFS operations. That's a really big if. There's a bunch of places where we rely upon ->i_rwsem on directories and _not_ to provide exclusion for fs methods. I'm still halfway through the Neil's patchset, but verifying correctness won't be easy and I'm not optimistic about getting rid of those uses...
On Fri, Aug 26, 2022 at 12:10:43PM +1000, NeilBrown wrote: > +/** > + * d_lock_update_nested - lock a dentry before updating > + * @dentry: the dentry to be locked > + * @base: the parent, or %NULL > + * @name: the name in that parent, or %NULL > + * @subclass: lockdep locking class. > + * > + * Lock a dentry in a directory on which a shared-lock may be held, and > + * on which parallel updates are permitted. > + * If the base and name are given, then on success the dentry will still > + * have that base and name - it will not have raced with rename. > + * On success, a positive dentry will still be hashed, ensuring there > + * was no race with unlink. > + * If they are not given, then on success the dentry will be negative, > + * which again ensures no race with rename, or unlink. I'm not sure it's a good idea to have that in one function, TBH. Looking at the callers, there are * lookup_hash_update() lookup_hash_update_len() nfsd shite filename_create_one() filename_create_one_len() nfsd shite filename_create() kern_path_create() user_path_create() do_mknodat() do_mkdirat() do_symlinkat() do_linkat() do_rmdir() do_unlinkat() * fuckloads of callers in lock_rename_lookup() * d_lock_update() atomic_open() lookup_open() Only the last two can get NULL base or name. Already interesting, isn't it? What's more, splitup between O_CREATE open() on one side and everything else that might create, remove or rename on the other looks really weird. > + rcu_read_lock(); /* for d_same_name() */ > + if (d_unhashed(dentry) && d_is_positive(dentry)) { > + /* name was unlinked while we waited */ > + ret = false; BTW, what happens if somebody has ->lookup() returning a positive unhashed? Livelock on attempt to hit it with any directory-modifying syscall? Right now such behaviour is permitted; I don't know if anything in the tree actually does it, but at the very least it would need to be documented. Note that *negative* unhashed is not just permitted, it's actively used e.g. by autofs. That really needs to be well commented... > + } else if (base && > + (dentry->d_parent != base || > + dentry->d_name.hash != name->hash || > + !d_same_name(dentry, base, name))) { > + /* dentry was renamed - possibly silly-rename */ > + ret = false; > + } else if (!base && d_is_positive(dentry)) { > + ret = false; > + } else { > + dentry->d_flags |= DCACHE_PAR_UPDATE; > + } > + * Parent directory has inode locked exclusive, or possibly shared if wq > + * is given. In the later case the fs has explicitly allowed concurrent > + * updates in this directory. This is the one and only case > + * when ->lookup() may be called on a non in-lookup dentry. What Linus already said about wq... To add a reason he hadn't mentioned, the length of call chain one needs to track to figure out whether it's NULL or not is... excessive. And I don't mean just "greater than 0". We have places like that, and sometimes we have to, but it's never a good thing... > static struct dentry *__lookup_hash(const struct qstr *name, > - struct dentry *base, unsigned int flags) > + struct dentry *base, unsigned int flags, > + wait_queue_head_t *wq) > - dentry = d_alloc(base, name); > + if (wq) > + dentry = d_alloc_parallel(base, name, wq); > + else > + dentry = d_alloc(base, name); > if (unlikely(!dentry)) > return ERR_PTR(-ENOMEM); > + if (IS_ERR(dentry)) > + return dentry; BTW, considering the fact that we have 12 callers of d_alloc() (including this one) and 28 callers of its wrapper (d_alloc_name()), I would seriously consider converting d_alloc() from "NULL or new dentry" to "ERR_PTR(-ENOMEM) or new dentry". Especially since quite a few of those callers will be happier that way. Grep and you'll see... As a side benefit, if (unlikely(!dentry)) turns into if (IS_ERR(dentry)). > +static struct dentry *lookup_hash_update( > + const struct qstr *name, > + struct dentry *base, unsigned int flags, > + wait_queue_head_t *wq) > +{ > + struct dentry *dentry; > + struct inode *dir = base->d_inode; > + int err; > + > + if (wq && IS_PAR_UPDATE(dir)) > + inode_lock_shared_nested(dir, I_MUTEX_PARENT); > + else > + inode_lock_nested(dir, I_MUTEX_PARENT); > + > +retry: > + dentry = __lookup_hash(name, base, flags, wq); > + if (IS_ERR(dentry)) { > + err = PTR_ERR(dentry); > + goto out_err; > + } > + if (!d_lock_update_nested(dentry, base, name, I_MUTEX_PARENT)) { > + /* > + * Failed to get lock due to race with unlink or rename > + * - try again > + */ > + d_lookup_done(dentry); When would we get out of __lookup_hash() with in-lookup dentry? Confused... > +struct dentry *lookup_hash_update_len(const char *name, int nlen, > + struct path *path, unsigned int flags, const struct path *, damnit... > + wait_queue_head_t *wq) > +{ > + struct qstr this; > + int err = lookup_one_common(mnt_user_ns(path->mnt), name, > + path->dentry, nlen, &this); > + if (err) > + return ERR_PTR(err); > + return lookup_hash_update(&this, path->dentry, flags, wq); > +} > +EXPORT_SYMBOL(lookup_hash_update_len); Frankly, the calling conventions of the "..._one_len" family is something I've kept regretting for a long time. Oh, well... > +static void done_path_update(struct path *path, struct dentry *dentry, > + bool with_wq) > +{ > + struct inode *dir = path->dentry->d_inode; > + > + d_lookup_done(dentry); > + d_unlock_update(dentry); > + if (IS_PAR_UPDATE(dir) && with_wq) > + inode_unlock_shared(dir); > + else > + inode_unlock(dir); > +} const struct path *, again... > @@ -3400,6 +3499,12 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file, > dentry = res; > } > } > + /* > + * If dentry is negative and this is a create we need to get > + * DCACHE_PAR_UPDATE. > + */ > + if ((open_flag & O_CREAT) && !dentry->d_inode) > + have_par_update = d_lock_update(dentry, NULL, NULL); > > /* Negative dentry, just create the file */ > if (!dentry->d_inode && (open_flag & O_CREAT)) { Fold the above here, please. What's more, losing the flag would've made it much easier to follow... > @@ -3419,9 +3524,13 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file, > error = create_error; > goto out_dput; > } > + if (have_par_update) > + d_unlock_update(dentry); > return dentry; > > out_dput: > + if (have_par_update) > + d_unlock_update(dentry); > @@ -3821,27 +3962,28 @@ struct dentry *kern_path_create(int dfd, const char *pathname, > struct path *path, unsigned int lookup_flags) BTW, there's 9 callers of that sucker in the entire tree, along with 3 callers of user_path_create() and 14 callers of done_path_create(). Not a big deal to add the wq in those, especially since it can be easily split into preparatory patch (with wq passed, but being unused). > -void done_path_create(struct path *path, struct dentry *dentry) > +void done_path_create_wq(struct path *path, struct dentry *dentry, bool with_wq) Why "with_wq" and not the wq itself? > - * The caller must hold dir->i_mutex. > + * The caller must either hold a write-lock on dir->i_rwsem, or > + * a have atomically set DCACHE_PAR_UPDATE, or both. ??? > + * If the filesystem permits (IS_PAR_UPDATE()), we take a shared lock on the > + * directory and set DCACHE_PAR_UPDATE to get exclusive access to the dentry. The latter happens unconditionally here, unless I'm misreading the code (as well as your cover letter). It does *NOT* happen on rename(), though, contrary to the same. And while your later patch adds it in lock_rename_lookup(), existing lock_rename() callers do not get that at all. Likely to be a problem... > --- a/include/linux/dcache.h > +++ b/include/linux/dcache.h > @@ -13,7 +13,9 @@ > #include <linux/rcupdate.h> > #include <linux/lockref.h> > #include <linux/stringhash.h> > +#include <linux/sched.h> Bloody hell, man... > +static inline void d_unlock_update(struct dentry *dentry) > +{ > + if (IS_ERR_OR_NULL(dentry)) > + return; Do explain... When could we ever get NULL or ERR_PTR() passed to that? BTW, I would seriously look into splitting the "let's add a helper that combines locking parent with __lookup_hash()" into a preliminary patch. Would be easier to follow.
On Fri, Aug 26, 2022 at 05:13:38PM -0700, Linus Torvalds wrote: > On Fri, Aug 26, 2022 at 4:07 PM NeilBrown <neilb@suse.de> wrote: > > > > As you note, by the end of the series "create" is not more different > > from "rename" than it already is. I only broke up the patches to make > > review more manageable. > > Yes, I understand. But I'm saying that maybe a filesystem actually > might want to treat them differently. > > That said, the really nasty part was that 'wq' thing that meant that > different paths had different directory locking not because of > low-level filesystem issues, but because of caller issues. > > So that's the one I _really_ disliked, and that I don't think should > exist even as a partial first step. > > The "tie every operation together with one flag" I can live with, in > case it turns out that yes, that one flag is all anybody ever really > wants. FWIW, what's really missing is the set of rules describing what the methods can expect from their arguments. Things like "oh, we can safely use ->d_parent here - we know that foo_rmdir(dir, child) is called only with dir held exclusive and child that had been observed to be a child of dentry alias of dir after dir had been locked, while all places that might change child->d_parent will be doing that only with child->d_parent->d_inode held at least shared" rely upon the current locking scheme. Change that 'held exclusive' to 'held shared' and we need something different, presumably 'this new bitlock on the child is held by the caller'. That's nice, but... What's to guarantee that we won't be hit by __d_unalias()? It won't care about the bitlock on existing alias, would it? And it only holds the old parent shared, so... My comments had been along the lines of "doing that would make the series easier to reason about"; I don't hate the approach, but * in the current form it's hard to read; there might be problems I hadn't even noticed yet * it's much easier to verify that stated assertions are guaranteed by the callers and sufficient for safety of callees if they *ARE* stated. Spelling them out is on the patch series authors, and IME doing that helps a lot when writing a series like that. At least on the level of internal notes... Especially since NFS is... special (or, as they say in New York, "sophisticated" - sorry). There's a plenty of things that are true for it, but do not hold for filesystems in general. And without an explicitly spelled out warranties it's very easy to end up with a mess that would be hell to apply to other filesystems. I really don't want to see an explosion of cargo-culted logics that might or might not remain valid for NFS by the time it gets copied around...
On Sat, 27 Aug 2022, Al Viro wrote: > On Fri, Aug 26, 2022 at 12:10:43PM +1000, NeilBrown wrote: > > > +/** > > + * d_lock_update_nested - lock a dentry before updating > > + * @dentry: the dentry to be locked > > + * @base: the parent, or %NULL > > + * @name: the name in that parent, or %NULL > > + * @subclass: lockdep locking class. > > + * > > + * Lock a dentry in a directory on which a shared-lock may be held, and > > + * on which parallel updates are permitted. > > + * If the base and name are given, then on success the dentry will still > > + * have that base and name - it will not have raced with rename. > > + * On success, a positive dentry will still be hashed, ensuring there > > + * was no race with unlink. > > + * If they are not given, then on success the dentry will be negative, > > + * which again ensures no race with rename, or unlink. > > I'm not sure it's a good idea to have that in one function, TBH. > Looking at the callers, there are > * lookup_hash_update() > lookup_hash_update_len() > nfsd shite > filename_create_one() > filename_create_one_len() > nfsd shite > filename_create() > kern_path_create() > user_path_create() > do_mknodat() > do_mkdirat() > do_symlinkat() > do_linkat() > do_rmdir() > do_unlinkat() > * fuckloads of callers in lock_rename_lookup() > * d_lock_update() > atomic_open() > lookup_open() > > Only the last two can get NULL base or name. Already interesting, > isn't it? What's more, splitup between O_CREATE open() on one > side and everything else that might create, remove or rename on > the other looks really weird. Well, O_CREATE is a bit weird. But I can see that it would be cleaner to pass the dir/name into those two calls that currently get NULL - then remove the handling of NULL. Thanks. > > > + rcu_read_lock(); /* for d_same_name() */ > > + if (d_unhashed(dentry) && d_is_positive(dentry)) { > > + /* name was unlinked while we waited */ > > + ret = false; > > BTW, what happens if somebody has ->lookup() returning a positive > unhashed? Livelock on attempt to hit it with any directory-modifying > syscall? Right now such behaviour is permitted; I don't know if > anything in the tree actually does it, but at the very least it > would need to be documented. > > Note that *negative* unhashed is not just permitted, it's > actively used e.g. by autofs. That really needs to be well > commented... I hadn't thought that ->lookup() could return anything unhashed. I'll look into that - thanks. > > > + } else if (base && > > + (dentry->d_parent != base || > > + dentry->d_name.hash != name->hash || > > + !d_same_name(dentry, base, name))) { > > + /* dentry was renamed - possibly silly-rename */ > > + ret = false; > > + } else if (!base && d_is_positive(dentry)) { > > + ret = false; > > + } else { > > + dentry->d_flags |= DCACHE_PAR_UPDATE; > > + } > > > + * Parent directory has inode locked exclusive, or possibly shared if wq > > + * is given. In the later case the fs has explicitly allowed concurrent > > + * updates in this directory. This is the one and only case > > + * when ->lookup() may be called on a non in-lookup dentry. > > What Linus already said about wq... To add a reason he hadn't mentioned, > the length of call chain one needs to track to figure out whether it's > NULL or not is... excessive. And I don't mean just "greater than 0". > We have places like that, and sometimes we have to, but it's never a good > thing... > > > static struct dentry *__lookup_hash(const struct qstr *name, > > - struct dentry *base, unsigned int flags) > > + struct dentry *base, unsigned int flags, > > + wait_queue_head_t *wq) > > > - dentry = d_alloc(base, name); > > + if (wq) > > + dentry = d_alloc_parallel(base, name, wq); > > + else > > + dentry = d_alloc(base, name); > > if (unlikely(!dentry)) > > return ERR_PTR(-ENOMEM); > > + if (IS_ERR(dentry)) > > + return dentry; > > BTW, considering the fact that we have 12 callers of d_alloc() > (including this one) and 28 callers of its wrapper (d_alloc_name()), I > would seriously consider converting d_alloc() from "NULL or new dentry" > to "ERR_PTR(-ENOMEM) or new dentry". Especially since quite a few of > those callers will be happier that way. Grep and you'll see... As a > side benefit, if (unlikely(!dentry)) turns into if (IS_ERR(dentry)). > > > +static struct dentry *lookup_hash_update( > > + const struct qstr *name, > > + struct dentry *base, unsigned int flags, > > + wait_queue_head_t *wq) > > +{ > > + struct dentry *dentry; > > + struct inode *dir = base->d_inode; > > + int err; > > + > > + if (wq && IS_PAR_UPDATE(dir)) > > + inode_lock_shared_nested(dir, I_MUTEX_PARENT); > > + else > > + inode_lock_nested(dir, I_MUTEX_PARENT); > > + > > +retry: > > + dentry = __lookup_hash(name, base, flags, wq); > > + if (IS_ERR(dentry)) { > > + err = PTR_ERR(dentry); > > + goto out_err; > > + } > > + if (!d_lock_update_nested(dentry, base, name, I_MUTEX_PARENT)) { > > + /* > > + * Failed to get lock due to race with unlink or rename > > + * - try again > > + */ > > + d_lookup_done(dentry); > > When would we get out of __lookup_hash() with in-lookup dentry? > Confused... Whenever wq is passed in and ->lookup() decides, based on the flags, to do nothing. NFS does this for LOOKUP_CREATE|LOOKUP_EXCL and for LOOKUP_RENAME_TARGET > > > +struct dentry *lookup_hash_update_len(const char *name, int nlen, > > + struct path *path, unsigned int flags, > > const struct path *, damnit... Sorry.... > > > + wait_queue_head_t *wq) > > +{ > > + struct qstr this; > > + int err = lookup_one_common(mnt_user_ns(path->mnt), name, > > + path->dentry, nlen, &this); > > + if (err) > > + return ERR_PTR(err); > > + return lookup_hash_update(&this, path->dentry, flags, wq); > > +} > > +EXPORT_SYMBOL(lookup_hash_update_len); > > Frankly, the calling conventions of the "..._one_len" family is something > I've kept regretting for a long time. Oh, well... > > > +static void done_path_update(struct path *path, struct dentry *dentry, > > + bool with_wq) > > +{ > > + struct inode *dir = path->dentry->d_inode; > > + > > + d_lookup_done(dentry); > > + d_unlock_update(dentry); > > + if (IS_PAR_UPDATE(dir) && with_wq) > > + inode_unlock_shared(dir); > > + else > > + inode_unlock(dir); > > +} > > const struct path *, again... > > > @@ -3400,6 +3499,12 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file, > > dentry = res; > > } > > } > > + /* > > + * If dentry is negative and this is a create we need to get > > + * DCACHE_PAR_UPDATE. > > + */ > > + if ((open_flag & O_CREAT) && !dentry->d_inode) > > + have_par_update = d_lock_update(dentry, NULL, NULL); > > > > /* Negative dentry, just create the file */ > > if (!dentry->d_inode && (open_flag & O_CREAT)) { > > Fold the above here, please. What's more, losing the flag would've > made it much easier to follow... Yes, that can certainly be tided up - thanks. > > > @@ -3419,9 +3524,13 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file, > > error = create_error; > > goto out_dput; > > } > > + if (have_par_update) > > + d_unlock_update(dentry); > > return dentry; > > > > out_dput: > > + if (have_par_update) > > + d_unlock_update(dentry); > > > > @@ -3821,27 +3962,28 @@ struct dentry *kern_path_create(int dfd, const char *pathname, > > struct path *path, unsigned int lookup_flags) > > BTW, there's 9 callers of that sucker in the entire tree, along with > 3 callers of user_path_create() and 14 callers of done_path_create(). > Not a big deal to add the wq in those, especially since it can be easily > split into preparatory patch (with wq passed, but being unused). > > > -void done_path_create(struct path *path, struct dentry *dentry) > > +void done_path_create_wq(struct path *path, struct dentry *dentry, bool with_wq) > > Why "with_wq" and not the wq itself? > I did that at first. However when I did silly rename in NFS I found that I wanted to call the 'done' thing when I didn't still have the wq. ...which might mean I have a bug. And the done_path_create_wq() doesn't do anything with the wq so ... I'll look at that again - thanks. > > - * The caller must hold dir->i_mutex. > > + * The caller must either hold a write-lock on dir->i_rwsem, or > > + * a have atomically set DCACHE_PAR_UPDATE, or both. > > ??? That's a hangover from an earlier version where I didn't set DCACHE_PAR_UPDATE when we had an exclusive lock. Will fix. > > > + * If the filesystem permits (IS_PAR_UPDATE()), we take a shared lock on the > > + * directory and set DCACHE_PAR_UPDATE to get exclusive access to the dentry. > > The latter happens unconditionally here, unless I'm misreading the code (as well > as your cover letter). It does *NOT* happen on rename(), though, contrary to > the same. And while your later patch adds it in lock_rename_lookup(), existing > lock_rename() callers do not get that at all. Likely to be a problem... Between the patch were DCACHE_PAR_UPDATE is added and the patch were lock_rename_lookup() is added, all filesystems use exclusive locks and none of them check DCACHE_PAR_UPDATE. So how can there be a problem? > > > --- a/include/linux/dcache.h > > +++ b/include/linux/dcache.h > > @@ -13,7 +13,9 @@ > > #include <linux/rcupdate.h> > > #include <linux/lockref.h> > > #include <linux/stringhash.h> > > +#include <linux/sched.h> > > Bloody hell, man... > I wonder what that was for .... removed. > > +static inline void d_unlock_update(struct dentry *dentry) > > +{ > > + if (IS_ERR_OR_NULL(dentry)) > > + return; > > Do explain... When could we ever get NULL or ERR_PTR() passed to that? Another hangover from earlier iterations - removed. Thanks. > > > BTW, I would seriously look into splitting the "let's add a helper > that combines locking parent with __lookup_hash()" into a preliminary > patch. Would be easier to follow. > Will look into that. Thanks a lot for the thorough review! NeilBrown
On Sat, 27 Aug 2022, Al Viro wrote: > On Fri, Aug 26, 2022 at 12:06:55PM -0700, Linus Torvalds wrote: > > > Because right now I think the main reason we cannot move the lock into > > the filesystem is literally that we've made the lock cover not just > > the filesystem part, but the "lookup and create dentry" part too. > > How about rename loop prevention? mount vs. rmdir? d_splice_alias() > fun on tree reconnects? Thanks for this list. I think the "mount vs. rmdir" usage of inode_lock() is independent of the usage for directory operations, so we can change the latter as much as we like without materially affecting the former. The lock we take on the directory being removed DOES ensure no new objects are linked into the directory, so for that reason we still need at least a shared lock when adding links to a directory. Moving that lock into the filesystem would invert the locking order in rmdir between the child being removed and the parent being locked. That would require some consideration. d_splice_alias() happens at ->lookup time so it is already under a shared lock. I don't see that it depends on i_rwsem - it uses i_lock for the important locking. Rename loop prevention is largely managed by s_vfs_rename_mutex. Once that is taken, nothing can be moved to a different directory. That means 'trap' will keep any relationship it had to new_path and old_path. It could be renamed within it's parent, but as long as it isn't removed the comparisons with old_dentry and new_dentry should still be reliable. As 'trap' clearly isn't empty, we trust that the filesystem won't allow an rmdir to succeed. What have I missed? Thanks, NeilBrown > > > But once you have that "DCACHE_PAR_LOOKUP" bit and the > > d_alloc_parallel() logic to serialize a _particular_ dentry being > > created (as opposed to serializing all the sleeping ops to that > > directly), I really think we should strive to move the locking - that > > no longer helps the VFS dcache layer - closer to the filesystem call > > and eventually into it. > > See above. >
On Thu, Sep 01, 2022 at 10:31:10AM +1000, NeilBrown wrote: > Thanks for this list. Keep in mind that this list is just off the top of my head - it's nowhere near complete. > d_splice_alias() happens at ->lookup time so it is already under a > shared lock. I don't see that it depends on i_rwsem - it uses i_lock > for the important locking. Nope. Process 1: rmdir foo/bar foo found and locked exclusive [*] dentry of bar found ->rmdir() instance called Process 2: stat foo/splat foo found and locked shared [*] dentry of splat does not exist anywhere dentry allocated, marked in-lookup ->lookup() instance called inode found and passed to d_splice_alias() ... ... which finds that it's a directory inode ... ... and foo/bar refers to it. E.g. it's on NFS and another client has just done mv bar splat __d_unalias() is called, to try and move existing alias (foo/bar) into the right place. It sees that no change of parent is involved, so it can just proceed to __d_move(). Process 1: forms an rmdir request to server, using ->d_name (and possibly ->d_parent) of dentry of foo/bar. It knows that ->d_name is stable, since the caller holds foo locked exclusive and all callers of __d_move() hold the old parent at least shared. In mainline process 2 will block (or, in case if it deals with different parent, try to grab the old parent of the existing alias shared and fail and with -ESTALE). With your changes process 1 will be holding foo/ locked shared, so process 2 will succeed and proceed to __d_move(), right under the nose of process 1 accessing ->d_name. If the names involved had been longer than 32 characters, it would risk accessing kfreed memory. Or fetching the length from old name and pointer from new one, walking past the end of kmalloc'ed object, etc. Sure, assuming that we are talking about NFS, server would have probably failed the RMDIR request - if you managed to form that request without oopsing, that is.
On Mon, Aug 29, 2022 at 11:59:02AM +1000, NeilBrown wrote: > > When would we get out of __lookup_hash() with in-lookup dentry? > > Confused... > > Whenever wq is passed in and ->lookup() decides, based on the flags, to do > nothing. > NFS does this for LOOKUP_CREATE|LOOKUP_EXCL and for LOOKUP_RENAME_TARGET Frankly, I would rather do what all other callers of ->lookup() do and just follow it with d_lookup_done(dentry), no matter what it returns. It's cheap enough...
On Sat, 03 Sep 2022, Al Viro wrote: > On Mon, Aug 29, 2022 at 11:59:02AM +1000, NeilBrown wrote: > > > > When would we get out of __lookup_hash() with in-lookup dentry? > > > Confused... > > > > Whenever wq is passed in and ->lookup() decides, based on the flags, to do > > nothing. > > NFS does this for LOOKUP_CREATE|LOOKUP_EXCL and for LOOKUP_RENAME_TARGET > > Frankly, I would rather do what all other callers of ->lookup() do and > just follow it with d_lookup_done(dentry), no matter what it returns. > It's cheap enough... > I don't think that is a good idea. Once you call d_lookup_done() (without having first called d_add() or similar) the dentry becomes invisible to normal path lookup, so another might be created. But the dentry will still be used for the 'create' or 'rename' and may then be added to the dcache - at which point you could have two dentries with the same name. When ->lookup() returns success without d_add()ing the dentry, that means that something else will complete the d_add() if/when necessary. For NFS, it specifically means that the lookup is effectively being combined with the following CREATE or RENAME. In this case there is no d_lookup_done() until the full operation is complete. For autofs (thanks for pointing me to that) the operation is completed when d_automount() signals the daemon to create the directory or symlink. In that case there IS a d_lookup_done() call and autofs needs some extra magic (the internal 'active' list) to make sure subsequent ->lookup requests can see that dentry which is still in the process of being set up. It might be nice if the dentry passed to autofs_lookup() could remain "d_inlookup()" until after d_automount has completed. Then autofs wouldn't need that active list. However I haven't yet looked at how disruptive such a change might be. Thanks, NeilBrown
On Sat, Sep 03, 2022 at 11:40:44AM +1000, NeilBrown wrote: > I don't think that is a good idea. Once you call d_lookup_done() > (without having first called d_add() or similar) the dentry becomes > invisible to normal path lookup, so another might be created. But the > dentry will still be used for the 'create' or 'rename' and may then be > added to the dcache - at which point you could have two dentries with the > same name. > > When ->lookup() returns success without d_add()ing the dentry, that > means that something else will complete the d_add() if/when necessary. > For NFS, it specifically means that the lookup is effectively being > combined with the following CREATE or RENAME. In this case there is no > d_lookup_done() until the full operation is complete. > > For autofs (thanks for pointing me to that) the operation is completed > when d_automount() signals the daemon to create the directory or > symlink. In that case there IS a d_lookup_done() call and autofs needs > some extra magic (the internal 'active' list) to make sure subsequent > ->lookup requests can see that dentry which is still in the process of > being set up. > > It might be nice if the dentry passed to autofs_lookup() could remain > "d_inlookup()" until after d_automount has completed. Then autofs > wouldn't need that active list. However I haven't yet looked at how > disruptive such a change might be. Very much so. You are starting to invent new rules for ->lookup() that just never had been there, basing on nothing better than a couple of examples. They are nowhere near everything there is. And you can't rely upon d_add() done by a method, for very obvious reasons. They are out of your control, they might very well decide that object creation has failed and drop the damn thing. Which is not allowed for in-lookup dentries without d_lookup_done(). Neil, *IF* you are introducing new rules like that, the absolutely minimal requirement is having them in Documentation/filesystems/porting.rst. And that includes "such-and-such method might be called with parent locked only shared; in that case it's guaranteed such-and-such things about its arguments (bitlocks held, etc.)". One thing we really need to avoid is that thing coming undocumented, with "NFS copes, nobody else has it enabled, whoever does it for other filesystems will just have to RTFS". I hope it's obvious that this is not an option. Because I can bloody guarantee that it will be cargo-culted over to other filesystems, with nobody (you and me included) understanding the resulting code.
On Sat, Sep 03, 2022 at 03:12:26AM +0100, Al Viro wrote: > Very much so. You are starting to invent new rules for ->lookup() that > just never had been there, basing on nothing better than a couple of > examples. They are nowhere near everything there is. A few examples besides NFS and autofs: ext4, f2fs and xfs might bloody well return NULL without hashing - happens on negative lookups with 'casefolding' crap. kernfs - treatment of inactive nodes. afs_dynroot_lookup() treatment of @cell... names. afs_lookup() treatment of @sys... names. There might very well be more - both merged into mainline and in development trees of various filesystems (including devel branches of in-tree ones - I'm not talking about out-of-tree projects). Note, BTW, that with the current rules it's perfectly possible to have this kind of fun: a name that resolves to different files for different processes unlink(2) is allowed and results depend upon the calling process All it takes is ->lookup() deliberately *NOT* hashing the sucker and ->unlink() acting according to dentry it has gotten from the caller. unlink(2) from different callers are serialized and none of that stuff is ever going to be hashed. d_alloc_parallel() might pick an in-lookup dentry from another caller of e.g. stat(2), but it will wait for in-lookup state ending, notice that the sucker is not hashed, drop it and retry. Suboptimal, but it works. Nothing in the mainline currently does that. Nothing that I know of, that is. Sure, it could be made work with the changes you seem to imply (if I'm not misreading you) - all it takes is lookup calling d_lookup_done() on its argument before returning NULL. But that's subtle, non-obvious and not documented anywhere... Another interesting question is the rules for unhashing dentries. What is needed for somebody to do temporary unhash, followed by rehashing?
On Sun, 04 Sep 2022, Al Viro wrote: > On Sat, Sep 03, 2022 at 03:12:26AM +0100, Al Viro wrote: > > > Very much so. You are starting to invent new rules for ->lookup() that > > just never had been there, basing on nothing better than a couple of > > examples. They are nowhere near everything there is. > > A few examples besides NFS and autofs: Hi Al, thanks for these - very helpful. I will give them due consideration when I write relevant documentation to include in the next posting of the series. Thanks a lot, NeilBrown > > ext4, f2fs and xfs might bloody well return NULL without hashing - happens > on negative lookups with 'casefolding' crap. > > kernfs - treatment of inactive nodes. > > afs_dynroot_lookup() treatment of @cell... names. > > afs_lookup() treatment of @sys... names. > > There might very well be more - both merged into mainline and in > development trees of various filesystems (including devel branches > of in-tree ones - I'm not talking about out-of-tree projects). > > Note, BTW, that with the current rules it's perfectly possible to > have this kind of fun: > a name that resolves to different files for different processes > unlink(2) is allowed and results depend upon the calling process > > All it takes is ->lookup() deliberately *NOT* hashing the sucker and > ->unlink() acting according to dentry it has gotten from the caller. > unlink(2) from different callers are serialized and none of that > stuff is ever going to be hashed. d_alloc_parallel() might pick an > in-lookup dentry from another caller of e.g. stat(2), but it will > wait for in-lookup state ending, notice that the sucker is not hashed, > drop it and retry. Suboptimal, but it works. > > Nothing in the mainline currently does that. Nothing that I know of, > that is. Sure, it could be made work with the changes you seem to > imply (if I'm not misreading you) - all it takes is lookup > calling d_lookup_done() on its argument before returning NULL. > But that's subtle, non-obvious and not documented anywhere... > > Another interesting question is the rules for unhashing dentries. > What is needed for somebody to do temporary unhash, followed by > rehashing? >
diff --git a/fs/dcache.c b/fs/dcache.c index bb0c4d0038db..d6bfa49b143b 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -1765,6 +1765,7 @@ static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name) struct dentry *dentry; char *dname; int err; + static struct lock_class_key __key; dentry = kmem_cache_alloc_lru(dentry_cache, &sb->s_dentry_lru, GFP_KERNEL); @@ -1820,6 +1821,8 @@ static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name) INIT_LIST_HEAD(&dentry->d_child); d_set_d_op(dentry, dentry->d_sb->s_d_op); + lockdep_init_map(&dentry->d_update_map, "DENTRY_PAR_UPDATE", &__key, 0); + if (dentry->d_op && dentry->d_op->d_init) { err = dentry->d_op->d_init(dentry); if (err) { @@ -2626,7 +2629,7 @@ static inline void end_dir_add(struct inode *dir, unsigned int n, static void d_wait_lookup(struct dentry *dentry) { if (d_in_lookup(dentry)) { - DECLARE_WAITQUEUE(wait, current); + DEFINE_WAIT(wait); add_wait_queue(dentry->d_wait, &wait); do { set_current_state(TASK_UNINTERRUPTIBLE); @@ -3274,6 +3277,67 @@ void d_tmpfile(struct dentry *dentry, struct inode *inode) } EXPORT_SYMBOL(d_tmpfile); +/** + * d_lock_update_nested - lock a dentry before updating + * @dentry: the dentry to be locked + * @base: the parent, or %NULL + * @name: the name in that parent, or %NULL + * @subclass: lockdep locking class. + * + * Lock a dentry in a directory on which a shared-lock may be held, and + * on which parallel updates are permitted. + * If the base and name are given, then on success the dentry will still + * have that base and name - it will not have raced with rename. + * On success, a positive dentry will still be hashed, ensuring there + * was no race with unlink. + * If they are not given, then on success the dentry will be negative, + * which again ensures no race with rename, or unlink. + * + * @subclass uses the I_MUTEX_ classes. + * If the parent directory is locked with I_MUTEX_NORMAL, use I_MUTEX_NORMAL. + * If the parent is locked with I_MUTEX_PARENT, I_MUTEX_PARENT2 or + * I_MUTEX_CHILD, use I_MUTEX_PARENT or, for the second in a rename, + * I_MUTEX_PARENT2. + */ +bool d_lock_update_nested(struct dentry *dentry, + struct dentry *base, const struct qstr *name, + int subclass) +{ + bool ret = true; + + lock_acquire_exclusive(&dentry->d_update_map, subclass, + 0, NULL, _THIS_IP_); + spin_lock(&dentry->d_lock); + if (dentry->d_flags & DCACHE_PAR_UPDATE) + ___wait_var_event(&dentry->d_flags, + !(dentry->d_flags & DCACHE_PAR_UPDATE), + TASK_UNINTERRUPTIBLE, 0, 0, + (spin_unlock(&dentry->d_lock), + schedule(), + spin_lock(&dentry->d_lock)) + ); + rcu_read_lock(); /* for d_same_name() */ + if (d_unhashed(dentry) && d_is_positive(dentry)) { + /* name was unlinked while we waited */ + ret = false; + } else if (base && + (dentry->d_parent != base || + dentry->d_name.hash != name->hash || + !d_same_name(dentry, base, name))) { + /* dentry was renamed - possibly silly-rename */ + ret = false; + } else if (!base && d_is_positive(dentry)) { + ret = false; + } else { + dentry->d_flags |= DCACHE_PAR_UPDATE; + } + rcu_read_unlock(); + spin_unlock(&dentry->d_lock); + if (!ret) + lock_map_release(&dentry->d_update_map); + return ret; +} + static __initdata unsigned long dhash_entries; static int __init set_dhash_entries(char *str) { diff --git a/fs/namei.c b/fs/namei.c index 53b4bc094db2..c008dfd01e30 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1574,14 +1574,14 @@ static struct dentry *lookup_dcache(const struct qstr *name, } /* - * Parent directory has inode locked exclusive. This is one - * and only case when ->lookup() gets called on non in-lookup - * dentries - as the matter of fact, this only gets called - * when directory is guaranteed to have no in-lookup children - * at all. + * Parent directory has inode locked exclusive, or possibly shared if wq + * is given. In the later case the fs has explicitly allowed concurrent + * updates in this directory. This is the one and only case + * when ->lookup() may be called on a non in-lookup dentry. */ static struct dentry *__lookup_hash(const struct qstr *name, - struct dentry *base, unsigned int flags) + struct dentry *base, unsigned int flags, + wait_queue_head_t *wq) { struct dentry *dentry = lookup_dcache(name, base, flags); struct dentry *old; @@ -1594,18 +1594,103 @@ static struct dentry *__lookup_hash(const struct qstr *name, if (unlikely(IS_DEADDIR(dir))) return ERR_PTR(-ENOENT); - dentry = d_alloc(base, name); + if (wq) + dentry = d_alloc_parallel(base, name, wq); + else + dentry = d_alloc(base, name); if (unlikely(!dentry)) return ERR_PTR(-ENOMEM); + if (IS_ERR(dentry)) + return dentry; + + if (wq && !d_in_lookup(dentry)) + /* Must have raced with another thread doing the lookup */ + return dentry; old = dir->i_op->lookup(dir, dentry, flags); if (unlikely(old)) { + d_lookup_done(dentry); dput(dentry); dentry = old; } return dentry; } +/* + * Parent directory (base) is not locked. We take either an exclusive + * or shared lock depending on the fs preference, then do a lookup, + * and then set the DCACHE_PAR_UPDATE bit on the child if a shared lock + * was taken on the parent. + */ +static struct dentry *lookup_hash_update( + const struct qstr *name, + struct dentry *base, unsigned int flags, + wait_queue_head_t *wq) +{ + struct dentry *dentry; + struct inode *dir = base->d_inode; + int err; + + if (wq && IS_PAR_UPDATE(dir)) + inode_lock_shared_nested(dir, I_MUTEX_PARENT); + else + inode_lock_nested(dir, I_MUTEX_PARENT); + +retry: + dentry = __lookup_hash(name, base, flags, wq); + if (IS_ERR(dentry)) { + err = PTR_ERR(dentry); + goto out_err; + } + if (!d_lock_update_nested(dentry, base, name, I_MUTEX_PARENT)) { + /* + * Failed to get lock due to race with unlink or rename + * - try again + */ + d_lookup_done(dentry); + dput(dentry); + goto retry; + } + return dentry; + +out_err: + if (wq && IS_PAR_UPDATE(dir)) + inode_unlock_shared(dir); + else + inode_unlock(dir); + return ERR_PTR(err); +} + +static int lookup_one_common(struct user_namespace *mnt_userns, + const char *name, struct dentry *base, int len, + struct qstr *this); + +struct dentry *lookup_hash_update_len(const char *name, int nlen, + struct path *path, unsigned int flags, + wait_queue_head_t *wq) +{ + struct qstr this; + int err = lookup_one_common(mnt_user_ns(path->mnt), name, + path->dentry, nlen, &this); + if (err) + return ERR_PTR(err); + return lookup_hash_update(&this, path->dentry, flags, wq); +} +EXPORT_SYMBOL(lookup_hash_update_len); + +static void done_path_update(struct path *path, struct dentry *dentry, + bool with_wq) +{ + struct inode *dir = path->dentry->d_inode; + + d_lookup_done(dentry); + d_unlock_update(dentry); + if (IS_PAR_UPDATE(dir) && with_wq) + inode_unlock_shared(dir); + else + inode_unlock(dir); +} + static struct dentry *lookup_fast(struct nameidata *nd) { struct dentry *dentry, *parent = nd->path.dentry; @@ -2570,7 +2655,7 @@ static struct dentry *__kern_path_locked(struct filename *name, struct path *pat return ERR_PTR(-EINVAL); } inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT); - d = __lookup_hash(&last, path->dentry, 0); + d = __lookup_hash(&last, path->dentry, 0, NULL); if (IS_ERR(d)) { inode_unlock(path->dentry->d_inode); path_put(path); @@ -3271,11 +3356,24 @@ static struct dentry *atomic_open(struct nameidata *nd, struct dentry *dentry, if (nd->flags & LOOKUP_DIRECTORY) open_flag |= O_DIRECTORY; + /* + * dentry is negative or d_in_lookup(). In case this is a + * shared-lock create we need to set DCACHE_PAR_UPDATE to ensure + * exclusion. + */ + if (open_flag & O_CREAT) { + if (!d_lock_update(dentry, NULL, NULL)) + /* already exists, non-atomic open */ + return dentry; + } + file->f_path.dentry = DENTRY_NOT_SET; file->f_path.mnt = nd->path.mnt; error = dir->i_op->atomic_open(dir, dentry, file, open_to_namei_flags(open_flag), mode); d_lookup_done(dentry); + if ((open_flag & O_CREAT)) + d_unlock_update(dentry); if (!error) { if (file->f_mode & FMODE_OPENED) { if (unlikely(dentry != file->f_path.dentry)) { @@ -3327,6 +3425,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file, int error, create_error = 0; umode_t mode = op->mode; DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq); + bool have_par_update = false; if (unlikely(IS_DEADDIR(dir_inode))) return ERR_PTR(-ENOENT); @@ -3400,6 +3499,12 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file, dentry = res; } } + /* + * If dentry is negative and this is a create we need to get + * DCACHE_PAR_UPDATE. + */ + if ((open_flag & O_CREAT) && !dentry->d_inode) + have_par_update = d_lock_update(dentry, NULL, NULL); /* Negative dentry, just create the file */ if (!dentry->d_inode && (open_flag & O_CREAT)) { @@ -3419,9 +3524,13 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file, error = create_error; goto out_dput; } + if (have_par_update) + d_unlock_update(dentry); return dentry; out_dput: + if (have_par_update) + d_unlock_update(dentry); dput(dentry); return ERR_PTR(error); } @@ -3434,6 +3543,7 @@ static const char *open_last_lookups(struct nameidata *nd, bool got_write = false; struct dentry *dentry; const char *res; + bool shared; nd->flags |= op->intent; @@ -3474,14 +3584,15 @@ static const char *open_last_lookups(struct nameidata *nd, * dropping this one anyway. */ } - if (open_flag & O_CREAT) + shared = !(open_flag & O_CREAT) || IS_PAR_UPDATE(dir->d_inode); + if (!shared) inode_lock(dir->d_inode); else inode_lock_shared(dir->d_inode); dentry = lookup_open(nd, file, op, got_write); if (!IS_ERR(dentry) && (file->f_mode & FMODE_CREATED)) fsnotify_create(dir->d_inode, dentry); - if (open_flag & O_CREAT) + if (!shared) inode_unlock(dir->d_inode); else inode_unlock_shared(dir->d_inode); @@ -3750,41 +3861,29 @@ struct file *do_file_open_root(const struct path *root, return file; } -static struct dentry *filename_create(int dfd, struct filename *name, - struct path *path, unsigned int lookup_flags) +static struct dentry *filename_create_one(struct qstr *last, struct path *path, + unsigned int lookup_flags, + wait_queue_head_t *wq) { - struct dentry *dentry = ERR_PTR(-EEXIST); - struct qstr last; + struct dentry *dentry; bool want_dir = lookup_flags & LOOKUP_DIRECTORY; unsigned int reval_flag = lookup_flags & LOOKUP_REVAL; unsigned int create_flags = LOOKUP_CREATE | LOOKUP_EXCL; - int type; int err2; int error; - error = filename_parentat(dfd, name, reval_flag, path, &last, &type); - if (error) - return ERR_PTR(error); - - /* - * Yucky last component or no last component at all? - * (foo/., foo/.., /////) - */ - if (unlikely(type != LAST_NORM)) - goto out; - /* don't fail immediately if it's r/o, at least try to report other errors */ err2 = mnt_want_write(path->mnt); /* * Do the final lookup. Suppress 'create' if there is a trailing * '/', and a directory wasn't requested. */ - if (last.name[last.len] && !want_dir) + if (last->name[last->len] && !want_dir) create_flags = 0; - inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT); - dentry = __lookup_hash(&last, path->dentry, reval_flag | create_flags); + dentry = lookup_hash_update(last, path->dentry, + reval_flag | create_flags, wq); if (IS_ERR(dentry)) - goto unlock; + goto drop_write; error = -EEXIST; if (d_is_positive(dentry)) @@ -3806,14 +3905,56 @@ static struct dentry *filename_create(int dfd, struct filename *name, } return dentry; fail: + done_path_update(path, dentry, !!wq); dput(dentry); dentry = ERR_PTR(error); -unlock: - inode_unlock(path->dentry->d_inode); +drop_write: if (!err2) mnt_drop_write(path->mnt); + return dentry; +} + +struct dentry *filename_create_one_len(const char *name, int nlen, + struct path *path, + unsigned int lookup_flags, + wait_queue_head_t *wq) +{ + struct qstr this; + int err; + + err = lookup_one_common(mnt_user_ns(path->mnt), name, + path->dentry, nlen, &this); + if (err) + return ERR_PTR(err); + return filename_create_one(&this, path, lookup_flags, wq); +} +EXPORT_SYMBOL(filename_create_one_len); + +static struct dentry *filename_create(int dfd, struct filename *name, + struct path *path, unsigned int lookup_flags, + wait_queue_head_t *wq) +{ + struct dentry *dentry = ERR_PTR(-EEXIST); + struct qstr last; + unsigned int reval_flag = lookup_flags & LOOKUP_REVAL; + int type; + int error; + + error = filename_parentat(dfd, name, reval_flag, path, &last, &type); + if (error) + return ERR_PTR(error); + + /* + * Yucky last component or no last component at all? + * (foo/., foo/.., /////) + */ + if (unlikely(type != LAST_NORM)) + goto out; + + dentry = filename_create_one(&last, path, lookup_flags, wq); out: - path_put(path); + if (IS_ERR(dentry)) + path_put(path); return dentry; } @@ -3821,27 +3962,28 @@ struct dentry *kern_path_create(int dfd, const char *pathname, struct path *path, unsigned int lookup_flags) { struct filename *filename = getname_kernel(pathname); - struct dentry *res = filename_create(dfd, filename, path, lookup_flags); + struct dentry *res = filename_create(dfd, filename, path, lookup_flags, + NULL); putname(filename); return res; } EXPORT_SYMBOL(kern_path_create); -void done_path_create(struct path *path, struct dentry *dentry) +void done_path_create_wq(struct path *path, struct dentry *dentry, bool with_wq) { + done_path_update(path, dentry, with_wq); dput(dentry); - inode_unlock(path->dentry->d_inode); mnt_drop_write(path->mnt); path_put(path); } -EXPORT_SYMBOL(done_path_create); +EXPORT_SYMBOL(done_path_create_wq); inline struct dentry *user_path_create(int dfd, const char __user *pathname, - struct path *path, unsigned int lookup_flags) + struct path *path, unsigned int lookup_flags) { struct filename *filename = getname(pathname); - struct dentry *res = filename_create(dfd, filename, path, lookup_flags); + struct dentry *res = filename_create(dfd, filename, path, lookup_flags, NULL); putname(filename); return res; @@ -3921,12 +4063,13 @@ static int do_mknodat(int dfd, struct filename *name, umode_t mode, struct path path; int error; unsigned int lookup_flags = 0; + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq); error = may_mknod(mode); if (error) goto out1; retry: - dentry = filename_create(dfd, name, &path, lookup_flags); + dentry = filename_create(dfd, name, &path, lookup_flags, &wq); error = PTR_ERR(dentry); if (IS_ERR(dentry)) goto out1; @@ -3954,7 +4097,7 @@ static int do_mknodat(int dfd, struct filename *name, umode_t mode, break; } out2: - done_path_create(&path, dentry); + done_path_create_wq(&path, dentry, true); if (retry_estale(error, lookup_flags)) { lookup_flags |= LOOKUP_REVAL; goto retry; @@ -4023,9 +4166,10 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode) struct path path; int error; unsigned int lookup_flags = LOOKUP_DIRECTORY; + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq); retry: - dentry = filename_create(dfd, name, &path, lookup_flags); + dentry = filename_create(dfd, name, &path, lookup_flags, &wq); error = PTR_ERR(dentry); if (IS_ERR(dentry)) goto out_putname; @@ -4038,7 +4182,7 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode) error = vfs_mkdir(mnt_userns, path.dentry->d_inode, dentry, mode); } - done_path_create(&path, dentry); + done_path_create_wq(&path, dentry, true); if (retry_estale(error, lookup_flags)) { lookup_flags |= LOOKUP_REVAL; goto retry; @@ -4122,6 +4266,7 @@ int do_rmdir(int dfd, struct filename *name) struct qstr last; int type; unsigned int lookup_flags = 0; + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq); retry: error = filename_parentat(dfd, name, lookup_flags, &path, &last, &type); if (error) @@ -4143,8 +4288,7 @@ int do_rmdir(int dfd, struct filename *name) if (error) goto exit2; - inode_lock_nested(path.dentry->d_inode, I_MUTEX_PARENT); - dentry = __lookup_hash(&last, path.dentry, lookup_flags); + dentry = lookup_hash_update(&last, path.dentry, lookup_flags, &wq); error = PTR_ERR(dentry); if (IS_ERR(dentry)) goto exit3; @@ -4158,9 +4302,9 @@ int do_rmdir(int dfd, struct filename *name) mnt_userns = mnt_user_ns(path.mnt); error = vfs_rmdir(mnt_userns, path.dentry->d_inode, dentry); exit4: + done_path_update(&path, dentry, true); dput(dentry); exit3: - inode_unlock(path.dentry->d_inode); mnt_drop_write(path.mnt); exit2: path_put(&path); @@ -4185,13 +4329,14 @@ SYSCALL_DEFINE1(rmdir, const char __user *, pathname) * @dentry: victim * @delegated_inode: returns victim inode, if the inode is delegated. * - * The caller must hold dir->i_mutex. + * The caller must either hold a write-lock on dir->i_rwsem, or + * a have atomically set DCACHE_PAR_UPDATE, or both. * * If vfs_unlink discovers a delegation, it will return -EWOULDBLOCK and * return a reference to the inode in delegated_inode. The caller * should then break the delegation on that inode and retry. Because * breaking a delegation may take a long time, the caller should drop - * dir->i_mutex before doing so. + * dir->i_rwsem before doing so. * * Alternatively, a caller may pass NULL for delegated_inode. This may * be appropriate for callers that expect the underlying filesystem not @@ -4250,9 +4395,11 @@ EXPORT_SYMBOL(vfs_unlink); /* * Make sure that the actual truncation of the file will occur outside its - * directory's i_mutex. Truncate can take a long time if there is a lot of + * directory's i_rwsem. Truncate can take a long time if there is a lot of * writeout happening, and we don't want to prevent access to the directory * while waiting on the I/O. + * If the filesystem permits (IS_PAR_UPDATE()), we take a shared lock on the + * directory and set DCACHE_PAR_UPDATE to get exclusive access to the dentry. */ int do_unlinkat(int dfd, struct filename *name) { @@ -4264,6 +4411,7 @@ int do_unlinkat(int dfd, struct filename *name) struct inode *inode = NULL; struct inode *delegated_inode = NULL; unsigned int lookup_flags = 0; + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq); retry: error = filename_parentat(dfd, name, lookup_flags, &path, &last, &type); if (error) @@ -4276,9 +4424,9 @@ int do_unlinkat(int dfd, struct filename *name) error = mnt_want_write(path.mnt); if (error) goto exit2; + retry_deleg: - inode_lock_nested(path.dentry->d_inode, I_MUTEX_PARENT); - dentry = __lookup_hash(&last, path.dentry, lookup_flags); + dentry = lookup_hash_update(&last, path.dentry, lookup_flags, &wq); error = PTR_ERR(dentry); if (!IS_ERR(dentry)) { struct user_namespace *mnt_userns; @@ -4297,9 +4445,9 @@ int do_unlinkat(int dfd, struct filename *name) error = vfs_unlink(mnt_userns, path.dentry->d_inode, dentry, &delegated_inode); exit3: + done_path_update(&path, dentry, true); dput(dentry); } - inode_unlock(path.dentry->d_inode); if (inode) iput(inode); /* truncate the inode here */ inode = NULL; @@ -4388,13 +4536,14 @@ int do_symlinkat(struct filename *from, int newdfd, struct filename *to) struct dentry *dentry; struct path path; unsigned int lookup_flags = 0; + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq); if (IS_ERR(from)) { error = PTR_ERR(from); goto out_putnames; } retry: - dentry = filename_create(newdfd, to, &path, lookup_flags); + dentry = filename_create(newdfd, to, &path, lookup_flags, &wq); error = PTR_ERR(dentry); if (IS_ERR(dentry)) goto out_putnames; @@ -4407,7 +4556,7 @@ int do_symlinkat(struct filename *from, int newdfd, struct filename *to) error = vfs_symlink(mnt_userns, path.dentry->d_inode, dentry, from->name); } - done_path_create(&path, dentry); + done_path_create_wq(&path, dentry, true); if (retry_estale(error, lookup_flags)) { lookup_flags |= LOOKUP_REVAL; goto retry; @@ -4536,6 +4685,7 @@ int do_linkat(int olddfd, struct filename *old, int newdfd, struct inode *delegated_inode = NULL; int how = 0; int error; + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq); if ((flags & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH)) != 0) { error = -EINVAL; @@ -4559,7 +4709,7 @@ int do_linkat(int olddfd, struct filename *old, int newdfd, goto out_putnames; new_dentry = filename_create(newdfd, new, &new_path, - (how & LOOKUP_REVAL)); + (how & LOOKUP_REVAL), &wq); error = PTR_ERR(new_dentry); if (IS_ERR(new_dentry)) goto out_putpath; @@ -4577,7 +4727,7 @@ int do_linkat(int olddfd, struct filename *old, int newdfd, error = vfs_link(old_path.dentry, mnt_userns, new_path.dentry->d_inode, new_dentry, &delegated_inode); out_dput: - done_path_create(&new_path, new_dentry); + done_path_create_wq(&new_path, new_dentry, true); if (delegated_inode) { error = break_deleg_wait(&delegated_inode); if (!error) { @@ -4847,7 +4997,8 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd, retry_deleg: trap = lock_rename(new_path.dentry, old_path.dentry); - old_dentry = __lookup_hash(&old_last, old_path.dentry, lookup_flags); + old_dentry = __lookup_hash(&old_last, old_path.dentry, + lookup_flags, NULL); error = PTR_ERR(old_dentry); if (IS_ERR(old_dentry)) goto exit3; @@ -4855,7 +5006,8 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd, error = -ENOENT; if (d_is_negative(old_dentry)) goto exit4; - new_dentry = __lookup_hash(&new_last, new_path.dentry, lookup_flags | target_flags); + new_dentry = __lookup_hash(&new_last, new_path.dentry, + lookup_flags | target_flags, NULL); error = PTR_ERR(new_dentry); if (IS_ERR(new_dentry)) goto exit4; diff --git a/include/linux/dcache.h b/include/linux/dcache.h index 92c78ed02b54..d71c12907617 100644 --- a/include/linux/dcache.h +++ b/include/linux/dcache.h @@ -13,7 +13,9 @@ #include <linux/rcupdate.h> #include <linux/lockref.h> #include <linux/stringhash.h> +#include <linux/sched.h> #include <linux/wait.h> +#include <linux/wait_bit.h> struct path; struct vfsmount; @@ -96,6 +98,8 @@ struct dentry { unsigned long d_time; /* used by d_revalidate */ void *d_fsdata; /* fs-specific data */ + /* lockdep tracking of DCACHE_PAR_UPDATE locks */ + struct lockdep_map d_update_map; union { struct list_head d_lru; /* LRU list */ wait_queue_head_t *d_wait; /* in-lookup ones only */ @@ -211,6 +215,7 @@ struct dentry_operations { #define DCACHE_PAR_LOOKUP 0x10000000 /* being looked up (with parent locked shared) */ #define DCACHE_DENTRY_CURSOR 0x20000000 #define DCACHE_NORCU 0x40000000 /* No RCU delay for freeing */ +#define DCACHE_PAR_UPDATE 0x80000000 /* Being created or unlinked with shared lock */ extern seqlock_t rename_lock; @@ -598,4 +603,27 @@ struct name_snapshot { void take_dentry_name_snapshot(struct name_snapshot *, struct dentry *); void release_dentry_name_snapshot(struct name_snapshot *); +bool d_lock_update_nested(struct dentry *dentry, + struct dentry *base, const struct qstr *name, + int subclass); +static inline bool d_lock_update(struct dentry *dentry, + struct dentry *base, const struct qstr *name) +{ + /* 0 is I_MUTEX_NORMAL, but fs.h might not be included */ + return d_lock_update_nested(dentry, base, name, 0); +} + +static inline void d_unlock_update(struct dentry *dentry) +{ + if (IS_ERR_OR_NULL(dentry)) + return; + if (dentry->d_flags & DCACHE_PAR_UPDATE) { + lock_map_release(&dentry->d_update_map); + spin_lock(&dentry->d_lock); + dentry->d_flags &= ~DCACHE_PAR_UPDATE; + spin_unlock(&dentry->d_lock); + wake_up_var(&dentry->d_flags); + } +} + #endif /* __LINUX_DCACHE_H */ diff --git a/include/linux/fs.h b/include/linux/fs.h index 9eced4cc286e..a9a48306806a 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2524,12 +2524,13 @@ int sync_inode_metadata(struct inode *inode, int wait); struct file_system_type { const char *name; int fs_flags; -#define FS_REQUIRES_DEV 1 +#define FS_REQUIRES_DEV 1 #define FS_BINARY_MOUNTDATA 2 #define FS_HAS_SUBTYPE 4 #define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */ #define FS_DISALLOW_NOTIFY_PERM 16 /* Disable fanotify permission events */ -#define FS_ALLOW_IDMAP 32 /* FS has been updated to handle vfs idmappings. */ +#define FS_ALLOW_IDMAP 32 /* FS has been updated to handle vfs idmappings. */ +#define FS_PAR_DIR_UPDATE BIT(6) /* Allow parallel directory updates */ #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */ int (*init_fs_context)(struct fs_context *); const struct fs_parameter_spec *parameters; diff --git a/include/linux/namei.h b/include/linux/namei.h index caeb08a98536..b7a123b489b1 100644 --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -61,7 +61,14 @@ extern int kern_path(const char *, unsigned, struct path *); extern struct dentry *kern_path_create(int, const char *, struct path *, unsigned int); extern struct dentry *user_path_create(int, const char __user *, struct path *, unsigned int); -extern void done_path_create(struct path *, struct dentry *); +extern struct dentry *lookup_hash_update_len(const char *name, int nlen, + struct path *path, unsigned int flags, + wait_queue_head_t *wq); +extern void done_path_create_wq(struct path *, struct dentry *, bool); +static inline void done_path_create(struct path *path, struct dentry *dentry) +{ + done_path_create_wq(path, dentry, false); +} extern struct dentry *kern_path_locked(const char *, struct path *); extern struct dentry *try_lookup_one_len(const char *, struct dentry *, int); @@ -75,6 +82,13 @@ struct dentry *lookup_one_unlocked(struct user_namespace *mnt_userns, struct dentry *lookup_one_positive_unlocked(struct user_namespace *mnt_userns, const char *name, struct dentry *base, int len); +struct dentry *filename_create_one_len(const char *name, int nlen, + struct path *path, + unsigned int lookup_flags, + wait_queue_head_t *wq); +static inline bool IS_PAR_UPDATE(struct inode *dir) { + return !!(dir->i_sb->s_type->fs_flags & FS_PAR_DIR_UPDATE); +} extern int follow_down_one(struct path *); extern int follow_down(struct path *);