Message ID | 20160703062917.GG14480@ZenIV.linux.org.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Sun, Jul 03, 2016 at 07:29:46AM +0100, Al Viro wrote: > Tentative NFS patch follows; I don't understand Lustre well enough, but it > looks like a plausible strategy there as well. Speaking of Lustre: WTF is /* Open dentry. */ if (S_ISFIFO(d_inode(dentry)->i_mode)) { /* We cannot call open here as it would * deadlock. */ if (it_disposition(it, DISP_ENQ_OPEN_REF)) ptlrpc_req_finished( (struct ptlrpc_request *) it->d.lustre.it_data); rc = finish_no_open(file, de); } else { about and why do we only do that to FIFOs? What about symlinks or device nodes? Directories, for that matter... Shouldn't that be if (!S_ISREG(...)) instead? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Jul 3, 2016, at 8:08 PM, Al Viro wrote: > On Sun, Jul 03, 2016 at 07:29:46AM +0100, Al Viro wrote: > >> Tentative NFS patch follows; I don't understand Lustre well enough, but it >> looks like a plausible strategy there as well. > > Speaking of Lustre: WTF is > /* Open dentry. */ > if (S_ISFIFO(d_inode(dentry)->i_mode)) { > /* We cannot call open here as it would > * deadlock. > */ > if (it_disposition(it, DISP_ENQ_OPEN_REF)) > ptlrpc_req_finished( > (struct ptlrpc_request *) > it->d.lustre.it_data); > rc = finish_no_open(file, de); > } else { > about and why do we only do that to FIFOs? What about symlinks or device > nodes? Directories, for that matter... Shouldn't that be if (!S_ISREG(...)) > instead? Hm… This dates to sometime in 2006 and my memory is a bit hazy here. I think when we called into the open, it went into fifo open and stuck there waiting for the other opener. Something like that. And we cannot really be stuck here because we are holding some locks that need to be released in predictable time. This code is actually unreachable now because the server never returns an openhandle for special device nodes anymore (there's a comment about it in current staging tree, but I guess you are looking at some prior version). I imagine device nodes might have represented a similar risk too, but it did not occur to me to test it separately and the testsuite does not do it either. Directories do not get stuck when you open them so they are ok and we can atomically open them too, I guess. Symlinks are handled specially on the server and the open never returns the actual open handle for those, so this path is also unreachable with those.-- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Jul 3, 2016, at 2:29 AM, Al Viro wrote: > On Sat, Jun 25, 2016 at 12:38:40PM -0400, Oleg Drokin wrote: > >> Sorry to nag you about this, but did any of those pan out? >> >> d_alloc_parallel() sounds like a bit too heavy there, esp. considering we came in with >> a dentry already (though a potentially shared one, I understand). >> Would not it be better to try and establish some dentry locking rule for calling into >> d_splice_alias() instead? At least then the callers can make sure the dentry does >> not change under them? >> Though I guess if there's dentry locking like that, we might as well do all the >> checking in d_splice_alias(), but that means the unhashed dentries would no >> longer be disallowed which is a change of semantic from now.-- > > FWIW, the only interesting case here is this: > * no O_CREAT in flags (otherwise the parent is held exclusive). > * dentry is found in hash > * dentry is negative > * dentry has passed ->d_revalidate() (i.e. in case of > NFS it had nfs_neg_need_reval() return false). > > Only two instances are non-trivial in that respect - NFS and Lustre. > Everything else will simply fail open() with ENOENT in that case. > > And at least for NFS we could bloody well do d_drop + d_alloc_parallel + > finish_no_open and bugger off in case it's not in_lookup, otherwise do > pretty much what we do in case we'd got in_lookup from the very beginning. > Some adjustments are needed for that case (basically, we need to make > sure we hit d_lookup_done() matching that d_alloc_parallel() and deal > with refcounting correctly). > > Tentative NFS patch follows; I don't understand Lustre well enough, but it > looks like a plausible strategy there as well. This patch seems to have brought the other crash back in (or something similar), the one with a negative dentry being hashed when it's already hashed. it's not as easy to hit as before, but a lot easier than the race we are hitting here. Also in all cases it is ls that is crashing now, which seems to highlight it is taking some of this new paths added. I have half a dosen crashdumps if you want me to look something up there. This is on top of 4.7.0-rc6 a99cde438de0c4c0cecc1d1af1a55a75b10bfdef with just your patch on top. I can reproduce it with both the complex workload, or with a simplified one (attached), takes anywhere from 5 to 30 minutes to hit. > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c > index d8015a03..5474e39 100644 > --- a/fs/nfs/dir.c > +++ b/fs/nfs/dir.c > @@ -1485,11 +1485,13 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry, > struct file *file, unsigned open_flags, > umode_t mode, int *opened) > { > + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq); > struct nfs_open_context *ctx; > struct dentry *res; > struct iattr attr = { .ia_valid = ATTR_OPEN }; > struct inode *inode; > unsigned int lookup_flags = 0; > + bool switched = false; > int err; > > /* Expect a negative dentry */ > @@ -1528,6 +1530,17 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry, > attr.ia_size = 0; > } > > + if (!(open_flags & O_CREAT) && !d_unhashed(dentry)) { > + d_drop(dentry); > + switched = true; > + dentry = d_alloc_parallel(dentry->d_parent, > + &dentry->d_name, &wq); Hm, d_alloc_parallel can return some preexisting dentry it was able to lookup in rcu attached to the same parent with the same name it seems? What's to stop a parallel thread to rehash the dentry after we dropped it here, and so d_alloc_parallel will happily find it? I am running a test to verify this theory now. > + if (IS_ERR(dentry)) > + return PTR_ERR(dentry); > + if (unlikely(!d_in_lookup(dentry))) > + return finish_no_open(file, dentry); > + } > + > ctx = create_nfs_open_context(dentry, open_flags); > err = PTR_ERR(ctx); > if (IS_ERR(ctx)) > @@ -1563,14 +1576,23 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry, > trace_nfs_atomic_open_exit(dir, ctx, open_flags, err); > put_nfs_open_context(ctx); > out: > + if (unlikely(switched)) { > + d_lookup_done(dentry); > + dput(dentry); > + } > return err; > > no_open: > res = nfs_lookup(dir, dentry, lookup_flags); > - err = PTR_ERR(res); > + if (switched) { > + d_lookup_done(dentry); > + if (!res) > + res = dentry; > + else > + dput(dentry); > + } > if (IS_ERR(res)) > - goto out; > - > + return PTR_ERR(res); > return finish_no_open(file, res); > } > EXPORT_SYMBOL_GPL(nfs_atomic_open); -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
ah, and of course the testcase that I forgot to attach ;) On Jul 4, 2016, at 10:28 PM, Oleg Drokin wrote: > > On Jul 3, 2016, at 2:29 AM, Al Viro wrote: > >> On Sat, Jun 25, 2016 at 12:38:40PM -0400, Oleg Drokin wrote: >> >>> Sorry to nag you about this, but did any of those pan out? >>> >>> d_alloc_parallel() sounds like a bit too heavy there, esp. considering we came in with >>> a dentry already (though a potentially shared one, I understand). >>> Would not it be better to try and establish some dentry locking rule for calling into >>> d_splice_alias() instead? At least then the callers can make sure the dentry does >>> not change under them? >>> Though I guess if there's dentry locking like that, we might as well do all the >>> checking in d_splice_alias(), but that means the unhashed dentries would no >>> longer be disallowed which is a change of semantic from now.-- >> >> FWIW, the only interesting case here is this: >> * no O_CREAT in flags (otherwise the parent is held exclusive). >> * dentry is found in hash >> * dentry is negative >> * dentry has passed ->d_revalidate() (i.e. in case of >> NFS it had nfs_neg_need_reval() return false). >> >> Only two instances are non-trivial in that respect - NFS and Lustre. >> Everything else will simply fail open() with ENOENT in that case. >> >> And at least for NFS we could bloody well do d_drop + d_alloc_parallel + >> finish_no_open and bugger off in case it's not in_lookup, otherwise do >> pretty much what we do in case we'd got in_lookup from the very beginning. >> Some adjustments are needed for that case (basically, we need to make >> sure we hit d_lookup_done() matching that d_alloc_parallel() and deal >> with refcounting correctly). >> >> Tentative NFS patch follows; I don't understand Lustre well enough, but it >> looks like a plausible strategy there as well. > > This patch seems to have brought the other crash back in (or something similar), > the one with a negative dentry being hashed when it's already hashed. > it's not as easy to hit as before, but a lot easier than the race we are hitting here. > > Also in all cases it is ls that is crashing now, which seems to highlight it is > taking some of this new paths added. > I have half a dosen crashdumps if you want me to look something up there. > This is on top of 4.7.0-rc6 a99cde438de0c4c0cecc1d1af1a55a75b10bfdef > with just your patch on top. > > I can reproduce it with both the complex workload, or with a simplified one (attached), > takes anywhere from 5 to 30 minutes to hit. > >> >> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c >> index d8015a03..5474e39 100644 >> --- a/fs/nfs/dir.c >> +++ b/fs/nfs/dir.c >> @@ -1485,11 +1485,13 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry, >> struct file *file, unsigned open_flags, >> umode_t mode, int *opened) >> { >> + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq); >> struct nfs_open_context *ctx; >> struct dentry *res; >> struct iattr attr = { .ia_valid = ATTR_OPEN }; >> struct inode *inode; >> unsigned int lookup_flags = 0; >> + bool switched = false; >> int err; >> >> /* Expect a negative dentry */ >> @@ -1528,6 +1530,17 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry, >> attr.ia_size = 0; >> } >> >> + if (!(open_flags & O_CREAT) && !d_unhashed(dentry)) { >> + d_drop(dentry); >> + switched = true; >> + dentry = d_alloc_parallel(dentry->d_parent, >> + &dentry->d_name, &wq); > > Hm, d_alloc_parallel can return some preexisting dentry it was able to lookup in > rcu attached to the same parent with the same name it seems? > What's to stop a parallel thread to rehash the dentry after we dropped it here, > and so d_alloc_parallel will happily find it? > I am running a test to verify this theory now. > >> + if (IS_ERR(dentry)) >> + return PTR_ERR(dentry); >> + if (unlikely(!d_in_lookup(dentry))) >> + return finish_no_open(file, dentry); >> + } >> + >> ctx = create_nfs_open_context(dentry, open_flags); >> err = PTR_ERR(ctx); >> if (IS_ERR(ctx)) >> @@ -1563,14 +1576,23 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry, >> trace_nfs_atomic_open_exit(dir, ctx, open_flags, err); >> put_nfs_open_context(ctx); >> out: >> + if (unlikely(switched)) { >> + d_lookup_done(dentry); >> + dput(dentry); >> + } >> return err; >> >> no_open: >> res = nfs_lookup(dir, dentry, lookup_flags); >> - err = PTR_ERR(res); >> + if (switched) { >> + d_lookup_done(dentry); >> + if (!res) >> + res = dentry; >> + else >> + dput(dentry); >> + } >> if (IS_ERR(res)) >> - goto out; >> - >> + return PTR_ERR(res); >> return finish_no_open(file, res); >> } >> EXPORT_SYMBOL_GPL(nfs_atomic_open); >
On Jul 4, 2016, at 10:28 PM, Oleg Drokin wrote: > > On Jul 3, 2016, at 2:29 AM, Al Viro wrote: > >> On Sat, Jun 25, 2016 at 12:38:40PM -0400, Oleg Drokin wrote: >> >>> Sorry to nag you about this, but did any of those pan out? >>> >>> d_alloc_parallel() sounds like a bit too heavy there, esp. considering we came in with >>> a dentry already (though a potentially shared one, I understand). >>> Would not it be better to try and establish some dentry locking rule for calling into >>> d_splice_alias() instead? At least then the callers can make sure the dentry does >>> not change under them? >>> Though I guess if there's dentry locking like that, we might as well do all the >>> checking in d_splice_alias(), but that means the unhashed dentries would no >>> longer be disallowed which is a change of semantic from now.-- >> >> FWIW, the only interesting case here is this: >> * no O_CREAT in flags (otherwise the parent is held exclusive). >> * dentry is found in hash >> * dentry is negative >> * dentry has passed ->d_revalidate() (i.e. in case of >> NFS it had nfs_neg_need_reval() return false). >> >> Only two instances are non-trivial in that respect - NFS and Lustre. >> Everything else will simply fail open() with ENOENT in that case. >> >> And at least for NFS we could bloody well do d_drop + d_alloc_parallel + >> finish_no_open and bugger off in case it's not in_lookup, otherwise do >> pretty much what we do in case we'd got in_lookup from the very beginning. >> Some adjustments are needed for that case (basically, we need to make >> sure we hit d_lookup_done() matching that d_alloc_parallel() and deal >> with refcounting correctly). >> >> Tentative NFS patch follows; I don't understand Lustre well enough, but it >> looks like a plausible strategy there as well. > > This patch seems to have brought the other crash back in (or something similar), > the one with a negative dentry being hashed when it's already hashed. > it's not as easy to hit as before, but a lot easier than the race we are hitting here. > > Also in all cases it is ls that is crashing now, which seems to highlight it is > taking some of this new paths added. > I have half a dosen crashdumps if you want me to look something up there. > This is on top of 4.7.0-rc6 a99cde438de0c4c0cecc1d1af1a55a75b10bfdef > with just your patch on top. > > I can reproduce it with both the complex workload, or with a simplified one (attached), > takes anywhere from 5 to 30 minutes to hit. > >> >> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c >> index d8015a03..5474e39 100644 >> --- a/fs/nfs/dir.c >> +++ b/fs/nfs/dir.c >> @@ -1485,11 +1485,13 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry, >> struct file *file, unsigned open_flags, >> umode_t mode, int *opened) >> { >> + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq); >> struct nfs_open_context *ctx; >> struct dentry *res; >> struct iattr attr = { .ia_valid = ATTR_OPEN }; >> struct inode *inode; >> unsigned int lookup_flags = 0; >> + bool switched = false; >> int err; >> >> /* Expect a negative dentry */ >> @@ -1528,6 +1530,17 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry, >> attr.ia_size = 0; >> } >> >> + if (!(open_flags & O_CREAT) && !d_unhashed(dentry)) { >> + d_drop(dentry); >> + switched = true; >> + dentry = d_alloc_parallel(dentry->d_parent, >> + &dentry->d_name, &wq); > > Hm, d_alloc_parallel can return some preexisting dentry it was able to lookup in > rcu attached to the same parent with the same name it seems? > What's to stop a parallel thread to rehash the dentry after we dropped it here, > and so d_alloc_parallel will happily find it? > I am running a test to verify this theory now. Ok, so at least in my case we do not come out of d_alloc_parallel() with a hashed dentry, though I wonder why are not you trying to catch this case here? If you rely on the !d_in_lookup(dentry) below, that seems to be racy. if it was __d_lookup_rcu() that found the dentry, we do not seem to be performing any additional checks on it and just return it. But what if it was just added by a parallel caller here that just finished d_splice_alias (= hashed dentry so it's not rejected by the __d_lookup_rcu) and at the same time have not progressed all the way to d_lookup_done() call below? Interesting that in some of the dumps d_hash is all zeroes, which means it is unhashed, yet the assertion was triggered, so there was some parallel caller that performed a d_drop on us (but nothing of interest on other CPUs). In all cases d_flags = 140, so we do not have the DCACHE_PAR_LOOKUP set at the point of crash. Also in part of the dumps the dentry is actually positive, not negative. > >> + if (IS_ERR(dentry)) >> + return PTR_ERR(dentry); >> + if (unlikely(!d_in_lookup(dentry))) >> + return finish_no_open(file, dentry); >> + } >> + >> ctx = create_nfs_open_context(dentry, open_flags); >> err = PTR_ERR(ctx); >> if (IS_ERR(ctx)) >> @@ -1563,14 +1576,23 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry, >> trace_nfs_atomic_open_exit(dir, ctx, open_flags, err); >> put_nfs_open_context(ctx); >> out: >> + if (unlikely(switched)) { >> + d_lookup_done(dentry); >> + dput(dentry); >> + } >> return err; >> >> no_open: >> res = nfs_lookup(dir, dentry, lookup_flags); >> - err = PTR_ERR(res); >> + if (switched) { >> + d_lookup_done(dentry); >> + if (!res) >> + res = dentry; >> + else >> + dput(dentry); >> + } >> if (IS_ERR(res)) >> - goto out; >> - >> + return PTR_ERR(res); >> return finish_no_open(file, res); >> } >> EXPORT_SYMBOL_GPL(nfs_atomic_open); > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Jul 3, 2016, at 2:29 AM, Al Viro wrote: > On Sat, Jun 25, 2016 at 12:38:40PM -0400, Oleg Drokin wrote: > >> Sorry to nag you about this, but did any of those pan out? >> >> d_alloc_parallel() sounds like a bit too heavy there, esp. considering we came in with >> a dentry already (though a potentially shared one, I understand). >> Would not it be better to try and establish some dentry locking rule for calling into >> d_splice_alias() instead? At least then the callers can make sure the dentry does >> not change under them? >> Though I guess if there's dentry locking like that, we might as well do all the >> checking in d_splice_alias(), but that means the unhashed dentries would no >> longer be disallowed which is a change of semantic from now.-- > > FWIW, the only interesting case here is this: > * no O_CREAT in flags (otherwise the parent is held exclusive). > * dentry is found in hash > * dentry is negative > * dentry has passed ->d_revalidate() (i.e. in case of > NFS it had nfs_neg_need_reval() return false). > > Only two instances are non-trivial in that respect - NFS and Lustre. > Everything else will simply fail open() with ENOENT in that case. > > And at least for NFS we could bloody well do d_drop + d_alloc_parallel + > finish_no_open and bugger off in case it's not in_lookup, otherwise do > pretty much what we do in case we'd got in_lookup from the very beginning. > Some adjustments are needed for that case (basically, we need to make > sure we hit d_lookup_done() matching that d_alloc_parallel() and deal > with refcounting correctly). > > Tentative NFS patch follows; I don't understand Lustre well enough, but it > looks like a plausible strategy there as well. I think I know why this does not work or at least part of the reason. > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c > index d8015a03..5474e39 100644 > --- a/fs/nfs/dir.c > +++ b/fs/nfs/dir.c > @@ -1485,11 +1485,13 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry, > struct file *file, unsigned open_flags, > umode_t mode, int *opened) > { > + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq); > struct nfs_open_context *ctx; > struct dentry *res; > struct iattr attr = { .ia_valid = ATTR_OPEN }; > struct inode *inode; > unsigned int lookup_flags = 0; > + bool switched = false; > int err; > > /* Expect a negative dentry */ > @@ -1528,6 +1530,17 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry, > attr.ia_size = 0; > } > > + if (!(open_flags & O_CREAT) && !d_unhashed(dentry)) { So we come racing here from multiple threads (say 3 or more - we have seen this in the older crash reports, so totally possible) > + d_drop(dentry); One lucky one does this first before the others perform the !d_unhashed check above. This makes the other ones to not enter here. And we are back to the original problem of multiple threads trying to instantiate same dentry as before. Only this time the race is somehow even wider because of some other interactions I don't really understand, at least removing this patch I feel like I am having harder time hitting the original issue. > + switched = true; > + dentry = d_alloc_parallel(dentry->d_parent, > + &dentry->d_name, &wq); > + if (IS_ERR(dentry)) > + return PTR_ERR(dentry); > + if (unlikely(!d_in_lookup(dentry))) > + return finish_no_open(file, dentry); > + } > + > ctx = create_nfs_open_context(dentry, open_flags); > err = PTR_ERR(ctx); > if (IS_ERR(ctx)) > @@ -1563,14 +1576,23 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry, > trace_nfs_atomic_open_exit(dir, ctx, open_flags, err); > put_nfs_open_context(ctx); > out: > + if (unlikely(switched)) { > + d_lookup_done(dentry); > + dput(dentry); > + } > return err; > > no_open: > res = nfs_lookup(dir, dentry, lookup_flags); > - err = PTR_ERR(res); > + if (switched) { > + d_lookup_done(dentry); > + if (!res) > + res = dentry; > + else > + dput(dentry); > + } > if (IS_ERR(res)) > - goto out; > - > + return PTR_ERR(res); > return finish_no_open(file, res); > } > EXPORT_SYMBOL_GPL(nfs_atomic_open); -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 05, 2016 at 02:22:48AM -0400, Oleg Drokin wrote: > > + if (!(open_flags & O_CREAT) && !d_unhashed(dentry)) { s/d_unhashed/d_in_lookup/ in that. > So we come racing here from multiple threads (say 3 or more - we have seen this > in the older crash reports, so totally possible) > > > + d_drop(dentry); > > One lucky one does this first before the others perform the !d_unhashed check above. > This makes the other ones to not enter here. > > And we are back to the original problem of multiple threads trying to instantiate > same dentry as before. Yep. See above - it should've been using d_in_lookup() in the first place, through the entire nfs_atomic_open(). Same in the Lustre part of fixes, obviously. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 05, 2016 at 01:31:10PM +0100, Al Viro wrote: > On Tue, Jul 05, 2016 at 02:22:48AM -0400, Oleg Drokin wrote: > > > > + if (!(open_flags & O_CREAT) && !d_unhashed(dentry)) { > > s/d_unhashed/d_in_lookup/ in that. > > > So we come racing here from multiple threads (say 3 or more - we have seen this > > in the older crash reports, so totally possible) > > > > > + d_drop(dentry); > > > > One lucky one does this first before the others perform the !d_unhashed check above. > > This makes the other ones to not enter here. > > > > And we are back to the original problem of multiple threads trying to instantiate > > same dentry as before. > > Yep. See above - it should've been using d_in_lookup() in the first place, > through the entire nfs_atomic_open(). Same in the Lustre part of fixes, > obviously. See current #for-linus for hopefully fixed variants (both lustre and nfs) -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Jul 5, 2016, at 9:51 AM, Al Viro wrote: > On Tue, Jul 05, 2016 at 01:31:10PM +0100, Al Viro wrote: >> On Tue, Jul 05, 2016 at 02:22:48AM -0400, Oleg Drokin wrote: >> >>>> + if (!(open_flags & O_CREAT) && !d_unhashed(dentry)) { >> >> s/d_unhashed/d_in_lookup/ in that. >> >>> So we come racing here from multiple threads (say 3 or more - we have seen this >>> in the older crash reports, so totally possible) >>> >>>> + d_drop(dentry); >>> >>> One lucky one does this first before the others perform the !d_unhashed check above. >>> This makes the other ones to not enter here. >>> >>> And we are back to the original problem of multiple threads trying to instantiate >>> same dentry as before. >> >> Yep. See above - it should've been using d_in_lookup() in the first place, >> through the entire nfs_atomic_open(). Same in the Lustre part of fixes, >> obviously. > > See current #for-linus for hopefully fixed variants (both lustre and nfs) The first patch of the series: > @@ -416,9 +416,9 @@ static int ll_lookup_it_finish(struct ptlrpc_request *request, > ... > - if (d_unhashed(*de)) { > + if (d_in_lookup(*de)) { > struct dentry *alias; > > alias = ll_splice_alias(inode, *de); This breaks Lustre because we now might progress further in this function without calling into ll_splice_alias and that's the only place that we do ll_d_init() that later code depends on so we violently crash next time we call e.g. d_lustre_revalidate() further down that code. Also I still wonder what's to stop d_alloc_parallel() from returning a hashed dentry with d_in_lookup() still true? Certainly there's a big gap between hashing the dentry and dropping the PAR bit in there that I imagine might allow __d_lookup_rcu() to pick it up in between?-- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Jul 5, 2016, at 9:51 AM, Al Viro wrote: > On Tue, Jul 05, 2016 at 01:31:10PM +0100, Al Viro wrote: >> On Tue, Jul 05, 2016 at 02:22:48AM -0400, Oleg Drokin wrote: >> >>>> + if (!(open_flags & O_CREAT) && !d_unhashed(dentry)) { >> >> s/d_unhashed/d_in_lookup/ in that. >> >>> So we come racing here from multiple threads (say 3 or more - we have seen this >>> in the older crash reports, so totally possible) >>> >>>> + d_drop(dentry); >>> >>> One lucky one does this first before the others perform the !d_unhashed check above. >>> This makes the other ones to not enter here. >>> >>> And we are back to the original problem of multiple threads trying to instantiate >>> same dentry as before. >> >> Yep. See above - it should've been using d_in_lookup() in the first place, >> through the entire nfs_atomic_open(). Same in the Lustre part of fixes, >> obviously. > > See current #for-linus for hopefully fixed variants (both lustre and nfs) So at first it looked like we just need another d_init in the other arm of that "if (d_in_lookup())" statement, but alas. Also the patch that changed d_unhashed check for d_in_lookup() now results in a stale comment: /* Only hash *de if it is unhashed (new dentry). * Atomic_open may passing hashed dentries for open. */ if (d_in_lookup(*de)) { Since we no longer check for d_unhashed(), what would be a better choice of words here? "Only hash *de if it is a new dentry coming from lookup"? This also makes me question the whole thing some more. We are definitely in lookup when this hits, so the dentry is already new, yet it does not check off as d_in_lookup(). That also means that by skipping the ll_splice_alias we are failing to hash it and that causing needless lookups later? Looking some back into the history of commits, d_in_lookup() is to tell us that we are in the middle of lookup. How can we be in the middle of lookup path then and not have this set on a dentry? We know dentry was not substituted with anything here because we did not call into ll_split_alias(). So what's going on then? Here's a backtrace: [ 146.045148] [<ffffffffa017baa6>] lbug_with_loc+0x46/0xb0 [libcfs] [ 146.045158] [<ffffffffa05baef3>] ll_lookup_it_finish+0x713/0xaa0 [lustre] [ 146.045160] [<ffffffff810e6dcd>] ? trace_hardirqs_on+0xd/0x10 [ 146.045167] [<ffffffffa05bb51b>] ll_lookup_it+0x29b/0x710 [lustre] [ 146.045173] [<ffffffffa05b8830>] ? md_set_lock_data.part.25+0x60/0x60 [lustr e] [ 146.045179] [<ffffffffa05bc6a4>] ll_lookup_nd+0x84/0x190 [lustre] [ 146.045180] [<ffffffff81276e94>] __lookup_hash+0x64/0xa0 [ 146.045181] [<ffffffff810e1b88>] ? down_write_nested+0xa8/0xc0 [ 146.045182] [<ffffffff8127d55f>] do_unlinkat+0x1bf/0x2f0 [ 146.045183] [<ffffffff8127e12b>] SyS_unlinkat+0x1b/0x30 [ 146.045185] [<ffffffff8188b3bc>] entry_SYSCALL_64_fastpath+0x1f/0xbd __lookup_hash() does d_alloc (not parallel) and falls through into the ->lookup() of the filesystem. So the dots do not connect. The more I look at it the more I suspect it's wrong. Everywhere else you changed in that patch, it was in *atomic_open() with a very known impact. ll_lookup_it_finish() on the other hand is a generic lookup path, not just for atomic opens. I took out that part of your patch and problems went away it seems. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 05, 2016 at 11:21:32AM -0400, Oleg Drokin wrote: > > ... > > - if (d_unhashed(*de)) { > > + if (d_in_lookup(*de)) { > > struct dentry *alias; > > > > alias = ll_splice_alias(inode, *de); > > This breaks Lustre because we now might progress further in this function > without calling into ll_splice_alias and that's the only place that we do > ll_d_init() that later code depends on so we violently crash next time > we call e.g. d_lustre_revalidate() further down that code. Huh? How the hell do those conditions differ there? > Also I still wonder what's to stop d_alloc_parallel() from returning > a hashed dentry with d_in_lookup() still true? The fact that such dentries do not exist at any point? > Certainly there's a big gap between hashing the dentry and dropping the PAR > bit in there that I imagine might allow __d_lookup_rcu() to pick it up > in between?-- WTF? Where do you see that gap? in-lookup dentries get hashed only in one place - __d_add(). And there (besides holding ->d_lock around both) we drop that bit in flags *before* _d_rehash(). AFAICS, the situation with barriers is OK there, due to lockref_get_not_dead() serving as ACQUIRE operation; I could be missing something subtle, but a wide gap... Where? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 05, 2016 at 12:33:09PM -0400, Oleg Drokin wrote: > This also makes me question the whole thing some more. We are definitely in lookup > when this hits, so the dentry is already new, yet it does not check off as > d_in_lookup(). That also means that by skipping the ll_splice_alias we are failing > to hash it and that causing needless lookups later? > Looking some back into the history of commits, d_in_lookup() is to tell us > that we are in the middle of lookup. How can we be in the middle of lookup > path then and not have this set on a dentry? We know dentry was not > substituted with anything here because we did not call into ll_split_alias(). > So what's going on then? Lookup in directory locked exclusive, that's what... In unlink(), in your testcase. And yes, this piece of 1/3 is incorrect; what I do not understand is the logics of what you are doing with dcache in ll_splice_alias() and in its caller ;-/ -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Jul 5, 2016, at 1:42 PM, Al Viro wrote: > On Tue, Jul 05, 2016 at 11:21:32AM -0400, Oleg Drokin wrote: >>> ... >>> - if (d_unhashed(*de)) { >>> + if (d_in_lookup(*de)) { >>> struct dentry *alias; >>> >>> alias = ll_splice_alias(inode, *de); >> >> This breaks Lustre because we now might progress further in this function >> without calling into ll_splice_alias and that's the only place that we do >> ll_d_init() that later code depends on so we violently crash next time >> we call e.g. d_lustre_revalidate() further down that code. > > Huh? How the hell do those conditions differ there? Like explained in my other email, because this is in a normal lookup path, we can get here with a new dentry that was allocated in __hash_lookup via d_alloc (not parallel) that's not marked with the PAR bit. >> Also I still wonder what's to stop d_alloc_parallel() from returning >> a hashed dentry with d_in_lookup() still true? > > The fact that such dentries do not exist at any point? > >> Certainly there's a big gap between hashing the dentry and dropping the PAR >> bit in there that I imagine might allow __d_lookup_rcu() to pick it up >> in between?-- > > WTF? Where do you see that gap? in-lookup dentries get hashed only in one > place - __d_add(). And there (besides holding ->d_lock around both) we > drop that bit in flags *before* _d_rehash(). AFAICS, the situation with > barriers is OK there, due to lockref_get_not_dead() serving as ACQUIRE > operation; I could be missing something subtle, but a wide gap... Where? Oh! I see, I missed that __d_add drops the PAR bit as well, not just the code at the end of the call that does d_alloc_parallel. Then indeed there is no gap, sorry for the false alarm.-- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Jul 5, 2016, at 2:08 PM, Al Viro wrote: > On Tue, Jul 05, 2016 at 12:33:09PM -0400, Oleg Drokin wrote: > >> This also makes me question the whole thing some more. We are definitely in lookup >> when this hits, so the dentry is already new, yet it does not check off as >> d_in_lookup(). That also means that by skipping the ll_splice_alias we are failing >> to hash it and that causing needless lookups later? >> Looking some back into the history of commits, d_in_lookup() is to tell us >> that we are in the middle of lookup. How can we be in the middle of lookup >> path then and not have this set on a dentry? We know dentry was not >> substituted with anything here because we did not call into ll_split_alias(). >> So what's going on then? > > Lookup in directory locked exclusive, that's what... In unlink(), in your > testcase. And yes, this piece of 1/3 is incorrect; what I do not understand > is the logics of what you are doing with dcache in ll_splice_alias() and > in its caller ;-/ When d_in_lookup was introduced, it was described as: New primitives: d_in_lookup() (a predicate checking if dentry is in the in-lookup state) and d_lookup_done() (tells the system that we are done with lookup and if it's still marked as in-lookup, it should cease to be such). I don't see where it mentions anything about exclusive vs parallel lookup that probably led to some confusion. So with Lustre the dentry can be in three states, really: 1. hashed dentry that's all A-ok to reuse. 2. hashed dentry that's NOT valid (dlm lock went away) - this is distinguished in d_compare by looking at a bit in the fs_data 3. unhashed dentry ( I guess could be both valid and invalid lustre-wise). So the logic in ll_lookup_it_finish() (part of regular lookup) is this: If the dentry we have is not hashed - this is a new lookup, so we need to call into ll_splice_alias() to see if there's a better dentry we need to reuse that was already rejected by VFS before since we did not have necessary locks, but we do have them now. The comment at the top of ll_dcompare() explains why we don't just unhash the dentry on lock-loss - that apparently leads to a loop around real_lookup for real-contended dentries. This is also why we cannot use d_splice_alias here - such cases are possible for regular files and directories. Anyway, I guess additional point of confusion here is then why does ll_lookup_it_finish() need to check for hashedness of the dentry since it's in lookup, so we should be unhashed here. I checked the commit history and this check was added along with atomic_open support, so I imagine we can just move it up into ll_atomic_open and then your change starts to make sense along with a few other things. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 05, 2016 at 03:12:44PM -0400, Oleg Drokin wrote: > > When d_in_lookup was introduced, it was described as: > New primitives: d_in_lookup() (a predicate checking if dentry is in > the in-lookup state) and d_lookup_done() (tells the system that > we are done with lookup and if it's still marked as in-lookup, it > should cease to be such). > > I don't see where it mentions anything about exclusive vs parallel lookup > that probably led to some confusion. In the same commit: #define DCACHE_PAR_LOOKUP 0x10000000 /* being looked up (with parent locked shared) */ static inline int d_in_lookup(struct dentry *dentry) { return dentry->d_flags & DCACHE_PAR_LOOKUP; } Sure, we could use d_alloc_parallel() for all lookups, but it wouldn't buy us anything in terms of exclusion (parent locked exclusive => no other lookup attempts on that parent/name pair anyway) and it would cost extra searches both in the primary and in-lookup hashes, as well as insertions and removals from the latter. Hell knows - perhaps teaching d_alloc_parallel() that NULL wq => just allocate and mark in-lookup, without touching either primary or in-lookup hash (and scream bloody murder if the parent isn't locked exclusive) would be a good idea. A few places in fs/dcache.c would need to check for ->d_wait being non-NULL (__d_lookup_done(), __d_add() an __d_move()); We could use that for lookups when parent is locked exclusive; then d_in_lookup() would be true for *all* dentries passed to ->lookup(). I'll look into that, but it's obviously the next cycle fodder. > So with Lustre the dentry can be in three states, really: > > 1. hashed dentry that's all A-ok to reuse. > 2. hashed dentry that's NOT valid (dlm lock went away) - this is distinguished in d_compare by looking at a bit in the fs_data > 3. unhashed dentry ( I guess could be both valid and invalid lustre-wise). > > So the logic in ll_lookup_it_finish() (part of regular lookup) is this: > > If the dentry we have is not hashed - this is a new lookup, so we need to > call into ll_splice_alias() to see if there's a better dentry we need to > reuse that was already rejected by VFS before since we did not have necessary locks, > but we do have them now. > The comment at the top of ll_dcompare() explains why we don't just unhash the > dentry on lock-loss - that apparently leads to a loop around real_lookup for > real-contended dentries. > This is also why we cannot use d_splice_alias here - such cases are possible > for regular files and directories. > > Anyway, I guess additional point of confusion here is then why does > ll_lookup_it_finish() need to check for hashedness of the dentry since it's in > lookup, so we should be unhashed here. > I checked the commit history and this check was added along with atomic_open > support, so I imagine we can just move it up into ll_atomic_open and then your > change starts to make sense along with a few other things. So basically this } else if (!it_disposition(it, DISP_LOOKUP_NEG) && !it_disposition(it, DISP_OPEN_CREATE)) { /* With DISP_OPEN_CREATE dentry will be * instantiated in ll_create_it. */ LASSERT(!d_inode(*de)); d_instantiate(*de, inode); } is something we should do only for negative hashed fed to it by ->atomic_open(), and that - only if we have no O_CREAT in flags? Then, since 3/3 eliminates that case completely, we could just rip that else-if, along with the check for d_unhashed itself, making the call of ll_splice_alias() unconditional there. Or am I misreading what you said? Do you see any problems with what's in #for-linus now (head at 11f0083)? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Jul 5, 2016, at 9:51 AM, Al Viro wrote: > On Tue, Jul 05, 2016 at 01:31:10PM +0100, Al Viro wrote: >> On Tue, Jul 05, 2016 at 02:22:48AM -0400, Oleg Drokin wrote: >> >>>> + if (!(open_flags & O_CREAT) && !d_unhashed(dentry)) { >> >> s/d_unhashed/d_in_lookup/ in that. >> >>> So we come racing here from multiple threads (say 3 or more - we have seen this >>> in the older crash reports, so totally possible) >>> >>>> + d_drop(dentry); >>> >>> One lucky one does this first before the others perform the !d_unhashed check above. >>> This makes the other ones to not enter here. >>> >>> And we are back to the original problem of multiple threads trying to instantiate >>> same dentry as before. >> >> Yep. See above - it should've been using d_in_lookup() in the first place, >> through the entire nfs_atomic_open(). Same in the Lustre part of fixes, >> obviously. > > See current #for-linus for hopefully fixed variants (both lustre and nfs) So returning to the NFS - the patches held up for 24 hours under various loads with no crashes or other such ill effects. So I think the NFS patches are good to go. The Lustre patches I only tried my own and those seem to be doing well too. Do you want me to send you my Lustre patches just for this issue, or send everything to Greg as usual? Thanks.-- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index d8015a03..5474e39 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -1485,11 +1485,13 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry, struct file *file, unsigned open_flags, umode_t mode, int *opened) { + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq); struct nfs_open_context *ctx; struct dentry *res; struct iattr attr = { .ia_valid = ATTR_OPEN }; struct inode *inode; unsigned int lookup_flags = 0; + bool switched = false; int err; /* Expect a negative dentry */ @@ -1528,6 +1530,17 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry, attr.ia_size = 0; } + if (!(open_flags & O_CREAT) && !d_unhashed(dentry)) { + d_drop(dentry); + switched = true; + dentry = d_alloc_parallel(dentry->d_parent, + &dentry->d_name, &wq); + if (IS_ERR(dentry)) + return PTR_ERR(dentry); + if (unlikely(!d_in_lookup(dentry))) + return finish_no_open(file, dentry); + } + ctx = create_nfs_open_context(dentry, open_flags); err = PTR_ERR(ctx); if (IS_ERR(ctx)) @@ -1563,14 +1576,23 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry, trace_nfs_atomic_open_exit(dir, ctx, open_flags, err); put_nfs_open_context(ctx); out: + if (unlikely(switched)) { + d_lookup_done(dentry); + dput(dentry); + } return err; no_open: res = nfs_lookup(dir, dentry, lookup_flags); - err = PTR_ERR(res); + if (switched) { + d_lookup_done(dentry); + if (!res) + res = dentry; + else + dput(dentry); + } if (IS_ERR(res)) - goto out; - + return PTR_ERR(res); return finish_no_open(file, res); } EXPORT_SYMBOL_GPL(nfs_atomic_open);