Message ID | 20160229130924.GV17997@ZenIV.linux.org.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Al Viro <viro@ZenIV.linux.org.uk> wrote: > David, Linus, do you see any problems with that? To me it looks saner > that way and as cheap as the current code, but I might be missing something > here... You're effectively converting to this: read d_seq.sequence smp_rmb() read d_inode, d_flags smp_rmb() check d_seq.sequence in the read path and this: write d_seq.sequence smp_wmb() write d_inode, d_flags smp_wmb() write d_seq.sequence This should work - especially if we're wangling these sequence points anyway, and so have to pay the barrier penalties whatever. In fact, you actually take a barrier out, I think. I have had a problem with getting the ordering of d_inode and d_flags right because of __d_clear_type_and_inode() where we're required to unset a dentry so that it can be repurposed[*] as a negative dentry rather than replacing it. This is something we for performance sake - and it's something we can only do if the dentry isn't referenced, whereas if we properly followed the RCU model, we would have to wait a grace period after delisting an unlinked dentry before we could repurpose it - but that significantly slows down rename, unlink and rmdir. [*] Al and I disagree on whether this is a reuse or merely a change of state. David -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 29, 2016 at 2:09 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Sun, Feb 28, 2016 at 08:01:01PM +0000, Al Viro wrote: >> On Sun, Feb 28, 2016 at 05:01:34PM +0000, Al Viro wrote: >> >> > Erm... What's to order ->d_inode and ->d_flags fetches there? David? >> > Looks like the barrier in d_is_negative() is on the wrong side of fetch. >> > Confused... >> >> OK, as per David's suggestion, let's flip them around, bringing the >> barrier in d_is_negative() between them. Dmitry, could you try this on >> top of mainline? Again, it's until the first warning. > > Hmm... Reordering is definitely wrong, but what I really wonder is if > dentry_rcuwalk_invalidate() is right outside of __d_drop(). IOW, is > it right in __d_instantiate() and dentry_unlink_inode()? The code > dealing with ->d_flags in RCU mode is more interested in coherency between > ->d_flags and ->d_inode and it looks like we'd need *two* increments - > even-to-odd before updating both and odd-to-even after both are in sync. > The more I look at the situation with d_is_...() wrt barriers and ->d_seq, > the less I understand it; outside of RCU mode we don't really need the > barriers for that stuff and in RCU mode ->d_flags handling had been > a serious headache all along... > > I'm tempted to do as below; the amount of smp_wmb() remains the same and > so's the amount of stores (splitting that += 2 in two doesn't affect that - > we dirty the same cacheline before and after anyway). OTOH, that would > mean that ->d_seq match guarantees ->d_flags and ->d_inode being in sync. > And I suspect that we could drop _read_ barriers in d_is_...() after that; > in non-RCU mode we don't give a damn anyway and in RCU one ->d_seq check > would provide one; it doesn't really matter on x86, but smp_rmb() may be > costly. Splitting ->d_seq increments *does* matter on x86 wrt correctness; > in-between state becomes guaranteed ->d_seq mismatch and that just might > be it. Dmitry, could you add this on top of the previous patch? Regardless of whether reordering is wrong or not, do we see how it can fix the WARNINGs/oopses? Because it does seem to. I've tried to revert just this part: - *inode = d_backing_inode(dentry); negative = d_is_negative(dentry); + *inode = d_backing_inode(dentry); And got: [ 976.609688] WARNING: CPU: 0 PID: 12126 at fs/namei.c:1587 lookup_fast+0x3fa/0x450() [ 976.626768] WARNING: CPU: 0 PID: 12126 at fs/namei.c:3123 path_openat+0x12bc/0x1520() in 15 minutes. In particular, applying this on top the previous patch will be inconclusive, because I already don't see the warnings. > David, Linus, do you see any problems with that? To me it looks saner > that way and as cheap as the current code, but I might be missing something > here... > > diff --git a/fs/dcache.c b/fs/dcache.c > index 92d5140..2c08cce 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -279,7 +279,6 @@ static inline void __d_set_inode_and_type(struct dentry *dentry, > unsigned flags; > > dentry->d_inode = inode; > - smp_wmb(); > flags = READ_ONCE(dentry->d_flags); > flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU); > flags |= type_flags; > @@ -300,7 +299,6 @@ static inline void __d_clear_type_and_inode(struct dentry *dentry) > > flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU); > WRITE_ONCE(dentry->d_flags, flags); > - smp_wmb(); > dentry->d_inode = NULL; > } > > @@ -370,9 +368,11 @@ static void dentry_unlink_inode(struct dentry * dentry) > __releases(dentry->d_inode->i_lock) > { > struct inode *inode = dentry->d_inode; > + > + raw_write_seqcount_begin(&dentry->d_seq); > __d_clear_type_and_inode(dentry); > hlist_del_init(&dentry->d_u.d_alias); > - dentry_rcuwalk_invalidate(dentry); > + raw_write_seqcount_end(&dentry->d_seq); > spin_unlock(&dentry->d_lock); > spin_unlock(&inode->i_lock); > if (!inode->i_nlink) > @@ -1758,8 +1758,9 @@ static void __d_instantiate(struct dentry *dentry, struct inode *inode) > spin_lock(&dentry->d_lock); > if (inode) > hlist_add_head(&dentry->d_u.d_alias, &inode->i_dentry); > + raw_write_seqcount_begin(&dentry->d_seq); > __d_set_inode_and_type(dentry, inode, add_flags); > - dentry_rcuwalk_invalidate(dentry); > + raw_write_seqcount_end(&dentry->d_seq); > spin_unlock(&dentry->d_lock); > fsnotify_d_instantiate(dentry, inode); > } -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 29, 2016 at 04:54:54PM +0100, Dmitry Vyukov wrote: > Regardless of whether reordering is wrong or not, do we see how it can > fix the WARNINGs/oopses? Because it does seem to. I've tried to revert > just this part: > > - *inode = d_backing_inode(dentry); > negative = d_is_negative(dentry); > + *inode = d_backing_inode(dentry); > > And got: > > [ 976.609688] WARNING: CPU: 0 PID: 12126 at fs/namei.c:1587 > lookup_fast+0x3fa/0x450() > [ 976.626768] WARNING: CPU: 0 PID: 12126 at fs/namei.c:3123 > path_openat+0x12bc/0x1520() > > in 15 minutes. dentry going from negative to positive lookup_fast() fetch NULL ->d_inode store non-NULL ->d_inode store new ->d_flags fetch new ->d_flags check ->d_seq bump ->d_seq by 2 Change the order of fetches and you'll get rid of that scenario. > In particular, applying this on top the previous patch will be > inconclusive, because I already don't see the warnings. Apply it with that reordering reversed, please. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 29, 2016 at 5:09 AM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > The more I look at the situation with d_is_...() wrt barriers and ->d_seq, > the less I understand it; outside of RCU mode we don't really need the > barriers for that stuff and in RCU mode ->d_flags handling had been > a serious headache all along... Yeah, one of my least favorite "recent" vfs improvements. > I'm tempted to do as below .. [ changing it to be unde the seqlock ] > > David, Linus, do you see any problems with that? To me it looks saner > that way and as cheap as the current code, but I might be missing something > here... I'd absolutely love to see this. The memory ordering for the flags updates and reading was always really confusing, and I hated how it was hidden inside the random access functions. And apparently it wasn't just confusing, it was buggy too. But I'd love it _more_ if this also means that we can get rid of the rmb's, which your patch didn't. Can we? Or does the ordering still remain for some other issue? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 29, 2016 at 08:45:37AM -0800, Linus Torvalds wrote: > > David, Linus, do you see any problems with that? To me it looks saner > > that way and as cheap as the current code, but I might be missing something > > here... > > I'd absolutely love to see this. The memory ordering for the flags > updates and reading was always really confusing, and I hated how it > was hidden inside the random access functions. And apparently it > wasn't just confusing, it was buggy too. > > But I'd love it _more_ if this also means that we can get rid of the > rmb's, which your patch didn't. Can we? Or does the ordering still > remain for some other issue? In __d_entry_type(), you mean? Should be, along with READ_ONCE() there. AFAICS, ordering shouldn't be an issue anymore... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 29, 2016 at 04:50:31PM +0000, Al Viro wrote: > In __d_entry_type(), you mean? Should be, along with READ_ONCE() there. > AFAICS, ordering shouldn't be an issue anymore... FWIW, I've pushed the candidate fix (including the above) into vfs.git#for-linus; I'd prefer to wait for confirmation that dcache.c part is sufficient to fix the problem (_without_ reordering in lookup_fast()) before sending a pull request, though. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 29, 2016 at 8:50 AM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > In __d_entry_type(), you mean? Should be, along with READ_ONCE() there. > AFAICS, ordering shouldn't be an issue anymore... That's the one. It results in those barriers in very subtle places, and it was always unclear whether any of the users of those "d_is_*()" helper functions really understood the subtle memory ordering involved. The smp_rmb() is also potentially quite expensive on some architectures. So getting rid of those hidden memory orderings would be a goodness quite apart from fixing the bug Dmitry found. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 29, 2016 at 5:19 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Mon, Feb 29, 2016 at 04:54:54PM +0100, Dmitry Vyukov wrote: > >> Regardless of whether reordering is wrong or not, do we see how it can >> fix the WARNINGs/oopses? Because it does seem to. I've tried to revert >> just this part: >> >> - *inode = d_backing_inode(dentry); >> negative = d_is_negative(dentry); >> + *inode = d_backing_inode(dentry); >> >> And got: >> >> [ 976.609688] WARNING: CPU: 0 PID: 12126 at fs/namei.c:1587 >> lookup_fast+0x3fa/0x450() >> [ 976.626768] WARNING: CPU: 0 PID: 12126 at fs/namei.c:3123 >> path_openat+0x12bc/0x1520() >> >> in 15 minutes. > > dentry going from negative to positive lookup_fast() > fetch NULL ->d_inode > store non-NULL ->d_inode > store new ->d_flags > fetch new ->d_flags > check ->d_seq > bump ->d_seq by 2 > > Change the order of fetches and you'll get rid of that scenario. > >> In particular, applying this on top the previous patch will be >> inconclusive, because I already don't see the warnings. > > Apply it with that reordering reversed, please. OK, just wanted to make sure that we keep track of the situation. Restarted testing with combined patch. Here is it just in case: https://gist.githubusercontent.com/dvyukov/67fe363d5ce2e2b06c71/raw/4d1b6c23f8dff7e0f8e2e3cab7e50208fddb0570/gistfile1.txt -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 29, 2016 at 7:19 PM, Dmitry Vyukov <dvyukov@google.com> wrote: > On Mon, Feb 29, 2016 at 5:19 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: >> On Mon, Feb 29, 2016 at 04:54:54PM +0100, Dmitry Vyukov wrote: >> >>> Regardless of whether reordering is wrong or not, do we see how it can >>> fix the WARNINGs/oopses? Because it does seem to. I've tried to revert >>> just this part: >>> >>> - *inode = d_backing_inode(dentry); >>> negative = d_is_negative(dentry); >>> + *inode = d_backing_inode(dentry); >>> >>> And got: >>> >>> [ 976.609688] WARNING: CPU: 0 PID: 12126 at fs/namei.c:1587 >>> lookup_fast+0x3fa/0x450() >>> [ 976.626768] WARNING: CPU: 0 PID: 12126 at fs/namei.c:3123 >>> path_openat+0x12bc/0x1520() >>> >>> in 15 minutes. >> >> dentry going from negative to positive lookup_fast() >> fetch NULL ->d_inode >> store non-NULL ->d_inode >> store new ->d_flags >> fetch new ->d_flags >> check ->d_seq >> bump ->d_seq by 2 >> >> Change the order of fetches and you'll get rid of that scenario. >> >>> In particular, applying this on top the previous patch will be >>> inconclusive, because I already don't see the warnings. >> >> Apply it with that reordering reversed, please. > > > OK, just wanted to make sure that we keep track of the situation. > Restarted testing with combined patch. Here is it just in case: > https://gist.githubusercontent.com/dvyukov/67fe363d5ce2e2b06c71/raw/4d1b6c23f8dff7e0f8e2e3cab7e50208fddb0570/gistfile1.txt No warnings/crashes in 15 hours on 3 VMs! -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/dcache.c b/fs/dcache.c index 92d5140..2c08cce 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -279,7 +279,6 @@ static inline void __d_set_inode_and_type(struct dentry *dentry, unsigned flags; dentry->d_inode = inode; - smp_wmb(); flags = READ_ONCE(dentry->d_flags); flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU); flags |= type_flags; @@ -300,7 +299,6 @@ static inline void __d_clear_type_and_inode(struct dentry *dentry) flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU); WRITE_ONCE(dentry->d_flags, flags); - smp_wmb(); dentry->d_inode = NULL; } @@ -370,9 +368,11 @@ static void dentry_unlink_inode(struct dentry * dentry) __releases(dentry->d_inode->i_lock) { struct inode *inode = dentry->d_inode; + + raw_write_seqcount_begin(&dentry->d_seq); __d_clear_type_and_inode(dentry); hlist_del_init(&dentry->d_u.d_alias); - dentry_rcuwalk_invalidate(dentry); + raw_write_seqcount_end(&dentry->d_seq); spin_unlock(&dentry->d_lock); spin_unlock(&inode->i_lock); if (!inode->i_nlink) @@ -1758,8 +1758,9 @@ static void __d_instantiate(struct dentry *dentry, struct inode *inode) spin_lock(&dentry->d_lock); if (inode) hlist_add_head(&dentry->d_u.d_alias, &inode->i_dentry); + raw_write_seqcount_begin(&dentry->d_seq); __d_set_inode_and_type(dentry, inode, add_flags); - dentry_rcuwalk_invalidate(dentry); + raw_write_seqcount_end(&dentry->d_seq); spin_unlock(&dentry->d_lock); fsnotify_d_instantiate(dentry, inode); }