v4.2-rc dcache regression, probably 75a6f82a0d10

Message ID	20150801072603.GV17109@ZenIV.linux.org.uk (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> Date: Sat, 1 Aug 2015 08:26:03 +0100 From: Al Viro <viro@ZenIV.linux.org.uk> To: Linus Torvalds <torvalds@linux-foundation.org> Cc: Dominique Martinet <asmadeus@codewreck.org>, Hugh Dickins <hughd@google.com>, "J. Bruce Fields" <bfields@fieldses.org>, Dominique Martinet <dominique.martinet@cea.fr>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, linux-fsdevel <linux-fsdevel@vger.kernel.org>, David Howells <dhowells@redhat.com> Subject: Re: v4.2-rc dcache regression, probably 75a6f82a0d10 Message-ID: <20150801072603.GV17109@ZenIV.linux.org.uk> References: <alpine.LSU.2.11.1507311022320.10162@eggly.anvils> <CA+55aFxGp3Hn-CeFE1NTYDr7vt3xjqe2mV1BbZ2tYHpCxenuAA@mail.gmail.com> <alpine.LSU.2.11.1507311207160.11122@eggly.anvils> <20150731205036.GA3752@nautica> <CA+55aFz9jj_wAjXNoFbKFbV3M1Ki2KhHuoMwQGPi0wiQjMXdsw@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <CA+55aFz9jj_wAjXNoFbKFbV3M1Ki2KhHuoMwQGPi0wiQjMXdsw@mail.gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Al Viro Aug. 1, 2015, 7:26 a.m. UTC

On Fri, Jul 31, 2015 at 03:52:38PM -0700, Linus Torvalds wrote:

> Is that correct? Maybe, I haven't checked. And maybe it's a big bad
> bug. Regardless, it sure as hell isn't just changing the order of the
> access to those fields. That "DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU"
> clearing came from __d_instantiate(), but now it hits __d_obtain_alias
> too.

Actually, the shit had hit the fan earlier.  Look: in
commit b18825a7c8e37a7cf6abb97a12a6ad71af160de7
Author: David Howells <dhowells@redhat.com>
Date:   Thu Sep 12 19:22:53 2013 +0100

    VFS: Put a small type field into struct dentry::d_flags

we have this:
@@ -1823,7 +1794,7 @@ static int link_path_walk(const char *name, struct nameidata *nd)
                        if (err)
                                return err;
                }
-               if (!can_lookup(nd->inode)) {
+               if (!d_is_directory(nd->path.dentry)) {
                        err = -ENOTDIR; 
                        break;
                }

And that has turned the check done to an inode that *was* ours at some
point (i.e. fetching it had been followed by checking that ->d_seq had
been still valid) into something completely unprotected.  Suppose we
are in lazy mode and somebody had evicted nd->path.dentry after we'd looked
it up and before that check.  Sure, its ->d_seq had been bumped by that,
and we would've failed anyway.  With ECHILD.  Which, unlike ENOTDIR, is
"repeat in non-lazy mode".

AFAICS, that's where the problem is.  It affects only RCU mode and only
the places where dentry isn't pinned.  That place in link_path_walk() is
trivial - we just need to do
                if (unlikely(!d_can_lookup(nd->path.dentry))) {
			if (nd->flags & LOOKUP_RCU) {
				if (unlazy_walk(nd, NULL, 0))
					return -ECHILD;
			}
			return -ENOTDIR;
		}
there.  AFAICS, other places of that sort are not a problem anymore.

Folks, could you check if this fixes the problems you are seeing?

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dominique Martinet Aug. 1, 2015, 10:19 a.m. UTC | #1

Al Viro wrote on Sat, Aug 01, 2015:
> And that has turned the check done to an inode that *was* ours at some
> point (i.e. fetching it had been followed by checking that ->d_seq had
> been still valid) into something completely unprotected.  Suppose we
> are in lazy mode and somebody had evicted nd->path.dentry after we'd looked
> it up and before that check.  Sure, its ->d_seq had been bumped by that,
> and we would've failed anyway.  With ECHILD.  Which, unlike ENOTDIR, is
> "repeat in non-lazy mode".

That sounds like a good find, I was looking at how to claim/protect the
entry somehow as well but I just have no idea...

> Folks, could you check if this fixes the problems you are seeing?
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index ae4e4c1..b16c3a7 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1954,7 +1954,11 @@ OK:
>  				continue;
>  			}
>  		}
> -		if (unlikely(!d_can_lookup(nd->path.dentry)))
> +		if (unlikely(!d_can_lookup(nd->path.dentry))) {
> +			if (nd->flags & LOOKUP_RCU) {
> +				if (unlazy_walk(nd, NULL, 0))
> +					return -ECHILD;
> +			}
>  			return -ENOTDIR;
>  	}
>  }

Unfortunately, still happens for me.

I had to adapt a bit because using an old kernel (4bf46a272), will try
again with a recent master to doublecheck, but I had a break on
the "if (nd->flags & LOOKUP_RCU)" check:
 - sometimes fails without ever hitting the check. I think this fixes
the "ENOTDIR" I had described, but there's at least another way to
fail?
 - When we do hit it, we're into LOOKUP_RCU at this point alright,
unlazy_walk fails and we try again without RCU -- can confirm the
recovery process goes OK (well, that it went OK at least once)


Thanks,

Dominique Martinet Aug. 1, 2015, 10:50 a.m. UTC | #2

Dominique Martinet wrote on Sat, Aug 01, 2015:
> I had to adapt a bit because using an old kernel (4bf46a272), will try
> again with a recent master to doublecheck

There have been more changes than what I thought, can't seem to
reproduce in a while on linus' HEAD with that fix (it fell in that
check quite a few times already).

I'll let test run all weekend to be safe, but it looks good to me!

Linus Torvalds Aug. 1, 2015, 4:09 p.m. UTC | #3

On Sat, Aug 1, 2015 at 12:26 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Actually, the shit had hit the fan earlier.  Look: in
> commit b18825a7c8e37a7cf6abb97a12a6ad71af160de7
> Author: David Howells <dhowells@redhat.com>
> Date:   Thu Sep 12 19:22:53 2013 +0100
>
>     VFS: Put a small type field into struct dentry::d_flags
>
> we have this:
>
> -               if (!can_lookup(nd->inode)) {
> +               if (!d_is_directory(nd->path.dentry)) {

Ahh. That's subtle, yes. Inodes are stable in ways that dentries
aren't. And the reason why Dominique bisected it to 4bf46a272647 would
seem to be that while dentries aren't really stable, the dentry flags
generally don't change. But dentry_iput() changed to actually clear
the type when clearing the inode, so that probably added a few cases
where it went from "stable in practice" to be more easily triggered.

Your patch looks obviously correct, with the slight worry that there
might be other cases of this.

> there.  AFAICS, other places of that sort are not a problem anymore.

Al, do you plan a pull request? It would be good to get this into rc5
(tomorrow) regardless of whether there might be other issues lurking
too.

                   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro Aug. 1, 2015, 5:09 p.m. UTC | #4

On Sat, Aug 01, 2015 at 09:09:24AM -0700, Linus Torvalds wrote:
> On Sat, Aug 1, 2015 at 12:26 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Actually, the shit had hit the fan earlier.  Look: in
> > commit b18825a7c8e37a7cf6abb97a12a6ad71af160de7
> > Author: David Howells <dhowells@redhat.com>
> > Date:   Thu Sep 12 19:22:53 2013 +0100
> >
> >     VFS: Put a small type field into struct dentry::d_flags
> >
> > we have this:
> >
> > -               if (!can_lookup(nd->inode)) {
> > +               if (!d_is_directory(nd->path.dentry)) {
> 
> Ahh. That's subtle, yes. Inodes are stable in ways that dentries
> aren't. And the reason why Dominique bisected it to 4bf46a272647 would
> seem to be that while dentries aren't really stable, the dentry flags
> generally don't change. But dentry_iput() changed to actually clear
> the type when clearing the inode, so that probably added a few cases
> where it went from "stable in practice" to be more easily triggered.
> 
> Your patch looks obviously correct, with the slight worry that there
> might be other cases of this.
> 
> > there.  AFAICS, other places of that sort are not a problem anymore.
> 
> Al, do you plan a pull request? It would be good to get this into rc5
> (tomorrow) regardless of whether there might be other issues lurking
> too.

Will do later today.  -stable branches will be interesting, though ;-/

Al, still digging through the loads of fun stuff in l-k mailbox...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro Aug. 2, 2015, 12:14 a.m. UTC | #5

On Sat, Aug 01, 2015 at 09:09:24AM -0700, Linus Torvalds wrote:

> Al, do you plan a pull request? It would be good to get this into rc5
> (tomorrow) regardless of whether there might be other issues lurking
> too.

Please, pull from the usual place:
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-linus

Shortlog:
Al Viro (1):
      link_path_walk(): be careful when failing with ENOTDIR

Diffstat:
 fs/namei.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro Aug. 2, 2015, 12:23 a.m. UTC | #6

On Sun, Aug 02, 2015 at 01:14:02AM +0100, Al Viro wrote:
> On Sat, Aug 01, 2015 at 09:09:24AM -0700, Linus Torvalds wrote:
> 
> > Al, do you plan a pull request? It would be good to get this into rc5
> > (tomorrow) regardless of whether there might be other issues lurking
> > too.
> 
> Please, pull from the usual place:
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-linus
> 
> Shortlog:
> Al Viro (1):
>       link_path_walk(): be careful when failing with ENOTDIR
> 
> Diffstat:
>  fs/namei.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)

Diffstat:
 fs/namei.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

actually.  Note to self: git diff HEAD *before* pushing, to verify that
no uncommitted typo fixes are sitting around...

My apologies.  Branch head should be at 97242f9, just to make sure you get
the right one...

BTW, is there any convenient way to have git itself check things like
that?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds Aug. 2, 2015, 12:42 a.m. UTC | #7

On Sat, Aug 1, 2015 at 5:23 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> BTW, is there any convenient way to have git itself check things like
> that?

I guess you could just do a "pre-push" hook, and have it exit with an
error if the tree isn't clean (to fail the push) or at least warn.

See "man githooks" for details.

             Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds Aug. 2, 2015, 12:57 a.m. UTC | #8

On Sat, Aug 1, 2015 at 5:23 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Branch head should be at 97242f9, just to make sure you get
> the right one...

Ok, merged in my tree.

However, looking at this, I'm struck by how all the callers of
"link_path_walk()" tend to have very similar patterns wrt error
handling.

And I'm wondering - wouldn't it be nicer to extend that pattern a bit
more, and make the *callers* of link_path_walk() all do

        if (error) {
                if (nd->flags & LOOKUP_RCU) {
                        if (unlazy_walk(nd, NULL, 0))
                                error = -ECHILD;
                }
        }

and maybe even make that part of terminate_walk() that everybody calls
after getting here.

Because it's not just that "!d_can_lookup()" case that triggers it,
you also have that pattern in the RCU error case for may_lookup(), and
get_link().

So why don't we make the rule that *every* single error we get during
an RCU walk should do that unlazy_walk() and turn the error into
ECHILD on failure. Hmm? We're almost there as-is.

             Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro Aug. 2, 2015, 1:41 a.m. UTC | #9

On Sat, Aug 01, 2015 at 05:57:44PM -0700, Linus Torvalds wrote:

> Because it's not just that "!d_can_lookup()" case that triggers it,
> you also have that pattern in the RCU error case for may_lookup(), and
> get_link().

It feels like it might make sense to handle that in caller, but...
that goes only for cases when we are *NOT* going to continue after
successful transition to non-lazy mode.  And these two are not of
that sort - we do want to continue rather than restart everything
from scratch.

BTW, unlazy_walk() has too many arguments, all for the sake of one caller
(everything except lookup_fast() calls it with (nd, NULL, 0) as arguments)
and it might make sense to split the damn thing in two.  I have that in
a pending pile since the last cycle, but back then you have asked to stop
piling them up and let it settle, so I'd postponed that one along with other
cleanups...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds Aug. 2, 2015, 2:39 a.m. UTC | #10

On Sat, Aug 1, 2015 at 6:41 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> It feels like it might make sense to handle that in caller, but...
> that goes only for cases when we are *NOT* going to continue after
> successful transition to non-lazy mode.  And these two are not of
> that sort - we do want to continue rather than restart everything
> from scratch.

Ahh. Yes. I didn't notice that they actually don't return an error at
all if the unlazy_walk succeeds, but just continue.

Ok, so they really are very different.

> BTW, unlazy_walk() has too many arguments, all for the sake of one caller
> (everything except lookup_fast() calls it with (nd, NULL, 0) as arguments)
> and it might make sense to split the damn thing in two.  I have that in
> a pending pile since the last cycle, but back then you have asked to stop
> piling them up and let it settle, so I'd postponed that one along with other
> cleanups...

Well, I'd not be against continuing cleanups for 4.3... Well, as long
as we can make sure 4.2 is solid first, of course. I'd still like to
have Hugh verify that the current -git tree works for his load, but
obviously that wasn't easily reproducible, so that will presumably
take a few days. Dominique seems to at least not see it any more with
that patch.

               Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hugh Dickins Aug. 2, 2015, 4:06 a.m. UTC | #11

On Sat, 1 Aug 2015, Linus Torvalds wrote:
> 
> Well, I'd not be against continuing cleanups for 4.3... Well, as long
> as we can make sure 4.2 is solid first, of course. I'd still like to
> have Hugh verify that the current -git tree works for his load, but
> obviously that wasn't easily reproducible, so that will presumably
> take a few days. Dominique seems to at least not see it any more with
> that patch.

I've had Al's patch under load since this morning, and it's going fine.
But I've not actually managed to reproduce the issue for several days
now, however bogus a "fix" I've been testing.  So certainly don't wait
to hear from me: but of course I'll keep on with the loads, and shout
if it does go wrong again.

(I don't actually understand why the clearing of DCACHE_ENTRY_TYPE in
dentry_iput() is not of continuing concern; but don't worry, there's
plenty I don't understand - so long as you're both satisfied that
it's not a concern, no need to persuade me.)

Do we have any idea why a bug introduced in v3.13 should only now
stand out, both for Dominique and for me?  Has the RCU lookup somehow
become much more effective recently?  (I don't think symlinks were a
big deal for either of us, though the kernel build does use some).

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro Aug. 2, 2015, 4:39 a.m. UTC | #12

On Sat, Aug 01, 2015 at 09:06:55PM -0700, Hugh Dickins wrote:

> (I don't actually understand why the clearing of DCACHE_ENTRY_TYPE in
> dentry_iput() is not of continuing concern; but don't worry, there's
> plenty I don't understand - so long as you're both satisfied that
> it's not a concern, no need to persuade me.)

Because before we even get to dentry_iput(), we evict the fucker from hash.
And that will do dentry_rcuwalk_invalidate(dentry), which will bump ->d_seq
*AFTER* having it unhashed.  Now look at __d_lookup_rcu() - there we fetch
->d_seq, then verify that it's still hashed.

So having hit dentry_iput() means that everyone who'd found it via RCU
lookup will be guaranteed a ->d_seq mismatch.  The same goes for things
like d_drop() and d_instantiate().  Look for dentry_rcuwalk_invalidate()
callers in there...

Clearing DCACHE_ENTRY_TYPE there is fine - dentry *is* made negative there,
after all.  What we want is to have ->d_inode stable at least as long as
->d_seq remains so.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds Aug. 2, 2015, 4:42 a.m. UTC | #13

On Sat, Aug 1, 2015 at 9:06 PM, Hugh Dickins <hughd@google.com> wrote:
>
> (I don't actually understand why the clearing of DCACHE_ENTRY_TYPE in
> dentry_iput() is not of continuing concern; but don't worry, there's
> plenty I don't understand - so long as you're both satisfied that
> it's not a concern, no need to persuade me.)

So dentry_iput() is only called as the dentry is being thrown away,
and is stale.

Yes, such a stale dentry can be seen by an RCU lookup, but the RCU
lookups should always revalidate things after the lookup, so it
shouldn't matter. The problem here was that there was a missing
revalidate of the RCU lookup for an error case, so the error that
_should_ have been a harmless race that got handled later by the
proper validation instead turned into a real user-visible error.

But we didn't use to clear the flags in dentry_iput, so before things
generally "happened to work" anyway, because this rare error case
didn't actually ever trigger in the first place.

(And I still don't think we necessarily *should* clear the flags in
dentry_iput(), but it really shouldn't be a correctness issue)

> Do we have any idea why a bug introduced in v3.13 should only now
> stand out, both for Dominique and for me?  Has the RCU lookup somehow
> become much more effective recently?

So I do think that the clearing of the dentry flags exposed a
situation that was harder to hit before.

The fact that we now do RCU lookups even over symlinks probably does
end up widening the possibilities for this happening too, although as
you say, that shouldn't be very common during a kernel build.

                      Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hugh Dickins Aug. 2, 2015, 6:53 p.m. UTC | #14

On Sat, 1 Aug 2015, Linus Torvalds wrote:
> On Sat, Aug 1, 2015 at 9:06 PM, Hugh Dickins <hughd@google.com> wrote:
> >
> > (I don't actually understand why the clearing of DCACHE_ENTRY_TYPE in
> > dentry_iput() is not of continuing concern; but don't worry, there's
> > plenty I don't understand - so long as you're both satisfied that
> > it's not a concern, no need to persuade me.)
> 
> So dentry_iput() is only called as the dentry is being thrown away,
> and is stale.
> 
> Yes, such a stale dentry can be seen by an RCU lookup, but the RCU
> lookups should always revalidate things after the lookup, so it
> shouldn't matter. The problem here was that there was a missing
> revalidate of the RCU lookup for an error case, so the error that
> _should_ have been a harmless race that got handled later by the
> proper validation instead turned into a real user-visible error.

Thank you both for leading me through that: I really should have
rechecked the sequence count invalidation in the source for myself
(I had a wrong picture of it in my head), before inserting that
parenthesis and taking your time over it; but had been in a hurry
to get a response back.

> 
> But we didn't use to clear the flags in dentry_iput, so before things
> generally "happened to work" anyway, because this rare error case
> didn't actually ever trigger in the first place.
> 
> (And I still don't think we necessarily *should* clear the flags in
> dentry_iput(), but it really shouldn't be a correctness issue)
> 
> > Do we have any idea why a bug introduced in v3.13 should only now
> > stand out, both for Dominique and for me?  Has the RCU lookup somehow
> > become much more effective recently?
> 
> So I do think that the clearing of the dentry flags exposed a
> situation that was harder to hit before.

Right, that does indeed make sense of why it appeared now.

I cannot actually report success from yesterday's testing, since
it hung after 20 hours for, I believe, the same unrelated reason
that I ran into before.

I mentioned jbd2 last time, but I doubt that's at fault: it's almost
certainly an issue with recent vmscan changes and/or recent loop
changes - the business of page reclaim waiting on page writeback
has always been tricky and fragile and deadlock-prone, the more
so when loop is involved: probably the balance has got shifted
slightly by recent changes, I'll look into it (but definitely
not rc5 material).

Thanks,
Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

v4.2-rc dcache regression, probably 75a6f82a0d10

Commit Message

Comments

Patch