diff mbox

VFS: Add back check for !inode in walk_component()

Message ID 20150507125241.4da739ac@gandalf.local.home (mailing list archive)
State New, archived
Headers show

Commit Message

Steven Rostedt May 7, 2015, 4:52 p.m. UTC
Commit 698934df8b45 "VFS: Combine inode checks with d_is_negative() and
d_is_positive() in pathwalk" removed a check for inode being NULL in
walk_component() where the type is tested. Stressing my tracefs create
and remove instances while reading the files now triggers this:

BUG: unable to handle kernel NULL pointer dereference at 0000001c
IP: [<c05383a6>] inode_permission+0x2d/0x6c
*pdpt = 0000000030d9a001 *pde = 0000000000000000
Oops: 0000 [#1] SMP
Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 microcode ppdev parport_pc parport r8169
CPU: 0 PID: 3201 Comm: ftrace-test-mki Not tainted 4.1.0-rc2-test+ #94
Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
task: efa2ac20 ti: ed7a8000 task.ti: ed7a8000
EIP: 0060:[<c05383a6>] EFLAGS: 00010282 CPU: 0
EIP is at inode_permission+0x2d/0x6c
EAX: 00000001 EBX: 00000000 ECX: 00000006 EDX: 00000081
ESI: 00000000 EDI: ef92201b EBP: ed7a9e0c ESP: ed7a9df8
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 80050033 CR2: 0000001c CR3: 31cb4c20 CR4: 001407f0
Stack:
 c0538379 c101c228 00000000 00000081 ed7a9ef8 ed7a9e64 c0538c7e c0538c33
 c1012e98 00000000 ed7a9ef8 00000000 00000006 efa2ac20 efa2ac20 e1919d01
 00000000 c04767e1 ed7a9e64 c05374fb f1392210 ee3c0240 00000000 c05391b5
Call Trace:
 [<c0538379>] ? __inode_permission+0x91/0x91
 [<c0538c7e>] link_path_walk+0x7a/0x3db
 [<c0538c33>] ? link_path_walk+0x2f/0x3db
 [<c04767e1>] ? trace_hardirqs_on+0xb/0xd
 [<c05374fb>] ? read_seqcount_begin+0x6a/0x77
 [<c05391b5>] ? path_init+0x1d6/0x326
 [<c05392f9>] path_init+0x31a/0x326
 [<c0538fdf>] ? link_path_walk+0x3db/0x3db
 [<c05312a3>] ? get_empty_filp+0x128/0x190
 [<c053ad56>] path_openat+0x1a3/0x3da
 [<c0408c1a>] ? native_sched_clock+0x46/0x4b
 [<c053bcdf>] do_filp_open+0x2e/0x6f
 [<c052f591>] do_sys_open+0x7c/0x108
 [<c052f558>] ? do_sys_open+0x43/0x108
 [<c0cc4f87>] ? sysenter_exit+0xf/0x16
 [<c052f63d>] SyS_open+0x20/0x22
 [<c0cc4f58>] sysenter_do_call+0x12/0x12
Code: e5 53 83 ec 10 3e 8d 74 26 00 89 c3 89 44 24 08 a1 d8 7b 25 c1 89 55 f8 c7 04 24 79 83 53 c0 89 44 24 04 e8 0e b0 f9 ff 8b 55 f8 <8b> 4b 1c f6 c2 02 74 2e f6 41 30 01 66 8b 03 74 25 25 00 f0 00
EIP: [<c05383a6>] inode_permission+0x2d/0x6c SS:ESP 0068:ed7a9df8
CR2: 000000000000001c
---[ end trace 54b6a3ccfbef84c6 ]---

Adding a bunch of debug I found that the race is the following:

  CPU1				CPU2
  ----				----
			    mkdir(foo)
			      d_instantiate(dentry, inode);
				spin_lock(inode->i_lock);
				  spin_lock(dentry->d_lock);
				    __d_set_inode_and_type();
link_path_walk()
  walk_component()
    lookup_fast(nd, path, &inode);
      *inode = path->d_entry->d_inode; (inode == NULL)

				      dentry->d_inode = inode;
				      smp_wmb();
				      dentry->d_flags = flags;

    if (d_is_negative(path->d_entry))
      [ fails ]

    [...]

    nd->inode = inode; (where inode is NULL);

Then in the next loop of link_path_walk()

  err = may_lookup(nd);
    inode_permission(nd->inode...)
      reference to nd->inode->i_sb (BOOM!)

Without this patch I can easily cause the bug, with this patch, I have
yet to trigger it.

Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 fs/namei.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Al Viro May 7, 2015, 5:28 p.m. UTC | #1
On Thu, May 07, 2015 at 12:52:41PM -0400, Steven Rostedt wrote:
> 
> Commit 698934df8b45 "VFS: Combine inode checks with d_is_negative() and
> d_is_positive() in pathwalk" removed a check for inode being NULL in
> walk_component() where the type is tested. Stressing my tracefs create
> and remove instances while reading the files now triggers this:

So you get NULL ->d_inode with stale flags?  The thing is, ->d_inode
becoming NULL should happen via d_delete(), which goes throug this:
        unsigned flags = READ_ONCE(dentry->d_flags);

        flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
        WRITE_ONCE(dentry->d_flags, flags);
        smp_wmb();
        dentry->d_inode = NULL;

and after that assignment to ->d_flags you'll see d_is_negative() being
true.  OTOH, we have
                *inode = dentry->d_inode;
                if (read_seqcount_retry(&dentry->d_seq, seq))
in lookup_fast(), and read_seqcount_retry() is
{
        smp_rmb();
        return __read_seqcount_retry(s, start);
}

IOW, we have smp_rmb() between fetching ->d_inode and checking ->d_flags.

If you can reproduce that at will, could you make it dump nd->flags along with
dentry involved?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steven Rostedt May 7, 2015, 5:39 p.m. UTC | #2
On Thu, 7 May 2015 18:28:34 +0100
Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Thu, May 07, 2015 at 12:52:41PM -0400, Steven Rostedt wrote:
> > 
> > Commit 698934df8b45 "VFS: Combine inode checks with d_is_negative() and
> > d_is_positive() in pathwalk" removed a check for inode being NULL in
> > walk_component() where the type is tested. Stressing my tracefs create
> > and remove instances while reading the files now triggers this:
> 
> So you get NULL ->d_inode with stale flags?  The thing is, ->d_inode
> becoming NULL should happen via d_delete(), which goes throug this:

But it's not the delete, it's the creation of a new d_entry. Pehaps it
gets linked premature? The tracing I had happened in
tracefs_create_file().


>         unsigned flags = READ_ONCE(dentry->d_flags);
> 
>         flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
>         WRITE_ONCE(dentry->d_flags, flags);
>         smp_wmb();
>         dentry->d_inode = NULL;
> 
> and after that assignment to ->d_flags you'll see d_is_negative() being
> true.  OTOH, we have
>                 *inode = dentry->d_inode;
>                 if (read_seqcount_retry(&dentry->d_seq, seq))
> in lookup_fast(), and read_seqcount_retry() is
> {
>         smp_rmb();
>         return __read_seqcount_retry(s, start);
> }
> 
> IOW, we have smp_rmb() between fetching ->d_inode and checking ->d_flags.
> 
> If you can reproduce that at will, could you make it dump nd->flags along with
> dentry involved?

I had them printed in my previous traces. The flags were 0x200088, and
they were 0 just before the call.

 ftrace-test-mki-3201  [000]    85.306761: bprint:               walk_component: inode=(nil) dentry=0xee3c0240 neg:1
 ftrace-test-mki-3201  [000]    85.306761: bprint:               walk_component: inode=(nil) dentry=0xee3c0240 neg:0 flags:200088

That was with this:

        trace_printk("inode=%p dentry=%p neg:%d\n", inode, path->dentry,
                     d_is_negative(path->dentry));
        err = -ENOENT;
        if (d_is_negative(path->dentry))
                goto out_path_put;
        trace_printk("inode=%p dentry=%p neg:%d flags:%lx\n", inode, path->dentry,
                     d_is_negative(path->dentry), (long)path->dentry->d_flags);



-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steven Rostedt May 7, 2015, 5:43 p.m. UTC | #3
On Thu, 7 May 2015 13:39:35 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Thu, 7 May 2015 18:28:34 +0100
> Al Viro <viro@ZenIV.linux.org.uk> wrote:
> 
> > On Thu, May 07, 2015 at 12:52:41PM -0400, Steven Rostedt wrote:
> > > 
> > > Commit 698934df8b45 "VFS: Combine inode checks with d_is_negative() and
> > > d_is_positive() in pathwalk" removed a check for inode being NULL in
> > > walk_component() where the type is tested. Stressing my tracefs create
> > > and remove instances while reading the files now triggers this:
> > 
> > So you get NULL ->d_inode with stale flags?  The thing is, ->d_inode
> > becoming NULL should happen via d_delete(), which goes throug this:
> 
> But it's not the delete, it's the creation of a new d_entry. Pehaps it
> gets linked premature? The tracing I had happened in
> tracefs_create_file().
> 

Note, this could be because of my special hack to have mkdir create
files too. We never came up with a clean solution for that.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro May 7, 2015, 6:13 p.m. UTC | #4
On Thu, May 07, 2015 at 01:39:35PM -0400, Steven Rostedt wrote:
> I had them printed in my previous traces. The flags were 0x200088, and
> they were 0 just before the call.

Not dentry->d_flags, nd->flags.  Most interesting part is bit 6 in those
(LOOKUP_RCU, 0x40).

As for creation...  I think I see what might be going on:

A: finds a negative dentry, picks NULL ->d_inode from it and whatever
->d_seq it had.
B: d_instantiate(): sets ->d_inode non-NULL, ->d_flags accordingly and
bumps ->d_seq.
A: fetches ->d_flags, sees non-negative, assumes ->d_inode is non-NULL.

In reality, the last assumption should've been "->d_inode is non-NULL or
we have a stale ->d_seq and will end up discarding that fscker anyway".

Hmm...  Smells like we ought to
a) in lookup_fast() turn
                if (read_seqcount_retry(&dentry->d_seq, seq))
                        return -ECHILD;
into
		if (unlikely(d_is_negative(dentry))) {
			if (read_seqcount_retry(&dentry->d_seq, seq))
				return -ECHILD;
			else
				return -ENOENT;
		}
		if (read_seqcount_retry(&dentry->d_seq, seq))
			return -ECHILD;
and
        if (likely(!err))
                *inode = path->dentry->d_inode;
into
	if (likely(!err)) {
                *inode = path->dentry->d_inode;
		if (unlikely(d_is_negative(dentry))) {
			path_to_nameidata(path, nd);
			err = -ENOENT;
		}
	}
b) in walk_component() and do_last():finish_lookup move the d_is_negative()
checks a bit up - into the body of preceding if () in the former and just
prior to the finish_lookup: in the latter.

AFAICS, the rest of d_is_negative() in fs/namei.c doesn't suffer that kind
of problem...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steven Rostedt May 7, 2015, 7:20 p.m. UTC | #5
On Thu, 7 May 2015 19:13:35 +0100
Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Thu, May 07, 2015 at 01:39:35PM -0400, Steven Rostedt wrote:
> > I had them printed in my previous traces. The flags were 0x200088, and
> > they were 0 just before the call.
> 
> Not dentry->d_flags, nd->flags.  Most interesting part is bit 6 in those
> (LOOKUP_RCU, 0x40).

nd->flags == 0x51

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/namei.c b/fs/namei.c
index 4a8d998b7274..cdd066680de9 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1585,7 +1585,7 @@  static inline int walk_component(struct nameidata *nd, struct path *path,
 		inode = path->dentry->d_inode;
 	}
 	err = -ENOENT;
-	if (d_is_negative(path->dentry))
+	if (!inode || d_is_negative(path->dentry))
 		goto out_path_put;
 
 	if (should_follow_link(path->dentry, follow)) {