fs: NULL deref in atime_needs_update

Message ID	20160228200100.GP17997@ZenIV.linux.org.uk (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> Date: Sun, 28 Feb 2016 20:01:01 +0000 From: Al Viro <viro@ZenIV.linux.org.uk> To: Dmitry Vyukov <dvyukov@google.com> Cc: Ian Kent <raven@themaw.net>, =?iso-8859-1?Q?Micka=EBl_Sala=FCn?= <mic@digikod.net>, "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, syzkaller <syzkaller@googlegroups.com>, Kostya Serebryany <kcc@google.com>, Alexander Potapenko <glider@google.com>, Sasha Levin <sasha.levin@oracle.com>, Linus Torvalds <torvalds@linux-foundation.org>, David Howells <dhowells@redhat.com> Subject: Re: fs: NULL deref in atime_needs_update Message-ID: <20160228200100.GP17997@ZenIV.linux.org.uk> References: <20160224151511.GR17997@ZenIV.linux.org.uk> <CACT4Y+amZqudFFpc5R_vd+G_6i1Tq8LDNvX_2qwjoiiTFO4DxQ@mail.gmail.com> <20160225163927.GW17997@ZenIV.linux.org.uk> <20160226212137.GF17997@ZenIV.linux.org.uk> <CACT4Y+Z2RmFvtxdRY1=3h2sXMwmafFq+qgcZWGOoeZ79Yrwxzw@mail.gmail.com> <20160226220759.GH17997@ZenIV.linux.org.uk> <20160227222748.GI17997@ZenIV.linux.org.uk> <CACT4Y+beVWKoQJ8Wfo4i8ij067hQLGmdcnrfP8CU-3eECgOLFQ@mail.gmail.com> <CACT4Y+YztEDaepwKnzOnh=0+BN1NdpoCVOE-FLXquvKfNgndMg@mail.gmail.com> <20160228170133.GM17997@ZenIV.linux.org.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160228170133.GM17997@ZenIV.linux.org.uk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Message ID

20160228200100.GP17997@ZenIV.linux.org.uk (mailing list archive)

State

New, archived

Headers

Date: Sun, 28 Feb 2016 20:01:01 +0000
From: Al Viro <viro@ZenIV.linux.org.uk>
To: Dmitry Vyukov <dvyukov@google.com>
Cc: Ian Kent <raven@themaw.net>,
	=?iso-8859-1?Q?Micka=EBl_Sala=FCn?= <mic@digikod.net>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	syzkaller <syzkaller@googlegroups.com>,
	Kostya Serebryany <kcc@google.com>,
	Alexander Potapenko <glider@google.com>,
	Sasha Levin <sasha.levin@oracle.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	David Howells <dhowells@redhat.com>
Subject: Re: fs: NULL deref in atime_needs_update
Message-ID: <20160228200100.GP17997@ZenIV.linux.org.uk>
References: <20160224151511.GR17997@ZenIV.linux.org.uk>
	<CACT4Y+amZqudFFpc5R_vd+G_6i1Tq8LDNvX_2qwjoiiTFO4DxQ@mail.gmail.com>
	<20160225163927.GW17997@ZenIV.linux.org.uk>
	<20160226212137.GF17997@ZenIV.linux.org.uk>
	<CACT4Y+Z2RmFvtxdRY1=3h2sXMwmafFq+qgcZWGOoeZ79Yrwxzw@mail.gmail.com>
	<20160226220759.GH17997@ZenIV.linux.org.uk>
	<20160227222748.GI17997@ZenIV.linux.org.uk>
	<CACT4Y+beVWKoQJ8Wfo4i8ij067hQLGmdcnrfP8CU-3eECgOLFQ@mail.gmail.com>
	<CACT4Y+YztEDaepwKnzOnh=0+BN1NdpoCVOE-FLXquvKfNgndMg@mail.gmail.com>
	<20160228170133.GM17997@ZenIV.linux.org.uk>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20160228170133.GM17997@ZenIV.linux.org.uk>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk

Commit Message

Al Viro Feb. 28, 2016, 8:01 p.m. UTC

On Sun, Feb 28, 2016 at 05:01:34PM +0000, Al Viro wrote:

> Erm...  What's to order ->d_inode and ->d_flags fetches there?  David?
> Looks like the barrier in d_is_negative() is on the wrong side of fetch.
> Confused...

OK, as per David's suggestion, let's flip them around, bringing the
barrier in d_is_negative() between them.  Dmitry, could you try this on
top of mainline?  Again, it's until the first warning.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Dmitry Vyukov Feb. 29, 2016, 9:38 a.m. UTC | #1

On Sun, Feb 28, 2016 at 9:01 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sun, Feb 28, 2016 at 05:01:34PM +0000, Al Viro wrote:
>
>> Erm...  What's to order ->d_inode and ->d_flags fetches there?  David?
>> Looks like the barrier in d_is_negative() is on the wrong side of fetch.
>> Confused...
>
> OK, as per David's suggestion, let's flip them around, bringing the
> barrier in d_is_negative() between them.  Dmitry, could you try this on
> top of mainline?  Again, it's until the first warning.


Good news, I was able to trigger these warnings on a plain C program:
https://gist.githubusercontent.com/dvyukov/1a81426b8a5dd3620d6f/raw/fe6d03cfb0d219ad3d979f8bd6c016a5a1b93212/gistfile1.txt
Unfortunately, the failure rate is significantly lower than with
syzkaller. Syzkaller triggered it 8 times in 18 hours on a single VM;
while the C program triggered it once on 2 VMs. Al, maybe you can
modify the program to increase failure rate? I would expect that we
need to clash 2 (or maybe 3) syscalls with right timing to trigger it.
You must have a better idea as to what are these syscalls.
P.S. this is still with the previous patch, not the latest one.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dmitry Vyukov Feb. 29, 2016, 12:34 p.m. UTC | #2

On Mon, Feb 29, 2016 at 10:38 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
> On Sun, Feb 28, 2016 at 9:01 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> On Sun, Feb 28, 2016 at 05:01:34PM +0000, Al Viro wrote:
>>
>>> Erm...  What's to order ->d_inode and ->d_flags fetches there?  David?
>>> Looks like the barrier in d_is_negative() is on the wrong side of fetch.
>>> Confused...
>>
>> OK, as per David's suggestion, let's flip them around, bringing the
>> barrier in d_is_negative() between them.  Dmitry, could you try this on
>> top of mainline?  Again, it's until the first warning.


I am testing the new patch for several hours in several VMs now, I
would expect a WARNING to already fire. But no warnings fired so far.
I will keep it running. But meanwhile, do you an explanation of how:

-               *inode = d_backing_inode(dentry);
                negative = d_is_negative(dentry);
+               *inode = d_backing_inode(dentry);

can fix the bug? Explanation other than a missing rmw, because I am
running on x86 so I would not expect rmb to be the root cause (though,
it still may be necessary for other arches and to prevent possible
miscompilations). If you do have it and assuming that I will not see
warnings till tomorrow, then hopefully we can consider it as fixed!

It's not that I really understand what happens here, but looking at
the diff: is it the case that negative and inode can change under our
feet? If so, we still probably can get an inconsistent picture (i.e.
negative dentry but not NULL inode), can it be an issue? Is
non-negative->negative->non-negative->negative transition possible? If
so, we still probably can get the same crash regardless of order of
negative/inode loads.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro Feb. 29, 2016, 4:11 p.m. UTC | #3

On Mon, Feb 29, 2016 at 01:34:13PM +0100, Dmitry Vyukov wrote:

> It's not that I really understand what happens here, but looking at
> the diff: is it the case that negative and inode can change under our
> feet? If so, we still probably can get an inconsistent picture (i.e.
> negative dentry but not NULL inode), can it be an issue? Is
> non-negative->negative->non-negative->negative transition possible? If
> so, we still probably can get the same crash regardless of order of
> negative/inode loads.

Yes, we can - relying on the ordering is brittle and wrong.  See other
posting for possible solution; at least that one has much more understandable
rules:
	* ->d_seq is bumped before and after modifications of ->d_inode and
->d_flags, which both provides the barriers and (which is what matters for
x86) guarantees that ->d_seq match on recheck (which we do anyway) means
that ->d_inode and ->d_flags match each other.
	* RCU users of that part of ->d_flags should be verified by ->d_seq
check (we already are doing that - should_follow_link() didn't and that was
one of the bugs that got fixed).
	* non-RCU users either have the parent locked (which stabilizes
everything) or have dentry pinned and positive (ditto).  Checking that
dentry is negative (either by looking at inode or flags) does not guarantee
that it will stay such unless the parent is locked anyway.  IOW, the
games with barriers and order of assignments between ->d_inode and ->d_flags
do not actually buy us anything useful.
	* in case of __dentry_kill() we do *NOT* surround the stores to
->d_inode/->d_flags with ->d_seq bumps, but that's safe since by that point
we had already done __d_drop(), so RCU reader either doesn't find the
dentry in the first place, or gets ->d_seq bumped (by 2) between the
moment it's been fetched by __d_lookup_rcu() finding the sucker and
the moment when ->d_inode/->d_flags get changed.  If RCU reader gets in
before that, it sees consistent ->d_inode/->d_flags, as they used to be.
It will eventually fail to grab a reference to that dentry, but that's not
our problem.  If it gets in late enough to see ->d_inode and/or ->d_flags
changed, it will fail ->d_seq check and ignore the values it has fetched.
Again, no need for a barrier between ->d_inode and ->d_flags stores in
that case.

As the matter of fact, I'm somewhat tempted to make
static void dentry_unlink_inode(struct dentry * dentry)
        __releases(dentry->d_lock)
        __releases(dentry->d_inode->i_lock)
{
        struct inode *inode = dentry->d_inode;
	bool hashed = !d_unhashed(dentry);

	if (hashed)
		raw_write_seqcount_begin(&dentry->d_seq);
        __d_clear_type_and_inode(dentry);
        hlist_del_init(&dentry->d_u.d_alias);
	if (hashed)
		raw_write_seqcount_end(&dentry->d_seq);
        spin_unlock(&dentry->d_lock);
        spin_unlock(&inode->i_lock);
        if (!inode->i_nlink)
                fsnotify_inoderemove(inode);
        if (dentry->d_op && dentry->d_op->d_iput)
                dentry->d_op->d_iput(dentry, inode);
        else
                iput(inode);
}

and replace dentry_iput() in its only caller with

	if (dentry->d_inode)
		dentry_unlink_inode(dentry);	/* will drop ->d_lock */
	else
		spin_unlock(&dentry->d_lock);

That would get rid of annoying code duplication, but I would like to see
profiles - the cost of branches might very well get unpleasant.  Not sure,
and that part definitely isn't a -stable fodder.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
index c6d7d3d..86f81e3 100644
--- a/fs/autofs4/root.c
+++ b/fs/autofs4/root.c
@@ -323,6 +323,7 @@  static struct dentry *autofs4_mountpoint_changed(struct path *path)
 		struct dentry *new = d_lookup(parent, &dentry->d_name);
 		if (!new)
 			return NULL;
+		WARN_ON(d_is_negative(new));
 		ino = autofs4_dentry_ino(new);
 		ino->last_used = jiffies;
 		dput(path->dentry);
diff --git a/fs/namei.c b/fs/namei.c
index 9c590e0..630d222 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1209,6 +1209,7 @@  static int follow_managed(struct path *path, struct nameidata *nd)
 		/* Handle an automount point */
 		if (managed & DCACHE_NEED_AUTOMOUNT) {
 			ret = follow_automount(path, nd, &need_mntput);
+			WARN_ON(d_is_negative(path->dentry));
 			if (ret < 0)
 				break;
 			continue;
@@ -1260,6 +1261,7 @@  static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
 {
 	for (;;) {
 		struct mount *mounted;
+		void *p;
 		/*
 		 * Don't forget we might have a non-mountpoint managed dentry
 		 * that wants to block transit.
@@ -1289,7 +1291,9 @@  static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
 		 * dentry sequence number here after this d_inode read,
 		 * because a mount-point is always pinned.
 		 */
-		*inode = path->dentry->d_inode;
+		p = *inode = path->dentry->d_inode;
+		if (unlikely(!p))
+			WARN_ON(!read_seqretry(&mount_lock, nd->m_seq));
 	}
 	return !read_seqretry(&mount_lock, nd->m_seq) &&
 		!(path->dentry->d_flags & DCACHE_NEED_AUTOMOUNT);
@@ -1550,8 +1554,8 @@  static int lookup_fast(struct nameidata *nd,
 		 * This sequence count validates that the inode matches
 		 * the dentry name information from lookup.
 		 */
-		*inode = d_backing_inode(dentry);
 		negative = d_is_negative(dentry);
+		*inode = d_backing_inode(dentry);
 		if (read_seqcount_retry(&dentry->d_seq, seq))
 			return -ECHILD;
 
@@ -1580,6 +1584,7 @@  static int lookup_fast(struct nameidata *nd,
 		 */
 		if (negative)
 			return -ENOENT;
+		WARN_ON(!*inode);		// ->d_seq was fucked somehow
 		path->mnt = mnt;
 		path->dentry = dentry;
 		if (likely(__follow_mount_rcu(nd, path, inode, seqp)))
@@ -1613,8 +1618,10 @@  unlazy:
 	path->mnt = mnt;
 	path->dentry = dentry;
 	err = follow_managed(path, nd);
-	if (likely(!err))
+	if (likely(!err)) {
 		*inode = d_backing_inode(path->dentry);
+		WARN_ON(!*inode);
+	}
 	return err;
 
 need_lookup:
@@ -1717,6 +1724,7 @@  static inline int should_follow_link(struct nameidata *nd, struct path *link,
 		if (read_seqcount_retry(&link->dentry->d_seq, seq))
 			return -ECHILD;
 	}
+	WARN_ON(!inode);		// now, _that_ should not happen.
 	return pick_link(nd, link, inode, seq);
 }
 
@@ -3111,8 +3119,10 @@  static int do_last(struct nameidata *nd,
 			nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
 		/* we _can_ be in RCU mode here */
 		error = lookup_fast(nd, &path, &inode, &seq);
-		if (likely(!error))
+		if (likely(!error)) {
+			WARN_ON(!inode);
 			goto finish_lookup;
+		}
 
 		if (error < 0)
 			return error;
@@ -3203,6 +3213,7 @@  retry_lookup:
 		return -ENOENT;
 	}
 	inode = d_backing_inode(path.dentry);
+	WARN_ON(!inode);
 finish_lookup:
 	if (nd->depth)
 		put_link(nd);
diff --git a/fs/namespace.c b/fs/namespace.c
index 4fb1691..4128a5c 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1060,6 +1060,8 @@  static void cleanup_mnt(struct mount *mnt)
 	 * so mnt_get_writers() below is safe.
 	 */
 	WARN_ON(mnt_get_writers(mnt));
+	WARN_ON(!mnt->mnt.mnt_root->d_inode);	// some joker has managed to
+						// make mnt_root negative on us
 	if (unlikely(mnt->mnt_pins.first))
 		mnt_pin_kill(mnt);
 	fsnotify_vfsmount_delete(&mnt->mnt);

fs: NULL deref in atime_needs_update

Commit Message

Comments

Patch