Message ID | 20160228200100.GP17997@ZenIV.linux.org.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Sun, Feb 28, 2016 at 9:01 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Sun, Feb 28, 2016 at 05:01:34PM +0000, Al Viro wrote: > >> Erm... What's to order ->d_inode and ->d_flags fetches there? David? >> Looks like the barrier in d_is_negative() is on the wrong side of fetch. >> Confused... > > OK, as per David's suggestion, let's flip them around, bringing the > barrier in d_is_negative() between them. Dmitry, could you try this on > top of mainline? Again, it's until the first warning. Good news, I was able to trigger these warnings on a plain C program: https://gist.githubusercontent.com/dvyukov/1a81426b8a5dd3620d6f/raw/fe6d03cfb0d219ad3d979f8bd6c016a5a1b93212/gistfile1.txt Unfortunately, the failure rate is significantly lower than with syzkaller. Syzkaller triggered it 8 times in 18 hours on a single VM; while the C program triggered it once on 2 VMs. Al, maybe you can modify the program to increase failure rate? I would expect that we need to clash 2 (or maybe 3) syscalls with right timing to trigger it. You must have a better idea as to what are these syscalls. P.S. this is still with the previous patch, not the latest one. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 29, 2016 at 10:38 AM, Dmitry Vyukov <dvyukov@google.com> wrote: > On Sun, Feb 28, 2016 at 9:01 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: >> On Sun, Feb 28, 2016 at 05:01:34PM +0000, Al Viro wrote: >> >>> Erm... What's to order ->d_inode and ->d_flags fetches there? David? >>> Looks like the barrier in d_is_negative() is on the wrong side of fetch. >>> Confused... >> >> OK, as per David's suggestion, let's flip them around, bringing the >> barrier in d_is_negative() between them. Dmitry, could you try this on >> top of mainline? Again, it's until the first warning. I am testing the new patch for several hours in several VMs now, I would expect a WARNING to already fire. But no warnings fired so far. I will keep it running. But meanwhile, do you an explanation of how: - *inode = d_backing_inode(dentry); negative = d_is_negative(dentry); + *inode = d_backing_inode(dentry); can fix the bug? Explanation other than a missing rmw, because I am running on x86 so I would not expect rmb to be the root cause (though, it still may be necessary for other arches and to prevent possible miscompilations). If you do have it and assuming that I will not see warnings till tomorrow, then hopefully we can consider it as fixed! It's not that I really understand what happens here, but looking at the diff: is it the case that negative and inode can change under our feet? If so, we still probably can get an inconsistent picture (i.e. negative dentry but not NULL inode), can it be an issue? Is non-negative->negative->non-negative->negative transition possible? If so, we still probably can get the same crash regardless of order of negative/inode loads. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 29, 2016 at 01:34:13PM +0100, Dmitry Vyukov wrote: > It's not that I really understand what happens here, but looking at > the diff: is it the case that negative and inode can change under our > feet? If so, we still probably can get an inconsistent picture (i.e. > negative dentry but not NULL inode), can it be an issue? Is > non-negative->negative->non-negative->negative transition possible? If > so, we still probably can get the same crash regardless of order of > negative/inode loads. Yes, we can - relying on the ordering is brittle and wrong. See other posting for possible solution; at least that one has much more understandable rules: * ->d_seq is bumped before and after modifications of ->d_inode and ->d_flags, which both provides the barriers and (which is what matters for x86) guarantees that ->d_seq match on recheck (which we do anyway) means that ->d_inode and ->d_flags match each other. * RCU users of that part of ->d_flags should be verified by ->d_seq check (we already are doing that - should_follow_link() didn't and that was one of the bugs that got fixed). * non-RCU users either have the parent locked (which stabilizes everything) or have dentry pinned and positive (ditto). Checking that dentry is negative (either by looking at inode or flags) does not guarantee that it will stay such unless the parent is locked anyway. IOW, the games with barriers and order of assignments between ->d_inode and ->d_flags do not actually buy us anything useful. * in case of __dentry_kill() we do *NOT* surround the stores to ->d_inode/->d_flags with ->d_seq bumps, but that's safe since by that point we had already done __d_drop(), so RCU reader either doesn't find the dentry in the first place, or gets ->d_seq bumped (by 2) between the moment it's been fetched by __d_lookup_rcu() finding the sucker and the moment when ->d_inode/->d_flags get changed. If RCU reader gets in before that, it sees consistent ->d_inode/->d_flags, as they used to be. It will eventually fail to grab a reference to that dentry, but that's not our problem. If it gets in late enough to see ->d_inode and/or ->d_flags changed, it will fail ->d_seq check and ignore the values it has fetched. Again, no need for a barrier between ->d_inode and ->d_flags stores in that case. As the matter of fact, I'm somewhat tempted to make static void dentry_unlink_inode(struct dentry * dentry) __releases(dentry->d_lock) __releases(dentry->d_inode->i_lock) { struct inode *inode = dentry->d_inode; bool hashed = !d_unhashed(dentry); if (hashed) raw_write_seqcount_begin(&dentry->d_seq); __d_clear_type_and_inode(dentry); hlist_del_init(&dentry->d_u.d_alias); if (hashed) raw_write_seqcount_end(&dentry->d_seq); spin_unlock(&dentry->d_lock); spin_unlock(&inode->i_lock); if (!inode->i_nlink) fsnotify_inoderemove(inode); if (dentry->d_op && dentry->d_op->d_iput) dentry->d_op->d_iput(dentry, inode); else iput(inode); } and replace dentry_iput() in its only caller with if (dentry->d_inode) dentry_unlink_inode(dentry); /* will drop ->d_lock */ else spin_unlock(&dentry->d_lock); That would get rid of annoying code duplication, but I would like to see profiles - the cost of branches might very well get unpleasant. Not sure, and that part definitely isn't a -stable fodder. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c index c6d7d3d..86f81e3 100644 --- a/fs/autofs4/root.c +++ b/fs/autofs4/root.c @@ -323,6 +323,7 @@ static struct dentry *autofs4_mountpoint_changed(struct path *path) struct dentry *new = d_lookup(parent, &dentry->d_name); if (!new) return NULL; + WARN_ON(d_is_negative(new)); ino = autofs4_dentry_ino(new); ino->last_used = jiffies; dput(path->dentry); diff --git a/fs/namei.c b/fs/namei.c index 9c590e0..630d222 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1209,6 +1209,7 @@ static int follow_managed(struct path *path, struct nameidata *nd) /* Handle an automount point */ if (managed & DCACHE_NEED_AUTOMOUNT) { ret = follow_automount(path, nd, &need_mntput); + WARN_ON(d_is_negative(path->dentry)); if (ret < 0) break; continue; @@ -1260,6 +1261,7 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path, { for (;;) { struct mount *mounted; + void *p; /* * Don't forget we might have a non-mountpoint managed dentry * that wants to block transit. @@ -1289,7 +1291,9 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path, * dentry sequence number here after this d_inode read, * because a mount-point is always pinned. */ - *inode = path->dentry->d_inode; + p = *inode = path->dentry->d_inode; + if (unlikely(!p)) + WARN_ON(!read_seqretry(&mount_lock, nd->m_seq)); } return !read_seqretry(&mount_lock, nd->m_seq) && !(path->dentry->d_flags & DCACHE_NEED_AUTOMOUNT); @@ -1550,8 +1554,8 @@ static int lookup_fast(struct nameidata *nd, * This sequence count validates that the inode matches * the dentry name information from lookup. */ - *inode = d_backing_inode(dentry); negative = d_is_negative(dentry); + *inode = d_backing_inode(dentry); if (read_seqcount_retry(&dentry->d_seq, seq)) return -ECHILD; @@ -1580,6 +1584,7 @@ static int lookup_fast(struct nameidata *nd, */ if (negative) return -ENOENT; + WARN_ON(!*inode); // ->d_seq was fucked somehow path->mnt = mnt; path->dentry = dentry; if (likely(__follow_mount_rcu(nd, path, inode, seqp))) @@ -1613,8 +1618,10 @@ unlazy: path->mnt = mnt; path->dentry = dentry; err = follow_managed(path, nd); - if (likely(!err)) + if (likely(!err)) { *inode = d_backing_inode(path->dentry); + WARN_ON(!*inode); + } return err; need_lookup: @@ -1717,6 +1724,7 @@ static inline int should_follow_link(struct nameidata *nd, struct path *link, if (read_seqcount_retry(&link->dentry->d_seq, seq)) return -ECHILD; } + WARN_ON(!inode); // now, _that_ should not happen. return pick_link(nd, link, inode, seq); } @@ -3111,8 +3119,10 @@ static int do_last(struct nameidata *nd, nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY; /* we _can_ be in RCU mode here */ error = lookup_fast(nd, &path, &inode, &seq); - if (likely(!error)) + if (likely(!error)) { + WARN_ON(!inode); goto finish_lookup; + } if (error < 0) return error; @@ -3203,6 +3213,7 @@ retry_lookup: return -ENOENT; } inode = d_backing_inode(path.dentry); + WARN_ON(!inode); finish_lookup: if (nd->depth) put_link(nd); diff --git a/fs/namespace.c b/fs/namespace.c index 4fb1691..4128a5c 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1060,6 +1060,8 @@ static void cleanup_mnt(struct mount *mnt) * so mnt_get_writers() below is safe. */ WARN_ON(mnt_get_writers(mnt)); + WARN_ON(!mnt->mnt.mnt_root->d_inode); // some joker has managed to + // make mnt_root negative on us if (unlikely(mnt->mnt_pins.first)) mnt_pin_kill(mnt); fsnotify_vfsmount_delete(&mnt->mnt);