fs: NULL deref in atime_needs_update

On Wed, Feb 24, 2016 at 11:12:13AM +0800, Ian Kent wrote:
> On Wed, 2016-02-17 at 00:40 +0100, Mickaël Salaün wrote:
> > Hi,
> > 
> > Actually I found the same bug (without fuzzing) and I can reproduce it
> > in a deterministic way (e.g. by creating a LSM that return 1 for the
> > security_file_open hook). At least, from v4.2.8 I can easily trigger
> > traces like this :
> 
> Reading through this thread I wonder if this is a new problem.
> 
> Is there a previous kernel it can't be reproduced on?
> Perhaps a bisect will shed some light on what's happening.

There are several things in the mix.  What Mickaël has found is that a bunch
of places where _positive_ number returned instead of expected 0 or -E... can
get propagated all way back to caller of do_last(), confusing the hell out
of it.  That's not what Dmitry has triggered, though.  WARN_ON() in the
end of do_last() would've triggered, and IMO this one, along with mitigation
(map that "error value" to -EINVAL) should go into mainline and all -stable.
Bogus ->open() returning a positive number had always been bad news; in the
best case it would be returned to userland, leading to "open(2) has failed
and returned a positive number".  Hell knows if we ever had such instances
(or have them right now), but I wouldn't bet on their absense.  Rare
failure exits returning bogus values in an ->open() instance in some driver
can easily stay unnoticed for a long time.  Starting from at least 3.6
(circa the atomic_open support) it got more unpleasant than simple "confuse
the hell out of userland".  ->open() isn't the only vector for injection of
such crap - ->permission() would also serve, same for several LSM turds, etc.

Again, that's a separate problem.  What Dmitry seems to be catching is getting
crap values fed to should_follow_link() as inode.  I see one bug in that
area that does need fixing (fix present in the last patch I've posted, with
WARN_ON() to indicate that this thing has triggered; _that_ WARN_ON() should
be gone from the final variant, since this can trigger without driver bugs,
etc.)  In RCU mode after we'd checked that dentry is a symlink one, we need
to verify that it hadn't been changed since before we'd fetched the inode.
It might have been e.g. a regular file, which got unlinked with symlink
created in its place.  Then we'd go into get_link() with non-symlink inode
and oops on attempt to call its ->i_op->get_link().  That race is real, very
hard to hit (you need both the unlink(2) and symlink(2) to happen between
lookup_fast() and should_follow_link() and unless you have PREEMPT_RCU you
can't lose the timeslice there) and would've manifested differently.

But that leaves other two kinds of bugs getting triggered: inode of some
non-symlink is possible, but what we saw included NULL inode when we'd
reached finish_open: in do_last().  Should be flat-out impossible - we either
get lookup_fast(..., &inode, ...) return 0 and store NULL in inode, or
get NULL inode from pinned d_is_symlink() dentry after having grabbed
a reference and left RCU mode.  Neither should be possible without either
something weird happening to lookup_fast() (and we would've seen oopsen in
link_path_walk() if that could happen, BTW) or screwed dentry refcounting
somewhere, combined with a race that managed to turn...

Oh, shit.  No screwed refcounting is needed.  Look:
        BUG_ON(nd->flags & LOOKUP_RCU);
        inode = d_backing_inode(path.dentry);
        seq = 0;        /* out of RCU mode, so the value doesn't matter */
        if (unlikely(d_is_negative(path.dentry))) {
                path_to_nameidata(&path, nd);
                return -ENOENT;
        }
Suppose we come here with negative path.dentry.  We are holding a reference,
all right, and for a _postive_ dentry that would've been enough to keep
it positive.  Not so for a negative one, though - symlink(2) on another
CPU doint d_instantiate() just before the d_is_negative() check and we
are fucked - inode is stale and we sail through all the checks, all the
way into should_follow_link().

We also have the same kind of crap in walk_component() -
                err = lookup_slow(nd, &path);
                if (err < 0)
                        return err;
                inode = d_backing_inode(path.dentry);
                seq = 0;        /* we are already out of RCU mode */
                err = -ENOENT;
                if (d_is_negative(path.dentry))  
                        goto out_path_put;
There it's much harder to hit, though - we need it not just d_instantiate()
overlapping those lines; we need the racing syscall to get from locking
the parent to d_instantiate() between the point where lookup_slow() has
unlocked the parent and d_is_negative().  And lookup_slow() couldn't have
gone into mountpoint crossing, so it's really hard to hit - you pretty
much have to get preempted just after fetching inode.

OK, the next delta to try, and there definitely are several commits in
that pile.  It still does not explain the traces with GPF at 0x50, though -
for that we need not just a NULL getting to should_follow_link() but
something non-NULL with NULL at offset 40 from it (offset of ->i_sb in
struct inode).  That something *can't* be a valid struct inode or had been
one in recent past - ->i_sb is assigned in new_inode(), value is always
non-NULL and never modified all the way until RCU-delayed freeing of struct
inode.  So that has to be something entirely different...  Anyway, the
patch so far follows:

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

fs: NULL deref in atime_needs_update

Commit Message

Comments

Patch