diff mbox

aio: fix a user triggered use after free (and fix freeze protection of aio writes)

Message ID 1477727070-18806-2-git-send-email-hch@lst.de (mailing list archive)
State New, archived
Headers show

Commit Message

Christoph Hellwig Oct. 29, 2016, 7:44 a.m. UTC
From: Jan Kara <jack@suse.cz>

Currently we dropped freeze protection of aio writes just after IO was
submitted. Thus aio write could be in flight while the filesystem was
frozen and that could result in unexpected situation like aio completion
wanting to convert extent type on frozen filesystem. Testcase from
Dmitry triggering this is like:

for ((i=0;i<60;i++));do fsfreeze -f /mnt ;sleep 1;fsfreeze -u /mnt;done &
fio --bs=4k --ioengine=libaio --iodepth=128 --size=1g --direct=1 \
    --runtime=60 --filename=/mnt/file --name=rand-write --rw=randwrite

Fix the problem by dropping freeze protection only once IO is completed
in aio_complete().

[hch: The above was the changelog of the original patch from Jan.
 It turns out that it fixes something even more important - a use
 after free of the file structucture given that the direct I/O
 code calls fput and potentially drops the last reference to it in
 aio_complete.  Together with two racing threads and a zero sized
 I/O this seems easily exploitable]

Reported-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
[hch: switch to use __sb_writers_acquired and file_inode(file),
      updated changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: stable@vger.kernel.org
---
 fs/aio.c           | 28 +++++++++++++++++++++++++---
 include/linux/fs.h |  1 +
 2 files changed, 26 insertions(+), 3 deletions(-)

Comments

Al Viro Oct. 29, 2016, 12:24 p.m. UTC | #1
On Sat, Oct 29, 2016 at 09:44:29AM +0200, Christoph Hellwig wrote:

> -		if (rw == WRITE)
> +		if (rw == WRITE) {
>  			file_start_write(file);
> +			req->ki_flags |= IOCB_WRITE;
> +		}

> +		if (rw == WRITE) {
> +			/*
> +			 * We release freeze protection in aio_complete(). Fool
> +			 * lockdep by telling it the lock got released so that
> +			 * it doesn't complain about held lock when we return
> +			 * to userspace.
> +			 */
> +			__sb_writers_release(file_inode(file)->i_sb,
> +					SB_FREEZE_WRITE);
> +		}

How about taking this chunk (i.e. telling lockdep that we are not holding this
thing) past the iter_op() call, where file_end_write() used to be?

As it is, you risk hiding the lock dependencies the current mainline would've
caught.  Other than that I see no problems with the patch...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Oct. 29, 2016, 3:20 p.m. UTC | #2
On Sat, Oct 29, 2016 at 01:24:51PM +0100, Al Viro wrote:
> How about taking this chunk (i.e. telling lockdep that we are not holding this
> thing) past the iter_op() call, where file_end_write() used to be?

We can't as that would not fix the use after free (at least for the lockdep
case - otherwise the call is a no-op).  Once iter_op returns aio_complete
might have dropped our reference to the file, and another thread might
have closed the fd so that the fput from aio_complete was the last one.

This is something that xfstests/323 can reproduce under the right conditions.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Oct. 29, 2016, 4:12 p.m. UTC | #3
On Sat, Oct 29, 2016 at 05:20:17PM +0200, Christoph Hellwig wrote:
> On Sat, Oct 29, 2016 at 01:24:51PM +0100, Al Viro wrote:
> > How about taking this chunk (i.e. telling lockdep that we are not holding this
> > thing) past the iter_op() call, where file_end_write() used to be?
> 
> We can't as that would not fix the use after free (at least for the lockdep
> case - otherwise the call is a no-op).  Once iter_op returns aio_complete
> might have dropped our reference to the file, and another thread might
> have closed the fd so that the fput from aio_complete was the last one.
> 
> This is something that xfstests/323 can reproduce under the right conditions.

Eww...  IOW, as soon as we'd submitted an async iocb, nobody can so much as
look at struct file *or* iocb, right?  Or underlying inode, or any fs-private
data structures attached to it...

I certainly agree that it's a bug, but IMO you are just papering over it.
Just look at e.g.
        written = mapping->a_ops->direct_IO(iocb, &data);

        /*
         * Finally, try again to invalidate clean pages which might have been
         * cached by non-direct readahead, or faulted in by get_user_pages()
         * if the source of the write was an mmap'ed region of the file
         * we're writing.  Either one is a pretty crazy thing to do,
         * so we don't support it 100%.  If this invalidation
         * fails, tough, the write still worked...
         */
        if (mapping->nrpages) {
                invalidate_inode_pages2_range(mapping,
                                              pos >> PAGE_SHIFT, end);
        }
in generic_file_direct_write() - who said that mapping doesn't point into
freed memory by that point?  Or that
        inode_lock(inode);
        ret = generic_write_checks(iocb, from);
        if (ret > 0)
                ret = __generic_file_write_iter(iocb, from);
        inode_unlock(inode);
in generic_file_write_iter() won't blow up on inode_unlock() of a freed inode?
Or that anything of that sort won't happen in other ->write_iter instances,
for that matter?

NAK, with apologies for not having looked at that earlier.  The bug is real,
all right, but this is not a solution - both incomplete and far too brittle.

Why do we play that kind of insane games, anyway?  Why not e.g. refcount
the (async) iocb and have kiocb_free() drop the reference, with io_submit_one()
holding one across the call of aio_run_iocb()?  Cacheline bouncing issues?
Anything more subtle?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Oct. 29, 2016, 4:29 p.m. UTC | #4
On Sat, Oct 29, 2016 at 05:12:30PM +0100, Al Viro wrote:

> NAK, with apologies for not having looked at that earlier.  The bug is real,
> all right, but this is not a solution - both incomplete and far too brittle.
> 
> Why do we play that kind of insane games, anyway?  Why not e.g. refcount
> the (async) iocb and have kiocb_free() drop the reference, with io_submit_one()
> holding one across the call of aio_run_iocb()?  Cacheline bouncing issues?
> Anything more subtle?

PS: I'm not saying that refcounting kiocb is the best solution - grabbing an
extra reference to struct file might be better (we have just dirtied that
cacheline, so the fact that struct file is shared more than kiocb shouldn't
matter much), but I really think that "file and everything attached to it
might disappear as soon as you get async IO started" is insanely brittle -
e.g. xfs_rw_iunlock(ip, iolock) in xfs_file_dio_aio_write() is also unsafe
<checks -next - yup, still there>

If struct file might be gone, so might struct inode and everything behind
it.  Which means that we either are not allowed to hold any locks across
__blockdev_direct_IO(), or need have end_io() callback taking care of
dropping those (and adjust the callers accordingly).  It might be not
impossible, but... *ouch*
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Oct. 30, 2016, 6:32 a.m. UTC | #5
On Sat, Oct 29, 2016 at 05:12:30PM +0100, Al Viro wrote:
> Eww...  IOW, as soon as we'd submitted an async iocb, nobody can so much as
> look at struct file *or* iocb, right?  Or underlying inode, or any fs-private
> data structures attached to it...

Yeah.

> I certainly agree that it's a bug, but IMO you are just papering over it.
> Just look at e.g.
>         written = mapping->a_ops->direct_IO(iocb, &data);
> 
>         /*
>          * Finally, try again to invalidate clean pages which might have been
>          * cached by non-direct readahead, or faulted in by get_user_pages()
>          * if the source of the write was an mmap'ed region of the file
>          * we're writing.  Either one is a pretty crazy thing to do,
>          * so we don't support it 100%.  If this invalidation
>          * fails, tough, the write still worked...
>          */
>         if (mapping->nrpages) {
>                 invalidate_inode_pages2_range(mapping,
>                                               pos >> PAGE_SHIFT, end);
>         }
> in generic_file_direct_write() - who said that mapping doesn't point into
> freed memory by that point?

True, somewhat unlike due to inode caching, but for sure possible
and something that needs to be deal with.

> Why do we play that kind of insane games, anyway?  Why not e.g. refcount
> the (async) iocb and have kiocb_free() drop the reference, with io_submit_one()
> holding one across the call of aio_run_iocb()?  Cacheline bouncing issues?
> Anything more subtle?

There shouldn't really be a need to refcount the iocb itself - nothing
worth looking at.  The one we consider was struct file, and I didn't
like it because of the cacheline bouncing if we decrement if from both
the submitter and completion context.  But it starts to sounds like
the sane version more and more.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/aio.c b/fs/aio.c
index 1157e13..bf315cd 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1078,6 +1078,17 @@  static void aio_complete(struct kiocb *kiocb, long res, long res2)
 	unsigned tail, pos, head;
 	unsigned long	flags;
 
+	if (kiocb->ki_flags & IOCB_WRITE) {
+		struct file *file = kiocb->ki_filp;
+
+		/*
+		 * Tell lockdep we inherited freeze protection from submission
+		 * thread.
+		 */
+		__sb_writers_acquired(file_inode(file)->i_sb, SB_FREEZE_WRITE);
+		file_end_write(file);
+	}
+
 	/*
 	 * Special case handling for sync iocbs:
 	 *  - events go directly into the iocb for fast handling
@@ -1460,13 +1471,24 @@  static ssize_t aio_run_iocb(struct kiocb *req, unsigned opcode,
 			return ret;
 		}
 
-		if (rw == WRITE)
+		if (rw == WRITE) {
 			file_start_write(file);
+			req->ki_flags |= IOCB_WRITE;
+		}
+
+		if (rw == WRITE) {
+			/*
+			 * We release freeze protection in aio_complete(). Fool
+			 * lockdep by telling it the lock got released so that
+			 * it doesn't complain about held lock when we return
+			 * to userspace.
+			 */
+			__sb_writers_release(file_inode(file)->i_sb,
+					SB_FREEZE_WRITE);
+		}
 
 		ret = iter_op(req, &iter);
 
-		if (rw == WRITE)
-			file_end_write(file);
 		kfree(iovec);
 		break;
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 16d2b6e..db600e9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -321,6 +321,7 @@  struct writeback_control;
 #define IOCB_HIPRI		(1 << 3)
 #define IOCB_DSYNC		(1 << 4)
 #define IOCB_SYNC		(1 << 5)
+#define IOCB_WRITE		(1 << 6)
 
 struct kiocb {
 	struct file		*ki_filp;