diff mbox

aio: fix a user triggered use after free (and fix freeze protection of aio writes)

Message ID CA+55aFxubzEr6JUB9US2HBuijCCe5Vs5tR0nbST+tj=gkrDtqg@mail.gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Linus Torvalds Oct. 29, 2016, 5:47 p.m. UTC
On Sat, Oct 29, 2016 at 8:20 AM, Christoph Hellwig <hch@lst.de> wrote:
>
> We can't as that would not fix the use after free (at least for the lockdep
> case - otherwise the call is a no-op).  Once iter_op returns aio_complete
> might have dropped our reference to the file, and another thread might
> have closed the fd so that the fput from aio_complete was the last one.

I don't concpetually mind the patch per se, but the repeated

             if (rw == WRITE) {
                   ..
             }

             if (rw == WRITE) {
                   ..
             }

is just insane and makes the code less legible than it should be.

Also, honestly, make it use a helper: "aio_file_start_write()" and
"aio_file_end_write()" that has the comments and the lockdep games.

Because that patch is just too effing ugly.

Does something like the attached work for you guys?

                Linus
fs/aio.c           | 33 +++++++++++++++++++++++++++++----
 include/linux/fs.h |  1 +
 2 files changed, 30 insertions(+), 4 deletions(-)

Comments

Al Viro Oct. 29, 2016, 6:52 p.m. UTC | #1
On Sat, Oct 29, 2016 at 10:47:58AM -0700, Linus Torvalds wrote:

> Also, honestly, make it use a helper: "aio_file_start_write()" and
> "aio_file_end_write()" that has the comments and the lockdep games.
> 
> Because that patch is just too effing ugly.
> 
> Does something like the attached work for you guys?

No.  The use-after-free problem is real, nasty and only papered over by
that patch.

What happens is that io_submit_one()
	* allocates aio_kiocb
	* does fget() and stuffs the struct file * into kiocb
	* in case of early problems we call kiocb_free(), freeing kiocb and
doing fput() on file, then bugger off.
	* otherwise, eventually we get to passing that iocb to
->read_iter()/->write_iter().
	* if that has resulted in anything other than -EIOCBQUEUED, we
call aio_complete(), which calls kiocb_free(), freeing kiocb and doing fput()
on file.
	* if ->{read,write}_iter() returns -EIOCBQUEUED, we expect
aio_complete() to be called asynchronously.

And that call can happen as soon as we return from __blockdev_direct_IO()
(even earlier, actually).  As soon as that happens, the reference to
struct file we'd acquired in io_submit_one() is dropped.  If descriptor
table had been shared, another thread might have already closed that sucker,
and fput() from aio_complete() would free struct file.

That's what this patch is papering over.  Because if we hit that scenario
and struct file *does* get closed asynchronously just as our ->write_iter()
is returning from __blockdev_direct_IO(), we are fucked.  Not only struct
file might be freed - struct inode might've been gone too.  And a bunch
of ->write_iter/->read_iter instances do access struct inode after the call
of __blockdev_direct_IO().  file_write_end() is just the tip of the iceberg -
see examples I've posted today for the things we *can't* move around.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Oct. 29, 2016, 7:07 p.m. UTC | #2
On Sat, Oct 29, 2016 at 11:52 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> And that call can happen as soon as we return from __blockdev_direct_IO()
> (even earlier, actually).  As soon as that happens, the reference to
> struct file we'd acquired in io_submit_one() is dropped.  If descriptor
> table had been shared, another thread might have already closed that sucker,
> and fput() from aio_complete() would free struct file.

But that's the point. We don't *do* anything like that any more. We
now always do the final access from aio_complete(). So it doesn't
matter if that is called asynchronously (very early) or not.

That's the whole point of the patch. Exactly to do everything either
*before* we even submit it (at which point no completion can happen),
or doing it in aio_complete() which is guaranteed to be after the
submission. No races, no use-after-free.

What am I missing?

             Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Oct. 29, 2016, 7:17 p.m. UTC | #3
On Sat, Oct 29, 2016 at 12:07:07PM -0700, Linus Torvalds wrote:
> On Sat, Oct 29, 2016 at 11:52 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > And that call can happen as soon as we return from __blockdev_direct_IO()
> > (even earlier, actually).  As soon as that happens, the reference to
> > struct file we'd acquired in io_submit_one() is dropped.  If descriptor
> > table had been shared, another thread might have already closed that sucker,
> > and fput() from aio_complete() would free struct file.
> 
> But that's the point. We don't *do* anything like that any more. We
> now always do the final access from aio_complete(). So it doesn't
> matter if that is called asynchronously (very early) or not.
> 
> That's the whole point of the patch. Exactly to do everything either
> *before* we even submit it (at which point no completion can happen),
> or doing it in aio_complete() which is guaranteed to be after the
> submission. No races, no use-after-free.
> 
> What am I missing?

        if (ret > 0)
                ret = __generic_file_write_iter(iocb, from);
        inode_unlock(inode);
in generic_file_write_iter() (inode might be gone here)
        written = mapping->a_ops->direct_IO(iocb, &data);

        /*
         * Finally, try again to invalidate clean pages which might have been
         * cached by non-direct readahead, or faulted in by get_user_pages()
         * if the source of the write was an mmap'ed region of the file
         * we're writing.  Either one is a pretty crazy thing to do,
         * so we don't support it 100%.  If this invalidation
         * fails, tough, the write still worked...
         */
        if (mapping->nrpages) {
                invalidate_inode_pages2_range(mapping,
                                              pos >> PAGE_SHIFT, end);
in generic_file_direct_write() (mapping points into inode, which might be
freed)
        ret = __blockdev_direct_IO(iocb, inode, target->bt_bdev, &data,
                        xfs_get_blocks_direct, NULL, NULL, 0);
        if (ret >= 0) {
                iocb->ki_pos += ret;
                iov_iter_advance(to, ret);
        }
        xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
in xfs_file_dio_aio_read() (ip points to xfs-private part of inode).

And that - just from a quick look.  We *do* access inode between the call
of __blockdev_direct_IO() and return from ->write_iter().  What's more,
as soon as aio_complete() has happened, what's to stop close from another
thread + umount + rmmod unmapping that ->write_iter() completely?

AFAICS, the possibility of dropping the last reference to struct file
before ->write_iter() has returned is fundamentally broken.  I might be
missing something subtle here, but...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Oct. 29, 2016, 8:09 p.m. UTC | #4
On Sat, Oct 29, 2016 at 12:17 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> AFAICS, the possibility of dropping the last reference to struct file
> before ->write_iter() has returned is fundamentally broken.  I might be
> missing something subtle here, but...

Ok, let's add a get_file(); fput(); around that whole iter call sequence.

And that's a separate issue from "we should hold the fs freezer lock
around the whole operation". So I think we need both.

Does that make everybody happy?

              Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara Oct. 30, 2016, 9:44 a.m. UTC | #5
On Sat 29-10-16 13:09:25, Linus Torvalds wrote:
> On Sat, Oct 29, 2016 at 12:17 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > AFAICS, the possibility of dropping the last reference to struct file
> > before ->write_iter() has returned is fundamentally broken.  I might be
> > missing something subtle here, but...
> 
> Ok, let's add a get_file(); fput(); around that whole iter call sequence.
> 
> And that's a separate issue from "we should hold the fs freezer lock
> around the whole operation". So I think we need both.

Yup, these are two separate issues. I'm fine with adding get_file(); fput()
around the iter call sequence. Then, when we have struct file available for
the whole time ->write_iter runs, I'd prefer to keep the call to fool
lockdep where original file_end_write() call was - that gives us proper
lockdep coverage for all the code behind iter_op(). The downside is we
cannot keep helpers as elegant as you suggested in your patch but I believe
it's bearable and worth the additional lockdep coverage. I'll send patches
shortly...

								Honza
Jan Kara Oct. 30, 2016, 10:52 a.m. UTC | #6
On Sun 30-10-16 10:44:37, Jan Kara wrote:
> On Sat 29-10-16 13:09:25, Linus Torvalds wrote:
> > On Sat, Oct 29, 2016 at 12:17 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > AFAICS, the possibility of dropping the last reference to struct file
> > > before ->write_iter() has returned is fundamentally broken.  I might be
> > > missing something subtle here, but...
> > 
> > Ok, let's add a get_file(); fput(); around that whole iter call sequence.
> > 
> > And that's a separate issue from "we should hold the fs freezer lock
> > around the whole operation". So I think we need both.
> 
> Yup, these are two separate issues. I'm fine with adding get_file(); fput()
> around the iter call sequence. Then, when we have struct file available for
> the whole time ->write_iter runs, I'd prefer to keep the call to fool
> lockdep where original file_end_write() call was - that gives us proper
> lockdep coverage for all the code behind iter_op(). The downside is we
> cannot keep helpers as elegant as you suggested in your patch but I believe
> it's bearable and worth the additional lockdep coverage. I'll send patches
> shortly...

Hum, the additional refcount patch oopses on me when running generic/323,
I'll have to board to my flight to US shortly so I won't be able to send it
soon - maybe when I'm transferring in Denver ;).

								Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Oct. 30, 2016, 3:58 p.m. UTC | #7
On Sun, Oct 30, 2016 at 11:52:31AM +0100, Jan Kara wrote:
> Hum, the additional refcount patch oopses on me when running generic/323,
> I'll have to board to my flight to US shortly so I won't be able to send it
> soon - maybe when I'm transferring in Denver ;).

I've got a version that works and the rest of the patches rebased on top
of it.

And I've also got internet on my flight, so I'll be able to post it as
soon as testing finishes :)

> 
> 								Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
---end quoted text---
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/aio.c b/fs/aio.c
index 1157e13a36d6..3f66331ef90c 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1066,6 +1066,27 @@  static struct kioctx *lookup_ioctx(unsigned long ctx_id)
 	return ret;
 }
 
+/*
+ * We do file_start_write/file_end_write() to make sure
+ * we have filesystem freeze protection over the whole
+ * AIO write sequence, but to make sure that lockdep does
+ * not complain about the held lock when we return to
+ * userspace, we tell it that we release and reaquire the
+ * lock.
+ */
+static void aio_file_start_write(struct file *file)
+{
+	file_start_write(file);
+	__sb_writers_release(file_inode(file)->i_sb, SB_FREEZE_WRITE);
+}
+
+static void aio_file_end_write(struct file *file)
+{
+	__sb_writers_acquired(file_inode(file)->i_sb, SB_FREEZE_WRITE);
+	file_end_write(file);
+}
+
+
 /* aio_complete
  *	Called when the io request on the given iocb is complete.
  */
@@ -1078,6 +1099,9 @@  static void aio_complete(struct kiocb *kiocb, long res, long res2)
 	unsigned tail, pos, head;
 	unsigned long	flags;
 
+	if (kiocb->ki_flags & IOCB_WRITE)
+		aio_file_end_write(kiocb->ki_filp);
+
 	/*
 	 * Special case handling for sync iocbs:
 	 *  - events go directly into the iocb for fast handling
@@ -1460,13 +1484,14 @@  static ssize_t aio_run_iocb(struct kiocb *req, unsigned opcode,
 			return ret;
 		}
 
-		if (rw == WRITE)
-			file_start_write(file);
+		if (rw == WRITE) {
+			/* aio_complete() will end the write */
+			req->ki_flags |= IOCB_WRITE;
+			aio_file_start_write(file);
+		}
 
 		ret = iter_op(req, &iter);
 
-		if (rw == WRITE)
-			file_end_write(file);
 		kfree(iovec);
 		break;
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 16d2b6e874d6..db600e9bb1b4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -321,6 +321,7 @@  struct writeback_control;
 #define IOCB_HIPRI		(1 << 3)
 #define IOCB_DSYNC		(1 << 4)
 #define IOCB_SYNC		(1 << 5)
+#define IOCB_WRITE		(1 << 6)
 
 struct kiocb {
 	struct file		*ki_filp;