Message ID | CA+55aFxubzEr6JUB9US2HBuijCCe5Vs5tR0nbST+tj=gkrDtqg@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Sat, Oct 29, 2016 at 10:47:58AM -0700, Linus Torvalds wrote: > Also, honestly, make it use a helper: "aio_file_start_write()" and > "aio_file_end_write()" that has the comments and the lockdep games. > > Because that patch is just too effing ugly. > > Does something like the attached work for you guys? No. The use-after-free problem is real, nasty and only papered over by that patch. What happens is that io_submit_one() * allocates aio_kiocb * does fget() and stuffs the struct file * into kiocb * in case of early problems we call kiocb_free(), freeing kiocb and doing fput() on file, then bugger off. * otherwise, eventually we get to passing that iocb to ->read_iter()/->write_iter(). * if that has resulted in anything other than -EIOCBQUEUED, we call aio_complete(), which calls kiocb_free(), freeing kiocb and doing fput() on file. * if ->{read,write}_iter() returns -EIOCBQUEUED, we expect aio_complete() to be called asynchronously. And that call can happen as soon as we return from __blockdev_direct_IO() (even earlier, actually). As soon as that happens, the reference to struct file we'd acquired in io_submit_one() is dropped. If descriptor table had been shared, another thread might have already closed that sucker, and fput() from aio_complete() would free struct file. That's what this patch is papering over. Because if we hit that scenario and struct file *does* get closed asynchronously just as our ->write_iter() is returning from __blockdev_direct_IO(), we are fucked. Not only struct file might be freed - struct inode might've been gone too. And a bunch of ->write_iter/->read_iter instances do access struct inode after the call of __blockdev_direct_IO(). file_write_end() is just the tip of the iceberg - see examples I've posted today for the things we *can't* move around. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Oct 29, 2016 at 11:52 AM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > And that call can happen as soon as we return from __blockdev_direct_IO() > (even earlier, actually). As soon as that happens, the reference to > struct file we'd acquired in io_submit_one() is dropped. If descriptor > table had been shared, another thread might have already closed that sucker, > and fput() from aio_complete() would free struct file. But that's the point. We don't *do* anything like that any more. We now always do the final access from aio_complete(). So it doesn't matter if that is called asynchronously (very early) or not. That's the whole point of the patch. Exactly to do everything either *before* we even submit it (at which point no completion can happen), or doing it in aio_complete() which is guaranteed to be after the submission. No races, no use-after-free. What am I missing? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Oct 29, 2016 at 12:07:07PM -0700, Linus Torvalds wrote: > On Sat, Oct 29, 2016 at 11:52 AM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > And that call can happen as soon as we return from __blockdev_direct_IO() > > (even earlier, actually). As soon as that happens, the reference to > > struct file we'd acquired in io_submit_one() is dropped. If descriptor > > table had been shared, another thread might have already closed that sucker, > > and fput() from aio_complete() would free struct file. > > But that's the point. We don't *do* anything like that any more. We > now always do the final access from aio_complete(). So it doesn't > matter if that is called asynchronously (very early) or not. > > That's the whole point of the patch. Exactly to do everything either > *before* we even submit it (at which point no completion can happen), > or doing it in aio_complete() which is guaranteed to be after the > submission. No races, no use-after-free. > > What am I missing? if (ret > 0) ret = __generic_file_write_iter(iocb, from); inode_unlock(inode); in generic_file_write_iter() (inode might be gone here) written = mapping->a_ops->direct_IO(iocb, &data); /* * Finally, try again to invalidate clean pages which might have been * cached by non-direct readahead, or faulted in by get_user_pages() * if the source of the write was an mmap'ed region of the file * we're writing. Either one is a pretty crazy thing to do, * so we don't support it 100%. If this invalidation * fails, tough, the write still worked... */ if (mapping->nrpages) { invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT, end); in generic_file_direct_write() (mapping points into inode, which might be freed) ret = __blockdev_direct_IO(iocb, inode, target->bt_bdev, &data, xfs_get_blocks_direct, NULL, NULL, 0); if (ret >= 0) { iocb->ki_pos += ret; iov_iter_advance(to, ret); } xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED); in xfs_file_dio_aio_read() (ip points to xfs-private part of inode). And that - just from a quick look. We *do* access inode between the call of __blockdev_direct_IO() and return from ->write_iter(). What's more, as soon as aio_complete() has happened, what's to stop close from another thread + umount + rmmod unmapping that ->write_iter() completely? AFAICS, the possibility of dropping the last reference to struct file before ->write_iter() has returned is fundamentally broken. I might be missing something subtle here, but... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Oct 29, 2016 at 12:17 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > AFAICS, the possibility of dropping the last reference to struct file > before ->write_iter() has returned is fundamentally broken. I might be > missing something subtle here, but... Ok, let's add a get_file(); fput(); around that whole iter call sequence. And that's a separate issue from "we should hold the fs freezer lock around the whole operation". So I think we need both. Does that make everybody happy? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat 29-10-16 13:09:25, Linus Torvalds wrote: > On Sat, Oct 29, 2016 at 12:17 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > AFAICS, the possibility of dropping the last reference to struct file > > before ->write_iter() has returned is fundamentally broken. I might be > > missing something subtle here, but... > > Ok, let's add a get_file(); fput(); around that whole iter call sequence. > > And that's a separate issue from "we should hold the fs freezer lock > around the whole operation". So I think we need both. Yup, these are two separate issues. I'm fine with adding get_file(); fput() around the iter call sequence. Then, when we have struct file available for the whole time ->write_iter runs, I'd prefer to keep the call to fool lockdep where original file_end_write() call was - that gives us proper lockdep coverage for all the code behind iter_op(). The downside is we cannot keep helpers as elegant as you suggested in your patch but I believe it's bearable and worth the additional lockdep coverage. I'll send patches shortly... Honza
On Sun 30-10-16 10:44:37, Jan Kara wrote: > On Sat 29-10-16 13:09:25, Linus Torvalds wrote: > > On Sat, Oct 29, 2016 at 12:17 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > > > AFAICS, the possibility of dropping the last reference to struct file > > > before ->write_iter() has returned is fundamentally broken. I might be > > > missing something subtle here, but... > > > > Ok, let's add a get_file(); fput(); around that whole iter call sequence. > > > > And that's a separate issue from "we should hold the fs freezer lock > > around the whole operation". So I think we need both. > > Yup, these are two separate issues. I'm fine with adding get_file(); fput() > around the iter call sequence. Then, when we have struct file available for > the whole time ->write_iter runs, I'd prefer to keep the call to fool > lockdep where original file_end_write() call was - that gives us proper > lockdep coverage for all the code behind iter_op(). The downside is we > cannot keep helpers as elegant as you suggested in your patch but I believe > it's bearable and worth the additional lockdep coverage. I'll send patches > shortly... Hum, the additional refcount patch oopses on me when running generic/323, I'll have to board to my flight to US shortly so I won't be able to send it soon - maybe when I'm transferring in Denver ;). Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Oct 30, 2016 at 11:52:31AM +0100, Jan Kara wrote: > Hum, the additional refcount patch oopses on me when running generic/323, > I'll have to board to my flight to US shortly so I won't be able to send it > soon - maybe when I'm transferring in Denver ;). I've got a version that works and the rest of the patches rebased on top of it. And I've also got internet on my flight, so I'll be able to post it as soon as testing finishes :) > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR ---end quoted text--- -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/aio.c b/fs/aio.c index 1157e13a36d6..3f66331ef90c 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1066,6 +1066,27 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id) return ret; } +/* + * We do file_start_write/file_end_write() to make sure + * we have filesystem freeze protection over the whole + * AIO write sequence, but to make sure that lockdep does + * not complain about the held lock when we return to + * userspace, we tell it that we release and reaquire the + * lock. + */ +static void aio_file_start_write(struct file *file) +{ + file_start_write(file); + __sb_writers_release(file_inode(file)->i_sb, SB_FREEZE_WRITE); +} + +static void aio_file_end_write(struct file *file) +{ + __sb_writers_acquired(file_inode(file)->i_sb, SB_FREEZE_WRITE); + file_end_write(file); +} + + /* aio_complete * Called when the io request on the given iocb is complete. */ @@ -1078,6 +1099,9 @@ static void aio_complete(struct kiocb *kiocb, long res, long res2) unsigned tail, pos, head; unsigned long flags; + if (kiocb->ki_flags & IOCB_WRITE) + aio_file_end_write(kiocb->ki_filp); + /* * Special case handling for sync iocbs: * - events go directly into the iocb for fast handling @@ -1460,13 +1484,14 @@ static ssize_t aio_run_iocb(struct kiocb *req, unsigned opcode, return ret; } - if (rw == WRITE) - file_start_write(file); + if (rw == WRITE) { + /* aio_complete() will end the write */ + req->ki_flags |= IOCB_WRITE; + aio_file_start_write(file); + } ret = iter_op(req, &iter); - if (rw == WRITE) - file_end_write(file); kfree(iovec); break; diff --git a/include/linux/fs.h b/include/linux/fs.h index 16d2b6e874d6..db600e9bb1b4 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -321,6 +321,7 @@ struct writeback_control; #define IOCB_HIPRI (1 << 3) #define IOCB_DSYNC (1 << 4) #define IOCB_SYNC (1 << 5) +#define IOCB_WRITE (1 << 6) struct kiocb { struct file *ki_filp;