diff mbox

block: flush writeback dwork before detaching a bdev inode from it

Message ID 20160620133146.GE6882@quack2.suse.cz (mailing list archive)
State New, archived
Headers show

Commit Message

Jan Kara June 20, 2016, 1:31 p.m. UTC
Hi,

On Fri 17-06-16 12:04:05, Tejun Heo wrote:
> 43d1c0eb7e11 ("block: detach bdev inode from its wb in
> __blkdev_put()") detached bdev inode from its wb as the bdev inode may
> outlive the underlying bdi and thus the wb.  This is accomplished by
> invoking inode_detach_wb() from __blkdev_put(); however, while the
> inode can't be dirtied by the time control reaches there, that doesn't
> guarantee that writeback isn't already in progress on the inode.  This
> can lead to the inode being disassociated from its wb while writeback
> operation is in flight causing oopses like the following.

<snip>

> Fix it by flushing the wb dwork before detaching the inode.  Combined
> with the fact that the inode can no longer be dirtied, this guarantees
> that no writeback operation can be in flight or initiated.

Sorry for the late reply but now when thinking about the patch I don't
think it is quite right. Writeback can happen from other contexts than just
the worker one (e.g. kswapd can do writeback, or in some out-of-memory
situations we may punt to doing writeback directly instead of calling the
worker, or sync(2) calls fdatawrite() for block device inode directly when
iterating through blockdev superblock). So flushing the workqueue IMHO is
not covering 100% of cases. 

I wanted to suggest to use inode_wait_for_writeback() which will make sure
writeback is done with the inode. However we effectively already do this
by calling bdev_write_inode() which calls writeback_single_inode() in
WB_SYNC_ALL mode. So by the time that call completes we are sure writeback
code is not looking at the inode. *However* what I think is happening is
that sync_blockdev() writes all the dirty pages, bdev_write_inode() writes
the inode and clears all dirty bits, however the inode still stays in the
b_dirty / b_io list of the wb because it has I_DIRTY_TIME set. Subsequently
once flusher work runs, it finds the inode, looks at it and boom. And the
problem seems to be that write_inode_now(inode, true) does not result in
I_DIRTY_TIME being cleared.

Attached patch should fix this issue - it is compile-tested only. Dmitry,
can you check whether this patch fixes the issue for you as well?

								Honza

Comments

Dmitry Vyukov June 20, 2016, 1:38 p.m. UTC | #1
On Mon, Jun 20, 2016 at 3:31 PM, Jan Kara <jack@suse.cz> wrote:
> Hi,
>
> On Fri 17-06-16 12:04:05, Tejun Heo wrote:
>> 43d1c0eb7e11 ("block: detach bdev inode from its wb in
>> __blkdev_put()") detached bdev inode from its wb as the bdev inode may
>> outlive the underlying bdi and thus the wb.  This is accomplished by
>> invoking inode_detach_wb() from __blkdev_put(); however, while the
>> inode can't be dirtied by the time control reaches there, that doesn't
>> guarantee that writeback isn't already in progress on the inode.  This
>> can lead to the inode being disassociated from its wb while writeback
>> operation is in flight causing oopses like the following.
>
> <snip>
>
>> Fix it by flushing the wb dwork before detaching the inode.  Combined
>> with the fact that the inode can no longer be dirtied, this guarantees
>> that no writeback operation can be in flight or initiated.
>
> Sorry for the late reply but now when thinking about the patch I don't
> think it is quite right. Writeback can happen from other contexts than just
> the worker one (e.g. kswapd can do writeback, or in some out-of-memory
> situations we may punt to doing writeback directly instead of calling the
> worker, or sync(2) calls fdatawrite() for block device inode directly when
> iterating through blockdev superblock). So flushing the workqueue IMHO is
> not covering 100% of cases.
>
> I wanted to suggest to use inode_wait_for_writeback() which will make sure
> writeback is done with the inode. However we effectively already do this
> by calling bdev_write_inode() which calls writeback_single_inode() in
> WB_SYNC_ALL mode. So by the time that call completes we are sure writeback
> code is not looking at the inode. *However* what I think is happening is
> that sync_blockdev() writes all the dirty pages, bdev_write_inode() writes
> the inode and clears all dirty bits, however the inode still stays in the
> b_dirty / b_io list of the wb because it has I_DIRTY_TIME set. Subsequently
> once flusher work runs, it finds the inode, looks at it and boom. And the
> problem seems to be that write_inode_now(inode, true) does not result in
> I_DIRTY_TIME being cleared.
>
> Attached patch should fix this issue - it is compile-tested only. Dmitry,
> can you check whether this patch fixes the issue for you as well?


I can't directly test it because crash happened very infrequently.
If Tejun/Ted agree that it is the right way to fix it, then I can
patch it, restart the fuzzer and leave it running for a while.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo June 20, 2016, 5:40 p.m. UTC | #2
Hello,

On Mon, Jun 20, 2016 at 03:38:41PM +0200, Dmitry Vyukov wrote:
> > Sorry for the late reply but now when thinking about the patch I don't
> > think it is quite right. Writeback can happen from other contexts than just
> > the worker one (e.g. kswapd can do writeback, or in some out-of-memory
> > situations we may punt to doing writeback directly instead of calling the
> > worker, or sync(2) calls fdatawrite() for block device inode directly when
> > iterating through blockdev superblock). So flushing the workqueue IMHO is
> > not covering 100% of cases.

Hmmm, yeah, the patch undoes what the cgroup writeback changes added
but it looks like the addition was exposing the existing problem more
rather than causing a new one.

> > I wanted to suggest to use inode_wait_for_writeback() which will make sure
> > writeback is done with the inode. However we effectively already do this
> > by calling bdev_write_inode() which calls writeback_single_inode() in
> > WB_SYNC_ALL mode. So by the time that call completes we are sure writeback
> > code is not looking at the inode. *However* what I think is happening is
> > that sync_blockdev() writes all the dirty pages, bdev_write_inode() writes
> > the inode and clears all dirty bits, however the inode still stays in the
> > b_dirty / b_io list of the wb because it has I_DIRTY_TIME set. Subsequently
> > once flusher work runs, it finds the inode, looks at it and boom. And the
> > problem seems to be that write_inode_now(inode, true) does not result in
> > I_DIRTY_TIME being cleared.
> >
> > Attached patch should fix this issue - it is compile-tested only. Dmitry,
> > can you check whether this patch fixes the issue for you as well?
> 
> I can't directly test it because crash happened very infrequently.
> If Tejun/Ted agree that it is the right way to fix it, then I can
> patch it, restart the fuzzer and leave it running for a while.

Yes, please try out the patch.

Thanks a lot.
Dmitry Vyukov June 21, 2016, 12:58 p.m. UTC | #3
On Mon, Jun 20, 2016 at 7:40 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Mon, Jun 20, 2016 at 03:38:41PM +0200, Dmitry Vyukov wrote:
>> > Sorry for the late reply but now when thinking about the patch I don't
>> > think it is quite right. Writeback can happen from other contexts than just
>> > the worker one (e.g. kswapd can do writeback, or in some out-of-memory
>> > situations we may punt to doing writeback directly instead of calling the
>> > worker, or sync(2) calls fdatawrite() for block device inode directly when
>> > iterating through blockdev superblock). So flushing the workqueue IMHO is
>> > not covering 100% of cases.
>
> Hmmm, yeah, the patch undoes what the cgroup writeback changes added
> but it looks like the addition was exposing the existing problem more
> rather than causing a new one.
>
>> > I wanted to suggest to use inode_wait_for_writeback() which will make sure
>> > writeback is done with the inode. However we effectively already do this
>> > by calling bdev_write_inode() which calls writeback_single_inode() in
>> > WB_SYNC_ALL mode. So by the time that call completes we are sure writeback
>> > code is not looking at the inode. *However* what I think is happening is
>> > that sync_blockdev() writes all the dirty pages, bdev_write_inode() writes
>> > the inode and clears all dirty bits, however the inode still stays in the
>> > b_dirty / b_io list of the wb because it has I_DIRTY_TIME set. Subsequently
>> > once flusher work runs, it finds the inode, looks at it and boom. And the
>> > problem seems to be that write_inode_now(inode, true) does not result in
>> > I_DIRTY_TIME being cleared.
>> >
>> > Attached patch should fix this issue - it is compile-tested only. Dmitry,
>> > can you check whether this patch fixes the issue for you as well?
>>
>> I can't directly test it because crash happened very infrequently.
>> If Tejun/Ted agree that it is the right way to fix it, then I can
>> patch it, restart the fuzzer and leave it running for a while.
>
> Yes, please try out the patch.

Done
At least it should catch any new introduced bugs.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

From 907aceb33f6b598f00c9b1d10a110c74e4a0f954 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Mon, 20 Jun 2016 15:13:42 +0200
Subject: [PATCH] writeback: Write dirty times for WB_SYNC_ALL writeback

Currently we take care to handle I_DIRTY_TIME in vfs_fsync() and
queue_io() so that inodes which have only dirty timestamps are properly
written on fsync(2) and sync(2). However there are other call sites -
most notably going through write_inode_now() - which expect inode to be
clean after WB_SYNC_ALL writeback. This is not currently true as we do
not clear I_DIRTY_TIME in __writeback_single_inode() even for
WB_SYNC_ALL writeback in all the cases. This then resulted in the
following oops because bdev_write_inode() did not clean the inode and
writeback code later stumbled over a dirty inode with detached wb.

  general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
  Modules linked in:
  CPU: 3 PID: 32 Comm: kworker/u10:1 Not tainted 4.6.0-rc3+ #349
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  Workqueue: writeback wb_workfn (flush-11:0)
  task: ffff88006ccf1840 ti: ffff88006cda8000 task.ti: ffff88006cda8000
  RIP: 0010:[<ffffffff818884d2>]  [<ffffffff818884d2>]
  locked_inode_to_wb_and_lock_list+0xa2/0x750
  RSP: 0018:ffff88006cdaf7d0  EFLAGS: 00010246
  RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050
  RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286
  RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000
  R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0
  R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40
  FS:  0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0
  DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
  Stack:
   ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948
   ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948
   ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e
  Call Trace:
   [<     inline     >] inode_to_wb_and_lock_list fs/fs-writeback.c:309
   [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554
   [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600
   [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709
   [<     inline     >] wb_do_writeback fs/fs-writeback.c:1844
   [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884
   [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094
   [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228
   [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
   [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392
  Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e
  00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42>
  80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3
  RIP  [<     inline     >] wb_get include/linux/backing-dev-defs.h:212
  RIP  [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750
  fs/fs-writeback.c:281
   RSP <ffff88006cdaf7d0>
  ---[ end trace 986a4d314dcb2694 ]---

Fix the problem by making sure __writeback_single_inode() writes inode
only with dirty times in WB_SYNC_ALL mode.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 989a2cef6b76..51efb644410a 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1291,6 +1291,7 @@  __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	dirty = inode->i_state & I_DIRTY;
 	if (inode->i_state & I_DIRTY_TIME) {
 		if ((dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) ||
+		    wbc->sync_mode == WB_SYNC_ALL ||
 		    unlikely(inode->i_state & I_DIRTY_TIME_EXPIRED) ||
 		    unlikely(time_after(jiffies,
 					(inode->dirtied_time_when +
-- 
2.6.6