From patchwork Mon Jun 20 13:31:46 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jan Kara X-Patchwork-Id: 9187411 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 196AF6075E for ; Mon, 20 Jun 2016 13:32:18 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 053D426992 for ; Mon, 20 Jun 2016 13:32:18 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id ED960276AE; Mon, 20 Jun 2016 13:32:17 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, T_TVD_MIME_EPI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id DBDAB26992 for ; Mon, 20 Jun 2016 13:32:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752962AbcFTNbw (ORCPT ); Mon, 20 Jun 2016 09:31:52 -0400 Received: from mx2.suse.de ([195.135.220.15]:42342 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751496AbcFTNbu (ORCPT ); Mon, 20 Jun 2016 09:31:50 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (charybdis-ext.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 5F490AD14; Mon, 20 Jun 2016 13:31:48 +0000 (UTC) Received: by quack2.suse.cz (Postfix, from userid 1000) id AE8391E0ECC; Mon, 20 Jun 2016 15:31:46 +0200 (CEST) Date: Mon, 20 Jun 2016 15:31:46 +0200 From: Jan Kara To: Tejun Heo Cc: Jens Axboe , Andrey Ryabinin , Alexander Viro , "linux-fsdevel@vger.kernel.org" , LKML , syzkaller , Kostya Serebryany , Alexander Potapenko , Ilya Dryomov , Jan Kara , Dmitry Vyukov , kernel-team@fb.com, Ted Tso Subject: Re: [PATCH] block: flush writeback dwork before detaching a bdev inode from it Message-ID: <20160620133146.GE6882@quack2.suse.cz> References: <20160420211456.GE4775@htj.duckdns.org> <20160421170635.GO7822@mtj.duckdns.org> <20160617160405.GJ3262@mtj.duckdns.org> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20160617160405.GJ3262@mtj.duckdns.org> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Hi, On Fri 17-06-16 12:04:05, Tejun Heo wrote: > 43d1c0eb7e11 ("block: detach bdev inode from its wb in > __blkdev_put()") detached bdev inode from its wb as the bdev inode may > outlive the underlying bdi and thus the wb. This is accomplished by > invoking inode_detach_wb() from __blkdev_put(); however, while the > inode can't be dirtied by the time control reaches there, that doesn't > guarantee that writeback isn't already in progress on the inode. This > can lead to the inode being disassociated from its wb while writeback > operation is in flight causing oopses like the following. > Fix it by flushing the wb dwork before detaching the inode. Combined > with the fact that the inode can no longer be dirtied, this guarantees > that no writeback operation can be in flight or initiated. Sorry for the late reply but now when thinking about the patch I don't think it is quite right. Writeback can happen from other contexts than just the worker one (e.g. kswapd can do writeback, or in some out-of-memory situations we may punt to doing writeback directly instead of calling the worker, or sync(2) calls fdatawrite() for block device inode directly when iterating through blockdev superblock). So flushing the workqueue IMHO is not covering 100% of cases. I wanted to suggest to use inode_wait_for_writeback() which will make sure writeback is done with the inode. However we effectively already do this by calling bdev_write_inode() which calls writeback_single_inode() in WB_SYNC_ALL mode. So by the time that call completes we are sure writeback code is not looking at the inode. *However* what I think is happening is that sync_blockdev() writes all the dirty pages, bdev_write_inode() writes the inode and clears all dirty bits, however the inode still stays in the b_dirty / b_io list of the wb because it has I_DIRTY_TIME set. Subsequently once flusher work runs, it finds the inode, looks at it and boom. And the problem seems to be that write_inode_now(inode, true) does not result in I_DIRTY_TIME being cleared. Attached patch should fix this issue - it is compile-tested only. Dmitry, can you check whether this patch fixes the issue for you as well? Honza From 907aceb33f6b598f00c9b1d10a110c74e4a0f954 Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Mon, 20 Jun 2016 15:13:42 +0200 Subject: [PATCH] writeback: Write dirty times for WB_SYNC_ALL writeback Currently we take care to handle I_DIRTY_TIME in vfs_fsync() and queue_io() so that inodes which have only dirty timestamps are properly written on fsync(2) and sync(2). However there are other call sites - most notably going through write_inode_now() - which expect inode to be clean after WB_SYNC_ALL writeback. This is not currently true as we do not clear I_DIRTY_TIME in __writeback_single_inode() even for WB_SYNC_ALL writeback in all the cases. This then resulted in the following oops because bdev_write_inode() did not clean the inode and writeback code later stumbled over a dirty inode with detached wb. general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN Modules linked in: CPU: 3 PID: 32 Comm: kworker/u10:1 Not tainted 4.6.0-rc3+ #349 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Workqueue: writeback wb_workfn (flush-11:0) task: ffff88006ccf1840 ti: ffff88006cda8000 task.ti: ffff88006cda8000 RIP: 0010:[] [] locked_inode_to_wb_and_lock_list+0xa2/0x750 RSP: 0018:ffff88006cdaf7d0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050 RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286 RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000 R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0 R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40 FS: 0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0 DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 Stack: ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948 ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948 ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e Call Trace: [< inline >] inode_to_wb_and_lock_list fs/fs-writeback.c:309 [] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554 [] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600 [] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709 [< inline >] wb_do_writeback fs/fs-writeback.c:1844 [] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884 [] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094 [] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228 [] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303 [] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392 Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e 00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42> 80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3 RIP [< inline >] wb_get include/linux/backing-dev-defs.h:212 RIP [] locked_inode_to_wb_and_lock_list+0xa2/0x750 fs/fs-writeback.c:281 RSP ---[ end trace 986a4d314dcb2694 ]--- Fix the problem by making sure __writeback_single_inode() writes inode only with dirty times in WB_SYNC_ALL mode. Reported-by: Dmitry Vyukov Signed-off-by: Jan Kara --- fs/fs-writeback.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 989a2cef6b76..51efb644410a 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1291,6 +1291,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc) dirty = inode->i_state & I_DIRTY; if (inode->i_state & I_DIRTY_TIME) { if ((dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) || + wbc->sync_mode == WB_SYNC_ALL || unlikely(inode->i_state & I_DIRTY_TIME_EXPIRED) || unlikely(time_after(jiffies, (inode->dirtied_time_when + -- 2.6.6