From patchwork Thu Oct  8 15:31:47 2015
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: kernel@kyup.com
X-Patchwork-Id: 7353811
Return-Path: <linux-fsdevel-owner@kernel.org>
X-Original-To: patchwork-linux-fsdevel@patchwork.kernel.org
Delivered-To: patchwork-parsemail@patchwork1.web.kernel.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.136])
	by patchwork1.web.kernel.org (Postfix) with ESMTP id 2A40D9F4DC
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
	Thu,  8 Oct 2015 15:32:27 +0000 (UTC)
Received: from mail.kernel.org (localhost [127.0.0.1])
	by mail.kernel.org (Postfix) with ESMTP id CBADC20808
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
	Thu,  8 Oct 2015 15:32:25 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 6B49120803
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
	Thu,  8 Oct 2015 15:32:24 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S934164AbbJHPcU (ORCPT
	<rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
	Thu, 8 Oct 2015 11:32:20 -0400
Received: from mail-wi0-f177.google.com ([209.85.212.177]:33837 "EHLO
	mail-wi0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S934204AbbJHPb4 (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Thu, 8 Oct 2015 11:31:56 -0400
Received: by wicfx3 with SMTP id fx3so33952369wic.1
	for <linux-fsdevel@vger.kernel.org>;
	Thu, 08 Oct 2015 08:31:51 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20130820;
	h=x-gm-message-state:from:to:cc:subject:date:message-id;
	bh=7nUsXMamQ1eCJFr7F6a3D71qcUilBgZCwuYfaE6LgUk=;
	b=J0mcuDCFogKV1+s7ZoRmcLvHg5obStEUV+8/p4Q1mIEKY4f3bnWB2jxR/GE+8Zzz6y
	Lkf+sR3Fto8fV1QclNjL6GJE/wNGriyXAiTQjd0bCfSfinyWGxKxXHzGJ1/B/4YITuSx
	mPPozwykTTjsIGx0JUjK+w4QGhxykJjB5k2VL2kUQokOi0VJXhmv+wjt0hMzdUK/nfYe
	gscPkKB06f8XfkaYX09roiHwF5iW77keH9UkmPu3rV7cfpGiz7Vp2NsfXhOJmt73ynBl
	IusTc4+x7TgUll5zxQlOE4y7DqXIxIeBVY80RpBeFUN2F9z9zAqRhX2jbWzWYaX/w4W8
	OHBg==
X-Gm-Message-State: 
 ALoCoQlHyYYB46Q2Jayx+TKdP1G7OSD4q+9qgO35x9lheSMOryJ3ouWUd7+RMt/pGqVSeIbPEkhg
X-Received: by 10.195.11.72 with SMTP id eg8mr9824560wjd.14.1444318311371;
	Thu, 08 Oct 2015 08:31:51 -0700 (PDT)
Received: from localhost.localdomain (admins.1h.com. [82.118.240.130])
	by smtp.gmail.com with ESMTPSA id
	he3sm46894540wjc.48.2015.10.08.08.31.50
	(version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
	Thu, 08 Oct 2015 08:31:50 -0700 (PDT)
From: Nikolay Borisov <kernel@kyup.com>
To: tytso@mit.edu, adilger.kernel@dilger.ca, viro@zeniv.linux.org.uk,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
Cc: operations@siteground.com, mm@1h.com
Subject: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts
	disabled and page-draining IPI
Date: Thu,  8 Oct 2015 18:31:47 +0300
Message-Id: <1444318308-27560-1-git-send-email-kernel@kyup.com>
X-Mailer: git-send-email 1.7.1
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org
X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI,
	T_RP_MATCHES_RCVD,
	UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Currently when bios are being finished in ext4_finish_bio this is done by
first disabling interrupts and then acquiring a bit_spin_lock.
However, those buffer heads might be under async write and as such the
wait on bit_spin_lock might cause the CPU to be spinning with interrupts
disabled for arbitrary period of time. If in the mean time there is
demand for memory and such cannot be freed the allocator's code might
have to resort to dumping the per-cpu lists, like so:

PID: 31111  TASK: ffff881cbb2fb870  CPU: 2   COMMAND: "kworker/u96:0"
 #0 [ffff881fffa46dc0] crash_nmi_callback at ffffffff8106f24e
 #1 [ffff881fffa46de0] nmi_handle at ffffffff8104c152
 #2 [ffff881fffa46e70] do_nmi at ffffffff8104c3b4
 #3 [ffff881fffa46ef0] end_repeat_nmi at ffffffff81656e2e
    [exception RIP: smp_call_function_many+577]
    RIP: ffffffff810e7f81  RSP: ffff880d35b815c8  RFLAGS: 00000202
    RAX: 0000000000000017  RBX: ffffffff81142690  RCX: 0000000000000017
    RDX: ffff883fff375478  RSI: 0000000000000040  RDI: 0000000000000040
    RBP: ffff880d35b81628   R8: ffff881fffa51ec8   R9: 0000000000000000
    R10: 0000000000000000  R11: ffffffff812943f3  R12: 0000000000000000
    R13: ffff881fffa51ec0  R14: ffff881fffa51ec8  R15: 0000000000011f00
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #4 [ffff880d35b815c8] smp_call_function_many at ffffffff810e7f81
 #5 [ffff880d35b81630] on_each_cpu_mask at ffffffff810e801c
 #6 [ffff880d35b81660] drain_all_pages at ffffffff81140178
 #7 [ffff880d35b81690] __alloc_pages_nodemask at ffffffff8114310b
 #8 [ffff880d35b81810] alloc_pages_current at ffffffff81181c5e
 #9 [ffff880d35b81860] new_slab at ffffffff81188305

However, this will never return, since on_each_cpu_mask is being called
with last argument 1 i.e. wait until the IPI handler is invoked on every
cpu. Additionally, if there is another thread on which ext4_finish_bio
depends to complete e.g:

PID: 34220  TASK: ffff883937660810  CPU: 44  COMMAND: "kworker/u98:39"
 #0 [ffff88209d5b10b8] __schedule at ffffffff81653d5a
 #1 [ffff88209d5b1150] schedule at ffffffff816542f9
 #2 [ffff88209d5b1160] schedule_preempt_disabled at ffffffff81654686
 #3 [ffff88209d5b1180] __mutex_lock_slowpath at ffffffff816521eb
 #4 [ffff88209d5b1200] mutex_lock at ffffffff816522d1
 #5 [ffff88209d5b1220] new_read at ffffffffa0152a7e [dm_bufio]
 #6 [ffff88209d5b1280] dm_bufio_get at ffffffffa0152ba6 [dm_bufio]
 #7 [ffff88209d5b1290] dm_bm_read_try_lock at ffffffffa015c878 [dm_persistent_data]
 #8 [ffff88209d5b12e0] dm_tm_read_lock at ffffffffa015f7ad [dm_persistent_data]
 #9 [ffff88209d5b12f0] bn_read_lock at ffffffffa016281b [dm_persistent_data]

And in turn this second thread is dependent on the original, allocation
to succeed a hard lockup occurs, since ext4_finish_bio would be waitin for
block_write_full_page to complete, which in turn is dependent on the original
memory allocation to succeeds, which in turn is dependent on the IPI executing
on each core.  For completeness here is how the call stack for hung ext4_bio_finish
would look like:

[427160.405277] NMI backtrace for cpu 23
[427160.405279] CPU: 23 PID: 4611 Comm: kworker/u98:7 Tainted: G        W    3.12.47-clouder1 #1
[427160.405281] Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015
[427160.405285] Workqueue: writeback bdi_writeback_workfn (flush-252:148)
[427160.405286] task: ffff8825aa819830 ti: ffff882b19180000 task.ti: ffff882b19180000
[427160.405290] RIP: 0010:[<ffffffff8125be13>]  [<ffffffff8125be13>] ext4_finish_bio+0x273/0x2a0
[427160.405291] RSP: 0000:ffff883fff3639b0  EFLAGS: 00000002
[427160.405292] RAX: ffff882b19180000 RBX: ffff883f67480a80 RCX: 0000000000000110
[427160.405292] RDX: ffff882b19180000 RSI: 0000000000000000 RDI: ffff883f67480a80
[427160.405293] RBP: ffff883fff363a70 R08: 0000000000014b80 R09: ffff881fff454f00
[427160.405294] R10: ffffea00473214c0 R11: ffffffff8113bfd7 R12: ffff880826272138
[427160.405295] R13: 0000000000000000 R14: 0000000000000000 R15: ffffea00aeaea400
[427160.405296] FS:  0000000000000000(0000) GS:ffff883fff360000(0000) knlGS:0000000000000000
[427160.405296] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[427160.405297] CR2: 0000003c5b009c24 CR3: 0000000001c0b000 CR4: 00000000001407e0
[427160.405297] Stack:
[427160.405305]  0000000000000000 ffffffff8203f230 ffff883fff363a00 ffff882b19180000
[427160.405312]  ffff882b19180000 ffff882b19180000 00000400018e0af8 ffff882b19180000
[427160.405319]  ffff883f67480a80 0000000000000000 0000000000000202 00000000d219e720
[427160.405320] Call Trace:
[427160.405324]  <IRQ>
[427160.405327]  [<ffffffff8125c2c8>] ext4_end_bio+0xc8/0x120
[427160.405335]  [<ffffffff811dbf1d>] bio_endio+0x1d/0x40
[427160.405341]  [<ffffffff81546781>] dec_pending+0x1c1/0x360
[427160.405345]  [<ffffffff81546996>] clone_endio+0x76/0xa0
[427160.405350]  [<ffffffff811dbf1d>] bio_endio+0x1d/0x40
[427160.405353]  [<ffffffff81546781>] dec_pending+0x1c1/0x360
[427160.405358]  [<ffffffff81546996>] clone_endio+0x76/0xa0
[427160.405362]  [<ffffffff811dbf1d>] bio_endio+0x1d/0x40
[427160.405365]  [<ffffffff81546781>] dec_pending+0x1c1/0x360
[427160.405369]  [<ffffffff81546996>] clone_endio+0x76/0xa0
[427160.405373]  [<ffffffff811dbf1d>] bio_endio+0x1d/0x40
[427160.405380]  [<ffffffff812fad2b>] blk_update_request+0x21b/0x450
[427160.405385]  [<ffffffff812faf87>] blk_update_bidi_request+0x27/0xb0
[427160.405389]  [<ffffffff812fcc7f>] blk_end_bidi_request+0x2f/0x80
[427160.405392]  [<ffffffff812fcd20>] blk_end_request+0x10/0x20
[427160.405400]  [<ffffffff813fdc1c>] scsi_io_completion+0xbc/0x620
[427160.405404]  [<ffffffff813f57f9>] scsi_finish_command+0xc9/0x130
[427160.405408]  [<ffffffff813fe2e7>] scsi_softirq_done+0x147/0x170
[427160.405413]  [<ffffffff813035ad>] blk_done_softirq+0x7d/0x90
[427160.405418]  [<ffffffff8108ed87>] __do_softirq+0x137/0x2e0
[427160.405422]  [<ffffffff81658a0c>] call_softirq+0x1c/0x30
[427160.405427]  [<ffffffff8104a35d>] do_softirq+0x8d/0xc0
[427160.405428]  [<ffffffff8108e925>] irq_exit+0x95/0xa0
[427160.405431]  [<ffffffff8106f755>] smp_call_function_single_interrupt+0x35/0x40
[427160.405434]  [<ffffffff8165826f>] call_function_single_interrupt+0x6f/0x80
[427160.405436]  <EOI>
[427160.405438]  [<ffffffff813276e6>] ? memcpy+0x6/0x110
[427160.405440]  [<ffffffff811dc6d6>] ? __bio_clone+0x26/0x70
[427160.405442]  [<ffffffff81546db9>] __clone_and_map_data_bio+0x139/0x160
[427160.405445]  [<ffffffff815471cd>] __split_and_process_bio+0x3ed/0x490
[427160.405447]  [<ffffffff815473a6>] dm_request+0x136/0x1e0
[427160.405449]  [<ffffffff812fbe0a>] generic_make_request+0xca/0x100
[427160.405451]  [<ffffffff812fbeb9>] submit_bio+0x79/0x160
[427160.405453]  [<ffffffff81144c3d>] ? account_page_writeback+0x2d/0x40
[427160.405455]  [<ffffffff81144dbd>] ? __test_set_page_writeback+0x16d/0x1f0
[427160.405457]  [<ffffffff8125b7a9>] ext4_io_submit+0x29/0x50
[427160.405459]  [<ffffffff8125b8fb>] ext4_bio_write_page+0x12b/0x2f0
[427160.405461]  [<ffffffff81252fe8>] mpage_submit_page+0x68/0x90
[427160.405463]  [<ffffffff81253100>] mpage_process_page_bufs+0xf0/0x110
[427160.405465]  [<ffffffff81254a80>] mpage_prepare_extent_to_map+0x210/0x310
[427160.405468]  [<ffffffff8125a911>] ? ext4_writepages+0x361/0xc60
[427160.405472]  [<ffffffff81283c09>] ? __ext4_journal_start_sb+0x79/0x110
[427160.405474]  [<ffffffff8125a948>] ext4_writepages+0x398/0xc60
[427160.405477]  [<ffffffff812fd358>] ? blk_finish_plug+0x18/0x50
[427160.405479]  [<ffffffff81146b40>] do_writepages+0x20/0x40
[427160.405483]  [<ffffffff811cec79>] __writeback_single_inode+0x49/0x2b0
[427160.405487]  [<ffffffff810aeeef>] ? wake_up_bit+0x2f/0x40
[427160.405488]  [<ffffffff811cfdee>] writeback_sb_inodes+0x2de/0x540
[427160.405492]  [<ffffffff811a6e65>] ? put_super+0x25/0x50
[427160.405494]  [<ffffffff811d00ee>] __writeback_inodes_wb+0x9e/0xd0
[427160.405495]  [<ffffffff811d035b>] wb_writeback+0x23b/0x340
[427160.405497]  [<ffffffff811d04f9>] wb_do_writeback+0x99/0x230
[427160.405500]  [<ffffffff810a40f1>] ? set_worker_desc+0x81/0x90
[427160.405503]  [<ffffffff810c7a6a>] ? dequeue_task_fair+0x36a/0x4c0
[427160.405505]  [<ffffffff811d0bf8>] bdi_writeback_workfn+0x88/0x260
[427160.405509]  [<ffffffff810bbb3e>] ? finish_task_switch+0x4e/0xe0
[427160.405511]  [<ffffffff81653dac>] ? __schedule+0x2dc/0x760
[427160.405514]  [<ffffffff810a61e5>] process_one_work+0x195/0x550
[427160.405517]  [<ffffffff810a848a>] worker_thread+0x13a/0x430
[427160.405519]  [<ffffffff810a8350>] ? manage_workers+0x2c0/0x2c0
[427160.405521]  [<ffffffff810ae48e>] kthread+0xce/0xe0
[427160.405523]  [<ffffffff810ae3c0>] ? kthread_freezable_should_stop+0x80/0x80
[427160.405525]  [<ffffffff816571c8>] ret_from_fork+0x58/0x90
[427160.405527]  [<ffffffff810ae3c0>] ? kthread_freezable_should_stop+0x80/0x80

To fix the situation this patch changes the order in which the
bit_spin_lock and interrupts disabling occcurs. The exepected
effect is that even if a core is spinning on the bitlock it will
have its interrupts enabled, thus being able to respond to IPIs.
This eventually would allow memory allocation requiring draining of
the per cpu pages to succeed.

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
---
 fs/ext4/page-io.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 84ba4d2..095331b 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -96,8 +96,8 @@ static void ext4_finish_bio(struct bio *bio)
 		 * We check all buffers in the page under BH_Uptodate_Lock
 		 * to avoid races with other end io clearing async_write flags
 		 */
-		local_irq_save(flags);
 		bit_spin_lock(BH_Uptodate_Lock, &head->b_state);
+		local_irq_save(flags);
 		do {
 			if (bh_offset(bh) < bio_start ||
 			    bh_offset(bh) + bh->b_size > bio_end) {
@@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio)
 			if (bio->bi_error)
 				buffer_io_error(bh);
 		} while ((bh = bh->b_this_page) != head);
-		bit_spin_unlock(BH_Uptodate_Lock, &head->b_state);
 		local_irq_restore(flags);
+		bit_spin_unlock(BH_Uptodate_Lock, &head->b_state);
 		if (!under_io) {
 #ifdef CONFIG_EXT4_FS_ENCRYPTION
 			if (ctx)