From patchwork Wed Apr 12 17:43:49 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sumit Semwal X-Patchwork-Id: 9678033 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 68CBF60382 for ; Wed, 12 Apr 2017 17:44:55 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5E57228589 for ; Wed, 12 Apr 2017 17:44:55 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 52DC82868D; Wed, 12 Apr 2017 17:44:55 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id CADF128589 for ; Wed, 12 Apr 2017 17:44:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752220AbdDLRow (ORCPT ); Wed, 12 Apr 2017 13:44:52 -0400 Received: from mail-pg0-f51.google.com ([74.125.83.51]:35967 "EHLO mail-pg0-f51.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755093AbdDLRor (ORCPT ); Wed, 12 Apr 2017 13:44:47 -0400 Received: by mail-pg0-f51.google.com with SMTP id g2so17975847pge.3 for ; Wed, 12 Apr 2017 10:44:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=18SwAii/IQRWqTWRoBqgj1Esna8k0ayO/Q1NCJQo/6w=; b=XVbZJLhczyVYpKY+Ek9JjI34RNlq5nEBMBt9K+T4fMOqPNCdQgwi3lYQwRqyq6WXvj JbiSt1wtq6Ge85zOOUzOLPJY8Tzf4ii2i/2BzW6k07eX8rOewnXQ3p5gW+TNQHrqGmzM AQC9ZxCmBd37epuTBRh/8ap+ogpyyV2pavsLY= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=18SwAii/IQRWqTWRoBqgj1Esna8k0ayO/Q1NCJQo/6w=; b=Hh4L1GxfmLjd3K/JJF0JEt6IybVv8z+Af6VQX08lZq1+Pfig3D1W2xa138NYyIhnuk zUfBtfkAkNMY94cuzYoZ8jEnBlU97KC9JPe2Wfy+TwECapEGw9T1He9t5rbX6Hf6pM26 VsDm3Ognjfq6P6LXJNm7OwrKJ58Wgmi1sQnkRgo5/LqCvgh/vc+0WS5+nF4T4/0OaHKS ICsxD4VAdIPow3rWCoCasyBMwEC66wNyc+Zfrai5zmYsq6G/9SoBbVcXY9dGN3Q7Hs5A SAAoHudEqAZhT1BZeZRX7g53xwuGI3g+5XvlJAW2O4u5JjpVuzfLqekPL8C7ez8FchYc 5LFg== X-Gm-Message-State: AN3rC/78tK6FQks3t+fnjbf+LhY5kvXhLHVlWRMhhFPL7znl7PyV5f5Vrp12Zu3Z6kwxR4SF X-Received: by 10.99.149.86 with SMTP id t22mr29732246pgn.154.1492019086292; Wed, 12 Apr 2017 10:44:46 -0700 (PDT) Received: from phantom.lan ([106.51.225.38]) by smtp.gmail.com with ESMTPSA id r17sm37750019pfa.13.2017.04.12.10.44.42 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Wed, 12 Apr 2017 10:44:45 -0700 (PDT) From: Sumit Semwal To: stable@vger.kernel.org Cc: Gabriel Krisman Bertazi , Brian King , Douglas Miller , linux-block@vger.kernel.org, linux-scsi@vger.kernel.org, Jens Axboe , Sumit Semwal Subject: [v2 PATCH for-4.4 15/16] blk-mq: Avoid memory reclaim when remapping queues Date: Wed, 12 Apr 2017 23:13:49 +0530 Message-Id: <1492019030-13567-16-git-send-email-sumit.semwal@linaro.org> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1492019030-13567-1-git-send-email-sumit.semwal@linaro.org> References: <1492019030-13567-1-git-send-email-sumit.semwal@linaro.org> Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Gabriel Krisman Bertazi [ Upstream commit 36e1f3d107867b25c616c2fd294f5a1c9d4e5d09 ] While stressing memory and IO at the same time we changed SMT settings, we were able to consistently trigger deadlocks in the mm system, which froze the entire machine. I think that under memory stress conditions, the large allocations performed by blk_mq_init_rq_map may trigger a reclaim, which stalls waiting on the block layer remmaping completion, thus deadlocking the system. The trace below was collected after the machine stalled, waiting for the hotplug event completion. The simplest fix for this is to make allocations in this path non-reclaimable, with GFP_NOIO. With this patch, We couldn't hit the issue anymore. This should apply on top of Jens's for-next branch cleanly. Changes since v1: - Use GFP_NOIO instead of GFP_NOWAIT. Call Trace: [c000000f0160aaf0] [c000000f0160ab50] 0xc000000f0160ab50 (unreliable) [c000000f0160acc0] [c000000000016624] __switch_to+0x2e4/0x430 [c000000f0160ad20] [c000000000b1a880] __schedule+0x310/0x9b0 [c000000f0160ae00] [c000000000b1af68] schedule+0x48/0xc0 [c000000f0160ae30] [c000000000b1b4b0] schedule_preempt_disabled+0x20/0x30 [c000000f0160ae50] [c000000000b1d4fc] __mutex_lock_slowpath+0xec/0x1f0 [c000000f0160aed0] [c000000000b1d678] mutex_lock+0x78/0xa0 [c000000f0160af00] [d000000019413cac] xfs_reclaim_inodes_ag+0x33c/0x380 [xfs] [c000000f0160b0b0] [d000000019415164] xfs_reclaim_inodes_nr+0x54/0x70 [xfs] [c000000f0160b0f0] [d0000000194297f8] xfs_fs_free_cached_objects+0x38/0x60 [xfs] [c000000f0160b120] [c0000000003172c8] super_cache_scan+0x1f8/0x210 [c000000f0160b190] [c00000000026301c] shrink_slab.part.13+0x21c/0x4c0 [c000000f0160b2d0] [c000000000268088] shrink_zone+0x2d8/0x3c0 [c000000f0160b380] [c00000000026834c] do_try_to_free_pages+0x1dc/0x520 [c000000f0160b450] [c00000000026876c] try_to_free_pages+0xdc/0x250 [c000000f0160b4e0] [c000000000251978] __alloc_pages_nodemask+0x868/0x10d0 [c000000f0160b6f0] [c000000000567030] blk_mq_init_rq_map+0x160/0x380 [c000000f0160b7a0] [c00000000056758c] blk_mq_map_swqueue+0x33c/0x360 [c000000f0160b820] [c000000000567904] blk_mq_queue_reinit+0x64/0xb0 [c000000f0160b850] [c00000000056a16c] blk_mq_queue_reinit_notify+0x19c/0x250 [c000000f0160b8a0] [c0000000000f5d38] notifier_call_chain+0x98/0x100 [c000000f0160b8f0] [c0000000000c5fb0] __cpu_notify+0x70/0xe0 [c000000f0160b930] [c0000000000c63c4] notify_prepare+0x44/0xb0 [c000000f0160b9b0] [c0000000000c52f4] cpuhp_invoke_callback+0x84/0x250 [c000000f0160ba10] [c0000000000c570c] cpuhp_up_callbacks+0x5c/0x120 [c000000f0160ba60] [c0000000000c7cb8] _cpu_up+0xf8/0x1d0 [c000000f0160bac0] [c0000000000c7eb0] do_cpu_up+0x120/0x150 [c000000f0160bb40] [c0000000006fe024] cpu_subsys_online+0x64/0xe0 [c000000f0160bb90] [c0000000006f5124] device_online+0xb4/0x120 [c000000f0160bbd0] [c0000000006f5244] online_store+0xb4/0xc0 [c000000f0160bc20] [c0000000006f0a68] dev_attr_store+0x68/0xa0 [c000000f0160bc60] [c0000000003ccc30] sysfs_kf_write+0x80/0xb0 [c000000f0160bca0] [c0000000003cbabc] kernfs_fop_write+0x17c/0x250 [c000000f0160bcf0] [c00000000030fe6c] __vfs_write+0x6c/0x1e0 [c000000f0160bd90] [c000000000311490] vfs_write+0xd0/0x270 [c000000f0160bde0] [c0000000003131fc] SyS_write+0x6c/0x110 [c000000f0160be30] [c000000000009204] system_call+0x38/0xec Signed-off-by: Gabriel Krisman Bertazi Cc: Brian King Cc: Douglas Miller Cc: linux-block@vger.kernel.org Cc: linux-scsi@vger.kernel.org Signed-off-by: Jens Axboe Signed-off-by: Sumit Semwal --- block/blk-mq.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index d8d63c3..0d1af3e 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1470,7 +1470,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set, INIT_LIST_HEAD(&tags->page_list); tags->rqs = kzalloc_node(set->queue_depth * sizeof(struct request *), - GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY, + GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, set->numa_node); if (!tags->rqs) { blk_mq_free_tags(tags); @@ -1496,7 +1496,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set, do { page = alloc_pages_node(set->numa_node, - GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO, + GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO, this_order); if (page) break; @@ -1517,7 +1517,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set, * Allow kmemleak to scan these pages as they contain pointers * to additional allocations like via ops->init_request(). */ - kmemleak_alloc(p, order_to_size(this_order), 1, GFP_KERNEL); + kmemleak_alloc(p, order_to_size(this_order), 1, GFP_NOIO); entries_per_page = order_to_size(this_order) / rq_size; to_do = min(entries_per_page, set->queue_depth - i); left -= to_do * rq_size;