From patchwork Thu Jan 5 10:54:00 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jinpu Wang X-Patchwork-Id: 9498865 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 9C4FA60413 for ; Thu, 5 Jan 2017 10:56:19 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9FCDF28156 for ; Thu, 5 Jan 2017 10:56:19 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 93BA32818E; Thu, 5 Jan 2017 10:56:19 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.3 required=2.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, DKIM_SIGNED, FREEMAIL_FROM, RCVD_IN_DNSWL_HI, RCVD_IN_SORBS_SPAM, T_DKIM_INVALID, T_TVD_MIME_EPI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5C6E0282E8 for ; Thu, 5 Jan 2017 10:56:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S966350AbdAEKzy (ORCPT ); Thu, 5 Jan 2017 05:55:54 -0500 Received: from mail-yw0-f196.google.com ([209.85.161.196]:34595 "EHLO mail-yw0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1762553AbdAEKya (ORCPT ); Thu, 5 Jan 2017 05:54:30 -0500 Received: by mail-yw0-f196.google.com with SMTP id a10so44509740ywa.1; Thu, 05 Jan 2017 02:54:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=mLSl8OKEmU7jrQ48FrZsgZhHw6G9wFwWn+tNEbwnH9w=; b=oGZGoNRzjj1BI2QXJYnpmqy2NOd/Z2xafJHKY14wZfnuBcCJjniQfxmOwjX38YKGNv /aDg61SlUnxKhVUhmkOzMPeSg+jZkNWHRRKXP2XF7AadndiWqWyUD7XylFE39mzT5+As ammnqjOsc8D4qnKVNjL32eiIxok6lPYikE2IiGnDWJk3sy+GreP7IqnwfKM2iGlXT5sD 1Qf2oK4e+UgQYuRkWZgjZ02tjy31yiPSfMY7IHQNMKzjc1WCF+ehgdU8nrqMDN+u2TXh /HxNuK9Gkn+rcyrZ4Ci+x8iiC0VVeqU3ppQ2M72FIqevh2BF5rfHIyz6KM7Ml0mc799s vjXQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=mLSl8OKEmU7jrQ48FrZsgZhHw6G9wFwWn+tNEbwnH9w=; b=UPD5GbhtTuZJk+ATAjUShRdMJ+gOttkPnldy2MsJdQWfcJtxyK8jQMGSDEPhrP6diG T9Q2AdicsizABu/bzA7N+6BAKSxw5xRjuV0Z2XhOLcbIf+NmKgUowOh8F+8cUY/H6P8y pTw9Dgc/XPIiBVZEnRCw7jS0RDuxm6sz0j8aLhD/mm7DK/mAFjcn8dUm2MD2wvtRYnrm 6hDTIPuNhP1JQUTwlNYjBQJHnI//+9MzNL5cwhyBFrEihDaemNYdnu7t/SasKy1t0eRu ysl7QVGR/w3UmVpys3zMRLpaC4CHrypHYmuICJDg2BCbCS3gFHGWT+6KIMQDh1zlZdbv oL3g== X-Gm-Message-State: AIkVDXIdp1ReYYscXJDuVB0NSHAC3qWfxVLUlGPI+/Qf4pTYolgQlrdnR+xVbm4JuzWDUmdaxivIaE8i5kXXxw== X-Received: by 10.13.244.133 with SMTP id d127mr65225703ywf.3.1483613641421; Thu, 05 Jan 2017 02:54:01 -0800 (PST) MIME-Version: 1.0 Received: by 10.13.230.13 with HTTP; Thu, 5 Jan 2017 02:54:00 -0800 (PST) In-Reply-To: <20170104185046.GA982@redhat.com> References: <1467990243-3531-1-git-send-email-lars.ellenberg@linbit.com> <1467990243-3531-2-git-send-email-lars.ellenberg@linbit.com> <20160711141042.GY13335@soda.linbit> <76d9bf14-d848-4405-8358-3771c0a93d39@profitbricks.com> <20161223114553.GP4138@soda.linbit> <878tqrmmqx.fsf@notabene.neil.brown.name> <20170104185046.GA982@redhat.com> From: =?UTF-8?B?546L6YeR5rWm?= Date: Thu, 5 Jan 2017 11:54:00 +0100 Message-ID: Subject: Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion To: Mike Snitzer Cc: NeilBrown , Mikulas Patocka , Jack Wang , Lars Ellenberg , Jens Axboe , linux-raid , Michael Wang , Peter Zijlstra , Jiri Kosina , Ming Lei , LKML , Zheng Liu , linux-block@vger.kernel.org, Takashi Iwai , "linux-bcache@vger.kernel.org" , Ingo Molnar , Alasdair Kergon , "Martin K. Petersen" , Keith Busch , device-mapper development , Shaohua Li , Kent Overstreet , "Kirill A. Shutemov" , Roland Kammerer , Jinpu Wang Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP 2017-01-04 19:50 GMT+01:00 Mike Snitzer : > On Wed, Jan 04 2017 at 12:12am -0500, > NeilBrown wrote: > >> On Tue, Jan 03 2017, Jack Wang wrote: >> >> > 2016-12-23 12:45 GMT+01:00 Lars Ellenberg : >> >> On Fri, Dec 23, 2016 at 09:49:53AM +0100, Michael Wang wrote: >> >>> Dear Maintainers >> >>> >> >>> I'd like to ask for the status of this patch since we hit the >> >>> issue too during our testing on md raid1. >> >>> >> >>> Split remainder bio_A was queued ahead, following by bio_B for >> >>> lower device, at this moment raid start freezing, the loop take >> >>> out bio_A firstly and deliver it, which will hung since raid is >> >>> freezing, while the freezing never end since it waiting for >> >>> bio_B to finish, and bio_B is still on the queue, waiting for >> >>> bio_A to finish... >> >>> >> >>> We're looking for a good solution and we found this patch >> >>> already progressed a lot, but we can't find it on linux-next, >> >>> so we'd like to ask are we still planning to have this fix >> >>> in upstream? >> >> >> >> I don't see why not, I'd even like to have it in older kernels, >> >> but did not have the time and energy to push it. >> >> >> >> Thanks for the bump. >> >> >> >> Lars >> >> >> > Hi folks, >> > >> > As Michael mentioned, we hit a bug this patch is trying to fix. >> > Neil suggested another way to fix it. I attached below. >> > I personal prefer Neil's version as it's less code change, and straight forward. >> > >> > Could you share your comments, we can get one fix into mainline. >> > >> > Thanks, >> > Jinpu >> > From 69a4829a55503e496ce9c730d2c8e3dd8a08874a Mon Sep 17 00:00:00 2001 >> > From: NeilBrown >> > Date: Wed, 14 Dec 2016 16:55:52 +0100 >> > Subject: [PATCH] block: fix deadlock between freeze_array() and wait_barrier() >> > >> > When we call wait_barrier, we might have some bios waiting >> > in current->bio_list, which prevents the array_freeze call to >> > complete. Those can only be internal READs, which have already >> > passed the wait_barrier call (thus incrementing nr_pending), but >> > still were not submitted to the lower level, due to generic_make_request >> > logic to avoid recursive calls. In such case, we have a deadlock: >> > - array_frozen is already set to 1, so wait_barrier unconditionally waits, so >> > - internal READ bios will not be submitted, thus freeze_array will >> > never completes. >> > >> > To fix this, modify generic_make_request to always sort bio_list_on_stack >> > first with lowest level, then higher, until same level. >> > >> > Sent to linux-raid mail list: >> > https://marc.info/?l=linux-raid&m=148232453107685&w=2 >> > >> >> This should probably also have >> >> Inspired-by: Lars Ellenberg >> >> or something that, as I was building on Lars' ideas when I wrote this. >> >> It would also be worth noting in the description that this addresses >> issues with dm and drbd as well as md. > > I never saw this patch but certainly like the relative simplicity of the > solution when compared with other approaches taken, e.g. (5 topmost > commits on this branch): > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip > >> In fact, I think that with this patch in place, much of the need for the >> rescue_workqueue won't exist any more. I cannot promise it can be >> removed completely, but it should be to hard to make it optional and >> only enabled for those few block devices that will still need it. >> The rescuer should only be needed for a bioset which can be allocated >> From twice in the one call the ->make_request_fn. This would include >> raid0 for example, though raid0_make_reqest could be re-written to not >> use a loop and to just call generic_make_request(bio) if bio != split. > > Mikulas, would you be willing to try the below patch with the > dm-snapshot deadlock scenario and report back on whether it fixes that? > > Patch below looks to be the same as here: > https://marc.info/?l=linux-raid&m=148232453107685&q=p3 > > Neil and/or others if that isn't the patch that should be tested please > provide a pointer to the latest. > > Thanks, > Mike Thanks Mike, I've rebased the patch on to Linux-4.10-rc2, and updated the description as Neil suggested. If Mikulas get possitive feedback, then we can go with it. Cheers, Jinpu From 4ffaefb719c129ed51f9fcb235b945caf56de8d1 Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Wed, 14 Dec 2016 16:55:52 +0100 Subject: [PATCH] block: fix deadlock between freeze_array() and wait_barrier() When we call wait_barrier, we might have some bios waiting in current->bio_list, which prevents the array_freeze call to complete. Those can only be internal READs, which have already passed the wait_barrier call (thus incrementing nr_pending), but still were not submitted to the lower level, due to generic_make_request logic to avoid recursive calls. In such case, we have a deadlock: - array_frozen is already set to 1, so wait_barrier unconditionally waits, so - internal READ bios will not be submitted, thus freeze_array will never completes. To fix this, modify generic_make_request to always sort bio_list_on_stack first with lowest level, then higher, until same level. This would address issuses with dm and drbd as well as md. Sent to linux-raid mail list: https://marc.info/?l=linux-raid&m=148232453107685&w=2 Inspired-by: Lars Ellenberg Suggested-by: NeilBrown Signed-off-by: Jack Wang --- block/blk-core.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/block/blk-core.c b/block/blk-core.c index 61ba08c..2f74129 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2019,9 +2019,30 @@ blk_qc_t generic_make_request(struct bio *bio) struct request_queue *q = bdev_get_queue(bio->bi_bdev); if (likely(blk_queue_enter(q, false) == 0)) { + struct bio_list lower, same, hold; + + /* Create a fresh bio_list for all subordinate requests */ + bio_list_init(&hold); + bio_list_merge(&hold, &bio_list_on_stack); + bio_list_init(&bio_list_on_stack); + ret = q->make_request_fn(q, bio); blk_queue_exit(q); + /* sort new bios into those for a lower level + * and those for the same level + */ + bio_list_init(&lower); + bio_list_init(&same); + while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL) + if (q == bdev_get_queue(bio->bi_bdev)) + bio_list_add(&same, bio); + else + bio_list_add(&lower, bio); + /* now assemble so we handle the lowest level first */ + bio_list_merge(&bio_list_on_stack, &lower); + bio_list_merge(&bio_list_on_stack, &same); + bio_list_merge(&bio_list_on_stack, &hold); bio = bio_list_pop(current->bio_list); } else { -- 2.7.4