From patchwork Thu Jan 5 10:54:00 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jinpu Wang X-Patchwork-Id: 9498949 X-Patchwork-Delegate: snitzer@redhat.com Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 3FE77606B5 for ; Thu, 5 Jan 2017 12:01:43 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1A98C2003F for ; Thu, 5 Jan 2017 12:01:43 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 0F09128334; Thu, 5 Jan 2017 12:01:43 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.6 required=2.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, DKIM_SIGNED, FREEMAIL_FROM, RCVD_IN_DNSWL_MED, RCVD_IN_SORBS_SPAM, T_DKIM_INVALID autolearn=unavailable version=3.3.1 Received: from mx6-phx2.redhat.com (mx6-phx2.redhat.com [209.132.183.39]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 75A202003F for ; Thu, 5 Jan 2017 12:01:41 +0000 (UTC) Received: from lists01.pubmisc.prod.ext.phx2.redhat.com (lists01.pubmisc.prod.ext.phx2.redhat.com [10.5.19.33]) by mx6-phx2.redhat.com (8.14.4/8.14.4) with ESMTP id v05C0IX7065146; Thu, 5 Jan 2017 07:00:19 -0500 Received: from int-mx14.intmail.prod.int.phx2.redhat.com (int-mx14.intmail.prod.int.phx2.redhat.com [10.5.11.27]) by lists01.pubmisc.prod.ext.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id v05As4SG005494 for ; Thu, 5 Jan 2017 05:54:04 -0500 Received: from mx1.redhat.com (ext-mx02.extmail.prod.ext.phx2.redhat.com [10.5.110.26]) by int-mx14.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id v05As43k020502 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Thu, 5 Jan 2017 05:54:04 -0500 Received: from mail-yw0-f195.google.com (mail-yw0-f195.google.com [209.85.161.195]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 236028FCF0; Thu, 5 Jan 2017 10:54:03 +0000 (UTC) Received: by mail-yw0-f195.google.com with SMTP id 17so350641ywk.2; Thu, 05 Jan 2017 02:54:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=mLSl8OKEmU7jrQ48FrZsgZhHw6G9wFwWn+tNEbwnH9w=; b=oGZGoNRzjj1BI2QXJYnpmqy2NOd/Z2xafJHKY14wZfnuBcCJjniQfxmOwjX38YKGNv /aDg61SlUnxKhVUhmkOzMPeSg+jZkNWHRRKXP2XF7AadndiWqWyUD7XylFE39mzT5+As ammnqjOsc8D4qnKVNjL32eiIxok6lPYikE2IiGnDWJk3sy+GreP7IqnwfKM2iGlXT5sD 1Qf2oK4e+UgQYuRkWZgjZ02tjy31yiPSfMY7IHQNMKzjc1WCF+ehgdU8nrqMDN+u2TXh /HxNuK9Gkn+rcyrZ4Ci+x8iiC0VVeqU3ppQ2M72FIqevh2BF5rfHIyz6KM7Ml0mc799s vjXQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=mLSl8OKEmU7jrQ48FrZsgZhHw6G9wFwWn+tNEbwnH9w=; b=MLrodNgnL1644skf+FF/bUVbOHTrzRJY13SbbyK6pLpmKGMso/MWH83F5qirsGlbHk t8WRCz4fT00xmWGvDiUnHjlo8WGHlpvV3vXgaW/3ZexgWQ8yJ8sXuFW/zTslF2kQUnWu y7Wl6GQ1TekgOBKYWSyonp7Sa8z+8GtqvBjxt6C6z9e5VqgCEVTdctvYvlpfm410xW4p nm8rZOcjxnQLcqGPMV+ouwUpfAjbUw/bJx4NyfvtkFT5dUsmplL69rwnhdkQXdVyBuoq YMQZRX1RKmuxE7Z6RhzZThEu18VycaOAEH7LVpEtJudOAOCUCTojtdWdt2YbRm4XvpVi fDqQ== X-Gm-Message-State: AIkVDXLD6fhjePfJbBN3fkVDOejq0i+l3xi/8DTjHsglZduLgx5mTDzRVvecdOK28XdJsfKA9PpdKar32eZr5w== X-Received: by 10.13.244.133 with SMTP id d127mr65225703ywf.3.1483613641421; Thu, 05 Jan 2017 02:54:01 -0800 (PST) MIME-Version: 1.0 Received: by 10.13.230.13 with HTTP; Thu, 5 Jan 2017 02:54:00 -0800 (PST) In-Reply-To: <20170104185046.GA982@redhat.com> References: <1467990243-3531-1-git-send-email-lars.ellenberg@linbit.com> <1467990243-3531-2-git-send-email-lars.ellenberg@linbit.com> <20160711141042.GY13335@soda.linbit> <76d9bf14-d848-4405-8358-3771c0a93d39@profitbricks.com> <20161223114553.GP4138@soda.linbit> <878tqrmmqx.fsf@notabene.neil.brown.name> <20170104185046.GA982@redhat.com> From: =?UTF-8?B?546L6YeR5rWm?= Date: Thu, 5 Jan 2017 11:54:00 +0100 Message-ID: To: Mike Snitzer X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.26]); Thu, 05 Jan 2017 10:54:03 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.26]); Thu, 05 Jan 2017 10:54:03 +0000 (UTC) for IP:'209.85.161.195' DOMAIN:'mail-yw0-f195.google.com' HELO:'mail-yw0-f195.google.com' FROM:'jinpuwang@gmail.com' RCPT:'' X-RedHat-Spam-Score: 1.17 * (BAYES_50, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, RCVD_IN_SORBS_SPAM, SPF_PASS) 209.85.161.195 mail-yw0-f195.google.com 209.85.161.195 mail-yw0-f195.google.com X-Scanned-By: MIMEDefang 2.68 on 10.5.11.27 X-Scanned-By: MIMEDefang 2.78 on 10.5.110.26 X-loop: dm-devel@redhat.com X-Mailman-Approved-At: Thu, 05 Jan 2017 07:00:17 -0500 Cc: Peter Zijlstra , Ming Lei , LKML , Keith Busch , device-mapper development , Jack Wang , Jinpu Wang , Alasdair Kergon , Roland Kammerer , Michael Wang , Takashi Iwai , Zheng Liu , Shaohua Li , Kent Overstreet , linux-block@vger.kernel.org, "linux-bcache@vger.kernel.org" , Mikulas Patocka , Ingo Molnar , Jens Axboe , linux-raid , "Martin K. Petersen" , Jiri Kosina , NeilBrown , Lars Ellenberg , "Kirill A. Shutemov" Subject: Re: [dm-devel] [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion X-BeenThere: dm-devel@redhat.com X-Mailman-Version: 2.1.12 Precedence: junk List-Id: device-mapper development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com X-Virus-Scanned: ClamAV using ClamSMTP 2017-01-04 19:50 GMT+01:00 Mike Snitzer : > On Wed, Jan 04 2017 at 12:12am -0500, > NeilBrown wrote: > >> On Tue, Jan 03 2017, Jack Wang wrote: >> >> > 2016-12-23 12:45 GMT+01:00 Lars Ellenberg : >> >> On Fri, Dec 23, 2016 at 09:49:53AM +0100, Michael Wang wrote: >> >>> Dear Maintainers >> >>> >> >>> I'd like to ask for the status of this patch since we hit the >> >>> issue too during our testing on md raid1. >> >>> >> >>> Split remainder bio_A was queued ahead, following by bio_B for >> >>> lower device, at this moment raid start freezing, the loop take >> >>> out bio_A firstly and deliver it, which will hung since raid is >> >>> freezing, while the freezing never end since it waiting for >> >>> bio_B to finish, and bio_B is still on the queue, waiting for >> >>> bio_A to finish... >> >>> >> >>> We're looking for a good solution and we found this patch >> >>> already progressed a lot, but we can't find it on linux-next, >> >>> so we'd like to ask are we still planning to have this fix >> >>> in upstream? >> >> >> >> I don't see why not, I'd even like to have it in older kernels, >> >> but did not have the time and energy to push it. >> >> >> >> Thanks for the bump. >> >> >> >> Lars >> >> >> > Hi folks, >> > >> > As Michael mentioned, we hit a bug this patch is trying to fix. >> > Neil suggested another way to fix it. I attached below. >> > I personal prefer Neil's version as it's less code change, and straight forward. >> > >> > Could you share your comments, we can get one fix into mainline. >> > >> > Thanks, >> > Jinpu >> > From 69a4829a55503e496ce9c730d2c8e3dd8a08874a Mon Sep 17 00:00:00 2001 >> > From: NeilBrown >> > Date: Wed, 14 Dec 2016 16:55:52 +0100 >> > Subject: [PATCH] block: fix deadlock between freeze_array() and wait_barrier() >> > >> > When we call wait_barrier, we might have some bios waiting >> > in current->bio_list, which prevents the array_freeze call to >> > complete. Those can only be internal READs, which have already >> > passed the wait_barrier call (thus incrementing nr_pending), but >> > still were not submitted to the lower level, due to generic_make_request >> > logic to avoid recursive calls. In such case, we have a deadlock: >> > - array_frozen is already set to 1, so wait_barrier unconditionally waits, so >> > - internal READ bios will not be submitted, thus freeze_array will >> > never completes. >> > >> > To fix this, modify generic_make_request to always sort bio_list_on_stack >> > first with lowest level, then higher, until same level. >> > >> > Sent to linux-raid mail list: >> > https://marc.info/?l=linux-raid&m=148232453107685&w=2 >> > >> >> This should probably also have >> >> Inspired-by: Lars Ellenberg >> >> or something that, as I was building on Lars' ideas when I wrote this. >> >> It would also be worth noting in the description that this addresses >> issues with dm and drbd as well as md. > > I never saw this patch but certainly like the relative simplicity of the > solution when compared with other approaches taken, e.g. (5 topmost > commits on this branch): > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip > >> In fact, I think that with this patch in place, much of the need for the >> rescue_workqueue won't exist any more. I cannot promise it can be >> removed completely, but it should be to hard to make it optional and >> only enabled for those few block devices that will still need it. >> The rescuer should only be needed for a bioset which can be allocated >> From twice in the one call the ->make_request_fn. This would include >> raid0 for example, though raid0_make_reqest could be re-written to not >> use a loop and to just call generic_make_request(bio) if bio != split. > > Mikulas, would you be willing to try the below patch with the > dm-snapshot deadlock scenario and report back on whether it fixes that? > > Patch below looks to be the same as here: > https://marc.info/?l=linux-raid&m=148232453107685&q=p3 > > Neil and/or others if that isn't the patch that should be tested please > provide a pointer to the latest. > > Thanks, > Mike Thanks Mike, I've rebased the patch on to Linux-4.10-rc2, and updated the description as Neil suggested. If Mikulas get possitive feedback, then we can go with it. Cheers, Jinpu --- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel From 4ffaefb719c129ed51f9fcb235b945caf56de8d1 Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Wed, 14 Dec 2016 16:55:52 +0100 Subject: [PATCH] block: fix deadlock between freeze_array() and wait_barrier() When we call wait_barrier, we might have some bios waiting in current->bio_list, which prevents the array_freeze call to complete. Those can only be internal READs, which have already passed the wait_barrier call (thus incrementing nr_pending), but still were not submitted to the lower level, due to generic_make_request logic to avoid recursive calls. In such case, we have a deadlock: - array_frozen is already set to 1, so wait_barrier unconditionally waits, so - internal READ bios will not be submitted, thus freeze_array will never completes. To fix this, modify generic_make_request to always sort bio_list_on_stack first with lowest level, then higher, until same level. This would address issuses with dm and drbd as well as md. Sent to linux-raid mail list: https://marc.info/?l=linux-raid&m=148232453107685&w=2 Inspired-by: Lars Ellenberg Suggested-by: NeilBrown Signed-off-by: Jack Wang --- block/blk-core.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/block/blk-core.c b/block/blk-core.c index 61ba08c..2f74129 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2019,9 +2019,30 @@ blk_qc_t generic_make_request(struct bio *bio) struct request_queue *q = bdev_get_queue(bio->bi_bdev); if (likely(blk_queue_enter(q, false) == 0)) { + struct bio_list lower, same, hold; + + /* Create a fresh bio_list for all subordinate requests */ + bio_list_init(&hold); + bio_list_merge(&hold, &bio_list_on_stack); + bio_list_init(&bio_list_on_stack); + ret = q->make_request_fn(q, bio); blk_queue_exit(q); + /* sort new bios into those for a lower level + * and those for the same level + */ + bio_list_init(&lower); + bio_list_init(&same); + while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL) + if (q == bdev_get_queue(bio->bi_bdev)) + bio_list_add(&same, bio); + else + bio_list_add(&lower, bio); + /* now assemble so we handle the lowest level first */ + bio_list_merge(&bio_list_on_stack, &lower); + bio_list_merge(&bio_list_on_stack, &same); + bio_list_merge(&bio_list_on_stack, &hold); bio = bio_list_pop(current->bio_list); } else { -- 2.7.4