From patchwork Tue Aug 18 06:29:48 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: NeilBrown X-Patchwork-Id: 7028331 X-Patchwork-Delegate: snitzer@redhat.com Return-Path: X-Original-To: patchwork-dm-devel@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork1.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork1.web.kernel.org (Postfix) with ESMTP id 0DAEA9F344 for ; Tue, 18 Aug 2015 06:34:34 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 18A9A206E3 for ; Tue, 18 Aug 2015 06:34:33 +0000 (UTC) Received: from mx4-phx2.redhat.com (mx4-phx2.redhat.com [209.132.183.25]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id ECC6D206D6 for ; Tue, 18 Aug 2015 06:34:31 +0000 (UTC) Received: from lists01.pubmisc.prod.ext.phx2.redhat.com (lists01.pubmisc.prod.ext.phx2.redhat.com [10.5.19.33]) by mx4-phx2.redhat.com (8.13.8/8.13.8) with ESMTP id t7I6Tx0J011008; Tue, 18 Aug 2015 02:30:00 -0400 Received: from int-mx14.intmail.prod.int.phx2.redhat.com (int-mx14.intmail.prod.int.phx2.redhat.com [10.5.11.27]) by lists01.pubmisc.prod.ext.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id t7I6TwGb020320 for ; Tue, 18 Aug 2015 02:29:58 -0400 Received: from mx1.redhat.com (ext-mx05.extmail.prod.ext.phx2.redhat.com [10.5.110.29]) by int-mx14.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id t7I6Twu7030587 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Tue, 18 Aug 2015 02:29:58 -0400 Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by mx1.redhat.com (Postfix) with ESMTPS id 36FFC344F70; Tue, 18 Aug 2015 06:29:57 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (charybdis-ext.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id C3A2FAAC2; Tue, 18 Aug 2015 06:29:55 +0000 (UTC) Date: Tue, 18 Aug 2015 16:29:48 +1000 From: NeilBrown To: Mikulas Patocka Message-ID: <20150818162948.4e27a969@noble> In-Reply-To: References: <20150810135551.64d7dbac@noble> <20150812114631.15268065@noble> <20150813104355.485ed63d@noble> MIME-Version: 1.0 X-RedHat-Spam-Score: -4.601 (BAYES_50, DCC_REPUT_00_12, RCVD_IN_DNSWL_HI, SPF_PASS) 195.135.220.15 mx2.suse.de 195.135.220.15 mx2.suse.de X-Scanned-By: MIMEDefang 2.68 on 10.5.11.27 X-Scanned-By: MIMEDefang 2.78 on 10.5.110.29 X-loop: dm-devel@redhat.com Cc: dm-devel@redhat.com Subject: Re: [dm-devel] dm-snap deadlock in pending_complete() X-BeenThere: dm-devel@redhat.com X-Mailman-Version: 2.1.12 Precedence: junk Reply-To: device-mapper development List-Id: device-mapper development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com X-Spam-Status: No, score=-7.3 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On Mon, 17 Aug 2015 09:48:57 -0400 (EDT) Mikulas Patocka wrote: > > > On Thu, 13 Aug 2015, NeilBrown wrote: > > Or we could change __split_and_process_bio to use bio_split() to split > > the bio, then handle the first and call generic_make_request on the > > second. That would effectively put the second half on the end of > > bio_list so it wouldn't be tried until all requests derived from the > > first half have been handled. > > I don't think it will fix the bug - even if you put the second half of the > bio at the end of bio_list, it will still wait until other entries on the > bio list are processed. > > For example - device 1 gets a bio, splits it to bio1 and bio2, forwards > them to device 2 and put them on current->bio_list > > Device 2 receives bio1 (popped from curretn->bio_list), splits it to bio3 > and bio4, forwards them to device 3 and puts them at the end of > current->bio_list > > Device 2 receives bio2 (popped from current->bio_list), waits until bio1 > finishes, but bio1 won't ever finish because it depends on bio3 and bio4 > that are on current->bio_list. > Yes, you are right. I think I was thinking of a slightly different sort of problem. That is more insidious than I imagined, but maybe it leads to a fairly straight forward solution. Suppose we treated current->bio_list like a stack instead of a queue. In your example, bio would be split into bio1 and bio2 that are both pushed onto the stack - lets push them in reverse order to maintain a similar set of dependencies. So push bio2 then bio1. Then bio1 is popped off, split into bio3 and bio4, and pushed back. bio3 is popped and handled - nothing new pushed. bio4 is popped and handled - nothing new pushed. bio2 is popped, waits for bio1 - which will complete as nothing stops it - and then proceed (probably gets split into bio4 and bio5). The key idea here is that a bio for a lower-level device (lower in the stacking order, where filesystem is at the top and hardware at the bottom) is never placed after (beneath) a bio for an upper level device in the stack. So we always handle the lower-level bios first. This makes sense as upper level devices may want to wait for lower level devices, but never the reverse. We probably don't want a strict stack discipline as if one driver submits two bios, they should still be processed in that order. Rather: each call to a make_request_fn gets a new queue to attach bios to, and when it completes, this queue is pushed onto the stack of other pending bios. Something like this. Though that is wrong at least because it doesn't create a proper linkage between the old 'bio' and the clone created by clone_bio(). This would result in cloned part being fully processed before the remainder is looked at at all. NeilBrown --- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel diff --git a/block/blk-core.c b/block/blk-core.c index 627ed0c593fb..b4663eed7615 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1961,12 +1961,15 @@ void generic_make_request(struct bio *bio) * bio_list, and call into ->make_request() again. */ BUG_ON(bio->bi_next); - bio_list_init(&bio_list_on_stack); - current->bio_list = &bio_list_on_stack; do { struct request_queue *q = bdev_get_queue(bio->bi_bdev); + struct bio_list remainder; + remainder = bio_list_on_stack; + bio_list_init(&bio_list_on_stack); + current->bio_list = &bio_list_on_stack; q->make_request_fn(q, bio); + bio_list_merge(&bio_list_on_stack, &remainder); bio = bio_list_pop(current->bio_list); } while (bio); So before handling a bio we put all other delayed bios (which must be for devices on the same or a higher level) in 'remainder' and initialize bio_list_on_stack which becomes current->bio_list. bios submitted by that make_request_fn call (all for lower-level devices) are collected in bio_list_on_stack which is then pushed onto the remainder (actually remainder is spliced underneath bio_list_on_stack). Then the top bio is handled. If the most recent make_request_fn submitted any bios, this will be the first of those, otherwise it will whatever is next from some previous call. This isn't sufficient by itself. It still needs dm_make_request to split a bio of the front of the original bio, map that, then submit the mapped bio and the remains of the split bio. I imagine something vaguely like this: diff --git a/drivers/md/dm.c b/drivers/md/dm.c index ab37ae114e94..977678ff8d82 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1711,8 +1711,11 @@ static void __split_and_process_bio(struct mapped_device *md, } else { ci.bio = bio; ci.sector_count = bio_sectors(bio); - while (ci.sector_count && !error) - error = __split_and_process_non_flush(&ci); + error = __split_and_process_non_flush(&ci); + if (!error && ci.sector_count) { + bio->bi_iter.bi_size = to_bytes(ci.sector_count); + generic_make_request(bio); + } } /* drop the extra reference count */