From patchwork Wed Oct 14 20:47:39 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mike Snitzer X-Patchwork-Id: 7398411 X-Patchwork-Delegate: snitzer@redhat.com Return-Path: X-Original-To: patchwork-dm-devel@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork1.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork1.web.kernel.org (Postfix) with ESMTP id 4A5F79F302 for ; Wed, 14 Oct 2015 20:51:57 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 1F98B207A5 for ; Wed, 14 Oct 2015 20:51:56 +0000 (UTC) Received: from mx4-phx2.redhat.com (mx4-phx2.redhat.com [209.132.183.25]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id C24E120772 for ; Wed, 14 Oct 2015 20:51:54 +0000 (UTC) Received: from lists01.pubmisc.prod.ext.phx2.redhat.com (lists01.pubmisc.prod.ext.phx2.redhat.com [10.5.19.33]) by mx4-phx2.redhat.com (8.13.8/8.13.8) with ESMTP id t9EKlgw1029469; Wed, 14 Oct 2015 16:47:43 -0400 Received: from int-mx11.intmail.prod.int.phx2.redhat.com (int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.24]) by lists01.pubmisc.prod.ext.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id t9EKle1v019914 for ; Wed, 14 Oct 2015 16:47:41 -0400 Received: from localhost (vpn-54-18.rdu2.redhat.com [10.10.54.18]) by int-mx11.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id t9EKlerX016565 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Wed, 14 Oct 2015 16:47:40 -0400 Date: Wed, 14 Oct 2015 16:47:39 -0400 From: Mike Snitzer To: Jens Axboe Message-ID: <20151014204739.GA23449@redhat.com> References: <5384CE82.90601@kernel.dk> <20151005205943.GB25762@redhat.com> <20151006185016.GA31955@redhat.com> <20151006201637.GA4158@redhat.com> <20151008150859.GA11770@redhat.com> <20151009195203.GA18790@redhat.com> <20151009195907.GB18790@redhat.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20151009195907.GB18790@redhat.com> User-Agent: Mutt/1.5.23 (2014-03-12) X-Scanned-By: MIMEDefang 2.68 on 10.5.11.24 X-loop: dm-devel@redhat.com Cc: dm-devel@redhat.com, linux-kernel@vger.kernel.org, jmoyer@redhat.com, Mikulas Patocka , kent.overstreet@gmail.com, "Alasdair G. Kergon" Subject: [dm-devel] [PATCH v3 for-4.4] block: flush queued bios when process blocks to avoid deadlock X-BeenThere: dm-devel@redhat.com X-Mailman-Version: 2.1.12 Precedence: junk Reply-To: device-mapper development List-Id: device-mapper development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, T_RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Mikulas Patocka The block layer uses per-process bio list to avoid recursion in generic_make_request. When generic_make_request is called recursively, the bio is added to current->bio_list and generic_make_request returns immediately. The top-level instance of generic_make_request takes bios from current->bio_list and processes them. Commit df2cb6daa4 ("block: Avoid deadlocks with bio allocation by stacking drivers") created a workqueue for every bio set and code in bio_alloc_bioset() that tries to resolve some low-memory deadlocks by redirecting bios queued on current->bio_list to the workqueue if the system is low on memory. However another deadlock (see below **) may happen, without any low memory condition, because generic_make_request is queuing bios to current->bio_list (rather than submitting them). Fix this deadlock by redirecting any bios on current->bio_list to the bio_set's rescue workqueue on every schedule call. Consequently, when the process blocks on a mutex, the bios queued on current->bio_list are dispatched to independent workqueus and they can complete without waiting for the mutex to be available. Also, now we can remove punt_bios_to_rescuer() and bio_alloc_bioset()'s calls to it because bio_alloc_bioset() will implicitly punt all bios on current->bio_list if it performs a blocking allocation. ** Here is the dm-snapshot deadlock that was observed: 1) Process A sends one-page read bio to the dm-snapshot target. The bio spans snapshot chunk boundary and so it is split to two bios by device mapper. 2) Device mapper creates the first sub-bio and sends it to the snapshot driver. 3) The function snapshot_map calls track_chunk (that allocates a structure dm_snap_tracked_chunk and adds it to tracked_chunk_hash) and then remaps the bio to the underlying device and exits with DM_MAPIO_REMAPPED. 4) The remapped bio is submitted with generic_make_request, but it isn't issued - it is added to current->bio_list instead. 5) Meanwhile, process B (dm's kcopyd) executes pending_complete for the chunk affected be the first remapped bio, it takes down_write(&s->lock) and then loops in __check_for_conflicting_io, waiting for dm_snap_tracked_chunk created in step 3) to be released. 6) Process A continues, it creates a second sub-bio for the rest of the original bio. 7) snapshot_map is called for this new bio, it waits on down_write(&s->lock) that is held by Process B (in step 5). Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1267650 Signed-off-by: Mikulas Patocka Signed-off-by: Mike Snitzer Depends-on: df2cb6daa4 ("block: Avoid deadlocks with bio allocation by stacking drivers") Cc: stable@vger.kernel.org Reviewed-by: Jeff Moyer --- block/bio.c | 75 +++++++++++++++++++------------------------------- include/linux/blkdev.h | 19 +++++++++++-- kernel/sched/core.c | 7 ++--- 3 files changed, 48 insertions(+), 53 deletions(-) v3: improved patch header, changed sched/core.c block callout to blk_flush_queued_io(), io_schedule_timeout() also updated to use blk_flush_queued_io(), blk_flush_bio_list() now takes a @tsk argument rather than assuming current. v3 is now being submitted with more feeling now that (ab)using the onstack plugging proved problematic, please see: https://www.redhat.com/archives/dm-devel/2015-October/msg00087.html diff --git a/block/bio.c b/block/bio.c index ad3f276..99f5a2ad 100644 --- a/block/bio.c +++ b/block/bio.c @@ -354,35 +354,35 @@ static void bio_alloc_rescue(struct work_struct *work) } } -static void punt_bios_to_rescuer(struct bio_set *bs) +/** + * blk_flush_bio_list + * @tsk: task_struct whose bio_list must be flushed + * + * Pop bios queued on @tsk->bio_list and submit each of them to + * their rescue workqueue. + * + * If the bio doesn't have a bio_set, we leave it on @tsk->bio_list. + * However, stacking drivers should use bio_set, so this shouldn't be + * an issue. + */ +void blk_flush_bio_list(struct task_struct *tsk) { - struct bio_list punt, nopunt; struct bio *bio; + struct bio_list list = *tsk->bio_list; + bio_list_init(tsk->bio_list); - /* - * In order to guarantee forward progress we must punt only bios that - * were allocated from this bio_set; otherwise, if there was a bio on - * there for a stacking driver higher up in the stack, processing it - * could require allocating bios from this bio_set, and doing that from - * our own rescuer would be bad. - * - * Since bio lists are singly linked, pop them all instead of trying to - * remove from the middle of the list: - */ - - bio_list_init(&punt); - bio_list_init(&nopunt); - - while ((bio = bio_list_pop(current->bio_list))) - bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio); - - *current->bio_list = nopunt; - - spin_lock(&bs->rescue_lock); - bio_list_merge(&bs->rescue_list, &punt); - spin_unlock(&bs->rescue_lock); + while ((bio = bio_list_pop(&list))) { + struct bio_set *bs = bio->bi_pool; + if (unlikely(!bs)) { + bio_list_add(tsk->bio_list, bio); + continue; + } - queue_work(bs->rescue_workqueue, &bs->rescue_work); + spin_lock(&bs->rescue_lock); + bio_list_add(&bs->rescue_list, bio); + queue_work(bs->rescue_workqueue, &bs->rescue_work); + spin_unlock(&bs->rescue_lock); + } } /** @@ -422,7 +422,6 @@ static void punt_bios_to_rescuer(struct bio_set *bs) */ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs) { - gfp_t saved_gfp = gfp_mask; unsigned front_pad; unsigned inline_vecs; unsigned long idx = BIO_POOL_NONE; @@ -457,23 +456,11 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs) * reserve. * * We solve this, and guarantee forward progress, with a rescuer - * workqueue per bio_set. If we go to allocate and there are - * bios on current->bio_list, we first try the allocation - * without __GFP_WAIT; if that fails, we punt those bios we - * would be blocking to the rescuer workqueue before we retry - * with the original gfp_flags. + * workqueue per bio_set. If an allocation would block (due to + * __GFP_WAIT) the scheduler will first punt all bios on + * current->bio_list to the rescuer workqueue. */ - - if (current->bio_list && !bio_list_empty(current->bio_list)) - gfp_mask &= ~__GFP_WAIT; - p = mempool_alloc(bs->bio_pool, gfp_mask); - if (!p && gfp_mask != saved_gfp) { - punt_bios_to_rescuer(bs); - gfp_mask = saved_gfp; - p = mempool_alloc(bs->bio_pool, gfp_mask); - } - front_pad = bs->front_pad; inline_vecs = BIO_INLINE_VECS; } @@ -486,12 +473,6 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs) if (nr_iovecs > inline_vecs) { bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool); - if (!bvl && gfp_mask != saved_gfp) { - punt_bios_to_rescuer(bs); - gfp_mask = saved_gfp; - bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool); - } - if (unlikely(!bvl)) goto err_free; diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 19c2e94..5dc7415 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1084,6 +1084,22 @@ static inline bool blk_needs_flush_plug(struct task_struct *tsk) !list_empty(&plug->cb_list)); } +extern void blk_flush_bio_list(struct task_struct *tsk); + +static inline void blk_flush_queued_io(struct task_struct *tsk) +{ + /* + * Flush any queued bios to corresponding rescue threads. + */ + if (tsk->bio_list && !bio_list_empty(tsk->bio_list)) + blk_flush_bio_list(tsk); + /* + * Flush any plugged IO that is queued. + */ + if (blk_needs_flush_plug(tsk)) + blk_schedule_flush_plug(tsk); +} + /* * tag stuff */ @@ -1671,11 +1687,10 @@ static inline void blk_flush_plug(struct task_struct *task) { } -static inline void blk_schedule_flush_plug(struct task_struct *task) +static inline void blk_flush_queued_io(struct task_struct *tsk) { } - static inline bool blk_needs_flush_plug(struct task_struct *tsk) { return false; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 10a8faa..eaf9eb3 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3127,11 +3127,10 @@ static inline void sched_submit_work(struct task_struct *tsk) if (!tsk->state || tsk_is_pi_blocked(tsk)) return; /* - * If we are going to sleep and we have plugged IO queued, + * If we are going to sleep and we have queued IO, * make sure to submit it to avoid deadlocks. */ - if (blk_needs_flush_plug(tsk)) - blk_schedule_flush_plug(tsk); + blk_flush_queued_io(tsk); } asmlinkage __visible void __sched schedule(void) @@ -4718,7 +4717,7 @@ long __sched io_schedule_timeout(long timeout) long ret; current->in_iowait = 1; - blk_schedule_flush_plug(current); + blk_flush_queued_io(current); delayacct_blkio_start(); rq = raw_rq();