[v3,for-4.4] block: flush queued bios when process blocks to avoid deadlock
diff mbox

Message ID 20151014204739.GA23449@redhat.com
State Deferred, archived
Delegated to: Mike Snitzer
Headers show

Commit Message

Mike Snitzer Oct. 14, 2015, 8:47 p.m. UTC
From: Mikulas Patocka <mpatocka@redhat.com>

The block layer uses per-process bio list to avoid recursion in
generic_make_request.  When generic_make_request is called recursively,
the bio is added to current->bio_list and generic_make_request returns
immediately.  The top-level instance of generic_make_request takes bios
from current->bio_list and processes them.

Commit df2cb6daa4 ("block: Avoid deadlocks with bio allocation by
stacking drivers") created a workqueue for every bio set and code
in bio_alloc_bioset() that tries to resolve some low-memory deadlocks by
redirecting bios queued on current->bio_list to the workqueue if the
system is low on memory.  However another deadlock (see below **) may
happen, without any low memory condition, because generic_make_request
is queuing bios to current->bio_list (rather than submitting them).

Fix this deadlock by redirecting any bios on current->bio_list to the
bio_set's rescue workqueue on every schedule call.  Consequently, when
the process blocks on a mutex, the bios queued on current->bio_list are
dispatched to independent workqueus and they can complete without
waiting for the mutex to be available.

Also, now we can remove punt_bios_to_rescuer() and bio_alloc_bioset()'s
calls to it because bio_alloc_bioset() will implicitly punt all bios on
current->bio_list if it performs a blocking allocation.

** Here is the dm-snapshot deadlock that was observed:

1) Process A sends one-page read bio to the dm-snapshot target. The bio
spans snapshot chunk boundary and so it is split to two bios by device
mapper.

2) Device mapper creates the first sub-bio and sends it to the snapshot
driver.

3) The function snapshot_map calls track_chunk (that allocates a structure
dm_snap_tracked_chunk and adds it to tracked_chunk_hash) and then remaps
the bio to the underlying device and exits with DM_MAPIO_REMAPPED.

4) The remapped bio is submitted with generic_make_request, but it isn't
issued - it is added to current->bio_list instead.

5) Meanwhile, process B (dm's kcopyd) executes pending_complete for the
chunk affected be the first remapped bio, it takes down_write(&s->lock)
and then loops in __check_for_conflicting_io, waiting for
dm_snap_tracked_chunk created in step 3) to be released.

6) Process A continues, it creates a second sub-bio for the rest of the
original bio.

7) snapshot_map is called for this new bio, it waits on
down_write(&s->lock) that is held by Process B (in step 5).

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1267650
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Depends-on: df2cb6daa4 ("block: Avoid deadlocks with bio allocation by stacking drivers")
Cc: stable@vger.kernel.org
---
 block/bio.c            | 75 +++++++++++++++++++-------------------------------
 include/linux/blkdev.h | 19 +++++++++++--
 kernel/sched/core.c    |  7 ++---
 3 files changed, 48 insertions(+), 53 deletions(-)

v3: improved patch header, changed sched/core.c block callout to blk_flush_queued_io(),
    io_schedule_timeout() also updated to use blk_flush_queued_io(), blk_flush_bio_list()
    now takes a @tsk argument rather than assuming current. v3 is now being submitted with
    more feeling now that (ab)using the onstack plugging proved problematic, please see:
    https://www.redhat.com/archives/dm-devel/2015-October/msg00087.html

Comments

Jeff Moyer Oct. 14, 2015, 9:44 p.m. UTC | #1
I don't see a problem with this.  Jens, I'm not sure what you were
getting at about the using the existing plugging infrastructure.  I
couldn't think of a clean way to integrate this code with the plugging.
They really do serve two separate purposes, and I don't think growing a
conditional in the scheduler hook is all that onerous.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>

Mike Snitzer <snitzer@redhat.com> writes:

> From: Mikulas Patocka <mpatocka@redhat.com>
>
> The block layer uses per-process bio list to avoid recursion in
> generic_make_request.  When generic_make_request is called recursively,
> the bio is added to current->bio_list and generic_make_request returns
> immediately.  The top-level instance of generic_make_request takes bios
> from current->bio_list and processes them.
>
> Commit df2cb6daa4 ("block: Avoid deadlocks with bio allocation by
> stacking drivers") created a workqueue for every bio set and code
> in bio_alloc_bioset() that tries to resolve some low-memory deadlocks by
> redirecting bios queued on current->bio_list to the workqueue if the
> system is low on memory.  However another deadlock (see below **) may
> happen, without any low memory condition, because generic_make_request
> is queuing bios to current->bio_list (rather than submitting them).
>
> Fix this deadlock by redirecting any bios on current->bio_list to the
> bio_set's rescue workqueue on every schedule call.  Consequently, when
> the process blocks on a mutex, the bios queued on current->bio_list are
> dispatched to independent workqueus and they can complete without
> waiting for the mutex to be available.
>
> Also, now we can remove punt_bios_to_rescuer() and bio_alloc_bioset()'s
> calls to it because bio_alloc_bioset() will implicitly punt all bios on
> current->bio_list if it performs a blocking allocation.
>
> ** Here is the dm-snapshot deadlock that was observed:
>
> 1) Process A sends one-page read bio to the dm-snapshot target. The bio
> spans snapshot chunk boundary and so it is split to two bios by device
> mapper.
>
> 2) Device mapper creates the first sub-bio and sends it to the snapshot
> driver.
>
> 3) The function snapshot_map calls track_chunk (that allocates a structure
> dm_snap_tracked_chunk and adds it to tracked_chunk_hash) and then remaps
> the bio to the underlying device and exits with DM_MAPIO_REMAPPED.
>
> 4) The remapped bio is submitted with generic_make_request, but it isn't
> issued - it is added to current->bio_list instead.
>
> 5) Meanwhile, process B (dm's kcopyd) executes pending_complete for the
> chunk affected be the first remapped bio, it takes down_write(&s->lock)
> and then loops in __check_for_conflicting_io, waiting for
> dm_snap_tracked_chunk created in step 3) to be released.
>
> 6) Process A continues, it creates a second sub-bio for the rest of the
> original bio.
>
> 7) snapshot_map is called for this new bio, it waits on
> down_write(&s->lock) that is held by Process B (in step 5).
>
> Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1267650
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> Depends-on: df2cb6daa4 ("block: Avoid deadlocks with bio allocation by stacking drivers")
> Cc: stable@vger.kernel.org
> ---
>  block/bio.c            | 75 +++++++++++++++++++-------------------------------
>  include/linux/blkdev.h | 19 +++++++++++--
>  kernel/sched/core.c    |  7 ++---
>  3 files changed, 48 insertions(+), 53 deletions(-)
>
> v3: improved patch header, changed sched/core.c block callout to blk_flush_queued_io(),
>     io_schedule_timeout() also updated to use blk_flush_queued_io(), blk_flush_bio_list()
>     now takes a @tsk argument rather than assuming current. v3 is now being submitted with
>     more feeling now that (ab)using the onstack plugging proved problematic, please see:
>     https://www.redhat.com/archives/dm-devel/2015-October/msg00087.html
>
> diff --git a/block/bio.c b/block/bio.c
> index ad3f276..99f5a2ad 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -354,35 +354,35 @@ static void bio_alloc_rescue(struct work_struct *work)
>  	}
>  }
>  
> -static void punt_bios_to_rescuer(struct bio_set *bs)
> +/**
> + * blk_flush_bio_list
> + * @tsk: task_struct whose bio_list must be flushed
> + *
> + * Pop bios queued on @tsk->bio_list and submit each of them to
> + * their rescue workqueue.
> + *
> + * If the bio doesn't have a bio_set, we leave it on @tsk->bio_list.
> + * However, stacking drivers should use bio_set, so this shouldn't be
> + * an issue.
> + */
> +void blk_flush_bio_list(struct task_struct *tsk)
>  {
> -	struct bio_list punt, nopunt;
>  	struct bio *bio;
> +	struct bio_list list = *tsk->bio_list;
> +	bio_list_init(tsk->bio_list);
>  
> -	/*
> -	 * In order to guarantee forward progress we must punt only bios that
> -	 * were allocated from this bio_set; otherwise, if there was a bio on
> -	 * there for a stacking driver higher up in the stack, processing it
> -	 * could require allocating bios from this bio_set, and doing that from
> -	 * our own rescuer would be bad.
> -	 *
> -	 * Since bio lists are singly linked, pop them all instead of trying to
> -	 * remove from the middle of the list:
> -	 */
> -
> -	bio_list_init(&punt);
> -	bio_list_init(&nopunt);
> -
> -	while ((bio = bio_list_pop(current->bio_list)))
> -		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
> -
> -	*current->bio_list = nopunt;
> -
> -	spin_lock(&bs->rescue_lock);
> -	bio_list_merge(&bs->rescue_list, &punt);
> -	spin_unlock(&bs->rescue_lock);
> +	while ((bio = bio_list_pop(&list))) {
> +		struct bio_set *bs = bio->bi_pool;
> +		if (unlikely(!bs)) {
> +			bio_list_add(tsk->bio_list, bio);
> +			continue;
> +		}
>  
> -	queue_work(bs->rescue_workqueue, &bs->rescue_work);
> +		spin_lock(&bs->rescue_lock);
> +		bio_list_add(&bs->rescue_list, bio);
> +		queue_work(bs->rescue_workqueue, &bs->rescue_work);
> +		spin_unlock(&bs->rescue_lock);
> +	}
>  }
>  
>  /**
> @@ -422,7 +422,6 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
>   */
>  struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>  {
> -	gfp_t saved_gfp = gfp_mask;
>  	unsigned front_pad;
>  	unsigned inline_vecs;
>  	unsigned long idx = BIO_POOL_NONE;
> @@ -457,23 +456,11 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>  		 * reserve.
>  		 *
>  		 * We solve this, and guarantee forward progress, with a rescuer
> -		 * workqueue per bio_set. If we go to allocate and there are
> -		 * bios on current->bio_list, we first try the allocation
> -		 * without __GFP_WAIT; if that fails, we punt those bios we
> -		 * would be blocking to the rescuer workqueue before we retry
> -		 * with the original gfp_flags.
> +		 * workqueue per bio_set. If an allocation would block (due to
> +		 * __GFP_WAIT) the scheduler will first punt all bios on
> +		 * current->bio_list to the rescuer workqueue.
>  		 */
> -
> -		if (current->bio_list && !bio_list_empty(current->bio_list))
> -			gfp_mask &= ~__GFP_WAIT;
> -
>  		p = mempool_alloc(bs->bio_pool, gfp_mask);
> -		if (!p && gfp_mask != saved_gfp) {
> -			punt_bios_to_rescuer(bs);
> -			gfp_mask = saved_gfp;
> -			p = mempool_alloc(bs->bio_pool, gfp_mask);
> -		}
> -
>  		front_pad = bs->front_pad;
>  		inline_vecs = BIO_INLINE_VECS;
>  	}
> @@ -486,12 +473,6 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>  
>  	if (nr_iovecs > inline_vecs) {
>  		bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
> -		if (!bvl && gfp_mask != saved_gfp) {
> -			punt_bios_to_rescuer(bs);
> -			gfp_mask = saved_gfp;
> -			bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
> -		}
> -
>  		if (unlikely(!bvl))
>  			goto err_free;
>  
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 19c2e94..5dc7415 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1084,6 +1084,22 @@ static inline bool blk_needs_flush_plug(struct task_struct *tsk)
>  		 !list_empty(&plug->cb_list));
>  }
>  
> +extern void blk_flush_bio_list(struct task_struct *tsk);
> +
> +static inline void blk_flush_queued_io(struct task_struct *tsk)
> +{
> +	/*
> +	 * Flush any queued bios to corresponding rescue threads.
> +	 */
> +	if (tsk->bio_list && !bio_list_empty(tsk->bio_list))
> +		blk_flush_bio_list(tsk);
> +	/*
> +	 * Flush any plugged IO that is queued.
> +	 */
> +	if (blk_needs_flush_plug(tsk))
> +		blk_schedule_flush_plug(tsk);
> +}
> +
>  /*
>   * tag stuff
>   */
> @@ -1671,11 +1687,10 @@ static inline void blk_flush_plug(struct task_struct *task)
>  {
>  }
>  
> -static inline void blk_schedule_flush_plug(struct task_struct *task)
> +static inline void blk_flush_queued_io(struct task_struct *tsk)
>  {
>  }
>  
> -
>  static inline bool blk_needs_flush_plug(struct task_struct *tsk)
>  {
>  	return false;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 10a8faa..eaf9eb3 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3127,11 +3127,10 @@ static inline void sched_submit_work(struct task_struct *tsk)
>  	if (!tsk->state || tsk_is_pi_blocked(tsk))
>  		return;
>  	/*
> -	 * If we are going to sleep and we have plugged IO queued,
> +	 * If we are going to sleep and we have queued IO,
>  	 * make sure to submit it to avoid deadlocks.
>  	 */
> -	if (blk_needs_flush_plug(tsk))
> -		blk_schedule_flush_plug(tsk);
> +	blk_flush_queued_io(tsk);
>  }
>  
>  asmlinkage __visible void __sched schedule(void)
> @@ -4718,7 +4717,7 @@ long __sched io_schedule_timeout(long timeout)
>  	long ret;
>  
>  	current->in_iowait = 1;
> -	blk_schedule_flush_plug(current);
> +	blk_flush_queued_io(current);
>  
>  	delayacct_blkio_start();
>  	rq = raw_rq();

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lei Oct. 17, 2015, 4:04 p.m. UTC | #2
On Thu, Oct 15, 2015 at 4:47 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> From: Mikulas Patocka <mpatocka@redhat.com>
>
> The block layer uses per-process bio list to avoid recursion in
> generic_make_request.  When generic_make_request is called recursively,
> the bio is added to current->bio_list and generic_make_request returns
> immediately.  The top-level instance of generic_make_request takes bios
> from current->bio_list and processes them.
>
> Commit df2cb6daa4 ("block: Avoid deadlocks with bio allocation by
> stacking drivers") created a workqueue for every bio set and code
> in bio_alloc_bioset() that tries to resolve some low-memory deadlocks by
> redirecting bios queued on current->bio_list to the workqueue if the
> system is low on memory.  However another deadlock (see below **) may
> happen, without any low memory condition, because generic_make_request
> is queuing bios to current->bio_list (rather than submitting them).
>
> Fix this deadlock by redirecting any bios on current->bio_list to the
> bio_set's rescue workqueue on every schedule call.  Consequently, when
> the process blocks on a mutex, the bios queued on current->bio_list are
> dispatched to independent workqueus and they can complete without
> waiting for the mutex to be available.

It isn't common to acquire mutex/semaphone inside .make_request()
or .request_fn(), so I am wondering it is good to reuse the rescuing
workqueue for this unusual case.

Also sometimes it can hurt performance by converting I/O submission
from one context into concurrent contexts of workqueue, especially
in case of sequential I/O, since plug & plug merge can't be used any
more.

>
> Also, now we can remove punt_bios_to_rescuer() and bio_alloc_bioset()'s
> calls to it because bio_alloc_bioset() will implicitly punt all bios on
> current->bio_list if it performs a blocking allocation.
>
> ** Here is the dm-snapshot deadlock that was observed:
>
> 1) Process A sends one-page read bio to the dm-snapshot target. The bio
> spans snapshot chunk boundary and so it is split to two bios by device
> mapper.
>
> 2) Device mapper creates the first sub-bio and sends it to the snapshot
> driver.
>
> 3) The function snapshot_map calls track_chunk (that allocates a structure
> dm_snap_tracked_chunk and adds it to tracked_chunk_hash) and then remaps
> the bio to the underlying device and exits with DM_MAPIO_REMAPPED.
>
> 4) The remapped bio is submitted with generic_make_request, but it isn't
> issued - it is added to current->bio_list instead.
>
> 5) Meanwhile, process B (dm's kcopyd) executes pending_complete for the
> chunk affected be the first remapped bio, it takes down_write(&s->lock)
> and then loops in __check_for_conflicting_io, waiting for
> dm_snap_tracked_chunk created in step 3) to be released.
>
> 6) Process A continues, it creates a second sub-bio for the rest of the
> original bio.
>
> 7) snapshot_map is called for this new bio, it waits on
> down_write(&s->lock) that is held by Process B (in step 5).
>
> Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1267650
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> Depends-on: df2cb6daa4 ("block: Avoid deadlocks with bio allocation by stacking drivers")
> Cc: stable@vger.kernel.org
> ---
>  block/bio.c            | 75 +++++++++++++++++++-------------------------------
>  include/linux/blkdev.h | 19 +++++++++++--
>  kernel/sched/core.c    |  7 ++---
>  3 files changed, 48 insertions(+), 53 deletions(-)
>
> v3: improved patch header, changed sched/core.c block callout to blk_flush_queued_io(),
>     io_schedule_timeout() also updated to use blk_flush_queued_io(), blk_flush_bio_list()
>     now takes a @tsk argument rather than assuming current. v3 is now being submitted with
>     more feeling now that (ab)using the onstack plugging proved problematic, please see:
>     https://www.redhat.com/archives/dm-devel/2015-October/msg00087.html
>
> diff --git a/block/bio.c b/block/bio.c
> index ad3f276..99f5a2ad 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -354,35 +354,35 @@ static void bio_alloc_rescue(struct work_struct *work)
>         }
>  }
>
> -static void punt_bios_to_rescuer(struct bio_set *bs)
> +/**
> + * blk_flush_bio_list
> + * @tsk: task_struct whose bio_list must be flushed
> + *
> + * Pop bios queued on @tsk->bio_list and submit each of them to
> + * their rescue workqueue.
> + *
> + * If the bio doesn't have a bio_set, we leave it on @tsk->bio_list.
> + * However, stacking drivers should use bio_set, so this shouldn't be
> + * an issue.
> + */
> +void blk_flush_bio_list(struct task_struct *tsk)
>  {
> -       struct bio_list punt, nopunt;
>         struct bio *bio;
> +       struct bio_list list = *tsk->bio_list;
> +       bio_list_init(tsk->bio_list);
>
> -       /*
> -        * In order to guarantee forward progress we must punt only bios that
> -        * were allocated from this bio_set; otherwise, if there was a bio on
> -        * there for a stacking driver higher up in the stack, processing it
> -        * could require allocating bios from this bio_set, and doing that from
> -        * our own rescuer would be bad.
> -        *
> -        * Since bio lists are singly linked, pop them all instead of trying to
> -        * remove from the middle of the list:
> -        */
> -
> -       bio_list_init(&punt);
> -       bio_list_init(&nopunt);
> -
> -       while ((bio = bio_list_pop(current->bio_list)))
> -               bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
> -
> -       *current->bio_list = nopunt;
> -
> -       spin_lock(&bs->rescue_lock);
> -       bio_list_merge(&bs->rescue_list, &punt);
> -       spin_unlock(&bs->rescue_lock);
> +       while ((bio = bio_list_pop(&list))) {
> +               struct bio_set *bs = bio->bi_pool;
> +               if (unlikely(!bs)) {
> +                       bio_list_add(tsk->bio_list, bio);
> +                       continue;
> +               }
>
> -       queue_work(bs->rescue_workqueue, &bs->rescue_work);
> +               spin_lock(&bs->rescue_lock);
> +               bio_list_add(&bs->rescue_list, bio);
> +               queue_work(bs->rescue_workqueue, &bs->rescue_work);
> +               spin_unlock(&bs->rescue_lock);
> +       }

Not like rescuring path, schedule out can be quite frequent, and the
above change will switch to submit these I/Os from wq concurrently,
which might hurt performance for sequential I/O.

Also I am wondering why not submit these I/Os in 'current' context
just like what flush plug does?

>  }
>
>  /**
> @@ -422,7 +422,6 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
>   */
>  struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>  {
> -       gfp_t saved_gfp = gfp_mask;
>         unsigned front_pad;
>         unsigned inline_vecs;
>         unsigned long idx = BIO_POOL_NONE;
> @@ -457,23 +456,11 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>                  * reserve.
>                  *
>                  * We solve this, and guarantee forward progress, with a rescuer
> -                * workqueue per bio_set. If we go to allocate and there are
> -                * bios on current->bio_list, we first try the allocation
> -                * without __GFP_WAIT; if that fails, we punt those bios we
> -                * would be blocking to the rescuer workqueue before we retry
> -                * with the original gfp_flags.
> +                * workqueue per bio_set. If an allocation would block (due to
> +                * __GFP_WAIT) the scheduler will first punt all bios on
> +                * current->bio_list to the rescuer workqueue.
>                  */
> -
> -               if (current->bio_list && !bio_list_empty(current->bio_list))
> -                       gfp_mask &= ~__GFP_WAIT;
> -
>                 p = mempool_alloc(bs->bio_pool, gfp_mask);
> -               if (!p && gfp_mask != saved_gfp) {
> -                       punt_bios_to_rescuer(bs);
> -                       gfp_mask = saved_gfp;
> -                       p = mempool_alloc(bs->bio_pool, gfp_mask);
> -               }
> -
>                 front_pad = bs->front_pad;
>                 inline_vecs = BIO_INLINE_VECS;
>         }
> @@ -486,12 +473,6 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>
>         if (nr_iovecs > inline_vecs) {
>                 bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
> -               if (!bvl && gfp_mask != saved_gfp) {
> -                       punt_bios_to_rescuer(bs);
> -                       gfp_mask = saved_gfp;
> -                       bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
> -               }
> -

Looks you touched rescuing path for bio allocation, and better to just
do one thing in one patch.

>                 if (unlikely(!bvl))
>                         goto err_free;
>
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 19c2e94..5dc7415 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1084,6 +1084,22 @@ static inline bool blk_needs_flush_plug(struct task_struct *tsk)
>                  !list_empty(&plug->cb_list));
>  }
>
> +extern void blk_flush_bio_list(struct task_struct *tsk);
> +
> +static inline void blk_flush_queued_io(struct task_struct *tsk)
> +{
> +       /*
> +        * Flush any queued bios to corresponding rescue threads.
> +        */
> +       if (tsk->bio_list && !bio_list_empty(tsk->bio_list))
> +               blk_flush_bio_list(tsk);
> +       /*
> +        * Flush any plugged IO that is queued.
> +        */
> +       if (blk_needs_flush_plug(tsk))
> +               blk_schedule_flush_plug(tsk);
> +}
> +
>  /*
>   * tag stuff
>   */
> @@ -1671,11 +1687,10 @@ static inline void blk_flush_plug(struct task_struct *task)
>  {
>  }
>
> -static inline void blk_schedule_flush_plug(struct task_struct *task)
> +static inline void blk_flush_queued_io(struct task_struct *tsk)
>  {
>  }
>
> -
>  static inline bool blk_needs_flush_plug(struct task_struct *tsk)
>  {
>         return false;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 10a8faa..eaf9eb3 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3127,11 +3127,10 @@ static inline void sched_submit_work(struct task_struct *tsk)
>         if (!tsk->state || tsk_is_pi_blocked(tsk))
>                 return;
>         /*
> -        * If we are going to sleep and we have plugged IO queued,
> +        * If we are going to sleep and we have queued IO,
>          * make sure to submit it to avoid deadlocks.
>          */
> -       if (blk_needs_flush_plug(tsk))
> -               blk_schedule_flush_plug(tsk);
> +       blk_flush_queued_io(tsk);
>  }
>
>  asmlinkage __visible void __sched schedule(void)
> @@ -4718,7 +4717,7 @@ long __sched io_schedule_timeout(long timeout)
>         long ret;
>
>         current->in_iowait = 1;
> -       blk_schedule_flush_plug(current);
> +       blk_flush_queued_io(current);
>
>         delayacct_blkio_start();
>         rq = raw_rq();
> --
> 2.3.8 (Apple Git-58)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer Oct. 20, 2015, 7:57 p.m. UTC | #3
On Sat, Oct 17 2015 at 12:04pm -0400,
Ming Lei <tom.leiming@gmail.com> wrote:

> On Thu, Oct 15, 2015 at 4:47 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> > From: Mikulas Patocka <mpatocka@redhat.com>
> >
> > The block layer uses per-process bio list to avoid recursion in
> > generic_make_request.  When generic_make_request is called recursively,
> > the bio is added to current->bio_list and generic_make_request returns
> > immediately.  The top-level instance of generic_make_request takes bios
> > from current->bio_list and processes them.
> >
> > Commit df2cb6daa4 ("block: Avoid deadlocks with bio allocation by
> > stacking drivers") created a workqueue for every bio set and code
> > in bio_alloc_bioset() that tries to resolve some low-memory deadlocks by
> > redirecting bios queued on current->bio_list to the workqueue if the
> > system is low on memory.  However another deadlock (see below **) may
> > happen, without any low memory condition, because generic_make_request
> > is queuing bios to current->bio_list (rather than submitting them).
> >
> > Fix this deadlock by redirecting any bios on current->bio_list to the
> > bio_set's rescue workqueue on every schedule call.  Consequently, when
> > the process blocks on a mutex, the bios queued on current->bio_list are
> > dispatched to independent workqueus and they can complete without
> > waiting for the mutex to be available.
> 
> It isn't common to acquire mutex/semaphone inside .make_request()
> or .request_fn(), so I am wondering it is good to reuse the rescuing
> workqueue for this unusual case.

Which specific locking are you concerned about?

> Also sometimes it can hurt performance by converting I/O submission
> from one context into concurrent contexts of workqueue, especially
> in case of sequential I/O, since plug & plug merge can't be used any
> more.

True, plug and plug merge wouldn't be usable but this recursive call to
generic_make_request isn't expected to be common.

This patch was to fix a relatively obscure bio spliting scenario that
fell out of the complexity of dm-snapshot.

> > diff --git a/block/bio.c b/block/bio.c
> > index ad3f276..99f5a2ad 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -354,35 +354,35 @@ static void bio_alloc_rescue(struct work_struct *work)
> >         }
> >  }
> >
> > -static void punt_bios_to_rescuer(struct bio_set *bs)
> > +/**
> > + * blk_flush_bio_list
> > + * @tsk: task_struct whose bio_list must be flushed
> > + *
> > + * Pop bios queued on @tsk->bio_list and submit each of them to
> > + * their rescue workqueue.
> > + *
> > + * If the bio doesn't have a bio_set, we leave it on @tsk->bio_list.
> > + * However, stacking drivers should use bio_set, so this shouldn't be
> > + * an issue.
> > + */
> > +void blk_flush_bio_list(struct task_struct *tsk)
...
> > +       while ((bio = bio_list_pop(&list))) {
> > +               struct bio_set *bs = bio->bi_pool;
> > +               if (unlikely(!bs)) {
> > +                       bio_list_add(tsk->bio_list, bio);
> > +                       continue;
> > +               }
> >
> > -       queue_work(bs->rescue_workqueue, &bs->rescue_work);
> > +               spin_lock(&bs->rescue_lock);
> > +               bio_list_add(&bs->rescue_list, bio);
> > +               queue_work(bs->rescue_workqueue, &bs->rescue_work);
> > +               spin_unlock(&bs->rescue_lock);
> > +       }
> 
> Not like rescuring path, schedule out can be quite frequent, and the
> above change will switch to submit these I/Os from wq concurrently,
> which might hurt performance for sequential I/O.
> 
> Also I am wondering why not submit these I/Os in 'current' context
> just like what flush plug does?

Flush plug during schedule makes use of kblockd so I'm not sure what
you're referring to here.

> >  }
> >
> >  /**
> > @@ -422,7 +422,6 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
> >   */
> >  struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> >  {
> > -       gfp_t saved_gfp = gfp_mask;
> >         unsigned front_pad;
> >         unsigned inline_vecs;
> >         unsigned long idx = BIO_POOL_NONE;
> > @@ -457,23 +456,11 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> >                  * reserve.
> >                  *
> >                  * We solve this, and guarantee forward progress, with a rescuer
> > -                * workqueue per bio_set. If we go to allocate and there are
> > -                * bios on current->bio_list, we first try the allocation
> > -                * without __GFP_WAIT; if that fails, we punt those bios we
> > -                * would be blocking to the rescuer workqueue before we retry
> > -                * with the original gfp_flags.
> > +                * workqueue per bio_set. If an allocation would block (due to
> > +                * __GFP_WAIT) the scheduler will first punt all bios on
> > +                * current->bio_list to the rescuer workqueue.
> >                  */
> > -
> > -               if (current->bio_list && !bio_list_empty(current->bio_list))
> > -                       gfp_mask &= ~__GFP_WAIT;
> > -
> >                 p = mempool_alloc(bs->bio_pool, gfp_mask);
> > -               if (!p && gfp_mask != saved_gfp) {
> > -                       punt_bios_to_rescuer(bs);
> > -                       gfp_mask = saved_gfp;
> > -                       p = mempool_alloc(bs->bio_pool, gfp_mask);
> > -               }
> > -
> >                 front_pad = bs->front_pad;
> >                 inline_vecs = BIO_INLINE_VECS;
> >         }
> > @@ -486,12 +473,6 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> >
> >         if (nr_iovecs > inline_vecs) {
> >                 bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
> > -               if (!bvl && gfp_mask != saved_gfp) {
> > -                       punt_bios_to_rescuer(bs);
> > -                       gfp_mask = saved_gfp;
> > -                       bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
> > -               }
> > -
> 
> Looks you touched rescuing path for bio allocation, and better to just
> do one thing in one patch.

Yes, good point, I've split the patches, you can see the result in the 2
topmost commits in this branch:

http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip

Definitely helped clean up the patches and make them more
approachable/reviewable.  Thanks for the suggestion.

I'll hold off on sending out v4 of these patches until I can better
undersatnd your concerns about using the rescue workqueue during
schedule.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka Oct. 20, 2015, 8:03 p.m. UTC | #4
On Sun, 18 Oct 2015, Ming Lei wrote:

> On Thu, Oct 15, 2015 at 4:47 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> > From: Mikulas Patocka <mpatocka@redhat.com>
> >
> > The block layer uses per-process bio list to avoid recursion in
> > generic_make_request.  When generic_make_request is called recursively,
> > the bio is added to current->bio_list and generic_make_request returns
> > immediately.  The top-level instance of generic_make_request takes bios
> > from current->bio_list and processes them.
> >
> > Commit df2cb6daa4 ("block: Avoid deadlocks with bio allocation by
> > stacking drivers") created a workqueue for every bio set and code
> > in bio_alloc_bioset() that tries to resolve some low-memory deadlocks by
> > redirecting bios queued on current->bio_list to the workqueue if the
> > system is low on memory.  However another deadlock (see below **) may
> > happen, without any low memory condition, because generic_make_request
> > is queuing bios to current->bio_list (rather than submitting them).
> >
> > Fix this deadlock by redirecting any bios on current->bio_list to the
> > bio_set's rescue workqueue on every schedule call.  Consequently, when
> > the process blocks on a mutex, the bios queued on current->bio_list are
> > dispatched to independent workqueus and they can complete without
> > waiting for the mutex to be available.
> 
> It isn't common to acquire mutex/semaphone inside .make_request()
> or .request_fn(), so I am wondering it is good to reuse the rescuing
> workqueue for this unusual case.

Some drivers (dm-snapshot, dm-thin) do acquire a mutex in .make_requests() 
for every bio. It wouldn't be practical to convert them to not acquire the 
mutex (and it would also degrade performance of these drivers, if they had 
to offload every bio to a worker thread that can acquire the mutex).

You can't acquire a mutex in .request_fn() because .request_fn() is called 
with queue spinlock held.

> Also sometimes it can hurt performance by converting I/O submission
> from one context into concurrent contexts of workqueue, especially
> in case of sequential I/O, since plug & plug merge can't be used any
> more.

You can add blk_start_plug/blk_finish_plug to the function 
bio_alloc_rescue. That would be reasonable to make sure that the requests 
are merged even when they are offloaded to rescue thread.

> > -       queue_work(bs->rescue_workqueue, &bs->rescue_work);
> > +               spin_lock(&bs->rescue_lock);
> > +               bio_list_add(&bs->rescue_list, bio);
> > +               queue_work(bs->rescue_workqueue, &bs->rescue_work);
> > +               spin_unlock(&bs->rescue_lock);
> > +       }
> 
> Not like rescuring path, schedule out can be quite frequent, and the
> above change will switch to submit these I/Os from wq concurrently,
> which might hurt performance for sequential I/O.
> 
> Also I am wondering why not submit these I/Os in 'current' context
> just like what flush plug does?

Processing requests doesn't block (they only take the queue spinlock).

Processing bios can block (they can take other mutexes or semaphores), so 
processing them from the schedule hook is impossible - the bio's 
make_request function could attempt to take some lock that is already 
held. So - we must offload the bios to a separate workqueue.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lei Oct. 21, 2015, 4:38 p.m. UTC | #5
On Wed, Oct 21, 2015 at 4:03 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
> On Sun, 18 Oct 2015, Ming Lei wrote:
>
>> On Thu, Oct 15, 2015 at 4:47 AM, Mike Snitzer <snitzer@redhat.com> wrote:
>> > From: Mikulas Patocka <mpatocka@redhat.com>
>> >
>> > The block layer uses per-process bio list to avoid recursion in
>> > generic_make_request.  When generic_make_request is called recursively,
>> > the bio is added to current->bio_list and generic_make_request returns
>> > immediately.  The top-level instance of generic_make_request takes bios
>> > from current->bio_list and processes them.
>> >
>> > Commit df2cb6daa4 ("block: Avoid deadlocks with bio allocation by
>> > stacking drivers") created a workqueue for every bio set and code
>> > in bio_alloc_bioset() that tries to resolve some low-memory deadlocks by
>> > redirecting bios queued on current->bio_list to the workqueue if the
>> > system is low on memory.  However another deadlock (see below **) may
>> > happen, without any low memory condition, because generic_make_request
>> > is queuing bios to current->bio_list (rather than submitting them).
>> >
>> > Fix this deadlock by redirecting any bios on current->bio_list to the
>> > bio_set's rescue workqueue on every schedule call.  Consequently, when
>> > the process blocks on a mutex, the bios queued on current->bio_list are
>> > dispatched to independent workqueus and they can complete without
>> > waiting for the mutex to be available.
>>
>> It isn't common to acquire mutex/semaphone inside .make_request()
>> or .request_fn(), so I am wondering it is good to reuse the rescuing
>> workqueue for this unusual case.
>
> Some drivers (dm-snapshot, dm-thin) do acquire a mutex in .make_requests()
> for every bio. It wouldn't be practical to convert them to not acquire the
> mutex (and it would also degrade performance of these drivers, if they had
> to offload every bio to a worker thread that can acquire the mutex).

Lots of drivers handle I/O in that way, and this way makes AIO not possible
basically for dm-snapshot.

>
> You can't acquire a mutex in .request_fn() because .request_fn() is called
> with queue spinlock held.

Lots of drivers are dropping the lock in the request_fn(), but I admit
.request_fn shouldn't be blocked.

>
>> Also sometimes it can hurt performance by converting I/O submission
>> from one context into concurrent contexts of workqueue, especially
>> in case of sequential I/O, since plug & plug merge can't be used any
>> more.
>
> You can add blk_start_plug/blk_finish_plug to the function
> bio_alloc_rescue. That would be reasonable to make sure that the requests
> are merged even when they are offloaded to rescue thread.

The IOs submitted from each wq context becomes not contineous any
more, so plug merge isn't doable, not mention the extra context switch
cost.

This kind of cost can be introduced for all bio devices just for handling
the unusual case, fair?

>
>> > -       queue_work(bs->rescue_workqueue, &bs->rescue_work);
>> > +               spin_lock(&bs->rescue_lock);
>> > +               bio_list_add(&bs->rescue_list, bio);
>> > +               queue_work(bs->rescue_workqueue, &bs->rescue_work);
>> > +               spin_unlock(&bs->rescue_lock);
>> > +       }
>>
>> Not like rescuring path, schedule out can be quite frequent, and the
>> above change will switch to submit these I/Os from wq concurrently,
>> which might hurt performance for sequential I/O.
>>
>> Also I am wondering why not submit these I/Os in 'current' context
>> just like what flush plug does?
>
> Processing requests doesn't block (they only take the queue spinlock).
>
> Processing bios can block (they can take other mutexes or semaphores), so
> processing them from the schedule hook is impossible - the bio's
> make_request function could attempt to take some lock that is already
> held. So - we must offload the bios to a separate workqueue.

Yes, so better to just handle dm-snapshot in this way.


Thanks,
Ming Lei

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka Oct. 21, 2015, 9:49 p.m. UTC | #6
On Thu, 22 Oct 2015, Ming Lei wrote:

> > Some drivers (dm-snapshot, dm-thin) do acquire a mutex in .make_requests()
> > for every bio. It wouldn't be practical to convert them to not acquire the
> > mutex (and it would also degrade performance of these drivers, if they had
> > to offload every bio to a worker thread that can acquire the mutex).
> 
> Lots of drivers handle I/O in that way, and this way makes AIO not possible
> basically for dm-snapshot.

It doesn't have to do anything with asynchronous I/O. Of course you can do 
asynchronous I/O on dm-snapshot.

> >> Also sometimes it can hurt performance by converting I/O submission
> >> from one context into concurrent contexts of workqueue, especially
> >> in case of sequential I/O, since plug & plug merge can't be used any
> >> more.
> >
> > You can add blk_start_plug/blk_finish_plug to the function
> > bio_alloc_rescue. That would be reasonable to make sure that the requests
> > are merged even when they are offloaded to rescue thread.
> 
> The IOs submitted from each wq context becomes not contineous any
> more, so plug merge isn't doable, not mention the extra context switch
> cost.

If the requests are mergeable, blk_start_plug/blk_finish_plug will merge 
them, if not, it won't.

> This kind of cost can be introduced for all bio devices just for handling
> the unusual case, fair?

Offloading bios to a worker thread when the make_request_fn function 
blocks is required to avoid a deadlock (BTW. the deadlock became more 
common in the kernel 4.3 due to unrestricted size of bios).

The bio list current->bio_list introduces a false locking dependency - 
completion of a bio depends on completion of other bios on 
current->bio_list directed for different devices, thus it could create 
circular dependency resulting in deadlock.

To avoid the circular dependency, each bio must be offloaded to a specific 
workqueue, so that completion of bio for device A no longer depends on 
completion of another bio for device B.

> >> > -       queue_work(bs->rescue_workqueue, &bs->rescue_work);
> >> > +               spin_lock(&bs->rescue_lock);
> >> > +               bio_list_add(&bs->rescue_list, bio);
> >> > +               queue_work(bs->rescue_workqueue, &bs->rescue_work);
> >> > +               spin_unlock(&bs->rescue_lock);
> >> > +       }
> >>
> >> Not like rescuring path, schedule out can be quite frequent, and the
> >> above change will switch to submit these I/Os from wq concurrently,
> >> which might hurt performance for sequential I/O.
> >>
> >> Also I am wondering why not submit these I/Os in 'current' context
> >> just like what flush plug does?
> >
> > Processing requests doesn't block (they only take the queue spinlock).
> >
> > Processing bios can block (they can take other mutexes or semaphores), so
> > processing them from the schedule hook is impossible - the bio's
> > make_request function could attempt to take some lock that is already
> > held. So - we must offload the bios to a separate workqueue.
> 
> Yes, so better to just handle dm-snapshot in this way.

All dm targets and almost all other bio-processing drivers can block in 
the make_request_fn function (for example, they may block when allocating 
from a mempool).

Mikulas

> Thanks,
> Ming Lei
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lei Oct. 22, 2015, 1:53 a.m. UTC | #7
On Thu, Oct 22, 2015 at 5:49 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
> On Thu, 22 Oct 2015, Ming Lei wrote:
>
>> > Some drivers (dm-snapshot, dm-thin) do acquire a mutex in .make_requests()
>> > for every bio. It wouldn't be practical to convert them to not acquire the
>> > mutex (and it would also degrade performance of these drivers, if they had
>> > to offload every bio to a worker thread that can acquire the mutex).
>>
>> Lots of drivers handle I/O in that way, and this way makes AIO not possible
>> basically for dm-snapshot.
>
> It doesn't have to do anything with asynchronous I/O. Of course you can do
> asynchronous I/O on dm-snapshot.

But it is actually sync I/O because submit_bio() will block on the semaphore,
so this behavious isn't acceptable from AIO view. That should be why there
are very few drivers to use semaphoe/mutex in I/O path.

I have found zram has removed its rwlock from its I/O path in v3.17.

>
>> >> Also sometimes it can hurt performance by converting I/O submission
>> >> from one context into concurrent contexts of workqueue, especially
>> >> in case of sequential I/O, since plug & plug merge can't be used any
>> >> more.
>> >
>> > You can add blk_start_plug/blk_finish_plug to the function
>> > bio_alloc_rescue. That would be reasonable to make sure that the requests
>> > are merged even when they are offloaded to rescue thread.
>>
>> The IOs submitted from each wq context becomes not contineous any
>> more, so plug merge isn't doable, not mention the extra context switch
>> cost.
>
> If the requests are mergeable, blk_start_plug/blk_finish_plug will merge
> them, if not, it won't.

Plug is per task, and these requests are mergeable previously from one
same task, but after you submit them from different context via wq, it
becomes not mergeable any more via plug. Also context switch isn't free.

>
>> This kind of cost can be introduced for all bio devices just for handling
>> the unusual case, fair?
>
> Offloading bios to a worker thread when the make_request_fn function
> blocks is required to avoid a deadlock (BTW. the deadlock became more
> common in the kernel 4.3 due to unrestricted size of bios).
>
> The bio list current->bio_list introduces a false locking dependency -
> completion of a bio depends on completion of other bios on
> current->bio_list directed for different devices, thus it could create
> circular dependency resulting in deadlock.
>
> To avoid the circular dependency, each bio must be offloaded to a specific
> workqueue, so that completion of bio for device A no longer depends on
> completion of another bio for device B.

You can do that for dm-snapshot only by using per-device thread/wq or
whatever, but it isn't good to make other drivers involved by reusing
rescure wq.

>
>> >> > -       queue_work(bs->rescue_workqueue, &bs->rescue_work);
>> >> > +               spin_lock(&bs->rescue_lock);
>> >> > +               bio_list_add(&bs->rescue_list, bio);
>> >> > +               queue_work(bs->rescue_workqueue, &bs->rescue_work);
>> >> > +               spin_unlock(&bs->rescue_lock);
>> >> > +       }
>> >>
>> >> Not like rescuring path, schedule out can be quite frequent, and the
>> >> above change will switch to submit these I/Os from wq concurrently,
>> >> which might hurt performance for sequential I/O.
>> >>
>> >> Also I am wondering why not submit these I/Os in 'current' context
>> >> just like what flush plug does?
>> >
>> > Processing requests doesn't block (they only take the queue spinlock).
>> >
>> > Processing bios can block (they can take other mutexes or semaphores), so
>> > processing them from the schedule hook is impossible - the bio's
>> > make_request function could attempt to take some lock that is already
>> > held. So - we must offload the bios to a separate workqueue.
>>
>> Yes, so better to just handle dm-snapshot in this way.
>
> All dm targets and almost all other bio-processing drivers can block in
> the make_request_fn function (for example, they may block when allocating
> from a mempool).

blk_queue_bio() and its mq pairs can block in get_request() too, and that is
acceptable, but it isn't friendly/acceptable for AIO to use mutex/semaphoe
in I/O path which can block quite frequently.

--
Ming Lei

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Patch
diff mbox

diff --git a/block/bio.c b/block/bio.c
index ad3f276..99f5a2ad 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -354,35 +354,35 @@  static void bio_alloc_rescue(struct work_struct *work)
 	}
 }
 
-static void punt_bios_to_rescuer(struct bio_set *bs)
+/**
+ * blk_flush_bio_list
+ * @tsk: task_struct whose bio_list must be flushed
+ *
+ * Pop bios queued on @tsk->bio_list and submit each of them to
+ * their rescue workqueue.
+ *
+ * If the bio doesn't have a bio_set, we leave it on @tsk->bio_list.
+ * However, stacking drivers should use bio_set, so this shouldn't be
+ * an issue.
+ */
+void blk_flush_bio_list(struct task_struct *tsk)
 {
-	struct bio_list punt, nopunt;
 	struct bio *bio;
+	struct bio_list list = *tsk->bio_list;
+	bio_list_init(tsk->bio_list);
 
-	/*
-	 * In order to guarantee forward progress we must punt only bios that
-	 * were allocated from this bio_set; otherwise, if there was a bio on
-	 * there for a stacking driver higher up in the stack, processing it
-	 * could require allocating bios from this bio_set, and doing that from
-	 * our own rescuer would be bad.
-	 *
-	 * Since bio lists are singly linked, pop them all instead of trying to
-	 * remove from the middle of the list:
-	 */
-
-	bio_list_init(&punt);
-	bio_list_init(&nopunt);
-
-	while ((bio = bio_list_pop(current->bio_list)))
-		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
-
-	*current->bio_list = nopunt;
-
-	spin_lock(&bs->rescue_lock);
-	bio_list_merge(&bs->rescue_list, &punt);
-	spin_unlock(&bs->rescue_lock);
+	while ((bio = bio_list_pop(&list))) {
+		struct bio_set *bs = bio->bi_pool;
+		if (unlikely(!bs)) {
+			bio_list_add(tsk->bio_list, bio);
+			continue;
+		}
 
-	queue_work(bs->rescue_workqueue, &bs->rescue_work);
+		spin_lock(&bs->rescue_lock);
+		bio_list_add(&bs->rescue_list, bio);
+		queue_work(bs->rescue_workqueue, &bs->rescue_work);
+		spin_unlock(&bs->rescue_lock);
+	}
 }
 
 /**
@@ -422,7 +422,6 @@  static void punt_bios_to_rescuer(struct bio_set *bs)
  */
 struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 {
-	gfp_t saved_gfp = gfp_mask;
 	unsigned front_pad;
 	unsigned inline_vecs;
 	unsigned long idx = BIO_POOL_NONE;
@@ -457,23 +456,11 @@  struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 		 * reserve.
 		 *
 		 * We solve this, and guarantee forward progress, with a rescuer
-		 * workqueue per bio_set. If we go to allocate and there are
-		 * bios on current->bio_list, we first try the allocation
-		 * without __GFP_WAIT; if that fails, we punt those bios we
-		 * would be blocking to the rescuer workqueue before we retry
-		 * with the original gfp_flags.
+		 * workqueue per bio_set. If an allocation would block (due to
+		 * __GFP_WAIT) the scheduler will first punt all bios on
+		 * current->bio_list to the rescuer workqueue.
 		 */
-
-		if (current->bio_list && !bio_list_empty(current->bio_list))
-			gfp_mask &= ~__GFP_WAIT;
-
 		p = mempool_alloc(bs->bio_pool, gfp_mask);
-		if (!p && gfp_mask != saved_gfp) {
-			punt_bios_to_rescuer(bs);
-			gfp_mask = saved_gfp;
-			p = mempool_alloc(bs->bio_pool, gfp_mask);
-		}
-
 		front_pad = bs->front_pad;
 		inline_vecs = BIO_INLINE_VECS;
 	}
@@ -486,12 +473,6 @@  struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 
 	if (nr_iovecs > inline_vecs) {
 		bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
-		if (!bvl && gfp_mask != saved_gfp) {
-			punt_bios_to_rescuer(bs);
-			gfp_mask = saved_gfp;
-			bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
-		}
-
 		if (unlikely(!bvl))
 			goto err_free;
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 19c2e94..5dc7415 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1084,6 +1084,22 @@  static inline bool blk_needs_flush_plug(struct task_struct *tsk)
 		 !list_empty(&plug->cb_list));
 }
 
+extern void blk_flush_bio_list(struct task_struct *tsk);
+
+static inline void blk_flush_queued_io(struct task_struct *tsk)
+{
+	/*
+	 * Flush any queued bios to corresponding rescue threads.
+	 */
+	if (tsk->bio_list && !bio_list_empty(tsk->bio_list))
+		blk_flush_bio_list(tsk);
+	/*
+	 * Flush any plugged IO that is queued.
+	 */
+	if (blk_needs_flush_plug(tsk))
+		blk_schedule_flush_plug(tsk);
+}
+
 /*
  * tag stuff
  */
@@ -1671,11 +1687,10 @@  static inline void blk_flush_plug(struct task_struct *task)
 {
 }
 
-static inline void blk_schedule_flush_plug(struct task_struct *task)
+static inline void blk_flush_queued_io(struct task_struct *tsk)
 {
 }
 
-
 static inline bool blk_needs_flush_plug(struct task_struct *tsk)
 {
 	return false;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 10a8faa..eaf9eb3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3127,11 +3127,10 @@  static inline void sched_submit_work(struct task_struct *tsk)
 	if (!tsk->state || tsk_is_pi_blocked(tsk))
 		return;
 	/*
-	 * If we are going to sleep and we have plugged IO queued,
+	 * If we are going to sleep and we have queued IO,
 	 * make sure to submit it to avoid deadlocks.
 	 */
-	if (blk_needs_flush_plug(tsk))
-		blk_schedule_flush_plug(tsk);
+	blk_flush_queued_io(tsk);
 }
 
 asmlinkage __visible void __sched schedule(void)
@@ -4718,7 +4717,7 @@  long __sched io_schedule_timeout(long timeout)
 	long ret;
 
 	current->in_iowait = 1;
-	blk_schedule_flush_plug(current);
+	blk_flush_queued_io(current);
 
 	delayacct_blkio_start();
 	rq = raw_rq();