diff mbox

[RFC] block: fix blk_queue_split() resource exhaustion

Message ID 1466583730-28595-1-git-send-email-lars.ellenberg@linbit.com (mailing list archive)
State Deferred, archived
Delegated to: Mike Snitzer
Headers show

Commit Message

Lars Ellenberg June 22, 2016, 8:22 a.m. UTC
For a long time, generic_make_request() converts recursion into
iteration by queuing recursive arguments on current->bio_list.

This is convenient for stacking drivers,
the top-most driver would take the originally submitted bio,
and re-submit a re-mapped version of it, or one or more clones,
or one or more new allocated bios to its backend(s). Which
are then simply processed in turn, and each can again queue
more "backend-bios" until we reach the bottom of the driver stack,
and actually dispatch to the real backend device.

Any stacking driver ->make_request_fn() could expect that,
once it returns, any backend-bios it submitted via recursive calls
to generic_make_request() would now be processed and dispatched, before
the current task would call into this driver again.

This is changed by commit
  54efd50 block: make generic_make_request handle arbitrarily sized bios

Drivers may call blk_queue_split() inside their ->make_request_fn(),
which may split the current bio into a front-part to be dealt with
immediately, and a remainder-part, which may need to be split even
further. That remainder-part will simply also be pushed to
current->bio_list, and would end up being head-of-queue, in front
of any backend-bios the current make_request_fn() might submit during
processing of the fron-part.

Which means the current task would immediately end up back in the same
make_request_fn() of the same driver again, before any of its backend
bios have even been processed.

This can lead to resource starvation deadlock.
Drivers could avoid this by learning to not need blk_queue_split(),
or by submitting their backend bios in a different context (dedicated
kernel thread, work_queue context, ...). Or by playing funny re-ordering
games with entries on current->bio_list.

Instead, I suggest to distinguish between recursive calls to
generic_make_request(), and pushing back the remainder part in
blk_queue_split(), by pointing current->bio_lists to a
	struct recursion_to_iteration_bio_lists {
		struct bio_list recursion;
		struct bio_list remainder;
	}

To have all bios targeted to drivers lower in the stack processed before
processing the next piece of a bio targeted at the higher levels,
as long as queued bios resulting from recursion are available,
they will continue to be processed in FIFO order.
Pushed back bio-parts resulting from blk_queue_split() will be processed
in LIFO order, one-by-one, whenever the recursion list becomes empty.

Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
---

This is not a theoretical problem.
At least int DRBD, and an unfortunately high IO concurrency wrt. the
"max-buffers" setting, without this patch we have a reproducible deadlock.

I'm unsure if it could cause problems in md raid in some corner cases
or when a rebuild/scrub/... starts in a "bad" moment.

It also may increase "stress" on any involved bio_set,
because it may increase the number of bios that are allocated
before the first of them is actually processed, causing more
frequent calls of punt_bios_to_rescuer().

Alternatively,
a simple one-line change to generic_make_request() would also "fix" it,
by processing all recursive bios in LIFO order.
But that may change the order in which bios reach the "physical" layer,
in case they are in fact split into many smaller pieces, for example in
a striping setup, which may have other performance implications.

 | diff --git a/block/blk-core.c b/block/blk-core.c
 | index 2475b1c7..a5623f6 100644
 | --- a/block/blk-core.c
 | +++ b/block/blk-core.c
 | @@ -2048,7 +2048,7 @@ blk_qc_t generic_make_request(struct bio *bio)
 |  	 * should be added at the tail
 |  	 */
 |  	if (current->bio_list) {
 | -		bio_list_add(current->bio_list, bio);
 | +		bio_list_add_head(current->bio_list, bio);
 |  		goto out;
 |  	}

Any fix would need to go to stable 4.3.x and up.
This patch does apply as is to 4.5 and up,
with a trivial fixup (or a cherry-pick of cda2264) to 4.4,
and with some more (also trivial) fixup to 4.3 because of
GFP_WAIT -> GFP_RECLAIM and comment rewrap.

Any input?
How should I proceed?

---
 block/bio.c               | 14 +++++++-------
 block/blk-core.c          | 32 +++++++++++++++++---------------
 block/blk-merge.c         |  3 ++-
 drivers/md/bcache/btree.c | 12 ++++++------
 drivers/md/dm-bufio.c     |  2 +-
 drivers/md/md.h           |  7 +++++++
 drivers/md/raid1.c        |  5 ++---
 drivers/md/raid10.c       |  5 ++---
 include/linux/bio.h       | 11 +++++++++++
 include/linux/sched.h     |  4 ++--
 10 files changed, 57 insertions(+), 38 deletions(-)

Comments

Ming Lei June 24, 2016, 11:36 a.m. UTC | #1
On Wed, Jun 22, 2016 at 4:22 PM, Lars Ellenberg
<lars.ellenberg@linbit.com> wrote:
> For a long time, generic_make_request() converts recursion into
> iteration by queuing recursive arguments on current->bio_list.
>
> This is convenient for stacking drivers,
> the top-most driver would take the originally submitted bio,
> and re-submit a re-mapped version of it, or one or more clones,
> or one or more new allocated bios to its backend(s). Which
> are then simply processed in turn, and each can again queue
> more "backend-bios" until we reach the bottom of the driver stack,
> and actually dispatch to the real backend device.
>
> Any stacking driver ->make_request_fn() could expect that,
> once it returns, any backend-bios it submitted via recursive calls
> to generic_make_request() would now be processed and dispatched, before
> the current task would call into this driver again.
>
> This is changed by commit
>   54efd50 block: make generic_make_request handle arbitrarily sized bios
>
> Drivers may call blk_queue_split() inside their ->make_request_fn(),
> which may split the current bio into a front-part to be dealt with
> immediately, and a remainder-part, which may need to be split even
> further. That remainder-part will simply also be pushed to
> current->bio_list, and would end up being head-of-queue, in front
> of any backend-bios the current make_request_fn() might submit during
> processing of the fron-part.
>
> Which means the current task would immediately end up back in the same
> make_request_fn() of the same driver again, before any of its backend
> bios have even been processed.
>
> This can lead to resource starvation deadlock.

Could you explain it a bit what the resource is and how it is
exhausted?

> Drivers could avoid this by learning to not need blk_queue_split(),
> or by submitting their backend bios in a different context (dedicated
> kernel thread, work_queue context, ...). Or by playing funny re-ordering
> games with entries on current->bio_list.
>
> Instead, I suggest to distinguish between recursive calls to
> generic_make_request(), and pushing back the remainder part in
> blk_queue_split(), by pointing current->bio_lists to a
>         struct recursion_to_iteration_bio_lists {
>                 struct bio_list recursion;
>                 struct bio_list remainder;
>         }
>
> To have all bios targeted to drivers lower in the stack processed before
> processing the next piece of a bio targeted at the higher levels,
> as long as queued bios resulting from recursion are available,
> they will continue to be processed in FIFO order.
> Pushed back bio-parts resulting from blk_queue_split() will be processed
> in LIFO order, one-by-one, whenever the recursion list becomes empty.
>
> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
> Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
> ---
>
> This is not a theoretical problem.
> At least int DRBD, and an unfortunately high IO concurrency wrt. the
> "max-buffers" setting, without this patch we have a reproducible deadlock.

Is there any log about the deadlock? And is there any lockdep warning
if it is enabled?

Thanks,

>
> I'm unsure if it could cause problems in md raid in some corner cases
> or when a rebuild/scrub/... starts in a "bad" moment.
>
> It also may increase "stress" on any involved bio_set,
> because it may increase the number of bios that are allocated
> before the first of them is actually processed, causing more
> frequent calls of punt_bios_to_rescuer().
>
> Alternatively,
> a simple one-line change to generic_make_request() would also "fix" it,
> by processing all recursive bios in LIFO order.
> But that may change the order in which bios reach the "physical" layer,
> in case they are in fact split into many smaller pieces, for example in
> a striping setup, which may have other performance implications.
>
>  | diff --git a/block/blk-core.c b/block/blk-core.c
>  | index 2475b1c7..a5623f6 100644
>  | --- a/block/blk-core.c
>  | +++ b/block/blk-core.c
>  | @@ -2048,7 +2048,7 @@ blk_qc_t generic_make_request(struct bio *bio)
>  |       * should be added at the tail
>  |       */
>  |      if (current->bio_list) {
>  | -            bio_list_add(current->bio_list, bio);
>  | +            bio_list_add_head(current->bio_list, bio);
>  |              goto out;
>  |      }
>
> Any fix would need to go to stable 4.3.x and up.
> This patch does apply as is to 4.5 and up,
> with a trivial fixup (or a cherry-pick of cda2264) to 4.4,
> and with some more (also trivial) fixup to 4.3 because of
> GFP_WAIT -> GFP_RECLAIM and comment rewrap.
>
> Any input?
> How should I proceed?
>
> ---
>  block/bio.c               | 14 +++++++-------
>  block/blk-core.c          | 32 +++++++++++++++++---------------
>  block/blk-merge.c         |  3 ++-
>  drivers/md/bcache/btree.c | 12 ++++++------
>  drivers/md/dm-bufio.c     |  2 +-
>  drivers/md/md.h           |  7 +++++++
>  drivers/md/raid1.c        |  5 ++---
>  drivers/md/raid10.c       |  5 ++---
>  include/linux/bio.h       | 11 +++++++++++
>  include/linux/sched.h     |  4 ++--
>  10 files changed, 57 insertions(+), 38 deletions(-)
>
> diff --git a/block/bio.c b/block/bio.c
> index 0e4aa42..2ffcea0 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -368,10 +368,10 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
>         bio_list_init(&punt);
>         bio_list_init(&nopunt);
>
> -       while ((bio = bio_list_pop(current->bio_list)))
> +       while ((bio = bio_list_pop(&current->bio_lists->recursion)))
>                 bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
>
> -       *current->bio_list = nopunt;
> +       current->bio_lists->recursion = nopunt;
>
>         spin_lock(&bs->rescue_lock);
>         bio_list_merge(&bs->rescue_list, &punt);
> @@ -453,13 +453,13 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>                  *
>                  * We solve this, and guarantee forward progress, with a rescuer
>                  * workqueue per bio_set. If we go to allocate and there are
> -                * bios on current->bio_list, we first try the allocation
> -                * without __GFP_DIRECT_RECLAIM; if that fails, we punt those
> -                * bios we would be blocking to the rescuer workqueue before
> -                * we retry with the original gfp_flags.
> +                * bios on current->bio_lists->recursion, we first try the
> +                * allocation without __GFP_DIRECT_RECLAIM; if that fails, we
> +                * punt those bios we would be blocking to the rescuer
> +                * workqueue before we retry with the original gfp_flags.
>                  */
>
> -               if (current->bio_list && !bio_list_empty(current->bio_list))
> +               if (current->bio_lists && !bio_list_empty(&current->bio_lists->recursion))
>                         gfp_mask &= ~__GFP_DIRECT_RECLAIM;
>
>                 p = mempool_alloc(bs->bio_pool, gfp_mask);
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 2475b1c7..f03ff4c 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2031,7 +2031,7 @@ end_io:
>   */
>  blk_qc_t generic_make_request(struct bio *bio)
>  {
> -       struct bio_list bio_list_on_stack;
> +       struct recursion_to_iteration_bio_lists bio_lists_on_stack;
>         blk_qc_t ret = BLK_QC_T_NONE;
>
>         if (!generic_make_request_checks(bio))
> @@ -2040,15 +2040,18 @@ blk_qc_t generic_make_request(struct bio *bio)
>         /*
>          * We only want one ->make_request_fn to be active at a time, else
>          * stack usage with stacked devices could be a problem.  So use
> -        * current->bio_list to keep a list of requests submited by a
> -        * make_request_fn function.  current->bio_list is also used as a
> +        * current->bio_lists to keep a list of requests submited by a
> +        * make_request_fn function.  current->bio_lists is also used as a
>          * flag to say if generic_make_request is currently active in this
>          * task or not.  If it is NULL, then no make_request is active.  If
>          * it is non-NULL, then a make_request is active, and new requests
> -        * should be added at the tail
> +        * should be added at the tail of current->bio_lists->recursion;
> +        * remainder bios resulting from a call to bio_queue_split() from
> +        * within ->make_request_fn() should be added to the head of
> +        * current->bio_lists->remainder.
>          */
> -       if (current->bio_list) {
> -               bio_list_add(current->bio_list, bio);
> +       if (current->bio_lists) {
> +               bio_list_add(&current->bio_lists->recursion, bio);
>                 goto out;
>         }
>
> @@ -2057,7 +2060,7 @@ blk_qc_t generic_make_request(struct bio *bio)
>          * Before entering the loop, bio->bi_next is NULL (as all callers
>          * ensure that) so we have a list with a single bio.
>          * We pretend that we have just taken it off a longer list, so
> -        * we assign bio_list to a pointer to the bio_list_on_stack,
> +        * we assign bio_list to a pointer to the bio_lists_on_stack,
>          * thus initialising the bio_list of new bios to be
>          * added.  ->make_request() may indeed add some more bios
>          * through a recursive call to generic_make_request.  If it
> @@ -2067,8 +2070,9 @@ blk_qc_t generic_make_request(struct bio *bio)
>          * bio_list, and call into ->make_request() again.
>          */
>         BUG_ON(bio->bi_next);
> -       bio_list_init(&bio_list_on_stack);
> -       current->bio_list = &bio_list_on_stack;
> +       bio_list_init(&bio_lists_on_stack.recursion);
> +       bio_list_init(&bio_lists_on_stack.remainder);
> +       current->bio_lists = &bio_lists_on_stack;
>         do {
>                 struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>
> @@ -2076,16 +2080,14 @@ blk_qc_t generic_make_request(struct bio *bio)
>                         ret = q->make_request_fn(q, bio);
>
>                         blk_queue_exit(q);
> -
> -                       bio = bio_list_pop(current->bio_list);
>                 } else {
> -                       struct bio *bio_next = bio_list_pop(current->bio_list);
> -
>                         bio_io_error(bio);
> -                       bio = bio_next;
>                 }
> +               bio = bio_list_pop(&current->bio_lists->recursion);
> +               if (!bio)
> +                       bio = bio_list_pop(&current->bio_lists->remainder);
>         } while (bio);
> -       current->bio_list = NULL; /* deactivate */
> +       current->bio_lists = NULL; /* deactivate */
>
>  out:
>         return ret;
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 2613531..8da0c22 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -172,6 +172,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
>         struct bio *split, *res;
>         unsigned nsegs;
>
> +       BUG_ON(!current->bio_lists);
>         if ((*bio)->bi_rw & REQ_DISCARD)
>                 split = blk_bio_discard_split(q, *bio, bs, &nsegs);
>         else if ((*bio)->bi_rw & REQ_WRITE_SAME)
> @@ -190,7 +191,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
>
>                 bio_chain(split, *bio);
>                 trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
> -               generic_make_request(*bio);
> +               bio_list_add_head(&current->bio_lists->remainder, *bio);
>                 *bio = split;
>         }
>  }
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index eab505e..eb0b41a 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -450,7 +450,7 @@ void __bch_btree_node_write(struct btree *b, struct closure *parent)
>
>         trace_bcache_btree_write(b);
>
> -       BUG_ON(current->bio_list);
> +       BUG_ON(current->bio_lists);
>         BUG_ON(b->written >= btree_blocks(b));
>         BUG_ON(b->written && !i->keys);
>         BUG_ON(btree_bset_first(b)->seq != i->seq);
> @@ -544,7 +544,7 @@ static void bch_btree_leaf_dirty(struct btree *b, atomic_t *journal_ref)
>
>         /* Force write if set is too big */
>         if (set_bytes(i) > PAGE_SIZE - 48 &&
> -           !current->bio_list)
> +           !current->bio_lists)
>                 bch_btree_node_write(b, NULL);
>  }
>
> @@ -889,7 +889,7 @@ static struct btree *mca_alloc(struct cache_set *c, struct btree_op *op,
>  {
>         struct btree *b;
>
> -       BUG_ON(current->bio_list);
> +       BUG_ON(current->bio_lists);
>
>         lockdep_assert_held(&c->bucket_lock);
>
> @@ -976,7 +976,7 @@ retry:
>         b = mca_find(c, k);
>
>         if (!b) {
> -               if (current->bio_list)
> +               if (current->bio_lists)
>                         return ERR_PTR(-EAGAIN);
>
>                 mutex_lock(&c->bucket_lock);
> @@ -2127,7 +2127,7 @@ static int bch_btree_insert_node(struct btree *b, struct btree_op *op,
>
>         return 0;
>  split:
> -       if (current->bio_list) {
> +       if (current->bio_lists) {
>                 op->lock = b->c->root->level + 1;
>                 return -EAGAIN;
>         } else if (op->lock <= b->c->root->level) {
> @@ -2209,7 +2209,7 @@ int bch_btree_insert(struct cache_set *c, struct keylist *keys,
>         struct btree_insert_op op;
>         int ret = 0;
>
> -       BUG_ON(current->bio_list);
> +       BUG_ON(current->bio_lists);
>         BUG_ON(bch_keylist_empty(keys));
>
>         bch_btree_op_init(&op.op, 0);
> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> index cd77216..b64d4f0 100644
> --- a/drivers/md/dm-bufio.c
> +++ b/drivers/md/dm-bufio.c
> @@ -174,7 +174,7 @@ static inline int dm_bufio_cache_index(struct dm_bufio_client *c)
>  #define DM_BUFIO_CACHE(c)      (dm_bufio_caches[dm_bufio_cache_index(c)])
>  #define DM_BUFIO_CACHE_NAME(c) (dm_bufio_cache_names[dm_bufio_cache_index(c)])
>
> -#define dm_bufio_in_request()  (!!current->bio_list)
> +#define dm_bufio_in_request()  (!!current->bio_lists)
>
>  static void dm_bufio_lock(struct dm_bufio_client *c)
>  {
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index b5c4be7..7afc71c 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -663,6 +663,13 @@ static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev *mddev)
>         }
>  }
>
> +static inline bool current_has_pending_bios(void)
> +{
> +       return current->bio_lists && (
> +             !bio_list_empty(&current->bio_lists->recursion) ||
> +             !bio_list_empty(&current->bio_lists->remainder));
> +}
> +
>  extern struct md_cluster_operations *md_cluster_ops;
>  static inline int mddev_is_clustered(struct mddev *mddev)
>  {
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index c7c8cde..811f2c0 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -876,8 +876,7 @@ static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
>                                     (!conf->barrier ||
>                                      ((conf->start_next_window <
>                                        conf->next_resync + RESYNC_SECTORS) &&
> -                                     current->bio_list &&
> -                                     !bio_list_empty(current->bio_list))),
> +                                     current_has_pending_bios())),
>                                     conf->resync_lock);
>                 conf->nr_waiting--;
>         }
> @@ -1014,7 +1013,7 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
>         struct r1conf *conf = mddev->private;
>         struct bio *bio;
>
> -       if (from_schedule || current->bio_list) {
> +       if (from_schedule || current->bio_lists) {
>                 spin_lock_irq(&conf->device_lock);
>                 bio_list_merge(&conf->pending_bio_list, &plug->pending);
>                 conf->pending_count += plug->pending_cnt;
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index c7de2a5..2c70717 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -945,8 +945,7 @@ static void wait_barrier(struct r10conf *conf)
>                 wait_event_lock_irq(conf->wait_barrier,
>                                     !conf->barrier ||
>                                     (conf->nr_pending &&
> -                                    current->bio_list &&
> -                                    !bio_list_empty(current->bio_list)),
> +                                    current_has_pending_bios()),
>                                     conf->resync_lock);
>                 conf->nr_waiting--;
>         }
> @@ -1022,7 +1021,7 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
>         struct r10conf *conf = mddev->private;
>         struct bio *bio;
>
> -       if (from_schedule || current->bio_list) {
> +       if (from_schedule || current->bio_lists) {
>                 spin_lock_irq(&conf->device_lock);
>                 bio_list_merge(&conf->pending_bio_list, &plug->pending);
>                 conf->pending_count += plug->pending_cnt;
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 9faebf7..57e45a0 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -598,6 +598,17 @@ struct bio_list {
>         struct bio *tail;
>  };
>
> +/* for generic_make_request() */
> +struct recursion_to_iteration_bio_lists {
> +       /* For stacking drivers submitting to their respective backend.
> +        * Added to tail, processed in FIFO order. */
> +       struct bio_list recursion;
> +       /* For pushing back the "remainder" part resulting from calling
> +        * blk_queue_split(). Added to head, processed in LIFO order,
> +        * once the "recursion" list has been drained. */
> +       struct bio_list remainder;
> +};
> +
>  static inline int bio_list_empty(const struct bio_list *bl)
>  {
>         return bl->head == NULL;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 6e42ada..146eedc 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -128,7 +128,7 @@ struct sched_attr {
>
>  struct futex_pi_state;
>  struct robust_list_head;
> -struct bio_list;
> +struct recursion_to_iteration_bio_lists;
>  struct fs_struct;
>  struct perf_event_context;
>  struct blk_plug;
> @@ -1727,7 +1727,7 @@ struct task_struct {
>         void *journal_info;
>
>  /* stacked block device info */
> -       struct bio_list *bio_list;
> +       struct recursion_to_iteration_bio_lists *bio_lists;
>
>  #ifdef CONFIG_BLOCK
>  /* stack plugging */
> --
> 1.9.1
>

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Lars Ellenberg June 24, 2016, 2:27 p.m. UTC | #2
On Fri, Jun 24, 2016 at 07:36:57PM +0800, Ming Lei wrote:
> >
> > This is not a theoretical problem.
> > At least int DRBD, and an unfortunately high IO concurrency wrt. the
> > "max-buffers" setting, without this patch we have a reproducible deadlock.
> 
> Is there any log about the deadlock? And is there any lockdep warning
> if it is enabled?

In DRBD, to avoid potentially very long internal queues as we wait for
our replication peer device and local backend, we limit the number of
in-flight bios we accept, and block in our ->make_request_fn() if that
number exceeds a configured watermark ("max-buffers").

Works fine, as long as we could assume that once our make_request_fn()
returns, any bios we "recursively" submitted against the local backend
would be dispatched. Which used to be the case.

commit 54efd50 block: make generic_make_request handle arbitrarily sized bios
introduced blk_queue_split(), which will split any bio that is
violating the queue limits into a smaller piece to be processed
right away, and queue the "rest" on current->bio_list to be
processed by the iteration in generic_make_request() once the
current ->make_request_fn() returns.

Before that, any user was supposed to go through bio_add_page(),
which would call our merge bvec function, and that should already
be sufficient to not violate queue limits.

Previously, typically the next in line bio to be processed would
be the cloned one we passed down to our backend device, which in
case of further stacking devices (e.g. logical volume, raid1 or
similar) may again push further bios on that list.
All of which would simply be processed in turn.

Now, with blk_queue_split(), the next-in-line bio would be the
next piece of the "too big" original bio, so we end up calling
several times into our ->make_request_fn() without even
dispatching one bio to the backend.

With highly concurrent IO and/or very small "max-buffers" setting,
this can deadlock, where the current submit path would wait for
the completion of the bios supposedly already passed to the
backend, but which actually are not even dispatched yet, rotting
away on current->bio_list.

I could code around that in various ways inside the DRBD make_request_fn,
or always dispatch the backend bios to a different (thread or work_queue)
context, but I'd rather have the distinction between "recursively
submitted" bios and "pushed back part of originally passed in bio" as
implemented in this patch.

Thanks,

    Lars

> > I'm unsure if it could cause problems in md raid in some corner cases
> > or when a rebuild/scrub/... starts in a "bad" moment.
> >
> > It also may increase "stress" on any involved bio_set,
> > because it may increase the number of bios that are allocated
> > before the first of them is actually processed, causing more
> > frequent calls of punt_bios_to_rescuer().
> >
> > Alternatively,
> > a simple one-line change to generic_make_request() would also "fix" it,
> > by processing all recursive bios in LIFO order.
> > But that may change the order in which bios reach the "physical" layer,
> > in case they are in fact split into many smaller pieces, for example in
> > a striping setup, which may have other performance implications.
> >
> >  | diff --git a/block/blk-core.c b/block/blk-core.c
> >  | index 2475b1c7..a5623f6 100644
> >  | --- a/block/blk-core.c
> >  | +++ b/block/blk-core.c
> >  | @@ -2048,7 +2048,7 @@ blk_qc_t generic_make_request(struct bio *bio)
> >  |       * should be added at the tail
> >  |       */
> >  |      if (current->bio_list) {
> >  | -            bio_list_add(current->bio_list, bio);
> >  | +            bio_list_add_head(current->bio_list, bio);
> >  |              goto out;
> >  |      }

> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 2475b1c7..f03ff4c 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -2031,7 +2031,7 @@ end_io:
> >   */
> >  blk_qc_t generic_make_request(struct bio *bio)
> >  {
> > -       struct bio_list bio_list_on_stack;
> > +       struct recursion_to_iteration_bio_lists bio_lists_on_stack;
> >         blk_qc_t ret = BLK_QC_T_NONE;
> >
> >         if (!generic_make_request_checks(bio))
> > @@ -2040,15 +2040,18 @@ blk_qc_t generic_make_request(struct bio *bio)

> > -       if (current->bio_list) {
> > -               bio_list_add(current->bio_list, bio);
> > +       if (current->bio_lists) {
> > +               bio_list_add(&current->bio_lists->recursion, bio);
> >                 goto out;
> >         }
> >

> > @@ -2067,8 +2070,9 @@ blk_qc_t generic_make_request(struct bio *bio)
> >          * bio_list, and call into ->make_request() again.
> >          */
> >         BUG_ON(bio->bi_next);
> > -       bio_list_init(&bio_list_on_stack);
> > -       current->bio_list = &bio_list_on_stack;
> > +       bio_list_init(&bio_lists_on_stack.recursion);
> > +       bio_list_init(&bio_lists_on_stack.remainder);
> > +       current->bio_lists = &bio_lists_on_stack;
> >         do {
> >                 struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> >
> > @@ -2076,16 +2080,14 @@ blk_qc_t generic_make_request(struct bio *bio)
> >                         ret = q->make_request_fn(q, bio);
> >
> >                         blk_queue_exit(q);
> > -
> > -                       bio = bio_list_pop(current->bio_list);
> >                 } else {
> > -                       struct bio *bio_next = bio_list_pop(current->bio_list);
> > -
> >                         bio_io_error(bio);
> > -                       bio = bio_next;
> >                 }
> > +               bio = bio_list_pop(&current->bio_lists->recursion);
> > +               if (!bio)
> > +                       bio = bio_list_pop(&current->bio_lists->remainder);
> >         } while (bio);
> > -       current->bio_list = NULL; /* deactivate */
> > +       current->bio_lists = NULL; /* deactivate */
> >
> >  out:
> >         return ret;
> > diff --git a/block/blk-merge.c b/block/blk-merge.c
> > index 2613531..8da0c22 100644
> > --- a/block/blk-merge.c
> > +++ b/block/blk-merge.c
> > @@ -172,6 +172,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
> >         struct bio *split, *res;
> >         unsigned nsegs;
> >
> > +       BUG_ON(!current->bio_lists);
> >         if ((*bio)->bi_rw & REQ_DISCARD)
> >                 split = blk_bio_discard_split(q, *bio, bs, &nsegs);
> >         else if ((*bio)->bi_rw & REQ_WRITE_SAME)
> > @@ -190,7 +191,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
> >
> >                 bio_chain(split, *bio);
> >                 trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
> > -               generic_make_request(*bio);
> > +               bio_list_add_head(&current->bio_lists->remainder, *bio);
> >                 *bio = split;

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer June 24, 2016, 3:15 p.m. UTC | #3
On Fri, Jun 24 2016 at 10:27am -0400,
Lars Ellenberg <lars.ellenberg@linbit.com> wrote:

> On Fri, Jun 24, 2016 at 07:36:57PM +0800, Ming Lei wrote:
> > >
> > > This is not a theoretical problem.
> > > At least int DRBD, and an unfortunately high IO concurrency wrt. the
> > > "max-buffers" setting, without this patch we have a reproducible deadlock.
> > 
> > Is there any log about the deadlock? And is there any lockdep warning
> > if it is enabled?
> 
> In DRBD, to avoid potentially very long internal queues as we wait for
> our replication peer device and local backend, we limit the number of
> in-flight bios we accept, and block in our ->make_request_fn() if that
> number exceeds a configured watermark ("max-buffers").
> 
> Works fine, as long as we could assume that once our make_request_fn()
> returns, any bios we "recursively" submitted against the local backend
> would be dispatched. Which used to be the case.

It'd be useful to know whether this patch fixes your issue:
https://patchwork.kernel.org/patch/7398411/

Ming Lei didn't like it due to concerns about I contexts changing
(whereby breaking merging that occurs via plugging).

But if it _does_ fix your issue then the case for the change is
increased; and we just need to focus on addressing Ming's concerns
(Mikulas has some ideas).

Conversely, and in parallel, Mikulas can look to see if your approach
fixes the observed dm-snapshot deadlock that he set out to fix.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lei June 25, 2016, 9:30 a.m. UTC | #4
On Fri, Jun 24, 2016 at 10:27 PM, Lars Ellenberg
<lars.ellenberg@linbit.com> wrote:
> On Fri, Jun 24, 2016 at 07:36:57PM +0800, Ming Lei wrote:
>> >
>> > This is not a theoretical problem.
>> > At least int DRBD, and an unfortunately high IO concurrency wrt. the
>> > "max-buffers" setting, without this patch we have a reproducible deadlock.
>>
>> Is there any log about the deadlock? And is there any lockdep warning
>> if it is enabled?
>
> In DRBD, to avoid potentially very long internal queues as we wait for
> our replication peer device and local backend, we limit the number of
> in-flight bios we accept, and block in our ->make_request_fn() if that
> number exceeds a configured watermark ("max-buffers").
>
> Works fine, as long as we could assume that once our make_request_fn()
> returns, any bios we "recursively" submitted against the local backend
> would be dispatched. Which used to be the case.
>
> commit 54efd50 block: make generic_make_request handle arbitrarily sized bios
> introduced blk_queue_split(), which will split any bio that is
> violating the queue limits into a smaller piece to be processed
> right away, and queue the "rest" on current->bio_list to be
> processed by the iteration in generic_make_request() once the
> current ->make_request_fn() returns.
>
> Before that, any user was supposed to go through bio_add_page(),
> which would call our merge bvec function, and that should already
> be sufficient to not violate queue limits.
>
> Previously, typically the next in line bio to be processed would
> be the cloned one we passed down to our backend device, which in
> case of further stacking devices (e.g. logical volume, raid1 or
> similar) may again push further bios on that list.
> All of which would simply be processed in turn.

Could you clarify if the issue is triggered in drbd without dm/md involved?
Or the issue is always triggered with dm/md over drbd?

As Mike mentioned, there is one known issue with dm-snapshot.

>
> Now, with blk_queue_split(), the next-in-line bio would be the
> next piece of the "too big" original bio, so we end up calling
> several times into our ->make_request_fn() without even
> dispatching one bio to the backend.

If your ->make_request_fn() is called several times, each time
at least you should dispatch one bio returnd from blk_queue_split(),
so I don't understand why even one bio isn't dispatched out.

>
> With highly concurrent IO and/or very small "max-buffers" setting,
> this can deadlock, where the current submit path would wait for
> the completion of the bios supposedly already passed to the
> backend, but which actually are not even dispatched yet, rotting
> away on current->bio_list.

Suppose your ->make_request_fn only handles the bio returned
from blk_queue_split(), I don't understand why your driver waits
for the 'rest' bios from the blk_queue_split().

Maybe drbd driver calls generic_make_request() in
->make_request_fn() and supposes the bio is always dispatched
out once generic_make_request() returns, is it right?

>
> I could code around that in various ways inside the DRBD make_request_fn,
> or always dispatch the backend bios to a different (thread or work_queue)
> context, but I'd rather have the distinction between "recursively
> submitted" bios and "pushed back part of originally passed in bio" as
> implemented in this patch.
>
> Thanks,
>
>     Lars
>
>> > I'm unsure if it could cause problems in md raid in some corner cases
>> > or when a rebuild/scrub/... starts in a "bad" moment.
>> >
>> > It also may increase "stress" on any involved bio_set,
>> > because it may increase the number of bios that are allocated
>> > before the first of them is actually processed, causing more
>> > frequent calls of punt_bios_to_rescuer().
>> >
>> > Alternatively,
>> > a simple one-line change to generic_make_request() would also "fix" it,
>> > by processing all recursive bios in LIFO order.
>> > But that may change the order in which bios reach the "physical" layer,
>> > in case they are in fact split into many smaller pieces, for example in
>> > a striping setup, which may have other performance implications.
>> >
>> >  | diff --git a/block/blk-core.c b/block/blk-core.c
>> >  | index 2475b1c7..a5623f6 100644
>> >  | --- a/block/blk-core.c
>> >  | +++ b/block/blk-core.c
>> >  | @@ -2048,7 +2048,7 @@ blk_qc_t generic_make_request(struct bio *bio)
>> >  |       * should be added at the tail
>> >  |       */
>> >  |      if (current->bio_list) {
>> >  | -            bio_list_add(current->bio_list, bio);
>> >  | +            bio_list_add_head(current->bio_list, bio);
>> >  |              goto out;
>> >  |      }
>
>> > diff --git a/block/blk-core.c b/block/blk-core.c
>> > index 2475b1c7..f03ff4c 100644
>> > --- a/block/blk-core.c
>> > +++ b/block/blk-core.c
>> > @@ -2031,7 +2031,7 @@ end_io:
>> >   */
>> >  blk_qc_t generic_make_request(struct bio *bio)
>> >  {
>> > -       struct bio_list bio_list_on_stack;
>> > +       struct recursion_to_iteration_bio_lists bio_lists_on_stack;
>> >         blk_qc_t ret = BLK_QC_T_NONE;
>> >
>> >         if (!generic_make_request_checks(bio))
>> > @@ -2040,15 +2040,18 @@ blk_qc_t generic_make_request(struct bio *bio)
>
>> > -       if (current->bio_list) {
>> > -               bio_list_add(current->bio_list, bio);
>> > +       if (current->bio_lists) {
>> > +               bio_list_add(&current->bio_lists->recursion, bio);
>> >                 goto out;
>> >         }
>> >
>
>> > @@ -2067,8 +2070,9 @@ blk_qc_t generic_make_request(struct bio *bio)
>> >          * bio_list, and call into ->make_request() again.
>> >          */
>> >         BUG_ON(bio->bi_next);
>> > -       bio_list_init(&bio_list_on_stack);
>> > -       current->bio_list = &bio_list_on_stack;
>> > +       bio_list_init(&bio_lists_on_stack.recursion);
>> > +       bio_list_init(&bio_lists_on_stack.remainder);
>> > +       current->bio_lists = &bio_lists_on_stack;
>> >         do {
>> >                 struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>> >
>> > @@ -2076,16 +2080,14 @@ blk_qc_t generic_make_request(struct bio *bio)
>> >                         ret = q->make_request_fn(q, bio);
>> >
>> >                         blk_queue_exit(q);
>> > -
>> > -                       bio = bio_list_pop(current->bio_list);
>> >                 } else {
>> > -                       struct bio *bio_next = bio_list_pop(current->bio_list);
>> > -
>> >                         bio_io_error(bio);
>> > -                       bio = bio_next;
>> >                 }
>> > +               bio = bio_list_pop(&current->bio_lists->recursion);
>> > +               if (!bio)
>> > +                       bio = bio_list_pop(&current->bio_lists->remainder);
>> >         } while (bio);
>> > -       current->bio_list = NULL; /* deactivate */
>> > +       current->bio_lists = NULL; /* deactivate */
>> >
>> >  out:
>> >         return ret;
>> > diff --git a/block/blk-merge.c b/block/blk-merge.c
>> > index 2613531..8da0c22 100644
>> > --- a/block/blk-merge.c
>> > +++ b/block/blk-merge.c
>> > @@ -172,6 +172,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
>> >         struct bio *split, *res;
>> >         unsigned nsegs;
>> >
>> > +       BUG_ON(!current->bio_lists);
>> >         if ((*bio)->bi_rw & REQ_DISCARD)
>> >                 split = blk_bio_discard_split(q, *bio, bs, &nsegs);
>> >         else if ((*bio)->bi_rw & REQ_WRITE_SAME)
>> > @@ -190,7 +191,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
>> >
>> >                 bio_chain(split, *bio);
>> >                 trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
>> > -               generic_make_request(*bio);
>> > +               bio_list_add_head(&current->bio_lists->remainder, *bio);
>> >                 *bio = split;
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Lars Ellenberg June 28, 2016, 8:24 a.m. UTC | #5
On Fri, Jun 24, 2016 at 11:15:47AM -0400, Mike Snitzer wrote:
> On Fri, Jun 24 2016 at 10:27am -0400,
> Lars Ellenberg <lars.ellenberg@linbit.com> wrote:
> 
> > On Fri, Jun 24, 2016 at 07:36:57PM +0800, Ming Lei wrote:
> > > >
> > > > This is not a theoretical problem.
> > > > At least int DRBD, and an unfortunately high IO concurrency wrt. the
> > > > "max-buffers" setting, without this patch we have a reproducible deadlock.
> > > 
> > > Is there any log about the deadlock? And is there any lockdep warning
> > > if it is enabled?
> > 
> > In DRBD, to avoid potentially very long internal queues as we wait for
> > our replication peer device and local backend, we limit the number of
> > in-flight bios we accept, and block in our ->make_request_fn() if that
> > number exceeds a configured watermark ("max-buffers").
> > 
> > Works fine, as long as we could assume that once our make_request_fn()
> > returns, any bios we "recursively" submitted against the local backend
> > would be dispatched. Which used to be the case.
> 
> It'd be useful to know whether this patch fixes your issue:
> https://patchwork.kernel.org/patch/7398411/

I would assume so.
because if current is blocked for any reason,
it will dispatch all bios that are still on current->bio_list
to be processed from other contexts.

Which means we will not deadlock, but make progress,
if the unblock of current depends on processing of those bios.

Also see my other mail on the issue,
where I try to better explain the mechanics of "my" deadlock.

    Lars

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Lars Ellenberg June 28, 2016, 8:45 a.m. UTC | #6
On Sat, Jun 25, 2016 at 05:30:29PM +0800, Ming Lei wrote:
> On Fri, Jun 24, 2016 at 10:27 PM, Lars Ellenberg
> <lars.ellenberg@linbit.com> wrote:
> > On Fri, Jun 24, 2016 at 07:36:57PM +0800, Ming Lei wrote:
> >> >
> >> > This is not a theoretical problem.
> >> > At least int DRBD, and an unfortunately high IO concurrency wrt. the
> >> > "max-buffers" setting, without this patch we have a reproducible deadlock.
> >>
> >> Is there any log about the deadlock? And is there any lockdep warning
> >> if it is enabled?
> >
> > In DRBD, to avoid potentially very long internal queues as we wait for
> > our replication peer device and local backend, we limit the number of
> > in-flight bios we accept, and block in our ->make_request_fn() if that
> > number exceeds a configured watermark ("max-buffers").
> >
> > Works fine, as long as we could assume that once our make_request_fn()
> > returns, any bios we "recursively" submitted against the local backend
> > would be dispatched. Which used to be the case.
> >
> > commit 54efd50 block: make generic_make_request handle arbitrarily sized bios
> > introduced blk_queue_split(), which will split any bio that is
> > violating the queue limits into a smaller piece to be processed
> > right away, and queue the "rest" on current->bio_list to be
> > processed by the iteration in generic_make_request() once the
> > current ->make_request_fn() returns.
> >
> > Before that, any user was supposed to go through bio_add_page(),
> > which would call our merge bvec function, and that should already
> > be sufficient to not violate queue limits.
> >
> > Previously, typically the next in line bio to be processed would
> > be the cloned one we passed down to our backend device, which in
> > case of further stacking devices (e.g. logical volume, raid1 or
> > similar) may again push further bios on that list.
> > All of which would simply be processed in turn.
> 
> Could you clarify if the issue is triggered in drbd without dm/md involved?
> Or the issue is always triggered with dm/md over drbd?
> 
> As Mike mentioned, there is one known issue with dm-snapshot.


The issue can always be triggered, even if only DRBD is involved.

> If your ->make_request_fn() is called several times, each time
> at least you should dispatch one bio returnd from blk_queue_split(),
> so I don't understand why even one bio isn't dispatched out.

I'll try to "visualize" the mechanics of "my" deadlock here.

Just to clarify the effect, assume we had a driver that
for $reasons would limit the number of in-flight IO to
one single bio.

=== before bio_queue_split()

Previously, if something would want to read some range of blocks,
it would allocate a bio, call bio_add_page() a number of times,
and once the bio was "full", call generic_make_request(),
and fill the next bio, then submit that one.

Stacking: "single_in_flight" (q1) -> "sda" (q2)

  generic_make_request(bio)	current->bio_list	in-flight
   B_orig_1				NULL		0
    q1->make_request_fn(B_orig_1)	empty
    							1
	recursive call, queues:		B_1_remapped
    iterative call:
    q2->make_request_fn(B_1_remapped)	empty
    		-> actually dispatched to hardware
  return of generic_make_request(B_orig_1).
   B_orig_2
    q1->make_request_fn(B_orig_1)
    							1
	blocks, waits for in-flight to drop
	...
	completion of B_orig_1				0

	recursive call, queues:		B_2_remapped
    iterative call:
    q2->make_request_fn(B_2_remapped)	empty
    		-> actually dispatched to hardware
  return of generic_make_request(B_orig_2).


=== with bio_queue_split()

Now, uppser layers buils one large bio.

  generic_make_request(bio)	current->bio_list	in-flight
   B_orig				NULL		0
    q1->make_request_fn(B_orig)		empty
    blk_queue_split(B_orig) splits into
    					B_orig_r1
    B_orig_s1
    							1
	recursive call, queues:		B_orig_r1, B_s1_remapped
    iterative call:
    q1->make_request_fn(B_orig_r1)	B_s1_remapped
	blocks, waits for in-flight to drop
	...
	which never happens,
	because B_s1_remapped has not even been submitted to q2 yet,
	let alone dispatched to hardware.


Obviously we don't limit ourselves to just one request, but with larger
incoming bios, with the number of times we split them, with the number
of stacking layers below us, or even layers below us that *also*
call blk_queue_split (or equivalent open coded clone and split)
themselves to split even further, things get worse.

    Lars

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lei July 2, 2016, 10:03 a.m. UTC | #7
On Tue, Jun 28, 2016 at 4:45 PM, Lars Ellenberg
<lars.ellenberg@linbit.com> wrote:
> On Sat, Jun 25, 2016 at 05:30:29PM +0800, Ming Lei wrote:
>> On Fri, Jun 24, 2016 at 10:27 PM, Lars Ellenberg
>> <lars.ellenberg@linbit.com> wrote:
>> > On Fri, Jun 24, 2016 at 07:36:57PM +0800, Ming Lei wrote:
>> >> >
>> >> > This is not a theoretical problem.
>> >> > At least int DRBD, and an unfortunately high IO concurrency wrt. the
>> >> > "max-buffers" setting, without this patch we have a reproducible deadlock.
>> >>
>> >> Is there any log about the deadlock? And is there any lockdep warning
>> >> if it is enabled?
>> >
>> > In DRBD, to avoid potentially very long internal queues as we wait for
>> > our replication peer device and local backend, we limit the number of
>> > in-flight bios we accept, and block in our ->make_request_fn() if that
>> > number exceeds a configured watermark ("max-buffers").
>> >
>> > Works fine, as long as we could assume that once our make_request_fn()
>> > returns, any bios we "recursively" submitted against the local backend
>> > would be dispatched. Which used to be the case.
>> >
>> > commit 54efd50 block: make generic_make_request handle arbitrarily sized bios
>> > introduced blk_queue_split(), which will split any bio that is
>> > violating the queue limits into a smaller piece to be processed
>> > right away, and queue the "rest" on current->bio_list to be
>> > processed by the iteration in generic_make_request() once the
>> > current ->make_request_fn() returns.
>> >
>> > Before that, any user was supposed to go through bio_add_page(),
>> > which would call our merge bvec function, and that should already
>> > be sufficient to not violate queue limits.
>> >
>> > Previously, typically the next in line bio to be processed would
>> > be the cloned one we passed down to our backend device, which in
>> > case of further stacking devices (e.g. logical volume, raid1 or
>> > similar) may again push further bios on that list.
>> > All of which would simply be processed in turn.
>>
>> Could you clarify if the issue is triggered in drbd without dm/md involved?
>> Or the issue is always triggered with dm/md over drbd?
>>
>> As Mike mentioned, there is one known issue with dm-snapshot.
>
>
> The issue can always be triggered, even if only DRBD is involved.
>
>> If your ->make_request_fn() is called several times, each time
>> at least you should dispatch one bio returnd from blk_queue_split(),
>> so I don't understand why even one bio isn't dispatched out.
>
> I'll try to "visualize" the mechanics of "my" deadlock here.
>
> Just to clarify the effect, assume we had a driver that
> for $reasons would limit the number of in-flight IO to
> one single bio.
>
> === before bio_queue_split()
>
> Previously, if something would want to read some range of blocks,
> it would allocate a bio, call bio_add_page() a number of times,
> and once the bio was "full", call generic_make_request(),
> and fill the next bio, then submit that one.
>
> Stacking: "single_in_flight" (q1) -> "sda" (q2)
>
>   generic_make_request(bio)     current->bio_list       in-flight
>    B_orig_1                             NULL            0
>     q1->make_request_fn(B_orig_1)       empty
>                                                         1
>         recursive call, queues:         B_1_remapped
>     iterative call:
>     q2->make_request_fn(B_1_remapped)   empty
>                 -> actually dispatched to hardware
>   return of generic_make_request(B_orig_1).
>    B_orig_2
>     q1->make_request_fn(B_orig_1)
>                                                         1
>         blocks, waits for in-flight to drop
>         ...
>         completion of B_orig_1                          0
>
>         recursive call, queues:         B_2_remapped
>     iterative call:
>     q2->make_request_fn(B_2_remapped)   empty
>                 -> actually dispatched to hardware
>   return of generic_make_request(B_orig_2).
>
>
> === with bio_queue_split()
>
> Now, uppser layers buils one large bio.
>
>   generic_make_request(bio)     current->bio_list       in-flight
>    B_orig                               NULL            0
>     q1->make_request_fn(B_orig)         empty
>     blk_queue_split(B_orig) splits into
>                                         B_orig_r1
>     B_orig_s1
>                                                         1
>         recursive call, queues:         B_orig_r1, B_s1_remapped
>     iterative call:
>     q1->make_request_fn(B_orig_r1)      B_s1_remapped
>         blocks, waits for in-flight to drop
>         ...
>         which never happens,
>         because B_s1_remapped has not even been submitted to q2 yet,
>         let alone dispatched to hardware.
>
>
> Obviously we don't limit ourselves to just one request, but with larger
> incoming bios, with the number of times we split them, with the number
> of stacking layers below us, or even layers below us that *also*
> call blk_queue_split (or equivalent open coded clone and split)
> themselves to split even further, things get worse.

Thanks for your so detailed explanation!

Now I believe the issue is really from blk_queue_split(), and I will comment
on your patch later.

thanks,

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lei July 2, 2016, 10:28 a.m. UTC | #8
On Wed, Jun 22, 2016 at 4:22 PM, Lars Ellenberg
<lars.ellenberg@linbit.com> wrote:
> For a long time, generic_make_request() converts recursion into
> iteration by queuing recursive arguments on current->bio_list.
>
> This is convenient for stacking drivers,
> the top-most driver would take the originally submitted bio,
> and re-submit a re-mapped version of it, or one or more clones,
> or one or more new allocated bios to its backend(s). Which
> are then simply processed in turn, and each can again queue
> more "backend-bios" until we reach the bottom of the driver stack,
> and actually dispatch to the real backend device.
>
> Any stacking driver ->make_request_fn() could expect that,
> once it returns, any backend-bios it submitted via recursive calls
> to generic_make_request() would now be processed and dispatched, before
> the current task would call into this driver again.
>
> This is changed by commit
>   54efd50 block: make generic_make_request handle arbitrarily sized bios
>
> Drivers may call blk_queue_split() inside their ->make_request_fn(),
> which may split the current bio into a front-part to be dealt with
> immediately, and a remainder-part, which may need to be split even
> further. That remainder-part will simply also be pushed to
> current->bio_list, and would end up being head-of-queue, in front
> of any backend-bios the current make_request_fn() might submit during
> processing of the fron-part.
>
> Which means the current task would immediately end up back in the same
> make_request_fn() of the same driver again, before any of its backend
> bios have even been processed.
>
> This can lead to resource starvation deadlock.
> Drivers could avoid this by learning to not need blk_queue_split(),
> or by submitting their backend bios in a different context (dedicated
> kernel thread, work_queue context, ...). Or by playing funny re-ordering
> games with entries on current->bio_list.
>
> Instead, I suggest to distinguish between recursive calls to
> generic_make_request(), and pushing back the remainder part in
> blk_queue_split(), by pointing current->bio_lists to a
>         struct recursion_to_iteration_bio_lists {
>                 struct bio_list recursion;
>                 struct bio_list remainder;
>         }
>
> To have all bios targeted to drivers lower in the stack processed before
> processing the next piece of a bio targeted at the higher levels,
> as long as queued bios resulting from recursion are available,
> they will continue to be processed in FIFO order.
> Pushed back bio-parts resulting from blk_queue_split() will be processed
> in LIFO order, one-by-one, whenever the recursion list becomes empty.
>
> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
> Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
> ---
>
> This is not a theoretical problem.
> At least int DRBD, and an unfortunately high IO concurrency wrt. the
> "max-buffers" setting, without this patch we have a reproducible deadlock.
>
> I'm unsure if it could cause problems in md raid in some corner cases
> or when a rebuild/scrub/... starts in a "bad" moment.
>
> It also may increase "stress" on any involved bio_set,
> because it may increase the number of bios that are allocated
> before the first of them is actually processed, causing more
> frequent calls of punt_bios_to_rescuer().
>
> Alternatively,
> a simple one-line change to generic_make_request() would also "fix" it,
> by processing all recursive bios in LIFO order.
> But that may change the order in which bios reach the "physical" layer,
> in case they are in fact split into many smaller pieces, for example in
> a striping setup, which may have other performance implications.
>
>  | diff --git a/block/blk-core.c b/block/blk-core.c
>  | index 2475b1c7..a5623f6 100644
>  | --- a/block/blk-core.c
>  | +++ b/block/blk-core.c
>  | @@ -2048,7 +2048,7 @@ blk_qc_t generic_make_request(struct bio *bio)
>  |       * should be added at the tail
>  |       */
>  |      if (current->bio_list) {
>  | -            bio_list_add(current->bio_list, bio);
>  | +            bio_list_add_head(current->bio_list, bio);
>  |              goto out;
>  |      }
>
> Any fix would need to go to stable 4.3.x and up.
> This patch does apply as is to 4.5 and up,
> with a trivial fixup (or a cherry-pick of cda2264) to 4.4,
> and with some more (also trivial) fixup to 4.3 because of
> GFP_WAIT -> GFP_RECLAIM and comment rewrap.
>
> Any input?

The idea looks good, but not sure it can cover all cases of
dm over brbd or brbd over dm and maintaining two lists
becomes too complicated.

One clean solution may be to convert the loop of generic_make_request()
into the following way:

do {
    struct bio *splitted, *remainder = NULL;
    struct request_queue *q = bdev_get_queue(bio->bi_bdev);

    blk_queue_split(q, &bio, &remainder, q->bio_split);

    ret = q->make_request_fn(q, bio);

    if (remainder)
         bio_list_add(current->bio_list, remainder);
    bio = bio_list_pop(current->bio_list);
} while (bio)

And blk_queue_split() need a bit change and all its external calling
have to be removed.

Of cource we also need to fix blk_queue_bounce() first, but that can
be done easily by handling bounced bio via generic_make_request().

> How should I proceed?

I think the issue need to be fixed.

Thanks,
Ming

>
> ---
>  block/bio.c               | 14 +++++++-------
>  block/blk-core.c          | 32 +++++++++++++++++---------------
>  block/blk-merge.c         |  3 ++-
>  drivers/md/bcache/btree.c | 12 ++++++------
>  drivers/md/dm-bufio.c     |  2 +-
>  drivers/md/md.h           |  7 +++++++
>  drivers/md/raid1.c        |  5 ++---
>  drivers/md/raid10.c       |  5 ++---
>  include/linux/bio.h       | 11 +++++++++++
>  include/linux/sched.h     |  4 ++--
>  10 files changed, 57 insertions(+), 38 deletions(-)
>
> diff --git a/block/bio.c b/block/bio.c
> index 0e4aa42..2ffcea0 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -368,10 +368,10 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
>         bio_list_init(&punt);
>         bio_list_init(&nopunt);
>
> -       while ((bio = bio_list_pop(current->bio_list)))
> +       while ((bio = bio_list_pop(&current->bio_lists->recursion)))
>                 bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
>
> -       *current->bio_list = nopunt;
> +       current->bio_lists->recursion = nopunt;
>
>         spin_lock(&bs->rescue_lock);
>         bio_list_merge(&bs->rescue_list, &punt);
> @@ -453,13 +453,13 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>                  *
>                  * We solve this, and guarantee forward progress, with a rescuer
>                  * workqueue per bio_set. If we go to allocate and there are
> -                * bios on current->bio_list, we first try the allocation
> -                * without __GFP_DIRECT_RECLAIM; if that fails, we punt those
> -                * bios we would be blocking to the rescuer workqueue before
> -                * we retry with the original gfp_flags.
> +                * bios on current->bio_lists->recursion, we first try the
> +                * allocation without __GFP_DIRECT_RECLAIM; if that fails, we
> +                * punt those bios we would be blocking to the rescuer
> +                * workqueue before we retry with the original gfp_flags.
>                  */
>
> -               if (current->bio_list && !bio_list_empty(current->bio_list))
> +               if (current->bio_lists && !bio_list_empty(&current->bio_lists->recursion))
>                         gfp_mask &= ~__GFP_DIRECT_RECLAIM;
>
>                 p = mempool_alloc(bs->bio_pool, gfp_mask);
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 2475b1c7..f03ff4c 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2031,7 +2031,7 @@ end_io:
>   */
>  blk_qc_t generic_make_request(struct bio *bio)
>  {
> -       struct bio_list bio_list_on_stack;
> +       struct recursion_to_iteration_bio_lists bio_lists_on_stack;
>         blk_qc_t ret = BLK_QC_T_NONE;
>
>         if (!generic_make_request_checks(bio))
> @@ -2040,15 +2040,18 @@ blk_qc_t generic_make_request(struct bio *bio)
>         /*
>          * We only want one ->make_request_fn to be active at a time, else
>          * stack usage with stacked devices could be a problem.  So use
> -        * current->bio_list to keep a list of requests submited by a
> -        * make_request_fn function.  current->bio_list is also used as a
> +        * current->bio_lists to keep a list of requests submited by a
> +        * make_request_fn function.  current->bio_lists is also used as a
>          * flag to say if generic_make_request is currently active in this
>          * task or not.  If it is NULL, then no make_request is active.  If
>          * it is non-NULL, then a make_request is active, and new requests
> -        * should be added at the tail
> +        * should be added at the tail of current->bio_lists->recursion;
> +        * remainder bios resulting from a call to bio_queue_split() from
> +        * within ->make_request_fn() should be added to the head of
> +        * current->bio_lists->remainder.
>          */
> -       if (current->bio_list) {
> -               bio_list_add(current->bio_list, bio);
> +       if (current->bio_lists) {
> +               bio_list_add(&current->bio_lists->recursion, bio);
>                 goto out;
>         }
>
> @@ -2057,7 +2060,7 @@ blk_qc_t generic_make_request(struct bio *bio)
>          * Before entering the loop, bio->bi_next is NULL (as all callers
>          * ensure that) so we have a list with a single bio.
>          * We pretend that we have just taken it off a longer list, so
> -        * we assign bio_list to a pointer to the bio_list_on_stack,
> +        * we assign bio_list to a pointer to the bio_lists_on_stack,
>          * thus initialising the bio_list of new bios to be
>          * added.  ->make_request() may indeed add some more bios
>          * through a recursive call to generic_make_request.  If it
> @@ -2067,8 +2070,9 @@ blk_qc_t generic_make_request(struct bio *bio)
>          * bio_list, and call into ->make_request() again.
>          */
>         BUG_ON(bio->bi_next);
> -       bio_list_init(&bio_list_on_stack);
> -       current->bio_list = &bio_list_on_stack;
> +       bio_list_init(&bio_lists_on_stack.recursion);
> +       bio_list_init(&bio_lists_on_stack.remainder);
> +       current->bio_lists = &bio_lists_on_stack;
>         do {
>                 struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>
> @@ -2076,16 +2080,14 @@ blk_qc_t generic_make_request(struct bio *bio)
>                         ret = q->make_request_fn(q, bio);
>
>                         blk_queue_exit(q);
> -
> -                       bio = bio_list_pop(current->bio_list);
>                 } else {
> -                       struct bio *bio_next = bio_list_pop(current->bio_list);
> -
>                         bio_io_error(bio);
> -                       bio = bio_next;
>                 }
> +               bio = bio_list_pop(&current->bio_lists->recursion);
> +               if (!bio)
> +                       bio = bio_list_pop(&current->bio_lists->remainder);
>         } while (bio);
> -       current->bio_list = NULL; /* deactivate */
> +       current->bio_lists = NULL; /* deactivate */
>
>  out:
>         return ret;
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 2613531..8da0c22 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -172,6 +172,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
>         struct bio *split, *res;
>         unsigned nsegs;
>
> +       BUG_ON(!current->bio_lists);
>         if ((*bio)->bi_rw & REQ_DISCARD)
>                 split = blk_bio_discard_split(q, *bio, bs, &nsegs);
>         else if ((*bio)->bi_rw & REQ_WRITE_SAME)
> @@ -190,7 +191,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
>
>                 bio_chain(split, *bio);
>                 trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
> -               generic_make_request(*bio);
> +               bio_list_add_head(&current->bio_lists->remainder, *bio);
>                 *bio = split;
>         }
>  }
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index eab505e..eb0b41a 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -450,7 +450,7 @@ void __bch_btree_node_write(struct btree *b, struct closure *parent)
>
>         trace_bcache_btree_write(b);
>
> -       BUG_ON(current->bio_list);
> +       BUG_ON(current->bio_lists);
>         BUG_ON(b->written >= btree_blocks(b));
>         BUG_ON(b->written && !i->keys);
>         BUG_ON(btree_bset_first(b)->seq != i->seq);
> @@ -544,7 +544,7 @@ static void bch_btree_leaf_dirty(struct btree *b, atomic_t *journal_ref)
>
>         /* Force write if set is too big */
>         if (set_bytes(i) > PAGE_SIZE - 48 &&
> -           !current->bio_list)
> +           !current->bio_lists)
>                 bch_btree_node_write(b, NULL);
>  }
>
> @@ -889,7 +889,7 @@ static struct btree *mca_alloc(struct cache_set *c, struct btree_op *op,
>  {
>         struct btree *b;
>
> -       BUG_ON(current->bio_list);
> +       BUG_ON(current->bio_lists);
>
>         lockdep_assert_held(&c->bucket_lock);
>
> @@ -976,7 +976,7 @@ retry:
>         b = mca_find(c, k);
>
>         if (!b) {
> -               if (current->bio_list)
> +               if (current->bio_lists)
>                         return ERR_PTR(-EAGAIN);
>
>                 mutex_lock(&c->bucket_lock);
> @@ -2127,7 +2127,7 @@ static int bch_btree_insert_node(struct btree *b, struct btree_op *op,
>
>         return 0;
>  split:
> -       if (current->bio_list) {
> +       if (current->bio_lists) {
>                 op->lock = b->c->root->level + 1;
>                 return -EAGAIN;
>         } else if (op->lock <= b->c->root->level) {
> @@ -2209,7 +2209,7 @@ int bch_btree_insert(struct cache_set *c, struct keylist *keys,
>         struct btree_insert_op op;
>         int ret = 0;
>
> -       BUG_ON(current->bio_list);
> +       BUG_ON(current->bio_lists);
>         BUG_ON(bch_keylist_empty(keys));
>
>         bch_btree_op_init(&op.op, 0);
> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> index cd77216..b64d4f0 100644
> --- a/drivers/md/dm-bufio.c
> +++ b/drivers/md/dm-bufio.c
> @@ -174,7 +174,7 @@ static inline int dm_bufio_cache_index(struct dm_bufio_client *c)
>  #define DM_BUFIO_CACHE(c)      (dm_bufio_caches[dm_bufio_cache_index(c)])
>  #define DM_BUFIO_CACHE_NAME(c) (dm_bufio_cache_names[dm_bufio_cache_index(c)])
>
> -#define dm_bufio_in_request()  (!!current->bio_list)
> +#define dm_bufio_in_request()  (!!current->bio_lists)
>
>  static void dm_bufio_lock(struct dm_bufio_client *c)
>  {
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index b5c4be7..7afc71c 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -663,6 +663,13 @@ static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev *mddev)
>         }
>  }
>
> +static inline bool current_has_pending_bios(void)
> +{
> +       return current->bio_lists && (
> +             !bio_list_empty(&current->bio_lists->recursion) ||
> +             !bio_list_empty(&current->bio_lists->remainder));
> +}
> +
>  extern struct md_cluster_operations *md_cluster_ops;
>  static inline int mddev_is_clustered(struct mddev *mddev)
>  {
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index c7c8cde..811f2c0 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -876,8 +876,7 @@ static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
>                                     (!conf->barrier ||
>                                      ((conf->start_next_window <
>                                        conf->next_resync + RESYNC_SECTORS) &&
> -                                     current->bio_list &&
> -                                     !bio_list_empty(current->bio_list))),
> +                                     current_has_pending_bios())),
>                                     conf->resync_lock);
>                 conf->nr_waiting--;
>         }
> @@ -1014,7 +1013,7 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
>         struct r1conf *conf = mddev->private;
>         struct bio *bio;
>
> -       if (from_schedule || current->bio_list) {
> +       if (from_schedule || current->bio_lists) {
>                 spin_lock_irq(&conf->device_lock);
>                 bio_list_merge(&conf->pending_bio_list, &plug->pending);
>                 conf->pending_count += plug->pending_cnt;
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index c7de2a5..2c70717 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -945,8 +945,7 @@ static void wait_barrier(struct r10conf *conf)
>                 wait_event_lock_irq(conf->wait_barrier,
>                                     !conf->barrier ||
>                                     (conf->nr_pending &&
> -                                    current->bio_list &&
> -                                    !bio_list_empty(current->bio_list)),
> +                                    current_has_pending_bios()),
>                                     conf->resync_lock);
>                 conf->nr_waiting--;
>         }
> @@ -1022,7 +1021,7 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
>         struct r10conf *conf = mddev->private;
>         struct bio *bio;
>
> -       if (from_schedule || current->bio_list) {
> +       if (from_schedule || current->bio_lists) {
>                 spin_lock_irq(&conf->device_lock);
>                 bio_list_merge(&conf->pending_bio_list, &plug->pending);
>                 conf->pending_count += plug->pending_cnt;
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 9faebf7..57e45a0 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -598,6 +598,17 @@ struct bio_list {
>         struct bio *tail;
>  };
>
> +/* for generic_make_request() */
> +struct recursion_to_iteration_bio_lists {
> +       /* For stacking drivers submitting to their respective backend.
> +        * Added to tail, processed in FIFO order. */
> +       struct bio_list recursion;
> +       /* For pushing back the "remainder" part resulting from calling
> +        * blk_queue_split(). Added to head, processed in LIFO order,
> +        * once the "recursion" list has been drained. */
> +       struct bio_list remainder;
> +};
> +
>  static inline int bio_list_empty(const struct bio_list *bl)
>  {
>         return bl->head == NULL;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 6e42ada..146eedc 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -128,7 +128,7 @@ struct sched_attr {
>
>  struct futex_pi_state;
>  struct robust_list_head;
> -struct bio_list;
> +struct recursion_to_iteration_bio_lists;
>  struct fs_struct;
>  struct perf_event_context;
>  struct blk_plug;
> @@ -1727,7 +1727,7 @@ struct task_struct {
>         void *journal_info;
>
>  /* stacked block device info */
> -       struct bio_list *bio_list;
> +       struct recursion_to_iteration_bio_lists *bio_lists;
>
>  #ifdef CONFIG_BLOCK
>  /* stack plugging */
> --
> 1.9.1
>

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Lars Ellenberg July 4, 2016, 8:20 a.m. UTC | #9
On Sat, Jul 02, 2016 at 06:28:29PM +0800, Ming Lei wrote:
> The idea looks good, but not sure it can cover all cases of
> dm over brbd or brbd over dm and maintaining two lists
> becomes too complicated.
> 
> One clean solution may be to convert the loop of generic_make_request()
> into the following way:
> 
> do {
>     struct bio *splitted, *remainder = NULL;
>     struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> 
>     blk_queue_split(q, &bio, &remainder, q->bio_split);
> 
>     ret = q->make_request_fn(q, bio);
> 
>     if (remainder)
>          bio_list_add(current->bio_list, remainder);
>     bio = bio_list_pop(current->bio_list);
> } while (bio)

Not good enough.

Consider DRBD on device-mapper on device-mapper on scsi,
or insert multipath and / or md raid into the stack.
The iterative call to q->make_request_fn() in the second iteration
may queue bios after the remainder.

Goal was to first process all "deeper level" bios
before processing the remainder.

You can achieve this by doing last-in-first-out on bio_list,
or by using two lists, as I suggested in the original post.

    Lars

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lei July 4, 2016, 10:47 a.m. UTC | #10
On Mon, Jul 4, 2016 at 4:20 PM, Lars Ellenberg
<lars.ellenberg@linbit.com> wrote:
> On Sat, Jul 02, 2016 at 06:28:29PM +0800, Ming Lei wrote:
>> The idea looks good, but not sure it can cover all cases of
>> dm over brbd or brbd over dm and maintaining two lists
>> becomes too complicated.
>>
>> One clean solution may be to convert the loop of generic_make_request()
>> into the following way:
>>
>> do {
>>     struct bio *splitted, *remainder = NULL;
>>     struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>>
>>     blk_queue_split(q, &bio, &remainder, q->bio_split);
>>
>>     ret = q->make_request_fn(q, bio);
>>
>>     if (remainder)
>>          bio_list_add(current->bio_list, remainder);
>>     bio = bio_list_pop(current->bio_list);
>> } while (bio)
>
> Not good enough.
>
> Consider DRBD on device-mapper on device-mapper on scsi,
> or insert multipath and / or md raid into the stack.
> The iterative call to q->make_request_fn() in the second iteration
> may queue bios after the remainder.

But this remainder is not the top remainder any more, I guess you mean
the following situation:

- drbd_make_request(bio)
       ->generic_make_request(bio)
              ->bio is added into current->bio_list
- bio is splitted as bio_a and bio_b in generic_make_request()
- dm_make_request(bio_a)
        ->generic_make_request(bio_a)
               ->bio_a is add into current_list
- bio_a is splitted as bio_a_a and bio_a_b in generic_make_request()
- dm_make_request(bio_a_a)
      ->.....
- bio_a_b is added into current->bio_list
- dm_make_request(bio_a_b)

But it is correct because bio_a depends on both bio_a_a and bio_a_b.
Or I understand you wrong?

>
> Goal was to first process all "deeper level" bios
> before processing the remainder.

For the reported bio splitting issue, I think the goal is to make sure all
BIOs generated from 'bio' inside .make_request_fn(bio) are queued
before the 'current' remainder. Cause it is the issue introduced by
blk_split_bio().

Thanks,
Ming

>
> You can achieve this by doing last-in-first-out on bio_list,
> or by using two lists, as I suggested in the original post.
>
>     Lars
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Lars Ellenberg July 6, 2016, 12:38 p.m. UTC | #11
On Mon, Jul 04, 2016 at 06:47:29PM +0800, Ming Lei wrote:
> >> One clean solution may be to convert the loop of generic_make_request()
> >> into the following way:
> >>
> >> do {
> >>     struct bio *splitted, *remainder = NULL;
> >>     struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> >>
> >>     blk_queue_split(q, &bio, &remainder, q->bio_split);
> >>
> >>     ret = q->make_request_fn(q, bio);
> >>
> >>     if (remainder)
> >>          bio_list_add(current->bio_list, remainder);
> >>     bio = bio_list_pop(current->bio_list);
> >> } while (bio)
> >
> > Not good enough.


> > Goal was to first process all "deeper level" bios
> > before processing the remainder.
> 
> For the reported bio splitting issue, I think the goal is to make sure all
> BIOs generated from 'bio' inside .make_request_fn(bio) are queued
> before the 'current' remainder. Cause it is the issue introduced by
> blk_split_bio().

Stacking:
	qA (limitting layer)
	-> qB (remapping)
	-> qC (remapping)
	-> qD (hardware)

[in fact you don't even need four layers,
 this is just to clarify that the stack may be more complex than you
 assume it would be]

Columns below:
1) argument to generic_make_request() and its target queue.
2) current->bio_list
3) "in-flight" counter of qA.

==== In your new picture, iiuc:

generic_make_request(bio_orig)
		NULL			in-flight=0
bio_orig	empty
  blk_queue_split()
  result:
  bio_s, bio_r	
qA->make_request_fn(bio_s)
					in-flight=1
  bio_c = bio_clone(bio_s)
  generic_make_request(bio_c to qB)
		bio_c
<-return
  bio_list_add(bio_r)
  		bio_c, bio_r
  bio_list_pop()
  		bio_r
qB->make_request_fn(bio_c)
  (Assume it does not clone, but only remap.
  But it may also be a striping layer,
  and queue more than one bio here.)
  generic_make_request(bio_c to qC)
  		bio_r, bio_c
<-return
  bio_list_pop()
  		bio_c
qA->make_request_fn(bio_r)		in-flight still 1

	BLOCKS, because bio_c has not been processed to its final
	destination qD yet, and not dispatched to hardware.


==== my suggestion

generic_make_request(bio_orig)
		NULL			in-flight=0
bio_orig	empty			in-flight=0
qA->make_request_fn(bio_orig)
  blk_queue_split()
  result:
  bio_s, and bio_r stuffed away to head of remainder list.
					in-flight=1
  bio_c = bio_clone(bio_s)
  generic_make_request(bio_c to qB)
		bio_c
<-return
  		bio_c
  bio_list_pop()
  		empty
qB->make_request_fn(bio_c)
  (Assume it does not clone, but only remap.
  But it may also be a striping layer,
  and queue more than one bio here.)
  generic_make_request(bio_c to qC)
  		bio_c
<-return
  bio_list_pop()
  		empty
qC->make_request_fn(bio_c)
  generic_make_request(bio_c to qD)
  		bio_c
<-return
  bio_list_pop()
  		empty
qD->make_request_fn(bio_c)
	dispatches to hardware
<-return
		empty
   bio_list_pop()
   NULL, great, lets pop from remainder list
qA->make_request_fn(bio_r)		in-flight=?

	May block, but only until completion of bio_c.
	Which may already have happened.

	*makes progress*

Thanks,

    Lars

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lei July 6, 2016, 3:57 p.m. UTC | #12
On Wed, Jul 6, 2016 at 8:38 PM, Lars Ellenberg
<lars.ellenberg@linbit.com> wrote:
> On Mon, Jul 04, 2016 at 06:47:29PM +0800, Ming Lei wrote:
>> >> One clean solution may be to convert the loop of generic_make_request()
>> >> into the following way:
>> >>
>> >> do {
>> >>     struct bio *splitted, *remainder = NULL;
>> >>     struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>> >>
>> >>     blk_queue_split(q, &bio, &remainder, q->bio_split);
>> >>
>> >>     ret = q->make_request_fn(q, bio);
>> >>
>> >>     if (remainder)
>> >>          bio_list_add(current->bio_list, remainder);
>> >>     bio = bio_list_pop(current->bio_list);
>> >> } while (bio)
>> >
>> > Not good enough.
>
>
>> > Goal was to first process all "deeper level" bios
>> > before processing the remainder.
>>
>> For the reported bio splitting issue, I think the goal is to make sure all
>> BIOs generated from 'bio' inside .make_request_fn(bio) are queued
>> before the 'current' remainder. Cause it is the issue introduced by
>> blk_split_bio().

I believe the above description is correct, but my previous solution
is wrong, looks what we need is a binary tree, in which left child is
the splitted bio and the remainder bio is the right child, and the tree
need to be traversed in pre-order to make sure all BIOs generated
from 'bio' inside .make_request_fn(bio) are queued before the
remainder of bio splitting.

Another property of the tree is that if one node has only one child,
the child has to be left one.

I will think about it further to see if it is doable to implement.

>
> Stacking:
>         qA (limitting layer)
>         -> qB (remapping)
>         -> qC (remapping)
>         -> qD (hardware)
>
> [in fact you don't even need four layers,
>  this is just to clarify that the stack may be more complex than you
>  assume it would be]
>
> Columns below:
> 1) argument to generic_make_request() and its target queue.
> 2) current->bio_list
> 3) "in-flight" counter of qA.
>
> ==== In your new picture, iiuc:
>
> generic_make_request(bio_orig)
>                 NULL                    in-flight=0
> bio_orig        empty
>   blk_queue_split()
>   result:
>   bio_s, bio_r
> qA->make_request_fn(bio_s)
>                                         in-flight=1
>   bio_c = bio_clone(bio_s)
>   generic_make_request(bio_c to qB)
>                 bio_c
> <-return
>   bio_list_add(bio_r)
>                 bio_c, bio_r
>   bio_list_pop()
>                 bio_r
> qB->make_request_fn(bio_c)
>   (Assume it does not clone, but only remap.
>   But it may also be a striping layer,
>   and queue more than one bio here.)
>   generic_make_request(bio_c to qC)
>                 bio_r, bio_c
> <-return
>   bio_list_pop()
>                 bio_c
> qA->make_request_fn(bio_r)              in-flight still 1
>
>         BLOCKS, because bio_c has not been processed to its final
>         destination qD yet, and not dispatched to hardware.

Yes, you are right.

>
>
> ==== my suggestion
>
> generic_make_request(bio_orig)
>                 NULL                    in-flight=0
> bio_orig        empty                   in-flight=0
> qA->make_request_fn(bio_orig)
>   blk_queue_split()
>   result:
>   bio_s, and bio_r stuffed away to head of remainder list.
>                                         in-flight=1
>   bio_c = bio_clone(bio_s)
>   generic_make_request(bio_c to qB)
>                 bio_c
> <-return
>                 bio_c
>   bio_list_pop()
>                 empty
> qB->make_request_fn(bio_c)
>   (Assume it does not clone, but only remap.
>   But it may also be a striping layer,
>   and queue more than one bio here.)
>   generic_make_request(bio_c to qC)
>                 bio_c
> <-return
>   bio_list_pop()
>                 empty
> qC->make_request_fn(bio_c)
>   generic_make_request(bio_c to qD)
>                 bio_c
> <-return
>   bio_list_pop()
>                 empty
> qD->make_request_fn(bio_c)
>         dispatches to hardware
> <-return
>                 empty
>    bio_list_pop()
>    NULL, great, lets pop from remainder list
> qA->make_request_fn(bio_r)              in-flight=?
>
>         May block, but only until completion of bio_c.
>         Which may already have happened.
>
>         *makes progress*

I admit your solution is smart, but it isn't easy to prove it as correct
in theory.  But if the traversal can be mapped into pre-order traversal
of the above binary tree, it may be correct.

Thanks,
Ming

>
> Thanks,
>
>     Lars
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
NeilBrown July 7, 2016, 5:35 a.m. UTC | #13
On Wed, Jun 22 2016, Lars Ellenberg wrote:

> For a long time, generic_make_request() converts recursion into
> iteration by queuing recursive arguments on current->bio_list.
>
> This is convenient for stacking drivers,
> the top-most driver would take the originally submitted bio,
> and re-submit a re-mapped version of it, or one or more clones,
> or one or more new allocated bios to its backend(s). Which
> are then simply processed in turn, and each can again queue
> more "backend-bios" until we reach the bottom of the driver stack,
> and actually dispatch to the real backend device.
>
> Any stacking driver ->make_request_fn() could expect that,
> once it returns, any backend-bios it submitted via recursive calls
> to generic_make_request() would now be processed and dispatched, before
> the current task would call into this driver again.
>
> This is changed by commit
>   54efd50 block: make generic_make_request handle arbitrarily sized bios
>
> Drivers may call blk_queue_split() inside their ->make_request_fn(),
> which may split the current bio into a front-part to be dealt with
> immediately, and a remainder-part, which may need to be split even
> further. That remainder-part will simply also be pushed to
> current->bio_list, and would end up being head-of-queue, in front
> of any backend-bios the current make_request_fn() might submit during
> processing of the fron-part.
>
> Which means the current task would immediately end up back in the same
> make_request_fn() of the same driver again, before any of its backend
> bios have even been processed.
>
> This can lead to resource starvation deadlock.
> Drivers could avoid this by learning to not need blk_queue_split(),
> or by submitting their backend bios in a different context (dedicated
> kernel thread, work_queue context, ...). Or by playing funny re-ordering
> games with entries on current->bio_list.
>
> Instead, I suggest to distinguish between recursive calls to
> generic_make_request(), and pushing back the remainder part in
> blk_queue_split(), by pointing current->bio_lists to a
> 	struct recursion_to_iteration_bio_lists {
> 		struct bio_list recursion;
> 		struct bio_list remainder;
> 	}
>
> To have all bios targeted to drivers lower in the stack processed before
> processing the next piece of a bio targeted at the higher levels,
> as long as queued bios resulting from recursion are available,
> they will continue to be processed in FIFO order.
> Pushed back bio-parts resulting from blk_queue_split() will be processed
> in LIFO order, one-by-one, whenever the recursion list becomes empty.

I really like this change.  It seems to precisely address the problem.
The "problem" being that requests for "this" device are potentially
mixed up with requests from underlying devices.
However I'm not sure it is quite general enough.

The "remainder" list is a stack of requests aimed at "this" level or
higher, and I think it will always exactly fit that description.
The "recursion" list needs to be a queue of requests aimed at the next
level down, and that doesn't quiet work, because once you start acting
on the first entry in that list, all the rest become "this" level.

I think you can address this by always calling ->make_request_fn with an
empty "recursion", then after the call completes, splice the "recursion"
list that resulted (if any) on top of the "remainder" stack.

This way, the "remainder" stack is always "requests for lower-level
devices before request for upper level devices" and the "recursion"
queue is always "requests for devices below the current level".

I also really *don't* like the idea of punting to a separate thread - it
seems to be just delaying the problem.

Can you try move the bio_list_init(->recursion) call to just before
the ->make_request_fn() call, and adding
    bio_list_merge_head(->remainder, ->recursion)
just after?
(or something like that) and confirm it makes sense, and works?

Thanks!

NeilBrown
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Lars Ellenberg July 7, 2016, 8:03 a.m. UTC | #14
On Wed, Jul 06, 2016 at 11:57:51PM +0800, Ming Lei wrote:
> > ==== my suggestion
> >
> > generic_make_request(bio_orig)
> >                 NULL                    in-flight=0
> > bio_orig        empty                   in-flight=0
> > qA->make_request_fn(bio_orig)
> >   blk_queue_split()
> >   result:
> >   bio_s, and bio_r stuffed away to head of remainder list.
> >                                         in-flight=1
> >   bio_c = bio_clone(bio_s)
> >   generic_make_request(bio_c to qB)
> >                 bio_c
> > <-return
> >                 bio_c
> >   bio_list_pop()
> >                 empty
> > qB->make_request_fn(bio_c)
> >   (Assume it does not clone, but only remap.
> >   But it may also be a striping layer,
> >   and queue more than one bio here.)
> >   generic_make_request(bio_c to qC)
> >                 bio_c
> > <-return
> >   bio_list_pop()
> >                 empty
> > qC->make_request_fn(bio_c)
> >   generic_make_request(bio_c to qD)
> >                 bio_c
> > <-return
> >   bio_list_pop()
> >                 empty
> > qD->make_request_fn(bio_c)
> >         dispatches to hardware
> > <-return
> >                 empty
> >    bio_list_pop()
> >    NULL, great, lets pop from remainder list
> > qA->make_request_fn(bio_r)              in-flight=?
> >
> >         May block, but only until completion of bio_c.
> >         Which may already have happened.
> >
> >         *makes progress*
> 
> I admit your solution is smart, but it isn't easy to prove it as correct
> in theory.  But if the traversal can be mapped into pre-order traversal
> of the above binary tree, it may be correct.

What are you talking about.
There is no tree.
There is a single fifo.
And I suggest to make that one fifo, and one lifo instead.

  |<------ original bio ----->|
  |piece|----remainder--------|
  
  |piece| is then processed, just as it was before,
  all recursive submissions turned into iterative processing,
  in the exact order they have been called recursively.
  Until all deeper level submissions have been fully processed.
  
  If deeper levels are again calling bio_queue_split, their
  respective remainder are queued in front of the "top level"
  remainder.
  
  And only then, the remainders are processed,
  just as if they did come in as "original bio", see above.

So if it did make progress before,
it will make progress now.

    Lars

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Lars Ellenberg July 7, 2016, 8:16 a.m. UTC | #15
On Thu, Jul 07, 2016 at 03:35:48PM +1000, NeilBrown wrote:
> On Wed, Jun 22 2016, Lars Ellenberg wrote:
> 
> > For a long time, generic_make_request() converts recursion into
> > iteration by queuing recursive arguments on current->bio_list.
> >
> > This is convenient for stacking drivers,
> > the top-most driver would take the originally submitted bio,
> > and re-submit a re-mapped version of it, or one or more clones,
> > or one or more new allocated bios to its backend(s). Which
> > are then simply processed in turn, and each can again queue
> > more "backend-bios" until we reach the bottom of the driver stack,
> > and actually dispatch to the real backend device.
> >
> > Any stacking driver ->make_request_fn() could expect that,
> > once it returns, any backend-bios it submitted via recursive calls
> > to generic_make_request() would now be processed and dispatched, before
> > the current task would call into this driver again.
> >
> > This is changed by commit
> >   54efd50 block: make generic_make_request handle arbitrarily sized bios
> >
> > Drivers may call blk_queue_split() inside their ->make_request_fn(),
> > which may split the current bio into a front-part to be dealt with
> > immediately, and a remainder-part, which may need to be split even
> > further. That remainder-part will simply also be pushed to
> > current->bio_list, and would end up being head-of-queue, in front
> > of any backend-bios the current make_request_fn() might submit during
> > processing of the fron-part.
> >
> > Which means the current task would immediately end up back in the same
> > make_request_fn() of the same driver again, before any of its backend
> > bios have even been processed.
> >
> > This can lead to resource starvation deadlock.
> > Drivers could avoid this by learning to not need blk_queue_split(),
> > or by submitting their backend bios in a different context (dedicated
> > kernel thread, work_queue context, ...). Or by playing funny re-ordering
> > games with entries on current->bio_list.
> >
> > Instead, I suggest to distinguish between recursive calls to
> > generic_make_request(), and pushing back the remainder part in
> > blk_queue_split(), by pointing current->bio_lists to a
> > 	struct recursion_to_iteration_bio_lists {
> > 		struct bio_list recursion;
> > 		struct bio_list remainder;
> > 	}
> >
> > To have all bios targeted to drivers lower in the stack processed before
> > processing the next piece of a bio targeted at the higher levels,
> > as long as queued bios resulting from recursion are available,
> > they will continue to be processed in FIFO order.
> > Pushed back bio-parts resulting from blk_queue_split() will be processed
> > in LIFO order, one-by-one, whenever the recursion list becomes empty.
> 
> I really like this change.  It seems to precisely address the problem.
> The "problem" being that requests for "this" device are potentially
> mixed up with requests from underlying devices.
> However I'm not sure it is quite general enough.
> 
> The "remainder" list is a stack of requests aimed at "this" level or
> higher, and I think it will always exactly fit that description.
> The "recursion" list needs to be a queue of requests aimed at the next
> level down, and that doesn't quiet work, because once you start acting
> on the first entry in that list, all the rest become "this" level.

Uhm, well,
that's how it has been since you introduced this back in 2007, d89d879.
And it worked.

> I think you can address this by always calling ->make_request_fn with an
> empty "recursion", then after the call completes, splice the "recursion"
> list that resulted (if any) on top of the "remainder" stack.
> 
> This way, the "remainder" stack is always "requests for lower-level
> devices before request for upper level devices" and the "recursion"
> queue is always "requests for devices below the current level".

Yes, I guess that would work as well,
but may need "empirical proof" to check for performance regressions.

> I also really *don't* like the idea of punting to a separate thread - it
> seems to be just delaying the problem.
> 
> Can you try move the bio_list_init(->recursion) call to just before
> the ->make_request_fn() call, and adding
>     bio_list_merge_head(->remainder, ->recursion)
> just after?
> (or something like that) and confirm it makes sense, and works?

Sure, will do.
I'd suggest this would be a patch on its own though, on top of this one.
Because it would change the order in which stacked bios are processed
wrt the way it used to be since 2007 (my suggestion as is does not).

Which may change performance metrics.
It may even improve some of them,
or maybe it does nothing, but we don't know.

Thanks,

    Lars

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer July 7, 2016, 12:45 p.m. UTC | #16
On Thu, Jul 07 2016 at  1:35am -0400,
NeilBrown <neilb@suse.com> wrote:

> On Wed, Jun 22 2016, Lars Ellenberg wrote:
> 
> > For a long time, generic_make_request() converts recursion into
> > iteration by queuing recursive arguments on current->bio_list.
> >
> > This is convenient for stacking drivers,
> > the top-most driver would take the originally submitted bio,
> > and re-submit a re-mapped version of it, or one or more clones,
> > or one or more new allocated bios to its backend(s). Which
> > are then simply processed in turn, and each can again queue
> > more "backend-bios" until we reach the bottom of the driver stack,
> > and actually dispatch to the real backend device.
> >
> > Any stacking driver ->make_request_fn() could expect that,
> > once it returns, any backend-bios it submitted via recursive calls
> > to generic_make_request() would now be processed and dispatched, before
> > the current task would call into this driver again.
> >
> > This is changed by commit
> >   54efd50 block: make generic_make_request handle arbitrarily sized bios
> >
> > Drivers may call blk_queue_split() inside their ->make_request_fn(),
> > which may split the current bio into a front-part to be dealt with
> > immediately, and a remainder-part, which may need to be split even
> > further. That remainder-part will simply also be pushed to
> > current->bio_list, and would end up being head-of-queue, in front
> > of any backend-bios the current make_request_fn() might submit during
> > processing of the fron-part.
> >
> > Which means the current task would immediately end up back in the same
> > make_request_fn() of the same driver again, before any of its backend
> > bios have even been processed.
> >
> > This can lead to resource starvation deadlock.
> > Drivers could avoid this by learning to not need blk_queue_split(),
> > or by submitting their backend bios in a different context (dedicated
> > kernel thread, work_queue context, ...). Or by playing funny re-ordering
> > games with entries on current->bio_list.
> >
> > Instead, I suggest to distinguish between recursive calls to
> > generic_make_request(), and pushing back the remainder part in
> > blk_queue_split(), by pointing current->bio_lists to a
> > 	struct recursion_to_iteration_bio_lists {
> > 		struct bio_list recursion;
> > 		struct bio_list remainder;
> > 	}
> >
> > To have all bios targeted to drivers lower in the stack processed before
> > processing the next piece of a bio targeted at the higher levels,
> > as long as queued bios resulting from recursion are available,
> > they will continue to be processed in FIFO order.
> > Pushed back bio-parts resulting from blk_queue_split() will be processed
> > in LIFO order, one-by-one, whenever the recursion list becomes empty.
> 
> I really like this change.  It seems to precisely address the problem.
> The "problem" being that requests for "this" device are potentially
> mixed up with requests from underlying devices.
> However I'm not sure it is quite general enough.
> 
> The "remainder" list is a stack of requests aimed at "this" level or
> higher, and I think it will always exactly fit that description.
> The "recursion" list needs to be a queue of requests aimed at the next
> level down, and that doesn't quiet work, because once you start acting
> on the first entry in that list, all the rest become "this" level.
> 
> I think you can address this by always calling ->make_request_fn with an
> empty "recursion", then after the call completes, splice the "recursion"
> list that resulted (if any) on top of the "remainder" stack.
> 
> This way, the "remainder" stack is always "requests for lower-level
> devices before request for upper level devices" and the "recursion"
> queue is always "requests for devices below the current level".
> 
> I also really *don't* like the idea of punting to a separate thread

Hi Neil,

Was this concern about "punting to a separate thread" in reference to
the line of work from Mikulas at the top of this 'wip' branch?
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip

> - it seems to be just delaying the problem.

Have you looked at this closely?  Not seeing how you can say that given
that on schedule the bios on current->bio_list are flushed.

The incremental work to delay the offload of queued bios is just meant
to preserve existing bio submission order unless there is reason to
believe a deadlock exists.

I would agree that this timer based approach is rather "gross" to some
degree _but_ it beats deadlocks!  This code needs fixing.  And the fix
cannot be constrained to bio_queue_split() because DM isn't even using
it.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lei July 7, 2016, 1:14 p.m. UTC | #17
On Thu, Jul 7, 2016 at 4:03 PM, Lars Ellenberg
<lars.ellenberg@linbit.com> wrote:
> On Wed, Jul 06, 2016 at 11:57:51PM +0800, Ming Lei wrote:
>> > ==== my suggestion
>> >
>> > generic_make_request(bio_orig)
>> >                 NULL                    in-flight=0
>> > bio_orig        empty                   in-flight=0
>> > qA->make_request_fn(bio_orig)
>> >   blk_queue_split()
>> >   result:
>> >   bio_s, and bio_r stuffed away to head of remainder list.
>> >                                         in-flight=1
>> >   bio_c = bio_clone(bio_s)
>> >   generic_make_request(bio_c to qB)
>> >                 bio_c
>> > <-return
>> >                 bio_c
>> >   bio_list_pop()
>> >                 empty
>> > qB->make_request_fn(bio_c)
>> >   (Assume it does not clone, but only remap.
>> >   But it may also be a striping layer,
>> >   and queue more than one bio here.)
>> >   generic_make_request(bio_c to qC)
>> >                 bio_c
>> > <-return
>> >   bio_list_pop()
>> >                 empty
>> > qC->make_request_fn(bio_c)
>> >   generic_make_request(bio_c to qD)
>> >                 bio_c
>> > <-return
>> >   bio_list_pop()
>> >                 empty
>> > qD->make_request_fn(bio_c)
>> >         dispatches to hardware
>> > <-return
>> >                 empty
>> >    bio_list_pop()
>> >    NULL, great, lets pop from remainder list
>> > qA->make_request_fn(bio_r)              in-flight=?
>> >
>> >         May block, but only until completion of bio_c.
>> >         Which may already have happened.
>> >
>> >         *makes progress*
>>
>> I admit your solution is smart, but it isn't easy to prove it as correct
>> in theory.  But if the traversal can be mapped into pre-order traversal
>> of the above binary tree, it may be correct.
>
> What are you talking about.
> There is no tree.
> There is a single fifo.
> And I suggest to make that one fifo, and one lifo instead.

The implementation may use fifo(queue) or lifo(stack), but the traversal
order actually is pre-order on one binary tree if we think the splitted bio and
the cloned bio as left child, and the remainder bio as right child.

The big benifit of modeling the algorithem as tree is that the implementation
can be proved easily.

Your patch is very similar with non-recursive pre-order algorithem.

And the one line change approach is just the typical pre-order
recursive algorithem on binary tree.

Let's abstract the one line change into following pseudocode:

bio_list_add_head(current->bio_list, bio);

while (!bio_list_empty(current->bio_list)) {

       bio = bio_list_pop(current->bio_list);

        q->make_request_fn(q, bio);   //visit the node
}

q->make_request_fn(q, bio)
       -> blk_queue_split(split, bio)
              -> generic_make_request(remainder)
                    ->bio_list_add_head(current->bio_list, bio);
              -> push(split) & pop(split) & visit the split bio
                        ->generic_make_request(cloned_bio)
                              ->bio_list_add_head(current->bio_list, bio);

If you compare the above with the non-recursive pre-order traversal
implementation[1], it is basically same.

[1] http://algorithmsandme.in/2015/03/preorder-traversal-of-tree-without-recursion/

The main difference between oneline change and this patch is the submit
order in the following situation:

q->make_request_fn(bio)
          ->generic_make_request(cloned_bio0)
          ->generic_make_request(cloned_bio1)
          ->generic_make_request(cloned_bio2)

But the three BIOs are often submitted into different queues, so the order
may not make a big difference. Even they are submitted to same queue
(disk), the order change may not be a issue too because there is still
I/O scheduler.

And do we have stacked driver which submits several BIOs to same queue
in .make_request_fn()?

If there isn't such case, I suggest the oneline change, together with
the pre-order traversal comment since it is very simple.

Thanks,

>
>   |<------ original bio ----->|
>   |piece|----remainder--------|
>
>   |piece| is then processed, just as it was before,
>   all recursive submissions turned into iterative processing,
>   in the exact order they have been called recursively.
>   Until all deeper level submissions have been fully processed.
>
>   If deeper levels are again calling bio_queue_split, their
>   respective remainder are queued in front of the "top level"
>   remainder.
>
>   And only then, the remainders are processed,
>   just as if they did come in as "original bio", see above.
>
> So if it did make progress before,
> it will make progress now.
>
>     Lars
>

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer July 7, 2016, 2:36 p.m. UTC | #18
On Wed, Jun 22 2016 at  4:22am -0400,
Lars Ellenberg <lars.ellenberg@linbit.com> wrote:

> For a long time, generic_make_request() converts recursion into
> iteration by queuing recursive arguments on current->bio_list.
> 
> This is convenient for stacking drivers,
> the top-most driver would take the originally submitted bio,
> and re-submit a re-mapped version of it, or one or more clones,
> or one or more new allocated bios to its backend(s). Which
> are then simply processed in turn, and each can again queue
> more "backend-bios" until we reach the bottom of the driver stack,
> and actually dispatch to the real backend device.
> 
> Any stacking driver ->make_request_fn() could expect that,
> once it returns, any backend-bios it submitted via recursive calls
> to generic_make_request() would now be processed and dispatched, before
> the current task would call into this driver again.
> 
> This is changed by commit
>   54efd50 block: make generic_make_request handle arbitrarily sized bios
> 
> Drivers may call blk_queue_split() inside their ->make_request_fn(),
> which may split the current bio into a front-part to be dealt with
> immediately, and a remainder-part, which may need to be split even
> further. That remainder-part will simply also be pushed to
> current->bio_list, and would end up being head-of-queue, in front
> of any backend-bios the current make_request_fn() might submit during
> processing of the fron-part.
> 
> Which means the current task would immediately end up back in the same
> make_request_fn() of the same driver again, before any of its backend
> bios have even been processed.
> 
> This can lead to resource starvation deadlock.
> Drivers could avoid this by learning to not need blk_queue_split(),
> or by submitting their backend bios in a different context (dedicated
> kernel thread, work_queue context, ...). Or by playing funny re-ordering
> games with entries on current->bio_list.
> 
> Instead, I suggest to distinguish between recursive calls to
> generic_make_request(), and pushing back the remainder part in
> blk_queue_split(), by pointing current->bio_lists to a
> 	struct recursion_to_iteration_bio_lists {
> 		struct bio_list recursion;
> 		struct bio_list remainder;
> 	}
> 
> To have all bios targeted to drivers lower in the stack processed before
> processing the next piece of a bio targeted at the higher levels,
> as long as queued bios resulting from recursion are available,
> they will continue to be processed in FIFO order.
> Pushed back bio-parts resulting from blk_queue_split() will be processed
> in LIFO order, one-by-one, whenever the recursion list becomes empty.
> 
> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
> Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>

I've rebased this patch against Jens' for-4.8/core (resolved conflict in
blk-merge.c) and pushed the result to this wip2 branch, feel free to use
it to resubmit for inclusion if/when that is the way forward, see:

http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=wip2&id=36cee4b1ddef0a46562045b421792a847c570b6b

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
NeilBrown July 7, 2016, 10:07 p.m. UTC | #19
On Thu, Jul 07 2016, Lars Ellenberg wrote:

> On Thu, Jul 07, 2016 at 03:35:48PM +1000, NeilBrown wrote:
>> On Wed, Jun 22 2016, Lars Ellenberg wrote:
>> 
>> > For a long time, generic_make_request() converts recursion into
>> > iteration by queuing recursive arguments on current->bio_list.
>> >
>> > This is convenient for stacking drivers,
>> > the top-most driver would take the originally submitted bio,
>> > and re-submit a re-mapped version of it, or one or more clones,
>> > or one or more new allocated bios to its backend(s). Which
>> > are then simply processed in turn, and each can again queue
>> > more "backend-bios" until we reach the bottom of the driver stack,
>> > and actually dispatch to the real backend device.
>> >
>> > Any stacking driver ->make_request_fn() could expect that,
>> > once it returns, any backend-bios it submitted via recursive calls
>> > to generic_make_request() would now be processed and dispatched, before
>> > the current task would call into this driver again.
>> >
>> > This is changed by commit
>> >   54efd50 block: make generic_make_request handle arbitrarily sized bios
>> >
>> > Drivers may call blk_queue_split() inside their ->make_request_fn(),
>> > which may split the current bio into a front-part to be dealt with
>> > immediately, and a remainder-part, which may need to be split even
>> > further. That remainder-part will simply also be pushed to
>> > current->bio_list, and would end up being head-of-queue, in front
>> > of any backend-bios the current make_request_fn() might submit during
>> > processing of the fron-part.
>> >
>> > Which means the current task would immediately end up back in the same
>> > make_request_fn() of the same driver again, before any of its backend
>> > bios have even been processed.
>> >
>> > This can lead to resource starvation deadlock.
>> > Drivers could avoid this by learning to not need blk_queue_split(),
>> > or by submitting their backend bios in a different context (dedicated
>> > kernel thread, work_queue context, ...). Or by playing funny re-ordering
>> > games with entries on current->bio_list.
>> >
>> > Instead, I suggest to distinguish between recursive calls to
>> > generic_make_request(), and pushing back the remainder part in
>> > blk_queue_split(), by pointing current->bio_lists to a
>> > 	struct recursion_to_iteration_bio_lists {
>> > 		struct bio_list recursion;
>> > 		struct bio_list remainder;
>> > 	}
>> >
>> > To have all bios targeted to drivers lower in the stack processed before
>> > processing the next piece of a bio targeted at the higher levels,
>> > as long as queued bios resulting from recursion are available,
>> > they will continue to be processed in FIFO order.
>> > Pushed back bio-parts resulting from blk_queue_split() will be processed
>> > in LIFO order, one-by-one, whenever the recursion list becomes empty.
>> 
>> I really like this change.  It seems to precisely address the problem.
>> The "problem" being that requests for "this" device are potentially
>> mixed up with requests from underlying devices.
>> However I'm not sure it is quite general enough.
>> 
>> The "remainder" list is a stack of requests aimed at "this" level or
>> higher, and I think it will always exactly fit that description.
>> The "recursion" list needs to be a queue of requests aimed at the next
>> level down, and that doesn't quiet work, because once you start acting
>> on the first entry in that list, all the rest become "this" level.
>
> Uhm, well,
> that's how it has been since you introduced this back in 2007, d89d879.
> And it worked.

Just because it "worked", doesn't mean it was "right" :-)

>
>> I think you can address this by always calling ->make_request_fn with an
>> empty "recursion", then after the call completes, splice the "recursion"
>> list that resulted (if any) on top of the "remainder" stack.
>> 
>> This way, the "remainder" stack is always "requests for lower-level
>> devices before request for upper level devices" and the "recursion"
>> queue is always "requests for devices below the current level".
>
> Yes, I guess that would work as well,
> but may need "empirical proof" to check for performance regressions.
>
>> I also really *don't* like the idea of punting to a separate thread - it
>> seems to be just delaying the problem.
>> 
>> Can you try move the bio_list_init(->recursion) call to just before
>> the ->make_request_fn() call, and adding
>>     bio_list_merge_head(->remainder, ->recursion)
>> just after?
>> (or something like that) and confirm it makes sense, and works?
>
> Sure, will do.
> I'd suggest this would be a patch on its own though, on top of this one.
> Because it would change the order in which stacked bios are processed
> wrt the way it used to be since 2007 (my suggestion as is does not).

Yes, because I think the original order is wrong, as you have helped me
to see.
Before I introduced the recursion limiting, requests were handled as an
in-order tree walk.  The code I wrote tried to preserve that but didn't
for several reasons.  I think we need to restore the original in-order
walk because it makes the most sense.
So after processing a particular bio, we should then process all the
'child' bios - bios send to underlying devices.  Then the 'sibling'
bios, that were split off, and then any remaining parents and ancestors.

You patch created the right structures for doing this, my proposal took
it a step closer, but now after more careful analysis I don't think it
is quite right.
With my previous proposal (and you latest patch - thanks!) requests for
"this" level are stacked, but they should be queued.
If a make_request_fn only ever submits one request for this level and
zero or more lower levels, then the difference between a queue and a
stack is irrelevant.  If it submited more that one, a stack would cause
them to be handled in the reverse order.

To make the patch "perfect", and maybe even more elegant we could treat
->remainder and ->recursion more alike.
i.e.:
  - generic make request has a private "stack" of requests.
  - before calling ->make_request_fn(), both ->remainder and ->recursion
    are initialised
  - after ->make_request_fn(), ->remainder are spliced in to top of
    'stack', then ->recursion is spliced onto that.
  - If stack is not empty, the top request is popped and we loop to top.

This reliably follows in-order execution, and handles siblings correctly
(in submitted order) if/when a request splits off multiple siblings.

>
> Which may change performance metrics.
> It may even improve some of them,
> or maybe it does nothing, but we don't know.

I think that as long a requests are submitted in the order they are
created at each level there is no reason to expect performance
regressions.
All we are doing is changing the ordering between requests generated at
different levels, and I think we are restoring a more natural order.

Thanks,
NeilBrown
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
NeilBrown July 7, 2016, 10:40 p.m. UTC | #20
On Thu, Jul 07 2016, Mike Snitzer wrote:

> On Thu, Jul 07 2016 at  1:35am -0400,
> NeilBrown <neilb@suse.com> wrote:
>
>> On Wed, Jun 22 2016, Lars Ellenberg wrote:
>> 
>> > For a long time, generic_make_request() converts recursion into
>> > iteration by queuing recursive arguments on current->bio_list.
>> >
>> > This is convenient for stacking drivers,
>> > the top-most driver would take the originally submitted bio,
>> > and re-submit a re-mapped version of it, or one or more clones,
>> > or one or more new allocated bios to its backend(s). Which
>> > are then simply processed in turn, and each can again queue
>> > more "backend-bios" until we reach the bottom of the driver stack,
>> > and actually dispatch to the real backend device.
>> >
>> > Any stacking driver ->make_request_fn() could expect that,
>> > once it returns, any backend-bios it submitted via recursive calls
>> > to generic_make_request() would now be processed and dispatched, before
>> > the current task would call into this driver again.
>> >
>> > This is changed by commit
>> >   54efd50 block: make generic_make_request handle arbitrarily sized bios
>> >
>> > Drivers may call blk_queue_split() inside their ->make_request_fn(),
>> > which may split the current bio into a front-part to be dealt with
>> > immediately, and a remainder-part, which may need to be split even
>> > further. That remainder-part will simply also be pushed to
>> > current->bio_list, and would end up being head-of-queue, in front
>> > of any backend-bios the current make_request_fn() might submit during
>> > processing of the fron-part.
>> >
>> > Which means the current task would immediately end up back in the same
>> > make_request_fn() of the same driver again, before any of its backend
>> > bios have even been processed.
>> >
>> > This can lead to resource starvation deadlock.
>> > Drivers could avoid this by learning to not need blk_queue_split(),
>> > or by submitting their backend bios in a different context (dedicated
>> > kernel thread, work_queue context, ...). Or by playing funny re-ordering
>> > games with entries on current->bio_list.
>> >
>> > Instead, I suggest to distinguish between recursive calls to
>> > generic_make_request(), and pushing back the remainder part in
>> > blk_queue_split(), by pointing current->bio_lists to a
>> > 	struct recursion_to_iteration_bio_lists {
>> > 		struct bio_list recursion;
>> > 		struct bio_list remainder;
>> > 	}
>> >
>> > To have all bios targeted to drivers lower in the stack processed before
>> > processing the next piece of a bio targeted at the higher levels,
>> > as long as queued bios resulting from recursion are available,
>> > they will continue to be processed in FIFO order.
>> > Pushed back bio-parts resulting from blk_queue_split() will be processed
>> > in LIFO order, one-by-one, whenever the recursion list becomes empty.
>> 
>> I really like this change.  It seems to precisely address the problem.
>> The "problem" being that requests for "this" device are potentially
>> mixed up with requests from underlying devices.
>> However I'm not sure it is quite general enough.
>> 
>> The "remainder" list is a stack of requests aimed at "this" level or
>> higher, and I think it will always exactly fit that description.
>> The "recursion" list needs to be a queue of requests aimed at the next
>> level down, and that doesn't quiet work, because once you start acting
>> on the first entry in that list, all the rest become "this" level.
>> 
>> I think you can address this by always calling ->make_request_fn with an
>> empty "recursion", then after the call completes, splice the "recursion"
>> list that resulted (if any) on top of the "remainder" stack.
>> 
>> This way, the "remainder" stack is always "requests for lower-level
>> devices before request for upper level devices" and the "recursion"
>> queue is always "requests for devices below the current level".
>> 
>> I also really *don't* like the idea of punting to a separate thread
>
> Hi Neil,
>
> Was this concern about "punting to a separate thread" in reference to
> the line of work from Mikulas at the top of this 'wip' branch?
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip

A little bit in response to
  https://patchwork.kernel.org/patch/7398411/
which you linked upthread, but more in response to the per-blocked
workqueue threads we now have. The (Commit df2cb6daa4) really seemed
like a lazy solution.
I may will be that the current proposal makes these threads redundant by
completely handling the first section split of a request before looking
to see if there is any more to split.

For this to work, each split would need to be a binary split.
i.e.
  If request is too big:
    1/ split into A that isn't too big and B
    2/ add B to the "remainder" queue
    3/ generate sub requests for A and submit them to "recursion" queue

Then 'A' would be completely submitted before B was started, so it would
be safe for B to wait for any resources (like something from a mempool)
that A might be using.

>
>> - it seems to be just delaying the problem.
>
> Have you looked at this closely?  Not seeing how you can say that given
> that on schedule the bios on current->bio_list are flushed.

I fully accept that it is similar to a current solution.  I just don't
like either.

In the sequence of steps given in
https://patchwork.kernel.org/patch/7398411/

step 6 would now not happen until after the bio mentioned in step 4 has
been completely submitted, and so after step5 has had a chance to
release the lock.

>
> The incremental work to delay the offload of queued bios is just meant
> to preserve existing bio submission order unless there is reason to
> believe a deadlock exists.
>
> I would agree that this timer based approach is rather "gross" to some
> degree _but_ it beats deadlocks!  This code needs fixing.  And the fix
> cannot be constrained to bio_queue_split() because DM isn't even using
> it.

Yes, the 1-second timer isn't the most elegant thing ever.  Certainly
better than a deadlock.  But I think we can do better.

Certainly DM needs fixing, as well as md (and maybe others).  The fix
should be fairly easy though.
For md/raid0, the loop in raid0_make_request would be discarded
and the
	} while (split != bio);
at the end would be replaced by
       if (split != bio)
              generic_queue_request(bio);

though probably with a better function name.
generic_queue_request() would add the bio to the ->remainder list.

Presumably DM could do something similar, but I'm not familiar enough
with the code to say precisely what.

Thanks,
NeilBrown
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Lars Ellenberg July 8, 2016, 8:02 a.m. UTC | #21
On Fri, Jul 08, 2016 at 08:07:52AM +1000, NeilBrown wrote:
> Before I introduced the recursion limiting, requests were handled as an
> in-order tree walk.  The code I wrote tried to preserve that but didn't
> for several reasons.  I think we need to restore the original in-order
> walk because it makes the most sense.
> So after processing a particular bio, we should then process all the
> 'child' bios - bios send to underlying devices.  Then the 'sibling'
> bios, that were split off, and then any remaining parents and ancestors.
> 
> You patch created the right structures for doing this, my proposal took
> it a step closer, but now after more careful analysis I don't think it
> is quite right.
> With my previous proposal (and you latest patch - thanks!) requests for
> "this" level are stacked, but they should be queued.
> If a make_request_fn only ever submits one request for this level and
> zero or more lower levels, then the difference between a queue and a
> stack is irrelevant.  If it submited more that one, a stack would cause
> them to be handled in the reverse order.

We have a device stack.
q_this_level->make_request_fn() cannot possibly submit anything
on "this_level", or it would create a device loop, I think.

So we start with the initial, "top most" call to generic_make_request().
That is one single bio. All queues are empty.

This bio is then passed on to its destination queue make_request_fn().

Which may chose to split it (via blk_queue_split, or like dm does, or
else). If it, like blk_queue_split() does, splits it into
"piece-I-can-handle-now" and "remainder", both still targeted at the
top most (current) queue, I think the "remainder" should just be pushed
back, which will make it look as if upper layers did
	generic_make_request("piece-I-can-handle-now");
	generic_make_request("remainder");
Which I do, by using bio_list_add_head(remainder, bio); (*head*).

I don't see any other way for a make_request_fn(bio(l=x)) to generate
"sibling" bios to the same level (l=x) as its own argument.

This same q(l=x)->make_request_fn(bio(l=x)) may now call
generic_make_request() for zero or more "child" bios (l=x+1),
which are queued in order: bio_list_add(recursion, bio); (*tail*).
Then, once l=x returns, the queue generated by it is spliced
in front of the "remainder" (*head*).
All bios are processed in the order they have been queued,
by peeling off of the head.

After all "child" bios of level l>=x+1 have been processed,
the next bio to be processed will be the "pushed back" remainder.

All "Natural order".

> To make the patch "perfect", and maybe even more elegant we could treat
> ->remainder and ->recursion more alike.
> i.e.:
>   - generic make request has a private "stack" of requests.
>   - before calling ->make_request_fn(), both ->remainder and ->recursion
>     are initialised
>   - after ->make_request_fn(), ->remainder are spliced in to top of
>     'stack', then ->recursion is spliced onto that.
>   - If stack is not empty, the top request is popped and we loop to top.
> 
> This reliably follows in-order execution, and handles siblings correctly
> (in submitted order) if/when a request splits off multiple siblings.

The only splitting that creates siblings on the current level
is blk_queue_split(), which splits the current bio into
"front piece" and "remainder", already processed in this order.

Anything else creating "siblings" is not creating siblings for the
current layer, but for the next deeper layer, which are queue on
"recursion" and also processed in the order they have been generated.

> I think that as long a requests are submitted in the order they are
> created at each level there is no reason to expect performance
> regressions.
> All we are doing is changing the ordering between requests generated at
> different levels, and I think we are restoring a more natural order.

I believe both patches combined are doing exactly this already.
I could rename .remainder to .todo or .incoming, though.

.incoming = [ bio(l=0) ]
.recursion = []

split

.incoming = [ bio(l=0,now_1), bio(l=0,remainder_1) ]
.recursion = []

process head of .incoming

.incoming = [ bio(l=0,remainder_1) ]
.recursion = [ bio(l=1,a), bio(l=1,b), bio(l=1,c), ... ]

merge_head

.incoming = [ bio(l=1,a), bio(l=1,b), bio(l=1,c), ...,
		bio(l=0,remainder_1) ]
.recursion = []

process head of .incoming, potentially split first

.incoming = [ bio(l=1,a,now), bio(l=1,a,remainder), bio(l=1,b), bio(l=1,c), ...,
		bio(l=0,remainder_1) ]
...
.incoming = [ bio(l=1,a,remainder), bio(l=1,b), bio(l=1,c), ...,
		bio(l=0,remainder_1) ]
.recursion = [ bio(l=2,aa), bio(l=2,ab), ... ]

merge_head

.incoming = [ bio(l=2,aa), bio(l=2,ab), ...,
		bio(l=1,a,remainder), bio(l=1,b), bio(l=1,c), ...,
		bio(l=0,remainder_1) ]
.recursion = []

...

process away ... until back at l=0

.incoming = [ bio(l=0,remainder_1) ]
.recursion = []

potentially split further
.incoming = [ bio(l=0,now_2), bio(l=0,remainder_2) ]
.recursion = []

rinse, repeat.

Thanks,

    Lars Ellenberg


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
NeilBrown July 8, 2016, 9:39 a.m. UTC | #22
On Fri, Jul 08 2016, Lars Ellenberg wrote:

> On Fri, Jul 08, 2016 at 08:07:52AM +1000, NeilBrown wrote:
>> Before I introduced the recursion limiting, requests were handled as an
>> in-order tree walk.  The code I wrote tried to preserve that but didn't
>> for several reasons.  I think we need to restore the original in-order
>> walk because it makes the most sense.
>> So after processing a particular bio, we should then process all the
>> 'child' bios - bios send to underlying devices.  Then the 'sibling'
>> bios, that were split off, and then any remaining parents and ancestors.
>> 
>> You patch created the right structures for doing this, my proposal took
>> it a step closer, but now after more careful analysis I don't think it
>> is quite right.
>> With my previous proposal (and you latest patch - thanks!) requests for
>> "this" level are stacked, but they should be queued.
>> If a make_request_fn only ever submits one request for this level and
>> zero or more lower levels, then the difference between a queue and a
>> stack is irrelevant.  If it submited more that one, a stack would cause
>> them to be handled in the reverse order.
>
> We have a device stack.
> q_this_level->make_request_fn() cannot possibly submit anything
> on "this_level", or it would create a device loop, I think.
>
> So we start with the initial, "top most" call to generic_make_request().
> That is one single bio. All queues are empty.
>
> This bio is then passed on to its destination queue make_request_fn().
>
> Which may chose to split it (via blk_queue_split, or like dm does, or
> else). If it, like blk_queue_split() does, splits it into
> "piece-I-can-handle-now" and "remainder", both still targeted at the
> top most (current) queue, I think the "remainder" should just be pushed
> back, which will make it look as if upper layers did
> 	generic_make_request("piece-I-can-handle-now");
> 	generic_make_request("remainder");
> Which I do, by using bio_list_add_head(remainder, bio); (*head*).

This is exactly what I mean by "submitting a request at 'this' level".
Maybe that is a poor way to express it.  Within a make_request_fn, you
cannot "submit" a request, you can only "queue" a request.
generic_make_request hides that by doing something different for a call
From a make_request_fn that for a call from anywhere else.
The important thing is to have 2 queues to queue to.

>
> I don't see any other way for a make_request_fn(bio(l=x)) to generate
> "sibling" bios to the same level (l=x) as its own argument.

Yes, it only comes from splitting.

>
> This same q(l=x)->make_request_fn(bio(l=x)) may now call
> generic_make_request() for zero or more "child" bios (l=x+1),
> which are queued in order: bio_list_add(recursion, bio); (*tail*).
> Then, once l=x returns, the queue generated by it is spliced
> in front of the "remainder" (*head*).
> All bios are processed in the order they have been queued,
> by peeling off of the head.
>
> After all "child" bios of level l>=x+1 have been processed,
> the next bio to be processed will be the "pushed back" remainder.
>
> All "Natural order".
>
>> To make the patch "perfect", and maybe even more elegant we could treat
>> ->remainder and ->recursion more alike.
>> i.e.:
>>   - generic make request has a private "stack" of requests.
>>   - before calling ->make_request_fn(), both ->remainder and ->recursion
>>     are initialised
>>   - after ->make_request_fn(), ->remainder are spliced in to top of
>>     'stack', then ->recursion is spliced onto that.
>>   - If stack is not empty, the top request is popped and we loop to top.
>> 
>> This reliably follows in-order execution, and handles siblings correctly
>> (in submitted order) if/when a request splits off multiple siblings.
>
> The only splitting that creates siblings on the current level
> is blk_queue_split(), which splits the current bio into
> "front piece" and "remainder", already processed in this order.

Yes.
I imagine that a driver *could* split a bio into three parts with a
single allocation, but I cannot actually see any point in doing it.  So
I was over-complicating things.

>
> Anything else creating "siblings" is not creating siblings for the
> current layer, but for the next deeper layer, which are queue on
> "recursion" and also processed in the order they have been generated.
>
>> I think that as long a requests are submitted in the order they are
>> created at each level there is no reason to expect performance
>> regressions.
>> All we are doing is changing the ordering between requests generated at
>> different levels, and I think we are restoring a more natural order.
>
> I believe both patches combined are doing exactly this already.
> I could rename .remainder to .todo or .incoming, though.

:-)  neither "remainder" or "recursion" seem like brilliant names to me,
but I don't have anything better to suggest.  Naming is hard!
As long as a comment explains the name clearly I could cope with X and Y.

>
> .incoming = [ bio(l=0) ]
> .recursion = []
>
> split
>
> .incoming = [ bio(l=0,now_1), bio(l=0,remainder_1) ]
> .recursion = []
>
> process head of .incoming
>
> .incoming = [ bio(l=0,remainder_1) ]
> .recursion = [ bio(l=1,a), bio(l=1,b), bio(l=1,c), ... ]
>
> merge_head
>
> .incoming = [ bio(l=1,a), bio(l=1,b), bio(l=1,c), ...,
> 		bio(l=0,remainder_1) ]
> .recursion = []
>
> process head of .incoming, potentially split first
>
> .incoming = [ bio(l=1,a,now), bio(l=1,a,remainder), bio(l=1,b), bio(l=1,c), ...,
> 		bio(l=0,remainder_1) ]
> ...
> .incoming = [ bio(l=1,a,remainder), bio(l=1,b), bio(l=1,c), ...,
> 		bio(l=0,remainder_1) ]
> .recursion = [ bio(l=2,aa), bio(l=2,ab), ... ]
>
> merge_head
>
> .incoming = [ bio(l=2,aa), bio(l=2,ab), ...,
> 		bio(l=1,a,remainder), bio(l=1,b), bio(l=1,c), ...,
> 		bio(l=0,remainder_1) ]
> .recursion = []
>
> ...
>
> process away ... until back at l=0
>
> .incoming = [ bio(l=0,remainder_1) ]
> .recursion = []
>
> potentially split further
> .incoming = [ bio(l=0,now_2), bio(l=0,remainder_2) ]
> .recursion = []
>
> rinse, repeat.

I think we just might be in violent agreement.

Thanks,
NeilBrown
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lei July 8, 2016, 11:08 a.m. UTC | #23
On Fri, Jul 8, 2016 at 6:07 AM, NeilBrown <neilb@suse.com> wrote:
> On Thu, Jul 07 2016, Lars Ellenberg wrote:
>
>> On Thu, Jul 07, 2016 at 03:35:48PM +1000, NeilBrown wrote:
>>> On Wed, Jun 22 2016, Lars Ellenberg wrote:
>>>
>>> > For a long time, generic_make_request() converts recursion into
>>> > iteration by queuing recursive arguments on current->bio_list.
>>> >
>>> > This is convenient for stacking drivers,
>>> > the top-most driver would take the originally submitted bio,
>>> > and re-submit a re-mapped version of it, or one or more clones,
>>> > or one or more new allocated bios to its backend(s). Which
>>> > are then simply processed in turn, and each can again queue
>>> > more "backend-bios" until we reach the bottom of the driver stack,
>>> > and actually dispatch to the real backend device.
>>> >
>>> > Any stacking driver ->make_request_fn() could expect that,
>>> > once it returns, any backend-bios it submitted via recursive calls
>>> > to generic_make_request() would now be processed and dispatched, before
>>> > the current task would call into this driver again.
>>> >
>>> > This is changed by commit
>>> >   54efd50 block: make generic_make_request handle arbitrarily sized bios
>>> >
>>> > Drivers may call blk_queue_split() inside their ->make_request_fn(),
>>> > which may split the current bio into a front-part to be dealt with
>>> > immediately, and a remainder-part, which may need to be split even
>>> > further. That remainder-part will simply also be pushed to
>>> > current->bio_list, and would end up being head-of-queue, in front
>>> > of any backend-bios the current make_request_fn() might submit during
>>> > processing of the fron-part.
>>> >
>>> > Which means the current task would immediately end up back in the same
>>> > make_request_fn() of the same driver again, before any of its backend
>>> > bios have even been processed.
>>> >
>>> > This can lead to resource starvation deadlock.
>>> > Drivers could avoid this by learning to not need blk_queue_split(),
>>> > or by submitting their backend bios in a different context (dedicated
>>> > kernel thread, work_queue context, ...). Or by playing funny re-ordering
>>> > games with entries on current->bio_list.
>>> >
>>> > Instead, I suggest to distinguish between recursive calls to
>>> > generic_make_request(), and pushing back the remainder part in
>>> > blk_queue_split(), by pointing current->bio_lists to a
>>> >    struct recursion_to_iteration_bio_lists {
>>> >            struct bio_list recursion;
>>> >            struct bio_list remainder;
>>> >    }
>>> >
>>> > To have all bios targeted to drivers lower in the stack processed before
>>> > processing the next piece of a bio targeted at the higher levels,
>>> > as long as queued bios resulting from recursion are available,
>>> > they will continue to be processed in FIFO order.
>>> > Pushed back bio-parts resulting from blk_queue_split() will be processed
>>> > in LIFO order, one-by-one, whenever the recursion list becomes empty.
>>>
>>> I really like this change.  It seems to precisely address the problem.
>>> The "problem" being that requests for "this" device are potentially
>>> mixed up with requests from underlying devices.
>>> However I'm not sure it is quite general enough.
>>>
>>> The "remainder" list is a stack of requests aimed at "this" level or
>>> higher, and I think it will always exactly fit that description.
>>> The "recursion" list needs to be a queue of requests aimed at the next
>>> level down, and that doesn't quiet work, because once you start acting
>>> on the first entry in that list, all the rest become "this" level.
>>
>> Uhm, well,
>> that's how it has been since you introduced this back in 2007, d89d879.
>> And it worked.
>
> Just because it "worked", doesn't mean it was "right" :-)
>
>>
>>> I think you can address this by always calling ->make_request_fn with an
>>> empty "recursion", then after the call completes, splice the "recursion"
>>> list that resulted (if any) on top of the "remainder" stack.
>>>
>>> This way, the "remainder" stack is always "requests for lower-level
>>> devices before request for upper level devices" and the "recursion"
>>> queue is always "requests for devices below the current level".
>>
>> Yes, I guess that would work as well,
>> but may need "empirical proof" to check for performance regressions.
>>
>>> I also really *don't* like the idea of punting to a separate thread - it
>>> seems to be just delaying the problem.
>>>
>>> Can you try move the bio_list_init(->recursion) call to just before
>>> the ->make_request_fn() call, and adding
>>>     bio_list_merge_head(->remainder, ->recursion)
>>> just after?
>>> (or something like that) and confirm it makes sense, and works?
>>
>> Sure, will do.
>> I'd suggest this would be a patch on its own though, on top of this one.
>> Because it would change the order in which stacked bios are processed
>> wrt the way it used to be since 2007 (my suggestion as is does not).
>
> Yes, because I think the original order is wrong, as you have helped me
> to see.
> Before I introduced the recursion limiting, requests were handled as an
> in-order tree walk.  The code I wrote tried to preserve that but didn't
> for several reasons.  I think we need to restore the original in-order
> walk because it makes the most sense.

> So after processing a particular bio, we should then process all the
> 'child' bios - bios send to underlying devices.  Then the 'sibling'
> bios, that were split off, and then any remaining parents and ancestors.

IMHO, that is just what the oneline patch is doing, isn't it?

| diff --git a/block/blk-core.c b/block/blk-core.c
 | index 2475b1c7..a5623f6 100644
 | --- a/block/blk-core.c
 | +++ b/block/blk-core.c
 | @@ -2048,7 +2048,7 @@ blk_qc_t generic_make_request(struct bio *bio)
 |       * should be added at the tail
 |       */
 |      if (current->bio_list) {
 | -            bio_list_add(current->bio_list, bio);
 | +            bio_list_add_head(current->bio_list, bio);
 |              goto out;
 |      }

Thanks,

>
> You patch created the right structures for doing this, my proposal took
> it a step closer, but now after more careful analysis I don't think it
> is quite right.
> With my previous proposal (and you latest patch - thanks!) requests for
> "this" level are stacked, but they should be queued.
> If a make_request_fn only ever submits one request for this level and
> zero or more lower levels, then the difference between a queue and a
> stack is irrelevant.  If it submited more that one, a stack would cause
> them to be handled in the reverse order.
>
> To make the patch "perfect", and maybe even more elegant we could treat
> ->remainder and ->recursion more alike.
> i.e.:
>   - generic make request has a private "stack" of requests.
>   - before calling ->make_request_fn(), both ->remainder and ->recursion
>     are initialised
>   - after ->make_request_fn(), ->remainder are spliced in to top of
>     'stack', then ->recursion is spliced onto that.
>   - If stack is not empty, the top request is popped and we loop to top.
>
> This reliably follows in-order execution, and handles siblings correctly
> (in submitted order) if/when a request splits off multiple siblings.
>
>>
>> Which may change performance metrics.
>> It may even improve some of them,
>> or maybe it does nothing, but we don't know.
>
> I think that as long a requests are submitted in the order they are
> created at each level there is no reason to expect performance
> regressions.
> All we are doing is changing the ordering between requests generated at
> different levels, and I think we are restoring a more natural order.
>
> Thanks,
> NeilBrown

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Lars Ellenberg July 8, 2016, 12:52 p.m. UTC | #24
On Fri, Jul 08, 2016 at 07:08:32PM +0800, Ming Lei wrote:
> > So after processing a particular bio, we should then process all the
> > 'child' bios - bios send to underlying devices.  Then the 'sibling'
> > bios, that were split off, and then any remaining parents and ancestors.
> 
> IMHO, that is just what the oneline patch is doing, isn't it?
> 
> | diff --git a/block/blk-core.c b/block/blk-core.c
>  | index 2475b1c7..a5623f6 100644
>  | --- a/block/blk-core.c
>  | +++ b/block/blk-core.c
>  | @@ -2048,7 +2048,7 @@ blk_qc_t generic_make_request(struct bio *bio)
>  |       * should be added at the tail
>  |       */
>  |      if (current->bio_list) {
>  | -            bio_list_add(current->bio_list, bio);
>  | +            bio_list_add_head(current->bio_list, bio);
>  |              goto out;
>  |      }

Almost, but not quite.

As explained earlier, this will re-order.
It will still process bios in "deepest level first" order,
but it will process "sibling" bios in reverse submission order.

Think "very large bio" submitted to a stripe set
with small stripe width/stripe unit size.

So I'd expect this to be a performance hit in some scenarios,
unless the stack at some deeper level does back-merging in its elevator.
(If some driver is not able to merge stuff because of "reverse submission
order" this can easily mean saturating IOPS of the physical device with
small requests, throttling bandwidth to a minimum.)

That's why I mentioned it as "potential easy fix for the deadlock",
but did not suggest it as the proper way to fix this.

If however the powers that be decide that this was a non-issue,
we could use it this way.

	Lars

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Lars Ellenberg July 8, 2016, 1 p.m. UTC | #25
On Fri, Jul 08, 2016 at 07:39:36PM +1000, NeilBrown wrote:
> >> To make the patch "perfect", and maybe even more elegant we could treat
> >> ->remainder and ->recursion more alike.
> >> i.e.:
> >>   - generic make request has a private "stack" of requests.
> >>   - before calling ->make_request_fn(), both ->remainder and ->recursion
> >>     are initialised
> >>   - after ->make_request_fn(), ->remainder are spliced in to top of
> >>     'stack', then ->recursion is spliced onto that.
> >>   - If stack is not empty, the top request is popped and we loop to top.
> >> 
> >> This reliably follows in-order execution, and handles siblings correctly
> >> (in submitted order) if/when a request splits off multiple siblings.
> >
> > The only splitting that creates siblings on the current level
> > is blk_queue_split(), which splits the current bio into
> > "front piece" and "remainder", already processed in this order.
> 
> Yes.
> I imagine that a driver *could* split a bio into three parts with a
> single allocation, but I cannot actually see any point in doing it.  So
> I was over-complicating things.
> 
> >
> > Anything else creating "siblings" is not creating siblings for the
> > current layer, but for the next deeper layer, which are queue on
> > "recursion" and also processed in the order they have been generated.
> >
> >> I think that as long a requests are submitted in the order they are
> >> created at each level there is no reason to expect performance
> >> regressions.
> >> All we are doing is changing the ordering between requests generated at
> >> different levels, and I think we are restoring a more natural order.
> >
> > I believe both patches combined are doing exactly this already.
> > I could rename .remainder to .todo or .incoming, though.
> 
> :-)  neither "remainder" or "recursion" seem like brilliant names to me,
> but I don't have anything better to suggest.  Naming is hard!
> As long as a comment explains the name clearly I could cope with X and Y.

...

> I think we just might be in violent agreement.

I thought so, too :-)

Should I merge both patches,
rename to ".queue" and ".tmp",
and submit for inclusion?

    Lars Ellenberg

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer July 8, 2016, 1:05 p.m. UTC | #26
On Fri, Jul 08 2016 at  8:52am -0400,
Lars Ellenberg <lars.ellenberg@linbit.com> wrote:

> On Fri, Jul 08, 2016 at 07:08:32PM +0800, Ming Lei wrote:
> > > So after processing a particular bio, we should then process all the
> > > 'child' bios - bios send to underlying devices.  Then the 'sibling'
> > > bios, that were split off, and then any remaining parents and ancestors.
> > 
> > IMHO, that is just what the oneline patch is doing, isn't it?
> > 
> > | diff --git a/block/blk-core.c b/block/blk-core.c
> >  | index 2475b1c7..a5623f6 100644
> >  | --- a/block/blk-core.c
> >  | +++ b/block/blk-core.c
> >  | @@ -2048,7 +2048,7 @@ blk_qc_t generic_make_request(struct bio *bio)
> >  |       * should be added at the tail
> >  |       */
> >  |      if (current->bio_list) {
> >  | -            bio_list_add(current->bio_list, bio);
> >  | +            bio_list_add_head(current->bio_list, bio);
> >  |              goto out;
> >  |      }
> 
> Almost, but not quite.
> 
> As explained earlier, this will re-order.
> It will still process bios in "deepest level first" order,
> but it will process "sibling" bios in reverse submission order.
> 
> Think "very large bio" submitted to a stripe set
> with small stripe width/stripe unit size.
> 
> So I'd expect this to be a performance hit in some scenarios,
> unless the stack at some deeper level does back-merging in its elevator.
> (If some driver is not able to merge stuff because of "reverse submission
> order" this can easily mean saturating IOPS of the physical device with
> small requests, throttling bandwidth to a minimum.)
> 
> That's why I mentioned it as "potential easy fix for the deadlock",
> but did not suggest it as the proper way to fix this.
> 
> If however the powers that be decide that this was a non-issue,
> we could use it this way.

No, we cannot do this.  With blk-mq it doesn't have any of the more
elaborate IO scheduling that request_fn request_queues have.  We should
not be knowingly mangling the order of IO with the thought that some
other layer will fix it up.

I think it best for you to rebase your work (against jens' for-4.8/core)
into a single coherent patch and resubmit for 4.8 inclusion.  I really
don't see a huge benefit to keeping neilb's suggestion split out -- but
if you or others do that's fine.

The only concern I have relative to DM is: DM doesn't use
blk_queue_split, so will it need to open-code setting
recursion/remainder in order to ensure forward progress?  neilb seemed
to think the rework in generic_make_request would "just work" for the
dm-snapshot deadlock case though so maybe this isn't a valid
concern... unfortunately we don't have a quick reproducer for that
dm-snapshot issue so it'll take a bit to prove.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox

Patch

diff --git a/block/bio.c b/block/bio.c
index 0e4aa42..2ffcea0 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -368,10 +368,10 @@  static void punt_bios_to_rescuer(struct bio_set *bs)
 	bio_list_init(&punt);
 	bio_list_init(&nopunt);
 
-	while ((bio = bio_list_pop(current->bio_list)))
+	while ((bio = bio_list_pop(&current->bio_lists->recursion)))
 		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
 
-	*current->bio_list = nopunt;
+	current->bio_lists->recursion = nopunt;
 
 	spin_lock(&bs->rescue_lock);
 	bio_list_merge(&bs->rescue_list, &punt);
@@ -453,13 +453,13 @@  struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 		 *
 		 * We solve this, and guarantee forward progress, with a rescuer
 		 * workqueue per bio_set. If we go to allocate and there are
-		 * bios on current->bio_list, we first try the allocation
-		 * without __GFP_DIRECT_RECLAIM; if that fails, we punt those
-		 * bios we would be blocking to the rescuer workqueue before
-		 * we retry with the original gfp_flags.
+		 * bios on current->bio_lists->recursion, we first try the
+		 * allocation without __GFP_DIRECT_RECLAIM; if that fails, we
+		 * punt those bios we would be blocking to the rescuer
+		 * workqueue before we retry with the original gfp_flags.
 		 */
 
-		if (current->bio_list && !bio_list_empty(current->bio_list))
+		if (current->bio_lists && !bio_list_empty(&current->bio_lists->recursion))
 			gfp_mask &= ~__GFP_DIRECT_RECLAIM;
 
 		p = mempool_alloc(bs->bio_pool, gfp_mask);
diff --git a/block/blk-core.c b/block/blk-core.c
index 2475b1c7..f03ff4c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2031,7 +2031,7 @@  end_io:
  */
 blk_qc_t generic_make_request(struct bio *bio)
 {
-	struct bio_list bio_list_on_stack;
+	struct recursion_to_iteration_bio_lists bio_lists_on_stack;
 	blk_qc_t ret = BLK_QC_T_NONE;
 
 	if (!generic_make_request_checks(bio))
@@ -2040,15 +2040,18 @@  blk_qc_t generic_make_request(struct bio *bio)
 	/*
 	 * We only want one ->make_request_fn to be active at a time, else
 	 * stack usage with stacked devices could be a problem.  So use
-	 * current->bio_list to keep a list of requests submited by a
-	 * make_request_fn function.  current->bio_list is also used as a
+	 * current->bio_lists to keep a list of requests submited by a
+	 * make_request_fn function.  current->bio_lists is also used as a
 	 * flag to say if generic_make_request is currently active in this
 	 * task or not.  If it is NULL, then no make_request is active.  If
 	 * it is non-NULL, then a make_request is active, and new requests
-	 * should be added at the tail
+	 * should be added at the tail of current->bio_lists->recursion;
+	 * remainder bios resulting from a call to bio_queue_split() from
+	 * within ->make_request_fn() should be added to the head of
+	 * current->bio_lists->remainder.
 	 */
-	if (current->bio_list) {
-		bio_list_add(current->bio_list, bio);
+	if (current->bio_lists) {
+		bio_list_add(&current->bio_lists->recursion, bio);
 		goto out;
 	}
 
@@ -2057,7 +2060,7 @@  blk_qc_t generic_make_request(struct bio *bio)
 	 * Before entering the loop, bio->bi_next is NULL (as all callers
 	 * ensure that) so we have a list with a single bio.
 	 * We pretend that we have just taken it off a longer list, so
-	 * we assign bio_list to a pointer to the bio_list_on_stack,
+	 * we assign bio_list to a pointer to the bio_lists_on_stack,
 	 * thus initialising the bio_list of new bios to be
 	 * added.  ->make_request() may indeed add some more bios
 	 * through a recursive call to generic_make_request.  If it
@@ -2067,8 +2070,9 @@  blk_qc_t generic_make_request(struct bio *bio)
 	 * bio_list, and call into ->make_request() again.
 	 */
 	BUG_ON(bio->bi_next);
-	bio_list_init(&bio_list_on_stack);
-	current->bio_list = &bio_list_on_stack;
+	bio_list_init(&bio_lists_on_stack.recursion);
+	bio_list_init(&bio_lists_on_stack.remainder);
+	current->bio_lists = &bio_lists_on_stack;
 	do {
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
@@ -2076,16 +2080,14 @@  blk_qc_t generic_make_request(struct bio *bio)
 			ret = q->make_request_fn(q, bio);
 
 			blk_queue_exit(q);
-
-			bio = bio_list_pop(current->bio_list);
 		} else {
-			struct bio *bio_next = bio_list_pop(current->bio_list);
-
 			bio_io_error(bio);
-			bio = bio_next;
 		}
+		bio = bio_list_pop(&current->bio_lists->recursion);
+		if (!bio)
+			bio = bio_list_pop(&current->bio_lists->remainder);
 	} while (bio);
-	current->bio_list = NULL; /* deactivate */
+	current->bio_lists = NULL; /* deactivate */
 
 out:
 	return ret;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 2613531..8da0c22 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -172,6 +172,7 @@  void blk_queue_split(struct request_queue *q, struct bio **bio,
 	struct bio *split, *res;
 	unsigned nsegs;
 
+	BUG_ON(!current->bio_lists);
 	if ((*bio)->bi_rw & REQ_DISCARD)
 		split = blk_bio_discard_split(q, *bio, bs, &nsegs);
 	else if ((*bio)->bi_rw & REQ_WRITE_SAME)
@@ -190,7 +191,7 @@  void blk_queue_split(struct request_queue *q, struct bio **bio,
 
 		bio_chain(split, *bio);
 		trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
-		generic_make_request(*bio);
+		bio_list_add_head(&current->bio_lists->remainder, *bio);
 		*bio = split;
 	}
 }
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index eab505e..eb0b41a 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -450,7 +450,7 @@  void __bch_btree_node_write(struct btree *b, struct closure *parent)
 
 	trace_bcache_btree_write(b);
 
-	BUG_ON(current->bio_list);
+	BUG_ON(current->bio_lists);
 	BUG_ON(b->written >= btree_blocks(b));
 	BUG_ON(b->written && !i->keys);
 	BUG_ON(btree_bset_first(b)->seq != i->seq);
@@ -544,7 +544,7 @@  static void bch_btree_leaf_dirty(struct btree *b, atomic_t *journal_ref)
 
 	/* Force write if set is too big */
 	if (set_bytes(i) > PAGE_SIZE - 48 &&
-	    !current->bio_list)
+	    !current->bio_lists)
 		bch_btree_node_write(b, NULL);
 }
 
@@ -889,7 +889,7 @@  static struct btree *mca_alloc(struct cache_set *c, struct btree_op *op,
 {
 	struct btree *b;
 
-	BUG_ON(current->bio_list);
+	BUG_ON(current->bio_lists);
 
 	lockdep_assert_held(&c->bucket_lock);
 
@@ -976,7 +976,7 @@  retry:
 	b = mca_find(c, k);
 
 	if (!b) {
-		if (current->bio_list)
+		if (current->bio_lists)
 			return ERR_PTR(-EAGAIN);
 
 		mutex_lock(&c->bucket_lock);
@@ -2127,7 +2127,7 @@  static int bch_btree_insert_node(struct btree *b, struct btree_op *op,
 
 	return 0;
 split:
-	if (current->bio_list) {
+	if (current->bio_lists) {
 		op->lock = b->c->root->level + 1;
 		return -EAGAIN;
 	} else if (op->lock <= b->c->root->level) {
@@ -2209,7 +2209,7 @@  int bch_btree_insert(struct cache_set *c, struct keylist *keys,
 	struct btree_insert_op op;
 	int ret = 0;
 
-	BUG_ON(current->bio_list);
+	BUG_ON(current->bio_lists);
 	BUG_ON(bch_keylist_empty(keys));
 
 	bch_btree_op_init(&op.op, 0);
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index cd77216..b64d4f0 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -174,7 +174,7 @@  static inline int dm_bufio_cache_index(struct dm_bufio_client *c)
 #define DM_BUFIO_CACHE(c)	(dm_bufio_caches[dm_bufio_cache_index(c)])
 #define DM_BUFIO_CACHE_NAME(c)	(dm_bufio_cache_names[dm_bufio_cache_index(c)])
 
-#define dm_bufio_in_request()	(!!current->bio_list)
+#define dm_bufio_in_request()	(!!current->bio_lists)
 
 static void dm_bufio_lock(struct dm_bufio_client *c)
 {
diff --git a/drivers/md/md.h b/drivers/md/md.h
index b5c4be7..7afc71c 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -663,6 +663,13 @@  static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev *mddev)
 	}
 }
 
+static inline bool current_has_pending_bios(void)
+{
+	return current->bio_lists && (
+	      !bio_list_empty(&current->bio_lists->recursion) ||
+	      !bio_list_empty(&current->bio_lists->remainder));
+}
+
 extern struct md_cluster_operations *md_cluster_ops;
 static inline int mddev_is_clustered(struct mddev *mddev)
 {
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index c7c8cde..811f2c0 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -876,8 +876,7 @@  static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
 				    (!conf->barrier ||
 				     ((conf->start_next_window <
 				       conf->next_resync + RESYNC_SECTORS) &&
-				      current->bio_list &&
-				      !bio_list_empty(current->bio_list))),
+				      current_has_pending_bios())),
 				    conf->resync_lock);
 		conf->nr_waiting--;
 	}
@@ -1014,7 +1013,7 @@  static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
 	struct r1conf *conf = mddev->private;
 	struct bio *bio;
 
-	if (from_schedule || current->bio_list) {
+	if (from_schedule || current->bio_lists) {
 		spin_lock_irq(&conf->device_lock);
 		bio_list_merge(&conf->pending_bio_list, &plug->pending);
 		conf->pending_count += plug->pending_cnt;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index c7de2a5..2c70717 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -945,8 +945,7 @@  static void wait_barrier(struct r10conf *conf)
 		wait_event_lock_irq(conf->wait_barrier,
 				    !conf->barrier ||
 				    (conf->nr_pending &&
-				     current->bio_list &&
-				     !bio_list_empty(current->bio_list)),
+				     current_has_pending_bios()),
 				    conf->resync_lock);
 		conf->nr_waiting--;
 	}
@@ -1022,7 +1021,7 @@  static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
 	struct r10conf *conf = mddev->private;
 	struct bio *bio;
 
-	if (from_schedule || current->bio_list) {
+	if (from_schedule || current->bio_lists) {
 		spin_lock_irq(&conf->device_lock);
 		bio_list_merge(&conf->pending_bio_list, &plug->pending);
 		conf->pending_count += plug->pending_cnt;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 9faebf7..57e45a0 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -598,6 +598,17 @@  struct bio_list {
 	struct bio *tail;
 };
 
+/* for generic_make_request() */
+struct recursion_to_iteration_bio_lists {
+	/* For stacking drivers submitting to their respective backend.
+	 * Added to tail, processed in FIFO order. */
+	struct bio_list recursion;
+	/* For pushing back the "remainder" part resulting from calling
+	 * blk_queue_split(). Added to head, processed in LIFO order,
+	 * once the "recursion" list has been drained. */
+	struct bio_list remainder;
+};
+
 static inline int bio_list_empty(const struct bio_list *bl)
 {
 	return bl->head == NULL;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e42ada..146eedc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -128,7 +128,7 @@  struct sched_attr {
 
 struct futex_pi_state;
 struct robust_list_head;
-struct bio_list;
+struct recursion_to_iteration_bio_lists;
 struct fs_struct;
 struct perf_event_context;
 struct blk_plug;
@@ -1727,7 +1727,7 @@  struct task_struct {
 	void *journal_info;
 
 /* stacked block device info */
-	struct bio_list *bio_list;
+	struct recursion_to_iteration_bio_lists *bio_lists;
 
 #ifdef CONFIG_BLOCK
 /* stack plugging */