mbox series

[0/4] Fix order when split bio and send remaining back to itself

Message ID 1609233522-25837-1-git-send-email-dannyshih@synology.com (mailing list archive)
Headers show
Series Fix order when split bio and send remaining back to itself | expand

Message

Danny Shih Dec. 29, 2020, 9:18 a.m. UTC
From: Danny Shih <dannyshih@synology.com>

We found out that split bios might handle not in order when a big bio
had split by blk_queue_split() and also split in stacking block device,
such as md device because chunk size boundary limit.

Stacking block device normally use submit_bio_noacct() add the remaining
bio to current->bio_list's tail after they split original bio. Therefore,
when bio split first time, the last part of bio was add to bio_list.
After then, when bio split second time, the middle part of bio was add to
bio_list. Results that the middle part is now behind the last part of bio.

For example:
	There is a RAID0 md device, with max_sectors_kb = 2 KB,
	and chunk_size = 1 KB

	1. a read bio come to md device wants to read 0-7 KB
	2. In blk_queue_split(), bio split into (0-1), (2-7),
	   and send (2-7) back to md device

	   current->bio_list = bio_list_on_stack[0]: (md 2-7)
	3. RAID0 split bio (0-1) into (0) and (1), since chunk size is 1 KB
	   and send (1) back to md device

	   bio_list_on_stack[0]: (md 2-7) -> (md 1)
	4. remap and send (0) to lower layer device

	   bio_list_on_stack[0]: (md 2-7) -> (md 1) -> (lower 0)
	5. __submit_bio_noacct() sorting bio let lower bio handle firstly
	   bio_list_on_stack[0]: (lower 0) -> (md 2-7) -> (md 1)
	   pop (lower 0)
	   move bio_list_on_stack[0] to bio_list_on_stack[1]

	   bio_list_on_stack[1]: (md 2-7) -> (md 1)
	6. after handle lower bio, it handle (md 2-7) firstly, and split
	   in blk_queue_split() into (2-3), (4-7), send (4-7) back

	   bio_list_on_stack[0]: (md 4-7)
	   bio_list_on_stack[1]: (md 1)
	7. RAID0 split bio (2-3) into (2) and (3) and send (3) back

	   bio_list_on_stack[0]: (md 4-7) -> (md 3)
	   bio_list_on_stack[1]: (md 1)
	...
	In the end, the split bio handle's order will become
	0 -> 2 -> 4 -> 6 -> 7 -> 5 -> 3 -> 1

Reverse the order of same queue bio when sorting bio in
__submit_bio_noacct() can solve this issue, but it might influence
too much. So we provide alternative version of submit_bio_noacct(),
named submit_bio_noacct_add_head(), for the case which need to add bio
to the head of current->bio_list. And replace submit_bio_noacct() with
submit_bio_noacct_add_head() in block device layer when we want to
split bio and send remaining back to itself.

Danny Shih (4):
  block: introduce submit_bio_noacct_add_head
  block: use submit_bio_noacct_add_head for split bio sending back
  dm: use submit_bio_noacct_add_head for split bio sending back
  md: use submit_bio_noacct_add_head for split bio sending back

 block/blk-core.c       | 44 +++++++++++++++++++++++++++++++++-----------
 block/blk-merge.c      |  2 +-
 block/bounce.c         |  2 +-
 drivers/md/dm.c        |  2 +-
 drivers/md/md-linear.c |  2 +-
 drivers/md/raid0.c     |  4 ++--
 drivers/md/raid1.c     |  4 ++--
 drivers/md/raid10.c    |  4 ++--
 drivers/md/raid5.c     |  2 +-
 include/linux/blkdev.h |  1 +
 10 files changed, 45 insertions(+), 22 deletions(-)

Comments

Mike Snitzer Dec. 30, 2020, 11:34 p.m. UTC | #1
On Tue, Dec 29 2020 at  4:18am -0500,
dannyshih <dannyshih@synology.com> wrote:

> From: Danny Shih <dannyshih@synology.com>
> 
> We found out that split bios might handle not in order when a big bio
> had split by blk_queue_split() and also split in stacking block device,
> such as md device because chunk size boundary limit.
> 
> Stacking block device normally use submit_bio_noacct() add the remaining
> bio to current->bio_list's tail after they split original bio. Therefore,
> when bio split first time, the last part of bio was add to bio_list.
> After then, when bio split second time, the middle part of bio was add to
> bio_list. Results that the middle part is now behind the last part of bio.
> 
> For example:
> 	There is a RAID0 md device, with max_sectors_kb = 2 KB,
> 	and chunk_size = 1 KB
> 
> 	1. a read bio come to md device wants to read 0-7 KB
> 	2. In blk_queue_split(), bio split into (0-1), (2-7),
> 	   and send (2-7) back to md device
> 
> 	   current->bio_list = bio_list_on_stack[0]: (md 2-7)
> 	3. RAID0 split bio (0-1) into (0) and (1), since chunk size is 1 KB
> 	   and send (1) back to md device
> 
> 	   bio_list_on_stack[0]: (md 2-7) -> (md 1)
> 	4. remap and send (0) to lower layer device
> 
> 	   bio_list_on_stack[0]: (md 2-7) -> (md 1) -> (lower 0)
> 	5. __submit_bio_noacct() sorting bio let lower bio handle firstly
> 	   bio_list_on_stack[0]: (lower 0) -> (md 2-7) -> (md 1)
> 	   pop (lower 0)
> 	   move bio_list_on_stack[0] to bio_list_on_stack[1]
> 
> 	   bio_list_on_stack[1]: (md 2-7) -> (md 1)
> 	6. after handle lower bio, it handle (md 2-7) firstly, and split
> 	   in blk_queue_split() into (2-3), (4-7), send (4-7) back
> 
> 	   bio_list_on_stack[0]: (md 4-7)
> 	   bio_list_on_stack[1]: (md 1)
> 	7. RAID0 split bio (2-3) into (2) and (3) and send (3) back
> 
> 	   bio_list_on_stack[0]: (md 4-7) -> (md 3)
> 	   bio_list_on_stack[1]: (md 1)
> 	...
> 	In the end, the split bio handle's order will become
> 	0 -> 2 -> 4 -> 6 -> 7 -> 5 -> 3 -> 1
> 
> Reverse the order of same queue bio when sorting bio in
> __submit_bio_noacct() can solve this issue, but it might influence
> too much. So we provide alternative version of submit_bio_noacct(),
> named submit_bio_noacct_add_head(), for the case which need to add bio
> to the head of current->bio_list. And replace submit_bio_noacct() with
> submit_bio_noacct_add_head() in block device layer when we want to
> split bio and send remaining back to itself.

Ordering aside, you cannot split more than once.  So your proposed fix
to insert at head isn't valid because you're still implicitly allocating
more than one bio from the bioset which could cause deadlock in a low
memory situation.

I had to deal with a comparable issue with DM core not too long ago, see
this commit:

commit ee1dfad5325ff1cfb2239e564cd411b3bfe8667a
Author: Mike Snitzer <snitzer@redhat.com>
Date:   Mon Sep 14 13:04:19 2020 -0400

    dm: fix bio splitting and its bio completion order for regular IO

    dm_queue_split() is removed because __split_and_process_bio() _must_
    handle splitting bios to ensure proper bio submission and completion
    ordering as a bio is split.

    Otherwise, multiple recursive calls to ->submit_bio will cause multiple
    split bios to be allocated from the same ->bio_split mempool at the same
    time. This would result in deadlock in low memory conditions because no
    progress could be made (only one bio is available in ->bio_split
    mempool).

    This fix has been verified to still fix the loss of performance, due
    to excess splitting, that commit 120c9257f5f1 provided.

    Fixes: 120c9257f5f1 ("Revert "dm: always call blk_queue_split() in dm_process_bio()"")
    Cc: stable@vger.kernel.org # 5.0+, requires custom backport due to 5.9 changes
    Reported-by: Ming Lei <ming.lei@redhat.com>
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>

Basically you cannot split the same bio more than once without
recursing.  Your elaborate documentation shows things going wrong quite
early in step 3.  That additional split and recursing back to MD
shouldn't happen before the first bio split completes.

Seems the proper fix is to disallow max_sectors_kb to be imposed, via
blk_queue_split(), if MD has further splitting constraints, via
chunk_sectors, that negate max_sectors_kb anyway.

Mike
Danny Shih Dec. 31, 2020, 8:28 a.m. UTC | #2
Mike Snitzer writes:
>> submit_bio_noacct_add_head() in block device layer when we want to
>> split bio and send remaining back to itself.
> Ordering aside, you cannot split more than once.  So your proposed fix
> to insert at head isn't valid because you're still implicitly allocating
> more than one bio from the bioset which could cause deadlock in a low
> memory situation.
>
> I had to deal with a comparable issue with DM core not too long ago, see
> this commit:
>
> commit ee1dfad5325ff1cfb2239e564cd411b3bfe8667a
> Author: Mike Snitzer <snitzer@redhat.com>
> Date:   Mon Sep 14 13:04:19 2020 -0400
>
>      dm: fix bio splitting and its bio completion order for regular IO
>
>      dm_queue_split() is removed because __split_and_process_bio() _must_
>      handle splitting bios to ensure proper bio submission and completion
>      ordering as a bio is split.
>
>      Otherwise, multiple recursive calls to ->submit_bio will cause multiple
>      split bios to be allocated from the same ->bio_split mempool at the same
>      time. This would result in deadlock in low memory conditions because no
>      progress could be made (only one bio is available in ->bio_split
>      mempool).
>
>      This fix has been verified to still fix the loss of performance, due
>      to excess splitting, that commit 120c9257f5f1 provided.
>
>      Fixes: 120c9257f5f1 ("Revert "dm: always call blk_queue_split() in dm_process_bio()"")
>      Cc: stable@vger.kernel.org # 5.0+, requires custom backport due to 5.9 changes
>      Reported-by: Ming Lei <ming.lei@redhat.com>
>      Signed-off-by: Mike Snitzer <snitzer@redhat.com>
>
> Basically you cannot split the same bio more than once without
> recursing.  Your elaborate documentation shows things going wrong quite
> early in step 3.  That additional split and recursing back to MD
> shouldn't happen before the first bio split completes.
>
> Seems the proper fix is to disallow max_sectors_kb to be imposed, via
> blk_queue_split(), if MD has further splitting constraints, via
> chunk_sectors, that negate max_sectors_kb anyway.
>
> Mike


Hi Mike,

I think you're right that a driver should not split the same bio more
than once without recursing when using the same mempool.

If a driver only split bio once, the out-of-order issue no longer exists.
(Therefore, this problem won't occur on DM device.)

But the MD devices are using their private bioset (mddev->bio_set
or conf->bio_split) for splitting by themselves that are not the same
bioset used in blk_queue_split() (i.e. q->bio_split). The deadlock
you have mentioned might not happen to them.

I think there are two solutions:

1. In case MD devices want to change to use q->bio_split someday
    without this out-of-order issue, make them do split once would be
    a solution.

2. If MD devices should split the bio twice, so we can separately handle
    limits in blk_queue_split() and each raid level's (raid0, raid5, 
raid1, ...).
    I will try to find another solution in this case.

    My proposal is not suitable after I reconsider the problem:

    If a bio is split into A part and B part.

    +------|------+
    |   A  |   B  |
    +------|------+

    I think a driver should make sure A part is always handled before B 
part.
    Inserting bio at head of current->bio_list and submitting bio in the 
same
    time while handling A part could make bios generated from A part be
    handled before B part. This broke the order of those bios that generated
    form A part.

    (Maybe I should find a way to make B part at the head of 
bio_list_on_stack[1]
    while submitting it...)

Thanks for your comments.
I will try to figure out a better way to fix it in the next version.

Best regards,
Danny Shih