diff mbox

[v3] block: make sure big bio is splitted into at most 256 bvecs

Message ID 1471273882-3938-1-git-send-email-ming.lei@canonical.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ming Lei Aug. 15, 2016, 3:11 p.m. UTC
After arbitrary bio size is supported, the incoming bio may
be very big. We have to split the bio into small bios so that
each holds at most BIO_MAX_PAGES bvecs for safety reason, such
as bio_clone().

This patch fixes the following kernel crash:

> [  172.660142] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
> [  172.660229] IP: [<ffffffff811e53b4>] bio_trim+0xf/0x2a
> [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
> [  172.660399] Oops: 0000 [#1] SMP
> [...]
> [  172.664780] Call Trace:
> [  172.664813]  [<ffffffffa007f3be>] ? raid1_make_request+0x2e8/0xad7 [raid1]
> [  172.664846]  [<ffffffff811f07da>] ? blk_queue_split+0x377/0x3d4
> [  172.664880]  [<ffffffffa005fb5f>] ? md_make_request+0xf6/0x1e9 [md_mod]
> [  172.664912]  [<ffffffff811eb860>] ? generic_make_request+0xb5/0x155
> [  172.664947]  [<ffffffffa0445c89>] ? prio_io+0x85/0x95 [bcache]
> [  172.664981]  [<ffffffffa0448252>] ? register_cache_set+0x355/0x8d0 [bcache]
> [  172.665016]  [<ffffffffa04497d3>] ? register_bcache+0x1006/0x1174 [bcache]

The issue can be reproduced by the following steps:
	- create one raid1 over two virtio-blk
	- build bcache device over the above raid1 and another cache device
	and bucket size is set as 2Mbytes
	- set cache mode as writeback
	- run random write over ext4 on the bcache device

Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios)
Reported-by: Sebastian Roesner <sroesner-kernelorg@roesner-online.de>
Reported-by: Eric Wheeler <bcache@lists.ewheeler.net>
Cc: stable@vger.kernel.org (4.3+)
Cc: Shaohua Li <shli@fb.com>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
V3:
	- rebase against v4.8-rc1 since .bi_rw of bio is renamed
	as .bi_opf

V2:
	- don't mark as REQ_NOMERGE in case the bio is splitted
	for reaching the limit of bvecs count
V1:
	- Kent pointed out that using max io size can't cover
	the case of non-full bvecs/pages

 block/blk-merge.c | 35 ++++++++++++++++++++++++++++++++---
 1 file changed, 32 insertions(+), 3 deletions(-)

Comments

Christoph Hellwig Aug. 15, 2016, 6:23 p.m. UTC | #1
On Mon, Aug 15, 2016 at 11:11:22PM +0800, Ming Lei wrote:
> After arbitrary bio size is supported, the incoming bio may
> be very big. We have to split the bio into small bios so that
> each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> as bio_clone().

I still think working around a rough driver submitting too large
I/O is a bad thing until we've done a full audit of all consuming
bios through ->make_request, and we've enabled it for the common
path as well.

>  	bool do_split = true;
>  	struct bio *new = NULL;
>  	const unsigned max_sectors = get_max_io_size(q, bio);
> +	unsigned bvecs = 0;
> +
> +	*no_merge = true;
>  
>  	bio_for_each_segment(bv, bio, iter) {
>  		/*
> +		 * With arbitrary bio size, the incoming bio may be very
> +		 * big. We have to split the bio into small bios so that
> +		 * each holds at most BIO_MAX_PAGES bvecs because
> +		 * bio_clone() can fail to allocate big bvecs.
> +		 *
> +		 * It should have been better to apply the limit per
> +		 * request queue in which bio_clone() is involved,
> +		 * instead of globally. The biggest blocker is
> +		 * bio_clone() in bio bounce.
> +		 *
> +		 * If bio is splitted by this reason, we should allow
> +		 * to continue bios merging.
> +		 *
> +		 * TODO: deal with bio bounce's bio_clone() gracefully
> +		 * and convert the global limit into per-queue limit.
> +		 */
> +		if (bvecs++ >= BIO_MAX_PAGES) {
> +			*no_merge = false;
> +			goto split;
> +		}

That being said this simple if check here is simple enough that it's
probably fine.  But I see no need to uglify the whole code path
with that no_merge flag.  Please drop if for now, and if we start
caring for this path in common code we should just move the
REQ_NOMERGE setting into the actual blk_bio_*_split helpers.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kent Overstreet Aug. 15, 2016, 7:12 p.m. UTC | #2
On Mon, Aug 15, 2016 at 11:23:28AM -0700, Christoph Hellwig wrote:
> On Mon, Aug 15, 2016 at 11:11:22PM +0800, Ming Lei wrote:
> > After arbitrary bio size is supported, the incoming bio may
> > be very big. We have to split the bio into small bios so that
> > each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> > as bio_clone().
> 
> I still think working around a rough driver submitting too large
> I/O is a bad thing until we've done a full audit of all consuming
> bios through ->make_request, and we've enabled it for the common
> path as well.

bcache originally had workaround code to split too-large bios when it first went
upstream - that was dropped only after the patches to make
generic_make_request() handle arbitrary size bios went in. So to do what you're
suggesting would mean reverting that bcache patch and bringing that code back,
which from my perspective would be a step in the wrong direction. I just want to
get this over and done with.

re: interactions with other drivers - bio_clone() has already been changed to
only clone biovecs that are live for current bi_iter, so there shouldn't be any
safety issues. A driver would have to be intentionally doing its own open coded
bio cloning that clones all of bi_io_vec, not just the active ones - but if
they're doing that, they're already broken because a driver isn't allowed to
look at bi_vcnt if it isn't a bio that it owns - bi_vcnt is 0 on bios that don't
own their biovec (i.e. that were created by bio_clone_fast).

And the cloning and bi_vcnt usage stuff I audited very thoroughly back when I
was working on immutable biovecs and such back in the day, and I had to do a
fair amount of cleanup/refactoring before that stuff could go in.

> 
> >  	bool do_split = true;
> >  	struct bio *new = NULL;
> >  	const unsigned max_sectors = get_max_io_size(q, bio);
> > +	unsigned bvecs = 0;
> > +
> > +	*no_merge = true;
> >  
> >  	bio_for_each_segment(bv, bio, iter) {
> >  		/*
> > +		 * With arbitrary bio size, the incoming bio may be very
> > +		 * big. We have to split the bio into small bios so that
> > +		 * each holds at most BIO_MAX_PAGES bvecs because
> > +		 * bio_clone() can fail to allocate big bvecs.
> > +		 *
> > +		 * It should have been better to apply the limit per
> > +		 * request queue in which bio_clone() is involved,
> > +		 * instead of globally. The biggest blocker is
> > +		 * bio_clone() in bio bounce.
> > +		 *
> > +		 * If bio is splitted by this reason, we should allow
> > +		 * to continue bios merging.
> > +		 *
> > +		 * TODO: deal with bio bounce's bio_clone() gracefully
> > +		 * and convert the global limit into per-queue limit.
> > +		 */
> > +		if (bvecs++ >= BIO_MAX_PAGES) {
> > +			*no_merge = false;
> > +			goto split;
> > +		}
> 
> That being said this simple if check here is simple enough that it's
> probably fine.  But I see no need to uglify the whole code path
> with that no_merge flag.  Please drop if for now, and if we start
> caring for this path in common code we should just move the
> REQ_NOMERGE setting into the actual blk_bio_*_split helpers.

Agreed about the no_merge thing.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Wheeler Aug. 19, 2016, 12:41 a.m. UTC | #3
> On Mon, Aug 15, 2016 at 11:23:28AM -0700, Christoph Hellwig wrote:
> > On Mon, Aug 15, 2016 at 11:11:22PM +0800, Ming Lei wrote:
> > > After arbitrary bio size is supported, the incoming bio may
> > > be very big. We have to split the bio into small bios so that
> > > each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> > > as bio_clone().
> > 
> > I still think working around a rough driver submitting too large
> > I/O is a bad thing until we've done a full audit of all consuming
> > bios through ->make_request, and we've enabled it for the common
> > path as well.
> 
> bcache originally had workaround code to split too-large bios when it 
> first went upstream - that was dropped only after the patches to make 
> generic_make_request() handle arbitrary size bios went in. So to do what 
> you're suggesting would mean reverting that bcache patch and bringing 
> that code back, which from my perspective would be a step in the wrong 
> direction. I just want to get this over and done with.
> 
> > 
> > >  	bool do_split = true;
> > >  	struct bio *new = NULL;
> > >  	const unsigned max_sectors = get_max_io_size(q, bio);
> > > +	unsigned bvecs = 0;
> > > +
> > > +	*no_merge = true;
> > >  
> > >  	bio_for_each_segment(bv, bio, iter) {
> > >  		/*
> > > +		 * With arbitrary bio size, the incoming bio may be very
> > > +		 * big. We have to split the bio into small bios so that
> > > +		 * each holds at most BIO_MAX_PAGES bvecs because
> > > +		 * bio_clone() can fail to allocate big bvecs.
> > > +		 *
> > > +		 * It should have been better to apply the limit per
> > > +		 * request queue in which bio_clone() is involved,
> > > +		 * instead of globally. The biggest blocker is
> > > +		 * bio_clone() in bio bounce.
> > > +		 *
> > > +		 * If bio is splitted by this reason, we should allow
> > > +		 * to continue bios merging.
> > > +		 *
> > > +		 * TODO: deal with bio bounce's bio_clone() gracefully
> > > +		 * and convert the global limit into per-queue limit.
> > > +		 */
> > > +		if (bvecs++ >= BIO_MAX_PAGES) {
> > > +			*no_merge = false;
> > > +			goto split;
> > > +		}
> > 
> > That being said this simple if check here is simple enough that it's
> > probably fine.  But I see no need to uglify the whole code path
> > with that no_merge flag.  Please drop if for now, and if we start
> > caring for this path in common code we should just move the
> > REQ_NOMERGE setting into the actual blk_bio_*_split helpers.
> 
> Agreed about the no_merge thing.

By removing `no_merge` this patch should cherry-peck into stable v4.3+ 
without merge issues by avoiding bi_rw refactor interference, too.

Ming, can you send out a V4 without `no_merge` ?

--
Eric Wheeler



--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ming Lei Aug. 21, 2016, 9:31 a.m. UTC | #4
On Fri, Aug 19, 2016 at 8:41 AM, Eric Wheeler <bcache@lists.ewheeler.net> wrote:
>> On Mon, Aug 15, 2016 at 11:23:28AM -0700, Christoph Hellwig wrote:
>> > On Mon, Aug 15, 2016 at 11:11:22PM +0800, Ming Lei wrote:
>> > > After arbitrary bio size is supported, the incoming bio may
>> > > be very big. We have to split the bio into small bios so that
>> > > each holds at most BIO_MAX_PAGES bvecs for safety reason, such
>> > > as bio_clone().
>> >
>> > I still think working around a rough driver submitting too large
>> > I/O is a bad thing until we've done a full audit of all consuming
>> > bios through ->make_request, and we've enabled it for the common
>> > path as well.
>>
>> bcache originally had workaround code to split too-large bios when it
>> first went upstream - that was dropped only after the patches to make
>> generic_make_request() handle arbitrary size bios went in. So to do what
>> you're suggesting would mean reverting that bcache patch and bringing
>> that code back, which from my perspective would be a step in the wrong
>> direction. I just want to get this over and done with.
>>
>> >
>> > >   bool do_split = true;
>> > >   struct bio *new = NULL;
>> > >   const unsigned max_sectors = get_max_io_size(q, bio);
>> > > + unsigned bvecs = 0;
>> > > +
>> > > + *no_merge = true;
>> > >
>> > >   bio_for_each_segment(bv, bio, iter) {
>> > >           /*
>> > > +          * With arbitrary bio size, the incoming bio may be very
>> > > +          * big. We have to split the bio into small bios so that
>> > > +          * each holds at most BIO_MAX_PAGES bvecs because
>> > > +          * bio_clone() can fail to allocate big bvecs.
>> > > +          *
>> > > +          * It should have been better to apply the limit per
>> > > +          * request queue in which bio_clone() is involved,
>> > > +          * instead of globally. The biggest blocker is
>> > > +          * bio_clone() in bio bounce.
>> > > +          *
>> > > +          * If bio is splitted by this reason, we should allow
>> > > +          * to continue bios merging.
>> > > +          *
>> > > +          * TODO: deal with bio bounce's bio_clone() gracefully
>> > > +          * and convert the global limit into per-queue limit.
>> > > +          */
>> > > +         if (bvecs++ >= BIO_MAX_PAGES) {
>> > > +                 *no_merge = false;
>> > > +                 goto split;
>> > > +         }
>> >
>> > That being said this simple if check here is simple enough that it's
>> > probably fine.  But I see no need to uglify the whole code path
>> > with that no_merge flag.  Please drop if for now, and if we start
>> > caring for this path in common code we should just move the
>> > REQ_NOMERGE setting into the actual blk_bio_*_split helpers.
>>
>> Agreed about the no_merge thing.
>
> By removing `no_merge` this patch should cherry-peck into stable v4.3+
> without merge issues by avoiding bi_rw refactor interference, too.
>
> Ming, can you send out a V4 without `no_merge` ?

This patch is definitely correct, and I don't see dealing with 'no_merge'
should be removed.

In this case, the bio is still possible to merge with others because
it doesn't violate any limit of the queue because it just can't be held in
256 bvecs, that means it is correct to clear no_merge for this situation.

Also I don't think it is complicated or ugly to deal with the flag.

Jens, could you merge this fix?

Thanks,
Ming

>
> --
> Eric Wheeler
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kent Overstreet Aug. 21, 2016, 5:58 p.m. UTC | #5
On Sun, Aug 21, 2016 at 05:31:33PM +0800, Ming Lei wrote:
> This patch is definitely correct, and I don't see dealing with 'no_merge'
> should be removed.
> 
> In this case, the bio is still possible to merge with others because
> it doesn't violate any limit of the queue because it just can't be held in
> 256 bvecs, that means it is correct to clear no_merge for this situation.
> 
> Also I don't think it is complicated or ugly to deal with the flag.

It's extra unnecessary code. All other things being equal, less code is always
better than more code. It works just fine without it, what's the justification
for adding the flag?

Don't be so dismissive of other maintainer's concerns, we've got to deal with
this code too. Especially when I and Christoph are agreeing on something - how
often does that happen?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ming Lei Aug. 22, 2016, 8:57 a.m. UTC | #6
On Mon, Aug 22, 2016 at 1:58 AM, Kent Overstreet
<kent.overstreet@gmail.com> wrote:
> On Sun, Aug 21, 2016 at 05:31:33PM +0800, Ming Lei wrote:
>> This patch is definitely correct, and I don't see dealing with 'no_merge'
>> should be removed.
>>
>> In this case, the bio is still possible to merge with others because
>> it doesn't violate any limit of the queue because it just can't be held in
>> 256 bvecs, that means it is correct to clear no_merge for this situation.
>>
>> Also I don't think it is complicated or ugly to deal with the flag.
>
> It's extra unnecessary code. All other things being equal, less code is always
> better than more code. It works just fine without it, what's the justification
> for adding the flag?

From rationality view, the 'no_merge' flag need to be cleared because
it doesn't violate any queue's limit and should be allowed to merge, and
actually it is possible to submit all these splitted bios into hardware via
one request, that is different with other splitting cases.

Given you introduced support of arbitrarily sized bios in commit 54efd50bf
(block:make generic_make_request handle arbitrarily sized bios), other
drivers/fs might use this to simplify bio submission too in the future.

>
> Don't be so dismissive of other maintainer's concerns, we've got to deal with
> this code too. Especially when I and Christoph are agreeing on something - how
> often does that happen?

OK, I will submit a V4 and comment on not merging just for simplicity resaon.


Thanks,
Ming
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 3eec75a..c44d3e9 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -85,7 +85,8 @@  static inline unsigned get_max_io_size(struct request_queue *q,
 static struct bio *blk_bio_segment_split(struct request_queue *q,
 					 struct bio *bio,
 					 struct bio_set *bs,
-					 unsigned *segs)
+					 unsigned *segs,
+					 bool *no_merge)
 {
 	struct bio_vec bv, bvprv, *bvprvp = NULL;
 	struct bvec_iter iter;
@@ -94,9 +95,34 @@  static struct bio *blk_bio_segment_split(struct request_queue *q,
 	bool do_split = true;
 	struct bio *new = NULL;
 	const unsigned max_sectors = get_max_io_size(q, bio);
+	unsigned bvecs = 0;
+
+	*no_merge = true;
 
 	bio_for_each_segment(bv, bio, iter) {
 		/*
+		 * With arbitrary bio size, the incoming bio may be very
+		 * big. We have to split the bio into small bios so that
+		 * each holds at most BIO_MAX_PAGES bvecs because
+		 * bio_clone() can fail to allocate big bvecs.
+		 *
+		 * It should have been better to apply the limit per
+		 * request queue in which bio_clone() is involved,
+		 * instead of globally. The biggest blocker is
+		 * bio_clone() in bio bounce.
+		 *
+		 * If bio is splitted by this reason, we should allow
+		 * to continue bios merging.
+		 *
+		 * TODO: deal with bio bounce's bio_clone() gracefully
+		 * and convert the global limit into per-queue limit.
+		 */
+		if (bvecs++ >= BIO_MAX_PAGES) {
+			*no_merge = false;
+			goto split;
+		}
+
+		/*
 		 * If the queue doesn't support SG gaps and adding this
 		 * offset would create a gap, disallow it.
 		 */
@@ -171,13 +197,15 @@  void blk_queue_split(struct request_queue *q, struct bio **bio,
 {
 	struct bio *split, *res;
 	unsigned nsegs;
+	bool no_merge_for_split = true;
 
 	if (bio_op(*bio) == REQ_OP_DISCARD)
 		split = blk_bio_discard_split(q, *bio, bs, &nsegs);
 	else if (bio_op(*bio) == REQ_OP_WRITE_SAME)
 		split = blk_bio_write_same_split(q, *bio, bs, &nsegs);
 	else
-		split = blk_bio_segment_split(q, *bio, q->bio_split, &nsegs);
+		split = blk_bio_segment_split(q, *bio, q->bio_split, &nsegs,
+				&no_merge_for_split);
 
 	/* physical segments can be figured out during splitting */
 	res = split ? split : *bio;
@@ -186,7 +214,8 @@  void blk_queue_split(struct request_queue *q, struct bio **bio,
 
 	if (split) {
 		/* there isn't chance to merge the splitted bio */
-		split->bi_opf |= REQ_NOMERGE;
+		if (no_merge_for_split)
+			split->bi_opf |= REQ_NOMERGE;
 
 		bio_chain(split, *bio);
 		trace_block_split(q, split, (*bio)->bi_iter.bi_sector);