block: make sure big bio is splitted into at most 256 bvecs

Message ID	1459878246-9249-1-git-send-email-ming.lei@canonical.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Ming Lei <ming.lei@canonical.com> To: Jens Axboe <axboe@fb.com>, linux-kernel@vger.kernel.org Cc: linux-block@vger.kernel.org, kent.overstreet@gmail.com, Christoph Hellwig <hch@infradead.org>, Eric Wheeler <bcache@lists.ewheeler.net>, Sebastian Roesner <sroesner-kernelorg@roesner-online.de>, Ming Lei <ming.lei@canonical.com>, stable@vger.kernel.org (4.2+), Shaohua Li <shli@fb.com> Subject: [PATCH] block: make sure big bio is splitted into at most 256 bvecs Date: Wed, 6 Apr 2016 01:44:06 +0800 Message-Id: <1459878246-9249-1-git-send-email-ming.lei@canonical.com> Sender: linux-block-owner@vger.kernel.org Precedence: bulk

Ming Lei April 5, 2016, 5:44 p.m. UTC

After arbitrary bio size is supported, the incoming bio may
be very big. We have to split the bio into small bios so that
each holds at most BIO_MAX_PAGES bvecs for safety reason, such
as bio_clone().

This patch fixes the following kernel crash:

> [  172.660142] BUG: unable to handle kernel NULL pointer dereference at
> 0000000000000028
> [  172.660229] IP: [<ffffffff811e53b4>] bio_trim+0xf/0x2a
> [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
> [  172.660399] Oops: 0000 [#1] SMP
> [...]
> [  172.664780] Call Trace:
> [  172.664813]  [<ffffffffa007f3be>] ? raid1_make_request+0x2e8/0xad7 [raid1]
> [  172.664846]  [<ffffffff811f07da>] ? blk_queue_split+0x377/0x3d4
> [  172.664880]  [<ffffffffa005fb5f>] ? md_make_request+0xf6/0x1e9 [md_mod]
> [  172.664912]  [<ffffffff811eb860>] ? generic_make_request+0xb5/0x155
> [  172.664947]  [<ffffffffa0445c89>] ? prio_io+0x85/0x95 [bcache]
> [  172.664981]  [<ffffffffa0448252>] ? register_cache_set+0x355/0x8d0 [bcache]
> [  172.665016]  [<ffffffffa04497d3>] ? register_bcache+0x1006/0x1174 [bcache]

Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios)
Reported-by: Sebastian Roesner <sroesner-kernelorg@roesner-online.de>
Reported-by: Eric Wheeler <bcache@lists.ewheeler.net>
Cc: stable@vger.kernel.org (4.2+)
Cc: Shaohua Li <shli@fb.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
I can reproduce the issue and verify the fix by the following approach:
	- create one raid1 over two virtio-blk 
	- build bcache device over the above raid1 and another cache device.
	- set cache mode as writeback
	- run random write over ext4 on the bcache device
	- then the crash can be triggered

 block/blk-merge.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

Shaohua Li April 5, 2016, 6:27 p.m. UTC | #1

On Wed, Apr 06, 2016 at 01:44:06AM +0800, Ming Lei wrote:
> After arbitrary bio size is supported, the incoming bio may
> be very big. We have to split the bio into small bios so that
> each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> as bio_clone().
> 
> This patch fixes the following kernel crash:
> 
> > [  172.660142] BUG: unable to handle kernel NULL pointer dereference at
> > 0000000000000028
> > [  172.660229] IP: [<ffffffff811e53b4>] bio_trim+0xf/0x2a
> > [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
> > [  172.660399] Oops: 0000 [#1] SMP
> > [...]
> > [  172.664780] Call Trace:
> > [  172.664813]  [<ffffffffa007f3be>] ? raid1_make_request+0x2e8/0xad7 [raid1]
> > [  172.664846]  [<ffffffff811f07da>] ? blk_queue_split+0x377/0x3d4
> > [  172.664880]  [<ffffffffa005fb5f>] ? md_make_request+0xf6/0x1e9 [md_mod]
> > [  172.664912]  [<ffffffff811eb860>] ? generic_make_request+0xb5/0x155
> > [  172.664947]  [<ffffffffa0445c89>] ? prio_io+0x85/0x95 [bcache]
> > [  172.664981]  [<ffffffffa0448252>] ? register_cache_set+0x355/0x8d0 [bcache]
> > [  172.665016]  [<ffffffffa04497d3>] ? register_bcache+0x1006/0x1174 [bcache]
> 
> Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios)

this bug is introduced by d2be537c3ba
> Reported-by: Sebastian Roesner <sroesner-kernelorg@roesner-online.de>
> Reported-by: Eric Wheeler <bcache@lists.ewheeler.net>
> Cc: stable@vger.kernel.org (4.2+)
> Cc: Shaohua Li <shli@fb.com>
> Signed-off-by: Ming Lei <ming.lei@canonical.com>
> ---
> I can reproduce the issue and verify the fix by the following approach:
> 	- create one raid1 over two virtio-blk 
> 	- build bcache device over the above raid1 and another cache device.
> 	- set cache mode as writeback
> 	- run random write over ext4 on the bcache device
> 	- then the crash can be triggered

can you explain why this is better than my original patch?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Kent Overstreet April 6, 2016, 12:27 a.m. UTC | #2

On Tue, Apr 05, 2016 at 11:27:21AM -0700, Shaohua Li wrote:
> On Wed, Apr 06, 2016 at 01:44:06AM +0800, Ming Lei wrote:
> > After arbitrary bio size is supported, the incoming bio may
> > be very big. We have to split the bio into small bios so that
> > each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> > as bio_clone().
> > 
> > This patch fixes the following kernel crash:
> > 
> > > [  172.660142] BUG: unable to handle kernel NULL pointer dereference at
> > > 0000000000000028
> > > [  172.660229] IP: [<ffffffff811e53b4>] bio_trim+0xf/0x2a
> > > [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
> > > [  172.660399] Oops: 0000 [#1] SMP
> > > [...]
> > > [  172.664780] Call Trace:
> > > [  172.664813]  [<ffffffffa007f3be>] ? raid1_make_request+0x2e8/0xad7 [raid1]
> > > [  172.664846]  [<ffffffff811f07da>] ? blk_queue_split+0x377/0x3d4
> > > [  172.664880]  [<ffffffffa005fb5f>] ? md_make_request+0xf6/0x1e9 [md_mod]
> > > [  172.664912]  [<ffffffff811eb860>] ? generic_make_request+0xb5/0x155
> > > [  172.664947]  [<ffffffffa0445c89>] ? prio_io+0x85/0x95 [bcache]
> > > [  172.664981]  [<ffffffffa0448252>] ? register_cache_set+0x355/0x8d0 [bcache]
> > > [  172.665016]  [<ffffffffa04497d3>] ? register_bcache+0x1006/0x1174 [bcache]
> > 
> > Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios)
> 
> this bug is introduced by d2be537c3ba
> > Reported-by: Sebastian Roesner <sroesner-kernelorg@roesner-online.de>
> > Reported-by: Eric Wheeler <bcache@lists.ewheeler.net>
> > Cc: stable@vger.kernel.org (4.2+)
> > Cc: Shaohua Li <shli@fb.com>
> > Signed-off-by: Ming Lei <ming.lei@canonical.com>
> > ---
> > I can reproduce the issue and verify the fix by the following approach:
> > 	- create one raid1 over two virtio-blk 
> > 	- build bcache device over the above raid1 and another cache device.
> > 	- set cache mode as writeback
> > 	- run random write over ext4 on the bcache device
> > 	- then the crash can be triggered
> 
> can you explain why this is better than my original patch?

Shaohua, what was your original patch? I'm sorry, I know I saw it at one point
but I can't remember what it was.

I didn't see Jeff's patch that introduced this bug until your email just now.
Argh.

Jeff, "block: bump BLK_DEF_MAX_SECTORS to 2560" doesn't make much sense as far as
I can tell without changing the BIO_MAX_PAGES too - that's probably why you
weren't seeing much performance increase from that patch...

But BLK_DEF_MAX_SECTORS should not have been enforcing the BIO_MAX_PAGES limit
so that patch was not at fault.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Shaohua Li April 6, 2016, 12:30 a.m. UTC | #3

On Tue, Apr 05, 2016 at 04:27:33PM -0800, Kent Overstreet wrote:
> On Tue, Apr 05, 2016 at 11:27:21AM -0700, Shaohua Li wrote:
> > On Wed, Apr 06, 2016 at 01:44:06AM +0800, Ming Lei wrote:
> > > After arbitrary bio size is supported, the incoming bio may
> > > be very big. We have to split the bio into small bios so that
> > > each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> > > as bio_clone().
> > > 
> > > This patch fixes the following kernel crash:
> > > 
> > > > [  172.660142] BUG: unable to handle kernel NULL pointer dereference at
> > > > 0000000000000028
> > > > [  172.660229] IP: [<ffffffff811e53b4>] bio_trim+0xf/0x2a
> > > > [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
> > > > [  172.660399] Oops: 0000 [#1] SMP
> > > > [...]
> > > > [  172.664780] Call Trace:
> > > > [  172.664813]  [<ffffffffa007f3be>] ? raid1_make_request+0x2e8/0xad7 [raid1]
> > > > [  172.664846]  [<ffffffff811f07da>] ? blk_queue_split+0x377/0x3d4
> > > > [  172.664880]  [<ffffffffa005fb5f>] ? md_make_request+0xf6/0x1e9 [md_mod]
> > > > [  172.664912]  [<ffffffff811eb860>] ? generic_make_request+0xb5/0x155
> > > > [  172.664947]  [<ffffffffa0445c89>] ? prio_io+0x85/0x95 [bcache]
> > > > [  172.664981]  [<ffffffffa0448252>] ? register_cache_set+0x355/0x8d0 [bcache]
> > > > [  172.665016]  [<ffffffffa04497d3>] ? register_bcache+0x1006/0x1174 [bcache]
> > > 
> > > Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios)
> > 
> > this bug is introduced by d2be537c3ba
> > > Reported-by: Sebastian Roesner <sroesner-kernelorg@roesner-online.de>
> > > Reported-by: Eric Wheeler <bcache@lists.ewheeler.net>
> > > Cc: stable@vger.kernel.org (4.2+)
> > > Cc: Shaohua Li <shli@fb.com>
> > > Signed-off-by: Ming Lei <ming.lei@canonical.com>
> > > ---
> > > I can reproduce the issue and verify the fix by the following approach:
> > > 	- create one raid1 over two virtio-blk 
> > > 	- build bcache device over the above raid1 and another cache device.
> > > 	- set cache mode as writeback
> > > 	- run random write over ext4 on the bcache device
> > > 	- then the crash can be triggered
> > 
> > can you explain why this is better than my original patch?
> 
> Shaohua, what was your original patch? I'm sorry, I know I saw it at one point
> but I can't remember what it was.

this one:
http://marc.info/?l=linux-kernel&m=145926976808760&w=2
 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Kent Overstreet April 6, 2016, 12:30 a.m. UTC | #4

On Wed, Apr 06, 2016 at 01:44:06AM +0800, Ming Lei wrote:
> After arbitrary bio size is supported, the incoming bio may
> be very big. We have to split the bio into small bios so that
> each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> as bio_clone().
> 
> This patch fixes the following kernel crash:

Ming, let's not do it this way; drivers that don't clone biovecs are the norm -
instead, md has its own queue limits that it ought to be setting up correctly.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Kent Overstreet April 6, 2016, 12:36 a.m. UTC | #5

On Tue, Apr 05, 2016 at 05:30:07PM -0700, Shaohua Li wrote:
> this one:
> http://marc.info/?l=linux-kernel&m=145926976808760&w=2

Ah. that patch won't actually fix the bug, since md isn't using
blk_default_limits, it's using blk_set_stacking_limits().
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Shaohua Li April 6, 2016, 12:41 a.m. UTC | #6

On Tue, Apr 05, 2016 at 04:36:04PM -0800, Kent Overstreet wrote:
> On Tue, Apr 05, 2016 at 05:30:07PM -0700, Shaohua Li wrote:
> > this one:
> > http://marc.info/?l=linux-kernel&m=145926976808760&w=2
> 
> Ah. that patch won't actually fix the bug, since md isn't using
> blk_default_limits, it's using blk_set_stacking_limits().

Not really, the limit is set by under layer disk not md, otherwise it
should be BLK_SAFE_MAX_SECTORS, but the reported bio has 2560 sectors.
blk_set_stacking_limits() will use it.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Kent Overstreet April 6, 2016, 12:45 a.m. UTC | #7

On Tue, Apr 05, 2016 at 05:41:47PM -0700, Shaohua Li wrote:
> On Tue, Apr 05, 2016 at 04:36:04PM -0800, Kent Overstreet wrote:
> > On Tue, Apr 05, 2016 at 05:30:07PM -0700, Shaohua Li wrote:
> > > this one:
> > > http://marc.info/?l=linux-kernel&m=145926976808760&w=2
> > 
> > Ah. that patch won't actually fix the bug, since md isn't using
> > blk_default_limits, it's using blk_set_stacking_limits().
> 
> Not really, the limit is set by under layer disk not md, otherwise it
> should be BLK_SAFE_MAX_SECTORS, but the reported bio has 2560 sectors.
> blk_set_stacking_limits() will use it.

What? Well, that could should just be deleted, there's no reason anymore for md
to care about the queue limits of the devices underneath it.

Regardless, using BLK_DEF_MAX_SECTORS to limit # of pages in the biovec is
_crazy_. Why would you even do that? We have a separate field in queue limits
for # max segments, use it.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ming Lei April 6, 2016, 12:47 a.m. UTC | #8

On Wed, Apr 6, 2016 at 2:27 AM, Shaohua Li <shli@fb.com> wrote:
> On Wed, Apr 06, 2016 at 01:44:06AM +0800, Ming Lei wrote:
>> After arbitrary bio size is supported, the incoming bio may
>> be very big. We have to split the bio into small bios so that
>> each holds at most BIO_MAX_PAGES bvecs for safety reason, such
>> as bio_clone().
>>
>> This patch fixes the following kernel crash:
>>
>> > [  172.660142] BUG: unable to handle kernel NULL pointer dereference at
>> > 0000000000000028
>> > [  172.660229] IP: [<ffffffff811e53b4>] bio_trim+0xf/0x2a
>> > [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
>> > [  172.660399] Oops: 0000 [#1] SMP
>> > [...]
>> > [  172.664780] Call Trace:
>> > [  172.664813]  [<ffffffffa007f3be>] ? raid1_make_request+0x2e8/0xad7 [raid1]
>> > [  172.664846]  [<ffffffff811f07da>] ? blk_queue_split+0x377/0x3d4
>> > [  172.664880]  [<ffffffffa005fb5f>] ? md_make_request+0xf6/0x1e9 [md_mod]
>> > [  172.664912]  [<ffffffff811eb860>] ? generic_make_request+0xb5/0x155
>> > [  172.664947]  [<ffffffffa0445c89>] ? prio_io+0x85/0x95 [bcache]
>> > [  172.664981]  [<ffffffffa0448252>] ? register_cache_set+0x355/0x8d0 [bcache]
>> > [  172.665016]  [<ffffffffa04497d3>] ? register_bcache+0x1006/0x1174 [bcache]
>>
>> Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios)
>
> this bug is introduced by d2be537c3ba
>> Reported-by: Sebastian Roesner <sroesner-kernelorg@roesner-online.de>
>> Reported-by: Eric Wheeler <bcache@lists.ewheeler.net>
>> Cc: stable@vger.kernel.org (4.2+)
>> Cc: Shaohua Li <shli@fb.com>
>> Signed-off-by: Ming Lei <ming.lei@canonical.com>
>> ---
>> I can reproduce the issue and verify the fix by the following approach:
>>       - create one raid1 over two virtio-blk
>>       - build bcache device over the above raid1 and another cache device.
>>       - set cache mode as writeback
>>       - run random write over ext4 on the bcache device
>>       - then the crash can be triggered
>
> can you explain why this is better than my original patch?

Shaohua, your patch is wrong, please see the following link:

    https://lkml.org/lkml/2016/3/30/893

Thanks,

> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ming Lei April 6, 2016, 12:59 a.m. UTC | #9

On Wed, Apr 6, 2016 at 8:30 AM, Kent Overstreet
<kent.overstreet@gmail.com> wrote:
> On Wed, Apr 06, 2016 at 01:44:06AM +0800, Ming Lei wrote:
>> After arbitrary bio size is supported, the incoming bio may
>> be very big. We have to split the bio into small bios so that
>> each holds at most BIO_MAX_PAGES bvecs for safety reason, such
>> as bio_clone().
>>
>> This patch fixes the following kernel crash:
>
> Ming, let's not do it this way; drivers that don't clone biovecs are the norm -
> instead, md has its own queue limits that it ought to be setting up correctly.

Except for md, there are also several usages of bio_clone:

         - drbd
         - osdblk
         - pktcdvd
         - xen-blkfront
         - verify code of bcache

I don't like bio_clone() too, which can cause trouble to multipage bvecs.

How about fixing the issue by this simple patch first? Then once we limits
all above queues by max sectors, the global limit can be removed as
mentioned by the comment.

thanks,
Ming

> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Shaohua Li April 6, 2016, 12:59 a.m. UTC | #10

On Tue, Apr 05, 2016 at 04:45:55PM -0800, Kent Overstreet wrote:
> On Tue, Apr 05, 2016 at 05:41:47PM -0700, Shaohua Li wrote:
> > On Tue, Apr 05, 2016 at 04:36:04PM -0800, Kent Overstreet wrote:
> > > On Tue, Apr 05, 2016 at 05:30:07PM -0700, Shaohua Li wrote:
> > > > this one:
> > > > http://marc.info/?l=linux-kernel&m=145926976808760&w=2
> > > 
> > > Ah. that patch won't actually fix the bug, since md isn't using
> > > blk_default_limits, it's using blk_set_stacking_limits().
> > 
> > Not really, the limit is set by under layer disk not md, otherwise it
> > should be BLK_SAFE_MAX_SECTORS, but the reported bio has 2560 sectors.
> > blk_set_stacking_limits() will use it.
> 
> What? Well, that could should just be deleted, there's no reason anymore for md
> to care about the queue limits of the devices underneath it.
> 
> Regardless, using BLK_DEF_MAX_SECTORS to limit # of pages in the biovec is
> _crazy_. Why would you even do that? We have a separate field in queue limits
> for # max segments, use it.

We don't limit the max segments in blk_queue_max_segments(), but we can
add. On the other hand, limit max segments to 256 could be a problem,
because bvec page isn't always 4k, this might make some bio smaller.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ming Lei April 6, 2016, 1:02 a.m. UTC | #11

On Wed, Apr 6, 2016 at 1:44 AM, Ming Lei <ming.lei@canonical.com> wrote:
> After arbitrary bio size is supported, the incoming bio may
> be very big. We have to split the bio into small bios so that
> each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> as bio_clone().
>
> This patch fixes the following kernel crash:
>
>> [  172.660142] BUG: unable to handle kernel NULL pointer dereference at
>> 0000000000000028
>> [  172.660229] IP: [<ffffffff811e53b4>] bio_trim+0xf/0x2a
>> [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
>> [  172.660399] Oops: 0000 [#1] SMP
>> [...]
>> [  172.664780] Call Trace:
>> [  172.664813]  [<ffffffffa007f3be>] ? raid1_make_request+0x2e8/0xad7 [raid1]
>> [  172.664846]  [<ffffffff811f07da>] ? blk_queue_split+0x377/0x3d4
>> [  172.664880]  [<ffffffffa005fb5f>] ? md_make_request+0xf6/0x1e9 [md_mod]
>> [  172.664912]  [<ffffffff811eb860>] ? generic_make_request+0xb5/0x155
>> [  172.664947]  [<ffffffffa0445c89>] ? prio_io+0x85/0x95 [bcache]
>> [  172.664981]  [<ffffffffa0448252>] ? register_cache_set+0x355/0x8d0 [bcache]
>> [  172.665016]  [<ffffffffa04497d3>] ? register_bcache+0x1006/0x1174 [bcache]
>
> Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios)
> Reported-by: Sebastian Roesner <sroesner-kernelorg@roesner-online.de>
> Reported-by: Eric Wheeler <bcache@lists.ewheeler.net>
> Cc: stable@vger.kernel.org (4.2+)
> Cc: Shaohua Li <shli@fb.com>
> Signed-off-by: Ming Lei <ming.lei@canonical.com>
> ---
> I can reproduce the issue and verify the fix by the following approach:
>         - create one raid1 over two virtio-blk
>         - build bcache device over the above raid1 and another cache device.
>         - set cache mode as writeback
>         - run random write over ext4 on the bcache device
>         - then the crash can be triggered

For anyone who is interested in issue/fix, forget to mention:

       The bucket size should be set as bigger than 1M during making bcache.
       In my test, the bucket size is 2M.

Thanks,
Ming

>
>  block/blk-merge.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
>
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 2613531..9a8651f 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -79,6 +79,18 @@ static inline unsigned get_max_io_size(struct request_queue *q,
>         /* aligned to logical block size */
>         sectors &= ~(mask >> 9);
>
> +       /*
> +        * With arbitrary bio size, the incoming bio may be very big.
> +        * We have to split the bio into small bios so that each holds
> +        * at most BIO_MAX_PAGES bvecs for safety reason, such as
> +        * bio_clone().
> +        *
> +        * In the future, the limit might be converted into per-queue
> +        * flag.
> +        */
> +       sectors = min_t(unsigned, sectors, BIO_MAX_PAGES <<
> +                       (PAGE_CACHE_SHIFT - 9));
> +
>         return sectors;
>  }
>
> --
> 1.9.1
>
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Shaohua Li April 6, 2016, 1:04 a.m. UTC | #12

On Wed, Apr 06, 2016 at 08:47:56AM +0800, Ming Lei wrote:
> On Wed, Apr 6, 2016 at 2:27 AM, Shaohua Li <shli@fb.com> wrote:
> > On Wed, Apr 06, 2016 at 01:44:06AM +0800, Ming Lei wrote:
> >> After arbitrary bio size is supported, the incoming bio may
> >> be very big. We have to split the bio into small bios so that
> >> each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> >> as bio_clone().
> >>
> >> This patch fixes the following kernel crash:
> >>
> >> > [  172.660142] BUG: unable to handle kernel NULL pointer dereference at
> >> > 0000000000000028
> >> > [  172.660229] IP: [<ffffffff811e53b4>] bio_trim+0xf/0x2a
> >> > [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
> >> > [  172.660399] Oops: 0000 [#1] SMP
> >> > [...]
> >> > [  172.664780] Call Trace:
> >> > [  172.664813]  [<ffffffffa007f3be>] ? raid1_make_request+0x2e8/0xad7 [raid1]
> >> > [  172.664846]  [<ffffffff811f07da>] ? blk_queue_split+0x377/0x3d4
> >> > [  172.664880]  [<ffffffffa005fb5f>] ? md_make_request+0xf6/0x1e9 [md_mod]
> >> > [  172.664912]  [<ffffffff811eb860>] ? generic_make_request+0xb5/0x155
> >> > [  172.664947]  [<ffffffffa0445c89>] ? prio_io+0x85/0x95 [bcache]
> >> > [  172.664981]  [<ffffffffa0448252>] ? register_cache_set+0x355/0x8d0 [bcache]
> >> > [  172.665016]  [<ffffffffa04497d3>] ? register_bcache+0x1006/0x1174 [bcache]
> >>
> >> Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios)
> >
> > this bug is introduced by d2be537c3ba
> >> Reported-by: Sebastian Roesner <sroesner-kernelorg@roesner-online.de>
> >> Reported-by: Eric Wheeler <bcache@lists.ewheeler.net>
> >> Cc: stable@vger.kernel.org (4.2+)
> >> Cc: Shaohua Li <shli@fb.com>
> >> Signed-off-by: Ming Lei <ming.lei@canonical.com>
> >> ---
> >> I can reproduce the issue and verify the fix by the following approach:
> >>       - create one raid1 over two virtio-blk
> >>       - build bcache device over the above raid1 and another cache device.
> >>       - set cache mode as writeback
> >>       - run random write over ext4 on the bcache device
> >>       - then the crash can be triggered
> >
> > can you explain why this is better than my original patch?
> 
> Shaohua, your patch is wrong, please see the following link:
> 
>     https://lkml.org/lkml/2016/3/30/893

I don't see why, really, except it declares you are right :)

why it's 2560 instead of 2048?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ming Lei April 6, 2016, 1:05 a.m. UTC | #13

On Wed, Apr 6, 2016 at 8:59 AM, Shaohua Li <shli@fb.com> wrote:
> On Tue, Apr 05, 2016 at 04:45:55PM -0800, Kent Overstreet wrote:
>> On Tue, Apr 05, 2016 at 05:41:47PM -0700, Shaohua Li wrote:
>> > On Tue, Apr 05, 2016 at 04:36:04PM -0800, Kent Overstreet wrote:
>> > > On Tue, Apr 05, 2016 at 05:30:07PM -0700, Shaohua Li wrote:
>> > > > this one:
>> > > > http://marc.info/?l=linux-kernel&m=145926976808760&w=2
>> > >
>> > > Ah. that patch won't actually fix the bug, since md isn't using
>> > > blk_default_limits, it's using blk_set_stacking_limits().
>> >
>> > Not really, the limit is set by under layer disk not md, otherwise it
>> > should be BLK_SAFE_MAX_SECTORS, but the reported bio has 2560 sectors.
>> > blk_set_stacking_limits() will use it.
>>
>> What? Well, that could should just be deleted, there's no reason anymore for md
>> to care about the queue limits of the devices underneath it.
>>
>> Regardless, using BLK_DEF_MAX_SECTORS to limit # of pages in the biovec is
>> _crazy_. Why would you even do that? We have a separate field in queue limits
>> for # max segments, use it.
>
> We don't limit the max segments in blk_queue_max_segments(), but we can
> add. On the other hand, limit max segments to 256 could be a problem,
> because bvec page isn't always 4k, this might make some bio smaller.

It is nothing with max segments limit, it is about max sectors limit:
think about one big bio which includes 2Mbytes, then 512 bvecs are required,
but the 2M buffer can be continuous physically.


> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Kent Overstreet April 6, 2016, 1:10 a.m. UTC | #14

On Wed, Apr 06, 2016 at 08:59:31AM +0800, Ming Lei wrote:
> On Wed, Apr 6, 2016 at 8:30 AM, Kent Overstreet
> <kent.overstreet@gmail.com> wrote:
> > On Wed, Apr 06, 2016 at 01:44:06AM +0800, Ming Lei wrote:
> >> After arbitrary bio size is supported, the incoming bio may
> >> be very big. We have to split the bio into small bios so that
> >> each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> >> as bio_clone().
> >>
> >> This patch fixes the following kernel crash:
> >
> > Ming, let's not do it this way; drivers that don't clone biovecs are the norm -
> > instead, md has its own queue limits that it ought to be setting up correctly.
> 
> Except for md, there are also several usages of bio_clone:
> 
>          - drbd
>          - osdblk
>          - pktcdvd
>          - xen-blkfront
>          - verify code of bcache
> 
> I don't like bio_clone() too, which can cause trouble to multipage bvecs.
> 
> How about fixing the issue by this simple patch first? Then once we limits
> all above queues by max sectors, the global limit can be removed as
> mentioned by the comment.

just do this:

void blk_set_limit_clonable(struct queue_limits *lim)
{
	lim->max_segments = min(lim->max_segments, BIO_MAX_PAGES);
}

and then call that from the appropriate drivers. It should be like 20 minutes of
work.

My issue is that your approach of just enforcing a global limit is a step in the
wrong direction - we want to get _away_ from that and move towards drivers
specifying _directly_ what their limits are: more straightforward, less opaque.

Also, your patch is wrong, as it'll break if there's bvecs that aren't full
pages.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ming Lei April 6, 2016, 1:11 a.m. UTC | #15

On Wed, Apr 6, 2016 at 9:04 AM, Shaohua Li <shli@fb.com> wrote:
> On Wed, Apr 06, 2016 at 08:47:56AM +0800, Ming Lei wrote:
>> On Wed, Apr 6, 2016 at 2:27 AM, Shaohua Li <shli@fb.com> wrote:
>> > On Wed, Apr 06, 2016 at 01:44:06AM +0800, Ming Lei wrote:
>> >> After arbitrary bio size is supported, the incoming bio may
>> >> be very big. We have to split the bio into small bios so that
>> >> each holds at most BIO_MAX_PAGES bvecs for safety reason, such
>> >> as bio_clone().
>> >>
>> >> This patch fixes the following kernel crash:
>> >>
>> >> > [  172.660142] BUG: unable to handle kernel NULL pointer dereference at
>> >> > 0000000000000028
>> >> > [  172.660229] IP: [<ffffffff811e53b4>] bio_trim+0xf/0x2a
>> >> > [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
>> >> > [  172.660399] Oops: 0000 [#1] SMP
>> >> > [...]
>> >> > [  172.664780] Call Trace:
>> >> > [  172.664813]  [<ffffffffa007f3be>] ? raid1_make_request+0x2e8/0xad7 [raid1]
>> >> > [  172.664846]  [<ffffffff811f07da>] ? blk_queue_split+0x377/0x3d4
>> >> > [  172.664880]  [<ffffffffa005fb5f>] ? md_make_request+0xf6/0x1e9 [md_mod]
>> >> > [  172.664912]  [<ffffffff811eb860>] ? generic_make_request+0xb5/0x155
>> >> > [  172.664947]  [<ffffffffa0445c89>] ? prio_io+0x85/0x95 [bcache]
>> >> > [  172.664981]  [<ffffffffa0448252>] ? register_cache_set+0x355/0x8d0 [bcache]
>> >> > [  172.665016]  [<ffffffffa04497d3>] ? register_bcache+0x1006/0x1174 [bcache]
>> >>
>> >> Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios)
>> >
>> > this bug is introduced by d2be537c3ba
>> >> Reported-by: Sebastian Roesner <sroesner-kernelorg@roesner-online.de>
>> >> Reported-by: Eric Wheeler <bcache@lists.ewheeler.net>
>> >> Cc: stable@vger.kernel.org (4.2+)
>> >> Cc: Shaohua Li <shli@fb.com>
>> >> Signed-off-by: Ming Lei <ming.lei@canonical.com>
>> >> ---
>> >> I can reproduce the issue and verify the fix by the following approach:
>> >>       - create one raid1 over two virtio-blk
>> >>       - build bcache device over the above raid1 and another cache device.
>> >>       - set cache mode as writeback
>> >>       - run random write over ext4 on the bcache device
>> >>       - then the crash can be triggered
>> >
>> > can you explain why this is better than my original patch?
>>
>> Shaohua, your patch is wrong, please see the following link:
>>
>>     https://lkml.org/lkml/2016/3/30/893
>
> I don't see why, really, except it declares you are right :)

Never mind, I post it again, and maybe cause my poor english, :-)

blk_rq_get_max_sectors() which uses max sectors limit is used for
merging bios/reqs, and that means limits.max_sectors is for limitting
max sectors in one request or transfer. One request may include lots of
bios. Now this patch decreases the limit just for single bio's 256
bvec's limitation.
Is that correct? That is the reason why I suggest to change get_max_io_size()
for bio's 256 bvecs limit.

On the contrary, the default max sectors should have been increased
since hardware is becoming quicker, and we should send more to drive
in one request, IMO.

>
> why it's 2560 instead of 2048?

I don't know the exact reason why Jeff takes 2560, but I feel it can be
bigger because the hardware is becoming quicker.

Thanks,
Ming

> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ming Lei April 6, 2016, 1:20 a.m. UTC | #16

On Wed, Apr 6, 2016 at 9:10 AM, Kent Overstreet
<kent.overstreet@gmail.com> wrote:
> On Wed, Apr 06, 2016 at 08:59:31AM +0800, Ming Lei wrote:
>> On Wed, Apr 6, 2016 at 8:30 AM, Kent Overstreet
>> <kent.overstreet@gmail.com> wrote:
>> > On Wed, Apr 06, 2016 at 01:44:06AM +0800, Ming Lei wrote:
>> >> After arbitrary bio size is supported, the incoming bio may
>> >> be very big. We have to split the bio into small bios so that
>> >> each holds at most BIO_MAX_PAGES bvecs for safety reason, such
>> >> as bio_clone().
>> >>
>> >> This patch fixes the following kernel crash:
>> >
>> > Ming, let's not do it this way; drivers that don't clone biovecs are the norm -
>> > instead, md has its own queue limits that it ought to be setting up correctly.
>>
>> Except for md, there are also several usages of bio_clone:
>>
>>          - drbd
>>          - osdblk
>>          - pktcdvd
>>          - xen-blkfront
>>          - verify code of bcache
>>
>> I don't like bio_clone() too, which can cause trouble to multipage bvecs.
>>
>> How about fixing the issue by this simple patch first? Then once we limits
>> all above queues by max sectors, the global limit can be removed as
>> mentioned by the comment.
>
> just do this:
>
> void blk_set_limit_clonable(struct queue_limits *lim)
> {
>         lim->max_segments = min(lim->max_segments, BIO_MAX_PAGES);
> }

As I memtioned it is __not__ correct to use max_segments, and the issue is
related with max sectors, please see the code of bio_clone_bioset():

      bio = bio_alloc_bioset(gfp_mask, bio_segments(bio_src), bs);

bio_segments() returns pages actually.

>
> and then call that from the appropriate drivers. It should be like 20 minutes of
> work.
>
> My issue is that your approach of just enforcing a global limit is a step in the
> wrong direction - we want to get _away_ from that and move towards drivers
> specifying _directly_ what their limits are: more straightforward, less opaque.
>
> Also, your patch is wrong, as it'll break if there's bvecs that aren't full
> pages.

I don't understand why my patch is wrong, since we can split anywhere
in a bio, could you explain it a bit?

Thanks,
Ming

> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Kent Overstreet April 6, 2016, 1:28 a.m. UTC | #17

On Wed, Apr 06, 2016 at 09:20:59AM +0800, Ming Lei wrote:
> On Wed, Apr 6, 2016 at 9:10 AM, Kent Overstreet
> <kent.overstreet@gmail.com> wrote:
> > On Wed, Apr 06, 2016 at 08:59:31AM +0800, Ming Lei wrote:
> >> On Wed, Apr 6, 2016 at 8:30 AM, Kent Overstreet
> >> <kent.overstreet@gmail.com> wrote:
> >> > On Wed, Apr 06, 2016 at 01:44:06AM +0800, Ming Lei wrote:
> >> >> After arbitrary bio size is supported, the incoming bio may
> >> >> be very big. We have to split the bio into small bios so that
> >> >> each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> >> >> as bio_clone().
> >> >>
> >> >> This patch fixes the following kernel crash:
> >> >
> >> > Ming, let's not do it this way; drivers that don't clone biovecs are the norm -
> >> > instead, md has its own queue limits that it ought to be setting up correctly.
> >>
> >> Except for md, there are also several usages of bio_clone:
> >>
> >>          - drbd
> >>          - osdblk
> >>          - pktcdvd
> >>          - xen-blkfront
> >>          - verify code of bcache
> >>
> >> I don't like bio_clone() too, which can cause trouble to multipage bvecs.
> >>
> >> How about fixing the issue by this simple patch first? Then once we limits
> >> all above queues by max sectors, the global limit can be removed as
> >> mentioned by the comment.
> >
> > just do this:
> >
> > void blk_set_limit_clonable(struct queue_limits *lim)
> > {
> >         lim->max_segments = min(lim->max_segments, BIO_MAX_PAGES);
> > }
> 
> As I memtioned it is __not__ correct to use max_segments, and the issue is
> related with max sectors, please see the code of bio_clone_bioset():

I know how bio_clone_bioset() works but I'm not seeing how that has anything to
do with max sectors. The way it copies the biovec is not going to merge
segments, if the original bio had non full page segments then so is the clone.

>       bio = bio_alloc_bioset(gfp_mask, bio_segments(bio_src), bs);
> 
> bio_segments() returns pages actually.
> 
> >
> > and then call that from the appropriate drivers. It should be like 20 minutes of
> > work.
> >
> > My issue is that your approach of just enforcing a global limit is a step in the
> > wrong direction - we want to get _away_ from that and move towards drivers
> > specifying _directly_ what their limits are: more straightforward, less opaque.
> >
> > Also, your patch is wrong, as it'll break if there's bvecs that aren't full
> > pages.
> 
> I don't understand why my patch is wrong, since we can split anywhere
> in a bio, could you explain it a bit?

If you have a bio that has > BIO_MAX_PAGES segments, but all the segments are a
single sector (not a full page!) - then think about what'll happen...

It can happen with userspace issuing direct IOs
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ming Lei April 6, 2016, 1:51 a.m. UTC | #18

On Wed, Apr 6, 2016 at 9:28 AM, Kent Overstreet
<kent.overstreet@gmail.com> wrote:
> On Wed, Apr 06, 2016 at 09:20:59AM +0800, Ming Lei wrote:
>> On Wed, Apr 6, 2016 at 9:10 AM, Kent Overstreet
>> <kent.overstreet@gmail.com> wrote:
>> > On Wed, Apr 06, 2016 at 08:59:31AM +0800, Ming Lei wrote:
>> >> On Wed, Apr 6, 2016 at 8:30 AM, Kent Overstreet
>> >> <kent.overstreet@gmail.com> wrote:
>> >> > On Wed, Apr 06, 2016 at 01:44:06AM +0800, Ming Lei wrote:
>> >> >> After arbitrary bio size is supported, the incoming bio may
>> >> >> be very big. We have to split the bio into small bios so that
>> >> >> each holds at most BIO_MAX_PAGES bvecs for safety reason, such
>> >> >> as bio_clone().
>> >> >>
>> >> >> This patch fixes the following kernel crash:
>> >> >
>> >> > Ming, let's not do it this way; drivers that don't clone biovecs are the norm -
>> >> > instead, md has its own queue limits that it ought to be setting up correctly.
>> >>
>> >> Except for md, there are also several usages of bio_clone:
>> >>
>> >>          - drbd
>> >>          - osdblk
>> >>          - pktcdvd
>> >>          - xen-blkfront
>> >>          - verify code of bcache
>> >>
>> >> I don't like bio_clone() too, which can cause trouble to multipage bvecs.
>> >>
>> >> How about fixing the issue by this simple patch first? Then once we limits
>> >> all above queues by max sectors, the global limit can be removed as
>> >> mentioned by the comment.
>> >
>> > just do this:
>> >
>> > void blk_set_limit_clonable(struct queue_limits *lim)
>> > {
>> >         lim->max_segments = min(lim->max_segments, BIO_MAX_PAGES);
>> > }
>>
>> As I memtioned it is __not__ correct to use max_segments, and the issue is
>> related with max sectors, please see the code of bio_clone_bioset():
>
> I know how bio_clone_bioset() works but I'm not seeing how that has anything to
> do with max sectors. The way it copies the biovec is not going to merge
> segments, if the original bio had non full page segments then so is the clone.

OK, I see, now it is a totally new limit, and no current queue limit can fit
the purpose.

Looks we need to introduce the new limit of io_max_vecs, which can be
applied into blk_bio_segment_split().

But a queue flag should be better than queue limit since it is a 'limit' from
software/driver.

>
>>       bio = bio_alloc_bioset(gfp_mask, bio_segments(bio_src), bs);
>>
>> bio_segments() returns pages actually.
>>
>> >
>> > and then call that from the appropriate drivers. It should be like 20 minutes of
>> > work.
>> >
>> > My issue is that your approach of just enforcing a global limit is a step in the
>> > wrong direction - we want to get _away_ from that and move towards drivers
>> > specifying _directly_ what their limits are: more straightforward, less opaque.
>> >
>> > Also, your patch is wrong, as it'll break if there's bvecs that aren't full
>> > pages.
>>
>> I don't understand why my patch is wrong, since we can split anywhere
>> in a bio, could you explain it a bit?
>
> If you have a bio that has > BIO_MAX_PAGES segments, but all the segments are a
> single sector (not a full page!) - then think about what'll happen...
>
> It can happen with userspace issuing direct IOs

Yeah, I will cook a patch for review.

Thanks,
Ming
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Kent Overstreet April 6, 2016, 2:22 a.m. UTC | #19

On Wed, Apr 06, 2016 at 09:51:02AM +0800, Ming Lei wrote:
> On Wed, Apr 6, 2016 at 9:28 AM, Kent Overstreet
> <kent.overstreet@gmail.com> wrote:
> > On Wed, Apr 06, 2016 at 09:20:59AM +0800, Ming Lei wrote:
> >> On Wed, Apr 6, 2016 at 9:10 AM, Kent Overstreet
> >> <kent.overstreet@gmail.com> wrote:
> >> > On Wed, Apr 06, 2016 at 08:59:31AM +0800, Ming Lei wrote:
> >> >> On Wed, Apr 6, 2016 at 8:30 AM, Kent Overstreet
> >> >> <kent.overstreet@gmail.com> wrote:
> >> >> > On Wed, Apr 06, 2016 at 01:44:06AM +0800, Ming Lei wrote:
> >> >> >> After arbitrary bio size is supported, the incoming bio may
> >> >> >> be very big. We have to split the bio into small bios so that
> >> >> >> each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> >> >> >> as bio_clone().
> >> >> >>
> >> >> >> This patch fixes the following kernel crash:
> >> >> >
> >> >> > Ming, let's not do it this way; drivers that don't clone biovecs are the norm -
> >> >> > instead, md has its own queue limits that it ought to be setting up correctly.
> >> >>
> >> >> Except for md, there are also several usages of bio_clone:
> >> >>
> >> >>          - drbd
> >> >>          - osdblk
> >> >>          - pktcdvd
> >> >>          - xen-blkfront
> >> >>          - verify code of bcache
> >> >>
> >> >> I don't like bio_clone() too, which can cause trouble to multipage bvecs.
> >> >>
> >> >> How about fixing the issue by this simple patch first? Then once we limits
> >> >> all above queues by max sectors, the global limit can be removed as
> >> >> mentioned by the comment.
> >> >
> >> > just do this:
> >> >
> >> > void blk_set_limit_clonable(struct queue_limits *lim)
> >> > {
> >> >         lim->max_segments = min(lim->max_segments, BIO_MAX_PAGES);
> >> > }
> >>
> >> As I memtioned it is __not__ correct to use max_segments, and the issue is
> >> related with max sectors, please see the code of bio_clone_bioset():
> >
> > I know how bio_clone_bioset() works but I'm not seeing how that has anything to
> > do with max sectors. The way it copies the biovec is not going to merge
> > segments, if the original bio had non full page segments then so is the clone.
> 
> OK, I see, now it is a totally new limit, and no current queue limit can fit
> the purpose.
> 
> Looks we need to introduce the new limit of io_max_vecs, which can be
> applied into blk_bio_segment_split().
> 
> But a queue flag should be better than queue limit since it is a 'limit' from
> software/driver.

Why is max_segments not appropriate?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ming Lei April 6, 2016, 2:30 a.m. UTC | #20

On Wed, Apr 6, 2016 at 10:22 AM, Kent Overstreet
<kent.overstreet@gmail.com> wrote:
> On Wed, Apr 06, 2016 at 09:51:02AM +0800, Ming Lei wrote:
>> On Wed, Apr 6, 2016 at 9:28 AM, Kent Overstreet
>> <kent.overstreet@gmail.com> wrote:
>> > On Wed, Apr 06, 2016 at 09:20:59AM +0800, Ming Lei wrote:
>> >> On Wed, Apr 6, 2016 at 9:10 AM, Kent Overstreet
>> >> <kent.overstreet@gmail.com> wrote:
>> >> > On Wed, Apr 06, 2016 at 08:59:31AM +0800, Ming Lei wrote:
>> >> >> On Wed, Apr 6, 2016 at 8:30 AM, Kent Overstreet
>> >> >> <kent.overstreet@gmail.com> wrote:
>> >> >> > On Wed, Apr 06, 2016 at 01:44:06AM +0800, Ming Lei wrote:
>> >> >> >> After arbitrary bio size is supported, the incoming bio may
>> >> >> >> be very big. We have to split the bio into small bios so that
>> >> >> >> each holds at most BIO_MAX_PAGES bvecs for safety reason, such
>> >> >> >> as bio_clone().
>> >> >> >>
>> >> >> >> This patch fixes the following kernel crash:
>> >> >> >
>> >> >> > Ming, let's not do it this way; drivers that don't clone biovecs are the norm -
>> >> >> > instead, md has its own queue limits that it ought to be setting up correctly.
>> >> >>
>> >> >> Except for md, there are also several usages of bio_clone:
>> >> >>
>> >> >>          - drbd
>> >> >>          - osdblk
>> >> >>          - pktcdvd
>> >> >>          - xen-blkfront
>> >> >>          - verify code of bcache
>> >> >>
>> >> >> I don't like bio_clone() too, which can cause trouble to multipage bvecs.
>> >> >>
>> >> >> How about fixing the issue by this simple patch first? Then once we limits
>> >> >> all above queues by max sectors, the global limit can be removed as
>> >> >> mentioned by the comment.
>> >> >
>> >> > just do this:
>> >> >
>> >> > void blk_set_limit_clonable(struct queue_limits *lim)
>> >> > {
>> >> >         lim->max_segments = min(lim->max_segments, BIO_MAX_PAGES);
>> >> > }
>> >>
>> >> As I memtioned it is __not__ correct to use max_segments, and the issue is
>> >> related with max sectors, please see the code of bio_clone_bioset():
>> >
>> > I know how bio_clone_bioset() works but I'm not seeing how that has anything to
>> > do with max sectors. The way it copies the biovec is not going to merge
>> > segments, if the original bio had non full page segments then so is the clone.
>>
>> OK, I see, now it is a totally new limit, and no current queue limit can fit
>> the purpose.
>>
>> Looks we need to introduce the new limit of io_max_vecs, which can be
>> applied into blk_bio_segment_split().
>>
>> But a queue flag should be better than queue limit since it is a 'limit' from
>> software/driver.
>
> Why is max_segments not appropriate?

Now limit.max_segments is for limiting segments from hw view, one this
segment may include lots of pages/bvecs.

The current bio_clone() issue is that we can't clone from one bio which
includes more than 256 bvecs, maybe all these 256 bvecs belong to
one same hw segment.


Thanks,
Ming

> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Kent Overstreet April 6, 2016, 2:34 a.m. UTC | #21

On Wed, Apr 06, 2016 at 10:30:22AM +0800, Ming Lei wrote:
> Now limit.max_segments is for limiting segments from hw view, one this
> segment may include lots of pages/bvecs.
> 
> The current bio_clone() issue is that we can't clone from one bio which
> includes more than 256 bvecs, maybe all these 256 bvecs belong to
> one same hw segment.

So the distinction is purely a post multipage bvec thing?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ming Lei April 6, 2016, 2:37 a.m. UTC | #22

On Wed, Apr 6, 2016 at 10:34 AM, Kent Overstreet
<kent.overstreet@gmail.com> wrote:
> On Wed, Apr 06, 2016 at 10:30:22AM +0800, Ming Lei wrote:
>> Now limit.max_segments is for limiting segments from hw view, one this
>> segment may include lots of pages/bvecs.
>>
>> The current bio_clone() issue is that we can't clone from one bio which
>> includes more than 256 bvecs, maybe all these 256 bvecs belong to
>> one same hw segment.
>
> So the distinction is purely a post multipage bvec thing?

Even after multipage bvec is applied, the limit for max bvecs is still needed
for some cases like bio bounce, in which bio_clone() need to
clone single page bvec.

Thanks,


> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Kent Overstreet April 6, 2016, 2:40 a.m. UTC | #23

On Wed, Apr 06, 2016 at 10:37:05AM +0800, Ming Lei wrote:
> On Wed, Apr 6, 2016 at 10:34 AM, Kent Overstreet
> <kent.overstreet@gmail.com> wrote:
> > On Wed, Apr 06, 2016 at 10:30:22AM +0800, Ming Lei wrote:
> >> Now limit.max_segments is for limiting segments from hw view, one this
> >> segment may include lots of pages/bvecs.
> >>
> >> The current bio_clone() issue is that we can't clone from one bio which
> >> includes more than 256 bvecs, maybe all these 256 bvecs belong to
> >> one same hw segment.
> >
> > So the distinction is purely a post multipage bvec thing?
> 
> Even after multipage bvec is applied, the limit for max bvecs is still needed
> for some cases like bio bounce, in which bio_clone() need to
> clone single page bvec.

s/max bvecs/max pages/?

What I meant is that until we have multipage bvecs, unless I'm missing something
max_segments is exactly what we want. After multipage bvecs, things do get more
complicated I agree.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ming Lei April 6, 2016, 2:51 a.m. UTC | #24

On Wed, Apr 6, 2016 at 10:40 AM, Kent Overstreet
<kent.overstreet@gmail.com> wrote:
> On Wed, Apr 06, 2016 at 10:37:05AM +0800, Ming Lei wrote:
>> On Wed, Apr 6, 2016 at 10:34 AM, Kent Overstreet
>> <kent.overstreet@gmail.com> wrote:
>> > On Wed, Apr 06, 2016 at 10:30:22AM +0800, Ming Lei wrote:
>> >> Now limit.max_segments is for limiting segments from hw view, one this
>> >> segment may include lots of pages/bvecs.
>> >>
>> >> The current bio_clone() issue is that we can't clone from one bio which
>> >> includes more than 256 bvecs, maybe all these 256 bvecs belong to
>> >> one same hw segment.
>> >
>> > So the distinction is purely a post multipage bvec thing?
>>
>> Even after multipage bvec is applied, the limit for max bvecs is still needed
>> for some cases like bio bounce, in which bio_clone() need to
>> clone single page bvec.
>
> s/max bvecs/max pages/?

Exactly, :-)

>
> What I meant is that until we have multipage bvecs, unless I'm missing something
> max_segments is exactly what we want. After multipage bvecs, things do get more

Yes, but now we need to fix current issue and backport the fix. Given bio
bounce isn't easy to fix, not like other users of bio_clone, I think
we still need
to apply the max pages limit globally, what do you think of it?

For multipage bvecs, there may be a long way ahead, :-)

thanks,
Ming

> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Kent Overstreet April 6, 2016, 2:58 a.m. UTC | #25

On Wed, Apr 06, 2016 at 10:51:40AM +0800, Ming Lei wrote:
> On Wed, Apr 6, 2016 at 10:40 AM, Kent Overstreet
> <kent.overstreet@gmail.com> wrote:
> > On Wed, Apr 06, 2016 at 10:37:05AM +0800, Ming Lei wrote:
> >> On Wed, Apr 6, 2016 at 10:34 AM, Kent Overstreet
> >> <kent.overstreet@gmail.com> wrote:
> >> > On Wed, Apr 06, 2016 at 10:30:22AM +0800, Ming Lei wrote:
> >> >> Now limit.max_segments is for limiting segments from hw view, one this
> >> >> segment may include lots of pages/bvecs.
> >> >>
> >> >> The current bio_clone() issue is that we can't clone from one bio which
> >> >> includes more than 256 bvecs, maybe all these 256 bvecs belong to
> >> >> one same hw segment.
> >> >
> >> > So the distinction is purely a post multipage bvec thing?
> >>
> >> Even after multipage bvec is applied, the limit for max bvecs is still needed
> >> for some cases like bio bounce, in which bio_clone() need to
> >> clone single page bvec.
> >
> > s/max bvecs/max pages/?
> 
> Exactly, :-)
> 
> >
> > What I meant is that until we have multipage bvecs, unless I'm missing something
> > max_segments is exactly what we want. After multipage bvecs, things do get more
> 
> Yes, but now we need to fix current issue and backport the fix. Given bio
> bounce isn't easy to fix, not like other users of bio_clone, I think
> we still need
> to apply the max pages limit globally, what do you think of it?

Ugh, yeah bouncing is an issue, with that it's most of the drivers.

Ok, yeah I guess the global limit makes sense for now.

Please add a giant comment explaining why the global limit is there, precisely
what it's needed for and that we'd like to get rid of it.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Wheeler April 7, 2016, 1:36 a.m. UTC | #26

On Wed, 6 Apr 2016, Ming Lei wrote:

> After arbitrary bio size is supported, the incoming bio may
> be very big. We have to split the bio into small bios so that
> each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> as bio_clone().
> 
> This patch fixes the following kernel crash:
> 
> > [  172.660142] BUG: unable to handle kernel NULL pointer dereference at
> > 0000000000000028
> > [  172.660229] IP: [<ffffffff811e53b4>] bio_trim+0xf/0x2a
> > [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
> > [  172.660399] Oops: 0000 [#1] SMP
> > [...]
> > [  172.664780] Call Trace:
> > [  172.664813]  [<ffffffffa007f3be>] ? raid1_make_request+0x2e8/0xad7 [raid1]
> > [  172.664846]  [<ffffffff811f07da>] ? blk_queue_split+0x377/0x3d4
> > [  172.664880]  [<ffffffffa005fb5f>] ? md_make_request+0xf6/0x1e9 [md_mod]
> > [  172.664912]  [<ffffffff811eb860>] ? generic_make_request+0xb5/0x155
> > [  172.664947]  [<ffffffffa0445c89>] ? prio_io+0x85/0x95 [bcache]
> > [  172.664981]  [<ffffffffa0448252>] ? register_cache_set+0x355/0x8d0 [bcache]
> > [  172.665016]  [<ffffffffa04497d3>] ? register_bcache+0x1006/0x1174 [bcache]
> 
> Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios)
> Reported-by: Sebastian Roesner <sroesner-kernelorg@roesner-online.de>
> Reported-by: Eric Wheeler <bcache@lists.ewheeler.net>
> Cc: stable@vger.kernel.org (4.2+)

Ming Lei,

get_max_io_size doesn't appear until 4.5 based on a quick LXR check so we 
won't see it in stable I don't think.

It would be nice to see this fixed in 4.1 (if affected there).  Is there 
another place this could be applied to be a bit more backward compatible?

-Eric


--
Eric Wheeler

	

> Cc: Shaohua Li <shli@fb.com>
> Signed-off-by: Ming Lei <ming.lei@canonical.com>
> ---
> I can reproduce the issue and verify the fix by the following approach:
> 	- create one raid1 over two virtio-blk 
> 	- build bcache device over the above raid1 and another cache device.
> 	- set cache mode as writeback
> 	- run random write over ext4 on the bcache device
> 	- then the crash can be triggered
> 
>  block/blk-merge.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 2613531..9a8651f 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -79,6 +79,18 @@ static inline unsigned get_max_io_size(struct request_queue *q,
>  	/* aligned to logical block size */
>  	sectors &= ~(mask >> 9);
>  
> +	/*
> +	 * With arbitrary bio size, the incoming bio may be very big.
> +	 * We have to split the bio into small bios so that each holds
> +	 * at most BIO_MAX_PAGES bvecs for safety reason, such as
> +	 * bio_clone().
> +	 *
> +	 * In the future, the limit might be converted into per-queue
> +	 * flag.
> +	 */
> +	sectors = min_t(unsigned, sectors, BIO_MAX_PAGES <<
> +			(PAGE_CACHE_SHIFT - 9));
> +
>  	return sectors;
>  }
>  
> -- 
> 1.9.1
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Wheeler April 7, 2016, 1:48 a.m. UTC | #27

On Wed, 6 Apr 2016, Ming Lei wrote:

> On Wed, Apr 6, 2016 at 1:44 AM, Ming Lei <ming.lei@canonical.com> wrote:
> > After arbitrary bio size is supported, the incoming bio may
> > be very big. We have to split the bio into small bios so that
> > each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> > as bio_clone().
> >
> > This patch fixes the following kernel crash:
> >
> >> [  172.660142] BUG: unable to handle kernel NULL pointer dereference at
> >> 0000000000000028
> >> [  172.660229] IP: [<ffffffff811e53b4>] bio_trim+0xf/0x2a
> >> [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
> >> [  172.660399] Oops: 0000 [#1] SMP
> >> [...]
> >> [  172.664780] Call Trace:
> >> [  172.664813]  [<ffffffffa007f3be>] ? raid1_make_request+0x2e8/0xad7 [raid1]
> >> [  172.664846]  [<ffffffff811f07da>] ? blk_queue_split+0x377/0x3d4
> >> [  172.664880]  [<ffffffffa005fb5f>] ? md_make_request+0xf6/0x1e9 [md_mod]
> >> [  172.664912]  [<ffffffff811eb860>] ? generic_make_request+0xb5/0x155
> >> [  172.664947]  [<ffffffffa0445c89>] ? prio_io+0x85/0x95 [bcache]
> >> [  172.664981]  [<ffffffffa0448252>] ? register_cache_set+0x355/0x8d0 [bcache]
> >> [  172.665016]  [<ffffffffa04497d3>] ? register_bcache+0x1006/0x1174 [bcache]
> >
> > Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios)
> > Reported-by: Sebastian Roesner <sroesner-kernelorg@roesner-online.de>
> > Reported-by: Eric Wheeler <bcache@lists.ewheeler.net>
> > Cc: stable@vger.kernel.org (4.2+)
> > Cc: Shaohua Li <shli@fb.com>
> > Signed-off-by: Ming Lei <ming.lei@canonical.com>
> > ---
> > I can reproduce the issue and verify the fix by the following approach:
> >         - create one raid1 over two virtio-blk
> >         - build bcache device over the above raid1 and another cache device.
> >         - set cache mode as writeback
> >         - run random write over ext4 on the bcache device
> >         - then the crash can be triggered
> 
> For anyone who is interested in issue/fix, forget to mention:
> 
>        The bucket size should be set as bigger than 1M during making bcache.
>        In my test, the bucket size is 2M.

Does the bucket size dictate the ideal cached data size, or is it just an 
optimization for erase block boundaries on the SSD?  

Are reads/writes smaller than the bucket size still cached effectively, or 
does a 2MB bucket slurp up 2MB of backing data along with it?  

For example, if 64k is our ideal IO size, should we use 64k buckets?

--
Eric Wheeler

> 
> Thanks,
> Ming
> 
> >
> >  block/blk-merge.c | 12 ++++++++++++
> >  1 file changed, 12 insertions(+)
> >
> > diff --git a/block/blk-merge.c b/block/blk-merge.c
> > index 2613531..9a8651f 100644
> > --- a/block/blk-merge.c
> > +++ b/block/blk-merge.c
> > @@ -79,6 +79,18 @@ static inline unsigned get_max_io_size(struct request_queue *q,
> >         /* aligned to logical block size */
> >         sectors &= ~(mask >> 9);
> >
> > +       /*
> > +        * With arbitrary bio size, the incoming bio may be very big.
> > +        * We have to split the bio into small bios so that each holds
> > +        * at most BIO_MAX_PAGES bvecs for safety reason, such as
> > +        * bio_clone().
> > +        *
> > +        * In the future, the limit might be converted into per-queue
> > +        * flag.
> > +        */
> > +       sectors = min_t(unsigned, sectors, BIO_MAX_PAGES <<
> > +                       (PAGE_CACHE_SHIFT - 9));
> > +
> >         return sectors;
> >  }
> >
> > --
> > 1.9.1
> >
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ming Lei April 7, 2016, 1:49 a.m. UTC | #28

On Thu, Apr 7, 2016 at 9:36 AM, Eric Wheeler <bcache@lists.ewheeler.net> wrote:
> On Wed, 6 Apr 2016, Ming Lei wrote:
>
>> After arbitrary bio size is supported, the incoming bio may
>> be very big. We have to split the bio into small bios so that
>> each holds at most BIO_MAX_PAGES bvecs for safety reason, such
>> as bio_clone().
>>
>> This patch fixes the following kernel crash:
>>
>> > [  172.660142] BUG: unable to handle kernel NULL pointer dereference at
>> > 0000000000000028
>> > [  172.660229] IP: [<ffffffff811e53b4>] bio_trim+0xf/0x2a
>> > [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
>> > [  172.660399] Oops: 0000 [#1] SMP
>> > [...]
>> > [  172.664780] Call Trace:
>> > [  172.664813]  [<ffffffffa007f3be>] ? raid1_make_request+0x2e8/0xad7 [raid1]
>> > [  172.664846]  [<ffffffff811f07da>] ? blk_queue_split+0x377/0x3d4
>> > [  172.664880]  [<ffffffffa005fb5f>] ? md_make_request+0xf6/0x1e9 [md_mod]
>> > [  172.664912]  [<ffffffff811eb860>] ? generic_make_request+0xb5/0x155
>> > [  172.664947]  [<ffffffffa0445c89>] ? prio_io+0x85/0x95 [bcache]
>> > [  172.664981]  [<ffffffffa0448252>] ? register_cache_set+0x355/0x8d0 [bcache]
>> > [  172.665016]  [<ffffffffa04497d3>] ? register_bcache+0x1006/0x1174 [bcache]
>>
>> Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios)
>> Reported-by: Sebastian Roesner <sroesner-kernelorg@roesner-online.de>
>> Reported-by: Eric Wheeler <bcache@lists.ewheeler.net>
>> Cc: stable@vger.kernel.org (4.2+)
>
> Ming Lei,
>
> get_max_io_size doesn't appear until 4.5 based on a quick LXR check so we
> won't see it in stable I don't think.
>
> It would be nice to see this fixed in 4.1 (if affected there).  Is there

The issue should be introduced to v4.3 via 54efd50

> another place this could be applied to be a bit more backward compatible?

The v1 needn't change to get_max_io_size(), and it should be simple enough
to backport to previous stables, please try it:

http://marc.info/?l=linux-block&m=145991422422927&w=2

Thanks,

>
>> Cc: Shaohua Li <shli@fb.com>
>> Signed-off-by: Ming Lei <ming.lei@canonical.com>
>> ---
>> I can reproduce the issue and verify the fix by the following approach:
>>       - create one raid1 over two virtio-blk
>>       - build bcache device over the above raid1 and another cache device.
>>       - set cache mode as writeback
>>       - run random write over ext4 on the bcache device
>>       - then the crash can be triggered
>>
>>  block/blk-merge.c | 12 ++++++++++++
>>  1 file changed, 12 insertions(+)
>>
>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>> index 2613531..9a8651f 100644
>> --- a/block/blk-merge.c
>> +++ b/block/blk-merge.c
>> @@ -79,6 +79,18 @@ static inline unsigned get_max_io_size(struct request_queue *q,
>>       /* aligned to logical block size */
>>       sectors &= ~(mask >> 9);
>>
>> +     /*
>> +      * With arbitrary bio size, the incoming bio may be very big.
>> +      * We have to split the bio into small bios so that each holds
>> +      * at most BIO_MAX_PAGES bvecs for safety reason, such as
>> +      * bio_clone().
>> +      *
>> +      * In the future, the limit might be converted into per-queue
>> +      * flag.
>> +      */
>> +     sectors = min_t(unsigned, sectors, BIO_MAX_PAGES <<
>> +                     (PAGE_CACHE_SHIFT - 9));
>> +
>>       return sectors;
>>  }
>>
>> --
>> 1.9.1
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Wheeler April 7, 2016, 1:56 a.m. UTC | #29

On Thu, 7 Apr 2016, Ming Lei wrote:

> On Thu, Apr 7, 2016 at 9:36 AM, Eric Wheeler <bcache@lists.ewheeler.net> wrote:
> > On Wed, 6 Apr 2016, Ming Lei wrote:
> >
> >> After arbitrary bio size is supported, the incoming bio may
> >> be very big. We have to split the bio into small bios so that
> >> each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> >> as bio_clone().
> >>
> >> This patch fixes the following kernel crash:
> >>
> >> > [  172.660142] BUG: unable to handle kernel NULL pointer dereference at
> >> > 0000000000000028
> >> > [  172.660229] IP: [<ffffffff811e53b4>] bio_trim+0xf/0x2a
> >> > [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
> >> > [  172.660399] Oops: 0000 [#1] SMP
> >> > [...]
> >> > [  172.664780] Call Trace:
> >> > [  172.664813]  [<ffffffffa007f3be>] ? raid1_make_request+0x2e8/0xad7 [raid1]
> >> > [  172.664846]  [<ffffffff811f07da>] ? blk_queue_split+0x377/0x3d4
> >> > [  172.664880]  [<ffffffffa005fb5f>] ? md_make_request+0xf6/0x1e9 [md_mod]
> >> > [  172.664912]  [<ffffffff811eb860>] ? generic_make_request+0xb5/0x155
> >> > [  172.664947]  [<ffffffffa0445c89>] ? prio_io+0x85/0x95 [bcache]
> >> > [  172.664981]  [<ffffffffa0448252>] ? register_cache_set+0x355/0x8d0 [bcache]
> >> > [  172.665016]  [<ffffffffa04497d3>] ? register_bcache+0x1006/0x1174 [bcache]
> >>
> >> Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios)
> >> Reported-by: Sebastian Roesner <sroesner-kernelorg@roesner-online.de>
> >> Reported-by: Eric Wheeler <bcache@lists.ewheeler.net>
> >> Cc: stable@vger.kernel.org (4.2+)
> >
> > Ming Lei,
> >
> > get_max_io_size doesn't appear until 4.5 based on a quick LXR check so we
> > won't see it in stable I don't think.
> >
> > It would be nice to see this fixed in 4.1 (if affected there).  Is there
> 
> The issue should be introduced to v4.3 via 54efd50
> 
> > another place this could be applied to be a bit more backward compatible?
> 
> The v1 needn't change to get_max_io_size(), and it should be simple enough
> to backport to previous stables, please try it:
> 
> http://marc.info/?l=linux-block&m=145991422422927&w=2

V1 changes blk_bio_segment_split() which doesn't exist until 4.3.

How might you port this to v4.1.y?

--
Eric Wheeler


> 
> Thanks,
> 
> >
> >> Cc: Shaohua Li <shli@fb.com>
> >> Signed-off-by: Ming Lei <ming.lei@canonical.com>
> >> ---
> >> I can reproduce the issue and verify the fix by the following approach:
> >>       - create one raid1 over two virtio-blk
> >>       - build bcache device over the above raid1 and another cache device.
> >>       - set cache mode as writeback
> >>       - run random write over ext4 on the bcache device
> >>       - then the crash can be triggered
> >>
> >>  block/blk-merge.c | 12 ++++++++++++
> >>  1 file changed, 12 insertions(+)
> >>
> >> diff --git a/block/blk-merge.c b/block/blk-merge.c
> >> index 2613531..9a8651f 100644
> >> --- a/block/blk-merge.c
> >> +++ b/block/blk-merge.c
> >> @@ -79,6 +79,18 @@ static inline unsigned get_max_io_size(struct request_queue *q,
> >>       /* aligned to logical block size */
> >>       sectors &= ~(mask >> 9);
> >>
> >> +     /*
> >> +      * With arbitrary bio size, the incoming bio may be very big.
> >> +      * We have to split the bio into small bios so that each holds
> >> +      * at most BIO_MAX_PAGES bvecs for safety reason, such as
> >> +      * bio_clone().
> >> +      *
> >> +      * In the future, the limit might be converted into per-queue
> >> +      * flag.
> >> +      */
> >> +     sectors = min_t(unsigned, sectors, BIO_MAX_PAGES <<
> >> +                     (PAGE_CACHE_SHIFT - 9));
> >> +
> >>       return sectors;
> >>  }
> >>
> >> --
> >> 1.9.1
> >>
> >>
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-block" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ming Lei April 7, 2016, 2:16 a.m. UTC | #30

On Thu, Apr 7, 2016 at 9:56 AM, Eric Wheeler <bcache@lists.ewheeler.net> wrote:
> On Thu, 7 Apr 2016, Ming Lei wrote:
>
>> On Thu, Apr 7, 2016 at 9:36 AM, Eric Wheeler <bcache@lists.ewheeler.net> wrote:
>> > On Wed, 6 Apr 2016, Ming Lei wrote:
>> >
>> >> After arbitrary bio size is supported, the incoming bio may
>> >> be very big. We have to split the bio into small bios so that
>> >> each holds at most BIO_MAX_PAGES bvecs for safety reason, such
>> >> as bio_clone().
>> >>
>> >> This patch fixes the following kernel crash:
>> >>
>> >> > [  172.660142] BUG: unable to handle kernel NULL pointer dereference at
>> >> > 0000000000000028
>> >> > [  172.660229] IP: [<ffffffff811e53b4>] bio_trim+0xf/0x2a
>> >> > [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
>> >> > [  172.660399] Oops: 0000 [#1] SMP
>> >> > [...]
>> >> > [  172.664780] Call Trace:
>> >> > [  172.664813]  [<ffffffffa007f3be>] ? raid1_make_request+0x2e8/0xad7 [raid1]
>> >> > [  172.664846]  [<ffffffff811f07da>] ? blk_queue_split+0x377/0x3d4
>> >> > [  172.664880]  [<ffffffffa005fb5f>] ? md_make_request+0xf6/0x1e9 [md_mod]
>> >> > [  172.664912]  [<ffffffff811eb860>] ? generic_make_request+0xb5/0x155
>> >> > [  172.664947]  [<ffffffffa0445c89>] ? prio_io+0x85/0x95 [bcache]
>> >> > [  172.664981]  [<ffffffffa0448252>] ? register_cache_set+0x355/0x8d0 [bcache]
>> >> > [  172.665016]  [<ffffffffa04497d3>] ? register_bcache+0x1006/0x1174 [bcache]
>> >>
>> >> Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios)
>> >> Reported-by: Sebastian Roesner <sroesner-kernelorg@roesner-online.de>
>> >> Reported-by: Eric Wheeler <bcache@lists.ewheeler.net>
>> >> Cc: stable@vger.kernel.org (4.2+)
>> >
>> > Ming Lei,
>> >
>> > get_max_io_size doesn't appear until 4.5 based on a quick LXR check so we
>> > won't see it in stable I don't think.
>> >
>> > It would be nice to see this fixed in 4.1 (if affected there).  Is there
>>
>> The issue should be introduced to v4.3 via 54efd50
>>
>> > another place this could be applied to be a bit more backward compatible?
>>
>> The v1 needn't change to get_max_io_size(), and it should be simple enough
>> to backport to previous stables, please try it:
>>
>> http://marc.info/?l=linux-block&m=145991422422927&w=2
>
> V1 changes blk_bio_segment_split() which doesn't exist until 4.3.
>
> How might you port this to v4.1.y?

Can you see the issue with v4.1?

You mentioned there are three reports:

> [2014-02-04] https://bugzilla.redhat.com/show_bug.cgi?id=1061339
> [2016-01-13] http://www.spinics.net/lists/linux-bcache/msg03335.html
             https://bugzilla.kernel.org/show_bug.cgi?id=110771
> [2016-03-25] http://thread.gmane.org/gmane.linux.kernel.bcache.devel/3607 [this thread]
             https://bugzilla.kernel.org/show_bug.cgi?id=114871

The first one has been fixed by '8423ae3 block: Fix cloning of
discard/write same bios', as mentioned in bugzilla.

The other two are reported on v4.4 and v4.5.

If you look at the patch description, it is just needed for 4.3+.

Or I am wrong?


Thanks,

>
> --
> Eric Wheeler
>
>
>>
>> Thanks,
>>
>> >
>> >> Cc: Shaohua Li <shli@fb.com>
>> >> Signed-off-by: Ming Lei <ming.lei@canonical.com>
>> >> ---
>> >> I can reproduce the issue and verify the fix by the following approach:
>> >>       - create one raid1 over two virtio-blk
>> >>       - build bcache device over the above raid1 and another cache device.
>> >>       - set cache mode as writeback
>> >>       - run random write over ext4 on the bcache device
>> >>       - then the crash can be triggered
>> >>
>> >>  block/blk-merge.c | 12 ++++++++++++
>> >>  1 file changed, 12 insertions(+)
>> >>
>> >> diff --git a/block/blk-merge.c b/block/blk-merge.c
>> >> index 2613531..9a8651f 100644
>> >> --- a/block/blk-merge.c
>> >> +++ b/block/blk-merge.c
>> >> @@ -79,6 +79,18 @@ static inline unsigned get_max_io_size(struct request_queue *q,
>> >>       /* aligned to logical block size */
>> >>       sectors &= ~(mask >> 9);
>> >>
>> >> +     /*
>> >> +      * With arbitrary bio size, the incoming bio may be very big.
>> >> +      * We have to split the bio into small bios so that each holds
>> >> +      * at most BIO_MAX_PAGES bvecs for safety reason, such as
>> >> +      * bio_clone().
>> >> +      *
>> >> +      * In the future, the limit might be converted into per-queue
>> >> +      * flag.
>> >> +      */
>> >> +     sectors = min_t(unsigned, sectors, BIO_MAX_PAGES <<
>> >> +                     (PAGE_CACHE_SHIFT - 9));
>> >> +
>> >>       return sectors;
>> >>  }
>> >>
>> >> --
>> >> 1.9.1
>> >>
>> >>
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-block" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Wheeler April 7, 2016, 11:29 p.m. UTC | #31

On Thu, 7 Apr 2016, Ming Lei wrote:

> On Thu, Apr 7, 2016 at 9:56 AM, Eric Wheeler <bcache@lists.ewheeler.net> wrote:
> > On Thu, 7 Apr 2016, Ming Lei wrote:
> >
> >> On Thu, Apr 7, 2016 at 9:36 AM, Eric Wheeler <bcache@lists.ewheeler.net> wrote:
> >> > On Wed, 6 Apr 2016, Ming Lei wrote:
> >> >
> >> >> After arbitrary bio size is supported, the incoming bio may
> >> >> be very big. We have to split the bio into small bios so that
> >> >> each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> >> >> as bio_clone().
> >> >>
> >> >> This patch fixes the following kernel crash:
> >> >>
> >> >> > [  172.660142] BUG: unable to handle kernel NULL pointer dereference at
> >> >> > 0000000000000028
> >> >> > [  172.660229] IP: [<ffffffff811e53b4>] bio_trim+0xf/0x2a
> >> >> > [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
> >> >> > [  172.660399] Oops: 0000 [#1] SMP
> >> >> > [...]
> >> >> > [  172.664780] Call Trace:
> >> >> > [  172.664813]  [<ffffffffa007f3be>] ? raid1_make_request+0x2e8/0xad7 [raid1]
> >> >> > [  172.664846]  [<ffffffff811f07da>] ? blk_queue_split+0x377/0x3d4
> >> >> > [  172.664880]  [<ffffffffa005fb5f>] ? md_make_request+0xf6/0x1e9 [md_mod]
> >> >> > [  172.664912]  [<ffffffff811eb860>] ? generic_make_request+0xb5/0x155
> >> >> > [  172.664947]  [<ffffffffa0445c89>] ? prio_io+0x85/0x95 [bcache]
> >> >> > [  172.664981]  [<ffffffffa0448252>] ? register_cache_set+0x355/0x8d0 [bcache]
> >> >> > [  172.665016]  [<ffffffffa04497d3>] ? register_bcache+0x1006/0x1174 [bcache]
> >> >>
> >> >> Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios)
> >> >> Reported-by: Sebastian Roesner <sroesner-kernelorg@roesner-online.de>
> >> >> Reported-by: Eric Wheeler <bcache@lists.ewheeler.net>
> >> >> Cc: stable@vger.kernel.org (4.2+)
> >> >
> >> > Ming Lei,
> >> >
> >> > get_max_io_size doesn't appear until 4.5 based on a quick LXR check so we
> >> > won't see it in stable I don't think.
> >> >
> >> > It would be nice to see this fixed in 4.1 (if affected there).  Is there
> >>
> >> The issue should be introduced to v4.3 via 54efd50
> >>
> >> > another place this could be applied to be a bit more backward compatible?
> >>
> >> The v1 needn't change to get_max_io_size(), and it should be simple enough
> >> to backport to previous stables, please try it:
> >>
> >> http://marc.info/?l=linux-block&m=145991422422927&w=2
> >
> > V1 changes blk_bio_segment_split() which doesn't exist until 4.3.
> >
> > How might you port this to v4.1.y?
> 
> Can you see the issue with v4.1?
> 
> You mentioned there are three reports:
> 
> > [2014-02-04] https://bugzilla.redhat.com/show_bug.cgi?id=1061339
> > [2016-01-13] http://www.spinics.net/lists/linux-bcache/msg03335.html
>              https://bugzilla.kernel.org/show_bug.cgi?id=110771
> > [2016-03-25] http://thread.gmane.org/gmane.linux.kernel.bcache.devel/3607 [this thread]
>              https://bugzilla.kernel.org/show_bug.cgi?id=114871
> 
> The first one has been fixed by '8423ae3 block: Fix cloning of
> discard/write same bios', as mentioned in bugzilla.
> 
> The other two are reported on v4.4 and v4.5.
> 
> If you look at the patch description, it is just needed for 4.3+.


Oh, that could be---but just to be sure:

I had thought perhaps this was an old issue since the first mention of 
this backtrace (but not bcache) was in 3.14 back in 2014 based on this 
post:
  https://bugzilla.redhat.com/show_bug.cgi?id=1061339

Is this relevant?

--
Eric Wheeler



> 
> Or I am wrong?
> 
> 
> Thanks,
> 
> >
> > --
> > Eric Wheeler
> >
> >
> >>
> >> Thanks,
> >>
> >> >
> >> >> Cc: Shaohua Li <shli@fb.com>
> >> >> Signed-off-by: Ming Lei <ming.lei@canonical.com>
> >> >> ---
> >> >> I can reproduce the issue and verify the fix by the following approach:
> >> >>       - create one raid1 over two virtio-blk
> >> >>       - build bcache device over the above raid1 and another cache device.
> >> >>       - set cache mode as writeback
> >> >>       - run random write over ext4 on the bcache device
> >> >>       - then the crash can be triggered
> >> >>
> >> >>  block/blk-merge.c | 12 ++++++++++++
> >> >>  1 file changed, 12 insertions(+)
> >> >>
> >> >> diff --git a/block/blk-merge.c b/block/blk-merge.c
> >> >> index 2613531..9a8651f 100644
> >> >> --- a/block/blk-merge.c
> >> >> +++ b/block/blk-merge.c
> >> >> @@ -79,6 +79,18 @@ static inline unsigned get_max_io_size(struct request_queue *q,
> >> >>       /* aligned to logical block size */
> >> >>       sectors &= ~(mask >> 9);
> >> >>
> >> >> +     /*
> >> >> +      * With arbitrary bio size, the incoming bio may be very big.
> >> >> +      * We have to split the bio into small bios so that each holds
> >> >> +      * at most BIO_MAX_PAGES bvecs for safety reason, such as
> >> >> +      * bio_clone().
> >> >> +      *
> >> >> +      * In the future, the limit might be converted into per-queue
> >> >> +      * flag.
> >> >> +      */
> >> >> +     sectors = min_t(unsigned, sectors, BIO_MAX_PAGES <<
> >> >> +                     (PAGE_CACHE_SHIFT - 9));
> >> >> +
> >> >>       return sectors;
> >> >>  }
> >> >>
> >> >> --
> >> >> 1.9.1
> >> >>
> >> >>
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe linux-block" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ming Lei April 8, 2016, 12:21 a.m. UTC | #32

On Fri, Apr 8, 2016 at 7:29 AM, Eric Wheeler <bcache@lists.ewheeler.net> wrote:
> On Thu, 7 Apr 2016, Ming Lei wrote:
>
>> On Thu, Apr 7, 2016 at 9:56 AM, Eric Wheeler <bcache@lists.ewheeler.net> wrote:
>> > On Thu, 7 Apr 2016, Ming Lei wrote:
>> >
>> >> On Thu, Apr 7, 2016 at 9:36 AM, Eric Wheeler <bcache@lists.ewheeler.net> wrote:
>> >> > On Wed, 6 Apr 2016, Ming Lei wrote:
>> >> >
>> >> >> After arbitrary bio size is supported, the incoming bio may
>> >> >> be very big. We have to split the bio into small bios so that
>> >> >> each holds at most BIO_MAX_PAGES bvecs for safety reason, such
>> >> >> as bio_clone().
>> >> >>
>> >> >> This patch fixes the following kernel crash:
>> >> >>
>> >> >> > [  172.660142] BUG: unable to handle kernel NULL pointer dereference at
>> >> >> > 0000000000000028
>> >> >> > [  172.660229] IP: [<ffffffff811e53b4>] bio_trim+0xf/0x2a
>> >> >> > [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
>> >> >> > [  172.660399] Oops: 0000 [#1] SMP
>> >> >> > [...]
>> >> >> > [  172.664780] Call Trace:
>> >> >> > [  172.664813]  [<ffffffffa007f3be>] ? raid1_make_request+0x2e8/0xad7 [raid1]
>> >> >> > [  172.664846]  [<ffffffff811f07da>] ? blk_queue_split+0x377/0x3d4
>> >> >> > [  172.664880]  [<ffffffffa005fb5f>] ? md_make_request+0xf6/0x1e9 [md_mod]
>> >> >> > [  172.664912]  [<ffffffff811eb860>] ? generic_make_request+0xb5/0x155
>> >> >> > [  172.664947]  [<ffffffffa0445c89>] ? prio_io+0x85/0x95 [bcache]
>> >> >> > [  172.664981]  [<ffffffffa0448252>] ? register_cache_set+0x355/0x8d0 [bcache]
>> >> >> > [  172.665016]  [<ffffffffa04497d3>] ? register_bcache+0x1006/0x1174 [bcache]
>> >> >>
>> >> >> Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios)
>> >> >> Reported-by: Sebastian Roesner <sroesner-kernelorg@roesner-online.de>
>> >> >> Reported-by: Eric Wheeler <bcache@lists.ewheeler.net>
>> >> >> Cc: stable@vger.kernel.org (4.2+)
>> >> >
>> >> > Ming Lei,
>> >> >
>> >> > get_max_io_size doesn't appear until 4.5 based on a quick LXR check so we
>> >> > won't see it in stable I don't think.
>> >> >
>> >> > It would be nice to see this fixed in 4.1 (if affected there).  Is there
>> >>
>> >> The issue should be introduced to v4.3 via 54efd50
>> >>
>> >> > another place this could be applied to be a bit more backward compatible?
>> >>
>> >> The v1 needn't change to get_max_io_size(), and it should be simple enough
>> >> to backport to previous stables, please try it:
>> >>
>> >> http://marc.info/?l=linux-block&m=145991422422927&w=2
>> >
>> > V1 changes blk_bio_segment_split() which doesn't exist until 4.3.
>> >
>> > How might you port this to v4.1.y?
>>
>> Can you see the issue with v4.1?
>>
>> You mentioned there are three reports:
>>
>> > [2014-02-04] https://bugzilla.redhat.com/show_bug.cgi?id=1061339
>> > [2016-01-13] http://www.spinics.net/lists/linux-bcache/msg03335.html
>>              https://bugzilla.kernel.org/show_bug.cgi?id=110771
>> > [2016-03-25] http://thread.gmane.org/gmane.linux.kernel.bcache.devel/3607 [this thread]
>>              https://bugzilla.kernel.org/show_bug.cgi?id=114871
>>
>> The first one has been fixed by '8423ae3 block: Fix cloning of
>> discard/write same bios', as mentioned in bugzilla.
>>
>> The other two are reported on v4.4 and v4.5.
>>
>> If you look at the patch description, it is just needed for 4.3+.
>
>
> Oh, that could be---but just to be sure:
>
> I had thought perhaps this was an old issue since the first mention of
> this backtrace (but not bcache) was in 3.14 back in 2014 based on this
> post:
>   https://bugzilla.redhat.com/show_bug.cgi?id=1061339
>
> Is this relevant?

Eric,

That is another bio_clone() issue, which was fixed by commit
'8423ae3 block: Fix cloning of discard/write same bios'.

If you did read the bugzillar comment and my reply, you should
get the answer already.

I am sorry for having to answer your same question again and again...

>
> --
> Eric Wheeler
>
>
>
>>
>> Or I am wrong?
>>
>>
>> Thanks,
>>
>> >
>> > --
>> > Eric Wheeler
>> >
>> >
>> >>
>> >> Thanks,
>> >>
>> >> >
>> >> >> Cc: Shaohua Li <shli@fb.com>
>> >> >> Signed-off-by: Ming Lei <ming.lei@canonical.com>
>> >> >> ---
>> >> >> I can reproduce the issue and verify the fix by the following approach:
>> >> >>       - create one raid1 over two virtio-blk
>> >> >>       - build bcache device over the above raid1 and another cache device.
>> >> >>       - set cache mode as writeback
>> >> >>       - run random write over ext4 on the bcache device
>> >> >>       - then the crash can be triggered
>> >> >>
>> >> >>  block/blk-merge.c | 12 ++++++++++++
>> >> >>  1 file changed, 12 insertions(+)
>> >> >>
>> >> >> diff --git a/block/blk-merge.c b/block/blk-merge.c
>> >> >> index 2613531..9a8651f 100644
>> >> >> --- a/block/blk-merge.c
>> >> >> +++ b/block/blk-merge.c
>> >> >> @@ -79,6 +79,18 @@ static inline unsigned get_max_io_size(struct request_queue *q,
>> >> >>       /* aligned to logical block size */
>> >> >>       sectors &= ~(mask >> 9);
>> >> >>
>> >> >> +     /*
>> >> >> +      * With arbitrary bio size, the incoming bio may be very big.
>> >> >> +      * We have to split the bio into small bios so that each holds
>> >> >> +      * at most BIO_MAX_PAGES bvecs for safety reason, such as
>> >> >> +      * bio_clone().
>> >> >> +      *
>> >> >> +      * In the future, the limit might be converted into per-queue
>> >> >> +      * flag.
>> >> >> +      */
>> >> >> +     sectors = min_t(unsigned, sectors, BIO_MAX_PAGES <<
>> >> >> +                     (PAGE_CACHE_SHIFT - 9));
>> >> >> +
>> >> >>       return sectors;
>> >> >>  }
>> >> >>
>> >> >> --
>> >> >> 1.9.1
>> >> >>
>> >> >>
>> >> > --
>> >> > To unsubscribe from this list: send the line "unsubscribe linux-block" in
>> >> > the body of a message to majordomo@vger.kernel.org
>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

block: make sure big bio is splitted into at most 256 bvecs

Commit Message

Comments

Patch