diff mbox series

block: revert to using min_not_zero() when stacking chunk_sectors

Message ID 20201130171805.77712-1-snitzer@redhat.com (mailing list archive)
State New, archived
Headers show
Series block: revert to using min_not_zero() when stacking chunk_sectors | expand

Commit Message

Mike Snitzer Nov. 30, 2020, 5:18 p.m. UTC
chunk_sectors must reflect the most limited of all devices in the IO
stack.

Otherwise malformed IO may result. E.g.: prior to this fix,
->chunk_sectors = lcm_not_zero(8, 128) would result in
blk_max_size_offset() splitting IO at 128 sectors rather than the
required more restrictive 8 sectors.

Fixes: 22ada802ede8 ("block: use lcm_not_zero() when stacking chunk_sectors")
Cc: stable@vger.kernel.org
Reported-by: John Dorminy <jdorminy@redhat.com>
Reported-by: Bruce Johnston <bjohnsto@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 block/blk-settings.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

John Dorminy Nov. 30, 2020, 8:51 p.m. UTC | #1
I don't think this suffices, as it allows IOs that span max(a,b) chunk
boundaries.

Chunk sectors is defined as "if set, it will prevent merging across
chunk boundaries". Pulling the example from the last change:
It is possible, albeit more unlikely, for a block device to have a non
power-of-2 for chunk_sectors (e.g. 10+2 RAID6 with 128K chunk_sectors,
which results in a full-stripe size of 1280K. This causes the RAID6's
io_opt to be advertised as 1280K, and a stacked device _could_ then be
made to use a blocksize, aka chunk_sectors, that matches non power-of-2
io_opt of underlying RAID6 -- resulting in stacked device's
chunk_sectors being a non power-of-2).

Suppose the stacked device had a block size/chunk_sectors of 256k.
Then, with this change, some IOs issued by the stacked device to the
RAID beneath could span 1280k sector boundaries, and require further
splitting still. I think combining as the GCD is better, since any IO
of size gcd(a,b) definitely spans neither a a-chunk nor a b-chunk
boundary.

But it's possible I'm misunderstanding the purpose of chunk_sectors,
or there should be a check that the one of the two devices' chunk
sizes divides the other.

Thanks.

-John

On Mon, Nov 30, 2020 at 12:18 PM Mike Snitzer <snitzer@redhat.com> wrote:
>
> chunk_sectors must reflect the most limited of all devices in the IO
> stack.
>
> Otherwise malformed IO may result. E.g.: prior to this fix,
> ->chunk_sectors = lcm_not_zero(8, 128) would result in
> blk_max_size_offset() splitting IO at 128 sectors rather than the
> required more restrictive 8 sectors.
>
> Fixes: 22ada802ede8 ("block: use lcm_not_zero() when stacking chunk_sectors")
> Cc: stable@vger.kernel.org
> Reported-by: John Dorminy <jdorminy@redhat.com>
> Reported-by: Bruce Johnston <bjohnsto@redhat.com>
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> ---
>  block/blk-settings.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 9741d1d83e98..1d9decd4646e 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -547,7 +547,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>
>         t->io_min = max(t->io_min, b->io_min);
>         t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);
> -       t->chunk_sectors = lcm_not_zero(t->chunk_sectors, b->chunk_sectors);
> +
> +       if (b->chunk_sectors)
> +               t->chunk_sectors = min_not_zero(t->chunk_sectors,
> +                                               b->chunk_sectors);
>
>         /* Physical block size a multiple of the logical block size? */
>         if (t->physical_block_size & (t->logical_block_size - 1)) {
> --
> 2.15.0
>
Mike Snitzer Nov. 30, 2020, 11:24 p.m. UTC | #2
On Mon, Nov 30 2020 at  3:51pm -0500,
John Dorminy <jdorminy@redhat.com> wrote:

> I don't think this suffices, as it allows IOs that span max(a,b) chunk
> boundaries.
> 
> Chunk sectors is defined as "if set, it will prevent merging across
> chunk boundaries". Pulling the example from the last change:

If you're going to cherry pick a portion of a commit header please
reference the commit id and use quotes or indentation to make it clear
what is being referenced, etc.

> It is possible, albeit more unlikely, for a block device to have a non
> power-of-2 for chunk_sectors (e.g. 10+2 RAID6 with 128K chunk_sectors,
> which results in a full-stripe size of 1280K. This causes the RAID6's
> io_opt to be advertised as 1280K, and a stacked device _could_ then be
> made to use a blocksize, aka chunk_sectors, that matches non power-of-2
> io_opt of underlying RAID6 -- resulting in stacked device's
> chunk_sectors being a non power-of-2).

This was from the header for commit 07d098e6bba ("block: allow
'chunk_sectors' to be non-power-of-2")

> Suppose the stacked device had a block size/chunk_sectors of 256k.

Quite the tangent just to setup an a toy example of say: thinp with 256K
blocksize/chunk_sectors ontop of a RAID6 with a chunk_sectors of 128K
and stripesize of 1280K.

> Then, with this change, some IOs issued by the stacked device to the
> RAID beneath could span 1280k sector boundaries, and require further
> splitting still.
> I think combining as the GCD is better, since any IO
> of size gcd(a,b) definitely spans neither a a-chunk nor a b-chunk
> boundary.

To be clear, you are _not_ saying using lcm_not_zero() is correct.
You're saying that simply reverting block core back to using
min_not_zero() may not be as good as using gcd().

While that may be true (not sure yet) you've now muddied a conservative
fix (that reverts block core back to its longstanding use of
min_not_zero for chunk_sectors) in pursuit of addressing some different
concern than the case that you _really_ care about getting fixed
(I'm inferring based on your regression report):
4K chunk_sectors stacked on larger chunk_sectors, e.g. 256K

My patch fixes the case and doesn't try to innovate, it tries to get
block core back to sane chunk_sectors stacking (which I broke).

> But it's possible I'm misunderstanding the purpose of chunk_sectors,
> or there should be a check that the one of the two devices' chunk
> sizes divides the other.

Seriously not amused by your response, I now have to do damage control
because you have a concern that you really weren't able to communicate
very effectively.

But I got this far, so for your above toy example (stacking 128K and
256K chunk_sectors):
min_not_zero = 128K
gcd = 128K

SO please explain to me why gcd() is better at setting a chunk_sectors
that ensures IO doesn't span 1280K stripesize (nevermind that
chunk_sectors has no meaningful relation to io_opt to begin with!).

Mike


> 
> On Mon, Nov 30, 2020 at 12:18 PM Mike Snitzer <snitzer@redhat.com> wrote:
> >
> > chunk_sectors must reflect the most limited of all devices in the IO
> > stack.
> >
> > Otherwise malformed IO may result. E.g.: prior to this fix,
> > ->chunk_sectors = lcm_not_zero(8, 128) would result in
> > blk_max_size_offset() splitting IO at 128 sectors rather than the
> > required more restrictive 8 sectors.
> >
> > Fixes: 22ada802ede8 ("block: use lcm_not_zero() when stacking chunk_sectors")
> > Cc: stable@vger.kernel.org
> > Reported-by: John Dorminy <jdorminy@redhat.com>
> > Reported-by: Bruce Johnston <bjohnsto@redhat.com>
> > Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> > ---
> >  block/blk-settings.c | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/block/blk-settings.c b/block/blk-settings.c
> > index 9741d1d83e98..1d9decd4646e 100644
> > --- a/block/blk-settings.c
> > +++ b/block/blk-settings.c
> > @@ -547,7 +547,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
> >
> >         t->io_min = max(t->io_min, b->io_min);
> >         t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);
> > -       t->chunk_sectors = lcm_not_zero(t->chunk_sectors, b->chunk_sectors);
> > +
> > +       if (b->chunk_sectors)
> > +               t->chunk_sectors = min_not_zero(t->chunk_sectors,
> > +                                               b->chunk_sectors);
> >
> >         /* Physical block size a multiple of the logical block size? */
> >         if (t->physical_block_size & (t->logical_block_size - 1)) {
> > --
> > 2.15.0
> >
>
John Dorminy Dec. 1, 2020, 12:21 a.m. UTC | #3
> If you're going to cherry pick a portion of a commit header please
> reference the commit id and use quotes or indentation to make it clear
> what is being referenced, etc.
Apologies.

> Quite the tangent just to setup an a toy example of say: thinp with 256K
> blocksize/chunk_sectors ontop of a RAID6 with a chunk_sectors of 128K
> and stripesize of 1280K.

I screwed up my math ... many apologies :/

Consider a thinp of chunk_sectors 512K atop a RAID6 with chunk_sectors 1280K.
(Previously, this RAID6 would be disallowed because chunk_sectors
could only be a power of 2, but 07d098e6bba removed this constraint.)
-With lcm_not_zero(), a full-device IO would be split into 2560K IOs,
which obviously spans both 512K and 1280K chunk boundaries.
-With min_not_zero(), a full-device IO would be split into 512K IOs,
some of which would span 1280k chunk boundaries. For instance, one IO
would span from offset 1024K to 1536K.
-With the hypothetical gcd_not_zero(), a full-device IO would be split
into 256K IOs, which span neither 512K nor 1280K chunk boundaries.

>
> To be clear, you are _not_ saying using lcm_not_zero() is correct.
> You're saying that simply reverting block core back to using
> min_not_zero() may not be as good as using gcd().

Assuming my understanding of chunk_sectors is correct -- which as per
blk-settings.c seems to be "a driver will not receive a bio that spans
a chunk_sector boundary, except in single-page cases" -- I believe
using lcm_not_zero() and min_not_zero() can both violate this
requirement. The current lcm_not_zero() is not correct, but also
reverting block core back to using min_not_zero() leaves edge cases as
above.

I believe gcd provides the requirement, but min_not_zero() +
disallowing non-power-of-2 chunk_sectors also provides the
requirement.

> > But it's possible I'm misunderstanding the purpose of chunk_sectors,
> > or there should be a check that the one of the two devices' chunk
> > sizes divides the other.
>
> Seriously not amused by your response, I now have to do damage control
> because you have a concern that you really weren't able to communicate
> very effectively.

Apologies.
Mike Snitzer Dec. 1, 2020, 2:12 a.m. UTC | #4
On Mon, Nov 30 2020 at  7:21pm -0500,
John Dorminy <jdorminy@redhat.com> wrote:

> > If you're going to cherry pick a portion of a commit header please
> > reference the commit id and use quotes or indentation to make it clear
> > what is being referenced, etc.
> Apologies.
> 
> > Quite the tangent just to setup an a toy example of say: thinp with 256K
> > blocksize/chunk_sectors ontop of a RAID6 with a chunk_sectors of 128K
> > and stripesize of 1280K.
> 
> I screwed up my math ... many apologies :/
> 
> Consider a thinp of chunk_sectors 512K atop a RAID6 with chunk_sectors 1280K.
> (Previously, this RAID6 would be disallowed because chunk_sectors
> could only be a power of 2, but 07d098e6bba removed this constraint.)

Think you have your example messed up still.  RAID 10+2 with 128K
chunk_sectors, 1280K full stripe (io_opt). Then thinp stacked ontop of
it with chunk_sectors of 1280K was usecase that wasn't supported before.

So stacked chunk_sectors = min_not_zero(128K, 1280K) = 128K

> -With lcm_not_zero(), a full-device IO would be split into 2560K IOs,
> which obviously spans both 512K and 1280K chunk boundaries.

Sure, think we both agree lcm_not_zero() shouldn't be used.

> -With min_not_zero(), a full-device IO would be split into 512K IOs,
> some of which would span 1280k chunk boundaries. For instance, one IO
> would span from offset 1024K to 1536K.

RAID6 with chunk_sectors of 1280K is pretty insane...
And yet you're saying full device IO is 1280K...
So something still isn't adding up.

Anyway, if we run with your example of chunk_sectors (512K, 1280K), yes
there is serious potential for IO to span the RAID6 layer's chunk_sector
boundary.

> -With the hypothetical gcd_not_zero(), a full-device IO would be split
> into 256K IOs, which span neither 512K nor 1280K chunk boundaries.

Yeap, I see.

> > To be clear, you are _not_ saying using lcm_not_zero() is correct.
> > You're saying that simply reverting block core back to using
> > min_not_zero() may not be as good as using gcd().
> 
> Assuming my understanding of chunk_sectors is correct -- which as per
> blk-settings.c seems to be "a driver will not receive a bio that spans
> a chunk_sector boundary, except in single-page cases" -- I believe
> using lcm_not_zero() and min_not_zero() can both violate this
> requirement. The current lcm_not_zero() is not correct, but also
> reverting block core back to using min_not_zero() leaves edge cases as
> above.

But your chunk_sectors (512K, 1280K) example is a misconfigured IO
stack.  Really not sure it worth being concerned about it.

> I believe gcd provides the requirement, but min_not_zero() +
> disallowing non-power-of-2 chunk_sectors also provides the
> requirement.

Kind of on the fence on this... think I'd like to get Martin's take.

Using gcd() instead of min_not_zero() to stack chunk_sectors isn't a big
deal; given the nature of chunk_sectors coupled with it being able to be
a non-power-of-2 _does_ add a new wrinkle.

So you had a valid point all along, just that you made me work pretty
hard to understand you.

> > > But it's possible I'm misunderstanding the purpose of chunk_sectors,
> > > or there should be a check that the one of the two devices' chunk
> > > sizes divides the other.
> >
> > Seriously not amused by your response, I now have to do damage control
> > because you have a concern that you really weren't able to communicate
> > very effectively.
> 
> Apologies.

Eh, I need to build my pain threshold back up.. been away from it all
for more than a week.. ;)

Mike
diff mbox series

Patch

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 9741d1d83e98..1d9decd4646e 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -547,7 +547,10 @@  int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 
 	t->io_min = max(t->io_min, b->io_min);
 	t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);
-	t->chunk_sectors = lcm_not_zero(t->chunk_sectors, b->chunk_sectors);
+
+	if (b->chunk_sectors)
+		t->chunk_sectors = min_not_zero(t->chunk_sectors,
+						b->chunk_sectors);
 
 	/* Physical block size a multiple of the logical block size? */
 	if (t->physical_block_size & (t->logical_block_size - 1)) {