diff mbox series

mkfs.xfs: sanity check stripe geometry from blkid

Message ID a673fbd3-5038-2dc8-8135-a58c24042734@redhat.com (mailing list archive)
State New, archived
Headers show
Series mkfs.xfs: sanity check stripe geometry from blkid | expand

Commit Message

Eric Sandeen May 15, 2020, 7:14 p.m. UTC
We validate commandline options for stripe unit and stripe width, and
if a device returns nonsensical values via libblkid, the superbock write
verifier will eventually catch it and fail (noisily and cryptically) but
it seems a bit cleaner to just do a basic sanity check on the numbers
as soon as we get them from blkid, and if they're bogus, ignore them from
the start.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---

Comments

Darrick J. Wong May 15, 2020, 8:48 p.m. UTC | #1
On Fri, May 15, 2020 at 02:14:17PM -0500, Eric Sandeen wrote:
> We validate commandline options for stripe unit and stripe width, and
> if a device returns nonsensical values via libblkid, the superbock write
> verifier will eventually catch it and fail (noisily and cryptically) but
> it seems a bit cleaner to just do a basic sanity check on the numbers
> as soon as we get them from blkid, and if they're bogus, ignore them from
> the start.
> 
> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> ---
> 
> diff --git a/libfrog/topology.c b/libfrog/topology.c
> index b1b470c9..38ed03b7 100644
> --- a/libfrog/topology.c
> +++ b/libfrog/topology.c
> @@ -213,6 +213,19 @@ static void blkid_get_topology(
>  	val = blkid_topology_get_optimal_io_size(tp);
>  	*swidth = val;
>  
> +        /*
> +	 * Occasionally, firmware is broken and returns optimal < minimum,
> +	 * or optimal which is not a multiple of minimum.
> +	 * In that case just give up and set both to zero, we can't trust
> +	 * information from this device. Similar to xfs_validate_sb_common().
> +	 */
> +        if (*sunit) {
> +                if ((*sunit > *swidth) || (*swidth % *sunit != 0)) {

I feel like we're copypasting this sunit/swidth checking logic all over
xfsprogs and yet we're still losing the stripe unit validation whackamole
game.

In the end, we want to check more or less the same things for each pair
of stripe unit and stripe width:

 * integer overflows of either value
 * sunit and swidth alignment wrt sector size
 * if either sunit or swidth are zero, both should be zero
 * swidth must be a multiple of sunit

All four of these rules apply to the blkid_get_toplogy answers for the
data device, the log device, and the realtime device; and any mkfs CLI
overrides of those values.

IOWs, is there some way to refactor those four rules into a single
validation function and call that in the six(ish) places we need it?
Especially since you're the one who played the last round of whackamole,
back in May 2018. :)

--D

> +                        *sunit = 0;
> +                        *swidth = 0;
> +                }
> +        }
> +
>  	/*
>  	 * If the reported values are the same as the physical sector size
>  	 * do not bother to report anything.  It will only cause warnings
>
Eric Sandeen May 15, 2020, 8:54 p.m. UTC | #2
On 5/15/20 3:48 PM, Darrick J. Wong wrote:
> On Fri, May 15, 2020 at 02:14:17PM -0500, Eric Sandeen wrote:
>> We validate commandline options for stripe unit and stripe width, and
>> if a device returns nonsensical values via libblkid, the superbock write
>> verifier will eventually catch it and fail (noisily and cryptically) but
>> it seems a bit cleaner to just do a basic sanity check on the numbers
>> as soon as we get them from blkid, and if they're bogus, ignore them from
>> the start.
>>
>> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
>> ---
>>
>> diff --git a/libfrog/topology.c b/libfrog/topology.c
>> index b1b470c9..38ed03b7 100644
>> --- a/libfrog/topology.c
>> +++ b/libfrog/topology.c
>> @@ -213,6 +213,19 @@ static void blkid_get_topology(
>>  	val = blkid_topology_get_optimal_io_size(tp);
>>  	*swidth = val;
>>  
>> +        /*
>> +	 * Occasionally, firmware is broken and returns optimal < minimum,
>> +	 * or optimal which is not a multiple of minimum.
>> +	 * In that case just give up and set both to zero, we can't trust
>> +	 * information from this device. Similar to xfs_validate_sb_common().
>> +	 */
>> +        if (*sunit) {
>> +                if ((*sunit > *swidth) || (*swidth % *sunit != 0)) {
> 
> I feel like we're copypasting this sunit/swidth checking logic all over
> xfsprogs 

That's because we are!

> and yet we're still losing the stripe unit validation whackamole
> game.

Need moar hammers!

> In the end, we want to check more or less the same things for each pair
> of stripe unit and stripe width:
> 
>  * integer overflows of either value
>  * sunit and swidth alignment wrt sector size
>  * if either sunit or swidth are zero, both should be zero
>  * swidth must be a multiple of sunit
> 
> All four of these rules apply to the blkid_get_toplogy answers for the
> data device, the log device, and the realtime device; and any mkfs CLI
> overrides of those values.
> 
> IOWs, is there some way to refactor those four rules into a single
> validation function and call that in the six(ish) places we need it?
> Especially since you're the one who played the last round of whackamole,
> back in May 2018. :)

So .... I would like to do that refactoring.  I'd also like to fix this
with some expediency, TBH...

Refactoring is going to be a little more complicated, I fear, because sanity
on "what came straight from blkid" is a little different from "what came from
cmdline" and has slightly different checks than "how does it fit into the
superblock we just read?"

This (swidth-vs-sunit-is-borken) is common enough that I wanted to just kill
it with fire, and um ... make it all better/cohesive at some later date.

I don't like arguing for expediency over beauty but well... here I am.

-Eric

> --D
> 
>> +                        *sunit = 0;
>> +                        *swidth = 0;
>> +                }
>> +        }
>> +
>>  	/*
>>  	 * If the reported values are the same as the physical sector size
>>  	 * do not bother to report anything.  It will only cause warnings
>>
>
Eric Sandeen May 15, 2020, 9:06 p.m. UTC | #3
On 5/15/20 3:54 PM, Eric Sandeen wrote:
> On 5/15/20 3:48 PM, Darrick J. Wong wrote:
>> On Fri, May 15, 2020 at 02:14:17PM -0500, Eric Sandeen wrote:
>>> We validate commandline options for stripe unit and stripe width, and
>>> if a device returns nonsensical values via libblkid, the superbock write
>>> verifier will eventually catch it and fail (noisily and cryptically) but
>>> it seems a bit cleaner to just do a basic sanity check on the numbers
>>> as soon as we get them from blkid, and if they're bogus, ignore them from
>>> the start.
>>>
>>> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
>>> ---
>>>
>>> diff --git a/libfrog/topology.c b/libfrog/topology.c
>>> index b1b470c9..38ed03b7 100644
>>> --- a/libfrog/topology.c
>>> +++ b/libfrog/topology.c
>>> @@ -213,6 +213,19 @@ static void blkid_get_topology(
>>>  	val = blkid_topology_get_optimal_io_size(tp);
>>>  	*swidth = val;
>>>  
>>> +        /*
>>> +	 * Occasionally, firmware is broken and returns optimal < minimum,
>>> +	 * or optimal which is not a multiple of minimum.
>>> +	 * In that case just give up and set both to zero, we can't trust
>>> +	 * information from this device. Similar to xfs_validate_sb_common().
>>> +	 */
>>> +        if (*sunit) {
>>> +                if ((*sunit > *swidth) || (*swidth % *sunit != 0)) {
>>
>> I feel like we're copypasting this sunit/swidth checking logic all over
>> xfsprogs 
> 
> That's because we are!
> 
>> and yet we're still losing the stripe unit validation whackamole
>> game.
> 
> Need moar hammers!
> 
>> In the end, we want to check more or less the same things for each pair
>> of stripe unit and stripe width:
>>
>>  * integer overflows of either value
>>  * sunit and swidth alignment wrt sector size
>>  * if either sunit or swidth are zero, both should be zero
>>  * swidth must be a multiple of sunit
>>
>> All four of these rules apply to the blkid_get_toplogy answers for the
>> data device, the log device, and the realtime device; and any mkfs CLI
>> overrides of those values.
>>
>> IOWs, is there some way to refactor those four rules into a single
>> validation function and call that in the six(ish) places we need it?
>> Especially since you're the one who played the last round of whackamole,
>> back in May 2018. :)
> 
> So .... I would like to do that refactoring.  I'd also like to fix this
> with some expediency, TBH...
> 
> Refactoring is going to be a little more complicated, I fear, because sanity
> on "what came straight from blkid" is a little different from "what came from
> cmdline" and has slightly different checks than "how does it fit into the
> superblock we just read?"
> 
> This (swidth-vs-sunit-is-borken) is common enough that I wanted to just kill
> it with fire, and um ... make it all better/cohesive at some later date.
> 
> I don't like arguing for expediency over beauty but well... here I am.

A little more context, failures like this are caught today, but only
in the validators, and it's very noisy and nonobvious when it happens.
And we seem to keep finding devices that are broken like this, so would be
nice to just ignore them and move on.

-Eric
Darrick J. Wong May 15, 2020, 9:10 p.m. UTC | #4
On Fri, May 15, 2020 at 03:54:34PM -0500, Eric Sandeen wrote:
> On 5/15/20 3:48 PM, Darrick J. Wong wrote:
> > On Fri, May 15, 2020 at 02:14:17PM -0500, Eric Sandeen wrote:
> >> We validate commandline options for stripe unit and stripe width, and
> >> if a device returns nonsensical values via libblkid, the superbock write
> >> verifier will eventually catch it and fail (noisily and cryptically) but
> >> it seems a bit cleaner to just do a basic sanity check on the numbers
> >> as soon as we get them from blkid, and if they're bogus, ignore them from
> >> the start.
> >>
> >> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> >> ---
> >>
> >> diff --git a/libfrog/topology.c b/libfrog/topology.c
> >> index b1b470c9..38ed03b7 100644
> >> --- a/libfrog/topology.c
> >> +++ b/libfrog/topology.c
> >> @@ -213,6 +213,19 @@ static void blkid_get_topology(
> >>  	val = blkid_topology_get_optimal_io_size(tp);
> >>  	*swidth = val;
> >>  
> >> +        /*

Tabs not spaces

> >> +	 * Occasionally, firmware is broken and returns optimal < minimum,
> >> +	 * or optimal which is not a multiple of minimum.
> >> +	 * In that case just give up and set both to zero, we can't trust
> >> +	 * information from this device. Similar to xfs_validate_sb_common().
> >> +	 */
> >> +        if (*sunit) {
> >> +                if ((*sunit > *swidth) || (*swidth % *sunit != 0)) {

Why not combine these?

if (*sunit != 0 && (*sunit > *swidth || *swidth % *sunit != 0)) {

Aside from that the code looks fine I guess...

> > I feel like we're copypasting this sunit/swidth checking logic all over
> > xfsprogs 
> 
> That's because we are!
> 
> > and yet we're still losing the stripe unit validation whackamole
> > game.
> 
> Need moar hammers!
> 
> > In the end, we want to check more or less the same things for each pair
> > of stripe unit and stripe width:
> > 
> >  * integer overflows of either value
> >  * sunit and swidth alignment wrt sector size
> >  * if either sunit or swidth are zero, both should be zero
> >  * swidth must be a multiple of sunit
> > 
> > All four of these rules apply to the blkid_get_toplogy answers for the
> > data device, the log device, and the realtime device; and any mkfs CLI
> > overrides of those values.
> > 
> > IOWs, is there some way to refactor those four rules into a single
> > validation function and call that in the six(ish) places we need it?
> > Especially since you're the one who played the last round of whackamole,
> > back in May 2018. :)
> 
> So .... I would like to do that refactoring.  I'd also like to fix this
> with some expediency, TBH...
> 
> Refactoring is going to be a little more complicated, I fear, because sanity
> on "what came straight from blkid" is a little different from "what came from
> cmdline" and has slightly different checks than "how does it fit into the
> superblock we just read?"

Admittedly I wondered if "refactor all these checks" would fall apart
because each tool has its own slightly different reporting and logging
requirements.  You could make a checker function return an enum of what
it's mad about and each caller could either have a message catalogue or
just bail depending on the circumstances, but now I've probably
overengineered the corner case catching code.

> This (swidth-vs-sunit-is-borken) is common enough that I wanted to just kill
> it with fire, and um ... make it all better/cohesive at some later date.
> 
> I don't like arguing for expediency over beauty but well... here I am.

:(

--D

> -Eric
> 
> > --D
> > 
> >> +                        *sunit = 0;
> >> +                        *swidth = 0;
> >> +                }
> >> +        }
> >> +
> >>  	/*
> >>  	 * If the reported values are the same as the physical sector size
> >>  	 * do not bother to report anything.  It will only cause warnings
> >>
> > 
>
Eric Sandeen May 15, 2020, 9:34 p.m. UTC | #5
On 5/15/20 4:10 PM, Darrick J. Wong wrote:
> On Fri, May 15, 2020 at 03:54:34PM -0500, Eric Sandeen wrote:
>> On 5/15/20 3:48 PM, Darrick J. Wong wrote:
>>> On Fri, May 15, 2020 at 02:14:17PM -0500, Eric Sandeen wrote:
>>>> We validate commandline options for stripe unit and stripe width, and
>>>> if a device returns nonsensical values via libblkid, the superbock write
>>>> verifier will eventually catch it and fail (noisily and cryptically) but
>>>> it seems a bit cleaner to just do a basic sanity check on the numbers
>>>> as soon as we get them from blkid, and if they're bogus, ignore them from
>>>> the start.
>>>>
>>>> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
>>>> ---
>>>>
>>>> diff --git a/libfrog/topology.c b/libfrog/topology.c
>>>> index b1b470c9..38ed03b7 100644
>>>> --- a/libfrog/topology.c
>>>> +++ b/libfrog/topology.c
>>>> @@ -213,6 +213,19 @@ static void blkid_get_topology(
>>>>  	val = blkid_topology_get_optimal_io_size(tp);
>>>>  	*swidth = val;
>>>>  
>>>> +        /*
> 
> Tabs not spaces

yeah I have no idea how that happened :P

>>>> +	 * Occasionally, firmware is broken and returns optimal < minimum,
>>>> +	 * or optimal which is not a multiple of minimum.
>>>> +	 * In that case just give up and set both to zero, we can't trust
>>>> +	 * information from this device. Similar to xfs_validate_sb_common().
>>>> +	 */
>>>> +        if (*sunit) {
>>>> +                if ((*sunit > *swidth) || (*swidth % *sunit != 0)) {
> 
> Why not combine these?
> 
> if (*sunit != 0 && (*sunit > *swidth || *swidth % *sunit != 0)) {

was making it look a little like the kernel sb checks but *shrug*

> Aside from that the code looks fine I guess...
> 
>>> I feel like we're copypasting this sunit/swidth checking logic all over
>>> xfsprogs 
>>
>> That's because we are!
>>
>>> and yet we're still losing the stripe unit validation whackamole
>>> game.
>>
>> Need moar hammers!
>>
>>> In the end, we want to check more or less the same things for each pair
>>> of stripe unit and stripe width:
>>>
>>>  * integer overflows of either value
>>>  * sunit and swidth alignment wrt sector size
>>>  * if either sunit or swidth are zero, both should be zero
>>>  * swidth must be a multiple of sunit
>>>
>>> All four of these rules apply to the blkid_get_toplogy answers for the
>>> data device, the log device, and the realtime device; and any mkfs CLI
>>> overrides of those values.
>>>
>>> IOWs, is there some way to refactor those four rules into a single
>>> validation function and call that in the six(ish) places we need it?
>>> Especially since you're the one who played the last round of whackamole,
>>> back in May 2018. :)
>>
>> So .... I would like to do that refactoring.  I'd also like to fix this
>> with some expediency, TBH...
>>
>> Refactoring is going to be a little more complicated, I fear, because sanity
>> on "what came straight from blkid" is a little different from "what came from
>> cmdline" and has slightly different checks than "how does it fit into the
>> superblock we just read?"
> 
> Admittedly I wondered if "refactor all these checks" would fall apart
> because each tool has its own slightly different reporting and logging
> requirements.  You could make a checker function return an enum of what
> it's mad about and each caller could either have a message catalogue or
> just bail depending on the circumstances, but now I've probably
> overengineered the corner case catching code.
> 
>> This (swidth-vs-sunit-is-borken) is common enough that I wanted to just kill
>> it with fire, and um ... make it all better/cohesive at some later date.
>>
>> I don't like arguing for expediency over beauty but well... here I am.
> 
> :(

Well, I guess it's not actually that urgent; we don't handle it well
today but maybe I should resist the urge to do another spot-fix that
could(?) be handled better....

i'll put this on the backburner & give it a bit more thought I guess.

Thanks,
-Eric
diff mbox series

Patch

diff --git a/libfrog/topology.c b/libfrog/topology.c
index b1b470c9..38ed03b7 100644
--- a/libfrog/topology.c
+++ b/libfrog/topology.c
@@ -213,6 +213,19 @@  static void blkid_get_topology(
 	val = blkid_topology_get_optimal_io_size(tp);
 	*swidth = val;
 
+        /*
+	 * Occasionally, firmware is broken and returns optimal < minimum,
+	 * or optimal which is not a multiple of minimum.
+	 * In that case just give up and set both to zero, we can't trust
+	 * information from this device. Similar to xfs_validate_sb_common().
+	 */
+        if (*sunit) {
+                if ((*sunit > *swidth) || (*swidth % *sunit != 0)) {
+                        *sunit = 0;
+                        *swidth = 0;
+                }
+        }
+
 	/*
 	 * If the reported values are the same as the physical sector size
 	 * do not bother to report anything.  It will only cause warnings