diff mbox series

[v3,11/11] dm-zoned: ensure only power of 2 zone sizes are allowed

Message ID 20220506081105.29134-12-p.raghav@samsung.com (mailing list archive)
State New, archived
Headers show
Series [v3,01/11] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze | expand

Commit Message

Pankaj Raghav May 6, 2022, 8:11 a.m. UTC
From: Luis Chamberlain <mcgrof@kernel.org>

Today dm-zoned relies on the assumption that you have a zone size
with a power of 2. Even though the block layer today enforces this
requirement, these devices do exist and so provide a stop-gap measure
to ensure these devices cannot be used by mistake

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/md/dm-zone.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

Comments

Damien Le Moal May 6, 2022, 3:41 p.m. UTC | #1
On 2022/05/06 17:11, Pankaj Raghav wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> Today dm-zoned relies on the assumption that you have a zone size
> with a power of 2. Even though the block layer today enforces this
> requirement, these devices do exist and so provide a stop-gap measure
> to ensure these devices cannot be used by mistake
> 
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  drivers/md/dm-zone.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> index 3e7b1fe15..27dc4ddf2 100644
> --- a/drivers/md/dm-zone.c
> +++ b/drivers/md/dm-zone.c
> @@ -231,6 +231,18 @@ static int dm_revalidate_zones(struct mapped_device *md, struct dm_table *t)
>  	struct request_queue *q = md->queue;
>  	unsigned int noio_flag;
>  	int ret;
> +	struct block_device *bdev = md->disk->part0;
> +	sector_t zone_sectors;
> +	char bname[BDEVNAME_SIZE];
> +
> +	zone_sectors = bdev_zone_sectors(bdev);
> +
> +	if (!is_power_of_2(zone_sectors)) {
> +		DMWARN("%s: %s only power of two zone size supported\n",
> +		       dm_device_name(md),
> +		       bdevname(bdev, bname));
> +		return 1;

return -EINVAL;

The error propagates to dm_table_set_restrictions() so a proper error code must
be returned.


> +	}
>  
>  	/*
>  	 * Check if something changed. If yes, cleanup the current resources
Pankaj Raghav May 9, 2022, 11:03 a.m. UTC | #2
>> ---
>>  drivers/md/dm-zone.c | 12 ++++++++++++
>>  1 file changed, 12 insertions(+)
>>
>> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
>> index 3e7b1fe15..27dc4ddf2 100644
>> --- a/drivers/md/dm-zone.c
>> +++ b/drivers/md/dm-zone.c
>> @@ -231,6 +231,18 @@ static int dm_revalidate_zones(struct mapped_device *md, struct dm_table *t)
>>  	struct request_queue *q = md->queue;
>>  	unsigned int noio_flag;
>>  	int ret;
>> +	struct block_device *bdev = md->disk->part0;
>> +	sector_t zone_sectors;
>> +	char bname[BDEVNAME_SIZE];
>> +
>> +	zone_sectors = bdev_zone_sectors(bdev);
>> +
>> +	if (!is_power_of_2(zone_sectors)) {
>> +		DMWARN("%s: %s only power of two zone size supported\n",
>> +		       dm_device_name(md),
>> +		       bdevname(bdev, bname));
>> +		return 1;
> 
> return -EINVAL;
> 
> The error propagates to dm_table_set_restrictions() so a proper error code must
> be returned.
> 
Good point. I will add this in the next rev.
Mike Snitzer May 9, 2022, 4:05 p.m. UTC | #3
On Mon, May 09 2022 at  7:03P -0400,
Pankaj Raghav <p.raghav@samsung.com> wrote:

> >> ---
> >>  drivers/md/dm-zone.c | 12 ++++++++++++
> >>  1 file changed, 12 insertions(+)
> >>
> >> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> >> index 3e7b1fe15..27dc4ddf2 100644
> >> --- a/drivers/md/dm-zone.c
> >> +++ b/drivers/md/dm-zone.c
> >> @@ -231,6 +231,18 @@ static int dm_revalidate_zones(struct mapped_device *md, struct dm_table *t)
> >>  	struct request_queue *q = md->queue;
> >>  	unsigned int noio_flag;
> >>  	int ret;
> >> +	struct block_device *bdev = md->disk->part0;
> >> +	sector_t zone_sectors;
> >> +	char bname[BDEVNAME_SIZE];
> >> +
> >> +	zone_sectors = bdev_zone_sectors(bdev);
> >> +
> >> +	if (!is_power_of_2(zone_sectors)) {
> >> +		DMWARN("%s: %s only power of two zone size supported\n",
> >> +		       dm_device_name(md),
> >> +		       bdevname(bdev, bname));
> >> +		return 1;
> > 
> > return -EINVAL;
> > 
> > The error propagates to dm_table_set_restrictions() so a proper error code must
> > be returned.
> > 
> Good point. I will add this in the next rev.

Also, DMWARN already provides the trailing newline, so please remove
the above newline.
David Sterba May 9, 2022, 6:54 p.m. UTC | #4
On Fri, May 06, 2022 at 10:11:05AM +0200, Pankaj Raghav wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> Today dm-zoned relies on the assumption that you have a zone size
> with a power of 2. Even though the block layer today enforces this
> requirement, these devices do exist and so provide a stop-gap measure
> to ensure these devices cannot be used by mistake
> 
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  drivers/md/dm-zone.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> index 3e7b1fe15..27dc4ddf2 100644
> --- a/drivers/md/dm-zone.c
> +++ b/drivers/md/dm-zone.c
> @@ -231,6 +231,18 @@ static int dm_revalidate_zones(struct mapped_device *md, struct dm_table *t)
>  	struct request_queue *q = md->queue;
>  	unsigned int noio_flag;
>  	int ret;
> +	struct block_device *bdev = md->disk->part0;
> +	sector_t zone_sectors;
> +	char bname[BDEVNAME_SIZE];
> +
> +	zone_sectors = bdev_zone_sectors(bdev);
> +
> +	if (!is_power_of_2(zone_sectors)) {

is_power_of_2 takes 'unsigned long' and sector_t is u64, so this is not
32bit clean and we had an actual bug where value 1<<48 was not
recognized as power of 2.

> +		DMWARN("%s: %s only power of two zone size supported\n",
> +		       dm_device_name(md),
> +		       bdevname(bdev, bname));
> +		return 1;
> +	}
>  
>  	/*
>  	 * Check if something changed. If yes, cleanup the current resources
> -- 
> 2.25.1
Pankaj Raghav May 11, 2022, 2:39 p.m. UTC | #5
Hi David,

On 2022-05-09 20:54, David Sterba wrote:>> diff --git
a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
>> index 3e7b1fe15..27dc4ddf2 100644
>> --- a/drivers/md/dm-zone.c
>> +++ b/drivers/md/dm-zone.c
>> @@ -231,6 +231,18 @@ static int dm_revalidate_zones(struct mapped_device *md, struct dm_table *t)
>>  	struct request_queue *q = md->queue;
>>  	unsigned int noio_flag;
>>  	int ret;
>> +	struct block_device *bdev = md->disk->part0;
>> +	sector_t zone_sectors;
>> +	char bname[BDEVNAME_SIZE];
>> +
>> +	zone_sectors = bdev_zone_sectors(bdev);
>> +
>> +	if (!is_power_of_2(zone_sectors)) {
> 
> is_power_of_2 takes 'unsigned long' and sector_t is u64, so this is not
> 32bit clean and we had an actual bug where value 1<<48 was not
> recognized as power of 2.
> 
Good catch. Now I understand why btrfs has a helper for is_power_of_two_u64.

But the zone size can never be more than 32bit value so the zone size
sect will never greater than unsigned long.

With that said, we have two options:

1.) We can put a comment explaining that even though it is 32 bit
unsafe, zone size sect can never be a 32bit value

or

2) We should move the btrfs only helper `is_power_of_two_u64` to some
common header and use it everywhere.

Let me know your thoughts.
David Sterba May 11, 2022, 4 p.m. UTC | #6
On Wed, May 11, 2022 at 04:39:17PM +0200, Pankaj Raghav wrote:
> Hi David,
> 
> On 2022-05-09 20:54, David Sterba wrote:>> diff --git
> a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> >> index 3e7b1fe15..27dc4ddf2 100644
> >> --- a/drivers/md/dm-zone.c
> >> +++ b/drivers/md/dm-zone.c
> >> @@ -231,6 +231,18 @@ static int dm_revalidate_zones(struct mapped_device *md, struct dm_table *t)
> >>  	struct request_queue *q = md->queue;
> >>  	unsigned int noio_flag;
> >>  	int ret;
> >> +	struct block_device *bdev = md->disk->part0;
> >> +	sector_t zone_sectors;
> >> +	char bname[BDEVNAME_SIZE];
> >> +
> >> +	zone_sectors = bdev_zone_sectors(bdev);
> >> +
> >> +	if (!is_power_of_2(zone_sectors)) {
> > 
> > is_power_of_2 takes 'unsigned long' and sector_t is u64, so this is not
> > 32bit clean and we had an actual bug where value 1<<48 was not
> > recognized as power of 2.
> > 
> Good catch. Now I understand why btrfs has a helper for is_power_of_two_u64.
> 
> But the zone size can never be more than 32bit value so the zone size
> sect will never greater than unsigned long.

We've set the maximum supported zone size in btrfs to be 8G, which is a
lot and should be sufficient for some time, but this also means that the
value is larger than 32bit maximum. I have actually tested btrfs on top
of such emaulated zoned device via TCMU, so it's not dm-zoned, so it's
up to you to make sure that a silent overflow won't happen.

> With that said, we have two options:
> 
> 1.) We can put a comment explaining that even though it is 32 bit
> unsafe, zone size sect can never be a 32bit value

This is probably part of the protocol and specification of the zoned
devices, the filesystem either accepts the spec or makes some room for
larger values in case it's not too costly.

> or
> 
> 2) We should move the btrfs only helper `is_power_of_two_u64` to some
> common header and use it everywhere.

Yeah, that can be done independently. With some macro magic it can be
made type-safe for any argument while preserving the 'is_power_of_2'
name.
Pankaj Raghav May 12, 2022, 8:27 a.m. UTC | #7
>>>> +	zone_sectors = bdev_zone_sectors(bdev);
>>>> +
>>>> +	if (!is_power_of_2(zone_sectors)) {
>>>
>>> is_power_of_2 takes 'unsigned long' and sector_t is u64, so this is not
>>> 32bit clean and we had an actual bug where value 1<<48 was not
>>> recognized as power of 2.
>>>
>> Good catch. Now I understand why btrfs has a helper for is_power_of_two_u64.
>>
>> But the zone size can never be more than 32bit value so the zone size
>> sect will never greater than unsigned long.
> 
> We've set the maximum supported zone size in btrfs to be 8G, which is a
> lot and should be sufficient for some time, but this also means that the
> value is larger than 32bit maximum. I have actually tested btrfs on top
> of such emaulated zoned device via TCMU, so it's not dm-zoned, so it's
> up to you to make sure that a silent overflow won't happen.
> 

bdev_zone_sectors is used in this case and not the actual size in bytes.
So the zone size need to be 2TB for the sectors value to cross the 32bit
limit. This is likely not an issue in the near future.

>> With that said, we have two options:
>>
>> 1.) We can put a comment explaining that even though it is 32 bit
>> unsafe, zone size sect can never be a 32bit value
> 
> This is probably part of the protocol and specification of the zoned
> devices, the filesystem either accepts the spec or makes some room for
> larger values in case it's not too costly.
> 
>> or
>>
>> 2) We should move the btrfs only helper `is_power_of_two_u64` to some
>> common header and use it everywhere.
> 
> Yeah, that can be done independently. With some macro magic it can be
> made type-safe for any argument while preserving the 'is_power_of_2'
> name.
But I agree with your point that we need a type safe power of 2
implementation in a common header so that we can avoid silent overflows
in 32 bit architectures.

I will keep the change as is in this patch and follow up on the type
safe power of 2 later independently. Thanks.
diff mbox series

Patch

diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index 3e7b1fe15..27dc4ddf2 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -231,6 +231,18 @@  static int dm_revalidate_zones(struct mapped_device *md, struct dm_table *t)
 	struct request_queue *q = md->queue;
 	unsigned int noio_flag;
 	int ret;
+	struct block_device *bdev = md->disk->part0;
+	sector_t zone_sectors;
+	char bname[BDEVNAME_SIZE];
+
+	zone_sectors = bdev_zone_sectors(bdev);
+
+	if (!is_power_of_2(zone_sectors)) {
+		DMWARN("%s: %s only power of two zone size supported\n",
+		       dm_device_name(md),
+		       bdevname(bdev, bname));
+		return 1;
+	}
 
 	/*
 	 * Check if something changed. If yes, cleanup the current resources