diff mbox series

[6/6] nvme: Add consistency check for zone count

Message ID 20200625122152.17359-7-javier@javigon.com (mailing list archive)
State New, archived
Headers show
Series ZNS: Extra features for current patches | expand

Commit Message

Javier González June 25, 2020, 12:21 p.m. UTC
From: Javier González <javier.gonz@samsung.com>

Since the number of zones is calculated through the reported device
capacity and the ZNS specification allows to report the total number of
zones in the device, add an extra check to guarantee consistency between
the device and the kernel.

Signed-off-by: Javier González <javier.gonz@samsung.com>
Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
---
 drivers/nvme/host/zns.c | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Matias Bjorling June 25, 2020, 1:16 p.m. UTC | #1
On 25/06/2020 14.21, Javier González wrote:
> From: Javier González <javier.gonz@samsung.com>
>
> Since the number of zones is calculated through the reported device
> capacity and the ZNS specification allows to report the total number of
> zones in the device, add an extra check to guarantee consistency between
> the device and the kernel.
>
> Signed-off-by: Javier González <javier.gonz@samsung.com>
> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> ---
>   drivers/nvme/host/zns.c | 7 +++++++
>   1 file changed, 7 insertions(+)
>
> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
> index 7d8381fe7665..de806788a184 100644
> --- a/drivers/nvme/host/zns.c
> +++ b/drivers/nvme/host/zns.c
> @@ -234,6 +234,13 @@ static int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>   		sector += ns->zsze * nz;
>   	}
>   
> +	if (nr_zones < 0 && zone_idx != ns->nr_zones) {
> +		dev_err(ns->ctrl->device, "inconsistent zone count %u/%u\n",
> +				zone_idx, ns->nr_zones);
> +		ret = -EINVAL;
> +		goto out_free;
> +	}
> +
>   	ret = zone_idx;
>   out_free:
>   	kvfree(report);

Sounds like a check for a broken implementation. For implementations in 
the wild that exhibits this behavior, a quirk can be added. This kind of 
check is generally not needed. This can easily be checked by having a 
test case in a validation suite. The kernel should not have to check for it.
Javier González June 25, 2020, 7:45 p.m. UTC | #2
On 25.06.2020 15:16, Matias Bjørling wrote:
>On 25/06/2020 14.21, Javier González wrote:
>>From: Javier González <javier.gonz@samsung.com>
>>
>>Since the number of zones is calculated through the reported device
>>capacity and the ZNS specification allows to report the total number of
>>zones in the device, add an extra check to guarantee consistency between
>>the device and the kernel.
>>
>>Signed-off-by: Javier González <javier.gonz@samsung.com>
>>Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
>>Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>>Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>---
>>  drivers/nvme/host/zns.c | 7 +++++++
>>  1 file changed, 7 insertions(+)
>>
>>diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>>index 7d8381fe7665..de806788a184 100644
>>--- a/drivers/nvme/host/zns.c
>>+++ b/drivers/nvme/host/zns.c
>>@@ -234,6 +234,13 @@ static int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>>  		sector += ns->zsze * nz;
>>  	}
>>+	if (nr_zones < 0 && zone_idx != ns->nr_zones) {
>>+		dev_err(ns->ctrl->device, "inconsistent zone count %u/%u\n",
>>+				zone_idx, ns->nr_zones);
>>+		ret = -EINVAL;
>>+		goto out_free;
>>+	}
>>+
>>  	ret = zone_idx;
>>  out_free:
>>  	kvfree(report);
>
>Sounds like a check for a broken implementation. For implementations 
>in the wild that exhibits this behavior, a quirk can be added. This 
>kind of check is generally not needed. This can easily be checked by 
>having a test case in a validation suite. The kernel should not have 
>to check for it.
>

I don't believe it hurts to validate as ZNS provides a method to
retrieve the actual number of zones. It can help people detecting an
issue that can hide for some time.

If the general opinion is that this belongs to a test suite, we can add
it to blktests (already have it there internally). We can also have it
in both places.

Javier
Keith Busch June 25, 2020, 9:49 p.m. UTC | #3
On Thu, Jun 25, 2020 at 02:21:52PM +0200, Javier González wrote:
>  drivers/nvme/host/zns.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
> index 7d8381fe7665..de806788a184 100644
> --- a/drivers/nvme/host/zns.c
> +++ b/drivers/nvme/host/zns.c
> @@ -234,6 +234,13 @@ static int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>  		sector += ns->zsze * nz;
>  	}
>  
> +	if (nr_zones < 0 && zone_idx != ns->nr_zones) {
> +		dev_err(ns->ctrl->device, "inconsistent zone count %u/%u\n",
> +				zone_idx, ns->nr_zones);
> +		ret = -EINVAL;
> +		goto out_free;
> +	}
> +
>  	ret = zone_idx;

nr_zones is unsigned, so it's never < 0.

The API we're providing doesn't require zone_idx equal the namespace's
nr_zones at the end, though. A subset of the total number of zones can
be requested here.
Damien Le Moal June 26, 2020, 12:04 a.m. UTC | #4
On 2020/06/26 6:49, Keith Busch wrote:
> On Thu, Jun 25, 2020 at 02:21:52PM +0200, Javier González wrote:
>>  drivers/nvme/host/zns.c | 7 +++++++
>>  1 file changed, 7 insertions(+)
>>
>> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>> index 7d8381fe7665..de806788a184 100644
>> --- a/drivers/nvme/host/zns.c
>> +++ b/drivers/nvme/host/zns.c
>> @@ -234,6 +234,13 @@ static int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>>  		sector += ns->zsze * nz;
>>  	}
>>  
>> +	if (nr_zones < 0 && zone_idx != ns->nr_zones) {
>> +		dev_err(ns->ctrl->device, "inconsistent zone count %u/%u\n",
>> +				zone_idx, ns->nr_zones);
>> +		ret = -EINVAL;
>> +		goto out_free;
>> +	}
>> +
>>  	ret = zone_idx;
> 
> nr_zones is unsigned, so it's never < 0.
> 
> The API we're providing doesn't require zone_idx equal the namespace's
> nr_zones at the end, though. A subset of the total number of zones can
> be requested here.
> 

Yes, absolutely. zone_idx is not an absolute zone number. It is the index of the
reported zone descriptor in the current report range requested by the user,
which is not necessarily for the entire drive (i.e., provided nr zones is less
than the total number of zones of the disk and/or start sector is > 0). So
zone_idx indicates the actual number of zones reported, it is not the total
number of zones of the drive.
Javier González June 26, 2020, 6:13 a.m. UTC | #5
On 26.06.2020 00:04, Damien Le Moal wrote:
>On 2020/06/26 6:49, Keith Busch wrote:
>> On Thu, Jun 25, 2020 at 02:21:52PM +0200, Javier González wrote:
>>>  drivers/nvme/host/zns.c | 7 +++++++
>>>  1 file changed, 7 insertions(+)
>>>
>>> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>>> index 7d8381fe7665..de806788a184 100644
>>> --- a/drivers/nvme/host/zns.c
>>> +++ b/drivers/nvme/host/zns.c
>>> @@ -234,6 +234,13 @@ static int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>>>  		sector += ns->zsze * nz;
>>>  	}
>>>
>>> +	if (nr_zones < 0 && zone_idx != ns->nr_zones) {
>>> +		dev_err(ns->ctrl->device, "inconsistent zone count %u/%u\n",
>>> +				zone_idx, ns->nr_zones);
>>> +		ret = -EINVAL;
>>> +		goto out_free;
>>> +	}
>>> +
>>>  	ret = zone_idx;
>>
>> nr_zones is unsigned, so it's never < 0.
>>
>> The API we're providing doesn't require zone_idx equal the namespace's
>> nr_zones at the end, though. A subset of the total number of zones can
>> be requested here.
>>

I did see nr_zones coming with -1; guess it is my compiler.

>
>Yes, absolutely. zone_idx is not an absolute zone number. It is the index of the
>reported zone descriptor in the current report range requested by the user,
>which is not necessarily for the entire drive (i.e., provided nr zones is less
>than the total number of zones of the disk and/or start sector is > 0). So
>zone_idx indicates the actual number of zones reported, it is not the total

I see. As I can see, when nr_zones comes undefined I believed we could
assume that zone_idx is absolute, but I can be wrong.

Does it make sense to support this check with an additional counter and
a explicit nr_zones initialization when undefined or you
prefer to just remove it as Matias suggested?

Javier
Damien Le Moal June 26, 2020, 6:49 a.m. UTC | #6
On 2020/06/26 15:13, Javier González wrote:
> On 26.06.2020 00:04, Damien Le Moal wrote:
>> On 2020/06/26 6:49, Keith Busch wrote:
>>> On Thu, Jun 25, 2020 at 02:21:52PM +0200, Javier González wrote:
>>>>  drivers/nvme/host/zns.c | 7 +++++++
>>>>  1 file changed, 7 insertions(+)
>>>>
>>>> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>>>> index 7d8381fe7665..de806788a184 100644
>>>> --- a/drivers/nvme/host/zns.c
>>>> +++ b/drivers/nvme/host/zns.c
>>>> @@ -234,6 +234,13 @@ static int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>>>>  		sector += ns->zsze * nz;
>>>>  	}
>>>>
>>>> +	if (nr_zones < 0 && zone_idx != ns->nr_zones) {
>>>> +		dev_err(ns->ctrl->device, "inconsistent zone count %u/%u\n",
>>>> +				zone_idx, ns->nr_zones);
>>>> +		ret = -EINVAL;
>>>> +		goto out_free;
>>>> +	}
>>>> +
>>>>  	ret = zone_idx;
>>>
>>> nr_zones is unsigned, so it's never < 0.
>>>
>>> The API we're providing doesn't require zone_idx equal the namespace's
>>> nr_zones at the end, though. A subset of the total number of zones can
>>> be requested here.
>>>
> 
> I did see nr_zones coming with -1; guess it is my compiler.

See include/linux/blkdev.h. -1 is:

#define BLK_ALL_ZONES  ((unsigned int)-1)

Which is documented in block/blk-zoned.c:

/**
 * blkdev_report_zones - Get zones information
 * @bdev:       Target block device
 * @sector:     Sector from which to report zones
 * @nr_zones:   Maximum number of zones to report
 * @cb:         Callback function called for each reported zone
 * @data:       Private data for the callback
 *
 * Description:
 *    Get zone information starting from the zone containing @sector for at most
 *    @nr_zones, and call @cb for each zone reported by the device.
 *    To report all zones in a device starting from @sector, the BLK_ALL_ZONES
 *    constant can be passed to @nr_zones.
 *    Returns the number of zones reported by the device, or a negative errno
 *    value in case of failure.
 *
 *    Note: The caller must use memalloc_noXX_save/restore() calls to control
 *    memory allocations done within this function.
 */
int blkdev_report_zones(struct block_device *bdev, sector_t sector,
                        unsigned int nr_zones, report_zones_cb cb, void *data)

> 
>>
>> Yes, absolutely. zone_idx is not an absolute zone number. It is the index of the
>> reported zone descriptor in the current report range requested by the user,
>> which is not necessarily for the entire drive (i.e., provided nr zones is less
>> than the total number of zones of the disk and/or start sector is > 0). So
>> zone_idx indicates the actual number of zones reported, it is not the total
> 
> I see. As I can see, when nr_zones comes undefined I believed we could
> assume that zone_idx is absolute, but I can be wrong.

No. zone_idx is *always* the index of the zone in the current report. Whatever
that report is, regardless of the report starting point and number of zones
requested. E.g. For a single zone report (nr_zones = 1), you will always see
zone_idx = 0. For a full report, zone_idx will correspond to the zone number.
This is used for example in blk_revalidate_disk_zones() to initialize the zone
bitmaps.

> Does it make sense to support this check with an additional counter and
> a explicit nr_zones initialization when undefined or you
> prefer to just remove it as Matias suggested?

The check is not needed at all.

If the device is buggy and reports more zones than the device capacity or any
other bugs, the driver can catch that when it processes the report.
blk_revalidate_disk_zones() also has many checks.
Javier González June 26, 2020, 6:55 a.m. UTC | #7
On 26.06.2020 06:49, Damien Le Moal wrote:
>On 2020/06/26 15:13, Javier González wrote:
>> On 26.06.2020 00:04, Damien Le Moal wrote:
>>> On 2020/06/26 6:49, Keith Busch wrote:
>>>> On Thu, Jun 25, 2020 at 02:21:52PM +0200, Javier González wrote:
>>>>>  drivers/nvme/host/zns.c | 7 +++++++
>>>>>  1 file changed, 7 insertions(+)
>>>>>
>>>>> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>>>>> index 7d8381fe7665..de806788a184 100644
>>>>> --- a/drivers/nvme/host/zns.c
>>>>> +++ b/drivers/nvme/host/zns.c
>>>>> @@ -234,6 +234,13 @@ static int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>>>>>  		sector += ns->zsze * nz;
>>>>>  	}
>>>>>
>>>>> +	if (nr_zones < 0 && zone_idx != ns->nr_zones) {
>>>>> +		dev_err(ns->ctrl->device, "inconsistent zone count %u/%u\n",
>>>>> +				zone_idx, ns->nr_zones);
>>>>> +		ret = -EINVAL;
>>>>> +		goto out_free;
>>>>> +	}
>>>>> +
>>>>>  	ret = zone_idx;
>>>>
>>>> nr_zones is unsigned, so it's never < 0.
>>>>
>>>> The API we're providing doesn't require zone_idx equal the namespace's
>>>> nr_zones at the end, though. A subset of the total number of zones can
>>>> be requested here.
>>>>
>>
>> I did see nr_zones coming with -1; guess it is my compiler.
>
>See include/linux/blkdev.h. -1 is:
>
>#define BLK_ALL_ZONES  ((unsigned int)-1)
>
>Which is documented in block/blk-zoned.c:
>
>/**
> * blkdev_report_zones - Get zones information
> * @bdev:       Target block device
> * @sector:     Sector from which to report zones
> * @nr_zones:   Maximum number of zones to report
> * @cb:         Callback function called for each reported zone
> * @data:       Private data for the callback
> *
> * Description:
> *    Get zone information starting from the zone containing @sector for at most
> *    @nr_zones, and call @cb for each zone reported by the device.
> *    To report all zones in a device starting from @sector, the BLK_ALL_ZONES
> *    constant can be passed to @nr_zones.
> *    Returns the number of zones reported by the device, or a negative errno
> *    value in case of failure.
> *
> *    Note: The caller must use memalloc_noXX_save/restore() calls to control
> *    memory allocations done within this function.
> */
>int blkdev_report_zones(struct block_device *bdev, sector_t sector,
>                        unsigned int nr_zones, report_zones_cb cb, void *data)
>
>>
>>>
>>> Yes, absolutely. zone_idx is not an absolute zone number. It is the index of the
>>> reported zone descriptor in the current report range requested by the user,
>>> which is not necessarily for the entire drive (i.e., provided nr zones is less
>>> than the total number of zones of the disk and/or start sector is > 0). So
>>> zone_idx indicates the actual number of zones reported, it is not the total
>>
>> I see. As I can see, when nr_zones comes undefined I believed we could
>> assume that zone_idx is absolute, but I can be wrong.
>
>No. zone_idx is *always* the index of the zone in the current report. Whatever
>that report is, regardless of the report starting point and number of zones
>requested. E.g. For a single zone report (nr_zones = 1), you will always see
>zone_idx = 0. For a full report, zone_idx will correspond to the zone number.
>This is used for example in blk_revalidate_disk_zones() to initialize the zone
>bitmaps.
>
>> Does it make sense to support this check with an additional counter and
>> a explicit nr_zones initialization when undefined or you
>> prefer to just remove it as Matias suggested?
>
>The check is not needed at all.
>
>If the device is buggy and reports more zones than the device capacity or any
>other bugs, the driver can catch that when it processes the report.
>blk_revalidate_disk_zones() also has many checks.

I have managed to create a QEMU ZNS device that gave me a headache with
a little bit of extra capacity that triggered an additional zone report.
This was the motivation for the patch.

I will look at the checks to cover this case too then.

Thanks,
Javier
Damien Le Moal June 26, 2020, 7:09 a.m. UTC | #8
On 2020/06/26 15:55, Javier González wrote:
> On 26.06.2020 06:49, Damien Le Moal wrote:
>> On 2020/06/26 15:13, Javier González wrote:
>>> On 26.06.2020 00:04, Damien Le Moal wrote:
>>>> On 2020/06/26 6:49, Keith Busch wrote:
>>>>> On Thu, Jun 25, 2020 at 02:21:52PM +0200, Javier González wrote:
>>>>>>  drivers/nvme/host/zns.c | 7 +++++++
>>>>>>  1 file changed, 7 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>>>>>> index 7d8381fe7665..de806788a184 100644
>>>>>> --- a/drivers/nvme/host/zns.c
>>>>>> +++ b/drivers/nvme/host/zns.c
>>>>>> @@ -234,6 +234,13 @@ static int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>>>>>>  		sector += ns->zsze * nz;
>>>>>>  	}
>>>>>>
>>>>>> +	if (nr_zones < 0 && zone_idx != ns->nr_zones) {
>>>>>> +		dev_err(ns->ctrl->device, "inconsistent zone count %u/%u\n",
>>>>>> +				zone_idx, ns->nr_zones);
>>>>>> +		ret = -EINVAL;
>>>>>> +		goto out_free;
>>>>>> +	}
>>>>>> +
>>>>>>  	ret = zone_idx;
>>>>>
>>>>> nr_zones is unsigned, so it's never < 0.
>>>>>
>>>>> The API we're providing doesn't require zone_idx equal the namespace's
>>>>> nr_zones at the end, though. A subset of the total number of zones can
>>>>> be requested here.
>>>>>
>>>
>>> I did see nr_zones coming with -1; guess it is my compiler.
>>
>> See include/linux/blkdev.h. -1 is:
>>
>> #define BLK_ALL_ZONES  ((unsigned int)-1)
>>
>> Which is documented in block/blk-zoned.c:
>>
>> /**
>> * blkdev_report_zones - Get zones information
>> * @bdev:       Target block device
>> * @sector:     Sector from which to report zones
>> * @nr_zones:   Maximum number of zones to report
>> * @cb:         Callback function called for each reported zone
>> * @data:       Private data for the callback
>> *
>> * Description:
>> *    Get zone information starting from the zone containing @sector for at most
>> *    @nr_zones, and call @cb for each zone reported by the device.
>> *    To report all zones in a device starting from @sector, the BLK_ALL_ZONES
>> *    constant can be passed to @nr_zones.
>> *    Returns the number of zones reported by the device, or a negative errno
>> *    value in case of failure.
>> *
>> *    Note: The caller must use memalloc_noXX_save/restore() calls to control
>> *    memory allocations done within this function.
>> */
>> int blkdev_report_zones(struct block_device *bdev, sector_t sector,
>>                        unsigned int nr_zones, report_zones_cb cb, void *data)
>>
>>>
>>>>
>>>> Yes, absolutely. zone_idx is not an absolute zone number. It is the index of the
>>>> reported zone descriptor in the current report range requested by the user,
>>>> which is not necessarily for the entire drive (i.e., provided nr zones is less
>>>> than the total number of zones of the disk and/or start sector is > 0). So
>>>> zone_idx indicates the actual number of zones reported, it is not the total
>>>
>>> I see. As I can see, when nr_zones comes undefined I believed we could
>>> assume that zone_idx is absolute, but I can be wrong.
>>
>> No. zone_idx is *always* the index of the zone in the current report. Whatever
>> that report is, regardless of the report starting point and number of zones
>> requested. E.g. For a single zone report (nr_zones = 1), you will always see
>> zone_idx = 0. For a full report, zone_idx will correspond to the zone number.
>> This is used for example in blk_revalidate_disk_zones() to initialize the zone
>> bitmaps.
>>
>>> Does it make sense to support this check with an additional counter and
>>> a explicit nr_zones initialization when undefined or you
>>> prefer to just remove it as Matias suggested?
>>
>> The check is not needed at all.
>>
>> If the device is buggy and reports more zones than the device capacity or any
>> other bugs, the driver can catch that when it processes the report.
>> blk_revalidate_disk_zones() also has many checks.
> 
> I have managed to create a QEMU ZNS device that gave me a headache with
> a little bit of extra capacity that triggered an additional zone report.
> This was the motivation for the patch.

The device emulation sound buggy... If the capacity is wrong, then the report
will be too since zones are all supposed to be sequential (no holes between
zones) and up to the disk capacity only (last zone start + len = capacity + 1)

If one or the other is wrong, this should be easy to detect. Normally,
blk_revalidate_disk_zones() should be able to catch that.

> 
> I will look at the checks to cover this case too then.
> 
> Thanks,
> Javier
>
Javier González June 26, 2020, 7:29 a.m. UTC | #9
On 26.06.2020 07:09, Damien Le Moal wrote:
>On 2020/06/26 15:55, Javier González wrote:
>> On 26.06.2020 06:49, Damien Le Moal wrote:
>>> On 2020/06/26 15:13, Javier González wrote:
>>>> On 26.06.2020 00:04, Damien Le Moal wrote:
>>>>> On 2020/06/26 6:49, Keith Busch wrote:
>>>>>> On Thu, Jun 25, 2020 at 02:21:52PM +0200, Javier González wrote:
>>>>>>>  drivers/nvme/host/zns.c | 7 +++++++
>>>>>>>  1 file changed, 7 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>>>>>>> index 7d8381fe7665..de806788a184 100644
>>>>>>> --- a/drivers/nvme/host/zns.c
>>>>>>> +++ b/drivers/nvme/host/zns.c
>>>>>>> @@ -234,6 +234,13 @@ static int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>>>>>>>  		sector += ns->zsze * nz;
>>>>>>>  	}
>>>>>>>
>>>>>>> +	if (nr_zones < 0 && zone_idx != ns->nr_zones) {
>>>>>>> +		dev_err(ns->ctrl->device, "inconsistent zone count %u/%u\n",
>>>>>>> +				zone_idx, ns->nr_zones);
>>>>>>> +		ret = -EINVAL;
>>>>>>> +		goto out_free;
>>>>>>> +	}
>>>>>>> +
>>>>>>>  	ret = zone_idx;
>>>>>>
>>>>>> nr_zones is unsigned, so it's never < 0.
>>>>>>
>>>>>> The API we're providing doesn't require zone_idx equal the namespace's
>>>>>> nr_zones at the end, though. A subset of the total number of zones can
>>>>>> be requested here.
>>>>>>
>>>>
>>>> I did see nr_zones coming with -1; guess it is my compiler.
>>>
>>> See include/linux/blkdev.h. -1 is:
>>>
>>> #define BLK_ALL_ZONES  ((unsigned int)-1)
>>>
>>> Which is documented in block/blk-zoned.c:
>>>
>>> /**
>>> * blkdev_report_zones - Get zones information
>>> * @bdev:       Target block device
>>> * @sector:     Sector from which to report zones
>>> * @nr_zones:   Maximum number of zones to report
>>> * @cb:         Callback function called for each reported zone
>>> * @data:       Private data for the callback
>>> *
>>> * Description:
>>> *    Get zone information starting from the zone containing @sector for at most
>>> *    @nr_zones, and call @cb for each zone reported by the device.
>>> *    To report all zones in a device starting from @sector, the BLK_ALL_ZONES
>>> *    constant can be passed to @nr_zones.
>>> *    Returns the number of zones reported by the device, or a negative errno
>>> *    value in case of failure.
>>> *
>>> *    Note: The caller must use memalloc_noXX_save/restore() calls to control
>>> *    memory allocations done within this function.
>>> */
>>> int blkdev_report_zones(struct block_device *bdev, sector_t sector,
>>>                        unsigned int nr_zones, report_zones_cb cb, void *data)
>>>
>>>>
>>>>>
>>>>> Yes, absolutely. zone_idx is not an absolute zone number. It is the index of the
>>>>> reported zone descriptor in the current report range requested by the user,
>>>>> which is not necessarily for the entire drive (i.e., provided nr zones is less
>>>>> than the total number of zones of the disk and/or start sector is > 0). So
>>>>> zone_idx indicates the actual number of zones reported, it is not the total
>>>>
>>>> I see. As I can see, when nr_zones comes undefined I believed we could
>>>> assume that zone_idx is absolute, but I can be wrong.
>>>
>>> No. zone_idx is *always* the index of the zone in the current report. Whatever
>>> that report is, regardless of the report starting point and number of zones
>>> requested. E.g. For a single zone report (nr_zones = 1), you will always see
>>> zone_idx = 0. For a full report, zone_idx will correspond to the zone number.
>>> This is used for example in blk_revalidate_disk_zones() to initialize the zone
>>> bitmaps.
>>>
>>>> Does it make sense to support this check with an additional counter and
>>>> a explicit nr_zones initialization when undefined or you
>>>> prefer to just remove it as Matias suggested?
>>>
>>> The check is not needed at all.
>>>
>>> If the device is buggy and reports more zones than the device capacity or any
>>> other bugs, the driver can catch that when it processes the report.
>>> blk_revalidate_disk_zones() also has many checks.
>>
>> I have managed to create a QEMU ZNS device that gave me a headache with
>> a little bit of extra capacity that triggered an additional zone report.
>> This was the motivation for the patch.
>
>The device emulation sound buggy... If the capacity is wrong, then the report
>will be too since zones are all supposed to be sequential (no holes between
>zones) and up to the disk capacity only (last zone start + len = capacity + 1)
>
>If one or the other is wrong, this should be easy to detect. Normally,
>blk_revalidate_disk_zones() should be able to catch that.

We have the capability to select the reported device capacity manually
for a number of reasons. One of the different test configurations in our
CI did go through.

But it is OK, I will remove the check on V2.

Javier
Damien Le Moal June 26, 2020, 7:42 a.m. UTC | #10
On 2020/06/26 16:29, Javier González wrote:
> On 26.06.2020 07:09, Damien Le Moal wrote:
>> On 2020/06/26 15:55, Javier González wrote:
>>> On 26.06.2020 06:49, Damien Le Moal wrote:
>>>> On 2020/06/26 15:13, Javier González wrote:
>>>>> On 26.06.2020 00:04, Damien Le Moal wrote:
>>>>>> On 2020/06/26 6:49, Keith Busch wrote:
>>>>>>> On Thu, Jun 25, 2020 at 02:21:52PM +0200, Javier González wrote:
>>>>>>>>  drivers/nvme/host/zns.c | 7 +++++++
>>>>>>>>  1 file changed, 7 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>>>>>>>> index 7d8381fe7665..de806788a184 100644
>>>>>>>> --- a/drivers/nvme/host/zns.c
>>>>>>>> +++ b/drivers/nvme/host/zns.c
>>>>>>>> @@ -234,6 +234,13 @@ static int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>>>>>>>>  		sector += ns->zsze * nz;
>>>>>>>>  	}
>>>>>>>>
>>>>>>>> +	if (nr_zones < 0 && zone_idx != ns->nr_zones) {
>>>>>>>> +		dev_err(ns->ctrl->device, "inconsistent zone count %u/%u\n",
>>>>>>>> +				zone_idx, ns->nr_zones);
>>>>>>>> +		ret = -EINVAL;
>>>>>>>> +		goto out_free;
>>>>>>>> +	}
>>>>>>>> +
>>>>>>>>  	ret = zone_idx;
>>>>>>>
>>>>>>> nr_zones is unsigned, so it's never < 0.
>>>>>>>
>>>>>>> The API we're providing doesn't require zone_idx equal the namespace's
>>>>>>> nr_zones at the end, though. A subset of the total number of zones can
>>>>>>> be requested here.
>>>>>>>
>>>>>
>>>>> I did see nr_zones coming with -1; guess it is my compiler.
>>>>
>>>> See include/linux/blkdev.h. -1 is:
>>>>
>>>> #define BLK_ALL_ZONES  ((unsigned int)-1)
>>>>
>>>> Which is documented in block/blk-zoned.c:
>>>>
>>>> /**
>>>> * blkdev_report_zones - Get zones information
>>>> * @bdev:       Target block device
>>>> * @sector:     Sector from which to report zones
>>>> * @nr_zones:   Maximum number of zones to report
>>>> * @cb:         Callback function called for each reported zone
>>>> * @data:       Private data for the callback
>>>> *
>>>> * Description:
>>>> *    Get zone information starting from the zone containing @sector for at most
>>>> *    @nr_zones, and call @cb for each zone reported by the device.
>>>> *    To report all zones in a device starting from @sector, the BLK_ALL_ZONES
>>>> *    constant can be passed to @nr_zones.
>>>> *    Returns the number of zones reported by the device, or a negative errno
>>>> *    value in case of failure.
>>>> *
>>>> *    Note: The caller must use memalloc_noXX_save/restore() calls to control
>>>> *    memory allocations done within this function.
>>>> */
>>>> int blkdev_report_zones(struct block_device *bdev, sector_t sector,
>>>>                        unsigned int nr_zones, report_zones_cb cb, void *data)
>>>>
>>>>>
>>>>>>
>>>>>> Yes, absolutely. zone_idx is not an absolute zone number. It is the index of the
>>>>>> reported zone descriptor in the current report range requested by the user,
>>>>>> which is not necessarily for the entire drive (i.e., provided nr zones is less
>>>>>> than the total number of zones of the disk and/or start sector is > 0). So
>>>>>> zone_idx indicates the actual number of zones reported, it is not the total
>>>>>
>>>>> I see. As I can see, when nr_zones comes undefined I believed we could
>>>>> assume that zone_idx is absolute, but I can be wrong.
>>>>
>>>> No. zone_idx is *always* the index of the zone in the current report. Whatever
>>>> that report is, regardless of the report starting point and number of zones
>>>> requested. E.g. For a single zone report (nr_zones = 1), you will always see
>>>> zone_idx = 0. For a full report, zone_idx will correspond to the zone number.
>>>> This is used for example in blk_revalidate_disk_zones() to initialize the zone
>>>> bitmaps.
>>>>
>>>>> Does it make sense to support this check with an additional counter and
>>>>> a explicit nr_zones initialization when undefined or you
>>>>> prefer to just remove it as Matias suggested?
>>>>
>>>> The check is not needed at all.
>>>>
>>>> If the device is buggy and reports more zones than the device capacity or any
>>>> other bugs, the driver can catch that when it processes the report.
>>>> blk_revalidate_disk_zones() also has many checks.
>>>
>>> I have managed to create a QEMU ZNS device that gave me a headache with
>>> a little bit of extra capacity that triggered an additional zone report.
>>> This was the motivation for the patch.
>>
>> The device emulation sound buggy... If the capacity is wrong, then the report
>> will be too since zones are all supposed to be sequential (no holes between
>> zones) and up to the disk capacity only (last zone start + len = capacity + 1)
>>
>> If one or the other is wrong, this should be easy to detect. Normally,
>> blk_revalidate_disk_zones() should be able to catch that.
> 
> We have the capability to select the reported device capacity manually
> for a number of reasons. One of the different test configurations in our
> CI did go through.

If you change the drive capacity on the fly (e.g. with a low level format ?),
you must revalidate the disk/drive to get the changed capacity. A lot of things
will break otherwise This is not just report zones that will be incorrect.

> 
> But it is OK, I will remove the check on V2.
> 
> Javier
>
Christoph Hellwig June 26, 2020, 9:16 a.m. UTC | #11
As a bunch of folks noticed I don't zone_idx does what you think it
does.  That being said I'm very happy about any kind of validation
(except maybe in a supper hot path).  So if we want to validate the
zone number we can do, just not as in this patch.
Javier González June 26, 2020, 10:03 a.m. UTC | #12
On 26.06.2020 11:16, Christoph Hellwig wrote:
>As a bunch of folks noticed I don't zone_idx does what you think it
>does.  That being said I'm very happy about any kind of validation
>(except maybe in a supper hot path).  So if we want to validate the
>zone number we can do, just not as in this patch.

Ok. I will send a proper implementation of it and then we see if it
fits.
diff mbox series

Patch

diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
index 7d8381fe7665..de806788a184 100644
--- a/drivers/nvme/host/zns.c
+++ b/drivers/nvme/host/zns.c
@@ -234,6 +234,13 @@  static int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
 		sector += ns->zsze * nz;
 	}
 
+	if (nr_zones < 0 && zone_idx != ns->nr_zones) {
+		dev_err(ns->ctrl->device, "inconsistent zone count %u/%u\n",
+				zone_idx, ns->nr_zones);
+		ret = -EINVAL;
+		goto out_free;
+	}
+
 	ret = zone_idx;
 out_free:
 	kvfree(report);