diff mbox series

nvme: Fix io_opt limit setting

Message ID 20200514015452.1055278-1-damien.lemoal@wdc.com (mailing list archive)
State New, archived
Headers show
Series nvme: Fix io_opt limit setting | expand

Commit Message

Damien Le Moal May 14, 2020, 1:54 a.m. UTC
Currently, a namespace io_opt queue limit is set by default to the
physical sector size of the namespace and to the the write optimal
size (NOWS) when the namespace reports this value. This causes problems
with block limits stacking in blk_stack_limits() when a namespace block
device is combined with an HDD which generally do not report any optimal
transfer size (io_opt limit is 0). The code:

/* Optimal I/O a multiple of the physical block size? */
if (t->io_opt & (t->physical_block_size - 1)) {
	t->io_opt = 0;
	t->misaligned = 1;
	ret = -1;
}

results in blk_stack_limits() to return an error when the combined
devices have different but compatible physical sector sizes (e.g. 512B
sector SSD with 4KB sector disks).

Fix this by not setting the optiomal IO size limit if the namespace does
not report an optimal write size value.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 drivers/nvme/host/core.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Comments

Martin K. Petersen May 14, 2020, 3:29 a.m. UTC | #1
Damien,

> results in blk_stack_limits() to return an error when the combined
> devices have different but compatible physical sector sizes (e.g. 512B
> sector SSD with 4KB sector disks).

We'll need to get that stacking logic fixed up to take io_opt into
account when scaling pbs/min. Just as a safety measure in case we don't
catch devices reporting crazy values in the LLDs.

> Fix this by not setting the optiomal IO size limit if the namespace

optimal

> does not report an optimal write size value.

Setting io_opt to the logical block size in the NVMe driver is
equivalent to telling the filesystems that they should not submit I/Os
larger than one sector. That makes no sense. This change is correct.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Keith Busch May 14, 2020, 3:40 a.m. UTC | #2
On Thu, May 14, 2020 at 10:54:52AM +0900, Damien Le Moal wrote:
> Currently, a namespace io_opt queue limit is set by default to the
> physical sector size of the namespace and to the the write optimal
> size (NOWS) when the namespace reports this value. This causes problems
> with block limits stacking in blk_stack_limits() when a namespace block
> device is combined with an HDD which generally do not report any optimal
> transfer size (io_opt limit is 0). The code:
> 
> /* Optimal I/O a multiple of the physical block size? */
> if (t->io_opt & (t->physical_block_size - 1)) {
> 	t->io_opt = 0;
> 	t->misaligned = 1;
> 	ret = -1;
> }
> 
> results in blk_stack_limits() to return an error when the combined
> devices have different but compatible physical sector sizes (e.g. 512B
> sector SSD with 4KB sector disks).
> 
> Fix this by not setting the optiomal IO size limit if the namespace does
> not report an optimal write size value.

Won't this continue to break if a controller does report NOWS that's not
a multiple of the physical block size of the device it's stacking with?
Damien Le Moal May 14, 2020, 3:47 a.m. UTC | #3
On 2020/05/14 12:40, Keith Busch wrote:
> On Thu, May 14, 2020 at 10:54:52AM +0900, Damien Le Moal wrote:
>> Currently, a namespace io_opt queue limit is set by default to the
>> physical sector size of the namespace and to the the write optimal
>> size (NOWS) when the namespace reports this value. This causes problems
>> with block limits stacking in blk_stack_limits() when a namespace block
>> device is combined with an HDD which generally do not report any optimal
>> transfer size (io_opt limit is 0). The code:
>>
>> /* Optimal I/O a multiple of the physical block size? */
>> if (t->io_opt & (t->physical_block_size - 1)) {
>> 	t->io_opt = 0;
>> 	t->misaligned = 1;
>> 	ret = -1;
>> }
>>
>> results in blk_stack_limits() to return an error when the combined
>> devices have different but compatible physical sector sizes (e.g. 512B
>> sector SSD with 4KB sector disks).
>>
>> Fix this by not setting the optiomal IO size limit if the namespace does
>> not report an optimal write size value.
> 
> Won't this continue to break if a controller does report NOWS that's not
> a multiple of the physical block size of the device it's stacking with?

When io_opt stacking is handled, the physical sector size for the stacked device
is already resolved to a common value. If the NOWS value cannot accommodate this
resolved physical sector size, this is an incompatible stacking, so failing is
OK in that case.
Keith Busch May 14, 2020, 4:12 a.m. UTC | #4
On Thu, May 14, 2020 at 03:47:56AM +0000, Damien Le Moal wrote:
> On 2020/05/14 12:40, Keith Busch wrote:
> > On Thu, May 14, 2020 at 10:54:52AM +0900, Damien Le Moal wrote:
> >> Currently, a namespace io_opt queue limit is set by default to the
> >> physical sector size of the namespace and to the the write optimal
> >> size (NOWS) when the namespace reports this value. This causes problems
> >> with block limits stacking in blk_stack_limits() when a namespace block
> >> device is combined with an HDD which generally do not report any optimal
> >> transfer size (io_opt limit is 0). The code:
> >>
> >> /* Optimal I/O a multiple of the physical block size? */
> >> if (t->io_opt & (t->physical_block_size - 1)) {
> >> 	t->io_opt = 0;
> >> 	t->misaligned = 1;
> >> 	ret = -1;
> >> }
> >>
> >> results in blk_stack_limits() to return an error when the combined
> >> devices have different but compatible physical sector sizes (e.g. 512B
> >> sector SSD with 4KB sector disks).
> >>
> >> Fix this by not setting the optiomal IO size limit if the namespace does
> >> not report an optimal write size value.
> > 
> > Won't this continue to break if a controller does report NOWS that's not
> > a multiple of the physical block size of the device it's stacking with?
> 
> When io_opt stacking is handled, the physical sector size for the stacked device
> is already resolved to a common value. If the NOWS value cannot accommodate this
> resolved physical sector size, this is an incompatible stacking, so failing is
> OK in that case.

I see, though it's not strictly incompatible as io_opt is merely a hint
that could continue to work if the stacked limit was recalculated as:

	if (t->io_opt & (t->physical_block_size - 1))
	 	t->io_opt = lcm(t->io_opt, t->physical_block_size);

Regardless, your patch does make sense, but it does have a merge
conflict with nvme-5.8.
Damien Le Moal May 14, 2020, 4:13 a.m. UTC | #5
On 2020/05/14 13:12, Keith Busch wrote:
> On Thu, May 14, 2020 at 03:47:56AM +0000, Damien Le Moal wrote:
>> On 2020/05/14 12:40, Keith Busch wrote:
>>> On Thu, May 14, 2020 at 10:54:52AM +0900, Damien Le Moal wrote:
>>>> Currently, a namespace io_opt queue limit is set by default to the
>>>> physical sector size of the namespace and to the the write optimal
>>>> size (NOWS) when the namespace reports this value. This causes problems
>>>> with block limits stacking in blk_stack_limits() when a namespace block
>>>> device is combined with an HDD which generally do not report any optimal
>>>> transfer size (io_opt limit is 0). The code:
>>>>
>>>> /* Optimal I/O a multiple of the physical block size? */
>>>> if (t->io_opt & (t->physical_block_size - 1)) {
>>>> 	t->io_opt = 0;
>>>> 	t->misaligned = 1;
>>>> 	ret = -1;
>>>> }
>>>>
>>>> results in blk_stack_limits() to return an error when the combined
>>>> devices have different but compatible physical sector sizes (e.g. 512B
>>>> sector SSD with 4KB sector disks).
>>>>
>>>> Fix this by not setting the optiomal IO size limit if the namespace does
>>>> not report an optimal write size value.
>>>
>>> Won't this continue to break if a controller does report NOWS that's not
>>> a multiple of the physical block size of the device it's stacking with?
>>
>> When io_opt stacking is handled, the physical sector size for the stacked device
>> is already resolved to a common value. If the NOWS value cannot accommodate this
>> resolved physical sector size, this is an incompatible stacking, so failing is
>> OK in that case.
> 
> I see, though it's not strictly incompatible as io_opt is merely a hint
> that could continue to work if the stacked limit was recalculated as:
> 
> 	if (t->io_opt & (t->physical_block_size - 1))
> 	 	t->io_opt = lcm(t->io_opt, t->physical_block_size);
> 
> Regardless, your patch does make sense, but it does have a merge
> conflict with nvme-5.8.

Ooops. I will rebase and resend.

And maybe we should send your suggestion above as a proper patch ?

>
Bart Van Assche May 14, 2020, 4:47 a.m. UTC | #6
On 2020-05-13 18:54, Damien Le Moal wrote:
> @@ -1848,7 +1847,8 @@ static void nvme_update_disk_info(struct gendisk *disk,
>  	 */
>  	blk_queue_physical_block_size(disk->queue, min(phys_bs, atomic_bs));
>  	blk_queue_io_min(disk->queue, phys_bs);
> -	blk_queue_io_opt(disk->queue, io_opt);
> +	if (io_opt)
> +		blk_queue_io_opt(disk->queue, io_opt);

The above change looks confusing to me. We want the NVMe driver to set
io_opt, so why only call blk_queue_io_opt() if io_opt != 0? That means
that the io_opt value will be left to any value set by the block layer
core if io_opt == 0 instead of properly being set to zero.

Thanks,

Bart.
Damien Le Moal May 14, 2020, 4:49 a.m. UTC | #7
On 2020/05/14 13:47, Bart Van Assche wrote:
> On 2020-05-13 18:54, Damien Le Moal wrote:
>> @@ -1848,7 +1847,8 @@ static void nvme_update_disk_info(struct gendisk *disk,
>>  	 */
>>  	blk_queue_physical_block_size(disk->queue, min(phys_bs, atomic_bs));
>>  	blk_queue_io_min(disk->queue, phys_bs);
>> -	blk_queue_io_opt(disk->queue, io_opt);
>> +	if (io_opt)
>> +		blk_queue_io_opt(disk->queue, io_opt);
> 
> The above change looks confusing to me. We want the NVMe driver to set
> io_opt, so why only call blk_queue_io_opt() if io_opt != 0? That means
> that the io_opt value will be left to any value set by the block layer
> core if io_opt == 0 instead of properly being set to zero.

OK. I will remove the "if".

> 
> Thanks,
> 
> Bart.
>
Hannes Reinecke May 14, 2020, 6:11 a.m. UTC | #8
On 5/14/20 3:54 AM, Damien Le Moal wrote:
> Currently, a namespace io_opt queue limit is set by default to the
> physical sector size of the namespace and to the the write optimal
> size (NOWS) when the namespace reports this value. This causes problems
> with block limits stacking in blk_stack_limits() when a namespace block
> device is combined with an HDD which generally do not report any optimal
> transfer size (io_opt limit is 0). The code:
> 
> /* Optimal I/O a multiple of the physical block size? */
> if (t->io_opt & (t->physical_block_size - 1)) {
> 	t->io_opt = 0;
> 	t->misaligned = 1;
> 	ret = -1;
> }
> 
> results in blk_stack_limits() to return an error when the combined
> devices have different but compatible physical sector sizes (e.g. 512B
> sector SSD with 4KB sector disks).
> 
> Fix this by not setting the optiomal IO size limit if the namespace does
> not report an optimal write size value.
> 
> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
> ---
>   drivers/nvme/host/core.c | 8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)
> 
Ah, so you beat me to it :-)

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
Martin K. Petersen May 14, 2020, 10:19 p.m. UTC | #9
Bart,

> The above change looks confusing to me. We want the NVMe driver to set
> io_opt, so why only call blk_queue_io_opt() if io_opt != 0? That means
> that the io_opt value will be left to any value set by the block layer
> core if io_opt == 0 instead of properly being set to zero.

We do explicitly set it to 0 when allocating a queue. But no biggie.
diff mbox series

Patch

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f3c037f5a9ba..0729173053ed 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1809,7 +1809,7 @@  static void nvme_update_disk_info(struct gendisk *disk,
 {
 	sector_t capacity = nvme_lba_to_sect(ns, le64_to_cpu(id->nsze));
 	unsigned short bs = 1 << ns->lba_shift;
-	u32 atomic_bs, phys_bs, io_opt;
+	u32 atomic_bs, phys_bs, io_opt = 0;
 
 	if (ns->lba_shift > PAGE_SHIFT) {
 		/* unsupported block size, set capacity to 0 later */
@@ -1832,12 +1832,11 @@  static void nvme_update_disk_info(struct gendisk *disk,
 		atomic_bs = bs;
 	}
 	phys_bs = bs;
-	io_opt = bs;
 	if (id->nsfeat & (1 << 4)) {
 		/* NPWG = Namespace Preferred Write Granularity */
 		phys_bs *= 1 + le16_to_cpu(id->npwg);
 		/* NOWS = Namespace Optimal Write Size */
-		io_opt *= 1 + le16_to_cpu(id->nows);
+		io_opt = bs * (1 + le16_to_cpu(id->nows));
 	}
 
 	blk_queue_logical_block_size(disk->queue, bs);
@@ -1848,7 +1847,8 @@  static void nvme_update_disk_info(struct gendisk *disk,
 	 */
 	blk_queue_physical_block_size(disk->queue, min(phys_bs, atomic_bs));
 	blk_queue_io_min(disk->queue, phys_bs);
-	blk_queue_io_opt(disk->queue, io_opt);
+	if (io_opt)
+		blk_queue_io_opt(disk->queue, io_opt);
 
 	if (ns->ms && !ns->ext &&
 	    (ns->ctrl->ops->flags & NVME_F_METADATA_SUPPORTED))