diff mbox series

[RFC,v2,1/4] block: change REQ_OP_ZONE_RESET from 6 to 13

Message ID 20200516035434.82809-2-colyli@suse.de (mailing list archive)
State New, archived
Headers show
Series block layer change necessary for bcache zoned device support | expand

Commit Message

Coly Li May 16, 2020, 3:54 a.m. UTC
For a zoned device, e.g. host managed SMR hard drive, REQ_OP_ZONE_RESET
is to reset the LBA of a zone's write pointer back to the start LBA of
this zone. After the write point is reset, all previously stored data
in this zone is invalid and unaccessible anymore. Therefore, this op
code changes on disk data, belongs to a WRITE request op code.

Current REQ_OP_ZONE_RESET is defined as number 6, but the convention of
the op code is, READ requests are even numbers, and WRITE requests are
odd numbers. See how op_is_write defined,

397 static inline bool op_is_write(unsigned int op)
398 {
399         return (op & 1);
400 }

When create bcache device on top of the zoned SMR drive, bcache driver
has to handle the REQ_OP_ZONE_RESET bio by invalidate all cached data
belongs to the LBA range of the restting zone. A wrong op code value
will make the this zone management bio goes into bcache' read requests
code path and causes undefined result.

This patch changes REQ_OP_ZONE_RESET value from 6 to 13, the new odd
number will make bcache driver handle this zone management bio properly
in write requests code path.

Fixes: 87374179c535 ("block: add a proper block layer data direction encoding")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jens Axboe <axboe@fb.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Shaun Tancheff <shaun.tancheff@seagate.com>
---
 include/linux/blk_types.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Chaitanya Kulkarni May 16, 2020, 4:06 a.m. UTC | #1
On 05/15/2020 08:55 PM, Coly Li wrote:
>   	REQ_OP_ZONE_FINISH	= 12,
> +	/* reset a zone write pointer */
> +	REQ_OP_ZONE_RESET	= 13,

I think 13 is taken, I found this in block tree branch :-
for-5.8/block

316         REQ_OP_ZONE_FINISH      = 12,
317         /* write data at the current zone write pointer */
318         REQ_OP_ZONE_APPEND      = 13,
Coly Li May 16, 2020, 9:33 a.m. UTC | #2
On 2020/5/16 12:06, Chaitanya Kulkarni wrote:
> On 05/15/2020 08:55 PM, Coly Li wrote:
>>   	REQ_OP_ZONE_FINISH	= 12,
>> +	/* reset a zone write pointer */
>> +	REQ_OP_ZONE_RESET	= 13,
> 
> I think 13 is taken, I found this in block tree branch :-
> for-5.8/block
> 
> 316         REQ_OP_ZONE_FINISH      = 12,
> 317         /* write data at the current zone write pointer */
> 318         REQ_OP_ZONE_APPEND      = 13,
> 

Aha, sure. Then I have to look at 15.

Thanks for the hint.

Coly Li
Christoph Hellwig May 16, 2020, 12:38 p.m. UTC | #3
On Sat, May 16, 2020 at 11:54:31AM +0800, Coly Li wrote:
> For a zoned device, e.g. host managed SMR hard drive, REQ_OP_ZONE_RESET
> is to reset the LBA of a zone's write pointer back to the start LBA of
> this zone. After the write point is reset, all previously stored data
> in this zone is invalid and unaccessible anymore. Therefore, this op
> code changes on disk data, belongs to a WRITE request op code.
> 
> Current REQ_OP_ZONE_RESET is defined as number 6, but the convention of
> the op code is, READ requests are even numbers, and WRITE requests are
> odd numbers. See how op_is_write defined,

The convention is all about data transfer, and zone reset does not
transfer any data.
Coly Li May 16, 2020, 12:44 p.m. UTC | #4
On 2020/5/16 20:38, Christoph Hellwig wrote:
> On Sat, May 16, 2020 at 11:54:31AM +0800, Coly Li wrote:
>> For a zoned device, e.g. host managed SMR hard drive, REQ_OP_ZONE_RESET
>> is to reset the LBA of a zone's write pointer back to the start LBA of
>> this zone. After the write point is reset, all previously stored data
>> in this zone is invalid and unaccessible anymore. Therefore, this op
>> code changes on disk data, belongs to a WRITE request op code.
>>
>> Current REQ_OP_ZONE_RESET is defined as number 6, but the convention of
>> the op code is, READ requests are even numbers, and WRITE requests are
>> odd numbers. See how op_is_write defined,
> 
> The convention is all about data transfer, and zone reset does not
> transfer any data.
> 

Yes you are right, just like REQ_OP_DISCARD which does not transfer any
data but changes the data on device. If the request changes the stored
data, it does transfer data.

This is why in patch "block: set bi_size to REQ_OP_ZONE_RESET bio" I set
bi_size to the REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL bios.

Coly Li
Christoph Hellwig May 16, 2020, 12:50 p.m. UTC | #5
On Sat, May 16, 2020 at 08:44:45PM +0800, Coly Li wrote:
> Yes you are right, just like REQ_OP_DISCARD which does not transfer any
> data but changes the data on device. If the request changes the stored
> data, it does transfer data.

REQ_OP_DISCARD is a special case, because most implementation end up
transferring data, it just gets attached in the low-level driver.
Coly Li May 16, 2020, 1:05 p.m. UTC | #6
On 2020/5/16 20:50, Christoph Hellwig wrote:
> On Sat, May 16, 2020 at 08:44:45PM +0800, Coly Li wrote:
>> Yes you are right, just like REQ_OP_DISCARD which does not transfer any
>> data but changes the data on device. If the request changes the stored
>> data, it does transfer data.
> 
> REQ_OP_DISCARD is a special case, because most implementation end up
> transferring data, it just gets attached in the low-level driver.
> 

Yes, REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL are quite similar to
REQ_OP_DISCARD. Data read from the LBA range of reset zone is not
suggested and the content is undefined.

For bcache, similar to REQ_OP_DISCARD, REQ_OP_ZONE_RESET and
REQ_OP_ZONE_RESET_ALL are handled in same way: If the backing device
supports discard/zone_reset, and the operation successes, then cached
data on SSD covered by the LBA range should be invalid, otherwise users
will read outdated and garbage data.

We should treat REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL being in
WRITE direction.

Coly Li
Christoph Hellwig May 16, 2020, 3:36 p.m. UTC | #7
On Sat, May 16, 2020 at 09:05:39PM +0800, Coly Li wrote:
> On 2020/5/16 20:50, Christoph Hellwig wrote:
> > On Sat, May 16, 2020 at 08:44:45PM +0800, Coly Li wrote:
> >> Yes you are right, just like REQ_OP_DISCARD which does not transfer any
> >> data but changes the data on device. If the request changes the stored
> >> data, it does transfer data.
> > 
> > REQ_OP_DISCARD is a special case, because most implementation end up
> > transferring data, it just gets attached in the low-level driver.
> > 
> 
> Yes, REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL are quite similar to
> REQ_OP_DISCARD. Data read from the LBA range of reset zone is not
> suggested and the content is undefined.
> 
> For bcache, similar to REQ_OP_DISCARD, REQ_OP_ZONE_RESET and
> REQ_OP_ZONE_RESET_ALL are handled in same way: If the backing device
> supports discard/zone_reset, and the operation successes, then cached
> data on SSD covered by the LBA range should be invalid, otherwise users
> will read outdated and garbage data.
> 
> We should treat REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL being in
> WRITE direction.

No, the difference is that the underlying SCSI/ATA/NVMe command tend to
usually actually transfer data.  Take a look at the special_vec field
in struct request and the RQF_SPECIAL_PAYLOAD flag.
Coly Li May 17, 2020, 5:30 a.m. UTC | #8
On 2020/5/16 23:36, Christoph Hellwig wrote:
> On Sat, May 16, 2020 at 09:05:39PM +0800, Coly Li wrote:
>> On 2020/5/16 20:50, Christoph Hellwig wrote:
>>> On Sat, May 16, 2020 at 08:44:45PM +0800, Coly Li wrote:
>>>> Yes you are right, just like REQ_OP_DISCARD which does not transfer any
>>>> data but changes the data on device. If the request changes the stored
>>>> data, it does transfer data.
>>>
>>> REQ_OP_DISCARD is a special case, because most implementation end up
>>> transferring data, it just gets attached in the low-level driver.
>>>
>>
>> Yes, REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL are quite similar to
>> REQ_OP_DISCARD. Data read from the LBA range of reset zone is not
>> suggested and the content is undefined.
>>
>> For bcache, similar to REQ_OP_DISCARD, REQ_OP_ZONE_RESET and
>> REQ_OP_ZONE_RESET_ALL are handled in same way: If the backing device
>> supports discard/zone_reset, and the operation successes, then cached
>> data on SSD covered by the LBA range should be invalid, otherwise users
>> will read outdated and garbage data.
>>
>> We should treat REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL being in
>> WRITE direction.
> 
> No, the difference is that the underlying SCSI/ATA/NVMe command tend to
> usually actually transfer data.  Take a look at the special_vec field
> in struct request and the RQF_SPECIAL_PAYLOAD flag.
> 

Then bio_data_dir() will be problematic, as it is defined,
 52 /*
 53  * Return the data direction, READ or WRITE.
 54  */
 55 #define bio_data_dir(bio) \
 56         (op_is_write(bio_op(bio)) ? WRITE : READ)

For the REQ_OP_ZONE_RESET bio, bio_data_dir() will report it as READ.

Since the author is you, how to fix this ?

Coly Li
Damien Le Moal May 18, 2020, 12:33 a.m. UTC | #9
On 2020/05/16 12:55, Coly Li wrote:
> For a zoned device, e.g. host managed SMR hard drive, REQ_OP_ZONE_RESET
> is to reset the LBA of a zone's write pointer back to the start LBA of
> this zone. After the write point is reset, all previously stored data
> in this zone is invalid and unaccessible anymore. Therefore, this op
> code changes on disk data, belongs to a WRITE request op code.
> 
> Current REQ_OP_ZONE_RESET is defined as number 6, but the convention of
> the op code is, READ requests are even numbers, and WRITE requests are
> odd numbers. See how op_is_write defined,
> 
> 397 static inline bool op_is_write(unsigned int op)
> 398 {
> 399         return (op & 1);
> 400 }
> 
> When create bcache device on top of the zoned SMR drive, bcache driver
> has to handle the REQ_OP_ZONE_RESET bio by invalidate all cached data
> belongs to the LBA range of the restting zone. A wrong op code value
> will make the this zone management bio goes into bcache' read requests
> code path and causes undefined result.
> 
> This patch changes REQ_OP_ZONE_RESET value from 6 to 13, the new odd
> number will make bcache driver handle this zone management bio properly
> in write requests code path.
> 
> Fixes: 87374179c535 ("block: add a proper block layer data direction encoding")
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Jens Axboe <axboe@fb.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Cc: Keith Busch <kbusch@kernel.org>
> Cc: Shaun Tancheff <shaun.tancheff@seagate.com>
> ---
>  include/linux/blk_types.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 31eb92876be7..8f7bc15a6de3 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -282,8 +282,6 @@ enum req_opf {
>  	REQ_OP_DISCARD		= 3,
>  	/* securely erase sectors */
>  	REQ_OP_SECURE_ERASE	= 5,
> -	/* reset a zone write pointer */
> -	REQ_OP_ZONE_RESET	= 6,
>  	/* write the same sector many times */
>  	REQ_OP_WRITE_SAME	= 7,
>  	/* reset all the zone present on the device */
> @@ -296,6 +294,8 @@ enum req_opf {
>  	REQ_OP_ZONE_CLOSE	= 11,
>  	/* Transition a zone to full */
>  	REQ_OP_ZONE_FINISH	= 12,
> +	/* reset a zone write pointer */
> +	REQ_OP_ZONE_RESET	= 13,

As Chaitanya pointed out, this is already used. Please rebase on Jens
block/for-5.8/block branch.

I do not see any problem with this change, but as Christoph commented, since
zone reset does not transfer any data, I do not really see the necessity for
this. Zone reset is indeed data-destructive, but it is not a "write" command.
But following DISCARD and having the op number as an odd number is OK I think.

>  
>  	/* SCSI passthrough using struct scsi_request */
>  	REQ_OP_SCSI_IN		= 32,
>
Chaitanya Kulkarni May 18, 2020, 5:09 a.m. UTC | #10
On 5/17/20 5:33 PM, Damien Le Moal wrote:
> As Chaitanya pointed out, this is already used. Please rebase on Jens
> block/for-5.8/block branch.
> 
> I do not see any problem with this change, but as Christoph commented, since
> zone reset does not transfer any data, I do not really see the necessity for
> this. Zone reset is indeed data-destructive, but it is not a "write" command.
> But following DISCARD and having the op number as an odd number is OK I think.
Let's keep reset consistent with discard and write-zeroes.
Hannes Reinecke May 18, 2020, 6:53 a.m. UTC | #11
On 5/17/20 7:30 AM, Coly Li wrote:
> On 2020/5/16 23:36, Christoph Hellwig wrote:
>> On Sat, May 16, 2020 at 09:05:39PM +0800, Coly Li wrote:
>>> On 2020/5/16 20:50, Christoph Hellwig wrote:
>>>> On Sat, May 16, 2020 at 08:44:45PM +0800, Coly Li wrote:
>>>>> Yes you are right, just like REQ_OP_DISCARD which does not transfer any
>>>>> data but changes the data on device. If the request changes the stored
>>>>> data, it does transfer data.
>>>>
>>>> REQ_OP_DISCARD is a special case, because most implementation end up
>>>> transferring data, it just gets attached in the low-level driver.
>>>>
>>>
>>> Yes, REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL are quite similar to
>>> REQ_OP_DISCARD. Data read from the LBA range of reset zone is not
>>> suggested and the content is undefined.
>>>
>>> For bcache, similar to REQ_OP_DISCARD, REQ_OP_ZONE_RESET and
>>> REQ_OP_ZONE_RESET_ALL are handled in same way: If the backing device
>>> supports discard/zone_reset, and the operation successes, then cached
>>> data on SSD covered by the LBA range should be invalid, otherwise users
>>> will read outdated and garbage data.
>>>
>>> We should treat REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL being in
>>> WRITE direction.
>>
>> No, the difference is that the underlying SCSI/ATA/NVMe command tend to
>> usually actually transfer data.  Take a look at the special_vec field
>> in struct request and the RQF_SPECIAL_PAYLOAD flag.
>>
> 
> Then bio_data_dir() will be problematic, as it is defined,
>   52 /*
>   53  * Return the data direction, READ or WRITE.
>   54  */
>   55 #define bio_data_dir(bio) \
>   56         (op_is_write(bio_op(bio)) ? WRITE : READ)
> 
> For the REQ_OP_ZONE_RESET bio, bio_data_dir() will report it as READ.
> 
> Since the author is you, how to fix this ?
> 

Well, but I do think that Coly is right in that bio_data_dir() is very 
unconvincing, or at least improperly documented.

I can't quite follow Christoph's argument in that bio_data_dir() 
reflects whether data is _actually_ transferred (if so, why would 
REQ_OP_ZONE_CLOSE() be marked as WRITE?)
However, in most cases bio_data_dir() will only come into effect if 
there are attached bvecs.

So the _correct_ usage for bio_data_dir() would be to check if there is 
data attached to the bio (eg by using bio_has_data()), and only call 
bio_data_dir() if that returns true. For false you'd need to add a 
switch evaluating the individual commands.

And, btw, bio_has_data() needs updating, too; all the zone commands are 
missing from there, too.

Cheers,

Hannes
Damien Le Moal May 18, 2020, 6:56 a.m. UTC | #12
On 2020/05/18 15:53, Hannes Reinecke wrote:
> On 5/17/20 7:30 AM, Coly Li wrote:
>> On 2020/5/16 23:36, Christoph Hellwig wrote:
>>> On Sat, May 16, 2020 at 09:05:39PM +0800, Coly Li wrote:
>>>> On 2020/5/16 20:50, Christoph Hellwig wrote:
>>>>> On Sat, May 16, 2020 at 08:44:45PM +0800, Coly Li wrote:
>>>>>> Yes you are right, just like REQ_OP_DISCARD which does not transfer any
>>>>>> data but changes the data on device. If the request changes the stored
>>>>>> data, it does transfer data.
>>>>>
>>>>> REQ_OP_DISCARD is a special case, because most implementation end up
>>>>> transferring data, it just gets attached in the low-level driver.
>>>>>
>>>>
>>>> Yes, REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL are quite similar to
>>>> REQ_OP_DISCARD. Data read from the LBA range of reset zone is not
>>>> suggested and the content is undefined.
>>>>
>>>> For bcache, similar to REQ_OP_DISCARD, REQ_OP_ZONE_RESET and
>>>> REQ_OP_ZONE_RESET_ALL are handled in same way: If the backing device
>>>> supports discard/zone_reset, and the operation successes, then cached
>>>> data on SSD covered by the LBA range should be invalid, otherwise users
>>>> will read outdated and garbage data.
>>>>
>>>> We should treat REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL being in
>>>> WRITE direction.
>>>
>>> No, the difference is that the underlying SCSI/ATA/NVMe command tend to
>>> usually actually transfer data.  Take a look at the special_vec field
>>> in struct request and the RQF_SPECIAL_PAYLOAD flag.
>>>
>>
>> Then bio_data_dir() will be problematic, as it is defined,
>>   52 /*
>>   53  * Return the data direction, READ or WRITE.
>>   54  */
>>   55 #define bio_data_dir(bio) \
>>   56         (op_is_write(bio_op(bio)) ? WRITE : READ)
>>
>> For the REQ_OP_ZONE_RESET bio, bio_data_dir() will report it as READ.
>>
>> Since the author is you, how to fix this ?
>>
> 
> Well, but I do think that Coly is right in that bio_data_dir() is very 
> unconvincing, or at least improperly documented.
> 
> I can't quite follow Christoph's argument in that bio_data_dir() 
> reflects whether data is _actually_ transferred (if so, why would 
> REQ_OP_ZONE_CLOSE() be marked as WRITE?)
> However, in most cases bio_data_dir() will only come into effect if 
> there are attached bvecs.
> 
> So the _correct_ usage for bio_data_dir() would be to check if there is 
> data attached to the bio (eg by using bio_has_data()), and only call 
> bio_data_dir() if that returns true. For false you'd need to add a 
> switch evaluating the individual commands.
> 
> And, btw, bio_has_data() needs updating, too; all the zone commands are 
> missing from there, too.

Not really. They all have 0 size, so bio_has_data() always return false for zone
management ops. It will return true only for report zones.

> 
> Cheers,
> 
> Hannes
>
diff mbox series

Patch

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 31eb92876be7..8f7bc15a6de3 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -282,8 +282,6 @@  enum req_opf {
 	REQ_OP_DISCARD		= 3,
 	/* securely erase sectors */
 	REQ_OP_SECURE_ERASE	= 5,
-	/* reset a zone write pointer */
-	REQ_OP_ZONE_RESET	= 6,
 	/* write the same sector many times */
 	REQ_OP_WRITE_SAME	= 7,
 	/* reset all the zone present on the device */
@@ -296,6 +294,8 @@  enum req_opf {
 	REQ_OP_ZONE_CLOSE	= 11,
 	/* Transition a zone to full */
 	REQ_OP_ZONE_FINISH	= 12,
+	/* reset a zone write pointer */
+	REQ_OP_ZONE_RESET	= 13,
 
 	/* SCSI passthrough using struct scsi_request */
 	REQ_OP_SCSI_IN		= 32,