Message ID | 20200516035434.82809-2-colyli@suse.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | block layer change necessary for bcache zoned device support | expand |
On 05/15/2020 08:55 PM, Coly Li wrote: > REQ_OP_ZONE_FINISH = 12, > + /* reset a zone write pointer */ > + REQ_OP_ZONE_RESET = 13, I think 13 is taken, I found this in block tree branch :- for-5.8/block 316 REQ_OP_ZONE_FINISH = 12, 317 /* write data at the current zone write pointer */ 318 REQ_OP_ZONE_APPEND = 13,
On 2020/5/16 12:06, Chaitanya Kulkarni wrote: > On 05/15/2020 08:55 PM, Coly Li wrote: >> REQ_OP_ZONE_FINISH = 12, >> + /* reset a zone write pointer */ >> + REQ_OP_ZONE_RESET = 13, > > I think 13 is taken, I found this in block tree branch :- > for-5.8/block > > 316 REQ_OP_ZONE_FINISH = 12, > 317 /* write data at the current zone write pointer */ > 318 REQ_OP_ZONE_APPEND = 13, > Aha, sure. Then I have to look at 15. Thanks for the hint. Coly Li
On Sat, May 16, 2020 at 11:54:31AM +0800, Coly Li wrote: > For a zoned device, e.g. host managed SMR hard drive, REQ_OP_ZONE_RESET > is to reset the LBA of a zone's write pointer back to the start LBA of > this zone. After the write point is reset, all previously stored data > in this zone is invalid and unaccessible anymore. Therefore, this op > code changes on disk data, belongs to a WRITE request op code. > > Current REQ_OP_ZONE_RESET is defined as number 6, but the convention of > the op code is, READ requests are even numbers, and WRITE requests are > odd numbers. See how op_is_write defined, The convention is all about data transfer, and zone reset does not transfer any data.
On 2020/5/16 20:38, Christoph Hellwig wrote: > On Sat, May 16, 2020 at 11:54:31AM +0800, Coly Li wrote: >> For a zoned device, e.g. host managed SMR hard drive, REQ_OP_ZONE_RESET >> is to reset the LBA of a zone's write pointer back to the start LBA of >> this zone. After the write point is reset, all previously stored data >> in this zone is invalid and unaccessible anymore. Therefore, this op >> code changes on disk data, belongs to a WRITE request op code. >> >> Current REQ_OP_ZONE_RESET is defined as number 6, but the convention of >> the op code is, READ requests are even numbers, and WRITE requests are >> odd numbers. See how op_is_write defined, > > The convention is all about data transfer, and zone reset does not > transfer any data. > Yes you are right, just like REQ_OP_DISCARD which does not transfer any data but changes the data on device. If the request changes the stored data, it does transfer data. This is why in patch "block: set bi_size to REQ_OP_ZONE_RESET bio" I set bi_size to the REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL bios. Coly Li
On Sat, May 16, 2020 at 08:44:45PM +0800, Coly Li wrote: > Yes you are right, just like REQ_OP_DISCARD which does not transfer any > data but changes the data on device. If the request changes the stored > data, it does transfer data. REQ_OP_DISCARD is a special case, because most implementation end up transferring data, it just gets attached in the low-level driver.
On 2020/5/16 20:50, Christoph Hellwig wrote: > On Sat, May 16, 2020 at 08:44:45PM +0800, Coly Li wrote: >> Yes you are right, just like REQ_OP_DISCARD which does not transfer any >> data but changes the data on device. If the request changes the stored >> data, it does transfer data. > > REQ_OP_DISCARD is a special case, because most implementation end up > transferring data, it just gets attached in the low-level driver. > Yes, REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL are quite similar to REQ_OP_DISCARD. Data read from the LBA range of reset zone is not suggested and the content is undefined. For bcache, similar to REQ_OP_DISCARD, REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL are handled in same way: If the backing device supports discard/zone_reset, and the operation successes, then cached data on SSD covered by the LBA range should be invalid, otherwise users will read outdated and garbage data. We should treat REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL being in WRITE direction. Coly Li
On Sat, May 16, 2020 at 09:05:39PM +0800, Coly Li wrote: > On 2020/5/16 20:50, Christoph Hellwig wrote: > > On Sat, May 16, 2020 at 08:44:45PM +0800, Coly Li wrote: > >> Yes you are right, just like REQ_OP_DISCARD which does not transfer any > >> data but changes the data on device. If the request changes the stored > >> data, it does transfer data. > > > > REQ_OP_DISCARD is a special case, because most implementation end up > > transferring data, it just gets attached in the low-level driver. > > > > Yes, REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL are quite similar to > REQ_OP_DISCARD. Data read from the LBA range of reset zone is not > suggested and the content is undefined. > > For bcache, similar to REQ_OP_DISCARD, REQ_OP_ZONE_RESET and > REQ_OP_ZONE_RESET_ALL are handled in same way: If the backing device > supports discard/zone_reset, and the operation successes, then cached > data on SSD covered by the LBA range should be invalid, otherwise users > will read outdated and garbage data. > > We should treat REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL being in > WRITE direction. No, the difference is that the underlying SCSI/ATA/NVMe command tend to usually actually transfer data. Take a look at the special_vec field in struct request and the RQF_SPECIAL_PAYLOAD flag.
On 2020/5/16 23:36, Christoph Hellwig wrote: > On Sat, May 16, 2020 at 09:05:39PM +0800, Coly Li wrote: >> On 2020/5/16 20:50, Christoph Hellwig wrote: >>> On Sat, May 16, 2020 at 08:44:45PM +0800, Coly Li wrote: >>>> Yes you are right, just like REQ_OP_DISCARD which does not transfer any >>>> data but changes the data on device. If the request changes the stored >>>> data, it does transfer data. >>> >>> REQ_OP_DISCARD is a special case, because most implementation end up >>> transferring data, it just gets attached in the low-level driver. >>> >> >> Yes, REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL are quite similar to >> REQ_OP_DISCARD. Data read from the LBA range of reset zone is not >> suggested and the content is undefined. >> >> For bcache, similar to REQ_OP_DISCARD, REQ_OP_ZONE_RESET and >> REQ_OP_ZONE_RESET_ALL are handled in same way: If the backing device >> supports discard/zone_reset, and the operation successes, then cached >> data on SSD covered by the LBA range should be invalid, otherwise users >> will read outdated and garbage data. >> >> We should treat REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL being in >> WRITE direction. > > No, the difference is that the underlying SCSI/ATA/NVMe command tend to > usually actually transfer data. Take a look at the special_vec field > in struct request and the RQF_SPECIAL_PAYLOAD flag. > Then bio_data_dir() will be problematic, as it is defined, 52 /* 53 * Return the data direction, READ or WRITE. 54 */ 55 #define bio_data_dir(bio) \ 56 (op_is_write(bio_op(bio)) ? WRITE : READ) For the REQ_OP_ZONE_RESET bio, bio_data_dir() will report it as READ. Since the author is you, how to fix this ? Coly Li
On 2020/05/16 12:55, Coly Li wrote: > For a zoned device, e.g. host managed SMR hard drive, REQ_OP_ZONE_RESET > is to reset the LBA of a zone's write pointer back to the start LBA of > this zone. After the write point is reset, all previously stored data > in this zone is invalid and unaccessible anymore. Therefore, this op > code changes on disk data, belongs to a WRITE request op code. > > Current REQ_OP_ZONE_RESET is defined as number 6, but the convention of > the op code is, READ requests are even numbers, and WRITE requests are > odd numbers. See how op_is_write defined, > > 397 static inline bool op_is_write(unsigned int op) > 398 { > 399 return (op & 1); > 400 } > > When create bcache device on top of the zoned SMR drive, bcache driver > has to handle the REQ_OP_ZONE_RESET bio by invalidate all cached data > belongs to the LBA range of the restting zone. A wrong op code value > will make the this zone management bio goes into bcache' read requests > code path and causes undefined result. > > This patch changes REQ_OP_ZONE_RESET value from 6 to 13, the new odd > number will make bcache driver handle this zone management bio properly > in write requests code path. > > Fixes: 87374179c535 ("block: add a proper block layer data direction encoding") > Signed-off-by: Coly Li <colyli@suse.de> > Cc: Christoph Hellwig <hch@lst.de> > Cc: Damien Le Moal <damien.lemoal@wdc.com> > Cc: Hannes Reinecke <hare@suse.de> > Cc: Jens Axboe <axboe@fb.com> > Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> > Cc: Keith Busch <kbusch@kernel.org> > Cc: Shaun Tancheff <shaun.tancheff@seagate.com> > --- > include/linux/blk_types.h | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h > index 31eb92876be7..8f7bc15a6de3 100644 > --- a/include/linux/blk_types.h > +++ b/include/linux/blk_types.h > @@ -282,8 +282,6 @@ enum req_opf { > REQ_OP_DISCARD = 3, > /* securely erase sectors */ > REQ_OP_SECURE_ERASE = 5, > - /* reset a zone write pointer */ > - REQ_OP_ZONE_RESET = 6, > /* write the same sector many times */ > REQ_OP_WRITE_SAME = 7, > /* reset all the zone present on the device */ > @@ -296,6 +294,8 @@ enum req_opf { > REQ_OP_ZONE_CLOSE = 11, > /* Transition a zone to full */ > REQ_OP_ZONE_FINISH = 12, > + /* reset a zone write pointer */ > + REQ_OP_ZONE_RESET = 13, As Chaitanya pointed out, this is already used. Please rebase on Jens block/for-5.8/block branch. I do not see any problem with this change, but as Christoph commented, since zone reset does not transfer any data, I do not really see the necessity for this. Zone reset is indeed data-destructive, but it is not a "write" command. But following DISCARD and having the op number as an odd number is OK I think. > > /* SCSI passthrough using struct scsi_request */ > REQ_OP_SCSI_IN = 32, >
On 5/17/20 5:33 PM, Damien Le Moal wrote: > As Chaitanya pointed out, this is already used. Please rebase on Jens > block/for-5.8/block branch. > > I do not see any problem with this change, but as Christoph commented, since > zone reset does not transfer any data, I do not really see the necessity for > this. Zone reset is indeed data-destructive, but it is not a "write" command. > But following DISCARD and having the op number as an odd number is OK I think. Let's keep reset consistent with discard and write-zeroes.
On 5/17/20 7:30 AM, Coly Li wrote: > On 2020/5/16 23:36, Christoph Hellwig wrote: >> On Sat, May 16, 2020 at 09:05:39PM +0800, Coly Li wrote: >>> On 2020/5/16 20:50, Christoph Hellwig wrote: >>>> On Sat, May 16, 2020 at 08:44:45PM +0800, Coly Li wrote: >>>>> Yes you are right, just like REQ_OP_DISCARD which does not transfer any >>>>> data but changes the data on device. If the request changes the stored >>>>> data, it does transfer data. >>>> >>>> REQ_OP_DISCARD is a special case, because most implementation end up >>>> transferring data, it just gets attached in the low-level driver. >>>> >>> >>> Yes, REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL are quite similar to >>> REQ_OP_DISCARD. Data read from the LBA range of reset zone is not >>> suggested and the content is undefined. >>> >>> For bcache, similar to REQ_OP_DISCARD, REQ_OP_ZONE_RESET and >>> REQ_OP_ZONE_RESET_ALL are handled in same way: If the backing device >>> supports discard/zone_reset, and the operation successes, then cached >>> data on SSD covered by the LBA range should be invalid, otherwise users >>> will read outdated and garbage data. >>> >>> We should treat REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL being in >>> WRITE direction. >> >> No, the difference is that the underlying SCSI/ATA/NVMe command tend to >> usually actually transfer data. Take a look at the special_vec field >> in struct request and the RQF_SPECIAL_PAYLOAD flag. >> > > Then bio_data_dir() will be problematic, as it is defined, > 52 /* > 53 * Return the data direction, READ or WRITE. > 54 */ > 55 #define bio_data_dir(bio) \ > 56 (op_is_write(bio_op(bio)) ? WRITE : READ) > > For the REQ_OP_ZONE_RESET bio, bio_data_dir() will report it as READ. > > Since the author is you, how to fix this ? > Well, but I do think that Coly is right in that bio_data_dir() is very unconvincing, or at least improperly documented. I can't quite follow Christoph's argument in that bio_data_dir() reflects whether data is _actually_ transferred (if so, why would REQ_OP_ZONE_CLOSE() be marked as WRITE?) However, in most cases bio_data_dir() will only come into effect if there are attached bvecs. So the _correct_ usage for bio_data_dir() would be to check if there is data attached to the bio (eg by using bio_has_data()), and only call bio_data_dir() if that returns true. For false you'd need to add a switch evaluating the individual commands. And, btw, bio_has_data() needs updating, too; all the zone commands are missing from there, too. Cheers, Hannes
On 2020/05/18 15:53, Hannes Reinecke wrote: > On 5/17/20 7:30 AM, Coly Li wrote: >> On 2020/5/16 23:36, Christoph Hellwig wrote: >>> On Sat, May 16, 2020 at 09:05:39PM +0800, Coly Li wrote: >>>> On 2020/5/16 20:50, Christoph Hellwig wrote: >>>>> On Sat, May 16, 2020 at 08:44:45PM +0800, Coly Li wrote: >>>>>> Yes you are right, just like REQ_OP_DISCARD which does not transfer any >>>>>> data but changes the data on device. If the request changes the stored >>>>>> data, it does transfer data. >>>>> >>>>> REQ_OP_DISCARD is a special case, because most implementation end up >>>>> transferring data, it just gets attached in the low-level driver. >>>>> >>>> >>>> Yes, REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL are quite similar to >>>> REQ_OP_DISCARD. Data read from the LBA range of reset zone is not >>>> suggested and the content is undefined. >>>> >>>> For bcache, similar to REQ_OP_DISCARD, REQ_OP_ZONE_RESET and >>>> REQ_OP_ZONE_RESET_ALL are handled in same way: If the backing device >>>> supports discard/zone_reset, and the operation successes, then cached >>>> data on SSD covered by the LBA range should be invalid, otherwise users >>>> will read outdated and garbage data. >>>> >>>> We should treat REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL being in >>>> WRITE direction. >>> >>> No, the difference is that the underlying SCSI/ATA/NVMe command tend to >>> usually actually transfer data. Take a look at the special_vec field >>> in struct request and the RQF_SPECIAL_PAYLOAD flag. >>> >> >> Then bio_data_dir() will be problematic, as it is defined, >> 52 /* >> 53 * Return the data direction, READ or WRITE. >> 54 */ >> 55 #define bio_data_dir(bio) \ >> 56 (op_is_write(bio_op(bio)) ? WRITE : READ) >> >> For the REQ_OP_ZONE_RESET bio, bio_data_dir() will report it as READ. >> >> Since the author is you, how to fix this ? >> > > Well, but I do think that Coly is right in that bio_data_dir() is very > unconvincing, or at least improperly documented. > > I can't quite follow Christoph's argument in that bio_data_dir() > reflects whether data is _actually_ transferred (if so, why would > REQ_OP_ZONE_CLOSE() be marked as WRITE?) > However, in most cases bio_data_dir() will only come into effect if > there are attached bvecs. > > So the _correct_ usage for bio_data_dir() would be to check if there is > data attached to the bio (eg by using bio_has_data()), and only call > bio_data_dir() if that returns true. For false you'd need to add a > switch evaluating the individual commands. > > And, btw, bio_has_data() needs updating, too; all the zone commands are > missing from there, too. Not really. They all have 0 size, so bio_has_data() always return false for zone management ops. It will return true only for report zones. > > Cheers, > > Hannes >
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 31eb92876be7..8f7bc15a6de3 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -282,8 +282,6 @@ enum req_opf { REQ_OP_DISCARD = 3, /* securely erase sectors */ REQ_OP_SECURE_ERASE = 5, - /* reset a zone write pointer */ - REQ_OP_ZONE_RESET = 6, /* write the same sector many times */ REQ_OP_WRITE_SAME = 7, /* reset all the zone present on the device */ @@ -296,6 +294,8 @@ enum req_opf { REQ_OP_ZONE_CLOSE = 11, /* Transition a zone to full */ REQ_OP_ZONE_FINISH = 12, + /* reset a zone write pointer */ + REQ_OP_ZONE_RESET = 13, /* SCSI passthrough using struct scsi_request */ REQ_OP_SCSI_IN = 32,
For a zoned device, e.g. host managed SMR hard drive, REQ_OP_ZONE_RESET is to reset the LBA of a zone's write pointer back to the start LBA of this zone. After the write point is reset, all previously stored data in this zone is invalid and unaccessible anymore. Therefore, this op code changes on disk data, belongs to a WRITE request op code. Current REQ_OP_ZONE_RESET is defined as number 6, but the convention of the op code is, READ requests are even numbers, and WRITE requests are odd numbers. See how op_is_write defined, 397 static inline bool op_is_write(unsigned int op) 398 { 399 return (op & 1); 400 } When create bcache device on top of the zoned SMR drive, bcache driver has to handle the REQ_OP_ZONE_RESET bio by invalidate all cached data belongs to the LBA range of the restting zone. A wrong op code value will make the this zone management bio goes into bcache' read requests code path and causes undefined result. This patch changes REQ_OP_ZONE_RESET value from 6 to 13, the new odd number will make bcache driver handle this zone management bio properly in write requests code path. Fixes: 87374179c535 ("block: add a proper block layer data direction encoding") Signed-off-by: Coly Li <colyli@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Jens Axboe <axboe@fb.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Shaun Tancheff <shaun.tancheff@seagate.com> --- include/linux/blk_types.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)