diff mbox series

[RFC,v2,4/4] block: set bi_size to REQ_OP_ZONE_RESET bio

Message ID 20200516035434.82809-5-colyli@suse.de (mailing list archive)
State New, archived
Headers show
Series block layer change necessary for bcache zoned device support | expand

Commit Message

Coly Li May 16, 2020, 3:54 a.m. UTC
Now for zoned device, zone management ioctl commands are converted into
zone management bios and handled by blkdev_zone_mgmt(). There are 4 zone
management bios are handled, their op code is,
- REQ_OP_ZONE_RESET
  Reset the zone's writer pointer and empty all previously stored data.
- REQ_OP_ZONE_OPEN
  Open the zones in the specified sector range, no influence on data.
- REQ_OP_ZONE_CLOSE
  Close the zones in the specified sector range, no influence on data.
- REQ_OP_ZONE_FINISH
  Mark the zone as full, no influence on data.
All the zone management bios has 0 byte size, a.k.a their bi_size is 0.

Exept for REQ_OP_ZONE_RESET request, zero length bio works fine for
other zone management bio, before the zoned device e.g. host managed SMR
hard drive can be created as a bcache device.

When a bcache device (virtual block device to forward bios like md raid
drivers) can be created on top of the zoned device, and a fast SSD is
attached as a cache device, bcache driver may cache the frequent random
READ requests on fast SSD to accelerate hot data READ performance.

When bcache driver receives a zone management bio for REQ_OP_ZONE_RESET
op, while forwarding the request to underlying zoned device e.g. host
managed SMR hard drive, it should also invalidate all cached data from
SSD for the resetting zone. Otherwise bcache will continue provide the
outdated cached data to READ request and cause potential data storage
inconsistency and corruption.

In order to invalidate outdated data from SSD for the reset zone, bcache
needs to know not only the start LBA but also the range length of the
resetting zone. Otherwise, bcache won't be able to accurately invalidate
the outdated cached data.

Is it possible to simply set the bi_size inside bcache driver? The
answer is NO. Although every REQ_OP_ZONE_RESET bio has exact length as
zone size or q->limits.chunk_sectors, it is possible that some other
layer stacking block driver (in the future) exists between bcache driver
and blkdev_zone_mgmt() where the zone management bio is made.

The best location to set bi_size is where the zone management bio is
composed in blkdev_zone_mgmt(), then no matter how this bio is split
before bcache driver receives it, bcache driver can always correctly
invalidate the resetting range.

This patch sets the bi_size of REQ_OP_ZONE_RESET bio for each resetting
zone. Here REQ_OP_ZONE_RESET_ALL is special whose bi_size should be set
as capacity of whole drive size, then bcache can invalidate all cached
data from SSD for the zoned backing device.

With this change, now bcache code can handle REQ_OP_ZONE_RESET bio in
the way very similar to REQ_OP_DISCARD bio with very little change.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Ajay Joshi <ajay.joshi@wdc.com>
Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Keith Busch <kbusch@kernel.org>
---
Changelog:
v2: fix typo for REQ_OP_ZONE_RESET_ALL.
v1: initial version.

 block/blk-zoned.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Christoph Hellwig May 16, 2020, 12:53 p.m. UTC | #1
This looks ok, but I want to hear Damiens opinion.
Damien Le Moal May 18, 2020, 12:59 a.m. UTC | #2
On 2020/05/16 12:55, Coly Li wrote:
> Now for zoned device, zone management ioctl commands are converted into
> zone management bios and handled by blkdev_zone_mgmt(). There are 4 zone
> management bios are handled, their op code is,
> - REQ_OP_ZONE_RESET
>   Reset the zone's writer pointer and empty all previously stored data.
> - REQ_OP_ZONE_OPEN
>   Open the zones in the specified sector range, no influence on data.
> - REQ_OP_ZONE_CLOSE
>   Close the zones in the specified sector range, no influence on data.
> - REQ_OP_ZONE_FINISH
>   Mark the zone as full, no influence on data.
> All the zone management bios has 0 byte size, a.k.a their bi_size is 0.
> 
> Exept for REQ_OP_ZONE_RESET request, zero length bio works fine for
> other zone management bio, before the zoned device e.g. host managed SMR
> hard drive can be created as a bcache device.
> 
> When a bcache device (virtual block device to forward bios like md raid
> drivers) can be created on top of the zoned device, and a fast SSD is
> attached as a cache device, bcache driver may cache the frequent random
> READ requests on fast SSD to accelerate hot data READ performance.
> 
> When bcache driver receives a zone management bio for REQ_OP_ZONE_RESET
> op, while forwarding the request to underlying zoned device e.g. host
> managed SMR hard drive, it should also invalidate all cached data from
> SSD for the resetting zone. Otherwise bcache will continue provide the
> outdated cached data to READ request and cause potential data storage
> inconsistency and corruption.
> 
> In order to invalidate outdated data from SSD for the reset zone, bcache
> needs to know not only the start LBA but also the range length of the
> resetting zone. Otherwise, bcache won't be able to accurately invalidate
> the outdated cached data.
> 
> Is it possible to simply set the bi_size inside bcache driver? The
> answer is NO. Although every REQ_OP_ZONE_RESET bio has exact length as
> zone size or q->limits.chunk_sectors, it is possible that some other
> layer stacking block driver (in the future) exists between bcache driver
> and blkdev_zone_mgmt() where the zone management bio is made.

But bcache "knows" what underlying devices it is using, right ? So getting the
SMR drive zone size using blk_queue_zone_sectors(), bdev_zone_sectors() or by
directly accessing q->limits->chunk_sectors should not be a problem at all, no ?

> 
> The best location to set bi_size is where the zone management bio is
> composed in blkdev_zone_mgmt(), then no matter how this bio is split
> before bcache driver receives it, bcache driver can always correctly
> invalidate the resetting range.
> 
> This patch sets the bi_size of REQ_OP_ZONE_RESET bio for each resetting
> zone. Here REQ_OP_ZONE_RESET_ALL is special whose bi_size should be set
> as capacity of whole drive size, then bcache can invalidate all cached
> data from SSD for the zoned backing device.

NACK. The problem here is that struct bio_vec bv_len field and struct bvec_iter
bi_size field are both "unsigned int". So they can describe at most 4G x 512B =
2TB ranges. For REQ_OP_ZONE_RESET_ALL, that is simply way too small (we already
are at 20TB capacity for SMR). One cannot issue a BIO with a length that is the
entire disk capacity.

I really think that bcache should handle the zone management ops as a special
case. I understand the goal of trying to minimize that by integrating them as
much as possible into the regular bcache IO path, but that really seems
dangerous to me. Since these operations will need remapping in bcache anyway
(for handling the first hidden zone containing the super block), all special
processing can be done under a single "if" which should not impact too much
performance of regular read/write commands in the hot path.

Device mapper has such code: see __split_and_process_bio() and its use of
op_is_zone_mgmt() function to handle zone management requests slightly
differently than other BIOs. That remove the need for relying on op_is_write()
direction (patch 1 and 2 in this series) for reset zones too, which in the end
will I think make your bcache code a lot more solid.

> 
> With this change, now bcache code can handle REQ_OP_ZONE_RESET bio in
> the way very similar to REQ_OP_DISCARD bio with very little change.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Ajay Joshi <ajay.joshi@wdc.com>
> Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Cc: Keith Busch <kbusch@kernel.org>
> ---
> Changelog:
> v2: fix typo for REQ_OP_ZONE_RESET_ALL.
> v1: initial version.
> 
>  block/blk-zoned.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 1e0708c68267..01d91314399b 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -227,11 +227,15 @@ int blkdev_zone_mgmt(struct block_device *bdev, enum req_opf op,
>  		if (op == REQ_OP_ZONE_RESET &&
>  		    blkdev_allow_reset_all_zones(bdev, sector, nr_sectors)) {
>  			bio->bi_opf = REQ_OP_ZONE_RESET_ALL;
> +			bio->bi_iter.bi_sector = sector;
> +			bio->bi_iter.bi_size = nr_sectors;
>  			break;
>  		}
>  
>  		bio->bi_opf = op | REQ_SYNC;
>  		bio->bi_iter.bi_sector = sector;
> +		if (op == REQ_OP_ZONE_RESET)
> +			bio->bi_iter.bi_size = zone_sectors;
>  		sector += zone_sectors;
>  
>  		/* This may take a while, so be nice to others */
>
Coly Li May 18, 2020, 2:32 a.m. UTC | #3
On 2020/5/18 08:59, Damien Le Moal wrote:
> On 2020/05/16 12:55, Coly Li wrote:
>> Now for zoned device, zone management ioctl commands are converted into
>> zone management bios and handled by blkdev_zone_mgmt(). There are 4 zone
>> management bios are handled, their op code is,
>> - REQ_OP_ZONE_RESET
>>   Reset the zone's writer pointer and empty all previously stored data.
>> - REQ_OP_ZONE_OPEN
>>   Open the zones in the specified sector range, no influence on data.
>> - REQ_OP_ZONE_CLOSE
>>   Close the zones in the specified sector range, no influence on data.
>> - REQ_OP_ZONE_FINISH
>>   Mark the zone as full, no influence on data.
>> All the zone management bios has 0 byte size, a.k.a their bi_size is 0.
>>
>> Exept for REQ_OP_ZONE_RESET request, zero length bio works fine for
>> other zone management bio, before the zoned device e.g. host managed SMR
>> hard drive can be created as a bcache device.
>>
>> When a bcache device (virtual block device to forward bios like md raid
>> drivers) can be created on top of the zoned device, and a fast SSD is
>> attached as a cache device, bcache driver may cache the frequent random
>> READ requests on fast SSD to accelerate hot data READ performance.
>>
>> When bcache driver receives a zone management bio for REQ_OP_ZONE_RESET
>> op, while forwarding the request to underlying zoned device e.g. host
>> managed SMR hard drive, it should also invalidate all cached data from
>> SSD for the resetting zone. Otherwise bcache will continue provide the
>> outdated cached data to READ request and cause potential data storage
>> inconsistency and corruption.
>>
>> In order to invalidate outdated data from SSD for the reset zone, bcache
>> needs to know not only the start LBA but also the range length of the
>> resetting zone. Otherwise, bcache won't be able to accurately invalidate
>> the outdated cached data.
>>
>> Is it possible to simply set the bi_size inside bcache driver? The
>> answer is NO. Although every REQ_OP_ZONE_RESET bio has exact length as
>> zone size or q->limits.chunk_sectors, it is possible that some other
>> layer stacking block driver (in the future) exists between bcache driver
>> and blkdev_zone_mgmt() where the zone management bio is made.
> 
> But bcache "knows" what underlying devices it is using, right ? So getting the
> SMR drive zone size using blk_queue_zone_sectors(), bdev_zone_sectors() or by
> directly accessing q->limits->chunk_sectors should not be a problem at all, no ?
> 
>>
>> The best location to set bi_size is where the zone management bio is
>> composed in blkdev_zone_mgmt(), then no matter how this bio is split
>> before bcache driver receives it, bcache driver can always correctly
>> invalidate the resetting range.
>>
>> This patch sets the bi_size of REQ_OP_ZONE_RESET bio for each resetting
>> zone. Here REQ_OP_ZONE_RESET_ALL is special whose bi_size should be set
>> as capacity of whole drive size, then bcache can invalidate all cached
>> data from SSD for the zoned backing device.
> 
> NACK. The problem here is that struct bio_vec bv_len field and struct bvec_iter
> bi_size field are both "unsigned int". So they can describe at most 4G x 512B =
> 2TB ranges. For REQ_OP_ZONE_RESET_ALL, that is simply way too small (we already
> are at 20TB capacity for SMR). One cannot issue a BIO with a length that is the
> entire disk capacity.
> 
> I really think that bcache should handle the zone management ops as a special
> case. I understand the goal of trying to minimize that by integrating them as
> much as possible into the regular bcache IO path, but that really seems
> dangerous to me. Since these operations will need remapping in bcache anyway
> (for handling the first hidden zone containing the super block), all special
> processing can be done under a single "if" which should not impact too much
> performance of regular read/write commands in the hot path.
> 
> Device mapper has such code: see __split_and_process_bio() and its use of
> op_is_zone_mgmt() function to handle zone management requests slightly
> differently than other BIOs. That remove the need for relying on op_is_write()
> direction (patch 1 and 2 in this series) for reset zones too, which in the end
> will I think make your bcache code a lot more solid.
> 

Copied. I understand and agree with you and Christoph now. Let me
reconstruct the bcache code and maybe the depended change of generic
block layer code can be avoided.

Damien and Christoph, thank you all for the comments.

Coly Li
diff mbox series

Patch

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 1e0708c68267..01d91314399b 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -227,11 +227,15 @@  int blkdev_zone_mgmt(struct block_device *bdev, enum req_opf op,
 		if (op == REQ_OP_ZONE_RESET &&
 		    blkdev_allow_reset_all_zones(bdev, sector, nr_sectors)) {
 			bio->bi_opf = REQ_OP_ZONE_RESET_ALL;
+			bio->bi_iter.bi_sector = sector;
+			bio->bi_iter.bi_size = nr_sectors;
 			break;
 		}
 
 		bio->bi_opf = op | REQ_SYNC;
 		bio->bi_iter.bi_sector = sector;
+		if (op == REQ_OP_ZONE_RESET)
+			bio->bi_iter.bi_size = zone_sectors;
 		sector += zone_sectors;
 
 		/* This may take a while, so be nice to others */