[RFC,v2] bcache: export zoned information for backing device
diff mbox series

Message ID 20200510165232.67500-1-colyli@suse.de
State New
Headers show
Series
  • [RFC,v2] bcache: export zoned information for backing device
Related show

Commit Message

Coly Li May 10, 2020, 4:52 p.m. UTC
This is a very basic zoned device support. With this patch, bcache
device is able to,
- Export zoned device attribution via sysfs
- Response report zones request, e.g. by command 'blkzone report'
But the bcache device is still NOT able to,
- Response any zoned device management request or IOCTL command

Here are the testings I have done,
- read /sys/block/bcache0/queue/zoned, content is 'host-managed'
- read /sys/block/bcache0/queue/nr_zones, content is number of zones
  including all zone types.
- read /sys/block/bcache0/queue/chunk_sectors, content is zone size
  in sectors.
- run 'blkzone report /dev/bcache0', all zones information displayed.
- run 'blkzone reset /dev/bcache0', operation is rejected with error
  information: "blkzone: /dev/bcache0: BLKRESETZONE ioctl failed:
  Operation not supported"
- Sequential writes by dd, I can see some zones' write pointer 'wptr'
  values updated.

All of these are very basic testings, if you have better testing
tools or cases, please offer me hint.

Thanks in advance for your review and comments.

Signed-off-by: Coly Li <colyli@suse.de>
CC: Hannes Reinecke <hare@suse.com>
CC: Damien Le Moal <damien.lemoal@wdc.com>
CC: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 drivers/md/bcache/bcache.h  | 10 +++++
 drivers/md/bcache/request.c | 74 +++++++++++++++++++++++++++++++++++++
 drivers/md/bcache/super.c   | 51 +++++++++++++++++++++----
 3 files changed, 128 insertions(+), 7 deletions(-)

Comments

Coly Li May 10, 2020, 5:24 p.m. UTC | #1
On 2020/5/11 00:52, Coly Li wrote:
> This is a very basic zoned device support. With this patch, bcache
> device is able to,
> - Export zoned device attribution via sysfs
> - Response report zones request, e.g. by command 'blkzone report'
> But the bcache device is still NOT able to,
> - Response any zoned device management request or IOCTL command
> 
> Here are the testings I have done,
> - read /sys/block/bcache0/queue/zoned, content is 'host-managed'
> - read /sys/block/bcache0/queue/nr_zones, content is number of zones
>   including all zone types.
> - read /sys/block/bcache0/queue/chunk_sectors, content is zone size
>   in sectors.
> - run 'blkzone report /dev/bcache0', all zones information displayed.
> - run 'blkzone reset /dev/bcache0', operation is rejected with error
>   information: "blkzone: /dev/bcache0: BLKRESETZONE ioctl failed:
>   Operation not supported"
> - Sequential writes by dd, I can see some zones' write pointer 'wptr'
>   values updated.
> 
> All of these are very basic testings, if you have better testing
> tools or cases, please offer me hint.
> 
> Thanks in advance for your review and comments.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> CC: Hannes Reinecke <hare@suse.com>
> CC: Damien Le Moal <damien.lemoal@wdc.com>
> CC: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---

Hi Damien and Johannes,

With this patch the bcache device with a SMR drive can export the zone
information and format zonefs on top of it. Writeback mode does not work
at this moment (it requires on-disk format change and on my to-do list),
writethrough and writearound mode can be used on the bcache device to
accelerate random read when hitting.

During my testing, there are 2 things needs to fix.

1, mkzonefs report the first zone size does not match.
   Because bcache occupies the first 8KB of the backing SMR drive, so
the first zone size is 8KB less. By ignoring unmatched zone 0 size,
mkzonefs works and the bcache device is formated.

2, Direct I/O on files under seq/ directory can not be accessed.
   I need help here. It seems direct I/O write fails with -EINVAL. I
found the failure happens in fs/iomap/direct-io.c:iomap_dio_bio_actor(),
211         if ((pos | length | align) & ((1 << blkbits) - 1))
212                 return -EINVAL;

When I write to seq/1 file on offset 0 with 4096 bytes, in the above
line, align is 205427296, and  (pos | length | align) & ((1 << blkbits)
- 1) is non-zero. Then all writes to files under seq/ fail with -EINVAL.

I guess there should be something missing when I do the direct I/O
write. Could you all give me some hint ?

Thanks in advance.

Coly Li
Hannes Reinecke May 11, 2020, 6:05 a.m. UTC | #2
On 5/10/20 6:52 PM, Coly Li wrote:
> This is a very basic zoned device support. With this patch, bcache
> device is able to,
> - Export zoned device attribution via sysfs
> - Response report zones request, e.g. by command 'blkzone report'
> But the bcache device is still NOT able to,
> - Response any zoned device management request or IOCTL command
> 
> Here are the testings I have done,
> - read /sys/block/bcache0/queue/zoned, content is 'host-managed'
> - read /sys/block/bcache0/queue/nr_zones, content is number of zones
>    including all zone types.
> - read /sys/block/bcache0/queue/chunk_sectors, content is zone size
>    in sectors.
> - run 'blkzone report /dev/bcache0', all zones information displayed.
> - run 'blkzone reset /dev/bcache0', operation is rejected with error
>    information: "blkzone: /dev/bcache0: BLKRESETZONE ioctl failed:
>    Operation not supported"
> - Sequential writes by dd, I can see some zones' write pointer 'wptr'
>    values updated.
> 
> All of these are very basic testings, if you have better testing
> tools or cases, please offer me hint.
> 
> Thanks in advance for your review and comments.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> CC: Hannes Reinecke <hare@suse.com>
> CC: Damien Le Moal <damien.lemoal@wdc.com>
> CC: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>   drivers/md/bcache/bcache.h  | 10 +++++
>   drivers/md/bcache/request.c | 74 +++++++++++++++++++++++++++++++++++++
>   drivers/md/bcache/super.c   | 51 +++++++++++++++++++++----
>   3 files changed, 128 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 74a9849ea164..0d298b48707f 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -221,6 +221,7 @@ BITMASK(GC_MOVE, struct bucket, gc_mark, 15, 1);
>   struct search;
>   struct btree;
>   struct keybuf;
> +struct bch_report_zones_args;
>   
>   struct keybuf_key {
>   	struct rb_node		node;
> @@ -277,6 +278,8 @@ struct bcache_device {
>   			  struct bio *bio, unsigned int sectors);
>   	int (*ioctl)(struct bcache_device *d, fmode_t mode,
>   		     unsigned int cmd, unsigned long arg);
> +	int (*report_zones)(struct bch_report_zones_args *args,
> +			    unsigned int nr_zones);
>   };
>   
>   struct io {
> @@ -748,6 +751,13 @@ struct bbio {
>   	struct bio		bio;
>   };
>   
> +struct bch_report_zones_args {
> +	struct bcache_device *bcache_device;
> +	sector_t sector;
> +	void *orig_data;
> +	report_zones_cb orig_cb;
> +};
> +
>   #define BTREE_PRIO		USHRT_MAX
>   #define INITIAL_PRIO		32768U
>   
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 71a90fbec314..bd50204caac7 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -1190,6 +1190,19 @@ blk_qc_t cached_dev_make_request(struct request_queue *q, struct bio *bio)
>   		}
>   	}
>   
> +	/*
> +	 * zone management request may change the data layout and content
> +	 * implicitly on backing device, which introduces unacceptible
> +	 * inconsistency between the backing device and the cache device.
> +	 * Therefore all zone management related request will be denied here.
> +	 */
> +	if (op_is_zone_mgmt(bio->bi_opf)) {
> +		pr_err_ratelimited("zoned device management request denied.");
> +		bio->bi_status = BLK_STS_NOTSUPP;
> +		bio_endio(bio);
> +		return BLK_QC_T_NONE;
> +	}
> +
>   	generic_start_io_acct(q,
>   			      bio_op(bio),
>   			      bio_sectors(bio),
> @@ -1233,6 +1246,24 @@ static int cached_dev_ioctl(struct bcache_device *d, fmode_t mode,
>   	if (dc->io_disable)
>   		return -EIO;
>   
> +	/*
> +	 * zone management ioctl commands may change the data layout
> +	 * and content implicitly on backing device, which introduces
> +	 * unacceptible inconsistency between the backing device and
> +	 * the cache device. Therefore all zone management related
> +	 * ioctl commands will be denied here.
> +	 */
> +	switch (cmd) {
> +	case BLKRESETZONE:
> +	case BLKOPENZONE:
> +	case BLKCLOSEZONE:
> +	case BLKFINISHZONE:
> +		return -EOPNOTSUPP;
> +	default:
> +		/* Other commands fall through*/
> +		break;
> +	}
> +
>   	return __blkdev_driver_ioctl(dc->bdev, mode, cmd, arg);
>   }
>   
> @@ -1261,6 +1292,48 @@ static int cached_dev_congested(void *data, int bits)
>   	return ret;
>   }
>   
> +static int cached_dev_report_zones_cb(struct blk_zone *zone,
> +				      unsigned int idx,
> +				      void *data)
> +{
> +	struct bch_report_zones_args *args = data;
> +	struct bcache_device *d = args->bcache_device;
> +	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
> +	unsigned long data_offset = dc->sb.data_offset;
> +
> +	if (zone->start >= data_offset) {
> +		zone->start -= data_offset;
> +		zone->wp -= data_offset;
> +	} else {
> +		/*
> +		 * Normally the first zone is conventional zone,
> +		 * we don't need to worry about how to maintain
> +		 * the write pointer.
> +		 * But the zone->len is special, because the
> +		 * sector 0 on bcache device is 8KB offset on
> +		 * backing device indeed.
> +		 */
> +		zone->len -= data_offset;
> +	}
> +
> +	return args->orig_cb(zone, idx, args->orig_data);
> +}
> +
Well, I'm afraid that this is not going to work.
For SMR drives only the _last_ sector may be of a different length than 
the other sectors (specifications vary, but implementations do follow 
this model).
So either you blank out the first zone completely (and modify each and 
every zone in the report zones output) or you move the metadata to the 
last zone (but then this zone might be sequential, requiring even more 
changes to the metadata).

So all things considered I guess the easiest way would be to blank out 
the first zone...

> +static int cached_dev_report_zones(struct bch_report_zones_args *args,
> +				   unsigned int nr_zones)
> +{
> +	struct bcache_device *d = args->bcache_device;
> +	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
> +	sector_t sector = args->sector + dc->sb.data_offset;
> +
> +	/* sector is real LBA of backing device */
> +	return blkdev_report_zones(dc->bdev,
> +				   sector,
> +				   nr_zones,
> +				   cached_dev_report_zones_cb,
> +				   args);
> +}
> +
>   void bch_cached_dev_request_init(struct cached_dev *dc)
>   {
>   	struct gendisk *g = dc->disk.disk;
> @@ -1268,6 +1341,7 @@ void bch_cached_dev_request_init(struct cached_dev *dc)
>   	g->queue->backing_dev_info->congested_fn = cached_dev_congested;
>   	dc->disk.cache_miss			= cached_dev_cache_miss;
>   	dc->disk.ioctl				= cached_dev_ioctl;
> +	dc->disk.report_zones			= cached_dev_report_zones;
>   }
>   
>   /* Flash backed devices */
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index d98354fa28e3..b7d496040cee 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -679,10 +679,31 @@ static int ioctl_dev(struct block_device *b, fmode_t mode,
>   	return d->ioctl(d, mode, cmd, arg);
>   }
>   
> +static int report_zones_dev(struct gendisk *disk,
> +			    sector_t sector,
> +			    unsigned int nr_zones,
> +			    report_zones_cb cb,
> +			    void *data)
> +{
> +	struct bcache_device *d = disk->private_data;
> +	struct bch_report_zones_args args = {
> +		.bcache_device = d,
> +		.sector = sector,
> +		.orig_data = data,
> +		.orig_cb = cb,
> +	};
> +
> +	if (d->report_zones)
> +		return d->report_zones(&args, nr_zones);
> +
> +	return -EOPNOTSUPP;
> +}
> +
>   static const struct block_device_operations bcache_ops = {
>   	.open		= open_dev,
>   	.release	= release_dev,
>   	.ioctl		= ioctl_dev,
> +	.report_zones	= report_zones_dev,
>   	.owner		= THIS_MODULE,
>   };
>   
> @@ -815,8 +836,12 @@ static void bcache_device_free(struct bcache_device *d)
>   	closure_debug_destroy(&d->cl);
>   }
>   
> -static int bcache_device_init(struct bcache_device *d, unsigned int block_size,
> -			      sector_t sectors, make_request_fn make_request_fn)
> +static int bcache_device_init(struct cached_dev *dc,
> +			      struct bcache_device *d,
> +			      unsigned int block_size,
> +			      sector_t sectors,
> +			      make_request_fn make_request_fn)
> +
>   {
>   	struct request_queue *q;
>   	const size_t max_stripes = min_t(size_t, INT_MAX,
> @@ -886,6 +911,17 @@ static int bcache_device_init(struct bcache_device *d, unsigned int block_size,
>   	blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, d->disk->queue);
>   	blk_queue_flag_set(QUEUE_FLAG_DISCARD, d->disk->queue);
>   
> +	/*
> +	 * If this is for backing device registration, and it is an
> +	 * zoned device (e.g. host-managed S.M.R. hard drive), set
> +	 * up zoned device attribution properly for sysfs interface.
> +	 */
> +	if (dc && bdev_is_zoned(dc->bdev)) {
> +		q->limits.zoned = bdev_zoned_model(dc->bdev);
> +		q->nr_zones = blkdev_nr_zones(dc->bdev->bd_disk);
> +		q->limits.chunk_sectors = bdev_zone_sectors(dc->bdev);
> +	}
> +
>   	blk_queue_write_cache(q, true, true);
>   
>   	return 0;
> @@ -1337,9 +1373,9 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
>   		dc->partial_stripes_expensive =
>   			q->limits.raid_partial_stripes_expensive;
>   
> -	ret = bcache_device_init(&dc->disk, block_size,
> -			 dc->bdev->bd_part->nr_sects - dc->sb.data_offset,
> -			 cached_dev_make_request);
> +	ret = bcache_device_init(dc, &dc->disk, block_size,
> +			dc->bdev->bd_part->nr_sects - dc->sb.data_offset,
> +			cached_dev_make_request);
>   	if (ret)
>   		return ret;
>   

If you already pass in 'dc' as argument, can't you drop '&dc->disk' as 
argument?

> @@ -1451,8 +1487,9 @@ static int flash_dev_run(struct cache_set *c, struct uuid_entry *u)
>   
>   	kobject_init(&d->kobj, &bch_flash_dev_ktype);
>   
> -	if (bcache_device_init(d, block_bytes(c), u->sectors,
> -			flash_dev_make_request))
> +	if (bcache_device_init(NULL, d, block_bytes(c),
> +			       u->sectors,
> +			       flash_dev_make_request))
>   		goto err;
>   
>   	bcache_device_attach(d, c, u - c->uuids);
> 
Ah. Hmm. Probably not. Ignore my previous comment.

Cheers,

Hannes
Damien Le Moal May 11, 2020, 7:05 a.m. UTC | #3
Hi Coli,

On 2020/05/11 2:24, Coly Li wrote:
> On 2020/5/11 00:52, Coly Li wrote:
>> This is a very basic zoned device support. With this patch, bcache
>> device is able to,
>> - Export zoned device attribution via sysfs
>> - Response report zones request, e.g. by command 'blkzone report'
>> But the bcache device is still NOT able to,
>> - Response any zoned device management request or IOCTL command
>>
>> Here are the testings I have done,
>> - read /sys/block/bcache0/queue/zoned, content is 'host-managed'
>> - read /sys/block/bcache0/queue/nr_zones, content is number of zones
>>   including all zone types.
>> - read /sys/block/bcache0/queue/chunk_sectors, content is zone size
>>   in sectors.
>> - run 'blkzone report /dev/bcache0', all zones information displayed.
>> - run 'blkzone reset /dev/bcache0', operation is rejected with error
>>   information: "blkzone: /dev/bcache0: BLKRESETZONE ioctl failed:
>>   Operation not supported"
>> - Sequential writes by dd, I can see some zones' write pointer 'wptr'
>>   values updated.
>>
>> All of these are very basic testings, if you have better testing
>> tools or cases, please offer me hint.
>>
>> Thanks in advance for your review and comments.
>>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> CC: Hannes Reinecke <hare@suse.com>
>> CC: Damien Le Moal <damien.lemoal@wdc.com>
>> CC: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>> ---
> 
> Hi Damien and Johannes,
> 
> With this patch the bcache device with a SMR drive can export the zone
> information and format zonefs on top of it. Writeback mode does not work
> at this moment (it requires on-disk format change and on my to-do list),
> writethrough and writearound mode can be used on the bcache device to
> accelerate random read when hitting.
> 
> During my testing, there are 2 things needs to fix.
> 
> 1, mkzonefs report the first zone size does not match.
>    Because bcache occupies the first 8KB of the backing SMR drive, so
> the first zone size is 8KB less. By ignoring unmatched zone 0 size,
> mkzonefs works and the bcache device is formated.

Hannes was faster than me to comment on this. I can only repeat: this will not
work as there is the assumptions that all zones are the same size, except
eventually for the last one. I am even surprised that mkzonefs worked... Looks I
have some patching to do :)

The simplest solution is as Hannes pointed out: use the first zone in its
entirety for bcache super block and nothing else. And do not show this zone in
the zone report so that the bcache device user cannot reset or overwrite it.
That will mean as Hannes pointed out that the report zones device method for
bcache will need to remap zone start and wp sectors to start from sector 0.
Basically, substracting zone size to all zone start and to zone wp for non-full
zones will do.

> 2, Direct I/O on files under seq/ directory can not be accessed.
>    I need help here. It seems direct I/O write fails with -EINVAL. I
> found the failure happens in fs/iomap/direct-io.c:iomap_dio_bio_actor(),
> 211         if ((pos | length | align) & ((1 << blkbits) - 1))
> 212                 return -EINVAL;
> 
> When I write to seq/1 file on offset 0 with 4096 bytes, in the above
> line, align is 205427296, and  (pos | length | align) & ((1 << blkbits)
> - 1) is non-zero. Then all writes to files under seq/ fail with -EINVAL.

205427296 is 195MB, so offset 0 in file seq/1 does not align to zone 1 start
sector. This is why you see the  error (unaligned write). This is due to the
first zone not being the same size, which zonefs assumes.

If you could do all that, it probably also mean that you forgot to call
blk_revalidate_disk_zones() when setting up the bcache device. That is necessary
to set all queue limits and check the zone sizes are equal.

I will check your patch shortly.

> I guess there should be something missing when I do the direct I/O
> write. Could you all give me some hint ?
> 
> Thanks in advance.
> 
> Coly Li
>
Damien Le Moal May 11, 2020, 9:06 a.m. UTC | #4
On 2020/05/11 1:52, Coly Li wrote:
> This is a very basic zoned device support. With this patch, bcache
> device is able to,
> - Export zoned device attribution via sysfs
> - Response report zones request, e.g. by command 'blkzone report'
> But the bcache device is still NOT able to,
> - Response any zoned device management request or IOCTL command
> 
> Here are the testings I have done,
> - read /sys/block/bcache0/queue/zoned, content is 'host-managed'
> - read /sys/block/bcache0/queue/nr_zones, content is number of zones
>   including all zone types.
> - read /sys/block/bcache0/queue/chunk_sectors, content is zone size
>   in sectors.
> - run 'blkzone report /dev/bcache0', all zones information displayed.
> - run 'blkzone reset /dev/bcache0', operation is rejected with error
>   information: "blkzone: /dev/bcache0: BLKRESETZONE ioctl failed:
>   Operation not supported"

Weird. If report zones works, reset should also, modulo the zone size problem
for the first zone. You may get EINVAL, but not ENOTTY.

> - Sequential writes by dd, I can see some zones' write pointer 'wptr'
>   values updated.
> 
> All of these are very basic testings, if you have better testing
> tools or cases, please offer me hint.
> 
> Thanks in advance for your review and comments.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> CC: Hannes Reinecke <hare@suse.com>
> CC: Damien Le Moal <damien.lemoal@wdc.com>
> CC: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  drivers/md/bcache/bcache.h  | 10 +++++
>  drivers/md/bcache/request.c | 74 +++++++++++++++++++++++++++++++++++++
>  drivers/md/bcache/super.c   | 51 +++++++++++++++++++++----
>  3 files changed, 128 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 74a9849ea164..0d298b48707f 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -221,6 +221,7 @@ BITMASK(GC_MOVE, struct bucket, gc_mark, 15, 1);
>  struct search;
>  struct btree;
>  struct keybuf;
> +struct bch_report_zones_args;
>  
>  struct keybuf_key {
>  	struct rb_node		node;
> @@ -277,6 +278,8 @@ struct bcache_device {
>  			  struct bio *bio, unsigned int sectors);
>  	int (*ioctl)(struct bcache_device *d, fmode_t mode,
>  		     unsigned int cmd, unsigned long arg);
> +	int (*report_zones)(struct bch_report_zones_args *args,
> +			    unsigned int nr_zones);
>  };
>  
>  struct io {
> @@ -748,6 +751,13 @@ struct bbio {
>  	struct bio		bio;
>  };
>  
> +struct bch_report_zones_args {
> +	struct bcache_device *bcache_device;
> +	sector_t sector;
> +	void *orig_data;
> +	report_zones_cb orig_cb;
> +};
> +
>  #define BTREE_PRIO		USHRT_MAX
>  #define INITIAL_PRIO		32768U
>  
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 71a90fbec314..bd50204caac7 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -1190,6 +1190,19 @@ blk_qc_t cached_dev_make_request(struct request_queue *q, struct bio *bio)
>  		}
>  	}
>  
> +	/*
> +	 * zone management request may change the data layout and content
> +	 * implicitly on backing device, which introduces unacceptible

s/unacceptible/unacceptable

> +	 * inconsistency between the backing device and the cache device.
> +	 * Therefore all zone management related request will be denied here.
> +	 */
> +	if (op_is_zone_mgmt(bio->bi_opf)) {
> +		pr_err_ratelimited("zoned device management request denied.");
> +		bio->bi_status = BLK_STS_NOTSUPP;
> +		bio_endio(bio);
> +		return BLK_QC_T_NONE;

OK. That explains the operation not supported for "blkzone reset". I do not
think this is good. With this, the drive will be writable only exactly once,
without the possibility for the user to reset any zone and rewrite them. All
zone compliant file systems will fail (f2fs, zonefs, btrfs coming).

At the very least REQ_OP_ZONE_RESET should be allowed and trigger an
invalidation on the cache device of all blocks of the zone that are cached.

Note: the zone management operations will need to be remapped like report zones,
but in reverse: the BIO start sector must be increase by the zone size.

> +	}
> +
>  	generic_start_io_acct(q,
>  			      bio_op(bio),
>  			      bio_sectors(bio),
> @@ -1233,6 +1246,24 @@ static int cached_dev_ioctl(struct bcache_device *d, fmode_t mode,
>  	if (dc->io_disable)
>  		return -EIO;
>  
> +	/*
> +	 * zone management ioctl commands may change the data layout
> +	 * and content implicitly on backing device, which introduces
> +	 * unacceptible inconsistency between the backing device and
> +	 * the cache device. Therefore all zone management related
> +	 * ioctl commands will be denied here.
> +	 */
> +	switch (cmd) {
> +	case BLKRESETZONE:
> +	case BLKOPENZONE:
> +	case BLKCLOSEZONE:
> +	case BLKFINISHZONE:
> +		return -EOPNOTSUPP;

Same comment as above. Of note is that only BLKRESETZONE will result in cache
inconsistency if not taken care of with an invalidation of the cached blocks of
the zone on the cache device. Open and close operations have no effect on data.
Finish zone will artificially increase the zone write pointer to the end of the
zone to make it full but without actually writing any data. So there is no need
I think to change anything on the cache device in that case.

> +	default:
> +		/* Other commands fall through*/
> +		break;
> +	}
> +
>  	return __blkdev_driver_ioctl(dc->bdev, mode, cmd, arg);
>  }
>  
> @@ -1261,6 +1292,48 @@ static int cached_dev_congested(void *data, int bits)
>  	return ret;
>  }
>  
> +static int cached_dev_report_zones_cb(struct blk_zone *zone,
> +				      unsigned int idx,
> +				      void *data)
> +{
> +	struct bch_report_zones_args *args = data;
> +	struct bcache_device *d = args->bcache_device;
> +	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
> +	unsigned long data_offset = dc->sb.data_offset;
> +
> +	if (zone->start >= data_offset) {
> +		zone->start -= data_offset;
> +		zone->wp -= data_offset;

Since the zone that contains the super block has to be ignored, the remapping of
the zone start can be done unconditionally. For the write pointer remapping, you
need to handle several cases: conventional zones, full zones and read-only zones
do not have a valid write pointer value, so no updating. You also need to skip
offline zones.

> +	} else {
> +		/*
> +		 * Normally the first zone is conventional zone,
> +		 * we don't need to worry about how to maintain
> +		 * the write pointer.
> +		 * But the zone->len is special, because the
> +		 * sector 0 on bcache device is 8KB offset on
> +		 * backing device indeed.
> +		 */
> +		zone->len -= data_offset;

This should not be necessary if the first zone containing the super block is
skipped entirely.

> +	}
> +
> +	return args->orig_cb(zone, idx, args->orig_data);
> +}
> +
> +static int cached_dev_report_zones(struct bch_report_zones_args *args,
> +				   unsigned int nr_zones)
> +{
> +	struct bcache_device *d = args->bcache_device;
> +	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
> +	sector_t sector = args->sector + dc->sb.data_offset;
> +
> +	/* sector is real LBA of backing device */
> +	return blkdev_report_zones(dc->bdev,
> +				   sector,
> +				   nr_zones,
> +				   cached_dev_report_zones_cb,
> +				   args);
> +}
> +
>  void bch_cached_dev_request_init(struct cached_dev *dc)
>  {
>  	struct gendisk *g = dc->disk.disk;
> @@ -1268,6 +1341,7 @@ void bch_cached_dev_request_init(struct cached_dev *dc)
>  	g->queue->backing_dev_info->congested_fn = cached_dev_congested;
>  	dc->disk.cache_miss			= cached_dev_cache_miss;
>  	dc->disk.ioctl				= cached_dev_ioctl;
> +	dc->disk.report_zones			= cached_dev_report_zones;
>  }
>  
>  /* Flash backed devices */
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index d98354fa28e3..b7d496040cee 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -679,10 +679,31 @@ static int ioctl_dev(struct block_device *b, fmode_t mode,
>  	return d->ioctl(d, mode, cmd, arg);
>  }
>  
> +static int report_zones_dev(struct gendisk *disk,
> +			    sector_t sector,
> +			    unsigned int nr_zones,
> +			    report_zones_cb cb,
> +			    void *data)
> +{
> +	struct bcache_device *d = disk->private_data;
> +	struct bch_report_zones_args args = {
> +		.bcache_device = d,
> +		.sector = sector,
> +		.orig_data = data,
> +		.orig_cb = cb,
> +	};
> +
> +	if (d->report_zones)
> +		return d->report_zones(&args, nr_zones);

It looks to me like this is not necessary. This function could just be
cached_dev_report_zones() and you can drop the dc->disk.report_zones method.

> +
> +	return -EOPNOTSUPP;
> +}
> +
>  static const struct block_device_operations bcache_ops = {
>  	.open		= open_dev,
>  	.release	= release_dev,
>  	.ioctl		= ioctl_dev,
> +	.report_zones	= report_zones_dev,
>  	.owner		= THIS_MODULE,
>  };
>  
> @@ -815,8 +836,12 @@ static void bcache_device_free(struct bcache_device *d)
>  	closure_debug_destroy(&d->cl);
>  }
>  
> -static int bcache_device_init(struct bcache_device *d, unsigned int block_size,
> -			      sector_t sectors, make_request_fn make_request_fn)
> +static int bcache_device_init(struct cached_dev *dc,
> +			      struct bcache_device *d,
> +			      unsigned int block_size,
> +			      sector_t sectors,
> +			      make_request_fn make_request_fn)
> +
>  {
>  	struct request_queue *q;
>  	const size_t max_stripes = min_t(size_t, INT_MAX,
> @@ -886,6 +911,17 @@ static int bcache_device_init(struct bcache_device *d, unsigned int block_size,
>  	blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, d->disk->queue);
>  	blk_queue_flag_set(QUEUE_FLAG_DISCARD, d->disk->queue);
>  
> +	/*
> +	 * If this is for backing device registration, and it is an
> +	 * zoned device (e.g. host-managed S.M.R. hard drive), set
> +	 * up zoned device attribution properly for sysfs interface.
> +	 */
> +	if (dc && bdev_is_zoned(dc->bdev)) {
> +		q->limits.zoned = bdev_zoned_model(dc->bdev);
> +		q->nr_zones = blkdev_nr_zones(dc->bdev->bd_disk);
> +		q->limits.chunk_sectors = bdev_zone_sectors(dc->bdev);

You need to call blk_revalidate_disk_zones() here:

	blk_revalidate_disk_zones(dc->bdev->bd_disk);

but call it *after* setting the bc device capacity to

	get_capacity(dc->bdev->bd_disk) - bdev_zone_sectors(dc->bdev);

Which I think is in fact the sectors argument to this function ?

With that information blk_revalidate_disk_zones() will check the zones and set
q->nr_zones.


> +	}
> +
>  	blk_queue_write_cache(q, true, true);
>  
>  	return 0;
> @@ -1337,9 +1373,9 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
>  		dc->partial_stripes_expensive =
>  			q->limits.raid_partial_stripes_expensive;
>  
> -	ret = bcache_device_init(&dc->disk, block_size,
> -			 dc->bdev->bd_part->nr_sects - dc->sb.data_offset,
> -			 cached_dev_make_request);
> +	ret = bcache_device_init(dc, &dc->disk, block_size,
> +			dc->bdev->bd_part->nr_sects - dc->sb.data_offset,

dc->sb.data_offset should be the device zone size (chunk sectors) to skip the
entire first zone and preserve the "all zones have the same size" constraint.

> +			cached_dev_make_request);
>  	if (ret)
>  		return ret;
>  
> @@ -1451,8 +1487,9 @@ static int flash_dev_run(struct cache_set *c, struct uuid_entry *u)
>  
>  	kobject_init(&d->kobj, &bch_flash_dev_ktype);
>  
> -	if (bcache_device_init(d, block_bytes(c), u->sectors,
> -			flash_dev_make_request))
> +	if (bcache_device_init(NULL, d, block_bytes(c),
> +			       u->sectors,
> +			       flash_dev_make_request))
>  		goto err;
>  
>  	bcache_device_attach(d, c, u - c->uuids);
>
Coly Li May 11, 2020, 11 a.m. UTC | #5
On 2020/5/11 17:06, Damien Le Moal wrote:
> On 2020/05/11 1:52, Coly Li wrote:
>> This is a very basic zoned device support. With this patch, bcache
>> device is able to,
>> - Export zoned device attribution via sysfs
>> - Response report zones request, e.g. by command 'blkzone report'
>> But the bcache device is still NOT able to,
>> - Response any zoned device management request or IOCTL command
>>
>> Here are the testings I have done,
>> - read /sys/block/bcache0/queue/zoned, content is 'host-managed'
>> - read /sys/block/bcache0/queue/nr_zones, content is number of zones
>>   including all zone types.
>> - read /sys/block/bcache0/queue/chunk_sectors, content is zone size
>>   in sectors.
>> - run 'blkzone report /dev/bcache0', all zones information displayed.
>> - run 'blkzone reset /dev/bcache0', operation is rejected with error
>>   information: "blkzone: /dev/bcache0: BLKRESETZONE ioctl failed:
>>   Operation not supported"
> 
> Weird. If report zones works, reset should also, modulo the zone size problem
> for the first zone. You may get EINVAL, but not ENOTTY.
> 

This is on purpose. Because reset the underlying zones layout may
corrupt the bcache mapping between cache device and backing device. So
such management commands are rejected.



>> - Sequential writes by dd, I can see some zones' write pointer 'wptr'
>>   values updated.
>>
>> All of these are very basic testings, if you have better testing
>> tools or cases, please offer me hint.
>>
>> Thanks in advance for your review and comments.
>>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> CC: Hannes Reinecke <hare@suse.com>
>> CC: Damien Le Moal <damien.lemoal@wdc.com>
>> CC: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>> ---
>>  drivers/md/bcache/bcache.h  | 10 +++++
>>  drivers/md/bcache/request.c | 74 +++++++++++++++++++++++++++++++++++++
>>  drivers/md/bcache/super.c   | 51 +++++++++++++++++++++----
>>  3 files changed, 128 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
>> index 74a9849ea164..0d298b48707f 100644
>> --- a/drivers/md/bcache/bcache.h
>> +++ b/drivers/md/bcache/bcache.h
>> @@ -221,6 +221,7 @@ BITMASK(GC_MOVE, struct bucket, gc_mark, 15, 1);
>>  struct search;
>>  struct btree;
>>  struct keybuf;
>> +struct bch_report_zones_args;
>>  
>>  struct keybuf_key {
>>  	struct rb_node		node;
>> @@ -277,6 +278,8 @@ struct bcache_device {
>>  			  struct bio *bio, unsigned int sectors);
>>  	int (*ioctl)(struct bcache_device *d, fmode_t mode,
>>  		     unsigned int cmd, unsigned long arg);
>> +	int (*report_zones)(struct bch_report_zones_args *args,
>> +			    unsigned int nr_zones);
>>  };
>>  
>>  struct io {
>> @@ -748,6 +751,13 @@ struct bbio {
>>  	struct bio		bio;
>>  };
>>  
>> +struct bch_report_zones_args {
>> +	struct bcache_device *bcache_device;
>> +	sector_t sector;
>> +	void *orig_data;
>> +	report_zones_cb orig_cb;
>> +};
>> +
>>  #define BTREE_PRIO		USHRT_MAX
>>  #define INITIAL_PRIO		32768U
>>  
>> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
>> index 71a90fbec314..bd50204caac7 100644
>> --- a/drivers/md/bcache/request.c
>> +++ b/drivers/md/bcache/request.c
>> @@ -1190,6 +1190,19 @@ blk_qc_t cached_dev_make_request(struct request_queue *q, struct bio *bio)
>>  		}
>>  	}
>>  
>> +	/*
>> +	 * zone management request may change the data layout and content
>> +	 * implicitly on backing device, which introduces unacceptible
> 
> s/unacceptible/unacceptable
> 
>> +	 * inconsistency between the backing device and the cache device.
>> +	 * Therefore all zone management related request will be denied here.
>> +	 */
>> +	if (op_is_zone_mgmt(bio->bi_opf)) {
>> +		pr_err_ratelimited("zoned device management request denied.");
>> +		bio->bi_status = BLK_STS_NOTSUPP;
>> +		bio_endio(bio);
>> +		return BLK_QC_T_NONE;
> 
> OK. That explains the operation not supported for "blkzone reset". I do not
> think this is good. With this, the drive will be writable only exactly once,
> without the possibility for the user to reset any zone and rewrite them. All
> zone compliant file systems will fail (f2fs, zonefs, btrfs coming).
> 
> At the very least REQ_OP_ZONE_RESET should be allowed and trigger an
> invalidation on the cache device of all blocks of the zone that are cached.
> 

Copied. Then I should find a method to invalid the cached data on SSD in
a proper way.

> Note: the zone management operations will need to be remapped like report zones,
> but in reverse: the BIO start sector must be increase by the zone size.
> 

Thanks for the hint :-)


>> +	}
>> +
>>  	generic_start_io_acct(q,
>>  			      bio_op(bio),
>>  			      bio_sectors(bio),
>> @@ -1233,6 +1246,24 @@ static int cached_dev_ioctl(struct bcache_device *d, fmode_t mode,
>>  	if (dc->io_disable)
>>  		return -EIO;
>>  
>> +	/*
>> +	 * zone management ioctl commands may change the data layout
>> +	 * and content implicitly on backing device, which introduces
>> +	 * unacceptible inconsistency between the backing device and
>> +	 * the cache device. Therefore all zone management related
>> +	 * ioctl commands will be denied here.
>> +	 */
>> +	switch (cmd) {
>> +	case BLKRESETZONE:
>> +	case BLKOPENZONE:
>> +	case BLKCLOSEZONE:
>> +	case BLKFINISHZONE:
>> +		return -EOPNOTSUPP;
> 
> Same comment as above. Of note is that only BLKRESETZONE will result in cache
> inconsistency if not taken care of with an invalidation of the cached blocks of
> the zone on the cache device. Open and close operations have no effect on data.
> Finish zone will artificially increase the zone write pointer to the end of the
> zone to make it full but without actually writing any data. So there is no need
> I think to change anything on the cache device in that case.
> 

Copied. I will handle this.

One thing not cleared to me is, what is the purpose of BLKOPENZONE and
BLKCLOSEZONE, is it used to update writer pointer only ? Or it is also
used to set a specific zone to be offline ?


>> +	default:
>> +		/* Other commands fall through*/
>> +		break;
>> +	}
>> +
>>  	return __blkdev_driver_ioctl(dc->bdev, mode, cmd, arg);
>>  }
>>  
>> @@ -1261,6 +1292,48 @@ static int cached_dev_congested(void *data, int bits)
>>  	return ret;
>>  }
>>  
>> +static int cached_dev_report_zones_cb(struct blk_zone *zone,
>> +				      unsigned int idx,
>> +				      void *data)
>> +{
>> +	struct bch_report_zones_args *args = data;
>> +	struct bcache_device *d = args->bcache_device;
>> +	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
>> +	unsigned long data_offset = dc->sb.data_offset;
>> +
>> +	if (zone->start >= data_offset) {
>> +		zone->start -= data_offset;
>> +		zone->wp -= data_offset;
> 
> Since the zone that contains the super block has to be ignored, the remapping of
> the zone start can be done unconditionally. For the write pointer remapping, you
> need to handle several cases: conventional zones, full zones and read-only zones
> do not have a valid write pointer value, so no updating. You also need to skip
> offline zones.
> 

Copied, I will fix here in next version.


>> +	} else {
>> +		/*
>> +		 * Normally the first zone is conventional zone,
>> +		 * we don't need to worry about how to maintain
>> +		 * the write pointer.
>> +		 * But the zone->len is special, because the
>> +		 * sector 0 on bcache device is 8KB offset on
>> +		 * backing device indeed.
>> +		 */
>> +		zone->len -= data_offset;
> 
> This should not be necessary if the first zone containing the super block is
> skipped entirely.
> 

Yes, it will be.


>> +	}
>> +
>> +	return args->orig_cb(zone, idx, args->orig_data);
>> +}
>> +
>> +static int cached_dev_report_zones(struct bch_report_zones_args *args,
>> +				   unsigned int nr_zones)
>> +{
>> +	struct bcache_device *d = args->bcache_device;
>> +	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
>> +	sector_t sector = args->sector + dc->sb.data_offset;
>> +
>> +	/* sector is real LBA of backing device */
>> +	return blkdev_report_zones(dc->bdev,
>> +				   sector,
>> +				   nr_zones,
>> +				   cached_dev_report_zones_cb,
>> +				   args);
>> +}
>> +
>>  void bch_cached_dev_request_init(struct cached_dev *dc)
>>  {
>>  	struct gendisk *g = dc->disk.disk;
>> @@ -1268,6 +1341,7 @@ void bch_cached_dev_request_init(struct cached_dev *dc)
>>  	g->queue->backing_dev_info->congested_fn = cached_dev_congested;
>>  	dc->disk.cache_miss			= cached_dev_cache_miss;
>>  	dc->disk.ioctl				= cached_dev_ioctl;
>> +	dc->disk.report_zones			= cached_dev_report_zones;
>>  }
>>  
>>  /* Flash backed devices */
>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>> index d98354fa28e3..b7d496040cee 100644
>> --- a/drivers/md/bcache/super.c
>> +++ b/drivers/md/bcache/super.c
>> @@ -679,10 +679,31 @@ static int ioctl_dev(struct block_device *b, fmode_t mode,
>>  	return d->ioctl(d, mode, cmd, arg);
>>  }
>>  
>> +static int report_zones_dev(struct gendisk *disk,
>> +			    sector_t sector,
>> +			    unsigned int nr_zones,
>> +			    report_zones_cb cb,
>> +			    void *data)
>> +{
>> +	struct bcache_device *d = disk->private_data;
>> +	struct bch_report_zones_args args = {
>> +		.bcache_device = d,
>> +		.sector = sector,
>> +		.orig_data = data,
>> +		.orig_cb = cb,
>> +	};
>> +
>> +	if (d->report_zones)
>> +		return d->report_zones(&args, nr_zones);
> 
> It looks to me like this is not necessary. This function could just be
> cached_dev_report_zones() and you can drop the dc->disk.report_zones method.
> 

Yes, I will fix it.


>> +
>> +	return -EOPNOTSUPP;
>> +}
>> +
>>  static const struct block_device_operations bcache_ops = {
>>  	.open		= open_dev,
>>  	.release	= release_dev,
>>  	.ioctl		= ioctl_dev,
>> +	.report_zones	= report_zones_dev,
>>  	.owner		= THIS_MODULE,
>>  };
>>  
>> @@ -815,8 +836,12 @@ static void bcache_device_free(struct bcache_device *d)
>>  	closure_debug_destroy(&d->cl);
>>  }
>>  
>> -static int bcache_device_init(struct bcache_device *d, unsigned int block_size,
>> -			      sector_t sectors, make_request_fn make_request_fn)
>> +static int bcache_device_init(struct cached_dev *dc,
>> +			      struct bcache_device *d,
>> +			      unsigned int block_size,
>> +			      sector_t sectors,
>> +			      make_request_fn make_request_fn)
>> +
>>  {
>>  	struct request_queue *q;
>>  	const size_t max_stripes = min_t(size_t, INT_MAX,
>> @@ -886,6 +911,17 @@ static int bcache_device_init(struct bcache_device *d, unsigned int block_size,
>>  	blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, d->disk->queue);
>>  	blk_queue_flag_set(QUEUE_FLAG_DISCARD, d->disk->queue);
>>  
>> +	/*
>> +	 * If this is for backing device registration, and it is an
>> +	 * zoned device (e.g. host-managed S.M.R. hard drive), set
>> +	 * up zoned device attribution properly for sysfs interface.
>> +	 */
>> +	if (dc && bdev_is_zoned(dc->bdev)) {
>> +		q->limits.zoned = bdev_zoned_model(dc->bdev);
>> +		q->nr_zones = blkdev_nr_zones(dc->bdev->bd_disk);
>> +		q->limits.chunk_sectors = bdev_zone_sectors(dc->bdev);
> 
> You need to call blk_revalidate_disk_zones() here:
> 
> 	blk_revalidate_disk_zones(dc->bdev->bd_disk);
> 
> but call it *after* setting the bc device capacity to
> 
> 	get_capacity(dc->bdev->bd_disk) - bdev_zone_sectors(dc->bdev);
> 
> Which I think is in fact the sectors argument to this function ?
> 
> With that information blk_revalidate_disk_zones() will check the zones and set
> q->nr_zones.
> 
> 

This is something I didn't notice. Yes I should revalidate teh zones.
Will do it in next version.


>> +	}
>> +
>>  	blk_queue_write_cache(q, true, true);
>>  
>>  	return 0;
>> @@ -1337,9 +1373,9 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
>>  		dc->partial_stripes_expensive =
>>  			q->limits.raid_partial_stripes_expensive;
>>  
>> -	ret = bcache_device_init(&dc->disk, block_size,
>> -			 dc->bdev->bd_part->nr_sects - dc->sb.data_offset,
>> -			 cached_dev_make_request);
>> +	ret = bcache_device_init(dc, &dc->disk, block_size,
>> +			dc->bdev->bd_part->nr_sects - dc->sb.data_offset,
> 
> dc->sb.data_offset should be the device zone size (chunk sectors) to skip the
> entire first zone and preserve the "all zones have the same size" constraint.
> 


Sure, and the bcache-tools should be fixed to recognize zoned device as
well.

>> +			cached_dev_make_request);
>>  	if (ret)
>>  		return ret;
>>  
>> @@ -1451,8 +1487,9 @@ static int flash_dev_run(struct cache_set *c, struct uuid_entry *u)
>>  
>>  	kobject_init(&d->kobj, &bch_flash_dev_ktype);
>>  
>> -	if (bcache_device_init(d, block_bytes(c), u->sectors,
>> -			flash_dev_make_request))
>> +	if (bcache_device_init(NULL, d, block_bytes(c),
>> +			       u->sectors,
>> +			       flash_dev_make_request))
>>  		goto err;
>>  
>>  	bcache_device_attach(d, c, u - c->uuids);
>>
> 
> 

Thanks for your review. I will post v2 soon.

Coly Li
Damien Le Moal May 11, 2020, 11:20 a.m. UTC | #6
On 2020/05/11 20:00, Coly Li wrote:
> On 2020/5/11 17:06, Damien Le Moal wrote:
>> On 2020/05/11 1:52, Coly Li wrote:
>>> This is a very basic zoned device support. With this patch, bcache
>>> device is able to,
>>> - Export zoned device attribution via sysfs
>>> - Response report zones request, e.g. by command 'blkzone report'
>>> But the bcache device is still NOT able to,
>>> - Response any zoned device management request or IOCTL command
>>>
>>> Here are the testings I have done,
>>> - read /sys/block/bcache0/queue/zoned, content is 'host-managed'
>>> - read /sys/block/bcache0/queue/nr_zones, content is number of zones
>>>   including all zone types.
>>> - read /sys/block/bcache0/queue/chunk_sectors, content is zone size
>>>   in sectors.
>>> - run 'blkzone report /dev/bcache0', all zones information displayed.
>>> - run 'blkzone reset /dev/bcache0', operation is rejected with error
>>>   information: "blkzone: /dev/bcache0: BLKRESETZONE ioctl failed:
>>>   Operation not supported"
>>
>> Weird. If report zones works, reset should also, modulo the zone size problem
>> for the first zone. You may get EINVAL, but not ENOTTY.
>>
> 
> This is on purpose. Because reset the underlying zones layout may
> corrupt the bcache mapping between cache device and backing device. So
> such management commands are rejected.
> 
> 
> 
>>> - Sequential writes by dd, I can see some zones' write pointer 'wptr'
>>>   values updated.
>>>
>>> All of these are very basic testings, if you have better testing
>>> tools or cases, please offer me hint.
>>>
>>> Thanks in advance for your review and comments.
>>>
>>> Signed-off-by: Coly Li <colyli@suse.de>
>>> CC: Hannes Reinecke <hare@suse.com>
>>> CC: Damien Le Moal <damien.lemoal@wdc.com>
>>> CC: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>> ---
>>>  drivers/md/bcache/bcache.h  | 10 +++++
>>>  drivers/md/bcache/request.c | 74 +++++++++++++++++++++++++++++++++++++
>>>  drivers/md/bcache/super.c   | 51 +++++++++++++++++++++----
>>>  3 files changed, 128 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
>>> index 74a9849ea164..0d298b48707f 100644
>>> --- a/drivers/md/bcache/bcache.h
>>> +++ b/drivers/md/bcache/bcache.h
>>> @@ -221,6 +221,7 @@ BITMASK(GC_MOVE, struct bucket, gc_mark, 15, 1);
>>>  struct search;
>>>  struct btree;
>>>  struct keybuf;
>>> +struct bch_report_zones_args;
>>>  
>>>  struct keybuf_key {
>>>  	struct rb_node		node;
>>> @@ -277,6 +278,8 @@ struct bcache_device {
>>>  			  struct bio *bio, unsigned int sectors);
>>>  	int (*ioctl)(struct bcache_device *d, fmode_t mode,
>>>  		     unsigned int cmd, unsigned long arg);
>>> +	int (*report_zones)(struct bch_report_zones_args *args,
>>> +			    unsigned int nr_zones);
>>>  };
>>>  
>>>  struct io {
>>> @@ -748,6 +751,13 @@ struct bbio {
>>>  	struct bio		bio;
>>>  };
>>>  
>>> +struct bch_report_zones_args {
>>> +	struct bcache_device *bcache_device;
>>> +	sector_t sector;
>>> +	void *orig_data;
>>> +	report_zones_cb orig_cb;
>>> +};
>>> +
>>>  #define BTREE_PRIO		USHRT_MAX
>>>  #define INITIAL_PRIO		32768U
>>>  
>>> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
>>> index 71a90fbec314..bd50204caac7 100644
>>> --- a/drivers/md/bcache/request.c
>>> +++ b/drivers/md/bcache/request.c
>>> @@ -1190,6 +1190,19 @@ blk_qc_t cached_dev_make_request(struct request_queue *q, struct bio *bio)
>>>  		}
>>>  	}
>>>  
>>> +	/*
>>> +	 * zone management request may change the data layout and content
>>> +	 * implicitly on backing device, which introduces unacceptible
>>
>> s/unacceptible/unacceptable
>>
>>> +	 * inconsistency between the backing device and the cache device.
>>> +	 * Therefore all zone management related request will be denied here.
>>> +	 */
>>> +	if (op_is_zone_mgmt(bio->bi_opf)) {
>>> +		pr_err_ratelimited("zoned device management request denied.");
>>> +		bio->bi_status = BLK_STS_NOTSUPP;
>>> +		bio_endio(bio);
>>> +		return BLK_QC_T_NONE;
>>
>> OK. That explains the operation not supported for "blkzone reset". I do not
>> think this is good. With this, the drive will be writable only exactly once,
>> without the possibility for the user to reset any zone and rewrite them. All
>> zone compliant file systems will fail (f2fs, zonefs, btrfs coming).
>>
>> At the very least REQ_OP_ZONE_RESET should be allowed and trigger an
>> invalidation on the cache device of all blocks of the zone that are cached.
>>
> 
> Copied. Then I should find a method to invalid the cached data on SSD in
> a proper way.
> 
>> Note: the zone management operations will need to be remapped like report zones,
>> but in reverse: the BIO start sector must be increase by the zone size.
>>
> 
> Thanks for the hint :-)
> 
> 
>>> +	}
>>> +
>>>  	generic_start_io_acct(q,
>>>  			      bio_op(bio),
>>>  			      bio_sectors(bio),
>>> @@ -1233,6 +1246,24 @@ static int cached_dev_ioctl(struct bcache_device *d, fmode_t mode,
>>>  	if (dc->io_disable)
>>>  		return -EIO;
>>>  
>>> +	/*
>>> +	 * zone management ioctl commands may change the data layout
>>> +	 * and content implicitly on backing device, which introduces
>>> +	 * unacceptible inconsistency between the backing device and
>>> +	 * the cache device. Therefore all zone management related
>>> +	 * ioctl commands will be denied here.
>>> +	 */
>>> +	switch (cmd) {
>>> +	case BLKRESETZONE:
>>> +	case BLKOPENZONE:
>>> +	case BLKCLOSEZONE:
>>> +	case BLKFINISHZONE:
>>> +		return -EOPNOTSUPP;
>>
>> Same comment as above. Of note is that only BLKRESETZONE will result in cache
>> inconsistency if not taken care of with an invalidation of the cached blocks of
>> the zone on the cache device. Open and close operations have no effect on data.
>> Finish zone will artificially increase the zone write pointer to the end of the
>> zone to make it full but without actually writing any data. So there is no need
>> I think to change anything on the cache device in that case.
>>
> 
> Copied. I will handle this.
> 
> One thing not cleared to me is, what is the purpose of BLKOPENZONE and
> BLKCLOSEZONE, is it used to update writer pointer only ? Or it is also
> used to set a specific zone to be offline ?

open and close have no effect on the wp of a zone. No change. The open operation
is a hint to the drive to tell it "I am going to write intensively to this zone,
so keep resources for it around" and the close operation frees the zone
resources. Everything works fine without these for SMR HDDs and the performance
gain from using open/close can hardly be measured at all. These operations will
however be more important for upcoming ZNS devices due to the different media
characteristics. Since these operations have no impact on the zone data, they
are safe for bcache, regardless of the cache state.

Only reset and finish change a zone wp. And only reset needs invalidation of the
cache device blocks of the zone.

> 
> 
>>> +	default:
>>> +		/* Other commands fall through*/
>>> +		break;
>>> +	}
>>> +
>>>  	return __blkdev_driver_ioctl(dc->bdev, mode, cmd, arg);
>>>  }
>>>  
>>> @@ -1261,6 +1292,48 @@ static int cached_dev_congested(void *data, int bits)
>>>  	return ret;
>>>  }
>>>  
>>> +static int cached_dev_report_zones_cb(struct blk_zone *zone,
>>> +				      unsigned int idx,
>>> +				      void *data)
>>> +{
>>> +	struct bch_report_zones_args *args = data;
>>> +	struct bcache_device *d = args->bcache_device;
>>> +	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
>>> +	unsigned long data_offset = dc->sb.data_offset;
>>> +
>>> +	if (zone->start >= data_offset) {
>>> +		zone->start -= data_offset;
>>> +		zone->wp -= data_offset;
>>
>> Since the zone that contains the super block has to be ignored, the remapping of
>> the zone start can be done unconditionally. For the write pointer remapping, you
>> need to handle several cases: conventional zones, full zones and read-only zones
>> do not have a valid write pointer value, so no updating. You also need to skip
>> offline zones.
>>
> 
> Copied, I will fix here in next version.
> 
> 
>>> +	} else {
>>> +		/*
>>> +		 * Normally the first zone is conventional zone,
>>> +		 * we don't need to worry about how to maintain
>>> +		 * the write pointer.
>>> +		 * But the zone->len is special, because the
>>> +		 * sector 0 on bcache device is 8KB offset on
>>> +		 * backing device indeed.
>>> +		 */
>>> +		zone->len -= data_offset;
>>
>> This should not be necessary if the first zone containing the super block is
>> skipped entirely.
>>
> 
> Yes, it will be.
> 
> 
>>> +	}
>>> +
>>> +	return args->orig_cb(zone, idx, args->orig_data);
>>> +}
>>> +
>>> +static int cached_dev_report_zones(struct bch_report_zones_args *args,
>>> +				   unsigned int nr_zones)
>>> +{
>>> +	struct bcache_device *d = args->bcache_device;
>>> +	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
>>> +	sector_t sector = args->sector + dc->sb.data_offset;
>>> +
>>> +	/* sector is real LBA of backing device */
>>> +	return blkdev_report_zones(dc->bdev,
>>> +				   sector,
>>> +				   nr_zones,
>>> +				   cached_dev_report_zones_cb,
>>> +				   args);
>>> +}
>>> +
>>>  void bch_cached_dev_request_init(struct cached_dev *dc)
>>>  {
>>>  	struct gendisk *g = dc->disk.disk;
>>> @@ -1268,6 +1341,7 @@ void bch_cached_dev_request_init(struct cached_dev *dc)
>>>  	g->queue->backing_dev_info->congested_fn = cached_dev_congested;
>>>  	dc->disk.cache_miss			= cached_dev_cache_miss;
>>>  	dc->disk.ioctl				= cached_dev_ioctl;
>>> +	dc->disk.report_zones			= cached_dev_report_zones;
>>>  }
>>>  
>>>  /* Flash backed devices */
>>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>>> index d98354fa28e3..b7d496040cee 100644
>>> --- a/drivers/md/bcache/super.c
>>> +++ b/drivers/md/bcache/super.c
>>> @@ -679,10 +679,31 @@ static int ioctl_dev(struct block_device *b, fmode_t mode,
>>>  	return d->ioctl(d, mode, cmd, arg);
>>>  }
>>>  
>>> +static int report_zones_dev(struct gendisk *disk,
>>> +			    sector_t sector,
>>> +			    unsigned int nr_zones,
>>> +			    report_zones_cb cb,
>>> +			    void *data)
>>> +{
>>> +	struct bcache_device *d = disk->private_data;
>>> +	struct bch_report_zones_args args = {
>>> +		.bcache_device = d,
>>> +		.sector = sector,
>>> +		.orig_data = data,
>>> +		.orig_cb = cb,
>>> +	};
>>> +
>>> +	if (d->report_zones)
>>> +		return d->report_zones(&args, nr_zones);
>>
>> It looks to me like this is not necessary. This function could just be
>> cached_dev_report_zones() and you can drop the dc->disk.report_zones method.
>>
> 
> Yes, I will fix it.
> 
> 
>>> +
>>> +	return -EOPNOTSUPP;
>>> +}
>>> +
>>>  static const struct block_device_operations bcache_ops = {
>>>  	.open		= open_dev,
>>>  	.release	= release_dev,
>>>  	.ioctl		= ioctl_dev,
>>> +	.report_zones	= report_zones_dev,
>>>  	.owner		= THIS_MODULE,
>>>  };
>>>  
>>> @@ -815,8 +836,12 @@ static void bcache_device_free(struct bcache_device *d)
>>>  	closure_debug_destroy(&d->cl);
>>>  }
>>>  
>>> -static int bcache_device_init(struct bcache_device *d, unsigned int block_size,
>>> -			      sector_t sectors, make_request_fn make_request_fn)
>>> +static int bcache_device_init(struct cached_dev *dc,
>>> +			      struct bcache_device *d,
>>> +			      unsigned int block_size,
>>> +			      sector_t sectors,
>>> +			      make_request_fn make_request_fn)
>>> +
>>>  {
>>>  	struct request_queue *q;
>>>  	const size_t max_stripes = min_t(size_t, INT_MAX,
>>> @@ -886,6 +911,17 @@ static int bcache_device_init(struct bcache_device *d, unsigned int block_size,
>>>  	blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, d->disk->queue);
>>>  	blk_queue_flag_set(QUEUE_FLAG_DISCARD, d->disk->queue);
>>>  
>>> +	/*
>>> +	 * If this is for backing device registration, and it is an
>>> +	 * zoned device (e.g. host-managed S.M.R. hard drive), set
>>> +	 * up zoned device attribution properly for sysfs interface.
>>> +	 */
>>> +	if (dc && bdev_is_zoned(dc->bdev)) {
>>> +		q->limits.zoned = bdev_zoned_model(dc->bdev);
>>> +		q->nr_zones = blkdev_nr_zones(dc->bdev->bd_disk);
>>> +		q->limits.chunk_sectors = bdev_zone_sectors(dc->bdev);
>>
>> You need to call blk_revalidate_disk_zones() here:
>>
>> 	blk_revalidate_disk_zones(dc->bdev->bd_disk);
>>
>> but call it *after* setting the bc device capacity to
>>
>> 	get_capacity(dc->bdev->bd_disk) - bdev_zone_sectors(dc->bdev);
>>
>> Which I think is in fact the sectors argument to this function ?
>>
>> With that information blk_revalidate_disk_zones() will check the zones and set
>> q->nr_zones.
>>
>>
> 
> This is something I didn't notice. Yes I should revalidate teh zones.
> Will do it in next version.
> 
> 
>>> +	}
>>> +
>>>  	blk_queue_write_cache(q, true, true);
>>>  
>>>  	return 0;
>>> @@ -1337,9 +1373,9 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
>>>  		dc->partial_stripes_expensive =
>>>  			q->limits.raid_partial_stripes_expensive;
>>>  
>>> -	ret = bcache_device_init(&dc->disk, block_size,
>>> -			 dc->bdev->bd_part->nr_sects - dc->sb.data_offset,
>>> -			 cached_dev_make_request);
>>> +	ret = bcache_device_init(dc, &dc->disk, block_size,
>>> +			dc->bdev->bd_part->nr_sects - dc->sb.data_offset,
>>
>> dc->sb.data_offset should be the device zone size (chunk sectors) to skip the
>> entire first zone and preserve the "all zones have the same size" constraint.
>>
> 
> 
> Sure, and the bcache-tools should be fixed to recognize zoned device as
> well.
> 
>>> +			cached_dev_make_request);
>>>  	if (ret)
>>>  		return ret;
>>>  
>>> @@ -1451,8 +1487,9 @@ static int flash_dev_run(struct cache_set *c, struct uuid_entry *u)
>>>  
>>>  	kobject_init(&d->kobj, &bch_flash_dev_ktype);
>>>  
>>> -	if (bcache_device_init(d, block_bytes(c), u->sectors,
>>> -			flash_dev_make_request))
>>> +	if (bcache_device_init(NULL, d, block_bytes(c),
>>> +			       u->sectors,
>>> +			       flash_dev_make_request))
>>>  		goto err;
>>>  
>>>  	bcache_device_attach(d, c, u - c->uuids);
>>>
>>
>>
> 
> Thanks for your review. I will post v2 soon.
> 
> Coly Li
> 
>

Patch
diff mbox series

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 74a9849ea164..0d298b48707f 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -221,6 +221,7 @@  BITMASK(GC_MOVE, struct bucket, gc_mark, 15, 1);
 struct search;
 struct btree;
 struct keybuf;
+struct bch_report_zones_args;
 
 struct keybuf_key {
 	struct rb_node		node;
@@ -277,6 +278,8 @@  struct bcache_device {
 			  struct bio *bio, unsigned int sectors);
 	int (*ioctl)(struct bcache_device *d, fmode_t mode,
 		     unsigned int cmd, unsigned long arg);
+	int (*report_zones)(struct bch_report_zones_args *args,
+			    unsigned int nr_zones);
 };
 
 struct io {
@@ -748,6 +751,13 @@  struct bbio {
 	struct bio		bio;
 };
 
+struct bch_report_zones_args {
+	struct bcache_device *bcache_device;
+	sector_t sector;
+	void *orig_data;
+	report_zones_cb orig_cb;
+};
+
 #define BTREE_PRIO		USHRT_MAX
 #define INITIAL_PRIO		32768U
 
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 71a90fbec314..bd50204caac7 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -1190,6 +1190,19 @@  blk_qc_t cached_dev_make_request(struct request_queue *q, struct bio *bio)
 		}
 	}
 
+	/*
+	 * zone management request may change the data layout and content
+	 * implicitly on backing device, which introduces unacceptible
+	 * inconsistency between the backing device and the cache device.
+	 * Therefore all zone management related request will be denied here.
+	 */
+	if (op_is_zone_mgmt(bio->bi_opf)) {
+		pr_err_ratelimited("zoned device management request denied.");
+		bio->bi_status = BLK_STS_NOTSUPP;
+		bio_endio(bio);
+		return BLK_QC_T_NONE;
+	}
+
 	generic_start_io_acct(q,
 			      bio_op(bio),
 			      bio_sectors(bio),
@@ -1233,6 +1246,24 @@  static int cached_dev_ioctl(struct bcache_device *d, fmode_t mode,
 	if (dc->io_disable)
 		return -EIO;
 
+	/*
+	 * zone management ioctl commands may change the data layout
+	 * and content implicitly on backing device, which introduces
+	 * unacceptible inconsistency between the backing device and
+	 * the cache device. Therefore all zone management related
+	 * ioctl commands will be denied here.
+	 */
+	switch (cmd) {
+	case BLKRESETZONE:
+	case BLKOPENZONE:
+	case BLKCLOSEZONE:
+	case BLKFINISHZONE:
+		return -EOPNOTSUPP;
+	default:
+		/* Other commands fall through*/
+		break;
+	}
+
 	return __blkdev_driver_ioctl(dc->bdev, mode, cmd, arg);
 }
 
@@ -1261,6 +1292,48 @@  static int cached_dev_congested(void *data, int bits)
 	return ret;
 }
 
+static int cached_dev_report_zones_cb(struct blk_zone *zone,
+				      unsigned int idx,
+				      void *data)
+{
+	struct bch_report_zones_args *args = data;
+	struct bcache_device *d = args->bcache_device;
+	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
+	unsigned long data_offset = dc->sb.data_offset;
+
+	if (zone->start >= data_offset) {
+		zone->start -= data_offset;
+		zone->wp -= data_offset;
+	} else {
+		/*
+		 * Normally the first zone is conventional zone,
+		 * we don't need to worry about how to maintain
+		 * the write pointer.
+		 * But the zone->len is special, because the
+		 * sector 0 on bcache device is 8KB offset on
+		 * backing device indeed.
+		 */
+		zone->len -= data_offset;
+	}
+
+	return args->orig_cb(zone, idx, args->orig_data);
+}
+
+static int cached_dev_report_zones(struct bch_report_zones_args *args,
+				   unsigned int nr_zones)
+{
+	struct bcache_device *d = args->bcache_device;
+	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
+	sector_t sector = args->sector + dc->sb.data_offset;
+
+	/* sector is real LBA of backing device */
+	return blkdev_report_zones(dc->bdev,
+				   sector,
+				   nr_zones,
+				   cached_dev_report_zones_cb,
+				   args);
+}
+
 void bch_cached_dev_request_init(struct cached_dev *dc)
 {
 	struct gendisk *g = dc->disk.disk;
@@ -1268,6 +1341,7 @@  void bch_cached_dev_request_init(struct cached_dev *dc)
 	g->queue->backing_dev_info->congested_fn = cached_dev_congested;
 	dc->disk.cache_miss			= cached_dev_cache_miss;
 	dc->disk.ioctl				= cached_dev_ioctl;
+	dc->disk.report_zones			= cached_dev_report_zones;
 }
 
 /* Flash backed devices */
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index d98354fa28e3..b7d496040cee 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -679,10 +679,31 @@  static int ioctl_dev(struct block_device *b, fmode_t mode,
 	return d->ioctl(d, mode, cmd, arg);
 }
 
+static int report_zones_dev(struct gendisk *disk,
+			    sector_t sector,
+			    unsigned int nr_zones,
+			    report_zones_cb cb,
+			    void *data)
+{
+	struct bcache_device *d = disk->private_data;
+	struct bch_report_zones_args args = {
+		.bcache_device = d,
+		.sector = sector,
+		.orig_data = data,
+		.orig_cb = cb,
+	};
+
+	if (d->report_zones)
+		return d->report_zones(&args, nr_zones);
+
+	return -EOPNOTSUPP;
+}
+
 static const struct block_device_operations bcache_ops = {
 	.open		= open_dev,
 	.release	= release_dev,
 	.ioctl		= ioctl_dev,
+	.report_zones	= report_zones_dev,
 	.owner		= THIS_MODULE,
 };
 
@@ -815,8 +836,12 @@  static void bcache_device_free(struct bcache_device *d)
 	closure_debug_destroy(&d->cl);
 }
 
-static int bcache_device_init(struct bcache_device *d, unsigned int block_size,
-			      sector_t sectors, make_request_fn make_request_fn)
+static int bcache_device_init(struct cached_dev *dc,
+			      struct bcache_device *d,
+			      unsigned int block_size,
+			      sector_t sectors,
+			      make_request_fn make_request_fn)
+
 {
 	struct request_queue *q;
 	const size_t max_stripes = min_t(size_t, INT_MAX,
@@ -886,6 +911,17 @@  static int bcache_device_init(struct bcache_device *d, unsigned int block_size,
 	blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, d->disk->queue);
 	blk_queue_flag_set(QUEUE_FLAG_DISCARD, d->disk->queue);
 
+	/*
+	 * If this is for backing device registration, and it is an
+	 * zoned device (e.g. host-managed S.M.R. hard drive), set
+	 * up zoned device attribution properly for sysfs interface.
+	 */
+	if (dc && bdev_is_zoned(dc->bdev)) {
+		q->limits.zoned = bdev_zoned_model(dc->bdev);
+		q->nr_zones = blkdev_nr_zones(dc->bdev->bd_disk);
+		q->limits.chunk_sectors = bdev_zone_sectors(dc->bdev);
+	}
+
 	blk_queue_write_cache(q, true, true);
 
 	return 0;
@@ -1337,9 +1373,9 @@  static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
 		dc->partial_stripes_expensive =
 			q->limits.raid_partial_stripes_expensive;
 
-	ret = bcache_device_init(&dc->disk, block_size,
-			 dc->bdev->bd_part->nr_sects - dc->sb.data_offset,
-			 cached_dev_make_request);
+	ret = bcache_device_init(dc, &dc->disk, block_size,
+			dc->bdev->bd_part->nr_sects - dc->sb.data_offset,
+			cached_dev_make_request);
 	if (ret)
 		return ret;
 
@@ -1451,8 +1487,9 @@  static int flash_dev_run(struct cache_set *c, struct uuid_entry *u)
 
 	kobject_init(&d->kobj, &bch_flash_dev_ktype);
 
-	if (bcache_device_init(d, block_bytes(c), u->sectors,
-			flash_dev_make_request))
+	if (bcache_device_init(NULL, d, block_bytes(c),
+			       u->sectors,
+			       flash_dev_make_request))
 		goto err;
 
 	bcache_device_attach(d, c, u - c->uuids);