[RFC,v3,8/9] md: Implement ->corrupted_range()

Message ID	20201215121414.253660-9-ruansy.fnst@cn.fujitsu.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=RHvU=FT=lists.01.org=linux-nvdimm-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 76149224F4 Received-SPF: None (mailfrom) identity=mailfrom; client-ip=183.91.158.132; helo=heian.cn.fujitsu.com; envelope-from=ruansy.fnst@cn.fujitsu.com; receiver=<UNKNOWN> From: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com> To: <linux-kernel@vger.kernel.org>, <linux-xfs@vger.kernel.org>, <linux-nvdimm@lists.01.org>, <linux-mm@kvack.org> Subject: [RFC PATCH v3 8/9] md: Implement ->corrupted_range() Date: Tue, 15 Dec 2020 20:14:13 +0800 Message-ID: <20201215121414.253660-9-ruansy.fnst@cn.fujitsu.com> In-Reply-To: <20201215121414.253660-1-ruansy.fnst@cn.fujitsu.com> References: <20201215121414.253660-1-ruansy.fnst@cn.fujitsu.com> MIME-Version: 1.0 Message-ID-Hash: RDGJJXWWPF2X7IGAGQL35QQJ6BH7KR22 CC: linux-fsdevel@vger.kernel.org, linux-raid@vger.kernel.org, darrick.wong@oracle.com, david@fromorbit.com, hch@lst.de, song@kernel.org, rgoldwyn@suse.de, qi.fuli@fujitsu.com, y-goto@fujitsu.com Precedence: list Archived-At: <https://lists.01.org/hyperkitty/list/linux-nvdimm@lists.01.org/message/RDGJJXWWPF2X7IGAGQL35QQJ6BH7KR22/> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit
Series	fsdax: introduce fs query to support reflink \| expand [RFC,v3,0/9] fsdax: introduce fs query to support reflink [RFC,v3,1/9] pagemap: Introduce ->memory_failure() [RFC,v3,2/9] blk: Introduce ->corrupted_range() for block device [RFC,v3,3/9] fs: Introduce ->corrupted_range() for superblock [RFC,v3,4/9] mm, fsdax: Refactor memory-failure handler for dax mapping [RFC,v3,5/9] mm, pmem: Implement ->memory_failure() in pmem driver [RFC,v3,6/9] pmem: Implement ->corrupted_range() for pmem driver [RFC,v3,7/9] dm: Introduce ->rmap() to find bdev offset [RFC,v3,8/9] md: Implement ->corrupted_range() [RFC,v3,9/9] xfs: Implement ->corrupted_range() for XFS

Ruan Shiyang Dec. 15, 2020, 12:14 p.m. UTC

With the support of ->rmap(), it is possible to obtain the superblock on
a mapped device.

If a pmem device is used as one target of mapped device, we cannot
obtain its superblock directly.  With the help of SYSFS, the mapped
device can be found on the target devices.  So, we iterate the
bdev->bd_holder_disks to obtain its mapped device.

Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
---
 drivers/md/dm.c       | 66 +++++++++++++++++++++++++++++++++++++++++++
 drivers/nvdimm/pmem.c |  9 ++++--
 fs/block_dev.c        | 21 ++++++++++++++
 include/linux/genhd.h |  7 +++++
 4 files changed, 100 insertions(+), 3 deletions(-)

Darrick J. Wong Dec. 15, 2020, 8:51 p.m. UTC | #1

On Tue, Dec 15, 2020 at 08:14:13PM +0800, Shiyang Ruan wrote:
> With the support of ->rmap(), it is possible to obtain the superblock on
> a mapped device.
> 
> If a pmem device is used as one target of mapped device, we cannot
> obtain its superblock directly.  With the help of SYSFS, the mapped
> device can be found on the target devices.  So, we iterate the
> bdev->bd_holder_disks to obtain its mapped device.
> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
> ---
>  drivers/md/dm.c       | 66 +++++++++++++++++++++++++++++++++++++++++++
>  drivers/nvdimm/pmem.c |  9 ++++--
>  fs/block_dev.c        | 21 ++++++++++++++
>  include/linux/genhd.h |  7 +++++
>  4 files changed, 100 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 4e0cbfe3f14d..9da1f9322735 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -507,6 +507,71 @@ static int dm_blk_report_zones(struct gendisk *disk, sector_t sector,
>  #define dm_blk_report_zones		NULL
>  #endif /* CONFIG_BLK_DEV_ZONED */
>  
> +struct dm_blk_corrupt {
> +	struct block_device *bdev;
> +	sector_t offset;
> +};
> +
> +static int dm_blk_corrupt_fn(struct dm_target *ti, struct dm_dev *dev,
> +				sector_t start, sector_t len, void *data)
> +{
> +	struct dm_blk_corrupt *bc = data;
> +
> +	return bc->bdev == (void *)dev->bdev &&
> +			(start <= bc->offset && bc->offset < start + len);
> +}
> +
> +static int dm_blk_corrupted_range(struct gendisk *disk,
> +				  struct block_device *target_bdev,
> +				  loff_t target_offset, size_t len, void *data)
> +{
> +	struct mapped_device *md = disk->private_data;
> +	struct block_device *md_bdev = md->bdev;
> +	struct dm_table *map;
> +	struct dm_target *ti;
> +	struct super_block *sb;
> +	int srcu_idx, i, rc = 0;
> +	bool found = false;
> +	sector_t disk_sec, target_sec = to_sector(target_offset);
> +
> +	map = dm_get_live_table(md, &srcu_idx);
> +	if (!map)
> +		return -ENODEV;
> +
> +	for (i = 0; i < dm_table_get_num_targets(map); i++) {
> +		ti = dm_table_get_target(map, i);
> +		if (ti->type->iterate_devices && ti->type->rmap) {
> +			struct dm_blk_corrupt bc = {target_bdev, target_sec};
> +
> +			found = ti->type->iterate_devices(ti, dm_blk_corrupt_fn, &bc);
> +			if (!found)
> +				continue;
> +			disk_sec = ti->type->rmap(ti, target_sec);

What happens if the dm device has multiple reverse mappings because the
physical storage is being shared at multiple LBAs?  (e.g. a
deduplication target)

> +			break;
> +		}
> +	}
> +
> +	if (!found) {
> +		rc = -ENODEV;
> +		goto out;
> +	}
> +
> +	sb = get_super(md_bdev);
> +	if (!sb) {
> +		rc = bd_disk_holder_corrupted_range(md_bdev, to_bytes(disk_sec), len, data);
> +		goto out;
> +	} else if (sb->s_op->corrupted_range) {
> +		loff_t off = to_bytes(disk_sec - get_start_sect(md_bdev));
> +
> +		rc = sb->s_op->corrupted_range(sb, md_bdev, off, len, data);

This "call bd_disk_holder_corrupted_range or sb->s_op->corrupted_range"
logic appears twice; should it be refactored into a common helper?

Or, should the superblock dispatch part move to
bd_disk_holder_corrupted_range?

> +	}
> +	drop_super(sb);
> +
> +out:
> +	dm_put_live_table(md, srcu_idx);
> +	return rc;
> +}
> +
>  static int dm_prepare_ioctl(struct mapped_device *md, int *srcu_idx,
>  			    struct block_device **bdev)
>  {
> @@ -3084,6 +3149,7 @@ static const struct block_device_operations dm_blk_dops = {
>  	.getgeo = dm_blk_getgeo,
>  	.report_zones = dm_blk_report_zones,
>  	.pr_ops = &dm_pr_ops,
> +	.corrupted_range = dm_blk_corrupted_range,
>  	.owner = THIS_MODULE
>  };
>  
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 4688bff19c20..e8cfaf860149 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -267,11 +267,14 @@ static int pmem_corrupted_range(struct gendisk *disk, struct block_device *bdev,
>  
>  	bdev_offset = (disk_sector - get_start_sect(bdev)) << SECTOR_SHIFT;
>  	sb = get_super(bdev);
> -	if (sb && sb->s_op->corrupted_range) {
> +	if (!sb) {
> +		rc = bd_disk_holder_corrupted_range(bdev, bdev_offset, len, data);
> +		goto out;
> +	} else if (sb->s_op->corrupted_range)
>  		rc = sb->s_op->corrupted_range(sb, bdev, bdev_offset, len, data);
> -		drop_super(sb);

This is out of scope for this patch(set) but do you think that the scsi
disk driver should intercept media errors from sense data and call
->corrupted_range too?  ISTR Ted muttering that one of his employers had
a patchset to do more with sense data than the upstream kernel currently
does...

> -	}
> +	drop_super(sb);
>  
> +out:
>  	bdput(bdev);
>  	return rc;
>  }
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 9e84b1928b94..d3e6bddb8041 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1171,6 +1171,27 @@ struct bd_holder_disk {
>  	int			refcnt;
>  };
>  
> +int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_t off, size_t len, void *data)
> +{
> +	struct bd_holder_disk *holder;
> +	struct gendisk *disk;
> +	int rc = 0;
> +
> +	if (list_empty(&(bdev->bd_holder_disks)))
> +		return -ENODEV;
> +
> +	list_for_each_entry(holder, &bdev->bd_holder_disks, list) {
> +		disk = holder->disk;
> +		if (disk->fops->corrupted_range) {
> +			rc = disk->fops->corrupted_range(disk, bdev, off, len, data);
> +			if (rc != -ENODEV)
> +				break;
> +		}
> +	}
> +	return rc;
> +}
> +EXPORT_SYMBOL_GPL(bd_disk_holder_corrupted_range);
> +
>  static struct bd_holder_disk *bd_find_holder_disk(struct block_device *bdev,
>  						  struct gendisk *disk)
>  {
> diff --git a/include/linux/genhd.h b/include/linux/genhd.h
> index ed06209008b8..fba247b852fa 100644
> --- a/include/linux/genhd.h
> +++ b/include/linux/genhd.h
> @@ -382,9 +382,16 @@ int blkdev_ioctl(struct block_device *, fmode_t, unsigned, unsigned long);
>  long compat_blkdev_ioctl(struct file *, unsigned, unsigned long);
>  
>  #ifdef CONFIG_SYSFS
> +int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_t off,
> +				   size_t len, void *data);
>  int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk);
>  void bd_unlink_disk_holder(struct block_device *bdev, struct gendisk *disk);
>  #else
> +int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_t off,
> +				   size_t len, void *data)
> +{
> +	return 0;
> +}
>  static inline int bd_link_disk_holder(struct block_device *bdev,
>  				      struct gendisk *disk)
>  {
> -- 
> 2.29.2
> 
> 
>

Dave Chinner Dec. 15, 2020, 11:28 p.m. UTC | #2

On Tue, Dec 15, 2020 at 12:51:02PM -0800, Darrick J. Wong wrote:
> On Tue, Dec 15, 2020 at 08:14:13PM +0800, Shiyang Ruan wrote:
> > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > index 4688bff19c20..e8cfaf860149 100644
> > --- a/drivers/nvdimm/pmem.c
> > +++ b/drivers/nvdimm/pmem.c
> > @@ -267,11 +267,14 @@ static int pmem_corrupted_range(struct gendisk *disk, struct block_device *bdev,
> >  
> >  	bdev_offset = (disk_sector - get_start_sect(bdev)) << SECTOR_SHIFT;
> >  	sb = get_super(bdev);
> > -	if (sb && sb->s_op->corrupted_range) {
> > +	if (!sb) {
> > +		rc = bd_disk_holder_corrupted_range(bdev, bdev_offset, len, data);
> > +		goto out;
> > +	} else if (sb->s_op->corrupted_range)
> >  		rc = sb->s_op->corrupted_range(sb, bdev, bdev_offset, len, data);
> > -		drop_super(sb);
> 
> This is out of scope for this patch(set) but do you think that the scsi
> disk driver should intercept media errors from sense data and call
> ->corrupted_range too?  ISTR Ted muttering that one of his employers had
> a patchset to do more with sense data than the upstream kernel currently
> does...

Most definitely!

That's the whole point of layering corrupt range reporting through
the device layers like this - the corrupted range reporting is not
limited specifically to pmem devices and so generic storage failures
(e.g.  RAID failures, hardware media failures, etc) can be reported
back up to the filesystem and we can take immediate, appropriate
action, including reporting to userspace that they just lost data in
file X at offset Y...

Combine that with the proposed "watch_sb()" syscall for reporting
such errors in a generic manner to interested listeners, and we've
got a fairly solid generic path for reporting data loss events to
userspace for an appropriate user-defined action to be taken...

Cheers,

Dave.

Jane Chu Dec. 16, 2020, 5:43 a.m. UTC | #3

On 12/15/2020 4:14 AM, Shiyang Ruan wrote:
>   #ifdef CONFIG_SYSFS
> +int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_t off,
> +				   size_t len, void *data);
>   int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk);
>   void bd_unlink_disk_holder(struct block_device *bdev, struct gendisk *disk);
>   #else
> +int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_t off,

Did you mean
   static inline int bd_disk_holder_corrupted_range(..
?

thanks,
-jane

> +				   size_t len, void *data)
> +{
> +	return 0;
> +}
>   static inline int bd_link_disk_holder(struct block_device *bdev,
>   				      struct gendisk *disk)

Ruan Shiyang Dec. 18, 2020, 1:50 a.m. UTC | #4

On 2020/12/16 下午1:43, Jane Chu wrote:
> On 12/15/2020 4:14 AM, Shiyang Ruan wrote:
>>   #ifdef CONFIG_SYSFS
>> +int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_t 
>> off,
>> +                   size_t len, void *data);
>>   int bd_link_disk_holder(struct block_device *bdev, struct gendisk 
>> *disk);
>>   void bd_unlink_disk_holder(struct block_device *bdev, struct gendisk 
>> *disk);
>>   #else
>> +int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_t 
>> off,
> 
> Did you mean
>    static inline int bd_disk_holder_corrupted_range(..
> ?

Yes, it's my fault.  Thanks a lot.


--
Thanks,
Ruan Shiyang.

> 
> thanks,
> -jane
> 
>> +                   size_t len, void *data)
>> +{
>> +    return 0;
>> +}
>>   static inline int bd_link_disk_holder(struct block_device *bdev,
>>                         struct gendisk *disk)
> 
>

Ruan Shiyang Dec. 18, 2020, 2:11 a.m. UTC | #5

On 2020/12/16 上午4:51, Darrick J. Wong wrote:
> On Tue, Dec 15, 2020 at 08:14:13PM +0800, Shiyang Ruan wrote:
>> With the support of ->rmap(), it is possible to obtain the superblock on
>> a mapped device.
>>
>> If a pmem device is used as one target of mapped device, we cannot
>> obtain its superblock directly.  With the help of SYSFS, the mapped
>> device can be found on the target devices.  So, we iterate the
>> bdev->bd_holder_disks to obtain its mapped device.
>>
>> Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
>> ---
>>   drivers/md/dm.c       | 66 +++++++++++++++++++++++++++++++++++++++++++
>>   drivers/nvdimm/pmem.c |  9 ++++--
>>   fs/block_dev.c        | 21 ++++++++++++++
>>   include/linux/genhd.h |  7 +++++
>>   4 files changed, 100 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>> index 4e0cbfe3f14d..9da1f9322735 100644
>> --- a/drivers/md/dm.c
>> +++ b/drivers/md/dm.c
>> @@ -507,6 +507,71 @@ static int dm_blk_report_zones(struct gendisk *disk, sector_t sector,
>>   #define dm_blk_report_zones		NULL
>>   #endif /* CONFIG_BLK_DEV_ZONED */
>>   
>> +struct dm_blk_corrupt {
>> +	struct block_device *bdev;
>> +	sector_t offset;
>> +};
>> +
>> +static int dm_blk_corrupt_fn(struct dm_target *ti, struct dm_dev *dev,
>> +				sector_t start, sector_t len, void *data)
>> +{
>> +	struct dm_blk_corrupt *bc = data;
>> +
>> +	return bc->bdev == (void *)dev->bdev &&
>> +			(start <= bc->offset && bc->offset < start + len);
>> +}
>> +
>> +static int dm_blk_corrupted_range(struct gendisk *disk,
>> +				  struct block_device *target_bdev,
>> +				  loff_t target_offset, size_t len, void *data)
>> +{
>> +	struct mapped_device *md = disk->private_data;
>> +	struct block_device *md_bdev = md->bdev;
>> +	struct dm_table *map;
>> +	struct dm_target *ti;
>> +	struct super_block *sb;
>> +	int srcu_idx, i, rc = 0;
>> +	bool found = false;
>> +	sector_t disk_sec, target_sec = to_sector(target_offset);
>> +
>> +	map = dm_get_live_table(md, &srcu_idx);
>> +	if (!map)
>> +		return -ENODEV;
>> +
>> +	for (i = 0; i < dm_table_get_num_targets(map); i++) {
>> +		ti = dm_table_get_target(map, i);
>> +		if (ti->type->iterate_devices && ti->type->rmap) {
>> +			struct dm_blk_corrupt bc = {target_bdev, target_sec};
>> +
>> +			found = ti->type->iterate_devices(ti, dm_blk_corrupt_fn, &bc);
>> +			if (!found)
>> +				continue;
>> +			disk_sec = ti->type->rmap(ti, target_sec);
> 
> What happens if the dm device has multiple reverse mappings because the
> physical storage is being shared at multiple LBAs?  (e.g. a
> deduplication target)

I thought that the dm device knows the mapping relationship, and it can 
be done by implementation of ->rmap() in each target.  Did I understand 
it wrong?

> 
>> +			break;
>> +		}
>> +	}
>> +
>> +	if (!found) {
>> +		rc = -ENODEV;
>> +		goto out;
>> +	}
>> +
>> +	sb = get_super(md_bdev);
>> +	if (!sb) {
>> +		rc = bd_disk_holder_corrupted_range(md_bdev, to_bytes(disk_sec), len, data);
>> +		goto out;
>> +	} else if (sb->s_op->corrupted_range) {
>> +		loff_t off = to_bytes(disk_sec - get_start_sect(md_bdev));
>> +
>> +		rc = sb->s_op->corrupted_range(sb, md_bdev, off, len, data);
> 
> This "call bd_disk_holder_corrupted_range or sb->s_op->corrupted_range"
> logic appears twice; should it be refactored into a common helper?
> 
> Or, should the superblock dispatch part move to
> bd_disk_holder_corrupted_range?

bd_disk_holder_corrupted_range() requires SYSFS configuration.  I 
introduce it to handle those block devices that can not obtain 
superblock by `get_super()`.

Usually, if we create filesystem directly on a pmem device, or make some 
partitions at first, we can use `get_super()` to get the superblock.  In 
other case, such as creating a LVM on pmem device, `get_super()` does 
not work.

So, I think refactoring it into a common helper looks better.


--
Thanks,
Ruan Shiyang.

> 
>> +	}
>> +	drop_super(sb);
>> +
>> +out:
>> +	dm_put_live_table(md, srcu_idx);
>> +	return rc;
>> +}
>> +
>>   static int dm_prepare_ioctl(struct mapped_device *md, int *srcu_idx,
>>   			    struct block_device **bdev)
>>   {
>> @@ -3084,6 +3149,7 @@ static const struct block_device_operations dm_blk_dops = {
>>   	.getgeo = dm_blk_getgeo,
>>   	.report_zones = dm_blk_report_zones,
>>   	.pr_ops = &dm_pr_ops,
>> +	.corrupted_range = dm_blk_corrupted_range,
>>   	.owner = THIS_MODULE
>>   };
>>   
>> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
>> index 4688bff19c20..e8cfaf860149 100644
>> --- a/drivers/nvdimm/pmem.c
>> +++ b/drivers/nvdimm/pmem.c
>> @@ -267,11 +267,14 @@ static int pmem_corrupted_range(struct gendisk *disk, struct block_device *bdev,
>>   
>>   	bdev_offset = (disk_sector - get_start_sect(bdev)) << SECTOR_SHIFT;
>>   	sb = get_super(bdev);
>> -	if (sb && sb->s_op->corrupted_range) {
>> +	if (!sb) {
>> +		rc = bd_disk_holder_corrupted_range(bdev, bdev_offset, len, data);
>> +		goto out;
>> +	} else if (sb->s_op->corrupted_range)
>>   		rc = sb->s_op->corrupted_range(sb, bdev, bdev_offset, len, data);
>> -		drop_super(sb);
> 
> This is out of scope for this patch(set) but do you think that the scsi
> disk driver should intercept media errors from sense data and call
> ->corrupted_range too?  ISTR Ted muttering that one of his employers had
> a patchset to do more with sense data than the upstream kernel currently
> does...
> 
>> -	}
>> +	drop_super(sb);
>>   
>> +out:
>>   	bdput(bdev);
>>   	return rc;
>>   }
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index 9e84b1928b94..d3e6bddb8041 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -1171,6 +1171,27 @@ struct bd_holder_disk {
>>   	int			refcnt;
>>   };
>>   
>> +int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_t off, size_t len, void *data)
>> +{
>> +	struct bd_holder_disk *holder;
>> +	struct gendisk *disk;
>> +	int rc = 0;
>> +
>> +	if (list_empty(&(bdev->bd_holder_disks)))
>> +		return -ENODEV;
>> +
>> +	list_for_each_entry(holder, &bdev->bd_holder_disks, list) {
>> +		disk = holder->disk;
>> +		if (disk->fops->corrupted_range) {
>> +			rc = disk->fops->corrupted_range(disk, bdev, off, len, data);
>> +			if (rc != -ENODEV)
>> +				break;
>> +		}
>> +	}
>> +	return rc;
>> +}
>> +EXPORT_SYMBOL_GPL(bd_disk_holder_corrupted_range);
>> +
>>   static struct bd_holder_disk *bd_find_holder_disk(struct block_device *bdev,
>>   						  struct gendisk *disk)
>>   {
>> diff --git a/include/linux/genhd.h b/include/linux/genhd.h
>> index ed06209008b8..fba247b852fa 100644
>> --- a/include/linux/genhd.h
>> +++ b/include/linux/genhd.h
>> @@ -382,9 +382,16 @@ int blkdev_ioctl(struct block_device *, fmode_t, unsigned, unsigned long);
>>   long compat_blkdev_ioctl(struct file *, unsigned, unsigned long);
>>   
>>   #ifdef CONFIG_SYSFS
>> +int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_t off,
>> +				   size_t len, void *data);
>>   int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk);
>>   void bd_unlink_disk_holder(struct block_device *bdev, struct gendisk *disk);
>>   #else
>> +int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_t off,
>> +				   size_t len, void *data)
>> +{
>> +	return 0;
>> +}
>>   static inline int bd_link_disk_holder(struct block_device *bdev,
>>   				      struct gendisk *disk)
>>   {
>> -- 
>> 2.29.2
>>
>>
>>
> 
>

Darrick J. Wong Jan. 4, 2021, 11:34 p.m. UTC | #6

On Fri, Dec 18, 2020 at 10:11:54AM +0800, Ruan Shiyang wrote:
> 
> 
> On 2020/12/16 上午4:51, Darrick J. Wong wrote:
> > On Tue, Dec 15, 2020 at 08:14:13PM +0800, Shiyang Ruan wrote:
> > > With the support of ->rmap(), it is possible to obtain the superblock on
> > > a mapped device.
> > > 
> > > If a pmem device is used as one target of mapped device, we cannot
> > > obtain its superblock directly.  With the help of SYSFS, the mapped
> > > device can be found on the target devices.  So, we iterate the
> > > bdev->bd_holder_disks to obtain its mapped device.
> > > 
> > > Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
> > > ---
> > >   drivers/md/dm.c       | 66 +++++++++++++++++++++++++++++++++++++++++++
> > >   drivers/nvdimm/pmem.c |  9 ++++--
> > >   fs/block_dev.c        | 21 ++++++++++++++
> > >   include/linux/genhd.h |  7 +++++
> > >   4 files changed, 100 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > > index 4e0cbfe3f14d..9da1f9322735 100644
> > > --- a/drivers/md/dm.c
> > > +++ b/drivers/md/dm.c
> > > @@ -507,6 +507,71 @@ static int dm_blk_report_zones(struct gendisk *disk, sector_t sector,
> > >   #define dm_blk_report_zones		NULL
> > >   #endif /* CONFIG_BLK_DEV_ZONED */
> > > +struct dm_blk_corrupt {
> > > +	struct block_device *bdev;
> > > +	sector_t offset;
> > > +};
> > > +
> > > +static int dm_blk_corrupt_fn(struct dm_target *ti, struct dm_dev *dev,
> > > +				sector_t start, sector_t len, void *data)
> > > +{
> > > +	struct dm_blk_corrupt *bc = data;
> > > +
> > > +	return bc->bdev == (void *)dev->bdev &&
> > > +			(start <= bc->offset && bc->offset < start + len);
> > > +}
> > > +
> > > +static int dm_blk_corrupted_range(struct gendisk *disk,
> > > +				  struct block_device *target_bdev,
> > > +				  loff_t target_offset, size_t len, void *data)
> > > +{
> > > +	struct mapped_device *md = disk->private_data;
> > > +	struct block_device *md_bdev = md->bdev;
> > > +	struct dm_table *map;
> > > +	struct dm_target *ti;
> > > +	struct super_block *sb;
> > > +	int srcu_idx, i, rc = 0;
> > > +	bool found = false;
> > > +	sector_t disk_sec, target_sec = to_sector(target_offset);
> > > +
> > > +	map = dm_get_live_table(md, &srcu_idx);
> > > +	if (!map)
> > > +		return -ENODEV;
> > > +
> > > +	for (i = 0; i < dm_table_get_num_targets(map); i++) {
> > > +		ti = dm_table_get_target(map, i);
> > > +		if (ti->type->iterate_devices && ti->type->rmap) {
> > > +			struct dm_blk_corrupt bc = {target_bdev, target_sec};
> > > +
> > > +			found = ti->type->iterate_devices(ti, dm_blk_corrupt_fn, &bc);
> > > +			if (!found)
> > > +				continue;
> > > +			disk_sec = ti->type->rmap(ti, target_sec);
> > 
> > What happens if the dm device has multiple reverse mappings because the
> > physical storage is being shared at multiple LBAs?  (e.g. a
> > deduplication target)
> 
> I thought that the dm device knows the mapping relationship, and it can be
> done by implementation of ->rmap() in each target.  Did I understand it
> wrong?

The dm device /does/ know the mapping relationship.  I'm asking what
happens if there are *multiple* mappings.  For example, a deduplicating
dm device could observe that the upper level code wrote some data to
sector 200 and now it wants to write the same data to sector 500.
Instead of writing twice, it simply maps sector 500 in its LBA space to
the same space that it mapped sector 200.

Pretend that sector 200 on the dm-dedupe device maps to sector 64 on the
underlying storage (call it /dev/pmem1 and let's say it's the only
target sitting underneath the dm-dedupe device).

If /dev/pmem1 then notices that sector 64 has gone bad, it will start
calling ->corrupted_range handlers until it calls dm_blk_corrupted_range
on the dm-dedupe device.  At least in theory, the dm-dedupe driver's
rmap method ought to return both (64 -> 200) and (64 -> 500) so that
dm_blk_corrupted_range can pass on both corruption notices to whatever's
sitting atop the dedupe device.

At the moment, your ->rmap prototype is only capable of returning one
sector_t mapping per target, and there's only the one target under the
dedupe device, so we cannot report the loss of sectors 200 and 500 to
whatever device is sitting on top of dm-dedupe.

--D

> > 
> > > +			break;
> > > +		}
> > > +	}
> > > +
> > > +	if (!found) {
> > > +		rc = -ENODEV;
> > > +		goto out;
> > > +	}
> > > +
> > > +	sb = get_super(md_bdev);
> > > +	if (!sb) {
> > > +		rc = bd_disk_holder_corrupted_range(md_bdev, to_bytes(disk_sec), len, data);
> > > +		goto out;
> > > +	} else if (sb->s_op->corrupted_range) {
> > > +		loff_t off = to_bytes(disk_sec - get_start_sect(md_bdev));
> > > +
> > > +		rc = sb->s_op->corrupted_range(sb, md_bdev, off, len, data);
> > 
> > This "call bd_disk_holder_corrupted_range or sb->s_op->corrupted_range"
> > logic appears twice; should it be refactored into a common helper?
> > 
> > Or, should the superblock dispatch part move to
> > bd_disk_holder_corrupted_range?
> 
> bd_disk_holder_corrupted_range() requires SYSFS configuration.  I introduce
> it to handle those block devices that can not obtain superblock by
> `get_super()`.
> 
> Usually, if we create filesystem directly on a pmem device, or make some
> partitions at first, we can use `get_super()` to get the superblock.  In
> other case, such as creating a LVM on pmem device, `get_super()` does not
> work.
> 
> So, I think refactoring it into a common helper looks better.
> 
> 
> --
> Thanks,
> Ruan Shiyang.
> 
> > 
> > > +	}
> > > +	drop_super(sb);
> > > +
> > > +out:
> > > +	dm_put_live_table(md, srcu_idx);
> > > +	return rc;
> > > +}
> > > +
> > >   static int dm_prepare_ioctl(struct mapped_device *md, int *srcu_idx,
> > >   			    struct block_device **bdev)
> > >   {
> > > @@ -3084,6 +3149,7 @@ static const struct block_device_operations dm_blk_dops = {
> > >   	.getgeo = dm_blk_getgeo,
> > >   	.report_zones = dm_blk_report_zones,
> > >   	.pr_ops = &dm_pr_ops,
> > > +	.corrupted_range = dm_blk_corrupted_range,
> > >   	.owner = THIS_MODULE
> > >   };
> > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > > index 4688bff19c20..e8cfaf860149 100644
> > > --- a/drivers/nvdimm/pmem.c
> > > +++ b/drivers/nvdimm/pmem.c
> > > @@ -267,11 +267,14 @@ static int pmem_corrupted_range(struct gendisk *disk, struct block_device *bdev,
> > >   	bdev_offset = (disk_sector - get_start_sect(bdev)) << SECTOR_SHIFT;
> > >   	sb = get_super(bdev);
> > > -	if (sb && sb->s_op->corrupted_range) {
> > > +	if (!sb) {
> > > +		rc = bd_disk_holder_corrupted_range(bdev, bdev_offset, len, data);
> > > +		goto out;
> > > +	} else if (sb->s_op->corrupted_range)
> > >   		rc = sb->s_op->corrupted_range(sb, bdev, bdev_offset, len, data);
> > > -		drop_super(sb);
> > 
> > This is out of scope for this patch(set) but do you think that the scsi
> > disk driver should intercept media errors from sense data and call
> > ->corrupted_range too?  ISTR Ted muttering that one of his employers had
> > a patchset to do more with sense data than the upstream kernel currently
> > does...
> > 
> > > -	}
> > > +	drop_super(sb);
> > > +out:
> > >   	bdput(bdev);
> > >   	return rc;
> > >   }
> > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > index 9e84b1928b94..d3e6bddb8041 100644
> > > --- a/fs/block_dev.c
> > > +++ b/fs/block_dev.c
> > > @@ -1171,6 +1171,27 @@ struct bd_holder_disk {
> > >   	int			refcnt;
> > >   };
> > > +int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_t off, size_t len, void *data)
> > > +{
> > > +	struct bd_holder_disk *holder;
> > > +	struct gendisk *disk;
> > > +	int rc = 0;
> > > +
> > > +	if (list_empty(&(bdev->bd_holder_disks)))
> > > +		return -ENODEV;
> > > +
> > > +	list_for_each_entry(holder, &bdev->bd_holder_disks, list) {
> > > +		disk = holder->disk;
> > > +		if (disk->fops->corrupted_range) {
> > > +			rc = disk->fops->corrupted_range(disk, bdev, off, len, data);
> > > +			if (rc != -ENODEV)
> > > +				break;
> > > +		}
> > > +	}
> > > +	return rc;
> > > +}
> > > +EXPORT_SYMBOL_GPL(bd_disk_holder_corrupted_range);
> > > +
> > >   static struct bd_holder_disk *bd_find_holder_disk(struct block_device *bdev,
> > >   						  struct gendisk *disk)
> > >   {
> > > diff --git a/include/linux/genhd.h b/include/linux/genhd.h
> > > index ed06209008b8..fba247b852fa 100644
> > > --- a/include/linux/genhd.h
> > > +++ b/include/linux/genhd.h
> > > @@ -382,9 +382,16 @@ int blkdev_ioctl(struct block_device *, fmode_t, unsigned, unsigned long);
> > >   long compat_blkdev_ioctl(struct file *, unsigned, unsigned long);
> > >   #ifdef CONFIG_SYSFS
> > > +int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_t off,
> > > +				   size_t len, void *data);
> > >   int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk);
> > >   void bd_unlink_disk_holder(struct block_device *bdev, struct gendisk *disk);
> > >   #else
> > > +int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_t off,
> > > +				   size_t len, void *data)
> > > +{
> > > +	return 0;
> > > +}
> > >   static inline int bd_link_disk_holder(struct block_device *bdev,
> > >   				      struct gendisk *disk)
> > >   {
> > > -- 
> > > 2.29.2
> > > 
> > > 
> > > 
> > 
> > 
> 
>

Ruan Shiyang Jan. 8, 2021, 9:52 a.m. UTC | #7

On 2021/1/5 上午7:34, Darrick J. Wong wrote:
> On Fri, Dec 18, 2020 at 10:11:54AM +0800, Ruan Shiyang wrote:
>>
>>
>> On 2020/12/16 上午4:51, Darrick J. Wong wrote:
>>> On Tue, Dec 15, 2020 at 08:14:13PM +0800, Shiyang Ruan wrote:
>>>> With the support of ->rmap(), it is possible to obtain the superblock on
>>>> a mapped device.
>>>>
>>>> If a pmem device is used as one target of mapped device, we cannot
>>>> obtain its superblock directly.  With the help of SYSFS, the mapped
>>>> device can be found on the target devices.  So, we iterate the
>>>> bdev->bd_holder_disks to obtain its mapped device.
>>>>
>>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
>>>> ---
>>>>    drivers/md/dm.c       | 66 +++++++++++++++++++++++++++++++++++++++++++
>>>>    drivers/nvdimm/pmem.c |  9 ++++--
>>>>    fs/block_dev.c        | 21 ++++++++++++++
>>>>    include/linux/genhd.h |  7 +++++
>>>>    4 files changed, 100 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>>>> index 4e0cbfe3f14d..9da1f9322735 100644
>>>> --- a/drivers/md/dm.c
>>>> +++ b/drivers/md/dm.c
>>>> @@ -507,6 +507,71 @@ static int dm_blk_report_zones(struct gendisk *disk, sector_t sector,
>>>>    #define dm_blk_report_zones		NULL
>>>>    #endif /* CONFIG_BLK_DEV_ZONED */
>>>> +struct dm_blk_corrupt {
>>>> +	struct block_device *bdev;
>>>> +	sector_t offset;
>>>> +};
>>>> +
>>>> +static int dm_blk_corrupt_fn(struct dm_target *ti, struct dm_dev *dev,
>>>> +				sector_t start, sector_t len, void *data)
>>>> +{
>>>> +	struct dm_blk_corrupt *bc = data;
>>>> +
>>>> +	return bc->bdev == (void *)dev->bdev &&
>>>> +			(start <= bc->offset && bc->offset < start + len);
>>>> +}
>>>> +
>>>> +static int dm_blk_corrupted_range(struct gendisk *disk,
>>>> +				  struct block_device *target_bdev,
>>>> +				  loff_t target_offset, size_t len, void *data)
>>>> +{
>>>> +	struct mapped_device *md = disk->private_data;
>>>> +	struct block_device *md_bdev = md->bdev;
>>>> +	struct dm_table *map;
>>>> +	struct dm_target *ti;
>>>> +	struct super_block *sb;
>>>> +	int srcu_idx, i, rc = 0;
>>>> +	bool found = false;
>>>> +	sector_t disk_sec, target_sec = to_sector(target_offset);
>>>> +
>>>> +	map = dm_get_live_table(md, &srcu_idx);
>>>> +	if (!map)
>>>> +		return -ENODEV;
>>>> +
>>>> +	for (i = 0; i < dm_table_get_num_targets(map); i++) {
>>>> +		ti = dm_table_get_target(map, i);
>>>> +		if (ti->type->iterate_devices && ti->type->rmap) {
>>>> +			struct dm_blk_corrupt bc = {target_bdev, target_sec};
>>>> +
>>>> +			found = ti->type->iterate_devices(ti, dm_blk_corrupt_fn, &bc);
>>>> +			if (!found)
>>>> +				continue;
>>>> +			disk_sec = ti->type->rmap(ti, target_sec);
>>>
>>> What happens if the dm device has multiple reverse mappings because the
>>> physical storage is being shared at multiple LBAs?  (e.g. a
>>> deduplication target)
>>
>> I thought that the dm device knows the mapping relationship, and it can be
>> done by implementation of ->rmap() in each target.  Did I understand it
>> wrong?
> 
> The dm device /does/ know the mapping relationship.  I'm asking what
> happens if there are *multiple* mappings.  For example, a deduplicating
> dm device could observe that the upper level code wrote some data to
> sector 200 and now it wants to write the same data to sector 500.
> Instead of writing twice, it simply maps sector 500 in its LBA space to
> the same space that it mapped sector 200.
> 
> Pretend that sector 200 on the dm-dedupe device maps to sector 64 on the
> underlying storage (call it /dev/pmem1 and let's say it's the only
> target sitting underneath the dm-dedupe device).
> 
> If /dev/pmem1 then notices that sector 64 has gone bad, it will start
> calling ->corrupted_range handlers until it calls dm_blk_corrupted_range
> on the dm-dedupe device.  At least in theory, the dm-dedupe driver's
> rmap method ought to return both (64 -> 200) and (64 -> 500) so that
> dm_blk_corrupted_range can pass on both corruption notices to whatever's
> sitting atop the dedupe device.
> 
> At the moment, your ->rmap prototype is only capable of returning one
> sector_t mapping per target, and there's only the one target under the
> dedupe device, so we cannot report the loss of sectors 200 and 500 to
> whatever device is sitting on top of dm-dedupe.

Got it.  I didn't know there is a kind of dm device called dm-dedupe. 
Thanks for the guidance.


--
Thanks,
Ruan Shiyang.

> 
> --D
> 
>>>
>>>> +			break;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	if (!found) {
>>>> +		rc = -ENODEV;
>>>> +		goto out;
>>>> +	}
>>>> +
>>>> +	sb = get_super(md_bdev);
>>>> +	if (!sb) {
>>>> +		rc = bd_disk_holder_corrupted_range(md_bdev, to_bytes(disk_sec), len, data);
>>>> +		goto out;
>>>> +	} else if (sb->s_op->corrupted_range) {
>>>> +		loff_t off = to_bytes(disk_sec - get_start_sect(md_bdev));
>>>> +
>>>> +		rc = sb->s_op->corrupted_range(sb, md_bdev, off, len, data);
>>>
>>> This "call bd_disk_holder_corrupted_range or sb->s_op->corrupted_range"
>>> logic appears twice; should it be refactored into a common helper?
>>>
>>> Or, should the superblock dispatch part move to
>>> bd_disk_holder_corrupted_range?
>>
>> bd_disk_holder_corrupted_range() requires SYSFS configuration.  I introduce
>> it to handle those block devices that can not obtain superblock by
>> `get_super()`.
>>
>> Usually, if we create filesystem directly on a pmem device, or make some
>> partitions at first, we can use `get_super()` to get the superblock.  In
>> other case, such as creating a LVM on pmem device, `get_super()` does not
>> work.
>>
>> So, I think refactoring it into a common helper looks better.
>>
>>
>> --
>> Thanks,
>> Ruan Shiyang.
>>
>>>
>>>> +	}
>>>> +	drop_super(sb);
>>>> +
>>>> +out:
>>>> +	dm_put_live_table(md, srcu_idx);
>>>> +	return rc;
>>>> +}
>>>> +
>>>>    static int dm_prepare_ioctl(struct mapped_device *md, int *srcu_idx,
>>>>    			    struct block_device **bdev)
>>>>    {
>>>> @@ -3084,6 +3149,7 @@ static const struct block_device_operations dm_blk_dops = {
>>>>    	.getgeo = dm_blk_getgeo,
>>>>    	.report_zones = dm_blk_report_zones,
>>>>    	.pr_ops = &dm_pr_ops,
>>>> +	.corrupted_range = dm_blk_corrupted_range,
>>>>    	.owner = THIS_MODULE
>>>>    };
>>>> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
>>>> index 4688bff19c20..e8cfaf860149 100644
>>>> --- a/drivers/nvdimm/pmem.c
>>>> +++ b/drivers/nvdimm/pmem.c
>>>> @@ -267,11 +267,14 @@ static int pmem_corrupted_range(struct gendisk *disk, struct block_device *bdev,
>>>>    	bdev_offset = (disk_sector - get_start_sect(bdev)) << SECTOR_SHIFT;
>>>>    	sb = get_super(bdev);
>>>> -	if (sb && sb->s_op->corrupted_range) {
>>>> +	if (!sb) {
>>>> +		rc = bd_disk_holder_corrupted_range(bdev, bdev_offset, len, data);
>>>> +		goto out;
>>>> +	} else if (sb->s_op->corrupted_range)
>>>>    		rc = sb->s_op->corrupted_range(sb, bdev, bdev_offset, len, data);
>>>> -		drop_super(sb);
>>>
>>> This is out of scope for this patch(set) but do you think that the scsi
>>> disk driver should intercept media errors from sense data and call
>>> ->corrupted_range too?  ISTR Ted muttering that one of his employers had
>>> a patchset to do more with sense data than the upstream kernel currently
>>> does...
>>>
>>>> -	}
>>>> +	drop_super(sb);
>>>> +out:
>>>>    	bdput(bdev);
>>>>    	return rc;
>>>>    }
>>>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>>>> index 9e84b1928b94..d3e6bddb8041 100644
>>>> --- a/fs/block_dev.c
>>>> +++ b/fs/block_dev.c
>>>> @@ -1171,6 +1171,27 @@ struct bd_holder_disk {
>>>>    	int			refcnt;
>>>>    };
>>>> +int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_t off, size_t len, void *data)
>>>> +{
>>>> +	struct bd_holder_disk *holder;
>>>> +	struct gendisk *disk;
>>>> +	int rc = 0;
>>>> +
>>>> +	if (list_empty(&(bdev->bd_holder_disks)))
>>>> +		return -ENODEV;
>>>> +
>>>> +	list_for_each_entry(holder, &bdev->bd_holder_disks, list) {
>>>> +		disk = holder->disk;
>>>> +		if (disk->fops->corrupted_range) {
>>>> +			rc = disk->fops->corrupted_range(disk, bdev, off, len, data);
>>>> +			if (rc != -ENODEV)
>>>> +				break;
>>>> +		}
>>>> +	}
>>>> +	return rc;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(bd_disk_holder_corrupted_range);
>>>> +
>>>>    static struct bd_holder_disk *bd_find_holder_disk(struct block_device *bdev,
>>>>    						  struct gendisk *disk)
>>>>    {
>>>> diff --git a/include/linux/genhd.h b/include/linux/genhd.h
>>>> index ed06209008b8..fba247b852fa 100644
>>>> --- a/include/linux/genhd.h
>>>> +++ b/include/linux/genhd.h
>>>> @@ -382,9 +382,16 @@ int blkdev_ioctl(struct block_device *, fmode_t, unsigned, unsigned long);
>>>>    long compat_blkdev_ioctl(struct file *, unsigned, unsigned long);
>>>>    #ifdef CONFIG_SYSFS
>>>> +int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_t off,
>>>> +				   size_t len, void *data);
>>>>    int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk);
>>>>    void bd_unlink_disk_holder(struct block_device *bdev, struct gendisk *disk);
>>>>    #else
>>>> +int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_t off,
>>>> +				   size_t len, void *data)
>>>> +{
>>>> +	return 0;
>>>> +}
>>>>    static inline int bd_link_disk_holder(struct block_device *bdev,
>>>>    				      struct gendisk *disk)
>>>>    {
>>>> -- 
>>>> 2.29.2
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
> 
>

Darrick J. Wong Jan. 8, 2021, 7:05 p.m. UTC | #8

On Fri, Jan 08, 2021 at 05:52:11PM +0800, Ruan Shiyang wrote:
> 
> 
> On 2021/1/5 上午7:34, Darrick J. Wong wrote:
> > On Fri, Dec 18, 2020 at 10:11:54AM +0800, Ruan Shiyang wrote:
> > > 
> > > 
> > > On 2020/12/16 上午4:51, Darrick J. Wong wrote:
> > > > On Tue, Dec 15, 2020 at 08:14:13PM +0800, Shiyang Ruan wrote:
> > > > > With the support of ->rmap(), it is possible to obtain the superblock on
> > > > > a mapped device.
> > > > > 
> > > > > If a pmem device is used as one target of mapped device, we cannot
> > > > > obtain its superblock directly.  With the help of SYSFS, the mapped
> > > > > device can be found on the target devices.  So, we iterate the
> > > > > bdev->bd_holder_disks to obtain its mapped device.
> > > > > 
> > > > > Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
> > > > > ---
> > > > >    drivers/md/dm.c       | 66 +++++++++++++++++++++++++++++++++++++++++++
> > > > >    drivers/nvdimm/pmem.c |  9 ++++--
> > > > >    fs/block_dev.c        | 21 ++++++++++++++
> > > > >    include/linux/genhd.h |  7 +++++
> > > > >    4 files changed, 100 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > > > > index 4e0cbfe3f14d..9da1f9322735 100644
> > > > > --- a/drivers/md/dm.c
> > > > > +++ b/drivers/md/dm.c
> > > > > @@ -507,6 +507,71 @@ static int dm_blk_report_zones(struct gendisk *disk, sector_t sector,
> > > > >    #define dm_blk_report_zones		NULL
> > > > >    #endif /* CONFIG_BLK_DEV_ZONED */
> > > > > +struct dm_blk_corrupt {
> > > > > +	struct block_device *bdev;
> > > > > +	sector_t offset;
> > > > > +};
> > > > > +
> > > > > +static int dm_blk_corrupt_fn(struct dm_target *ti, struct dm_dev *dev,
> > > > > +				sector_t start, sector_t len, void *data)
> > > > > +{
> > > > > +	struct dm_blk_corrupt *bc = data;
> > > > > +
> > > > > +	return bc->bdev == (void *)dev->bdev &&
> > > > > +			(start <= bc->offset && bc->offset < start + len);
> > > > > +}
> > > > > +
> > > > > +static int dm_blk_corrupted_range(struct gendisk *disk,
> > > > > +				  struct block_device *target_bdev,
> > > > > +				  loff_t target_offset, size_t len, void *data)
> > > > > +{
> > > > > +	struct mapped_device *md = disk->private_data;
> > > > > +	struct block_device *md_bdev = md->bdev;
> > > > > +	struct dm_table *map;
> > > > > +	struct dm_target *ti;
> > > > > +	struct super_block *sb;
> > > > > +	int srcu_idx, i, rc = 0;
> > > > > +	bool found = false;
> > > > > +	sector_t disk_sec, target_sec = to_sector(target_offset);
> > > > > +
> > > > > +	map = dm_get_live_table(md, &srcu_idx);
> > > > > +	if (!map)
> > > > > +		return -ENODEV;
> > > > > +
> > > > > +	for (i = 0; i < dm_table_get_num_targets(map); i++) {
> > > > > +		ti = dm_table_get_target(map, i);
> > > > > +		if (ti->type->iterate_devices && ti->type->rmap) {
> > > > > +			struct dm_blk_corrupt bc = {target_bdev, target_sec};
> > > > > +
> > > > > +			found = ti->type->iterate_devices(ti, dm_blk_corrupt_fn, &bc);
> > > > > +			if (!found)
> > > > > +				continue;
> > > > > +			disk_sec = ti->type->rmap(ti, target_sec);
> > > > 
> > > > What happens if the dm device has multiple reverse mappings because the
> > > > physical storage is being shared at multiple LBAs?  (e.g. a
> > > > deduplication target)
> > > 
> > > I thought that the dm device knows the mapping relationship, and it can be
> > > done by implementation of ->rmap() in each target.  Did I understand it
> > > wrong?
> > 
> > The dm device /does/ know the mapping relationship.  I'm asking what
> > happens if there are *multiple* mappings.  For example, a deduplicating
> > dm device could observe that the upper level code wrote some data to
> > sector 200 and now it wants to write the same data to sector 500.
> > Instead of writing twice, it simply maps sector 500 in its LBA space to
> > the same space that it mapped sector 200.
> > 
> > Pretend that sector 200 on the dm-dedupe device maps to sector 64 on the
> > underlying storage (call it /dev/pmem1 and let's say it's the only
> > target sitting underneath the dm-dedupe device).
> > 
> > If /dev/pmem1 then notices that sector 64 has gone bad, it will start
> > calling ->corrupted_range handlers until it calls dm_blk_corrupted_range
> > on the dm-dedupe device.  At least in theory, the dm-dedupe driver's
> > rmap method ought to return both (64 -> 200) and (64 -> 500) so that
> > dm_blk_corrupted_range can pass on both corruption notices to whatever's
> > sitting atop the dedupe device.
> > 
> > At the moment, your ->rmap prototype is only capable of returning one
> > sector_t mapping per target, and there's only the one target under the
> > dedupe device, so we cannot report the loss of sectors 200 and 500 to
> > whatever device is sitting on top of dm-dedupe.
> 
> Got it.  I didn't know there is a kind of dm device called dm-dedupe. Thanks
> for the guidance.

There isn't one upstream, but there are out of tree deduplication
drivers (VDO) and in principle any dm target can have multiple forward
mappings to a single block on the lower device.

--D

> 
> --
> Thanks,
> Ruan Shiyang.
> 
> > 
> > --D
> > 
> > > > 
> > > > > +			break;
> > > > > +		}
> > > > > +	}
> > > > > +
> > > > > +	if (!found) {
> > > > > +		rc = -ENODEV;
> > > > > +		goto out;
> > > > > +	}
> > > > > +
> > > > > +	sb = get_super(md_bdev);
> > > > > +	if (!sb) {
> > > > > +		rc = bd_disk_holder_corrupted_range(md_bdev, to_bytes(disk_sec), len, data);
> > > > > +		goto out;
> > > > > +	} else if (sb->s_op->corrupted_range) {
> > > > > +		loff_t off = to_bytes(disk_sec - get_start_sect(md_bdev));
> > > > > +
> > > > > +		rc = sb->s_op->corrupted_range(sb, md_bdev, off, len, data);
> > > > 
> > > > This "call bd_disk_holder_corrupted_range or sb->s_op->corrupted_range"
> > > > logic appears twice; should it be refactored into a common helper?
> > > > 
> > > > Or, should the superblock dispatch part move to
> > > > bd_disk_holder_corrupted_range?
> > > 
> > > bd_disk_holder_corrupted_range() requires SYSFS configuration.  I introduce
> > > it to handle those block devices that can not obtain superblock by
> > > `get_super()`.
> > > 
> > > Usually, if we create filesystem directly on a pmem device, or make some
> > > partitions at first, we can use `get_super()` to get the superblock.  In
> > > other case, such as creating a LVM on pmem device, `get_super()` does not
> > > work.
> > > 
> > > So, I think refactoring it into a common helper looks better.
> > > 
> > > 
> > > --
> > > Thanks,
> > > Ruan Shiyang.
> > > 
> > > > 
> > > > > +	}
> > > > > +	drop_super(sb);
> > > > > +
> > > > > +out:
> > > > > +	dm_put_live_table(md, srcu_idx);
> > > > > +	return rc;
> > > > > +}
> > > > > +
> > > > >    static int dm_prepare_ioctl(struct mapped_device *md, int *srcu_idx,
> > > > >    			    struct block_device **bdev)
> > > > >    {
> > > > > @@ -3084,6 +3149,7 @@ static const struct block_device_operations dm_blk_dops = {
> > > > >    	.getgeo = dm_blk_getgeo,
> > > > >    	.report_zones = dm_blk_report_zones,
> > > > >    	.pr_ops = &dm_pr_ops,
> > > > > +	.corrupted_range = dm_blk_corrupted_range,
> > > > >    	.owner = THIS_MODULE
> > > > >    };
> > > > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > > > > index 4688bff19c20..e8cfaf860149 100644
> > > > > --- a/drivers/nvdimm/pmem.c
> > > > > +++ b/drivers/nvdimm/pmem.c
> > > > > @@ -267,11 +267,14 @@ static int pmem_corrupted_range(struct gendisk *disk, struct block_device *bdev,
> > > > >    	bdev_offset = (disk_sector - get_start_sect(bdev)) << SECTOR_SHIFT;
> > > > >    	sb = get_super(bdev);
> > > > > -	if (sb && sb->s_op->corrupted_range) {
> > > > > +	if (!sb) {
> > > > > +		rc = bd_disk_holder_corrupted_range(bdev, bdev_offset, len, data);
> > > > > +		goto out;
> > > > > +	} else if (sb->s_op->corrupted_range)
> > > > >    		rc = sb->s_op->corrupted_range(sb, bdev, bdev_offset, len, data);
> > > > > -		drop_super(sb);
> > > > 
> > > > This is out of scope for this patch(set) but do you think that the scsi
> > > > disk driver should intercept media errors from sense data and call
> > > > ->corrupted_range too?  ISTR Ted muttering that one of his employers had
> > > > a patchset to do more with sense data than the upstream kernel currently
> > > > does...
> > > > 
> > > > > -	}
> > > > > +	drop_super(sb);
> > > > > +out:
> > > > >    	bdput(bdev);
> > > > >    	return rc;
> > > > >    }
> > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > > > index 9e84b1928b94..d3e6bddb8041 100644
> > > > > --- a/fs/block_dev.c
> > > > > +++ b/fs/block_dev.c
> > > > > @@ -1171,6 +1171,27 @@ struct bd_holder_disk {
> > > > >    	int			refcnt;
> > > > >    };
> > > > > +int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_t off, size_t len, void *data)
> > > > > +{
> > > > > +	struct bd_holder_disk *holder;
> > > > > +	struct gendisk *disk;
> > > > > +	int rc = 0;
> > > > > +
> > > > > +	if (list_empty(&(bdev->bd_holder_disks)))
> > > > > +		return -ENODEV;
> > > > > +
> > > > > +	list_for_each_entry(holder, &bdev->bd_holder_disks, list) {
> > > > > +		disk = holder->disk;
> > > > > +		if (disk->fops->corrupted_range) {
> > > > > +			rc = disk->fops->corrupted_range(disk, bdev, off, len, data);
> > > > > +			if (rc != -ENODEV)
> > > > > +				break;
> > > > > +		}
> > > > > +	}
> > > > > +	return rc;
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(bd_disk_holder_corrupted_range);
> > > > > +
> > > > >    static struct bd_holder_disk *bd_find_holder_disk(struct block_device *bdev,
> > > > >    						  struct gendisk *disk)
> > > > >    {
> > > > > diff --git a/include/linux/genhd.h b/include/linux/genhd.h
> > > > > index ed06209008b8..fba247b852fa 100644
> > > > > --- a/include/linux/genhd.h
> > > > > +++ b/include/linux/genhd.h
> > > > > @@ -382,9 +382,16 @@ int blkdev_ioctl(struct block_device *, fmode_t, unsigned, unsigned long);
> > > > >    long compat_blkdev_ioctl(struct file *, unsigned, unsigned long);
> > > > >    #ifdef CONFIG_SYSFS
> > > > > +int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_t off,
> > > > > +				   size_t len, void *data);
> > > > >    int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk);
> > > > >    void bd_unlink_disk_holder(struct block_device *bdev, struct gendisk *disk);
> > > > >    #else
> > > > > +int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_t off,
> > > > > +				   size_t len, void *data)
> > > > > +{
> > > > > +	return 0;
> > > > > +}
> > > > >    static inline int bd_link_disk_holder(struct block_device *bdev,
> > > > >    				      struct gendisk *disk)
> > > > >    {
> > > > > -- 
> > > > > 2.29.2
> > > > > 
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > 
> > > 
> > 
> > 
> 
>

[RFC,v3,8/9] md: Implement ->corrupted_range()

Commit Message

Comments

Patch