diff mbox series

[RFC] dax,pmem: Provide a dax operation to zero range of memory

Message ID 20200123165249.GA7664@redhat.com (mailing list archive)
State New, archived
Headers show
Series [RFC] dax,pmem: Provide a dax operation to zero range of memory | expand

Commit Message

Vivek Goyal Jan. 23, 2020, 4:52 p.m. UTC
Hi,

This is an RFC patch to provide a dax operation to zero a range of memory.
It will also clear poison in the process. This is primarily compile tested
patch. I don't have real hardware to test the poison logic. I am posting
this to figure out if this is the right direction or not.

Motivation from this patch comes from Christoph's feedback that he will
rather prefer a dax way to zero a range instead of relying on having to
call blkdev_issue_zeroout() in __dax_zero_page_range().

https://lkml.org/lkml/2019/8/26/361

My motivation for this change is virtiofs DAX support. There we use DAX
but we don't have a block device. So any dax code which has the assumption
that there is always a block device associated is a problem. So this
is more of a cleanup of one of the places where dax has this dependency
on block device and if we add a dax operation for zeroing a range, it
can help with not having to call blkdev_issue_zeroout() in dax path.

I have yet to take care of stacked block drivers (dm/md).

Current poison clearing logic is primarily written with assumption that
I/O is sector aligned. With this new method, this assumption is broken
and one can pass any range of memory to zero. I have fixed few places
in existing logic to be able to handle an arbitrary start/end. I am
not sure are there other dependencies which might need fixing or
prohibit us from providing this method.

Any feedback or comment is welcome.

Thanks
Vivek

---
 drivers/dax/super.c   |   13 +++++++++
 drivers/nvdimm/pmem.c |   67 ++++++++++++++++++++++++++++++++++++++++++--------
 fs/dax.c              |   39 ++++++++---------------------
 include/linux/dax.h   |    3 ++
 4 files changed, 85 insertions(+), 37 deletions(-)

Comments

Darrick J. Wong Jan. 23, 2020, 7:01 p.m. UTC | #1
On Thu, Jan 23, 2020 at 11:52:49AM -0500, Vivek Goyal wrote:
> Hi,
> 
> This is an RFC patch to provide a dax operation to zero a range of memory.
> It will also clear poison in the process. This is primarily compile tested
> patch. I don't have real hardware to test the poison logic. I am posting
> this to figure out if this is the right direction or not.
> 
> Motivation from this patch comes from Christoph's feedback that he will
> rather prefer a dax way to zero a range instead of relying on having to
> call blkdev_issue_zeroout() in __dax_zero_page_range().
> 
> https://lkml.org/lkml/2019/8/26/361
> 
> My motivation for this change is virtiofs DAX support. There we use DAX
> but we don't have a block device. So any dax code which has the assumption
> that there is always a block device associated is a problem. So this
> is more of a cleanup of one of the places where dax has this dependency
> on block device and if we add a dax operation for zeroing a range, it
> can help with not having to call blkdev_issue_zeroout() in dax path.
> 
> I have yet to take care of stacked block drivers (dm/md).
> 
> Current poison clearing logic is primarily written with assumption that
> I/O is sector aligned. With this new method, this assumption is broken
> and one can pass any range of memory to zero. I have fixed few places
> in existing logic to be able to handle an arbitrary start/end. I am
> not sure are there other dependencies which might need fixing or
> prohibit us from providing this method.
> 
> Any feedback or comment is welcome.

So who gest to use this? :)

Should we (XFS) make fallocate(ZERO_RANGE) detect when it's operating on
a written extent in a DAX file and call this instead of what it does now
(punch range and reallocate unwritten)?

Is this the kind of thing XFS should just do on its own when DAX us that
some range of pmem has gone bad and now we need to (a) race with the
userland programs to write /something/ to the range to prevent a machine
check, (b) whack all the programs that think they have a mapping to
their data, (c) see if we have a DRAM copy and just write that back, (d)
set wb_err so fsyncs fail, and/or (e) regenerate metadata as necessary?

<cough> Will XFS ever get that "your storage went bad" hook that was
promised ages ago?

Though I guess it only does this a single page at a time, which won't be
awesome if we're trying to zero (say) 100GB of pmem.  I was expecting to
see one big memset() call to zero the entire range followed by
pmem_clear_poison() on the entire range, but I guess you did tag this
RFC. :)

--D

> Thanks
> Vivek
> 
> ---
>  drivers/dax/super.c   |   13 +++++++++
>  drivers/nvdimm/pmem.c |   67 ++++++++++++++++++++++++++++++++++++++++++--------
>  fs/dax.c              |   39 ++++++++---------------------
>  include/linux/dax.h   |    3 ++
>  4 files changed, 85 insertions(+), 37 deletions(-)
> 
> Index: rhvgoyal-linux/drivers/nvdimm/pmem.c
> ===================================================================
> --- rhvgoyal-linux.orig/drivers/nvdimm/pmem.c	2020-01-23 11:32:11.075139183 -0500
> +++ rhvgoyal-linux/drivers/nvdimm/pmem.c	2020-01-23 11:32:28.660139183 -0500
> @@ -52,8 +52,8 @@ static void hwpoison_clear(struct pmem_d
>  	if (is_vmalloc_addr(pmem->virt_addr))
>  		return;
>  
> -	pfn_start = PHYS_PFN(phys);
> -	pfn_end = pfn_start + PHYS_PFN(len);
> +	pfn_start = PFN_UP(phys);
> +	pfn_end = PFN_DOWN(phys + len);
>  	for (pfn = pfn_start; pfn < pfn_end; pfn++) {
>  		struct page *page = pfn_to_page(pfn);
>  
> @@ -71,22 +71,24 @@ static blk_status_t pmem_clear_poison(st
>  		phys_addr_t offset, unsigned int len)
>  {
>  	struct device *dev = to_dev(pmem);
> -	sector_t sector;
> +	sector_t sector_start, sector_end;
>  	long cleared;
>  	blk_status_t rc = BLK_STS_OK;
> +	int nr_sectors;
>  
> -	sector = (offset - pmem->data_offset) / 512;
> +	sector_start = ALIGN((offset - pmem->data_offset), 512) / 512;
> +	sector_end = ALIGN_DOWN((offset - pmem->data_offset + len), 512)/512;
> +	nr_sectors =  sector_end - sector_start;
>  
>  	cleared = nvdimm_clear_poison(dev, pmem->phys_addr + offset, len);
>  	if (cleared < len)
>  		rc = BLK_STS_IOERR;
> -	if (cleared > 0 && cleared / 512) {
> +	if (cleared > 0 && nr_sectors > 0) {
>  		hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
> -		cleared /= 512;
> -		dev_dbg(dev, "%#llx clear %ld sector%s\n",
> -				(unsigned long long) sector, cleared,
> -				cleared > 1 ? "s" : "");
> -		badblocks_clear(&pmem->bb, sector, cleared);
> +		dev_dbg(dev, "%#llx clear %d sector%s\n",
> +				(unsigned long long) sector_start, nr_sectors,
> +				nr_sectors > 1 ? "s" : "");
> +		badblocks_clear(&pmem->bb, sector_start, nr_sectors);
>  		if (pmem->bb_state)
>  			sysfs_notify_dirent(pmem->bb_state);
>  	}
> @@ -268,6 +270,50 @@ static const struct block_device_operati
>  	.revalidate_disk =	nvdimm_revalidate_disk,
>  };
>  
> +static int pmem_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
> +				    unsigned int offset, loff_t len)
> +{
> +	int rc = 0;
> +	phys_addr_t phys_pos = pgoff * PAGE_SIZE + offset;
> +	struct pmem_device *pmem = dax_get_private(dax_dev);
> +	struct page *page = ZERO_PAGE(0);
> +
> +	do {
> +		unsigned bytes, nr_sectors = 0;
> +		sector_t sector_start, sector_end;
> +		bool bad_pmem = false;
> +		phys_addr_t pmem_off = phys_pos + pmem->data_offset;
> +		void *pmem_addr = pmem->virt_addr + pmem_off;
> +		unsigned int page_offset;
> +
> +		page_offset = offset_in_page(phys_pos);
> +		bytes = min_t(loff_t, PAGE_SIZE - page_offset, len);
> +
> +		sector_start = ALIGN(phys_pos, 512)/512;
> +		sector_end = ALIGN_DOWN(phys_pos + bytes, 512)/512;
> +		if (sector_end > sector_start)
> +			nr_sectors = sector_end - sector_start;
> +
> +		if (nr_sectors &&
> +		    unlikely(is_bad_pmem(&pmem->bb, sector_start,
> +					 nr_sectors * 512)))
> +			bad_pmem = true;
> +
> +		write_pmem(pmem_addr, page, 0, bytes);
> +		if (unlikely(bad_pmem)) {
> +			rc = pmem_clear_poison(pmem, pmem_off, bytes);
> +			write_pmem(pmem_addr, page, 0, bytes);
> +		}
> +		if (rc > 0)
> +			return -EIO;
> +
> +		phys_pos += phys_pos + bytes;
> +		len -= bytes;
> +	} while (len > 0);
> +
> +	return 0;
> +}
> +
>  static long pmem_dax_direct_access(struct dax_device *dax_dev,
>  		pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn)
>  {
> @@ -299,6 +345,7 @@ static const struct dax_operations pmem_
>  	.dax_supported = generic_fsdax_supported,
>  	.copy_from_iter = pmem_copy_from_iter,
>  	.copy_to_iter = pmem_copy_to_iter,
> +	.zero_page_range = pmem_dax_zero_page_range,
>  };
>  
>  static const struct attribute_group *pmem_attribute_groups[] = {
> Index: rhvgoyal-linux/include/linux/dax.h
> ===================================================================
> --- rhvgoyal-linux.orig/include/linux/dax.h	2020-01-23 11:25:23.814139183 -0500
> +++ rhvgoyal-linux/include/linux/dax.h	2020-01-23 11:32:17.799139183 -0500
> @@ -34,6 +34,8 @@ struct dax_operations {
>  	/* copy_to_iter: required operation for fs-dax direct-i/o */
>  	size_t (*copy_to_iter)(struct dax_device *, pgoff_t, void *, size_t,
>  			struct iov_iter *);
> +	/* zero_page_range: optional operation for fs-dax direct-i/o */
> +	int (*zero_page_range)(struct dax_device *, pgoff_t, unsigned, loff_t);
>  };
>  
>  extern struct attribute_group dax_attribute_group;
> @@ -209,6 +211,7 @@ size_t dax_copy_from_iter(struct dax_dev
>  		size_t bytes, struct iov_iter *i);
>  size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
>  		size_t bytes, struct iov_iter *i);
> +int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, unsigned offset, loff_t len);
>  void dax_flush(struct dax_device *dax_dev, void *addr, size_t size);
>  
>  ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
> Index: rhvgoyal-linux/fs/dax.c
> ===================================================================
> --- rhvgoyal-linux.orig/fs/dax.c	2020-01-23 11:25:23.814139183 -0500
> +++ rhvgoyal-linux/fs/dax.c	2020-01-23 11:32:17.801139183 -0500
> @@ -1044,38 +1044,23 @@ static vm_fault_t dax_load_hole(struct x
>  	return ret;
>  }
>  
> -static bool dax_range_is_aligned(struct block_device *bdev,
> -				 unsigned int offset, unsigned int length)
> -{
> -	unsigned short sector_size = bdev_logical_block_size(bdev);
> -
> -	if (!IS_ALIGNED(offset, sector_size))
> -		return false;
> -	if (!IS_ALIGNED(length, sector_size))
> -		return false;
> -
> -	return true;
> -}
> -
>  int __dax_zero_page_range(struct block_device *bdev,
>  		struct dax_device *dax_dev, sector_t sector,
>  		unsigned int offset, unsigned int size)
>  {
> -	if (dax_range_is_aligned(bdev, offset, size)) {
> -		sector_t start_sector = sector + (offset >> 9);
> +	pgoff_t pgoff;
> +	long rc, id;
>  
> -		return blkdev_issue_zeroout(bdev, start_sector,
> -				size >> 9, GFP_NOFS, 0);
> -	} else {
> -		pgoff_t pgoff;
> -		long rc, id;
> +	rc = bdev_dax_pgoff(bdev, sector, PAGE_SIZE, &pgoff);
> +	if (rc)
> +		return rc;
> +
> +	id = dax_read_lock();
> +	rc = dax_zero_page_range(dax_dev, pgoff, offset, size);
> +	if (rc == -EOPNOTSUPP) {
>  		void *kaddr;
>  
> -		rc = bdev_dax_pgoff(bdev, sector, PAGE_SIZE, &pgoff);
> -		if (rc)
> -			return rc;
> -
> -		id = dax_read_lock();
> +		/* If driver does not implement zero page range, fallback */
>  		rc = dax_direct_access(dax_dev, pgoff, 1, &kaddr, NULL);
>  		if (rc < 0) {
>  			dax_read_unlock(id);
> @@ -1083,9 +1068,9 @@ int __dax_zero_page_range(struct block_d
>  		}
>  		memset(kaddr + offset, 0, size);
>  		dax_flush(dax_dev, kaddr + offset, size);
> -		dax_read_unlock(id);
>  	}
> -	return 0;
> +	dax_read_unlock(id);
> +	return rc;
>  }
>  EXPORT_SYMBOL_GPL(__dax_zero_page_range);
>  
> Index: rhvgoyal-linux/drivers/dax/super.c
> ===================================================================
> --- rhvgoyal-linux.orig/drivers/dax/super.c	2020-01-23 11:25:23.814139183 -0500
> +++ rhvgoyal-linux/drivers/dax/super.c	2020-01-23 11:32:17.802139183 -0500
> @@ -344,6 +344,19 @@ size_t dax_copy_to_iter(struct dax_devic
>  }
>  EXPORT_SYMBOL_GPL(dax_copy_to_iter);
>  
> +int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
> +			unsigned offset, loff_t len)
> +{
> +	if (!dax_alive(dax_dev))
> +		return 0;
> +
> +	if (!dax_dev->ops->zero_page_range)
> +		return -EOPNOTSUPP;
> +
> +	return dax_dev->ops->zero_page_range(dax_dev, pgoff, offset, len);
> +}
> +EXPORT_SYMBOL_GPL(dax_zero_page_range);
> +
>  #ifdef CONFIG_ARCH_HAS_PMEM_API
>  void arch_wb_cache_pmem(void *addr, size_t size);
>  void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
>
Vivek Goyal Jan. 24, 2020, 1:52 p.m. UTC | #2
On Thu, Jan 23, 2020 at 11:01:03AM -0800, Darrick J. Wong wrote:
> On Thu, Jan 23, 2020 at 11:52:49AM -0500, Vivek Goyal wrote:
> > Hi,
> > 
> > This is an RFC patch to provide a dax operation to zero a range of memory.
> > It will also clear poison in the process. This is primarily compile tested
> > patch. I don't have real hardware to test the poison logic. I am posting
> > this to figure out if this is the right direction or not.
> > 
> > Motivation from this patch comes from Christoph's feedback that he will
> > rather prefer a dax way to zero a range instead of relying on having to
> > call blkdev_issue_zeroout() in __dax_zero_page_range().
> > 
> > https://lkml.org/lkml/2019/8/26/361
> > 
> > My motivation for this change is virtiofs DAX support. There we use DAX
> > but we don't have a block device. So any dax code which has the assumption
> > that there is always a block device associated is a problem. So this
> > is more of a cleanup of one of the places where dax has this dependency
> > on block device and if we add a dax operation for zeroing a range, it
> > can help with not having to call blkdev_issue_zeroout() in dax path.
> > 
> > I have yet to take care of stacked block drivers (dm/md).
> > 
> > Current poison clearing logic is primarily written with assumption that
> > I/O is sector aligned. With this new method, this assumption is broken
> > and one can pass any range of memory to zero. I have fixed few places
> > in existing logic to be able to handle an arbitrary start/end. I am
> > not sure are there other dependencies which might need fixing or
> > prohibit us from providing this method.
> > 
> > Any feedback or comment is welcome.
> 
> So who gest to use this? :)

Right now iomap_zero_range() is the only user. May be there can be
other users as well. Not sure.

> 
> Should we (XFS) make fallocate(ZERO_RANGE) detect when it's operating on
> a written extent in a DAX file and call this instead of what it does now
> (punch range and reallocate unwritten)?

May be if this method turns out to be more efficient. But if zeroing
blocks right away was more efficient, then blkdev_issue_zeroout() will
work even now? I am assuming that's what you are using to zero full
blocks. 

> 
> Is this the kind of thing XFS should just do on its own when DAX us that
> some range of pmem has gone bad and now we need to (a) race with the
> userland programs to write /something/ to the range to prevent a machine
> check, (b) whack all the programs that think they have a mapping to
> their data, (c) see if we have a DRAM copy and just write that back, (d)
> set wb_err so fsyncs fail, and/or (e) regenerate metadata as necessary?

I am not sure but current idea seems to be if there are some bad blocks
(poisoned memory), then user comes to know about it when reading (read
fails or user space gets SIGBUS in case of mapped file). And now user
space takes action to clear poison. So if user space is driving clearing
bad blocks/poison, XFS probably does not have to know about poisoned
memory locations.

I am not aware of all the discussions around this design. This seems
like a new thing which should be addressed through a different set 
of patches.

> 
> <cough> Will XFS ever get that "your storage went bad" hook that was
> promised ages ago?
> 
> Though I guess it only does this a single page at a time, which won't be
> awesome if we're trying to zero (say) 100GB of pmem.  I was expecting to
> see one big memset() call to zero the entire range followed by
> pmem_clear_poison() on the entire range, but I guess you did tag this
> RFC. :)

I was thinking about memset(). But in first attempt I just wanted to
do what existing code is doing to make sure it works. I was not sure
what issues might come up if I first call memset() on the full range.

One issue seems to be what if we face errors while clearing poision/bad
sectors/hwpoison. So is it better to do it page by page and abort soon
if we error out in clearing poison or first clear the whole rance and
abort clearing poison whenever we face error. Don't know.

Thanks
Vivek

> 
> --D
> 
> > Thanks
> > Vivek
> > 
> > ---
> >  drivers/dax/super.c   |   13 +++++++++
> >  drivers/nvdimm/pmem.c |   67 ++++++++++++++++++++++++++++++++++++++++++--------
> >  fs/dax.c              |   39 ++++++++---------------------
> >  include/linux/dax.h   |    3 ++
> >  4 files changed, 85 insertions(+), 37 deletions(-)
> > 
> > Index: rhvgoyal-linux/drivers/nvdimm/pmem.c
> > ===================================================================
> > --- rhvgoyal-linux.orig/drivers/nvdimm/pmem.c	2020-01-23 11:32:11.075139183 -0500
> > +++ rhvgoyal-linux/drivers/nvdimm/pmem.c	2020-01-23 11:32:28.660139183 -0500
> > @@ -52,8 +52,8 @@ static void hwpoison_clear(struct pmem_d
> >  	if (is_vmalloc_addr(pmem->virt_addr))
> >  		return;
> >  
> > -	pfn_start = PHYS_PFN(phys);
> > -	pfn_end = pfn_start + PHYS_PFN(len);
> > +	pfn_start = PFN_UP(phys);
> > +	pfn_end = PFN_DOWN(phys + len);
> >  	for (pfn = pfn_start; pfn < pfn_end; pfn++) {
> >  		struct page *page = pfn_to_page(pfn);
> >  
> > @@ -71,22 +71,24 @@ static blk_status_t pmem_clear_poison(st
> >  		phys_addr_t offset, unsigned int len)
> >  {
> >  	struct device *dev = to_dev(pmem);
> > -	sector_t sector;
> > +	sector_t sector_start, sector_end;
> >  	long cleared;
> >  	blk_status_t rc = BLK_STS_OK;
> > +	int nr_sectors;
> >  
> > -	sector = (offset - pmem->data_offset) / 512;
> > +	sector_start = ALIGN((offset - pmem->data_offset), 512) / 512;
> > +	sector_end = ALIGN_DOWN((offset - pmem->data_offset + len), 512)/512;
> > +	nr_sectors =  sector_end - sector_start;
> >  
> >  	cleared = nvdimm_clear_poison(dev, pmem->phys_addr + offset, len);
> >  	if (cleared < len)
> >  		rc = BLK_STS_IOERR;
> > -	if (cleared > 0 && cleared / 512) {
> > +	if (cleared > 0 && nr_sectors > 0) {
> >  		hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
> > -		cleared /= 512;
> > -		dev_dbg(dev, "%#llx clear %ld sector%s\n",
> > -				(unsigned long long) sector, cleared,
> > -				cleared > 1 ? "s" : "");
> > -		badblocks_clear(&pmem->bb, sector, cleared);
> > +		dev_dbg(dev, "%#llx clear %d sector%s\n",
> > +				(unsigned long long) sector_start, nr_sectors,
> > +				nr_sectors > 1 ? "s" : "");
> > +		badblocks_clear(&pmem->bb, sector_start, nr_sectors);
> >  		if (pmem->bb_state)
> >  			sysfs_notify_dirent(pmem->bb_state);
> >  	}
> > @@ -268,6 +270,50 @@ static const struct block_device_operati
> >  	.revalidate_disk =	nvdimm_revalidate_disk,
> >  };
> >  
> > +static int pmem_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
> > +				    unsigned int offset, loff_t len)
> > +{
> > +	int rc = 0;
> > +	phys_addr_t phys_pos = pgoff * PAGE_SIZE + offset;
> > +	struct pmem_device *pmem = dax_get_private(dax_dev);
> > +	struct page *page = ZERO_PAGE(0);
> > +
> > +	do {
> > +		unsigned bytes, nr_sectors = 0;
> > +		sector_t sector_start, sector_end;
> > +		bool bad_pmem = false;
> > +		phys_addr_t pmem_off = phys_pos + pmem->data_offset;
> > +		void *pmem_addr = pmem->virt_addr + pmem_off;
> > +		unsigned int page_offset;
> > +
> > +		page_offset = offset_in_page(phys_pos);
> > +		bytes = min_t(loff_t, PAGE_SIZE - page_offset, len);
> > +
> > +		sector_start = ALIGN(phys_pos, 512)/512;
> > +		sector_end = ALIGN_DOWN(phys_pos + bytes, 512)/512;
> > +		if (sector_end > sector_start)
> > +			nr_sectors = sector_end - sector_start;
> > +
> > +		if (nr_sectors &&
> > +		    unlikely(is_bad_pmem(&pmem->bb, sector_start,
> > +					 nr_sectors * 512)))
> > +			bad_pmem = true;
> > +
> > +		write_pmem(pmem_addr, page, 0, bytes);
> > +		if (unlikely(bad_pmem)) {
> > +			rc = pmem_clear_poison(pmem, pmem_off, bytes);
> > +			write_pmem(pmem_addr, page, 0, bytes);
> > +		}
> > +		if (rc > 0)
> > +			return -EIO;
> > +
> > +		phys_pos += phys_pos + bytes;
> > +		len -= bytes;
> > +	} while (len > 0);
> > +
> > +	return 0;
> > +}
> > +
> >  static long pmem_dax_direct_access(struct dax_device *dax_dev,
> >  		pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn)
> >  {
> > @@ -299,6 +345,7 @@ static const struct dax_operations pmem_
> >  	.dax_supported = generic_fsdax_supported,
> >  	.copy_from_iter = pmem_copy_from_iter,
> >  	.copy_to_iter = pmem_copy_to_iter,
> > +	.zero_page_range = pmem_dax_zero_page_range,
> >  };
> >  
> >  static const struct attribute_group *pmem_attribute_groups[] = {
> > Index: rhvgoyal-linux/include/linux/dax.h
> > ===================================================================
> > --- rhvgoyal-linux.orig/include/linux/dax.h	2020-01-23 11:25:23.814139183 -0500
> > +++ rhvgoyal-linux/include/linux/dax.h	2020-01-23 11:32:17.799139183 -0500
> > @@ -34,6 +34,8 @@ struct dax_operations {
> >  	/* copy_to_iter: required operation for fs-dax direct-i/o */
> >  	size_t (*copy_to_iter)(struct dax_device *, pgoff_t, void *, size_t,
> >  			struct iov_iter *);
> > +	/* zero_page_range: optional operation for fs-dax direct-i/o */
> > +	int (*zero_page_range)(struct dax_device *, pgoff_t, unsigned, loff_t);
> >  };
> >  
> >  extern struct attribute_group dax_attribute_group;
> > @@ -209,6 +211,7 @@ size_t dax_copy_from_iter(struct dax_dev
> >  		size_t bytes, struct iov_iter *i);
> >  size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
> >  		size_t bytes, struct iov_iter *i);
> > +int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, unsigned offset, loff_t len);
> >  void dax_flush(struct dax_device *dax_dev, void *addr, size_t size);
> >  
> >  ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
> > Index: rhvgoyal-linux/fs/dax.c
> > ===================================================================
> > --- rhvgoyal-linux.orig/fs/dax.c	2020-01-23 11:25:23.814139183 -0500
> > +++ rhvgoyal-linux/fs/dax.c	2020-01-23 11:32:17.801139183 -0500
> > @@ -1044,38 +1044,23 @@ static vm_fault_t dax_load_hole(struct x
> >  	return ret;
> >  }
> >  
> > -static bool dax_range_is_aligned(struct block_device *bdev,
> > -				 unsigned int offset, unsigned int length)
> > -{
> > -	unsigned short sector_size = bdev_logical_block_size(bdev);
> > -
> > -	if (!IS_ALIGNED(offset, sector_size))
> > -		return false;
> > -	if (!IS_ALIGNED(length, sector_size))
> > -		return false;
> > -
> > -	return true;
> > -}
> > -
> >  int __dax_zero_page_range(struct block_device *bdev,
> >  		struct dax_device *dax_dev, sector_t sector,
> >  		unsigned int offset, unsigned int size)
> >  {
> > -	if (dax_range_is_aligned(bdev, offset, size)) {
> > -		sector_t start_sector = sector + (offset >> 9);
> > +	pgoff_t pgoff;
> > +	long rc, id;
> >  
> > -		return blkdev_issue_zeroout(bdev, start_sector,
> > -				size >> 9, GFP_NOFS, 0);
> > -	} else {
> > -		pgoff_t pgoff;
> > -		long rc, id;
> > +	rc = bdev_dax_pgoff(bdev, sector, PAGE_SIZE, &pgoff);
> > +	if (rc)
> > +		return rc;
> > +
> > +	id = dax_read_lock();
> > +	rc = dax_zero_page_range(dax_dev, pgoff, offset, size);
> > +	if (rc == -EOPNOTSUPP) {
> >  		void *kaddr;
> >  
> > -		rc = bdev_dax_pgoff(bdev, sector, PAGE_SIZE, &pgoff);
> > -		if (rc)
> > -			return rc;
> > -
> > -		id = dax_read_lock();
> > +		/* If driver does not implement zero page range, fallback */
> >  		rc = dax_direct_access(dax_dev, pgoff, 1, &kaddr, NULL);
> >  		if (rc < 0) {
> >  			dax_read_unlock(id);
> > @@ -1083,9 +1068,9 @@ int __dax_zero_page_range(struct block_d
> >  		}
> >  		memset(kaddr + offset, 0, size);
> >  		dax_flush(dax_dev, kaddr + offset, size);
> > -		dax_read_unlock(id);
> >  	}
> > -	return 0;
> > +	dax_read_unlock(id);
> > +	return rc;
> >  }
> >  EXPORT_SYMBOL_GPL(__dax_zero_page_range);
> >  
> > Index: rhvgoyal-linux/drivers/dax/super.c
> > ===================================================================
> > --- rhvgoyal-linux.orig/drivers/dax/super.c	2020-01-23 11:25:23.814139183 -0500
> > +++ rhvgoyal-linux/drivers/dax/super.c	2020-01-23 11:32:17.802139183 -0500
> > @@ -344,6 +344,19 @@ size_t dax_copy_to_iter(struct dax_devic
> >  }
> >  EXPORT_SYMBOL_GPL(dax_copy_to_iter);
> >  
> > +int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
> > +			unsigned offset, loff_t len)
> > +{
> > +	if (!dax_alive(dax_dev))
> > +		return 0;
> > +
> > +	if (!dax_dev->ops->zero_page_range)
> > +		return -EOPNOTSUPP;
> > +
> > +	return dax_dev->ops->zero_page_range(dax_dev, pgoff, offset, len);
> > +}
> > +EXPORT_SYMBOL_GPL(dax_zero_page_range);
> > +
> >  #ifdef CONFIG_ARCH_HAS_PMEM_API
> >  void arch_wb_cache_pmem(void *addr, size_t size);
> >  void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
> > 
>
Christoph Hellwig Jan. 31, 2020, 5:36 a.m. UTC | #3
On Thu, Jan 23, 2020 at 11:52:49AM -0500, Vivek Goyal wrote:
> Hi,
> 
> This is an RFC patch to provide a dax operation to zero a range of memory.
> It will also clear poison in the process. This is primarily compile tested
> patch. I don't have real hardware to test the poison logic. I am posting
> this to figure out if this is the right direction or not.
> 
> Motivation from this patch comes from Christoph's feedback that he will
> rather prefer a dax way to zero a range instead of relying on having to
> call blkdev_issue_zeroout() in __dax_zero_page_range().
> 
> https://lkml.org/lkml/2019/8/26/361
> 
> My motivation for this change is virtiofs DAX support. There we use DAX
> but we don't have a block device. So any dax code which has the assumption
> that there is always a block device associated is a problem. So this
> is more of a cleanup of one of the places where dax has this dependency
> on block device and if we add a dax operation for zeroing a range, it
> can help with not having to call blkdev_issue_zeroout() in dax path.
> 
> I have yet to take care of stacked block drivers (dm/md).
> 
> Current poison clearing logic is primarily written with assumption that
> I/O is sector aligned. With this new method, this assumption is broken
> and one can pass any range of memory to zero. I have fixed few places
> in existing logic to be able to handle an arbitrary start/end. I am
> not sure are there other dependencies which might need fixing or
> prohibit us from providing this method.
> 
> Any feedback or comment is welcome.
> 
> Thanks
> Vivek
> 
> ---
>  drivers/dax/super.c   |   13 +++++++++
>  drivers/nvdimm/pmem.c |   67 ++++++++++++++++++++++++++++++++++++++++++--------
>  fs/dax.c              |   39 ++++++++---------------------
>  include/linux/dax.h   |    3 ++
>  4 files changed, 85 insertions(+), 37 deletions(-)
> 
> Index: rhvgoyal-linux/drivers/nvdimm/pmem.c
> ===================================================================
> --- rhvgoyal-linux.orig/drivers/nvdimm/pmem.c	2020-01-23 11:32:11.075139183 -0500
> +++ rhvgoyal-linux/drivers/nvdimm/pmem.c	2020-01-23 11:32:28.660139183 -0500
> @@ -52,8 +52,8 @@ static void hwpoison_clear(struct pmem_d
>  	if (is_vmalloc_addr(pmem->virt_addr))
>  		return;
>  
> -	pfn_start = PHYS_PFN(phys);
> -	pfn_end = pfn_start + PHYS_PFN(len);
> +	pfn_start = PFN_UP(phys);
> +	pfn_end = PFN_DOWN(phys + len);
>  	for (pfn = pfn_start; pfn < pfn_end; pfn++) {
>  		struct page *page = pfn_to_page(pfn);
>  

This change looks unrelated to the rest.

> +	sector_end = ALIGN_DOWN((offset - pmem->data_offset + len), 512)/512;
> +	nr_sectors =  sector_end - sector_start;
>  
>  	cleared = nvdimm_clear_poison(dev, pmem->phys_addr + offset, len);
>  	if (cleared < len)
>  		rc = BLK_STS_IOERR;
> -	if (cleared > 0 && cleared / 512) {
> +	if (cleared > 0 && nr_sectors > 0) {
>  		hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
> -		cleared /= 512;
> -		dev_dbg(dev, "%#llx clear %ld sector%s\n",
> -				(unsigned long long) sector, cleared,
> -				cleared > 1 ? "s" : "");
> -		badblocks_clear(&pmem->bb, sector, cleared);
> +		dev_dbg(dev, "%#llx clear %d sector%s\n",
> +				(unsigned long long) sector_start, nr_sectors,
> +				nr_sectors > 1 ? "s" : "");
> +		badblocks_clear(&pmem->bb, sector_start, nr_sectors);
>  		if (pmem->bb_state)
>  			sysfs_notify_dirent(pmem->bb_state);
>  	}

As does this one?

>  int __dax_zero_page_range(struct block_device *bdev,
>  		struct dax_device *dax_dev, sector_t sector,
>  		unsigned int offset, unsigned int size)
>  {
> +	pgoff_t pgoff;
> +	long rc, id;
>  
> +	rc = bdev_dax_pgoff(bdev, sector, PAGE_SIZE, &pgoff);
> +	if (rc)
> +		return rc;
> +
> +	id = dax_read_lock();
> +	rc = dax_zero_page_range(dax_dev, pgoff, offset, size);
> +	if (rc == -EOPNOTSUPP) {
>  		void *kaddr;
>  
> +		/* If driver does not implement zero page range, fallback */

I think we'll want to restructure this a bit.  First make the new
method mandatory, and just provide a generic_dax_zero_page_range or
similar for the non-pmem instances.

Then __dax_zero_page_range and iomap_dax_zero should merge, and maybe
eventually iomap_zero_range_actor and iomap_zero_range should be split
into a pagecache and DAX variant, lifting the IS_DAXD check into the
callers.
Dan Williams Jan. 31, 2020, 11:31 p.m. UTC | #4
On Thu, Jan 23, 2020 at 11:07 AM Darrick J. Wong
<darrick.wong@oracle.com> wrote:
>
> On Thu, Jan 23, 2020 at 11:52:49AM -0500, Vivek Goyal wrote:
> > Hi,
> >
> > This is an RFC patch to provide a dax operation to zero a range of memory.
> > It will also clear poison in the process. This is primarily compile tested
> > patch. I don't have real hardware to test the poison logic. I am posting
> > this to figure out if this is the right direction or not.
> >
> > Motivation from this patch comes from Christoph's feedback that he will
> > rather prefer a dax way to zero a range instead of relying on having to
> > call blkdev_issue_zeroout() in __dax_zero_page_range().
> >
> > https://lkml.org/lkml/2019/8/26/361
> >
> > My motivation for this change is virtiofs DAX support. There we use DAX
> > but we don't have a block device. So any dax code which has the assumption
> > that there is always a block device associated is a problem. So this
> > is more of a cleanup of one of the places where dax has this dependency
> > on block device and if we add a dax operation for zeroing a range, it
> > can help with not having to call blkdev_issue_zeroout() in dax path.
> >
> > I have yet to take care of stacked block drivers (dm/md).
> >
> > Current poison clearing logic is primarily written with assumption that
> > I/O is sector aligned. With this new method, this assumption is broken
> > and one can pass any range of memory to zero. I have fixed few places
> > in existing logic to be able to handle an arbitrary start/end. I am
> > not sure are there other dependencies which might need fixing or
> > prohibit us from providing this method.
> >
> > Any feedback or comment is welcome.
>
> So who gest to use this? :)
>
> Should we (XFS) make fallocate(ZERO_RANGE) detect when it's operating on
> a written extent in a DAX file and call this instead of what it does now
> (punch range and reallocate unwritten)?

If it eliminates more block assumptions, then yes. In general I think
there are opportunities to use "native" direct_access instead of
block-i/o for other areas too, like metadata i/o.

> Is this the kind of thing XFS should just do on its own when DAX us that
> some range of pmem has gone bad and now we need to (a) race with the
> userland programs to write /something/ to the range to prevent a machine
> check (b) whack all the programs that think they have a mapping to
> their data, (c) see if we have a DRAM copy and just write that back, (d)
> set wb_err so fsyncs fail, and/or (e) regenerate metadata as necessary?

(a), (b) duplicate what memory error handling already does. So yes,
could be done but it only helps if machine check handling is broken or
missing.

(c) what DRAM copy in the DAX case?

(d) dax fsync is just cache flush, so it can't fail, or are you
talking about errors in metadata?

(e) I thought our solution for dax metadata redundancy is to use a
realtime data device and raid mirror for the metadata device.

> <cough> Will XFS ever get that "your storage went bad" hook that was
> promised ages ago?

pmem developers don't scale?

> Though I guess it only does this a single page at a time, which won't be
> awesome if we're trying to zero (say) 100GB of pmem.  I was expecting to
> see one big memset() call to zero the entire range followed by
> pmem_clear_poison() on the entire range, but I guess you did tag this
> RFC. :)

Until movdir64b is available the only way to clear poison is by making
a call to the BIOS. The BIOS may not be efficient at bulk clearing.
Christoph Hellwig Feb. 3, 2020, 8:20 a.m. UTC | #5
On Fri, Jan 31, 2020 at 03:31:58PM -0800, Dan Williams wrote:
> > Should we (XFS) make fallocate(ZERO_RANGE) detect when it's operating on
> > a written extent in a DAX file and call this instead of what it does now
> > (punch range and reallocate unwritten)?
> 
> If it eliminates more block assumptions, then yes. In general I think
> there are opportunities to use "native" direct_access instead of
> block-i/o for other areas too, like metadata i/o.

Yes, and at least for XFS there aren't too many places where we rely
on block I/O after this.  It is the buffer cache and the log code,
and I actually have a WIP conversion for the latter here:

	http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-log-dax

which I need to dust off, similar with the cache flushing changes.

But more importantly with just the patch in this thread we should be
able to stop the block device pointer in struct iomap for DAX file
systems, and thus be able to union the bdev, dax_dev and inline data
fields, which should make their usage much more clear, and reduce the
stack footprint.

> (d) dax fsync is just cache flush, so it can't fail, or are you
> talking about errors in metadata?

And based on our discussion even that cache flush sounds like a bad
idea, and might be a reason why all the file system bypass or
weirdo file systems are faster than XFS.
Darrick J. Wong Feb. 4, 2020, 11:23 p.m. UTC | #6
On Fri, Jan 31, 2020 at 03:31:58PM -0800, Dan Williams wrote:
> On Thu, Jan 23, 2020 at 11:07 AM Darrick J. Wong
> <darrick.wong@oracle.com> wrote:
> >
> > On Thu, Jan 23, 2020 at 11:52:49AM -0500, Vivek Goyal wrote:
> > > Hi,
> > >
> > > This is an RFC patch to provide a dax operation to zero a range of memory.
> > > It will also clear poison in the process. This is primarily compile tested
> > > patch. I don't have real hardware to test the poison logic. I am posting
> > > this to figure out if this is the right direction or not.
> > >
> > > Motivation from this patch comes from Christoph's feedback that he will
> > > rather prefer a dax way to zero a range instead of relying on having to
> > > call blkdev_issue_zeroout() in __dax_zero_page_range().
> > >
> > > https://lkml.org/lkml/2019/8/26/361
> > >
> > > My motivation for this change is virtiofs DAX support. There we use DAX
> > > but we don't have a block device. So any dax code which has the assumption
> > > that there is always a block device associated is a problem. So this
> > > is more of a cleanup of one of the places where dax has this dependency
> > > on block device and if we add a dax operation for zeroing a range, it
> > > can help with not having to call blkdev_issue_zeroout() in dax path.
> > >
> > > I have yet to take care of stacked block drivers (dm/md).
> > >
> > > Current poison clearing logic is primarily written with assumption that
> > > I/O is sector aligned. With this new method, this assumption is broken
> > > and one can pass any range of memory to zero. I have fixed few places
> > > in existing logic to be able to handle an arbitrary start/end. I am
> > > not sure are there other dependencies which might need fixing or
> > > prohibit us from providing this method.
> > >
> > > Any feedback or comment is welcome.
> >
> > So who gest to use this? :)
> >
> > Should we (XFS) make fallocate(ZERO_RANGE) detect when it's operating on
> > a written extent in a DAX file and call this instead of what it does now
> > (punch range and reallocate unwritten)?
> 
> If it eliminates more block assumptions, then yes. In general I think
> there are opportunities to use "native" direct_access instead of
> block-i/o for other areas too, like metadata i/o.
> 
> > Is this the kind of thing XFS should just do on its own when DAX us that
> > some range of pmem has gone bad and now we need to (a) race with the
> > userland programs to write /something/ to the range to prevent a machine
> > check (b) whack all the programs that think they have a mapping to
> > their data, (c) see if we have a DRAM copy and just write that back, (d)
> > set wb_err so fsyncs fail, and/or (e) regenerate metadata as necessary?
> 
> (a), (b) duplicate what memory error handling already does. So yes,
> could be done but it only helps if machine check handling is broken or
> missing.

<nod> 

> (c) what DRAM copy in the DAX case?

Sorry, I was talking about the fs metadata that we cache in DRAM.

> (d) dax fsync is just cache flush, so it can't fail, or are you
> talking about errors in metadata?

I'm talking about an S_DAX file that someone is doing regular write()s
to:

1. Open file O_RDWR
2. Write something to the file
3. Some time later, something decides the pmem is bad.
4. Program calls fsync(); does it return EIO?

(I shouldn't have mixed the metadata/file data cases, sorry...)

> (e) I thought our solution for dax metadata redundancy is to use a
> realtime data device and raid mirror for the metadata device.

In the end it was set aside on the grounds that reserving space for
a separate metadata device was too costly and too complex for now.
We might get back to it later when the <cough> economics improve.

> > <cough> Will XFS ever get that "your storage went bad" hook that was
> > promised ages ago?
> 
> pmem developers don't scale?

Ah, sorry. :/

> > Though I guess it only does this a single page at a time, which won't be
> > awesome if we're trying to zero (say) 100GB of pmem.  I was expecting to
> > see one big memset() call to zero the entire range followed by
> > pmem_clear_poison() on the entire range, but I guess you did tag this
> > RFC. :)
> 
> Until movdir64b is available the only way to clear poison is by making
> a call to the BIOS. The BIOS may not be efficient at bulk clearing.

Well then let's port XFS to SMM mode. <duck>

(No, please don't...)

--D
diff mbox series

Patch

Index: rhvgoyal-linux/drivers/nvdimm/pmem.c
===================================================================
--- rhvgoyal-linux.orig/drivers/nvdimm/pmem.c	2020-01-23 11:32:11.075139183 -0500
+++ rhvgoyal-linux/drivers/nvdimm/pmem.c	2020-01-23 11:32:28.660139183 -0500
@@ -52,8 +52,8 @@  static void hwpoison_clear(struct pmem_d
 	if (is_vmalloc_addr(pmem->virt_addr))
 		return;
 
-	pfn_start = PHYS_PFN(phys);
-	pfn_end = pfn_start + PHYS_PFN(len);
+	pfn_start = PFN_UP(phys);
+	pfn_end = PFN_DOWN(phys + len);
 	for (pfn = pfn_start; pfn < pfn_end; pfn++) {
 		struct page *page = pfn_to_page(pfn);
 
@@ -71,22 +71,24 @@  static blk_status_t pmem_clear_poison(st
 		phys_addr_t offset, unsigned int len)
 {
 	struct device *dev = to_dev(pmem);
-	sector_t sector;
+	sector_t sector_start, sector_end;
 	long cleared;
 	blk_status_t rc = BLK_STS_OK;
+	int nr_sectors;
 
-	sector = (offset - pmem->data_offset) / 512;
+	sector_start = ALIGN((offset - pmem->data_offset), 512) / 512;
+	sector_end = ALIGN_DOWN((offset - pmem->data_offset + len), 512)/512;
+	nr_sectors =  sector_end - sector_start;
 
 	cleared = nvdimm_clear_poison(dev, pmem->phys_addr + offset, len);
 	if (cleared < len)
 		rc = BLK_STS_IOERR;
-	if (cleared > 0 && cleared / 512) {
+	if (cleared > 0 && nr_sectors > 0) {
 		hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
-		cleared /= 512;
-		dev_dbg(dev, "%#llx clear %ld sector%s\n",
-				(unsigned long long) sector, cleared,
-				cleared > 1 ? "s" : "");
-		badblocks_clear(&pmem->bb, sector, cleared);
+		dev_dbg(dev, "%#llx clear %d sector%s\n",
+				(unsigned long long) sector_start, nr_sectors,
+				nr_sectors > 1 ? "s" : "");
+		badblocks_clear(&pmem->bb, sector_start, nr_sectors);
 		if (pmem->bb_state)
 			sysfs_notify_dirent(pmem->bb_state);
 	}
@@ -268,6 +270,50 @@  static const struct block_device_operati
 	.revalidate_disk =	nvdimm_revalidate_disk,
 };
 
+static int pmem_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
+				    unsigned int offset, loff_t len)
+{
+	int rc = 0;
+	phys_addr_t phys_pos = pgoff * PAGE_SIZE + offset;
+	struct pmem_device *pmem = dax_get_private(dax_dev);
+	struct page *page = ZERO_PAGE(0);
+
+	do {
+		unsigned bytes, nr_sectors = 0;
+		sector_t sector_start, sector_end;
+		bool bad_pmem = false;
+		phys_addr_t pmem_off = phys_pos + pmem->data_offset;
+		void *pmem_addr = pmem->virt_addr + pmem_off;
+		unsigned int page_offset;
+
+		page_offset = offset_in_page(phys_pos);
+		bytes = min_t(loff_t, PAGE_SIZE - page_offset, len);
+
+		sector_start = ALIGN(phys_pos, 512)/512;
+		sector_end = ALIGN_DOWN(phys_pos + bytes, 512)/512;
+		if (sector_end > sector_start)
+			nr_sectors = sector_end - sector_start;
+
+		if (nr_sectors &&
+		    unlikely(is_bad_pmem(&pmem->bb, sector_start,
+					 nr_sectors * 512)))
+			bad_pmem = true;
+
+		write_pmem(pmem_addr, page, 0, bytes);
+		if (unlikely(bad_pmem)) {
+			rc = pmem_clear_poison(pmem, pmem_off, bytes);
+			write_pmem(pmem_addr, page, 0, bytes);
+		}
+		if (rc > 0)
+			return -EIO;
+
+		phys_pos += phys_pos + bytes;
+		len -= bytes;
+	} while (len > 0);
+
+	return 0;
+}
+
 static long pmem_dax_direct_access(struct dax_device *dax_dev,
 		pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn)
 {
@@ -299,6 +345,7 @@  static const struct dax_operations pmem_
 	.dax_supported = generic_fsdax_supported,
 	.copy_from_iter = pmem_copy_from_iter,
 	.copy_to_iter = pmem_copy_to_iter,
+	.zero_page_range = pmem_dax_zero_page_range,
 };
 
 static const struct attribute_group *pmem_attribute_groups[] = {
Index: rhvgoyal-linux/include/linux/dax.h
===================================================================
--- rhvgoyal-linux.orig/include/linux/dax.h	2020-01-23 11:25:23.814139183 -0500
+++ rhvgoyal-linux/include/linux/dax.h	2020-01-23 11:32:17.799139183 -0500
@@ -34,6 +34,8 @@  struct dax_operations {
 	/* copy_to_iter: required operation for fs-dax direct-i/o */
 	size_t (*copy_to_iter)(struct dax_device *, pgoff_t, void *, size_t,
 			struct iov_iter *);
+	/* zero_page_range: optional operation for fs-dax direct-i/o */
+	int (*zero_page_range)(struct dax_device *, pgoff_t, unsigned, loff_t);
 };
 
 extern struct attribute_group dax_attribute_group;
@@ -209,6 +211,7 @@  size_t dax_copy_from_iter(struct dax_dev
 		size_t bytes, struct iov_iter *i);
 size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
 		size_t bytes, struct iov_iter *i);
+int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, unsigned offset, loff_t len);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size);
 
 ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
Index: rhvgoyal-linux/fs/dax.c
===================================================================
--- rhvgoyal-linux.orig/fs/dax.c	2020-01-23 11:25:23.814139183 -0500
+++ rhvgoyal-linux/fs/dax.c	2020-01-23 11:32:17.801139183 -0500
@@ -1044,38 +1044,23 @@  static vm_fault_t dax_load_hole(struct x
 	return ret;
 }
 
-static bool dax_range_is_aligned(struct block_device *bdev,
-				 unsigned int offset, unsigned int length)
-{
-	unsigned short sector_size = bdev_logical_block_size(bdev);
-
-	if (!IS_ALIGNED(offset, sector_size))
-		return false;
-	if (!IS_ALIGNED(length, sector_size))
-		return false;
-
-	return true;
-}
-
 int __dax_zero_page_range(struct block_device *bdev,
 		struct dax_device *dax_dev, sector_t sector,
 		unsigned int offset, unsigned int size)
 {
-	if (dax_range_is_aligned(bdev, offset, size)) {
-		sector_t start_sector = sector + (offset >> 9);
+	pgoff_t pgoff;
+	long rc, id;
 
-		return blkdev_issue_zeroout(bdev, start_sector,
-				size >> 9, GFP_NOFS, 0);
-	} else {
-		pgoff_t pgoff;
-		long rc, id;
+	rc = bdev_dax_pgoff(bdev, sector, PAGE_SIZE, &pgoff);
+	if (rc)
+		return rc;
+
+	id = dax_read_lock();
+	rc = dax_zero_page_range(dax_dev, pgoff, offset, size);
+	if (rc == -EOPNOTSUPP) {
 		void *kaddr;
 
-		rc = bdev_dax_pgoff(bdev, sector, PAGE_SIZE, &pgoff);
-		if (rc)
-			return rc;
-
-		id = dax_read_lock();
+		/* If driver does not implement zero page range, fallback */
 		rc = dax_direct_access(dax_dev, pgoff, 1, &kaddr, NULL);
 		if (rc < 0) {
 			dax_read_unlock(id);
@@ -1083,9 +1068,9 @@  int __dax_zero_page_range(struct block_d
 		}
 		memset(kaddr + offset, 0, size);
 		dax_flush(dax_dev, kaddr + offset, size);
-		dax_read_unlock(id);
 	}
-	return 0;
+	dax_read_unlock(id);
+	return rc;
 }
 EXPORT_SYMBOL_GPL(__dax_zero_page_range);
 
Index: rhvgoyal-linux/drivers/dax/super.c
===================================================================
--- rhvgoyal-linux.orig/drivers/dax/super.c	2020-01-23 11:25:23.814139183 -0500
+++ rhvgoyal-linux/drivers/dax/super.c	2020-01-23 11:32:17.802139183 -0500
@@ -344,6 +344,19 @@  size_t dax_copy_to_iter(struct dax_devic
 }
 EXPORT_SYMBOL_GPL(dax_copy_to_iter);
 
+int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
+			unsigned offset, loff_t len)
+{
+	if (!dax_alive(dax_dev))
+		return 0;
+
+	if (!dax_dev->ops->zero_page_range)
+		return -EOPNOTSUPP;
+
+	return dax_dev->ops->zero_page_range(dax_dev, pgoff, offset, len);
+}
+EXPORT_SYMBOL_GPL(dax_zero_page_range);
+
 #ifdef CONFIG_ARCH_HAS_PMEM_API
 void arch_wb_cache_pmem(void *addr, size_t size);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)