Message ID | 20200203200029.4592-2-vgoyal@redhat.com (mailing list archive) |
---|---|
State | Superseded, archived |
Delegated to: | Mike Snitzer |
Headers | show |
Series | dax, pmem: Provide a dax operation to zero range of memory | expand |
> + /* > + * There are no users as of now. Once users are there, fix dm code > + * to be able to split a long range across targets. > + */ This comment confused me. I think this wants to say something like: /* * There are now callers that want to zero across a page boundary as of * now. Once there are users this check can be removed after the * device mapper code has been updated to split ranges across targets. */ > +static int pmem_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, > + unsigned int offset, size_t len) > +{ > + int rc = 0; > + phys_addr_t phys_pos = pgoff * PAGE_SIZE + offset; Any reason not to pass a phys_addr_t in the calling convention for the method and maybe also for dax_zero_page_range itself? > + sector_start = ALIGN(phys_pos, 512)/512; > + sector_end = ALIGN_DOWN(phys_pos + bytes, 512)/512; Missing whitespaces. Also this could use DIV_ROUND_UP and DIV_ROUND_DOWN. > + if (sector_end > sector_start) > + nr_sectors = sector_end - sector_start; > + > + if (nr_sectors && > + unlikely(is_bad_pmem(&pmem->bb, sector_start, > + nr_sectors * 512))) > + bad_pmem = true; How could nr_sectors be zero? > + write_pmem(pmem_addr, page, 0, bytes); > + if (unlikely(bad_pmem)) { > + /* > + * Pass block aligned offset and length. That seems > + * to work as of now. Other finer grained alignment > + * cases can be addressed later if need be. > + */ > + rc = pmem_clear_poison(pmem, ALIGN(pmem_off, 512), > + nr_sectors * 512); > + write_pmem(pmem_addr, page, 0, bytes); > + } This code largerly duplicates the write side of pmem_do_bvec. I think it might make sense to split pmem_do_bvec into a read and a write side as a prep patch, and then reuse the write side here. > +int generic_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, > + unsigned int offset, size_t len); This should probably go into a separare are of the header and have comment about being a section for generic helpers for drivers. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, Feb 05, 2020 at 10:30:50AM -0800, Christoph Hellwig wrote: > > + /* > > + * There are no users as of now. Once users are there, fix dm code > > + * to be able to split a long range across targets. > > + */ > > This comment confused me. I think this wants to say something like: > > /* > * There are now callers that want to zero across a page boundary as of > * now. Once there are users this check can be removed after the > * device mapper code has been updated to split ranges across targets. > */ Yes, that's what I wanted to say but I missed one line. Thanks. Will fix it. > > > +static int pmem_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, > > + unsigned int offset, size_t len) > > +{ > > + int rc = 0; > > + phys_addr_t phys_pos = pgoff * PAGE_SIZE + offset; > > Any reason not to pass a phys_addr_t in the calling convention for the > method and maybe also for dax_zero_page_range itself? I don't have any reason not to pass phys_addr_t. If that sounds better, will make changes. > > > + sector_start = ALIGN(phys_pos, 512)/512; > > + sector_end = ALIGN_DOWN(phys_pos + bytes, 512)/512; > > Missing whitespaces. Also this could use DIV_ROUND_UP and > DIV_ROUND_DOWN. Will do. > > > + if (sector_end > sector_start) > > + nr_sectors = sector_end - sector_start; > > + > > + if (nr_sectors && > > + unlikely(is_bad_pmem(&pmem->bb, sector_start, > > + nr_sectors * 512))) > > + bad_pmem = true; > > How could nr_sectors be zero? If somebody specified a range across two sectors but none of the sector is completely written. Then nr_sectors will be zero. In fact this check shoudl probably be nr_sectors > 0 as writes with-in a sector will lead to nr_sector being -1. Am I missing something. > > > + write_pmem(pmem_addr, page, 0, bytes); > > + if (unlikely(bad_pmem)) { > > + /* > > + * Pass block aligned offset and length. That seems > > + * to work as of now. Other finer grained alignment > > + * cases can be addressed later if need be. > > + */ > > + rc = pmem_clear_poison(pmem, ALIGN(pmem_off, 512), > > + nr_sectors * 512); > > + write_pmem(pmem_addr, page, 0, bytes); > > + } > > This code largerly duplicates the write side of pmem_do_bvec. I > think it might make sense to split pmem_do_bvec into a read and a write > side as a prep patch, and then reuse the write side here. Ok, I will look into it. How about just add a helper function for write side and use that function both here and in pmem_do_bvec(). > > > +int generic_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, > > + unsigned int offset, size_t len); > > This should probably go into a separare are of the header and have > comment about being a section for generic helpers for drivers. ok, will do. Thanks Vivek -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, Feb 5, 2020 at 12:03 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > On Wed, Feb 05, 2020 at 10:30:50AM -0800, Christoph Hellwig wrote: > > > + /* > > > + * There are no users as of now. Once users are there, fix dm code > > > + * to be able to split a long range across targets. > > > + */ > > > > This comment confused me. I think this wants to say something like: > > > > /* > > * There are now callers that want to zero across a page boundary as of > > * now. Once there are users this check can be removed after the > > * device mapper code has been updated to split ranges across targets. > > */ > > Yes, that's what I wanted to say but I missed one line. Thanks. Will fix > it. > > > > > > +static int pmem_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, > > > + unsigned int offset, size_t len) > > > +{ > > > + int rc = 0; > > > + phys_addr_t phys_pos = pgoff * PAGE_SIZE + offset; > > > > Any reason not to pass a phys_addr_t in the calling convention for the > > method and maybe also for dax_zero_page_range itself? > > I don't have any reason not to pass phys_addr_t. If that sounds better, > will make changes. The problem is device-mapper. That wants to use offset to route through the map to the leaf device. If it weren't for the firmware communication requirement you could do: dax_direct_access(...) generic_dax_zero_page_range(...) ...but as long as the firmware error clearing path is required I think we need to do pass the pgoff through the interface and do the pgoff to virt / phys translation inside the ops handler. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, Feb 05, 2020 at 04:40:44PM -0800, Dan Williams wrote: > > I don't have any reason not to pass phys_addr_t. If that sounds better, > > will make changes. > > The problem is device-mapper. That wants to use offset to route > through the map to the leaf device. If it weren't for the firmware > communication requirement you could do: > > dax_direct_access(...) > generic_dax_zero_page_range(...) > > ...but as long as the firmware error clearing path is required I think > we need to do pass the pgoff through the interface and do the pgoff to > virt / phys translation inside the ops handler. Maybe phys_addr_t was the wrong type - but why do we split the offset into the block device argument into a pgoff and offset into page instead of a single 64-bit value? -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, Feb 05, 2020 at 04:40:44PM -0800, Dan Williams wrote: > On Wed, Feb 5, 2020 at 12:03 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > On Wed, Feb 05, 2020 at 10:30:50AM -0800, Christoph Hellwig wrote: > > > > + /* > > > > + * There are no users as of now. Once users are there, fix dm code > > > > + * to be able to split a long range across targets. > > > > + */ > > > > > > This comment confused me. I think this wants to say something like: > > > > > > /* > > > * There are now callers that want to zero across a page boundary as of > > > * now. Once there are users this check can be removed after the > > > * device mapper code has been updated to split ranges across targets. > > > */ > > > > Yes, that's what I wanted to say but I missed one line. Thanks. Will fix > > it. > > > > > > > > > +static int pmem_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, > > > > + unsigned int offset, size_t len) > > > > +{ > > > > + int rc = 0; > > > > + phys_addr_t phys_pos = pgoff * PAGE_SIZE + offset; > > > > > > Any reason not to pass a phys_addr_t in the calling convention for the > > > method and maybe also for dax_zero_page_range itself? > > > > I don't have any reason not to pass phys_addr_t. If that sounds better, > > will make changes. > > The problem is device-mapper. That wants to use offset to route > through the map to the leaf device. If it weren't for the firmware > communication requirement you could do: > > dax_direct_access(...) > generic_dax_zero_page_range(...) > > ...but as long as the firmware error clearing path is required I think > we need to do pass the pgoff through the interface and do the pgoff to > virt / phys translation inside the ops handler. Hi Dan, Drivers can easily convert offset into dax device (say phys_addr_t) to pgoff and offset into page, isn't it? pgoff_t = phys_addr_t/PAGE_SIZE; offset = phys_addr_t & (PAGE_SIZE - 1); And pgoff can easily be converted into sectors which dm/md can use for mapping and come up with pgoff in target device. Anyway, I am fine with anything. Thanks Vivek -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, Feb 5, 2020 at 11:41 PM Christoph Hellwig <hch@infradead.org> wrote: > > On Wed, Feb 05, 2020 at 04:40:44PM -0800, Dan Williams wrote: > > > I don't have any reason not to pass phys_addr_t. If that sounds better, > > > will make changes. > > > > The problem is device-mapper. That wants to use offset to route > > through the map to the leaf device. If it weren't for the firmware > > communication requirement you could do: > > > > dax_direct_access(...) > > generic_dax_zero_page_range(...) > > > > ...but as long as the firmware error clearing path is required I think > > we need to do pass the pgoff through the interface and do the pgoff to > > virt / phys translation inside the ops handler. > > Maybe phys_addr_t was the wrong type - but why do we split the offset > into the block device argument into a pgoff and offset into page instead > of a single 64-bit value? Oh, got it yes, that looks odd for sub-page zeroing. Yes, let's just have one device relative byte-offset. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, Feb 6, 2020 at 6:35 AM Vivek Goyal <vgoyal@redhat.com> wrote: > > On Wed, Feb 05, 2020 at 04:40:44PM -0800, Dan Williams wrote: > > On Wed, Feb 5, 2020 at 12:03 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > > > On Wed, Feb 05, 2020 at 10:30:50AM -0800, Christoph Hellwig wrote: > > > > > + /* > > > > > + * There are no users as of now. Once users are there, fix dm code > > > > > + * to be able to split a long range across targets. > > > > > + */ > > > > > > > > This comment confused me. I think this wants to say something like: > > > > > > > > /* > > > > * There are now callers that want to zero across a page boundary as of > > > > * now. Once there are users this check can be removed after the > > > > * device mapper code has been updated to split ranges across targets. > > > > */ > > > > > > Yes, that's what I wanted to say but I missed one line. Thanks. Will fix > > > it. > > > > > > > > > > > > +static int pmem_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, > > > > > + unsigned int offset, size_t len) > > > > > +{ > > > > > + int rc = 0; > > > > > + phys_addr_t phys_pos = pgoff * PAGE_SIZE + offset; > > > > > > > > Any reason not to pass a phys_addr_t in the calling convention for the > > > > method and maybe also for dax_zero_page_range itself? > > > > > > I don't have any reason not to pass phys_addr_t. If that sounds better, > > > will make changes. > > > > The problem is device-mapper. That wants to use offset to route > > through the map to the leaf device. If it weren't for the firmware > > communication requirement you could do: > > > > dax_direct_access(...) > > generic_dax_zero_page_range(...) > > > > ...but as long as the firmware error clearing path is required I think > > we need to do pass the pgoff through the interface and do the pgoff to > > virt / phys translation inside the ops handler. > > Hi Dan, > > Drivers can easily convert offset into dax device (say phys_addr_t) to > pgoff and offset into page, isn't it? It's not a phys_addr_t it's a 64-bit device relative offset. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, Feb 07, 2020 at 08:57:39AM -0800, Dan Williams wrote: > On Wed, Feb 5, 2020 at 11:41 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > On Wed, Feb 05, 2020 at 04:40:44PM -0800, Dan Williams wrote: > > > > I don't have any reason not to pass phys_addr_t. If that sounds better, > > > > will make changes. > > > > > > The problem is device-mapper. That wants to use offset to route > > > through the map to the leaf device. If it weren't for the firmware > > > communication requirement you could do: > > > > > > dax_direct_access(...) > > > generic_dax_zero_page_range(...) > > > > > > ...but as long as the firmware error clearing path is required I think > > > we need to do pass the pgoff through the interface and do the pgoff to > > > virt / phys translation inside the ops handler. > > > > Maybe phys_addr_t was the wrong type - but why do we split the offset > > into the block device argument into a pgoff and offset into page instead > > of a single 64-bit value? > > Oh, got it yes, that looks odd for sub-page zeroing. Yes, let's just > have one device relative byte-offset. So what's the best type to represent this offset. "u64" or "phys_addr_t" or "loff_t" or something else. I like phys_addr_t followed by u64. Vivek -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, Feb 7, 2020 at 9:02 AM Vivek Goyal <vgoyal@redhat.com> wrote: > > On Fri, Feb 07, 2020 at 08:57:39AM -0800, Dan Williams wrote: > > On Wed, Feb 5, 2020 at 11:41 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > On Wed, Feb 05, 2020 at 04:40:44PM -0800, Dan Williams wrote: > > > > > I don't have any reason not to pass phys_addr_t. If that sounds better, > > > > > will make changes. > > > > > > > > The problem is device-mapper. That wants to use offset to route > > > > through the map to the leaf device. If it weren't for the firmware > > > > communication requirement you could do: > > > > > > > > dax_direct_access(...) > > > > generic_dax_zero_page_range(...) > > > > > > > > ...but as long as the firmware error clearing path is required I think > > > > we need to do pass the pgoff through the interface and do the pgoff to > > > > virt / phys translation inside the ops handler. > > > > > > Maybe phys_addr_t was the wrong type - but why do we split the offset > > > into the block device argument into a pgoff and offset into page instead > > > of a single 64-bit value? > > > > Oh, got it yes, that looks odd for sub-page zeroing. Yes, let's just > > have one device relative byte-offset. > > So what's the best type to represent this offset. "u64" or "phys_addr_t" > or "loff_t" or something else. I like phys_addr_t followed by u64. Let's make it u64. phys_addr_t has already led to confusion in this thread because the first question I ask when I read it is "why call ->direct_access() to do the translation when you already have the physical address?". -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 26a654dbc69a..371744256fe5 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -344,6 +344,26 @@ size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, } EXPORT_SYMBOL_GPL(dax_copy_to_iter); +int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, + unsigned offset, size_t len) +{ + if (!dax_alive(dax_dev)) + return -ENXIO; + + if (!dax_dev->ops->zero_page_range) + return -EOPNOTSUPP; + + /* + * There are no users as of now. Once users are there, fix dm code + * to be able to split a long range across targets. + */ + if (offset + len > PAGE_SIZE) + return -EIO; + + return dax_dev->ops->zero_page_range(dax_dev, pgoff, offset, len); +} +EXPORT_SYMBOL_GPL(dax_zero_page_range); + #ifdef CONFIG_ARCH_HAS_PMEM_API void arch_wb_cache_pmem(void *addr, size_t size); void dax_flush(struct dax_device *dax_dev, void *addr, size_t size) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index ad8e4df1282b..8739244a72a4 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -268,6 +268,55 @@ static const struct block_device_operations pmem_fops = { .revalidate_disk = nvdimm_revalidate_disk, }; +static int pmem_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, + unsigned int offset, size_t len) +{ + int rc = 0; + phys_addr_t phys_pos = pgoff * PAGE_SIZE + offset; + struct pmem_device *pmem = dax_get_private(dax_dev); + struct page *page = ZERO_PAGE(0); + unsigned bytes, nr_sectors = 0; + sector_t sector_start, sector_end; + bool bad_pmem = false; + phys_addr_t pmem_off = phys_pos + pmem->data_offset; + void *pmem_addr = pmem->virt_addr + pmem_off; + + bytes = min_t(size_t, PAGE_SIZE - offset_in_page(phys_pos), + len); + /* + * As of now zeroing only with-in a page is supported. This can be + * changed once there are users of zeroing across multiple pages + */ + if (WARN_ON(len > bytes)) + return -EIO; + + sector_start = ALIGN(phys_pos, 512)/512; + sector_end = ALIGN_DOWN(phys_pos + bytes, 512)/512; + if (sector_end > sector_start) + nr_sectors = sector_end - sector_start; + + if (nr_sectors && + unlikely(is_bad_pmem(&pmem->bb, sector_start, + nr_sectors * 512))) + bad_pmem = true; + + write_pmem(pmem_addr, page, 0, bytes); + if (unlikely(bad_pmem)) { + /* + * Pass block aligned offset and length. That seems + * to work as of now. Other finer grained alignment + * cases can be addressed later if need be. + */ + rc = pmem_clear_poison(pmem, ALIGN(pmem_off, 512), + nr_sectors * 512); + write_pmem(pmem_addr, page, 0, bytes); + } + if (rc > 0) + return -EIO; + + return 0; +} + static long pmem_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn) { @@ -299,6 +348,7 @@ static const struct dax_operations pmem_dax_ops = { .dax_supported = generic_fsdax_supported, .copy_from_iter = pmem_copy_from_iter, .copy_to_iter = pmem_copy_to_iter, + .zero_page_range = pmem_dax_zero_page_range, }; static const struct attribute_group *pmem_attribute_groups[] = { diff --git a/fs/dax.c b/fs/dax.c index 1f1f0201cad1..35631a4d0295 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1057,6 +1057,21 @@ static bool dax_range_is_aligned(struct block_device *bdev, return true; } +int generic_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, + unsigned int offset, size_t len) +{ + long rc; + void *kaddr; + + rc = dax_direct_access(dax_dev, pgoff, 1, &kaddr, NULL); + if (rc < 0) + return rc; + memset(kaddr + offset, 0, len); + dax_flush(dax_dev, kaddr + offset, len); + return 0; +} +EXPORT_SYMBOL_GPL(generic_dax_zero_page_range); + int __dax_zero_page_range(struct block_device *bdev, struct dax_device *dax_dev, sector_t sector, unsigned int offset, unsigned int size) diff --git a/include/linux/dax.h b/include/linux/dax.h index 9bd8528bd305..3356b874c55d 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -34,6 +34,8 @@ struct dax_operations { /* copy_to_iter: required operation for fs-dax direct-i/o */ size_t (*copy_to_iter)(struct dax_device *, pgoff_t, void *, size_t, struct iov_iter *); + /* zero_page_range: required operation for fs-dax direct-i/o */ + int (*zero_page_range)(struct dax_device *, pgoff_t, unsigned, size_t); }; extern struct attribute_group dax_attribute_group; @@ -209,6 +211,10 @@ size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i); size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i); +int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, + unsigned offset, size_t len); +int generic_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, + unsigned int offset, size_t len); void dax_flush(struct dax_device *dax_dev, void *addr, size_t size); ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
Add a dax operation zero_page_range, to zero a range of memory. This will also clear any poison in the range being zeroed. As of now, zeroing of up to one page is allowed in a single call. There are no callers which are trying to zero more than a page in a single call. Once we grow the callers which zero more than a page in single call, we can add that support. Primary reason for not doing that yet is that this will add little complexity in dm implementation where a range might be spanning multiple underlying targets and one will have to split the range into multiple sub ranges and call zero_page_range() on individual targets. Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> --- drivers/dax/super.c | 20 +++++++++++++++++ drivers/nvdimm/pmem.c | 50 +++++++++++++++++++++++++++++++++++++++++++ fs/dax.c | 15 +++++++++++++ include/linux/dax.h | 6 ++++++ 4 files changed, 91 insertions(+)