Message ID | 1441047584-14664-1-git-send-email-ross.zwisler@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Aug 31, 2015 at 12:59:44PM -0600, Ross Zwisler wrote: > For DAX msync we just need to flush the given range using > wb_cache_pmem(), which is now a public part of the PMEM API. > > The inclusion of <linux/dax.h> in fs/dax.c was done to make checkpatch > happy. Previously it was complaining about a bunch of undeclared > functions that could be made static. Should this be abstracted by adding a ->msync method? Maybe not worth to do for now, but it might be worth to keep that in mind. Otherwise this looks fine to me. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Aug 31, 2015 at 09:06:19PM +0200, Christoph Hellwig wrote: > On Mon, Aug 31, 2015 at 12:59:44PM -0600, Ross Zwisler wrote: > > For DAX msync we just need to flush the given range using > > wb_cache_pmem(), which is now a public part of the PMEM API. > > > > The inclusion of <linux/dax.h> in fs/dax.c was done to make checkpatch > > happy. Previously it was complaining about a bunch of undeclared > > functions that could be made static. > > Should this be abstracted by adding a ->msync method? Maybe not > worth to do for now, but it might be worth to keep that in mind. Where would we add the ->msync method? Do you mean to the PMEM API, or somewhere else? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Aug 31, 2015 at 01:26:40PM -0600, Ross Zwisler wrote: > > Should this be abstracted by adding a ->msync method? Maybe not > > worth to do for now, but it might be worth to keep that in mind. > > Where would we add the ->msync method? Do you mean to the PMEM API, or > somewhere else? vm_operations_struct would be the most logical choice, with filesystems having a dax-specific instance of that. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Aug 31, 2015 at 12:59:44PM -0600, Ross Zwisler wrote: > For DAX msync we just need to flush the given range using > wb_cache_pmem(), which is now a public part of the PMEM API. This is wrong, because it still leaves fsync() broken on dax. Flushing dirty data to stable storage is the responsibility of the writeback infrastructure, not the VMA/mm infrasrtucture. For non-dax configurations, msync defers all that to vfs_fsync_range(), because it has to be implemented there for fsync() to work. Even for DAX, msync has to call vfs_fsync_range() for the filesystem to commit the backing store allocations to stable storage, so there's not getting around the fact msync is the wrong place to be flushing DAX mappings to persistent storage. I pointed this out almost 6 months ago (i.e. that fsync was broken) anf hinted at how to solve it. Fix fsync, and msync gets fixed for free: https://lists.01.org/pipermail/linux-nvdimm/2015-March/000341.html I've also reported to Willy that DAX write page faults don't work correctly, either. xfstests generic/080 exposes this: a read from a page followed immediately by a write to that page does not result in ->page_mkwrite being called on the write and so backing store is not allocated for the page, nor are the timestamps for the file updated. This will also result in fsync (and msync) not working properly. Fix fsync first - if fsync works then msync will work. Cheers, Dave.
On Tue, Sep 01, 2015 at 09:38:03AM +1000, Dave Chinner wrote: > On Mon, Aug 31, 2015 at 12:59:44PM -0600, Ross Zwisler wrote: > > For DAX msync we just need to flush the given range using > > wb_cache_pmem(), which is now a public part of the PMEM API. > > This is wrong, because it still leaves fsync() broken on dax. > > Flushing dirty data to stable storage is the responsibility of the > writeback infrastructure, not the VMA/mm infrasrtucture. For non-dax > configurations, msync defers all that to vfs_fsync_range(), because > it has to be implemented there for fsync() to work. > > Even for DAX, msync has to call vfs_fsync_range() for the filesystem to commit > the backing store allocations to stable storage, so there's not > getting around the fact msync is the wrong place to be flushing > DAX mappings to persistent storage. DAX does call ->fsync before and after this patch. And with all the recent fixes we take care to ensure data is written though the cache for everything but mmap-access. With this patch from Ross we ensure msync writes back the cache before calling ->fsync so that the filesystem can then do it's work like converting unwritten extents. The only downside is that previously on Linux you could always use fsync as a replaement for msymc, which isn't true anymore for DAX. But given that we need the virtual address to write back the cache I can't see how to do this differently given that clwb() needs the user virtual address to flush the cache. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 01, 2015 at 09:38:03AM +1000, Dave Chinner wrote: > On Mon, Aug 31, 2015 at 12:59:44PM -0600, Ross Zwisler wrote: > > For DAX msync we just need to flush the given range using > > wb_cache_pmem(), which is now a public part of the PMEM API. > > This is wrong, because it still leaves fsync() broken on dax. > > Flushing dirty data to stable storage is the responsibility of the > writeback infrastructure, not the VMA/mm infrasrtucture. Writeback infrastructure is non-existent for DAX. Without struct page we don't have anything to transfer pte_ditry() to. And I'm not sure we need to invent some. For DAX flushing in-place can be cheaper than dirty tracking beyond page tables. > For non-dax configurations, msync defers all that to vfs_fsync_range(), > because it has to be implemented there for fsync() to work. Not necessary. I think fsync() for DAX can be implemented with rmap over all file's VMA and msync() them with commiting metadata afterwards. But we also need to commit to persistent on zap_page_range() to make it work. > Even for DAX, msync has to call vfs_fsync_range() for the filesystem to commit > the backing store allocations to stable storage, so there's not > getting around the fact msync is the wrong place to be flushing > DAX mappings to persistent storage. Why? IIUC, msync() doesn't have any requirements wrt metadata, right? > I pointed this out almost 6 months ago (i.e. that fsync was broken) > anf hinted at how to solve it. Fix fsync, and msync gets fixed for > free: > > https://lists.01.org/pipermail/linux-nvdimm/2015-March/000341.html > > I've also reported to Willy that DAX write page faults don't work > correctly, either. xfstests generic/080 exposes this: a read > from a page followed immediately by a write to that page does not > result in ->page_mkwrite being called on the write and so > backing store is not allocated for the page, nor are the timestamps > for the file updated. This will also result in fsync (and msync) > not working properly. Is that because XFS doesn't provide vm_ops->pfn_mkwrite?
On 09/01/2015 01:08 PM, Kirill A. Shutemov wrote: <> > > Is that because XFS doesn't provide vm_ops->pfn_mkwrite? > Right that would explain it, because I sent that patch exactly to solve this problem. I haven't looked at latest code for a while but I should checkout the latest and make a patch for xfs if it is indeed missing. Thanks Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/01/2015 10:06 AM, Christoph Hellwig wrote: > On Tue, Sep 01, 2015 at 09:38:03AM +1000, Dave Chinner wrote: >> On Mon, Aug 31, 2015 at 12:59:44PM -0600, Ross Zwisler wrote: >>> For DAX msync we just need to flush the given range using >>> wb_cache_pmem(), which is now a public part of the PMEM API. >> >> This is wrong, because it still leaves fsync() broken on dax. >> >> Flushing dirty data to stable storage is the responsibility of the >> writeback infrastructure, not the VMA/mm infrasrtucture. For non-dax >> configurations, msync defers all that to vfs_fsync_range(), because >> it has to be implemented there for fsync() to work. >> >> Even for DAX, msync has to call vfs_fsync_range() for the filesystem to commit >> the backing store allocations to stable storage, so there's not >> getting around the fact msync is the wrong place to be flushing >> DAX mappings to persistent storage. > > DAX does call ->fsync before and after this patch. And with all > the recent fixes we take care to ensure data is written though the > cache for everything but mmap-access. With this patch from Ross > we ensure msync writes back the cache before calling ->fsync so that > the filesystem can then do it's work like converting unwritten extents. > > The only downside is that previously on Linux you could always use > fsync as a replaement for msymc, which isn't true anymore for DAX. > Hi Christoph So the approach we took was a bit different to exactly solve these problem, and to also not over flush too much. here is what we did. * At vm_operations_struct we also override the .close vector (say call it dax_vm_close) * At dax_vm_close() on writable files call ->fsync(,vma->vm_start, vma->vm_end,) (We have an inode flag if the file was actually dirtied, but even if not, that will not be that bad, so a file was opened for write, mmapped, but actually never modified. Not a lot of these, and the do nothing cl_flushing is very fast) * At ->fsync() do the actual cl_flush for all cases but only iff if (mapping_mapped(inode->i_mapping) == 0) return 0; This is because data written not through mmap is already persistent and we do not need the cl_flushing Apps expect all these to work: 1. open mmap m-write msync ... close 2. open mmap m-write fsync ... close 3. open mmap m-write unmap ... fsync close 4. open mmap m-write sync ... The first 3 are supported with above, because what happens is that at [3] the fsync actually happens on unmap and fsync is redundant in that case. The only broken scenario is [3]. We do not have a list of "dax-dirty" inodes per sb to iterate on and call inode-sync on. This cause problems mostly in freeze because with actual [3] scenario the file will be eventually closed and persistent, but after the call to sync returns. Its on my TODO to fix [3] based on instructions from Dave. The mmap call will put the inode on the list and the dax_vm_close will remove it. One of the regular dirty list should be used as suggested by Dave. > But given that we need the virtual address to write back the cache > I can't see how to do this differently given that clwb() needs the > user virtual address to flush the cache. On Intel or any systems that have physical-based caching this is not a problem you just iterate on all get_block() of the range and flush the Kernel's virt_addr of the block, this is easy. With ARCHs with per VM caching you need to go through the i_mapping VMAs list and flush like that. I guess there is a way to schedule yourself as a process VMA somehow. I'm not sure how to solve this split, perhaps two generic functions, that are selected through the ARCH. Just my $0.017 Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/31/2015 09:59 PM, Ross Zwisler wrote: > For DAX msync we just need to flush the given range using > wb_cache_pmem(), which is now a public part of the PMEM API. > > The inclusion of <linux/dax.h> in fs/dax.c was done to make checkpatch > happy. Previously it was complaining about a bunch of undeclared > functions that could be made static. > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > --- > This patch is based on libnvdimm-for-next from our NVDIMM tree: > > https://git.kernel.org/cgit/linux/kernel/git/nvdimm/nvdimm.git/ > > with some DAX patches on top. The baseline tree can be found here: > > https://github.com/01org/prd/tree/dax_msync > --- > arch/x86/include/asm/pmem.h | 13 +++++++------ > fs/dax.c | 17 +++++++++++++++++ > include/linux/dax.h | 1 + > include/linux/pmem.h | 22 +++++++++++++++++++++- > mm/msync.c | 10 +++++++++- > 5 files changed, 55 insertions(+), 8 deletions(-) > > diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h > index d8ce3ec..85c07b2 100644 > --- a/arch/x86/include/asm/pmem.h > +++ b/arch/x86/include/asm/pmem.h > @@ -67,18 +67,19 @@ static inline void arch_wmb_pmem(void) > } > > /** > - * __arch_wb_cache_pmem - write back a cache range with CLWB > - * @vaddr: virtual start address > + * arch_wb_cache_pmem - write back a cache range with CLWB > + * @addr: virtual start address > * @size: number of bytes to write back > * > * Write back a cache range using the CLWB (cache line write back) > * instruction. This function requires explicit ordering with an > - * arch_wmb_pmem() call. This API is internal to the x86 PMEM implementation. > + * arch_wmb_pmem() call. > */ > -static inline void __arch_wb_cache_pmem(void *vaddr, size_t size) > +static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size) > { > u16 x86_clflush_size = boot_cpu_data.x86_clflush_size; > unsigned long clflush_mask = x86_clflush_size - 1; > + void *vaddr = (void __force *)addr; > void *vend = vaddr + size; > void *p; > > @@ -115,7 +116,7 @@ static inline size_t arch_copy_from_iter_pmem(void __pmem *addr, size_t bytes, > len = copy_from_iter_nocache(vaddr, bytes, i); > > if (__iter_needs_pmem_wb(i)) > - __arch_wb_cache_pmem(vaddr, bytes); > + arch_wb_cache_pmem(addr, bytes); > > return len; > } > @@ -138,7 +139,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size) > else > memset(vaddr, 0, size); > > - __arch_wb_cache_pmem(vaddr, size); > + arch_wb_cache_pmem(addr, size); > } > > static inline bool __arch_has_wmb_pmem(void) > diff --git a/fs/dax.c b/fs/dax.c > index fbe18b8..ed6aec1 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -17,6 +17,7 @@ > #include <linux/atomic.h> > #include <linux/blkdev.h> > #include <linux/buffer_head.h> > +#include <linux/dax.h> > #include <linux/fs.h> > #include <linux/genhd.h> > #include <linux/highmem.h> > @@ -25,6 +26,7 @@ > #include <linux/mutex.h> > #include <linux/pmem.h> > #include <linux/sched.h> > +#include <linux/sizes.h> > #include <linux/uio.h> > #include <linux/vmstat.h> > > @@ -753,3 +755,18 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block) > return dax_zero_page_range(inode, from, length, get_block); > } > EXPORT_SYMBOL_GPL(dax_truncate_page); > + > +void dax_sync_range(unsigned long addr, size_t len) > +{ > + while (len) { > + size_t chunk_len = min_t(size_t, SZ_1G, len); > + Where does the SZ_1G come from is it because you want to do cond_resched() every 1G bytes so not to get stuck for a long time? It took me a while to catch, At first I thought it might be do to wb_cache_pmem() limitations. Would you put a comment in the next iteration? > + wb_cache_pmem((void __pmem *)addr, chunk_len); > + len -= chunk_len; > + addr += chunk_len; > + if (len) > + cond_resched(); > + } > + wmb_pmem(); > +} > +EXPORT_SYMBOL_GPL(dax_sync_range); > diff --git a/include/linux/dax.h b/include/linux/dax.h > index b415e52..504b33f 100644 > --- a/include/linux/dax.h > +++ b/include/linux/dax.h > @@ -14,6 +14,7 @@ int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t, > dax_iodone_t); > int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t, > dax_iodone_t); > +void dax_sync_range(unsigned long addr, size_t len); > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *, > unsigned int flags, get_block_t, dax_iodone_t); > diff --git a/include/linux/pmem.h b/include/linux/pmem.h > index 85f810b3..aa29ebb 100644 > --- a/include/linux/pmem.h > +++ b/include/linux/pmem.h > @@ -53,12 +53,18 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size) > { > BUG(); See below > } > + > +static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size) > +{ > + BUG(); There is a clflush_cache_range() defined for generic use. On ADR systems (even without pcommit) this works perfectly and is persistent. why not use that in the generic case? Also all the above and below can be implements via this one. One usage of pmem is overlooked by all this API. The use of DRAM as pmem, across a VM or cross reboot. you have a piece of memory exposed as pmem to the subsytem which survives past the boot of that system. The CPU cache still needs flushing in this case. (People are already using this for logs and crash dumps) So all arches including all x86 variants can have a working generic implementation that will work, based on clflush_cache_range() I'll work on this ASAP and send patches ... > +} > #endif > > /* > * Architectures that define ARCH_HAS_PMEM_API must provide > * implementations for arch_memcpy_to_pmem(), arch_wmb_pmem(), > - * arch_copy_from_iter_pmem(), arch_clear_pmem() and arch_has_wmb_pmem(). > + * arch_copy_from_iter_pmem(), arch_clear_pmem(), arch_wb_cache_pmem() > + * and arch_has_wmb_pmem(). > */ > static inline void memcpy_from_pmem(void *dst, void __pmem const *src, size_t size) > { > @@ -202,4 +208,18 @@ static inline void clear_pmem(void __pmem *addr, size_t size) > else > default_clear_pmem(addr, size); > } > + > +/** > + * wb_cache_pmem - write back a range of cache lines > + * @vaddr: virtual start address > + * @size: number of bytes to write back > + * > + * Write back the cache lines starting at 'vaddr' for 'size' bytes. > + * This function requires explicit ordering with an wmb_pmem() call. > + */ > +static inline void wb_cache_pmem(void __pmem *addr, size_t size) > +{ > + if (arch_has_pmem_api()) > + arch_wb_cache_pmem(addr, size); > +} > #endif /* __PMEM_H__ */ > diff --git a/mm/msync.c b/mm/msync.c > index bb04d53..2a4739c 100644 > --- a/mm/msync.c > +++ b/mm/msync.c > @@ -7,6 +7,7 @@ > /* > * The msync() system call. > */ > +#include <linux/dax.h> > #include <linux/fs.h> > #include <linux/mm.h> > #include <linux/mman.h> > @@ -59,6 +60,7 @@ SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags) > for (;;) { > struct file *file; > loff_t fstart, fend; > + unsigned long range_len; > > /* Still start < end. */ > error = -ENOMEM; > @@ -77,10 +79,16 @@ SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags) > error = -EBUSY; > goto out_unlock; > } > + > + range_len = min(end, vma->vm_end) - start; > + > + if (vma_is_dax(vma)) > + dax_sync_range(start, range_len); > + Ye no I hate this. (No need to touch mm) All we need to do is define a dax_fsync() dax FS registers a special dax vector for ->fsync() that vector needs to call dax_fsync(); first as part of its fsync operation. Then dax_fsync() is: dax_fsync() { /* we always write with sync so only fsync if the file mmap'ed */ if (mapping_mapped(inode->i_mapping) == 0) return 0; dax_sync_range(start, range_len); } Thanks Boaz > file = vma->vm_file; > fstart = (start - vma->vm_start) + > ((loff_t)vma->vm_pgoff << PAGE_SHIFT); > - fend = fstart + (min(end, vma->vm_end) - start) - 1; > + fend = fstart + range_len - 1; > start = vma->vm_end; > if ((flags & MS_SYNC) && file && > (vma->vm_flags & VM_SHARED)) { > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 01, 2015 at 09:06:08AM +0200, Christoph Hellwig wrote: > On Tue, Sep 01, 2015 at 09:38:03AM +1000, Dave Chinner wrote: > > On Mon, Aug 31, 2015 at 12:59:44PM -0600, Ross Zwisler wrote: > > > For DAX msync we just need to flush the given range using > > > wb_cache_pmem(), which is now a public part of the PMEM API. > > > > This is wrong, because it still leaves fsync() broken on dax. > > > > Flushing dirty data to stable storage is the responsibility of the > > writeback infrastructure, not the VMA/mm infrasrtucture. For non-dax > > configurations, msync defers all that to vfs_fsync_range(), because > > it has to be implemented there for fsync() to work. > > > > Even for DAX, msync has to call vfs_fsync_range() for the filesystem to commit > > the backing store allocations to stable storage, so there's not > > getting around the fact msync is the wrong place to be flushing > > DAX mappings to persistent storage. > > DAX does call ->fsync before and after this patch. And with all > the recent fixes we take care to ensure data is written though the > cache for everything but mmap-access. With this patch from Ross > we ensure msync writes back the cache before calling ->fsync so that > the filesystem can then do it's work like converting unwritten extents. > > The only downside is that previously on Linux you could always use > fsync as a replaement for msymc, which isn't true anymore for DAX. Which means applications that should "just work" without modification on DAX are now subtly broken and don't actually guarantee data is safe after a crash. That's a pretty nasty landmine, and goes against *everything* we've claimed about using DAX with existing applications. That's wrong, and needs fixing. Cheers, Dave.
On Tue, Sep 01, 2015 at 01:08:04PM +0300, Kirill A. Shutemov wrote: > On Tue, Sep 01, 2015 at 09:38:03AM +1000, Dave Chinner wrote: > > On Mon, Aug 31, 2015 at 12:59:44PM -0600, Ross Zwisler wrote: > > Even for DAX, msync has to call vfs_fsync_range() for the filesystem to commit > > the backing store allocations to stable storage, so there's not > > getting around the fact msync is the wrong place to be flushing > > DAX mappings to persistent storage. > > Why? > IIUC, msync() doesn't have any requirements wrt metadata, right? Of course it does. If the backing store allocation has not been committed, then after a crash there will be a hole in file and so it will read as zeroes regardless of what data was written and flushed. > > I pointed this out almost 6 months ago (i.e. that fsync was broken) > > anf hinted at how to solve it. Fix fsync, and msync gets fixed for > > free: > > > > https://lists.01.org/pipermail/linux-nvdimm/2015-March/000341.html > > > > I've also reported to Willy that DAX write page faults don't work > > correctly, either. xfstests generic/080 exposes this: a read > > from a page followed immediately by a write to that page does not > > result in ->page_mkwrite being called on the write and so > > backing store is not allocated for the page, nor are the timestamps > > for the file updated. This will also result in fsync (and msync) > > not working properly. > > Is that because XFS doesn't provide vm_ops->pfn_mkwrite? I didn't know that had been committed. I don't recall seeing a pull request with that in it, none of the XFS DAX patches conflicted against it and there's been no runtime errors. I'll fix it up. As such, shouldn't there be a check in the VM (in ->mmap callers) that if we have the vma is returned with VM_MIXEDMODE enabled that ->pfn_mkwrite is not NULL? It's now clear to me that any filesystem that sets VM_MIXEDMODE needs to support both page_mkwrite and pfn_mkwrite, and such a check would have caught this immediately... Cheers, Dave.
On Wed, Sep 02, 2015 at 08:21:20AM +1000, Dave Chinner wrote: > Which means applications that should "just work" without > modification on DAX are now subtly broken and don't actually > guarantee data is safe after a crash. That's a pretty nasty > landmine, and goes against *everything* we've claimed about using > DAX with existing applications. > > That's wrong, and needs fixing. I agree that we need to fix fsync as well, and that the fsync solution could be used to implement msync if we choose to go that route. I think we might want to consider keeping the msync and fsync implementations separate, though, for two reasons. 1) The current msync implementation is much more efficient than what will be needed for fsync. Fsync will need to call into the filesystem, traverse all the blocks, get kernel virtual addresses from those and then call wb_cache_pmem() on those kernel addresses. I think this is a necessary evil for fsync since you don't have a VMA, but for msync we do and we can just flush using the user addresses without any fs lookups. 2) I believe that the near-term fsync code will rely on struct pages for PMEM, which I believe are possible but optional as of Dan's last patch set: https://lkml.org/lkml/2015/8/25/841 I believe that this means that if we don't have struct pages for PMEM (becuase ZONE_DEVICE et al. are turned off) fsync won't work. I'd be nice not to lose msync as well. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 01, 2015 at 09:19:45PM -0600, Ross Zwisler wrote: > On Wed, Sep 02, 2015 at 08:21:20AM +1000, Dave Chinner wrote: > > Which means applications that should "just work" without > > modification on DAX are now subtly broken and don't actually > > guarantee data is safe after a crash. That's a pretty nasty > > landmine, and goes against *everything* we've claimed about using > > DAX with existing applications. > > > > That's wrong, and needs fixing. > > I agree that we need to fix fsync as well, and that the fsync solution could > be used to implement msync if we choose to go that route. I think we might > want to consider keeping the msync and fsync implementations separate, though, > for two reasons. > > 1) The current msync implementation is much more efficient than what will be > needed for fsync. Fsync will need to call into the filesystem, traverse all > the blocks, get kernel virtual addresses from those and then call > wb_cache_pmem() on those kernel addresses. I think this is a necessary evil > for fsync since you don't have a VMA, but for msync we do and we can just > flush using the user addresses without any fs lookups. Yet you're ignoring the fact that flushing the entire range of the relevant VMAs may not be very efficient. It may be a very large mapping with only a few pages that need flushing from the cache, but you still iterate the mappings flushing GB ranges from the cache at a time. We don't need struct pages to track page dirty state. We already have a method for doing this that does not rely on having a struct page and can be used for efficient lookup of exact dirty ranges. i.e the per-page dirty tag that is kept in the mapping radix tree. It's fine grained, and extremely efficient in terms of lookups to find dirty pages. Indeed, the mapping tree tags were specifically designed to avoid this "fsync doesn't know what range to flush" problem for normal files. That same problem still exists here for msync - these patches are just hitting it with a goddamn massive hammer "because it is easy" rather than attempting to do the flushing efficiently. > 2) I believe that the near-term fsync code will rely on struct pages for > PMEM, which I believe are possible but optional as of Dan's last patch set: > > https://lkml.org/lkml/2015/8/25/841 > > I believe that this means that if we don't have struct pages for PMEM (becuase > ZONE_DEVICE et al. are turned off) fsync won't work. I'd be nice not to lose > msync as well. I don't think this is an either-or situation. If we use struct pages for PMEM, then fsync will work without modification as DAX will need to use struct pages and hence we can insert them in the mapping radix tree directly and use the normal set/clear_page_dirty() calls to track dirty state. It will give us fine grained flush capability and we won't want msync() to be using the big hammer if we can avoid it. If we make the existing pfn-based DAX code track dirty pages via mapping radix tree tags right now, then we allow fsync to work by reusing most of the infrastructure we already have. That means DAX and fsync will work exactly the same regardless of how we index/reference PMEM in future and we won't have to come back and fix it all up again. Cheers, Dave.
On 09/02/2015 06:19 AM, Ross Zwisler wrote: > On Wed, Sep 02, 2015 at 08:21:20AM +1000, Dave Chinner wrote: >> Which means applications that should "just work" without >> modification on DAX are now subtly broken and don't actually >> guarantee data is safe after a crash. That's a pretty nasty >> landmine, and goes against *everything* we've claimed about using >> DAX with existing applications. >> >> That's wrong, and needs fixing. > > I agree that we need to fix fsync as well, and that the fsync solution could > be used to implement msync if we choose to go that route. I think we might > want to consider keeping the msync and fsync implementations separate, though, > for two reasons. > > 1) The current msync implementation is much more efficient than what will be > needed for fsync. Fsync will need to call into the filesystem, traverse all > the blocks, get kernel virtual addresses from those and then call > wb_cache_pmem() on those kernel addresses. I was thinking about this some more, and no this is not what we need to do because of the virtual-based-cache ARCHs. And what we do for these systems will also work for physical-based-cache ARCHs. What we need to do, is dig into the mapping structure and pic up the current VMA on the call to fsync. Then just flush that one on that virtual address, (since it is current at the context of the fsync sys call) And of course we need to do like I wrote, we must call fsync on vm_operations->close before the VMA mappings goes away. Then an fsync after unmap is a no-op. > I think this is a necessary evil > for fsync since you don't have a VMA, but for msync we do and we can just > flush using the user addresses without any fs lookups. > right see above > 2) I believe that the near-term fsync code will rely on struct pages for > PMEM, which I believe are possible but optional as of Dan's last patch set: > > https://lkml.org/lkml/2015/8/25/841 > > I believe that this means that if we don't have struct pages for PMEM (becuase > ZONE_DEVICE et al. are turned off) fsync won't work. I'd be nice not to lose > msync as well. Please see above it can be made to work. Actually what we do is the traversal-kernel-ptr thing, and the fsync-on-unmap. And it works we have heavy persistence testing and it is all very good. So no, without pages it can all work very-well. There is only the sync problem that I intend to fix soon, is only a matter of keeping a dax-dirty inode-list per sb. So no this is not an excuse. Cheers Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/02/2015 08:17 AM, Dave Chinner wrote: > On Tue, Sep 01, 2015 at 09:19:45PM -0600, Ross Zwisler wrote: >> On Wed, Sep 02, 2015 at 08:21:20AM +1000, Dave Chinner wrote: >>> Which means applications that should "just work" without >>> modification on DAX are now subtly broken and don't actually >>> guarantee data is safe after a crash. That's a pretty nasty >>> landmine, and goes against *everything* we've claimed about using >>> DAX with existing applications. >>> >>> That's wrong, and needs fixing. >> >> I agree that we need to fix fsync as well, and that the fsync solution could >> be used to implement msync if we choose to go that route. I think we might >> want to consider keeping the msync and fsync implementations separate, though, >> for two reasons. >> >> 1) The current msync implementation is much more efficient than what will be >> needed for fsync. Fsync will need to call into the filesystem, traverse all >> the blocks, get kernel virtual addresses from those and then call >> wb_cache_pmem() on those kernel addresses. I think this is a necessary evil >> for fsync since you don't have a VMA, but for msync we do and we can just >> flush using the user addresses without any fs lookups. > > Yet you're ignoring the fact that flushing the entire range of the > relevant VMAs may not be very efficient. It may be a very > large mapping with only a few pages that need flushing from the > cache, but you still iterate the mappings flushing GB ranges from > the cache at a time. > So actually you are wrong about this. We have a working system and as part of our testing rig we do performance measurements, constantly. Our random mmap 4k writes test preforms very well and is in par with the random-direct-write implementation even though on every unmap, we do a VMA->start/end cl_flushing. The cl_flush operation is a no-op if the cacheline is not dirty and is a memory bus storm with all the CLs that are dirty. So the only cost is the iteration of vma->start-to-vma->end i+=64 Let us please agree that we should do the correct thing for now, and let the complains roll in about the slowness later. You will find that my proposed solution is not so slow. > We don't need struct pages to track page dirty state. We already > have a method for doing this that does not rely on having a struct > page and can be used for efficient lookup of exact dirty ranges. i.e > the per-page dirty tag that is kept in the mapping radix tree. It's > fine grained, and extremely efficient in terms of lookups to find > dirty pages. > In fact you will find that this solution is actually slower. Because you need an extra lock on every major-page-fault and you need to maintain the radix-tree populated. Today with dax the radix-tree is completely empty, it is only ever used if one reads in holes. But we found that this is not common at all. Usually mmap applications read what is really there. So the extra work per page will be more than actually doing the fast no-op cl_flush. > Indeed, the mapping tree tags were specifically designed to avoid > this "fsync doesn't know what range to flush" problem for normal > files. That same problem still exists here for msync - these patches > are just hitting it with a goddamn massive hammer "because it is > easy" rather than attempting to do the flushing efficiently. > You come from the disk world and every extra synced block is a huge waist. This is memory, is not what you are used too. Again we have benchmarks and mmap works very very well. Including that contraption of cl_flushing vma->start..vma->end I'll try and open up some time to send a rough draft. My boss will kill me, but he'll survive. Thanks Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/02/2015 03:27 AM, Boaz Harrosh wrote: >> > Yet you're ignoring the fact that flushing the entire range of the >> > relevant VMAs may not be very efficient. It may be a very >> > large mapping with only a few pages that need flushing from the >> > cache, but you still iterate the mappings flushing GB ranges from >> > the cache at a time. >> > > So actually you are wrong about this. We have a working system and as part > of our testing rig we do performance measurements, constantly. Our random > mmap 4k writes test preforms very well and is in par with the random-direct-write > implementation even though on every unmap, we do a VMA->start/end cl_flushing. > > The cl_flush operation is a no-op if the cacheline is not dirty and is a > memory bus storm with all the CLs that are dirty. So the only cost > is the iteration of vma->start-to-vma->end i+=64 I'd be curious what the cost is in practice. Do you have any actual numbers of the cost of doing it this way? Even if the instruction is a "noop", I'd really expect the overhead to really add up for a tens-of-gigabytes mapping, no matter how much the CPU optimizes it. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/02/2015 05:23 PM, Dave Hansen wrote: <> > I'd be curious what the cost is in practice. Do you have any actual > numbers of the cost of doing it this way? > > Even if the instruction is a "noop", I'd really expect the overhead to > really add up for a tens-of-gigabytes mapping, no matter how much the > CPU optimizes it. What tens-of-gigabytes mapping? I have yet to encounter an application that does that. Our tests show that usually the mmaps are small. I can send you a micro benchmark results of an mmap vs direct-io random write. Our code will jump over holes in the file BTW, but I'll ask to also run it with falloc that will make all blocks allocated. Give me a few days to collect this. I guess one optimization we should do is jump over holes and zero-extents. This will save the case of a mostly sparse very big file. Thanks Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/02/2015 08:18 AM, Boaz Harrosh wrote: > On 09/02/2015 05:23 PM, Dave Hansen wrote: >> > I'd be curious what the cost is in practice. Do you have any actual >> > numbers of the cost of doing it this way? >> > >> > Even if the instruction is a "noop", I'd really expect the overhead to >> > really add up for a tens-of-gigabytes mapping, no matter how much the >> > CPU optimizes it. > What tens-of-gigabytes mapping? I have yet to encounter an application > that does that. Our tests show that usually the mmaps are small. We are going to have 2-socket systems with 6TB of persistent memory in them. I think it's important to design this mechanism so that it scales to memory sizes like that and supports large mmap()s. I'm not sure the application you've seen thus far are very representative of what we want to design for. > I can send you a micro benchmark results of an mmap vs direct-io random > write. Our code will jump over holes in the file BTW, but I'll ask to also > run it with falloc that will make all blocks allocated. I'm really just more curious about actual clflush performance on large ranges. I'm curious how good the CPU is at optimizing it. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/02/2015 06:39 PM, Dave Hansen wrote: > On 09/02/2015 08:18 AM, Boaz Harrosh wrote: >> On 09/02/2015 05:23 PM, Dave Hansen wrote: >>>> I'd be curious what the cost is in practice. Do you have any actual >>>> numbers of the cost of doing it this way? >>>> >>>> Even if the instruction is a "noop", I'd really expect the overhead to >>>> really add up for a tens-of-gigabytes mapping, no matter how much the >>>> CPU optimizes it. >> What tens-of-gigabytes mapping? I have yet to encounter an application >> that does that. Our tests show that usually the mmaps are small. > > We are going to have 2-socket systems with 6TB of persistent memory in > them. I think it's important to design this mechanism so that it scales > to memory sizes like that and supports large mmap()s. > > I'm not sure the application you've seen thus far are very > representative of what we want to design for. > We have a patch pending to introduce a new mmap flag that pmem aware applications can set to eliminate any kind of flushing. MMAP_PMEM_AWARE. This is good for the like of libnvdimm that does one large mmap of the all 6T and does not want the clflush penalty on unmap. >> I can send you a micro benchmark results of an mmap vs direct-io random >> write. Our code will jump over holes in the file BTW, but I'll ask to also >> run it with falloc that will make all blocks allocated. > > I'm really just more curious about actual clflush performance on large > ranges. I'm curious how good the CPU is at optimizing it. > Again our test does not do this, because it will only flush written-extents of the file. the most we have in one machine is 64G of pmem, so even on a very large mmap the most that can be is 64G of data, and the actual modify of 64G of data will be much slower then the added clflush to each cache_line. Thanks Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/02/2015 09:00 AM, Boaz Harrosh wrote: >> > We are going to have 2-socket systems with 6TB of persistent memory in >> > them. I think it's important to design this mechanism so that it scales >> > to memory sizes like that and supports large mmap()s. >> > >> > I'm not sure the application you've seen thus far are very >> > representative of what we want to design for. >> > > We have a patch pending to introduce a new mmap flag that pmem aware > applications can set to eliminate any kind of flushing. MMAP_PMEM_AWARE. Great! Do you have a link so that I can review it and compare it to Ross's approach? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 01, 2015 at 04:12:42PM +0300, Boaz Harrosh wrote: > On 08/31/2015 09:59 PM, Ross Zwisler wrote: > > @@ -753,3 +755,18 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block) > > return dax_zero_page_range(inode, from, length, get_block); > > } > > EXPORT_SYMBOL_GPL(dax_truncate_page); > > + > > +void dax_sync_range(unsigned long addr, size_t len) > > +{ > > + while (len) { > > + size_t chunk_len = min_t(size_t, SZ_1G, len); > > + > > Where does the SZ_1G come from is it because you want to do cond_resched() > every 1G bytes so not to get stuck for a long time? > > It took me a while to catch, At first I thought it might be do to wb_cache_pmem() > limitations. Would you put a comment in the next iteration? Yep, the SZ_1G is just to make sure we cond_reshced() every once in a while. Is there a documented guideline somewhere as to how long a kernel thread is allowed to spin before calling cond_resched()? So far I haven' been able to find anything solid on this - it seems like each developer has their own preferences, and that those preferences vary pretty widely. In any case, assuming we continue to separate the msync() and fsync() implementations for DAX (which right now I'm doubting, to be honest), I'll add in a comment to explain this logic. > > diff --git a/include/linux/pmem.h b/include/linux/pmem.h > > index 85f810b3..aa29ebb 100644 > > --- a/include/linux/pmem.h > > +++ b/include/linux/pmem.h > > @@ -53,12 +53,18 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size) > > { > > BUG(); > > See below > > > } > > + > > +static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size) > > +{ > > + BUG(); > > There is a clflush_cache_range() defined for generic use. On ADR systems (even without pcommit) > this works perfectly and is persistent. why not use that in the generic case? Nope, we really do need to use wb_cache_pmem() because clflush_cache_range() isn't an architecture neutral API. wb_cache_pmem() also has the advantage that on x86 it will take advantage of the new CLWB instruction if it is available on the platform, and it doesn't introduce any unnecessary memory fencing. This works on both PCOMMIT-aware systems and on ADR boxes without PCOMMIT. > One usage of pmem is overlooked by all this API. The use of DRAM as pmem, across a VM > or cross reboot. you have a piece of memory exposed as pmem to the subsytem which survives > past the boot of that system. The CPU cache still needs flushing in this case. > (People are already using this for logs and crash dumps) I'm confused about this "DRAM as pmem" use case - are the requirements essentially the same as the ADR case? You need to make sure that pre-reboot the dirty cache lines have been flushed from the processor cache, but if they are in platform buffers (the "safe zone" for ADR) you're fine? If so, we're good to go, I think. Dan's most recent patch series made it so we correctly handle systems that have the PMEM API but not PCOMMIT: https://lists.01.org/pipermail/linux-nvdimm/2015-August/002005.html If the "DRAM as pmem across reboots" case isn't okay with your dirty data being in the ADR safe zone, I think you're toast. Without PCOMMIT the kernel cannot guarantee that the data has ever made it durably to the DIMMs, regardless what clflush/clflushopt/clwb magic you do. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 01, 2015 at 03:18:41PM +0300, Boaz Harrosh wrote: > So the approach we took was a bit different to exactly solve these > problem, and to also not over flush too much. here is what we did. > > * At vm_operations_struct we also override the .close vector (say call it dax_vm_close) > > * At dax_vm_close() on writable files call ->fsync(,vma->vm_start, vma->vm_end,) > (We have an inode flag if the file was actually dirtied, but even if not, that will > not be that bad, so a file was opened for write, mmapped, but actually never > modified. Not a lot of these, and the do nothing cl_flushing is very fast) > > * At ->fsync() do the actual cl_flush for all cases but only iff > if (mapping_mapped(inode->i_mapping) == 0) > return 0; > > This is because data written not through mmap is already persistent and we > do not need the cl_flushing > > Apps expect all these to work: > 1. open mmap m-write msync ... close > 2. open mmap m-write fsync ... close > 3. open mmap m-write unmap ... fsync close > > 4. open mmap m-write sync ... So basically you made close have an implicit fsync? What about the flow that looks like this: 5. open mmap close m-write This guy definitely needs an msync/fsync at the end to make sure that the m-write becomes durable. Also, the CLOSE(2) man page specifically says that a flush does not occur at close: A successful close does not guarantee that the data has been successfully saved to disk, as the kernel defers writes. It is not common for a filesystem to flush the buffers when the stream is closed. If you need to be sure that the data is physically stored, use fsync(2). (It will depend on the disk hardware at this point.) I don't think that adding an implicit fsync to close is the right solution - we just need to get msync and fsync correctly working. > The first 3 are supported with above, because what happens is that at [3] > the fsync actually happens on unmap and fsync is redundant in that case. > > The only broken scenario is [3]. We do not have a list of "dax-dirty" inodes > per sb to iterate on and call inode-sync on. This cause problems mostly in > freeze because with actual [3] scenario the file will be eventually closed > and persistent, but after the call to sync returns. > > Its on my TODO to fix [3] based on instructions from Dave. > The mmap call will put the inode on the list and the dax_vm_close will > remove it. One of the regular dirty list should be used as suggested by > Dave. I believe in the above two paragraphs you meant [4], so the 4. open mmap m-write sync ... case needs to be fixed so that we can detect DAX-dirty inodes? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 02, 2015 at 01:04:01PM -0600, Ross Zwisler wrote: > On Tue, Sep 01, 2015 at 03:18:41PM +0300, Boaz Harrosh wrote: > > So the approach we took was a bit different to exactly solve these > > problem, and to also not over flush too much. here is what we did. > > > > * At vm_operations_struct we also override the .close vector (say call it dax_vm_close) > > > > * At dax_vm_close() on writable files call ->fsync(,vma->vm_start, vma->vm_end,) > > (We have an inode flag if the file was actually dirtied, but even if not, that will > > not be that bad, so a file was opened for write, mmapped, but actually never > > modified. Not a lot of these, and the do nothing cl_flushing is very fast) > > > > * At ->fsync() do the actual cl_flush for all cases but only iff > > if (mapping_mapped(inode->i_mapping) == 0) > > return 0; > > > > This is because data written not through mmap is already persistent and we > > do not need the cl_flushing > > > > Apps expect all these to work: > > 1. open mmap m-write msync ... close > > 2. open mmap m-write fsync ... close > > 3. open mmap m-write unmap ... fsync close > > > > 4. open mmap m-write sync ... > > So basically you made close have an implicit fsync? What about the flow that > looks like this: > > 5. open mmap close m-write > > This guy definitely needs an msync/fsync at the end to make sure that the > m-write becomes durable. We can sync on pte_dirty() during zap_page_range(): it's practically free, since we page walk anyway. With this approach it probably makes sense to come back to page walk on msync() side too to be consistent wrt pte_dirty() meaning. > Also, the CLOSE(2) man page specifically says that a flush does not occur at > close: > A successful close does not guarantee that the data has been > successfully saved to disk, as the kernel defers writes. It > is not common for a filesystem to flush the buffers when the stream is > closed. If you need to be sure that the data is physically stored, > use fsync(2). (It will depend on the disk hardware at this point.) > > I don't think that adding an implicit fsync to close is the right solution - > we just need to get msync and fsync correctly working. I doesn't mean we can't sync if we can do without noticible performance degradation.
On 09/02/2015 10:04 PM, Ross Zwisler wrote: > On Tue, Sep 01, 2015 at 03:18:41PM +0300, Boaz Harrosh wrote: <> >> Apps expect all these to work: >> 1. open mmap m-write msync ... close >> 2. open mmap m-write fsync ... close >> 3. open mmap m-write unmap ... fsync close >> >> 4. open mmap m-write sync ... > > So basically you made close have an implicit fsync? What about the flow that > looks like this: > > 5. open mmap close m-write > What? no, close means ummap because you need a file* attached to your vma And you miss-understood me, the vm_opts->close is the *unmap* operation not the file::close() operation. I meant memory-cl_flush on unmap before the vma goes away. > This guy definitely needs an msync/fsync at the end to make sure that the > m-write becomes durable. > Exactly done at unmap time. > Also, the CLOSE(2) man page specifically says that a flush does not occur at > close: > A successful close does not guarantee that the data has been > successfully saved to disk, as the kernel defers writes. It > is not common for a filesystem to flush the buffers when the stream is > closed. If you need to be sure that the data is physically stored, > use fsync(2). (It will depend on the disk hardware at this point.) > > I don't think that adding an implicit fsync to close is the right solution - > we just need to get msync and fsync correctly working. > So above is not relevant, and we are doing that. taking care of cpu-cache flushing. This is not disk-flushing, on a long memcpy from usermode most of the data is already durable, is only the leftover margins. Like the dax_io in the kernel dax implies direct_io always, all we are trying is to have the least performance hit in memory-cache-flushing. IS nothing to do with the text above. >> The first 3 are supported with above, because what happens is that at [3] >> the fsync actually happens on unmap and fsync is redundant in that case. >> >> The only broken scenario is [3]. We do not have a list of "dax-dirty" inodes >> per sb to iterate on and call inode-sync on. This cause problems mostly in >> freeze because with actual [3] scenario the file will be eventually closed >> and persistent, but after the call to sync returns. >> >> Its on my TODO to fix [3] based on instructions from Dave. >> The mmap call will put the inode on the list and the dax_vm_close will >> remove it. One of the regular dirty list should be used as suggested by >> Dave. > > I believe in the above two paragraphs you meant [4], so the > > 4. open mmap m-write sync ... > > case needs to be fixed so that we can detect DAX-dirty inodes? > Yes I'll be working on sync (DAX-dirty-i_list) soon but it needs a working fsync to be in place (eg: dax_fsync(inode)) Thanks Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/02/2015 07:19 PM, Dave Hansen wrote: > On 09/02/2015 09:00 AM, Boaz Harrosh wrote: >>>> We are going to have 2-socket systems with 6TB of persistent memory in >>>> them. I think it's important to design this mechanism so that it scales >>>> to memory sizes like that and supports large mmap()s. >>>> >>>> I'm not sure the application you've seen thus far are very >>>> representative of what we want to design for. >>>> >> We have a patch pending to introduce a new mmap flag that pmem aware >> applications can set to eliminate any kind of flushing. MMAP_PMEM_AWARE. > > Great! Do you have a link so that I can review it and compare it to > Ross's approach? > Ha? I have not seen a new mmap flag from Ross, yet I have been off lately so it is logical that I might have missed it. Could you send me a link? (BTW my patch I did not release yet, I'll cc you once its done) Thanks Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 03, 2015 at 09:32:02AM +0300, Boaz Harrosh wrote: > On 09/02/2015 10:04 PM, Ross Zwisler wrote: > > On Tue, Sep 01, 2015 at 03:18:41PM +0300, Boaz Harrosh wrote: > <> > >> Apps expect all these to work: > >> 1. open mmap m-write msync ... close > >> 2. open mmap m-write fsync ... close > >> 3. open mmap m-write unmap ... fsync close > >> > >> 4. open mmap m-write sync ... > > > > So basically you made close have an implicit fsync? What about the flow that > > looks like this: > > > > 5. open mmap close m-write > > > > What? no, close means ummap because you need a file* attached to your vma > > And you miss-understood me, the vm_opts->close is the *unmap* operation not > the file::close() operation. > > I meant memory-cl_flush on unmap before the vma goes away. Ah, got it, thanks for the clarification. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h index d8ce3ec..85c07b2 100644 --- a/arch/x86/include/asm/pmem.h +++ b/arch/x86/include/asm/pmem.h @@ -67,18 +67,19 @@ static inline void arch_wmb_pmem(void) } /** - * __arch_wb_cache_pmem - write back a cache range with CLWB - * @vaddr: virtual start address + * arch_wb_cache_pmem - write back a cache range with CLWB + * @addr: virtual start address * @size: number of bytes to write back * * Write back a cache range using the CLWB (cache line write back) * instruction. This function requires explicit ordering with an - * arch_wmb_pmem() call. This API is internal to the x86 PMEM implementation. + * arch_wmb_pmem() call. */ -static inline void __arch_wb_cache_pmem(void *vaddr, size_t size) +static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size) { u16 x86_clflush_size = boot_cpu_data.x86_clflush_size; unsigned long clflush_mask = x86_clflush_size - 1; + void *vaddr = (void __force *)addr; void *vend = vaddr + size; void *p; @@ -115,7 +116,7 @@ static inline size_t arch_copy_from_iter_pmem(void __pmem *addr, size_t bytes, len = copy_from_iter_nocache(vaddr, bytes, i); if (__iter_needs_pmem_wb(i)) - __arch_wb_cache_pmem(vaddr, bytes); + arch_wb_cache_pmem(addr, bytes); return len; } @@ -138,7 +139,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size) else memset(vaddr, 0, size); - __arch_wb_cache_pmem(vaddr, size); + arch_wb_cache_pmem(addr, size); } static inline bool __arch_has_wmb_pmem(void) diff --git a/fs/dax.c b/fs/dax.c index fbe18b8..ed6aec1 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -17,6 +17,7 @@ #include <linux/atomic.h> #include <linux/blkdev.h> #include <linux/buffer_head.h> +#include <linux/dax.h> #include <linux/fs.h> #include <linux/genhd.h> #include <linux/highmem.h> @@ -25,6 +26,7 @@ #include <linux/mutex.h> #include <linux/pmem.h> #include <linux/sched.h> +#include <linux/sizes.h> #include <linux/uio.h> #include <linux/vmstat.h> @@ -753,3 +755,18 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block) return dax_zero_page_range(inode, from, length, get_block); } EXPORT_SYMBOL_GPL(dax_truncate_page); + +void dax_sync_range(unsigned long addr, size_t len) +{ + while (len) { + size_t chunk_len = min_t(size_t, SZ_1G, len); + + wb_cache_pmem((void __pmem *)addr, chunk_len); + len -= chunk_len; + addr += chunk_len; + if (len) + cond_resched(); + } + wmb_pmem(); +} +EXPORT_SYMBOL_GPL(dax_sync_range); diff --git a/include/linux/dax.h b/include/linux/dax.h index b415e52..504b33f 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -14,6 +14,7 @@ int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t, dax_iodone_t); int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t, dax_iodone_t); +void dax_sync_range(unsigned long addr, size_t len); #ifdef CONFIG_TRANSPARENT_HUGEPAGE int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *, unsigned int flags, get_block_t, dax_iodone_t); diff --git a/include/linux/pmem.h b/include/linux/pmem.h index 85f810b3..aa29ebb 100644 --- a/include/linux/pmem.h +++ b/include/linux/pmem.h @@ -53,12 +53,18 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size) { BUG(); } + +static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size) +{ + BUG(); +} #endif /* * Architectures that define ARCH_HAS_PMEM_API must provide * implementations for arch_memcpy_to_pmem(), arch_wmb_pmem(), - * arch_copy_from_iter_pmem(), arch_clear_pmem() and arch_has_wmb_pmem(). + * arch_copy_from_iter_pmem(), arch_clear_pmem(), arch_wb_cache_pmem() + * and arch_has_wmb_pmem(). */ static inline void memcpy_from_pmem(void *dst, void __pmem const *src, size_t size) { @@ -202,4 +208,18 @@ static inline void clear_pmem(void __pmem *addr, size_t size) else default_clear_pmem(addr, size); } + +/** + * wb_cache_pmem - write back a range of cache lines + * @vaddr: virtual start address + * @size: number of bytes to write back + * + * Write back the cache lines starting at 'vaddr' for 'size' bytes. + * This function requires explicit ordering with an wmb_pmem() call. + */ +static inline void wb_cache_pmem(void __pmem *addr, size_t size) +{ + if (arch_has_pmem_api()) + arch_wb_cache_pmem(addr, size); +} #endif /* __PMEM_H__ */ diff --git a/mm/msync.c b/mm/msync.c index bb04d53..2a4739c 100644 --- a/mm/msync.c +++ b/mm/msync.c @@ -7,6 +7,7 @@ /* * The msync() system call. */ +#include <linux/dax.h> #include <linux/fs.h> #include <linux/mm.h> #include <linux/mman.h> @@ -59,6 +60,7 @@ SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags) for (;;) { struct file *file; loff_t fstart, fend; + unsigned long range_len; /* Still start < end. */ error = -ENOMEM; @@ -77,10 +79,16 @@ SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags) error = -EBUSY; goto out_unlock; } + + range_len = min(end, vma->vm_end) - start; + + if (vma_is_dax(vma)) + dax_sync_range(start, range_len); + file = vma->vm_file; fstart = (start - vma->vm_start) + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); - fend = fstart + (min(end, vma->vm_end) - start) - 1; + fend = fstart + range_len - 1; start = vma->vm_end; if ((flags & MS_SYNC) && file && (vma->vm_flags & VM_SHARED)) {
For DAX msync we just need to flush the given range using wb_cache_pmem(), which is now a public part of the PMEM API. The inclusion of <linux/dax.h> in fs/dax.c was done to make checkpatch happy. Previously it was complaining about a bunch of undeclared functions that could be made static. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> --- This patch is based on libnvdimm-for-next from our NVDIMM tree: https://git.kernel.org/cgit/linux/kernel/git/nvdimm/nvdimm.git/ with some DAX patches on top. The baseline tree can be found here: https://github.com/01org/prd/tree/dax_msync --- arch/x86/include/asm/pmem.h | 13 +++++++------ fs/dax.c | 17 +++++++++++++++++ include/linux/dax.h | 1 + include/linux/pmem.h | 22 +++++++++++++++++++++- mm/msync.c | 10 +++++++++- 5 files changed, 55 insertions(+), 8 deletions(-)