diff mbox

dax, pmem: add support for msync

Message ID 1441047584-14664-1-git-send-email-ross.zwisler@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ross Zwisler Aug. 31, 2015, 6:59 p.m. UTC
For DAX msync we just need to flush the given range using
wb_cache_pmem(), which is now a public part of the PMEM API.

The inclusion of <linux/dax.h> in fs/dax.c was done to make checkpatch
happy.  Previously it was complaining about a bunch of undeclared
functions that could be made static.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
This patch is based on libnvdimm-for-next from our NVDIMM tree:

https://git.kernel.org/cgit/linux/kernel/git/nvdimm/nvdimm.git/

with some DAX patches on top.  The baseline tree can be found here:

https://github.com/01org/prd/tree/dax_msync
---
 arch/x86/include/asm/pmem.h | 13 +++++++------
 fs/dax.c                    | 17 +++++++++++++++++
 include/linux/dax.h         |  1 +
 include/linux/pmem.h        | 22 +++++++++++++++++++++-
 mm/msync.c                  | 10 +++++++++-
 5 files changed, 55 insertions(+), 8 deletions(-)

Comments

Christoph Hellwig Aug. 31, 2015, 7:06 p.m. UTC | #1
On Mon, Aug 31, 2015 at 12:59:44PM -0600, Ross Zwisler wrote:
> For DAX msync we just need to flush the given range using
> wb_cache_pmem(), which is now a public part of the PMEM API.
> 
> The inclusion of <linux/dax.h> in fs/dax.c was done to make checkpatch
> happy.  Previously it was complaining about a bunch of undeclared
> functions that could be made static.

Should this be abstracted by adding a ->msync method?  Maybe not
worth to do for now, but it might be worth to keep that in mind.

Otherwise this looks fine to me.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ross Zwisler Aug. 31, 2015, 7:26 p.m. UTC | #2
On Mon, Aug 31, 2015 at 09:06:19PM +0200, Christoph Hellwig wrote:
> On Mon, Aug 31, 2015 at 12:59:44PM -0600, Ross Zwisler wrote:
> > For DAX msync we just need to flush the given range using
> > wb_cache_pmem(), which is now a public part of the PMEM API.
> > 
> > The inclusion of <linux/dax.h> in fs/dax.c was done to make checkpatch
> > happy.  Previously it was complaining about a bunch of undeclared
> > functions that could be made static.
> 
> Should this be abstracted by adding a ->msync method?  Maybe not
> worth to do for now, but it might be worth to keep that in mind.

Where would we add the ->msync method?  Do you mean to the PMEM API, or
somewhere else?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Aug. 31, 2015, 7:34 p.m. UTC | #3
On Mon, Aug 31, 2015 at 01:26:40PM -0600, Ross Zwisler wrote:
> > Should this be abstracted by adding a ->msync method?  Maybe not
> > worth to do for now, but it might be worth to keep that in mind.
> 
> Where would we add the ->msync method?  Do you mean to the PMEM API, or
> somewhere else?

vm_operations_struct would be the most logical choice, with filesystems
having a dax-specific instance of that.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Aug. 31, 2015, 11:38 p.m. UTC | #4
On Mon, Aug 31, 2015 at 12:59:44PM -0600, Ross Zwisler wrote:
> For DAX msync we just need to flush the given range using
> wb_cache_pmem(), which is now a public part of the PMEM API.

This is wrong, because it still leaves fsync() broken on dax.

Flushing dirty data to stable storage is the responsibility of the
writeback infrastructure, not the VMA/mm infrasrtucture. For non-dax
configurations, msync defers all that to vfs_fsync_range(), because
it has to be implemented there for fsync() to work.

Even for DAX, msync has to call vfs_fsync_range() for the filesystem to commit
the backing store allocations to stable storage, so there's not
getting around the fact msync is the wrong place to be flushing
DAX mappings to persistent storage.

I pointed this out almost 6 months ago (i.e. that fsync was broken)
anf hinted at how to solve it. Fix fsync, and msync gets fixed for
free:

https://lists.01.org/pipermail/linux-nvdimm/2015-March/000341.html

I've also reported to Willy that DAX write page faults don't work
correctly, either. xfstests generic/080 exposes this: a read
from a page followed immediately by a write to that page does not
result in ->page_mkwrite being called on the write and so
backing store is not allocated for the page, nor are the timestamps
for the file updated. This will also result in fsync (and msync)
not working properly.

Fix fsync first - if fsync works then msync will work.

Cheers,

Dave.
Christoph Hellwig Sept. 1, 2015, 7:06 a.m. UTC | #5
On Tue, Sep 01, 2015 at 09:38:03AM +1000, Dave Chinner wrote:
> On Mon, Aug 31, 2015 at 12:59:44PM -0600, Ross Zwisler wrote:
> > For DAX msync we just need to flush the given range using
> > wb_cache_pmem(), which is now a public part of the PMEM API.
> 
> This is wrong, because it still leaves fsync() broken on dax.
> 
> Flushing dirty data to stable storage is the responsibility of the
> writeback infrastructure, not the VMA/mm infrasrtucture. For non-dax
> configurations, msync defers all that to vfs_fsync_range(), because
> it has to be implemented there for fsync() to work.
> 
> Even for DAX, msync has to call vfs_fsync_range() for the filesystem to commit
> the backing store allocations to stable storage, so there's not
> getting around the fact msync is the wrong place to be flushing
> DAX mappings to persistent storage.

DAX does call ->fsync before and after this patch.  And with all
the recent fixes we take care to ensure data is written though the
cache for everything but mmap-access.  With this patch from Ross
we ensure msync writes back the cache before calling ->fsync so that
the filesystem can then do it's work like converting unwritten extents.

The only downside is that previously on Linux you could always use
fsync as a replaement for msymc, which isn't true anymore for DAX.

But given that we need the virtual address to write back the cache
I can't see how to do this differently given that clwb() needs the
user virtual address to flush the cache.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kirill A . Shutemov Sept. 1, 2015, 10:08 a.m. UTC | #6
On Tue, Sep 01, 2015 at 09:38:03AM +1000, Dave Chinner wrote:
> On Mon, Aug 31, 2015 at 12:59:44PM -0600, Ross Zwisler wrote:
> > For DAX msync we just need to flush the given range using
> > wb_cache_pmem(), which is now a public part of the PMEM API.
> 
> This is wrong, because it still leaves fsync() broken on dax.
> 
> Flushing dirty data to stable storage is the responsibility of the
> writeback infrastructure, not the VMA/mm infrasrtucture.

Writeback infrastructure is non-existent for DAX. Without struct page we
don't have anything to transfer pte_ditry() to. And I'm not sure we need
to invent some. For DAX flushing in-place can be cheaper than dirty
tracking beyond page tables.

> For non-dax configurations, msync defers all that to vfs_fsync_range(),
> because it has to be implemented there for fsync() to work.

Not necessary. I think fsync() for DAX can be implemented with rmap over
all file's VMA and msync() them with commiting metadata afterwards.

But we also need to commit to persistent on zap_page_range() to make it
work.

> Even for DAX, msync has to call vfs_fsync_range() for the filesystem to commit
> the backing store allocations to stable storage, so there's not
> getting around the fact msync is the wrong place to be flushing
> DAX mappings to persistent storage.

Why?
IIUC, msync() doesn't have any requirements wrt metadata, right?

> I pointed this out almost 6 months ago (i.e. that fsync was broken)
> anf hinted at how to solve it. Fix fsync, and msync gets fixed for
> free:
> 
> https://lists.01.org/pipermail/linux-nvdimm/2015-March/000341.html
> 
> I've also reported to Willy that DAX write page faults don't work
> correctly, either. xfstests generic/080 exposes this: a read
> from a page followed immediately by a write to that page does not
> result in ->page_mkwrite being called on the write and so
> backing store is not allocated for the page, nor are the timestamps
> for the file updated. This will also result in fsync (and msync)
> not working properly.

Is that because XFS doesn't provide vm_ops->pfn_mkwrite?
Boaz Harrosh Sept. 1, 2015, 11:27 a.m. UTC | #7
On 09/01/2015 01:08 PM, Kirill A. Shutemov wrote:
<>
> 
> Is that because XFS doesn't provide vm_ops->pfn_mkwrite?
> 

Right that would explain it, because I sent that patch exactly to solve
this problem. I haven't looked at latest code for a while but I should
checkout the latest and make a patch for xfs if it is indeed missing.

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Boaz Harrosh Sept. 1, 2015, 12:18 p.m. UTC | #8
On 09/01/2015 10:06 AM, Christoph Hellwig wrote:
> On Tue, Sep 01, 2015 at 09:38:03AM +1000, Dave Chinner wrote:
>> On Mon, Aug 31, 2015 at 12:59:44PM -0600, Ross Zwisler wrote:
>>> For DAX msync we just need to flush the given range using
>>> wb_cache_pmem(), which is now a public part of the PMEM API.
>>
>> This is wrong, because it still leaves fsync() broken on dax.
>>
>> Flushing dirty data to stable storage is the responsibility of the
>> writeback infrastructure, not the VMA/mm infrasrtucture. For non-dax
>> configurations, msync defers all that to vfs_fsync_range(), because
>> it has to be implemented there for fsync() to work.
>>
>> Even for DAX, msync has to call vfs_fsync_range() for the filesystem to commit
>> the backing store allocations to stable storage, so there's not
>> getting around the fact msync is the wrong place to be flushing
>> DAX mappings to persistent storage.
> 
> DAX does call ->fsync before and after this patch.  And with all
> the recent fixes we take care to ensure data is written though the
> cache for everything but mmap-access.  With this patch from Ross
> we ensure msync writes back the cache before calling ->fsync so that
> the filesystem can then do it's work like converting unwritten extents.
> 
> The only downside is that previously on Linux you could always use
> fsync as a replaement for msymc, which isn't true anymore for DAX.
> 

Hi Christoph

So the approach we took was a bit different to exactly solve these
problem, and to also not over flush too much. here is what we did.

* At vm_operations_struct we also override the .close vector (say call it dax_vm_close)

* At dax_vm_close() on writable files call ->fsync(,vma->vm_start, vma->vm_end,)
  (We have an inode flag if the file was actually dirtied, but even if not, that will
   not be that bad, so a file was opened for write, mmapped, but actually never
   modified. Not a lot of these, and the do nothing cl_flushing is very fast)

* At ->fsync() do the actual cl_flush for all cases but only iff
	if (mapping_mapped(inode->i_mapping) == 0)
		return 0;

  This is because data written not through mmap is already persistent and we
  do not need the cl_flushing

Apps expect all these to work:
1. open mmap m-write msync ... close
2. open mmap m-write fsync ... close
3. open mmap m-write unmap ... fsync close

4. open mmap m-write sync ...

The first 3 are supported with above, because what happens is that at [3]
the fsync actually happens on unmap and fsync is redundant in that case.

The only broken scenario is [3]. We do not have a list of "dax-dirty" inodes
per sb to iterate on and call inode-sync on. This cause problems mostly in
freeze because with actual [3] scenario the file will be eventually closed
and persistent, but after the call to sync returns.

Its on my TODO to fix [3] based on instructions from Dave.
The mmap call will put the inode on the list and the dax_vm_close will
remove it. One of the regular dirty list should be used as suggested by
Dave.

> But given that we need the virtual address to write back the cache
> I can't see how to do this differently given that clwb() needs the
> user virtual address to flush the cache.

On Intel or any systems that have physical-based caching this is not
a problem you just iterate on all get_block() of the range and flush
the Kernel's virt_addr of the block, this is easy.

With ARCHs with per VM caching you need to go through the i_mapping VMAs list
and flush like that. I guess there is a way to schedule yourself as a process VMA
somehow.
I'm not sure how to solve this split, perhaps two generic functions, that
are selected through the ARCH.

Just my $0.017
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Boaz Harrosh Sept. 1, 2015, 1:12 p.m. UTC | #9
On 08/31/2015 09:59 PM, Ross Zwisler wrote:
> For DAX msync we just need to flush the given range using
> wb_cache_pmem(), which is now a public part of the PMEM API.
> 
> The inclusion of <linux/dax.h> in fs/dax.c was done to make checkpatch
> happy.  Previously it was complaining about a bunch of undeclared
> functions that could be made static.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
> This patch is based on libnvdimm-for-next from our NVDIMM tree:
> 
> https://git.kernel.org/cgit/linux/kernel/git/nvdimm/nvdimm.git/
> 
> with some DAX patches on top.  The baseline tree can be found here:
> 
> https://github.com/01org/prd/tree/dax_msync
> ---
>  arch/x86/include/asm/pmem.h | 13 +++++++------
>  fs/dax.c                    | 17 +++++++++++++++++
>  include/linux/dax.h         |  1 +
>  include/linux/pmem.h        | 22 +++++++++++++++++++++-
>  mm/msync.c                  | 10 +++++++++-
>  5 files changed, 55 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
> index d8ce3ec..85c07b2 100644
> --- a/arch/x86/include/asm/pmem.h
> +++ b/arch/x86/include/asm/pmem.h
> @@ -67,18 +67,19 @@ static inline void arch_wmb_pmem(void)
>  }
>  
>  /**
> - * __arch_wb_cache_pmem - write back a cache range with CLWB
> - * @vaddr:	virtual start address
> + * arch_wb_cache_pmem - write back a cache range with CLWB
> + * @addr:	virtual start address
>   * @size:	number of bytes to write back
>   *
>   * Write back a cache range using the CLWB (cache line write back)
>   * instruction.  This function requires explicit ordering with an
> - * arch_wmb_pmem() call.  This API is internal to the x86 PMEM implementation.
> + * arch_wmb_pmem() call.
>   */
> -static inline void __arch_wb_cache_pmem(void *vaddr, size_t size)
> +static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
>  {
>  	u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
>  	unsigned long clflush_mask = x86_clflush_size - 1;
> +	void *vaddr = (void __force *)addr;
>  	void *vend = vaddr + size;
>  	void *p;
>  
> @@ -115,7 +116,7 @@ static inline size_t arch_copy_from_iter_pmem(void __pmem *addr, size_t bytes,
>  	len = copy_from_iter_nocache(vaddr, bytes, i);
>  
>  	if (__iter_needs_pmem_wb(i))
> -		__arch_wb_cache_pmem(vaddr, bytes);
> +		arch_wb_cache_pmem(addr, bytes);
>  
>  	return len;
>  }
> @@ -138,7 +139,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
>  	else
>  		memset(vaddr, 0, size);
>  
> -	__arch_wb_cache_pmem(vaddr, size);
> +	arch_wb_cache_pmem(addr, size);
>  }
>  
>  static inline bool __arch_has_wmb_pmem(void)
> diff --git a/fs/dax.c b/fs/dax.c
> index fbe18b8..ed6aec1 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -17,6 +17,7 @@
>  #include <linux/atomic.h>
>  #include <linux/blkdev.h>
>  #include <linux/buffer_head.h>
> +#include <linux/dax.h>
>  #include <linux/fs.h>
>  #include <linux/genhd.h>
>  #include <linux/highmem.h>
> @@ -25,6 +26,7 @@
>  #include <linux/mutex.h>
>  #include <linux/pmem.h>
>  #include <linux/sched.h>
> +#include <linux/sizes.h>
>  #include <linux/uio.h>
>  #include <linux/vmstat.h>
>  
> @@ -753,3 +755,18 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
>  	return dax_zero_page_range(inode, from, length, get_block);
>  }
>  EXPORT_SYMBOL_GPL(dax_truncate_page);
> +
> +void dax_sync_range(unsigned long addr, size_t len)
> +{
> +	while (len) {
> +		size_t chunk_len = min_t(size_t, SZ_1G, len);
> +

Where does the  SZ_1G come from is it because you want to do cond_resched()
every 1G bytes so not to get stuck for a long time?

It took me a while to catch, At first I thought it might be do to wb_cache_pmem()
limitations. Would you put a comment in the next iteration?

> +		wb_cache_pmem((void __pmem *)addr, chunk_len);
> +		len -= chunk_len;
> +		addr += chunk_len;
> +		if (len)
> +			cond_resched();
> +	}
> +	wmb_pmem();
> +}
> +EXPORT_SYMBOL_GPL(dax_sync_range);
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index b415e52..504b33f 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -14,6 +14,7 @@ int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
>  		dax_iodone_t);
>  int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
>  		dax_iodone_t);
> +void dax_sync_range(unsigned long addr, size_t len);
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
>  				unsigned int flags, get_block_t, dax_iodone_t);
> diff --git a/include/linux/pmem.h b/include/linux/pmem.h
> index 85f810b3..aa29ebb 100644
> --- a/include/linux/pmem.h
> +++ b/include/linux/pmem.h
> @@ -53,12 +53,18 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
>  {
>  	BUG();

See below

>  }
> +
> +static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
> +{
> +	BUG();

There is a clflush_cache_range() defined for generic use. On ADR systems (even without pcommit)
this works perfectly and is persistent. why not use that in the generic case?

Also all the above and below can be implements via this one.

One usage of pmem is overlooked by all this API. The use of DRAM as pmem, across a VM
or cross reboot. you have a piece of memory exposed as pmem to the subsytem which survives
past the boot of that system. The CPU cache still needs flushing in this case.
(People are already using this for logs and crash dumps)

So all arches including all x86 variants can have a working generic implementation
that will work, based on clflush_cache_range()

I'll work on this ASAP and send patches ...

> +}
>  #endif
>  
>  /*
>   * Architectures that define ARCH_HAS_PMEM_API must provide
>   * implementations for arch_memcpy_to_pmem(), arch_wmb_pmem(),
> - * arch_copy_from_iter_pmem(), arch_clear_pmem() and arch_has_wmb_pmem().
> + * arch_copy_from_iter_pmem(), arch_clear_pmem(), arch_wb_cache_pmem()
> + * and arch_has_wmb_pmem().
>   */
>  static inline void memcpy_from_pmem(void *dst, void __pmem const *src, size_t size)
>  {
> @@ -202,4 +208,18 @@ static inline void clear_pmem(void __pmem *addr, size_t size)
>  	else
>  		default_clear_pmem(addr, size);
>  }
> +
> +/**
> + * wb_cache_pmem - write back a range of cache lines
> + * @vaddr:	virtual start address
> + * @size:	number of bytes to write back
> + *
> + * Write back the cache lines starting at 'vaddr' for 'size' bytes.
> + * This function requires explicit ordering with an wmb_pmem() call.
> + */
> +static inline void wb_cache_pmem(void __pmem *addr, size_t size)
> +{
> +	if (arch_has_pmem_api())
> +		arch_wb_cache_pmem(addr, size);
> +}
>  #endif /* __PMEM_H__ */
> diff --git a/mm/msync.c b/mm/msync.c
> index bb04d53..2a4739c 100644
> --- a/mm/msync.c
> +++ b/mm/msync.c
> @@ -7,6 +7,7 @@
>  /*
>   * The msync() system call.
>   */
> +#include <linux/dax.h>
>  #include <linux/fs.h>
>  #include <linux/mm.h>
>  #include <linux/mman.h>
> @@ -59,6 +60,7 @@ SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
>  	for (;;) {
>  		struct file *file;
>  		loff_t fstart, fend;
> +		unsigned long range_len;
>  
>  		/* Still start < end. */
>  		error = -ENOMEM;
> @@ -77,10 +79,16 @@ SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
>  			error = -EBUSY;
>  			goto out_unlock;
>  		}
> +
> +		range_len = min(end, vma->vm_end) - start;
> +
> +		if (vma_is_dax(vma))
> +			dax_sync_range(start, range_len);
> +

Ye no I hate this. (No need to touch mm)

All we need to do is define a dax_fsync()

dax FS registers a special dax vector for ->fsync() that vector
needs to call dax_fsync(); first as part of its fsync operation.
Then dax_fsync() is:

dax_fsync()
{
	/* we always write with sync so only fsync if the file mmap'ed */
	if (mapping_mapped(inode->i_mapping) == 0)
		return 0;

	dax_sync_range(start, range_len);
}

Thanks
Boaz

>  		file = vma->vm_file;
>  		fstart = (start - vma->vm_start) +
>  			 ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
> -		fend = fstart + (min(end, vma->vm_end) - start) - 1;
> +		fend = fstart + range_len - 1;
>  		start = vma->vm_end;
>  		if ((flags & MS_SYNC) && file &&
>  				(vma->vm_flags & VM_SHARED)) {
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Sept. 1, 2015, 10:21 p.m. UTC | #10
On Tue, Sep 01, 2015 at 09:06:08AM +0200, Christoph Hellwig wrote:
> On Tue, Sep 01, 2015 at 09:38:03AM +1000, Dave Chinner wrote:
> > On Mon, Aug 31, 2015 at 12:59:44PM -0600, Ross Zwisler wrote:
> > > For DAX msync we just need to flush the given range using
> > > wb_cache_pmem(), which is now a public part of the PMEM API.
> > 
> > This is wrong, because it still leaves fsync() broken on dax.
> > 
> > Flushing dirty data to stable storage is the responsibility of the
> > writeback infrastructure, not the VMA/mm infrasrtucture. For non-dax
> > configurations, msync defers all that to vfs_fsync_range(), because
> > it has to be implemented there for fsync() to work.
> > 
> > Even for DAX, msync has to call vfs_fsync_range() for the filesystem to commit
> > the backing store allocations to stable storage, so there's not
> > getting around the fact msync is the wrong place to be flushing
> > DAX mappings to persistent storage.
> 
> DAX does call ->fsync before and after this patch.  And with all
> the recent fixes we take care to ensure data is written though the
> cache for everything but mmap-access.  With this patch from Ross
> we ensure msync writes back the cache before calling ->fsync so that
> the filesystem can then do it's work like converting unwritten extents.
> 
> The only downside is that previously on Linux you could always use
> fsync as a replaement for msymc, which isn't true anymore for DAX.

Which means applications that should "just work" without
modification on DAX are now subtly broken and don't actually
guarantee data is safe after a crash. That's a pretty nasty
landmine, and goes against *everything* we've claimed about using
DAX with existing applications.

That's wrong, and needs fixing.

Cheers,

Dave.
Dave Chinner Sept. 1, 2015, 10:49 p.m. UTC | #11
On Tue, Sep 01, 2015 at 01:08:04PM +0300, Kirill A. Shutemov wrote:
> On Tue, Sep 01, 2015 at 09:38:03AM +1000, Dave Chinner wrote:
> > On Mon, Aug 31, 2015 at 12:59:44PM -0600, Ross Zwisler wrote:
> > Even for DAX, msync has to call vfs_fsync_range() for the filesystem to commit
> > the backing store allocations to stable storage, so there's not
> > getting around the fact msync is the wrong place to be flushing
> > DAX mappings to persistent storage.
> 
> Why?
> IIUC, msync() doesn't have any requirements wrt metadata, right?

Of course it does. If the backing store allocation has not been
committed, then after a crash there will be a hole in file and
so it will read as zeroes regardless of what data was written and
flushed.

> > I pointed this out almost 6 months ago (i.e. that fsync was broken)
> > anf hinted at how to solve it. Fix fsync, and msync gets fixed for
> > free:
> > 
> > https://lists.01.org/pipermail/linux-nvdimm/2015-March/000341.html
> > 
> > I've also reported to Willy that DAX write page faults don't work
> > correctly, either. xfstests generic/080 exposes this: a read
> > from a page followed immediately by a write to that page does not
> > result in ->page_mkwrite being called on the write and so
> > backing store is not allocated for the page, nor are the timestamps
> > for the file updated. This will also result in fsync (and msync)
> > not working properly.
> 
> Is that because XFS doesn't provide vm_ops->pfn_mkwrite?

I didn't know that had been committed. I don't recall seeing a pull
request with that in it, none of the XFS DAX patches conflicted
against it and there's been no runtime errors. I'll fix it up.

As such, shouldn't there be a check in the VM (in ->mmap callers)
that if we have the vma is returned with VM_MIXEDMODE enabled that
->pfn_mkwrite is not NULL?  It's now clear to me that any filesystem
that sets VM_MIXEDMODE needs to support both page_mkwrite and
pfn_mkwrite, and such a check would have caught this immediately...

Cheers,

Dave.
Ross Zwisler Sept. 2, 2015, 3:19 a.m. UTC | #12
On Wed, Sep 02, 2015 at 08:21:20AM +1000, Dave Chinner wrote:
> Which means applications that should "just work" without
> modification on DAX are now subtly broken and don't actually
> guarantee data is safe after a crash. That's a pretty nasty
> landmine, and goes against *everything* we've claimed about using
> DAX with existing applications.
> 
> That's wrong, and needs fixing.

I agree that we need to fix fsync as well, and that the fsync solution could
be used to implement msync if we choose to go that route.  I think we might
want to consider keeping the msync and fsync implementations separate, though,
for two reasons.

1) The current msync implementation is much more efficient than what will be
needed for fsync.  Fsync will need to call into the filesystem, traverse all
the blocks, get kernel virtual addresses from those and then call
wb_cache_pmem() on those kernel addresses.  I think this is a necessary evil
for fsync since you don't have a VMA, but for msync we do and we can just
flush using the user addresses without any fs lookups.

2) I believe that the near-term fsync code will rely on struct pages for
PMEM, which I believe are possible but optional as of Dan's last patch set:

https://lkml.org/lkml/2015/8/25/841

I believe that this means that if we don't have struct pages for PMEM (becuase
ZONE_DEVICE et al. are turned off) fsync won't work.  I'd be nice not to lose
msync as well.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Sept. 2, 2015, 5:17 a.m. UTC | #13
On Tue, Sep 01, 2015 at 09:19:45PM -0600, Ross Zwisler wrote:
> On Wed, Sep 02, 2015 at 08:21:20AM +1000, Dave Chinner wrote:
> > Which means applications that should "just work" without
> > modification on DAX are now subtly broken and don't actually
> > guarantee data is safe after a crash. That's a pretty nasty
> > landmine, and goes against *everything* we've claimed about using
> > DAX with existing applications.
> > 
> > That's wrong, and needs fixing.
> 
> I agree that we need to fix fsync as well, and that the fsync solution could
> be used to implement msync if we choose to go that route.  I think we might
> want to consider keeping the msync and fsync implementations separate, though,
> for two reasons.
> 
> 1) The current msync implementation is much more efficient than what will be
> needed for fsync.  Fsync will need to call into the filesystem, traverse all
> the blocks, get kernel virtual addresses from those and then call
> wb_cache_pmem() on those kernel addresses.  I think this is a necessary evil
> for fsync since you don't have a VMA, but for msync we do and we can just
> flush using the user addresses without any fs lookups.

Yet you're ignoring the fact that flushing the entire range of the
relevant VMAs may not be very efficient. It may be a very
large mapping with only a few pages that need flushing from the
cache, but you still iterate the mappings flushing GB ranges from
the cache at a time.

We don't need struct pages to track page dirty state. We already
have a method for doing this that does not rely on having a struct
page and can be used for efficient lookup of exact dirty ranges. i.e
the per-page dirty tag that is kept in the mapping radix tree. It's
fine grained, and extremely efficient in terms of lookups to find
dirty pages.

Indeed, the mapping tree tags were specifically designed to avoid
this "fsync doesn't know what range to flush" problem for normal
files. That same problem still exists here for msync - these patches
are just hitting it with a goddamn massive hammer "because it is
easy" rather than attempting to do the flushing efficiently.

> 2) I believe that the near-term fsync code will rely on struct pages for
> PMEM, which I believe are possible but optional as of Dan's last patch set:
> 
> https://lkml.org/lkml/2015/8/25/841
> 
> I believe that this means that if we don't have struct pages for PMEM (becuase
> ZONE_DEVICE et al. are turned off) fsync won't work.  I'd be nice not to lose
> msync as well.

I don't think this is an either-or situation. If we use struct pages
for PMEM, then fsync will work without modification as DAX will need
to use struct pages and hence we can insert them in the mapping
radix tree directly and use the normal set/clear_page_dirty() calls
to track dirty state. It will give us fine grained flush capability
and we won't want msync() to be using the big hammer if we can avoid
it.

If we make the existing pfn-based DAX code track dirty pages via
mapping radix tree tags right now, then we allow fsync to work by
reusing most of the infrastructure we already have.  That means DAX
and fsync will work exactly the same regardless of how we
index/reference PMEM in future and we won't have to come back and
fix it all up again.

Cheers,

Dave.
Boaz Harrosh Sept. 2, 2015, 10:04 a.m. UTC | #14
On 09/02/2015 06:19 AM, Ross Zwisler wrote:
> On Wed, Sep 02, 2015 at 08:21:20AM +1000, Dave Chinner wrote:
>> Which means applications that should "just work" without
>> modification on DAX are now subtly broken and don't actually
>> guarantee data is safe after a crash. That's a pretty nasty
>> landmine, and goes against *everything* we've claimed about using
>> DAX with existing applications.
>>
>> That's wrong, and needs fixing.
> 
> I agree that we need to fix fsync as well, and that the fsync solution could
> be used to implement msync if we choose to go that route.  I think we might
> want to consider keeping the msync and fsync implementations separate, though,
> for two reasons.
> 
> 1) The current msync implementation is much more efficient than what will be
> needed for fsync.  Fsync will need to call into the filesystem, traverse all
> the blocks, get kernel virtual addresses from those and then call
> wb_cache_pmem() on those kernel addresses.  

I was thinking about this some more, and no this is not what we need to do
because of the virtual-based-cache ARCHs. And what we do for these systems
will also work for physical-based-cache ARCHs.

What we need to do, is dig into the mapping structure and pic up the current
VMA on the call to fsync. Then just flush that one on that virtual address,
(since it is current at the context of the fsync sys call)

And of course we need to do like I wrote, we must call fsync on vm_operations->close
before the VMA mappings goes away. Then an fsync after unmap is a no-op.

> I think this is a necessary evil
> for fsync since you don't have a VMA, but for msync we do and we can just
> flush using the user addresses without any fs lookups.
> 

right see above

> 2) I believe that the near-term fsync code will rely on struct pages for
> PMEM, which I believe are possible but optional as of Dan's last patch set:
> 
> https://lkml.org/lkml/2015/8/25/841
> 
> I believe that this means that if we don't have struct pages for PMEM (becuase
> ZONE_DEVICE et al. are turned off) fsync won't work.  I'd be nice not to lose
> msync as well.

Please see above it can be made to work. Actually what we do is the
traversal-kernel-ptr thing, and the fsync-on-unmap. And it works we have heavy
persistence testing and it is all very good.

So no, without pages it can all work very-well. There is only the sync problem
that I intend to fix soon, is only a matter of keeping a dax-dirty inode-list
per sb.

So no this is not an excuse.

Cheers
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Boaz Harrosh Sept. 2, 2015, 10:27 a.m. UTC | #15
On 09/02/2015 08:17 AM, Dave Chinner wrote:
> On Tue, Sep 01, 2015 at 09:19:45PM -0600, Ross Zwisler wrote:
>> On Wed, Sep 02, 2015 at 08:21:20AM +1000, Dave Chinner wrote:
>>> Which means applications that should "just work" without
>>> modification on DAX are now subtly broken and don't actually
>>> guarantee data is safe after a crash. That's a pretty nasty
>>> landmine, and goes against *everything* we've claimed about using
>>> DAX with existing applications.
>>>
>>> That's wrong, and needs fixing.
>>
>> I agree that we need to fix fsync as well, and that the fsync solution could
>> be used to implement msync if we choose to go that route.  I think we might
>> want to consider keeping the msync and fsync implementations separate, though,
>> for two reasons.
>>
>> 1) The current msync implementation is much more efficient than what will be
>> needed for fsync.  Fsync will need to call into the filesystem, traverse all
>> the blocks, get kernel virtual addresses from those and then call
>> wb_cache_pmem() on those kernel addresses.  I think this is a necessary evil
>> for fsync since you don't have a VMA, but for msync we do and we can just
>> flush using the user addresses without any fs lookups.
> 
> Yet you're ignoring the fact that flushing the entire range of the
> relevant VMAs may not be very efficient. It may be a very
> large mapping with only a few pages that need flushing from the
> cache, but you still iterate the mappings flushing GB ranges from
> the cache at a time.
> 

So actually you are wrong about this. We have a working system and as part
of our testing rig we do performance measurements, constantly. Our random
mmap 4k writes test preforms very well and is in par with the random-direct-write
implementation even though on every unmap, we do a VMA->start/end cl_flushing.

The cl_flush operation is a no-op if the cacheline is not dirty and is a
memory bus storm with all the CLs that are dirty. So the only cost
is the iteration of vma->start-to-vma->end i+=64

Let us please agree that we should do the correct thing for now, and let
the complains roll in about the slowness later. You will find that my
proposed solution is not so slow.

> We don't need struct pages to track page dirty state. We already
> have a method for doing this that does not rely on having a struct
> page and can be used for efficient lookup of exact dirty ranges. i.e
> the per-page dirty tag that is kept in the mapping radix tree. It's
> fine grained, and extremely efficient in terms of lookups to find
> dirty pages.
> 

In fact you will find that this solution is actually slower. Because
you need an extra lock on every major-page-fault and you need to
maintain the radix-tree populated. Today with dax the radix-tree is
completely empty, it is only ever used if one reads in holes. But we
found that this is not common at all. Usually mmap applications read
what is really there.

So the extra work per page will be more than actually doing the fast
no-op cl_flush.

> Indeed, the mapping tree tags were specifically designed to avoid
> this "fsync doesn't know what range to flush" problem for normal
> files. That same problem still exists here for msync - these patches
> are just hitting it with a goddamn massive hammer "because it is
> easy" rather than attempting to do the flushing efficiently.
> 

You come from the disk world and every extra synced block is a huge waist.
This is memory, is not what you are used too.

Again we have benchmarks and mmap works very very well. Including that
contraption of cl_flushing vma->start..vma->end

I'll try and open up some time to send a rough draft. My boss will kill me,
but he'll survive.

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Hansen Sept. 2, 2015, 2:23 p.m. UTC | #16
On 09/02/2015 03:27 AM, Boaz Harrosh wrote:
>> > Yet you're ignoring the fact that flushing the entire range of the
>> > relevant VMAs may not be very efficient. It may be a very
>> > large mapping with only a few pages that need flushing from the
>> > cache, but you still iterate the mappings flushing GB ranges from
>> > the cache at a time.
>> > 
> So actually you are wrong about this. We have a working system and as part
> of our testing rig we do performance measurements, constantly. Our random
> mmap 4k writes test preforms very well and is in par with the random-direct-write
> implementation even though on every unmap, we do a VMA->start/end cl_flushing.
> 
> The cl_flush operation is a no-op if the cacheline is not dirty and is a
> memory bus storm with all the CLs that are dirty. So the only cost
> is the iteration of vma->start-to-vma->end i+=64

I'd be curious what the cost is in practice.  Do you have any actual
numbers of the cost of doing it this way?

Even if the instruction is a "noop", I'd really expect the overhead to
really add up for a tens-of-gigabytes mapping, no matter how much the
CPU optimizes it.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Boaz Harrosh Sept. 2, 2015, 3:18 p.m. UTC | #17
On 09/02/2015 05:23 PM, Dave Hansen wrote:
<>
> I'd be curious what the cost is in practice.  Do you have any actual
> numbers of the cost of doing it this way?
> 
> Even if the instruction is a "noop", I'd really expect the overhead to
> really add up for a tens-of-gigabytes mapping, no matter how much the
> CPU optimizes it.

What tens-of-gigabytes mapping? I have yet to encounter an application
that does that. Our tests show that usually the mmaps are small.

I can send you a micro benchmark results of an mmap vs direct-io random
write. Our code will jump over holes in the file BTW, but I'll ask to also
run it with falloc that will make all blocks allocated.

Give me a few days to collect this.

I guess one optimization we should do is jump over holes and zero-extents.
This will save the case of a mostly sparse very big file.

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Hansen Sept. 2, 2015, 3:39 p.m. UTC | #18
On 09/02/2015 08:18 AM, Boaz Harrosh wrote:
> On 09/02/2015 05:23 PM, Dave Hansen wrote:
>> > I'd be curious what the cost is in practice.  Do you have any actual
>> > numbers of the cost of doing it this way?
>> > 
>> > Even if the instruction is a "noop", I'd really expect the overhead to
>> > really add up for a tens-of-gigabytes mapping, no matter how much the
>> > CPU optimizes it.
> What tens-of-gigabytes mapping? I have yet to encounter an application
> that does that. Our tests show that usually the mmaps are small.

We are going to have 2-socket systems with 6TB of persistent memory in
them.  I think it's important to design this mechanism so that it scales
to memory sizes like that and supports large mmap()s.

I'm not sure the application you've seen thus far are very
representative of what we want to design for.

> I can send you a micro benchmark results of an mmap vs direct-io random
> write. Our code will jump over holes in the file BTW, but I'll ask to also
> run it with falloc that will make all blocks allocated.

I'm really just more curious about actual clflush performance on large
ranges.  I'm curious how good the CPU is at optimizing it.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Boaz Harrosh Sept. 2, 2015, 4 p.m. UTC | #19
On 09/02/2015 06:39 PM, Dave Hansen wrote:
> On 09/02/2015 08:18 AM, Boaz Harrosh wrote:
>> On 09/02/2015 05:23 PM, Dave Hansen wrote:
>>>> I'd be curious what the cost is in practice.  Do you have any actual
>>>> numbers of the cost of doing it this way?
>>>>
>>>> Even if the instruction is a "noop", I'd really expect the overhead to
>>>> really add up for a tens-of-gigabytes mapping, no matter how much the
>>>> CPU optimizes it.
>> What tens-of-gigabytes mapping? I have yet to encounter an application
>> that does that. Our tests show that usually the mmaps are small.
> 
> We are going to have 2-socket systems with 6TB of persistent memory in
> them.  I think it's important to design this mechanism so that it scales
> to memory sizes like that and supports large mmap()s.
> 
> I'm not sure the application you've seen thus far are very
> representative of what we want to design for.
> 

We have a patch pending to introduce a new mmap flag that pmem aware
applications can set to eliminate any kind of flushing. MMAP_PMEM_AWARE.

This is good for the like of libnvdimm that does one large mmap of the
all 6T and does not want the clflush penalty on unmap.

>> I can send you a micro benchmark results of an mmap vs direct-io random
>> write. Our code will jump over holes in the file BTW, but I'll ask to also
>> run it with falloc that will make all blocks allocated.
> 
> I'm really just more curious about actual clflush performance on large
> ranges.  I'm curious how good the CPU is at optimizing it.
> 

Again our test does not do this, because it will only flush written-extents
of the file. the most we have in one machine is 64G of pmem, so even on a
very large mmap the most that can be is 64G of data, and the actual modify
of 64G of data will be much slower then the added clflush to each cache_line.

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Hansen Sept. 2, 2015, 4:19 p.m. UTC | #20
On 09/02/2015 09:00 AM, Boaz Harrosh wrote:
>> > We are going to have 2-socket systems with 6TB of persistent memory in
>> > them.  I think it's important to design this mechanism so that it scales
>> > to memory sizes like that and supports large mmap()s.
>> > 
>> > I'm not sure the application you've seen thus far are very
>> > representative of what we want to design for.
>> > 
> We have a patch pending to introduce a new mmap flag that pmem aware
> applications can set to eliminate any kind of flushing. MMAP_PMEM_AWARE.

Great!  Do you have a link so that I can review it and compare it to
Ross's approach?

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ross Zwisler Sept. 2, 2015, 5:47 p.m. UTC | #21
On Tue, Sep 01, 2015 at 04:12:42PM +0300, Boaz Harrosh wrote:
> On 08/31/2015 09:59 PM, Ross Zwisler wrote:
> > @@ -753,3 +755,18 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
> >  	return dax_zero_page_range(inode, from, length, get_block);
> >  }
> >  EXPORT_SYMBOL_GPL(dax_truncate_page);
> > +
> > +void dax_sync_range(unsigned long addr, size_t len)
> > +{
> > +	while (len) {
> > +		size_t chunk_len = min_t(size_t, SZ_1G, len);
> > +
> 
> Where does the  SZ_1G come from is it because you want to do cond_resched()
> every 1G bytes so not to get stuck for a long time?
> 
> It took me a while to catch, At first I thought it might be do to wb_cache_pmem()
> limitations. Would you put a comment in the next iteration?

Yep, the SZ_1G is just to make sure we cond_reshced() every once in a while.
Is there a documented guideline somewhere as to how long a kernel thread is
allowed to spin before calling cond_resched()?   So far I haven' been able to
find anything solid on this - it seems like each developer has their own
preferences, and that those preferences vary pretty widely.

In any case, assuming we continue to separate the msync() and fsync()
implementations for DAX (which right now I'm doubting, to be honest), I'll add
in a comment to explain this logic.

> > diff --git a/include/linux/pmem.h b/include/linux/pmem.h
> > index 85f810b3..aa29ebb 100644
> > --- a/include/linux/pmem.h
> > +++ b/include/linux/pmem.h
> > @@ -53,12 +53,18 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
> >  {
> >  	BUG();
> 
> See below
> 
> >  }
> > +
> > +static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
> > +{
> > +	BUG();
> 
> There is a clflush_cache_range() defined for generic use. On ADR systems (even without pcommit)
> this works perfectly and is persistent. why not use that in the generic case?

Nope, we really do need to use wb_cache_pmem() because clflush_cache_range()
isn't an architecture neutral API.  wb_cache_pmem() also has the advantage
that on x86 it will take advantage of the new CLWB instruction if it is
available on the platform, and it doesn't introduce any unnecessary memory
fencing.  This works on both PCOMMIT-aware systems and on ADR boxes without
PCOMMIT.
 
> One usage of pmem is overlooked by all this API. The use of DRAM as pmem, across a VM
> or cross reboot. you have a piece of memory exposed as pmem to the subsytem which survives
> past the boot of that system. The CPU cache still needs flushing in this case.
> (People are already using this for logs and crash dumps)

I'm confused about this "DRAM as pmem" use case - are the requirements
essentially the same as the ADR case?  You need to make sure that pre-reboot
the dirty cache lines have been flushed from the processor cache, but if they
are in platform buffers (the "safe zone" for ADR) you're fine?

If so, we're good to go, I think.  Dan's most recent patch series made it so
we correctly handle systems that have the PMEM API but not PCOMMIT:

https://lists.01.org/pipermail/linux-nvdimm/2015-August/002005.html

If the "DRAM as pmem across reboots" case isn't okay with your dirty data
being in the ADR safe zone, I think you're toast.  Without PCOMMIT the kernel
cannot guarantee that the data has ever made it durably to the DIMMs,
regardless what clflush/clflushopt/clwb magic you do.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ross Zwisler Sept. 2, 2015, 7:04 p.m. UTC | #22
On Tue, Sep 01, 2015 at 03:18:41PM +0300, Boaz Harrosh wrote:
> So the approach we took was a bit different to exactly solve these
> problem, and to also not over flush too much. here is what we did.
> 
> * At vm_operations_struct we also override the .close vector (say call it dax_vm_close)
> 
> * At dax_vm_close() on writable files call ->fsync(,vma->vm_start, vma->vm_end,)
>   (We have an inode flag if the file was actually dirtied, but even if not, that will
>    not be that bad, so a file was opened for write, mmapped, but actually never
>    modified. Not a lot of these, and the do nothing cl_flushing is very fast)
> 
> * At ->fsync() do the actual cl_flush for all cases but only iff
> 	if (mapping_mapped(inode->i_mapping) == 0)
> 		return 0;
> 
>   This is because data written not through mmap is already persistent and we
>   do not need the cl_flushing
> 
> Apps expect all these to work:
> 1. open mmap m-write msync ... close
> 2. open mmap m-write fsync ... close
> 3. open mmap m-write unmap ... fsync close
> 
> 4. open mmap m-write sync ...

So basically you made close have an implicit fsync?  What about the flow that
looks like this:

5. open mmap close m-write

This guy definitely needs an msync/fsync at the end to make sure that the
m-write becomes durable.  

Also, the CLOSE(2) man page specifically says that a flush does not occur at
close:
	A successful close does not guarantee that the data has been
	successfully  saved  to  disk,  as  the  kernel defers  writes.   It
	is not common for a filesystem to flush the buffers when the stream is
	closed.  If you need to be sure that the data is physically stored,
	use fsync(2).  (It will depend on the disk  hardware  at this point.)

I don't think that adding an implicit fsync to close is the right solution -
we just need to get msync and fsync correctly working.

> The first 3 are supported with above, because what happens is that at [3]
> the fsync actually happens on unmap and fsync is redundant in that case.
> 
> The only broken scenario is [3]. We do not have a list of "dax-dirty" inodes
> per sb to iterate on and call inode-sync on. This cause problems mostly in
> freeze because with actual [3] scenario the file will be eventually closed
> and persistent, but after the call to sync returns.
> 
> Its on my TODO to fix [3] based on instructions from Dave.
> The mmap call will put the inode on the list and the dax_vm_close will
> remove it. One of the regular dirty list should be used as suggested by
> Dave.

I believe in the above two paragraphs you meant [4], so the 

4. open mmap m-write sync ...

case needs to be fixed so that we can detect DAX-dirty inodes?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kirill A . Shutemov Sept. 2, 2015, 8:17 p.m. UTC | #23
On Wed, Sep 02, 2015 at 01:04:01PM -0600, Ross Zwisler wrote:
> On Tue, Sep 01, 2015 at 03:18:41PM +0300, Boaz Harrosh wrote:
> > So the approach we took was a bit different to exactly solve these
> > problem, and to also not over flush too much. here is what we did.
> > 
> > * At vm_operations_struct we also override the .close vector (say call it dax_vm_close)
> > 
> > * At dax_vm_close() on writable files call ->fsync(,vma->vm_start, vma->vm_end,)
> >   (We have an inode flag if the file was actually dirtied, but even if not, that will
> >    not be that bad, so a file was opened for write, mmapped, but actually never
> >    modified. Not a lot of these, and the do nothing cl_flushing is very fast)
> > 
> > * At ->fsync() do the actual cl_flush for all cases but only iff
> > 	if (mapping_mapped(inode->i_mapping) == 0)
> > 		return 0;
> > 
> >   This is because data written not through mmap is already persistent and we
> >   do not need the cl_flushing
> > 
> > Apps expect all these to work:
> > 1. open mmap m-write msync ... close
> > 2. open mmap m-write fsync ... close
> > 3. open mmap m-write unmap ... fsync close
> > 
> > 4. open mmap m-write sync ...
> 
> So basically you made close have an implicit fsync?  What about the flow that
> looks like this:
> 
> 5. open mmap close m-write
> 
> This guy definitely needs an msync/fsync at the end to make sure that the
> m-write becomes durable.  

We can sync on pte_dirty() during zap_page_range(): it's practically free,
since we page walk anyway.

With this approach it probably makes sense to come back to page walk on
msync() side too to be consistent wrt pte_dirty() meaning.

> Also, the CLOSE(2) man page specifically says that a flush does not occur at
> close:
> 	A successful close does not guarantee that the data has been
> 	successfully  saved  to  disk,  as  the  kernel defers  writes.   It
> 	is not common for a filesystem to flush the buffers when the stream is
> 	closed.  If you need to be sure that the data is physically stored,
> 	use fsync(2).  (It will depend on the disk  hardware  at this point.)
> 
> I don't think that adding an implicit fsync to close is the right solution -
> we just need to get msync and fsync correctly working.

I doesn't mean we can't sync if we can do without noticible performance
degradation.
Boaz Harrosh Sept. 3, 2015, 6:32 a.m. UTC | #24
On 09/02/2015 10:04 PM, Ross Zwisler wrote:
> On Tue, Sep 01, 2015 at 03:18:41PM +0300, Boaz Harrosh wrote:
<>
>> Apps expect all these to work:
>> 1. open mmap m-write msync ... close
>> 2. open mmap m-write fsync ... close
>> 3. open mmap m-write unmap ... fsync close
>>
>> 4. open mmap m-write sync ...
> 
> So basically you made close have an implicit fsync?  What about the flow that
> looks like this:
> 
> 5. open mmap close m-write
> 

What? no, close means ummap because you need a file* attached to your vma

And you miss-understood me, the vm_opts->close is the *unmap* operation not
the file::close() operation.

I meant memory-cl_flush on unmap before the vma goes away.

> This guy definitely needs an msync/fsync at the end to make sure that the
> m-write becomes durable.  
> 

Exactly done at unmap time.

> Also, the CLOSE(2) man page specifically says that a flush does not occur at
> close:
> 	A successful close does not guarantee that the data has been
> 	successfully  saved  to  disk,  as  the  kernel defers  writes.   It
> 	is not common for a filesystem to flush the buffers when the stream is
> 	closed.  If you need to be sure that the data is physically stored,
> 	use fsync(2).  (It will depend on the disk  hardware  at this point.)
> 
> I don't think that adding an implicit fsync to close is the right solution -
> we just need to get msync and fsync correctly working.
> 

So above is not relevant, and we are doing that. taking care of cpu-cache flushing.
This is not disk-flushing, on a long memcpy from usermode most of the data is
already durable, is only the leftover margins. Like the dax_io in the kernel
dax implies direct_io always, all we are trying is to have the least
performance hit in memory-cache-flushing.

IS nothing to do with the text above.

>> The first 3 are supported with above, because what happens is that at [3]
>> the fsync actually happens on unmap and fsync is redundant in that case.
>>
>> The only broken scenario is [3]. We do not have a list of "dax-dirty" inodes
>> per sb to iterate on and call inode-sync on. This cause problems mostly in
>> freeze because with actual [3] scenario the file will be eventually closed
>> and persistent, but after the call to sync returns.
>>
>> Its on my TODO to fix [3] based on instructions from Dave.
>> The mmap call will put the inode on the list and the dax_vm_close will
>> remove it. One of the regular dirty list should be used as suggested by
>> Dave.
> 
> I believe in the above two paragraphs you meant [4], so the 
> 
> 4. open mmap m-write sync ...
> 
> case needs to be fixed so that we can detect DAX-dirty inodes?
> 

Yes I'll be working on sync (DAX-dirty-i_list) soon but it needs a working
fsync to be in place (eg: dax_fsync(inode))

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Boaz Harrosh Sept. 3, 2015, 6:41 a.m. UTC | #25
On 09/02/2015 07:19 PM, Dave Hansen wrote:
> On 09/02/2015 09:00 AM, Boaz Harrosh wrote:
>>>> We are going to have 2-socket systems with 6TB of persistent memory in
>>>> them.  I think it's important to design this mechanism so that it scales
>>>> to memory sizes like that and supports large mmap()s.
>>>>
>>>> I'm not sure the application you've seen thus far are very
>>>> representative of what we want to design for.
>>>>
>> We have a patch pending to introduce a new mmap flag that pmem aware
>> applications can set to eliminate any kind of flushing. MMAP_PMEM_AWARE.
> 
> Great!  Do you have a link so that I can review it and compare it to
> Ross's approach?
> 

Ha? I have not seen a new mmap flag from Ross, yet I have been off lately
so it is logical that I might have missed it.

Could you send me a link?

(BTW my patch I did not release yet, I'll cc you once its done)

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ross Zwisler Sept. 3, 2015, 4:44 p.m. UTC | #26
On Thu, Sep 03, 2015 at 09:32:02AM +0300, Boaz Harrosh wrote:
> On 09/02/2015 10:04 PM, Ross Zwisler wrote:
> > On Tue, Sep 01, 2015 at 03:18:41PM +0300, Boaz Harrosh wrote:
> <>
> >> Apps expect all these to work:
> >> 1. open mmap m-write msync ... close
> >> 2. open mmap m-write fsync ... close
> >> 3. open mmap m-write unmap ... fsync close
> >>
> >> 4. open mmap m-write sync ...
> > 
> > So basically you made close have an implicit fsync?  What about the flow that
> > looks like this:
> > 
> > 5. open mmap close m-write
> > 
> 
> What? no, close means ummap because you need a file* attached to your vma
> 
> And you miss-understood me, the vm_opts->close is the *unmap* operation not
> the file::close() operation.
> 
> I meant memory-cl_flush on unmap before the vma goes away.

Ah, got it, thanks for the clarification.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index d8ce3ec..85c07b2 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -67,18 +67,19 @@  static inline void arch_wmb_pmem(void)
 }
 
 /**
- * __arch_wb_cache_pmem - write back a cache range with CLWB
- * @vaddr:	virtual start address
+ * arch_wb_cache_pmem - write back a cache range with CLWB
+ * @addr:	virtual start address
  * @size:	number of bytes to write back
  *
  * Write back a cache range using the CLWB (cache line write back)
  * instruction.  This function requires explicit ordering with an
- * arch_wmb_pmem() call.  This API is internal to the x86 PMEM implementation.
+ * arch_wmb_pmem() call.
  */
-static inline void __arch_wb_cache_pmem(void *vaddr, size_t size)
+static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
 {
 	u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
 	unsigned long clflush_mask = x86_clflush_size - 1;
+	void *vaddr = (void __force *)addr;
 	void *vend = vaddr + size;
 	void *p;
 
@@ -115,7 +116,7 @@  static inline size_t arch_copy_from_iter_pmem(void __pmem *addr, size_t bytes,
 	len = copy_from_iter_nocache(vaddr, bytes, i);
 
 	if (__iter_needs_pmem_wb(i))
-		__arch_wb_cache_pmem(vaddr, bytes);
+		arch_wb_cache_pmem(addr, bytes);
 
 	return len;
 }
@@ -138,7 +139,7 @@  static inline void arch_clear_pmem(void __pmem *addr, size_t size)
 	else
 		memset(vaddr, 0, size);
 
-	__arch_wb_cache_pmem(vaddr, size);
+	arch_wb_cache_pmem(addr, size);
 }
 
 static inline bool __arch_has_wmb_pmem(void)
diff --git a/fs/dax.c b/fs/dax.c
index fbe18b8..ed6aec1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -17,6 +17,7 @@ 
 #include <linux/atomic.h>
 #include <linux/blkdev.h>
 #include <linux/buffer_head.h>
+#include <linux/dax.h>
 #include <linux/fs.h>
 #include <linux/genhd.h>
 #include <linux/highmem.h>
@@ -25,6 +26,7 @@ 
 #include <linux/mutex.h>
 #include <linux/pmem.h>
 #include <linux/sched.h>
+#include <linux/sizes.h>
 #include <linux/uio.h>
 #include <linux/vmstat.h>
 
@@ -753,3 +755,18 @@  int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
 	return dax_zero_page_range(inode, from, length, get_block);
 }
 EXPORT_SYMBOL_GPL(dax_truncate_page);
+
+void dax_sync_range(unsigned long addr, size_t len)
+{
+	while (len) {
+		size_t chunk_len = min_t(size_t, SZ_1G, len);
+
+		wb_cache_pmem((void __pmem *)addr, chunk_len);
+		len -= chunk_len;
+		addr += chunk_len;
+		if (len)
+			cond_resched();
+	}
+	wmb_pmem();
+}
+EXPORT_SYMBOL_GPL(dax_sync_range);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b415e52..504b33f 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -14,6 +14,7 @@  int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
 		dax_iodone_t);
 int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
 		dax_iodone_t);
+void dax_sync_range(unsigned long addr, size_t len);
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
 				unsigned int flags, get_block_t, dax_iodone_t);
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index 85f810b3..aa29ebb 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -53,12 +53,18 @@  static inline void arch_clear_pmem(void __pmem *addr, size_t size)
 {
 	BUG();
 }
+
+static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
+{
+	BUG();
+}
 #endif
 
 /*
  * Architectures that define ARCH_HAS_PMEM_API must provide
  * implementations for arch_memcpy_to_pmem(), arch_wmb_pmem(),
- * arch_copy_from_iter_pmem(), arch_clear_pmem() and arch_has_wmb_pmem().
+ * arch_copy_from_iter_pmem(), arch_clear_pmem(), arch_wb_cache_pmem()
+ * and arch_has_wmb_pmem().
  */
 static inline void memcpy_from_pmem(void *dst, void __pmem const *src, size_t size)
 {
@@ -202,4 +208,18 @@  static inline void clear_pmem(void __pmem *addr, size_t size)
 	else
 		default_clear_pmem(addr, size);
 }
+
+/**
+ * wb_cache_pmem - write back a range of cache lines
+ * @vaddr:	virtual start address
+ * @size:	number of bytes to write back
+ *
+ * Write back the cache lines starting at 'vaddr' for 'size' bytes.
+ * This function requires explicit ordering with an wmb_pmem() call.
+ */
+static inline void wb_cache_pmem(void __pmem *addr, size_t size)
+{
+	if (arch_has_pmem_api())
+		arch_wb_cache_pmem(addr, size);
+}
 #endif /* __PMEM_H__ */
diff --git a/mm/msync.c b/mm/msync.c
index bb04d53..2a4739c 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -7,6 +7,7 @@ 
 /*
  * The msync() system call.
  */
+#include <linux/dax.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/mman.h>
@@ -59,6 +60,7 @@  SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
 	for (;;) {
 		struct file *file;
 		loff_t fstart, fend;
+		unsigned long range_len;
 
 		/* Still start < end. */
 		error = -ENOMEM;
@@ -77,10 +79,16 @@  SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
 			error = -EBUSY;
 			goto out_unlock;
 		}
+
+		range_len = min(end, vma->vm_end) - start;
+
+		if (vma_is_dax(vma))
+			dax_sync_range(start, range_len);
+
 		file = vma->vm_file;
 		fstart = (start - vma->vm_start) +
 			 ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
-		fend = fstart + (min(end, vma->vm_end) - start) - 1;
+		fend = fstart + range_len - 1;
 		start = vma->vm_end;
 		if ((flags & MS_SYNC) && file &&
 				(vma->vm_flags & VM_SHARED)) {