diff mbox

[2/2] dax: move writeback calls into the filesystems

Message ID 1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ross Zwisler Feb. 7, 2016, 7:19 a.m. UTC
Previously calls to dax_writeback_mapping_range() for all DAX filesystems
(ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
dax_writeback_mapping_range() needs a struct block_device, and it used to
get that from inode->i_sb->s_bdev.  This is correct for normal inodes
mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
block devices and for XFS real-time files.

Instead, call dax_writeback_mapping_range() directly from the filesystem or
raw block device fsync/msync code so that they can supply us with a valid
block device.

It should be noted that this will reduce the number of calls to
dax_writeback_mapping_range() because filemap_write_and_wait_range() is
called in the various filesystems for operations other than just
fsync/msync.  Both ext4 & XFS call filemap_write_and_wait_range() outside
of ->fsync for hole punch, truncate, and block relocation
(xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()).

I don't believe that these extra flushes are necessary in the DAX case.  In
the page cache case when we have dirty data in the page cache, that data
will be actively lost if we evict a dirty page cache page without flushing
it to media first.  For DAX, though, the data will remain consistent with
the physical address to which it was written regardless of whether it's in
the processor cache or not - really the only reason I see to flush is in
response to a fsync or msync so that our data is durable on media in case
of a power loss.  The case where we could throw dirty data out of the page
cache and essentially lose writes simply doesn't exist.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/block_dev.c      |  7 +++++++
 fs/dax.c            |  5 ++---
 fs/ext2/file.c      | 10 ++++++++++
 fs/ext4/fsync.c     | 10 +++++++++-
 fs/xfs/xfs_file.c   | 12 ++++++++++--
 include/linux/dax.h |  4 ++--
 mm/filemap.c        |  6 ------
 7 files changed, 40 insertions(+), 14 deletions(-)

Comments

Dan Williams Feb. 7, 2016, 7:13 p.m. UTC | #1
On Sat, Feb 6, 2016 at 11:19 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> Previously calls to dax_writeback_mapping_range() for all DAX filesystems
> (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
> dax_writeback_mapping_range() needs a struct block_device, and it used to
> get that from inode->i_sb->s_bdev.  This is correct for normal inodes
> mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
> block devices and for XFS real-time files.
>
> Instead, call dax_writeback_mapping_range() directly from the filesystem or
> raw block device fsync/msync code so that they can supply us with a valid
> block device.
>
> It should be noted that this will reduce the number of calls to
> dax_writeback_mapping_range() because filemap_write_and_wait_range() is
> called in the various filesystems for operations other than just
> fsync/msync.  Both ext4 & XFS call filemap_write_and_wait_range() outside
> of ->fsync for hole punch, truncate, and block relocation
> (xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()).
>
> I don't believe that these extra flushes are necessary in the DAX case.  In
> the page cache case when we have dirty data in the page cache, that data
> will be actively lost if we evict a dirty page cache page without flushing
> it to media first.  For DAX, though, the data will remain consistent with
> the physical address to which it was written regardless of whether it's in
> the processor cache or not - really the only reason I see to flush is in
> response to a fsync or msync so that our data is durable on media in case
> of a power loss.  The case where we could throw dirty data out of the page
> cache and essentially lose writes simply doesn't exist.
>
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/block_dev.c      |  7 +++++++
>  fs/dax.c            |  5 ++---
>  fs/ext2/file.c      | 10 ++++++++++
>  fs/ext4/fsync.c     | 10 +++++++++-
>  fs/xfs/xfs_file.c   | 12 ++++++++++--
>  include/linux/dax.h |  4 ++--
>  mm/filemap.c        |  6 ------
>  7 files changed, 40 insertions(+), 14 deletions(-)

This sprinkling of dax specific fixups outside of vm_operations_struct
routines still has me thinking that we are going in the wrong
direction for fsync/msync support.

If an application is both unaware of DAX and doing mmap I/O it is
better served by the page cache where writeback is durable by default.
We expect DAX-aware applications to assume responsibility for cpu
cache management [1].  Making DAX mmap semantics explicit opt-in
solves not only durability support, but also the current problem that
DAX gets silently disabled leaving an app to wonder if it really got a
direct mapping. DAX also silently picks pud, pmd, or pte mappings
which is information an application would really like to know at map
time.

The proposal: make applications explicitly request DAX semantics with
a new MAP_DAX flag and fail if DAX is unavailable.  Document that a
successful MAP_DAX request mandates that the application assumes
responsibility for cpu cache management.  Require that all
applications that mmap the file agree on MAP_DAX.  This also solves
the future problem of DAX support on virtually tagged cache
architectures where it is difficult for the kernel to know what alias
addresses need flushing.

[1]: https://github.com/pmem/nvml
Dave Chinner Feb. 7, 2016, 9:50 p.m. UTC | #2
On Sun, Feb 07, 2016 at 11:13:51AM -0800, Dan Williams wrote:
> On Sat, Feb 6, 2016 at 11:19 PM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > Previously calls to dax_writeback_mapping_range() for all DAX filesystems
> > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
> > dax_writeback_mapping_range() needs a struct block_device, and it used to
> > get that from inode->i_sb->s_bdev.  This is correct for normal inodes
> > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
> > block devices and for XFS real-time files.
> >
> > Instead, call dax_writeback_mapping_range() directly from the filesystem or
> > raw block device fsync/msync code so that they can supply us with a valid
> > block device.
> >
> > It should be noted that this will reduce the number of calls to
> > dax_writeback_mapping_range() because filemap_write_and_wait_range() is
> > called in the various filesystems for operations other than just
> > fsync/msync.  Both ext4 & XFS call filemap_write_and_wait_range() outside
> > of ->fsync for hole punch, truncate, and block relocation
> > (xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()).
> >
> > I don't believe that these extra flushes are necessary in the DAX case.  In
> > the page cache case when we have dirty data in the page cache, that data
> > will be actively lost if we evict a dirty page cache page without flushing
> > it to media first.  For DAX, though, the data will remain consistent with
> > the physical address to which it was written regardless of whether it's in
> > the processor cache or not - really the only reason I see to flush is in
> > response to a fsync or msync so that our data is durable on media in case
> > of a power loss.  The case where we could throw dirty data out of the page
> > cache and essentially lose writes simply doesn't exist.
> >
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > ---
> >  fs/block_dev.c      |  7 +++++++
> >  fs/dax.c            |  5 ++---
> >  fs/ext2/file.c      | 10 ++++++++++
> >  fs/ext4/fsync.c     | 10 +++++++++-
> >  fs/xfs/xfs_file.c   | 12 ++++++++++--
> >  include/linux/dax.h |  4 ++--
> >  mm/filemap.c        |  6 ------
> >  7 files changed, 40 insertions(+), 14 deletions(-)
> 
> This sprinkling of dax specific fixups outside of vm_operations_struct
> routines still has me thinking that we are going in the wrong
> direction for fsync/msync support.
> 
> If an application is both unaware of DAX and doing mmap I/O it is
> better served by the page cache where writeback is durable by default.
> We expect DAX-aware applications to assume responsibility for cpu
> cache management [1].  Making DAX mmap semantics explicit opt-in
> solves not only durability support, but also the current problem that
> DAX gets silently disabled leaving an app to wonder if it really got a
> direct mapping. DAX also silently picks pud, pmd, or pte mappings
> which is information an application would really like to know at map
> time.
> 
> The proposal: make applications explicitly request DAX semantics with
> a new MAP_DAX flag and fail if DAX is unavailable.

No.

As I've stated before, the entire purpose of enabling DAX through
existing filesytsems like XFS and ext4 is so that existing
applications work with DAX *without modification*.

That is, applications can be entirely unaware of the fact that the
filesystem is giving them direct access to the storage because the
access and failure semantics of DAX enabled mmap are *identical to
the existing mmap semantics*.

Given this, the app doesn't need to care whether DAX is enabled or
not; all that will be seen is a difference in speed of access.
Enabling and disabling DAX is, at this point, purely an
administration decision - if the hardware and filesystem supports
it, it can be turned on without having to wait years for application
developers to add support for it....

-Dave.
Dan Williams Feb. 8, 2016, 8:18 a.m. UTC | #3
On Sun, Feb 7, 2016 at 1:50 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Sun, Feb 07, 2016 at 11:13:51AM -0800, Dan Williams wrote:
>> On Sat, Feb 6, 2016 at 11:19 PM, Ross Zwisler
>> <ross.zwisler@linux.intel.com> wrote:
>> > Previously calls to dax_writeback_mapping_range() for all DAX filesystems
>> > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
>> > dax_writeback_mapping_range() needs a struct block_device, and it used to
>> > get that from inode->i_sb->s_bdev.  This is correct for normal inodes
>> > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
>> > block devices and for XFS real-time files.
>> >
>> > Instead, call dax_writeback_mapping_range() directly from the filesystem or
>> > raw block device fsync/msync code so that they can supply us with a valid
>> > block device.
>> >
>> > It should be noted that this will reduce the number of calls to
>> > dax_writeback_mapping_range() because filemap_write_and_wait_range() is
>> > called in the various filesystems for operations other than just
>> > fsync/msync.  Both ext4 & XFS call filemap_write_and_wait_range() outside
>> > of ->fsync for hole punch, truncate, and block relocation
>> > (xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()).
>> >
>> > I don't believe that these extra flushes are necessary in the DAX case.  In
>> > the page cache case when we have dirty data in the page cache, that data
>> > will be actively lost if we evict a dirty page cache page without flushing
>> > it to media first.  For DAX, though, the data will remain consistent with
>> > the physical address to which it was written regardless of whether it's in
>> > the processor cache or not - really the only reason I see to flush is in
>> > response to a fsync or msync so that our data is durable on media in case
>> > of a power loss.  The case where we could throw dirty data out of the page
>> > cache and essentially lose writes simply doesn't exist.
>> >
>> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
>> > ---
>> >  fs/block_dev.c      |  7 +++++++
>> >  fs/dax.c            |  5 ++---
>> >  fs/ext2/file.c      | 10 ++++++++++
>> >  fs/ext4/fsync.c     | 10 +++++++++-
>> >  fs/xfs/xfs_file.c   | 12 ++++++++++--
>> >  include/linux/dax.h |  4 ++--
>> >  mm/filemap.c        |  6 ------
>> >  7 files changed, 40 insertions(+), 14 deletions(-)
>>
>> This sprinkling of dax specific fixups outside of vm_operations_struct
>> routines still has me thinking that we are going in the wrong
>> direction for fsync/msync support.
>>
>> If an application is both unaware of DAX and doing mmap I/O it is
>> better served by the page cache where writeback is durable by default.
>> We expect DAX-aware applications to assume responsibility for cpu
>> cache management [1].  Making DAX mmap semantics explicit opt-in
>> solves not only durability support, but also the current problem that
>> DAX gets silently disabled leaving an app to wonder if it really got a
>> direct mapping. DAX also silently picks pud, pmd, or pte mappings
>> which is information an application would really like to know at map
>> time.
>>
>> The proposal: make applications explicitly request DAX semantics with
>> a new MAP_DAX flag and fail if DAX is unavailable.
>
> No.
>
> As I've stated before, the entire purpose of enabling DAX through
> existing filesytsems like XFS and ext4 is so that existing
> applications work with DAX *without modification*.
>
> That is, applications can be entirely unaware of the fact that the
> filesystem is giving them direct access to the storage because the
> access and failure semantics of DAX enabled mmap are *identical to
> the existing mmap semantics*.
>
> Given this, the app doesn't need to care whether DAX is enabled or
> not; all that will be seen is a difference in speed of access.
> Enabling and disabling DAX is, at this point, purely an
> administration decision - if the hardware and filesystem supports
> it, it can be turned on without having to wait years for application
> developers to add support for it....

Setting aside the current block zeroing problem you seem to assuming
that DAX will always be faster and that may not be true at a media
level.  Waiting years for some applications to determine if DAX makes
sense for their use case seems completely reasonable.  In the meantime
the apps that are already making these changes want to know that a DAX
mapping request has not silently dropped backed to page cache.  They
also want to know if they successfully jumped through all the hoops to
get a larger than pte mapping.

I agree it is useful to be able to force DAX on an unmodified
application to see what happens, and it follows that if those
applications want to run in that mode they will need functional
fsync()...

I would feel better if we were talking about specific applications and
performance numbers to know if forcing DAX on application is a debug
facility or a production level capability.  You seem to have already
made that determination and I'm curious what I'm missing.
Jan Kara Feb. 8, 2016, 10:48 a.m. UTC | #4
On Sun 07-02-16 00:19:13, Ross Zwisler wrote:
> Previously calls to dax_writeback_mapping_range() for all DAX filesystems
> (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
> dax_writeback_mapping_range() needs a struct block_device, and it used to
> get that from inode->i_sb->s_bdev.  This is correct for normal inodes
> mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
> block devices and for XFS real-time files.
> 
> Instead, call dax_writeback_mapping_range() directly from the filesystem or
> raw block device fsync/msync code so that they can supply us with a valid
> block device.
> 
> It should be noted that this will reduce the number of calls to
> dax_writeback_mapping_range() because filemap_write_and_wait_range() is
> called in the various filesystems for operations other than just
> fsync/msync.  Both ext4 & XFS call filemap_write_and_wait_range() outside
> of ->fsync for hole punch, truncate, and block relocation
> (xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()).
> 
> I don't believe that these extra flushes are necessary in the DAX case.  In
> the page cache case when we have dirty data in the page cache, that data
> will be actively lost if we evict a dirty page cache page without flushing
> it to media first.  For DAX, though, the data will remain consistent with
> the physical address to which it was written regardless of whether it's in
> the processor cache or not - really the only reason I see to flush is in
> response to a fsync or msync so that our data is durable on media in case
> of a power loss.  The case where we could throw dirty data out of the page
> cache and essentially lose writes simply doesn't exist.

You should at least note that sync(2) won't make data durable with this
patch in the changelog. Dave and Christoph have told you that Linux users
depend on sync(2) to make data durable and I fully agree with them.  Given
current options, I think we can live with this for 4.5 but long term this
is IMO unacceptable.

								Honza
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/block_dev.c      |  7 +++++++
>  fs/dax.c            |  5 ++---
>  fs/ext2/file.c      | 10 ++++++++++
>  fs/ext4/fsync.c     | 10 +++++++++-
>  fs/xfs/xfs_file.c   | 12 ++++++++++--
>  include/linux/dax.h |  4 ++--
>  mm/filemap.c        |  6 ------
>  7 files changed, 40 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index fa0507a..312ad44 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -356,8 +356,15 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync)
>  {
>  	struct inode *bd_inode = bdev_file_inode(filp);
>  	struct block_device *bdev = I_BDEV(bd_inode);
> +	struct address_space *mapping = bd_inode->i_mapping;
>  	int error;
>  	
> +	if (dax_mapping(mapping) && mapping->nrexceptional) {
> +		error = dax_writeback_mapping_range(mapping, bdev, start, end);
> +		if (error)
> +			return error;
> +	}
> +
>  	error = filemap_write_and_wait_range(filp->f_mapping, start, end);
>  	if (error)
>  		return error;
> diff --git a/fs/dax.c b/fs/dax.c
> index 4592241..4b5006a 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -484,11 +484,10 @@ static int dax_writeback_one(struct block_device *bdev,
>   * end]. This is required by data integrity operations to ensure file data is
>   * on persistent storage prior to completion of the operation.
>   */
> -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
> -		loff_t end)
> +int dax_writeback_mapping_range(struct address_space *mapping,
> +		struct block_device *bdev, loff_t start, loff_t end)
>  {
>  	struct inode *inode = mapping->host;
> -	struct block_device *bdev = inode->i_sb->s_bdev;
>  	pgoff_t start_index, end_index, pmd_index;
>  	pgoff_t indices[PAGEVEC_SIZE];
>  	struct pagevec pvec;
> diff --git a/fs/ext2/file.c b/fs/ext2/file.c
> index 2c88d68..d1abf53 100644
> --- a/fs/ext2/file.c
> +++ b/fs/ext2/file.c
> @@ -162,6 +162,16 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync)
>  	int ret;
>  	struct super_block *sb = file->f_mapping->host->i_sb;
>  	struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
> +#ifdef CONFIG_FS_DAX
> +	struct address_space *inode_mapping = file->f_inode->i_mapping;
> +
> +	if (dax_mapping(inode_mapping) && inode_mapping->nrexceptional) {
> +		ret = dax_writeback_mapping_range(inode_mapping, sb->s_bdev,
> +				start, end);
> +		if (ret)
> +			return ret;
> +	}
> +#endif
>  
>  	ret = generic_file_fsync(file, start, end, datasync);
>  	if (ret == -EIO || test_and_clear_bit(AS_EIO, &mapping->flags)) {
> diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> index 8850254..e9cf53b 100644
> --- a/fs/ext4/fsync.c
> +++ b/fs/ext4/fsync.c
> @@ -27,6 +27,7 @@
>  #include <linux/sched.h>
>  #include <linux/writeback.h>
>  #include <linux/blkdev.h>
> +#include <linux/dax.h>
>  
>  #include "ext4.h"
>  #include "ext4_jbd2.h"
> @@ -83,10 +84,10 @@ static int ext4_sync_parent(struct inode *inode)
>   * What we do is just kick off a commit and wait on it.  This will snapshot the
>   * inode to disk.
>   */
> -
>  int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>  {
>  	struct inode *inode = file->f_mapping->host;
> +	struct address_space *mapping = inode->i_mapping;
>  	struct ext4_inode_info *ei = EXT4_I(inode);
>  	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
>  	int ret = 0, err;
> @@ -97,6 +98,13 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>  
>  	trace_ext4_sync_file_enter(file, datasync);
>  
> +	if (dax_mapping(mapping) && mapping->nrexceptional) {
> +		err = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev,
> +				start, end);
> +		if (err)
> +			goto out;
> +	}
> +
>  	if (inode->i_sb->s_flags & MS_RDONLY) {
>  		/* Make sure that we read updated s_mount_flags value */
>  		smp_rmb();
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 52883ac..84e95cc 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -209,7 +209,8 @@ xfs_file_fsync(
>  	loff_t			end,
>  	int			datasync)
>  {
> -	struct inode		*inode = file->f_mapping->host;
> +	struct address_space	*mapping = file->f_mapping;
> +	struct inode		*inode = mapping->host;
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_mount	*mp = ip->i_mount;
>  	int			error = 0;
> @@ -218,7 +219,14 @@ xfs_file_fsync(
>  
>  	trace_xfs_file_fsync(ip);
>  
> -	error = filemap_write_and_wait_range(inode->i_mapping, start, end);
> +	if (dax_mapping(mapping) && mapping->nrexceptional) {
> +		error = dax_writeback_mapping_range(mapping,
> +				xfs_find_bdev_for_inode(inode), start, end);
> +		if (error)
> +			return error;
> +	}
> +
> +	error = filemap_write_and_wait_range(mapping, start, end);
>  	if (error)
>  		return error;
>  
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index bad27b0..8e9f114 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -42,6 +42,6 @@ static inline bool dax_mapping(struct address_space *mapping)
>  {
>  	return mapping->host && IS_DAX(mapping->host);
>  }
> -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
> -		loff_t end);
> +int dax_writeback_mapping_range(struct address_space *mapping,
> +		struct block_device *bdev, loff_t start, loff_t end);
>  #endif
> diff --git a/mm/filemap.c b/mm/filemap.c
> index bc94386..c4286eb 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -482,12 +482,6 @@ int filemap_write_and_wait_range(struct address_space *mapping,
>  {
>  	int err = 0;
>  
> -	if (dax_mapping(mapping) && mapping->nrexceptional) {
> -		err = dax_writeback_mapping_range(mapping, lstart, lend);
> -		if (err)
> -			return err;
> -	}
> -
>  	if (mapping->nrpages) {
>  		err = __filemap_fdatawrite_range(mapping, lstart, lend,
>  						 WB_SYNC_ALL);
> -- 
> 2.5.0
> 
>
Ross Zwisler Feb. 8, 2016, 4:12 p.m. UTC | #5
On Mon, Feb 08, 2016 at 11:48:50AM +0100, Jan Kara wrote:
> On Sun 07-02-16 00:19:13, Ross Zwisler wrote:
> > Previously calls to dax_writeback_mapping_range() for all DAX filesystems
> > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
> > dax_writeback_mapping_range() needs a struct block_device, and it used to
> > get that from inode->i_sb->s_bdev.  This is correct for normal inodes
> > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
> > block devices and for XFS real-time files.
> > 
> > Instead, call dax_writeback_mapping_range() directly from the filesystem or
> > raw block device fsync/msync code so that they can supply us with a valid
> > block device.
> > 
> > It should be noted that this will reduce the number of calls to
> > dax_writeback_mapping_range() because filemap_write_and_wait_range() is
> > called in the various filesystems for operations other than just
> > fsync/msync.  Both ext4 & XFS call filemap_write_and_wait_range() outside
> > of ->fsync for hole punch, truncate, and block relocation
> > (xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()).
> > 
> > I don't believe that these extra flushes are necessary in the DAX case.  In
> > the page cache case when we have dirty data in the page cache, that data
> > will be actively lost if we evict a dirty page cache page without flushing
> > it to media first.  For DAX, though, the data will remain consistent with
> > the physical address to which it was written regardless of whether it's in
> > the processor cache or not - really the only reason I see to flush is in
> > response to a fsync or msync so that our data is durable on media in case
> > of a power loss.  The case where we could throw dirty data out of the page
> > cache and essentially lose writes simply doesn't exist.
> 
> You should at least note that sync(2) won't make data durable with this
> patch in the changelog. Dave and Christoph have told you that Linux users
> depend on sync(2) to make data durable and I fully agree with them.  Given
> current options, I think we can live with this for 4.5 but long term this
> is IMO unacceptable.
> 
> 								Honza

I agree.  I'll add a note to the changelog and will work on adding support for
sync(2).

> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > ---
> >  fs/block_dev.c      |  7 +++++++
> >  fs/dax.c            |  5 ++---
> >  fs/ext2/file.c      | 10 ++++++++++
> >  fs/ext4/fsync.c     | 10 +++++++++-
> >  fs/xfs/xfs_file.c   | 12 ++++++++++--
> >  include/linux/dax.h |  4 ++--
> >  mm/filemap.c        |  6 ------
> >  7 files changed, 40 insertions(+), 14 deletions(-)
> > 
> > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > index fa0507a..312ad44 100644
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -356,8 +356,15 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync)
> >  {
> >  	struct inode *bd_inode = bdev_file_inode(filp);
> >  	struct block_device *bdev = I_BDEV(bd_inode);
> > +	struct address_space *mapping = bd_inode->i_mapping;
> >  	int error;
> >  	
> > +	if (dax_mapping(mapping) && mapping->nrexceptional) {
> > +		error = dax_writeback_mapping_range(mapping, bdev, start, end);
> > +		if (error)
> > +			return error;
> > +	}
> > +
> >  	error = filemap_write_and_wait_range(filp->f_mapping, start, end);
> >  	if (error)
> >  		return error;
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 4592241..4b5006a 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -484,11 +484,10 @@ static int dax_writeback_one(struct block_device *bdev,
> >   * end]. This is required by data integrity operations to ensure file data is
> >   * on persistent storage prior to completion of the operation.
> >   */
> > -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
> > -		loff_t end)
> > +int dax_writeback_mapping_range(struct address_space *mapping,
> > +		struct block_device *bdev, loff_t start, loff_t end)
> >  {
> >  	struct inode *inode = mapping->host;
> > -	struct block_device *bdev = inode->i_sb->s_bdev;
> >  	pgoff_t start_index, end_index, pmd_index;
> >  	pgoff_t indices[PAGEVEC_SIZE];
> >  	struct pagevec pvec;
> > diff --git a/fs/ext2/file.c b/fs/ext2/file.c
> > index 2c88d68..d1abf53 100644
> > --- a/fs/ext2/file.c
> > +++ b/fs/ext2/file.c
> > @@ -162,6 +162,16 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync)
> >  	int ret;
> >  	struct super_block *sb = file->f_mapping->host->i_sb;
> >  	struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
> > +#ifdef CONFIG_FS_DAX
> > +	struct address_space *inode_mapping = file->f_inode->i_mapping;
> > +
> > +	if (dax_mapping(inode_mapping) && inode_mapping->nrexceptional) {
> > +		ret = dax_writeback_mapping_range(inode_mapping, sb->s_bdev,
> > +				start, end);
> > +		if (ret)
> > +			return ret;
> > +	}
> > +#endif
> >  
> >  	ret = generic_file_fsync(file, start, end, datasync);
> >  	if (ret == -EIO || test_and_clear_bit(AS_EIO, &mapping->flags)) {
> > diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> > index 8850254..e9cf53b 100644
> > --- a/fs/ext4/fsync.c
> > +++ b/fs/ext4/fsync.c
> > @@ -27,6 +27,7 @@
> >  #include <linux/sched.h>
> >  #include <linux/writeback.h>
> >  #include <linux/blkdev.h>
> > +#include <linux/dax.h>
> >  
> >  #include "ext4.h"
> >  #include "ext4_jbd2.h"
> > @@ -83,10 +84,10 @@ static int ext4_sync_parent(struct inode *inode)
> >   * What we do is just kick off a commit and wait on it.  This will snapshot the
> >   * inode to disk.
> >   */
> > -
> >  int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
> >  {
> >  	struct inode *inode = file->f_mapping->host;
> > +	struct address_space *mapping = inode->i_mapping;
> >  	struct ext4_inode_info *ei = EXT4_I(inode);
> >  	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
> >  	int ret = 0, err;
> > @@ -97,6 +98,13 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
> >  
> >  	trace_ext4_sync_file_enter(file, datasync);
> >  
> > +	if (dax_mapping(mapping) && mapping->nrexceptional) {
> > +		err = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev,
> > +				start, end);
> > +		if (err)
> > +			goto out;
> > +	}
> > +
> >  	if (inode->i_sb->s_flags & MS_RDONLY) {
> >  		/* Make sure that we read updated s_mount_flags value */
> >  		smp_rmb();
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 52883ac..84e95cc 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -209,7 +209,8 @@ xfs_file_fsync(
> >  	loff_t			end,
> >  	int			datasync)
> >  {
> > -	struct inode		*inode = file->f_mapping->host;
> > +	struct address_space	*mapping = file->f_mapping;
> > +	struct inode		*inode = mapping->host;
> >  	struct xfs_inode	*ip = XFS_I(inode);
> >  	struct xfs_mount	*mp = ip->i_mount;
> >  	int			error = 0;
> > @@ -218,7 +219,14 @@ xfs_file_fsync(
> >  
> >  	trace_xfs_file_fsync(ip);
> >  
> > -	error = filemap_write_and_wait_range(inode->i_mapping, start, end);
> > +	if (dax_mapping(mapping) && mapping->nrexceptional) {
> > +		error = dax_writeback_mapping_range(mapping,
> > +				xfs_find_bdev_for_inode(inode), start, end);
> > +		if (error)
> > +			return error;
> > +	}
> > +
> > +	error = filemap_write_and_wait_range(mapping, start, end);
> >  	if (error)
> >  		return error;
> >  
> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index bad27b0..8e9f114 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -42,6 +42,6 @@ static inline bool dax_mapping(struct address_space *mapping)
> >  {
> >  	return mapping->host && IS_DAX(mapping->host);
> >  }
> > -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
> > -		loff_t end);
> > +int dax_writeback_mapping_range(struct address_space *mapping,
> > +		struct block_device *bdev, loff_t start, loff_t end);
> >  #endif
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index bc94386..c4286eb 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -482,12 +482,6 @@ int filemap_write_and_wait_range(struct address_space *mapping,
> >  {
> >  	int err = 0;
> >  
> > -	if (dax_mapping(mapping) && mapping->nrexceptional) {
> > -		err = dax_writeback_mapping_range(mapping, lstart, lend);
> > -		if (err)
> > -			return err;
> > -	}
> > -
> >  	if (mapping->nrpages) {
> >  		err = __filemap_fdatawrite_range(mapping, lstart, lend,
> >  						 WB_SYNC_ALL);
> > -- 
> > 2.5.0
> > 
> > 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
Ross Zwisler Feb. 8, 2016, 6:31 p.m. UTC | #6
On Sun, Feb 07, 2016 at 11:13:51AM -0800, Dan Williams wrote:
> The proposal: make applications explicitly request DAX semantics with
> a new MAP_DAX flag and fail if DAX is unavailable.  Document that a
> successful MAP_DAX request mandates that the application assumes
> responsibility for cpu cache management.  

> Require that all applications that mmap the file agree on MAP_DAX.

I think this proposal could run into issues with aliasing.  For example, say
you have two threads accessing the same region, and one wants to use DAX and
the other wants to use the page cache.  What happens?

If we satisfy both requests, we end up with one user reading and writing to
the page cache, while the other is reading and writing directly to the media.
They can't see each other's changes, and you get data corruption.

If we satisfy the request of whoever asked first, sort of lock the inode into
that mode, and then return an error to the second thread because they are
asking for the other mode, we have now introduced a new weird failure case
where mmaps can randomly fail based on the behavior of other applications.
I think this is where you were going with the last line quoted above, but I
don't understand how it would work in an acceptable way.

It seems like we have to have the decision about whether or not to use DAX
made in the same way for all users of the inode so that we don't run into
these types of conflicts.

> This also solves
> the future problem of DAX support on virtually tagged cache
> architectures where it is difficult for the kernel to know what alias
> addresses need flushing.
> 
> [1]: https://github.com/pmem/nvml
Dan Williams Feb. 8, 2016, 7:23 p.m. UTC | #7
On Mon, Feb 8, 2016 at 10:31 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Sun, Feb 07, 2016 at 11:13:51AM -0800, Dan Williams wrote:
>> The proposal: make applications explicitly request DAX semantics with
>> a new MAP_DAX flag and fail if DAX is unavailable.  Document that a
>> successful MAP_DAX request mandates that the application assumes
>> responsibility for cpu cache management.
>
>> Require that all applications that mmap the file agree on MAP_DAX.
>
> I think this proposal could run into issues with aliasing.  For example, say
> you have two threads accessing the same region, and one wants to use DAX and
> the other wants to use the page cache.  What happens?
>
> If we satisfy both requests, we end up with one user reading and writing to
> the page cache, while the other is reading and writing directly to the media.
> They can't see each other's changes, and you get data corruption.
>
> If we satisfy the request of whoever asked first, sort of lock the inode into
> that mode, and then return an error to the second thread because they are
> asking for the other mode, we have now introduced a new weird failure case
> where mmaps can randomly fail based on the behavior of other applications.
> I think this is where you were going with the last line quoted above, but I
> don't understand how it would work in an acceptable way.
>
> It seems like we have to have the decision about whether or not to use DAX
> made in the same way for all users of the inode so that we don't run into
> these types of conflicts.

We haven't solved the conflict problem by pushing it out to the inode,
see the recent revert of blkdev_daxset().  We're heading in a
direction where an application can't develop it's own policies about
DAX usage, it's always an administrative decision.  However, maybe
that is ok.  Dave is right that if an application is using an existing
filesystem it should get all the existing semantics.

If the existing semantics (or overhead of maintaining the existing
semantics) turn out not to fit a given pmem-aware application then we
may just need new interfaces (separate from fs/dax.c) to persistent
memory.  I admit we're a ways off from knowing if that is needed.
Dave Chinner Feb. 8, 2016, 8:18 p.m. UTC | #8
On Mon, Feb 08, 2016 at 12:18:11AM -0800, Dan Williams wrote:
> On Sun, Feb 7, 2016 at 1:50 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Sun, Feb 07, 2016 at 11:13:51AM -0800, Dan Williams wrote:
> >> On Sat, Feb 6, 2016 at 11:19 PM, Ross Zwisler
> >> <ross.zwisler@linux.intel.com> wrote:
> >> > Previously calls to dax_writeback_mapping_range() for all DAX filesystems
> >> > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
> >> > dax_writeback_mapping_range() needs a struct block_device, and it used to
> >> > get that from inode->i_sb->s_bdev.  This is correct for normal inodes
> >> > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
> >> > block devices and for XFS real-time files.
> >> >
> >> > Instead, call dax_writeback_mapping_range() directly from the filesystem or
> >> > raw block device fsync/msync code so that they can supply us with a valid
> >> > block device.
> >> >
> >> > It should be noted that this will reduce the number of calls to
> >> > dax_writeback_mapping_range() because filemap_write_and_wait_range() is
> >> > called in the various filesystems for operations other than just
> >> > fsync/msync.  Both ext4 & XFS call filemap_write_and_wait_range() outside
> >> > of ->fsync for hole punch, truncate, and block relocation
> >> > (xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()).
> >> >
> >> > I don't believe that these extra flushes are necessary in the DAX case.  In
> >> > the page cache case when we have dirty data in the page cache, that data
> >> > will be actively lost if we evict a dirty page cache page without flushing
> >> > it to media first.  For DAX, though, the data will remain consistent with
> >> > the physical address to which it was written regardless of whether it's in
> >> > the processor cache or not - really the only reason I see to flush is in
> >> > response to a fsync or msync so that our data is durable on media in case
> >> > of a power loss.  The case where we could throw dirty data out of the page
> >> > cache and essentially lose writes simply doesn't exist.
> >> >
> >> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> >> > ---
> >> >  fs/block_dev.c      |  7 +++++++
> >> >  fs/dax.c            |  5 ++---
> >> >  fs/ext2/file.c      | 10 ++++++++++
> >> >  fs/ext4/fsync.c     | 10 +++++++++-
> >> >  fs/xfs/xfs_file.c   | 12 ++++++++++--
> >> >  include/linux/dax.h |  4 ++--
> >> >  mm/filemap.c        |  6 ------
> >> >  7 files changed, 40 insertions(+), 14 deletions(-)
> >>
> >> This sprinkling of dax specific fixups outside of vm_operations_struct
> >> routines still has me thinking that we are going in the wrong
> >> direction for fsync/msync support.
> >>
> >> If an application is both unaware of DAX and doing mmap I/O it is
> >> better served by the page cache where writeback is durable by default.
> >> We expect DAX-aware applications to assume responsibility for cpu
> >> cache management [1].  Making DAX mmap semantics explicit opt-in
> >> solves not only durability support, but also the current problem that
> >> DAX gets silently disabled leaving an app to wonder if it really got a
> >> direct mapping. DAX also silently picks pud, pmd, or pte mappings
> >> which is information an application would really like to know at map
> >> time.
> >>
> >> The proposal: make applications explicitly request DAX semantics with
> >> a new MAP_DAX flag and fail if DAX is unavailable.
> >
> > No.
> >
> > As I've stated before, the entire purpose of enabling DAX through
> > existing filesytsems like XFS and ext4 is so that existing
> > applications work with DAX *without modification*.
> >
> > That is, applications can be entirely unaware of the fact that the
> > filesystem is giving them direct access to the storage because the
> > access and failure semantics of DAX enabled mmap are *identical to
> > the existing mmap semantics*.
> >
> > Given this, the app doesn't need to care whether DAX is enabled or
> > not; all that will be seen is a difference in speed of access.
> > Enabling and disabling DAX is, at this point, purely an
> > administration decision - if the hardware and filesystem supports
> > it, it can be turned on without having to wait years for application
> > developers to add support for it....
> 
> Setting aside the current block zeroing problem you seem to assuming
> that DAX will always be faster and that may not be true at a media
> level.  Waiting years for some applications to determine if DAX makes
> sense for their use case seems completely reasonable.  In the meantime
> the apps that are already making these changes want to know that a DAX
> mapping request has not silently dropped backed to page cache.  They
> also want to know if they successfully jumped through all the hoops to
> get a larger than pte mapping.
> 
> I agree it is useful to be able to force DAX on an unmodified
> application to see what happens, and it follows that if those
> applications want to run in that mode they will need functional
> fsync()...
> 
> I would feel better if we were talking about specific applications and
> performance numbers to know if forcing DAX on application is a debug
> facility or a production level capability.  You seem to have already
> made that determination and I'm curious what I'm missing.

I'm not setting any policy here at all.  This whole argument is
based around the DAX mount option doing "global fs enable or
silently turning it off" and the application not knowing about that.

The whole point of having a persistent per-inode DAX flags is that
it is a policy mechanism, not a policy.  The application can, if it
is DAX aware, directly control whether DAX is used on a file or not.
The application can even query and clear that persistent inode flag
if it is configured not to (or cannot) use DAX.

If the filesystem cannot support DAX, then we can error out attempts
to set the DAX flag and then the app knows DAX is not available.
i.e. the attempt to set policy failed. If the flag is set, then the
inode will *always* use DAX - there is no "fall back to page cache"
when DAX is enabled.

If the applicaiton is not DAX aware, then the admin can control the
DAX policy by manipulating these flags themselves, and hence control
whether DAX is used by the application or not.

If you think I'm dictating policy for DAX users and application,
then you haven't understood anything I've previously said about why
the DAX mount option needs to die before any of this is considered
production ready. DAX is not an opaque "all or nothing" option. XFS
will provide apps and admins with fine-grained, persistent,
discoverable policy flags to allow admins and applications to set
DAX policies however they see fit. This simply cannot be done if the
only knob you have is a mount option that may or may not stick.

Cheers,

Dave.
Dan Williams Feb. 8, 2016, 8:55 p.m. UTC | #9
On Mon, Feb 8, 2016 at 12:18 PM, Dave Chinner <david@fromorbit.com> wrote:
[..]
>> Setting aside the current block zeroing problem you seem to assuming
>> that DAX will always be faster and that may not be true at a media
>> level.  Waiting years for some applications to determine if DAX makes
>> sense for their use case seems completely reasonable.  In the meantime
>> the apps that are already making these changes want to know that a DAX
>> mapping request has not silently dropped backed to page cache.  They
>> also want to know if they successfully jumped through all the hoops to
>> get a larger than pte mapping.
>>
>> I agree it is useful to be able to force DAX on an unmodified
>> application to see what happens, and it follows that if those
>> applications want to run in that mode they will need functional
>> fsync()...
>>
>> I would feel better if we were talking about specific applications and
>> performance numbers to know if forcing DAX on application is a debug
>> facility or a production level capability.  You seem to have already
>> made that determination and I'm curious what I'm missing.
>
> I'm not setting any policy here at all.  This whole argument is
> based around the DAX mount option doing "global fs enable or
> silently turning it off" and the application not knowing about that.
>
> The whole point of having a persistent per-inode DAX flags is that
> it is a policy mechanism, not a policy.  The application can, if it
> is DAX aware, directly control whether DAX is used on a file or not.
> The application can even query and clear that persistent inode flag
> if it is configured not to (or cannot) use DAX.
>
> If the filesystem cannot support DAX, then we can error out attempts
> to set the DAX flag and then the app knows DAX is not available.
> i.e. the attempt to set policy failed. If the flag is set, then the
> inode will *always* use DAX - there is no "fall back to page cache"
> when DAX is enabled.
>
> If the applicaiton is not DAX aware, then the admin can control the
> DAX policy by manipulating these flags themselves, and hence control
> whether DAX is used by the application or not.
>
> If you think I'm dictating policy for DAX users and application,
> then you haven't understood anything I've previously said about why
> the DAX mount option needs to die before any of this is considered
> production ready. DAX is not an opaque "all or nothing" option. XFS
> will provide apps and admins with fine-grained, persistent,
> discoverable policy flags to allow admins and applications to set
> DAX policies however they see fit. This simply cannot be done if the
> only knob you have is a mount option that may or may not stick.

I agree the mount option needs to die, and I fully grok the reasoning.
  What I'm concerned with is that a system using fully-DAX-aware
applications is forced to incur the overhead of maintaining *sync
semantics, periodic sync(2) in particular,  even if it is not relying
on those semantics.

However, like I said in my other mail, we can solve that with
alternate interfaces to persistent memory if that becomes an issue and
not require that "disable *sync" capability to come through DAX.
Jeff Moyer Feb. 8, 2016, 8:58 p.m. UTC | #10
Dan Williams <dan.j.williams@intel.com> writes:

> I agree the mount option needs to die, and I fully grok the reasoning.
>   What I'm concerned with is that a system using fully-DAX-aware
> applications is forced to incur the overhead of maintaining *sync
> semantics, periodic sync(2) in particular,  even if it is not relying
> on those semantics.
>
> However, like I said in my other mail, we can solve that with
> alternate interfaces to persistent memory if that becomes an issue and
> not require that "disable *sync" capability to come through DAX.

What do you envision these alternate interfaces looking like?

-Jeff
Dan Williams Feb. 8, 2016, 10:05 p.m. UTC | #11
On Mon, Feb 8, 2016 at 12:58 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Dan Williams <dan.j.williams@intel.com> writes:
>
>> I agree the mount option needs to die, and I fully grok the reasoning.
>>   What I'm concerned with is that a system using fully-DAX-aware
>> applications is forced to incur the overhead of maintaining *sync
>> semantics, periodic sync(2) in particular,  even if it is not relying
>> on those semantics.
>>
>> However, like I said in my other mail, we can solve that with
>> alternate interfaces to persistent memory if that becomes an issue and
>> not require that "disable *sync" capability to come through DAX.
>
> What do you envision these alternate interfaces looking like?

Well, plan-A was making DAX be explicit opt-in for applications, I
haven't thought too much about plan-B.  I expect it to be driven by
real performance numbers and application use cases once the *sync
compat work completes.
Jan Kara Feb. 9, 2016, 9:43 a.m. UTC | #12
On Mon 08-02-16 12:55:24, Dan Williams wrote:
> On Mon, Feb 8, 2016 at 12:18 PM, Dave Chinner <david@fromorbit.com> wrote:
> [..]
> >> Setting aside the current block zeroing problem you seem to assuming
> >> that DAX will always be faster and that may not be true at a media
> >> level.  Waiting years for some applications to determine if DAX makes
> >> sense for their use case seems completely reasonable.  In the meantime
> >> the apps that are already making these changes want to know that a DAX
> >> mapping request has not silently dropped backed to page cache.  They
> >> also want to know if they successfully jumped through all the hoops to
> >> get a larger than pte mapping.
> >>
> >> I agree it is useful to be able to force DAX on an unmodified
> >> application to see what happens, and it follows that if those
> >> applications want to run in that mode they will need functional
> >> fsync()...
> >>
> >> I would feel better if we were talking about specific applications and
> >> performance numbers to know if forcing DAX on application is a debug
> >> facility or a production level capability.  You seem to have already
> >> made that determination and I'm curious what I'm missing.
> >
> > I'm not setting any policy here at all.  This whole argument is
> > based around the DAX mount option doing "global fs enable or
> > silently turning it off" and the application not knowing about that.
> >
> > The whole point of having a persistent per-inode DAX flags is that
> > it is a policy mechanism, not a policy.  The application can, if it
> > is DAX aware, directly control whether DAX is used on a file or not.
> > The application can even query and clear that persistent inode flag
> > if it is configured not to (or cannot) use DAX.
> >
> > If the filesystem cannot support DAX, then we can error out attempts
> > to set the DAX flag and then the app knows DAX is not available.
> > i.e. the attempt to set policy failed. If the flag is set, then the
> > inode will *always* use DAX - there is no "fall back to page cache"
> > when DAX is enabled.
> >
> > If the applicaiton is not DAX aware, then the admin can control the
> > DAX policy by manipulating these flags themselves, and hence control
> > whether DAX is used by the application or not.
> >
> > If you think I'm dictating policy for DAX users and application,
> > then you haven't understood anything I've previously said about why
> > the DAX mount option needs to die before any of this is considered
> > production ready. DAX is not an opaque "all or nothing" option. XFS
> > will provide apps and admins with fine-grained, persistent,
> > discoverable policy flags to allow admins and applications to set
> > DAX policies however they see fit. This simply cannot be done if the
> > only knob you have is a mount option that may or may not stick.
> 
> I agree the mount option needs to die, and I fully grok the reasoning.
>   What I'm concerned with is that a system using fully-DAX-aware
> applications is forced to incur the overhead of maintaining *sync
> semantics, periodic sync(2) in particular,  even if it is not relying
> on those semantics.

Let me somewhat correct this: IMO hard requirement is maintaining sync(2)
semantics. Periodic writeback does not have any hard durability guarantees
and we are free to ignore such requests in ->writepages() (that function
has enough information in the writeback_control structure to differentiate
between periodic writeback and data integrity sync) if we decide it is
useful. Actually, we could do that even for 4.5.


								Honza
Jan Kara Feb. 9, 2016, 4:01 p.m. UTC | #13
On Tue 09-02-16 10:43:53, Jan Kara wrote:
> On Mon 08-02-16 12:55:24, Dan Williams wrote:
> > On Mon, Feb 8, 2016 at 12:18 PM, Dave Chinner <david@fromorbit.com> wrote:
> > [..]
> > >> Setting aside the current block zeroing problem you seem to assuming
> > >> that DAX will always be faster and that may not be true at a media
> > >> level.  Waiting years for some applications to determine if DAX makes
> > >> sense for their use case seems completely reasonable.  In the meantime
> > >> the apps that are already making these changes want to know that a DAX
> > >> mapping request has not silently dropped backed to page cache.  They
> > >> also want to know if they successfully jumped through all the hoops to
> > >> get a larger than pte mapping.
> > >>
> > >> I agree it is useful to be able to force DAX on an unmodified
> > >> application to see what happens, and it follows that if those
> > >> applications want to run in that mode they will need functional
> > >> fsync()...
> > >>
> > >> I would feel better if we were talking about specific applications and
> > >> performance numbers to know if forcing DAX on application is a debug
> > >> facility or a production level capability.  You seem to have already
> > >> made that determination and I'm curious what I'm missing.
> > >
> > > I'm not setting any policy here at all.  This whole argument is
> > > based around the DAX mount option doing "global fs enable or
> > > silently turning it off" and the application not knowing about that.
> > >
> > > The whole point of having a persistent per-inode DAX flags is that
> > > it is a policy mechanism, not a policy.  The application can, if it
> > > is DAX aware, directly control whether DAX is used on a file or not.
> > > The application can even query and clear that persistent inode flag
> > > if it is configured not to (or cannot) use DAX.
> > >
> > > If the filesystem cannot support DAX, then we can error out attempts
> > > to set the DAX flag and then the app knows DAX is not available.
> > > i.e. the attempt to set policy failed. If the flag is set, then the
> > > inode will *always* use DAX - there is no "fall back to page cache"
> > > when DAX is enabled.
> > >
> > > If the applicaiton is not DAX aware, then the admin can control the
> > > DAX policy by manipulating these flags themselves, and hence control
> > > whether DAX is used by the application or not.
> > >
> > > If you think I'm dictating policy for DAX users and application,
> > > then you haven't understood anything I've previously said about why
> > > the DAX mount option needs to die before any of this is considered
> > > production ready. DAX is not an opaque "all or nothing" option. XFS
> > > will provide apps and admins with fine-grained, persistent,
> > > discoverable policy flags to allow admins and applications to set
> > > DAX policies however they see fit. This simply cannot be done if the
> > > only knob you have is a mount option that may or may not stick.
> > 
> > I agree the mount option needs to die, and I fully grok the reasoning.
> >   What I'm concerned with is that a system using fully-DAX-aware
> > applications is forced to incur the overhead of maintaining *sync
> > semantics, periodic sync(2) in particular,  even if it is not relying
> > on those semantics.
> 
> Let me somewhat correct this: IMO hard requirement is maintaining sync(2)
> semantics. Periodic writeback does not have any hard durability guarantees
> and we are free to ignore such requests in ->writepages() (that function
> has enough information in the writeback_control structure to differentiate
> between periodic writeback and data integrity sync) if we decide it is
> useful. Actually, we could do that even for 4.5.

Attached is a version of Ross' patch that will work for sync(2) and
fsync(2) and we won't flush caches during periodic writeback. The patch is
only compile-tested. Ross?

								Honza
Ross Zwisler Feb. 9, 2016, 6:06 p.m. UTC | #14
On Tue, Feb 09, 2016 at 05:01:34PM +0100, Jan Kara wrote:
> On Tue 09-02-16 10:43:53, Jan Kara wrote:
> > On Mon 08-02-16 12:55:24, Dan Williams wrote:
> > > On Mon, Feb 8, 2016 at 12:18 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > [..]
> > > >> Setting aside the current block zeroing problem you seem to assuming
> > > >> that DAX will always be faster and that may not be true at a media
> > > >> level.  Waiting years for some applications to determine if DAX makes
> > > >> sense for their use case seems completely reasonable.  In the meantime
> > > >> the apps that are already making these changes want to know that a DAX
> > > >> mapping request has not silently dropped backed to page cache.  They
> > > >> also want to know if they successfully jumped through all the hoops to
> > > >> get a larger than pte mapping.
> > > >>
> > > >> I agree it is useful to be able to force DAX on an unmodified
> > > >> application to see what happens, and it follows that if those
> > > >> applications want to run in that mode they will need functional
> > > >> fsync()...
> > > >>
> > > >> I would feel better if we were talking about specific applications and
> > > >> performance numbers to know if forcing DAX on application is a debug
> > > >> facility or a production level capability.  You seem to have already
> > > >> made that determination and I'm curious what I'm missing.
> > > >
> > > > I'm not setting any policy here at all.  This whole argument is
> > > > based around the DAX mount option doing "global fs enable or
> > > > silently turning it off" and the application not knowing about that.
> > > >
> > > > The whole point of having a persistent per-inode DAX flags is that
> > > > it is a policy mechanism, not a policy.  The application can, if it
> > > > is DAX aware, directly control whether DAX is used on a file or not.
> > > > The application can even query and clear that persistent inode flag
> > > > if it is configured not to (or cannot) use DAX.
> > > >
> > > > If the filesystem cannot support DAX, then we can error out attempts
> > > > to set the DAX flag and then the app knows DAX is not available.
> > > > i.e. the attempt to set policy failed. If the flag is set, then the
> > > > inode will *always* use DAX - there is no "fall back to page cache"
> > > > when DAX is enabled.
> > > >
> > > > If the applicaiton is not DAX aware, then the admin can control the
> > > > DAX policy by manipulating these flags themselves, and hence control
> > > > whether DAX is used by the application or not.
> > > >
> > > > If you think I'm dictating policy for DAX users and application,
> > > > then you haven't understood anything I've previously said about why
> > > > the DAX mount option needs to die before any of this is considered
> > > > production ready. DAX is not an opaque "all or nothing" option. XFS
> > > > will provide apps and admins with fine-grained, persistent,
> > > > discoverable policy flags to allow admins and applications to set
> > > > DAX policies however they see fit. This simply cannot be done if the
> > > > only knob you have is a mount option that may or may not stick.
> > > 
> > > I agree the mount option needs to die, and I fully grok the reasoning.
> > >   What I'm concerned with is that a system using fully-DAX-aware
> > > applications is forced to incur the overhead of maintaining *sync
> > > semantics, periodic sync(2) in particular,  even if it is not relying
> > > on those semantics.
> > 
> > Let me somewhat correct this: IMO hard requirement is maintaining sync(2)
> > semantics. Periodic writeback does not have any hard durability guarantees
> > and we are free to ignore such requests in ->writepages() (that function
> > has enough information in the writeback_control structure to differentiate
> > between periodic writeback and data integrity sync) if we decide it is
> > useful. Actually, we could do that even for 4.5.
> 
> Attached is a version of Ross' patch that will work for sync(2) and
> fsync(2) and we won't flush caches during periodic writeback. The patch is
> only compile-tested. Ross?

This looks great.  I'll send out a v2 with this and with the
dax_clear_sectors() changes after I'm done testing.
diff mbox

Patch

diff --git a/fs/block_dev.c b/fs/block_dev.c
index fa0507a..312ad44 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -356,8 +356,15 @@  int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync)
 {
 	struct inode *bd_inode = bdev_file_inode(filp);
 	struct block_device *bdev = I_BDEV(bd_inode);
+	struct address_space *mapping = bd_inode->i_mapping;
 	int error;
 	
+	if (dax_mapping(mapping) && mapping->nrexceptional) {
+		error = dax_writeback_mapping_range(mapping, bdev, start, end);
+		if (error)
+			return error;
+	}
+
 	error = filemap_write_and_wait_range(filp->f_mapping, start, end);
 	if (error)
 		return error;
diff --git a/fs/dax.c b/fs/dax.c
index 4592241..4b5006a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -484,11 +484,10 @@  static int dax_writeback_one(struct block_device *bdev,
  * end]. This is required by data integrity operations to ensure file data is
  * on persistent storage prior to completion of the operation.
  */
-int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
-		loff_t end)
+int dax_writeback_mapping_range(struct address_space *mapping,
+		struct block_device *bdev, loff_t start, loff_t end)
 {
 	struct inode *inode = mapping->host;
-	struct block_device *bdev = inode->i_sb->s_bdev;
 	pgoff_t start_index, end_index, pmd_index;
 	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 2c88d68..d1abf53 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -162,6 +162,16 @@  int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 	int ret;
 	struct super_block *sb = file->f_mapping->host->i_sb;
 	struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
+#ifdef CONFIG_FS_DAX
+	struct address_space *inode_mapping = file->f_inode->i_mapping;
+
+	if (dax_mapping(inode_mapping) && inode_mapping->nrexceptional) {
+		ret = dax_writeback_mapping_range(inode_mapping, sb->s_bdev,
+				start, end);
+		if (ret)
+			return ret;
+	}
+#endif
 
 	ret = generic_file_fsync(file, start, end, datasync);
 	if (ret == -EIO || test_and_clear_bit(AS_EIO, &mapping->flags)) {
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 8850254..e9cf53b 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -27,6 +27,7 @@ 
 #include <linux/sched.h>
 #include <linux/writeback.h>
 #include <linux/blkdev.h>
+#include <linux/dax.h>
 
 #include "ext4.h"
 #include "ext4_jbd2.h"
@@ -83,10 +84,10 @@  static int ext4_sync_parent(struct inode *inode)
  * What we do is just kick off a commit and wait on it.  This will snapshot the
  * inode to disk.
  */
-
 int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 {
 	struct inode *inode = file->f_mapping->host;
+	struct address_space *mapping = inode->i_mapping;
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
 	int ret = 0, err;
@@ -97,6 +98,13 @@  int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 
 	trace_ext4_sync_file_enter(file, datasync);
 
+	if (dax_mapping(mapping) && mapping->nrexceptional) {
+		err = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev,
+				start, end);
+		if (err)
+			goto out;
+	}
+
 	if (inode->i_sb->s_flags & MS_RDONLY) {
 		/* Make sure that we read updated s_mount_flags value */
 		smp_rmb();
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 52883ac..84e95cc 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -209,7 +209,8 @@  xfs_file_fsync(
 	loff_t			end,
 	int			datasync)
 {
-	struct inode		*inode = file->f_mapping->host;
+	struct address_space	*mapping = file->f_mapping;
+	struct inode		*inode = mapping->host;
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
 	int			error = 0;
@@ -218,7 +219,14 @@  xfs_file_fsync(
 
 	trace_xfs_file_fsync(ip);
 
-	error = filemap_write_and_wait_range(inode->i_mapping, start, end);
+	if (dax_mapping(mapping) && mapping->nrexceptional) {
+		error = dax_writeback_mapping_range(mapping,
+				xfs_find_bdev_for_inode(inode), start, end);
+		if (error)
+			return error;
+	}
+
+	error = filemap_write_and_wait_range(mapping, start, end);
 	if (error)
 		return error;
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index bad27b0..8e9f114 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -42,6 +42,6 @@  static inline bool dax_mapping(struct address_space *mapping)
 {
 	return mapping->host && IS_DAX(mapping->host);
 }
-int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
-		loff_t end);
+int dax_writeback_mapping_range(struct address_space *mapping,
+		struct block_device *bdev, loff_t start, loff_t end);
 #endif
diff --git a/mm/filemap.c b/mm/filemap.c
index bc94386..c4286eb 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -482,12 +482,6 @@  int filemap_write_and_wait_range(struct address_space *mapping,
 {
 	int err = 0;
 
-	if (dax_mapping(mapping) && mapping->nrexceptional) {
-		err = dax_writeback_mapping_range(mapping, lstart, lend);
-		if (err)
-			return err;
-	}
-
 	if (mapping->nrpages) {
 		err = __filemap_fdatawrite_range(mapping, lstart, lend,
 						 WB_SYNC_ALL);