mbox series

[V5,00/12] Enable per-file/per-directory DAX operations V5

Message ID 20200227052442.22524-1-ira.weiny@intel.com (mailing list archive)
Headers show
Series Enable per-file/per-directory DAX operations V5 | expand

Message

Ira Weiny Feb. 27, 2020, 5:24 a.m. UTC
From: Ira Weiny <ira.weiny@intel.com>

Changes from V4:
	* Open code the aops lock rather than add it to the xfs_ilock()
	  subsystem (Darrick's comments were obsoleted by this change)
	* Fix lkp build suggestions and bugs

Changes from V3:
	* Remove global locking...  :-D
	* put back per inode locking and remove pre-mature optimizations
	* Fix issues with Directories having IS_DAX() set
	* Fix kernel crash issues reported by Jeff
	* Add some clean up patches
	* Consolidate diflags to iflags functions
	* Update/add documentation
	* Reorder/rename patches quite a bit

Changes from V2:

	* Move i_dax_sem to be a global percpu_rw_sem rather than per inode
		Internal discussions with Dan determined this would be easier,
		just as performant, and slightly less overhead that having it
		in the SB as suggested by Jan
	* Fix locking order in comments and throughout code
	* Change "mode" to "state" throughout commits
	* Add CONFIG_FS_DAX wrapper to disable inode_[un]lock_state() when not
		configured
	* Add static branch for which is activated by a device which supports
		DAX in XFS
	* Change "lock/unlock" to up/down read/write as appropriate
		Previous names were over simplified
	* Update comments/documentation

	* Remove the xfs specific lock to the vfs (global) layer.
	* Fix i_dax_sem locking order and comments

	* Move 'i_mapped' count from struct inode to struct address_space and
		rename it to mmap_count
	* Add inode_has_mappings() call

	* Fix build issues
	* Clean up syntax spacing and minor issues
	* Update man page text for STATX_ATTR_DAX
	* Add reviewed-by's
	* Rebase to 5.6

	Rename patch:
		from: fs/xfs: Add lock/unlock state to xfs
		to: fs/xfs: Add write DAX lock to xfs layer
	Add patch:
		fs/xfs: Clarify lockdep dependency for xfs_isilocked()
	Drop patch:
		fs/xfs: Fix truncate up


At LSF/MM'19 [1] [2] we discussed applications that overestimate memory
consumption due to their inability to detect whether the kernel will
instantiate page cache for a file, and cases where a global dax enable via a
mount option is too coarse.

The following patch series enables selecting the use of DAX on individual files
and/or directories on xfs, and lays some groundwork to do so in ext4.  In this
scheme the dax mount option can be omitted to allow the per-file property to
take effect.

The insight at LSF/MM was to separate the per-mount or per-file "physical"
capability switch from an "effective" attribute for the file.

At LSF/MM we discussed the difficulties of switching the DAX state of a file
with active mappings / page cache.  It was thought the races could be avoided
by limiting DAX state flips to 0-length files.

However, this turns out to not be true.[3] This is because address space
operations (a_ops) may be in use at any time the inode is referenced and users
have expressed a desire to be able to change the DAX state on a file with data
in it.  For those reasons this patch set allows changing the DAX state flag on
a file as long as it is not current mapped.

Details of when and how DAX state can be changed on a file is included in a
documentation patch.

It should be noted that the physical DAX flag inheritance is not shown in this
patch set as it was maintained from previous work on XFS.  The physical DAX
flag and it's inheritance will need to be added to other file systems for user
control. 

As submitted this works on real hardware testing.


[1] https://lwn.net/Articles/787973/
[2] https://lwn.net/Articles/787233/
[3] https://lkml.org/lkml/2019/10/20/96
[4] https://patchwork.kernel.org/patch/11310511/


To: linux-kernel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Theodore Y. Ts'o" <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org
Cc: linux-xfs@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org


Ira Weiny (12):
  fs/xfs: Remove unnecessary initialization of i_rwsem
  fs: Remove unneeded IS_DAX() check
  fs/stat: Define DAX statx attribute
  fs/xfs: Isolate the physical DAX flag from enabled
  fs/xfs: Create function xfs_inode_enable_dax()
  fs: Add locking for a dynamic address space operations state
  fs: Prevent DAX state change if file is mmap'ed
  fs/xfs: Hold off aops users while changing DAX state
  fs/xfs: Clean up locking in dax invalidate
  fs/xfs: Allow toggle of effective DAX flag
  fs/xfs: Remove xfs_diflags_to_linux()
  Documentation/dax: Update Usage section

 Documentation/filesystems/dax.txt | 84 +++++++++++++++++++++++++-
 Documentation/filesystems/vfs.rst | 16 +++++
 fs/attr.c                         |  1 +
 fs/inode.c                        | 16 ++++-
 fs/iomap/buffered-io.c            |  1 +
 fs/open.c                         |  4 ++
 fs/stat.c                         |  5 ++
 fs/xfs/xfs_icache.c               |  5 +-
 fs/xfs/xfs_inode.h                |  2 +
 fs/xfs/xfs_ioctl.c                | 98 +++++++++++++++----------------
 fs/xfs/xfs_iops.c                 | 69 +++++++++++++++-------
 include/linux/fs.h                | 73 ++++++++++++++++++++++-
 include/uapi/linux/stat.h         |  1 +
 mm/fadvise.c                      |  7 ++-
 mm/filemap.c                      |  4 ++
 mm/huge_memory.c                  |  1 +
 mm/khugepaged.c                   |  2 +
 mm/mmap.c                         | 19 +++++-
 mm/util.c                         |  9 ++-
 19 files changed, 328 insertions(+), 89 deletions(-)

Comments

Christoph Hellwig March 5, 2020, 3:51 p.m. UTC | #1
FYI, I still will fully NAK any series that adds additional locks
and thus atomic instructions to basically every fs call, and grows
the inode by a rw_semaphore plus and atomic64_t.  I also think the
whole idea of switching operation vectors at runtime is fatally flawed
and we should never add such code, nevermind just for a fringe usecase
of a fringe feature.

On Wed, Feb 26, 2020 at 09:24:30PM -0800, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> Changes from V4:
> 	* Open code the aops lock rather than add it to the xfs_ilock()
> 	  subsystem (Darrick's comments were obsoleted by this change)
> 	* Fix lkp build suggestions and bugs
> 
> Changes from V3:
> 	* Remove global locking...  :-D
> 	* put back per inode locking and remove pre-mature optimizations
> 	* Fix issues with Directories having IS_DAX() set
> 	* Fix kernel crash issues reported by Jeff
> 	* Add some clean up patches
> 	* Consolidate diflags to iflags functions
> 	* Update/add documentation
> 	* Reorder/rename patches quite a bit
> 
> Changes from V2:
> 
> 	* Move i_dax_sem to be a global percpu_rw_sem rather than per inode
> 		Internal discussions with Dan determined this would be easier,
> 		just as performant, and slightly less overhead that having it
> 		in the SB as suggested by Jan
> 	* Fix locking order in comments and throughout code
> 	* Change "mode" to "state" throughout commits
> 	* Add CONFIG_FS_DAX wrapper to disable inode_[un]lock_state() when not
> 		configured
> 	* Add static branch for which is activated by a device which supports
> 		DAX in XFS
> 	* Change "lock/unlock" to up/down read/write as appropriate
> 		Previous names were over simplified
> 	* Update comments/documentation
> 
> 	* Remove the xfs specific lock to the vfs (global) layer.
> 	* Fix i_dax_sem locking order and comments
> 
> 	* Move 'i_mapped' count from struct inode to struct address_space and
> 		rename it to mmap_count
> 	* Add inode_has_mappings() call
> 
> 	* Fix build issues
> 	* Clean up syntax spacing and minor issues
> 	* Update man page text for STATX_ATTR_DAX
> 	* Add reviewed-by's
> 	* Rebase to 5.6
> 
> 	Rename patch:
> 		from: fs/xfs: Add lock/unlock state to xfs
> 		to: fs/xfs: Add write DAX lock to xfs layer
> 	Add patch:
> 		fs/xfs: Clarify lockdep dependency for xfs_isilocked()
> 	Drop patch:
> 		fs/xfs: Fix truncate up
> 
> 
> At LSF/MM'19 [1] [2] we discussed applications that overestimate memory
> consumption due to their inability to detect whether the kernel will
> instantiate page cache for a file, and cases where a global dax enable via a
> mount option is too coarse.
> 
> The following patch series enables selecting the use of DAX on individual files
> and/or directories on xfs, and lays some groundwork to do so in ext4.  In this
> scheme the dax mount option can be omitted to allow the per-file property to
> take effect.
> 
> The insight at LSF/MM was to separate the per-mount or per-file "physical"
> capability switch from an "effective" attribute for the file.
> 
> At LSF/MM we discussed the difficulties of switching the DAX state of a file
> with active mappings / page cache.  It was thought the races could be avoided
> by limiting DAX state flips to 0-length files.
> 
> However, this turns out to not be true.[3] This is because address space
> operations (a_ops) may be in use at any time the inode is referenced and users
> have expressed a desire to be able to change the DAX state on a file with data
> in it.  For those reasons this patch set allows changing the DAX state flag on
> a file as long as it is not current mapped.
> 
> Details of when and how DAX state can be changed on a file is included in a
> documentation patch.
> 
> It should be noted that the physical DAX flag inheritance is not shown in this
> patch set as it was maintained from previous work on XFS.  The physical DAX
> flag and it's inheritance will need to be added to other file systems for user
> control. 
> 
> As submitted this works on real hardware testing.
> 
> 
> [1] https://lwn.net/Articles/787973/
> [2] https://lwn.net/Articles/787233/
> [3] https://lkml.org/lkml/2019/10/20/96
> [4] https://patchwork.kernel.org/patch/11310511/
> 
> 
> To: linux-kernel@vger.kernel.org
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: "Theodore Y. Ts'o" <tytso@mit.edu>
> Cc: Jan Kara <jack@suse.cz>
> Cc: linux-ext4@vger.kernel.org
> Cc: linux-xfs@vger.kernel.org
> Cc: linux-fsdevel@vger.kernel.org
> 
> 
> Ira Weiny (12):
>   fs/xfs: Remove unnecessary initialization of i_rwsem
>   fs: Remove unneeded IS_DAX() check
>   fs/stat: Define DAX statx attribute
>   fs/xfs: Isolate the physical DAX flag from enabled
>   fs/xfs: Create function xfs_inode_enable_dax()
>   fs: Add locking for a dynamic address space operations state
>   fs: Prevent DAX state change if file is mmap'ed
>   fs/xfs: Hold off aops users while changing DAX state
>   fs/xfs: Clean up locking in dax invalidate
>   fs/xfs: Allow toggle of effective DAX flag
>   fs/xfs: Remove xfs_diflags_to_linux()
>   Documentation/dax: Update Usage section
> 
>  Documentation/filesystems/dax.txt | 84 +++++++++++++++++++++++++-
>  Documentation/filesystems/vfs.rst | 16 +++++
>  fs/attr.c                         |  1 +
>  fs/inode.c                        | 16 ++++-
>  fs/iomap/buffered-io.c            |  1 +
>  fs/open.c                         |  4 ++
>  fs/stat.c                         |  5 ++
>  fs/xfs/xfs_icache.c               |  5 +-
>  fs/xfs/xfs_inode.h                |  2 +
>  fs/xfs/xfs_ioctl.c                | 98 +++++++++++++++----------------
>  fs/xfs/xfs_iops.c                 | 69 +++++++++++++++-------
>  include/linux/fs.h                | 73 ++++++++++++++++++++++-
>  include/uapi/linux/stat.h         |  1 +
>  mm/fadvise.c                      |  7 ++-
>  mm/filemap.c                      |  4 ++
>  mm/huge_memory.c                  |  1 +
>  mm/khugepaged.c                   |  2 +
>  mm/mmap.c                         | 19 +++++-
>  mm/util.c                         |  9 ++-
>  19 files changed, 328 insertions(+), 89 deletions(-)
> 
> -- 
> 2.21.0
---end quoted text---
Ira Weiny March 9, 2020, 5:04 p.m. UTC | #2
On Thu, Mar 05, 2020 at 04:51:44PM +0100, Christoph Hellwig wrote:
> FYI, I still will fully NAK any series that adds additional locks
> and thus atomic instructions to basically every fs call, and grows
> the inode by a rw_semaphore plus and atomic64_t.  I also think the
> whole idea of switching operation vectors at runtime is fatally flawed
> and we should never add such code, nevermind just for a fringe usecase
> of a fringe feature.

Being new to this area of the kernel I'm not clear on the history...

It was my understanding that the per-file flag support was a requirement to
removing the experimental designation from DAX.  Is this still the case?

Ira

> 
> On Wed, Feb 26, 2020 at 09:24:30PM -0800, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > Changes from V4:
> > 	* Open code the aops lock rather than add it to the xfs_ilock()
> > 	  subsystem (Darrick's comments were obsoleted by this change)
> > 	* Fix lkp build suggestions and bugs
> > 
> > Changes from V3:
> > 	* Remove global locking...  :-D
> > 	* put back per inode locking and remove pre-mature optimizations
> > 	* Fix issues with Directories having IS_DAX() set
> > 	* Fix kernel crash issues reported by Jeff
> > 	* Add some clean up patches
> > 	* Consolidate diflags to iflags functions
> > 	* Update/add documentation
> > 	* Reorder/rename patches quite a bit
> > 
> > Changes from V2:
> > 
> > 	* Move i_dax_sem to be a global percpu_rw_sem rather than per inode
> > 		Internal discussions with Dan determined this would be easier,
> > 		just as performant, and slightly less overhead that having it
> > 		in the SB as suggested by Jan
> > 	* Fix locking order in comments and throughout code
> > 	* Change "mode" to "state" throughout commits
> > 	* Add CONFIG_FS_DAX wrapper to disable inode_[un]lock_state() when not
> > 		configured
> > 	* Add static branch for which is activated by a device which supports
> > 		DAX in XFS
> > 	* Change "lock/unlock" to up/down read/write as appropriate
> > 		Previous names were over simplified
> > 	* Update comments/documentation
> > 
> > 	* Remove the xfs specific lock to the vfs (global) layer.
> > 	* Fix i_dax_sem locking order and comments
> > 
> > 	* Move 'i_mapped' count from struct inode to struct address_space and
> > 		rename it to mmap_count
> > 	* Add inode_has_mappings() call
> > 
> > 	* Fix build issues
> > 	* Clean up syntax spacing and minor issues
> > 	* Update man page text for STATX_ATTR_DAX
> > 	* Add reviewed-by's
> > 	* Rebase to 5.6
> > 
> > 	Rename patch:
> > 		from: fs/xfs: Add lock/unlock state to xfs
> > 		to: fs/xfs: Add write DAX lock to xfs layer
> > 	Add patch:
> > 		fs/xfs: Clarify lockdep dependency for xfs_isilocked()
> > 	Drop patch:
> > 		fs/xfs: Fix truncate up
> > 
> > 
> > At LSF/MM'19 [1] [2] we discussed applications that overestimate memory
> > consumption due to their inability to detect whether the kernel will
> > instantiate page cache for a file, and cases where a global dax enable via a
> > mount option is too coarse.
> > 
> > The following patch series enables selecting the use of DAX on individual files
> > and/or directories on xfs, and lays some groundwork to do so in ext4.  In this
> > scheme the dax mount option can be omitted to allow the per-file property to
> > take effect.
> > 
> > The insight at LSF/MM was to separate the per-mount or per-file "physical"
> > capability switch from an "effective" attribute for the file.
> > 
> > At LSF/MM we discussed the difficulties of switching the DAX state of a file
> > with active mappings / page cache.  It was thought the races could be avoided
> > by limiting DAX state flips to 0-length files.
> > 
> > However, this turns out to not be true.[3] This is because address space
> > operations (a_ops) may be in use at any time the inode is referenced and users
> > have expressed a desire to be able to change the DAX state on a file with data
> > in it.  For those reasons this patch set allows changing the DAX state flag on
> > a file as long as it is not current mapped.
> > 
> > Details of when and how DAX state can be changed on a file is included in a
> > documentation patch.
> > 
> > It should be noted that the physical DAX flag inheritance is not shown in this
> > patch set as it was maintained from previous work on XFS.  The physical DAX
> > flag and it's inheritance will need to be added to other file systems for user
> > control. 
> > 
> > As submitted this works on real hardware testing.
> > 
> > 
> > [1] https://lwn.net/Articles/787973/
> > [2] https://lwn.net/Articles/787233/
> > [3] https://lkml.org/lkml/2019/10/20/96
> > [4] https://patchwork.kernel.org/patch/11310511/
> > 
> > 
> > To: linux-kernel@vger.kernel.org
> > Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> > Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: Dave Chinner <david@fromorbit.com>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: "Theodore Y. Ts'o" <tytso@mit.edu>
> > Cc: Jan Kara <jack@suse.cz>
> > Cc: linux-ext4@vger.kernel.org
> > Cc: linux-xfs@vger.kernel.org
> > Cc: linux-fsdevel@vger.kernel.org
> > 
> > 
> > Ira Weiny (12):
> >   fs/xfs: Remove unnecessary initialization of i_rwsem
> >   fs: Remove unneeded IS_DAX() check
> >   fs/stat: Define DAX statx attribute
> >   fs/xfs: Isolate the physical DAX flag from enabled
> >   fs/xfs: Create function xfs_inode_enable_dax()
> >   fs: Add locking for a dynamic address space operations state
> >   fs: Prevent DAX state change if file is mmap'ed
> >   fs/xfs: Hold off aops users while changing DAX state
> >   fs/xfs: Clean up locking in dax invalidate
> >   fs/xfs: Allow toggle of effective DAX flag
> >   fs/xfs: Remove xfs_diflags_to_linux()
> >   Documentation/dax: Update Usage section
> > 
> >  Documentation/filesystems/dax.txt | 84 +++++++++++++++++++++++++-
> >  Documentation/filesystems/vfs.rst | 16 +++++
> >  fs/attr.c                         |  1 +
> >  fs/inode.c                        | 16 ++++-
> >  fs/iomap/buffered-io.c            |  1 +
> >  fs/open.c                         |  4 ++
> >  fs/stat.c                         |  5 ++
> >  fs/xfs/xfs_icache.c               |  5 +-
> >  fs/xfs/xfs_inode.h                |  2 +
> >  fs/xfs/xfs_ioctl.c                | 98 +++++++++++++++----------------
> >  fs/xfs/xfs_iops.c                 | 69 +++++++++++++++-------
> >  include/linux/fs.h                | 73 ++++++++++++++++++++++-
> >  include/uapi/linux/stat.h         |  1 +
> >  mm/fadvise.c                      |  7 ++-
> >  mm/filemap.c                      |  4 ++
> >  mm/huge_memory.c                  |  1 +
> >  mm/khugepaged.c                   |  2 +
> >  mm/mmap.c                         | 19 +++++-
> >  mm/util.c                         |  9 ++-
> >  19 files changed, 328 insertions(+), 89 deletions(-)
> > 
> > -- 
> > 2.21.0
> ---end quoted text---
Darrick J. Wong March 11, 2020, 3:36 a.m. UTC | #3
On Mon, Mar 09, 2020 at 10:04:37AM -0700, Ira Weiny wrote:
> On Thu, Mar 05, 2020 at 04:51:44PM +0100, Christoph Hellwig wrote:
> > FYI, I still will fully NAK any series that adds additional locks
> > and thus atomic instructions to basically every fs call, and grows
> > the inode by a rw_semaphore plus and atomic64_t.  I also think the
> > whole idea of switching operation vectors at runtime is fatally flawed
> > and we should never add such code, nevermind just for a fringe usecase
> > of a fringe feature.
> 
> Being new to this area of the kernel I'm not clear on the history...

I /think/ the TLDR version in no particular order is:

- Some people expressed interest in being able to control page cache vs.
  direct access on pmem hardware at a higher granularity than the entire
  fs.

- Dave Chinner(?) added the per-inode flag intending it to be the sign
  that would flip on DAX.

- Someone (I forget who) made it a mount option that would enable it for
  all files regardless of inode flags and whatnot.

- Eric Sandeen(?) complained that the behavior of the dax inode flag was
  weird, particularly the part where you set the iflag and at some point
  if and when the inode gets reclaimed and then reconstituted then the
  change will finally take place.

- Christoph Hellwig objected on various grounds (the kernel is
  responsible for selecting the most appropriate hardware abstraction
  for the usage pattern; a binary flag doesn't capture enough detail for
  potential future pmem hardware; and now additional locking overhead).

- There's been (I hope) a long term understanding that the mount option
  will go away eventually, and not after we remove the EXPERIMENTAL
  tags.

(FWIW I tend to agree with Eric and Christoph, but I also thought it
would be useful at least to see what changeable file operations would be
like; if there were other users who had already implemented it; and how
much of an apetite there was for revoke().)

Hopefully I summarized that more or less accurately...

> It was my understanding that the per-file flag support was a requirement to
> removing the experimental designation from DAX.  Is this still the case?

Nailing down the meaning of the per-file dax flag is/was the requirement,
even if we kill it off entirely in the end.

Given Christoph's veto threat, I suppose that leaves the following
options?

1) Leave the inode flag (FS_XFLAG_DAX) as it is, and export the S_DAX
status via statx.  Document that changes to FS_XFLAG_DAX do not take
effect immediately and that one must check statx to find out the real
mode.  If we choose this, I would also deprecate the dax mount option;
send in my mkfs.xfs patch to make it so that you can set FS_XFLAG_DAX on
all files at mkfs time; and we can finally lay this whole thing to rest.
This is the closest to what we have today.

2) Withdraw FS_XFLAG_DAX entirely, and let the kernel choose based on
usage patterns, hardware heuristics, or spiteful arbitrariness.

Can we please pick (1) and just be done with this?  I want to move on.

--D

There are still other things that need to be ironed out WRT pmem:

a) reflink and page/pfn/whatever sharing -- fix the mm or (ab)use the
xfs buffer cache, or something worse?

b) getting our stories straight on how to clear poison, and whether or
not we can come up with a better story for ZERO_FILE_RANGE on pmem.  In
the ideal world I'd love to see Z_F_R actually memset(0) the pmem and
clear poison, at least if the file->pmem mappings were contiguous.

c) wiring up xfs to hwpoison, or wiring up hwpoison to xfs, or otherwise
figuring out how to get storage to tell xfs that it lost something so
that maybe xfs can fix it quickly

> Ira
> 
> > 
> > On Wed, Feb 26, 2020 at 09:24:30PM -0800, ira.weiny@intel.com wrote:
> > > From: Ira Weiny <ira.weiny@intel.com>
> > > 
> > > Changes from V4:
> > > 	* Open code the aops lock rather than add it to the xfs_ilock()
> > > 	  subsystem (Darrick's comments were obsoleted by this change)
> > > 	* Fix lkp build suggestions and bugs
> > > 
> > > Changes from V3:
> > > 	* Remove global locking...  :-D
> > > 	* put back per inode locking and remove pre-mature optimizations
> > > 	* Fix issues with Directories having IS_DAX() set
> > > 	* Fix kernel crash issues reported by Jeff
> > > 	* Add some clean up patches
> > > 	* Consolidate diflags to iflags functions
> > > 	* Update/add documentation
> > > 	* Reorder/rename patches quite a bit
> > > 
> > > Changes from V2:
> > > 
> > > 	* Move i_dax_sem to be a global percpu_rw_sem rather than per inode
> > > 		Internal discussions with Dan determined this would be easier,
> > > 		just as performant, and slightly less overhead that having it
> > > 		in the SB as suggested by Jan
> > > 	* Fix locking order in comments and throughout code
> > > 	* Change "mode" to "state" throughout commits
> > > 	* Add CONFIG_FS_DAX wrapper to disable inode_[un]lock_state() when not
> > > 		configured
> > > 	* Add static branch for which is activated by a device which supports
> > > 		DAX in XFS
> > > 	* Change "lock/unlock" to up/down read/write as appropriate
> > > 		Previous names were over simplified
> > > 	* Update comments/documentation
> > > 
> > > 	* Remove the xfs specific lock to the vfs (global) layer.
> > > 	* Fix i_dax_sem locking order and comments
> > > 
> > > 	* Move 'i_mapped' count from struct inode to struct address_space and
> > > 		rename it to mmap_count
> > > 	* Add inode_has_mappings() call
> > > 
> > > 	* Fix build issues
> > > 	* Clean up syntax spacing and minor issues
> > > 	* Update man page text for STATX_ATTR_DAX
> > > 	* Add reviewed-by's
> > > 	* Rebase to 5.6
> > > 
> > > 	Rename patch:
> > > 		from: fs/xfs: Add lock/unlock state to xfs
> > > 		to: fs/xfs: Add write DAX lock to xfs layer
> > > 	Add patch:
> > > 		fs/xfs: Clarify lockdep dependency for xfs_isilocked()
> > > 	Drop patch:
> > > 		fs/xfs: Fix truncate up
> > > 
> > > 
> > > At LSF/MM'19 [1] [2] we discussed applications that overestimate memory
> > > consumption due to their inability to detect whether the kernel will
> > > instantiate page cache for a file, and cases where a global dax enable via a
> > > mount option is too coarse.
> > > 
> > > The following patch series enables selecting the use of DAX on individual files
> > > and/or directories on xfs, and lays some groundwork to do so in ext4.  In this
> > > scheme the dax mount option can be omitted to allow the per-file property to
> > > take effect.
> > > 
> > > The insight at LSF/MM was to separate the per-mount or per-file "physical"
> > > capability switch from an "effective" attribute for the file.
> > > 
> > > At LSF/MM we discussed the difficulties of switching the DAX state of a file
> > > with active mappings / page cache.  It was thought the races could be avoided
> > > by limiting DAX state flips to 0-length files.
> > > 
> > > However, this turns out to not be true.[3] This is because address space
> > > operations (a_ops) may be in use at any time the inode is referenced and users
> > > have expressed a desire to be able to change the DAX state on a file with data
> > > in it.  For those reasons this patch set allows changing the DAX state flag on
> > > a file as long as it is not current mapped.
> > > 
> > > Details of when and how DAX state can be changed on a file is included in a
> > > documentation patch.
> > > 
> > > It should be noted that the physical DAX flag inheritance is not shown in this
> > > patch set as it was maintained from previous work on XFS.  The physical DAX
> > > flag and it's inheritance will need to be added to other file systems for user
> > > control. 
> > > 
> > > As submitted this works on real hardware testing.
> > > 
> > > 
> > > [1] https://lwn.net/Articles/787973/
> > > [2] https://lwn.net/Articles/787233/
> > > [3] https://lkml.org/lkml/2019/10/20/96
> > > [4] https://patchwork.kernel.org/patch/11310511/
> > > 
> > > 
> > > To: linux-kernel@vger.kernel.org
> > > Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> > > Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> > > Cc: Dan Williams <dan.j.williams@intel.com>
> > > Cc: Dave Chinner <david@fromorbit.com>
> > > Cc: Christoph Hellwig <hch@lst.de>
> > > Cc: "Theodore Y. Ts'o" <tytso@mit.edu>
> > > Cc: Jan Kara <jack@suse.cz>
> > > Cc: linux-ext4@vger.kernel.org
> > > Cc: linux-xfs@vger.kernel.org
> > > Cc: linux-fsdevel@vger.kernel.org
> > > 
> > > 
> > > Ira Weiny (12):
> > >   fs/xfs: Remove unnecessary initialization of i_rwsem
> > >   fs: Remove unneeded IS_DAX() check
> > >   fs/stat: Define DAX statx attribute
> > >   fs/xfs: Isolate the physical DAX flag from enabled
> > >   fs/xfs: Create function xfs_inode_enable_dax()
> > >   fs: Add locking for a dynamic address space operations state
> > >   fs: Prevent DAX state change if file is mmap'ed
> > >   fs/xfs: Hold off aops users while changing DAX state
> > >   fs/xfs: Clean up locking in dax invalidate
> > >   fs/xfs: Allow toggle of effective DAX flag
> > >   fs/xfs: Remove xfs_diflags_to_linux()
> > >   Documentation/dax: Update Usage section
> > > 
> > >  Documentation/filesystems/dax.txt | 84 +++++++++++++++++++++++++-
> > >  Documentation/filesystems/vfs.rst | 16 +++++
> > >  fs/attr.c                         |  1 +
> > >  fs/inode.c                        | 16 ++++-
> > >  fs/iomap/buffered-io.c            |  1 +
> > >  fs/open.c                         |  4 ++
> > >  fs/stat.c                         |  5 ++
> > >  fs/xfs/xfs_icache.c               |  5 +-
> > >  fs/xfs/xfs_inode.h                |  2 +
> > >  fs/xfs/xfs_ioctl.c                | 98 +++++++++++++++----------------
> > >  fs/xfs/xfs_iops.c                 | 69 +++++++++++++++-------
> > >  include/linux/fs.h                | 73 ++++++++++++++++++++++-
> > >  include/uapi/linux/stat.h         |  1 +
> > >  mm/fadvise.c                      |  7 ++-
> > >  mm/filemap.c                      |  4 ++
> > >  mm/huge_memory.c                  |  1 +
> > >  mm/khugepaged.c                   |  2 +
> > >  mm/mmap.c                         | 19 +++++-
> > >  mm/util.c                         |  9 ++-
> > >  19 files changed, 328 insertions(+), 89 deletions(-)
> > > 
> > > -- 
> > > 2.21.0
> > ---end quoted text---
Christoph Hellwig March 11, 2020, 6:29 a.m. UTC | #4
On Tue, Mar 10, 2020 at 08:36:14PM -0700, Darrick J. Wong wrote:
> 1) Leave the inode flag (FS_XFLAG_DAX) as it is, and export the S_DAX
> status via statx.  Document that changes to FS_XFLAG_DAX do not take
> effect immediately and that one must check statx to find out the real
> mode.  If we choose this, I would also deprecate the dax mount option;
> send in my mkfs.xfs patch to make it so that you can set FS_XFLAG_DAX on
> all files at mkfs time; and we can finally lay this whole thing to rest.
> This is the closest to what we have today.
> 
> 2) Withdraw FS_XFLAG_DAX entirely, and let the kernel choose based on
> usage patterns, hardware heuristics, or spiteful arbitrariness.

3) Only allow changing FS_XFLAG_DAX on directories or files that do
not have blocks allocated to them yet, and side step all the hard
problems.

Which of course still side steps the hard question of what it actually
is supposed to mean..
Dave Chinner March 11, 2020, 6:39 a.m. UTC | #5
On Tue, Mar 10, 2020 at 08:36:14PM -0700, Darrick J. Wong wrote:
> There are still other things that need to be ironed out WRT pmem:
> 
> a) reflink and page/pfn/whatever sharing -- fix the mm or (ab)use the
> xfs buffer cache, or something worse?

I don't think we need either. We just need to remove the DAX page
association for hwpoison that requires the struct page to store the
mapping and index. Get rid of that and we should be able to  safely
map the same page into different inode address spaces at the same
time. When we write a shared page, we COW it immediately and replace
the page in the inode's mapping tree, so we can't actually write to
a shared page...

IOWs, the dax_associate_page() related functionality probably needs
to be a filesystem callout - part of the aops vector, I think, so
that device dax can still use it. That way XFS can go it's own way,
while ext4 and device dax can continue to use the existing mechanism
mechanisn that is currently implemented....

XFS can then make use of rmapbt to find the owners on a bad page
notification, and run the "kill userspace dead dead dead" lookup on
each mapping/index tuple rather than pass it around on a struct
page. i.e. we'll do a kill scan for each mapping/index owner tuple
we find, not just one.

That requires converting all the current vma killer code to pass
mapping/index tuples around rather than the struct page. That kill
code doesn't actually need the struct page, it just needs the
mapping/index tuple to match to the vmas that have it mapped into
userspace.

> b) getting our stories straight on how to clear poison, and whether or
> not we can come up with a better story for ZERO_FILE_RANGE on pmem.  In
> the ideal world I'd love to see Z_F_R actually memset(0) the pmem and
> clear poison, at least if the file->pmem mappings were contiguous.

Are you talking about ZFR from userspace through the filesystem (how
do you clear poison in free space?) or ZFR on the dax device fro
either userspace or the kernel filesystem code?

> c) wiring up xfs to hwpoison, or wiring up hwpoison to xfs, or otherwise
> figuring out how to get storage to tell xfs that it lost something so
> that maybe xfs can fix it quickly

Yup, I think that's a dax device callback into the filesystem. i.e
the hwpoison gets delivered to the dax device, which then calls the
fs function rather than do it's current "dax_lock_page(), kill
userspace dead dead dead, dax_unlock_page()" dance. The filesystem
can do a much more intricate dance and wreak far more havoc on
userspace than what the dax device can do.....

Copious amounts of unused time are things I don't have,
unfortunately. Only having 7 fingers to type with right now doesn't
help, either.

Cheers,

Dave.
Christoph Hellwig March 11, 2020, 6:44 a.m. UTC | #6
On Wed, Mar 11, 2020 at 05:39:42PM +1100, Dave Chinner wrote:
> IOWs, the dax_associate_page() related functionality probably needs
> to be a filesystem callout - part of the aops vector, I think, so
> that device dax can still use it. That way XFS can go it's own way,
> while ext4 and device dax can continue to use the existing mechanism
> mechanisn that is currently implemented....

s/XFS/XFS with rmap/, as most XFS file systems currently don't have
that enabled we'll also need to keep the legacy path around.
Dan Williams March 11, 2020, 5:07 p.m. UTC | #7
On Tue, Mar 10, 2020 at 11:30 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Tue, Mar 10, 2020 at 08:36:14PM -0700, Darrick J. Wong wrote:
> > 1) Leave the inode flag (FS_XFLAG_DAX) as it is, and export the S_DAX
> > status via statx.  Document that changes to FS_XFLAG_DAX do not take
> > effect immediately and that one must check statx to find out the real
> > mode.  If we choose this, I would also deprecate the dax mount option;
> > send in my mkfs.xfs patch to make it so that you can set FS_XFLAG_DAX on
> > all files at mkfs time; and we can finally lay this whole thing to rest.
> > This is the closest to what we have today.
> >
> > 2) Withdraw FS_XFLAG_DAX entirely, and let the kernel choose based on
> > usage patterns, hardware heuristics, or spiteful arbitrariness.
>
> 3) Only allow changing FS_XFLAG_DAX on directories or files that do
> not have blocks allocated to them yet, and side step all the hard
> problems.

This sounds reasonable to me.

As for deprecating the mount option, I think at a minimum it needs to
continue be accepted as an option even if it is ignored to not break
existing setups. We're currently going through the prolonged flag day
of people discovering that if they update xfsprogs they need to
specify "-m reflink=0" to mkfs.xfs. That pain seems to have only been
a road bump not a showstopper based on the bug reports I've seen. If
anything it has added helpful pressure towards getting reflink support
bumped up in the priority. Hopefully the xfs position that the dax
mount option can be ignored makes it possible to implement the same
policy on ext4, and we can just move on...

> Which of course still side steps the hard question of what it actually
> is supposed to mean..

If we have statx to indicate the effective dax-state that addresses
the pain for applications that want to account for dax in their page
cache pressure estimates, and lets FS_XFLAG_DAX not need to specify
precise semantics.
Dan Williams March 11, 2020, 5:07 p.m. UTC | #8
On Tue, Mar 10, 2020 at 11:44 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Wed, Mar 11, 2020 at 05:39:42PM +1100, Dave Chinner wrote:
> > IOWs, the dax_associate_page() related functionality probably needs
> > to be a filesystem callout - part of the aops vector, I think, so
> > that device dax can still use it. That way XFS can go it's own way,
> > while ext4 and device dax can continue to use the existing mechanism
> > mechanisn that is currently implemented....
>
> s/XFS/XFS with rmap/, as most XFS file systems currently don't have
> that enabled we'll also need to keep the legacy path around.

Agree, it needs to be an opt-in capability as ext4 and xfs w/o rmap
will stay on the legacy path for the foreseeable future.
Dave Chinner March 12, 2020, 12:49 a.m. UTC | #9
On Wed, Mar 11, 2020 at 07:44:12AM +0100, Christoph Hellwig wrote:
> On Wed, Mar 11, 2020 at 05:39:42PM +1100, Dave Chinner wrote:
> > IOWs, the dax_associate_page() related functionality probably needs
> > to be a filesystem callout - part of the aops vector, I think, so
> > that device dax can still use it. That way XFS can go it's own way,
> > while ext4 and device dax can continue to use the existing mechanism
> > mechanisn that is currently implemented....
> 
> s/XFS/XFS with rmap/, as most XFS file systems currently don't have
> that enabled we'll also need to keep the legacy path around.

Sure, that's trivially easy to handle in the XFS code once the
callouts are in place.

But, quite frankly, we can enforce rmap to be enabled 
enabled because nobody is using a reflink enabled FS w/ DAX right
now. Everyone will have to mkfs their filesystems anyway to enable
reflink+dax, so we simply don't allow reflink+dax to be enabled
unless rmap is also enabled. Simple, easy, trivial.

Cheers,

Dave.
Darrick J. Wong March 12, 2020, 3 a.m. UTC | #10
On Thu, Mar 12, 2020 at 11:49:32AM +1100, Dave Chinner wrote:
> On Wed, Mar 11, 2020 at 07:44:12AM +0100, Christoph Hellwig wrote:
> > On Wed, Mar 11, 2020 at 05:39:42PM +1100, Dave Chinner wrote:
> > > IOWs, the dax_associate_page() related functionality probably needs
> > > to be a filesystem callout - part of the aops vector, I think, so
> > > that device dax can still use it. That way XFS can go it's own way,
> > > while ext4 and device dax can continue to use the existing mechanism
> > > mechanisn that is currently implemented....
> > 
> > s/XFS/XFS with rmap/, as most XFS file systems currently don't have
> > that enabled we'll also need to keep the legacy path around.
> 
> Sure, that's trivially easy to handle in the XFS code once the
> callouts are in place.
> 
> But, quite frankly, we can enforce rmap to be enabled 
> enabled because nobody is using a reflink enabled FS w/ DAX right
> now. Everyone will have to mkfs their filesystems anyway to enable
> reflink+dax, so we simply don't allow reflink+dax to be enabled
> unless rmap is also enabled. Simple, easy, trivial.

Heh, this reminds me that I need to get that rmap performance analysis
report out to the list... it does have a fairly substantial performance
impact (in its current not-terribly-optimized state) but otoh enables
self-repair.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Christoph Hellwig March 12, 2020, 7:27 a.m. UTC | #11
On Thu, Mar 12, 2020 at 11:49:32AM +1100, Dave Chinner wrote:
> > > IOWs, the dax_associate_page() related functionality probably needs
> > > to be a filesystem callout - part of the aops vector, I think, so
> > > that device dax can still use it. That way XFS can go it's own way,
> > > while ext4 and device dax can continue to use the existing mechanism
> > > mechanisn that is currently implemented....
> > 
> > s/XFS/XFS with rmap/, as most XFS file systems currently don't have
> > that enabled we'll also need to keep the legacy path around.
> 
> Sure, that's trivially easy to handle in the XFS code once the
> callouts are in place.
> 
> But, quite frankly, we can enforce rmap to be enabled 
> enabled because nobody is using a reflink enabled FS w/ DAX right
> now. Everyone will have to mkfs their filesystems anyway to enable
> reflink+dax, so we simply don't allow reflink+dax to be enabled
> unless rmap is also enabled. Simple, easy, trivial.

True, I think rmap will be required for DAX+reflink.  But we still
need the legacy infrastructure for error reporting.
Jan Kara March 16, 2020, 9:52 a.m. UTC | #12
On Wed 11-03-20 10:07:18, Dan Williams wrote:
> On Tue, Mar 10, 2020 at 11:30 PM Christoph Hellwig <hch@lst.de> wrote:
> >
> > On Tue, Mar 10, 2020 at 08:36:14PM -0700, Darrick J. Wong wrote:
> > > 1) Leave the inode flag (FS_XFLAG_DAX) as it is, and export the S_DAX
> > > status via statx.  Document that changes to FS_XFLAG_DAX do not take
> > > effect immediately and that one must check statx to find out the real
> > > mode.  If we choose this, I would also deprecate the dax mount option;
> > > send in my mkfs.xfs patch to make it so that you can set FS_XFLAG_DAX on
> > > all files at mkfs time; and we can finally lay this whole thing to rest.
> > > This is the closest to what we have today.
> > >
> > > 2) Withdraw FS_XFLAG_DAX entirely, and let the kernel choose based on
> > > usage patterns, hardware heuristics, or spiteful arbitrariness.
> >
> > 3) Only allow changing FS_XFLAG_DAX on directories or files that do
> > not have blocks allocated to them yet, and side step all the hard
> > problems.
> 
> This sounds reasonable to me.
> 
> As for deprecating the mount option, I think at a minimum it needs to
> continue be accepted as an option even if it is ignored to not break
> existing setups.

Agreed. But that's how we usually deprecate mount options. Also I'd say
that statx() support for reporting DAX state and some education of
programmers using DAX is required before we deprecate the mount option
since currently applications check 'dax' mount option to determine how much
memory they need to set aside for page cache before they consume everything
else on the machine...

								Honza
Christoph Hellwig March 16, 2020, 9:55 a.m. UTC | #13
On Mon, Mar 16, 2020 at 10:52:24AM +0100, Jan Kara wrote:
> > This sounds reasonable to me.
> > 
> > As for deprecating the mount option, I think at a minimum it needs to
> > continue be accepted as an option even if it is ignored to not break
> > existing setups.
> 
> Agreed. But that's how we usually deprecate mount options. Also I'd say
> that statx() support for reporting DAX state and some education of
> programmers using DAX is required before we deprecate the mount option
> since currently applications check 'dax' mount option to determine how much
> memory they need to set aside for page cache before they consume everything
> else on the machine...

I don't even think we should deprecate it.  It isn't painful to maintain
and actually useful for testing.  Instead we should expand it into a
tristate:

  dax=off
  dax=flag
  dax=always

where the existing "dax" option maps to "dax=always" and nodax maps
to "dax=off". and dax=flag becomes the default for DAX capable devices.
Darrick J. Wong April 1, 2020, 4 a.m. UTC | #14
On Mon, Mar 16, 2020 at 10:55:09AM +0100, Christoph Hellwig wrote:
> On Mon, Mar 16, 2020 at 10:52:24AM +0100, Jan Kara wrote:
> > > This sounds reasonable to me.
> > > 
> > > As for deprecating the mount option, I think at a minimum it needs to
> > > continue be accepted as an option even if it is ignored to not break
> > > existing setups.
> > 
> > Agreed. But that's how we usually deprecate mount options. Also I'd say
> > that statx() support for reporting DAX state and some education of
> > programmers using DAX is required before we deprecate the mount option
> > since currently applications check 'dax' mount option to determine how much
> > memory they need to set aside for page cache before they consume everything
> > else on the machine...
> 
> I don't even think we should deprecate it.  It isn't painful to maintain
> and actually useful for testing.  Instead we should expand it into a
> tristate:
> 
>   dax=off
>   dax=flag
>   dax=always
> 
> where the existing "dax" option maps to "dax=always" and nodax maps
> to "dax=off". and dax=flag becomes the default for DAX capable devices.

That works for me.  In summary:

 - Applications must call statx to discover the current S_DAX state.

 - There exists an advisory file inode flag FS_XFLAG_DAX that can be
   changed on files that have no blocks allocated to them.  Changing
   this flag does not necessarily change the S_DAX state immediately
   but programs can query the S_DAX state via statx.

   If FS_XFLAG_DAX is set and the fs is on pmem then it will always
   enable S_DAX at inode load time; if FS_XFLAG_DAX is not set, it will
   never enable S_DAX.  Unless overridden...

 - There exists a dax= mount option.  dax=off means "never set S_DAX,
   ignore FS_XFLAG_DAX"; dax=always means "always set S_DAX (at least on
   pmem), ignore FS_XFLAG_DAX"; and dax=iflag means "follow FS_XFLAG_DAX"
   and is the default.  "dax" by itself means "dax=always".  "nodax"
   means "dax=off".

 - There exists an advisory directory inode flag FS_XFLAG_DAX that can
   be changed at any time.  The flag state is copied into any files or
   subdirectories created within that directory.  If programs require
   that file access runs in S_DAX mode, they'll have to create those
   files themselves inside a directory with FS_XFLAG_DAX set, or mount
   the fs with dax=always.

Ok?  Let's please get this part finished for 5.8, then we can get back
to arguing about fs-rmap and reflink and dax and whatnot.

--D
Jan Kara April 1, 2020, 10:25 a.m. UTC | #15
On Wed 01-04-20 04:00:21, Darrick J. Wong wrote:
> On Mon, Mar 16, 2020 at 10:55:09AM +0100, Christoph Hellwig wrote:
> > On Mon, Mar 16, 2020 at 10:52:24AM +0100, Jan Kara wrote:
> > > > This sounds reasonable to me.
> > > > 
> > > > As for deprecating the mount option, I think at a minimum it needs to
> > > > continue be accepted as an option even if it is ignored to not break
> > > > existing setups.
> > > 
> > > Agreed. But that's how we usually deprecate mount options. Also I'd say
> > > that statx() support for reporting DAX state and some education of
> > > programmers using DAX is required before we deprecate the mount option
> > > since currently applications check 'dax' mount option to determine how much
> > > memory they need to set aside for page cache before they consume everything
> > > else on the machine...
> > 
> > I don't even think we should deprecate it.  It isn't painful to maintain
> > and actually useful for testing.  Instead we should expand it into a
> > tristate:
> > 
> >   dax=off
> >   dax=flag
> >   dax=always
> > 
> > where the existing "dax" option maps to "dax=always" and nodax maps
> > to "dax=off". and dax=flag becomes the default for DAX capable devices.
> 
> That works for me.  In summary:
> 
>  - Applications must call statx to discover the current S_DAX state.
> 
>  - There exists an advisory file inode flag FS_XFLAG_DAX that can be
>    changed on files that have no blocks allocated to them.  Changing
>    this flag does not necessarily change the S_DAX state immediately
>    but programs can query the S_DAX state via statx.

I generally like the proposal but I think the fact that toggling
FS_XFLAG_DAX will not have immediate effect on S_DAX will cause quite some
confusion and ultimately bug reports. I'm thinking whether we could somehow
improve this... For example an ioctl that would try to make set inode flags
effective by evicting the inode (and returning EBUSY if the eviction is
impossible for some reason)?

								Honza

> 
>    If FS_XFLAG_DAX is set and the fs is on pmem then it will always
>    enable S_DAX at inode load time; if FS_XFLAG_DAX is not set, it will
>    never enable S_DAX.  Unless overridden...
> 
>  - There exists a dax= mount option.  dax=off means "never set S_DAX,
>    ignore FS_XFLAG_DAX"; dax=always means "always set S_DAX (at least on
>    pmem), ignore FS_XFLAG_DAX"; and dax=iflag means "follow FS_XFLAG_DAX"
>    and is the default.  "dax" by itself means "dax=always".  "nodax"
>    means "dax=off".
> 
>  - There exists an advisory directory inode flag FS_XFLAG_DAX that can
>    be changed at any time.  The flag state is copied into any files or
>    subdirectories created within that directory.  If programs require
>    that file access runs in S_DAX mode, they'll have to create those
>    files themselves inside a directory with FS_XFLAG_DAX set, or mount
>    the fs with dax=always.
> 
> Ok?  Let's please get this part finished for 5.8, then we can get back
> to arguing about fs-rmap and reflink and dax and whatnot.
> 
> --D
Christoph Hellwig April 2, 2020, 8:53 a.m. UTC | #16
On Wed, Apr 01, 2020 at 12:25:11PM +0200, Jan Kara wrote:
> >  - Applications must call statx to discover the current S_DAX state.
> > 
> >  - There exists an advisory file inode flag FS_XFLAG_DAX that can be
> >    changed on files that have no blocks allocated to them.  Changing
> >    this flag does not necessarily change the S_DAX state immediately
> >    but programs can query the S_DAX state via statx.
> 
> I generally like the proposal but I think the fact that toggling
> FS_XFLAG_DAX will not have immediate effect on S_DAX will cause quite some
> confusion and ultimately bug reports. I'm thinking whether we could somehow
> improve this... For example an ioctl that would try to make set inode flags
> effective by evicting the inode (and returning EBUSY if the eviction is
> impossible for some reason)?

I'd just return an error for that case, don't play silly games like
evicting the inode.
Ira Weiny April 2, 2020, 8:55 p.m. UTC | #17
On Thu, Apr 02, 2020 at 10:53:27AM +0200, Christoph Hellwig wrote:
> On Wed, Apr 01, 2020 at 12:25:11PM +0200, Jan Kara wrote:
> > >  - Applications must call statx to discover the current S_DAX state.
> > > 
> > >  - There exists an advisory file inode flag FS_XFLAG_DAX that can be
> > >    changed on files that have no blocks allocated to them.  Changing
> > >    this flag does not necessarily change the S_DAX state immediately
> > >    but programs can query the S_DAX state via statx.
> > 
> > I generally like the proposal but I think the fact that toggling
> > FS_XFLAG_DAX will not have immediate effect on S_DAX will cause quite some
> > confusion and ultimately bug reports. I'm thinking whether we could somehow
> > improve this... For example an ioctl that would try to make set inode flags
> > effective by evicting the inode (and returning EBUSY if the eviction is
> > impossible for some reason)?
> 
> I'd just return an error for that case, don't play silly games like
> evicting the inode.

I think I agree with Christoph here.  But I want to clarify.  I was heading in
a direction of failing the ioctl completely.  But we could have the flag change
with an appropriate error which could let the user know the change has been
delayed.

But I don't immediately see what error code is appropriate for such an
indication.  Candidates I can envision:

EAGAIN
ERESTART
EUSERS
EINPROGRESS

None are perfect but I'm leaning toward EINPROGRESS.

Ira
Ira Weiny April 3, 2020, 4:39 a.m. UTC | #18
On Wed, Apr 01, 2020 at 04:00:21AM +0000, Darrick J. Wong wrote:
> On Mon, Mar 16, 2020 at 10:55:09AM +0100, Christoph Hellwig wrote:
> > On Mon, Mar 16, 2020 at 10:52:24AM +0100, Jan Kara wrote:
> > > > This sounds reasonable to me.
> > > > 
> > > > As for deprecating the mount option, I think at a minimum it needs to
> > > > continue be accepted as an option even if it is ignored to not break
> > > > existing setups.
> > > 
> > > Agreed. But that's how we usually deprecate mount options. Also I'd say
> > > that statx() support for reporting DAX state and some education of
> > > programmers using DAX is required before we deprecate the mount option
> > > since currently applications check 'dax' mount option to determine how much
> > > memory they need to set aside for page cache before they consume everything
> > > else on the machine...
> > 
> > I don't even think we should deprecate it.  It isn't painful to maintain
> > and actually useful for testing.  Instead we should expand it into a
> > tristate:
> > 
> >   dax=off
> >   dax=flag
> >   dax=always
> > 
> > where the existing "dax" option maps to "dax=always" and nodax maps
> > to "dax=off". and dax=flag becomes the default for DAX capable devices.
> 
> That works for me.  In summary:
> 
>  - Applications must call statx to discover the current S_DAX state.
> 
>  - There exists an advisory file inode flag FS_XFLAG_DAX that can be
>    changed on files that have no blocks allocated to them.  Changing
>    this flag does not necessarily change the S_DAX state immediately
>    but programs can query the S_DAX state via statx.
> 
>    If FS_XFLAG_DAX is set and the fs is on pmem then it will always
>    enable S_DAX at inode load time; if FS_XFLAG_DAX is not set, it will
>    never enable S_DAX.  Unless overridden...
> 
>  - There exists a dax= mount option.  dax=off means "never set S_DAX,
>    ignore FS_XFLAG_DAX"; dax=always means "always set S_DAX (at least on
>    pmem), ignore FS_XFLAG_DAX"; and dax=iflag means "follow FS_XFLAG_DAX"
>    and is the default.  "dax" by itself means "dax=always".  "nodax"
>    means "dax=off".
> 
>  - There exists an advisory directory inode flag FS_XFLAG_DAX that can
>    be changed at any time.  The flag state is copied into any files or
>    subdirectories created within that directory.  If programs require
>    that file access runs in S_DAX mode, they'll have to create those
>    files themselves inside a directory with FS_XFLAG_DAX set, or mount
>    the fs with dax=always.

One other thing to add here.  They _can_ set the FS_XFLAG_DAX on a file with
data and force an eviction to get S_DAX to change.

I think that is a nice reason to have a different error code returned.

> 
> Ok?  Let's please get this part finished for 5.8, then we can get back
> to arguing about fs-rmap and reflink and dax and whatnot.

I'm happy to see you motivated to get this in.

I'm starting with a new xfstest to make sure we agree on the semantics prior to
more patches.  I hope to have the xfstest patch sent tomorrow sometime.

Ira

> 
> --D
Christoph Hellwig April 3, 2020, 7:27 a.m. UTC | #19
On Thu, Apr 02, 2020 at 01:55:19PM -0700, Ira Weiny wrote:
> > I'd just return an error for that case, don't play silly games like
> > evicting the inode.
> 
> I think I agree with Christoph here.  But I want to clarify.  I was heading in
> a direction of failing the ioctl completely.  But we could have the flag change
> with an appropriate error which could let the user know the change has been
> delayed.
> 
> But I don't immediately see what error code is appropriate for such an
> indication.  Candidates I can envision:
> 
> EAGAIN
> ERESTART
> EUSERS
> EINPROGRESS
> 
> None are perfect but I'm leaning toward EINPROGRESS.

I really, really dislike that idea.  The whole point of not forcing
evictions is to make it clear - no this inode is "busy" you can't
do that.  A reasonably smart application can try to evict itself.

But returning an error and doing a lazy change anyway is straight from
the playbook for arcane and confusing API designs.
Ira Weiny April 3, 2020, 3:48 p.m. UTC | #20
On Fri, Apr 03, 2020 at 09:27:31AM +0200, Christoph Hellwig wrote:
> On Thu, Apr 02, 2020 at 01:55:19PM -0700, Ira Weiny wrote:
> > > I'd just return an error for that case, don't play silly games like
> > > evicting the inode.
> > 
> > I think I agree with Christoph here.  But I want to clarify.  I was heading in
> > a direction of failing the ioctl completely.  But we could have the flag change
> > with an appropriate error which could let the user know the change has been
> > delayed.
> > 
> > But I don't immediately see what error code is appropriate for such an
> > indication.  Candidates I can envision:
> > 
> > EAGAIN
> > ERESTART
> > EUSERS
> > EINPROGRESS
> > 
> > None are perfect but I'm leaning toward EINPROGRESS.
> 
> I really, really dislike that idea.  The whole point of not forcing
> evictions is to make it clear - no this inode is "busy" you can't
> do that.  A reasonably smart application can try to evict itself.

I don't understand.  What Darrick proposed would never need any evictions.  If
the file has blocks allocated the FS_XFLAG_DAX flag can not be changed.  So I
don't see what good eviction would do at all.

> 
> But returning an error and doing a lazy change anyway is straight from
> the playbook for arcane and confusing API designs.

Jan countered with a proposal that the FS_XFLAG_DAX does change with blocks
allocated.  But that S_DAX would change on eviction.  Adding that some eviction
ioctl could be added.

You then proposed just returning an error for that case.  (This lead me to
believe that you were ok with an eviction based change of S_DAX.)

So I agreed that changing S_DAX could be delayed until an explicit eviction.
But, to aid the 'smart application', a different error code could be used to
indicate that the FS_XFLAG_DAX had been changed but that until that explicit
eviction occurs S_DAX would remain.

So I don't fully follow what you mean by 'lazy change'?

Do you still really, really dislike an explicit eviction method for changing
the S_DAX flag?

If FS_XFLAG_DAX can never be changed on a file with blocks allocated and the
user wants to change the mode of operations on their 'data'; they would have to
create a new file with the proper setting and move the data there.  For example
copy the file into a directory marked FS_XFLAG_DAX==true?

I'm ok with either interface as I think both could be clear if documented.

Jan?

Ira
Darrick J. Wong April 3, 2020, 4:05 p.m. UTC | #21
On Fri, Apr 03, 2020 at 09:27:31AM +0200, Christoph Hellwig wrote:
> On Thu, Apr 02, 2020 at 01:55:19PM -0700, Ira Weiny wrote:
> > > I'd just return an error for that case, don't play silly games like
> > > evicting the inode.
> > 
> > I think I agree with Christoph here.  But I want to clarify.  I was heading in
> > a direction of failing the ioctl completely.  But we could have the flag change
> > with an appropriate error which could let the user know the change has been
> > delayed.
> > 
> > But I don't immediately see what error code is appropriate for such an
> > indication.  Candidates I can envision:
> > 
> > EAGAIN
> > ERESTART
> > EUSERS
> > EINPROGRESS
> > 
> > None are perfect but I'm leaning toward EINPROGRESS.
> 
> I really, really dislike that idea.  The whole point of not forcing
> evictions is to make it clear - no this inode is "busy" you can't
> do that.  A reasonably smart application can try to evict itself.
> 
> But returning an error and doing a lazy change anyway is straight from
> the playbook for arcane and confusing API designs.

Agreed.  That's why I wrote that applications can set FS_XFLAG_DAX and
then query statx for STATX_ATTR_DAX to find out if it actually took
effect, and that if applications require it immediately they can either
create a file in a FS_XFLAG_DAX directory, or the admin can mount with
dax=always.  No magic return values required or desired anywhere.

I don't know what "try to evict the inode" magic means, but I'm fairly
sure I don't want to. ;)

--D
Jan Kara April 3, 2020, 5:03 p.m. UTC | #22
On Fri 03-04-20 08:48:29, Ira Weiny wrote:
> On Fri, Apr 03, 2020 at 09:27:31AM +0200, Christoph Hellwig wrote:
> > On Thu, Apr 02, 2020 at 01:55:19PM -0700, Ira Weiny wrote:
> > > > I'd just return an error for that case, don't play silly games like
> > > > evicting the inode.
> > > 
> > > I think I agree with Christoph here.  But I want to clarify.  I was heading in
> > > a direction of failing the ioctl completely.  But we could have the flag change
> > > with an appropriate error which could let the user know the change has been
> > > delayed.
> > > 
> > > But I don't immediately see what error code is appropriate for such an
> > > indication.  Candidates I can envision:
> > > 
> > > EAGAIN
> > > ERESTART
> > > EUSERS
> > > EINPROGRESS
> > > 
> > > None are perfect but I'm leaning toward EINPROGRESS.
> > 
> > I really, really dislike that idea.  The whole point of not forcing
> > evictions is to make it clear - no this inode is "busy" you can't
> > do that.  A reasonably smart application can try to evict itself.
> 
> I don't understand.  What Darrick proposed would never need any
> evictions.  If the file has blocks allocated the FS_XFLAG_DAX flag can
> not be changed.  So I don't see what good eviction would do at all.

I guess there's some confusion here (may well be than on my side). Darrick
propose that we can switch FS_XFLAG_DAX only when file has no blocks
allocated - fine by me. But that still does not mean than we can switch
S_DAX immediately, does it? Because that would still mean we need to switch
aops on living inode and that's ... difficult and Christoph didn't want to
clutter the code with it.

So I've understood Darrick's proposal as: Just switch FS_XFLAG_DAX flag,
S_DAX flag will magically switch when inode gets evicted and the inode gets
reloaded from the disk again. Did I misunderstand anything?

And my thinking was that this is surprising behavior for the user and so it
will likely generate lots of bug reports along the lines of "DAX inode flag
does not work!". So I was pondering how to make the behavior less
confusing... The ioctl I've suggested was just a poor attempt at that.

> > But returning an error and doing a lazy change anyway is straight from
> > the playbook for arcane and confusing API designs.
> 
> Jan countered with a proposal that the FS_XFLAG_DAX does change with
> blocks allocated.  But that S_DAX would change on eviction.  Adding that
> some eviction ioctl could be added.

No, I didn't mean that we can change FS_XFLAG_DAX with blocks allocated. I
was still speaking about the case without blocks allocated.

> You then proposed just returning an error for that case.  (This lead me to
> believe that you were ok with an eviction based change of S_DAX.)
> 
> So I agreed that changing S_DAX could be delayed until an explicit eviction.
> But, to aid the 'smart application', a different error code could be used to
> indicate that the FS_XFLAG_DAX had been changed but that until that explicit
> eviction occurs S_DAX would remain.
> 
> So I don't fully follow what you mean by 'lazy change'?
> 
> Do you still really, really dislike an explicit eviction method for changing
> the S_DAX flag?
> 
> If FS_XFLAG_DAX can never be changed on a file with blocks allocated and the
> user wants to change the mode of operations on their 'data'; they would have to
> create a new file with the proper setting and move the data there.  For example
> copy the file into a directory marked FS_XFLAG_DAX==true?
> 
> I'm ok with either interface as I think both could be clear if documented.

I agree that what Darrick suggested is technically easily doable and can be
documented. But it is not natural behavior (i.e., different than all inode
flags we have) and we know how careful people are when reading
documentation...


								Honza
Ira Weiny April 3, 2020, 6:18 p.m. UTC | #23
On Fri, Apr 03, 2020 at 07:03:38PM +0200, Jan Kara wrote:
> On Fri 03-04-20 08:48:29, Ira Weiny wrote:
> > On Fri, Apr 03, 2020 at 09:27:31AM +0200, Christoph Hellwig wrote:
> > > On Thu, Apr 02, 2020 at 01:55:19PM -0700, Ira Weiny wrote:
> > > > > I'd just return an error for that case, don't play silly games like
> > > > > evicting the inode.
> > > > 
> > > > I think I agree with Christoph here.  But I want to clarify.  I was heading in
> > > > a direction of failing the ioctl completely.  But we could have the flag change
> > > > with an appropriate error which could let the user know the change has been
> > > > delayed.
> > > > 
> > > > But I don't immediately see what error code is appropriate for such an
> > > > indication.  Candidates I can envision:
> > > > 
> > > > EAGAIN
> > > > ERESTART
> > > > EUSERS
> > > > EINPROGRESS
> > > > 
> > > > None are perfect but I'm leaning toward EINPROGRESS.
> > > 
> > > I really, really dislike that idea.  The whole point of not forcing
> > > evictions is to make it clear - no this inode is "busy" you can't
> > > do that.  A reasonably smart application can try to evict itself.
> > 
> > I don't understand.  What Darrick proposed would never need any
> > evictions.  If the file has blocks allocated the FS_XFLAG_DAX flag can
> > not be changed.  So I don't see what good eviction would do at all.
> 
> I guess there's some confusion here (may well be than on my side). Darrick
> propose that we can switch FS_XFLAG_DAX only when file has no blocks
> allocated - fine by me. But that still does not mean than we can switch
> S_DAX immediately, does it? Because that would still mean we need to switch
> aops on living inode and that's ... difficult and Christoph didn't want to
> clutter the code with it.
> 
> So I've understood Darrick's proposal as: Just switch FS_XFLAG_DAX flag,
> S_DAX flag will magically switch when inode gets evicted and the inode gets
> reloaded from the disk again. Did I misunderstand anything?
> 
> And my thinking was that this is surprising behavior for the user and so it
> will likely generate lots of bug reports along the lines of "DAX inode flag
> does not work!". So I was pondering how to make the behavior less
> confusing... The ioctl I've suggested was just a poor attempt at that.

Ok but then I don't understand Christophs comment to "just return an error for
that case"?  Which case?

> 
> > > But returning an error and doing a lazy change anyway is straight from
> > > the playbook for arcane and confusing API designs.
> > 
> > Jan countered with a proposal that the FS_XFLAG_DAX does change with
> > blocks allocated.  But that S_DAX would change on eviction.  Adding that
> > some eviction ioctl could be added.
> 
> No, I didn't mean that we can change FS_XFLAG_DAX with blocks allocated. I
> was still speaking about the case without blocks allocated.

Ah ok good point.  But again what 'error' do we return when FS_XFLAG_DAX
changed but S_DAX did not?

> 
> > You then proposed just returning an error for that case.  (This lead me to
> > believe that you were ok with an eviction based change of S_DAX.)
> > 
> > So I agreed that changing S_DAX could be delayed until an explicit eviction.
> > But, to aid the 'smart application', a different error code could be used to
> > indicate that the FS_XFLAG_DAX had been changed but that until that explicit
> > eviction occurs S_DAX would remain.
> > 
> > So I don't fully follow what you mean by 'lazy change'?
> > 
> > Do you still really, really dislike an explicit eviction method for changing
> > the S_DAX flag?
> > 
> > If FS_XFLAG_DAX can never be changed on a file with blocks allocated and the
> > user wants to change the mode of operations on their 'data'; they would have to
> > create a new file with the proper setting and move the data there.  For example
> > copy the file into a directory marked FS_XFLAG_DAX==true?
> > 
> > I'm ok with either interface as I think both could be clear if documented.
> 
> I agree that what Darrick suggested is technically easily doable and can be
> documented. But it is not natural behavior (i.e., different than all inode
> flags we have) and we know how careful people are when reading
> documentation...
> 

Ok For 5.8 why don't we not allow FS_XFLAG_DAX to be changed on files _at_
_all_...

In summary:

 - Applications must call statx to discover the current S_DAX state.

 - There exists an advisory file inode flag FS_XFLAG_DAX that is set based on
   the parent directory FS_XFLAG_DAX inode flag.  (There is no way to change
   this flag after file creation.)

   If FS_XFLAG_DAX is set and the fs is on pmem then it will enable S_DAX at
   inode load time; if FS_XFLAG_DAX is not set, it will not enable S_DAX.
   Unless overridden...

 - There exists a dax= mount option.

   "-o dax=off" means "never set S_DAX, ignore FS_XFLAG_DAX"
   	"-o nodax" means "dax=off"
   "-o dax=always" means "always set S_DAX (at least on pmem), ignore FS_XFLAG_DAX"
   	"-o dax" by itself means "dax=always"
   "-o dax=iflag" means "follow FS_XFLAG_DAX" and is the default

 - There exists an advisory directory inode flag FS_XFLAG_DAX that can be
   changed at any time.  The flag state is copied into any files or
   subdirectories when they are created within that directory.  If programs
   require file access runs in S_DAX mode, they'll have to create those files
   inside a directory with FS_XFLAG_DAX set, or mount the fs with an
   appropriate dax mount option.


???

Ira
Ira Weiny April 3, 2020, 6:21 p.m. UTC | #24
On Fri, Apr 03, 2020 at 11:18:43AM -0700, 'Ira Weiny' wrote:
[snip]

> 
> Ok For 5.8 why don't we not allow FS_XFLAG_DAX to be changed on files _at_
> _all_...

To be fair, I believe Christoph advocated for this at one time or another...

Ira

> 
> In summary:
> 
>  - Applications must call statx to discover the current S_DAX state.
> 
>  - There exists an advisory file inode flag FS_XFLAG_DAX that is set based on
>    the parent directory FS_XFLAG_DAX inode flag.  (There is no way to change
>    this flag after file creation.)
> 
>    If FS_XFLAG_DAX is set and the fs is on pmem then it will enable S_DAX at
>    inode load time; if FS_XFLAG_DAX is not set, it will not enable S_DAX.
>    Unless overridden...
> 
>  - There exists a dax= mount option.
> 
>    "-o dax=off" means "never set S_DAX, ignore FS_XFLAG_DAX"
>    	"-o nodax" means "dax=off"
>    "-o dax=always" means "always set S_DAX (at least on pmem), ignore FS_XFLAG_DAX"
>    	"-o dax" by itself means "dax=always"
>    "-o dax=iflag" means "follow FS_XFLAG_DAX" and is the default
> 
>  - There exists an advisory directory inode flag FS_XFLAG_DAX that can be
>    changed at any time.  The flag state is copied into any files or
>    subdirectories when they are created within that directory.  If programs
>    require file access runs in S_DAX mode, they'll have to create those files
>    inside a directory with FS_XFLAG_DAX set, or mount the fs with an
>    appropriate dax mount option.
> 
> 
> ???
> 
> Ira
>
Darrick J. Wong April 3, 2020, 6:29 p.m. UTC | #25
On Fri, Apr 03, 2020 at 07:03:38PM +0200, Jan Kara wrote:
> On Fri 03-04-20 08:48:29, Ira Weiny wrote:
> > On Fri, Apr 03, 2020 at 09:27:31AM +0200, Christoph Hellwig wrote:
> > > On Thu, Apr 02, 2020 at 01:55:19PM -0700, Ira Weiny wrote:
> > > > > I'd just return an error for that case, don't play silly games like
> > > > > evicting the inode.
> > > > 
> > > > I think I agree with Christoph here.  But I want to clarify.  I was heading in
> > > > a direction of failing the ioctl completely.  But we could have the flag change
> > > > with an appropriate error which could let the user know the change has been
> > > > delayed.
> > > > 
> > > > But I don't immediately see what error code is appropriate for such an
> > > > indication.  Candidates I can envision:
> > > > 
> > > > EAGAIN
> > > > ERESTART
> > > > EUSERS
> > > > EINPROGRESS
> > > > 
> > > > None are perfect but I'm leaning toward EINPROGRESS.
> > > 
> > > I really, really dislike that idea.  The whole point of not forcing
> > > evictions is to make it clear - no this inode is "busy" you can't
> > > do that.  A reasonably smart application can try to evict itself.
> > 
> > I don't understand.  What Darrick proposed would never need any
> > evictions.  If the file has blocks allocated the FS_XFLAG_DAX flag can
> > not be changed.  So I don't see what good eviction would do at all.
> 
> I guess there's some confusion here (may well be than on my side). Darrick
> propose that we can switch FS_XFLAG_DAX only when file has no blocks
> allocated - fine by me. But that still does not mean than we can switch
> S_DAX immediately, does it? Because that would still mean we need to switch
> aops on living inode and that's ... difficult and Christoph didn't want to
> clutter the code with it.

IIRC, the reason Ira was trying to introduce this file operations lock
is because there isn't any other safe way to change the operations
dynamically, because some of those operations don't start taking any
locks at all until after we've accessed the file operations pointer.

Now that I think about it some more, that's also means we can't change
S_DAX even on files without blocks allocated.  Look at fallocate, it
doesn't even take i_rwsem until we've called into (say)
xfs_file_fallocate.

Ok, so with that in mind, I think we simply have to say that
FS_XFLAG_DAX is 100% advisory, and S_DAX will magically switch some time
in the future or after the next umount/mount cycle.

FSSETXATTR can't evict an inode it has just set FS_XFLAG_DAX on, because
it applies that change to the fd passed into ioctl(), which means that the
caller has a reference to the fd -> file -> inode, which means the inode
is unevictable until after the call completes.

IOWs, 

> So I've understood Darrick's proposal as: Just switch FS_XFLAG_DAX flag,
> S_DAX flag will magically switch when inode gets evicted and the inode gets
> reloaded from the disk again. Did I misunderstand anything?

That was /my/ understanding. :)

> And my thinking was that this is surprising behavior for the user and so it
> will likely generate lots of bug reports along the lines of "DAX inode flag
> does not work!". So I was pondering how to make the behavior less
> confusing... The ioctl I've suggested was just a poor attempt at that.

At best, userspace uses FSSETXATTR to change FS_XFLAG_DAX, and then
calls some as-yet-undefined ioctl to try to evict the inode from memory.
I'm not sure how you'd actually do that, though, considering that you'd
have to close the fd and as soon as that happens the inode can disappear
permanently.

Ok, that's not sane.  Forget I ever wrote that.

> > > But returning an error and doing a lazy change anyway is straight from
> > > the playbook for arcane and confusing API designs.
> > 
> > Jan countered with a proposal that the FS_XFLAG_DAX does change with
> > blocks allocated.  But that S_DAX would change on eviction.  Adding that
> > some eviction ioctl could be added.
> 
> No, I didn't mean that we can change FS_XFLAG_DAX with blocks allocated. I
> was still speaking about the case without blocks allocated.
> 
> > You then proposed just returning an error for that case.  (This lead me to
> > believe that you were ok with an eviction based change of S_DAX.)
> > 
> > So I agreed that changing S_DAX could be delayed until an explicit eviction.
> > But, to aid the 'smart application', a different error code could be used to
> > indicate that the FS_XFLAG_DAX had been changed but that until that explicit
> > eviction occurs S_DAX would remain.

There's no point in returning a magic error code from FSSETXATTR, as I
realized above.  To restate: FSSETXATTR can't evict the inode, so it
would always have to return the magic error code.  The best we can do is
tell userspace that they can set the advisory FS_XFLAG_DAX flag and then
check STATX_ATTR_DAX immediately afterwards.  If some day in the future
we get smarter and can change it immediately, the statx output will
reflect that.

> > So I don't fully follow what you mean by 'lazy change'?
> > 
> > Do you still really, really dislike an explicit eviction method for changing
> > the S_DAX flag?
> > 
> > If FS_XFLAG_DAX can never be changed on a file with blocks allocated and the
> > user wants to change the mode of operations on their 'data'; they would have to
> > create a new file with the proper setting and move the data there.  For example
> > copy the file into a directory marked FS_XFLAG_DAX==true?
> > 
> > I'm ok with either interface as I think both could be clear if documented.
> 
> I agree that what Darrick suggested is technically easily doable and can be
> documented. But it is not natural behavior (i.e., different than all inode
> flags we have) and we know how careful people are when reading
> documentation...

To reflect all that I've rambled in this thread, I withdraw the previous
paragraph and submit this one for consideration:

 - There exists an advisory file inode flag FS_XFLAG_DAX.

   If FS_XFLAG_DAX is set and the fs is on pmem then it will always
   enable S_DAX at inode load time; if FS_XFLAG_DAX is not set, it will
   never enable S_DAX.  The advice can be overridden by mount option.

   Changing this flag does not necessarily change the S_DAX state
   immediately but programs can query the S_DAX state via statx to
   detect when the new advice has gone into effect.

Consider this from the perspective of minimizing changes to userspace
programs between now and the future.  If your program really wants
S_DAX, you can write:

	fd = open(...);

	ioctl(fd, FSGETXATTR, &fsx);

	if (fsx.xflags & FS_XFLAG_DAX)
		return 0;

	fsx.xflags |= FS_XFLAG_DAX;
	ioctl(fd, FSSETXATTR, &fsx);

	do {
		statx(fd, STATX_GIVE_ME_EVERYTHING, &statx);
		if (statx.attrs & STATX_ATTR_DAX)
			return 0;

		/* do stupid magic to evict things */
	} while (we haven't gotten bored and wandered away);

This code snippet will work with the current limitations of the kernel,
and it'll continue working even on Linux v5000 where we finally figure
out how to change S_DAX on the fly.

--D

> 
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
Darrick J. Wong April 3, 2020, 6:37 p.m. UTC | #26
On Fri, Apr 03, 2020 at 11:18:43AM -0700, Ira Weiny wrote:
> On Fri, Apr 03, 2020 at 07:03:38PM +0200, Jan Kara wrote:
> > On Fri 03-04-20 08:48:29, Ira Weiny wrote:
> > > On Fri, Apr 03, 2020 at 09:27:31AM +0200, Christoph Hellwig wrote:
> > > > On Thu, Apr 02, 2020 at 01:55:19PM -0700, Ira Weiny wrote:
> > > > > > I'd just return an error for that case, don't play silly games like
> > > > > > evicting the inode.
> > > > > 
> > > > > I think I agree with Christoph here.  But I want to clarify.  I was heading in
> > > > > a direction of failing the ioctl completely.  But we could have the flag change
> > > > > with an appropriate error which could let the user know the change has been
> > > > > delayed.
> > > > > 
> > > > > But I don't immediately see what error code is appropriate for such an
> > > > > indication.  Candidates I can envision:
> > > > > 
> > > > > EAGAIN
> > > > > ERESTART
> > > > > EUSERS
> > > > > EINPROGRESS
> > > > > 
> > > > > None are perfect but I'm leaning toward EINPROGRESS.
> > > > 
> > > > I really, really dislike that idea.  The whole point of not forcing
> > > > evictions is to make it clear - no this inode is "busy" you can't
> > > > do that.  A reasonably smart application can try to evict itself.
> > > 
> > > I don't understand.  What Darrick proposed would never need any
> > > evictions.  If the file has blocks allocated the FS_XFLAG_DAX flag can
> > > not be changed.  So I don't see what good eviction would do at all.
> > 
> > I guess there's some confusion here (may well be than on my side). Darrick
> > propose that we can switch FS_XFLAG_DAX only when file has no blocks
> > allocated - fine by me. But that still does not mean than we can switch
> > S_DAX immediately, does it? Because that would still mean we need to switch
> > aops on living inode and that's ... difficult and Christoph didn't want to
> > clutter the code with it.
> > 
> > So I've understood Darrick's proposal as: Just switch FS_XFLAG_DAX flag,
> > S_DAX flag will magically switch when inode gets evicted and the inode gets
> > reloaded from the disk again. Did I misunderstand anything?
> > 
> > And my thinking was that this is surprising behavior for the user and so it
> > will likely generate lots of bug reports along the lines of "DAX inode flag
> > does not work!". So I was pondering how to make the behavior less
> > confusing... The ioctl I've suggested was just a poor attempt at that.
> 
> Ok but then I don't understand Christophs comment to "just return an error for
> that case"?  Which case?
> 
> > 
> > > > But returning an error and doing a lazy change anyway is straight from
> > > > the playbook for arcane and confusing API designs.
> > > 
> > > Jan countered with a proposal that the FS_XFLAG_DAX does change with
> > > blocks allocated.  But that S_DAX would change on eviction.  Adding that
> > > some eviction ioctl could be added.
> > 
> > No, I didn't mean that we can change FS_XFLAG_DAX with blocks allocated. I
> > was still speaking about the case without blocks allocated.
> 
> Ah ok good point.  But again what 'error' do we return when FS_XFLAG_DAX
> changed but S_DAX did not?
> 
> > 
> > > You then proposed just returning an error for that case.  (This lead me to
> > > believe that you were ok with an eviction based change of S_DAX.)
> > > 
> > > So I agreed that changing S_DAX could be delayed until an explicit eviction.
> > > But, to aid the 'smart application', a different error code could be used to
> > > indicate that the FS_XFLAG_DAX had been changed but that until that explicit
> > > eviction occurs S_DAX would remain.
> > > 
> > > So I don't fully follow what you mean by 'lazy change'?
> > > 
> > > Do you still really, really dislike an explicit eviction method for changing
> > > the S_DAX flag?
> > > 
> > > If FS_XFLAG_DAX can never be changed on a file with blocks allocated and the
> > > user wants to change the mode of operations on their 'data'; they would have to
> > > create a new file with the proper setting and move the data there.  For example
> > > copy the file into a directory marked FS_XFLAG_DAX==true?
> > > 
> > > I'm ok with either interface as I think both could be clear if documented.
> > 
> > I agree that what Darrick suggested is technically easily doable and can be
> > documented. But it is not natural behavior (i.e., different than all inode
> > flags we have) and we know how careful people are when reading
> > documentation...
> > 
> 
> Ok For 5.8 why don't we not allow FS_XFLAG_DAX to be changed on files _at_
> _all_...
> 
> In summary:
> 
>  - Applications must call statx to discover the current S_DAX state.

Ok.

>  - There exists an advisory file inode flag FS_XFLAG_DAX that is set based on
>    the parent directory FS_XFLAG_DAX inode flag.  (There is no way to change
>    this flag after file creation.)
> 
>    If FS_XFLAG_DAX is set and the fs is on pmem then it will enable S_DAX at
>    inode load time; if FS_XFLAG_DAX is not set, it will not enable S_DAX.
>    Unless overridden...

Ok, fine with me. :)

>  - There exists a dax= mount option.
> 
>    "-o dax=off" means "never set S_DAX, ignore FS_XFLAG_DAX"
>    	"-o nodax" means "dax=off"

I surveyed the three fses that support dax and found that none of the
filesystems actually have a 'nodax' flag.  Now would be the time not to
add such a thing, and make people specify dax=off instead.  It would
be handy if we could have a single fsparam_enum for figuring out the dax
mount options.

>    "-o dax=always" means "always set S_DAX (at least on pmem), ignore FS_XFLAG_DAX"
>    	"-o dax" by itself means "dax=always"
>    "-o dax=iflag" means "follow FS_XFLAG_DAX" and is the default
> 
>  - There exists an advisory directory inode flag FS_XFLAG_DAX that can be
>    changed at any time.  The flag state is copied into any files or
>    subdirectories when they are created within that directory.  If programs
>    require file access runs in S_DAX mode, they'll have to create those files

"...they must create..."

>    inside a directory with FS_XFLAG_DAX set, or mount the fs with an
>    appropriate dax mount option.

Otherwise seems ok to me.

--D

> 
> 
> ???
> 
> Ira
>
Ira Weiny April 5, 2020, 6:19 a.m. UTC | #27
> > 
> > In summary:
> > 
> >  - Applications must call statx to discover the current S_DAX state.
> 
> Ok.
> 
> >  - There exists an advisory file inode flag FS_XFLAG_DAX that is set based on
> >    the parent directory FS_XFLAG_DAX inode flag.  (There is no way to change
> >    this flag after file creation.)
> > 
> >    If FS_XFLAG_DAX is set and the fs is on pmem then it will enable S_DAX at
> >    inode load time; if FS_XFLAG_DAX is not set, it will not enable S_DAX.
> >    Unless overridden...
> 
> Ok, fine with me. :)

:-D

> 
> >  - There exists a dax= mount option.
> > 
> >    "-o dax=off" means "never set S_DAX, ignore FS_XFLAG_DAX"
> >    	"-o nodax" means "dax=off"
> 
> I surveyed the three fses that support dax and found that none of the
> filesystems actually have a 'nodax' flag.  Now would be the time not to
> add such a thing, and make people specify dax=off instead.  It would
> be handy if we could have a single fsparam_enum for figuring out the dax
> mount options.

yes good point.

I'm working on updating the documentation patch and I think this might also
be better as:

	-o dax=never

Which is the opposite of 'always'.

> 
> >    "-o dax=always" means "always set S_DAX (at least on pmem), ignore FS_XFLAG_DAX"
> >    	"-o dax" by itself means "dax=always"
> >    "-o dax=iflag" means "follow FS_XFLAG_DAX" and is the default
> > 
> >  - There exists an advisory directory inode flag FS_XFLAG_DAX that can be
> >    changed at any time.  The flag state is copied into any files or
> >    subdirectories when they are created within that directory.  If programs
> >    require file access runs in S_DAX mode, they'll have to create those files
> 
> "...they must create..."

yes

> 
> >    inside a directory with FS_XFLAG_DAX set, or mount the fs with an
> >    appropriate dax mount option.
> 
> Otherwise seems ok to me.

Thanks!
Ira
Jan Kara April 6, 2020, 10 a.m. UTC | #28
On Fri 03-04-20 11:18:43, Ira Weiny wrote:
> Ok For 5.8 why don't we not allow FS_XFLAG_DAX to be changed on files _at_
> _all_...
> 
> In summary:
> 
>  - Applications must call statx to discover the current S_DAX state.
> 
>  - There exists an advisory file inode flag FS_XFLAG_DAX that is set based on
>    the parent directory FS_XFLAG_DAX inode flag.  (There is no way to change
>    this flag after file creation.)
> 
>    If FS_XFLAG_DAX is set and the fs is on pmem then it will enable S_DAX at
>    inode load time; if FS_XFLAG_DAX is not set, it will not enable S_DAX.
>    Unless overridden...

OK, after considering all the options we were hashing out here, I think
this is the best API. There isn't the confusing "S_DAX will magically
switch on inode eviction" and although the functionality is limited, I
think 90% of users would end up using the functionality like this anyway.

>  - There exists a dax= mount option.
> 
>    "-o dax=off" means "never set S_DAX, ignore FS_XFLAG_DAX"
>    	"-o nodax" means "dax=off"
>    "-o dax=always" means "always set S_DAX (at least on pmem), ignore FS_XFLAG_DAX"
>    	"-o dax" by itself means "dax=always"
>    "-o dax=iflag" means "follow FS_XFLAG_DAX" and is the default
> 
>  - There exists an advisory directory inode flag FS_XFLAG_DAX that can be
>    changed at any time.  The flag state is copied into any files or
>    subdirectories when they are created within that directory.  If programs
>    require file access runs in S_DAX mode, they'll have to create those files
>    inside a directory with FS_XFLAG_DAX set, or mount the fs with an
>    appropriate dax mount option.

								Honza