mbox series

[v3,00/12] Enable per-file/directory DAX operations V3

Message ID 20200208193445.27421-1-ira.weiny@intel.com (mailing list archive)
Headers show
Series Enable per-file/directory DAX operations V3 | expand

Message

Ira Weiny Feb. 8, 2020, 7:34 p.m. UTC
From: Ira Weiny <ira.weiny@intel.com>

Changes from V2:

	* Move i_dax_sem to be a global percpu_rw_sem rather than per inode
		Internal discussions with Dan determined this would be easier,
		just as performant, and slightly less overhead that having it
		in the SB as suggested by Jan
	* Fix locking order in comments and throughout code
	* Change "mode" to "state" throughout commits
	* Add CONFIG_FS_DAX wrapper to disable inode_[un]lock_state() when not
		configured
	* Add static branch for which is activated by a device which supports
		DAX in XFS
	* Change "lock/unlock" to up/down read/write as appropriate
		Previous names were over simplified
	* Update comments/documentation

	* Remove the xfs specific lock to the vfs (global) layer.
	* Fix i_dax_sem locking order and comments

	* Move 'i_mapped' count from struct inode to struct address_space and
		rename it to mmap_count
	* Add inode_has_mappings() call

	* Fix build issues
	* Clean up syntax spacing and minor issues
	* Update man page text for STATX_ATTR_DAX
	* Add reviewed-by's
	* Rebase to latest linux-next

	Rename patch:
		from: fs/xfs: Add lock/unlock state to xfs
		to: fs/xfs: Add write DAX lock to xfs layer
	Add patch:
		fs/xfs: Clarify lockdep dependency for xfs_isilocked()
	Drop patch:
		fs/xfs: Fix truncate up

	https://github.com/weiny2/linux-kernel/tree/dax-file-state-change-v3

At LSF/MM'19 [1] [2] we discussed applications that overestimate memory
consumption due to their inability to detect whether the kernel will
instantiate page cache for a file, and cases where a global dax enable via a
mount option is too coarse.

The following patch series enables selecting the use of DAX on individual files
and/or directories on xfs, and lays some groundwork to do so in ext4.  In this
scheme the dax mount option can be omitted to allow the per-file property to
take effect.

The insight at LSF/MM was to separate the per-mount or per-file "physical"
capability switch from an "effective" attribute for the file.

At LSF/MM we discussed the difficulties of switching the DAX state of a file
with active mappings / page cache.  It was thought the races could be avoided
by limiting DAX state flips to 0-length files.

However, this turns out to not be true.[3] This is because address space
operations (a_ops) may be in use at any time the inode is referenced and users
have expressed a desire to be able to change the DAX state on a file with data
in it.  For those reasons this patch set allows changing the DAX state flag on
a file as long as it is not current mapped.

Furthermore, DAX is a property of the inode and as such, many operations other
than address space operations need to be protected during a DAX state change.

Therefore callbacks are placed within the inode operations and used to lock the
inode as appropriate.

As in V1, Users are able to query the effective and physical flags separately
at any time.  Specifically the addition of the statx attribute bit allows them
to ensure the file is operating in the DAX state they intend.  This 'effective
flag' and physical flags could differ when the filesystem is mounted with the
dax flag for example.

It should be noted that the physical DAX flag inheritance is not shown in this
patch set as it was maintained from previous work on XFS.  The physical DAX
flag and it's inheritance will need to be added to other file systems for user
control. 


[1] https://lwn.net/Articles/787973/
[2] https://lwn.net/Articles/787233/
[3] https://lkml.org/lkml/2019/10/20/96
[4] https://patchwork.kernel.org/patch/11310511/


To: linux-kernel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Theodore Y. Ts'o" <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org
Cc: linux-xfs@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org


Ira Weiny (12):
  fs/stat: Define DAX statx attribute
  fs/xfs: Isolate the physical DAX flag from effective
  fs/xfs: Separate functionality of xfs_inode_supports_dax()
  fs/xfs: Clean up DAX support check
  fs: remove unneeded IS_DAX() check
  fs/xfs: Check if the inode supports DAX under lock
  fs: Add locking for a dynamic DAX state
  fs/xfs: Clarify lockdep dependency for xfs_isilocked()
  fs/xfs: Add write DAX lock to xfs layer
  fs: Prevent DAX state change if file is mmap'ed
  fs/xfs: Clean up locking in dax invalidate
  fs/xfs: Allow toggle of effective DAX flag

 Documentation/filesystems/vfs.rst | 17 ++++++
 fs/attr.c                         |  1 +
 fs/dax.c                          |  3 ++
 fs/inode.c                        | 15 ++++--
 fs/iomap/buffered-io.c            |  1 +
 fs/open.c                         |  4 ++
 fs/stat.c                         |  5 ++
 fs/super.c                        |  3 ++
 fs/xfs/xfs_inode.c                | 24 +++++++--
 fs/xfs/xfs_inode.h                |  8 ++-
 fs/xfs/xfs_ioctl.c                | 56 ++++++++++++--------
 fs/xfs/xfs_iops.c                 | 51 ++++++++++++++----
 fs/xfs/xfs_iops.h                 |  2 +
 fs/xfs/xfs_super.c                | 16 +++---
 include/linux/fs.h                | 86 +++++++++++++++++++++++++++++--
 include/uapi/linux/stat.h         |  1 +
 mm/fadvise.c                      | 10 +++-
 mm/filemap.c                      |  4 ++
 mm/huge_memory.c                  |  1 +
 mm/khugepaged.c                   |  2 +
 mm/madvise.c                      |  3 ++
 mm/mmap.c                         | 19 ++++++-
 mm/util.c                         |  9 +++-
 23 files changed, 287 insertions(+), 54 deletions(-)

Comments

Jeff Moyer Feb. 10, 2020, 3:15 p.m. UTC | #1
Hi, Ira,

Could you please include documentation patches as part of this series?

Thanks,
Jeff
Ira Weiny Feb. 11, 2020, 8:17 p.m. UTC | #2
On Mon, Feb 10, 2020 at 10:15:47AM -0500, Jeff Moyer wrote:
> Hi, Ira,
> 
> Could you please include documentation patches as part of this series?

I do have an update to the vfs.rst doc in

	fs: Add locking for a dynamic DAX state

I'm happy to do more but was there something specific you would like to see?
Or documentation in xfs perhaps?

Ira

> 
> Thanks,
> Jeff
>
Jeff Moyer Feb. 12, 2020, 7:49 p.m. UTC | #3
Ira Weiny <ira.weiny@intel.com> writes:

> On Mon, Feb 10, 2020 at 10:15:47AM -0500, Jeff Moyer wrote:
>> Hi, Ira,
>> 
>> Could you please include documentation patches as part of this series?
>
> I do have an update to the vfs.rst doc in
>
> 	fs: Add locking for a dynamic DAX state
>
> I'm happy to do more but was there something specific you would like to see?
> Or documentation in xfs perhaps?

Sorry, I was referring to your statx man page addition.  It would be
nice if we could find a home for the information in your cover letter,
too.  Right now, I'm not sure how application developers are supposed to
figure out how to use the per-inode settings.

If I read your cover letter correctly, the mount option overrides any
on-disk setting.  Is that right?  Given that we document the dax mount
option as "the way to get dax," it may be a good idea to allow for a
user to selectively disable dax, even when -o dax is specified.  Is that
possible?

-Jeff
Ira Weiny Feb. 13, 2020, 7:01 p.m. UTC | #4
On Wed, Feb 12, 2020 at 02:49:48PM -0500, Jeff Moyer wrote:
> Ira Weiny <ira.weiny@intel.com> writes:
> 
> > On Mon, Feb 10, 2020 at 10:15:47AM -0500, Jeff Moyer wrote:
> >> Hi, Ira,
> >> 
> >> Could you please include documentation patches as part of this series?
> >
> > I do have an update to the vfs.rst doc in
> >
> > 	fs: Add locking for a dynamic DAX state
> >
> > I'm happy to do more but was there something specific you would like to see?
> > Or documentation in xfs perhaps?
> 
> Sorry, I was referring to your statx man page addition.

Ah yea I guess I could include that as a patch.  I just wanted to get buy off
on the whole thing prior to setting documentation in.

> It would be
> nice if we could find a home for the information in your cover letter,
> too.  Right now, I'm not sure how application developers are supposed to
> figure out how to use the per-inode settings.

I'm not sure either.  But this is probably a good start:

https://www.kernel.org/doc/Documentation/filesystems/dax.txt

Something under the Usage section like:

diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index 679729442fd2..1bab5d5d775b 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -20,8 +20,18 @@ Usage
 If you have a block device which supports DAX, you can make a filesystem
 on it as usual.  The DAX code currently only supports files with a block
 size equal to your kernel's PAGE_SIZE, so you may need to specify a block
-size when creating the filesystem.  When mounting it, use the "-o dax"
-option on the command line or add 'dax' to the options in /etc/fstab.
+size when creating the filesystem.
+
+Files can then be enabled to use dax using the statx system call or an
+application using it like 'xfs_io'.  Directories can also be enabled for dax
+to have the file system automatically enable dax on all files within those
+directories.
+
+Alternately, when mounting it one can use the "-o dax" option on the command
+line or add 'dax' to the options in /etc/fstab to globaly override all files to
+use dax on that filesystem.  Using the "-o dax" does not change the state of
+individual files so remounting without "-o dax" will revert them to the state
+saved in the filesystem meta data.
 
 
 Implementation Tips for Block Driver Writers

> 
> If I read your cover letter correctly, the mount option overrides any
> on-disk setting.  Is that right?

Yes

> Given that we document the dax mount
> option as "the way to get dax," it may be a good idea to allow for a
> user to selectively disable dax, even when -o dax is specified.  Is that
> possible?

Not with this patch set.  And I'm not sure how that would work.  The idea was
that -o dax was simply an override for users who were used to having their
entire FS be dax.  We wanted to depreciate the use of "-o dax" in general.  The
individual settings are saved so I don't think it makes sense to ignore the -o
dax in favor of those settings.  Basically that would IMO make the -o dax
useless.

Ira
Ira Weiny Feb. 13, 2020, 7:05 p.m. UTC | #5
On Thu, Feb 13, 2020 at 11:01:57AM -0800, 'Ira Weiny' wrote:
> On Wed, Feb 12, 2020 at 02:49:48PM -0500, Jeff Moyer wrote:
> > Ira Weiny <ira.weiny@intel.com> writes:
> > 
 
[snip]

> > Given that we document the dax mount
> > option as "the way to get dax," it may be a good idea to allow for a
> > user to selectively disable dax, even when -o dax is specified.  Is that
> > possible?
> 
> Not with this patch set.  And I'm not sure how that would work.  The idea was
> that -o dax was simply an override for users who were used to having their
> entire FS be dax.  We wanted to depreciate the use of "-o dax" in general.  The
> individual settings are saved so I don't think it makes sense to ignore the -o
> dax in favor of those settings.  Basically that would IMO make the -o dax
> useless.

Oh and I forgot to mention that setting 'dax' on the root of the FS basically
provides '-o dax' functionality by default with the ability to "turn it off"
for files.

Ira

> 
> Ira
>
Darrick J. Wong Feb. 13, 2020, 7:58 p.m. UTC | #6
On Thu, Feb 13, 2020 at 11:05:13AM -0800, Ira Weiny wrote:
> On Thu, Feb 13, 2020 at 11:01:57AM -0800, 'Ira Weiny' wrote:
> > On Wed, Feb 12, 2020 at 02:49:48PM -0500, Jeff Moyer wrote:
> > > Ira Weiny <ira.weiny@intel.com> writes:
> > > 
>  
> [snip]
> 
> > > Given that we document the dax mount
> > > option as "the way to get dax," it may be a good idea to allow for a
> > > user to selectively disable dax, even when -o dax is specified.  Is that
> > > possible?
> > 
> > Not with this patch set.  And I'm not sure how that would work.  The idea was
> > that -o dax was simply an override for users who were used to having their
> > entire FS be dax.  We wanted to depreciate the use of "-o dax" in general.  The
> > individual settings are saved so I don't think it makes sense to ignore the -o
> > dax in favor of those settings.  Basically that would IMO make the -o dax
> > useless.
> 
> Oh and I forgot to mention that setting 'dax' on the root of the FS basically
> provides '-o dax' functionality by default with the ability to "turn it off"
> for files.

Please don't further confuse FS_XFLAG_DAX and S_DAX.  They are two
separate flags with two separate behaviors:

FS_XFLAG_DAX is a filesystem inode metadata flag.

Setting FS_XFLAG_DAX on a directory causes all files and directories
created within that directory to inherit FS_XFLAG_DAX.

Mounting with -o dax causes all files and directories created to have
FS_XFLAG_DAX set regardless of the parent's status.

The FS_XFLAG_DAX can be get and set via the fs[g]etxattr ioctl.

-------

S_DAX is the flag that controls the IO path in the kernel for a given
inode.

Loading a file inode into the kernel (via _iget) with FS_XFLAG_DAX set
or creating a file inode that inherits FS_XFLAG_DAX causes the incore
inode to have the S_DAX flag set if the storage device supports it.

Files with S_DAX set use the dax IO paths through the kernel.

The S_DAX flag can be queried via statx.

--D

> Ira
> 
> > 
> > Ira
> >
Ira Weiny Feb. 13, 2020, 11:29 p.m. UTC | #7
On Thu, Feb 13, 2020 at 11:58:39AM -0800, Darrick J. Wong wrote:
> On Thu, Feb 13, 2020 at 11:05:13AM -0800, Ira Weiny wrote:
> > On Thu, Feb 13, 2020 at 11:01:57AM -0800, 'Ira Weiny' wrote:
> > > On Wed, Feb 12, 2020 at 02:49:48PM -0500, Jeff Moyer wrote:
> > > > Ira Weiny <ira.weiny@intel.com> writes:
> > > > 
> >  
> > [snip]
> > 
> > > > Given that we document the dax mount
> > > > option as "the way to get dax," it may be a good idea to allow for a
> > > > user to selectively disable dax, even when -o dax is specified.  Is that
> > > > possible?
> > > 
> > > Not with this patch set.  And I'm not sure how that would work.  The idea was
> > > that -o dax was simply an override for users who were used to having their
> > > entire FS be dax.  We wanted to depreciate the use of "-o dax" in general.  The
> > > individual settings are saved so I don't think it makes sense to ignore the -o
> > > dax in favor of those settings.  Basically that would IMO make the -o dax
> > > useless.
> > 
> > Oh and I forgot to mention that setting 'dax' on the root of the FS basically
> > provides '-o dax' functionality by default with the ability to "turn it off"
> > for files.
> 
> Please don't further confuse FS_XFLAG_DAX and S_DAX.

Yes...  the above text is wrong WRT statx.  But setting the physical
XFS_DIFLAG2_DAX flag on the root directory will by default cause all files and
directories created there to be XFS_DIFLAG2_DAX and so forth on down the tree
unless explicitly changed.  This will be the same as mounting with '-o dax' but
with the ability to turn off dax for individual files.  Which I think is the
functionality Jeff is wanting.

>
> They are two
> separate flags with two separate behaviors:
> 
> FS_XFLAG_DAX is a filesystem inode metadata flag.
> 
> Setting FS_XFLAG_DAX on a directory causes all files and directories
> created within that directory to inherit FS_XFLAG_DAX.
> 
> Mounting with -o dax causes all files and directories created to have
> FS_XFLAG_DAX set regardless of the parent's status.

I don't believe this is true, either before _or_ after this patch set.

'-o dax' only causes XFS_MOUNT_DAX to be set which then cause S_DAX to be set.
It does not affect FS_XFLAG_DAX.  This is important because we don't want '-o
dax' to suddenly convert all files to DAX if '-o dax' is not used.

> 
> The FS_XFLAG_DAX can be get and set via the fs[g]etxattr ioctl.

Right statx was the wrong tool...

fs[g|s]etattr via the xfs_io -c 'chatttr|lsattr' is the correct tool.

> 
> -------
> 
> S_DAX is the flag that controls the IO path in the kernel for a given
> inode.
> 
> Loading a file inode into the kernel (via _iget) with FS_XFLAG_DAX set
> or creating a file inode that inherits FS_XFLAG_DAX causes the incore
> inode to have the S_DAX flag set if the storage device supports it.

Yes after reworking "Clean up DAX support check" I believe I've got it correct
now.  Soon to be in V4.

> 
> Files with S_DAX set use the dax IO paths through the kernel.
> 
> The S_DAX flag can be queried via statx.

Yes as a verification that the file is at that moment operating as dax.  It
will not return true for a directory ever.  My bad for saying that.  Sorry I
got my tools flags mixed up...

Ira
Dan Williams Feb. 14, 2020, 12:16 a.m. UTC | #8
On Thu, Feb 13, 2020 at 3:29 PM Ira Weiny <ira.weiny@intel.com> wrote:
>
> On Thu, Feb 13, 2020 at 11:58:39AM -0800, Darrick J. Wong wrote:
> > On Thu, Feb 13, 2020 at 11:05:13AM -0800, Ira Weiny wrote:
> > > On Thu, Feb 13, 2020 at 11:01:57AM -0800, 'Ira Weiny' wrote:
> > > > On Wed, Feb 12, 2020 at 02:49:48PM -0500, Jeff Moyer wrote:
> > > > > Ira Weiny <ira.weiny@intel.com> writes:
> > > > >
> > >
> > > [snip]
> > >
> > > > > Given that we document the dax mount
> > > > > option as "the way to get dax," it may be a good idea to allow for a
> > > > > user to selectively disable dax, even when -o dax is specified.  Is that
> > > > > possible?
> > > >
> > > > Not with this patch set.  And I'm not sure how that would work.  The idea was
> > > > that -o dax was simply an override for users who were used to having their
> > > > entire FS be dax.  We wanted to depreciate the use of "-o dax" in general.  The
> > > > individual settings are saved so I don't think it makes sense to ignore the -o
> > > > dax in favor of those settings.  Basically that would IMO make the -o dax
> > > > useless.
> > >
> > > Oh and I forgot to mention that setting 'dax' on the root of the FS basically
> > > provides '-o dax' functionality by default with the ability to "turn it off"
> > > for files.
> >
> > Please don't further confuse FS_XFLAG_DAX and S_DAX.
>
> Yes...  the above text is wrong WRT statx.  But setting the physical
> XFS_DIFLAG2_DAX flag on the root directory will by default cause all files and
> directories created there to be XFS_DIFLAG2_DAX and so forth on down the tree
> unless explicitly changed.  This will be the same as mounting with '-o dax' but
> with the ability to turn off dax for individual files.  Which I think is the
> functionality Jeff is wanting.

To be clear you mean turn off XFS_DIFLAG2_DAX, not mask S_DAX when you
say "turn off dax", right?

The mount option simply forces "S_DAX" on all regular files as long as
the underlying device (or soon to be superblock for virtiofs) supports
it. There is no method to mask S_DAX when the filesystem was mounted
with -o dax. Otherwise we would seem to need yet another physical flag
to "always disable" dax.
Ira Weiny Feb. 14, 2020, 8:06 p.m. UTC | #9
On Thu, Feb 13, 2020 at 04:16:17PM -0800, Dan Williams wrote:
> On Thu, Feb 13, 2020 at 3:29 PM Ira Weiny <ira.weiny@intel.com> wrote:
> >
> > On Thu, Feb 13, 2020 at 11:58:39AM -0800, Darrick J. Wong wrote:
> > > On Thu, Feb 13, 2020 at 11:05:13AM -0800, Ira Weiny wrote:
> > > > On Thu, Feb 13, 2020 at 11:01:57AM -0800, 'Ira Weiny' wrote:
> > > > > On Wed, Feb 12, 2020 at 02:49:48PM -0500, Jeff Moyer wrote:
> > > > > > Ira Weiny <ira.weiny@intel.com> writes:
> > > > > >
> > > >
> > > > [snip]
> > > >
> > > > > > Given that we document the dax mount
> > > > > > option as "the way to get dax," it may be a good idea to allow for a
> > > > > > user to selectively disable dax, even when -o dax is specified.  Is that
> > > > > > possible?
> > > > >
> > > > > Not with this patch set.  And I'm not sure how that would work.  The idea was
> > > > > that -o dax was simply an override for users who were used to having their
> > > > > entire FS be dax.  We wanted to depreciate the use of "-o dax" in general.  The
> > > > > individual settings are saved so I don't think it makes sense to ignore the -o
> > > > > dax in favor of those settings.  Basically that would IMO make the -o dax
> > > > > useless.
> > > >
> > > > Oh and I forgot to mention that setting 'dax' on the root of the FS basically
> > > > provides '-o dax' functionality by default with the ability to "turn it off"
> > > > for files.
> > >
> > > Please don't further confuse FS_XFLAG_DAX and S_DAX.
> >
> > Yes...  the above text is wrong WRT statx.  But setting the physical
> > XFS_DIFLAG2_DAX flag on the root directory will by default cause all files and
> > directories created there to be XFS_DIFLAG2_DAX and so forth on down the tree
> > unless explicitly changed.  This will be the same as mounting with '-o dax' but
> > with the ability to turn off dax for individual files.  Which I think is the
> > functionality Jeff is wanting.
> 
> To be clear you mean turn off XFS_DIFLAG2_DAX, not mask S_DAX when you
> say "turn off dax", right?

Yes.

[disclaimer: the following assumes the underlying 'device' (superblock)
supports DAX]

... which results in S_DAX == false when the file is opened without the mount
option.  The key would be that all directories/files created under a root with
XFS_DIFLAG2_DAX == true would inherit their flag and be XFS_DIFLAG2_DAX == true
all the way down the tree.  Any file not wanting DAX would need to set
XFS_DIFLAG2_DAX == false.  And setting false could be used on a directory to
allow a user or group to not use dax on files in that sub-tree.

Then without '-o dax' (XFS_MOUNT_DAX == false) all files when opened set S_DAX
equal to XFS_DIFLAG2_DAX value.  (Directories, as of V4, never get S_DAX set.)

If '-o dax' (XFS_MOUNT_DAX == true) then S_DAX is set on all files.


[IF the underlying 'device' (superblock) does _not_ support DAX]

... S_DAX is _never_ set but the underlying XFS_DIFLAG2_DAX flags can be
toggled and will be inherited as above.  Because S_DAX is never set access to
that file will be restricted to "not dax"...[1]

I could go into that level of detail in the doc if needed?  I feel like we need
a more general name for XFS_DIFLAG2_DAX if I do.[2]

> 
> The mount option simply forces "S_DAX" on all regular files as long as
> the underlying device (or soon to be superblock for virtiofs) supports
> it. There is no method to mask S_DAX when the filesystem was mounted
> with -o dax. Otherwise we would seem to need yet another physical flag
> to "always disable" dax.

Exactly.  I don't think we want to support that.  From this thread alone it
seems we have enough complexity and that would be another layer...

;-)

Ira

[1] I'm beginning to think that if I type dax one more time I'm going to go
crazy...  :-P

[2] I have patches in the wings to introduce EXT4_DAX_FL as an ext4 on disk bit
which would be equivalent to XFS_DIFLAG2_DAX.  If anyone wants a better name
let me know.
Jeff Moyer Feb. 14, 2020, 9:23 p.m. UTC | #10
Ira Weiny <ira.weiny@intel.com> writes:

> [disclaimer: the following assumes the underlying 'device' (superblock)
> supports DAX]
>
> ... which results in S_DAX == false when the file is opened without the mount
> option.  The key would be that all directories/files created under a root with
> XFS_DIFLAG2_DAX == true would inherit their flag and be XFS_DIFLAG2_DAX == true
> all the way down the tree.  Any file not wanting DAX would need to set
> XFS_DIFLAG2_DAX == false.  And setting false could be used on a directory to
> allow a user or group to not use dax on files in that sub-tree.
>
> Then without '-o dax' (XFS_MOUNT_DAX == false) all files when opened set S_DAX
> equal to XFS_DIFLAG2_DAX value.  (Directories, as of V4, never get S_DAX set.)
>
> If '-o dax' (XFS_MOUNT_DAX == true) then S_DAX is set on all files.

One more clarifying question.  Let's say I set XFS_DIFLAG2_DAX on an
inode.  I then open the file, and perform mmap/load/store/etc.  I close
the file, and I unset XFS_DIFLAG2_DAX.  Will the next open treat the
file as S_DAX or not?  My guess is the inode won't be evicted, and so
S_DAX will remain set.

The reason I ask is I've had requests from application developers to do
just this.  They want to be able to switch back and forth between dax
modes.

Thanks,
Jeff

> [1] I'm beginning to think that if I type dax one more time I'm going to go
> crazy...  :-P

dax dax dax!
Ira Weiny Feb. 14, 2020, 9:58 p.m. UTC | #11
On Fri, Feb 14, 2020 at 04:23:19PM -0500, Jeff Moyer wrote:
> Ira Weiny <ira.weiny@intel.com> writes:
> 
> > [disclaimer: the following assumes the underlying 'device' (superblock)
> > supports DAX]
> >
> > ... which results in S_DAX == false when the file is opened without the mount
> > option.  The key would be that all directories/files created under a root with
> > XFS_DIFLAG2_DAX == true would inherit their flag and be XFS_DIFLAG2_DAX == true
> > all the way down the tree.  Any file not wanting DAX would need to set
> > XFS_DIFLAG2_DAX == false.  And setting false could be used on a directory to
> > allow a user or group to not use dax on files in that sub-tree.
> >
> > Then without '-o dax' (XFS_MOUNT_DAX == false) all files when opened set S_DAX
> > equal to XFS_DIFLAG2_DAX value.  (Directories, as of V4, never get S_DAX set.)
> >
> > If '-o dax' (XFS_MOUNT_DAX == true) then S_DAX is set on all files.
> 
> One more clarifying question.  Let's say I set XFS_DIFLAG2_DAX on an
> inode.  I then open the file, and perform mmap/load/store/etc.  I close
> the file, and I unset XFS_DIFLAG2_DAX.  Will the next open treat the
> file as S_DAX or not?  My guess is the inode won't be evicted, and so
> S_DAX will remain set.

The inode will not be evicted, or even it happens to be xfs_io will reload it
to unset the XFS_DIFLAG2_DAX flag.  And the S_DAX flag changes _with_ the
XFS_DIFLAG2_DAX change when it can (when the underlying storage supports
S_DAX).

Trying to change XFS_DIFLAG2_DAX while the file is mmap'ed returns -EBUSY.

Ira

> 
> The reason I ask is I've had requests from application developers to do
> just this.  They want to be able to switch back and forth between dax
> modes.
> 
> Thanks,
> Jeff
> 
> > [1] I'm beginning to think that if I type dax one more time I'm going to go
> > crazy...  :-P
> 
> dax dax dax!
>
Jeff Moyer Feb. 14, 2020, 10:06 p.m. UTC | #12
Ira Weiny <ira.weiny@intel.com> writes:

> On Fri, Feb 14, 2020 at 04:23:19PM -0500, Jeff Moyer wrote:
>> Ira Weiny <ira.weiny@intel.com> writes:
>> 
>> > [disclaimer: the following assumes the underlying 'device' (superblock)
>> > supports DAX]
>> >
>> > ... which results in S_DAX == false when the file is opened without the mount
>> > option.  The key would be that all directories/files created under a root with
>> > XFS_DIFLAG2_DAX == true would inherit their flag and be XFS_DIFLAG2_DAX == true
>> > all the way down the tree.  Any file not wanting DAX would need to set
>> > XFS_DIFLAG2_DAX == false.  And setting false could be used on a directory to
>> > allow a user or group to not use dax on files in that sub-tree.
>> >
>> > Then without '-o dax' (XFS_MOUNT_DAX == false) all files when opened set S_DAX
>> > equal to XFS_DIFLAG2_DAX value.  (Directories, as of V4, never get S_DAX set.)
>> >
>> > If '-o dax' (XFS_MOUNT_DAX == true) then S_DAX is set on all files.
>> 
>> One more clarifying question.  Let's say I set XFS_DIFLAG2_DAX on an
>> inode.  I then open the file, and perform mmap/load/store/etc.  I close
>> the file, and I unset XFS_DIFLAG2_DAX.  Will the next open treat the
>> file as S_DAX or not?  My guess is the inode won't be evicted, and so
>> S_DAX will remain set.
>
> The inode will not be evicted, or even it happens to be xfs_io will reload it
> to unset the XFS_DIFLAG2_DAX flag.  And the S_DAX flag changes _with_ the
> XFS_DIFLAG2_DAX change when it can (when the underlying storage supports
> S_DAX).

OK, so it will be possible to change the effective mode.

I'll try to get some testing in on this series, now.

Thanks!
Jeff
Jeff Moyer Feb. 14, 2020, 10:58 p.m. UTC | #13
Hi, Ira,

Jeff Moyer <jmoyer@redhat.com> writes:

> I'll try to get some testing in on this series, now.

This series panics in xfstests generic/013, when run like so:

MKFS_OPTIONS="-m reflink=0" MOUNT_OPTIONS="-o dax" ./check -g auto

I'd dig in further, but it's late on a Friday.  You understand.  :)

Cheers,
Jeff
Jeff Moyer Feb. 14, 2020, 11:03 p.m. UTC | #14
Jeff Moyer <jmoyer@redhat.com> writes:

> Hi, Ira,
>
> Jeff Moyer <jmoyer@redhat.com> writes:
>
>> I'll try to get some testing in on this series, now.
>
> This series panics in xfstests generic/013, when run like so:
>
> MKFS_OPTIONS="-m reflink=0" MOUNT_OPTIONS="-o dax" ./check -g auto
>
> I'd dig in further, but it's late on a Friday.  You understand.  :)

Sorry, I should have at least given you a clue.  Below is the stack
trace.  We're going down the buffered I/O path, even though the fs is
mounted with -o dax.  Somewhere the inode isn't getting marked properly.

-Jeff

[  549.461099] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  549.468053] #PF: supervisor instruction fetch in kernel mode
[  549.473713] #PF: error_code(0x0010) - not-present page
[  549.478851] PGD 17c7e06067 P4D 17c7e06067 PUD 17c7e01067 PMD 0 
[  549.484773] Oops: 0010 [#1] SMP NOPTI
[  549.488438] CPU: 68 PID: 19851 Comm: fsstress Not tainted 5.6.0-rc1+ #42
[  549.495134] Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5C620.86B.0D.01.0395.022720191340 02/27/2019
[  549.505562] RIP: 0010:0x0
[  549.508186] Code: Bad RIP value.
[  549.511418] RSP: 0018:ffffab132dc9fa98 EFLAGS: 00010246
[  549.516642] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
[  549.523768] RDX: 0000000000000000 RSI: ffffdcf75e7060c0 RDI: ffff8c3805d22300
[  549.530900] RBP: ffffab132dc9fb08 R08: 0000000000000000 R09: 00002308a18f9f3f
[  549.538030] R10: 0000000000000000 R11: ffffdcf75f4b19c0 R12: ffff8c37cfe6d2b8
[  549.545155] R13: ffffab132dc9fb60 R14: ffffdcf75e7060c8 R15: ffffdcf75e7060c0
[  549.552288] FS:  00007f849d20cb80(0000) GS:ffff8c3821100000(0000) knlGS:0000000000000000
[  549.560373] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  549.566117] CR2: ffffffffffffffd6 CR3: 00000017d3088005 CR4: 00000000007606e0
[  549.573250] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  549.580383] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  549.587515] PKRU: 55555554
[  549.590228] Call Trace:
[  549.592683]  read_pages+0x120/0x190
[  549.596173]  __do_page_cache_readahead+0x1c1/0x1e0
[  549.600965]  ondemand_readahead+0x182/0x2f0
[  549.605152]  generic_file_buffered_read+0x5a6/0xaf0
[  549.610032]  ? security_inode_permission+0x30/0x50
[  549.614824]  ? _cond_resched+0x15/0x30
[  549.618620]  xfs_file_buffered_aio_read+0x47/0xe0 [xfs]
[  549.623861]  xfs_file_read_iter+0x6e/0xd0 [xfs]
[  549.628394]  generic_file_splice_read+0x100/0x220
[  549.633099]  splice_direct_to_actor+0xd5/0x220
[  549.637543]  ? pipe_to_sendpage+0xa0/0xa0
[  549.641557]  do_splice_direct+0x9a/0xd0
[  549.645396]  vfs_copy_file_range+0x153/0x320
[  549.649667]  __x64_sys_copy_file_range+0xdd/0x200
[  549.654375]  do_syscall_64+0x55/0x1d0
[  549.658039]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  549.663091] RIP: 0033:0x7f849c7086bd
[  549.666671] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 9b 67 2c 00 f7 d8 64 89 01 48
[  549.685414] RSP: 002b:00007fff81b78678 EFLAGS: 00000246 ORIG_RAX: 0000000000000146
[  549.692980] RAX: ffffffffffffffda RBX: 00000000000000c9 RCX: 00007f849c7086bd
[  549.700112] RDX: 0000000000000004 RSI: 00007fff81b786b0 RDI: 0000000000000003
[  549.707242] RBP: 00000000005d92ef R08: 0000000000006cb4 R09: 0000000000000000
[  549.714375] R10: 00007fff81b786b8 R11: 0000000000000246 R12: 0000000000000003
[  549.721508] R13: 0000000000006cb4 R14: 0000000000037f58 R15: 0000000000365871
[  549.728641] Modules linked in: xt_CHECKSUM nft_chain_nat xt_MASQUERADE nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 nft_counter nft_compat nf_tables nfnetlink tun bridge stp llc rfkill sunrpc vfat fat intel_rapl_msr intel_rapl_common skx_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel iTCO_wdt iTCO_vendor_support kvm irqbypass crct10dif_pclmul ipmi_ssif crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore mei_me ipmi_si joydev intel_rapl_perf ioatdma pcspkr ipmi_devintf sg i2c_i801 lpc_ich mei dca ipmi_msghandler dax_pmem dax_pmem_core acpi_power_meter acpi_pad xfs libcrc32c nd_pmem nd_btt sd_mod sr_mod cdrom ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec crc32c_intel i40e drm nvme ahci nvme_core libahci t10_pi libata wmi nfit libnvdimm
[  549.805384] CR2: 0000000000000000
[  549.808744] ---[ end trace 62568a4ecc43ee90 ]---
Ira Weiny Feb. 18, 2020, 2:35 a.m. UTC | #15
On Fri, Feb 14, 2020 at 05:58:10PM -0500, Jeff Moyer wrote:
> Hi, Ira,
> 
> Jeff Moyer <jmoyer@redhat.com> writes:
> 
> > I'll try to get some testing in on this series, now.
> 
> This series panics in xfstests generic/013, when run like so:
> 
> MKFS_OPTIONS="-m reflink=0" MOUNT_OPTIONS="-o dax" ./check -g auto
> 
> I'd dig in further, but it's late on a Friday.  You understand.  :)

Yep...  and a long weekend if you are in the US...  I ran the test with V4 and
got the panic below.

Is this similar to what you see?  If so I'll work on it in V4.  FWIW with '-o
dax' specified I don't see how fsstress is causing an issue with my patch set.
Does fsstress attempt to change dax states?  I don't see that in the test but
I'm not real familiar with generic/013 and fsstress.

If my disassembly of read_pages is correct it looks like readpage is null which
makes sense because all files should be IS_DAX() == true due to the mount option...

But tracing code indicates that the patch:

	fs: remove unneeded IS_DAX() check

... may be the culprit and the following fix may work...

diff --git a/mm/filemap.c b/mm/filemap.c
index 3a7863ba51b9..7eaf74a2a39b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2257,7 +2257,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
        if (!count)
                goto out; /* skip atime */
 
-       if (iocb->ki_flags & IOCB_DIRECT) {
+       if (iocb->ki_flags & IOCB_DIRECT || IS_DAX(inode)) {
                struct file *file = iocb->ki_filp;
                struct address_space *mapping = file->f_mapping;
                struct inode *inode = mapping->host;

And of course now my server is not responding so I can't reboot to test it...
:-(

I'll continue tomorrow after I go press the power button on the machine since
IPMI has decided not to work...  :-(

Ira

[ 1204.461801] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1204.472375] #PF: supervisor instruction fetch in kernel mode
[ 1204.481440] #PF: error_code(0x0010) - not-present page
[ 1204.489920] PGD 80000003c273d067 P4D 80000003c273d067 PUD 36a73b067 PMD 0 
[ 1204.500396] Oops: 0010 [#1] SMP KASAN PTI
[ 1204.507617] CPU: 6 PID: 15714 Comm: fsstress Not tainted 5.5.0-next-20200207+ #1
[ 1204.518632] Hardware name: Intel Corporation SandyBridge Platform/00, BIOS SE5C600.86B.02.04.0003.102320141138 10/23/2014
[ 1204.533715] RIP: 0010:0x0
[ 1204.539444] Code: Bad RIP value.
[ 1204.545813] RSP: 0018:ffff88837dedf528 EFLAGS: 00010246
[ 1204.554454] RAX: 0000000000000000 RBX: ffffea000cb6ae08 RCX: ffffffff813765fc
[ 1204.565223] RDX: dffffc0000000000 RSI: ffffea000cb6ae00 RDI: ffff8887b032a800
[ 1204.575943] RBP: ffff88837dedf618 R08: fffff9400196d5c1 R09: fffff9400196d5c1
[ 1204.586657] R10: fffff9400196d5c0 R11: ffffea000cb6ae07 R12: ffffea000cb6ae00
[ 1204.597362] R13: ffffffffa0ac3da0 R14: 0000000000000000 R15: ffff888342a14040
[ 1204.608061] FS:  00007fc47c0a8b80(0000) GS:ffff8883c7100000(0000) knlGS:0000000000000000
[ 1204.619869] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1204.629010] CR2: ffffffffffffffd6 CR3: 0000000325892002 CR4: 00000000000606e0
[ 1204.639695] Call Trace:
[ 1204.645128]  read_pages+0x23d/0x2f0
[ 1204.651691]  ? read_cache_pages+0x2b0/0x2b0
[ 1204.659030]  ? policy_node+0x56/0x60
[ 1204.665675]  __do_page_cache_readahead+0x28b/0x2b0
[ 1204.673641]  ? read_pages+0x2f0/0x2f0
[ 1204.680286]  ondemand_readahead+0x2bf/0x5d0
[ 1204.687561]  generic_file_buffered_read+0x992/0x1170
[ 1204.695703]  ? read_cache_page_gfp+0x20/0x20
[ 1204.703052]  ? down_read_nested+0x10b/0x2d0
[ 1204.710266]  ? downgrade_write+0x270/0x270
[ 1204.717399]  ? lock_acquire+0x101/0x200
[ 1204.724171]  ? generic_file_splice_read+0x20d/0x350
[ 1204.732067]  ? generic_file_read_iter+0x3b/0x220
[ 1204.739736]  ? xfs_file_buffered_aio_read+0x87/0x1d0 [xfs]
[ 1204.748351]  xfs_file_buffered_aio_read+0x92/0x1d0 [xfs]
[ 1204.756759]  xfs_file_read_iter+0x120/0x1f0 [xfs]
[ 1204.764420]  generic_file_splice_read+0x239/0x350
[ 1204.772072]  ? pipe_to_user+0x80/0x80
[ 1204.778476]  splice_direct_to_actor+0x1d8/0x460
[ 1204.785831]  ? pipe_to_sendpage+0x1a0/0x1a0
[ 1204.792769]  ? do_splice_to+0xc0/0xc0
[ 1204.799144]  ? selinux_file_permission+0x1d2/0x210
[ 1204.806734]  do_splice_direct+0x10c/0x170
[ 1204.813393]  ? splice_direct_to_actor+0x460/0x460
[ 1204.820830]  ? debug_lockdep_rcu_enabled+0x23/0x60
[ 1204.828349]  ? __sb_start_write+0x12c/0x1f0
[ 1204.835177]  vfs_copy_file_range+0x309/0x5c0
[ 1204.842085]  ? __x64_sys_sendfile+0x160/0x160
[ 1204.849039]  ? from_kgid+0xa0/0xa0
[ 1204.854879]  ? _copy_to_user+0x6a/0x80
[ 1204.861067]  ? cp_new_stat+0x271/0x2c0
[ 1204.867238]  ? __ia32_sys_lstat+0x30/0x30
[ 1204.873672]  ? down_read_non_owner+0x2e0/0x2e0
[ 1204.880579]  __x64_sys_copy_file_range+0x17a/0x310
[ 1204.887844]  ? __ia32_sys_copy_file_range+0x320/0x320
[ 1204.895369]  ? lockdep_hardirqs_off+0x1a/0x140
[ 1204.902142]  do_syscall_64+0x78/0x300
[ 1204.908056]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1204.915507] RIP: 0033:0x7fc47c1a1d6d
[ 1204.921283] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d eb 80 0c 00 f7 d8 64 89 01 48
[ 1204.946109] RSP: 002b:00007ffd9cc4cc08 EFLAGS: 00000202 ORIG_RAX: 0000000000000146
[ 1204.956531] RAX: ffffffffffffffda RBX: 0000000000000047 RCX: 00007fc47c1a1d6d
[ 1204.966477] RDX: 0000000000000004 RSI: 00007ffd9cc4cc40 RDI: 0000000000000003
[ 1204.976394] RBP: 000000000061a35e R08: 000000000000f15b R09: 0000000000000000
[ 1204.986279] R10: 00007ffd9cc4cc48 R11: 0000000000000202 R12: 0000000000000003
[ 1204.996142] R13: 000000000000f15b R14: 00000000000326f7 R15: 0000000000159373
[ 1205.005983] Modules linked in: vfat fat isofs rfkill ib_isert
iscsi_target_mod ib_srpt target_core_mod ib_srp scsi_transport_srp opa_vnic
rpcrdma sunrpc rdma_ucm ib_iser ib_umad rdma_cm ib_ipoib iw_cm dax_pmem_compat
libiscsi iTCO_wdt device_dax nd_pmem ib_cm dax_pmem_core nd_btt
iTCO_vendor_support scsi_transport_iscsi snd_hda_codec_realtek
snd_hda_codec_generic ledtrig_audio sb_edac x86_pkg_temp_thermal
intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
snd_hda_intel hfi1 snd_intel_dspcfg snd_hda_codec snd_hda_core aesni_intel
snd_hwdep crypto_simd snd_pcm rdmavt snd_timer nd_e820 ib_uverbs cryptd
libnvdimm snd ib_core soundcore glue_helper ipmi_si mei_me ipmi_devintf pcspkr
mei i2c_i801 ipmi_msghandler lpc_ich mfd_core ioatdma wmi acpi_cpufreq
sch_fq_codel xfs libcrc32c mlx4_en sr_mod cdrom sd_mod t10_pi mgag200
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_vram_helper
drm_ttm_helper ahci ttm crc32c_intel libahci mlx4_core isci igb libsas drm
[ 1205.006047]  dca scsi_transport_sas firewire_ohci i2c_algo_bit firewire_core
crc_itu_t libata i2c_core dm_mod [last unloaded: mlx4_ib]
[ 1205.140277] CR2: 0000000000000000
[ 1205.146689] ---[ end trace cf133ac3f2876827 ]---

> 
> Cheers,
> Jeff
>
Jeff Moyer Feb. 18, 2020, 2:22 p.m. UTC | #16
Ira Weiny <ira.weiny@intel.com> writes:

> Yep...  and a long weekend if you are in the US...  I ran the test with V4 and
> got the panic below.
>
> Is this similar to what you see?  If so I'll work on it in V4.  FWIW with '-o

Yes, precisely.

> dax' specified I don't see how fsstress is causing an issue with my patch set.
> Does fsstress attempt to change dax states?  I don't see that in the test but
> I'm not real familiar with generic/013 and fsstress.

Not that I'm aware of, no.

> If my disassembly of read_pages is correct it looks like readpage is null which
> makes sense because all files should be IS_DAX() == true due to the mount option...
>
> But tracing code indicates that the patch:
>
> 	fs: remove unneeded IS_DAX() check
>
> ... may be the culprit and the following fix may work...
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 3a7863ba51b9..7eaf74a2a39b 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2257,7 +2257,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
>         if (!count)
>                 goto out; /* skip atime */
>  
> -       if (iocb->ki_flags & IOCB_DIRECT) {
> +       if (iocb->ki_flags & IOCB_DIRECT || IS_DAX(inode)) {
>                 struct file *file = iocb->ki_filp;
>                 struct address_space *mapping = file->f_mapping;
>                 struct inode *inode = mapping->host;

Well, you'll have to up-level the inode variable instantiation,
obviously.  That solves this particular issue.  The next traceback
you'll hit is in the writeback path:

[  116.044545] ------------[ cut here ]------------
[  116.049163] WARNING: CPU: 48 PID: 4469 at fs/dax.c:862 dax_writeback_mapping_range+0x397/0x530
...
[  116.134509] CPU: 48 PID: 4469 Comm: fsstress Not tainted 5.6.0-rc1+ #43
[  116.141121] Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5C620.86B.0D.01.0395.022720191340 02/27/2019
[  116.151549] RIP: 0010:dax_writeback_mapping_range+0x397/0x530
[  116.157294] Code: ff ff 31 db 48 8b 7c 24 28 c6 07 00 0f 1f 40 00 fb 48 8b 7c 24 10 e8 98 fc 29 00 0f 1f 44 00 00 e9 f1 fc ff ff 4c 8b 64 24 08 <0f> 0b be fb ff ff ff 4c 89 e7 e8 fa 87 ed ff f0 41 80 8c 24 80 00
[  116.176036] RSP: 0018:ffffb9b162fa7c18 EFLAGS: 00010046
[  116.181261] RAX: 0000000000000000 RBX: 00000000000001ac RCX: 0000000000000020
[  116.188387] RDX: 0000000000000000 RSI: 00000000000001ac RDI: ffffb9b162fa7c40
[  116.195519] RBP: 0000000000000020 R08: ffff9a73dc24d6b0 R09: 0000000000000020
[  116.202648] R10: 0000000000000000 R11: 0000000000000238 R12: ffff9a73d92c66b8
[  116.209774] R13: ffffe4a09f0cb200 R14: 0000000000000000 R15: ffffe4a09f0cb200
[  116.216907] FS:  00007f2dbcd22b80(0000) GS:ffff9a7420c00000(0000) knlGS:0000000000000000
[  116.224992] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  116.230735] CR2: 00007fa21808b648 CR3: 000000179e0a2003 CR4: 00000000007606e0
[  116.237860] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  116.244990] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  116.252115] PKRU: 55555554
[  116.254827] Call Trace:
[  116.257286]  do_writepages+0x41/0xd0
[  116.260862]  __filemap_fdatawrite_range+0xcb/0x100
[  116.265653]  filemap_write_and_wait_range+0x38/0x90
[  116.270579]  xfs_setattr_size+0x2c2/0x3e0 [xfs]
[  116.275126]  xfs_file_fallocate+0x239/0x440 [xfs]
[  116.279831]  ? selinux_file_permission+0x108/0x140
[  116.284622]  vfs_fallocate+0x14d/0x2f0
[  116.288374]  ksys_fallocate+0x3c/0x80
[  116.292039]  __x64_sys_fallocate+0x1a/0x20
[  116.296139]  do_syscall_64+0x55/0x1d0
[  116.299806]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  116.304856] RIP: 0033:0x7f2dbc21983b
[  116.308435] Code: ff ff eb ba 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 8d 05 25 0e 2d 00 49 89 ca 8b 00 85 c0 75 14 b8 1d 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 5d c3 0f 1f 40 00 41 55 49 89 cd 41 54 49 89

That's here:

        /*
         * A page got tagged dirty in DAX mapping? Something is seriously
         * wrong.
         */
        if (WARN_ON(!xa_is_value(entry)))
                return -EIO;

Cheers,
Jeff
Ira Weiny Feb. 18, 2020, 11:54 p.m. UTC | #17
On Tue, Feb 18, 2020 at 09:22:58AM -0500, Jeff Moyer wrote:
> Ira Weiny <ira.weiny@intel.com> writes:
> 
> > Yep...  and a long weekend if you are in the US...  I ran the test with V4 and
> > got the panic below.
> >
> > Is this similar to what you see?  If so I'll work on it in V4.  FWIW with '-o
> 
> Yes, precisely.

Ok...

> 
> > dax' specified I don't see how fsstress is causing an issue with my patch set.
> > Does fsstress attempt to change dax states?  I don't see that in the test but
> > I'm not real familiar with generic/013 and fsstress.
> 
> Not that I'm aware of, no.
> 
> > If my disassembly of read_pages is correct it looks like readpage is null which
> > makes sense because all files should be IS_DAX() == true due to the mount option...
> >
> > But tracing code indicates that the patch:
> >
> > 	fs: remove unneeded IS_DAX() check
> >
> > ... may be the culprit and the following fix may work...
> >
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 3a7863ba51b9..7eaf74a2a39b 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2257,7 +2257,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
> >         if (!count)
> >                 goto out; /* skip atime */
> >  
> > -       if (iocb->ki_flags & IOCB_DIRECT) {
> > +       if (iocb->ki_flags & IOCB_DIRECT || IS_DAX(inode)) {
> >                 struct file *file = iocb->ki_filp;
> >                 struct address_space *mapping = file->f_mapping;
> >                 struct inode *inode = mapping->host;
> 
> Well, you'll have to up-level the inode variable instantiation,
> obviously.  That solves this particular issue.

Well...  This seems to be a random issue.  I've had BMC issues with
my server most of the day...  But even with this patch I still get the failure
in read_pages().  :-/

And I have gotten it to both succeed and fail with qemu...  :-/

> The next traceback
> you'll hit is in the writeback path:

> 
> [  116.044545] ------------[ cut here ]------------
> [  116.049163] WARNING: CPU: 48 PID: 4469 at fs/dax.c:862 dax_writeback_mapping_range+0x397/0x530
> ...
> [  116.134509] CPU: 48 PID: 4469 Comm: fsstress Not tainted 5.6.0-rc1+ #43
> [  116.141121] Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5C620.86B.0D.01.0395.022720191340 02/27/2019
> [  116.151549] RIP: 0010:dax_writeback_mapping_range+0x397/0x530
> [  116.157294] Code: ff ff 31 db 48 8b 7c 24 28 c6 07 00 0f 1f 40 00 fb 48 8b 7c 24 10 e8 98 fc 29 00 0f 1f 44 00 00 e9 f1 fc ff ff 4c 8b 64 24 08 <0f> 0b be fb ff ff ff 4c 89 e7 e8 fa 87 ed ff f0 41 80 8c 24 80 00
> [  116.176036] RSP: 0018:ffffb9b162fa7c18 EFLAGS: 00010046
> [  116.181261] RAX: 0000000000000000 RBX: 00000000000001ac RCX: 0000000000000020
> [  116.188387] RDX: 0000000000000000 RSI: 00000000000001ac RDI: ffffb9b162fa7c40
> [  116.195519] RBP: 0000000000000020 R08: ffff9a73dc24d6b0 R09: 0000000000000020
> [  116.202648] R10: 0000000000000000 R11: 0000000000000238 R12: ffff9a73d92c66b8
> [  116.209774] R13: ffffe4a09f0cb200 R14: 0000000000000000 R15: ffffe4a09f0cb200
> [  116.216907] FS:  00007f2dbcd22b80(0000) GS:ffff9a7420c00000(0000) knlGS:0000000000000000
> [  116.224992] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  116.230735] CR2: 00007fa21808b648 CR3: 000000179e0a2003 CR4: 00000000007606e0
> [  116.237860] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  116.244990] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  116.252115] PKRU: 55555554
> [  116.254827] Call Trace:
> [  116.257286]  do_writepages+0x41/0xd0
> [  116.260862]  __filemap_fdatawrite_range+0xcb/0x100
> [  116.265653]  filemap_write_and_wait_range+0x38/0x90
> [  116.270579]  xfs_setattr_size+0x2c2/0x3e0 [xfs]
> [  116.275126]  xfs_file_fallocate+0x239/0x440 [xfs]
> [  116.279831]  ? selinux_file_permission+0x108/0x140
> [  116.284622]  vfs_fallocate+0x14d/0x2f0
> [  116.288374]  ksys_fallocate+0x3c/0x80
> [  116.292039]  __x64_sys_fallocate+0x1a/0x20
> [  116.296139]  do_syscall_64+0x55/0x1d0
> [  116.299806]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  116.304856] RIP: 0033:0x7f2dbc21983b
> [  116.308435] Code: ff ff eb ba 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 8d 05 25 0e 2d 00 49 89 ca 8b 00 85 c0 75 14 b8 1d 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 5d c3 0f 1f 40 00 41 55 49 89 cd 41 54 49 89
> 
> That's here:
> 
>         /*
>          * A page got tagged dirty in DAX mapping? Something is seriously
>          * wrong.
>          */
>         if (WARN_ON(!xa_is_value(entry)))
>                 return -EIO;

I have not gotten this.  Having to walk to the lab to power cycle the machine
has slowed my progress...

Ira

> 
> Cheers,
> Jeff
>
Ira Weiny Feb. 20, 2020, 4:20 p.m. UTC | #18
On Tue, Feb 18, 2020 at 03:54:30PM -0800, 'Ira Weiny' wrote:
> On Tue, Feb 18, 2020 at 09:22:58AM -0500, Jeff Moyer wrote:
> > Ira Weiny <ira.weiny@intel.com> writes:
> > > If my disassembly of read_pages is correct it looks like readpage is null which
> > > makes sense because all files should be IS_DAX() == true due to the mount option...
> > >
> > > But tracing code indicates that the patch:
> > >
> > > 	fs: remove unneeded IS_DAX() check
> > >
> > > ... may be the culprit and the following fix may work...
> > >
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index 3a7863ba51b9..7eaf74a2a39b 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -2257,7 +2257,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
> > >         if (!count)
> > >                 goto out; /* skip atime */
> > >  
> > > -       if (iocb->ki_flags & IOCB_DIRECT) {
> > > +       if (iocb->ki_flags & IOCB_DIRECT || IS_DAX(inode)) {
> > >                 struct file *file = iocb->ki_filp;
> > >                 struct address_space *mapping = file->f_mapping;
> > >                 struct inode *inode = mapping->host;
> > 
> > Well, you'll have to up-level the inode variable instantiation,
> > obviously.  That solves this particular issue.
> 
> Well...  This seems to be a random issue.  I've had BMC issues with
> my server most of the day...  But even with this patch I still get the failure
> in read_pages().  :-/
> 
> And I have gotten it to both succeed and fail with qemu...  :-/

... here is the fix.  I made the change in xfs_diflags_to_linux() early on with
out factoring in the flag logic changes we have agreed upon...

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 62d9f622bad1..d592949ad396 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1123,11 +1123,11 @@ xfs_diflags_to_linux(
                inode->i_flags |= S_NOATIME;
        else
                inode->i_flags &= ~S_NOATIME;
-       if (xflags & FS_XFLAG_DAX)
+
+       if (xfs_inode_enable_dax(ip))
                inode->i_flags |= S_DAX;
        else
                inode->i_flags &= ~S_DAX;
-
 }

But the one thing which tripped me up, and concerns me, is we have 2 functions
which set the inode flags.

xfs_diflags_to_iflags()
xfs_diflags_to_linux()

xfs_diflags_to_iflags() is geared toward initialization but logically they do
the same thing.  I see no reason to keep them separate.  Does anyone?

Based on this find, the discussion on behavior in this thread, and the comments
from Dave I'm reworking the series because the flag check/set functions have
all changed and I really want to be as clear as possible with both the patches
and the resulting code.[*]  So v4 should be out today including attempting to
document what we have discussed here and being as clear as possible on the
behavior.  :-D

Thanks so much for testing this!

Ira

[*] I will probably throw in a patch to remove xfs_diflags_to_iflags() as I
really don't see a reason to keep it.
Darrick J. Wong Feb. 20, 2020, 4:30 p.m. UTC | #19
On Thu, Feb 20, 2020 at 08:20:28AM -0800, Ira Weiny wrote:
> On Tue, Feb 18, 2020 at 03:54:30PM -0800, 'Ira Weiny' wrote:
> > On Tue, Feb 18, 2020 at 09:22:58AM -0500, Jeff Moyer wrote:
> > > Ira Weiny <ira.weiny@intel.com> writes:
> > > > If my disassembly of read_pages is correct it looks like readpage is null which
> > > > makes sense because all files should be IS_DAX() == true due to the mount option...
> > > >
> > > > But tracing code indicates that the patch:
> > > >
> > > > 	fs: remove unneeded IS_DAX() check
> > > >
> > > > ... may be the culprit and the following fix may work...
> > > >
> > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > index 3a7863ba51b9..7eaf74a2a39b 100644
> > > > --- a/mm/filemap.c
> > > > +++ b/mm/filemap.c
> > > > @@ -2257,7 +2257,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
> > > >         if (!count)
> > > >                 goto out; /* skip atime */
> > > >  
> > > > -       if (iocb->ki_flags & IOCB_DIRECT) {
> > > > +       if (iocb->ki_flags & IOCB_DIRECT || IS_DAX(inode)) {
> > > >                 struct file *file = iocb->ki_filp;
> > > >                 struct address_space *mapping = file->f_mapping;
> > > >                 struct inode *inode = mapping->host;
> > > 
> > > Well, you'll have to up-level the inode variable instantiation,
> > > obviously.  That solves this particular issue.
> > 
> > Well...  This seems to be a random issue.  I've had BMC issues with
> > my server most of the day...  But even with this patch I still get the failure
> > in read_pages().  :-/
> > 
> > And I have gotten it to both succeed and fail with qemu...  :-/
> 
> ... here is the fix.  I made the change in xfs_diflags_to_linux() early on with
> out factoring in the flag logic changes we have agreed upon...
> 
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index 62d9f622bad1..d592949ad396 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -1123,11 +1123,11 @@ xfs_diflags_to_linux(
>                 inode->i_flags |= S_NOATIME;
>         else
>                 inode->i_flags &= ~S_NOATIME;
> -       if (xflags & FS_XFLAG_DAX)
> +
> +       if (xfs_inode_enable_dax(ip))
>                 inode->i_flags |= S_DAX;
>         else
>                 inode->i_flags &= ~S_DAX;
> -
>  }
> 
> But the one thing which tripped me up, and concerns me, is we have 2 functions
> which set the inode flags.
> 
> xfs_diflags_to_iflags()
> xfs_diflags_to_linux()
> 
> xfs_diflags_to_iflags() is geared toward initialization but logically they do
> the same thing.  I see no reason to keep them separate.  Does anyone?
> 
> Based on this find, the discussion on behavior in this thread, and the comments
> from Dave I'm reworking the series because the flag check/set functions have
> all changed and I really want to be as clear as possible with both the patches
> and the resulting code.[*]  So v4 should be out today including attempting to
> document what we have discussed here and being as clear as possible on the
> behavior.  :-D
> 
> Thanks so much for testing this!
> 
> Ira
> 
> [*] I will probably throw in a patch to remove xfs_diflags_to_iflags() as I
> really don't see a reason to keep it.
> 

I prefer you keep the one in xfs_iops.c since ioctls are a higher level
function than general inode operations.

--D
Ira Weiny Feb. 20, 2020, 4:49 p.m. UTC | #20
On Thu, Feb 20, 2020 at 08:30:24AM -0800, Darrick J. Wong wrote:
> On Thu, Feb 20, 2020 at 08:20:28AM -0800, Ira Weiny wrote:
> > On Tue, Feb 18, 2020 at 03:54:30PM -0800, 'Ira Weiny' wrote:
> > > On Tue, Feb 18, 2020 at 09:22:58AM -0500, Jeff Moyer wrote:
> > > > Ira Weiny <ira.weiny@intel.com> writes:
> > > > > If my disassembly of read_pages is correct it looks like readpage is null which
> > > > > makes sense because all files should be IS_DAX() == true due to the mount option...
> > > > >
> > > > > But tracing code indicates that the patch:
> > > > >
> > > > > 	fs: remove unneeded IS_DAX() check
> > > > >
> > > > > ... may be the culprit and the following fix may work...
> > > > >
> > > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > > index 3a7863ba51b9..7eaf74a2a39b 100644
> > > > > --- a/mm/filemap.c
> > > > > +++ b/mm/filemap.c
> > > > > @@ -2257,7 +2257,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
> > > > >         if (!count)
> > > > >                 goto out; /* skip atime */
> > > > >  
> > > > > -       if (iocb->ki_flags & IOCB_DIRECT) {
> > > > > +       if (iocb->ki_flags & IOCB_DIRECT || IS_DAX(inode)) {
> > > > >                 struct file *file = iocb->ki_filp;
> > > > >                 struct address_space *mapping = file->f_mapping;
> > > > >                 struct inode *inode = mapping->host;
> > > > 
> > > > Well, you'll have to up-level the inode variable instantiation,
> > > > obviously.  That solves this particular issue.
> > > 
> > > Well...  This seems to be a random issue.  I've had BMC issues with
> > > my server most of the day...  But even with this patch I still get the failure
> > > in read_pages().  :-/
> > > 
> > > And I have gotten it to both succeed and fail with qemu...  :-/
> > 
> > ... here is the fix.  I made the change in xfs_diflags_to_linux() early on with
> > out factoring in the flag logic changes we have agreed upon...
> > 
> > diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> > index 62d9f622bad1..d592949ad396 100644
> > --- a/fs/xfs/xfs_ioctl.c
> > +++ b/fs/xfs/xfs_ioctl.c
> > @@ -1123,11 +1123,11 @@ xfs_diflags_to_linux(
> >                 inode->i_flags |= S_NOATIME;
> >         else
> >                 inode->i_flags &= ~S_NOATIME;
> > -       if (xflags & FS_XFLAG_DAX)
> > +
> > +       if (xfs_inode_enable_dax(ip))
> >                 inode->i_flags |= S_DAX;
> >         else
> >                 inode->i_flags &= ~S_DAX;
> > -
> >  }
> > 
> > But the one thing which tripped me up, and concerns me, is we have 2 functions
> > which set the inode flags.
> > 
> > xfs_diflags_to_iflags()
> > xfs_diflags_to_linux()
> > 
> > xfs_diflags_to_iflags() is geared toward initialization but logically they do
> > the same thing.  I see no reason to keep them separate.  Does anyone?
> > 
> > Based on this find, the discussion on behavior in this thread, and the comments
> > from Dave I'm reworking the series because the flag check/set functions have
> > all changed and I really want to be as clear as possible with both the patches
> > and the resulting code.[*]  So v4 should be out today including attempting to
> > document what we have discussed here and being as clear as possible on the
> > behavior.  :-D
> > 
> > Thanks so much for testing this!
> > 
> > Ira
> > 
> > [*] I will probably throw in a patch to remove xfs_diflags_to_iflags() as I
> > really don't see a reason to keep it.
> > 
> 
> I prefer you keep the one in xfs_iops.c since ioctls are a higher level
> function than general inode operations.

Makes sense.  Do you prefer the xfs_diflags_to_iflags() name as well?

Ira

> 
> --D
Darrick J. Wong Feb. 20, 2020, 5 p.m. UTC | #21
On Thu, Feb 20, 2020 at 08:49:57AM -0800, Ira Weiny wrote:
> On Thu, Feb 20, 2020 at 08:30:24AM -0800, Darrick J. Wong wrote:
> > On Thu, Feb 20, 2020 at 08:20:28AM -0800, Ira Weiny wrote:
> > > On Tue, Feb 18, 2020 at 03:54:30PM -0800, 'Ira Weiny' wrote:
> > > > On Tue, Feb 18, 2020 at 09:22:58AM -0500, Jeff Moyer wrote:
> > > > > Ira Weiny <ira.weiny@intel.com> writes:
> > > > > > If my disassembly of read_pages is correct it looks like readpage is null which
> > > > > > makes sense because all files should be IS_DAX() == true due to the mount option...
> > > > > >
> > > > > > But tracing code indicates that the patch:
> > > > > >
> > > > > > 	fs: remove unneeded IS_DAX() check
> > > > > >
> > > > > > ... may be the culprit and the following fix may work...
> > > > > >
> > > > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > > > index 3a7863ba51b9..7eaf74a2a39b 100644
> > > > > > --- a/mm/filemap.c
> > > > > > +++ b/mm/filemap.c
> > > > > > @@ -2257,7 +2257,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
> > > > > >         if (!count)
> > > > > >                 goto out; /* skip atime */
> > > > > >  
> > > > > > -       if (iocb->ki_flags & IOCB_DIRECT) {
> > > > > > +       if (iocb->ki_flags & IOCB_DIRECT || IS_DAX(inode)) {
> > > > > >                 struct file *file = iocb->ki_filp;
> > > > > >                 struct address_space *mapping = file->f_mapping;
> > > > > >                 struct inode *inode = mapping->host;
> > > > > 
> > > > > Well, you'll have to up-level the inode variable instantiation,
> > > > > obviously.  That solves this particular issue.
> > > > 
> > > > Well...  This seems to be a random issue.  I've had BMC issues with
> > > > my server most of the day...  But even with this patch I still get the failure
> > > > in read_pages().  :-/
> > > > 
> > > > And I have gotten it to both succeed and fail with qemu...  :-/
> > > 
> > > ... here is the fix.  I made the change in xfs_diflags_to_linux() early on with
> > > out factoring in the flag logic changes we have agreed upon...
> > > 
> > > diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> > > index 62d9f622bad1..d592949ad396 100644
> > > --- a/fs/xfs/xfs_ioctl.c
> > > +++ b/fs/xfs/xfs_ioctl.c
> > > @@ -1123,11 +1123,11 @@ xfs_diflags_to_linux(
> > >                 inode->i_flags |= S_NOATIME;
> > >         else
> > >                 inode->i_flags &= ~S_NOATIME;
> > > -       if (xflags & FS_XFLAG_DAX)
> > > +
> > > +       if (xfs_inode_enable_dax(ip))
> > >                 inode->i_flags |= S_DAX;
> > >         else
> > >                 inode->i_flags &= ~S_DAX;
> > > -
> > >  }
> > > 
> > > But the one thing which tripped me up, and concerns me, is we have 2 functions
> > > which set the inode flags.
> > > 
> > > xfs_diflags_to_iflags()
> > > xfs_diflags_to_linux()
> > > 
> > > xfs_diflags_to_iflags() is geared toward initialization but logically they do
> > > the same thing.  I see no reason to keep them separate.  Does anyone?
> > > 
> > > Based on this find, the discussion on behavior in this thread, and the comments
> > > from Dave I'm reworking the series because the flag check/set functions have
> > > all changed and I really want to be as clear as possible with both the patches
> > > and the resulting code.[*]  So v4 should be out today including attempting to
> > > document what we have discussed here and being as clear as possible on the
> > > behavior.  :-D
> > > 
> > > Thanks so much for testing this!
> > > 
> > > Ira
> > > 
> > > [*] I will probably throw in a patch to remove xfs_diflags_to_iflags() as I
> > > really don't see a reason to keep it.
> > > 
> > 
> > I prefer you keep the one in xfs_iops.c since ioctls are a higher level
> > function than general inode operations.
> 
> Makes sense.  Do you prefer the xfs_diflags_to_iflags() name as well?

I don't really care one way or another, so ... iflags wins by arbitrary
choice! 8)

--D

> Ira
> 
> > 
> > --D