diff mbox

[1/7] xfs: always use DAX if mount option is used

Message ID 20170925231404.32723-2-ross.zwisler@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ross Zwisler Sept. 25, 2017, 11:13 p.m. UTC
Before support for the per-inode DAX flag was disabled the XFS the code had
an issue where the user couldn't reliably tell whether or not DAX was being
used to service page faults and I/O when the DAX mount option was used.  In
this case each inode within the mounted filesystem started with S_DAX set
due to the mount option, but it could be cleared if someone touched the
individual inode flag.

For example (v4.13 and before):

  # mount | grep dax
  /dev/pmem0 on /mnt type xfs
  (rw,relatime,seclabel,attr2,dax,inode64,sunit=4096,swidth=4096,noquota)

  # touch /mnt/a /mnt/b   # both files currently use DAX

  # xfs_io -c "lsattr" /mnt/*  # neither has the DAX inode option set
  ----------e----- /mnt/a
  ----------e----- /mnt/b

  # xfs_io -c "chattr -x" /mnt/a  # this clears S_DAX for /mnt/a

  # xfs_io -c "lsattr" /mnt/*
  ----------e----- /mnt/a
  ----------e----- /mnt/b

We end up with both /mnt/a and /mnt/b looking identical from the point of
view of the mount option and from lsattr, but one is using DAX and the
other is not.

Fix this by always doing DAX I/O when either the mount option is set or
when the DAX inode flag is set.  This means that DAX will always be used
for all inodes on a filesystem mounted with -o dax, making the usage
reliable and detectable.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_ioctl.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

Comments

Dave Chinner Sept. 25, 2017, 11:38 p.m. UTC | #1
On Mon, Sep 25, 2017 at 05:13:58PM -0600, Ross Zwisler wrote:
> Before support for the per-inode DAX flag was disabled the XFS the code had
> an issue where the user couldn't reliably tell whether or not DAX was being
> used to service page faults and I/O when the DAX mount option was used.  In
> this case each inode within the mounted filesystem started with S_DAX set
> due to the mount option, but it could be cleared if someone touched the
> individual inode flag.
> 
> For example (v4.13 and before):
> 
>   # mount | grep dax
>   /dev/pmem0 on /mnt type xfs
>   (rw,relatime,seclabel,attr2,dax,inode64,sunit=4096,swidth=4096,noquota)
> 
>   # touch /mnt/a /mnt/b   # both files currently use DAX
> 
>   # xfs_io -c "lsattr" /mnt/*  # neither has the DAX inode option set
>   ----------e----- /mnt/a
>   ----------e----- /mnt/b
> 
>   # xfs_io -c "chattr -x" /mnt/a  # this clears S_DAX for /mnt/a
> 
>   # xfs_io -c "lsattr" /mnt/*
>   ----------e----- /mnt/a
>   ----------e----- /mnt/b

That's really a bug in the lsattr code, yes? If we've cleared the
S_DAX flag for the inode, then why is it being reported in lsattr?
Or if we failed to clear the S_DAX flag in the 'chattr -x' call,
then isn't that the bug that needs fixing?

Remember, the whole point of the dax inode flag was to be able to
override the mount option setting so that admins could turn off/on
dax for the things that didn't/did work with DAX correctly so they
didn't need multiple filesystems on pmem to segregate the apps that
did/didn't work with DAX...

Cheers,

Dave.
Jan Kara Sept. 26, 2017, 9:35 a.m. UTC | #2
On Tue 26-09-17 09:38:12, Dave Chinner wrote:
> On Mon, Sep 25, 2017 at 05:13:58PM -0600, Ross Zwisler wrote:
> > Before support for the per-inode DAX flag was disabled the XFS the code had
> > an issue where the user couldn't reliably tell whether or not DAX was being
> > used to service page faults and I/O when the DAX mount option was used.  In
> > this case each inode within the mounted filesystem started with S_DAX set
> > due to the mount option, but it could be cleared if someone touched the
> > individual inode flag.
> > 
> > For example (v4.13 and before):
> > 
> >   # mount | grep dax
> >   /dev/pmem0 on /mnt type xfs
> >   (rw,relatime,seclabel,attr2,dax,inode64,sunit=4096,swidth=4096,noquota)
> > 
> >   # touch /mnt/a /mnt/b   # both files currently use DAX
> > 
> >   # xfs_io -c "lsattr" /mnt/*  # neither has the DAX inode option set
> >   ----------e----- /mnt/a
> >   ----------e----- /mnt/b
> > 
> >   # xfs_io -c "chattr -x" /mnt/a  # this clears S_DAX for /mnt/a
> > 
> >   # xfs_io -c "lsattr" /mnt/*
> >   ----------e----- /mnt/a
> >   ----------e----- /mnt/b
> 
> That's really a bug in the lsattr code, yes? If we've cleared the
> S_DAX flag for the inode, then why is it being reported in lsattr?
> Or if we failed to clear the S_DAX flag in the 'chattr -x' call,
> then isn't that the bug that needs fixing?
> 
> Remember, the whole point of the dax inode flag was to be able to
> override the mount option setting so that admins could turn off/on
> dax for the things that didn't/did work with DAX correctly so they
> didn't need multiple filesystems on pmem to segregate the apps that
> did/didn't work with DAX...

So I think there is some confusion that is created by the fact that whether
DAX is used or not is controlled by both a mount option and an inode flag.
We could define that "Inode flag always wins" which is what you seem to
suggest above but then mount option has no practical effect since on-disk
S_DAX flag will always overrule it.

Ross suggests that DAX should be used if "Inode flag or mount option is
set". Which is similar to how e.g. noatime inode flag works but does not
allow to selectively disable DAX.

So if we wanted both mount option to work and selective disabling of DAX,
we would need three states of inode setting - force DAX, disable DAX,
inherit from mount option.

								Honza
Dave Chinner Sept. 26, 2017, 11:09 a.m. UTC | #3
On Tue, Sep 26, 2017 at 11:35:48AM +0200, Jan Kara wrote:
> On Tue 26-09-17 09:38:12, Dave Chinner wrote:
> > On Mon, Sep 25, 2017 at 05:13:58PM -0600, Ross Zwisler wrote:
> > > Before support for the per-inode DAX flag was disabled the XFS the code had
> > > an issue where the user couldn't reliably tell whether or not DAX was being
> > > used to service page faults and I/O when the DAX mount option was used.  In
> > > this case each inode within the mounted filesystem started with S_DAX set
> > > due to the mount option, but it could be cleared if someone touched the
> > > individual inode flag.
> > > 
> > > For example (v4.13 and before):
> > > 
> > >   # mount | grep dax
> > >   /dev/pmem0 on /mnt type xfs
> > >   (rw,relatime,seclabel,attr2,dax,inode64,sunit=4096,swidth=4096,noquota)
> > > 
> > >   # touch /mnt/a /mnt/b   # both files currently use DAX
> > > 
> > >   # xfs_io -c "lsattr" /mnt/*  # neither has the DAX inode option set
> > >   ----------e----- /mnt/a
> > >   ----------e----- /mnt/b
> > > 
> > >   # xfs_io -c "chattr -x" /mnt/a  # this clears S_DAX for /mnt/a
> > > 
> > >   # xfs_io -c "lsattr" /mnt/*
> > >   ----------e----- /mnt/a
> > >   ----------e----- /mnt/b
> > 
> > That's really a bug in the lsattr code, yes? If we've cleared the
> > S_DAX flag for the inode, then why is it being reported in lsattr?
> > Or if we failed to clear the S_DAX flag in the 'chattr -x' call,
> > then isn't that the bug that needs fixing?
> > 
> > Remember, the whole point of the dax inode flag was to be able to
> > override the mount option setting so that admins could turn off/on
> > dax for the things that didn't/did work with DAX correctly so they
> > didn't need multiple filesystems on pmem to segregate the apps that
> > did/didn't work with DAX...
> 
> So I think there is some confusion that is created by the fact that whether
> DAX is used or not is controlled by both a mount option and an inode flag.
> We could define that "Inode flag always wins" which is what you seem to
> suggest above but then mount option has no practical effect since on-disk
> S_DAX flag will always overrule it.

Well, quite frankly, I never wanted the mount option for XFS. It was
supposed to be for initial testing only, then we'd /always/ use the
the inode flags. For a filesystem to default to using DAX, we
set the DAX flag on the root inode at mkfs time, and then everything
inode flag based just works.

But it seems that we're now stuck with the stupid, blunt, brute
force mount option because that's what the first commit on ext4
used.  Now we're just about stuck with this silly "but we can't turn
it off" problem because of the mount option overriding everything.

If we have to keep the mount option, then lets fix it to mean "mount
option sets inheritable inode flag on directory creation" and
/maybe/ "mount option sets inode flag on file creation".

This then allows the inode flag to control everything else. i.e the
mount option sets the initial flag value rather than the behaviour
of the inode. The behaviour of the inode should be entirely
controlled by the inode flag, hence after initial creation the
chattr +/-x commands do what they advertise regardless of the mount
option value.

Yes, it means that existing users are going to have to run chattr -R
+x on their pmem filesystems to get the inode flags on disk, but
this is all tagged with EXPERIMENTAL and this is the sort of change
that is expected from experimental functionality.

Cheers,

Dave.
Christoph Hellwig Sept. 26, 2017, 2:37 p.m. UTC | #4
On Tue, Sep 26, 2017 at 09:09:57PM +1000, Dave Chinner wrote:
> Well, quite frankly, I never wanted the mount option for XFS. It was
> supposed to be for initial testing only, then we'd /always/ use the
> the inode flags. For a filesystem to default to using DAX, we
> set the DAX flag on the root inode at mkfs time, and then everything
> inode flag based just works.

And I deeply fundamentally disagree.  The mount option is a nice
enough big hammer to try a mode without encoding nitty gritty details
into the application ABI.

The per-inode persistent flag is the biggest nightmare ever, as we see
in all these discussions about it.

What does it even mean?  Right now it forces direct addressing as long
as the underlying media supports that.  But what about media that
you directly access but you really don't want to because it's really slow?
Or media that is so god damn fast that you never want to buffer?  Or
media where you want to buffer for writes (or at least some of them)
but not for reads?

It encodes a very specific mechanism for an early direct access
implementation into the ABI.  What we really need is for applications
to declare an intent, not specify a particular mechanism.
Ross Zwisler Sept. 26, 2017, 5:30 p.m. UTC | #5
On Tue, Sep 26, 2017 at 04:37:43PM +0200, Christoph Hellwig wrote:
> On Tue, Sep 26, 2017 at 09:09:57PM +1000, Dave Chinner wrote:
> > Well, quite frankly, I never wanted the mount option for XFS. It was
> > supposed to be for initial testing only, then we'd /always/ use the
> > the inode flags. For a filesystem to default to using DAX, we
> > set the DAX flag on the root inode at mkfs time, and then everything
> > inode flag based just works.
> 
> And I deeply fundamentally disagree.  The mount option is a nice
> enough big hammer to try a mode without encoding nitty gritty details
> into the application ABI.
> 
> The per-inode persistent flag is the biggest nightmare ever, as we see
> in all these discussions about it.
> 
> What does it even mean?  Right now it forces direct addressing as long
> as the underlying media supports that.  But what about media that
> you directly access but you really don't want to because it's really slow?
> Or media that is so god damn fast that you never want to buffer?  Or
> media where you want to buffer for writes (or at least some of them)
> but not for reads?
> 
> It encodes a very specific mechanism for an early direct access
> implementation into the ABI.  What we really need is for applications
> to declare an intent, not specify a particular mechanism.

I agree that Christoph's idea about having the system intelligently adjust to
use DAX based on performance information it gathers about the underlying
persistent memory (probably via the HMAT on x86_64 systems) is interesting,
but I think we're still a ways away from that.

FWIW, as my patches suggest and Jan observed I think that we should allow
users to turn on DAX by treating the inode flag and the mount flag as an 'or'
operation.  i.e. you get DAX if either the mount option is specified or if the
inode flag is set, and you can continue to manipulate the per-inode flag as
you want regardless of the mount option.  I think this provides maximum
flexibility of the mechanism to select DAX without enforcing policy.

In the end, though, I think what's really important is that we figure out what
the various options mean, have the same story for both XFS and ext4, and
document it as hch suggested in response to my patch 7 in this series.

Does it make sense at this point to just start a "dax" man page that can
contain info about the mount options, inode flags, kernel config options, how
to get PMDs, etc?  Or does this documentation need to be sprinkled around more
in existing man pages?
Eric Sandeen Sept. 26, 2017, 6:02 p.m. UTC | #6
On 9/26/17 6:09 AM, Dave Chinner wrote:
> On Tue, Sep 26, 2017 at 11:35:48AM +0200, Jan Kara wrote:
>> On Tue 26-09-17 09:38:12, Dave Chinner wrote:
>>> On Mon, Sep 25, 2017 at 05:13:58PM -0600, Ross Zwisler wrote:
>>>> Before support for the per-inode DAX flag was disabled the XFS the code had
>>>> an issue where the user couldn't reliably tell whether or not DAX was being
>>>> used to service page faults and I/O when the DAX mount option was used.  In
>>>> this case each inode within the mounted filesystem started with S_DAX set
>>>> due to the mount option, but it could be cleared if someone touched the
>>>> individual inode flag.
>>>>
>>>> For example (v4.13 and before):
>>>>
>>>>   # mount | grep dax
>>>>   /dev/pmem0 on /mnt type xfs
>>>>   (rw,relatime,seclabel,attr2,dax,inode64,sunit=4096,swidth=4096,noquota)
>>>>
>>>>   # touch /mnt/a /mnt/b   # both files currently use DAX
>>>>
>>>>   # xfs_io -c "lsattr" /mnt/*  # neither has the DAX inode option set
>>>>   ----------e----- /mnt/a
>>>>   ----------e----- /mnt/b
>>>>
>>>>   # xfs_io -c "chattr -x" /mnt/a  # this clears S_DAX for /mnt/a
>>>>
>>>>   # xfs_io -c "lsattr" /mnt/*
>>>>   ----------e----- /mnt/a
>>>>   ----------e----- /mnt/b
>>>
>>> That's really a bug in the lsattr code, yes? If we've cleared the
>>> S_DAX flag for the inode, then why is it being reported in lsattr?
>>> Or if we failed to clear the S_DAX flag in the 'chattr -x' call,
>>> then isn't that the bug that needs fixing?
>>>
>>> Remember, the whole point of the dax inode flag was to be able to
>>> override the mount option setting so that admins could turn off/on
>>> dax for the things that didn't/did work with DAX correctly so they
>>> didn't need multiple filesystems on pmem to segregate the apps that
>>> did/didn't work with DAX...
>>
>> So I think there is some confusion that is created by the fact that whether
>> DAX is used or not is controlled by both a mount option and an inode flag.
>> We could define that "Inode flag always wins" which is what you seem to
>> suggest above but then mount option has no practical effect since on-disk
>> S_DAX flag will always overrule it.
> 
> Well, quite frankly, I never wanted the mount option for XFS. It was
> supposed to be for initial testing only, then we'd /always/ use the
> the inode flags. For a filesystem to default to using DAX, we
> set the DAX flag on the root inode at mkfs time, and then everything
> inode flag based just works.
> 
> But it seems that we're now stuck with the stupid, blunt, brute
> force mount option because that's what the first commit on ext4
> used.  Now we're just about stuck with this silly "but we can't turn
> it off" problem because of the mount option overriding everything.

I don't think the existence of a mount option in ext4 makes us any
more "stuck" than the mount option in xfs does.

fs/xfs/xfs_super.c:		"DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
fs/ext4/super.c:		"DAX enabled. Warning: EXPERIMENTAL, use at your own risk");

so when^wif this argument ever gets settled, I think there is plenty
of latitude to do the right thing, potentially breaking the old thing.

> If we have to keep the mount option, then lets fix it to mean "mount
> option sets inheritable inode flag on directory creation" and
> /maybe/ "mount option sets inode flag on file creation".
> 
> This then allows the inode flag to control everything else. i.e the
> mount option sets the initial flag value rather than the behaviour
> of the inode. The behaviour of the inode should be entirely
> controlled by the inode flag, hence after initial creation the
> chattr +/-x commands do what they advertise regardless of the mount
> option value.
> 
> Yes, it means that existing users are going to have to run chattr -R
> +x on their pmem filesystems to get the inode flags on disk, but
> this is all tagged with EXPERIMENTAL and this is the sort of change
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> that is expected from experimental functionality.

Right.

-Eric

> Cheers,
> 
> Dave.
>
Ross Zwisler Sept. 26, 2017, 6:50 p.m. UTC | #7
On Tue, Sep 26, 2017 at 09:38:12AM +1000, Dave Chinner wrote:
> On Mon, Sep 25, 2017 at 05:13:58PM -0600, Ross Zwisler wrote:
> > Before support for the per-inode DAX flag was disabled the XFS the code had
> > an issue where the user couldn't reliably tell whether or not DAX was being
> > used to service page faults and I/O when the DAX mount option was used.  In
> > this case each inode within the mounted filesystem started with S_DAX set
> > due to the mount option, but it could be cleared if someone touched the
> > individual inode flag.
> > 
> > For example (v4.13 and before):
> > 
> >   # mount | grep dax
> >   /dev/pmem0 on /mnt type xfs
> >   (rw,relatime,seclabel,attr2,dax,inode64,sunit=4096,swidth=4096,noquota)
> > 
> >   # touch /mnt/a /mnt/b   # both files currently use DAX
> > 
> >   # xfs_io -c "lsattr" /mnt/*  # neither has the DAX inode option set
> >   ----------e----- /mnt/a
> >   ----------e----- /mnt/b
> > 
> >   # xfs_io -c "chattr -x" /mnt/a  # this clears S_DAX for /mnt/a
> > 
> >   # xfs_io -c "lsattr" /mnt/*
> >   ----------e----- /mnt/a
> >   ----------e----- /mnt/b
> 
> That's really a bug in the lsattr code, yes? If we've cleared the
> S_DAX flag for the inode, then why is it being reported in lsattr?
> Or if we failed to clear the S_DAX flag in the 'chattr -x' call,
> then isn't that the bug that needs fixing?

No, I think lsattr/chattr are working correctly.  In both the examples above
the DAX inode flag (which is represeted by an 'x') is never set.  S_DAX is the
in-memory inode flag (not the on-media inode flag) which is not manipulated
directly by lsattr/chattr, but instead reflects whether the inode is actually
using DAX or not.

Manipulating and displaying the on-media inode flag works as expected with
lsattr/chattr:

  # xfs_io -c "lsattr" ./a 
  ---------------- ./a 
  
  # xfs_io -c "chattr +x" ./a 
  
  # xfs_io -c "lsattr" ./a 
  --------------x- ./a 

- Ross
Darrick J. Wong Sept. 26, 2017, 7:48 p.m. UTC | #8
On Tue, Sep 26, 2017 at 11:30:57AM -0600, Ross Zwisler wrote:
> On Tue, Sep 26, 2017 at 04:37:43PM +0200, Christoph Hellwig wrote:
> > On Tue, Sep 26, 2017 at 09:09:57PM +1000, Dave Chinner wrote:
> > > Well, quite frankly, I never wanted the mount option for XFS. It was
> > > supposed to be for initial testing only, then we'd /always/ use the
> > > the inode flags. For a filesystem to default to using DAX, we
> > > set the DAX flag on the root inode at mkfs time, and then everything
> > > inode flag based just works.
> > 
> > And I deeply fundamentally disagree.  The mount option is a nice
> > enough big hammer to try a mode without encoding nitty gritty details
> > into the application ABI.
> > 
> > The per-inode persistent flag is the biggest nightmare ever, as we see
> > in all these discussions about it.
> > 
> > What does it even mean?  Right now it forces direct addressing as long
> > as the underlying media supports that.  But what about media that
> > you directly access but you really don't want to because it's really slow?
> > Or media that is so god damn fast that you never want to buffer?  Or
> > media where you want to buffer for writes (or at least some of them)
> > but not for reads?
> > 
> > It encodes a very specific mechanism for an early direct access
> > implementation into the ABI.  What we really need is for applications
> > to declare an intent, not specify a particular mechanism.
> 
> I agree that Christoph's idea about having the system intelligently adjust to
> use DAX based on performance information it gathers about the underlying
> persistent memory (probably via the HMAT on x86_64 systems) is interesting,
> but I think we're still a ways away from that.
> 
> FWIW, as my patches suggest and Jan observed I think that we should allow
> users to turn on DAX by treating the inode flag and the mount flag as an 'or'
> operation.  i.e. you get DAX if either the mount option is specified or if the
> inode flag is set, and you can continue to manipulate the per-inode flag as
> you want regardless of the mount option.  I think this provides maximum
> flexibility of the mechanism to select DAX without enforcing policy.
> 
> In the end, though, I think what's really important is that we figure out what
> the various options mean, have the same story for both XFS and ext4, and
> document it as hch suggested in response to my patch 7 in this series.

Agreed.  We have a fundamental conflict between letting the sysadmin or
user decide how they want an inode to behave vs. letting the kernel make
all the decisions based on whatever information it gathers.

I'm pulled this patch out of -fixes and for-next because I feel strongly
discouraged about taking any more patches that change the user-control
parts of the DAX implementation until we reach a consensus on what to
do.

Given that DAX and pmem support in filesystems is still experimental,
I'm open to changing the interface as needed.  Where do we think we'll
be in a few years once ACPI or whatever reaches the point of being able
to tell the kernel about the general performance characteristics of the
pmem?  What choices about the interface do we need to make now so that
we can get there while minimizing the number of insufficient interfaces
to deprecate?

My personal guess is that most programs will not care enough to want to
make a syscall so we might as well give them the most performant option
available.

Roughly speaking, here are the use cases I can think of:

 * Regular buffered read/write: we can let the kernel decide if it wants
   to push the IO through the page cache, directly access the pmem, or
   some future combination of the two.

 * O_DIRECT read/write: Straight to pmem.

 * Regular mmap: This seems fairly agnostic to how we actually make the
   memory mapping work, right?

 * MAP_DIRECT/MAP_SYNC mmap: If userspace actually goes to the trouble
   of making sure the whole range is allocated and pre-written and uses
   these flags then they get direct access.

I've wondered off and on if an acceptable solution is to define a number
of things surrounding an inode for which XFS /could/ optimize, and let
the user tell us which one thing matters most to them: total manual
control over everything like we do now, sequential io, random io,
fastest mmap access possible, most direct access to storage, etc.
If you set a hint other than full manual control then XFS reserves the
right to change inode flags at any time to satisfy the hint.

For the most part I'm in favor of Christoph's suggestion to let the
kernel decide on its own, and I don't see the point in encoding details
of the storage medium access strategy on the disk, particularly since
filesystems are supposed to be fairly independent of storage.  But
frankly, so many people have asked me over the years if there's some way
to influence the decision-making that I won't quite let go of file hints
as a way to influence the decisions XFS makes around storage media.

> Does it make sense at this point to just start a "dax" man page that can
> contain info about the mount options, inode flags, kernel config options, how
> to get PMDs, etc?  Or does this documentation need to be sprinkled around more
> in existing man pages?

Personally it'd be a lot easier to tell internal groups to go look at a
single documentation page that discusses everything you'd want to know
about enabling dax -- how to control it, how to make large page table
entries work, etc.  Some of those things will get into fs internals,
however, which probably belong in the ext4/xfs manpages.  I suggest
laying out the general details in a single dax manpage and pointing
people at each fs's documentation for specific details.

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Sept. 26, 2017, 10 p.m. UTC | #9
On Tue, Sep 26, 2017 at 12:48:30PM -0700, Darrick J. Wong wrote:
> For the most part I'm in favor of Christoph's suggestion to let the
> kernel decide on its own, and I don't see the point in encoding details
> of the storage medium access strategy on the disk, particularly since
> filesystems are supposed to be fairly independent of storage.  But
> frankly, so many people have asked me over the years if there's some way
> to influence the decision-making that I won't quite let go of file hints
> as a way to influence the decisions XFS makes around storage media.

And that's pretty much it. The discussion here is not about whether
there should be a flag, but what semantics it should have when the
flag is not set. If "flag not set" means "kernel selects
automatically", then that's fine by me.

But history tells us that users and admins want a way to be able to
override the kernel's automatic behaviours because they are /never
100% correct/ for everyone. There are always exceptions, otherwise
we wouldn't have the great plethora of mkfs, mount, proc and sysfs
options for our filesystems or storage. Anyone who says "the kernel
will always do the right thing for everyone automatically" is living
in a dream world.

Note: I agree that the kernel should do the right thing w.r.t. DAX
automatically. We don't need a mount option for that - we can probe
for dax support automatically and use it automatically already.
However, in a world where the kernel automatically uses that
functionality when it is present, admins and users need a way to
solve the "default behaviour is bad for me, let me control this
manually" problem. That's where the inode flags come in....

i.e. What I'm advocating is a model DAX gets enabled automatically
if the underlying device supports is using whatever the kernel
thinks is optimal at the time the access is made, but the user can
override/direct behvaiour on a case by case basis via persistent
inode flags/xattrs/whatever.

Cheers,

Dave.
Christoph Hellwig Sept. 27, 2017, 6:40 a.m. UTC | #10
On Tue, Sep 26, 2017 at 11:30:57AM -0600, Ross Zwisler wrote:
> I agree that Christoph's idea about having the system intelligently adjust to
> use DAX based on performance information it gathers about the underlying
> persistent memory (probably via the HMAT on x86_64 systems) is interesting,
> but I think we're still a ways away from that.

So what are the missing blockers for a getting started?

> FWIW, as my patches suggest and Jan observed I think that we should allow
> users to turn on DAX by treating the inode flag and the mount flag as an 'or'
> operation.  i.e. you get DAX if either the mount option is specified or if the
> inode flag is set, and you can continue to manipulate the per-inode flag as
> you want regardless of the mount option.  I think this provides maximum
> flexibility of the mechanism to select DAX without enforcing policy.

IFF we stick to the dax flag that's the only workable way.  The only
major issue I still see with that is that this allows unprivilegued
users to enable DAX on a any file they own / have write access to.
So there isn't really any way to effectively disable the DAX path
by the sysadmin.

> Does it make sense at this point to just start a "dax" man page that can
> contain info about the mount options, inode flags, kernel config options, how
> to get PMDs, etc?  Or does this documentation need to be sprinkled around more
> in existing man pages?

A dax manpage would be good.
Ross Zwisler Sept. 27, 2017, 4:15 p.m. UTC | #11
On Tue, Sep 26, 2017 at 11:40:01PM -0700, Christoph Hellwig wrote:
> On Tue, Sep 26, 2017 at 11:30:57AM -0600, Ross Zwisler wrote:
> > I agree that Christoph's idea about having the system intelligently adjust to
> > use DAX based on performance information it gathers about the underlying
> > persistent memory (probably via the HMAT on x86_64 systems) is interesting,
> > but I think we're still a ways away from that.
> 
> So what are the missing blockers for a getting started?

Well, I don't know if platforms that support HMAT + PMEM are widely available,
but we have all the details in the ACPI spec, so we could begin to code it up
and things will "just work" when platforms arrive.

> > FWIW, as my patches suggest and Jan observed I think that we should allow
> > users to turn on DAX by treating the inode flag and the mount flag as an 'or'
> > operation.  i.e. you get DAX if either the mount option is specified or if the
> > inode flag is set, and you can continue to manipulate the per-inode flag as
> > you want regardless of the mount option.  I think this provides maximum
> > flexibility of the mechanism to select DAX without enforcing policy.
> 
> IFF we stick to the dax flag that's the only workable way.  The only
> major issue I still see with that is that this allows unprivilegued
> users to enable DAX on a any file they own / have write access to.
> So there isn't really any way to effectively disable the DAX path
> by the sysadmin.

Hum, I wonder if maybe we need/want three different mount modes?  What about:

autodax (the default): the filesystem is free to use DAX or not, as it sees
fit and thinks is optimal.  For the time being we can make this mean "don't
use DAX", and phase in DAX usage as we add support for the HMAT, etc.

Users can manually turn on DAX for a given inode by setting the DAX inode
flag, but there is no way for the user to *prevent* DAX for an inode - the
kernel can always choose to turn it on.

MAP_DIRECT and MAP_SYNC work.

nodax: Don't use DAX.  The kernel won't choose to use DAX, and any DAX inode
flags will be ignored.  This gives the sysadmin the override that I think
you're looking for.  The user can still manipulate the inode flags as they see
fit.

MAP_DIRECT and MAP_SYNC both fail.

dax: Use DAX for all inodes in the filesystem.  Again the inode flags are
essentially ignored, but the user can manipulate the inode flags as they see
fit.  This is basically unchanged from how it works today, modulo the bug
where DAX can get turned off if you unset the inode flag where it wasn't even
set (patch 1 in my series).

MAP_DIRECT and MAP_SYNC work.

> > Does it make sense at this point to just start a "dax" man page that can
> > contain info about the mount options, inode flags, kernel config options, how
> > to get PMDs, etc?  Or does this documentation need to be sprinkled around more
> > in existing man pages?
> 
> A dax manpage would be good.

Okay, I'll start with a manpage, and once we agree on whats in there we can
start working on code again. :)
Christoph Hellwig Oct. 1, 2017, 8:17 a.m. UTC | #12
On Wed, Sep 27, 2017 at 10:15:10AM -0600, Ross Zwisler wrote:
> Well, I don't know if platforms that support HMAT + PMEM are widely available,
> but we have all the details in the ACPI spec, so we could begin to code it up
> and things will "just work" when platforms arrive.

Then again currently all actually shipping NVDIMMs are battery backed
dram and DAX mode should work just fine for them.  Things will get
interesting once companies start shipping actually persistent technologies
that will be significantly slower than DRAM.  And we sould make sure
we have the infrastruture for that in place.

> Hum, I wonder if maybe we need/want three different mount modes?  What about:
> 
> autodax (the default): the filesystem is free to use DAX or not, as it sees
> fit and thinks is optimal.  For the time being we can make this mean "don't
> use DAX", and phase in DAX usage as we add support for the HMAT, etc.

What does "use DAX" really mean anyway?

I think we are conflating a few things:

 a) use a block device or use a dax_device for accessing the device
 b) use the pagecache for caching data in DRAM or not.

Now we actually have a really nice way to control a) already, it's
called O_DIRECT.  Currently O_DIRECT only works with read/write I/O,
but with a byte addressable scheme we now can implement it for mmap
as well, which is what the DAX mmap path does.

b) right now is implied by a), but it's really an implementation
detail.

So the modes would be more like two options to:

 a) disallow any byte-level access.  The right way to do that would
    be to mount the /dev/dax* device instead of the block device
    to allow byte access, and disallow any DAXish operation if you
    mount the block device in the long run.
 b) have a mode to always force an O_DIRECT-like mode for devices
    that are fast enough.  We should always do that with the right
    HMAT entries if mounting the /dev/dax devices, and maybe have
    a mount option to force it.
diff mbox

Patch

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 5049e8a..26faeb9 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1013,7 +1013,7 @@  xfs_diflags_to_linux(
 	else
 		inode->i_flags &= ~S_NOATIME;
 #if 0	/* disabled until the flag switching races are sorted out */
-	if (xflags & FS_XFLAG_DAX)
+	if ((xflags & FS_XFLAG_DAX) || (ip->i_mount->m_flags & XFS_MOUNT_DAX))
 		inode->i_flags |= S_DAX;
 	else
 		inode->i_flags &= ~S_DAX;
@@ -1104,7 +1104,14 @@  xfs_ioctl_setattr_dax_invalidate(
 			return -EINVAL;
 	}
 
-	/* If the DAX state is not changing, we have nothing to do here. */
+	/*
+	 * If the DAX state is not changing, we have nothing to do here.  If
+	 * the DAX mount option was used we will update the DAX inode flag as
+	 * the user requested but we will continue to use DAX for I/O and page
+	 * faults regardless of how the inode flag is set.
+	 */
+	if (ip->i_mount->m_flags & XFS_MOUNT_DAX)
+		return 0;
 	if ((fa->fsx_xflags & FS_XFLAG_DAX) && IS_DAX(inode))
 		return 0;
 	if (!(fa->fsx_xflags & FS_XFLAG_DAX) && !IS_DAX(inode))