[v2,2/3] mm, dax: add VM_DAX flag for DAX VMAs
diff mbox

Message ID 20160915230748.GS30497@dastard
State New, archived
Headers show

Commit Message

Dave Chinner Sept. 15, 2016, 11:07 p.m. UTC
On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
> >> The DAX property, page cache bypass, of a VMA is only detectable via the
> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
> >> only available internal to the kernel and is a property that userspace
> >> applications would like to interrogate.
> >
> > They have absolutely no business knowing such an implementation detail.
> 
> Hasn't that train already left the station with FS_XFLAG_DAX?

No, that's an admin flag, not a runtime hint for applications. Just
because that flag is set on an inode, it does not mean that DAX is
actually in use - it will be ignored if the backing dev is not dax
capable.

> The other problem with hiding the DAX property is that it turns out to
> not be a transparent acceleration feature.  See xfs/086 xfs/088
> xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is
> due to the fact that DAX disallows delayed allocation behavior.

Which is not a bug, nor is it something that app developers should
be surprised by.

i.e. Subtle differences in error reporting behaviour occur in
filesystems /all the time/. Run the test on a non-dax filesystem
with an extent size hint. It fails /exactly the same way as DAX/.
Run it with direct IO - fails the same way as DAX. Run it
with synchronous writes - it fails the same way as DAX.

IOWs, if an app can't handle the way DAX reports errors, then they
are /broken/. Delayed allocation requires checking the return value
of fsync() or close() to capture the allocation error - many more
apps get that wrong than the ones that expect the immediate errors
from write()...

Anyway: to domeonstrate that the nothign is actually broken, and
you might sometimes need to fix tests and send patches to
fstests@vger.kernel.org, this makes xfs/086 pass for me on DAX:

Cheers,

Dave.

Comments

Dan Williams Sept. 15, 2016, 11:19 p.m. UTC | #1
On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
>> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>> >> The DAX property, page cache bypass, of a VMA is only detectable via the
>> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>> >> only available internal to the kernel and is a property that userspace
>> >> applications would like to interrogate.
>> >
>> > They have absolutely no business knowing such an implementation detail.
>>
>> Hasn't that train already left the station with FS_XFLAG_DAX?
>
> No, that's an admin flag, not a runtime hint for applications. Just
> because that flag is set on an inode, it does not mean that DAX is
> actually in use - it will be ignored if the backing dev is not dax
> capable.

Ok, but then VM_DAX does not suffer from that problem.  I'm trying to
understand why VM_DAX has no business being in the smaps "VmFlags"
line, but something ambiguous to userspace like VM_MIXEDMAP does?

>
>> The other problem with hiding the DAX property is that it turns out to
>> not be a transparent acceleration feature.  See xfs/086 xfs/088
>> xfs/089 xfs/091 which fail with DAX and, as far as I understand, it is
>> due to the fact that DAX disallows delayed allocation behavior.
>
> Which is not a bug, nor is it something that app developers should
> be surprised by.
>
> i.e. Subtle differences in error reporting behaviour occur in
> filesystems /all the time/. Run the test on a non-dax filesystem
> with an extent size hint. It fails /exactly the same way as DAX/.
> Run it with direct IO - fails the same way as DAX. Run it
> with synchronous writes - it fails the same way as DAX.
>
> IOWs, if an app can't handle the way DAX reports errors, then they
> are /broken/. Delayed allocation requires checking the return value
> of fsync() or close() to capture the allocation error - many more
> apps get that wrong than the ones that expect the immediate errors
> from write()...
>
> Anyway: to domeonstrate that the nothign is actually broken, and
> you might sometimes need to fix tests and send patches to
> fstests@vger.kernel.org, this makes xfs/086 pass for me on DAX:
>
> --- a/tests/xfs/086
> +++ b/tests/xfs/086
> @@ -96,7 +96,8 @@ _scratch_mount
>
>  echo "+ modify files"
>  for x in `seq 1 64`; do
> -       $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" >> $seqres.full
> +       $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" \
> +               >> $seqres.full 2>&1
>  done
>  umount "${SCRATCH_MNT}"

Thanks for that!  Wasn't immediately obvious to me, and didn't get
that response when I asked on the list a while back.
Dan Williams Sept. 16, 2016, 12:16 a.m. UTC | #2
On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
>> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>> >> The DAX property, page cache bypass, of a VMA is only detectable via the
>> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>> >> only available internal to the kernel and is a property that userspace
>> >> applications would like to interrogate.
>> >
>> > They have absolutely no business knowing such an implementation detail.
>>
>> Hasn't that train already left the station with FS_XFLAG_DAX?
>
> No, that's an admin flag, not a runtime hint for applications. Just
> because that flag is set on an inode, it does not mean that DAX is
> actually in use - it will be ignored if the backing dev is not dax
> capable.
>

What's the point of an admin flag if an admin can't do cat /proc/<pid
of interest>/smaps, or some other mechanism, to validate that the
setting the admin cares about is in effect?
Dave Chinner Sept. 16, 2016, 1:24 a.m. UTC | #3
On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote:
> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the
> >> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
> >> >> only available internal to the kernel and is a property that userspace
> >> >> applications would like to interrogate.
> >> >
> >> > They have absolutely no business knowing such an implementation detail.
> >>
> >> Hasn't that train already left the station with FS_XFLAG_DAX?
> >
> > No, that's an admin flag, not a runtime hint for applications. Just
> > because that flag is set on an inode, it does not mean that DAX is
> > actually in use - it will be ignored if the backing dev is not dax
> > capable.
> 
> What's the point of an admin flag if an admin can't do cat /proc/<pid
> of interest>/smaps, or some other mechanism, to validate that the
> setting the admin cares about is in effect?

Sorry, I don't follow - why would you be looking at mapping file
regions in /proc to determine if some file somewhere in a filesystem
has a specific flag set on it or not?

FS_XFLAG_DAX is an inode attribute flag, not something you can
query or administrate through mmap:

I.e.
# xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo
 --------------- foo
 --------------x foo
 --------------- foo
#

What happens when that flag is set on an inode is determined by a
whole bunch of other things that are completely separate to the
management of the inode flag itself.

Cheers,

Dave.
Dan Williams Sept. 16, 2016, 2:04 a.m. UTC | #4
On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote:
>> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
>> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the
>> >> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>> >> >> only available internal to the kernel and is a property that userspace
>> >> >> applications would like to interrogate.
>> >> >
>> >> > They have absolutely no business knowing such an implementation detail.
>> >>
>> >> Hasn't that train already left the station with FS_XFLAG_DAX?
>> >
>> > No, that's an admin flag, not a runtime hint for applications. Just
>> > because that flag is set on an inode, it does not mean that DAX is
>> > actually in use - it will be ignored if the backing dev is not dax
>> > capable.
>>
>> What's the point of an admin flag if an admin can't do cat /proc/<pid
>> of interest>/smaps, or some other mechanism, to validate that the
>> setting the admin cares about is in effect?
>
> Sorry, I don't follow - why would you be looking at mapping file
> regions in /proc to determine if some file somewhere in a filesystem
> has a specific flag set on it or not?
>
> FS_XFLAG_DAX is an inode attribute flag, not something you can
> query or administrate through mmap:
>
> I.e.
> # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo
>  --------------- foo
>  --------------x foo
>  --------------- foo
> #
>
> What happens when that flag is set on an inode is determined by a
> whole bunch of other things that are completely separate to the
> management of the inode flag itself.

Right, I understand that, but how does an admin audit those "bunch of
other things" that actually gate whether DAX ends up being used in
practice?  There's currently no way for userspace to observe that a
file with FS_XFLAG_DAX actually results in a change in mmap behavior.
Dan Williams Sept. 16, 2016, 3:41 a.m. UTC | #5
On Thu, Sep 15, 2016 at 7:04 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote:
>> On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote:
>>> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
>>> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
>>> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>>> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>>> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the
>>> >> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>>> >> >> only available internal to the kernel and is a property that userspace
>>> >> >> applications would like to interrogate.
>>> >> >
>>> >> > They have absolutely no business knowing such an implementation detail.
>>> >>
>>> >> Hasn't that train already left the station with FS_XFLAG_DAX?
>>> >
>>> > No, that's an admin flag, not a runtime hint for applications. Just
>>> > because that flag is set on an inode, it does not mean that DAX is
>>> > actually in use - it will be ignored if the backing dev is not dax
>>> > capable.
>>>
>>> What's the point of an admin flag if an admin can't do cat /proc/<pid
>>> of interest>/smaps, or some other mechanism, to validate that the
>>> setting the admin cares about is in effect?
>>
>> Sorry, I don't follow - why would you be looking at mapping file
>> regions in /proc to determine if some file somewhere in a filesystem
>> has a specific flag set on it or not?
>>
>> FS_XFLAG_DAX is an inode attribute flag, not something you can
>> query or administrate through mmap:
>>
>> I.e.
>> # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo
>>  --------------- foo
>>  --------------x foo
>>  --------------- foo
>> #
>>
>> What happens when that flag is set on an inode is determined by a
>> whole bunch of other things that are completely separate to the
>> management of the inode flag itself.
>
> Right, I understand that, but how does an admin audit those "bunch of
> other things" that actually gate whether DAX ends up being used in
> practice?  There's currently no way for userspace to observe that a
> file with FS_XFLAG_DAX actually results in a change in mmap behavior.

Let me put it another way, if we inadvertently break DAX causing it to
be disabled in scenarios when it should be enabled.  What is the
interface for the admin to check "I have the DAX inode flag set, but
the file this application expects to be mapped DAX is mapped with page
cache"?
Dave Chinner Sept. 16, 2016, 5:36 a.m. UTC | #6
On Thu, Sep 15, 2016 at 07:04:27PM -0700, Dan Williams wrote:
> On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote:
> >> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
> >> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
> >> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the
> >> >> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
> >> >> >> only available internal to the kernel and is a property that userspace
> >> >> >> applications would like to interrogate.
> >> >> >
> >> >> > They have absolutely no business knowing such an implementation detail.
> >> >>
> >> >> Hasn't that train already left the station with FS_XFLAG_DAX?
> >> >
> >> > No, that's an admin flag, not a runtime hint for applications. Just
> >> > because that flag is set on an inode, it does not mean that DAX is
> >> > actually in use - it will be ignored if the backing dev is not dax
> >> > capable.
> >>
> >> What's the point of an admin flag if an admin can't do cat /proc/<pid
> >> of interest>/smaps, or some other mechanism, to validate that the
> >> setting the admin cares about is in effect?
> >
> > Sorry, I don't follow - why would you be looking at mapping file
> > regions in /proc to determine if some file somewhere in a filesystem
> > has a specific flag set on it or not?
> >
> > FS_XFLAG_DAX is an inode attribute flag, not something you can
> > query or administrate through mmap:
> >
> > I.e.
> > # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo
> >  --------------- foo
> >  --------------x foo
> >  --------------- foo
> > #
> >
> > What happens when that flag is set on an inode is determined by a
> > whole bunch of other things that are completely separate to the
> > management of the inode flag itself.
> 
> Right, I understand that, but how does an admin audit those "bunch of
> other things"

Filesystem mounts checks all the various stuff that determines
whether DAX can be used. It logs to the console that it is "Dax
capable". Any file that then has FS_XFLAG_DAX set will result in DAX
being used. There is no other possibility when these two things are
reported.

/me points at runtime diagnostic tracepoints like
trace_xfs_file_dax_read() and notes that dax is sadly lacking in
diagnostic tracepoints.

Besides, userspace can't do anything useful with this information,
because the FS_XFLAG_DAX can be changed /at any time/ by an admin.
And the filesystem is free to remove it at any time, too, if it
needs to (e.g. file gets reflinked or snapshotted).

That's right, an inode can dynamically change from DAX to non-DAX
underneath the application, and the application /will not notice/.
That's because changing the flag will sync and invalidate the
existing mappings and the next application access will simply fault
it back in using whatever mechanism the inode is now configured
with.

Plain and simple: userspace has absolutely no fucking idea of
whether DAX is enabled or not, and whatever the kernel returns to
userspace above the DAX configuration is stale before it even got
out of the kernel....

Cheers,

Dave.
Dan Williams Sept. 16, 2016, 10:47 a.m. UTC | #7
On Thu, Sep 15, 2016 at 10:36 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Sep 15, 2016 at 07:04:27PM -0700, Dan Williams wrote:
>> On Thu, Sep 15, 2016 at 6:24 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Thu, Sep 15, 2016 at 05:16:42PM -0700, Dan Williams wrote:
>> >> On Thu, Sep 15, 2016 at 4:07 PM, Dave Chinner <david@fromorbit.com> wrote:
>> >> > On Thu, Sep 15, 2016 at 10:01:03AM -0700, Dan Williams wrote:
>> >> >> On Thu, Sep 15, 2016 at 1:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>> >> >> > On Wed, Sep 14, 2016 at 11:54:38PM -0700, Dan Williams wrote:
>> >> >> >> The DAX property, page cache bypass, of a VMA is only detectable via the
>> >> >> >> vma_is_dax() helper to check the S_DAX inode flag.  However, this is
>> >> >> >> only available internal to the kernel and is a property that userspace
>> >> >> >> applications would like to interrogate.
>> >> >> >
>> >> >> > They have absolutely no business knowing such an implementation detail.
>> >> >>
>> >> >> Hasn't that train already left the station with FS_XFLAG_DAX?
>> >> >
>> >> > No, that's an admin flag, not a runtime hint for applications. Just
>> >> > because that flag is set on an inode, it does not mean that DAX is
>> >> > actually in use - it will be ignored if the backing dev is not dax
>> >> > capable.
>> >>
>> >> What's the point of an admin flag if an admin can't do cat /proc/<pid
>> >> of interest>/smaps, or some other mechanism, to validate that the
>> >> setting the admin cares about is in effect?
>> >
>> > Sorry, I don't follow - why would you be looking at mapping file
>> > regions in /proc to determine if some file somewhere in a filesystem
>> > has a specific flag set on it or not?
>> >
>> > FS_XFLAG_DAX is an inode attribute flag, not something you can
>> > query or administrate through mmap:
>> >
>> > I.e.
>> > # xfs_io -c "lsattr" -c "chattr +x" -c lsattr -c "chattr -x" -c "lsattr" foo
>> >  --------------- foo
>> >  --------------x foo
>> >  --------------- foo
>> > #
>> >
>> > What happens when that flag is set on an inode is determined by a
>> > whole bunch of other things that are completely separate to the
>> > management of the inode flag itself.
>>
>> Right, I understand that, but how does an admin audit those "bunch of
>> other things"
>
> Filesystem mounts checks all the various stuff that determines
> whether DAX can be used. It logs to the console that it is "Dax
> capable". Any file that then has FS_XFLAG_DAX set will result in DAX
> being used. There is no other possibility when these two things are
> reported.
>
> /me points at runtime diagnostic tracepoints like
> trace_xfs_file_dax_read() and notes that dax is sadly lacking in
> diagnostic tracepoints.
>
> Besides, userspace can't do anything useful with this information,
> because the FS_XFLAG_DAX can be changed /at any time/ by an admin.
> And the filesystem is free to remove it at any time, too, if it
> needs to (e.g. file gets reflinked or snapshotted).
>
> That's right, an inode can dynamically change from DAX to non-DAX
> underneath the application, and the application /will not notice/.
> That's because changing the flag will sync and invalidate the
> existing mappings and the next application access will simply fault
> it back in using whatever mechanism the inode is now configured
> with.
>
> Plain and simple: userspace has absolutely no fucking idea of
> whether DAX is enabled or not, and whatever the kernel returns to
> userspace above the DAX configuration is stale before it even got
> out of the kernel....

smaps is already known to be an ephemeral interface, but we output
useful information there nonetheless.

Patch
diff mbox

--- a/tests/xfs/086
+++ b/tests/xfs/086
@@ -96,7 +96,8 @@  _scratch_mount
 
 echo "+ modify files"
 for x in `seq 1 64`; do
-	$XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" >> $seqres.full
+	$XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" \
+		>> $seqres.full 2>&1
 done
 umount "${SCRATCH_MNT}"