diff mbox

[RFC,2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem

Message ID 149766213493.22552.4057048843646200083.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dan Williams June 17, 2017, 1:15 a.m. UTC
To date, the full promise of byte-addressable access to persistent
memory has only been half realized via the filesystem-dax interface. The
current filesystem-dax mechanism allows an application to consume (read)
data from persistent storage at byte-size granularity, bypassing the
full page reads required by traditional storage devices.

Now, for writes, applications still need to contend with
page-granularity dirtying and flushing semantics as well as filesystem
coordination for metadata updates after any mmap write. The current
situation precludes use cases that leverage byte-granularity / in-place
updates to persistent media.

To get around this limitation there are some specialized applications
that are using the device-dax interface to bypass the overhead and
data-safety problems of the current filesystem-dax mmap-write path.
QEMU-KVM is forced to use device-dax to safely pass through persistent
memory to a guest [1]. Some specialized databases are using device-dax
for byte-granularity writes. Outside of those cases, device-dax is
difficult for general purpose persistent memory applications to consume.
There is demand for access to pmem without needing to contend with
special device configuration and other device-dax limitations.

The 'daxfile' interface satisfies this demand and realizes one of Dave
Chinner's ideas for allowing pmem applications to safely bypass
fsync/msync requirements. The idea is to make the file immutable with
respect to the offset-to-block mappings for every extent in the file
[2]. It turns out that filesystems already need to make this guarantee
today. This property is needed for files marked as swap files.

The new daxctl() syscall manages setting a file into 'static-dax' mode
whereby it arranges for the file to be treated as a swapfile as far as
the filesystem is concerned, but not registered with the core-mm as
swapfile space. A file in this mode is then safe to be mapped and
written without the requirement to fsync/msync the writes.  The cpu
cache management for flushing data to persistence can be handled
completely in userspace.

[1]: https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg01207.html
[2]: https://lkml.org/lkml/2016/9/11/159

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/entry/syscalls/syscall_64.tbl |    1 
 include/linux/dax.h                    |    9 ++
 include/linux/fs.h                     |    3 +
 include/linux/syscalls.h               |    1 
 include/uapi/linux/dax.h               |    8 +
 mm/Kconfig                             |    5 +
 mm/Makefile                            |    1 
 mm/daxfile.c                           |  186 ++++++++++++++++++++++++++++++++
 mm/page_io.c                           |   31 +++++
 9 files changed, 245 insertions(+)
 create mode 100644 include/uapi/linux/dax.h
 create mode 100644 mm/daxfile.c

Comments

Andy Lutomirski June 17, 2017, 4:25 p.m. UTC | #1
On Fri, Jun 16, 2017 at 6:15 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> To date, the full promise of byte-addressable access to persistent
> memory has only been half realized via the filesystem-dax interface. The
> current filesystem-dax mechanism allows an application to consume (read)
> data from persistent storage at byte-size granularity, bypassing the
> full page reads required by traditional storage devices.
>
> Now, for writes, applications still need to contend with
> page-granularity dirtying and flushing semantics as well as filesystem
> coordination for metadata updates after any mmap write. The current
> situation precludes use cases that leverage byte-granularity / in-place
> updates to persistent media.
>
> To get around this limitation there are some specialized applications
> that are using the device-dax interface to bypass the overhead and
> data-safety problems of the current filesystem-dax mmap-write path.
> QEMU-KVM is forced to use device-dax to safely pass through persistent
> memory to a guest [1]. Some specialized databases are using device-dax
> for byte-granularity writes. Outside of those cases, device-dax is
> difficult for general purpose persistent memory applications to consume.
> There is demand for access to pmem without needing to contend with
> special device configuration and other device-dax limitations.
>
> The 'daxfile' interface satisfies this demand and realizes one of Dave
> Chinner's ideas for allowing pmem applications to safely bypass
> fsync/msync requirements. The idea is to make the file immutable with
> respect to the offset-to-block mappings for every extent in the file
> [2]. It turns out that filesystems already need to make this guarantee
> today. This property is needed for files marked as swap files.
>
> The new daxctl() syscall manages setting a file into 'static-dax' mode
> whereby it arranges for the file to be treated as a swapfile as far as
> the filesystem is concerned, but not registered with the core-mm as
> swapfile space. A file in this mode is then safe to be mapped and
> written without the requirement to fsync/msync the writes.  The cpu
> cache management for flushing data to persistence can be handled
> completely in userspace.

Can you remind those of us who haven't played with DAX in a while what
the problem is with mmapping a DAX file without this patchset?  If
there's some bookkkeeping needed to make sure that the filesystem will
invalidate all the mappings if it decides to move the file, maybe that
should be the default rather than needing a new syscall.

--Andy
Dan Williams June 17, 2017, 9:52 p.m. UTC | #2
On Sat, Jun 17, 2017 at 9:25 AM, Andy Lutomirski <luto@kernel.org> wrote:
> On Fri, Jun 16, 2017 at 6:15 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> To date, the full promise of byte-addressable access to persistent
>> memory has only been half realized via the filesystem-dax interface. The
>> current filesystem-dax mechanism allows an application to consume (read)
>> data from persistent storage at byte-size granularity, bypassing the
>> full page reads required by traditional storage devices.
>>
>> Now, for writes, applications still need to contend with
>> page-granularity dirtying and flushing semantics as well as filesystem
>> coordination for metadata updates after any mmap write. The current
>> situation precludes use cases that leverage byte-granularity / in-place
>> updates to persistent media.
>>
>> To get around this limitation there are some specialized applications
>> that are using the device-dax interface to bypass the overhead and
>> data-safety problems of the current filesystem-dax mmap-write path.
>> QEMU-KVM is forced to use device-dax to safely pass through persistent
>> memory to a guest [1]. Some specialized databases are using device-dax
>> for byte-granularity writes. Outside of those cases, device-dax is
>> difficult for general purpose persistent memory applications to consume.
>> There is demand for access to pmem without needing to contend with
>> special device configuration and other device-dax limitations.
>>
>> The 'daxfile' interface satisfies this demand and realizes one of Dave
>> Chinner's ideas for allowing pmem applications to safely bypass
>> fsync/msync requirements. The idea is to make the file immutable with
>> respect to the offset-to-block mappings for every extent in the file
>> [2]. It turns out that filesystems already need to make this guarantee
>> today. This property is needed for files marked as swap files.
>>
>> The new daxctl() syscall manages setting a file into 'static-dax' mode
>> whereby it arranges for the file to be treated as a swapfile as far as
>> the filesystem is concerned, but not registered with the core-mm as
>> swapfile space. A file in this mode is then safe to be mapped and
>> written without the requirement to fsync/msync the writes.  The cpu
>> cache management for flushing data to persistence can be handled
>> completely in userspace.
>
> Can you remind those of us who haven't played with DAX in a while what
> the problem is with mmapping a DAX file without this patchset?  If
> there's some bookkkeeping needed to make sure that the filesystem will
> invalidate all the mappings if it decides to move the file, maybe that
> should be the default rather than needing a new syscall.

The bookkeeping to invalidate mappings when the filesystem moves a
block is already there.

Without this patchset an application needs to call fsync/msync after
any write to a DAX mapping otherwise there is no guarantee the
filesystem has written the metadata to find the updated block after a
crash or power loss event. Even if the sync operation is reduced to a
minimal cmpxchg in userspace to check if the filesystem-metadata is
dirty, that mechanism doesn't translate to a virtualized environment,
as requiring guests to trigger host fsync()s is not feasible. It's a
half-step solution when you can instead just ask the filesystem to
never move blocks, as Dave proposed many months back.

We stepped back from that proposal when it looked like a significant
amount of per-filesystem work to introduce the capability and it was
not clear that application developers would tolerate the side effects
of this 'immutable' semantic. However, the implementation is dead
simple since ext4 and xfs already need to make
block-allocation-immutable semantics available for swapfiles. We also
have application developers telling us they are ok with the semantics,
especially because it catches Linux up to other operating environments
that are already on board with allowing this type of access to pmem
through a filesystem. This patchset gives pmem application developers
what they want without any additional burden on filesystem
implementations.
Andy Lutomirski June 17, 2017, 11:50 p.m. UTC | #3
On Sat, Jun 17, 2017 at 2:52 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Sat, Jun 17, 2017 at 9:25 AM, Andy Lutomirski <luto@kernel.org> wrote:
>>
>> Can you remind those of us who haven't played with DAX in a while what
>> the problem is with mmapping a DAX file without this patchset?  If
>> there's some bookkkeeping needed to make sure that the filesystem will
>> invalidate all the mappings if it decides to move the file, maybe that
>> should be the default rather than needing a new syscall.
>
> The bookkeeping to invalidate mappings when the filesystem moves a
> block is already there.
>
> Without this patchset an application needs to call fsync/msync after
> any write to a DAX mapping otherwise there is no guarantee the
> filesystem has written the metadata to find the updated block after a
> crash or power loss event. Even if the sync operation is reduced to a
> minimal cmpxchg in userspace to check if the filesystem-metadata is
> dirty, that mechanism doesn't translate to a virtualized environment,
> as requiring guests to trigger host fsync()s is not feasible. It's a
> half-step solution when you can instead just ask the filesystem to
> never move blocks, as Dave proposed many months back.
>
> We stepped back from that proposal when it looked like a significant
> amount of per-filesystem work to introduce the capability and it was
> not clear that application developers would tolerate the side effects
> of this 'immutable' semantic. However, the implementation is dead
> simple since ext4 and xfs already need to make
> block-allocation-immutable semantics available for swapfiles. We also
> have application developers telling us they are ok with the semantics,
> especially because it catches Linux up to other operating environments
> that are already on board with allowing this type of access to pmem
> through a filesystem. This patchset gives pmem application developers
> what they want without any additional burden on filesystem
> implementations.

I see.

I have a couple of minor-ish issues with the current proposal, then.
One is that I think the terminology should be changed to still make
sense if filesystems or VFS improves to make this approach
unnecessary.  Rather than saying "this file is now static", perhaps
users should set a flag with the explicit semantics that "mmaps of
this file are guaranteed not to lose data due to the kernel's
activities", IOW that mmaps will be at least as durable as a direct
mapping of DAX memory.  Then the kernel has the flexibility to add a
future implementation in which, instead of pinning the file, the
filesystem just knows to keep metadata synced before allowing
page_mkwrite to re-enable writes to an mmapped DAX file.

My other objection is that the syscall intentionally leaks a reference
to the file.  This means it needs overflow protection and it probably
shouldn't ever be allowed to use it without privilege.

Why can't the underlying issue be easily fixed, though?  Could
.page_mkwrite just make sure that metadata is synced when the FS uses
DAX?  On a DAX fs, syncing metadata should be extremely fast.  This
could be conditioned on an madvise or mmap flag if performance might
be an issue.  As far as I know, this change alone should be
sufficient.
Dan Williams June 18, 2017, 3:15 a.m. UTC | #4
On Sat, Jun 17, 2017 at 4:50 PM, Andy Lutomirski <luto@kernel.org> wrote:
> On Sat, Jun 17, 2017 at 2:52 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> On Sat, Jun 17, 2017 at 9:25 AM, Andy Lutomirski <luto@kernel.org> wrote:
>>>
>>> Can you remind those of us who haven't played with DAX in a while what
>>> the problem is with mmapping a DAX file without this patchset?  If
>>> there's some bookkkeeping needed to make sure that the filesystem will
>>> invalidate all the mappings if it decides to move the file, maybe that
>>> should be the default rather than needing a new syscall.
>>
>> The bookkeeping to invalidate mappings when the filesystem moves a
>> block is already there.
>>
>> Without this patchset an application needs to call fsync/msync after
>> any write to a DAX mapping otherwise there is no guarantee the
>> filesystem has written the metadata to find the updated block after a
>> crash or power loss event. Even if the sync operation is reduced to a
>> minimal cmpxchg in userspace to check if the filesystem-metadata is
>> dirty, that mechanism doesn't translate to a virtualized environment,
>> as requiring guests to trigger host fsync()s is not feasible. It's a
>> half-step solution when you can instead just ask the filesystem to
>> never move blocks, as Dave proposed many months back.
>>
>> We stepped back from that proposal when it looked like a significant
>> amount of per-filesystem work to introduce the capability and it was
>> not clear that application developers would tolerate the side effects
>> of this 'immutable' semantic. However, the implementation is dead
>> simple since ext4 and xfs already need to make
>> block-allocation-immutable semantics available for swapfiles. We also
>> have application developers telling us they are ok with the semantics,
>> especially because it catches Linux up to other operating environments
>> that are already on board with allowing this type of access to pmem
>> through a filesystem. This patchset gives pmem application developers
>> what they want without any additional burden on filesystem
>> implementations.
>
> I see.
>
> I have a couple of minor-ish issues with the current proposal, then.
> One is that I think the terminology should be changed to still make
> sense if filesystems or VFS improves to make this approach
> unnecessary.  Rather than saying "this file is now static", perhaps
> users should set a flag with the explicit semantics that "mmaps of
> this file are guaranteed not to lose data due to the kernel's
> activities", IOW that mmaps will be at least as durable as a direct
> mapping of DAX memory.  Then the kernel has the flexibility to add a
> future implementation in which, instead of pinning the file, the
> filesystem just knows to keep metadata synced before allowing
> page_mkwrite to re-enable writes to an mmapped DAX file.

Yes, sounds good to me. Rename the flag to DAXCTL_F_SYNC to indicate
updates via mmap to this file are synchronous as far as block
allocation metadata is concerned. Future filesystems are then free to
always support this synchronous mode without using the swapfile hack.

> My other objection is that the syscall intentionally leaks a reference
> to the file.  This means it needs overflow protection and it probably
> shouldn't ever be allowed to use it without privilege.

We only hold the one reference while S_DAXFILE is set, so I think the
protection is there, and per Dave's original proposal this requires
CAP_LINUX_IMMUTABLE.

> Why can't the underlying issue be easily fixed, though?  Could
> .page_mkwrite just make sure that metadata is synced when the FS uses
> DAX?

Yes, it most definitely could and that idea has been floated.

> On a DAX fs, syncing metadata should be extremely fast.  This
> could be conditioned on an madvise or mmap flag if performance might
> be an issue.  As far as I know, this change alone should be
> sufficient.

The hang up is that it requires per-fs enabling as it needs to be
careful to manage mmap_sem vs fs journal locks for example. I know the
in-development NOVA [1] filesystem is planning to support this out of
the gate. ext4 would be open to implementing it, but I think xfs is
cold on the idea. Christoph originally proposed it here [2], before
Dave went on to propose immutable semantics.

[1]: https://github.com/NVSL/NOVA
[2]: https://lists.01.org/pipermail/linux-nvdimm/2016-February/004609.html
Andy Lutomirski June 18, 2017, 5:05 a.m. UTC | #5
On Sat, Jun 17, 2017 at 8:15 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Sat, Jun 17, 2017 at 4:50 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> My other objection is that the syscall intentionally leaks a reference
>> to the file.  This means it needs overflow protection and it probably
>> shouldn't ever be allowed to use it without privilege.
>
> We only hold the one reference while S_DAXFILE is set, so I think the
> protection is there, and per Dave's original proposal this requires
> CAP_LINUX_IMMUTABLE.
>
>> Why can't the underlying issue be easily fixed, though?  Could
>> .page_mkwrite just make sure that metadata is synced when the FS uses
>> DAX?
>
> Yes, it most definitely could and that idea has been floated.
>
>> On a DAX fs, syncing metadata should be extremely fast.  This
>> could be conditioned on an madvise or mmap flag if performance might
>> be an issue.  As far as I know, this change alone should be
>> sufficient.
>
> The hang up is that it requires per-fs enabling as it needs to be
> careful to manage mmap_sem vs fs journal locks for example. I know the
> in-development NOVA [1] filesystem is planning to support this out of
> the gate. ext4 would be open to implementing it, but I think xfs is
> cold on the idea. Christoph originally proposed it here [2], before
> Dave went on to propose immutable semantics.

Hmm.  Given a choice between a very clean API that works without
privilege but is awkward to implement on XFS and an awkward-to-use
API, I'd personally choose the former.

Dave, even with the lock ordering issue, couldn't XFS implement
MAP_PMEM_AWARE by having .page_mkwrite work roughly like this:

if (metadata is dirty) {
  up_write(&mmap_sem);
  sync the metadata;
  down_write(&mmap_sem);
  return 0;  /* retry the fault */
} else {
  return whatever success code;
}

This might require returning VM_FAULT_RETRY instead of 0 and it might
require auditing the core mm code to make sure that it can handle
mmap_sem being dropped like this.  I don't see why it couldn't work in
principle, though.
Christoph Hellwig June 18, 2017, 8:18 a.m. UTC | #6
On Sat, Jun 17, 2017 at 08:15:05PM -0700, Dan Williams wrote:
> The hang up is that it requires per-fs enabling as it needs to be
> careful to manage mmap_sem vs fs journal locks for example. I know the
> in-development NOVA [1] filesystem is planning to support this out of
> the gate. ext4 would be open to implementing it, but I think xfs is
> cold on the idea. Christoph originally proposed it here [2], before
> Dave went on to propose immutable semantics.
> 
> [1]: https://github.com/NVSL/NOVA
> [2]: https://lists.01.org/pipermail/linux-nvdimm/2016-February/004609.html

And I stand to that statement.  Let's get DAX stable first, and
properly cleaned up (e.g. follow on work with separating it entirely
from the block device).  Then think hard about how most of the 
persistent memory technologies actually work, including the point that
for a lot of workloads page cache will be required at least on the
write side.   And then come up with actual real use cases and we can
look into it.

And stop trying to shoe-horn crap like this in.
Dan Williams June 19, 2017, 1:51 a.m. UTC | #7
On Sun, Jun 18, 2017 at 1:18 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Sat, Jun 17, 2017 at 08:15:05PM -0700, Dan Williams wrote:
>> The hang up is that it requires per-fs enabling as it needs to be
>> careful to manage mmap_sem vs fs journal locks for example. I know the
>> in-development NOVA [1] filesystem is planning to support this out of
>> the gate. ext4 would be open to implementing it, but I think xfs is
>> cold on the idea. Christoph originally proposed it here [2], before
>> Dave went on to propose immutable semantics.
>>
>> [1]: https://github.com/NVSL/NOVA
>> [2]: https://lists.01.org/pipermail/linux-nvdimm/2016-February/004609.html
>
> And I stand to that statement.  Let's get DAX stable first, and
> properly cleaned up (e.g. follow on work with separating it entirely
> from the block device).  Then think hard about how most of the
> persistent memory technologies actually work, including the point that
> for a lot of workloads page cache will be required at least on the
> write side.   And then come up with actual real use cases and we can
> look into it.

I see it differently. We're already at a good point in time to start
iterating on a fix for this issue. Ross and Jan have done a lot of
good work on the dax stability front, and the block-device separation
of dax is well underway.

> And stop trying to shoe-horn crap like this in.

The kernel shoe-horning all pmem+filesystem-dax applications into
abiding page-cache semantics is a problem, and this RFC has already
helped move the needle on a couple fronts. 1/ Swapfiles are subtly
broken which is something worth fixing, and if it gets us a
synchronous-dax mode without major filesystem surgery then that's all
for the better. 2/ There's an appetite for just fixing this
incrementally in each filesystem's fault handler, so if ext4 was able
to prove out an interface / implementation for synchronous faults we
could go with that instead of a pre-allocated + immutable interface
and let other filesystems set their own timelines.
Dave Chinner June 19, 2017, 1:21 p.m. UTC | #8
On Sat, Jun 17, 2017 at 10:05:45PM -0700, Andy Lutomirski wrote:
> On Sat, Jun 17, 2017 at 8:15 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > On Sat, Jun 17, 2017 at 4:50 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >> My other objection is that the syscall intentionally leaks a reference
> >> to the file.  This means it needs overflow protection and it probably
> >> shouldn't ever be allowed to use it without privilege.
> >
> > We only hold the one reference while S_DAXFILE is set, so I think the
> > protection is there, and per Dave's original proposal this requires
> > CAP_LINUX_IMMUTABLE.
> >
> >> Why can't the underlying issue be easily fixed, though?  Could
> >> .page_mkwrite just make sure that metadata is synced when the FS uses
> >> DAX?
> >
> > Yes, it most definitely could and that idea has been floated.
> >
> >> On a DAX fs, syncing metadata should be extremely fast.

<sigh>

This again....

Persistent memory means the *I/O* is fast. It does not mean that
*complex filesystem operations* are fast.

Don't forget that there's an shitload of CPU that gets burnt to make
sure that the metadata is synced correctly. Do that /synchronously/
on *every* write page fault (which, BTW, modify mtime, so will
always have dirty metadata to sync) and now you have a serious
performance problem with your "fast" DAX access method.

And that's before we even consider all the problems with running
sync operations in page fault context....

> >> This
> >> could be conditioned on an madvise or mmap flag if performance might
> >> be an issue.  As far as I know, this change alone should be
> >> sufficient.
> >
> > The hang up is that it requires per-fs enabling as it needs to be
> > careful to manage mmap_sem vs fs journal locks for example. I know the
> > in-development NOVA [1] filesystem is planning to support this out of
> > the gate. ext4 would be open to implementing it, but I think xfs is
> > cold on the idea. Christoph originally proposed it here [2], before
> > Dave went on to propose immutable semantics.
> 
> Hmm.  Given a choice between a very clean API that works without
> privilege but is awkward to implement on XFS and an awkward-to-use
> API, I'd personally choose the former.

Yup, you have the choice of a clean kernel API that will be
substantially slower than the existing "dirty page" tracking and
having the app run fsync() when necessary, or having to do a little
more work in a library routine that preallocates a file and sets a
flag on it?

The apps will use the library API, not the kernel API, so who really
cares if there's a few steps to setting up the file state
appropriately?

> Dave, even with the lock ordering issue, couldn't XFS implement
> MAP_PMEM_AWARE by having .page_mkwrite work roughly like this:
> 
> if (metadata is dirty) {
>   up_write(&mmap_sem);
>   sync the metadata;
>   down_write(&mmap_sem);
>   return 0;  /* retry the fault */
> } else {
>   return whatever success code;
> }

How do you know that there is dependent filesystem metadata that
needs syncing at a level that you can safely manipulate the
mmap_sem? And how, exactly, do you do this without races? It'd be
trivial to DOS such retryable DAX faults simply by touching the file
in a tight loop in a separate process...

Cheers,

Dave.
Andy Lutomirski June 19, 2017, 3:22 p.m. UTC | #9
On Mon, Jun 19, 2017 at 6:21 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Sat, Jun 17, 2017 at 10:05:45PM -0700, Andy Lutomirski wrote:
>> On Sat, Jun 17, 2017 at 8:15 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> > On Sat, Jun 17, 2017 at 4:50 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> >> My other objection is that the syscall intentionally leaks a reference
>> >> to the file.  This means it needs overflow protection and it probably
>> >> shouldn't ever be allowed to use it without privilege.
>> >
>> > We only hold the one reference while S_DAXFILE is set, so I think the
>> > protection is there, and per Dave's original proposal this requires
>> > CAP_LINUX_IMMUTABLE.
>> >
>> >> Why can't the underlying issue be easily fixed, though?  Could
>> >> .page_mkwrite just make sure that metadata is synced when the FS uses
>> >> DAX?
>> >
>> > Yes, it most definitely could and that idea has been floated.
>> >
>> >> On a DAX fs, syncing metadata should be extremely fast.
>
> <sigh>
>
> This again....
>
> Persistent memory means the *I/O* is fast. It does not mean that
> *complex filesystem operations* are fast.
>
> Don't forget that there's an shitload of CPU that gets burnt to make
> sure that the metadata is synced correctly. Do that /synchronously/
> on *every* write page fault (which, BTW, modify mtime, so will
> always have dirty metadata to sync) and now you have a serious
> performance problem with your "fast" DAX access method.

I think the mtime issue can and should be solved separately.  But it'
s a fair point that there would be workloads for which this could be
excessively expensive.  In particular, simply creating a file,
mmapping a large range, and touching the pages one by one -- delalloc
would be completely defeated.

But here's a strawman for solving both issues.  First, mtime.  I
consider it to be either a bug or a misfeature that .page_mkwrite
*ever* dirties an inode just to update mtime.  I have old patches to
fix this, and those patches could be updated and merged.  With them
applied, there's just a set_bit() in .page_mkwrite() to handle mtime.

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=mmap_mtime/patch_v4

Second: syncing extents.  Here's a straw man.  Forget the mmap() flag.
Instead add a new msync() operation:

msync(start, length, MSYNC_PMEM_PREPARE_WRITE);

If this operation succeeds, it guarantees that all future writes
through this mapping on this range will hit actual storage and that
all the metadata operations needed to make this write persistent will
hit storage such that they are ordered before the user's writes.

As an implementation detail, this will flush out the extents if
needed.  In addition, if the FS has any mechanism that would cause
problems asyncronously later on (dedupe?  deallocated extents full of
zeros?  defrag?), it may also need to set a flag on the VMA that
changes the behavior of future .page_mkwrite operations.

(On x86, for example, this would permit the FS to do WC/streaming
writes without SFENCE if the FS were structured in a way that this
worked.)

Now we have an API that should work going forward without introducing
baggage.  And XFS is free to implement this API by making the entire
file act like a swap file if XFS wants to do so, but this doesn't
force other filesystems (ext4? NOVA?) to do the same thing.

>
> And that's before we even consider all the problems with running
> sync operations in page fault context....
>
>> >> This
>> >> could be conditioned on an madvise or mmap flag if performance might
>> >> be an issue.  As far as I know, this change alone should be
>> >> sufficient.
>> >
>> > The hang up is that it requires per-fs enabling as it needs to be
>> > careful to manage mmap_sem vs fs journal locks for example. I know the
>> > in-development NOVA [1] filesystem is planning to support this out of
>> > the gate. ext4 would be open to implementing it, but I think xfs is
>> > cold on the idea. Christoph originally proposed it here [2], before
>> > Dave went on to propose immutable semantics.
>>
>> Hmm.  Given a choice between a very clean API that works without
>> privilege but is awkward to implement on XFS and an awkward-to-use
>> API, I'd personally choose the former.
>
> Yup, you have the choice of a clean kernel API that will be
> substantially slower than the existing "dirty page" tracking and
> having the app run fsync() when necessary, or having to do a little
> more work in a library routine that preallocates a file and sets a
> flag on it?
>
> The apps will use the library API, not the kernel API, so who really
> cares if there's a few steps to setting up the file state
> appropriately?
>
>> Dave, even with the lock ordering issue, couldn't XFS implement
>> MAP_PMEM_AWARE by having .page_mkwrite work roughly like this:
>>
>> if (metadata is dirty) {
>>   up_write(&mmap_sem);
>>   sync the metadata;
>>   down_write(&mmap_sem);
>>   return 0;  /* retry the fault */
>> } else {
>>   return whatever success code;
>> }
>
> How do you know that there is dependent filesystem metadata that
> needs syncing at a level that you can safely manipulate the
> mmap_sem? And how, exactly, do you do this without races?

I have no idea, but I expect that all the locking issues are solvable.

> It'd be
> trivial to DOS such retryable DAX faults simply by touching the file
> in a tight loop in a separate process...

If the code were smart enough to only cause a retry when the extent
being touched is dirty, this problem wouldn't exist.
Dave Chinner June 20, 2017, 12:46 a.m. UTC | #10
On Mon, Jun 19, 2017 at 08:22:10AM -0700, Andy Lutomirski wrote:
> On Mon, Jun 19, 2017 at 6:21 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Sat, Jun 17, 2017 at 10:05:45PM -0700, Andy Lutomirski wrote:
> >> On Sat, Jun 17, 2017 at 8:15 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> >> > On Sat, Jun 17, 2017 at 4:50 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >> >> My other objection is that the syscall intentionally leaks a reference
> >> >> to the file.  This means it needs overflow protection and it probably
> >> >> shouldn't ever be allowed to use it without privilege.
> >> >
> >> > We only hold the one reference while S_DAXFILE is set, so I think the
> >> > protection is there, and per Dave's original proposal this requires
> >> > CAP_LINUX_IMMUTABLE.
> >> >
> >> >> Why can't the underlying issue be easily fixed, though?  Could
> >> >> .page_mkwrite just make sure that metadata is synced when the FS uses
> >> >> DAX?
> >> >
> >> > Yes, it most definitely could and that idea has been floated.
> >> >
> >> >> On a DAX fs, syncing metadata should be extremely fast.
> >
> > <sigh>
> >
> > This again....
> >
> > Persistent memory means the *I/O* is fast. It does not mean that
> > *complex filesystem operations* are fast.
> >
> > Don't forget that there's an shitload of CPU that gets burnt to make
> > sure that the metadata is synced correctly. Do that /synchronously/
> > on *every* write page fault (which, BTW, modify mtime, so will
> > always have dirty metadata to sync) and now you have a serious
> > performance problem with your "fast" DAX access method.
> 
> I think the mtime issue can and should be solved separately.  But it'
> s a fair point that there would be workloads for which this could be
> excessively expensive.  In particular, simply creating a file,
> mmapping a large range, and touching the pages one by one -- delalloc
> would be completely defeated.
> 
> But here's a strawman for solving both issues.  First, mtime.  I
> consider it to be either a bug or a misfeature that .page_mkwrite
> *ever* dirties an inode just to update mtime.  I have old patches to
> fix this, and those patches could be updated and merged.  With them
> applied, there's just a set_bit() in .page_mkwrite() to handle mtime.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=mmap_mtime/patch_v4

Yup, I remember that - it delays the update to data writeback time,
IOWs the proposed MAP_SYNC page fault semantics result in the same
(poor) behaviour because the sync operation will trigger mtime
updates instead of the page fault.

Unless, of course, you are implying that MAP_SYNC should not
actually sync all known dirty metadata on an inode.

<smacks head on desk>

> Second: syncing extents.  Here's a straw man.  Forget the mmap() flag.
> Instead add a new msync() operation:
> 
> msync(start, length, MSYNC_PMEM_PREPARE_WRITE);

How's this any different from the fallocate command I proposed to do
this (apart from the fact that msync() is not intended to be abused
as a file preallocation/zeroing interface)?

> If this operation succeeds, it guarantees that all future writes
> through this mapping on this range will hit actual storage and that
> all the metadata operations needed to make this write persistent will
> hit storage such that they are ordered before the user's writes.
> As an implementation detail, this will flush out the extents if
> needed.  In addition, if the FS has any mechanism that would cause
> problems asyncronously later on (dedupe?  deallocated extents full
> of zeros?  defrag?),

Hole punch, truncate, reflink, dedupe, snapshots, scrubbing and
other background filesystem maintenance operations, etc can all
change the extent layout asynchronously if there's no mechanism to
tell the fs not to modify the inode extent layout.

> it may also need to set a flag on the VMA
> that changes the behavior of future .page_mkwrite operations.
> 
> (On x86, for example, this would permit the FS to do WC/streaming
> writes without SFENCE if the FS were structured in a way that this
> worked.)
> 
> Now we have an API that should work going forward without
> introducing baggage.  And XFS is free to implement this API by
> making the entire file act like a swap file if XFS wants to do so,
> but this doesn't force other filesystems (ext4? NOVA?) to do the
> same thing.

Sure, you are providing a simple programmatic API, but this does not
provide a viable feature management strategy.

i.e. the API you are now proposing requires the filesystem to ensure
an inode's extent map cannot be modified ever again in the future
(that "guarantees all future writes" bit).  This requires, at
minimum, a persistent flag to be set on the inode so the VFS and
filesystem implementations can use it to prevent anything that, for
example, relies on copy-on-write semantics being done on those
files. That means the proposed msync operation will need to check
the filesystem can support this feature and *fail* if it can't.

Further, administrators need to be aware of this application
requirement so they can plan their backup and disaster recovery
operations appropriately (e.g. reflink and/or snapshots cannot be
used as part of thei backup strategy). Hence the point of such
restricted file manipulation functionality requiring permissions to
be granted - it ensures sysadmins know they've got something less
than obvious going on they may need special processes to handle
safely.

Unsurprisingly, this is exactly what the "DAX immutable" inode flag
I proposed provides.  It provides an explicit, standardised and
*documented* management strategy that is common across all
filesystems. It uses mechanisms that *already exist*, the VFS and
filesystems already implement, and adminstrators are familiar with
using to manage their systems (e.g. setting the "NODUMP" inode flag
to exclude files from backups). This also avoids the management
level fragmentation which would occur if filesystems each solve the
"DAX userspace data sync" problem differently via different
management tools, behaviours and semantics.

Keep in mind that there are more uses for immutable extent maps than
just DAX. e.g. every so often someone pops up and says "I have this
high speed data aquisition hardware and we'd like to DMA data direct
to the storage because it's far too slow pushing it through memory
and then the OS to get it to storage. How do I map the storage
blocks and guarantee the mapping won't change while we are
transferring data direct from the hardware?". A file with an
allocated, immutable extent map solves this problem for these sorts
of esoteric applications.

As such, can we please drop all the mmap/msync special API snowflake
proposals and instead address the problem how to set up and manage
files with immutable extent maps efficiently through fallocate?
Once we have them, pmem aware applications don't need to do anything
special with mmap to be able to use userspace data sync instructions
safely.  i.e. the pmem library should select the correct data sync
method according to how the file was set up and what functionality
the underlying filesystem DAX implementation supports.

> >> Dave, even with the lock ordering issue, couldn't XFS implement
> >> MAP_PMEM_AWARE by having .page_mkwrite work roughly like this:
> >>
> >> if (metadata is dirty) { up_write(&mmap_sem); sync the
> >> metadata; down_write(&mmap_sem); return 0;  /* retry the fault
> >> */ } else { return whatever success code; }
> >
> > How do you know that there is dependent filesystem metadata that
> > needs syncing at a level that you can safely manipulate the
> > mmap_sem? And how, exactly, do you do this without races?
> 
> I have no idea, but I expect that all the locking issues are
> solvable.

Yay, Dunning-Kruger to the rescue!

Cheers,

Dave.
Darrick J. Wong June 20, 2017, 5:22 a.m. UTC | #11
[add linux-xfs to the fray]

On Fri, Jun 16, 2017 at 06:15:35PM -0700, Dan Williams wrote:
> To date, the full promise of byte-addressable access to persistent
> memory has only been half realized via the filesystem-dax interface. The
> current filesystem-dax mechanism allows an application to consume (read)
> data from persistent storage at byte-size granularity, bypassing the
> full page reads required by traditional storage devices.
> 
> Now, for writes, applications still need to contend with
> page-granularity dirtying and flushing semantics as well as filesystem
> coordination for metadata updates after any mmap write. The current
> situation precludes use cases that leverage byte-granularity / in-place
> updates to persistent media.
> 
> To get around this limitation there are some specialized applications
> that are using the device-dax interface to bypass the overhead and
> data-safety problems of the current filesystem-dax mmap-write path.
> QEMU-KVM is forced to use device-dax to safely pass through persistent
> memory to a guest [1]. Some specialized databases are using device-dax
> for byte-granularity writes. Outside of those cases, device-dax is
> difficult for general purpose persistent memory applications to consume.
> There is demand for access to pmem without needing to contend with
> special device configuration and other device-dax limitations.
> 
> The 'daxfile' interface satisfies this demand and realizes one of Dave
> Chinner's ideas for allowing pmem applications to safely bypass
> fsync/msync requirements. The idea is to make the file immutable with
> respect to the offset-to-block mappings for every extent in the file
> [2]. It turns out that filesystems already need to make this guarantee
> today. This property is needed for files marked as swap files.
> 
> The new daxctl() syscall manages setting a file into 'static-dax' mode
> whereby it arranges for the file to be treated as a swapfile as far as
> the filesystem is concerned, but not registered with the core-mm as
> swapfile space. A file in this mode is then safe to be mapped and
> written without the requirement to fsync/msync the writes.  The cpu
> cache management for flushing data to persistence can be handled
> completely in userspace.
> 
> [1]: https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg01207.html
> [2]: https://lkml.org/lkml/2016/9/11/159
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  arch/x86/entry/syscalls/syscall_64.tbl |    1 
>  include/linux/dax.h                    |    9 ++
>  include/linux/fs.h                     |    3 +
>  include/linux/syscalls.h               |    1 
>  include/uapi/linux/dax.h               |    8 +
>  mm/Kconfig                             |    5 +
>  mm/Makefile                            |    1 
>  mm/daxfile.c                           |  186 ++++++++++++++++++++++++++++++++
>  mm/page_io.c                           |   31 +++++
>  9 files changed, 245 insertions(+)
>  create mode 100644 include/uapi/linux/dax.h
>  create mode 100644 mm/daxfile.c
> 
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 5aef183e2f85..795eb93d6beb 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -339,6 +339,7 @@
>  330	common	pkey_alloc		sys_pkey_alloc
>  331	common	pkey_free		sys_pkey_free
>  332	common	statx			sys_statx
> +333	64	daxctl			sys_daxctl
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 5ec1f6c47716..5f1d0e0ed30f 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -4,8 +4,17 @@
>  #include <linux/fs.h>
>  #include <linux/mm.h>
>  #include <linux/radix-tree.h>
> +#include <uapi/linux/dax.h>
>  #include <asm/pgtable.h>
>  
> +/*
> + * TODO: make sys_daxctl() be the generic interface for toggling S_DAX
> + * across filesystems. For now, mark DAXCTL_F_DAX as an invalid flag
> + */
> +#define DAXCTL_VALID_FLAGS (DAXCTL_F_GET | DAXCTL_F_STATIC)
> +
> +int daxfile_activate(struct file *daxfile, unsigned align);
> +
>  struct iomap_ops;
>  struct dax_device;
>  struct dax_operations {
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3e68cabb8457..3af649fb669f 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1824,8 +1824,10 @@ struct super_operations {
>  #define S_NOSEC		4096	/* no suid or xattr security attributes */
>  #ifdef CONFIG_FS_DAX
>  #define S_DAX		8192	/* Direct Access, avoiding the page cache */
> +#define S_DAXFILE	16384	/* no truncate (swapfile) semantics + dax */
>  #else
>  #define S_DAX		0	/* Make all the DAX code disappear */
> +#define S_DAXFILE	0
>  #endif
>  
>  /*
> @@ -1865,6 +1867,7 @@ struct super_operations {
>  #define IS_AUTOMOUNT(inode)	((inode)->i_flags & S_AUTOMOUNT)
>  #define IS_NOSEC(inode)		((inode)->i_flags & S_NOSEC)
>  #define IS_DAX(inode)		((inode)->i_flags & S_DAX)
> +#define IS_DAXFILE(inode)	((inode)->i_flags & S_DAXFILE)
>  
>  #define IS_WHITEOUT(inode)	(S_ISCHR(inode->i_mode) && \
>  				 (inode)->i_rdev == WHITEOUT_DEV)
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 980c3c9b06f8..49e5cc4c192e 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -701,6 +701,7 @@ asmlinkage long sys_prctl(int option, unsigned long arg2, unsigned long arg3,
>  			unsigned long arg4, unsigned long arg5);
>  asmlinkage long sys_swapon(const char __user *specialfile, int swap_flags);
>  asmlinkage long sys_swapoff(const char __user *specialfile);
> +asmlinkage long sys_daxctl(const char __user *path, int flags, int align);
>  asmlinkage long sys_sysctl(struct __sysctl_args __user *args);
>  asmlinkage long sys_sysinfo(struct sysinfo __user *info);
>  asmlinkage long sys_sysfs(int option,
> diff --git a/include/uapi/linux/dax.h b/include/uapi/linux/dax.h
> new file mode 100644
> index 000000000000..78a41bb392c0
> --- /dev/null
> +++ b/include/uapi/linux/dax.h
> @@ -0,0 +1,8 @@
> +#ifndef _UAPI_LINUX_DAX_H
> +#define _UAPI_LINUX_DAX_H
> +
> +#define DAXCTL_F_GET    (1 << 0)
> +#define DAXCTL_F_DAX    (1 << 1)
> +#define DAXCTL_F_STATIC (1 << 2)
> +
> +#endif /* _UAPI_LINUX_DAX_H */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index beb7a455915d..b874565c34eb 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -450,6 +450,11 @@ config	TRANSPARENT_HUGE_PAGECACHE
>  	def_bool y
>  	depends on TRANSPARENT_HUGEPAGE
>  
> +config DAXFILE
> +	def_bool y
> +	depends on FS_DAX
> +	depends on SWAP
> +
>  #
>  # UP and nommu archs use km based percpu allocator
>  #
> diff --git a/mm/Makefile b/mm/Makefile
> index 026f6a828a50..38d9025a3e37 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -56,6 +56,7 @@ endif
>  obj-$(CONFIG_HAVE_MEMBLOCK) += memblock.o
>  
>  obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o
> +obj-$(CONFIG_DAXFILE)	+= daxfile.o
>  obj-$(CONFIG_FRONTSWAP)	+= frontswap.o
>  obj-$(CONFIG_ZSWAP)	+= zswap.o
>  obj-$(CONFIG_HAS_DMA)	+= dmapool.o
> diff --git a/mm/daxfile.c b/mm/daxfile.c
> new file mode 100644
> index 000000000000..fe230199c855
> --- /dev/null
> +++ b/mm/daxfile.c
> @@ -0,0 +1,186 @@
> +/*
> + * Copyright(c) 2017 Intel Corporation. All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of version 2 of the GNU General Public License as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + */
> +#include <linux/dax.h>
> +#include <linux/slab.h>
> +#include <linux/highmem.h>
> +#include <linux/pagemap.h>
> +#include <linux/syscalls.h>
> +
> +/*
> + * TODO: a list to lookup daxfiles assumes a low number of instances,
> + * revisit.
> + */
> +static LIST_HEAD(daxfiles);
> +static DEFINE_SPINLOCK(dax_lock);
> +
> +struct dax_info {
> +	struct list_head list;
> +	struct file *daxfile;
> +};
> +
> +static int daxfile_disable(struct file *victim)
> +{
> +	int found = 0;
> +	struct dax_info *d;
> +	struct inode *inode;
> +	struct file *daxfile;
> +	struct address_space *mapping;
> +
> +	mapping = victim->f_mapping;
> +	spin_lock(&dax_lock);
> +	list_for_each_entry(d, &daxfiles, list)
> +		if (d->daxfile->f_mapping == mapping) {
> +			list_del(&d->list);
> +			found = 1;
> +			break;
> +		}
> +	spin_unlock(&dax_lock);
> +
> +	if (!found)
> +		return -EINVAL;
> +
> +	daxfile = d->daxfile;
> +
> +	inode = mapping->host;
> +	inode->i_flags &= ~(S_SWAPFILE | S_DAXFILE);
> +	filp_close(daxfile, NULL);
> +
> +	return 0;
> +}
> +
> +static int claim_daxfile_checks(struct inode *inode)
> +{
> +	if (!S_ISREG(inode->i_mode))
> +		return -EINVAL;
> +
> +	if (!IS_DAX(inode))
> +		return -EINVAL;
> +
> +	if (IS_SWAPFILE(inode) || IS_DAXFILE(inode))
> +		return -EBUSY;
> +
> +	return 0;
> +}
> +
> +int daxfile_enable(struct file *daxfile, int align)
> +{
> +	struct address_space *mapping;
> +	struct inode *inode;
> +	struct dax_info *d;
> +	int rc;
> +
> +	if (align < 0)
> +		return -EINVAL;
> +
> +	mapping = daxfile->f_mapping;
> +	inode = mapping->host;
> +
> +	rc = claim_daxfile_checks(inode);
> +	if (rc)
> +		return rc;
> +
> +	rc = daxfile_activate(daxfile, align);
> +	if (rc)
> +		return rc;
> +
> +	d = kzalloc(sizeof(*d), GFP_KERNEL);
> +	if (!d)
> +		return -ENOMEM;
> +	INIT_LIST_HEAD(&d->list);
> +	d->daxfile = daxfile;
> +
> +	spin_lock(&dax_lock);
> +	list_add(&d->list, &daxfiles);
> +	spin_unlock(&dax_lock);
> +
> +	/*
> +	 * We set S_SWAPFILE to gain "no truncate" / static block
> +	 * allocation semantics, and S_DAXFILE so we can differentiate
> +	 * traditional swapfiles and assume static block mappings in the
> +	 * dax mmap path.
> +	 */
> +	inode->i_flags |= S_SWAPFILE | S_DAXFILE;

Yikes.  You know, I hadn't even thought about considering swap files as
a subcase of files with immutable block maps, but here we are.  Both
swap files and DAX require absolutely stable block mappings, they are
both (probably) intolerant of inode metadata changes (size, mtime, etc.)

But on the other hand, the bmap interface is so... yuck.  We return zero
to indicate no mapping or error or whatever, it doesn't actually tell us
/which/ device it's returning offsets into, etc.  I was writing a
regression test earlier to check that we've sealed off XFS RT files from
becoming swap files (because bmap is broken, not (afaik) because of any
weird limitation of xfs) and forgot that quirk long enough to waste time
wondering why it failed to fail on a 4.11 kernel.  That's right, the
first rt file gets block zero and magically doesn't fail to fail if
that's the swap file.

Gross.  I've now ranted twice this month about how bmap() doesn't work
on reflinked files on XFS.

Honestly, I realize we've gone back, forth, and around all over the
place on this.  I still prefer something similar to a permanent flag,
similar to what Dave suggested, though I hate the name PMEM_IMMUTABLE
and some of the semantics.

First, a new inode flag S_IOMAP_FROZEN that means the file's block map
cannot change.

Second, some kind of function to toggle the S_IOMAP_FROZEN flag.
Turning it on will lock the inode, check the extent map for holes,
shared, or unwritten bits, and bail out if it finds any, or set the
flag.  Not sure if we should require CAP_LINUX_IMMUTABLE -- probably
yes, at least at first.  I don't currently have any objection to writing
non-iomap inode metadata out to disk.

Third, the flag can only be cleared if the file isn't mapped.

Fourth, the VFS entry points for things like read, write, truncate,
utimes, fallocate, etc. all just bail out if S_IOMAP_FROZEN is set on a
file, so that the block map cannot be modified.  mmap is still allowed,
as we've discussed.  /Maybe/ we can allow fallocate to extend a file
with zeroed extents (it will be slow) as I've heard murmurs about
wanting to be able to extend a file, maybe not.

Fifth, swapfiles now require the S_IOMAP_FROZEN flag since they want
stable iomap but probably don't care about things like mtime.  Maybe
they can call iomap too.

Sixth, XFS can record the S_IOMAP_FROZEN state in di_flags2 and set it
whenever the in-core inode gets constructed.  This enables us to
prohibit reflinking and other such undesirable activity.

If we actually /do/ come up with a reference implementation for XFS, I'd
be ok with tacking it on the end of my dev branch, which will give us a
loooong runway to try it out.  The end of the dev branch is beyond
online XFS fsck and repair and the "root metadata btrees in inodes"
rework; since that's ~90 patches with my name on it that I cannot also
review, it won't go in for a long time indeed!

(Yes, that was also sort of a plea for someone to go review the XFS
scrub patches.)

> +	return 0;
> +}
> +
> +SYSCALL_DEFINE3(daxctl, const char __user *, path, int, flags, int, align)

I was /about/ to grouse about this syscall, then realized that maybe it
/is/ useful to be able to check a specific alignment.  Maybe not, since
I had something more permanent in mind anyway.  In any case, just pass
in an opened fd if this sticks around.

--D

> +{
> +	int rc;
> +	struct filename *name;
> +	struct inode *inode = NULL;
> +	struct file *daxfile = NULL;
> +	struct address_space *mapping;
> +
> +	if (flags & ~DAXCTL_VALID_FLAGS)
> +		return -EINVAL;
> +
> +	name = getname(path);
> +	if (IS_ERR(name))
> +		return PTR_ERR(name);
> +
> +	daxfile = file_open_name(name, O_RDWR|O_LARGEFILE, 0);
> +	if (IS_ERR(daxfile)) {
> +		rc = PTR_ERR(daxfile);
> +		daxfile = NULL;
> +		goto out;
> +	}
> +
> +	mapping = daxfile->f_mapping;
> +	inode = mapping->host;
> +	if (flags & DAXCTL_F_GET) {
> +		/*
> +		 * We only report the state of DAXCTL_F_STATIC since
> +		 * there is no actions for applications to take based on
> +		 * the setting of S_DAX. However, if this interface is
> +		 * used for toggling S_DAX presumably userspace would
> +		 * want to know the state of the flag.
> +		 *
> +		 * TODO: revisit whether we want to report DAXCTL_F_DAX
> +		 * in the IS_DAX() case.
> +		 */
> +		if (IS_DAXFILE(inode))
> +			rc = DAXCTL_F_STATIC;
> +		else
> +			rc = 0;
> +
> +		goto out;
> +	}
> +
> +	/*
> +	 * TODO: Should unprivileged users be allowed to control daxfile
> +	 * behavior? Perhaps a mount flag... is -o dax that flag?
> +	 */
> +	if (!capable(CAP_LINUX_IMMUTABLE)) {
> +		rc = -EPERM;
> +		goto out;
> +	}
> +
> +	inode_lock(inode);
> +	if (!IS_DAXFILE(inode) && (flags & DAXCTL_F_STATIC)) {
> +		rc = daxfile_enable(daxfile, align);
> +		/* if successfully enabled hold daxfile open */
> +		if (rc == 0)
> +			daxfile = NULL;
> +	} else if (IS_DAXFILE(inode) && !(flags & DAXCTL_F_STATIC))
> +		rc = daxfile_disable(daxfile);
> +	else
> +		rc = 0;
> +	inode_unlock(inode);
> +
> +out:
> +	if (daxfile)
> +		filp_close(daxfile, NULL);
> +	if (name)
> +		putname(name);
> +	return rc;
> +}
> diff --git a/mm/page_io.c b/mm/page_io.c
> index 5cec9a3d49f2..35160ad9c51f 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -244,6 +244,37 @@ static int bmap_walk(struct file *file, const unsigned page_size,
>  	goto out;
>  }
>  
> +static int daxfile_check(sector_t block, unsigned long page_no,
> +		enum bmap_check type, void *none)
> +{
> +	if (type == BMAP_WALK_DONE)
> +		return 0;
> +
> +	/*
> +	 * Unlike the swapfile case, fail daxfile_activate() if any file
> +	 * extent is not page aligned.
> +	 */
> +	if (type != BMAP_WALK_FULLPAGE)
> +		return -EINVAL;
> +	return 0;
> +}
> +
> +int daxfile_activate(struct file *daxfile, unsigned align)
> +{
> +	int rc;
> +
> +	if (!align)
> +		align = PAGE_SIZE;
> +
> +	if (align < PAGE_SIZE || !is_power_of_2(align))
> +		return -EINVAL;
> +
> +	rc = bmap_walk(daxfile, align, ULONG_MAX, NULL, daxfile_check, NULL);
> +	if (rc)
> +		pr_debug("daxctl: daxfile has holes\n");
> +	return rc;
> +}
> +
>  static int swapfile_check(sector_t block, unsigned long page_no,
>  		enum bmap_check type, void *_sis)
>  {
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Lutomirski June 20, 2017, 5:53 a.m. UTC | #12
On Mon, Jun 19, 2017 at 5:46 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Jun 19, 2017 at 08:22:10AM -0700, Andy Lutomirski wrote:
>> On Mon, Jun 19, 2017 at 6:21 AM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Sat, Jun 17, 2017 at 10:05:45PM -0700, Andy Lutomirski wrote:
>> >> On Sat, Jun 17, 2017 at 8:15 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> >> > On Sat, Jun 17, 2017 at 4:50 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> >> >> My other objection is that the syscall intentionally leaks a reference
>> >> >> to the file.  This means it needs overflow protection and it probably
>> >> >> shouldn't ever be allowed to use it without privilege.
>> >> >
>> >> > We only hold the one reference while S_DAXFILE is set, so I think the
>> >> > protection is there, and per Dave's original proposal this requires
>> >> > CAP_LINUX_IMMUTABLE.
>> >> >
>> >> >> Why can't the underlying issue be easily fixed, though?  Could
>> >> >> .page_mkwrite just make sure that metadata is synced when the FS uses
>> >> >> DAX?
>> >> >
>> >> > Yes, it most definitely could and that idea has been floated.
>> >> >
>> >> >> On a DAX fs, syncing metadata should be extremely fast.
>> >
>> > <sigh>
>> >
>> > This again....
>> >
>> > Persistent memory means the *I/O* is fast. It does not mean that
>> > *complex filesystem operations* are fast.
>> >
>> > Don't forget that there's an shitload of CPU that gets burnt to make
>> > sure that the metadata is synced correctly. Do that /synchronously/
>> > on *every* write page fault (which, BTW, modify mtime, so will
>> > always have dirty metadata to sync) and now you have a serious
>> > performance problem with your "fast" DAX access method.
>>
>> I think the mtime issue can and should be solved separately.  But it'
>> s a fair point that there would be workloads for which this could be
>> excessively expensive.  In particular, simply creating a file,
>> mmapping a large range, and touching the pages one by one -- delalloc
>> would be completely defeated.
>>
>> But here's a strawman for solving both issues.  First, mtime.  I
>> consider it to be either a bug or a misfeature that .page_mkwrite
>> *ever* dirties an inode just to update mtime.  I have old patches to
>> fix this, and those patches could be updated and merged.  With them
>> applied, there's just a set_bit() in .page_mkwrite() to handle mtime.
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=mmap_mtime/patch_v4
>
> Yup, I remember that - it delays the update to data writeback time,
> IOWs the proposed MAP_SYNC page fault semantics result in the same
> (poor) behaviour because the sync operation will trigger mtime
> updates instead of the page fault.
>
> Unless, of course, you are implying that MAP_SYNC should not
> actually sync all known dirty metadata on an inode.
>
> <smacks head on desk>
>
>> Second: syncing extents.  Here's a straw man.  Forget the mmap() flag.
>> Instead add a new msync() operation:
>>
>> msync(start, length, MSYNC_PMEM_PREPARE_WRITE);
>
> How's this any different from the fallocate command I proposed to do
> this (apart from the fact that msync() is not intended to be abused
> as a file preallocation/zeroing interface)?

I must have missed that suggestion.

But it's different in a major way.  fallocate() takes an fd parameter,
which means that, if some flag gets set, it's set on the struct file.
The persistence property seems to me like it belongs on the vma, not
the file.  But it doesn't have to be msync() -- it could be madvise or
even a new mallocate().  (Although mallocate() is possible the worst
name ever.)

>
>> If this operation succeeds, it guarantees that all future writes
>> through this mapping on this range will hit actual storage and that
>> all the metadata operations needed to make this write persistent will
>> hit storage such that they are ordered before the user's writes.
>> As an implementation detail, this will flush out the extents if
>> needed.  In addition, if the FS has any mechanism that would cause
>> problems asyncronously later on (dedupe?  deallocated extents full
>> of zeros?  defrag?),
>
> Hole punch, truncate, reflink, dedupe, snapshots, scrubbing and
> other background filesystem maintenance operations, etc can all
> change the extent layout asynchronously if there's no mechanism to
> tell the fs not to modify the inode extent layout.

But that's my whole point.  The kernel doesn't really need to prevent
all these background maintenance operations -- it just needs to block
.page_mkwrite until they are synced.  I think that whatever new
mechanism we add for this should be sticky, but I see no reason why
the filesystem should have to block reflink on a DAX file entirely.

In fact, the daxctl() proposal seems actively problematic for some
usecases.  I think I should be able to mmap() a DAX file and then,
while it's still mapped, extend the file, mmap the new part (with the
appropriate flag, madvise(), msync(), fallocate(), whatever), and
write directly through that mapping and through the original mapping,
concurrently, with the full persistence guarantee.  This seems really
awkward to do using daxctl().

>
>> it may also need to set a flag on the VMA
>> that changes the behavior of future .page_mkwrite operations.
>>
>> (On x86, for example, this would permit the FS to do WC/streaming
>> writes without SFENCE if the FS were structured in a way that this
>> worked.)
>>
>> Now we have an API that should work going forward without
>> introducing baggage.  And XFS is free to implement this API by
>> making the entire file act like a swap file if XFS wants to do so,
>> but this doesn't force other filesystems (ext4? NOVA?) to do the
>> same thing.
>
> Sure, you are providing a simple programmatic API, but this does not
> provide a viable feature management strategy.
>
> i.e. the API you are now proposing requires the filesystem to ensure
> an inode's extent map cannot be modified ever again in the future
> (that "guarantees all future writes" bit).  This requires, at
> minimum, a persistent flag to be set on the inode so the VFS and
> filesystem implementations can use it to prevent anything that, for
> example, relies on copy-on-write semantics being done on those
> files. That means the proposed msync operation will need to check
> the filesystem can support this feature and *fail* if it can't.

No it doesn't.  A filesystem *could* implement it like that, but it
could also implement it using .page_mkwrite.  And yes, a filesystem
that can't can't guarantee durability with CLFLUSHOPT; SFENCE on a
mapping should fail this operation to indicate that it can't support
it.

>
> Further, administrators need to be aware of this application
> requirement so they can plan their backup and disaster recovery
> operations appropriately (e.g. reflink and/or snapshots cannot be
> used as part of thei backup strategy).

Or they could use a filesystem that will understand that the new
operation needs to break COW.

>
> Unsurprisingly, this is exactly what the "DAX immutable" inode flag
> I proposed provides.  It provides an explicit, standardised and
> *documented* management strategy that is common across all
> filesystems. It uses mechanisms that *already exist*, the VFS and
> filesystems already implement, and adminstrators are familiar with
> using to manage their systems (e.g. setting the "NODUMP" inode flag
> to exclude files from backups). This also avoids the management
> level fragmentation which would occur if filesystems each solve the
> "DAX userspace data sync" problem differently via different
> management tools, behaviours and semantics.

The DAX immutable flag is really nasty for my software that would like
to use DAX.  I have quite a few processes, all unprivileged, that
create files that they'd like to map using DAX and write to durably
without needing to fsync() (using CLFLUSHOPT; SFENCE or perhaps a WT
mapping).  If I were to use daxctl(), I'd have to have to write a
privileged daemon to manage it, and that would be rather nasty.

If, instead, we had a nice unprivileged per-vma or per-fd mechanism to
tell the filesystem that I want DAX durability, I could just use it
without any fuss.  If it worked on ext4 before it worked on xfs, then
I'd use ext4.  If it ended up being heavier weight on XFS than it was
on ext4 because XFS needed to lock down the extent map for the inode
whereas ext4 could manage it through .page_mkwrite(), then I'd
benchmark it and see which fs would win.  (For my particular use case,
I doubt it would matter, since I aggressively offload fs metadata
operations to a thread whose performance I don't really care about.)


--Andy
Christoph Hellwig June 20, 2017, 8:49 a.m. UTC | #13
[stripped giant fullquotes]

On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote:
> But that's my whole point.  The kernel doesn't really need to prevent
> all these background maintenance operations -- it just needs to block
> .page_mkwrite until they are synced.  I think that whatever new
> mechanism we add for this should be sticky, but I see no reason why
> the filesystem should have to block reflink on a DAX file entirely.

Agreed - IFF we want to support write through semantics this is the
only somewhat feasible way.  It still has massive downsides of forcing
the full sync machinery to run from the page fauly handler, which
I'm rather scared off, but that's still better than creating a magic
special case that isn't managable at all.

> If, instead, we had a nice unprivileged per-vma or per-fd mechanism to
> tell the filesystem that I want DAX durability, I could just use it
> without any fuss.  If it worked on ext4 before it worked on xfs, then
> I'd use ext4.  If it ended up being heavier weight on XFS than it was
> on ext4 because XFS needed to lock down the extent map for the inode
> whereas ext4 could manage it through .page_mkwrite(), then I'd
> benchmark it and see which fs would win.  (For my particular use case,
> I doubt it would matter, since I aggressively offload fs metadata
> operations to a thread whose performance I don't really care about.)

ext4 and XFS have the same fundamental issue:  both have a file system
wide log of modified data that needs to be flushed to stable storage
to ensure everything is safe.  So if you solve the issue for one of
them you've solved it for the other one as well modulo implementation
details.
Dave Chinner June 20, 2017, 10:11 a.m. UTC | #14
On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote:
> On Mon, Jun 19, 2017 at 5:46 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Jun 19, 2017 at 08:22:10AM -0700, Andy Lutomirski wrote:
> >> Second: syncing extents.  Here's a straw man.  Forget the mmap() flag.
> >> Instead add a new msync() operation:
> >>
> >> msync(start, length, MSYNC_PMEM_PREPARE_WRITE);
> >
> > How's this any different from the fallocate command I proposed to do
> > this (apart from the fact that msync() is not intended to be abused
> > as a file preallocation/zeroing interface)?
> 
> I must have missed that suggestion.
> 
> But it's different in a major way.  fallocate() takes an fd parameter,
> which means that, if some flag gets set, it's set on the struct file.

DAX is a property of the inode, not the VMA or struct file as it
needs to be consistent across all VMAs and struct files that
reference that inode. Also, fallocate() manipulates state and
metadata hidden behind the struct inode, not the struct file, so it
seems to me like the right API to use.

And, as mmap() requires a fd to set up the mapping and fallocate()
would have to be run *before* mmap() is used to access the data
directly, I don't see why using fallocate would be a problem here...

> >> If this operation succeeds, it guarantees that all future writes
> >> through this mapping on this range will hit actual storage and that
> >> all the metadata operations needed to make this write persistent will
> >> hit storage such that they are ordered before the user's writes.
> >> As an implementation detail, this will flush out the extents if
> >> needed.  In addition, if the FS has any mechanism that would cause
> >> problems asyncronously later on (dedupe?  deallocated extents full
> >> of zeros?  defrag?),
> >
> > Hole punch, truncate, reflink, dedupe, snapshots, scrubbing and
> > other background filesystem maintenance operations, etc can all
> > change the extent layout asynchronously if there's no mechanism to
> > tell the fs not to modify the inode extent layout.
> 
> But that's my whole point.  The kernel doesn't really need to prevent
> all these background maintenance operations -- it just needs to block
> .page_mkwrite until they are synced.  I think that whatever new
> mechanism we add for this should be sticky, but I see no reason why
> the filesystem should have to block reflink on a DAX file entirely.

I understand the problem quite well, thank you very much. Yes,
COW operations (and other things) can be handled by invalidating DAX
mappings and blocking new page faults.  I see little difference
between this and running the sync path after page-mkwrite has
triggered filesystem metadata changes (e.g.  block allocation). i.e.
If MAP_SYNC is going to be used, then all the things you are talking
about comes along for the ride via invalidations.

The MAP_SYNC proposal is effectively "run the metadata side of
fdatasync() on every page fault". If the inode is not metadata
dirty, then it will do nothing, otherwise it will do what it needs
to stabilise the inode for userspace to be able to sync the data and
it will block until it is done.

Prediction for the MAP_SYNC future: frequent bug reports about huge,
unpredictable page fault latencies on DAX files because every so
often a page fault is required to sync tens of thousands of
unrelated dirty objects because of filesystem journal ordering
constraints....

Cheers,

Dave.
Ross Zwisler June 20, 2017, 3:42 p.m. UTC | #15
On Mon, Jun 19, 2017 at 10:22:14PM -0700, Darrick J. Wong wrote:
<>
> Fourth, the VFS entry points for things like read, write, truncate,
> utimes, fallocate, etc. all just bail out if S_IOMAP_FROZEN is set on a
> file, so that the block map cannot be modified.  mmap is still allowed,
> as we've discussed.  /Maybe/ we can allow fallocate to extend a file
> with zeroed extents (it will be slow) as I've heard murmurs about
> wanting to be able to extend a file, maybe not.

Read and write should still be allowed, right?
Andy Lutomirski June 20, 2017, 4:14 p.m. UTC | #16
On Tue, Jun 20, 2017 at 3:11 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote:
>> On Mon, Jun 19, 2017 at 5:46 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Mon, Jun 19, 2017 at 08:22:10AM -0700, Andy Lutomirski wrote:
>> >> Second: syncing extents.  Here's a straw man.  Forget the mmap() flag.
>> >> Instead add a new msync() operation:
>> >>
>> >> msync(start, length, MSYNC_PMEM_PREPARE_WRITE);
>> >
>> > How's this any different from the fallocate command I proposed to do
>> > this (apart from the fact that msync() is not intended to be abused
>> > as a file preallocation/zeroing interface)?
>>
>> I must have missed that suggestion.
>>
>> But it's different in a major way.  fallocate() takes an fd parameter,
>> which means that, if some flag gets set, it's set on the struct file.
>
> DAX is a property of the inode, not the VMA or struct file as it
> needs to be consistent across all VMAs and struct files that
> reference that inode. Also, fallocate() manipulates state and
> metadata hidden behind the struct inode, not the struct file, so it
> seems to me like the right API to use.

I'm not sure I see why.  I can think of a few different scenarios:

 - Reflink while a DAX-using program is running.  It would be nice for
performance, but not critical for functionality, if trying to reflink
a file that's mapped for DAX would copy instead of COWing.  But
breaking COW on the next page_mkwrite would work, too.  A per-inode
count of the number of live DAX mappings or of the number of struct
file instances that have requested DAX would work here.

 - Trying to use DAX on a file that is already reflinked.  The order
of operations doesn't matter hugely, except that breaking COW for the
entire range in question all at once would be faster and result in
better allocation.

 - Checksumming filesystems.  I think it's basically impossible to do
DAX writes to a file like this.  A filesystem could skip checksumming
on extents that are actively mapped for DAX, but something like chattr
to tell, say, btrfs that a given file is intended for DAX is probably
better.  (But, if so, it should presumably be persistent like chattr
and not something that's purely in memory like daxctl().)

 - Defrag and such on an otherwise simple filesystem.  Defragging a
file while it's not actively mapped for DAX should be allowed.
Defragging a file while it is actively mapped for DAX may be slow, but
it a filesystem could certainly make it work.

 - RAID.  Not going to work.

>
> And, as mmap() requires a fd to set up the mapping and fallocate()
> would have to be run *before* mmap() is used to access the data
> directly, I don't see why using fallocate would be a problem here...

Fair enough.  But ISTM fallocate() has no business setting up
important state in the struct file.  It's an operation on inodes.

Looking at the above scenarios, it seems to may that two or three
separate mechanisms may be ideal here.

1. The most important: some way to tell the kernel that an open file
description or an mmap or some subrange thereof is going to be used
for DAX writes and for the kernel to respond by saying "yes, okay" or
"no, not okay".  This should be 100% reliable, which means that all
the corner cases have to work.  This means that, if one task says
"make this file work for DAX" and another task extends the file using
truncate (without calling fallocate), then whatever the kernel
promised to the first task should remain true.

2. Some way to tell filesystems like btrfs to make a file that will be
DAX-able.  chattr +C might already fit the bill.  Without this, I'd
expect the normal incantation to DAX-map a file on btrfs to return an
error.

3. (Not strictly related to DAX.) A way to tell the kernel "I have
this file mmapped for write.  Please go out of your way to avoid
future page faults."  I've wanted this for ordinary files on ext4.
The kernel could, but presently does not, use hardware dirty tracking
instead of software dirty tracking to decide when to write the page
back.  The kernel could also, in principle, write back dirty pages
without ever write protecting them.  For DAX, this might change
behavior to prevent any operation that would relocate blocks or to
have the kernel go out of its way to only do such operations when
absolutely necessary and to immediately update and unwriteprotect the
relevant pages.

(3) is optional and could be delayed to the distant future.  It's not
needed for correctness.

I really don't want to see a scenario where DAX works if you use some
fancy special-purpose library exactly as intended but occasionally
eats your data if you don't use the library exactly as intended.
Getting a valid DAX mapping, writing to it, and doing
CLFLUSHOPT;SFENCE must make that write durable no matter what (unless
the underlying hardware actually fails).

> The MAP_SYNC proposal is effectively "run the metadata side of
> fdatasync() on every page fault". If the inode is not metadata
> dirty, then it will do nothing, otherwise it will do what it needs
> to stabilise the inode for userspace to be able to sync the data and
> it will block until it is done.
>
> Prediction for the MAP_SYNC future: frequent bug reports about huge,
> unpredictable page fault latencies on DAX files because every so
> often a page fault is required to sync tens of thousands of
> unrelated dirty objects because of filesystem journal ordering
> constraints....

Is this really so bad?  Someone might ask for relaxed journal ordering
or they might switch to a different filesystem.  IIRC some filesystems
(ZFS?) have explicit support for this use case.  I suspect that users
saying "your filesystem is slower than I'd like for such-and-such use
case" isn't all that rare.
Dan Williams June 20, 2017, 4:17 p.m. UTC | #17
On Tue, Jun 20, 2017 at 1:49 AM, Christoph Hellwig <hch@lst.de> wrote:
> [stripped giant fullquotes]
>
> On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote:
>> But that's my whole point.  The kernel doesn't really need to prevent
>> all these background maintenance operations -- it just needs to block
>> .page_mkwrite until they are synced.  I think that whatever new
>> mechanism we add for this should be sticky, but I see no reason why
>> the filesystem should have to block reflink on a DAX file entirely.
>
> Agreed - IFF we want to support write through semantics this is the
> only somewhat feasible way.  It still has massive downsides of forcing
> the full sync machinery to run from the page fauly handler, which
> I'm rather scared off, but that's still better than creating a magic
> special case that isn't managable at all.

An immutable-extent DAX-file and a reflink-capable DAX-file are not
mutually exclusive, and I have yet to hear a need for reflink support
without fsync/msync. Instead I have heard the need for an immutable
file for RDMA purposes, especially for hardware that can't trigger an
mmu fault. The special management of an immutable file is acceptable
to get these capabilities.
Andy Lutomirski June 20, 2017, 4:26 p.m. UTC | #18
On Tue, Jun 20, 2017 at 9:17 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Tue, Jun 20, 2017 at 1:49 AM, Christoph Hellwig <hch@lst.de> wrote:
>> [stripped giant fullquotes]
>>
>> On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote:
>>> But that's my whole point.  The kernel doesn't really need to prevent
>>> all these background maintenance operations -- it just needs to block
>>> .page_mkwrite until they are synced.  I think that whatever new
>>> mechanism we add for this should be sticky, but I see no reason why
>>> the filesystem should have to block reflink on a DAX file entirely.
>>
>> Agreed - IFF we want to support write through semantics this is the
>> only somewhat feasible way.  It still has massive downsides of forcing
>> the full sync machinery to run from the page fauly handler, which
>> I'm rather scared off, but that's still better than creating a magic
>> special case that isn't managable at all.
>
> An immutable-extent DAX-file and a reflink-capable DAX-file are not
> mutually exclusive, and I have yet to hear a need for reflink support
> without fsync/msync. Instead I have heard the need for an immutable
> file for RDMA purposes, especially for hardware that can't trigger an
> mmu fault. The special management of an immutable file is acceptable
> to get these capabilities.

I guess this applies to any user of get_user_pages() on a DAX-mapped file.  Hmm.
Dave Chinner June 20, 2017, 11:53 p.m. UTC | #19
On Tue, Jun 20, 2017 at 09:17:36AM -0700, Dan Williams wrote:
> On Tue, Jun 20, 2017 at 1:49 AM, Christoph Hellwig <hch@lst.de> wrote:
> > [stripped giant fullquotes]
> >
> > On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote:
> >> But that's my whole point.  The kernel doesn't really need to prevent
> >> all these background maintenance operations -- it just needs to block
> >> .page_mkwrite until they are synced.  I think that whatever new
> >> mechanism we add for this should be sticky, but I see no reason why
> >> the filesystem should have to block reflink on a DAX file entirely.
> >
> > Agreed - IFF we want to support write through semantics this is the
> > only somewhat feasible way.  It still has massive downsides of forcing
> > the full sync machinery to run from the page fauly handler, which
> > I'm rather scared off, but that's still better than creating a magic
> > special case that isn't managable at all.
> 
> An immutable-extent DAX-file and a reflink-capable DAX-file are not
> mutually exclusive,

Actually, they are mutually exclusive: when the immutable extent DAX
inode is breaking the extent sharing done during the reflink
operation, the copy-on-write operation requires allocating and
freeing extents on the inode that has immutable extents. Which, if
the inode really has immutable extents, cannot be done.

That said, if the extent sharing is broken on the other side of the
reflink (i.e. the non-immutable inode created by the reflink) then
the extent map of the inode with immutable extents will remain
unchanged. i.e. there are two sides to this, and if you only see one
side you might come to the wrong conclusion.

However, we cannot guarantee that no writes occur to the inode with
immutable extent maps (especially as the whole point is to allow
userspace writes and commits without the kernel being involved), so
extent sharing on immutable extent maps cannot be allowed...

Cheers,

Dave.
Darrick J. Wong June 21, 2017, 1:24 a.m. UTC | #20
On Wed, Jun 21, 2017 at 09:53:46AM +1000, Dave Chinner wrote:
> On Tue, Jun 20, 2017 at 09:17:36AM -0700, Dan Williams wrote:
> > On Tue, Jun 20, 2017 at 1:49 AM, Christoph Hellwig <hch@lst.de> wrote:
> > > [stripped giant fullquotes]
> > >
> > > On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote:
> > >> But that's my whole point.  The kernel doesn't really need to prevent
> > >> all these background maintenance operations -- it just needs to block
> > >> .page_mkwrite until they are synced.  I think that whatever new
> > >> mechanism we add for this should be sticky, but I see no reason why
> > >> the filesystem should have to block reflink on a DAX file entirely.
> > >
> > > Agreed - IFF we want to support write through semantics this is the
> > > only somewhat feasible way.  It still has massive downsides of forcing
> > > the full sync machinery to run from the page fauly handler, which
> > > I'm rather scared off, but that's still better than creating a magic
> > > special case that isn't managable at all.
> > 
> > An immutable-extent DAX-file and a reflink-capable DAX-file are not
> > mutually exclusive,
> 
> Actually, they are mutually exclusive: when the immutable extent DAX
> inode is breaking the extent sharing done during the reflink
> operation, the copy-on-write operation requires allocating and
> freeing extents on the inode that has immutable extents. Which, if
> the inode really has immutable extents, cannot be done.
> 
> That said, if the extent sharing is broken on the other side of the
> reflink (i.e. the non-immutable inode created by the reflink) then
> the extent map of the inode with immutable extents will remain
> unchanged. i.e. there are two sides to this, and if you only see one
> side you might come to the wrong conclusion.
> 
> However, we cannot guarantee that no writes occur to the inode with
> immutable extent maps (especially as the whole point is to allow
> userspace writes and commits without the kernel being involved), so
> extent sharing on immutable extent maps cannot be allowed...

Just to play devil's advocate...

/If/ you have rmap and /if/ you discover that there's only one
IOMAP_IMMUTABLE file owning this same block and /if/ you're willing to
relocate every other mapping on the whole filesystem, /then/ you could
/in theory/ support shared daxfiles.

However, that's so many on-disk metadata lookups to shove into a
pagefault handler that I don't think anyone in XFSland would entertain
such an ugly fantasy.  You'd be making a lot of metadata requests, and
you'd have to lock the rmapbt while grabbing inodes, which is insane.

Much easier to have a per-inode flag that says "the block map of this
file does not change" and put up with the restricted semantics.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner June 21, 2017, 1:40 a.m. UTC | #21
On Tue, Jun 20, 2017 at 09:14:24AM -0700, Andy Lutomirski wrote:
> On Tue, Jun 20, 2017 at 3:11 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote:
> >> On Mon, Jun 19, 2017 at 5:46 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Mon, Jun 19, 2017 at 08:22:10AM -0700, Andy Lutomirski wrote:
> >> >> Second: syncing extents.  Here's a straw man.  Forget the mmap() flag.
> >> >> Instead add a new msync() operation:
> >> >>
> >> >> msync(start, length, MSYNC_PMEM_PREPARE_WRITE);
> >> >
> >> > How's this any different from the fallocate command I proposed to do
> >> > this (apart from the fact that msync() is not intended to be abused
> >> > as a file preallocation/zeroing interface)?
> >>
> >> I must have missed that suggestion.
> >>
> >> But it's different in a major way.  fallocate() takes an fd parameter,
> >> which means that, if some flag gets set, it's set on the struct file.
> >
> > DAX is a property of the inode, not the VMA or struct file as it
> > needs to be consistent across all VMAs and struct files that
> > reference that inode. Also, fallocate() manipulates state and
> > metadata hidden behind the struct inode, not the struct file, so it
> > seems to me like the right API to use.
> 
> I'm not sure I see why.  I can think of a few different scenarios:
> 
>  - Reflink while a DAX-using program is running.  It would be nice for
> performance, but not critical for functionality, if trying to reflink
> a file that's mapped for DAX would copy instead of COWing.  But
> breaking COW on the next page_mkwrite would work, too. 

Your mangling terminology here. We don't "break COW" - we *use*
copy-on-write to break *extent sharing*. We can break extent sharing
in page_mkwrite - that's exactly what we do for normal pagecache
based mmap writes, and it's done in page_mkwrite.

It hasn't been enabled it for DAX yet because it simply hasn't been
robustly tested yet.

> A per-inode
> count of the number of live DAX mappings or of the number of struct
> file instances that have requested DAX would work here.

For what purpose does this serve? The reflink invalidates all the
existing mappings, so the next write access causes a fault and then
page_mkwrite is called and the shared extent will get COWed....

>  - Trying to use DAX on a file that is already reflinked.  The order
> of operations doesn't matter hugely, except that breaking COW for the
> entire range in question all at once would be faster and result in
> better allocation.

We have COW extent size hints for that. i.e. if you want to COW a
huge page at a time, set the COW extent size hint to the huge page
size...

>  - Checksumming filesystems.  I think it's basically impossible to do
> DAX writes to a file like this.

No shit, sherlock. See my previous comments about compression and
encryption. DAX cannot be used where in-place data manipulations
would be required by the IO path.

> A filesystem could skip checksumming
> on extents that are actively mapped for DAX, but something like chattr
> to tell, say, btrfs that a given file is intended for DAX is probably
> better.  (But, if so, it should presumably be persistent like chattr
> and not something that's purely in memory like daxctl().)

You can already do this on btrfs regardless of DAX. There's an inode
flag:

#define BTRFS_INODE_NODATASUM           (1 << 0)

And you set it by turning off copy-on-write when the file is empty.
i.e. just after creation, run:

	ioctl(FS_IOC_GETFLAGS, &flags).
	flags |= FS_NOCOW_FL;
	ioctl(FS_IOC_SETFLAGS, &flags).

You're talking about stuff that already exists - stop trying to tell
me about basic filesystem functionality and DAX requirements that I
understood years ago. Please start with the assumption that I know a
lot more about this than you do, and if there's something you write
that I don't understand I'll ask you to explain....

>  - Defrag and such on an otherwise simple filesystem.  Defragging a
> file while it's not actively mapped for DAX should be allowed.
> Defragging a file while it is actively mapped for DAX may be slow, but
> it a filesystem could certainly make it work.

FYI, XFS already does this - it skips mapped files altogether, so
DAX state simply doesn't matter here. We also have the FS_NODEFRAG
ioctl flag, so admins can choose to mark files that they want defrag
to skip...

> > And, as mmap() requires a fd to set up the mapping and fallocate()
> > would have to be run *before* mmap() is used to access the data
> > directly, I don't see why using fallocate would be a problem here...
> 
> Fair enough.  But ISTM fallocate() has no business setting up
> important state in the struct file.  It's an operation on inodes.

Yup, the state in question is kept in inode->i_flags rather than
duplicated into the struct file. Which leads to code like this in
the page fault handlers. e.g. in xfs_filemap_page_mkwrite():

	struct inode            *inode = file_inode(vma->vm_file);
	....
	if (IS_DAX(inode)) {
		.....

Yeah, the page fault behaviour is determined by state kept on the
inode, not the struct file or vma....

> Looking at the above scenarios, it seems to may that two or three
> separate mechanisms may be ideal here.
> 
> 1. The most important: some way to tell the kernel that an open file
> description or an mmap or some subrange thereof is going to be used
> for DAX writes and for the kernel to respond by saying "yes, okay" or
> "no, not okay". 

A file can only be accessed by DAX or through the page cache at a
point in time - it can't do both because the mapping infrastructure
(e.g. the radix tree) can only handle one or the other at any point
in time. So, you want to change the DAX access mode of an open file?
We have that on XFS. Turn on DAX:

	ioctl(FS_IOC_FSGETXATTR, &flags).
	flags |= FS_XFLAG_DAX;
	ioctl(FS_IOC_FSSETXATTR, &flags).

This will throw an error if the kernel and/or fs does not support
DAX. The fs will lock out page faults on the inode, invalidate all
the existing mappings, remove the cached pages and switch to DAX
mode.

Turn off DAX:

	ioctl(FS_IOC_FSGETXATTR, &flags).
	flags &= ~FS_XFLAG_DAX;
	ioctl(FS_IOC_FSSETXATTR, &flags).

And the filesystem will lock out page faults, invalidate all the DAX
mappings and switch to cached mode...

> 2. Some way to tell filesystems like btrfs to make a file that will be
> DAX-able.

See above.

> > Prediction for the MAP_SYNC future: frequent bug reports about
> > huge, unpredictable page fault latencies on DAX files because
> > every so often a page fault is required to sync tens of
> > thousands of unrelated dirty objects because of filesystem
> > journal ordering constraints....
> 
> Is this really so bad?

Apparently it is. There are people telling us that mtime
updates in page faults introduce too much unpredictable latency and
that screws over their low latency real time applications.

Those same people are telling use that dirty tracking in page faults
for msync/fsync on DAX is too heavyweight and calling msync is too
onerous and has unpredictable latencies because it might result in
having to sync tens of thousands of unrelated dirty objects. Hence
they want to use userspace data sync primitives to avoid this
overhead and so filesystems need to make it possible to provide this
userspace idata sync capability.

So, we came up with a method that removed all overhead and
unpredictability from the page fault path, and now we're being told
that calling fallocate() is too hard and difficult and the
restrictions that prevent all the page fault overhead (operations
which you won't want to be doing for low-latency apps in the first
place) are too onerous.

And so the proposed solution is an API that requires the filesystem
to *run fdatasync in every write page fault*?

Can you see the contradictions in the requirements here? On one hand
we're being told page faults have to be low overhead and have
predictable latency, but when presented with a solution we're told
you want page faults to be blocked arbitrarily and for unbound
lengths of time.

Put simply: I don't care what gets implemented. What I care about is
that everyone understands that we're being given contradictory
requirements and that neither of the proposed solutions solves both
sets of requirements.

> I suspect that users saying "your filesystem is slower than I'd
> like for such-and-such use case" isn't all that rare.

Except the people asking for userspace data sync are asking for it
for *performance reasons*. Making that explicit case *slower* is the
exact opposite of what we've been asked to provide....

Cheers,

Dave.
Dave Chinner June 21, 2017, 2:19 a.m. UTC | #22
On Tue, Jun 20, 2017 at 06:24:03PM -0700, Darrick J. Wong wrote:
> On Wed, Jun 21, 2017 at 09:53:46AM +1000, Dave Chinner wrote:
> > On Tue, Jun 20, 2017 at 09:17:36AM -0700, Dan Williams wrote:
> > > An immutable-extent DAX-file and a reflink-capable DAX-file are not
> > > mutually exclusive,
> > 
> > Actually, they are mutually exclusive: when the immutable extent DAX
> > inode is breaking the extent sharing done during the reflink
> > operation, the copy-on-write operation requires allocating and
> > freeing extents on the inode that has immutable extents. Which, if
> > the inode really has immutable extents, cannot be done.
> > 
> > That said, if the extent sharing is broken on the other side of the
> > reflink (i.e. the non-immutable inode created by the reflink) then
> > the extent map of the inode with immutable extents will remain
> > unchanged. i.e. there are two sides to this, and if you only see one
> > side you might come to the wrong conclusion.
> > 
> > However, we cannot guarantee that no writes occur to the inode with
> > immutable extent maps (especially as the whole point is to allow
> > userspace writes and commits without the kernel being involved), so
> > extent sharing on immutable extent maps cannot be allowed...
> 
> Just to play devil's advocate...
> 
> /If/ you have rmap and /if/ you discover that there's only one
> IOMAP_IMMUTABLE file owning this same block and /if/ you're willing to
> relocate every other mapping on the whole filesystem, /then/ you could
> /in theory/ support shared daxfiles.

I figured that nobody apart from experienced filesystem developers
would understand the complexities of rmap and refcounts and how they
could be abused to do this. I also assumed that that people like you
would understand this is possible but completely impractical....

> However, that's so many on-disk metadata lookups to shove into a
> pagefault handler that I don't think anyone in XFSland would entertain
> such an ugly fantasy.  You'd be making a lot of metadata requests, and
> you'd have to lock the rmapbt while grabbing inodes, which is insane.

Exactly. But while I understand this, consider the amount of assumed
filesystem and XFS knowledge in that one simple paragraph. Most
non-experts would have stopped *understanding* at "/If/ you have
rmap" and go away with the wrong ideas in their heads. Hence I now
tend to omit mentioning "possible but impractical" things in mixed
expertise discussions....

> Much easier to have a per-inode flag that says "the block map of this
> file does not change" and put up with the restricted semantics.

In a nutshell.

Cheers,

Dave.
Andy Lutomirski June 21, 2017, 5:18 a.m. UTC | #23
On Tue, Jun 20, 2017 at 6:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> Your mangling terminology here. We don't "break COW" - we *use*
> copy-on-write to break *extent sharing*. We can break extent sharing
> in page_mkwrite - that's exactly what we do for normal pagecache
> based mmap writes, and it's done in page_mkwrite.

Right, my bad.

>
> It hasn't been enabled it for DAX yet because it simply hasn't been
> robustly tested yet.
>
>> A per-inode
>> count of the number of live DAX mappings or of the number of struct
>> file instances that have requested DAX would work here.
>
> For what purpose does this serve? The reflink invalidates all the
> existing mappings, so the next write access causes a fault and then
> page_mkwrite is called and the shared extent will get COWed....

The same purpose as XFS's FS_XFLAG_DAX (assuming I'm understanding it
right), except that IMO an API that doesn't involve making a change to
an inode that sticks around would be nice.  The inode flag has the
unfortunate property that, if two different programs each try to set
the flag, mmap, write, and clear the flag, they'll stomp on each other
and risk data corruption.

I admit I'm now thoroughly confused as to exactly what XFS does here
-- does FS_XFLAG_DAX persist across unmount/mount?

>
>>  - Trying to use DAX on a file that is already reflinked.  The order
>> of operations doesn't matter hugely, except that breaking COW for the
>> entire range in question all at once would be faster and result in
>> better allocation.
>
> We have COW extent size hints for that. i.e. if you want to COW a
> huge page at a time, set the COW extent size hint to the huge page
> size...

Nifty.

> Apparently it is. There are people telling us that mtime
> updates in page faults introduce too much unpredictable latency and
> that screws over their low latency real time applications.

I was one of those, and I even wrote patches.  I should try to dust them off.

>
> Those same people are telling use that dirty tracking in page faults
> for msync/fsync on DAX is too heavyweight and calling msync is too
> onerous and has unpredictable latencies because it might result in
> having to sync tens of thousands of unrelated dirty objects. Hence
> they want to use userspace data sync primitives to avoid this
> overhead and so filesystems need to make it possible to provide this
> userspace idata sync capability.

If I were using DAX in production, I'd have exactly this issue.  Let
me quote myself:

On Tue, Jun 20, 2017 at 9:14 AM, Andy Lutomirski <luto@kernel.org> wrote:
> 3. (Not strictly related to DAX.) A way to tell the kernel "I have
> this file mmapped for write.  Please go out of your way to avoid
> future page faults."  I've wanted this for ordinary files on ext4.
> The kernel could, but presently does not, use hardware dirty tracking
> instead of software dirty tracking to decide when to write the page
> back.  The kernel could also, in principle, write back dirty pages
> without ever write protecting them.  For DAX, this might change
> behavior to prevent any operation that would relocate blocks or to
> have the kernel go out of its way to only do such operations when
> absolutely necessary and to immediately update and unwriteprotect the
> relevant pages.

I agree that this is a real issue, but it's not limited to DAX.  I've
wanted a mode where I tell the kernel "I'm a high-performance
application mmapping this file and I'm going to write to it a lot.  Do
your best to avoid any page faults, even if it adversely affects the
performance of the system."  This mode could do lots of things.  It
could cause the system to leave the page writable even after writeback
and, if possible, to use hardware dirty tracking.  It could cause the
system to make a copy of the page and write back from the copy if
there is anything in play that could need stable pages during
writeback.  And, for DAX, it could tell the system to keep the page
pinned and disallow moving it and reflinking it.

(Of course, the above requires that we either deal with mtime like my
patches do or that this heavyweight mechanism disable mtime updates.
I prefer the former.)

Here's the overall point I'm trying to make: unprivileged programs
that want to write to DAX files with userspace commit mechanisms
(CLFLUSHOPT;SFENCE, etc) should be able to do so reliably, without
privilege, and with reasonably clean APIs.  Ideally they could do this
to any file they have write access to.  Programs that want to write to
mmapped files, DAX or otherwise, without latency spikes due to
.page_mkwrite should be able to opt in to a heavier weight mechanism.
But these two issues are someone independent, and I think they should
be solved separately.
Dave Chinner June 21, 2017, 11:37 p.m. UTC | #24
On Mon, Jun 19, 2017 at 10:22:14PM -0700, Darrick J. Wong wrote:
> [add linux-xfs to the fray]
> 
> On Fri, Jun 16, 2017 at 06:15:35PM -0700, Dan Williams wrote:
> > +	spin_lock(&dax_lock);
> > +	list_add(&d->list, &daxfiles);
> > +	spin_unlock(&dax_lock);
> > +
> > +	/*
> > +	 * We set S_SWAPFILE to gain "no truncate" / static block
> > +	 * allocation semantics, and S_DAXFILE so we can differentiate
> > +	 * traditional swapfiles and assume static block mappings in the
> > +	 * dax mmap path.
> > +	 */
> > +	inode->i_flags |= S_SWAPFILE | S_DAXFILE;
> 
> Yikes.  You know, I hadn't even thought about considering swap files as
> a subcase of files with immutable block maps, but here we are.  Both
> swap files and DAX require absolutely stable block mappings, they are
> both (probably) intolerant of inode metadata changes (size, mtime, etc.)

Swap files are intolerant of any metadata changes because once the
mapping has been sucked into the swapfile code, the inode is never
looked at again. DAX file data access always goes through the inode,
so they is much more tolerant of metadata changes given certain
constraints.

<snip bmap rant>

> Honestly, I realize we've gone back, forth, and around all over the
> place on this.  I still prefer something similar to a permanent flag,
> similar to what Dave suggested, though I hate the name PMEM_IMMUTABLE
> and some of the semantics.
> 
> First, a new inode flag S_IOMAP_FROZEN that means the file's block map
> cannot change.

I've been calling it "immutable extents" - freezing has implications
that it's only temporary (i.e. freezing filesystems) and will be
followed shortly by a thaw. That isn't the case here - we truly want
the extent/block map to be immutable....

> Second, some kind of function to toggle the S_IOMAP_FROZEN flag.
> Turning it on will lock the inode, check the extent map for holes,
> shared, or unwritten bits, and bail out if it finds any, or set the
> flag. 

Hmmm, I disagree on the unwritten state here.  We want swap files to
be able to use unwritten extents - it means we can preallocate the
swap file and hand it straight to swapon without having to zero it
(i.e. no I/O needed to demand allocate more swap space when memory
is very low).  Also, anyone who tries to read the swap file from
userspace will be reading unwritten extents, which will always
return zeros rather than whatever is in the swap file...

> Not sure if we should require CAP_LINUX_IMMUTABLE -- probably
> yes, at least at first.  I don't currently have any objection to writing
> non-iomap inode metadata out to disk.
> 
> Third, the flag can only be cleared if the file isn't mapped.

How do we check this from the fs without racing? AFAICT we can't
prevent a concurrent map operation from occurring while we are
changing the state of the inode - we can only block page faults
after then inode is mapped....

> Fourth, the VFS entry points for things like read, write, truncate,
> utimes, fallocate, etc. all just bail out if S_IOMAP_FROZEN is set on a
> file, so that the block map cannot be modified.
> mmap is still allowed,
> as we've discussed.  /Maybe/ we can allow fallocate to extend a file
> with zeroed extents (it will be slow) as I've heard murmurs about
> wanting to be able to extend a file, maybe not.

read is fine, write should be fine as long as the iomap call can
error out operations that would require extent map modifications.
fallocate should be allowed to modify the extent map, too, because
it should be the mechanism used be applications to set up file
extents in the correct form for applications to use as immutable
(i.e. lock out page faults, allocate, zero, extend and fsync in
one atomic operation)....

> Fifth, swapfiles now require the S_IOMAP_FROZEN flag since they want
> stable iomap but probably don't care about things like mtime.  Maybe
> they can call iomap too.
> 
> Sixth, XFS can record the S_IOMAP_FROZEN state in di_flags2 and set it
> whenever the in-core inode gets constructed.  This enables us to
> prohibit reflinking and other such undesirable activity.

*nod*

> If we actually /do/ come up with a reference implementation for XFS, I'd
> be ok with tacking it on the end of my dev branch, which will give us a
> loooong runway to try it out.  The end of the dev branch is beyond
> online XFS fsck and repair and the "root metadata btrees in inodes"
> rework; since that's ~90 patches with my name on it that I cannot also
> review, it won't go in for a long time indeed!

I don't think it's so complex to need such a long dev time -
all the infrastructure we need is pretty much there already...

> (Yes, that was also sort of a plea for someone to go review the XFS
> scrub patches.)
> 
> > +	return 0;
> > +}
> > +
> > +SYSCALL_DEFINE3(daxctl, const char __user *, path, int, flags, int, align)
> 
> I was /about/ to grouse about this syscall, then realized that maybe it
> /is/ useful to be able to check a specific alignment.  Maybe not, since
> I had something more permanent in mind anyway.  In any case, just pass
> in an opened fd if this sticks around.

We can do all that via fallocate(), too...

Cheers,

Dave.
Dave Chinner June 22, 2017, 12:02 a.m. UTC | #25
On Tue, Jun 20, 2017 at 10:18:24PM -0700, Andy Lutomirski wrote:
> On Tue, Jun 20, 2017 at 6:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> A per-inode
> >> count of the number of live DAX mappings or of the number of struct
> >> file instances that have requested DAX would work here.
> >
> > For what purpose does this serve? The reflink invalidates all the
> > existing mappings, so the next write access causes a fault and then
> > page_mkwrite is called and the shared extent will get COWed....
> 
> The same purpose as XFS's FS_XFLAG_DAX (assuming I'm understanding it
> right), except that IMO an API that doesn't involve making a change to
> an inode that sticks around would be nice.  The inode flag has the
> unfortunate property that, if two different programs each try to set
> the flag, mmap, write, and clear the flag, they'll stomp on each other
> and risk data corruption.
> 
> I admit I'm now thoroughly confused as to exactly what XFS does here
> -- does FS_XFLAG_DAX persist across unmount/mount?

Yes, it is.

i.e. DAX on XFS does not rely on a naive fs-wide mount option. You
can have applications on pmem filesystems use either DAX or normal
IO based on directory/inode flags.  Something doesn't work with DAX,
so just remove the DAX flags from the directories/inodes, and it
will safely and transparently switch to page-cache based IO.

<snip>

> Here's the overall point I'm trying to make: unprivileged programs
> that want to write to DAX files with userspace commit mechanisms
> (CLFLUSHOPT;SFENCE, etc) should be able to do so reliably, without
> privilege, and with reasonably clean APIs.  Ideally they could do this
> to any file they have write access to.

The privilege argument is irrelevant now - it was /suggested/
initially as a way of preventing people from shooting themselves in
the foot based on the immutable file model. It's clear that's not
desired, and it's not a show stopper. 

> Programs that want to write to
> mmapped files, DAX or otherwise, without latency spikes due to
> .page_mkwrite should be able to opt in to a heavier weight mechanism.
> But these two issues are someone independent, and I think they should
> be solved separately.

You seem to be calling the "fdatasync on every page fault" the
"lightweight" option. That's the brute-force-with-big-hammer
solution - it's most definitely not lightweight as every page fault
has extra overhead to call ->fsync(). Sure, the API is simple, but
the runtime overhead is significant.

The lightweight *runtime* option is to set up the file in such a
way that there is never any extra overhead at page fault time.  This
is what immutable extent maps provide.  Indeed, because the mappings
never change, you could use hardware dirty tracking if you wanted,
as there's no need to look up the filesystem to do writeback as
everything needed for writeback was mapped at page fault time.  This
"map first and then just write when you need to" is *exactly how
swap files work*.

Even if you are considering the complexity of the APIs, it's hardly
a "heavyweight" when it only requires a single call to fallocate()
before mmap() to set up the immutable extents on the file...

Cheers,

Dave.
Andy Lutomirski June 22, 2017, 4:07 a.m. UTC | #26
On Wed, Jun 21, 2017 at 5:02 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> You seem to be calling the "fdatasync on every page fault" the

It's the opposite of fdatasync().  It needs to sync whatever metadata
is needed to find the data.  The data doesn't need to be synced.

> "lightweight" option. That's the brute-force-with-big-hammer
> solution - it's most definitely not lightweight as every page fault
> has extra overhead to call ->fsync(). Sure, the API is simple, but
> the runtime overhead is significant.

It's lightweight in terms of its impact on the filesystem.  It doesn't
need any persistent setup -- you can just use it.

> Even if you are considering the complexity of the APIs, it's hardly
> a "heavyweight" when it only requires a single call to fallocate()
> before mmap() to set up the immutable extents on the file...

So what would the exact semantics be?  In particular, how can it fail?
 If I do the fallocate(), is it absolutely promised that the extent
map won't get out of sync between what mmap sees and what's on disk?
Do user programs need to worry about colliding with each other when
one does fallocate() to DAXify a file and the other does fallocate()
to unDAXify a file?  Does this particular fallocate() call still keep
its effect after a reboot?

These issues are why I think it would be nicer to have an API that
makes a particular mapping or fd be unconditionally *correct* and then
to provide something else that makes it avoid latency spikes.

Is there an actual concrete proposal that's reviewable?
Darrick J. Wong June 22, 2017, 7:09 a.m. UTC | #27
On Tue, Jun 20, 2017 at 09:42:55AM -0600, Ross Zwisler wrote:
> On Mon, Jun 19, 2017 at 10:22:14PM -0700, Darrick J. Wong wrote:
> <>
> > Fourth, the VFS entry points for things like read, write, truncate,
> > utimes, fallocate, etc. all just bail out if S_IOMAP_FROZEN is set on a
> > file, so that the block map cannot be modified.  mmap is still allowed,
> > as we've discussed.  /Maybe/ we can allow fallocate to extend a file
> > with zeroed extents (it will be slow) as I've heard murmurs about
> > wanting to be able to extend a file, maybe not.
> 
> Read and write should still be allowed, right?

<shrug> I had thought the usage model was pretty slanted towards mmap,
but it's not a big deal to turn read/writes into glorified memcpy,
provided we reject the io request if it goes past EOF.

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong June 22, 2017, 7:23 a.m. UTC | #28
On Thu, Jun 22, 2017 at 09:37:14AM +1000, Dave Chinner wrote:
> On Mon, Jun 19, 2017 at 10:22:14PM -0700, Darrick J. Wong wrote:
> > [add linux-xfs to the fray]
> > 
> > On Fri, Jun 16, 2017 at 06:15:35PM -0700, Dan Williams wrote:
> > > +	spin_lock(&dax_lock);
> > > +	list_add(&d->list, &daxfiles);
> > > +	spin_unlock(&dax_lock);
> > > +
> > > +	/*
> > > +	 * We set S_SWAPFILE to gain "no truncate" / static block
> > > +	 * allocation semantics, and S_DAXFILE so we can differentiate
> > > +	 * traditional swapfiles and assume static block mappings in the
> > > +	 * dax mmap path.
> > > +	 */
> > > +	inode->i_flags |= S_SWAPFILE | S_DAXFILE;
> > 
> > Yikes.  You know, I hadn't even thought about considering swap files as
> > a subcase of files with immutable block maps, but here we are.  Both
> > swap files and DAX require absolutely stable block mappings, they are
> > both (probably) intolerant of inode metadata changes (size, mtime, etc.)
> 
> Swap files are intolerant of any metadata changes because once the
> mapping has been sucked into the swapfile code, the inode is never
> looked at again. DAX file data access always goes through the inode,
> so they is much more tolerant of metadata changes given certain
> constraints.

Fair enough.

> <snip bmap rant>
> 
> > Honestly, I realize we've gone back, forth, and around all over the
> > place on this.  I still prefer something similar to a permanent flag,
> > similar to what Dave suggested, though I hate the name PMEM_IMMUTABLE
> > and some of the semantics.
> > 
> > First, a new inode flag S_IOMAP_FROZEN that means the file's block map
> > cannot change.
> 
> I've been calling it "immutable extents" - freezing has implications
> that it's only temporary (i.e. freezing filesystems) and will be
> followed shortly by a thaw. That isn't the case here - we truly want
> the extent/block map to be immutable....

<nod> S_IOMAP_IMMUTABLE it is, then.

> > Second, some kind of function to toggle the S_IOMAP_FROZEN flag.
> > Turning it on will lock the inode, check the extent map for holes,
> > shared, or unwritten bits, and bail out if it finds any, or set the
> > flag. 
> 
> Hmmm, I disagree on the unwritten state here.  We want swap files to
> be able to use unwritten extents - it means we can preallocate the
> swap file and hand it straight to swapon without having to zero it
> (i.e. no I/O needed to demand allocate more swap space when memory
> is very low).  Also, anyone who tries to read the swap file from
> userspace will be reading unwritten extents, which will always
> return zeros rather than whatever is in the swap file...

Now I've twisted all the way around to thinking that swap files
should be /totally/ unwritten, except for the file header. :)

> > Not sure if we should require CAP_LINUX_IMMUTABLE -- probably
> > yes, at least at first.  I don't currently have any objection to writing
> > non-iomap inode metadata out to disk.
> > 
> > Third, the flag can only be cleared if the file isn't mapped.
> 
> How do we check this from the fs without racing? AFAICT we can't
> prevent a concurrent map operation from occurring while we are
> changing the state of the inode - we can only block page faults
> after then inode is mapped....

I'd thought we could coordinate that via xfs_file_mmap, but tbh my brain
paged that out a while ago.

> > Fourth, the VFS entry points for things like read, write, truncate,
> > utimes, fallocate, etc. all just bail out if S_IOMAP_FROZEN is set on a
> > file, so that the block map cannot be modified.
> > mmap is still allowed,
> > as we've discussed.  /Maybe/ we can allow fallocate to extend a file
> > with zeroed extents (it will be slow) as I've heard murmurs about
> > wanting to be able to extend a file, maybe not.
> 
> read is fine, write should be fine as long as the iomap call can
> error out operations that would require extent map modifications.

Ok.

> fallocate should be allowed to modify the extent map, too, because
> it should be the mechanism used be applications to set up file
> extents in the correct form for applications to use as immutable
> (i.e. lock out page faults, allocate, zero, extend and fsync in
> one atomic operation)....

<nod>

> > Fifth, swapfiles now require the S_IOMAP_FROZEN flag since they want
> > stable iomap but probably don't care about things like mtime.  Maybe
> > they can call iomap too.
> > 
> > Sixth, XFS can record the S_IOMAP_FROZEN state in di_flags2 and set it
> > whenever the in-core inode gets constructed.  This enables us to
> > prohibit reflinking and other such undesirable activity.
> 
> *nod*
> 
> > If we actually /do/ come up with a reference implementation for XFS, I'd
> > be ok with tacking it on the end of my dev branch, which will give us a
> > loooong runway to try it out.  The end of the dev branch is beyond
> > online XFS fsck and repair and the "root metadata btrees in inodes"
> > rework; since that's ~90 patches with my name on it that I cannot also
> > review, it won't go in for a long time indeed!
> 
> I don't think it's so complex to need such a long dev time -
> all the infrastructure we need is pretty much there already...

It definitely is, but we've been bikeshedding so long it now has
momentum. :)

That said, it (and having a go at dax+reflink) are things that I'd like
to look at (after pushing the scrub stuff) for the rest of the year.

> > (Yes, that was also sort of a plea for someone to go review the XFS
> > scrub patches.)
> > 
> > > +	return 0;
> > > +}
> > > +
> > > +SYSCALL_DEFINE3(daxctl, const char __user *, path, int, flags, int, align)
> > 
> > I was /about/ to grouse about this syscall, then realized that maybe it
> > /is/ useful to be able to check a specific alignment.  Maybe not, since
> > I had something more permanent in mind anyway.  In any case, just pass
> > in an opened fd if this sticks around.
> 
> We can do all that via fallocate(), too...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner June 23, 2017, 12:52 a.m. UTC | #29
On Wed, Jun 21, 2017 at 09:07:57PM -0700, Andy Lutomirski wrote:
> On Wed, Jun 21, 2017 at 5:02 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > You seem to be calling the "fdatasync on every page fault" the
> 
> It's the opposite of fdatasync().  It needs to sync whatever metadata
> is needed to find the data.  The data doesn't need to be synced.

So much wrong with that statement.

Andy, what does fdatasync() do when you have a data-clean,
metadata-dirty file (e.g. you just punched a hole  or preallocated
more space via fallocate())?  Hint: it doesn't sync any data
because the mapping tree is clean, but it still syncs the dirty
metadata needed to access the data.

Now, what does a file where we do direct IO writes look like? Yup,
the mapping tree always remains clean and so it's only ever going to
appear to the kernel as a *data-clean, metadata-dirty* file. So,
after a direct IO write is done, what operation do we need to run to
ensure that we can always access the data?

Yup, it's fdatasync().

So, what does a DAX file that does userspace data flushes look like
to the kernel? Yup, again the mapping tree always remains clean and
so it's only ever going to be a *data-clean, metadata-dirty* file.

It should be clear now why I said "fdatasync on every page fault"
because that's exactly the mechanism we'd use to implement this
functionality....

It should also be clear that DAX is not introducing any new data
integrity problems to the filesystems that direct IO hasn't already
introduced. Both DAX with userspace data sync and Direct IO writes
are completely untracked by the kernel.  IOWs, direct IO is a form
of "kernel bypass", just like DAX+userspace data sync is.  All that
is different is the method by which data is written to the storage
media from userspace, which in the case of DAX is via mmap rather
than read/write.

> > "lightweight" option. That's the brute-force-with-big-hammer
> > solution - it's most definitely not lightweight as every page fault
> > has extra overhead to call ->fsync(). Sure, the API is simple, but
> > the runtime overhead is significant.
> 
> It's lightweight in terms of its impact on the filesystem.  It doesn't
> need any persistent setup -- you can just use it.

Well, no, that's wrong, because we have to co-ordinate multiple
concurrent accesses to the data in the kernel. What happens when
some other process writes to the file *at the same time* but does
not use userspace sync? We aren't tracking dirty regions on the
inode mapping because we've been told not to do that, so fsync()
from that other process *won't sync the data it wrote*. IOws, the
kernel has failed to provide the guarantee that userspace wants it
to provide.

The single mapping tree is central to the problem here - we can't
mix modes of dirty tracking across different processes. Either
everything uses userspace sync, or everything uses kernel controlled
dirty tracking so fsync() works correctly in all cases. Put simply -
dirty tracking is a per-inode function, not a per-file or per-vma
function.

As the direct IO kernel-bypass model demonstrates, as soon as you
start considering multi-process data coherency and durability with
mixed kernel+kernel bypass methods in play, lots of potential
problems and issues crop up that can't easily be solved by the
kernel or filesystems. We try to minimise the problems, but we don't
guarantee mixed mode coherency (and hence integrity) as we've
delegated data coherency and integrity responsibility to the app
bypassing the kernel data coherency and integrity mechanisms.

What I'd like to avoid is creating another kernel bypass mechanism
where we allow coherency and/or integrity to be fucked up in a way that
we can't fix without giving up all the performance that the kernel
bypass provides userspace apps. Constrain the cases where kernel
bypass is allowed, and we avoid all the crappy corner cases where
our only answer to users with corrupt data is "the man page advises
application developers not to do that".

If in future we work out how to implement everything without
needing immutable extents in the inode, we can relax the
restrictions we've placed on userspace DAX data sync....

> > Even if you are considering the complexity of the APIs, it's hardly
> > a "heavyweight" when it only requires a single call to fallocate()
> > before mmap() to set up the immutable extents on the file...
> 
> So what would the exact semantics be?  In particular, how can it fail?
>  If I do the fallocate(), is it absolutely promised that the extent
> map won't get out of sync between what mmap sees and what's on disk?

That's precisely the guarantee I documented would be given by
immutable extents in my very first proposal.

> Do user programs need to worry about colliding with each other when
> one does fallocate() to DAXify a file and the other does fallocate()
> to unDAXify a file?

Yes, it can. This was one of the reasons for putting it under
privilege - so only the app has full control of the extent map
changes and nobody else can fuck with it.

> Does this particular fallocate() call still keep
> its effect after a reboot?

Yes, it does, because it has to be transparent and behave
consistently with all of userspace, not just the app that owns the
file, and not just while that app is running. (e.g. defrag could be
running on the file before the app starts, and then you're screwed
when defrag modifies the extent map after app startup...)

> Is there an actual concrete proposal that's reviewable?

Yes, the first posting where I proposed this functionality many
months ago spelled this all out in detail.

Cheers,

Dave.
Andy Lutomirski June 23, 2017, 3:07 a.m. UTC | #30
On Thu, Jun 22, 2017 at 5:52 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Jun 21, 2017 at 09:07:57PM -0700, Andy Lutomirski wrote:
>> On Wed, Jun 21, 2017 at 5:02 PM, Dave Chinner <david@fromorbit.com> wrote:
>> >
>> > You seem to be calling the "fdatasync on every page fault" the
>>
>> It's the opposite of fdatasync().  It needs to sync whatever metadata
>> is needed to find the data.  The data doesn't need to be synced.
>
> So much wrong with that statement.
>
> Andy, what does fdatasync() do when you have a data-clean,
> metadata-dirty file (e.g. you just punched a hole  or preallocated
> more space via fallocate())?  Hint: it doesn't sync any data
> because the mapping tree is clean, but it still syncs the dirty
> metadata needed to access the data.
>
> Now, what does a file where we do direct IO writes look like? Yup,
> the mapping tree always remains clean and so it's only ever going to
> appear to the kernel as a *data-clean, metadata-dirty* file. So,
> after a direct IO write is done, what operation do we need to run to
> ensure that we can always access the data?
>
> Yup, it's fdatasync().

Fair enough.  Except that fdatasync() goes through dax_writeback_one()
(I think), which deals with cache flushes (via wb_cache_pmem()).  This
special type of sync shouldn't need to do that, so it's not really
quite fdatasync().

>> > "lightweight" option. That's the brute-force-with-big-hammer
>> > solution - it's most definitely not lightweight as every page fault
>> > has extra overhead to call ->fsync(). Sure, the API is simple, but
>> > the runtime overhead is significant.
>>
>> It's lightweight in terms of its impact on the filesystem.  It doesn't
>> need any persistent setup -- you can just use it.
>
> Well, no, that's wrong, because we have to co-ordinate multiple
> concurrent accesses to the data in the kernel. What happens when
> some other process writes to the file *at the same time* but does
> not use userspace sync? We aren't tracking dirty regions on the
> inode mapping because we've been told not to do that, so fsync()
> from that other process *won't sync the data it wrote*. IOws, the
> kernel has failed to provide the guarantee that userspace wants it
> to provide.

...

> What I'd like to avoid is creating another kernel bypass mechanism
> where we allow coherency and/or integrity to be fucked up in a way that
> we can't fix without giving up all the performance that the kernel
> bypass provides userspace apps. Constrain the cases where kernel
> bypass is allowed, and we avoid all the crappy corner cases where
> our only answer to users with corrupt data is "the man page advises
> application developers not to do that".

Ah, I see, a DAX file makes regular write() flush out the cache
automatically.  But I think the situation may be fucked up
integrity-wise anyway.  If you make an immutable-extent DAX file and a
DAX-unaware process mmaps() it and writes to the mapping, what flushes
the CPU cache?  Isn't part of the point of the magic immutable-extent
mode that it wouldn't have to track dirty extents?  Can it keep track
of which mappings are DAX-aware (via an mmap() flag, I assume)?  Would
all mappings of a DAX immutable-extent file be forced to be uncached
(or writethrough or WC or some type that allows fsync to be fast)?

Can you send a link to your fallocate email?  I'm having trouble finding it.
diff mbox

Patch

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 5aef183e2f85..795eb93d6beb 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -339,6 +339,7 @@ 
 330	common	pkey_alloc		sys_pkey_alloc
 331	common	pkey_free		sys_pkey_free
 332	common	statx			sys_statx
+333	64	daxctl			sys_daxctl
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 5ec1f6c47716..5f1d0e0ed30f 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -4,8 +4,17 @@ 
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/radix-tree.h>
+#include <uapi/linux/dax.h>
 #include <asm/pgtable.h>
 
+/*
+ * TODO: make sys_daxctl() be the generic interface for toggling S_DAX
+ * across filesystems. For now, mark DAXCTL_F_DAX as an invalid flag
+ */
+#define DAXCTL_VALID_FLAGS (DAXCTL_F_GET | DAXCTL_F_STATIC)
+
+int daxfile_activate(struct file *daxfile, unsigned align);
+
 struct iomap_ops;
 struct dax_device;
 struct dax_operations {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3e68cabb8457..3af649fb669f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1824,8 +1824,10 @@  struct super_operations {
 #define S_NOSEC		4096	/* no suid or xattr security attributes */
 #ifdef CONFIG_FS_DAX
 #define S_DAX		8192	/* Direct Access, avoiding the page cache */
+#define S_DAXFILE	16384	/* no truncate (swapfile) semantics + dax */
 #else
 #define S_DAX		0	/* Make all the DAX code disappear */
+#define S_DAXFILE	0
 #endif
 
 /*
@@ -1865,6 +1867,7 @@  struct super_operations {
 #define IS_AUTOMOUNT(inode)	((inode)->i_flags & S_AUTOMOUNT)
 #define IS_NOSEC(inode)		((inode)->i_flags & S_NOSEC)
 #define IS_DAX(inode)		((inode)->i_flags & S_DAX)
+#define IS_DAXFILE(inode)	((inode)->i_flags & S_DAXFILE)
 
 #define IS_WHITEOUT(inode)	(S_ISCHR(inode->i_mode) && \
 				 (inode)->i_rdev == WHITEOUT_DEV)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 980c3c9b06f8..49e5cc4c192e 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -701,6 +701,7 @@  asmlinkage long sys_prctl(int option, unsigned long arg2, unsigned long arg3,
 			unsigned long arg4, unsigned long arg5);
 asmlinkage long sys_swapon(const char __user *specialfile, int swap_flags);
 asmlinkage long sys_swapoff(const char __user *specialfile);
+asmlinkage long sys_daxctl(const char __user *path, int flags, int align);
 asmlinkage long sys_sysctl(struct __sysctl_args __user *args);
 asmlinkage long sys_sysinfo(struct sysinfo __user *info);
 asmlinkage long sys_sysfs(int option,
diff --git a/include/uapi/linux/dax.h b/include/uapi/linux/dax.h
new file mode 100644
index 000000000000..78a41bb392c0
--- /dev/null
+++ b/include/uapi/linux/dax.h
@@ -0,0 +1,8 @@ 
+#ifndef _UAPI_LINUX_DAX_H
+#define _UAPI_LINUX_DAX_H
+
+#define DAXCTL_F_GET    (1 << 0)
+#define DAXCTL_F_DAX    (1 << 1)
+#define DAXCTL_F_STATIC (1 << 2)
+
+#endif /* _UAPI_LINUX_DAX_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index beb7a455915d..b874565c34eb 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -450,6 +450,11 @@  config	TRANSPARENT_HUGE_PAGECACHE
 	def_bool y
 	depends on TRANSPARENT_HUGEPAGE
 
+config DAXFILE
+	def_bool y
+	depends on FS_DAX
+	depends on SWAP
+
 #
 # UP and nommu archs use km based percpu allocator
 #
diff --git a/mm/Makefile b/mm/Makefile
index 026f6a828a50..38d9025a3e37 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -56,6 +56,7 @@  endif
 obj-$(CONFIG_HAVE_MEMBLOCK) += memblock.o
 
 obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o
+obj-$(CONFIG_DAXFILE)	+= daxfile.o
 obj-$(CONFIG_FRONTSWAP)	+= frontswap.o
 obj-$(CONFIG_ZSWAP)	+= zswap.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
diff --git a/mm/daxfile.c b/mm/daxfile.c
new file mode 100644
index 000000000000..fe230199c855
--- /dev/null
+++ b/mm/daxfile.c
@@ -0,0 +1,186 @@ 
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/dax.h>
+#include <linux/slab.h>
+#include <linux/highmem.h>
+#include <linux/pagemap.h>
+#include <linux/syscalls.h>
+
+/*
+ * TODO: a list to lookup daxfiles assumes a low number of instances,
+ * revisit.
+ */
+static LIST_HEAD(daxfiles);
+static DEFINE_SPINLOCK(dax_lock);
+
+struct dax_info {
+	struct list_head list;
+	struct file *daxfile;
+};
+
+static int daxfile_disable(struct file *victim)
+{
+	int found = 0;
+	struct dax_info *d;
+	struct inode *inode;
+	struct file *daxfile;
+	struct address_space *mapping;
+
+	mapping = victim->f_mapping;
+	spin_lock(&dax_lock);
+	list_for_each_entry(d, &daxfiles, list)
+		if (d->daxfile->f_mapping == mapping) {
+			list_del(&d->list);
+			found = 1;
+			break;
+		}
+	spin_unlock(&dax_lock);
+
+	if (!found)
+		return -EINVAL;
+
+	daxfile = d->daxfile;
+
+	inode = mapping->host;
+	inode->i_flags &= ~(S_SWAPFILE | S_DAXFILE);
+	filp_close(daxfile, NULL);
+
+	return 0;
+}
+
+static int claim_daxfile_checks(struct inode *inode)
+{
+	if (!S_ISREG(inode->i_mode))
+		return -EINVAL;
+
+	if (!IS_DAX(inode))
+		return -EINVAL;
+
+	if (IS_SWAPFILE(inode) || IS_DAXFILE(inode))
+		return -EBUSY;
+
+	return 0;
+}
+
+int daxfile_enable(struct file *daxfile, int align)
+{
+	struct address_space *mapping;
+	struct inode *inode;
+	struct dax_info *d;
+	int rc;
+
+	if (align < 0)
+		return -EINVAL;
+
+	mapping = daxfile->f_mapping;
+	inode = mapping->host;
+
+	rc = claim_daxfile_checks(inode);
+	if (rc)
+		return rc;
+
+	rc = daxfile_activate(daxfile, align);
+	if (rc)
+		return rc;
+
+	d = kzalloc(sizeof(*d), GFP_KERNEL);
+	if (!d)
+		return -ENOMEM;
+	INIT_LIST_HEAD(&d->list);
+	d->daxfile = daxfile;
+
+	spin_lock(&dax_lock);
+	list_add(&d->list, &daxfiles);
+	spin_unlock(&dax_lock);
+
+	/*
+	 * We set S_SWAPFILE to gain "no truncate" / static block
+	 * allocation semantics, and S_DAXFILE so we can differentiate
+	 * traditional swapfiles and assume static block mappings in the
+	 * dax mmap path.
+	 */
+	inode->i_flags |= S_SWAPFILE | S_DAXFILE;
+	return 0;
+}
+
+SYSCALL_DEFINE3(daxctl, const char __user *, path, int, flags, int, align)
+{
+	int rc;
+	struct filename *name;
+	struct inode *inode = NULL;
+	struct file *daxfile = NULL;
+	struct address_space *mapping;
+
+	if (flags & ~DAXCTL_VALID_FLAGS)
+		return -EINVAL;
+
+	name = getname(path);
+	if (IS_ERR(name))
+		return PTR_ERR(name);
+
+	daxfile = file_open_name(name, O_RDWR|O_LARGEFILE, 0);
+	if (IS_ERR(daxfile)) {
+		rc = PTR_ERR(daxfile);
+		daxfile = NULL;
+		goto out;
+	}
+
+	mapping = daxfile->f_mapping;
+	inode = mapping->host;
+	if (flags & DAXCTL_F_GET) {
+		/*
+		 * We only report the state of DAXCTL_F_STATIC since
+		 * there is no actions for applications to take based on
+		 * the setting of S_DAX. However, if this interface is
+		 * used for toggling S_DAX presumably userspace would
+		 * want to know the state of the flag.
+		 *
+		 * TODO: revisit whether we want to report DAXCTL_F_DAX
+		 * in the IS_DAX() case.
+		 */
+		if (IS_DAXFILE(inode))
+			rc = DAXCTL_F_STATIC;
+		else
+			rc = 0;
+
+		goto out;
+	}
+
+	/*
+	 * TODO: Should unprivileged users be allowed to control daxfile
+	 * behavior? Perhaps a mount flag... is -o dax that flag?
+	 */
+	if (!capable(CAP_LINUX_IMMUTABLE)) {
+		rc = -EPERM;
+		goto out;
+	}
+
+	inode_lock(inode);
+	if (!IS_DAXFILE(inode) && (flags & DAXCTL_F_STATIC)) {
+		rc = daxfile_enable(daxfile, align);
+		/* if successfully enabled hold daxfile open */
+		if (rc == 0)
+			daxfile = NULL;
+	} else if (IS_DAXFILE(inode) && !(flags & DAXCTL_F_STATIC))
+		rc = daxfile_disable(daxfile);
+	else
+		rc = 0;
+	inode_unlock(inode);
+
+out:
+	if (daxfile)
+		filp_close(daxfile, NULL);
+	if (name)
+		putname(name);
+	return rc;
+}
diff --git a/mm/page_io.c b/mm/page_io.c
index 5cec9a3d49f2..35160ad9c51f 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -244,6 +244,37 @@  static int bmap_walk(struct file *file, const unsigned page_size,
 	goto out;
 }
 
+static int daxfile_check(sector_t block, unsigned long page_no,
+		enum bmap_check type, void *none)
+{
+	if (type == BMAP_WALK_DONE)
+		return 0;
+
+	/*
+	 * Unlike the swapfile case, fail daxfile_activate() if any file
+	 * extent is not page aligned.
+	 */
+	if (type != BMAP_WALK_FULLPAGE)
+		return -EINVAL;
+	return 0;
+}
+
+int daxfile_activate(struct file *daxfile, unsigned align)
+{
+	int rc;
+
+	if (!align)
+		align = PAGE_SIZE;
+
+	if (align < PAGE_SIZE || !is_power_of_2(align))
+		return -EINVAL;
+
+	rc = bmap_walk(daxfile, align, ULONG_MAX, NULL, daxfile_check, NULL);
+	if (rc)
+		pr_debug("daxctl: daxfile has holes\n");
+	return rc;
+}
+
 static int swapfile_check(sector_t block, unsigned long page_no,
 		enum bmap_check type, void *_sis)
 {