Message ID | 149766213493.22552.4057048843646200083.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Jun 16, 2017 at 6:15 PM, Dan Williams <dan.j.williams@intel.com> wrote: > To date, the full promise of byte-addressable access to persistent > memory has only been half realized via the filesystem-dax interface. The > current filesystem-dax mechanism allows an application to consume (read) > data from persistent storage at byte-size granularity, bypassing the > full page reads required by traditional storage devices. > > Now, for writes, applications still need to contend with > page-granularity dirtying and flushing semantics as well as filesystem > coordination for metadata updates after any mmap write. The current > situation precludes use cases that leverage byte-granularity / in-place > updates to persistent media. > > To get around this limitation there are some specialized applications > that are using the device-dax interface to bypass the overhead and > data-safety problems of the current filesystem-dax mmap-write path. > QEMU-KVM is forced to use device-dax to safely pass through persistent > memory to a guest [1]. Some specialized databases are using device-dax > for byte-granularity writes. Outside of those cases, device-dax is > difficult for general purpose persistent memory applications to consume. > There is demand for access to pmem without needing to contend with > special device configuration and other device-dax limitations. > > The 'daxfile' interface satisfies this demand and realizes one of Dave > Chinner's ideas for allowing pmem applications to safely bypass > fsync/msync requirements. The idea is to make the file immutable with > respect to the offset-to-block mappings for every extent in the file > [2]. It turns out that filesystems already need to make this guarantee > today. This property is needed for files marked as swap files. > > The new daxctl() syscall manages setting a file into 'static-dax' mode > whereby it arranges for the file to be treated as a swapfile as far as > the filesystem is concerned, but not registered with the core-mm as > swapfile space. A file in this mode is then safe to be mapped and > written without the requirement to fsync/msync the writes. The cpu > cache management for flushing data to persistence can be handled > completely in userspace. Can you remind those of us who haven't played with DAX in a while what the problem is with mmapping a DAX file without this patchset? If there's some bookkkeeping needed to make sure that the filesystem will invalidate all the mappings if it decides to move the file, maybe that should be the default rather than needing a new syscall. --Andy
On Sat, Jun 17, 2017 at 9:25 AM, Andy Lutomirski <luto@kernel.org> wrote: > On Fri, Jun 16, 2017 at 6:15 PM, Dan Williams <dan.j.williams@intel.com> wrote: >> To date, the full promise of byte-addressable access to persistent >> memory has only been half realized via the filesystem-dax interface. The >> current filesystem-dax mechanism allows an application to consume (read) >> data from persistent storage at byte-size granularity, bypassing the >> full page reads required by traditional storage devices. >> >> Now, for writes, applications still need to contend with >> page-granularity dirtying and flushing semantics as well as filesystem >> coordination for metadata updates after any mmap write. The current >> situation precludes use cases that leverage byte-granularity / in-place >> updates to persistent media. >> >> To get around this limitation there are some specialized applications >> that are using the device-dax interface to bypass the overhead and >> data-safety problems of the current filesystem-dax mmap-write path. >> QEMU-KVM is forced to use device-dax to safely pass through persistent >> memory to a guest [1]. Some specialized databases are using device-dax >> for byte-granularity writes. Outside of those cases, device-dax is >> difficult for general purpose persistent memory applications to consume. >> There is demand for access to pmem without needing to contend with >> special device configuration and other device-dax limitations. >> >> The 'daxfile' interface satisfies this demand and realizes one of Dave >> Chinner's ideas for allowing pmem applications to safely bypass >> fsync/msync requirements. The idea is to make the file immutable with >> respect to the offset-to-block mappings for every extent in the file >> [2]. It turns out that filesystems already need to make this guarantee >> today. This property is needed for files marked as swap files. >> >> The new daxctl() syscall manages setting a file into 'static-dax' mode >> whereby it arranges for the file to be treated as a swapfile as far as >> the filesystem is concerned, but not registered with the core-mm as >> swapfile space. A file in this mode is then safe to be mapped and >> written without the requirement to fsync/msync the writes. The cpu >> cache management for flushing data to persistence can be handled >> completely in userspace. > > Can you remind those of us who haven't played with DAX in a while what > the problem is with mmapping a DAX file without this patchset? If > there's some bookkkeeping needed to make sure that the filesystem will > invalidate all the mappings if it decides to move the file, maybe that > should be the default rather than needing a new syscall. The bookkeeping to invalidate mappings when the filesystem moves a block is already there. Without this patchset an application needs to call fsync/msync after any write to a DAX mapping otherwise there is no guarantee the filesystem has written the metadata to find the updated block after a crash or power loss event. Even if the sync operation is reduced to a minimal cmpxchg in userspace to check if the filesystem-metadata is dirty, that mechanism doesn't translate to a virtualized environment, as requiring guests to trigger host fsync()s is not feasible. It's a half-step solution when you can instead just ask the filesystem to never move blocks, as Dave proposed many months back. We stepped back from that proposal when it looked like a significant amount of per-filesystem work to introduce the capability and it was not clear that application developers would tolerate the side effects of this 'immutable' semantic. However, the implementation is dead simple since ext4 and xfs already need to make block-allocation-immutable semantics available for swapfiles. We also have application developers telling us they are ok with the semantics, especially because it catches Linux up to other operating environments that are already on board with allowing this type of access to pmem through a filesystem. This patchset gives pmem application developers what they want without any additional burden on filesystem implementations.
On Sat, Jun 17, 2017 at 2:52 PM, Dan Williams <dan.j.williams@intel.com> wrote: > On Sat, Jun 17, 2017 at 9:25 AM, Andy Lutomirski <luto@kernel.org> wrote: >> >> Can you remind those of us who haven't played with DAX in a while what >> the problem is with mmapping a DAX file without this patchset? If >> there's some bookkkeeping needed to make sure that the filesystem will >> invalidate all the mappings if it decides to move the file, maybe that >> should be the default rather than needing a new syscall. > > The bookkeeping to invalidate mappings when the filesystem moves a > block is already there. > > Without this patchset an application needs to call fsync/msync after > any write to a DAX mapping otherwise there is no guarantee the > filesystem has written the metadata to find the updated block after a > crash or power loss event. Even if the sync operation is reduced to a > minimal cmpxchg in userspace to check if the filesystem-metadata is > dirty, that mechanism doesn't translate to a virtualized environment, > as requiring guests to trigger host fsync()s is not feasible. It's a > half-step solution when you can instead just ask the filesystem to > never move blocks, as Dave proposed many months back. > > We stepped back from that proposal when it looked like a significant > amount of per-filesystem work to introduce the capability and it was > not clear that application developers would tolerate the side effects > of this 'immutable' semantic. However, the implementation is dead > simple since ext4 and xfs already need to make > block-allocation-immutable semantics available for swapfiles. We also > have application developers telling us they are ok with the semantics, > especially because it catches Linux up to other operating environments > that are already on board with allowing this type of access to pmem > through a filesystem. This patchset gives pmem application developers > what they want without any additional burden on filesystem > implementations. I see. I have a couple of minor-ish issues with the current proposal, then. One is that I think the terminology should be changed to still make sense if filesystems or VFS improves to make this approach unnecessary. Rather than saying "this file is now static", perhaps users should set a flag with the explicit semantics that "mmaps of this file are guaranteed not to lose data due to the kernel's activities", IOW that mmaps will be at least as durable as a direct mapping of DAX memory. Then the kernel has the flexibility to add a future implementation in which, instead of pinning the file, the filesystem just knows to keep metadata synced before allowing page_mkwrite to re-enable writes to an mmapped DAX file. My other objection is that the syscall intentionally leaks a reference to the file. This means it needs overflow protection and it probably shouldn't ever be allowed to use it without privilege. Why can't the underlying issue be easily fixed, though? Could .page_mkwrite just make sure that metadata is synced when the FS uses DAX? On a DAX fs, syncing metadata should be extremely fast. This could be conditioned on an madvise or mmap flag if performance might be an issue. As far as I know, this change alone should be sufficient.
On Sat, Jun 17, 2017 at 4:50 PM, Andy Lutomirski <luto@kernel.org> wrote: > On Sat, Jun 17, 2017 at 2:52 PM, Dan Williams <dan.j.williams@intel.com> wrote: >> On Sat, Jun 17, 2017 at 9:25 AM, Andy Lutomirski <luto@kernel.org> wrote: >>> >>> Can you remind those of us who haven't played with DAX in a while what >>> the problem is with mmapping a DAX file without this patchset? If >>> there's some bookkkeeping needed to make sure that the filesystem will >>> invalidate all the mappings if it decides to move the file, maybe that >>> should be the default rather than needing a new syscall. >> >> The bookkeeping to invalidate mappings when the filesystem moves a >> block is already there. >> >> Without this patchset an application needs to call fsync/msync after >> any write to a DAX mapping otherwise there is no guarantee the >> filesystem has written the metadata to find the updated block after a >> crash or power loss event. Even if the sync operation is reduced to a >> minimal cmpxchg in userspace to check if the filesystem-metadata is >> dirty, that mechanism doesn't translate to a virtualized environment, >> as requiring guests to trigger host fsync()s is not feasible. It's a >> half-step solution when you can instead just ask the filesystem to >> never move blocks, as Dave proposed many months back. >> >> We stepped back from that proposal when it looked like a significant >> amount of per-filesystem work to introduce the capability and it was >> not clear that application developers would tolerate the side effects >> of this 'immutable' semantic. However, the implementation is dead >> simple since ext4 and xfs already need to make >> block-allocation-immutable semantics available for swapfiles. We also >> have application developers telling us they are ok with the semantics, >> especially because it catches Linux up to other operating environments >> that are already on board with allowing this type of access to pmem >> through a filesystem. This patchset gives pmem application developers >> what they want without any additional burden on filesystem >> implementations. > > I see. > > I have a couple of minor-ish issues with the current proposal, then. > One is that I think the terminology should be changed to still make > sense if filesystems or VFS improves to make this approach > unnecessary. Rather than saying "this file is now static", perhaps > users should set a flag with the explicit semantics that "mmaps of > this file are guaranteed not to lose data due to the kernel's > activities", IOW that mmaps will be at least as durable as a direct > mapping of DAX memory. Then the kernel has the flexibility to add a > future implementation in which, instead of pinning the file, the > filesystem just knows to keep metadata synced before allowing > page_mkwrite to re-enable writes to an mmapped DAX file. Yes, sounds good to me. Rename the flag to DAXCTL_F_SYNC to indicate updates via mmap to this file are synchronous as far as block allocation metadata is concerned. Future filesystems are then free to always support this synchronous mode without using the swapfile hack. > My other objection is that the syscall intentionally leaks a reference > to the file. This means it needs overflow protection and it probably > shouldn't ever be allowed to use it without privilege. We only hold the one reference while S_DAXFILE is set, so I think the protection is there, and per Dave's original proposal this requires CAP_LINUX_IMMUTABLE. > Why can't the underlying issue be easily fixed, though? Could > .page_mkwrite just make sure that metadata is synced when the FS uses > DAX? Yes, it most definitely could and that idea has been floated. > On a DAX fs, syncing metadata should be extremely fast. This > could be conditioned on an madvise or mmap flag if performance might > be an issue. As far as I know, this change alone should be > sufficient. The hang up is that it requires per-fs enabling as it needs to be careful to manage mmap_sem vs fs journal locks for example. I know the in-development NOVA [1] filesystem is planning to support this out of the gate. ext4 would be open to implementing it, but I think xfs is cold on the idea. Christoph originally proposed it here [2], before Dave went on to propose immutable semantics. [1]: https://github.com/NVSL/NOVA [2]: https://lists.01.org/pipermail/linux-nvdimm/2016-February/004609.html
On Sat, Jun 17, 2017 at 8:15 PM, Dan Williams <dan.j.williams@intel.com> wrote: > On Sat, Jun 17, 2017 at 4:50 PM, Andy Lutomirski <luto@kernel.org> wrote: >> My other objection is that the syscall intentionally leaks a reference >> to the file. This means it needs overflow protection and it probably >> shouldn't ever be allowed to use it without privilege. > > We only hold the one reference while S_DAXFILE is set, so I think the > protection is there, and per Dave's original proposal this requires > CAP_LINUX_IMMUTABLE. > >> Why can't the underlying issue be easily fixed, though? Could >> .page_mkwrite just make sure that metadata is synced when the FS uses >> DAX? > > Yes, it most definitely could and that idea has been floated. > >> On a DAX fs, syncing metadata should be extremely fast. This >> could be conditioned on an madvise or mmap flag if performance might >> be an issue. As far as I know, this change alone should be >> sufficient. > > The hang up is that it requires per-fs enabling as it needs to be > careful to manage mmap_sem vs fs journal locks for example. I know the > in-development NOVA [1] filesystem is planning to support this out of > the gate. ext4 would be open to implementing it, but I think xfs is > cold on the idea. Christoph originally proposed it here [2], before > Dave went on to propose immutable semantics. Hmm. Given a choice between a very clean API that works without privilege but is awkward to implement on XFS and an awkward-to-use API, I'd personally choose the former. Dave, even with the lock ordering issue, couldn't XFS implement MAP_PMEM_AWARE by having .page_mkwrite work roughly like this: if (metadata is dirty) { up_write(&mmap_sem); sync the metadata; down_write(&mmap_sem); return 0; /* retry the fault */ } else { return whatever success code; } This might require returning VM_FAULT_RETRY instead of 0 and it might require auditing the core mm code to make sure that it can handle mmap_sem being dropped like this. I don't see why it couldn't work in principle, though.
On Sat, Jun 17, 2017 at 08:15:05PM -0700, Dan Williams wrote: > The hang up is that it requires per-fs enabling as it needs to be > careful to manage mmap_sem vs fs journal locks for example. I know the > in-development NOVA [1] filesystem is planning to support this out of > the gate. ext4 would be open to implementing it, but I think xfs is > cold on the idea. Christoph originally proposed it here [2], before > Dave went on to propose immutable semantics. > > [1]: https://github.com/NVSL/NOVA > [2]: https://lists.01.org/pipermail/linux-nvdimm/2016-February/004609.html And I stand to that statement. Let's get DAX stable first, and properly cleaned up (e.g. follow on work with separating it entirely from the block device). Then think hard about how most of the persistent memory technologies actually work, including the point that for a lot of workloads page cache will be required at least on the write side. And then come up with actual real use cases and we can look into it. And stop trying to shoe-horn crap like this in.
On Sun, Jun 18, 2017 at 1:18 AM, Christoph Hellwig <hch@lst.de> wrote: > On Sat, Jun 17, 2017 at 08:15:05PM -0700, Dan Williams wrote: >> The hang up is that it requires per-fs enabling as it needs to be >> careful to manage mmap_sem vs fs journal locks for example. I know the >> in-development NOVA [1] filesystem is planning to support this out of >> the gate. ext4 would be open to implementing it, but I think xfs is >> cold on the idea. Christoph originally proposed it here [2], before >> Dave went on to propose immutable semantics. >> >> [1]: https://github.com/NVSL/NOVA >> [2]: https://lists.01.org/pipermail/linux-nvdimm/2016-February/004609.html > > And I stand to that statement. Let's get DAX stable first, and > properly cleaned up (e.g. follow on work with separating it entirely > from the block device). Then think hard about how most of the > persistent memory technologies actually work, including the point that > for a lot of workloads page cache will be required at least on the > write side. And then come up with actual real use cases and we can > look into it. I see it differently. We're already at a good point in time to start iterating on a fix for this issue. Ross and Jan have done a lot of good work on the dax stability front, and the block-device separation of dax is well underway. > And stop trying to shoe-horn crap like this in. The kernel shoe-horning all pmem+filesystem-dax applications into abiding page-cache semantics is a problem, and this RFC has already helped move the needle on a couple fronts. 1/ Swapfiles are subtly broken which is something worth fixing, and if it gets us a synchronous-dax mode without major filesystem surgery then that's all for the better. 2/ There's an appetite for just fixing this incrementally in each filesystem's fault handler, so if ext4 was able to prove out an interface / implementation for synchronous faults we could go with that instead of a pre-allocated + immutable interface and let other filesystems set their own timelines.
On Sat, Jun 17, 2017 at 10:05:45PM -0700, Andy Lutomirski wrote: > On Sat, Jun 17, 2017 at 8:15 PM, Dan Williams <dan.j.williams@intel.com> wrote: > > On Sat, Jun 17, 2017 at 4:50 PM, Andy Lutomirski <luto@kernel.org> wrote: > >> My other objection is that the syscall intentionally leaks a reference > >> to the file. This means it needs overflow protection and it probably > >> shouldn't ever be allowed to use it without privilege. > > > > We only hold the one reference while S_DAXFILE is set, so I think the > > protection is there, and per Dave's original proposal this requires > > CAP_LINUX_IMMUTABLE. > > > >> Why can't the underlying issue be easily fixed, though? Could > >> .page_mkwrite just make sure that metadata is synced when the FS uses > >> DAX? > > > > Yes, it most definitely could and that idea has been floated. > > > >> On a DAX fs, syncing metadata should be extremely fast. <sigh> This again.... Persistent memory means the *I/O* is fast. It does not mean that *complex filesystem operations* are fast. Don't forget that there's an shitload of CPU that gets burnt to make sure that the metadata is synced correctly. Do that /synchronously/ on *every* write page fault (which, BTW, modify mtime, so will always have dirty metadata to sync) and now you have a serious performance problem with your "fast" DAX access method. And that's before we even consider all the problems with running sync operations in page fault context.... > >> This > >> could be conditioned on an madvise or mmap flag if performance might > >> be an issue. As far as I know, this change alone should be > >> sufficient. > > > > The hang up is that it requires per-fs enabling as it needs to be > > careful to manage mmap_sem vs fs journal locks for example. I know the > > in-development NOVA [1] filesystem is planning to support this out of > > the gate. ext4 would be open to implementing it, but I think xfs is > > cold on the idea. Christoph originally proposed it here [2], before > > Dave went on to propose immutable semantics. > > Hmm. Given a choice between a very clean API that works without > privilege but is awkward to implement on XFS and an awkward-to-use > API, I'd personally choose the former. Yup, you have the choice of a clean kernel API that will be substantially slower than the existing "dirty page" tracking and having the app run fsync() when necessary, or having to do a little more work in a library routine that preallocates a file and sets a flag on it? The apps will use the library API, not the kernel API, so who really cares if there's a few steps to setting up the file state appropriately? > Dave, even with the lock ordering issue, couldn't XFS implement > MAP_PMEM_AWARE by having .page_mkwrite work roughly like this: > > if (metadata is dirty) { > up_write(&mmap_sem); > sync the metadata; > down_write(&mmap_sem); > return 0; /* retry the fault */ > } else { > return whatever success code; > } How do you know that there is dependent filesystem metadata that needs syncing at a level that you can safely manipulate the mmap_sem? And how, exactly, do you do this without races? It'd be trivial to DOS such retryable DAX faults simply by touching the file in a tight loop in a separate process... Cheers, Dave.
On Mon, Jun 19, 2017 at 6:21 AM, Dave Chinner <david@fromorbit.com> wrote: > On Sat, Jun 17, 2017 at 10:05:45PM -0700, Andy Lutomirski wrote: >> On Sat, Jun 17, 2017 at 8:15 PM, Dan Williams <dan.j.williams@intel.com> wrote: >> > On Sat, Jun 17, 2017 at 4:50 PM, Andy Lutomirski <luto@kernel.org> wrote: >> >> My other objection is that the syscall intentionally leaks a reference >> >> to the file. This means it needs overflow protection and it probably >> >> shouldn't ever be allowed to use it without privilege. >> > >> > We only hold the one reference while S_DAXFILE is set, so I think the >> > protection is there, and per Dave's original proposal this requires >> > CAP_LINUX_IMMUTABLE. >> > >> >> Why can't the underlying issue be easily fixed, though? Could >> >> .page_mkwrite just make sure that metadata is synced when the FS uses >> >> DAX? >> > >> > Yes, it most definitely could and that idea has been floated. >> > >> >> On a DAX fs, syncing metadata should be extremely fast. > > <sigh> > > This again.... > > Persistent memory means the *I/O* is fast. It does not mean that > *complex filesystem operations* are fast. > > Don't forget that there's an shitload of CPU that gets burnt to make > sure that the metadata is synced correctly. Do that /synchronously/ > on *every* write page fault (which, BTW, modify mtime, so will > always have dirty metadata to sync) and now you have a serious > performance problem with your "fast" DAX access method. I think the mtime issue can and should be solved separately. But it' s a fair point that there would be workloads for which this could be excessively expensive. In particular, simply creating a file, mmapping a large range, and touching the pages one by one -- delalloc would be completely defeated. But here's a strawman for solving both issues. First, mtime. I consider it to be either a bug or a misfeature that .page_mkwrite *ever* dirties an inode just to update mtime. I have old patches to fix this, and those patches could be updated and merged. With them applied, there's just a set_bit() in .page_mkwrite() to handle mtime. https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=mmap_mtime/patch_v4 Second: syncing extents. Here's a straw man. Forget the mmap() flag. Instead add a new msync() operation: msync(start, length, MSYNC_PMEM_PREPARE_WRITE); If this operation succeeds, it guarantees that all future writes through this mapping on this range will hit actual storage and that all the metadata operations needed to make this write persistent will hit storage such that they are ordered before the user's writes. As an implementation detail, this will flush out the extents if needed. In addition, if the FS has any mechanism that would cause problems asyncronously later on (dedupe? deallocated extents full of zeros? defrag?), it may also need to set a flag on the VMA that changes the behavior of future .page_mkwrite operations. (On x86, for example, this would permit the FS to do WC/streaming writes without SFENCE if the FS were structured in a way that this worked.) Now we have an API that should work going forward without introducing baggage. And XFS is free to implement this API by making the entire file act like a swap file if XFS wants to do so, but this doesn't force other filesystems (ext4? NOVA?) to do the same thing. > > And that's before we even consider all the problems with running > sync operations in page fault context.... > >> >> This >> >> could be conditioned on an madvise or mmap flag if performance might >> >> be an issue. As far as I know, this change alone should be >> >> sufficient. >> > >> > The hang up is that it requires per-fs enabling as it needs to be >> > careful to manage mmap_sem vs fs journal locks for example. I know the >> > in-development NOVA [1] filesystem is planning to support this out of >> > the gate. ext4 would be open to implementing it, but I think xfs is >> > cold on the idea. Christoph originally proposed it here [2], before >> > Dave went on to propose immutable semantics. >> >> Hmm. Given a choice between a very clean API that works without >> privilege but is awkward to implement on XFS and an awkward-to-use >> API, I'd personally choose the former. > > Yup, you have the choice of a clean kernel API that will be > substantially slower than the existing "dirty page" tracking and > having the app run fsync() when necessary, or having to do a little > more work in a library routine that preallocates a file and sets a > flag on it? > > The apps will use the library API, not the kernel API, so who really > cares if there's a few steps to setting up the file state > appropriately? > >> Dave, even with the lock ordering issue, couldn't XFS implement >> MAP_PMEM_AWARE by having .page_mkwrite work roughly like this: >> >> if (metadata is dirty) { >> up_write(&mmap_sem); >> sync the metadata; >> down_write(&mmap_sem); >> return 0; /* retry the fault */ >> } else { >> return whatever success code; >> } > > How do you know that there is dependent filesystem metadata that > needs syncing at a level that you can safely manipulate the > mmap_sem? And how, exactly, do you do this without races? I have no idea, but I expect that all the locking issues are solvable. > It'd be > trivial to DOS such retryable DAX faults simply by touching the file > in a tight loop in a separate process... If the code were smart enough to only cause a retry when the extent being touched is dirty, this problem wouldn't exist.
On Mon, Jun 19, 2017 at 08:22:10AM -0700, Andy Lutomirski wrote: > On Mon, Jun 19, 2017 at 6:21 AM, Dave Chinner <david@fromorbit.com> wrote: > > On Sat, Jun 17, 2017 at 10:05:45PM -0700, Andy Lutomirski wrote: > >> On Sat, Jun 17, 2017 at 8:15 PM, Dan Williams <dan.j.williams@intel.com> wrote: > >> > On Sat, Jun 17, 2017 at 4:50 PM, Andy Lutomirski <luto@kernel.org> wrote: > >> >> My other objection is that the syscall intentionally leaks a reference > >> >> to the file. This means it needs overflow protection and it probably > >> >> shouldn't ever be allowed to use it without privilege. > >> > > >> > We only hold the one reference while S_DAXFILE is set, so I think the > >> > protection is there, and per Dave's original proposal this requires > >> > CAP_LINUX_IMMUTABLE. > >> > > >> >> Why can't the underlying issue be easily fixed, though? Could > >> >> .page_mkwrite just make sure that metadata is synced when the FS uses > >> >> DAX? > >> > > >> > Yes, it most definitely could and that idea has been floated. > >> > > >> >> On a DAX fs, syncing metadata should be extremely fast. > > > > <sigh> > > > > This again.... > > > > Persistent memory means the *I/O* is fast. It does not mean that > > *complex filesystem operations* are fast. > > > > Don't forget that there's an shitload of CPU that gets burnt to make > > sure that the metadata is synced correctly. Do that /synchronously/ > > on *every* write page fault (which, BTW, modify mtime, so will > > always have dirty metadata to sync) and now you have a serious > > performance problem with your "fast" DAX access method. > > I think the mtime issue can and should be solved separately. But it' > s a fair point that there would be workloads for which this could be > excessively expensive. In particular, simply creating a file, > mmapping a large range, and touching the pages one by one -- delalloc > would be completely defeated. > > But here's a strawman for solving both issues. First, mtime. I > consider it to be either a bug or a misfeature that .page_mkwrite > *ever* dirties an inode just to update mtime. I have old patches to > fix this, and those patches could be updated and merged. With them > applied, there's just a set_bit() in .page_mkwrite() to handle mtime. > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=mmap_mtime/patch_v4 Yup, I remember that - it delays the update to data writeback time, IOWs the proposed MAP_SYNC page fault semantics result in the same (poor) behaviour because the sync operation will trigger mtime updates instead of the page fault. Unless, of course, you are implying that MAP_SYNC should not actually sync all known dirty metadata on an inode. <smacks head on desk> > Second: syncing extents. Here's a straw man. Forget the mmap() flag. > Instead add a new msync() operation: > > msync(start, length, MSYNC_PMEM_PREPARE_WRITE); How's this any different from the fallocate command I proposed to do this (apart from the fact that msync() is not intended to be abused as a file preallocation/zeroing interface)? > If this operation succeeds, it guarantees that all future writes > through this mapping on this range will hit actual storage and that > all the metadata operations needed to make this write persistent will > hit storage such that they are ordered before the user's writes. > As an implementation detail, this will flush out the extents if > needed. In addition, if the FS has any mechanism that would cause > problems asyncronously later on (dedupe? deallocated extents full > of zeros? defrag?), Hole punch, truncate, reflink, dedupe, snapshots, scrubbing and other background filesystem maintenance operations, etc can all change the extent layout asynchronously if there's no mechanism to tell the fs not to modify the inode extent layout. > it may also need to set a flag on the VMA > that changes the behavior of future .page_mkwrite operations. > > (On x86, for example, this would permit the FS to do WC/streaming > writes without SFENCE if the FS were structured in a way that this > worked.) > > Now we have an API that should work going forward without > introducing baggage. And XFS is free to implement this API by > making the entire file act like a swap file if XFS wants to do so, > but this doesn't force other filesystems (ext4? NOVA?) to do the > same thing. Sure, you are providing a simple programmatic API, but this does not provide a viable feature management strategy. i.e. the API you are now proposing requires the filesystem to ensure an inode's extent map cannot be modified ever again in the future (that "guarantees all future writes" bit). This requires, at minimum, a persistent flag to be set on the inode so the VFS and filesystem implementations can use it to prevent anything that, for example, relies on copy-on-write semantics being done on those files. That means the proposed msync operation will need to check the filesystem can support this feature and *fail* if it can't. Further, administrators need to be aware of this application requirement so they can plan their backup and disaster recovery operations appropriately (e.g. reflink and/or snapshots cannot be used as part of thei backup strategy). Hence the point of such restricted file manipulation functionality requiring permissions to be granted - it ensures sysadmins know they've got something less than obvious going on they may need special processes to handle safely. Unsurprisingly, this is exactly what the "DAX immutable" inode flag I proposed provides. It provides an explicit, standardised and *documented* management strategy that is common across all filesystems. It uses mechanisms that *already exist*, the VFS and filesystems already implement, and adminstrators are familiar with using to manage their systems (e.g. setting the "NODUMP" inode flag to exclude files from backups). This also avoids the management level fragmentation which would occur if filesystems each solve the "DAX userspace data sync" problem differently via different management tools, behaviours and semantics. Keep in mind that there are more uses for immutable extent maps than just DAX. e.g. every so often someone pops up and says "I have this high speed data aquisition hardware and we'd like to DMA data direct to the storage because it's far too slow pushing it through memory and then the OS to get it to storage. How do I map the storage blocks and guarantee the mapping won't change while we are transferring data direct from the hardware?". A file with an allocated, immutable extent map solves this problem for these sorts of esoteric applications. As such, can we please drop all the mmap/msync special API snowflake proposals and instead address the problem how to set up and manage files with immutable extent maps efficiently through fallocate? Once we have them, pmem aware applications don't need to do anything special with mmap to be able to use userspace data sync instructions safely. i.e. the pmem library should select the correct data sync method according to how the file was set up and what functionality the underlying filesystem DAX implementation supports. > >> Dave, even with the lock ordering issue, couldn't XFS implement > >> MAP_PMEM_AWARE by having .page_mkwrite work roughly like this: > >> > >> if (metadata is dirty) { up_write(&mmap_sem); sync the > >> metadata; down_write(&mmap_sem); return 0; /* retry the fault > >> */ } else { return whatever success code; } > > > > How do you know that there is dependent filesystem metadata that > > needs syncing at a level that you can safely manipulate the > > mmap_sem? And how, exactly, do you do this without races? > > I have no idea, but I expect that all the locking issues are > solvable. Yay, Dunning-Kruger to the rescue! Cheers, Dave.
[add linux-xfs to the fray] On Fri, Jun 16, 2017 at 06:15:35PM -0700, Dan Williams wrote: > To date, the full promise of byte-addressable access to persistent > memory has only been half realized via the filesystem-dax interface. The > current filesystem-dax mechanism allows an application to consume (read) > data from persistent storage at byte-size granularity, bypassing the > full page reads required by traditional storage devices. > > Now, for writes, applications still need to contend with > page-granularity dirtying and flushing semantics as well as filesystem > coordination for metadata updates after any mmap write. The current > situation precludes use cases that leverage byte-granularity / in-place > updates to persistent media. > > To get around this limitation there are some specialized applications > that are using the device-dax interface to bypass the overhead and > data-safety problems of the current filesystem-dax mmap-write path. > QEMU-KVM is forced to use device-dax to safely pass through persistent > memory to a guest [1]. Some specialized databases are using device-dax > for byte-granularity writes. Outside of those cases, device-dax is > difficult for general purpose persistent memory applications to consume. > There is demand for access to pmem without needing to contend with > special device configuration and other device-dax limitations. > > The 'daxfile' interface satisfies this demand and realizes one of Dave > Chinner's ideas for allowing pmem applications to safely bypass > fsync/msync requirements. The idea is to make the file immutable with > respect to the offset-to-block mappings for every extent in the file > [2]. It turns out that filesystems already need to make this guarantee > today. This property is needed for files marked as swap files. > > The new daxctl() syscall manages setting a file into 'static-dax' mode > whereby it arranges for the file to be treated as a swapfile as far as > the filesystem is concerned, but not registered with the core-mm as > swapfile space. A file in this mode is then safe to be mapped and > written without the requirement to fsync/msync the writes. The cpu > cache management for flushing data to persistence can be handled > completely in userspace. > > [1]: https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg01207.html > [2]: https://lkml.org/lkml/2016/9/11/159 > > Cc: Jan Kara <jack@suse.cz> > Cc: Jeff Moyer <jmoyer@redhat.com> > Cc: Christoph Hellwig <hch@lst.de> > Cc: Dave Chinner <david@fromorbit.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > --- > arch/x86/entry/syscalls/syscall_64.tbl | 1 > include/linux/dax.h | 9 ++ > include/linux/fs.h | 3 + > include/linux/syscalls.h | 1 > include/uapi/linux/dax.h | 8 + > mm/Kconfig | 5 + > mm/Makefile | 1 > mm/daxfile.c | 186 ++++++++++++++++++++++++++++++++ > mm/page_io.c | 31 +++++ > 9 files changed, 245 insertions(+) > create mode 100644 include/uapi/linux/dax.h > create mode 100644 mm/daxfile.c > > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl > index 5aef183e2f85..795eb93d6beb 100644 > --- a/arch/x86/entry/syscalls/syscall_64.tbl > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > @@ -339,6 +339,7 @@ > 330 common pkey_alloc sys_pkey_alloc > 331 common pkey_free sys_pkey_free > 332 common statx sys_statx > +333 64 daxctl sys_daxctl > > # > # x32-specific system call numbers start at 512 to avoid cache impact > diff --git a/include/linux/dax.h b/include/linux/dax.h > index 5ec1f6c47716..5f1d0e0ed30f 100644 > --- a/include/linux/dax.h > +++ b/include/linux/dax.h > @@ -4,8 +4,17 @@ > #include <linux/fs.h> > #include <linux/mm.h> > #include <linux/radix-tree.h> > +#include <uapi/linux/dax.h> > #include <asm/pgtable.h> > > +/* > + * TODO: make sys_daxctl() be the generic interface for toggling S_DAX > + * across filesystems. For now, mark DAXCTL_F_DAX as an invalid flag > + */ > +#define DAXCTL_VALID_FLAGS (DAXCTL_F_GET | DAXCTL_F_STATIC) > + > +int daxfile_activate(struct file *daxfile, unsigned align); > + > struct iomap_ops; > struct dax_device; > struct dax_operations { > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 3e68cabb8457..3af649fb669f 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1824,8 +1824,10 @@ struct super_operations { > #define S_NOSEC 4096 /* no suid or xattr security attributes */ > #ifdef CONFIG_FS_DAX > #define S_DAX 8192 /* Direct Access, avoiding the page cache */ > +#define S_DAXFILE 16384 /* no truncate (swapfile) semantics + dax */ > #else > #define S_DAX 0 /* Make all the DAX code disappear */ > +#define S_DAXFILE 0 > #endif > > /* > @@ -1865,6 +1867,7 @@ struct super_operations { > #define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT) > #define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC) > #define IS_DAX(inode) ((inode)->i_flags & S_DAX) > +#define IS_DAXFILE(inode) ((inode)->i_flags & S_DAXFILE) > > #define IS_WHITEOUT(inode) (S_ISCHR(inode->i_mode) && \ > (inode)->i_rdev == WHITEOUT_DEV) > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > index 980c3c9b06f8..49e5cc4c192e 100644 > --- a/include/linux/syscalls.h > +++ b/include/linux/syscalls.h > @@ -701,6 +701,7 @@ asmlinkage long sys_prctl(int option, unsigned long arg2, unsigned long arg3, > unsigned long arg4, unsigned long arg5); > asmlinkage long sys_swapon(const char __user *specialfile, int swap_flags); > asmlinkage long sys_swapoff(const char __user *specialfile); > +asmlinkage long sys_daxctl(const char __user *path, int flags, int align); > asmlinkage long sys_sysctl(struct __sysctl_args __user *args); > asmlinkage long sys_sysinfo(struct sysinfo __user *info); > asmlinkage long sys_sysfs(int option, > diff --git a/include/uapi/linux/dax.h b/include/uapi/linux/dax.h > new file mode 100644 > index 000000000000..78a41bb392c0 > --- /dev/null > +++ b/include/uapi/linux/dax.h > @@ -0,0 +1,8 @@ > +#ifndef _UAPI_LINUX_DAX_H > +#define _UAPI_LINUX_DAX_H > + > +#define DAXCTL_F_GET (1 << 0) > +#define DAXCTL_F_DAX (1 << 1) > +#define DAXCTL_F_STATIC (1 << 2) > + > +#endif /* _UAPI_LINUX_DAX_H */ > diff --git a/mm/Kconfig b/mm/Kconfig > index beb7a455915d..b874565c34eb 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -450,6 +450,11 @@ config TRANSPARENT_HUGE_PAGECACHE > def_bool y > depends on TRANSPARENT_HUGEPAGE > > +config DAXFILE > + def_bool y > + depends on FS_DAX > + depends on SWAP > + > # > # UP and nommu archs use km based percpu allocator > # > diff --git a/mm/Makefile b/mm/Makefile > index 026f6a828a50..38d9025a3e37 100644 > --- a/mm/Makefile > +++ b/mm/Makefile > @@ -56,6 +56,7 @@ endif > obj-$(CONFIG_HAVE_MEMBLOCK) += memblock.o > > obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o > +obj-$(CONFIG_DAXFILE) += daxfile.o > obj-$(CONFIG_FRONTSWAP) += frontswap.o > obj-$(CONFIG_ZSWAP) += zswap.o > obj-$(CONFIG_HAS_DMA) += dmapool.o > diff --git a/mm/daxfile.c b/mm/daxfile.c > new file mode 100644 > index 000000000000..fe230199c855 > --- /dev/null > +++ b/mm/daxfile.c > @@ -0,0 +1,186 @@ > +/* > + * Copyright(c) 2017 Intel Corporation. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of version 2 of the GNU General Public License as > + * published by the Free Software Foundation. > + * > + * This program is distributed in the hope that it will be useful, but > + * WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * General Public License for more details. > + */ > +#include <linux/dax.h> > +#include <linux/slab.h> > +#include <linux/highmem.h> > +#include <linux/pagemap.h> > +#include <linux/syscalls.h> > + > +/* > + * TODO: a list to lookup daxfiles assumes a low number of instances, > + * revisit. > + */ > +static LIST_HEAD(daxfiles); > +static DEFINE_SPINLOCK(dax_lock); > + > +struct dax_info { > + struct list_head list; > + struct file *daxfile; > +}; > + > +static int daxfile_disable(struct file *victim) > +{ > + int found = 0; > + struct dax_info *d; > + struct inode *inode; > + struct file *daxfile; > + struct address_space *mapping; > + > + mapping = victim->f_mapping; > + spin_lock(&dax_lock); > + list_for_each_entry(d, &daxfiles, list) > + if (d->daxfile->f_mapping == mapping) { > + list_del(&d->list); > + found = 1; > + break; > + } > + spin_unlock(&dax_lock); > + > + if (!found) > + return -EINVAL; > + > + daxfile = d->daxfile; > + > + inode = mapping->host; > + inode->i_flags &= ~(S_SWAPFILE | S_DAXFILE); > + filp_close(daxfile, NULL); > + > + return 0; > +} > + > +static int claim_daxfile_checks(struct inode *inode) > +{ > + if (!S_ISREG(inode->i_mode)) > + return -EINVAL; > + > + if (!IS_DAX(inode)) > + return -EINVAL; > + > + if (IS_SWAPFILE(inode) || IS_DAXFILE(inode)) > + return -EBUSY; > + > + return 0; > +} > + > +int daxfile_enable(struct file *daxfile, int align) > +{ > + struct address_space *mapping; > + struct inode *inode; > + struct dax_info *d; > + int rc; > + > + if (align < 0) > + return -EINVAL; > + > + mapping = daxfile->f_mapping; > + inode = mapping->host; > + > + rc = claim_daxfile_checks(inode); > + if (rc) > + return rc; > + > + rc = daxfile_activate(daxfile, align); > + if (rc) > + return rc; > + > + d = kzalloc(sizeof(*d), GFP_KERNEL); > + if (!d) > + return -ENOMEM; > + INIT_LIST_HEAD(&d->list); > + d->daxfile = daxfile; > + > + spin_lock(&dax_lock); > + list_add(&d->list, &daxfiles); > + spin_unlock(&dax_lock); > + > + /* > + * We set S_SWAPFILE to gain "no truncate" / static block > + * allocation semantics, and S_DAXFILE so we can differentiate > + * traditional swapfiles and assume static block mappings in the > + * dax mmap path. > + */ > + inode->i_flags |= S_SWAPFILE | S_DAXFILE; Yikes. You know, I hadn't even thought about considering swap files as a subcase of files with immutable block maps, but here we are. Both swap files and DAX require absolutely stable block mappings, they are both (probably) intolerant of inode metadata changes (size, mtime, etc.) But on the other hand, the bmap interface is so... yuck. We return zero to indicate no mapping or error or whatever, it doesn't actually tell us /which/ device it's returning offsets into, etc. I was writing a regression test earlier to check that we've sealed off XFS RT files from becoming swap files (because bmap is broken, not (afaik) because of any weird limitation of xfs) and forgot that quirk long enough to waste time wondering why it failed to fail on a 4.11 kernel. That's right, the first rt file gets block zero and magically doesn't fail to fail if that's the swap file. Gross. I've now ranted twice this month about how bmap() doesn't work on reflinked files on XFS. Honestly, I realize we've gone back, forth, and around all over the place on this. I still prefer something similar to a permanent flag, similar to what Dave suggested, though I hate the name PMEM_IMMUTABLE and some of the semantics. First, a new inode flag S_IOMAP_FROZEN that means the file's block map cannot change. Second, some kind of function to toggle the S_IOMAP_FROZEN flag. Turning it on will lock the inode, check the extent map for holes, shared, or unwritten bits, and bail out if it finds any, or set the flag. Not sure if we should require CAP_LINUX_IMMUTABLE -- probably yes, at least at first. I don't currently have any objection to writing non-iomap inode metadata out to disk. Third, the flag can only be cleared if the file isn't mapped. Fourth, the VFS entry points for things like read, write, truncate, utimes, fallocate, etc. all just bail out if S_IOMAP_FROZEN is set on a file, so that the block map cannot be modified. mmap is still allowed, as we've discussed. /Maybe/ we can allow fallocate to extend a file with zeroed extents (it will be slow) as I've heard murmurs about wanting to be able to extend a file, maybe not. Fifth, swapfiles now require the S_IOMAP_FROZEN flag since they want stable iomap but probably don't care about things like mtime. Maybe they can call iomap too. Sixth, XFS can record the S_IOMAP_FROZEN state in di_flags2 and set it whenever the in-core inode gets constructed. This enables us to prohibit reflinking and other such undesirable activity. If we actually /do/ come up with a reference implementation for XFS, I'd be ok with tacking it on the end of my dev branch, which will give us a loooong runway to try it out. The end of the dev branch is beyond online XFS fsck and repair and the "root metadata btrees in inodes" rework; since that's ~90 patches with my name on it that I cannot also review, it won't go in for a long time indeed! (Yes, that was also sort of a plea for someone to go review the XFS scrub patches.) > + return 0; > +} > + > +SYSCALL_DEFINE3(daxctl, const char __user *, path, int, flags, int, align) I was /about/ to grouse about this syscall, then realized that maybe it /is/ useful to be able to check a specific alignment. Maybe not, since I had something more permanent in mind anyway. In any case, just pass in an opened fd if this sticks around. --D > +{ > + int rc; > + struct filename *name; > + struct inode *inode = NULL; > + struct file *daxfile = NULL; > + struct address_space *mapping; > + > + if (flags & ~DAXCTL_VALID_FLAGS) > + return -EINVAL; > + > + name = getname(path); > + if (IS_ERR(name)) > + return PTR_ERR(name); > + > + daxfile = file_open_name(name, O_RDWR|O_LARGEFILE, 0); > + if (IS_ERR(daxfile)) { > + rc = PTR_ERR(daxfile); > + daxfile = NULL; > + goto out; > + } > + > + mapping = daxfile->f_mapping; > + inode = mapping->host; > + if (flags & DAXCTL_F_GET) { > + /* > + * We only report the state of DAXCTL_F_STATIC since > + * there is no actions for applications to take based on > + * the setting of S_DAX. However, if this interface is > + * used for toggling S_DAX presumably userspace would > + * want to know the state of the flag. > + * > + * TODO: revisit whether we want to report DAXCTL_F_DAX > + * in the IS_DAX() case. > + */ > + if (IS_DAXFILE(inode)) > + rc = DAXCTL_F_STATIC; > + else > + rc = 0; > + > + goto out; > + } > + > + /* > + * TODO: Should unprivileged users be allowed to control daxfile > + * behavior? Perhaps a mount flag... is -o dax that flag? > + */ > + if (!capable(CAP_LINUX_IMMUTABLE)) { > + rc = -EPERM; > + goto out; > + } > + > + inode_lock(inode); > + if (!IS_DAXFILE(inode) && (flags & DAXCTL_F_STATIC)) { > + rc = daxfile_enable(daxfile, align); > + /* if successfully enabled hold daxfile open */ > + if (rc == 0) > + daxfile = NULL; > + } else if (IS_DAXFILE(inode) && !(flags & DAXCTL_F_STATIC)) > + rc = daxfile_disable(daxfile); > + else > + rc = 0; > + inode_unlock(inode); > + > +out: > + if (daxfile) > + filp_close(daxfile, NULL); > + if (name) > + putname(name); > + return rc; > +} > diff --git a/mm/page_io.c b/mm/page_io.c > index 5cec9a3d49f2..35160ad9c51f 100644 > --- a/mm/page_io.c > +++ b/mm/page_io.c > @@ -244,6 +244,37 @@ static int bmap_walk(struct file *file, const unsigned page_size, > goto out; > } > > +static int daxfile_check(sector_t block, unsigned long page_no, > + enum bmap_check type, void *none) > +{ > + if (type == BMAP_WALK_DONE) > + return 0; > + > + /* > + * Unlike the swapfile case, fail daxfile_activate() if any file > + * extent is not page aligned. > + */ > + if (type != BMAP_WALK_FULLPAGE) > + return -EINVAL; > + return 0; > +} > + > +int daxfile_activate(struct file *daxfile, unsigned align) > +{ > + int rc; > + > + if (!align) > + align = PAGE_SIZE; > + > + if (align < PAGE_SIZE || !is_power_of_2(align)) > + return -EINVAL; > + > + rc = bmap_walk(daxfile, align, ULONG_MAX, NULL, daxfile_check, NULL); > + if (rc) > + pr_debug("daxctl: daxfile has holes\n"); > + return rc; > +} > + > static int swapfile_check(sector_t block, unsigned long page_no, > enum bmap_check type, void *_sis) > { > > -- > To unsubscribe from this list: send the line "unsubscribe linux-api" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jun 19, 2017 at 5:46 PM, Dave Chinner <david@fromorbit.com> wrote: > On Mon, Jun 19, 2017 at 08:22:10AM -0700, Andy Lutomirski wrote: >> On Mon, Jun 19, 2017 at 6:21 AM, Dave Chinner <david@fromorbit.com> wrote: >> > On Sat, Jun 17, 2017 at 10:05:45PM -0700, Andy Lutomirski wrote: >> >> On Sat, Jun 17, 2017 at 8:15 PM, Dan Williams <dan.j.williams@intel.com> wrote: >> >> > On Sat, Jun 17, 2017 at 4:50 PM, Andy Lutomirski <luto@kernel.org> wrote: >> >> >> My other objection is that the syscall intentionally leaks a reference >> >> >> to the file. This means it needs overflow protection and it probably >> >> >> shouldn't ever be allowed to use it without privilege. >> >> > >> >> > We only hold the one reference while S_DAXFILE is set, so I think the >> >> > protection is there, and per Dave's original proposal this requires >> >> > CAP_LINUX_IMMUTABLE. >> >> > >> >> >> Why can't the underlying issue be easily fixed, though? Could >> >> >> .page_mkwrite just make sure that metadata is synced when the FS uses >> >> >> DAX? >> >> > >> >> > Yes, it most definitely could and that idea has been floated. >> >> > >> >> >> On a DAX fs, syncing metadata should be extremely fast. >> > >> > <sigh> >> > >> > This again.... >> > >> > Persistent memory means the *I/O* is fast. It does not mean that >> > *complex filesystem operations* are fast. >> > >> > Don't forget that there's an shitload of CPU that gets burnt to make >> > sure that the metadata is synced correctly. Do that /synchronously/ >> > on *every* write page fault (which, BTW, modify mtime, so will >> > always have dirty metadata to sync) and now you have a serious >> > performance problem with your "fast" DAX access method. >> >> I think the mtime issue can and should be solved separately. But it' >> s a fair point that there would be workloads for which this could be >> excessively expensive. In particular, simply creating a file, >> mmapping a large range, and touching the pages one by one -- delalloc >> would be completely defeated. >> >> But here's a strawman for solving both issues. First, mtime. I >> consider it to be either a bug or a misfeature that .page_mkwrite >> *ever* dirties an inode just to update mtime. I have old patches to >> fix this, and those patches could be updated and merged. With them >> applied, there's just a set_bit() in .page_mkwrite() to handle mtime. >> >> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=mmap_mtime/patch_v4 > > Yup, I remember that - it delays the update to data writeback time, > IOWs the proposed MAP_SYNC page fault semantics result in the same > (poor) behaviour because the sync operation will trigger mtime > updates instead of the page fault. > > Unless, of course, you are implying that MAP_SYNC should not > actually sync all known dirty metadata on an inode. > > <smacks head on desk> > >> Second: syncing extents. Here's a straw man. Forget the mmap() flag. >> Instead add a new msync() operation: >> >> msync(start, length, MSYNC_PMEM_PREPARE_WRITE); > > How's this any different from the fallocate command I proposed to do > this (apart from the fact that msync() is not intended to be abused > as a file preallocation/zeroing interface)? I must have missed that suggestion. But it's different in a major way. fallocate() takes an fd parameter, which means that, if some flag gets set, it's set on the struct file. The persistence property seems to me like it belongs on the vma, not the file. But it doesn't have to be msync() -- it could be madvise or even a new mallocate(). (Although mallocate() is possible the worst name ever.) > >> If this operation succeeds, it guarantees that all future writes >> through this mapping on this range will hit actual storage and that >> all the metadata operations needed to make this write persistent will >> hit storage such that they are ordered before the user's writes. >> As an implementation detail, this will flush out the extents if >> needed. In addition, if the FS has any mechanism that would cause >> problems asyncronously later on (dedupe? deallocated extents full >> of zeros? defrag?), > > Hole punch, truncate, reflink, dedupe, snapshots, scrubbing and > other background filesystem maintenance operations, etc can all > change the extent layout asynchronously if there's no mechanism to > tell the fs not to modify the inode extent layout. But that's my whole point. The kernel doesn't really need to prevent all these background maintenance operations -- it just needs to block .page_mkwrite until they are synced. I think that whatever new mechanism we add for this should be sticky, but I see no reason why the filesystem should have to block reflink on a DAX file entirely. In fact, the daxctl() proposal seems actively problematic for some usecases. I think I should be able to mmap() a DAX file and then, while it's still mapped, extend the file, mmap the new part (with the appropriate flag, madvise(), msync(), fallocate(), whatever), and write directly through that mapping and through the original mapping, concurrently, with the full persistence guarantee. This seems really awkward to do using daxctl(). > >> it may also need to set a flag on the VMA >> that changes the behavior of future .page_mkwrite operations. >> >> (On x86, for example, this would permit the FS to do WC/streaming >> writes without SFENCE if the FS were structured in a way that this >> worked.) >> >> Now we have an API that should work going forward without >> introducing baggage. And XFS is free to implement this API by >> making the entire file act like a swap file if XFS wants to do so, >> but this doesn't force other filesystems (ext4? NOVA?) to do the >> same thing. > > Sure, you are providing a simple programmatic API, but this does not > provide a viable feature management strategy. > > i.e. the API you are now proposing requires the filesystem to ensure > an inode's extent map cannot be modified ever again in the future > (that "guarantees all future writes" bit). This requires, at > minimum, a persistent flag to be set on the inode so the VFS and > filesystem implementations can use it to prevent anything that, for > example, relies on copy-on-write semantics being done on those > files. That means the proposed msync operation will need to check > the filesystem can support this feature and *fail* if it can't. No it doesn't. A filesystem *could* implement it like that, but it could also implement it using .page_mkwrite. And yes, a filesystem that can't can't guarantee durability with CLFLUSHOPT; SFENCE on a mapping should fail this operation to indicate that it can't support it. > > Further, administrators need to be aware of this application > requirement so they can plan their backup and disaster recovery > operations appropriately (e.g. reflink and/or snapshots cannot be > used as part of thei backup strategy). Or they could use a filesystem that will understand that the new operation needs to break COW. > > Unsurprisingly, this is exactly what the "DAX immutable" inode flag > I proposed provides. It provides an explicit, standardised and > *documented* management strategy that is common across all > filesystems. It uses mechanisms that *already exist*, the VFS and > filesystems already implement, and adminstrators are familiar with > using to manage their systems (e.g. setting the "NODUMP" inode flag > to exclude files from backups). This also avoids the management > level fragmentation which would occur if filesystems each solve the > "DAX userspace data sync" problem differently via different > management tools, behaviours and semantics. The DAX immutable flag is really nasty for my software that would like to use DAX. I have quite a few processes, all unprivileged, that create files that they'd like to map using DAX and write to durably without needing to fsync() (using CLFLUSHOPT; SFENCE or perhaps a WT mapping). If I were to use daxctl(), I'd have to have to write a privileged daemon to manage it, and that would be rather nasty. If, instead, we had a nice unprivileged per-vma or per-fd mechanism to tell the filesystem that I want DAX durability, I could just use it without any fuss. If it worked on ext4 before it worked on xfs, then I'd use ext4. If it ended up being heavier weight on XFS than it was on ext4 because XFS needed to lock down the extent map for the inode whereas ext4 could manage it through .page_mkwrite(), then I'd benchmark it and see which fs would win. (For my particular use case, I doubt it would matter, since I aggressively offload fs metadata operations to a thread whose performance I don't really care about.) --Andy
[stripped giant fullquotes] On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote: > But that's my whole point. The kernel doesn't really need to prevent > all these background maintenance operations -- it just needs to block > .page_mkwrite until they are synced. I think that whatever new > mechanism we add for this should be sticky, but I see no reason why > the filesystem should have to block reflink on a DAX file entirely. Agreed - IFF we want to support write through semantics this is the only somewhat feasible way. It still has massive downsides of forcing the full sync machinery to run from the page fauly handler, which I'm rather scared off, but that's still better than creating a magic special case that isn't managable at all. > If, instead, we had a nice unprivileged per-vma or per-fd mechanism to > tell the filesystem that I want DAX durability, I could just use it > without any fuss. If it worked on ext4 before it worked on xfs, then > I'd use ext4. If it ended up being heavier weight on XFS than it was > on ext4 because XFS needed to lock down the extent map for the inode > whereas ext4 could manage it through .page_mkwrite(), then I'd > benchmark it and see which fs would win. (For my particular use case, > I doubt it would matter, since I aggressively offload fs metadata > operations to a thread whose performance I don't really care about.) ext4 and XFS have the same fundamental issue: both have a file system wide log of modified data that needs to be flushed to stable storage to ensure everything is safe. So if you solve the issue for one of them you've solved it for the other one as well modulo implementation details.
On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote: > On Mon, Jun 19, 2017 at 5:46 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Jun 19, 2017 at 08:22:10AM -0700, Andy Lutomirski wrote: > >> Second: syncing extents. Here's a straw man. Forget the mmap() flag. > >> Instead add a new msync() operation: > >> > >> msync(start, length, MSYNC_PMEM_PREPARE_WRITE); > > > > How's this any different from the fallocate command I proposed to do > > this (apart from the fact that msync() is not intended to be abused > > as a file preallocation/zeroing interface)? > > I must have missed that suggestion. > > But it's different in a major way. fallocate() takes an fd parameter, > which means that, if some flag gets set, it's set on the struct file. DAX is a property of the inode, not the VMA or struct file as it needs to be consistent across all VMAs and struct files that reference that inode. Also, fallocate() manipulates state and metadata hidden behind the struct inode, not the struct file, so it seems to me like the right API to use. And, as mmap() requires a fd to set up the mapping and fallocate() would have to be run *before* mmap() is used to access the data directly, I don't see why using fallocate would be a problem here... > >> If this operation succeeds, it guarantees that all future writes > >> through this mapping on this range will hit actual storage and that > >> all the metadata operations needed to make this write persistent will > >> hit storage such that they are ordered before the user's writes. > >> As an implementation detail, this will flush out the extents if > >> needed. In addition, if the FS has any mechanism that would cause > >> problems asyncronously later on (dedupe? deallocated extents full > >> of zeros? defrag?), > > > > Hole punch, truncate, reflink, dedupe, snapshots, scrubbing and > > other background filesystem maintenance operations, etc can all > > change the extent layout asynchronously if there's no mechanism to > > tell the fs not to modify the inode extent layout. > > But that's my whole point. The kernel doesn't really need to prevent > all these background maintenance operations -- it just needs to block > .page_mkwrite until they are synced. I think that whatever new > mechanism we add for this should be sticky, but I see no reason why > the filesystem should have to block reflink on a DAX file entirely. I understand the problem quite well, thank you very much. Yes, COW operations (and other things) can be handled by invalidating DAX mappings and blocking new page faults. I see little difference between this and running the sync path after page-mkwrite has triggered filesystem metadata changes (e.g. block allocation). i.e. If MAP_SYNC is going to be used, then all the things you are talking about comes along for the ride via invalidations. The MAP_SYNC proposal is effectively "run the metadata side of fdatasync() on every page fault". If the inode is not metadata dirty, then it will do nothing, otherwise it will do what it needs to stabilise the inode for userspace to be able to sync the data and it will block until it is done. Prediction for the MAP_SYNC future: frequent bug reports about huge, unpredictable page fault latencies on DAX files because every so often a page fault is required to sync tens of thousands of unrelated dirty objects because of filesystem journal ordering constraints.... Cheers, Dave.
On Mon, Jun 19, 2017 at 10:22:14PM -0700, Darrick J. Wong wrote: <> > Fourth, the VFS entry points for things like read, write, truncate, > utimes, fallocate, etc. all just bail out if S_IOMAP_FROZEN is set on a > file, so that the block map cannot be modified. mmap is still allowed, > as we've discussed. /Maybe/ we can allow fallocate to extend a file > with zeroed extents (it will be slow) as I've heard murmurs about > wanting to be able to extend a file, maybe not. Read and write should still be allowed, right?
On Tue, Jun 20, 2017 at 3:11 AM, Dave Chinner <david@fromorbit.com> wrote: > On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote: >> On Mon, Jun 19, 2017 at 5:46 PM, Dave Chinner <david@fromorbit.com> wrote: >> > On Mon, Jun 19, 2017 at 08:22:10AM -0700, Andy Lutomirski wrote: >> >> Second: syncing extents. Here's a straw man. Forget the mmap() flag. >> >> Instead add a new msync() operation: >> >> >> >> msync(start, length, MSYNC_PMEM_PREPARE_WRITE); >> > >> > How's this any different from the fallocate command I proposed to do >> > this (apart from the fact that msync() is not intended to be abused >> > as a file preallocation/zeroing interface)? >> >> I must have missed that suggestion. >> >> But it's different in a major way. fallocate() takes an fd parameter, >> which means that, if some flag gets set, it's set on the struct file. > > DAX is a property of the inode, not the VMA or struct file as it > needs to be consistent across all VMAs and struct files that > reference that inode. Also, fallocate() manipulates state and > metadata hidden behind the struct inode, not the struct file, so it > seems to me like the right API to use. I'm not sure I see why. I can think of a few different scenarios: - Reflink while a DAX-using program is running. It would be nice for performance, but not critical for functionality, if trying to reflink a file that's mapped for DAX would copy instead of COWing. But breaking COW on the next page_mkwrite would work, too. A per-inode count of the number of live DAX mappings or of the number of struct file instances that have requested DAX would work here. - Trying to use DAX on a file that is already reflinked. The order of operations doesn't matter hugely, except that breaking COW for the entire range in question all at once would be faster and result in better allocation. - Checksumming filesystems. I think it's basically impossible to do DAX writes to a file like this. A filesystem could skip checksumming on extents that are actively mapped for DAX, but something like chattr to tell, say, btrfs that a given file is intended for DAX is probably better. (But, if so, it should presumably be persistent like chattr and not something that's purely in memory like daxctl().) - Defrag and such on an otherwise simple filesystem. Defragging a file while it's not actively mapped for DAX should be allowed. Defragging a file while it is actively mapped for DAX may be slow, but it a filesystem could certainly make it work. - RAID. Not going to work. > > And, as mmap() requires a fd to set up the mapping and fallocate() > would have to be run *before* mmap() is used to access the data > directly, I don't see why using fallocate would be a problem here... Fair enough. But ISTM fallocate() has no business setting up important state in the struct file. It's an operation on inodes. Looking at the above scenarios, it seems to may that two or three separate mechanisms may be ideal here. 1. The most important: some way to tell the kernel that an open file description or an mmap or some subrange thereof is going to be used for DAX writes and for the kernel to respond by saying "yes, okay" or "no, not okay". This should be 100% reliable, which means that all the corner cases have to work. This means that, if one task says "make this file work for DAX" and another task extends the file using truncate (without calling fallocate), then whatever the kernel promised to the first task should remain true. 2. Some way to tell filesystems like btrfs to make a file that will be DAX-able. chattr +C might already fit the bill. Without this, I'd expect the normal incantation to DAX-map a file on btrfs to return an error. 3. (Not strictly related to DAX.) A way to tell the kernel "I have this file mmapped for write. Please go out of your way to avoid future page faults." I've wanted this for ordinary files on ext4. The kernel could, but presently does not, use hardware dirty tracking instead of software dirty tracking to decide when to write the page back. The kernel could also, in principle, write back dirty pages without ever write protecting them. For DAX, this might change behavior to prevent any operation that would relocate blocks or to have the kernel go out of its way to only do such operations when absolutely necessary and to immediately update and unwriteprotect the relevant pages. (3) is optional and could be delayed to the distant future. It's not needed for correctness. I really don't want to see a scenario where DAX works if you use some fancy special-purpose library exactly as intended but occasionally eats your data if you don't use the library exactly as intended. Getting a valid DAX mapping, writing to it, and doing CLFLUSHOPT;SFENCE must make that write durable no matter what (unless the underlying hardware actually fails). > The MAP_SYNC proposal is effectively "run the metadata side of > fdatasync() on every page fault". If the inode is not metadata > dirty, then it will do nothing, otherwise it will do what it needs > to stabilise the inode for userspace to be able to sync the data and > it will block until it is done. > > Prediction for the MAP_SYNC future: frequent bug reports about huge, > unpredictable page fault latencies on DAX files because every so > often a page fault is required to sync tens of thousands of > unrelated dirty objects because of filesystem journal ordering > constraints.... Is this really so bad? Someone might ask for relaxed journal ordering or they might switch to a different filesystem. IIRC some filesystems (ZFS?) have explicit support for this use case. I suspect that users saying "your filesystem is slower than I'd like for such-and-such use case" isn't all that rare.
On Tue, Jun 20, 2017 at 1:49 AM, Christoph Hellwig <hch@lst.de> wrote: > [stripped giant fullquotes] > > On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote: >> But that's my whole point. The kernel doesn't really need to prevent >> all these background maintenance operations -- it just needs to block >> .page_mkwrite until they are synced. I think that whatever new >> mechanism we add for this should be sticky, but I see no reason why >> the filesystem should have to block reflink on a DAX file entirely. > > Agreed - IFF we want to support write through semantics this is the > only somewhat feasible way. It still has massive downsides of forcing > the full sync machinery to run from the page fauly handler, which > I'm rather scared off, but that's still better than creating a magic > special case that isn't managable at all. An immutable-extent DAX-file and a reflink-capable DAX-file are not mutually exclusive, and I have yet to hear a need for reflink support without fsync/msync. Instead I have heard the need for an immutable file for RDMA purposes, especially for hardware that can't trigger an mmu fault. The special management of an immutable file is acceptable to get these capabilities.
On Tue, Jun 20, 2017 at 9:17 AM, Dan Williams <dan.j.williams@intel.com> wrote: > On Tue, Jun 20, 2017 at 1:49 AM, Christoph Hellwig <hch@lst.de> wrote: >> [stripped giant fullquotes] >> >> On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote: >>> But that's my whole point. The kernel doesn't really need to prevent >>> all these background maintenance operations -- it just needs to block >>> .page_mkwrite until they are synced. I think that whatever new >>> mechanism we add for this should be sticky, but I see no reason why >>> the filesystem should have to block reflink on a DAX file entirely. >> >> Agreed - IFF we want to support write through semantics this is the >> only somewhat feasible way. It still has massive downsides of forcing >> the full sync machinery to run from the page fauly handler, which >> I'm rather scared off, but that's still better than creating a magic >> special case that isn't managable at all. > > An immutable-extent DAX-file and a reflink-capable DAX-file are not > mutually exclusive, and I have yet to hear a need for reflink support > without fsync/msync. Instead I have heard the need for an immutable > file for RDMA purposes, especially for hardware that can't trigger an > mmu fault. The special management of an immutable file is acceptable > to get these capabilities. I guess this applies to any user of get_user_pages() on a DAX-mapped file. Hmm.
On Tue, Jun 20, 2017 at 09:17:36AM -0700, Dan Williams wrote: > On Tue, Jun 20, 2017 at 1:49 AM, Christoph Hellwig <hch@lst.de> wrote: > > [stripped giant fullquotes] > > > > On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote: > >> But that's my whole point. The kernel doesn't really need to prevent > >> all these background maintenance operations -- it just needs to block > >> .page_mkwrite until they are synced. I think that whatever new > >> mechanism we add for this should be sticky, but I see no reason why > >> the filesystem should have to block reflink on a DAX file entirely. > > > > Agreed - IFF we want to support write through semantics this is the > > only somewhat feasible way. It still has massive downsides of forcing > > the full sync machinery to run from the page fauly handler, which > > I'm rather scared off, but that's still better than creating a magic > > special case that isn't managable at all. > > An immutable-extent DAX-file and a reflink-capable DAX-file are not > mutually exclusive, Actually, they are mutually exclusive: when the immutable extent DAX inode is breaking the extent sharing done during the reflink operation, the copy-on-write operation requires allocating and freeing extents on the inode that has immutable extents. Which, if the inode really has immutable extents, cannot be done. That said, if the extent sharing is broken on the other side of the reflink (i.e. the non-immutable inode created by the reflink) then the extent map of the inode with immutable extents will remain unchanged. i.e. there are two sides to this, and if you only see one side you might come to the wrong conclusion. However, we cannot guarantee that no writes occur to the inode with immutable extent maps (especially as the whole point is to allow userspace writes and commits without the kernel being involved), so extent sharing on immutable extent maps cannot be allowed... Cheers, Dave.
On Wed, Jun 21, 2017 at 09:53:46AM +1000, Dave Chinner wrote: > On Tue, Jun 20, 2017 at 09:17:36AM -0700, Dan Williams wrote: > > On Tue, Jun 20, 2017 at 1:49 AM, Christoph Hellwig <hch@lst.de> wrote: > > > [stripped giant fullquotes] > > > > > > On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote: > > >> But that's my whole point. The kernel doesn't really need to prevent > > >> all these background maintenance operations -- it just needs to block > > >> .page_mkwrite until they are synced. I think that whatever new > > >> mechanism we add for this should be sticky, but I see no reason why > > >> the filesystem should have to block reflink on a DAX file entirely. > > > > > > Agreed - IFF we want to support write through semantics this is the > > > only somewhat feasible way. It still has massive downsides of forcing > > > the full sync machinery to run from the page fauly handler, which > > > I'm rather scared off, but that's still better than creating a magic > > > special case that isn't managable at all. > > > > An immutable-extent DAX-file and a reflink-capable DAX-file are not > > mutually exclusive, > > Actually, they are mutually exclusive: when the immutable extent DAX > inode is breaking the extent sharing done during the reflink > operation, the copy-on-write operation requires allocating and > freeing extents on the inode that has immutable extents. Which, if > the inode really has immutable extents, cannot be done. > > That said, if the extent sharing is broken on the other side of the > reflink (i.e. the non-immutable inode created by the reflink) then > the extent map of the inode with immutable extents will remain > unchanged. i.e. there are two sides to this, and if you only see one > side you might come to the wrong conclusion. > > However, we cannot guarantee that no writes occur to the inode with > immutable extent maps (especially as the whole point is to allow > userspace writes and commits without the kernel being involved), so > extent sharing on immutable extent maps cannot be allowed... Just to play devil's advocate... /If/ you have rmap and /if/ you discover that there's only one IOMAP_IMMUTABLE file owning this same block and /if/ you're willing to relocate every other mapping on the whole filesystem, /then/ you could /in theory/ support shared daxfiles. However, that's so many on-disk metadata lookups to shove into a pagefault handler that I don't think anyone in XFSland would entertain such an ugly fantasy. You'd be making a lot of metadata requests, and you'd have to lock the rmapbt while grabbing inodes, which is insane. Much easier to have a per-inode flag that says "the block map of this file does not change" and put up with the restricted semantics. --D > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Tue, Jun 20, 2017 at 09:14:24AM -0700, Andy Lutomirski wrote: > On Tue, Jun 20, 2017 at 3:11 AM, Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote: > >> On Mon, Jun 19, 2017 at 5:46 PM, Dave Chinner <david@fromorbit.com> wrote: > >> > On Mon, Jun 19, 2017 at 08:22:10AM -0700, Andy Lutomirski wrote: > >> >> Second: syncing extents. Here's a straw man. Forget the mmap() flag. > >> >> Instead add a new msync() operation: > >> >> > >> >> msync(start, length, MSYNC_PMEM_PREPARE_WRITE); > >> > > >> > How's this any different from the fallocate command I proposed to do > >> > this (apart from the fact that msync() is not intended to be abused > >> > as a file preallocation/zeroing interface)? > >> > >> I must have missed that suggestion. > >> > >> But it's different in a major way. fallocate() takes an fd parameter, > >> which means that, if some flag gets set, it's set on the struct file. > > > > DAX is a property of the inode, not the VMA or struct file as it > > needs to be consistent across all VMAs and struct files that > > reference that inode. Also, fallocate() manipulates state and > > metadata hidden behind the struct inode, not the struct file, so it > > seems to me like the right API to use. > > I'm not sure I see why. I can think of a few different scenarios: > > - Reflink while a DAX-using program is running. It would be nice for > performance, but not critical for functionality, if trying to reflink > a file that's mapped for DAX would copy instead of COWing. But > breaking COW on the next page_mkwrite would work, too. Your mangling terminology here. We don't "break COW" - we *use* copy-on-write to break *extent sharing*. We can break extent sharing in page_mkwrite - that's exactly what we do for normal pagecache based mmap writes, and it's done in page_mkwrite. It hasn't been enabled it for DAX yet because it simply hasn't been robustly tested yet. > A per-inode > count of the number of live DAX mappings or of the number of struct > file instances that have requested DAX would work here. For what purpose does this serve? The reflink invalidates all the existing mappings, so the next write access causes a fault and then page_mkwrite is called and the shared extent will get COWed.... > - Trying to use DAX on a file that is already reflinked. The order > of operations doesn't matter hugely, except that breaking COW for the > entire range in question all at once would be faster and result in > better allocation. We have COW extent size hints for that. i.e. if you want to COW a huge page at a time, set the COW extent size hint to the huge page size... > - Checksumming filesystems. I think it's basically impossible to do > DAX writes to a file like this. No shit, sherlock. See my previous comments about compression and encryption. DAX cannot be used where in-place data manipulations would be required by the IO path. > A filesystem could skip checksumming > on extents that are actively mapped for DAX, but something like chattr > to tell, say, btrfs that a given file is intended for DAX is probably > better. (But, if so, it should presumably be persistent like chattr > and not something that's purely in memory like daxctl().) You can already do this on btrfs regardless of DAX. There's an inode flag: #define BTRFS_INODE_NODATASUM (1 << 0) And you set it by turning off copy-on-write when the file is empty. i.e. just after creation, run: ioctl(FS_IOC_GETFLAGS, &flags). flags |= FS_NOCOW_FL; ioctl(FS_IOC_SETFLAGS, &flags). You're talking about stuff that already exists - stop trying to tell me about basic filesystem functionality and DAX requirements that I understood years ago. Please start with the assumption that I know a lot more about this than you do, and if there's something you write that I don't understand I'll ask you to explain.... > - Defrag and such on an otherwise simple filesystem. Defragging a > file while it's not actively mapped for DAX should be allowed. > Defragging a file while it is actively mapped for DAX may be slow, but > it a filesystem could certainly make it work. FYI, XFS already does this - it skips mapped files altogether, so DAX state simply doesn't matter here. We also have the FS_NODEFRAG ioctl flag, so admins can choose to mark files that they want defrag to skip... > > And, as mmap() requires a fd to set up the mapping and fallocate() > > would have to be run *before* mmap() is used to access the data > > directly, I don't see why using fallocate would be a problem here... > > Fair enough. But ISTM fallocate() has no business setting up > important state in the struct file. It's an operation on inodes. Yup, the state in question is kept in inode->i_flags rather than duplicated into the struct file. Which leads to code like this in the page fault handlers. e.g. in xfs_filemap_page_mkwrite(): struct inode *inode = file_inode(vma->vm_file); .... if (IS_DAX(inode)) { ..... Yeah, the page fault behaviour is determined by state kept on the inode, not the struct file or vma.... > Looking at the above scenarios, it seems to may that two or three > separate mechanisms may be ideal here. > > 1. The most important: some way to tell the kernel that an open file > description or an mmap or some subrange thereof is going to be used > for DAX writes and for the kernel to respond by saying "yes, okay" or > "no, not okay". A file can only be accessed by DAX or through the page cache at a point in time - it can't do both because the mapping infrastructure (e.g. the radix tree) can only handle one or the other at any point in time. So, you want to change the DAX access mode of an open file? We have that on XFS. Turn on DAX: ioctl(FS_IOC_FSGETXATTR, &flags). flags |= FS_XFLAG_DAX; ioctl(FS_IOC_FSSETXATTR, &flags). This will throw an error if the kernel and/or fs does not support DAX. The fs will lock out page faults on the inode, invalidate all the existing mappings, remove the cached pages and switch to DAX mode. Turn off DAX: ioctl(FS_IOC_FSGETXATTR, &flags). flags &= ~FS_XFLAG_DAX; ioctl(FS_IOC_FSSETXATTR, &flags). And the filesystem will lock out page faults, invalidate all the DAX mappings and switch to cached mode... > 2. Some way to tell filesystems like btrfs to make a file that will be > DAX-able. See above. > > Prediction for the MAP_SYNC future: frequent bug reports about > > huge, unpredictable page fault latencies on DAX files because > > every so often a page fault is required to sync tens of > > thousands of unrelated dirty objects because of filesystem > > journal ordering constraints.... > > Is this really so bad? Apparently it is. There are people telling us that mtime updates in page faults introduce too much unpredictable latency and that screws over their low latency real time applications. Those same people are telling use that dirty tracking in page faults for msync/fsync on DAX is too heavyweight and calling msync is too onerous and has unpredictable latencies because it might result in having to sync tens of thousands of unrelated dirty objects. Hence they want to use userspace data sync primitives to avoid this overhead and so filesystems need to make it possible to provide this userspace idata sync capability. So, we came up with a method that removed all overhead and unpredictability from the page fault path, and now we're being told that calling fallocate() is too hard and difficult and the restrictions that prevent all the page fault overhead (operations which you won't want to be doing for low-latency apps in the first place) are too onerous. And so the proposed solution is an API that requires the filesystem to *run fdatasync in every write page fault*? Can you see the contradictions in the requirements here? On one hand we're being told page faults have to be low overhead and have predictable latency, but when presented with a solution we're told you want page faults to be blocked arbitrarily and for unbound lengths of time. Put simply: I don't care what gets implemented. What I care about is that everyone understands that we're being given contradictory requirements and that neither of the proposed solutions solves both sets of requirements. > I suspect that users saying "your filesystem is slower than I'd > like for such-and-such use case" isn't all that rare. Except the people asking for userspace data sync are asking for it for *performance reasons*. Making that explicit case *slower* is the exact opposite of what we've been asked to provide.... Cheers, Dave.
On Tue, Jun 20, 2017 at 06:24:03PM -0700, Darrick J. Wong wrote: > On Wed, Jun 21, 2017 at 09:53:46AM +1000, Dave Chinner wrote: > > On Tue, Jun 20, 2017 at 09:17:36AM -0700, Dan Williams wrote: > > > An immutable-extent DAX-file and a reflink-capable DAX-file are not > > > mutually exclusive, > > > > Actually, they are mutually exclusive: when the immutable extent DAX > > inode is breaking the extent sharing done during the reflink > > operation, the copy-on-write operation requires allocating and > > freeing extents on the inode that has immutable extents. Which, if > > the inode really has immutable extents, cannot be done. > > > > That said, if the extent sharing is broken on the other side of the > > reflink (i.e. the non-immutable inode created by the reflink) then > > the extent map of the inode with immutable extents will remain > > unchanged. i.e. there are two sides to this, and if you only see one > > side you might come to the wrong conclusion. > > > > However, we cannot guarantee that no writes occur to the inode with > > immutable extent maps (especially as the whole point is to allow > > userspace writes and commits without the kernel being involved), so > > extent sharing on immutable extent maps cannot be allowed... > > Just to play devil's advocate... > > /If/ you have rmap and /if/ you discover that there's only one > IOMAP_IMMUTABLE file owning this same block and /if/ you're willing to > relocate every other mapping on the whole filesystem, /then/ you could > /in theory/ support shared daxfiles. I figured that nobody apart from experienced filesystem developers would understand the complexities of rmap and refcounts and how they could be abused to do this. I also assumed that that people like you would understand this is possible but completely impractical.... > However, that's so many on-disk metadata lookups to shove into a > pagefault handler that I don't think anyone in XFSland would entertain > such an ugly fantasy. You'd be making a lot of metadata requests, and > you'd have to lock the rmapbt while grabbing inodes, which is insane. Exactly. But while I understand this, consider the amount of assumed filesystem and XFS knowledge in that one simple paragraph. Most non-experts would have stopped *understanding* at "/If/ you have rmap" and go away with the wrong ideas in their heads. Hence I now tend to omit mentioning "possible but impractical" things in mixed expertise discussions.... > Much easier to have a per-inode flag that says "the block map of this > file does not change" and put up with the restricted semantics. In a nutshell. Cheers, Dave.
On Tue, Jun 20, 2017 at 6:40 PM, Dave Chinner <david@fromorbit.com> wrote: > Your mangling terminology here. We don't "break COW" - we *use* > copy-on-write to break *extent sharing*. We can break extent sharing > in page_mkwrite - that's exactly what we do for normal pagecache > based mmap writes, and it's done in page_mkwrite. Right, my bad. > > It hasn't been enabled it for DAX yet because it simply hasn't been > robustly tested yet. > >> A per-inode >> count of the number of live DAX mappings or of the number of struct >> file instances that have requested DAX would work here. > > For what purpose does this serve? The reflink invalidates all the > existing mappings, so the next write access causes a fault and then > page_mkwrite is called and the shared extent will get COWed.... The same purpose as XFS's FS_XFLAG_DAX (assuming I'm understanding it right), except that IMO an API that doesn't involve making a change to an inode that sticks around would be nice. The inode flag has the unfortunate property that, if two different programs each try to set the flag, mmap, write, and clear the flag, they'll stomp on each other and risk data corruption. I admit I'm now thoroughly confused as to exactly what XFS does here -- does FS_XFLAG_DAX persist across unmount/mount? > >> - Trying to use DAX on a file that is already reflinked. The order >> of operations doesn't matter hugely, except that breaking COW for the >> entire range in question all at once would be faster and result in >> better allocation. > > We have COW extent size hints for that. i.e. if you want to COW a > huge page at a time, set the COW extent size hint to the huge page > size... Nifty. > Apparently it is. There are people telling us that mtime > updates in page faults introduce too much unpredictable latency and > that screws over their low latency real time applications. I was one of those, and I even wrote patches. I should try to dust them off. > > Those same people are telling use that dirty tracking in page faults > for msync/fsync on DAX is too heavyweight and calling msync is too > onerous and has unpredictable latencies because it might result in > having to sync tens of thousands of unrelated dirty objects. Hence > they want to use userspace data sync primitives to avoid this > overhead and so filesystems need to make it possible to provide this > userspace idata sync capability. If I were using DAX in production, I'd have exactly this issue. Let me quote myself: On Tue, Jun 20, 2017 at 9:14 AM, Andy Lutomirski <luto@kernel.org> wrote: > 3. (Not strictly related to DAX.) A way to tell the kernel "I have > this file mmapped for write. Please go out of your way to avoid > future page faults." I've wanted this for ordinary files on ext4. > The kernel could, but presently does not, use hardware dirty tracking > instead of software dirty tracking to decide when to write the page > back. The kernel could also, in principle, write back dirty pages > without ever write protecting them. For DAX, this might change > behavior to prevent any operation that would relocate blocks or to > have the kernel go out of its way to only do such operations when > absolutely necessary and to immediately update and unwriteprotect the > relevant pages. I agree that this is a real issue, but it's not limited to DAX. I've wanted a mode where I tell the kernel "I'm a high-performance application mmapping this file and I'm going to write to it a lot. Do your best to avoid any page faults, even if it adversely affects the performance of the system." This mode could do lots of things. It could cause the system to leave the page writable even after writeback and, if possible, to use hardware dirty tracking. It could cause the system to make a copy of the page and write back from the copy if there is anything in play that could need stable pages during writeback. And, for DAX, it could tell the system to keep the page pinned and disallow moving it and reflinking it. (Of course, the above requires that we either deal with mtime like my patches do or that this heavyweight mechanism disable mtime updates. I prefer the former.) Here's the overall point I'm trying to make: unprivileged programs that want to write to DAX files with userspace commit mechanisms (CLFLUSHOPT;SFENCE, etc) should be able to do so reliably, without privilege, and with reasonably clean APIs. Ideally they could do this to any file they have write access to. Programs that want to write to mmapped files, DAX or otherwise, without latency spikes due to .page_mkwrite should be able to opt in to a heavier weight mechanism. But these two issues are someone independent, and I think they should be solved separately.
On Mon, Jun 19, 2017 at 10:22:14PM -0700, Darrick J. Wong wrote: > [add linux-xfs to the fray] > > On Fri, Jun 16, 2017 at 06:15:35PM -0700, Dan Williams wrote: > > + spin_lock(&dax_lock); > > + list_add(&d->list, &daxfiles); > > + spin_unlock(&dax_lock); > > + > > + /* > > + * We set S_SWAPFILE to gain "no truncate" / static block > > + * allocation semantics, and S_DAXFILE so we can differentiate > > + * traditional swapfiles and assume static block mappings in the > > + * dax mmap path. > > + */ > > + inode->i_flags |= S_SWAPFILE | S_DAXFILE; > > Yikes. You know, I hadn't even thought about considering swap files as > a subcase of files with immutable block maps, but here we are. Both > swap files and DAX require absolutely stable block mappings, they are > both (probably) intolerant of inode metadata changes (size, mtime, etc.) Swap files are intolerant of any metadata changes because once the mapping has been sucked into the swapfile code, the inode is never looked at again. DAX file data access always goes through the inode, so they is much more tolerant of metadata changes given certain constraints. <snip bmap rant> > Honestly, I realize we've gone back, forth, and around all over the > place on this. I still prefer something similar to a permanent flag, > similar to what Dave suggested, though I hate the name PMEM_IMMUTABLE > and some of the semantics. > > First, a new inode flag S_IOMAP_FROZEN that means the file's block map > cannot change. I've been calling it "immutable extents" - freezing has implications that it's only temporary (i.e. freezing filesystems) and will be followed shortly by a thaw. That isn't the case here - we truly want the extent/block map to be immutable.... > Second, some kind of function to toggle the S_IOMAP_FROZEN flag. > Turning it on will lock the inode, check the extent map for holes, > shared, or unwritten bits, and bail out if it finds any, or set the > flag. Hmmm, I disagree on the unwritten state here. We want swap files to be able to use unwritten extents - it means we can preallocate the swap file and hand it straight to swapon without having to zero it (i.e. no I/O needed to demand allocate more swap space when memory is very low). Also, anyone who tries to read the swap file from userspace will be reading unwritten extents, which will always return zeros rather than whatever is in the swap file... > Not sure if we should require CAP_LINUX_IMMUTABLE -- probably > yes, at least at first. I don't currently have any objection to writing > non-iomap inode metadata out to disk. > > Third, the flag can only be cleared if the file isn't mapped. How do we check this from the fs without racing? AFAICT we can't prevent a concurrent map operation from occurring while we are changing the state of the inode - we can only block page faults after then inode is mapped.... > Fourth, the VFS entry points for things like read, write, truncate, > utimes, fallocate, etc. all just bail out if S_IOMAP_FROZEN is set on a > file, so that the block map cannot be modified. > mmap is still allowed, > as we've discussed. /Maybe/ we can allow fallocate to extend a file > with zeroed extents (it will be slow) as I've heard murmurs about > wanting to be able to extend a file, maybe not. read is fine, write should be fine as long as the iomap call can error out operations that would require extent map modifications. fallocate should be allowed to modify the extent map, too, because it should be the mechanism used be applications to set up file extents in the correct form for applications to use as immutable (i.e. lock out page faults, allocate, zero, extend and fsync in one atomic operation).... > Fifth, swapfiles now require the S_IOMAP_FROZEN flag since they want > stable iomap but probably don't care about things like mtime. Maybe > they can call iomap too. > > Sixth, XFS can record the S_IOMAP_FROZEN state in di_flags2 and set it > whenever the in-core inode gets constructed. This enables us to > prohibit reflinking and other such undesirable activity. *nod* > If we actually /do/ come up with a reference implementation for XFS, I'd > be ok with tacking it on the end of my dev branch, which will give us a > loooong runway to try it out. The end of the dev branch is beyond > online XFS fsck and repair and the "root metadata btrees in inodes" > rework; since that's ~90 patches with my name on it that I cannot also > review, it won't go in for a long time indeed! I don't think it's so complex to need such a long dev time - all the infrastructure we need is pretty much there already... > (Yes, that was also sort of a plea for someone to go review the XFS > scrub patches.) > > > + return 0; > > +} > > + > > +SYSCALL_DEFINE3(daxctl, const char __user *, path, int, flags, int, align) > > I was /about/ to grouse about this syscall, then realized that maybe it > /is/ useful to be able to check a specific alignment. Maybe not, since > I had something more permanent in mind anyway. In any case, just pass > in an opened fd if this sticks around. We can do all that via fallocate(), too... Cheers, Dave.
On Tue, Jun 20, 2017 at 10:18:24PM -0700, Andy Lutomirski wrote: > On Tue, Jun 20, 2017 at 6:40 PM, Dave Chinner <david@fromorbit.com> wrote: > >> A per-inode > >> count of the number of live DAX mappings or of the number of struct > >> file instances that have requested DAX would work here. > > > > For what purpose does this serve? The reflink invalidates all the > > existing mappings, so the next write access causes a fault and then > > page_mkwrite is called and the shared extent will get COWed.... > > The same purpose as XFS's FS_XFLAG_DAX (assuming I'm understanding it > right), except that IMO an API that doesn't involve making a change to > an inode that sticks around would be nice. The inode flag has the > unfortunate property that, if two different programs each try to set > the flag, mmap, write, and clear the flag, they'll stomp on each other > and risk data corruption. > > I admit I'm now thoroughly confused as to exactly what XFS does here > -- does FS_XFLAG_DAX persist across unmount/mount? Yes, it is. i.e. DAX on XFS does not rely on a naive fs-wide mount option. You can have applications on pmem filesystems use either DAX or normal IO based on directory/inode flags. Something doesn't work with DAX, so just remove the DAX flags from the directories/inodes, and it will safely and transparently switch to page-cache based IO. <snip> > Here's the overall point I'm trying to make: unprivileged programs > that want to write to DAX files with userspace commit mechanisms > (CLFLUSHOPT;SFENCE, etc) should be able to do so reliably, without > privilege, and with reasonably clean APIs. Ideally they could do this > to any file they have write access to. The privilege argument is irrelevant now - it was /suggested/ initially as a way of preventing people from shooting themselves in the foot based on the immutable file model. It's clear that's not desired, and it's not a show stopper. > Programs that want to write to > mmapped files, DAX or otherwise, without latency spikes due to > .page_mkwrite should be able to opt in to a heavier weight mechanism. > But these two issues are someone independent, and I think they should > be solved separately. You seem to be calling the "fdatasync on every page fault" the "lightweight" option. That's the brute-force-with-big-hammer solution - it's most definitely not lightweight as every page fault has extra overhead to call ->fsync(). Sure, the API is simple, but the runtime overhead is significant. The lightweight *runtime* option is to set up the file in such a way that there is never any extra overhead at page fault time. This is what immutable extent maps provide. Indeed, because the mappings never change, you could use hardware dirty tracking if you wanted, as there's no need to look up the filesystem to do writeback as everything needed for writeback was mapped at page fault time. This "map first and then just write when you need to" is *exactly how swap files work*. Even if you are considering the complexity of the APIs, it's hardly a "heavyweight" when it only requires a single call to fallocate() before mmap() to set up the immutable extents on the file... Cheers, Dave.
On Wed, Jun 21, 2017 at 5:02 PM, Dave Chinner <david@fromorbit.com> wrote: > > You seem to be calling the "fdatasync on every page fault" the It's the opposite of fdatasync(). It needs to sync whatever metadata is needed to find the data. The data doesn't need to be synced. > "lightweight" option. That's the brute-force-with-big-hammer > solution - it's most definitely not lightweight as every page fault > has extra overhead to call ->fsync(). Sure, the API is simple, but > the runtime overhead is significant. It's lightweight in terms of its impact on the filesystem. It doesn't need any persistent setup -- you can just use it. > Even if you are considering the complexity of the APIs, it's hardly > a "heavyweight" when it only requires a single call to fallocate() > before mmap() to set up the immutable extents on the file... So what would the exact semantics be? In particular, how can it fail? If I do the fallocate(), is it absolutely promised that the extent map won't get out of sync between what mmap sees and what's on disk? Do user programs need to worry about colliding with each other when one does fallocate() to DAXify a file and the other does fallocate() to unDAXify a file? Does this particular fallocate() call still keep its effect after a reboot? These issues are why I think it would be nicer to have an API that makes a particular mapping or fd be unconditionally *correct* and then to provide something else that makes it avoid latency spikes. Is there an actual concrete proposal that's reviewable?
On Tue, Jun 20, 2017 at 09:42:55AM -0600, Ross Zwisler wrote: > On Mon, Jun 19, 2017 at 10:22:14PM -0700, Darrick J. Wong wrote: > <> > > Fourth, the VFS entry points for things like read, write, truncate, > > utimes, fallocate, etc. all just bail out if S_IOMAP_FROZEN is set on a > > file, so that the block map cannot be modified. mmap is still allowed, > > as we've discussed. /Maybe/ we can allow fallocate to extend a file > > with zeroed extents (it will be slow) as I've heard murmurs about > > wanting to be able to extend a file, maybe not. > > Read and write should still be allowed, right? <shrug> I had thought the usage model was pretty slanted towards mmap, but it's not a big deal to turn read/writes into glorified memcpy, provided we reject the io request if it goes past EOF. --D > -- > To unsubscribe from this list: send the line "unsubscribe linux-api" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jun 22, 2017 at 09:37:14AM +1000, Dave Chinner wrote: > On Mon, Jun 19, 2017 at 10:22:14PM -0700, Darrick J. Wong wrote: > > [add linux-xfs to the fray] > > > > On Fri, Jun 16, 2017 at 06:15:35PM -0700, Dan Williams wrote: > > > + spin_lock(&dax_lock); > > > + list_add(&d->list, &daxfiles); > > > + spin_unlock(&dax_lock); > > > + > > > + /* > > > + * We set S_SWAPFILE to gain "no truncate" / static block > > > + * allocation semantics, and S_DAXFILE so we can differentiate > > > + * traditional swapfiles and assume static block mappings in the > > > + * dax mmap path. > > > + */ > > > + inode->i_flags |= S_SWAPFILE | S_DAXFILE; > > > > Yikes. You know, I hadn't even thought about considering swap files as > > a subcase of files with immutable block maps, but here we are. Both > > swap files and DAX require absolutely stable block mappings, they are > > both (probably) intolerant of inode metadata changes (size, mtime, etc.) > > Swap files are intolerant of any metadata changes because once the > mapping has been sucked into the swapfile code, the inode is never > looked at again. DAX file data access always goes through the inode, > so they is much more tolerant of metadata changes given certain > constraints. Fair enough. > <snip bmap rant> > > > Honestly, I realize we've gone back, forth, and around all over the > > place on this. I still prefer something similar to a permanent flag, > > similar to what Dave suggested, though I hate the name PMEM_IMMUTABLE > > and some of the semantics. > > > > First, a new inode flag S_IOMAP_FROZEN that means the file's block map > > cannot change. > > I've been calling it "immutable extents" - freezing has implications > that it's only temporary (i.e. freezing filesystems) and will be > followed shortly by a thaw. That isn't the case here - we truly want > the extent/block map to be immutable.... <nod> S_IOMAP_IMMUTABLE it is, then. > > Second, some kind of function to toggle the S_IOMAP_FROZEN flag. > > Turning it on will lock the inode, check the extent map for holes, > > shared, or unwritten bits, and bail out if it finds any, or set the > > flag. > > Hmmm, I disagree on the unwritten state here. We want swap files to > be able to use unwritten extents - it means we can preallocate the > swap file and hand it straight to swapon without having to zero it > (i.e. no I/O needed to demand allocate more swap space when memory > is very low). Also, anyone who tries to read the swap file from > userspace will be reading unwritten extents, which will always > return zeros rather than whatever is in the swap file... Now I've twisted all the way around to thinking that swap files should be /totally/ unwritten, except for the file header. :) > > Not sure if we should require CAP_LINUX_IMMUTABLE -- probably > > yes, at least at first. I don't currently have any objection to writing > > non-iomap inode metadata out to disk. > > > > Third, the flag can only be cleared if the file isn't mapped. > > How do we check this from the fs without racing? AFAICT we can't > prevent a concurrent map operation from occurring while we are > changing the state of the inode - we can only block page faults > after then inode is mapped.... I'd thought we could coordinate that via xfs_file_mmap, but tbh my brain paged that out a while ago. > > Fourth, the VFS entry points for things like read, write, truncate, > > utimes, fallocate, etc. all just bail out if S_IOMAP_FROZEN is set on a > > file, so that the block map cannot be modified. > > mmap is still allowed, > > as we've discussed. /Maybe/ we can allow fallocate to extend a file > > with zeroed extents (it will be slow) as I've heard murmurs about > > wanting to be able to extend a file, maybe not. > > read is fine, write should be fine as long as the iomap call can > error out operations that would require extent map modifications. Ok. > fallocate should be allowed to modify the extent map, too, because > it should be the mechanism used be applications to set up file > extents in the correct form for applications to use as immutable > (i.e. lock out page faults, allocate, zero, extend and fsync in > one atomic operation).... <nod> > > Fifth, swapfiles now require the S_IOMAP_FROZEN flag since they want > > stable iomap but probably don't care about things like mtime. Maybe > > they can call iomap too. > > > > Sixth, XFS can record the S_IOMAP_FROZEN state in di_flags2 and set it > > whenever the in-core inode gets constructed. This enables us to > > prohibit reflinking and other such undesirable activity. > > *nod* > > > If we actually /do/ come up with a reference implementation for XFS, I'd > > be ok with tacking it on the end of my dev branch, which will give us a > > loooong runway to try it out. The end of the dev branch is beyond > > online XFS fsck and repair and the "root metadata btrees in inodes" > > rework; since that's ~90 patches with my name on it that I cannot also > > review, it won't go in for a long time indeed! > > I don't think it's so complex to need such a long dev time - > all the infrastructure we need is pretty much there already... It definitely is, but we've been bikeshedding so long it now has momentum. :) That said, it (and having a go at dax+reflink) are things that I'd like to look at (after pushing the scrub stuff) for the rest of the year. > > (Yes, that was also sort of a plea for someone to go review the XFS > > scrub patches.) > > > > > + return 0; > > > +} > > > + > > > +SYSCALL_DEFINE3(daxctl, const char __user *, path, int, flags, int, align) > > > > I was /about/ to grouse about this syscall, then realized that maybe it > > /is/ useful to be able to check a specific alignment. Maybe not, since > > I had something more permanent in mind anyway. In any case, just pass > > in an opened fd if this sticks around. > > We can do all that via fallocate(), too... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-api" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 21, 2017 at 09:07:57PM -0700, Andy Lutomirski wrote: > On Wed, Jun 21, 2017 at 5:02 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > You seem to be calling the "fdatasync on every page fault" the > > It's the opposite of fdatasync(). It needs to sync whatever metadata > is needed to find the data. The data doesn't need to be synced. So much wrong with that statement. Andy, what does fdatasync() do when you have a data-clean, metadata-dirty file (e.g. you just punched a hole or preallocated more space via fallocate())? Hint: it doesn't sync any data because the mapping tree is clean, but it still syncs the dirty metadata needed to access the data. Now, what does a file where we do direct IO writes look like? Yup, the mapping tree always remains clean and so it's only ever going to appear to the kernel as a *data-clean, metadata-dirty* file. So, after a direct IO write is done, what operation do we need to run to ensure that we can always access the data? Yup, it's fdatasync(). So, what does a DAX file that does userspace data flushes look like to the kernel? Yup, again the mapping tree always remains clean and so it's only ever going to be a *data-clean, metadata-dirty* file. It should be clear now why I said "fdatasync on every page fault" because that's exactly the mechanism we'd use to implement this functionality.... It should also be clear that DAX is not introducing any new data integrity problems to the filesystems that direct IO hasn't already introduced. Both DAX with userspace data sync and Direct IO writes are completely untracked by the kernel. IOWs, direct IO is a form of "kernel bypass", just like DAX+userspace data sync is. All that is different is the method by which data is written to the storage media from userspace, which in the case of DAX is via mmap rather than read/write. > > "lightweight" option. That's the brute-force-with-big-hammer > > solution - it's most definitely not lightweight as every page fault > > has extra overhead to call ->fsync(). Sure, the API is simple, but > > the runtime overhead is significant. > > It's lightweight in terms of its impact on the filesystem. It doesn't > need any persistent setup -- you can just use it. Well, no, that's wrong, because we have to co-ordinate multiple concurrent accesses to the data in the kernel. What happens when some other process writes to the file *at the same time* but does not use userspace sync? We aren't tracking dirty regions on the inode mapping because we've been told not to do that, so fsync() from that other process *won't sync the data it wrote*. IOws, the kernel has failed to provide the guarantee that userspace wants it to provide. The single mapping tree is central to the problem here - we can't mix modes of dirty tracking across different processes. Either everything uses userspace sync, or everything uses kernel controlled dirty tracking so fsync() works correctly in all cases. Put simply - dirty tracking is a per-inode function, not a per-file or per-vma function. As the direct IO kernel-bypass model demonstrates, as soon as you start considering multi-process data coherency and durability with mixed kernel+kernel bypass methods in play, lots of potential problems and issues crop up that can't easily be solved by the kernel or filesystems. We try to minimise the problems, but we don't guarantee mixed mode coherency (and hence integrity) as we've delegated data coherency and integrity responsibility to the app bypassing the kernel data coherency and integrity mechanisms. What I'd like to avoid is creating another kernel bypass mechanism where we allow coherency and/or integrity to be fucked up in a way that we can't fix without giving up all the performance that the kernel bypass provides userspace apps. Constrain the cases where kernel bypass is allowed, and we avoid all the crappy corner cases where our only answer to users with corrupt data is "the man page advises application developers not to do that". If in future we work out how to implement everything without needing immutable extents in the inode, we can relax the restrictions we've placed on userspace DAX data sync.... > > Even if you are considering the complexity of the APIs, it's hardly > > a "heavyweight" when it only requires a single call to fallocate() > > before mmap() to set up the immutable extents on the file... > > So what would the exact semantics be? In particular, how can it fail? > If I do the fallocate(), is it absolutely promised that the extent > map won't get out of sync between what mmap sees and what's on disk? That's precisely the guarantee I documented would be given by immutable extents in my very first proposal. > Do user programs need to worry about colliding with each other when > one does fallocate() to DAXify a file and the other does fallocate() > to unDAXify a file? Yes, it can. This was one of the reasons for putting it under privilege - so only the app has full control of the extent map changes and nobody else can fuck with it. > Does this particular fallocate() call still keep > its effect after a reboot? Yes, it does, because it has to be transparent and behave consistently with all of userspace, not just the app that owns the file, and not just while that app is running. (e.g. defrag could be running on the file before the app starts, and then you're screwed when defrag modifies the extent map after app startup...) > Is there an actual concrete proposal that's reviewable? Yes, the first posting where I proposed this functionality many months ago spelled this all out in detail. Cheers, Dave.
On Thu, Jun 22, 2017 at 5:52 PM, Dave Chinner <david@fromorbit.com> wrote: > On Wed, Jun 21, 2017 at 09:07:57PM -0700, Andy Lutomirski wrote: >> On Wed, Jun 21, 2017 at 5:02 PM, Dave Chinner <david@fromorbit.com> wrote: >> > >> > You seem to be calling the "fdatasync on every page fault" the >> >> It's the opposite of fdatasync(). It needs to sync whatever metadata >> is needed to find the data. The data doesn't need to be synced. > > So much wrong with that statement. > > Andy, what does fdatasync() do when you have a data-clean, > metadata-dirty file (e.g. you just punched a hole or preallocated > more space via fallocate())? Hint: it doesn't sync any data > because the mapping tree is clean, but it still syncs the dirty > metadata needed to access the data. > > Now, what does a file where we do direct IO writes look like? Yup, > the mapping tree always remains clean and so it's only ever going to > appear to the kernel as a *data-clean, metadata-dirty* file. So, > after a direct IO write is done, what operation do we need to run to > ensure that we can always access the data? > > Yup, it's fdatasync(). Fair enough. Except that fdatasync() goes through dax_writeback_one() (I think), which deals with cache flushes (via wb_cache_pmem()). This special type of sync shouldn't need to do that, so it's not really quite fdatasync(). >> > "lightweight" option. That's the brute-force-with-big-hammer >> > solution - it's most definitely not lightweight as every page fault >> > has extra overhead to call ->fsync(). Sure, the API is simple, but >> > the runtime overhead is significant. >> >> It's lightweight in terms of its impact on the filesystem. It doesn't >> need any persistent setup -- you can just use it. > > Well, no, that's wrong, because we have to co-ordinate multiple > concurrent accesses to the data in the kernel. What happens when > some other process writes to the file *at the same time* but does > not use userspace sync? We aren't tracking dirty regions on the > inode mapping because we've been told not to do that, so fsync() > from that other process *won't sync the data it wrote*. IOws, the > kernel has failed to provide the guarantee that userspace wants it > to provide. ... > What I'd like to avoid is creating another kernel bypass mechanism > where we allow coherency and/or integrity to be fucked up in a way that > we can't fix without giving up all the performance that the kernel > bypass provides userspace apps. Constrain the cases where kernel > bypass is allowed, and we avoid all the crappy corner cases where > our only answer to users with corrupt data is "the man page advises > application developers not to do that". Ah, I see, a DAX file makes regular write() flush out the cache automatically. But I think the situation may be fucked up integrity-wise anyway. If you make an immutable-extent DAX file and a DAX-unaware process mmaps() it and writes to the mapping, what flushes the CPU cache? Isn't part of the point of the magic immutable-extent mode that it wouldn't have to track dirty extents? Can it keep track of which mappings are DAX-aware (via an mmap() flag, I assume)? Would all mappings of a DAX immutable-extent file be forced to be uncached (or writethrough or WC or some type that allows fsync to be fast)? Can you send a link to your fallocate email? I'm having trouble finding it.
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 5aef183e2f85..795eb93d6beb 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -339,6 +339,7 @@ 330 common pkey_alloc sys_pkey_alloc 331 common pkey_free sys_pkey_free 332 common statx sys_statx +333 64 daxctl sys_daxctl # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/include/linux/dax.h b/include/linux/dax.h index 5ec1f6c47716..5f1d0e0ed30f 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -4,8 +4,17 @@ #include <linux/fs.h> #include <linux/mm.h> #include <linux/radix-tree.h> +#include <uapi/linux/dax.h> #include <asm/pgtable.h> +/* + * TODO: make sys_daxctl() be the generic interface for toggling S_DAX + * across filesystems. For now, mark DAXCTL_F_DAX as an invalid flag + */ +#define DAXCTL_VALID_FLAGS (DAXCTL_F_GET | DAXCTL_F_STATIC) + +int daxfile_activate(struct file *daxfile, unsigned align); + struct iomap_ops; struct dax_device; struct dax_operations { diff --git a/include/linux/fs.h b/include/linux/fs.h index 3e68cabb8457..3af649fb669f 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1824,8 +1824,10 @@ struct super_operations { #define S_NOSEC 4096 /* no suid or xattr security attributes */ #ifdef CONFIG_FS_DAX #define S_DAX 8192 /* Direct Access, avoiding the page cache */ +#define S_DAXFILE 16384 /* no truncate (swapfile) semantics + dax */ #else #define S_DAX 0 /* Make all the DAX code disappear */ +#define S_DAXFILE 0 #endif /* @@ -1865,6 +1867,7 @@ struct super_operations { #define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT) #define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC) #define IS_DAX(inode) ((inode)->i_flags & S_DAX) +#define IS_DAXFILE(inode) ((inode)->i_flags & S_DAXFILE) #define IS_WHITEOUT(inode) (S_ISCHR(inode->i_mode) && \ (inode)->i_rdev == WHITEOUT_DEV) diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 980c3c9b06f8..49e5cc4c192e 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -701,6 +701,7 @@ asmlinkage long sys_prctl(int option, unsigned long arg2, unsigned long arg3, unsigned long arg4, unsigned long arg5); asmlinkage long sys_swapon(const char __user *specialfile, int swap_flags); asmlinkage long sys_swapoff(const char __user *specialfile); +asmlinkage long sys_daxctl(const char __user *path, int flags, int align); asmlinkage long sys_sysctl(struct __sysctl_args __user *args); asmlinkage long sys_sysinfo(struct sysinfo __user *info); asmlinkage long sys_sysfs(int option, diff --git a/include/uapi/linux/dax.h b/include/uapi/linux/dax.h new file mode 100644 index 000000000000..78a41bb392c0 --- /dev/null +++ b/include/uapi/linux/dax.h @@ -0,0 +1,8 @@ +#ifndef _UAPI_LINUX_DAX_H +#define _UAPI_LINUX_DAX_H + +#define DAXCTL_F_GET (1 << 0) +#define DAXCTL_F_DAX (1 << 1) +#define DAXCTL_F_STATIC (1 << 2) + +#endif /* _UAPI_LINUX_DAX_H */ diff --git a/mm/Kconfig b/mm/Kconfig index beb7a455915d..b874565c34eb 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -450,6 +450,11 @@ config TRANSPARENT_HUGE_PAGECACHE def_bool y depends on TRANSPARENT_HUGEPAGE +config DAXFILE + def_bool y + depends on FS_DAX + depends on SWAP + # # UP and nommu archs use km based percpu allocator # diff --git a/mm/Makefile b/mm/Makefile index 026f6a828a50..38d9025a3e37 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -56,6 +56,7 @@ endif obj-$(CONFIG_HAVE_MEMBLOCK) += memblock.o obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o +obj-$(CONFIG_DAXFILE) += daxfile.o obj-$(CONFIG_FRONTSWAP) += frontswap.o obj-$(CONFIG_ZSWAP) += zswap.o obj-$(CONFIG_HAS_DMA) += dmapool.o diff --git a/mm/daxfile.c b/mm/daxfile.c new file mode 100644 index 000000000000..fe230199c855 --- /dev/null +++ b/mm/daxfile.c @@ -0,0 +1,186 @@ +/* + * Copyright(c) 2017 Intel Corporation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ +#include <linux/dax.h> +#include <linux/slab.h> +#include <linux/highmem.h> +#include <linux/pagemap.h> +#include <linux/syscalls.h> + +/* + * TODO: a list to lookup daxfiles assumes a low number of instances, + * revisit. + */ +static LIST_HEAD(daxfiles); +static DEFINE_SPINLOCK(dax_lock); + +struct dax_info { + struct list_head list; + struct file *daxfile; +}; + +static int daxfile_disable(struct file *victim) +{ + int found = 0; + struct dax_info *d; + struct inode *inode; + struct file *daxfile; + struct address_space *mapping; + + mapping = victim->f_mapping; + spin_lock(&dax_lock); + list_for_each_entry(d, &daxfiles, list) + if (d->daxfile->f_mapping == mapping) { + list_del(&d->list); + found = 1; + break; + } + spin_unlock(&dax_lock); + + if (!found) + return -EINVAL; + + daxfile = d->daxfile; + + inode = mapping->host; + inode->i_flags &= ~(S_SWAPFILE | S_DAXFILE); + filp_close(daxfile, NULL); + + return 0; +} + +static int claim_daxfile_checks(struct inode *inode) +{ + if (!S_ISREG(inode->i_mode)) + return -EINVAL; + + if (!IS_DAX(inode)) + return -EINVAL; + + if (IS_SWAPFILE(inode) || IS_DAXFILE(inode)) + return -EBUSY; + + return 0; +} + +int daxfile_enable(struct file *daxfile, int align) +{ + struct address_space *mapping; + struct inode *inode; + struct dax_info *d; + int rc; + + if (align < 0) + return -EINVAL; + + mapping = daxfile->f_mapping; + inode = mapping->host; + + rc = claim_daxfile_checks(inode); + if (rc) + return rc; + + rc = daxfile_activate(daxfile, align); + if (rc) + return rc; + + d = kzalloc(sizeof(*d), GFP_KERNEL); + if (!d) + return -ENOMEM; + INIT_LIST_HEAD(&d->list); + d->daxfile = daxfile; + + spin_lock(&dax_lock); + list_add(&d->list, &daxfiles); + spin_unlock(&dax_lock); + + /* + * We set S_SWAPFILE to gain "no truncate" / static block + * allocation semantics, and S_DAXFILE so we can differentiate + * traditional swapfiles and assume static block mappings in the + * dax mmap path. + */ + inode->i_flags |= S_SWAPFILE | S_DAXFILE; + return 0; +} + +SYSCALL_DEFINE3(daxctl, const char __user *, path, int, flags, int, align) +{ + int rc; + struct filename *name; + struct inode *inode = NULL; + struct file *daxfile = NULL; + struct address_space *mapping; + + if (flags & ~DAXCTL_VALID_FLAGS) + return -EINVAL; + + name = getname(path); + if (IS_ERR(name)) + return PTR_ERR(name); + + daxfile = file_open_name(name, O_RDWR|O_LARGEFILE, 0); + if (IS_ERR(daxfile)) { + rc = PTR_ERR(daxfile); + daxfile = NULL; + goto out; + } + + mapping = daxfile->f_mapping; + inode = mapping->host; + if (flags & DAXCTL_F_GET) { + /* + * We only report the state of DAXCTL_F_STATIC since + * there is no actions for applications to take based on + * the setting of S_DAX. However, if this interface is + * used for toggling S_DAX presumably userspace would + * want to know the state of the flag. + * + * TODO: revisit whether we want to report DAXCTL_F_DAX + * in the IS_DAX() case. + */ + if (IS_DAXFILE(inode)) + rc = DAXCTL_F_STATIC; + else + rc = 0; + + goto out; + } + + /* + * TODO: Should unprivileged users be allowed to control daxfile + * behavior? Perhaps a mount flag... is -o dax that flag? + */ + if (!capable(CAP_LINUX_IMMUTABLE)) { + rc = -EPERM; + goto out; + } + + inode_lock(inode); + if (!IS_DAXFILE(inode) && (flags & DAXCTL_F_STATIC)) { + rc = daxfile_enable(daxfile, align); + /* if successfully enabled hold daxfile open */ + if (rc == 0) + daxfile = NULL; + } else if (IS_DAXFILE(inode) && !(flags & DAXCTL_F_STATIC)) + rc = daxfile_disable(daxfile); + else + rc = 0; + inode_unlock(inode); + +out: + if (daxfile) + filp_close(daxfile, NULL); + if (name) + putname(name); + return rc; +} diff --git a/mm/page_io.c b/mm/page_io.c index 5cec9a3d49f2..35160ad9c51f 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -244,6 +244,37 @@ static int bmap_walk(struct file *file, const unsigned page_size, goto out; } +static int daxfile_check(sector_t block, unsigned long page_no, + enum bmap_check type, void *none) +{ + if (type == BMAP_WALK_DONE) + return 0; + + /* + * Unlike the swapfile case, fail daxfile_activate() if any file + * extent is not page aligned. + */ + if (type != BMAP_WALK_FULLPAGE) + return -EINVAL; + return 0; +} + +int daxfile_activate(struct file *daxfile, unsigned align) +{ + int rc; + + if (!align) + align = PAGE_SIZE; + + if (align < PAGE_SIZE || !is_power_of_2(align)) + return -EINVAL; + + rc = bmap_walk(daxfile, align, ULONG_MAX, NULL, daxfile_check, NULL); + if (rc) + pr_debug("daxctl: daxfile has holes\n"); + return rc; +} + static int swapfile_check(sector_t block, unsigned long page_no, enum bmap_check type, void *_sis) {
To date, the full promise of byte-addressable access to persistent memory has only been half realized via the filesystem-dax interface. The current filesystem-dax mechanism allows an application to consume (read) data from persistent storage at byte-size granularity, bypassing the full page reads required by traditional storage devices. Now, for writes, applications still need to contend with page-granularity dirtying and flushing semantics as well as filesystem coordination for metadata updates after any mmap write. The current situation precludes use cases that leverage byte-granularity / in-place updates to persistent media. To get around this limitation there are some specialized applications that are using the device-dax interface to bypass the overhead and data-safety problems of the current filesystem-dax mmap-write path. QEMU-KVM is forced to use device-dax to safely pass through persistent memory to a guest [1]. Some specialized databases are using device-dax for byte-granularity writes. Outside of those cases, device-dax is difficult for general purpose persistent memory applications to consume. There is demand for access to pmem without needing to contend with special device configuration and other device-dax limitations. The 'daxfile' interface satisfies this demand and realizes one of Dave Chinner's ideas for allowing pmem applications to safely bypass fsync/msync requirements. The idea is to make the file immutable with respect to the offset-to-block mappings for every extent in the file [2]. It turns out that filesystems already need to make this guarantee today. This property is needed for files marked as swap files. The new daxctl() syscall manages setting a file into 'static-dax' mode whereby it arranges for the file to be treated as a swapfile as far as the filesystem is concerned, but not registered with the core-mm as swapfile space. A file in this mode is then safe to be mapped and written without the requirement to fsync/msync the writes. The cpu cache management for flushing data to persistence can be handled completely in userspace. [1]: https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg01207.html [2]: https://lkml.org/lkml/2016/9/11/159 Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- arch/x86/entry/syscalls/syscall_64.tbl | 1 include/linux/dax.h | 9 ++ include/linux/fs.h | 3 + include/linux/syscalls.h | 1 include/uapi/linux/dax.h | 8 + mm/Kconfig | 5 + mm/Makefile | 1 mm/daxfile.c | 186 ++++++++++++++++++++++++++++++++ mm/page_io.c | 31 +++++ 9 files changed, 245 insertions(+) create mode 100644 include/uapi/linux/dax.h create mode 100644 mm/daxfile.c