Message ID | 20170925231404.32723-2-ross.zwisler@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Sep 25, 2017 at 05:13:58PM -0600, Ross Zwisler wrote: > Before support for the per-inode DAX flag was disabled the XFS the code had > an issue where the user couldn't reliably tell whether or not DAX was being > used to service page faults and I/O when the DAX mount option was used. In > this case each inode within the mounted filesystem started with S_DAX set > due to the mount option, but it could be cleared if someone touched the > individual inode flag. > > For example (v4.13 and before): > > # mount | grep dax > /dev/pmem0 on /mnt type xfs > (rw,relatime,seclabel,attr2,dax,inode64,sunit=4096,swidth=4096,noquota) > > # touch /mnt/a /mnt/b # both files currently use DAX > > # xfs_io -c "lsattr" /mnt/* # neither has the DAX inode option set > ----------e----- /mnt/a > ----------e----- /mnt/b > > # xfs_io -c "chattr -x" /mnt/a # this clears S_DAX for /mnt/a > > # xfs_io -c "lsattr" /mnt/* > ----------e----- /mnt/a > ----------e----- /mnt/b That's really a bug in the lsattr code, yes? If we've cleared the S_DAX flag for the inode, then why is it being reported in lsattr? Or if we failed to clear the S_DAX flag in the 'chattr -x' call, then isn't that the bug that needs fixing? Remember, the whole point of the dax inode flag was to be able to override the mount option setting so that admins could turn off/on dax for the things that didn't/did work with DAX correctly so they didn't need multiple filesystems on pmem to segregate the apps that did/didn't work with DAX... Cheers, Dave.
On Tue 26-09-17 09:38:12, Dave Chinner wrote: > On Mon, Sep 25, 2017 at 05:13:58PM -0600, Ross Zwisler wrote: > > Before support for the per-inode DAX flag was disabled the XFS the code had > > an issue where the user couldn't reliably tell whether or not DAX was being > > used to service page faults and I/O when the DAX mount option was used. In > > this case each inode within the mounted filesystem started with S_DAX set > > due to the mount option, but it could be cleared if someone touched the > > individual inode flag. > > > > For example (v4.13 and before): > > > > # mount | grep dax > > /dev/pmem0 on /mnt type xfs > > (rw,relatime,seclabel,attr2,dax,inode64,sunit=4096,swidth=4096,noquota) > > > > # touch /mnt/a /mnt/b # both files currently use DAX > > > > # xfs_io -c "lsattr" /mnt/* # neither has the DAX inode option set > > ----------e----- /mnt/a > > ----------e----- /mnt/b > > > > # xfs_io -c "chattr -x" /mnt/a # this clears S_DAX for /mnt/a > > > > # xfs_io -c "lsattr" /mnt/* > > ----------e----- /mnt/a > > ----------e----- /mnt/b > > That's really a bug in the lsattr code, yes? If we've cleared the > S_DAX flag for the inode, then why is it being reported in lsattr? > Or if we failed to clear the S_DAX flag in the 'chattr -x' call, > then isn't that the bug that needs fixing? > > Remember, the whole point of the dax inode flag was to be able to > override the mount option setting so that admins could turn off/on > dax for the things that didn't/did work with DAX correctly so they > didn't need multiple filesystems on pmem to segregate the apps that > did/didn't work with DAX... So I think there is some confusion that is created by the fact that whether DAX is used or not is controlled by both a mount option and an inode flag. We could define that "Inode flag always wins" which is what you seem to suggest above but then mount option has no practical effect since on-disk S_DAX flag will always overrule it. Ross suggests that DAX should be used if "Inode flag or mount option is set". Which is similar to how e.g. noatime inode flag works but does not allow to selectively disable DAX. So if we wanted both mount option to work and selective disabling of DAX, we would need three states of inode setting - force DAX, disable DAX, inherit from mount option. Honza
On Tue, Sep 26, 2017 at 11:35:48AM +0200, Jan Kara wrote: > On Tue 26-09-17 09:38:12, Dave Chinner wrote: > > On Mon, Sep 25, 2017 at 05:13:58PM -0600, Ross Zwisler wrote: > > > Before support for the per-inode DAX flag was disabled the XFS the code had > > > an issue where the user couldn't reliably tell whether or not DAX was being > > > used to service page faults and I/O when the DAX mount option was used. In > > > this case each inode within the mounted filesystem started with S_DAX set > > > due to the mount option, but it could be cleared if someone touched the > > > individual inode flag. > > > > > > For example (v4.13 and before): > > > > > > # mount | grep dax > > > /dev/pmem0 on /mnt type xfs > > > (rw,relatime,seclabel,attr2,dax,inode64,sunit=4096,swidth=4096,noquota) > > > > > > # touch /mnt/a /mnt/b # both files currently use DAX > > > > > > # xfs_io -c "lsattr" /mnt/* # neither has the DAX inode option set > > > ----------e----- /mnt/a > > > ----------e----- /mnt/b > > > > > > # xfs_io -c "chattr -x" /mnt/a # this clears S_DAX for /mnt/a > > > > > > # xfs_io -c "lsattr" /mnt/* > > > ----------e----- /mnt/a > > > ----------e----- /mnt/b > > > > That's really a bug in the lsattr code, yes? If we've cleared the > > S_DAX flag for the inode, then why is it being reported in lsattr? > > Or if we failed to clear the S_DAX flag in the 'chattr -x' call, > > then isn't that the bug that needs fixing? > > > > Remember, the whole point of the dax inode flag was to be able to > > override the mount option setting so that admins could turn off/on > > dax for the things that didn't/did work with DAX correctly so they > > didn't need multiple filesystems on pmem to segregate the apps that > > did/didn't work with DAX... > > So I think there is some confusion that is created by the fact that whether > DAX is used or not is controlled by both a mount option and an inode flag. > We could define that "Inode flag always wins" which is what you seem to > suggest above but then mount option has no practical effect since on-disk > S_DAX flag will always overrule it. Well, quite frankly, I never wanted the mount option for XFS. It was supposed to be for initial testing only, then we'd /always/ use the the inode flags. For a filesystem to default to using DAX, we set the DAX flag on the root inode at mkfs time, and then everything inode flag based just works. But it seems that we're now stuck with the stupid, blunt, brute force mount option because that's what the first commit on ext4 used. Now we're just about stuck with this silly "but we can't turn it off" problem because of the mount option overriding everything. If we have to keep the mount option, then lets fix it to mean "mount option sets inheritable inode flag on directory creation" and /maybe/ "mount option sets inode flag on file creation". This then allows the inode flag to control everything else. i.e the mount option sets the initial flag value rather than the behaviour of the inode. The behaviour of the inode should be entirely controlled by the inode flag, hence after initial creation the chattr +/-x commands do what they advertise regardless of the mount option value. Yes, it means that existing users are going to have to run chattr -R +x on their pmem filesystems to get the inode flags on disk, but this is all tagged with EXPERIMENTAL and this is the sort of change that is expected from experimental functionality. Cheers, Dave.
On Tue, Sep 26, 2017 at 09:09:57PM +1000, Dave Chinner wrote: > Well, quite frankly, I never wanted the mount option for XFS. It was > supposed to be for initial testing only, then we'd /always/ use the > the inode flags. For a filesystem to default to using DAX, we > set the DAX flag on the root inode at mkfs time, and then everything > inode flag based just works. And I deeply fundamentally disagree. The mount option is a nice enough big hammer to try a mode without encoding nitty gritty details into the application ABI. The per-inode persistent flag is the biggest nightmare ever, as we see in all these discussions about it. What does it even mean? Right now it forces direct addressing as long as the underlying media supports that. But what about media that you directly access but you really don't want to because it's really slow? Or media that is so god damn fast that you never want to buffer? Or media where you want to buffer for writes (or at least some of them) but not for reads? It encodes a very specific mechanism for an early direct access implementation into the ABI. What we really need is for applications to declare an intent, not specify a particular mechanism.
On Tue, Sep 26, 2017 at 04:37:43PM +0200, Christoph Hellwig wrote: > On Tue, Sep 26, 2017 at 09:09:57PM +1000, Dave Chinner wrote: > > Well, quite frankly, I never wanted the mount option for XFS. It was > > supposed to be for initial testing only, then we'd /always/ use the > > the inode flags. For a filesystem to default to using DAX, we > > set the DAX flag on the root inode at mkfs time, and then everything > > inode flag based just works. > > And I deeply fundamentally disagree. The mount option is a nice > enough big hammer to try a mode without encoding nitty gritty details > into the application ABI. > > The per-inode persistent flag is the biggest nightmare ever, as we see > in all these discussions about it. > > What does it even mean? Right now it forces direct addressing as long > as the underlying media supports that. But what about media that > you directly access but you really don't want to because it's really slow? > Or media that is so god damn fast that you never want to buffer? Or > media where you want to buffer for writes (or at least some of them) > but not for reads? > > It encodes a very specific mechanism for an early direct access > implementation into the ABI. What we really need is for applications > to declare an intent, not specify a particular mechanism. I agree that Christoph's idea about having the system intelligently adjust to use DAX based on performance information it gathers about the underlying persistent memory (probably via the HMAT on x86_64 systems) is interesting, but I think we're still a ways away from that. FWIW, as my patches suggest and Jan observed I think that we should allow users to turn on DAX by treating the inode flag and the mount flag as an 'or' operation. i.e. you get DAX if either the mount option is specified or if the inode flag is set, and you can continue to manipulate the per-inode flag as you want regardless of the mount option. I think this provides maximum flexibility of the mechanism to select DAX without enforcing policy. In the end, though, I think what's really important is that we figure out what the various options mean, have the same story for both XFS and ext4, and document it as hch suggested in response to my patch 7 in this series. Does it make sense at this point to just start a "dax" man page that can contain info about the mount options, inode flags, kernel config options, how to get PMDs, etc? Or does this documentation need to be sprinkled around more in existing man pages?
On 9/26/17 6:09 AM, Dave Chinner wrote: > On Tue, Sep 26, 2017 at 11:35:48AM +0200, Jan Kara wrote: >> On Tue 26-09-17 09:38:12, Dave Chinner wrote: >>> On Mon, Sep 25, 2017 at 05:13:58PM -0600, Ross Zwisler wrote: >>>> Before support for the per-inode DAX flag was disabled the XFS the code had >>>> an issue where the user couldn't reliably tell whether or not DAX was being >>>> used to service page faults and I/O when the DAX mount option was used. In >>>> this case each inode within the mounted filesystem started with S_DAX set >>>> due to the mount option, but it could be cleared if someone touched the >>>> individual inode flag. >>>> >>>> For example (v4.13 and before): >>>> >>>> # mount | grep dax >>>> /dev/pmem0 on /mnt type xfs >>>> (rw,relatime,seclabel,attr2,dax,inode64,sunit=4096,swidth=4096,noquota) >>>> >>>> # touch /mnt/a /mnt/b # both files currently use DAX >>>> >>>> # xfs_io -c "lsattr" /mnt/* # neither has the DAX inode option set >>>> ----------e----- /mnt/a >>>> ----------e----- /mnt/b >>>> >>>> # xfs_io -c "chattr -x" /mnt/a # this clears S_DAX for /mnt/a >>>> >>>> # xfs_io -c "lsattr" /mnt/* >>>> ----------e----- /mnt/a >>>> ----------e----- /mnt/b >>> >>> That's really a bug in the lsattr code, yes? If we've cleared the >>> S_DAX flag for the inode, then why is it being reported in lsattr? >>> Or if we failed to clear the S_DAX flag in the 'chattr -x' call, >>> then isn't that the bug that needs fixing? >>> >>> Remember, the whole point of the dax inode flag was to be able to >>> override the mount option setting so that admins could turn off/on >>> dax for the things that didn't/did work with DAX correctly so they >>> didn't need multiple filesystems on pmem to segregate the apps that >>> did/didn't work with DAX... >> >> So I think there is some confusion that is created by the fact that whether >> DAX is used or not is controlled by both a mount option and an inode flag. >> We could define that "Inode flag always wins" which is what you seem to >> suggest above but then mount option has no practical effect since on-disk >> S_DAX flag will always overrule it. > > Well, quite frankly, I never wanted the mount option for XFS. It was > supposed to be for initial testing only, then we'd /always/ use the > the inode flags. For a filesystem to default to using DAX, we > set the DAX flag on the root inode at mkfs time, and then everything > inode flag based just works. > > But it seems that we're now stuck with the stupid, blunt, brute > force mount option because that's what the first commit on ext4 > used. Now we're just about stuck with this silly "but we can't turn > it off" problem because of the mount option overriding everything. I don't think the existence of a mount option in ext4 makes us any more "stuck" than the mount option in xfs does. fs/xfs/xfs_super.c: "DAX enabled. Warning: EXPERIMENTAL, use at your own risk"); fs/ext4/super.c: "DAX enabled. Warning: EXPERIMENTAL, use at your own risk"); so when^wif this argument ever gets settled, I think there is plenty of latitude to do the right thing, potentially breaking the old thing. > If we have to keep the mount option, then lets fix it to mean "mount > option sets inheritable inode flag on directory creation" and > /maybe/ "mount option sets inode flag on file creation". > > This then allows the inode flag to control everything else. i.e the > mount option sets the initial flag value rather than the behaviour > of the inode. The behaviour of the inode should be entirely > controlled by the inode flag, hence after initial creation the > chattr +/-x commands do what they advertise regardless of the mount > option value. > > Yes, it means that existing users are going to have to run chattr -R > +x on their pmem filesystems to get the inode flags on disk, but > this is all tagged with EXPERIMENTAL and this is the sort of change ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > that is expected from experimental functionality. Right. -Eric > Cheers, > > Dave. >
On Tue, Sep 26, 2017 at 09:38:12AM +1000, Dave Chinner wrote: > On Mon, Sep 25, 2017 at 05:13:58PM -0600, Ross Zwisler wrote: > > Before support for the per-inode DAX flag was disabled the XFS the code had > > an issue where the user couldn't reliably tell whether or not DAX was being > > used to service page faults and I/O when the DAX mount option was used. In > > this case each inode within the mounted filesystem started with S_DAX set > > due to the mount option, but it could be cleared if someone touched the > > individual inode flag. > > > > For example (v4.13 and before): > > > > # mount | grep dax > > /dev/pmem0 on /mnt type xfs > > (rw,relatime,seclabel,attr2,dax,inode64,sunit=4096,swidth=4096,noquota) > > > > # touch /mnt/a /mnt/b # both files currently use DAX > > > > # xfs_io -c "lsattr" /mnt/* # neither has the DAX inode option set > > ----------e----- /mnt/a > > ----------e----- /mnt/b > > > > # xfs_io -c "chattr -x" /mnt/a # this clears S_DAX for /mnt/a > > > > # xfs_io -c "lsattr" /mnt/* > > ----------e----- /mnt/a > > ----------e----- /mnt/b > > That's really a bug in the lsattr code, yes? If we've cleared the > S_DAX flag for the inode, then why is it being reported in lsattr? > Or if we failed to clear the S_DAX flag in the 'chattr -x' call, > then isn't that the bug that needs fixing? No, I think lsattr/chattr are working correctly. In both the examples above the DAX inode flag (which is represeted by an 'x') is never set. S_DAX is the in-memory inode flag (not the on-media inode flag) which is not manipulated directly by lsattr/chattr, but instead reflects whether the inode is actually using DAX or not. Manipulating and displaying the on-media inode flag works as expected with lsattr/chattr: # xfs_io -c "lsattr" ./a ---------------- ./a # xfs_io -c "chattr +x" ./a # xfs_io -c "lsattr" ./a --------------x- ./a - Ross
On Tue, Sep 26, 2017 at 11:30:57AM -0600, Ross Zwisler wrote: > On Tue, Sep 26, 2017 at 04:37:43PM +0200, Christoph Hellwig wrote: > > On Tue, Sep 26, 2017 at 09:09:57PM +1000, Dave Chinner wrote: > > > Well, quite frankly, I never wanted the mount option for XFS. It was > > > supposed to be for initial testing only, then we'd /always/ use the > > > the inode flags. For a filesystem to default to using DAX, we > > > set the DAX flag on the root inode at mkfs time, and then everything > > > inode flag based just works. > > > > And I deeply fundamentally disagree. The mount option is a nice > > enough big hammer to try a mode without encoding nitty gritty details > > into the application ABI. > > > > The per-inode persistent flag is the biggest nightmare ever, as we see > > in all these discussions about it. > > > > What does it even mean? Right now it forces direct addressing as long > > as the underlying media supports that. But what about media that > > you directly access but you really don't want to because it's really slow? > > Or media that is so god damn fast that you never want to buffer? Or > > media where you want to buffer for writes (or at least some of them) > > but not for reads? > > > > It encodes a very specific mechanism for an early direct access > > implementation into the ABI. What we really need is for applications > > to declare an intent, not specify a particular mechanism. > > I agree that Christoph's idea about having the system intelligently adjust to > use DAX based on performance information it gathers about the underlying > persistent memory (probably via the HMAT on x86_64 systems) is interesting, > but I think we're still a ways away from that. > > FWIW, as my patches suggest and Jan observed I think that we should allow > users to turn on DAX by treating the inode flag and the mount flag as an 'or' > operation. i.e. you get DAX if either the mount option is specified or if the > inode flag is set, and you can continue to manipulate the per-inode flag as > you want regardless of the mount option. I think this provides maximum > flexibility of the mechanism to select DAX without enforcing policy. > > In the end, though, I think what's really important is that we figure out what > the various options mean, have the same story for both XFS and ext4, and > document it as hch suggested in response to my patch 7 in this series. Agreed. We have a fundamental conflict between letting the sysadmin or user decide how they want an inode to behave vs. letting the kernel make all the decisions based on whatever information it gathers. I'm pulled this patch out of -fixes and for-next because I feel strongly discouraged about taking any more patches that change the user-control parts of the DAX implementation until we reach a consensus on what to do. Given that DAX and pmem support in filesystems is still experimental, I'm open to changing the interface as needed. Where do we think we'll be in a few years once ACPI or whatever reaches the point of being able to tell the kernel about the general performance characteristics of the pmem? What choices about the interface do we need to make now so that we can get there while minimizing the number of insufficient interfaces to deprecate? My personal guess is that most programs will not care enough to want to make a syscall so we might as well give them the most performant option available. Roughly speaking, here are the use cases I can think of: * Regular buffered read/write: we can let the kernel decide if it wants to push the IO through the page cache, directly access the pmem, or some future combination of the two. * O_DIRECT read/write: Straight to pmem. * Regular mmap: This seems fairly agnostic to how we actually make the memory mapping work, right? * MAP_DIRECT/MAP_SYNC mmap: If userspace actually goes to the trouble of making sure the whole range is allocated and pre-written and uses these flags then they get direct access. I've wondered off and on if an acceptable solution is to define a number of things surrounding an inode for which XFS /could/ optimize, and let the user tell us which one thing matters most to them: total manual control over everything like we do now, sequential io, random io, fastest mmap access possible, most direct access to storage, etc. If you set a hint other than full manual control then XFS reserves the right to change inode flags at any time to satisfy the hint. For the most part I'm in favor of Christoph's suggestion to let the kernel decide on its own, and I don't see the point in encoding details of the storage medium access strategy on the disk, particularly since filesystems are supposed to be fairly independent of storage. But frankly, so many people have asked me over the years if there's some way to influence the decision-making that I won't quite let go of file hints as a way to influence the decisions XFS makes around storage media. > Does it make sense at this point to just start a "dax" man page that can > contain info about the mount options, inode flags, kernel config options, how > to get PMDs, etc? Or does this documentation need to be sprinkled around more > in existing man pages? Personally it'd be a lot easier to tell internal groups to go look at a single documentation page that discusses everything you'd want to know about enabling dax -- how to control it, how to make large page table entries work, etc. Some of those things will get into fs internals, however, which probably belong in the ext4/xfs manpages. I suggest laying out the general details in a single dax manpage and pointing people at each fs's documentation for specific details. --D > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 26, 2017 at 12:48:30PM -0700, Darrick J. Wong wrote: > For the most part I'm in favor of Christoph's suggestion to let the > kernel decide on its own, and I don't see the point in encoding details > of the storage medium access strategy on the disk, particularly since > filesystems are supposed to be fairly independent of storage. But > frankly, so many people have asked me over the years if there's some way > to influence the decision-making that I won't quite let go of file hints > as a way to influence the decisions XFS makes around storage media. And that's pretty much it. The discussion here is not about whether there should be a flag, but what semantics it should have when the flag is not set. If "flag not set" means "kernel selects automatically", then that's fine by me. But history tells us that users and admins want a way to be able to override the kernel's automatic behaviours because they are /never 100% correct/ for everyone. There are always exceptions, otherwise we wouldn't have the great plethora of mkfs, mount, proc and sysfs options for our filesystems or storage. Anyone who says "the kernel will always do the right thing for everyone automatically" is living in a dream world. Note: I agree that the kernel should do the right thing w.r.t. DAX automatically. We don't need a mount option for that - we can probe for dax support automatically and use it automatically already. However, in a world where the kernel automatically uses that functionality when it is present, admins and users need a way to solve the "default behaviour is bad for me, let me control this manually" problem. That's where the inode flags come in.... i.e. What I'm advocating is a model DAX gets enabled automatically if the underlying device supports is using whatever the kernel thinks is optimal at the time the access is made, but the user can override/direct behvaiour on a case by case basis via persistent inode flags/xattrs/whatever. Cheers, Dave.
On Tue, Sep 26, 2017 at 11:30:57AM -0600, Ross Zwisler wrote: > I agree that Christoph's idea about having the system intelligently adjust to > use DAX based on performance information it gathers about the underlying > persistent memory (probably via the HMAT on x86_64 systems) is interesting, > but I think we're still a ways away from that. So what are the missing blockers for a getting started? > FWIW, as my patches suggest and Jan observed I think that we should allow > users to turn on DAX by treating the inode flag and the mount flag as an 'or' > operation. i.e. you get DAX if either the mount option is specified or if the > inode flag is set, and you can continue to manipulate the per-inode flag as > you want regardless of the mount option. I think this provides maximum > flexibility of the mechanism to select DAX without enforcing policy. IFF we stick to the dax flag that's the only workable way. The only major issue I still see with that is that this allows unprivilegued users to enable DAX on a any file they own / have write access to. So there isn't really any way to effectively disable the DAX path by the sysadmin. > Does it make sense at this point to just start a "dax" man page that can > contain info about the mount options, inode flags, kernel config options, how > to get PMDs, etc? Or does this documentation need to be sprinkled around more > in existing man pages? A dax manpage would be good.
On Tue, Sep 26, 2017 at 11:40:01PM -0700, Christoph Hellwig wrote: > On Tue, Sep 26, 2017 at 11:30:57AM -0600, Ross Zwisler wrote: > > I agree that Christoph's idea about having the system intelligently adjust to > > use DAX based on performance information it gathers about the underlying > > persistent memory (probably via the HMAT on x86_64 systems) is interesting, > > but I think we're still a ways away from that. > > So what are the missing blockers for a getting started? Well, I don't know if platforms that support HMAT + PMEM are widely available, but we have all the details in the ACPI spec, so we could begin to code it up and things will "just work" when platforms arrive. > > FWIW, as my patches suggest and Jan observed I think that we should allow > > users to turn on DAX by treating the inode flag and the mount flag as an 'or' > > operation. i.e. you get DAX if either the mount option is specified or if the > > inode flag is set, and you can continue to manipulate the per-inode flag as > > you want regardless of the mount option. I think this provides maximum > > flexibility of the mechanism to select DAX without enforcing policy. > > IFF we stick to the dax flag that's the only workable way. The only > major issue I still see with that is that this allows unprivilegued > users to enable DAX on a any file they own / have write access to. > So there isn't really any way to effectively disable the DAX path > by the sysadmin. Hum, I wonder if maybe we need/want three different mount modes? What about: autodax (the default): the filesystem is free to use DAX or not, as it sees fit and thinks is optimal. For the time being we can make this mean "don't use DAX", and phase in DAX usage as we add support for the HMAT, etc. Users can manually turn on DAX for a given inode by setting the DAX inode flag, but there is no way for the user to *prevent* DAX for an inode - the kernel can always choose to turn it on. MAP_DIRECT and MAP_SYNC work. nodax: Don't use DAX. The kernel won't choose to use DAX, and any DAX inode flags will be ignored. This gives the sysadmin the override that I think you're looking for. The user can still manipulate the inode flags as they see fit. MAP_DIRECT and MAP_SYNC both fail. dax: Use DAX for all inodes in the filesystem. Again the inode flags are essentially ignored, but the user can manipulate the inode flags as they see fit. This is basically unchanged from how it works today, modulo the bug where DAX can get turned off if you unset the inode flag where it wasn't even set (patch 1 in my series). MAP_DIRECT and MAP_SYNC work. > > Does it make sense at this point to just start a "dax" man page that can > > contain info about the mount options, inode flags, kernel config options, how > > to get PMDs, etc? Or does this documentation need to be sprinkled around more > > in existing man pages? > > A dax manpage would be good. Okay, I'll start with a manpage, and once we agree on whats in there we can start working on code again. :)
On Wed, Sep 27, 2017 at 10:15:10AM -0600, Ross Zwisler wrote: > Well, I don't know if platforms that support HMAT + PMEM are widely available, > but we have all the details in the ACPI spec, so we could begin to code it up > and things will "just work" when platforms arrive. Then again currently all actually shipping NVDIMMs are battery backed dram and DAX mode should work just fine for them. Things will get interesting once companies start shipping actually persistent technologies that will be significantly slower than DRAM. And we sould make sure we have the infrastruture for that in place. > Hum, I wonder if maybe we need/want three different mount modes? What about: > > autodax (the default): the filesystem is free to use DAX or not, as it sees > fit and thinks is optimal. For the time being we can make this mean "don't > use DAX", and phase in DAX usage as we add support for the HMAT, etc. What does "use DAX" really mean anyway? I think we are conflating a few things: a) use a block device or use a dax_device for accessing the device b) use the pagecache for caching data in DRAM or not. Now we actually have a really nice way to control a) already, it's called O_DIRECT. Currently O_DIRECT only works with read/write I/O, but with a byte addressable scheme we now can implement it for mmap as well, which is what the DAX mmap path does. b) right now is implied by a), but it's really an implementation detail. So the modes would be more like two options to: a) disallow any byte-level access. The right way to do that would be to mount the /dev/dax* device instead of the block device to allow byte access, and disallow any DAXish operation if you mount the block device in the long run. b) have a mode to always force an O_DIRECT-like mode for devices that are fast enough. We should always do that with the right HMAT entries if mounting the /dev/dax devices, and maybe have a mount option to force it.
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c index 5049e8a..26faeb9 100644 --- a/fs/xfs/xfs_ioctl.c +++ b/fs/xfs/xfs_ioctl.c @@ -1013,7 +1013,7 @@ xfs_diflags_to_linux( else inode->i_flags &= ~S_NOATIME; #if 0 /* disabled until the flag switching races are sorted out */ - if (xflags & FS_XFLAG_DAX) + if ((xflags & FS_XFLAG_DAX) || (ip->i_mount->m_flags & XFS_MOUNT_DAX)) inode->i_flags |= S_DAX; else inode->i_flags &= ~S_DAX; @@ -1104,7 +1104,14 @@ xfs_ioctl_setattr_dax_invalidate( return -EINVAL; } - /* If the DAX state is not changing, we have nothing to do here. */ + /* + * If the DAX state is not changing, we have nothing to do here. If + * the DAX mount option was used we will update the DAX inode flag as + * the user requested but we will continue to use DAX for I/O and page + * faults regardless of how the inode flag is set. + */ + if (ip->i_mount->m_flags & XFS_MOUNT_DAX) + return 0; if ((fa->fsx_xflags & FS_XFLAG_DAX) && IS_DAX(inode)) return 0; if (!(fa->fsx_xflags & FS_XFLAG_DAX) && !IS_DAX(inode))