Message ID | 20200208193445.27421-8-ira.weiny@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Enable per-file/directory DAX operations V3 | expand |
On Sat, Feb 08, 2020 at 11:34:40AM -0800, ira.weiny@intel.com wrote: > From: Ira Weiny <ira.weiny@intel.com> > > DAX requires special address space operations but many other functions > check the IS_DAX() state. > > While DAX is a property of the inode we perfer a lock at the super block > level because of the overhead of a rwsem within the inode. > > Define a vfs per superblock percpu rs semaphore to lock the DAX state ???? > while performing various VFS layer operations. Write lock calls are > provided here but are used in subsequent patches by the file systems > themselves. > > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > > --- > Changes from V2 > > Rebase on linux-next-08-02-2020 > > Fix locking order > Change all references from mode to state where appropriate > add CONFIG_FS_DAX requirement for state change > Use a static branch to enable locking only when a dax capable > device has been seen. > > Move the lock to a global vfs lock > > this does a few things > 1) preps us better for ext4 support > 2) removes funky callbacks from inode ops > 3) remove complexity from XFS and probably from > ext4 later > > We can do this because > 1) the locking order is required to be at the > highest level anyway, so why complicate xfs > 2) We had to move the sem to the super_block > because it is too heavy for the inode. > 3) After internal discussions with Dan we > decided that this would be easier, just as > performant, and with slightly less overhead > than in the VFS SB. > > We also change the functions names to up/down; > read/write as appropriate. Previous names were over > simplified. This, IMO, is a bit of a train wreck. This patch has nothing to do with "DAX state", it's about serialising access to the aops vector. There should be zero references to DAX in this patch at all, except maybe to say "switching DAX on dynamically requires atomic switching of address space ops". Big problems I see here: 1. static key to turn it on globally. - just a gross hack around doing it properly with a per-sb mechanism and enbaling it only on filesystems that are on DAX capable block devices. - you're already passing in an inode to all these functions. It's trivial to do: if (!inode->i_sb->s_flags & S_DYNAMIC_AOPS) return /* do sb->s_aops_lock manipulation */ 2. global lock - OMG! - global lock will cause entire system IO/page fault stalls when someone does recursive/bulk DAX flag modification operations. Per-cpu rwsem Contention on large systems will be utterly awful. - ext4's use case almost never hits the exclusive lock side of the percpu-rwsem - only when changing the journal mode flag on the inode. And it only affects writeback in progress, so it's not going to have massive concurrency on it like a VFS level global lock has. -> Bad model to follow. - per-sb lock is trivial - see #1 - which limits scope to single filesystem - per-inode rwsem would make this problem go away entirely. 3. If we can use a global per-cpu rwsem, why can't we just use a per-inode rwsem? - locking context rules are the same - rwsem scales pretty damn well for shared ops - no "global" contention problems - small enough that we can put another rwsem in the inode. 4. "inode_dax_state_up_read" - Eye bleeds. - this is about the aops structure serialisation, not dax. - The name makes no sense in the places that it has been added. 5. We already have code that does almost exactly what we need: the superblock freezing infrastructure. - freezing implements a "hold operations on this superblock until further notice" model we can easily follow. - sb_start_write/sb_end_write provides a simple API model and a clean, clear and concise naming convention we can use, too. Really, I'm struggling to understand how we got to "global locking that stops the world" from "need to change per-inode state atomically". Can someone please explain to me why this patch isn't just a simple set of largely self-explanitory functions like this: XX_start_aop() XX_end_aop() XX_lock_aops() XX_switch_aops(aops) XX_unlock_aops() where "XX" is "sb" or "inode" depending on what locking granularity is used... Cheers, Dave.
On Tue, Feb 11, 2020 at 07:00:35PM +1100, Dave Chinner wrote: > On Sat, Feb 08, 2020 at 11:34:40AM -0800, ira.weiny@intel.com wrote: > > From: Ira Weiny <ira.weiny@intel.com> > > > > DAX requires special address space operations but many other functions > > check the IS_DAX() state. > > > > While DAX is a property of the inode we perfer a lock at the super block > > level because of the overhead of a rwsem within the inode. > > > > Define a vfs per superblock percpu rs semaphore to lock the DAX state > > ???? oops... I must have forgotten to update the commit message when I did the global RW sem. I implemented the per-SB, percpu rwsem first but it was suggested that the percpu nature of the lock combined with the anticipated infrequent use of the write side made using a global easier. But before I proceed on V4 I'd like to get consensus on which of the 2 locking models to go with. 1) percpu per superblock lock 2) per inode rwsem Depending on my mood I can convince myself of both being performant but the percpu is very attractive because I don't anticipate many changes of state during run time. OTOH concurrent threads accessing the same file at run time may also be low so there is likely to be little read contention across CPU's on the per-inode lock? Opinions? > > > while performing various VFS layer operations. Write lock calls are > > provided here but are used in subsequent patches by the file systems > > themselves. > > > > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > > > > --- > > Changes from V2 > > > > Rebase on linux-next-08-02-2020 > > > > Fix locking order > > Change all references from mode to state where appropriate > > add CONFIG_FS_DAX requirement for state change > > Use a static branch to enable locking only when a dax capable > > device has been seen. > > > > Move the lock to a global vfs lock > > > > this does a few things > > 1) preps us better for ext4 support > > 2) removes funky callbacks from inode ops > > 3) remove complexity from XFS and probably from > > ext4 later > > > > We can do this because > > 1) the locking order is required to be at the > > highest level anyway, so why complicate xfs > > 2) We had to move the sem to the super_block > > because it is too heavy for the inode. > > 3) After internal discussions with Dan we > > decided that this would be easier, just as > > performant, and with slightly less overhead > > than in the VFS SB. > > > > We also change the functions names to up/down; > > read/write as appropriate. Previous names were over > > simplified. > > This, IMO, is a bit of a train wreck. > > This patch has nothing to do with "DAX state", it's about > serialising access to the aops vector. It is a bit more than just the aops vector. It also has to protect against checks of IS_DAX() which occur through many of the file operations. > > There should be zero > references to DAX in this patch at all, except maybe to say > "switching DAX on dynamically requires atomic switching of address > space ops". I'm ok with removing the dax references. However, it very specifically protects against IS_DAX checks. Adding in protection for other states will require care IMO. This is why I preserved the dax naming because I did not want to over step what is protected with the locking. I do _want_ it to protect more. But I can't guarantee it, have not tested it, and so made the calls specific to dax. > > Big problems I see here: > > 1. static key to turn it on globally. > - just a gross hack around doing it properly with a per-sb > mechanism and enbaling it only on filesystems that are on DAX > capable block devices. Fair enough. This was a reaction to Darricks desire to get this out of the way when DAX was not in the system. The static branch seemed like a good thing for users who don't have DAX capable hardware while running kernels and FS's which have DAX enabled. http://lkml.iu.edu/hypermail/linux/kernel/2001.1/05691.html > - you're already passing in an inode to all these functions. It's > trivial to do: > > if (!inode->i_sb->s_flags & S_DYNAMIC_AOPS) > return > /* do sb->s_aops_lock manipulation */ Yea that would be ok IMO. Darrick would just having this be CONFIG_FS_DAX as per this patch be ok with you. I guess the static key may have been a bit of overkill? > > 2. global lock > - OMG! I know. The thinking on this is that the locking is percpu which is near 0 overhead in the read case and we are rarely going to take exclusive access. > - global lock will cause entire system IO/page fault stalls > when someone does recursive/bulk DAX flag modification > operations. Per-cpu rwsem Contention on large systems will be > utterly awful. Yea that is probably bad... I certainly did not test the responsiveness of this. FWIW if the system only has 1 FS the per-SB lock is not going to be any different from the global which was part of our thinking. > - ext4's use case almost never hits the exclusive lock side of the > percpu-rwsem - only when changing the journal mode flag on the > inode. And it only affects writeback in progress, so it's not > going to have massive concurrency on it like a VFS level global > lock has. -> Bad model to follow. I admit I did not research the ext4's journal mode locking that much. > - per-sb lock is trivial - see #1 - which limits scope to single > filesystem I agree and... well the commit message shows I actually implemented it that way at first... :-/ > - per-inode rwsem would make this problem go away entirely. But would that be ok for concurrent read performance which is going to be used 99.99% of the time? Maybe Darricks comments made me panic a little bit too much overhead WRT locking and its potential impact on users not using DAX? > > 3. If we can use a global per-cpu rwsem, why can't we just use a > per-inode rwsem? Per-cpu lock was attractive because of its near 0 overhead to take the read lock which happens a lot during normal operations. > - locking context rules are the same > - rwsem scales pretty damn well for shared ops Does it? I'm not sure. > - no "global" contention problems > - small enough that we can put another rwsem in the inode. Everything else I agree with... :-D > > 4. "inode_dax_state_up_read" > - Eye bleeds. > - this is about the aops structure serialisation, not dax. > - The name makes no sense in the places that it has been added. Again it is about more than just the aops. IS_DAX() needs to be protected in all the file operations calls as well or we can get races with the logic in those calls and a state switch. > > 5. We already have code that does almost exactly what we need: the > superblock freezing infrastructure. I think freeze does a lot more than we need. > - freezing implements a "hold operations on this superblock until > further notice" model we can easily follow. > - sb_start_write/sb_end_write provides a simple API model and a > clean, clear and concise naming convention we can use, too. Ok as a model... If going with the per-SB lock. After working through my response I'm leaning toward a per-inode lock again. This was the way I did this to begin with. I want feedback before reworking for V4, please. > > Really, I'm struggling to understand how we got to "global locking > that stops the world" from "need to change per-inode state > atomically". Can someone please explain to me why this patch isn't > just a simple set of largely self-explanitory functions like this: > > XX_start_aop() > XX_end_aop() > > XX_lock_aops() > XX_switch_aops(aops) > XX_unlock_aops() > > where "XX" is "sb" or "inode" depending on what locking granularity > is used... Because failing to protect the logic around IS_DAX() results in failures in the read/write and direct IO paths. So we need to lock the file operations as well. Thanks for the review, :-D Ira > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Tue, Feb 11, 2020 at 12:14 PM Ira Weiny <ira.weiny@intel.com> wrote: > > On Tue, Feb 11, 2020 at 07:00:35PM +1100, Dave Chinner wrote: > > On Sat, Feb 08, 2020 at 11:34:40AM -0800, ira.weiny@intel.com wrote: > > > From: Ira Weiny <ira.weiny@intel.com> > > > > > > DAX requires special address space operations but many other functions > > > check the IS_DAX() state. > > > > > > While DAX is a property of the inode we perfer a lock at the super block > > > level because of the overhead of a rwsem within the inode. > > > > > > Define a vfs per superblock percpu rs semaphore to lock the DAX state > > > > ???? > > oops... I must have forgotten to update the commit message when I did the > global RW sem. I implemented the per-SB, percpu rwsem first but it was > suggested that the percpu nature of the lock combined with the anticipated > infrequent use of the write side made using a global easier. > > But before I proceed on V4 I'd like to get consensus on which of the 2 locking > models to go with. > > 1) percpu per superblock lock > 2) per inode rwsem > > Depending on my mood I can convince myself of both being performant but the > percpu is very attractive because I don't anticipate many changes of state > during run time. OTOH concurrent threads accessing the same file at run time > may also be low so there is likely to be little read contention across CPU's on > the per-inode lock? > > Opinions? As one who thought a global lock would be reasonable relative to how often dax address_space_operations are swapped out (on the order of taking cgroup_threadgroup_rwsem for write), I think a per-superblock lock is also an ok starting point. We can always go to finer grained locking in the future if we see evidence that the benefits of percpu are lost to stopping the world at the superblock level.
On Tue, Feb 11, 2020 at 12:14:31PM -0800, Ira Weiny wrote: > On Tue, Feb 11, 2020 at 07:00:35PM +1100, Dave Chinner wrote: > > On Sat, Feb 08, 2020 at 11:34:40AM -0800, ira.weiny@intel.com wrote: > > > From: Ira Weiny <ira.weiny@intel.com> > > > > > > DAX requires special address space operations but many other functions > > > check the IS_DAX() state. > > > > > > While DAX is a property of the inode we perfer a lock at the super block > > > level because of the overhead of a rwsem within the inode. > > > > > > Define a vfs per superblock percpu rs semaphore to lock the DAX state > > > > ???? > > oops... I must have forgotten to update the commit message when I did the > global RW sem. I implemented the per-SB, percpu rwsem first but it was > suggested that the percpu nature of the lock combined with the anticipated > infrequent use of the write side made using a global easier. > > But before I proceed on V4 I'd like to get consensus on which of the 2 locking > models to go with. > > 1) percpu per superblock lock > 2) per inode rwsem > > Depending on my mood I can convince myself of both being performant but the > percpu is very attractive because I don't anticipate many changes of state > during run time. OTOH concurrent threads accessing the same file at run time > may also be low so there is likely to be little read contention across CPU's on > the per-inode lock? > > Opinions? > > > > > > while performing various VFS layer operations. Write lock calls are > > > provided here but are used in subsequent patches by the file systems > > > themselves. > > > > > > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > > > > > > --- > > > Changes from V2 > > > > > > Rebase on linux-next-08-02-2020 > > > > > > Fix locking order > > > Change all references from mode to state where appropriate > > > add CONFIG_FS_DAX requirement for state change > > > Use a static branch to enable locking only when a dax capable > > > device has been seen. > > > > > > Move the lock to a global vfs lock > > > > > > this does a few things > > > 1) preps us better for ext4 support > > > 2) removes funky callbacks from inode ops > > > 3) remove complexity from XFS and probably from > > > ext4 later > > > > > > We can do this because > > > 1) the locking order is required to be at the > > > highest level anyway, so why complicate xfs > > > 2) We had to move the sem to the super_block > > > because it is too heavy for the inode. > > > 3) After internal discussions with Dan we > > > decided that this would be easier, just as > > > performant, and with slightly less overhead > > > than in the VFS SB. > > > > > > We also change the functions names to up/down; > > > read/write as appropriate. Previous names were over > > > simplified. > > > > This, IMO, is a bit of a train wreck. > > > > This patch has nothing to do with "DAX state", it's about > > serialising access to the aops vector. > > It is a bit more than just the aops vector. It also has to protect against > checks of IS_DAX() which occur through many of the file operations. I think you are looking at this incorrectly. The IS_DAX() construct is just determining what path to the aops method we take. regardless of whether IS_DAX() is true or not, we need to execute an aop method, and so aops vector protection is required regardless of how IS_DAX() evaluates. But we do require IS_DAX() to evaluate consistently through an entire aops protected region, as there may be multiple aops method calls in a aops context (e.g. write page faults). Hence we have to ensure that IS_DAX() changes atomically w.r.t. to the aops vector switches. This is simply: sb_aops_lock() inode->i_flags |= S_DAX sb_aops_switch(new_aops) sb_aops_unlock(). This guarantees that inside sb_aops_start/end(), IS_DAX() checks are stable because we change the state atomically with the aops vector. We are *not* providing serialisation of inode->i_flags access or updates here; all we need to do is ensure that the S_DAX flag is consistent and stable across an aops access region. If we are not in an aops call chain or will not call an aops method, we just don't care if the IS_DAX() call is racy as whatever we call is still static and if it's DAAX sensitive it can call IS_DAX() itself when needed. Again, this isn't about DAX at all, it's about being able to switch aops vectors in a safe and reliable manner. The IS_DAX() constraints are really a minor addition on top of the "stable aops vector" regions we are laying down here. > > Big problems I see here: > > > > 1. static key to turn it on globally. > > - just a gross hack around doing it properly with a per-sb > > mechanism and enbaling it only on filesystems that are on DAX > > capable block devices. > > Fair enough. This was a reaction to Darricks desire to get this out of the way > when DAX was not in the system. The static branch seemed like a good thing for > users who don't have DAX capable hardware while running kernels and FS's which > have DAX enabled. > > http://lkml.iu.edu/hypermail/linux/kernel/2001.1/05691.html I think that's largely premature optimisation, and it backs us into the "all or nothing" corner which is a bad place to be for what is per-filesystem functionality. > > - you're already passing in an inode to all these functions. It's > > trivial to do: > > > > if (!inode->i_sb->s_flags & S_DYNAMIC_AOPS) > > return > > /* do sb->s_aops_lock manipulation */ > > Yea that would be ok IMO. > > Darrick would just having this be CONFIG_FS_DAX as per this patch be ok with > you. I guess the static key may have been a bit of overkill? > > > > > 2. global lock > > - OMG! > > I know. The thinking on this is that the locking is percpu which is near > 0 overhead in the read case and we are rarely going to take exclusive access. The problem is that users can effectively run: $ xfs_io -c "chattr -R -D -x" /path/to/dax_fs And then it walks over millions of individual inodes turning off the DAX flag on each of them. And if each of those inodes takes a global per-cpu rwsem that blocks all read/write IO and page faults on everything even for a short time, then this is will have a major impact on the entire system and users will be very unhappy. > > - global lock will cause entire system IO/page fault stalls > > when someone does recursive/bulk DAX flag modification > > operations. Per-cpu rwsem Contention on large systems will be > > utterly awful. > > Yea that is probably bad... I certainly did not test the responsiveness of > this. FWIW if the system only has 1 FS the per-SB lock is not going to be any > different from the global which was part of our thinking. Don't forget that things like /proc, /sys, etc are all filesystems. Hence the global lock will affect accesses to -everything- in the system, not just the DAX enabled filesystem. Global locks are -bad-. > > - ext4's use case almost never hits the exclusive lock side of the > > percpu-rwsem - only when changing the journal mode flag on the > > inode. And it only affects writeback in progress, so it's not > > going to have massive concurrency on it like a VFS level global > > lock has. -> Bad model to follow. > > I admit I did not research the ext4's journal mode locking that much. > > > - per-sb lock is trivial - see #1 - which limits scope to single > > filesystem > > I agree and... well the commit message shows I actually implemented it that > way at first... :-/ > > > - per-inode rwsem would make this problem go away entirely. > > But would that be ok for concurrent read performance which is going to be used > 99.99% of the time? Maybe Darricks comments made me panic a little bit too > much overhead WRT locking and its potential impact on users not using DAX? I know that a rwsem in shared mode can easily handle 2M lock cycles/s across a 32p machine without contention (concurrent AIO-DIO read/writes to a single file) so the baseline performance of a rwsem is likely good enough to start from. It's simple, and we can run tests easily enough to find out where it starts to become a performance limitation. This whole "global percpu rwsem thing stinks like premature optimisation. Do the simple, straight forward thing first, then get numbers and analysis the limitations to determine what the second generation of the functionality needs to fix. IMO, we don't have to solve every scalability problem with the initial implementation. Let's just make it work first, then worry about extreme scalability when we have some idea of where those scalability problems are. > > 3. If we can use a global per-cpu rwsem, why can't we just use a > > per-inode rwsem? > > Per-cpu lock was attractive because of its near 0 overhead to take the read > lock which happens a lot during normal operations. > > > - locking context rules are the same > > - rwsem scales pretty damn well for shared ops > > Does it? I'm not sure. If you haven't tested it, then we are definitely in the realm of premature optimisation... > > > - no "global" contention problems > > - small enough that we can put another rwsem in the inode. > > Everything else I agree with... :-D > > > > > 4. "inode_dax_state_up_read" > > - Eye bleeds. > > - this is about the aops structure serialisation, not dax. > > - The name makes no sense in the places that it has been added. > > Again it is about more than just the aops. IS_DAX() needs to be protected in > all the file operations calls as well or we can get races with the logic in > those calls and a state switch. > > > > > 5. We already have code that does almost exactly what we need: the > > superblock freezing infrastructure. > > I think freeze does a lot more than we need. > > > - freezing implements a "hold operations on this superblock until > > further notice" model we can easily follow. > > - sb_start_write/sb_end_write provides a simple API model and a > > clean, clear and concise naming convention we can use, too. > > Ok as a model... If going with the per-SB lock. Ok, you completely missed my point. You're still looking at this as a set of "locks" and serialisation. Freezing is *not a lock* - it provides a series of "drain points" where we can transparently block new operations, then wait for all existing operations to complete so we can make a state change, and then once that is done we unblock all the waiters.... IOWs, the freeze model provides an ordered barrier mechanism, and that's precisely what we need for aops protection... Yes, it may be implemented with locks internally, but that's implementation detail of the barrier mechanism, not an indication what functionality it is actually providing. > After working through my > response I'm leaning toward a per-inode lock again. This was the way I did > this to begin with. > > I want feedback before reworking for V4, please. IMO, always do the simple thing first. Only do a more complex thing if the simple thing doesn't work or doesn't perform sufficiently well for an initial implemenation. Otherwise we end up with crazy solutions from premature optimisation that simply aren't viable. > > Really, I'm struggling to understand how we got to "global locking > > that stops the world" from "need to change per-inode state > > atomically". Can someone please explain to me why this patch isn't > > just a simple set of largely self-explanitory functions like this: > > > > XX_start_aop() > > XX_end_aop() > > > > XX_lock_aops() > > XX_switch_aops(aops) > > XX_unlock_aops() > > > > where "XX" is "sb" or "inode" depending on what locking granularity > > is used... > > Because failing to protect the logic around IS_DAX() results in failures in the > read/write and direct IO paths. > > So we need to lock the file operations as well. Again, there is no "locking" here. We just need to annotate regions where the aops vector must remain stable. If a fop calls an aop that the aops vector does not change while it is the middle of executing the fop. Hence such fops calls will need to be inside XX_start_aop()/XX_end_aop() markers. Cheers, Dave.
On Wed, Feb 12, 2020 at 08:49:17AM +1100, Dave Chinner wrote: > On Tue, Feb 11, 2020 at 12:14:31PM -0800, Ira Weiny wrote: > > On Tue, Feb 11, 2020 at 07:00:35PM +1100, Dave Chinner wrote: > > > On Sat, Feb 08, 2020 at 11:34:40AM -0800, ira.weiny@intel.com wrote: > > > > From: Ira Weiny <ira.weiny@intel.com> > > > > > > > > DAX requires special address space operations but many other functions > > > > check the IS_DAX() state. > > > > > > > > While DAX is a property of the inode we perfer a lock at the super block > > > > level because of the overhead of a rwsem within the inode. > > > > > > > > Define a vfs per superblock percpu rs semaphore to lock the DAX state > > > > > > ???? > > > > oops... I must have forgotten to update the commit message when I did the > > global RW sem. I implemented the per-SB, percpu rwsem first but it was > > suggested that the percpu nature of the lock combined with the anticipated > > infrequent use of the write side made using a global easier. > > > > But before I proceed on V4 I'd like to get consensus on which of the 2 locking > > models to go with. > > > > 1) percpu per superblock lock > > 2) per inode rwsem > > > > Depending on my mood I can convince myself of both being performant but the > > percpu is very attractive because I don't anticipate many changes of state > > during run time. OTOH concurrent threads accessing the same file at run time > > may also be low so there is likely to be little read contention across CPU's on > > the per-inode lock? > > > > Opinions? > > > > > > > > > while performing various VFS layer operations. Write lock calls are > > > > provided here but are used in subsequent patches by the file systems > > > > themselves. > > > > > > > > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > > > > > > > > --- > > > > Changes from V2 > > > > > > > > Rebase on linux-next-08-02-2020 > > > > > > > > Fix locking order > > > > Change all references from mode to state where appropriate > > > > add CONFIG_FS_DAX requirement for state change > > > > Use a static branch to enable locking only when a dax capable > > > > device has been seen. > > > > > > > > Move the lock to a global vfs lock > > > > > > > > this does a few things > > > > 1) preps us better for ext4 support > > > > 2) removes funky callbacks from inode ops > > > > 3) remove complexity from XFS and probably from > > > > ext4 later > > > > > > > > We can do this because > > > > 1) the locking order is required to be at the > > > > highest level anyway, so why complicate xfs > > > > 2) We had to move the sem to the super_block > > > > because it is too heavy for the inode. > > > > 3) After internal discussions with Dan we > > > > decided that this would be easier, just as > > > > performant, and with slightly less overhead > > > > than in the VFS SB. > > > > > > > > We also change the functions names to up/down; > > > > read/write as appropriate. Previous names were over > > > > simplified. > > > > > > This, IMO, is a bit of a train wreck. > > > > > > This patch has nothing to do with "DAX state", it's about > > > serialising access to the aops vector. > > > > It is a bit more than just the aops vector. It also has to protect against > > checks of IS_DAX() which occur through many of the file operations. > > I think you are looking at this incorrectly. The IS_DAX() construct > is just determining what path to the aops method we take. regardless > of whether IS_DAX() is true or not, we need to execute an aop > method, and so aops vector protection is required regardless of how > IS_DAX() evaluates. > > But we do require IS_DAX() to evaluate consistently through an > entire aops protected region, as there may be multiple aops method > calls in a aops context (e.g. write page faults). Hence we have to > ensure that IS_DAX() changes atomically w.r.t. to the aops vector > switches. This is simply: > > sb_aops_lock() > inode->i_flags |= S_DAX > sb_aops_switch(new_aops) > sb_aops_unlock(). > > This guarantees that inside sb_aops_start/end(), IS_DAX() checks > are stable because we change the state atomically with the aops > vector. > > We are *not* providing serialisation of inode->i_flags access or > updates here; all we need to do is ensure that the S_DAX flag is > consistent and stable across an aops access region. If we are not > in an aops call chain or will not call an aops method, we just don't > care if the IS_DAX() call is racy as whatever we call is still > static and if it's DAAX sensitive it can call IS_DAX() itself when > needed. > > Again, this isn't about DAX at all, it's about being able to switch > aops vectors in a safe and reliable manner. The IS_DAX() constraints > are really a minor addition on top of the "stable aops vector" > regions we are laying down here. > > > > > Big problems I see here: > > > > > > 1. static key to turn it on globally. > > > - just a gross hack around doing it properly with a per-sb > > > mechanism and enbaling it only on filesystems that are on DAX > > > capable block devices. > > > > Fair enough. This was a reaction to Darricks desire to get this out of the way > > when DAX was not in the system. The static branch seemed like a good thing for > > users who don't have DAX capable hardware while running kernels and FS's which > > have DAX enabled. > > > > http://lkml.iu.edu/hypermail/linux/kernel/2001.1/05691.html > > I think that's largely premature optimisation, and it backs us into > the "all or nothing" corner which is a bad place to be for what is > per-filesystem functionality. Oh, wow, uh... this was a total misunderstanding. Having a per-inode primitive to grant ourselves the ability to change fops/aops safely was fine, I just wanted to have it compile out of existence if CONFIG_DAX=n (or some other cleverish way if this mount will never support DAX (e.g. scsi disk)). I wasn't asking for it to move to the superblock or become a Big Kernel Primitive. --D > > > > - you're already passing in an inode to all these functions. It's > > > trivial to do: > > > > > > if (!inode->i_sb->s_flags & S_DYNAMIC_AOPS) > > > return > > > /* do sb->s_aops_lock manipulation */ > > > > Yea that would be ok IMO. > > > > Darrick would just having this be CONFIG_FS_DAX as per this patch be ok with > > you. I guess the static key may have been a bit of overkill? > > > > > > > > 2. global lock > > > - OMG! > > > > I know. The thinking on this is that the locking is percpu which is near > > 0 overhead in the read case and we are rarely going to take exclusive access. > > The problem is that users can effectively run: > > $ xfs_io -c "chattr -R -D -x" /path/to/dax_fs > > And then it walks over millions of individual inodes turning off > the DAX flag on each of them. And if each of those inodes takes a > global per-cpu rwsem that blocks all read/write IO and page faults > on everything even for a short time, then this is will have a major > impact on the entire system and users will be very unhappy. > > > > - global lock will cause entire system IO/page fault stalls > > > when someone does recursive/bulk DAX flag modification > > > operations. Per-cpu rwsem Contention on large systems will be > > > utterly awful. > > > > Yea that is probably bad... I certainly did not test the responsiveness of > > this. FWIW if the system only has 1 FS the per-SB lock is not going to be any > > different from the global which was part of our thinking. > > Don't forget that things like /proc, /sys, etc are all filesystems. > Hence the global lock will affect accesses to -everything- in the > system, not just the DAX enabled filesystem. Global locks are -bad-. > > > > - ext4's use case almost never hits the exclusive lock side of the > > > percpu-rwsem - only when changing the journal mode flag on the > > > inode. And it only affects writeback in progress, so it's not > > > going to have massive concurrency on it like a VFS level global > > > lock has. -> Bad model to follow. > > > > I admit I did not research the ext4's journal mode locking that much. > > > > > - per-sb lock is trivial - see #1 - which limits scope to single > > > filesystem > > > > I agree and... well the commit message shows I actually implemented it that > > way at first... :-/ > > > > > - per-inode rwsem would make this problem go away entirely. > > > > But would that be ok for concurrent read performance which is going to be used > > 99.99% of the time? Maybe Darricks comments made me panic a little bit too > > much overhead WRT locking and its potential impact on users not using DAX? > > I know that a rwsem in shared mode can easily handle 2M lock > cycles/s across a 32p machine without contention (concurrent AIO-DIO > read/writes to a single file) so the baseline performance of a rwsem > is likely good enough to start from. > > It's simple, and we can run tests easily enough to find out where it > starts to become a performance limitation. This whole "global percpu > rwsem thing stinks like premature optimisation. Do the simple, > straight forward thing first, then get numbers and analysis the > limitations to determine what the second generation of the > functionality needs to fix. > > IMO, we don't have to solve every scalability problem with the > initial implementation. Let's just make it work first, then worry > about extreme scalability when we have some idea of where those > scalability problems are. > > > > 3. If we can use a global per-cpu rwsem, why can't we just use a > > > per-inode rwsem? > > > > Per-cpu lock was attractive because of its near 0 overhead to take the read > > lock which happens a lot during normal operations. > > > > > - locking context rules are the same > > > - rwsem scales pretty damn well for shared ops > > > > Does it? I'm not sure. > > If you haven't tested it, then we are definitely in the realm of > premature optimisation... > > > > > > - no "global" contention problems > > > - small enough that we can put another rwsem in the inode. > > > > Everything else I agree with... :-D > > > > > > > > 4. "inode_dax_state_up_read" > > > - Eye bleeds. > > > - this is about the aops structure serialisation, not dax. > > > - The name makes no sense in the places that it has been added. > > > > Again it is about more than just the aops. IS_DAX() needs to be protected in > > all the file operations calls as well or we can get races with the logic in > > those calls and a state switch. > > > > > > > > 5. We already have code that does almost exactly what we need: the > > > superblock freezing infrastructure. > > > > I think freeze does a lot more than we need. > > > > > - freezing implements a "hold operations on this superblock until > > > further notice" model we can easily follow. > > > - sb_start_write/sb_end_write provides a simple API model and a > > > clean, clear and concise naming convention we can use, too. > > > > Ok as a model... If going with the per-SB lock. > > Ok, you completely missed my point. You're still looking at this as > a set of "locks" and serialisation. > > Freezing is *not a lock* - it provides a series of "drain points" > where we can transparently block new operations, then wait for all > existing operations to complete so we can make a state change, and > then once that is done we unblock all the waiters.... > > IOWs, the freeze model provides an ordered barrier mechanism, and > that's precisely what we need for aops protection... > > Yes, it may be implemented with locks internally, but that's > implementation detail of the barrier mechanism, not an indication > what functionality it is actually providing. > > > After working through my > > response I'm leaning toward a per-inode lock again. This was the way I did > > this to begin with. > > > > I want feedback before reworking for V4, please. > > IMO, always do the simple thing first. > > Only do a more complex thing if the simple thing doesn't work or > doesn't perform sufficiently well for an initial implemenation. > Otherwise we end up with crazy solutions from premature optimisation > that simply aren't viable. > > > > Really, I'm struggling to understand how we got to "global locking > > > that stops the world" from "need to change per-inode state > > > atomically". Can someone please explain to me why this patch isn't > > > just a simple set of largely self-explanitory functions like this: > > > > > > XX_start_aop() > > > XX_end_aop() > > > > > > XX_lock_aops() > > > XX_switch_aops(aops) > > > XX_unlock_aops() > > > > > > where "XX" is "sb" or "inode" depending on what locking granularity > > > is used... > > > > Because failing to protect the logic around IS_DAX() results in failures in the > > read/write and direct IO paths. > > > > So we need to lock the file operations as well. > > Again, there is no "locking" here. We just need to annotate regions > where the aops vector must remain stable. If a fop calls an aop that > the aops vector does not change while it is the middle of executing > the fop. Hence such fops calls will need to be inside > XX_start_aop()/XX_end_aop() markers. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst index 7d4d09dd5e6d..cd011ceb4b72 100644 --- a/Documentation/filesystems/vfs.rst +++ b/Documentation/filesystems/vfs.rst @@ -934,6 +934,23 @@ cache in your filesystem. The following members are defined: Called during swapoff on files where swap_activate was successful. +Changing DAX 'state' dynamically +---------------------------------- + +Some file systems which support DAX want to be able to change the DAX state +dyanically. To switch the state safely we lock the inode state in all "normal" +file system operations and restrict state changes to those operations. The +specific rules are. + + 1) the direct_IO address_space_operation must be supported in all + potential a_ops vectors for any state suported by the inode. + 2) FS's should enable the static branch lock_dax_state_static_key when a DAX + capable device is detected. + 3) DAX state changes shall not be allowed while the file is mmap'ed + 4) For non-mmaped operations the VFS layer must take the read lock for any + use of IS_DAX() + 5) Filesystems take the write lock when changing DAX states. + The File Object =============== diff --git a/fs/attr.c b/fs/attr.c index b4bbdbd4c8ca..9b15f73d1079 100644 --- a/fs/attr.c +++ b/fs/attr.c @@ -332,6 +332,7 @@ int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **de if (error) return error; + /* DAX read state should already be held here */ if (inode->i_op->setattr) error = inode->i_op->setattr(dentry, attr); else diff --git a/fs/dax.c b/fs/dax.c index 1f1f0201cad1..96136866f151 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -30,6 +30,9 @@ #define CREATE_TRACE_POINTS #include <trace/events/fs_dax.h> +DEFINE_STATIC_KEY_FALSE(lock_dax_state_static_key); +EXPORT_SYMBOL(lock_dax_state_static_key); + static inline unsigned int pe_order(enum page_entry_size pe_size) { if (pe_size == PE_SIZE_PTE) diff --git a/fs/inode.c b/fs/inode.c index 7d57068b6b7a..7d0227f9e3e8 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1616,11 +1616,19 @@ EXPORT_SYMBOL(iput); */ int bmap(struct inode *inode, sector_t *block) { - if (!inode->i_mapping->a_ops->bmap) - return -EINVAL; + int ret = 0; + + inode_dax_state_down_read(inode); + if (!inode->i_mapping->a_ops->bmap) { + ret = -EINVAL; + goto err; + } *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block); - return 0; + +err: + inode_dax_state_up_read(inode); + return ret; } EXPORT_SYMBOL(bmap); #endif diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 7c84c4c027c4..e313a34d5fa6 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -999,6 +999,7 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count, offset = offset_in_page(pos); bytes = min_t(loff_t, PAGE_SIZE - offset, count); + /* DAX state read should already be held here */ if (IS_DAX(inode)) status = iomap_dax_zero(pos, offset, bytes, iomap); else diff --git a/fs/open.c b/fs/open.c index 0788b3715731..148980e30611 100644 --- a/fs/open.c +++ b/fs/open.c @@ -59,10 +59,12 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs, if (ret) newattrs.ia_valid |= ret | ATTR_FORCE; + inode_dax_state_down_read(dentry->d_inode); inode_lock(dentry->d_inode); /* Note any delegations or leases have already been broken: */ ret = notify_change(dentry, &newattrs, NULL); inode_unlock(dentry->d_inode); + inode_dax_state_up_read(dentry->d_inode); return ret; } @@ -306,7 +308,9 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len) return -EOPNOTSUPP; file_start_write(file); + inode_dax_state_down_read(inode); ret = file->f_op->fallocate(file, mode, offset, len); + inode_dax_state_up_read(inode); /* * Create inotify and fanotify events. diff --git a/fs/stat.c b/fs/stat.c index 894699c74dde..bf8841314c08 100644 --- a/fs/stat.c +++ b/fs/stat.c @@ -79,8 +79,10 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat, if (IS_AUTOMOUNT(inode)) stat->attributes |= STATX_ATTR_AUTOMOUNT; + inode_dax_state_down_read(inode); if (IS_DAX(inode)) stat->attributes |= STATX_ATTR_DAX; + inode_dax_state_up_read(inode); if (inode->i_op->getattr) return inode->i_op->getattr(path, stat, request_mask, diff --git a/fs/super.c b/fs/super.c index cd352530eca9..3e26e3a1d860 100644 --- a/fs/super.c +++ b/fs/super.c @@ -51,6 +51,9 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = { "sb_internal", }; +DEFINE_PERCPU_RWSEM(sb_dax_rwsem); +EXPORT_SYMBOL(sb_dax_rwsem); + /* * One thing we have to be careful of with a per-sb shrinker is that we don't * drop the last active reference to the superblock from within the shrinker. diff --git a/include/linux/fs.h b/include/linux/fs.h index 63d1e533a07d..1a22cd94c4ab 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -40,6 +40,7 @@ #include <linux/fs_types.h> #include <linux/build_bug.h> #include <linux/stddef.h> +#include <linux/percpu-rwsem.h> #include <asm/byteorder.h> #include <uapi/linux/fs.h> @@ -359,6 +360,11 @@ typedef struct { typedef int (*read_actor_t)(read_descriptor_t *, struct page *, unsigned long, unsigned long); +/** + * NOTE: DO NOT define new functions in address_space_operations without first + * considering how dynamic DAX states are to be supported. See the + * inode_dax_state_*_read() functions + */ struct address_space_operations { int (*writepage)(struct page *page, struct writeback_control *wbc); int (*readpage)(struct file *, struct page *); @@ -1817,6 +1823,11 @@ struct block_device_operations; struct iov_iter; +/** + * NOTE: DO NOT define new functions in file_operations without first + * considering how dynamic DAX states are to be supported. See the + * inode_dax_state_*_read() functions + */ struct file_operations { struct module *owner; loff_t (*llseek) (struct file *, loff_t, int); @@ -1889,16 +1900,79 @@ struct inode_operations { int (*set_acl)(struct inode *, struct posix_acl *, int); } ____cacheline_aligned; +#if defined(CONFIG_FS_DAX) +/* + * Filesystems wishing to support dynamic DAX states must do the following. + * + * 1) the direct_IO address_space_operation must be supported in all + * potential a_ops vectors for any state suported by the inode. + * 2) FS's should enable the static branch lock_dax_state_static_key when a DAX + * capable device is detected. + * 3) DAX state changes shall not be allowed while the file is mmap'ed + * 4) For non-mmaped operations the VFS layer must take the read lock for any + * use of IS_DAX() + * 5) Filesystems take the write lock when changing DAX states. + */ +DECLARE_STATIC_KEY_FALSE(lock_dax_state_static_key); +extern struct percpu_rw_semaphore sb_dax_rwsem; +static inline void inode_dax_state_down_read(struct inode *inode) +{ + if (!static_branch_unlikely(&lock_dax_state_static_key)) + return; + percpu_down_read(&sb_dax_rwsem); +} +static inline void inode_dax_state_up_read(struct inode *inode) +{ + if (!static_branch_unlikely(&lock_dax_state_static_key)) + return; + percpu_up_read(&sb_dax_rwsem); +} +static inline void inode_dax_state_down_write(struct inode *inode) +{ + if (!static_branch_unlikely(&lock_dax_state_static_key)) + return; + percpu_down_write(&sb_dax_rwsem); +} +static inline void inode_dax_state_up_write(struct inode *inode) +{ + if (!static_branch_unlikely(&lock_dax_state_static_key)) + return; + percpu_up_write(&sb_dax_rwsem); +} +static inline void enable_dax_state_static_branch(void) +{ + static_branch_enable(&lock_dax_state_static_key); +} +#else /* !CONFIG_FS_DAX */ +#define inode_dax_state_down_read(inode) do { (void)(inode); } while (0) +#define inode_dax_state_up_read(inode) do { (void)(inode); } while (0) +#define inode_dax_state_down_write(inode) do { (void)(inode); } while (0) +#define inode_dax_state_up_write(inode) do { (void)(inode); } while (0) +#define enable_dax_state_static_branch() +#endif /* CONFIG_FS_DAX */ + static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio, struct iov_iter *iter) { - return file->f_op->read_iter(kio, iter); + struct inode *inode = file_inode(kio->ki_filp); + ssize_t ret; + + inode_dax_state_down_read(inode); + ret = file->f_op->read_iter(kio, iter); + inode_dax_state_up_read(inode); + return ret; } static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio, struct iov_iter *iter) { - return file->f_op->write_iter(kio, iter); + struct inode *inode = file_inode(kio->ki_filp); + ssize_t ret; + + inode_dax_state_down_read(inode); + ret = file->f_op->write_iter(kio, iter); + inode_dax_state_up_read(inode); + return ret; } static inline int call_mmap(struct file *file, struct vm_area_struct *vma) diff --git a/mm/fadvise.c b/mm/fadvise.c index 4f17c83db575..ac85eb778c74 100644 --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -47,7 +47,10 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice) bdi = inode_to_bdi(mapping->host); + inode_dax_state_down_read(inode); if (IS_DAX(inode) || (bdi == &noop_backing_dev_info)) { + int ret = 0; + switch (advice) { case POSIX_FADV_NORMAL: case POSIX_FADV_RANDOM: @@ -58,10 +61,13 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice) /* no bad return value, but ignore advice */ break; default: - return -EINVAL; + ret = -EINVAL; } - return 0; + + inode_dax_state_up_read(inode); + return ret; } + inode_dax_state_up_read(inode); /* * Careful about overflows. Len == 0 means "as much as possible". Use diff --git a/mm/filemap.c b/mm/filemap.c index 1784478270e1..3a7863ba51b9 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2293,6 +2293,8 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter) * and return. Otherwise fallthrough to buffered io for * the rest of the read. Buffered reads will not work for * DAX files, so don't bother trying. + * + * IS_DAX is protected under ->read_iter lock */ if (retval < 0 || !count || iocb->ki_pos >= size || IS_DAX(inode)) @@ -3377,6 +3379,8 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from) * holes, for example. For DAX files, a buffered write will * not succeed (even if it did, DAX does not handle dirty * page-cache pages correctly). + * + * IS_DAX is protected under ->write_iter lock */ if (written < 0 || !iov_iter_count(from) || IS_DAX(inode)) goto out; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index b08b199f9a11..3d05bd10d83e 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -572,6 +572,7 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long ret; loff_t off = (loff_t)pgoff << PAGE_SHIFT; + /* Should not need locking here because mmap is not allowed */ if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD)) goto out; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index b679908743cb..3bec46277886 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1592,9 +1592,11 @@ static void collapse_file(struct mm_struct *mm, } else { /* !is_shmem */ if (!page || xa_is_value(page)) { xas_unlock_irq(&xas); + inode_dax_state_down_read(file->f_inode); page_cache_sync_readahead(mapping, &file->f_ra, file, index, PAGE_SIZE); + inode_dax_state_up_read(file->f_inode); /* drain pagevecs to help isolate_lru_page() */ lru_add_drain(); page = find_lock_page(mapping, index); diff --git a/mm/madvise.c b/mm/madvise.c index 43b47d3fae02..419b7c26216b 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -275,10 +275,13 @@ static long madvise_willneed(struct vm_area_struct *vma, return -EBADF; #endif + inode_dax_state_down_read(file_inode(file)); if (IS_DAX(file_inode(file))) { + inode_dax_state_up_read(file_inode(file)); /* no bad return value, but ignore advice */ return 0; } + inode_dax_state_up_read(file_inode(file)); /* * Filesystem's fadvise may need to take various locks. We need to diff --git a/mm/util.c b/mm/util.c index 988d11e6c17c..8dfb9958f2a6 100644 --- a/mm/util.c +++ b/mm/util.c @@ -501,11 +501,18 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr, ret = security_mmap_file(file, prot, flag); if (!ret) { - if (down_write_killable(&mm->mmap_sem)) + if (file) + inode_dax_state_down_read(file_inode(file)); + if (down_write_killable(&mm->mmap_sem)) { + if (file) + inode_dax_state_up_read(file_inode(file)); return -EINTR; + } ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff, &populate, &uf); up_write(&mm->mmap_sem); + if (file) + inode_dax_state_up_read(file_inode(file)); userfaultfd_unmap_complete(mm, &uf); if (populate) mm_populate(ret, populate);