Message ID | 20180907015155.32559-1-david@fromorbit.com (mailing list archive) |
---|---|
State | Accepted |
Headers | show |
Series | xfs: avoid lockdep false positives in xfs_trans_alloc | expand |
On Fri, Sep 07, 2018 at 11:51:55AM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > We've had a few reports of lockdep tripping over memory reclaim > context vs filesystem freeze "deadlocks". They all have looked > to be false positives on analysis, but it seems that they are > being tripped because we take freeze references before we run > a GFP_KERNEL allocation for the struct xfs_trans. > > We can avoid this false positive vector just by re-ordering the > operations in xfs_trans_alloc(). That is. we need allocate the > structure before we take the freeze reference and enter the GFP_NOFS > allocation context that follows the xfs_trans around. This prevents > lockdep from seeing the GFP_KERNEL allocation inside the transaction > context, and that prevents it from triggering the freeze level vs > alloc context vs reclaim warnings. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- We have an old RHEL bug for this problem and I recall dwelling on this exact change for a bit. I don't recall whether I just lost track of it, there was some kind of problem or I just didn't think it worth the change to quiet a false positive. I think the latter because I don't remember actually testing it and don't otherwise have any notes to suggest it's problematic in any way. The only difference I can think of is that on a frozen fs, we'd allow some number of transaction allocations to occur and either immediately block on the freeze/write lock or drop into reclaim. Those instances should be limited, however, because most user-driven operations should block at the write level rather than the internal/transaction level. I'll run some tests to see if anything explodes, but in the meantime this seems reasonable to me: Reviewed-by: Brian Foster <bfoster@redhat.com> > fs/xfs/xfs_trans.c | 10 ++++++++-- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c > index bedc5a5133a5..912b42f5fe4a 100644 > --- a/fs/xfs/xfs_trans.c > +++ b/fs/xfs/xfs_trans.c > @@ -259,6 +259,14 @@ xfs_trans_alloc( > struct xfs_trans *tp; > int error; > > + /* > + * Allocate the handle before we do our freeze accounting and setting up > + * GFP_NOFS allocation context so that we avoid lockdep false positives > + * by doing GFP_KERNEL allocations inside sb_start_intwrite(). > + */ > + tp = kmem_zone_zalloc(xfs_trans_zone, > + (flags & XFS_TRANS_NOFS) ? KM_NOFS : KM_SLEEP); > + > if (!(flags & XFS_TRANS_NO_WRITECOUNT)) > sb_start_intwrite(mp->m_super); > > @@ -270,8 +278,6 @@ xfs_trans_alloc( > mp->m_super->s_writers.frozen == SB_FREEZE_COMPLETE); > atomic_inc(&mp->m_active_trans); > > - tp = kmem_zone_zalloc(xfs_trans_zone, > - (flags & XFS_TRANS_NOFS) ? KM_NOFS : KM_SLEEP); > tp->t_magic = XFS_TRANS_HEADER_MAGIC; > tp->t_flags = flags; > tp->t_mountp = mp; > -- > 2.17.0 >
Looks good,
Reviewed-by: Christoph Hellwig <hch@lst.de>
[CC Ted and Jan to see if there are lessons here that apply to ext2 ext4] On Fri, Sep 7, 2018 at 6:03 AM Dave Chinner <david@fromorbit.com> wrote: > > From: Dave Chinner <dchinner@redhat.com> > > We've had a few reports of lockdep tripping over memory reclaim > context vs filesystem freeze "deadlocks". They all have looked > to be false positives on analysis, but it seems that they are > being tripped because we take freeze references before we run > a GFP_KERNEL allocation for the struct xfs_trans.===== > > We can avoid this false positive vector just by re-ordering the > operations in xfs_trans_alloc(). That is. we need allocate the > structure before we take the freeze reference and enter the GFP_NOFS > allocation context that follows the xfs_trans around. This prevents > lockdep from seeing the GFP_KERNEL allocation inside the transaction > context, and that prevents it from triggering the freeze level vs > alloc context vs reclaim warnings. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- Dave, First of all, you may add Tested-by: Amir Goldstein <amir73il@gmail.com> and I think attribution of Reported-by [2] to Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> is called for. I was getting the lockdep warning below reliably with the stress test overlay/019 (over bas fs xfs with reflink) ever since kernel v4.18. The warning is tripped on my system after 2 minutes of stress test. The possibly interesting part about this particular splat is that, unlike previously reported traces [1][2], sb_internals is not taken by kswapd from pagewrite path, which as you wrote is not possible during freeze level internal. In my splats sb_internals is taken by kswapd from dcache shrink path. It's true that overlayfs adds some quirks to the equation and this specific stress test (by Ted) is especially evil for making changes on overlay lower layer. This could results for example with unlinked xfs inode with elevated reference from an overlay dentry, which wasn't informed about this underlying unlink. However, the case of an unlinked inode reference being held by an open file or by inotify event doesn't seem so unlikely during frozen filesystem, so I am not really sure that the "false positive" claim holds water. If you agree with this argument, you might want to tone down the "false positive" language from commit message and you may name overlay/19 as a reliable reproducer if you see fit. Thanks, Amir. >8->8->8->8 extracted only the interesting snip >8->8->8->8 -> #0 (sb_internal#2){.+.+}: lock_acquire+0x150/0x182 __sb_start_write+0x93/0x164 xfs_trans_alloc+0x3c/0x1ca xfs_inactive_truncate+0x3d/0x122 xfs_inactive+0x15a/0x1f4 xfs_fs_destroy_inode+0x160/0x246 __dentry_kill+0xdd/0x159 dentry_kill+0x138/0x173 dput+0x230/0x23d ovl_destroy_inode+0x15/0x69 __dentry_kill+0xdd/0x159 shrink_dentry_list+0x286/0x2c5 prune_dcache_sb+0x5a/0x78 super_cache_scan+0xfe/0x17b shrink_slab.constprop.31+0x27a/0x438 shrink_node+0x74/0x251 kswapd+0x502/0x62f [1] https://marc.info/?l=linux-xfs&m=152954339303385&w=2 [2] https://marc.info/?l=linux-xfs&m=153627727029090&w=2
On Sun, Sep 30, 2018 at 10:56:02AM +0300, Amir Goldstein wrote: > [CC Ted and Jan to see if there are lessons here that apply to ext2 ext4] > > On Fri, Sep 7, 2018 at 6:03 AM Dave Chinner <david@fromorbit.com> wrote: > > > > From: Dave Chinner <dchinner@redhat.com> > > > > We've had a few reports of lockdep tripping over memory reclaim > > context vs filesystem freeze "deadlocks". They all have looked > > to be false positives on analysis, but it seems that they are > > being tripped because we take freeze references before we run > > a GFP_KERNEL allocation for the struct xfs_trans.===== > > > > We can avoid this false positive vector just by re-ordering the > > operations in xfs_trans_alloc(). That is. we need allocate the > > structure before we take the freeze reference and enter the GFP_NOFS > > allocation context that follows the xfs_trans around. This prevents > > lockdep from seeing the GFP_KERNEL allocation inside the transaction > > context, and that prevents it from triggering the freeze level vs > > alloc context vs reclaim warnings. > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > --- > > Dave, > > First of all, you may add > Tested-by: Amir Goldstein <amir73il@gmail.com> Too late, already pushed. > and I think attribution of Reported-by [2] to > Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> > is called for. Not for multiply-reported bugs. i.e. this has been reported in RHEL, by syzbot, by various users and several developers so giving attribution to just one person is not the right thing to do. It's either all or none, and given that the commit message is not for tracking a list of people who hit the issue, I decided "none". > I was getting the lockdep warning below reliably with the stress test > overlay/019 (over bas fs xfs with reflink) ever since kernel v4.18. > The warning is tripped on my system after 2 minutes of stress test. > > The possibly interesting part about this particular splat is that, unlike > previously reported traces [1][2], sb_internals is not taken by kswapd > from pagewrite path, which as you wrote is not possible during freeze > level internal. In my splats sb_internals is taken by kswapd from > dcache shrink path. Which is exactly the same case. i.e. a transaction is being run from kswapd's reclaim context. It doesn't matter if it's an extent allocation transaction in the direct page writeback path, or prune_dcache_sb() killing a dentry and dropping the last reference to an unlinked inode triggering a truncate, or indeed prune_icache_sb dropping an inode off the LRU and triggering a truncate of specualtively preallocated blocks beyond EOF. i.e. Lockdep is warning about a transaction being run in kswapd's reclaim context - this is something we are allowed to do (and need to do to make forwards progress) because the kswapd reclaim context is GFP_KERNEL.... Cheers, Dave.
On Mon, Oct 1, 2018 at 4:09 AM Dave Chinner <david@fromorbit.com> wrote: > > On Sun, Sep 30, 2018 at 10:56:02AM +0300, Amir Goldstein wrote: > > [CC Ted and Jan to see if there are lessons here that apply to ext2 ext4] > > > > On Fri, Sep 7, 2018 at 6:03 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > We've had a few reports of lockdep tripping over memory reclaim > > > context vs filesystem freeze "deadlocks". They all have looked > > > to be false positives on analysis, but it seems that they are > > > being tripped because we take freeze references before we run > > > a GFP_KERNEL allocation for the struct xfs_trans.===== > > > > > > We can avoid this false positive vector just by re-ordering the > > > operations in xfs_trans_alloc(). That is. we need allocate the > > > structure before we take the freeze reference and enter the GFP_NOFS > > > allocation context that follows the xfs_trans around. This prevents > > > lockdep from seeing the GFP_KERNEL allocation inside the transaction > > > context, and that prevents it from triggering the freeze level vs > > > alloc context vs reclaim warnings. > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > --- > > [...] > > I was getting the lockdep warning below reliably with the stress test > > overlay/019 (over bas fs xfs with reflink) ever since kernel v4.18. > > The warning is tripped on my system after 2 minutes of stress test. > > > > The possibly interesting part about this particular splat is that, unlike > > previously reported traces [1][2], sb_internals is not taken by kswapd > > from pagewrite path, which as you wrote is not possible during freeze > > level internal. In my splats sb_internals is taken by kswapd from > > dcache shrink path. > > Which is exactly the same case. i.e. a transaction is being run > from kswapd's reclaim context. It doesn't matter if it's an extent > allocation transaction in the direct page writeback path, or > prune_dcache_sb() killing a dentry and dropping the last reference > to an unlinked inode triggering a truncate, or indeed prune_icache_sb > dropping an inode off the LRU and triggering a truncate of > specualtively preallocated blocks beyond EOF. > > i.e. Lockdep is warning about a transaction being run in kswapd's > reclaim context - this is something we are allowed to do (and need > to do to make forwards progress) because the kswapd reclaim context > is GFP_KERNEL.... > I understand that, but I would still like to ask for a clarification on one point. In response to one of the lockdep warning reports [1] you wrote: "It's not a deadlock - for anything to deadlock in this path, we have to be in the middle of a freeze and have frozen the transaction subsystem. Which we cannot do until we've cleaned all the dirty cached pages in the filesystem and frozen all new writes. Which means kswapd cannot enter this direct writeback path because we can't have dirty pages on the filesystem." I don't see how this argument holds for the shrinker case. That is, if filesystem is already past freezing the transaction subsystem and then kswapd comes along and runs the shrinkers. So while the statement "it's not a deadlock" may still be true, I am not yet convinced that the claim that there are no dirty pages to write when filesystem is frozen is sufficient to back that claim. Are you sure there was no deadlock lurking in there while fs is past SB_FREEZE_FS and kswapd shrinker races with another process releasing the last reference to an(other) inode? I appreciate your patience, I would really like to better understand all the aspects of freezing (it is essential for ovl snapshots). Thanks, Amir. [1] https://marc.info/?l=linux-xfs&m=153627727029090&w=2
On Mon, Oct 01, 2018 at 10:56:25AM +0300, Amir Goldstein wrote: > On Mon, Oct 1, 2018 at 4:09 AM Dave Chinner <david@fromorbit.com> wrote: > > > > On Sun, Sep 30, 2018 at 10:56:02AM +0300, Amir Goldstein wrote: > > > [CC Ted and Jan to see if there are lessons here that apply to ext2 ext4] > > > > > > On Fri, Sep 7, 2018 at 6:03 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > > > We've had a few reports of lockdep tripping over memory reclaim > > > > context vs filesystem freeze "deadlocks". They all have looked > > > > to be false positives on analysis, but it seems that they are > > > > being tripped because we take freeze references before we run > > > > a GFP_KERNEL allocation for the struct xfs_trans.===== > > > > > > > > We can avoid this false positive vector just by re-ordering the > > > > operations in xfs_trans_alloc(). That is. we need allocate the > > > > structure before we take the freeze reference and enter the GFP_NOFS > > > > allocation context that follows the xfs_trans around. This prevents > > > > lockdep from seeing the GFP_KERNEL allocation inside the transaction > > > > context, and that prevents it from triggering the freeze level vs > > > > alloc context vs reclaim warnings. > > > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > > --- > > > > [...] > > > I was getting the lockdep warning below reliably with the stress test > > > overlay/019 (over bas fs xfs with reflink) ever since kernel v4.18. > > > The warning is tripped on my system after 2 minutes of stress test. > > > > > > The possibly interesting part about this particular splat is that, unlike > > > previously reported traces [1][2], sb_internals is not taken by kswapd > > > from pagewrite path, which as you wrote is not possible during freeze > > > level internal. In my splats sb_internals is taken by kswapd from > > > dcache shrink path. > > > > Which is exactly the same case. i.e. a transaction is being run > > from kswapd's reclaim context. It doesn't matter if it's an extent > > allocation transaction in the direct page writeback path, or > > prune_dcache_sb() killing a dentry and dropping the last reference > > to an unlinked inode triggering a truncate, or indeed prune_icache_sb > > dropping an inode off the LRU and triggering a truncate of > > specualtively preallocated blocks beyond EOF. > > > > i.e. Lockdep is warning about a transaction being run in kswapd's > > reclaim context - this is something we are allowed to do (and need > > to do to make forwards progress) because the kswapd reclaim context > > is GFP_KERNEL.... > > > > I understand that, but I would still like to ask for a clarification on one > point. In response to one of the lockdep warning reports [1] you wrote: > "It's not a deadlock - for anything to deadlock in this path, we have > to be in the middle of a freeze and have frozen the transaction > subsystem. Which we cannot do until we've cleaned all the dirty > cached pages in the filesystem and frozen all new writes. Which means > kswapd cannot enter this direct writeback path because we can't have > dirty pages on the filesystem." > > I don't see how this argument holds for the shrinker case. > That is, if filesystem is already past freezing the transaction subsystem > and then kswapd comes along and runs the shrinkers. We've already cleaned all the dirty inodes before we pruned out all the reclaimable inodes from the cache as part of the freeze process. Hence the XFS filesystem should not be holding any inodes that need post-release transactions to be run. i.e. freeze has already cleaned up anything that shrinkers might trip over. > So while the statement "it's not a deadlock" may still be true, I am not > yet convinced that the claim that there are no dirty pages to write when > filesystem is frozen is sufficient to back that claim. > > Are you sure there was no deadlock lurking in there while fs is past > SB_FREEZE_FS and kswapd shrinker races with another process > releasing the last reference to an(other) inode? The inodes being released by the shrinkers should be clean, and hence releasing them does not require post-release transactions to be run. It does concern me that the overlay dcache shrinker is dropping the last reference to an XFS inode and it does not get put on the LRU for the correct superblock inode cache shrinker to free it. That implies that the overlay dcache shrinker is dropping the last reference to an unlinked inode. AFAIA, the dcache shrinker should never be freeing the last reference to an unlinked inode - it should always be done from the context that unlinked the inode or the context that closed the final fd on that inode. i.e. a task context, not a shrinker or kswapd. Can you confirm what the state of the inode being dropped in that lockdep trace is? Cheers, Dave.
On Tue, Oct 2, 2018 at 1:32 AM Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Oct 01, 2018 at 10:56:25AM +0300, Amir Goldstein wrote: [...] > > So while the statement "it's not a deadlock" may still be true, I am not > > yet convinced that the claim that there are no dirty pages to write when > > filesystem is frozen is sufficient to back that claim. > > > > Are you sure there was no deadlock lurking in there while fs is past > > SB_FREEZE_FS and kswapd shrinker races with another process > > releasing the last reference to an(other) inode? > > The inodes being released by the shrinkers should be clean, and > hence releasing them does not require post-release transactions to > be run. > > It does concern me that the overlay dcache shrinker is dropping the > last reference to an XFS inode and it does not get put on the LRU > for the correct superblock inode cache shrinker to free it. That > implies that the overlay dcache shrinker is dropping the last > reference to an unlinked inode. > Yes, there is nothing that prevents overlayfs (or ecryptfs) to end up with the last reference on an underlying unlinked inode. It will require an action of unlink from underlying layer, which is not normal user behavior, but it is possible. > AFAIA, the dcache shrinker should never be freeing the last > reference to an unlinked inode - it should always be done from the > context that unlinked the inode or the context that closed the final > fd on that inode. i.e. a task context, not a shrinker or kswapd. Can > you confirm what the state of the inode being dropped in that > lockdep trace is? > Given that the stress tests runs fsstress in parallel (same seed) on overlay and underlying lower fs, I would say that it is very likely to trip on shrinker putting an overlay dentry/inode holding the last reference to an unlinked underlying fs inode. How concerned are you about this? Is it inherently hard to deal with this situation in XFS? pardon my ignorance, but can't memory shrinker trigger oom killer indirectly causing to release deleted inodes? Not sure in which context those files/inodes get released? Thanks, Amir.
Ove a n Tue, Oct 02, 2018 at 07:02:08AM +0300, Amir Goldstein wrote: > On Tue, Oct 2, 2018 at 1:32 AM Dave Chinner <david@fromorbit.com> wrote: > > > > On Mon, Oct 01, 2018 at 10:56:25AM +0300, Amir Goldstein wrote: > [...] > > > So while the statement "it's not a deadlock" may still be true, I am not > > > yet convinced that the claim that there are no dirty pages to write when > > > filesystem is frozen is sufficient to back that claim. > > > > > > Are you sure there was no deadlock lurking in there while fs is past > > > SB_FREEZE_FS and kswapd shrinker races with another process > > > releasing the last reference to an(other) inode? > > > > The inodes being released by the shrinkers should be clean, and > > hence releasing them does not require post-release transactions to > > be run. > > > > It does concern me that the overlay dcache shrinker is dropping the > > last reference to an XFS inode and it does not get put on the LRU > > for the correct superblock inode cache shrinker to free it. That > > implies that the overlay dcache shrinker is dropping the last > > reference to an unlinked inode. > > Yes, there is nothing that prevents overlayfs (or ecryptfs) to end up > with the last reference on an underlying unlinked inode. /me groans > It will require an action of unlink from underlying layer, which is not > normal user behavior, but it is possible. Foot, meet Shotgun. > > AFAIA, the dcache shrinker should never be freeing the last > > reference to an unlinked inode - it should always be done from the > > context that unlinked the inode or the context that closed the final > > fd on that inode. i.e. a task context, not a shrinker or kswapd. Can > > you confirm what the state of the inode being dropped in that > > lockdep trace is? > > > > Given that the stress tests runs fsstress in parallel (same seed) on > overlay and underlying lower fs, I would say that it is very likely to > trip on shrinker putting an overlay dentry/inode holding the last reference > to an unlinked underlying fs inode. The shrinker still shouldn't trip on them. Whatever releases the last reference to an unlinked file should be freeing the inode. > How concerned are you about this? Scares the hell outta me. > Is it inherently hard to deal with > this situation in XFS? Nothing to do with the messenger. Have a look at shrink_dcache_sb() and shrink_dcache_for_umount() and what they imply about the dentries that take references to an inode on a different superblock. Then look at generic_shutdown_super() - pay attention to what happens if there are still allocated inodes after all the superblock dentries have been pruned and inodes evicted. i.e. this will trigger if some other superblock holds references to them: if (!list_empty(&sb->s_inodes)) { printk("VFS: Busy inodes after unmount of %s. " "Self-destruct in 5 seconds. Have a nice day...\n", sb->s_id); } The per-superblock cache infrastructure is built around the assumption that the dentries referencing an inode all point to the same superblock as the inode. Hence to free all the external references to the superblock's inode cache, all we need to do is prune the superblock's dentry cache. If something else is taking external references to an inode, then this all breaks down. The superblock shrinker is based around the same assumptions and so makes the further assumption that it doesn't interact directly with any other superblock. i.e. a superblock shrinker will only change the state of objects the superblock directly owns. This means a filesystem implementation only has to worry about it's own shrinker(s) when considering memory reclaim recursion and other issues like invalidating caches to prevent shrinker actions when frozen. If we have cross-superblock object references then it gets very complicated very quickly. Freeze is a good example, as you suspected. i.e. if the lower filesystem is frozen but overlay holds open unlinked files on the lower filesystem, then overlay if going to get blocked by the lower frozen filesystem when it tries to release the final inode reference. If overlay puts unlinked dentries on it's LRU where the superblock shrinker may clean them up and release the final reference to unlinked inodes, then whatever calls the shrinker will get blocked. If kswapd does the shrinking, then the whole system can lock up because kswapd can't make progress until the filesystem is unfrozen. And if the process that does that unfreezing needs memory.... I can think of several other similar ways that we can probably be screwed by cross-superblock references and memory reclaim interactions. I can't think of any way to avoid them except for not getting into that mess in the first place. > pardon my ignorance, but can't memory shrinker > trigger oom killer indirectly causing to release deleted inodes? > Not sure in which context those files/inodes get released? The oom killer doesn't free anything. It marks a process as being killed, then when that process context gets scehduled it exits, releasing all the open files it holds and so dropping the last reference to the dentries in task context, not the OOM kill context. Cheers, Dave.
On Tue, Oct 2, 2018 at 8:39 AM, Dave Chinner <david@fromorbit.com> wrote: > Have a look at shrink_dcache_sb() and shrink_dcache_for_umount() and > what they imply about the dentries that take references to an inode > on a different superblock. Then look at generic_shutdown_super() - > pay attention to what happens if there are still allocated inodes > after all the superblock dentries have been pruned and inodes > evicted. i.e. this will trigger if some other superblock holds > references to them: > > if (!list_empty(&sb->s_inodes)) { > printk("VFS: Busy inodes after unmount of %s. " > "Self-destruct in 5 seconds. Have a nice day...\n", > sb->s_id); > } > Overlay holds references to the underlying sb's (through a set of internal mounts), so that's not going to happen. > If overlay puts unlinked dentries on it's LRU where the superblock > shrinker may clean them up and release the final reference to > unlinked inodes, then whatever calls the shrinker will get blocked. > If kswapd does the shrinking, then the whole system can lock up > because kswapd can't make progress until the filesystem is unfrozen. > And if the process that does that unfreezing needs memory.... Seems like freezing any of the layers if overlay itself is not frozen is not a good idea. Preventing unlink (or modification generally) on the underlying layers if part of an overlay is also doable, but it would likely run into such backward compat issues that no one would be happy. So I don't think that's now the way to go. We could have done that initially, but it turns out allowing modification of the underlying layers can be useful at times. > I can think of several other similar ways that we can probably be > screwed by cross-superblock references and memory reclaim > interactions. I can't think of any way to avoid them except for > not getting into that mess in the first place. The freezer interaction can be solved by letting the freezer know about the dependencies of filesystems. So if an underlying layer needs to be frozen, then all stacks containing that underlying layer would need to be frozen first. I guess any of the other bad interactions would be solvable in a similar manner. Thanks, Miklos > >> pardon my ignorance, but can't memory shrinker >> trigger oom killer indirectly causing to release deleted inodes? >> Not sure in which context those files/inodes get released? > > The oom killer doesn't free anything. It marks a process as being > killed, then when that process context gets scehduled it exits, > releasing all the open files it holds and so dropping the last > reference to the dentries in task context, not the OOM kill context. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Tue, Oct 02, 2018 at 09:33:30AM +0200, Miklos Szeredi wrote: > On Tue, Oct 2, 2018 at 8:39 AM, Dave Chinner <david@fromorbit.com> wrote: > > > Have a look at shrink_dcache_sb() and shrink_dcache_for_umount() and > > what they imply about the dentries that take references to an inode > > on a different superblock. Then look at generic_shutdown_super() - > > pay attention to what happens if there are still allocated inodes > > after all the superblock dentries have been pruned and inodes > > evicted. i.e. this will trigger if some other superblock holds > > references to them: > > > > if (!list_empty(&sb->s_inodes)) { > > printk("VFS: Busy inodes after unmount of %s. " > > "Self-destruct in 5 seconds. Have a nice day...\n", > > sb->s_id); > > } > > > > Overlay holds references to the underlying sb's (through a set of > internal mounts), so that's not going to happen. Ok, so that example is not going to trigger. That doesn't mean it isn't a problem, though.... > > If overlay puts unlinked dentries on it's LRU where the superblock > > shrinker may clean them up and release the final reference to > > unlinked inodes, then whatever calls the shrinker will get blocked. > > If kswapd does the shrinking, then the whole system can lock up > > because kswapd can't make progress until the filesystem is unfrozen. > > And if the process that does that unfreezing needs memory.... > > Seems like freezing any of the layers if overlay itself is not frozen > is not a good idea. That's something we can't directly control. e.g. lower filesystem is on a DM volume. DM can freeze the lower fileystem through the block device when a dm command is run. It may well be that the admins that set up the storage and filesystem layer have no idea that there are now overlay users on top of the filesystem they originally set up. Indeed, the admins may not even know that dm operations freeze filesystems because it happens completely transparently to them. > Preventing unlink (or modification generally) on the underlying layers > if part of an overlay is also doable, but it would likely run into > such backward compat issues that no one would be happy. So I don't > think that's now the way to go. We could have done that initially, > but it turns out allowing modification of the underlying layers can be > useful at times. Which means we're stuck with a fundamental "overlay can panic/deadlock machines" problem, yes? > > I can think of several other similar ways that we can probably be > > screwed by cross-superblock references and memory reclaim > > interactions. I can't think of any way to avoid them except for > > not getting into that mess in the first place. > > The freezer interaction can be solved by letting the freezer know > about the dependencies of filesystems. What freezer is that - the userspace application that calls the freeze ioctl()? > So if an underlying layer > needs to be frozen, then all stacks containing that underlying layer > would need to be frozen first. I guess any of the other bad > interactions would be solvable in a similar manner. We've talked about this in the context for fixing the hibernation mess - freezing multiple layers requires determining the dependencies between filesystem layers, and it has to work both up and down the dependency tree. And that tree has to include block devices, because things like loopback devices and locally mounted network devices form part of that dependency tree. Doing that as a global "freeze everything" operation can be solved by reverse walking the superblock list (as it's in sb instantiation order), but we have no obvious way of solving this dependency problem for any random superblock in the system. If you have a solution to this, I'm all ears :P Cheers, Dave.
On Wed, Oct 3, 2018 at 2:14 AM Dave Chinner <david@fromorbit.com> wrote: [...] > > Seems like freezing any of the layers if overlay itself is not frozen > > is not a good idea. > > That's something we can't directly control. e.g. lower filesystem is > on a DM volume. DM can freeze the lower fileystem through the block > device when a dm command is run. It may well be that the admins that > set up the storage and filesystem layer have no idea that there are > now overlay users on top of the filesystem they originally set up. > Indeed, the admins may not even know that dm operations freeze > filesystems because it happens completely transparently to them. > I don't think we should be binding the stacked filesystem issues with the stacked block over fs issues. The latter is more complex to solve generally and has by design non limited stack depth. The former has limited stack depth (2) and each sb knows its own stack depth, which is already used in overlay to annotate lockdep correctly. If vfs stores a reverse tree of stacked fs dependencies, then individual sb freeze can be solved. Drawing the fire away from overlayfs.. I personally find the behavior that a process that only has files open for read could block when filesystem is frozen somewhat unexpected to users (even if I can expect it). I wonder out loud if it wouldn't be friendlier for any filesystem to defer "garbage collection" (e.g. truncate deleted inode blocks) to thawing time, just as those operations are already run on mount (post crash) anyway. Thanks, Amir.
On Wed, Oct 03, 2018 at 06:45:13AM +0300, Amir Goldstein wrote: > On Wed, Oct 3, 2018 at 2:14 AM Dave Chinner <david@fromorbit.com> wrote: > [...] > > > Seems like freezing any of the layers if overlay itself is not frozen > > > is not a good idea. > > > > That's something we can't directly control. e.g. lower filesystem is > > on a DM volume. DM can freeze the lower fileystem through the block > > device when a dm command is run. It may well be that the admins that > > set up the storage and filesystem layer have no idea that there are > > now overlay users on top of the filesystem they originally set up. > > Indeed, the admins may not even know that dm operations freeze > > filesystems because it happens completely transparently to them. > > > > I don't think we should be binding the stacked filesystem issues with > the stacked block over fs issues. It's the same problem. Hacking a one-off solution to hide a specific overlay symptom does not address the root problem. And, besides, if you stack like this: overlay lower_fs loopback dev loop img fs And freeze the loop img fs, overlay can still get stuck in it's shrinker because the the lower_fs gets stuck doing IO on the frozen loop img fs. i.e. it's the same issue - kswapd will get stuck doing reclaim from the overlay shrinker. > The latter is more complex to solve > generally and has by design non limited stack depth. The former has > limited stack depth (2) and each sb knows its own stack depth, which > is already used in overlay to annotate lockdep correctly. > > If vfs stores a reverse tree of stacked fs dependencies, then individual > sb freeze can be solved. Don't make me mention bind mounts... :/ > Drawing the fire away from overlayfs.. I personally find the behavior that > a process that only has files open for read could block when filesystem is > frozen somewhat unexpected to users (even if I can expect it). Filesystem reads have always been able to modify the file (e.g. atime updates). Not to mention filesystem reads require memory allocation, and that means any GFP_KERNEL direct reclaim can get stuck on a frozen filesystem if that filesystem hasn't properly cleared out it's dangerous reclaimable objects when freezing. > I wonder out loud if it wouldn't be friendlier for any filesystem to defer > "garbage collection" (e.g. truncate deleted inode blocks) to thawing time, https://marc.info/?l=linux-xfs&m=153022904909523&w=2 Been on the list of "nice to have" unlink optimisations for XFS since 2008. But it's a performance optimisation and precursor for offlining AGs for online repair, not something we've ever considered as needed for correctness or to prevent deadlocks. > just as those operations are already run on mount (post crash) anyway. That's a completely different context - log recovery is much more constrained in the amount of work it needs to do and has much more freedom for handling errors (i.e. it can just leak bad unlinked inodes). Runtime deferral of post-unlink, post-reference inode reclaim is a *lot* more complex than processing pending unlinks in log recovery. Cheers, Dave.
On Thu, Oct 4, 2018 at 12:59 AM, Dave Chinner <david@fromorbit.com> wrote: > On Wed, Oct 03, 2018 at 06:45:13AM +0300, Amir Goldstein wrote: >> On Wed, Oct 3, 2018 at 2:14 AM Dave Chinner <david@fromorbit.com> wrote: >> [...] >> > > Seems like freezing any of the layers if overlay itself is not frozen >> > > is not a good idea. >> > >> > That's something we can't directly control. e.g. lower filesystem is >> > on a DM volume. DM can freeze the lower fileystem through the block >> > device when a dm command is run. It may well be that the admins that >> > set up the storage and filesystem layer have no idea that there are >> > now overlay users on top of the filesystem they originally set up. >> > Indeed, the admins may not even know that dm operations freeze >> > filesystems because it happens completely transparently to them. >> > >> >> I don't think we should be binding the stacked filesystem issues with >> the stacked block over fs issues. > > It's the same problem. Hacking a one-off solution to hide a specific > overlay symptom does not address the root problem. And, besides, if > you stack like this: > > overlay > lower_fs > loopback dev > loop img fs > > And freeze the loop img fs, overlay can still get stuck in it's > shrinker because the the lower_fs gets stuck doing IO on the frozen > loop img fs. > > i.e. it's the same issue - kswapd will get stuck doing reclaim from > the overlay shrinker. Is overlay making the situation any worse in this case? IOW, would reclaim on fs-over-loopback not have the same issue *without* overlay? If not, why? >> The latter is more complex to solve >> generally and has by design non limited stack depth. The former has >> limited stack depth (2) and each sb knows its own stack depth, which >> is already used in overlay to annotate lockdep correctly. >> >> If vfs stores a reverse tree of stacked fs dependencies, then individual >> sb freeze can be solved. > > Don't make me mention bind mounts... :/ How do mounts have anything to do with this? We are talking about filesystem (i.e. superblock) dependencies, not mount dependencies (which would make zero sense anyway). Thanks, Miklos
On Thu, Oct 04, 2018 at 01:14:12AM +0200, Miklos Szeredi wrote: > On Thu, Oct 4, 2018 at 12:59 AM, Dave Chinner <david@fromorbit.com> wrote: > > On Wed, Oct 03, 2018 at 06:45:13AM +0300, Amir Goldstein wrote: > >> On Wed, Oct 3, 2018 at 2:14 AM Dave Chinner <david@fromorbit.com> wrote: > >> [...] > >> > > Seems like freezing any of the layers if overlay itself is not frozen > >> > > is not a good idea. > >> > > >> > That's something we can't directly control. e.g. lower filesystem is > >> > on a DM volume. DM can freeze the lower fileystem through the block > >> > device when a dm command is run. It may well be that the admins that > >> > set up the storage and filesystem layer have no idea that there are > >> > now overlay users on top of the filesystem they originally set up. > >> > Indeed, the admins may not even know that dm operations freeze > >> > filesystems because it happens completely transparently to them. > >> > > >> > >> I don't think we should be binding the stacked filesystem issues with > >> the stacked block over fs issues. > > > > It's the same problem. Hacking a one-off solution to hide a specific > > overlay symptom does not address the root problem. And, besides, if > > you stack like this: > > > > overlay > > lower_fs > > loopback dev > > loop img fs > > > > And freeze the loop img fs, overlay can still get stuck in it's > > shrinker because the the lower_fs gets stuck doing IO on the frozen > > loop img fs. > > > > i.e. it's the same issue - kswapd will get stuck doing reclaim from > > the overlay shrinker. > > Is overlay making the situation any worse in this case? IOW, would No, it's not. My point is that it's the same layering problem, not that it is an issue unique to overlay. Fixing one without addressing the other does not make the problem go away. > >> If vfs stores a reverse tree of stacked fs dependencies, then individual > >> sb freeze can be solved. > > > > Don't make me mention bind mounts... :/ > > How do mounts have anything to do with this? Bind mounts layer superblocks in unpredictable manners. e.g. the lowerfs for overlay could be a directory heirachy made up of multiple bind mounts from different source filesystems. So it's not just a 1:1 upper/lower superblock relationship it could be 1:many. Dependency graphs get complx quickly in such configurations. You can also have the same superblock above overlay through a bind mount as well as below as the lower fs. e.g. A user could freeze the bind mounted heirarchy after checking it's a different device to the filesystem they are also writing to (e.g. logs to "root" device, freezes "home" device). IOWs, they could be completely unaware that the freeze target is the same filesystem that underlies the "root" device, and so when the freeze the "home" target their root device also freezes. That sort of thing registers high up on the user WTF-o-meter, and it's not at all obvious what went wrong or how to get out of that mess. It's not clear to me if bind mounts should accept path based superblock admin commands like freeze because the potentially expose problems far outside the realm that specific user/mount namespace is allowed access to. My point is that it's not obvious that there is a simple, clear dependency hierarchy between superblocks because there are so many ways they can be layered by userspace and layered fileystems and block devices. Cheers, Dave.
On Thu, Oct 4, 2018 at 7:38 AM, Dave Chinner <david@fromorbit.com> wrote: > On Thu, Oct 04, 2018 at 01:14:12AM +0200, Miklos Szeredi wrote: >> On Thu, Oct 4, 2018 at 12:59 AM, Dave Chinner <david@fromorbit.com> wrote: >> > overlay >> > lower_fs >> > loopback dev >> > loop img fs >> > >> > And freeze the loop img fs, overlay can still get stuck in it's >> > shrinker because the the lower_fs gets stuck doing IO on the frozen >> > loop img fs. >> > >> > i.e. it's the same issue - kswapd will get stuck doing reclaim from >> > the overlay shrinker. >> >> Is overlay making the situation any worse in this case? IOW, would > > No, it's not. My point is that it's the same layering problem, not > that it is an issue unique to overlay. Fixing one without addressing > the other does not make the problem go away. But you were suggesting that this should be solved by somehow preventing the shrink after unlink issue in overlayfs. But the freeze dependency is not unique to that: if filesystem A is frozen and a file in that fs is loopback mounted in filesystem B then B also needs to be frozen before A is frozen to prevent deadlocking through kswapd. Both could be solved by creating a new "struct freeze_dep" object with ->freeze/unfreeze methods, that can be embedded into overlay and loopback device and whatever else, and which is installed onto a "struct list_head s_freeze_deps" list in the superblock. Does that make sense? > >> >> If vfs stores a reverse tree of stacked fs dependencies, then individual >> >> sb freeze can be solved. >> > >> > Don't make me mention bind mounts... :/ >> >> How do mounts have anything to do with this? > > Bind mounts layer superblocks in unpredictable manners. e.g. the > lowerfs for overlay could be a directory heirachy made up of > multiple bind mounts from different source filesystems. So it's not > just a 1:1 upper/lower superblock relationship it could be 1:many. > Dependency graphs get complx quickly in such configurations. Overlayfs avoids that sort of mess by not overlaying mount trees, just single mounts. > My point is that it's not obvious that there is a simple, clear > dependency hierarchy between superblocks because there are so many > ways they can be layered by userspace and layered fileystems and > block devices. For overlay and loopback there's a simple dependency graph. As for stacked block devs, I have no idea if such an infrastructure makes sense or is possible at all. Thanks, Miklos
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index bedc5a5133a5..912b42f5fe4a 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -259,6 +259,14 @@ xfs_trans_alloc( struct xfs_trans *tp; int error; + /* + * Allocate the handle before we do our freeze accounting and setting up + * GFP_NOFS allocation context so that we avoid lockdep false positives + * by doing GFP_KERNEL allocations inside sb_start_intwrite(). + */ + tp = kmem_zone_zalloc(xfs_trans_zone, + (flags & XFS_TRANS_NOFS) ? KM_NOFS : KM_SLEEP); + if (!(flags & XFS_TRANS_NO_WRITECOUNT)) sb_start_intwrite(mp->m_super); @@ -270,8 +278,6 @@ xfs_trans_alloc( mp->m_super->s_writers.frozen == SB_FREEZE_COMPLETE); atomic_inc(&mp->m_active_trans); - tp = kmem_zone_zalloc(xfs_trans_zone, - (flags & XFS_TRANS_NOFS) ? KM_NOFS : KM_SLEEP); tp->t_magic = XFS_TRANS_HEADER_MAGIC; tp->t_flags = flags; tp->t_mountp = mp;