diff mbox series

xfs: avoid lockdep false positives in xfs_trans_alloc

Message ID 20180907015155.32559-1-david@fromorbit.com (mailing list archive)
State Accepted
Headers show
Series xfs: avoid lockdep false positives in xfs_trans_alloc | expand

Commit Message

Dave Chinner Sept. 7, 2018, 1:51 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

We've had a few reports of lockdep tripping over memory reclaim
context vs filesystem freeze "deadlocks". They all have looked
to be false positives on analysis, but it seems that they are
being tripped because we take freeze references before we run
a GFP_KERNEL allocation for the struct xfs_trans.

We can avoid this false positive vector just by re-ordering the
operations in xfs_trans_alloc(). That is. we need allocate the
structure before we take the freeze reference and enter the GFP_NOFS
allocation context that follows the xfs_trans around. This prevents
lockdep from seeing the GFP_KERNEL allocation inside the transaction
context, and that prevents it from triggering the freeze level vs
alloc context vs reclaim warnings.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_trans.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

Comments

Brian Foster Sept. 7, 2018, 2:07 p.m. UTC | #1
On Fri, Sep 07, 2018 at 11:51:55AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We've had a few reports of lockdep tripping over memory reclaim
> context vs filesystem freeze "deadlocks". They all have looked
> to be false positives on analysis, but it seems that they are
> being tripped because we take freeze references before we run
> a GFP_KERNEL allocation for the struct xfs_trans.
> 
> We can avoid this false positive vector just by re-ordering the
> operations in xfs_trans_alloc(). That is. we need allocate the
> structure before we take the freeze reference and enter the GFP_NOFS
> allocation context that follows the xfs_trans around. This prevents
> lockdep from seeing the GFP_KERNEL allocation inside the transaction
> context, and that prevents it from triggering the freeze level vs
> alloc context vs reclaim warnings.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

We have an old RHEL bug for this problem and I recall dwelling on this
exact change for a bit. I don't recall whether I just lost track of it,
there was some kind of problem or I just didn't think it worth the
change to quiet a false positive. I think the latter because I don't
remember actually testing it and don't otherwise have any notes to
suggest it's problematic in any way.

The only difference I can think of is that on a frozen fs, we'd allow
some number of transaction allocations to occur and either immediately
block on the freeze/write lock or drop into reclaim. Those instances
should be limited, however, because most user-driven operations should
block at the write level rather than the internal/transaction level.
I'll run some tests to see if anything explodes, but in the meantime
this seems reasonable to me:

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_trans.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index bedc5a5133a5..912b42f5fe4a 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -259,6 +259,14 @@ xfs_trans_alloc(
>  	struct xfs_trans	*tp;
>  	int			error;
>  
> +	/*
> +	 * Allocate the handle before we do our freeze accounting and setting up
> +	 * GFP_NOFS allocation context so that we avoid lockdep false positives
> +	 * by doing GFP_KERNEL allocations inside sb_start_intwrite().
> +	 */
> +	tp = kmem_zone_zalloc(xfs_trans_zone,
> +		(flags & XFS_TRANS_NOFS) ? KM_NOFS : KM_SLEEP);
> +
>  	if (!(flags & XFS_TRANS_NO_WRITECOUNT))
>  		sb_start_intwrite(mp->m_super);
>  
> @@ -270,8 +278,6 @@ xfs_trans_alloc(
>  		mp->m_super->s_writers.frozen == SB_FREEZE_COMPLETE);
>  	atomic_inc(&mp->m_active_trans);
>  
> -	tp = kmem_zone_zalloc(xfs_trans_zone,
> -		(flags & XFS_TRANS_NOFS) ? KM_NOFS : KM_SLEEP);
>  	tp->t_magic = XFS_TRANS_HEADER_MAGIC;
>  	tp->t_flags = flags;
>  	tp->t_mountp = mp;
> -- 
> 2.17.0
>
Christoph Hellwig Sept. 10, 2018, 6:47 a.m. UTC | #2
Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>
Amir Goldstein Sept. 30, 2018, 7:56 a.m. UTC | #3
[CC Ted and Jan to see if there are lessons here that apply to ext2 ext4]

On Fri, Sep 7, 2018 at 6:03 AM Dave Chinner <david@fromorbit.com> wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> We've had a few reports of lockdep tripping over memory reclaim
> context vs filesystem freeze "deadlocks". They all have looked
> to be false positives on analysis, but it seems that they are
> being tripped because we take freeze references before we run
> a GFP_KERNEL allocation for the struct xfs_trans.=====
>
> We can avoid this false positive vector just by re-ordering the
> operations in xfs_trans_alloc(). That is. we need allocate the
> structure before we take the freeze reference and enter the GFP_NOFS
> allocation context that follows the xfs_trans around. This prevents
> lockdep from seeing the GFP_KERNEL allocation inside the transaction
> context, and that prevents it from triggering the freeze level vs
> alloc context vs reclaim warnings.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Dave,

First of all, you may add
Tested-by: Amir Goldstein <amir73il@gmail.com>
and I think attribution of Reported-by [2] to
Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
is called for.

I was getting the lockdep warning below reliably with the stress test
overlay/019 (over bas fs xfs with reflink) ever since kernel v4.18.
The warning is tripped on my system after 2 minutes of stress test.

The possibly interesting part about this particular splat is that, unlike
previously reported traces [1][2], sb_internals is not taken by kswapd
from pagewrite path, which as you wrote is not possible during freeze
level internal. In my splats sb_internals is taken by kswapd from
dcache shrink path.

It's true that overlayfs adds some quirks to the equation and this
specific stress test (by Ted) is especially evil for making changes
on overlay lower layer. This could results for example with unlinked
xfs inode with elevated reference from an overlay dentry, which
wasn't informed about this underlying unlink.

However, the case of an unlinked inode reference being held by an
open file or by inotify event doesn't seem so unlikely during frozen
filesystem, so I am not really sure that the "false positive" claim holds
water. If you agree with this argument, you might want to tone down
the "false positive" language from commit message and you may
name overlay/19 as a reliable reproducer if you see fit.

Thanks,
Amir.

>8->8->8->8 extracted only the interesting snip >8->8->8->8
 -> #0 (sb_internal#2){.+.+}:
        lock_acquire+0x150/0x182
        __sb_start_write+0x93/0x164
        xfs_trans_alloc+0x3c/0x1ca
        xfs_inactive_truncate+0x3d/0x122
        xfs_inactive+0x15a/0x1f4
        xfs_fs_destroy_inode+0x160/0x246
        __dentry_kill+0xdd/0x159
        dentry_kill+0x138/0x173
        dput+0x230/0x23d
        ovl_destroy_inode+0x15/0x69
        __dentry_kill+0xdd/0x159
        shrink_dentry_list+0x286/0x2c5
        prune_dcache_sb+0x5a/0x78
        super_cache_scan+0xfe/0x17b
        shrink_slab.constprop.31+0x27a/0x438
        shrink_node+0x74/0x251
        kswapd+0x502/0x62f

[1] https://marc.info/?l=linux-xfs&m=152954339303385&w=2
[2] https://marc.info/?l=linux-xfs&m=153627727029090&w=2
Dave Chinner Oct. 1, 2018, 1:09 a.m. UTC | #4
On Sun, Sep 30, 2018 at 10:56:02AM +0300, Amir Goldstein wrote:
> [CC Ted and Jan to see if there are lessons here that apply to ext2 ext4]
> 
> On Fri, Sep 7, 2018 at 6:03 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > We've had a few reports of lockdep tripping over memory reclaim
> > context vs filesystem freeze "deadlocks". They all have looked
> > to be false positives on analysis, but it seems that they are
> > being tripped because we take freeze references before we run
> > a GFP_KERNEL allocation for the struct xfs_trans.=====
> >
> > We can avoid this false positive vector just by re-ordering the
> > operations in xfs_trans_alloc(). That is. we need allocate the
> > structure before we take the freeze reference and enter the GFP_NOFS
> > allocation context that follows the xfs_trans around. This prevents
> > lockdep from seeing the GFP_KERNEL allocation inside the transaction
> > context, and that prevents it from triggering the freeze level vs
> > alloc context vs reclaim warnings.
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> 
> Dave,
> 
> First of all, you may add
> Tested-by: Amir Goldstein <amir73il@gmail.com>

Too late, already pushed.

> and I think attribution of Reported-by [2] to
> Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
> is called for.

Not for multiply-reported bugs. i.e. this has been reported in RHEL,
by syzbot, by various users and several developers so giving
attribution to just one person is not the right thing to do. It's
either all or none, and given that the commit message is not for
tracking a list of people who hit the issue, I decided "none".

> I was getting the lockdep warning below reliably with the stress test
> overlay/019 (over bas fs xfs with reflink) ever since kernel v4.18.
> The warning is tripped on my system after 2 minutes of stress test.
> 
> The possibly interesting part about this particular splat is that, unlike
> previously reported traces [1][2], sb_internals is not taken by kswapd
> from pagewrite path, which as you wrote is not possible during freeze
> level internal. In my splats sb_internals is taken by kswapd from
> dcache shrink path.

Which is exactly the same case. i.e. a transaction is being run
from kswapd's reclaim context. It doesn't matter if it's an extent
allocation transaction in the direct page writeback path, or
prune_dcache_sb() killing a dentry and dropping the last reference
to an unlinked inode triggering a truncate, or indeed prune_icache_sb
dropping an inode off the LRU and triggering a truncate of
specualtively preallocated blocks beyond EOF.

i.e. Lockdep is warning about a transaction being run in kswapd's
reclaim context - this is something we are allowed to do (and need
to do to make forwards progress) because the kswapd reclaim context
is GFP_KERNEL....

Cheers,

Dave.
Amir Goldstein Oct. 1, 2018, 7:56 a.m. UTC | #5
On Mon, Oct 1, 2018 at 4:09 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Sun, Sep 30, 2018 at 10:56:02AM +0300, Amir Goldstein wrote:
> > [CC Ted and Jan to see if there are lessons here that apply to ext2 ext4]
> >
> > On Fri, Sep 7, 2018 at 6:03 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > From: Dave Chinner <dchinner@redhat.com>
> > >
> > > We've had a few reports of lockdep tripping over memory reclaim
> > > context vs filesystem freeze "deadlocks". They all have looked
> > > to be false positives on analysis, but it seems that they are
> > > being tripped because we take freeze references before we run
> > > a GFP_KERNEL allocation for the struct xfs_trans.=====
> > >
> > > We can avoid this false positive vector just by re-ordering the
> > > operations in xfs_trans_alloc(). That is. we need allocate the
> > > structure before we take the freeze reference and enter the GFP_NOFS
> > > allocation context that follows the xfs_trans around. This prevents
> > > lockdep from seeing the GFP_KERNEL allocation inside the transaction
> > > context, and that prevents it from triggering the freeze level vs
> > > alloc context vs reclaim warnings.
> > >
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> >
[...]
> > I was getting the lockdep warning below reliably with the stress test
> > overlay/019 (over bas fs xfs with reflink) ever since kernel v4.18.
> > The warning is tripped on my system after 2 minutes of stress test.
> >
> > The possibly interesting part about this particular splat is that, unlike
> > previously reported traces [1][2], sb_internals is not taken by kswapd
> > from pagewrite path, which as you wrote is not possible during freeze
> > level internal. In my splats sb_internals is taken by kswapd from
> > dcache shrink path.
>
> Which is exactly the same case. i.e. a transaction is being run
> from kswapd's reclaim context. It doesn't matter if it's an extent
> allocation transaction in the direct page writeback path, or
> prune_dcache_sb() killing a dentry and dropping the last reference
> to an unlinked inode triggering a truncate, or indeed prune_icache_sb
> dropping an inode off the LRU and triggering a truncate of
> specualtively preallocated blocks beyond EOF.
>
> i.e. Lockdep is warning about a transaction being run in kswapd's
> reclaim context - this is something we are allowed to do (and need
> to do to make forwards progress) because the kswapd reclaim context
> is GFP_KERNEL....
>

I understand that, but I would still like to ask for a clarification on one
point. In response to one of the lockdep warning reports [1] you wrote:
"It's not a deadlock - for anything to deadlock in this path, we have
to be in the middle of a freeze and have frozen the transaction
subsystem. Which we cannot do until we've cleaned all the dirty
cached pages in the filesystem and frozen all new writes. Which means
kswapd cannot enter this direct writeback path because we can't have
dirty pages on the filesystem."

I don't see how this argument holds for the shrinker case.
That is, if filesystem is already past freezing the transaction subsystem
and then kswapd comes along and runs the shrinkers.

So while the statement "it's not a deadlock" may still be true, I am not
yet convinced that the claim that there are no dirty pages to write when
filesystem is frozen is sufficient to back that claim.

Are you sure there was no deadlock lurking in there while fs is past
SB_FREEZE_FS and kswapd shrinker races with another process
releasing the last reference to an(other) inode?

I appreciate your patience, I would really like to better understand
all the aspects of freezing (it is essential for ovl snapshots).

Thanks,
Amir.

[1] https://marc.info/?l=linux-xfs&m=153627727029090&w=2
Dave Chinner Oct. 1, 2018, 10:32 p.m. UTC | #6
On Mon, Oct 01, 2018 at 10:56:25AM +0300, Amir Goldstein wrote:
> On Mon, Oct 1, 2018 at 4:09 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Sun, Sep 30, 2018 at 10:56:02AM +0300, Amir Goldstein wrote:
> > > [CC Ted and Jan to see if there are lessons here that apply to ext2 ext4]
> > >
> > > On Fri, Sep 7, 2018 at 6:03 AM Dave Chinner <david@fromorbit.com> wrote:
> > > >
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > >
> > > > We've had a few reports of lockdep tripping over memory reclaim
> > > > context vs filesystem freeze "deadlocks". They all have looked
> > > > to be false positives on analysis, but it seems that they are
> > > > being tripped because we take freeze references before we run
> > > > a GFP_KERNEL allocation for the struct xfs_trans.=====
> > > >
> > > > We can avoid this false positive vector just by re-ordering the
> > > > operations in xfs_trans_alloc(). That is. we need allocate the
> > > > structure before we take the freeze reference and enter the GFP_NOFS
> > > > allocation context that follows the xfs_trans around. This prevents
> > > > lockdep from seeing the GFP_KERNEL allocation inside the transaction
> > > > context, and that prevents it from triggering the freeze level vs
> > > > alloc context vs reclaim warnings.
> > > >
> > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > ---
> > >
> [...]
> > > I was getting the lockdep warning below reliably with the stress test
> > > overlay/019 (over bas fs xfs with reflink) ever since kernel v4.18.
> > > The warning is tripped on my system after 2 minutes of stress test.
> > >
> > > The possibly interesting part about this particular splat is that, unlike
> > > previously reported traces [1][2], sb_internals is not taken by kswapd
> > > from pagewrite path, which as you wrote is not possible during freeze
> > > level internal. In my splats sb_internals is taken by kswapd from
> > > dcache shrink path.
> >
> > Which is exactly the same case. i.e. a transaction is being run
> > from kswapd's reclaim context. It doesn't matter if it's an extent
> > allocation transaction in the direct page writeback path, or
> > prune_dcache_sb() killing a dentry and dropping the last reference
> > to an unlinked inode triggering a truncate, or indeed prune_icache_sb
> > dropping an inode off the LRU and triggering a truncate of
> > specualtively preallocated blocks beyond EOF.
> >
> > i.e. Lockdep is warning about a transaction being run in kswapd's
> > reclaim context - this is something we are allowed to do (and need
> > to do to make forwards progress) because the kswapd reclaim context
> > is GFP_KERNEL....
> >
> 
> I understand that, but I would still like to ask for a clarification on one
> point. In response to one of the lockdep warning reports [1] you wrote:
> "It's not a deadlock - for anything to deadlock in this path, we have
> to be in the middle of a freeze and have frozen the transaction
> subsystem. Which we cannot do until we've cleaned all the dirty
> cached pages in the filesystem and frozen all new writes. Which means
> kswapd cannot enter this direct writeback path because we can't have
> dirty pages on the filesystem."
> 
> I don't see how this argument holds for the shrinker case.
> That is, if filesystem is already past freezing the transaction subsystem
> and then kswapd comes along and runs the shrinkers.

We've already cleaned all the dirty inodes before we pruned out all
the reclaimable inodes from the cache as part of the freeze process.
Hence the XFS filesystem should not be holding any inodes that need
post-release transactions to be run. i.e. freeze has already cleaned
up anything that shrinkers might trip over.

> So while the statement "it's not a deadlock" may still be true, I am not
> yet convinced that the claim that there are no dirty pages to write when
> filesystem is frozen is sufficient to back that claim.
> 
> Are you sure there was no deadlock lurking in there while fs is past
> SB_FREEZE_FS and kswapd shrinker races with another process
> releasing the last reference to an(other) inode?

The inodes being released by the shrinkers should be clean, and
hence releasing them does not require post-release transactions to
be run.

It does concern me that the overlay dcache shrinker is dropping the
last reference to an XFS inode and it does not get put on the LRU
for the correct superblock inode cache shrinker to free it. That
implies that the overlay dcache shrinker is dropping the last
reference to an unlinked inode.

AFAIA, the dcache shrinker should never be freeing the last
reference to an unlinked inode - it should always be done from the
context that unlinked the inode or the context that closed the final
fd on that inode. i.e. a task context, not a shrinker or kswapd. Can
you confirm what the state of the inode being dropped in that
lockdep trace is?

Cheers,

Dave.
Amir Goldstein Oct. 2, 2018, 4:02 a.m. UTC | #7
On Tue, Oct 2, 2018 at 1:32 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Oct 01, 2018 at 10:56:25AM +0300, Amir Goldstein wrote:
[...]
> > So while the statement "it's not a deadlock" may still be true, I am not
> > yet convinced that the claim that there are no dirty pages to write when
> > filesystem is frozen is sufficient to back that claim.
> >
> > Are you sure there was no deadlock lurking in there while fs is past
> > SB_FREEZE_FS and kswapd shrinker races with another process
> > releasing the last reference to an(other) inode?
>
> The inodes being released by the shrinkers should be clean, and
> hence releasing them does not require post-release transactions to
> be run.
>
> It does concern me that the overlay dcache shrinker is dropping the
> last reference to an XFS inode and it does not get put on the LRU
> for the correct superblock inode cache shrinker to free it. That
> implies that the overlay dcache shrinker is dropping the last
> reference to an unlinked inode.
>

Yes, there is nothing that prevents overlayfs (or ecryptfs) to end up
with the last reference on an underlying unlinked inode.
It will require an action of unlink from underlying layer, which is not
normal user behavior, but it is possible.

> AFAIA, the dcache shrinker should never be freeing the last
> reference to an unlinked inode - it should always be done from the
> context that unlinked the inode or the context that closed the final
> fd on that inode. i.e. a task context, not a shrinker or kswapd. Can
> you confirm what the state of the inode being dropped in that
> lockdep trace is?
>

Given that the stress tests runs fsstress in parallel (same seed) on
overlay and underlying lower fs, I would say that it is very likely to
trip on shrinker putting an overlay dentry/inode holding the last reference
to an unlinked underlying fs inode.

How concerned are you about this? Is it inherently hard to deal with
this situation in XFS? pardon my ignorance, but can't memory shrinker
trigger oom killer indirectly causing to release deleted inodes?
Not sure in which context those files/inodes get released?

Thanks,
Amir.
Dave Chinner Oct. 2, 2018, 6:39 a.m. UTC | #8
Ove a n Tue, Oct 02, 2018 at 07:02:08AM +0300, Amir Goldstein wrote:
> On Tue, Oct 2, 2018 at 1:32 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Mon, Oct 01, 2018 at 10:56:25AM +0300, Amir Goldstein wrote:
> [...]
> > > So while the statement "it's not a deadlock" may still be true, I am not
> > > yet convinced that the claim that there are no dirty pages to write when
> > > filesystem is frozen is sufficient to back that claim.
> > >
> > > Are you sure there was no deadlock lurking in there while fs is past
> > > SB_FREEZE_FS and kswapd shrinker races with another process
> > > releasing the last reference to an(other) inode?
> >
> > The inodes being released by the shrinkers should be clean, and
> > hence releasing them does not require post-release transactions to
> > be run.
> >
> > It does concern me that the overlay dcache shrinker is dropping the
> > last reference to an XFS inode and it does not get put on the LRU
> > for the correct superblock inode cache shrinker to free it. That
> > implies that the overlay dcache shrinker is dropping the last
> > reference to an unlinked inode.
> 
> Yes, there is nothing that prevents overlayfs (or ecryptfs) to end up
> with the last reference on an underlying unlinked inode.

/me groans

> It will require an action of unlink from underlying layer, which is not
> normal user behavior, but it is possible.

Foot, meet Shotgun.

> > AFAIA, the dcache shrinker should never be freeing the last
> > reference to an unlinked inode - it should always be done from the
> > context that unlinked the inode or the context that closed the final
> > fd on that inode. i.e. a task context, not a shrinker or kswapd. Can
> > you confirm what the state of the inode being dropped in that
> > lockdep trace is?
> >
> 
> Given that the stress tests runs fsstress in parallel (same seed) on
> overlay and underlying lower fs, I would say that it is very likely to
> trip on shrinker putting an overlay dentry/inode holding the last reference
> to an unlinked underlying fs inode.

The shrinker still shouldn't trip on them. Whatever releases the
last reference to an unlinked file should be freeing the inode.

> How concerned are you about this?

Scares the hell outta me.

> Is it inherently hard to deal with
> this situation in XFS?

Nothing to do with the messenger.

Have a look at shrink_dcache_sb() and shrink_dcache_for_umount() and
what they imply about the dentries that take references to an inode
on a different superblock. Then look at generic_shutdown_super() -
pay attention to what happens if there are still allocated inodes
after all the superblock dentries have been pruned and inodes
evicted. i.e. this will trigger if some other superblock holds
references to them:

                if (!list_empty(&sb->s_inodes)) {
                        printk("VFS: Busy inodes after unmount of %s. "
                           "Self-destruct in 5 seconds.  Have a nice day...\n",
                           sb->s_id);
                }

The per-superblock cache infrastructure is built around the
assumption that the dentries referencing an inode all point to the
same superblock as the inode. Hence to free all the external
references to the superblock's inode cache, all we need to do is
prune the superblock's dentry cache. If something else is taking
external references to an inode, then this all breaks down.

The superblock shrinker is based around the same assumptions and so
makes the further assumption that it doesn't interact directly with
any other superblock. i.e. a superblock shrinker will only change
the state of objects the superblock directly owns. This means a
filesystem implementation only has to worry about it's own
shrinker(s) when considering memory reclaim recursion and other
issues like invalidating caches to prevent shrinker actions when
frozen.

If we have cross-superblock object references then it gets very
complicated very quickly. Freeze is a good example, as you
suspected. i.e. if the lower filesystem is frozen but overlay holds
open unlinked files on the lower filesystem, then overlay if going
to get blocked by the lower frozen filesystem when it tries to
release the final inode reference.

If overlay puts unlinked dentries on it's LRU where the superblock
shrinker may clean them up and release the final reference to
unlinked inodes, then whatever calls the shrinker will get blocked.
If kswapd does the shrinking, then the whole system can lock up
because kswapd can't make progress until the filesystem is unfrozen.
And if the process that does that unfreezing needs memory....

I can think of several other similar ways that we can probably be
screwed by cross-superblock references and memory reclaim
interactions. I can't think of any way to avoid them except for
not getting into that mess in the first place.

> pardon my ignorance, but can't memory shrinker
> trigger oom killer indirectly causing to release deleted inodes?
> Not sure in which context those files/inodes get released?

The oom killer doesn't free anything. It marks a process as being
killed, then when that process context gets scehduled it exits,
releasing all the open files it holds and so dropping the last
reference to the dentries in task context, not the OOM kill context.

Cheers,

Dave.
Miklos Szeredi Oct. 2, 2018, 7:33 a.m. UTC | #9
On Tue, Oct 2, 2018 at 8:39 AM, Dave Chinner <david@fromorbit.com> wrote:

> Have a look at shrink_dcache_sb() and shrink_dcache_for_umount() and
> what they imply about the dentries that take references to an inode
> on a different superblock. Then look at generic_shutdown_super() -
> pay attention to what happens if there are still allocated inodes
> after all the superblock dentries have been pruned and inodes
> evicted. i.e. this will trigger if some other superblock holds
> references to them:
>
>                 if (!list_empty(&sb->s_inodes)) {
>                         printk("VFS: Busy inodes after unmount of %s. "
>                            "Self-destruct in 5 seconds.  Have a nice day...\n",
>                            sb->s_id);
>                 }
>

Overlay holds references to the underlying sb's (through a set of
internal mounts), so that's not going to happen.

> If overlay puts unlinked dentries on it's LRU where the superblock
> shrinker may clean them up and release the final reference to
> unlinked inodes, then whatever calls the shrinker will get blocked.
> If kswapd does the shrinking, then the whole system can lock up
> because kswapd can't make progress until the filesystem is unfrozen.
> And if the process that does that unfreezing needs memory....

Seems like freezing any of the layers if overlay itself is not frozen
is not a good idea.

Preventing unlink (or modification generally) on the underlying layers
if part of an overlay is also doable, but it would likely run into
such backward compat issues that no one would be happy.   So I don't
think that's now the way to go.   We could have done that initially,
but it turns out allowing modification of the underlying layers can be
useful at times.

> I can think of several other similar ways that we can probably be
> screwed by cross-superblock references and memory reclaim
> interactions. I can't think of any way to avoid them except for
> not getting into that mess in the first place.

The freezer interaction can be solved by letting the freezer know
about the dependencies of filesystems.   So if an underlying layer
needs to be frozen, then all stacks containing that underlying layer
would need to be frozen first.  I guess any of the other bad
interactions would be solvable in a similar manner.

Thanks,
Miklos

>
>> pardon my ignorance, but can't memory shrinker
>> trigger oom killer indirectly causing to release deleted inodes?
>> Not sure in which context those files/inodes get released?
>
> The oom killer doesn't free anything. It marks a process as being
> killed, then when that process context gets scehduled it exits,
> releasing all the open files it holds and so dropping the last
> reference to the dentries in task context, not the OOM kill context.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
Dave Chinner Oct. 2, 2018, 11:14 p.m. UTC | #10
On Tue, Oct 02, 2018 at 09:33:30AM +0200, Miklos Szeredi wrote:
> On Tue, Oct 2, 2018 at 8:39 AM, Dave Chinner <david@fromorbit.com> wrote:
> 
> > Have a look at shrink_dcache_sb() and shrink_dcache_for_umount() and
> > what they imply about the dentries that take references to an inode
> > on a different superblock. Then look at generic_shutdown_super() -
> > pay attention to what happens if there are still allocated inodes
> > after all the superblock dentries have been pruned and inodes
> > evicted. i.e. this will trigger if some other superblock holds
> > references to them:
> >
> >                 if (!list_empty(&sb->s_inodes)) {
> >                         printk("VFS: Busy inodes after unmount of %s. "
> >                            "Self-destruct in 5 seconds.  Have a nice day...\n",
> >                            sb->s_id);
> >                 }
> >
> 
> Overlay holds references to the underlying sb's (through a set of
> internal mounts), so that's not going to happen.

Ok, so that example is not going to trigger. That doesn't mean it
isn't a problem, though....

> > If overlay puts unlinked dentries on it's LRU where the superblock
> > shrinker may clean them up and release the final reference to
> > unlinked inodes, then whatever calls the shrinker will get blocked.
> > If kswapd does the shrinking, then the whole system can lock up
> > because kswapd can't make progress until the filesystem is unfrozen.
> > And if the process that does that unfreezing needs memory....
> 
> Seems like freezing any of the layers if overlay itself is not frozen
> is not a good idea.

That's something we can't directly control. e.g. lower filesystem is
on a DM volume. DM can freeze the lower fileystem through the block
device when a dm command is run. It may well be that the admins that
set up the storage and filesystem layer have no idea that there are
now overlay users on top of the filesystem they originally set up.
Indeed, the admins may not even know that dm operations freeze
filesystems because it happens completely transparently to them.

> Preventing unlink (or modification generally) on the underlying layers
> if part of an overlay is also doable, but it would likely run into
> such backward compat issues that no one would be happy.   So I don't
> think that's now the way to go.   We could have done that initially,
> but it turns out allowing modification of the underlying layers can be
> useful at times.

Which means we're stuck with a fundamental "overlay can
panic/deadlock machines" problem, yes?

> > I can think of several other similar ways that we can probably be
> > screwed by cross-superblock references and memory reclaim
> > interactions. I can't think of any way to avoid them except for
> > not getting into that mess in the first place.
> 
> The freezer interaction can be solved by letting the freezer know
> about the dependencies of filesystems.

What freezer is that - the userspace application that calls the
freeze ioctl()?

> So if an underlying layer
> needs to be frozen, then all stacks containing that underlying layer
> would need to be frozen first.  I guess any of the other bad
> interactions would be solvable in a similar manner.

We've talked about this in the context for fixing the hibernation
mess - freezing multiple layers requires determining the
dependencies between filesystem layers, and it has to work both
up and down the dependency tree. And that tree has to include block
devices, because things like loopback devices and locally mounted
network devices form part of that dependency tree.

Doing that as a global "freeze everything" operation can be solved
by reverse walking the superblock list (as it's in sb instantiation
order), but we have no obvious way of solving this dependency
problem for any random superblock in the system. If you have a
solution to this, I'm all ears :P

Cheers,

Dave.
Amir Goldstein Oct. 3, 2018, 3:45 a.m. UTC | #11
On Wed, Oct 3, 2018 at 2:14 AM Dave Chinner <david@fromorbit.com> wrote:
[...]
> > Seems like freezing any of the layers if overlay itself is not frozen
> > is not a good idea.
>
> That's something we can't directly control. e.g. lower filesystem is
> on a DM volume. DM can freeze the lower fileystem through the block
> device when a dm command is run. It may well be that the admins that
> set up the storage and filesystem layer have no idea that there are
> now overlay users on top of the filesystem they originally set up.
> Indeed, the admins may not even know that dm operations freeze
> filesystems because it happens completely transparently to them.
>

I don't think we should be binding the stacked filesystem issues with
the stacked block over fs issues. The latter is more complex to solve
generally and has by design non limited stack depth. The former has
limited stack depth (2) and each sb knows its own stack depth, which
is already used in overlay to annotate lockdep correctly.

If vfs stores a reverse tree of stacked fs dependencies, then individual
sb freeze can be solved.

Drawing the fire away from overlayfs.. I personally find the behavior that
a process that only has files open for read could block when filesystem is
frozen somewhat unexpected to users (even if I can expect it).

I wonder out loud if it wouldn't be friendlier for any filesystem to defer
"garbage collection" (e.g. truncate deleted inode blocks) to thawing time,
just as those operations are already run on mount (post crash) anyway.

Thanks,
Amir.
Dave Chinner Oct. 3, 2018, 10:59 p.m. UTC | #12
On Wed, Oct 03, 2018 at 06:45:13AM +0300, Amir Goldstein wrote:
> On Wed, Oct 3, 2018 at 2:14 AM Dave Chinner <david@fromorbit.com> wrote:
> [...]
> > > Seems like freezing any of the layers if overlay itself is not frozen
> > > is not a good idea.
> >
> > That's something we can't directly control. e.g. lower filesystem is
> > on a DM volume. DM can freeze the lower fileystem through the block
> > device when a dm command is run. It may well be that the admins that
> > set up the storage and filesystem layer have no idea that there are
> > now overlay users on top of the filesystem they originally set up.
> > Indeed, the admins may not even know that dm operations freeze
> > filesystems because it happens completely transparently to them.
> >
> 
> I don't think we should be binding the stacked filesystem issues with
> the stacked block over fs issues.

It's the same problem.  Hacking a one-off solution to hide a specific
overlay symptom does not address the root problem. And, besides, if
you stack like this:

overlay
  lower_fs
    loopback dev
      loop img fs

And freeze the loop img fs, overlay can still get stuck in it's
shrinker because the the lower_fs gets stuck doing IO on the frozen
loop img fs.

i.e. it's the same issue - kswapd will get stuck doing reclaim from
the overlay shrinker.

> The latter is more complex to solve
> generally and has by design non limited stack depth. The former has
> limited stack depth (2) and each sb knows its own stack depth, which
> is already used in overlay to annotate lockdep correctly.
> 
> If vfs stores a reverse tree of stacked fs dependencies, then individual
> sb freeze can be solved.

Don't make me mention bind mounts... :/

> Drawing the fire away from overlayfs.. I personally find the behavior that
> a process that only has files open for read could block when filesystem is
> frozen somewhat unexpected to users (even if I can expect it).

Filesystem reads have always been able to modify the file (e.g.
atime updates). Not to mention filesystem reads require memory
allocation, and that means any GFP_KERNEL direct reclaim can get
stuck on a frozen filesystem if that filesystem hasn't properly
cleared out it's dangerous reclaimable objects when freezing.

> I wonder out loud if it wouldn't be friendlier for any filesystem to defer
> "garbage collection" (e.g. truncate deleted inode blocks) to thawing time,

https://marc.info/?l=linux-xfs&m=153022904909523&w=2

Been on the list of "nice to have" unlink optimisations for XFS
since 2008.

But it's a performance optimisation and precursor for offlining AGs
for online repair, not something we've ever considered as needed for
correctness or to prevent deadlocks.

> just as those operations are already run on mount (post crash) anyway.

That's a completely different context - log recovery is much more
constrained in the amount of work it needs to do and has much more
freedom for handling errors (i.e. it can just leak bad unlinked
inodes). Runtime deferral of post-unlink, post-reference inode
reclaim is a *lot* more complex than processing pending unlinks in
log recovery.

Cheers,

Dave.
Miklos Szeredi Oct. 3, 2018, 11:14 p.m. UTC | #13
On Thu, Oct 4, 2018 at 12:59 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Oct 03, 2018 at 06:45:13AM +0300, Amir Goldstein wrote:
>> On Wed, Oct 3, 2018 at 2:14 AM Dave Chinner <david@fromorbit.com> wrote:
>> [...]
>> > > Seems like freezing any of the layers if overlay itself is not frozen
>> > > is not a good idea.
>> >
>> > That's something we can't directly control. e.g. lower filesystem is
>> > on a DM volume. DM can freeze the lower fileystem through the block
>> > device when a dm command is run. It may well be that the admins that
>> > set up the storage and filesystem layer have no idea that there are
>> > now overlay users on top of the filesystem they originally set up.
>> > Indeed, the admins may not even know that dm operations freeze
>> > filesystems because it happens completely transparently to them.
>> >
>>
>> I don't think we should be binding the stacked filesystem issues with
>> the stacked block over fs issues.
>
> It's the same problem.  Hacking a one-off solution to hide a specific
> overlay symptom does not address the root problem. And, besides, if
> you stack like this:
>
> overlay
>   lower_fs
>     loopback dev
>       loop img fs
>
> And freeze the loop img fs, overlay can still get stuck in it's
> shrinker because the the lower_fs gets stuck doing IO on the frozen
> loop img fs.
>
> i.e. it's the same issue - kswapd will get stuck doing reclaim from
> the overlay shrinker.

Is overlay making the situation any worse in this case?  IOW, would
reclaim on fs-over-loopback not have the same issue *without* overlay?
 If not, why?

>> The latter is more complex to solve
>> generally and has by design non limited stack depth. The former has
>> limited stack depth (2) and each sb knows its own stack depth, which
>> is already used in overlay to annotate lockdep correctly.
>>
>> If vfs stores a reverse tree of stacked fs dependencies, then individual
>> sb freeze can be solved.
>
> Don't make me mention bind mounts... :/

How do mounts have anything to do with this?  We are talking about
filesystem (i.e. superblock) dependencies, not mount dependencies
(which would make zero sense anyway).

Thanks,
Miklos
Dave Chinner Oct. 4, 2018, 5:38 a.m. UTC | #14
On Thu, Oct 04, 2018 at 01:14:12AM +0200, Miklos Szeredi wrote:
> On Thu, Oct 4, 2018 at 12:59 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Wed, Oct 03, 2018 at 06:45:13AM +0300, Amir Goldstein wrote:
> >> On Wed, Oct 3, 2018 at 2:14 AM Dave Chinner <david@fromorbit.com> wrote:
> >> [...]
> >> > > Seems like freezing any of the layers if overlay itself is not frozen
> >> > > is not a good idea.
> >> >
> >> > That's something we can't directly control. e.g. lower filesystem is
> >> > on a DM volume. DM can freeze the lower fileystem through the block
> >> > device when a dm command is run. It may well be that the admins that
> >> > set up the storage and filesystem layer have no idea that there are
> >> > now overlay users on top of the filesystem they originally set up.
> >> > Indeed, the admins may not even know that dm operations freeze
> >> > filesystems because it happens completely transparently to them.
> >> >
> >>
> >> I don't think we should be binding the stacked filesystem issues with
> >> the stacked block over fs issues.
> >
> > It's the same problem.  Hacking a one-off solution to hide a specific
> > overlay symptom does not address the root problem. And, besides, if
> > you stack like this:
> >
> > overlay
> >   lower_fs
> >     loopback dev
> >       loop img fs
> >
> > And freeze the loop img fs, overlay can still get stuck in it's
> > shrinker because the the lower_fs gets stuck doing IO on the frozen
> > loop img fs.
> >
> > i.e. it's the same issue - kswapd will get stuck doing reclaim from
> > the overlay shrinker.
> 
> Is overlay making the situation any worse in this case?  IOW, would

No, it's not. My point is that it's the same layering problem, not
that it is an issue unique to overlay. Fixing one without addressing
the other does not make the problem go away.

> >> If vfs stores a reverse tree of stacked fs dependencies, then individual
> >> sb freeze can be solved.
> >
> > Don't make me mention bind mounts... :/
> 
> How do mounts have anything to do with this?

Bind mounts layer superblocks in unpredictable manners. e.g. the
lowerfs for overlay could be a directory heirachy made up of
multiple bind mounts from different source filesystems. So it's not
just a 1:1 upper/lower superblock relationship it could be 1:many.
Dependency graphs get complx quickly in such configurations.

You can also have the same superblock above overlay through a bind
mount as well as below as the lower fs.  e.g. A user could freeze
the bind mounted heirarchy after checking it's a different device to
the filesystem they are also writing to (e.g. logs to "root" device,
freezes "home" device). IOWs, they could be completely unaware that
the freeze target is the same filesystem that underlies the "root"
device, and so when the freeze the "home" target their root device
also freezes.

That sort of thing registers high up on the user WTF-o-meter, and
it's not at all obvious what went wrong or how to get out of that
mess. It's not clear to me if bind mounts should accept path based
superblock admin commands like freeze because the potentially expose
problems far outside the realm that specific user/mount namespace is
allowed access to.

My point is that it's not obvious that there is a simple, clear
dependency hierarchy between superblocks because there are so many
ways they can be layered by userspace and layered fileystems and
block devices.

Cheers,

Dave.
Miklos Szeredi Oct. 4, 2018, 7:33 a.m. UTC | #15
On Thu, Oct 4, 2018 at 7:38 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Oct 04, 2018 at 01:14:12AM +0200, Miklos Szeredi wrote:
>> On Thu, Oct 4, 2018 at 12:59 AM, Dave Chinner <david@fromorbit.com> wrote:
>> > overlay
>> >   lower_fs
>> >     loopback dev
>> >       loop img fs
>> >
>> > And freeze the loop img fs, overlay can still get stuck in it's
>> > shrinker because the the lower_fs gets stuck doing IO on the frozen
>> > loop img fs.
>> >
>> > i.e. it's the same issue - kswapd will get stuck doing reclaim from
>> > the overlay shrinker.
>>
>> Is overlay making the situation any worse in this case?  IOW, would
>
> No, it's not. My point is that it's the same layering problem, not
> that it is an issue unique to overlay. Fixing one without addressing
> the other does not make the problem go away.

But you were suggesting that this should be solved by somehow
preventing the shrink after unlink issue in overlayfs.

But the freeze dependency is not unique to that: if filesystem A is
frozen and a file in that fs is loopback mounted in filesystem B then
B also needs to be frozen before A is frozen to prevent deadlocking
through kswapd.

Both could be solved by creating a new "struct freeze_dep" object with
->freeze/unfreeze methods, that can be embedded into overlay and
loopback device and whatever else, and which is installed onto a
"struct list_head s_freeze_deps" list in the superblock.

Does that make sense?

>
>> >> If vfs stores a reverse tree of stacked fs dependencies, then individual
>> >> sb freeze can be solved.
>> >
>> > Don't make me mention bind mounts... :/
>>
>> How do mounts have anything to do with this?
>
> Bind mounts layer superblocks in unpredictable manners. e.g. the
> lowerfs for overlay could be a directory heirachy made up of
> multiple bind mounts from different source filesystems. So it's not
> just a 1:1 upper/lower superblock relationship it could be 1:many.
> Dependency graphs get complx quickly in such configurations.

Overlayfs avoids that sort of mess by not overlaying mount trees, just
single mounts.

> My point is that it's not obvious that there is a simple, clear
> dependency hierarchy between superblocks because there are so many
> ways they can be layered by userspace and layered fileystems and
> block devices.

For overlay and loopback there's a simple dependency graph.  As for
stacked block devs, I have no idea if such an infrastructure makes
sense or is possible at all.

Thanks,
Miklos
diff mbox series

Patch

diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index bedc5a5133a5..912b42f5fe4a 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -259,6 +259,14 @@  xfs_trans_alloc(
 	struct xfs_trans	*tp;
 	int			error;
 
+	/*
+	 * Allocate the handle before we do our freeze accounting and setting up
+	 * GFP_NOFS allocation context so that we avoid lockdep false positives
+	 * by doing GFP_KERNEL allocations inside sb_start_intwrite().
+	 */
+	tp = kmem_zone_zalloc(xfs_trans_zone,
+		(flags & XFS_TRANS_NOFS) ? KM_NOFS : KM_SLEEP);
+
 	if (!(flags & XFS_TRANS_NO_WRITECOUNT))
 		sb_start_intwrite(mp->m_super);
 
@@ -270,8 +278,6 @@  xfs_trans_alloc(
 		mp->m_super->s_writers.frozen == SB_FREEZE_COMPLETE);
 	atomic_inc(&mp->m_active_trans);
 
-	tp = kmem_zone_zalloc(xfs_trans_zone,
-		(flags & XFS_TRANS_NOFS) ? KM_NOFS : KM_SLEEP);
 	tp->t_magic = XFS_TRANS_HEADER_MAGIC;
 	tp->t_flags = flags;
 	tp->t_mountp = mp;