mbox series

[PATCHSET,v8,00/20] xfs: deferred inode inactivation

Message ID 162758423315.332903.16799817941903734904.stgit@magnolia (mailing list archive)
Headers show
Series xfs: deferred inode inactivation | expand

Message

Darrick J. Wong July 29, 2021, 6:43 p.m. UTC
Hi all,

This patch series implements deferred inode inactivation.  Inactivation
is what happens when an open file loses its last incore reference: if
the file has speculative preallocations, they must be freed, and if the
file is unlinked, all forks must be truncated, and the inode marked
freed in the inode chunk and the inode btrees.

Currently, all of this activity is performed in frontend threads when
the last in-memory reference is lost and/or the vfs decides to drop the
inode.  Three complaints stem from this behavior: first, that the time
to unlink (in the worst case) depends on both the complexity of the
directory as well as the the number of extents in that file; second,
that deleting a directory tree is inefficient and seeky because we free
the inodes in readdir order, not disk order; and third, the upcoming
online repair feature needs to be able to xfs_irele while scanning a
filesystem in transaction context.  It cannot perform inode inactivation
in this context because xfs does not support nested transactions.

The implementation will be familiar to those who have studied how XFS
scans for reclaimable in-core inodes -- we create a couple more inode
state flags to mark an inode as needing inactivation and being in the
middle of inactivation.  When inodes need inactivation, we set
NEED_INACTIVE in iflags, set the INACTIVE radix tree tag, and schedule a
deferred work item.  The deferred worker runs in an unbounded workqueue,
scanning the inode radix tree for tagged inodes to inactivate, and
performing all the on-disk metadata updates.  Once the inode has been
inactivated, it is left in the reclaim state and the background reclaim
worker (or direct reclaim) will get to it eventually.

Doing the inactivations from kernel threads solves the first problem by
constraining the amount of work done by the unlink() call to removing
the directory entry.  It solves the third problem by moving inactivation
to a separate process.  Because the inactivations are done in order of
inode number, we solve the second problem by performing updates in (we
hope) disk order.  This also decreases the amount of time it takes to
let go of an inode cluster if we're deleting entire directory trees.

There are three big warts I can think of in this series: first, because
the actual freeing of nlink==0 inodes is now done in the background,
this means that the system will be busy making metadata updates for some
time after the unlink() call returns.  This temporarily reduces
available iops.  Second, in order to retain the behavior that deleting
100TB of unshared data should result in a free space gain of 100TB, the
statvfs and quota reporting ioctls wait for inactivation to finish,
which increases the long tail latency of those calls.  This behavior is,
unfortunately, key to not introducing regressions in fstests.  The third
problem is that the deferrals keep memory usage higher for longer,
reduce opportunities to throttle the frontend when metadata load is
heavy, and the unbounded workqueues can create transaction storms.

v1-v2: NYE patchbombs
v3: rebase against 5.12-rc2 for submission.
v4: combine the can/has eofblocks predicates, clean up incore inode tree
    walks, fix inobt deadlock
v5: actually freeze the inode gc threads when we freeze the filesystem,
    consolidate the code that deals with inode tagging, and use
    foreground inactivation during quotaoff to avoid cycling dquots
v6: rebase to 5.13-rc4, fix quotaoff not to require foreground inactivation,
    refactor to use inode walk goals, use atomic bitflags to control the
    scheduling of gc workers
v7: simplify the inodegc worker, which simplifies how flushes work, break
    up the patch into smaller pieces, flush inactive inodes on syncfs to
    simplify freeze/ro-remount handling, separate inode selection filtering
    in iget, refactor inode recycling further, change gc delay to 100ms,
    decrease the gc delay when space or quota are low, move most of the
    destroy_inode logic to mark_reclaimable, get rid of the fallocate flush
    scan thing, get rid of polled flush mode
v8: rebase against 5.14-rc2, hook the memory shrinkers so that we requeue
    inactivation immediately when memory starts to get tight and force
    callers queueing inodes for inactivation to wait for the inactivation
    workers to run (i.e. throttling the frontend) to reduce memory storms,
    add hch's quotaoff removal series as a dependency to shut down arguments
    about quota walks

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=deferred-inactivation-5.15
---
 Documentation/admin-guide/xfs.rst |   10 
 fs/xfs/libxfs/xfs_ag.c            |   13 +
 fs/xfs/libxfs/xfs_ag.h            |   13 +
 fs/xfs/scrub/common.c             |   10 
 fs/xfs/xfs_dquot.h                |   10 
 fs/xfs/xfs_globals.c              |    5 
 fs/xfs/xfs_icache.c               |  848 ++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_icache.h               |    7 
 fs/xfs/xfs_inode.c                |   53 ++
 fs/xfs/xfs_inode.h                |   21 +
 fs/xfs/xfs_itable.c               |   42 ++
 fs/xfs/xfs_iwalk.c                |   33 +
 fs/xfs/xfs_linux.h                |    1 
 fs/xfs/xfs_log_recover.c          |    7 
 fs/xfs/xfs_mount.c                |   42 ++
 fs/xfs/xfs_mount.h                |   29 +
 fs/xfs/xfs_qm_syscalls.c          |    8 
 fs/xfs/xfs_super.c                |  110 +++--
 fs/xfs/xfs_sysctl.c               |    9 
 fs/xfs/xfs_sysctl.h               |    1 
 fs/xfs/xfs_trace.h                |  295 +++++++++++++
 fs/xfs/xfs_trans.c                |    5 
 22 files changed, 1463 insertions(+), 109 deletions(-)

Comments

Dave Chinner Aug. 2, 2021, 10:35 a.m. UTC | #1
On Thu, Jul 29, 2021 at 11:43:53AM -0700, Darrick J. Wong wrote:
> Hi all,
> 
> This patch series implements deferred inode inactivation.  Inactivation
> is what happens when an open file loses its last incore reference: if
> the file has speculative preallocations, they must be freed, and if the
> file is unlinked, all forks must be truncated, and the inode marked
> freed in the inode chunk and the inode btrees.
> 
> Currently, all of this activity is performed in frontend threads when
> the last in-memory reference is lost and/or the vfs decides to drop the
> inode.  Three complaints stem from this behavior: first, that the time
> to unlink (in the worst case) depends on both the complexity of the
> directory as well as the the number of extents in that file; second,
> that deleting a directory tree is inefficient and seeky because we free
> the inodes in readdir order, not disk order; and third, the upcoming
> online repair feature needs to be able to xfs_irele while scanning a
> filesystem in transaction context.  It cannot perform inode inactivation
> in this context because xfs does not support nested transactions.
> 
> The implementation will be familiar to those who have studied how XFS
> scans for reclaimable in-core inodes -- we create a couple more inode
> state flags to mark an inode as needing inactivation and being in the
> middle of inactivation.  When inodes need inactivation, we set
> NEED_INACTIVE in iflags, set the INACTIVE radix tree tag, and schedule a
> deferred work item.  The deferred worker runs in an unbounded workqueue,
> scanning the inode radix tree for tagged inodes to inactivate, and
> performing all the on-disk metadata updates.  Once the inode has been
> inactivated, it is left in the reclaim state and the background reclaim
> worker (or direct reclaim) will get to it eventually.
> 
> Doing the inactivations from kernel threads solves the first problem by
> constraining the amount of work done by the unlink() call to removing
> the directory entry.  It solves the third problem by moving inactivation
> to a separate process.  Because the inactivations are done in order of
> inode number, we solve the second problem by performing updates in (we
> hope) disk order.  This also decreases the amount of time it takes to
> let go of an inode cluster if we're deleting entire directory trees.
> 
> There are three big warts I can think of in this series: first, because
> the actual freeing of nlink==0 inodes is now done in the background,
> this means that the system will be busy making metadata updates for some
> time after the unlink() call returns.  This temporarily reduces
> available iops.  Second, in order to retain the behavior that deleting
> 100TB of unshared data should result in a free space gain of 100TB, the
> statvfs and quota reporting ioctls wait for inactivation to finish,
> which increases the long tail latency of those calls.  This behavior is,
> unfortunately, key to not introducing regressions in fstests.  The third
> problem is that the deferrals keep memory usage higher for longer,
> reduce opportunities to throttle the frontend when metadata load is
> heavy, and the unbounded workqueues can create transaction storms.

Yup, those transaction storms are a problem (where all the CIL
contention is coming from, I think), but there's a bigger
issue....

The problem I see is that way the background work has been
structured is that we lose all the implicit CPU affinity we end up
with by having all the objects in a directory be in the same AG.
This means that when we do work across different directories in
different processes, the filesystem objects they operate on are
largely CPU, node and AG affine. The only contention point is
typically the transaction commit (i.e. the CIL).

However, the use of the mark-and-sweep AG work technique for
inactivation breaks this implicit CPU affinity for the inactivation
work that is usually done in process context when it drops the last
reference to the file. We now just mark the inode for inactivation,
and move on to the next thing. Some time later, an unbound workqueue
thread walks across all the AG and processes all those
inactivations. It's not affine to the process that placed the inode
in the queue, so the inactivation takes place on a mostly cold data
cache.  Hence we end up loading the data set into CPU caches a
second time to do the inactivation, rather than running on a hot
local cache.

The other problem here is that with the really deep queues (65536
inodes before throttling) we grow the unlinked lists a lot and so
there is more overhead managing them. i.e. We end up with more cold
cache misses and buffer lookups...

SO, really, now that I dig into the workings of the perag based
deferred mechanism and tried to help it regain all the lost
performance, I'm not sure that mark-and-sweep works for this
infrastructure.

To put my money where my mouth is, I just wrote a small patch on top
this series that rips out the mark-and-sweep stuff and replaces it
with a small, simple, lockless per-cpu inactivation queue. I set it
up to trigger the work to be run at 32 queued inodes, and converted
the workqueue to be bound to the CPU and have no concurrency on the
CPU. IOWs, we get a per-cpu worker thread to process the per-cpu
inactivation queue. I gutted all the throttling, all the enospc
and quota stuff, and neutered recycling of inodes that need
inactivation. Yeah, I ripped the guts out of it. But with that, the
unlink performance is largely back to the pre-inactivation rate.

I see AGI buffer lockstepping where I'd expect to see it, and
there's no additional contention on other locks because it mostly
keeps the cpu affinity of the data set being inactivated intact.
Being bound to the same CPU and only running for a short period of
time means that it pre-empts the process that released the inodes
and doesn't trigger it to be scheduled away onto a different CPU.
Hence the workload still largely runs on hot(-ish) caches and so
runs at nearly the same efficiency/IPC as the pre-inactivation code.

Hacky patch below, let me know what you think.

Cheers,

Dave.