diff mbox series

[13/24] xfs: make inode reclaim almost non-blocking

Message ID 20200522035029.3022405-14-david@fromorbit.com (mailing list archive)
State Superseded
Headers show
Series xfs: rework inode flushing to make inode reclaim fully asynchronous | expand

Commit Message

Dave Chinner May 22, 2020, 3:50 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

Now that dirty inode writeback doesn't cause read-modify-write
cycles on the inode cluster buffer under memory pressure, the need
to throttle memory reclaim to the rate at which we can clean dirty
inodes goes away. That is due to the fact that we no longer thrash
inode cluster buffers under memory pressure to clean dirty inodes.

This means inode writeback no longer stalls on memory allocation
or read IO, and hence can be done asynchrnously without generating
memory pressure. As a result, blocking inode writeback in reclaim is
no longer necessary to prevent reclaim priority windup as cleaning
dirty inodes is no longer dependent on having memory reserves
available for the filesystem to make progress reclaiming inodes.

Hence we can convert inode reclaim to be non-blocking for shrinker
callouts, both for direct reclaim and kswapd.

On a vanilla kernel, running a 16-way fsmark create workload on a
4 node/16p/16GB RAM machine, I can reliably pin 14.75GB of RAM via
userspace mlock(). The OOM killer gets invoked at 15GB of
pinned RAM.

With this patch alone, pinning memory triggers premature OOM
killer invocation, sometimes with as much as 45% of RAM being free.
It's trivially easy to trigger the OOM killer when reclaim does not
block.

With pinning inode clusters in RAM adn then adding this patch, I can
reliably pin 14.5GB of RAM and still have the fsmark workload run to
completion. The OOM killer gets invoked 14.75GB of pinned RAM, which
is only a small amount of memory less than the vanilla kernel. It is
much more reliable than just with async reclaim alone.

simoops shows that allocation stalls go away when async reclaim is
used. Vanilla kernel:

Run time: 1924 seconds
Read latency (p50: 3,305,472) (p95: 3,723,264) (p99: 4,001,792)
Write latency (p50: 184,064) (p95: 553,984) (p99: 807,936)
Allocation latency (p50: 2,641,920) (p95: 3,911,680) (p99: 4,464,640)
work rate = 13.45/sec (avg 13.44/sec) (p50: 13.46) (p95: 13.58) (p99: 13.70)
alloc stall rate = 3.80/sec (avg: 2.59) (p50: 2.54) (p95: 2.96) (p99: 3.02)

With inode cluster pinning and async reclaim:

Run time: 1924 seconds
Read latency (p50: 3,305,472) (p95: 3,715,072) (p99: 3,977,216)
Write latency (p50: 187,648) (p95: 553,984) (p99: 789,504)
Allocation latency (p50: 2,748,416) (p95: 3,919,872) (p99: 4,448,256)
work rate = 13.28/sec (avg 13.32/sec) (p50: 13.26) (p95: 13.34) (p99: 13.34)
alloc stall rate = 0.02/sec (avg: 0.02) (p50: 0.01) (p95: 0.03) (p99: 0.03)

Latencies don't really change much, nor does the work rate. However,
allocation almost never stalls with these changes, whilst the
vanilla kernel is sometimes reporting 20 stalls/s over a 60s sample
period. This difference is due to inode reclaim being largely
non-blocking now.

IOWs, once we have pinned inode cluster buffers, we can make inode
reclaim non-blocking without a major risk of premature and/or
spurious OOM killer invocation, and without any changes to memory
reclaim infrastructure.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Amir Goldstein May 22, 2020, 12:19 p.m. UTC | #1
On Fri, May 22, 2020 at 6:51 AM Dave Chinner <david@fromorbit.com> wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> Now that dirty inode writeback doesn't cause read-modify-write
> cycles on the inode cluster buffer under memory pressure, the need
> to throttle memory reclaim to the rate at which we can clean dirty
> inodes goes away. That is due to the fact that we no longer thrash
> inode cluster buffers under memory pressure to clean dirty inodes.
>
> This means inode writeback no longer stalls on memory allocation
> or read IO, and hence can be done asynchrnously without generating
> memory pressure. As a result, blocking inode writeback in reclaim is
> no longer necessary to prevent reclaim priority windup as cleaning
> dirty inodes is no longer dependent on having memory reserves
> available for the filesystem to make progress reclaiming inodes.
>
> Hence we can convert inode reclaim to be non-blocking for shrinker
> callouts, both for direct reclaim and kswapd.
>
> On a vanilla kernel, running a 16-way fsmark create workload on a
> 4 node/16p/16GB RAM machine, I can reliably pin 14.75GB of RAM via
> userspace mlock(). The OOM killer gets invoked at 15GB of
> pinned RAM.
>
> With this patch alone, pinning memory triggers premature OOM
> killer invocation, sometimes with as much as 45% of RAM being free.
> It's trivially easy to trigger the OOM killer when reclaim does not
> block.
>
> With pinning inode clusters in RAM adn then adding this patch, I can

typo: adn

> reliably pin 14.5GB of RAM and still have the fsmark workload run to
> completion. The OOM killer gets invoked 14.75GB of pinned RAM, which
> is only a small amount of memory less than the vanilla kernel. It is
> much more reliable than just with async reclaim alone.
>
> simoops shows that allocation stalls go away when async reclaim is
> used. Vanilla kernel:
>
> Run time: 1924 seconds
> Read latency (p50: 3,305,472) (p95: 3,723,264) (p99: 4,001,792)
> Write latency (p50: 184,064) (p95: 553,984) (p99: 807,936)
> Allocation latency (p50: 2,641,920) (p95: 3,911,680) (p99: 4,464,640)
> work rate = 13.45/sec (avg 13.44/sec) (p50: 13.46) (p95: 13.58) (p99: 13.70)
> alloc stall rate = 3.80/sec (avg: 2.59) (p50: 2.54) (p95: 2.96) (p99: 3.02)
>
> With inode cluster pinning and async reclaim:
>
> Run time: 1924 seconds
> Read latency (p50: 3,305,472) (p95: 3,715,072) (p99: 3,977,216)
> Write latency (p50: 187,648) (p95: 553,984) (p99: 789,504)
> Allocation latency (p50: 2,748,416) (p95: 3,919,872) (p99: 4,448,256)
> work rate = 13.28/sec (avg 13.32/sec) (p50: 13.26) (p95: 13.34) (p99: 13.34)
> alloc stall rate = 0.02/sec (avg: 0.02) (p50: 0.01) (p95: 0.03) (p99: 0.03)
>
> Latencies don't really change much, nor does the work rate. However,
> allocation almost never stalls with these changes, whilst the
> vanilla kernel is sometimes reporting 20 stalls/s over a 60s sample
> period. This difference is due to inode reclaim being largely
> non-blocking now.
>
> IOWs, once we have pinned inode cluster buffers, we can make inode
> reclaim non-blocking without a major risk of premature and/or
> spurious OOM killer invocation, and without any changes to memory
> reclaim infrastructure.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

I can confirm observing the stalls on production servers and that
this patch alone fixes the stalls.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

Thanks,
Amir.
Darrick J. Wong May 22, 2020, 10:48 p.m. UTC | #2
On Fri, May 22, 2020 at 01:50:18PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that dirty inode writeback doesn't cause read-modify-write
> cycles on the inode cluster buffer under memory pressure, the need
> to throttle memory reclaim to the rate at which we can clean dirty
> inodes goes away. That is due to the fact that we no longer thrash
> inode cluster buffers under memory pressure to clean dirty inodes.
> 
> This means inode writeback no longer stalls on memory allocation
> or read IO, and hence can be done asynchrnously without generating

"...asynchronously..."

> memory pressure. As a result, blocking inode writeback in reclaim is
> no longer necessary to prevent reclaim priority windup as cleaning
> dirty inodes is no longer dependent on having memory reserves
> available for the filesystem to make progress reclaiming inodes.
> 
> Hence we can convert inode reclaim to be non-blocking for shrinker
> callouts, both for direct reclaim and kswapd.
> 
> On a vanilla kernel, running a 16-way fsmark create workload on a
> 4 node/16p/16GB RAM machine, I can reliably pin 14.75GB of RAM via
> userspace mlock(). The OOM killer gets invoked at 15GB of
> pinned RAM.
> 
> With this patch alone, pinning memory triggers premature OOM
> killer invocation, sometimes with as much as 45% of RAM being free.
> It's trivially easy to trigger the OOM killer when reclaim does not
> block.
> 
> With pinning inode clusters in RAM adn then adding this patch, I can
> reliably pin 14.5GB of RAM and still have the fsmark workload run to
> completion. The OOM killer gets invoked 14.75GB of pinned RAM, which
> is only a small amount of memory less than the vanilla kernel. It is
> much more reliable than just with async reclaim alone.

So the lack of OOM kills is the result of not having to do RMW and
ratcheting up the reclaim priority, right?

And, {con|per}versely, can I run fstests with 400MB of RAM now? :D

> simoops shows that allocation stalls go away when async reclaim is
> used. Vanilla kernel:
> 
> Run time: 1924 seconds
> Read latency (p50: 3,305,472) (p95: 3,723,264) (p99: 4,001,792)
> Write latency (p50: 184,064) (p95: 553,984) (p99: 807,936)
> Allocation latency (p50: 2,641,920) (p95: 3,911,680) (p99: 4,464,640)
> work rate = 13.45/sec (avg 13.44/sec) (p50: 13.46) (p95: 13.58) (p99: 13.70)
> alloc stall rate = 3.80/sec (avg: 2.59) (p50: 2.54) (p95: 2.96) (p99: 3.02)
> 
> With inode cluster pinning and async reclaim:
> 
> Run time: 1924 seconds
> Read latency (p50: 3,305,472) (p95: 3,715,072) (p99: 3,977,216)
> Write latency (p50: 187,648) (p95: 553,984) (p99: 789,504)
> Allocation latency (p50: 2,748,416) (p95: 3,919,872) (p99: 4,448,256)

I'm not familiar with simoops, and ElGoog is not helpful.  What are the
units here?

> work rate = 13.28/sec (avg 13.32/sec) (p50: 13.26) (p95: 13.34) (p99: 13.34)
> alloc stall rate = 0.02/sec (avg: 0.02) (p50: 0.01) (p95: 0.03) (p99: 0.03)
> 
> Latencies don't really change much, nor does the work rate. However,
> allocation almost never stalls with these changes, whilst the
> vanilla kernel is sometimes reporting 20 stalls/s over a 60s sample
> period. This difference is due to inode reclaim being largely
> non-blocking now.

<nod>

> IOWs, once we have pinned inode cluster buffers, we can make inode
> reclaim non-blocking without a major risk of premature and/or
> spurious OOM killer invocation, and without any changes to memory
> reclaim infrastructure.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks ok to me, provided my suppositions are correct...
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_icache.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index d806d3bfa8936..0f0f8fcd61b03 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1420,7 +1420,7 @@ xfs_reclaim_inodes_nr(
>  	xfs_reclaim_work_queue(mp);
>  	xfs_ail_push_all(mp->m_ail);
>  
> -	return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan);
> +	return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan);
>  }
>  
>  /*
> -- 
> 2.26.2.761.g0e0b3e54be
>
Dave Chinner May 23, 2020, 10:29 p.m. UTC | #3
On Fri, May 22, 2020 at 03:48:06PM -0700, Darrick J. Wong wrote:
> On Fri, May 22, 2020 at 01:50:18PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Now that dirty inode writeback doesn't cause read-modify-write
> > cycles on the inode cluster buffer under memory pressure, the need
> > to throttle memory reclaim to the rate at which we can clean dirty
> > inodes goes away. That is due to the fact that we no longer thrash
> > inode cluster buffers under memory pressure to clean dirty inodes.
> > 
> > This means inode writeback no longer stalls on memory allocation
> > or read IO, and hence can be done asynchrnously without generating
> 
> "...asynchronously..."
> 
> > memory pressure. As a result, blocking inode writeback in reclaim is
> > no longer necessary to prevent reclaim priority windup as cleaning
> > dirty inodes is no longer dependent on having memory reserves
> > available for the filesystem to make progress reclaiming inodes.
> > 
> > Hence we can convert inode reclaim to be non-blocking for shrinker
> > callouts, both for direct reclaim and kswapd.
> > 
> > On a vanilla kernel, running a 16-way fsmark create workload on a
> > 4 node/16p/16GB RAM machine, I can reliably pin 14.75GB of RAM via
> > userspace mlock(). The OOM killer gets invoked at 15GB of
> > pinned RAM.
> > 
> > With this patch alone, pinning memory triggers premature OOM
> > killer invocation, sometimes with as much as 45% of RAM being free.
> > It's trivially easy to trigger the OOM killer when reclaim does not
> > block.
> > 
> > With pinning inode clusters in RAM adn then adding this patch, I can
> > reliably pin 14.5GB of RAM and still have the fsmark workload run to
> > completion. The OOM killer gets invoked 14.75GB of pinned RAM, which
> > is only a small amount of memory less than the vanilla kernel. It is
> > much more reliable than just with async reclaim alone.
> 
> So the lack of OOM kills is the result of not having to do RMW and
> ratcheting up the reclaim priority, right?

Effectively. The ratcheting up the reclaim priority without
writeback is a secondary effect of RMW in inode writeback.

That is, the AIL blocks on memory reclaim doing dirty inode
writeback because it has unbound demand (async flushing). Hence it
exhausts memory reserves if there are lots of dirty inodes. It's
also PF_MEMALLOC so, like kswapd, it can dip into certain reserves
that normal allocation can't.

The synchronous write behaviour of reclaim, however, bounds memory
demand at (N * ag count * pages per inode cluster), and hence it is
much more likely to make forwards progress, albeit slowly. The
synchronous write also has the effect of throttling the rate at
which reclaim cycles, hence slowly down the speed at which it ramps
up the reclaim priority rate. IOWs, we get both forwards progress
and lower reclaim priority because we block reclaim like this.

IOWs, removing the synchronous writeback from reclaim does two
things. The first is that it removes the ability to make forwards
progress reclaiming inodes from XFS when there is very low free
memory. This is bad for obvious reasons.

The second is that it allows reclaim to think it can't free
inode memory quickly and that's what causes the increase in reclaim
priority. i.e. it needs more scan loops to free inodes because
writeback of dirty inodes is slow and not making progress. This is
also bad, because we can make progress, just not as fast as memory
reclaim is capable of backing off from.

The sync writeback of inode clusters from reclaim mitigated both of
these issues when they occurred at the cost of increased allocation
latency at extreme OOM conditions...

This is why, despite everyone with OOM latency problems claiming "it
works for them so you should just merge it", just skipping inode
writeback in the shrinker has not been a solution to the problem -
it didn't solve the underlying "reclaim of dirty inodes can create
unbound memory demand" problem that the sync inode writeback
controlled.

Previous attempts to solve this problem had been focussed on
replacing the throttling the shrinker did with backoffs in the core
reclaim algorithms, but that's made no progress on the mm/ side of
things. Hence this patchset - trying to tackle the problem from a
different direction so we are no longer reliant on changing core OS
infrastructure to solve problems XFS users are having.

> And, {con|per}versely, can I run fstests with 400MB of RAM now? :D

If it is bound on sync inode writeback from memory reclaim, then it
will help, otherwise it may make things worse because the trade off
we are making here is that dirty inodes can pin substantially more
memory in cache while they queue to be written back.

Yup, that's the ugly downside of this approach. Rather than have the
core memory reclaim throttle and wait according to what we need it
to do, we simply make the XFS cache footprint larger every time we
dirty an inode. It also costs us 1-2% extra CPU per transaction, so
this change certainly isn't free. IMO, it's most definitely not the
most efficient, performant or desirable solution to the problem, but
it's one that works and is wholly contained within XFS.

> > simoops shows that allocation stalls go away when async reclaim is
> > used. Vanilla kernel:
> > 
> > Run time: 1924 seconds
> > Read latency (p50: 3,305,472) (p95: 3,723,264) (p99: 4,001,792)
> > Write latency (p50: 184,064) (p95: 553,984) (p99: 807,936)
> > Allocation latency (p50: 2,641,920) (p95: 3,911,680) (p99: 4,464,640)
> > work rate = 13.45/sec (avg 13.44/sec) (p50: 13.46) (p95: 13.58) (p99: 13.70)
> > alloc stall rate = 3.80/sec (avg: 2.59) (p50: 2.54) (p95: 2.96) (p99: 3.02)
> > 
> > With inode cluster pinning and async reclaim:
> > 
> > Run time: 1924 seconds
> > Read latency (p50: 3,305,472) (p95: 3,715,072) (p99: 3,977,216)
> > Write latency (p50: 187,648) (p95: 553,984) (p99: 789,504)
> > Allocation latency (p50: 2,748,416) (p95: 3,919,872) (p99: 4,448,256)
> 
> I'm not familiar with simoops, and ElGoog is not helpful.  What are the
> units here?

Microseconds, IIRC.

Cheers,

Dave.
diff mbox series

Patch

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index d806d3bfa8936..0f0f8fcd61b03 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1420,7 +1420,7 @@  xfs_reclaim_inodes_nr(
 	xfs_reclaim_work_queue(mp);
 	xfs_ail_push_all(mp->m_ail);
 
-	return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan);
+	return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan);
 }
 
 /*