diff mbox series

[1/2] Revert "mm: don't reclaim inodes with many attached pages"

Message ID 20190130041707.27750-2-david@fromorbit.com (mailing list archive)
State New, archived
Headers show
Series mm: shrinkers are now way too aggressive | expand

Commit Message

Dave Chinner Jan. 30, 2019, 4:17 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

This reverts commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0.

This change causes serious changes to page cache and inode cache
behaviour and balance, resulting in major performance regressions
when combining worklaods such as large file copies and kernel
compiles.

https://bugzilla.kernel.org/show_bug.cgi?id=202441

This change is a hack to work around the problems introduced by
changing how agressive shrinkers are on small caches in commit
172b06c32b94 ("mm: slowly shrink slabs with a relatively small
number of objects"). It creates more problems than it solves, wasn't
adequately reviewed or tested, so it needs to be reverted.

cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

Comments

Chris Mason Jan. 30, 2019, 12:21 p.m. UTC | #1
On 29 Jan 2019, at 23:17, Dave Chinner wrote:

> From: Dave Chinner <dchinner@redhat.com>
>
> This reverts commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0.
>
> This change causes serious changes to page cache and inode cache
> behaviour and balance, resulting in major performance regressions
> when combining worklaods such as large file copies and kernel
> compiles.
>
> https://bugzilla.kernel.org/show_bug.cgi?id=202441

I'm a little confused by the latest comment in the bz:

https://bugzilla.kernel.org/show_bug.cgi?id=202441#c24

Are these reverts sufficient?

Roman beat me to suggesting Rik's followup.  We hit a different problem 
in prod with small slabs, and have a lot of instrumentation on Rik's 
code helping.

-chris
Dave Chinner Jan. 31, 2019, 1:34 a.m. UTC | #2
On Wed, Jan 30, 2019 at 12:21:07PM +0000, Chris Mason wrote:
> 
> 
> On 29 Jan 2019, at 23:17, Dave Chinner wrote:
> 
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > This reverts commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0.
> >
> > This change causes serious changes to page cache and inode cache
> > behaviour and balance, resulting in major performance regressions
> > when combining worklaods such as large file copies and kernel
> > compiles.
> >
> > https://bugzilla.kernel.org/show_bug.cgi?id=202441
> 
> I'm a little confused by the latest comment in the bz:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=202441#c24

Which says the first patch that changed the shrinker behaviour is
the underlying cause of the regression.

> Are these reverts sufficient?

I think so.

> Roman beat me to suggesting Rik's followup.  We hit a different problem 
> in prod with small slabs, and have a lot of instrumentation on Rik's 
> code helping.

I think that's just another nasty, expedient hack that doesn't solve
the underlying problem. Solving the underlying problem does not
require changing core reclaim algorithms and upsetting a page
reclaim/shrinker balance that has been stable and worked well for
just about everyone for years.

Cheers,

Dave.
Michal Hocko Jan. 31, 2019, 9:10 a.m. UTC | #3
On Thu 31-01-19 12:34:03, Dave Chinner wrote:
> On Wed, Jan 30, 2019 at 12:21:07PM +0000, Chris Mason wrote:
> > 
> > 
> > On 29 Jan 2019, at 23:17, Dave Chinner wrote:
> > 
> > > From: Dave Chinner <dchinner@redhat.com>
> > >
> > > This reverts commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0.
> > >
> > > This change causes serious changes to page cache and inode cache
> > > behaviour and balance, resulting in major performance regressions
> > > when combining worklaods such as large file copies and kernel
> > > compiles.
> > >
> > > https://bugzilla.kernel.org/show_bug.cgi?id=202441
> > 
> > I'm a little confused by the latest comment in the bz:
> > 
> > https://bugzilla.kernel.org/show_bug.cgi?id=202441#c24
> 
> Which says the first patch that changed the shrinker behaviour is
> the underlying cause of the regression.
> 
> > Are these reverts sufficient?
> 
> I think so.
> 
> > Roman beat me to suggesting Rik's followup.  We hit a different problem 
> > in prod with small slabs, and have a lot of instrumentation on Rik's 
> > code helping.
> 
> I think that's just another nasty, expedient hack that doesn't solve
> the underlying problem. Solving the underlying problem does not
> require changing core reclaim algorithms and upsetting a page
> reclaim/shrinker balance that has been stable and worked well for
> just about everyone for years.

I tend to agree with Dave here. Slab pressure balancing is quite subtle
and easy to get wrong. If we want to plug the problem with offline
memcgs then the fix should be targeted at that problem. So maybe we want
to emulate high pressure on offline memcgs only. There might be other
issues to resolve for small caches but let's start with something more
targeted first please.
Chris Mason Jan. 31, 2019, 3:48 p.m. UTC | #4
On 30 Jan 2019, at 20:34, Dave Chinner wrote:

> On Wed, Jan 30, 2019 at 12:21:07PM +0000, Chris Mason wrote:
>>
>>
>> On 29 Jan 2019, at 23:17, Dave Chinner wrote:
>>
>>> From: Dave Chinner <dchinner@redhat.com>
>>>
>>> This reverts commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0.
>>>
>>> This change causes serious changes to page cache and inode cache
>>> behaviour and balance, resulting in major performance regressions
>>> when combining worklaods such as large file copies and kernel
>>> compiles.
>>>
>>> https://bugzilla.kernel.org/show_bug.cgi?id=202441
>>
>> I'm a little confused by the latest comment in the bz:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=202441#c24
>
> Which says the first patch that changed the shrinker behaviour is
> the underlying cause of the regression.
>
>> Are these reverts sufficient?
>
> I think so.

Based on the latest comment:

"If I had been less strict in my testing I probably would have 
discovered that the problem was present earlier than 4.19.3. Mr Gushins 
commit made it more visible.
I'm going back to work after two days off, so I might not be able to 
respond inside your working hours, but I'll keep checking in on this as 
I get a chance."

I don't think the reverts are sufficient.

>
>> Roman beat me to suggesting Rik's followup.  We hit a different 
>> problem
>> in prod with small slabs, and have a lot of instrumentation on Rik's
>> code helping.
>
> I think that's just another nasty, expedient hack that doesn't solve
> the underlying problem. Solving the underlying problem does not
> require changing core reclaim algorithms and upsetting a page
> reclaim/shrinker balance that has been stable and worked well for
> just about everyone for years.
>

Things are definitely breaking down in non-specialized workloads, and 
have been for a long time.

-chris
Roman Gushchin Jan. 31, 2019, 6:57 p.m. UTC | #5
On Thu, Jan 31, 2019 at 10:10:11AM +0100, Michal Hocko wrote:
> On Thu 31-01-19 12:34:03, Dave Chinner wrote:
> > On Wed, Jan 30, 2019 at 12:21:07PM +0000, Chris Mason wrote:
> > > 
> > > 
> > > On 29 Jan 2019, at 23:17, Dave Chinner wrote:
> > > 
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > >
> > > > This reverts commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0.
> > > >
> > > > This change causes serious changes to page cache and inode cache
> > > > behaviour and balance, resulting in major performance regressions
> > > > when combining worklaods such as large file copies and kernel
> > > > compiles.
> > > >
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=202441
> > > 
> > > I'm a little confused by the latest comment in the bz:
> > > 
> > > https://bugzilla.kernel.org/show_bug.cgi?id=202441#c24
> > 
> > Which says the first patch that changed the shrinker behaviour is
> > the underlying cause of the regression.
> > 
> > > Are these reverts sufficient?
> > 
> > I think so.
> > 
> > > Roman beat me to suggesting Rik's followup.  We hit a different problem 
> > > in prod with small slabs, and have a lot of instrumentation on Rik's 
> > > code helping.
> > 
> > I think that's just another nasty, expedient hack that doesn't solve
> > the underlying problem. Solving the underlying problem does not
> > require changing core reclaim algorithms and upsetting a page
> > reclaim/shrinker balance that has been stable and worked well for
> > just about everyone for years.
> 
> I tend to agree with Dave here. Slab pressure balancing is quite subtle
> and easy to get wrong. If we want to plug the problem with offline
> memcgs then the fix should be targeted at that problem. So maybe we want
> to emulate high pressure on offline memcgs only. There might be other
> issues to resolve for small caches but let's start with something more
> targeted first please.

First, the path proposed by Dave is not regression-safe too. A slab object
can be used by other cgroups as well, so creating an artificial pressure on
the dying cgroup might perfectly affect the rest of the system. We do reparent
slab lists on offlining, so there is even no easy way to iterate over them.
Also, creating an artifical pressure will create unnecessary CPU load.

So I'd really prefer to make the "natural" memory pressure to be applied
in a way, that doesn't leave any stalled objects behind.

Second, the code around slab pressure is not "worked well for years": as I can
see the latest major change was made about a year ago by Josef Bacik
(9092c71bb724 "mm: use sc->priority for slab shrink targets").

The existing balance, even if it works perfectly for some cases, isn't something
set in stone. We're really under-scanning small cgroups, and I strongly believe
that what Rik is proposing is a right thing to do. If we don't scan objects
in small cgroups unless we have really strong memory pressure, we're basically
wasting memory.

And it really makes no sense to reclaim inodes with tons of attached pagecache
as easy as "empty" inodes. At the end, all we need is to free some memory, and
treating a sub-page object equal to many thousands page object is just strange.
If it's simple "wrong" and I do miss something, please, explain. Maybe we need
something more complicated than in my patch, but saying that existing code is
just perfect and can't be touched at all makes no sense to me.

So, assuming all this, can we, please, first check if Rik's patch is addressing
the regression?

Thanks!
Dave Chinner Jan. 31, 2019, 10:19 p.m. UTC | #6
On Thu, Jan 31, 2019 at 06:57:10PM +0000, Roman Gushchin wrote:
> On Thu, Jan 31, 2019 at 10:10:11AM +0100, Michal Hocko wrote:
> > On Thu 31-01-19 12:34:03, Dave Chinner wrote:
> > > On Wed, Jan 30, 2019 at 12:21:07PM +0000, Chris Mason wrote:
> > > > 
> > > > 
> > > > On 29 Jan 2019, at 23:17, Dave Chinner wrote:
> > > > 
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > >
> > > > > This reverts commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0.
> > > > >
> > > > > This change causes serious changes to page cache and inode cache
> > > > > behaviour and balance, resulting in major performance regressions
> > > > > when combining worklaods such as large file copies and kernel
> > > > > compiles.
> > > > >
> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=202441
> > > > 
> > > > I'm a little confused by the latest comment in the bz:
> > > > 
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=202441#c24
> > > 
> > > Which says the first patch that changed the shrinker behaviour is
> > > the underlying cause of the regression.
> > > 
> > > > Are these reverts sufficient?
> > > 
> > > I think so.
> > > 
> > > > Roman beat me to suggesting Rik's followup.  We hit a different problem 
> > > > in prod with small slabs, and have a lot of instrumentation on Rik's 
> > > > code helping.
> > > 
> > > I think that's just another nasty, expedient hack that doesn't solve
> > > the underlying problem. Solving the underlying problem does not
> > > require changing core reclaim algorithms and upsetting a page
> > > reclaim/shrinker balance that has been stable and worked well for
> > > just about everyone for years.
> > 
> > I tend to agree with Dave here. Slab pressure balancing is quite subtle
> > and easy to get wrong. If we want to plug the problem with offline
> > memcgs then the fix should be targeted at that problem. So maybe we want
> > to emulate high pressure on offline memcgs only. There might be other
> > issues to resolve for small caches but let's start with something more
> > targeted first please.
> 
> First, the path proposed by Dave is not regression-safe too. A slab object
> can be used by other cgroups as well, so creating an artificial pressure on
> the dying cgroup might perfectly affect the rest of the system.

Isolating regressions to the memcg dying path is far better than the
current changes you've made, which have already been *proven* to
affect the rest of the system.

> We do reparent
> slab lists on offlining, so there is even no easy way to iterate over them.
> Also, creating an artifical pressure will create unnecessary CPU load.

Yes, so do your changes - they increase small slab scanning by 100x
which means reclaim CPU has most definitely increased.

However, the memcg reaper *doesn't need to be perfect* to solve the
"takes too long to clean up dying memcgs". Even if it leaves shared
objects behind (which we want to do!), it still trims those memcgs
down to /just the shared objects still in use/.  And given that
objects shared by memcgs are in the minority (according to past
discussions about the difficulies of accounting them correctly) I
think this is just fine.

Besides, those reamining shared objects are the ones we want to
naturally age out under memory pressure, but otherwise the memcgs
will have been shaken clean of all other objects accounted to them.
i.e. the "dying memcg" memory footprint goes down massively and the
"long term buildup" of dying memcgs basically goes away.

> So I'd really prefer to make the "natural" memory pressure to be applied
> in a way, that doesn't leave any stalled objects behind.

Except that "natural" memory pressure isn't enough to do this, so
you've artificially inflated that "natural" pressure and screwed up
the reclaim for everyone else.

> Second, the code around slab pressure is not "worked well for years": as I can
> see the latest major change was made about a year ago by Josef Bacik
> (9092c71bb724 "mm: use sc->priority for slab shrink targets").

Strawman. False equivalence.

The commit you mention was effectively a minor tweak - instead of
using the ratio of pages scanned to pages reclaimed to define the
amount of slab work, (a second order measure of reclaim urgency), it
was changed to use the primary reclaim urgency directive that the
page scanning uses - the reclaim priority.

This *isn't a major change* in algorithm - it conveys essentially
exactly the same information, and does not change the overall page
cache vs shrinker reclaim balance. It just behaves more correctly in
corner cases where there is very little page cache/lots of slab and
vice versa.  It barely even changed the amount or balance of
reclaim being done under normal circumstances.

> The existing balance, even if it works perfectly for some cases, isn't something
> set in stone. We're really under-scanning small cgroups, and I strongly believe
> that what Rik is proposing is a right thing to do.

You keep saying "small cgroups". That is incorrect.

The shrinker knows nothing about the size of the cache. All it knows
is how many /freeable objects/ are in the cache. A large cache can
have zero freeable objects, and to the shrinker look just like a
small cache with just a few freeable objects. You're trying to
derive cache characterisations from accounting data that carries no
such information.

IOWs, adding yet another broken, racy, unpredictable "deferred scan
count" to triggers when there is few freeable objects in the cache
is not an improvement. The code is completely buggy, because
shrinkers can run in parallel and so shrinker->small_scan can be
modified by multiple shrinker instances running at the same time.
That's the problem the shrinker->nr_deferred[] array and the atomic
exchanges solve (which has other problems we really need to fix
before we even start thinking about changing shrinker balances at
all).

> If we don't scan objects
> in small cgroups unless we have really strong memory pressure, we're basically
> wasting memory.

Maybe for memcgs, but that's exactly the oppose of what we want to
do for global caches (e.g. filesystem metadata caches). We need to
make sure that a single, heavily pressured cache doesn't evict small
caches that lower pressure but are equally important for
performance.

e.g. I've noticed recently a significant increase in RMW cycles in
XFS inode cache writeback during various benchmarks. It hasn't
affected performance because the machine has IO and CPU to burn, but
on slower machines and storage, it will have a major impact.

The reason these RMW cycles occur is that the dirty inode in memory
needs to be flushed to the backing buffer before it can be written.
That backing buffer is in a metadata slab cache controlled by a
shrinker (not the superblock shrinker which reclaims inodes).  It's
*much smaller* than the inode cache, but it also needs to be turned
over much more slowly than the inode cache. If it gets turned over
too quickly because of inode cache pressure, then flushing dirty
inodes (in the superblock inode cache shrinker!) needs to read the
backing buffer back in to memory before it can write the inode and
reclaim it.

And when you change the shrinker infrastructure to reclaim small
caches more aggressively? It changes the balance of cache reclaim
within the filesystem, and so opens up these "should have been in
the cache but wasn't" performance problems. Where are all the
problems that users have reported? In the superblock shrinker,
reclaiming XFS inodes, waiting on IO to be done on really slow Io
subsystems. Basically, we're stuck because the shrinker changes have
screwed up the intra-cache reclaim balance.

IOWs, there's complex dependencies between shrinkers and memory
reclaim, and changing the balance of large vs small caches can have
very adverse affects when memory pressure occurs. For XFS,
increasing small cache pressure vs large cache pressure has the
effect of *increasing the memory demand* of inode reclaim because
writeback has to allocate buffers, and that's one of the reasons
it's the canary for naive memory reclaim algorithm changes.

Just because something "works for your memcg heavy workload" doesn't
mean it works for everyone, or is even a desirable algorithmic
change.

> And it really makes no sense to reclaim inodes with tons of attached pagecache
> as easy as "empty" inodes.

You didn't even know this happened until recently. Have you looked
up it's history?  e.g. from the initial git commit in 2.6.12, almost
15 years ago now, there's this comment on prune_icache():

 * Any inodes which are pinned purely because of attached pagecache have their
 * pagecache removed.  We expect the final iput() on that inode to add it to
 * the front of the inode_unused list.  So look for it there and if the
 * inode is still freeable, proceed.  The right inode is found 99.9% of the
 * time in testing on a 4-way.

Indeed, did you know this inode cache based page reclaim is what
PGINODESTEAL and KSWAPD_INODESTEAL /proc/vmstat record for us? e.g.
looking at my workstation:

$ grep inodesteal /proc/vmstat
pginodesteal 14617964
kswapd_inodesteal 6674859
$

Page cache reclaim from clean, unreferenced, aged-out inodes
actually happens an awful lot in real life.

> So, assuming all this, can we, please, first check if Rik's patch is addressing
> the regression?

Nope, it's broken and buggy, and reintroduces problems with racing
deferred counts that were fixed years ago when I originally
wrote the numa aware shrinker infrastructure.

Cheers,

Dave.
Dave Chinner Feb. 1, 2019, 11:39 p.m. UTC | #7
On Thu, Jan 31, 2019 at 03:48:11PM +0000, Chris Mason wrote:
> On 30 Jan 2019, at 20:34, Dave Chinner wrote:
> 
> > On Wed, Jan 30, 2019 at 12:21:07PM +0000, Chris Mason wrote:
> >>
> >>
> >> On 29 Jan 2019, at 23:17, Dave Chinner wrote:
> >>
> >>> From: Dave Chinner <dchinner@redhat.com>
> >>>
> >>> This reverts commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0.
> >>>
> >>> This change causes serious changes to page cache and inode cache
> >>> behaviour and balance, resulting in major performance regressions
> >>> when combining worklaods such as large file copies and kernel
> >>> compiles.
> >>>
> >>> https://bugzilla.kernel.org/show_bug.cgi?id=202441
> >>
> >> I'm a little confused by the latest comment in the bz:
> >>
> >> https://bugzilla.kernel.org/show_bug.cgi?id=202441#c24
> >
> > Which says the first patch that changed the shrinker behaviour is
> > the underlying cause of the regression.
> >
> >> Are these reverts sufficient?
> >
> > I think so.
> 
> Based on the latest comment:
> 
> "If I had been less strict in my testing I probably would have 
> discovered that the problem was present earlier than 4.19.3. Mr Gushins 
> commit made it more visible.
> I'm going back to work after two days off, so I might not be able to 
> respond inside your working hours, but I'll keep checking in on this as 
> I get a chance."
> 
> I don't think the reverts are sufficient.

Roger has tested the two reverts more heavily against 5.0.0-rc3.
Without the reverts, the machine locks up hard. With the two reverts
applied, it runs along smoothly under extremely heavy load.

https://bugzilla.kernel.org/show_bug.cgi?id=202441#c26

So, yes, these changes need to be reverted.

Cheers,

Dave.
Dave Chinner Feb. 4, 2019, 9:47 p.m. UTC | #8
On Fri, Feb 01, 2019 at 09:19:04AM +1100, Dave Chinner wrote:
> > So, assuming all this, can we, please, first check if Rik's patch is addressing
> > the regression?
> 
> Nope, it's broken and buggy, and reintroduces problems with racing
> deferred counts that were fixed years ago when I originally
> wrote the numa aware shrinker infrastructure.

So, the first thing to do here is fix the non-deterministic deferred
reclaim behaviour of the shrinker. Any deferred reclaim because of
things like GFP_NOFS contexts get landed on the the very next
shrinker instance that runs, be it kswapd, direct reclaim, or
whether it can even perform reclaim or not (e.g. it might be a
GFP_NOFS context itself).

As a result, there is no predicting when a shrinker instance might
get landed with a huge amount of work that isn't it's own. We can't
even guarantee that kswapd reclaim context sees this deferred work,
because if there is another reclaimer on that node running at the
same time, it will have scooped away the deferred work and kswapd
will only do a small amount of work.

How small? Well a node with 1.8m freeable items, reclaim priority
12 (i.e. lowest priority), and seeks = 2 will result in a scan count
of:

	delta = 1,800,000 >> 12 = 440
	delta *= 4 = 1600
	delta /= 2 = 800.

the shrinker will only scan 800 objects when there is light memory
pressure. That's only 0.04% of the cache, which is insignificant.
That's fine for direct reclaim, but for kswapd we need it to do more
work when there is deferred work.

So, say we have another 1m objects of deferred work (very common
on filesystem (and therefore GFP_NOFS) heavy workloads), we'll do:

	total_scan = deferred_objects;
	....
	total_scan += delta;

(ignoring clamping)

which means that we'll actually try to scan 1,000,800 objects from
the cache on the next pass. This may be clamped down to
(freeable_objects / 2), but that's still 900,000 objects in this
case.

IOWs, we just increased the reclaim work of this shrinker instance
by a factor of 1000x.  That's where all the long tail shrinker
reclaim latencies are coming from, and where a large amount of the
"XFS is doing inode IO from the shrinker" come from as they drive it
straight through all the clean inodes and into dirty reclaimable
inodes. With the additional "small cache" pressure being added and
then deferred since 4.18-rc5, this is much more likely to happen
with direct reclaim because the deferred count from GFP_NOFS
allocations are wound up faster.

So, if we want to prevent direct reclaim from this obvious long tail
latency problem, we have to stop direct reclaim from ever doing
deferred work. i.e. we need to move deferred work to kswapd and, for
XFS, we then have to ensure that kswapd will not block on dirty
inodes so it can do all this work as quickly as possible. And then
for kswapd, we need to limit the amount of deferred work so that it
doesn't spend all it's time emptying a single cache at low
priorities, but will attempt to perform all the deferred work if
reclaim priority winds up far enough.

This gives us a solid, predictable "deferred work" infrastructure
for the shrinkers. It gets rid of the nasty behaviour, and paves the
way for adding different sorts of deferred work (like Rik's "small
cache" pressure increase) to kswapd rather than in random reclaim
contexts. It also allows us to use a different "how much work should
we do" calculation for kswapd. i.e. one that is appropriate for
background, non-blocking scanning rather than being tailored to
limiting the work that any one direct reclaim context must do.

So, if I just set the XFS inode cache shrinker to skip inode
writeback (as everyone seems to want to do), fsmark is OOM-killed at
27M inodes, right when the log runs out of space, the tail-pushing
thread goes near to being CPU bound, and inode writeback from
reclaim is necessary to retire the dirty inode from the log before
it can be reclaimed. It's easily reproducable, and sometimes the
oom-killer chooses daemons rather than fsmark and it has killed the
system on occasion. Reverting the two patches in this thread makes
the OOM kill problem go away - it just turns it back into a
performance issue.

So, on top of the reverts, the patch below that reworks the deferred
shrinker work to kswapd is the first patch I've been able to get a
"XFs inode shrinker doesn't block kswapd" patch through my
benchmarks and memory stress workloads without triggering OOM-kills
randomly or seeing substantial performance regressions. Indeed, it
appears to behave better that the existing code (fsmark inode create
is marginally faster, simoops long tails have completely gone(*)).

This indicates to me that we really should be considering fixing the
deferred work problems before adding new types of deferred work into
the shrinker infrastructure (for whatever reason). Get the
infrastructure, reliable, predictable and somewhat deterministic,
then we can start trialling pressure/balance changes knowing exactly
where we are directing that extra work....

Cheers,

Dave.

(*) Chris, FYI, the last output before symoops died because "too
many open files" - p99 latency is nearly identical to p50 latency:

Run time: 10873 seconds
Read latency (p50: 3,084,288) (p95: 3,158,016) (p99: 3,256,320)
Write latency (p50: 7,479,296) (p95: 8,101,888) (p99: 8,437,760)
Allocation latency (p50: 479,744) (p95: 744,448) (p99: 1,016,832)
work rate = 8.63/sec (avg 9.05/sec) (p50: 9.19) (p95: 9.91) (p99: 11.42)
alloc stall rate = 0.02/sec (avg: 0.01) (p50: 0.01) (p95: 0.01) (p99: 0.01)

This is the same machine that I originally ran simoops on back in
~4.9 when you first proposed async kswapd behaviour for XFS. Typical
long tail latencies back then were:

[https://www.spinics.net/lists/linux-xfs/msg02235.html]

Run time: 1140 seconds
Read latency (p50: 3,035,136) (p95: 4,415,488) (p99: 6,119,424)
Write latency (p50: 27,557,888) (p95: 31,490,048) (p99: 45,285,376)
Allocation latency (p50: 247,552) (p95: 1,497,088) (p99: 19,496,960)
work rate = 3.68/sec (avg 3.71/sec) (p50: 3.71) (p95: 4.04) (p99: 4.04)
alloc stall rate = 1.65/sec (avg: 0.12) (p50: 0.00) (p95: 0.12) (p99: 0.12)
Jan Kara Feb. 7, 2019, 10:27 a.m. UTC | #9
On Fri 01-02-19 09:19:04, Dave Chinner wrote:
> Maybe for memcgs, but that's exactly the oppose of what we want to
> do for global caches (e.g. filesystem metadata caches). We need to
> make sure that a single, heavily pressured cache doesn't evict small
> caches that lower pressure but are equally important for
> performance.
> 
> e.g. I've noticed recently a significant increase in RMW cycles in
> XFS inode cache writeback during various benchmarks. It hasn't
> affected performance because the machine has IO and CPU to burn, but
> on slower machines and storage, it will have a major impact.

Just as a data point, our performance testing infrastructure has bisected
down to the commits discussed in this thread as the cause of about 40%
regression in XFS file delete performance in bonnie++ benchmark.

								Honza
Andrew Morton Feb. 8, 2019, 5:37 a.m. UTC | #10
On Thu, 7 Feb 2019 11:27:50 +0100 Jan Kara <jack@suse.cz> wrote:

> On Fri 01-02-19 09:19:04, Dave Chinner wrote:
> > Maybe for memcgs, but that's exactly the oppose of what we want to
> > do for global caches (e.g. filesystem metadata caches). We need to
> > make sure that a single, heavily pressured cache doesn't evict small
> > caches that lower pressure but are equally important for
> > performance.
> > 
> > e.g. I've noticed recently a significant increase in RMW cycles in
> > XFS inode cache writeback during various benchmarks. It hasn't
> > affected performance because the machine has IO and CPU to burn, but
> > on slower machines and storage, it will have a major impact.
> 
> Just as a data point, our performance testing infrastructure has bisected
> down to the commits discussed in this thread as the cause of about 40%
> regression in XFS file delete performance in bonnie++ benchmark.
> 

Has anyone done significant testing with Rik's maybe-fix?



From: Rik van Riel <riel@surriel.com>
Subject: mm, slab, vmscan: accumulate gradual pressure on small slabs

There are a few issues with the way the number of slab objects to scan is
calculated in do_shrink_slab.  First, for zero-seek slabs, we could leave
the last object around forever.  That could result in pinning a dying
cgroup into memory, instead of reclaiming it.  The fix for that is
trivial.

Secondly, small slabs receive much more pressure, relative to their size,
than larger slabs, due to "rounding up" the minimum number of scanned
objects to batch_size.

We can keep the pressure on all slabs equal relative to their size by
accumulating the scan pressure on small slabs over time, resulting in
sometimes scanning an object, instead of always scanning several.

This results in lower system CPU use, and a lower major fault rate, as
actively used entries from smaller caches get reclaimed less aggressively,
and need to be reloaded/recreated less often.

[akpm@linux-foundation.org: whitespace fixes, per Roman]
[riel@surriel.com: couple of fixes]
  Link: http://lkml.kernel.org/r/20190129142831.6a373403@imladris.surriel.com
Link: http://lkml.kernel.org/r/20190128143535.7767c397@imladris.surriel.com
Fixes: 4b85afbdacd2 ("mm: zero-seek shrinkers")
Fixes: 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects")
Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Chris Mason <clm@fb.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jonathan Lemon <bsd@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---


--- a/include/linux/shrinker.h~mmslabvmscan-accumulate-gradual-pressure-on-small-slabs
+++ a/include/linux/shrinker.h
@@ -65,6 +65,7 @@ struct shrinker {
 
 	long batch;	/* reclaim batch size, 0 = default */
 	int seeks;	/* seeks to recreate an obj */
+	int small_scan;	/* accumulate pressure on slabs with few objects */
 	unsigned flags;
 
 	/* These are for internal use */
--- a/mm/vmscan.c~mmslabvmscan-accumulate-gradual-pressure-on-small-slabs
+++ a/mm/vmscan.c
@@ -488,18 +488,30 @@ static unsigned long do_shrink_slab(stru
 		 * them aggressively under memory pressure to keep
 		 * them from causing refetches in the IO caches.
 		 */
-		delta = freeable / 2;
+		delta = (freeable + 1) / 2;
 	}
 
 	/*
 	 * Make sure we apply some minimal pressure on default priority
-	 * even on small cgroups. Stale objects are not only consuming memory
+	 * even on small cgroups, by accumulating pressure across multiple
+	 * slab shrinker runs. Stale objects are not only consuming memory
 	 * by themselves, but can also hold a reference to a dying cgroup,
 	 * preventing it from being reclaimed. A dying cgroup with all
 	 * corresponding structures like per-cpu stats and kmem caches
 	 * can be really big, so it may lead to a significant waste of memory.
 	 */
-	delta = max_t(unsigned long long, delta, min(freeable, batch_size));
+	if (!delta && shrinker->seeks) {
+		unsigned long nr_considered;
+
+		shrinker->small_scan += freeable;
+		nr_considered = shrinker->small_scan >> priority;
+
+		delta = 4 * nr_considered;
+		do_div(delta, shrinker->seeks);
+
+		if (delta)
+			shrinker->small_scan -= nr_considered << priority;
+	}
 
 	total_scan += delta;
 	if (total_scan < 0) {
Jan Kara Feb. 8, 2019, 9:55 a.m. UTC | #11
On Thu 07-02-19 21:37:27, Andrew Morton wrote:
> On Thu, 7 Feb 2019 11:27:50 +0100 Jan Kara <jack@suse.cz> wrote:
> 
> > On Fri 01-02-19 09:19:04, Dave Chinner wrote:
> > > Maybe for memcgs, but that's exactly the oppose of what we want to
> > > do for global caches (e.g. filesystem metadata caches). We need to
> > > make sure that a single, heavily pressured cache doesn't evict small
> > > caches that lower pressure but are equally important for
> > > performance.
> > > 
> > > e.g. I've noticed recently a significant increase in RMW cycles in
> > > XFS inode cache writeback during various benchmarks. It hasn't
> > > affected performance because the machine has IO and CPU to burn, but
> > > on slower machines and storage, it will have a major impact.
> > 
> > Just as a data point, our performance testing infrastructure has bisected
> > down to the commits discussed in this thread as the cause of about 40%
> > regression in XFS file delete performance in bonnie++ benchmark.
> > 
> 
> Has anyone done significant testing with Rik's maybe-fix?

I will give it a spin with bonnie++ today. We'll see what comes out.

								Honza

> 
> 
> 
> From: Rik van Riel <riel@surriel.com>
> Subject: mm, slab, vmscan: accumulate gradual pressure on small slabs
> 
> There are a few issues with the way the number of slab objects to scan is
> calculated in do_shrink_slab.  First, for zero-seek slabs, we could leave
> the last object around forever.  That could result in pinning a dying
> cgroup into memory, instead of reclaiming it.  The fix for that is
> trivial.
> 
> Secondly, small slabs receive much more pressure, relative to their size,
> than larger slabs, due to "rounding up" the minimum number of scanned
> objects to batch_size.
> 
> We can keep the pressure on all slabs equal relative to their size by
> accumulating the scan pressure on small slabs over time, resulting in
> sometimes scanning an object, instead of always scanning several.
> 
> This results in lower system CPU use, and a lower major fault rate, as
> actively used entries from smaller caches get reclaimed less aggressively,
> and need to be reloaded/recreated less often.
> 
> [akpm@linux-foundation.org: whitespace fixes, per Roman]
> [riel@surriel.com: couple of fixes]
>   Link: http://lkml.kernel.org/r/20190129142831.6a373403@imladris.surriel.com
> Link: http://lkml.kernel.org/r/20190128143535.7767c397@imladris.surriel.com
> Fixes: 4b85afbdacd2 ("mm: zero-seek shrinkers")
> Fixes: 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects")
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Tested-by: Chris Mason <clm@fb.com>
> Acked-by: Roman Gushchin <guro@fb.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Jonathan Lemon <bsd@fb.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: <stable@vger.kernel.org>
> 
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
> 
> --- a/include/linux/shrinker.h~mmslabvmscan-accumulate-gradual-pressure-on-small-slabs
> +++ a/include/linux/shrinker.h
> @@ -65,6 +65,7 @@ struct shrinker {
>  
>  	long batch;	/* reclaim batch size, 0 = default */
>  	int seeks;	/* seeks to recreate an obj */
> +	int small_scan;	/* accumulate pressure on slabs with few objects */
>  	unsigned flags;
>  
>  	/* These are for internal use */
> --- a/mm/vmscan.c~mmslabvmscan-accumulate-gradual-pressure-on-small-slabs
> +++ a/mm/vmscan.c
> @@ -488,18 +488,30 @@ static unsigned long do_shrink_slab(stru
>  		 * them aggressively under memory pressure to keep
>  		 * them from causing refetches in the IO caches.
>  		 */
> -		delta = freeable / 2;
> +		delta = (freeable + 1) / 2;
>  	}
>  
>  	/*
>  	 * Make sure we apply some minimal pressure on default priority
> -	 * even on small cgroups. Stale objects are not only consuming memory
> +	 * even on small cgroups, by accumulating pressure across multiple
> +	 * slab shrinker runs. Stale objects are not only consuming memory
>  	 * by themselves, but can also hold a reference to a dying cgroup,
>  	 * preventing it from being reclaimed. A dying cgroup with all
>  	 * corresponding structures like per-cpu stats and kmem caches
>  	 * can be really big, so it may lead to a significant waste of memory.
>  	 */
> -	delta = max_t(unsigned long long, delta, min(freeable, batch_size));
> +	if (!delta && shrinker->seeks) {
> +		unsigned long nr_considered;
> +
> +		shrinker->small_scan += freeable;
> +		nr_considered = shrinker->small_scan >> priority;
> +
> +		delta = 4 * nr_considered;
> +		do_div(delta, shrinker->seeks);
> +
> +		if (delta)
> +			shrinker->small_scan -= nr_considered << priority;
> +	}
>  
>  	total_scan += delta;
>  	if (total_scan < 0) {
> _
>
Jan Kara Feb. 8, 2019, 12:50 p.m. UTC | #12
On Fri 08-02-19 10:55:07, Jan Kara wrote:
> On Thu 07-02-19 21:37:27, Andrew Morton wrote:
> > On Thu, 7 Feb 2019 11:27:50 +0100 Jan Kara <jack@suse.cz> wrote:
> > 
> > > On Fri 01-02-19 09:19:04, Dave Chinner wrote:
> > > > Maybe for memcgs, but that's exactly the oppose of what we want to
> > > > do for global caches (e.g. filesystem metadata caches). We need to
> > > > make sure that a single, heavily pressured cache doesn't evict small
> > > > caches that lower pressure but are equally important for
> > > > performance.
> > > > 
> > > > e.g. I've noticed recently a significant increase in RMW cycles in
> > > > XFS inode cache writeback during various benchmarks. It hasn't
> > > > affected performance because the machine has IO and CPU to burn, but
> > > > on slower machines and storage, it will have a major impact.
> > > 
> > > Just as a data point, our performance testing infrastructure has bisected
> > > down to the commits discussed in this thread as the cause of about 40%
> > > regression in XFS file delete performance in bonnie++ benchmark.
> > > 
> > 
> > Has anyone done significant testing with Rik's maybe-fix?
> 
> I will give it a spin with bonnie++ today. We'll see what comes out.

OK, I did a bonnie++ run with Rik's patch (on top of 4.20 to rule out other
differences). This machine does not show so big differences in bonnie++
numbers but the difference is still clearly visible. The results are
(averages of 5 runs):

		 Revert			Base			Rik
SeqCreate del    78.04 (   0.00%)	98.18 ( -25.81%)	90.90 ( -16.48%)
RandCreate del   87.68 (   0.00%)	95.01 (  -8.36%)	87.66 (   0.03%)

'Revert' is 4.20 with "mm: don't reclaim inodes with many attached pages"
and "mm: slowly shrink slabs with a relatively small number of objects"
reverted. 'Base' is the kernel without any reverts. 'Rik' is a 4.20 with
Rik's patch applied.

The numbers are time to do a batch of deletes so lower is better. You can see
that the patch did help somewhat but it was not enough to close the gap
when files are deleted in 'readdir' order.

								Honza

> > From: Rik van Riel <riel@surriel.com>
> > Subject: mm, slab, vmscan: accumulate gradual pressure on small slabs
> > 
> > There are a few issues with the way the number of slab objects to scan is
> > calculated in do_shrink_slab.  First, for zero-seek slabs, we could leave
> > the last object around forever.  That could result in pinning a dying
> > cgroup into memory, instead of reclaiming it.  The fix for that is
> > trivial.
> > 
> > Secondly, small slabs receive much more pressure, relative to their size,
> > than larger slabs, due to "rounding up" the minimum number of scanned
> > objects to batch_size.
> > 
> > We can keep the pressure on all slabs equal relative to their size by
> > accumulating the scan pressure on small slabs over time, resulting in
> > sometimes scanning an object, instead of always scanning several.
> > 
> > This results in lower system CPU use, and a lower major fault rate, as
> > actively used entries from smaller caches get reclaimed less aggressively,
> > and need to be reloaded/recreated less often.
> > 
> > [akpm@linux-foundation.org: whitespace fixes, per Roman]
> > [riel@surriel.com: couple of fixes]
> >   Link: http://lkml.kernel.org/r/20190129142831.6a373403@imladris.surriel.com
> > Link: http://lkml.kernel.org/r/20190128143535.7767c397@imladris.surriel.com
> > Fixes: 4b85afbdacd2 ("mm: zero-seek shrinkers")
> > Fixes: 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects")
> > Signed-off-by: Rik van Riel <riel@surriel.com>
> > Tested-by: Chris Mason <clm@fb.com>
> > Acked-by: Roman Gushchin <guro@fb.com>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Dave Chinner <dchinner@redhat.com>
> > Cc: Jonathan Lemon <bsd@fb.com>
> > Cc: Jan Kara <jack@suse.cz>
> > Cc: <stable@vger.kernel.org>
> > 
> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > ---
> > 
> > 
> > --- a/include/linux/shrinker.h~mmslabvmscan-accumulate-gradual-pressure-on-small-slabs
> > +++ a/include/linux/shrinker.h
> > @@ -65,6 +65,7 @@ struct shrinker {
> >  
> >  	long batch;	/* reclaim batch size, 0 = default */
> >  	int seeks;	/* seeks to recreate an obj */
> > +	int small_scan;	/* accumulate pressure on slabs with few objects */
> >  	unsigned flags;
> >  
> >  	/* These are for internal use */
> > --- a/mm/vmscan.c~mmslabvmscan-accumulate-gradual-pressure-on-small-slabs
> > +++ a/mm/vmscan.c
> > @@ -488,18 +488,30 @@ static unsigned long do_shrink_slab(stru
> >  		 * them aggressively under memory pressure to keep
> >  		 * them from causing refetches in the IO caches.
> >  		 */
> > -		delta = freeable / 2;
> > +		delta = (freeable + 1) / 2;
> >  	}
> >  
> >  	/*
> >  	 * Make sure we apply some minimal pressure on default priority
> > -	 * even on small cgroups. Stale objects are not only consuming memory
> > +	 * even on small cgroups, by accumulating pressure across multiple
> > +	 * slab shrinker runs. Stale objects are not only consuming memory
> >  	 * by themselves, but can also hold a reference to a dying cgroup,
> >  	 * preventing it from being reclaimed. A dying cgroup with all
> >  	 * corresponding structures like per-cpu stats and kmem caches
> >  	 * can be really big, so it may lead to a significant waste of memory.
> >  	 */
> > -	delta = max_t(unsigned long long, delta, min(freeable, batch_size));
> > +	if (!delta && shrinker->seeks) {
> > +		unsigned long nr_considered;
> > +
> > +		shrinker->small_scan += freeable;
> > +		nr_considered = shrinker->small_scan >> priority;
> > +
> > +		delta = 4 * nr_considered;
> > +		do_div(delta, shrinker->seeks);
> > +
> > +		if (delta)
> > +			shrinker->small_scan -= nr_considered << priority;
> > +	}
> >  
> >  	total_scan += delta;
> >  	if (total_scan < 0) {
> > _
> > 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
Dave Chinner Feb. 8, 2019, 9:25 p.m. UTC | #13
On Thu, Feb 07, 2019 at 09:37:27PM -0800, Andrew Morton wrote:
> On Thu, 7 Feb 2019 11:27:50 +0100 Jan Kara <jack@suse.cz> wrote:
> 
> > On Fri 01-02-19 09:19:04, Dave Chinner wrote:
> > > Maybe for memcgs, but that's exactly the oppose of what we want to
> > > do for global caches (e.g. filesystem metadata caches). We need to
> > > make sure that a single, heavily pressured cache doesn't evict small
> > > caches that lower pressure but are equally important for
> > > performance.
> > > 
> > > e.g. I've noticed recently a significant increase in RMW cycles in
> > > XFS inode cache writeback during various benchmarks. It hasn't
> > > affected performance because the machine has IO and CPU to burn, but
> > > on slower machines and storage, it will have a major impact.
> > 
> > Just as a data point, our performance testing infrastructure has bisected
> > down to the commits discussed in this thread as the cause of about 40%
> > regression in XFS file delete performance in bonnie++ benchmark.
> > 
> 
> Has anyone done significant testing with Rik's maybe-fix?

Apart from pointing out all the bugs and incorrect algorithmic
assumptions it makes, no.

Cheers,

Dave.
Andrew Morton Feb. 8, 2019, 10:49 p.m. UTC | #14
On Fri, 8 Feb 2019 13:50:49 +0100 Jan Kara <jack@suse.cz> wrote:

> > > Has anyone done significant testing with Rik's maybe-fix?
> > 
> > I will give it a spin with bonnie++ today. We'll see what comes out.
> 
> OK, I did a bonnie++ run with Rik's patch (on top of 4.20 to rule out other
> differences). This machine does not show so big differences in bonnie++
> numbers but the difference is still clearly visible. The results are
> (averages of 5 runs):
> 
> 		 Revert			Base			Rik
> SeqCreate del    78.04 (   0.00%)	98.18 ( -25.81%)	90.90 ( -16.48%)
> RandCreate del   87.68 (   0.00%)	95.01 (  -8.36%)	87.66 (   0.03%)
> 
> 'Revert' is 4.20 with "mm: don't reclaim inodes with many attached pages"
> and "mm: slowly shrink slabs with a relatively small number of objects"
> reverted. 'Base' is the kernel without any reverts. 'Rik' is a 4.20 with
> Rik's patch applied.
> 
> The numbers are time to do a batch of deletes so lower is better. You can see
> that the patch did help somewhat but it was not enough to close the gap
> when files are deleted in 'readdir' order.

OK, thanks.

I guess we need a rethink on Roman's fixes.   I'll queued the reverts.


BTW, one thing I don't think has been discussed (or noticed) is the
effect of "mm: don't reclaim inodes with many attached pages" on 32-bit
highmem machines.  Look why someone added that code in the first place:

: commit f9a316fa9099053a299851762aedbf12881cff42
: Author: Andrew Morton <akpm@digeo.com>
: Date:   Thu Oct 31 04:09:37 2002 -0800
: 
:     [PATCH] strip pagecache from to-be-reaped inodes
:     
:     With large highmem machines and many small cached files it is possible
:     to encounter ZONE_NORMAL allocation failures.  This can be demonstrated
:     with a large number of one-byte files on a 7G machine.
:     
:     All lowmem is filled with icache and all those inodes have a small
:     amount of highmem pagecache which makes them unfreeable.
:     
:     The patch strips the pagecache from inodes as they come off the tail of
:     the inode_unused list.
:     
:     I play tricks in there peeking at the head of the inode_unused list to
:     pick up the inode again after running iput().  The alternatives seemed
:     to involve more widespread changes.
:     
:     Or running invalidate_inode_pages() under inode_lock which would be a
:     bad thing from a scheduling latency and lock contention point of view.

I guess I shold have added a comment.  Doh.
Roman Gushchin Feb. 9, 2019, 3:42 a.m. UTC | #15
On Fri, Feb 08, 2019 at 02:49:44PM -0800, Andrew Morton wrote:
> On Fri, 8 Feb 2019 13:50:49 +0100 Jan Kara <jack@suse.cz> wrote:
> 
> > > > Has anyone done significant testing with Rik's maybe-fix?
> > > 
> > > I will give it a spin with bonnie++ today. We'll see what comes out.
> > 
> > OK, I did a bonnie++ run with Rik's patch (on top of 4.20 to rule out other
> > differences). This machine does not show so big differences in bonnie++
> > numbers but the difference is still clearly visible. The results are
> > (averages of 5 runs):
> > 
> > 		 Revert			Base			Rik
> > SeqCreate del    78.04 (   0.00%)	98.18 ( -25.81%)	90.90 ( -16.48%)
> > RandCreate del   87.68 (   0.00%)	95.01 (  -8.36%)	87.66 (   0.03%)
> > 
> > 'Revert' is 4.20 with "mm: don't reclaim inodes with many attached pages"
> > and "mm: slowly shrink slabs with a relatively small number of objects"
> > reverted. 'Base' is the kernel without any reverts. 'Rik' is a 4.20 with
> > Rik's patch applied.
> > 
> > The numbers are time to do a batch of deletes so lower is better. You can see
> > that the patch did help somewhat but it was not enough to close the gap
> > when files are deleted in 'readdir' order.
> 
> OK, thanks.
> 
> I guess we need a rethink on Roman's fixes.   I'll queued the reverts.

Agree.

I still believe that we should cause the machine-wide memory pressure
to clean up any remains of dead cgroups, and Rik's patch is a step into
the right direction. But we need to make some experiments and probably
some code changes here to guarantee that we don't regress on performance.

> 
> 
> BTW, one thing I don't think has been discussed (or noticed) is the
> effect of "mm: don't reclaim inodes with many attached pages" on 32-bit
> highmem machines.  Look why someone added that code in the first place:
> 
> : commit f9a316fa9099053a299851762aedbf12881cff42
> : Author: Andrew Morton <akpm@digeo.com>
> : Date:   Thu Oct 31 04:09:37 2002 -0800
> : 
> :     [PATCH] strip pagecache from to-be-reaped inodes
> :     
> :     With large highmem machines and many small cached files it is possible
> :     to encounter ZONE_NORMAL allocation failures.  This can be demonstrated
> :     with a large number of one-byte files on a 7G machine.
> :     
> :     All lowmem is filled with icache and all those inodes have a small
> :     amount of highmem pagecache which makes them unfreeable.
> :     
> :     The patch strips the pagecache from inodes as they come off the tail of
> :     the inode_unused list.
> :     
> :     I play tricks in there peeking at the head of the inode_unused list to
> :     pick up the inode again after running iput().  The alternatives seemed
> :     to involve more widespread changes.
> :     
> :     Or running invalidate_inode_pages() under inode_lock which would be a
> :     bad thing from a scheduling latency and lock contention point of view.
> 
> I guess I shold have added a comment.  Doh.
> 

It's a very useful link.

Thanks!
Wolfgang Walter Feb. 11, 2019, 3:34 p.m. UTC | #16
Am Donnerstag, 7. Februar 2019, 11:27:50 schrieb Jan Kara:
> On Fri 01-02-19 09:19:04, Dave Chinner wrote:
> > Maybe for memcgs, but that's exactly the oppose of what we want to
> > do for global caches (e.g. filesystem metadata caches). We need to
> > make sure that a single, heavily pressured cache doesn't evict small
> > caches that lower pressure but are equally important for
> > performance.
> > 
> > e.g. I've noticed recently a significant increase in RMW cycles in
> > XFS inode cache writeback during various benchmarks. It hasn't
> > affected performance because the machine has IO and CPU to burn, but
> > on slower machines and storage, it will have a major impact.
> 
> Just as a data point, our performance testing infrastructure has bisected
> down to the commits discussed in this thread as the cause of about 40%
> regression in XFS file delete performance in bonnie++ benchmark.

We also bisected our big IO-performance problem of an imap-server (starting 
with 4.19.3) down to

	mm: don't reclaim inodes with many attached pages
	commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0 upstream.

On other servers the filesystems sometimes seems to hang for 10 seconds and 
more.

We also see a performance regression compared to 4.14 even with this patch 
reverted, but much less dramatic.

Now I saw this thread and I'll try to revert

172b06c32b949759fe6313abec514bc4f15014f4

and see if this helps.

Regards,
diff mbox series

Patch

diff --git a/fs/inode.c b/fs/inode.c
index 0cd47fe0dbe5..73432e64f874 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -730,11 +730,8 @@  static enum lru_status inode_lru_isolate(struct list_head *item,
 		return LRU_REMOVED;
 	}
 
-	/*
-	 * Recently referenced inodes and inodes with many attached pages
-	 * get one more pass.
-	 */
-	if (inode->i_state & I_REFERENCED || inode->i_data.nrpages > 1) {
+	/* recently referenced inodes get one more pass */
+	if (inode->i_state & I_REFERENCED) {
 		inode->i_state &= ~I_REFERENCED;
 		spin_unlock(&inode->i_lock);
 		return LRU_ROTATE;