Message ID | 20190130041707.27750-2-david@fromorbit.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: shrinkers are now way too aggressive | expand |
On 29 Jan 2019, at 23:17, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > This reverts commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0. > > This change causes serious changes to page cache and inode cache > behaviour and balance, resulting in major performance regressions > when combining worklaods such as large file copies and kernel > compiles. > > https://bugzilla.kernel.org/show_bug.cgi?id=202441 I'm a little confused by the latest comment in the bz: https://bugzilla.kernel.org/show_bug.cgi?id=202441#c24 Are these reverts sufficient? Roman beat me to suggesting Rik's followup. We hit a different problem in prod with small slabs, and have a lot of instrumentation on Rik's code helping. -chris
On Wed, Jan 30, 2019 at 12:21:07PM +0000, Chris Mason wrote: > > > On 29 Jan 2019, at 23:17, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@redhat.com> > > > > This reverts commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0. > > > > This change causes serious changes to page cache and inode cache > > behaviour and balance, resulting in major performance regressions > > when combining worklaods such as large file copies and kernel > > compiles. > > > > https://bugzilla.kernel.org/show_bug.cgi?id=202441 > > I'm a little confused by the latest comment in the bz: > > https://bugzilla.kernel.org/show_bug.cgi?id=202441#c24 Which says the first patch that changed the shrinker behaviour is the underlying cause of the regression. > Are these reverts sufficient? I think so. > Roman beat me to suggesting Rik's followup. We hit a different problem > in prod with small slabs, and have a lot of instrumentation on Rik's > code helping. I think that's just another nasty, expedient hack that doesn't solve the underlying problem. Solving the underlying problem does not require changing core reclaim algorithms and upsetting a page reclaim/shrinker balance that has been stable and worked well for just about everyone for years. Cheers, Dave.
On Thu 31-01-19 12:34:03, Dave Chinner wrote: > On Wed, Jan 30, 2019 at 12:21:07PM +0000, Chris Mason wrote: > > > > > > On 29 Jan 2019, at 23:17, Dave Chinner wrote: > > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > This reverts commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0. > > > > > > This change causes serious changes to page cache and inode cache > > > behaviour and balance, resulting in major performance regressions > > > when combining worklaods such as large file copies and kernel > > > compiles. > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=202441 > > > > I'm a little confused by the latest comment in the bz: > > > > https://bugzilla.kernel.org/show_bug.cgi?id=202441#c24 > > Which says the first patch that changed the shrinker behaviour is > the underlying cause of the regression. > > > Are these reverts sufficient? > > I think so. > > > Roman beat me to suggesting Rik's followup. We hit a different problem > > in prod with small slabs, and have a lot of instrumentation on Rik's > > code helping. > > I think that's just another nasty, expedient hack that doesn't solve > the underlying problem. Solving the underlying problem does not > require changing core reclaim algorithms and upsetting a page > reclaim/shrinker balance that has been stable and worked well for > just about everyone for years. I tend to agree with Dave here. Slab pressure balancing is quite subtle and easy to get wrong. If we want to plug the problem with offline memcgs then the fix should be targeted at that problem. So maybe we want to emulate high pressure on offline memcgs only. There might be other issues to resolve for small caches but let's start with something more targeted first please.
On 30 Jan 2019, at 20:34, Dave Chinner wrote: > On Wed, Jan 30, 2019 at 12:21:07PM +0000, Chris Mason wrote: >> >> >> On 29 Jan 2019, at 23:17, Dave Chinner wrote: >> >>> From: Dave Chinner <dchinner@redhat.com> >>> >>> This reverts commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0. >>> >>> This change causes serious changes to page cache and inode cache >>> behaviour and balance, resulting in major performance regressions >>> when combining worklaods such as large file copies and kernel >>> compiles. >>> >>> https://bugzilla.kernel.org/show_bug.cgi?id=202441 >> >> I'm a little confused by the latest comment in the bz: >> >> https://bugzilla.kernel.org/show_bug.cgi?id=202441#c24 > > Which says the first patch that changed the shrinker behaviour is > the underlying cause of the regression. > >> Are these reverts sufficient? > > I think so. Based on the latest comment: "If I had been less strict in my testing I probably would have discovered that the problem was present earlier than 4.19.3. Mr Gushins commit made it more visible. I'm going back to work after two days off, so I might not be able to respond inside your working hours, but I'll keep checking in on this as I get a chance." I don't think the reverts are sufficient. > >> Roman beat me to suggesting Rik's followup. We hit a different >> problem >> in prod with small slabs, and have a lot of instrumentation on Rik's >> code helping. > > I think that's just another nasty, expedient hack that doesn't solve > the underlying problem. Solving the underlying problem does not > require changing core reclaim algorithms and upsetting a page > reclaim/shrinker balance that has been stable and worked well for > just about everyone for years. > Things are definitely breaking down in non-specialized workloads, and have been for a long time. -chris
On Thu, Jan 31, 2019 at 10:10:11AM +0100, Michal Hocko wrote: > On Thu 31-01-19 12:34:03, Dave Chinner wrote: > > On Wed, Jan 30, 2019 at 12:21:07PM +0000, Chris Mason wrote: > > > > > > > > > On 29 Jan 2019, at 23:17, Dave Chinner wrote: > > > > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > > > This reverts commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0. > > > > > > > > This change causes serious changes to page cache and inode cache > > > > behaviour and balance, resulting in major performance regressions > > > > when combining worklaods such as large file copies and kernel > > > > compiles. > > > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=202441 > > > > > > I'm a little confused by the latest comment in the bz: > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=202441#c24 > > > > Which says the first patch that changed the shrinker behaviour is > > the underlying cause of the regression. > > > > > Are these reverts sufficient? > > > > I think so. > > > > > Roman beat me to suggesting Rik's followup. We hit a different problem > > > in prod with small slabs, and have a lot of instrumentation on Rik's > > > code helping. > > > > I think that's just another nasty, expedient hack that doesn't solve > > the underlying problem. Solving the underlying problem does not > > require changing core reclaim algorithms and upsetting a page > > reclaim/shrinker balance that has been stable and worked well for > > just about everyone for years. > > I tend to agree with Dave here. Slab pressure balancing is quite subtle > and easy to get wrong. If we want to plug the problem with offline > memcgs then the fix should be targeted at that problem. So maybe we want > to emulate high pressure on offline memcgs only. There might be other > issues to resolve for small caches but let's start with something more > targeted first please. First, the path proposed by Dave is not regression-safe too. A slab object can be used by other cgroups as well, so creating an artificial pressure on the dying cgroup might perfectly affect the rest of the system. We do reparent slab lists on offlining, so there is even no easy way to iterate over them. Also, creating an artifical pressure will create unnecessary CPU load. So I'd really prefer to make the "natural" memory pressure to be applied in a way, that doesn't leave any stalled objects behind. Second, the code around slab pressure is not "worked well for years": as I can see the latest major change was made about a year ago by Josef Bacik (9092c71bb724 "mm: use sc->priority for slab shrink targets"). The existing balance, even if it works perfectly for some cases, isn't something set in stone. We're really under-scanning small cgroups, and I strongly believe that what Rik is proposing is a right thing to do. If we don't scan objects in small cgroups unless we have really strong memory pressure, we're basically wasting memory. And it really makes no sense to reclaim inodes with tons of attached pagecache as easy as "empty" inodes. At the end, all we need is to free some memory, and treating a sub-page object equal to many thousands page object is just strange. If it's simple "wrong" and I do miss something, please, explain. Maybe we need something more complicated than in my patch, but saying that existing code is just perfect and can't be touched at all makes no sense to me. So, assuming all this, can we, please, first check if Rik's patch is addressing the regression? Thanks!
On Thu, Jan 31, 2019 at 06:57:10PM +0000, Roman Gushchin wrote: > On Thu, Jan 31, 2019 at 10:10:11AM +0100, Michal Hocko wrote: > > On Thu 31-01-19 12:34:03, Dave Chinner wrote: > > > On Wed, Jan 30, 2019 at 12:21:07PM +0000, Chris Mason wrote: > > > > > > > > > > > > On 29 Jan 2019, at 23:17, Dave Chinner wrote: > > > > > > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > > > > > This reverts commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0. > > > > > > > > > > This change causes serious changes to page cache and inode cache > > > > > behaviour and balance, resulting in major performance regressions > > > > > when combining worklaods such as large file copies and kernel > > > > > compiles. > > > > > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=202441 > > > > > > > > I'm a little confused by the latest comment in the bz: > > > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=202441#c24 > > > > > > Which says the first patch that changed the shrinker behaviour is > > > the underlying cause of the regression. > > > > > > > Are these reverts sufficient? > > > > > > I think so. > > > > > > > Roman beat me to suggesting Rik's followup. We hit a different problem > > > > in prod with small slabs, and have a lot of instrumentation on Rik's > > > > code helping. > > > > > > I think that's just another nasty, expedient hack that doesn't solve > > > the underlying problem. Solving the underlying problem does not > > > require changing core reclaim algorithms and upsetting a page > > > reclaim/shrinker balance that has been stable and worked well for > > > just about everyone for years. > > > > I tend to agree with Dave here. Slab pressure balancing is quite subtle > > and easy to get wrong. If we want to plug the problem with offline > > memcgs then the fix should be targeted at that problem. So maybe we want > > to emulate high pressure on offline memcgs only. There might be other > > issues to resolve for small caches but let's start with something more > > targeted first please. > > First, the path proposed by Dave is not regression-safe too. A slab object > can be used by other cgroups as well, so creating an artificial pressure on > the dying cgroup might perfectly affect the rest of the system. Isolating regressions to the memcg dying path is far better than the current changes you've made, which have already been *proven* to affect the rest of the system. > We do reparent > slab lists on offlining, so there is even no easy way to iterate over them. > Also, creating an artifical pressure will create unnecessary CPU load. Yes, so do your changes - they increase small slab scanning by 100x which means reclaim CPU has most definitely increased. However, the memcg reaper *doesn't need to be perfect* to solve the "takes too long to clean up dying memcgs". Even if it leaves shared objects behind (which we want to do!), it still trims those memcgs down to /just the shared objects still in use/. And given that objects shared by memcgs are in the minority (according to past discussions about the difficulies of accounting them correctly) I think this is just fine. Besides, those reamining shared objects are the ones we want to naturally age out under memory pressure, but otherwise the memcgs will have been shaken clean of all other objects accounted to them. i.e. the "dying memcg" memory footprint goes down massively and the "long term buildup" of dying memcgs basically goes away. > So I'd really prefer to make the "natural" memory pressure to be applied > in a way, that doesn't leave any stalled objects behind. Except that "natural" memory pressure isn't enough to do this, so you've artificially inflated that "natural" pressure and screwed up the reclaim for everyone else. > Second, the code around slab pressure is not "worked well for years": as I can > see the latest major change was made about a year ago by Josef Bacik > (9092c71bb724 "mm: use sc->priority for slab shrink targets"). Strawman. False equivalence. The commit you mention was effectively a minor tweak - instead of using the ratio of pages scanned to pages reclaimed to define the amount of slab work, (a second order measure of reclaim urgency), it was changed to use the primary reclaim urgency directive that the page scanning uses - the reclaim priority. This *isn't a major change* in algorithm - it conveys essentially exactly the same information, and does not change the overall page cache vs shrinker reclaim balance. It just behaves more correctly in corner cases where there is very little page cache/lots of slab and vice versa. It barely even changed the amount or balance of reclaim being done under normal circumstances. > The existing balance, even if it works perfectly for some cases, isn't something > set in stone. We're really under-scanning small cgroups, and I strongly believe > that what Rik is proposing is a right thing to do. You keep saying "small cgroups". That is incorrect. The shrinker knows nothing about the size of the cache. All it knows is how many /freeable objects/ are in the cache. A large cache can have zero freeable objects, and to the shrinker look just like a small cache with just a few freeable objects. You're trying to derive cache characterisations from accounting data that carries no such information. IOWs, adding yet another broken, racy, unpredictable "deferred scan count" to triggers when there is few freeable objects in the cache is not an improvement. The code is completely buggy, because shrinkers can run in parallel and so shrinker->small_scan can be modified by multiple shrinker instances running at the same time. That's the problem the shrinker->nr_deferred[] array and the atomic exchanges solve (which has other problems we really need to fix before we even start thinking about changing shrinker balances at all). > If we don't scan objects > in small cgroups unless we have really strong memory pressure, we're basically > wasting memory. Maybe for memcgs, but that's exactly the oppose of what we want to do for global caches (e.g. filesystem metadata caches). We need to make sure that a single, heavily pressured cache doesn't evict small caches that lower pressure but are equally important for performance. e.g. I've noticed recently a significant increase in RMW cycles in XFS inode cache writeback during various benchmarks. It hasn't affected performance because the machine has IO and CPU to burn, but on slower machines and storage, it will have a major impact. The reason these RMW cycles occur is that the dirty inode in memory needs to be flushed to the backing buffer before it can be written. That backing buffer is in a metadata slab cache controlled by a shrinker (not the superblock shrinker which reclaims inodes). It's *much smaller* than the inode cache, but it also needs to be turned over much more slowly than the inode cache. If it gets turned over too quickly because of inode cache pressure, then flushing dirty inodes (in the superblock inode cache shrinker!) needs to read the backing buffer back in to memory before it can write the inode and reclaim it. And when you change the shrinker infrastructure to reclaim small caches more aggressively? It changes the balance of cache reclaim within the filesystem, and so opens up these "should have been in the cache but wasn't" performance problems. Where are all the problems that users have reported? In the superblock shrinker, reclaiming XFS inodes, waiting on IO to be done on really slow Io subsystems. Basically, we're stuck because the shrinker changes have screwed up the intra-cache reclaim balance. IOWs, there's complex dependencies between shrinkers and memory reclaim, and changing the balance of large vs small caches can have very adverse affects when memory pressure occurs. For XFS, increasing small cache pressure vs large cache pressure has the effect of *increasing the memory demand* of inode reclaim because writeback has to allocate buffers, and that's one of the reasons it's the canary for naive memory reclaim algorithm changes. Just because something "works for your memcg heavy workload" doesn't mean it works for everyone, or is even a desirable algorithmic change. > And it really makes no sense to reclaim inodes with tons of attached pagecache > as easy as "empty" inodes. You didn't even know this happened until recently. Have you looked up it's history? e.g. from the initial git commit in 2.6.12, almost 15 years ago now, there's this comment on prune_icache(): * Any inodes which are pinned purely because of attached pagecache have their * pagecache removed. We expect the final iput() on that inode to add it to * the front of the inode_unused list. So look for it there and if the * inode is still freeable, proceed. The right inode is found 99.9% of the * time in testing on a 4-way. Indeed, did you know this inode cache based page reclaim is what PGINODESTEAL and KSWAPD_INODESTEAL /proc/vmstat record for us? e.g. looking at my workstation: $ grep inodesteal /proc/vmstat pginodesteal 14617964 kswapd_inodesteal 6674859 $ Page cache reclaim from clean, unreferenced, aged-out inodes actually happens an awful lot in real life. > So, assuming all this, can we, please, first check if Rik's patch is addressing > the regression? Nope, it's broken and buggy, and reintroduces problems with racing deferred counts that were fixed years ago when I originally wrote the numa aware shrinker infrastructure. Cheers, Dave.
On Thu, Jan 31, 2019 at 03:48:11PM +0000, Chris Mason wrote: > On 30 Jan 2019, at 20:34, Dave Chinner wrote: > > > On Wed, Jan 30, 2019 at 12:21:07PM +0000, Chris Mason wrote: > >> > >> > >> On 29 Jan 2019, at 23:17, Dave Chinner wrote: > >> > >>> From: Dave Chinner <dchinner@redhat.com> > >>> > >>> This reverts commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0. > >>> > >>> This change causes serious changes to page cache and inode cache > >>> behaviour and balance, resulting in major performance regressions > >>> when combining worklaods such as large file copies and kernel > >>> compiles. > >>> > >>> https://bugzilla.kernel.org/show_bug.cgi?id=202441 > >> > >> I'm a little confused by the latest comment in the bz: > >> > >> https://bugzilla.kernel.org/show_bug.cgi?id=202441#c24 > > > > Which says the first patch that changed the shrinker behaviour is > > the underlying cause of the regression. > > > >> Are these reverts sufficient? > > > > I think so. > > Based on the latest comment: > > "If I had been less strict in my testing I probably would have > discovered that the problem was present earlier than 4.19.3. Mr Gushins > commit made it more visible. > I'm going back to work after two days off, so I might not be able to > respond inside your working hours, but I'll keep checking in on this as > I get a chance." > > I don't think the reverts are sufficient. Roger has tested the two reverts more heavily against 5.0.0-rc3. Without the reverts, the machine locks up hard. With the two reverts applied, it runs along smoothly under extremely heavy load. https://bugzilla.kernel.org/show_bug.cgi?id=202441#c26 So, yes, these changes need to be reverted. Cheers, Dave.
On Fri, Feb 01, 2019 at 09:19:04AM +1100, Dave Chinner wrote: > > So, assuming all this, can we, please, first check if Rik's patch is addressing > > the regression? > > Nope, it's broken and buggy, and reintroduces problems with racing > deferred counts that were fixed years ago when I originally > wrote the numa aware shrinker infrastructure. So, the first thing to do here is fix the non-deterministic deferred reclaim behaviour of the shrinker. Any deferred reclaim because of things like GFP_NOFS contexts get landed on the the very next shrinker instance that runs, be it kswapd, direct reclaim, or whether it can even perform reclaim or not (e.g. it might be a GFP_NOFS context itself). As a result, there is no predicting when a shrinker instance might get landed with a huge amount of work that isn't it's own. We can't even guarantee that kswapd reclaim context sees this deferred work, because if there is another reclaimer on that node running at the same time, it will have scooped away the deferred work and kswapd will only do a small amount of work. How small? Well a node with 1.8m freeable items, reclaim priority 12 (i.e. lowest priority), and seeks = 2 will result in a scan count of: delta = 1,800,000 >> 12 = 440 delta *= 4 = 1600 delta /= 2 = 800. the shrinker will only scan 800 objects when there is light memory pressure. That's only 0.04% of the cache, which is insignificant. That's fine for direct reclaim, but for kswapd we need it to do more work when there is deferred work. So, say we have another 1m objects of deferred work (very common on filesystem (and therefore GFP_NOFS) heavy workloads), we'll do: total_scan = deferred_objects; .... total_scan += delta; (ignoring clamping) which means that we'll actually try to scan 1,000,800 objects from the cache on the next pass. This may be clamped down to (freeable_objects / 2), but that's still 900,000 objects in this case. IOWs, we just increased the reclaim work of this shrinker instance by a factor of 1000x. That's where all the long tail shrinker reclaim latencies are coming from, and where a large amount of the "XFS is doing inode IO from the shrinker" come from as they drive it straight through all the clean inodes and into dirty reclaimable inodes. With the additional "small cache" pressure being added and then deferred since 4.18-rc5, this is much more likely to happen with direct reclaim because the deferred count from GFP_NOFS allocations are wound up faster. So, if we want to prevent direct reclaim from this obvious long tail latency problem, we have to stop direct reclaim from ever doing deferred work. i.e. we need to move deferred work to kswapd and, for XFS, we then have to ensure that kswapd will not block on dirty inodes so it can do all this work as quickly as possible. And then for kswapd, we need to limit the amount of deferred work so that it doesn't spend all it's time emptying a single cache at low priorities, but will attempt to perform all the deferred work if reclaim priority winds up far enough. This gives us a solid, predictable "deferred work" infrastructure for the shrinkers. It gets rid of the nasty behaviour, and paves the way for adding different sorts of deferred work (like Rik's "small cache" pressure increase) to kswapd rather than in random reclaim contexts. It also allows us to use a different "how much work should we do" calculation for kswapd. i.e. one that is appropriate for background, non-blocking scanning rather than being tailored to limiting the work that any one direct reclaim context must do. So, if I just set the XFS inode cache shrinker to skip inode writeback (as everyone seems to want to do), fsmark is OOM-killed at 27M inodes, right when the log runs out of space, the tail-pushing thread goes near to being CPU bound, and inode writeback from reclaim is necessary to retire the dirty inode from the log before it can be reclaimed. It's easily reproducable, and sometimes the oom-killer chooses daemons rather than fsmark and it has killed the system on occasion. Reverting the two patches in this thread makes the OOM kill problem go away - it just turns it back into a performance issue. So, on top of the reverts, the patch below that reworks the deferred shrinker work to kswapd is the first patch I've been able to get a "XFs inode shrinker doesn't block kswapd" patch through my benchmarks and memory stress workloads without triggering OOM-kills randomly or seeing substantial performance regressions. Indeed, it appears to behave better that the existing code (fsmark inode create is marginally faster, simoops long tails have completely gone(*)). This indicates to me that we really should be considering fixing the deferred work problems before adding new types of deferred work into the shrinker infrastructure (for whatever reason). Get the infrastructure, reliable, predictable and somewhat deterministic, then we can start trialling pressure/balance changes knowing exactly where we are directing that extra work.... Cheers, Dave. (*) Chris, FYI, the last output before symoops died because "too many open files" - p99 latency is nearly identical to p50 latency: Run time: 10873 seconds Read latency (p50: 3,084,288) (p95: 3,158,016) (p99: 3,256,320) Write latency (p50: 7,479,296) (p95: 8,101,888) (p99: 8,437,760) Allocation latency (p50: 479,744) (p95: 744,448) (p99: 1,016,832) work rate = 8.63/sec (avg 9.05/sec) (p50: 9.19) (p95: 9.91) (p99: 11.42) alloc stall rate = 0.02/sec (avg: 0.01) (p50: 0.01) (p95: 0.01) (p99: 0.01) This is the same machine that I originally ran simoops on back in ~4.9 when you first proposed async kswapd behaviour for XFS. Typical long tail latencies back then were: [https://www.spinics.net/lists/linux-xfs/msg02235.html] Run time: 1140 seconds Read latency (p50: 3,035,136) (p95: 4,415,488) (p99: 6,119,424) Write latency (p50: 27,557,888) (p95: 31,490,048) (p99: 45,285,376) Allocation latency (p50: 247,552) (p95: 1,497,088) (p99: 19,496,960) work rate = 3.68/sec (avg 3.71/sec) (p50: 3.71) (p95: 4.04) (p99: 4.04) alloc stall rate = 1.65/sec (avg: 0.12) (p50: 0.00) (p95: 0.12) (p99: 0.12)
On Fri 01-02-19 09:19:04, Dave Chinner wrote: > Maybe for memcgs, but that's exactly the oppose of what we want to > do for global caches (e.g. filesystem metadata caches). We need to > make sure that a single, heavily pressured cache doesn't evict small > caches that lower pressure but are equally important for > performance. > > e.g. I've noticed recently a significant increase in RMW cycles in > XFS inode cache writeback during various benchmarks. It hasn't > affected performance because the machine has IO and CPU to burn, but > on slower machines and storage, it will have a major impact. Just as a data point, our performance testing infrastructure has bisected down to the commits discussed in this thread as the cause of about 40% regression in XFS file delete performance in bonnie++ benchmark. Honza
On Thu, 7 Feb 2019 11:27:50 +0100 Jan Kara <jack@suse.cz> wrote: > On Fri 01-02-19 09:19:04, Dave Chinner wrote: > > Maybe for memcgs, but that's exactly the oppose of what we want to > > do for global caches (e.g. filesystem metadata caches). We need to > > make sure that a single, heavily pressured cache doesn't evict small > > caches that lower pressure but are equally important for > > performance. > > > > e.g. I've noticed recently a significant increase in RMW cycles in > > XFS inode cache writeback during various benchmarks. It hasn't > > affected performance because the machine has IO and CPU to burn, but > > on slower machines and storage, it will have a major impact. > > Just as a data point, our performance testing infrastructure has bisected > down to the commits discussed in this thread as the cause of about 40% > regression in XFS file delete performance in bonnie++ benchmark. > Has anyone done significant testing with Rik's maybe-fix? From: Rik van Riel <riel@surriel.com> Subject: mm, slab, vmscan: accumulate gradual pressure on small slabs There are a few issues with the way the number of slab objects to scan is calculated in do_shrink_slab. First, for zero-seek slabs, we could leave the last object around forever. That could result in pinning a dying cgroup into memory, instead of reclaiming it. The fix for that is trivial. Secondly, small slabs receive much more pressure, relative to their size, than larger slabs, due to "rounding up" the minimum number of scanned objects to batch_size. We can keep the pressure on all slabs equal relative to their size by accumulating the scan pressure on small slabs over time, resulting in sometimes scanning an object, instead of always scanning several. This results in lower system CPU use, and a lower major fault rate, as actively used entries from smaller caches get reclaimed less aggressively, and need to be reloaded/recreated less often. [akpm@linux-foundation.org: whitespace fixes, per Roman] [riel@surriel.com: couple of fixes] Link: http://lkml.kernel.org/r/20190129142831.6a373403@imladris.surriel.com Link: http://lkml.kernel.org/r/20190128143535.7767c397@imladris.surriel.com Fixes: 4b85afbdacd2 ("mm: zero-seek shrinkers") Fixes: 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects") Signed-off-by: Rik van Riel <riel@surriel.com> Tested-by: Chris Mason <clm@fb.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Chinner <dchinner@redhat.com> Cc: Jonathan Lemon <bsd@fb.com> Cc: Jan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- --- a/include/linux/shrinker.h~mmslabvmscan-accumulate-gradual-pressure-on-small-slabs +++ a/include/linux/shrinker.h @@ -65,6 +65,7 @@ struct shrinker { long batch; /* reclaim batch size, 0 = default */ int seeks; /* seeks to recreate an obj */ + int small_scan; /* accumulate pressure on slabs with few objects */ unsigned flags; /* These are for internal use */ --- a/mm/vmscan.c~mmslabvmscan-accumulate-gradual-pressure-on-small-slabs +++ a/mm/vmscan.c @@ -488,18 +488,30 @@ static unsigned long do_shrink_slab(stru * them aggressively under memory pressure to keep * them from causing refetches in the IO caches. */ - delta = freeable / 2; + delta = (freeable + 1) / 2; } /* * Make sure we apply some minimal pressure on default priority - * even on small cgroups. Stale objects are not only consuming memory + * even on small cgroups, by accumulating pressure across multiple + * slab shrinker runs. Stale objects are not only consuming memory * by themselves, but can also hold a reference to a dying cgroup, * preventing it from being reclaimed. A dying cgroup with all * corresponding structures like per-cpu stats and kmem caches * can be really big, so it may lead to a significant waste of memory. */ - delta = max_t(unsigned long long, delta, min(freeable, batch_size)); + if (!delta && shrinker->seeks) { + unsigned long nr_considered; + + shrinker->small_scan += freeable; + nr_considered = shrinker->small_scan >> priority; + + delta = 4 * nr_considered; + do_div(delta, shrinker->seeks); + + if (delta) + shrinker->small_scan -= nr_considered << priority; + } total_scan += delta; if (total_scan < 0) {
On Thu 07-02-19 21:37:27, Andrew Morton wrote: > On Thu, 7 Feb 2019 11:27:50 +0100 Jan Kara <jack@suse.cz> wrote: > > > On Fri 01-02-19 09:19:04, Dave Chinner wrote: > > > Maybe for memcgs, but that's exactly the oppose of what we want to > > > do for global caches (e.g. filesystem metadata caches). We need to > > > make sure that a single, heavily pressured cache doesn't evict small > > > caches that lower pressure but are equally important for > > > performance. > > > > > > e.g. I've noticed recently a significant increase in RMW cycles in > > > XFS inode cache writeback during various benchmarks. It hasn't > > > affected performance because the machine has IO and CPU to burn, but > > > on slower machines and storage, it will have a major impact. > > > > Just as a data point, our performance testing infrastructure has bisected > > down to the commits discussed in this thread as the cause of about 40% > > regression in XFS file delete performance in bonnie++ benchmark. > > > > Has anyone done significant testing with Rik's maybe-fix? I will give it a spin with bonnie++ today. We'll see what comes out. Honza > > > > From: Rik van Riel <riel@surriel.com> > Subject: mm, slab, vmscan: accumulate gradual pressure on small slabs > > There are a few issues with the way the number of slab objects to scan is > calculated in do_shrink_slab. First, for zero-seek slabs, we could leave > the last object around forever. That could result in pinning a dying > cgroup into memory, instead of reclaiming it. The fix for that is > trivial. > > Secondly, small slabs receive much more pressure, relative to their size, > than larger slabs, due to "rounding up" the minimum number of scanned > objects to batch_size. > > We can keep the pressure on all slabs equal relative to their size by > accumulating the scan pressure on small slabs over time, resulting in > sometimes scanning an object, instead of always scanning several. > > This results in lower system CPU use, and a lower major fault rate, as > actively used entries from smaller caches get reclaimed less aggressively, > and need to be reloaded/recreated less often. > > [akpm@linux-foundation.org: whitespace fixes, per Roman] > [riel@surriel.com: couple of fixes] > Link: http://lkml.kernel.org/r/20190129142831.6a373403@imladris.surriel.com > Link: http://lkml.kernel.org/r/20190128143535.7767c397@imladris.surriel.com > Fixes: 4b85afbdacd2 ("mm: zero-seek shrinkers") > Fixes: 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects") > Signed-off-by: Rik van Riel <riel@surriel.com> > Tested-by: Chris Mason <clm@fb.com> > Acked-by: Roman Gushchin <guro@fb.com> > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > Cc: Dave Chinner <dchinner@redhat.com> > Cc: Jonathan Lemon <bsd@fb.com> > Cc: Jan Kara <jack@suse.cz> > Cc: <stable@vger.kernel.org> > > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > --- > > > --- a/include/linux/shrinker.h~mmslabvmscan-accumulate-gradual-pressure-on-small-slabs > +++ a/include/linux/shrinker.h > @@ -65,6 +65,7 @@ struct shrinker { > > long batch; /* reclaim batch size, 0 = default */ > int seeks; /* seeks to recreate an obj */ > + int small_scan; /* accumulate pressure on slabs with few objects */ > unsigned flags; > > /* These are for internal use */ > --- a/mm/vmscan.c~mmslabvmscan-accumulate-gradual-pressure-on-small-slabs > +++ a/mm/vmscan.c > @@ -488,18 +488,30 @@ static unsigned long do_shrink_slab(stru > * them aggressively under memory pressure to keep > * them from causing refetches in the IO caches. > */ > - delta = freeable / 2; > + delta = (freeable + 1) / 2; > } > > /* > * Make sure we apply some minimal pressure on default priority > - * even on small cgroups. Stale objects are not only consuming memory > + * even on small cgroups, by accumulating pressure across multiple > + * slab shrinker runs. Stale objects are not only consuming memory > * by themselves, but can also hold a reference to a dying cgroup, > * preventing it from being reclaimed. A dying cgroup with all > * corresponding structures like per-cpu stats and kmem caches > * can be really big, so it may lead to a significant waste of memory. > */ > - delta = max_t(unsigned long long, delta, min(freeable, batch_size)); > + if (!delta && shrinker->seeks) { > + unsigned long nr_considered; > + > + shrinker->small_scan += freeable; > + nr_considered = shrinker->small_scan >> priority; > + > + delta = 4 * nr_considered; > + do_div(delta, shrinker->seeks); > + > + if (delta) > + shrinker->small_scan -= nr_considered << priority; > + } > > total_scan += delta; > if (total_scan < 0) { > _ >
On Fri 08-02-19 10:55:07, Jan Kara wrote: > On Thu 07-02-19 21:37:27, Andrew Morton wrote: > > On Thu, 7 Feb 2019 11:27:50 +0100 Jan Kara <jack@suse.cz> wrote: > > > > > On Fri 01-02-19 09:19:04, Dave Chinner wrote: > > > > Maybe for memcgs, but that's exactly the oppose of what we want to > > > > do for global caches (e.g. filesystem metadata caches). We need to > > > > make sure that a single, heavily pressured cache doesn't evict small > > > > caches that lower pressure but are equally important for > > > > performance. > > > > > > > > e.g. I've noticed recently a significant increase in RMW cycles in > > > > XFS inode cache writeback during various benchmarks. It hasn't > > > > affected performance because the machine has IO and CPU to burn, but > > > > on slower machines and storage, it will have a major impact. > > > > > > Just as a data point, our performance testing infrastructure has bisected > > > down to the commits discussed in this thread as the cause of about 40% > > > regression in XFS file delete performance in bonnie++ benchmark. > > > > > > > Has anyone done significant testing with Rik's maybe-fix? > > I will give it a spin with bonnie++ today. We'll see what comes out. OK, I did a bonnie++ run with Rik's patch (on top of 4.20 to rule out other differences). This machine does not show so big differences in bonnie++ numbers but the difference is still clearly visible. The results are (averages of 5 runs): Revert Base Rik SeqCreate del 78.04 ( 0.00%) 98.18 ( -25.81%) 90.90 ( -16.48%) RandCreate del 87.68 ( 0.00%) 95.01 ( -8.36%) 87.66 ( 0.03%) 'Revert' is 4.20 with "mm: don't reclaim inodes with many attached pages" and "mm: slowly shrink slabs with a relatively small number of objects" reverted. 'Base' is the kernel without any reverts. 'Rik' is a 4.20 with Rik's patch applied. The numbers are time to do a batch of deletes so lower is better. You can see that the patch did help somewhat but it was not enough to close the gap when files are deleted in 'readdir' order. Honza > > From: Rik van Riel <riel@surriel.com> > > Subject: mm, slab, vmscan: accumulate gradual pressure on small slabs > > > > There are a few issues with the way the number of slab objects to scan is > > calculated in do_shrink_slab. First, for zero-seek slabs, we could leave > > the last object around forever. That could result in pinning a dying > > cgroup into memory, instead of reclaiming it. The fix for that is > > trivial. > > > > Secondly, small slabs receive much more pressure, relative to their size, > > than larger slabs, due to "rounding up" the minimum number of scanned > > objects to batch_size. > > > > We can keep the pressure on all slabs equal relative to their size by > > accumulating the scan pressure on small slabs over time, resulting in > > sometimes scanning an object, instead of always scanning several. > > > > This results in lower system CPU use, and a lower major fault rate, as > > actively used entries from smaller caches get reclaimed less aggressively, > > and need to be reloaded/recreated less often. > > > > [akpm@linux-foundation.org: whitespace fixes, per Roman] > > [riel@surriel.com: couple of fixes] > > Link: http://lkml.kernel.org/r/20190129142831.6a373403@imladris.surriel.com > > Link: http://lkml.kernel.org/r/20190128143535.7767c397@imladris.surriel.com > > Fixes: 4b85afbdacd2 ("mm: zero-seek shrinkers") > > Fixes: 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects") > > Signed-off-by: Rik van Riel <riel@surriel.com> > > Tested-by: Chris Mason <clm@fb.com> > > Acked-by: Roman Gushchin <guro@fb.com> > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > > Cc: Dave Chinner <dchinner@redhat.com> > > Cc: Jonathan Lemon <bsd@fb.com> > > Cc: Jan Kara <jack@suse.cz> > > Cc: <stable@vger.kernel.org> > > > > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > > --- > > > > > > --- a/include/linux/shrinker.h~mmslabvmscan-accumulate-gradual-pressure-on-small-slabs > > +++ a/include/linux/shrinker.h > > @@ -65,6 +65,7 @@ struct shrinker { > > > > long batch; /* reclaim batch size, 0 = default */ > > int seeks; /* seeks to recreate an obj */ > > + int small_scan; /* accumulate pressure on slabs with few objects */ > > unsigned flags; > > > > /* These are for internal use */ > > --- a/mm/vmscan.c~mmslabvmscan-accumulate-gradual-pressure-on-small-slabs > > +++ a/mm/vmscan.c > > @@ -488,18 +488,30 @@ static unsigned long do_shrink_slab(stru > > * them aggressively under memory pressure to keep > > * them from causing refetches in the IO caches. > > */ > > - delta = freeable / 2; > > + delta = (freeable + 1) / 2; > > } > > > > /* > > * Make sure we apply some minimal pressure on default priority > > - * even on small cgroups. Stale objects are not only consuming memory > > + * even on small cgroups, by accumulating pressure across multiple > > + * slab shrinker runs. Stale objects are not only consuming memory > > * by themselves, but can also hold a reference to a dying cgroup, > > * preventing it from being reclaimed. A dying cgroup with all > > * corresponding structures like per-cpu stats and kmem caches > > * can be really big, so it may lead to a significant waste of memory. > > */ > > - delta = max_t(unsigned long long, delta, min(freeable, batch_size)); > > + if (!delta && shrinker->seeks) { > > + unsigned long nr_considered; > > + > > + shrinker->small_scan += freeable; > > + nr_considered = shrinker->small_scan >> priority; > > + > > + delta = 4 * nr_considered; > > + do_div(delta, shrinker->seeks); > > + > > + if (delta) > > + shrinker->small_scan -= nr_considered << priority; > > + } > > > > total_scan += delta; > > if (total_scan < 0) { > > _ > > > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR
On Thu, Feb 07, 2019 at 09:37:27PM -0800, Andrew Morton wrote: > On Thu, 7 Feb 2019 11:27:50 +0100 Jan Kara <jack@suse.cz> wrote: > > > On Fri 01-02-19 09:19:04, Dave Chinner wrote: > > > Maybe for memcgs, but that's exactly the oppose of what we want to > > > do for global caches (e.g. filesystem metadata caches). We need to > > > make sure that a single, heavily pressured cache doesn't evict small > > > caches that lower pressure but are equally important for > > > performance. > > > > > > e.g. I've noticed recently a significant increase in RMW cycles in > > > XFS inode cache writeback during various benchmarks. It hasn't > > > affected performance because the machine has IO and CPU to burn, but > > > on slower machines and storage, it will have a major impact. > > > > Just as a data point, our performance testing infrastructure has bisected > > down to the commits discussed in this thread as the cause of about 40% > > regression in XFS file delete performance in bonnie++ benchmark. > > > > Has anyone done significant testing with Rik's maybe-fix? Apart from pointing out all the bugs and incorrect algorithmic assumptions it makes, no. Cheers, Dave.
On Fri, 8 Feb 2019 13:50:49 +0100 Jan Kara <jack@suse.cz> wrote: > > > Has anyone done significant testing with Rik's maybe-fix? > > > > I will give it a spin with bonnie++ today. We'll see what comes out. > > OK, I did a bonnie++ run with Rik's patch (on top of 4.20 to rule out other > differences). This machine does not show so big differences in bonnie++ > numbers but the difference is still clearly visible. The results are > (averages of 5 runs): > > Revert Base Rik > SeqCreate del 78.04 ( 0.00%) 98.18 ( -25.81%) 90.90 ( -16.48%) > RandCreate del 87.68 ( 0.00%) 95.01 ( -8.36%) 87.66 ( 0.03%) > > 'Revert' is 4.20 with "mm: don't reclaim inodes with many attached pages" > and "mm: slowly shrink slabs with a relatively small number of objects" > reverted. 'Base' is the kernel without any reverts. 'Rik' is a 4.20 with > Rik's patch applied. > > The numbers are time to do a batch of deletes so lower is better. You can see > that the patch did help somewhat but it was not enough to close the gap > when files are deleted in 'readdir' order. OK, thanks. I guess we need a rethink on Roman's fixes. I'll queued the reverts. BTW, one thing I don't think has been discussed (or noticed) is the effect of "mm: don't reclaim inodes with many attached pages" on 32-bit highmem machines. Look why someone added that code in the first place: : commit f9a316fa9099053a299851762aedbf12881cff42 : Author: Andrew Morton <akpm@digeo.com> : Date: Thu Oct 31 04:09:37 2002 -0800 : : [PATCH] strip pagecache from to-be-reaped inodes : : With large highmem machines and many small cached files it is possible : to encounter ZONE_NORMAL allocation failures. This can be demonstrated : with a large number of one-byte files on a 7G machine. : : All lowmem is filled with icache and all those inodes have a small : amount of highmem pagecache which makes them unfreeable. : : The patch strips the pagecache from inodes as they come off the tail of : the inode_unused list. : : I play tricks in there peeking at the head of the inode_unused list to : pick up the inode again after running iput(). The alternatives seemed : to involve more widespread changes. : : Or running invalidate_inode_pages() under inode_lock which would be a : bad thing from a scheduling latency and lock contention point of view. I guess I shold have added a comment. Doh.
On Fri, Feb 08, 2019 at 02:49:44PM -0800, Andrew Morton wrote: > On Fri, 8 Feb 2019 13:50:49 +0100 Jan Kara <jack@suse.cz> wrote: > > > > > Has anyone done significant testing with Rik's maybe-fix? > > > > > > I will give it a spin with bonnie++ today. We'll see what comes out. > > > > OK, I did a bonnie++ run with Rik's patch (on top of 4.20 to rule out other > > differences). This machine does not show so big differences in bonnie++ > > numbers but the difference is still clearly visible. The results are > > (averages of 5 runs): > > > > Revert Base Rik > > SeqCreate del 78.04 ( 0.00%) 98.18 ( -25.81%) 90.90 ( -16.48%) > > RandCreate del 87.68 ( 0.00%) 95.01 ( -8.36%) 87.66 ( 0.03%) > > > > 'Revert' is 4.20 with "mm: don't reclaim inodes with many attached pages" > > and "mm: slowly shrink slabs with a relatively small number of objects" > > reverted. 'Base' is the kernel without any reverts. 'Rik' is a 4.20 with > > Rik's patch applied. > > > > The numbers are time to do a batch of deletes so lower is better. You can see > > that the patch did help somewhat but it was not enough to close the gap > > when files are deleted in 'readdir' order. > > OK, thanks. > > I guess we need a rethink on Roman's fixes. I'll queued the reverts. Agree. I still believe that we should cause the machine-wide memory pressure to clean up any remains of dead cgroups, and Rik's patch is a step into the right direction. But we need to make some experiments and probably some code changes here to guarantee that we don't regress on performance. > > > BTW, one thing I don't think has been discussed (or noticed) is the > effect of "mm: don't reclaim inodes with many attached pages" on 32-bit > highmem machines. Look why someone added that code in the first place: > > : commit f9a316fa9099053a299851762aedbf12881cff42 > : Author: Andrew Morton <akpm@digeo.com> > : Date: Thu Oct 31 04:09:37 2002 -0800 > : > : [PATCH] strip pagecache from to-be-reaped inodes > : > : With large highmem machines and many small cached files it is possible > : to encounter ZONE_NORMAL allocation failures. This can be demonstrated > : with a large number of one-byte files on a 7G machine. > : > : All lowmem is filled with icache and all those inodes have a small > : amount of highmem pagecache which makes them unfreeable. > : > : The patch strips the pagecache from inodes as they come off the tail of > : the inode_unused list. > : > : I play tricks in there peeking at the head of the inode_unused list to > : pick up the inode again after running iput(). The alternatives seemed > : to involve more widespread changes. > : > : Or running invalidate_inode_pages() under inode_lock which would be a > : bad thing from a scheduling latency and lock contention point of view. > > I guess I shold have added a comment. Doh. > It's a very useful link. Thanks!
Am Donnerstag, 7. Februar 2019, 11:27:50 schrieb Jan Kara: > On Fri 01-02-19 09:19:04, Dave Chinner wrote: > > Maybe for memcgs, but that's exactly the oppose of what we want to > > do for global caches (e.g. filesystem metadata caches). We need to > > make sure that a single, heavily pressured cache doesn't evict small > > caches that lower pressure but are equally important for > > performance. > > > > e.g. I've noticed recently a significant increase in RMW cycles in > > XFS inode cache writeback during various benchmarks. It hasn't > > affected performance because the machine has IO and CPU to burn, but > > on slower machines and storage, it will have a major impact. > > Just as a data point, our performance testing infrastructure has bisected > down to the commits discussed in this thread as the cause of about 40% > regression in XFS file delete performance in bonnie++ benchmark. We also bisected our big IO-performance problem of an imap-server (starting with 4.19.3) down to mm: don't reclaim inodes with many attached pages commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0 upstream. On other servers the filesystems sometimes seems to hang for 10 seconds and more. We also see a performance regression compared to 4.14 even with this patch reverted, but much less dramatic. Now I saw this thread and I'll try to revert 172b06c32b949759fe6313abec514bc4f15014f4 and see if this helps. Regards,
diff --git a/fs/inode.c b/fs/inode.c index 0cd47fe0dbe5..73432e64f874 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -730,11 +730,8 @@ static enum lru_status inode_lru_isolate(struct list_head *item, return LRU_REMOVED; } - /* - * Recently referenced inodes and inodes with many attached pages - * get one more pass. - */ - if (inode->i_state & I_REFERENCED || inode->i_data.nrpages > 1) { + /* recently referenced inodes get one more pass */ + if (inode->i_state & I_REFERENCED) { inode->i_state &= ~I_REFERENCED; spin_unlock(&inode->i_lock); return LRU_ROTATE;