diff mbox series

[RFC] mm: don't reclaim inodes with many attached pages

Message ID 20181023164302.20436-1-guro@fb.com (mailing list archive)
State New, archived
Headers show
Series [RFC] mm: don't reclaim inodes with many attached pages | expand

Commit Message

Roman Gushchin Oct. 23, 2018, 4:43 p.m. UTC
Spock reported that the commit 172b06c32b94 ("mm: slowly shrink slabs
with a relatively small number of objects") leads to a regression on
his setup: periodically the majority of the pagecache is evicted
without an obvious reason, while before the change the amount of free
memory was balancing around the watermark.

The reason behind is that the mentioned above change created some
minimal background pressure on the inode cache. The problem is that
if an inode is considered to be reclaimed, all belonging pagecache
page are stripped, no matter how many of them are there. So, if a huge
multi-gigabyte file is cached in the memory, and the goal is to
reclaim only few slab objects (unused inodes), we still can eventually
evict all gigabytes of the pagecache at once.

The workload described by Spock has few large non-mapped files in the
pagecache, so it's especially noticeable.

To solve the problem let's postpone the reclaim of inodes, which have
more than 1 attached page. Let's wait until the pagecache pages will
be evicted naturally by scanning the corresponding LRU lists, and only
then reclaim the inode structure.

Reported-by: Spock <dairinin@gmail.com>
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 fs/inode.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Comments

Andrew Morton Oct. 24, 2018, 10:18 p.m. UTC | #1
On Tue, 23 Oct 2018 16:43:29 +0000 Roman Gushchin <guro@fb.com> wrote:

> Spock reported that the commit 172b06c32b94 ("mm: slowly shrink slabs
> with a relatively small number of objects") leads to a regression on
> his setup: periodically the majority of the pagecache is evicted
> without an obvious reason, while before the change the amount of free
> memory was balancing around the watermark.
> 
> The reason behind is that the mentioned above change created some
> minimal background pressure on the inode cache. The problem is that
> if an inode is considered to be reclaimed, all belonging pagecache
> page are stripped, no matter how many of them are there. So, if a huge
> multi-gigabyte file is cached in the memory, and the goal is to
> reclaim only few slab objects (unused inodes), we still can eventually
> evict all gigabytes of the pagecache at once.
> 
> The workload described by Spock has few large non-mapped files in the
> pagecache, so it's especially noticeable.
> 
> To solve the problem let's postpone the reclaim of inodes, which have
> more than 1 attached page. Let's wait until the pagecache pages will
> be evicted naturally by scanning the corresponding LRU lists, and only
> then reclaim the inode structure.
> 
> ...
>
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -730,8 +730,11 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
>  		return LRU_REMOVED;
>  	}
>  
> -	/* recently referenced inodes get one more pass */
> -	if (inode->i_state & I_REFERENCED) {
> +	/*
> +	 * Recently referenced inodes and inodes with many attached pages
> +	 * get one more pass.
> +	 */
> +	if (inode->i_state & I_REFERENCED || inode->i_data.nrpages > 1) {
>  		inode->i_state &= ~I_REFERENCED;
>  		spin_unlock(&inode->i_lock);
>  		return LRU_ROTATE;

hm, why "1"?

I guess one could argue that this will encompass long symlinks, but I
just made that up to make "1" appear more justifiable ;)
Andrew Morton Oct. 24, 2018, 10:19 p.m. UTC | #2
On Tue, 23 Oct 2018 16:43:29 +0000 Roman Gushchin <guro@fb.com> wrote:

> Spock reported that the commit 172b06c32b94 ("mm: slowly shrink slabs
> with a relatively small number of objects") leads to a regression on
> his setup: periodically the majority of the pagecache is evicted
> without an obvious reason, while before the change the amount of free
> memory was balancing around the watermark.
> 
> The reason behind is that the mentioned above change created some
> minimal background pressure on the inode cache. The problem is that
> if an inode is considered to be reclaimed, all belonging pagecache
> page are stripped, no matter how many of them are there. So, if a huge
> multi-gigabyte file is cached in the memory, and the goal is to
> reclaim only few slab objects (unused inodes), we still can eventually
> evict all gigabytes of the pagecache at once.
> 
> The workload described by Spock has few large non-mapped files in the
> pagecache, so it's especially noticeable.
> 
> To solve the problem let's postpone the reclaim of inodes, which have
> more than 1 attached page. Let's wait until the pagecache pages will
> be evicted naturally by scanning the corresponding LRU lists, and only
> then reclaim the inode structure.

Is this regression serious enough to warrant fixing 4.19.1?
Roman Gushchin Oct. 24, 2018, 11:49 p.m. UTC | #3
On Wed, Oct 24, 2018 at 03:18:53PM -0700, Andrew Morton wrote:
> On Tue, 23 Oct 2018 16:43:29 +0000 Roman Gushchin <guro@fb.com> wrote:
> 
> > Spock reported that the commit 172b06c32b94 ("mm: slowly shrink slabs
> > with a relatively small number of objects") leads to a regression on
> > his setup: periodically the majority of the pagecache is evicted
> > without an obvious reason, while before the change the amount of free
> > memory was balancing around the watermark.
> > 
> > The reason behind is that the mentioned above change created some
> > minimal background pressure on the inode cache. The problem is that
> > if an inode is considered to be reclaimed, all belonging pagecache
> > page are stripped, no matter how many of them are there. So, if a huge
> > multi-gigabyte file is cached in the memory, and the goal is to
> > reclaim only few slab objects (unused inodes), we still can eventually
> > evict all gigabytes of the pagecache at once.
> > 
> > The workload described by Spock has few large non-mapped files in the
> > pagecache, so it's especially noticeable.
> > 
> > To solve the problem let's postpone the reclaim of inodes, which have
> > more than 1 attached page. Let's wait until the pagecache pages will
> > be evicted naturally by scanning the corresponding LRU lists, and only
> > then reclaim the inode structure.
> > 
> > ...
> >
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -730,8 +730,11 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
> >  		return LRU_REMOVED;
> >  	}
> >  
> > -	/* recently referenced inodes get one more pass */
> > -	if (inode->i_state & I_REFERENCED) {
> > +	/*
> > +	 * Recently referenced inodes and inodes with many attached pages
> > +	 * get one more pass.
> > +	 */
> > +	if (inode->i_state & I_REFERENCED || inode->i_data.nrpages > 1) {
> >  		inode->i_state &= ~I_REFERENCED;
> >  		spin_unlock(&inode->i_lock);
> >  		return LRU_ROTATE;
> 
> hm, why "1"?
> 
> I guess one could argue that this will encompass long symlinks, but I
> just made that up to make "1" appear more justifiable ;) 
> 

Well, I'm slightly aware of introducing an inode leak here, so I was thinking
about some small number of pages. It's definitely makes no sense to reclaim
several Gb of pagecache, however throwing away a couple of pages to speed up
inode reuse is totally fine.
But then I realized that I don't have any justification for a number like
4 or 32, so I ended up with 1. I'm pretty open here, but not sure that switching
to 0 is much better.

Thanks!
Roman Gushchin Oct. 24, 2018, 11:51 p.m. UTC | #4
On Wed, Oct 24, 2018 at 03:19:50PM -0700, Andrew Morton wrote:
> On Tue, 23 Oct 2018 16:43:29 +0000 Roman Gushchin <guro@fb.com> wrote:
> 
> > Spock reported that the commit 172b06c32b94 ("mm: slowly shrink slabs
> > with a relatively small number of objects") leads to a regression on
> > his setup: periodically the majority of the pagecache is evicted
> > without an obvious reason, while before the change the amount of free
> > memory was balancing around the watermark.
> > 
> > The reason behind is that the mentioned above change created some
> > minimal background pressure on the inode cache. The problem is that
> > if an inode is considered to be reclaimed, all belonging pagecache
> > page are stripped, no matter how many of them are there. So, if a huge
> > multi-gigabyte file is cached in the memory, and the goal is to
> > reclaim only few slab objects (unused inodes), we still can eventually
> > evict all gigabytes of the pagecache at once.
> > 
> > The workload described by Spock has few large non-mapped files in the
> > pagecache, so it's especially noticeable.
> > 
> > To solve the problem let's postpone the reclaim of inodes, which have
> > more than 1 attached page. Let's wait until the pagecache pages will
> > be evicted naturally by scanning the corresponding LRU lists, and only
> > then reclaim the inode structure.
> 
> Is this regression serious enough to warrant fixing 4.19.1?

I'd give it some testing in the mm tree (and I'll test it by myself
on our fleet), and then backport to 4.19.x.

Thanks!
Michal Hocko Oct. 25, 2018, 9:23 a.m. UTC | #5
On Wed 24-10-18 15:19:50, Andrew Morton wrote:
> On Tue, 23 Oct 2018 16:43:29 +0000 Roman Gushchin <guro@fb.com> wrote:
> 
> > Spock reported that the commit 172b06c32b94 ("mm: slowly shrink slabs
> > with a relatively small number of objects") leads to a regression on
> > his setup: periodically the majority of the pagecache is evicted
> > without an obvious reason, while before the change the amount of free
> > memory was balancing around the watermark.
> > 
> > The reason behind is that the mentioned above change created some
> > minimal background pressure on the inode cache. The problem is that
> > if an inode is considered to be reclaimed, all belonging pagecache
> > page are stripped, no matter how many of them are there. So, if a huge
> > multi-gigabyte file is cached in the memory, and the goal is to
> > reclaim only few slab objects (unused inodes), we still can eventually
> > evict all gigabytes of the pagecache at once.
> > 
> > The workload described by Spock has few large non-mapped files in the
> > pagecache, so it's especially noticeable.
> > 
> > To solve the problem let's postpone the reclaim of inodes, which have
> > more than 1 attached page. Let's wait until the pagecache pages will
> > be evicted naturally by scanning the corresponding LRU lists, and only
> > then reclaim the inode structure.
> 
> Is this regression serious enough to warrant fixing 4.19.1?

Let's not forget about stable tree(s) which backported 172b06c32b94. I
would suggest reverting there.
Andrew Morton Oct. 25, 2018, 7:44 p.m. UTC | #6
On Thu, 25 Oct 2018 11:23:52 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> On Wed 24-10-18 15:19:50, Andrew Morton wrote:
> > On Tue, 23 Oct 2018 16:43:29 +0000 Roman Gushchin <guro@fb.com> wrote:
> > 
> > > Spock reported that the commit 172b06c32b94 ("mm: slowly shrink slabs
> > > with a relatively small number of objects") leads to a regression on
> > > his setup: periodically the majority of the pagecache is evicted
> > > without an obvious reason, while before the change the amount of free
> > > memory was balancing around the watermark.
> > > 
> > > The reason behind is that the mentioned above change created some
> > > minimal background pressure on the inode cache. The problem is that
> > > if an inode is considered to be reclaimed, all belonging pagecache
> > > page are stripped, no matter how many of them are there. So, if a huge
> > > multi-gigabyte file is cached in the memory, and the goal is to
> > > reclaim only few slab objects (unused inodes), we still can eventually
> > > evict all gigabytes of the pagecache at once.
> > > 
> > > The workload described by Spock has few large non-mapped files in the
> > > pagecache, so it's especially noticeable.
> > > 
> > > To solve the problem let's postpone the reclaim of inodes, which have
> > > more than 1 attached page. Let's wait until the pagecache pages will
> > > be evicted naturally by scanning the corresponding LRU lists, and only
> > > then reclaim the inode structure.
> > 
> > Is this regression serious enough to warrant fixing 4.19.1?
> 
> Let's not forget about stable tree(s) which backported 172b06c32b94. I
> would suggest reverting there.

Yup.  Sasha, can you please take care of this?
Sasha Levin Oct. 25, 2018, 8:20 p.m. UTC | #7
On Thu, Oct 25, 2018 at 12:44:42PM -0700, Andrew Morton wrote:
>On Thu, 25 Oct 2018 11:23:52 +0200 Michal Hocko <mhocko@kernel.org> wrote:
>
>> On Wed 24-10-18 15:19:50, Andrew Morton wrote:
>> > On Tue, 23 Oct 2018 16:43:29 +0000 Roman Gushchin <guro@fb.com> wrote:
>> >
>> > > Spock reported that the commit 172b06c32b94 ("mm: slowly shrink slabs
>> > > with a relatively small number of objects") leads to a regression on
>> > > his setup: periodically the majority of the pagecache is evicted
>> > > without an obvious reason, while before the change the amount of free
>> > > memory was balancing around the watermark.
>> > >
>> > > The reason behind is that the mentioned above change created some
>> > > minimal background pressure on the inode cache. The problem is that
>> > > if an inode is considered to be reclaimed, all belonging pagecache
>> > > page are stripped, no matter how many of them are there. So, if a huge
>> > > multi-gigabyte file is cached in the memory, and the goal is to
>> > > reclaim only few slab objects (unused inodes), we still can eventually
>> > > evict all gigabytes of the pagecache at once.
>> > >
>> > > The workload described by Spock has few large non-mapped files in the
>> > > pagecache, so it's especially noticeable.
>> > >
>> > > To solve the problem let's postpone the reclaim of inodes, which have
>> > > more than 1 attached page. Let's wait until the pagecache pages will
>> > > be evicted naturally by scanning the corresponding LRU lists, and only
>> > > then reclaim the inode structure.
>> >
>> > Is this regression serious enough to warrant fixing 4.19.1?
>>
>> Let's not forget about stable tree(s) which backported 172b06c32b94. I
>> would suggest reverting there.
>
>Yup.  Sasha, can you please take care of this?

Sure, I'll revert it from current stable trees.

Should 172b06c32b94 and this commit be backported once Roman confirms
the issue is fixed? As far as I understand 172b06c32b94 addressed an
issue FB were seeing in their fleet and needed to be fixed.


--
Thanks,
Sasha
Matthew Wilcox Oct. 25, 2018, 8:27 p.m. UTC | #8
On Thu, Oct 25, 2018 at 04:20:14PM -0400, Sasha Levin wrote:
> On Thu, Oct 25, 2018 at 12:44:42PM -0700, Andrew Morton wrote:
> > Yup.  Sasha, can you please take care of this?
> 
> Sure, I'll revert it from current stable trees.
> 
> Should 172b06c32b94 and this commit be backported once Roman confirms
> the issue is fixed? As far as I understand 172b06c32b94 addressed an
> issue FB were seeing in their fleet and needed to be fixed.

I'm not sure I see "FB sees an issue in their fleet" and "needs to be
fixed in stable kernels" as related.  FB's workload is different from
most people's workloads and FB has a large and highly-skilled team of
kernel engineers.  Obviously I want this problem fixed in mainline,
but I don't know that most people benefit from having it fixed in stable.
Roman Gushchin Oct. 25, 2018, 8:32 p.m. UTC | #9
On Thu, Oct 25, 2018 at 04:20:14PM -0400, Sasha Levin wrote:
> On Thu, Oct 25, 2018 at 12:44:42PM -0700, Andrew Morton wrote:
> > On Thu, 25 Oct 2018 11:23:52 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > > On Wed 24-10-18 15:19:50, Andrew Morton wrote:
> > > > On Tue, 23 Oct 2018 16:43:29 +0000 Roman Gushchin <guro@fb.com> wrote:
> > > >
> > > > > Spock reported that the commit 172b06c32b94 ("mm: slowly shrink slabs
> > > > > with a relatively small number of objects") leads to a regression on
> > > > > his setup: periodically the majority of the pagecache is evicted
> > > > > without an obvious reason, while before the change the amount of free
> > > > > memory was balancing around the watermark.
> > > > >
> > > > > The reason behind is that the mentioned above change created some
> > > > > minimal background pressure on the inode cache. The problem is that
> > > > > if an inode is considered to be reclaimed, all belonging pagecache
> > > > > page are stripped, no matter how many of them are there. So, if a huge
> > > > > multi-gigabyte file is cached in the memory, and the goal is to
> > > > > reclaim only few slab objects (unused inodes), we still can eventually
> > > > > evict all gigabytes of the pagecache at once.
> > > > >
> > > > > The workload described by Spock has few large non-mapped files in the
> > > > > pagecache, so it's especially noticeable.
> > > > >
> > > > > To solve the problem let's postpone the reclaim of inodes, which have
> > > > > more than 1 attached page. Let's wait until the pagecache pages will
> > > > > be evicted naturally by scanning the corresponding LRU lists, and only
> > > > > then reclaim the inode structure.
> > > >
> > > > Is this regression serious enough to warrant fixing 4.19.1?
> > > 
> > > Let's not forget about stable tree(s) which backported 172b06c32b94. I
> > > would suggest reverting there.
> > 
> > Yup.  Sasha, can you please take care of this?
> 
> Sure, I'll revert it from current stable trees.
> 
> Should 172b06c32b94 and this commit be backported once Roman confirms
> the issue is fixed? As far as I understand 172b06c32b94 addressed an
> issue FB were seeing in their fleet and needed to be fixed.

The memcg leak was also independently reported by several companies,
so it's not only about our fleet.

The memcg css leak is fixed by a series of commits (as in the mm tree):
  37e521912118 math64: prevent double calculation of DIV64_U64_ROUND_UP() arguments
  c6be4e82b1b3 mm: don't miss the last page because of round-off error
  f2e821fc8c63 mm: drain memcg stocks on css offlining
  03a971b56f18 mm: rework memcg kernel stack accounting
  172b06c32b94 mm: slowly shrink slabs with a relatively small number of objects

The last one by itself isn't enough, and it makes no sense to backport it
without all other patches. So, I'd either backport them all (including
47036ad4032e ("mm: don't reclaim inodes with many attached pages"),
either just revert 172b06c32b94.

Also 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects")
by itself is fine, but it reveals an independent issue in inode reclaim code,
which 47036ad4032e ("mm: don't reclaim inodes with many attached pages") aims to fix.

Thanks!
Sasha Levin Oct. 25, 2018, 9:44 p.m. UTC | #10
On Thu, Oct 25, 2018 at 01:27:07PM -0700, Matthew Wilcox wrote:
>On Thu, Oct 25, 2018 at 04:20:14PM -0400, Sasha Levin wrote:
>> On Thu, Oct 25, 2018 at 12:44:42PM -0700, Andrew Morton wrote:
>> > Yup.  Sasha, can you please take care of this?
>>
>> Sure, I'll revert it from current stable trees.
>>
>> Should 172b06c32b94 and this commit be backported once Roman confirms
>> the issue is fixed? As far as I understand 172b06c32b94 addressed an
>> issue FB were seeing in their fleet and needed to be fixed.
>
>I'm not sure I see "FB sees an issue in their fleet" and "needs to be
>fixed in stable kernels" as related.  FB's workload is different from
>most people's workloads and FB has a large and highly-skilled team of
>kernel engineers.  Obviously I want this problem fixed in mainline,
>but I don't know that most people benefit from having it fixed in stable.

I don't want to make backporting decisions based on how big a certain
company's kernel team is. I only mentioned FB explicitly to suggest that
this issue is seen on real-life scenarios rather than on synthetic tests
or code review.

So yes, let's not run to fix it just because it's FB but also let's not
ignore it because FB has a world-class kernel team. This should be done
purely based on how likely this patch will regress stable kernels vs the
severity of the bug it fixes.

--
Thanks,
Sasha
Michal Hocko Oct. 26, 2018, 7:33 a.m. UTC | #11
On Thu 25-10-18 20:32:47, Roman Gushchin wrote:
> On Thu, Oct 25, 2018 at 04:20:14PM -0400, Sasha Levin wrote:
> > On Thu, Oct 25, 2018 at 12:44:42PM -0700, Andrew Morton wrote:
> > > On Thu, 25 Oct 2018 11:23:52 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> > > 
> > > > On Wed 24-10-18 15:19:50, Andrew Morton wrote:
> > > > > On Tue, 23 Oct 2018 16:43:29 +0000 Roman Gushchin <guro@fb.com> wrote:
> > > > >
> > > > > > Spock reported that the commit 172b06c32b94 ("mm: slowly shrink slabs
> > > > > > with a relatively small number of objects") leads to a regression on
> > > > > > his setup: periodically the majority of the pagecache is evicted
> > > > > > without an obvious reason, while before the change the amount of free
> > > > > > memory was balancing around the watermark.
> > > > > >
> > > > > > The reason behind is that the mentioned above change created some
> > > > > > minimal background pressure on the inode cache. The problem is that
> > > > > > if an inode is considered to be reclaimed, all belonging pagecache
> > > > > > page are stripped, no matter how many of them are there. So, if a huge
> > > > > > multi-gigabyte file is cached in the memory, and the goal is to
> > > > > > reclaim only few slab objects (unused inodes), we still can eventually
> > > > > > evict all gigabytes of the pagecache at once.
> > > > > >
> > > > > > The workload described by Spock has few large non-mapped files in the
> > > > > > pagecache, so it's especially noticeable.
> > > > > >
> > > > > > To solve the problem let's postpone the reclaim of inodes, which have
> > > > > > more than 1 attached page. Let's wait until the pagecache pages will
> > > > > > be evicted naturally by scanning the corresponding LRU lists, and only
> > > > > > then reclaim the inode structure.
> > > > >
> > > > > Is this regression serious enough to warrant fixing 4.19.1?
> > > > 
> > > > Let's not forget about stable tree(s) which backported 172b06c32b94. I
> > > > would suggest reverting there.
> > > 
> > > Yup.  Sasha, can you please take care of this?
> > 
> > Sure, I'll revert it from current stable trees.
> > 
> > Should 172b06c32b94 and this commit be backported once Roman confirms
> > the issue is fixed? As far as I understand 172b06c32b94 addressed an
> > issue FB were seeing in their fleet and needed to be fixed.
> 
> The memcg leak was also independently reported by several companies,
> so it's not only about our fleet.

By memcg leak you mean a lot of dead memcgs with small amount of memory
which are staying behind and the global memory pressure removes them
only very slowly or almost not at all, right?

I have avague recollection that systemd can trigger a pattern which
makes this "leak" noticeable. Is that right? If yes what would be a
minimal and safe fix for the stable tree? "mm: don't miss the last page
because of round-off error" would sound like the candidate but I never
got around to review it properly.

> The memcg css leak is fixed by a series of commits (as in the mm tree):
>   37e521912118 math64: prevent double calculation of DIV64_U64_ROUND_UP() arguments
>   c6be4e82b1b3 mm: don't miss the last page because of round-off error
>   f2e821fc8c63 mm: drain memcg stocks on css offlining
>   03a971b56f18 mm: rework memcg kernel stack accounting

btw. none of these sha are refering to anything in my git tree. They all
seem to be in the next tree though.

>   172b06c32b94 mm: slowly shrink slabs with a relatively small number of objects
> 
> The last one by itself isn't enough, and it makes no sense to backport it
> without all other patches. So, I'd either backport them all (including
> 47036ad4032e ("mm: don't reclaim inodes with many attached pages"),
> either just revert 172b06c32b94.
> 
> Also 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects")
> by itself is fine, but it reveals an independent issue in inode reclaim code,
> which 47036ad4032e ("mm: don't reclaim inodes with many attached pages") aims to fix.

To me it sounds it needs much more time to settle before it can be
considered safe for the stable tree. Even if the patch itself is correct
it seems too subtle and reveal a behavior which was not anticipated and
that just proves it is far from straightforward.
Michal Hocko Oct. 26, 2018, 8:57 a.m. UTC | #12
Spock doesn't seem to be cced here - fixed now

On Tue 23-10-18 16:43:29, Roman Gushchin wrote:
> Spock reported that the commit 172b06c32b94 ("mm: slowly shrink slabs
> with a relatively small number of objects") leads to a regression on
> his setup: periodically the majority of the pagecache is evicted
> without an obvious reason, while before the change the amount of free
> memory was balancing around the watermark.
> 
> The reason behind is that the mentioned above change created some
> minimal background pressure on the inode cache. The problem is that
> if an inode is considered to be reclaimed, all belonging pagecache
> page are stripped, no matter how many of them are there. So, if a huge
> multi-gigabyte file is cached in the memory, and the goal is to
> reclaim only few slab objects (unused inodes), we still can eventually
> evict all gigabytes of the pagecache at once.
> 
> The workload described by Spock has few large non-mapped files in the
> pagecache, so it's especially noticeable.
> 
> To solve the problem let's postpone the reclaim of inodes, which have
> more than 1 attached page. Let's wait until the pagecache pages will
> be evicted naturally by scanning the corresponding LRU lists, and only
> then reclaim the inode structure.

Has this actually fixed/worked around the issue?

> Reported-by: Spock <dairinin@gmail.com>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Rik van Riel <riel@surriel.com>
> Cc: Randy Dunlap <rdunlap@infradead.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>  fs/inode.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 73432e64f874..0cd47fe0dbe5 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -730,8 +730,11 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
>  		return LRU_REMOVED;
>  	}
>  
> -	/* recently referenced inodes get one more pass */
> -	if (inode->i_state & I_REFERENCED) {
> +	/*
> +	 * Recently referenced inodes and inodes with many attached pages
> +	 * get one more pass.
> +	 */
> +	if (inode->i_state & I_REFERENCED || inode->i_data.nrpages > 1) {

The comment is just confusing. Did you mean to say s@many@any@ ?

>  		inode->i_state &= ~I_REFERENCED;
>  		spin_unlock(&inode->i_lock);
>  		return LRU_ROTATE;
> -- 
> 2.17.2
Roman Gushchin Oct. 26, 2018, 3:54 p.m. UTC | #13
On Fri, Oct 26, 2018 at 09:33:03AM +0200, Michal Hocko wrote:
> On Thu 25-10-18 20:32:47, Roman Gushchin wrote:
> > On Thu, Oct 25, 2018 at 04:20:14PM -0400, Sasha Levin wrote:
> > > On Thu, Oct 25, 2018 at 12:44:42PM -0700, Andrew Morton wrote:
> > > > On Thu, 25 Oct 2018 11:23:52 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> > > > 
> > > > > On Wed 24-10-18 15:19:50, Andrew Morton wrote:
> > > > > > On Tue, 23 Oct 2018 16:43:29 +0000 Roman Gushchin <guro@fb.com> wrote:
> > > > > >
> > > > > > > Spock reported that the commit 172b06c32b94 ("mm: slowly shrink slabs
> > > > > > > with a relatively small number of objects") leads to a regression on
> > > > > > > his setup: periodically the majority of the pagecache is evicted
> > > > > > > without an obvious reason, while before the change the amount of free
> > > > > > > memory was balancing around the watermark.
> > > > > > >
> > > > > > > The reason behind is that the mentioned above change created some
> > > > > > > minimal background pressure on the inode cache. The problem is that
> > > > > > > if an inode is considered to be reclaimed, all belonging pagecache
> > > > > > > page are stripped, no matter how many of them are there. So, if a huge
> > > > > > > multi-gigabyte file is cached in the memory, and the goal is to
> > > > > > > reclaim only few slab objects (unused inodes), we still can eventually
> > > > > > > evict all gigabytes of the pagecache at once.
> > > > > > >
> > > > > > > The workload described by Spock has few large non-mapped files in the
> > > > > > > pagecache, so it's especially noticeable.
> > > > > > >
> > > > > > > To solve the problem let's postpone the reclaim of inodes, which have
> > > > > > > more than 1 attached page. Let's wait until the pagecache pages will
> > > > > > > be evicted naturally by scanning the corresponding LRU lists, and only
> > > > > > > then reclaim the inode structure.
> > > > > >
> > > > > > Is this regression serious enough to warrant fixing 4.19.1?
> > > > > 
> > > > > Let's not forget about stable tree(s) which backported 172b06c32b94. I
> > > > > would suggest reverting there.
> > > > 
> > > > Yup.  Sasha, can you please take care of this?
> > > 
> > > Sure, I'll revert it from current stable trees.
> > > 
> > > Should 172b06c32b94 and this commit be backported once Roman confirms
> > > the issue is fixed? As far as I understand 172b06c32b94 addressed an
> > > issue FB were seeing in their fleet and needed to be fixed.
> > 
> > The memcg leak was also independently reported by several companies,
> > so it's not only about our fleet.
> 
> By memcg leak you mean a lot of dead memcgs with small amount of memory
> which are staying behind and the global memory pressure removes them
> only very slowly or almost not at all, right?

Right.

> 
> I have avague recollection that systemd can trigger a pattern which
> makes this "leak" noticeable. Is that right? If yes what would be a
> minimal and safe fix for the stable tree? "mm: don't miss the last page
> because of round-off error" would sound like the candidate but I never
> got around to review it properly.

Yes, systemd can create and destroy a ton of cgroups under some circumstances,
but there is nothing systemd-specific here. It's quite typical to run services
in new cgroups, so with time the number of dying cgroups tends to grow.

I've listed all necessary patches, it's the required set (except the last patch,
but it has to be squashed). f2e821fc8c63 can be probably skipped, but I haven't
tested without it, and it's the most straightforward patch from the set.

Daniel McGinnes has reported the same issue in the cgroups@ mailing list,
and he confirmed that this patchset solved the problem for him.

> > The memcg css leak is fixed by a series of commits (as in the mm tree):
> >   37e521912118 math64: prevent double calculation of DIV64_U64_ROUND_UP() arguments
> >   c6be4e82b1b3 mm: don't miss the last page because of round-off error
> >   f2e821fc8c63 mm: drain memcg stocks on css offlining
> >   03a971b56f18 mm: rework memcg kernel stack accounting
> 
> btw. none of these sha are refering to anything in my git tree. They all
> seem to be in the next tree though.

Yeah, they all are in the mm tree, and hashes are from Johannes's git.

> 
> >   172b06c32b94 mm: slowly shrink slabs with a relatively small number of objects
> > 
> > The last one by itself isn't enough, and it makes no sense to backport it
> > without all other patches. So, I'd either backport them all (including
> > 47036ad4032e ("mm: don't reclaim inodes with many attached pages"),
> > either just revert 172b06c32b94.
> > 
> > Also 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects")
> > by itself is fine, but it reveals an independent issue in inode reclaim code,
> > which 47036ad4032e ("mm: don't reclaim inodes with many attached pages") aims to fix.
> 
> To me it sounds it needs much more time to settle before it can be
> considered safe for the stable tree. Even if the patch itself is correct
> it seems too subtle and reveal a behavior which was not anticipated and
> that just proves it is far from straightforward.

Absolutely. I'm not pushing this to stable at all, that single patch
was an accident.

Thanks!
Roman Gushchin Oct. 26, 2018, 3:56 p.m. UTC | #14
On Fri, Oct 26, 2018 at 10:57:35AM +0200, Michal Hocko wrote:
> Spock doesn't seem to be cced here - fixed now
> 
> On Tue 23-10-18 16:43:29, Roman Gushchin wrote:
> > Spock reported that the commit 172b06c32b94 ("mm: slowly shrink slabs
> > with a relatively small number of objects") leads to a regression on
> > his setup: periodically the majority of the pagecache is evicted
> > without an obvious reason, while before the change the amount of free
> > memory was balancing around the watermark.
> > 
> > The reason behind is that the mentioned above change created some
> > minimal background pressure on the inode cache. The problem is that
> > if an inode is considered to be reclaimed, all belonging pagecache
> > page are stripped, no matter how many of them are there. So, if a huge
> > multi-gigabyte file is cached in the memory, and the goal is to
> > reclaim only few slab objects (unused inodes), we still can eventually
> > evict all gigabytes of the pagecache at once.
> > 
> > The workload described by Spock has few large non-mapped files in the
> > pagecache, so it's especially noticeable.
> > 
> > To solve the problem let's postpone the reclaim of inodes, which have
> > more than 1 attached page. Let's wait until the pagecache pages will
> > be evicted naturally by scanning the corresponding LRU lists, and only
> > then reclaim the inode structure.
> 
> Has this actually fixed/worked around the issue?

Spock wrote this earlier to me directly. I believe I can quote it here:

"Patch applied, looks good so far. System behaves like it was with
pre-4.18.15 kernels.
Also tried to add some user-level tests to the geneic background activity, like
- stat'ing a bunch of files
- streamed read several large files at once on ext4 and XFS
- random reads on the whole collection with a read size of 16K

I will be monitoring while fragmentation stacks up and report back if
something bad happens."

Spock, please let me know if you have any new results.

Thanks!
Roman Gushchin Oct. 26, 2018, 3:58 p.m. UTC | #15
On Fri, Oct 26, 2018 at 10:57:35AM +0200, Michal Hocko wrote:
> Spock doesn't seem to be cced here - fixed now
> 
> On Tue 23-10-18 16:43:29, Roman Gushchin wrote:
> > Spock reported that the commit 172b06c32b94 ("mm: slowly shrink slabs
> > with a relatively small number of objects") leads to a regression on
> > his setup: periodically the majority of the pagecache is evicted
> > without an obvious reason, while before the change the amount of free
> > memory was balancing around the watermark.
> > 
> > The reason behind is that the mentioned above change created some
> > minimal background pressure on the inode cache. The problem is that
> > if an inode is considered to be reclaimed, all belonging pagecache
> > page are stripped, no matter how many of them are there. So, if a huge
> > multi-gigabyte file is cached in the memory, and the goal is to
> > reclaim only few slab objects (unused inodes), we still can eventually
> > evict all gigabytes of the pagecache at once.
> > 
> > The workload described by Spock has few large non-mapped files in the
> > pagecache, so it's especially noticeable.
> > 
> > To solve the problem let's postpone the reclaim of inodes, which have
> > more than 1 attached page. Let's wait until the pagecache pages will
> > be evicted naturally by scanning the corresponding LRU lists, and only
> > then reclaim the inode structure.
> 
> Has this actually fixed/worked around the issue?
> 
> > Reported-by: Spock <dairinin@gmail.com>
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: Rik van Riel <riel@surriel.com>
> > Cc: Randy Dunlap <rdunlap@infradead.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > ---
> >  fs/inode.c | 7 +++++--
> >  1 file changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 73432e64f874..0cd47fe0dbe5 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -730,8 +730,11 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
> >  		return LRU_REMOVED;
> >  	}
> >  
> > -	/* recently referenced inodes get one more pass */
> > -	if (inode->i_state & I_REFERENCED) {
> > +	/*
> > +	 * Recently referenced inodes and inodes with many attached pages
> > +	 * get one more pass.
> > +	 */
> > +	if (inode->i_state & I_REFERENCED || inode->i_data.nrpages > 1) {
> 
> The comment is just confusing. Did you mean to say s@many@any@ ?

No, here many == more than 1.

I'm happy to fix the comment, if you have any suggestions.


Thanks!
Spock Oct. 26, 2018, 5 p.m. UTC | #16
пт, 26 окт. 2018 г. в 18:57, Roman Gushchin <guro@fb.com>:
>
> On Fri, Oct 26, 2018 at 10:57:35AM +0200, Michal Hocko wrote:
> > Spock doesn't seem to be cced here - fixed now
> >
> > On Tue 23-10-18 16:43:29, Roman Gushchin wrote:
> > > Spock reported that the commit 172b06c32b94 ("mm: slowly shrink slabs
> > > with a relatively small number of objects") leads to a regression on
> > > his setup: periodically the majority of the pagecache is evicted
> > > without an obvious reason, while before the change the amount of free
> > > memory was balancing around the watermark.
> > >
> > > The reason behind is that the mentioned above change created some
> > > minimal background pressure on the inode cache. The problem is that
> > > if an inode is considered to be reclaimed, all belonging pagecache
> > > page are stripped, no matter how many of them are there. So, if a huge
> > > multi-gigabyte file is cached in the memory, and the goal is to
> > > reclaim only few slab objects (unused inodes), we still can eventually
> > > evict all gigabytes of the pagecache at once.
> > >
> > > The workload described by Spock has few large non-mapped files in the
> > > pagecache, so it's especially noticeable.
> > >
> > > To solve the problem let's postpone the reclaim of inodes, which have
> > > more than 1 attached page. Let's wait until the pagecache pages will
> > > be evicted naturally by scanning the corresponding LRU lists, and only
> > > then reclaim the inode structure.
> >
> > Has this actually fixed/worked around the issue?
>
> Spock wrote this earlier to me directly. I believe I can quote it here:
>
> "Patch applied, looks good so far. System behaves like it was with
> pre-4.18.15 kernels.
> Also tried to add some user-level tests to the geneic background activity, like
> - stat'ing a bunch of files
> - streamed read several large files at once on ext4 and XFS
> - random reads on the whole collection with a read size of 16K
>
> I will be monitoring while fragmentation stacks up and report back if
> something bad happens."
>
> Spock, please let me know if you have any new results.
>
> Thanks!

Hello,

I'd say the patch fixed the problem, at least with my workload

MemTotal:        8164968 kB
MemFree:          135852 kB
MemAvailable:    6406088 kB
Buffers:           11988 kB
Cached:          6414124 kB
SwapCached:            0 kB
Active:          1491952 kB
Inactive:        5989576 kB
Active(anon):     542512 kB
Inactive(anon):   523780 kB
Active(file):     949440 kB
Inactive(file):  5465796 kB
Unevictable:        8872 kB
Mlocked:            8872 kB
SwapTotal:       4194300 kB
SwapFree:        4194300 kB
Dirty:               128 kB
Writeback:             0 kB
AnonPages:       1064232 kB
Mapped:            32348 kB
Shmem:              3952 kB
Slab:             205108 kB
SReclaimable:     148792 kB
SUnreclaim:        56316 kB
KernelStack:        3984 kB
PageTables:        11100 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     8276784 kB
Committed_AS:    1944792 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
AnonHugePages:      6144 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:      271872 kB
DirectMap2M:     8116224 kB
diff mbox series

Patch

diff --git a/fs/inode.c b/fs/inode.c
index 73432e64f874..0cd47fe0dbe5 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -730,8 +730,11 @@  static enum lru_status inode_lru_isolate(struct list_head *item,
 		return LRU_REMOVED;
 	}
 
-	/* recently referenced inodes get one more pass */
-	if (inode->i_state & I_REFERENCED) {
+	/*
+	 * Recently referenced inodes and inodes with many attached pages
+	 * get one more pass.
+	 */
+	if (inode->i_state & I_REFERENCED || inode->i_data.nrpages > 1) {
 		inode->i_state &= ~I_REFERENCED;
 		spin_unlock(&inode->i_lock);
 		return LRU_ROTATE;