diff mbox series

[RESEND,1/2,-mm] mm: account lazy free pages separately

Message ID 1565308665-24747-1-git-send-email-yang.shi@linux.alibaba.com (mailing list archive)
State New, archived
Headers show
Series [RESEND,1/2,-mm] mm: account lazy free pages separately | expand

Commit Message

Yang Shi Aug. 8, 2019, 11:57 p.m. UTC
When doing partial unmap to THP, the pages in the affected range would
be considered to be reclaimable when memory pressure comes in.  And,
such pages would be put on deferred split queue and get minus from the
memory statistics (i.e. /proc/meminfo).

For example, when doing THP split test, /proc/meminfo would show:

Before put on lazy free list:
MemTotal:       45288336 kB
MemFree:        43281376 kB
MemAvailable:   43254048 kB
...
Active(anon):    1096296 kB
Inactive(anon):     8372 kB
...
AnonPages:       1096264 kB
...
AnonHugePages:   1056768 kB

After put on lazy free list:
MemTotal:       45288336 kB
MemFree:        43282612 kB
MemAvailable:   43255284 kB
...
Active(anon):    1094228 kB
Inactive(anon):     8372 kB
...
AnonPages:         49668 kB
...
AnonHugePages:     10240 kB

The THPs confusingly look disappeared although they are still on LRU if
you are not familair the tricks done by kernel.

Accounted the lazy free pages to NR_LAZYFREE, and show them in meminfo
and other places.  With the change the /proc/meminfo would look like:
Before put on lazy free list:
AnonHugePages:   1056768 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
LazyFreePages:         0 kB

After put on lazy free list:
AnonHugePages:     10240 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
LazyFreePages:   1046528 kB

And, this is also the preparation for the following patch to account
lazy free pages to available memory.

Here the lazyfree doesn't count MADV_FREE pages since they are not
actually unmapped until they get reclaimed.  And, they are put on
inactive file LRU, so they have been accounted for available memory.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
I'm not quite sure whether LazyFreePages is a good name or not since "Lazyfree"
is typically referred to MADV_FREE pages.  I could use a more spceific name,
i.e. "DeferredSplitTHP" since it doesn't account MADV_FREE as explained in the
commit log.  But, a more general name would be good for including other type
pages in the future.

 Documentation/filesystems/proc.txt | 12 ++++++++----
 drivers/base/node.c                |  3 +++
 fs/proc/meminfo.c                  |  3 +++
 include/linux/mmzone.h             |  1 +
 mm/huge_memory.c                   |  6 ++++++
 mm/page_alloc.c                    |  2 ++
 mm/vmstat.c                        |  1 +
 7 files changed, 24 insertions(+), 4 deletions(-)

Comments

Michal Hocko Aug. 9, 2019, 8:32 a.m. UTC | #1
On Fri 09-08-19 07:57:44, Yang Shi wrote:
> When doing partial unmap to THP, the pages in the affected range would
> be considered to be reclaimable when memory pressure comes in.  And,
> such pages would be put on deferred split queue and get minus from the
> memory statistics (i.e. /proc/meminfo).
> 
> For example, when doing THP split test, /proc/meminfo would show:
> 
> Before put on lazy free list:
> MemTotal:       45288336 kB
> MemFree:        43281376 kB
> MemAvailable:   43254048 kB
> ...
> Active(anon):    1096296 kB
> Inactive(anon):     8372 kB
> ...
> AnonPages:       1096264 kB
> ...
> AnonHugePages:   1056768 kB
> 
> After put on lazy free list:
> MemTotal:       45288336 kB
> MemFree:        43282612 kB
> MemAvailable:   43255284 kB
> ...
> Active(anon):    1094228 kB
> Inactive(anon):     8372 kB
> ...
> AnonPages:         49668 kB
> ...
> AnonHugePages:     10240 kB
> 
> The THPs confusingly look disappeared although they are still on LRU if
> you are not familair the tricks done by kernel.

Is this a fallout of the recent deferred freeing work?

> Accounted the lazy free pages to NR_LAZYFREE, and show them in meminfo
> and other places.  With the change the /proc/meminfo would look like:
> Before put on lazy free list:

The name is really confusing because I have thought of MADV_FREE immediately.

> +LazyFreePages: Cleanly freeable pages under memory pressure (i.e. deferred
> +               split THP).

What does that mean actually? I have hard time imagine what cleanly
freeable pages mean.
Yang Shi Aug. 9, 2019, 4:19 p.m. UTC | #2
On 8/9/19 1:32 AM, Michal Hocko wrote:
> On Fri 09-08-19 07:57:44, Yang Shi wrote:
>> When doing partial unmap to THP, the pages in the affected range would
>> be considered to be reclaimable when memory pressure comes in.  And,
>> such pages would be put on deferred split queue and get minus from the
>> memory statistics (i.e. /proc/meminfo).
>>
>> For example, when doing THP split test, /proc/meminfo would show:
>>
>> Before put on lazy free list:
>> MemTotal:       45288336 kB
>> MemFree:        43281376 kB
>> MemAvailable:   43254048 kB
>> ...
>> Active(anon):    1096296 kB
>> Inactive(anon):     8372 kB
>> ...
>> AnonPages:       1096264 kB
>> ...
>> AnonHugePages:   1056768 kB
>>
>> After put on lazy free list:
>> MemTotal:       45288336 kB
>> MemFree:        43282612 kB
>> MemAvailable:   43255284 kB
>> ...
>> Active(anon):    1094228 kB
>> Inactive(anon):     8372 kB
>> ...
>> AnonPages:         49668 kB
>> ...
>> AnonHugePages:     10240 kB
>>
>> The THPs confusingly look disappeared although they are still on LRU if
>> you are not familair the tricks done by kernel.
> Is this a fallout of the recent deferred freeing work?

This series follows up the discussion happened when reviewing "Make 
deferred split shrinker memcg aware".

David Rientjes suggested deferred split THP should be accounted into 
available memory since they would be shrunk when memory pressure comes 
in, just like MADV_FREE pages. For the discussion, please refer to: 
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg2010115.html

>
>> Accounted the lazy free pages to NR_LAZYFREE, and show them in meminfo
>> and other places.  With the change the /proc/meminfo would look like:
>> Before put on lazy free list:
> The name is really confusing because I have thought of MADV_FREE immediately.

Yes, I agree. We may use a more specific name, i.e. DeferredSplitTHP.

>
>> +LazyFreePages: Cleanly freeable pages under memory pressure (i.e. deferred
>> +               split THP).
> What does that mean actually? I have hard time imagine what cleanly
> freeable pages mean.

Like deferred split THP and MADV_FREE pages, they could be reclaimed 
during memory pressure.

If you just go with "DeferredSplitTHP", these ambiguity would go away.
Michal Hocko Aug. 9, 2019, 6:02 p.m. UTC | #3
On Fri 09-08-19 09:19:13, Yang Shi wrote:
> 
> 
> On 8/9/19 1:32 AM, Michal Hocko wrote:
> > On Fri 09-08-19 07:57:44, Yang Shi wrote:
> > > When doing partial unmap to THP, the pages in the affected range would
> > > be considered to be reclaimable when memory pressure comes in.  And,
> > > such pages would be put on deferred split queue and get minus from the
> > > memory statistics (i.e. /proc/meminfo).
> > > 
> > > For example, when doing THP split test, /proc/meminfo would show:
> > > 
> > > Before put on lazy free list:
> > > MemTotal:       45288336 kB
> > > MemFree:        43281376 kB
> > > MemAvailable:   43254048 kB
> > > ...
> > > Active(anon):    1096296 kB
> > > Inactive(anon):     8372 kB
> > > ...
> > > AnonPages:       1096264 kB
> > > ...
> > > AnonHugePages:   1056768 kB
> > > 
> > > After put on lazy free list:
> > > MemTotal:       45288336 kB
> > > MemFree:        43282612 kB
> > > MemAvailable:   43255284 kB
> > > ...
> > > Active(anon):    1094228 kB
> > > Inactive(anon):     8372 kB
> > > ...
> > > AnonPages:         49668 kB
> > > ...
> > > AnonHugePages:     10240 kB
> > > 
> > > The THPs confusingly look disappeared although they are still on LRU if
> > > you are not familair the tricks done by kernel.
> > Is this a fallout of the recent deferred freeing work?
> 
> This series follows up the discussion happened when reviewing "Make deferred
> split shrinker memcg aware".

OK, so it is a pre-existing problem. Thanks!

> David Rientjes suggested deferred split THP should be accounted into
> available memory since they would be shrunk when memory pressure comes in,
> just like MADV_FREE pages. For the discussion, please refer to:
> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg2010115.html

Thanks for the reference.

> 
> > 
> > > Accounted the lazy free pages to NR_LAZYFREE, and show them in meminfo
> > > and other places.  With the change the /proc/meminfo would look like:
> > > Before put on lazy free list:
> > The name is really confusing because I have thought of MADV_FREE immediately.
> 
> Yes, I agree. We may use a more specific name, i.e. DeferredSplitTHP.
> 
> > 
> > > +LazyFreePages: Cleanly freeable pages under memory pressure (i.e. deferred
> > > +               split THP).
> > What does that mean actually? I have hard time imagine what cleanly
> > freeable pages mean.
> 
> Like deferred split THP and MADV_FREE pages, they could be reclaimed during
> memory pressure.
> 
> If you just go with "DeferredSplitTHP", these ambiguity would go away.

I have to study the code some more but is there any reason why those
pages are not accounted as proper THPs anymore? Sure they are partially
unmaped but they are still THPs so why cannot we keep them accounted
like that. Having a new counter to reflect that sounds like papering
over the problem to me. But as I've said I might be missing something
important here.
Yang Shi Aug. 9, 2019, 6:26 p.m. UTC | #4
On 8/9/19 11:02 AM, Michal Hocko wrote:
> On Fri 09-08-19 09:19:13, Yang Shi wrote:
>>
>> On 8/9/19 1:32 AM, Michal Hocko wrote:
>>> On Fri 09-08-19 07:57:44, Yang Shi wrote:
>>>> When doing partial unmap to THP, the pages in the affected range would
>>>> be considered to be reclaimable when memory pressure comes in.  And,
>>>> such pages would be put on deferred split queue and get minus from the
>>>> memory statistics (i.e. /proc/meminfo).
>>>>
>>>> For example, when doing THP split test, /proc/meminfo would show:
>>>>
>>>> Before put on lazy free list:
>>>> MemTotal:       45288336 kB
>>>> MemFree:        43281376 kB
>>>> MemAvailable:   43254048 kB
>>>> ...
>>>> Active(anon):    1096296 kB
>>>> Inactive(anon):     8372 kB
>>>> ...
>>>> AnonPages:       1096264 kB
>>>> ...
>>>> AnonHugePages:   1056768 kB
>>>>
>>>> After put on lazy free list:
>>>> MemTotal:       45288336 kB
>>>> MemFree:        43282612 kB
>>>> MemAvailable:   43255284 kB
>>>> ...
>>>> Active(anon):    1094228 kB
>>>> Inactive(anon):     8372 kB
>>>> ...
>>>> AnonPages:         49668 kB
>>>> ...
>>>> AnonHugePages:     10240 kB
>>>>
>>>> The THPs confusingly look disappeared although they are still on LRU if
>>>> you are not familair the tricks done by kernel.
>>> Is this a fallout of the recent deferred freeing work?
>> This series follows up the discussion happened when reviewing "Make deferred
>> split shrinker memcg aware".
> OK, so it is a pre-existing problem. Thanks!
>
>> David Rientjes suggested deferred split THP should be accounted into
>> available memory since they would be shrunk when memory pressure comes in,
>> just like MADV_FREE pages. For the discussion, please refer to:
>> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg2010115.html
> Thanks for the reference.
>
>>>> Accounted the lazy free pages to NR_LAZYFREE, and show them in meminfo
>>>> and other places.  With the change the /proc/meminfo would look like:
>>>> Before put on lazy free list:
>>> The name is really confusing because I have thought of MADV_FREE immediately.
>> Yes, I agree. We may use a more specific name, i.e. DeferredSplitTHP.
>>
>>>> +LazyFreePages: Cleanly freeable pages under memory pressure (i.e. deferred
>>>> +               split THP).
>>> What does that mean actually? I have hard time imagine what cleanly
>>> freeable pages mean.
>> Like deferred split THP and MADV_FREE pages, they could be reclaimed during
>> memory pressure.
>>
>> If you just go with "DeferredSplitTHP", these ambiguity would go away.
> I have to study the code some more but is there any reason why those
> pages are not accounted as proper THPs anymore? Sure they are partially
> unmaped but they are still THPs so why cannot we keep them accounted
> like that. Having a new counter to reflect that sounds like papering
> over the problem to me. But as I've said I might be missing something
> important here.

I think we could keep those pages accounted for NR_ANON_THPS since they 
are still THP although they are unmapped as you mentioned if we just 
want to fix the improper accounting.

Here the new counter is introduced for patch 2/2 to account deferred 
split THPs into available memory since NR_ANON_THPS may contain 
non-deferred split THPs.

I could use an internal counter for deferred split THPs, but if it is 
accounted by mod_node_page_state, why not just show it in /proc/meminfo? 
Or we fix NR_ANON_THPS and show deferred split THPs in /proc/meminfo?

>
Yang Shi Aug. 9, 2019, 11:54 p.m. UTC | #5
On 8/9/19 11:26 AM, Yang Shi wrote:
>
>
> On 8/9/19 11:02 AM, Michal Hocko wrote:
>> On Fri 09-08-19 09:19:13, Yang Shi wrote:
>>>
>>> On 8/9/19 1:32 AM, Michal Hocko wrote:
>>>> On Fri 09-08-19 07:57:44, Yang Shi wrote:
>>>>> When doing partial unmap to THP, the pages in the affected range 
>>>>> would
>>>>> be considered to be reclaimable when memory pressure comes in.  And,
>>>>> such pages would be put on deferred split queue and get minus from 
>>>>> the
>>>>> memory statistics (i.e. /proc/meminfo).
>>>>>
>>>>> For example, when doing THP split test, /proc/meminfo would show:
>>>>>
>>>>> Before put on lazy free list:
>>>>> MemTotal:       45288336 kB
>>>>> MemFree:        43281376 kB
>>>>> MemAvailable:   43254048 kB
>>>>> ...
>>>>> Active(anon):    1096296 kB
>>>>> Inactive(anon):     8372 kB
>>>>> ...
>>>>> AnonPages:       1096264 kB
>>>>> ...
>>>>> AnonHugePages:   1056768 kB
>>>>>
>>>>> After put on lazy free list:
>>>>> MemTotal:       45288336 kB
>>>>> MemFree:        43282612 kB
>>>>> MemAvailable:   43255284 kB
>>>>> ...
>>>>> Active(anon):    1094228 kB
>>>>> Inactive(anon):     8372 kB
>>>>> ...
>>>>> AnonPages:         49668 kB
>>>>> ...
>>>>> AnonHugePages:     10240 kB
>>>>>
>>>>> The THPs confusingly look disappeared although they are still on 
>>>>> LRU if
>>>>> you are not familair the tricks done by kernel.
>>>> Is this a fallout of the recent deferred freeing work?
>>> This series follows up the discussion happened when reviewing "Make 
>>> deferred
>>> split shrinker memcg aware".
>> OK, so it is a pre-existing problem. Thanks!
>>
>>> David Rientjes suggested deferred split THP should be accounted into
>>> available memory since they would be shrunk when memory pressure 
>>> comes in,
>>> just like MADV_FREE pages. For the discussion, please refer to:
>>> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg2010115.html 
>>>
>> Thanks for the reference.
>>
>>>>> Accounted the lazy free pages to NR_LAZYFREE, and show them in 
>>>>> meminfo
>>>>> and other places.  With the change the /proc/meminfo would look like:
>>>>> Before put on lazy free list:
>>>> The name is really confusing because I have thought of MADV_FREE 
>>>> immediately.
>>> Yes, I agree. We may use a more specific name, i.e. DeferredSplitTHP.
>>>
>>>>> +LazyFreePages: Cleanly freeable pages under memory pressure (i.e. 
>>>>> deferred
>>>>> +               split THP).
>>>> What does that mean actually? I have hard time imagine what cleanly
>>>> freeable pages mean.
>>> Like deferred split THP and MADV_FREE pages, they could be reclaimed 
>>> during
>>> memory pressure.
>>>
>>> If you just go with "DeferredSplitTHP", these ambiguity would go away.
>> I have to study the code some more but is there any reason why those
>> pages are not accounted as proper THPs anymore? Sure they are partially
>> unmaped but they are still THPs so why cannot we keep them accounted
>> like that. Having a new counter to reflect that sounds like papering
>> over the problem to me. But as I've said I might be missing something
>> important here.
>
> I think we could keep those pages accounted for NR_ANON_THPS since 
> they are still THP although they are unmapped as you mentioned if we 
> just want to fix the improper accounting.

By double checking what NR_ANON_THPS really means, 
Documentation/filesystems/proc.txt says "Non-file backed huge pages 
mapped into userspace page tables". Then it makes some sense to dec 
NR_ANON_THPS when removing rmap even though they are still THPs.

I don't think we would like to change the definition, if so a new 
counter may make more sense.

>
> Here the new counter is introduced for patch 2/2 to account deferred 
> split THPs into available memory since NR_ANON_THPS may contain 
> non-deferred split THPs.
>
> I could use an internal counter for deferred split THPs, but if it is 
> accounted by mod_node_page_state, why not just show it in 
> /proc/meminfo? Or we fix NR_ANON_THPS and show deferred split THPs in 
> /proc/meminfo?
>
>>
>
Michal Hocko Aug. 12, 2019, 9:34 a.m. UTC | #6
On Fri 09-08-19 16:54:43, Yang Shi wrote:
> 
> 
> On 8/9/19 11:26 AM, Yang Shi wrote:
> > 
> > 
> > On 8/9/19 11:02 AM, Michal Hocko wrote:
[...]
> > > I have to study the code some more but is there any reason why those
> > > pages are not accounted as proper THPs anymore? Sure they are partially
> > > unmaped but they are still THPs so why cannot we keep them accounted
> > > like that. Having a new counter to reflect that sounds like papering
> > > over the problem to me. But as I've said I might be missing something
> > > important here.
> > 
> > I think we could keep those pages accounted for NR_ANON_THPS since they
> > are still THP although they are unmapped as you mentioned if we just
> > want to fix the improper accounting.
> 
> By double checking what NR_ANON_THPS really means,
> Documentation/filesystems/proc.txt says "Non-file backed huge pages mapped
> into userspace page tables". Then it makes some sense to dec NR_ANON_THPS
> when removing rmap even though they are still THPs.
> 
> I don't think we would like to change the definition, if so a new counter
> may make more sense.

Yes, changing NR_ANON_THPS semantic sounds like a bad idea. Let
me try whether I understand the problem. So we have some THP in
limbo waiting for them to be split and unmapped parts to be freed,
right? I can see that page_remove_anon_compound_rmap does correctly
decrement NR_ANON_MAPPED for sub pages that are no longer mapped by
anybody. LRU pages seem to be accounted properly as well.  As you've
said NR_ANON_THPS reflects the number of THPs mapped and that should be
reflecting the reality already IIUC.

So the only problem seems to be that deferred THP might aggregate a lot
of immediately freeable memory (if none of the subpages are mapped) and
that can confuse MemAvailable because it doesn't know about the fact.
Has an skewed counter resulted in a user observable behavior/failures?
I can see that memcg rss size was the primary problem David was looking
at. But MemAvailable will not help with that, right? Moreover is
accounting the full THP correct? What if subpages are still mapped?
Yang Shi Aug. 12, 2019, 5 p.m. UTC | #7
On 8/12/19 2:34 AM, Michal Hocko wrote:
> On Fri 09-08-19 16:54:43, Yang Shi wrote:
>>
>> On 8/9/19 11:26 AM, Yang Shi wrote:
>>>
>>> On 8/9/19 11:02 AM, Michal Hocko wrote:
> [...]
>>>> I have to study the code some more but is there any reason why those
>>>> pages are not accounted as proper THPs anymore? Sure they are partially
>>>> unmaped but they are still THPs so why cannot we keep them accounted
>>>> like that. Having a new counter to reflect that sounds like papering
>>>> over the problem to me. But as I've said I might be missing something
>>>> important here.
>>> I think we could keep those pages accounted for NR_ANON_THPS since they
>>> are still THP although they are unmapped as you mentioned if we just
>>> want to fix the improper accounting.
>> By double checking what NR_ANON_THPS really means,
>> Documentation/filesystems/proc.txt says "Non-file backed huge pages mapped
>> into userspace page tables". Then it makes some sense to dec NR_ANON_THPS
>> when removing rmap even though they are still THPs.
>>
>> I don't think we would like to change the definition, if so a new counter
>> may make more sense.
> Yes, changing NR_ANON_THPS semantic sounds like a bad idea. Let
> me try whether I understand the problem. So we have some THP in
> limbo waiting for them to be split and unmapped parts to be freed,
> right? I can see that page_remove_anon_compound_rmap does correctly
> decrement NR_ANON_MAPPED for sub pages that are no longer mapped by
> anybody. LRU pages seem to be accounted properly as well.  As you've
> said NR_ANON_THPS reflects the number of THPs mapped and that should be
> reflecting the reality already IIUC.
>
> So the only problem seems to be that deferred THP might aggregate a lot
> of immediately freeable memory (if none of the subpages are mapped) and
> that can confuse MemAvailable because it doesn't know about the fact.
> Has an skewed counter resulted in a user observable behavior/failures?

No. But the skewed counter may make big difference for a big scale 
cluster. The MemAvailable is an important factor for cluster scheduler 
to determine the capacity.

Even though the scheduler could place one more small container due to 
extra available memory, it would make big difference for a cluster with 
thousands of nodes.

> I can see that memcg rss size was the primary problem David was looking
> at. But MemAvailable will not help with that, right? Moreover is

Yes, but David actually would like to have memcg MemAvailable (the 
accounter like the global one), which should be counted like the global 
one and should account per memcg deferred split THP properly.

> accounting the full THP correct? What if subpages are still mapped?

"Deferred split" definitely doesn't mean they are free. When memory 
pressure is hit, they would be split, then the unmapped normal pages 
would be freed. So, when calculating MemAvailable, they are not 
accounted 100%, but like "available += lazyfree - min(lazyfree / 2, 
wmark_low)", just like how page cache is accounted.

We could get more accurate account, i.e. checking each sub page's 
mapcount when accounting, but it may change before shrinker start 
scanning. So, just use the ballpark estimation to trade off the 
complexity for accurate accounting.

>
Michal Hocko Aug. 14, 2019, 11:08 a.m. UTC | #8
On Mon 12-08-19 10:00:17, Yang Shi wrote:
> 
> 
> On 8/12/19 2:34 AM, Michal Hocko wrote:
> > On Fri 09-08-19 16:54:43, Yang Shi wrote:
> > > 
> > > On 8/9/19 11:26 AM, Yang Shi wrote:
> > > > 
> > > > On 8/9/19 11:02 AM, Michal Hocko wrote:
> > [...]
> > > > > I have to study the code some more but is there any reason why those
> > > > > pages are not accounted as proper THPs anymore? Sure they are partially
> > > > > unmaped but they are still THPs so why cannot we keep them accounted
> > > > > like that. Having a new counter to reflect that sounds like papering
> > > > > over the problem to me. But as I've said I might be missing something
> > > > > important here.
> > > > I think we could keep those pages accounted for NR_ANON_THPS since they
> > > > are still THP although they are unmapped as you mentioned if we just
> > > > want to fix the improper accounting.
> > > By double checking what NR_ANON_THPS really means,
> > > Documentation/filesystems/proc.txt says "Non-file backed huge pages mapped
> > > into userspace page tables". Then it makes some sense to dec NR_ANON_THPS
> > > when removing rmap even though they are still THPs.
> > > 
> > > I don't think we would like to change the definition, if so a new counter
> > > may make more sense.
> > Yes, changing NR_ANON_THPS semantic sounds like a bad idea. Let
> > me try whether I understand the problem. So we have some THP in
> > limbo waiting for them to be split and unmapped parts to be freed,
> > right? I can see that page_remove_anon_compound_rmap does correctly
> > decrement NR_ANON_MAPPED for sub pages that are no longer mapped by
> > anybody. LRU pages seem to be accounted properly as well.  As you've
> > said NR_ANON_THPS reflects the number of THPs mapped and that should be
> > reflecting the reality already IIUC.
> > 
> > So the only problem seems to be that deferred THP might aggregate a lot
> > of immediately freeable memory (if none of the subpages are mapped) and
> > that can confuse MemAvailable because it doesn't know about the fact.
> > Has an skewed counter resulted in a user observable behavior/failures?
> 
> No. But the skewed counter may make big difference for a big scale cluster.
> The MemAvailable is an important factor for cluster scheduler to determine
> the capacity.

But MemAvailable is a very rough estimation. Is relying on it really a
good measure? I mean there is a lot of reclaimable memory that is not
reflected there (some fs. internal data structures, networking buffers
etc.)

[...]

> > accounting the full THP correct? What if subpages are still mapped?
> 
> "Deferred split" definitely doesn't mean they are free. When memory pressure
> is hit, they would be split, then the unmapped normal pages would be freed.
> So, when calculating MemAvailable, they are not accounted 100%, but like
> "available += lazyfree - min(lazyfree / 2, wmark_low)", just like how page
> cache is accounted.

Then this is even more dubious IMHO.

> We could get more accurate account, i.e. checking each sub page's mapcount
> when accounting, but it may change before shrinker start scanning. So, just
> use the ballpark estimation to trade off the complexity for accurate
> accounting.

I do not see much point in fixing up one particular counter when there
is a whole lot that is even not considered. I would rather live with the
fact that MemAvailable is only very rough estimate then whack a mole on
any memory consumer that is freeable directly or indirectly via memory
reclaim. Because this is likely to be always subtly broken and only
visible under very specific workloads so there is no way to test for it.
Vlastimil Babka Aug. 14, 2019, 12:49 p.m. UTC | #9
On 8/9/19 8:26 PM, Yang Shi wrote:
> Here the new counter is introduced for patch 2/2 to account deferred 
> split THPs into available memory since NR_ANON_THPS may contain 
> non-deferred split THPs.
> 
> I could use an internal counter for deferred split THPs, but if it is 
> accounted by mod_node_page_state, why not just show it in /proc/meminfo? 

The answer to "Why not" is that it becomes part of userspace API (btw this
patchset should have CC'd linux-api@ - please do for further iterations) and
even if the implementation detail of deferred splitting might change in the
future, we'll basically have to keep the counter (even with 0 value) in
/proc/meminfo forever.

Also, quite recently we have added the following counter:

KReclaimable: Kernel allocations that the kernel will attempt to reclaim
              under memory pressure. Includes SReclaimable (below), and other
              direct allocations with a shrinker.

Although THP allocations are not exactly "kernel allocations", once they are
unmapped, they are in fact kernel-only, so IMHO it wouldn't be a big stretch to
add the lazy THP pages there?

> Or we fix NR_ANON_THPS and show deferred split THPs in /proc/meminfo?
> 
>>
>
Michal Hocko Aug. 14, 2019, 12:53 p.m. UTC | #10
On Wed 14-08-19 14:49:18, Vlastimil Babka wrote:
> On 8/9/19 8:26 PM, Yang Shi wrote:
> > Here the new counter is introduced for patch 2/2 to account deferred 
> > split THPs into available memory since NR_ANON_THPS may contain 
> > non-deferred split THPs.
> > 
> > I could use an internal counter for deferred split THPs, but if it is 
> > accounted by mod_node_page_state, why not just show it in /proc/meminfo? 
> 
> The answer to "Why not" is that it becomes part of userspace API (btw this
> patchset should have CC'd linux-api@ - please do for further iterations) and
> even if the implementation detail of deferred splitting might change in the
> future, we'll basically have to keep the counter (even with 0 value) in
> /proc/meminfo forever.
> 
> Also, quite recently we have added the following counter:
> 
> KReclaimable: Kernel allocations that the kernel will attempt to reclaim
>               under memory pressure. Includes SReclaimable (below), and other
>               direct allocations with a shrinker.
> 
> Although THP allocations are not exactly "kernel allocations", once they are
> unmapped, they are in fact kernel-only, so IMHO it wouldn't be a big stretch to
> add the lazy THP pages there?

That would indeed fit in much better than a dedicated counter.
Vlastimil Babka Aug. 14, 2019, 12:55 p.m. UTC | #11
On 8/12/19 7:00 PM, Yang Shi wrote:
>> I can see that memcg rss size was the primary problem David was looking
>> at. But MemAvailable will not help with that, right? Moreover is
> 
> Yes, but David actually would like to have memcg MemAvailable (the 
> accounter like the global one), which should be counted like the global 
> one and should account per memcg deferred split THP properly.
> 
>> accounting the full THP correct? What if subpages are still mapped?
> 
> "Deferred split" definitely doesn't mean they are free. When memory 
> pressure is hit, they would be split, then the unmapped normal pages 
> would be freed. So, when calculating MemAvailable, they are not 
> accounted 100%, but like "available += lazyfree - min(lazyfree / 2, 
> wmark_low)", just like how page cache is accounted.
> 
> We could get more accurate account, i.e. checking each sub page's 
> mapcount when accounting, but it may change before shrinker start 
> scanning. So, just use the ballpark estimation to trade off the 
> complexity for accurate accounting.

If we know the mapcounts in the moment the deferred split is initiated (I
suppose there has to be a iteration over all subpages already?), we could get
the exact number to adjust the counter with, and also store the number somewhere
(e.g. a unused field in first/second tail page, I think we already do that for
something). Then in the shrinker we just read that number to adjust the counter
back. Then we can ignore the subpage mapping changes before shrinking happens,
they shouldn't change the situation significantly, and importantly we we will be
safe from counter imbalance thanks to the stored number.
Yang Shi Aug. 15, 2019, 4:51 a.m. UTC | #12
On 8/14/19 4:08 AM, Michal Hocko wrote:
> On Mon 12-08-19 10:00:17, Yang Shi wrote:
>>
>> On 8/12/19 2:34 AM, Michal Hocko wrote:
>>> On Fri 09-08-19 16:54:43, Yang Shi wrote:
>>>> On 8/9/19 11:26 AM, Yang Shi wrote:
>>>>> On 8/9/19 11:02 AM, Michal Hocko wrote:
>>> [...]
>>>>>> I have to study the code some more but is there any reason why those
>>>>>> pages are not accounted as proper THPs anymore? Sure they are partially
>>>>>> unmaped but they are still THPs so why cannot we keep them accounted
>>>>>> like that. Having a new counter to reflect that sounds like papering
>>>>>> over the problem to me. But as I've said I might be missing something
>>>>>> important here.
>>>>> I think we could keep those pages accounted for NR_ANON_THPS since they
>>>>> are still THP although they are unmapped as you mentioned if we just
>>>>> want to fix the improper accounting.
>>>> By double checking what NR_ANON_THPS really means,
>>>> Documentation/filesystems/proc.txt says "Non-file backed huge pages mapped
>>>> into userspace page tables". Then it makes some sense to dec NR_ANON_THPS
>>>> when removing rmap even though they are still THPs.
>>>>
>>>> I don't think we would like to change the definition, if so a new counter
>>>> may make more sense.
>>> Yes, changing NR_ANON_THPS semantic sounds like a bad idea. Let
>>> me try whether I understand the problem. So we have some THP in
>>> limbo waiting for them to be split and unmapped parts to be freed,
>>> right? I can see that page_remove_anon_compound_rmap does correctly
>>> decrement NR_ANON_MAPPED for sub pages that are no longer mapped by
>>> anybody. LRU pages seem to be accounted properly as well.  As you've
>>> said NR_ANON_THPS reflects the number of THPs mapped and that should be
>>> reflecting the reality already IIUC.
>>>
>>> So the only problem seems to be that deferred THP might aggregate a lot
>>> of immediately freeable memory (if none of the subpages are mapped) and
>>> that can confuse MemAvailable because it doesn't know about the fact.
>>> Has an skewed counter resulted in a user observable behavior/failures?
>> No. But the skewed counter may make big difference for a big scale cluster.
>> The MemAvailable is an important factor for cluster scheduler to determine
>> the capacity.
> But MemAvailable is a very rough estimation. Is relying on it really a
> good measure? I mean there is a lot of reclaimable memory that is not
> reflected there (some fs. internal data structures, networking buffers
> etc.)

Yes, I agree there are other freeable objects not accounted into 
MemAvailable. Their size depends on the workload. But, deferred split 
THPs seems more common with the common workloads. A simple run with 
MariaDB test of mmtest shows it could generate over fifteen thousand 
deferred split THPs (accumulated around 30G in one hour run, 75% of 40G 
memory for my VM). So, it may be worth accounting deferred split THPs in 
MemAvailable.

>
> [...]
>
>>> accounting the full THP correct? What if subpages are still mapped?
>> "Deferred split" definitely doesn't mean they are free. When memory pressure
>> is hit, they would be split, then the unmapped normal pages would be freed.
>> So, when calculating MemAvailable, they are not accounted 100%, but like
>> "available += lazyfree - min(lazyfree / 2, wmark_low)", just like how page
>> cache is accounted.
> Then this is even more dubious IMHO.
>
>> We could get more accurate account, i.e. checking each sub page's mapcount
>> when accounting, but it may change before shrinker start scanning. So, just
>> use the ballpark estimation to trade off the complexity for accurate
>> accounting.
> I do not see much point in fixing up one particular counter when there
> is a whole lot that is even not considered. I would rather live with the
> fact that MemAvailable is only very rough estimate then whack a mole on
> any memory consumer that is freeable directly or indirectly via memory
> reclaim. Because this is likely to be always subtly broken and only
> visible under very specific workloads so there is no way to test for it.

I saw Vlastimil suggested KReclaimable, it seems a good fit. If so we 
don't need create a new counter anymore.
Yang Shi Aug. 15, 2019, 4:53 a.m. UTC | #13
On 8/14/19 5:49 AM, Vlastimil Babka wrote:
> On 8/9/19 8:26 PM, Yang Shi wrote:
>> Here the new counter is introduced for patch 2/2 to account deferred
>> split THPs into available memory since NR_ANON_THPS may contain
>> non-deferred split THPs.
>>
>> I could use an internal counter for deferred split THPs, but if it is
>> accounted by mod_node_page_state, why not just show it in /proc/meminfo?
> The answer to "Why not" is that it becomes part of userspace API (btw this
> patchset should have CC'd linux-api@ - please do for further iterations) and
> even if the implementation detail of deferred splitting might change in the
> future, we'll basically have to keep the counter (even with 0 value) in
> /proc/meminfo forever.
>
> Also, quite recently we have added the following counter:
>
> KReclaimable: Kernel allocations that the kernel will attempt to reclaim
>                under memory pressure. Includes SReclaimable (below), and other
>                direct allocations with a shrinker.
>
> Although THP allocations are not exactly "kernel allocations", once they are
> unmapped, they are in fact kernel-only, so IMHO it wouldn't be a big stretch to
> add the lazy THP pages there?

Thanks a lot for the suggestion. I agree it may be a good fit. Hope 
"kernel allocations" not cause confusion. But, we can explain in the 
documentation.

>
>> Or we fix NR_ANON_THPS and show deferred split THPs in /proc/meminfo?
>>
Yang Shi Aug. 15, 2019, 4:54 a.m. UTC | #14
On 8/14/19 5:55 AM, Vlastimil Babka wrote:
> On 8/12/19 7:00 PM, Yang Shi wrote:
>>> I can see that memcg rss size was the primary problem David was looking
>>> at. But MemAvailable will not help with that, right? Moreover is
>> Yes, but David actually would like to have memcg MemAvailable (the
>> accounter like the global one), which should be counted like the global
>> one and should account per memcg deferred split THP properly.
>>
>>> accounting the full THP correct? What if subpages are still mapped?
>> "Deferred split" definitely doesn't mean they are free. When memory
>> pressure is hit, they would be split, then the unmapped normal pages
>> would be freed. So, when calculating MemAvailable, they are not
>> accounted 100%, but like "available += lazyfree - min(lazyfree / 2,
>> wmark_low)", just like how page cache is accounted.
>>
>> We could get more accurate account, i.e. checking each sub page's
>> mapcount when accounting, but it may change before shrinker start
>> scanning. So, just use the ballpark estimation to trade off the
>> complexity for accurate accounting.
> If we know the mapcounts in the moment the deferred split is initiated (I
> suppose there has to be a iteration over all subpages already?), we could get
> the exact number to adjust the counter with, and also store the number somewhere
> (e.g. a unused field in first/second tail page, I think we already do that for
> something). Then in the shrinker we just read that number to adjust the counter
> back. Then we can ignore the subpage mapping changes before shrinking happens,
> they shouldn't change the situation significantly, and importantly we we will be
> safe from counter imbalance thanks to the stored number.

Thanks, I'm going to look into this approach. Thanks for the suggestion 
again.
Michal Hocko Aug. 15, 2019, 8:46 a.m. UTC | #15
On Wed 14-08-19 21:51:47, Yang Shi wrote:
> 
> 
> On 8/14/19 4:08 AM, Michal Hocko wrote:
> > On Mon 12-08-19 10:00:17, Yang Shi wrote:
> > > 
> > > On 8/12/19 2:34 AM, Michal Hocko wrote:
> > > > On Fri 09-08-19 16:54:43, Yang Shi wrote:
> > > > > On 8/9/19 11:26 AM, Yang Shi wrote:
> > > > > > On 8/9/19 11:02 AM, Michal Hocko wrote:
> > > > [...]
> > > > > > > I have to study the code some more but is there any reason why those
> > > > > > > pages are not accounted as proper THPs anymore? Sure they are partially
> > > > > > > unmaped but they are still THPs so why cannot we keep them accounted
> > > > > > > like that. Having a new counter to reflect that sounds like papering
> > > > > > > over the problem to me. But as I've said I might be missing something
> > > > > > > important here.
> > > > > > I think we could keep those pages accounted for NR_ANON_THPS since they
> > > > > > are still THP although they are unmapped as you mentioned if we just
> > > > > > want to fix the improper accounting.
> > > > > By double checking what NR_ANON_THPS really means,
> > > > > Documentation/filesystems/proc.txt says "Non-file backed huge pages mapped
> > > > > into userspace page tables". Then it makes some sense to dec NR_ANON_THPS
> > > > > when removing rmap even though they are still THPs.
> > > > > 
> > > > > I don't think we would like to change the definition, if so a new counter
> > > > > may make more sense.
> > > > Yes, changing NR_ANON_THPS semantic sounds like a bad idea. Let
> > > > me try whether I understand the problem. So we have some THP in
> > > > limbo waiting for them to be split and unmapped parts to be freed,
> > > > right? I can see that page_remove_anon_compound_rmap does correctly
> > > > decrement NR_ANON_MAPPED for sub pages that are no longer mapped by
> > > > anybody. LRU pages seem to be accounted properly as well.  As you've
> > > > said NR_ANON_THPS reflects the number of THPs mapped and that should be
> > > > reflecting the reality already IIUC.
> > > > 
> > > > So the only problem seems to be that deferred THP might aggregate a lot
> > > > of immediately freeable memory (if none of the subpages are mapped) and
> > > > that can confuse MemAvailable because it doesn't know about the fact.
> > > > Has an skewed counter resulted in a user observable behavior/failures?
> > > No. But the skewed counter may make big difference for a big scale cluster.
> > > The MemAvailable is an important factor for cluster scheduler to determine
> > > the capacity.
> > But MemAvailable is a very rough estimation. Is relying on it really a
> > good measure? I mean there is a lot of reclaimable memory that is not
> > reflected there (some fs. internal data structures, networking buffers
> > etc.)
> 
> Yes, I agree there are other freeable objects not accounted into
> MemAvailable. Their size depends on the workload. But, deferred split THPs
> seems more common with the common workloads. A simple run with MariaDB test
> of mmtest shows it could generate over fifteen thousand deferred split THPs
> (accumulated around 30G in one hour run, 75% of 40G memory for my VM). So,
> it may be worth accounting deferred split THPs in MemAvailable.

This is a very useful information to put into the changelog.
diff mbox series

Patch

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 99ca040..8dd09d1 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -917,6 +917,7 @@  HardwareCorrupted:   0 kB
 AnonHugePages:   49152 kB
 ShmemHugePages:      0 kB
 ShmemPmdMapped:      0 kB
+LazyFreePages:       0 kB
 
 
     MemTotal: Total usable ram (i.e. physical ram minus a few reserved
@@ -924,12 +925,13 @@  ShmemPmdMapped:      0 kB
      MemFree: The sum of LowFree+HighFree
 MemAvailable: An estimate of how much memory is available for starting new
               applications, without swapping. Calculated from MemFree,
-              SReclaimable, the size of the file LRU lists, and the low
-              watermarks in each zone.
+              SReclaimable, the size of the file LRU lists, LazyFree pages
+              and the low watermarks in each zone.
               The estimate takes into account that the system needs some
               page cache to function well, and that not all reclaimable
-              slab will be reclaimable, due to items being in use. The
-              impact of those factors will vary from system to system.
+              slab and LazyFree pages will be reclaimable, due to items
+              being in use. The impact of those factors will vary from
+              system to system.
      Buffers: Relatively temporary storage for raw disk blocks
               shouldn't get tremendously large (20MB or so)
       Cached: in-memory cache for files read from the disk (the
@@ -967,6 +969,8 @@  AnonHugePages: Non-file backed huge pages mapped into userspace page tables
 ShmemHugePages: Memory used by shared memory (shmem) and tmpfs allocated
               with huge pages
 ShmemPmdMapped: Shared memory mapped into userspace with huge pages
+LazyFreePages: Cleanly freeable pages under memory pressure (i.e. deferred
+               split THP).
 KReclaimable: Kernel allocations that the kernel will attempt to reclaim
               under memory pressure. Includes SReclaimable (below), and other
               direct allocations with a shrinker.
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 75b7e6f..9326b4e 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -428,6 +428,7 @@  static ssize_t node_read_meminfo(struct device *dev,
 		       "Node %d ShmemHugePages: %8lu kB\n"
 		       "Node %d ShmemPmdMapped: %8lu kB\n"
 #endif
+		       "Node %d LazyFreePages:	%8lu kB\n"
 			,
 		       nid, K(node_page_state(pgdat, NR_FILE_DIRTY)),
 		       nid, K(node_page_state(pgdat, NR_WRITEBACK)),
@@ -454,6 +455,8 @@  static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED) *
 				       HPAGE_PMD_NR)
 #endif
+		       ,
+		       nid, K(node_page_state(pgdat, NR_LAZYFREE))
 		       );
 	n += hugetlb_report_node_meminfo(nid, buf + n);
 	return n;
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 465ea01..d66491a 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -138,6 +138,9 @@  static int meminfo_proc_show(struct seq_file *m, void *v)
 		    global_node_page_state(NR_SHMEM_PMDMAPPED) * HPAGE_PMD_NR);
 #endif
 
+	show_val_kb(m, "LazyFreePages:  ",
+		    global_node_page_state(NR_LAZYFREE));
+
 #ifdef CONFIG_CMA
 	show_val_kb(m, "CmaTotal:       ", totalcma_pages);
 	show_val_kb(m, "CmaFree:        ",
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d8ec773..4d00b1c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -235,6 +235,7 @@  enum node_stat_item {
 	NR_SHMEM_THPS,
 	NR_SHMEM_PMDMAPPED,
 	NR_ANON_THPS,
+	NR_LAZYFREE,		/* Lazyfree pages, i.e. deferred split THP */
 	NR_UNSTABLE_NFS,	/* NFS unstable pages */
 	NR_VMSCAN_WRITE,
 	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c9a596e..a20152f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2766,6 +2766,8 @@  int split_huge_page_to_list(struct page *page, struct list_head *list)
 		if (!list_empty(page_deferred_list(head))) {
 			ds_queue->split_queue_len--;
 			list_del(page_deferred_list(head));
+			__mod_lruvec_page_state(head, NR_LAZYFREE,
+						-HPAGE_PMD_NR);
 		}
 		if (mapping)
 			__dec_node_page_state(page, NR_SHMEM_THPS);
@@ -2815,6 +2817,7 @@  void free_transhuge_page(struct page *page)
 	if (!list_empty(page_deferred_list(page))) {
 		ds_queue->split_queue_len--;
 		list_del(page_deferred_list(page));
+		__mod_lruvec_page_state(page, NR_LAZYFREE, -HPAGE_PMD_NR);
 	}
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 	free_compound_page(page);
@@ -2846,6 +2849,7 @@  void deferred_split_huge_page(struct page *page)
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
 	if (list_empty(page_deferred_list(page))) {
 		count_vm_event(THP_DEFERRED_SPLIT_PAGE);
+		__mod_lruvec_page_state(page, NR_LAZYFREE, HPAGE_PMD_NR);
 		list_add_tail(page_deferred_list(page), &ds_queue->split_queue);
 		ds_queue->split_queue_len++;
 #ifdef CONFIG_MEMCG
@@ -2906,6 +2910,8 @@  static unsigned long deferred_split_scan(struct shrinker *shrink,
 			/* We lost race with put_compound_page() */
 			list_del_init(page_deferred_list(page));
 			ds_queue->split_queue_len--;
+			__mod_lruvec_page_state(page, NR_LAZYFREE,
+						-HPAGE_PMD_NR);
 		}
 		if (!--sc->nr_to_scan)
 			break;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1d1c5d3..1f3eba8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5279,6 +5279,7 @@  void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			" shmem_pmdmapped: %lukB"
 			" anon_thp: %lukB"
 #endif
+			" lazyfree:%lukB"
 			" writeback_tmp:%lukB"
 			" unstable:%lukB"
 			" all_unreclaimable? %s"
@@ -5301,6 +5302,7 @@  void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 					* HPAGE_PMD_NR),
 			K(node_page_state(pgdat, NR_ANON_THPS) * HPAGE_PMD_NR),
 #endif
+			K(node_page_state(pgdat, NR_LAZYFREE)),
 			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
 			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
 			pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
diff --git a/mm/vmstat.c b/mm/vmstat.c
index fd7e16c..4f6eed5 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1159,6 +1159,7 @@  int fragmentation_index(struct zone *zone, unsigned int order)
 	"nr_shmem_hugepages",
 	"nr_shmem_pmdmapped",
 	"nr_anon_transparent_hugepages",
+	"nr_lazyfree",
 	"nr_unstable",
 	"nr_vmscan_write",
 	"nr_vmscan_immediate_reclaim",