mbox series

[RFC,0/6] Reclaim zero subpages of thp to avoid memory bloat

Message ID 1635422215-99394-1-git-send-email-ningzhang@linux.alibaba.com (mailing list archive)
Headers show
Series Reclaim zero subpages of thp to avoid memory bloat | expand

Message

Ning Zhang Oct. 28, 2021, 11:56 a.m. UTC
As we know, thp may lead to memory bloat which may cause OOM.
Through testing with some apps, we found that the reason of
memory bloat is a huge page may contain some zero subpages
(may accessed or not). And we found that most zero subpages
are centralized in a few huge pages.

Following is a text_classification_rnn case for tensorflow:

  zero_subpages   huge_pages  waste
  [     0,     1) 186         0.00%
  [     1,     2) 23          0.01%
  [     2,     4) 36          0.02%
  [     4,     8) 67          0.08%
  [     8,    16) 80          0.23%
  [    16,    32) 109         0.61%
  [    32,    64) 44          0.49%
  [    64,   128) 12          0.30%
  [   128,   256) 28          1.54%
  [   256,   513) 159        18.03%

In the case, there are 187 huge pages (25% of the total huge pages)
which contain more then 128 zero subpages. And these huge pages
lead to 19.57% waste of the total rss. It means we can reclaim
19.57% memory by splitting the 187 huge pages and reclaiming the
zero subpages.

This patchset introduce a new mechanism to split the huge page
which has zero subpages and reclaim these zero subpages.

We add the anonymous huge page to a list to reduce the cost of
finding the huge page. When the memory reclaim is triggering,
the list will be walked and the huge page contains enough zero
subpages may be reclaimed. Meanwhile, replace the zero subpages
by ZERO_PAGE(0). 

Yu Zhao has done some similar work when the huge page is swap out
or migrated to accelerate[1]. While we do this in the normal memory
shrink path for the swapoff scene to avoid OOM.

In the future, we will do the proactive reclaim to reclaim the "cold"
huge page proactively. This is for keeping the performance of thp as
for as possible. In addition to that, some users want the memory usage
using thp is equal to the usage using 4K.

[1] https://lore.kernel.org/linux-mm/20210731063938.1391602-1-yuzhao@google.com/

Ning Zhang (6):
  mm, thp: introduce thp zero subpages reclaim
  mm, thp: add a global interface for zero subapges reclaim
  mm, thp: introduce zero subpages reclaim threshold
  mm, thp: introduce a controller to trigger zero subpages reclaim
  mm, thp: add some statistics for zero subpages reclaim
  mm, thp: add document for zero subpages reclaim

 Documentation/admin-guide/mm/transhuge.rst |  75 ++++++
 include/linux/huge_mm.h                    |  13 +
 include/linux/memcontrol.h                 |  26 ++
 include/linux/mm.h                         |   1 +
 include/linux/mm_types.h                   |   6 +
 include/linux/mmzone.h                     |   9 +
 mm/huge_memory.c                           | 374 ++++++++++++++++++++++++++++-
 mm/memcontrol.c                            | 243 +++++++++++++++++++
 mm/vmscan.c                                |  61 ++++-
 9 files changed, 805 insertions(+), 3 deletions(-)

Comments

Kirill A. Shutemov Oct. 28, 2021, 2:13 p.m. UTC | #1
On Thu, Oct 28, 2021 at 07:56:49PM +0800, Ning Zhang wrote:
> As we know, thp may lead to memory bloat which may cause OOM.
> Through testing with some apps, we found that the reason of
> memory bloat is a huge page may contain some zero subpages
> (may accessed or not). And we found that most zero subpages
> are centralized in a few huge pages.
> 
> Following is a text_classification_rnn case for tensorflow:
> 
>   zero_subpages   huge_pages  waste
>   [     0,     1) 186         0.00%
>   [     1,     2) 23          0.01%
>   [     2,     4) 36          0.02%
>   [     4,     8) 67          0.08%
>   [     8,    16) 80          0.23%
>   [    16,    32) 109         0.61%
>   [    32,    64) 44          0.49%
>   [    64,   128) 12          0.30%
>   [   128,   256) 28          1.54%
>   [   256,   513) 159        18.03%
> 
> In the case, there are 187 huge pages (25% of the total huge pages)
> which contain more then 128 zero subpages. And these huge pages
> lead to 19.57% waste of the total rss. It means we can reclaim
> 19.57% memory by splitting the 187 huge pages and reclaiming the
> zero subpages.
> 
> This patchset introduce a new mechanism to split the huge page
> which has zero subpages and reclaim these zero subpages.
> 
> We add the anonymous huge page to a list to reduce the cost of
> finding the huge page. When the memory reclaim is triggering,
> the list will be walked and the huge page contains enough zero
> subpages may be reclaimed. Meanwhile, replace the zero subpages
> by ZERO_PAGE(0). 

Does it actually help your workload?

I mean this will only be triggered via vmscan that was going to split
pages and free anyway.

You prioritize splitting THP and freeing zero subpages over reclaiming
other pages. It may or may not be right thing to do, depending on
workload.

Maybe it makes more sense to check for all-zero pages just after
split_huge_page_to_list() in vmscan and free such pages immediately rather
then add all this complexity?

> Yu Zhao has done some similar work when the huge page is swap out
> or migrated to accelerate[1]. While we do this in the normal memory
> shrink path for the swapoff scene to avoid OOM.
> 
> In the future, we will do the proactive reclaim to reclaim the "cold"
> huge page proactively. This is for keeping the performance of thp as
> for as possible. In addition to that, some users want the memory usage
> using thp is equal to the usage using 4K.

Proactive reclaim can be harmful if your max_ptes_none allows to recreate
THP back.
Ning Zhang Oct. 29, 2021, 12:07 p.m. UTC | #2
在 2021/10/28 下午10:13, Kirill A. Shutemov 写道:
> On Thu, Oct 28, 2021 at 07:56:49PM +0800, Ning Zhang wrote:
>> As we know, thp may lead to memory bloat which may cause OOM.
>> Through testing with some apps, we found that the reason of
>> memory bloat is a huge page may contain some zero subpages
>> (may accessed or not). And we found that most zero subpages
>> are centralized in a few huge pages.
>>
>> Following is a text_classification_rnn case for tensorflow:
>>
>>    zero_subpages   huge_pages  waste
>>    [     0,     1) 186         0.00%
>>    [     1,     2) 23          0.01%
>>    [     2,     4) 36          0.02%
>>    [     4,     8) 67          0.08%
>>    [     8,    16) 80          0.23%
>>    [    16,    32) 109         0.61%
>>    [    32,    64) 44          0.49%
>>    [    64,   128) 12          0.30%
>>    [   128,   256) 28          1.54%
>>    [   256,   513) 159        18.03%
>>
>> In the case, there are 187 huge pages (25% of the total huge pages)
>> which contain more then 128 zero subpages. And these huge pages
>> lead to 19.57% waste of the total rss. It means we can reclaim
>> 19.57% memory by splitting the 187 huge pages and reclaiming the
>> zero subpages.
>>
>> This patchset introduce a new mechanism to split the huge page
>> which has zero subpages and reclaim these zero subpages.
>>
>> We add the anonymous huge page to a list to reduce the cost of
>> finding the huge page. When the memory reclaim is triggering,
>> the list will be walked and the huge page contains enough zero
>> subpages may be reclaimed. Meanwhile, replace the zero subpages
>> by ZERO_PAGE(0).
> Does it actually help your workload?
>
> I mean this will only be triggered via vmscan that was going to split
> pages and free anyway.
>
> You prioritize splitting THP and freeing zero subpages over reclaiming
> other pages. It may or may not be right thing to do, depending on
> workload.
>
> Maybe it makes more sense to check for all-zero pages just after
> split_huge_page_to_list() in vmscan and free such pages immediately rather
> then add all this complexity?
>
The purpose of zero subpages reclaim(ZSR) is to pick out the huge pages 
which
have waste and reclaim them.

We do this for two reasons:
1. If swap is off, anonymous pages will not be scanned, and we don't 
have the
    opportunity  to split the huge page. ZSR can be helpful for this.
2. If swap is on, splitting first will not only split the huge page, but 
also
    swap out the nonzero subpages, while ZSR will only split the huge page.
    Splitting first will result to more performance degradation. If ZSR 
can't
    reclaim enough pages, swap can still work.

Why use a seperate ZSR list instead of the default LRU list?

Because it may cause high CPU overhead to scan for target huge pages if 
there
both exist a lot of regular and huge pages. And it maybe especially 
terrible
when swap is off, we may scan the whole LRU list many times. A huge page 
will
be deleted from ZSR list when it was scanned, so the page will be 
scanned only
once. It's hard to use LRU list, because it may add new pages into LRU list
continuously when scanning.

Also, we can decrease the priority to prioritize reclaiming file-backed 
page.
For example, only triggerring ZSR when the priority is less than 4.
>> Yu Zhao has done some similar work when the huge page is swap out
>> or migrated to accelerate[1]. While we do this in the normal memory
>> shrink path for the swapoff scene to avoid OOM.
>>
>> In the future, we will do the proactive reclaim to reclaim the "cold"
>> huge page proactively. This is for keeping the performance of thp as
>> for as possible. In addition to that, some users want the memory usage
>> using thp is equal to the usage using 4K.
> Proactive reclaim can be harmful if your max_ptes_none allows to recreate
> THP back.
Thanks! We will consider it.
>
Michal Hocko Oct. 29, 2021, 1:38 p.m. UTC | #3
On Thu 28-10-21 19:56:49, Ning Zhang wrote:
> As we know, thp may lead to memory bloat which may cause OOM.
> Through testing with some apps, we found that the reason of
> memory bloat is a huge page may contain some zero subpages
> (may accessed or not). And we found that most zero subpages
> are centralized in a few huge pages.
> 
> Following is a text_classification_rnn case for tensorflow:
> 
>   zero_subpages   huge_pages  waste
>   [     0,     1) 186         0.00%
>   [     1,     2) 23          0.01%
>   [     2,     4) 36          0.02%
>   [     4,     8) 67          0.08%
>   [     8,    16) 80          0.23%
>   [    16,    32) 109         0.61%
>   [    32,    64) 44          0.49%
>   [    64,   128) 12          0.30%
>   [   128,   256) 28          1.54%
>   [   256,   513) 159        18.03%
> 
> In the case, there are 187 huge pages (25% of the total huge pages)
> which contain more then 128 zero subpages. And these huge pages
> lead to 19.57% waste of the total rss. It means we can reclaim
> 19.57% memory by splitting the 187 huge pages and reclaiming the
> zero subpages.

What is the THP policy configuration in your testing? I assume you are
using defaults right? That would be always for THP and madvise for
defrag. Would it make more sense to use madvise mode for THP for your
workload? The THP code is rather complex and just by looking at the
diffstat this add quite a lot on top. Is this really worth it?
Ning Zhang Oct. 29, 2021, 4:12 p.m. UTC | #4
在 2021/10/29 下午9:38, Michal Hocko 写道:
> On Thu 28-10-21 19:56:49, Ning Zhang wrote:
>> As we know, thp may lead to memory bloat which may cause OOM.
>> Through testing with some apps, we found that the reason of
>> memory bloat is a huge page may contain some zero subpages
>> (may accessed or not). And we found that most zero subpages
>> are centralized in a few huge pages.
>>
>> Following is a text_classification_rnn case for tensorflow:
>>
>>    zero_subpages   huge_pages  waste
>>    [     0,     1) 186         0.00%
>>    [     1,     2) 23          0.01%
>>    [     2,     4) 36          0.02%
>>    [     4,     8) 67          0.08%
>>    [     8,    16) 80          0.23%
>>    [    16,    32) 109         0.61%
>>    [    32,    64) 44          0.49%
>>    [    64,   128) 12          0.30%
>>    [   128,   256) 28          1.54%
>>    [   256,   513) 159        18.03%
>>
>> In the case, there are 187 huge pages (25% of the total huge pages)
>> which contain more then 128 zero subpages. And these huge pages
>> lead to 19.57% waste of the total rss. It means we can reclaim
>> 19.57% memory by splitting the 187 huge pages and reclaiming the
>> zero subpages.
> What is the THP policy configuration in your testing? I assume you are
> using defaults right? That would be always for THP and madvise for
> defrag. Would it make more sense to use madvise mode for THP for your
> workload? The THP code is rather complex and just by looking at the
> diffstat this add quite a lot on top. Is this really worth it?

The THP configuration is always.

Madvise needs users to set MADV_HUGEPAGE by themselves if they want use 
huge page, while many users don't do set this, and they can't control 
this well.

Such as java, users can set heap and metaspace to use huge pages with 
madvise, but there is also memory bloat. Users still need to test 
whether their app can accept the waste.

For the case above, if we set THP configuration to be madvise, all the 
pages it uses will be 4K-page.

Memory bloat is one of the most important reasons that users disable 
THP.  We do this to popularize THP to be default enabled.
Yang Shi Oct. 29, 2021, 4:56 p.m. UTC | #5
On Fri, Oct 29, 2021 at 5:08 AM ning zhang <ningzhang@linux.alibaba.com> wrote:
>
>
> 在 2021/10/28 下午10:13, Kirill A. Shutemov 写道:
> > On Thu, Oct 28, 2021 at 07:56:49PM +0800, Ning Zhang wrote:
> >> As we know, thp may lead to memory bloat which may cause OOM.
> >> Through testing with some apps, we found that the reason of
> >> memory bloat is a huge page may contain some zero subpages
> >> (may accessed or not). And we found that most zero subpages
> >> are centralized in a few huge pages.
> >>
> >> Following is a text_classification_rnn case for tensorflow:
> >>
> >>    zero_subpages   huge_pages  waste
> >>    [     0,     1) 186         0.00%
> >>    [     1,     2) 23          0.01%
> >>    [     2,     4) 36          0.02%
> >>    [     4,     8) 67          0.08%
> >>    [     8,    16) 80          0.23%
> >>    [    16,    32) 109         0.61%
> >>    [    32,    64) 44          0.49%
> >>    [    64,   128) 12          0.30%
> >>    [   128,   256) 28          1.54%
> >>    [   256,   513) 159        18.03%
> >>
> >> In the case, there are 187 huge pages (25% of the total huge pages)
> >> which contain more then 128 zero subpages. And these huge pages
> >> lead to 19.57% waste of the total rss. It means we can reclaim
> >> 19.57% memory by splitting the 187 huge pages and reclaiming the
> >> zero subpages.
> >>
> >> This patchset introduce a new mechanism to split the huge page
> >> which has zero subpages and reclaim these zero subpages.
> >>
> >> We add the anonymous huge page to a list to reduce the cost of
> >> finding the huge page. When the memory reclaim is triggering,
> >> the list will be walked and the huge page contains enough zero
> >> subpages may be reclaimed. Meanwhile, replace the zero subpages
> >> by ZERO_PAGE(0).
> > Does it actually help your workload?
> >
> > I mean this will only be triggered via vmscan that was going to split
> > pages and free anyway.
> >
> > You prioritize splitting THP and freeing zero subpages over reclaiming
> > other pages. It may or may not be right thing to do, depending on
> > workload.
> >
> > Maybe it makes more sense to check for all-zero pages just after
> > split_huge_page_to_list() in vmscan and free such pages immediately rather
> > then add all this complexity?
> >
> The purpose of zero subpages reclaim(ZSR) is to pick out the huge pages
> which
> have waste and reclaim them.
>
> We do this for two reasons:
> 1. If swap is off, anonymous pages will not be scanned, and we don't
> have the
>     opportunity  to split the huge page. ZSR can be helpful for this.
> 2. If swap is on, splitting first will not only split the huge page, but
> also
>     swap out the nonzero subpages, while ZSR will only split the huge page.
>     Splitting first will result to more performance degradation. If ZSR
> can't
>     reclaim enough pages, swap can still work.
>
> Why use a seperate ZSR list instead of the default LRU list?
>
> Because it may cause high CPU overhead to scan for target huge pages if
> there
> both exist a lot of regular and huge pages. And it maybe especially
> terrible
> when swap is off, we may scan the whole LRU list many times. A huge page
> will
> be deleted from ZSR list when it was scanned, so the page will be
> scanned only
> once. It's hard to use LRU list, because it may add new pages into LRU list
> continuously when scanning.
>
> Also, we can decrease the priority to prioritize reclaiming file-backed
> page.
> For example, only triggerring ZSR when the priority is less than 4.

I'm not sure if this will help the workloads in general or not. The
problem is it doesn't check if the huge page is "hot" or not. It just
picks up the first huge page from the list, which seems like a FIFO
list IIUC. But if the huge page is "hot" even though there is some
internal access imbalance it may be better to keep the huge page since
the performance gain may outperform the memory saving. But if the huge
page is not "hot", then I think the question is why it is a THP in the
first place.

Let's step back to think about whether allocating THP upon first
access for such area or workload is good or not. We should be able to
check the access imbalance in allocation stage instead of reclaim
stage. Currently anonymous THP just supports 3 modes: always, madvise
and none. Both always and madvise tries to allocate THP in page fault
path (assuming anonymous THP) upon first access. I'm wondering if we
could add a "defer" mode or not. It defers THP allocation/collapse to
khugepaged instead of in page fault path. Then all the knobs used by
khugepaged could be applied, particularly max_ptes_none in your case.
You could set a low max_ptes_none if you prefer memory saving. IMHO,
this seems much simpler than scanning list (may be quite long) to find
out suitable candidate then split then replace to zero page.

Of course this may have some potential performance impact since the
THP install is delayed for some time. This could be optimized by
respecting  MADV_HUGEPAGE.

Anyway, just some wild idea.

> >> Yu Zhao has done some similar work when the huge page is swap out
> >> or migrated to accelerate[1]. While we do this in the normal memory
> >> shrink path for the swapoff scene to avoid OOM.
> >>
> >> In the future, we will do the proactive reclaim to reclaim the "cold"
> >> huge page proactively. This is for keeping the performance of thp as
> >> for as possible. In addition to that, some users want the memory usage
> >> using thp is equal to the usage using 4K.
> > Proactive reclaim can be harmful if your max_ptes_none allows to recreate
> > THP back.
> Thanks! We will consider it.
> >
>
Ning Zhang Nov. 1, 2021, 2:50 a.m. UTC | #6
在 2021/10/30 上午12:56, Yang Shi 写道:
> On Fri, Oct 29, 2021 at 5:08 AM ning zhang <ningzhang@linux.alibaba.com> wrote:
>>
>> 在 2021/10/28 下午10:13, Kirill A. Shutemov 写道:
>>> On Thu, Oct 28, 2021 at 07:56:49PM +0800, Ning Zhang wrote:
>>>> As we know, thp may lead to memory bloat which may cause OOM.
>>>> Through testing with some apps, we found that the reason of
>>>> memory bloat is a huge page may contain some zero subpages
>>>> (may accessed or not). And we found that most zero subpages
>>>> are centralized in a few huge pages.
>>>>
>>>> Following is a text_classification_rnn case for tensorflow:
>>>>
>>>>     zero_subpages   huge_pages  waste
>>>>     [     0,     1) 186         0.00%
>>>>     [     1,     2) 23          0.01%
>>>>     [     2,     4) 36          0.02%
>>>>     [     4,     8) 67          0.08%
>>>>     [     8,    16) 80          0.23%
>>>>     [    16,    32) 109         0.61%
>>>>     [    32,    64) 44          0.49%
>>>>     [    64,   128) 12          0.30%
>>>>     [   128,   256) 28          1.54%
>>>>     [   256,   513) 159        18.03%
>>>>
>>>> In the case, there are 187 huge pages (25% of the total huge pages)
>>>> which contain more then 128 zero subpages. And these huge pages
>>>> lead to 19.57% waste of the total rss. It means we can reclaim
>>>> 19.57% memory by splitting the 187 huge pages and reclaiming the
>>>> zero subpages.
>>>>
>>>> This patchset introduce a new mechanism to split the huge page
>>>> which has zero subpages and reclaim these zero subpages.
>>>>
>>>> We add the anonymous huge page to a list to reduce the cost of
>>>> finding the huge page. When the memory reclaim is triggering,
>>>> the list will be walked and the huge page contains enough zero
>>>> subpages may be reclaimed. Meanwhile, replace the zero subpages
>>>> by ZERO_PAGE(0).
>>> Does it actually help your workload?
>>>
>>> I mean this will only be triggered via vmscan that was going to split
>>> pages and free anyway.
>>>
>>> You prioritize splitting THP and freeing zero subpages over reclaiming
>>> other pages. It may or may not be right thing to do, depending on
>>> workload.
>>>
>>> Maybe it makes more sense to check for all-zero pages just after
>>> split_huge_page_to_list() in vmscan and free such pages immediately rather
>>> then add all this complexity?
>>>
>> The purpose of zero subpages reclaim(ZSR) is to pick out the huge pages
>> which
>> have waste and reclaim them.
>>
>> We do this for two reasons:
>> 1. If swap is off, anonymous pages will not be scanned, and we don't
>> have the
>>      opportunity  to split the huge page. ZSR can be helpful for this.
>> 2. If swap is on, splitting first will not only split the huge page, but
>> also
>>      swap out the nonzero subpages, while ZSR will only split the huge page.
>>      Splitting first will result to more performance degradation. If ZSR
>> can't
>>      reclaim enough pages, swap can still work.
>>
>> Why use a seperate ZSR list instead of the default LRU list?
>>
>> Because it may cause high CPU overhead to scan for target huge pages if
>> there
>> both exist a lot of regular and huge pages. And it maybe especially
>> terrible
>> when swap is off, we may scan the whole LRU list many times. A huge page
>> will
>> be deleted from ZSR list when it was scanned, so the page will be
>> scanned only
>> once. It's hard to use LRU list, because it may add new pages into LRU list
>> continuously when scanning.
>>
>> Also, we can decrease the priority to prioritize reclaiming file-backed
>> page.
>> For example, only triggerring ZSR when the priority is less than 4.
> I'm not sure if this will help the workloads in general or not. The
> problem is it doesn't check if the huge page is "hot" or not. It just
> picks up the first huge page from the list, which seems like a FIFO
> list IIUC. But if the huge page is "hot" even though there is some
> internal access imbalance it may be better to keep the huge page since
> the performance gain may outperform the memory saving. But if the huge
> page is not "hot", then I think the question is why it is a THP in the
> first place.
We don't split all the huge pages, and just split the huge page
contains enough zero subpages. It's hard to check a anonymous
page is hot or cold, and we are working on it.

We only scan 32 huge pages maximum except the last loop when
reclaiming. I think we can start ZSR when priority is 1 or 2,
or maybe only when priority is 0. In this case, If we don't
start ZSR, the process will be killed by OOM.
>
> Let's step back to think about whether allocating THP upon first
> access for such area or workload is good or not. We should be able to
> check the access imbalance in allocation stage instead of reclaim
> stage. Currently anonymous THP just supports 3 modes: always, madvise
> and none. Both always and madvise tries to allocate THP in page fault
> path (assuming anonymous THP) upon first access. I'm wondering if we
> could add a "defer" mode or not. It defers THP allocation/collapse to
> khugepaged instead of in page fault path. Then all the knobs used by
> khugepaged could be applied, particularly max_ptes_none in your case.
> You could set a low max_ptes_none if you prefer memory saving. IMHO,
> this seems much simpler than scanning list (may be quite long) to find
> out suitable candidate then split then replace to zero page.
>
> Of course this may have some potential performance impact since the
> THP install is delayed for some time. This could be optimized by
> respecting  MADV_HUGEPAGE.
>
> Anyway, just some wild idea.
>
>>>> Yu Zhao has done some similar work when the huge page is swap out
>>>> or migrated to accelerate[1]. While we do this in the normal memory
>>>> shrink path for the swapoff scene to avoid OOM.
>>>>
>>>> In the future, we will do the proactive reclaim to reclaim the "cold"
>>>> huge page proactively. This is for keeping the performance of thp as
>>>> for as possible. In addition to that, some users want the memory usage
>>>> using thp is equal to the usage using 4K.
>>> Proactive reclaim can be harmful if your max_ptes_none allows to recreate
>>> THP back.
>> Thanks! We will consider it.
Michal Hocko Nov. 1, 2021, 9:20 a.m. UTC | #7
On Sat 30-10-21 00:12:53, ning zhang wrote:
> 
> 在 2021/10/29 下午9:38, Michal Hocko 写道:
> > On Thu 28-10-21 19:56:49, Ning Zhang wrote:
> > > As we know, thp may lead to memory bloat which may cause OOM.
> > > Through testing with some apps, we found that the reason of
> > > memory bloat is a huge page may contain some zero subpages
> > > (may accessed or not). And we found that most zero subpages
> > > are centralized in a few huge pages.
> > > 
> > > Following is a text_classification_rnn case for tensorflow:
> > > 
> > >    zero_subpages   huge_pages  waste
> > >    [     0,     1) 186         0.00%
> > >    [     1,     2) 23          0.01%
> > >    [     2,     4) 36          0.02%
> > >    [     4,     8) 67          0.08%
> > >    [     8,    16) 80          0.23%
> > >    [    16,    32) 109         0.61%
> > >    [    32,    64) 44          0.49%
> > >    [    64,   128) 12          0.30%
> > >    [   128,   256) 28          1.54%
> > >    [   256,   513) 159        18.03%
> > > 
> > > In the case, there are 187 huge pages (25% of the total huge pages)
> > > which contain more then 128 zero subpages. And these huge pages
> > > lead to 19.57% waste of the total rss. It means we can reclaim
> > > 19.57% memory by splitting the 187 huge pages and reclaiming the
> > > zero subpages.
> > What is the THP policy configuration in your testing? I assume you are
> > using defaults right? That would be always for THP and madvise for
> > defrag. Would it make more sense to use madvise mode for THP for your
> > workload? The THP code is rather complex and just by looking at the
> > diffstat this add quite a lot on top. Is this really worth it?
> 
> The THP configuration is always.
> 
> Madvise needs users to set MADV_HUGEPAGE by themselves if they want use huge
> page, while many users don't do set this, and they can't control this well.

What do you mean tey can't control this well? 

> Such as java, users can set heap and metaspace to use huge pages with
> madvise, but there is also memory bloat. Users still need to test whether
> their app can accept the waste.

There will always be some internal fragmentation when huge pages are
used. The amount will depend on how well the memory is used but huge
pages give a performance boost in return.

If the memory bloat is a significant problem then overeager THP usage is
certainly not good and I would argue that applying THP always policy is
not a proper configuration. No matter how much the MM code can try to
fix up the situation it will be always a catch up game.
 
> For the case above, if we set THP configuration to be madvise, all the pages
> it uses will be 4K-page.
> 
> Memory bloat is one of the most important reasons that users disable THP. 
> We do this to popularize THP to be default enabled.

To my knowledge the most popular reason to disable THP is the runtime
overhead. A large part of that overhead has been reduced by not doing
heavy compaction during the page fault allocations by default. Memory
overhead is certainly an important aspect as well but there is always
a possibility to reduce that by reducing it to madvised regions for
page fault (i.e. those where author of the code has considered the
costs vs. benefits of the huge page) and setting up a conservative
khugepaged policy. So there are existing tools available. You are trying
to add quite a lot of code so you should have good arguments to add more
complexity. I am not sure that popularizing THP is a strong one TBH.
Ning Zhang Nov. 8, 2021, 3:24 a.m. UTC | #8
在 2021/11/1 下午5:20, Michal Hocko 写道:
> On Sat 30-10-21 00:12:53, ning zhang wrote:
>> 在 2021/10/29 下午9:38, Michal Hocko 写道:
>>> On Thu 28-10-21 19:56:49, Ning Zhang wrote:
>>>> As we know, thp may lead to memory bloat which may cause OOM.
>>>> Through testing with some apps, we found that the reason of
>>>> memory bloat is a huge page may contain some zero subpages
>>>> (may accessed or not). And we found that most zero subpages
>>>> are centralized in a few huge pages.
>>>>
>>>> Following is a text_classification_rnn case for tensorflow:
>>>>
>>>>     zero_subpages   huge_pages  waste
>>>>     [     0,     1) 186         0.00%
>>>>     [     1,     2) 23          0.01%
>>>>     [     2,     4) 36          0.02%
>>>>     [     4,     8) 67          0.08%
>>>>     [     8,    16) 80          0.23%
>>>>     [    16,    32) 109         0.61%
>>>>     [    32,    64) 44          0.49%
>>>>     [    64,   128) 12          0.30%
>>>>     [   128,   256) 28          1.54%
>>>>     [   256,   513) 159        18.03%
>>>>
>>>> In the case, there are 187 huge pages (25% of the total huge pages)
>>>> which contain more then 128 zero subpages. And these huge pages
>>>> lead to 19.57% waste of the total rss. It means we can reclaim
>>>> 19.57% memory by splitting the 187 huge pages and reclaiming the
>>>> zero subpages.
>>> What is the THP policy configuration in your testing? I assume you are
>>> using defaults right? That would be always for THP and madvise for
>>> defrag. Would it make more sense to use madvise mode for THP for your
>>> workload? The THP code is rather complex and just by looking at the
>>> diffstat this add quite a lot on top. Is this really worth it?
>> The THP configuration is always.
>>
>> Madvise needs users to set MADV_HUGEPAGE by themselves if they want use huge
>> page, while many users don't do set this, and they can't control this well.
> What do you mean tey can't control this well?

I means they don't know where they should use THP.

And even if they use madvise, memory bloat still exists.
<https://dict.youdao.com/w/still%20exist/#keyfrom=E2Ctranslation>

>
>> Such as java, users can set heap and metaspace to use huge pages with
>> madvise, but there is also memory bloat. Users still need to test whether
>> their app can accept the waste.
> There will always be some internal fragmentation when huge pages are
> used. The amount will depend on how well the memory is used but huge
> pages give a performance boost in return.
>
> If the memory bloat is a significant problem then overeager THP usage is
> certainly not good and I would argue that applying THP always policy is
> not a proper configuration. No matter how much the MM code can try to
> fix up the situation it will be always a catch up game.
>   
>> For the case above, if we set THP configuration to be madvise, all the pages
>> it uses will be 4K-page.
>>
>> Memory bloat is one of the most important reasons that users disable THP.
>> We do this to popularize THP to be default enabled.
> To my knowledge the most popular reason to disable THP is the runtime
> overhead. A large part of that overhead has been reduced by not doing
> heavy compaction during the page fault allocations by default. Memory
> overhead is certainly an important aspect as well but there is always
> a possibility to reduce that by reducing it to madvised regions for
> page fault (i.e. those where author of the code has considered the
> costs vs. benefits of the huge page) and setting up a conservative
> khugepaged policy. So there are existing tools available. You are trying
> to add quite a lot of code so you should have good arguments to add more
> complexity. I am not sure that popularizing THP is a strong one TBH.

Sorry for relpying late. For the compaction, we can set defrag
of THP to be defer or never, to avoid overhead produced by
direct reclaim. However, there are no way to reduce memory bloat.

If the memory usage reach the limit, and we can't reclaim some
pages, the OOM will be triggered and the process will be killed.
Our patchest is to avoid OOM.

Much code is interface to control ZSR. And we will try to
reduce the complexity.