mbox series

[RFC,0/2] Add disable_unmap_file arg to memory.reclaim

Message ID 20240829101918.3454840-1-hezhongkun.hzk@bytedance.com (mailing list archive)
Headers show
Series Add disable_unmap_file arg to memory.reclaim | expand

Message

Zhongkun He Aug. 29, 2024, 10:19 a.m. UTC
This patch proposes augmenting the memory.reclaim interface with a
disable_unmap_file argument that will skip the mapped pages in
that reclaim attempt.

For example:

echo "2M disable_unmap_file" > /sys/fs/cgroup/test/memory.reclaim

will perform reclaim on the test cgroup with no mapped file page.

The memory.reclaim is a useful interface. We can carry out proactive
memory reclaim in the user space, which can increase the utilization
rate of memory. 

In the actual usage scenarios, we found that when there are sufficient
anonymous pages, mapped file pages with a relatively small proportion
would still be reclaimed. This is likely to cause an increase in
refaults and an increase in task delay, because mapped file pages
usually include important executable codes, data, and shared libraries,
etc. According to the verified situation, if we can skip this part of
the memory, the task delay will be reduced.

IMO,it is difficult to balance the priorities of various pages in the
kernel, there are too many scenarios to consider. However, for the
scenario of proactive memory reclaim in user space, we can make a
simple judgment in this case.

Zhongkun He (2):
  mm: vmscan: modify the semantics of scan_control.may_unmap to
    UNMAP_ANON and UNMAP_FILE
  mm: memcg: add disbale_unmap_file arg to memory.reclaim

 include/linux/swap.h |  1 +
 mm/memcontrol.c      |  9 ++++--
 mm/vmscan.c          | 65 ++++++++++++++++++++++++++++++++++----------
 3 files changed, 59 insertions(+), 16 deletions(-)

Comments

Michal Hocko Aug. 29, 2024, 10:23 a.m. UTC | #1
On Thu 29-08-24 18:19:16, Zhongkun He wrote:
> This patch proposes augmenting the memory.reclaim interface with a
> disable_unmap_file argument that will skip the mapped pages in
> that reclaim attempt.
> 
> For example:
> 
> echo "2M disable_unmap_file" > /sys/fs/cgroup/test/memory.reclaim
> 
> will perform reclaim on the test cgroup with no mapped file page.
> 
> The memory.reclaim is a useful interface. We can carry out proactive
> memory reclaim in the user space, which can increase the utilization
> rate of memory. 
> 
> In the actual usage scenarios, we found that when there are sufficient
> anonymous pages, mapped file pages with a relatively small proportion
> would still be reclaimed. This is likely to cause an increase in
> refaults and an increase in task delay, because mapped file pages
> usually include important executable codes, data, and shared libraries,
> etc. According to the verified situation, if we can skip this part of
> the memory, the task delay will be reduced.

Do you have examples of workloads where this is demonstrably helps and
cannot be tuned via swappiness?
Zhongkun He Aug. 29, 2024, 10:37 a.m. UTC | #2
On Thu, Aug 29, 2024 at 6:24 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 29-08-24 18:19:16, Zhongkun He wrote:
> > This patch proposes augmenting the memory.reclaim interface with a
> > disable_unmap_file argument that will skip the mapped pages in
> > that reclaim attempt.
> >
> > For example:
> >
> > echo "2M disable_unmap_file" > /sys/fs/cgroup/test/memory.reclaim
> >
> > will perform reclaim on the test cgroup with no mapped file page.
> >
> > The memory.reclaim is a useful interface. We can carry out proactive
> > memory reclaim in the user space, which can increase the utilization
> > rate of memory.
> >
> > In the actual usage scenarios, we found that when there are sufficient
> > anonymous pages, mapped file pages with a relatively small proportion
> > would still be reclaimed. This is likely to cause an increase in
> > refaults and an increase in task delay, because mapped file pages
> > usually include important executable codes, data, and shared libraries,
> > etc. According to the verified situation, if we can skip this part of
> > the memory, the task delay will be reduced.
>
> Do you have examples of workloads where this is demonstrably helps and
> cannot be tuned via swappiness?

Sorry, I put the test workload in the second patch. Please have a look.

Even if there are sufficient anonymous pages and a small number of
page cache and mapped file pages, mapped file pages will still be reclaimed.
Here is an example of anonymous pages being sufficient but mapped
file pages still being reclaimed:

Swappiness has been set to the maximum value.

cat memory.stat | grep -wE 'anon|file|file_mapped'
anon 3406462976
file 332967936
file_mapped 300302336

echo 1g > memory.reclaim swappiness=200 > memory.reclaim
cat memory.stat | grep -wE 'anon|file|file_mapped'
anon 2613276672
file 52523008
file_mapped 30982144

echo 1g > memory.reclaim swappiness=200 > memory.reclaim
cat memory.stat | grep -wE 'anon|file|file_mapped'
anon 1552130048
file 39759872
file_mapped 20299776

With this patch, the file_mapped pages will be skipped.

echo 1g > memory.reclaim swappiness=200 disable_unmap_file  > memory.reclaim
cat memory.stat | grep -wE 'anon|file|file_mapped'
anon 480059392
file 37978112
file_mapped 20299776



> --
> Michal Hocko
> SUSE Labs
Michal Hocko Aug. 29, 2024, 11:50 a.m. UTC | #3
On Thu 29-08-24 18:37:07, Zhongkun He wrote:
> On Thu, Aug 29, 2024 at 6:24 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 29-08-24 18:19:16, Zhongkun He wrote:
> > > This patch proposes augmenting the memory.reclaim interface with a
> > > disable_unmap_file argument that will skip the mapped pages in
> > > that reclaim attempt.
> > >
> > > For example:
> > >
> > > echo "2M disable_unmap_file" > /sys/fs/cgroup/test/memory.reclaim
> > >
> > > will perform reclaim on the test cgroup with no mapped file page.
> > >
> > > The memory.reclaim is a useful interface. We can carry out proactive
> > > memory reclaim in the user space, which can increase the utilization
> > > rate of memory.
> > >
> > > In the actual usage scenarios, we found that when there are sufficient
> > > anonymous pages, mapped file pages with a relatively small proportion
> > > would still be reclaimed. This is likely to cause an increase in
> > > refaults and an increase in task delay, because mapped file pages
> > > usually include important executable codes, data, and shared libraries,
> > > etc. According to the verified situation, if we can skip this part of
> > > the memory, the task delay will be reduced.
> >
> > Do you have examples of workloads where this is demonstrably helps and
> > cannot be tuned via swappiness?
> 
> Sorry, I put the test workload in the second patch. Please have a look.

I have missed those as they are not threaded to the cover letter. You
can either use --in-reply-to when sending patches separately from the
cover letter or use can use --compose/--cover-leter when sending patches
through git-send-email

> Even if there are sufficient anonymous pages and a small number of
> page cache and mapped file pages, mapped file pages will still be reclaimed.
> Here is an example of anonymous pages being sufficient but mapped
> file pages still being reclaimed:
> Swappiness has been set to the maximum value.
> 
> cat memory.stat | grep -wE 'anon|file|file_mapped'
> anon 3406462976
> file 332967936
> file_mapped 300302336
> 
> echo 1g > memory.reclaim swappiness=200 > memory.reclaim
> cat memory.stat | grep -wE 'anon|file|file_mapped'
> anon 2613276672
> file 52523008
> file_mapped 30982144

This seems to be 73% (ano) vs 27% (file) balance. 90% of the 
file LRU seems to be mapped which matches 90% of file LRU reclaimed
memory to be mapped. So the reclaim is proportional there.

But I do understand that this is still unexpected when swappiness=200
should make reclaim anon oriented. Is this MGLRU or regular LRU
implementation?

Is this some artificial workload or something real world?
Zhongkun He Aug. 29, 2024, 1:15 p.m. UTC | #4
On Thu, Aug 29, 2024 at 7:51 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 29-08-24 18:37:07, Zhongkun He wrote:
> > On Thu, Aug 29, 2024 at 6:24 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 29-08-24 18:19:16, Zhongkun He wrote:
> > > > This patch proposes augmenting the memory.reclaim interface with a
> > > > disable_unmap_file argument that will skip the mapped pages in
> > > > that reclaim attempt.
> > > >
> > > > For example:
> > > >
> > > > echo "2M disable_unmap_file" > /sys/fs/cgroup/test/memory.reclaim
> > > >
> > > > will perform reclaim on the test cgroup with no mapped file page.
> > > >
> > > > The memory.reclaim is a useful interface. We can carry out proactive
> > > > memory reclaim in the user space, which can increase the utilization
> > > > rate of memory.
> > > >
> > > > In the actual usage scenarios, we found that when there are sufficient
> > > > anonymous pages, mapped file pages with a relatively small proportion
> > > > would still be reclaimed. This is likely to cause an increase in
> > > > refaults and an increase in task delay, because mapped file pages
> > > > usually include important executable codes, data, and shared libraries,
> > > > etc. According to the verified situation, if we can skip this part of
> > > > the memory, the task delay will be reduced.
> > >
> > > Do you have examples of workloads where this is demonstrably helps and
> > > cannot be tuned via swappiness?
> >
> > Sorry, I put the test workload in the second patch. Please have a look.
>
> I have missed those as they are not threaded to the cover letter. You
> can either use --in-reply-to when sending patches separately from the
> cover letter or use can use --compose/--cover-leter when sending patches
> through git-send-email

Got it, thanks. I encountered a problem after sending the cover letter, so
I resent the others without --in-reply-to.

>
> > Even if there are sufficient anonymous pages and a small number of
> > page cache and mapped file pages, mapped file pages will still be reclaimed.
> > Here is an example of anonymous pages being sufficient but mapped
> > file pages still being reclaimed:
> > Swappiness has been set to the maximum value.
> >
> > cat memory.stat | grep -wE 'anon|file|file_mapped'
> > anon 3406462976
> > file 332967936
> > file_mapped 300302336
> >
> > echo 1g > memory.reclaim swappiness=200 > memory.reclaim
> > cat memory.stat | grep -wE 'anon|file|file_mapped'
> > anon 2613276672
> > file 52523008
> > file_mapped 30982144
>
> This seems to be 73% (ano) vs 27% (file) balance. 90% of the
> file LRU seems to be mapped which matches 90% of file LRU reclaimed
> memory to be mapped. So the reclaim is proportional there.
>
> But I do understand that this is still unexpected when swappiness=200
> should make reclaim anon oriented. Is this MGLRU or regular LRU
> implementation?
>

This is a regular LRU implementation and the MGLRU has the same questions
but performs better. Please have a look:

root@vm:/sys/fs/cgroup/test# cat /sys/kernel/mm/lru_gen/enabled
0x0007

root@vm:/sys/fs/cgroup/test# cat memory.stat | grep -wE 'anon|file|file_mapped'
anon 3310338048
file 293498880
file_mapped 273506304

root@vm:/sys/fs/cgroup/test# echo 1g > memory.reclaim swappiness=200 >
memory.reclaim

root@vm:/sys/fs/cgroup/test# cat memory.stat | grep -wE 'anon|file|file_mapped'
anon 2373173248
file 157233152
file_mapped 146173952

root@vm:/sys/fs/cgroup/test# echo 1g > memory.reclaim swappiness=200 >
memory.reclaim
root@vm:/sys/fs/cgroup/test# cat memory.stat | grep -wE 'anon|file|file_mapped'
anon 1370886144
file 85663744
file_mapped 78118912

> Is this some artificial workload or something real world?
>

This is an artificial workload to show the detail of this case more
easily. But we have encountered
this problem on our servers. If the performance of the disk is poor,
like HDD, the situation will
become even worse. The delay of the task becomes more serious because
reading data will be slower.
Hot pages will thrash repeatedly between the memory and the disk. At
this time, the pressure on the
disk will also be greater. If there are many tasks using this disk, it
will also affect other tasks.

That was the background of this case.

>
> --
> Michal Hocko
> SUSE Labs
Michal Hocko Aug. 29, 2024, 1:36 p.m. UTC | #5
On Thu 29-08-24 21:15:50, Zhongkun He wrote:
> On Thu, Aug 29, 2024 at 7:51 PM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > Is this some artificial workload or something real world?
> >
> 
> This is an artificial workload to show the detail of this case more
> easily. But we have encountered this problem on our servers.

This is always good to mention in the changelog. If you can observe this
in real workloads it is good to get numbers from those because
artificial workloads tend to overshoot the underlying problem and we can
potentially miss the practical contributors to the problem.

Seeing this my main question is whether we should focus on swappiness
behavior more than adding a very strange and very targetted reclaim
mode. After all we have a mapped memory and executables protection in
place. So in the end this is more about balance between anon vs. file
LRUs.

> If the performance of the disk is poor, like HDD, the situation will
> become even worse.

Doesn't that impact swapin/out as well? Or do you happen to have a
faster storage for the swap?

> The delay of the task becomes more serious because reading data will
> be slower.  Hot pages will thrash repeatedly between the memory and
> the disk. 

Doesn't refault stats and IO cost aspect of the reclaim when balancing
LRUs dealing with this situation already? Why it doesn't work in your
case? Have you tried to investigate that?
Zhongkun He Aug. 29, 2024, 2:30 p.m. UTC | #6
On Thu, Aug 29, 2024 at 9:36 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 29-08-24 21:15:50, Zhongkun He wrote:
> > On Thu, Aug 29, 2024 at 7:51 PM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > Is this some artificial workload or something real world?
> > >
> >
> > This is an artificial workload to show the detail of this case more
> > easily. But we have encountered this problem on our servers.
>
> This is always good to mention in the changelog. If you can observe this
> in real workloads it is good to get numbers from those because
> artificial workloads tend to overshoot the underlying problem and we can
> potentially miss the practical contributors to the problem.

That sounds reasonable. I will try it.

>
> Seeing this my main question is whether we should focus on swappiness
> behavior more than adding a very strange and very targetted reclaim
> mode. After all we have a mapped memory and executables protection in
> place. So in the end this is more about balance between anon vs. file
> LRUs.
>

I  have a question about the swappiness, if set the swappiness=0, we can only
reclaim the file pages. but we do not have an option to disable the reclaim from
file pages because there are faster storages for the swap without IO, like zram
and zswap.  I wonder if we can give it a try in this direction.

> > If the performance of the disk is poor, like HDD, the situation will
> > become even worse.
>
> Doesn't that impact swapin/out as well? Or do you happen to have a
> faster storage for the swap?

Yes, we use ZRAM as the swap storage.

>
> > The delay of the task becomes more serious because reading data will
> > be slower.  Hot pages will thrash repeatedly between the memory and
> > the disk.
>
> Doesn't refault stats and IO cost aspect of the reclaim when balancing
> LRUs dealing with this situation already? Why it doesn't work in your
> case? Have you tried to investigate that?

OK, I'll try to reproduce the problem again. but IIUC, we could not reclaim
pages from one side. Please see this 'commit d483a5dd009  ("mm:
vmscan: limit the range of LRU type balancing")'  [1]

Unless this condition is met:
sc->file_is_tiny =
            file + free <= total_high_wmark &&
            !(sc->may_deactivate & DEACTIVATE_ANON) &&
            anon >> sc->priority;

[1]: https://lore.kernel.org/all/20200520232525.798933-15-hannes@cmpxchg.org/T/#u

> --
> Michal Hocko
> SUSE Labs
Michal Hocko Aug. 29, 2024, 3 p.m. UTC | #7
On Thu 29-08-24 22:30:09, Zhongkun He wrote:
> On Thu, Aug 29, 2024 at 9:36 PM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > Seeing this my main question is whether we should focus on swappiness
> > behavior more than adding a very strange and very targetted reclaim
> > mode. After all we have a mapped memory and executables protection in
> > place. So in the end this is more about balance between anon vs. file
> > LRUs.
> >
> 
> I  have a question about the swappiness, if set the swappiness=0, we can only
> reclaim the file pages. but we do not have an option to disable the reclaim from
> file pages because there are faster storages for the swap without IO, like zram
> and zswap.  I wonder if we can give it a try in this direction.

I do not think we should give any guarantee that 200 will only reclaim
anon pages. But having that heavily anon oriented makes sense and I
thought this was an existing semantic.

[...]
> > > The delay of the task becomes more serious because reading data will
> > > be slower.  Hot pages will thrash repeatedly between the memory and
> > > the disk.
> >
> > Doesn't refault stats and IO cost aspect of the reclaim when balancing
> > LRUs dealing with this situation already? Why it doesn't work in your
> > case? Have you tried to investigate that?
> 
> OK, I'll try to reproduce the problem again. but IIUC, we could not reclaim
> pages from one side. Please see this 'commit d483a5dd009  ("mm:
> vmscan: limit the range of LRU type balancing")'  [1]
> 
> Unless this condition is met:
> sc->file_is_tiny =
>             file + free <= total_high_wmark &&
>             !(sc->may_deactivate & DEACTIVATE_ANON) &&
>             anon >> sc->priority;

There have been some changes in this area where swappiness was treated
differently so it would make sense to investigate with the current mm
tree.

> [1]: https://lore.kernel.org/all/20200520232525.798933-15-hannes@cmpxchg.org/T/#u
> 
> > --
> > Michal Hocko
> > SUSE Labs