Message ID | 20240829101918.3454840-1-hezhongkun.hzk@bytedance.com (mailing list archive) |
---|---|
Headers | show |
Series | Add disable_unmap_file arg to memory.reclaim | expand |
On Thu 29-08-24 18:19:16, Zhongkun He wrote: > This patch proposes augmenting the memory.reclaim interface with a > disable_unmap_file argument that will skip the mapped pages in > that reclaim attempt. > > For example: > > echo "2M disable_unmap_file" > /sys/fs/cgroup/test/memory.reclaim > > will perform reclaim on the test cgroup with no mapped file page. > > The memory.reclaim is a useful interface. We can carry out proactive > memory reclaim in the user space, which can increase the utilization > rate of memory. > > In the actual usage scenarios, we found that when there are sufficient > anonymous pages, mapped file pages with a relatively small proportion > would still be reclaimed. This is likely to cause an increase in > refaults and an increase in task delay, because mapped file pages > usually include important executable codes, data, and shared libraries, > etc. According to the verified situation, if we can skip this part of > the memory, the task delay will be reduced. Do you have examples of workloads where this is demonstrably helps and cannot be tuned via swappiness?
On Thu, Aug 29, 2024 at 6:24 PM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 29-08-24 18:19:16, Zhongkun He wrote: > > This patch proposes augmenting the memory.reclaim interface with a > > disable_unmap_file argument that will skip the mapped pages in > > that reclaim attempt. > > > > For example: > > > > echo "2M disable_unmap_file" > /sys/fs/cgroup/test/memory.reclaim > > > > will perform reclaim on the test cgroup with no mapped file page. > > > > The memory.reclaim is a useful interface. We can carry out proactive > > memory reclaim in the user space, which can increase the utilization > > rate of memory. > > > > In the actual usage scenarios, we found that when there are sufficient > > anonymous pages, mapped file pages with a relatively small proportion > > would still be reclaimed. This is likely to cause an increase in > > refaults and an increase in task delay, because mapped file pages > > usually include important executable codes, data, and shared libraries, > > etc. According to the verified situation, if we can skip this part of > > the memory, the task delay will be reduced. > > Do you have examples of workloads where this is demonstrably helps and > cannot be tuned via swappiness? Sorry, I put the test workload in the second patch. Please have a look. Even if there are sufficient anonymous pages and a small number of page cache and mapped file pages, mapped file pages will still be reclaimed. Here is an example of anonymous pages being sufficient but mapped file pages still being reclaimed: Swappiness has been set to the maximum value. cat memory.stat | grep -wE 'anon|file|file_mapped' anon 3406462976 file 332967936 file_mapped 300302336 echo 1g > memory.reclaim swappiness=200 > memory.reclaim cat memory.stat | grep -wE 'anon|file|file_mapped' anon 2613276672 file 52523008 file_mapped 30982144 echo 1g > memory.reclaim swappiness=200 > memory.reclaim cat memory.stat | grep -wE 'anon|file|file_mapped' anon 1552130048 file 39759872 file_mapped 20299776 With this patch, the file_mapped pages will be skipped. echo 1g > memory.reclaim swappiness=200 disable_unmap_file > memory.reclaim cat memory.stat | grep -wE 'anon|file|file_mapped' anon 480059392 file 37978112 file_mapped 20299776 > -- > Michal Hocko > SUSE Labs
On Thu 29-08-24 18:37:07, Zhongkun He wrote: > On Thu, Aug 29, 2024 at 6:24 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Thu 29-08-24 18:19:16, Zhongkun He wrote: > > > This patch proposes augmenting the memory.reclaim interface with a > > > disable_unmap_file argument that will skip the mapped pages in > > > that reclaim attempt. > > > > > > For example: > > > > > > echo "2M disable_unmap_file" > /sys/fs/cgroup/test/memory.reclaim > > > > > > will perform reclaim on the test cgroup with no mapped file page. > > > > > > The memory.reclaim is a useful interface. We can carry out proactive > > > memory reclaim in the user space, which can increase the utilization > > > rate of memory. > > > > > > In the actual usage scenarios, we found that when there are sufficient > > > anonymous pages, mapped file pages with a relatively small proportion > > > would still be reclaimed. This is likely to cause an increase in > > > refaults and an increase in task delay, because mapped file pages > > > usually include important executable codes, data, and shared libraries, > > > etc. According to the verified situation, if we can skip this part of > > > the memory, the task delay will be reduced. > > > > Do you have examples of workloads where this is demonstrably helps and > > cannot be tuned via swappiness? > > Sorry, I put the test workload in the second patch. Please have a look. I have missed those as they are not threaded to the cover letter. You can either use --in-reply-to when sending patches separately from the cover letter or use can use --compose/--cover-leter when sending patches through git-send-email > Even if there are sufficient anonymous pages and a small number of > page cache and mapped file pages, mapped file pages will still be reclaimed. > Here is an example of anonymous pages being sufficient but mapped > file pages still being reclaimed: > Swappiness has been set to the maximum value. > > cat memory.stat | grep -wE 'anon|file|file_mapped' > anon 3406462976 > file 332967936 > file_mapped 300302336 > > echo 1g > memory.reclaim swappiness=200 > memory.reclaim > cat memory.stat | grep -wE 'anon|file|file_mapped' > anon 2613276672 > file 52523008 > file_mapped 30982144 This seems to be 73% (ano) vs 27% (file) balance. 90% of the file LRU seems to be mapped which matches 90% of file LRU reclaimed memory to be mapped. So the reclaim is proportional there. But I do understand that this is still unexpected when swappiness=200 should make reclaim anon oriented. Is this MGLRU or regular LRU implementation? Is this some artificial workload or something real world?
On Thu, Aug 29, 2024 at 7:51 PM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 29-08-24 18:37:07, Zhongkun He wrote: > > On Thu, Aug 29, 2024 at 6:24 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Thu 29-08-24 18:19:16, Zhongkun He wrote: > > > > This patch proposes augmenting the memory.reclaim interface with a > > > > disable_unmap_file argument that will skip the mapped pages in > > > > that reclaim attempt. > > > > > > > > For example: > > > > > > > > echo "2M disable_unmap_file" > /sys/fs/cgroup/test/memory.reclaim > > > > > > > > will perform reclaim on the test cgroup with no mapped file page. > > > > > > > > The memory.reclaim is a useful interface. We can carry out proactive > > > > memory reclaim in the user space, which can increase the utilization > > > > rate of memory. > > > > > > > > In the actual usage scenarios, we found that when there are sufficient > > > > anonymous pages, mapped file pages with a relatively small proportion > > > > would still be reclaimed. This is likely to cause an increase in > > > > refaults and an increase in task delay, because mapped file pages > > > > usually include important executable codes, data, and shared libraries, > > > > etc. According to the verified situation, if we can skip this part of > > > > the memory, the task delay will be reduced. > > > > > > Do you have examples of workloads where this is demonstrably helps and > > > cannot be tuned via swappiness? > > > > Sorry, I put the test workload in the second patch. Please have a look. > > I have missed those as they are not threaded to the cover letter. You > can either use --in-reply-to when sending patches separately from the > cover letter or use can use --compose/--cover-leter when sending patches > through git-send-email Got it, thanks. I encountered a problem after sending the cover letter, so I resent the others without --in-reply-to. > > > Even if there are sufficient anonymous pages and a small number of > > page cache and mapped file pages, mapped file pages will still be reclaimed. > > Here is an example of anonymous pages being sufficient but mapped > > file pages still being reclaimed: > > Swappiness has been set to the maximum value. > > > > cat memory.stat | grep -wE 'anon|file|file_mapped' > > anon 3406462976 > > file 332967936 > > file_mapped 300302336 > > > > echo 1g > memory.reclaim swappiness=200 > memory.reclaim > > cat memory.stat | grep -wE 'anon|file|file_mapped' > > anon 2613276672 > > file 52523008 > > file_mapped 30982144 > > This seems to be 73% (ano) vs 27% (file) balance. 90% of the > file LRU seems to be mapped which matches 90% of file LRU reclaimed > memory to be mapped. So the reclaim is proportional there. > > But I do understand that this is still unexpected when swappiness=200 > should make reclaim anon oriented. Is this MGLRU or regular LRU > implementation? > This is a regular LRU implementation and the MGLRU has the same questions but performs better. Please have a look: root@vm:/sys/fs/cgroup/test# cat /sys/kernel/mm/lru_gen/enabled 0x0007 root@vm:/sys/fs/cgroup/test# cat memory.stat | grep -wE 'anon|file|file_mapped' anon 3310338048 file 293498880 file_mapped 273506304 root@vm:/sys/fs/cgroup/test# echo 1g > memory.reclaim swappiness=200 > memory.reclaim root@vm:/sys/fs/cgroup/test# cat memory.stat | grep -wE 'anon|file|file_mapped' anon 2373173248 file 157233152 file_mapped 146173952 root@vm:/sys/fs/cgroup/test# echo 1g > memory.reclaim swappiness=200 > memory.reclaim root@vm:/sys/fs/cgroup/test# cat memory.stat | grep -wE 'anon|file|file_mapped' anon 1370886144 file 85663744 file_mapped 78118912 > Is this some artificial workload or something real world? > This is an artificial workload to show the detail of this case more easily. But we have encountered this problem on our servers. If the performance of the disk is poor, like HDD, the situation will become even worse. The delay of the task becomes more serious because reading data will be slower. Hot pages will thrash repeatedly between the memory and the disk. At this time, the pressure on the disk will also be greater. If there are many tasks using this disk, it will also affect other tasks. That was the background of this case. > > -- > Michal Hocko > SUSE Labs
On Thu 29-08-24 21:15:50, Zhongkun He wrote: > On Thu, Aug 29, 2024 at 7:51 PM Michal Hocko <mhocko@suse.com> wrote: [...] > > Is this some artificial workload or something real world? > > > > This is an artificial workload to show the detail of this case more > easily. But we have encountered this problem on our servers. This is always good to mention in the changelog. If you can observe this in real workloads it is good to get numbers from those because artificial workloads tend to overshoot the underlying problem and we can potentially miss the practical contributors to the problem. Seeing this my main question is whether we should focus on swappiness behavior more than adding a very strange and very targetted reclaim mode. After all we have a mapped memory and executables protection in place. So in the end this is more about balance between anon vs. file LRUs. > If the performance of the disk is poor, like HDD, the situation will > become even worse. Doesn't that impact swapin/out as well? Or do you happen to have a faster storage for the swap? > The delay of the task becomes more serious because reading data will > be slower. Hot pages will thrash repeatedly between the memory and > the disk. Doesn't refault stats and IO cost aspect of the reclaim when balancing LRUs dealing with this situation already? Why it doesn't work in your case? Have you tried to investigate that?
On Thu, Aug 29, 2024 at 9:36 PM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 29-08-24 21:15:50, Zhongkun He wrote: > > On Thu, Aug 29, 2024 at 7:51 PM Michal Hocko <mhocko@suse.com> wrote: > [...] > > > Is this some artificial workload or something real world? > > > > > > > This is an artificial workload to show the detail of this case more > > easily. But we have encountered this problem on our servers. > > This is always good to mention in the changelog. If you can observe this > in real workloads it is good to get numbers from those because > artificial workloads tend to overshoot the underlying problem and we can > potentially miss the practical contributors to the problem. That sounds reasonable. I will try it. > > Seeing this my main question is whether we should focus on swappiness > behavior more than adding a very strange and very targetted reclaim > mode. After all we have a mapped memory and executables protection in > place. So in the end this is more about balance between anon vs. file > LRUs. > I have a question about the swappiness, if set the swappiness=0, we can only reclaim the file pages. but we do not have an option to disable the reclaim from file pages because there are faster storages for the swap without IO, like zram and zswap. I wonder if we can give it a try in this direction. > > If the performance of the disk is poor, like HDD, the situation will > > become even worse. > > Doesn't that impact swapin/out as well? Or do you happen to have a > faster storage for the swap? Yes, we use ZRAM as the swap storage. > > > The delay of the task becomes more serious because reading data will > > be slower. Hot pages will thrash repeatedly between the memory and > > the disk. > > Doesn't refault stats and IO cost aspect of the reclaim when balancing > LRUs dealing with this situation already? Why it doesn't work in your > case? Have you tried to investigate that? OK, I'll try to reproduce the problem again. but IIUC, we could not reclaim pages from one side. Please see this 'commit d483a5dd009 ("mm: vmscan: limit the range of LRU type balancing")' [1] Unless this condition is met: sc->file_is_tiny = file + free <= total_high_wmark && !(sc->may_deactivate & DEACTIVATE_ANON) && anon >> sc->priority; [1]: https://lore.kernel.org/all/20200520232525.798933-15-hannes@cmpxchg.org/T/#u > -- > Michal Hocko > SUSE Labs
On Thu 29-08-24 22:30:09, Zhongkun He wrote: > On Thu, Aug 29, 2024 at 9:36 PM Michal Hocko <mhocko@suse.com> wrote: [...] > > Seeing this my main question is whether we should focus on swappiness > > behavior more than adding a very strange and very targetted reclaim > > mode. After all we have a mapped memory and executables protection in > > place. So in the end this is more about balance between anon vs. file > > LRUs. > > > > I have a question about the swappiness, if set the swappiness=0, we can only > reclaim the file pages. but we do not have an option to disable the reclaim from > file pages because there are faster storages for the swap without IO, like zram > and zswap. I wonder if we can give it a try in this direction. I do not think we should give any guarantee that 200 will only reclaim anon pages. But having that heavily anon oriented makes sense and I thought this was an existing semantic. [...] > > > The delay of the task becomes more serious because reading data will > > > be slower. Hot pages will thrash repeatedly between the memory and > > > the disk. > > > > Doesn't refault stats and IO cost aspect of the reclaim when balancing > > LRUs dealing with this situation already? Why it doesn't work in your > > case? Have you tried to investigate that? > > OK, I'll try to reproduce the problem again. but IIUC, we could not reclaim > pages from one side. Please see this 'commit d483a5dd009 ("mm: > vmscan: limit the range of LRU type balancing")' [1] > > Unless this condition is met: > sc->file_is_tiny = > file + free <= total_high_wmark && > !(sc->may_deactivate & DEACTIVATE_ANON) && > anon >> sc->priority; There have been some changes in this area where swappiness was treated differently so it would make sense to investigate with the current mm tree. > [1]: https://lore.kernel.org/all/20200520232525.798933-15-hannes@cmpxchg.org/T/#u > > > -- > > Michal Hocko > > SUSE Labs