Message ID | 20250313034812.3910627-1-hezhongkun.hzk@bytedance.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [V1] mm: vmscan: skip the file folios in proactive reclaim if swappiness is MAX | expand |
On Thu 13-03-25 11:48:12, Zhongkun He wrote: > With this patch 'commit <68cd9050d871> ("mm: add swappiness= arg to > memory.reclaim")', we can submit an additional swappiness=<val> argument > to memory.reclaim. It is very useful because we can dynamically adjust > the reclamation ratio based on the anonymous folios and file folios of > each cgroup. For example,when swappiness is set to 0, we only reclaim > from file folios. > > However,we have also encountered a new issue: when swappiness is set to > the MAX_SWAPPINESS, it may still only reclaim file folios. This is due > to the knob of cache_trim_mode, which depends solely on the ratio of > inactive folios, regardless of whether there are a large number of cold > folios in anonymous folio list. > > So, we hope to add a new control logic where proactive memory reclaim only > reclaims from anonymous folios when swappiness is set to MAX_SWAPPINESS. > For example, something like this: > > echo "2M swappiness=200" > /sys/fs/cgroup/memory.reclaim > > will perform reclaim on the rootcg with a swappiness setting of 200 (max > swappiness) regardless of the file folios. Users have a more comprehensive > view of the application's memory distribution because there are many > metrics available. For example, if we find that a certain cgroup has a > large number of inactive anon folios, we can reclaim only those and skip > file folios, because with the zram/zswap, the IO tradeoff that > cache_trim_mode is making doesn't hold - file refaults will cause IO, > whereas anon decompression will not. > > With this patch, the swappiness argument of memory.reclaim has a more > precise semantics: 0 means reclaiming only from file pages, while 200 > means reclaiming just from anonymous pages. Well, with this patch we have 0 - always swap, 200 - never swap and anything inbetween behaves more or less arbitrary, right? Not a new problem with swappiness but would it make more sense to drop all the heuristics for scanning LRUs and simply use the given swappiness when doing the pro active reclaim?
On Thu, Mar 13, 2025 at 3:57 PM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 13-03-25 11:48:12, Zhongkun He wrote: > > With this patch 'commit <68cd9050d871> ("mm: add swappiness= arg to > > memory.reclaim")', we can submit an additional swappiness=<val> argument > > to memory.reclaim. It is very useful because we can dynamically adjust > > the reclamation ratio based on the anonymous folios and file folios of > > each cgroup. For example,when swappiness is set to 0, we only reclaim > > from file folios. > > > > However,we have also encountered a new issue: when swappiness is set to > > the MAX_SWAPPINESS, it may still only reclaim file folios. This is due > > to the knob of cache_trim_mode, which depends solely on the ratio of > > inactive folios, regardless of whether there are a large number of cold > > folios in anonymous folio list. > > > > So, we hope to add a new control logic where proactive memory reclaim only > > reclaims from anonymous folios when swappiness is set to MAX_SWAPPINESS. > > For example, something like this: > > > > echo "2M swappiness=200" > /sys/fs/cgroup/memory.reclaim > > > > will perform reclaim on the rootcg with a swappiness setting of 200 (max > > swappiness) regardless of the file folios. Users have a more comprehensive > > view of the application's memory distribution because there are many > > metrics available. For example, if we find that a certain cgroup has a > > large number of inactive anon folios, we can reclaim only those and skip > > file folios, because with the zram/zswap, the IO tradeoff that > > cache_trim_mode is making doesn't hold - file refaults will cause IO, > > whereas anon decompression will not. > > > > With this patch, the swappiness argument of memory.reclaim has a more > > precise semantics: 0 means reclaiming only from file pages, while 200 > > means reclaiming just from anonymous pages. > > Well, with this patch we have 0 - always swap, 200 - never swap and > anything inbetween behaves more or less arbitrary, right? Not a new > problem with swappiness but would it make more sense to drop all the > heuristics for scanning LRUs and simply use the given swappiness when > doing the pro active reclaim? Thanks for your suggestion! I totally agree with you. I'm preparing to send another patch to do this and a new thread to discuss, because I think the implementation doesn't conflict with this one. Do you think so ? > > -- > Michal Hocko > SUSE Labs
On Thu, Mar 13, 2025 at 3:57 PM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 13-03-25 11:48:12, Zhongkun He wrote: > > With this patch 'commit <68cd9050d871> ("mm: add swappiness= arg to > > memory.reclaim")', we can submit an additional swappiness=<val> argument > > to memory.reclaim. It is very useful because we can dynamically adjust > > the reclamation ratio based on the anonymous folios and file folios of > > each cgroup. For example,when swappiness is set to 0, we only reclaim > > from file folios. > > > > However,we have also encountered a new issue: when swappiness is set to > > the MAX_SWAPPINESS, it may still only reclaim file folios. This is due > > to the knob of cache_trim_mode, which depends solely on the ratio of > > inactive folios, regardless of whether there are a large number of cold > > folios in anonymous folio list. > > > > So, we hope to add a new control logic where proactive memory reclaim only > > reclaims from anonymous folios when swappiness is set to MAX_SWAPPINESS. > > For example, something like this: > > > > echo "2M swappiness=200" > /sys/fs/cgroup/memory.reclaim > > > > will perform reclaim on the rootcg with a swappiness setting of 200 (max > > swappiness) regardless of the file folios. Users have a more comprehensive > > view of the application's memory distribution because there are many > > metrics available. For example, if we find that a certain cgroup has a > > large number of inactive anon folios, we can reclaim only those and skip > > file folios, because with the zram/zswap, the IO tradeoff that > > cache_trim_mode is making doesn't hold - file refaults will cause IO, > > whereas anon decompression will not. > > > > With this patch, the swappiness argument of memory.reclaim has a more > > precise semantics: 0 means reclaiming only from file pages, while 200 > > means reclaiming just from anonymous pages. > > Well, with this patch we have 0 - always swap, 200 - never swap and > anything inbetween behaves more or less arbitrary, right? Not a new > problem with swappiness but would it make more sense to drop all the > heuristics for scanning LRUs and simply use the given swappiness when > doing the pro active reclaim? > Thanks for your suggestion! I totally agree with you. I'm preparing to send another patch to do this and a new thread to discuss, because I think the implementation doesn't conflict with this one. Do you think so ? > -- > Michal Hocko > SUSE Labs
On Thu 13-03-25 16:57:34, Zhongkun He wrote: > On Thu, Mar 13, 2025 at 3:57 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Thu 13-03-25 11:48:12, Zhongkun He wrote: > > > With this patch 'commit <68cd9050d871> ("mm: add swappiness= arg to > > > memory.reclaim")', we can submit an additional swappiness=<val> argument > > > to memory.reclaim. It is very useful because we can dynamically adjust > > > the reclamation ratio based on the anonymous folios and file folios of > > > each cgroup. For example,when swappiness is set to 0, we only reclaim > > > from file folios. > > > > > > However,we have also encountered a new issue: when swappiness is set to > > > the MAX_SWAPPINESS, it may still only reclaim file folios. This is due > > > to the knob of cache_trim_mode, which depends solely on the ratio of > > > inactive folios, regardless of whether there are a large number of cold > > > folios in anonymous folio list. > > > > > > So, we hope to add a new control logic where proactive memory reclaim only > > > reclaims from anonymous folios when swappiness is set to MAX_SWAPPINESS. > > > For example, something like this: > > > > > > echo "2M swappiness=200" > /sys/fs/cgroup/memory.reclaim > > > > > > will perform reclaim on the rootcg with a swappiness setting of 200 (max > > > swappiness) regardless of the file folios. Users have a more comprehensive > > > view of the application's memory distribution because there are many > > > metrics available. For example, if we find that a certain cgroup has a > > > large number of inactive anon folios, we can reclaim only those and skip > > > file folios, because with the zram/zswap, the IO tradeoff that > > > cache_trim_mode is making doesn't hold - file refaults will cause IO, > > > whereas anon decompression will not. > > > > > > With this patch, the swappiness argument of memory.reclaim has a more > > > precise semantics: 0 means reclaiming only from file pages, while 200 > > > means reclaiming just from anonymous pages. > > > > Well, with this patch we have 0 - always swap, 200 - never swap and > > anything inbetween behaves more or less arbitrary, right? Not a new > > problem with swappiness but would it make more sense to drop all the > > heuristics for scanning LRUs and simply use the given swappiness when > > doing the pro active reclaim? > > Thanks for your suggestion! I totally agree with you. I'm preparing to send > another patch to do this and a new thread to discuss, because I think the > implementation doesn't conflict with this one. Do you think so ? If the change will enforce SCAN_FRACT for proactive reclaim with swappiness given then it will make the balancing much smoother but I do not think the behavior at both ends of the scale would imply only single LRU scanning mode.
On Thu, Mar 13, 2025 at 5:43 PM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 13-03-25 16:57:34, Zhongkun He wrote: > > On Thu, Mar 13, 2025 at 3:57 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Thu 13-03-25 11:48:12, Zhongkun He wrote: > > > > With this patch 'commit <68cd9050d871> ("mm: add swappiness= arg to > > > > memory.reclaim")', we can submit an additional swappiness=<val> argument > > > > to memory.reclaim. It is very useful because we can dynamically adjust > > > > the reclamation ratio based on the anonymous folios and file folios of > > > > each cgroup. For example,when swappiness is set to 0, we only reclaim > > > > from file folios. > > > > > > > > However,we have also encountered a new issue: when swappiness is set to > > > > the MAX_SWAPPINESS, it may still only reclaim file folios. This is due > > > > to the knob of cache_trim_mode, which depends solely on the ratio of > > > > inactive folios, regardless of whether there are a large number of cold > > > > folios in anonymous folio list. > > > > > > > > So, we hope to add a new control logic where proactive memory reclaim only > > > > reclaims from anonymous folios when swappiness is set to MAX_SWAPPINESS. > > > > For example, something like this: > > > > > > > > echo "2M swappiness=200" > /sys/fs/cgroup/memory.reclaim > > > > > > > > will perform reclaim on the rootcg with a swappiness setting of 200 (max > > > > swappiness) regardless of the file folios. Users have a more comprehensive > > > > view of the application's memory distribution because there are many > > > > metrics available. For example, if we find that a certain cgroup has a > > > > large number of inactive anon folios, we can reclaim only those and skip > > > > file folios, because with the zram/zswap, the IO tradeoff that > > > > cache_trim_mode is making doesn't hold - file refaults will cause IO, > > > > whereas anon decompression will not. > > > > > > > > With this patch, the swappiness argument of memory.reclaim has a more > > > > precise semantics: 0 means reclaiming only from file pages, while 200 > > > > means reclaiming just from anonymous pages. > > > > > > Well, with this patch we have 0 - always swap, 200 - never swap and > > > anything inbetween behaves more or less arbitrary, right? Not a new > > > problem with swappiness but would it make more sense to drop all the > > > heuristics for scanning LRUs and simply use the given swappiness when > > > doing the pro active reclaim? > > > > Thanks for your suggestion! I totally agree with you. I'm preparing to send > > another patch to do this and a new thread to discuss, because I think the > > implementation doesn't conflict with this one. Do you think so ? > > If the change will enforce SCAN_FRACT for proactive reclaim with > swappiness given then it will make the balancing much smoother but I do > not think the behavior at both ends of the scale would imply only single > LRU scanning mode. Hi Michal, I'am confused about the description that 'I do not think the behavior at both ends of the scale would imply only single LRU scanning mode.’ and what we should do at the max value of swappiness. Besides that, I have discovered a new issue. If we drop all the heuristics for scanning LRUs, the swappiness value each time will accurately represent the ratio of memory to be reclaimed. This means that before each pro reclamation operation, we would need to have a relatively clear understanding of the current memory ratio and dynamically changing the swappiness more often because with the pro memory reclaim the ratio is alway changing . As a result, the flexibility would be reduced. However, at both ends of the scale, we would have a clearer intention to reclaim from a single list. For example, in a cgroup, if we have 10G of anon pages and 3G of file pages, I would prefer to set swappiness=200 to reclaim anon pages only. Once the amount of file and anon pages becomes roughly equal, we can set swappiness=100 and rely on the system's original heuristics to determine the appropriate amount to reclaim. On the other hand, if we have 1g anon, and 10G page caches, we would like to set swappiness=0 to reclaim only from file pages even with cache_trim_mode. At least from the semantic perspective, it is clear, and users don’t need to worry about the threshold of cache_trim_mode or even don't know the existence of cache_trim_mode . Overall, setting swappiness=0 and swappiness=200 to reclaim from a single LRU list is intended to address the extreme cases we have actually encountered. As Johannes mentioned above, with the zram/zswap, the IO tradeoff that cache_trim_mode is making doesn't hold - file refaults will cause IO, whereas anon decompression will not. we would like to set swappiness=200 to reclaim only from anon list which really makes sense to us. Thanks. > -- > Michal Hocko > SUSE Labs
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index cb1b4e759b7e..6a4487ead7e0 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1343,6 +1343,10 @@ The following nested keys are defined. same semantics as vm.swappiness applied to memcg reclaim with all the existing limitations and potential future extensions. + The swappiness have the range [0, 200], 0 means reclaiming only + from file folios, 200 (MAX_SWAPPINESS) means reclaiming just from + anonymous folios. + memory.peak A read-write single value file which exists on non-root cgroups. diff --git a/mm/vmscan.c b/mm/vmscan.c index c767d71c43d7..f4312b41e0e0 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2438,6 +2438,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, goto out; } + /* + * Do not bother scanning file folios if the memory reclaim + * invoked by userspace through memory.reclaim and the + * swappiness is MAX_SWAPPINESS. + */ + if (sc->proactive && (swappiness == MAX_SWAPPINESS)) { + scan_balance = SCAN_ANON; + goto out; + } + /* * Do not apply any pressure balancing cleverness when the * system is close to OOM, scan both anon and file equally