Message ID | 20220104202227.2903605-7-yuzhao@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Multigenerational LRU Framework | expand |
I am still reading through the series. It is a lot of code and quite hard to wrap ones head around so these are mostly random things I have run into. More will likely follow up. On Tue 04-01-22 13:22:25, Yu Zhao wrote: [...] > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index aba18cd101db..028afdb81c10 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -1393,18 +1393,24 @@ mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg) > > static inline void lock_page_memcg(struct page *page) > { > + /* to match folio_memcg_rcu() */ > + rcu_read_lock(); > } > > static inline void unlock_page_memcg(struct page *page) > { > + rcu_read_unlock(); > } > > static inline void folio_memcg_lock(struct folio *folio) > { > + /* to match folio_memcg_rcu() */ > + rcu_read_lock(); > } > > static inline void folio_memcg_unlock(struct folio *folio) > { > + rcu_read_unlock(); > } This should go into a separate patch and merge it independently. I haven't really realized that !MEMCG configuration has a different locking scopes. [...] > diff --git a/include/linux/oom.h b/include/linux/oom.h > index 2db9a1432511..9c7a4fae0661 100644 > --- a/include/linux/oom.h > +++ b/include/linux/oom.h > @@ -57,6 +57,22 @@ struct oom_control { > extern struct mutex oom_lock; > extern struct mutex oom_adj_mutex; > > +#ifdef CONFIG_MMU > +extern struct task_struct *oom_reaper_list; > +extern struct wait_queue_head oom_reaper_wait; > + > +static inline bool oom_reaping_in_progress(void) > +{ > + /* a racy check can be used to reduce the chance of overkilling */ > + return READ_ONCE(oom_reaper_list) || !waitqueue_active(&oom_reaper_wait); > +} > +#else > +static inline bool oom_reaping_in_progress(void) > +{ > + return false; > +} > +#endif I do not like this. These are internal oom reaper's and no code should really make any decisions based on that. oom_reaping_in_progress is not telling much anyway. This is a global queue for oom reaper that can contain oom victims from different oom scopes (e.g. global OOM, memcg OOM or memory policy OOM). Your lru_gen_age_node uses this to decide whether to trigger out_of_memory and that is clearly wrong for the above reasons. out_of_memory is designed to skip over any action if there is an oom victim pending from the oom domain (have a look at oom_evaluate_task). [...] > +static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc, > + unsigned long min_ttl) > +{ > + bool need_aging; > + long nr_to_scan; > + struct mem_cgroup *memcg = lruvec_memcg(lruvec); > + int swappiness = get_swappiness(memcg); > + DEFINE_MAX_SEQ(lruvec); > + DEFINE_MIN_SEQ(lruvec); > + > + if (mem_cgroup_below_min(memcg)) > + return false; mem_cgroup_below_min requires effective values to be calculated for the reclaimed hierarchy. Have a look at mem_cgroup_calculate_protection
On Tue 04-01-22 13:22:25, Yu Zhao wrote: > +static struct lru_gen_mm_walk *alloc_mm_walk(void) > +{ > + if (!current->reclaim_state || !current->reclaim_state->mm_walk) > + return kvzalloc(sizeof(struct lru_gen_mm_walk), GFP_KERNEL); > + > + return current->reclaim_state->mm_walk; > +} > + > +static void free_mm_walk(struct lru_gen_mm_walk *walk) > +{ > + if (!current->reclaim_state || !current->reclaim_state->mm_walk) > + kvfree(walk); > +} Do I get it right that you are allocating from the reclaim context? What prevents this to completely deplete the memory as the reclaim context is PF_MEMALLOC?
On Thu, Jan 06, 2022 at 05:06:42PM +0100, Michal Hocko wrote: > I am still reading through the series. It is a lot of code and quite > hard to wrap ones head around so these are mostly random things I have > run into. More will likely follow up. > > On Tue 04-01-22 13:22:25, Yu Zhao wrote: > [...] > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > index aba18cd101db..028afdb81c10 100644 > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -1393,18 +1393,24 @@ mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg) > > > > static inline void lock_page_memcg(struct page *page) > > { > > + /* to match folio_memcg_rcu() */ > > + rcu_read_lock(); > > } > > > > static inline void unlock_page_memcg(struct page *page) > > { > > + rcu_read_unlock(); > > } > > > > static inline void folio_memcg_lock(struct folio *folio) > > { > > + /* to match folio_memcg_rcu() */ > > + rcu_read_lock(); > > } > > > > static inline void folio_memcg_unlock(struct folio *folio) > > { > > + rcu_read_unlock(); > > } > > This should go into a separate patch and merge it independently. I > haven't really realized that !MEMCG configuration has a different > locking scopes. Considered it done. > > diff --git a/include/linux/oom.h b/include/linux/oom.h > > index 2db9a1432511..9c7a4fae0661 100644 > > --- a/include/linux/oom.h > > +++ b/include/linux/oom.h > > @@ -57,6 +57,22 @@ struct oom_control { > > extern struct mutex oom_lock; > > extern struct mutex oom_adj_mutex; > > > > +#ifdef CONFIG_MMU > > +extern struct task_struct *oom_reaper_list; > > +extern struct wait_queue_head oom_reaper_wait; > > + > > +static inline bool oom_reaping_in_progress(void) > > +{ > > + /* a racy check can be used to reduce the chance of overkilling */ > > + return READ_ONCE(oom_reaper_list) || !waitqueue_active(&oom_reaper_wait); > > +} > > +#else > > +static inline bool oom_reaping_in_progress(void) > > +{ > > + return false; > > +} > > +#endif > > I do not like this. These are internal oom reaper's and no code should > really make any decisions based on that. oom_reaping_in_progress is not > telling much anyway. There is a perfectly legitimate reason for this. If there is already a oom kill victim and the oom reaper is making progress, the system may still be under memory pressure until the oom reaping is done. The page reclaim has two choices in this transient state: kill more processes or keep reclaiming (a few more) hot pages. The first choice, AKA overkilling, is generally a bad one. The oom reaper is single threaded and it can't go faster with additional victims. Additional processes are sacrificed for nothing -- this is an overcorrection of a system that tries to strike a balance between the tendencies to release memory pressure and to improve memory utilization. > This is a global queue for oom reaper that can > contain oom victims from different oom scopes (e.g. global OOM, memcg > OOM or memory policy OOM). True, but this is a wrong reason to make the conclusion below. Oom kill scopes do NOT matter; only the pool the freed memory goes into does. And there is only one global pool free pages. > Your lru_gen_age_node uses this to decide whether to trigger > out_of_memory and that is clearly wrong for the above reasons. I hope my explanation above is clear enough. There is nothing wrong with the purpose and the usage of oom_reaping_in_progress(), and it has been well tested in the Arch Linux Zen kernel. Without it, overkills can be easily reproduced by the following simple script. That is additional oom kills happen to processes other than "tail". # enable zram while true; do tail /dev/zero done > out_of_memory is designed to skip over any action if there is an oom > victim pending from the oom domain (have a look at oom_evaluate_task). Where exactly? Point me to the code please. I don't see such a logic inside out_of_memory() or oom_evaluate_task(). Currently the only thing that could remotely prevent overkills is oom_lock. But it's inadequate. This is the entire pipeline: low on memory -> out_of_memory() -> oom_reaper() -> free memory To avoid overkills, we need to consider the later half of it too. oom_reaping_in_progress() is exactly for this purpose. > > +static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc, > > + unsigned long min_ttl) > > +{ > > + bool need_aging; > > + long nr_to_scan; > > + struct mem_cgroup *memcg = lruvec_memcg(lruvec); > > + int swappiness = get_swappiness(memcg); > > + DEFINE_MAX_SEQ(lruvec); > > + DEFINE_MIN_SEQ(lruvec); > > + > > + if (mem_cgroup_below_min(memcg)) > > + return false; > > mem_cgroup_below_min requires effective values to be calculated for the > reclaimed hierarchy. Have a look at mem_cgroup_calculate_protection I always keep that in mind, and age_lruvec() is called *after* mem_cgroup_calculate_protection(): balance_pgdat() memcgs_need_aging = 0 do { lru_gen_age_node() if (!memcgs_need_aging) { memcgs_need_aging = 1 return } age_lruvec() shrink_node_memcgs() mem_cgroup_calculate_protection() lru_gen_shrink_lruvec() if ... memcgs_need_aging = 0 } while ...
On Thu, Jan 06, 2022 at 05:12:16PM +0100, Michal Hocko wrote: > On Tue 04-01-22 13:22:25, Yu Zhao wrote: > > +static struct lru_gen_mm_walk *alloc_mm_walk(void) > > +{ > > + if (!current->reclaim_state || !current->reclaim_state->mm_walk) > > + return kvzalloc(sizeof(struct lru_gen_mm_walk), GFP_KERNEL); > > + > > + return current->reclaim_state->mm_walk; > > +} > > + > > +static void free_mm_walk(struct lru_gen_mm_walk *walk) > > +{ > > + if (!current->reclaim_state || !current->reclaim_state->mm_walk) > > + kvfree(walk); > > +} > > Do I get it right that you are allocating from the reclaim context? What > prevents this to completely deplete the memory as the reclaim context is > PF_MEMALLOC? Yes, and in general the same reason zram/zswap/etc. allocate memory in the reclaim context: to make more free memory. In this case, lru_gen_mm_walk is small (160 bytes); it's per direct reclaimer; and direct reclaimers rarely come here, i.e., only when kswapd can't keep up in terms of the aging, which is similar to the condition where the inactive list is empty for the active/inactive lru.
On Thu 06-01-22 14:27:52, Yu Zhao wrote: > On Thu, Jan 06, 2022 at 05:06:42PM +0100, Michal Hocko wrote: [...] > > > diff --git a/include/linux/oom.h b/include/linux/oom.h > > > index 2db9a1432511..9c7a4fae0661 100644 > > > --- a/include/linux/oom.h > > > +++ b/include/linux/oom.h > > > @@ -57,6 +57,22 @@ struct oom_control { > > > extern struct mutex oom_lock; > > > extern struct mutex oom_adj_mutex; > > > > > > +#ifdef CONFIG_MMU > > > +extern struct task_struct *oom_reaper_list; > > > +extern struct wait_queue_head oom_reaper_wait; > > > + > > > +static inline bool oom_reaping_in_progress(void) > > > +{ > > > + /* a racy check can be used to reduce the chance of overkilling */ > > > + return READ_ONCE(oom_reaper_list) || !waitqueue_active(&oom_reaper_wait); > > > +} > > > +#else > > > +static inline bool oom_reaping_in_progress(void) > > > +{ > > > + return false; > > > +} > > > +#endif > > > > I do not like this. These are internal oom reaper's and no code should > > really make any decisions based on that. oom_reaping_in_progress is not > > telling much anyway. > > There is a perfectly legitimate reason for this. > > If there is already a oom kill victim and the oom reaper is making > progress, the system may still be under memory pressure until the oom > reaping is done. The page reclaim has two choices in this transient > state: kill more processes or keep reclaiming (a few more) hot pages. > > The first choice, AKA overkilling, is generally a bad one. The oom > reaper is single threaded and it can't go faster with additional > victims. Additional processes are sacrificed for nothing -- this is > an overcorrection of a system that tries to strike a balance between > the tendencies to release memory pressure and to improve memory > utilization. > > > This is a global queue for oom reaper that can > > contain oom victims from different oom scopes (e.g. global OOM, memcg > > OOM or memory policy OOM). > > True, but this is a wrong reason to make the conclusion below. Oom > kill scopes do NOT matter; only the pool the freed memory goes into > does. And there is only one global pool free pages. > > > Your lru_gen_age_node uses this to decide whether to trigger > > out_of_memory and that is clearly wrong for the above reasons. > > I hope my explanation above is clear enough. There is nothing wrong > with the purpose and the usage of oom_reaping_in_progress(), and it > has been well tested in the Arch Linux Zen kernel. I disagree. An ongoing oom kill in one domain (say memcg A) shouldn't be any base for any decisions in reclaim in other domain (say memcg B or even a global reclaim). Those are fundamentally different conditions. > Without it, overkills can be easily reproduced by the following simple > script. That is additional oom kills happen to processes other than > "tail". > > # enable zram > while true; > do > tail /dev/zero > done I would be interested to hear more (care to send oom reports?). > > out_of_memory is designed to skip over any action if there is an oom > > victim pending from the oom domain (have a look at oom_evaluate_task). > > Where exactly? Point me to the code please. > > I don't see such a logic inside out_of_memory() or > oom_evaluate_task(). Currently the only thing that could remotely > prevent overkills is oom_lock. But it's inadequate. OK, let me try to exaplain. The protocol is rather convoluted. Once the oom killer is invoked it choses a victim to kill. oom_evaluate_task will evaluate _all_ tasks from the oom respective domain (select_bad_process which distinguishes memcg vs global oom kill and oom_cpuset_eligible for the cpuset domains). If there is any pre-existing oom victim (tsk_is_oom_victim) then the scan is aborted and the oom killer bails out. OOM victim stops being considered as relevant once the oom reaper manages to release its address space (or give up on the mmap_sem contention) and sets MMF_OOM_SKIP flag for the mm. That being said the out_of_memory automatically backs off and relies on the oom reaper to process its queue. Does it make more clear for you now? > This is the entire pipeline: > low on memory -> out_of_memory() -> oom_reaper() -> free memory > > To avoid overkills, we need to consider the later half of it too. > oom_reaping_in_progress() is exactly for this purpose. > > > > +static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc, > > > + unsigned long min_ttl) > > > +{ > > > + bool need_aging; > > > + long nr_to_scan; > > > + struct mem_cgroup *memcg = lruvec_memcg(lruvec); > > > + int swappiness = get_swappiness(memcg); > > > + DEFINE_MAX_SEQ(lruvec); > > > + DEFINE_MIN_SEQ(lruvec); > > > + > > > + if (mem_cgroup_below_min(memcg)) > > > + return false; > > > > mem_cgroup_below_min requires effective values to be calculated for the > > reclaimed hierarchy. Have a look at mem_cgroup_calculate_protection > > I always keep that in mind, and age_lruvec() is called *after* > mem_cgroup_calculate_protection(): > balance_pgdat() > memcgs_need_aging = 0 > do { > lru_gen_age_node() > if (!memcgs_need_aging) { > memcgs_need_aging = 1 > return > } > age_lruvec() > > shrink_node_memcgs() > mem_cgroup_calculate_protection() > lru_gen_shrink_lruvec() > if ... > memcgs_need_aging = 0 > } while ... Uff, this is really subtle. I really think you should be following the existing pattern when the effective values are calculated right in the same context as they are evaluated.
On Thu 06-01-22 14:41:12, Yu Zhao wrote: > On Thu, Jan 06, 2022 at 05:12:16PM +0100, Michal Hocko wrote: > > On Tue 04-01-22 13:22:25, Yu Zhao wrote: > > > +static struct lru_gen_mm_walk *alloc_mm_walk(void) > > > +{ > > > + if (!current->reclaim_state || !current->reclaim_state->mm_walk) > > > + return kvzalloc(sizeof(struct lru_gen_mm_walk), GFP_KERNEL); > > > + > > > + return current->reclaim_state->mm_walk; > > > +} > > > + > > > +static void free_mm_walk(struct lru_gen_mm_walk *walk) > > > +{ > > > + if (!current->reclaim_state || !current->reclaim_state->mm_walk) > > > + kvfree(walk); > > > +} > > > > Do I get it right that you are allocating from the reclaim context? What > > prevents this to completely deplete the memory as the reclaim context is > > PF_MEMALLOC? > > Yes, and in general the same reason zram/zswap/etc. allocate memory in > the reclaim context: to make more free memory. I have to admit that I am not really familiar with zram/zswap but I find the concept of requiring memory to do the reclaim really problematic. > In this case, lru_gen_mm_walk is small (160 bytes); it's per direct > reclaimer; and direct reclaimers rarely come here, i.e., only when > kswapd can't keep up in terms of the aging, which is similar to the > condition where the inactive list is empty for the active/inactive > lru. Well, this is not a strong argument to be honest. Kswapd being stuck and the majority of the reclaim being done in the direct reclaim context is a situation I have seen many many times. We used to have problems with direct reclaimers throttling to prevent an over eager OOM situations. Have you considered using a pool of preallocated objects instead?
On Fri 07-01-22 09:55:09, Michal Hocko wrote: [...] > > In this case, lru_gen_mm_walk is small (160 bytes); it's per direct > > reclaimer; and direct reclaimers rarely come here, i.e., only when > > kswapd can't keep up in terms of the aging, which is similar to the > > condition where the inactive list is empty for the active/inactive > > lru. > > Well, this is not a strong argument to be honest. Kswapd being stuck > and the majority of the reclaim being done in the direct reclaim > context is a situation I have seen many many times. Also do not forget that memcg reclaim is effectivelly only direct reclaim. Not that the memcg reclaim indicates a global memory shortage but it can add up and race with the global reclaim as well.
On Tue 04-01-22 13:22:25, Yu Zhao wrote: [...] > +static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) > +{ > + struct mem_cgroup *memcg; > + bool success = false; > + unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl); > + > + VM_BUG_ON(!current_is_kswapd()); > + > + current->reclaim_state->mm_walk = &pgdat->mm_walk; > + > + memcg = mem_cgroup_iter(NULL, NULL, NULL); > + do { > + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); > + > + if (age_lruvec(lruvec, sc, min_ttl)) > + success = true; > + > + cond_resched(); > + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); > + > + if (!success && mutex_trylock(&oom_lock)) { > + struct oom_control oc = { > + .gfp_mask = sc->gfp_mask, > + .order = sc->order, > + }; > + > + if (!oom_reaping_in_progress()) > + out_of_memory(&oc); > + > + mutex_unlock(&oom_lock); > + } Why do you need to trigger oom killer from this path? Why cannot you rely on the page allocator to do that like we do now?
On Tue 04-01-22 13:22:25, Yu Zhao wrote: [...] > +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk) > +{ > + static const struct mm_walk_ops mm_walk_ops = { > + .test_walk = should_skip_vma, > + .p4d_entry = walk_pud_range, > + }; > + > + int err; > +#ifdef CONFIG_MEMCG > + struct mem_cgroup *memcg = lruvec_memcg(lruvec); > +#endif > + > + walk->next_addr = FIRST_USER_ADDRESS; > + > + do { > + unsigned long start = walk->next_addr; > + unsigned long end = mm->highest_vm_end; > + > + err = -EBUSY; > + > + rcu_read_lock(); > +#ifdef CONFIG_MEMCG > + if (memcg && atomic_read(&memcg->moving_account)) > + goto contended; > +#endif > + if (!mmap_read_trylock(mm)) > + goto contended; Have you evaluated the behavior under mmap_sem contention? I mean what would be an effect of some mms being excluded from the walk? This path is called from direct reclaim and we do allocate with exclusive mmap_sem IIRC and the trylock can fail in a presence of pending writer if I am not mistaken so even the read lock holder (e.g. an allocation from the #PF) can bypass the walk. Or is this considered statistically insignificant thus a theoretical problem?
On Fri, Jan 07, 2022 at 09:43:49AM +0100, Michal Hocko wrote: > On Thu 06-01-22 14:27:52, Yu Zhao wrote: > > On Thu, Jan 06, 2022 at 05:06:42PM +0100, Michal Hocko wrote: > [...] > > > > diff --git a/include/linux/oom.h b/include/linux/oom.h > > > > index 2db9a1432511..9c7a4fae0661 100644 > > > > --- a/include/linux/oom.h > > > > +++ b/include/linux/oom.h > > > > @@ -57,6 +57,22 @@ struct oom_control { > > > > extern struct mutex oom_lock; > > > > extern struct mutex oom_adj_mutex; > > > > > > > > +#ifdef CONFIG_MMU > > > > +extern struct task_struct *oom_reaper_list; > > > > +extern struct wait_queue_head oom_reaper_wait; > > > > + > > > > +static inline bool oom_reaping_in_progress(void) > > > > +{ > > > > + /* a racy check can be used to reduce the chance of overkilling */ > > > > + return READ_ONCE(oom_reaper_list) || !waitqueue_active(&oom_reaper_wait); > > > > +} > > > > +#else > > > > +static inline bool oom_reaping_in_progress(void) > > > > +{ > > > > + return false; > > > > +} > > > > +#endif > > > > > > I do not like this. These are internal oom reaper's and no code should > > > really make any decisions based on that. oom_reaping_in_progress is not > > > telling much anyway. > > > > There is a perfectly legitimate reason for this. > > > > If there is already a oom kill victim and the oom reaper is making > > progress, the system may still be under memory pressure until the oom > > reaping is done. The page reclaim has two choices in this transient > > state: kill more processes or keep reclaiming (a few more) hot pages. > > > > The first choice, AKA overkilling, is generally a bad one. The oom > > reaper is single threaded and it can't go faster with additional > > victims. Additional processes are sacrificed for nothing -- this is > > an overcorrection of a system that tries to strike a balance between > > the tendencies to release memory pressure and to improve memory > > utilization. > > > > > This is a global queue for oom reaper that can > > > contain oom victims from different oom scopes (e.g. global OOM, memcg > > > OOM or memory policy OOM). > > > > True, but this is a wrong reason to make the conclusion below. Oom > > kill scopes do NOT matter; only the pool the freed memory goes into > > does. And there is only one global pool free pages. > > > > > Your lru_gen_age_node uses this to decide whether to trigger > > > out_of_memory and that is clearly wrong for the above reasons. > > > > I hope my explanation above is clear enough. There is nothing wrong > > with the purpose and the usage of oom_reaping_in_progress(), and it > > has been well tested in the Arch Linux Zen kernel. > > I disagree. An ongoing oom kill in one domain (say memcg A) shouldn't be > any base for any decisions in reclaim in other domain (say memcg B or > even a global reclaim). Those are fundamentally different conditions. I agree for the memcg A oom and memcg B reclaim case, because memory freed from A doesn't go to B. I still think for the memcg A and the global reclaim case, memory freed from A can be considered when deciding whether to make more kills during global reclaim. But this is something really minor, and I'll go with your suggestion, i.e., getting rid of oom_reaping_in_progress(). > > Without it, overkills can be easily reproduced by the following simple > > script. That is additional oom kills happen to processes other than > > "tail". > > > > # enable zram > > while true; > > do > > tail /dev/zero > > done > > I would be interested to hear more (care to send oom reports?). I agree with what said below. I think those additional ooms might have been from different oom domains. I plan to leave this for now and go with your suggestion as mentioned above. > > > out_of_memory is designed to skip over any action if there is an oom > > > victim pending from the oom domain (have a look at oom_evaluate_task). > > > > Where exactly? Point me to the code please. > > > > I don't see such a logic inside out_of_memory() or > > oom_evaluate_task(). Currently the only thing that could remotely > > prevent overkills is oom_lock. But it's inadequate. > > OK, let me try to exaplain. The protocol is rather convoluted. Once the > oom killer is invoked it choses a victim to kill. oom_evaluate_task will > evaluate _all_ tasks from the oom respective domain (select_bad_process > which distinguishes memcg vs global oom kill and oom_cpuset_eligible for > the cpuset domains). If there is any pre-existing oom victim > (tsk_is_oom_victim) then the scan is aborted and the oom killer bails > out. OOM victim stops being considered as relevant once the oom reaper > manages to release its address space (or give up on the mmap_sem > contention) and sets MMF_OOM_SKIP flag for the mm. > > That being said the out_of_memory automatically backs off and relies on > the oom reaper to process its queue. > > Does it make more clear for you now? Yes, you are right, thanks. > > This is the entire pipeline: > > low on memory -> out_of_memory() -> oom_reaper() -> free memory > > > > To avoid overkills, we need to consider the later half of it too. > > oom_reaping_in_progress() is exactly for this purpose. > > > > > > +static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc, > > > > + unsigned long min_ttl) > > > > +{ > > > > + bool need_aging; > > > > + long nr_to_scan; > > > > + struct mem_cgroup *memcg = lruvec_memcg(lruvec); > > > > + int swappiness = get_swappiness(memcg); > > > > + DEFINE_MAX_SEQ(lruvec); > > > > + DEFINE_MIN_SEQ(lruvec); > > > > + > > > > + if (mem_cgroup_below_min(memcg)) > > > > + return false; > > > > > > mem_cgroup_below_min requires effective values to be calculated for the > > > reclaimed hierarchy. Have a look at mem_cgroup_calculate_protection > > > > I always keep that in mind, and age_lruvec() is called *after* > > mem_cgroup_calculate_protection(): > > > balance_pgdat() > > memcgs_need_aging = 0 > > do { > > lru_gen_age_node() > > if (!memcgs_need_aging) { > > memcgs_need_aging = 1 > > return > > } > > age_lruvec() > > > > shrink_node_memcgs() > > mem_cgroup_calculate_protection() > > lru_gen_shrink_lruvec() > > if ... > > memcgs_need_aging = 0 > > } while ... > > Uff, this is really subtle. I really think you should be following the > existing pattern when the effective values are calculated right in the > same context as they are evaluated. Consider it done.
On Fri, Jan 07, 2022 at 02:11:29PM +0100, Michal Hocko wrote: > On Tue 04-01-22 13:22:25, Yu Zhao wrote: > [...] > > +static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) > > +{ > > + struct mem_cgroup *memcg; > > + bool success = false; > > + unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl); > > + > > + VM_BUG_ON(!current_is_kswapd()); > > + > > + current->reclaim_state->mm_walk = &pgdat->mm_walk; > > + > > + memcg = mem_cgroup_iter(NULL, NULL, NULL); > > + do { > > + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); > > + > > + if (age_lruvec(lruvec, sc, min_ttl)) > > + success = true; > > + > > + cond_resched(); > > + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); > > + > > + if (!success && mutex_trylock(&oom_lock)) { > > + struct oom_control oc = { > > + .gfp_mask = sc->gfp_mask, > > + .order = sc->order, > > + }; > > + > > + if (!oom_reaping_in_progress()) > > + out_of_memory(&oc); > > + > > + mutex_unlock(&oom_lock); > > + } > > Why do you need to trigger oom killer from this path? Why cannot you > rely on the page allocator to do that like we do now? This is per desktop users' (repeated) requests. The can't tolerate thrashing as servers do because of UI lags; and they usually don't have fancy tools like oomd. Related discussions I saw: https://github.com/zen-kernel/zen-kernel/issues/218 https://lore.kernel.org/lkml/20101028191523.GA14972@google.com/ https://lore.kernel.org/lkml/20211213051521.21f02dd2@mail.inbox.lv/ https://lore.kernel.org/lkml/54C2C89C.8080002@gmail.com/ https://lore.kernel.org/lkml/d9802b6a-949b-b327-c4a6-3dbca485ec20@gmx.com/ From patch 8: Personal computers ------------------ :Thrashing prevention: Write ``N`` to ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of ``N`` milliseconds from getting evicted. The OOM killer is invoked if this working set can't be kept in memory. Based on the average human detectable lag (~100ms), ``N=1000`` usually eliminates intolerable lags due to thrashing. Larger values like ``N=3000`` make lags less noticeable at the cost of more OOM kills.
On Fri, Jan 07, 2022 at 10:00:31AM +0100, Michal Hocko wrote: > On Fri 07-01-22 09:55:09, Michal Hocko wrote: > [...] > > > In this case, lru_gen_mm_walk is small (160 bytes); it's per direct > > > reclaimer; and direct reclaimers rarely come here, i.e., only when > > > kswapd can't keep up in terms of the aging, which is similar to the > > > condition where the inactive list is empty for the active/inactive > > > lru. > > > > Well, this is not a strong argument to be honest. Kswapd being stuck > > and the majority of the reclaim being done in the direct reclaim > > context is a situation I have seen many many times. > > Also do not forget that memcg reclaim is effectivelly only direct > reclaim. Not that the memcg reclaim indicates a global memory shortage > but it can add up and race with the global reclaim as well. I don't dispute any of the above, and I probably don't like this code more than you do. But let's not forget the purposes of PF_MEMALLOC, besides preventing recursive reclaims, include letting reclaim dip into reserves so that it can make more free memory. So I think it's acceptable if the following conditions are met: 1. The allocation size is small. 2. The number of allocations is bounded. 3. Its failure doesn't stall reclaim. And it'd be nice if 4. The allocation happens rarely, e.g., slow path only. The code in question meets all of them. 1. This allocation is 160 bytes. 2. It's bounded by the number of page table walkers which, in the worst, is same as the number of mm_struct's. 3. Most importantly, its failure doesn't stall the aging. The aging will fallback to the rmap-based function lru_gen_look_around(). But this function only gathers the accessed bit from at most 64 PTEs, meaning it's less efficient (retains ~80% performance gains). 4. This allocation is rare, i.e., only when the aging is required, which is similar to the low inactive case for the active/inactive lru. The bottom line is I can try various optimizations, e.g., preallocate a few buffers for a limited number of page walkers and if this number has been reached, fallback to the rmap-based function. But I have yet to see evidence that calls for additional complexity.
On Fri, Jan 07, 2022 at 03:44:50PM +0100, Michal Hocko wrote: > On Tue 04-01-22 13:22:25, Yu Zhao wrote: > [...] > > +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk) > > +{ > > + static const struct mm_walk_ops mm_walk_ops = { > > + .test_walk = should_skip_vma, > > + .p4d_entry = walk_pud_range, > > + }; > > + > > + int err; > > +#ifdef CONFIG_MEMCG > > + struct mem_cgroup *memcg = lruvec_memcg(lruvec); > > +#endif > > + > > + walk->next_addr = FIRST_USER_ADDRESS; > > + > > + do { > > + unsigned long start = walk->next_addr; > > + unsigned long end = mm->highest_vm_end; > > + > > + err = -EBUSY; > > + > > + rcu_read_lock(); > > +#ifdef CONFIG_MEMCG > > + if (memcg && atomic_read(&memcg->moving_account)) > > + goto contended; > > +#endif > > + if (!mmap_read_trylock(mm)) > > + goto contended; > > Have you evaluated the behavior under mmap_sem contention? I mean what > would be an effect of some mms being excluded from the walk? This path > is called from direct reclaim and we do allocate with exclusive mmap_sem > IIRC and the trylock can fail in a presence of pending writer if I am > not mistaken so even the read lock holder (e.g. an allocation from the #PF) > can bypass the walk. You are right. Here it must be a trylock; otherwise it can deadlock. I think there might be a misunderstanding: the aging doesn't exclusively rely on page table walks to gather the accessed bit. It prefers page table walks but it can also fallback to the rmap-based function, i.e., lru_gen_look_around(), which only gathers the accessed bit from at most 64 PTEs and therefore is less efficient. But it still retains about 80% of the performance gains. > Or is this considered statistically insignificant thus a theoretical > problem? Yes. People who work on the maple tree and SPF at Google expressed the same concern during the design review meeting (all stakeholders on the mailing list were also invited). So we had a counter to monitor the contention in previous versions, i.e., MM_LOCK_CONTENTION in v4 here: https://lore.kernel.org/lkml/20210818063107.2696454-8-yuzhao@google.com/ And we also combined this patchset with the SPF patchset to see if the latter makes any difference. Our conclusion was the contention is statistically insignificant to the performance under memory pressure. This can be explained by how often we create a new generation. (We only walk page tables when we create a new generation. And it's similar to the low inactive condition for the active/inactive lru.) Usually we only do so every few seconds. We'd run into problems with other parts of the kernel, e.g., lru lock contention, i/o congestion, etc. if we create more than a few generation every second.
On Sun 09-01-22 21:47:57, Yu Zhao wrote: > On Fri, Jan 07, 2022 at 03:44:50PM +0100, Michal Hocko wrote: > > On Tue 04-01-22 13:22:25, Yu Zhao wrote: > > [...] > > > +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk) > > > +{ > > > + static const struct mm_walk_ops mm_walk_ops = { > > > + .test_walk = should_skip_vma, > > > + .p4d_entry = walk_pud_range, > > > + }; > > > + > > > + int err; > > > +#ifdef CONFIG_MEMCG > > > + struct mem_cgroup *memcg = lruvec_memcg(lruvec); > > > +#endif > > > + > > > + walk->next_addr = FIRST_USER_ADDRESS; > > > + > > > + do { > > > + unsigned long start = walk->next_addr; > > > + unsigned long end = mm->highest_vm_end; > > > + > > > + err = -EBUSY; > > > + > > > + rcu_read_lock(); > > > +#ifdef CONFIG_MEMCG > > > + if (memcg && atomic_read(&memcg->moving_account)) > > > + goto contended; > > > +#endif > > > + if (!mmap_read_trylock(mm)) > > > + goto contended; > > > > Have you evaluated the behavior under mmap_sem contention? I mean what > > would be an effect of some mms being excluded from the walk? This path > > is called from direct reclaim and we do allocate with exclusive mmap_sem > > IIRC and the trylock can fail in a presence of pending writer if I am > > not mistaken so even the read lock holder (e.g. an allocation from the #PF) > > can bypass the walk. > > You are right. Here it must be a trylock; otherwise it can deadlock. Yeah, this is clear. > I think there might be a misunderstanding: the aging doesn't > exclusively rely on page table walks to gather the accessed bit. It > prefers page table walks but it can also fallback to the rmap-based > function, i.e., lru_gen_look_around(), which only gathers the accessed > bit from at most 64 PTEs and therefore is less efficient. But it still > retains about 80% of the performance gains. I have to say that I really have hard time to understand the runtime behavior depending on that interaction. How does the reclaim behave when the virtual scan is enabled, partially enabled and almost completely disabled due to different constrains? I do not see any such an evaluation described in changelogs and I consider this to be a rather important information to judge the overall behavior. > > Or is this considered statistically insignificant thus a theoretical > > problem? > > Yes. People who work on the maple tree and SPF at Google expressed the > same concern during the design review meeting (all stakeholders on the > mailing list were also invited). So we had a counter to monitor the > contention in previous versions, i.e., MM_LOCK_CONTENTION in v4 here: > https://lore.kernel.org/lkml/20210818063107.2696454-8-yuzhao@google.com/ > > And we also combined this patchset with the SPF patchset to see if the > latter makes any difference. Our conclusion was the contention is > statistically insignificant to the performance under memory pressure. > > This can be explained by how often we create a new generation. (We > only walk page tables when we create a new generation. And it's > similar to the low inactive condition for the active/inactive lru.) > > Usually we only do so every few seconds. We'd run into problems with > other parts of the kernel, e.g., lru lock contention, i/o congestion, > etc. if we create more than a few generation every second. This would be a very good information to have in changelogs. Ideally with some numbers and analysis.
On Sun 09-01-22 20:58:02, Yu Zhao wrote: > On Fri, Jan 07, 2022 at 10:00:31AM +0100, Michal Hocko wrote: > > On Fri 07-01-22 09:55:09, Michal Hocko wrote: > > [...] > > > > In this case, lru_gen_mm_walk is small (160 bytes); it's per direct > > > > reclaimer; and direct reclaimers rarely come here, i.e., only when > > > > kswapd can't keep up in terms of the aging, which is similar to the > > > > condition where the inactive list is empty for the active/inactive > > > > lru. > > > > > > Well, this is not a strong argument to be honest. Kswapd being stuck > > > and the majority of the reclaim being done in the direct reclaim > > > context is a situation I have seen many many times. > > > > Also do not forget that memcg reclaim is effectivelly only direct > > reclaim. Not that the memcg reclaim indicates a global memory shortage > > but it can add up and race with the global reclaim as well. > > I don't dispute any of the above, and I probably don't like this code > more than you do. > > But let's not forget the purposes of PF_MEMALLOC, besides preventing > recursive reclaims, include letting reclaim dip into reserves so that > it can make more free memory. So I think it's acceptable if the > following conditions are met: > 1. The allocation size is small. > 2. The number of allocations is bounded. > 3. Its failure doesn't stall reclaim. > And it'd be nice if > 4. The allocation happens rarely, e.g., slow path only. I would add 0. The allocation should be done only if absolutely _necessary_. Please keep in mind that whatever you allocate from that context will be consuming a very precious memory reserves which are shared with other components of the system. Even worse these can go all the way to depleting memory completely where other things can fall apart. > The code in question meets all of them. > > 1. This allocation is 160 bytes. > 2. It's bounded by the number of page table walkers which, in the > worst, is same as the number of mm_struct's. > 3. Most importantly, its failure doesn't stall the aging. The aging > will fallback to the rmap-based function lru_gen_look_around(). > But this function only gathers the accessed bit from at most 64 > PTEs, meaning it's less efficient (retains ~80% performance gains). > 4. This allocation is rare, i.e., only when the aging is required, > which is similar to the low inactive case for the active/inactive > lru. I think this fallback behavior deserves much more detailed explanation in changelogs. > The bottom line is I can try various optimizations, e.g., preallocate > a few buffers for a limited number of page walkers and if this number > has been reached, fallback to the rmap-based function. But I have yet > to see evidence that calls for additional complexity. I would disagree here. This is not an optimization. You should be avoiding allocations from the memory reclaim because any allocation just add a runtime behavior complexity and potential corner cases.
On Thu 06-01-22 17:12:18, Michal Hocko wrote: > On Tue 04-01-22 13:22:25, Yu Zhao wrote: > > +static struct lru_gen_mm_walk *alloc_mm_walk(void) > > +{ > > + if (!current->reclaim_state || !current->reclaim_state->mm_walk) > > + return kvzalloc(sizeof(struct lru_gen_mm_walk), GFP_KERNEL); One thing I have overlooked completely. You cannot really use GFP_KERNEL allocation here because the reclaim context can be constrained (e.g. GFP_NOFS). This allocation will not do any reclaim as it is PF_MEMALLOC but I suspect that the lockdep will complain anyway. Also kvmalloc is not really great here. a) vmalloc path is never executed for small objects and b) we do not really want to make a dependency between vmalloc and the reclaim (by vmalloc -> reclaim -> vmalloc). Even if we rule out vmalloc and look at kmalloc alone. Is this really safe? I do not see any recursion prevention in the SL.B code. Maybe this just happens to work but the dependency should be really documented so that future SL.B changes won't break the whole scheme.
On Fri 07-01-22 16:36:11, Yu Zhao wrote: > On Fri, Jan 07, 2022 at 02:11:29PM +0100, Michal Hocko wrote: > > On Tue 04-01-22 13:22:25, Yu Zhao wrote: > > [...] > > > +static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) > > > +{ > > > + struct mem_cgroup *memcg; > > > + bool success = false; > > > + unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl); > > > + > > > + VM_BUG_ON(!current_is_kswapd()); > > > + > > > + current->reclaim_state->mm_walk = &pgdat->mm_walk; > > > + > > > + memcg = mem_cgroup_iter(NULL, NULL, NULL); > > > + do { > > > + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); > > > + > > > + if (age_lruvec(lruvec, sc, min_ttl)) > > > + success = true; > > > + > > > + cond_resched(); > > > + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); > > > + > > > + if (!success && mutex_trylock(&oom_lock)) { > > > + struct oom_control oc = { > > > + .gfp_mask = sc->gfp_mask, > > > + .order = sc->order, > > > + }; > > > + > > > + if (!oom_reaping_in_progress()) > > > + out_of_memory(&oc); > > > + > > > + mutex_unlock(&oom_lock); > > > + } > > > > Why do you need to trigger oom killer from this path? Why cannot you > > rely on the page allocator to do that like we do now? > > This is per desktop users' (repeated) requests. The can't tolerate > thrashing as servers do because of UI lags; and they usually don't > have fancy tools like oomd. > > Related discussions I saw: > https://github.com/zen-kernel/zen-kernel/issues/218 > https://lore.kernel.org/lkml/20101028191523.GA14972@google.com/ > https://lore.kernel.org/lkml/20211213051521.21f02dd2@mail.inbox.lv/ > https://lore.kernel.org/lkml/54C2C89C.8080002@gmail.com/ > https://lore.kernel.org/lkml/d9802b6a-949b-b327-c4a6-3dbca485ec20@gmx.com/ I do not really see any arguments why an userspace based trashing detection cannot be used for those. Could you clarify? Also my question was pointing to why out_of_memory is called from the reclaim rather than the allocator (memcg charging path). It is the caller of the reclaim to control different reclaim strategies and tell when all the hopes are lost and the oom killer should be invoked. This allows for a different policies at the allocator level and this change will break that AFAICS. E.g. what if the underlying allocation context is __GFP_NORETRY? > >From patch 8: > Personal computers > ------------------ > :Thrashing prevention: Write ``N`` to > ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of > ``N`` milliseconds from getting evicted. The OOM killer is invoked if > this working set can't be kept in memory. Based on the average human > detectable lag (~100ms), ``N=1000`` usually eliminates intolerable > lags due to thrashing. Larger values like ``N=3000`` make lags less > noticeable at the cost of more OOM kills. This is a very good example of something that should be a self contained patch with its own justification. TBH it is really not all that clear to me that we want to provide any user visible knob to control OOM behavior based on a time based QoS.
On 1/10/22 16:01, Michal Hocko wrote: > On Thu 06-01-22 17:12:18, Michal Hocko wrote: >> On Tue 04-01-22 13:22:25, Yu Zhao wrote: >> > +static struct lru_gen_mm_walk *alloc_mm_walk(void) >> > +{ >> > + if (!current->reclaim_state || !current->reclaim_state->mm_walk) >> > + return kvzalloc(sizeof(struct lru_gen_mm_walk), GFP_KERNEL); > > One thing I have overlooked completely. You cannot really use GFP_KERNEL > allocation here because the reclaim context can be constrained (e.g. > GFP_NOFS). This allocation will not do any reclaim as it is PF_MEMALLOC > but I suspect that the lockdep will complain anyway. > > Also kvmalloc is not really great here. a) vmalloc path is never > executed for small objects and b) we do not really want to make a > dependency between vmalloc and the reclaim (by vmalloc -> reclaim -> > vmalloc). > > Even if we rule out vmalloc and look at kmalloc alone. Is this really > safe? I do not see any recursion prevention in the SL.B code. Maybe this > just happens to work but the dependency should be really documented so > that future SL.B changes won't break the whole scheme. Slab implementations drop all locks before calling into page allocator (thus possibly reclaim) so slab itself should be fine and I don't expect it to change. But we could eventually reach the page allocator recursively again, that's true and not great.
On Mon 10-01-22 17:01:07, Vlastimil Babka wrote: > On 1/10/22 16:01, Michal Hocko wrote: > > On Thu 06-01-22 17:12:18, Michal Hocko wrote: > >> On Tue 04-01-22 13:22:25, Yu Zhao wrote: > >> > +static struct lru_gen_mm_walk *alloc_mm_walk(void) > >> > +{ > >> > + if (!current->reclaim_state || !current->reclaim_state->mm_walk) > >> > + return kvzalloc(sizeof(struct lru_gen_mm_walk), GFP_KERNEL); > > > > One thing I have overlooked completely. You cannot really use GFP_KERNEL > > allocation here because the reclaim context can be constrained (e.g. > > GFP_NOFS). This allocation will not do any reclaim as it is PF_MEMALLOC > > but I suspect that the lockdep will complain anyway. > > > > Also kvmalloc is not really great here. a) vmalloc path is never > > executed for small objects and b) we do not really want to make a > > dependency between vmalloc and the reclaim (by vmalloc -> reclaim -> > > vmalloc). > > > > Even if we rule out vmalloc and look at kmalloc alone. Is this really > > safe? I do not see any recursion prevention in the SL.B code. Maybe this > > just happens to work but the dependency should be really documented so > > that future SL.B changes won't break the whole scheme. > > Slab implementations drop all locks before calling into page allocator (thus > possibly reclaim) so slab itself should be fine and I don't expect it to > change. But we could eventually reach the page allocator recursively again, > that's true and not great. Thanks for double checking. If recursion is really intended and something SL.B allocators should support then this is definitely worth documenting so that a subtle change won't break in the future.
On Tue 04-01-22 13:22:25, Yu Zhao wrote: [...] > +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk) > +{ > + static const struct mm_walk_ops mm_walk_ops = { > + .test_walk = should_skip_vma, > + .p4d_entry = walk_pud_range, > + }; > + > + int err; > +#ifdef CONFIG_MEMCG > + struct mem_cgroup *memcg = lruvec_memcg(lruvec); > +#endif > + > + walk->next_addr = FIRST_USER_ADDRESS; > + > + do { > + unsigned long start = walk->next_addr; > + unsigned long end = mm->highest_vm_end; > + > + err = -EBUSY; > + > + rcu_read_lock(); > +#ifdef CONFIG_MEMCG > + if (memcg && atomic_read(&memcg->moving_account)) > + goto contended; > +#endif Why do you need to check for moving_account?
On Mon, Jan 10, 2022 at 04:35:46PM +0100, Michal Hocko wrote: > On Fri 07-01-22 16:36:11, Yu Zhao wrote: > > On Fri, Jan 07, 2022 at 02:11:29PM +0100, Michal Hocko wrote: > > > On Tue 04-01-22 13:22:25, Yu Zhao wrote: > > > [...] > > > > +static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) > > > > +{ > > > > + struct mem_cgroup *memcg; > > > > + bool success = false; > > > > + unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl); > > > > + > > > > + VM_BUG_ON(!current_is_kswapd()); > > > > + > > > > + current->reclaim_state->mm_walk = &pgdat->mm_walk; > > > > + > > > > + memcg = mem_cgroup_iter(NULL, NULL, NULL); > > > > + do { > > > > + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); > > > > + > > > > + if (age_lruvec(lruvec, sc, min_ttl)) > > > > + success = true; > > > > + > > > > + cond_resched(); > > > > + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); > > > > + > > > > + if (!success && mutex_trylock(&oom_lock)) { > > > > + struct oom_control oc = { > > > > + .gfp_mask = sc->gfp_mask, > > > > + .order = sc->order, > > > > + }; > > > > + > > > > + if (!oom_reaping_in_progress()) > > > > + out_of_memory(&oc); > > > > + > > > > + mutex_unlock(&oom_lock); > > > > + } > > > > > > Why do you need to trigger oom killer from this path? Why cannot you > > > rely on the page allocator to do that like we do now? > > > > This is per desktop users' (repeated) requests. The can't tolerate > > thrashing as servers do because of UI lags; and they usually don't > > have fancy tools like oomd. > > > > Related discussions I saw: > > https://github.com/zen-kernel/zen-kernel/issues/218 > > https://lore.kernel.org/lkml/20101028191523.GA14972@google.com/ > > https://lore.kernel.org/lkml/20211213051521.21f02dd2@mail.inbox.lv/ > > https://lore.kernel.org/lkml/54C2C89C.8080002@gmail.com/ > > https://lore.kernel.org/lkml/d9802b6a-949b-b327-c4a6-3dbca485ec20@gmx.com/ > > I do not really see any arguments why an userspace based trashing > detection cannot be used for those. Could you clarify? It definitely can be done. But who is going to do it for every distro and all individual users? AFAIK, not a single distro provides such a solution for desktop/laptop/phone users. And also there is the theoretical question how reliable a userspace solution can be. What if this usespace solution itself gets stuck in the direct reclaim path. I'm sure if nobody has done some search to prove or debunk it. In addition, what exactly PSI values should be used on different models of consumer electronics? Nobody knows. We have a team working on this and we haven't figured it out for all our Chromebook models. As Andrew said, "a blunt instrument like this would be useful". https://lore.kernel.org/lkml/20211202135824.33d2421bf5116801cfa2040d@linux-foundation.org/ I'd like to have less code in kernel too, but I've learned never to walk over users. If I remove this and they come after me asking why, I'd have a hard time convincing them. > Also my question was pointing to why out_of_memory is called from the > reclaim rather than the allocator (memcg charging path). It is the > caller of the reclaim to control different reclaim strategies and tell > when all the hopes are lost and the oom killer should be invoked. This > allows for a different policies at the allocator level and this change > will break that AFAICS. E.g. what if the underlying allocation context > is __GFP_NORETRY? This is called in kswapd only, and by default (min_ttl=0) it doesn't do anything. So __GFP_NORETRY doesn't apply. The question would be more along the lines of long-term ABI support. And I'll add the following comments, if you think we can keep this logic: OOM kill if every generation from all memcgs is younger than min_ttl. Another theoretical possibility is all memcgs are either below min or ineligible at priority 0, but this isn't the main goal. (Please read my reply at the bottom to decide whether we should keep it or not. Thanks.) > > >From patch 8: > > Personal computers > > ------------------ > > :Thrashing prevention: Write ``N`` to > > ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of > > ``N`` milliseconds from getting evicted. The OOM killer is invoked if > > this working set can't be kept in memory. Based on the average human > > detectable lag (~100ms), ``N=1000`` usually eliminates intolerable > > lags due to thrashing. Larger values like ``N=3000`` make lags less > > noticeable at the cost of more OOM kills. > > This is a very good example of something that should be a self contained > patch with its own justification. Consider it done. > TBH it is really not all that clear to > me that we want to provide any user visible knob to control OOM behavior > based on a time based QoS. Agreed, and it didn't exist until v4, i.e., after I was demanded to provide it for several times. For example: https://github.com/zen-kernel/zen-kernel/issues/223 And another example: Your Multigenerational LRU patchset is pretty complex and effective, but does not eliminate thrashing condition fully on an old PCs with slow HDD. I'm kindly asking you to cooperate with hakavlad if it's possible and maybe re-implement parts of le9 patch in your patchset wherever acceptable, as they are quite similar in the core concept. This is excerpt of an email from iam@valdikss.org.ru, and he has posted demo videos in this discussion: https://lore.kernel.org/lkml/2dc51fc8-f14e-17ed-a8c6-0ec70423bf54@valdikss.org.ru/
On Mon 10-01-22 18:18:55, Yu Zhao wrote: > On Mon, Jan 10, 2022 at 04:35:46PM +0100, Michal Hocko wrote: > > On Fri 07-01-22 16:36:11, Yu Zhao wrote: > > > On Fri, Jan 07, 2022 at 02:11:29PM +0100, Michal Hocko wrote: > > > > On Tue 04-01-22 13:22:25, Yu Zhao wrote: > > > > [...] > > > > > +static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) > > > > > +{ > > > > > + struct mem_cgroup *memcg; > > > > > + bool success = false; > > > > > + unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl); > > > > > + > > > > > + VM_BUG_ON(!current_is_kswapd()); > > > > > + > > > > > + current->reclaim_state->mm_walk = &pgdat->mm_walk; > > > > > + > > > > > + memcg = mem_cgroup_iter(NULL, NULL, NULL); > > > > > + do { > > > > > + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); > > > > > + > > > > > + if (age_lruvec(lruvec, sc, min_ttl)) > > > > > + success = true; > > > > > + > > > > > + cond_resched(); > > > > > + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); > > > > > + > > > > > + if (!success && mutex_trylock(&oom_lock)) { > > > > > + struct oom_control oc = { > > > > > + .gfp_mask = sc->gfp_mask, > > > > > + .order = sc->order, > > > > > + }; > > > > > + > > > > > + if (!oom_reaping_in_progress()) > > > > > + out_of_memory(&oc); > > > > > + > > > > > + mutex_unlock(&oom_lock); > > > > > + } > > > > > > > > Why do you need to trigger oom killer from this path? Why cannot you > > > > rely on the page allocator to do that like we do now? > > > > > > This is per desktop users' (repeated) requests. The can't tolerate > > > thrashing as servers do because of UI lags; and they usually don't > > > have fancy tools like oomd. > > > > > > Related discussions I saw: > > > https://github.com/zen-kernel/zen-kernel/issues/218 > > > https://lore.kernel.org/lkml/20101028191523.GA14972@google.com/ > > > https://lore.kernel.org/lkml/20211213051521.21f02dd2@mail.inbox.lv/ > > > https://lore.kernel.org/lkml/54C2C89C.8080002@gmail.com/ > > > https://lore.kernel.org/lkml/d9802b6a-949b-b327-c4a6-3dbca485ec20@gmx.com/ > > > > I do not really see any arguments why an userspace based trashing > > detection cannot be used for those. Could you clarify? > > It definitely can be done. But who is going to do it for every distro > and all individual users? AFAIK, not a single distro provides such a > solution for desktop/laptop/phone users. If existing interfaces provides sufficient information to make those calls then I would definitely prefer a userspace solution. > And also there is the theoretical question how reliable a userspace > solution can be. What if this usespace solution itself gets stuck in > the direct reclaim path. I'm sure if nobody has done some search to > prove or debunk it. I have to confess I haven't checked oomd or other solutions but with a sufficient care (all the code mlocked in + no allocations done while collecting data) I believe this should be achieveable. > In addition, what exactly PSI values should be used on different > models of consumer electronics? Nobody knows. We have a team working > on this and we haven't figured it out for all our Chromebook models. I believe this is a matter of tuning for a specific deployment. We do not have only psi but also refault counters that can be used. > As Andrew said, "a blunt instrument like this would be useful". > https://lore.kernel.org/lkml/20211202135824.33d2421bf5116801cfa2040d@linux-foundation.org/ > > I'd like to have less code in kernel too, but I've learned never to > walk over users. If I remove this and they come after me asking why, > I'd have a hard time convincing them. > > > Also my question was pointing to why out_of_memory is called from the > > reclaim rather than the allocator (memcg charging path). It is the > > caller of the reclaim to control different reclaim strategies and tell > > when all the hopes are lost and the oom killer should be invoked. This > > allows for a different policies at the allocator level and this change > > will break that AFAICS. E.g. what if the underlying allocation context > > is __GFP_NORETRY? > > This is called in kswapd only, and by default (min_ttl=0) it doesn't > do anything. So __GFP_NORETRY doesn't apply. My bad. I must have got lost when traversing the code but I can see you are enforcing that by a VM_BUG_ON. So the limited scope reclaim is not a problem indeed. > The question would be > more along the lines of long-term ABI support. > > And I'll add the following comments, if you think we can keep this > logic: > OOM kill if every generation from all memcgs is younger than min_ttl. > Another theoretical possibility is all memcgs are either below min or > ineligible at priority 0, but this isn't the main goal. > > (Please read my reply at the bottom to decide whether we should keep > it or not. Thanks.) > > > > >From patch 8: > > > Personal computers > > > ------------------ > > > :Thrashing prevention: Write ``N`` to > > > ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of > > > ``N`` milliseconds from getting evicted. The OOM killer is invoked if > > > this working set can't be kept in memory. Based on the average human > > > detectable lag (~100ms), ``N=1000`` usually eliminates intolerable > > > lags due to thrashing. Larger values like ``N=3000`` make lags less > > > noticeable at the cost of more OOM kills. > > > > This is a very good example of something that should be a self contained > > patch with its own justification. > > Consider it done. > > > TBH it is really not all that clear to > > me that we want to provide any user visible knob to control OOM behavior > > based on a time based QoS. > > Agreed, and it didn't exist until v4, i.e., after I was demanded to > provide it for several times. > > For example: > https://github.com/zen-kernel/zen-kernel/issues/223 > > And another example: > Your Multigenerational LRU patchset is pretty complex and > effective, but does not eliminate thrashing condition fully on an > old PCs with slow HDD. > > I'm kindly asking you to cooperate with hakavlad if it's possible > and maybe re-implement parts of le9 patch in your patchset wherever > acceptable, as they are quite similar in the core concept. > > This is excerpt of an email from iam@valdikss.org.ru, and he has > posted demo videos in this discussion: > https://lore.kernel.org/lkml/2dc51fc8-f14e-17ed-a8c6-0ec70423bf54@valdikss.org.ru/ That is all an interesting feedback but we should be really craful about ABI constrains and future maintainability of the knob. I still stand behind my statement that kernel should implement such features only if it is clear that we cannot really implement a similar logic in the userspace.
On Tue 11-01-22 11:21:48, Alexey Avramov wrote: > > I do not really see any arguments why an userspace based trashing > > detection cannot be used for those. > > Firsly, > because this is the task of the kernel, not the user space. > Memory is managed by the kernel, not by the user space. > The absence of such a mechanism in the kernel is a fundamental problem. > The userspace tools are ugly hacks: > some of them consume a lot of CPU [1], > some of them consume a lot of memory [2], > some of them cannot into process_mrelease() (earlyoom, nohang), > some of them kill only the whole cgroup (systemd-oomd, oomd) [3] > and depends on systemd and cgroup_v2 (oomd, systemd-oomd). Thanks for those links. Read through them and my understanding is that most of those are very specific to the tool used and nothing really fundamental because of lack of kernel support. > One of the biggest challenges for userspace oom-killers is to potentially > function under intense memory pressure and are prone to getting stuck in > memory reclaim themselves [4]. This one is more interesting and the truth is that handling the complete OOM situation from the userspace is really tricky. Especially when with a more complex oom decision policy. In the past we have discussed potential ways to implement a oom kill policy be kernel modules or eBPF. Without anybody following up on that. But I suspect you are mixing up two things here. One of them is out of memory situation where no memory can be reclaimed and allocated. The other is one where the memory can be reclaimed, a progress is made, but that leads to a trashing when the most of the time is spent on refaulting a memory reclaimed shortly before. The first one is addressed by the global oom killer and it tries to be really conservative as much as possible because this is a very disruptive operation. But the later one is more complex and a proper handling really depends on the particular workload to be handled properly because it is more of a QoS than an emergency action to keep the system alive. There are workloads which prefer a temporary trashing over its working set during a peak memory demand rather than an OOM kill because way too much work would be thrown away. On the other side workloads that are latency sensitive can see even the direct reclaim as a runtime visible problem. I hope you can imagine there is a really large gap between those two cases and no simple solution can be applied to the whole range. Therefore we have PSI and refault stats exported to the userspace so that a workload specific policy can be implemented there. If userspace has hard time to use that data and action upon then let's talk about specifics. For the most steady trashing situations I have seen the userspace with mlocked memory and the code can make a forward progress and mediate the situation. [...] > [1] https://github.com/facebookincubator/oomd/issues/79 > [2] https://github.com/hakavlad/nohang#memory-and-cpu-usage > [3] https://github.com/facebookincubator/oomd/issues/125 > [4] https://lore.kernel.org/all/CALvZod7vtDxJZtNhn81V=oE-EPOf=4KZB2Bv6Giz+u3bFFyOLg@mail.gmail.com/ > [5] https://github.com/zen-kernel/zen-kernel/issues/223 > [6] https://raw.githubusercontent.com/hakavlad/cache-tests/main/mg-LRU-v3_vs_classic-LRU/3-firefox-tail-OOM/mg-LRU-1/psi2 > [7] https://lore.kernel.org/linux-mm/20211202150614.22440-1-mgorman@techsingularity.net/
> I do not really see any arguments why an userspace based trashing > detection cannot be used for those. Firsly, because this is the task of the kernel, not the user space. Memory is managed by the kernel, not by the user space. The absence of such a mechanism in the kernel is a fundamental problem. The userspace tools are ugly hacks: some of them consume a lot of CPU [1], some of them consume a lot of memory [2], some of them cannot into process_mrelease() (earlyoom, nohang), some of them kill only the whole cgroup (systemd-oomd, oomd) [3] and depends on systemd and cgroup_v2 (oomd, systemd-oomd). One of the biggest challenges for userspace oom-killers is to potentially function under intense memory pressure and are prone to getting stuck in memory reclaim themselves [4]. It is strange that after decades of user complaints about thrashing and not-working OOM killer I have to explain the obvious things. The basic mechanism must be implemented in the kernel. Stop shifting responsibility to the user space! Secondly, the real reason for the min_ttl_ms mechanism is that without it, multi-minute stalls are possible [5] even when the killer is expected to arrive, and memory pressure is closed to 100 at this period [6]. This fixes a bug that does not exist in the mainline LRU (this is MGLRU-specific bug). BTW, the similar symptoms were recently fixed in the mainline [7]. [1] https://github.com/facebookincubator/oomd/issues/79 [2] https://github.com/hakavlad/nohang#memory-and-cpu-usage [3] https://github.com/facebookincubator/oomd/issues/125 [4] https://lore.kernel.org/all/CALvZod7vtDxJZtNhn81V=oE-EPOf=4KZB2Bv6Giz+u3bFFyOLg@mail.gmail.com/ [5] https://github.com/zen-kernel/zen-kernel/issues/223 [6] https://raw.githubusercontent.com/hakavlad/cache-tests/main/mg-LRU-v3_vs_classic-LRU/3-firefox-tail-OOM/mg-LRU-1/psi2 [7] https://lore.kernel.org/linux-mm/20211202150614.22440-1-mgorman@techsingularity.net/ [I am duplicating a previous message here - it was not delivered to mailing lists]
On Mon, Jan 10, 2022 at 04:01:13PM +0100, Michal Hocko wrote: > On Thu 06-01-22 17:12:18, Michal Hocko wrote: > > On Tue 04-01-22 13:22:25, Yu Zhao wrote: > > > +static struct lru_gen_mm_walk *alloc_mm_walk(void) > > > +{ > > > + if (!current->reclaim_state || !current->reclaim_state->mm_walk) > > > + return kvzalloc(sizeof(struct lru_gen_mm_walk), GFP_KERNEL); > > One thing I have overlooked completely. I appreciate your attention to details but GFP_KERNEL is legit in the reclaim path. It's been used many years in our production, e.g., page reclaim swap_writepage() frontswap_store() zswap_frontswap_store() zswap_entry_cache_alloc(GFP_KERNEL) (And I always test my changes with lockdep, kasan, DEBUG_VM, etc., no warnings ever seen from using GFP_KERNEL in the reclaim path.) > You cannot really use GFP_KERNEL > allocation here because the reclaim context can be constrained (e.g. > GFP_NOFS). This allocation will not do any reclaim as it is PF_MEMALLOC > but I suspect that the lockdep will complain anyway. > > Also kvmalloc is not really great here. a) vmalloc path is never > executed for small objects and b) we do not really want to make a > dependency between vmalloc and the reclaim (by vmalloc -> reclaim -> > vmalloc). > > Even if we rule out vmalloc and look at kmalloc alone. Is this really > safe? I do not see any recursion prevention in the SL.B code. Maybe this > just happens to work but the dependency should be really documented so > that future SL.B changes won't break the whole scheme. Affirmative, as Vlastimil has clarified. Thanks!
On Mon, Jan 10, 2022 at 05:57:39PM +0100, Michal Hocko wrote: > On Tue 04-01-22 13:22:25, Yu Zhao wrote: > [...] > > +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk) > > +{ > > + static const struct mm_walk_ops mm_walk_ops = { > > + .test_walk = should_skip_vma, > > + .p4d_entry = walk_pud_range, > > + }; > > + > > + int err; > > +#ifdef CONFIG_MEMCG > > + struct mem_cgroup *memcg = lruvec_memcg(lruvec); > > +#endif > > + > > + walk->next_addr = FIRST_USER_ADDRESS; > > + > > + do { > > + unsigned long start = walk->next_addr; > > + unsigned long end = mm->highest_vm_end; > > + > > + err = -EBUSY; > > + > > + rcu_read_lock(); > > +#ifdef CONFIG_MEMCG > > + if (memcg && atomic_read(&memcg->moving_account)) > > + goto contended; > > +#endif > > Why do you need to check for moving_account? This check, if succeeds, blocks memcg migration. Our goal is to move pages between different generations of the same lruvec (the first arg). Meanwhile, pages can also be migrated between different memcgs (different lruvecs). The active/inactive lru uses isolation to block memcg migration. Generations account pages similarly to the active/inactive lru, i.e., each generation has nr_pages counter. However, unlike the active/ inactive lru, a page can be moved to a different generation without getting isolated or even without being under the lru lock, as long as the delta is eventually accounted for (which does require the lru lock when it happens). The generation counter in page->flags (folio->flags to be precise) stores 0 when a page is isolated, to synchronize with isolation.
On Tue 11-01-22 18:01:29, Yu Zhao wrote: > On Mon, Jan 10, 2022 at 05:57:39PM +0100, Michal Hocko wrote: > > On Tue 04-01-22 13:22:25, Yu Zhao wrote: > > [...] > > > +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk) > > > +{ > > > + static const struct mm_walk_ops mm_walk_ops = { > > > + .test_walk = should_skip_vma, > > > + .p4d_entry = walk_pud_range, > > > + }; > > > + > > > + int err; > > > +#ifdef CONFIG_MEMCG > > > + struct mem_cgroup *memcg = lruvec_memcg(lruvec); > > > +#endif > > > + > > > + walk->next_addr = FIRST_USER_ADDRESS; > > > + > > > + do { > > > + unsigned long start = walk->next_addr; > > > + unsigned long end = mm->highest_vm_end; > > > + > > > + err = -EBUSY; > > > + > > > + rcu_read_lock(); > > > +#ifdef CONFIG_MEMCG > > > + if (memcg && atomic_read(&memcg->moving_account)) > > > + goto contended; > > > +#endif > > > > Why do you need to check for moving_account? > > This check, if succeeds, blocks memcg migration. OK, I can see that you rely on the RCU here for the synchronization. A comment which mentions mem_cgroup_move_charge would be helpful for clarity. Is there any reason you are not using folio_memcg_lock in the pte walk instead?
On Tue 11-01-22 16:16:57, Yu Zhao wrote: > On Mon, Jan 10, 2022 at 04:01:13PM +0100, Michal Hocko wrote: > > On Thu 06-01-22 17:12:18, Michal Hocko wrote: > > > On Tue 04-01-22 13:22:25, Yu Zhao wrote: > > > > +static struct lru_gen_mm_walk *alloc_mm_walk(void) > > > > +{ > > > > + if (!current->reclaim_state || !current->reclaim_state->mm_walk) > > > > + return kvzalloc(sizeof(struct lru_gen_mm_walk), GFP_KERNEL); > > > > One thing I have overlooked completely. > > I appreciate your attention to details but GFP_KERNEL is legit in the > reclaim path. It's been used many years in our production, e.g., > page reclaim > swap_writepage() > frontswap_store() > zswap_frontswap_store() > zswap_entry_cache_alloc(GFP_KERNEL) > > (And I always test my changes with lockdep, kasan, DEBUG_VM, etc., no > warnings ever seen from using GFP_KERNEL in the reclaim path.) OK, I can see it now. __need_reclaim will check for PF_MEMALLOC and skip the fs_reclaim tracking. I still maintain I am not really happy about (nor in the zswap example) allocations from the direct reclaim context. I would really recommend using a pre-allocated pool of objects. If there are strong reasons for not doing so then at lease change that to kzalloc. Thanks!
On Wed, Jan 12, 2022 at 11:17:53AM +0100, Michal Hocko wrote: > On Tue 11-01-22 18:01:29, Yu Zhao wrote: > > On Mon, Jan 10, 2022 at 05:57:39PM +0100, Michal Hocko wrote: > > > On Tue 04-01-22 13:22:25, Yu Zhao wrote: > > > [...] > > > > +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk) > > > > +{ > > > > + static const struct mm_walk_ops mm_walk_ops = { > > > > + .test_walk = should_skip_vma, > > > > + .p4d_entry = walk_pud_range, > > > > + }; > > > > + > > > > + int err; > > > > +#ifdef CONFIG_MEMCG > > > > + struct mem_cgroup *memcg = lruvec_memcg(lruvec); > > > > +#endif > > > > + > > > > + walk->next_addr = FIRST_USER_ADDRESS; > > > > + > > > > + do { > > > > + unsigned long start = walk->next_addr; > > > > + unsigned long end = mm->highest_vm_end; > > > > + > > > > + err = -EBUSY; > > > > + > > > > + rcu_read_lock(); > > > > +#ifdef CONFIG_MEMCG > > > > + if (memcg && atomic_read(&memcg->moving_account)) > > > > + goto contended; > > > > +#endif > > > > > > Why do you need to check for moving_account? > > > > This check, if succeeds, blocks memcg migration. > > OK, I can see that you rely on the RCU here for the synchronization. A > comment which mentions mem_cgroup_move_charge would be helpful for > clarity. Will do > Is there any reason you are not using folio_memcg_lock in the > pte walk instead? We have a particular lruvec (the first arg), hence a particular memcg to lock. But we don't have a particular page to lock.
On Wed, Jan 12, 2022 at 11:28:57AM +0100, Michal Hocko wrote: > On Tue 11-01-22 16:16:57, Yu Zhao wrote: > > On Mon, Jan 10, 2022 at 04:01:13PM +0100, Michal Hocko wrote: > > > On Thu 06-01-22 17:12:18, Michal Hocko wrote: > > > > On Tue 04-01-22 13:22:25, Yu Zhao wrote: > > > > > +static struct lru_gen_mm_walk *alloc_mm_walk(void) > > > > > +{ > > > > > + if (!current->reclaim_state || !current->reclaim_state->mm_walk) > > > > > + return kvzalloc(sizeof(struct lru_gen_mm_walk), GFP_KERNEL); > > > > > > One thing I have overlooked completely. > > > > I appreciate your attention to details but GFP_KERNEL is legit in the > > reclaim path. It's been used many years in our production, e.g., > > page reclaim > > swap_writepage() > > frontswap_store() > > zswap_frontswap_store() > > zswap_entry_cache_alloc(GFP_KERNEL) > > > > (And I always test my changes with lockdep, kasan, DEBUG_VM, etc., no > > warnings ever seen from using GFP_KERNEL in the reclaim path.) > > OK, I can see it now. __need_reclaim will check for PF_MEMALLOC and skip > the fs_reclaim tracking. > > I still maintain I am not really happy about (nor in the zswap example) > allocations from the direct reclaim context. I would really recommend > using a pre-allocated pool of objects. Not trying to argue anything -- there are many other places in the reclaim path that must allocate memory to make progress, e.g., add_to_swap_cache() xas_nomem() __swap_writepage() bio_alloc() The only way to not allocate memory is drop clean pages. Writing dirty pages (not swap) might require allocations as well. (But we only write dirty pages in kswapd, not in the direct reclaim path.) > If there are strong reasons for not doing so then at lease change that > to kzalloc. Consider it done.
On Mon, Jan 10, 2022 at 03:37:28PM +0100, Michal Hocko wrote: > On Sun 09-01-22 20:58:02, Yu Zhao wrote: > > On Fri, Jan 07, 2022 at 10:00:31AM +0100, Michal Hocko wrote: > > > On Fri 07-01-22 09:55:09, Michal Hocko wrote: > > > [...] > > > > > In this case, lru_gen_mm_walk is small (160 bytes); it's per direct > > > > > reclaimer; and direct reclaimers rarely come here, i.e., only when > > > > > kswapd can't keep up in terms of the aging, which is similar to the > > > > > condition where the inactive list is empty for the active/inactive > > > > > lru. > > > > > > > > Well, this is not a strong argument to be honest. Kswapd being stuck > > > > and the majority of the reclaim being done in the direct reclaim > > > > context is a situation I have seen many many times. > > > > > > Also do not forget that memcg reclaim is effectivelly only direct > > > reclaim. Not that the memcg reclaim indicates a global memory shortage > > > but it can add up and race with the global reclaim as well. > > > > I don't dispute any of the above, and I probably don't like this code > > more than you do. > > > > But let's not forget the purposes of PF_MEMALLOC, besides preventing > > recursive reclaims, include letting reclaim dip into reserves so that > > it can make more free memory. So I think it's acceptable if the > > following conditions are met: > > 1. The allocation size is small. > > 2. The number of allocations is bounded. > > 3. Its failure doesn't stall reclaim. > > And it'd be nice if > > 4. The allocation happens rarely, e.g., slow path only. > > I would add > 0. The allocation should be done only if absolutely _necessary_. > > Please keep in mind that whatever you allocate from that context will be > consuming a very precious memory reserves which are shared with other > components of the system. Even worse these can go all the way to > depleting memory completely where other things can fall apart. I agree but I also see a distinction: 1,2,3 are objective; 0,4 are subjective. For some users, page reclaim itself could be not absolutely necessary because they are okay with OOM kills. But for others, the situation could be reversed. > > The code in question meets all of them. > > > > 1. This allocation is 160 bytes. > > 2. It's bounded by the number of page table walkers which, in the > > worst, is same as the number of mm_struct's. > > 3. Most importantly, its failure doesn't stall the aging. The aging > > will fallback to the rmap-based function lru_gen_look_around(). > > But this function only gathers the accessed bit from at most 64 > > PTEs, meaning it's less efficient (retains ~80% performance gains). > > 4. This allocation is rare, i.e., only when the aging is required, > > which is similar to the low inactive case for the active/inactive > > lru. > > I think this fallback behavior deserves much more detailed explanation > in changelogs. Will do. > > The bottom line is I can try various optimizations, e.g., preallocate > > a few buffers for a limited number of page walkers and if this number > > has been reached, fallback to the rmap-based function. But I have yet > > to see evidence that calls for additional complexity. > > I would disagree here. This is not an optimization. You should be > avoiding allocations from the memory reclaim because any allocation just > add a runtime behavior complexity and potential corner cases. Would __GFP_NOMEMALLOC address your concern? It prevents allocations from accessing the reserves even under PF_MEMALLOC.
On Wed 12-01-22 16:43:15, Yu Zhao wrote: > On Wed, Jan 12, 2022 at 11:17:53AM +0100, Michal Hocko wrote: [...] > > Is there any reason you are not using folio_memcg_lock in the > > pte walk instead? > > We have a particular lruvec (the first arg), hence a particular memcg > to lock. But we don't have a particular page to lock. That is certainly true at this layer but the locking should be needed only for specific pages, no? So you can move the lock down to the callback which examines respective pages. Or is there anything preventing that? To be honest, and that is the reason I am asking, I really do not like to open code the migration synchronization outside of the memcg proper. Code paths which need a stable memcg are supposed to be using folio_memcg_lock for the specific examination time. If you prefer a trylock approach for this usecase then we can add one.
On Thu 13-01-22 02:43:38, Yu Zhao wrote: [...] > > > The bottom line is I can try various optimizations, e.g., preallocate > > > a few buffers for a limited number of page walkers and if this number > > > has been reached, fallback to the rmap-based function. But I have yet > > > to see evidence that calls for additional complexity. > > > > I would disagree here. This is not an optimization. You should be > > avoiding allocations from the memory reclaim because any allocation just > > add a runtime behavior complexity and potential corner cases. > > Would __GFP_NOMEMALLOC address your concern? It prevents allocations > from accessing the reserves even under PF_MEMALLOC. __GFP_NOMEMALLOC would deal with the complete memory depletion concern for sure but I am not sure how any of these allocations would succeed when called from the direct reclaim. Some access to memory reserves is necessary if you insist on allocating from the reclaim process. You can have a look at the limited memory reserves access by oom victims for an example of how this can be done.
> But the later one is more complex and a proper > handling really depends on the particular workload That is why I advocate the introduction of new tunables. > There are workloads which prefer a temporary trashing over its working > set during a peak memory demand rather than an OOM kill OK, for such cases, the OOM handles can be set to 0. It can even be the default value. > On the other side workloads that are > latency sensitive I daresay that this is the case with most workloads. An internet server that falls into thrashing is a dead server. > no simple solution can be applied to the whole There are several solutions and they can be taken into the kernel at the same time, they all work: - min_ttl_ms + MGLRU - vm.min_filelist_kbytes-like knobs - PSI-based solutions. > For the most steady trashing situations I have > seen the userspace with mlocked memory and the code can make a forward > progress and mediate the situation. I still don't see a problem in making all the kernel-space solutions in the kernel.
On Thu, Jan 13, 2022 at 01:02:26PM +0100, Michal Hocko wrote: > On Thu 13-01-22 02:43:38, Yu Zhao wrote: > [...] > > > > The bottom line is I can try various optimizations, e.g., preallocate > > > > a few buffers for a limited number of page walkers and if this number > > > > has been reached, fallback to the rmap-based function. But I have yet > > > > to see evidence that calls for additional complexity. > > > > > > I would disagree here. This is not an optimization. You should be > > > avoiding allocations from the memory reclaim because any allocation just > > > add a runtime behavior complexity and potential corner cases. > > > > Would __GFP_NOMEMALLOC address your concern? It prevents allocations > > from accessing the reserves even under PF_MEMALLOC. > > __GFP_NOMEMALLOC would deal with the complete memory depletion concern > for sure but I am not sure how any of these allocations would succeed > when called from the direct reclaim. Some access to memory reserves is > necessary if you insist on allocating from the reclaim process. > > You can have a look at the limited memory reserves access by oom victims > for an example of how this can be done. Thanks. I'll change GFP_KERNEL to __GFP_HIGH | __GFP_NOMEMALLOC. __GFP_HIGH allows some access to memory reserves and __GFP_NOMEMALLOC prevents the complete depletion. Basically the combination lower the min watermark by 1/2, and we have been using them for add_to_swap_cache().
On Mon, Jan 10, 2022 at 11:54:42AM +0100, Michal Hocko wrote: > On Sun 09-01-22 21:47:57, Yu Zhao wrote: > > On Fri, Jan 07, 2022 at 03:44:50PM +0100, Michal Hocko wrote: > > > On Tue 04-01-22 13:22:25, Yu Zhao wrote: > > > [...] > > > > +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk) > > > > +{ > > > > + static const struct mm_walk_ops mm_walk_ops = { > > > > + .test_walk = should_skip_vma, > > > > + .p4d_entry = walk_pud_range, > > > > + }; > > > > + > > > > + int err; > > > > +#ifdef CONFIG_MEMCG > > > > + struct mem_cgroup *memcg = lruvec_memcg(lruvec); > > > > +#endif > > > > + > > > > + walk->next_addr = FIRST_USER_ADDRESS; > > > > + > > > > + do { > > > > + unsigned long start = walk->next_addr; > > > > + unsigned long end = mm->highest_vm_end; > > > > + > > > > + err = -EBUSY; > > > > + > > > > + rcu_read_lock(); > > > > +#ifdef CONFIG_MEMCG > > > > + if (memcg && atomic_read(&memcg->moving_account)) > > > > + goto contended; > > > > +#endif > > > > + if (!mmap_read_trylock(mm)) > > > > + goto contended; > > > > > > Have you evaluated the behavior under mmap_sem contention? I mean what > > > would be an effect of some mms being excluded from the walk? This path > > > is called from direct reclaim and we do allocate with exclusive mmap_sem > > > IIRC and the trylock can fail in a presence of pending writer if I am > > > not mistaken so even the read lock holder (e.g. an allocation from the #PF) > > > can bypass the walk. > > > > You are right. Here it must be a trylock; otherwise it can deadlock. > > Yeah, this is clear. > > > I think there might be a misunderstanding: the aging doesn't > > exclusively rely on page table walks to gather the accessed bit. It > > prefers page table walks but it can also fallback to the rmap-based > > function, i.e., lru_gen_look_around(), which only gathers the accessed > > bit from at most 64 PTEs and therefore is less efficient. But it still > > retains about 80% of the performance gains. > > I have to say that I really have hard time to understand the runtime > behavior depending on that interaction. How does the reclaim behave when > the virtual scan is enabled, partially enabled and almost completely > disabled due to different constrains? I do not see any such an > evaluation described in changelogs and I consider this to be a rather > important information to judge the overall behavior. It doesn't have (partially) enabled/disabled states nor does its behavior change with different reclaim constraints. Having either would make its design too complex to implement or benchmark. There is feedback loop connecting page table walks and rmap walks by Bloom filters. The Bloom filters hold dense hot areas. Page table walks test whether virtual areas are in the Bloom filters and scan those that were tested positive. Anything they miss will be caught by rmap walks later (shrink_page_list()). And when rmap walks find new dense hot areas, they add those area to the Bloom filters. A dense hot area means it has many accessed pages belonging to the reclaim domain, and clearing the accessed bit in all PTEs within this area by one page table walk is more efficient than doing it one by one by many rmap walks, in terms of cacheline utilization. > > > Or is this considered statistically insignificant thus a theoretical > > > problem? > > > > Yes. People who work on the maple tree and SPF at Google expressed the > > same concern during the design review meeting (all stakeholders on the > > mailing list were also invited). So we had a counter to monitor the > > contention in previous versions, i.e., MM_LOCK_CONTENTION in v4 here: > > https://lore.kernel.org/lkml/20210818063107.2696454-8-yuzhao@google.com/ > > > > And we also combined this patchset with the SPF patchset to see if the > > latter makes any difference. Our conclusion was the contention is > > statistically insignificant to the performance under memory pressure. > > > > This can be explained by how often we create a new generation. (We > > only walk page tables when we create a new generation. And it's > > similar to the low inactive condition for the active/inactive lru.) > > > > Usually we only do so every few seconds. We'd run into problems with > > other parts of the kernel, e.g., lru lock contention, i/o congestion, > > etc. if we create more than a few generation every second. > > This would be a very good information to have in changelogs. Ideally > with some numbers and analysis. Will do. Thanks.
On Wed 19-01-22 00:04:10, Yu Zhao wrote: > On Mon, Jan 10, 2022 at 11:54:42AM +0100, Michal Hocko wrote: > > On Sun 09-01-22 21:47:57, Yu Zhao wrote: > > > On Fri, Jan 07, 2022 at 03:44:50PM +0100, Michal Hocko wrote: > > > > On Tue 04-01-22 13:22:25, Yu Zhao wrote: > > > > [...] > > > > > +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk) > > > > > +{ > > > > > + static const struct mm_walk_ops mm_walk_ops = { > > > > > + .test_walk = should_skip_vma, > > > > > + .p4d_entry = walk_pud_range, > > > > > + }; > > > > > + > > > > > + int err; > > > > > +#ifdef CONFIG_MEMCG > > > > > + struct mem_cgroup *memcg = lruvec_memcg(lruvec); > > > > > +#endif > > > > > + > > > > > + walk->next_addr = FIRST_USER_ADDRESS; > > > > > + > > > > > + do { > > > > > + unsigned long start = walk->next_addr; > > > > > + unsigned long end = mm->highest_vm_end; > > > > > + > > > > > + err = -EBUSY; > > > > > + > > > > > + rcu_read_lock(); > > > > > +#ifdef CONFIG_MEMCG > > > > > + if (memcg && atomic_read(&memcg->moving_account)) > > > > > + goto contended; > > > > > +#endif > > > > > + if (!mmap_read_trylock(mm)) > > > > > + goto contended; > > > > > > > > Have you evaluated the behavior under mmap_sem contention? I mean what > > > > would be an effect of some mms being excluded from the walk? This path > > > > is called from direct reclaim and we do allocate with exclusive mmap_sem > > > > IIRC and the trylock can fail in a presence of pending writer if I am > > > > not mistaken so even the read lock holder (e.g. an allocation from the #PF) > > > > can bypass the walk. > > > > > > You are right. Here it must be a trylock; otherwise it can deadlock. > > > > Yeah, this is clear. > > > > > I think there might be a misunderstanding: the aging doesn't > > > exclusively rely on page table walks to gather the accessed bit. It > > > prefers page table walks but it can also fallback to the rmap-based > > > function, i.e., lru_gen_look_around(), which only gathers the accessed > > > bit from at most 64 PTEs and therefore is less efficient. But it still > > > retains about 80% of the performance gains. > > > > I have to say that I really have hard time to understand the runtime > > behavior depending on that interaction. How does the reclaim behave when > > the virtual scan is enabled, partially enabled and almost completely > > disabled due to different constrains? I do not see any such an > > evaluation described in changelogs and I consider this to be a rather > > important information to judge the overall behavior. > > It doesn't have (partially) enabled/disabled states nor does its > behavior change with different reclaim constraints. Having either > would make its design too complex to implement or benchmark. Let me clarify. By "partially enabled" I really meant behavior depedning on runtime conditions. Say mmap_sem cannot be locked for half of scanned tasks and/or allocation for the mm walker fails due to lack of memory. How does this going to affect reclaim efficiency. How does a user/admin know that the memory reclaim is in a "degraded" mode because of the contention?
On Tue 18-01-22 23:31:07, Yu Zhao wrote: > On Thu, Jan 13, 2022 at 01:02:26PM +0100, Michal Hocko wrote: > > On Thu 13-01-22 02:43:38, Yu Zhao wrote: > > [...] > > > > > The bottom line is I can try various optimizations, e.g., preallocate > > > > > a few buffers for a limited number of page walkers and if this number > > > > > has been reached, fallback to the rmap-based function. But I have yet > > > > > to see evidence that calls for additional complexity. > > > > > > > > I would disagree here. This is not an optimization. You should be > > > > avoiding allocations from the memory reclaim because any allocation just > > > > add a runtime behavior complexity and potential corner cases. > > > > > > Would __GFP_NOMEMALLOC address your concern? It prevents allocations > > > from accessing the reserves even under PF_MEMALLOC. > > > > __GFP_NOMEMALLOC would deal with the complete memory depletion concern > > for sure but I am not sure how any of these allocations would succeed > > when called from the direct reclaim. Some access to memory reserves is > > necessary if you insist on allocating from the reclaim process. > > > > You can have a look at the limited memory reserves access by oom victims > > for an example of how this can be done. > > Thanks. I'll change GFP_KERNEL to __GFP_HIGH | __GFP_NOMEMALLOC. > __GFP_HIGH allows some access to memory reserves and __GFP_NOMEMALLOC > prevents the complete depletion. Basically the combination lower the > min watermark by 1/2, and we have been using them for > add_to_swap_cache(). Yes this will prevent the complete memory depletion. There are other users of this portion of memory reserves so the reclaim might be out of luck. How this turns out in practice remains to be seen though but it certainly is an opportunity for corner cases and hard to test behavior.
On Wed, Jan 19, 2022 at 10:42:47AM +0100, Michal Hocko wrote: > On Wed 19-01-22 00:04:10, Yu Zhao wrote: > > On Mon, Jan 10, 2022 at 11:54:42AM +0100, Michal Hocko wrote: > > > On Sun 09-01-22 21:47:57, Yu Zhao wrote: > > > > On Fri, Jan 07, 2022 at 03:44:50PM +0100, Michal Hocko wrote: > > > > > On Tue 04-01-22 13:22:25, Yu Zhao wrote: > > > > > [...] > > > > > > +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk) > > > > > > +{ > > > > > > + static const struct mm_walk_ops mm_walk_ops = { > > > > > > + .test_walk = should_skip_vma, > > > > > > + .p4d_entry = walk_pud_range, > > > > > > + }; > > > > > > + > > > > > > + int err; > > > > > > +#ifdef CONFIG_MEMCG > > > > > > + struct mem_cgroup *memcg = lruvec_memcg(lruvec); > > > > > > +#endif > > > > > > + > > > > > > + walk->next_addr = FIRST_USER_ADDRESS; > > > > > > + > > > > > > + do { > > > > > > + unsigned long start = walk->next_addr; > > > > > > + unsigned long end = mm->highest_vm_end; > > > > > > + > > > > > > + err = -EBUSY; > > > > > > + > > > > > > + rcu_read_lock(); > > > > > > +#ifdef CONFIG_MEMCG > > > > > > + if (memcg && atomic_read(&memcg->moving_account)) > > > > > > + goto contended; > > > > > > +#endif > > > > > > + if (!mmap_read_trylock(mm)) > > > > > > + goto contended; > > > > > > > > > > Have you evaluated the behavior under mmap_sem contention? I mean what > > > > > would be an effect of some mms being excluded from the walk? This path > > > > > is called from direct reclaim and we do allocate with exclusive mmap_sem > > > > > IIRC and the trylock can fail in a presence of pending writer if I am > > > > > not mistaken so even the read lock holder (e.g. an allocation from the #PF) > > > > > can bypass the walk. > > > > > > > > You are right. Here it must be a trylock; otherwise it can deadlock. > > > > > > Yeah, this is clear. > > > > > > > I think there might be a misunderstanding: the aging doesn't > > > > exclusively rely on page table walks to gather the accessed bit. It > > > > prefers page table walks but it can also fallback to the rmap-based > > > > function, i.e., lru_gen_look_around(), which only gathers the accessed > > > > bit from at most 64 PTEs and therefore is less efficient. But it still > > > > retains about 80% of the performance gains. > > > > > > I have to say that I really have hard time to understand the runtime > > > behavior depending on that interaction. How does the reclaim behave when > > > the virtual scan is enabled, partially enabled and almost completely > > > disabled due to different constrains? I do not see any such an > > > evaluation described in changelogs and I consider this to be a rather > > > important information to judge the overall behavior. > > > > It doesn't have (partially) enabled/disabled states nor does its > > behavior change with different reclaim constraints. Having either > > would make its design too complex to implement or benchmark. > > Let me clarify. By "partially enabled" I really meant behavior depedning > on runtime conditions. Say mmap_sem cannot be locked for half of scanned > tasks and/or allocation for the mm walker fails due to lack of memory. > How does this going to affect reclaim efficiency. Understood. This is not only possible -- it's the default for our ARM hardware that doesn't support the accessed bit, i.e., CPUs that don't automatically set the accessed bit. In try_to_inc_max_seq(), we have: /* * If the hardware doesn't automatically set the accessed bit, fallback * to lru_gen_look_around(), which only clears the accessed bit in a * handful of PTEs. Spreading the work out over a period of time usually * is less efficient, but it avoids bursty page faults. */ if the accessed bit is not supported return if alloc_mm_walk() fails return walk_mm() if mmap_sem contented return scan page tables We have a microbenchmark that specifically measures this worst case scenario by entirely disabling page table scanning. Its results showed that this still retains more than 90% of the optimal performance. I'll share this microbenchmark in another email when answering Barry's questions regarding the accessed bit. Our profiling infra also indirectly confirms this: it collects data from real users running on hardware with and without the accessed bit. Users running on hardware without the accessed bit indeed suffer a small performance degradation, compared with users running on hardware with it. But they still benefit almost as much, compared with users running on the same hardware but without MGLRU. > How does a user/admin > know that the memory reclaim is in a "degraded" mode because of the > contention? As we previously discussed here: https://lore.kernel.org/linux-mm/Ydu6fXg2FmrseQOn@google.com/ there used to be a counter measuring the contention, and it was deemed unnecessary and removed in v4. But I don't have a problem if we want to revive it.
On Thu, Jan 13, 2022 at 12:57:35PM +0100, Michal Hocko wrote: > On Wed 12-01-22 16:43:15, Yu Zhao wrote: > > On Wed, Jan 12, 2022 at 11:17:53AM +0100, Michal Hocko wrote: > [...] > > > Is there any reason you are not using folio_memcg_lock in the > > > pte walk instead? > > > > We have a particular lruvec (the first arg), hence a particular memcg > > to lock. But we don't have a particular page to lock. > > That is certainly true at this layer but the locking should be needed > only for specific pages, no? Yes. > So you can move the lock down to the > callback which examines respective pages. Or is there anything > preventing that? No. > To be honest, and that is the reason I am asking, I really do not like > to open code the migration synchronization outside of the memcg proper. Agreed. > Code paths which need a stable memcg are supposed to be using > folio_memcg_lock for the specific examination time. No argument here, just a clarification: when possible I prefer to lock a batch of pages rather than individual ones. > If you prefer a > trylock approach for this usecase then we can add one. Done. Thanks.
On Sun 23-01-22 14:28:30, Yu Zhao wrote: > On Wed, Jan 19, 2022 at 10:42:47AM +0100, Michal Hocko wrote: > > On Wed 19-01-22 00:04:10, Yu Zhao wrote: > > > On Mon, Jan 10, 2022 at 11:54:42AM +0100, Michal Hocko wrote: > > > > On Sun 09-01-22 21:47:57, Yu Zhao wrote: > > > > > On Fri, Jan 07, 2022 at 03:44:50PM +0100, Michal Hocko wrote: > > > > > > On Tue 04-01-22 13:22:25, Yu Zhao wrote: > > > > > > [...] > > > > > > > +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk) > > > > > > > +{ > > > > > > > + static const struct mm_walk_ops mm_walk_ops = { > > > > > > > + .test_walk = should_skip_vma, > > > > > > > + .p4d_entry = walk_pud_range, > > > > > > > + }; > > > > > > > + > > > > > > > + int err; > > > > > > > +#ifdef CONFIG_MEMCG > > > > > > > + struct mem_cgroup *memcg = lruvec_memcg(lruvec); > > > > > > > +#endif > > > > > > > + > > > > > > > + walk->next_addr = FIRST_USER_ADDRESS; > > > > > > > + > > > > > > > + do { > > > > > > > + unsigned long start = walk->next_addr; > > > > > > > + unsigned long end = mm->highest_vm_end; > > > > > > > + > > > > > > > + err = -EBUSY; > > > > > > > + > > > > > > > + rcu_read_lock(); > > > > > > > +#ifdef CONFIG_MEMCG > > > > > > > + if (memcg && atomic_read(&memcg->moving_account)) > > > > > > > + goto contended; > > > > > > > +#endif > > > > > > > + if (!mmap_read_trylock(mm)) > > > > > > > + goto contended; > > > > > > > > > > > > Have you evaluated the behavior under mmap_sem contention? I mean what > > > > > > would be an effect of some mms being excluded from the walk? This path > > > > > > is called from direct reclaim and we do allocate with exclusive mmap_sem > > > > > > IIRC and the trylock can fail in a presence of pending writer if I am > > > > > > not mistaken so even the read lock holder (e.g. an allocation from the #PF) > > > > > > can bypass the walk. > > > > > > > > > > You are right. Here it must be a trylock; otherwise it can deadlock. > > > > > > > > Yeah, this is clear. > > > > > > > > > I think there might be a misunderstanding: the aging doesn't > > > > > exclusively rely on page table walks to gather the accessed bit. It > > > > > prefers page table walks but it can also fallback to the rmap-based > > > > > function, i.e., lru_gen_look_around(), which only gathers the accessed > > > > > bit from at most 64 PTEs and therefore is less efficient. But it still > > > > > retains about 80% of the performance gains. > > > > > > > > I have to say that I really have hard time to understand the runtime > > > > behavior depending on that interaction. How does the reclaim behave when > > > > the virtual scan is enabled, partially enabled and almost completely > > > > disabled due to different constrains? I do not see any such an > > > > evaluation described in changelogs and I consider this to be a rather > > > > important information to judge the overall behavior. > > > > > > It doesn't have (partially) enabled/disabled states nor does its > > > behavior change with different reclaim constraints. Having either > > > would make its design too complex to implement or benchmark. > > > > Let me clarify. By "partially enabled" I really meant behavior depedning > > on runtime conditions. Say mmap_sem cannot be locked for half of scanned > > tasks and/or allocation for the mm walker fails due to lack of memory. > > How does this going to affect reclaim efficiency. > > Understood. This is not only possible -- it's the default for our ARM > hardware that doesn't support the accessed bit, i.e., CPUs that don't > automatically set the accessed bit. > > In try_to_inc_max_seq(), we have: > /* > * If the hardware doesn't automatically set the accessed bit, fallback > * to lru_gen_look_around(), which only clears the accessed bit in a > * handful of PTEs. Spreading the work out over a period of time usually > * is less efficient, but it avoids bursty page faults. > */ > if the accessed bit is not supported > return > > if alloc_mm_walk() fails > return > > walk_mm() > if mmap_sem contented > return > > scan page tables > > We have a microbenchmark that specifically measures this worst case > scenario by entirely disabling page table scanning. Its results showed > that this still retains more than 90% of the optimal performance. I'll > share this microbenchmark in another email when answering Barry's > questions regarding the accessed bit. > > Our profiling infra also indirectly confirms this: it collects data > from real users running on hardware with and without the accessed > bit. Users running on hardware without the accessed bit indeed suffer > a small performance degradation, compared with users running on > hardware with it. But they still benefit almost as much, compared with > users running on the same hardware but without MGLRU. This definitely a good information to have in the cover letter. > > How does a user/admin > > know that the memory reclaim is in a "degraded" mode because of the > > contention? > > As we previously discussed here: > https://lore.kernel.org/linux-mm/Ydu6fXg2FmrseQOn@google.com/ > there used to be a counter measuring the contention, and it was deemed > unnecessary and removed in v4. But I don't have a problem if we want > to revive it. Well, counter might be rather tricky but few trace points would make some sense to me.
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index aba18cd101db..028afdb81c10 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1393,18 +1393,24 @@ mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg) static inline void lock_page_memcg(struct page *page) { + /* to match folio_memcg_rcu() */ + rcu_read_lock(); } static inline void unlock_page_memcg(struct page *page) { + rcu_read_unlock(); } static inline void folio_memcg_lock(struct folio *folio) { + /* to match folio_memcg_rcu() */ + rcu_read_lock(); } static inline void folio_memcg_unlock(struct folio *folio) { + rcu_read_unlock(); } static inline void mem_cgroup_handle_over_high(void) diff --git a/include/linux/mm.h b/include/linux/mm.h index fadbf8e6abcd..3d42118b7f5e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1599,6 +1599,11 @@ static inline unsigned long folio_pfn(struct folio *folio) return page_to_pfn(&folio->page); } +static inline struct folio *pfn_folio(unsigned long pfn) +{ + return page_folio(pfn_to_page(pfn)); +} + /* MIGRATE_CMA and ZONE_MOVABLE do not allow pin pages */ #ifdef CONFIG_MIGRATION static inline bool is_pinnable_page(struct page *page) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 5b9bc2532c5b..94af12507788 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -304,6 +304,7 @@ enum lruvec_flags { }; struct lruvec; +struct page_vma_mapped_walk; #define LRU_GEN_MASK ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF) #define LRU_REFS_MASK ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF) @@ -410,6 +411,7 @@ struct lru_gen_mm_walk { }; void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *lruvec); +void lru_gen_look_around(struct page_vma_mapped_walk *pvmw); #ifdef CONFIG_MEMCG void lru_gen_init_memcg(struct mem_cgroup *memcg); @@ -422,6 +424,10 @@ static inline void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *l { } +static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) +{ +} + #ifdef CONFIG_MEMCG static inline void lru_gen_init_memcg(struct mem_cgroup *memcg) { @@ -1048,6 +1054,10 @@ typedef struct pglist_data { unsigned long flags; +#ifdef CONFIG_LRU_GEN + /* kswap mm walk data */ + struct lru_gen_mm_walk mm_walk; +#endif ZONE_PADDING(_pad2_) /* Per-node vmstats */ diff --git a/include/linux/oom.h b/include/linux/oom.h index 2db9a1432511..9c7a4fae0661 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -57,6 +57,22 @@ struct oom_control { extern struct mutex oom_lock; extern struct mutex oom_adj_mutex; +#ifdef CONFIG_MMU +extern struct task_struct *oom_reaper_list; +extern struct wait_queue_head oom_reaper_wait; + +static inline bool oom_reaping_in_progress(void) +{ + /* a racy check can be used to reduce the chance of overkilling */ + return READ_ONCE(oom_reaper_list) || !waitqueue_active(&oom_reaper_wait); +} +#else +static inline bool oom_reaping_in_progress(void) +{ + return false; +} +#endif + static inline void set_current_oom_origin(void) { current->signal->oom_flag_origin = true; diff --git a/include/linux/swap.h b/include/linux/swap.h index d1ea44b31f19..bb93bba97115 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -137,6 +137,10 @@ union swap_header { */ struct reclaim_state { unsigned long reclaimed_slab; +#ifdef CONFIG_LRU_GEN + /* per-thread mm walk data */ + struct lru_gen_mm_walk *mm_walk; +#endif }; #ifdef __KERNEL__ diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 1ddabefcfb5a..ef5860fc7d22 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -508,8 +508,8 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm) * victim (if that is possible) to help the OOM killer to move on. */ static struct task_struct *oom_reaper_th; -static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait); -static struct task_struct *oom_reaper_list; +DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait); +struct task_struct *oom_reaper_list; static DEFINE_SPINLOCK(oom_reaper_lock); bool __oom_reap_task_mm(struct mm_struct *mm) diff --git a/mm/rmap.c b/mm/rmap.c index 163ac4e6bcee..2f023e6c0f82 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -73,6 +73,7 @@ #include <linux/page_idle.h> #include <linux/memremap.h> #include <linux/userfaultfd_k.h> +#include <linux/mm_inline.h> #include <asm/tlbflush.h> @@ -790,6 +791,12 @@ static bool page_referenced_one(struct page *page, struct vm_area_struct *vma, } if (pvmw.pte) { + if (lru_gen_enabled() && pte_young(*pvmw.pte) && + !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) { + lru_gen_look_around(&pvmw); + referenced++; + } + if (ptep_clear_flush_young_notify(vma, address, pvmw.pte)) { /* diff --git a/mm/vmscan.c b/mm/vmscan.c index 5eaf22aa446a..fbf1337a1632 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -51,6 +51,8 @@ #include <linux/dax.h> #include <linux/psi.h> #include <linux/memory.h> +#include <linux/pagewalk.h> +#include <linux/shmem_fs.h> #include <asm/tlbflush.h> #include <asm/div64.h> @@ -1555,6 +1557,11 @@ static unsigned int shrink_page_list(struct list_head *page_list, if (!sc->may_unmap && page_mapped(page)) goto keep_locked; + /* folio_update_gen() tried to promote this page? */ + if (lru_gen_enabled() && !ignore_references && + page_mapped(page) && PageReferenced(page)) + goto keep_locked; + may_enter_fs = (sc->gfp_mask & __GFP_FS) || (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO)); @@ -3047,6 +3054,15 @@ static bool can_age_anon_pages(struct pglist_data *pgdat, * shorthand helpers ******************************************************************************/ +#define DEFINE_MAX_SEQ(lruvec) \ + unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq) + +#define DEFINE_MIN_SEQ(lruvec) \ + unsigned long min_seq[ANON_AND_FILE] = { \ + READ_ONCE((lruvec)->lrugen.min_seq[0]), \ + READ_ONCE((lruvec)->lrugen.min_seq[1]), \ + } + #define for_each_gen_type_zone(gen, type, zone) \ for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++) \ for ((type) = 0; (type) < ANON_AND_FILE; (type)++) \ @@ -3077,6 +3093,12 @@ static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid) return pgdat ? &pgdat->__lruvec : NULL; } +static int get_swappiness(struct mem_cgroup *memcg) +{ + return mem_cgroup_get_nr_swap_pages(memcg) >= MIN_LRU_BATCH ? + mem_cgroup_swappiness(memcg) : 0; +} + static int get_nr_gens(struct lruvec *lruvec, int type) { return lruvec->lrugen.max_seq - lruvec->lrugen.min_seq[type] + 1; @@ -3431,6 +3453,869 @@ static bool get_next_mm(struct lruvec *lruvec, struct lru_gen_mm_walk *walk, return last; } +/****************************************************************************** + * the aging + ******************************************************************************/ + +static void folio_update_gen(struct folio *folio, struct lru_gen_mm_walk *walk) +{ + unsigned long old_flags, new_flags; + int type = folio_is_file_lru(folio); + int zone = folio_zonenum(folio); + int delta = folio_nr_pages(folio); + int old_gen, new_gen = lru_gen_from_seq(walk->max_seq); + + do { + new_flags = old_flags = READ_ONCE(folio->flags); + + /* for shrink_page_list() */ + if (!(new_flags & LRU_GEN_MASK)) { + new_flags |= BIT(PG_referenced); + continue; + } + + new_flags &= ~LRU_GEN_MASK; + new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF; + } while (new_flags != old_flags && + cmpxchg(&folio->flags, old_flags, new_flags) != old_flags); + + old_gen = ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1; + if (old_gen < 0 || old_gen == new_gen) + return; + + walk->batched++; + walk->nr_pages[old_gen][type][zone] -= delta; + walk->nr_pages[new_gen][type][zone] += delta; +} + +static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming) +{ + unsigned long old_flags, new_flags; + int type = folio_is_file_lru(folio); + struct lru_gen_struct *lrugen = &lruvec->lrugen; + int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]); + + do { + new_flags = old_flags = READ_ONCE(folio->flags); + VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio); + + new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1; + /* folio_update_gen() has promoted this page? */ + if (new_gen >= 0 && new_gen != old_gen) + return new_gen; + + new_gen = (old_gen + 1) % MAX_NR_GENS; + + new_flags &= ~LRU_GEN_MASK; + new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF; + /* for folio_end_writeback() */ + if (reclaiming) + new_flags |= BIT(PG_reclaim); + } while (cmpxchg(&folio->flags, old_flags, new_flags) != old_flags); + + lru_gen_balance_size(lruvec, folio, old_gen, new_gen); + + return new_gen; +} + +static void reset_batch_size(struct lruvec *lruvec, struct lru_gen_mm_walk *walk) +{ + int gen, type, zone; + struct lru_gen_struct *lrugen = &lruvec->lrugen; + + walk->batched = 0; + + for_each_gen_type_zone(gen, type, zone) { + enum lru_list lru = type * LRU_FILE; + int delta = walk->nr_pages[gen][type][zone]; + + if (!delta) + continue; + + walk->nr_pages[gen][type][zone] = 0; + WRITE_ONCE(lrugen->nr_pages[gen][type][zone], + lrugen->nr_pages[gen][type][zone] + delta); + + if (lru_gen_is_active(lruvec, gen)) + lru += LRU_ACTIVE; + lru_gen_update_size(lruvec, lru, zone, delta); + } +} + +static int should_skip_vma(unsigned long start, unsigned long end, struct mm_walk *walk) +{ + struct address_space *mapping; + struct vm_area_struct *vma = walk->vma; + struct lru_gen_mm_walk *priv = walk->private; + + if (!vma_is_accessible(vma) || is_vm_hugetlb_page(vma) || + (vma->vm_flags & (VM_LOCKED | VM_SPECIAL | VM_SEQ_READ | VM_RAND_READ))) + return true; + + if (vma_is_anonymous(vma)) + return !priv->can_swap; + + if (WARN_ON_ONCE(!vma->vm_file || !vma->vm_file->f_mapping)) + return true; + + mapping = vma->vm_file->f_mapping; + if (!mapping->a_ops->writepage) + return true; + + return (shmem_mapping(mapping) && !priv->can_swap) || mapping_unevictable(mapping); +} + +/* + * Some userspace memory allocators map many single-page VMAs. Instead of + * returning back to the PGD table for each of such VMAs, finish an entire PMD + * table to reduce zigzags and improve cache performance. + */ +static bool get_next_vma(struct mm_walk *walk, unsigned long mask, unsigned long size, + unsigned long *start, unsigned long *end) +{ + unsigned long next = round_up(*end, size); + + VM_BUG_ON(mask & size); + VM_BUG_ON(*start >= *end); + VM_BUG_ON((next & mask) != (*start & mask)); + + while (walk->vma) { + if (next >= walk->vma->vm_end) { + walk->vma = walk->vma->vm_next; + continue; + } + + if ((next & mask) != (walk->vma->vm_start & mask)) + return false; + + if (should_skip_vma(walk->vma->vm_start, walk->vma->vm_end, walk)) { + walk->vma = walk->vma->vm_next; + continue; + } + + *start = max(next, walk->vma->vm_start); + next = (next | ~mask) + 1; + /* rounded-up boundaries can wrap to 0 */ + *end = next && next < walk->vma->vm_end ? next : walk->vma->vm_end; + + return true; + } + + return false; +} + +static bool suitable_to_scan(int total, int young) +{ + int n = clamp_t(int, cache_line_size() / sizeof(pte_t), 2, 8); + + /* suitable if the average number of young PTEs per cacheline is >=1 */ + return young * n >= total; +} + +static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, + struct mm_walk *walk) +{ + int i; + pte_t *pte; + spinlock_t *ptl; + unsigned long addr; + int total = 0; + int young = 0; + struct lru_gen_mm_walk *priv = walk->private; + struct mem_cgroup *memcg = lruvec_memcg(priv->lruvec); + struct pglist_data *pgdat = lruvec_pgdat(priv->lruvec); + + VM_BUG_ON(pmd_leaf(*pmd)); + + pte = pte_offset_map_lock(walk->mm, pmd, start & PMD_MASK, &ptl); + arch_enter_lazy_mmu_mode(); +restart: + for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) { + struct folio *folio; + unsigned long pfn = pte_pfn(pte[i]); + + total++; + priv->mm_stats[MM_PTE_TOTAL]++; + + if (!pte_present(pte[i]) || is_zero_pfn(pfn)) + continue; + + if (WARN_ON_ONCE(pte_devmap(pte[i]) || pte_special(pte[i]))) + continue; + + if (!pte_young(pte[i])) { + priv->mm_stats[MM_PTE_OLD]++; + continue; + } + + VM_BUG_ON(!pfn_valid(pfn)); + if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat)) + continue; + + folio = pfn_folio(pfn); + if (folio_nid(folio) != pgdat->node_id) + continue; + + if (folio_memcg_rcu(folio) != memcg) + continue; + + VM_BUG_ON(addr < walk->vma->vm_start || addr >= walk->vma->vm_end); + if (ptep_test_and_clear_young(walk->vma, addr, pte + i)) { + folio_update_gen(folio, priv); + priv->mm_stats[MM_PTE_YOUNG]++; + young++; + } + + if (pte_dirty(pte[i]) && !folio_test_dirty(folio) && + !(folio_test_anon(folio) && folio_test_swapbacked(folio) && + !folio_test_swapcache(folio))) + folio_mark_dirty(folio); + } + + if (i < PTRS_PER_PTE && get_next_vma(walk, PMD_MASK, PAGE_SIZE, &start, &end)) + goto restart; + + arch_leave_lazy_mmu_mode(); + pte_unmap_unlock(pte, ptl); + + return suitable_to_scan(total, young); +} + +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) +static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area_struct *vma, + struct mm_walk *walk, unsigned long *start) +{ + int i; + pmd_t *pmd; + spinlock_t *ptl; + struct lru_gen_mm_walk *priv = walk->private; + struct mem_cgroup *memcg = lruvec_memcg(priv->lruvec); + struct pglist_data *pgdat = lruvec_pgdat(priv->lruvec); + + VM_BUG_ON(pud_leaf(*pud)); + + /* try to batch at most 1+MIN_LRU_BATCH+1 entries */ + if (*start == -1) { + *start = next; + return; + } + + i = next == -1 ? 0 : pmd_index(next) - pmd_index(*start); + if (i && i <= MIN_LRU_BATCH) { + __set_bit(i - 1, priv->bitmap); + return; + } + + pmd = pmd_offset(pud, *start); + ptl = pmd_lock(walk->mm, pmd); + arch_enter_lazy_mmu_mode(); + + do { + struct folio *folio; + unsigned long pfn = pmd_pfn(pmd[i]); + unsigned long addr = i ? (*start & PMD_MASK) + i * PMD_SIZE : *start; + + if (!pmd_present(pmd[i]) || is_huge_zero_pmd(pmd[i])) + goto next; + + if (WARN_ON_ONCE(pmd_devmap(pmd[i]))) + goto next; + + if (!pmd_trans_huge(pmd[i])) { + if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)) + pmdp_test_and_clear_young(vma, addr, pmd + i); + goto next; + } + + VM_BUG_ON(!pfn_valid(pfn)); + if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat)) + goto next; + + folio = pfn_folio(pfn); + if (folio_nid(folio) != pgdat->node_id) + goto next; + + if (folio_memcg_rcu(folio) != memcg) + goto next; + + VM_BUG_ON(addr < vma->vm_start || addr >= vma->vm_end); + if (pmdp_test_and_clear_young(vma, addr, pmd + i)) { + folio_update_gen(folio, priv); + priv->mm_stats[MM_PTE_YOUNG]++; + } + + if (pmd_dirty(pmd[i]) && !folio_test_dirty(folio) && + !(folio_test_anon(folio) && folio_test_swapbacked(folio) && + !folio_test_swapcache(folio))) + folio_mark_dirty(folio); +next: + i = i > MIN_LRU_BATCH ? 0 : + find_next_bit(priv->bitmap, MIN_LRU_BATCH, i) + 1; + } while (i <= MIN_LRU_BATCH); + + arch_leave_lazy_mmu_mode(); + spin_unlock(ptl); + + *start = -1; + bitmap_zero(priv->bitmap, MIN_LRU_BATCH); +} +#else +static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area_struct *vma, + struct mm_walk *walk, unsigned long *start) +{ +} +#endif + +static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end, + struct mm_walk *walk) +{ + int i; + pmd_t *pmd; + unsigned long next; + unsigned long addr; + struct vm_area_struct *vma; + unsigned long pos = -1; + struct lru_gen_mm_walk *priv = walk->private; +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + struct pglist_data *pgdat = lruvec_pgdat(priv->lruvec); +#endif + + VM_BUG_ON(pud_leaf(*pud)); + + /* + * Finish an entire PMD in two passes: the first only reaches to PTE + * tables to avoid taking the PMD lock; the second, if necessary, takes + * the PMD lock to clear the accessed bit in PMD entries. + */ + pmd = pmd_offset(pud, start & PUD_MASK); +restart: + /* walk_pte_range() may call get_next_vma() */ + vma = walk->vma; + for (i = pmd_index(start), addr = start; addr != end; i++, addr = next) { + pmd_t val = pmd_read_atomic(pmd + i); + + /* for pmd_read_atomic() */ + barrier(); + + next = pmd_addr_end(addr, end); + + if (!pmd_present(val)) { + priv->mm_stats[MM_PTE_TOTAL]++; + continue; + } + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + if (pmd_trans_huge(val)) { + unsigned long pfn = pmd_pfn(val); + + priv->mm_stats[MM_PTE_TOTAL]++; + + if (is_huge_zero_pmd(val)) + continue; + + if (!pmd_young(val)) { + priv->mm_stats[MM_PTE_OLD]++; + continue; + } + + if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat)) + continue; + + walk_pmd_range_locked(pud, addr, vma, walk, &pos); + continue; + } +#endif + priv->mm_stats[MM_PMD_TOTAL]++; + +#ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG + if (!pmd_young(val)) + continue; + + walk_pmd_range_locked(pud, addr, vma, walk, &pos); +#endif + if (!priv->full_scan && !test_bloom_filter(priv->lruvec, priv->max_seq, pmd + i)) + continue; + + priv->mm_stats[MM_PMD_FOUND]++; + + if (!walk_pte_range(&val, addr, next, walk)) + continue; + + set_bloom_filter(priv->lruvec, priv->max_seq + 1, pmd + i); + + priv->mm_stats[MM_PMD_ADDED]++; + } + + walk_pmd_range_locked(pud, -1, vma, walk, &pos); + + if (i < PTRS_PER_PMD && get_next_vma(walk, PUD_MASK, PMD_SIZE, &start, &end)) + goto restart; +} + +static int walk_pud_range(p4d_t *p4d, unsigned long start, unsigned long end, + struct mm_walk *walk) +{ + int i; + pud_t *pud; + unsigned long addr; + unsigned long next; + struct lru_gen_mm_walk *priv = walk->private; + + VM_BUG_ON(p4d_leaf(*p4d)); + + pud = pud_offset(p4d, start & P4D_MASK); +restart: + for (i = pud_index(start), addr = start; addr != end; i++, addr = next) { + pud_t val = READ_ONCE(pud[i]); + + next = pud_addr_end(addr, end); + + if (!pud_present(val) || WARN_ON_ONCE(pud_leaf(val))) + continue; + + walk_pmd_range(&val, addr, next, walk); + + if (priv->batched >= MAX_LRU_BATCH) { + end = (addr | ~PUD_MASK) + 1; + goto done; + } + } + + if (i < PTRS_PER_PUD && get_next_vma(walk, P4D_MASK, PUD_SIZE, &start, &end)) + goto restart; + + end = round_up(end, P4D_SIZE); +done: + /* rounded-up boundaries can wrap to 0 */ + priv->next_addr = end && walk->vma ? max(end, walk->vma->vm_start) : 0; + + return -EAGAIN; +} + +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk) +{ + static const struct mm_walk_ops mm_walk_ops = { + .test_walk = should_skip_vma, + .p4d_entry = walk_pud_range, + }; + + int err; +#ifdef CONFIG_MEMCG + struct mem_cgroup *memcg = lruvec_memcg(lruvec); +#endif + + walk->next_addr = FIRST_USER_ADDRESS; + + do { + unsigned long start = walk->next_addr; + unsigned long end = mm->highest_vm_end; + + err = -EBUSY; + + rcu_read_lock(); +#ifdef CONFIG_MEMCG + if (memcg && atomic_read(&memcg->moving_account)) + goto contended; +#endif + if (!mmap_read_trylock(mm)) + goto contended; + + err = walk_page_range(mm, start, end, &mm_walk_ops, walk); + + mmap_read_unlock(mm); + + if (walk->batched) { + spin_lock_irq(&lruvec->lru_lock); + reset_batch_size(lruvec, walk); + spin_unlock_irq(&lruvec->lru_lock); + } +contended: + rcu_read_unlock(); + + cond_resched(); + } while (err == -EAGAIN && walk->next_addr && !mm_is_oom_victim(mm)); +} + +static struct lru_gen_mm_walk *alloc_mm_walk(void) +{ + if (!current->reclaim_state || !current->reclaim_state->mm_walk) + return kvzalloc(sizeof(struct lru_gen_mm_walk), GFP_KERNEL); + + return current->reclaim_state->mm_walk; +} + +static void free_mm_walk(struct lru_gen_mm_walk *walk) +{ + if (!current->reclaim_state || !current->reclaim_state->mm_walk) + kvfree(walk); +} + +static void inc_min_seq(struct lruvec *lruvec) +{ + int gen, type; + struct lru_gen_struct *lrugen = &lruvec->lrugen; + + VM_BUG_ON(!seq_is_valid(lruvec)); + + for (type = 0; type < ANON_AND_FILE; type++) { + if (get_nr_gens(lruvec, type) != MAX_NR_GENS) + continue; + + WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1); + } +} + +static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap) +{ + int gen, type, zone; + bool success = false; + struct lru_gen_struct *lrugen = &lruvec->lrugen; + DEFINE_MIN_SEQ(lruvec); + + VM_BUG_ON(!seq_is_valid(lruvec)); + + for (type = !can_swap; type < ANON_AND_FILE; type++) { + while (lrugen->max_seq >= min_seq[type] + MIN_NR_GENS) { + gen = lru_gen_from_seq(min_seq[type]); + + for (zone = 0; zone < MAX_NR_ZONES; zone++) { + if (!list_empty(&lrugen->lists[gen][type][zone])) + goto next; + } + + min_seq[type]++; + } +next: + ; + } + + /* see the comment in seq_is_valid() */ + if (can_swap) { + min_seq[0] = min(min_seq[0], min_seq[1]); + min_seq[1] = max(min_seq[0], lrugen->min_seq[1]); + } + + for (type = !can_swap; type < ANON_AND_FILE; type++) { + if (min_seq[type] == lrugen->min_seq[type]) + continue; + + WRITE_ONCE(lrugen->min_seq[type], min_seq[type]); + success = true; + } + + return success; +} + +static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq) +{ + int prev, next; + int type, zone; + struct lru_gen_struct *lrugen = &lruvec->lrugen; + + spin_lock_irq(&lruvec->lru_lock); + + VM_BUG_ON(!seq_is_valid(lruvec)); + + if (max_seq != lrugen->max_seq) + goto unlock; + + inc_min_seq(lruvec); + + /* update the active/inactive lru sizes for compatibility */ + prev = lru_gen_from_seq(lrugen->max_seq - 1); + next = lru_gen_from_seq(lrugen->max_seq + 1); + + for (type = 0; type < ANON_AND_FILE; type++) { + for (zone = 0; zone < MAX_NR_ZONES; zone++) { + enum lru_list lru = type * LRU_FILE; + long delta = lrugen->nr_pages[prev][type][zone] - + lrugen->nr_pages[next][type][zone]; + + if (!delta) + continue; + + lru_gen_update_size(lruvec, lru, zone, delta); + lru_gen_update_size(lruvec, lru + LRU_ACTIVE, zone, -delta); + } + } + + WRITE_ONCE(lrugen->timestamps[next], jiffies); + /* make sure preceding modifications appear */ + smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1); +unlock: + spin_unlock_irq(&lruvec->lru_lock); +} + +static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, + struct scan_control *sc, bool can_swap, bool full_scan) +{ + bool last; + struct lru_gen_mm_walk *walk; + struct mm_struct *mm = NULL; + struct lru_gen_struct *lrugen = &lruvec->lrugen; + + VM_BUG_ON(max_seq > READ_ONCE(lrugen->max_seq)); + + /* + * If the hardware doesn't automatically set the accessed bit, fallback + * to lru_gen_look_around(), which only clears the accessed bit in a + * handful of PTEs. Spreading the work out over a period of time usually + * is less efficient, but it avoids bursty page faults. + */ + if (!full_scan && !arch_has_hw_pte_young(false)) { + inc_max_seq(lruvec, max_seq); + return true; + } + + walk = alloc_mm_walk(); + if (!walk) + return false; + + walk->lruvec = lruvec; + walk->max_seq = max_seq; + walk->can_swap = can_swap; + walk->full_scan = full_scan; + + do { + last = get_next_mm(lruvec, walk, &mm); + if (mm) + walk_mm(lruvec, mm, walk); + + cond_resched(); + } while (mm); + + free_mm_walk(walk); + + if (!last) { + if (!current_is_kswapd() && sc->priority < DEF_PRIORITY - 2) + wait_event_killable(lruvec->mm_state.wait, + max_seq < READ_ONCE(lrugen->max_seq)); + + return max_seq < READ_ONCE(lrugen->max_seq); + } + + VM_BUG_ON(max_seq != READ_ONCE(lrugen->max_seq)); + + inc_max_seq(lruvec, max_seq); + /* either this sees any waiters or they will see updated max_seq */ + if (wq_has_sleeper(&lruvec->mm_state.wait)) + wake_up_all(&lruvec->mm_state.wait); + + wakeup_flusher_threads(WB_REASON_VMSCAN); + + return true; +} + +static long get_nr_evictable(struct lruvec *lruvec, unsigned long max_seq, unsigned long *min_seq, + struct scan_control *sc, bool can_swap, bool *need_aging) +{ + int gen, type, zone; + long max = 0; + long min = 0; + struct lru_gen_struct *lrugen = &lruvec->lrugen; + + /* + * The upper bound of evictable pages is all eligible pages; the lower + * bound is aged eligible file pages. The aging is due if the number of + * aged generations and the number of aged eligible file pages are both + * low. + */ + for (type = !can_swap; type < ANON_AND_FILE; type++) { + unsigned long seq; + + for (seq = min_seq[type]; seq <= max_seq; seq++) { + long size = 0; + + gen = lru_gen_from_seq(seq); + + for (zone = 0; zone <= sc->reclaim_idx; zone++) + size += READ_ONCE(lrugen->nr_pages[gen][type][zone]); + + max += size; + if (type && max_seq >= seq + MIN_NR_GENS) + min += size; + } + } + + *need_aging = max_seq <= min_seq[1] + MIN_NR_GENS && min < MIN_LRU_BATCH; + + return max > 0 ? max : 0; +} + +static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc, + unsigned long min_ttl) +{ + bool need_aging; + long nr_to_scan; + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + int swappiness = get_swappiness(memcg); + DEFINE_MAX_SEQ(lruvec); + DEFINE_MIN_SEQ(lruvec); + + if (mem_cgroup_below_min(memcg)) + return false; + + if (min_ttl) { + int gen = lru_gen_from_seq(min_seq[1]); + unsigned long birth = READ_ONCE(lruvec->lrugen.timestamps[gen]); + + if (time_is_after_jiffies(birth + min_ttl)) + return false; + } + + nr_to_scan = get_nr_evictable(lruvec, max_seq, min_seq, sc, swappiness, &need_aging); + if (!nr_to_scan) + return false; + + nr_to_scan >>= sc->priority; + + if (!mem_cgroup_online(memcg)) + nr_to_scan++; + + if (nr_to_scan && need_aging && (!mem_cgroup_below_low(memcg) || sc->memcg_low_reclaim)) + try_to_inc_max_seq(lruvec, max_seq, sc, swappiness, false); + + return true; +} + +/* to protect the working set of the last N jiffies */ +static unsigned long lru_gen_min_ttl __read_mostly; + +static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) +{ + struct mem_cgroup *memcg; + bool success = false; + unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl); + + VM_BUG_ON(!current_is_kswapd()); + + current->reclaim_state->mm_walk = &pgdat->mm_walk; + + memcg = mem_cgroup_iter(NULL, NULL, NULL); + do { + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); + + if (age_lruvec(lruvec, sc, min_ttl)) + success = true; + + cond_resched(); + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); + + if (!success && mutex_trylock(&oom_lock)) { + struct oom_control oc = { + .gfp_mask = sc->gfp_mask, + .order = sc->order, + }; + + if (!oom_reaping_in_progress()) + out_of_memory(&oc); + + mutex_unlock(&oom_lock); + } + + current->reclaim_state->mm_walk = NULL; +} + +/* + * This function exploits spatial locality when shrink_page_list() walks the + * rmap. It scans the vicinity of a young PTE in a PTE table and promotes + * accessed pages. If the scan was done cacheline efficiently, it adds the PMD + * entry pointing to this PTE table to the Bloom filter. This process is a + * feedback loop from the eviction to the aging. + */ +void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) +{ + int i; + pte_t *pte; + unsigned long start; + unsigned long end; + unsigned long addr; + struct lru_gen_mm_walk *walk; + int total = 0; + int young = 0; + struct mem_cgroup *memcg = page_memcg(pvmw->page); + struct pglist_data *pgdat = page_pgdat(pvmw->page); + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); + DEFINE_MAX_SEQ(lruvec); + + lockdep_assert_held(pvmw->ptl); + VM_BUG_ON_PAGE(PageLRU(pvmw->page), pvmw->page); + + walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL; + if (!walk) + return; + + walk->max_seq = max_seq; + + start = max(pvmw->address & PMD_MASK, pvmw->vma->vm_start); + end = pmd_addr_end(pvmw->address, pvmw->vma->vm_end); + + if (end - start > MIN_LRU_BATCH * PAGE_SIZE) { + if (pvmw->address - start < MIN_LRU_BATCH * PAGE_SIZE / 2) + end = start + MIN_LRU_BATCH * PAGE_SIZE; + else if (end - pvmw->address < MIN_LRU_BATCH * PAGE_SIZE / 2) + start = end - MIN_LRU_BATCH * PAGE_SIZE; + else { + start = pvmw->address - MIN_LRU_BATCH * PAGE_SIZE / 2; + end = pvmw->address + MIN_LRU_BATCH * PAGE_SIZE / 2; + } + } + + pte = pvmw->pte - (pvmw->address - start) / PAGE_SIZE; + + lock_page_memcg(pvmw->page); + arch_enter_lazy_mmu_mode(); + + for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) { + struct folio *folio; + unsigned long pfn = pte_pfn(pte[i]); + + total++; + + if (!pte_present(pte[i]) || is_zero_pfn(pfn)) + continue; + + if (WARN_ON_ONCE(pte_devmap(pte[i]) || pte_special(pte[i]))) + continue; + + if (!pte_young(pte[i])) + continue; + + VM_BUG_ON(!pfn_valid(pfn)); + if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat)) + continue; + + folio = pfn_folio(pfn); + if (folio_nid(folio) != pgdat->node_id) + continue; + + if (folio_memcg_rcu(folio) != memcg) + continue; + + VM_BUG_ON(addr < pvmw->vma->vm_start || addr >= pvmw->vma->vm_end); + if (ptep_test_and_clear_young(pvmw->vma, addr, pte + i)) { + folio_update_gen(folio, walk); + young++; + } + + if (pte_dirty(pte[i]) && !folio_test_dirty(folio) && + !(folio_test_anon(folio) && folio_test_swapbacked(folio) && + !folio_test_swapcache(folio))) + __set_bit(i, walk->bitmap); + } + + arch_leave_lazy_mmu_mode(); + unlock_page_memcg(pvmw->page); + + if (suitable_to_scan(total, young)) + set_bloom_filter(lruvec, max_seq, pvmw->pmd); + + for_each_set_bit(i, walk->bitmap, MIN_LRU_BATCH) + set_page_dirty(pte_page(pte[i])); + + bitmap_zero(walk->bitmap, MIN_LRU_BATCH); +} + /****************************************************************************** * state change ******************************************************************************/ @@ -3649,6 +4534,12 @@ static int __init init_lru_gen(void) }; late_initcall(init_lru_gen); +#else + +static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) +{ +} + #endif /* CONFIG_LRU_GEN */ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) @@ -4536,6 +5427,11 @@ static void age_active_anon(struct pglist_data *pgdat, struct mem_cgroup *memcg; struct lruvec *lruvec; + if (lru_gen_enabled()) { + lru_gen_age_node(pgdat, sc); + return; + } + if (!can_age_anon_pages(pgdat, sc)) return;