diff mbox series

[v6,6/9] mm: multigenerational lru: aging

Message ID 20220104202227.2903605-7-yuzhao@google.com (mailing list archive)
State New, archived
Headers show
Series Multigenerational LRU Framework | expand

Commit Message

Yu Zhao Jan. 4, 2022, 8:22 p.m. UTC
To avoid confusions, the term "scan" will be applied to PTEs in a page
table and pages on an lru list. It emphasizes on consecutive elements
in a set rather than the data structure holding this set together.

The aging produces young generations. Given an lruvec, it iterates
lruvec_memcg()->mm_list and calls walk_page_range() with each
mm_struct on this list to scan PTEs for accessed pages. On finding a
young PTE, it clears the accessed bit and updates the gen counter of
the page mapped by this PTE to (max_seq%MAX_NR_GENS)+1. After each
iteration of this list, it increments max_seq. The aging is needed
before the eviction can continue when max_seq-min_seq+1 reaches
MIN_NR_GENS.

To avoid confusions, the terms "promotion" and "demotion" will be
applied to the multigenerational lru, as a new convention; the terms
"activation" and "deactivation" will be applied to the active/inactive
lru, as usual.

IOW, the aging promotes a page to the youngest generation when it
finds this page accessed thru page tables; demotion happens
consequently when it creates a new generation. Note that promotion
doesn't require any lru list operations in the aging path, only the
update of the gen counter and the lru sizes; demotion, unless as the
result of the creation of a new generation, requires lru list
operations, e.g., lru_deactivate_fn().

The aging uses the following optimizations when walking page tables:
1) It uses the accessed bit in non-leaf PMD entries, the hint from the
   CPU scheduler and the Bloom filters to reduce its search space.
2) It doesn't zigzag between a PGD table and the same PMD or PTE table
   spanning multiple VMAs. In other words, it finishes all the VMAs
   within the range of the same PMD or PTE table before it returns to
   a PGD table. This improves the cache performance for workloads that
   have large numbers of tiny VMAs, especially when
   CONFIG_PGTABLE_LEVELS=5.

The aging is only interested in accessed pages and therefore has the
complexity of O(nr_hot_evictable_pages). The worst case scenario is
the aging fails to exploit any spatial locality and the eviction has
to promote all accessed pages when walking the rmap, which is similar
to the active/inactive lru. However, generations still can provide
better temporal locality.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
---
 include/linux/memcontrol.h |   6 +
 include/linux/mm.h         |   5 +
 include/linux/mmzone.h     |  10 +
 include/linux/oom.h        |  16 +
 include/linux/swap.h       |   4 +
 mm/oom_kill.c              |   4 +-
 mm/rmap.c                  |   7 +
 mm/vmscan.c                | 896 +++++++++++++++++++++++++++++++++++++
 8 files changed, 946 insertions(+), 2 deletions(-)

Comments

Michal Hocko Jan. 6, 2022, 4:06 p.m. UTC | #1
I am still reading through the series. It is a lot of code and quite
hard to wrap ones head around so these are mostly random things I have
run into. More will likely follow up.

On Tue 04-01-22 13:22:25, Yu Zhao wrote:
[...]
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index aba18cd101db..028afdb81c10 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1393,18 +1393,24 @@ mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
>  
>  static inline void lock_page_memcg(struct page *page)
>  {
> +	/* to match folio_memcg_rcu() */
> +	rcu_read_lock();
>  }
>  
>  static inline void unlock_page_memcg(struct page *page)
>  {
> +	rcu_read_unlock();
>  }
>  
>  static inline void folio_memcg_lock(struct folio *folio)
>  {
> +	/* to match folio_memcg_rcu() */
> +	rcu_read_lock();
>  }
>  
>  static inline void folio_memcg_unlock(struct folio *folio)
>  {
> +	rcu_read_unlock();
>  }

This should go into a separate patch and merge it independently. I
haven't really realized that !MEMCG configuration has a different
locking scopes.

[...]
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 2db9a1432511..9c7a4fae0661 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -57,6 +57,22 @@ struct oom_control {
>  extern struct mutex oom_lock;
>  extern struct mutex oom_adj_mutex;
>  
> +#ifdef CONFIG_MMU
> +extern struct task_struct *oom_reaper_list;
> +extern struct wait_queue_head oom_reaper_wait;
> +
> +static inline bool oom_reaping_in_progress(void)
> +{
> +	/* a racy check can be used to reduce the chance of overkilling */
> +	return READ_ONCE(oom_reaper_list) || !waitqueue_active(&oom_reaper_wait);
> +}
> +#else
> +static inline bool oom_reaping_in_progress(void)
> +{
> +	return false;
> +}
> +#endif

I do not like this. These are internal oom reaper's and no code should
really make any decisions based on that. oom_reaping_in_progress is not
telling much anyway. This is a global queue for oom reaper that can
contain oom victims from different oom scopes (e.g. global OOM, memcg
OOM or memory policy OOM).

Your lru_gen_age_node uses this to decide whether to trigger
out_of_memory and that is clearly wrong for the above reasons.
out_of_memory is designed to skip over any action if there is an oom
victim pending from the oom domain (have a look at oom_evaluate_task).

[...]

> +static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc,
> +		       unsigned long min_ttl)
> +{
> +	bool need_aging;
> +	long nr_to_scan;
> +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> +	int swappiness = get_swappiness(memcg);
> +	DEFINE_MAX_SEQ(lruvec);
> +	DEFINE_MIN_SEQ(lruvec);
> +
> +	if (mem_cgroup_below_min(memcg))
> +		return false;

mem_cgroup_below_min requires effective values to be calculated for the
reclaimed hierarchy. Have a look at mem_cgroup_calculate_protection
Michal Hocko Jan. 6, 2022, 4:12 p.m. UTC | #2
On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> +static struct lru_gen_mm_walk *alloc_mm_walk(void)
> +{
> +	if (!current->reclaim_state || !current->reclaim_state->mm_walk)
> +		return kvzalloc(sizeof(struct lru_gen_mm_walk), GFP_KERNEL);
> +
> +	return current->reclaim_state->mm_walk;
> +}
> +
> +static void free_mm_walk(struct lru_gen_mm_walk *walk)
> +{
> +	if (!current->reclaim_state || !current->reclaim_state->mm_walk)
> +		kvfree(walk);
> +}

Do I get it right that you are allocating from the reclaim context? What
prevents this to completely deplete the memory as the reclaim context is
PF_MEMALLOC?
Yu Zhao Jan. 6, 2022, 9:27 p.m. UTC | #3
On Thu, Jan 06, 2022 at 05:06:42PM +0100, Michal Hocko wrote:
> I am still reading through the series. It is a lot of code and quite
> hard to wrap ones head around so these are mostly random things I have
> run into. More will likely follow up.
> 
> On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> [...]
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index aba18cd101db..028afdb81c10 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -1393,18 +1393,24 @@ mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
> >  
> >  static inline void lock_page_memcg(struct page *page)
> >  {
> > +	/* to match folio_memcg_rcu() */
> > +	rcu_read_lock();
> >  }
> >  
> >  static inline void unlock_page_memcg(struct page *page)
> >  {
> > +	rcu_read_unlock();
> >  }
> >  
> >  static inline void folio_memcg_lock(struct folio *folio)
> >  {
> > +	/* to match folio_memcg_rcu() */
> > +	rcu_read_lock();
> >  }
> >  
> >  static inline void folio_memcg_unlock(struct folio *folio)
> >  {
> > +	rcu_read_unlock();
> >  }
> 
> This should go into a separate patch and merge it independently. I
> haven't really realized that !MEMCG configuration has a different
> locking scopes.

Considered it done.

> > diff --git a/include/linux/oom.h b/include/linux/oom.h
> > index 2db9a1432511..9c7a4fae0661 100644
> > --- a/include/linux/oom.h
> > +++ b/include/linux/oom.h
> > @@ -57,6 +57,22 @@ struct oom_control {
> >  extern struct mutex oom_lock;
> >  extern struct mutex oom_adj_mutex;
> >  
> > +#ifdef CONFIG_MMU
> > +extern struct task_struct *oom_reaper_list;
> > +extern struct wait_queue_head oom_reaper_wait;
> > +
> > +static inline bool oom_reaping_in_progress(void)
> > +{
> > +	/* a racy check can be used to reduce the chance of overkilling */
> > +	return READ_ONCE(oom_reaper_list) || !waitqueue_active(&oom_reaper_wait);
> > +}
> > +#else
> > +static inline bool oom_reaping_in_progress(void)
> > +{
> > +	return false;
> > +}
> > +#endif
> 
> I do not like this. These are internal oom reaper's and no code should
> really make any decisions based on that. oom_reaping_in_progress is not
> telling much anyway.

There is a perfectly legitimate reason for this.

If there is already a oom kill victim and the oom reaper is making
progress, the system may still be under memory pressure until the oom
reaping is done. The page reclaim has two choices in this transient
state: kill more processes or keep reclaiming (a few more) hot pages.

The first choice, AKA overkilling, is generally a bad one. The oom
reaper is single threaded and it can't go faster with additional
victims. Additional processes are sacrificed for nothing -- this is
an overcorrection of a system that tries to strike a balance between
the tendencies to release memory pressure and to improve memory
utilization.

> This is a global queue for oom reaper that can
> contain oom victims from different oom scopes (e.g. global OOM, memcg
> OOM or memory policy OOM).

True, but this is a wrong reason to make the conclusion below. Oom
kill scopes do NOT matter; only the pool the freed memory goes into
does. And there is only one global pool free pages.

> Your lru_gen_age_node uses this to decide whether to trigger
> out_of_memory and that is clearly wrong for the above reasons.

I hope my explanation above is clear enough. There is nothing wrong
with the purpose and the usage of oom_reaping_in_progress(), and it
has been well tested in the Arch Linux Zen kernel.

Without it, overkills can be easily reproduced by the following simple
script. That is additional oom kills happen to processes other than
"tail".

  # enable zram
  while true;
  do
      tail /dev/zero
  done

> out_of_memory is designed to skip over any action if there is an oom
> victim pending from the oom domain (have a look at oom_evaluate_task).

Where exactly? Point me to the code please.

I don't see such a logic inside out_of_memory() or
oom_evaluate_task(). Currently the only thing that could remotely
prevent overkills is oom_lock. But it's inadequate.

This is the entire pipeline:
low on memory -> out_of_memory() -> oom_reaper() -> free memory

To avoid overkills, we need to consider the later half of it too.
oom_reaping_in_progress() is exactly for this purpose.

> > +static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc,
> > +		       unsigned long min_ttl)
> > +{
> > +	bool need_aging;
> > +	long nr_to_scan;
> > +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > +	int swappiness = get_swappiness(memcg);
> > +	DEFINE_MAX_SEQ(lruvec);
> > +	DEFINE_MIN_SEQ(lruvec);
> > +
> > +	if (mem_cgroup_below_min(memcg))
> > +		return false;
> 
> mem_cgroup_below_min requires effective values to be calculated for the
> reclaimed hierarchy. Have a look at mem_cgroup_calculate_protection

I always keep that in mind, and age_lruvec() is called *after*
mem_cgroup_calculate_protection():

  balance_pgdat()
    memcgs_need_aging = 0
    do {
      lru_gen_age_node()
        if (!memcgs_need_aging) {
            memcgs_need_aging = 1
            return
        }
        age_lruvec()

      shrink_node_memcgs()
        mem_cgroup_calculate_protection()
        lru_gen_shrink_lruvec()
          if ...
            memcgs_need_aging = 0
    } while ...
Yu Zhao Jan. 6, 2022, 9:41 p.m. UTC | #4
On Thu, Jan 06, 2022 at 05:12:16PM +0100, Michal Hocko wrote:
> On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > +static struct lru_gen_mm_walk *alloc_mm_walk(void)
> > +{
> > +	if (!current->reclaim_state || !current->reclaim_state->mm_walk)
> > +		return kvzalloc(sizeof(struct lru_gen_mm_walk), GFP_KERNEL);
> > +
> > +	return current->reclaim_state->mm_walk;
> > +}
> > +
> > +static void free_mm_walk(struct lru_gen_mm_walk *walk)
> > +{
> > +	if (!current->reclaim_state || !current->reclaim_state->mm_walk)
> > +		kvfree(walk);
> > +}
> 
> Do I get it right that you are allocating from the reclaim context? What
> prevents this to completely deplete the memory as the reclaim context is
> PF_MEMALLOC?

Yes, and in general the same reason zram/zswap/etc. allocate memory in
the reclaim context: to make more free memory.

In this case, lru_gen_mm_walk is small (160 bytes); it's per direct
reclaimer; and direct reclaimers rarely come here, i.e., only when
kswapd can't keep up in terms of the aging, which is similar to the
condition where the inactive list is empty for the active/inactive
lru.
Michal Hocko Jan. 7, 2022, 8:43 a.m. UTC | #5
On Thu 06-01-22 14:27:52, Yu Zhao wrote:
> On Thu, Jan 06, 2022 at 05:06:42PM +0100, Michal Hocko wrote:
[...]
> > > diff --git a/include/linux/oom.h b/include/linux/oom.h
> > > index 2db9a1432511..9c7a4fae0661 100644
> > > --- a/include/linux/oom.h
> > > +++ b/include/linux/oom.h
> > > @@ -57,6 +57,22 @@ struct oom_control {
> > >  extern struct mutex oom_lock;
> > >  extern struct mutex oom_adj_mutex;
> > >  
> > > +#ifdef CONFIG_MMU
> > > +extern struct task_struct *oom_reaper_list;
> > > +extern struct wait_queue_head oom_reaper_wait;
> > > +
> > > +static inline bool oom_reaping_in_progress(void)
> > > +{
> > > +	/* a racy check can be used to reduce the chance of overkilling */
> > > +	return READ_ONCE(oom_reaper_list) || !waitqueue_active(&oom_reaper_wait);
> > > +}
> > > +#else
> > > +static inline bool oom_reaping_in_progress(void)
> > > +{
> > > +	return false;
> > > +}
> > > +#endif
> > 
> > I do not like this. These are internal oom reaper's and no code should
> > really make any decisions based on that. oom_reaping_in_progress is not
> > telling much anyway.
> 
> There is a perfectly legitimate reason for this.
> 
> If there is already a oom kill victim and the oom reaper is making
> progress, the system may still be under memory pressure until the oom
> reaping is done. The page reclaim has two choices in this transient
> state: kill more processes or keep reclaiming (a few more) hot pages.
> 
> The first choice, AKA overkilling, is generally a bad one. The oom
> reaper is single threaded and it can't go faster with additional
> victims. Additional processes are sacrificed for nothing -- this is
> an overcorrection of a system that tries to strike a balance between
> the tendencies to release memory pressure and to improve memory
> utilization.
> 
> > This is a global queue for oom reaper that can
> > contain oom victims from different oom scopes (e.g. global OOM, memcg
> > OOM or memory policy OOM).
> 
> True, but this is a wrong reason to make the conclusion below. Oom
> kill scopes do NOT matter; only the pool the freed memory goes into
> does. And there is only one global pool free pages.
> 
> > Your lru_gen_age_node uses this to decide whether to trigger
> > out_of_memory and that is clearly wrong for the above reasons.
> 
> I hope my explanation above is clear enough. There is nothing wrong
> with the purpose and the usage of oom_reaping_in_progress(), and it
> has been well tested in the Arch Linux Zen kernel.

I disagree. An ongoing oom kill in one domain (say memcg A) shouldn't be
any base for any decisions in reclaim in other domain (say memcg B or
even a global reclaim). Those are fundamentally different conditions.

> Without it, overkills can be easily reproduced by the following simple
> script. That is additional oom kills happen to processes other than
> "tail".
> 
>   # enable zram
>   while true;
>   do
>       tail /dev/zero
>   done

I would be interested to hear more (care to send oom reports?).

> > out_of_memory is designed to skip over any action if there is an oom
> > victim pending from the oom domain (have a look at oom_evaluate_task).
> 
> Where exactly? Point me to the code please.
> 
> I don't see such a logic inside out_of_memory() or
> oom_evaluate_task(). Currently the only thing that could remotely
> prevent overkills is oom_lock. But it's inadequate.

OK, let me try to exaplain. The protocol is rather convoluted. Once the
oom killer is invoked it choses a victim to kill. oom_evaluate_task will
evaluate _all_ tasks from the oom respective domain (select_bad_process
which distinguishes memcg vs global oom kill and oom_cpuset_eligible for
the cpuset domains). If there is any pre-existing oom victim
(tsk_is_oom_victim) then the scan is aborted and the oom killer bails
out. OOM victim stops being considered as relevant once the oom reaper
manages to release its address space (or give up on the mmap_sem
contention) and sets MMF_OOM_SKIP flag for the mm.

That being said the out_of_memory automatically backs off and relies on
the oom reaper to process its queue.

Does it make more clear for you now?

> This is the entire pipeline:
> low on memory -> out_of_memory() -> oom_reaper() -> free memory
> 
> To avoid overkills, we need to consider the later half of it too.
> oom_reaping_in_progress() is exactly for this purpose.
> 
> > > +static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc,
> > > +		       unsigned long min_ttl)
> > > +{
> > > +	bool need_aging;
> > > +	long nr_to_scan;
> > > +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > > +	int swappiness = get_swappiness(memcg);
> > > +	DEFINE_MAX_SEQ(lruvec);
> > > +	DEFINE_MIN_SEQ(lruvec);
> > > +
> > > +	if (mem_cgroup_below_min(memcg))
> > > +		return false;
> > 
> > mem_cgroup_below_min requires effective values to be calculated for the
> > reclaimed hierarchy. Have a look at mem_cgroup_calculate_protection
> 
> I always keep that in mind, and age_lruvec() is called *after*
> mem_cgroup_calculate_protection():

>   balance_pgdat()
>     memcgs_need_aging = 0
>     do {
>       lru_gen_age_node()
>         if (!memcgs_need_aging) {
>             memcgs_need_aging = 1
>             return
>         }
>         age_lruvec()
> 
>       shrink_node_memcgs()
>         mem_cgroup_calculate_protection()
>         lru_gen_shrink_lruvec()
>           if ...
>             memcgs_need_aging = 0
>     } while ...

Uff, this is really subtle. I really think you should be following the
existing pattern when the effective values are calculated right in the
same context as they are evaluated.
Michal Hocko Jan. 7, 2022, 8:55 a.m. UTC | #6
On Thu 06-01-22 14:41:12, Yu Zhao wrote:
> On Thu, Jan 06, 2022 at 05:12:16PM +0100, Michal Hocko wrote:
> > On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > > +static struct lru_gen_mm_walk *alloc_mm_walk(void)
> > > +{
> > > +	if (!current->reclaim_state || !current->reclaim_state->mm_walk)
> > > +		return kvzalloc(sizeof(struct lru_gen_mm_walk), GFP_KERNEL);
> > > +
> > > +	return current->reclaim_state->mm_walk;
> > > +}
> > > +
> > > +static void free_mm_walk(struct lru_gen_mm_walk *walk)
> > > +{
> > > +	if (!current->reclaim_state || !current->reclaim_state->mm_walk)
> > > +		kvfree(walk);
> > > +}
> > 
> > Do I get it right that you are allocating from the reclaim context? What
> > prevents this to completely deplete the memory as the reclaim context is
> > PF_MEMALLOC?
> 
> Yes, and in general the same reason zram/zswap/etc. allocate memory in
> the reclaim context: to make more free memory.

I have to admit that I am not really familiar with zram/zswap but I find
the concept of requiring memory to do the reclaim really problematic.
 
> In this case, lru_gen_mm_walk is small (160 bytes); it's per direct
> reclaimer; and direct reclaimers rarely come here, i.e., only when
> kswapd can't keep up in terms of the aging, which is similar to the
> condition where the inactive list is empty for the active/inactive
> lru.

Well, this is not a strong argument to be honest. Kswapd being stuck
and the majority of the reclaim being done in the direct reclaim
context is a situation I have seen many many times. We used to have
problems with direct reclaimers throttling to prevent an over eager OOM
situations.

Have you considered using a pool of preallocated objects instead?
Michal Hocko Jan. 7, 2022, 9 a.m. UTC | #7
On Fri 07-01-22 09:55:09, Michal Hocko wrote:
[...]
> > In this case, lru_gen_mm_walk is small (160 bytes); it's per direct
> > reclaimer; and direct reclaimers rarely come here, i.e., only when
> > kswapd can't keep up in terms of the aging, which is similar to the
> > condition where the inactive list is empty for the active/inactive
> > lru.
> 
> Well, this is not a strong argument to be honest. Kswapd being stuck
> and the majority of the reclaim being done in the direct reclaim
> context is a situation I have seen many many times.

Also do not forget that memcg reclaim is effectivelly only direct
reclaim. Not that the memcg reclaim indicates a global memory shortage
but it can add up and race with the global reclaim as well.
Michal Hocko Jan. 7, 2022, 1:11 p.m. UTC | #8
On Tue 04-01-22 13:22:25, Yu Zhao wrote:
[...]
> +static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
> +{
> +	struct mem_cgroup *memcg;
> +	bool success = false;
> +	unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl);
> +
> +	VM_BUG_ON(!current_is_kswapd());
> +
> +	current->reclaim_state->mm_walk = &pgdat->mm_walk;
> +
> +	memcg = mem_cgroup_iter(NULL, NULL, NULL);
> +	do {
> +		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
> +
> +		if (age_lruvec(lruvec, sc, min_ttl))
> +			success = true;
> +
> +		cond_resched();
> +	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
> +
> +	if (!success && mutex_trylock(&oom_lock)) {
> +		struct oom_control oc = {
> +			.gfp_mask = sc->gfp_mask,
> +			.order = sc->order,
> +		};
> +
> +		if (!oom_reaping_in_progress())
> +			out_of_memory(&oc);
> +
> +		mutex_unlock(&oom_lock);
> +	}

Why do you need to trigger oom killer from this path? Why cannot you
rely on the page allocator to do that like we do now?
Michal Hocko Jan. 7, 2022, 2:44 p.m. UTC | #9
On Tue 04-01-22 13:22:25, Yu Zhao wrote:
[...]
> +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
> +{
> +	static const struct mm_walk_ops mm_walk_ops = {
> +		.test_walk = should_skip_vma,
> +		.p4d_entry = walk_pud_range,
> +	};
> +
> +	int err;
> +#ifdef CONFIG_MEMCG
> +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> +#endif
> +
> +	walk->next_addr = FIRST_USER_ADDRESS;
> +
> +	do {
> +		unsigned long start = walk->next_addr;
> +		unsigned long end = mm->highest_vm_end;
> +
> +		err = -EBUSY;
> +
> +		rcu_read_lock();
> +#ifdef CONFIG_MEMCG
> +		if (memcg && atomic_read(&memcg->moving_account))
> +			goto contended;
> +#endif
> +		if (!mmap_read_trylock(mm))
> +			goto contended;

Have you evaluated the behavior under mmap_sem contention? I mean what
would be an effect of some mms being excluded from the walk? This path
is called from direct reclaim and we do allocate with exclusive mmap_sem
IIRC and the trylock can fail in a presence of pending writer if I am
not mistaken so even the read lock holder (e.g. an allocation from the #PF)
can bypass the walk.

Or is this considered statistically insignificant thus a theoretical
problem?
Yu Zhao Jan. 7, 2022, 9:12 p.m. UTC | #10
On Fri, Jan 07, 2022 at 09:43:49AM +0100, Michal Hocko wrote:
> On Thu 06-01-22 14:27:52, Yu Zhao wrote:
> > On Thu, Jan 06, 2022 at 05:06:42PM +0100, Michal Hocko wrote:
> [...]
> > > > diff --git a/include/linux/oom.h b/include/linux/oom.h
> > > > index 2db9a1432511..9c7a4fae0661 100644
> > > > --- a/include/linux/oom.h
> > > > +++ b/include/linux/oom.h
> > > > @@ -57,6 +57,22 @@ struct oom_control {
> > > >  extern struct mutex oom_lock;
> > > >  extern struct mutex oom_adj_mutex;
> > > >  
> > > > +#ifdef CONFIG_MMU
> > > > +extern struct task_struct *oom_reaper_list;
> > > > +extern struct wait_queue_head oom_reaper_wait;
> > > > +
> > > > +static inline bool oom_reaping_in_progress(void)
> > > > +{
> > > > +	/* a racy check can be used to reduce the chance of overkilling */
> > > > +	return READ_ONCE(oom_reaper_list) || !waitqueue_active(&oom_reaper_wait);
> > > > +}
> > > > +#else
> > > > +static inline bool oom_reaping_in_progress(void)
> > > > +{
> > > > +	return false;
> > > > +}
> > > > +#endif
> > > 
> > > I do not like this. These are internal oom reaper's and no code should
> > > really make any decisions based on that. oom_reaping_in_progress is not
> > > telling much anyway.
> > 
> > There is a perfectly legitimate reason for this.
> > 
> > If there is already a oom kill victim and the oom reaper is making
> > progress, the system may still be under memory pressure until the oom
> > reaping is done. The page reclaim has two choices in this transient
> > state: kill more processes or keep reclaiming (a few more) hot pages.
> > 
> > The first choice, AKA overkilling, is generally a bad one. The oom
> > reaper is single threaded and it can't go faster with additional
> > victims. Additional processes are sacrificed for nothing -- this is
> > an overcorrection of a system that tries to strike a balance between
> > the tendencies to release memory pressure and to improve memory
> > utilization.
> > 
> > > This is a global queue for oom reaper that can
> > > contain oom victims from different oom scopes (e.g. global OOM, memcg
> > > OOM or memory policy OOM).
> > 
> > True, but this is a wrong reason to make the conclusion below. Oom
> > kill scopes do NOT matter; only the pool the freed memory goes into
> > does. And there is only one global pool free pages.
> > 
> > > Your lru_gen_age_node uses this to decide whether to trigger
> > > out_of_memory and that is clearly wrong for the above reasons.
> > 
> > I hope my explanation above is clear enough. There is nothing wrong
> > with the purpose and the usage of oom_reaping_in_progress(), and it
> > has been well tested in the Arch Linux Zen kernel.
> 
> I disagree. An ongoing oom kill in one domain (say memcg A) shouldn't be
> any base for any decisions in reclaim in other domain (say memcg B or
> even a global reclaim). Those are fundamentally different conditions.

I agree for the memcg A oom and memcg B reclaim case, because memory
freed from A doesn't go to B.

I still think for the memcg A and the global reclaim case, memory
freed from A can be considered when deciding whether to make more
kills during global reclaim.

But this is something really minor, and I'll go with your suggestion,
i.e., getting rid of oom_reaping_in_progress().

> > Without it, overkills can be easily reproduced by the following simple
> > script. That is additional oom kills happen to processes other than
> > "tail".
> > 
> >   # enable zram
> >   while true;
> >   do
> >       tail /dev/zero
> >   done
> 
> I would be interested to hear more (care to send oom reports?).

I agree with what said below. I think those additional ooms might have
been from different oom domains. I plan to leave this for now and go
with your suggestion as mentioned above.

> > > out_of_memory is designed to skip over any action if there is an oom
> > > victim pending from the oom domain (have a look at oom_evaluate_task).
> > 
> > Where exactly? Point me to the code please.
> > 
> > I don't see such a logic inside out_of_memory() or
> > oom_evaluate_task(). Currently the only thing that could remotely
> > prevent overkills is oom_lock. But it's inadequate.
> 
> OK, let me try to exaplain. The protocol is rather convoluted. Once the
> oom killer is invoked it choses a victim to kill. oom_evaluate_task will
> evaluate _all_ tasks from the oom respective domain (select_bad_process
> which distinguishes memcg vs global oom kill and oom_cpuset_eligible for
> the cpuset domains). If there is any pre-existing oom victim
> (tsk_is_oom_victim) then the scan is aborted and the oom killer bails
> out. OOM victim stops being considered as relevant once the oom reaper
> manages to release its address space (or give up on the mmap_sem
> contention) and sets MMF_OOM_SKIP flag for the mm.
> 
> That being said the out_of_memory automatically backs off and relies on
> the oom reaper to process its queue.
> 
> Does it make more clear for you now?

Yes, you are right, thanks.

> > This is the entire pipeline:
> > low on memory -> out_of_memory() -> oom_reaper() -> free memory
> > 
> > To avoid overkills, we need to consider the later half of it too.
> > oom_reaping_in_progress() is exactly for this purpose.
> > 
> > > > +static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc,
> > > > +		       unsigned long min_ttl)
> > > > +{
> > > > +	bool need_aging;
> > > > +	long nr_to_scan;
> > > > +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > > > +	int swappiness = get_swappiness(memcg);
> > > > +	DEFINE_MAX_SEQ(lruvec);
> > > > +	DEFINE_MIN_SEQ(lruvec);
> > > > +
> > > > +	if (mem_cgroup_below_min(memcg))
> > > > +		return false;
> > > 
> > > mem_cgroup_below_min requires effective values to be calculated for the
> > > reclaimed hierarchy. Have a look at mem_cgroup_calculate_protection
> > 
> > I always keep that in mind, and age_lruvec() is called *after*
> > mem_cgroup_calculate_protection():
> 
> >   balance_pgdat()
> >     memcgs_need_aging = 0
> >     do {
> >       lru_gen_age_node()
> >         if (!memcgs_need_aging) {
> >             memcgs_need_aging = 1
> >             return
> >         }
> >         age_lruvec()
> > 
> >       shrink_node_memcgs()
> >         mem_cgroup_calculate_protection()
> >         lru_gen_shrink_lruvec()
> >           if ...
> >             memcgs_need_aging = 0
> >     } while ...
> 
> Uff, this is really subtle. I really think you should be following the
> existing pattern when the effective values are calculated right in the
> same context as they are evaluated.

Consider it done.
Yu Zhao Jan. 7, 2022, 11:36 p.m. UTC | #11
On Fri, Jan 07, 2022 at 02:11:29PM +0100, Michal Hocko wrote:
> On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> [...]
> > +static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
> > +{
> > +	struct mem_cgroup *memcg;
> > +	bool success = false;
> > +	unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl);
> > +
> > +	VM_BUG_ON(!current_is_kswapd());
> > +
> > +	current->reclaim_state->mm_walk = &pgdat->mm_walk;
> > +
> > +	memcg = mem_cgroup_iter(NULL, NULL, NULL);
> > +	do {
> > +		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
> > +
> > +		if (age_lruvec(lruvec, sc, min_ttl))
> > +			success = true;
> > +
> > +		cond_resched();
> > +	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
> > +
> > +	if (!success && mutex_trylock(&oom_lock)) {
> > +		struct oom_control oc = {
> > +			.gfp_mask = sc->gfp_mask,
> > +			.order = sc->order,
> > +		};
> > +
> > +		if (!oom_reaping_in_progress())
> > +			out_of_memory(&oc);
> > +
> > +		mutex_unlock(&oom_lock);
> > +	}
> 
> Why do you need to trigger oom killer from this path? Why cannot you
> rely on the page allocator to do that like we do now?

This is per desktop users' (repeated) requests. The can't tolerate
thrashing as servers do because of UI lags; and they usually don't
have fancy tools like oomd.

Related discussions I saw:
https://github.com/zen-kernel/zen-kernel/issues/218
https://lore.kernel.org/lkml/20101028191523.GA14972@google.com/
https://lore.kernel.org/lkml/20211213051521.21f02dd2@mail.inbox.lv/
https://lore.kernel.org/lkml/54C2C89C.8080002@gmail.com/
https://lore.kernel.org/lkml/d9802b6a-949b-b327-c4a6-3dbca485ec20@gmx.com/

From patch 8:
  Personal computers
  ------------------
  :Thrashing prevention: Write ``N`` to
   ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of
   ``N`` milliseconds from getting evicted. The OOM killer is invoked if
   this working set can't be kept in memory. Based on the average human
   detectable lag (~100ms), ``N=1000`` usually eliminates intolerable
   lags due to thrashing. Larger values like ``N=3000`` make lags less
   noticeable at the cost of more OOM kills.
Yu Zhao Jan. 10, 2022, 3:58 a.m. UTC | #12
On Fri, Jan 07, 2022 at 10:00:31AM +0100, Michal Hocko wrote:
> On Fri 07-01-22 09:55:09, Michal Hocko wrote:
> [...]
> > > In this case, lru_gen_mm_walk is small (160 bytes); it's per direct
> > > reclaimer; and direct reclaimers rarely come here, i.e., only when
> > > kswapd can't keep up in terms of the aging, which is similar to the
> > > condition where the inactive list is empty for the active/inactive
> > > lru.
> > 
> > Well, this is not a strong argument to be honest. Kswapd being stuck
> > and the majority of the reclaim being done in the direct reclaim
> > context is a situation I have seen many many times.
> 
> Also do not forget that memcg reclaim is effectivelly only direct
> reclaim. Not that the memcg reclaim indicates a global memory shortage
> but it can add up and race with the global reclaim as well.

I don't dispute any of the above, and I probably don't like this code
more than you do.

But let's not forget the purposes of PF_MEMALLOC, besides preventing
recursive reclaims, include letting reclaim dip into reserves so that
it can make more free memory. So I think it's acceptable if the
following conditions are met:
1. The allocation size is small.
2. The number of allocations is bounded.
3. Its failure doesn't stall reclaim.
And it'd be nice if
4. The allocation happens rarely, e.g., slow path only.

The code in question meets all of them.

1. This allocation is 160 bytes.
2. It's bounded by the number of page table walkers which, in the
   worst, is same as the number of mm_struct's.
3. Most importantly, its failure doesn't stall the aging. The aging
   will fallback to the rmap-based function lru_gen_look_around().
   But this function only gathers the accessed bit from at most 64
   PTEs, meaning it's less efficient (retains ~80% performance gains).
4. This allocation is rare, i.e., only when the aging is required,
   which is similar to the low inactive case for the active/inactive
   lru.

The bottom line is I can try various optimizations, e.g., preallocate
a few buffers for a limited number of page walkers and if this number
has been reached, fallback to the rmap-based function. But I have yet
to see evidence that calls for additional complexity.
Yu Zhao Jan. 10, 2022, 4:47 a.m. UTC | #13
On Fri, Jan 07, 2022 at 03:44:50PM +0100, Michal Hocko wrote:
> On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> [...]
> > +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
> > +{
> > +	static const struct mm_walk_ops mm_walk_ops = {
> > +		.test_walk = should_skip_vma,
> > +		.p4d_entry = walk_pud_range,
> > +	};
> > +
> > +	int err;
> > +#ifdef CONFIG_MEMCG
> > +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > +#endif
> > +
> > +	walk->next_addr = FIRST_USER_ADDRESS;
> > +
> > +	do {
> > +		unsigned long start = walk->next_addr;
> > +		unsigned long end = mm->highest_vm_end;
> > +
> > +		err = -EBUSY;
> > +
> > +		rcu_read_lock();
> > +#ifdef CONFIG_MEMCG
> > +		if (memcg && atomic_read(&memcg->moving_account))
> > +			goto contended;
> > +#endif
> > +		if (!mmap_read_trylock(mm))
> > +			goto contended;
> 
> Have you evaluated the behavior under mmap_sem contention? I mean what
> would be an effect of some mms being excluded from the walk? This path
> is called from direct reclaim and we do allocate with exclusive mmap_sem
> IIRC and the trylock can fail in a presence of pending writer if I am
> not mistaken so even the read lock holder (e.g. an allocation from the #PF)
> can bypass the walk.

You are right. Here it must be a trylock; otherwise it can deadlock.

I think there might be a misunderstanding: the aging doesn't
exclusively rely on page table walks to gather the accessed bit. It
prefers page table walks but it can also fallback to the rmap-based
function, i.e., lru_gen_look_around(), which only gathers the accessed
bit from at most 64 PTEs and therefore is less efficient. But it still
retains about 80% of the performance gains.

> Or is this considered statistically insignificant thus a theoretical
> problem?

Yes. People who work on the maple tree and SPF at Google expressed the
same concern during the design review meeting (all stakeholders on the
mailing list were also invited). So we had a counter to monitor the
contention in previous versions, i.e., MM_LOCK_CONTENTION in v4 here:
https://lore.kernel.org/lkml/20210818063107.2696454-8-yuzhao@google.com/

And we also combined this patchset with the SPF patchset to see if the
latter makes any difference. Our conclusion was the contention is
statistically insignificant to the performance under memory pressure.

This can be explained by how often we create a new generation. (We
only walk page tables when we create a new generation. And it's
similar to the low inactive condition for the active/inactive lru.)

Usually we only do so every few seconds. We'd run into problems with
other parts of the kernel, e.g., lru lock contention, i/o congestion,
etc. if we create more than a few generation every second.
Michal Hocko Jan. 10, 2022, 10:54 a.m. UTC | #14
On Sun 09-01-22 21:47:57, Yu Zhao wrote:
> On Fri, Jan 07, 2022 at 03:44:50PM +0100, Michal Hocko wrote:
> > On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > [...]
> > > +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
> > > +{
> > > +	static const struct mm_walk_ops mm_walk_ops = {
> > > +		.test_walk = should_skip_vma,
> > > +		.p4d_entry = walk_pud_range,
> > > +	};
> > > +
> > > +	int err;
> > > +#ifdef CONFIG_MEMCG
> > > +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > > +#endif
> > > +
> > > +	walk->next_addr = FIRST_USER_ADDRESS;
> > > +
> > > +	do {
> > > +		unsigned long start = walk->next_addr;
> > > +		unsigned long end = mm->highest_vm_end;
> > > +
> > > +		err = -EBUSY;
> > > +
> > > +		rcu_read_lock();
> > > +#ifdef CONFIG_MEMCG
> > > +		if (memcg && atomic_read(&memcg->moving_account))
> > > +			goto contended;
> > > +#endif
> > > +		if (!mmap_read_trylock(mm))
> > > +			goto contended;
> > 
> > Have you evaluated the behavior under mmap_sem contention? I mean what
> > would be an effect of some mms being excluded from the walk? This path
> > is called from direct reclaim and we do allocate with exclusive mmap_sem
> > IIRC and the trylock can fail in a presence of pending writer if I am
> > not mistaken so even the read lock holder (e.g. an allocation from the #PF)
> > can bypass the walk.
> 
> You are right. Here it must be a trylock; otherwise it can deadlock.

Yeah, this is clear.

> I think there might be a misunderstanding: the aging doesn't
> exclusively rely on page table walks to gather the accessed bit. It
> prefers page table walks but it can also fallback to the rmap-based
> function, i.e., lru_gen_look_around(), which only gathers the accessed
> bit from at most 64 PTEs and therefore is less efficient. But it still
> retains about 80% of the performance gains.

I have to say that I really have hard time to understand the runtime
behavior depending on that interaction. How does the reclaim behave when
the virtual scan is enabled, partially enabled and almost completely
disabled due to different constrains? I do not see any such an
evaluation described in changelogs and I consider this to be a rather
important information to judge the overall behavior.
 
> > Or is this considered statistically insignificant thus a theoretical
> > problem?
> 
> Yes. People who work on the maple tree and SPF at Google expressed the
> same concern during the design review meeting (all stakeholders on the
> mailing list were also invited). So we had a counter to monitor the
> contention in previous versions, i.e., MM_LOCK_CONTENTION in v4 here:
> https://lore.kernel.org/lkml/20210818063107.2696454-8-yuzhao@google.com/
> 
> And we also combined this patchset with the SPF patchset to see if the
> latter makes any difference. Our conclusion was the contention is
> statistically insignificant to the performance under memory pressure.
> 
> This can be explained by how often we create a new generation. (We
> only walk page tables when we create a new generation. And it's
> similar to the low inactive condition for the active/inactive lru.)
> 
> Usually we only do so every few seconds. We'd run into problems with
> other parts of the kernel, e.g., lru lock contention, i/o congestion,
> etc. if we create more than a few generation every second.

This would be a very good information to have in changelogs. Ideally
with some numbers and analysis.
Michal Hocko Jan. 10, 2022, 2:37 p.m. UTC | #15
On Sun 09-01-22 20:58:02, Yu Zhao wrote:
> On Fri, Jan 07, 2022 at 10:00:31AM +0100, Michal Hocko wrote:
> > On Fri 07-01-22 09:55:09, Michal Hocko wrote:
> > [...]
> > > > In this case, lru_gen_mm_walk is small (160 bytes); it's per direct
> > > > reclaimer; and direct reclaimers rarely come here, i.e., only when
> > > > kswapd can't keep up in terms of the aging, which is similar to the
> > > > condition where the inactive list is empty for the active/inactive
> > > > lru.
> > > 
> > > Well, this is not a strong argument to be honest. Kswapd being stuck
> > > and the majority of the reclaim being done in the direct reclaim
> > > context is a situation I have seen many many times.
> > 
> > Also do not forget that memcg reclaim is effectivelly only direct
> > reclaim. Not that the memcg reclaim indicates a global memory shortage
> > but it can add up and race with the global reclaim as well.
> 
> I don't dispute any of the above, and I probably don't like this code
> more than you do.
> 
> But let's not forget the purposes of PF_MEMALLOC, besides preventing
> recursive reclaims, include letting reclaim dip into reserves so that
> it can make more free memory. So I think it's acceptable if the
> following conditions are met:
> 1. The allocation size is small.
> 2. The number of allocations is bounded.
> 3. Its failure doesn't stall reclaim.
> And it'd be nice if
> 4. The allocation happens rarely, e.g., slow path only.

I would add 
  0. The allocation should be done only if absolutely _necessary_.

Please keep in mind that whatever you allocate from that context will be
consuming a very precious memory reserves which are shared with other
components of the system. Even worse these can go all the way to
depleting memory completely where other things can fall apart.

> The code in question meets all of them.
> 
> 1. This allocation is 160 bytes.
> 2. It's bounded by the number of page table walkers which, in the
>    worst, is same as the number of mm_struct's.
> 3. Most importantly, its failure doesn't stall the aging. The aging
>    will fallback to the rmap-based function lru_gen_look_around().
>    But this function only gathers the accessed bit from at most 64
>    PTEs, meaning it's less efficient (retains ~80% performance gains).
> 4. This allocation is rare, i.e., only when the aging is required,
>    which is similar to the low inactive case for the active/inactive
>    lru.

I think this fallback behavior deserves much more detailed explanation
in changelogs.

> The bottom line is I can try various optimizations, e.g., preallocate
> a few buffers for a limited number of page walkers and if this number
> has been reached, fallback to the rmap-based function. But I have yet
> to see evidence that calls for additional complexity.

I would disagree here. This is not an optimization. You should be
avoiding allocations from the memory reclaim because any allocation just
add a runtime behavior complexity and potential corner cases.
Michal Hocko Jan. 10, 2022, 3:01 p.m. UTC | #16
On Thu 06-01-22 17:12:18, Michal Hocko wrote:
> On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > +static struct lru_gen_mm_walk *alloc_mm_walk(void)
> > +{
> > +	if (!current->reclaim_state || !current->reclaim_state->mm_walk)
> > +		return kvzalloc(sizeof(struct lru_gen_mm_walk), GFP_KERNEL);

One thing I have overlooked completely. You cannot really use GFP_KERNEL
allocation here because the reclaim context can be constrained (e.g.
GFP_NOFS). This allocation will not do any reclaim as it is PF_MEMALLOC
but I suspect that the lockdep will complain anyway.

Also kvmalloc is not really great here. a) vmalloc path is never
executed for small objects and b) we do not really want to make a
dependency between vmalloc and the reclaim (by vmalloc -> reclaim ->
vmalloc).

Even if we rule out vmalloc and look at kmalloc alone. Is this really
safe? I do not see any recursion prevention in the SL.B code. Maybe this
just happens to work but the dependency should be really documented so
that future SL.B changes won't break the whole scheme.
Michal Hocko Jan. 10, 2022, 3:35 p.m. UTC | #17
On Fri 07-01-22 16:36:11, Yu Zhao wrote:
> On Fri, Jan 07, 2022 at 02:11:29PM +0100, Michal Hocko wrote:
> > On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > [...]
> > > +static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
> > > +{
> > > +	struct mem_cgroup *memcg;
> > > +	bool success = false;
> > > +	unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl);
> > > +
> > > +	VM_BUG_ON(!current_is_kswapd());
> > > +
> > > +	current->reclaim_state->mm_walk = &pgdat->mm_walk;
> > > +
> > > +	memcg = mem_cgroup_iter(NULL, NULL, NULL);
> > > +	do {
> > > +		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
> > > +
> > > +		if (age_lruvec(lruvec, sc, min_ttl))
> > > +			success = true;
> > > +
> > > +		cond_resched();
> > > +	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
> > > +
> > > +	if (!success && mutex_trylock(&oom_lock)) {
> > > +		struct oom_control oc = {
> > > +			.gfp_mask = sc->gfp_mask,
> > > +			.order = sc->order,
> > > +		};
> > > +
> > > +		if (!oom_reaping_in_progress())
> > > +			out_of_memory(&oc);
> > > +
> > > +		mutex_unlock(&oom_lock);
> > > +	}
> > 
> > Why do you need to trigger oom killer from this path? Why cannot you
> > rely on the page allocator to do that like we do now?
> 
> This is per desktop users' (repeated) requests. The can't tolerate
> thrashing as servers do because of UI lags; and they usually don't
> have fancy tools like oomd.
> 
> Related discussions I saw:
> https://github.com/zen-kernel/zen-kernel/issues/218
> https://lore.kernel.org/lkml/20101028191523.GA14972@google.com/
> https://lore.kernel.org/lkml/20211213051521.21f02dd2@mail.inbox.lv/
> https://lore.kernel.org/lkml/54C2C89C.8080002@gmail.com/
> https://lore.kernel.org/lkml/d9802b6a-949b-b327-c4a6-3dbca485ec20@gmx.com/

I do not really see any arguments why an userspace based trashing
detection cannot be used for those. Could you clarify?

Also my question was pointing to why out_of_memory is called from the
reclaim rather than the allocator (memcg charging path). It is the
caller of the reclaim to control different reclaim strategies and tell
when all the hopes are lost and the oom killer should be invoked. This
allows for a different policies at the allocator level and this change
will break that AFAICS. E.g. what if the underlying allocation context
is __GFP_NORETRY?
 
> >From patch 8:
>   Personal computers
>   ------------------
>   :Thrashing prevention: Write ``N`` to
>    ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of
>    ``N`` milliseconds from getting evicted. The OOM killer is invoked if
>    this working set can't be kept in memory. Based on the average human
>    detectable lag (~100ms), ``N=1000`` usually eliminates intolerable
>    lags due to thrashing. Larger values like ``N=3000`` make lags less
>    noticeable at the cost of more OOM kills.

This is a very good example of something that should be a self contained
patch with its own justification. TBH it is really not all that clear to
me that we want to provide any user visible knob to control OOM behavior
based on a time based QoS.
Vlastimil Babka Jan. 10, 2022, 4:01 p.m. UTC | #18
On 1/10/22 16:01, Michal Hocko wrote:
> On Thu 06-01-22 17:12:18, Michal Hocko wrote:
>> On Tue 04-01-22 13:22:25, Yu Zhao wrote:
>> > +static struct lru_gen_mm_walk *alloc_mm_walk(void)
>> > +{
>> > +	if (!current->reclaim_state || !current->reclaim_state->mm_walk)
>> > +		return kvzalloc(sizeof(struct lru_gen_mm_walk), GFP_KERNEL);
> 
> One thing I have overlooked completely. You cannot really use GFP_KERNEL
> allocation here because the reclaim context can be constrained (e.g.
> GFP_NOFS). This allocation will not do any reclaim as it is PF_MEMALLOC
> but I suspect that the lockdep will complain anyway.
> 
> Also kvmalloc is not really great here. a) vmalloc path is never
> executed for small objects and b) we do not really want to make a
> dependency between vmalloc and the reclaim (by vmalloc -> reclaim ->
> vmalloc).
> 
> Even if we rule out vmalloc and look at kmalloc alone. Is this really
> safe? I do not see any recursion prevention in the SL.B code. Maybe this
> just happens to work but the dependency should be really documented so
> that future SL.B changes won't break the whole scheme. 

Slab implementations drop all locks before calling into page allocator (thus
possibly reclaim) so slab itself should be fine and I don't expect it to
change. But we could eventually reach the page allocator recursively again,
that's true and not great.
Michal Hocko Jan. 10, 2022, 4:25 p.m. UTC | #19
On Mon 10-01-22 17:01:07, Vlastimil Babka wrote:
> On 1/10/22 16:01, Michal Hocko wrote:
> > On Thu 06-01-22 17:12:18, Michal Hocko wrote:
> >> On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> >> > +static struct lru_gen_mm_walk *alloc_mm_walk(void)
> >> > +{
> >> > +	if (!current->reclaim_state || !current->reclaim_state->mm_walk)
> >> > +		return kvzalloc(sizeof(struct lru_gen_mm_walk), GFP_KERNEL);
> > 
> > One thing I have overlooked completely. You cannot really use GFP_KERNEL
> > allocation here because the reclaim context can be constrained (e.g.
> > GFP_NOFS). This allocation will not do any reclaim as it is PF_MEMALLOC
> > but I suspect that the lockdep will complain anyway.
> > 
> > Also kvmalloc is not really great here. a) vmalloc path is never
> > executed for small objects and b) we do not really want to make a
> > dependency between vmalloc and the reclaim (by vmalloc -> reclaim ->
> > vmalloc).
> > 
> > Even if we rule out vmalloc and look at kmalloc alone. Is this really
> > safe? I do not see any recursion prevention in the SL.B code. Maybe this
> > just happens to work but the dependency should be really documented so
> > that future SL.B changes won't break the whole scheme. 
> 
> Slab implementations drop all locks before calling into page allocator (thus
> possibly reclaim) so slab itself should be fine and I don't expect it to
> change. But we could eventually reach the page allocator recursively again,
> that's true and not great.

Thanks for double checking. If recursion is really intended and
something SL.B allocators should support then this is definitely worth
documenting so that a subtle change won't break in the future.
Michal Hocko Jan. 10, 2022, 4:57 p.m. UTC | #20
On Tue 04-01-22 13:22:25, Yu Zhao wrote:
[...]
> +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
> +{
> +	static const struct mm_walk_ops mm_walk_ops = {
> +		.test_walk = should_skip_vma,
> +		.p4d_entry = walk_pud_range,
> +	};
> +
> +	int err;
> +#ifdef CONFIG_MEMCG
> +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> +#endif
> +
> +	walk->next_addr = FIRST_USER_ADDRESS;
> +
> +	do {
> +		unsigned long start = walk->next_addr;
> +		unsigned long end = mm->highest_vm_end;
> +
> +		err = -EBUSY;
> +
> +		rcu_read_lock();
> +#ifdef CONFIG_MEMCG
> +		if (memcg && atomic_read(&memcg->moving_account))
> +			goto contended;
> +#endif

Why do you need to check for moving_account?
Yu Zhao Jan. 11, 2022, 1:18 a.m. UTC | #21
On Mon, Jan 10, 2022 at 04:35:46PM +0100, Michal Hocko wrote:
> On Fri 07-01-22 16:36:11, Yu Zhao wrote:
> > On Fri, Jan 07, 2022 at 02:11:29PM +0100, Michal Hocko wrote:
> > > On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > > [...]
> > > > +static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
> > > > +{
> > > > +	struct mem_cgroup *memcg;
> > > > +	bool success = false;
> > > > +	unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl);
> > > > +
> > > > +	VM_BUG_ON(!current_is_kswapd());
> > > > +
> > > > +	current->reclaim_state->mm_walk = &pgdat->mm_walk;
> > > > +
> > > > +	memcg = mem_cgroup_iter(NULL, NULL, NULL);
> > > > +	do {
> > > > +		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
> > > > +
> > > > +		if (age_lruvec(lruvec, sc, min_ttl))
> > > > +			success = true;
> > > > +
> > > > +		cond_resched();
> > > > +	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
> > > > +
> > > > +	if (!success && mutex_trylock(&oom_lock)) {
> > > > +		struct oom_control oc = {
> > > > +			.gfp_mask = sc->gfp_mask,
> > > > +			.order = sc->order,
> > > > +		};
> > > > +
> > > > +		if (!oom_reaping_in_progress())
> > > > +			out_of_memory(&oc);
> > > > +
> > > > +		mutex_unlock(&oom_lock);
> > > > +	}
> > > 
> > > Why do you need to trigger oom killer from this path? Why cannot you
> > > rely on the page allocator to do that like we do now?
> > 
> > This is per desktop users' (repeated) requests. The can't tolerate
> > thrashing as servers do because of UI lags; and they usually don't
> > have fancy tools like oomd.
> > 
> > Related discussions I saw:
> > https://github.com/zen-kernel/zen-kernel/issues/218
> > https://lore.kernel.org/lkml/20101028191523.GA14972@google.com/
> > https://lore.kernel.org/lkml/20211213051521.21f02dd2@mail.inbox.lv/
> > https://lore.kernel.org/lkml/54C2C89C.8080002@gmail.com/
> > https://lore.kernel.org/lkml/d9802b6a-949b-b327-c4a6-3dbca485ec20@gmx.com/
> 
> I do not really see any arguments why an userspace based trashing
> detection cannot be used for those. Could you clarify?

It definitely can be done. But who is going to do it for every distro
and all individual users? AFAIK, not a single distro provides such a
solution for desktop/laptop/phone users.

And also there is the theoretical question how reliable a userspace
solution can be. What if this usespace solution itself gets stuck in
the direct reclaim path. I'm sure if nobody has done some search to
prove or debunk it.

In addition, what exactly PSI values should be used on different
models of consumer electronics? Nobody knows. We have a team working
on this and we haven't figured it out for all our Chromebook models.

As Andrew said, "a blunt instrument like this would be useful".
https://lore.kernel.org/lkml/20211202135824.33d2421bf5116801cfa2040d@linux-foundation.org/

I'd like to have less code in kernel too, but I've learned never to
walk over users. If I remove this and they come after me asking why,
I'd have a hard time convincing them.

> Also my question was pointing to why out_of_memory is called from the
> reclaim rather than the allocator (memcg charging path). It is the
> caller of the reclaim to control different reclaim strategies and tell
> when all the hopes are lost and the oom killer should be invoked. This
> allows for a different policies at the allocator level and this change
> will break that AFAICS. E.g. what if the underlying allocation context
> is __GFP_NORETRY?

This is called in kswapd only, and by default (min_ttl=0) it doesn't
do anything. So __GFP_NORETRY doesn't apply. The question would be
more along the lines of long-term ABI support.

And I'll add the following comments, if you think we can keep this
logic:
   OOM kill if every generation from all memcgs is younger than min_ttl.
   Another theoretical possibility is all memcgs are either below min or
   ineligible at priority 0, but this isn't the main goal.

(Please read my reply at the bottom to decide whether we should keep
 it or not. Thanks.)

> > >From patch 8:
> >   Personal computers
> >   ------------------
> >   :Thrashing prevention: Write ``N`` to
> >    ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of
> >    ``N`` milliseconds from getting evicted. The OOM killer is invoked if
> >    this working set can't be kept in memory. Based on the average human
> >    detectable lag (~100ms), ``N=1000`` usually eliminates intolerable
> >    lags due to thrashing. Larger values like ``N=3000`` make lags less
> >    noticeable at the cost of more OOM kills.
> 
> This is a very good example of something that should be a self contained
> patch with its own justification.

Consider it done.

> TBH it is really not all that clear to
> me that we want to provide any user visible knob to control OOM behavior
> based on a time based QoS.

Agreed, and it didn't exist until v4, i.e., after I was demanded to
provide it for several times.

For example:
https://github.com/zen-kernel/zen-kernel/issues/223

And another example:
   Your Multigenerational LRU patchset is pretty complex and
   effective, but does not eliminate thrashing condition fully on an
   old PCs with slow HDD.

   I'm kindly asking you to cooperate with hakavlad if it's possible
   and maybe re-implement parts of le9 patch in your patchset wherever
   acceptable, as they are quite similar in the core concept.

This is excerpt of an email from iam@valdikss.org.ru, and he has
posted demo videos in this discussion:
https://lore.kernel.org/lkml/2dc51fc8-f14e-17ed-a8c6-0ec70423bf54@valdikss.org.ru/
Michal Hocko Jan. 11, 2022, 9 a.m. UTC | #22
On Mon 10-01-22 18:18:55, Yu Zhao wrote:
> On Mon, Jan 10, 2022 at 04:35:46PM +0100, Michal Hocko wrote:
> > On Fri 07-01-22 16:36:11, Yu Zhao wrote:
> > > On Fri, Jan 07, 2022 at 02:11:29PM +0100, Michal Hocko wrote:
> > > > On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > > > [...]
> > > > > +static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
> > > > > +{
> > > > > +	struct mem_cgroup *memcg;
> > > > > +	bool success = false;
> > > > > +	unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl);
> > > > > +
> > > > > +	VM_BUG_ON(!current_is_kswapd());
> > > > > +
> > > > > +	current->reclaim_state->mm_walk = &pgdat->mm_walk;
> > > > > +
> > > > > +	memcg = mem_cgroup_iter(NULL, NULL, NULL);
> > > > > +	do {
> > > > > +		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
> > > > > +
> > > > > +		if (age_lruvec(lruvec, sc, min_ttl))
> > > > > +			success = true;
> > > > > +
> > > > > +		cond_resched();
> > > > > +	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
> > > > > +
> > > > > +	if (!success && mutex_trylock(&oom_lock)) {
> > > > > +		struct oom_control oc = {
> > > > > +			.gfp_mask = sc->gfp_mask,
> > > > > +			.order = sc->order,
> > > > > +		};
> > > > > +
> > > > > +		if (!oom_reaping_in_progress())
> > > > > +			out_of_memory(&oc);
> > > > > +
> > > > > +		mutex_unlock(&oom_lock);
> > > > > +	}
> > > > 
> > > > Why do you need to trigger oom killer from this path? Why cannot you
> > > > rely on the page allocator to do that like we do now?
> > > 
> > > This is per desktop users' (repeated) requests. The can't tolerate
> > > thrashing as servers do because of UI lags; and they usually don't
> > > have fancy tools like oomd.
> > > 
> > > Related discussions I saw:
> > > https://github.com/zen-kernel/zen-kernel/issues/218
> > > https://lore.kernel.org/lkml/20101028191523.GA14972@google.com/
> > > https://lore.kernel.org/lkml/20211213051521.21f02dd2@mail.inbox.lv/
> > > https://lore.kernel.org/lkml/54C2C89C.8080002@gmail.com/
> > > https://lore.kernel.org/lkml/d9802b6a-949b-b327-c4a6-3dbca485ec20@gmx.com/
> > 
> > I do not really see any arguments why an userspace based trashing
> > detection cannot be used for those. Could you clarify?
> 
> It definitely can be done. But who is going to do it for every distro
> and all individual users? AFAIK, not a single distro provides such a
> solution for desktop/laptop/phone users.

If existing interfaces provides sufficient information to make those
calls then I would definitely prefer a userspace solution.

> And also there is the theoretical question how reliable a userspace
> solution can be. What if this usespace solution itself gets stuck in
> the direct reclaim path. I'm sure if nobody has done some search to
> prove or debunk it.

I have to confess I haven't checked oomd or other solutions but with a
sufficient care (all the code mlocked in + no allocations done while
collecting data) I believe this should be achieveable.

> In addition, what exactly PSI values should be used on different
> models of consumer electronics? Nobody knows. We have a team working
> on this and we haven't figured it out for all our Chromebook models.

I believe this is a matter of tuning for a specific deployment. We do
not have only psi but also refault counters that can be used.

> As Andrew said, "a blunt instrument like this would be useful".
> https://lore.kernel.org/lkml/20211202135824.33d2421bf5116801cfa2040d@linux-foundation.org/
> 
> I'd like to have less code in kernel too, but I've learned never to
> walk over users. If I remove this and they come after me asking why,
> I'd have a hard time convincing them.
> 
> > Also my question was pointing to why out_of_memory is called from the
> > reclaim rather than the allocator (memcg charging path). It is the
> > caller of the reclaim to control different reclaim strategies and tell
> > when all the hopes are lost and the oom killer should be invoked. This
> > allows for a different policies at the allocator level and this change
> > will break that AFAICS. E.g. what if the underlying allocation context
> > is __GFP_NORETRY?
> 
> This is called in kswapd only, and by default (min_ttl=0) it doesn't
> do anything. So __GFP_NORETRY doesn't apply.

My bad. I must have got lost when traversing the code but I can see you
are enforcing that by a VM_BUG_ON. So the limited scope reclaim is not a
problem indeed.

> The question would be
> more along the lines of long-term ABI support.
> 
> And I'll add the following comments, if you think we can keep this
> logic:
>    OOM kill if every generation from all memcgs is younger than min_ttl.
>    Another theoretical possibility is all memcgs are either below min or
>    ineligible at priority 0, but this isn't the main goal.
> 
> (Please read my reply at the bottom to decide whether we should keep
>  it or not. Thanks.)
> 
> > > >From patch 8:
> > >   Personal computers
> > >   ------------------
> > >   :Thrashing prevention: Write ``N`` to
> > >    ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of
> > >    ``N`` milliseconds from getting evicted. The OOM killer is invoked if
> > >    this working set can't be kept in memory. Based on the average human
> > >    detectable lag (~100ms), ``N=1000`` usually eliminates intolerable
> > >    lags due to thrashing. Larger values like ``N=3000`` make lags less
> > >    noticeable at the cost of more OOM kills.
> > 
> > This is a very good example of something that should be a self contained
> > patch with its own justification.
> 
> Consider it done.
> 
> > TBH it is really not all that clear to
> > me that we want to provide any user visible knob to control OOM behavior
> > based on a time based QoS.
> 
> Agreed, and it didn't exist until v4, i.e., after I was demanded to
> provide it for several times.
> 
> For example:
> https://github.com/zen-kernel/zen-kernel/issues/223
> 
> And another example:
>    Your Multigenerational LRU patchset is pretty complex and
>    effective, but does not eliminate thrashing condition fully on an
>    old PCs with slow HDD.
> 
>    I'm kindly asking you to cooperate with hakavlad if it's possible
>    and maybe re-implement parts of le9 patch in your patchset wherever
>    acceptable, as they are quite similar in the core concept.
> 
> This is excerpt of an email from iam@valdikss.org.ru, and he has
> posted demo videos in this discussion:
> https://lore.kernel.org/lkml/2dc51fc8-f14e-17ed-a8c6-0ec70423bf54@valdikss.org.ru/

That is all an interesting feedback but we should be really craful about
ABI constrains and future maintainability of the knob. I still stand
behind my statement that kernel should implement such features only if
it is clear that we cannot really implement a similar logic in the
userspace.
Michal Hocko Jan. 11, 2022, 12:15 p.m. UTC | #23
On Tue 11-01-22 11:21:48, Alexey Avramov wrote:
> > I do not really see any arguments why an userspace based trashing
> > detection cannot be used for those.
> 
> Firsly,
> because this is the task of the kernel, not the user space.
> Memory is managed by the kernel, not by the user space.
> The absence of such a mechanism in the kernel is a fundamental problem.
> The userspace tools are ugly hacks:
> some of them consume a lot of CPU [1],
> some of them consume a lot of memory [2],
> some of them cannot into process_mrelease() (earlyoom, nohang),
> some of them kill only the whole cgroup (systemd-oomd, oomd) [3]
> and depends on systemd and cgroup_v2 (oomd, systemd-oomd).

Thanks for those links. Read through them and my understanding is that
most of those are very specific to the tool used and nothing really
fundamental because of lack of kernel support.

> One of the biggest challenges for userspace oom-killers is to potentially
> function under intense memory pressure and are prone to getting stuck in
> memory reclaim themselves [4].

This one is more interesting and the truth is that handling the complete
OOM situation from the userspace is really tricky. Especially when with
a more complex oom decision policy. In the past we have discussed
potential ways to implement a oom kill policy be kernel modules or eBPF.
Without anybody following up on that.

But I suspect you are mixing up two things here. One of them is out
of memory situation where no memory can be reclaimed and allocated.

The other is one where the memory can be reclaimed, a progress is made,
but that leads to a trashing when the most of the time is spent on
refaulting a memory reclaimed shortly before.

The first one is addressed by the global oom killer and it tries to
be really conservative as much as possible because this is a very
disruptive operation. But the later one is more complex and a proper
handling really depends on the particular workload to be handled
properly because it is more of a QoS than an emergency action to keep
the system alive.

There are workloads which prefer a temporary trashing over its working
set during a peak memory demand rather than an OOM kill because way too
much work would be thrown away. On the other side workloads that are
latency sensitive can see even the direct reclaim as a runtime visible
problem.

I hope you can imagine there is a really large gap between those
two cases and no simple solution can be applied to the whole
range. Therefore we have PSI and refault stats exported to the userspace
so that a workload specific policy can be implemented there.

If userspace has hard time to use that data and action upon then let's
talk about specifics. For the most steady trashing situations I have
seen the userspace with mlocked memory and the code can make a forward
progress and mediate the situation.

[...]

> [1] https://github.com/facebookincubator/oomd/issues/79
> [2] https://github.com/hakavlad/nohang#memory-and-cpu-usage
> [3] https://github.com/facebookincubator/oomd/issues/125
> [4] https://lore.kernel.org/all/CALvZod7vtDxJZtNhn81V=oE-EPOf=4KZB2Bv6Giz+u3bFFyOLg@mail.gmail.com/
> [5] https://github.com/zen-kernel/zen-kernel/issues/223
> [6] https://raw.githubusercontent.com/hakavlad/cache-tests/main/mg-LRU-v3_vs_classic-LRU/3-firefox-tail-OOM/mg-LRU-1/psi2
> [7] https://lore.kernel.org/linux-mm/20211202150614.22440-1-mgorman@techsingularity.net/
Alexey Avramov Jan. 11, 2022, 2:22 p.m. UTC | #24
> I do not really see any arguments why an userspace based trashing
> detection cannot be used for those.

Firsly,
because this is the task of the kernel, not the user space. 
Memory is managed by the kernel, not by the user space. 
The absence of such a mechanism in the kernel is a fundamental problem.
The userspace tools are ugly hacks:
some of them consume a lot of CPU [1], 
some of them consume a lot of memory [2], 
some of them cannot into process_mrelease() (earlyoom, nohang), 
some of them kill only the whole cgroup (systemd-oomd, oomd) [3]
and depends on systemd and cgroup_v2 (oomd, systemd-oomd).
One of the biggest challenges for userspace oom-killers is to potentially
function under intense memory pressure and are prone to getting stuck in
memory reclaim themselves [4].

It is strange that after decades of user complaints about thrashing and
not-working OOM killer I have to explain the obvious things.
The basic mechanism must be implemented in the kernel.
Stop shifting responsibility to the user space!

Secondly,
the real reason for the min_ttl_ms mechanism is that without it, 
multi-minute stalls are possible [5] even when the killer is expected to
arrive, and memory pressure is closed to 100 at this period [6].
This fixes a bug that does not exist in the mainline LRU (this is
MGLRU-specific bug). BTW, the similar symptoms were recently fixed in the
mainline [7].

[1] https://github.com/facebookincubator/oomd/issues/79
[2] https://github.com/hakavlad/nohang#memory-and-cpu-usage
[3] https://github.com/facebookincubator/oomd/issues/125
[4] https://lore.kernel.org/all/CALvZod7vtDxJZtNhn81V=oE-EPOf=4KZB2Bv6Giz+u3bFFyOLg@mail.gmail.com/
[5] https://github.com/zen-kernel/zen-kernel/issues/223
[6] https://raw.githubusercontent.com/hakavlad/cache-tests/main/mg-LRU-v3_vs_classic-LRU/3-firefox-tail-OOM/mg-LRU-1/psi2
[7] https://lore.kernel.org/linux-mm/20211202150614.22440-1-mgorman@techsingularity.net/

[I am duplicating a previous message here - it was not delivered to mailing lists]
Yu Zhao Jan. 11, 2022, 11:16 p.m. UTC | #25
On Mon, Jan 10, 2022 at 04:01:13PM +0100, Michal Hocko wrote:
> On Thu 06-01-22 17:12:18, Michal Hocko wrote:
> > On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > > +static struct lru_gen_mm_walk *alloc_mm_walk(void)
> > > +{
> > > +	if (!current->reclaim_state || !current->reclaim_state->mm_walk)
> > > +		return kvzalloc(sizeof(struct lru_gen_mm_walk), GFP_KERNEL);
> 
> One thing I have overlooked completely.

I appreciate your attention to details but GFP_KERNEL is legit in the
reclaim path. It's been used many years in our production, e.g.,
  page reclaim
    swap_writepage()
      frontswap_store()
        zswap_frontswap_store()
          zswap_entry_cache_alloc(GFP_KERNEL)

(And I always test my changes with lockdep, kasan, DEBUG_VM, etc., no
 warnings ever seen from using GFP_KERNEL in the reclaim path.)

> You cannot really use GFP_KERNEL
> allocation here because the reclaim context can be constrained (e.g.
> GFP_NOFS). This allocation will not do any reclaim as it is PF_MEMALLOC
> but I suspect that the lockdep will complain anyway.
> 
> Also kvmalloc is not really great here. a) vmalloc path is never
> executed for small objects and b) we do not really want to make a
> dependency between vmalloc and the reclaim (by vmalloc -> reclaim ->
> vmalloc).
> 
> Even if we rule out vmalloc and look at kmalloc alone. Is this really
> safe? I do not see any recursion prevention in the SL.B code. Maybe this
> just happens to work but the dependency should be really documented so
> that future SL.B changes won't break the whole scheme. 

Affirmative, as Vlastimil has clarified.

Thanks!
Yu Zhao Jan. 12, 2022, 1:01 a.m. UTC | #26
On Mon, Jan 10, 2022 at 05:57:39PM +0100, Michal Hocko wrote:
> On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> [...]
> > +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
> > +{
> > +	static const struct mm_walk_ops mm_walk_ops = {
> > +		.test_walk = should_skip_vma,
> > +		.p4d_entry = walk_pud_range,
> > +	};
> > +
> > +	int err;
> > +#ifdef CONFIG_MEMCG
> > +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > +#endif
> > +
> > +	walk->next_addr = FIRST_USER_ADDRESS;
> > +
> > +	do {
> > +		unsigned long start = walk->next_addr;
> > +		unsigned long end = mm->highest_vm_end;
> > +
> > +		err = -EBUSY;
> > +
> > +		rcu_read_lock();
> > +#ifdef CONFIG_MEMCG
> > +		if (memcg && atomic_read(&memcg->moving_account))
> > +			goto contended;
> > +#endif
> 
> Why do you need to check for moving_account?

This check, if succeeds, blocks memcg migration.

Our goal is to move pages between different generations of the same
lruvec (the first arg). Meanwhile, pages can also be migrated between
different memcgs (different lruvecs).

The active/inactive lru uses isolation to block memcg migration.

Generations account pages similarly to the active/inactive lru, i.e.,
each generation has nr_pages counter. However, unlike the active/
inactive lru, a page can be moved to a different generation without
getting isolated or even without being under the lru lock, as long as
the delta is eventually accounted for (which does require the lru lock
when it happens).

The generation counter in page->flags (folio->flags to be precise)
stores 0 when a page is isolated, to synchronize with isolation.
Michal Hocko Jan. 12, 2022, 10:17 a.m. UTC | #27
On Tue 11-01-22 18:01:29, Yu Zhao wrote:
> On Mon, Jan 10, 2022 at 05:57:39PM +0100, Michal Hocko wrote:
> > On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > [...]
> > > +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
> > > +{
> > > +	static const struct mm_walk_ops mm_walk_ops = {
> > > +		.test_walk = should_skip_vma,
> > > +		.p4d_entry = walk_pud_range,
> > > +	};
> > > +
> > > +	int err;
> > > +#ifdef CONFIG_MEMCG
> > > +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > > +#endif
> > > +
> > > +	walk->next_addr = FIRST_USER_ADDRESS;
> > > +
> > > +	do {
> > > +		unsigned long start = walk->next_addr;
> > > +		unsigned long end = mm->highest_vm_end;
> > > +
> > > +		err = -EBUSY;
> > > +
> > > +		rcu_read_lock();
> > > +#ifdef CONFIG_MEMCG
> > > +		if (memcg && atomic_read(&memcg->moving_account))
> > > +			goto contended;
> > > +#endif
> > 
> > Why do you need to check for moving_account?
> 
> This check, if succeeds, blocks memcg migration.

OK, I can see that you rely on the RCU here for the synchronization. A
comment which mentions mem_cgroup_move_charge would be helpful for
clarity. Is there any reason you are not using folio_memcg_lock in the
pte walk instead?
Michal Hocko Jan. 12, 2022, 10:28 a.m. UTC | #28
On Tue 11-01-22 16:16:57, Yu Zhao wrote:
> On Mon, Jan 10, 2022 at 04:01:13PM +0100, Michal Hocko wrote:
> > On Thu 06-01-22 17:12:18, Michal Hocko wrote:
> > > On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > > > +static struct lru_gen_mm_walk *alloc_mm_walk(void)
> > > > +{
> > > > +	if (!current->reclaim_state || !current->reclaim_state->mm_walk)
> > > > +		return kvzalloc(sizeof(struct lru_gen_mm_walk), GFP_KERNEL);
> > 
> > One thing I have overlooked completely.
> 
> I appreciate your attention to details but GFP_KERNEL is legit in the
> reclaim path. It's been used many years in our production, e.g.,
>   page reclaim
>     swap_writepage()
>       frontswap_store()
>         zswap_frontswap_store()
>           zswap_entry_cache_alloc(GFP_KERNEL)
> 
> (And I always test my changes with lockdep, kasan, DEBUG_VM, etc., no
>  warnings ever seen from using GFP_KERNEL in the reclaim path.)

OK, I can see it now. __need_reclaim will check for PF_MEMALLOC and skip
the fs_reclaim tracking.

I still maintain I am not really happy about (nor in the zswap example)
allocations from the direct reclaim context. I would really recommend
using a pre-allocated pool of objects.

If there are strong reasons for not doing so then at lease change that
to kzalloc.

Thanks!
Yu Zhao Jan. 12, 2022, 11:43 p.m. UTC | #29
On Wed, Jan 12, 2022 at 11:17:53AM +0100, Michal Hocko wrote:
> On Tue 11-01-22 18:01:29, Yu Zhao wrote:
> > On Mon, Jan 10, 2022 at 05:57:39PM +0100, Michal Hocko wrote:
> > > On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > > [...]
> > > > +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
> > > > +{
> > > > +	static const struct mm_walk_ops mm_walk_ops = {
> > > > +		.test_walk = should_skip_vma,
> > > > +		.p4d_entry = walk_pud_range,
> > > > +	};
> > > > +
> > > > +	int err;
> > > > +#ifdef CONFIG_MEMCG
> > > > +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > > > +#endif
> > > > +
> > > > +	walk->next_addr = FIRST_USER_ADDRESS;
> > > > +
> > > > +	do {
> > > > +		unsigned long start = walk->next_addr;
> > > > +		unsigned long end = mm->highest_vm_end;
> > > > +
> > > > +		err = -EBUSY;
> > > > +
> > > > +		rcu_read_lock();
> > > > +#ifdef CONFIG_MEMCG
> > > > +		if (memcg && atomic_read(&memcg->moving_account))
> > > > +			goto contended;
> > > > +#endif
> > > 
> > > Why do you need to check for moving_account?
> > 
> > This check, if succeeds, blocks memcg migration.
> 
> OK, I can see that you rely on the RCU here for the synchronization. A
> comment which mentions mem_cgroup_move_charge would be helpful for
> clarity.

Will do

> Is there any reason you are not using folio_memcg_lock in the
> pte walk instead?

We have a particular lruvec (the first arg), hence a particular memcg
to lock. But we don't have a particular page to lock.
Yu Zhao Jan. 13, 2022, 9:25 a.m. UTC | #30
On Wed, Jan 12, 2022 at 11:28:57AM +0100, Michal Hocko wrote:
> On Tue 11-01-22 16:16:57, Yu Zhao wrote:
> > On Mon, Jan 10, 2022 at 04:01:13PM +0100, Michal Hocko wrote:
> > > On Thu 06-01-22 17:12:18, Michal Hocko wrote:
> > > > On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > > > > +static struct lru_gen_mm_walk *alloc_mm_walk(void)
> > > > > +{
> > > > > +	if (!current->reclaim_state || !current->reclaim_state->mm_walk)
> > > > > +		return kvzalloc(sizeof(struct lru_gen_mm_walk), GFP_KERNEL);
> > > 
> > > One thing I have overlooked completely.
> > 
> > I appreciate your attention to details but GFP_KERNEL is legit in the
> > reclaim path. It's been used many years in our production, e.g.,
> >   page reclaim
> >     swap_writepage()
> >       frontswap_store()
> >         zswap_frontswap_store()
> >           zswap_entry_cache_alloc(GFP_KERNEL)
> > 
> > (And I always test my changes with lockdep, kasan, DEBUG_VM, etc., no
> >  warnings ever seen from using GFP_KERNEL in the reclaim path.)
> 
> OK, I can see it now. __need_reclaim will check for PF_MEMALLOC and skip
> the fs_reclaim tracking.
> 
> I still maintain I am not really happy about (nor in the zswap example)
> allocations from the direct reclaim context. I would really recommend
> using a pre-allocated pool of objects.

Not trying to argue anything -- there are many other places in the
reclaim path that must allocate memory to make progress, e.g.,

  add_to_swap_cache()
    xas_nomem()

  __swap_writepage()
    bio_alloc()

The only way to not allocate memory is drop clean pages. Writing dirty
pages (not swap) might require allocations as well. (But we only write
dirty pages in kswapd, not in the direct reclaim path.)

> If there are strong reasons for not doing so then at lease change that
> to kzalloc.

Consider it done.
Yu Zhao Jan. 13, 2022, 9:43 a.m. UTC | #31
On Mon, Jan 10, 2022 at 03:37:28PM +0100, Michal Hocko wrote:
> On Sun 09-01-22 20:58:02, Yu Zhao wrote:
> > On Fri, Jan 07, 2022 at 10:00:31AM +0100, Michal Hocko wrote:
> > > On Fri 07-01-22 09:55:09, Michal Hocko wrote:
> > > [...]
> > > > > In this case, lru_gen_mm_walk is small (160 bytes); it's per direct
> > > > > reclaimer; and direct reclaimers rarely come here, i.e., only when
> > > > > kswapd can't keep up in terms of the aging, which is similar to the
> > > > > condition where the inactive list is empty for the active/inactive
> > > > > lru.
> > > > 
> > > > Well, this is not a strong argument to be honest. Kswapd being stuck
> > > > and the majority of the reclaim being done in the direct reclaim
> > > > context is a situation I have seen many many times.
> > > 
> > > Also do not forget that memcg reclaim is effectivelly only direct
> > > reclaim. Not that the memcg reclaim indicates a global memory shortage
> > > but it can add up and race with the global reclaim as well.
> > 
> > I don't dispute any of the above, and I probably don't like this code
> > more than you do.
> > 
> > But let's not forget the purposes of PF_MEMALLOC, besides preventing
> > recursive reclaims, include letting reclaim dip into reserves so that
> > it can make more free memory. So I think it's acceptable if the
> > following conditions are met:
> > 1. The allocation size is small.
> > 2. The number of allocations is bounded.
> > 3. Its failure doesn't stall reclaim.
> > And it'd be nice if
> > 4. The allocation happens rarely, e.g., slow path only.
> 
> I would add 
>   0. The allocation should be done only if absolutely _necessary_.
> 
> Please keep in mind that whatever you allocate from that context will be
> consuming a very precious memory reserves which are shared with other
> components of the system. Even worse these can go all the way to
> depleting memory completely where other things can fall apart.

I agree but I also see a distinction:
   1,2,3 are objective;
   0,4 are subjective.

For some users, page reclaim itself could be not absolutely necessary
because they are okay with OOM kills. But for others, the situation
could be reversed.

> > The code in question meets all of them.
> > 
> > 1. This allocation is 160 bytes.
> > 2. It's bounded by the number of page table walkers which, in the
> >    worst, is same as the number of mm_struct's.
> > 3. Most importantly, its failure doesn't stall the aging. The aging
> >    will fallback to the rmap-based function lru_gen_look_around().
> >    But this function only gathers the accessed bit from at most 64
> >    PTEs, meaning it's less efficient (retains ~80% performance gains).
> > 4. This allocation is rare, i.e., only when the aging is required,
> >    which is similar to the low inactive case for the active/inactive
> >    lru.
> 
> I think this fallback behavior deserves much more detailed explanation
> in changelogs.

Will do.

> > The bottom line is I can try various optimizations, e.g., preallocate
> > a few buffers for a limited number of page walkers and if this number
> > has been reached, fallback to the rmap-based function. But I have yet
> > to see evidence that calls for additional complexity.
> 
> I would disagree here. This is not an optimization. You should be
> avoiding allocations from the memory reclaim because any allocation just
> add a runtime behavior complexity and potential corner cases.

Would __GFP_NOMEMALLOC address your concern? It prevents allocations
from accessing the reserves even under PF_MEMALLOC.
Michal Hocko Jan. 13, 2022, 11:57 a.m. UTC | #32
On Wed 12-01-22 16:43:15, Yu Zhao wrote:
> On Wed, Jan 12, 2022 at 11:17:53AM +0100, Michal Hocko wrote:
[...]
> > Is there any reason you are not using folio_memcg_lock in the
> > pte walk instead?
> 
> We have a particular lruvec (the first arg), hence a particular memcg
> to lock. But we don't have a particular page to lock.

That is certainly true at this layer but the locking should be needed
only for specific pages, no? So you can move the lock down to the
callback which examines respective pages. Or is there anything
preventing that?

To be honest, and that is the reason I am asking, I really do not like
to open code the migration synchronization outside of the memcg proper.
Code paths which need a stable memcg are supposed to be using
folio_memcg_lock for the specific examination time. If you prefer a
trylock approach for this usecase then we can add one.
Michal Hocko Jan. 13, 2022, 12:02 p.m. UTC | #33
On Thu 13-01-22 02:43:38, Yu Zhao wrote:
[...]
> > > The bottom line is I can try various optimizations, e.g., preallocate
> > > a few buffers for a limited number of page walkers and if this number
> > > has been reached, fallback to the rmap-based function. But I have yet
> > > to see evidence that calls for additional complexity.
> > 
> > I would disagree here. This is not an optimization. You should be
> > avoiding allocations from the memory reclaim because any allocation just
> > add a runtime behavior complexity and potential corner cases.
> 
> Would __GFP_NOMEMALLOC address your concern? It prevents allocations
> from accessing the reserves even under PF_MEMALLOC.

__GFP_NOMEMALLOC would deal with the complete memory depletion concern
for sure but I am not sure how any of these allocations would succeed
when called from the direct reclaim. Some access to memory reserves is
necessary if you insist on allocating from the reclaim process.

You can have a look at the limited memory reserves access by oom victims
for an example of how this can be done.
Alexey Avramov Jan. 13, 2022, 5 p.m. UTC | #34
> But the later one is more complex and a proper
> handling really depends on the particular workload

That is why I advocate the introduction of new tunables.

> There are workloads which prefer a temporary trashing over its working
> set during a peak memory demand rather than an OOM kill

OK, for such cases, the OOM handles can be set to 0.
It can even be the default value.

> On the other side workloads that are
> latency sensitive

I daresay that this is the case with most workloads.
An internet server that falls into thrashing is a dead server.

> no simple solution can be applied to the whole

There are several solutions and they can be taken into the kernel
at the same time, they all work:
- min_ttl_ms + MGLRU
- vm.min_filelist_kbytes-like knobs
- PSI-based solutions.

> For the most steady trashing situations I have
> seen the userspace with mlocked memory and the code can make a forward
> progress and mediate the situation.

I still don't see a problem in making all the kernel-space solutions
in the kernel.
Yu Zhao Jan. 19, 2022, 6:31 a.m. UTC | #35
On Thu, Jan 13, 2022 at 01:02:26PM +0100, Michal Hocko wrote:
> On Thu 13-01-22 02:43:38, Yu Zhao wrote:
> [...]
> > > > The bottom line is I can try various optimizations, e.g., preallocate
> > > > a few buffers for a limited number of page walkers and if this number
> > > > has been reached, fallback to the rmap-based function. But I have yet
> > > > to see evidence that calls for additional complexity.
> > > 
> > > I would disagree here. This is not an optimization. You should be
> > > avoiding allocations from the memory reclaim because any allocation just
> > > add a runtime behavior complexity and potential corner cases.
> > 
> > Would __GFP_NOMEMALLOC address your concern? It prevents allocations
> > from accessing the reserves even under PF_MEMALLOC.
> 
> __GFP_NOMEMALLOC would deal with the complete memory depletion concern
> for sure but I am not sure how any of these allocations would succeed
> when called from the direct reclaim. Some access to memory reserves is
> necessary if you insist on allocating from the reclaim process.
> 
> You can have a look at the limited memory reserves access by oom victims
> for an example of how this can be done.

Thanks. I'll change GFP_KERNEL to __GFP_HIGH | __GFP_NOMEMALLOC.
__GFP_HIGH allows some access to memory reserves and __GFP_NOMEMALLOC
prevents the complete depletion. Basically the combination lower the
min watermark by 1/2, and we have been using them for
add_to_swap_cache().
Yu Zhao Jan. 19, 2022, 7:04 a.m. UTC | #36
On Mon, Jan 10, 2022 at 11:54:42AM +0100, Michal Hocko wrote:
> On Sun 09-01-22 21:47:57, Yu Zhao wrote:
> > On Fri, Jan 07, 2022 at 03:44:50PM +0100, Michal Hocko wrote:
> > > On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > > [...]
> > > > +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
> > > > +{
> > > > +	static const struct mm_walk_ops mm_walk_ops = {
> > > > +		.test_walk = should_skip_vma,
> > > > +		.p4d_entry = walk_pud_range,
> > > > +	};
> > > > +
> > > > +	int err;
> > > > +#ifdef CONFIG_MEMCG
> > > > +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > > > +#endif
> > > > +
> > > > +	walk->next_addr = FIRST_USER_ADDRESS;
> > > > +
> > > > +	do {
> > > > +		unsigned long start = walk->next_addr;
> > > > +		unsigned long end = mm->highest_vm_end;
> > > > +
> > > > +		err = -EBUSY;
> > > > +
> > > > +		rcu_read_lock();
> > > > +#ifdef CONFIG_MEMCG
> > > > +		if (memcg && atomic_read(&memcg->moving_account))
> > > > +			goto contended;
> > > > +#endif
> > > > +		if (!mmap_read_trylock(mm))
> > > > +			goto contended;
> > > 
> > > Have you evaluated the behavior under mmap_sem contention? I mean what
> > > would be an effect of some mms being excluded from the walk? This path
> > > is called from direct reclaim and we do allocate with exclusive mmap_sem
> > > IIRC and the trylock can fail in a presence of pending writer if I am
> > > not mistaken so even the read lock holder (e.g. an allocation from the #PF)
> > > can bypass the walk.
> > 
> > You are right. Here it must be a trylock; otherwise it can deadlock.
> 
> Yeah, this is clear.
> 
> > I think there might be a misunderstanding: the aging doesn't
> > exclusively rely on page table walks to gather the accessed bit. It
> > prefers page table walks but it can also fallback to the rmap-based
> > function, i.e., lru_gen_look_around(), which only gathers the accessed
> > bit from at most 64 PTEs and therefore is less efficient. But it still
> > retains about 80% of the performance gains.
> 
> I have to say that I really have hard time to understand the runtime
> behavior depending on that interaction. How does the reclaim behave when
> the virtual scan is enabled, partially enabled and almost completely
> disabled due to different constrains? I do not see any such an
> evaluation described in changelogs and I consider this to be a rather
> important information to judge the overall behavior.

It doesn't have (partially) enabled/disabled states nor does its
behavior change with different reclaim constraints. Having either
would make its design too complex to implement or benchmark.

There is feedback loop connecting page table walks and rmap walks by
Bloom filters. The Bloom filters hold dense hot areas. Page table walks
test whether virtual areas are in the Bloom filters and scan those that
were tested positive. Anything they miss will be caught by rmap walks
later (shrink_page_list()). And when rmap walks find new dense hot
areas, they add those area to the Bloom filters.

A dense hot area means it has many accessed pages belonging to the
reclaim domain, and clearing the accessed bit in all PTEs within this
area by one page table walk is more efficient than doing it one by one
by many rmap walks, in terms of cacheline utilization.

> > > Or is this considered statistically insignificant thus a theoretical
> > > problem?
> > 
> > Yes. People who work on the maple tree and SPF at Google expressed the
> > same concern during the design review meeting (all stakeholders on the
> > mailing list were also invited). So we had a counter to monitor the
> > contention in previous versions, i.e., MM_LOCK_CONTENTION in v4 here:
> > https://lore.kernel.org/lkml/20210818063107.2696454-8-yuzhao@google.com/
> > 
> > And we also combined this patchset with the SPF patchset to see if the
> > latter makes any difference. Our conclusion was the contention is
> > statistically insignificant to the performance under memory pressure.
> > 
> > This can be explained by how often we create a new generation. (We
> > only walk page tables when we create a new generation. And it's
> > similar to the low inactive condition for the active/inactive lru.)
> > 
> > Usually we only do so every few seconds. We'd run into problems with
> > other parts of the kernel, e.g., lru lock contention, i/o congestion,
> > etc. if we create more than a few generation every second.
> 
> This would be a very good information to have in changelogs. Ideally
> with some numbers and analysis.

Will do. Thanks.
Michal Hocko Jan. 19, 2022, 9:42 a.m. UTC | #37
On Wed 19-01-22 00:04:10, Yu Zhao wrote:
> On Mon, Jan 10, 2022 at 11:54:42AM +0100, Michal Hocko wrote:
> > On Sun 09-01-22 21:47:57, Yu Zhao wrote:
> > > On Fri, Jan 07, 2022 at 03:44:50PM +0100, Michal Hocko wrote:
> > > > On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > > > [...]
> > > > > +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
> > > > > +{
> > > > > +	static const struct mm_walk_ops mm_walk_ops = {
> > > > > +		.test_walk = should_skip_vma,
> > > > > +		.p4d_entry = walk_pud_range,
> > > > > +	};
> > > > > +
> > > > > +	int err;
> > > > > +#ifdef CONFIG_MEMCG
> > > > > +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > > > > +#endif
> > > > > +
> > > > > +	walk->next_addr = FIRST_USER_ADDRESS;
> > > > > +
> > > > > +	do {
> > > > > +		unsigned long start = walk->next_addr;
> > > > > +		unsigned long end = mm->highest_vm_end;
> > > > > +
> > > > > +		err = -EBUSY;
> > > > > +
> > > > > +		rcu_read_lock();
> > > > > +#ifdef CONFIG_MEMCG
> > > > > +		if (memcg && atomic_read(&memcg->moving_account))
> > > > > +			goto contended;
> > > > > +#endif
> > > > > +		if (!mmap_read_trylock(mm))
> > > > > +			goto contended;
> > > > 
> > > > Have you evaluated the behavior under mmap_sem contention? I mean what
> > > > would be an effect of some mms being excluded from the walk? This path
> > > > is called from direct reclaim and we do allocate with exclusive mmap_sem
> > > > IIRC and the trylock can fail in a presence of pending writer if I am
> > > > not mistaken so even the read lock holder (e.g. an allocation from the #PF)
> > > > can bypass the walk.
> > > 
> > > You are right. Here it must be a trylock; otherwise it can deadlock.
> > 
> > Yeah, this is clear.
> > 
> > > I think there might be a misunderstanding: the aging doesn't
> > > exclusively rely on page table walks to gather the accessed bit. It
> > > prefers page table walks but it can also fallback to the rmap-based
> > > function, i.e., lru_gen_look_around(), which only gathers the accessed
> > > bit from at most 64 PTEs and therefore is less efficient. But it still
> > > retains about 80% of the performance gains.
> > 
> > I have to say that I really have hard time to understand the runtime
> > behavior depending on that interaction. How does the reclaim behave when
> > the virtual scan is enabled, partially enabled and almost completely
> > disabled due to different constrains? I do not see any such an
> > evaluation described in changelogs and I consider this to be a rather
> > important information to judge the overall behavior.
> 
> It doesn't have (partially) enabled/disabled states nor does its
> behavior change with different reclaim constraints. Having either
> would make its design too complex to implement or benchmark.

Let me clarify. By "partially enabled" I really meant behavior depedning
on runtime conditions. Say mmap_sem cannot be locked for half of scanned
tasks and/or allocation for the mm walker fails due to lack of memory.
How does this going to affect reclaim efficiency. How does a user/admin
know that the memory reclaim is in a "degraded" mode because of the
contention?
Michal Hocko Jan. 19, 2022, 9:44 a.m. UTC | #38
On Tue 18-01-22 23:31:07, Yu Zhao wrote:
> On Thu, Jan 13, 2022 at 01:02:26PM +0100, Michal Hocko wrote:
> > On Thu 13-01-22 02:43:38, Yu Zhao wrote:
> > [...]
> > > > > The bottom line is I can try various optimizations, e.g., preallocate
> > > > > a few buffers for a limited number of page walkers and if this number
> > > > > has been reached, fallback to the rmap-based function. But I have yet
> > > > > to see evidence that calls for additional complexity.
> > > > 
> > > > I would disagree here. This is not an optimization. You should be
> > > > avoiding allocations from the memory reclaim because any allocation just
> > > > add a runtime behavior complexity and potential corner cases.
> > > 
> > > Would __GFP_NOMEMALLOC address your concern? It prevents allocations
> > > from accessing the reserves even under PF_MEMALLOC.
> > 
> > __GFP_NOMEMALLOC would deal with the complete memory depletion concern
> > for sure but I am not sure how any of these allocations would succeed
> > when called from the direct reclaim. Some access to memory reserves is
> > necessary if you insist on allocating from the reclaim process.
> > 
> > You can have a look at the limited memory reserves access by oom victims
> > for an example of how this can be done.
> 
> Thanks. I'll change GFP_KERNEL to __GFP_HIGH | __GFP_NOMEMALLOC.
> __GFP_HIGH allows some access to memory reserves and __GFP_NOMEMALLOC
> prevents the complete depletion. Basically the combination lower the
> min watermark by 1/2, and we have been using them for
> add_to_swap_cache().

Yes this will prevent the complete memory depletion. There are other
users of this portion of memory reserves so the reclaim might be out of
luck. How this turns out in practice remains to be seen though but it
certainly is an opportunity for corner cases and hard to test behavior.
Yu Zhao Jan. 23, 2022, 9:28 p.m. UTC | #39
On Wed, Jan 19, 2022 at 10:42:47AM +0100, Michal Hocko wrote:
> On Wed 19-01-22 00:04:10, Yu Zhao wrote:
> > On Mon, Jan 10, 2022 at 11:54:42AM +0100, Michal Hocko wrote:
> > > On Sun 09-01-22 21:47:57, Yu Zhao wrote:
> > > > On Fri, Jan 07, 2022 at 03:44:50PM +0100, Michal Hocko wrote:
> > > > > On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > > > > [...]
> > > > > > +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
> > > > > > +{
> > > > > > +	static const struct mm_walk_ops mm_walk_ops = {
> > > > > > +		.test_walk = should_skip_vma,
> > > > > > +		.p4d_entry = walk_pud_range,
> > > > > > +	};
> > > > > > +
> > > > > > +	int err;
> > > > > > +#ifdef CONFIG_MEMCG
> > > > > > +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > > > > > +#endif
> > > > > > +
> > > > > > +	walk->next_addr = FIRST_USER_ADDRESS;
> > > > > > +
> > > > > > +	do {
> > > > > > +		unsigned long start = walk->next_addr;
> > > > > > +		unsigned long end = mm->highest_vm_end;
> > > > > > +
> > > > > > +		err = -EBUSY;
> > > > > > +
> > > > > > +		rcu_read_lock();
> > > > > > +#ifdef CONFIG_MEMCG
> > > > > > +		if (memcg && atomic_read(&memcg->moving_account))
> > > > > > +			goto contended;
> > > > > > +#endif
> > > > > > +		if (!mmap_read_trylock(mm))
> > > > > > +			goto contended;
> > > > > 
> > > > > Have you evaluated the behavior under mmap_sem contention? I mean what
> > > > > would be an effect of some mms being excluded from the walk? This path
> > > > > is called from direct reclaim and we do allocate with exclusive mmap_sem
> > > > > IIRC and the trylock can fail in a presence of pending writer if I am
> > > > > not mistaken so even the read lock holder (e.g. an allocation from the #PF)
> > > > > can bypass the walk.
> > > > 
> > > > You are right. Here it must be a trylock; otherwise it can deadlock.
> > > 
> > > Yeah, this is clear.
> > > 
> > > > I think there might be a misunderstanding: the aging doesn't
> > > > exclusively rely on page table walks to gather the accessed bit. It
> > > > prefers page table walks but it can also fallback to the rmap-based
> > > > function, i.e., lru_gen_look_around(), which only gathers the accessed
> > > > bit from at most 64 PTEs and therefore is less efficient. But it still
> > > > retains about 80% of the performance gains.
> > > 
> > > I have to say that I really have hard time to understand the runtime
> > > behavior depending on that interaction. How does the reclaim behave when
> > > the virtual scan is enabled, partially enabled and almost completely
> > > disabled due to different constrains? I do not see any such an
> > > evaluation described in changelogs and I consider this to be a rather
> > > important information to judge the overall behavior.
> > 
> > It doesn't have (partially) enabled/disabled states nor does its
> > behavior change with different reclaim constraints. Having either
> > would make its design too complex to implement or benchmark.
> 
> Let me clarify. By "partially enabled" I really meant behavior depedning
> on runtime conditions. Say mmap_sem cannot be locked for half of scanned
> tasks and/or allocation for the mm walker fails due to lack of memory.
> How does this going to affect reclaim efficiency.

Understood. This is not only possible -- it's the default for our ARM
hardware that doesn't support the accessed bit, i.e., CPUs that don't
automatically set the accessed bit.

In try_to_inc_max_seq(), we have:
    /*
     * If the hardware doesn't automatically set the accessed bit, fallback
     * to lru_gen_look_around(), which only clears the accessed bit in a
     * handful of PTEs. Spreading the work out over a period of time usually
     * is less efficient, but it avoids bursty page faults.
     */
    if the accessed bit is not supported
        return

    if alloc_mm_walk() fails
        return

    walk_mm()
        if mmap_sem contented
            return

        scan page tables

We have a microbenchmark that specifically measures this worst case
scenario by entirely disabling page table scanning. Its results showed
that this still retains more than 90% of the optimal performance. I'll
share this microbenchmark in another email when answering Barry's
questions regarding the accessed bit.

Our profiling infra also indirectly confirms this: it collects data
from real users running on hardware with and without the accessed
bit. Users running on hardware without the accessed bit indeed suffer
a small performance degradation, compared with users running on
hardware with it. But they still benefit almost as much, compared with
users running on the same hardware but without MGLRU.

> How does a user/admin
> know that the memory reclaim is in a "degraded" mode because of the
> contention?

As we previously discussed here:
https://lore.kernel.org/linux-mm/Ydu6fXg2FmrseQOn@google.com/
there used to be a counter measuring the contention, and it was deemed
unnecessary and removed in v4. But I don't have a problem if we want
to revive it.
Yu Zhao Jan. 23, 2022, 9:40 p.m. UTC | #40
On Thu, Jan 13, 2022 at 12:57:35PM +0100, Michal Hocko wrote:
> On Wed 12-01-22 16:43:15, Yu Zhao wrote:
> > On Wed, Jan 12, 2022 at 11:17:53AM +0100, Michal Hocko wrote:
> [...]
> > > Is there any reason you are not using folio_memcg_lock in the
> > > pte walk instead?
> > 
> > We have a particular lruvec (the first arg), hence a particular memcg
> > to lock. But we don't have a particular page to lock.
> 
> That is certainly true at this layer but the locking should be needed
> only for specific pages, no?

Yes.

> So you can move the lock down to the
> callback which examines respective pages. Or is there anything
> preventing that?

No.

> To be honest, and that is the reason I am asking, I really do not like
> to open code the migration synchronization outside of the memcg proper.

Agreed.

> Code paths which need a stable memcg are supposed to be using
> folio_memcg_lock for the specific examination time.

No argument here, just a clarification: when possible I prefer to
lock a batch of pages rather than individual ones.

> If you prefer a
> trylock approach for this usecase then we can add one.

Done. Thanks.
Michal Hocko Jan. 24, 2022, 2:01 p.m. UTC | #41
On Sun 23-01-22 14:28:30, Yu Zhao wrote:
> On Wed, Jan 19, 2022 at 10:42:47AM +0100, Michal Hocko wrote:
> > On Wed 19-01-22 00:04:10, Yu Zhao wrote:
> > > On Mon, Jan 10, 2022 at 11:54:42AM +0100, Michal Hocko wrote:
> > > > On Sun 09-01-22 21:47:57, Yu Zhao wrote:
> > > > > On Fri, Jan 07, 2022 at 03:44:50PM +0100, Michal Hocko wrote:
> > > > > > On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> > > > > > [...]
> > > > > > > +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
> > > > > > > +{
> > > > > > > +	static const struct mm_walk_ops mm_walk_ops = {
> > > > > > > +		.test_walk = should_skip_vma,
> > > > > > > +		.p4d_entry = walk_pud_range,
> > > > > > > +	};
> > > > > > > +
> > > > > > > +	int err;
> > > > > > > +#ifdef CONFIG_MEMCG
> > > > > > > +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > > > > > > +#endif
> > > > > > > +
> > > > > > > +	walk->next_addr = FIRST_USER_ADDRESS;
> > > > > > > +
> > > > > > > +	do {
> > > > > > > +		unsigned long start = walk->next_addr;
> > > > > > > +		unsigned long end = mm->highest_vm_end;
> > > > > > > +
> > > > > > > +		err = -EBUSY;
> > > > > > > +
> > > > > > > +		rcu_read_lock();
> > > > > > > +#ifdef CONFIG_MEMCG
> > > > > > > +		if (memcg && atomic_read(&memcg->moving_account))
> > > > > > > +			goto contended;
> > > > > > > +#endif
> > > > > > > +		if (!mmap_read_trylock(mm))
> > > > > > > +			goto contended;
> > > > > > 
> > > > > > Have you evaluated the behavior under mmap_sem contention? I mean what
> > > > > > would be an effect of some mms being excluded from the walk? This path
> > > > > > is called from direct reclaim and we do allocate with exclusive mmap_sem
> > > > > > IIRC and the trylock can fail in a presence of pending writer if I am
> > > > > > not mistaken so even the read lock holder (e.g. an allocation from the #PF)
> > > > > > can bypass the walk.
> > > > > 
> > > > > You are right. Here it must be a trylock; otherwise it can deadlock.
> > > > 
> > > > Yeah, this is clear.
> > > > 
> > > > > I think there might be a misunderstanding: the aging doesn't
> > > > > exclusively rely on page table walks to gather the accessed bit. It
> > > > > prefers page table walks but it can also fallback to the rmap-based
> > > > > function, i.e., lru_gen_look_around(), which only gathers the accessed
> > > > > bit from at most 64 PTEs and therefore is less efficient. But it still
> > > > > retains about 80% of the performance gains.
> > > > 
> > > > I have to say that I really have hard time to understand the runtime
> > > > behavior depending on that interaction. How does the reclaim behave when
> > > > the virtual scan is enabled, partially enabled and almost completely
> > > > disabled due to different constrains? I do not see any such an
> > > > evaluation described in changelogs and I consider this to be a rather
> > > > important information to judge the overall behavior.
> > > 
> > > It doesn't have (partially) enabled/disabled states nor does its
> > > behavior change with different reclaim constraints. Having either
> > > would make its design too complex to implement or benchmark.
> > 
> > Let me clarify. By "partially enabled" I really meant behavior depedning
> > on runtime conditions. Say mmap_sem cannot be locked for half of scanned
> > tasks and/or allocation for the mm walker fails due to lack of memory.
> > How does this going to affect reclaim efficiency.
> 
> Understood. This is not only possible -- it's the default for our ARM
> hardware that doesn't support the accessed bit, i.e., CPUs that don't
> automatically set the accessed bit.
> 
> In try_to_inc_max_seq(), we have:
>     /*
>      * If the hardware doesn't automatically set the accessed bit, fallback
>      * to lru_gen_look_around(), which only clears the accessed bit in a
>      * handful of PTEs. Spreading the work out over a period of time usually
>      * is less efficient, but it avoids bursty page faults.
>      */
>     if the accessed bit is not supported
>         return
> 
>     if alloc_mm_walk() fails
>         return
> 
>     walk_mm()
>         if mmap_sem contented
>             return
> 
>         scan page tables
> 
> We have a microbenchmark that specifically measures this worst case
> scenario by entirely disabling page table scanning. Its results showed
> that this still retains more than 90% of the optimal performance. I'll
> share this microbenchmark in another email when answering Barry's
> questions regarding the accessed bit.
> 
> Our profiling infra also indirectly confirms this: it collects data
> from real users running on hardware with and without the accessed
> bit. Users running on hardware without the accessed bit indeed suffer
> a small performance degradation, compared with users running on
> hardware with it. But they still benefit almost as much, compared with
> users running on the same hardware but without MGLRU.

This definitely a good information to have in the cover letter.

> > How does a user/admin
> > know that the memory reclaim is in a "degraded" mode because of the
> > contention?
> 
> As we previously discussed here:
> https://lore.kernel.org/linux-mm/Ydu6fXg2FmrseQOn@google.com/
> there used to be a counter measuring the contention, and it was deemed
> unnecessary and removed in v4. But I don't have a problem if we want
> to revive it.

Well, counter might be rather tricky but few trace points would make some
sense to me.
diff mbox series

Patch

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index aba18cd101db..028afdb81c10 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1393,18 +1393,24 @@  mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
 
 static inline void lock_page_memcg(struct page *page)
 {
+	/* to match folio_memcg_rcu() */
+	rcu_read_lock();
 }
 
 static inline void unlock_page_memcg(struct page *page)
 {
+	rcu_read_unlock();
 }
 
 static inline void folio_memcg_lock(struct folio *folio)
 {
+	/* to match folio_memcg_rcu() */
+	rcu_read_lock();
 }
 
 static inline void folio_memcg_unlock(struct folio *folio)
 {
+	rcu_read_unlock();
 }
 
 static inline void mem_cgroup_handle_over_high(void)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fadbf8e6abcd..3d42118b7f5e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1599,6 +1599,11 @@  static inline unsigned long folio_pfn(struct folio *folio)
 	return page_to_pfn(&folio->page);
 }
 
+static inline struct folio *pfn_folio(unsigned long pfn)
+{
+	return page_folio(pfn_to_page(pfn));
+}
+
 /* MIGRATE_CMA and ZONE_MOVABLE do not allow pin pages */
 #ifdef CONFIG_MIGRATION
 static inline bool is_pinnable_page(struct page *page)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5b9bc2532c5b..94af12507788 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -304,6 +304,7 @@  enum lruvec_flags {
 };
 
 struct lruvec;
+struct page_vma_mapped_walk;
 
 #define LRU_GEN_MASK		((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
 #define LRU_REFS_MASK		((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
@@ -410,6 +411,7 @@  struct lru_gen_mm_walk {
 };
 
 void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *lruvec);
+void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
 
 #ifdef CONFIG_MEMCG
 void lru_gen_init_memcg(struct mem_cgroup *memcg);
@@ -422,6 +424,10 @@  static inline void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *l
 {
 }
 
+static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+{
+}
+
 #ifdef CONFIG_MEMCG
 static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
 {
@@ -1048,6 +1054,10 @@  typedef struct pglist_data {
 
 	unsigned long		flags;
 
+#ifdef CONFIG_LRU_GEN
+	/* kswap mm walk data */
+	struct lru_gen_mm_walk	mm_walk;
+#endif
 	ZONE_PADDING(_pad2_)
 
 	/* Per-node vmstats */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 2db9a1432511..9c7a4fae0661 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -57,6 +57,22 @@  struct oom_control {
 extern struct mutex oom_lock;
 extern struct mutex oom_adj_mutex;
 
+#ifdef CONFIG_MMU
+extern struct task_struct *oom_reaper_list;
+extern struct wait_queue_head oom_reaper_wait;
+
+static inline bool oom_reaping_in_progress(void)
+{
+	/* a racy check can be used to reduce the chance of overkilling */
+	return READ_ONCE(oom_reaper_list) || !waitqueue_active(&oom_reaper_wait);
+}
+#else
+static inline bool oom_reaping_in_progress(void)
+{
+	return false;
+}
+#endif
+
 static inline void set_current_oom_origin(void)
 {
 	current->signal->oom_flag_origin = true;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index d1ea44b31f19..bb93bba97115 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -137,6 +137,10 @@  union swap_header {
  */
 struct reclaim_state {
 	unsigned long reclaimed_slab;
+#ifdef CONFIG_LRU_GEN
+	/* per-thread mm walk data */
+	struct lru_gen_mm_walk *mm_walk;
+#endif
 };
 
 #ifdef __KERNEL__
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1ddabefcfb5a..ef5860fc7d22 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -508,8 +508,8 @@  bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
  * victim (if that is possible) to help the OOM killer to move on.
  */
 static struct task_struct *oom_reaper_th;
-static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
-static struct task_struct *oom_reaper_list;
+DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
+struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
 bool __oom_reap_task_mm(struct mm_struct *mm)
diff --git a/mm/rmap.c b/mm/rmap.c
index 163ac4e6bcee..2f023e6c0f82 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -73,6 +73,7 @@ 
 #include <linux/page_idle.h>
 #include <linux/memremap.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/mm_inline.h>
 
 #include <asm/tlbflush.h>
 
@@ -790,6 +791,12 @@  static bool page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		}
 
 		if (pvmw.pte) {
+			if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
+			    !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
+				lru_gen_look_around(&pvmw);
+				referenced++;
+			}
+
 			if (ptep_clear_flush_young_notify(vma, address,
 						pvmw.pte)) {
 				/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5eaf22aa446a..fbf1337a1632 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -51,6 +51,8 @@ 
 #include <linux/dax.h>
 #include <linux/psi.h>
 #include <linux/memory.h>
+#include <linux/pagewalk.h>
+#include <linux/shmem_fs.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -1555,6 +1557,11 @@  static unsigned int shrink_page_list(struct list_head *page_list,
 		if (!sc->may_unmap && page_mapped(page))
 			goto keep_locked;
 
+		/* folio_update_gen() tried to promote this page? */
+		if (lru_gen_enabled() && !ignore_references &&
+		    page_mapped(page) && PageReferenced(page))
+			goto keep_locked;
+
 		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
 			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
 
@@ -3047,6 +3054,15 @@  static bool can_age_anon_pages(struct pglist_data *pgdat,
  *                          shorthand helpers
  ******************************************************************************/
 
+#define DEFINE_MAX_SEQ(lruvec)						\
+	unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq)
+
+#define DEFINE_MIN_SEQ(lruvec)						\
+	unsigned long min_seq[ANON_AND_FILE] = {			\
+		READ_ONCE((lruvec)->lrugen.min_seq[0]),			\
+		READ_ONCE((lruvec)->lrugen.min_seq[1]),			\
+	}
+
 #define for_each_gen_type_zone(gen, type, zone)				\
 	for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++)			\
 		for ((type) = 0; (type) < ANON_AND_FILE; (type)++)	\
@@ -3077,6 +3093,12 @@  static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
 	return pgdat ? &pgdat->__lruvec : NULL;
 }
 
+static int get_swappiness(struct mem_cgroup *memcg)
+{
+	return mem_cgroup_get_nr_swap_pages(memcg) >= MIN_LRU_BATCH ?
+	       mem_cgroup_swappiness(memcg) : 0;
+}
+
 static int get_nr_gens(struct lruvec *lruvec, int type)
 {
 	return lruvec->lrugen.max_seq - lruvec->lrugen.min_seq[type] + 1;
@@ -3431,6 +3453,869 @@  static bool get_next_mm(struct lruvec *lruvec, struct lru_gen_mm_walk *walk,
 	return last;
 }
 
+/******************************************************************************
+ *                          the aging
+ ******************************************************************************/
+
+static void folio_update_gen(struct folio *folio, struct lru_gen_mm_walk *walk)
+{
+	unsigned long old_flags, new_flags;
+	int type = folio_is_file_lru(folio);
+	int zone = folio_zonenum(folio);
+	int delta = folio_nr_pages(folio);
+	int old_gen, new_gen = lru_gen_from_seq(walk->max_seq);
+
+	do {
+		new_flags = old_flags = READ_ONCE(folio->flags);
+
+		/* for shrink_page_list() */
+		if (!(new_flags & LRU_GEN_MASK)) {
+			new_flags |= BIT(PG_referenced);
+			continue;
+		}
+
+		new_flags &= ~LRU_GEN_MASK;
+		new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF;
+	} while (new_flags != old_flags &&
+		 cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
+
+	old_gen = ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+	if (old_gen < 0 || old_gen == new_gen)
+		return;
+
+	walk->batched++;
+	walk->nr_pages[old_gen][type][zone] -= delta;
+	walk->nr_pages[new_gen][type][zone] += delta;
+}
+
+static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+{
+	unsigned long old_flags, new_flags;
+	int type = folio_is_file_lru(folio);
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
+
+	do {
+		new_flags = old_flags = READ_ONCE(folio->flags);
+		VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);
+
+		new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+		/* folio_update_gen() has promoted this page? */
+		if (new_gen >= 0 && new_gen != old_gen)
+			return new_gen;
+
+		new_gen = (old_gen + 1) % MAX_NR_GENS;
+
+		new_flags &= ~LRU_GEN_MASK;
+		new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF;
+		/* for folio_end_writeback() */
+		if (reclaiming)
+			new_flags |= BIT(PG_reclaim);
+	} while (cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
+
+	lru_gen_balance_size(lruvec, folio, old_gen, new_gen);
+
+	return new_gen;
+}
+
+static void reset_batch_size(struct lruvec *lruvec, struct lru_gen_mm_walk *walk)
+{
+	int gen, type, zone;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	walk->batched = 0;
+
+	for_each_gen_type_zone(gen, type, zone) {
+		enum lru_list lru = type * LRU_FILE;
+		int delta = walk->nr_pages[gen][type][zone];
+
+		if (!delta)
+			continue;
+
+		walk->nr_pages[gen][type][zone] = 0;
+		WRITE_ONCE(lrugen->nr_pages[gen][type][zone],
+			   lrugen->nr_pages[gen][type][zone] + delta);
+
+		if (lru_gen_is_active(lruvec, gen))
+			lru += LRU_ACTIVE;
+		lru_gen_update_size(lruvec, lru, zone, delta);
+	}
+}
+
+static int should_skip_vma(unsigned long start, unsigned long end, struct mm_walk *walk)
+{
+	struct address_space *mapping;
+	struct vm_area_struct *vma = walk->vma;
+	struct lru_gen_mm_walk *priv = walk->private;
+
+	if (!vma_is_accessible(vma) || is_vm_hugetlb_page(vma) ||
+	    (vma->vm_flags & (VM_LOCKED | VM_SPECIAL | VM_SEQ_READ | VM_RAND_READ)))
+		return true;
+
+	if (vma_is_anonymous(vma))
+		return !priv->can_swap;
+
+	if (WARN_ON_ONCE(!vma->vm_file || !vma->vm_file->f_mapping))
+		return true;
+
+	mapping = vma->vm_file->f_mapping;
+	if (!mapping->a_ops->writepage)
+		return true;
+
+	return (shmem_mapping(mapping) && !priv->can_swap) || mapping_unevictable(mapping);
+}
+
+/*
+ * Some userspace memory allocators map many single-page VMAs. Instead of
+ * returning back to the PGD table for each of such VMAs, finish an entire PMD
+ * table to reduce zigzags and improve cache performance.
+ */
+static bool get_next_vma(struct mm_walk *walk, unsigned long mask, unsigned long size,
+			 unsigned long *start, unsigned long *end)
+{
+	unsigned long next = round_up(*end, size);
+
+	VM_BUG_ON(mask & size);
+	VM_BUG_ON(*start >= *end);
+	VM_BUG_ON((next & mask) != (*start & mask));
+
+	while (walk->vma) {
+		if (next >= walk->vma->vm_end) {
+			walk->vma = walk->vma->vm_next;
+			continue;
+		}
+
+		if ((next & mask) != (walk->vma->vm_start & mask))
+			return false;
+
+		if (should_skip_vma(walk->vma->vm_start, walk->vma->vm_end, walk)) {
+			walk->vma = walk->vma->vm_next;
+			continue;
+		}
+
+		*start = max(next, walk->vma->vm_start);
+		next = (next | ~mask) + 1;
+		/* rounded-up boundaries can wrap to 0 */
+		*end = next && next < walk->vma->vm_end ? next : walk->vma->vm_end;
+
+		return true;
+	}
+
+	return false;
+}
+
+static bool suitable_to_scan(int total, int young)
+{
+	int n = clamp_t(int, cache_line_size() / sizeof(pte_t), 2, 8);
+
+	/* suitable if the average number of young PTEs per cacheline is >=1 */
+	return young * n >= total;
+}
+
+static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
+			   struct mm_walk *walk)
+{
+	int i;
+	pte_t *pte;
+	spinlock_t *ptl;
+	unsigned long addr;
+	int total = 0;
+	int young = 0;
+	struct lru_gen_mm_walk *priv = walk->private;
+	struct mem_cgroup *memcg = lruvec_memcg(priv->lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(priv->lruvec);
+
+	VM_BUG_ON(pmd_leaf(*pmd));
+
+	pte = pte_offset_map_lock(walk->mm, pmd, start & PMD_MASK, &ptl);
+	arch_enter_lazy_mmu_mode();
+restart:
+	for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) {
+		struct folio *folio;
+		unsigned long pfn = pte_pfn(pte[i]);
+
+		total++;
+		priv->mm_stats[MM_PTE_TOTAL]++;
+
+		if (!pte_present(pte[i]) || is_zero_pfn(pfn))
+			continue;
+
+		if (WARN_ON_ONCE(pte_devmap(pte[i]) || pte_special(pte[i])))
+			continue;
+
+		if (!pte_young(pte[i])) {
+			priv->mm_stats[MM_PTE_OLD]++;
+			continue;
+		}
+
+		VM_BUG_ON(!pfn_valid(pfn));
+		if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+			continue;
+
+		folio = pfn_folio(pfn);
+		if (folio_nid(folio) != pgdat->node_id)
+			continue;
+
+		if (folio_memcg_rcu(folio) != memcg)
+			continue;
+
+		VM_BUG_ON(addr < walk->vma->vm_start || addr >= walk->vma->vm_end);
+		if (ptep_test_and_clear_young(walk->vma, addr, pte + i)) {
+			folio_update_gen(folio, priv);
+			priv->mm_stats[MM_PTE_YOUNG]++;
+			young++;
+		}
+
+		if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
+		    !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
+		      !folio_test_swapcache(folio)))
+			folio_mark_dirty(folio);
+	}
+
+	if (i < PTRS_PER_PTE && get_next_vma(walk, PMD_MASK, PAGE_SIZE, &start, &end))
+		goto restart;
+
+	arch_leave_lazy_mmu_mode();
+	pte_unmap_unlock(pte, ptl);
+
+	return suitable_to_scan(total, young);
+}
+
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
+static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area_struct *vma,
+				  struct mm_walk *walk, unsigned long *start)
+{
+	int i;
+	pmd_t *pmd;
+	spinlock_t *ptl;
+	struct lru_gen_mm_walk *priv = walk->private;
+	struct mem_cgroup *memcg = lruvec_memcg(priv->lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(priv->lruvec);
+
+	VM_BUG_ON(pud_leaf(*pud));
+
+	/* try to batch at most 1+MIN_LRU_BATCH+1 entries */
+	if (*start == -1) {
+		*start = next;
+		return;
+	}
+
+	i = next == -1 ? 0 : pmd_index(next) - pmd_index(*start);
+	if (i && i <= MIN_LRU_BATCH) {
+		__set_bit(i - 1, priv->bitmap);
+		return;
+	}
+
+	pmd = pmd_offset(pud, *start);
+	ptl = pmd_lock(walk->mm, pmd);
+	arch_enter_lazy_mmu_mode();
+
+	do {
+		struct folio *folio;
+		unsigned long pfn = pmd_pfn(pmd[i]);
+		unsigned long addr = i ? (*start & PMD_MASK) + i * PMD_SIZE : *start;
+
+		if (!pmd_present(pmd[i]) || is_huge_zero_pmd(pmd[i]))
+			goto next;
+
+		if (WARN_ON_ONCE(pmd_devmap(pmd[i])))
+			goto next;
+
+		if (!pmd_trans_huge(pmd[i])) {
+			if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG))
+				pmdp_test_and_clear_young(vma, addr, pmd + i);
+			goto next;
+		}
+
+		VM_BUG_ON(!pfn_valid(pfn));
+		if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+			goto next;
+
+		folio = pfn_folio(pfn);
+		if (folio_nid(folio) != pgdat->node_id)
+			goto next;
+
+		if (folio_memcg_rcu(folio) != memcg)
+			goto next;
+
+		VM_BUG_ON(addr < vma->vm_start || addr >= vma->vm_end);
+		if (pmdp_test_and_clear_young(vma, addr, pmd + i)) {
+			folio_update_gen(folio, priv);
+			priv->mm_stats[MM_PTE_YOUNG]++;
+		}
+
+		if (pmd_dirty(pmd[i]) && !folio_test_dirty(folio) &&
+		    !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
+		      !folio_test_swapcache(folio)))
+			folio_mark_dirty(folio);
+next:
+		i = i > MIN_LRU_BATCH ? 0 :
+		    find_next_bit(priv->bitmap, MIN_LRU_BATCH, i) + 1;
+	} while (i <= MIN_LRU_BATCH);
+
+	arch_leave_lazy_mmu_mode();
+	spin_unlock(ptl);
+
+	*start = -1;
+	bitmap_zero(priv->bitmap, MIN_LRU_BATCH);
+}
+#else
+static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area_struct *vma,
+				  struct mm_walk *walk, unsigned long *start)
+{
+}
+#endif
+
+static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
+			   struct mm_walk *walk)
+{
+	int i;
+	pmd_t *pmd;
+	unsigned long next;
+	unsigned long addr;
+	struct vm_area_struct *vma;
+	unsigned long pos = -1;
+	struct lru_gen_mm_walk *priv = walk->private;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	struct pglist_data *pgdat = lruvec_pgdat(priv->lruvec);
+#endif
+
+	VM_BUG_ON(pud_leaf(*pud));
+
+	/*
+	 * Finish an entire PMD in two passes: the first only reaches to PTE
+	 * tables to avoid taking the PMD lock; the second, if necessary, takes
+	 * the PMD lock to clear the accessed bit in PMD entries.
+	 */
+	pmd = pmd_offset(pud, start & PUD_MASK);
+restart:
+	/* walk_pte_range() may call get_next_vma() */
+	vma = walk->vma;
+	for (i = pmd_index(start), addr = start; addr != end; i++, addr = next) {
+		pmd_t val = pmd_read_atomic(pmd + i);
+
+		/* for pmd_read_atomic() */
+		barrier();
+
+		next = pmd_addr_end(addr, end);
+
+		if (!pmd_present(val)) {
+			priv->mm_stats[MM_PTE_TOTAL]++;
+			continue;
+		}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		if (pmd_trans_huge(val)) {
+			unsigned long pfn = pmd_pfn(val);
+
+			priv->mm_stats[MM_PTE_TOTAL]++;
+
+			if (is_huge_zero_pmd(val))
+				continue;
+
+			if (!pmd_young(val)) {
+				priv->mm_stats[MM_PTE_OLD]++;
+				continue;
+			}
+
+			if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+				continue;
+
+			walk_pmd_range_locked(pud, addr, vma, walk, &pos);
+			continue;
+		}
+#endif
+		priv->mm_stats[MM_PMD_TOTAL]++;
+
+#ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
+		if (!pmd_young(val))
+			continue;
+
+		walk_pmd_range_locked(pud, addr, vma, walk, &pos);
+#endif
+		if (!priv->full_scan && !test_bloom_filter(priv->lruvec, priv->max_seq, pmd + i))
+			continue;
+
+		priv->mm_stats[MM_PMD_FOUND]++;
+
+		if (!walk_pte_range(&val, addr, next, walk))
+			continue;
+
+		set_bloom_filter(priv->lruvec, priv->max_seq + 1, pmd + i);
+
+		priv->mm_stats[MM_PMD_ADDED]++;
+	}
+
+	walk_pmd_range_locked(pud, -1, vma, walk, &pos);
+
+	if (i < PTRS_PER_PMD && get_next_vma(walk, PUD_MASK, PMD_SIZE, &start, &end))
+		goto restart;
+}
+
+static int walk_pud_range(p4d_t *p4d, unsigned long start, unsigned long end,
+			  struct mm_walk *walk)
+{
+	int i;
+	pud_t *pud;
+	unsigned long addr;
+	unsigned long next;
+	struct lru_gen_mm_walk *priv = walk->private;
+
+	VM_BUG_ON(p4d_leaf(*p4d));
+
+	pud = pud_offset(p4d, start & P4D_MASK);
+restart:
+	for (i = pud_index(start), addr = start; addr != end; i++, addr = next) {
+		pud_t val = READ_ONCE(pud[i]);
+
+		next = pud_addr_end(addr, end);
+
+		if (!pud_present(val) || WARN_ON_ONCE(pud_leaf(val)))
+			continue;
+
+		walk_pmd_range(&val, addr, next, walk);
+
+		if (priv->batched >= MAX_LRU_BATCH) {
+			end = (addr | ~PUD_MASK) + 1;
+			goto done;
+		}
+	}
+
+	if (i < PTRS_PER_PUD && get_next_vma(walk, P4D_MASK, PUD_SIZE, &start, &end))
+		goto restart;
+
+	end = round_up(end, P4D_SIZE);
+done:
+	/* rounded-up boundaries can wrap to 0 */
+	priv->next_addr = end && walk->vma ? max(end, walk->vma->vm_start) : 0;
+
+	return -EAGAIN;
+}
+
+static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
+{
+	static const struct mm_walk_ops mm_walk_ops = {
+		.test_walk = should_skip_vma,
+		.p4d_entry = walk_pud_range,
+	};
+
+	int err;
+#ifdef CONFIG_MEMCG
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+#endif
+
+	walk->next_addr = FIRST_USER_ADDRESS;
+
+	do {
+		unsigned long start = walk->next_addr;
+		unsigned long end = mm->highest_vm_end;
+
+		err = -EBUSY;
+
+		rcu_read_lock();
+#ifdef CONFIG_MEMCG
+		if (memcg && atomic_read(&memcg->moving_account))
+			goto contended;
+#endif
+		if (!mmap_read_trylock(mm))
+			goto contended;
+
+		err = walk_page_range(mm, start, end, &mm_walk_ops, walk);
+
+		mmap_read_unlock(mm);
+
+		if (walk->batched) {
+			spin_lock_irq(&lruvec->lru_lock);
+			reset_batch_size(lruvec, walk);
+			spin_unlock_irq(&lruvec->lru_lock);
+		}
+contended:
+		rcu_read_unlock();
+
+		cond_resched();
+	} while (err == -EAGAIN && walk->next_addr && !mm_is_oom_victim(mm));
+}
+
+static struct lru_gen_mm_walk *alloc_mm_walk(void)
+{
+	if (!current->reclaim_state || !current->reclaim_state->mm_walk)
+		return kvzalloc(sizeof(struct lru_gen_mm_walk), GFP_KERNEL);
+
+	return current->reclaim_state->mm_walk;
+}
+
+static void free_mm_walk(struct lru_gen_mm_walk *walk)
+{
+	if (!current->reclaim_state || !current->reclaim_state->mm_walk)
+		kvfree(walk);
+}
+
+static void inc_min_seq(struct lruvec *lruvec)
+{
+	int gen, type;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	VM_BUG_ON(!seq_is_valid(lruvec));
+
+	for (type = 0; type < ANON_AND_FILE; type++) {
+		if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
+			continue;
+
+		WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1);
+	}
+}
+
+static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
+{
+	int gen, type, zone;
+	bool success = false;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	DEFINE_MIN_SEQ(lruvec);
+
+	VM_BUG_ON(!seq_is_valid(lruvec));
+
+	for (type = !can_swap; type < ANON_AND_FILE; type++) {
+		while (lrugen->max_seq >= min_seq[type] + MIN_NR_GENS) {
+			gen = lru_gen_from_seq(min_seq[type]);
+
+			for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+				if (!list_empty(&lrugen->lists[gen][type][zone]))
+					goto next;
+			}
+
+			min_seq[type]++;
+		}
+next:
+		;
+	}
+
+	/* see the comment in seq_is_valid() */
+	if (can_swap) {
+		min_seq[0] = min(min_seq[0], min_seq[1]);
+		min_seq[1] = max(min_seq[0], lrugen->min_seq[1]);
+	}
+
+	for (type = !can_swap; type < ANON_AND_FILE; type++) {
+		if (min_seq[type] == lrugen->min_seq[type])
+			continue;
+
+		WRITE_ONCE(lrugen->min_seq[type], min_seq[type]);
+		success = true;
+	}
+
+	return success;
+}
+
+static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
+{
+	int prev, next;
+	int type, zone;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	spin_lock_irq(&lruvec->lru_lock);
+
+	VM_BUG_ON(!seq_is_valid(lruvec));
+
+	if (max_seq != lrugen->max_seq)
+		goto unlock;
+
+	inc_min_seq(lruvec);
+
+	/* update the active/inactive lru sizes for compatibility */
+	prev = lru_gen_from_seq(lrugen->max_seq - 1);
+	next = lru_gen_from_seq(lrugen->max_seq + 1);
+
+	for (type = 0; type < ANON_AND_FILE; type++) {
+		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+			enum lru_list lru = type * LRU_FILE;
+			long delta = lrugen->nr_pages[prev][type][zone] -
+				     lrugen->nr_pages[next][type][zone];
+
+			if (!delta)
+				continue;
+
+			lru_gen_update_size(lruvec, lru, zone, delta);
+			lru_gen_update_size(lruvec, lru + LRU_ACTIVE, zone, -delta);
+		}
+	}
+
+	WRITE_ONCE(lrugen->timestamps[next], jiffies);
+	/* make sure preceding modifications appear */
+	smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
+unlock:
+	spin_unlock_irq(&lruvec->lru_lock);
+}
+
+static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
+			       struct scan_control *sc, bool can_swap, bool full_scan)
+{
+	bool last;
+	struct lru_gen_mm_walk *walk;
+	struct mm_struct *mm = NULL;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	VM_BUG_ON(max_seq > READ_ONCE(lrugen->max_seq));
+
+	/*
+	 * If the hardware doesn't automatically set the accessed bit, fallback
+	 * to lru_gen_look_around(), which only clears the accessed bit in a
+	 * handful of PTEs. Spreading the work out over a period of time usually
+	 * is less efficient, but it avoids bursty page faults.
+	 */
+	if (!full_scan && !arch_has_hw_pte_young(false)) {
+		inc_max_seq(lruvec, max_seq);
+		return true;
+	}
+
+	walk = alloc_mm_walk();
+	if (!walk)
+		return false;
+
+	walk->lruvec = lruvec;
+	walk->max_seq = max_seq;
+	walk->can_swap = can_swap;
+	walk->full_scan = full_scan;
+
+	do {
+		last = get_next_mm(lruvec, walk, &mm);
+		if (mm)
+			walk_mm(lruvec, mm, walk);
+
+		cond_resched();
+	} while (mm);
+
+	free_mm_walk(walk);
+
+	if (!last) {
+		if (!current_is_kswapd() && sc->priority < DEF_PRIORITY - 2)
+			wait_event_killable(lruvec->mm_state.wait,
+					    max_seq < READ_ONCE(lrugen->max_seq));
+
+		return max_seq < READ_ONCE(lrugen->max_seq);
+	}
+
+	VM_BUG_ON(max_seq != READ_ONCE(lrugen->max_seq));
+
+	inc_max_seq(lruvec, max_seq);
+	/* either this sees any waiters or they will see updated max_seq */
+	if (wq_has_sleeper(&lruvec->mm_state.wait))
+		wake_up_all(&lruvec->mm_state.wait);
+
+	wakeup_flusher_threads(WB_REASON_VMSCAN);
+
+	return true;
+}
+
+static long get_nr_evictable(struct lruvec *lruvec, unsigned long max_seq, unsigned long *min_seq,
+			     struct scan_control *sc, bool can_swap, bool *need_aging)
+{
+	int gen, type, zone;
+	long max = 0;
+	long min = 0;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	/*
+	 * The upper bound of evictable pages is all eligible pages; the lower
+	 * bound is aged eligible file pages. The aging is due if the number of
+	 * aged generations and the number of aged eligible file pages are both
+	 * low.
+	 */
+	for (type = !can_swap; type < ANON_AND_FILE; type++) {
+		unsigned long seq;
+
+		for (seq = min_seq[type]; seq <= max_seq; seq++) {
+			long size = 0;
+
+			gen = lru_gen_from_seq(seq);
+
+			for (zone = 0; zone <= sc->reclaim_idx; zone++)
+				size += READ_ONCE(lrugen->nr_pages[gen][type][zone]);
+
+			max += size;
+			if (type && max_seq >= seq + MIN_NR_GENS)
+				min += size;
+		}
+	}
+
+	*need_aging = max_seq <= min_seq[1] + MIN_NR_GENS && min < MIN_LRU_BATCH;
+
+	return max > 0 ? max : 0;
+}
+
+static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc,
+		       unsigned long min_ttl)
+{
+	bool need_aging;
+	long nr_to_scan;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	int swappiness = get_swappiness(memcg);
+	DEFINE_MAX_SEQ(lruvec);
+	DEFINE_MIN_SEQ(lruvec);
+
+	if (mem_cgroup_below_min(memcg))
+		return false;
+
+	if (min_ttl) {
+		int gen = lru_gen_from_seq(min_seq[1]);
+		unsigned long birth = READ_ONCE(lruvec->lrugen.timestamps[gen]);
+
+		if (time_is_after_jiffies(birth + min_ttl))
+			return false;
+	}
+
+	nr_to_scan = get_nr_evictable(lruvec, max_seq, min_seq, sc, swappiness, &need_aging);
+	if (!nr_to_scan)
+		return false;
+
+	nr_to_scan >>= sc->priority;
+
+	if (!mem_cgroup_online(memcg))
+		nr_to_scan++;
+
+	if (nr_to_scan && need_aging && (!mem_cgroup_below_low(memcg) || sc->memcg_low_reclaim))
+		try_to_inc_max_seq(lruvec, max_seq, sc, swappiness, false);
+
+	return true;
+}
+
+/* to protect the working set of the last N jiffies */
+static unsigned long lru_gen_min_ttl __read_mostly;
+
+static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
+{
+	struct mem_cgroup *memcg;
+	bool success = false;
+	unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl);
+
+	VM_BUG_ON(!current_is_kswapd());
+
+	current->reclaim_state->mm_walk = &pgdat->mm_walk;
+
+	memcg = mem_cgroup_iter(NULL, NULL, NULL);
+	do {
+		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
+
+		if (age_lruvec(lruvec, sc, min_ttl))
+			success = true;
+
+		cond_resched();
+	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+
+	if (!success && mutex_trylock(&oom_lock)) {
+		struct oom_control oc = {
+			.gfp_mask = sc->gfp_mask,
+			.order = sc->order,
+		};
+
+		if (!oom_reaping_in_progress())
+			out_of_memory(&oc);
+
+		mutex_unlock(&oom_lock);
+	}
+
+	current->reclaim_state->mm_walk = NULL;
+}
+
+/*
+ * This function exploits spatial locality when shrink_page_list() walks the
+ * rmap. It scans the vicinity of a young PTE in a PTE table and promotes
+ * accessed pages. If the scan was done cacheline efficiently, it adds the PMD
+ * entry pointing to this PTE table to the Bloom filter. This process is a
+ * feedback loop from the eviction to the aging.
+ */
+void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+{
+	int i;
+	pte_t *pte;
+	unsigned long start;
+	unsigned long end;
+	unsigned long addr;
+	struct lru_gen_mm_walk *walk;
+	int total = 0;
+	int young = 0;
+	struct mem_cgroup *memcg = page_memcg(pvmw->page);
+	struct pglist_data *pgdat = page_pgdat(pvmw->page);
+	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	DEFINE_MAX_SEQ(lruvec);
+
+	lockdep_assert_held(pvmw->ptl);
+	VM_BUG_ON_PAGE(PageLRU(pvmw->page), pvmw->page);
+
+	walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL;
+	if (!walk)
+		return;
+
+	walk->max_seq = max_seq;
+
+	start = max(pvmw->address & PMD_MASK, pvmw->vma->vm_start);
+	end = pmd_addr_end(pvmw->address, pvmw->vma->vm_end);
+
+	if (end - start > MIN_LRU_BATCH * PAGE_SIZE) {
+		if (pvmw->address - start < MIN_LRU_BATCH * PAGE_SIZE / 2)
+			end = start + MIN_LRU_BATCH * PAGE_SIZE;
+		else if (end - pvmw->address < MIN_LRU_BATCH * PAGE_SIZE / 2)
+			start = end - MIN_LRU_BATCH * PAGE_SIZE;
+		else {
+			start = pvmw->address - MIN_LRU_BATCH * PAGE_SIZE / 2;
+			end = pvmw->address + MIN_LRU_BATCH * PAGE_SIZE / 2;
+		}
+	}
+
+	pte = pvmw->pte - (pvmw->address - start) / PAGE_SIZE;
+
+	lock_page_memcg(pvmw->page);
+	arch_enter_lazy_mmu_mode();
+
+	for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
+		struct folio *folio;
+		unsigned long pfn = pte_pfn(pte[i]);
+
+		total++;
+
+		if (!pte_present(pte[i]) || is_zero_pfn(pfn))
+			continue;
+
+		if (WARN_ON_ONCE(pte_devmap(pte[i]) || pte_special(pte[i])))
+			continue;
+
+		if (!pte_young(pte[i]))
+			continue;
+
+		VM_BUG_ON(!pfn_valid(pfn));
+		if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+			continue;
+
+		folio = pfn_folio(pfn);
+		if (folio_nid(folio) != pgdat->node_id)
+			continue;
+
+		if (folio_memcg_rcu(folio) != memcg)
+			continue;
+
+		VM_BUG_ON(addr < pvmw->vma->vm_start || addr >= pvmw->vma->vm_end);
+		if (ptep_test_and_clear_young(pvmw->vma, addr, pte + i)) {
+			folio_update_gen(folio, walk);
+			young++;
+		}
+
+		if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
+		    !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
+		      !folio_test_swapcache(folio)))
+			__set_bit(i, walk->bitmap);
+	}
+
+	arch_leave_lazy_mmu_mode();
+	unlock_page_memcg(pvmw->page);
+
+	if (suitable_to_scan(total, young))
+		set_bloom_filter(lruvec, max_seq, pvmw->pmd);
+
+	for_each_set_bit(i, walk->bitmap, MIN_LRU_BATCH)
+		set_page_dirty(pte_page(pte[i]));
+
+	bitmap_zero(walk->bitmap, MIN_LRU_BATCH);
+}
+
 /******************************************************************************
  *                          state change
  ******************************************************************************/
@@ -3649,6 +4534,12 @@  static int __init init_lru_gen(void)
 };
 late_initcall(init_lru_gen);
 
+#else
+
+static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
+{
+}
+
 #endif /* CONFIG_LRU_GEN */
 
 static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
@@ -4536,6 +5427,11 @@  static void age_active_anon(struct pglist_data *pgdat,
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 
+	if (lru_gen_enabled()) {
+		lru_gen_age_node(pgdat, sc);
+		return;
+	}
+
 	if (!can_age_anon_pages(pgdat, sc))
 		return;